Big companies have started a video generation "arms race". Can AI really beat Hollywood?

Big companies have started a video generation "arms race". Can AI really take on Hollywood?

2024-07-15

Machine Heart Report

Synced Editorial Department

The AI video industry is in a fierce competition.

Just as Kuaishou launched Keling in a high-profile manner, Luma, not to be outdone, launched its latest video model Dream Machine. Then Runway intervened and launched the killer Gen-3.

Driven by the subtle FOMO atmosphere, more players are diving into this track with the purpose of "working themselves to death and killing their peers" -

Alibaba DAMO Academy is betting on the "Xunguang Video Creation Platform", ByteDance AI is exploring "Generative Film and Television", Meitu MOKI is eyeing AI short film creation, and Haiper AI is focusing on creative expression...

Shanghai on July 5th was hot, just like the anxiety in the AI video circle.

On this day, the X conference room of Hall H3 of the Shanghai World Expo Exhibition and Convention Center was crowded with people. The "2024 WAIC Video Generation Frontier Technology Forum" hosted by the World Artificial Intelligence Conference Organizing Committee Office and co-organized by Machine Heart and East Best Lansheng was in full swing.

The forum brought together many star companies and experts in the field of AI video to discuss the latest developments in video generation technology and its innovative practices in industrial applications.

In-depth sharing: Heartfelt words from people in the industry

Since the emergence of ChatGPT, the video generation technology ignited by Sora has definitely become the "hottest new thing" in the technology industry.

Although this field is still in its infancy, video generation technology is constantly expanding the boundaries of digital content creation with its astonishing development speed and great potential application prospects.

Chen Weihua, head of video generation at Alibaba DAMO Academy, Ni Bingbing, professor of the Department of Electronics at Shanghai Jiao Tong University, Chen Jianyi, senior vice president of Meitu Group, and Miao Yishu, founder of Haiper AI, attended the forum and delivered keynote speeches.

Chen Weihua, head of video generation at Alibaba DAMO Academy, said that the launch of Sora at the beginning of the year not only demonstrated the huge potential of AI video generation in high definition, high fidelity and high quality, but also inspired people's infinite imagination of this technology.

Although Sora is very cool, the generation process is still difficult to control, and the consistency of the protagonist is difficult to ensure, requiring a lot of manual post-editing to achieve the best results.

"Control of video content is the biggest demand in creation, and also the biggest challenge facing our algorithms today," said Chen Weihua.

Alibaba DAMO Academy’s latest AIGC product, the Xunguang Video Creation Platform, aims to improve video production efficiency and solve video post-editing issues. Through a simple split-shot organization format and rich video editing capabilities, it allows users to precisely control video content and maintain consistency between characters and scenes in multiple videos.

Xunguang provides a one-stop tool platform for the widespread application of AI videos. AI will not replace the work of creators, but will optimize the workflow of video creation and become a new engine driven by creativity.

Ni Bingbing, professor of the Department of Electronics at Shanghai Jiao Tong University, shared vector-oriented media content generation technology.

At the beginning of his speech, he poured a bucket of cold water.

"Current generation algorithms face structural and detail problems. For example, the generated content may have extra or missing elements, or may be hand-pierced. For those refined videos that need to conform to physical rules, current generation technology still faces challenges." Ni Bingbing said that the reason is that all generative intelligence is essentially a sampling process, and video is a high-dimensional space. Although the quality of content can be improved by increasing training data and reducing sampling accuracy, due to the extremely high dimensional space, it is still difficult to achieve perfection under the current technical framework.

In addition, computing power limitations are also an important factor. At present, computing power indicators including large language models and image and video generation models have reached tens of terabytes, hundreds of terabytes, or even thousands of terabytes. In the future, the development trend of generative intelligence will definitely sink to the end side, and the end side cannot use unlimited large computing power sampling to solve the problem.

In response to this, Ni Bingbing proposed using a vectorized representation framework to instantiate video content into network parameters, thereby achieving precise manipulation of generated content and better conforming to the rules of the physical world.

He believes that the current stage-by-stage success of generative artificial intelligence comes at the cost of excessive consumption of computing power and data. In the future, we should focus on new representations of media content and new paradigms of generative computing, and actively create a new type of media productivity that is higher quality and more efficient.

Chen Jianyi, senior vice president of Meitu Group, analyzed the application scenarios and challenges of AI video generation from the perspective of a product manager.

During the user survey, he discovered two interesting phenomena.

First, people in the industry will be amazed that the video is generated by AI, but for ordinary users, they don’t care whether the video is generated by AI, but focus on whether the content is attractive.

"This means that no matter what kind of visual experience AI video generation technology achieves, we must return to the content itself and focus on the values and stories that the video wants to convey," said Chen Jianyi.

Second, most ordinary users are not familiar with professional terms such as "Vincent Photo" and "Vincent Video", nor do they know their specific uses. Take "Vincent Photo" for example. This term is like the "Liquify" function of PhotoShop back then, which is difficult to understand. However, if it is limited to a scene and described as a "face and body slimming" function, users can more intuitively understand its value. The same is true for "Vincent Video".

At the same time, he said that AI video generation technology makes content expression more concrete, enriches visual creativity and experience, but it still needs to solve key problems such as visual setting controllability, dynamic controllability and audio controllability.

Meitu's AI short video creation platform MOKI is working to overcome these difficulties.

MOKI has built a comprehensive short film workflow with AI video generation technology at its core. In the early stages, creators can write scripts, design visual styles, and set characters, and then use AI technology to generate video materials. Ultimately, through AI's post-production capabilities, all the materials are connected in series to form a coherent short film.

As the founder of the star startup Haiper AI, Miao Yishu deeply explored the significance and value of video generation technology.

Miao Yishu said: "We often hear such views as 'language is intelligence' or 'large language models are general artificial intelligence (AGI)'. However, can language learning alone really lead us directly to AGI? Language is one of the important ways for humans to acquire knowledge, but it is not the only way. Humans learn through multiple learning methods such as vision, hearing, reading and kinesthetics. AI also needs to learn and build true general intelligence through the fusion of multiple modalities."

After the launch of GPT-3.5, many people have put forward the view that "natural language processing (NLP) no longer exists", because the large language model has basically solved the problems of language system learning and semantic reasoning through the autoregressive generative model (predicting the next word each time). We no longer even need the discriminative model to fine-tune specific reasoning problems.

Similarly, the video generation model also constructs a generative model through autoregression (predicting the next video frame each time), so the model implicitly learns important tasks in the field of computer vision such as depth prediction, semantic annotation, and semantic segmentation. Therefore, in 2024, we will hear statements like "Computer Vision (CV) no longer exists" because the video generation model has gradually mastered the perception ability and physical laws in the process of learning to generate video content.

“Do we need to understand Newton’s first law like a puppy to chase butterflies on the street? Do we need to know all the laws of physics like a 5-year-old to walk and ride a bicycle? The answer is no. Humans learn through continuous interaction and observation with the world and through various modeling. In fact, the video generation model has built a world model by learning to generate diverse video content. We can easily interact with the world model through prompts and render the video content we want, and all this does not require us to explicitly build a simulator to simulate the so-called physical laws.”

Miao Yishu emphasized that "Video Generation Is Beyond Generating Videos". In his opinion, the video generation model can not only generate video content, but also is an important step in learning basic perception capabilities through multimodal learning, and is also the only way for artificial intelligence to move towards AGI.

Roundtable debate: What is the path to video generation?

In addition to the keynote speeches by four experts and scholars, the forum also invited guests from academia, enterprises, start-ups, and well-known investment institutions to conduct in-depth roundtable discussions on topics such as cutting-edge technologies in video generation and innovative application practices in scene-based industries.

In the first roundtable discussion, guests including Zhu Jiang, founder and CEO of Jingying Technology, Liu Ziwei, assistant professor at Nanyang Technological University, Singapore, Li Feng, head of AI at Shengqu Games Technology Center, and Le Yuan, partner of Yitian Capital, conducted in-depth discussions on the theme of "Where will the improvement path of video generation technology go under the drive of big models?" and elaborated on the prospects for the implementation of video generation technology in the industry.

Zhu Jiang, founder and CEO of Jingying Technology, compared video generation technology to the Cambrian explosion of life, and believed that the current stage is a stage of rapid development of technology and applications. He emphasized that application layer companies need to maintain their understanding and leading of technology, while paying attention to user needs, in order to stand out from the competition. He said that in the end, both model companies and application companies will survive, but model companies may be more general, while application companies need to pay more attention to understanding users and business.

Liu Ziwei, assistant professor at Nanyang Technological University in Singapore, believes that video generation technology is currently in the GPT-3 era and is still about half a year away from maturity. He analyzed the advantages and disadvantages of the three technical paths of Diffusion, Transformer and language model, and believes that they may be integrated and developed in the future. He also emphasized the need to explore the "Newton's First Law" of video generation technology, that is, how to obtain predictable improvements by investing in computing power and data.

Li Feng, AI director of Shengqu Game Technology Center, believes that video generation technology can improve game development efficiency and creativity from the perspective of the game industry. He hopes to cooperate with model companies to apply video generation technology to the game development process, such as using the idea of micro-rendering to do level design and layout preview, and visually aligning communication methods during R&D collaboration to generate other dynamic asset images.

Le Yuan, partner of Yitian Capital, analyzed the challenges faced by the commercialization of video generation technology from the perspective of capital. He believes that video generation technology has made far more progress than expected in the past two or three years, which is surprising, but objectively speaking, today's technology level is still not enough to support large-scale commercialization. The methodology and challenges encountered in developing applications based on language models are also applicable to video-related application fields.

The second roundtable discussion of the forum focused on "Innovation and opportunities in video generation applications under the wave of deconstructing generative AI". Guests from 5Y Capital, FancyTech, Morph AI and Stanford University explored the development direction and application scenarios of video generation technology from multiple perspectives including investment, application, technology and art.

FancyTech founder and CEO Kong Jie believes that video generation technology will bring about supply-side reforms and allow more people to participate in content creation. He introduced FancyTech's To B video generation platform, which helps businesses reduce content creation costs by restoring real objects to virtual scenes.

Shi Yunfeng, vice president of 5Y Capital, mentioned that the current video generation is still in the early stages of development, similar to the exploratory state when GPT2 was first released. It is very challenging to find PMF when the foundation of technology is not yet solid. He believes that although technology is constantly improving, creators are very enthusiastic, and there is a certain range of dissemination, there is no widespread content consumption. Talented product managers are needed to tailor the product and create new content forms that are incompatible with the existing information flow.

Morph AI founder and CEO Xu Huaizhe believes that the technology and application of video generation are equally important. As a technical team, they must coordinate the development of the model layer and the application layer. He introduced Morph Studio, an all-in-one AI video production tool, which is based on Morph's leading AI video model. It has been tested globally and has received positive feedback. In the future, Morph will continue to optimize product functions and user experience through user feedback, so that its AI video technology can be implemented faster through products and better help creators.

Rao Anyi, a postdoctoral researcher at Stanford University, believes that video generation technology can inspire more interactive creative methods from the perspective of combining art and technology. He emphasized that neither machines nor humans can be 100% correct, so an interactive improvement mechanism needs to be introduced in the creative process to allow machines and humans to collaborate in the creation.

Overall, the guests at the roundtable discussion are full of expectations for the application prospects of video generation technology, but they also recognize that the current technology is still in its early stages and that new business models and application scenarios need to be explored to achieve greater value.

The successful holding of this forum not only provides a platform for communication and learning for practitioners in the field of AI video, but also provides more opportunities for cooperation for all links in the relevant industry chain. Looking to the future, AI video technology will usher in a broader development space and richer application scenarios, creating a better visual experience for mankind.

news

Big companies have started a video generation "arms race". Can AI really take on Hollywood?

Introduction

my contact information