Video generation war 2.0! Big companies are scrambling for underlying models, and startups have attracted 4.4 billion in 5 months

2024-07-24

Zhidongxi (public account:zhidxcom）
authorvanilla
editLi Shuiqing

If we talk about the hottest track in big models in 2024, video generation must be on the list.

After Sora opened a new era of AI video generation in February, the intensive model releases in June this year pushed the video generation war to a new climax.

“The next generation of AI film and television is here”, “It’s so exciting, it’s like one company takes over and the other takes the stage”, “There is finally hope to get rid of the PPT era”, “It looks like we will be able to use AI to make MVs soon”... Among AI video creators or practitioners, the most common emotion we can see is “excitement”.

Since Sora was released, Super8HomeAI companies at home and abroad have launched new products or models one after another, generatingMore than 10 secondsVideos are publicly available, some of which are said to have been2 minutesThe generation of ultra-long videos, the AI video generation track has set off a hot 2.0 war.

Over here,byteThe first AI video generation product, Jimeng, was launched online, extending the video generation time from the common 3-4 seconds to 12 seconds;quick workerThe sudden release of the large model of Keling and its stunning effect triggered heated discussions across the entire network, with the number of people queuing once approaching one million.

▲The number of people queuing to apply for Kuaishou Keling

Over there, startupsLuma AI"Abandoning 3D video projection", launching Dream Machine to enter the market in a high-profile manner; old playersRunwayNot to be outdone, the company has launched the next-generation Gen-3 model, pushing physical simulation capabilities to new heights.

▲Gen-3 video generation effect

The battle on the financing battlefield is also fierce.Aishi Technology, Shengshu TechnologySince March, it has received hundreds of millions of yuan in financing; overseas,PikaIn June, it received $80 million in financing, doubling its valuation to $500 million.RunwayIt was revealed that it was preparing a financing of up to US$450 million.

Sora is like a bombshell that has shaken the AI video generation industry. Now, after five months of fierce competition, how are the AI video generation products at home and abroad progressing? Can they compete with Sora? What challenges will they face? Through horizontal experience of available products and discussions with practitioners and creators, Zhidongxi deeply analyzed these issues.

In actual testing, I can clearly feel that the speed of video generation has become faster, the "rollover" phenomenon has been greatly reduced, and it has evolved from a simple "PPT-style" translation to a movement with angles and changes in action. Overall, among the free products available, the best results are Jimeng and Keling, which lead in terms of duration, stability and physical simulation.

In terms of financing, compared with before the release of Sora, the density and amount of financing related to AI video generation have increased significantly, attracting more than 4.4 billion in 5 months, and also driving other products "upstream and downstream" of the video production process, such as AI editing and AI lighting, to gain capital favor. In addition, there are many new players entering the market, some of which have raised hundreds of millions of dollars before releasing any products or technologies.

1. Technology War: Length, HD, and Physical Simulation

On February 16, OpenAI released Sora, which revolutionized the AI video generation track overnight. However, five months later, Sora is still a futures product, and it seems that it will be a long time before it can be used by the general public.

During this period, domestic and foreign large companies and startups have rushed to release new products or model upgrades, and most of them have been opened to all users, including some with amazing results, which has once again changed the landscape of AI video generation. After all, no matter how good Sora is, what value does it have if you can't use it?

According to incomplete statistics from Zhidongxi, at least8The company released new products or models, including Vidu from Shengshu Technology.Publicly available。

▲AI video generation product release/model upgrade (Zhidongxi Tabulation)

February 21,Stability AIThe AI video generation product Stable Video web version has been officially launched and is open to all users. Although its underlying model Stable Video Diffusion was open sourced in November last year, as a model, it still has certain deployment and usage thresholds. After being packaged into a web version and released, more users can use it easily and conveniently.

April 27,Biodata TechnologyIn collaboration with Tsinghua University, we released Vidu, a long-duration, highly consistent, and highly dynamic video model, which is said to be able to generate videos up to 16 seconds long with a resolution of 1080P and can imitate the real physical world.

Judging from the demo released, Vidu has indeed achieved good results in clarity, motion range, physical simulation, etc., but unfortunately, Vidu, like Sora, has not yet been released. Zhidongxi asked Shengshu Technology and learned that the product will start internal testing in the near future.

▲Vidu video demo of Shengshu Technology

May 9,byteThe AI creation platform Dreamina under Jianying was renamed "Jimeng", and AI drawing and AI video generation functions were launched, supporting the generation of videos up to 12 seconds long.

June 6,quick workerThe AI video model Keling was released and launched on the Kuaiying App. Users only need to fill out a questionnaire to apply for its use. Keling's large model focuses on high-intensity simulation of the characteristics of the physical world. For example, problems such as "eating noodles" that have stumped many AIs are reflected in the video cases it provides.

Currently, Keling supports generating videos with a fixed length of 5 seconds or 10 seconds. According to its official website, the model can generate videos with a maximum length of 2 minutes, a frame rate of 30fps, and a resolution of 1080P. Functions such as video continuation will be launched later.

On June 13, the startup that previously focused on AI-generated 3DLuma AIAnnounced the launch of Dream Machine, a video generation tool that supports generating 5-second videos from text and images. It also provides a video extension function that can extend the generated video by 5 seconds at a time.

June 17,RunwayThe next-generation model Gen-3 Alpha version was released and will be available to all users on July 2, with a minimum subscription fee of $15 per month. Gen-3 currently supports the generation of 5-second and 10-second videos based on text, while image-generated videos and other controllable tools are not yet available.

▲Gen-3 Alpha generates video effects

July 6,Smart Future(HiDream) released the Zhixiang Big Model 2.0 at WAIC, providing three video generation durations of 5, 10, and 15 seconds, and adding capabilities such as text embedding generation, script multi-lens video generation, and IP consistency.

On July 17, a British AI startup that previously focused on AI 3D reconstructionHaiper AI, announced that its AI video generation product Haiper has been upgraded to v1.5, with the length extended to 8 seconds, and provides functions such as video extension and picture quality enhancement.

The following table shows the generation time, resolution, frame rate and other parameters of these models, as well as additional capabilities beyond basic generation.

▲Upgraded AI video generation product parameters (by Zhidongxi)

From the parameter point of view, these AI video generation products have achieved significant progress in the generation time. The basic generation time has been extended from 2-4 seconds to 5 seconds, and more than half of them support a duration of more than 10 seconds, and some products also provide an extension function. Among the currently available free products, the longest generated video is 12 seconds by Jimeng.

In terms of visual effects, the resolution and frame rate have been greatly improved. More products support 720P and above, and the frame rate is approaching 24/30fps. Previous products generated video resolutions of around 1024*576, and frame rates of 8-12fps.

2. Product War:Hands-on test6 free "ready-to-use" products, with Douyin and Kuaishou leading the pack

When Sora was first released, Zhidongxi conducted an in-depth experience with 8 AI video generation tools available in China. At that time, the gap was still quite obvious, and there were many "failures". (The first "Chinese version of Sora" on the Internet! 15 companies competed, and ByteDance took the lead)

So how are these players who have delivered new answers after several months of iteration and upgrade? Zhidongxi tried out the newly released or upgraded AI video generation products. For the sake of fairness, we only tried out the free features and selected the videos generated for the first time.

It should be noted that video generation itself involves an element of luck similar to “drawing cards” and is also closely related to the writing of prompt words, so a small number of cases do not fully represent the model’s capabilities.

I chose the first levelStill Life Scene, the prompt words are:Close-up of tulips bathed in the warm light of the setting sun。

Stable Video shows high stability in this prompt. The picture clarity and color richness are relatively high, and the movement is mainly based on the movement of the lens.

▲Stable Video generates video

The picture clarity of Dream Machine has obviously dropped a level, but it is still relatively accurate in the presentation of the prompt words, and the movement is also mainly based on lens translation.

▲Dream Machine generates video

The video produced by Haiper has good visual quality, but the motion is slightly smaller.

▲Haiper generates video

The performance of the Zhixiang large model is also good, and the picture has a strong depth of field effect, but a close look at the petals reveals that there are some defects in the details and instability.

▲Video of generating large model of Zhixiang

That is, the dream generates an image with a fixed lens, and the movement is mainly the shaking of the tulips, and the overall effect is relatively stable.

The video generated by Keling shows the word "close-up" to the extreme, with high picture clarity and the texture of the petals. However, how to understand "close-up of tulips" is not a question with a fixed answer, so it is hard to say who is right or wrong.

//oss.zhidx.com/uploads/2024/07/6696499b734af_6696499b690e6_6696499b690bc_Tulip-Keling.mp4

▲ Keling generates video

Overall, the performance of various players in still life scenes is very stable, and the generated videos are highly usable.

I chose the second levelAnimal scenes, and added elements of stylization and dynamic action, the prompt words are:A cartoon kangaroo dancing discoThis is actually one of the cases provided by Sora. First, let’s take a look at Sora’s proofing.

//oss.zhidx.com/uploads/2024/07/6696464125de3_6696464116ab1_6696464116a7c_Dancing-kangaroo.mp4

▲Sora generated video example

Stable Video failed at this level. The first frame was perfect—perhaps because of the path Stable Video took when generating the video, which first generated four images for the user to choose from, and then generated the video based on the images the user selected—and then the kangaroo's entire body began to distort.

What’s more interesting is that there aren’t many problems with the people and anthropomorphic animals in the background of the picture. I wonder if it’s the “disco dancing” action that stumped Stable Video.

▲Stable Video generates video effects

The video generated by Dream Machine has good overall stability, but there is a lack of stability in details such as the kangaroo's feet and hands. In terms of the range of motion, in addition to the movement of the kangaroo itself, it also performs a lens push from close-up to panoramic view.

I also tried the video extension function of Dream Machine, and the last 5 seconds of the video is the extended content. It can be seen that it is not limited to a single shot, but switches from the whole body to the upper body close-up. However, although the group of people in the background are more stable in the extended video, the kangaroo is more unstable.

//oss.zhidx.com/uploads/2024/07/6695ec3b230c2_6695ec3b1f3da_6695ec3b1f39d_A-cartoon-kangaroo-disco-dances.-a318b1.mp4

▲Dream Machine generates video effects

The kangaroo generated by Haiper is somewhat distorted and does not reflect the keyword "dancing disco".

▲Haiper generates video

The Zhixiang model failed seriously in this stage. Like the Stable Video, the main body of the picture was distorted significantly, and it did not reflect the "disco dance".

▲Video effect generated by the Zhixiang large model

The overall visual effect of the video generated by Dream Machine is relatively good, with high clarity and rich colors. In terms of stability, the first few seconds are relatively normal, but there is obvious distortion in the last 3 seconds or so, and the degree of distortion is similar to that of Dream Machine.

In terms of semantic understanding, the picture shows some "dancing" movements, but it has little to do with "disco". In addition, the text in the background of the picture looks like "ghostly characters".

//oss.zhidx.com/uploads/2024/07/6695ec2b3d230_6695ec2b38b00_6695ec2b38adc_即梦.mp4

▲Dream-generated video effect

The video generated by Keling is relatively stable overall, with the main problems concentrated on the hands and eyes. However, in terms of semantic understanding, the keyword "dancing disco" is not reflected.

//oss.zhidx.com/uploads/2024/07/669649d2e096d_669649d2dbda7_669649d2dbd80_Kangaroo-Keling.mp4

▲Can generate video effects

Overall, Dream Machine, Jimeng, and Keling performed better, but none of them could reach Sora's level. In addition, this prompt also shows the aesthetic differences between the models, including color tendencies, style choices, camera switching, etc.

The third level is set toCharacter Close-up, the prompt words used are:Close-up of an astronaut floating outside a space station with the Earth and Moon in the background and stars reflected in the visor of his helmet。

Stable Video performs well in this level, accurately depicting keywords such as "astronaut", "Earth", "moon", "star reflection", etc., and its stability is also very high. The movement is not a simple lens pan, but the movement of the subject of the picture relative to the background.

▲Stable Video generates video

Dream Machine went awry, completely forgetting about the "astronauts" and drawing a cosmic scene.

▲Dream Machine generates video

Haiper performed well in this level. Although he missed the "moon", other key words were reflected and the reflection in the helmet was also very natural.

▲Haiper generates video

The Zhixiang model initially refused to generate the prompt word, indicating that there was sensitive content. After multiple cuts, I finally generated a video with "a close-up of a man floating outside the space station".

The overall effect of the picture is quite realistic. Although the final prompt word is only "space station" which reflects the content, it still depicts elements such as the earth and space suits. However, the protagonist does not wear a space helmet, and it is unknown how he breathes or even speaks (doge).

▲Video effect generated by the Zhixiang large model

Ji Meng is quite good at depicting character details. The faces and costumes are relatively delicate and the stability is also very high. However, there seems to be a second "Earth" in the background of the picture. In addition, the shot is more of a "close-up" than a "close-up".

//oss.zhidx.com/uploads/2024/07/66964f26a7c3e_66964f26a3673_66964f26a3651_astronaut-i.e.dream.mp4

▲Dream-generated video

The video generated by Keling did not show any people at first, and then the astronauts slowly entered the camera, but the background was still, which looked a bit humorous. However, the accuracy and stability of the video itself were still very high, reflecting every keyword and depicting the "space station" that some contestants missed.

//oss.zhidx.com/uploads/2024/07/66965077c3056_66965077be925_66965077be8fa_astronaut-keling.mp4

▲ Keling generates video

Although the overall performance of the character level is not as stable as that of the still life scene, it is much better than the previous level, which may be related to the rich training data and small movement range. The better performances in this level are Stable Video, Haiper, Jimeng and Keling.

Overall, among the 6 AI video generation products that Zhidongxi experienced this time,Dream, spiritualThe generation effect of the game is quite obvious, and it has achieved good results in terms of duration and stability. In addition, domestic products such as Morph Studio and NeverEnds also have very good effects, but since they have not had new products or model upgrades after the release of Sora, they are not within the scope of this experience.

3. Capital War:5Monthly income44100 million, new players emerge

When Sora was released, it once again set off a generative AI craze, just like GPT-4 did in the past, causing Wensheng Video concept stocks to collectively hit the daily limit.

The primary market has also ushered in a new wave of carnival. According to incomplete statistics from Zhidongxi, at least 5 months have passed since Sora was released.5Startups in the AI video generation track have receivedOver 100 millionFinancing, cumulatively about 1.2 billion yuan, in addition, Runway was revealed to be negotiating 450 million US dollars (about 3.268 billion yuan) in new financing.

▲Large-scale investment and financing related to AI video generation (by Zhidongxi)

domestic,Aishi TechnologyIt raised two rounds of funding worth hundreds of millions of yuan in March and April respectively, and was favored by well-known investors such as Ant. Prior to this, it had only received an angel round of financing of tens of millions of yuan in August last year.

In January this year, Aishi Technology launched the overseas version of its AI video generation product PixVerse, which became a dark horse competing with Pika and Runway. After Sora was released, its founder Wang Changhu said that it would surpass Pika and Runway within 3-6 months.

Five months have passed, and Aishi Technology has not released any iterative updates to the underlying model, but has launched new features such as character consistency and motion brushes. Zhidongxi asked about its product progress and learned that its new generation of models and new features "Vincent Video Feature Film” will be released this week and can generateDuration: 8 secondsvideos, and canGenerate 3-5 continuous audio videos at one time。

▲PixVerse launches motion brush function (Source: Aishi Technology)

Biodata TechnologyIt also received two rounds of financing worth hundreds of millions of yuan in just three months, with Baidu Ventures continuing to invest as an old shareholder. Previously, Shengshu Technology received two rounds of financing totaling more than 100 million yuan.

Sand AISand AI is a startup that has just come into the public eye and has not yet released any products. On July 10, it was revealed that Sand AI received tens of millions of dollars in Series A funding led by Capital Today in May.

Sand AI was founded in October 2023 and mainly develops video generation technology similar to Sora. It is worth noting that its founderCao YueyesOne of the co-founders of Lightyears Away, formerly the head of the Visual Model Research Center of Beijing Zhiyuan AI Research Institute and a senior researcher at Microsoft Research Asia.

Public information shows that Cao Yue graduated from Tsinghua University for both his undergraduate and doctoral degrees. He won the Marr Prize for best paper at the top computer vision conference ICCV and has been cited over 40,000 times on Google Scholar.

▲Cao Yue (Photo source: his personal homepage)

Haiper AIIt is also a new startup in the video generation field. Founded in 2022 in London, UK, the company previously focused on AI-based 3D reconstruction.

According to foreign media reports in March, Haiper AI received US$13.8 million (approximately RMB 100 million) in seed round financing, having previously raised US$5.4 million in April 2022.

Haiper AI's founding team is composed of two Chinese people, Yishu Miao, who worked in TikTok's global trust and safety team, and Ziyu Wang, who worked as a research scientist at DeepMind. Late last year, the Haiper AI team decided to focus on video generation and released a beta version of its first video generation product of the same name in December last year.

▲Haiper releases a beta version of the product with the same name

PikaIn June, it announced a new round of financing of approximately US$80 million (approximately RMB 581 million), doubling its valuation to nearly US$500 million. In November last year, Pika announced that it had completed a total of US$55 million in financing, with a valuation of US$200-300 million.

July 2, the "old player" in the AI video generation trackRunwayIt was revealed that it was negotiating US$450 million (approximately RMB 3.268 billion) in new financing, with a valuation of US$4 billion.

Runway's last round of financing was completed in June last year, with investors including Google and Nvidia, and it raised $1.5 billion with $141 million, bringing its total financing to $237 million. If this round of financing is completed, both the financing amount and valuation will more than double.

In general, in the months since Sora was released, new AI video generation financings have continued to appear in the primary market, not only with a higher frequency but also with a much higher amount, with each round exceeding the previous total. Even if some startups did not release products or upgrade models, it did not stop investors’ enthusiasm.

4. AI video war lasted 150 days, from "PPT" to real "video"

During the 150 days of Sora’s “invisibility”, under the “siege” of many large companies and startups, the gap between mainstream AI video generation products and Sora has been greatly shortened, and there is also a crucial point——Ready to use, and even many features are free.

At present, the leading AI video generation products have achieved good duration and stability, and the next iteration will focus on physical simulation. From the official demos, Gen-3, Keling, Jimeng, and Vidu have a high degree of simulation of the real world, and the selected cases are almost the same as the cases released by Sora.

So from the creator's perspective, what is the current product experience like?

recently,Director and AI film and television creator Chen Kun(Xianren Yikun) made a remake of the trailer for his AI short drama "Mountain and Sea Mirror" and compared it with the original.

At the premiere of the short play, he told Zhidongxi and other media that the progress of AI in the past six months is still very obvious, especiallyPhysics simulationIn his opinion, it has achievedIntergenerational"Iteration. Specifically, at this stage, video generation models such as Keling have achieved native high definition and are no longer driven by sliced-screen content. The subject moves reasonably, the movement range is not only large but also smooth, and the response to the prompt words is positive. But at the same time, AI video generation technology still faces several major pain points: character consistency, scene consistency, character performance, action interaction, and movement range.

▲Comparison between the remake and the original trailer of "Shan Hai Qi Jing"

From the application perspective, in scenarios such as film and television production, AI is still in the process of catching up with traditional film and television.

In a complete production process, AI is still an auxiliary means rather than a major tool. For example, in scripting, dubbing, editing, post-production and other links, there are currently no products that can reach the level of productivity.

However, in terms of cost, including labor efficiency, AI-based processes have been greatly compressed to the level of traditional production processes.Less than 1/4。

▲Chen Kun was interviewed at the preview

At WAIC 2024,Xie Xuzhang, co-founder of AiShi TechnologyTan Dao said that what we now call "video generation" is actually just the generation of video materials, which is only a small part of the complete video production process. There is no sound, editing, transition, script, etc. Whether in terms of technology or business, there is still a very long way to go.

This is another important direction for the development of AI video in addition to continuing to iterate the underlying model to overcome the existing pain points of video generation.

There are also many companies in the market that are trying various video production processes, and they are also favored by the primary market. In the past week alone, AI-driven video editing toolsCaptions, AI virtual environments provide lighting and compositing toolsBeebleThey received US$60 million and US$4.75 million in financing respectively.

Conclusion:AIVideo generation, waiting for aGPT-4 moment

The release of Sora has ignited the enthusiasm of domestic and foreign large-scale teams and entrepreneurs, but overall it is still in its early stages, there is no consensus on the technical route, and the generation effect is still a certain distance from commercial standards. As for the specific stage, many industry insiders have compared it to the early stages of language and image models, such as "GPT-3 era" and "image generation on the eve of 2022".

But what is certain is that AI video generation technology is developing exponentially, with new products and technologies constantly emerging. Although there are some technical pain points and challenges, with the iteration of technology and the promotion of the market, this field is expected to achieve more breakthroughs and applications.

The battle of AI video generation is not only a competition of technology, but also a competition of capital. In this storm of money-making, who will have the last laugh? Let's wait and see.

news

Video generation war 2.0! Big companies are scrambling for underlying models, and startups have attracted 4.4 billion in 5 months

Introduction

my contact information