news

Half a year has passed, where has the AI ​​video gone?

2024-07-23

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



Dingjiaoone original

Author | Wang Lu

Editor | Wei Jia

Since Sora made its debut earlier this year, people at home and abroad have been trying to use AI to subvert Hollywood. The AI ​​video circle has been very lively recently, with products being released one after another, all clamoring to surpass Sora.

Two foreign AI video startups took the lead. Luma, a San Francisco artificial intelligence technology company, launched the Dream Machine video generation model and released a movie-level promotional video. The product was also free for users to try. Another startup company, Runway, which is well-known in the field of AI video, also announced that it would open the Gen-3 Alpha model to some users for testing, claiming that it can produce details such as light and shadow.

Not to be outdone, Kuaishou has launched the Keling Web terminal, which allows users to generate video content up to 10 seconds long, and also has the functions of first and last frame control and camera lens control. Its original AI fantasy short drama "Shanhai Qijing: Cutting the Waves" is also broadcast on Kuaishou, and the pictures are all generated by AI. The AI ​​science fiction short drama "Sanxingdui: Future Revelation" was also broadcast recently, and was produced by ByteDance's AI video product Jimeng.

The AI ​​video updates so quickly that many netizens exclaimed, "Hollywood may be on strike again."

Today, in the AI ​​video track, there are domestic and foreign technology and Internet giants such as Google, Microsoft, Meta, Alibaba, ByteDance, and Meitu, as well as emerging companies such as Runway and Aishi Technology. According to incomplete statistics from "Dingjiao", in China alone, there are about 20 companies that have launched self-developed AI video products/models.

According to data from TouBao Research Institute, the market size of China's AI video generation industry is 8 million yuan in 2021, and it is expected that the market size will reach 9.279 billion yuan in 2026. Many industry insiders believe that the generation video track will usher in the mid-journey moment in 2024.

What stage have the Soras of the world reached? Who is the strongest? Can AI take down Hollywood?

Siege of Sora: Many products, few usable ones

There are many products/models launched in the AI ​​video track, but only a few can actually be used by the public.The most prominent representative abroad is Sora, which is still in internal testing after half a year and is only open to security teams and some visual artists, designers and filmmakers. The situation in China is similar. Alibaba Damo Academy's AI video product "Xunguang" and Baidu's AI video model UniVG are both in the internal testing stage. As for the currently popular Kuaishou Keling, users also need to queue up to apply for use, which has eliminated more than half of the products.

Among the remaining AI video products that can be used, some have set usage thresholds, and users need to pay or understand certain technologies.For example, if users do not have any knowledge of coding, they will not know where to start with Open-Sora from Luchen Technology.

"Dingjiao" sorted out the AI ​​video products released at home and abroad and found that the operation methods and functions of each company are similar. Users first generate instructions with text, and then select functions such as frame size, image clarity, generation style, and generation seconds, and finally click one button to generate.

The technical difficulties behind these functions vary. The most difficult one is,The resolution and number of seconds of the generated videoThis is also the focus of competition among AI video companies in their promotions.The reason behind this is closely related to the quality of the materials and computing power used in the training process.

AI researcher Cyrus told Dingjiao that currently most AI videos at home and abroad support the generation of 480p/720p, and a small number support 1080p high-definition videos.

He said that the more high-quality materials there are, the higher the computing power is, and the trained model can generate higher-quality videos, but it does not mean that high-quality materials and computing power can generate high-quality materials. If a model trained with low-resolution materials is forced to generate high-resolution videos, it will collapse or repeat, such as multiple hands and feet. Such problems can be solved by enlarging, repairing, and redrawing, but the effect and details are average.

Many companies also use the generated long seconds as a selling point.

Most domestic AI videos support 2-3 seconds, and products that can reach 5-10 seconds are considered relatively powerful. There are also some very popular products, such as Jimeng, which can be up to 12 seconds. However, none of them can match Sora, which once stated that it can generate a maximum of 60 seconds of video. However, as it has not yet been opened for use, its specific performance cannot be verified.

The length of the video is not enough, the content of the generated video must also be reasonable. Zhang Heng, chief researcher of Shiliu AI, told "Dingjiao": Technically, AI can be required to output continuously. It is no exaggeration to say that even generating an hour of video is not a problem, but most of the time what we want is not a surveillance video, nor a landscape animation that plays in a loop, but a short film with beautiful pictures and stories.

Dingjiao tested five popular free AI products in China, namely, ByteDance's Jimeng, Morph AI's Morph Studio, Aishi Technology's PixVerse, MewXAI's Yiying AI, and Right Brain Technology's Vega AI, and gave them the same text instruction: "A little girl in a red skirt, in the park, feeds a white rabbit with carrots."

The generation speed of several products is similar, only 2-3 minutes, but the clarity and duration are quite different, and the accuracy is even more "a dance of demons". The results are as follows:


Arty AI


Vega AI


Dream


Morph


Pix Verse

The advantages and disadvantages of each are obvious. Dreams wins in terms of duration, but the quality of the images is not high. The main character, the little girl, is directly deformed in the post-production. Vega AI also has the same problem. PixVerse has poor image quality.

In comparison, the content generated by Morph is very accurate, but it is only 2 seconds long. Yiying also has good image quality, but it does not understand the text well enough, and directly loses the key element of the rabbit. The generated video is not realistic enough and tends to be cartoon-like.

In short, no product has yet been able to provide a video that meets the requirements.

AI video challenges: accuracy, consistency, and richness

The experience of "fixed focus" is very different from the promotional videos released by various companies. If AI video wants to be truly commercialized, there is still a long way to go.

Zhang Heng told Dingjiao that from a technical perspective, they mainly consider the level of different AI video models from three dimensions:Accuracy, consistency, richness.

Zhang Heng gave an example to illustrate how to understand these three dimensions.

For example, generate a video of "two girls watching a basketball game on the playground".

Accuracy is reflected in, first, accurate understanding of content structure, for example, the objects appearing in the video must be girls, and there are two of them; second, accurate process control, for example, after a shot is made, the basketball must gradually fall from the net; and finally, accurate static data modeling, for example, when there are obstructions to the camera, the basketball cannot turn into a rugby ball.

Consistency refers to AI's modeling capabilities in time and space, which includes subject attention and long-term attention.

The main attention can be understood as, while watching a basketball game, the two little girls must always stay in the picture and cannot run around; long-term attention means that during the movement, the various elements in the video cannot be lost, nor can they be deformed or have other abnormal conditions.

Richness means that AI also has its own logic and can generate some reasonable details even without text prompts.

Basically, the AI ​​video tools available on the market have not been able to fully achieve the above dimensions, and various companies are constantly proposing solutions.

For example, in terms of the consistency of characters in the video, which is very important, Ji Meng and Ke Ling thought of usingVideos generated from pictures replace videos generated from text. That is, users first generate pictures with text, and then generate videos with pictures, or directly give one or two pictures, and AI will connect them into a moving video.

"But this is not a new technological breakthrough, and the difficulty of image-generated video is lower than that of text-generated video," Zhang Heng told Dingjiao. The principle of text-generated video is that AI first analyzes the text input by the user, breaks it down into a group of lens descriptions, converts the descriptions into text and then into pictures, and then gets the middle key frames of the video. By connecting these pictures, you can get a continuous video with action. Image-generated video is equivalent to giving AI a specific picture to imitate, and the generated video will continue the facial features in the picture to achieve consistency of the protagonist.

He also said that in actual scenarios, the effect of image-generated videos is more in line with user expectations, because text has limited ability to express picture details, and having pictures as references will help generate videos, but it is not yet commercially available. Intuitively speaking, 5 seconds is the upper limit of image-generated videos, and longer than 10 seconds may not be very meaningful, either because the content is repeated or the structure is distorted and the quality is reduced.

Currently, many short films and TV shows that claim to be produced entirely with AI mostly use image-to-video or video-to-video.

Jimeng also used the image-generated video using the last frame function, and tried the "fixed focus" function. The results are as follows:



In the process of combining, the characters became deformed and distorted.

Cyrus also said that videos require continuity, and many AI video tools that support image-to-video conversion also infer subsequent actions through single-frame images. As for whether the inference is correct, it still depends on luck.

It is understood thatWhen it comes to achieving consistency in the protagonists of Wensheng Video, each company does not rely solely on data generation.Zhang Heng said that most models are based on the original underlying DIT model, superimposed with various technologies, such as ControlVideo (a controllable text-video generation method proposed by Harbin Institute of Technology and Huawei Cloud), so as to deepen AI's memory of the protagonist's facial features, so that the face will not change too much during movement.

However, it is still in the trial stage. Even with the addition of technologies, the problem of character consistency has not been completely solved.

AI video, why is it evolving so slowly?

In the AI ​​circle, the United States and China are currently the most competitive.

From the relevant report of "The World's Most Influential Artificial Intelligence Scholars in 2023" (referred to as the "AI 2000 Scholars" list), it can be seen that among the 1,071 global "AI 2000 Institutions" in the four years from 2020 to 2023, the United States has 443, followed by China with 137. From the country distribution of "AI 2000 Scholars" in 2023, the United States has the largest number of selected people, with a total of 1,079 people, accounting for 54.0% of the global total, followed by China, with a total of 280 people selected.

In the past two years, in addition to making great progress in the areas of visual images and music, AI has also made some breakthroughs in the most difficult areas of breakthrough, namely AI videos.

At the recent World Artificial Intelligence Conference, Yitian Capital partner Le Yuan publicly stated that video generation technology has made progress far beyond expectations in the past two to three years. Liu Ziwei, assistant professor at Nanyang Technological University in Singapore, believes that video generation technology is currently in the GPT-3 era and it will take about half a year to mature.

However, Le Yuan also emphasized thatIts technical level is still not enough to support large-scale commercializationThe methodology and challenges used in developing applications based on language models are also applicable to video-related application areas.

At the beginning of the year, the emergence of Sora shocked the world. It made technical breakthroughs in diffusion and generation based on the new diffusion model DiT of the transformer architecture, which improved the quality and realism of image generation, making AI video a major breakthrough. Cyrus said that most of the cultural videos at home and abroad currently use similar technologies.


Image source/Sora official website

At this moment, everyone is basically consistent in the underlying technology. Although each company is also seeking technological breakthroughs based on this, more volumes of training data are needed to enrich product functions.

When using ByteDance's Jimeng and Morph AI's Morph Studio, users can choose the camera movement method of the video. The principle behind this is the different data sets.

"In the past, the images used by various companies in training were relatively simple. They mostly labeled the elements in the images, but did not explain what lens was used to shoot the element. This allowed many companies to discover this gap, so they used 3D rendered video datasets to complete the lens features." Zhang Heng said that currently these data come from renderings from the film and television industry and game companies.

"Fixed focus" also tried this function, but the lens changes were not very obvious.

The reason why Sora and other platforms are developing slower than GPT and Midjourney is that they have a timeline and it is more difficult to train video models than text and images. "All the video training data that can be used now has been mined, and we are also thinking of some new ways to create a series of data that can be used for training," said Zhang Heng.

Moreover, each AI video model has its own style. For example, the food broadcast videos made by Kuaishou Keling are better because they are supported by a large amount of such data.

Shen Renkui, founder of Shiliu AI, believes that the technologies of AI video include Text to video, Image to video, Video to video, and Avatar to video. Digital humans with customized images and voices have been used in the marketing field and have reached commercial levels, while Avatar videos still need to solve the problems of accuracy and controllability.

At this moment, whether it is the AI ​​sci-fi short drama "Sanxingdui: Apocalypse of the Future" co-produced by Douyin and Bona, or the AI ​​fantasy short drama "The Mirror of Mountains and Seas: Cutting Through the Waves" originally created by Kuaishou, it is more about large model companies actively looking for film and television production teams to cooperate with them. They have the need to promote their own technology products, and their works have not gone out of the circle.

In the field of short videos, AI still has a long way to go, and it is premature to say that it has taken over Hollywood.

*The title image comes from Pexels.