news

Generate a video in 30 seconds, free and unlimited, the Chinese version of OpenAI released today "Zhipu Qingying" has been played like crazy

2024-07-26

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


In the past six months, video generation models at home and abroad have ushered in a new round of technological explosion, and they are always able to become popular on social networks in the first time.

However, unlike the “falling behind” of language generation models, recent trends show that China’s progress in the field of video generation models has far surpassed the international level. Many foreign netizens said that “China’s Keling AI videos” are setting the Internet on fire, while OpenAI’s Sora is sleeping.

Today, Zhipu AI, a leading domestic large-scale model manufacturer, also released its AI video generation product "Qingying".


Of course, all AI video models at home and abroad have many flaws, but compared to the "futures" Sora, these AI video products are visible and tangible, and at most you may need to try a few more times before you can "draw" the guaranteed video.

And this exploration itself is part of technological progress.

Just as GPT-3 was once questioned and criticized at the beginning of its birth, but eventually proved its value over time, similarly, if these AI video generation tools are given some time, it may only take a short time for them to transform from toys to tools.

Qingying PC access link:
https://chatglm.cn/video?fr=opt_homepage_PC
Qingying mobile terminal access link:
https://chatglm.cn/video?&fr=opt_888_qy3

Generate 6s video in half a minute, "Zhipu Qingying" is officially released

Compared to Zhipu Qingying released today, many people may be more familiar with Zhipu Qingyan, but instead of looking at the advertisement to see the efficacy, you might as well take a look at the demonstration demo created by "Qingying" first.

In a lush forest, some sunlight shines through the gaps between leaves, creating a Tyndall effect, giving the light shape.


When the tsunami roared in like a raging monster, the entire village was instantly swallowed up by the sea, just like a classic scene in a doomsday movie.


In the neon-lit city night scene, a little monkey full of mechanical beauty holds high-tech tools and repairs equally flashing and futuristic electronic equipment.


Changing the style of painting, the kitten opened its mouth wide, showing a human-like confused expression, with question marks written all over its face.


There is no palace fighting drama, no intrigue, only the sincere sisterly love between Zhen Huan and Mei Zhuang in the cross-screen embrace that transcends time and space.


In addition, thanks to the efficient video generation big model CogVideo developed by the Zhipu big model team, Qingying now supports multiple generation methods, including text-to-video, image-to-video, and can even be applied to advertising production, film editing, short video production and other fields.

Qingying has powerful command-following capabilities and can fully understand and execute the commands given by users.

According to reports, Zhipu AI has developed an end-to-end video understanding model to generate detailed and content-appropriate descriptions for massive amounts of video data, thereby enhancing the model's text understanding and instruction-following capabilities and generating videos that meet user needs.


In terms of content coherence, Zhipu AI's self-developed efficient three-dimensional variational autoencoder structure (3D VAE) compresses the original video space to 2% of its size. Combined with the 3D RoPE position encoding module, it is more conducive to capturing the relationship between frames in the time dimension and establishing long-range dependencies in the video.

For example, how many steps does it take to turn a potato into French fries? No need to "start a fire", just a simple prompt, and the potato will turn into a golden and tempting French fries. Officials said that no matter how wild your ideas are, it can turn them into reality one by one.


In addition, CogVideoX, which was designed with reference to the Sora algorithm, is also a DiT architecture that can integrate the three dimensions of text, time, and space. After technical optimization, CogVideoX has increased its reasoning speed by 6 times compared to its predecessor (CogVideo). In theory, it only takes 30 seconds for the model to generate a 6-second video.

In comparison, it usually takes 2 to 5 minutes for KeLing AI, which is currently in the first echelon, to generate a single 5-second video.

At today's press conference, Zhang Peng, CEO of Zhipu AI, asked Qingying to generate a video of a cheetah sleeping on the ground with its body slightly rising and falling. The task was completed in about 30 seconds. However, it takes more time to make a static rose "bloom".

In addition, the video resolution of Qingying can reach 1440x960 (3:2) and the frame rate is 16fps.

Qingying also thoughtfully provides a music soundtrack function, so you can add music to the generated video and publish it directly.

I thought the static picture of an astronaut playing the guitar was fantastic enough, but when it moved, accompanied by a leisurely melody, it was as if the astronaut was holding a concert in space.

Unlike "Futures" Sora, "Qingying" does not engage in hunger marketing. It is fully open as soon as it goes online. Anyone can try it out without making an appointment or queuing. In subsequent versions, it will also launch higher resolution and longer duration video generation functions.

Zhang Peng also said at the Zhipu Open Day, "All users can experience the AI-generated video and image-generated video capabilities through Ying."

Now, Qingying is in the initial testing period, and all users can use it for free. If you want a smoother experience, you can unlock the high-speed channel benefits for one day (24 hours) for 5 yuan, and if you are willing to pay 199 yuan, you can unlock the paid high-speed channel benefits for one year.

In addition, Ying API was also launched simultaneously on the big model open platform bigmodel.cn. Enterprises and developers can experience and use the model capabilities of text-generated video and image-generated video by calling API.

The entry threshold is low, but you still need to "draw cards". Newbies no longer have to worry about writing commands

APPSO also experienced Qingying for the first time. After testing some scenarios, we also summarized some experience about using Qingying:

  • The video generation is like "alchemy", the output is unstable, it is recommended to try several times
  • The upper limit of the effect depends on the prompt word, and the prompt word structure should be as clear as possible
  • The best lens effect is the close-up, and other scenes are not stable
  • Entity type implementation sorting: Animals > Plants > Items > Buildings > People

A scientist who doesn't understand art is not a good scientist. Einstein played the guitar like a duck in water, shaking his head and creating his own rhythm. It didn't seem like he was acting.


The giant panda can also play the guitar very well and is very versatile.


The usually serious Tang Monk waves hello to you and sways to the rhythm.


Of course, the above are some videos with relatively good effects. In the process of video generation, we also accumulated a lot of waste films.

For example, the emperor lying on the bed was asked to eat a chicken leg with his right hand, and an extra hand appeared out of thin air. In the last second of the video, I felt that the emperor was about to reveal his feminine makeup and hair.


Or maybe the moment Leslie Cheung looked at me, the brother in my mind had already become "that man".


In complex scenes, the transitions of character movements are unnatural, the physical characteristics of complex scenes cannot be accurately simulated, the accuracy of the generated content is insufficient, etc. These shortcomings are not Qingying's "patent", but the current limitations of the video generation model.

In actual applications, although users can improve video quality by optimizing prompt words, failures are common. Fortunately, prompt words of acceptable quality can largely guarantee the lower limit of the video generation model.

In order to take care of some new players, we have also specially prepared some tips for prompt words:

  • Simple formula: [Camera movement] + [Set up the scene] + [More details]
  • Complex formula: [lens language] + [light and shadow] + [subject (subject description)] + [subject movement] + [scene (scene description)] + [emotion/atmosphere]

The camera pans (lens movement) to a little boy sitting on a park bench (subject description) with a steaming cup of coffee in his hand (subject action). He is wearing a blue shirt and looks happy (subject detail description), with the background being a tree-lined park with sunlight shining through the leaves onto the boy (environment description).

If you still have no idea, then I recommend that you use the intelligent agent provided by Zhipu Qingyan to help write video prompt words. Even if you input ordinary scenes commonly seen in life, you can get three high-quality prompt words.


For example, a casual sentence like "Corgi sunbathing on the beach" will give you the following Chinese and English prompts for natural scenery photography style, as well as watercolor style, 3D animation style and other styles for you to choose from:

English: On a sunny beach, a Corgi lies lazily on a beach towel, basking in the warm sunlight. The camera captures the scene from a low angle, showcasing the vast blue ocean and pristine white sand in the background, with gentle waves lapping at the shore. The atmosphere is tranquil, captured in 4K ultra-high definition.

Seeing such a satisfying prompt, yes, that’s what I really wanted to write at that time.

Attached is the address of the Qingying prompt word intelligent agent (Vensheng video): https://chatglm.cn/main/gdetail/669911fe0bef38883947d3c6

The same is true for generating videos from images. You can input the main image and select the image style, and then Zhipu Qingyan will help you write the corresponding prompt words. From no prompt words to "wear glasses" and then to "Tang Monk stretches out his hand and puts on glasses", the effect is also very different.


Attached is the address of the Qingying prompt word intelligent agent (picture and video): https://chatglm.cn/main/gdetail/669fb16ffdf0683c86f7d903

If you want to do your work well, you must first sharpen your tools. If you want to expand your horizons a little bit, you can also experience more content creation tools in Zhipu Qingyan.

From the initial stage of collecting topic materials, to the script writing stage, the picture and video creation process, and then to the promotional copy, it can almost open up the entire chain of video generation creativity. It's almost as if it doesn't say it explicitly. You just need to think about the creativity and leave the rest to it.

We found that recently released AI video products, including Keling, have improved controllability through methods such as first and last frame control.


AI creator Chen Kun once told APPSO that almost all AI videos available for commercial delivery are image-based videos, because image-based videos cannot do this. In fact, it is a problem of controllability.

Qingying, released today by Zhipu AI, further improves the controllability of text-generated videos. In an interview with APPSO, Zhipu AI stated that text-generated videos embody more universal controllability.

Most AI-generated videos are still controlled by humans using language. Therefore, how to recognize text or simple language instructions is a higher level of control.
AI video is evolving from a toy to a creator tool

If last year was the first year of the explosion of big models, this year can be said to be an important node for AI video to move towards application.

Although Sora, which set off all this, has not yet been launched, it has brought some inspiration to AI videos.

Sora solves the problem of detail jumps between frames through reasonable detail design. At the same time, it directly generates high-resolution (1080p) video images and can generate semantically rich videos up to 60 seconds long, which means that the training sequence behind it is also relatively long.


In these two months alone, no fewer than 10 companies have launched new AI video products or major updates.


Just a few days before the release of Zhipu Qingying, Kuaishou's Keling AI opened its internal beta test globally, and another one, PixVerse, which is considered to be Sora, released its V2 version, which supports one-click generation of 1-5 continuous video content.


Not long ago, Runway Gen 3 Alpha also opened a public beta for paid users, with significant improvements in the refinement and smoothness of details. Dream Machine, a movie-level video generation model that was just released last month, has also recently updated the first and last frame functions.

In just a few months, AI video generation has greatly improved in terms of physical simulation, motion smoothness, and understanding of prompt words. Chen Kun, director of AI fantasy dramas, is more sensitive to this and believes that the progress of AI video generation technology may be faster than expected.

The AI ​​videos of 2023 are more like dynamic PPTs, with characters performing in slow motion and relying on montage editing to score points. But now, the "PPT flavor" of AI videos has faded a lot.

The first domestic AIGC spectacle "Mountain and Sea Mirror: Breaking the Waves" directed by Chen Kun has just been launched recently. He used AI to replace many traditional film and television shooting links. He told APPSO that in the past, it took at least 100 people to make similar fantasy themes, but his team only has more than 10 people, which greatly shortens the production cycle and cost.

In the past six months, we can see that more professional film and television creators have begun to try AI videos. Kuaishou and Douyin have launched AI short dramas in China, and the first AI feature film "Our T2 Remake" co-produced by 50 AIGC creators premiered in Los Angeles.


Although AI video generation still has limitations in terms of consistency between characters and scenes, character performances, and action interactions, it cannot be denied that AI video is slowly transforming from a toy that was tried out last year into a tool for creators.

This may also be an important reason why products including Zhipu Qingying, Kuaishou Keling, Luma Dream Machine and others have begun to launch membership systems. It should be noted that most of the domestic large-scale C-end products are free, which is related to the domestic subscription payment habits and the strategy of prioritizing user growth. In addition to curious users, the payment for AI videos must be supported by more content creators to be sustainable.

Of course, AI video generation is still in its early stages. The so-called "generating a movie in one sentence" is just a misleading clickbait title. The video model needs to have better command-following capabilities and controllability to better understand the physical world.

Zhipu also mentioned in today's press conference that the exploration of multimodal models is still in a very early stage.

From the perspective of the effect of generated videos, there is a lot of room for improvement in the understanding of the laws of the physical world, high resolution, camera movement coherence, and duration. From the perspective of the model itself, a new model architecture with more breakthrough innovation is needed. It should compress video information more efficiently, more fully integrate text and video content, and make the generated content more realistic while meeting user instructions.

"We are actively exploring more efficient scaling methods at the model level." But Zhang Peng is also confident in the development of multimodal models. "With the continuous iteration of algorithms and data, I believe that Scaling Law will continue to exert its powerful power."

AI creator Chen Kun believes that it is only a matter of time before AI-generated shots can 100% withstand the big screen. How long this time is is not the most important thing to worry about. It is more important to participate in the process, just as Zhang Peng, CEO of Zhipu AI, mentioned in an interview with APPSO:

Many things need to be explored one after another, and this process is very important. Don’t just look at the final result. What is more important is that we take action. I think this is what everyone should pay more attention to at the moment.

Author: Li Chaofan, Mo Chongyu