news

Zhipu AI releases a large model for video generation, Bilibili participates in R&D, Yizhuang provides computing power|Jia Zi Guang Nian

2024-07-26

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


The video model enters the battle of 100 models.

Author: Zhao Jian

This year is the first year of the explosion of "video generation" big models. In the past two months, we have seen the competition among big video models such as Kuaishou Keling, SenseTime Vimi, Luma AI, Aishi Technology Pixverse, Runway Gen-3, etc.

However, in the first half of the year, large model companies that generated videos tended to focus only on the function of video generation.

In the second half of the year, large language model companies will gradually follow in the footsteps of OpenAI and enter the field of large video models to unify language models and video models.

Among the highly anticipated "Six Big Model Giants", Zhipu AI is the fastest.

This morning, this Tsinghua-affiliated big model unicorn launched a video generation big model product "Qingying", which is directly open to all users and supports both text-generated videos and image-generated videos.

After entering a piece of text or picture in Zhipu Qingyan PC or App (i.e. Prompt), users can choose the style they want to generate, including cartoon 3D, black and white, oil painting, movie feel, etc., and with the music that comes with Qingying, a video clip full of AI imagination is generated; in addition, the "AI Dynamic Photo Mini Program" supports the generation of videos from pictures.

Regarding the current landscape of the big video model field, Zhang Peng believes that it will probably enter a landscape of contention among a hundred schools of thought, just like the big language model.

In terms of commercialization strategy, Qingying's current payment plan is: during the initial test, all users can use it for free; paying 5 yuan will unlock the high-speed channel rights for one day (24 hours); paying 199 yuan will unlock the paid high-speed channel rights for one year. Zhang Peng, CEO of Zhipu AI, said: "The current commercialization is still in a very early stage, and the cost is actually very high. We will gradually iterate based on market feedback."

Qingying API is also launched on the Zhipu Big Model Open Platform. Enterprises and developers can experience and use the model capabilities of text-generated videos and image-generated videos by calling APIs.

Qingying's research and development has received strong support from the Beijing Municipal Government. Haidian District is the headquarters of Zhipu AI, which provides all-round support for Zhipu AI to carry out large-scale model research and development, including industrial investment, computing power subsidies, application scenario demonstrations, and talents; Qingying's training relies on the Yizhuang high-performance computing power cluster and was born in the Beijing Yizhuang computing power cluster. In the future, it will also be applied to the vast high-tech industrial cluster in Beijing Yizhuang, forming a new format in which large models empower the real economy.


In terms of ecological cooperation, Bilibili, as a partner, also participated in the technical research and development process of Qingying and is committed to exploring possible application scenarios in the future. At the same time, partner Huace Film and Television also participated in the co-construction of the model.

1. Generate video from any text in 30 seconds

What is the specific effect of Qingying? Let’s take a look at several video cases released by the official website (all with music).

  • Vincent video:

Tips: Push upward at a low angle, slowly raise your head, and suddenly a dragon appears on the iceberg, then the dragon finds you and rushes towards you. Hollywood movie style

Tips: In the night scene of a cyberpunk-style city with flashing neon lights, a handheld tracking camera slowly moves closer to a mechanical monkey repairing something with high-tech tools, surrounded by flashing electronic devices and futuristic decoration materials. Cyberpunk style, mysterious atmosphere, 4K high definition.

Prompt words: Advertising shooting angle, yellow background, white table, a potato is thrown down and turned into a serving of French fries

  • Figure video

Prompt word: Classical beauty

Prompt: Flames spewed out of a dragon's mouth and burned down a small village

Prompt: Capybara lazily drinks Coke with a straw, turning its head towards the camera

The video generation time of Qingying is about 6 seconds, and the waiting time after entering the prompt word is about 30 seconds. Zhang Peng said that this generation speed is already very fast in the industry.

Zhang Peng believes that the exploration of multimodal models is still in its early stages. From the perspective of the effect of generating videos, there is a lot of room for improvement in the understanding of the laws of the physical world, high resolution, camera movement continuity, and duration. From the perspective of the model itself, a new model architecture with more breakthrough innovation is needed. It should compress video information more efficiently, more fully integrate text and video content, and make the generated content more realistic while meeting user instructions.

2. Self-developed DiT architecture

The video generation model of Qingying Base is CogVideoX, which integrates the three dimensions of text, time, and space, and refers to the algorithm design of Sora. CogVideoX is also a DiT architecture. Through optimization, the inference speed of CogVideoX is 6 times faster than that of the previous generation (CogVideo).

Zhipu mainly shared three technical features of CogVideoX: content consistency, controllability, and model structure.


First, in order to solve the problem of content coherence,Zhipu has developed an efficient three-dimensional variational autoencoder structure (3D VAE) to compress the original video space to 2% of its size, thereby reducing the training cost and difficulty of the video diffusion generation model.

In terms of model structure, Zhipu uses causal 3D convolution as the main model component and removes the attention module commonly used in autoencoders, so that the model has the ability to migrate and use different resolutions.

At the same time, the form of causal convolution in the time dimension also makes the model have sequence independence from front to back for video encoding and decoding, which makes it easier to generalize to higher frame rates and longer times through fine-tuning.

From the perspective of engineering deployment, Zhipu fine-tunes and deploys the variational autoencoder based on temporal sequential parallelism, enabling it to support the encoding and decoding of extremely high frame rate videos with smaller video memory usage.

The second point is controllability.Most of the current video data lacks corresponding descriptive text or the description quality is low. For this reason, Zhipu has developed an end-to-end video understanding model to generate detailed and content-appropriate descriptions for massive video data. This can enhance the model's text understanding and command-following capabilities, making the generated video more consistent with user input and able to understand extremely long and complex prompt instructions.

This is also the approach used by Sora. OpenAI trained a highly descriptive caption generator model using DALL·E 3’s “re-captioning technique” and then used it to generate text captions for videos in the training dataset. In addition, OpenAI also used GPT to convert short user prompts into longer detailed captions, which were then sent to the video model.

Finally, there is a transformer architecture developed by Zhipu that integrates the three dimensions of text, time, and space.It abandons the traditional cross attention module and instead concatenates text embedding and video embedding at the input stage to enable more adequate interaction between the two modalities.

However, the feature spaces of the two modalities are very different. Zhipu compensates for this difference by processing the text and video modalities separately through expert adaptive layernorm. This can more effectively utilize the time step information in the diffusion model, allowing the model to efficiently use parameters to better align visual information with semantic information.

The attention module adopts a 3D full attention mechanism. Previous studies usually use separate spatial and temporal attention or block spatiotemporal attention, which require a large amount of implicit transmission of visual information, greatly increasing the difficulty of modeling. At the same time, they cannot be adapted to the existing efficient training framework.

The position encoding module designs 3D RoPE, which is more conducive to capturing the relationship between frames in the temporal dimension and establishing long-range dependencies in the video.

3. Scaling Laws Still Work

At the beginning of AI in the big model route, Zhipu began to lay out relevant multimodal fields. From text to pictures to videos, the big model's understanding of the world has become increasingly complex and multidimensional. Through learning from various modalities, the big model has emerged with the ability to understand, gain knowledge, and handle different tasks.

Zhipu's research on multimodal large models can be traced back to 2021. Since 2021, Zhipu has successively developed CogView (NeurIPS'21), CogView2 (NeurIPS'22), CogVideo (ICLR'23), Relay Diffusion (ICLR'24), and CogView3 (2024).


Based on CogView, the team developed CogVideo, a large-scale text-to-video generation model. It used a multi-frame rate hierarchical training strategy to generate high-quality video clips, and proposed a recursive interpolation-based method to gradually generate video clips corresponding to each sub-description, and interpolate these video clips layer by layer to obtain the final video clip. This work has attracted widespread attention from Facebook, Google, and Microsoft, and has been cited in Facebook's Make-A-Video, Google's Phenaki and MAGVIT, Microsoft's DragNUWA, NVIDIA's Video LDMs and other video generation model works.

In May 2024, the GLM big model technology team comprehensively elaborated on the three major technical trends of the GLM big model towards AGI in the keynote speech session of ICLR 2024. The native multimodal big model plays an important role in it: The GLM big model team believes that text is the key foundation for building a big model. The next step should be to mix multiple modalities such as text, images, videos, audio, etc. for training to build a truly native multimodal model.


Zhipu has comprehensively laid out a series of large-model products, and multimodal models have always played an important role. Zhipu has verified the effectiveness of the Scaling Law in video generation. In the future, while continuously scaling up the data scale and model scale, it will explore new model architectures with more breakthrough innovations, more efficiently compress video information, and more fully integrate text and video content.

Zhang Peng believes that one of the future technological breakthroughs for large models will be native multimodal large models, and Scaling Law will continue to play a role in both algorithms and data.

"We haven't seen any signs of the technology curve slowing down," said Zhang Peng.

(Cover image and accompanying images in the article are from: Zhipu)