Zhipu AI enters the video generation market: "Qingying" is launched, with a duration of 6 seconds and is free and unlimited

2024-07-26

Machine Heart Report

Synced Editorial Department

Developed and created by Zhipu Big Model Team.

Ever since Kuaishou’s KeLing AI became popular both at home and abroad, domestic video generation has become more and more popular, just like the big text model in 2023.

Just now, another video generation large model product was officially launched: Zhipu AI officially released "Qingying". As long as you have good ideas (a few words to a few hundred words) and a little patience (30 seconds), "Qingying" can generate high-precision videos with a resolution of 1440x960.

Video link: https://mp.weixin.qq.com/s/fNsMxyuutjVkEtX_xRnsMA

Starting today, Qingying is available on the Qingyan App, and all users can experience the full range of conversation, picture, video, code, and agent generation functions. In addition to covering the Zhipu Qingyan web page and App, you can also operate on the "AI Dynamic Photo Mini Program" to quickly achieve dynamic effects for photos on your phone.

The video generated by Zhipu "Qingying" is 6 seconds long and has a resolution of 1440×960. It can be used free of charge by all users.

PC access link: https://chatglm.cn/
Mobile access link: https://chatglm.cn/download?fr=web_home

Zhipu AI said that with the continuous development of technology, the generation capability of "Qingying" will soon be used in short video production, advertisement generation and even film editing.

In the development of generative AI video models, Scaling Law continues to play a role in both algorithms and data. "We are actively exploring more efficient scaling methods at the model level." At the Zhipu Open Day, Zhipu AI CEO Zhang Peng said: "With the continuous iteration of algorithms and data, I believe Scaling Law will continue to play a strong role."

A variety of styles

Judging from some current demos and simple trials, Zhipu AI's "Qingying" has the following features:

It performs well in video content such as landscape, animals, science fiction, and humanities and history.
The video styles it is good at generating include cartoon style, real photography style, two-dimensional anime style, etc.
In terms of entity type presentation, animals > plants > objects > buildings > people.

It can generate videos from text as well as pictures, and the generated styles include fantasy animation styles.

Vincent Video

Tips: Push upwards at a low angle, slowly look up, and suddenly a dragon appears on the iceberg, then the dragon finds you and rushes towards you. Hollywood movie style.

Hint: A wizard is casting a spell in the waves. The gem gathers the sea water and opens a magic portal.

Hint: The mushroom turns into a bear.

To the real scene:

Prompt: In a forest, people see that the towering trees block the sun, and some sunlight shines through the gaps between the leaves, the Tyndall effect.

Prompt: A capy guinea pig stands like a human, holding ice cream in his hand and eating it happily.

Figure video

In addition to text-generated videos, you can also use Qingying to generate videos from pictures. Picture-generated videos bring more new ways to play, including emoticons, advertising production, plot creation, short video creation, etc. At the same time, the "Old Photos Animation" applet based on Qingying will also be launched simultaneously. You only need to upload old photos in one step, and AI can make the photos condensed in the old times come alive.

Prompt: A free-moving colorful fish.

Prompt: The man in the picture is standing up and the wind is blowing his hair.

Prompt: A little yellow duck toy floats on the surface of the swimming pool, close-up.

To modern art:

Cue: The camera rotates around a bunch of vintage TVs showing different shows - 1950s sci-fi movies, horror movies, news, static, 1970s sitcoms, etc., set in a large gallery at a New York museum.

Prompt: Take out an iPhone and take a photo.

No prompt words.

Zhipu AI can extend the emoticons you often use into a "serial".

Prompt words: The four masters and apprentices stretched out their hands and high-fived each other, with confused expressions on their faces.

Video link: https://mp.weixin.qq.com/s/fNsMxyuutjVkEtX_xRnsMA

Prompt words: The kitten opened its mouth wide, with a confused expression and many question marks on its face.

Video link: https://mp.weixin.qq.com/s/fNsMxyuutjVkEtX_xRnsMA

It can be seen that Qingying can handle all kinds of styles, and there are more ways to play waiting for people to discover. Just click on the "Qingying Intelligent Body" function on Zhipu Qingyan PC/APP, and you can make every one of your ideas come true in an instant.

Fully self-developed technology

Zhipu AI, which is all in on big models, started deploying multimodal generative AI models very early. Since 2021, Zhipu AI has released a number of research projects, including CogView (NeurIPS'21), CogView2 (NeurIPS'22), CogVideo (ICLR'23), Relay Diffusion (ICLR'24), and CogView3 (2024).

It is reported that "Qingying" relies on CogVideoX, a new generation of video generation model independently developed by Zhipu AI big model team.

Last November, his team created the text-to-video generation model CogVideo based on the text-to-graph model CogView2, and subsequently open-sourced it.

CogVideo has 9.4 billion parameters. It generates a series of initial frames through CogView2 and interpolates images based on a bidirectional attention model to achieve video generation. In addition, CogVideo generates 3D environments based on text descriptions and can directly use pre-trained models to avoid expensive training. It also supports Chinese prompt input.

The video generation model of the Qingying base is CogVideoX, which can integrate the three dimensions of text, time and space. It refers to the algorithm design of Sora. It is also a DiT architecture. Through optimization, the inference speed of CogVideoX is 6 times faster than that of the previous generation (CogVideo).

The emergence of OpenAI's Sora has enabled AI to make significant progress in video generation, but most models still have difficulty generating video content with coherence and logical consistency.

To solve these problems, Zhipu AI has developed an efficient three-dimensional variational autoencoder structure (3D VAE), which can compress the original video space to 2%, greatly reducing the cost of model training and the difficulty of training.

The model structure uses causal 3D convolution as the main model component and removes the attention module commonly used in autoencoders, so that the model has the ability to migrate and use different resolutions.

At the same time, causal convolution in the time dimension enables the model video encoding and decoding to have sequence independence from front to back, which helps to expand the model to higher frame rates and longer time scenarios through fine-tuning.

In addition, video generation also faces the problem that most video data lack corresponding descriptive text or the description quality is low. For this reason, Zhipu AI has developed an end-to-end video understanding model to generate detailed and content-appropriate descriptions for massive video data, and then construct a large number of high-quality video-text pairs, so that the trained model has a high degree of instruction compliance.

Finally, it is worth mentioning that Zhipu AI has developed a transformer architecture that integrates text, time, and space. This architecture does not use the traditional cross attention module, but connects text embedding and video embedding at the input stage to enable more complete interaction between the two modalities.

However, there are significant differences in the feature spaces of text and video. Zhipu AI processes the two separately through expert adaptive layernorm, allowing the model to efficiently use parameters to better align visual information with semantic information.

Zhipu AI said that through optimization technology, the reasoning speed of Zhipu AI's generative video model has increased by 6 times. Currently, the model takes 30 seconds to generate a 6s video.

Now with the launch of "Qingying", Zhipu AI, a heavyweight player, has emerged in the video generation track.

In addition to the applications that everyone can try, Qingying API also launched the big model open platform bigmodel.cn. Enterprises and developers can experience and use the model capabilities of text-generated video and image-generated video by calling API.

As companies continue to launch AI video generation functions, this year's generative AI competition has entered a white-hot stage. For most users, there are more choices: now, whether you have no video production foundation or professional content creators, you can use the power of large models to create videos.

news

Zhipu AI enters the video generation market: "Qingying" is launched, with a duration of 6 seconds and is free and unlimited

Introduction

my contact information