news

Zhipu AI version of Sora is here! Free for everyone, unlimited times, anyone with a mobile phone can play, and the API is also open

2024-07-26

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Jin Lei from Aofei Temple
Quantum Bit | Public Account QbitAI

Just now,Zhipu AIA new Sora version was born, namedQingying

Without further ado, let's take a look at the image generated by Qingying.A short film



Video address: https://mp.weixin.qq.com/s/XmXR-XZtMvhZHtLTCxU4ZQ

existVincent VideoFor example, to give Qingying a prompt, you canChallenge its imagination

In the neon-lit cyberpunk-style city night scene, the handheld tracking camera slowly moves closer to show a mechanical monkey repairing something with high-tech tools, surrounded by flashing electronic devices and futuristic decoration materials. Cyberpunk style, mysterious atmosphere, 4K high definition.



Video address: https://mp.weixin.qq.com/s/XmXR-XZtMvhZHtLTCxU4ZQ

It has a strong cyberpunk and futuristic feel, which is closer to the image we imagine in our minds.

In addition toVincent VideoIn addition, Qingying this timeFigure videoThe capabilities were also released.

Now, let's compare your imagination and Qingying's creativity to see which one is better.

Please see the first picture -Cave civilization



Then the following video is the version created and composed by Qingying using AI Power:



Video address: https://mp.weixin.qq.com/s/XmXR-XZtMvhZHtLTCxU4ZQ

At the end of the video, Qingying even learned to shake the camera at the key frames, making the video even more mysterious.

Next, let’s go to Round 2. Let’s look at the pictures first.Dragon Breath



The way to open the video made by Qingying based on this picture is as follows:



Video address: https://mp.weixin.qq.com/s/XmXR-XZtMvhZHtLTCxU4ZQ

It is possible to imagine that the dragon is about to breathe fire, but I didn't expect it to burn the village on the ground. However, it is also reasonable.

However, looking at the whole launch event of Zhipu AI, the effects of high definition and picture consistency are only a small part of the highlights. More importantly, itWelfare valueIt’s full!

Free for everyone, no need to queue, unlimited times!

Moreover, the effect is to directly generate a large model from your own video.CogVideoAbilityFull Throttle, no hunger marketing.

According to Zhipu AI, it only takes 30 seconds to generate a 6-second 1440x960 video, and the speed of model inference has increased by 6 times.



Not only that, now in Zhipu QingyanPC versionandAPPOn the APP, the functions of text-generated video and picture-generated video are already available;AppletsIn terms of video, currently only images are supported.

There is also good news for developers. This time, the video generates a large modelAPIIt is now fully open.The first in Chinaoh!

It has to be said that in terms of convenience and efficiency, Zhipu AI has done a great job this time.

Next, it’s time to use Zhipu AI’s video generation function to conduct some actual tests.

Test of Zhipu AI version Sora

Let’s do a test firstVincent VideoEffect.

Open the Zhipu Qingyan APP or PC version, and the entrance to Vincent’s video will be in the main conversation.





Taking APP as an example, the interface is as follows:



Then everything is ready, except for the input prompt.

But it should be noted that this is the key to the success or failure of video generation.

One of the most important principles is:Structure! Nature!The formula is as follows:

  • Simple formula: [Camera movement] + [Set up the scene] + [More details]
  • Complex formula: [lens language] + [light and shadow] + [subject (subject description)] + [subject movement] + [scene (scene description)] + [emotion/atmosphere/style]

So how much worse will the effect be?

For example, if you only enter:Little boy drinking coffeeThe generated result is this:



It’s quite satisfactory, but it feels like AI at first glance.

But if the prompt words are enriched according to the formula, the opening method will be completely different:

The camera pans to reveal a young boy sitting on a park bench with a steaming cup of coffee in his hands. He is wearing a blue shirt and looks cheerful, with a tree-lined park in the background and sunlight filtering through the leaves onto the boy.



Video address: https://mp.weixin.qq.com/s/XmXR-XZtMvhZHtLTCxU4ZQ

No, the movie feeling comes out all at once.

But in addition to the formula just mentioned, there are several important principles that you can also refer to.

first,Repetition is power

Repeating or reinforcing key words in different parts of a prompt can help increase consistency in your output. For example, the camera flies through the scene at super-fast speed (the words "super-fast" and "fast" are repeated).

Second, try to keep your prompts focused on what should be in the scene. For example, you should prompt for a clear sky, not a sky without clouds.

With these formulas and principles in hand, we can start trying it out.

The little prince and the fox were looking at the stars together on the moon, and the fox looked at the little prince from time to time.



Video address: https://mp.weixin.qq.com/s/XmXR-XZtMvhZHtLTCxU4ZQ

Realistic depiction, close up, of a cheetah lying on the ground sleeping, its body slightly rising and falling.



Video address: https://mp.weixin.qq.com/s/XmXR-XZtMvhZHtLTCxU4ZQ

In addition, according to the introduction of Zhipu AI, if you try it a few more times, you may get unexpected results (it's free anyway).

After Vincent's video, let's test it again.Figure video

There are also two key techniques here.

First of all, the uploaded pictures should be as clear as possible, with a ratio of 3:2 and a format of jpg or png.

The second is still Prompt,There must be a subject, and then you can write a prompt based on the formula "[subject]+[subject movement]+[background]+[background movement]".

Of course, it is also possible without the prompt, but AI will generate videos based on its own ideas.

For example, we "feed" a photo of Tang Monk:



Then according to the formula technique just given, the prompt is as follows:

Tang Seng stretched out his hand and put on his sunglasses.



Video address: https://mp.weixin.qq.com/s/XmXR-XZtMvhZHtLTCxU4ZQ

As a result, there are more ways to play (do things).

For example, let Zhen Huan and Shen Meizhuang "break the wall" and embrace each other:

Zhen Huan and Mei Zhuang hug across the screen.



Video address: https://mp.weixin.qq.com/s/XmXR-XZtMvhZHtLTCxU4ZQ

It is no problem to revive old photos:

Hu Shi turned and left.



Video address: https://mp.weixin.qq.com/s/XmXR-XZtMvhZHtLTCxU4ZQ

Judging from various effects, Zhipu AI's Qingying is a type of Sora that can be used directly.

So the next question is:

How did you do it?

In the field of video generation, the consistency and coherence of the output content are key factors that determine the final effect.

To this end, according to Zhipu AI, the team has developed an efficientThree-dimensional variational autoencoder structure(3D VAE), compresses the original video space to 2% of its size, greatly reducing the training cost and difficulty of the video diffusion generation model.

In terms of model structure, the Zhipu team adoptedCausal 3D Convolution(Causal 3D convolution) is the main model component, which removes the attention module commonly used in the autoencoder, making the model capable of migration and use at different resolutions.

At the same time, the form of causal convolution in the time dimension also makes the model have sequence independence from front to back for video encoding and decoding. The purpose of this is to facilitate generalization to higher frame rates and longer times through fine-tuning.

From the perspective of engineering deployment, Zhipu AI is based on the time dimension.Sequential Parallelism(Temporal Sequential Parallel) fine-tunes and deploys the variational autoencoder to enable it to support the encoding and decoding of extremely high frame rate videos with smaller video memory usage.



But in addition to the consistency and coherence of the content, another problem with video generation is that most of the current video data lacks corresponding descriptive text or the description quality is poor.

To this end, Zhipu AI has developed an end-to-end video understanding model to generate detailed and content-appropriate descriptions for massive amounts of video data.

This can enhance the model's text understanding and instruction-following capabilities, making the generated videos more consistent with user input and able to understand extremely long and complex prompt instructions.

Finally, Zhipu AI has also developed a Transformer architecture that integrates the three dimensions of text, time, and space.

It abandons the traditional cross attention module and instead concatenates text embedding and video embedding at the input stage to enable more adequate interaction between the two modalities.

However, the feature spaces of the two modalities are very different. The team compensates for this difference by processing the text and video modalities separately through expert adaptive layernorm. This can more effectively utilize the time step information in the diffusion model, allowing the model to efficiently use parameters to better align visual information with semantic information.

The attention module adopts a 3D full attention mechanism. Previous studies usually use separate spatial and temporal attention or block spatiotemporal attention, which require a large amount of implicit transmission of visual information, greatly increasing the difficulty of modeling. At the same time, they cannot be adapted to the existing efficient training framework.

The position encoding module designs 3D RoPE, which is more conducive to capturing the relationship between frames in the temporal dimension and establishing long-range dependencies in the video.

The above is how Zhipu has developed the key technical strength behind Qingying.

One More Thing

In addition to this free version, Zhipu AI also launched a paid version, the price is as follows:

  • 5 yuan:Unlock 24-hour high-speed benefits
  • 199 yuan: Unlock one year of high-speed benefits

The annual fee is converted intoOnly 5.4 yuan a day

Mm, it does smell a bit fragrant.

The experience link is below, and those who are interested can try it~

https://chatglm.cn/video