news

Zhipu AI version of Sora is open source! The first commercially available, playable online, 3.7K stars on GitHub in 5 hours

2024-08-06

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Jin Lei from Aofei Temple
Quantum Bit | Public Account QbitAI

The Chinese version of Sora is really crazy popular.

Just now,Zhipu AIDirectlyQingyingThe big model behind video generation givesOpen Source

And isThe first commercially availableThat kind!



The name of this model isCogVideoX, just released on GitHub5 hours, and then he grabbed it3.7K Stars️。



Let’s take a look at the effect directly.

Prompt 1,Close-up

In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.



Video address: https://mp.weixin.qq.com/s/IXRQ6PJ7NteZGXLi2x228g

It can be seen that not only the details such as the character's eyes are very high-definition, but the continuity before and after blinking is also maintained.

Next, Prompt 2.One shot

The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.



Video address: https://mp.weixin.qq.com/s/IXRQ6PJ7NteZGXLi2x228g

The light and shadow, the distant view, the close view, and the process of the vehicle driving are all captured.

And these effects are not just an official release, everyone can play them online~

Single card A100, can be generated in 90 seconds

It is worth mentioning that Zhipu AI's CogVideoX includes multiple different sizes, and the one that is open sourced this time is CogVideoX-2B.

Its relevant basic information is as follows:



It only requires 18GB of video memory for inference at FP-16 precision and only 40GB of video memory for fine-tuning, which means that a single 4090 graphics card can perform inference, and a single A6000 graphics card can complete fine-tuning.

It is understood that this model already supports deployment in the diffusers library of HuggingFace. The operation is also very simple, with only two steps:

1. Install the corresponding dependencies

pip install --upgrade opencv-python transformers pip install git+https://github.com/huggingface/diffusers.git@878f609aa5ce4a78fea0f048726889debde1d7e8#egg=diffusers # Still in PR

2. Run the code

import torchfrom diffusers import CogVideoXPipelinefrom diffusers.utils import export_to_videoprompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b",torch_dtype=torch.float16).to("cuda")prompt_embeds, _ = pipe.encode_prompt(prompt=prompt,do_classifier_free_guidance=True,num_videos_per_prompt=1,max_sequence_length=226,device="cuda",dtype=torch.float16,)video = pipe(num_inference_steps=50,guidance_scale=6,prompt_embeds=prompt_embeds,).frames[0]export_to_video(video, "output.mp4", fps=8)

And on a single-card A100, following the steps just now, it only takes 90 seconds to generate a video.

Not only that, on HuggingFace, Zhipu AI alsoPlayable onlineDemo,Pro-test resultsas follows:



Video address: https://mp.weixin.qq.com/s/IXRQ6PJ7NteZGXLi2x228g

As you can see, the generated results can not only be downloaded in .mp4 format, but also in GIF format.

So the next question is, how does Zhipu AI do it?

The paper has also been published

This time, Zhipu AI not only open-sourced the video generation model, but also released the technical report behind it.



Looking at the content of the report, there are three major technical highlights worth talking about.

First of all, the team developed an efficientThree-dimensional variational autoencoder structure(3D VAE), compresses the original video space to 2% of its size, greatly reducing the training cost and difficulty of the video diffusion generation model.

The model structure includes an encoder, a decoder, and a latent space regularizer, and compression is achieved through four stages of downsampling and upsampling. Temporal causal convolution ensures the causality of information and reduces communication overhead. The team uses contextual parallel technology to adapt to large-scale video processing.

In experiments, the team found that large-resolution encoding is easy to generalize, while increasing the number of frames is more challenging.

Therefore, the team trained the model in two stages: first training at a lower frame rate and small batches, and then fine-tuning at a higher frame rate through context parallelism. The training loss function combines L2 loss, LPIPS perceptual loss, and GAN loss of the 3D discriminator.



followed byExpert Transformer

The team used the VAE encoder to compress the video into a latent space, then split the latent space into blocks and expanded the long sequence embedding z_vision.

At the same time, they use T5 to encode the text input into a text embedding z_text, and then concatenate z_text and z_vision along the sequence dimension. The concatenated embedding is fed into the expert Transformer block stack for processing.

Finally, the team back-concatenated the embeddings to recover the original latent space shape and decoded them using VAE to reconstruct the video.



The final highlight isdata.

The team developed negative labels to identify and exclude low-quality videos, such as over-edited, choppy motion, low-quality, lecture-style, text-dominated, and screen-noise videos.

Through the filters trained by video-llama, they labeled and screened 20,000 video data points. At the same time, they calculated the optical flow and aesthetic scores and dynamically adjusted the thresholds to ensure the quality of the generated videos.

Video data usually has no text description and needs to be converted into text description for text-to-video model training. Existing video captioning datasets have short captions that cannot fully describe the video content.

To this end, the team also proposed a pipeline for generating video subtitles from image subtitles and fine-tuned the end-to-end video subtitle model to obtain denser subtitles.

This method generates short captions through the Panda70M model, generates dense image captions using the CogView3 model, and then summarizes the final short video using the GPT-4 model.

They also fine-tuned a CogVLM2-Caption model based on CogVLM2-Video and Llama 3, trained using dense caption data, to accelerate the video caption generation process.



The above is the technical strength behind CogVideoX.

One More Thing

In the field of video generation, RunwayGen-3There are also new actions——

Gen-3 Alpha’s Vinyl video now supports “feeding” images as not only the first frame of the video, but also the last frame of the video.

It feels like AI is turning back time.

Let’s take a look at the effect:



Video address: https://mp.weixin.qq.com/s/IXRQ6PJ7NteZGXLi2x228g



Video address: https://mp.weixin.qq.com/s/IXRQ6PJ7NteZGXLi2x228g

Finally, regarding Zhipu AI's open source video generation model, the relevant link is attached below~

Code repository:
https://github.com/THUDM/CogVideo

Model Download:
https://huggingface.co/THUDM/CogVideoX-2b

Technical Reports:
https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf

online experience:
https://huggingface.co/spaces/THUDM/CogVideoX