Zhipu open-sources the Qingying CogVideoX 2B model, which can be used for inference with a single RTX 4090

2024-08-06

Author: Large Model Mobile Group
Email: [email protected]

With the continuous development of large-scale model technology, video generation technology is gradually maturing. Technologies represented by closed-source video generation models such as Sora and Gen-3 are redefining the future of the industry. However, to date, there is still no open-source video generation model that can meet the requirements of commercial-grade applications.

Adhering to the concept of "serving global developers with advanced technology", Zhipu AI announced that it will open source CogVideoX, a video generation model with the same origin as "Qingying", in order to allow every developer and every company to freely develop their own video generation model, thereby promoting rapid iteration and innovative development of the entire industry.

The CogVideoX open source model includes multiple models of different sizes. We are currently open sourcing CogVideoX-2B, which only requires 18GB of video memory for inference at FP-16 precision and 40GB of video memory for fine-tuning. This means that a single 4090 graphics card can perform inference, and a single A6000 graphics card can complete fine-tuning.

The upper limit of the prompt word of CogVideoX-2B is 226 tokens, the video length is 6 seconds, the frame rate is 8 frames/second, and the video resolution is 720*480. We have reserved a wide space for improving video quality, and look forward to developers contributing open source power in the optimization of prompt words, video length, frame rate, resolution, scene fine-tuning, and the development of various functions around video.

Models with stronger performance and larger parameters are on the way, so please stay tuned.

Code repository:
https://github.com/THUDM/CogVideo

Model Download:
https://huggingface.co/THUDM/CogVideoX-2b

Technical report: https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf

Model

VAE：

Since video data contains spatial and temporal information, its data volume and computational burden far exceed those of image data. To address this challenge, we proposed a video compression method based on 3D variational autoencoder (3D VAE). 3D VAE simultaneously compresses the spatial and temporal dimensions of the video through three-dimensional convolution, achieving higher compression rate and better reconstruction quality.

The model structure includes an encoder, a decoder, and a latent space regularizer, and compression is achieved through four stages of downsampling and upsampling. Temporal causal convolution ensures the causality of information and reduces communication overhead. We use context parallelism to adapt to large-scale video processing. In experiments, we found that large-resolution encoding is easy to generalize, while increasing the number of frames is more challenging. Therefore, we train the model in two stages: first train at a lower frame rate and small batches, and then fine-tune at a higher frame rate through context parallelism. The training loss function combines L2 loss, LPIPS perceptual loss, and GAN loss of the 3D discriminator.

Expert Transformer

We use the VAE encoder to compress the video into a latent space, then split the latent space into blocks and expand it into a long sequence embedding z_vision. At the same time, we use T5 to encode the text input into a text embedding z_text, and then concatenate z_text and z_vision along the sequence dimension. The concatenated embedding is fed into a stack of expert Transformer blocks for processing. Finally, we back-concatenate the embeddings to restore the original latent space shape and decode it using VAE to reconstruct the video.

Data

Video generation model training requires screening high-quality video data to learn real-world dynamics. Videos may be inaccurate due to manual editing or shooting problems. We developed negative labels to identify and exclude low-quality videos, such as over-edited, choppy motion, low-quality, lecture-style, text-dominated, and screen noise videos. Through the filter trained by video-llama, we labeled and screened 20,000 video data points. At the same time, we calculated the optical flow and aesthetic scores and dynamically adjusted the thresholds to ensure the quality of the generated videos.

Video data usually has no text description and needs to be converted to text description for text-to-video model training. Existing video captioning datasets have short captions that cannot fully describe the video content. We propose a pipeline to generate video captions from image captions and fine-tune an end-to-end video captioning model to obtain denser captions. This approach generates short captions through the Panda70M model, generates dense image captions using the CogView3 model, and then summarizes the final short video using the GPT-4 model. We also fine-tune a CogVLM2-Caption model based on CogVLM2-Video and Llama 3, trained with dense caption data, to accelerate the video captioning process.

performance

To evaluate the quality of text-to-video generation, we used multiple metrics in VBench, such as human actions, scenes, dynamics, etc. We also used two additional video evaluation tools: Dynamic Quality in Devil and GPT4o-MT Score in Chrono-Magic, which focus on the dynamic characteristics of videos. As shown in the following table.

We have verified the effectiveness of the scaling law in video generation. In the future, we will continue to scale up the data and model scales while exploring new model architectures with more breakthrough innovations, more efficiently compressing video information, and more fully integrating text and video content.

Demo

A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.

The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.

In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.

A single butterfly with wings that resemble stained glass flutters through a field of flowers. The shot captures the light as it passes through the delicate wings, creating a vibrant, colorful display. HD.

A snowy forest landscape with a dirt road running through it. The road is flanked by trees covered in snow, and the ground is also covered in snow. The sun is shining, creating a bright and serene atmosphere. The road appears to be empty, and there are no people or animals visible in the video. The style of the video is a natural landscape shot, with a focus on the beauty of the snowy forest and the peacefulness of the road.

Extreme close-up of chicken and green pepper kebabs grilling on a barbeque with flames. Shallow focus and light smoke. vivid colours

Click "" and go

news

Zhipu open-sources the Qingying CogVideoX 2B model, which can be used for inference with a single RTX 4090

Introduction

my contact information