The Chinese version of Sora is now open source! Reasoning is optimized to 18G, and can be run on a single 4090 GPU

The Chinese version of Sora is now open source! Inference is optimized to 18G, and can be run on a single 4090 GPU

2024-08-06

Smart Things
AuthorZeR0
Edit Mo Ying

Zhidongxi reported on August 6th, good news, Zhipu AI's video generation model CogVideoX-2B was officially open sourced last night.

The model has been released on GitHub and Hugging Face. Inference at FP16 precision only requires 18GB of video memory, and fine-tuning only requires 40GB. A single 4090 graphics card can be used for inference, and a single A6000 can be used for fine-tuning.

The upper limit of the prompt word of CogVideoX-2B is 226 tokens, the video length is 6 seconds, the frame rate is 8 frames/second, and the video resolution is 720 * 480.

The CogVideoX series of open source models are derived from Zhipu AI's commercial version of the video generation model "Qingying". After the initial release of the 2B version, open source models with stronger performance and larger parameters will be launched later.

Code repository:https://github.com/THUDM/CogVideo
Model Download:https://huggingface.co/THUDM/CogVideoX-2b
Technical Reports:https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf

According to the paper, CogVideoX is larger in the radar chart than several other video generation models, and its attribute values are closer to hexagons.

To evaluate the quality of Wensheng videos, Zhipu AI used multiple indicators in VBench, such as human action, scene, dynamic degree, etc. It also used two additional video evaluation tools: Dynamic Quality in Devil and GPT4o-MT Score in Chrono-Magic, which focus on the dynamic characteristics of videos. As can be seen from the table below, CogVideoX leads in scores in multiple indicators.

In human blind evaluation, CogVideoX scored higher than Kuaishou KeLing in all five indicators.

The GitHub page shows several video works generated by CogVideoX-2B:

▲提示词：A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship’s hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children’s items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship’s journey symbolizing endless adventures in a whimsical, indoor setting.

▲提示词：The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.

▲提示词：A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.

▲提示词：In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.

CogVideoX uses 3D VAE and expert Transformer architecture to generate coherent long videos, and builds a relatively high-quality collection of video clips with text descriptions through its self-developed video understanding model.

Because video data contains spatial and temporal information, its data volume and computational burden far exceed those of image data.3D Variational Autoencoder (3D VAE)A video compression method that simultaneously compresses the spatial and temporal dimensions of the video through three-dimensional convolution, achieving higher compression rate and better reconstruction quality.

▲3D VAE architecture in CogVideoX

The model structure includes an encoder, a decoder, and a latent space regularizer, and compression is achieved through four stages of downsampling and upsampling. Temporal causal convolution ensures the causality of information and reduces communication overhead. Context parallel technology can better adapt to large-scale video processing.

In the experiment, Zhipu AI found that large-resolution encoding is easy to generalize, but increasing the number of frames is more challenging, so the model training is divided into two stages: first training at a lower frame rate and small batches, and then fine-tuning at a higher frame rate through context parallelism. The training loss function combines L2 loss, LPIPS perception loss, and GAN loss of the 3D discriminator.

Zhipu AI uses VAE's encoder to compress the video into a latent space, then splits the latent space into blocks and expands them into a long sequence embedding z_vision. At the same time, T5 is used to encode the text input into a text embedding z_text, and then z_text and z_vision are spliced along the sequence dimension. The spliced embedding is sent toExpert TransformerThe embeddings are then back-concatenated to recover the original latent space shape and decoded using a VAE to reconstruct the video.

▲CogVideoX architecture

In terms of training data, Zhipu AI developed negative labels to identify and exclude low-quality videos, and annotated and screened 20,000 video data samples through filters trained by video-llama; at the same time, it calculated optical flow and aesthetic scores, and dynamically adjusted thresholds to ensure the quality of generated videos.

In response to the problem of lack of video subtitle data, Zhipu AI proposed aPipeline for generating video subtitles from image subtitles, and fine-tune the end-to-end video captioning model to obtain denser captions. This method generates short captions through the Panda70M model, generates dense image captions with the CogView3 model, and then summarizes the final short video with the GPT-4 model.

The team also fine-tuned aCogVLM2-Caption Model, trained using dense caption data to accelerate the video caption generation process.

▲Dense subtitle data generation process

The Zhipu AI team is still working hard to improve CogVideoX's ability to capture complex dynamics, explore new model architectures, more efficiently compress video information, and more fully integrate text and video content, in order to continue exploring the scaling law of video generation models, aiming to train larger and more powerful models to generate longer, higher-quality videos.

Nowadays, video generation models and applications are becoming more and more numerous, and the technology is gradually maturing, but there has not been an open source video generation model that can meet the requirements of commercial applications. We look forward to more video generation models becoming open source, encouraging more developers and companies to participate in the development of video generation models and applications, and contributing to various technical optimizations and function developments around video generation.

news

The Chinese version of Sora is now open source! Inference is optimized to 18G, and can be run on a single 4090 GPU

Introduction

my contact information