news

Alibaba releases "Magic Pen Ma Liang Version of Sora", which can make cats turn with just a touch, 20 demonstration videos and 10 pages of technical reports

2024-08-03

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Zhidongxi (public account:zhidxcom
author vanilla
edit Li Shuiqing

The AI ​​video generation track is booming, with new text-generated and image-generated video products emerging in an endless stream at home and abroad. Under the "involution" of major manufacturers, the current video generation models are close to the effect of "indistinguishable from the real thing" in all aspects.

However, the accuracy and ability to follow instructions of most video generation models need to be improved. Generating videos is still a "drawing card" process, which often requires users to generate many times to get the results that meet their needs. This also leads to problems such as high computing costs and waste of resources.

How to improve the accuracy of video generation, reduce the number of "card draws", and use as few resources as possible to obtain videos that meet the needs?

Zhidongxi reported on August 3 that the Alibaba team recently launchedTora, a video generation model, can be based onTrajectory, image, textor a combination of them, quickly generate precise motion control videos with just a few strokes, and also supportFirst and last frame control, which brings the controllability of video generation to a new level.

//oss.zhidx.com/uploads/2024/08/66acd09cc2d2b_66acd09cbf165_66acd09cbf141_Opening.mp4

Tora isThe first trajectory-oriented DiT framework model, using the scalability of DiT, the object motion generated by Tora can not only accurately follow the trajectory, but also effectively simulate the dynamics of the physical world. The relevant paper was published on arXiv on August 1.


▲Tora's paper

Tora currently only provides video demonstrations. Its project homepage shows that it will release online demos and reasoning and training codes in the future.

Paper address:

https://arxiv.org/abs/2407.21705

project address:

https://ali-videoai.github.io/tora_video/

1. Three-mode combined input to accurately control motion trajectory

Tora SupportTrajectory, text, imageThe three modalities, or their combined input, enable dynamic and precise control of video content of different lengths, aspect ratios, and resolutions.

The trajectory input can be a variety of straight lines and curves with directions, and multiple trajectories in different directions can also be combined. For example, you can use an S-shaped curve to control the trajectory of a floating object, and use text descriptions to control its speed. In the video below, the prompt words used use adverbs such as "slowly", "gracefully", and "gently".

//oss.zhidx.com/uploads/2024/08/66acd0922df15_66acd0921dea0_66acd0921de7e_curve trajectory.mp4

The same trajectory can also move repeatedly on an axis to create a shaking picture.

//oss.zhidx.com/uploads/2024/08/66acd09e8ab1e_66acd09e86884_66acd09e86862_back and forth trajectory.mp4

Drawing different trajectories on the same image can also allow Tora to generate videos with different movement directions.

//oss.zhidx.com/uploads/2024/08/66acd0948ef53_66acd0948af6b_66acd0948af47_same picture.mp4

Based on the same trajectory input, Tora will generate different movement modes according to the differences in the subject.

//oss.zhidx.com/uploads/2024/08/66acd09285368_66acd09281598_66acd09281575_circle.mp4

Different from the common motion brush function, Tora can generate corresponding videos based on the combination of trajectories and text even without input images.

For example, videos 1 and 3 in the video below are generated without an initial frame, only trajectories and text.

//oss.zhidx.com/uploads/2024/08/66acd09712f12_66acd0970ea1c_66acd0970e9fa_Track Text.mp4

Tora also supports first and last frame control, but this case only appears in the paper in the form of pictures, and no video demonstration is provided.


▲Tora first and last frame control

So, can the same effect be achieved with only two modal inputs, text and image? With this question in mind, I tried to input the same initial frame and prompt words into other AI video generators.

The following videos are generated by Tora, Vidu, Qingying, and Keling from left to right and from top to bottom. It can be seen that when the trajectory is a straight line, the video generation without trajectory input barely meets the requirements.

//oss.zhidx.com/uploads/2024/08/66acd5287df2f_66acd5287a1b5_66acd5287a197_鱼.mp4

But when the required motion trajectory becomes a curve, traditional text + image input can hardly meet the needs.

//oss.zhidx.com/uploads/2024/08/66acd51822425_66acd5181dfab_66acd5181df87_花.mp4

2. Based onOpenSoraFramework, innovative two motion processing modules

Tora adoptsOpenSoraAs its basic model DiT architecture, OpenSora is a video generation model framework designed and open sourced by AI startup Luchen Technology.

To achieve DiT-based trajectory control video generation, Tora introduces two new motion processing modules:Trajectory Extractor(Trajectory Extractor)和Motion Guided Fusion Device(Motion-guidance Fuser) is used to encode the provided trajectory into multi-level spatiotemporal motion patches.

The figure below shows the overall architecture of Tora. This approach is consistent with DiT's scalability and can create high-resolution, motion-controlled videos for longer durations.


▲Tora overall architecture

in,Trajectory ExtractorA 3D motion VAE (Variational Autoencoder) is used to embed trajectory vectors into the same latent space as video patches, which can effectively preserve the motion information between consecutive frames. Subsequently, stacked convolutional layers are used to extract hierarchical motion features.

Motion Guided Fusion DeviceUsing an adaptive normalization layer, these multi-level motion conditions are seamlessly fed into the corresponding DiT blocks to ensure that the video generation always follows the defined trajectory.

To combine DiT-based video generation with trajectories, we explore three variants of the fusion architecture, injecting motion patches into each STDiT block, among which Adaptive Norm shows the best performance.


▲Three architectural designs of motion-guided fusion devices

In the specific training process, the author adopted different training strategies for different input conditions.

In trajectory training, Tora uses a two-stage training method for trajectory learning. The first stage extracts dense optical flow from the training video. The second stage randomly selects 1 to N object trajectory samples from the optical flow based on the motion segmentation results and optical flow scores, and finally applies a Gaussian filter for refinement.

In image training, Tora follows the mask strategy adopted by OpenSora to support visual accommodation, randomly unlocking frames during training, and the video patches of unmasked frames are not affected by any noise, which enables Tora to seamlessly integrate text, images, and trajectories into a unified model.

When quantitatively compared with state-of-the-art motion-controllable video generation models, Tora has an increasing performance advantage over UNet-based methods as the number of generated frames increases, maintaining a high degree of trajectory control stability.


▲ Comparison of Tora with other controllable video generation models

For example, based on the same input, the video generated by Tora is smoother than that generated by the DragNUWA and MotionCtrl models, and follows the motion trajectory more accurately.

//oss.zhidx.com/uploads/2024/08/66acd0bd4936e_66acd0bd456db_66acd0bd456b9_comparison video.mp4

3. “Futures” have been fulfilled, and Alibaba continues to make plansAIvideo

The AI ​​video generation industry is in full swing, and Alibaba has also been continuously attacking the AI ​​video track. Compared with general models such as Sora that focus on the length and quality of video generation, the Alibaba team's project seems to focus more on the specific application of algorithms in different video generation forms.

In January this year, Tongyi Qianwen launched "National Dance King", which became popular with the "Terracotta Warriors Dancing Subject 3"; in February, Alibaba released the portrait video generation framework EMO, which can make the people in the photo speak with just one picture.

At that time, Zhidongxi counted Alibaba's layout in AI video, and it launched at least 7 new projects in 4 months, covering text-generated videos, image-generated videos, character dances, portraits speaking, etc. (Domestic god-level AI debuts! Gao Qiqiang becomes Luo Xiang, Cai Xukun becomes the king of rap, and also cooperates with Sora)

Now, half a year has passed, and EMO has changed from a "futures" to a "national singing and performing" function in the Tongyi App, which is available to everyone. Alibaba has also released more AI video projects.

1AtomoVideo: High-fidelity image-to-video generation

AtomoVideo was released on March 5. It is a high-fidelity image-generated video framework based on multi-granular image injection and high-quality datasets and training strategies. It can maintain high fidelity between the generated video and the given reference image while achieving rich motion intensity and good temporal consistency.


▲AtomoVideo generates video effects

Project homepage:https://atomo-video.github.io/

2EasyAnimate-v3: Single image+Generating high-resolution long videos from text

EasyAnimate is a video generation and processing flow launched by Alibaba on April 12, and it has been upgraded to version v3 in just three months. It introduces a motion module by extending the DiT framework, enhancing the ability to capture temporal dynamics, ensuring the smoothness and consistency of generated videos, and can generate videos of about 6 seconds at different resolutions and a frame rate of 24fps.


▲EasyAnimate v3 generates video effects

Project homepage:https://github.com/aigc-apps/EasyAnimate

Conclusion:AIVideo generation becomes more controllable

Now that the length and quality of AI video generation have reached a certain level, how to make the generated videos more controllable and more in line with demand is an important issue at present.

With continuous optimization in accuracy, controllability, and resource utilization efficiency, the user experience of AI video generation products will usher in a new stage, and the prices will become more affordable, allowing more creators to participate.