news

MotionClone: ​​Clone video motion with one click without training

2024-07-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

Without training or fine-tuning, the motion of the reference video can be cloned in the new scene specified by the cue word, whether it is global camera motion or local body motion, it can be done with one click.



Paper: https://arxiv.org/abs/2406.05338

Home page: https://bujiazi.github.io/motionclone.github.io/

Code: https://github.com/Bujiazi/MotionClone

This paper proposes a new framework called MotionClone, which, given any reference video, can extract the corresponding motion information without model training or fine-tuning; this motion information can be directly used together with text prompts to guide the generation of new videos, realizing text-to-video with customized motion.



Compared with previous studies, MotionClone has the following advantages:

No training or fine-tuning required: Previous methods usually require training models to encode motion cues or fine-tuning video diffusion models to fit specific motion patterns. The trained models encoding motion cues have poor generalization capabilities for motion outside the training domain, while fine-tuning existing video generation models may damage the underlying video generation quality of the base model. MotionClone does not require any additional training or fine-tuning, while preserving the generation quality of the base model to the greatest extent while improving motion generalization capabilities.

Higher motion quality: Existing open source video models are difficult to generate large and reasonable motion. MotionClone introduces principal component temporal attention motion guidance to greatly enhance the motion amplitude of generated videos while effectively ensuring the rationality of motion.

Better spatial position relationship: In order to avoid the spatial semantic mismatch caused by direct motion cloning, MotionClone proposes spatial semantic information guidance based on cross-attention mask to assist the correct coupling of spatial semantic information and spatiotemporal motion information.

Motion Information in Temporal Attention Module



In the text-to-video work, the temporal attention module is widely used to model the inter-frame correlation of the video. Since the attention map score in the temporal attention module represents the correlation between frames, an intuitive idea is whether it is possible to achieve motion cloning by constraining the exact same attention score to replicate the inter-frame connection.

However, experiments have found that directly copying the complete attention map (plain control) can only achieve very rough motion transfer. This is because most of the weights in the attention map correspond to noise or very subtle motion information, which is difficult to combine with the new scene specified by the text on the one hand, and on the other hand, it masks the potential effective motion guidance.

To solve this problem, MotionClone introduces the primary temporal-attention guidance mechanism, which only uses the main components of temporal attention to sparsely guide video generation, thereby filtering out the negative impact of noise and subtle motion information and achieving effective cloning of motion in new scenarios specified by text.



Spatial semantic correction

Principal component temporal attention motion guidance can achieve motion cloning of the reference video, but it cannot ensure that the subject of the motion is consistent with the user's intention, which will reduce the quality of video generation and even cause misalignment of the moving subject in some cases.

To solve the above problems, MotionClone introduces a location-aware semantic guidance mechanism, which divides the foreground and background areas of the video through the Cross Attention Mask. It ensures the reasonable layout of spatial semantics by separately constraining the semantic information of the foreground and background of the video, and promotes the correct coupling of temporal motion and spatial semantics.

MotionClone implementation details



DDIM Inversion: MotionClone uses DDIM Inversion to invert the input reference video into latent space to achieve temporal attention principal component extraction of the reference video.

Guidance stage: During each denoising phase, MotionClone simultaneously introduces principal component temporal attention motion guidance and spatial semantic information guidance, which work together to provide comprehensive motion and semantic guidance for controllable video generation.

Gaussian Mask: In the spatial semantic guidance mechanism, a Gaussian kernel function is used to blur the cross-attention mask to eliminate the potential influence of structural information.

30 videos from the DAVIS dataset were used for testing. Experimental results show that MotionClone has achieved significant improvements in text fit, temporal consistency, and multiple user survey indicators, surpassing previous motion migration methods. The specific results are shown in the following table.



The comparison of the generated results of MotionClone and existing motion migration methods is shown in the figure below, which shows that MotionClone has leading performance.



In summary, MotionClone is a new motion transfer framework that can effectively clone the motion in a reference video to a new scene specified by a user-given prompt without training or fine-tuning, providing a plug-and-play motion customization solution for existing video models.

MotionClone introduces efficient principal component motion information guidance and spatial semantic guidance while retaining the generation quality of the existing base model. While ensuring the semantic alignment capability with the text, it significantly improves the motion consistency with the reference video, achieving high-quality and controllable video generation.

In addition, MotionClone can directly adapt to rich community models to achieve diversified video generation and has extremely high scalability.