news

Kuaishou open-sources LivePortrait, GitHub 6.6K Stars, enabling rapid migration of facial expressions and gestures

2024-07-17

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Published by Synced

Synced Editorial Department

Recently, the Kuaishou Keling Big Model team has opened up aLivePortraitA controllable portrait video generation framework that can accurately and in real time transfer the expression and posture of the driving video to a static or dynamic portrait video to generate highly expressive video results. As shown in the following animated image:



LivePortrait test from netizens



LivePortrait test from netizens

The title of the paper corresponding to LivePortrait open sourced by Kuaishou is:

《 LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control 》



LivePortrait Paper Home

Moreover, LivePortrait is available immediately after release, adhering to the Kuaishou style, with one-click access to the paper, homepage, and code.Clément Delangue, CEO of HuggingFaceFollow and forward,Thomas Wolf, Chief Strategy OfficerI also experienced the function myself, it’s amazing!



And attracted the attention of netizens all over the worldLarge-scale evaluation



Video clips are from X

视频链接:https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650926594&idx=3&sn=7d44eac3409c6c2d5587ef80d7575a69&chksm=84e42a7cb393a36a0da7b8d223f28c5ed51095e53a449ea8e341ddd5f71576595776c02109b6&token=1755385124&lang=zh_CN#rd

At the same time, LivePotrait has received widespread attention from the open source community. In just over a week, it has gained a total of6.4K Stars,550 Forks,140 Issues&PRs, has received widespread acclaim, and the attention is still growing:



In addition, HuggingFace Space and Papers with code trend list1st place for a week, recently topped the HuggingFace all topic rankingsNo.1



HuggingFace Space No.1



Papers with code ranking first



HuggingFace All Topics Ranking No.1

For more resource information, see:

  • Code address: https://github.com/KwaiVGI/LivePortrait
  • Paper link: https://arxiv.org/abs/2407.03168
  • Project homepage: https://liveportrait.github.io/
  • HuggingFace Space one-click online experience: https://huggingface.co/spaces/KwaiVGI/LivePortrait

What kind of technology does LivePortrait use to quickly become popular on the entire Internet?

Method Introduction

Unlike the current mainstream diffusion model-based methods, LivePortrait explores and expands the potential of the implicit key point framework, thereby balancing the model's computational efficiency and controllability. LivePortrait focuses on better generalization, controllability, and practical efficiency. In order to improve generation capabilities and controllability, LivePortrait uses 69M high-quality training frames, a video-image hybrid training strategy, an upgraded network structure, and designs better motion modeling and optimization methods. In addition, LivePortrait regards implicit key points as an effective implicit representation of facial blendshape, and based on this, carefully proposes stitching and retargeting modules. These two modules are lightweight MLP networks, so while improving controllability, the computational cost can be ignored. Even compared with some existing diffusion model-based methods, LivePortrait is still very powerful. At the same time, on the RTX4090 GPU, LivePortrait's single-frame generation speed can reach 12.8ms. With further optimization, such as TensorRT, it is expected to reach less than 10ms!

The model training of LivePortrait is divided into two stages. The first stage is the basic model training, and the second stage is the fitting and redirection module training.

The first stage of basic model training



The first stage of basic model training

In the first stage of model training, LivePortrait made a series of improvements to implicit point-based frameworks such as Face Vid2vid[1], including:

High-quality training data collectionLivePortrait uses the public video datasets Voxceleb[2], MEAD[3], RAVDESS[4] and the stylized image dataset AAHQ[5]. In addition, it also uses large-scale 4K resolution portrait videos with different expressions and postures, more than 200 hours of speaking portrait videos, a private dataset LightStage[6], and some stylized videos and images. LivePortrait splits long videos into segments less than 30 seconds and ensures that each segment contains only one person. To ensure the quality of training data, LivePortrait uses Kuaishou's self-developed KVQ[7] (Kuaishou's self-developed video quality assessment method, which can comprehensively perceive the quality, content, scene, aesthetics, encoding, audio and other features of the video and perform multi-dimensional evaluation) to filter low-quality video clips. The total training data has 69M videos, including 18.9K identities and 60K static stylized portraits.

Video-image hybrid training: Models trained only with real-person portrait videos perform well for real-person portraits, but lack generalization capabilities for stylized portraits (such as anime). Stylized portrait videos are relatively rare, and LivePortrait collects only about 1.3K video clips from less than 100 identities. In contrast, high-quality stylized portrait images are more abundant, and LivePortrait collects about 60K images of different identities, providing diverse identity information. To take advantage of these two data types, LivePortrait treats each image as a frame of video clips and trains the model on both videos and images. This hybrid training improves the generalization ability of the model.

Upgraded network structure:LivePortrait unifies the canonical implicit keypoint estimation network (L), head pose estimation network (H) and expression deformation estimation network (Δ) into a single model (M), and adopts ConvNeXt-V2-Tiny[8] as its structure, so as to directly estimate the canonical implicit keypoints, head pose and expression deformation of the input image. In addition, inspired by the related work of face vid2vid, LivePortrait uses the decoder of SPADE[9] with better effect as the generator (G). The implicit features (fs) are carefully input into the SPADE decoder after deformation, where each channel of the implicit features is used as a semantic map to generate the driven image. In order to improve efficiency, LivePortrait also inserts the PixelShuffle[10] layer as the last layer of (G), thereby increasing the resolution from 256 to 512.

More flexible action transformation modeling:The original implicit keypoint calculation modeling method ignores the scaling factor, which makes the scaling easy to be learned into the expression coefficient, making training more difficult. To solve this problem, LivePortrait introduced a scaling factor in modeling. LivePortrait found that scaling regular projections will lead to overly flexible learnable expression coefficients, causing texture adhesion when driving across identities. Therefore, the transformation adopted by LivePortrait is a compromise between flexibility and driveability.

Keypoint-guided Implicit Keypoint Optimization: The original implicit point framework seems to lack the ability to vividly drive facial expressions, such as blinking and eye movements. Specifically, the eye direction and head orientation of the portrait in the driven results tend to remain parallel. LivePortrait attributes these limitations to the difficulty of unsupervised learning of subtle facial expressions. To address this issue, LivePortrait introduces 2D keypoints to capture micro-expressions and uses the keypoint-guided loss (Lguide) as a guide for implicit keypoint optimization.

Cascade loss function:LivePortrait uses the implicit keypoint invariant loss (LE), keypoint prior loss (LL), head pose loss (LH) and deformation prior loss (LΔ) of face vid2vid. To further improve the texture quality, LivePortrait uses perceptual and GAN losses, which are applied not only to the global area of ​​the input image, but also to the local areas of the face and mouth, denoted as cascade perceptual loss (LP, cascade) and cascade GAN loss (LG, cascade). The face and mouth areas are defined by 2D semantic keypoints. LivePortrait also uses face identity loss (Lfaceid) to preserve the identity of the reference image.

All modules in the first stage are trained from scratch, and the overall training optimization function (Lbase) is the weighted sum of the above loss terms.

Second stage fitting and redirection module training

LivePortrait regards implicit key points as an implicit blend deformation and finds that this combination can be well learned with the help of a lightweight MLP, and the computational cost is negligible. Considering the actual needs, LivePortrait designs a fitting module, an eye retargeting module, and a mouth retargeting module. When the reference portrait is cropped, the driven portrait will be reversed from the cropped space to the original image space. The addition of the fitting module is to avoid pixel misalignment during the reverse process, such as the shoulder area. As a result, LivePortrait can perform action driving for larger image sizes or group photos. The eye retargeting module aims to solve the problem of incomplete eye closure during cross-identity driving, especially when a portrait with small eyes drives a portrait with large eyes. The design concept of the mouth retargeting module is similar to that of the eye retargeting module. It normalizes the input by driving the mouth of the reference image to a closed state, so as to better drive.



Second stage model training: fitting and redirection module training

Fitting module: During the training process, the input of the fitting module (S) is the implicit key points (xs) of the reference image and the implicit key points (xd) of the driving frame of another identity, and the expression change (Δst) of the driving implicit key points (xd) is estimated. It can be seen that, unlike the first stage, LivePortrait uses cross-identity actions instead of same-identity actions to increase the difficulty of training, aiming to make the fitting module have better generalization. Next, the driving implicit key points (xd) are updated, and the corresponding driving output is (Ip,st). LivePortrait also outputs the self-reconstructed image (Ip,recon) at this stage. Finally, the loss function (Lst) of the fitting module calculates the pixel consistency loss of the shoulder area of ​​the two and the regularization loss of the fitting change.

Eye and Mouth Redirection Module :The input of the eye re-orientation module (Reyes) is the implicit keypoint of the reference image (xs), the eye opening condition tuple of the reference image and a random driving eye opening coefficient, thereby estimating the deformation change of the driving keypoint (Δeyes). The eye opening condition tuple represents the eye opening ratio, and the larger the eye opening ratio, the greater the degree of eye opening. Similarly, the input of the mouth re-orientation module (Rlip) is the implicit keypoint of the reference image (xs), the mouth opening condition coefficient of the reference image and a random driving mouth opening coefficient, thereby estimating the change of the driving keypoint (Δlip). Then, the driving keypoint (xd) is updated by the deformation change corresponding to the eyes and mouth, respectively, and the corresponding driving outputs are (Ip, eyes) and (Ip, lip). Finally, the objective functions of the eye and mouth re-orientation modules are (Leyes) and (Llip), respectively, which calculate the pixel consistency loss of the eye and mouth regions, the regularization loss of the eye and mouth changes, and the loss between the random driving coefficient and the opening condition coefficient of the driving output. The changes in the eyes and mouth (Δeyes) and (Δlip) are independent of each other, so during the inference phase, they can be linearly added and used to update the driving implicit keypoints.

Experimental comparison





Same identity driver: As can be seen from the above identity-driven comparison results, compared with the existing non-diffusion model method and the diffusion model-based method, LivePortrait has better generation quality and driving accuracy, and can capture the subtle expressions of the eyes and mouth of the driving frame while retaining the texture and identity of the reference image. Even in larger head poses, LivePortrait has a relatively stable performance.





Cross-identity driver: As can be seen from the above cross-identity driving comparison results, compared with existing methods, LivePortrait can accurately inherit the subtle eye and mouth movements in the driving video, and is also relatively stable when the posture is large. LivePortrait is slightly inferior to the diffusion model-based method AniPortrait[11] in terms of generation quality, but compared with the latter, LivePortrait has extremely fast inference efficiency and requires fewer FLOPs.

expand

Multi-person drive: Thanks to LivePortrait's fitting module, for group photos, LivePortrait can use specified driving videos to drive specified faces, thereby realizing group photo driving and broadening the practical application of LivePortrait.



视频链接:https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650926594&idx=3&sn=7d44eac3409c6c2d5587ef80d7575a69&chksm=84e42a7cb393a36a0da7b8d223f28c5ed51095e53a449ea8e341ddd5f71576595776c02109b6&token=1755385124&lang=zh_CN#rd

Animal Drive:LivePortrait not only has good generalization for human portraits, but also can accurately drive animal portraits after fine-tuning on animal datasets.

Portrait video editing: In addition to portrait photos, given a portrait video, such as a dance video, LivePortrait can use the driving video to edit the movements of the head area. Thanks to the fitting module, LivePortrait can accurately edit the movements of the head area, such as expressions and postures, without affecting the non-head area.



视频链接:https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650926594&idx=3&sn=7d44eac3409c6c2d5587ef80d7575a69&chksm=84e42a7cb393a36a0da7b8d223f28c5ed51095e53a449ea8e341ddd5f71576595776c02109b6&token=1755385124&lang=zh_CN#rd

Implementation and Outlook

LivePortrait's related technical points have been implemented in many businesses of Kuaishou, includingKuaishou Magic Watch, Kuaishou Private Message, Kuaishou AI Expression Gameplay, Kuaishou Live Broadcast, and the Puji APP incubated by Kuaishou for young peopleWe will explore new implementation methods and continue to create value for users. In addition, based on the KeLing basic model, LivePortrait will further explore multi-modal driven portrait video generation to pursue higher quality results.

references

[1] Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. One-shot free-view neural talking-head synthesis for video conferencing. In CVPR, 2021.

[2] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-scale speaker identification dataset. In Interspeech, 2017.

[3] Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV, 2020.

[4] Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. In PloS one, 2018

[5] Mingcong Liu, Qiang Li, Zekui Qin, Guoxin Zhang, Pengfei Wan, and Wen Zheng. Blendgan: Implicitly gan blending for arbitrary stylized face generation. In NeurIPS, 2021.

[6] Haotian Yang, Mingwu Zheng, Wanquan Feng, Haibin Huang, Yu-Kun Lai, Pengfei Wan, Zhongyuan Wang, and Chongyang Ma. Towards practical capture of high-fidelity relightable avatars. In SIGGRAPH Asia, 2023.

[7] Kai Zhao, Kun Yuan, Ming Sun, Mading Li, and Xing Wen. Quality-aware pre-trained models for blind image quality

assessment. In CVPR, 2023.

[8] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Con-

vnext v2: Co-designing and scaling convnets with masked autoencoders. In CVPR, 2023.

[9] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.

[10] Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, 2016.

[11] Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint:2403.17694, 2024.