2024-08-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
- AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]
The authors of this paper are all from the S-Lab team of Nanyang Technological University, Singapore, including postdoctoral fellow Hu Tao, doctoral student Hong Fangzhou, and Professor Liu Ziwei of the School of Computing and Data Science (MIT Technology Review Asia Pacific Innovator Under 35). In recent years, S-Lab has published many CV/CG/AIGC-related research works at top conferences such as CVPR, ICCV, ECCV, NeurIPS, and ICLR, and has carried out extensive cooperation with well-known universities and research institutions at home and abroad.
The generation and editing of 3D digital humans are widely used in the fields of digital twins, metaverse, games, holographic communications, etc. Traditional 3D digital human production is often time-consuming and labor-intensive. In recent years, researchers have proposed learning 3D digital humans from 2D images based on 3D generative adversarial networks (3D GANs), which greatly improves the efficiency of digital human production.
These methods often model digital humans in a one-dimensional latent vector space, which cannot represent the geometric structure and semantic information of the human body, thus limiting their generation quality and editing capabilities.
To solve this problem,The S-Lab team from Nanyang Technological University, Singapore proposed a new paradigm for generating 3D digital humans, StructLDM, based on the Structured Latent Diffusion Model.The paradigm includes three key designs: structured high-dimensional human representation, structured automatic decoder, and structured latent space diffusion model.
StructLDM is a feedforward 3D generative model that learns from images and videos. Compared with existing 3D GAN methods, it can generate high-quality, diverse and perspective-consistent 3D digital humans, and supports controllable generation and editing functions at different levels, such as local clothing editing, 3D virtual fitting and other part-aware editing tasks. It does not depend on specific clothing types or masking conditions and has high applicability.
Paper title: StructLDM: Structured Latent Diffusion for 3D Human Generation
Paper address: https://arxiv.org/pdf/2404.01241
Project homepage: https://taohuumd.github.io/projects/StructLDM
Laboratory homepage: https://www.ntu.edu.sg/s-lab
Method Overview
The StructLDM training process consists of two stages:
Structured automatic decoding:Given the human body posture information SMPL and camera parameters, the automatic decoder fits a structured UV latent for each individual character in the training set. The difficulty of this process lies in how to fit character images with different postures, different camera perspectives, and different clothing into a unified UV latent. For this reason, StructLDM proposes a structured local NeRF to model each part of the body separately, and merges the various parts of the body together through a global style mixer to learn the overall appearance of the character. In addition, in order to solve the problem of posture estimation error, adversarial learning is introduced in the training process of the automatic decoder. At this stage, the automatic decoder converts each individual character in the training set into a series of UV latents.
Structural Diffusion Model:The diffusion model learns the UV latent space obtained in the first stage to learn the three-dimensional prior of the human body.
In the inference stage, StructLDM can randomly generate a 3D digital human: randomly sample noise and denoise it to obtain UV latent, which can be rendered into a human image by the automatic decoder.
Experimental Results
This study conducted experimental evaluations on four datasets: the single-view image dataset DeepFashion [Liu et al. 2016], the video dataset UBCFashion [Zablotskaia et al. 2019], the real 3D human dataset THUman 2.0 [Yu et al. 2021], and the virtual 3D human dataset RenderPeople.
3.1 Comparison of qualitative results
StructLDM is compared with existing 3D GAN methods such as EVA3D, AG3D and StyleSDF on the UBCFashion dataset. Compared with existing methods, StructLDM can generate high-quality, diverse, and perspective-consistent 3D digital humans, such as different skin colors, different hairstyles, and clothing details (such as high heels).
StructLDM is compared with existing 3D GAN methods (such as EG3D, StyleSDF, and EVA3D) and the diffusion model PrimDiff on the RenderPeople dataset. Compared with existing methods, StructLDM can generate high-quality 3D digital humans with different poses and appearances, and generate high-quality facial details.
3.2 Comparison of quantitative results
The researchers compared the quantitative results with known methods on UBCFashion, RenderPeople, and THUman 2.0, and randomly selected 50,000 images in each dataset to calculate the FID. StructLDM can significantly reduce the FID. In addition, the User Study showed that about 73% of users believed that the results generated by StructLDM were more advantageous than AG3D in terms of facial details and full-body image quality.
3.3 Application
3.3.1 Controllable Generation
StructLDM supports controllability generation, such as camera view, pose, body shape control, and 3D virtual fitting, and can interpolate in 2D latent space.
3.3.2 Combination Generation
StructLDM supports combinational generation, such as combining parts ①②③④⑤ to generate a new digital human, and supports different editing tasks, such as identity editing, sleeves (4), skirts (5), 3D virtual fitting (6), and full-body stylization (7).
3.3.3 Editing Internet Pictures
StructLDM can edit Internet pictures. First, the corresponding UV latent is obtained through the Inversion technology. Then, the generated digital human can be edited through UV latent editing, such as editing shoes, tops, pants, etc.
3.4 Ablation Experiment
3.4.1 Latent Space Diffusion
The latent space diffusion model proposed by StructLDM can be used for different editing tasks, such as combinatorial generation. The figure below explores the impact of diffusion model parameters (such as diffusion steps and noise scale) on the generation results. StructLDM can improve the generation effect by controlling the diffusion model parameters.
3.4.2 One-dimensional and two-dimensional human representation
The researchers compared the human body representation effects of one-dimensional and two-dimensional latent and found that two-dimensional latent can generate high-frequency details (such as clothing texture and facial expressions), and adding adversarial learning can improve both image quality and fidelity.
3.4.3 Structure-aware Normalization
To improve the learning efficiency of the diffusion model, StructLDM proposed a structure-aligned normalization technique, which normalizes each latent pixel by pixel. Studies have found that the normalized latent distribution is closer to the Gaussian distribution, which is more conducive to the learning of the diffusion model.