news

Complex combination 3D scene generation, LLMs conversational 3D controllable generation editing framework is here

2024-07-31

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

The first author and corresponding author of the paper are from the VDIG (Visual Data Interpreting and Generation) Laboratory of the Wang Xuan Computer Institute of Peking University. The first author is doctoral student Zhou Xiaoyu, and the corresponding author is doctoral supervisor Wang Yongtao. In recent years, the VDIG Laboratory has published many representative results at top conferences such as IJCV, CVPR, AAAI, ICCV, ICML, and ECCV, and has won the first and second place awards in heavyweight competitions in the field of CV at home and abroad for many times, and has carried out extensive cooperation with well-known universities and research institutions at home and abroad.

In recent years, Text-to-3D methods for single objects have made a series of breakthroughs, but generating controllable, high-quality complex multi-object 3D scenes from text still faces huge challenges. Previous methods have major defects in the complexity of the generated scenes, geometric quality, texture consistency, multi-object interaction, controllability, and editability.

Recently, the VDIG research team from the Wang Xuan Computer Institute of Peking University and its collaborators announced their latest research results GALA3D. Aiming at the generation of multi-object complex 3D scenes, this work proposed an LLM-guided complex three-dimensional scene controllable generation framework GALA3D, which can generate high-quality, high-consistency 3D scenes with multiple objects and complex interactive relationships, and supports controllable editing of conversational interactions. The paper has been accepted by ICML 2024.



论文标题:GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting

Paper link: https://arxiv.org/pdf/2402.07207

Paper code: https://github.com/VDIGPKU/GALA3D

Project website: https://gala3d.github.io/



GALA3D is a high-quality Text-to-3D complex combination scene generation and controllable editing framework. When the user enters a descriptive text, GALA3D can generate the corresponding three-dimensional scene with multiple objects and complex interactive relationships in zero-shot. While ensuring that the generated 3D scene is highly aligned with the text, GALA3D demonstrates its excellent performance in terms of generated scene quality, complex interactions among multiple objects, and geometric consistency of the scene. In addition, GALA3D also supports user-friendly end-to-end generation and controllable editing, allowing ordinary users to easily customize and edit 3D scenes in conversational conversations. In communication with users, GALA3D can accurately implement conversational controllable editing of complex three-dimensional scenes, and realize diverse controllable editing needs such as layout transformation of complex three-dimensional scenes, embedding of digital assets, and changes in decoration styles based on user conversations.

Method Introduction

The overall architecture of GALA3D is shown in the figure below:



GALA3D uses large language models (LLMs) to generate initial layouts and proposes layout-guided generative 3D Gaussian representations to construct complex 3D scenes. GALA3D is designed to optimize the shape and distribution of 3D Gaussians through adaptive geometric control to generate 3D scenes with consistent geometry, texture, scale, and precise interactions. In addition, GALA3D also proposes a combined optimization mechanism that combines conditional diffusion priors and Vincent graph models to collaboratively generate 3D multi-object scenes with consistent styles, while iteratively optimizing the initial layout priors extracted from LLMs to obtain more realistic and accurate real scene spatial layouts. Extensive quantitative experiments and qualitative studies have shown that GALA3D has achieved significant results in text-to-complex 3D scene generation, surpassing existing Vincent 3D scene methods.

a. Scene layout prior based on LLMs

Large language models have demonstrated excellent natural language understanding and reasoning capabilities. This paper further explores the reasoning and layout generation capabilities of LLMs large language models in 3D complex scenes. How to obtain a relatively reasonable layout prior without manual design helps to reduce the cost of scene modeling and generation. To this end, we use LLMs (such as GPT-3.5) to extract instances of text input and their spatial relationships, and generate corresponding Layout layout priors. However, there is a certain gap between the 3D spatial layout and Layout prior of the scene interpreted by LLMs and the actual scene, which usually manifests as the generation of suspended/penetrating objects, and combinations of objects with large proportion differences. Furthermore, we proposed a Layout Refinement module to adjust and optimize the above-generated rough layout priors through vision-based Diffusion priors and Layout-guided generative 3D Gaussians.

b、Layout Refinement

GALA3D uses the Diffusion prior-based Layout optimization module to optimize the layout prior generated by the above LLMs. Specifically, we add the gradient optimization of the 3D Gaussian space layout guided by Layout to the 3D generation process, and adjust the spatial position, rotation angle and size ratio of the LLM-generated Layouts through ControlNet. The figure shows the correspondence between the 3D scene and the Layout before and after optimization. The optimized Layout has a more accurate spatial position and scale, and makes the interaction relationship between multiple objects in the 3D scene more reasonable.



c. Layout-guided generative 3D Gaussian representation

We first introduced 3D-Layout constraints into 3D Gaussian representation and proposed a layout-guided generative 3D Gaussian for complex 3D scenes. The Layout-guided 3D Gaussian representation contains multiple semantically extracted instance objects, where the Layout prior of each instance object can be parameterized as:

Where N represents the total number of instance objects in the scene. Specifically, each instance 3D Gaussian is optimized through adaptive geometry control to obtain the instance-level object 3D Gaussian representation. Furthermore, we combine multiple object Gaussians into the whole scene according to their relative position relationship, generate a layout-guided global 3D Gaussian, and render the whole scene through global Gaussian splatting.

d. Adaptive geometry control

In order to better control the spatial distribution and geometry of 3D Gaussians during the generation process, we propose an adaptive geometry control method for generative 3D Gaussians. First, given a set of initial Gaussians, GALA3D uses a set of density distribution functions to constrain the spatial position of the Gaussian ellipsoid to constrain the 3D Gaussians within the layout range. We then sample the Gaussians near the layout surface to fit the distribution function. After that, we propose to use shape regularization to control the geometry of the 3D Gaussians. During the 3D generation process, adaptive geometry control continuously optimizes the distribution and geometry of Gaussians, thereby generating 3D multi-objects and scenes with more texture details and regular geometry. Adaptive geometry control also ensures that the layout-guided generative 3D Gaussians are more controllable and consistent.

Experimental Results

Compared with existing Text-to-3D generation methods, GALA3D demonstrates better 3D scene generation quality and consistency. The quantitative experimental results are shown in the following table:



We also conducted an extensive and effective user survey, inviting 125 participants (39.2% of whom were experts and practitioners in related fields) to evaluate the generation scenarios of this method and existing methods from multiple perspectives. The results are shown in the following table:



Experimental results show that GALA3D surpasses existing methods in multiple dimensional evaluation indicators such as generated scene quality, geometric fidelity, text consistency, scene consistency, and achieves the best generation quality.

As shown in the qualitative experimental results below, GALA3D can generate complex multi-object 3D scenes in zero-shot with good consistency:



The following figure shows that GALA3D can support user-friendly, conversational controllable generation and editing:



For more research details, please refer to the original paper.