Spider-Man dances enchantingly, the next generation of ControlNet is here! Jia Jiaya team launched, plug and play

2024-08-17

Cressey from Aofei Temple
Quantum Bit | Public Account QbitAI

With less than 10% of the training parameters, we can achieve the same controllable generation as ControlNet!

Moreover, common models of the Stable Diffusion family such as SDXL and SD1.5 can be adapted, and it is plug-and-play.

At the same time, it can also be used with SVD to control video generation, and the action details can be controlled accurately to the finger.

Behind these images and videos is the open source image/video generation guidance tool launched by Jia Jiaya's team at CUHK.ControlNeXt。

As can be seen from the name, the R&D team positioned it as the next generation of ControlNet.

Like the classic masterpiece ResNeXt (an extension of ResNet) by great writers He Kaiming and Xie Saining, the name was also given in this way.

Some netizens believe that this name is well-deserved and it is indeed a next-generation product that takes ControlNet to a higher level.

Some people even said that ControlNeXt is a game changer, which greatly improves the efficiency of controllable generation, and they look forward to seeing the works people create with it.

Spiderman dances a beautiful woman

ControlNeXt supports multiple SD series models and is plug-and-play.

These include image generation models SD1.5, SDXL, SD3 (supporting Super Resolution), and video generation model SVD.

Without further ado, let’s take a look at the results.

It can be seen that by adding edge (Canny) guidance in SDXL, the drawn two-dimensional girl and the control lines are almost perfectly matched.

Even if the control contours are numerous and fragmented, the model can still draw pictures that meet the requirements.

And it can be seamlessly integrated with other LoRA weights without additional training.

For example, in SD1.5, the Pose control conditions can be used in combination with various LoRAs to create characters with different styles or even across dimensions, but with the same movements.

In addition, ControlNeXt also supports mask and depth control modes.

SD3 also supports Super Resolution, which can generate ultra-high-definition images.

During video generation, ControlNeXt can control character movements.

For example, Spider-Man can also dance the beautiful women’s dance in TikTok, and even the finger movements are imitated quite accurately.

Even a chair was made to grow hands and do the same dance. Although it is a bit abstract, the movements are replicated quite well.

Moreover, compared with the original ControlNet, ControlNeXt requires fewer training parameters and converges faster.

For example, in SD1.5 and SDXL, ControlNet requires 361 million and 1.251 billion learnable parameters respectively, but ControlNeXt only requires 30 million and 108 million respectively.Less than 10% of ControlNet。

During the training process, ControlNeXt is close to convergence in about 400 steps, but ControlNet requires ten or even dozens of times more steps.

The generation speed is also faster than ControlNet. On average, ControlNet will cause 41.9% delay equivalent to the basic model, but ControlNeXt only has 10.4%.

So, how is ControlNeXt implemented and what improvements have been made to ControlNet?

Lighter conditional control module

First, let’s use a picture to understand the entire workflow of ControlNeXt.

The key to lightweighting is ControlNeXt removes the huge control branch in ControlNet and introduces a lightweight convolutional module consisting of a small number of ResNet blocks instead。

This module is responsible for extracting feature representations that control conditions (such as semantic segmentation masks, keypoint priors, etc.).

The amount of training parameters is usually less than 10% of the pre-trained model in ControlNet, but it can still learn the input conditional control information well. This design greatly reduces computational overhead and memory usage.

Specifically, it samples equidistantly from different network layers of the pre-trained model to form a subset of parameters for training, while the rest of the parameters are frozen.

In addition, when designing the architecture of ControlNeXt, the research team also maintained the consistency of the model structure with the original architecture, thus achieving plug-and-play.

Whether it is ControlNet or ControlNeXt, the injection of conditional control information is an important link.

During this process, the ControlNeXt research team conducted in-depth research on two key issues: the selection of injection location and the design of injection method.

The research team observed that in most controllable generation tasks, the conditional information guiding generation is relatively simple in form and is highly correlated with the features in the denoising process.

So the team believes thatThere is no need to inject control information into every layer of the denoising networkSo I choseAggregate conditional features with denoising features only in the middle layer of the network。

The aggregation method is also as simple as possible - usingCross NormalizationAfter aligning the distributions of the two sets of features, they are directly added together.

This ensures that the control signal affects the denoising process while avoiding the introduction of additional learning parameters and instability by complex operations such as the attention mechanism.

The cross normalization is another core technology of ControlNeXt, which replaces the previously commonly used progressive initialization strategies such as zero-convolution.

Traditional methods alleviate the collapse problem by gradually releasing the influence of new modules from scratch, but this often results in slow convergence.

Cross normalization directly uses the mean μ and variance σ of the denoising features of the backbone network to normalize the features output by the control module, so that the data distributions of the two are aligned as much as possible.

(Note: ϵ is a small constant added for numerical stability, and γ is a scaling parameter.)

The normalized control features are then adjusted in amplitude and baseline through scale and offset parameters, and then added to the denoising features. This not only avoids the sensitivity of parameter initialization, but also allows the control conditions to take effect in the early stages of training, thereby accelerating the convergence process.

In addition, ControlNeXt also uses the control module to learn the mapping of conditional information to latent space features, making it more abstract and semantic, and more conducive to generalization to unseen control conditions.

Project homepage:
https://pbihao.github.io/projects/controlnext/index.html
Paper address:
https://arxiv.org/abs/2408.06070
GitHub：
https://github.com/dvlab-research/ControlNeXt

news