118 times cheaper than Stable Diffusion! Training a high-quality text graph model with 1.16 billion parameters for $1,890

2024-08-12

New Intelligence Report

Editor: alan

【New Wisdom Introduction】Recently, researchers from the University of California, Irvine and other institutions have reduced the training cost of the diffusion model to $1,890 by using strategies such as delay masking, MoE, and layered expansion.

How much does it cost to train a diffusion model?

The cheapest previous method (Wuerstchen) cost $28,400, and models like Stable Diffusion are an order of magnitude more expensive.

In the era of large models, ordinary people can't afford it. If you want all kinds of literary ladies, you have to rely on manufacturers to carry the burden forward.

In order to reduce this huge overhead, researchers have tried various solutions.

For example, the original diffusion model takes about 1,000 steps from noise to image, but it has now been reduced to about 20 steps or even less.

When the basic modules in the diffusion model were gradually replaced by DiT (Transformer) from Unet (CNN), some optimizations based on the characteristics of Transformer also followed.

For example, quantization, skipping some redundant calculations in Attention, such as pipeline.

Recently, researchers from the University of California, Irvine and other institutions have taken the goal of "saving money" a big step forward:

Paper address: https://arxiv.org/abs/2407.15811

- Train a 1.16 billion parameter diffusion model from scratch for only $1,890!

Compared with SOTA, there is an order of magnitude improvement, which gives ordinary people hope of being able to touch the pre-training.

More importantly, the cost-reducing techniques did not affect the performance of the model, and 1.16 billion parameters gave very good results as shown below.

In addition to the look and feel, the model's data indicators are also excellent. For example, the FID score given in the table below is very close to Stable Diffusion 1.5 and DALL·E 2.

In contrast, Wuerstchen's cost-cutting plan resulted in less-than-ideal test scores.

Tips for saving money

With the goal of "Stretching Each Dollar", the researchers started with DiT, the basic module of the diffusion model.

First of all, sequence length is the enemy of Transformer computational cost and needs to be eliminated.

For images, it is necessary to minimize the number of patches involved in the calculation without affecting performance (while also reducing memory overhead).

There are two ways to reduce the number of image blocks: one is to increase the size of each block, and the other is to remove part of the patch (mask).

Because the former will significantly reduce the model performance, we consider the mask method.

The most naive mask (Naive token masking) is similar to the random cropping training in the convolutional UNet, but allows training on non-contiguous regions of the image.

The previous most advanced method (MaskDiT) added a recovery and reconstruction structure before the output and trained it through an additional loss function, hoping to make up for the lost information through learning.

In order to reduce computational costs, both masks discard most of the patches at the beginning. The loss of information significantly reduces the overall performance of Transformer. Even if MaskDiT tries to make up for it, it only achieves not much improvement.

——Losing information is not desirable, so how can we reduce input without losing information?

Delay Masking

This paper proposes a deferred masking strategy, which uses a patch-mixer for preprocessing before masking and embeds the information of discarded patches into the surviving patches, thereby significantly reducing the performance degradation caused by high masking.

In this architecture, patch-mixer is implemented through a combination of attention layer and feed-forward layer, using binary mask for masking, and the loss function of the whole model is:

Compared with MaskDiT, no additional loss function is required here, and the overall design and training are simpler.

The mixer itself is a very lightweight structure, meeting the cost-saving criteria.

Fine-tuning

Since very high masking ratios can significantly reduce the ability of the diffusion model to learn global structures in images and introduce a distribution shift from training to testing, the authors performed a small amount of fine-tuning (unmasking) after pre-training (masking).

Additionally, fine-tuning can also mitigate any undesirable generation artifacts due to the use of masks.

MoE and Tiered Scaling

MoE can increase the parameters and expressiveness of the model without significantly increasing the training cost.

The authors use a simplified MoE layer based on expert selection routing, where each expert determines the tokens routed to it without any additional auxiliary loss function to balance the load between experts.

In addition, the authors also considered a hierarchical scaling approach, linearly increasing the width of the Transformer block (i.e., the hidden layer size in the attention layer and the feed-forward layer).

Since deeper layers in vision models tend to learn more complex features, using more parameters in deeper layers will lead to better performance.

Experimental setup

The authors use two variants of DiT: DiT-Tiny/2 and DiT-Xl/2, with a patch size of 2.

All models are trained using the AdamW optimizer with cosine learning rate decay and high weight decay.

The model front end uses a four-channel variational autoencoder (VAE) in the Stable-Diffusion-XL model to extract image features. In addition, the performance of the latest 16-channel VAE in large-scale training (saving version) is also tested.

The authors use the EDM framework as a unified training setting for all diffusion models and use FID as well as CLIP scores to measure the performance of image generation models.

The text encoder selects the most commonly used CLIP model. Although larger models such as T5-xxl perform better on challenging tasks such as text synthesis, they are not adopted here for the purpose of saving money.

Training Dataset

Use three real image datasets (Conceptual Captions, Segment Anything, TextCaps), containing 22 million image-text pairs.

Since SA1B does not provide real captions, synthetic captions generated by the LLaVA model are used here. The authors also added two synthetic image datasets containing 15 million image-text pairs to the large-scale training: JourneyDB and DiffusionDB.

For small-scale ablations, the researchers constructed a text-to-image dataset called cifar-captions by subsampling images of 10 CIFAR-10 classes from the larger COYO-700M dataset.

Evaluate

All evaluation experiments are performed using the DiT-Tiny/2 model and the cifar-captions dataset (256×256 resolution).

Each model was trained for 60K optimization steps and used the AdamW optimizer and exponential moving average (with a smoothing factor of 0.995 for the last 10K steps).

Delay Masking

The baseline of the experiment is the Naive masking mentioned above, while the delayed masking in this paper adds a lightweight patch-mixer with less than 10% of the parameters of the backbone network.

Generally speaking, the more patches are lost (high masking ratio), the worse the performance of the model will be. For example, the performance of MaskDiT drops significantly after exceeding 50%.

The comparative experiment here uses the default hyperparameters (learning rate 1.6×10e-4, weight decay of 0.01, and cosine learning rate) to train the two models.

The results in the figure above show that the delayed masking method has achieved improvements in the three indicators of FID, Clip-FID and Clip score.

Moreover, the performance gap with the baseline widens with the increase of masking rate. When the masking rate is 75%, naive masking reduces the FID score to 16.5, while our method reaches 5.03, which is closer to the FID score without masking (3.79).

Hyperparameters

Following the general idea of training LLM, we compare the hyperparameter choices for the two tasks.

First, in the feed-forward layer, the SwiGLU activation function outperforms GELU. Second, higher weight decay leads to better image generation performance.

In addition, unlike LLM training, our diffusion model can achieve better performance when using a higher running average coefficient for the AdamW second-order moment (β).

Finally, the authors found that using a small number of training steps, while increasing the learning rate to the maximum possible value (until training becomes unstable) also significantly improved image generation performance.

Mixer design

Hard work usually pays off, and the author also observed that the model performance continued to improve after using a larger patch-mixer.

However, in order to save money, a small mixer is chosen here.

The authors modified the noise distribution to (−0.6, 1.2), which improved the alignment between captions and generated images.

As shown in the figure below, at 75% masking ratio, the author also studied the impact of using different patch sizes.

When the number of continuous regions increases (the patch becomes larger), the performance of the model will deteriorate, so the original strategy of randomly masking each patch is retained.

Hierarchical scaling

This experiment trained two variants of the DiT-Tiny architecture, one with constant width and the other with a layer-wise scaling structure.

Both methods use Naive masking and adjust the size of the Transformer to ensure that the model computing power is the same in both cases, while performing the same training steps and training time.

From the results in the above table, we can see that the hierarchical scaling method outperforms the baseline constant width method in all three performance indicators, which shows that the hierarchical scaling method is more suitable for masked training of DiT.

References:

https://arxiv.org/abs/2407.15811

news

118 times cheaper than Stable Diffusion! Training a high-quality text graph model with 1.16 billion parameters for $1,890

Introduction

My contact information