70 times extreme compression! No need to worry about large models with many checkpoints

70x extreme compression! No need to worry about large models with many checkpoints

2024-08-05

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

The authors of this paper are all from Huawei Noah Lab, with Li Wenshuo as the first author and Wang Yunhe and Chen Xinghao as the corresponding authors. In recent years, the relevant team has published a number of representative works at ICML, CVPR, NeurIPS, ICCV, ECCV and other top conferences, and has produced rich results in the fields of efficient large language models and visual models, and has extensive cooperation with well-known universities and research institutions.

As the undisputed "king of traffic" in the current AI industry and academia, large models have attracted a large number of scholars and companies to invest resources in research and training. As the scale becomes larger and larger, system and engineering problems have become unavoidable problems in large model training. For example, in the 54-day training of Llama3.1, the system crashed 466 times, an average of once every 2.78 hours!

Then, frequent storage checkpoints are very necessary. But storage checkpoints itself is also a big project.

Meta has made a lot of efforts to speed up storage checkpoint time and increase storage frequency to combat frequent system failures. However, frequent storage also means a lot of storage resource overhead. Its training cluster is equipped with 240PB of SSD to meet this challenge. The storage cost alone is 100 million yuan!

Huawei Noah's ExCP method came into being. In order to cope with the huge storage overhead, they proposed the extreme compression checkpoint technology, which can losslessly compress the model 70 times and greatly reduce the storage overhead during training.

The code is now open source and released under the Apache 2.0 framework. Someone has successfully reproduced the results in the issue.

Article address: https://arxiv.org/abs/2406.11257
Repository address: https://github.com/Gaffey/ExCP

The method is also very innovative. The article mentions two important concepts. One is to use the residual information of the checkpoints in training to achieve a higher pruning ratio through the sparsity of information in the time series; the other is to combine the optimizer and weights for compression to achieve an overall high compression rate.

specific method

1. Checkpoint residuals

During the training process, the current parameters can be regarded as the sum of the weights stored in the previous checkpoint plus the gradient updates during successive iterations. This part is relatively sparse and contains less information. Therefore, compressing this residual can obtain a better compression ratio. In contrast, the momentum stored in the optimizer is the sliding average of the first-order moment and the second-order moment of the gradient. For the first-order moment, its default sliding average parameter is 0.9. After hundreds to thousands of iterations, it is not closely related to the content stored in the previous checkpoint. Therefore, the optimizer directly compresses its own value instead of the residual. The final checkpoint to be compressed is represented as

2. Weight-Optimizer Momentum Joint Compression

Existing work related to model compression generally only focuses on the inference performance of the model, or the size of the model's final storage checkpoint, but not on the storage space overhead of the model during the entire training process. Therefore, existing work only compresses weights, ignoring the fact that common optimizers such as Adam actually store momentum that is twice the number of weights. This work, on the one hand, compresses both together, significantly improving the overall compression ratio; on the other hand, it also takes advantage of the correlation between weights and optimizer momentum to further improve the compression ratio of each other.

Weight pruning: Since the pruned weights are residual values, the second-order moment of the optimizer momentum can roughly represent the change in the residual value of the weight over a period of time, so the second-order moment of the optimizer momentum can be used as an indicator to determine the pruning ratio of different layers. The pruning strategy is shown in the following formula

Where W and represent the weight and second-order moment respectively.

Optimizer momentum pruning: For momentum pruning, the first-order moment can be used as an indicator for pruning. There is a brief proof of convergence in the paper. At the same time, if the weight of a position has been pruned, the optimizer momentum of the corresponding position should also be processed synchronously, so the pruning strategy is shown in the following formula

In the formula, represents the first-order moment.

3. Overall compression process

The overall compression process is shown in Algorithm 1, which sequentially performs the steps of calculating weighted residuals/joint compression/non-uniform quantization/coding compression to obtain the final compression result.

The process of recovering the complete checkpoint file is shown in Algorithm 2. After decompression, the floating point result is first recovered from the codebook and subscript stored after non-uniform quantization, and then added to the reference weight (the original weight of the previous checkpoint or the recovered reconstruction weight) to obtain the complete checkpoint file. The process of recovering the checkpoint file in the entire training process is shown in Algorithm 3. After the training is completed, only the random seed of the initialization weight and the compression result stored at each checkpoint are saved, and then the checkpoints are recovered in sequence to obtain a complete checkpoint sequence, so that one or more checkpoints can be selected from them to resume training/testing, etc.

Experimental Results

The article not only evaluates large language models, but also achieves good results on larger visual models such as ViT-L32.

It can also be seen from the ablation experiment that the use of residual pruning method greatly reduces the loss caused by pruning.

The article also provides examples of question and answering before and after compression of a large language model. It can be seen that compression itself does not damage the model's question and answer capabilities.

news

70x extreme compression! No need to worry about large models with many checkpoints

Introduction

my contact information