news

A single card can handle Llama 3.1 405B, making it easy to reduce the size of large models! Super compression toolkit is here

2024-08-02

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Model Toolchain Team Contributions
Quantum Bit | Public Account QbitAI

Llama 3.1 (405B) can be processed with a single card. The latest large model compression tool is here!

Llama-3.1 has recently reached the top of open source, but its strongest 405B version model has a memory requirement of more than 900GB, which poses a more demanding challenge to resources.

Large model compression tools and benchmarks jointly launched by Beihang University, SenseTime, Nanyang Technological University and other teamsLLMC, which can solve this problem well.

It enables one 80G A100 to complete the calibration and evaluation of Llama 3.1 405B, thus achieving quantification at an ultra-low cost.

It supports a variety of compression algorithms, models, and inference backends, and has strong scalability and comprehensive evaluation capabilities.



Currently, the research team has placed the usage instructions on the GitHub homepage, which can be obtained by clicking the link at the end of the article.

Llama 3.1 is larger and harder to compress

Low-bit quantization is one of the common technologies to solve resource-constrained problems. To this end, researchers used LLMC to quantize and compress Llama 3.1.

The results are shown in Table 1. Some algorithms in LLMC, such as QuaRot and AWQ, can effectively maintain quantization accuracy on models with 70B and 405B parameters. However, the simplest "naive" algorithm shows a significant decrease in accuracy on these large-scale models, especially when the activation is quantized.



The research team found that the decline in quantization accuracy of the Llama 3.1 series models stems from the presence of some outliers in its activation tensor that are more significant than those of other models. As the size of the Llama 3.1 model increases, the phenomenon of these outliers becomes more serious. Outliers refer to points in the data where some values ​​are significantly different from other values, and are one of the key factors affecting quantization accuracy.

With the help of the LLMC tool, the research team visualized the input activation tensors of the four layers (q_proj, o_proj, gate_proj, down_proj) of the first block of the Llama 3.1 series models (8B, 70B, 405B) (as shown in Figure 1-3). The bottom of each sub-graph shows the mean and standard deviation of the Kurtosis values ​​of all tokens of the activation value of the layer.







As shown in Figure 1-3, in the Llama 3.1 series models, outliers exist in some channels of the activation tensor, and this phenomenon is more obvious in larger models.

Therefore, it can be reasonably inferred that:Although the Llama 3.1 405B model has become stronger, it has also become more "abnormal" and more difficult to quantify.

The LLMC tool supports a series of quantization algorithms for suppressing outliers in large models, including AWQ, SmoothQuant, OS+, QuaRot, etc. As can be seen from Table 1, these methods have greatly improved the quantization accuracy of Llama 3.1 by effectively suppressing outliers. For example, in the quantization of the 405B model W8A8, SmoothQuant, OS+, and QuaRot can achieve almost the same accuracy as the floating-point model.

LLMC: One-stop large model slimming toolkit



△LLMC framework diagram

Support multiple algorithmsLLMC supports a variety of compression algorithms, including 16 different quantization methods, covering weight-only, weight-activated, and mixed-precision quantization. This diversity allows for fair comparison and in-depth analysis of different methods. In addition to quantization, various types of sparse and related algorithms are currently supported.



△Classification of some hardware-friendly compression algorithms currently supported by LLMC

Precision height alignmentThe LLMC team performed several alignment experiments comparing several established quantization algorithms (LLMC vs. original papers/code).

The experimental settings are the same as those in the original paper or the default settings of its open source code (as shown in Table 3).

These experimental results are summarized in Tables 4-6. The results in the tables show that the LLMC tool is almost consistent in performance with the original quantization algorithms reported in the literature. Through these experiments, it is proved that LLMC is not only effective, but also reliable in reproducing the results of existing quantization methods. This ensures that the contribution of this tool to LLM quantization research is credible and valuable.





Quantification at a very low costThe LLMC toolkit is designed to be resource efficient and can run large models with minimal hardware requirements. Thanks to the single-block level operation mechanism, only one 80G A100 is needed to complete the calibration and evaluation of Llama 3.1 405B, thus achieving quantification at an ultra-low cost.

Multiple backend compatibilityLLMC supports a variety of quantization settings and model formats, and is compatible with multiple backends and hardware platforms, such as LightLLM, TRT-LLM, PPL-LLM, vLLM, MLC-TVM, and llama.cpp, making it highly versatile.



High scalabilityThe toolkit is highly modular and extensible, and can be easily adapted from integer quantization to floating point quantization, from dense models to mixture of experts (MoE) models, from LLM to visual language model (VLM), and from quantization to sparsification. This modular design ensures that users can extend and customize the toolkit to meet their needs.





Diversity AssessmentLLMC provides a comprehensive evaluation of compression models, providing detailed performance metrics and analysis such as perplexity (PPL), data visualization, kurtosis, error, and outlier distribution. This comprehensive evaluation capability ensures that users can make informed decisions about the best compression strategy for their models.



The LLMC team released the versatile large model compression toolkit LLMC, which supports a variety of compression algorithms, models, and inference backends, and has strong scalability and comprehensive evaluation capabilities.

The toolkit allows users to compress LLMs with hundreds of billions of parameters using only a single GPU, which greatly facilitates the application of LLM quantization. Equipped with this powerful toolkit, future large model researchers and ordinary users can effectively integrate appropriate algorithms and the formats required by the corresponding backend platforms for their applications, thereby popularizing the compression application of large models.

Tool address: https://github.com/ModelTC/llmc
Paper address: https://arxiv.org/abs/2405.06001