news

NVIDIA plays with pruning and distillation: halving the parameters of Llama 3.1 8B, with better performance at the same size

2024-08-16

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Machine Heart Report

Editors: Du Wei, Chen Chen, Zenan

The rise of small models.

Last month, Meta released the Llama 3.1 series of models, which includes Meta's largest model to date at 405B, as well as two smaller models with 70 billion and 8 billion parameters respectively.

Llama 3.1 is considered to be the beginning of a new era of open source. However, although the new generation of models has powerful performance, they still require a lot of computing resources when deployed.

Therefore, another trend has emerged in the industry to develop small language models (SLMs) that perform well enough in many language tasks and are very cheap to deploy.

Recently, NVIDIA research has shown that structured weight pruning combined with knowledge distillation can gradually obtain smaller language models from an initial larger model.



Turing Award winner and Meta's chief AI scientist Yann LeCun also liked and reposted the research.

After pruning and distillation, the NVIDIA research team refined Llama 3.1 8B into Llama-3.1-Minitron 4B and released it to the public. This is NVIDIA's first work in the Llama 3.1 open source series.

Llama-3.1-Minitron 4B outperforms state-of-the-art open-source models of similar size, including Minitron 4B, Phi-2 2.7B, Gemma2 2.6B, and Qwen2-1.5B.



The relevant paper of this research was released as early as last month.



  • Paper link: https://www.arxiv.org/pdf/2407.14679
  • Paper title: Compact Language Models via Pruning and Knowledge Distillation

Pruning and distillation

Pruning makes the model smaller and leaner, and can be done by removing layers (depth pruning) or removing neurons and attention heads and embedding channels (width pruning). Pruning is usually accompanied by some degree of retraining to restore accuracy.

Model distillation is a technique that transfers knowledge from a large, complex model (often called a teacher model) to a smaller, simpler student model. The goal is to create a more efficient model that retains most of the predictive power of the original larger model while running faster and consuming fewer resources.

There are two main distillation methods: SDG fine-tuning and classic knowledge distillation. These two distillation methods complement each other. This article mainly focuses on the classic knowledge distillation method.

NVIDIA uses a combination of pruning and classic knowledge distillation to construct large models. The following figure shows the pruning and distillation process of a single model (top) and the chain of model pruning and distillation (bottom). The specific process is as follows:

1. NVIDIA starts with a 15B model, evaluates the importance of each component (layer, neuron, head, and embedding channel), and then sorts and prunes the model to reach the target size: 8B model.

2. Then light retraining was performed using model distillation, with the original model as the teacher and the pruned model as the student.

3. After training, the small model (8B) is used as the starting point for pruning and distillation into a smaller 4B model.



The process of pruning and distilling from the 15B model.

It is important to note that before pruning the model, you need to understand which part of the model is important. NVIDIA proposed an activation-based pure importance evaluation strategy that can simultaneously calculate information of all relevant dimensions (depth, neurons, heads, and embedding channels), using a small calibration dataset containing 1024 samples and only requiring forward propagation. This method is simpler and cost-effective than strategies that rely on gradient information and require backpropagation.

During pruning, you can iteratively alternate between pruning and importance estimation for a given axis or combination of axes. Empirical studies show that using a single importance estimation is sufficient and iterative estimation does not bring additional benefits.

Retraining using classic knowledge distillation

Figure 2 below illustrates the distillation process, where an N-layer student model (the pruned model) is distilled from an M-layer teacher model (the original unpruned model). The student model is learned by minimizing a combination of the embedding output loss, the logit loss, and the Transformer encoder-specific loss mapped to the student block S and the teacher block T.



Figure 2: Distillation training loss.

Pruning and distillation best practices

Based on extensive ablation research on pruning and knowledge distillation in compact language models, NVIDIA summarizes its learnings into the following best practices for structured compression.

One is to adjust the size.

  • To train a set of LLMs, we first train the largest one, and then iteratively prune and distill to obtain smaller LLMs.
  • If you use a multi-stage training strategy to train the largest model, it is best to prune and retrain the model obtained in the last stage of training.
  • Prune the available source model that is closest to the target size.

The second is pruning.

  • Prioritizing width pruning over depth pruning works well for models below 15B parameters.
  • Use single-shot importance estimation, as iterative importance estimation does not provide any benefit.

The third is retraining.

  • Only retrain using the distillation loss instead of regular training.
  • When the depth is significantly reduced, logit, intermediate states, and embedding distillation are used.
  • When there is no significant reduction in depth, logit-only distillation is used.

Llama-3.1-Minitron: Putting best practices into practice

Meta recently launched the powerful Llama 3.1 open source model series, which is comparable to closed source models in many benchmarks. The parameters of Llama 3.1 range from a huge 405B to 70B, 8B.

With the experience of Nemotron distillation, NVIDIA set out to distill the Llama 3.1 8B model into a smaller and more efficient 4B model by taking the following measures:

  • Teacher fine-tuning
  • Depth-only pruning
  • Width-only pruning
  • Accuracy Benchmark
  • Performance Benchmarks

Teacher fine-tuning

In order to correct the distribution bias of the original dataset on which the model was trained, NVIDIA first fine-tuned the unpruned 8B model on their dataset (94B tokens). Experiments show that if the distribution bias is not corrected, the teacher model will provide suboptimal guidance for the dataset during distillation.

Depth-only pruning

To reduce from 8B to 4B, NVIDIA pruned 16 layers (50%). They first evaluated the importance of each layer or group of consecutive sublayers by removing them from the model and observed the increase in LM loss or decrease in accuracy in downstream tasks.

Figure 5 below shows the LM loss values ​​on the validation set after removing 1, 2, 8, or 16 layers. For example, the red graph for layer 16 indicates that if the first 16 layers are removed, LM loss occurs. Layer 17 indicates that if the first layer is kept and layers 2 to 17 are removed, LM loss also occurs. NVIDIA observed that the beginning and end layers are the most important.



Figure 5: Importance of layers in depth-only pruning.

However, NVIDIA observed that this LM loss does not necessarily directly correlate to downstream performance.

Figure 6 below shows the Winogrande accuracy of each pruned model, which shows that it is best to remove layers 16 to 31, where layer 31 is the second-to-last layer and the 5-shot accuracy of the pruned model is significantly higher than random accuracy (0.5). NVIDIA adopted this insight and removed layers 16 to 31.



Figure 6: Accuracy on the Winogrande task when 16 layers are removed.

Width-only pruning

NVIDIA pruned the embedding (hidden) and MLP intermediate dimensions along the width axis to compress Llama 3.1 8B. Specifically, they used the activation-based strategy described earlier to compute importance scores for each attention head, embedding channel, and MLP hidden dimension.

After the importance estimate, Nvidia chose

  • The MLP intermediate dimension was pruned from 14336 to 9216.
  • Prune the hidden size from 4096 to 3072.
  • Retrain the number of attention heads and layers.

It is worth mentioning that after single-sample pruning, the LM loss of width pruning is higher than that of depth pruning. However, after a short period of retraining, the trend is reversed.

Accuracy Benchmark

NVIDIA distills the model using the following parameters

  • Peak learning rate = 1e-4
  • Minimum learning rate = 1e-5
  • 40-step linear warm-up
  • Cosine decay scheme
  • Global batch size = 1152

Table 1 below shows the performance comparison of the Llama-3.1-Minitron 4B model variants (width pruned and depth pruned) with the original Llama 3.1 8B model and other similarly sized models on benchmarks across multiple domains. Overall, NVIDIA has once again demonstrated the effectiveness of the width pruning strategy compared to depth pruning following best practices.



Table 1: Comparison of the accuracy of the Minitron 4B base model compared to similar-sized base models.

To verify whether the distilled model can become a powerful instruction model, NVIDIA fine-tuned the Llama-3.1-Minitron 4B model using NeMo-Aligner.

They used the training data of Nemotron-4 340B and evaluated it on IFEval, MT-Bench, ChatRAG-Bench, and Berkeley Function Calling Leaderboard (BFCL) to test instruction following, role playing, RAG, and function calling capabilities. Finally, they confirmed that the Llama-3.1-Minitron 4B model can be a reliable instruction model, which outperforms other baseline SLMs.



Table 2: Comparison of the accuracy of the aligned Minitron 4B base model with similarly sized aligned models.

Performance Benchmarks

NVIDIA optimized the Llama 3.1 8B and Llama-3.1-Minitron 4B models using NVIDIA TensorRT-LLM, an open source toolkit for optimizing LLM inference.

The next two figures show the throughput requests per second for different models in different use cases at FP8 and FP16 precision, expressed as an ISL/OSL combination of batch size 32 for the 8B model and an ISL/OSL combination of batch size 64 for the 4B model, thanks to the smaller weights allowing larger batch sizes on an NVIDIA H100 80GB GPU.

The Llama-3.1-Minitron-4B-Depth-Base variant is the fastest, with an average throughput of about 2.7 times that of Llama 3.1 8B, while the Llama-3.1-Minitron-4B-Width-Base variant has an average throughput of about 1.8 times that of Llama 3.1 8B. Deploying in FP8 also improves the performance of all three models by about 1.3 times compared to BF16.





Figure 8: Combination: Llama 3.1 8B with BS=32, Llama-3.1-Minitron 4B model with BS=64. 1x H100 80GB GPU.

in conclusion

Pruning and classical knowledge distillation is a very cost-effective way to progressively obtain LLMs of smaller size, achieving higher accuracy than training from scratch in all domains. This is a more effective and data-efficient approach than fine-tuning with synthetic data or pre-training from scratch.

Llama-3.1-Minitron 4B is NVIDIA's first attempt at using the state-of-the-art open source Llama 3.1 series. To use the SDG fine-tuning of Llama-3.1 in NVIDIA NeMo, see the /sdg-law-title-generation section on GitHub.

For more information, see the following resources:

  • https://arxiv.org/abs/2407.14679
  • https://github.com/NVlabs/Minitron
  • https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base
  • https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Depth-Base

https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/