Only 3.8B parameters are activated, and the performance is comparable to the same 7B model! Both training and fine-tuning can be used, from Microsoft

Only 3.8B parameters are activated, and the performance is comparable to the same 7B model! It can be used for training and fine-tuning, from Microsoft

2024-07-18

Cressey from Aofei Temple
Quantum Bit | Public Account QbitAI

Only 60% of the parameters need to be activated, we can achieve comparable performance to the fully activated dense model.

A new study by Microsoft Research Asia has achieved theFully sparse activation, which greatly reduces the cost of reasoning.

It also has a wide range of applicability, and can provide effective support whether it is training from scratch, continuing training or fine-tuning.

The method is calledQ-Sparse, model sparsification is achieved at the neuron level, which is finer-grained than other methods. With the same inference overhead, both performance and sparsity are better.

The "Q" in the name refers to Quantization, which means that in addition to the ordinary model, it alsoCompatible quantization technology, which is applicable to models with various quantization methods.

The author further stated that if Q-Sparse is combined with model quantization technology, greater cost reduction and efficiency improvement can be achieved.

In addition, while studying Q-Sparse, the team also conducted an in-depth exploration of the relationship between parameter scale, sparsity, and model performance, and found that"Scaling Law" for model inference optimization。

Some netizens believe that this technology is indeed good and better than ReLU.

Some people have also started making a wish, saying that it would be great if (AMD's) ROCm could support this technology faster than Nvidia.

Using Top-K function to achieve sparseness

The core operation of Q-Sparse isApplies a Top-K sparsification function to the input tensor.。

Specifically, the Transformer architecture uses nn.Linear linear layers (matrix multiplication) for projection in both the attention layer and the feedforward layer, which can be expressed as Y=X·W^T. (Where X is the input tensor, W represents its weight, and Y is the output tensor)

In Q-Sparse, for an input activation tensor X, its absolute value |X| is first calculated and sorted.Find the K elements with the largest absolute value。

Here K is a pre-set hyperparameter that determines the degree of sparsification.

Q-Sparse then creates a binary mask tensor M with the same shape as X. For the positions corresponding to the K elements with the largest absolute values in a series of |X|, the corresponding positions in M are set to 1, and the remaining positions are set to 0.

Next, the input tensor X is multiplied with the mask tensor M by Hadamard product (element-wise multiplication) to obtain the sparse tensorX_sparse。

During the forward propagation process, the sparse tensor X_sparse will replace the original input tensor X to participate in subsequent calculations (such as matrix multiplication).

Since most elements in X_sparse have been set to zero, the computation and memory bandwidth requirements can be significantly reduced.

During the back-propagation process, Q-Sparse usesStraight-through estimator(Straight-Through Estimator, STE) is used to calculate the gradient of the Top-K function.

In traditional training methods, it is usually necessary to calculate the gradient of the loss function with respect to the network parameters and use the gradient descent method to update the parameters to minimize the loss.

However, when there are some non-differentiable operations such as quantization and Top-K in the network, the calculation of gradients will encounter problems, because the gradient of the output of these operations to the input is 0 at most points, resulting in the inability to effectively propagate the gradient.

STE avoids the vanishing gradient problem by directly passing the gradient to the tensor before sparsification.

In general back-propagation, the gradient of the loss function L with respect to x is ∂L/∂x=∂L/∂y⋅∂y/∂x, but it cannot be directly calculated due to its non-differentiability.

The solution of STE is to only calculate the gradient of the loss function for the sparse tensor y, and then copy it directly to the original tensor x, that is, directly use ∂L/∂y as an estimate of ∂L/∂x.

△ Gradient comparison with/without STE

For the feed-forward layer, Q-Sparse usesSquared ReLU functionInstead of the regular ReLU activation function, a square operation can further improve the sparsity of the activation (⊙ denotes the Hadamard product).

In addition, in order to adapt to the quantization model, Q-Sparse will quantize the input tensor before applying Top-K sparsification to ensure that the sparsification operation is compatible with the quantization representation. Its function is expressed as follows:

Here, ε is a small constant used to avoid the denominator being zero.

Specifically, for 1-bit quantized weights, Q-Sparse uses the following quantization function, where α is the mean absolute value of the weight tensor W.

60% activation parameters achieve the same effect

Comparative experiments show that Q-Sparse is significantly better than the previous ReLU method in terms of both sparsity rate and model performance.

Regarding the specific effect of Q-Sparse, the authors evaluated its performance on three tasks: training from scratch, continuing training, and fine-tuning.

Training from scratchThe model used in the experiment is Llama. The results show that on the 700M and 7B models, using Q-Sparse with 70% top-K (i.e., 40% overall sparsity rate) can achieve training losses comparable to those of the dense baseline.

Continue trainingThe purpose is to sparse the dense model. The experimental object here is Mistral-7B.

As a result, when the activation parameters were 2.9B and 3.8B, the scores of the model in datasets such as ARC and MMLU did not drop significantly.

existFine-tuningIn the experiment, for the Qwen-7B and Mistral-7B models, Q-Sparse showed similar results to continued training, achieving performance very close to that of the dense model with about 60% of the activation parameters.

These results mean that, at the same performance, compared to dense models,Sparse activation models can significantly reduce activation parameters during inference, thereby reducing the number of consumed FLOPS.

For the quantitative model, the team applied Q-Sparse on the self-developed BitNet b1.58 model and trained and evaluated it on multiple datasets.

It can be seen that at the 700M and 7B scales, the convergence speed and final loss function value of the quantized model using Q-Sparse are comparable to those of the quantized model (BitNet b1.58) without Q-Sparse.

This shows that Q-SparseCan be seamlessly integrated into quantitative models, without significantly affecting the training and convergence of the model.

Based on this, the author believes that combining Q-Sparse with quantization technology can further improve the efficiency of large language models in the inference stage.

Discovering a new “Scaling Law” for inference optimization

In addition to evaluating the performance of these models when sparse activation is used, the authors also explored the relationship between model performance, scale, and sparsity, and made some new discoveries.

Performance Scaling Law of Sparse Activation Models: The authors found that, similar to dense models, the performance of sparse activation models also follows a power-law scaling relationship.

Specifically, given the sparsity rate S, the loss function value L(N,S) of the model at convergence can be approximated by the following formula:

Where N is the number of model parameters; E is a constant representing the loss of the model at infinity; and A(S) is a scaling factor related to the sparsity rate S.

This scaling law states thatdilutesparseThe performance of the activation model improves as the model size increases, but the rate of improvement gradually slows down.。

At the same time, the authors found that the performance of the model is also affected by the sparsity rate.

As mentioned in the section on the relationship between parameter size and performance, A(S) is a scaling factor related to the sparsity rate S, which can be approximated by the following formula:

Where B and C are constants, and β is a parameter that controls the speed of the exponential decay.

This formula shows that when the sparsity rate S increases (the model becomes more sparse), it meansHigher sparsity ratios will result in decreased performance, the rate of decline is exponential.

Based on the above findings, the authors derived an optimal sparsity rate S* for inference, which can minimize the value of the model loss function when the budget (the number of floating-point operations during inference) is constant.

For the full-precision (FP32) model, the optimal sparsity rate is about 45.58%; while the optimal sparsity rate for the low-precision (such as 1.58-bit) model is higher, about 61.25%.

The authors observed that the performance gap between sparse activation models and dense models gradually narrowed as the model size increased.

This can be explained from the scaling law: when the model size N tends to infinity, the loss function value of the sparse activation model tends to L(∞,S)=E, while the loss function value of the dense model tends to L(∞,0)=E.

This means that at extremely large scale, sparse activation models have the potential to achieve performance comparable to dense models, providing a useful reference for designing and training large-scale sparse activation models.

Paper address: https://arxiv.org/abs/2407.10969

news

Only 3.8B parameters are activated, and the performance is comparable to the same 7B model! It can be used for training and fine-tuning, from Microsoft

Using Top-K function to achieve sparseness

60% activation parameters achieve the same effect

Discovering a new “Scaling Law” for inference optimization

Introduction

my contact information