10,000 words of technical knowledge! A must-read quantitative guide for LLM engineers, with visual illustrations revealing how to compress large models

2024-07-31

New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】Faced with the growing parameter size of LLM, developers and researchers without H100 have come up with many ways to make up for it, and "quantization" technology is one of them. This visual guide uses various diagrams to comprehensively summarize the basic concepts and branching methods of "quantization".

Large language models (LLMs) are often too large to run on consumer hardware. These models can have billions of parameters and often require GPUs with large video memory to accelerate the inference process.

Therefore, more and more research has begun to focus on how to shrink the model, such as improving training methods or using adapters. One of the main techniques in this field is called quantization.

ML engineer Maarten Grootendorst wrote a blog post specifically introducing quantization in the context of language modeling and exploring the relevant concepts one by one through a visualization method to help us build an intuitive understanding of the technique.

In this blog post, Maarten will explore the various approaches, use cases, and the rationale behind quantification.

The table of contents and content of the article are shown in the figure below. It mainly introduces two methods: post-training quantization (PTQ) and quantization-aware training (QAT). Readers with AI basics are recommended to jump directly to the symmetric quantization part:

Part 1: The “Problem” of LLM

A “big language model” is big in terms of the number of model parameters, which usually reach billions (mostly weights).

Not only are the storage costs of these parameters quite high, but the computational complexity during the inference phase is also very large.

During inference, the activation value is the product of the input and the weight, so the more weights there are, the larger the activation value will be.

Therefore, we want to represent billions of values as efficiently as possible, minimizing the space required to store the parameters.

Let's start from the beginning, explore how values are represented, and then move on to optimization.

How to represent numbers

Numerical values are usually stored as floating point numbers (or floats for short): a positive or negative number with a decimal point.

These values are represented by binary digits per bit.

The IEEE-754 standard describes how each bit in a number represents a specific value. Specifically, there are three mappings: sign, exponent, or decimal (mantissa).

These three parts can be combined to calculate the value represented by a set of bit values:

The more bits used, the more accurate the numerical value represented is usually. For example, the FP32 format can be accurate to more digits after the decimal point than FP16:

Memory Limits

The more bits available, not only is the value more precise, but the range of values that can be represented is also wider.

Given the number of bits and representation format, the range of representable values is called the dynamic range, and the distance between two adjacent values is called the precision.

A neat property of this representation is that we can calculate how much memory a device needs to store a given value.

Since each byte in memory contains 8 bits, we can create a basic formula for most forms of floating point numbers -

In practice, there are many more factors that affect the amount of GPU/memory required during inference, such as context size and model architecture.

Now suppose we have a model with 70 billion parameters. Most models themselves are represented using 32-bit floating point numbers (often called full precision), which requires 280GB of memory to load the model.

But if all parameters can be represented by 16-bit floating point numbers, the required memory size can be directly reduced by half.

Therefore, minimizing the number of representation bits of model parameters (not only during inference but also during training) is very attractive.

However, this approach is not without cost. As the number of bits of representation decreases, resulting in reduced precision, the accuracy of the model will generally also decrease.

We want to reduce the number of bits used to represent a value while maintaining accuracy... this is where quantization comes in handy.

Part II: Introduction to Quantitative

Now we know that the purpose of quantization is to reduce the precision of model parameters from a higher bit width (such as 32-bit floating point numbers) to a lower bit width (such as 8-bit integers).

When reducing the number of bits used to represent the original parameter, there is usually some loss of precision (granularity).

To make this effect more intuitive, we can use the colors of a photo as an analogy. For example, choose any image (left), but only use 8 colors to represent it (right):

Note that the enlarged cookie looks more "grainy" than the original one.

Similarly, the main goal of quantization is to reduce the number of bits (colors) required to represent the original parameters while retaining as much accuracy as possible.

Common Data Types

First, let’s look at common data types and the impact of using them as an alternative to 32-bit (called full precision, or FP32) representation.

FP16

First an example of going from 32-bit to 16-bit (called half-precision or FP16) floating point numbers:

The range of possible values for FP16 is much smaller than that for FP32.

BF16

In order to obtain a similar numerical range as the original FP32, bfloat 16 was introduced as a "truncated FP32" type:

BF16 uses the same number of bits as FP16, but adds exponent bits, so it can take a wider range of values and is often used in the field of deep learning.

INT8

When the number of bits is further reduced, it is closer to the representation of integers rather than floating point numbers. For example, from FP32 to INT8 with only 8 bits, only 1/4 of the original number of bits:

Each time the number of bits is reduced, a mapping is performed to "compress" the original FP32 representation into fewer bits.

But in actual operation, we don’t need to map the entire FP32 range [-3.4e38, 3.4e38] to INT8. We just need to find a way to map the data range of the actual model parameters to INT8.

Common compression/mapping methods include symmetric quantization and asymmetric quantization, both of which belong to linear mapping.

What we will discuss next is the quantization method from FP32 to INT8.

Symmetric Quantization

In symmetric quantization, the range of the original floating-point values is mapped to a symmetric range centered on zero in the quantization space, with the range before and after quantization both centered on zero.

This means that zero in the original floating-point space is also exactly zero after being mapped to the quantized space.

A typical example of symmetric quantization is maximum absolute value (absmax) quantization.

Given a list of numbers, we take the highest absolute value (α) as the range to perform the linear mapping on.

[-127, 127] represents the restricted range, and the unrestricted range is [-128, 127], depending on the quantization method

Since this is a linear map centered at zero, the formula is simple.

First calculate the scale factor (s) using the following formula:

- b is the number of bytes we want to quantize to (8)

- α is the highest absolute value

We then use s to quantize the input x:

As shown in the figure above, the maximum absolute value α is 10.8. When mapping FP32 to INT8, the following formula is obtained:

If you want to restore the original FP32 values, you can also use the previously calculated scale factor (s) to do the dequantization.

First quantize, then dequantize to restore the original value. The whole process is as follows:

You can see that some values, like 3.08 and 3.02, are both 36 when quantized to INT8. So when dequantized back to FP32, they lose some precision and are no longer distinguishable.

The difference between the original value and the dequantized value is called the quantization error. Generally, the fewer the number of bits in the quantization result, the larger the error.

Asymmetric quantization

Unlike symmetric quantization, asymmetric quantization is not symmetric around zero. Instead, it maps the minimum value (β) and maximum value (α) in the floating point range to the minimum and maximum values of the quantization range, respectively.

The method we explore here is called zero-point quantization.

Notice how the position of 0 is shifted. That's why it's called asymmetric quantization. In the range [-7.59, 10.8], the maximum and minimum values are at different distances from 0.

Due to the offset of the zero position, we have to calculate the zero point in the INT8 range to perform the linear mapping. As before, we also have to calculate the scale factor (s), but using the difference value in the INT8 range [-128, 127].

This is a bit complicated by the need to compute the zero point (z) in the INT8 range to move the weights.

Like before, let's fill in the formula:

To dequantize the quantized values from INT8 back to FP32, we need to use the previously calculated scale factor (s) and zero point (z).

Other than that, dequantization is simple:

When we put symmetric and asymmetric quantization side by side, we can quickly see the difference between the two approaches:

In the figure above, we can see the zero-centered nature of symmetric quantization and the offset of asymmetric quantization.

Range Mapping and Clipping

In the previous example, we explored how to map the range of values in a given vector to a low-order representation. While this can map the entire range of vector values, it has one major drawback: outliers.

Imagine you have a vector containing the following values:

A value that is much larger than all other values can be considered an outlier. If we map the entire range of the vector, all small values will be mapped to the same low-bit representation and lose their distinctiveness:

This is the absmax method used before. The same thing happens with asymmetric quantization if no clipping is done.

Instead, we can choose to clip certain values. Clipping means setting a different dynamic range of the original values so that all outliers are set to the same value.

In the following example, we manually set the dynamic range to [-5, 5], and all values outside this range will be mapped to -127 or 127, regardless of their actual values:

The main advantage of this method is that the quantization error of non-outliers is significantly reduced. However, it will lead to an increase in the quantization error of outliers.

Calibration

In the example above, we randomly set the dynamic range to [-5, 5], but this should actually be determined through a "calibration" process to find a suitable range that includes as many values as possible while minimizing quantization error.

The specific implementation of the calibration step is different for different types of parameters.

Weights (and biases)

We can think of the weights and biases of a large language model (LLM) as static values, since they are known before running the model. For example, the ~20GB file for Llama 3 consists mostly of its weights and biases.

Since the number of bias variables (millions) is significantly less than that of weights (billions), biases are usually kept at a higher precision (such as INT16), while the main work of quantization is focused on weights.

For known static weights, calibration techniques for a selected range include:

- Manually select percentiles for input range

- Optimize the mean square error (MSE) between the original weights and the quantized weights

- Minimize the entropy between the original and quantized values (KL divergence)

For example, choosing a percentile can result in clipping behavior similar to what we saw previously.

Activation Value

The inputs that are constantly updated throughout the large language model are often called activations.

They are called activation values because they are usually passed through some activation function, such as sigmoid or relu.

Unlike weights, activation values vary with input data during inference and are therefore difficult to quantify accurately.

Since these values are updated after each hidden layer, during inference, their values are not known until the input data passes through the model.

In general, there are two methods for calibrating weights and activations, applied at different stages of the model:

- Post-Training Quantization (PTQ)

- As the name implies, quantization is performed after training

- Quantization Aware Training (QAT)

- Quantization during training/fine-tuning

Part 3: Post-training quantization (PTQ)

Post-training quantization (PTQ) is one of the most popular quantization techniques. It quantizes the model parameters (including weights and activation values) after the model training is completed.

The weight quantization can be performed using symmetric quantization or asymmetric quantization.

However, quantization of activations requires an inference phase to obtain their latent distribution, since we do not know their range in advance.

There are two forms of activation value quantization:

- Dynamic quantization

- static quantization

Dynamic Quantization

After the data passes through the hidden layer, its activation values are collected and the maximum (α) and minimum (β) values of each layer are compared:

The distribution of these activation values is then used to calculate the zero point (z) and scale factor (s) values required to quantize the output:

This process is repeated each time the data passes through a new network layer. Therefore, each layer has its own independent z and s values, and thus uses a different quantization scheme.

Static Quantization

Unlike dynamic quantization, static quantization does not calculate the zero point (z) and scale factor (s) during inference, but calculates these values before inference.

To find these values, we use a calibration dataset and feed it into the model to collect these latent activation value distributions.

Once these distributions are collected, we can calculate the s and z values needed for quantization during inference.

At actual inference time, the s and z values are not recomputed, but rather used globally across all activations to quantize them.

In general, dynamic quantization computes s and z values for each hidden layer, which tends to be more accurate. However, this may increase computation time because these values need to be calculated at every inference.

In contrast, static quantization, while not as accurate as dynamic quantization, is faster because it already knows in advance the s and z values used for quantization.

4-bit quantization field

Quantization below 8 bits has always been a challenge, as the quantization error increases with each bit reduction. Fortunately, there are several clever ways to reduce the number of bits to 6, 4, or even 2 bits (although dropping below 4 bits is generally not recommended).

We will explore two methods commonly found on HuggingFace:

- GPTQ (full model runs on GPU)

- GGUF (possibly offloading layer to CPU)

GPTQ

GPTQ can be said to be one of the most famous 4-bit quantization methods in practical applications.

It uses asymmetric quantization and processes each layer independently before moving on to the next layer:

In this layer-by-layer quantization process, it first converts the weights of the layer into an inverse Hessian matrix. The inverse Hessian matrix is the second-order derivative of the model loss function, which represents the sensitivity of the model output to each weight change.

In simple terms, it essentially shows the importance (inverse importance) of the weights in each layer.

The weights for smaller values in the Hessian matrix are more important because small changes in these weights can lead to significant changes in model performance.

In the inverse Hessian matrix, lower values represent more "important" weights

Next, we quantize and dequantize the first row of the weight matrix:

This process allows us to calculate the quantization error (q), which we can weight using the previously calculated inverse Hessian value (h_1).

Essentially, we are creating a weighted quantization error based on the importance of the weights:

Next, we redistribute this weighted quantization error to the other weights in that row. This helps maintain the overall functionality and output of the network.

For example, if we do this for the second weight (i.e. x_2=0.3), we will add the quantization error (q) multiplied by the inverse Hessian of the second weight (h_2):

Next, continue with the same operation for the third weight in a given row:

This process of redistributing the weighted quantization error q is repeated until all values have been quantized.

This method works because weights are usually correlated with each other, so when one weight has a quantization error, the related weights will be updated accordingly via the inverse Hessian.

GGUF

While GPTQ is a great way to quantize an entire Large Language Model (LLM) on a GPU, it is also possible to offload any layer of the LLM to the CPU via GGUF if the hardware is not available.

This is equivalent to running the model with the CPU and GPU at the same time to make up for the lack of video memory (VRAM).

The quantization method GGUF is frequently updated and depends on the specific number of quantization bits, but the basic principle is as follows.

First, the weights of a given layer are divided into "super blocks", each of which contains a set of "sub blocks". From these "sub blocks", we calculate the scale factor (s) and the α value:

To quantize a given "sub-block", you can use the absmax quantization mentioned earlier and multiply the given weight by the scaling factor (s_sub):

The scaling factor s_sub is calculated using the information in the sub-block, but quantized using the information in the super-block s_super:

In summary, this block-based quantization uses the scale factor of the "super block" (s_super) to quantize the scale factor of the "sub block" (s_sub).

The quantization level of each scale factor may be different, and the scale factors of "super blocks" usually have higher precision than those of "sub blocks".

To illustrate this, let's explore several quantization levels (2-bit, 4-bit, and 6-bit):

Depending on the quantization type, an additional minimum value (m) is required to adjust the zero point. These are quantized in the same way as the scale factor (s).

Part 4: Quantization Aware Training (QAT)

The third part describes how to quantize the model after training. The disadvantage of this method is that it does not take into account the actual training process.

This is where Quantization Aware Training (QAT) comes in. Unlike Post-Training Quantization (PTQ), QAT aims to learn the quantization process during training.

QAT tends to be more accurate than PTQ because quantization is already taken into account during training. Here's how it works:

During the training process, so-called "fake" quantization is introduced. For example, the weights are first quantized to INT4, and then dequantized back to FP32:

This process allows the model to take quantization errors into account when calculating losses and updating weights during the training phase.

As shown in the figure below, QAT attempts to explore the loss value in the case of "wide" minima to reduce quantization error, because "narrow" minima tend to lead to larger quantization error.

Assuming that quantization is not considered during backpropagation, the gradient descent process will choose the weight with the smallest loss value. However, if it is in a "narrow" minimum, it will introduce larger quantization errors.

In contrast, if we consider quantization, a different update weight will be chosen in the "wide" minimum, with a much smaller quantization error.

Therefore, although the PTQ method has a lower loss value at high precision (e.g. FP32), QAT also has a low loss value at low precision (e.g. INT4), which is what we pursue.

1-bit era: BitNet

Previously we saw that reducing the quantization precision to 4-bit is already quite small, but what if we reduce it further?

This is where BitNet comes in, which represents the model’s weights as a single bit, either -1 or 1, and does this by injecting the quantization process directly into the Transformer architecture.

The Transformer architecture, which is the basis of most LLMs, consists of computations involving linear layers:

These linear layers are usually represented in higher precision, such as FP16, and are where most of the weights reside.

BitNet replaces these linear layers with BitLinear layers:

A BitLinear layer works the same as a normal linear layer, multiplying the weights by the activation values to compute the output.

But the difference is that the BitLinear layer uses only 1 bit to represent the weight of the model and uses INT8 to represent the activation value:

The BitLinear layer performs a kind of "fake" quantization during training, just like Quantization Aware Training (QAT), to analyze the effects of quantization on weights and activations:

Let’s take a look at BitLinear step by step.

Weight Quantization

During training, weights are stored in INT8 and then quantized to 1 bit using a basic strategy called the signum function.

Essentially, it shifts the distribution of weights to be centered around 0, and then assigns all values less than 0 to -1, and all values greater than 0 to 1:

Additionally, it keeps track of a value β (mean absolute value) which we will use later in the dequantization process.

Activation Quantization

To quantize activation values, BitLinear uses the maximum absolute value method (absmax) to convert activation values from FP16 to INT8 because they need to perform matrix multiplication (×) with higher precision.

Additionally, it keeps track of a value α (maximum absolute value) which we will use later in the dequantization process.

Dequantization

We keep track of α (maximum absolute value of activations) and β (mean absolute value of weights) which will help us dequantize the activations back to FP16.

The output activations are rescaled using {α, γ} to dequantize them to their original precision:

This process is relatively simple and allows the model to be represented with only two values, -1 or 1.

With this approach, the authors observed that the performance gap between 1-bit training and FP16 training narrowed as the model size grew.

However, this only holds for larger models (>30B parameters), the gap between smaller models is still large.

All LLMs are 1.58 bits

To improve the scalability issues mentioned above, BitNet 1.58b was introduced.

In this new approach, each weight of the model can take not only -1 or 1 but also 0, making each variable ternary.

Interestingly, just the simple operation of adding 0 greatly improved BitNet and accelerated the calculation process.

The Power of 0

Why is adding 0 a significant improvement?

It's all about matrix multiplication!

First, let's explore the basics of how matrix multiplication works.

When computing the output, we multiply the weight matrix by the input vector. Here is a visualization of the first row of multiplications for the first layer of the weight matrix:

This multiplication involves two actions, multiplying a single weight by the input and then adding them all together.

In contrast, BitNet 1.58b manages to avoid the act of multiplication, because the three-valued weights essentially tell you the following:

- 1: I want to add this value

- 0: I don't want this value

- -1: I want to subtract this value

So if your weights are quantized to 1.58 bits, you only need to perform the addition:

This not only speeds up computation significantly, but also allows feature filtering.

Setting a given weight to 0 is equivalent to ignoring the input, rather than indicating plus or minus the input value like a 1-bit.

Quantification

To perform weight quantization, BitNet 1.58b uses mean absolute value quantization (absmean), a variant of maximum absolute value quantization (absmax) that we have seen before.

It simply compresses the distribution of weights and quantizes the values using the absolute mean (α). They are then rounded to -1, 0, or 1:

Compared to BitNet, activation quantization is identical except for one aspect: instead of scaling activations to the range [0, 2ᵇ⁻¹], they are scaled to [-2ᵇ⁻¹, 2ᵇ⁻¹] using the maximum absolute value method.

To summarize, 1.58-bit quantization mainly involves two techniques:

- Add 0 to create a three-value representation [-1, 0, 1]

- Absolute mean quantization of weights

The BitNet paper concludes: "13B BitNet b1.58 is more efficient than 3B FP16 LLM in terms of latency, memory usage, and energy consumption."

Paper address: https://arxiv.org/abs/2402.17764

With only 1.58 computationally efficient bits, we obtain a lightweight model.

References:

https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization

news

10,000 words of technical knowledge! A must-read quantitative guide for LLM engineers, with visual illustrations revealing how to compress large models

Introduction

my contact information