New trend in large-model terminal deployment: hardware directly supports hybrid matrix multiplication

New trend in large model terminal deployment: hardware directly supports hybrid matrix multiplication

2024-08-19

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

In the field of artificial intelligence, an increase in model parameters often means improved performance. However, as the model size increases, the computing power and memory requirements of the terminal device are also increasing. Low-bit quantization technology has become one of the key technologies for achieving efficient operation of large models on resource-constrained devices because it can significantly reduce storage and computing costs and improve reasoning efficiency. However, if the hardware device does not support the data mode after low-bit quantization, the advantages of low-bit quantization will not be realized.

To solve this problem, Microsoft Research Asia launched a new data compiler Ladder and algorithm T-MAC, which enable hardware that currently only supports symmetric precision calculations to directly run mixed-precision matrix multiplication. Test results show that Ladder can speed up by up to 14.6 times in supporting custom data types that GPUs do not originally support; T-MAC makes the throughput of large models running on the CPU twice as fast as the dedicated accelerator NPU on the Surface AI PC equipped with the latest Qualcomm Snapdragon X Elite chipset. In addition, the researchers also designed the LUT Tensor Core hardware architecture. This streamlined design enables the hardware to directly support various low-bit mixed-precision calculations, providing new ideas for artificial intelligence hardware design.

Large models have been increasingly deployed on end-side devices such as smartphones, laptops, and robots to provide advanced intelligence and real-time response services. However, large models containing hundreds of millions of parameters place extremely high demands on the memory and computing power of end devices, which limits their widespread application. Low-bit quantization technology has become an effective means of deploying large models on the end-side and achieving efficient reasoning because it can significantly compress the model size and reduce the demand for computing resources.

With the development of low-bit quantization technology, data types are becoming increasingly diverse, such as low-bit data such as int4, int2, and int1, which makes large models increasingly use mixed-precision matrix multiplication (mpGEMM) with low-bit weights and high-bit weights in reasoning. However, existing hardware computing units such as CPUs and GPUs usually only support symmetric computing modes and are not compatible with this mixed-precision matrix multiplication.

How does mixed-precision matrix multiplication differ from traditional matrix multiplication?

In traditional matrix multiplication, the values at both ends of the operation are symmetrical, such as FP16*FP16, int8*int8. However, low-bit quantization of large models breaks this symmetry, making one end of the multiplication high-bit and the other end low-bit, such as int8*int1 or int8*int2 implemented in the 1-bit BitNet model, and FP16*int4, a mixed multiplication of floating-point numbers and integers.

In order to give full play to the advantages of low-bit quantization, enable hardware devices to directly support mixed-precision matrix multiplication, and ensure that large models can run efficiently and quickly on end-side devices, researchers at Microsoft Research Asia have innovated existing CPU and GPU computing operators and hardware architectures:

The data type compiler Ladder was launched to support the expression and mutual conversion of various low-precision data types, and losslessly convert data types not supported by the hardware into data type instructions supported by the hardware. In the traditional computing mode, the hardware can support mixed-precision DNN (deep neural network) computing;
Developed a new algorithm, T-MAC, which is based on the lookup table (LUT) method and implements direct hardware support for mixed-precision matrix multiplication. At the software level, the calculation on the CPU has achieved better acceleration than the traditional calculation mode.
A new hardware architecture, LUT Tensor Core, was proposed, opening up new ideas for the design of the next generation of artificial intelligence hardware.

Ladder: lossless conversion of custom data types into data types supported by hardware

Currently, cutting-edge accelerators are integrating lower-bit computing units, such as FP32, FP16, and even FP8 operations, into the next-generation architecture. However, due to the limitations of chip area and high hardware costs, each accelerator can only provide a limited number of computing units for standard data types. For example, the NVIDIA V100 TENSOR CORE GPU only supports FP16, and although the A100 has added support for int2, int4, and int8, it does not cover newer data formats such as FP8 or OCP-MXFP. In addition, there is a gap between the rapid iteration of large models and the slow pace of hardware upgrades, resulting in many new data types not being supported by hardware, which in turn affects the acceleration and operation of large models.

Researchers at Microsoft Research Asia found that although hardware accelerators lack computing instructions for custom data types, their memory systems can convert them into opaque data blocks of fixed bit width to store arbitrary data types. At the same time, most custom data types can be losslessly converted to more standard data types supported by existing hardware computing units. For example, NF4 tensors can be converted into FP16 or FP32 to perform floating-point operations.

Based on these findings, the researchers proposed aWe support all custom data types by separating data storage and computation, and developed a data compiler Ladder to bridge the gap between emerging custom data types and the inherent precision formats supported by current hardware.

Ladder defines a data type system, including the abstraction of lossless conversion between data types. It can represent various data types supported by algorithms and hardware, and defines conversion rules between data types. When processing low-bit algorithm applications, Ladder translates low-bit data into the most efficient execution format on the current hardware through a series of optimizations, including optimization of calculation and storage - mapping algorithms to matching calculation instructions, and storing data in different formats in storage units of different levels to achieve the most efficient operation.

Figure 1: Ladder system architecture

Evaluation of DNN inference performance running on NVIDIA A100, NVIDIA V100, NVIDIA RTX A6000, NVIDIA RTX 4090, and AMD Instinct MI250 GPUs shows that Ladder outperforms existing state-of-the-art DNN compilers in terms of natively supported data types, and excels in supporting custom data types that are not natively supported by GPUs, with a speedup of up to 14.6 times.

Ladder is the first system to systematically support custom data types for representing low-bit-precision data when running DNNs on modern hardware accelerators.This provides model researchers with more flexible data type optimization methods, while also allowing hardware architecture developers to support a wider range of data types without changing the hardware.

T-MAC: Universal low-bit mixed-precision matrix multiplication without multiplication

In order to enable existing hardware devices to support different data modes and mixed-precision matrix multiplication, a common practice is to dequantize low-bit models when deploying large models on the end side. However, there are two major problems with this approach: first, from a performance perspective, the conversion overhead during the dequantization process may offset the performance improvement brought by low-bit quantization; second, from a development perspective, developers need to redesign data layout and computing kernels for different mixed precisions. Researchers at Microsoft Research Asia believe that the key to deploying large models with low-bit quantization on devices is how to break through the implementation of traditional matrix multiplication based on the characteristics of low bits.

To this end, the researchers proposed aA look-up table (LUT)-based method, T-MAC, helps large models with low-bit quantization achieve efficient inference on the CPU.The core idea of T-MAC is to use the feature of mixed precision matrix multiplication that one end has very low bits (such as 1 bit or 2 bits). Their output results are only 2 to the power of 1 and 2 to the power of 2. These fewer output results can be calculated in advance and stored in the table. When calculating, only the results need to be read from the table, avoiding repeated calculations and greatly reducing the number of multiplication and addition operations.

Specifically,T-MAC transforms traditional data type-centric multiplication into a bit-based lookup table operation, achieving a unified and scalable mixed-precision matrix multiplication solution, reducing the size of the table and keeping it in the fastest memory unit, reducing the cost of random access to the table.This innovation paves the way for deploying large models with low-bit quantization on resource-constrained edge devices.

Figure 2: T-MAC schematic diagram

In tests on low-bit quantized Llama and 1-bit BitNet large language models, T-MAC showed significant performance advantages. On a Surface Laptop 7 equipped with the latest Qualcomm Snapdragon X Elite chipset, T-MAC enabled the generation rate of the 3B BitNet-b1.58 model to reach 48 tokens per second, the generation rate of the 2bit 7B Llama model to reach 30 tokens per second, and the generation rate of the 4bit 7B Llama model to reach 20 tokens per second, all of which are far faster than the average human reading speed. Compared with the original Llama.cpp framework, it has increased by 4 to 5 times, and is even twice as fast as a dedicated NPU accelerator.

Even on lower-performance devices such as the Raspberry Pi 5, T-MAC enables the 3B BitNet-b1.58 model to reach a generation rate of 11 tokens per second. T-MAC also has a significant power advantage, requiring only 1/4 to 1/6 the number of cores of the original Llama.cpp to achieve the same generation rate on resource-constrained devices.

These results show that T-MAC provides a practical solution that makes it more efficient to deploy large language models on edge devices using general-purpose CPUs without relying on GPUs, allowing large models to run efficiently on resource-constrained devices, thereby promoting the application of large models in a wider range of scenarios.

LUT Tensor Core: Driving the Next Generation of Hardware Accelerators to Natively Support Mixed-Precision Matrix Multiplication

Both T-MAC and Ladder implement optimized support for mixed-precision matrix multiplication on existing CPU and GPU architectures. Although these software-level innovations have significantly improved computing efficiency, they are still not as efficient as hardware accelerators that can directly implement a dedicated lookup table. Researchers believe that the ideal approach is to redesign hardware accelerators so that CPUs, GPUs, etc. can natively support mixed-precision matrix multiplication, but this goal faces three major challenges:

Efficiency: The design and implementation must be cost-effective, maximizing the computational benefits of low-bit data by optimizing the chip's area utilization.
Flexibility: Since different models and scenarios require different weight and activation precisions, the mixed-precision matrix multiplication design in hardware must be able to handle various weight precisions (such as int4/2/1) and activation precisions (such as FP16/8, int8) and their combinations.
Compatibility: New designs must integrate seamlessly with existing GPU architectures and software ecosystems to accelerate the adoption of new technologies.

To address these challenges, researchers at Microsoft Research Asia designedLUT Tensor Core, a GPU Tensor Core microarchitecture that directly performs mixed-precision matrix multiplication using lookup tables.On the one hand, the design based on the lookup table simplifies the multiplication operation into a table pre-computation operation, and the result can be directly looked up in the table, which improves the computational efficiency. On the other hand, this approach also simplifies the hardware requirements. It only requires registers for table storage and multiplexers for lookup, without multipliers and adders. At the same time, the LUT Tensor Core achieves flexibility in weight precision through bit-serial design and flexibility in activation precision through table quantization.

In addition, in order to integrate with existing GPU microarchitectures and software stacks, the researchers extended the existing MMA instruction set in the GPU, added a set of LMMA instructions, and designed a software stack similar to cuBLAS for integration into existing DNN frameworks. The researchers also designed a compiler for end-to-end execution planning on GPUs with LUT Tensor Cores. These innovative methods allow LUT Tensor Cores to be adopted seamlessly and quickly.

Figure 3: LUT Tensor Core microarchitecture overview

Tests on the Llama and BitNet models show that LUT Tensor Core can provide up to 6.93 times the reasoning speed and only occupies 38.7% of the area of the traditional Tensor Core. With almost the same model accuracy, this is equivalent to 20.7 times the computing density and 19.1 times the energy efficiency improvement. As the scale and complexity of large AI models continue to grow, LUT Tensor Core will help further unleash the potential of low-bit large language models and promote the application of AI in new scenarios.

"The lookup table method has led a shift in the computing paradigm. In the past, we relied on matrix multiplication and accumulation operations, but in the era of large models, thanks to low-bit quantization technology, the lookup table method will become mainstream. Compared with traditional floating-point operations or matrix multiplication, the lookup table method is computationally lighter and more efficient, and it is easier to scale at the hardware level. It can achieve higher transistor density and provide greater throughput per unit chip area, thereby promoting innovation in hardware architecture." said Ting Cao, principal researcher at Microsoft Research Asia.

The long tail effect of low-bit quantization: bringing new possibilities for embodied intelligence

Low-bit quantization technology not only optimizes the running efficiency of large models on end-side devices, but also provides new space for scaling up model parameters by reducing the "volume" of a single parameter. This parameter scaling capability gives the model greater flexibility and expressiveness, as demonstrated by the BitNet model, which starts with a low-bit model and gradually expands to larger-scale training.

Microsoft Research Asia's innovative technologies such as T-MAC, Ladder and LUT Tensor Core provide efficient operation solutions for various low-bit quantized large models, enabling these models to run efficiently on various devices and promoting researchers to design and optimize large models from a low-bit perspective. Some of these technologies have already played a role in Microsoft's Bing search and advertising business and other search large models.As the requirements for memory and computing resources decrease, the deployment of low-bit large models on embodied intelligent systems such as robots will also become possible, enabling these devices to better achieve dynamic perception and real-time interaction with the environment.

Currently, T-MAC and Ladder have been open sourced on GitHub. Relevant R&D personnel are welcome to test the applications and explore more possibilities of artificial intelligence technology with Microsoft Research Asia.

Ladder paper link: https://www.usenix.org/conference/osdi24/presentation/wang-lei
BitBLAS/Ladder GitHub link: https://github.com/microsoft/BitBLAS
T-MAC paper link: https://arxiv.org/abs/2407.00088
T-MAC GitHub link: https://github.com/microsoft/T-MAC
LUT Tensor Core paper link: https://arxiv.org/abs/2408.06003
BitDistiller paper link: https://arxiv.org/abs/2402.10631
BitDistiller GitHub link: https://github.com/DD-DuDa/BitDistiller

news

New trend in large model terminal deployment: hardware directly supports hybrid matrix multiplication

Introduction

My contact information