GPU training of Llama 3.1 crashed like crazy, but some big companies actually used CPU servers to run large models with hundreds of billions of parameters?

GPU training of Llama 3.1 crashed madly. Some big companies actually used CPU servers to run large models with hundreds of billions of parameters?

2024-08-01

New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】It’s time to use CPU general-purpose servers to run large models with hundreds of billions of parameters!

Musk built the world's largest supercomputer consisting of 100,000 H100 chips in 19 days and has devoted all his efforts to the training of Grok 3.

At the same time, foreign media revealed that the next supercomputing cluster jointly built by OpenAI and Microsoft will be composed of 100,000 GB200s.

In this AI competition, major technology companies are working hard to increase their investment in GPUs, seemingly implying that owning more and more powerful GPUs will make them invincible.

However, this frenzy of pursuit for high-end GPUs is not a perfect solution in all cases.

The creator of Pytorch said that the technical report contains a lot of interesting details about the infrastructure, including how to parallelize, how to make the system more reliable, etc.

Take stability as an example. During the 54 days of Llama 3.1 training, Meta's 16,000 H100 clusters encountered a total of 419 unexpected interruptions, equivalent to an average of once every 3 hours.

Of these, 148 (30.1%) were caused by various GPU failures.

In contrast, there were only 2 interrupts caused by CPU failures.

On the other hand, if you want to run Llama 3.1 405B, you have to pair it with two 8×H100 DGX workstations, which means 1280GB of video memory.

Once, a warrior tried to run it on a 4090, but after waiting for 30 minutes, the model slowly spit out a "The".

The complete reply took a full 20 hours

Friends who are familiar with model training and reasoning know that these things are not surprising at all.

Cluster construction (GPU configuration, network design, track optimization, etc.), cluster management (real-time monitoring, troubleshooting, etc.)... each of them is a "roadblock".

What should companies that lack relevant experience and funds do?

Recently, R&D engineers from Inspur Information have managed to run the "Source 2.0" with hundreds of billions of parameters on a general-purpose server using only four CPUs!

When faced with the task of coding a program in Java, Source 2.0 gave results very quickly.

Give it another reasoning problem - there is a ladder hanging on the side of the boat, 2 meters above the sea surface. The sea water rises half a meter every hour. How many hours will it take for the sea water to submerge the ladder?

Similarly, AI gave detailed problem-solving steps and answers with almost zero delay.

‍

Using a general-purpose server to run a large model with hundreds of billions of parameters is unprecedented. There is no accumulation in this field and no experience to draw on.

How did Inspur Information do it?

Using 4 CPUs to leverage large models with hundreds of billions of parameters

To achieve inference of a large model with hundreds of billions of parameters on a single server, it involves two main stages, both of which place rigid demands on computing power.

First, there is the pre-filling phase, also called the forward propagation phase.

This phase involves the processing of input data and the first reading of model parameters.

For example, when you enter the prompt "Write me an article about AI", the pre-fill stage will input all the tokens and model parameters in the question into the calculation at one time.

Sometimes, this input may be a few words, or it may be thousands of words, or a book.

How computationally demanding the first stage is depends primarily on the length of our input.

When calculating the first token, since the model is loaded for the first time, all weight parameters, KV Cache and other data will be stored in the memory.

This is 2-3 times the memory space occupied by the model parameters themselves.

For a model with hundreds of billions of parameters, a large number of parameters and data inputs need to be processed in a powerful computing unit. To this end, it needs to support vectorized instruction sets and matrix calculation instruction sets to implement a large number of matrix multiplications and tensor operations.

Secondly, there is the decoding stage, which is the stage where the model starts to output results after all the questions are input.

At this stage, the only requirement for large models is to output as quickly as possible. At the same time, the challenge is no longer a computing power challenge, but a "data transfer" challenge.

It consists of two parts of "data handling":

A large number of KV Caches generated in the pre-filling phase need to be moved from the video memory/main memory to the computing unit (a very large workload)
Transfer of model parameters themselves

These transfers play a decisive role in the calculation and reasoning speed of large models. The faster the data is transferred, the faster the LLM will speak.

LLM output is mainly generated through KV Catch, which generates tokens one by one and stores the key-value vector of the new word block after each step.

Therefore, for real-time reasoning of models as large as hundreds of billions, the server needs to have higher computing power and higher data transfer efficiency from storage units to computing units.

In summary, the two stages of large model reasoning have completely different computing characteristics, which require coordinated optimization in software and hardware.

GPU is not everything

Traditionally, GPU has become the first choice for AI training and reasoning because of its superior parallel processing capabilities.

cost

However, high-end GPU servers are often in short supply and extremely difficult to obtain in the market.

Only deep-pocketed tech giants, such as Microsoft and Google, can afford this expense.

On the other hand, not only can we not afford to buy it, we can’t even afford to use it.

The cost of GPU-based cloud service rental is high for inference tasks. For researchers and application manufacturers, if they want to achieve higher cost-effectiveness, they have to find other ways.

Video Memory

In addition, one of the biggest disadvantages of GPU is that the video memory capacity is limited.

The current LLM network architecture in the industry has gradually moved from GPT to MoE. The parameter scale of large models leading to AGI will only grow exponentially.

This means that the size of closed-source/open-source mainstream models will only get bigger and bigger, and models with hundreds of billions or even trillions of parameters will become mainstream.

For a model with 10 billion parameters, 20-30GB of video memory is enough. However, if you want to run a model with 100 billion parameters, you will need about 200-300GB of video memory space.

The video memory of the current mainstream AI chips is usually only tens of GB, which obviously cannot accommodate such a large model. (The most powerful AI chip has not yet reached 200GB)

An underrated general purpose server

If the GPU doesn’t work, then start with the CPU.

Although large-scale model training is not yet possible, general-purpose servers unexpectedly have considerable advantages in reasoning tasks.

In the process of specific practice, Inspur Information's engineers started from the hardware resources and algorithm levels and overcame one "obstacle" after another.

Large memory + high-speed bandwidth

In terms of computing power,Currently, leading server CPUs already have AI acceleration capabilities.

Similar to the Tensor core of the GPU, the AMX advanced matrix extension can accelerate low-precision calculations, compile them into instruction sets for the CPU cores, and use dedicated cores for acceleration.

In terms of algorithms,Inspur Information's universal server can simultaneously support mainstream AI frameworks such as PyTorch and TensorFlow, as well as popular development tools such as DeepSpeed, meeting users' needs for a more mature, easy-to-deploy, and convenient open ecosystem.

In terms of communication,The full-link UPI (Ultra Path Interconnect) bus interconnection design enables efficient data transmission between CPUs:

Allows direct data transmission between any two CPUs, reducing communication delays
Provides high transfer rates, up to 16GT/s (Giga Transfers per second)

In addition, Inspur Information's R&D engineers also optimized the routing paths and impedance continuity between CPUs and between CPUs and memory.

Based on the 3D simulation results, they adjusted the via arrangement and reduced the signal crosstalk to below -60dB, a 50% reduction from the previous generation.

In addition, through DOE matrix active simulation, the optimal solution for the combination of all corners of the channel was found, so that the computing power performance can be fully utilized.

Memory:It can be said to be the biggest advantage of general servers.

capacity

For a 4-way server, you only need to plug 8 32GB memory sticks into each CPU to easily reach 1TB. After fully plugged in, it can even be expanded to 16TB, supporting models with up to 1 trillion parameters.

bandwidth

When paired with DDR5 memory, a theoretical bandwidth of 4800MHz × 8bit × 8 channels × 4 chips ÷ 1024 = 1200GB/s can be achieved.

The measured results show that the read bandwidth is 995GB/s, the write bandwidth is 423GB/s, and the read and write bandwidth is 437GB/s.

This data is not inferior to some GPUs or accelerator cards equipped with GDDR video memory.

But hardware alone is not enough

Relying solely on hardware innovation is far from enough. It is difficult for the CPU to perform large-scale parallel computing of large model algorithms.

As mentioned at the beginning, large models have very high requirements for communication bandwidth, whether it is data calculation, between computing units, or between computing units and memory.

If we calculate according to BF16 precision, in order to reduce the running delay of a trillion-scale model to less than 100ms, the communication bandwidth between the memory and the computing unit must be at least 2TB/s.

Moreover, general-purpose server processors are not suitable for large AI models that are designed based on accelerator cards that excel at large-scale parallel computing.

The reason is obvious: although the latter has highly versatile and high-performance computing cores, it does not have an environment for parallel work.

Generally speaking, a general-purpose server will first transfer the model's weights to a CPU, which will then connect to other CPUs in series to achieve the transmission of weight data.

However, since large models need to frequently move algorithm weights between memory and CPU during runtime, the consequence is that the bandwidth utilization between CPU and memory is low and the communication overhead is extremely high.

How to solve the problem? Use algorithm innovation

In response to the above challenges, Inspur Information proposed two technical innovations, "Tensor Parallel" and "NF4 Quantization", and successfully achieved real-time reasoning of the 100-billion-capacity model Yuan2.0-102B.

Based on the performance analysis results, we can clearly see the distribution of computation time for different parts of the model:

The linear layer running time accounts for 50%, the convolution running time accounts for 20%, the aggregation communication time accounts for 20%, and other calculations account for 10%.

Note that during the entire inference process, computation time accounts for 80%!

This is in stark contrast to AI accelerator cards that use multiple PCIe cards, where communication overhead can be as high as 50%, resulting in serious waste of computing power.

Yuan2.0-102B model reasoning performance analysis results

Tensor Parallelism

The so-called tensor parallelism is to first split the convolution operator into tensors, and then input the matrix calculation weights of the attention layer and feedforward layer in the large model into the memory of multiple processors respectively.

In this way, the four CPUs in a general-purpose server can obtain algorithm weights at the same time and accelerate calculations.

However, tensor parallelism divides model parameters at a finer granularity, requiring the CPU to synchronize data after each tensor calculation.

The full-link UPI bus interconnection technology mentioned above can fully meet this requirement (communication bandwidth is up to 16GT/s).

Ultimately, this collaborative and parallel work directly increased computing efficiency by 4 times!

NF4 Quantification

As for the problem of insufficient memory bandwidth, the model needs to be "slimmed down" without affecting the accuracy, that is, quantized.

The advantage is that, on the one hand, the LLM parameters can be quantized into low-bit data, and the weight will become smaller. On the other hand, after the weight is reduced, the amount of data transmitted during calculation will also be reduced.

Here, Inspur Information uses a rare quantile quantization method - NF4 (4-bit NormalFloat).

The NF4 quantization method can compress the size of Yuan2.0-102B to 1/4 of its original size.

Specifically, the core idea of NF4 is to ensure that the number of values of the input tensor in the quantization interval is equal.

This feature is very suitable for LLM weights that present an approximately normal distribution.

Since the standard deviation can be adjusted to fit the range of the quantized data type, NF4 can achieve higher accuracy than traditional 4-bit integer or 4-bit floating point quantization.

In this way, the quantized model can not only meet the accuracy requirements, but also significantly reduce the amount of memory access data for large-scale parallel computing, thereby meeting the decoding requirements for real-time reasoning.

Integer or floating point quantization methods usually have data intervals that are evenly or exponentially distributed.

In order to further compress the weight parameters of the model, the team also adopted the nested quantization (Double Quant) technology.

This is a secondary quantification based on NF4 quantification.

Because NF4 quantization will generate a large number of scale parameters, if 32-bit floating point numbers (FP32) are used for storage, it will take up a lot of memory.

For an LLM with 100 billion parameters, if every 64 parameters are used as a quantization block (block size = 64) for calculation, an additional 6GB of memory is required just to store the scale parameters: (100B ÷ 64) × 4 = 6GB.

The team significantly reduced the required storage space by quantizing these scale parameters to 8-bit floating point numbers (FP8).

When the block size is 256, the additional space required to store all scale parameters is only 1.57GB: (100B ÷ 64 ÷ 256) × 4 + (100B ÷ 64) × 1 = 1.57GB.

Through nested quantization, each weight parameter of the model finally occupies only 4 bytes of memory space, saving a lot of memory space compared to the original FP32.

At the same time, it increases the efficiency of data transfer from memory to CPU by 4 times.

Such optimization significantly alleviates the limitation of memory bandwidth on the inference decoding efficiency of the Yuan2.0-102B model, thereby further improving the inference performance of the model.

The so-called universal means that everyone can use it

At this point, Inspur Information has successfully submitted its paper!

Through system optimization, Inspur Information's NF8260G7 has achieved the industry's first support for the operation of large models with hundreds of billions of parameters based only on general-purpose processors.

At this point, the parameter scale of large AI models supported by general computing power has exceeded 100 billion, completely filling the gap in the industry and becoming a new starting point for enterprises to own AI.

The deployment of AI models with hundreds of billions of parameters now has a more powerful and cost-effective option; the application of large AI models can achieve closer integration with cloud, big data, and databases.

The ultimate goal of scientific and technological progress must be to fall into the mortal world.

At present, AIGC has penetrated into thousands of industries. AI has penetrated into every computing device at an astonishing speed.

From January to April 2024, the number of winning bids for large domestic models has exceeded the total number for the whole year of 2023, and the disclosed amount of winning bids has reached 77% of the whole year of 2023.

Practitioners in the financial industry, hospital outpatient departments, and corporate IT departments have all discovered this: the computing power infrastructure of traditional industries is no longer sufficient!

Today, large models with hundreds of billions of parameters are the key to the emergence of intelligence in all industries. Whether general computing power can run large models with hundreds of billions of parameters is the key to measuring whether it can support the emergence of intelligence in all industries.

Inspur Information's pioneering work enables customers in industries such as the Internet, finance, and healthcare to achieve efficient deployment, saving more than 80% of construction costs on the initial investment.

Whether it is financial fraud prevention, financial data analysis, enterprise CRM marketing insights, medical intelligent diagnosis, personalized treatment plans, education and training, etc., we will witness the widespread application of AI.

From now on, all computing is AI.

References:

https://mp.weixin.qq.com/s/1wYt7dfoVy2J1FFkOJjRTg

news

GPU training of Llama 3.1 crashed madly. Some big companies actually used CPU servers to run large models with hundreds of billions of parameters?

Introduction

my contact information