news

How was the open source model that beat GPT-4o created? All about Llama 3.1 405B is written in the paper

2024-07-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



Machine Heart Report

Synced Editorial Department

After an "accidental leak" two days in advance, Llama 3.1 was finally officially released last night.

Llama 3.1 extends the context length to 128K, with 8B, 70B, and 405B versions, once again raising the competitive standard in the large model track.

For the AI ​​community, the most important significance of Llama 3.1 405B is that it refreshes the capability ceiling of open source basic models. Meta officials said that in a series of tasks, its performance is comparable to the best closed-source models.

The following table shows the performance of the current Llama 3 series models on key benchmarks. As can be seen, the performance of the 405B model is very close to that of GPT-4o.



At the same time, Meta released the paper "The Llama 3 Herd of Models", revealing the research details of the Llama 3 series models so far.



Paper address: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

Next, let’s take a look at the content of the paper.

Llama3 Paper Highlights

1. After pre-training with 8K context length, Llama 3.1 405B uses 128K context length for continuous training and supports multiple languages ​​and tools.

2. Compared with the previous Llama model, Meta strengthens the curation pipelines for preprocessing and pre-training data, as well as the quality assurance and filtering methods for post-training data.

Meta believes that there are three key levers in the development of high-quality foundational models: data, scale, and complexity management.

First, Meta improves the data used for pre-training and post-training compared to earlier versions of Llama, both in terms of quantity and quality. Meta pre-trained Llama 3 on a multilingual token corpus of approximately 15 trillion tokens, compared to Llama 2 which used only 1.8 trillion tokens.

The model trained this time is much larger than previous Llama models: the flagship language model used 3.8 × 10²⁵ floating-point operations (FLOPs) for pre-training, which is nearly 50 times more than the largest version of Llama 2.

Based on the Scaling law, the current flagship model is already approximately the optimal size for computation under Meta's training budget, but Meta's training time for smaller models has far exceeded the optimal time for computation. The results show that these smaller models outperform the optimal model for computation under the same inference budget. In the post-training stage, Meta used the 405B flagship model to further improve the quality of smaller models such as the 70B and 8B models.

3. To support large-scale production inference of the 405B model, Meta quantized 16 bits (BF16) to 8 bits (FP8), reducing the computational requirements and enabling the model to run on a single server node.

4. Pre-training 405B on 15.6T tokens (3.8x10²⁵ FLOPs) is a major challenge. Meta optimized the entire training stack and used more than 16K H100 GPUs.

As PyTorch creator and Meta distinguished engineer Soumith Chintala said, the Llama3 paper reveals a lot of cool details, one of which is the construction of the infrastructure.



5. In post-training, Meta refines the Chat model through multiple rounds of alignment, which includes supervised fine-tuning (SFT), rejection sampling, and direct preference optimization. Most SFT samples are generated by synthetic data.

The researchers made some choices in the design to maximize the scalability of the model development process. For example, the standard dense Transformer model architecture was chosen with only minor adjustments instead of the expert mixture model to maximize the stability of training. Similarly, a relatively simple post-training procedure was adopted, based on supervised fine-tuning (SFT), rejection sampling (RS), and direct preference optimization (DPO), rather than more complex reinforcement learning algorithms, as the latter tend to be less stable and more difficult to scale.

6. As part of the development process of Llama 3, the Meta team also developed multimodal extensions of the model to enable image recognition, video recognition, and speech understanding. These models are still under active development and are not yet ready for release, but the paper shows the results of preliminary experiments on these multimodal models.

7. Meta updated its license to allow developers to use the output of the Llama model to enhance other models.

At the end of this paper, we also see a long list of contributors:





This series of factors ultimately led to today's Llama 3 series.

Of course, for ordinary developers, how to use a 405B-scale model is a challenge and requires a lot of computing resources and expertise.

With the release, Llama 3.1 is ecosystem-ready, with more than 25 partners providing services that can be used with the latest models, including Amazon Web Services, NVIDIA, Databricks, Groq, Dell, Azure, Google Cloud, and Snowflake.



For more technical details, please refer to the original paper.