news

Llama 3.1 is out! The open source giant defeated the closed source for the first time, and the era of GPT-4 for all people is coming

2024-07-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】The big model landscape has changed overnight again. Llama 3.1 405B made its debut, surpassing GPT-4o and Claude 3.5 Sonnet in multiple tests. For the first time in history, an open source model beat the strongest closed source model. Zuckerberg made a bold statement: open source AI will win, just as Linux eventually won.

The new open source king Llama 3.1 405B was officially launched last night!

In multiple benchmark tests, GPT-4o and Claude 3.5 Sonnet were surpassed. In other words, closed-source SOTA models are already being caught up by open-source models.


Overnight, the Llama 3.1 405B became the most powerful model in the world.

(Also launched at the same time are the new 70B and 8B models)


LeCun summarized the key points of the Llama 3.1 model family:

- 405B performance is comparable to the best closed-source models

- Open source/free to use weights and code, allowing fine-tuning, distillation into other models, and deployment anywhere

- 128k context, multiple languages, good code generation, complex reasoning, and tool use

- Llama Stack API allows for easy integration


Meta has truly implemented the spirit of open source this time, and has also generously released a 90-page paper.

Thomas Wolf, chief scientist of HuggingFace, praised: If you want to study large models from scratch, this paper is what you need!

It covers literally everything — pre-training data, filtering, annealing, synthetic data, scaling laws, infrastructure, parallel processing, training methods, post-training adaptation, tooling, benchmarking, inference strategies, quantization, vision, speech, and video…

Nathan Lambert, a researcher at AI2, estimates that the 90-page Llama 3.1 paper will directly push the progress of the open source model forward by 3-9 months!


Meta CEO Zuckerberg proudly wrote a long article: Open source artificial intelligence is the way forward.


In an interview with the New York Times, Zuckerberg supports open source AI

In this article, Zuckerberg recalled with emotion how Meta turned around in the LLM wave:

Last year, Llama 2 was only comparable to marginal older models; this year, Llama 3 is already ahead of the state-of-the-art models in some respects; and starting next year, future Llama models will be state-of-the-art.

When asked many times whether he was worried about losing his technological advantage by open-sourcing Llama, Zuckerberg directly compared himself to Linux.

He said that big technology companies once invested heavily in their own versions of Unix, but in the end open source Linux won because it allows developers to modify the code at will, making it more advanced, more secure, and with a broader ecosystem.

AI, too, will develop in a similar way.

To this end, Meta has specially relaxed its license, allowing developers to use the high-quality output of the Llama 3.1 model for the first time to improve and develop third-party AI models.


Netizen: A new era begins

After Llama 3.1 was officially lifted, it caused an uproar on the entire network.

AI master Karpathy then expressed some of his own thoughts:

Today, with the release of the 405B model, cutting-edge large models at the level of GPT-4/Claude 3.5 Sonnet are open to everyone for the first time to use and build upon. Its weights are open source, commercially licensed, and allow for the generation of synthetic data, distillation, and fine-tuning of models.

This is a truly open, cutting-edge LLM released by Meta. In addition, they also released a 92-page technical report with a lot of model details: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/


The philosophy behind this model release is detailed in a long article by Zuckerberg, which is well worth reading as it covers all the main ideas and arguments supporting the worldview of an open AI ecosystem very well:

Open source AI is the future.

I often say that it’s still early days, like the 1980s all over again for computing, LLM is the next big computing paradigm, and Meta is clearly positioning itself as a leader in its open ecosystem.

- People will give tips and use RAG on these models

- People will fine-tune the model

- People will distill them into smaller expert models for specific tasks and applications

- People research, benchmark, optimize

In addition, open ecosystems self-organize into products, applications, and services in a modular way, and each participant can contribute their unique expertise.

One example is that AI chip startup Groq has integrated the Llama 3.1 model, which can achieve almost instant inference of the 8B model.

Karpathy said that due to server pressure, he seemed unable to try the 405B running on Groq, which may be the most capable and fastest large model today.


He also predicts that closed source models will catch up soon and is looking forward to it.

Meta researcher Tian Yuandong said, a new era has begun! Open source LLM is now comparable to/better than closed source LLM!


A new king of open source models is born.


After testing the fine-tuned Llama 3.1 8B, the founder of OpenPipe exclaimed: There has never been such a small and powerful open source model - it performs better than GPT-4o mini on every task!



Jim Fan, senior scientist at Nvidia, said, "The power of GPT-4 is in our hands. This is a historic moment."


Few people pay attention to the infrastructure behind AI model training. Soumith Chintala, the father of Pytorch, stood up and said that even in the facilities built with 16,000 GPUs, failures will occur.

These details are hidden in the Llama 3.1 paper, including how to parallelize and maintain system reliability. It is worth mentioning that the Meta team achieved 90% effective training time in model training.



Some netizens have counted the increasing usage of GPUs during the iteration process of the Llama model.

Llama 1: 2048 GPUs

Llama 2: 4096 GPUs

Llama 3.1: 16,384 GPUs (in fact, Llama 3 was trained on two clusters with 24,000 GPUs)

Llama 4:......


The strongest open source model family

In fact, some key points about the Llama 3.1 series models were basically completely spoiled yesterday.

As stated in the leaked information, Llama 3.1 can support 8 languages ​​(English, German, French, Italian, Portuguese, Hindi, Spanish and Thai), multilingual conversational agents, translation use cases, etc.

In terms of context length, compared with Llama 2 and Llama 3, the length of all contexts in the Llama 3.1 series models has increased 16 times to 128K.


Meta stressed that Llama 3.1 also has improvements in tool usage, supporting zero-shot tool usage, including web searches, mathematical operations, and code execution.

Based on the long context, the model not only knows when to use the tool, but also understands how to use it and how to interpret the results.

Additionally, Llama 3.1 provides powerful flexibility in invoking custom tools through fine-tuning.


Main Capabilities

First, Llama 3.1 can be run as a system capable of performing “agent” tasks:

- Decompose tasks and perform multi-step reasoning

- use tools

- Built-in tools: Models come with built-in knowledge of tools like search or code interpreters

- Zero-shot learning: the model can learn to call tools from contextual tool definitions it has not seen before

For example, ask the model: "This is a CSV file, can you describe what's in it?"

It will recognize that this CSV file contains monthly inflation rates for multiple years, and the year column indicates the year for each set of monthly inflation rates.


Next, we can ask it to plot a graph over time series.


Next, it can perform a series of tricky tasks, such as drawing a chart of the S&P 500 in the same chart.


Once you're done, you can resize the chart to add information to different axes.


As shown above, Llama 3.1 supports 8 languages, so it is capable of multi-language translation.

We could have it translate the fairy tale Hansel and Gretel (Hansel and Gretel) into Spanish.


Even when faced with more complex reasoning questions, Llama 3.1 can easily solve them.

"I have 3 shirts, 5 pairs of shorts and 1 dress. I'm going to be away for 10 days. Are these enough clothes for my vacation?"

AI breaks down the known conditions, comes up with a reasonable combination of tops, shorts, and skirts, and suggests that it is best to bring a few more tops.


After completing the reasoning, it also thoughtfully provides us with a more detailed travel clothing guide and packing list.


We can also let AI write code by hand.

For example, ask it to create a program that uses a recursive backtracking algorithm or a depth-first search algorithm to generate a perfect maze, and you can customize the size and complexity.

As soon as the AI ​​got started, it directly came up with the Python code for the maze program.


After the code was completed, the AI ​​also gave a detailed explanation.


Next, if you want to customize the program, the AI ​​code assistant provides us with corresponding code suggestions - adjust the width and height.


Evaluation Results

To evaluate the performance of Llama 3.1, Meta not only included 150 benchmark datasets covering multiple languages ​​in the test, but also conducted comparisons in real scenarios.

In a variety of tasks, 405B can compete with closed-source leading models such as GPT-4, GPT-4o, and Claude 3.5 Sonnet.


The small models of 8B and 70B also perform well among closed-source and open-source models with similar parameter numbers.

In addition to long-context tasks, the 8B and 70B models achieved SOTA on general tasks, code, mathematics, reasoning, tool usage, and multi-language.


In human evaluation, the Llama 3.1 405B model performs on par with GPT-4 but slightly behind GPT-4o.

However, compared with the Claude 3.5 Sonnet, the 405B large model has an advantage, with a winning rate of 24.9%.


In addition, in Scale's ranking, the fine-tuned version of Llama 3.1 405B crushed Claude 3.5 Sonnet and GPT-4o in the instruction following evaluation.

In the math task, 405B ranked second only to Claude 3.5 Sonnet. However, Llama 3.1 scored relatively low in the coding task.


92 pages of extremely detailed technical report

No one can be as open source as Meta, and a 92-page long technical report was also released today.


Paper address: https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

The paper proposes that Llama 3.1, a high-quality foundation model, has three key levers: data, scale, and complexity management.

In terms of data, the total amount and quality of data in Llama 3.1 have been improved compared to its predecessor, such as more careful preprocessing and management pipelines for pre-training data, and more stringent quality assurance and filtering methods for post-training data.

Llama 2 was pre-trained on only 1.8T tokens of data, while the multilingual pre-training corpus of Llama 3.1 reached 15.6T tokens, an increase of more than 8 times.

In terms of scale, the training of Llama 3.1 used more than 16,000 NVIDIA H100 GPUs, with a total computing power of 3.8e25 FLOPS, which is almost 50× that of Llama 2.

In order to better achieve "scale up", the paper specifically proposes the aspect of "complexity management". When choosing model architecture and algorithms, more attention should be paid to their stability and scalability.

It is worth noting that Llama 3.1 does not use the most popular MoE architecture, but the dense Transformer of the decoder-only architecture. Only some modifications and adjustments have been made to the original Transformer architecture to maximize training stability.

A similar approach is to use simple post-training processes such as SFT, RS, DPO, etc. instead of more complex reinforcement learning algorithms.

Similar to many large models, the development of Llama 3 also mainly includes two stages: pre-training and post-training.

During pre-training, we also use "predict the next token" as the training objective. We first set the context window to 8K and then expand it to 128K during the pre-training phase.

The post-training phase improves the model through multiple rounds of iterative human feedback, significantly improving encoding and reasoning performance and integrating tool usage capabilities.

In addition, the paper also attempts to add multimodal features such as images, videos, and voice to Llama 3.1 using 3 additional stages:

- Multimodal encoder pre-training: The image and speech encoders are trained separately. The pre-training data for the former is image-text pairs, while the latter uses a self-supervised approach to try to reconstruct the masked parts of the speech through discretized tokens.

- Visual Adapter: It consists of a series of cross-attention layers that inject the representation of the image encoder into the pre-trained language model. Based on images, the paper also attempts to train a video adapter on video-text pairs.

- Speech Adapter: Connects the speech encoder and language model, and also integrates the "text-to-speech" system.


Unfortunately, the aforementioned multimodal functionality is still under development and therefore not included in the newly released Llama 3.1.

Model Architecture

Llama 3.1 still uses the standard dense Transformer, and has no significant architectural differences from Llama and Llama 2. The performance improvement mainly comes from the improvement of training data quality, diversity, and scale expansion.


Compared with Llama 3, the architecture of Llama 3.1 has the following improvements:

- Grouped Query Attention (GQA): with 8 key-value headers, improves inference speed and reduces KV cache during decoding

- Attention mask: prevents self-attention between different documents in the same sequence. This trick has limited effect in standard pre-training, but is very important when continuing pre-training on very long sequences

- 128K token vocabulary: includes 100K from Tiktoken and an additional 28K to better support non-English languages. Compared with Llama 2, the compression ratio for both English and non-English languages ​​is improved

- Set the hyperparameter θ of RoPE to 500,000: better support for long context

The key hyperparameters of the model are shown in Table 3. Based on the amount of data and training computing power, the size of the model achieves the computing power optimization revealed by the Scaling Law.


Parallel efficiency

Training a 405B model on 16,000 GPUs is already a huge project just considering parallelism and fault handling.

In addition to the model itself, the paper also explains the parallelization scheme used in the training process, as well as infrastructure such as storage and network.

Llama 3.1 training uses 4D parallelism (tensor + pipeline + context + data). At BF16 precision, the GPU utilization (MFU) is about 38% to 41%.


The Llama 3.1 training cluster also handled failures very well, achieving over 90% of effective training time, but this still meant that there was at least one interruption per day during the total 54 days of pre-training.

The paper lists the causes of all 419 unexpected interruptions in detail (Table 5), which is of great reference significance for the construction of future GPU clusters. Among them, 78% of them are confirmed or suspected to be hardware-related.


Since the cluster's automated operation and maintenance is relatively complete, although there are many failures, most of them can be handled automatically. During the entire process, only 3 failures required manual intervention.

Improve the performance of specific abilities

Code

To improve the model’s coding capabilities, Meta uses methods such as training code experts, generating SFT synthetic data, guiding format improvements through system prompts, and creating quality filters (removing bad samples from training data).


Using Llama 3 to convert Python code (left) to PHP code (right) to augment the SFT dataset with a wider range of programming languages


Improve code quality through system upgrades. Left: No system prompts Right: With system prompts

multilingual

To improve Llama 3’s multilingual capabilities, Meta specially trained an expert who can handle more multilingual data to acquire and generate high-quality multilingual instruction fine-tuning data (such as German, French, Italian, Portuguese, Hindi, Spanish, and Thai) and address specific challenges in multilingual guidance.


Mathematical Reasoning

Training models that are good at mathematical reasoning faces several major challenges, such as lack of hints, lack of true CoT, incorrect intermediate steps, the need to teach the model to use external tools, and differences between training and inference.

To this end, Meta adopts the following methods: solving the problem of insufficient prompts, enhancing the step-by-step reasoning process in training data, filtering out incorrect reasoning processes, combining code and text reasoning, and learning from feedback and errors.


Long context

In the final pre-training stage, Meta expands the context length of Llama 3 from 8K tokens to 128K.

In practice, the team found that if only short-context data is used for SFT, the model's long-context ability will be significantly degraded; and reading lengthy contexts is very tedious and time-consuming, so it is impractical for humans to annotate such examples.

Therefore, Meta chose synthetic data to fill this gap.

Using an early version of Llama 3, they generated synthetic data based on key long-context use cases: (multi-turn) question answering, long document summarization, and codebase reasoning.

Tool Usage

Meta trained Llama 3 to interact with search engines, Python interpreters, and mathematical calculation engines.

During development, Meta has gradually complicated the manual annotation protocol as Llama 3 has improved over time, starting with annotation of single-round tool usage, moving to tool usage in conversations, and finally to annotation of multi-step tool usage and data analysis.


Llama 3 performs multi-step planning, reasoning, and tool invocation to solve tasks


Based on the provided files, ask the model to summarize the file contents, find and fix errors, optimize the code, perform data analysis or visualization, etc.

Factual

Meta took an illusion-first approach to LLM’s admittedly challenging illusion problem.

The principle they follow is that after training, the model should "know what it knows" rather than adding knowledge.

Maneuverability

For Llama 3, Meta is enhancing its controllability through system prompts with natural language instructions, specifically around response length, format, tone, and persona/personality.


“You are a helpful, cheerful AI chatbot that acts as a meal planning assistant for busy families”

team member

The Llama 3 team is very large, with nearly 220 core members alone and 312 other contributors.




Zuckerberg: Open source AI is the future

As we all know, Zuckerberg has always been a loyal supporter of open source AI.

This time it is not just about releasing a new strongest model, but a vow to make open source AI the best.


In his blog, Zuckerberg directly drew lessons from history. In the past, major technology companies invested huge amounts of money in developing closed-source versions of Unix.

The Unix battlefield was fierce, but unexpectedly, the one who had the last laugh was open source Linux.


Linux was originally popular among developers because it allowed them to modify the code at will and was more affordable.

But over time it has become more advanced, more secure, and has more features supported by a broader ecosystem than any closed Unix.

Today, Linux has become the industry standard for cloud computing and most mobile device operating systems, and everyone benefits from it.

Zuckerberg believes that the development trajectory of AI will be similar, and he points the finger at the closed-source model of "several technology companies."


“Today, a few tech companies are developing the leading closed model, but open source is rapidly closing the gap.”

Zuckerberg dares to name it directly because he has the confidence to do so. Last year, Llama 2 was still lagging behind the cutting-edge old generation models.

This year, Llama 3 is able to compete with other giant models in terms of performance.

As the first cutting-edge open source AI model, Llama 3.1 405B has a significantly better cost/performance ratio than closed models. The openness of the 405B model makes it the best choice for fine-tuning and distilling small models.

Why open source AI is good for developers

For developers, adhering to the open source model has five major benefits:

First, open source models allow developers to freely train, fine-tune, and distill their own models.

Each developer has different needs. On-device tasks and classification tasks require small models, while more complex tasks require large models.

Using state-of-the-art open source models, developers can continue training with their own data and distill it to an ideal size.

Second, it can avoid being restricted by a single supplier.

Developers don’t want to be dependent on a model that they can’t run and control, nor do they want vendors to change the model, modify the terms of use, or even stop providing service altogether.

Open source allows models to be easily switched and deployed, creating a broad ecosystem.

Third, protect data security.

When developers are working with sensitive data, they need to ensure that the data is secure, which requires them not to send it to closed-source models through APIs.

It is well known that open source software is generally more secure due to a more transparent development process.

Fourth, it operates efficiently and at lower costs.

Developers running Llama 3.1 405B have an inference cost of half that of GPT-4o, both on the user side and for offline inference tasks.

Fifth, in the long run, open source will become the standard for the entire industry.

In fact, open source evolves faster than closed source models, and developers want to build their systems on architectures that have long-term advantages.

In Zuckerberg's view, the release of Llama 3.1 will be a turning point for the industry, making open source even more unstoppable.

References:

https://ai.meta.com/blog/meta-llama-3-1/

https://llama.meta.com/

https://www.facebook.com/4/posts/10115716861061241/?rdid=VE0wPWaJDdF21j32