The open source model surpasses the strongest closed source model. Can Llama 3.1 subvert the AI ecosystem?

The open source model surpasses the strongest closed source model. Can Llama 3.1 subvert the AI ecosystem? | Jiazi Guangnian

2024-07-24

Zuckerberg vowed to continue open source.

Author: Sukhoi

Editor: Zhao Jian

Llama 3.1 is finally here.

On July 23rd, local time in the United States, Meta officially released Llama 3.1. It includes three scales: 8B, 70B, and 405B, and the maximum context is increased to 128k. Llama is currently one of the large model series with the most users and the strongest performance in the open source field.

The highlights of Llama 3.1 are:

1. There are three versions: 8B, 70B and 405B. The 405B version is one of the largest open source models currently. 2. The model has 405 billion parameters, surpassing the existing top AI models in performance. 3. The model introduces a longer context window (up to 128K tokens), which can handle more complex tasks and conversations. 4. It supports multi-language input and output, enhancing the versatility and applicability of the model. 5. It improves reasoning capabilities, especially in solving complex mathematical problems and generating content in real time.

Meta wrote in an official blog: "To this day, it is still the norm that the performance of open source large language models lags behind closed source models. But now, we are ushering in a new era led by open source. We are publicly releasing Meta Llama 3.1 405B, the world's largest and most powerful open source base model. To date, the cumulative number of downloads of all Llama versions has exceeded 300 million, and this is just the beginning."

The debate between open source and closed source has always been a hot topic in the technology field.

Open source software is more transparent and flexible, allowing developers around the world to jointly review, modify and improve the code, thus promoting rapid innovation and progress in technology. The closed source model is usually developed and maintained by a single company or organization, which can provide professional support and services to ensure the security and stability of the software. However, this model also limits the user's control and customization capabilities.

Previously, the closed source model had always been slightly better. Until the release of Llama 3.1, a significant breakthrough was made in the ongoing fierce open source vs. closed source debate: the open source model was finally able to compete with the closed source model.

According to the benchmark data provided by Meta, the most popular 405B version is comparable to GPT-4 and Claude 3 in terms of performance. Human Evaluation is mainly used to evaluate the model's ability to understand and generate code and solve abstract logic problems. In the competition with other large models, Llama 3.1 405B is slightly better.

Llama 3.1 is on par with GPT-4 and Claude 3.5. Source: Meta

Andrew Ng, associate professor of computer science and electrical engineering at Stanford University and director of the Artificial Intelligence Laboratory, praised the "great contribution of Meta and the Llama team to open source" on social media. He said: "Llama 3.1 increases context length and improves functionality, which is a wonderful gift for everyone." He also hoped that "stupid regulations like California's proposed SB1047 will not prevent such innovation."

Andrew Ng’s social media, source: X

Turing Award winner and Meta’s chief AI scientist Yann LeCun quoted The Verge’s description of the performance of Llama 3.1, Meta’s largest and best open source AI model released to date: Llama 3.1 surpasses OpenAI and other competitors on certain benchmarks.

Yang Likun's social media, source: X

Interestingly, yesterday's 405B version of Llama 3.1 was suspected to have been "leaked" on HugginFace and GitHub, and the evaluation data sent by the whistleblower is basically consistent with the version information officially released today.

Meta's founder and CEO Mark Zuckerberg wrote a long article titled "Open Source AI Is the Path Forward", detailing why open source is important to developers, Meta, and the world.

He predicts that by the end of this year, Meta AI will surpass ChatGPT to become the most widely used assistant.

He also stated:Vow to carry out open source to the end.

Article excerpt of "Open Source AI Is the Path Forward", source Meta

1.The Making of Llama 3.1

In terms of model architecture, as Meta's largest model to date, Llama 3.1 is trained on data from more than 15 trillion tokens, and the pre-training data date is up to December 2023.

In order to achieve training at such a large scale in a reasonable time and achieve the desired results, Meta optimized the entire training stack, using more than 16,000 H100 blocks. 405B is the first Llama model trained at this scale.

Transformer model architecture in the text generation process of Llama 3.1, source: Meta

In order to maximize the stability and convenience of training, Meta chose the standard decoder-only Transformer model architecture for fine-tuning instead of the currently popular mixture of experts (MoE) architecture.

This decision enables Llama 3.1 to guarantee high-quality output of short texts while supporting context lengths up to 128K, and enables flexible processing of long and short texts instead of focusing only on long texts.

At the same time, the research team implemented an iterative post-training method to generate high-quality synthetic data and improve the various functions of the model through each round of supervised fine-tuning and direct preference optimization. Compared with previous versions, Llama 3.1 increases the quantity and quality of pre-training and post-training data, introduces more detailed pre-processing and management processes, and more stringent quality assurance and filtering techniques.

Following the scaling rules of language models, Llama 3.1 outperforms previous smaller models trained using the same procedure.

In order to meet the running requirements of the large-scale 405B model, Meta reduced the quantization of the model data from 16 bits (BF16) to 8 bits (FP8), which greatly reduced the demand for computing resources and enabled the model to run on a single server node.

In terms of command and chat fine-tuning for the Llama 3.1 405B model, the development team is committed to improving the model's responsiveness, practicality, and quality to user commands while ensuring a high level of security.

In the post-training phase, the team performed several rounds of adjustments based on pre-training. Each round included supervised fine-tuning (SFT), rejection sampling (RS), and direct preference optimization (DPO).Additionally, the team used synthetic data generation to produce the vast majority of SFT examples, meaning they did not rely solely on real-world data, but instead trained the model through algorithmically generated data.

At the same time, the team also uses a variety of data processing methods to filter this data to ensure the highest quality and expand the scope of application of fine-tuning data.

Meta is also exploring a new strategy, which is to use the 405B model as the "teacher model" for the 70B and 8B models, thereby extracting small customized models suitable for the needs of various industries from large models. This approach coincides with the strategy of GPT-4o mini.That is, "first make it big, then make it small"。

Andrej Karpathy, one of the former founding members of OpenAI, once commented on GPT-4o Mini: "Models must become larger before they can become smaller. Because we need them to (automatically) help reconstruct the training data into an ideal, synthetic format." He pointed out that this method can effectively transfer the deep and broad knowledge of large models to more practical and lower-cost small models.

As a leader in the open source model route, Meta has also shown great sincerity in the supporting facilities of the Llama model.

The Llama system is designed as a comprehensive framework that can integrate multiple components, including calling external tools. Meta's goal is to provide a broader system that allows developers to flexibly design and create customized products that meet their needs.

To advance AI responsibly beyond the model layer, the research team released a complete reference system with multiple sample applications and new components, such as the multilingual security model Llama Guard 3 and the prompt injection filter Prompt Guard. These applications are open source and available for further development by the community.

To better define component interfaces and promote their standardization in the industry, the researchers worked with industry, startups, and the broader community and published a "Llama Stack" proposal on GitHub, a set of standardized interfaces that simplifies the construction of toolchain components (such as fine-tuning, synthetic data generation) and agent applications.

According to the benchmark data provided by Meta, Llama 3.1 405B scored 98.1 on the NIH/Multi-needle benchmark, which is comparable to GPT-4 and Claude 3.5 in terms of performance. The 405B version scored 95.2 on the ZeroSCROLLS/QuALITY benchmark for its excellent ability to integrate massive amounts of text information, making it very friendly to AI application developers who are concerned about RAG performance.

Comparison of Llama 3.1 with closed-source models such as GPT4, source: Meta

Llama 3.1 compared with open source models such as Mistral 7B Instruct, source: Meta

The Llama 3.1 8B version is significantly better than the Gemma 2 9B 1T and Mistral 7B Instruct, and has a significant improvement over the previous generation Llama 3 8B. At the same time, the Llama 3.1 70B version even surpasses GPT-3.5 Turbo.

According to the official report of the Llama team, they conducted in-depth performance evaluation and extensive manual testing on these models on more than 150 multilingual benchmark datasets. The results show that Llama's top models are comparable to the top basic models on the market such as GPT-4, GPT-4o, and Claude 3.5 Sonnet in various tasks. At the same time, compared with closed and open source models with similar parameter scales, Llama's small version also showed strong competitiveness.

2.The debate between open source and closed source models

Can the open source model surpass the closed source model?

This issue has been controversial since last year. The development paths of the two models represent different technical philosophies, and they have their own advantages in promoting technological progress and meeting business needs.

For example, Llama 3.1 is an open source large model that allows researchers and developers to access its source code, allowing people to freely study, modify, and even improve the model. This openness encourages extensive collaboration and innovation, allowing developers from different backgrounds to work together to solve problems.

In contrast, ChatGPT is a closed-source model developed by OpenAI. Although it provides API access, its core algorithm and training data are not fully disclosed. The closed-source nature of GPT-3 makes it more robust in the commercialization path, while the controllability ensures the stability and security of the product, making it more trusted by companies when processing sensitive information. However, this closed nature also limits external researchers' full understanding and innovation capabilities of the model.

Last May, foreign media reported that a document leaked from Google, with the theme "We have no moat, and neither does OpenAI. While we are still arguing, open source has quietly taken our jobs." In the same year, after Meta released the open source large model Llama 2, Yang Likun said that Llama 2 would change the market landscape of large language models.

The open source community led by the Llama series of models is highly anticipated. Previously, the most advanced closed-source model GPT-4 was always slightly better, although the gap between it and Llama 3 at the time was already very small.

The most authoritative list in the field of large models is the Large Model Arena (LLM Arena), which uses the ELO scoring system that has been used in chess. Its basic rule is to let users ask any questions to two anonymous models (such as ChatGPT, Claude, Llama) and vote for the one that answers better. The model that answers better will get points, and the final ranking is determined by the accumulated points. Arean ELO collected voting data from 500,000 people.

List of large model rankings, source: LLM Arena

On the LLM Arena rankings, OpenAI's GPT-4o currently tops the list. All of the top ten models are closed source. Although closed source models are still far ahead in the rankings, the gap between open source and closed source models is not getting bigger as Robin Li said at the 2024 Baidu AI Developer Conference, but is actually gradually narrowing.

During WAIC, Robin Li said: "Open source is actually a kind of IQ tax." Source: Baidu

Until today's release of Llama 3.1, the open source model can finally compete with the closed source model.

As to which open source or closed source model is better, Jiazi Guangnian has discussed with many AI industry practitioners. The industry generally believes that:It often depends on personal stance and is not a simple black-and-white issue.

The issue of open source and closed source is not a purely technical difference, but more about the choice of business model. Currently, neither the open source nor the closed source model has found a completely successful business model.

So what factors influence the capability differences between open source and closed source models?

Zhang Junlin, the person in charge of new technology research and development at Weibo, pointed out that the growth rate of model capabilities is a key factor. If the model capabilities grow very fast, it means that a large amount of computing resources will be needed in a short period of time. In this case, closed-source models have more advantages because of their resource advantages. On the contrary, if the model capabilities grow slowly, the gap between open source and closed source will be reduced, and the catch-up speed will also be faster.

He believes that in the next few years, the difference in the capabilities of open source and closed source models will depend on the development of "synthetic data" technology. If "synthetic data" technology makes significant progress in the next two years, the gap between the two may widen; if there is no breakthrough, the capabilities of the two will become similar.

Overall, "synthetic data" will become a key technology for the development of large language models in the future.

Whether it is open source or closed source does not determine the performance of the model. A closed source model is not ahead because it is closed source, and an open source model is not behind because it is open source. On the contrary, the model is closed source because it is ahead, and has to be open source because it is not ahead enough.

If a company creates a high-performance model, it may no longer be open source.

For example, the French star startup Mistral, whose open source strongest 7B model Mistral-7B and the first open source MoE model 8x7B (MMLU 70) are among the most popular models in the open source community. However, Mistral's subsequent training Mistral-Medium (MMLU-75) and Mistral-Large (MMLU-81) are both closed source models.

Currently, the best performing closed-source model and the best performing open-source model are both dominated by large companies, and among the large companies, Meta has the greatest determination to open source.If OpenAI does not open source its software for commercial returns, then what is the purpose of Meta choosing to open source its software and allow users to try it for free?

At last quarter's earnings conference, Zuckerberg responded to this matter by saying that Meta open-sourced its AI technology in order to promote technological innovation, improve model quality, establish industry standards, attract talent, increase transparency and support long-term strategy.

This time, Zuckerberg explained in detail in "Open Source AI Is the Path Forward" why open source AI is good for developers:

In conversations with developers, CEOs, and government officials from around the world, I often hear them emphasize the need to train, fine-tune, and optimize their own models.

Each organization has unique needs, and models of different sizes can be optimized for those needs and trained or fine-tuned using specific data. Simple on-device tasks and classification tasks may require smaller models, while more complex tasks may require larger models.

Now you can use state-of-the-art Llama models and continue to train them on your own data, optimizing them to their ideal size — without us or anyone else ever touching your data.

We need to control our own destiny, not be beholden to a closed source vendor.

Many organizations don't want to rely on models that they can't run and control themselves. They worry that the provider of a closed-source model might change the model, terms of use, or even stop the service altogether. They also don't want to be locked into a single cloud platform that has exclusive rights to a certain model. Open source provides a compatible toolchain for many companies, making it easy to switch between different systems.

We need to protect our data.

Many organizations handle sensitive data and need to protect it from being sent to closed-source models via cloud APIs. Others simply don’t trust the data provided by closed-source model providers. Open source solves these problems by letting you run models wherever you want, and is widely considered more secure because of the transparency of the development process.

We need a way to operate efficiently and economically.

Developers can run the Llama 3.1 405B model for inference on their own infrastructure at about half the cost of using closed-source models such as GPT-4o, for both user-facing and offline inference tasks.

We are betting on an ecosystem that has the potential to become a long-term standard.

Many people see that the open source model is developing faster than the closed source model, and they hope to build the system architecture that will bring the greatest long-term advantages.

(Cover image from Meta X account)

news