news

Llama 3.1 leaked ahead of time, will it dethrone GPT-4o? Faster and 10 times cheaper

2024-07-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Text|Chang Minxiao and Yuan Yingliang

Editor: Deng Yongyi

If becoming the ceiling of open source models is the fate of Llama's big model, then "being leaked" is the disaster that Llama has to overcome.

In March 2023, Llama 2 was leaked in advance, and Meta had to release the model in an open source format.

Today, history repeats itself.

On July 12, Pacific Time, a Meta employee revealed that Meta plans to release the largest parameter-scale version of Llama to date: Llama 3.1 405B on July 23, 2024 local time. He revealed that 405B will be the first multimodal model in the Llama series.

However, on July 22, Pacific Time, one day before the scheduled release, the model and benchmark results of Llama 3.1 were leaked on technical communities such as Reddit, and the magnet link (a program used to download documents) of Llama 3.1 has been circulated in communities such as HuggingFace.

From the leaked results,The performance of Llama 3.1 is comparable to OpenAI's GPT-4o!

An AI blogger praised that the release of Llama 3.1 will be another day to change the fate of the AI ​​industry:


△Source: X

The leaked benchmark results show that Llama 3.1 has three sizes: 8B, 70B, and 405B. The 70B model with the smallest number of parameters has performance comparable to GPT-4o in many aspects.


△The above picture shows the comparison between various versions of Llama 3.1 and OpenAI GPT-4o and Llama 3 8B/70B. Among them, the 70B version, which is in the middle of the scale, also surpasses GPT-4o in many aspects. Image source: X user @mattshumer_

Some netizens pointed out that based on this benchmark, Llama 3.1 405B ≈ GPT-4o, and Llama 3.1 70B will become the first lightweight model that can defeat OpenAI, GPT-4o mini.


△Image source: X user @corbtt

However, many netizens who have downloaded the model to "try it out" found that the leaked version of Llama 3.1 405B, with all files weighing about 820GB, requires nearly three times the memory of Llama 2 (about 280GB) that retains full precision.

This means that unless you have a mining machine at home and can afford enough GPUs, it is difficult for individual developers to run Llama 3.1 on their own computers. Some netizens speculate that Llama 3.1 is not aimed at individuals, but at institutions and enterprises.

Llama 3.1, which has not been officially announced, has also been poured cold water. Many netizens complained that Llama 3.1 has too high requirements for GPU and is not as good as the GPT-4o mini from OpenAI next door.


△ Comments from netizens on X. Image source: X user @_Talesh


Function iteration, indicator optimization, and computing resource reduction

According to the leaked model information, Llama 3.1 has more iterations in functionality compared to Llama 3, which was released on April 19, 2024, including a longer context window, multi-language input and output, and possible integration of developers with third-party tools.

Data training: Llama 3.1 was trained using 15T+ tokens from public sources, and fine-tuning data includes publicly available instruction tuning datasets (unlike Llama-3!) and over 25 million synthetically generated examples. Multilingual dialogue: Llama 3.1 supports 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Although unfortunately there is no Chinese, developers can fine-tune the Llama 3.1 model for languages ​​other than the 8 supported languages. Context window: The context length of each version has been expanded from 8k to 128k, which is roughly equivalent to the model being able to remember, understand, and process 96,000 words at a time, almost an entire original Harry Potter book.

Many netizens were eager to try and let Llama 3.1 compete with its "predecessors" and found that not only the indicators were significantly improved, but also a lot of computing resources were saved.

Based on the test of netizens, Llama 3.1 has significantly improved its capabilities compared to Llama 3. Among them, human_eval and truthfulqa_mc1 have made significant progress, which means that the ability to generate programming code is stronger and the answers to questions are more realistic.

At the same time, compared with the base model, Llama 3's instruction model has obvious improvements in prompt learning, context learning, and efficient parameter fine-tuning.

This makes sense because the base model is usually not fine-tuned for a specific task, while the instruction model is specially trained to follow instructions or complete a specific task. Usually, the metrics of the instruction model are better.

This makes people look forward to the official release of Llama 3.1. The results of the leaked Llama 3.1 model test are only for the base model, and the instruction model may perform better!


△Image source: X user @thenameless7741


△Image source: X user @thenameless7741

Surprisingly, in the benchmark test results, the Llama 3.1 70B model tied or even beat GPT-4o, and the Llama 3.1 8B model was close to the Llama 3 70B model in performance. Some netizens speculated that this may be due to the use of model distillation technology, that is, the 8B and 70B models are simplified from the largest 405B model, making the large model "smaller".

Model distillation technology can be seen as a student learning from a teacher. The large and powerful model (teacher model) is the teacher, and the smaller and simpler model (student model) is the student. The student model learns by "imitating" the teacher model, making the output as close as possible to the output of the teacher model, thereby learning similar knowledge and capabilities.

The student model trained by distillation can reduce the model size and computing resource requirements while maintaining high performance and comparable accuracy.


△Image source: Reddit


Not everyone can run, but the price is reasonable.

Whether Llama 3.1 will be open source as expected is still unknown. But even if it is open source, if you want to afford Llama 3.1, you still need to have a mine at home.

If you want to run Llama 3.1, the most basic entry ticket is a sufficient GPU.

The leaked document shows that the training time of Llama 3.1 405B on H100-80GB type hardware is 30.84M GPU hours. This means that, assuming that only one H100-80GB is used per hour, it will take 30.84M hours to run Llama 3.1 405B - ​​it will take 3500 years for the model to run!


△Image source: Reddit

If you want to deploy privately, if an enterprise wants to run Llama 3.1 405B smoothly within a month, it needs to reserve at least 43,000 H100-80GB. Based on the unit price of H100 of 40,000 US dollars,The entry ticket using Llama 3.1 405B computing power is as high as 1.7 billion US dollars, equivalent to 12.5 billion yuan.

But the good news is that the inference cost of Llama 3.1 may be cheaper.

According to Artificial Analysis, the cost of throughput 1 million tokens, Llama 3.1 405B will be cheaper than cutting-edge models of similar quality (GPT-4o and Claude 3.5 Sonnet), and more cost-effective.


△Image source: X user @ArtificialAnlys

In addition, some netizens speculated through the source code that Llama 3.1 405B may become a membership product, and users need to pay for it. However, we still need to wait for the official release to know the real situation.


△Image source: X user @testingcatalog

(36Kr author Zhou Xinyu also contributed to this article)

Welcome to communicate

Welcome to communicate