news

The strongest model Llama 3.1 405B is officially released, Zuckerberg: Open source leads the new era

2024-07-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



Machine Heart Report

Synced Editorial Department

Just now, the long-awaited Llama 3.1 was officially released!

Meta officially announced that "open source will lead the new era".



In the official blog, Meta said: "Until today, open source large language models mostly lagged behind closed models in terms of functionality and performance. Now, we are ushering in a new era led by open source. We publicly release Meta Llama 3.1 405B, which we believe is the world's largest and most powerful open source base model. To date, the total number of downloads of all Llama versions has exceeded 300 million times, and we are just getting started."

Meta founder and CEO Mark Zuckerberg also wrote a long article titled "Open Source AI Is the Path Forward", explaining why open source is good for all developers, Meta, and the world.



Highlights of this release include:

  • The latest series of models extends the context length to 128K, adds support for eight languages, and includes the top open source model Llama 3.1 405B;
  • Llama 3.1 405B is in a class of its own, with Meta officially claiming it is comparable to the best closed-source models;
  • This release also provides more components that work with the model, including a reference system, to build Llama into a system;
  • Users can experience Llama 3.1 405B through WhatsApp and meta.ai.



Address: https://llama.meta.com/

All netizens can download it and try it out.

Introducing Llama 3.1

Llama 3.1 405B is the first publicly available model that performs on par with top AI models in terms of common sense, manipulability, mathematics, tool use, and multi-language translation.

Meta says the latest generation of Llama will inspire new applications and modeling paradigms, including leveraging synthetic data generation to improve and train smaller models, and model distillation — a capability never before achieved in open source.

At the same time, Meta also launched upgraded versions of the 8B and 70B models, supporting multiple languages, context length up to 128K, and stronger reasoning capabilities. The latest models support advanced use cases such as long text summarization, multilingual conversational agents, and coding assistants.

For example, Llama 3.1 can translate stories into Spanish:



When a user asks a question like "I have 3 shirts, 5 pairs of shorts and 1 dress. If I plan to travel for 10 days, are the clothes I prepared enough?" the model can make inferences quickly.



Long context: For uploaded documents, Llama 3.1 is able to analyze and summarize large documents of up to 8k tokens.



Coding assistant, which can quickly write code for user requirements:



In addition, the developer of Llama 3.1 405B also tweeted a "spoiler", indicating that the development of a model that integrates speech and vision capabilities like GPT-4o is still under development.



Meta has also made changes to the open source protocol to allow developers to use the output of Llama models (including 405B) to improve other models. In addition, in order to fulfill the open source commitment, starting today, Meta will make these models available to the community, and users can download them at llama.meta.com and Hugging Face.

download link:

  • https://huggingface.co/meta-llama
  • https://llama.meta.com/

Model Evaluation

Meta is evaluated on over 150 benchmark datasets, in addition, they conduct extensive human evaluation.

Experimental results show that the flagship model Llama 3.1 405B is competitive with leading base models including GPT-4, GPT-4o, and Claude 3.5 Sonnet on a range of tasks. In addition, the 8B and 70B small models are competitive with closed-source and open-source models with similar number of parameters.







Model Architecture

As Meta’s largest model to date, training Llama 3.1 405B with over 15 trillion tokens was a significant challenge. To achieve training at this scale, Meta optimized the entire training stack and trained on over 16,000 H100 GPUs, making this model the first Llama model trained at this scale.



To address this, Meta made the following design choices, focusing on keeping the model development process scalable and simple.

  • A standard decoder Transformer model architecture with only minor tweaks was chosen over a mixture of experts model to maximize training stability.
  • An iterative post-training procedure is employed, using supervised fine-tuning and direct preference optimization at each round. This enables Meta to create the highest quality synthetic data for each round and improve the performance of each feature.

Compared to previous versions of Llama, Meta improves the quantity and quality of data used for pre-training and post-training, such as developing more careful preprocessing and management pipelines for pre-training data, and developing more rigorous quality assurance and filtering methods for post-training data.

As expected from language model scaling laws, Meta’s new flagship model outperforms smaller models trained using the same procedure. Meta also uses a 405B parameter model to improve post-training quality of smaller models.

To support the large-scale inference output of the 405B model, Meta quantized the model from 16 bits (BF16) to 8 bits (FP8), effectively reducing the required computing requirements and allowing the model to run on a single server node.

Command and chat tweaks

Llama 3.1 405B strives to improve the usefulness, quality, and detailed instruction-following capabilities of the model’s responses to user instructions, while ensuring a high level of safety.

In the post-training stage, the research team built the final chat model by performing several rounds of alignment on the basis of the pre-trained model. Each round involved supervised fine-tuning (SFT), rejection sampling (RS), and direct preference optimization (DPO).

The research team used synthetic data generation to produce the vast majority of SFT examples, and iterated multiple times to generate increasingly high-quality synthetic data across all features. In addition, the research team also adopted a variety of data processing techniques to filter these synthetic data to the highest quality and to scale fine-tune the amount of data across features.

Llama System

The Llama model has always existed as part of an AI system that coordinates multiple components, including calling external tools. Meta aims to go beyond the base model, giving developers the flexibility to design and create custom products that fit their vision.

To enable responsible development of AI beyond the model layer, Meta has released a complete reference system that includes multiple sample applications as well as new components such as Llama Guard 3 (a multi-language security model) and Prompt Guard (a prompt injection filter). These sample applications are open source and can be built upon by the open source community.

In order to collaborate more broadly with the industry, startups, and open source communities to help better define the interfaces of components, Meta has posted a comment request for "Llama Stack" on GitHub. Llama Stack is a set of standardized interfaces for building canonical toolchain components (fine-tuning, synthetic data generation) and intelligent agent applications. This helps to achieve interoperability more easily.

Unlike closed models, Llama model weights are available for download. Developers can fully customize the model for their needs and applications, train it on new datasets, and perform additional fine-tuning.

Developing with Llama 3.1 405B

For ordinary developers, deploying a large-scale model like 405B is undoubtedly a challenge, which requires a lot of computing resources and professional skills. In communication with the developer community, Meta realized that the development of generative AI is not just about giving the model input prompts. They expect all developers to fully develop the potential of Llama 3.1 405B in the following areas:

  • Real-time and batch inference
  • Supervised fine-tuning
  • Test and evaluate the performance of the model in a specific application
  • Continuous pre-training
  • Retrieval Enhanced Generation (RAG)
  • Function Call
  • Synthetic Data Generation

Starting today, all advanced features of the Llama 3.1 405B model will be available for developers to get started right away. Developers can also explore higher-level workflows, such as synthetic data generation based on model distillation. With this upgrade, Meta also seamlessly integrates solutions provided by partners AWS, NVIDIA, and Databricks to achieve more efficient retrieval enhancement generation (RAG). In addition, Groq has optimized low-latency inference for models deployed in the cloud, and has also made similar performance improvements to local systems.

Meta has also built a "tool package" for Llama 3.1 405B, which includes key projects such as vLLM, TensorRT and PyTorch, so that everything from model development to deployment can be done "out of the box" in one step.

Reference link: https://ai.meta.com/blog/meta-llama-3-1/