news

How important is post-training? AI2 researchers explain the post-training secrets of cutting-edge models in detail

2024-08-19

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Qiao Yang

【New Wisdom Introduction】More and more studies have found that post-training is equally important to model performance. Nathan Lambert, a machine learning researcher at Allen AI, recently published a technical blog post summarizing the model post-training formulas used by technology giants.

With the rapid development of LLM academia and industry, not only are the computing power and data used for pre-training increasing rapidly, but the alignment and fine-tuning methods for post-training are also constantly being updated.

Earlier published models such as InstructGPT and WebGPT use standard RLHF methods, where the data management style and scale seem to be outdated.

In recent months, AI giants such as Meta, Google, and Nvidia have released open source models with detailed papers or reports, including,,, and Apple Intellegence's basic model report.

From these disclosures, we can see some cutting-edge trends in post-training methods. Allen AI research scientist Nathan Lambert recently published an article on this topic.


Original address: https://www.interconnects.ai/p/frontier-model-post-training


Dr. Nathan Lambert graduated from UC Berkeley, led the RLHF team at HuggingFace, and is currently a machine learning researcher at Allen AI.

In his article, he pointed out that synthetic data, iterative training, human preference labels, and extensive filtering are the common characteristics of the post-training methods used by these models. Specifically, the new post-training recipe is based on the following assumptions:

- The quality of synthetic data can be higher than human data, especially for challenging tasks

- RLHF can be scaled to a much larger scale than instruction fine-tuning

- Multiple rounds of training and generation are required to get the best model

- Data filtering is the most important part of training

These assumptions are largely intertwined, forming a training program that can be scaled to large teams and is very suitable for technology giants. The specific content of the article explains each of the above four points in detail.

New Standard Pipeline

If we consider the ChatBot Arena score to measure the post-training performance of a model, this is largely related to style and robustness, with almost all major labs achieving significant gains through iterative training.

We have yet to see the release of Gemini 2 or GPT-5, which may reset the current post-training paradigm and potentially unlock deeper control over our models.

But from the current perspective, the methods used by top laboratories are clearly converging, and this trend is much clearer than expected.

Human Preference Data

The initial RLHF pipeline focuses on human data, which comes in two main forms: 1) human data for fine-tuning instructions on specialized tasks; and 2) human preference data regarding task completion.

This type of fine-tuning dataset is expensive and strictly protected. As far as I know, the only public one is No Robots released by Lambert when he was on the HuggingFace team.


Warehouse address: https://huggingface.co/datasets/HuggingFaceH4/no_robots

Human preference data is largely relevant to the improvement of a particular model, but even when the data is open, it is not certain that the preferences of one model can be transferred to another.

Lambert and his team had tried something similar at HuggingFace, but failed on small paid data contracts.

Right now, the only aspect of using human data is preference data. Based on the data disclosed by Llama 2 and other rumors, Meta may have spent $10M-20M or even more on preference data. This is only limited to the final released model, not including more extensive experiments and evaluations.

Nemotron uses a lot of synthetic data to replace human data, but the fine-tuning of this model is relatively not that good.

There is an immediate challenge, but also an opportunity, for the open community: to understand the extent of human intervention in such data and whether it can be replaced with approaches such as LLM-as-a-Judge or reward models.

Extended RLHF

Thomas Scialom, Llama 3's alignment lead, said on the Latent Space podcast:

RLHF is much more scalable. It costs less, is easier to operate, and generally results in better performance.


He also said that he would spend "100% of the alignment data budget on alignment data required for the RL phase, rather than spending more time on instructions."

Most of the open source alignment work focuses on extended instruction fine tuning (IFT, or SFT). IFT is easy to use, applicable to a variety of tasks, and easy to use with synthetic data.

But it is clear that the industry only uses IFT as a starting point to expand RLHF. SFT data mainly focuses on specific areas that previous models have not covered, and then expands RLHF on this basis.

RLHF is an iterative process, and the generation of the model allows it to continue to improve. The Llama 2 and Nemotron papers detail 5 rounds of training, but we don’t know if there is an upper limit to this number.

Llama 3.1 was trained on the preference data for 6 rounds, Llama 2 was trained on 5 rounds, and Nemotron was trained on 4 rounds, with multiple rounds of instruction fine-tuning before that.

For human preference data, multiple iterations may be necessary mainly for feasibility reasons:

1. Data is transferred from annotation companies to laboratories in batches

2. Conducting multiple rounds of small-scale training can reduce the risk of final product delivery. Instead of waiting for all the data to be in place before starting training, it is better to let the model gradually get on track.

These kinds of real-world factors may seem insignificant, but they often trigger certain industry regulations.

The following picture is from the Llama 2 paper, which records data related to 5 rounds of rejection sampling and PPO.


Nemotron also performed 2 rounds of SFT fine-tuning and 4 rounds of alignment. Among them, RPO is a reward model weighted by the DPO optimizer.


A similar iterative RLHF approach can be traced back to the “Constitutional Artificial Intelligence” proposed by Anthropic, but the open source community does not seem to have replicated this result on a large scale.


Currently, the academic community is focusing on "online DPO training", which is similar in direction, but does not pay as much attention to the data between rounds. This method currently still requires a lot of manual work, but once the process is automated, online DPO will become the future.

In fact, the choice of algorithm for the post-training stage should not be so rigid. DPO and PPO have their own advantages and disadvantages. The former is easier to scale, but PPO-inspired methods (such as online RL) have a higher performance ceiling.

Currently these solutions are mainly driven by simplicity considerations, as these teams are still relatively new and are building modular systems, and a statement from a member of the Llama 3 post-training team also confirms this approach of engineering simplicity.


Llama 3 has a simple post-training loop: rejection sampling, SFT, and DPO. This not only achieves state-of-the-art performance at an empirical level, but also achieves reproducibility. Moreover, the team can asynchronously explore many different workflows (e.g. encoding, math), bringing data together in the same simple loop.
Synthetic Data

An important part of this new RLHF cycle is synthetic instruction data that surpasses human capabilities on most tasks.

If you can improve the model a little bit and generate better instructions, then "start over" and update the checkpoint.

Meta explicitly states in the paper that they "used the 405B model to improve the post-training quality of our smaller models"; Google did this by distilling Gemini Flash, but in reality most cutting-edge models probably include some similar steps.

I heard that OpenAI is using 50 trillion tokens of data to train the next generation of models, most of which is synthetic data. There was a rumor last year that Anthropic had a "pre-trained constitutional AI corpus", which now seems reasonable.

These AI companies realized the importance of synthetic data 12 to 18 months ago, when they no longer used model output for self-iterative training. But Meta is different because it benefits from other better open models.

A look at today’s post-training clearly shows that the problem of model collapse due to synthetic data is overstated. Model collapse only occurs in artificial settings where the original data is discarded and only the generated new data remains.

Data quality is king

The bulk of the Llama 3.1 report is devoted to the details of data management, with each relevant sub-area requiring extensive and specific management instructions.

This is consistent with what I know of the work done by John Schulman’s post-training team at OpenAI and other similar groups — give them a specific domain, get data about it, and the model will get better.

But without extensive data filtering and curation, none of the above RLHF methods will work.

At Allen AI, we started prioritizing data more in the post-training process and saw an immediate difference in the speed at which our models improved.

Case Study - Nemotron and Llama

The post-training process of Llama is as follows:


Nemotron's diagram is a bit more simplistic:


Taken together, we can see the commonalities of most methods.

But the chart below, and most industry research papers, ignore the data.


Models such as Llama 3.1 mention a lot of details in the report, such as regularization, adjustment of the loss function, model averaging, etc., but these are marginal gains in model performance and are largely beyond the scope of the core fine-tuning loop.

At a certain point in time, these details will become insignificant.

References:

https://www.interconnects.ai/p/frontier-model-post-training