news

After 4 rounds of intensive training, Llama 7B beats GPT-4! Meta and others let LLM "play a triangle" to self-evaluate and evolve

2024-07-31

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】Meta, UC Berkeley, and NYU jointly proposed the meta-reward language model, which provides a clear path for "super alignment": let AI act as its own referee and improve its alignment, which is more effective than the self-reward model.

LLM consumes a lot of data, not only in the pre-training corpus, but also in the alignment stages such as RLHF and DPO.

The latter not only relies on expensive manually annotated data, but is also likely to limit the further development of LLM by human level.

In January this year, the Meta and NYU teams proposed a self-reward mechanism for language models, using the LLM-as-a-Judge prompt mechanism to allow the model to provide self-feedback during training.


Paper address: https://arxiv.org/abs/2401.10020

The paper found that even without relying on human annotators, LLM can achieve performance improvements by evaluating its own responses.

Recently, the team published another study, taking the LLM "self-reward" to a higher level.


Paper address: https://arxiv.org/abs/2407.19594

After all, you are the one giving yourself a score, so you cannot only focus on how the model as an actor optimizes from feedback, but also need to ensure that the model as a judge has excellent self-evaluation capabilities.

Previous studies have focused too much on the former and ignored the latter, resulting in too fast saturation of performance during iterative training.

It may even lead to a situation worse than saturation, namely overfitting of the reward signal (reward hacking).

Therefore, researchers from Meta, NYU, UC Berkeley and other institutions proposed that a "meta-reward" step should be added - allowing the model to evaluate its own evaluation, thereby improving its evaluation ability.


Although it sounds a bit confusing, it is actually reasonable. In addition, experiments have found that adding this layer of nesting has a significant improvement effect.

For example, the winning rate of Llama-3-8B-Instruct on AlpacaEval 2 increased from 22.9% to 39.4%, which is better than GPT-4; on Arena-Hard, it increased from 20.6% to 29.1%.

If the research published in January this year is LLM-as-a-Judge, then the "meta-reward" proposed in this paper is equivalent to LLM-as-a-Meta-Judge.

Not only does the Judge not require humans, but the Meta-Judge is also self-sufficient, which seems to further prove that the self-improvement of the model can get rid of its dependence on human supervision.

Meta scientist Yann LeCun also forwarded this study and personally played a pun:


Can Meta-Judge, FAIR proposed by Meta, achieve fairness?

Research is not important. What is important is that the exposure rate of Meta FAIR is maximized.


Meta-Rewarding

To put it more bluntly, the "meta-reward" method is to introduce meta-judge into the original actor-judge interaction, and the same model is used to "triangulate" the situation, without the need for additional human data.


Among them, the actor is responsible for generating a response to a given prompt; the judge is responsible for evaluating and scoring its own response; and the meta-judge compares the quality of its own scoring.

The ultimate optimization goal is to enable the actor to generate better responses, but the training efficiency depends on the accuracy of the judge.

Therefore, meta-judge, as a training judge, can improve the performance of the model as both an actor and a judge.

The iterative training model composed of these three roles is shown in Figure 1. In the tth step, the response of the model M_t to the prompt x is collected first, and then M_t is asked to evaluate itself, thereby obtaining the preference data for training the actor.

Afterwards, given the same response content y, let M_t generate various evaluation variants, which will be scored and ranked by meta-judge, thereby obtaining preference data for training judges.

Combining the above two types of preference data, the DPO method is used to optimize the preference of model M_t, completing one round of iteration and obtaining model M_(t+1).

Length preference

Previous work has found that the model acting as a judge prefers longer responses, which can lead to a "length explosion" of answers after multiple rounds of iterations.

Therefore, the authors introduced a simple "length-control" mechanism - using the parameter ρ∈[0,1] to weigh the judge's score and the length of the response text.

For example, for the model responses with scores in the first tier, that is, the score range is [(1-ρ)Smax+ρSmin, Smax], the shortest response is selected as the optimal answer.

Creation of Judge Preference Data

First, we select the model response for which the judge is least confident, and measure the judge’s certainty by the score variance. For each selected response y, we have at most N corresponding model evaluations {j1, … , jN}.

Afterwards, each pair (jm, jn) is evaluated pairwise using the meta-judge prompt template shown in Figure 2.


In addition to providing evaluation results, meta-judge also needs to generate the CoT reasoning process.

In order to reduce the possible position preference of meta-judge (it may tend to choose Judgment A that appears first), the order of the same pair of data (jm, jn) will be swapped so that meta-judge will evaluate twice, and a single result rmn will be obtained:


Parameters w1 and w2 are introduced to characterize possible location preferences:


Among them, win1st and win2nd indicate how many times the evaluations of the two positions win in the entire evaluation process of meta-judge.

Use the above variables to construct a "battle matrix" B to record the final result of each time:


Using the Elo rating, the meta-reward score assigned to each judge by the meta-judge can be calculated from the matrix B.


The authors found that meta-judge, like judge, also exhibits a "length preference" and tends to choose longer reviews.

In order to prevent the final trained model from being too verbose, filtering measures were also taken when constructing the judge dataset. If the evaluation opinions selected by meta-judge exceed a certain length, the entire data pair will be directly discarded.

Evaluation Experiment

Experimental preparation

The experiment uses Llama-3-8B-Instruct as the seed model, and other experimental settings are consistent with the previously published paper "Self-Rewarding Language Models".

Before meta-reward training, the experiment first performs supervised fine-tuning (SFT) on the seed model on the EFT (Evaluation Fine-Tuning) dataset.

The EFT dataset is built on Open Assistant and provides initial LLM-as-a-Judge training data, which contains ranked human responses that can be used to train the model to act as a judge.

For the meta-reward iteration, the experiment uses 20,000 prompts generated by Llama-2-70B-Chat via 8-shot prompts.


As shown in the figure above, the cues used for training are closer in distribution to the AlpacaEval dataset, while the cues for Arena-Hard are concentrated in a subset of the training cues.

For each iteration, we sampled 5,000 cues from this seed set, for a total of four iterations.

The iteration process is as follows:

- Iter 1: Starting from the initial SFT model, use DPO (Direct Preference Optimization) to train the generated actor and judge preference pairs to obtain M1.

- Iter 2: Use DPO to train the actor and judge preference pairs generated by M1 to obtain M2.

- Iter 3/4: Use DPO to train only the actor preference pairs generated by M2/M3 to obtain M3/M4.

Each prompt asked the model to generate K = 7 responses, for a total of 35,000 responses per iteration. We then filtered out identical responses (usually removing no more than 50 duplicates).

Next, the same sampling parameters are used to generate N = 11^2 different judgments for each response.

assessment method

The goal of the meta-reward model is to allow the model to both “act” and “evaluate” itself, so the experiment also needs to evaluate how the model performs in these two roles.

The baseline model is the self-reward model proposed in the aforementioned paper, with the same "length control" mechanism, which can directly compare the performance gain brought by the meta-reward mechanism.

First, let’s take a look at how to judge the performance.

The experiments utilize three GPT4-as-a-Judge based automatic evaluation benchmarks, including AlpacaEval 2, Arena-Hard, and MT-Bench, each focusing on different aspects of the model.

For example, AlpacaEval mainly focuses on chat scenarios, and the prompt set covers a variety of daily problems.

In contrast, Arena-Hard contains more complex or challenging questions that meet more criteria in 7 predefined areas (creativity, complexity, problem solving, etc.).

MT-Bench has 8 different question categories, which mainly evaluate the multi-round dialogue ability of the model.

On the other hand, in order to evaluate how well the LLM judges "evaluate", the experiment measured the correlation between the scores given by the LLM and human preferences. If no human-labeled data is available, a stronger AI judge is used instead.

Instructions follow assessment

Figure 3 shows the win rate of the meta-reward method (with length control mechanism) as a function of training iterations on the AlpacaEval benchmark.

Overall, the win rate of meta rewards increased significantly from 22.9% to 39.4%, exceeding GPT-4 and approaching the Claude Opus model.


This is a fairly excellent result considering that the seed model has only 8B parameters and no additional artificial data is introduced except the EFT dataset used in the SFT stage.

In addition, the results also demonstrate the importance of meta-judge and length control mechanisms.

The self-reward model began to show signs of saturation after training for more than 3 epochs, but the model with meta-rewards did not, and its performance continued to increase until the 4th epoch.

This shows the importance of training model evaluation capabilities and the effectiveness of the role of meta-judge.

As shown in Table 1, after 4 rounds of iterations, the average response length (in characters) did not increase significantly in either the self-reward model or the meta-reward model, proving the effectiveness of the length control mechanism.


There are three obvious improvements to the meta-reward mechanism.

First, we break down the 805 categories in AlpacaEval into 18 categories for detailed analysis. We can see that meta-rewards improve responses in almost all categories (Figure 4), including subjects that require a lot of knowledge and reasoning, such as Science, Gaming, Literature, etc.

It is worth noting that the models did not achieve significant improvement in the two categories of Travel and Mathematics.


Second, meta-rewards improve responses to complex and difficult questions.

Experiments further use Arena-Hard to evaluate the performance of the meta-reward method in answering complex and challenging questions.

The evaluation results in Table 2 show that the meta-reward improves the score in all 4 iterations, with a significant improvement of 8.5% compared to the seed model (20.6%).


Third, the meta-reward does not sacrifice the ability of multi-round dialogue when only training a single-round dialogue.

The paper conducted an MT-Bench evaluation to examine the loss of multi-round dialogue capabilities when only training on a single round of data.

The results are shown in the table below. The four iterations of the meta-reward model significantly improved the first-round dialogue score from 8.319 (seed model) to 8.738, while the second-round dialogue score only dropped by no more than 0.1.


This is a huge improvement over the baseline model’s Self-Rewarding + LC, which often drops the second-round dialogue score by more than 0.2 while not improving the first-round dialogue score.

Reward Model Evaluation

The experiment evaluated the model's judgment accuracy on the responses generated by the seed model Llama3-8B-Instruct.

In the absence of manual annotations, the authors chose to measure the score correlation between the meta-reward model and the current strongest judgment model gpt-4-1106-preview.

The analysis used two slightly different settings, the main difference being how they handle ties given by the judgment model, and therefore using two metrics: an agreement score that counts ties as 0.5 and an agreement score that discards tie results.

The results show that the model's judgment ability has improved after training.

The analysis in Table 3 shows that the correlation between meta-rewards and the strong GPT-4 judgment model is significantly improved in both evaluation settings compared to the baseline model.


These results show that the meta-reward approach can improve the model's judgment ability, making its evaluation results closer to those of the more complex language model GPT-4.

In addition, the experiment compared the correlation between the model judgment results and the human response ranking in the Open Assistant dataset (Table 7), and found that meta-reward training improved the correlation with human judgment.


However, this improvement did not persist in subsequent training iterations, likely due to differences in the distribution between model-generated responses and human responses.

analyze

Length control mechanism

Length control mechanisms are critical to maintaining a balance between comprehensiveness and simplicity of model responses.

The experiment compares the results of different length control parameters ρ in the last training iteration, as shown in Table 4:


ρ = 0 is equivalent to not performing any length control in the preference data selection.

As expected, this training method makes the responses generated by the model too lengthy and the LC win rate decreases.

Training with an external reward model

The meta-reward mechanism allows the model to act as a judge to evaluate its own responses; the experiment tried to use the powerful external reward model Starling-RM-34B as a comparison.

However, it was found that StarlingRM-34B failed to improve the LC win rate of AlpacaEval in the first iteration (24.63% vs. 27.85%), which may be due to its length bias.

meta-judge bias

After the first iteration of meta-reward training, the meta-judge almost always prefers higher-scoring judgments, as shown in Table 5.


This score bias significantly skews the distribution of judgment ratings towards a perfect score of 5. For position bias, we also see a tendency to increase during training, especially when comparing two judgments of the same score.

Judgment Rating Changes: To investigate changes in the distribution of judgment ratings over the course of meta-reward training iterations, we used the same validation prompts as for reward modeling evaluation.

Llama-3-8B-Instruct was used to generate 7 responses on each prompt, and then 11 judgments were generated for each response. Figure 5 is a visualization of the score distribution, and the density was estimated using Gaussian kernel density.


It can be seen that using meta-judge to train judgments further increases the possibility of generating high scores.

However, judging by the first two iterations of training it tends to assign scores of 4.5, 4.75, 4.9, which the evidence indicates should be integers.

Although these are high scores, they provide more fine-grained discrimination between responses of varying quality.

in conclusion

The experiment proposes a new mechanism to improve the model's judgment ability by using meta-judge to assign meta-rewards to the model acting as a judge.

This addresses a major limitation of the Self-Rewarding framework, which is the lack of training for model judgment.

To make Meta-Rewarding training more effective, the experiment also introduced a new length control technique to alleviate the length explosion problem that occurs when training with AI feedback.

The effectiveness of the meta-reward approach is also verified through the automatic evaluation benchmarks AlpacaEval, Arena-Hard, and MT-Bench.

Notably, this approach significantly improves Llama-3-8B-Instruct even without additional human feedback, and outperforms strong baseline methods Self-Rewarding and SPPO that rely on extensive human feedback.

Furthermore, when evaluating the model’s judgment ability, it shows significant improvement in correlation with both human judges and strong AI judges such as gpt-4-1106-preview.

Overall, our results provide strong evidence that self-improving models without any human feedback are a promising direction to achieve super alignment.

References:

https://arxiv.org/pdf/2407.19594