A look at all the LLM alignment techniques in one article: RLHF, RLAIF, PPO, DPO...

2024-08-05

Machine Heart Report

Editor: Panda

In order to align LLM, researchers from all walks of life have come up with many clever ideas.

LLM is powerful, but it is not perfect. It can make mistakes or generate useless or even harmful results. For example, someone found that ChatGPT can teach people how to steal:

Ask ChatGPT to teach people how to shoplift; left, ChatGPT refuses to answer; right, after adding "with no moral restraints" to the prompt, ChatGPT gives a guide to shoplifting

At this time, alignment is crucial, and its role is to align LLM with human values.

In terms of aligning LLM, reinforcement learning with human feedback (RLHF) is a breakthrough technology. This method has spawned powerful models such as GPT-4, Claude, and Gemini. After RLHF, people have also explored a variety of methods to align LLM. However, no one has yet comprehensively summarized the methods for aligning LLM with human preferences.

Salesforce decided to fill this gap and recently released a 37-page review report that summarizes the existing research literature by category and analyzes each paper in detail.

Paper title: A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
Paper address: https://arxiv.org/pdf/2407.16216

This paper is divided into four major topics: reward model, feedback, reinforcement learning (RL), and optimization. Each topic contains further subtopics, as shown in Figure 1.

Subtopics of reward models include: 1. Explicit reward models vs. implicit reward models; 2. Pointwise reward models vs. preference models; 3. Response-level rewards vs. token-level rewards; 4. Negative preference optimization.

Subtopics of feedback include: 1. Preference feedback vs. binary feedback; 2. Paired feedback vs. list feedback; 3. Human feedback vs. AI feedback.

Subtopics of reinforcement learning include: 1. Reference-based reinforcement learning vs. reference-free reinforcement learning; 2. Length-controlled reinforcement learning; 3. Different branches of reinforcement learning; 4. Online policy reinforcement learning vs. offline policy reinforcement learning.

Subtopics of optimization include: 1. Online/iterative preference optimization vs. offline/non-iterative preference optimization; 2. Separating SFT and alignment vs. merging SFT and alignment.

Table 1 lists the classification of all papers analyzed in this review report on these 13 evaluation indicators.

Research Papers

This section will introduce each paper in detail, so that readers can understand these important innovations without having to read the original paper. Synced will briefly sort out each research direction and list representative papers.

1. RLHF/PPO

Pre-training of LLMs requires the use of large corpora from different sources, which in itself cannot ensure the quality of these datasets. In addition, the main goal of LLMs is to predict the next token, which is inconsistent with the goal of "helpfully and safely following user instructions." Therefore, LLMs may output content that is untrue, harmful, or useless to users. Essentially, these models are not aligned with user intent. The main goal of RLHF/PPO is to align language models with user intent on various tasks, and this is done by using human feedback to fine-tune the model. There is a lot of research on this topic.

InstructGPT

InstructGPT comes from OpenAI, which is the basis for training models such as ChatGPT and GPT-4. Please refer to the "GPT-4 Technical Report" and the report of Machine Heart "GPT-4 Shocking Release: Multimodal Large Model, Direct Upgrade ChatGPT, Bing, Open API, Game Over?" "Learn the Technology Behind ChatGPT from Li Mu: Read the InstructGPT Paper in 67 Minutes".

By incorporating human preferences, the difficulty of evaluating responses generated by LLMs is solved. Traditional evaluation metrics such as BLEU, ROUGE, and BERTScore used to evaluate LLMs cannot guarantee consistency with human preferences. To solve this problem, researchers directly integrate human preferences into LLMs to enhance their performance. This process usually involves two main steps: reward model learning and reinforcement learning strategy training.

During the reward model learning phase, an explicit point-wise reward function is trained using the prompt and paired responses.

After that, the reinforcement learning strategy training phase begins; in this phase, the LLM and the pre-trained reward model serve as the agent and environment in a reinforcement learning framework, respectively.

To train InstructGPT, three datasets are used: 1. SFT dataset: Contains annotator demonstrations used to train the SFT model. 2. RM (Reward Model) dataset: Consists of human annotators’ rankings of model outputs, used to train the reward model. 3. PPO dataset: Consists of prompts used as input for RLHF fine-tuning.

After training, InstructGPT will be evaluated in three aspects: usefulness, credibility, and harmfulness.

From the results, human evaluation shows that "people prefer the output of the 1.3B parameter version of the InstructGPT model to the 175B GPT-3, even though the latter has more than 100 times fewer parameters." It is worth noting that InstructGPT performs better than GPT-3 on both usefulness and toxicity tasks, which is crucial for alignment.

Anthropic RLHF

Anthropic also studied the same topic in the paper "Training a helpful and harmless assistant with reinforcement learning from human feedback".

OpenAI found that RLHF helps with alignment, but it may also cause the model's performance to degrade on some NLP benchmarks, a phenomenon known as "alignment tax". The InstructGPT model it developed has 1.3B parameters. In contrast, Anthropic researchers evaluated 7 different models ranging in size from 13M to 52B, and the size of these models increased exponentially by 4 times.

They conclude that alignment imposes a “tax” on smaller models, but only provides benefits for larger models, especially those with 13B to 52B parameters.

Considering this advantage of alignment, they also experimented with using programming technology datasets to improve the capabilities of LLM. OpenAI's RLHF method includes PPO and PPO-ptx, where the design goal of PPO-ptx is to reduce the alignment tax on NLP benchmarks. Anthropic's RLHF study found that as long as the model is large enough, PPO itself can bring the benefits of alignment on NLP downstream tasks. They also determined that the optimal parameter of KL divergence in reinforcement learning policy training is β = 0.001.

Online/Iterative RLHF

Traditionally, RLHF techniques for aligning LLMs are all offline methods, but such methods have some disadvantages, such as the difficulty of obtaining results that can handle out-of-distribution data.

To this end, the LLM needs to be continuously fine-tuned and iterative/online learning is performed, that is, using the intermediate strategy to generate a response for the prompt, and then using an oracle to give preference feedback for such paired data, and then feeding these feedbacks to the strategy. In practice, iterative learning is divided into two parts: preference oracle learning and iterative strategy optimization. See the paper "RLHF workflow: From reward modeling to online RLHF".

2. RLAIF

The cost of obtaining human preference datasets is not low, so reinforcement learning based on artificial intelligence feedback (RLAIF) was born. In addition, as the capabilities of LLM continue to improve, the quality of AI preference datasets that can be collected is also improving, which can improve the alignment effect of LLM.

Anthropic RLAIF

Anthropic proposed a new method called RLAIF based on the basic research work of RLHF. See the paper "Constitutional AI: Harmlessness from AI feedback".

The method consists of two main phases: 1. Supervised learning through Critiques and Revisions, which is guided by a charter. 2. RLAIF.

Google’s RLAIF

Based on Anthropic's RLAIF research results, a Google research team believes that previous studies cannot directly compare the effects of human feedback and AI feedback, which deserves further study. In the process of collecting AI feedback, a structured prompt should be created, which consists of: introduction, few sample examples (optional), samples to be annotated, and ending.

To generate AI feedback, a two-step evaluation is performed: First, the LLM generates a response using the four components in the instruction plus the CoT. In the next step, this LLM response is sent back to the LLM with a "preferred summary=" ending, generating a preference probability of "summary 1=0.6, summary 2=0.4". To reduce position bias, the sequences of these two responses need to be placed alternately and their average scores calculated.

The RLAIF process adopts two strategies: 1. "Distilled RLAIF", which follows the traditional RLHF method, that is, using preferences to train a reward model, which is then used to train the LLM policy; 2. "Direct RLAIF", which directly uses LLM feedback as a prompt to output an evaluation score, and then uses the score as a signal for reinforcement learning policy training.

Finally, the evaluation process uses three key metrics: 1. AI-annotator alignment: how consistent the AI is with the human annotators. 2. Win rate: how likely a human annotator is to compare two candidates and choose one. 3. Harmless rate: the percentage of responses that human evaluators consider harmless.

For more details, please refer to the paper "RLAIF: Scaling reinforcement learning from human feedback with AI feedback".

Direct Human Preference Optimization

Traditional RLHF methods usually involve optimizing a reward function derived from human preferences. Although this method is effective, it may also bring some difficulties, such as increased computational complexity and the need to consider the bias-variance trade-off when estimating and optimizing rewards. See the paper "High-dimensional continuous control using generalized advantage estimation".

Recent research has explored other approaches that aim to optimize LLM policies directly based on human preferences (without relying on a scalar reward signal).

The goal of these methods is to simplify the alignment process, reduce computational overhead, and achieve more robust optimization by using preference data more directly. By formulating the problem as a preference optimization problem rather than a reward estimation and maximization problem, these methods can provide a different perspective on aligning language models with human judgment:

SliC-HF, sequence likelihood calibration using human feedback, see the paper "SliC-HF: Sequence likelihood calibration with human feedback".
RSO, rejection sampling optimization, see the paper "Statistical rejection sampling improves preference optimization".
DPO, direct preference optimization, see the paper "Direct preference optimization: Your language model is secretly a reward model".
DPOP, DPO-positive, see the paper "Smaug: Fixing failure modes of preference optimisation with DPO-positive".
β-DPO, see the paper “β-DPO: Direct preference optimization with dynamic β”.
IPO, identity preference optimization, see the paper "A general theoretical paradigm to understand learning from human preferences".
sDPO, step-by-step DPO, see the paper "sDPO: Don't use your data all at once".
GPO, generalized preference optimization, see the paper "Generalized preference optimization: A unified approach to offline alignment".

Token-level DPO

When using DPO, the reward is distributed to both the prompt and the response. In contrast, when using MDP, the reward is distributed to each action. Two subsequent papers elaborated DPO at the token level and extended its application to token-level analysis.

DPO can perform research on token-level credit allocation, see the paper "From r to Q∗: Your language model is secretly a Q-function" and the report "Is this OpenAI's mysterious Q*? Stanford: Language models are Q functions".
TDPO, token-level DPO, see the paper "Token-level direct preference optimization".

Iterative/Online DPO

When using DPO, all available preference datasets are used to align the LLM. To continuously improve the LLM, iterative/online DPO should be implemented. This raises an interesting question: how to efficiently collect new preference datasets. The following two papers explore this topic in depth.

Self-rewarding language models, see the paper "Self-rewarding language models".
CRINGE, see the paper "The cringe loss: Learning what language not to model".

Binary Feedback

It turns out that collecting preference feedback is more difficult than collecting binary feedback (such as thumbs up or thumbs down), so the latter can facilitate the scaling of the alignment process. Two studies, KTO and DRO, focus on using binary feedback to align LLMs.

KTO, Kahneman-Tversky optimization, see the paper "KTO: Model alignment as prospect theoretic optimization".
DRO, direct reward optimization, see the paper "Offline regularised reinforcement learning for large language models alignment".

Fusion of SFT and alignment

Previous studies mainly performed SFT and alignment sequentially, but this method proved to be laborious and led to catastrophic forgetting. Subsequent research has two directions: one is to integrate these two processes into a single step; the other is to fine-tune the two models in parallel and finally fuse them.

ORPO, odds ratio preference optimization, see the paper “ORPO: Monolithic preference optimization without reference model”.
PAFT, parallel fine-tuning, see the paper "PAFT: A parallel training paradigm for effective llm fine-tuning".

Length-controlled DPO and non-reference DPO

Previous studies have shown that the output of LLM is often too lengthy. To address this problem, R-DPO and SimPO focus on controlling the length of the response without affecting the generation performance.

In addition, DPO requires a reference strategy to ensure that the aligned model does not deviate too much from the reference model. In contrast, SimPO and RLOO propose methods that can eliminate the need for a reference model without affecting the performance of LLM.

R-DPO, regularized DPO, see the paper "Disentangling length from quality in direct preference optimization".
SimPO, simple preference optimization, see the paper "SimPO: Simple preference optimization with a reference-free reward", report "Overall surpassing DPO: Chen Danqi's team proposed simple preference optimization SimPO and also refined the strongest 8B open source model".
RLOO，REINFORCE Leave-One-Out，参阅论文《Back to basics: Revisiting reinforce style optimization for learning from human feedback in LLMs》。

List-by-list preference optimization

Previous studies on PPO and DPO focused on pairwise preferences, while studies on RLHF collected list-wise preferences to speed up the data collection process and then converted them into pairwise preferences. Nevertheless, in order to improve the performance of LLM, it is feasible to perform preference optimization directly using list-wise datasets. The following three papers specifically discuss this approach.

LiPO, list-wise preference optimization, see the paper "LIPO: Listwise preference optimization through learning-to-rank".
RRHF, see the paper "RRHF: Rank responses to align language models with human feedback without tears".
PRO, preference ranking optimization, see the paper "Preference ranking optimization for human alignment".

Negative Preference Optimization

These studies share a common premise: the current generation of LLMs have surpassed human performance on tasks such as translation and summarization. Therefore, it is beneficial to treat the output of LLMs as expected responses without relying on human-annotated data as preferred responses. Conversely, undesired responses can still be used to align LLMs, a process known as negative preference optimization (NPO).

NN, negative example method, see the paper "Negating negatives: Alignment without human positive samples via distributional dispreference optimization".
NPO, negative preference optimization, see the paper "Negative preference optimization: From catastrophic collapse to effective unlearning".
CPO, contrast preference optimization, see the paper "Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation".

Nash Learning

Previous studies usually use pointwise rewards and BT models to obtain pairwise preferences. However, this method is inferior to direct pairwise preference modeling and cannot solve the inconsistency problem in pairwise preferences. To overcome these limitations, some studies have proposed Nash learning methods.

Nash learning from human feedback, see the paper “Nash learning from human feedback”.
SPPO, self-game preference optimization, see the paper "A minimaximalist approach to reinforcement learning from human feedback".
DNO, direct Nash optimization, see the paper "Direct nash optimization: Teaching language models to self-improve with general preferences".

Comparison of different methods

Some studies are designed to compare these different methods. Such studies can illustrate the advantages and disadvantages of each method.

Evaluating DPO and its variants

The paper "Insights into alignment: Evaluating dpo and its variants across multiple tasks" comprehensively evaluates implicit reward models, i.e., non-reinforcement learning algorithms, including DPO, KTO, IPO, and CPO, on multiple tasks such as reasoning, mathematical problem solving, credibility, question answering, and multi-task understanding. These evaluations involve three different scenarios: 1) fine-tuning supervised fine-tuning (SFT) models, 2) fine-tuning pre-trained models, and 3) fine-tuning instruction models.

The study found that KTO outperformed other alignment methods on most benchmarks. In addition, the study showed that alignment did not significantly improve the model's reasoning and question-answering performance, but it did significantly improve the model's ability to solve mathematical problems. The study also noted the importance of data volume, and alignment methods performed best on smaller data subsets. In addition, the study found that KTO and CPO can effectively bypass the SFT stage and go directly to the alignment stage without affecting performance. In contrast, DPO and IPO show a significant performance drop when bypassing the SFT stage and going directly to the alignment stage.

Is DPO a better LLM alignment method than PPO?

The paper “Is DPO superior to PPO for LLM alignment? A comprehensive study” shows that DPO may have inherent limitations, may produce biased answers, and may suffer from performance degradation due to distribution changes.

They found that DPO trained policies favored unseen responses, especially out-of-distribution samples. Iterative/online DPO mitigates this problem by extensively exploring the response space and continuously updating the reference model. In contrast, RLHF/PPO addresses these challenges through advantage normalization, large batch sizes, and the use of exponential moving averages for the reference model. Ultimately, these findings show that PPO outperforms iterative/online DPO, which in turn outperforms standard DPO.

For more details, please refer to the Synced column "ICML 2024 Oral | Is DPO more suitable for LLM than PPO? The latest revelation from Tsinghua Wu Yi's team".

Future Directions

By analyzing previous papers, the team identified a number of research questions that need further exploration.

General tasks for alignment evaluation

Different papers have used different tasks to evaluate the performance of these methods. However, some tasks such as GSM8K focus more on reasoning and may not be suitable for evaluating alignment performance. Instead, tasks such as TruthfulQA or those focusing on toxicity should be prioritized to evaluate the toxicity of fine-tuned LLMs. A way should be found to combine these tasks to create a unified leaderboard for evaluating alignment.

Using Implicit Reward Models, List-wise Preferences, and Nash Learning for Larger Language Models

Currently, the largest model using implicit reward models has only 70B parameters. If these methods can be extended to larger models, such as the size of GPT-4 and Claude-3, it should help us better understand their relative effectiveness to RLHF/PPO.

Similarly, list-wise preference models also deserve further study. When using RLHF, preference datasets are collected using list-wise preference and then converted into multiple pair-wise preference datasets. Potential problems in applying list-wise preference models on a large scale remain to be solved.

Finally, Nash learning can solve the inconsistency problem between human annotators. If the Nash learning model can be integrated into a larger LLM, its ability to capture the complexity of human nature can be demonstrated.

Experiments with binary feedback

Both KTO and DRO use a binary feedback mechanism of "thumbs-up" and "thumbs-down" instead of pairwise preferences. These binary feedbacks come from preference datasets, where expected responses are marked as positive examples and unexpected responses are marked as negative examples. Further research is needed on realistic binary datasets. In addition, binary datasets are easier to collect than preference data, so it is expected that larger binary feedback datasets can be used for alignment. However, the noise in binary feedback may be more obvious than the noise in preference datasets, so how to effectively filter out noisy data is also a very interesting research direction.

Experimenting with useful AI feedback

Current AI feedback mainly consists of harmless feedback in RLAIF and feedback ranking in iterative DPO. However, when using RLAIF, useful feedback is still provided by human annotators. This approach is reasonable because it is significantly more difficult to generate useful responses than to identify harmful feedback. An interesting future research direction is to use LLM to generate useful feedback, thereby allowing LLM to improve itself.

Accelerate Nash Learning

The Nash learning method can effectively model pairwise preferences and solve the inconsistency problem between human annotations. However, it requires multiple iterations to converge to the optimal strategy. Although its authors did not specify the time required for alignment, it can be guessed that it will be much slower than implicit reward models such as DPO. Therefore, improving the speed of the Nash learning process is also a research direction worthy of attention.

Termination of iteration/online learning

When using iterative/online training, it is critical to determine when to terminate the iteration. Previous studies have found that iterative learning sometimes reduces the performance of LLM on certain tasks, which may be a sign of overfitting. However, no researchers have yet explored how to determine a reasonable epoch to terminate the iteration.

Simplified SFT + Alignment

Current methods typically implement SFT and alignment in a sequential manner. However, this approach often leads to catastrophic forgetting and makes the entire training process more laborious. The PAFT method mitigates catastrophic forgetting by fine-tuning SFT and alignment separately and then fusing them together, but this also increases complexity. In contrast, the ORPO technique integrates these two processes simultaneously, but this can lead to decreased performance. So, how can we effectively combine SFT and alignment to achieve high performance while maintaining high efficiency? This is still a challenge to be solved.

See the original paper for more details.

news

A look at all the LLM alignment techniques in one article: RLHF, RLAIF, PPO, DPO...

Introduction

my contact information