news

High-scoring paper at the first COLM conference on large models: Preference search algorithm makes large model evaluation more efficient

2024-08-05

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

The authors of the article are both from the Language Technology Laboratory at the University of Cambridge. Liu Yinhong is a third-year doctoral student, and his supervisors are Professors Nigel Collier and Ehsan Shareghi. His research interests are large models and text evaluation, data generation, etc. Zhou Han is a second-year doctoral student, and his supervisors are Professors Anna Korhonen and Ivan Vulić. His research interests are efficient large models.

The large models exhibited superior command-following and task generalization capabilities, a unique capability that stems from the fact that the LLMs were trained using command-following data and reinforcement learning with human feedback (RLHF). In the RLHF training paradigm, the reward model is aligned with human preferences based on ranking comparison data. This strengthens the alignment of LLMs with human values, resulting in responses that are better at helping humans and adhering to human values.

Recently, the first COLM conference on large models has just announced the results of acceptance. One of the high-scoring works analyzed the score bias problem that is difficult to avoid and correct when LLM is used as a text evaluator, and proposed to convert the evaluation problem into a preference sorting problem, thereby designing the PairS algorithm, an algorithm that can search and sort from pairwise preferences. By exploiting the assumptions of uncertainty and LLM transitivity, PairS can give efficient and accurate preference sorting, and show higher consistency with human judgment on multiple test sets.



Paper link: https://arxiv.org/abs/2403.16950

论文标题:Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators

Github address: https://github.com/cambridgeltl/PairS

What are the problems with evaluating with large models?

A large number of recent works have demonstrated the excellent performance of LLMs in evaluating text quality, forming a new paradigm for evaluating generation tasks without reference, avoiding expensive human annotation costs. However, LLM evaluators are highly sensitive to prompt design and are even affected by multiple biases, including position bias, length bias, and context bias. These biases hinder the fairness and credibility of LLM evaluators, leading to inconsistencies and misalignment with human judgment.



To reduce biased predictions of LLMs, previous work has developed calibration techniques to reduce bias in LLM predictions. We first systematically analyze the effectiveness of calibration techniques in aligning pointwise LLM evaluators. As shown in Figure 2 above, even with supervised data, existing calibration methods still cannot align LLM evaluators well.

As shown in Equation 1, we believe that the main reason for evaluation misalignment is not the biased priors over evaluation score distribution of LLM, but the misalignment of the evaluation standard, that is, the likelihood of the LLM evaluator. We believe that when performing pairwise evaluation, the LLM evaluator will have a more consistent evaluation standard with humans, so we explore a new LLM evaluation paradigm to promote more aligned judgments.



Inspiration from RLHF

As shown in Figure 1 below, inspired by the alignment of reward models through preference data in RLHF, we believe that the LLM evaluator can obtain predictions that are more aligned with humans by generating preference rankings. Recently, some work has begun to obtain preference rankings by letting LLM perform pairwise comparisons. However, the complexity and scalability of evaluating preference rankings have been largely ignored. They ignore the transitivity assumption, making the complexity of the number of comparisons O (N^2), making the evaluation process expensive and infeasible.

PairS: An efficient preference search algorithm

In this work, we propose two pairwise preference search algorithms (PairS-greedy and PairS-beam). PairS-greedy is an algorithm based on the complete transitivity assumption and merge sort, which can obtain the global preference ranking with only O (NlogN) complexity. The transitivity assumption means that, for example, for 3 candidates, LLM always has A≻C if A≻B and B≻C. Under this assumption, we can directly use traditional sorting algorithms to obtain preference ranking from pairwise preferences.

However, LLM does not have perfect transitivity, so we designed the PairS-beam algorithm. Under a more relaxed transitivity assumption, we derived and simplified the likelihood function of preference sorting. PairS-beam performs a beam search based on the likelihood value in each merge operation of the merge sort algorithm, and uses the uncertainty of the preference to reduce the search method of the pairwise comparison space. PairS-beam can adjust the comparison complexity and sorting quality, and efficiently give the maximum likelihood estimate (MLE) of the preference sorting. In Figure 3 below, we show an example of how PairS-beam performs a merge operation.



Experimental Results

We tested on several representative datasets, including the closed-form abbreviation generation tasks NewsRoom and SummEval, and the open-form story generation task HANNA, and compared with multiple LLM single-point evaluation baseline methods, including unsupervised direct scoring, G-Eval, GPTScore, and supervised training UniEval and BARTScore. As shown in Table 1 below, PairS has a higher consistency with human scores in each task compared with them. GPT-4-turbo can even achieve SOTA results.

In the paper, we also compare two baseline methods for preference ranking, win rate and ELO rating. PairS can achieve the same quality of preference ranking with only about 30% of the number of comparisons. The paper also provides more insights on how to use pairwise preferences to quantify the transitivity of LLM evaluators and how pairwise evaluators can benefit from calibration.

For more research details, please refer to the original paper.