news

OpenAI's Weng Li proposed a large model "external hallucination": 10,000 words to explain the reasons for hallucinations in detail...

2024-07-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

The west wind blows from Aofei Temple
Quantum Bit | Public Account QbitAI

The illusion of the big model can be divided into internal and external aspects.

OpenAI Chinese scientist Weng Li's latest blog proposesLLM External Illusion(extrinsic hallucination)。



Different from referring to the model generating content that is inconsistent with reality, fictitious, inconsistent or meaningless, Weng Li specifically defines the LLM "illusion" problem asModel output is fictional and not based on the provided context or world knowledge

Thus, hallucinations are of two types:

  • In-context hallucinations: ModelsThe output should be consistent with the source content in context(When in-context hallucination occurs, the output is inconsistent with the source content).
  • External Illusion: Model output should be based on the pre-training dataset. However, given the size of the pre-training dataset, it is too expensive to retrieve and identify each generated conflict.World Knowledgesymbol, then essentially attempts to ensure that the model output is factual and verifiable with knowledge of the outside world. Equally important,When the model does not know a fact, it should explicitly say so.



Previously, Weng Li also proposed the Agent formula: Agent = big model + memory + active planning + tool use, which was called by some netizens as "the best article about Agent I have ever seen."





This blog about the illusion of large models is also "heavy", the article is very long, and there are 24 references:



Weng Li focused on external hallucinations and discussed three questions: What causes hallucinations? Hallucination detection, and methods to resist hallucinations.



QuantumBit compiled and edited the original text without changing the original meaning.

Quantum位 has obtained authorization from the original author to translate and reprint.

The original text is here:

https://lilianweng.github.io/posts/2024-07-07-hallucination/

What causes hallucinations?

Considering that a standard deployable LLM needs to be pre-trained and fine-tuned for alignment and improvement, the cause analysis starts from these two stages.

The problem with pre-training data

Pre-training datasets are designed to represent all available world knowledge in written form and are therefore very large.

Scraping data from the public internet is the most common choice, but this leads to the possibility that some outdated, missing, or wrong information may appear. Since the model may falsely memorize this information simply by maximizing the log-likelihood, the model may make mistakes.

Fine-tuning new knowledge

Fine-tuning pre-trained LLMs through supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) is a common technique to improve certain capabilities of the model (such as instruction tracking). The fine-tuning stage inevitably introduces new knowledge.

Fine-tuning usually consumes less computing resources.Whether new knowledge can be reliably learned by fine-tuning models on a small scale is still under debate

In a study this year, Gekhman et al. discussed the question of whether fine-tuning the LLM with new knowledge could lead to hallucinations.

They found that the LLM learned fine-tuned examples with new knowledge more slowly than examples consistent with the model’s pre-existing knowledge, and that once it learned these examples with new knowledge, the model’s tendency to hallucinate increased.

Specifically, given a closed-ended question answering dataset (i.e., EntityQuestions) = (,), Correct(,;,) is defined as an estimate of the likelihood that the model M accurately generates the correct answer to the question when prompted with random examples and a certain decoding temperature.

They divided the examples into four categories according to different Correct(,;,) conditions: Known group (including three subgroups: HighlyKnown, MaybeKnown, WeakKnown), and Unknown group.



Some interesting observations from the experiments, where the accuracy on the dev set is taken as a symbolic indicator of hallucination:

  • The fitting speed of Unknown is obviously much slower than that of Known;
  • The best performance is achieved when the LLM fits most of the known training examples but only a few of the unknown examples.
  • When most of the unknown examples are learned, the model starts to hallucinate



These results from Gekhman et al. point out the risks of using supervised fine-tuning to update LLM knowledge.

Hallucination Detection

Search Enhancement Evaluation

To quantify the hallucination phenomenon of the model, Lee et al. introduced a new benchmark dataset in 2022FactualityPrompt,The dataset contains factual and non-factual prompts, using Wikipedia documents or sentences as the factual basis knowledge base.

Wikipedia documents are known ground truth from the FEVER dataset, while sentences are selected by tf-idf or similarity based on sentence embeddings.



Given a model-continuation and paired Wikipedia text, two metrics for evaluating hallucination are considered:Hallucinated Named Entities(NE)Error rateImplied Ratio(Entailment ratios)。

Higher NE error rates and lower entailment ratios indicate higher factualness, and both metrics are found to correlate with human annotations, with larger models performing better on this benchmark.

In addition, Min et al. 2023 proposedFActScore, decomposes long text generation into multiple atomic facts and verifies each fact individually against knowledge bases such as Wikipedia. Then the ratio (precision) of sentences supported by knowledge sources generated by each model can be measured, and FActScore is the average precision of model generation in a set of prompts.

This paper experiments with various factual verification methods in the biography generation task and finds thatUsing retrieval has better consistency than context-free LLMIn retrieval enhancement methods, the choice of the best estimator depends on the model.

  • Context-free LLM: Use the “True or False?” prompt LLM directly without additional context
  • Retrieve → LLM: Use relevant passages retrieved from knowledge sources as contextual prompts
  • Nonparametric probability (NP): The average likelihood of the labels in the atomic facts is calculated through the mask LM and used for prediction
  • Retrieval → LLM+NP: Integration of two methods

Some interesting observations about the model's hallucination behavior:

  • In the biography generation task, the error rate is higher for rarer entities
  • Facts mentioned later in the generated content also have a higher error rate
  • Using retrieval to provide a basis for model generation can significantly help reduce hallucinations

Wei et al. (2024) also proposed a method for evaluating the factuality of LLM long texts, calledSAFE(Search-Augmented Factuality Evaluator)。

Compared with FActScore, the main difference is that SAFE uses a language model as an agent.Iteratively issues Google search queries through a multi-step process, and reason whether the search results support or do not support that fact.

In each step, the agent generates a search query based on the fact to be checked and the search results obtained previously. After several steps, the model performs reasoning to determine whether the fact is supported by the search results.

According to the experiment,SAFE outperforms human annotations despite costing 20 times less: The rate of agreement with humans was 72%, and the rate of outperforming humans was 76% when disagreeing.



The SAFE evaluation metric is F1@K. For long factual model responses, ideally both precision and recall should be achieved, because the response should satisfy:

  • Factual: Measured by precision, which is the percentage of supported facts in the entire response.
  • Long: Measured by recall, i.e. the percentage of provided facts out of all relevant facts that should be present in the response. Therefore, the number of facts with the most support is considered.

Given the model response, the metric F1@K is defined as:





In addition, Chern et al. (2023) proposed a fact-checking workflow that follows a standardFacToolIt is designed to detect factual errors in a variety of tasks including knowledge-based question answering, code generation, solving mathematical problems, and scientific literature review. The steps include:

  • Claim Extraction: Extract all verifiable claims by prompting LLM.
  • Query generation: Convert each statement into a series of queries suitable for external tools, such as search engine queries, unit test cases, code snippets, and paper titles.
  • Tool query and evidence collection: query external tools such as search engines, code interpreters, Google Scholar, and obtain the returned results.
  • Consistency verification: Each claim is assigned a binary factuality label based on the level of evidence support provided by external tools.



Sampling-based detection

Manakul et al. (2023) proposed a method that relies on consistency checking of multiple samples from black-box LLMs.SelfCheckGPT, to identify factual errors.

Considering that the grey-box fact-checking measurement requires access to the token-level logprob of the LLM, SelfCheckGPTOnly samples that do not rely on external knowledge bases need to be used, so black-box access is sufficient, no external knowledge base is required.

This method uses different metrics to measure the consistency between the model response and other random model samples, including BERTScore, NLI, prompts (asking yes/no), etc. When experimentally testing WikiBio paragraphs generated by GPT-3, SelfCheckGPT using prompts seems to perform best.



Calibrating the unknown

Asking models to generate answers to unanswerable or unknown questions can induce hallucinations.TruthfulQA(Lin et al., 2021) andSelfAware(Yin et al., 2023) are two benchmarks that measure the ability of a model to generate realistic responses in such situations, the former being adversarially constructed to highlight human fallibility, and the latter containing questions that are unanswerable due to their nature.

When faced with these problems,The model should refuse to answer or provide relevant information

In TruthfulQA, test questions are adversarially designed based on common human misconceptions or mistakes. The benchmark contains 817 questions covering 38 topics including health, law, finance, and politics.

When tested, the best LLM had an accuracy rate of 58%, while humans could achieve 94%. The research team found thatDue to a common misconception, larger models are less realistic, but this trend does not hold for other criteria.(Non-confrontational)Facts show up in the benchmark

Here are some examples of incorrect answers from GPT-3 on TruthfulQA:



Yin et al. studied in 2023SelfAwareThe concept of , refers to whether language models know what they know or don’t know.

SelfAware contains 1032 unanswerable questions and 2337 answerable questions in five categories. The unanswerable questions are collected from online forums with human annotations, and the answerable questions are collected from SQuAD, HotpotQA, and TriviaQA.

A question may be unanswerable for various reasons, such as lack of scientific consensus, imagining the future, being completely subjective, philosophical reasons that may produce multiple responses, etc.

The study treated the distinction between answerable and unanswerable questions as a binary classification task and used the F1 score or accuracy to evaluate the performance of the model. Experiments showed that larger models performed better on this task.



Another way to assess how well a model knows about the unknown is to measure the uncertainty in the model output. When a problem is somewhere between what is known and what is unknown, the model should show the right confidence.

Kadavath et al. (2022) showed that in multiple multi-choice questions with visible letter answer options,TopicIn various formats (MMLU, TruthfulQA, QuALITY, LogiQA), LLM performs well in estimating the probability of an answer being correct, meaning that the predicted probability is consistent with how often the answer is true.

RLHF fine-tuning results in a poorer model calibration, but higher sampling temperatures lead to better calibration results.



Lin et al. proposed in 2022CalibratedMathTask Suite. CalibrateMath is a suite of programmatically generated math problems of varying difficulty levels that test how well the model’s output probabilities are calibrated.

For each question, the model must provide a numerical answer and its confidence in that answer. Three types of probabilities are considered:

  • A number or word expressed in words (e.g., "minimum," "low," "medium," "high," "highest"), such as "Confidence: 60% / medium."
  • Normalized log probability of the answer token. Note that this parameter is not used in the fine-tuning experiments.
  • Logprob of the indirect "True/False" token after the original answer. Experiments focus on how well the calibration generalizes under changes in the distribution of task difficulty or content. Each fine-tuning data point is a question, the model's answer (which may be wrong), and the confidence of the calibration. In both cases, the probability of the textual representation generalizes well, while all settings perform well on the multiplication and division task transfer. Few-shot is weaker than the fine-tuned model in terms of model prediction confidence. Including more examples helps, and 50-shot is almost as good as the fine-tuned version.



Indirect query

Agrawal et al. (2023) specifically studied cases of hallucination citations in LLM generation, including fictitious book, article, and paper titles. They used two consistency-based methods to detect hallucinations, direct query and indirect query. Both methods run the check multiple times when T>0 and verify consistency.



Direct queries require the model to determine whether the generated reference exists, while indirect queries require auxiliary details, such asWho is the author of the reference?

The hypothesis is that for a phantom reference, the consistency of generating the same author multiple times is less than the likelihood that multiple responses to a direct query will indicate the existence of the reference.

Experiments show thatIndirect query methods work better, with greater model capacity and fewer hallucinations

Ways to fight hallucinations

Next, we review a set of methods to improve the authenticity of LLM responses, including retrieval from external knowledge bases, special sampling methods, and alignment fine-tuning. We will not discuss some interpretability methods that reduce hallucinations through neuron editing here.

RAG → Editing and Attribution

RAG (Retrieval Augmented Generation) is a very common approach to provide foundational information, which involves retrieving relevant documents and then generating using additional relevant documents as context.

RARR(Retrofit Attribution using Research and Revision) is a framework proposed by Gao et al. in 2022 that enables LLM to retroactively support attribution of external evidence through editorial attribution.

Given a model-generated text, RARR processes it in two steps, outputting a revised text and an attribution report:

1. Research phase: Find relevant documents as evidence.

We first use the query generation model (via few-shot prompts, →1,…, ) to construct a set of search queries 1,…, to verify various aspects of each sentence.
Run a Google search, each query = 5 results.
A pre-trained query-document relevance model is used to assign relevance scores, and only one most relevant document = 1,…, is retained for each query.

2. Revision stage: Edit the output to correct content that is not supported by the evidence while retaining the original content as much as possible.Initialize revised text = .

According to (,), the agreement model (via few-shot cues + CoT, (,,)→0,1) checks whether the evidence is inconsistent with the current revised text.

Only when an inconsistency is detected, the edit model (via a few hints + CoT, (,,)→ new ) outputs a new version of , aiming to be minimally changed concurrently with the evidence.

Finally only a limited amount of =5 evidence made it into the attribution report.



When evaluating a revised text, both attribution and retention are important.

Attribution uses an AIS (Attributed to Identified Source) score to measure how much of a piece of content is attributable. Human annotations can be collected or an NLI model can be used to approximate automatic AIS scoring.

Preservation refers to the degree to which the original text is preserved, measured by Previntent×PrevLev, where Previntent requires human annotation and PrevLev is based on the character-level Levenshtein edit distance. Compared with the two baselines, RARR leads to better balanced results, especially in terms of preservation metrics.

Similar to RARR using search+edit, Mishra et al. 2024 proposedFAVA(Factuality Verification with Augmented Knowledge) also retrieves relevant documents and then edits the model output to avoid hallucination errors. The FAVA model consists of a retriever and an editor.

Given a prompt and a model output, retrieve the most relevant documents:



The editor generates enhanced output:



RARR does not require training, but the editor model edit in FAVA requires fine-tuning. By classifying different types of hallucination errors in more detail, synthetic training data can be generated for the editing model by inserting random errors in the model generation.

Each example is a triplet (,,∗) where is the original Wikipedia paragraph as the golden context, is the LM output with errors, and ∗ is the output with wrong labels and correct edits.



He et al. proposed in 2022RRThe Rethinking with retrieval method also relies on retrieving relevant external knowledge, but does not involve additional editing.

Instead of leveraging the search query generation model, RR retrieval is based on the decomposed CoT cues.

Given an input prompt, RR uses the CoT prompt to generate multiple reasoning paths 1,…, when temperature > 0, where each reasoning path contains an explanation (i.e., the reasoning part) followed by a prediction (i.e., the actual model output). External knowledge 1,…, is retrieved to support each explanation. Then, the most faithful answer is selected based on the fit of the retrieved knowledge 1,…,.

  • Knowledge Retrieval: RR’s experiments apply sparse retrieval BM25 to search Wikipedia, and then rerank them by the embedding cosine similarity provided by the pre-trained MPNet model.
  • Loyalty Rating: The fidelity of each reasoning path is estimated by a combination of entailment score, contradiction score, and MPNet similarity. Both entailment score and contradiction score are provided by the pre-trained NLI model.



Self-RAG(Asai et al., 2024) train a language model end-to-end so that it learns to reflect on its own production by outputting task results and intermittent special reflection tokens.

The research team created a supervised dataset for both the critical and generative models by prompting GPT-4, which was then distilled into an internal model to reduce inference cost.



Given an input prompt, the generated output consists of multiple parts (e.g., a paragraph is a sentence). There are four types of reflective tags, one for retrieval and three for evaluation:

  • Retrieve: Determines whether to run retrieval in parallel to obtain a set of documents; output value: {yes, no, continue}.
  • IsRel: Determines whether the hint is relevant to the retrieved document; output value: {relevant, irrelevant}.
  • IsSup: Determines whether output text is supported; output value: {fully supported, partially supported, no support}.
  • IsUse: Determines whether the output text is useful; output value: {5, 4, 3, 2, 1}.

Self-RAG generates one segment at a time. Based on the given and previous generation < , the model decodes the Retrieve token:

  • If Retrieve==no, generate directly;
  • If Retrieve==yes, the model retrieves multiple paragraphs in parallel and uses the IsRel token to check if the retrieved documents are relevant. If relevant, additional rating tokens are generated and used to score, rank, and select the best result among multiple outputs.

Action Chain

Without external search knowledge, aUse the model itself for validation and revisionprocess to reduce hallucinations.

Dhuliawala et al. proposed a method for planning and execution verification based on action chains in 2023, calledChain-of-Verification(CoVe). CoVe includes four core steps:

  • Baseline response: The model generates an initial draft response, called the “baseline”.
  • Planning verification: Based on this raw generation, the model designs non-templated verification questions for fact checking; this can be achieved with a small number of example prompts (answers, verification questions).
  • Perform verification: The model answers these questions independently. There are several setup variations:

1) Joint: Combined with step 2, where the few-shot example structure is (response, verification question, verification answer); the disadvantage is that the original response is in context and the model may repeat similar hallucinations.

2) Two-step approach: Separate the verification planning and execution steps, such as not affecting the original response.

3) Decomposition: Answer each verification question separately. For example, if a long basic generation result generates multiple verification questions, each question will be answered one by one.

4) Decomposition + Revision: Add a “cross-check” step after the decomposition validation is performed, conditionalizing on the baseline responses and validation questions and answers to detect inconsistencies.

  • Final Output: Generate the final, refined output. If any inconsistencies are found, the output is modified in this step.

CoVe is designed this way because using long verification chain generation may lead to repetition hallucinations, because the initial hallucinated response is still in the context and can be attended to during the new generation, whileAnswering each validation question individually was found to give better results than long-form generation



Here are some interesting observations from the CoVe experiments:

  • Command adjustments and CoT did not reduce the hallucinations.
  • The decomposition and two-step CoVe improves performance, and further explicit reasoning for inconsistency detection also helps (“decomposition+revision” approach).
  • Short-form verification questions elicit more accurate responses than long-form questions.
  • Free-form LLMs generate verification questions better than heuristic questions (e.g., does X answer the question?), and questions that require open-ended generation are better than yes/no questions.

In addition, Sun et al. proposed in 2023RECITE, which relies on paraphrase as an intermediate step to improve the factual correctness of model generation and reduce hallucinations.

The motivation is to use the Transformer memory as an information retrieval model. In RECITE's repeat and answer scheme, the LLM is first required to repeat the relevant information and then generate output.

Specifically, a few-shot contextual cues can be used to teach the model to paraphrase, and then generate answers based on the paraphrases. In addition, it can be combined with a self-consistent ensemble approach that uses multiple samples and can be extended to support multi-hop question answering.



The generated paraphrases are comparable to the BM25-based retrieval model, but both fall short when using real passages. According to the error analysis conducted by the research team, about 7-10% of the questions were paraphrased correctly but the correct answer could not be generated; about 12% of the questions were paraphrased incorrectly but could still be answered correctly.

Sampling Method

Lee et al. (2022) found that kernel sampling (top-sampling) performed worse than greedy sampling on the FactualityPrompt benchmark, even though kernel sampling added additional randomness, achieving better diversity and fewer repetitions.

Therefore, they proposed a hypothesis-based fact core sampling algorithm.This hypothesis states that random sampling has a greater impact on the factuality of the second half of a sentence than the beginning of the sentence.. Fact-core sampling aims to dynamically adjust the probability of sampling vocabulary in each sentence. For the th token in a sentence, there is =max(,⋅−1), which is used to prevent sampling from falling back to greedy sampling that harms generation quality and diversity.



Li et al. proposed in 2023Inference-Time Intervention(ITI), we investigate whether certain attention heads are more relevant to factuality by linearly probing the activations at each layer to distinguish between real and fake outputs.

They found that for many attention heads, the detector performed no better than random selection, while some showed strong performance. After identifying a set of sparse attention heads with high linear detection accuracy in truth, ITI adjusts the activations of the top selected attention heads along the “truthful” direction at inference time.



Fine-tuning for factuality

Lee et al. proposed two ideas for fact-enhanced training in 2022:

  • Introducing TopicPrefix for better factual understanding: prepend the topic (i.e. Wikipedia document title) to each sentence in that document.
  • Use sentence completion loss as training objective: Update the training loss to focus on the second half of the sentence, assuming that the second half of the sentence contains more factual knowledge. The implementation is very simple, decide a pivot point, and apply zero mask to all tokens before the token. In their experiments, the best pivot point is chosen to be 0.5x sentence length.

Lin et al. proposed to conduct fact-focused SFT+RLHF alignment training in 2024, namedFLAME

  • SFT stage (Factuality-aware SFT): The goal is to generate training data that is more factual than the model itself (measured by FActScore).
  • RLHF stage (Factuality-aware DPO): Two methods were tested. Method 1 performed poorly, while Method 2 performed OK, probably because Method 1 tried to distill new knowledge into the model without sufficient training.

As mentioned earlier, there is some evidence that fine-tuning new knowledge can lead to hallucinations, and the supervision of RAG contains information that is unknown to LLM.

Method 1: Use RAG data samples as positive samples and the original model generated as negative samples as RM data.

Approach 2: Using FActScore as a factual reward signal.



To avoid accidentally distilling unknown knowledge into the model during aligned training, they propose to construct an SFT/DPO dataset using the responses generated by the model.



Tian & Mitchell et al. proposed in 2024Factuality tuningAlso relying on fine-tuning language models to improve factuality, they experimented with different methods to estimate the truthfulness of atomic statements in each model sample and then ran DPO.



Factual Adjustment Process:

1. The model completes example pairs given a prompt set (e.g. "Write a bio of Yo-Yo Ma")

2. Label it for authenticity based on two methods that do not require human intervention:

Reference-based: Check whether the model claims are supported by an external knowledge base, similar to the retrieval-based hallucination evaluation section above. (a) Extract a series of atomic claims; (b) Find Wikipedia references; (c) Use a fine-tuned small NLI model to check whether the reference text supports the atomic claims.

Non-reference-based: Use the model's own confidence as a proxy for its truthfulness, similar to indirect query methods. (a) Convert each statement into a corresponding question/needs careful rephrasing to ensure the question is clear; use few-shot prompts; (b) Sample multiple times from the model to answer the question; (c) Calculate an aggregate score/use string matching or ask GPT to determine whether two answers are semantically equivalent.

3. Build a training dataset by generating multiple samples from the model and assigning preferences based on the authenticity score. Then fine-tune the model using DPO on this dataset.



Fine-tuning attribution

Assigning attributions is a good way to reduce hallucinations when generating model outputs that rely on search results. There is a line of work that aims to train LLMs to better utilize retrieved content and assign high-quality attributions.

Nakano et al. proposed in 2022WebGPT, combines web search for document retrieval with a fine-tuned GPT model designed to answer long-form questions to reduce hallucinations and improve factual precision.

The model interacts with internet searches in a text-based web browser and learns to quote web pages to answer questions. As the model is browsing, one action it can take is to quote an excerpt from the current page. When doing this, the page title, domain name, and excerpt are recorded so that they can be used later as a reference.The core of WebGPT is to use reference materials to help people judge the correctness of facts

The model is first fine-tuned with supervision on demonstrations of humans answering questions using a web browsing environment for behavioral cloning.

Comparative data is collected between answers generated by two models for the same question (each with its own reference set), where answers are judged on their factual accuracy, coherence, and overall usefulness. Reward models are used for RL training and best-of-n rejection sampling. In contrast, RL has limited effects, and even more limited effects when rejection sampling is used.



Menick et al. proposed in 2022GopherCite, is very similar to WebGPT in using a search engine to create support material and teach the model to provide reference material. Both perform supervised fine-tuning for guidance, and both apply RLHF training.

Unlike WebGPT, which relies on human demonstrations for behavior cloning, GopherCiteGenerate demos via few-shot prompts, and each generation is populated with the context of related documents, and then a reward model is used to score which ones are the best.



Another trick to avoid low-quality responses is to configure the model to refuse to answer with a preset answer "I don't know", which is determined by a global RM threshold, called selective prediction.

The RL empirical results are similar to those of WebGPT, i.e., RL brings only limited improvements or no improvement when combined with rejection sampling.



Who is Weng Li?

Weng Li is a Chinese scientist at OpenAI, one of the contributors to ChatGPT, and a graduate of Peking University.



She is the head of OpenAI's artificial intelligence application research. She joined OpenAI in 2018 and was mainly involved in pre-training, reinforcement learning & alignment, model security and other aspects of the GPT-4 project.

In the safety advisory team established by OpenAI at the end of last year, she leads the Safety Systems team to solve problems such as reducing the abuse of existing models such as ChatGPT.