news

OpenAI's "last" super alignment paper: large and small models play a game, output readability is improved

2024-07-18

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Cressey from Aofei Temple
Quantum Bit | Public Account QbitAI

letBig and small models compete with each other, you can improve the readability of the generated content!

This is serious research from OpenAI, and its goal is to make the model output easier to understand while ensuring accuracy.

After training with this method, the accuracy of human judgment on model output has increased significantly and the speed has become faster.



The research, from OpenAI’s now-defunct “super alignment” team, mimics a 2021 paper by University of Toronto academics and others.The Prover-Verifier Game

During the training process, OpenAI lets the large model serve as the "prover" and the small model as the "verifier", allowing these two identities to compete against each other.

Ultimately, the output of the large model is not onlyEasier to understand, and no significant loss in accuracy, the judgment level of the small model has also improved.

Even OpenAI researcher Yining Chen, one of the authors of the paper, said that this method "has the potential to make future models smarter than humans."

It is worth mentioning that the original head of the Super Alignment Team and the author of RLHFJan LeikeHe also participated in this research. When he left, he criticized OpenAI for not paying attention to safety.

However, at the end of Jan Leike's tweet introducing this paper, he advertised and recruited for Anthropic (Claude manufacturer), where he currently works.



Another author of the paper, OpenAI researcher Nat McAleese, revealed in a tweet that this will be OpenAI's final super alignment paper.



This news made some netizens very excited, saying that this must not be the last article. Nat, you will continue to do super alignment, right?



Let the models compete with each other

OpenAI research found that the content generated by large models has some checks and balances in terms of accuracy and readability.

If you optimize for accuracy, readability will decrease; if you optimize for readability, accuracy will be lost.

Even on simple elementary-school math tasks, it’s easy to see the gap, which OpenAI calls a “comprehensibility tax.”



However, whether humans can understand the content generated by the model is an important prerequisite for making accurate judgments, and therefore also has an important impact on security.

Even leaving this aside, the comprehensibility of the content directly determines whether the model can provide useful help to humans.

This research was inspired by the "prover-verifier" game proposed by scholars from the University of Toronto in 2021, and the model was trained through game theory.

ProverPossessing powerful computing power, capable of performing complex calculations and generating proofs, the task is to try to convince the verifier that a statement is correct.

ValidatorOnly relatively simple calculations can be performed, and an interactive process is required to determine whether the proof provided by the prover is credible.

For example, there is a math problem that requires solving the expression of a function. The prover knows this expression (but it is not necessarily correct) and hopes that the verifier believes that his answer is correct.

The verifier knows the function values ​​corresponding to some independent variables and can use this information to test the prover, judging whether he knows the correct function by whether his answer is accurate or not.

In this way, OpenAI strives to achieve a balance between accuracy and readability, exchanging relatively small accuracy losses for higher readability.

For this project, the OpenAI team used a dataset containing elementary school math problems, solutions, and correct answers.

The model selected GPT-4 series models of different sizes, with the large ones used as provers and the small ones used as verifiers.

The dataset is randomly partitioned into two subsets of equal size for training the prover and the verifier, respectively.

The prover is given two roles: helpful and sneaky, which generate correct and incorrect solutions respectively.

But the common point between the two is that they both need their solutions to be given high scores by the verifier.


△The evolution of the cunning prover in multiple rounds of training

These two roles are not fixed, the prover needsTake turns, rewards are given when behavior is consistent with the role played.

The training is divided into multiple rounds, and in each round the team will extract a batch of correct and incorrect samples from the solutions generated by the two provers in the previous round.

These samples will serve as the training data for the current round of the verifier. The verifier will be trained through supervised learning to improve the accuracy of its judgment, so that the verifier will score the correct solution higher than the incorrect solution.

In this process, the cunning prover constantly urges the verifier to improve its judgment accuracy, and the verifier also makes the content generated by the prover more in line with the scoring preferences. The two achieve common progress through mutual game.



An important prerequisite for obtaining a high score is that the generated content can be understood by the verifier. In this way, as long as human preferences are introduced into the verifier, the interpretability of the model output content can be enhanced.

Balance between accuracy and readability

Ultimately, this game-based training method achieves a good balance between high accuracy and high comprehensibility.

Traditional training methods only focus on the accuracy of the model itself. As the training process progresses, the accuracy of the answers improves, but human judgments become less and less accurate and slower.



Since the OpenAI team introduced human preferences into the verifier, the verifier's judgment accuracy is related to humans to a certain extent.

The final results also confirmed that the model (prover) trained in this game-themed way allows humans to make faster and more accurate judgments about it as the number of training rounds increases, without significantly losing its own accuracy.

After comparing with the baseline method, it is not difficult to find that after 5 rounds, the accuracy of human judgment exceeds 82%, but the baseline is less than 75% after 7,000 steps of training;

In terms of time, the baseline human judgment took nearly 27 seconds, while the new method was only slightly more than 26 seconds.



In addition, for the verifier, as the training continues, its robustness gradually increases, and it can identify wrong answers more accurately.



In addition to evaluating the effectiveness of this game-based training method, the OpenAI team made some other discoveries during the research process:

  • The impact of the size of the validator: Too small will lead to unstable training, too large will make it difficult to improve robustness in multiple rounds of training, and one thousandth to one hundredth of the prover works best.
  • The errors in the malicious solution become more subtle and localized during training, indicating that the verifier pushes the treacherous prover to generate adversarial examples that are closer to the true solution.
  • As the number of training rounds increases, the accuracy of human evaluators on the treacherous prover decreases.
One More Thing

Since the super alignment team that conducted this research has been disbanded and is responsible for security issues, OpenAI's security strategy has once again attracted widespread attention.

Regarding this issue, OpenAI CTO Mira Murati gave an exclusive interview at Johns Hopkins University earlier this month.

During the meeting, Mira said that OpenAI did not "put product (priority) before safety" as Jan Leike, the former head of the super alignment team, accused.

She also said that although the super alignment team has been disbanded, super alignment is actually just one of OpenAI's multiple security teams, and there are still many people in the company working on security.



Reference Links:
[1]https://openai.com/index/prover-verifier-games-improve-legibility/
[2]https://venturebeat.com/ai/openai-used-a-game-to-help-ai-models-explain-themselves-better/
[3]https://x.com/__nmca__/status/1813646245602435542