The last work of the OpenAI Super Alignment Team: Two large models compete with each other, and the output is easier to understand

The final work of the OpenAI Super Alignment Team: Two large models compete with each other, and the output is easier to understand

2024-07-18

Machine Heart Report

Synced Editorial Department

If the answers given by the AI model are completely incomprehensible, would you dare to use it?

As machine learning systems are applied in more important domains, it becomes increasingly important to justify why we can trust their outputs and to make clear when we should not trust them.

One possible way to gain trust in the output of a complex system is to require that the system produce an explanation of its output that is readable by a human or another trusted system, i.e., fully understandable enough that any possible errors can be detected. For example, to build trust in the judicial system, we require courts to produce legible written opinions that explain and support their decisions.

We can take a similar approach for large language models.

However, when taking this approach, it is very important to ensure that the language model produces understandable text, especially when dealing with complex tasks such as mathematics and coding.

As shown in the figure below, if you ask AI to write a quick sort algorithm, AI will write it out very quickly, and the answer is very concise. But how can you judge whether AI writes it correctly if you can't write code?

OpenAI studied this question in a recent paper.

Paper Title: PROVER-VERIFIER GAMES IMPROVE LEGIBILITY OF LLM OUTPUTS

Paper link: https://cdn.openai.com/prover-verifier-games-improve-legibility-of-llm-outputs/legibility.pdf

They found that when large language models are asked to generate answers with the goal of getting the right answer, the answers they produce can be so difficult to understand that human evaluators are twice as likely to make mistakes when judging these answers.

So their goal is to train advanced language models to generate text that simpler models can easily verify. They found that such text is not only friendly to simple models, but also more readable to humans.

So, how is it implemented specifically? This also mentions a 2021 paper - "Learning to Give Checkable Answers with Prover-Verifier Games". In this paper, Cem Anil et al. proposed a game theory framework called "Prover-Verifier Games". The framework consists of two competing learners: a trusted verifier network that tries to choose the correct answer, and another more powerful but untrusted prover network that tries to convince the verifier to accept a specific answer regardless of its correctness. The goal of the framework is to produce a reliable proof protocol from the game.

OpenAI borrowed this framework in the new study, letting two large models (the stronger one is the prover and the weaker one is the verifier) compete with each other until they reach a balance where the prover gives an answer that is both readable and reasonably correct.

OpenAI says that by improving the readability of model outputs, it can make the application of artificial intelligence more effective and trustworthy, which is beneficial for various fields that require precise communication.

It is worth mentioning that this paper comes from OpenAI's Super Alignment Team, several of whose authors left OpenAI earlier this year and the team has fallen apart. One of the authors revealed that this is the team's last paper.

An intuitive example

In order to improve the readability of the model output, OpenAI optimized the chain of thought (CoT) of some elementary school math problems so that it can be understood by weaker models. For example:

Question: Shawna's father is five times older than Shawna. Shawna is now three times older than Aliya. If Aliya is 3 years old, how old is Shawna's father?

The purpose of doing this is to verify whether the output of the model trained in this way is more direct and understandable.

This method improved the model's performance by 50% with only some optimizations to the training set. At the same time, the answers output by the model are easier to understand, which helps human users who need to evaluate the correctness of these answers to make more accurate judgments. OpenAI believes that further narrowing the performance gap between models with strong readability and models with strong capabilities is a key goal in future alignment research.

How do “provers” and “verifiers” compete during training?

To motivate the strong model (prover) to generate more readable text, the researchers adopted a training scheme that requires the answers generated by the strong model to be easily verified by the weaker model (verifier). Specifically, they used large and small models in the GPT-4 series. They differed by about three orders of magnitude in the amount of pre-training data. They all received some mathematical derivation training based on manual annotations, but did not undergo extensive post-training.

First, the research team asked the prover model to try to solve a series of mathematical problems. Since the prover model itself may make mistakes, some of the answers it generated may be incorrect. Next, they handed these answers to the verifier model for verification, alternating between the following two stages for multiple rounds:

“Verifier” training: The research team trained a weaker “verifier” model to verify whether the answer given by the “prover” is correct.

Prover training: The research team instructed the stronger provers to act out either helpful or cunning roles.

When the “prover” is set to “useful”, the research team will reward the “verifier” with a high-scoring correct answer.

When the prover was set to be "cunning", the research team rewarded the verifier for wrong answers with high scores.

The research team found that initially, a cunning "prover" could easily generate wrong answers that convinced the "verifier". However, these wrong answers would enter the next round of "verifier" training, and the same loopholes would no longer be convincing in the next round. Therefore, the "prover" would be forced to explore different strategies in each round.

Training ends when the cunning "prover" can no longer generate a wrong solution that convinces the "verifier".

What did the researchers find?

Performance vs. readability tradeoff:

A model that is supervised fine-tuned only on human-written derivations, while having higher readability, has lower accuracy.

When the model is optimized only for correctness, that is, only focusing on generating the correct answer, although high accuracy can be achieved, the generated solutions are less readable.

The checkable game approach proposed by OpenAI strikes a balance between the two, maintaining high readability while maintaining moderate accuracy.

Three useful models:

Robust Verifier: Able to effectively distinguish correct from incorrect solutions even when the solutions are misleading.

Useful Provers: Generate solutions that are understandable to humans, thus reducing errors by human evaluators.

Crafty Provers: Produce subtle, erroneous solutions that initially confuse human evaluators, thus highlighting areas for further improvement.

The research team used the same test problem to demonstrate the generation and training process of Sneaky Prover.

Question: Shawna's father is five times older than Shawna. Shawna is now three times older than Aliya. If Aliya is 3 years old, how old is Shawna's father?

in conclusion

With the increasing use of large language models (LLMs) in critical domains and complex applications, it becomes critical to understand whether the content they generate is trustworthy. By requiring models to generate clear and verifiable reasons, trust in their generated content can be enhanced.

A significant advantage of this approach is that it reduces reliance on human demonstration or readability judgments. This autonomy is particularly important for the alignment of future superintelligent AI systems, where the ultimate goal is to reliably align AI systems with human values and expectations without direct human supervision.

Although this work was only experimented on one dataset and ground truth labels are still required, the research team still expects that such methods will play a key role in developing correct, transparent and verifiable AI systems, and enhance their trustworthiness and security in real-world applications.

For more details, please refer to the original paper.

Reference Links:

https://openai.com/index/prover-verifier-games-improve-legibility/

news

The final work of the OpenAI Super Alignment Team: Two large models compete with each other, and the output is easier to understand

Introduction

my contact information