OpenAI's latest paper allows super models to explain themselves

Text/AI Future Guide by Hao Boyang

Editor/ Zhou Xiaoyan

Today, OpenAI released a new paper, which is the "posthumous work" of the defunct OpenAI Super Alignment Team. It proposes a new way to improve the interpretability of models, which is to let the model explain its own logic, and the easier the explanation, the better.

This idea is not new, but before, everyone used prompt word engineering to do it.Use prompts such as thought chains to force the AI to explain its answers. But wouldn’t it be better if the model was better at explaining from the beginning? Then you have to make some adjustments during the model training process.To achieve this goal, OpenAI proposed a method called "checkability training". It uses a method called "prover-verifier game" to improve the model's ability to explain itself. In the end, it was found that this method is indeed useful.

How did they do it specifically?

We want to train a very smart AI (we call it a "prover") to solve math problems. We want this AI to not only give the correct answer, but also explain it clearly and in a way that is easy for people to understand. However, if we just let the AI pursue the correct answer, it may develop a very complex and difficult to understand way of solving the problem.

For example, in the example below, the answer given by the model is correct, but the proof method is very difficult to understand. This makes it difficult for ordinary users, including customers in industries that require high precision, to judge whether its thinking is correct.

So to avoid this, the researchers came up with a clever method:

1. They created a "referee" AI (called "verifier"). This referee is not as smart as the problem-solving AI, but its job is to judge whether the problem-solving AI's answer is correct and clear.

2. Then, they let the problem-solving AI play two roles: a "good student" (helpful AI), which tried to give correct and clear answers; and a "naughty student" (cunning AI), which tried to give answers that seemed correct but were actually subtly wrong.

3. Both "students" try to convince the referee AI that their answer is correct.

4. Over time, the judge AI became better and better at identifying correct and incorrect answers. At the same time, the "good student" AI learned how to explain the questions more clearly, while the "naughty student" AI learned how to make more subtle mistakes.

5. This process repeats itself over and over again, like an ongoing game.

This process is called the "prover-verifier game". Through this method, after multiple rounds of training, the explanation of the model becomes clearer and clearer. For example, after one round of this training, the explanation of the model is much better.

(The results of the first round of training are already much more detailed)

After five rounds, the model can basically explain the questions to you in the tone of an elementary school teacher.

(Results of the fifth round of training)

Under the pressure of the cunning AI, the referee AI "verifier" has also evolved. It has become better at distinguishing the truth from the false. Even if the cunning AI says something fantastic, it is more likely to find loopholes in it.

(This is the answer of the cunning AI, which looks quite realistic)

(As time goes by, the level of cunning AI is also improving)

In the process, OpenAI researchers also discovered aThe phenomenon of the "comprehensibility tax", that is, the higher the comprehensibility of AI output, the more likely it is to sacrifice a certain degree of performance or accuracy.

They made a graph to illustrate this phenomenon.

In the picture,The X-axis represents the accuracy of the model generator, and the Y-axis represents the accuracy of the human evaluator, which is used here as a measure of legibility. Models that are directly subjected to reinforcement learning (RL) without "inspectability training" have strong capabilities but poor interpretability. Models that have undergone "inspectability training" can improve accuracy while maintaining the same interpretability. But there is a limit. The "legibility tax" marked with a red arrow in the figure is the performance that must be sacrificed without losing interpretability.

What is the purpose of this research? From the above examples, we can see that the explanation method that is easier for the referee AI to understand is also easier for us humans to understand.It shows that we can use smaller, easier-to-train models to replace human annotation feedback (such as RHLF) to improve the comprehensibility of large model outputs. The paper has high expectations for this method. They believe that "this autonomy is particularly important for the alignment of future super-intelligent AI systems. The goal is to ensure that AI systems are reliably aligned with human values and expectations without direct human supervision."

Looking back, the pioneering work of the Super Alignment team was to propose the idea of using weak models to supervise strong models. It seems that in the face of the original ideal of Super Alignment, they really tried to use various methods to achieve it. Unfortunately, things have changed, the team has left, and now we can only hear the last echo.

news

OpenAI's latest paper allows super models to explain themselves

OpenAI's latest paper allows super models to explain themselves

Introduction

my contact information