news

google deepmind demonstrates genrm technology: fine-tuning llms as reward models to improve generative ai reasoning capabilities

2024-09-03

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

it home reported on september 3 that the google deepmind team published a paper on arxiv on august 27, introducing and demonstrating the genrm generative verifier and creatively proposing a reward model to improve the reasoning ability of generative ai.

in the ai ​​industry, the current mainstream approach to improving large language models (llms) is the best-of-n model, where the n candidate solutions generated by the llm are sorted by the verifier and the best solution is selected.

such llm-based verifiers are usually trained as discriminative classifiers to score solutions, but they cannot exploit the text generation capabilities of pre-trained llms.

to overcome this limitation, the deepmind team tried to use the next token prediction objective to train the verifier, performing both verification and solution generation simultaneously.

compared with traditional verifiers, the generative verifier (genrm) of the deepmind team has the following advantages:

seamlessly integrated command adjustment

support chain of thought reasoning

leverage additional inference time computation via majority voting

when using a gemma-based verifier on algorithmic and elementary school math reasoning tasks, genrm outperforms both the discriminative and llm-as-a-judge verifiers, improving the percentage of problems solved using best-of-n by 16-64%.

according to google deepmind, genrm's edge over categorical reward models marks a key evolution in ai reward systems, particularly in terms of their ability to prevent fraudulent behavior learned by new models. this advance highlights the urgent need to refine reward models to align ai outputs with socially responsible standards.