news

Llama 8B search surpasses GPT-4o 100 times! Reasoning search can improve performance, new "Scaling Law"

2024-08-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Qiao Yang

【New Wisdom Introduction】Recent papers have shown that generative models such as LLM can be expanded through search and achieve very significant performance improvements. Another reproduction experiment also found that by searching the Llama 3.1 model with only 8B parameters 100 times, it can reach the same level as GPT-4o in Python code generation tasks.

Rich Sutton, a pioneer of reinforcement learning and professor of CS at the University of Alberta in Canada, wrote a blog post titled "The Bitter Lesson" in 2019, which became one of the classic discussions in the field of AI.

Even the intuition reflected between the lines of Rich Sutton already has the meaning of Scaling Law.


Original address: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf

The article briefly reviews the development of AI in areas such as chess, Go, speech recognition, and vision, and puts forward the following viewpoints:


One of the hard lessons we should learn is to realize the power of general-purpose methods that continue to scale as the amount of computation increases due to the surge in available computing power. Two methods that seem to scale arbitrarily in this way are search and learning.

However, this view is not exactly the same as the Scaling Law, and we cannot use it as a basis to conclude that small models are doomed to be irrelevant.

As Sutton describes, we have two axes to scale: learning and searching.

The Scaling Law proposed by OpenAI emphasizes the former. When other conditions remain unchanged, larger models perform better because they can learn more knowledge and patterns from the training set.

But what we often overlook is the latter. The search method can also be smoothly expanded as computing power grows during the reasoning phase to generate more or higher quality candidate answers.

A recent article published by scholars from Stanford, Oxford, DeepMind and other institutions focused on this point.


Paper address: https://arxiv.org/abs/2407.21787

With the increase in the number of repeated sampling in the reasoning stage, the performance (i.e., problem coverage) of the model in mathematics, reasoning, and code fields such as GSM8K, MATH, MiniF2F-Math, and SWE-bench Lite has been significantly improved.

There even seems to be an exponential linear relationship between the two, which can be modeled by an exponential power law, which seems to explain the existence of a scaling law in the inference stage.


Inspired by this paper, the two engineers began to try to reproduce it - the result was that searching with 100 small Llama models could catch up with or even beat GPT-4o in Python programming tasks.


The two authors used a vivid metaphor: before, a horse-sized duck was needed to achieve boundary capabilities; but now, we can choose to use 100 duck-sized horses (or more precisely, alpacas).

The source code used in the experiment has been uploaded to GitHub, and the cost of reproduction is quite low.


https://gist.github.com/charlesfrye/27f25188dbbcfdf20a83c0230020fe05

In order to try higher performance, the author used the vLLM library to implement batch reasoning and expanded the hardware conditions to 10 A100-40GB GPUs, with an output speed of 40k token/s.

Evaluation Metrics and Results

The authors chose a benchmark not covered in the aforementioned Large Language Monkeys paper - HumanEval.

The benefit of this dataset is that the generated code is evaluated using running tests without the involvement of LLM-as-Judge or human evaluation, which can more objectively measure its correctness.

The performance of the model is measured by two metrics: pass@k and fail@k. According to the results reported by PapersWithCode, GPT-4o achieved a pass@1 score of 90.2% in zero-shot reasoning.


https://paperswithcode.com/sota/code-generation-on-humaneval

Using the method proposed in the above paper, plus minimal prompt fine-tuning (no other hyperparameters were adjusted), the pass@k score of Llama 3.1 8B was significantly improved.

When the number of repeated sampling k is 100, the performance is comparable to GPT-4o (90.5% vs. 90.2%); when k reaches 1000, the score is 95.1%, which is significantly better than GPT-4o.


If we use the fail@k indicator (equivalent to 1-pass@k) and then perform a logarithmic transformation on the two axes in the above figure, we can see the curve shown in the figure below, which seems to perfectly conform to the "scaling law".


It is worth noting that this small experiment is not a strict reproduction of the paper, but only extracts the core methods.

However, these results further emphasize that when using search methods for inference-stage augmentation, smaller models can outperform "monster" models like GPT-4o in a predictable manner.

The future of search

The search method is powerful because it can scale "transparently" as the amount of computation increases, and it can also shift resource consumption from memory to computation to achieve further resource balancing.

Recent major AI achievements in mathematics, such as the level of , and , are inseparable from the search used therein.

However, the implementation of search first requires high-quality evaluation of the results. DeepMind's model translates mathematical problems expressed in natural language into formal statements, which are then carefully supervised by compilers/verifiers such as Lean.

, which can greatly improve the degree of parallelism and automation.

According to the Curry-Howard-Lambek correspondence, it is relatively easy to automatically identify and evaluate mathematical proofs and code generation results using computer programs.

But similar methods may fail in fields other than mathematics and programming. For example, it is difficult to conduct effective search for open-ended NLP tasks such as "summarize emails."

From this perspective, search is downstream of evaluation, and we can roughly expect that the performance of generative models in a particular domain will improve in direct proportion to the evaluation and search capabilities.

To achieve this goal, agents in repeatable digital environments seem to be a promising direction.

References:

https://modal.com/blog/llama-human-eval