the new king of open source big models beats gpt4o, new technology can self-correct errors, and 99.2 in mathematics breaks the test set

2024-09-06

the west wind blows from aofei temple
quantum bit | public account qbitai

the throne of open source big models suddenly changed hands, and it turned out to be from a small startup team, which instantly set the industry on fire.

the new model is calledReflection 70B, using a new training technique, allows ai to learn to correct its own errors and illusions during reasoning.

for example, in the recently popular numerical r test, it initially made the same mistakes as most models, but actively<reflection> tagcorrected himself.

in the official evaluation, the 70b model completely surpassed the strongest open source llama 3.1 405b, gpt-4o, claude 3 opus, and gemini 1.5 pro, especially the math benchmark gsm8k.score 99.2%。

this result also made noam brown, a scientist at openai and the father of texas poker ai, speak passionately:

gsm8k scores 99%! can we officially retire this benchmark?

as soon as the model was launched, netizens rushed to try it out, and meta took the initiative to provide more computing power.

in the test by netizens, reflection 70b was able to answer questions that were wrong in the gsm8k dataset:

i fed the model 5 problems where the “ground_truth” present in gsm8k was inherently incorrect.
rather than repeating the wrong answers in the dataset, the model got all of them right, which is pretty impressive.this shows that the 99.2% accuracy does not come from memorizing the test set.！

counting all kinds of r is no problem, evencoined wordsseveral r's in "drirrrngrrrrrnnn" can also be counted correctly.

netizens were surprised that the open source made by a small team surpassed the top closed source. now the strongest open source model can be run locally.

the key 70b is just the beginning. officials said that a bigger one will be released next week.Reflection 405B。

it is expected that the performance of 405b will be significantly better than sonnet and gpt-4o.

the reflection 70b weights are now public, and api access will be provided by hyperbolic labs later today.

the model can self-reflect and correct its mistakes

more details about the reflection 70b are available below.

the key to improving the reflection 70b's capabilities is the use of a newReflection-Tuninga training method that enables the model to reflect on the text it generates, detecting and correcting errors in its reasoning before finalizing a response.

the data used in training comes from synthetic data generated using the glaiveai platform.

reflection 70b is based on the llama 3.1 70b instruct and can be sampled from reflection llama-3.1 70b using the same code, pipeline, etc. as other llama models.

it even uses the standard llama 3.1 chat format.

however, reflection 70b introduces somespecial tokens, structured output process.

as shown in the following example, splitting the planning process into a separate step can improve cot and keep the output refined:

the model will be<thinking> and</thinking> the output reasoning starts in the label, and once it is satisfied with its reasoning, it will be<output> and</output> the final answer is output within the label.

so it is able to separate its internal thinking and reasoning from the final answer.

exist<thinking> part, the model may output one or more<reflection>label, indicating that the model has discovered an error in its reasoning and will attempt to correct it before providing a final answer.

the system prompts as follows:

You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside tags, and then provide your final response inside
tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside tags.
(you are a world-class ai system capable of complex reasoning and reflection. you reason about the query within the label and then
provide your final response within the tag. if you find yourself making an error in your reasoning at any time, correct yourself within the tag.)

it is also worth mentioning that in the benchmark test, all benchmarks have been checked for pollution by lmsys's llm decontaminator, isolating<output> part and test this part separately.

when using reflection 70b, the official also shared some tips:

the initial recommended parameters are temperature.7 and top_p.95
to improve accuracy, it is best to add "think carefully." at the end of the prompt.

officials also stated thata report will be released next week, detailing the model training process and findings.

created by agent entrepreneurial team

behind reflection 70b is a small team led by ceo of hyperwriteai Mutt Shumerlead.

linkedin shows that mutt shumer is a serial entrepreneur, graduated from syracuse university in the united states, and is currently the co-founder and ceo of othersideai.

othersideai is an ai application company dedicated to developing the world's most advanced auto-completion tools through large-scale ai systems. it is also the company behind hyperwrite.

hyperwrite is a browser manipulation agent that can operate google chrome like a human to complete a series of tasks, such as ordering pizza:

like gpt-llm-trainer, you only need to describe the goal in words, and it will list the steps and execute them at the same time.

when it was first launched, it was claimed to be "stronger than autogpt".

hyperwrite can also be installed as a google extension.

in addition, mutt shumer founded visos while in high school, dedicated to developing the next generation of virtual reality software for medical purposes.

he also founded furi, a company that aims to disrupt the sporting goods industry by creating high-performance products and selling them at fair prices.

although there is meta support, the trial version is currently open and is still temporarily inaccessible.

if you are interested, you can save the code first~

https://reflection-playground-production.up.railway.app/

reference links:
[1]https://huggingface.co/mattshumer/Reflection-Llama-3.1-70B
[2]https://x.com/mattshumer_/status/1831767014341538166
[3]https://x.com/polynoamial/status/1831798985528635806
[4]https://x.com/degeneratoor/status/1831809610451448196
[5]https://x.com/kimmonismus/status/1831772661296345333

news

the new king of open source big models beats gpt4o, new technology can self-correct errors, and 99.2 in mathematics breaks the test set

the model can self-reflect and correct its mistakes

created by agent entrepreneurial team

introduction

my contact information