The little model stood up, ran SOTA in the browser, hugged the face: synthetic data is not the future

2024-08-19

Dream morning from Aofei Temple
Quantum Bit | Public Account QbitAI

The SOTA small model that can run directly in the browser is here, winning at the 200 million, 500 million and 2 billion levels respectively, produced by Baobaolian.

There are only two secrets:

Filter data harshly
Train hard on a highly filtered dataset

Chief Scientist of HugfaceThomas Wolf, summarizing the team’s experience in developing small models, and putting forward new ideas, which attracted the attention of the industry:

Synthetic data is currently only useful in certain areas, the web is so large and diverse that the potential of real data has not yet been fully realized.

Currently, the 360M model version has been released as a demo and can be played online (pay attention to the traffic).

Call the local GPU in the browser to run, including the model weights and web front-end UI, all in 400MB.

Strictly filter network data, performance increases dramatically

Regarding Microsoft's Phi series of small models, it claims to use half of the synthetic data and the effect is very good, but the data is not public.

The open source industry leaders couldn't stand it anymore:

Create a benchmark large synthetic dataset and open source it.

Moreover, the team hinted that this move was also intended to test whether the rumors that Microsoft was tinkering with the rankings on the test set were true.

Hugface was constructed using the best open source model at the time, Mixtral-8-7B.25BSynthetic data.

The trained model performs well, but is still somewhat lower than the levels of Phi-1 and Phi-1.5.

They tried to get a large model to explain various topics at the secondary school level, but it ended up performing poorly on the MMLU test, which is a PhD-level topic.

The real performance breakthrough came from a side quest:

In addition to using large models to generate synthetic data from scratch, tryFiltering network data with large models。

Specifically, a classifier was developed using the annotations generated by Llama3-70B-Struct.Keep only the most educational web pages in the FineWeb dataset。

After using heavily filtered network data, performance skyrocketed and surpassed all other similarly sized models, including Phi-1.5, on most benchmarks.

The Hug Face team said the results of this experiment were'Bittersweet'Although the model performance is unprecedentedly high, it also shows that synthetic data is still inferior to real data.

Later they used the same idea to expand from natural language to code, and the filtered code dataset was also proven to be very powerful.

Improve the HumanEval benchmark score directly from around 13% to over 20%.

In the mixed data set they finally constructed, the deduplicated filtered data set accounted for the vast majority, and the purely synthetic data Cosmopedia v2 accounted for only 15%.

So, in summary, is synthetic data still useful?

The team believes that it may only make more sense in areas where there is a real lack of real data, such as reasoning and mathematics.

Even small models require training on trillions of tokens

Just as they were excited about these new discoveries and results, a new intern, Elie Bakouch, joined.

Although he was just an intern at the time, he was indeed an expert in various training techniques.

With the help of Elie, the team reduced the model size from 1.7B to 360M or even 170M, which is comparable to the classic models GPT-1, GPT-2 and BERT.

In this process, a second important discovery was made: contrary to past consensus,Even small models have to be trained on trillions of tokens, the longer the better.

alsoData annealingIt has also been shown to be effective to (Anneal the data), that is, to retain a special set of high-quality data during the last part of training.

The final released series of models are suitable for deployment on a variety of devices from smartphones to laptops, and the largest 1.7B model BF16 precision only occupies 3G of memory.

For reference, the entry-level iPhone 15 also has 6G, and there are even more Android phones.

Although the basic model trained this time is good enough, the team still found a problem.

Past alignment and fine-tuning techniques, such as SFT, DPO, PPO, etc., are very effective for large models, but not ideal for small models.

The team analyzed that the alignment dataset contained many concepts that were too complex for small models and lacked well-designed simple tasks.

The next new pit has also been dug. Interested teams can start working on it. Maybe it will become a big savior for small models.

Online trial:
https://huggingface.co/spaces/HuggingFaceTB/instant-smollm

Reference Links:
[1]https://huggingface.co/blog/smollm
[2]https://x.com/Thom_Wolf/status/1825094850686906857

news

The little model stood up, ran SOTA in the browser, hugged the face: synthetic data is not the future

Introduction

My contact information