Apple open-sources 7B model, gives away all training data sets, netizens: It's not like Apple at all

Apple open-sources 7B model, gives away all training data sets, netizens: It's not like Apple

2024-07-22

Apple is the latest entrant into the open source big model arena, and it's more open than any other company.

roll out7B Model, not only the effect isLlama 3 8BQuite, and it is open source at one timeFull training process and resources。

You know, not long ago, Nature magazine editor Elizabeth GibneyWriting criticism：

Many AI models that claim to be open source are actually not transparent in terms of data and training methods, and cannot meet the needs of real scientific research.

And this time Apple is really serious!!

Even NLP scientists and creators of AutoAWQ were amazed:

Apple released a model that beats the Mistral 7B, but what's even better is that they completely open sourced everything.Includes pre-training datasets

It also attracted netizens to ridicule online:

As for the significance of this open source, some enthusiastic netizens also helped summarize it:

For anyone who wants to train a model from scratch or fine-tune an existing model,Data Management ProcessIt must be studied.

Of course, in addition to OpenAI and Apple, Mistral AI and NVIDIA also released a small model with 12B parameters last week.

The founder of HuggingFace said,「Miniature Week」coming!

Roll! Keep rolling! So how powerful is the small model released by Apple this time?

The effect is close to Llama 3 8B

Let’s not talk about how powerful it is, let’s first take a look at what the Hugging Face technical director just “unboxed”Basic model configuration。

To sum up:

7B base model, used on open datasets2.5T tokensConduct training
Mainly English data, with2048Tokens context window
Datasets include DCLM-BASELINE, StarCoder, and ProofPile2
MMLU scores close to Llama 3 8B
Training using PyTorch and OpenLM framework

Specifically, the research team first proposed a language modelNew benchmark for data comparison——DCLM。

This benchmark was proposed because the team found that:

Machine learning (ML) models from larger datasetsAutomatically filter and select high-quality data, may be the key to constructing a high-quality training set.

Therefore, the team used DCLM to design high-quality datasets to improve model performance, especially in multimodal areas.

ThatIdeasIt’s simple: use a standardized framework to conduct experiments, including fixed model architecture, training code, hyperparameters, and evaluation, and ultimately find out which data wrangling strategy is best for training high-performance models.

Based on the above ideas, the team built aHigh-quality dataset DCLM-BASELINEAnd used it to retrain a 7B parameter model - DCLM-7B.

How does DCLM-7B perform specifically?

The results show that it performs well on the MMLU benchmark in 5-shotThe accuracy rate is 64%, comparable to Mistral-7B-v0.3 (63%) and Llama 3 8B (66%); and its average performance on 53 natural language understanding tasks is also comparable to Llama 3 8B, while the amount of computation required is only 1/6 of the latter.

Compared with other models of the same size, the MMLU score of DCLM-7B surpasses Mistral-7B and is close to Llama 3 8B.

Finally, forTest the effect of new dataset, some industry insiders used Kapathy's llm.c to train GPT-2 1.5B to compare the two data sets DCLM-Baseline and FineWeb-Edu.

The results show that DCLM-Baseline has achievedHigher average score, and performs better on tasks such as ARC (scientific reasoning for elementary school students), HellaSwag (common sense reasoning), and MMLU.

"Small" models become a new trend

Back to the beginning, “small” models have become a new trend recently.

First, HuggingFace launched a family of small models“SmolLM”, which include 135M, 360M and 1.7B models.

They outperform similarly sized models on a wide range of reasoning and commonsense benchmarks.

Then OpenAI suddenly releasedGPT-4o mini, not only is its capability close to GPT-4, but its price has also dropped significantly.

Just GPT-4o miniReleased on the same dayMistral AI and NVIDIA jointly released a 12B parameter small model——Mistral NeMo。

In terms of overall performance, Mistral NeMo beats Gemma 2 9B and Llama 3 8B in multiple benchmark tests.

So, why do people start rolling small models?

The reason may be as the founder of smol AI reminded, although the model has become smaller, but in the case of similar capabilities, the small model is moreGreatly reduced costs。

Just like the picture he provided, small models represented by GPT-4o mini are generally cheaper than those on the right.

In this regard, we, the spectators, be like:

So, which one do you prefer?

news

Apple open-sources 7B model, gives away all training data sets, netizens: It's not like Apple

Introduction

my contact information