news

OpenAI launches a small model war! Apple DCLM makes a strong debut, crushing Mistral 7B and is fully open source

2024-07-21

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Taozi Qiao Yang

【New Wisdom Introduction】The era of small models is coming? OpenAI entered the small model battlefield for the first time with GPT-4o mini. Mistral AI and HuggingFace released small models this week. Now, Apple has also released a 7 billion parameter small model DCLM, which outperforms Mistral-7B in performance.

The battlefield of small models has begun!

Following the release of GPT-4o mini and Mistral NeMo, Apple also entered the market.

The DCLM small model has two parameter sizes - 7 billion and 1.4 billion, and is open source upon release. The maximum 7 billion parameters exceeds Mistral-7B and its performance is close to Llama 3 and Gemma.


According to Vaishaal Shankar, a research scientist in Apple's ML group (also a DCLM developer), this is the best performing "truly open source" model to date, with not only weights and training code, but also based on the open dataset DCLM-Baseline.


Compared with the model performance, the example of "true open source" set by DCLM is more eye-catching.

In contrast, most technology giants only use closed-source models, or "half-hidden".


In addition, Shankar also announced that the model intermediate checkpoints and optimizer status will continue to be launched online.


Could this be the spring of the LLM open source community?


DCLM series is fully open source

At present, all model weights have been released on HuggingFace, and the model card has basically covered the key information.


https://huggingface.co/apple/DCLM-7B

DCLM-7B also uses a decoder-only architecture and is pre-trained using the PyTorch and OpenLM frameworks.

The DCLM-baseline dataset with a total of 4T tokens comes from the total DCLM of 240T, and the DCLM-7B model further filters out 2.5T for training.


The context length is 2048, which is smaller than the 8k length of Mistral 7B and Gemma 2 9B.

In terms of performance, the authors directly used the evaluation kit LLM Foundry to test the model's scores on 53 benchmark tasks.

When comparing with other models, in addition to the MMLU score, the author also customized two indicators - "core accuracy" and "extended accuracy".

The former is the average of the center accuracy of 22 tasks including HellaSwag and ARC-E, while the latter covers all 53 tasks.

Although it does not use the most data, DCLM achieves the best performance in all three indicators compared with other open data models of the same size (both weights and datasets are open source).


The three columns of benchmark scores are from left to right: Core, MMLU, Extended

Compared with the previous SOTA MAP-Neo model, DCLM-7B achieves 63.7% accuracy in the 5-shot MMLU task, an increase of 6.6 percentage points, while reducing the amount of computation required for training by 40%.

However, compared with the model with open source weights and closed source datasets, the results are not satisfactory.

DCLM has a significant gap with Phi-3 in all indicators, and its scores are roughly equivalent to those of Mistral-7B-v0.3 or Gemma 8B.


The researchers found that if an additional 100B of data from the same dataset was used for training and the context length was extended to 8k, the model's scores on the core and extended benchmarks would be further improved, but the MMLU results did not change.


This result completely surpasses the score of Mistral 7B-v0.3.

In addition, HuggingFace also released an instruction fine-tuning version of the 7B model, which achieved a large-scale improvement in performance on the mathematical reasoning task GSM8K, with the score soaring from the original 2.1 to 52.5.


https://huggingface.co/apple/DCLM-7B-8k

In addition to version 7B, version 1.4B was also released simultaneously. The amazing thing is that the amount of training data increased by 0.1T compared to version 7B.


https://huggingface.co/TRI-ML/DCLM-1B

Compared with SmolLM recently released by HuggingFace, DCLM-1B performs significantly better, especially the 5-shot MMLU score, which is 11.9% higher than SmolLM.

Not only that, DCLM-1B's score of 41.9 on MMLU is also higher than Qwen-1.5B's 37.87 and Phi-1.5B's 35.90.


The lag of the 7B model has been surpassed by the 1.4B model. It turns out that small models are Apple's forte.

It is worth noting that the 7B model is only available under Apple's Sample Code License (ASCL), but version 1.4B is released under Apache 2.0, allowing commercial use, distribution and modification.

When talking about the DCLM series of models released this time, we have to mention their important foundation - the DataComp benchmark.


Paper address: https://arxiv.org/pdf/2406.11794

The DataComp paper was first published on June 17. The co-first authors Jeffrey Li, Alex Fang and the co-last author Vaishaal Shankar are also developers of Apple DCLM.

The article not only elaborates on the process of constructing the dataset, but also mentions some content about the DCLM model.

Vaishaal Shankar said that an updated version of this paper will be released soon, providing more technical details about model pre-training.

Compared with modifying the model for the same data set, DataComp takes the opposite approach - the model used for evaluation is fixed, and the task is to filter and process the best data from a total data pool of 240T.

It can be said that this approach is very consistent with the research and development ideas of technology giants - for the performance of LLM, pre-training data is becoming a more important factor than model architecture and weights.

After all, a series of "open source" models such as Llama, Gemma, and Phi only give weights and do not publish data.

Both Scaling Law and SLM are required

For AI tech giants, sometimes the bigger the model, the better.


In fact, there has always been no shortage of small models in the AI ​​community, such as the multiple iterations of Microsoft's Phi series models and Google's Gemma 2 7B, which was just updated at the end of June.

This week, OpenAI suddenly released GPT-4o mini, Mistral AI teamed up with NVIDIA to release Mistral NeMo, and the release of small models such as HuggingFace's SmoLLM has once again added fuel to the field of small models.

As an OpenAI researcher said, "While we like training big models more than anyone, OpenAI also knows how to train small models."


Small models have the advantages of low cost, high speed, and greater professionalism. They are usually trained with only a small amount of data and are designed for specific tasks.

Making large models smaller and then expanding their scale may be one of the future development trends.


A few days ago, when GPT-4o mini was released, Andrej Karpathy also posted a long tweet expressing similar views.


He believes that the competition in model size will "intensify in the opposite direction", not getting bigger, but instead competing to see who is smaller and lighter.

The reason why current LLMs are gradually becoming "behemoths" is that the training process is still very wasteful. We are basically asking the model to remember the contents of the entire Internet (and in fact, LLMs have a pretty good memory ability, much better than humans).

But for small models, the training objectives have changed. The key question is how AI systems can learn more from less data.

We need the model to become larger first and then smaller, because we need a "behemoth" to reconstruct and shape the data into an ideal synthetic form, gradually obtaining a "perfect training set" and then feeding it to the small model.

Musk also agreed with this view. The model improvement ladder described by Karpathy is exactly the path that Tesla has taken in reality.


In April 2023, Sam Altman announced the end of the era of large AI models. In a recent interview, he also confirmed that data quality is a key success factor for further AI training.


Microsoft researchers made this assumption when developing the Phi model, and Hugging Face AI researchers recently confirmed this assumption and released a high-quality training dataset.

Taking GPT-4 as an example, the cost of developing and using over one trillion parameters exceeded $100 million.

A small model, such as one trained specifically on a legal dataset, might use fewer than 10 billion parameters, cost less than $10 million, and use less computing power to respond to each query, so it’s less expensive.

Nadella once said that the Phi small model series is only 1/100 of the size of the free model behind OpenAI, and performs almost as well on many tasks.


In addition, Google and AI startups Mistral, Anthropic, and Cohere have also released smaller models this year.

In June, Apple announced its own AI development roadmap, planning to use small models so that the software can run entirely on the phone, making it faster and safer.

For many tasks, such as summarizing documents or generating images, large models may be overkill.

Illia Polosukhin, the author behind the pioneering Transformer, said that calculating 2+2 shouldn’t require a quadrillion operations.

However, technology giants have not given up on big models. At this year's WWDC conference, Apple announced that ChatGPT would be embedded in the Siri assistant to perform complex tasks such as writing emails.

After all, on the road to ultimate AGI/ASI, the expansion of parameter scale is proportional to the growth of intelligence.


References:

https://venturebeat.com/ai/apple-shows-off-open-ai-prowess-new-models-outperform-mistral-and-hugging-face-offerings/

https://www.wsj.com/tech/ai/for-ai-giants-smaller-is-sometimes-better-ef07eb98?mod=tech_lead_story

https://the-decoder.com/ai-models-might-need-to-scale-down-to-scale-up-again/