news

The original author himself! Mistral's first open source 7B Mamba model "Cleopatra" is amazing

2024-07-17

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】Recently, 7B small models have become a trend that AI giants are vying to catch up with. Following Google's Gemma2 7B, Mistral today released two more 7B models, namely Mathstral for STEM subjects and Codestral Mamba, a code model using the Mamaba architecture.

Mistral has a surprise new product!

Just today, Mistral released two small models: Mathstral 7B and Codestral Mamba 7B.

First up is the Mathstral 7B, designed specifically for mathematical reasoning and scientific discovery.

On the MATH benchmark, it achieved 56.6% pass@1, an improvement of more than 20% over the Minerva 540B. Mathstral scored 68.4% on MATH and 74.6% using the reward model.

The code model Codestral Mamba is one of the first open source models to adopt the Mamba 2 architecture.

It is the best available model for 7B codes and is trained with a context length of 256k tokens.


Both models are released under the Apache 2.0 license, and the weights have been uploaded to the HuggingFace repository.


Hugging Face address: https://huggingface.co/mistralai

Mathstral

Interestingly, according to the official announcement, the launch of Mathstral coincides with the 2311th anniversary of Archimedes’ birth.

Mathstral is designed for STEM subjects to solve advanced math problems that require complex, multi-step reasoning. The parameters are only 7B and the context window is 32k.

Moreover, Mathstral's research and development has a heavyweight partner - Numina, which just won the championship in the first AI Mathematical Olympiad on Kaggle last week.


Moreover, a Twitter user discovered that Mathstral can correctly answer the question "Which is bigger, 9.11 or 9.9?" which has stumped many large models.

The integers and decimals are compared separately, and the thinking chain is clear. It can be said to be an example of excellent mathematical model homework.


Based on the language ability of Mistral 7B, Mathstral further focuses on STEM subjects. According to the subject breakdown results of MMLU, mathematics, physics, biology, chemistry, statistics, computer science and other fields are all Mathstral's absolute advantage projects.


According to the official blog post, Mathstral seems to have sacrificed some inference speed in exchange for model performance, but judging from the evaluation results, this trade-off is worth it.

In multiple benchmark tests in the fields of mathematics and reasoning, Mathstral defeated popular small models such as Llama 3 8B and Gemma2 9B, especially achieving SOTA in mathematics competition questions such as AMC 2023 and AIME 2024.


Moreover, the inference time can be further increased to achieve better model results.

If majority voting is used for 64 candidates, Mathstral's score on MATH can reach 68.37%. If an additional reward model is added, a high score of 74.59% can be achieved.

In addition to the HuggingFace and la Plateforme platforms, you can also call the two officially released open source SDKs, Mistral-finetune and Mistral Inference, to use or fine-tune the model.

Codestral Mamba

Following the release of the Mixtral series based on the Transformer architecture, Codestral Mamba, the first code generation model using the Mamba2 architecture, was also released.

In addition, the development process also received assistance from Mamba's original authors Albert Gu and Tri Dao.

Interestingly, the official announcement specifically mentioned the Egyptian queen Cleopatra VII, who dramatically ended her life with a poisonous snake.

After the release of the Mamba architecture, its superior experimental performance has received widespread attention and optimism, but because the entire AI community has invested too much in Transformer, we have rarely seen industrial models that actually use Mamba.

At this time, Codestral Mamba can provide us with a new perspective to study the new architecture.

The Mamba architecture was first released in December 2023, and the two authors launched an updated version of Mamba-2 in May this year.

Unlike the Transformer, the Mamba model has the advantage of linear-time reasoning and is theoretically capable of modeling sequences of unlimited length.

For the same 7B model, while Mathstral's context window is only 32k, Codestral Mamba can be expanded to 256k.

This efficiency advantage in inference time and context length, as well as the potential for fast response, is particularly important in practical scenarios for improving coding efficiency.

The Mistral team saw this advantage of the Mamba model and took the lead in trying it. From the benchmark test, the 7B parameter Codestral Mamba not only has obvious advantages over other 7B models, but can even compete with larger-scale models.


In 8 benchmark tests, Codestral Mamba basically matched the performance of Code Llama 34B, and even surpassed it in 6 of them.

However, compared with its big sister Codestral 22B, Codestral Mamba's parameter disadvantage is reflected, and it still seems to be insufficient.

It is worth mentioning that Codestral 22B is a new model released less than two months ago. Once again, I am amazed that Mistral, headquartered in Paris, has such a volume.

Codestral Mamba can also be deployed using Mistral-inference, or TensorRL-LLM, a fast deployment API released by NVIDIA.


GitHub address: https://github.com/NVIDIA/TensorRT-LLM

For local operation, the official blog said that you can pay attention to the support of llama.cpp in the future. But ollama acted quickly and has added Mathstral to the model library.


In response to netizens urging for updates to Codestral Mamba, Ollama also said very helpfully: "I'm already working on it, so just be patient."


References:

https://mistral.ai/news/codestral-mamba/

https://mistral.ai/news/mathstral/

https://venturebeat.com/ai/mistral-releases-codestral-mamba-for-faster-longer-code-generation/