Mamba's coding is really better than Transformer! The original paper was selected for the top new conference

2024-07-17

The west wind blows from Aofei Temple
Quantum Bit | Public Account QbitAI

"European OpenAI" and "Transformer Challenger" have joined forces!

Mistral AI has just launched its firstMamba2Open source model of architectureCodestral Mamba(7B), specializes in code generation.

Unlike the Transformer architecture, the Mamba architecture can perform "linear time reasoning" and can theoretically support inputs of unlimited length.

Mistral AI: This is why the code reasoning model we launched using the Mamba architecture is resistant to attacks.

Mistral AI says it has already256k token contextCodestral Mamba was tested in

In the benchmark test, Codestral Mamba's overall performance surpassed CodeGemma-1.1 7B, CodeLlama 7B, DeepSeek v1.5 7B, and CodeLlama 34B.

Some netizens said that this wave is the rhythm of Mistral AI to take the Mamba architecture to the forefront.

One of the authors of the Mamba architecture, Assistant Professor at CMUAlbert Guexpress:

Different modalities or data formats with weaker “tokenizations” (e.g., code, byte-level modeling) will increasingly benefit from compression models such as SSM.

In addition to Codestral Mamba, Mistral AI also released a newmathematical model——Mathstral（7B）。

What's interesting is that netizens asked it to do the "Which is bigger, 9.11 or 9.9", Mathstral first compared the integers, then compared the decimal parts, and finally succeeded.

7B performance is close to 22B Transformer

The full Codestral Mamba benchmark results are as follows:

On all HumanEval C++/Java/JavaScript/Bash benchmarks, Codestral Mamba surpasses CodeGemma-1.1 7B, CodeLlama 7B, and the larger CodeLlama 34B.

Mistral AI's most powerful open source programming modelCodestral 22BThere is not much difference between it and Codestral Mamba.

In addition, DeepSeek v1.5 7B also stands out in the benchmark, competing back and forth with Codestral Mamba.

DeepSeek v1.5 7B outperforms Codestral Mamba in Spider (complex cross-domain semantic analysis and text-to-SQL tasks), HumanEval Java, HumanEval Bash, and MBPP.

Besides the benchmark results, the most interesting thing about Codestral Mamba is that it is one of the first Mamba2 architecture models.

Mamba architecture by FlashAttention authorTri Daoand CMU Assistant Professor, Co-founder and Chief Scientist of Cartesia AIAlbert GuProposed at the end of last year.

Previously, large Transformer-based models such as ChatGPT had a major pain point: processing long texts required a huge amount of computing power. This was also due to the quadratic complexity of the attention mechanism in the Transformer architecture.

Mamba is the first to truly match the performance of the TransformerLinear Time Series Models, which is also a state space model (SSM).

Mamba is built on the more modern Structured SSM (S4) suitable for deep learning and has similarities with the classic architecture RNN.

There are three main innovations: selective processing of input information, hardware-aware algorithms, and simpler architecture.

The Mamba architecture has attracted widespread attention since its launch. Jim Fan, founder of Stability AI and scientist at NVIDIA, and others are excited about its emergence.

Mamba's first paper was rejected by ICLR at the beginning of the year, which caused heated discussions in the circle at that time.

However, it has recently been accepted by the new generation top conference CoLM2024.

Mamba2 is its second generation, with an 8-fold increase in state space and a 50% increase in training speed.

The Mamba2 paper even found that there is a very close mathematical connection between the attention mechanism in Transformer and SSM, and the paper was successfully selected for ICML 2024.

A mathematical model was also published

In addition to Codestral Mamba, Mistral AI also launched an open source mathematical model:Mathstral(7B) to commemorate the 2311th anniversary of Archimedes' birth.

Mathstral is based on Mistral 7B and focuses on STEM (science, technology, engineering, mathematics). The context window is 32k.

In the benchmark test, Mathstral MATH scored 56.6% and MMLU reached 63.47%.

The point is that Mathstral can also achieve better results with more inference time computation:

Mathstral 7B achieved a score of 68.37% on the MATH test using majority voting, and this was boosted to 74.59% when a strong reward model was applied to 64 candidate models.

The following are the performance differences between Mathstral 7B and Mistral 7B in various subjects of MMLU:

Reference Links:
[1]https://mistral.ai/news/codestral-mamba/
[2]https://mistral.ai/news/mathstral/
[3]https://x.com/MistralAI/status/1813222156265791531
[4]https://x.com/GuillaumeLample/status/1813231491154899012
[5]https://x.com/theo_gervet/status/1813226968600469824
[6]https://x.com/tuturetom/status/1813238885453033540
[7]https://x.com/WenhuChen/status/1812562112524226569

news