Non-Transformer architecture stands up! The first large model without pure attention, surpassing Llama 3.1

Non-Transformer architectures stand up! The first large model without attention, surpassing Llama 3.1

2024-08-13

Machine Heart Report

Editors: Du Wei, Chen Chen

The large model of the Mamba architecture once again challenges the Transformer.

Is the Mamba architecture model finally going to "stand up" this time? Since its first launch in December 2023, Mamba has become a strong competitor to Transformer.

Since then, models using the Mamba architecture have continued to emerge, such as Codestral 7B, the first large open source model based on the Mamba architecture released by Mistral.

Today, the Technology Innovation Institute (TII) in Abu Dhabi released aNew open source Mamba model - Falcon Mamba 7B。

Let’s first summarize the highlights of Falcon Mamba 7B: it can process sequences of arbitrary length without increasing memory storage, and it can run on a single 24GB A10 GPU.

Currently available for viewing and use on Hugging Face, the Falcon Mamba 7B is a causal decoder-only model that uses a novelMamba State Space Language Model (SSLM) ArchitectureTo handle various text generation tasks.

Judging from the results, the Falcon Mamba 7B outperforms leading models in its size class on some benchmarks, including Meta's Llama 3 8B, Llama 3.1 8B, and Mistral 7B.

Falcon Mamba 7B is divided into four variant models, namely basic version, instruction fine-tuning version, 4-bit version and instruction fine-tuning 4-bit version.

As an open source model, Falcon Mamba 7B adopts the license "Falcon License 2.0" based on Apache 2.0 to support research and application purposes.

Hugging Face Address: https://huggingface.co/tiiuae/falcon-mamba-7b

The Falcon Mamba 7B is the fourth model that TII has open-sourced, following the Falcon 180B, Falcon 40B and Falcon 2.The first Mamba SSLM architecture model。

The first universal large-scale pure Mamba model

Transformer-based models have long dominated generative AI, however, researchers have noticed that the Transformer architecture may have difficulties processing longer text messages.

Essentially, the attention mechanism in Transformer understands the context by comparing each word (or token) with every word in the text, which requires more computing power and memory requirements to handle the growing context window.

However, if computing resources are not scaled accordingly, model inference speed will slow down and texts over a certain length will not be processed. To overcome these obstacles, the state-space language model (SSLM) architecture has emerged. This architecture works by continuously updating the state as it processes words and has become a promising alternative, deployed by many institutions including TII.

The Falcon Mamba 7B uses the Mamba SSM architecture originally proposed in a December 2023 paper by researchers from Carnegie Mellon University and Princeton University.

The architecture uses a selection mechanism that allows the model to dynamically adjust its parameters based on the input. This allows the model to focus on or ignore specific inputs, similar to how attention mechanisms work in Transformers, while providing the ability to process long text sequences (such as entire books) without requiring additional memory or computational resources.

TII noted that the approach makes the model suitable for enterprise-level machine translation, text summarization, computer vision and audio processing tasks, as well as estimation and prediction tasks.

Training Data

Falcon Mamba 7BTraining data up to 5500GT, mainly composed of RefinedWeb datasets, and added with high-quality technical data, code data, and mathematical data from public sources. All data are tokenized by Falcon-7B/11B tagger.

Similar to other Falcon series models, the Falcon Mamba 7B is trained using a multi-stage training strategy.The context length has been increased from 2048 to 8192Moreover, inspired by the concept of curriculum learning, TII carefully selects mixed data throughout the training phase, fully considering the diversity and complexity of the data.

In the final training stage, TII uses a small set of high-quality curated data (i.e., samples from Fineweb-edu) to further improve the performance.

Training process, hyperparameters

Most of the training for the Falcon Mamba 7B isCompleted on 256 H100 80GB GPUsThe strategy of combining 3D parallelism (TP=1, PP=1, DP=256) with ZeRO is adopted. The figure below shows the model hyperparameter details, including accuracy, optimizer, maximum learning rate, weight decay, and batch size.

Specifically, Falcon Mamba 7B was trained with the AdamW optimizer, the WSD (Warm-Stabilize-Decay) learning rate schedule, and the batch size was increased from b_min=128 to b_max=2048 during training for the first 50 GTs.

In the stable phase, TII uses the maximum learning rate η_max=6.4×10^−4, and then uses an exponential schedule of more than 500GT to decay it to the minimum value. At the same time, TII uses BatchScaling in the acceleration phase to readjust the learning rate η so that the Adam noise temperature remains constant.

The entire model training took about two months。

Model Evaluation

To understand how the Falcon Mamba 7B compares to the leading Transformer model in its size class, the study ran a test to determine the maximum context length the model could handle using a single 24GB A10 GPU.

The results show that Falcon Mamba can adapt to larger sequences than the current Transformer model, whileTheoretically, it can accommodate unlimited context lengths。

Next, the researchers measured the model generation throughput using a batch size of 1 and a H100 GPU. The results are shown in the figure below. Falcon Mamba generates all tokens at a constant throughput and there is no increase in CUDA peak memory. For the Transformer model, the peak memory increases and the generation speed slows down as the number of generated tokens increases.

Even on standard industry benchmarks, the new model performs better than or close to popular transformer models as well as pure state-space models and hybrid state-space models.

For example, in the Arc, TruthfulQA, and GSM8K benchmarks, Falcon Mamba 7B scores 62.03%, 53.42%, and 52.54%, respectively, surpassing Llama 3 8 B, Llama 3.1 8B, Gemma 7B, and Mistral 7B. However, in the MMLU and Hellaswag benchmarks, Falcon Mamba 7B lags far behind these models.

"The launch of the Falcon Mamba 7B represents a significant step forward for the institution, inspiring new perspectives and furthering the exploration of intelligent systems," said Hakim Hacid, principal researcher at TII, in a statement. "At TII, we are pushing the boundaries of SSLM and transformer models to inspire further innovation in generative AI."

To date, TII's Falcon series of language models has been downloaded more than 45 million times - making it one of the most successful LLM versions in the UAE.

The Falcon Mamba 7B paper will be released soon, so you can wait for it.

https://huggingface.co/blog/falconmamba

https://venturebeat.com/ai/falcon-mamba-7bs-powerful-new-ai-architecture-offers-alternative-to-transformer-models/

news

Non-Transformer architectures stand up! The first large model without attention, surpassing Llama 3.1

Introduction

My contact information