Replace Transformer, 7B open source model immediately tops the list! Can process any long sequence

Replace Transformer, 7B open source model immediately tops the list! Can process any length sequence

2024-08-13

Mingmin from Aofei Temple
Quantum Bit | Public Account QbitAI

Just by replacing the Transformer architecture, the performance is immediately improved in all aspects, competing for the same scale open source model!

(The attention mechanism no longer exists)

This is the latestFalcon Mamba 7BModel.

It usesMamba State Space Language Model ArchitectureTo handle various text generation tasks.

By canceling the traditional attention mechanism, the problem of low computational efficiency when the model processes long sequences is effectively improved.

It can handleInfinite lengthsequence, but memory requirements do not increase.

No matter how long the context is,The time to generate each token is basically the same。

As a result, the performance of the Falcon Mamba model has been improved in all aspects, defeating a number of Transformer architecture models, such as Llama-3.1 (8B), Mistral (7B) and Falcon-2 (11B).

The above results were brought by the Technology Innovation Institute (TII) in Abu Dhabi, UAE, which is the development team of the Falcon model.

The series includes four models: basic version, instruction fine-tuning version, 4-bit version and instruction fine-tuning 4-bit version.

The latest model follows the TII Falcon License 2.0 open protocol, which is under the Apache 2.0 protocol.

Onlookers shouted: The rules of the game are about to change!

The world's first open source SSLM

In terms of performance, the Falcon Mamba 7B outperforms all open source models.

It is based on the first generation Mamba.

Mamba is aState Space Model(SSM, State Space Model). It combines the characteristics of RNN and CNN. By introducing a selection mechanism, it allows the model to selectively propagate or forget information based on the current input, thereby improving the efficiency of processing text information.

At the same time, it designs a hardware-aware parallel algorithm that runs in recursive mode, avoiding IO access between GPU memory levels and improving computing efficiency.

Finally, it also simplifies the architecture by combining the SSM architecture and the MLP block in the Transformer into a single block.

Switching from Transformer to Mamba allows the Falcon model to process arbitrarily long sequences without increasing memory, which is especially suitable for a single A10 24GB GPU.

The study also discussed two different approaches to processing the sequence.

The parallel pre-filling method is suitable for GPU parallel processing and has high memory requirements; the sequential filling method is suitable for the SSM model and can process sequences of arbitrary length, so it is not subject to memory limitations.

To ensure robustness at large scale, the Falcon Mamba model uses an additional RMS normalization layer.

The RMS normalization layer can simplify the calculation process of LayerNorm and reduce the amount of calculation.

The model was trained using 5500GT data, which mainly came from the RefedWeb dataset and public data. The training process was basically uniform, and a small amount of high-quality planning data was added in the later stage of training, which helped optimize the model in the final stage.

In the test of generating tokens with batch size of 1 and prompt length of 1-130k on H100, Falcon Mamba was able toMaintain a stable throughput when generating new tokens, which means that its performance is not affected by the length of the text and can stably process long sequences without performance degradation.

Falcon Mamba supports multiple Hugging Face APIs, including AutoModelForCausalLM and pipline.

A command tuning version was also released, which can make the model more accurate by fine-tuning with an additional 5 billion tokens.

The latest models are available on Hugging Face and GitHub~

Reference Links:
https://huggingface.co/blog/falconmamba#hardware-performance

news

Replace Transformer, 7B open source model immediately tops the list! Can process any length sequence

Introduction

My contact information