Understanding Mamba in one article, the strongest competitor of Transformer

Understanding Mamba, the strongest competitor of Transformer

2024-08-19

Machine Heart Report

Editor: Panda

Mamba is good, but it is still early in its development.

There are many deep learning architectures, but the most successful one in recent years is Transformer, which has established its dominant position in many application fields.

A key driver of this success is the attention mechanism, which allows Transformer-based models to focus on relevant parts of the input sequence and achieve better contextual understanding. However, the disadvantage of the attention mechanism is that it is computationally expensive and grows quadratically with the input size, making it difficult to process very long texts.

Fortunately, a new architecture with great potential was born some time ago: the structured state-space sequence model (SSM). This architecture can efficiently capture the complex dependencies in sequence data, and thus become a strong rival of Transformer.

The design of this type of model is inspired by the classic state-space model - we can think of it as a fusion model of recurrent neural networks and convolutional neural networks. They can use loop or convolution operations for efficient calculation, so that the computational overhead varies linearly or nearly linearly with the sequence length, thereby greatly reducing the computational cost.

More specifically, one of the most successful variants of SSM, Mamba, has modeling capabilities comparable to Transformer while maintaining linear scalability with sequence length.

Mamba first introduces a simple but effective selection mechanism that reparameterizes the SSM based on the input, allowing the model to retain necessary and relevant data indefinitely while filtering out irrelevant information. Then, Mamba also includes a hardware-aware algorithm that uses scans instead of convolutions to cyclically compute the model, which can speed up the calculation by 3 times on the A100 GPU.

As shown in Figure 1, with its powerful ability to model complex long sequence data and nearly linear scalability, Mamba has emerged as a foundational model and is expected to revolutionize multiple research and application fields such as computer vision, natural language processing, and medicine.

Therefore, the literature on the study and application of Mamba has grown rapidly, making it dizzying, and a comprehensive review report will certainly be of great benefit. Recently, a research team from the Hong Kong Polytechnic University published their contribution on arXiv.

Paper title: A Survey of Mamba
Paper address: https://arxiv.org/pdf/2408.01129

This review report summarizes Mamba from multiple perspectives, which can help beginners learn the basic working mechanism of Mamba and also help experienced practitioners understand the latest developments.

Mamba is a hot research topic, and therefore many teams are trying to write review reports. In addition to the one introduced in this article, there are other reviews focusing on state-space models or visual Mamba. For details, please refer to the corresponding papers:

Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges. arXiv:2404.16112
State space model for new-generation network alternative to transformers: A survey. arXiv:2404.09516
Vision Mamba: A Comprehensive Survey and Taxonomy. arXiv:2405.04404
A survey on vision mamba: Models, applications and challenges. arXiv:2404.18861
A survey on visual mamba. arXiv:2404.15956

Prerequisites

Mamba combines the cyclic framework of the recurrent neural network (RNN), the parallel computing and attention mechanism of the Transformer, and the linear characteristics of the state space model (SSM). Therefore, in order to thoroughly understand Mamba, it is necessary to first understand these three architectures.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are good at processing sequential data because of their ability to retain internal memory.

Specifically, at each discrete time step k, a standard RNN processes a vector together with the hidden state of the previous time step, then outputs another vector and updates the hidden state. This hidden state serves as the memory of the RNN, which retains information about inputs that have been seen in the past. This dynamic memory allows the RNN to process sequences of different lengths.

That is, RNN is a nonlinear recurrent model that can effectively capture temporal patterns by using historical knowledge stored in hidden states.

Transformer

The Transformer's self-attention mechanism helps capture global dependencies in the input. It does this by assigning weights to each position based on how important it is relative to other positions. More specifically, the original input is first linearly transformed to convert the sequence of input vectors x into three types of vectors: query Q, key K, and value V.

Then the normalized attention score S is calculated and the attention weight is calculated.

In addition to performing a single attention function, we can also perform multi-head attention. This allows the model to capture different types of relationships and understand the input sequence from multiple perspectives. Multi-head attention uses multiple sets of self-attention modules to process the input sequence in parallel. Each head operates independently and performs the same calculations as the standard self-attention mechanism.

Afterwards, the attention weights of each head are aggregated and combined to obtain a weighted sum of the value vector. This aggregation step allows the model to use information from multiple heads and capture many different patterns and relationships in the input sequence.

State Space

State-space model (SSM) is a traditional mathematical framework that can be used to describe the dynamic behavior of a system over time. In recent years, SSM has been widely used in many different fields such as cybernetics, robotics and economics.

At its core, SSM reflects the behavior of the system through a set of hidden variables called "states", which enables it to effectively capture the dependencies of temporal data. Unlike RNN, SSM is a linear model with associative properties. Specifically, the classic state-space model constructs two key equations (state equation and observation equation) to model the relationship between input x and output y at the current time t through an N-dimensional hidden state h (t).

Discretization

In order to meet the needs of machine learning, SSM must undergo a discretization process - converting continuous parameters into discrete parameters. Generally speaking, the goal of the discretization method is to divide the continuous time into K discrete intervals with as equal integral areas as possible. To achieve this goal, one of the most representative solutions adopted by SSM is Zero-Order Hold (ZOH), which assumes that the function value on the interval Δ = [_{−1}, _ ] remains unchanged. Discrete SSM is similar to the structure of recurrent neural networks, so discrete SSM can perform the reasoning process more efficiently than Transformer-based models.

Convolution calculation

The discrete SSM is a linear system with associative properties and can therefore be seamlessly integrated with convolutional computations.

The relationship between RNN, Transformer and SSM

Figure 2 shows the computational algorithms of RNN, Transformer, and SSM.

On the one hand, conventional RNNs operate based on a non-linear recurrent framework where each computation depends only on the previous hidden state and the current input.

Although this form allows RNN to quickly generate output during autoregressive inference, it also makes it difficult for RNN to fully utilize the parallel computing power of GPU, resulting in slower model training speed.

On the other hand, the Transformer architecture performs matrix multiplications on multiple query-key pairs in parallel, which can be efficiently distributed to hardware resources, allowing faster training of attention-based models. However, if a Transformer-based model is to generate a response or prediction, the inference process will be very time-consuming.

Unlike RNN and Transformer, which only support one type of computation, discrete SSM is very flexible; thanks to its linear nature, it can support both recurrent and convolutional computations. This feature allows SSM to achieve not only efficient reasoning, but also parallel training. However, it should be noted that the most conventional SSM is time-invariant, which means that its A, B, C, and Δ are independent of the model input x. This limits its context-aware modeling capabilities, causing SSM to perform poorly on some specific tasks such as selective replication.

Mamba

In order to address the above-mentioned shortcomings of traditional SSM and achieve context-aware modeling, Albert Gu and Tri Dao proposed Mamba, which can be used as the backbone network of a general sequence base model. Please refer to the report of Machine Heart "Five times the throughput, performance fully surrounds Transformer: New architecture Mamba ignites the AI circle".

Later, the two of them further proposed Mamba-2, in which the Structured Space-State Duality (SSD) constructed a robust theoretical framework that connects structured SSM with various forms of attention, allowing us to migrate the algorithms and system optimization techniques originally developed for Transformer to SSM. You can also refer to the report of Machine Heart, "Fighting Transformer again! Mamba 2 led by the original author is here, and the new architecture has greatly improved training efficiency."

Mamba-1: Selective state-space modeling using hardware-aware algorithms

Mamba-1 introduces three innovative technologies based on the structured state space model, namely memory initialization based on the high-order polynomial projection operator (HiPPO), selection mechanism, and hardware-aware computing, as shown in Figure 3. The goal of these technologies is to enhance the long-range linear time series modeling capabilities of SSM.

Specifically, the initialization strategy constructs a coherent hidden state matrix to effectively promote long-term memory.

The selection mechanism then enables the SSM to acquire representations of perceptual content.

Finally, to improve training efficiency, Mamba also includes two hardware-aware computing algorithms: Parallel Associative Scan and Memory Recomputation.

Mamba-2: State-space Duality

Transformer has inspired the development of many different techniques, such as parameter efficient fine-tuning, catastrophic forgetting mitigation, and model quantization. In order to allow state-space models to benefit from these techniques originally developed for Transformer, Mamba-2 introduces a new framework: Structured State Space Duality (SSD). This framework theoretically connects SSM and different forms of attention.

Essentially, SSD shows that the attention mechanism used by the Transformer and the linear time-invariant system used in SSM can both be viewed as semi-separable matrix transformations.

In addition, Albert Gu and Tri Dao also proved that selective SSM is equivalent to a structured linear attention mechanism implemented using a semi-separable mask matrix.

Mamba-2 designs a computational method based on SSD that uses hardware more efficiently, using a block factorization matrix multiplication algorithm.

Specifically, by treating the state-space model as a semi-separable matrix through this matrix transformation, Mamba-2 can decompose the computation into matrix blocks, where the diagonal blocks represent intra-block computations. The non-diagonal blocks represent inter-block computations decomposed through the hidden state of SSM. This method allows Mamba-2 to train 2-8 times faster than Mamba-1's parallel association scanning, while its performance is comparable to Transformer.

Mamba Block

Let's look at the block design of Mamba-1 and Mamba-2. Figure 4 compares the two architectures.

The design of Mamba-1 is centered around SSM, where the task of the selective SSM layer is to perform the mapping from the input sequence X to Y. In this design, after the initial linear projection to create X, the linear projection of (A, B, C) is used. The input token and state matrix are then scanned through the selective SSM unit using parallel associations to obtain the output Y. After that, Mamba-1 uses a skip connection to encourage feature reuse and alleviate the performance degradation that often occurs during model training. Finally, the Mamba model is constructed by stacking this module with standard normalization and residual connections interleaved.

As for Mamba-2, an SSD layer is introduced to create a mapping from [X, A, B, C] to Y. This is achieved by using a single projection at the beginning of the block to process [X, A, B, C] simultaneously, similar to how the standard attention architecture generates Q, K, V projections in parallel.

That is, the Mamba-2 block is a simplification of the Mamba-1 block by removing the sequential linear projection. This allows the SSD structure to be calculated faster than the parallel selective scanning of Mamba-1. In addition, to improve training stability, Mamba-2 also adds a normalization layer after the skip connection.

The Mamba model is evolving

State-space models and Mamba have developed rapidly recently and have become a promising choice for basic model backbone networks. Although Mamba performs well in natural language processing tasks, it still has some problems, such as memory loss, difficulty in generalizing to different tasks, and inferior performance to Transformer-based language models in complex patterns. To address these problems, the research community has proposed many improvements to the Mamba architecture. Existing research mainly focuses on modifying block design, scanning patterns, and memory management. Table 1 summarizes the relevant research by category.

Block Design

The design and structure of the Mamba block has a great impact on the overall performance of the Mamba model, and therefore this has become a major research hotspot.

As shown in Figure 5, existing research can be divided into three categories based on different methods of building new Mamba modules:

Integration method: Integrate the Mamba block with other models to achieve a balance between effect and efficiency;
Replacement method: Replace the main layers in other model frameworks with Mamba blocks;
Modification method: Modify the components within the Classic Mamba block.

Scan Mode

Parallel association scanning is a key component in the Mamba model, which aims to solve the computational problems caused by the selection mechanism, speed up the training process, and reduce memory requirements. It is achieved by exploiting the linear properties of time-varying SSM to design kernel fusion and recomputation at the hardware level. However, Mamba's unidirectional sequence modeling paradigm is not conducive to comprehensive learning of diverse data, such as images and videos.

To alleviate this problem, some researchers have explored new efficient scanning methods to improve the performance of the Mamba model and facilitate its training process. As shown in Figure 6, in terms of developing scanning patterns, existing research results can be divided into two categories:

Flattened scanning method: looks at the token sequence from a flattened perspective and processes the model input based on this;
Stereoscopic scanning methods: Scan model input across dimensions, channels or scales, which can be further divided into three categories: layered scanning, spatiotemporal scanning, and hybrid scanning.

Memory Management

Similar to RNN, in the state-space model, the memory of the hidden state effectively stores the information of the previous steps, and therefore has a crucial impact on the overall performance of SSM. Although Mamba introduces a HiPPO-based method for memory initialization, managing the memory in SSM units is still very difficult, including transferring hidden information before layers and achieving lossless memory compression.

To this end, some pioneering studies have proposed some different solutions, including memory initialization, compression, and concatenation.

Making Mamba adaptable to diverse data

The Mamba architecture is an extension of the selective state-space model. It has the basic characteristics of a recurrent model and is therefore very suitable as a general base model for processing sequence data such as text, time series, and speech.

In addition, some recent groundbreaking research has expanded the application scenarios of the Mamba architecture, enabling it to not only process sequence data, but also be used in areas such as images and maps, as shown in Figure 7.

The goal of these studies is to fully utilize Mamba's excellent ability to obtain long-range dependencies and its efficiency advantages in learning and reasoning. Table 2 briefly summarizes these research results.

Sequence data

Sequence data refers to data collected and organized in a specific order, where the order of the data points is important. This review report comprehensively summarizes the application of Mamba on a variety of sequence data, including natural language, video, time series, speech, and human motion data. See the original paper for details.

Non-sequential data

Unlike sequential data, non-sequential data does not follow a specific order. Its data points can be organized in any order without significantly affecting the meaning of the data. For recurrent models (RNN and SSM, etc.) that are specifically designed to capture time dependencies in data, this lack of inherent order in the data can be difficult to handle.

Surprisingly, some recent studies have successfully enabled Mamba (representative SSM) to efficiently process non-sequential data, including images, atlases, and point cloud data.

Multimodal data

In order to improve AI's perception and scene understanding capabilities, data from multiple modalities, such as language (sequential data) and images (non-sequential data), can be integrated. Such integration can provide very valuable and complementary information.

Recently, multimodal large language models (MLLMs) have become the most popular research hotspot; these models inherit the powerful capabilities of large language models (LLMs), including strong language expression and logical reasoning capabilities. Although Transformer has become the dominant method in this field, Mamba is also emerging as a strong competitor. It performs well in aligning mixed source data and achieving linear complexity expansion of sequence length, which makes Mamba likely to replace Transformer in multimodal learning.

application

Here are some notable applications of Mamba-based models. The team grouped these applications into the following categories: natural language processing, computer vision, speech analysis, drug discovery, recommender systems, and robotics and autonomous systems.

We will not introduce it in detail here, please refer to the original paper for details.

Challenges and opportunities

Although Mamba has achieved outstanding performance in some areas, overall, Mamba research is still in its infancy and there are still some challenges to be overcome. Of course, these challenges are also opportunities.

How to develop and improve the Mamba-based base model;
How to fully implement hardware-aware computing to maximize the use of hardware such as GPUs and TPUs to improve model efficiency;
How to improve the credibility of the Mamba model, which requires further research in terms of security and robustness, fairness, explainability, and privacy;
How to apply new techniques from the Transformer field to Mamba, such as parameter-efficient fine-tuning, catastrophic forgetting mitigation, and retrieval-augmented generation (RAG).

news

Understanding Mamba, the strongest competitor of Transformer

Introduction

My contact information