Meta proposes modality-aware expert hybrid

Meta proposes a modality-aware expert hybrid

2024-08-14

Machine Heart Report

A hybrid expert must also specialize in one's field.

For the current mixed-modal basic model, the commonly used architectural design is to fuse encoders or decoders of specific modalities, but this approach has limitations: it is impossible to integrate information from different modalities, and it is difficult to output content containing multiple modalities.

To overcome this limitation, the Chameleon team at Meta FAIR proposed a new single Transformer architecture in their recent paper “Chameleon: Mixed-modal early-fusion foundation models” that can model mixed-modal sequences consisting of discrete image and text tokens based on the prediction target of the next token, thereby performing seamless reasoning and generation between different modalities.

After pre-training on about 10 trillion mixed-modal tokens, Chameleon has demonstrated broad vision and language capabilities, and can handle a variety of downstream tasks well. Chameleon performs particularly well in generating mixed-modal long answers, even beating commercial models such as Gemini 1.0 Pro and GPT-4V. However, for a model like Chameleon where various modalities are mixed early in model training, a lot of computing power is needed to expand its capabilities.

Based on the above problems, the Meta FAIR team conducted research and exploration on routed sparse architecture and proposed MoMa: modality-aware hybrid expert architecture.

Paper title: MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Paper address: https://arxiv.org/pdf/2407.21770

Previous studies have shown that this type of architecture can effectively expand the capabilities of single-modal basic models and enhance the performance of multimodal contrastive learning models. However, using it to train models that fuse various modalities earlier is still a topic with both opportunities and challenges, and few people have studied it.

The team’s research is based on the insight that different modalities are inherently heterogeneous — text and image tokens have different information density and redundancy patterns.

While integrating these tokens into a unified fusion architecture, the team also proposed to further optimize the framework by integrating modules for specific modalities. The team called this concept modality-aware sparsity (MaS for short); it allows the model to better capture the characteristics of each modality while maintaining strong cross-modal integration performance through partial parameter sharing and attention mechanisms.

Previous studies such as VLMo, BEiT-3, and VL-MoE have adopted the mixed-modality expert (MoME/mixture-of-modality-experts) method to train visual-language encoders and masked language modeling. The research team from FAIR has further expanded the scope of MoE.

Model Architecture

Early Fusion

The new model proposed in this paper is based on Chameleon's early fusion architecture, which represents images and text as a series of discrete tokens in a unified Transformer. At its core, Chameleon is a Transformer-based model that applies a self-attention mechanism on a combined sequence of image and text tokens. This allows the model to capture complex intra- and inter-modal correlations. The model is trained using the next token prediction objective, generating text and image tokens in an autoregressive manner.

In Chameleon, the image tokenization scheme uses a learned image tokenizer, which encodes a 512 × 512 image into 1024 discrete tokens based on a codebook of size 8192. For text tokenization, a BPE tokenizer with a vocabulary size of 65,536 is used, which includes image tokens. This unified tokenization method allows the model to seamlessly handle arbitrary sequences of image and text tokens.

With this approach, the new model inherits the advantages of unified representation, good flexibility, high scalability, and support for end-to-end learning.

On this basis (Figure 1a), in order to further improve the efficiency and performance of the early fusion model, the team also introduced modality-aware sparsity technology.

Width expansion: modality-aware hybrid experts

The team proposed a width extension method: integrating modality-aware module sparsity into the forward module to extend the standard mixture of experts (MoE) architecture.

The approach is based on the insight that tokens of different modalities have different characteristics and information density.

By constructing different expert groups for each modality, the model can develop specialized processing pathways while maintaining the ability to integrate information across modalities.

Figure 1b shows the key components of this modality-aware mixture of experts (MoMa). In simple terms, experts of each specific modality are first grouped, then hierarchical routing (divided into modality-aware routing and intra-modality routing) is implemented, and finally experts are selected. The detailed process can be found in the original paper.

In general, for an input token x, the formal definition of the MoMa module is:

After the MoMa calculation, the team further used residual connections and Swin Transformer normalization.

Mixture-of-Depths（MoD）

Previously, some researchers have explored introducing sparsity into the depth dimension, and their approach is either to randomly discard certain layers or to use learnable routers.

The team's approach refers to the second method and integrates the recently proposed hybrid depth (MoD) technology. For more information about MoD, please refer to the Synced report "DeepMind upgrades Transformer, and forward pass FLOPs can be reduced by up to half".

Specifically, as shown in the figure below, the team's approach is to integrate MoD before the mixture of experts (MoE) routing in each MoD layer, thereby ensuring that MoD can be applied to the entire batch of data before modality separation.

reasoning

In the inference phase, we cannot directly use the expert selection routing of MoE or the layer selection routing of MoD because top-k selection in a batch of data will destroy the causal relationship.

In order to ensure the causal relationship of reasoning, inspired by the above-mentioned MoD paper, the research team introduced an auxiliary router, whose role is to predict the probability of a token being selected by an expert or layer based only on the hidden representation of the token.

Upcycling

Training a MoE architecture from scratch presents a unique challenge in terms of optimizing the representation space and routing mechanism. The team found that the MoE router is responsible for partitioning the representation space for each expert. However, in the early stages of model training, this representation space is not optimal, which results in suboptimal trained routing functions.

To overcome this limitation, they proposed an upcycling method based on the paper "Sparse upcycling: Training mixture-of-experts from dense checkpoints" by Komatsuzaki et al.

Specifically, we first train an architecture with one FFN expert per modality. After some pre-set number of steps, we then upgrade the model by converting each modality-specific FFN into an expert-selective MoE module and initializing each expert to the one trained in the first phase. Here, we reset the learning rate scheduler while retaining the data loader state from the previous phase to ensure that the second phase of training uses the refreshed data.

To help experts become more specialized, the team also used Gumbel noise to enhance the MoE routing function, allowing the new router to sample experts in a differentiable way.

This upscaling approach coupled with the Gumbel-Sigmoid technique overcomes the limitations of the learned routers, thereby improving the performance of the proposed modality-aware sparse architecture.

Efficiency optimization

To facilitate the distributed training of MoMa, the team adopted Fully Sharded Data Parallel (FSDP). However, compared with conventional MoE, this method has some unique efficiency problems, including load balancing problems and efficiency problems of expert execution.

For the load balancing problem, the team developed a balanced data mixing method that keeps the text-to-image data ratio on each GPU consistent with the expert ratio.

Regarding the efficiency of expert execution, the team explored some strategies to help improve the execution efficiency of experts in different modalities:

Limit the experts in each modality to isomorphic experts and prohibit routing text tokens to image experts and vice versa;

Use block sparsity to improve execution efficiency;

When the number of modalities is limited, experts from different modalities are run sequentially.

Since each GPU processes enough tokens in the experiment, hardware utilization is not a big problem even if multiple batch matrix multiplications are used. Therefore, the team believes that the sequential execution method is a better choice for the current scale of the experimental environment.

Other optimizations

To further improve throughput, the team also used several other optimization techniques.

These include general optimization operations such as reducing gradient communication volume and automated GPU kernel fusion. The research team also implemented graph optimization through torch.compile.

In addition, they developed some optimization techniques for MoMa, including reusing modal token indexes across different layers to most efficiently synchronize devices between the CPU and GPU.

experiment

set up

The pre-training dataset and pre-processing process used in the experiment are the same as Chameleon. In order to evaluate the scalability performance, they trained the model with more than 1 trillion tokens.

Table 1 gives the detailed configurations of dense and sparse models.

Scalability at different computing tiers

The team analyzed the scaling performance of different models at different compute levels (FLOPs) equivalent to three dense model sizes: 90M, 435M, and 1.4B.

Experimental results show that a sparse model can achieve comparable pre-training loss with a dense model of the same FLOPs using only 1/η of the total FLOPs (η represents the pre-training acceleration factor).

Modal Unbinding

Introducing modality-specific expert groups improves the pre-training efficiency of models of different sizes, which is particularly beneficial for the image modality. As shown in Figure 3, the moe_1t1i configuration using 1 image expert and 1 text expert significantly outperforms the corresponding dense model.

Expanding the number of experts in each modality grouping can further improve model performance.

Mixing Depth and Expertise

The team observed that the convergence rate of training loss was improved when using MoE and MoD and their combination. As shown in Figure 4, adding MoD to the moe_1t1i architecture (mod_moe_1t1i) significantly improved model performance across different model sizes.

In addition, mod_moe_1t1i is comparable to or even exceeds moe_4t4i across different model sizes and modalities, which indicates that introducing sparsity in the depth dimension can also effectively improve training efficiency.

On the other hand, you can also see that the benefits of stacking MoD and MoE will gradually decrease.

Expand the number of experts

To study the impact of expanding the number of experts, the team conducted further ablation experiments. They explored two scenarios: assigning an equal number of experts to each modality (balanced) and assigning a different number of experts to each modality (unbalanced). The results are shown in Figure 5.

For the balanced setting, as can be seen from Figure 5a, the training loss decreases significantly as the number of experts increases. However, the text and image losses show different scaling patterns. This suggests that the inherent characteristics of each modality lead to different sparse modeling behaviors.

For the unbalanced setting, Figure 5b compares three different configurations with the same total number of experts (8). It can be seen that the more experts there are in a modality, the better the model generally performs on that modality.

Upgrade

The team also verified the effect of the aforementioned upgrade. Figure 6 compares the training curves of different model variants.

The results show that upscaling can indeed further improve model training: when the first stage has 10k steps, upscaling can bring 1.2 times the FLOPs benefit; and when the number of steps is 20k, there is also a 1.16 times FLOPs benefit.

Additionally, it can be observed that the performance gap between the upscaled model and the model trained from scratch increases as training progresses.

Throughput Analysis

Sparse models usually do not bring immediate performance gains because they increase dynamics and related data balance issues. In order to quantify the impact of the proposed method on training efficiency, the team usually controlled the variable experiments to compare the training throughput of different architectures. The results are shown in Table 2.

As can be seen, the modality-based sparse performance achieves a better quality-throughput tradeoff than the dense model and exhibits reasonable scalability as the number of experts grows. On the other hand, although the MoD variants achieve the best absolute loss, they tend to be more computationally expensive due to the additional dynamics and imbalance.

Inference time performance

The team also evaluated the model’s performance on retained language modeling data and downstream tasks. The results are shown in Tables 3 and 4.

As shown in Table 3, by using multiple image experts, the 1.4B MoMa 1t1i model outperforms the corresponding dense models on most metrics, except for the image-to-text conditional perplexity metric on COCO and Flickr. Further expanding the number of experts also improves performance, with 1.4B MoE 8x achieving the best image-to-text performance.

In addition, the 1.4B MoE 8x model is also very good at text-to-text tasks, as shown in Table 4. 1.4B MoMa 4t4i performs best on all conditional image perplexity metrics, while its text perplexity on most benchmarks is also very close to 1.4B MoE 8x.

Overall, the 1.4B MoMa 4t4i model has the best modeling results on mixed text and image modal data.

For more details, please read the original paper.

news

Meta proposes a modality-aware expert hybrid

Introduction

My contact information