Algorithms, systems and applications: three perspectives to fully understand the hybrid expert (MoE)

2024-07-26

Machine Heart Report

Editor: Panda W

LLM is very powerful, and in order to achieve sustainable expansion of LLM, it is necessary to find and implement methods that can improve its efficiency. Mixed Experts (MoE) is an important member of this type of method.

Recently, the new generation of large models proposed by various technology companies are all using the Mixture of Experts (MoE) method.

The concept of mixture of experts was first proposed in the 1991 paper "Adaptive mixtures of local experts" and has been widely explored and developed over the past thirty years. In recent years, with the emergence and development of sparse gated MoE, especially in combination with large-scale language models based on Transformer, this technology, which has been around for more than thirty years, has gained new vitality.

The MoE framework is based on a simple yet powerful idea: different parts of the model (called experts) focus on different tasks or different aspects of the data.

When using this paradigm, only experts related to an input are involved in processing, which can control the computational cost while still benefiting from a large amount of expertise. Therefore, MoE can improve the capabilities of large language models without significantly increasing computational requirements.

As shown in Figure 1, MoE-related research has grown strongly, especially after the emergence of Mixtral-8x7B in 2024 and various industrial-grade LLMs such as Grok-1, DBRX, Arctic, and DeepSeek-V2.

This picture comes from a MoE review report recently released by a research team from the Hong Kong University of Science and Technology (Guangzhou), which clearly and comprehensively summarizes MoE-related research and proposes a new classification method, classifying these studies into three categories: algorithms, systems, and applications.

Paper title: A Survey on Mixture of Experts

Paper address: https://arxiv.org/pdf/2407.06204

Synced has compiled the main content of this review report to help readers understand the current development of MoE. For more details, please read the original paper. In addition, we have also compiled some reports related to MoE at the end of the article.

Background knowledge of hybrid experts

In Transformer-based large language models (LLMs), each mixture of experts (MoE) layer is usually composed of an "expert network" {_1, ... , _} and a "gating network" G.

This gated network is usually a linear network with a softmax activation function, which directs the input to the appropriate expert network. The MoE layer is placed inside the Transformer module to select the feed-forward network (FFN), usually after the self-attention (SA) sublayer. This placement is critical because the computational requirements of the FFN increase as the model grows. For example, in the PaLM model with 540 billion parameters, 90% of the parameters are in its FFN layer.

To describe it in mathematical form: each expert network _ (usually a linear-ReLU-linear network) is parameterized by W_, which receives the same input x and generates an output _ (x; W_). At the same time, the gated network G (usually composed of a linear-ReLU-linear-softmax network) with parameter Θ obtains the output G (x; Θ). Based on the design of the gating function, MoE layers can be roughly divided into the following two categories.

Dense MoE

The dense mixed expert layer activates all expert networks {_1, ... , _} in each iteration. This strategy was widely adopted in early MoE research. Some recent studies have also adopted dense MoE, such as EvoMoE, MoLE, LoRAMoE and DS-MoE. Figure 2a shows the structure of the dense MoE layer. Therefore, the output of the dense MoE layer can be expressed as:

where (x; Θ) is the gate value before the softmax operation.

Sparse MoE

Although the prediction accuracy of dense mixture experts is generally higher, its computational load is also very high.

To address this problem, Shazeer et al.'s paper "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer" introduces a sparse gated MoE layer that activates only a selected subset of experts in each forward pass. This strategy achieves sparsity by computing a weighted sum of the outputs of the top-k experts, rather than aggregating the outputs of all experts together. Figure 2b shows the structure of this sparse MoE layer.

According to the framework proposed in the above paper, Equation 2.2 can be modified to reflect the sparse gating mechanism:

Here is an explanation: The TopK (・, ) function only retains the top k items of the original value of the vector, while setting the other items to −∞. After the softmax operation, all −∞ items will become approximately zero. The hyperparameter k should be selected according to the specific application, and common options are = 1 or = 2. Adding the noise term R_noise is a common strategy for training sparse gated MoE layers, which can promote exploration between experts and improve the stability of MoE training.

Although sparse gating G(x; Θ) can significantly expand the parameter space of the model without increasing the corresponding computational cost, it also leads to the load balancing problem. The load balancing problem refers to the uneven distribution of loads among experts - some experts are frequently used, while others are rarely used or not used at all.

To solve this problem, each MoE layer integrates an auxiliary loss function, which is used to urge each batch of tokens to be evenly distributed to each expert. From the mathematical description, first define a query batch B = {x_1, x_2, ..., x_} containing T tokens and N experts. Then the auxiliary load balancing loss is defined as:

Where D_i is the token ratio assigned to expert i, and P_i is the gating probability ratio assigned to expert i. To ensure that the batch is evenly distributed among the N experts, the load balancing loss function L_{load-balancing} should be minimized. When each expert is assigned the same number of tokens D_ = 1/ and the same gating probability P_ = 1/, the optimal condition is achieved:

At this point, the loads of the experts are balanced.

In the following text, unless explicitly stated otherwise, the term “MoE” refers only to “sparse MoE”.

Classification of hybrid experts

To help researchers find their targets among the large number of LLM studies that use MoE, the team developed a taxonomy that categorizes these models according to three aspects: algorithm design, system design, and application.

Figure 3 shows this taxonomy and some representative research results.

A comprehensive and in-depth look at each category is provided below.

Algorithm design of hybrid experts

Gating Function

The gating function (also called routing function or router) is a fundamental component of all MoE architectures. Its role is to coordinate the use of expert computations and combine the outputs of various experts.

According to the processing method of each input, the gating can be divided into three types: sparse, dense and soft. The sparse gating mechanism activates some experts, while the dense one activates all experts. The soft one includes fully differentiable methods, including input token fusion and expert fusion. Figure 4 shows the various gating functions used in the MoE model.

Sparse

The sparse gating function activates a selected subset of experts when processing each input token, which can be viewed as a form of conditional computation.

Gating functions can implement many forms of gated decisions, such as binary decisions, sparse or continuous decisions, stochastic or deterministic decisions; they have been intensively studied and can be trained using various forms of reinforcement learning and backpropagation.

The study "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer" by Shazeer et al. pioneered a differentiable heuristic method using an auxiliary load balancing loss, where the output of the expert calculation can be weighted according to the probability of selection. This introduces differentiability to the gating process, thereby guiding the optimization of the gating function through the gradient.

Later, this paradigm became the dominant paradigm in the MoE research field. Since this method selects experts for each input token, it can be regarded as a token-selective gating function.

The following are the main points of this section, see the original paper for details:

Token selective gating

Auxiliary loss for token selective gating

Expert capacity for token selective gating

Other developments in token selective gating

Non-trainable token selective gating

Expert Select Gating

Intensive

Dense MoE means that all experts are activated when processing each input.

Although sparse MoE has efficiency advantages, dense MoE continues to see innovation. In particular, dense activation works well for LoRA-MoE fine-tuning, and the computational overhead of LoRA experts is relatively low. This approach can effectively and flexibly integrate multiple LoRAs to complete various downstream tasks. This preserves the generative power of the original pre-trained model while retaining the unique characteristics of each LoRA for each task.

soft

For sparse MoE, a fundamental discrete optimization problem is how to decide which experts to assign to each token. This usually requires a heuristic auxiliary loss to ensure balanced participation of experts and minimize unassigned tokens. This problem is particularly prominent in scenarios involving out-of-distribution data (such as small inference batches, new inputs, or transfer learning).

Similar to dense MoE, soft MoE methods also use all experts when processing each input, thereby maintaining full differentiability and avoiding the inherent problems of discrete expert selection methods. The difference between soft MoE and dense MoE is that the former alleviates the computational requirements by performing a gated weighted fusion of the input tokens or experts.

expert

This section introduces the architecture of the expert network within the MoE framework and discusses the gating functions that coordinate the activations of these experts.

Network Type

Since MoEs were incorporated into Transformer architectures, they typically replace the feed-forward network (FFN) modules in these models. Typically, each expert in a MoE layer replicates the architecture of the FFN it replaces.

This paradigm of using FFN as an expert is still the mainstream today, but people have also made a lot of improvements to it.

Hyperparameters

The size of the sparse MoE model is controlled by several key hyperparameters, including:

Number of experts per MoE tier

Size of each expert

How often the MoE layer is placed throughout the model

The choice of these hyperparameters is critical as it profoundly affects the performance and computational efficiency of the model in various tasks. Therefore, the optimal hyperparameters should be selected based on the specific application requirements and computing infrastructure. Table 2 shows the configuration of some models using MoE.

In addition, Table 3 lists the number of parameters and benchmark performance of some recent open source models.

Activation Function

The sparse MoE model built on the dense Transformer architecture uses similar activation functions as leading dense LLMs such as BERT, T5, GPT, and LLAMA. The activation function has evolved from ReLU to more advanced options such as GeLU, GeGLU, and SwiGLU.

This trend also extends to other components of MoE models, which often integrate techniques such as root mean square layer normalization (RMSNorm), grouped query attention (GQA), and rotated position embedding (RoPE).

Shared Experts

DeepSpeed-MoE innovatively introduces a residual MoE architecture, where each token is processed by a fixed expert plus a gated expert, so that two experts participate in processing at each layer, while not making the communication cost exceed the top-1 gated method. This method treats the gated MoE expert as an error correction aid for the fixed dense FFN.

Conditional MoE Routing (CMR) used in NLLB also adopts a similar approach, combining the outputs of the dense FFN and MoE layers.

The paradigm that integrates fixed FFN and sparse MoE is often referred to as shared experts, as shown in Figure 5b.

Recently, models such as DeepSeekMoE, OpenMoE, Qwen1.5-MoE, and MoCLE have adopted this paradigm, indicating that it is becoming a mainstream configuration. However, DeepSeekMoE and Qwen1.5-MoE use multiple shared experts instead of a single one.

Hybrid parameter efficient expert

Parameter Efficient Fine-tuning (PEFT) is a method to improve the efficiency of fine-tuning. Simply put, PEFT is to update only a small part of the parameters of the base model during fine-tuning.

PEFT is successful, but due to its limited trainable parameters and potential catastrophic forgetting, it is difficult to use in situations where generalization to multiple tasks is required.

To alleviate these limitations, the Hybrid Parameter Efficient Expert (MoPE) was born, which combines the MoE framework with PEFT. MoPE integrates the gating mechanism of MoE with the multi-expert architecture, while each expert is built using PEFT technology. This clever combination can greatly improve the performance of PEFT in multi-task scenarios. In addition, since PEFT is used to build the experts, MoPE uses fewer parameters and is much more resource efficient than the traditional MoE model.

MoPE combines the multi-task characteristics of MoE with the resource efficiency of PEFT, and is a very promising research direction. Figure 6 classifies MoPE according to its position in the Transformer model architecture. For a more detailed introduction to the research results on MoPE, please refer to the original paper.

Training and inference scenarios

Mixed experts are advancing, and so are the associated training and inference schemes.

The initial training and inference scenario requires training the MoE model from scratch and directly using the trained model configuration to perform inference.

But now, many new paradigms have emerged in the training and reasoning of MoE models, including combining the advantages of dense models and sparse models to complement each other.

Figure 7 shows the training and reasoning schemes related to MoE. It can be seen that the emerging schemes can be divided into three categories:

Dense to Sparse: Start with dense model training and gradually transition to a sparse MoE configuration;

Sparse to dense: involves reducing the sparse MoE model to a dense form, which is conducive to implementing inference into hardware form;

Expert model fusion: Integrate multiple pre-trained dense expert models into a unified MoE model.

MoE's derivative technologies

Mixture of Experts (MoE) has inspired many different variants. For example, Xue et al.'s paper "Go wider instead of deeper" proposed a WideNet with increased model width by replacing the feed-forward network (FFN) with MoE layers while maintaining the shared trainable parameters on the Transformer layers, except for the normalization layers.

There are also SYT (Sparse Universal Transformer) proposed by Tan et al., MoT (Mixed Token) proposed by Antoniak et al., SMoP (Sparse Mixed Prompt) proposed by Choi et al., Lifelong-MoE proposed by Chen et al., MoD (Mixed Depth) proposed by Raposo et al., etc.

To sum up, the development of MoE-derived technologies reveals a trend: MoE has more and more functions and is increasingly adaptable to different fields.

System design of hybrid experts

While Mixture of Experts (MoE) can enhance the capabilities of large language models, it also brings new technical challenges due to its sparse and dynamic computational load.

GShard introduces expert parallelism, which can schedule the split local tokens according to the load balancing constraints of expert capabilities, thereby achieving parallel gating and expert calculations. This paradigm has become a basic strategy to promote the efficient expansion of MoE models. We can think of this method as an enhanced version of data parallelism - each expert in the MoE layer is assigned to a different device, and all non-expert layers are repeated on all devices.

As shown in Fig. 8a, the workflow of expert parallelization is to perform the following operations in sequence: gate routing, input encoding, All-to-All scheduling, expert calculation, All-to-All combination, and output decoding.

Generally speaking, the input size of GEMM needs to be large enough to fully utilize the computing device. Therefore, input encoding is used to aggregate the input tokens of the same expert into a continuous memory space, which is determined by the "token-expert mapping" in the gate routing. After that, the role of All-to-All scheduling is to distribute the input tokens to the corresponding experts on each device. This is followed by localized calculations by the experts. After the calculation is completed, it is summarized through All-to-All combination, and then the output is decoded to restore the layout of the original data according to the gate index.

In addition, some researchers have explored the synergy between expert parallelization and other existing parallel strategies (such as tensor, pipeline, and serial parallelization) to improve the scalability and efficiency of MoE models in large-scale distributed environments.

Figure 8 shows some examples of hybrid parallelization, including (b) data + expert + tensor parallelization, (c) data + expert + pipeline parallelization, and (d) expert + tensor parallelism.

It is important to recognize that there are complex interactions between computational efficiency, communication load, and memory usage, which are affected by the choice of distributed parallelization strategy and are also affected by different hardware configurations. Therefore, when deploying strategies for real-world applications, careful trade-offs must be considered and adjusted for specific scenarios.

Afterwards, the team introduced the system design challenges faced by the development of the MoE model and the research results of solving these challenges in three major sections: computing, communication, and storage. For details, please refer to the original paper. Table 4 gives an overview of the open source MoE framework.

Application of hybrid experts

In the current Transformer-dominated large language model (LLM) field, the mixture of experts (MoE) paradigm is quite attractive because it can significantly improve the model capabilities without introducing excessive computational requirements in the training and reasoning stages. This type of technology can significantly improve the performance of LLM on a variety of downstream tasks, and even create some AI applications that surpass human level.

There are rumors that the powerful GPT-4 may also use some kind of MoE architecture - consisting of 8 experts with 220 billion parameters, trained on a variety of data sets and tasks, and using a 16-iteration reasoning process. For more details about this rumor, please refer to the Synced report "The Ultimate "Revelation": GPT-4 Model Architecture, Training Cost, and Dataset Information Are All Revealed".

So it’s no surprise that MoE has blossomed in natural language processing, computer vision, recommender systems, and multimodal applications.

These applications essentially require the use of conditional computing to significantly increase the number of model parameters, thereby enhancing the performance of the model at a fixed computing cost, or implementing dynamic expert selection through a gating mechanism to achieve efficient multi-task learning.

The team also introduced representative MoE applications in these different fields to help readers understand how to use MoE for specific tasks. See the original paper for details.

Challenges and opportunities

Hybrid experts, powerful functions, lower costs, and improved performance. Although the prospects are good, there are still challenges.

In this section, the team sorted out the key challenges related to MoE and pointed out future research directions that are expected to achieve important results. These challenges and research directions are briefly listed below. For more details, please refer to the original paper.

Training stability and load balancing

Scalability and communication overhead

Specialization and collaboration of experts

Sparse activations and computational efficiency

Generalization and robustness

Explainability and transparency

Optimal expert architecture

Integrate with existing frameworks

news

Algorithms, systems and applications: three perspectives to fully understand the hybrid expert (MoE)

Introduction

my contact information