LLM Mixture of Experts Explained

Mixture of Experts (MoE) is an AI technique where a set of specialized models (experts) are orchestrated by a gating mechanism to handle different parts of the input. This "divide and conquer" strategy optimizes for both performance and efficiency. It leverages the fact that an ensemble of weaker models specializing in specific areas can produce more accurate results, similar to traditional machine learning ensembles. However, it introduces the dynamic routing of input during generation.

Modern AI models face two major challenges: the enormous consumption of computational resources and the difficulty of fitting a single model to increasingly diverse and complex data. MoE has emerged as a powerful solution to both. In this blog post, I'll explain how OpenAI leveraged it to combine eight different models into what we call GPT-4, how Mixtral's architecture made the method even more efficient, and dive deeper into the technical components that make it all work.

Index

The Mixture of Experts: Explained

How GPT-4 Implements Mixture of Experts

Mistral 8x7B (Mixtral) Explained

Technical Components of MoE

Advantages: Speed and Efficiency

Disadvantages: GPU Needs & Training Challenges

The Evolution of MoE

MoE in the Wild: Beyond LLMs

Open Source MoEs and Exciting Directions

The Bottom Line: Why MoE Matters

The Mixture of Experts: Explained

Here's a surprising revelation: to build an LLM application, you will, of course, need an LLM. However, when you break down your app's functionalities, you'll find that different components serve distinct purposes. Some components might retrieve data, others might generate a "chat" experience, and some could be for formatting or summarization. Similar to traditional machine learning, where combining models in ensembles like boosting and bagging improves results, MoE in LLMs uses a set of different transformer models. These are trained differently and weighted differently to create a complex, dynamic inference pipeline.

Room of Experts

In the context of LLMs, each model, or 'expert,' naturally develops proficiency in different areas during training. The role of a 'coordinator' is played by a Gating Network (also called a Router). This network's crucial task is to direct inputs to the appropriate expert(s) based on the topic. As it trains, the Gating Network gets better at understanding each expert's strengths and fine-tunes its routing decisions.

It's important to clarify that these LLM 'experts' don't possess expertise like human specialists. Their 'expertise' resides in a complex, high-dimensional embedding space. The notion of categorizing them into domains is a conceptual tool to help us understand their diverse capabilities.

What makes MoE unique?

In traditional "dense" models, all tasks are processed by a single, monolithic neural network—like a generalist handling every problem. For complex problems, finding a single generalist model capable of handling everything is difficult and computationally expensive, which is why the MoE architecture is so valuable. It activates only the most relevant subset of parameters for any given input, enhancing specialization while saving on computation.

How GPT-4 Implements Mixture of Experts 📄

On June 20th, 2023, George Hotz, founder of Comma.ai, revealed that GPT-4 is not a single massive model but a combination of 8 smaller models, each with 220 billion parameters. This leak was later confirmed by Soumith Chintala, co-founder of PyTorch at Meta.

GPT-4 -> 8 x 220B params = ~1.76 Trillion params

For context, GPT-3.5 has around 175B parameters. However, the total parameter calculation for an MoE model isn't straightforward. Typically, only the feed-forward network (FFN) layers are replicated for each expert, while other layers (like attention mechanisms) are shared. This significantly reduces the true total parameter count, which is likely somewhere between 1.2-1.76 Trillion for GPT-4.

Why is GPT-4 sometimes seen as becoming "dumber" or "lazy"?

Recent reports of degraded answer quality and "laziness" in GPT-4 may be directly connected to its MoE architecture. As OpenAI focuses on reducing inference costs and prices, they might be using fewer or smaller experts for certain queries.

Cost Reduction: Since every expert must be loaded into VRAM, the hardware requirements are immense. Reducing the number or size of experts used per query has a big impact on costs but can affect performance.

Aggressive RLHF: Continuous Reinforcement Learning with Human Feedback (RLHF) is used to make GPT-4 safer and more useful for specific products, but this can sometimes make it less creative or interesting for the everyday ChatGPT user.

Distillation/Quantization: OpenAI may be using techniques like distillation (compressing the MoE into a smaller dense model) or quantization (reducing the precision of the model's weights) to further cut costs.

The lack of transparency means we can only speculate based on performance and occasional leaks.

Mistral 8x7B aka Mixtral Explained 📄

Mixtral is outperforming many large models while being incredibly efficient. It employs a routing layer that selects a combination of two experts for each token, optimizing resource usage. It has a total of 46.7B parameters, but it only uses about 12.9B active parameters per token.

The Architecture of Mixtral

Mixtral is a sparse mixture-of-experts (SMoE) network. At its core, it's a decoder-only model, and like its predecessor Mistral 7B, it's fully open-source with an Apache 2.0 license.

The Expert Mechanism: Mixtral's magic lies in its feedforward blocks. Instead of a single set of parameters, it picks from eight distinct groups.

Token Routing: For every token, a router network chooses two groups of experts. This dual selection allows for more nuanced processing.

Additive Combination: The outputs from these two chosen experts are combined additively, blending their specialized knowledge.

Efficiency: This ingenious approach means Mixtral operates with the speed and cost of a 12.9B parameter model, despite having a much larger total parameter count (46.7B).

Performance and Benchmarks

Mixtral's performance is a major highlight. It outperforms larger models like Llama 2 70B and GPT-3.5 on various benchmarks and is about six times faster in inference. It currently sits at the top of many open-source leaderboards, proving that open-weights models can compete with proprietary giants.

Training and Implementation Challenges

VRAM Requirements: Running Mixtral effectively requires at least 30 GB of VRAM, necessitating high-end GPUs like the NVIDIA A100 or A6000.

Engineering Hurdles: Training such a model requires clever engineering tricks like 4-bit quantization and QLoRA, especially for the linear layers in the attention blocks.

Technical Components of MoE

The survey "A Comprehensive Survey of Mixture-of-Experts" provides a fantastic breakdown of the core components. Let's dive deeper.

1. Gating Network (The Router)

This is the conductor of the orchestra, deciding which tokens go to which experts. Its design is critical for the model's effectiveness. The router's goal is to learn which expert is best for a given input, ensuring that experts specialize and that the workload is distributed evenly.

Linear (Softmax) Gating: Most MoE models, including Mixtral, use a simple and effective linear layer followed by a softmax function. The router calculates a score for each expert, and softmax turns these scores into probabilities. The Top-K function then selects the highest-scoring experts (for Mixtral, K=2). A bit of noise is often added during training to encourage the router to explore different experts. The formula looks something like this:

G(x)=softmax(TopK(Wg⋅x+noise,K))

Here, G(x) are the final weights for the experts, x is the input token, and Wg is the trainable weight matrix of the gating network.

Non-linear Gating: More advanced designs exist. For example, a cosine router calculates the cosine similarity between the input and an "embedding" representing each expert. This can be better at handling data from different domains and improving generalization.

2. Expert Networks

Experts are the specialized neural networks that do the actual processing. In principle, they could be any kind of model, but in practice, they are usually integrated into a larger architecture.

FFN Experts (Most Common): In Transformers like GPT-4 and Mixtral, the feed-forward network (FFN) layers are replaced with MoE layers. Each expert is its own FFN. This is ideal because FFN layers account for a large portion of a Transformer's computational cost and have shown a natural tendency for specialization.

Attention Experts (MoA): Some models apply the MoE concept to the attention mechanism itself, creating a Mixture-of-Attention (MoA). Here, each "expert" is a distinct attention head, and the router selects the most relevant heads for a given token.

CNN Experts: In computer vision, MoE can be applied to Convolutional Neural Networks (CNNs), where each expert is a set of convolutional layers designed to handle different visual features or image types.

3. Routing Strategies

This defines the level at which routing decisions are made.

Token-Level Routing (Classic): Each token in a sequence is routed independently to the best expert(s). This is the most common approach and is used by Mixtral.

Task-Level Routing: For multi-task learning, the router can send all tokens related to a specific task (e.g., translation vs. summarization) to a dedicated set of experts. This minimizes interference between tasks.

Modality-Level Routing: In multi-modal models (e.g., handling text and images), routing can be done based on the data's modality, sending text tokens to text experts and image patches to vision experts.

4. Training Strategies

Training MoEs is tricky due to their sparse nature. A key challenge is load balancing. If the router sends too many tokens to one expert and neglects others, the model "collapses," wasting capacity and hurting performance. To prevent this, an auxiliary loss function is added during training. This loss penalizes the router for unbalanced decisions, encouraging it to distribute tokens more evenly across all available experts.

Advantages: Speed and Efficiency

Faster Pretraining & Inference: Because only a fraction of the model's parameters are used for any given token, MoE models are much faster to train and run than dense models with the same total parameter count.

Higher Quality: By allowing experts to specialize, the overall model can store more knowledge and handle niche scenarios better, leading to higher-quality outputs.

Cost-Effective Scaling: MoE provides a path to scale models to trillions of parameters without a proportional increase in computational cost.

Disadvantages: GPU Needs & Training Challenges

High VRAM Requirement: The biggest catch is that all experts must be loaded into GPU memory (VRAM), even though only a few are used at a time. This results in massive VRAM requirements.

Training Instability: As mentioned, MoEs can be difficult to train. They require careful tuning and auxiliary losses to ensure load balancing and prevent model collapse, where the router favors only a few experts.

Fine-tuning Difficulties: Historically, fine-tuning MoEs was challenging and could lead to overfitting, though recent advancements are making this easier.

The Evolution of MoE

The MoE concept isn't new. It has a rich history:

1991: The idea of "Adaptive Mixtures of Local Experts" was first proposed by researchers including Robert A. Jacobs and Geoffrey Hinton.

2014: MoE was first applied to modern deep learning.

2017: A Google team including Geoffrey Hinton and Noam Shazeer proposed using MoE for large-scale models in the paper "Outrageously Large Neural Networks."

2020: Google's GShard showed how to apply MoE to giant transformer models for machine translation.

2022: Google's Switch Transformers addressed many of MoE's training and fine-tuning issues, scaling a model to over a trillion parameters.

MoE in the Wild: Beyond LLMs 🏞️

While LLMs have put MoE in the spotlight, the technique is highly versatile and is being used across AI:

Computer Vision: MoE is used in models for image classification, object detection, and semantic segmentation. For example, Vision MoE (V-MoE) scales vision transformers to 15 billion parameters, allowing experts to specialize in recognizing different visual features (e.g., one expert for fur, another for faces).

Reinforcement Learning (RL): In RL, MoE can be used to create more adaptable agents. Different experts can learn distinct policies or skills (e.g., walking vs. jumping), and a gating network chooses the right expert based on the agent's current situation or goal.

Multi-task & Multi-modal Learning: MoE is a natural fit for models that handle multiple tasks or data types (text, images, audio) simultaneously, as it allows for clean specialization and reduces interference between different domains.

Open Source MoEs and Exciting Directions

Today, the community is buzzing with open-source MoE projects like Megablocks, Fairseq, and OpenMoE. The release of Mixtral with an Apache 2.0 license is a massive win for the democratization of AI.

Exciting research directions include:

Distilling MoEs: Compressing large, sparse MoE models into smaller, dense models that retain their performance but are easier to deploy.

Quantization of MoEs: Reducing the memory footprint by using lower-precision numbers for the model weights.

Model Merging: Exploring new, efficient ways to combine experts.

The Bottom Line: Why MoE Matters

MoE models represent a fundamental shift in how we build and scale AI. By moving from monolithic, dense models to sparse, specialized architectures, we can create systems that are not only more powerful but also more efficient. As we advance, MoE models will likely become more prevalent, pushing the boundaries of what's possible.

My hunch is that many future open-source models will be MoEs. It wouldn't be surprising if models like Llama 3 appear as a Mixture of Experts to achieve next-level performance. The story of MoE is still being written, and it's one of the most exciting chapters in AI today.