Technology · Spring 2026

Mixture of Experts explained.

A 2026 field guide to Mixture of Experts: how it works, why GPT-4, DeepSeek, Llama 4, Qwen3, Mistral Large 3 and Kimi K2 all rely on it, plus Super Experts, routing strategies, and the systems trade-offs MoE forces on production teams.

Miguel NevesMay 3, 202614 min read3,073 wordsFiled under Technology
Frontispiece· Spring 2026 · TensorOps Blog

Dense models activate every parameter.MoE picks a few specialists per token —trillion-parameter capacityat the compute cost of something much smaller.The default architecture of frontier AI in 2026.

Inside this dispatch8 sections · 14 minutes
  1. 01Foundations · what MoE is and why it works
  2. 02From GPT-4 to Mixtral
  3. 03The 2026 MoE landscape
  4. 04Inside the model · routing, experts, training
  5. 052026 research findings
  6. 06Inference, deployment, and systems
  7. 07Beyond LLMs
  8. 08Where MoE is going
LLM Mixture of Experts
LLM Mixture of Experts

TensorOps Technology · Spring 2026 · A field guide for architects, engineers, and AI thought leaders.

Mixture of Experts has moved from an experimental scaling trick to the default architecture behind frontier AI. GPT-4's rumored sparse design was the early signal. Today, DeepSeek-V3/R1 and V4, Qwen3-235B-A22B, the Llama 4 family, Mistral Large 3, Kimi K2, Grok-1/2, and GPT-5/GPT-OSS all rely on it. Modern AI faces two linked challenges — exploding compute requirements and the difficulty of fitting one monolithic model to increasingly diverse data — and MoE has emerged as the dominant answer to both. The core idea is simple: MoE lets models grow dramatically larger without activating every parameter for every token. It is how the industry now builds larger, more capable systems while keeping inference cost, latency, and training compute within reach.

Foundations · what MoE is and why it works

What is Mixture of Experts?

MoE is an AI architecture where a set of specialized sub-networks ("experts") sit inside a larger model, and a lightweight gating mechanism decides which experts process each input token. It's a divide-and-conquer strategy that optimizes for both performance and efficiency. Instead of using the whole model every time, only a small subset activates per token — letting a model contain hundreds of billions or trillions of parameters while using a fraction of them for any given input.

How does it differ from traditional ensembles?

Traditional ML ensembles like boosting and bagging combine models statically. MoE is dynamic and internal to the model. For each token, a router picks the most relevant experts in real time. The result feels less like a committee vote and more like a coordinated team of specialists working inside a single neural architecture, with routing decisions baked into inference itself.

What does "expert" really mean here?

Each expert naturally develops proficiency in different regions of a high-dimensional embedding space during training. These aren't experts in the human sense — categorizing them as "the math expert" or "the code expert" is a useful conceptual shorthand, but their actual specialization is statistical and emergent. The router learns which expert is best for which token representation, and refines those decisions over training.

Why does MoE matter for modern LLMs?

Dense models activate all parameters for every token, which makes scaling expensive. MoE breaks the link between total model capacity and active compute. A 671B-parameter MoE may activate only ~37B per token, giving it enormous representational capacity at the cost profile of a much smaller dense model. This decoupling is the single most important shift in frontier-model design over the past two years.

What problem does MoE actually solve?

Modern LLMs have to handle code, math, conversation, multilingual reasoning, vision, retrieval, and agentic workflows — each with different demands. A single dense feed-forward path forces all of this through the same parameters, creating interference. MoE lets the model specialize internally, routing different inputs through different expert pathways without ballooning per-token compute.

What happens inside an MoE layer?

In most LLMs, MoE replaces some dense feed-forward (FFN) layers with expert-based FFN blocks. The router scores available experts for a token and dispatches it to the top-ranked one or two. Classic routing uses Top-K softmax gating: G(x) = softmax(TopK(W_g · x + noise, K)). Mixtral used Top-2; newer systems lean toward finer-grained experts, shared experts, and sigmoid gating to reduce winner-take-all dynamics.

Anatomy of an MoE layer — only top-K of N experts activate per token Fig. 01 · Inside an MoE layer Dense layers run every parameter for every token. MoE layers route each token to only K of N experts. Dense FFN all params active · every token Token FFN 100% of weights Output cost ∝ total params MoE FFN router picks top-K of N experts per token Token Router G(x) · softmax · TopK E1 E2 E3 E4 E5 E6 E7 E8 Σ Output cost ∝ active params Mixtral picks 2 of 8 experts per token · DeepSeek-V3 picks 8 of 256 routed experts plus 1 always-on shared expert

From GPT-4 to Mixtral

Why did GPT-4 become associated with MoE?

On June 20, 2023, George Hotz revealed that GPT-4 was rumored to be eight smaller models of ~220B parameters each — later echoed by Soumith Chintala. That gives a headline figure of ~1.76T parameters, but the math isn't that clean: in MoE transformers, typically only the FFN layers are replicated per expert while attention layers are shared. The true total for GPT-4 likely sits between 1.2T and 1.76T. OpenAI never confirmed the architecture, but inference-cost patterns and reverse-engineering made sparse activation the credible explanation.

How should we read the "lazy GPT-4" debate?

Some users described later GPT-4 variants as less thorough or "lazier" than earlier releases. Three plausible causes, all consistent with MoE-based serving:

  • Cost reduction: Every expert must be loaded into VRAM, so reducing the number or size of experts used per query has a big impact on cost — and on quality.
  • Aggressive RLHF: Continuous reinforcement learning with human feedback makes models safer and more useful for specific products, but can flatten creativity.
  • Distillation and quantization: Compressing the MoE into a smaller dense model or reducing weight precision cuts costs further.

MoE isn't inherently weaker; it gives operators more levers to trade cost, latency, and quality.

What is OpenAI doing with MoE now?

GPT-5 and the GPT-OSS open-weight release (2025) use multimodal MoE with experts specialized across coding, math, conversation, and vision. The lineage is now openly sparse, even if exact parameter counts remain undisclosed.

What made Mixtral the turning point?

Mixtral 8x7B (2023) proved that open-weight MoE could be practical and competitive. It's a sparse mixture-of-experts (SMoE) decoder-only model under Apache 2.0:

  • Total / active parameters: 46.7B total, ~12.9B active.
  • Expert mechanism: Each FFN block picks from eight distinct expert groups.
  • Token routing: A router selects two experts per token (Top-2).
  • Additive combination: The outputs of the two chosen experts are combined additively, blending their specialized knowledge.
  • Efficiency: Operates at the speed and cost of a 12.9B model despite a 46.7B total footprint, ~6× faster than Llama 2 70B at higher quality.

Mixtral 8x22B followed with 141B total / 39B active, pushing stronger results in coding and math.

What changed after Mixtral?

The industry shifted from small expert pools to much finer-grained systems — 128, 256, or more routed experts, often combined with always-on shared experts. This improves specialization and routing flexibility while reducing the risk that a few coarse experts become overloaded or poorly differentiated.

The 2026 MoE landscape

What are the leading open MoE models in 2026?

The 2025–2026 generation pushed MoE into much larger deployments:

  • DeepSeek-V3 / R1 — ~671B total / ~37B active. 256 routed experts plus 1 shared expert, with auxiliary-loss-free load balancing.
  • DeepSeek-V4 (April 2026) — 1.6T total / 49B active in the Pro variant; a Flash variant at 284B total / 13B active. Native 1M-token context, trained on >32T tokens.
  • Qwen3-235B-A22B — 235B total / ~22B active, activating 8 experts per token from a pool of 128. Strong multilingual and agentic performance.
  • Llama 4 Maverick — Meta's first major MoE flagship, ~400B total / ~17B active, 128 routed plus 1 shared expert per MoE layer, interleaved dense/MoE layers, native multimodality.
  • Mistral Large 3 — 675B total / 41B active, 256K context, top open-source coding performer on LM Arena, deployable on a single 8-GPU node.
  • Kimi K2 — ~1T total / ~32B active, natively multimodal.
  • Grok-1 / Grok-2 — xAI's open MoE releases (314B total, ~70–80B active for Grok-1) extended the open ecosystem beyond DeepSeek, Qwen, Mistral, and Meta.
Total vs active parameters across leading 2026 MoE models Fig. 02 · Total vs active parameters MoE decouples model capacity from per-token compute. Active parameters (accent) are a small fraction of total (ink). 0 250B 500B 750B 1000B 1250B 1500B 1700B Parameters DeepSeek-V4 Pro 1600B total 49B active Kimi K2 1000B total 32B active Mistral Large 3 675B total 41B active DeepSeek-V3 / R1 671B total 37B active Llama 4 Maverick 400B total 17B active Qwen3-235B-A22B 235B total 22B active Llama 4 Scout 109B total 17B active total parameters active per token DeepSeek-V4 Pro carries 33× more parameters than it activates per token · this is the gap MoE was built to exploit

What about the full Llama 4 family?

Llama 4 expanded into a family rather than a single release. Scout is the smaller variant (109B total / 17B active, with up to 10M-token context in some configs, runnable on a single H100). Maverick is the larger flagship at ~400B / 17B active. Behemoth is a research-scale system with significantly larger active capacity — some reports cite ~288B active. The design reflects a broader trend: MoE is now a systems architecture, not just a model-layer trick.

Why are shared experts important?

Shared experts process every token and provide stability, common knowledge, and a baseline representation that all routed paths build on. DeepSeek and Llama 4 both use them. This reduces fragmentation: routed experts can specialize aggressively while shared experts preserve global continuity across tasks, languages, modalities, and reasoning patterns.

Shared + routed expert architecture (DeepSeek, Llama 4) Fig. 04 · Shared + routed experts Always-on shared expert preserves global continuity; routed experts specialize and fire only when picked. Token Router scores routed pool always on Shared expert stability · global knowledge top-K E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 Routed pool · N experts Σ Output DeepSeek-V3 · 1 shared expert + 256 routed (top-8) · Llama 4 Maverick · 1 shared + 128 routed Shared experts absorb common patterns so routed experts can specialize without fragmenting basic capability

Inside the model · routing, experts, training

What is the router actually learning?

The router learns which expert pathways are useful for each token. During training, different experts naturally become better suited to different regions of the embedding space. But specialization is rarely as clean as "one expert for code, one for math." Experts often specialize in syntax, entities, languages, token patterns, reasoning modes, or hidden internal features that don't map neatly to task labels — specialization is more emergent than designed.

What gating designs are in use?

The router is the conductor of the orchestra, and its design is critical:

  • Linear (softmax) gating — A linear layer plus softmax produces expert probabilities, with Top-K selecting the winners. Noise added during training encourages exploration. Used by Mixtral and most production MoEs.
  • Cosine routers — Compute cosine similarity between input and a learned embedding for each expert. Better at handling cross-domain data and improving generalization.
  • Sigmoid gating — Reduces winner-take-all competition between experts.
  • Routing-free designs — Experts self-activate via learned mechanisms, eliminating the explicit router bottleneck. An active 2026 research direction.

What kinds of experts exist beyond FFN?

Three families matter:

  • FFN experts — The dominant pattern in LLMs (GPT-4, Mixtral, DeepSeek, Llama 4). FFN layers carry a large fraction of transformer compute and show natural specialization tendencies.
  • Attention experts (Mixture-of-Attention, MoA) — Each expert is a distinct attention head; the router picks the most relevant heads per token.
  • CNN experts — In computer vision, each expert is a set of convolutional layers tuned for different visual features or image types.

What are the main routing strategies?

Token-level routing remains the standard for LLMs — each token independently routed based on its hidden representation. Task-level routing sends all tokens for a given task (translation vs. summarization) to dedicated experts, minimizing interference. Modality-level routing sends text tokens to text experts and image patches to vision experts in multimodal systems — increasingly common as native multimodality becomes the norm.

How is MoE training stabilized?

The central challenge is load balancing. If the router sends most tokens to a few favorite experts, the model collapses, wasting capacity. Classic MoE adds an auxiliary loss that penalizes unbalanced routing decisions. DeepSeek-style designs now lead with auxiliary-loss-free load balancing, supplemented by orthogonality losses, variance objectives, and router regularization. The goal is genuine specialization rather than artificial uniformity. Recent work also addresses gradient noise, RL instability specific to sparse models, and expert collapse.

2026 research findings

What are Super Experts, and why are they the headline 2026 finding?

ICLR 2026 papers (and arXiv 2507.23279) identified a tiny subset of experts — often just 3–10 out of thousands — that dominate extreme activation outliers in the down_proj layer and create massive hidden-state activations between decoder layers. Pruning just 3 of 6,144 experts in Qwen3-30B-A3B caused catastrophic collapse: repetitive, uninformative outputs and major drops in math and reasoning. Super Experts are model-specific, data-agnostic, and unaffected by post-training.

Why do Super Experts matter for deployment?

They reshape how teams think about compression, pruning, and quantization. If a handful of experts carry disproportionate responsibility, naive pruning destroys quality. They also offer a window into why MoE works: Super Experts appear connected to hidden-state outliers and attention-sink behavior, suggesting they stabilize information flow across layers. Compression strategies built around Super Expert preservation are now an active research area.

What is MoE entanglement?

ICLR 2026 cross-layer routing studies show that MoE layers are not independent. Activations from one MoE layer strongly predict which experts fire in subsequent layers, and MoE outputs often dominate routing decisions more than attention layers do. This cross-layer dependency — MoE entanglement — means routing forms a dynamic pathway through the network, with expert outputs shaping future selection more strongly than earlier work assumed.

Does MoE reduce polysemanticity?

There is growing evidence that sparse routing reduces some forms of superposition compared with dense models of similar active size. Because different token patterns route through different experts, features become less entangled. Mechanistic studies on OLMoE and Qwen3 variants have automatically labeled experts as detectors for subword families, proper names, locations, languages, and entity types — though the mapping remains imperfect.

Inference, deployment, and systems

Why is MoE faster at inference?

Only active parameters matter for compute. A model with 671B total but 37B active has per-token compute resembling a much smaller dense model. In production this means better throughput, lower marginal cost, or higher quality at the same serving budget — typically 5–10× better cost-performance than equivalent dense models, and meaningful speedups over comparable dense systems (Mixtral was ~6× faster than Llama 2 70B at higher quality).

Schematic capability frontier — MoE Pareto curve sits above dense at the same active compute Fig. 05 · Capability per active parameter Schematic. At equal active compute per token, MoE models sit on a higher capability frontier than dense models — the gap is the value MoE captures. 0 20B 40B 60B 80B 100B 120B Active parameters per token Relative capability Mixtral 8x7B Llama 4 Scout Qwen3-A22B DeepSeek-V3 Mistral L3 Llama 3 8B Llama 3 70B Dense frontier MoE frontier capability lift Schematic — vertical scale is illustrative. Active-parameter counts are publicly disclosed by model authors.

Why is MoE still hard to deploy?

The full model has to live somewhere. Even if only a few experts activate per token, the serving stack must load, shard, offload, or retrieve the full expert set efficiently. That creates VRAM, bandwidth, and orchestration challenges. Even Mixtral required at least 30 GB of VRAM and high-end GPUs (A100, A6000, H100); frontier-scale MoEs need clusters or aggressive quantization. Production MoE serving depends heavily on expert parallelism, quantization (FP8, INT4, 4-bit with QLoRA-style techniques for attention layers), fast routing kernels, communication scheduling, and hardware-aware placement.

The deployment paradox — small per-token compute, large memory footprint Fig. 06 · The deployment paradox Per-token compute is small (active params). Memory is huge (every expert must be loaded somewhere). Both must be planned for. 0 250 GB 500 GB 750 GB 1.0 TB 1.2 TB 1.5 TB 1.7 TB VRAM (≈ FP8 weights) · Active params (B) active params per token (GB equiv.) VRAM to serve full model (FP8) Mixtral 8x7B 47 GB 12.9B active Llama 4 Scout 109 GB 17B active Qwen3-A22B 235 GB 22B active DeepSeek-V3 671 GB 37B active Mistral L3 675 GB 41B active DeepSeek-V4 Pro 1600 GB 49B active FP8 weight footprint ≈ 1 byte per parameter. Add KV cache, activation buffers, and expert-parallel comms overhead in production. Llama 4 Scout fits a single H100 (80 GB) only with quantization · DeepSeek-V4 Pro needs an 8-GPU node minimum, even at FP8.

What hardware and systems advances are changing MoE?

The 2026 acceleration surveys (arXiv 2503.07137 updated; "A Survey on Accelerated Technologies for MoE") cover hybrid parallel computing, fine-grained memory management, communication scheduling, ML-guided load balancing, cross-layer optimization, and hardware-software co-design. NVIDIA Blackwell NVL72 reportedly delivers ~10× inference gains on DeepSeek-R1, Kimi K2, and Mistral Large 3. Wide-expert parallelism, FP8 and INT4 quantization, and SSD expert offloading for consumer hardware are all moving from research into production.

Can MoE run on smaller infrastructure?

Increasingly, yes. DeepSeek-V3.2 Speciale fits on 8×H100 with FP8. Llama 4 Scout runs on a single H100. Mistral Large 3 deploys on a single 8-GPU node. Heavy quantization, expert offloading, and specialized inference engines have made smaller deployments realistic, but frontier-scale MoEs still require serious memory capacity. The defining tradeoff remains the gap between active compute (small) and memory footprint (huge).

How does MoE affect fine-tuning?

Easier than in 2024 but still more complex than dense fine-tuning. Teams must account for routing behavior, expert balance, memory layout, and whether adapters target shared layers, routed experts, or both. Unsloth offers 12–30× faster MoE training; Hugging Face expert backends and native vLLM support have made MoE first-class citizens. For narrow enterprise tasks with constrained infrastructure, dense models may still be simpler.

Beyond LLMs

Where is MoE used beyond LLMs?

The technique generalizes well across AI:

  • Computer vision — V-MoE scaled vision transformers to 15B parameters with experts specializing in different visual features (one for fur, another for faces). Swin-MoE extends this to hierarchical vision transformers.
  • Reinforcement learning — Different experts learn distinct policies or skills (walking vs. jumping), with the gating network choosing based on the agent's state or goal.
  • Multimodal models — MoE-LLaVA, Qwen3-VL, and others coordinate text, image, video, code, and tool-use pathways through specialized experts.
  • Agentic systems — Task-specific experts handle planning, retrieval, and tool calls without cross-interference.

The core advantage is consistent: reduce interference by letting different experts handle different data regimes.

How does MoE relate to multimodality?

Native multimodality is now standard across flagship MoEs — Qwen3, Llama 4, Mistral Large 3, Kimi K2, and DeepSeek-V4, not just the GPT-5 lineage. Instead of bolting vision onto a text model, newer systems route different modalities through specialized expert pathways. This matters for enterprise architectures: document understanding, code agents, visual inspection, long-context retrieval, and workflow automation increasingly need models that coordinate multiple data types without collapsing them into one overloaded representation.

How does MoE fit into the post-Transformer discussion?

MoE is still transformer-based but belongs to the broader 2026 conversation about changing scaling laws — alongside state-space models, Mamba-style architectures, retrieval, and agentic control. MoE's advantage is that it works with today's transformer ecosystem. It improves scaling efficiency without requiring the industry to abandon the architecture, tooling, and hardware paths already built.

Where MoE is going

A brief evolution

  • 1991 — Jacobs and Hinton publish Adaptive Mixtures of Local Experts.
  • 2014 — MoE first applied to modern deep learning.
  • 2017 — Google's Outrageously Large Neural Networks (Hinton, Shazeer, et al.) proposes MoE for large-scale models.
  • 2020 — GShard scales MoE to giant transformers for machine translation.
  • 2022 — Switch Transformers cross 1T parameters and address MoE training and fine-tuning issues.
  • 2023 — Mixtral 8x7B brings MoE to open source.
  • 2024–2025 — DeepSeek-V2/V3, Qwen3, Llama 4, GPT-OSS, Mistral Large 3 establish MoE as the standard.
  • 2026 — Focus shifts to Super Experts and interpretability, routing-free designs, multimodal MoE, and inference-time optimization.
From Jacobs & Hinton (1991) to Super Experts (2026) Fig. 03 · A brief evolution of MoE Three decades from a regional-experts paper to the default architecture of frontier AI. 1991 Adaptive Mixtures of Local Experts 2014 MoE meets deep learning 2017 Outrageously Large Neural Networks 2020 GShard scales to T-params 2022 Switch Transformer crosses 1T 2023 Mixtral 8x7B open-weight MoE 2024–25 DeepSeek · Qwen Llama 4 · GPT-OSS 2026 Super Experts · entanglement The 2017 paper from Hinton, Shazeer et al. is the inflection — MoE moved from idea to scaling lever for transformers

What are the most important open-source MoE directions?

The ecosystem has exploded. Megablocks, DeepSpeed-MoE, Fairseq, OpenMoE, and Unsloth all power MoE workloads, with native support in Hugging Face Transformers and vLLM. Active research directions include:

  • Distilling MoEs into smaller dense models that retain performance.
  • Quantization for memory-footprint reduction (FP8, INT4).
  • Model merging and new ways to combine experts.
  • Routing-free and self-activating experts.
  • Orthogonality losses for genuine specialization.
  • Compression strategies built around Super Expert preservation.
  • Hybrid MoE plus retrieval or agentic control.
  • Long-context multimodal variants.
  • Hardware co-design for sparse activation patterns.

Why is MoE now the default scaling recipe?

It offers a practical answer to the frontier-model scaling problem. Labs can build models with massive total capacity while controlling per-token compute. That combination — quality, lower inference cost, specialization, and a real path to trillion-parameter systems — is hard to beat economically with dense architectures.

What should AI architects take away?

MoE changes model architecture, but it also changes infrastructure architecture. The design question is no longer just "How many parameters does the model have?" but "How many parameters are active, where are the experts stored, and how efficiently can they be routed?" Production teams must think across the full stack: model design, memory layout, parallelism, quantization, routing behavior, observability, and cost controls. Sparse intelligence is as much a systems problem as a modeling breakthrough.

What is the bottom line?

MoE is no longer exotic — it is the architecture that makes trillion-parameter AI practical. By activating only a fraction of parameters per token, it delivers higher capacity, stronger specialization, faster inference, and better cost-performance than dense models of equivalent total size. As of spring 2026, the vast majority of frontier open and closed models are MoE-based, and the gap between open and closed performance has never been smaller. The era of purely monolithic dense frontier models is giving way to sparse, specialized, and ruthlessly efficient AI systems. The story of MoE is still being written — but it's already the most consequential architectural shift of the decade.

TensorOps · Your partners in AI

End.   Set in Fraunces, Newsreader & JetBrains Mono.
TensorOps · Blog · 2026