A 2026 field guide to Mixture of Experts: how it works, why GPT-4, DeepSeek, Llama 4, Qwen3, Mistral Large 3 and Kimi K2 all rely on it, plus Super Experts, routing strategies, and the systems trade-offs MoE forces on production teams.
Dense models activate every parameter.MoE picks a few specialists per token —trillion-parameter capacityat the compute cost of something much smaller.The default architecture of frontier AI in 2026.

TensorOps Technology · Spring 2026 · A field guide for architects, engineers, and AI thought leaders.
Mixture of Experts has moved from an experimental scaling trick to the default architecture behind frontier AI. GPT-4's rumored sparse design was the early signal. Today, DeepSeek-V3/R1 and V4, Qwen3-235B-A22B, the Llama 4 family, Mistral Large 3, Kimi K2, Grok-1/2, and GPT-5/GPT-OSS all rely on it. Modern AI faces two linked challenges — exploding compute requirements and the difficulty of fitting one monolithic model to increasingly diverse data — and MoE has emerged as the dominant answer to both. The core idea is simple: MoE lets models grow dramatically larger without activating every parameter for every token. It is how the industry now builds larger, more capable systems while keeping inference cost, latency, and training compute within reach.
MoE is an AI architecture where a set of specialized sub-networks ("experts") sit inside a larger model, and a lightweight gating mechanism decides which experts process each input token. It's a divide-and-conquer strategy that optimizes for both performance and efficiency. Instead of using the whole model every time, only a small subset activates per token — letting a model contain hundreds of billions or trillions of parameters while using a fraction of them for any given input.
Traditional ML ensembles like boosting and bagging combine models statically. MoE is dynamic and internal to the model. For each token, a router picks the most relevant experts in real time. The result feels less like a committee vote and more like a coordinated team of specialists working inside a single neural architecture, with routing decisions baked into inference itself.
Each expert naturally develops proficiency in different regions of a high-dimensional embedding space during training. These aren't experts in the human sense — categorizing them as "the math expert" or "the code expert" is a useful conceptual shorthand, but their actual specialization is statistical and emergent. The router learns which expert is best for which token representation, and refines those decisions over training.
Dense models activate all parameters for every token, which makes scaling expensive. MoE breaks the link between total model capacity and active compute. A 671B-parameter MoE may activate only ~37B per token, giving it enormous representational capacity at the cost profile of a much smaller dense model. This decoupling is the single most important shift in frontier-model design over the past two years.
Modern LLMs have to handle code, math, conversation, multilingual reasoning, vision, retrieval, and agentic workflows — each with different demands. A single dense feed-forward path forces all of this through the same parameters, creating interference. MoE lets the model specialize internally, routing different inputs through different expert pathways without ballooning per-token compute.
In most LLMs, MoE replaces some dense feed-forward (FFN) layers with expert-based FFN blocks. The router scores available experts for a token and dispatches it to the top-ranked one or two. Classic routing uses Top-K softmax gating: G(x) = softmax(TopK(W_g · x + noise, K)). Mixtral used Top-2; newer systems lean toward finer-grained experts, shared experts, and sigmoid gating to reduce winner-take-all dynamics.
On June 20, 2023, George Hotz revealed that GPT-4 was rumored to be eight smaller models of ~220B parameters each — later echoed by Soumith Chintala. That gives a headline figure of ~1.76T parameters, but the math isn't that clean: in MoE transformers, typically only the FFN layers are replicated per expert while attention layers are shared. The true total for GPT-4 likely sits between 1.2T and 1.76T. OpenAI never confirmed the architecture, but inference-cost patterns and reverse-engineering made sparse activation the credible explanation.
Some users described later GPT-4 variants as less thorough or "lazier" than earlier releases. Three plausible causes, all consistent with MoE-based serving:
MoE isn't inherently weaker; it gives operators more levers to trade cost, latency, and quality.
GPT-5 and the GPT-OSS open-weight release (2025) use multimodal MoE with experts specialized across coding, math, conversation, and vision. The lineage is now openly sparse, even if exact parameter counts remain undisclosed.
Mixtral 8x7B (2023) proved that open-weight MoE could be practical and competitive. It's a sparse mixture-of-experts (SMoE) decoder-only model under Apache 2.0:
Mixtral 8x22B followed with 141B total / 39B active, pushing stronger results in coding and math.
The industry shifted from small expert pools to much finer-grained systems — 128, 256, or more routed experts, often combined with always-on shared experts. This improves specialization and routing flexibility while reducing the risk that a few coarse experts become overloaded or poorly differentiated.
The 2025–2026 generation pushed MoE into much larger deployments:
Llama 4 expanded into a family rather than a single release. Scout is the smaller variant (109B total / 17B active, with up to 10M-token context in some configs, runnable on a single H100). Maverick is the larger flagship at ~400B / 17B active. Behemoth is a research-scale system with significantly larger active capacity — some reports cite ~288B active. The design reflects a broader trend: MoE is now a systems architecture, not just a model-layer trick.
Shared experts process every token and provide stability, common knowledge, and a baseline representation that all routed paths build on. DeepSeek and Llama 4 both use them. This reduces fragmentation: routed experts can specialize aggressively while shared experts preserve global continuity across tasks, languages, modalities, and reasoning patterns.
The router learns which expert pathways are useful for each token. During training, different experts naturally become better suited to different regions of the embedding space. But specialization is rarely as clean as "one expert for code, one for math." Experts often specialize in syntax, entities, languages, token patterns, reasoning modes, or hidden internal features that don't map neatly to task labels — specialization is more emergent than designed.
The router is the conductor of the orchestra, and its design is critical:
Three families matter:
Token-level routing remains the standard for LLMs — each token independently routed based on its hidden representation. Task-level routing sends all tokens for a given task (translation vs. summarization) to dedicated experts, minimizing interference. Modality-level routing sends text tokens to text experts and image patches to vision experts in multimodal systems — increasingly common as native multimodality becomes the norm.
The central challenge is load balancing. If the router sends most tokens to a few favorite experts, the model collapses, wasting capacity. Classic MoE adds an auxiliary loss that penalizes unbalanced routing decisions. DeepSeek-style designs now lead with auxiliary-loss-free load balancing, supplemented by orthogonality losses, variance objectives, and router regularization. The goal is genuine specialization rather than artificial uniformity. Recent work also addresses gradient noise, RL instability specific to sparse models, and expert collapse.
ICLR 2026 papers (and arXiv 2507.23279) identified a tiny subset of experts — often just 3–10 out of thousands — that dominate extreme activation outliers in the down_proj layer and create massive hidden-state activations between decoder layers. Pruning just 3 of 6,144 experts in Qwen3-30B-A3B caused catastrophic collapse: repetitive, uninformative outputs and major drops in math and reasoning. Super Experts are model-specific, data-agnostic, and unaffected by post-training.
They reshape how teams think about compression, pruning, and quantization. If a handful of experts carry disproportionate responsibility, naive pruning destroys quality. They also offer a window into why MoE works: Super Experts appear connected to hidden-state outliers and attention-sink behavior, suggesting they stabilize information flow across layers. Compression strategies built around Super Expert preservation are now an active research area.
ICLR 2026 cross-layer routing studies show that MoE layers are not independent. Activations from one MoE layer strongly predict which experts fire in subsequent layers, and MoE outputs often dominate routing decisions more than attention layers do. This cross-layer dependency — MoE entanglement — means routing forms a dynamic pathway through the network, with expert outputs shaping future selection more strongly than earlier work assumed.
There is growing evidence that sparse routing reduces some forms of superposition compared with dense models of similar active size. Because different token patterns route through different experts, features become less entangled. Mechanistic studies on OLMoE and Qwen3 variants have automatically labeled experts as detectors for subword families, proper names, locations, languages, and entity types — though the mapping remains imperfect.
Only active parameters matter for compute. A model with 671B total but 37B active has per-token compute resembling a much smaller dense model. In production this means better throughput, lower marginal cost, or higher quality at the same serving budget — typically 5–10× better cost-performance than equivalent dense models, and meaningful speedups over comparable dense systems (Mixtral was ~6× faster than Llama 2 70B at higher quality).
The full model has to live somewhere. Even if only a few experts activate per token, the serving stack must load, shard, offload, or retrieve the full expert set efficiently. That creates VRAM, bandwidth, and orchestration challenges. Even Mixtral required at least 30 GB of VRAM and high-end GPUs (A100, A6000, H100); frontier-scale MoEs need clusters or aggressive quantization. Production MoE serving depends heavily on expert parallelism, quantization (FP8, INT4, 4-bit with QLoRA-style techniques for attention layers), fast routing kernels, communication scheduling, and hardware-aware placement.
The 2026 acceleration surveys (arXiv 2503.07137 updated; "A Survey on Accelerated Technologies for MoE") cover hybrid parallel computing, fine-grained memory management, communication scheduling, ML-guided load balancing, cross-layer optimization, and hardware-software co-design. NVIDIA Blackwell NVL72 reportedly delivers ~10× inference gains on DeepSeek-R1, Kimi K2, and Mistral Large 3. Wide-expert parallelism, FP8 and INT4 quantization, and SSD expert offloading for consumer hardware are all moving from research into production.
Increasingly, yes. DeepSeek-V3.2 Speciale fits on 8×H100 with FP8. Llama 4 Scout runs on a single H100. Mistral Large 3 deploys on a single 8-GPU node. Heavy quantization, expert offloading, and specialized inference engines have made smaller deployments realistic, but frontier-scale MoEs still require serious memory capacity. The defining tradeoff remains the gap between active compute (small) and memory footprint (huge).
Easier than in 2024 but still more complex than dense fine-tuning. Teams must account for routing behavior, expert balance, memory layout, and whether adapters target shared layers, routed experts, or both. Unsloth offers 12–30× faster MoE training; Hugging Face expert backends and native vLLM support have made MoE first-class citizens. For narrow enterprise tasks with constrained infrastructure, dense models may still be simpler.
The technique generalizes well across AI:
The core advantage is consistent: reduce interference by letting different experts handle different data regimes.
Native multimodality is now standard across flagship MoEs — Qwen3, Llama 4, Mistral Large 3, Kimi K2, and DeepSeek-V4, not just the GPT-5 lineage. Instead of bolting vision onto a text model, newer systems route different modalities through specialized expert pathways. This matters for enterprise architectures: document understanding, code agents, visual inspection, long-context retrieval, and workflow automation increasingly need models that coordinate multiple data types without collapsing them into one overloaded representation.
MoE is still transformer-based but belongs to the broader 2026 conversation about changing scaling laws — alongside state-space models, Mamba-style architectures, retrieval, and agentic control. MoE's advantage is that it works with today's transformer ecosystem. It improves scaling efficiency without requiring the industry to abandon the architecture, tooling, and hardware paths already built.
The ecosystem has exploded. Megablocks, DeepSpeed-MoE, Fairseq, OpenMoE, and Unsloth all power MoE workloads, with native support in Hugging Face Transformers and vLLM. Active research directions include:
It offers a practical answer to the frontier-model scaling problem. Labs can build models with massive total capacity while controlling per-token compute. That combination — quality, lower inference cost, specialization, and a real path to trillion-parameter systems — is hard to beat economically with dense architectures.
MoE changes model architecture, but it also changes infrastructure architecture. The design question is no longer just "How many parameters does the model have?" but "How many parameters are active, where are the experts stored, and how efficiently can they be routed?" Production teams must think across the full stack: model design, memory layout, parallelism, quantization, routing behavior, observability, and cost controls. Sparse intelligence is as much a systems problem as a modeling breakthrough.
MoE is no longer exotic — it is the architecture that makes trillion-parameter AI practical. By activating only a fraction of parameters per token, it delivers higher capacity, stronger specialization, faster inference, and better cost-performance than dense models of equivalent total size. As of spring 2026, the vast majority of frontier open and closed models are MoE-based, and the gap between open and closed performance has never been smaller. The era of purely monolithic dense frontier models is giving way to sparse, specialized, and ruthlessly efficient AI systems. The story of MoE is still being written — but it's already the most consequential architectural shift of the decade.
TensorOps · Your partners in AI