Chapter 12. Mixture of Experts (MoE), The Dominant Architecture of 2026
In Chapter 11, you saw a striking pattern in the model size tables: the most capable open-weight models of 2025 and 2026 all have two parameter counts listed, a large “total” number and a much smaller “active” number. DeepSeek-V3 has 671 billion total parameters but only 37 billion active. LLaMA 4 Maverick has 400 billion total but only 17 billion active. This is not a typo or a marketing trick. It is the result of a specific architectural choice called Mixture of Experts (MoE), and it has become the dominant design pattern for frontier language models. In this chapter, we will explain exactly how MoE works, why it exists, how routing decides which experts to use, and why this architecture enables models that are simultaneously larger and cheaper to run than their dense predecessors.
The Problem: Dense Models Waste Compute
In every chapter so far, we have described a dense Transformer: a model where every parameter participates in processing every token. When LLaMA 3 8B processes a token, all 8 billion parameters are involved. When LLaMA 3.1 405B processes a token, all 405 billion parameters are involved. The computational cost scales linearly with the number of parameters.
This creates a fundamental tension. On one hand, more parameters means more capacity to store knowledge, recognize patterns, and perform complex reasoning (Chapter 11). On the other hand, more parameters means more computation per token, which means slower inference, higher costs, and more GPU memory.
Consider the feed-forward network (FFN) from Chapter 9. In LLaMA 3 8B, the FFN has three weight matrices (W_gate, W_up, W_down) with a combined 176 million parameters per layer. Every single one of those parameters is used for every single token, regardless of what that token is. Whether the model is processing a question about quantum physics or a simple greeting, the same 176 million FFN parameters are activated.
This is wasteful. Different types of inputs require different types of processing. A question about Python code activates different knowledge than a question about medieval history. A dense model has no choice but to use all its parameters for everything, even though most of those parameters are irrelevant to any given input.
The question becomes: what if we could have a model with hundreds of billions of parameters’ worth of knowledge, but only use a small fraction of those parameters for each token? That is exactly what Mixture of Experts does.
The MoE Idea: Many Experts, Few Active
The core idea of MoE is simple: instead of one large FFN per layer, have many smaller FFN blocks (called experts), and for each token, only activate a few of them. A small network called the router (or gating network) decides which experts to use for each token.
Here is the key insight: the attention mechanism (Chapters 7-8) stays the same. MoE only replaces the FFN portion of the Transformer block. Recall from Chapter 10 that each Transformer block has two main components:
- Multi-head attention (gathers information from other tokens)
- Feed-forward network (processes each token independently)
In a dense model, step 2 uses a single FFN. In an MoE model, step 2 uses a collection of expert FFNs, and a router selects which ones to activate for each token.
The MoE Layer Structure
A single MoE layer contains:
- N routed experts: Each expert is a standard FFN (the same SwiGLU architecture from Chapter 9, with W_gate, W_up, and W_down matrices). The only difference is that each expert’s FFN is typically smaller than the FFN in a dense model of equivalent quality.
- A router network: A small linear layer that takes the token’s hidden state as input and produces a score for each expert. The router has shape [hidden_size x num_experts], so it adds very few parameters.
- Optional shared expert(s): Some architectures include one or more “shared” experts that are always active for every token, in addition to the routed experts.
For each token, the router computes scores for all N experts, selects the top K experts with the highest scores, and routes the token to only those K experts. The outputs of the selected experts are combined (typically as a weighted sum, where the weights come from the router scores) to produce the final FFN output for that token.
A Concrete Example
Let’s make this concrete with LLaMA 4 Maverick’s numbers:
- 128 routed experts per MoE layer
- 1 shared expert per MoE layer (always active)
- Top-1 routing: only 1 routed expert is selected per token
- Each expert’s FFN has intermediate_size = 8,192
When a token arrives at an MoE layer in Maverick:
- The router computes 128 scores (one per routed expert)
- The router selects the 1 expert with the highest score
- The token is processed by the shared expert AND the 1 selected routed expert
- The outputs of both experts are combined
- The result is added to the residual stream (just like a normal FFN output)
Out of 129 total expert FFNs in the layer (128 routed + 1 shared), only 2 are active for any given token. The other 127 routed experts sit idle for that token. This is why Maverick has 400 billion total parameters but only 17 billion active per token: most of the parameters are in expert FFNs that are not selected.
Source: LLaMA 4 Maverick from Meta (April 5, 2025). Configuration from HuggingFace Transformers Llama4TextConfig and PyTorch blog: 48 layers, hidden_size=5,120, 128 routed experts + 1 shared expert per MoE layer, top-1 routing, moe_intermediate_size=8,192, interleave_moe_layer_step=2 (alternating dense and MoE layers, so 24 of 48 layers are MoE).
A Brief History of MoE
The MoE concept is not new. It predates the Transformer by decades.
1991: The Original Idea
The concept of Mixture of Experts was introduced by Jacobs, Jordan, Nowlan, and Hinton in their 1991 paper “Adaptive Mixtures of Local Experts” (Neural Computation, 1991). The core idea was to divide a complex problem into simpler subproblems, each handled by a specialized “expert” network, with a “gating” network that learned to assign inputs to the appropriate experts. This was not about language models or Transformers; it was a general machine learning technique for combining multiple specialized networks.
Source: Jacobs, Jordan, Nowlan, and Hinton, “Adaptive Mixtures of Local Experts,” Neural Computation, Vol. 3, No. 1, pp. 79-87, 1991.
2017: Scaling MoE to Neural Networks
Shazeer et al. (2017) published “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (ICLR 2017), which applied the MoE concept to deep neural networks at scale. They introduced a sparsely-gated MoE layer consisting of up to thousands of feed-forward sub-networks, with a trainable gating network that selected a sparse combination of experts for each input. They applied this to language modeling and machine translation using LSTM-based models (not Transformers, which had not been invented yet), achieving models with up to 137 billion parameters. This paper established the key principles that modern MoE models still use: sparse activation (only a few experts per input), a learned gating function, and the need for load balancing to prevent expert collapse.
Source: Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” ICLR 2017 (arXiv:1701.06538). Applied MoE with up to 137B parameters to LSTM-based language models.
2022: Switch Transformer
Fedus, Zoph, and Shazeer (2022) published the Switch Transformer paper, which simplified MoE routing by sending each token to just one expert (top-1 routing) instead of two or more. This counterintuitive simplification made MoE models much easier to train at scale. They demonstrated a model with 1.6 trillion parameters, showing that MoE could scale to enormous sizes while keeping per-token compute manageable. The Switch Transformer also introduced a simplified auxiliary load-balancing loss to ensure all experts were used roughly equally.
Source: Fedus, Zoph, and Shazeer, “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,” JMLR 2022 (arXiv:2101.03961). Demonstrated 1.6T parameter MoE model with top-1 routing.
2023-2026: The MoE Revolution in LLMs
The modern MoE era for language models began in December 2023 with Mixtral 8x7B from Mistral AI, the first widely available open-weight MoE language model. Since then, MoE has become the dominant architecture for frontier models:
| Model | Release | Total Params | Active Params | Experts | Top-K |
|---|---|---|---|---|---|
| Mixtral 8x7B | Dec 2023 | 46.7B | 12.9B | 8 | 2 |
| Grok-1 | Mar 2024 | 314B | ~78.5B | 8 | 2 |
| Mixtral 8x22B | Apr 2024 | 141B | 39B | 8 | 2 |
| DeepSeek-V3 | Dec 2024 | 671B | 37B | 256 + 1 shared | 8 |
| LLaMA 4 Scout | Apr 2025 | 109B | 17B | 16 + 1 shared | 1 |
| LLaMA 4 Maverick | Apr 2025 | 400B | 17B | 128 + 1 shared | 1 |
| Qwen 3 235B-A22B | Apr 2025 | 235B | 22B | 128 | 8 |
| Qwen 3.5 397B-A17B | Feb 2026 | 397B | 17B | 512 + 1 shared | 10 |
| Mistral Small 4 | Mar 2026 | 119B | 6.5B | 128 + 1 shared | 4 |
Sources: Mixtral 8x7B from Mistral AI (December 2023): 46.7B total, 12.9B active, 8 experts, top-2 routing. Grok-1 from xAI (March 17, 2024, open-sourced under Apache 2.0): 314B total, ~78.5B active (25% of weights active per token per xAI), 8 experts, top-2 routing. Mixtral 8x22B from Mistral AI (April 2024). DeepSeek-V3 from DeepSeek (December 26, 2024): arXiv:2412.19437. LLaMA 4 Scout/Maverick from Meta (April 5, 2025): HuggingFace Transformers Llama4TextConfig, PyTorch blog. Qwen 3 235B-A22B from Alibaba (April 29, 2025): HuggingFace config.json (Qwen/Qwen3-235B-A22B), num_experts=128, num_experts_per_tok=8, moe_intermediate_size=1,536, intermediate_size=12,288, hidden_size=4,096, 94 layers, 64 query heads, 4 KV heads, head_dim=128, vocab_size=151,936, decoder_sparse_step=1 (all layers are MoE), router_aux_loss_coef=0.001. Qwen 3.5 397B-A17B from Alibaba (February 16, 2026): HuggingFace config.json (Qwen/Qwen3.5-397B-A17B), num_experts=512, num_experts_per_tok=10, moe_intermediate_size=1,024, shared_expert_intermediate_size=1,024, hidden_size=4,096, 60 layers, vocab_size=248,320, 32 query heads, 2 KV heads, head_dim=256, 256K native context (max_position_embeddings=262,144, extendable to 1M+), hybrid attention (Gated DeltaNet linear + full attention in 3:1 ratio, full_attention_interval=4, layer_types explicitly listed), linear attention config: linear_key_head_dim=128, linear_num_key_heads=16, linear_num_value_heads=64, linear_conv_kernel_dim=4, mtp_num_hidden_layers=1 (multi-token prediction), router_aux_loss_coef=0.001. Mistral Small 4 from Mistral AI (March 16, 2026, announced at GTC 2026): HuggingFace config.json (mistralai/Mistral-Small-4-119B-2603), 119B total, 6.5B active per HuggingFace (Mistral’s official blog states 6B active per token, or 8B including embedding and output layers), 128 routed experts + 1 shared expert, top-4 routing, hidden_size=4,096, moe_intermediate_size=2,048, 36 layers, 256K context window, uses Multi-head Latent Attention (MLA) with kv_lora_rank=256, vocab_size=131,072, Apache 2.0 license.
The trend is clear: the number of experts has grown from 8 (Mixtral) to 128 (LLaMA 4 Maverick, Qwen 3, Mistral Small 4) to 256 (DeepSeek-V3) to 512 (Qwen 3.5), while the number of active experts per token has remained small (1 to 10). This means the ratio of total to active parameters has grown dramatically, from about 3.6x for Mixtral 8x7B to about 23x for Qwen 3.5.
How Routing Works
The router is the brain of the MoE layer. It decides which experts process each token. Despite its critical role, the router is remarkably simple: it is a single linear layer (a weight matrix) with no bias and no activation function.
The Router Computation
For a token with hidden state vector h of shape [hidden_size]:
router_scores = h @ W_routerWhere W_router has shape [hidden_size x num_experts]. The result is a vector of num_experts scores, one per expert. The router then selects the top-K experts with the highest scores.
For Mixtral 8x7B, W_router has shape [4,096 x 8], producing 8 scores. The top 2 are selected.
For DeepSeek-V3, W_router has shape [7,168 x 256], producing 256 scores. The top 8 are selected.
For LLaMA 4 Maverick, W_router has shape [5,120 x 128], producing 128 scores. The top 1 is selected.
The router adds very few parameters to the model. LLaMA 4 Maverick’s router has 5,120 x 128 = 655,360 parameters per MoE layer, which is negligible compared to the millions of parameters in the expert FFNs.
Softmax vs. Sigmoid Gating
After computing the raw router scores, the model needs to convert them into weights that determine how much each selected expert contributes to the final output. There are two main approaches:
Softmax gating (used by Mixtral, LLaMA 4): The router scores for the selected top-K experts are passed through a softmax function, which normalizes them to sum to 1. This means the selected experts’ contributions are treated as a weighted average.
# Softmax gating (Mixtral style)
scores = h @ W_router # shape: [num_experts]
top_k_indices = top_k(scores, k=2)
top_k_scores = scores[top_k_indices]
weights = softmax(top_k_scores) # sum to 1.0
output = weights[0] * Expert_A(h) + weights[1] * Expert_B(h)Sigmoid gating (used by DeepSeek-V3): Each expert’s score is passed through a sigmoid function independently, producing a value between 0 and 1 for each expert. Unlike softmax, the scores are not normalized to sum to 1. This decouples the experts’ scores from each other: one expert having a high score does not force other experts’ scores down.
# Sigmoid gating (DeepSeek-V3 style)
scores = sigmoid(h @ W_router) # shape: [num_experts], each in (0, 1)
top_k_indices = top_k(scores, k=8)
top_k_scores = scores[top_k_indices]
# Normalize selected scores to sum to 1
weights = top_k_scores / sum(top_k_scores)
output = sum(weights[i] * Expert_i(h) for i in top_k_indices)DeepSeek-V3’s use of sigmoid gating was a deliberate design choice. With softmax, increasing one expert’s score automatically decreases all other experts’ scores (because softmax normalizes to sum to 1). This creates competition between experts that can lead to routing collapse, where a few experts dominate and others are never selected. Sigmoid gating avoids this by scoring each expert independently.
Source: DeepSeek-V3 technical report (arXiv:2412.19437). Configuration from HuggingFace: scoring_func=‘sigmoid’, topk_method=‘noaux_tc’.
The Complete MoE Forward Pass
Let’s trace the complete forward pass through an MoE layer, step by step:
Input: x (shape [seq_len, hidden_size])
For each token t in the sequence:
1. Compute router scores: scores = x[t] @ W_router (shape [num_experts])
2. Select top-K experts: indices = top_k(scores, K)
3. Compute gating weights: weights = softmax(scores[indices]) (or sigmoid + normalize)
4. Run selected experts: expert_outputs = [Expert_i(x[t]) for i in indices]
5. Combine: output[t] = sum(weights[i] * expert_outputs[i])
6. If shared expert exists: output[t] += SharedExpert(x[t])
Output: output (shape [seq_len, hidden_size])The output has the same shape as the input, just like a regular FFN. From the perspective of the rest of the Transformer (the attention layers, the residual connections, the normalization), the MoE layer is a drop-in replacement for the dense FFN. The only difference is internal: instead of one large FFN, there are many small FFNs with a router selecting among them.
Why “400B Total, 17B Active”
Let’s do the math to understand exactly where these numbers come from, using LLaMA 4 Maverick as our example.
Maverick’s Architecture
LLaMA 4 Maverick has 48 Transformer layers. Due to the interleave_moe_layer_step=2 configuration, every other layer is an MoE layer and the alternating layers are dense. This gives 24 MoE layers and 24 dense layers.
For each MoE layer:
- 128 routed expert FFNs, each with intermediate_size = 8,192
- 1 shared expert FFN, also with intermediate_size = 8,192
- Each expert FFN has 3 weight matrices (SwiGLU): W_gate, W_up, W_down
- Per expert: 3 x 5,120 x 8,192 = 125,829,120 parameters (~126M)
- 128 routed experts: 128 x 126M = ~16.1B parameters
- 1 shared expert: ~126M parameters
- Router: 5,120 x 128 = 655,360 parameters (~0.7M)
- Total MoE FFN per layer: ~16.2B parameters
For each dense layer:
- 1 FFN with intermediate_size_mlp = 16,384
- Per FFN: 3 x 5,120 x 16,384 = 251,658,240 parameters (~252M)
For the attention portion (same in both layer types):
- W_Q: 5,120 x (40 x 128) = 26,214,400
- W_K: 5,120 x (8 x 128) = 5,242,880
- W_V: 5,120 x (8 x 128) = 5,242,880
- W_O: (40 x 128) x 5,120 = 26,214,400
- Total attention per layer: ~62.9M
Total Parameters
24 MoE layers: 24 x (16.2B + 62.9M + 2 x 5,120) ≈ 390B
24 dense layers: 24 x (252M + 62.9M + 2 x 5,120) ≈ 7.6B
Embedding: 202,048 x 5,120 ≈ 1.0B
Output proj: 202,048 x 5,120 ≈ 1.0B
Final norm: 5,120
≈ ~400B totalActive Parameters Per Token
For each token, only 1 routed expert + 1 shared expert are active per MoE layer:
24 MoE layers: 24 x (2 x 126M + 62.9M + norms) ≈ 7.6B
24 dense layers: 24 x (252M + 62.9M + norms) ≈ 7.6B
Embedding + output: ≈ 2.0B
≈ ~17B activeThe ratio is dramatic: 400B / 17B ≈ 23.5x. The model stores 23.5 times more parameters than it uses for any given token. This is the fundamental efficiency of MoE: massive knowledge capacity with modest computational cost.
Source: LLaMA 4 Maverick architecture from HuggingFace Transformers Llama4TextConfig: vocab_size=202,048, hidden_size=5,120, num_hidden_layers=48, num_attention_heads=40, num_key_value_heads=8, head_dim=128, intermediate_size=8,192 (MoE expert), intermediate_size_mlp=16,384 (dense layer), num_experts=128, num_experts_per_tok=1, interleave_moe_layer_step=2. PyTorch blog confirms: “Scout / Maverick has a shared expert and 16 / 128 routed experts with dropless token-choice routing and Top-1 selection for each MoE layer.”
Load Balancing: The Expert Collapse Problem
MoE has a critical failure mode: expert collapse (also called routing collapse). If the router learns to always send tokens to the same few experts, the other experts never get trained and become useless. In the worst case, the model degenerates into a small dense model that ignores most of its parameters.
This happens because the router is trained by gradient descent along with the rest of the model. If expert A happens to perform slightly better than expert B early in training (perhaps due to random initialization), the router will send more tokens to expert A. Expert A gets more training data and improves further, while expert B stagnates. This creates a positive feedback loop where a few experts monopolize all the tokens and the rest are never used.
The Auxiliary Loss Solution
The standard solution is to add an auxiliary loss (also called a load-balancing loss) to the training objective. This is an additional penalty term that encourages the router to distribute tokens evenly across all experts.
The auxiliary loss works by measuring how unevenly tokens are distributed across experts and adding a penalty proportional to the imbalance. The total training loss becomes:
total_loss = language_modeling_loss + alpha * auxiliary_load_balancing_lossWhere alpha is a small coefficient (typically 0.001 to 0.01) that controls how strongly the model is penalized for uneven routing.
The auxiliary loss was introduced in the original Shazeer et al. (2017) paper and refined in the Switch Transformer (Fedus et al., 2022). It has been the standard approach for most MoE models, including Mixtral.
However, the auxiliary loss creates a tension: it pushes the router toward uniform distribution (all experts equally used), which may conflict with the optimal routing for the language modeling task. A large auxiliary loss coefficient forces more uniform routing but can hurt model quality. A small coefficient allows better routing but risks expert collapse.
DeepSeek-V3’s Innovation: Auxiliary-Loss-Free Load Balancing
DeepSeek-V3 pioneered a different approach that eliminates the auxiliary loss entirely. Instead of adding a penalty to the training loss, DeepSeek-V3 dynamically adjusts a per-expert bias term that is added to the router scores before the top-K selection. If an expert is receiving too many tokens, its bias is decreased, making it less likely to be selected. If an expert is receiving too few tokens, its bias is increased.
The key insight is that this bias is only used for expert selection (deciding which experts to route to), not for computing the final gating weights. The actual expert contribution weights are computed from the unbiased scores. This means the load balancing mechanism does not interfere with the gradient signal for the language modeling objective.
DeepSeek-V3’s technical report describes this as “auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing.” This approach has been influential: it demonstrates that load balancing can be achieved without compromising model quality.
Source: DeepSeek-V3 technical report (arXiv:2412.19437, December 2024). The auxiliary-loss-free load balancing strategy uses dynamic per-expert bias terms for expert selection while keeping the actual gating weights unbiased.
Real MoE Architectures: A Detailed Comparison
Let’s examine how the major MoE models differ in their architectural choices.
Mixtral 8x7B: The Pioneer
Mixtral 8x7B (Mistral AI, December 2023) was the first widely available open-weight MoE language model. Its design is straightforward:
- 32 Transformer layers, all MoE (no dense layers)
- 8 routed experts per layer, no shared experts
- Top-2 routing with softmax gating
- Each expert is a full SwiGLU FFN with intermediate_size = 14,336 (same as Mistral 7B)
- hidden_size = 4,096
- Total: 46.7B parameters, 12.9B active per token
The name “8x7B” is slightly misleading. Each expert is not a full 7B model; only the FFN portion is replicated 8 times. The attention layers, embeddings, and normalization are shared across all experts. The “7B” refers to the fact that each expert’s FFN is the same size as Mistral 7B’s FFN.
With top-2 routing, each token activates 2 of the 8 experts. The outputs of the 2 selected experts are combined as a weighted sum, where the weights come from the softmax of the router scores for those 2 experts.
Source: Mixtral 8x7B from Mistral AI (December 2023). Jiang et al., “Mixtral of Experts” (arXiv:2401.04088). 32 layers, 8 experts per layer, top-2 routing, 46.7B total, 12.9B active.
DeepSeek-V3: Maximum Efficiency
DeepSeek-V3 (DeepSeek, December 26, 2024) pushed MoE to a new scale with several innovations:
- 61 Transformer layers total
- First 3 layers are dense (first_k_dense_replace = 3), remaining 58 layers are MoE
- 256 routed experts + 1 shared expert per MoE layer
- Top-8 routing with sigmoid gating
- moe_intermediate_size = 2,048 per expert (much smaller than the dense intermediate_size of 18,432)
- hidden_size = 7,168
- Auxiliary-loss-free load balancing
- Total: 671B parameters, 37B active per token
DeepSeek-V3’s design philosophy is “many small experts.” With 256 experts of intermediate_size 2,048 each, the individual experts are quite small. But with 8 experts active per token, the effective FFN capacity per token is 8 x 2,048 = 16,384 intermediate dimensions, which is comparable to a dense model’s FFN. The advantage is that the model has 256 experts’ worth of specialized knowledge (256 x 2,048 = 524,288 total intermediate dimensions across all experts), giving it enormous knowledge capacity.
The first 3 dense layers are a deliberate design choice. Early layers in a Transformer handle basic token processing (Chapter 10), and routing decisions at this stage may not be meaningful because the token representations have not yet been refined by attention. Using dense layers for the first few layers ensures stable, uniform processing before the MoE routing kicks in.
DeepSeek-V3 also uses a grouped routing strategy: the 256 experts are divided into 8 groups of 32, and the router first selects the top 4 groups, then selects the top 2 experts within each selected group, for a total of 8 active experts. This reduces the communication overhead in distributed training and inference, because experts within the same group can be placed on the same GPU.
DeepSeek-V3’s training cost was remarkably low: 2.788 million H800 GPU hours for the full training pipeline (pre-training, context extension, and post-training), at an estimated cost of approximately $5.576 million (assuming $2 per GPU hour). This figure covers GPU compute time only; it does not include hardware purchase costs, research and development, data acquisition, or failed experiments. For comparison, Meta’s LLaMA 3.1 405B (a smaller dense model) required approximately 30.84 million GPU hours, roughly 11x more.
Source: DeepSeek-V3 technical report (arXiv:2412.19437, December 26, 2024). Configuration from HuggingFace: vocab_size=129,280, hidden_size=7,168, intermediate_size=18,432, moe_intermediate_size=2,048, num_hidden_layers=61, num_attention_heads=128, num_key_value_heads=128, n_routed_experts=256, n_shared_experts=1, num_experts_per_tok=8, first_k_dense_replace=3, scoring_func=‘sigmoid’, n_group=8, topk_group=4. Training cost: 2.788M H800 GPU hours, ~$5.576M.
LLaMA 4 Maverick: Alternating Dense and MoE
LLaMA 4 Maverick (Meta, April 5, 2025) takes a different approach to mixing dense and MoE layers:
- 48 Transformer layers total
- Alternating dense and MoE layers (interleave_moe_layer_step = 2): layers 0, 2, 4, … are dense; layers 1, 3, 5, … are MoE
- 128 routed experts + 1 shared expert per MoE layer
- Top-1 routing with softmax gating (dropless token-choice routing)
- moe_intermediate_size = 8,192 per expert
- Dense layer intermediate_size_mlp = 16,384
- hidden_size = 5,120
- Total: 400B parameters, 17B active per token
The alternating pattern is interesting. By interleaving dense and MoE layers, Maverick ensures that every token passes through a full dense FFN every other layer. The dense layers provide a “baseline” of processing that all tokens share, while the MoE layers provide specialized processing that varies by token. This is a different philosophy from DeepSeek-V3, which uses MoE for nearly all layers (58 out of 61).
Maverick’s top-1 routing is also notable. Each token is routed to only 1 of the 128 routed experts (plus the shared expert). This is the most aggressive sparsity among major MoE models: only 1/128 = 0.78% of the routed experts are active per token. Despite this extreme sparsity, Maverick achieves competitive performance, suggesting that the combination of a shared expert (providing common knowledge) and a single specialized expert (providing token-specific knowledge) is sufficient.
Source: LLaMA 4 Maverick from Meta (April 5, 2025). HuggingFace Transformers Llama4TextConfig. PyTorch blog: “Scout / Maverick has a shared expert and 16 / 128 routed experts with dropless token-choice routing and Top-1 selection for each MoE layer.”
LLaMA 4 Scout: Compact MoE
LLaMA 4 Scout (Meta, April 5, 2025) is the smaller sibling of Maverick:
- 48 Transformer layers, alternating dense and MoE (same as Maverick)
- 16 routed experts + 1 shared expert per MoE layer
- Top-1 routing
- Total: 109B parameters, 17B active per token
Scout demonstrates that MoE can be effective even with a modest number of experts. With only 16 routed experts (compared to Maverick’s 128), Scout has far fewer total parameters but the same active parameter count. This makes it much more practical to deploy: at INT4 quantization, Scout fits on a single NVIDIA H100 GPU (80 GB VRAM).
Source: LLaMA 4 Scout from Meta (April 5, 2025). StorageReview: “Llama 4 Scout is a compact model with 17 billion active parameters and 109 billion total parameters across 16 experts. It is optimized for efficiency and can run on a single NVIDIA H100 GPU (FP4 Quantized).”
Qwen 3.5 397B-A17B: Maximum Expert Count
Qwen 3.5 (Alibaba, February 16, 2026) pushes the expert count to a new extreme:
- 60 Transformer layers
- 512 routed experts + 1 shared expert per MoE layer
- Top-10 routing
- moe_intermediate_size = 1,024 per expert
- hidden_size = 4,096
- Hybrid attention architecture (mixing linear and full attention)
- 256K native context window (extendable to 1M+)
- Total: 397B parameters, 17B active per token
With 512 experts, Qwen 3.5 has the highest expert count of any major open-weight model as of March 2026. Each individual expert is very small (intermediate_size = 1,024), but with 10 active per token, the effective FFN capacity is 10 x 1,024 = 10,240 intermediate dimensions. The model compensates for the small expert size by activating more of them (top-10 vs. Maverick’s top-1).
Qwen 3.5 also introduces a hybrid attention architecture. Instead of using standard full attention in every layer, it alternates between Gated DeltaNet layers (a linear attention variant) and full attention layers in a 3:1 ratio: three out of every four layers use linear attention, and every fourth layer uses standard full attention. The config explicitly lists all 60 layer types: the pattern is [linear_attention, linear_attention, linear_attention, full_attention] repeated 15 times. The linear attention layers use a different head configuration than the full attention layers: 16 key heads with 128 dimensions each and 64 value heads with 128 dimensions each, plus a convolution kernel of dimension 4 (linear_conv_kernel_dim=4). Linear attention scales near-linearly with sequence length (rather than quadratically), which is why Qwen 3.5 can support a 1M token context window more efficiently than models using full attention everywhere. This hybrid approach is a separate innovation from MoE, but the two work together: the MoE routing operates the same way regardless of whether the layer uses linear or full attention.
Another notable detail: Qwen 3.5’s config includes mtp_num_hidden_layers=1, indicating it uses multi-token prediction (MTP) during training. This is the same technique used by DeepSeek-V3, where the model is trained to predict multiple future tokens simultaneously rather than just the next one. MTP has been shown to improve model quality by encouraging the model to plan further ahead. We will cover this training technique in more detail in Chapter 14.
Source: Qwen 3.5 397B-A17B from Alibaba (February 16, 2026). HuggingFace config.json (Qwen/Qwen3.5-397B-A17B): 60 layers, hidden_size=4,096, moe_intermediate_size=1,024, shared_expert_intermediate_size=1,024, vocab_size=248,320, 32 query heads, 2 KV heads, head_dim=256, 512 routed experts + 1 shared expert, top-10 routing, full_attention_interval=4 (hybrid Gated DeltaNet linear + full attention in 3:1 ratio), linear attention config: linear_key_head_dim=128, linear_num_key_heads=16, linear_num_value_heads=64, linear_conv_kernel_dim=4, 256K native context (max_position_embeddings=262,144, extendable to 1M+), mtp_num_hidden_layers=1 (multi-token prediction, same technique used by DeepSeek-V3), router_aux_loss_coef=0.001.
Mistral Small 4: MoE Meets MLA
Mistral Small 4 (Mistral AI, March 16, 2026, announced at GTC 2026) is notable for combining two efficiency innovations in a single model: Mixture of Experts and Multi-head Latent Attention (MLA, which we covered in Chapter 8 as a DeepSeek innovation).
- 36 Transformer layers, all MoE (first_k_dense_replace = 0)
- 128 routed experts + 1 shared expert per MoE layer
- Top-4 routing
- moe_intermediate_size = 2,048 per routed expert
- intermediate_size = 12,288 (the shared expert’s FFN intermediate size, larger than the routed experts)
- hidden_size = 4,096
- MLA with kv_lora_rank = 256 and q_lora_rank = 1,024
- 32 attention heads, 32 KV heads, head_dim = 128
- 256K context window
- Total: 119B parameters, 6.5B active per token (per HuggingFace; Mistral’s official blog states 6B active, or 8B including embedding and output layers)
Mistral Small 4 is the first major open-weight model to combine MoE with MLA. MLA compresses the key-value cache by projecting keys and values through a low-rank bottleneck (rank 256 in this case), which dramatically reduces the memory needed for the KV cache during inference. Combined with MoE’s sparse activation, this makes Mistral Small 4 exceptionally efficient: it has the knowledge capacity of a 119B model, the per-token compute of a ~6.5B model, and the KV cache memory footprint of a much smaller model.
The MLA implementation in Mistral Small 4 splits each query/key head into two components: a 64-dimensional portion that does not use rotary position embeddings (qk_nope_head_dim=64) and a 64-dimensional portion that does (qk_rope_head_dim=64), for a total qk_head_dim of 128. This split allows the model to combine position-aware matching (through the RoPE portion) with position-independent semantic matching (through the non-RoPE portion), which is the same design pattern used in DeepSeek-V2 and V3’s MLA implementation (Chapter 8).
An interesting architectural detail: the shared expert in Mistral Small 4 has an intermediate_size of 12,288, which is 6x larger than each routed expert’s 2,048. This means the shared expert has far more capacity than any individual routed expert, reflecting its role as the “generalist” that handles common patterns for every token. The routed experts are small and specialized, while the shared expert carries the bulk of the common knowledge.
The model is designed to fit on a single high-end GPU at reduced precision. At NVFP4 quantization, the 119B parameters compress to roughly 30 GB, making single-GPU deployment feasible on an NVIDIA H100 (80 GB VRAM) with room for the KV cache. The config also shows YaRN rope scaling with factor=128 (extending from an original 8,192 base context to the full 1M max_position_embeddings), the same context extension technique used by DeepSeek-V3 (Chapter 6).
Source: Mistral Small 4 from Mistral AI (March 16, 2026, announced at GTC 2026). HuggingFace config.json (mistralai/Mistral-Small-4-119B-2603): outer model_type=‘mistral3’, text_config model_type=‘mistral4’, 36 layers, hidden_size=4,096, intermediate_size=12,288 (shared expert), moe_intermediate_size=2,048 (routed experts), n_routed_experts=128, n_shared_experts=1, num_experts_per_tok=4, num_attention_heads=32, num_key_value_heads=32, head_dim=128, kv_lora_rank=256, q_lora_rank=1,024, qk_nope_head_dim=64, qk_rope_head_dim=64, v_head_dim=128, vocab_size=131,072, max_position_embeddings=1,048,576, first_k_dense_replace=0, norm_topk_prob=true, rope_type=‘yarn’ with factor=128, Apache 2.0 license. Mistral’s official blog: “6B active parameters per token (8B including embedding and output layers).” HuggingFace model card: “6.5B activated per token.”
Dense vs. MoE: The Tradeoffs
MoE is not strictly better than dense. Each architecture has advantages and disadvantages, and the right choice depends on the use case.
Advantages of MoE
Higher knowledge capacity per FLOP: An MoE model can store far more knowledge (in its total parameters) than a dense model with the same per-token compute cost. DeepSeek-V3 uses roughly the same compute per token as a 37B dense model but has access to 671B parameters’ worth of knowledge.
Faster inference per token: Because only a fraction of parameters are active, each token requires fewer floating-point operations. This translates to lower latency and higher throughput.
Better scaling: MoE models can scale total parameters without proportionally increasing compute. This makes it economically feasible to build models with hundreds of billions or trillions of parameters.
Lower training cost: DeepSeek-V3 trained for approximately $5.576 million, a fraction of what comparable dense models cost. The sparse activation means each training step requires less compute.
Disadvantages of MoE
Higher memory requirements: Even though only a fraction of parameters are active per token, all parameters must be stored in GPU memory (or at least be quickly accessible). LLaMA 4 Maverick requires memory for all 400B parameters, not just the 17B active ones. At FP16, that is approximately 800 GB for weights alone.
Communication overhead in distributed serving: In multi-GPU deployments, different experts may reside on different GPUs. When a token is routed to an expert on a different GPU, the token’s hidden state must be sent across the network. This inter-GPU communication can become a bottleneck, especially with many experts spread across many GPUs.
Uneven GPU utilization: Different tokens are routed to different experts, which means different GPUs (hosting different experts) have different workloads. Some GPUs may be busy while others are idle, reducing overall efficiency. Load balancing helps but does not eliminate this problem.
Complexity: MoE models are more complex to implement, train, and serve than dense models. The routing mechanism, load balancing, and expert parallelism add engineering challenges.
Expert underutilization: With top-1 routing and 128 experts, each expert is only used for about 1/128 of tokens. Some experts may specialize in rare topics and be activated infrequently, meaning their parameters contribute little to overall model quality despite consuming memory.
When to Choose Dense vs. MoE
| Scenario | Better Choice | Why |
|---|---|---|
| Maximum quality, unlimited budget | MoE | More total parameters = more knowledge |
| Single-GPU deployment | Dense (small) | MoE models need memory for all params |
| Low-latency serving | MoE | Fewer active params = faster per token |
| Simple deployment | Dense | No routing complexity |
| Training on limited budget | MoE | Lower compute per training step |
| Edge/mobile deployment | Dense (tiny) | MoE overhead not worth it at small scale |
In practice, the industry has largely converged on MoE for frontier models (the biggest, most capable models) and dense for smaller models (7B-14B range). The 7B-14B dense models remain popular because they are simple to deploy and run efficiently on a single GPU, while frontier MoE models offer the best quality for API-based serving where the infrastructure complexity is handled by the provider.
Hands-On: Implementing a Simple MoE Layer
Let’s implement a Mixture of Experts layer from scratch to make the concept completely concrete. This implementation includes the router, multiple expert FFNs, top-K selection, and the weighted combination of expert outputs.
import numpy as np
def swish(x):
"""Swish/SiLU activation function."""
return x * (1 / (1 + np.exp(-np.clip(x, -500, 500))))
def softmax(x, axis=-1):
"""Numerically stable softmax."""
e = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e / np.sum(e, axis=axis, keepdims=True)
def swiglu_expert(x, W_gate, W_up, W_down):
"""A single SwiGLU expert FFN (Chapter 9)."""
return (swish(x @ W_gate) * (x @ W_up)) @ W_down
def moe_layer(x, experts, W_router, top_k=2, shared_expert=None):
"""Mixture of Experts layer.
x: input, shape [seq_len, hidden_size]
experts: list of dicts, each with W_gate, W_up, W_down
W_router: router weight matrix, shape [hidden_size, num_experts]
top_k: number of experts to select per token
shared_expert: optional dict with W_gate, W_up, W_down (always active)
"""
seq_len, hidden_size = x.shape
num_experts = len(experts)
output = np.zeros_like(x)
for t in range(seq_len):
token = x[t] # shape [hidden_size]
# Step 1: Compute router scores
scores = token @ W_router # shape [num_experts]
# Step 2: Select top-K experts
top_indices = np.argsort(scores)[-top_k:][::-1]
# Step 3: Compute gating weights (softmax over selected experts)
top_scores = scores[top_indices]
weights = softmax(top_scores)
# Step 4: Run selected experts and combine
combined = np.zeros(hidden_size)
for i, idx in enumerate(top_indices):
e = experts[idx]
expert_out = swiglu_expert(token, e['W_gate'], e['W_up'], e['W_down'])
combined += weights[i] * expert_out
# Step 5: Add shared expert if present
if shared_expert is not None:
shared_out = swiglu_expert(
token, shared_expert['W_gate'],
shared_expert['W_up'], shared_expert['W_down']
)
combined += shared_out
output[t] = combined
return output
# Build a small MoE layer
np.random.seed(42)
hidden_size = 64
num_experts = 8
top_k = 2
expert_intermediate = 128 # each expert's FFN intermediate size
# Initialize expert weights
scale = (2 / hidden_size) ** 0.5
experts = []
for i in range(num_experts):
experts.append({
'W_gate': np.random.randn(hidden_size, expert_intermediate) * scale,
'W_up': np.random.randn(hidden_size, expert_intermediate) * scale,
'W_down': np.random.randn(expert_intermediate, hidden_size) * (2 / expert_intermediate) ** 0.5,
})
# Initialize router
W_router = np.random.randn(hidden_size, num_experts) * 0.01
# Initialize shared expert
shared = {
'W_gate': np.random.randn(hidden_size, expert_intermediate) * scale,
'W_up': np.random.randn(hidden_size, expert_intermediate) * scale,
'W_down': np.random.randn(expert_intermediate, hidden_size) * (2 / expert_intermediate) ** 0.5,
}
# Create input
seq_len = 6
x = np.random.randn(seq_len, hidden_size) * 0.5
# Run MoE layer
output = moe_layer(x, experts, W_router, top_k=top_k, shared_expert=shared)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Shapes match: {x.shape == output.shape}")
# Show which experts were selected for each token
print(f"\nRouting decisions (top-{top_k} of {num_experts} experts):")
for t in range(seq_len):
scores = x[t] @ W_router
top_indices = np.argsort(scores)[-top_k:][::-1]
top_scores = scores[top_indices]
weights = softmax(top_scores)
print(f" Token {t}: experts {top_indices} with weights [{weights[0]:.3f}, {weights[1]:.3f}]")
# Count parameters
router_params = hidden_size * num_experts
expert_params = num_experts * 3 * hidden_size * expert_intermediate
shared_params = 3 * hidden_size * expert_intermediate
total_params = router_params + expert_params + shared_params
active_params = router_params + top_k * 3 * hidden_size * expert_intermediate + shared_params
print(f"\nParameter counts:")
print(f" Router: {router_params:>10,}")
print(f" {num_experts} routed experts: {expert_params:>10,}")
print(f" 1 shared expert: {shared_params:>10,}")
print(f" Total: {total_params:>10,}")
print(f" Active per token: {active_params:>10,}")
print(f" Ratio: {total_params/active_params:.1f}x")When you run this code, you will see that different tokens are routed to different experts. The routing decisions depend on the token’s hidden state vector: tokens with similar representations tend to be routed to the same experts, while tokens with different representations are routed to different experts. This is the mechanism by which MoE models specialize: over the course of training, each expert learns to handle the types of tokens that are routed to it.
Hands-On: Comparing Dense vs. MoE Parameter Efficiency
Let’s build a direct comparison between a dense FFN and an MoE layer with the same active parameter count, to see how MoE achieves higher total capacity:
import numpy as np
def count_dense_ffn(hidden, intermediate):
"""Count parameters in a dense SwiGLU FFN."""
return 3 * hidden * intermediate
def count_moe_layer(hidden, expert_intermediate, num_experts, top_k,
num_shared=0):
"""Count total and active parameters in an MoE layer."""
router = hidden * num_experts
routed = num_experts * 3 * hidden * expert_intermediate
shared = num_shared * 3 * hidden * expert_intermediate
total = router + routed + shared
active = router + top_k * 3 * hidden * expert_intermediate + shared
return total, active
# Compare architectures with similar active parameter counts
hidden = 5120 # LLaMA 4 Maverick's hidden_size
print("Comparison: Dense FFN vs. MoE with similar active params")
print("=" * 70)
# Dense: LLaMA 3.1 405B style (intermediate = 53,248)
dense_inter = 53248
dense_params = count_dense_ffn(hidden, dense_inter)
print(f"\nDense FFN (intermediate={dense_inter:,}):")
print(f" Total params: {dense_params:>14,} ({dense_params/1e6:.1f}M)")
print(f" Active params: {dense_params:>14,} ({dense_params/1e6:.1f}M)")
print(f" Ratio: 1.0x")
# MoE: Maverick style (128 experts, top-1, + 1 shared)
moe_inter = 8192
total, active = count_moe_layer(hidden, moe_inter, 128, 1, num_shared=1)
print(f"\nMoE (128 experts x {moe_inter:,}, top-1, +1 shared):")
print(f" Total params: {total:>14,} ({total/1e6:.1f}M)")
print(f" Active params: {active:>14,} ({active/1e6:.1f}M)")
print(f" Ratio: {total/active:.1f}x")
# MoE: DeepSeek-V3 style (256 experts, top-8, + 1 shared)
ds_hidden = 7168
ds_moe_inter = 2048
total_ds, active_ds = count_moe_layer(ds_hidden, ds_moe_inter, 256, 8, num_shared=1)
print(f"\nMoE DeepSeek-V3 style (256 experts x {ds_moe_inter:,}, top-8, +1 shared):")
print(f" Total params: {total_ds:>14,} ({total_ds/1e6:.1f}M)")
print(f" Active params: {active_ds:>14,} ({active_ds/1e6:.1f}M)")
print(f" Ratio: {total_ds/active_ds:.1f}x")
# MoE: Qwen 3.5 style (512 experts, top-10, + 1 shared)
q_hidden = 4096
q_moe_inter = 1024
total_q, active_q = count_moe_layer(q_hidden, q_moe_inter, 512, 10, num_shared=1)
print(f"\nMoE Qwen 3.5 style (512 experts x {q_moe_inter:,}, top-10, +1 shared):")
print(f" Total params: {total_q:>14,} ({total_q/1e6:.1f}M)")
print(f" Active params: {active_q:>14,} ({active_q/1e6:.1f}M)")
print(f" Ratio: {total_q/active_q:.1f}x")
print("\n" + "=" * 70)
print("Key insight: MoE models achieve 15-45x more total parameters")
print("than active parameters, giving them massive knowledge capacity")
print("while keeping per-token compute comparable to smaller dense models.")This comparison reveals the core value proposition of MoE: by having many experts but activating only a few, you can build models with enormous total parameter counts (and thus knowledge capacity) while keeping the per-token computational cost manageable.
What Do Experts Specialize In?
A natural question: do the experts actually specialize in different topics or tasks? The answer is nuanced.
Evidence for Specialization
Research on Mixtral 8x7B has shown that different experts do develop different specializations. Some experts are more frequently activated for code tokens, others for mathematical expressions, others for natural language prose. The routing is not random; it reflects genuine differences in what each expert has learned to handle.
However, the specialization is not as clean as “expert 1 handles Python, expert 2 handles history.” The specialization is more subtle and operates at the token level, not the topic level. Within a single sentence about Python programming, different tokens may be routed to different experts. The word “def” might consistently go to one expert, while the variable name might go to another.
The Shared Expert’s Role
In models with a shared expert (DeepSeek-V3, LLaMA 4), the shared expert handles “common knowledge” that is useful for all tokens regardless of their content. This includes basic language patterns, common syntactic structures, and frequently used vocabulary. The routed experts then provide specialized knowledge on top of this common foundation.
This division of labor is analogous to how a team might work: one generalist handles the routine tasks that come up for every project, while specialists are called in for specific technical challenges. The shared expert is the generalist; the routed experts are the specialists.
Practical Implications
The specialization of experts has practical implications for model deployment. If you know that your use case primarily involves code generation, you could potentially prune or compress the experts that specialize in other domains (like creative writing or medical text) without significantly affecting code generation quality. This is an active area of research as of March 2026, though production deployments typically keep all experts intact.
The MoE Transformer Block
Let’s put MoE into the context of the full Transformer block from Chapter 10. In a dense Transformer, the block is:
Input: x
Step 1: h = x + Attention(RMSNorm(x)) # attention sub-block
Step 2: output = h + FFN(RMSNorm(h)) # FFN sub-block
Output: outputIn an MoE Transformer, the only change is in Step 2:
Input: x
Step 1: h = x + Attention(RMSNorm(x)) # attention sub-block (unchanged)
Step 2: output = h + MoE(RMSNorm(h)) # MoE replaces dense FFN
Output: outputThe attention mechanism is identical in dense and MoE models. The normalization and residual connections are identical. The only difference is that the dense FFN is replaced by the MoE layer (router + experts). This modularity is one reason MoE has been so successful: it is a targeted change to one component of the Transformer, not a wholesale redesign.
For models with alternating dense and MoE layers (like LLaMA 4 Maverick), even-numbered layers use the dense FFN and odd-numbered layers use the MoE layer. The attention mechanism is the same in both types of layers.
Hands-On: Complete MoE Transformer Block
Let’s implement a complete Transformer block with MoE, combining everything from Chapters 8-12:
import numpy as np
def rms_norm(x, gamma, eps=1e-5):
rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + eps)
return gamma * (x / rms)
def softmax(x, axis=-1):
e = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e / np.sum(e, axis=axis, keepdims=True)
def swish(x):
return x * (1 / (1 + np.exp(-np.clip(x, -500, 500))))
def attention(Q, K, V, mask):
scores = Q @ K.T / np.sqrt(Q.shape[-1])
scores = np.where(mask, -1e9, scores)
return softmax(scores) @ V
def moe_ffn(x, experts, W_router, top_k, shared_expert=None):
"""MoE FFN: route each token to top-K experts."""
seq_len, hidden = x.shape
output = np.zeros_like(x)
routing_log = []
for t in range(seq_len):
token = x[t]
scores = token @ W_router
top_idx = np.argsort(scores)[-top_k:][::-1]
weights = softmax(scores[top_idx])
combined = np.zeros(hidden)
for i, idx in enumerate(top_idx):
e = experts[idx]
out = (swish(token @ e['Wg']) * (token @ e['Wu'])) @ e['Wd']
combined += weights[i] * out
if shared_expert is not None:
s = shared_expert
combined += (swish(token @ s['Wg']) * (token @ s['Wu'])) @ s['Wd']
output[t] = combined
routing_log.append(top_idx.tolist())
return output, routing_log
def moe_transformer_block(x, params, mask):
"""Complete Transformer block with MoE FFN."""
seq_len, hidden = x.shape
n_q, n_kv, hd = params['n_q'], params['n_kv'], params['hd']
gs = n_q // n_kv
# Norm -> Attention -> Residual
xn = rms_norm(x, params['g1'])
Q = (xn @ params['WQ']).reshape(seq_len, n_q, hd)
K = (xn @ params['WK']).reshape(seq_len, n_kv, hd)
V = (xn @ params['WV']).reshape(seq_len, n_kv, hd)
heads = [attention(Q[:, q], K[:, q // gs], V[:, q // gs], mask)
for q in range(n_q)]
h = x + np.concatenate(heads, axis=-1) @ params['WO']
# Norm -> MoE FFN -> Residual
hn = rms_norm(h, params['g2'])
moe_out, routing = moe_ffn(
hn, params['experts'], params['W_router'],
params['top_k'], params.get('shared_expert')
)
output = h + moe_out
return output, routing
# Build a small MoE Transformer block
np.random.seed(42)
hidden = 64
n_q, n_kv, hd = 4, 2, 16
num_experts = 8
top_k = 2
expert_inter = 96
seq_len = 8
s = (2 / hidden) ** 0.5
se = (2 / expert_inter) ** 0.5
# Expert weights
experts = []
for _ in range(num_experts):
experts.append({
'Wg': np.random.randn(hidden, expert_inter) * s,
'Wu': np.random.randn(hidden, expert_inter) * s,
'Wd': np.random.randn(expert_inter, hidden) * se,
})
shared = {
'Wg': np.random.randn(hidden, expert_inter) * s,
'Wu': np.random.randn(hidden, expert_inter) * s,
'Wd': np.random.randn(expert_inter, hidden) * se,
}
params = {
'g1': np.ones(hidden), 'g2': np.ones(hidden),
'WQ': np.random.randn(hidden, n_q * hd) * s,
'WK': np.random.randn(hidden, n_kv * hd) * s,
'WV': np.random.randn(hidden, n_kv * hd) * s,
'WO': np.random.randn(n_q * hd, hidden) * s,
'experts': experts,
'shared_expert': shared,
'W_router': np.random.randn(hidden, num_experts) * 0.01,
'top_k': top_k,
'n_q': n_q, 'n_kv': n_kv, 'hd': hd,
}
# Run
x = np.random.randn(seq_len, hidden) * 0.5
mask = np.triu(np.ones((seq_len, seq_len), dtype=bool), k=1)
output, routing = moe_transformer_block(x, params, mask)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"\nRouting decisions (top-{top_k} of {num_experts} experts per token):")
for t, r in enumerate(routing):
print(f" Token {t}: experts {r}")
# Count expert usage
from collections import Counter
all_experts = [e for r in routing for e in r]
usage = Counter(all_experts)
print(f"\nExpert usage across {seq_len} tokens:")
for e in range(num_experts):
count = usage.get(e, 0)
bar = "#" * count
print(f" Expert {e}: {count} tokens {bar}")
# Parameter comparison
attn_params = (hidden * n_q * hd + 2 * hidden * n_kv * hd + n_q * hd * hidden)
router_params = hidden * num_experts
expert_total = num_experts * 3 * hidden * expert_inter
shared_params = 3 * hidden * expert_inter
norm_params = 2 * hidden
total = attn_params + router_params + expert_total + shared_params + norm_params
active = attn_params + router_params + top_k * 3 * hidden * expert_inter + shared_params + norm_params
print(f"\nBlock parameters:")
print(f" Total: {total:>10,} (all experts)")
print(f" Active: {active:>10,} (top-{top_k} experts + shared)")
print(f" Ratio: {total/active:.1f}x")This code implements a complete MoE Transformer block and shows the routing decisions for each token. You can see which experts are selected, how the expert usage is distributed, and the parameter efficiency ratio between total and active parameters.
Visualizing MoE Routing Patterns
Let’s create a visualization that shows how different tokens are routed to different experts across multiple layers:
import numpy as np
import matplotlib.pyplot as plt
def simulate_moe_routing(seq_len, num_experts, top_k, num_layers, hidden=64):
"""Simulate MoE routing across multiple layers."""
np.random.seed(42)
x = np.random.randn(seq_len, hidden) * 0.5
all_routing = []
for layer in range(num_layers):
W_router = np.random.randn(hidden, num_experts) * 0.1
layer_routing = []
for t in range(seq_len):
scores = x[t] @ W_router
top_idx = np.argsort(scores)[-top_k:][::-1]
layer_routing.append(top_idx)
all_routing.append(layer_routing)
# Simulate layer transformation
W = np.random.randn(hidden, hidden) * (2 / hidden) ** 0.5
x = x + np.tanh(x @ W) * 0.3
return all_routing
# Simulate routing for 10 tokens across 6 MoE layers with 16 experts
seq_len = 10
num_experts = 16
top_k = 2
num_layers = 6
routing = simulate_moe_routing(seq_len, num_experts, top_k, num_layers)
# Create routing heatmap
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Plot 1: Routing decisions as a heatmap
routing_matrix = np.zeros((num_layers, seq_len, num_experts))
for layer in range(num_layers):
for t in range(seq_len):
for e in routing[layer][t]:
routing_matrix[layer, t, e] = 1
# Flatten to (layers * tokens) x experts for visualization
flat = routing_matrix.reshape(num_layers * seq_len, num_experts)
im = axes[0].imshow(flat.T, aspect='auto', cmap='Blues', interpolation='nearest')
axes[0].set_xlabel('Token (across layers)')
axes[0].set_ylabel('Expert ID')
axes[0].set_title(f'Expert Activation Pattern\n({num_layers} layers, {seq_len} tokens, top-{top_k} of {num_experts})')
# Add layer boundaries
for i in range(1, num_layers):
axes[0].axvline(x=i * seq_len - 0.5, color='red', linewidth=0.5, alpha=0.5)
# Label layers
for i in range(num_layers):
axes[0].text(i * seq_len + seq_len / 2, -1.5, f'L{i}',
ha='center', fontsize=8, color='red')
# Plot 2: Expert load distribution
expert_loads = np.zeros(num_experts)
for layer in range(num_layers):
for t in range(seq_len):
for e in routing[layer][t]:
expert_loads[e] += 1
ideal_load = num_layers * seq_len * top_k / num_experts
colors = ['#e74c3c' if load > ideal_load * 1.5 else
'#f39c12' if load > ideal_load * 1.2 else
'#3498db' for load in expert_loads]
axes[1].bar(range(num_experts), expert_loads, color=colors, alpha=0.8)
axes[1].axhline(y=ideal_load, color='green', linestyle='--',
label=f'Ideal load ({ideal_load:.1f})')
axes[1].set_xlabel('Expert ID')
axes[1].set_ylabel('Number of token assignments')
axes[1].set_title('Expert Load Distribution\n(Red = overloaded, Blue = balanced)')
axes[1].legend()
axes[1].set_xticks(range(num_experts))
plt.tight_layout()
plt.savefig('moe_routing.png', dpi=150, bbox_inches='tight')
plt.show()
print("Plot saved to moe_routing.png")The left plot shows which experts are activated (blue cells) for each token at each layer. You can see that different tokens activate different experts, and the routing patterns change across layers. The right plot shows the total load on each expert across all layers and tokens. In a well-balanced system, all experts should have roughly equal load (the green dashed line). Experts that are significantly above this line are overloaded, which is exactly the problem that load balancing mechanisms (auxiliary loss or DeepSeek-V3’s bias-based approach) aim to prevent.
The Memory vs. Compute Tradeoff in Numbers
Let’s quantify the memory and compute characteristics of real MoE models compared to dense models:
import numpy as np
models = [
# (name, total_B, active_B, type)
("LLaMA 3 8B", 8.0, 8.0, "Dense"),
("LLaMA 3 70B", 70.6, 70.6, "Dense"),
("LLaMA 3.1 405B", 405.0, 405.0, "Dense"),
("Mixtral 8x7B", 46.7, 12.9, "MoE"),
("LLaMA 4 Scout", 109.0, 17.0, "MoE"),
("Qwen 3 235B", 235.0, 22.0, "MoE"),
("LLaMA 4 Maverick", 400.0, 17.0, "MoE"),
("Qwen 3.5 397B", 397.0, 17.0, "MoE"),
("DeepSeek-V3", 671.0, 37.0, "MoE"),
]
print(f"{'Model':<20} {'Total':>7} {'Active':>7} {'Memory':>9} {'Compute':>9} {'Ratio':>7}")
print(f"{'':20} {'(B)':>7} {'(B)':>7} {'FP16 GB':>9} {'(rel.)':>9} {'T/A':>7}")
print("-" * 65)
# Use LLaMA 3 8B as compute baseline
baseline_active = 8.0
for name, total, active, arch in models:
mem_gb = total * 2 # FP16: 2 bytes per param
compute_rel = active / baseline_active
ratio = total / active
marker = "*" if arch == "MoE" else " "
print(f"{name:<20} {total:>6.1f}B {active:>6.1f}B {mem_gb:>8.1f} {compute_rel:>8.1f}x {ratio:>6.1f}x {marker}")
print(f"\n* = MoE model")
print(f"\nKey observations:")
print(f" LLaMA 4 Maverick: needs {400*2:.0f} GB memory but only {17/8:.1f}x the compute of LLaMA 3 8B")
print(f" DeepSeek-V3: needs {671*2:.0f} GB memory but only {37/8:.1f}x the compute of LLaMA 3 8B")
print(f" LLaMA 3.1 405B: needs {405*2:.0f} GB memory AND {405/8:.1f}x the compute of LLaMA 3 8B")
print(f"\n MoE decouples memory from compute. Dense models cannot do this.")This table makes the MoE advantage clear. LLaMA 4 Maverick requires 800 GB of memory (for all 400B parameters at FP16), but its per-token compute is only about 2.1x that of LLaMA 3 8B (because only 17B parameters are active). In contrast, LLaMA 3.1 405B requires 810 GB of memory AND 50.6x the compute of LLaMA 3 8B. The dense model pays the full cost in both memory and compute; the MoE model pays the full cost in memory but a fraction of the cost in compute.
This is why MoE has become dominant for frontier models: it allows labs to build models with enormous knowledge capacity (hundreds of billions of total parameters) while keeping inference costs manageable (tens of billions of active parameters). The tradeoff is that you need enough GPU memory to hold all the parameters, even though most of them are idle for any given token.
The Convergence on 400B/17B
One of the most striking patterns in the model landscape as of March 2026 is that multiple labs have independently converged on a similar design point:
| Model | Total Params | Active Params | Lab |
|---|---|---|---|
| LLaMA 4 Maverick | 400B | 17B | Meta |
| Qwen 3.5 397B-A17B | 397B | 17B | Alibaba |
Both models have approximately 400B total parameters and 17B active parameters, despite being developed independently by different companies on different continents. This convergence suggests that the ~400B/17B ratio represents a practical optimum for current hardware and training methods.
The reasoning is likely:
17B active parameters provides enough per-token compute for frontier-level quality. This is roughly 2x the active compute of a 7B-8B model, which has proven to be a sweet spot for quality.
400B total parameters provides enough knowledge capacity to compete with the largest dense models, while remaining deployable on a reasonable number of GPUs (5-8 H100s at FP8 precision).
The ~23x ratio of total to active parameters means each expert is used for roughly 1/23 of tokens, which provides enough specialization without making individual experts too rarely used.
This convergence is reminiscent of how the industry converged on the 7B-8B size class for small dense models: once the hardware constraints and quality requirements are understood, the optimal design point becomes clear, and multiple labs arrive at it independently.
A second convergence point is emerging at the smaller end: Mistral Small 4 (119B total, 6.5B active, March 2026) and LLaMA 4 Scout (109B total, 17B active, April 2025) both target the ~100-120B total parameter range, designed to fit on a single high-end GPU at reduced precision. Mistral Small 4 pushes this further by combining MoE with Multi-head Latent Attention (MLA), achieving both sparse activation and compressed KV cache in a single architecture. This suggests the industry is settling on two MoE tiers: a “compact” tier around 100-120B total for single-GPU deployment, and a “frontier” tier around 400B total for multi-GPU serving.
Key Takeaways
Mixture of Experts (MoE) replaces the single dense FFN in each Transformer layer with multiple smaller FFN “experts” and a router that selects which experts to use for each token. The attention mechanism, normalization, and residual connections are unchanged. MoE is a targeted modification to the FFN component only.
The router is a simple linear layer (shape [hidden_size x num_experts]) that computes a score for each expert. The top-K experts with the highest scores are selected, and their outputs are combined as a weighted sum. The router adds negligible parameters to the model.
MoE enables models with massive total parameter counts but modest active parameter counts. DeepSeek-V3 has 671B total parameters but only 37B active per token. LLaMA 4 Maverick has 400B total but only 17B active. This means MoE models have the knowledge capacity of a very large model but the per-token compute cost of a much smaller one.
The MoE concept dates back to Jacobs, Jordan, Nowlan, and Hinton (1991). It was scaled to neural networks by Shazeer et al. (2017) with the Sparsely-Gated MoE layer, simplified by the Switch Transformer (Fedus et al., 2022) with top-1 routing, and brought to mainstream LLMs by Mixtral 8x7B (December 2023). By March 2026, MoE is the dominant architecture for frontier open-weight models.
Load balancing prevents expert collapse, where a few experts monopolize all tokens while others go unused. The traditional approach uses an auxiliary loss that penalizes uneven routing. DeepSeek-V3 pioneered an auxiliary-loss-free approach using dynamic per-expert bias terms, which achieves load balance without interfering with the language modeling gradient.
Real MoE architectures vary significantly in their design choices. Mixtral uses 8 experts with top-2 routing. DeepSeek-V3 uses 256 experts with top-8 routing and sigmoid gating. LLaMA 4 Maverick uses 128 experts with top-1 routing and alternating dense/MoE layers. Mistral Small 4 uses 128 experts with top-4 routing and combines MoE with Multi-head Latent Attention (MLA) for additional memory efficiency. Qwen 3.5 uses 512 experts with top-10 routing and a hybrid Gated DeltaNet/full attention architecture. The trend is toward more experts with smaller individual sizes, and toward combining MoE with other efficiency innovations (MLA, linear attention).
MoE models require GPU memory for all parameters (total, not active), which is their primary disadvantage. LLaMA 4 Maverick needs approximately 800 GB at FP16 for weights alone, even though only 17B parameters are used per token. This is why MoE models typically require multi-GPU deployments.
Multiple labs have independently converged on the ~400B total / 17B active design point (LLaMA 4 Maverick and Qwen 3.5), suggesting this ratio represents a practical optimum for frontier MoE models. A second convergence point is emerging around ~100-120B total / 6-17B active (LLaMA 4 Scout, Mistral Small 4) for compact, single-GPU MoE deployments.
Dense models remain preferred for small deployments (7B-14B range) where simplicity and single-GPU compatibility matter. MoE dominates at the frontier, where the goal is maximum quality and the infrastructure complexity is acceptable.
What’s Next
You now understand how Mixture of Experts works: the router selects which expert FFNs to activate for each token, enabling models with hundreds of billions of total parameters while keeping per-token compute manageable. In Chapter 13, we will explore the scaling laws that govern how model performance improves with size, data, and compute, including the Chinchilla scaling laws, emergent abilities, and the data wall that is pushing the industry toward synthetic data generation.