What is a mixture of experts?

A mixture-of-experts (MoE) model splits its feed-forward layers into many specialized expert subnetworks and a router that activates only a few per token, so the model has huge total capacity but only runs a fraction of it on any given input.

Mixture of experts (MoE) is an architecture that decouples a model's total parameter count from the compute spent per token. Instead of one dense feed-forward block, an MoE layer holds many parallel expert subnetworks plus a lightweight router that, for each token, selects a small number of experts to activate. The model therefore has a very large number of total parameters (high capacity, more knowledge stored) while only a few experts fire per token (low active compute, faster and cheaper inference than a dense model of equivalent total size). Many recent frontier and open-weight models are MoE, including the gpt-oss family, because the architecture is one of the cleanest ways to scale capability without scaling per-token cost linearly. The tradeoffs are real: MoE models need all experts resident in memory even though only some run, complicating serving; routing can be unbalanced; and they can be harder to fine-tune. For practitioners the takeaway is that an MoE model's parameter count overstates its inference cost, and the relevant numbers to compare are total parameters (capacity) versus active parameters (cost per token).