What is model distillation?

Model distillation trains a smaller, cheaper student model to mimic a larger teacher model's outputs, capturing much of the teacher's capability at a fraction of the inference cost and latency.

Model distillation is a compression technique: you take a large, capable teacher model and use its outputs, often the full probability distribution over tokens, not just the final answer, as training targets for a smaller student model. The student learns to reproduce the teacher's behavior on a task, ending up far cheaper and faster to run while retaining a large share of the quality. Distillation is one of the main ways production teams turn a frontier model's capability into something economical to serve at scale. It pairs naturally with quantization (shrinking weight precision) and other efficiency methods to drive down inference cost and latency. A common modern pattern is task-specific distillation: run an expensive model to generate high-quality labeled examples for your exact workload, then distill or fine-tune a small model on them, getting a specialist that beats a general-purpose model on your task at a fraction of the price. The tradeoff is generality, a distilled student excels at what it was trained on but loses headroom outside that distribution. For agent systems, distillation is most useful on narrow, high-volume subtasks (routing, classification, extraction) where a small specialist can handle the bulk of traffic and escalate the hard cases to a larger model.