Model Distillation

Definition & meaning

Definition

Model Distillation (or Knowledge Distillation) is a technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model. The student learns from the teacher's output probabilities rather than from raw training data, capturing the larger model's knowledge in a more compact and efficient form. Distilled models run faster, cost less, and require less memory while retaining most of the teacher's capability. This technique is crucial for deploying AI on edge devices, mobile apps, and cost-sensitive applications. Many production AI systems use distilled models — for example, running a 7B parameter model distilled from a 70B model to achieve 90% of the quality at 10% of the cost.

How It Works

Model distillation is a technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model. Instead of training the student on raw data with hard labels (e.g., "this is a cat"), the student learns from the teacher's soft probability distributions—the full output logits across the entire vocabulary for each token. These soft targets contain richer information than hard labels because they encode the teacher's uncertainty and the relationships between possible outputs (e.g., the teacher might assign 80% probability to "excellent" and 15% to "great," revealing semantic similarity). The training loss typically combines the standard cross-entropy loss on ground-truth labels with a KL-divergence loss between the student's and teacher's output distributions, weighted by a temperature parameter that controls how much the probability distributions are softened. Advanced distillation approaches include feature-level distillation (matching intermediate layer representations), attention transfer (matching attention patterns), and progressive distillation (distilling through multiple stages of decreasing model size).

Why It Matters

Distillation lets you deploy AI capabilities at a fraction of the compute cost. Running GPT-4-class models for every API call is expensive; a distilled model that captures 90% of the performance at 10% of the cost changes the economics entirely. This matters for edge deployment (running models on phones or IoT devices), latency-sensitive applications (real-time code completion, voice assistants), and any scenario where inference cost is a primary concern. For developers, distillation is a practical tool in the optimization toolkit—you can often get a specialized small model that outperforms a general large model on your specific task while being dramatically cheaper and faster to serve.

Real-World Examples

OpenAI's GPT-4o-mini is widely understood to be a distilled version of GPT-4o, offering similar capabilities at significantly lower cost. Google's Gemma models are distilled from larger Gemini models. DistilBERT, one of the earliest popular distilled models, runs 60% faster than BERT while retaining 97% of its performance. Microsoft's Phi series demonstrates that carefully distilled small models can punch far above their weight class. Hugging Face hosts hundreds of distilled model variants. On ThePlanetTools.ai, we highlight when AI tools use distilled models under the hood—it often explains why some tools deliver surprisingly good results at low price points compared to competitors using full-size models.