Tiedon tislaus

How Knowledge Distillation Works

Knowledge distillation is a model compression technique where a large, high-performing model (the teacher) transfers its learned knowledge to a smaller, more efficient model (the student). Instead of training the student on raw data alone, it learns from the teacher's output probability distributions, which contain richer information about relationships between classes and concepts than simple labels.

Distillation Strategies

The process works because the teacher's soft outputs, such as predicting 80% cat and 15% tiger for an image, encode valuable structural knowledge about similarities between concepts. The student model trained on these soft targets often outperforms an identical architecture trained only on hard labels, because it benefits from the teacher's nuanced understanding.

Enterprise Benefits

Modern distillation goes beyond matching output distributions. Feature-based distillation aligns intermediate representations between teacher and student. Relation-based distillation preserves relationships between data points. For language models, distillation often involves generating synthetic training data from the teacher, allowing the student to learn from diverse, high-quality examples.

How Knowledge Distillation Works

Distillation Strategies

Enterprise Benefits

Liittyvät termit