Machine learning

Knowledge Distillation

Knowledge Distillation (Teacher–Student Model Compression) · Also known as: Bilgi Damıtma (Knowledge Distillation), bilgi damıtma, teacher-student distillation, model distillation

Knowledge Distillation is a model-compression technique, introduced by Geoffrey Hinton and colleagues in 2015, that trains a small student model using the soft-label outputs of a large teacher model. Distilled models such as DistilBERT and TinyBERT reach roughly 97% of the larger model's performance while running far faster.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Knowledge Distillation

Longformer / BigBird Mixture of Experts Random Forest Visual Contrastive Learn…XGBoost Capsule Network Ensemble Self-supervised…Federated Learning MobileNet Multitask Learning

+3 more

When to use it

Use it when you have a strong but heavy trained teacher model and want a smaller, faster student for deployment on text or continuous-feature tasks, with at least about 100 task-specific observations. It assumes the teacher model is already trained and available and that a task-specific dataset is ready. Below roughly 500 observations the student cannot distil enough from the teacher and overfitting becomes a risk; below about 100, distillation is pointless and classic machine learning is the safer choice.

Strengths & limitations

Strengths

Yields a much smaller, faster model that can recover around 97% of the teacher's performance.
The teacher's soft labels carry richer information than hard labels, improving the student's generalisation.
Assumption-light: does not require normally distributed data.
Reuses an existing trained teacher, so it suits deployment and serving where speed and size matter.

Limitations

Requires an already-trained, available teacher model — distillation cannot start without one.
Needs a task-specific dataset; with too little data the student learns little from the teacher.
On small datasets (n below about 500) the student overfits and fails to absorb the teacher's knowledge.
Adds a training pipeline and tuning burden (temperature, mixing weight) on top of the original model.

Frequently asked

Why use soft labels instead of just the correct answer?

The teacher's full probability distribution reveals how similar the classes look to it and how confident it is. This extra signal — sometimes called dark knowledge — helps the student generalise better than hard labels alone.

How much smaller can the student be?

Distilled models such as DistilBERT and TinyBERT are substantially smaller and faster than their teachers while reaching roughly 97% of the teacher's performance, though the exact trade-off depends on the task and architecture.

What do I need before distilling?

You need a teacher model that is already trained and available, plus a task-specific dataset. With fewer than about 500 observations the student struggles to distil enough from the teacher and may overfit.

What is the alpha in the loss?

Alpha mixes the two parts of the distillation loss: a cross-entropy term on the true labels and a Kullback–Leibler divergence term that aligns the student's outputs with the teacher's. Tuning it balances learning from ground truth versus learning from the teacher.

Sources

Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Deep Learning Workshop. link ↗
Sanh, V., Debut, L., Chaumond, J. & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108. link ↗

How to cite this page

ScholarGate. (2026, June 1). Knowledge Distillation (Teacher–Student Model Compression). ScholarGate. https://scholargate.app/en/deep-learning/knowledge-distillation

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Longformer / BigBirdDeep learning↔ compare
Mixture of ExpertsDeep learning↔ compare
Random ForestMachine learning↔ compare
Visual Contrastive LearningDeep learning↔ compare
XGBoostMachine learning↔ compare

Compare side by side →

Referenced by

Capsule Network Ensemble Self-supervised Learning Federated Learning MobileNet Multitask Learning Neural Architecture Search Self-supervised Image Classification Weakly supervised vision transformer

Related reference concepts

Self-Supervised and Representation Learning Unsupervised Learning Supervised Learning Dimensionality Reduction VC Dimension and Capacity Variational Inference

Spotted an issue on this page? Report or suggest a fix →

Machine learning

Knowledge Distillation

Knowledge Distillation (Teacher–Student Model Compression) · Also known as: Bilgi Damıtma (Knowledge Distillation), bilgi damıtma, teacher-student distillation, model distillation

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Knowledge Distillation

Longformer / BigBird Mixture of Experts Random Forest Visual Contrastive Learn…XGBoost Capsule Network Ensemble Self-supervised…Federated Learning MobileNet Multitask Learning

+3 more

When to use it

Strengths & limitations

Strengths

Yields a much smaller, faster model that can recover around 97% of the teacher's performance.
The teacher's soft labels carry richer information than hard labels, improving the student's generalisation.
Assumption-light: does not require normally distributed data.
Reuses an existing trained teacher, so it suits deployment and serving where speed and size matter.

Limitations

Requires an already-trained, available teacher model — distillation cannot start without one.
Needs a task-specific dataset; with too little data the student learns little from the teacher.
On small datasets (n below about 500) the student overfits and fails to absorb the teacher's knowledge.
Adds a training pipeline and tuning burden (temperature, mixing weight) on top of the original model.

Frequently asked

Why use soft labels instead of just the correct answer?

How much smaller can the student be?

What do I need before distilling?

What is the alpha in the loss?

Sources

Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Deep Learning Workshop. link ↗
Sanh, V., Debut, L., Chaumond, J. & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108. link ↗

How to cite this page

ScholarGate. (2026, June 1). Knowledge Distillation (Teacher–Student Model Compression). ScholarGate. https://scholargate.app/en/deep-learning/knowledge-distillation

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Longformer / BigBirdDeep learning↔ compare
Mixture of ExpertsDeep learning↔ compare
Random ForestMachine learning↔ compare
Visual Contrastive LearningDeep learning↔ compare
XGBoostMachine learning↔ compare

Compare side by side →

Referenced by

Capsule Network Ensemble Self-supervised Learning Federated Learning MobileNet Multitask Learning Neural Architecture Search Self-supervised Image Classification Weakly supervised vision transformer

Similar methods

Related reference concepts

Self-Supervised and Representation Learning Unsupervised Learning Supervised Learning Dimensionality Reduction VC Dimension and Capacity Variational Inference

Spotted an issue on this page? Report or suggest a fix →

Knowledge Distillation

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Related methods

Which method?

Referenced by

Similar methods

Related reference concepts

Knowledge Distillation

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Related methods

Which method?

Referenced by

Similar methods

Related reference concepts