Knowledge Distillation

June 13, 2023 One-minute read

neural-net

Idea: Duplicate the performance of a complex model (teacher) to a simpler model (student)

Teacher and Student

The teacher model is trained first using a standard objective function to maximize its accuracy or a similar metric.
The student model, on the other hand, aims to learn transferable knowledge from the teacher by matching the probability distribution of the teacher’s predictions.

Dark Knowledge

Improve softness of the teacher’s distribution with Softmax Temperature (T)
As T grows, you get more insight about which classes the teacher finds similar to the predicted one $$p_i = \frac{exp(\frac{Z_i}{T})}{\sum_j exp(\frac{Z_j}{T})}$$

Techniques

Approach 1: Weigh objectives (Student and teacher) and combine during backprob
Approach 2: Compare distributions of the predictions (student and teacher) using KL divergence

References