Knowledge Distillation


Idea: Duplicate the performance of a complex model (teacher) to a simpler model (student)

Teacher and Student

  • The teacher model is trained first using a standard objective function to maximize its accuracy or a similar metric.
  • The student model, on the other hand, aims to learn transferable knowledge from the teacher by matching the probability distribution of the teacher’s predictions.

Dark Knowledge

  • Improve softness of the teacher’s distribution with Softmax Temperature (T)
  • As T grows, you get more insight about which classes the teacher finds similar to the predicted one $$p_i = \frac{exp(\frac{Z_i}{T})}{\sum_j exp(\frac{Z_j}{T})}$$

Techniques

  • Approach 1: Weigh objectives (Student and teacher) and combine during backprob
  • Approach 2: Compare distributions of the predictions (student and teacher) using KL divergence

References

  1. https://www.ttic.edu/dl/dark14.pdf
  2. https://youtu.be/asZoedN31VE