Idea: Duplicate the performance of a complex model (teacher) to a simpler model (student)
Teacher and Student
- The teacher model is trained first using a standard objective function to maximize its accuracy or a similar metric.
- The student model, on the other hand, aims to learn transferable knowledge from the teacher by matching the probability distribution of the teacher’s predictions.
Dark Knowledge
- Improve softness of the teacher’s distribution with
Softmax Temperature (T)
- As T grows, you get more insight about which classes the teacher finds similar to the predicted one $$p_i = \frac{exp(\frac{Z_i}{T})}{\sum_j exp(\frac{Z_j}{T})}$$
Techniques
- Approach 1: Weigh objectives (Student and teacher) and combine during backprob
- Approach 2: Compare distributions of the predictions (student and teacher) using KL divergence