Distilling the Knowledge in a Neural Network, Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015arXiv preprint arXiv:1503.02531DOI: 10.48550/arXiv.1503.02531 - The foundational paper introducing the concept of knowledge distillation, particularly focusing on using softened output probabilities (soft targets) and temperature scaling.
FitNets: Hints for Thin Deep Nets, Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio, 2015International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1412.6550 - Introduces the concept of feature-based knowledge distillation, where the student model learns from intermediate representations of the teacher.