Skip to yearly menu bar Skip to main content


Poster

A Good Teacher Adapts Their Knowledge for Distillation

Chengyao Qian · Trung Le · Mehrtash Harandi


Abstract:

Knowledge distillation (KD) is an effective method for enhancing a small model, named student, by training it under the supervision of larger teacher models. However, existing studies indicate that a substantial capacity gap between the student and teacher can lead to poor learning for the student model. This capacity gap problem limits the applicability of KD and necessitates careful selection of the teacher's size.%Despite its importance, the underlying cause of the capacity gap problem remains underexplored. In this paper, we reveal that a substantial disparity in the output distributions of teacher and student models is a key factor behind this issue. To demonstrate this, we decompose the KD loss into two components: class-wise similarity and inner-class distribution, and analyze the contribution of each term. Our analysis shows that a large distributional mismatch can lead to poor student learning.%Inspired by this observation, we propose the Adapted Inner-class Distribution (AID) method, wherein the teacher model is fine-tuned to optimize its inner-class distribution to better align with the student's capacity prior to knowledge distillation. This approach effectively bridges the capacity gap between teacher and student models and consistently achieves state-of-the-art performance across a diverse range of architectures.

Live content is unavailable. Log in and register to view live content