Poster
Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation
Guopeng Li · Qiang Wang · Ke Yan · Shouhong Ding · Yuan Gao · Gui-Song Xia
Most knowledge distillation (KD) methods focus on teacher-student pairs with similar architectures, such as both being CNN models. The potential and flexibility of KD can be greatly improved by expanding it to Cross-Architecture KD (CAKD), where the knowledge of homogeneous and heterogeneous teachers can be transferred selectively to given students. However, it makes CAKD extremely challenging because of substantial feature gaps between heterogeneous models (e.g., a ViT teacher and a CNN student), originating from the distinction of their inherent inductive biases} and module functions. To this end, we fuse heterogeneous knowledge before transferring it from teacher to student. This fusion combines the advantages of cross-architecture inductive biases and module functions by merging directly from different combinations of convolution, attention, and MLP modules derived from both student and teacher module functions. Furthermore, we observe that heterogeneous features exhibit diverse spatial distributions, hindering the effectiveness of conventional pixel-wise MSE loss. Therefore, we leverage a spatial-agnostic InfoNCE loss to align features after spatial smoothing. Our method is evaluated across various homogeneous models and arbitrary heterogeneous combinations of CNNs, ViTs, and MLPs, yielding promising performance for distilled models with a maximum gain of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K. Our codes will be released.
Live content is unavailable. Log in and register to view live content