PaperHub
5.0
/10
withdrawn4 位审稿人
最低3最高6标准差1.2
6
3
6
5
4.5
置信度
ICLR 2024

Improving Knowledge Distillation via Regularizing Feature Direction and Norm

OpenReviewPDF
提交: 2023-09-15更新: 2024-03-26

摘要

关键词
knowledge distillationfeature normfeature directionnetwork compressionfeature regularizationstudent training

评审与讨论

审稿意见
6

Here is a summary of the key points from the paper:

  • The paper proposes a method to improve knowledge distillation (KD) by regularizing student features to align direction with teacher class-means and have sufficiently large norms.

  • Current KD methods like logit or feature distillation align student and teacher but don't directly optimize for student's task performance.

  • The paper shows regularizing direction using cosine similarity to teacher class means helps improve student accuracy.

  • It also finds student models tend to produce smaller-norm features, so encouraging larger norms improves performance.

  • A simple combined loss called dino-loss is proposed to simultaneously regularize student feature direction and norm using teacher class means.

  • Experiments on CIFAR and ImageNet classification, and COCO detection show dino-loss consistently improves various KD methods like KD, ReviewKD, DKD.

  • Dino-loss achieves new state-of-the-art results among KD techniques on classification and detection benchmarks.

  • The method is model-agnostic, simple to implement, adds minimal overhead, and benefits from larger teacher models.

In summary, the key contributions are a way to improve KD by regularizing student features for better alignment and norms, along with a simple and effective dino-loss to achieve this jointly. The results demonstrate consistent gains across tasks and benchmarks.

优点

The paper presents an original and significant approach to improve KD via thoughtful feature regularization. The method is intuitive and supported by quality experiments. The gains are demonstrated to be significant across tasks. The presentation and discussion are clear:

  • The method and dino-loss are clearly explained with illustrations and equations. Results are well-presented in tables and figures. Limitations are properly discussed.
  • Improving KD is an important practical problem. The consistent gains are significant. Sets new state-of-the-art results on ImageNet classification and COCO detection.
  • Model-agnostic nature allows wide applicability to various KD methods and models. Simple extension can benefit the community compared to more complex techniques.

缺点

  • The paper should address the lack of novelty by acknowledging that feature normalization techniques have already been widely employed in knowledge distillation. For example, PKD (NeurIPS-2023) specifically incorporates channel alignment for detectors, and SKD (Guo Jia) explores normalization techniques on predictions. and Feature Normalized Knowledge Distillation for /mage Classification ECCV2022 also presents feature norm. Furthermore, it is worth investigating whether the proposed method has already been considered in the distiller's search work, as exemplified by KD-Zero: Evolving Knowledge Distiller for Any Teacher-Student Pairs (NeurIPS-2023).

  • In addition, the paper should incorporate a thorough discussion of relevant KD-related studies, including Self-Regulated Feature Learning via Teacher-free Feature Distillation (ECCV2022), NORM: Knowledge Distillation via N-to-One Representation Matching (ICLR2023), Shadow Knowledge Distillation: Bridging Offline and Online Knowledge Transfer (NIPS2022), DisWOT: Student Architecture Search for Distillation Without Training (CVPR2023), and Automated Knowledge Distillation via Monte Carlo Tree Search (ICCV2023). These discussions will provide valuable insights into the existing literature, establish connections with previous research, and potentially highlight points of comparison and contrast.

问题

The only concern to me is the novelty of the work and I hope the authors could discuss some of the related work I mentioned in the revised version.

伦理问题详情

no

审稿意见
3

The paper proposes a simple yet efficient feature direction distillation loss. Experiments show that this significantly improves KD performance.

优点

  1. Improving KD by feature norm and direction is reasonable and effectiveness.
  2. Experiments on standard benchmarks demonstrate that adopting Ldino\mathcal{L}_{dino} remarkably improves existing KD methods.

缺点

  1. The contributions seem a little limited.
  2. There is lack of theoretical analysis of DINO loss. The paper is not good enough to be published on ICLR.

问题

  1. How to align the features between heterogeneous architectures?
  2. Could you please provide more theoretical analysis?
  3. What about extending it to a multi-layer version of feature distillation?
  4. How to apply the proposed method to existing KD methods, e.g. ReviewKD, DKD, DIST? Just add the DINO loss function to the total loss ? If so, I think adding other loss like contrastive distillation loss or RKD may also make a improvement.

伦理问题详情

None

审稿意见
6

This paper studies Knowledge Distillation (KD). A simple loss term namely ND loss is proposed to enhance the distillation performance. It encourages the student to produce large-norm features and aligns the direction of student features and teacher class-means. The ND loss helps not only logit-based distillation methods but also feature-based distillation methods.

优点

  1. The proposed method is simple but effective. Encouraging the feature norm for the student is novel in the field of KD.
  2. Experimental results are strong. The authors also conduct experiments on object detection. The proposed loss can improve the existing methods on both image classification and object detection.
  3. The whole paper is organized and written well.

缺点

It is not a novel thing that decoupling the feature into the magnitude and the direction. Previous works [1][2] already studied this point. [1] uses the teacher classifier to project both teacher features and student features into the same space and then align them. [2] proposes a loss term to align two features’ direction. Compared to the existing works, this paper proposes enlarging feature norm and utilizing the class-mean feature. Authors should check more existing papers and discuss their differences. [1] Yang, Jing, et al. "Knowledge distillation via softmax regression representation learning." International Conference on Learning Representations (ICLR), 2021.

[2] Wang, Guo-Hua, Yifan Ge, and Jianxin Wu. "Distilling knowledge by mimicking features." IEEE Transactions on Pattern Analysis and Machine Intelligence 44.11 (2021): 8183-8195.

问题

None

审稿意见
5

This paper proposes to use teacher's class-mean to align student's direction and encourage the student to produce large-norms features, improving the performance of KD.

优点

The paper is generally well-written, and the methodology is well-motivated.

缺点

  1. would expect comparisons and discussion to similarity-preserving KD e.g., [1], which is a large family in feature distillation methods and shows some relations to the proposed method.
  2. Meanwhile, comparisons/discussion to explainablity-based KD, e.g., [2] are needed to see whether those methods can be benefited from the proposed method.

[1] Tung, Fred, and Greg Mori. “Similarity-Preserving Knowledge Distillation.” ICCV 2019.

[2] Guo, Ziyao, et al. "Class Attention Transfer Based Knowledge Distillation." CVPR 2023.

问题

Please see weakness.