Enhancing Logits Distillation with Plug&Play Kendall's $\tau$ Ranking Loss
Plug & Play Ranking Loss for Logits Distillation
摘要
评审与讨论
This paper addresses the problem of knowledge distillation by highlighting the limitations of traditional KL divergence. The proposed method introduces an auxiliary loss based on Kendall’s τ Coefficient, which enhances the learning of inter-class relationships and low-probability channels. Experiments conducted on three image classification datasets demonstrate the effectiveness of this approach.
优点
- The proposed method is straightforward and can be seamlessly integrated with logits-based knowledge distillation techniques.
- Experiments are conducted using both CNNs and ViTs across three different datasets. The ablation studies offer valuable insights into the proposed method.
缺点
-
Some claims lack adequate justification. For instance, it remains unclear how the proposed method resolves the suboptimal problem depicted in Figure 1. Including visual comparisons of logits with and without the ranking loss could enhance clarity and understanding.
-
The proposed method includes multiple hyperparameters; however, the observed performance improvements are limited. Furthermore, the proposed method is evaluated against several straightforward baseline methods for knowledge distillation.
问题
See weakness.
We sincerely thank you for the in-depth feedback! We highly value each of your comments, and all your concerns are addressed point by point:
Weakness1. Some Claims Lack Adequate Justification:
- In fact, we want to use Figure 1 in the paper to illustrate smaller channel is a large proportion in logit distillation, and the suboptimal problem is actually shown in Figure 2 in the paper. We add a visualization of teacher-student ranking alignment, presented in Figure 8 in the paper, to prove that our method contributes to align ranking across all channels. The part in the bottom left corner of the figure illustrates that our method helps the top ranked channels to find their correct position, which helps to solve the suboptimal problem in Figure 2.
- We include visual comparisons of logits in Figure 10 in the paper. In addition, Figure 3 in the paper already shows that the channels output by our method have smaller KL loss, indicating that the channels are more aligned, and the newly added Figure 8 can also prove this point to some extent.
Weakness2. Other Limitations:
-
We only introduce 2 hyperparameters: Our method only introduces two hyperparameters, weight and steepness coefficient, and the remaining hyperparameters are baseline settings, which we do not make any modifications. In addition, the weight of our method are set to inherit from baselines in experiments to ensure generalization and robustness.
-
Our method exhibits consistent improvements: Our method is plug-and-play and can be combined with various methods (e.g. KD, CTKD, DKD, MLKD) to improve the performance, as shown in Table 1 and Table 2 in the paper. In comparative experiments, our method outperforms MLKD+LSKD, a hihglight of CVPR2024, demonstrating the effectiveness of our method.
-
We conduct experiments with multiple SOTA baselines: In Table 1 and Table 2 in the paper, we combine our method with KD, CTKD, DKD, MLKD. Where KD is the classical method of logit distillation, DKD is the SOTA of 2022 and MLKD is the SOTA of 2023.
-
Moreover, we conduct experiments with a method that does not use KL divergence and improve the performance, shown in Table R2.1, to further illustrate the generalization of our method.
Table R2.1: Experiments on Combining Other Metods.
Teacher -> Student ResNet32×4 -> WRN-16-2 ResNet32×4 -> WRN-40-2 WRN-40-2 -> SHN-V1 VGG13 -> VGG8 WRN-40-2 -> WRN-40-1 DIST[1] 75.58 78.02 76.00 73.80 74.73 DIST[1]+Ours 75.85(+0.27) 78.54(+0.52) 76.23(+0.23) 74.06(+0.26) 74.86(+0.13)
[1] Knowledge distillation from a stronger teacher, NeurIPS 2022
Dear Reviewer n8Cd,
Thank you very much again for your time and effort! We would greatly appreciate it if you could take a little time to check our response. We are willing to further address any remaining concerns.
Sincerely,
Authors
Dear reviewer n8Cd,
We hope our responses have adequately addressed your previous concerns. The discussion period is approaching the end in 48 hours. If you have any additional comments or questions, please feel free to share them. Your feedback is greatly appreciated.
Sincerely,
Authors
This study presents an auxiliary ranking loss based on Kendall’s Tao Coefficient to improve knowledge distillation. The proposed ranking loss addresses the issue of KL divergence’s neglect of low-probability channels by incorporating inter-class relationship information and enhancing focus on low-probability channels. It can be integrated into any logit-based distillation method and demonstrates consistent optimization objectives with KL divergence. Experiments on three datasets across various CNN and ViT teacher-student combinations show that the ranking loss effectively improves performance across multiple baselines.
优点
- This paper proposes a plug-and-play ranking loss to address the suboptimization issues in knowledge distillation optimization.
- This paper demonstrates that Kullback-Leibler divergence is influenced by channel scale.
缺点
- The paper claims that the proposed ranking loss primarily addresses KL divergence's tendency to overlook low-probability channels. However, based on the proposed formula, the main objective appears to be enforcing ranking consistency between the teacher and student models, with no clear indication of increased emphasis on information from smaller channels. It is recommended that the author explain this aspect.
- In the experimental section, it is recommended to include visualization experiments to highlight the primary contribution—improved attention to low-probability channels.
- Since LSKD shows superior performance in Tables 1 and 2, further explanation of this result is advised.
问题
Please refer to the Strengths and Weaknesses.
We sincerely thank you for the in-depth feedback! We highly value each of your comments, and all your concerns are addressed point by point:
Weakness1. More Clear Indication of Increased Emphasis on Smaller Channels:
- We add a new figure showing the gradients obtained with different channel probabilities, shown in Figure 7 in the appendix in the paper, to confirm our claim. The figure shows that the gradient of KL divergence is significantly affected by the channel scale, providing a smaller gradient at smaller channels. In contrast, the gradient provided by the ranking loss is relatively independent of the channel scale, providing a gradient comparable to the larger channel at smaller channels. This illustrates the increased emphasis on low-probability channels in our method.
Weakness2. More Visualizations:
- As mentioned in the previous paragraph, we add to the paper a visual analysis of the gradient provided by the ranking loss and KL divergence for different logits, presented in Figure 7 in the paper. In addition, we include a comparative visualization of the alignment of student and teacher ranking, shown in Figure 8 in the paper. The visualization of the gradient shows that the proposed method improves the attention to smaller channels, while the ranking alignment shows that the proposed method aligns larger channels and smaller channels in a balanced way.
Weakness3. Explanation of LSKD:
- LSKD is a method to improve the temperature parameter of students and teachers. By modifying the temperature parameter in KD loss, LSKD alleviates the problem that KD loss is too strict on students, which is a different idea from our method of providing auxiliary information for KD loss to help learning.
- Our method outperforms LSKD on most experiments. Both LSKD and our method are plug-and-play methods, and the LSKD reported in the paper is actually also MLKD+LSKD. Despite the excellent performance of LSKD, our method still outperforms LSKD on most student-teacher structures (7 vs. 2) in Table 1 and Table2 in the paper.
Dear Reviewer eu6q,
Thank you very much again for your time and effort! We would greatly appreciate it if you could take a little time to check our response. We are willing to further address any remaining concerns.
Sincerely,
Authors
Dear reviewer eu6q,
We hope our responses have adequately addressed your previous concerns. The discussion period is approaching the end in 48 hours. If you have any additional comments or questions, please feel free to share them. Your feedback is greatly appreciated.
Sincerely,
Authors
This paper highlights the issues with using KL divergence in knowledge distillation and introduces a ranking loss based on Kendall's τ, which can be integrated into existing methods, enhances low-probability channel focus, and maintains inter-class relationships. Experimental results across various datasets and model architectures demonstrate that this approach consistently enhances performance.
优点
- The proposed method is designed for straightforward integration into existing logit-based distillation frameworks, increasing its relevance and utility.
- Multiple experiments conducted on a variety of datasets and architectures provide evidence of the proposed approach's effectiveness
缺点
- The KL divergence optimization is a relatively common scheme for the logit distillation task. Could the authors elaborate on the main novelty of this integration?
- More ablation experiments and analysis are required for discussion; please see the Questions.
问题
- What strategies could be implemented to minimize the computational overhead associated with the proposed ranking loss?
- This article mentioned that the proposed method balances the model’s attention to larger and smaller-valued channels. Could the ranking loss also offer advantages in scenarios with class imbalance?
- Are there any adverse effects when combining the proposed method with others? Could you provide relevant ablation experiments?
We sincerely thank you for the in-depth feedback! We highly value each of your comments, and all your concerns are addressed point by point:
Weakness1. The Main Novelty:
- Our novelty can be summarized in three points:
- We propose to apply ranking loss to logit distillation for the first time and propose a generic plug-and-play loss. It can be applied to various logit-based methods or methods that use KL divergence / modified KL and improve performance.
- We present a detailed mathematical analysis of an long-standing problem in logit distillation (KL divergence neglects small-valued channels) and demonstrate, both mathematically and visually, that ranking loss can help solve this problem.
- We conduct experiments and ablations in various settings such as ViT structure and detection task to verify the effectiveness and generalization of our method.
Weakness2. More Ablation Experiments and Analysis:
Question1. Minimize Computational Overhead:
- Since the computational overhead of the ranking loss is highly correlated with the number of channels, the best way to minimize the computational overhead is to reduce the channels involved in loss calculation. In Figure 5 in the paper, using only the top/min 50% of the channels to calculate the ranking loss achieves similar results as using the full channel, while the overhead of the ranking loss is only about 25% of the full channel case (the channel pairs involved in the calculation are changed from to ). Even only use 30% of the channels with about 9% computational overhead can also achieve a good performance.
Question2. Class Imbalance Scenarios:
- In our current research, our method primarily addresses the imbalance in the value of channel outputs, the effectiveness is rooted in addressing value-related imbalances rather than those caused by amount of data. Therefore, this method is not designed for class imbalance scenarios.
- However, the proposed method has the potential to help solve tasks in class imbalance scenarios. By capturing the ranking relationship, our method can provide more inter-class information for imbalanced classes, helping to learn more from a small amount of data and improve the performance.
Question3. Ablation Study on Combining Methods:
-
According to our experiments combined with other methods, which show in Tables 1 and 2 in the paper, the proposed ranking loss improves the performance when combined with various methods (e.g., KD, DKD, MLKD) and we have not identified adverse effects of proposed method currently.
-
Our method is a general, simple, plug-and-play method, and as far as we know, it can combine well with most methods. We propose the auxiliary supervision from a new perspective and bring additional performance improvements. Although in a few methods may be limited, there are still improvements.
-
In fact, compared to the linear alignment of the traditional distillation loss, our loss focuses on the ranking alignment, which can be characterized by a relatively softer nature and therefore more robust. Moreover, the mathematical derivation of Section 4.2.2 in the paper also shows that our method will not have a large adverse effect on logit distillation.
-
To further verify the generality of our method, we add ranking loss on a distillation method that does not use KL divergence and improve the performance, results are presented in Table R2.1.
Table R2.1: Experiments on Combining Other Metods.
Teacher -> Student ResNet32×4 -> WRN-16-2 ResNet32×4 -> WRN-40-2 WRN-40-2 -> SHN-V1 VGG13 -> VGG8 WRN-40-2 -> WRN-40-1 DIST[1] 75.58 78.02 76.00 73.80 74.73 DIST[1]+Ours 75.85(+0.27) 78.54(+0.52) 76.23(+0.23) 74.06(+0.26) 74.86(+0.13)
[1] Knowledge distillation from a stronger teacher, NeurIPS 2022
Dear Reviewer 6ebV,
Thank you very much again for your time and effort! We would greatly appreciate it if you could take a little time to check our response. We are willing to further address any remaining concerns.
Sincerely,
Authors
Dear reviewer 6ebV,
We hope our responses have adequately addressed your previous concerns. The discussion period is approaching the end in 48 hours. If you have any additional comments or questions, please feel free to share them. Your feedback is greatly appreciated.
Sincerely,
Authors
The paper proposes an enhancement to the knowledge distillation process by introducing a plug-and-play ranking loss based on Kendall’s τ Coefficient, which aims to mitigate the limitations of Kullback-Leibler (KL) divergence. The proposed ranking loss addresses issues like the neglect of low-probability channels and the inability of KL divergence to fully capture inter-class relationships. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets demonstrate the effectiveness of the approach, showing consistent improvements when applied to various teacher-student architecture combinations in CNN and Vision Transformer (ViT) models.
优点
- Novelty and contribution: The use of Kendall’s τ ranking loss in the context of knowledge distillation appears to be novel and provides a promising way to complement traditional KL divergence-based losses. The ranking-based approach helps the student model better capture inter-class relationships.
- Plug-and-Play nature: The ranking loss is designed to be plug-and-play, which increases its practicality. It can be easily integrated into existing logit-based distillation frameworks without modifying the underlying architecture.
- Intensive experiments: The paper provides a wide range of experiments on different datasets and architecture combinations, demonstrating the robustness and generalizability of the proposed ranking loss.
- Addressing suboptimal points: The paper provides convincing arguments about how ranking loss helps in avoiding suboptimal solutions often seen in KL divergence optimization. The experimental results back up these claims, particularly in the analysis of accuracy and loss curves.
缺点
- limited ablation study on hyperparameters. The author only discuss the effect of hyper-parameter k in the ranking loss. there is limited analysis of how sensitive the model is to different values of α, β, and γ in the overall loss function.
- Relation with other different distillation loss is not clear. The paper gives some explanations on why ranking loss works through its gradient form. However, I think since this loss is not used for its own. The author should discuss its relation with KD loss. KD constrains the logits after the softmax, however, ranking loss gives the constraint before the softmax, is it really necessary? I am not convinced by this.
- Some of the derivations involving the ranking loss (e.g., differentiable form of Kendall’s τ coefficient) are challenging to follow due to their dense notation and lack of intermediate steps. Please consider adding more explanation or flowchart to increase the readablity.
问题
- For the experiments involving different values of k and the comparison of different ranking loss forms, have you considered the effect of different initializations of the student model? The stability and sensitivity of the results with respect to different initial conditions could provide additional insights.
- RKD is another important loss for distillation. Have you ever tried to combine with RKD (relation knowledge distillation).
We sincerely thank you for the in-depth feedback! We highly value each of your comments, and all your concerns are addressed point by point:
Weakness1. Limited Ablation Study on Hyperparameters.:
-
We perform sensitivity analyses for and , and the results are presented in Tables R1.1 and R1.2, and the images are added to the Figure 9 in Appendix in the paper. As increases, a equal to has better performance, and as increases, larger generally has better performance. Our method achieves decent performance on various settings, illustrating its robustness and generalization ability.
-
In addition, we have also performed ablation on the parameter , and the results are shown in Table 6 of the paper.
Table R1.1: Sensitivity of . Teacher is WRN-40-2 and student is WRN-40-1.
74.17 73.97 73.74 73.54 73.33 74.25 74.80 74.68 74.28 73.95 74.44 74.43 74.80 74.66 74.31 74.44 74.33 74.29 74.33 74.78 74.15 74.52 74.07 74.71 74.49 Table R1.2: Sensitivity of . Teacher is WRN-40-2 and student is WRN-40-1.
74.15 74.52 74.07 74.71 74.49 74.38 74.15 74.48 74.76 74.66 74.47 74.84 74.06 74.55 74.29 74.23 74.51 74.41 74.65 74.75 74.11 74.31 74.72 74.41 74.81
Weakness2. Relation with Other Different Distillation Loss is Not Clear:
-
The difference between ranking loss and KD loss is not whether they are applied before-softmax or after-softmax. In fact, they are both essentially constraints on logits, as discussed in Sec. 4.2.2. The role of softmax is mainly to scale logits uniformly, while our ranking loss does not require uniform scaling to function. We do not apply softmax to the ranking loss because softmax does not alter the ranking of the channels and may increase computational overhead during gradient calculation, which is unnecessary for ranking loss.
-
KD loss focuses more on intra-channel matching, while ranking loss emphasizes inter-channel relationship matching. Additionally, KD loss generally prioritizes large-value channels (as discussed in Eq. 1 of the paper), whereas ranking loss considers all channels in a balanced manner. We believe that ranking loss provides rich inter-class knowledge, which complements KD loss by facilitating a more effective optimization process.
Weakness3. Dense Notation and Lack of Intermediate Steps:
- We appreciate your reminding. We add intermediate steps and more instructions to the derived formulas of Section4.1 and Appendix A.4 in the updated pdf to help understanding.
Question1. Ablation of Different Initializations:
-
In fact, we have conducted ablation on different initializations, and we believe that different teacher-students have a greater impact on the optimal value than the impact of initialization. We present the ablation experiments with different random initialization seeds in Table R1.3, our method works well when k is greater than 2 in general.
Table R1.3: Ablation on Different Initialization. Teacher is ResNet32×4 and student is ResNet8×4
seed=152 seed=386 seed=2347 74.01 73.80 73.54 74.24 74.48 74.45 74.74 74.74 74.44 75.35 75.37 74.98 75.59 75.65 75.37 75.44 75.51 75.31
Question2. Ranking Loss with RKD:
- We add RKD[1] method to the ranking loss and performed ablation experiments, the results are shown in Table R1.4, which also illustrates that the improvement brought by our approach is consistent and general.
Table R1.4: Experiments on Combining Other Metods.
| Teacher -> Student | ResNet32×4 -> ResNet8×4 | ResNet50 -> MobileNet-V2 | WRN-40-2 -> WRN-40-1 |
|---|---|---|---|
| RKD | 72.61 | 59.70 | 72.29 |
| RKD+Ours | 73.97 | 69.39 | 73.93 |
[1] Relational knowledge distillation. CVPR 2019
We sincerely appreciate your reply!
We apologize for any misunderstanding that may have occurred in our previous response. The following is a further explanation of Weakness 2, which may address your concerns:
W2.1 The question of applying constraints before and after the softmax function
- We emphasize that although the KL divergence is computed on the distributions obtained after softmax, consistency in these distributions is equivalent to a linear alignment of the logits before the softmax, as shown in Eq.12 in the paper. Therefore, from the perspective of logits, the KL loss after softmax can be seen as a linear alignment constraint on the logits (before softmax). The ranking loss is used to constrain the ranking of logits (before softmax) among channels and can act as an auxiliary to the KD loss at the logit level.
W2.2 Whether the ranking loss is redundant
-
Based on our understanding, you may consider that softmax contains inter-channel relationships and therefore overlaps with the ranking loss. To address this concern, we conducted experiments where the ranking loss is applied after the softmax function, as shown in Table R1.5. The results indicate that the ranking loss applied after softmax still brings significant performance improvements. This suggests that the inter-channel relationships emphasized by ranking loss are not captured by softmax, indicating that the relational knowledge provided by softmax is insufficient. Our method delivers enhancements both when applied before and after softmax.
-
Regarding your comment that "If the KD loss is small enough, the ranking loss will also be small," we would like to point out that due to the capacity differences between the teacher and student, it is challenging for the student to mimic the teacher's outputs completely. Therefore, achieving a small enough KD loss is difficult. In Figure 3 of the paper, we present the loss curves for both KD and KD+Ours. Although both methods exhibit small KD losses in the later stages of training (right figure), the ranking loss computed from the outputs of the KD method remains significantly higher than that of the KD+Ours (middle figure). This indicates that even towards the end of the training, the KD loss in the traditional KD method is still not small enough to ensure a minimal ranking loss. Figure 8 in the Appendix reinforces this point, where the channel ranking of KD is still not aligned after training (left figure). This scenario can be analyzed with the toy case in Figure 2 of the paper, where student1 has channel values closer to teacher, resulting in a smaller KD loss, however, the channel ranking is still incorrect, which leads to misclassification. This illustrates that merely minimizing the KD loss without considering channel ranking may not be sufficient for optimal model performance.
Table R1.5: More Ablation Experiments.
Teacher -> Student ResNet32×4 -> ResNet8×4 ResNet50 -> MobileNet-V2 WRN-40-2 -> WRN-40-1 ResNet32×4 -> SHN-V1 WRN-40-2 -> SHN-V1 KD 73.33 67.35 73.54 74.07 74.83 KD+Ours(after softmax) 74.01 69.37 74.07 75.02 75.53
Dear Reviewer pmPS,
Thank you very much for raising the score! We greatly appreciate the time and effort you have taken to help us improve our work. We will continue to refine and enhance our work accordingly.
Best regards,
Authors
We sincerely appreciate all the reviewers for their thoughtful feedback and their time. We are encouraged to see that they found the insight in our paper interesting (Rev. pmPS, eu6q), and found found the experiments to be well executed (pmPS, 6ebV, n8Cd) with valuable ablations (Rev. n8Cd). We are also delighted to see our method's effectiveness and genaralization ability are noticed by all reviewers. We have made numerous updates to the submission, hope these can adress your concerns.
One issue shared by multiple reviewers is the visualization of smaller channels receiving attention. We visualize the gradients and sequential alignments of all channels using different loss to illustrate smaller channels are emphasized in our method. The visualization can be seen in Figure 7 and Figure 8 in the paper.
Another widely concerned issue is the ablation experiment combined with other methods. We combine the feature-based method RKD and the logit-based but without KL divergence method DIST and improved their performance, which are shown in G1.1 and G1.2, to further demonstrate the effectiveness and generalization abllity of our method.
Table G1.1: Experiments on Combining Other Metods.
| Teacher -> Student | ResNet32×4 -> ResNet8×4 | ResNet50 -> MobileNet-V2 | WRN-40-2 -> WRN-40-1 |
|---|---|---|---|
| RKD | 72.61 | 59.70 | 72.29 |
| RKD+Ours | 73.97 | 69.39 | 73.93 |
Table G1.2: Experiments on Combining Other Metods.
| Teacher -> Student | ResNet32×4 -> WRN-16-2 | ResNet32×4 -> WRN-40-2 | WRN-40-2 -> SHN-V1 | VGG13 -> VGG8 | WRN-40-2 -> WRN-40-1 |
|---|---|---|---|---|---|
| DIST[1] | 75.58 | 78.02 | 76.00 | 73.80 | 74.73 |
| DIST[1]+Ours | 75.85(+0.27) | 78.54(+0.52) | 76.23(+0.23) | 74.06(+0.26) | 74.86(+0.13) |
Additionally, I would like to bring your attention that there are less than 4 days until the end of discussion period, please feel free to let us know if you have any further questions, we will try our best to address your concerns.
We are looking forward to your reply.
Dear Chairs and Reviewers,
Thank you all for your time and consideration. We truly appreciate your valuable feedback and constructive discussions.
We would like to express our sincere gratitude to Reviewer pmPS for your valuable feedback and for raising your score.
We would also like to kindly remind Reviewers 6ebV, eu6q, and n8Cd that the discussion period is nearing its conclusion (in about 24 hours). We would greatly appreciate your responses. If you have any further questions or concerns, please feel free to let us know.
We are looking forward to your reply.
Warm regards,
The authors
The paper introduces a plug-and-play ranking loss based on Kendall’s τ Coefficient to enhance existing KL divergence-based distillation methods. The authors highlight that gradients provided by KL divergence are influenced by channel scale and often overlook low-probability channels. Experiments conducted on CIFAR-100, ImageNet, and COCO demonstrate consistent improvements across various teacher-student combinations.
This paper received mixed scores. While the experiments demonstrate that the proposed ranking loss improves classification performance on several baseline models, there are significant concerns that need to be addressed.
Firstly, the paper lacks sufficient comparisons and explanations regarding prior methods. For instance, Figure 2 in this paper bears a strong resemblance to LSKD [1], yet there is no comparison or detailed analysis with LSKD on ImageNet. Instead, only results on CIFAR-100 are provided, which is less convincing for a robust evaluation. In addition, the proposed ranking loss shares a similar intent with the intra-class relations introduced in DIST [2]. However, the paper lacks the necessary analysis and experiments to demonstrate whether the ranking loss can further enhance DIST or achieve better performance compared to combining MLKD with DIST. This absence raises concerns about the novelty and the competitive positioning of the proposed method.
[1] Sun, Shangquan, et al. "Logit standardization in knowledge distillation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[2] Huang, Tao, et al. "Knowledge distillation from a stronger teacher." Advances in Neural Information Processing Systems 35 (2022): 33716-33727.
Secondly, there is an inconsistency between the paper’s claims and its proposed method. The authors critique KL divergence for overlooking low-probability channels. However, the proposed ranking loss focuses on enforcing ranking consistency between the teacher and student models, without clearly demonstrating how it addresses the role or importance of low-probability channels.
Overall, these gaps raise concerns about the clarity of the paper’s contributions.
审稿人讨论附加意见
Both Reviewer eu6q and 6ebV raised concerns regarding the core intention of the proposed method. The main objective of the method is to enforce ranking consistency between the teacher and student models. However, there is no clear evidence that it increases the emphasis on information from lower-probability channels. In the rebuttal, the authors provided an additional Figure 7 in the appendix, which illustrates that the gradient of the original KL divergence is affected by channel scale. While this observation is straightforward from Eq. 1, it fails to demonstrate that the proposed ranking loss enhances attention on channels with smaller logits. Furthermore, the explanation of the gradients is unclear.
Additionally, the ACs noted that the paper lacks sufficient comparisons and explanations regarding prior methods. For instance, Figure 2 in this paper strongly resembles Figure 2 in LSKD (https://arxiv.org/pdf/2403.01427). However, no comparison or detailed analysis with LSKD is provided on ImageNet, which undermines the robustness of the evaluation. Without such comparisons, the claims of the proposed method remain less convincing.
Reject