PaperHub
5.2
/10
Poster5 位审稿人
最低4最高6标准差0.7
5
5
6
4
6
3.8
置信度
正确性2.4
贡献度2.6
表达2.4
NeurIPS 2024

Decoupled Kullback-Leibler Divergence Loss

OpenReviewPDF
提交: 2024-05-10更新: 2024-12-19

摘要

In this paper, we delve deeper into the Kullback–Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error ($\mathbf{w}$MSE) loss and 2) a Cross-Entropy loss incorporating soft labels. Thanks to the decomposed formulation of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL/DKL in scenarios like knowledge distillation by breaking its asymmetric optimization property. This modification ensures that the $\mathbf{w}$MSE component is always effective during training, providing extra constructive cues. Secondly, we introduce class-wise global information into KL/DKL to mitigate bias from individual samples. With these two enhancements, we derive the Improved Kullback–Leibler (IKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training, and knowledge distillation tasks. The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard --- RobustBench and competitive performance on knowledge distillation, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.
关键词
Adversarial Training; Knowledge Distillation; Kullback-Leibler Divergence Loss; Long-tail Recognition

评审与讨论

审稿意见
5

The paper proposed the Decoupled Kullback-Leibler Divergence Loss (DKL), mathematically equivalent to the KL divergence. To solve the asymmetries issue of KL/DKL, the authors introduce the Improved KL divergence, which consists of class-wise global information and cross-entropy loss for soft labels. The proposed approach achieves new state-of-the-art adversarial robustness on the public leaderboard.

优点

  1. The motivation of this paper is clear.
  2. The results on adversarial robustness show the effectiveness of the proposed method.

缺点

The empirical study is not enough to show the effectiveness of the IKL compared with KL divergence.

  1. In the context of adversarial robustness, the author's approach of replacing KL in the DM-AT and comparing it with a few other baseline methods is insightful. However, to further validate the effectiveness of IKL compared with KL, it would be beneficial to see this approach applied to other methods such as TRADES or MART [1].
  2. The current adversarial attack experiments, which focus solely on the AA attack is not enough. It would be valuable to see more results with different attacks, such as PGD and CW, to better understand the robustness of the IKL.
  3. Regarding knowledge distillation, I wonder about its performance in the semi-supervised setting. Could you please add the experiments for comparing with Mean-Teacher [2] and provide a toy example to show the effectiveness of the IKL compared with KL?

[1] Improving Adversarial Robustness Requires Revisiting Misclassified Examples. [2] Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

问题

  1. The first terms in Eq5, 6, and 7 are slightly different, but the names are the same; could you please clarify this?
  2. In Table 1, the GI seems more important than BA; could you please explain it and add the results for only the use of GI?
  3. The asymmetric issue can be simply solved by using symmetric KL divergence by 1/2(KL(p,q)+KL(q,p)) or Jensen–Shannon divergence. Could you explain the advantages of enabling the gradients proposed in IKL?

局限性

Yes.

作者回复

Thanks for your suggestions and valuable comments. Here we provide our responses to address your concerns.

Q1: Ablation with TRADES and MART.

Thanks for your suggestion. TRADES is our baseline. We have already listed the results in Table 1 of the main paper. To highlight the significant improvements from our IKL loss, here we present the experimental results with TRADES and MART on CIFAR-100. Equipped with IKL loss, the robustness is enhanced for both methods. We transfer the hyper-parameters from TRADES to MART. Carefully tuning the hyper-parameters can potentially further improve performance. We would include the comparisons in the new revision.

MethodAccAA
KL (TRADES)62.8730.29
IKL (Ours)63.40(+0.63)31.92(+1.63)
IKL (Ours*)66.51(+3.64)31.43(+1.14)
---------
MART66.1030.65
MART with IKL (Ours)64.4431.22(+0.57)

Q2: Evaluation under PGD and CW attacks.

As suggested, we also test the robustness of our models under PGD, and CW attacks with 10 and 20 iterations. The perturbation size and step size are set to 8/255 and 2/255 respectively. As shown in the Table, with increasing iterations from 10 to 20, our models show similar robustness, demonstrating that our models don't suffer from the obfuscated gradients problem.

MethodAccPGD-10PGD-20CW-10CW-20Auto-Attack
KL (TRADES)62.8736.0135.8440.0339.8630.29
IKL (Ours)63.4036.7836.5540.7240.4731.92

The worst-case is in bold.

Q3: Applications of IKL loss on semi-supervised learning.

Please refer to our global rebuttal Q1.

Q4: The first terms in Eq5, 6, and 7 are slightly different, but the names are the same; could you please clarify this?

Eq.(5) is equivalent to KL loss regarding gradient optimization. In Eq. (6), the stop gradient operation in S(Δm)\mathcal{S}(\Delta \mathbf{m}) is applied because the teacher model is fixed and thus its outputs omo_{m} are detached from the gradient backpropagation during distillation. With Eq. (6), we know that the wwMSE component will have no effects on optimization in knowledge distillation. This issue is caused by the asymmetric property of KL loss. Thus we propose to break the asymmetric gradient property by enabling the gradients of Δn\Delta \mathbf{n} in Eq. (5) and Eq. (6). As a result, we derive the Eq. (7).

Q5: In Table 1, the GI seems more important than BA; could you please explain it and add the results for only the use of GI?

Actually, GI and BA work together for adversarial training. Adversarially-trained models only achieve around 60% natural accuracy on CIFAR-100. The low performance can lead to heavy sample-wise biases for wwMSE during training. GI can properly address this problem. Additionally, we observe that, without the BA, the model with GI is easy to crush during adversarial training. BA mechanism can make the training stable and efficient.

Q6: Regarding the asymmetric issue, what are the advantages of IKL loss over JSD loss?

That's a good question. With the following JS divergence loss,

JSD(PQ)=12KL(PM)+12KL(QM),M=12P+12Q.\quad \quad \quad JSD(P||Q) = \frac{1}{2} KL(P|| M) + \frac{1}{2} KL(Q||M), \quad M=\frac{1}{2} P + \frac{1}{2} Q.

Suppose PP and QQ are probability distributions of the teacher model and the student model respectively. We calculate its derivatives regarding on\mathbf{o}_{n} (the student logits),

LJSDo_ni=_j=1Cw_ni,j(Δn_i,jΔm_i,j)\quad \quad \quad \frac{\partial \mathcal{L}_{JSD}}{\partial \mathbf{o}\_{n}^{i}}=\sum\_{j=1}^{C} \mathbf{w}\_{\mathbf{n}}^{i,j}(\Delta n\_{i,j} - \Delta \mathbf{m}^{'}\_{i,j}), with the constraint Softmax(o_m)=12s_n+12s_m\quad Softmax(\mathbf{o}\_{m^{'}}) = \frac{1}{2} \mathbf{s}\_{n} + \frac{1}{2} \mathbf{s}\_{m},

where o_m\mathbf{o}\_{m} is the logits from the teacher model; o_m\mathbf{o}\_{m^{'}} is a virtual logits satisfying the above constraint; s_m=Softmax(o_m)\mathbf{s}\_{m} = *Softmax*(\mathbf{o}\_{m}); s_n=Softmax(o_n)\mathbf{s}\_{n}= *Softmax*(\mathbf{o}\_{n}); Δm_i,j=o_mio_mj\Delta \mathbf{m}^{'}\_{i,j} = \mathbf{o}\_{m^{'}}^{i} - \mathbf{o}\_{m^{'}}^{j}; Δn_i,j=o_nio_nj\Delta \mathbf{n}\_{i,j} = \mathbf{o}\_{n}^{i} - \mathbf{o}\_{n}^{j}; w_ni,j=s_nis_nj\mathbf{w}\_{n}^{i,j}=\mathbf{s}\_{n}^{i} * \mathbf{s}\_{n}^{j}.

Correspondingly, the derivatives of IKL loss regarding on\mathbf{o}_{n} (the student logits),

L_IKLo_ni=α_j=1Cwˉ_mi,j(Δn_i,jΔm_i,j)_Effects  of  wMSE  loss+β(s_mi(s_ni1)+s_ni(1s_mi))_Effects  of  CrossEntropy  loss\quad \quad \frac{\partial \mathcal{L}\_{IKL}}{\partial \mathbf{o}\_{n}^{i}} = \underbrace{\alpha \sum\_{j=1}^{C} \mathbf{\bar w}\_{m}^{i,j}(\Delta \mathbf{n}\_{i,j} - \Delta \mathbf{m}\_{i,j})}\_{Effects \ \ of \ \ wMSE \ \ loss} + \underbrace{\beta \left( \mathbf{s}\_{m}^{i} * (\mathbf{s}\_{n}^{i} - 1) + \mathbf{s}\_{n}^{i} * (1 - \mathbf{s}\_{m}^{i}) \right)}\_{Effects \ \ of \ \ Cross-Entropy \ \ loss}

Although JSD can produce gradients on o_n\mathbf{o}\_{n}, it has lost the property of KL loss that forces o_n\mathbf{o}\_{n} and o_m\mathbf{o}\_{m} to be similar absolutely in the Euclidean space and relatively in the Softmax space.

First, the soft labels from the teacher models often embed dark knowledge and facilitate the optimization of the student models. However, there are no effects of the cross-entropy loss with the soft labels for JSD loss.

Second, there could be multiple virtual o_m\mathbf{o}^{'}\_{m} satisfying the constraint (Softmax(o_m)=12s_n+12s_mSoftmax(\mathbf{o}\_{m^{'}}) = \frac{1}{2} \mathbf{s}\_{n} + \frac{1}{2} \mathbf{s}\_{m}). Thus, JSD loss is still to make o_m\mathbf{o}\_{m} and o_n\mathbf{o}\_{n} to be similar relatively in the softmax space and suffers from the same problem of KL divergence in scenarios like knowledge distillation.

We also empirically validate that IKL loss achieves better performance on the knowledge distillation task. As shown in Table 33 in the rebuttal_tables.pdf, the model trained with IKL loss surpasses the models trained by KL and JSD losses by 1.44% and 2.03% top-1 accuracy respectively.

评论

Thank you so much for addressing my concerns and answering my questions. Nevertheless, the paper's presentation should be improved. I will keep my score.

评论

Thanks for your suggestions and valuable comments on our paper. We promise to carefully consider all the reviewers' suggestions and revise the manuscript accordingly to enhance the overall quality of the paper.

审稿意见
5

The paper investigates the Kullback-Leibler (KL) Divergence loss and demonstrates its equivalence to the Decoupled Kullback-Leibler (DKL) Divergence loss in a limited setting, which consists of a weighted Mean Square Error (wMSE) and a Cross-Entropy loss with soft labels. By addressing the asymmetric optimization property of KL/DKL loss and incorporating class-wise global information, the authors propose the Improved Kullback-Leibler (IKL) Divergence loss. Experimental evaluations on CIFAR-10/100 and ImageNet datasets highlight the effectiveness of IKL loss in enhancing adversarial robustness and knowledge distillation tasks.

优点

  1. The writing is clear and easy to understand.

缺点

  1. It should be explicitly clarified that Theorem 1 and the subsequent analysis are solely based on the assumption that the probability distribution is a categorical distribution.
  2. Since the author claims that the proposed method mainly applies to adversarial training and knowledge distillation, a formal connection between these two methods and the KL divergence should be provided.
  3. The KL divergence can be applied to a much wider research area since it is a convenient measure between distributions. The author provides no relevant works regarding the KL divergence and other statistical distances, raising questions about whether the author has a comprehensive understanding of the KL divergence.
  4. Additionally, the related works section is perplexing. The author spends most of the paper describing the KL divergence without defining adversarial training and knowledge distillation, yet the related works section focuses entirely on these two areas.
  5. Lines 150-152 state, 'The asymmetric optimization can cause the wMSE component to be neglected or overlooked when omo_m is detached from gradient backpropagation, which is the case for knowledge distillation, potentially leading to performance degradation.' This should be a central investigation target of the paper. The author should provide either empirical or theoretical evidence to support this claim.

问题

  1. Regarding the asymmetric problem, what is the relation between the proposed DKL and the Jensen-Shannon (JS) divergence?

局限性

Yes

作者回复

Thanks for your suggestions and valuable comments. Here we provide our responses to address your concerns.

Q1: It should be explicitly clarified that Theorem 1 and the subsequent analysis are solely based on the assumption that the probability distribution is a categorical distribution.

Thanks for your suggestions. We will explicitly clarify that our analysis is based on the assumption that the probability distribution is a categorical distribution in the new revision.

Q2: The connection between KL loss and related tasks.

Thanks for the suggestion. We have described the connections between KL loss and the adversarial training and knowledge distillation in Sec. 3.1. For clarification, We will re-organize the section for related work and Section 3.1 and include other related applications of the KL divergence loss in the new revision.

Moreover, our paper focuses on the applications of the KL divergence loss from the gradient optimization perspective rather than the measurement between distributions. We delve into how KL divergence loss optimizes models during training in terms of gradient optimization.

Q3: New applications of our IKL loss on semi-supervised learning, knowledge distillation on imbalanced data, and semantic segmentation knowledge distillation.

Besides the adversarial training task and knowledge distillation on balanced data, We conduct experiments on other tasks including semi-supervised learning, knowledge distillation on imbalanced data, and semantic segmentation knowledge distillation. Experimental results show that our IKL loss can significantly improve model performance when compared with KL loss. On semi-supervised learning, replacing the consistency loss with our IKL loss in the Mean-Teacher algorithm, we achieve 2.67% improvements. For the knowledge distillation on imbalanced data, replacing the KL loss with our IKL loss, our model outperforms the baseline model by 1.44% top-1 accuracy. For the semantic segmentation distillation, replacing the KL loss with our IKL loss, we achieve 0.5 mIoU gains on the ADE20K dataset.

For details, please refer to our global rebuttal Q1.

Q4: Why do we need to break the asymmetric property of KL loss in scenarios like knowledge distillation?

Please refer to our global rebuttal Q3.

Q5: Regarding the asymmetric problem, what is the relation between the proposed DKL and the Jensen-Shannon (JS) divergence?

That's a good question. With the following JS divergence loss,

JSD(PQ)=12KL(PM)+12KL(QM),M=12P+12Q.\quad \quad \quad JSD(P||Q) = \frac{1}{2} KL(P|| M) + \frac{1}{2} KL(Q||M), \quad M=\frac{1}{2} P + \frac{1}{2} Q.

Suppose PP and QQ are probability distributions of the teacher model and the student model respectively. We calculate its derivatives regarding on\mathbf{o}_{n} (the student logits),

LJSDo_ni=_j=1Cw_ni,j(Δn_i,jΔm_i,j)\quad \quad \quad \frac{\partial \mathcal{L}_{JSD}}{\partial \mathbf{o}\_{n}^{i}}=\sum\_{j=1}^{C} \mathbf{w}\_{\mathbf{n}}^{i,j}(\Delta n\_{i,j} - \Delta \mathbf{m}^{'}\_{i,j}), with the constraint Softmax(o_m)=12s_n+12s_m\quad Softmax(\mathbf{o}\_{m^{'}}) = \frac{1}{2} \mathbf{s}\_{n} + \frac{1}{2} \mathbf{s}\_{m},

where o_m\mathbf{o}\_{m} is the logits from the teacher model, o_m\mathbf{o}\_{m^{'}} is a virtual logits satisfying the constraint, s_m=Softmax(o_m)\mathbf{s}\_{m} = *Softmax*(\mathbf{o}\_{m}), s_n=Softmax(o_n)\mathbf{s}\_{n}= *Softmax*(\mathbf{o}\_{n}), Δm_i,j=o_mio_mj\Delta \mathbf{m}^{'}\_{i,j} = \mathbf{o}\_{m^{'}}^{i} - \mathbf{o}\_{m^{'}}^{j}, Δn_i,j=o_nio_nj\Delta \mathbf{n}\_{i,j} = \mathbf{o}\_{n}^{i} - \mathbf{o}\_{n}^{j}, w_ni,j=s_nis_nj\mathbf{w}\_{n}^{i,j}=\mathbf{s}\_{n}^{i} * \mathbf{s}\_{n}^{j}.

Correspondingly, the derivatives of IKL loss regarding on\mathbf{o}_{n} (the student logits),

L_IKLo_ni=α_j=1Cwˉ_mi,j(Δn_i,jΔm_i,j)_Effects  of  wMSE  loss+β(s_mi(s_ni1)+s_ni(1s_mi))_Effects  of  CrossEntropy  loss\quad \quad \frac{\partial \mathcal{L}\_{IKL}}{\partial \mathbf{o}\_{n}^{i}} = \underbrace{\alpha \sum\_{j=1}^{C} \mathbf{\bar w}\_{m}^{i,j}(\Delta \mathbf{n}\_{i,j} - \Delta \mathbf{m}\_{i,j})}\_{Effects \ \ of \ \ wMSE \ \ loss} + \underbrace{\beta \left( \mathbf{s}\_{m}^{i} * (\mathbf{s}\_{n}^{i} - 1) + \mathbf{s}\_{n}^{i} * (1 - \mathbf{s}\_{m}^{i}) \right)}\_{Effects \ \ of \ \ Cross-Entropy \ \ loss}

Although JSD can produce gradients on o_n\mathbf{o}\_{n}, it has lost the property of KL loss that forces o_n\mathbf{o}\_{n} and o_m\mathbf{o}\_{m} to be similar absolutely in the Euclidean space and relatively in the Softmax space.

First, the soft labels from the teacher models often embed dark knowledge and facilitate the optimization of the student models. However, there are no effects of the cross-entropy loss with the soft labels for JSD loss.

Second, there could be multiple virtual o_m\mathbf{o}^{'}\_{m} satisfying the above constraint (Softmax(o_m)=12s_n+12s_mSoftmax(\mathbf{o}\_{m^{'}}) = \frac{1}{2} \mathbf{s}\_{n} + \frac{1}{2} \mathbf{s}\_{m}). Thus, JSD loss is still to make o_m\mathbf{o}\_{m} and o_n\mathbf{o}\_{n} to be similar relatively in the softmax space and suffers from the same problem of KL divergence in scenarios like knowledge distillation.

We also empirically validate that IKL loss achieves better performance on the knowledge distillation task. As shown in Table 33 in the rebuttal_tables.pdf, the model trained with IKL loss surpasses the models trained by KL and JSD losses by 1.44% and 2.03% top-1 accuracy respectively.

评论

Thank you for the detailed response. However, I believe some issues remain unresolved.

Q1 & Q2 & Q3

According to the author’s current reply, the paper specifically focuses on adversarial training and knowledge distillation, particularly in the context of gradient optimization. However, in the first sentence of the Abstract, the author wrote:

we delve deeper into the Kullback–Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels.

In line 33, the author states:

We reveal that the KL loss is mathematically equivalent to a composite of a weighted MSE (wMSE) loss and a Cross-Entropy loss employing soft labels.

These conclusions could be considered overclaims since KL divergence is not limited to classification tasks or (adversarial training and knowledge distillation). This raises concerns about the clarity of the paper’s true objectives.

Q4

What do you mean by absolutely and relatively?

评论

Thanks for your questions. We provide the following responses to address your concerns.

Q1: From the perspective of gradient optimization, KL loss is mathematically equivalent to a composite of a weighted MSE (wwMSE) loss and a Cross-Entropy loss employing soft labels.

Thanks for the question. KL divergence loss requires that the inputs s_m\mathbf{s}\_{m} and s_n\mathbf{s}\_{n} should be two probability vectors, constraining that j=1Cs_mj=1\sum_{j=1}^{C} \mathbf{s}\_{m}^{j}=1 and _j=1Cs_nj=1\sum\_{j=1}^{C} \mathbf{s}\_{n}^{j}=1 and CC is the input dimension. Thus we assume that the s_n\mathbf{s}\_{n} and s_m\mathbf{s}\_{m} are derived with Softmax activation as stated in line 120 of the main paper.

Besides this requirement, our analysis in Theorem 1 does not have other assumptions. Thus, it should be established for any scenario where the scores are from Softmax activation. For clarification, we will stress our assumption that the scores are from the Softmax activation in the new revision.

Q2: What do you mean by absolutely and relatively?

In the scenario of knowledge distillation, we expect the student model to mimic the teacher model's behavior.

Given the same input, if the logit feature from the student is the same as the logit feature from the teacher, then we call that the student mimics the teacher absolutely. Given the same input, if the probability output from the student is the same as the probability output from the teacher, then we call that the student mimics the teacher relatively.

This is because the same logit feature definitely leads to the same probability output while the same probability output cannot guarantee the same logit feature.

If there are other concerns, we are glad to talk more about it.

评论

Thank you for the further response. Nevertheless, I intend to keep my score in the current manuscript given the current structure of the entire paper. I'll wait for other reviewers and see if their concerns are addressed.

评论

Big thanks for your suggestions and valuable comments. It helps us greatly improve the quality of our paper. If there are other concerns, please let us know.

评论

Dear Reviewer wpCy,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper. Your expert suggestions have been invaluable in helping us refine our work. We have also addressed the concerns raised by Reviewer RheX and promise to thoroughly incorporate all reviewers' feedback to improve the manuscript further.

Would you kindly consider revising your score?

We look forward to your response.

Best regards,

The Authors

评论

Dear Authors,

Although in my opinion, this paper can still be improved. Given the effort in the discussion session, I'm willing to increase my score to borderline accept.

Sincerely, Reviewer

评论

Thank you for your positive feedback and recommendation. We appreciate your decision to raise the score to borderline acceptance. We are committed to incorporating all the reviewers' suggestions to further enhance the quality of the paper.

审稿意见
6

In this paper, the authors analyzed the optimization gradient of the commonly used KL Divergence loss metric in adversarial training and distillation. KL loss can be reformulated as a Decoupled KL (DKL) Divergence loss term through antiderivative operations. The gradients of both terms are equivalent in the case where the parameters of the DKL are set to α\alpha = 1 and β\beta = 1. The DKL loss term consists of 1) a weighted MSE (wMSE) loss term, which encourages similarity between output logits, and 2) a CE loss term, which encourages accurate predictions. The authors highlighted two potential drawbacks of the KL/DKL formulations : 1) asymmetric optimization of both loss terms, 2) sample-wise conditioning term wmw_m which induces strong variance into the optimization process. The authors proposed two modifications to the DKL loss term, namely, 1) Enabling the gradients of Δn\Delta n instead of Δm\Delta m in the formulation, thereby overcoming asymmetric optimization, 2) using global class-wise conditioning term wyˉ\bar{w_y}, conditioned on global ground truth, instead of wmw_m. The novel Improved KL Divergence Loss (IKL) is shown to be quantitatively and qualitatively superior to the classic and decoupled KL divergence terms. The authors also performed thorough adversarial robustness and distillation experiments, achieving improved performance compared to accuracy and robustness performance in various natural image datasets ( CIFAR-10, CIFAR-100, ImageNet ), augmentation strategies , and network backbones, outperforming multiple benchmarks.

优点

  1. Experiments : Thorough experiments were performed, incorporating various datasets, network backbones, and augmentation strategies.
  2. Clarity : The intuition behind the main concept is clear and straightforward. Readers can easily follow the thought process and justifications of the authors.
  3. Originality : The paper shows the promising potential of IKL in two different application domains, thus highlighting the IKL metric's relative flexibility. Furthermore, the hyperparameter ablations suggest that one can tune the predictive accuracy and adversarial robustness tradeoff. The competitive Robustbench results further substantiate the authors' claims.

缺点

  1. Theoretical evaluation is not convincing enough. There are no further justifications for why the specific anti-derivative formulation expressed in Theorem 1 is chosen as the main decoupled KL (DKL) formulation, as there could be multiple anti-derivative formulations for the same function that also lead to equivalent derivative formulations. The mathematical proof provided in the appendix has merely shown that performing chain rules on KL/DKL loss terms justifies the equivalence of the derivatives.
  2. While the authors have intuitively highlighted the issues of the unbalanced and potentially unstable optimization issues with DKL, the authors did not delve deep into the aforementioned issues. Further theoretical proofs or gradient analysis of KL/DKL/IKL optimization would provide a stronger justification, especially where exactly the stop-gradient operations are crucial. The following papers have performed gradient analysis and a thorough study into stop-gradient mechanisms.
  • Zhang, C., Zhang, K., Zhang, C., Pham, T. X., Yoo, C. D., and Kweon, I. S. How does simsiam avoid collapse without negative samples? a unified understanding with selfsupervised contrastive learning. International conference on learning representations, 2022. https://arxiv.org/abs/2203.16262
  • Zhuo, Z., Wang, Y., Ma, J., and Wang, Y. Towards a unified theoretical understanding of non-contrastive learning via rank differential mechanism. International conference on learning representations, 2023. https://arxiv.org/abs/2303.02387
  1. The accuracy and robustness results are generally only marginally better than the benchmarks. t-SNE visualization of the purported benefits of the IKL formulation does not additionally convey a strong qualitative advantage. The authors highlighted the lower computational costs of the approach in some of their experiments. I suggest that the authors pivot more on this to demonstrate that the IKL is a competitive, relatively computationally lighter method.

Minor : While the structure of the paper is clear, there are multiple grammatical and expression errors. I would recommend stronger proofreading and reformulation for improvement.

问题

  1. Does the assumption of DKL/KL equivalence in Theorem 1 still hold when the parameters α\alpha and β\beta are not set to 1? The authors have experimented with the values of these parameters while assuming that the DKL/IKL formulation still results in equivalent gradient formulations.
  2. While various recent works have highlighted the importance of solving the gradient conflict issue for accuracy/robustness tradeoff in various settings. Some works even proposed that anti-symmetric gradients are essential to solving the gradient conflicts. How can the authors justify their approach compared to these opposing approaches/perspectives?
  • Zhang, C., Zhang, K., Zhang, C., Pham, T. X., Yoo, C. D., and Kweon, I. S. How does simsiam avoid collapse without negative samples? a unified understanding with selfsupervised contrastive learning. International conference on learning representations, 2022. https://arxiv.org/abs/2203.16262
  • Zhuo, Z., Wang, Y., Ma, J., and Wang, Y. Towards a unified theoretical understanding of non-contrastive learning via rank differential mechanism. International conference on learning representations, 2023. https://arxiv.org/abs/2303.02387
  • Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems, 33: 5824–5836, 2020. https://arxiv.org/abs/2001.06782
  • Wen, Z. and Li, Y. The mechanism of prediction head in noncontrastive self-supervised learning. Advances in Neural Information Processing Systems, 35:24794–24809, 2022. https://arxiv.org/abs/2205.06226
  • Waseda, F. et al. Rethinking Invariance Regularization in Adversarial Training to Improve Robustness-Accuracy Trade-off. https://arxiv.org/abs/2402.14648, 2024.

局限性

The authors have not addressed the limitations of their work. Please provide some potential limitations that can be improved on for future works.

作者回复

Thanks for your suggestions and valuable comments. Here we provide our responses to address your concerns.

Q1: Theoretical evaluation is not convincing enough. There are no further justifications for why the specific anti-derivative formulation expressed in Theorem 1 is chosen as the main decoupled KL (DKL) formulation, as there could be multiple anti-derivative formulations for the same function that also lead to equivalent derivative formulations.

We decouple the KL divergence loss into a wwMSE component and a Cross-entropy component because MSE and Cross-entropy loss are the two most popular objectives for image recognition. There are lots of loss functions that can be used for image classification, like Mean Absolute Error and Hinge loss. However, Cross-entropy loss is the most effective one for gradient optimization. Similarly, MSE loss has also been validated as one of the most effective loss functions in many tasks like object detection and other regression tasks.

With Theorem 1, we know that the KL divergence loss can be decoupled into a wwMSE and a Cross-entropy loss. Then we can examine how each component affects model performance, which cannot be done by directly operating on KL divergence loss.

Q2: While the authors have intuitively highlighted the issues of the unbalanced and potentially unstable optimization issues with DKL, the authors did not delve deep into the aforementioned issues. Further theoretical proofs or gradient analysis of KL/DKL/IKL optimization would provide a stronger justification, especially where exactly the stop-gradient operations are crucial.

The stop-gradient operation is just used to ensure that our DKL loss (Theorem 1) is equivalent to the KL divergence loss from the perspective of gradient optimization, which implies that KL loss is a special case of our DKL/IKL loss.

By examining each component of DKL loss, we identify the potential issues of DKL loss, i.e., the asymmetric gradient property and possible sample-wise biases from hard examples or outliers. Since the DKL loss is equivalent to the KL loss, we consider that the KL loss also suffers from the two problems. Then we propose the IKL loss to address the potential issues.

Q5: We achieve significant improvements over the baseline and state-of-the-art adversarial robustness on public leaderboard --- RobustBench.

First, with the same experimental settings and only replacing the KL loss with the IKL loss, we significantly outperform our baseline TRADES by 1.63% robustness under auto-attack.

Second, without extra techniques from previous algorithms, we achieve the state-of-the-art adversarial robustness on the public leaderboard --- RobustBench under both settings: with the basic augmentation strategy including random crop and horizontal flip, with advanced data augmentation or synthesized data.

We only replace the KL loss with our IKL loss using the TRADES pipeline and thus our method is much more computationally efficient than LBGAT and ACAT, saving 33.3% training time. Our results are summarized in Table 35 in the rebuttal_tables.pdf.

Q6: Does the assumption of DKL/KL equivalence in Theorem 1 still hold when the parameters α=1\alpha=1 and β=1\beta=1 are not set to 1?

DKL loss is equivalent to KL loss only when α=1\alpha=1 and β=1\beta=1. The significance of our Theorem 1 is that: the KL loss can be decoupled into a weighted MSE loss and a Cross-entropy loss. Then the two terms can be adjusted independently for efficient optimization, which can not be achieved by directly tuning hyper-parameters of KL loss.

Q7: Various recent works have highlighted the importance of solving the gradient conflict issue for the accuracy/robustness trade-off in various settings. Some works even proposed that anti-symmetric gradients are essential to solving gradient conflicts. How can the authors justify their approach compared to these opposing approaches?

That's an interesting question. First, it is how the gradients affect training optimization matters rather than the property of symmetric or asymmetric. In our case, the asymmetric gradient property of KL divergence loss causes the wwMSE loss not to work under scenarios like knowledge distillation, leading to optimization problems during training (please refer to our global rebuttal Q3). Thus, we break the asymmetric property and observe performance improvements on several tasks including KD, AT, and semi-supervised learning (please refer to our global rebuttal Q1).

Second, as mentioned by you, some work like [ref4] proposes that anti-symmetric gradients can solve gradient conflicts between invariance loss and classification loss, and promote model robustness. In the paper, the key is still to remove the harmful gradients. The asymmetric gradient optimization is just one of the ways to solve gradient conflicts in this case. Thus, we should always pay attention to how the gradients affect the optimization process rather than the symmetric or asymmetric properties of gradients. I think it is a promising direction for deep learning optimization not only in robustness. I will include the relevant discussion in the new revision. The mentioned works on stop gradient for unsupervised representation learning will also be discussed.

[ref4] Rethinking Invariance Regularization in Adversarial Training to Improve Robustness-Accuracy Trade-off. arXiv:2402.14648.

[ref5] How Does SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive Learning. ICLR 2022.

[ref6] Towards a Unified Theoretical Understanding of Non-contrastive Learning via Rank Differential Mechanism. ICLR 2023.

[ref7] The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. NeurIPS 2022.

[ref8] Gradient surgery for multi-task learning. NeurIPS 2020.

评论

Dear Reviewer hZqZ,

We sincerely thank you for your precious time and efforts in reviewing our paper. We greatly appreciate your insightful and detailed feedback, and we have carefully addressed the concerns in the rebuttal. Please let us know if any aspects remain unclear, and we are happy to provide further clarification.

We are looking forward to your reply.

Best regards,

The Authors

评论

Dear Reviewer hZqZ,

We sincerely thank you for your precious time and efforts in reviewing our paper. We believe our responses have addressed your concerns. If you have any additional questions or require further clarification, we would be happy to provide it.

We are looking forward to your reply.

Best,
The Authors

审稿意见
4

This paper demonstrates that KL divergence can be decoupled into a weighted mean square error (wMSE) loss term and a cross-entropy term with soft labels. Based on this decoupling, the authors propose an improved version of the KL loss. In the context of knowledge distillation, the proposed method addresses issues with the student's wMSE loss. It also adjusts the weight term to avoid being determined by incorrect labels, instead using supervision. The proposed method is tested in two scenarios where KL divergence is used: knowledge distillation and adversarial training, achieving higher performance than existing methods in both cases.

优点

  • The paper provides a theoretical proof that KL divergence can be decoupled into two terms and improve original loss, which is a significant contribution.
  • The proposed method achieves higher performance than most of the existing methods in both knowledge distillation and adversarial training scenarios.

缺点

  • The paper lacks comparisons with the latest state-of-the-art methods, particularly in the knowledge distillation benchmark and RobustBench. The performance is also inferior compared to methods like [1].
  • References are not provided for the methods mentioned, making it difficult to identify the specific works being referred to, which significantly hampers readability.
  • The overall readability of the paper is low, with figures and text that are not self-explanatory. For example, Figure 1 is unclear without reading the full explanation, and terms like omo_m and ono_n are not adequately explained until later in the text. Additionally, the term "Inserting Class-wise Global Information" used for the weight term replacement in Eq.5 is misleading and lacks a clear explanation.
  • The proof approach for decoupling KL divergence is not straightforward. Instead of deriving DKL from KL divergence, the authors present DKL first and then show its equivalence through gradients, which obscures the understanding of the derivation process.
  • The benefits and reasons for why the Δn\Delta n term in Eq.7 should receive gradients are not well explained. The phrase "potentially hurt performance" is vague and misleading.

[1] Multi-level Logit Distillation, Jin et al., CVPR 2023

问题

  • The overall readability of the paper needs improvement. Figures should clearly illustrate the proposed method and its benefits. The proof method in Eq. 5 should be more understandable.

  • The paper should include comprehensive comparisons with all current state-of-the-art methods, especially showing higher performance than existing logit-based KD methods.

  • The authors should clearly explain the problems arising from the Δn\Delta n term not receiving gradients in Eq. 7 and the benefits of it receiving gradients.

局限性

The authors have adequately addressed some limitations of their work. There are no negative societal impacts.

作者回复

Thanks for your suggestions and valuable comments. Here we provide our responses to address your concerns.

Q1: We achieve significant improvements over the baseline and state-of-the-art adversarial robustness on RobustBench

For adversarial robustness

First, with the same experimental settings and only replacing the KL loss with the IKL loss, we significantly outperform our baseline TRADES by 1.63% robustness under auto-attack.

Second, without extra techniques from previous algorithms, we achieve the state-of-the-art adversarial robustness on the public leaderboard --- RobustBench under both settings: with the basic augmentation strategy including random crop and horizontal flip, with advanced data augmentation or synthesized data.

We only replace the KL loss with our IKL loss using the TRADES pipeline and thus our method is much more computationally efficient than LBGAT and ACAT, saving 33.3% training time. Our results are summarized in Table 35 in the rebuttal_tables.pdf

For knowledge distillation

On the knowledge distillation, for the mentioned work "multi-level logit distillation`` [ref3], we observe that it uses a weak augmentation (RandomResizedCrop and random horizontal flip) data batch and a strong augmentation (Autoaug) data batch for each training iteration. For fair comparisons, based on their open-sourced code (https://github.com/Jin-Ying/Multi-Level-Logit-Distillation), we also apply the same pipeline to our IKL loss. The experimental results are summarized in Table 36 in the rebuttal_tables.pdf. Under the same experimental setting, we achieve better performance, outperforming the work [ref3] by 0.35% (ResNet-34--ResNet-18) and 0.33% (ResNet-50--MobileNet) on the ImageNet.

[ref3] Multi-level Logit Distillation. CVPR 2023.

Q2: About the references, terms, and Figure 1.

Thanks for your suggestions. We would add citations properly, revise Figure 1 for a clear explanation, and define the terms systematically in the new version.

Q3: The term "Inserting Class-wise Global Information" used for the weight term replacement in Eq.(5) is misleading and lacks a clear explanation.

In Eq (5), we don't discuss the global class-wise information. We first propose Theorem 1 and identify that the DKL loss can suffer from sample-wise biases because the wwMSE component depends on the sample-wise prediction scores. Especially in adversarial training, the natural accuracy is only around 60% on CIFAR-100. There can be many hard examples or outliers that cause sample biases. The KL loss is equivalent to DKL loss in terms of gradient optimization. Thus the KL loss also suffers from this problem. Then we address this potential issue in Section 3.3 by introducing the global class-wise information. For more details, please refer to our analysis in Section 3.4.

Q4: The proof approach for decoupling KL divergence is not straightforward. Instead of deriving DKL from KL divergence, the authors present DKL first and then show its equivalence through gradients, which obscures the understanding of the derivation process.

In Section 3.2, we first analyze the gradient information of KL divergence loss regarding omo_{m} and ono_{n}. Then based on the structured gradient information, we construct the DKL formulation and propose Theorem 1. Finally, we give our proof for the equivalence between KL divergence loss and our DKL loss in the Appendix.

Q5: The benefits and reasons for why the term in Eq. (7) should receive gradients are not well explained.

Please refer to our global rebuttal Q3.

评论

Dear Reviewer nZvw,

We sincerely thank you for your precious time and efforts in reviewing our paper. We greatly appreciate your insightful and detailed feedback, and we have carefully addressed the concerns in the rebuttal. Please let us know if any aspects remain unclear, and we are happy to provide further clarification.

We are looking forward to your reply.

Best regards,

The Authors

评论

Dear Reviewer nZvw,

We sincerely thank you for your precious time and efforts in reviewing our paper. We believe our responses have addressed your concerns. If you have any additional questions or require further clarification, we would be happy to provide it.

We are looking forward to your reply.

Best,
The Authors

审稿意见
6

The paper presents an in-depth analysis of the Kullback-Leibler (KL) Divergence loss, a critical component in training deep learning models. The authors mathematically demonstrate that the KL loss is equivalent to a Decoupled Kullback-Leibler (DKL) loss, which is composed of a weighted Mean Square Error (wMSE) loss and a Cross-Entropy loss with soft labels. They identify limitations of the KL/DKL loss, particularly in scenarios like knowledge distillation, and propose improvements by introducing an Improved Kullback-Leibler (IKL) Divergence loss. This IKL loss breaks the asymmetric optimization property of the original KL loss and incorporates class-wise global information to mitigate biases from individual samples. The paper evaluates the IKL loss on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training and knowledge distillation tasks. The results show that the proposed IKL loss achieves state-of-the-art adversarial robustness and competitive performance in knowledge distillation.

优点

  1. The paper provides a novel theoretical perspective on the KL loss by proving its equivalence to the DKL loss, offering a new understanding of gradient optimization in deep learning.
  2. The IKL loss addresses specific limitations of the KL loss and demonstrates improved performance in adversarial training and knowledge distillation, which are significant for real-world applications.
  3. Comprehensive experiments on standard datasets like CIFAR-10/100 and ImageNet validate the effectiveness of the proposed IKL loss, enhancing the paper's credibility.

缺点

  1. While the IKL loss shows promising results on the mentioned datasets, it is unclear how well these improvements would generalize to other types of data or tasks outside the scope of the paper.
  2. The introduction of the IKL loss adds complexity to the model training process, which might be a concern for practitioners looking for simpler solutions.
  3. The paper does not provide a detailed analysis of the computational cost associated with the IKL loss, which is important for resource-constrained environments.

问题

  1. How does the IKL loss perform in comparison to other contemporary loss functions that are designed for specific tasks, such as object detection or semantic segmentation?
  2. What are the implications of the IKL loss for models that are trained with limited data, and how does it handle class imbalances?
  3. Can the authors provide more insight into the selection of the hyperparameters α and β in the IKL loss, and how sensitive are the results to these values?
  4. The paper mentions the use of class-wise global information; how does this interact with models that are trained in a semi-supervised or unsupervised setting?
  5. How does the IKL loss handle adversarial examples that are crafted to be more sophisticated than those generated by the Auto-Attack method used in the paper?

局限性

see the sections of weaknesses and questions

作者回复

Thanks for your suggestions and valuable comments. Here we provide our responses to address your concerns.

Q1: Besides the adversarial training and the knowledge distillation on balanced data, how does the IKL loss perform on other tasks, like semi-supervised learning, knowledge distillation on imbalanced data, and semantic segmentation knowledge distillation?

Thanks for your suggestions. Besides the adversarial training task and knowledge distillation on balanced data, We conduct experiments on other tasks including semi-supervised learning, knowledge distillation on imbalanced data, and semantic segmentation knowledge distillation. Experimental results show that our IKL loss can significantly improve model performance when compared with KL loss. For details, please refer to our global rebuttal Q1.

Q2: Pseudo code of IKL loss and complexity analysis.

Compared with the KL loss, IKL loss is only required to update the global class-wise prediction scores WRC×CW \in \mathbb{R}^{C \times C} where CC is the number of classes during training. This extra computational cost can be nearly ignored when compared with the model forward and backward. Algorithm 1 shows the implementation of our IKL loss in Pytorch style. On dense prediction tasks like semantic segmentation, Δa\Delta_{a} and Δb\Delta_{b} can require large GPU Memory due to the large number of pixels. Here, we also provide the memory-efficient implementations for wwMSE loss component, which is listed in Algorithm 2.

Algorithm 1: Pseudocode for DKL/IKL loss in Pytorch style

Input: logitsa,logitsbRB×Clogits_{a}, logits_{b} \in \mathbb{R}^{B \times C}; one-hot label YY; WRC×CW \in \mathbb{R}^{C \times C}; α\alpha and β\beta.

class_scores = YY @ WW;

Sample_weights = class_scores.view(-1, C, 1) @ class_scores.view(-1, 1, C);

Δa\Delta_a = logitsalogits_a.view(-1, C, 1) - logitsalogits_a.view(-1, 1, C);

Δb\Delta_b = logitsblogits_b.view(-1, C, 1) - logitsblogits_b.view(-1, 1, C);

wMSE_loss = (torch.pow(Δn\Delta_{n} - Δa\Delta_{a}, 2) * Sample_weights).sum(dim=(1,2)).mean() * 14\frac{1}{4};

score_a = F.softmax(logitsalogits_a, dim=1).detach();

log_score_b = F.log_softmax(logitsblogits_b, dim=-1);

CE_loss = -(score_a * log_score_b).sum(1).mean();

return**return** β\beta * CE_loss + α\alpha * wMSE_loss.

Algorithm 2: Memory efficient implementation for wMSE in Pytorch style.

Input: logitsa,logitsbRB×Clogits_{a}, logits_{b} \in \mathbb{R}^{B \times C}; one-hot label YY; WRC×CW \in \mathbb{R}^{C \times C};

class_scores = YY @ WW;

loss_a = (class_scores * logitsalogits_a * logitsalogits_a).sum(dim=1) * 2 - torch.pow((class_scores * logitsalogits_a).sum(dim=1), 2) * 2;

loss_b = (class_scores * logitsblogits_b * logitsblogits_b).sum(dim=1) * 2 - torch.pow((class_scores * logitsblogits_b).sum(dim=1), 2) * 2;

loss_ex = (class_scores * logitsalogits_a * logitsblogits_b).sum(dim=1) * 4 - (class_scores * logitsalogits_a).sum(dim=1) * (class_scores * logitsblogits_b).sum(dim=1) * 4;

wMSE_loss = 14\frac{1}{4} * (loss_a + loss_b - loss_ex).mean();

return**return** wMSE_loss.

Q3: Can the authors provide more insight into the selection of the hyperparameters α\alpha and β\beta in the IKL loss, and how sensitive are the results to these values?

For tasks using KL divergence loss, like knowledge distillation, adversarial training, and semi-supervised learning, we have priors to set the loss weight γ\gamma for KL divergence loss. When replacing KL divergence loss with our IKL loss, we first set the β\beta =γ\gamma for the cross-entropy component. Then, we perform the grid search α4\frac{\alpha}{4} \in {1, 2,3, 4,5}. After determining the value of α4\frac{\alpha}{4}, we again adjust the β\beta to achieve the best performance on validation data if needed. As shown in Table 15 of the Appendix, in a reasonable range α4\frac{\alpha}{4} \in {3,4,5,6} and β\beta \in {2,3,4,5}, increasing α4\frac{\alpha}{4} and β\beta, the performance firstly increases steadily and then decreases.

Q4: How does the class-wise global information interact with models that are trained in a semi-supervised or unsupervised setting?

In semi-supervised or unsupervised settings, we often can get the pseudo labels for the data. Equipping with a threshold γ\gamma, we filter out the samples that the maximum confidence score is lower than γ\gamma, like in FixMatch [ref2]. Then, the remaining samples can be used to update the global class-wise prediction scores.

[ref2] FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. NeurIPS 2020.

Q5: How does the IKL loss handle adversarial examples that are crafted to be more sophisticated than those generated by the Auto-Attack method used in the paper?

The auto-attack is one of the strongest attack methods. It is an ensemble of several adversarial attacks, including APGD-CE, APGD-DLR, FAB, and Square Attack. Here we also test the robustness of our models under PGD, and CW attacks with 10 and 20 iterations. The perturbation size and step size are set to 8/255 and 2/255 respectively. As shown in the Table, with increasing iterations from 10 to 20, our models show similar robustness, demonstrating that our models don't suffer from the obfuscated gradients problem.

MethodAccPGD-10PGD-20CW-10CW-20Auto-Attack
KL (TRADES)62.8736.0135.8440.0339.8630.29
IKL (Ours)63.4036.7836.5540.7240.4731.92

The worst-case is in bold.

评论

Dear Reviewer JUtg,

We sincerely thank you for your precious time and efforts in reviewing our paper. We greatly appreciate your insightful and detailed feedback, and we have carefully addressed the concerns in the rebuttal. Please let us know if any aspects remain unclear, and we are happy to provide further clarification.

We are looking forward to your reply.

Best regards,

The Authors

作者回复

We thank all the reviewers' efforts and valuable comments on our paper. Here we respond to the most concerned questions from reviewers.

Q1. New applications of our IKL loss in semi-supervised learning, knowledge distillation on imbalanced data, semantic segmentation knowledge distillation.

Thanks for the suggestions. Besides the adversarial training task and knowledge distillation on balanced data, We conduct experiments on other tasks including semi-supervised learning, knowledge distillation on imbalanced data, and semantic segmentation knowledge distillation.

Semi-supervised learning. We use the open-sourced code from https://github.com/microsoft/Semi-supervised-learning and conduct semi-supervised experiments on CIFAR-100 with FixMatch and Mean-Teacher methods. Specifically, each class has 2 labeled images and 500 unlabeled images. All default training hyper-parameters are used for fair comparisons. We only replace the consistency loss with our IKL loss. As shown in Table 32 in the rebuttal_tables.pdf, with our IKL loss, the Mean-Teacher method even surpasses the FixMatch, outperforming the baseline by 2.67%.

Imbalanced data. For the long-tailed recognition on ImageNet-LT, we train models 90 epochs with cross-entropy loss. We only preprocess images with RandomResizedCrop and random horizontal flip. All training settings are the same for fair comparisons. As shown in Table 33 in the rebuttal_tables.pdf, Models trained with our loss function achieve much better performance than the original KL loss, surpassing the baseline by 1.44%.

Semantic segmentation distillation. As suggested, we conduct ablation on the semantic segmentation distillation task. We use the APD[ref1] as our baseline for their open-sourced code. All default hyper-parameters are adopted. We only replace the original KL loss with our IKL loss. As shown in Table 34 in the rebuttal_tables.pdf, we achieve 0.5 mIoU performance gains with the IKL loss, demonstrating that the IKL loss can be complementary to other techniques in semantic segmentation distillation.

[ref1] Adaptive Perspective Distillation for Semantic Segmentation. IEEE TPAMI 2022.

Q2: Significant improvements from IKL loss and state-of-the-art robustness on the public leaderboard---RobustBench.

First, with the same experimental settings and only replacing the KL loss with the IKL loss, we significantly outperform our baseline TRADES by 1.63% robustness under auto-attack.

Second, without extra techniques from previous algorithms, we achieve the state-of-the-art adversarial robustness on the public leaderboard --- RobustBench under both settings: with the basic augmentation strategy including random crop and horizontal flip, with advanced data augmentation or synthesized data.

We only replace the KL loss with our IKL loss using the TRADES pipeline and thus our method is much more computationally efficient than LBGAT and ACAT, saving 33.3% training time. Our results are summarized in Table 35 in the rebuttal_tables.pdf

Q3: Why do we need to break the asymmetric property of KL loss in scenarios like knowledge distillation?

First, the cross-entropy loss is invariant to the mean value shift. Suppose vRC\mathbf{v} \in \mathbb{R}^{C} and γR\gamma \in \mathbb{R}, then the following equation is established:

L=logevyi=1Cevi=logevy+γi=1Cevi+γ\quad \quad \mathcal{L}=-\log\frac{e^{v_{y}}}{\sum_{i=1}^{C} e^{v_{i}}} = -\log\frac{e^{v_{y}+\gamma}}{\sum_{i=1}^{C} e^{v_{i}+\gamma}} , where vy\mathbf{v}_{y} is the yyth value of v\mathbf{v}.

Second, for tasks like the adversarial training, KL loss forces two outputs to be similar absolutely and relatively. Recall the derivatives of KL loss regarding omo_{m} and ono_{n} in Eq.(3) and (4) of main paper,

L_KLo_mj=_k=1C((Δm_j,kΔn_j,k)(s_mks_mj))Eq.(3)\quad \frac{\partial \mathcal{L}\_{KL}}{\partial \mathbf{o}\_{m}^{j}} = \sum\_{k=1}^{C} ((\Delta \mathbf{m}\_{j,k} -\Delta \mathbf{n}\_{j,k}) * (\mathbf{s}\_{m}^{k} * \mathbf{s}\_{m}^{j})) \quad Eq. (3) ,

L_KLo_nj=s_mj(s_nj1)+s_nj(1s_mj)Eq.(4)\quad \frac{\partial \mathcal{L}\_{KL}}{\partial \mathbf{o}\_{n}^{j}} = \mathbf{s}\_{m}^{j} * (\mathbf{s}\_{n}^{j} -1 ) + \mathbf{s}\_{n}^{j} * (1-\mathbf{s}\_{m}^{j}) \quad Eq. (4) .

Eq.(3) indicates that the KL loss encourages omo_{m} to be similar to ono_{n} absolutely in the Euclidean space. Eq.(4) indicates that the KL loss encourages ono_{n} to be similar to omo_{m} relatively in the Softmax space. The two items work collaboratively and drive the optimization together.

Third, for the knowledge distillation, the teacher model is well-trained and fixed during the distillation process. Thus, omo_{m} will be detached from the gradient backpropagation, and Eq.(3) takes no effects during the training. As a result, the KL divergence loss only forces omo_{m} and ono_{n} to be similar relatively in Softmax space.

This optimization can cause a problem: For a two-class classification task, after the knowledge distillation with an ideal teacher model, we derive the logits a1=[1,2]a_{1}=[1,2], a2=[0.5,1]a_{2}=[0.5,1], and a3=[0.3,0.2]a_{3}=[0.3, 0.2] for inputs x1x_{1} and x2x_{2} and x3x_{3} individually. With their logits, we known that x1x_{1} and x2x_{2} belong to class-2 while x3x_{3} is in class-1. However, in Euclidean space, we observe that the distance between a2a_{2} and a3a_{3} is even smaller than the distance between a1a_{1} and a2a_{2}, which contradicts our intuition that images in the same class should have smaller distance than images in different classes.

Our IKL loss can address this problem by enabling the gradient of Δn\Delta \mathbf{n} in DKL loss, thus forcing the output from the student model to be similar to the output from the teacher model absolutely. This operation breaks the asymmetric gradient property of KL loss and meanwhile keeps its good property that optimizes two outputs to be similar absolutely in the Euclidean space and relatively in the Softmax space simultaneously.

最终决定

Five experts in the field reviewed this paper. Their recommendations are 2 x Weak Accept, 2 x Borderline Accept, and a Borderline Reject. Overall, the reviewers appreciated the paper because it demonstrates that KL divergence loss can be decoupled into a weighted MSE term and a CE term with soft labels. The authors propose an Improved KL (IKL) based on this decoupling. The writing and motivation of this paper is clear. The paper also provides a comprehensive experimental validation on CIFAR-10/100 and ImageNet datasets, showing that the proposed IKL loss can provide state-of-the-art adversarial robustness and knowledge distillation.

The reviewers initially raised significant concerns regarding the paper, but the author provided a detailed and insightful response that addressed most of these concerns. I recommend accepting this paper based on the reviewers’ feedback and the authors’ satisfactory rebuttal. Overall, the method is novel, and the experimental validation is convincing. However, the reviewers have raised some concerns that should be addressed in the final camera-ready version of the paper. In particular, the authors should improve their paper to:

  • Clarify the scope of their work. While the IKL loss shows promising empirical results on CIFAR-10/100 and ImageNet datasets, can these improvements generalize to other types of data or tasks?
  • Explain the underlying principles that make the proposed approach effective in knowledge distillation.
  • Clarify that Theorem 1 and the subsequent analysis are solely based on the assumption that the probability distribution is a categorical distribution.
  • Explain how the proposed method for improvement by aligning the student model with the teacher model absolutely in the Euclidean space differs from simply regressing the absolute logit values using L1/L2 loss for KD, and from: Sun et al., Logit Standardization in Knowledge Distillation, CVPR 2024.
  • Provide an analysis of computational complexity associated with the IKL loss, which is important for resource-constrained environments.

The authors are encouraged to make the necessary changes to the best of their ability. We congratulate the authors on the acceptance of their paper!