PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
3
3
3
ICML 2025

Beyond One-Hot Labels: Semantic Mixing for Model Calibration

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

We introduce Calibration-aware Semantic Mixing (CSM), a data augmentation framework that uses diffusion models to generate training samples with ground-truth confidence, improving model calibration.

摘要

关键词
model calibrationsemantic mixingdiffusion models

评审与讨论

审稿意见
4

This paper propose Calibration-aware Semantic Mixing, a model calibration approach using diffusion-based data augmentation, like “semantic mixup”.Unlike traditional one-hot labeling, CSM generates mixed samples with soft labels with the CLIP. The authors introduce a reannotation technique using CLIP features and investigate the influence of loss functions prove L2 loss is good for enhances calibration.

给作者的问题

  1. How does the choice of diffusion model affect the performance of CSM? How about we use other generative models. There are some generative model doing interpolation , like GANs. Does it helps?
  2. How does CSM compare to Mixup in terms of training efficiency and memory usage?

论据与证据

  1. CSM improves model calibration by introducing semantically meaningful augmentations, validated through ECE and AECE reductions.
  2. L2 loss leads to balanced learning, demonstrated by both theoretical insights and empirical results.

方法与评估标准

  1. The paper introduces a diffusion-based augmentation method using L2 loss supported by theoretical evidence.
  2. This paper is evaluate with ECE, AECE on CIFAR-10, CIFAR-100, and Tiny-ImageNet, using ResNet-50/101, Wide-ResNet-26-10 and DenseNet-12

理论论述

The authors provide a theoretical justification for their reannotation strategy and choice of loss functions

实验设计与分析

  1. extensive comparisons with other train time calibration techniques.
  2. AUROC (%) for robustness evaluation under distribution shifts.
  3. Calibration performance with post hoc calibration methods.
  4. Reliability diagrams and ablation studies

补充材料

The supplementary material privide proof for Equations and Propositions and other reproduction details.

与现有文献的关系

This paper is highly related to the field of calibration, with the relationship to test time augmentation and label smoothing.

遗漏的重要参考文献

N/A

其他优缺点

Weaknesses:

  1. Limited discussion on computational efficiency
  2. Hyperparameter sensitivity analysis is not well explored.
  3. No Transformer model is included for comparison

其他意见或建议

  1. Provide comparisons of computational cost with existing calibration techniques.
  2. Include failure cases where CSM does not improve calibration performance
作者回复

Response to Reviewer jJxJ

Thanks for your helpful suggestions! Here’s our response:

Q4-1: Training computational cost and memory usage compared to existing methods.

A4-1: As also analyzed in A2-3, we compare the computational efficiency in terms of the training time in A2-3 Table B:

We can conclude that the number of augmented samples per batch is the major factor in training time. Our CSM maintain relative efficiency considering the calibration effectiveness. When compared on the EQ-DATA setting in the main paper Table 2, CSM achieves competitive calibration results with an equalized training time.

Such factor also decides memory usage. CSM uses ~32\frac{3}{2} memory compared to Mixup/RegMixup, while using ~equal memory compared to the RankMixup (depending on setep).

Regarding resource consumption, we generate augmented samples and train our model on one A4000 device. It's worth noting that CSM needs no re-generation when switching model/objectives/re-annotation methods, making it more efficient for decoupled study of these modules.

We will include the key information in the main paper.

Q4-2: Hyperparameter sensitivity.

A4-2: We have analyzed hyperparameter ss and the number of augmented samples in Appendix C. We provide more analyses about the number of augmented samples per training sample (denoted as NaugN_{aug}) here. Table C:

NaugsN_{augs}123
ECE: CIFAR-100.830.540.39
ECE: CIFAR-1002.071.291.74

It can be observed that adding the number of accompanied augmentations per dataset sample can generally improve the final calibration performance. This is because a larger number of NaugsN_{augs} can sample more sufficient proximal data for training, better filling the domain space and providing more accurate confidence estimation.

For sensitivity analysis, we have the following key observations to enhance our experiments considering existing results in Appendix C:

  1. Our proposed CSM is not sensitive to the total number of augmented samples. A relatively smaller quantity can still make CSM effective.
  2. CSM is relatively sensitive to the scaling factor ss as it is related to the temperature. Nevertheless, under an appropriate range of ss, the method can perform consistently well.

Q4-3: Evaluation on the transformer architecture.

A4-3: Evaluations on the Swin-Transformer architecture verify our method's equal or stronger effectiveness. Please refer to A1-3 Table A.

Q4-4: Failure cases of CSM.

A4-4: While our method can perform well on all the evaluated datasets, we admit there are some cases CSM fails. For instance, although CSM surpass compared methods on ECE/AECE, some post-temperature results on the CIFAR-100 dataset are not comparative to the SOTA methods. Specifically, they exhibit balanced pre- and post-temperature results (searched T=1T = 1) with remarkable pre-temperature ECE values but slightly larger calibration errors after temperature scaling. Such phenomenon is also stated in the calibration literature [1], where they found a balance exists in the two results. Considering both results together, our CSM can still achieve satisfactory model calibration.

Q4-5: Will the choice of generative models affect the performance?

A4-5: The choice of the generative model does influence the final prediction results. Some existing study [2] has already proved that adopting a better generative model could improve the classification model's robustness. We anticipate that a better generative backbone can achieve superior confidence calibration using our CSM. This aligns with our empirical experiences that using the ordinary Stable Diffusion architecture, we can only achieve suboptimal results as shown in the table:

Gen. ModelACCECE\downarrowAECE\downarrow
SD [3]76.871.931.78
EDM [4] (in our CSM)78.841.291.63

Therefore, we anticipate that a typical GAN with less parameters or lower fidelity compared to diffusion models would yield worse results.

References

[1] Wang, D. B., Feng, L., & Zhang, M. L. (2021). Rethinking calibration of deep neural networks: Do not be afraid of overconfidence. Advances in Neural Information Processing Systems, 34, 11809-11820.

[2] Wang, Z., Pang, T., Du, C., Lin, M., Liu, W., & Yan, S. (2023, July). Better diffusion models further improve adversarial training. In International conference on machine learning (pp. 36246-36263). PMLR.

[3] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695).

[4] Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35, 26565-26577.

审稿意见
3

This paper introduces a novel framework, Calibration-aware Semantic Mixing (CSM), designed to improve model calibration. The key contribution lies in addressing the limitations of one-hot labeled datasets by proposing a data augmentation technique that leverages semantic mixing to generate diverse samples via diffusion models. The paper also introduces reannotation techniques to enhance confidence annotation accuracy and explores different loss functions to achieve confidence-balanced learning. Experimental results demonstrate that CSM surpasses existing calibration methods, delivering superior performance across multiple benchmarks and tasks.

update after rebuttal

My concerns have been addressed and would like to recommend accept.

给作者的问题

  • Could you discuss or evaluate simple baselines for calibrated reannotation, such as directly adopting CLIP outputs as labels?
  • Could you discuss or evaluate simple baselines for calibration-aware data augmentation, such as using generated images without semantic mixing?
  • Could you clarify Proposition 3.4, particularly regarding why L2 loss outperforms cross-entropy and focal loss?

论据与证据

The claims presented in the paper are well-supported by empirical evidence from the experiments.

方法与评估标准

The proposed method is both novel and well-motivated, offering fresh insights into model calibration. The evaluation follows standard practice in this field, employing accuracy, Expected Calibration Error (ECE), and post-temperature scaling as key metrics.

理论论述

One concern arises in Proposition 3.4, where the paper claims that L2 loss outperforms cross-entropy (CE) and focal loss. The reasoning behind this claim is unclear and requires further clarification.

实验设计与分析

The overall experimental design and analysis are well-structured and reasonable. The authors compare the proposed calibration technique against several widely-used calibration algorithms on diverse models and datasets. Ablation studies effectively highlight the contributions of different components. However, I have two major concerns:

  • Calibrated Reannotation – The authors utilize CLIP’s visual encoder for reannotation, but they do not discuss or compare it with a simple baseline that directly adopts CLIP outputs as labels. Evaluating this baseline would help assess the added benefit of the proposed reannotation approach.
  • Calibration-aware Data Augmentation – The study proposes a semantic mixing strategy for generating calibrated samples using diffusion models. However, a crucial baseline is missing: directly using generated images from diffusion models without semantic mixing. Evaluating this approach would provide a clearer understanding of semantic mixing’s contribution.

补充材料

The supplementary material appropriately includes proofs of propositions, detailed descriptions of the experimental setup, and additional results.

与现有文献的关系

Addressing model calibration from a data-driven perspective is an interesting and promising direction. The results across different models and datasets suggest strong potential for real-world applications. The proposed algorithm could contribute significantly to the field of trustworthy AI, enhancing model reliability and confidence estimation in diverse applications.

遗漏的重要参考文献

None.

其他优缺点

Overall, the paper is well-written and easy to follow. The proposed semantic mixing framework is conceptually sound and coherently presented. The evaluation is comprehensive, and the method demonstrates superior calibration performance compared to existing baselines. However, as noted earlier, some concerns remain regarding the theoretical claims (Proposition 3.4) and the experimental design (missing baselines for reannotation and augmentation methods).

其他意见或建议

None.

作者回复

Response to Reviewer WPRF

Thank you for your encouraging feedback on the clarity, soundness, and comprehensive evaluation of our work. We truly appreciate your thoughtful suggestions for clarity and comprehensive validation. Here are our responses to the suggestions:

Q3-1: Clarify Proposition 3.4 (L2 loss vs. CE/FL).

A3-1: Thank you for commenting on the clarity issue. We need to clarify that there exists a typo in Proposition 3.4 potentially hindered understanding. Proposition 3.4 should have been presented as

  • δ0\forall \delta \ge 0, we have β(p1L2,p2L2)=0\beta(p^{L2}_1, p^{L2}_2) = 0,

which is an equation rather than an inequality for the L2\mathcal{L}_2 loss's balance function, meaning that when two similar samples exceed the model's discriminability, L2\mathcal{L}_2 loss tends to balance the learned labels of the harder and softer instances, instead of tending to fit a specific one of them. Note that the proof of Proposition 3.4 we provided in Appendix A.3 does prove that β(p1L2,p2L2)=0\beta(p^{L2}_1,p^{L2}_2) = 0.

As proved in Appendix A, easier samples with δq1L2,q2L2\delta \ge \|q^{L2}_1, q^{L2}_2\| would generally have piL2,qiL2=0,i=1,2\|p^{L2}_i, q^{L2}_i\| = 0, i=1,2. Smaller δ\deltas indicate difficulty for the learned model to separate the outputs for both samples, hence introducing a balancing problem. A theoretically non-zero balancing score β\beta means one of piL2,qiL2,i=1,2\|p^{L2}_i, q^{L2}_i\|, i=1,2, is minimized more completely, indicating the imbalanced nature of the objective.

Among the three, only the L2 loss theoretically zeros the β\beta balancing score, indicating its ability to balance over- and under-confidence of difficult soft-labeled sample pairs, hence being a superior loss for calibration with our augmented samples. In practice, the soft-label distribution may disturb such balance, which can be one of our future study.

We will correct this typo in the revised main paper.

Q3-2: Baseline for directly adopting CLIP labels. (Var. 1)

Q3-3: Baseline for diffusive sample augment without semantic mixing. (Var. 2)

A3-2, 3-3: We evaluate these two variants and compare them with our proposed CSM as follows: CIFAR-10:

VariantACCECE\downarrowAECE\downarrow
Var. 192.132.68(0.92)2.67(0.88)
Var. 296.122.45(0.96)2.44(1.13)
CSM (Ours)95.790.54(0.54)0.33(0.33)

CIFAR-100:

VariantACCECE\downarrowAECE\downarrow
Var. 166.6052.78(1.36)52.78(1.11)
Var. 279.2410.84(2.48)10.84(2.41)
CSM (Ours)78.841.29(1.29)1.63(1.63)

From these results, we can have these valuable observations:

  1. The vanilla CLIP annotation method yields the worst ACC and pre-temperature calibration errors, primarily due to the noisy information by annotating all classes. Such degradation is significant for CIFAR-100, in which there are more classes so that the noise is severer.
  2. Directly adopting class-conditioned augmentations from the diffusion model can slightly rise the prediction accuracy, as also evidenced by the generative model augmented classification literature. However, as it does not contain soft-labeled samples, Var. 2 fails to improve model calibration.

Therefore, we can conclude that models are effectively calibrated only when adopting the proper data and re-annotation scheme.

审稿意见
3

This paper presents Calibration-aware Semantic Mixing (CSM), a novel approach to improving model calibration by generating high-quality augmented data with soft labels. Unlike traditional augmentation methods that rely on one-hot labels, CSM leverages diffusion models to create semantically mixed images with confidence scores. The authors introduce a reannotation strategy based on CLIP features and explore different loss functions, demonstrating that L2 loss leads to better calibration. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that CSM surpasses existing calibration techniques.

给作者的问题

Since semantic mix augmentation effectively fills sparse regions in the data space and improves local data proximity, could this approach be integrated with [1] to enhance not only calibration but also model robustness? Exploring this synergy could yield further improvements in generalization and uncertainty estimation.

Reference: [1] Proximity-Informed Calibration for Deep Neural Networks

论据与证据

CSM enhances calibration by generating realistic semantically mixed samples, as evidenced by Figure 1. Reannotating confidence scores improves performance, which is validated through ablation studies. L2 loss provides a better balance in learning, leading to improved calibration, supported by both theoretical analysis and empirical results.

方法与评估标准

The paper evaluates model calibration using standard metrics, including Expected Calibration Error (ECE) and Adaptive ECE (AECE), across multiple datasets (CIFAR-10, CIFAR-100, Tiny-ImageNet). Additionally, Reliability diagrams and ablation studies are conducted for further comparison.

理论论述

The authors provide theoretical justifications for their reannotation strategy and choice of loss function. While the analysis appears rigorous, additional details and proofs in the supplementary material could further strengthen their claims.

实验设计与分析

The experiments primarily focus on ResNet-based models, and additional evaluations on other architectures (e.g., Transformer-based models) would be beneficial to confirm the generalizability of CSM. Computational overhead is not explicitly discussed—more details on efficiency and resource consumption would enhance the paper.

补充材料

The paper provides sufficient methodological details, but a more thorough review of supplementary material would be helpful to assess the depth of theoretical and experimental justifications.

与现有文献的关系

The work builds on existing research in model calibration and data augmentation, presenting a novel approach by incorporating diffusion models for calibration-aware augmentations. However, the discussion could benefit from additional comparisons to post-hoc calibration methods.

遗漏的重要参考文献

The paper should consider discussing and comparing its approach with existing post-hoc calibration methods, particularly [1] Test Time Augmentation Meets Post-hoc Calibration, which is closely related to data augmentation for calibration.

其他优缺点

Strengths: Novel method integrating diffusion models for calibration. Strong empirical results demonstrating superior performance over existing techniques. Comprehensive evaluation using standard calibration metrics. Weaknesses: Limited hyperparameter analysis—the sensitivity of CSM to different configurations is not well explored. Unclear computational cost—the efficiency trade-offs of using diffusion models for augmentation should be discussed.

其他意见或建议

N/A

作者回复

Response to Reviewer xs6o

Thank you for your positive and insightful feedback! Here are our responses:

Q2-1: Additional details and proofs in the supplementary material.

A2-1: Thank you for the kindly comments on the theoretical soundness. The claims made in our paper (including deducted results in Eq.(6)-(10) and Prop. 3.2-3.4) are sufficiently proved in Appendix A. In detail:

  • Eq.(6)-(7) corresponds to Eq.(14)-(15) with
    • Assumption 1: Classification optimal classifier E()\operatorname{E}(\cdot) ensures the likelihood ratio of different classes;
    • Assumption 2: Regarding the feature as the affine set elements of class prototypes with an orthogonal deviation.
  • Eq.(8)-(9) is proved by Eq.(15)-(19).
  • The result of Eq.(10) is acquired from Eq.(9) with the class factor invariance assumption.
  • Proposition 3.4 is first proved through Eq.(22)-(26) with the assumptions given in Definition 3.1.
  • To prove Proposition 3.2 and 3.3, we first give a general analysis of the problem in Line 712-791 (or Eq.(27)-(37)), then prove them by Lemma A.1 and Lemma A.2, respectively. Note that the proof for CE is unconditional, while for FL it's proved with assumption that γFL=1.0\gamma_{FL}=1.0 and we empirically find FL more imbalanced with larger γFL\gamma_{FL}.

These detailed descriptions illustrate the overall framework of the theoretical analysis. We will include connective details and key deductions in the main paper.

Q2-2: Additional evaluations on other architectures.

A2-2: Our method performs equally or more effectively to others with the Swin-Transformer architecture. Please refer to A1-3 Table A.

Q2-3: More details on efficiency and resource consumption.

A2-3: Thank your for your suggestion. For efficiency analysis, we compare explicitly in terms of the training time as follows:

Table B:

MethodsCEMbLSMixupRegMixupRankMixupOursOurs(EQ-DATA)
Training Time2.63h2.65h3.48h3.50h4.30h4.28h2.64h

One can see the number of augmented samples per batch is the major factor for training time. CSM outperforms others in ECE/AECE while maintaining reasonable speed. Even with equalized training samples (EQ-DATA, Table 2), it achieves competitive calibration. CSM runs on a single A4000. Augmented samples need no re-generation across model/loss/annotation changes, enabling efficient modular study.

Key details will be added to the main paper.

Q2-4: Comparison with [1], a test-time augmentation (TTA) post-hoc calibration method.

A2-4: We compare with [1] by evaluating CSM + [1] as follows:

MetricsECEAECE
Ours1.291.63
Ours+[1]1.391.53

Our integrated method balances ECE and AECE, achieving an optimized AECE of 1.53 on CIFAR-100. Compared to [1] using test-time sample-wise scaling, CSM employs training-time augmentation with inter-sample augmentations to expand the proximity space, enhancing calibration robustness. We will cite [1] and provide full comparisons in the main paper.

Q2-5: Computational overhead of diffusion-based augmentation.

A2-5: Our analysis in Appendix C shows that CSM requires few augmented samples to launch effectively, ensuring low computational costs. Generating augmented sets takes ~4 hours (CIFAR-10/100) or ~9 hours (Tiny-ImageNet) on an RTX4090 GPU, comparable to typical training times. Crucially, CSM eliminates re-generation when model architectures/parameters change, further enhancing efficiency through its decoupled design. This validates CSM's computational efficiency.

Q2-6: Hyperparameter sensitivity analysis.

A2-6: We have analyzed parameter ss and No. of augmented samples in Appendix C. We analyze NaugN_{aug} (refer to A4-2 Table C) and check sensitivity in A4-2. Larger NaugN_{aug} generally enhances our method, while within appropriate ranges of other parameters, it yields stable results.

Q2-7 Possibility to integrate our CSM with [2] to enhance calibration and robustness.

A2-7: Thank you for this insightful question. We conduct post-hoc experiments to integrate our method's outputs with [2], acquiring the following result:

Errors\downarrowECEMCEAECEPIECE
Ours1.290.211.623.16
Ours+[2]1.890.731.823.11

Due to limited time, we simply integrate CSM with [2] without further adjustments. Although a simple combination of them doesn't yield superioir ECE/AECE results, we find that the proximity-informed metric PIECE displays better results, which validates the robustness growth related to proximity from the integration. We will cite [2] for analysis.

References

[1] Hekler, A., Brinker, T. J., & Buettner, F. (2023, June). Test time augmentation meets post-hoc calibration: uncertainty quantification under real-world conditions. AAAI.

[2] Xiong, M., Deng, A., Koh, P. W. W., Wu, J., Li, S., Xu, J., & Hooi, B. (2023). Proximity-informed calibration for deep neural networks. NeurIPS.

审稿意见
3

Model calibration typically assumes full certainty in datasets with one-hot labels, limiting accurate uncertainty estimation. To address this, the paper introduces Calibration-aware Semantic Mixing (CSM), a data augmentation framework that synthetically generates diverse training samples annotated with explicit confidence scores using diffusion models. Additionally, the authors propose a calibrated reannotation method and explore suitable loss functions for this new data paradigm. Experimental results show CSM significantly improves model calibration over existing state-of-the-art methods.

update after rebuttal

Thank you for the author rebuttal. The major concerns regarding clarifications and additional experiments have been addressed. I will maintain my current rating.

给作者的问题

N/A

论据与证据

  • The motivation and necessity of semantic mixing from the perspective of network calibration are well articulated. Additionally, the drawbacks of existing data-driven methods (mixup-based approaches) are clearly defined.

方法与评估标准

  • Leveraging conditional diffusion models, specifically via a pre-trained diffusion network, to generate semantically mixed images is technically novel within the context of network calibration.
  • Further innovation is demonstrated through the identification and resolution of limitations associated with generated labels by introducing a calibration-oriented reannotation process.

理论论述

Further clarification and verification are needed regarding the balanced loss section. Specifically, more clarification on why the proposed L2 loss functions as a balanced loss would be helpful.

实验设计与分析

  • The experimental results across various networks and datasets demonstrate superior performance compared to existing state-of-the-art methods.
  • Although generalization capability is emphasized, experimental validation on larger datasets such as ImageNet and different network architectures such as Transformers appears insufficient. Additionally, comparisons with recent state-of-the-art methods like CALS (CVPR 23) and ACLS (ICCV 23) are missing,.
  • The experiment described in Table 2, which compares training times under identical conditions, is commendable, considering the potential increase in training duration due to the diffusion network. The necessity and effectiveness of reannotation are well-demonstrated in Figure 3.
  • In the ablation study (lines 382–384), please confirm if the explanations regarding CE and FL are reversed, particularly concerning temperature.
  • It would also be beneficial to provide a comparison illustrating the degree of confidence-balancing achieved by using CE, FL, and L2 losses.

补充材料

I have thoroughly reviewed the code, including both the sample mixing and reannotation modules.

与现有文献的关系

This approach effectively improves not only the network’s calibration capability but also enhances its interpretability and accuracy.

遗漏的重要参考文献

Recent state-of-the-art methods, such as ACLS: Adaptive and Conditional Label Smoothing for Network Calibration (ICCV 2023) and Class Adaptive Network Calibration (CVPR 2023), have not been included in the reference list. These works offer significant contributions to network calibration and should be considered for inclusion to provide a more comprehensive and up-to-date overview of current methodologies.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

Response to Reviewer TsGd

Thank you for your kind suggestions on clarity and experimental thoroughness. Below are our responses:

Q1-1: Clarification on the reason that the proposed L2 loss is a balanced loss.

A1-2: Thank you for this nice concern. We need to clarify that there exists a typo in Proposition 3.4 which makes the conclusion confusing. Proposition 3.4 should have been presented as

  • δ0\forall \delta \ge 0, we have β(p1L2,p2L2)=0\beta(p^{L2}_1,p^{L2}_2)=0,

which is an equation rather than an inequality for the L2\mathcal{L}_2 loss's balance function, meaning that when two similar samples exceed the model's discriminability, L2\mathcal{L}_2 loss tends to balance the learned labels of the harder and softer instances, instead of tending to fit a specific one of them.

Note that the proof of Proposition 3.4 we provided in the Appendix A.3 does prove that β(p1L2,p2L2)=0\beta(p^{L2}_1,p^{L2}_2) = 0. With such theoretical justification, we also present an empirical evidence in A1-5 regarding the confidence balance score.

Also refer to A3-1. We will correct this typo in the revised main paper.

Q1-2: Missing comparisons with ACLS [1] and CALS [2].

A1-2: We compare our result with theirs in the following tables. The results demonstrate the competitive or superior performance of our method compared to the state-of-the-arts. We will cite these compared methods and include these results in the final paper.

ResNet-50CIFAR-10|Tiny-ImageNet
MetricsACCECE\downarrowAECE\downarrow|ACCECE\downarrowAECE\downarrow
ACLS95.401.122.87|64.841.051.03
Ours95.790.540.33|66.991.291.19
ResNet-50Tiny-ImageNet|ImageNet
MetricsACCECE\downarrowAECE\downarrow|ACCECE\downarrowAECE\downarrow
CALS65.031.541.38|76.441.461.32
Ours66.991.291.19|79.871.321.35

Q1-3: Insufficient validation on ImageNet and Transformers.

A1-3: We compare our result with representative methods on ImageNet with the ResNet-50 and Swin-Transformer architectures. Our method performs equally or more effectively compared to these methods, especially to the mixup-based methods. We will include these results in the final paper.

ResNet50ImageNet
MetricsACCECE\downarrowAECE\downarrow
CE73.969.109.24
Mixup75.847.077.09
CRL73.838.478.47
MbLS75.394.074.14
RegMixup75.645.345.42
RankMixup74.863.933.92
CALS76.441.461.32
Ours79.871.321.35

Table A:

SwinTransformerV2ImageNet
MetricsACCECE\downarrowAECE\downarrow
CE75.609.959.94
LS75.427.327.33
FL75.603.193.18
FLSD74.702.442.37
MbLS77.181.951.73
CALS77.101.611.69
Ours81.081.491.86

Q1-4: Potential reversal of the following CE/FL explanations in ablation (lines 382–384).

"In contrast, CE and FL losses often require temperature adjustments, with CE favoring sharper labels and FL for softer ones, aligning with our theoretical expectations from Section 3.3."

A1-4: The analysis corresponds to the searched temperature values of Mixup and CSM in Table 4, where CE results sometimes involve a searched temperature larger than 1.0 compared using Mixup (Mixup (CE): T = 1.3), while FL results produce a searched temperature of T = 0.9 < 1 compared with our CSM (CSM (FL)). These two specific results highlight the nature of CE and FL losses.

As studied by existing works, a higher post-temperature can indicate the model's over-confidence while a lower one suggests under-confidence of the pre-temperature model. Therefore, with soft labels during training, these phenomena indicate a preference/bias of fitting different samples for the adopted losses, i.e., harder labels (e.g., close to one hot) vs. softer labels (e.g., mixup labels with λ=0.5\lambda=0.5).

Q1-5: Confidence-balancing comparison across CE, FL, and L2.

A1-5: We explicitly compute the average balance scores to illustrate the confidence balancing results here:

Loss ObjectivesCEOur LossFL
CIFAR-100-0.1438-0.1330-0.0393
Relative Value-0.01080.0000+0.0937

Our loss shows a clear confidence balance between CE and FL, confirming its effectiveness, though empirical values are typically negative due to the rareness of indistinguishable pairs and easier learning of high-confidence samples in practical experiments.

References

[1] Park, H., Noh, J., Oh, Y., Baek, D., & Ham, B. (2023). Acls: Adaptive and conditional label smoothing for network calibration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3936-3945).

[2] Liu, B., Rony, J., Galdran, A., Dolz, J., & Ben Ayed, I. (2023). Class adaptive network calibration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16070-16079).

最终决定

The paper got three weak accepts and an accept. Among the important concerns raised by reviewers in the pre-rebuttal stage are:

  • limited hyperparameter analyses
  • comparison with some recent calibration methods
  • clarification on a theoretical claim
  • limited discussion on computational efficiency
  • missing experiments with transformer model

All reviewers acknowledged the rebuttal and decided to keep their original (pre-rebuttal) rating of either weak accept or accept. Given that the rebuttal provides adequate responses to reviewer's comments and no major concern remain, the decision is to recommend acceptance of the paper. Authors are recommended to include important reviewer's comments in the final version.