/10

Poster4 位审稿人

最低2最高4标准差0.8

ICML 2025

Calibrated Language Models and How to Find Them with Label Smoothing

提交: 2025-01-22更新: 2025-07-24

TL;DR

Label smoothing helps calibration but is less effective for large vocabulary LLMs, therefore we identify the root causes and provide a custom kernel for better training with label smoothing.

摘要

关键词

model calibrationlabel smoothinglarge vocabulary large language models

评审与讨论

审稿意见

评分: 32025-03-07

The paper studies calibration in LLMs after SFT stage. They mention that SFT leads to some significant degradation in the calibration performance of LLM due to the lack of diversity in learned feature embeddings. To address this, they propose to use label smoothing during SFT stage for better calibration. They provide some theoretical justification on the rationale of using label smoothing by relating them with minimizing the logit distance, and is somehow equivalent to the MAP estimation of the softmax predictions. They also provide more ablations on the effectiveness of label smoothing, and pointed out that label smoothing might not work for models with larger vocab and small hidden dim size, due to limited capacity and lack of "concentration" behavior. To practically implement the label smoothing on models with large vocab, they also introduces a novel, memory-efficient GPU kernel implemented in Triton, with optimized memory usage and improved throughput.

给作者的问题

论据与证据

The claims made in the paper are generally well-supported. However, I found that the argument for some claims could be significantly improved. For example, in the section "Why does Instruction Tuning Lead to Mis-calibration?", it views the SFT stage as solving an OOD problem. While this might be supported by some literature, it is stated as a fact without any self-contained justification here. Furthermore, the calibration issue of the SFT stage is not explained clearly; it simply mentions that SFT hurts the diversity of the features, which is related to LLM calibration. I feel this part could be strengthened by adding more discussion or making it more self-contained.

方法与评估标准

Label smoothing is a well-known technique for adding regularization and improving model calibration. Given the observation that SFT leads to overconfidence in LLMs, applying label smoothing does make sense. Another novelty comes from the development of the memory-efficient kernel, especially for models with large vocab, this is nice as it makes the use of label smoothing more scalable.

The paper evaluates on three widely used benchmark MMLU, HellaSwag, and ARC-Easy, which seems to be a diverse set of tasks. I am not sure whether there exists any dataset specifically tailored for evaluating calibration, etc, not sure whether TruthfulQA makes sense here. It would be good to add some comments/discussions on these.

理论论述

I checked the claims in the main paper, most of them look reasonable [Lemma 4.1, Theorem 4.2]. For Proposition 3.3, it might be worth to add the prior distribution here when you argue for MAP.

实验设计与分析

The experiment setup does make sense. Some suggestions:

It might be good to add more ablations on the smoothing factor $\beta$ , the rule of thumb for selection, and the calibration performance span a wide range of $\beta$ . Furthermore, is there any adaptive label smoothing that adjust the smoothing factor based on hidden size, vocab size, or during training progress?
I am not sure whether it is within the scope of the paper, but it would be nice to understand the calibration issue after SFT under different training mixtures / datasets, how they correlated?
Actually, it seems some baselines are missing, such as temperature scaling? how do they perform?

补充材料

与现有文献的关系

The paper builds upon the well-known label smoothing method and adapt it to the LLMs with large vocab size, the findings should be related with community studying the robustness / safety of LLMs, and also the general community studying calibration of neural networks.

遗漏的重要参考文献

其他优缺点

Strengths:

The paper studies an important problem about calibration in LLMs with large vocab size, and the proposed method of using label smoothing does make sense.
The development of a custom GPI kernel is nice, makes the use of label smoothing more memory efficient, and makes it more practical.
The paper is generally well-structured, and easy to read.

Weakness:

The motivation of using label smoothing is not so clear, there are simpler methods like temperature scaling for model calibration, why we need to focus on label smoothing? It will be great to add more discussions on why we choose label smoothing rather than other model calibration techniques.
As noted earlier, the explanation in the section "Why does Instruction Tuning Lead to Miscalibration?" could be more self-contained and provide a clearer, more detailed explanation of the underlying mechanisms.

其他意见或建议

See the points mentioned in other threads.

作者回复

2025-04-01

We are immensely grateful for the thorough review that the reviewer has provided for our work. First, we would like to express our appreciation for the comments regarding the well-supported claims, the usefulness of our custom kernel, the appropriateness of the theoretical claims as well as the broader relationship of our work with robustness/safety of LLMs. We also appreciate the comments raised about improving our work, which we hope the responses below can address.

The calibration issue of the SFT stage is not explained clearly.

We appreciate this comment from the reviewer; we will further add discussion relating to existing literature that has explored how LLM representations (the representation of inputs within the feature space) [1, 2]. We agree that this merits a more self-contained paragraph or section that better details these claims, and will provide it within an updated manuscript. [3] also discusses this calibration issue in terms of how tuning in general (not limited to SFT) leads to feature diversity issues which we will also discuss in further detail.

Does TruthfulQA makes sense for evaluating calibration? It would be good to add some comments/discussions on these.

Please see our responses to Reviewers C57F and LLNm, who ask regarding whether or not evaluation on other tasks is possible. Results on TruthfulQA and Winogrande are provided here (https://anonymous.4open.science/r/Anonymous-0F38/README.md).

For Proposition 3.3, it might be worth adding the prior distribution here when you argue for MAP.

We thank you for your suggestion and will include the following discussion in the final version. The conditional prior $p\left(z\mid x\right)=\mathrm{Dir}\left(\alpha_{x}\right)$ is a Dirichlet distribution with instance-specific parameter $\alpha_{x}$ .

It might be good to add more ablations on the smoothing factor $\beta$ . Furthermore, is there any adaptive label smoothing that adjusts the smoothing factor based on hidden size, vocab size, or during training progress?

Thank you again for this question. Our response is two-fold. First, we test $\beta$ across a rather large scale (0.0 to 0.5 at increments of 0.1). Generally, smoothing is always beneficial towards calibration, but too much smoothing can be harmful for accuracy. From our empirical results, a choice of 0.1 or 0.2 is generally most useful.

There is no adaptive label-smoothing method that we are aware of, but we have considered this as a follow-up direction. Furthermore, we are making a better attempt to derive a potential relationship between $\beta$ and the model, however this is potentially out of scope for the current work.

It would be nice to understand the calibration issue after SFT under different training mixtures/datasets?

We appreciate the comment. We point the reviewer to Table 1, which shows consistent results across three different SFT datasets. We have further tested a mixture of the datasets and observed insignificant differences, thus we did not include it due to a lack of novel insights. Nevertheless, we are happy to provide it within the Appendix as an ablation worth discussing.

Actually, it seems some baselines are missing, such as temperature scaling?

We appreciate this mention. We have included a temperature scaling baseline here (https://anonymous.4open.science/r/Anonymous-0F38/README.md). As we can observe, temperature scaling generally can vary in effectiveness between datasets, thus the usefulness of label smoothing is highlighted here.

Why do we need to focus on label smoothing [...] rather than other model calibration techniques.

We thank the reviewer for their comment. Label smoothing is particularly attractive as it is interpretable in terms of the objective and why it can benefit confidence calibration (see Section 2 and 3), while also being simple to incorporate within an inherent part of the learning regime, namely the training of models. On the other hand, temperature scaling requires a dataset to determine an appropriate temperature.

We appreciate the extensive feedback provided and the multitude of comments highlighting the strengths of our work as well as those asking for additional details to further highlight them. We hope that this rebuttal provides enough detail to address these needs for additional discussion, as well as better present the strengths of our work. If the reviewer shares the same opinion, we would be extremely grateful for an increased score. We are also willing to further engage in discussion to address any remaining questions and comments.

References

[1] Murthy et al. One fish, two fish, but not the whole sea: Alignment reduces language models' conceptual diversity. arXiv 2024

[2] Wang et al. Preference Optimization with Multi-Sample Comparisons. arXiv 2024

[3] Oh et al. Towards Calibrated Robust Fine-Tuning of Vision-Language Models. NeurIPS 2024.

审稿意见

评分: 42025-03-10

The authors proposed using label smoothing to improve language model calibration for supervised fine-tuning. They demonstrate through theoretical analysis and experiments that label smoothing is effective, though computation is heavy for large vocabulary LLMs. They also propose a memory-efficient algorithm without sacrificing much accuracy.

给作者的问题

Please discuss if your method is limited to solving classification tasks in the paper.
Please include implementation details (e.g., instruction templates for each task) in the paper or supplementary material.

论据与证据

Both the theoretical analysis and experiment results demonstrate the effectiveness of the label smoothing approach.

方法与评估标准

The proposed method is reasonable and convincing, as label smoothing ensures the model not to over-estimate a specific class. For evaluation, the authors apply metrics of accuracy, ECE, and RMS to balance between model performance and reliability. They also include various modern models including Alpaca, Tulu3, and Openhermes.

理论论述

Looks good to me.

实验设计与分析

The experiment design is standard and reasonable. Though they mainly focus on a classification setting, they evaluate on metrics of accuracy, ECE, and RMS to balance between model performance and reliability. They also include various modern models including Alpaca, Tulu3, and Openhermes. They also demonstrate the time and memory usage of their method is more efficient than baselines. One thing is that the authors need to include their prompts and templates used in experiments in their appendix.

补充材料

Yes. More implementation details need to revealed, such as training epochs, instruction templates, etc.

与现有文献的关系

The paper focuses on improving language model calibration in supervised fine-tuning, which is an emerging topic in LLM reliability. It is interesting to see that label smoothing is helpful to improve language model calibration, as previous improvements mostly happen on verbalized confidence. Not sure if label smoothing is only restricted to logit-based confidence, and classification tasks. Please further describe this in the paper.

遗漏的重要参考文献

Some references on LLM calibration have been outdated. Below are a few works you may discuss:

scaling-based: Improving model reliability via inter-model latent agreement. https://arxiv.org/abs/2305.01481
prompting-based: Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. https://arxiv.org/abs/2305.14975
SFT-based: Teaching models to express their uncertainty in words. https://arxiv.org/abs/2205.14334
RLHF-based: Taming Overconfidence in LLMs: Reward Calibration in RLHF. https://arxiv.org/abs/2410.09724

其他优缺点

Please use less LLM polishing for your paper. There are many uncommon usage in academic papers, examples not limited to: (1) "settings remain where ...", (2) "Seeking a practical solution, we...", (3) "Growing vocabulary sizes ... consumed to materialize ..., making training difficult." For (3), why not just say "As the vocabulary size becomes larger, there will be increasing computational costs of... and training will become more expensive"?

其他意见或建议

N/A

作者回复

2025-04-01

We are very grateful and would like to thank the reviewer for their positive assessment of our work. We are happy they mention that they mention the suitability of the theory, experiments, methods and evaluation. To clarify their remaining questions and potential areas of concern, we hope the following discussion can provide the details necessary.

One thing is that the authors need to include their prompts and templates used in experiments in their Appendix [...] or supplementary material.

We thank the reviewer for the comment and will put all specific details. Some of these details are already provided in Section 3.1. We can provide the following experimental details:

Training: We conducted a learning rate sweep over learning rates [5e-6, 2e-5, 5e-5, 2e-4] with a summing reduction. We further tested label smoothing hyperparameters [0.0, 0.1, 0.2, 0.3, 0.4, 0.5] where 0.0 is no smoothing. We used the open-instruct repository (https://github.com/allenai/open-instruct) at commit e363290 for our training setup, with modifications to account for our kernel as well as specific experimental hyper-parameter settings and baselines.

Evaluation: Our implementation is based on the MMLU official repository (https://github.com/hendrycks/test). We first evaluate our models on MMLU and then further modify the files here to directly adapt the evaluation dataset to the other tasks at hand. We follow MMLU and use the following prompt for all tasks: "The following are multiple choice questions (with answers) about {}.\n\n".format(query).

We didn’t tune the template for fair comparison, for models that required a chat template, we used tokenizer.apply_chat_template (supported by HuggingFace models) directly on the sequence.

It is interesting to see that label smoothing is helpful to improve language model calibration, as previous improvements mostly happen on verbalized confidence. Not sure if label smoothing is only restricted to logit-based confidence, and classification tasks. [...] Please discuss if your method is limited to solving classification tasks in the paper.

Thank you for this question; it is certainly worth further discussion for clarifying inherent limitations in confidence calibration research. Implicitly, it is true that calibration is measurable only on classification-style tasks, or at the very least tasks where there is a set of reference answers where there may be an ordering of which answer is more correct than others. Label smoothing is also generally only applicable for logit-based classification tasks, given that it is used generally for classification-based losses. However similar concepts have been introduced in other domains such as for DPO preference based losses (Chowdhury et al., 2024) which we are currently investigating as a possible follow-up direction of interest.

For further discussion, we invite the reviewer to read our responses to reviewers C57F and Sf9v, who pose similar questions.

Some references on LLM calibration have been outdated. Below are a few works you may discuss.

We sincerely appreciate the references and will include them in our related works and discussion. We agree with the relevance of these works in the broader scope of research on LLM calibration and will keep these in mind for future work as well. If there are further requirements in which the reviewer would like to see these works discussed, we are happy to provide further details in how we plan to incorporate them.

Please use less LLM polishing for your paper.

Thank you for your comment. We would like to note that we actually did not use an LLM to polish our phrasing, but we agree strongly with the sentiment from the reviewer. If the phrasing resulted in any part of our work becoming less clear, we are happy to adjust such portions of the paper.

Once again, we would like to express our sincerest gratitude towards the reviewer for their extensive comments towards our manuscript. We again appreciate their positive comments towards our methodology, experiments and results. We also hope they can read our response towards areas in which they felt improvements could be made and are hopeful that this response has properly addressed those areas. If this sentiment is shared, it would be highly thankful if this could be incorporated within an improved scoring of our work.

References

[1] Chowdhury et al. Provably Robust DPO: Aligning Language Models with Noisy Feedback. ICML 2024

审稿意见

评分: 42025-03-14

In this work the authors examine the effect of Instruction tuning on LLM calibration, i.e. when the model says it is 70% sure about a prediction does it actually get it right ~70% of the time? They find that majority of available LLMs are reasonably well calibrated, but once Instruction tuning is applied, they tend to become miscalibrated and end up overconfident in their predictions.

Previous work has used label smoothing during instruction tuning as a way to reduce miscalibration. This work aims to explain why and identify settings, like when a large vocabulary is used, where label smoothing is not expected to help.

They include theoretical results, such as how things like model and vocabulary size establish lower bounds on entropy and how that relates to calibration.

Additionally, the develop and benchmark a new method for calculating logits and loss for large vocabularies while supporting label smoothing.

给作者的问题

N/A

论据与证据

Their claims seem well founded.

There insights into model size and its effect on entropy is validated in their experiments what show that label smoothing is less effective at calibrating smaller models. And their findings that LS helps calibration is most settings is inline with their claims and previous work.

Additionally their new LS kernel evaluation supports their claims about it being faster and more memory efficient.

方法与评估标准

Evaluations of calibration is done using just 3 datasets, it would be more convincing if there were more, especially in ones where the more text had to be generated and there was some notion of confidence in the entire answer instead of just the first token.

理论论述

I did not check the correctness of the proofs

实验设计与分析

There experimental design seemed reasonable.

补充材料

I did not review the supplementary material

与现有文献的关系

They mention a lot of previous work, including the other works that first found that LS was helpful in terms of calibration. Additionally, they include multiple common metrics for calibration.

遗漏的重要参考文献

Not that I know of,

其他优缺点

N/A

其他意见或建议

Figure 5 should probably be at the top of the column

作者回复

2025-04-01

First, we thank the reviewer for their review of our work and their enthusiasm they express regarding their work. We appreciate that they find the claims well founded, the insights to support these claims, and their agreement regarding our new kernel that incorporates label smoothing.

Evaluations of calibration is done using just 3 datasets, it would be more convincing if there were more.

To address the comment raised regarding the use of the different datasets, we are happy to provide additional results on different datasets here (https://anonymous.4open.science/r/Anonymous-0F38/README.md). We provide additional results on two datasets, WinoGrand and TruthfulQA, which demonstrate similar results compared to the prior three that are shown in our manuscript.

It would be more convincing if there were more [...] and there was some notion of confidence in the entire answer instead of just the first token.

With respect to whether or not there is some notion of confidence in the entire answer instead of just the first token, we do note that while such a concept exists, there may be limitations that are fundamental to the field of confidence calibration. Take for example two sequences A and B. We can measure a level of confidence on the two as the normalized model perplexity over the two options. However, the fundamental limitation is the need for these two sequences as reference to the model, as otherwise the normalizing constant for the choices will be infinite, due to the infinite potential generations from the model.

We will specifically mention this fundamental limitation of confidence/calibration measurements in an updated version of our manuscript, in order to make this point transparent. Further discussion on similar questions can be found in our response to reviewers LLNm and Sf9v.

Figure 5 should probably be at the top of the column

We also appreciate the mention of the placement of figures, which we will adjust.

Again, we would like to show our appreciation to the reviewer for their positive response to our work and for highlighting the fundamental soundness of our methodology. We also acknowledge some minor comments that were left regarding how to discuss specific points regarding the overarching problem being studied, which we believe have been addressed within this rebuttal. We are hopeful that the reviewer feels similarly, and if there remains any additional question or comments we remain ready to provide additional responses to clarify them.

审稿意见

评分: 22025-03-24

This work focuses on mitigating mis-calibration of LLM during SFT by incorporating label smoothing. It argues that label smoothing helps reduce model overconfidence by promoting equality among logits while enhancing learned feature embeddings. Through empirical experiments, this work demonstrates the effectiveness of label smoothing. Additionally, it designs custom kernels to enable label smoothing with significant memory and throughput improvements, without compromising performance.

给作者的问题

论据与证据

This work claims that label smoothing can mitigate mis-calibration by encouraging "diversity of feature embeddings". However, in subsequent argument, this work only shows that label smoothing encourages equal logits, and it's unclear how this is connected with "diversity of feature embeddings" (which itself is not clearly defined in this work).
This work claims that label smoothing is less effective in improving calibration for "large vocabulary LLMs" because "large vocabulary LLMs" do not have the ability to be overconfident. However, this cannot explain why label smoothing has huge improvement on Gemma2-2B (and relatively small improvement on Llama3-3B/1B). Additionally, to support the claim, authors shall show that the calibration errors of Llama3-3B/1B are indeed smalled (or less over-confident) compared to Llama3-8B. However, these numbers are not provided. If one looks a Figure 6 in appendix, it appears that Llama3-3B and 1B both are still over-confident.
In the introduction section, authors state that "We further show how alternative techniques, such as temperature scaling and logit capping, explicitly act as a mechanism to steer models toward overconfidence, allowing the benefits of label smoothing to once again emerge." However, no actual experiments/statistics are conducted to support this statement.

方法与评估标准

The proposed methods are well-suited to the problem, and the evaluation criteria appropriately align with the task.

理论论述

See claims and evidence section.

实验设计与分析

The empirical experiments conducted in this work are methodologically sound and effectively demonstrate: (1) the impact of label smoothing in mitigating model overconfidence in certain LLMs, and (2) the effectiveness of the custom kernels in enhancing memory efficiency and throughput.

补充材料

NA.

与现有文献的关系

NA.

遗漏的重要参考文献

其他优缺点

[-] The paper has several typos in its math symbols. E.g., in Eq (1), the index i should be n. "L is the length of a discrete input sequence" on page 3 should be N. [-] The paper would benefit from a more logical argument, as well as improved clarity and structure in its writing.

其他意见或建议

作者回复

2025-04-01

We would like to thank the reviewer for providing a thorough evaluation of our work. We are delighted that the reviewer finds the work to have been positioned and set up in a manner that aligns with supporting our claims. We are hopeful that the below responses can sufficiently address their remaining concerns regarding our work as well as provide the necessary details to clarify possible misunderstandings resulting from their absence.

However, in subsequent arguments, this work only shows that label smoothing encourages equal logits, and it's unclear how this is connected with "diversity of feature embeddings" (which itself is not clearly defined in this work).

Indeed, this section could be better detailed, which we may have overlooked as a result of the page limit in the initial submission.

Oh et al. (2024) provide a theorem (Theorem 3.1 of their work), which decomposes error into two components, a classification error (i.e. accuracy) and calibration error. They show both errors depend on the reciprocal of the smallest singular value of the input covariance matrix of the input $x$ , with a smaller value leading to greater error. Mathematically, this means that the variance within the input features should be more diverse and independent.

Next, our proposition 3.3 shows label smoothing to be a MAP estimation problem. Prior works have shown both empirically (Batmanghelich et al., 2014) and theoretically (Rajkowski, 2019) how this can lead to the learning of more diverse feature sets. First, Chi et al. (2024) show that the logits prior to the LM head are distributed as a Gaussian. Rajkowski (2019) proves that under the assumption that samples that are generated from a gaussian process, MAP estimation divides the data into clusters whose convex hulls are disjoint (Proposition 1 in their work) and into clusters of limited size (Proposition 2), leading to distinct and limited-sized clusters that make the features diverse and easy to distinguish (Corollary 3). Thus the model learns more diverse feature within the representations.

In light of these details, we are happy to provide additional reference and the associated lemmas and remarks within an updated manuscript.

If one looks at Figure 6 in the Appendix, it appears that Llama3-3B and 1B both are still over-confident.

We agree that this could be clarified. Our analysis in Sections 3/4 establish that smaller models have a higher lower entropy bound, which means that their predictions are bound to be less concentrated than those of a larger model. However, this does not limit them to being under-confident or perfectly calibrated, as the lower bound could still indicate that the model has the potential to become overconfident. Thus we can see that as the model increases, so does overconfidence, a signal that the entropy bound could in fact be playing an influence on predictions.

Authors state that "alternative techniques explicitly steer models toward overconfidence, allowing the benefits of label smoothing to once again emerge." However, no actual experiments/statistics are conducted to support this statement.

We provide clarification here. We can note the benefit of soft-capping directly from comparisons between Gemma and Gemma2. Figure 8 in Appendix B.2 shows that Gemma is naturally less prone to overconfidence compared to Gemma2, however the base models are both well calibrated. Both models are not publicly transparent about the raw pre-training data, however the fact that both are well calibrated initially prior to fine-tuning is an indication that both start at a roughly equivalent point/level. However, after SFT, Gemma2 models are naturally more overconfident than Gemma models, highlighting how the main difference between the two, the logit soft-capping, can naturally lead to overconfidence, which enables label smoothing to be more effective.

We extend our gratitude towards the reviewer for their engagement with our work. We appreciate their feedback and are hopeful that the provided response above clarifies any remaining uncertainties to highlight the strengths of our work. We hope that the reviewer shares this impression and would be appreciative if such an opinion could be reflected through an improved assessment. We also remain ready to provide any additional details for questions that remain.

References

[1] Oh et al. Towards Calibrated Robust Fine-Tuning of Vision-Language Models. In NeurIPS, 2024.

[2] Batmanghelich et al. Diversifying Sparsity Using Variational Determinantal Point Processes. In arXiv, 2014

[3] Rajkowski, Łukasz. Analysis of the maximal posterior partition in the Dirichlet Process Gaussian Mixture Model. In Bayesian Analysis, 2019

[4] Chi et al. Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation. In Findings of the Association on Computational Linguistics: NAACL-HLT, 2024

最终决定Accept (poster)

2025-05-01

The submission studies calibration in the context of supervised fine tuning. In the context of LLMs, label smoothing is expensive, and the submission makes a contribution of an efficient label smoothing procedure. The submission is supported by reasonable empirical evidence, and by some theoretical proofs. The reviewers were generally positive, with three out of four recommending acceptance. Reviewer fRKR felt that the submission was just below the threshold, with some specific concerns about the use of language in the claims of the paper. The authors are encouraged to take these concerns seriously and consider refining the claims in the submission to better match the empirical and theoretical evidence.