PaperHub
7.0
/10
Poster4 位审稿人
最低6最高8标准差0.7
7
6
7
8
3.8
置信度
COLM 2025

Always Tell Me The Odds: Fine-grained Conditional Probability Estimation

OpenReviewPDF
提交: 2025-03-21更新: 2025-08-26
TL;DR

We present a state-of-the-art model for fine-grained probability estimation of textual outcomes conditioned on context.

摘要

关键词
Large Language ModelProbabilistic ReasoningSemanticsCalibration

评审与讨论

审稿意见
7

Probabilistic inference from LLMs can be useful when used in mission-critical applications to account for ambiguity. However, LLMs are known to be calibrated poorly when compared to true distributions. This work addresses this problem by proposing a new pipeline that uses ideas from methods for decoder-based regression, synthetic data generation, and rank-consistency training.

The proposed pipeline of fine-tuning open-weight decoder LLMs for fine-grained probability estimation includes three parts: 1) discretizing scalar ground truth estimates into quantized bins, and using this discrete distribution for supervised fine-tuning using the KL loss. The model is trained to predict for tokens corresponding to these bins, 2) supervised training using synthetic data generated with a combination of LLM-ensemble and judge framework. Specifically, probability estimates are prompted for an instance from multiple LLMs. The discrepancy between these estimates is compared, and high discrepancy instances are sent to an LLM judge for confidence of each probability estimate along with its reasoning chain. The synthetic estimate is then computed using the Expected Label Score technique, and 3) indirect supervision from rank consistency, which enforces the same order between predicted and true probability estimates between two instances.

The authors compare their fine-tuning approach with other zero-shot LLMs (Deepseek-R1-Distill-Qwen-32B and GPT-4o) and a fine-tuned encoder-based regression model (RoBERTa-large). The authors evaluate the models across three aspects: a) Intrinsic, which evaluates the quality of the estimates, b) Comparison, which evaluates the plausibility ranking among estimates, and c) Structural, which evaluates the estimates in a structured reasoning setting. The results show that the introduction of both synthetic data and rank-consistency training significantly improves over the baseline models and the models fine-tuned with human supervision.

接收理由

  1. The proposed method is largely cogent and useful in scenarios with limited human-annotated data. Experimental choices are driven by insights from literature.
  2. The multi-pronged evaluation setup is commendable. In addition, it is nice to observe the clear individual improvements from the synthetic data and rank-consistency

拒绝理由

  1. The setup of the baseline encoder model is not fully clear. My assumption is that the model is a fully fine-tuned RoBERTa with a regression head (l.176). If that is the case, this model would be an unfair baseline since it compares a categorical model with a regression model. An encoder model trained to predict a categorical distribution over the quantized bins would be a fairer baseline comparison.
  2. While I am happy with the process taken for synthetic data creation, it would have been useful to test the importance of the LLM-judge confidences and the corresponding normalization. Does a simple aggregation of model estimates provide a good-enough distribution?
  3. I would have been interested to see how the pipeline proposed works without any ground truth data. That is, in the equation below line 229, can the first loss term be removed? This setting is essential to contextualize the role human supervision plays in the gains, especially when some of the baselines are considered in the zero-shot setting.
  4. The writing at certain places (esp. Section 3) can be improved. For example, I am not completely sure how the distribution gets aggregated when an instance passes the discrepancy filter. Figure 1 suggests that only the instances that do not pass the discrepancy test are sent to the LLM judge. However, lines 212-220 only explain how the distribution of these instances is computed. Similarly, Figure 1 suggests that the expected label score for the rank-consistency loss comes from the synthetic data, although my understanding is that this is the score from the model that is fine-tuned. Several important details are not explicitly stated (e.g., what is the judge model?).

给作者的问题

Questions:

  1. L.167: “summarize the rationale to extract key steps”. Is there a chance that this step introduces errors that propagate into subsequent steps?
  2. L.193: Can you provide details on what the function f might correspond to in practice? My intuition suggests that this should be the mean of the bin.
  3. L.219: How are the mixture distributions, Qi(t|k) ‘s , computed? Are these the quantized distributions of the estimates? If so, they should be denoted by small q’s to be consistent with the notation.

Suggestions:

  • L.162: “high-discrepancy …further review.” I would encourage the authors to mention what “further review” means here.
  • L.261-264: “removing modality, …., reasoning challenge”. I am unsure what this means. Can you provide examples in the appendix?

Typos:

  • L.28: “depends” -> “depend”
  • L.30: “involves” -> “involve”
  • L.113: “setupsevaluation” -> “setup, evaluation”
  • L.187: (0,1) -> [0,1]
  • Equation below l.229: “(D)_synthetic” -> (D_synthetic)
  • L.330: “taskssuch” -> “tasks such”
评论

We thank the reviewer for recognizing the general usefulness of our task formulation and for commending our evaluation design. We'll now clarify some remaining confusions about the experiment designs.

1. Encoder Baseline and Regression Setup

RoBERTa-Large baseline: This serves as a strong ecoder-only baseline. Trained solely on UNLI, it processes the concatenation of the premise and hypothesis, and maps the [CLS] embedding through a linear layer to produce a scalar probability prediction.

Although our decoder models predict a categorical distribution over quantized bins, we ultimately compute the expected value under this distribution (via the Expected Label Scoring Rule) to obtain a scalar score, This makes it a regression model capable of predicting fine-grained probability scores, ensuring a fair comparison against encoder-only models with a regression head.

2. Does Simple Aggregation Suffice?

While rerunning all experiments with mean aggregation would be expensive, we can share observations that motivated our choice to include an LLM-judge. Two of the main authors have manually annotated 3-way preferences (win-loss-tie) on a small WANLI subset (50 examples ). We selected WANLI because its examples are shorter, contain more relevant premise–hypothesis pairs, and are easier for humans to understand. On this subset, we observed that the simple average and the LLM-as-a-judge methods produced large discrepancies in probability estimates (greater than 0.3). For each example, the two probabilities are randomly shuffled to avoid positional bias. We found a slight agreed preference of LLM-as-a-judge over simple average (win rate of 56%). Qualitatively, observing these examples we found that LLM-as-a-judge makes the distribution sharper and sometimes effectively down-weighted obviously incorrect responses.

3. Training Without Human-Annotated Ground Truth

We thank the reviewer for this thoughtful suggestion. As discussed in Section 5, there appears to be a distribution mismatch between human and LLM-generated labels. Building on prior work that focuses on aligning model uncertainty with human subjective uncertainty [1,2], we consider it a key objective for our model’s uncertainty estimates to reflect human belief. As noted in our general response, this alignment is essential for downstream tasks such as decision making and for building agents that understand the world in ways similar to humans. We view high-quality human labels (Please refer our detailed discussion regarding label quality to reviewer YjWc) as a critical resource in achieving this goal.

4. Concern on error introduced during the summary process

We thank the reviewer for their suggestion and propose to evaluate final score consistency before and after summarization. We achieve a spearman correlation of 0.984 on QwQ reasoning chains and nearly 1 spearman correlation on Deepseek’s reasonales, demonstrating high alignment between original and summarized reasoning chains.

5. Function of f(.)

Your understanding is very accurate. We use the center of each bin to convert the bins into scalars.

6. Meaning of mixture distribution

Thank you for pointing this out. We agree with the reviewer that q(t|x) might be a more consistent notation. We’ll make sure this gets updated in the next version, together with other writing improvements as discussed by the reviewer.

7. “Further Review”

Thank you for pointing this out. By further review we just mean that the LLM Judge will be used to compute confidence for these instances. We’ll make sure these will be made clear in the future. Also we would like to clarify that if an instance passes the discrepancy filter, all scores will be aggregated uniformly.

8. “Removing Modality“

EntailmentBank is originally proposed as a dataset of explanation trees where each inference step can be fully grounded. We extract probabilistic inference steps by removing modalities (possibly, necessarily, likely) and quantifiers (any, exists) etc. We’ll provide some examples in the appendix.

Reference

[1] Zhengping Jiang, Anqi Liu, and Benjamin Van Durme. Addressing the binning problem in calibration assessment through scalar annotations. Transactions of the Association for Computational Linguistics, 12:120–136, 2024a.

[2] Tongfei Chen, Zhengping Jiang, Adam Poliak, Keisuke Sakaguchi, and Benjamin Van Durme. Uncertain natural language inference. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8772–8779, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.774. URL https://aclanthology.org/2020.acl-main.774.

评论

I thank the authors for their response.

Regarding Encoder Model Setup. I might not have been clear in my review. Since the encoder model with a regression head outputs a finer prediction, training such a model is harder as compared to learning a categorical distribution, such as the one proposed in the paper (the prediction can be within some ϵ of the true value). Ideally, I would have liked to see the encoder model trained with a similar categorical head, in addition to the regression baseline. I understand that the final comparison is based on EV, and that comparison is fair; however, it is the difference in how you compute the EV that is somewhat unfair.

Simple aggregation. I thank the authors for this quick experiment. I believe this is an essential ablation to justify the setup.

I am happy with the authors' response to my other comments.

评论

Thank you for acknowledging that you are generally satisfied with our response!

While we primarily test against a regression head as it appears to be a more natural way of modeling scalars, we appreciate the your insight that modeling categorical distribution can be easier. And we propose to include the following quick verification experiment:

Encoder Model Setup: We further evaluate an encoder-based model with a categorical classification head. Specifically, we train the encoder to match the discretized UNLI label distribution, as described in our paper. At inference time, we apply an Expected Label Scoring rule to convert the predicted categorical distribution into a fine-grained probability estimate.

We report Spearman correlation from the Intrinsic Evaluation, utilizing both 10 and 100 categories. As shown in the table below, these approaches only outperform LLama-7B with a single regression head on the circa dataset; in all other evaluations, neither variant achieves competitive performance compared to ours:

DatasetUNLIcircaGNLIEntailmentBanke-CARE
unli-10.748.490.560.498.735
Unli-100.750.513.550.515.742
评论

Thank you for the response. I am happy to increase my score. I hope that the authors include the discussion on both the categorical baseline and the simple aggregation ablation in their final version. I would encourage the author to address the writing of the paper (esp. the points mentioned in the original review).

审稿意见
6

This paper proposes a suite of training methods that improve a model's probability estimation capabilities. At its core, this paper proposes a synthetic data generation method that curates probability labels from multiple LLMs with careful considerations on their confidence and quality. Following the data, the authors propose a discretized probabilistic training scheme with multiple objectives (i.e., direct prediction and ranking consistency) to utilize the synthetic data, and show that the resulting model outperforms vanilla LLMs across several probabilistic benchmarks and evaluation settings.

接收理由

This paper discusses an important aspect of LLM usage: probability and uncertainty estimations. The proposed methods are effective to some extent, especially on in-domain data.

拒绝理由

  1. The contribution is limited; a major part of the proposed methods (i.e., discretized probabilistic training and ranking-consistency training) have been extensively discussed in existing works that use ML models for probability estimation. As I see it, the main novelty is the synthetic data generation part. However, this part itself is flawed because of the next point.
  2. The experiments are insufficient for readers to tell exactly how and why the synthetic data is helpful. The authors claim that the improvement using the synthetic data is because of its high-quality and high-confidence probability estimations from existing LLMs. However, there could be many other reasons for models to improve, such as some data leakage (as the authors are using in-domain training data to create these synthetic labels). Furthermore, it is hard to tell exactly how the synthetic data are generated based only on the description of the main sections of the paper, and the entire approach seems more like a distillation process rather than synthetic data generation process.
  3. The improvements from the synthetic data alone are compared with the in-domain supervised baselines, which further diminishes the contribution of this paper.
评论

We thank the reviewer for their thoughtful feedback. However, we feel that there are some misalignment between what the main contributions of our works are, and how our experiments support our main research claims. We would like to kindly clarify regarding these two points.

  1. Novelty and Contribution:

We respectfully clarify that while some of the methods in the paper are indeed unique (e.g., synthetic data creation), we consider the main contribution extends beyond methodology. A major contribution of the work is also about pinpointing this important but understudied problem of localized conditional probability inference and present an artifact that pushes the performance significantly beyond widely adopted solutions (ask-for-calibration, t/f logits as shown i n our additional experiments) We hope that through careful discussion of a highly effective training recipe and a comprehensive evaluation suite, we will motivate future research on realizing and studying the potential of LLM as a general probability estimator, which we believe can be a critical step in LLM reasoning. This work is not simply about proposing new methods. It is a task-driven study, anchored in practical needs for modeling uncertainty, and contributes tools, data, and benchmarks that the community can build upon.

  1. Effectiveness of Synthetic Data:

We respectfully disagree with the reviewer’s assessment that synthetic data is helpful solely because it facilitates in-domain training. We carefully designed our evaluation suite to test generalization. The synthetic data is derived from ANLI and WANLI, totaling ~26k examples, which we specifically chose to be distinct from test data in our evaluation suite. In fact many of the tasks in our evaluation suite has very different data and label source and formulation from traditional NLI datasets, and the fact that our model can be applied to very different tasks like comparison and graph-based belief probabilistic inference effectively is a strong evidence of the wide-applicability of our problem formulation, as well as the generalizability of our model.

We acknowledge the ambiguity between "distillation" and "synthetic generation" in terminology. Our approach sits between both: we reuse existing input queries (like premises/hypotheses) but prompt LLMs for step-by-step reasoning on conditional probability. This process then allows us to synthesize final probabilistic labels that differ significantly from original categorical labels, enabling richer supervision and demonstrating that our method isn't merely a distillation process.

To further elaborate on our synthetic data construction and its motivation, we first illustrate the quality of initial LLM estimations. Here's the Spearman correlation with human annotations on the UNLI test split for the four models we used:

dsqwenqwqllama
valid lines3026304029863040
correlation0.74590.730.75320.7242

Despite individual model estimations not fully aligning with human annotation, we observed that the aggregation of agreement results increases performance. This insight directly motivated our agreement-based filtering strategy:

Discrepancynum of exampleSpearman
0.52802.805
0.32110.847
0.21662.869

Subsequently, we apply a confidence-aware aggregation pipeline to assess the probability labels. We found that the LLM-as-a-judge approach sharpens the distribution and effectively down-weights obviously incorrect responses. For more details on this, please refer to our second discussion with Reviewer eG4a.

评论

Yes, I agree. Here, the novelty means more on "why the authors are doing this". If all the "whys" are discussed somewhere, the authors can no longer make "why" their main contribution. If "whys" are off the table, the authors have to explain the "hows" to show some contribution. However, as I mentioned above, the only "how" I see is that larger model supervision will improve smaller model performances. I wouldn't agree that this counts as contributions if no task-specific insights can be provided here.

评论

Thanks to the authors for their comments.

I will improve my rating a little because of the authors' clarification on the paper's position. However, even the position on "pinpointing this important but understudied problem of localized conditional probability" is not novel by any means, as many previous works, such as those cited by this paper, discuss similar problems, too, just not in exactly the same task setup. I do not think combining two non-novel points will automatically lead to any meaningful contribution, and the authors can do better in justifying why we need this new task formulation.

Regarding the authors saying, "We respectfully disagree with the reviewer’s assessment that synthetic data is helpful solely because it facilitates in-domain training," if the authors read my comments carefully, they should realize that I have never expressed that I believe in-domain training is the sole reason for improvement. I merely suggested that finding out what kind of training leakage the proposed method will produce will help us understand the improvements. In addition, judging by the authors' comments, terminologies aside, the proposed method is indeed using external (larger) models' knowledge and capabilities to supervise smaller models, and I do not see what the contribution of the synthetic data "method" is, clearly without additional ablation studies. Overall, I believe the empirical experiment part of this paper is insufficient to support the main claims that the paper is trying to make. To be more specific, the main empirical takeaway of this paper is that "larger LMs can produce supervision signals to improve smaller models", and I see no specific insights into how this conclusion can be tied to the conditional probability estimation task, and why models are doing better.

评论

I just want to add that novelty is overrated. Actual novel work in proceedings is rare

评论

Thank you for taking the time to re-evaluate the paper and for improving your rating! While we respect the your decision to hold firm in their assessment of the novelty, we’d like to take this opportunity to summarize the “why” that we believe makes this work necessary.

We agree that prior work has discussed motivations for having models output confidence scores that align with human-perceived conditional probabilities. However, despite these discussions, the field has largely continued to rely on categorical labels (e.g., "entailment", "neutral") as the dominant supervision signal. One of our goals is to challenge this status quo by demonstrating that scalar confidence scores, which closer to how humans actually reason about uncertainty, can be both meaningful and tractable in practice.

We understand your skepticism regarding the novelty of the task formulation. While individual components (e.g., LLM supervision or scalar prediction) are not entirely new, our contribution lies in combining these elements within a concrete, general-purpose framework for conditional probability estimation, along with releasing models and data that demonstrate this setup can be effective. We believe the artifacts serve as a practical and reproducible benchmark that could encourage further research in this direction.

We agree that data leakage and attribution of gains are crucial. However, we believe the risk of data leakage is low. Primarily because, the +Syn model is an ablation study designed to evaluate the impact of synthetic data. For this particular study, all evaluations, with the exception of the UNLI correlation, are conducted on out-of-domain datasets. Furthermore, we've introduced an additional ablation study, +Syn+R. This model adds rank-consistency supervisions compared to +Syn, demonstrating further improvements in decision-making across a variety of tasks.

审稿意见
7

This paper focuses on fine-grained probability estimation, where the LLMs are trained to produce a probability of the given context being true. The core contribution of this paper lies in the data construction process and model fine-tuning method. The authors first present their data construction pipeline, where LLMs are prompted to generate reasoning chains and probabilities of the context being true given the input context. Then LLM-as-a-judge will further re-evaluation those examples where the generated probabilities are discrepant with each other.

After the data construction, the model will be fine-tuned on these data (context + probability). The authors additionally propose a rank consistency training method to mitigate the data scarcity issue. Experiment results on a large number of benchmarks and across multiple models demonstrate the effectiveness of the proposed method.

接收理由

  1. The data construction process is rigorous and reasonable. The method of model fine-tuning is also effective.
  2. Strong performance across multiple benchmarks and models. Compared to small-scale RoBERTa model, the model delivered in this paper significantly improve the performance. The ablation study (+ Syn and +Syn+R) further demonstrate the effectiveness of the proposed fine-tuning method.

拒绝理由

The reviewer does not fine significant issue with this paper. Please refer to the following minor issues:

  1. Some writing is unclear, especially for the data construction process:

    • Reasoning-Augmented Prompting. It is better to show the prompt the authors used to elicit such probabilities. Otherwise the readers cannot know what is acutally the model is generating. Also, how many LLMs are used here? Just one? But the Agreement-Based Filtering implies there are multiple LLMs generating their reasoning and the probabilities individually.
    • Agreement-Based Filtering: Is this performed on example-level? Given a specific input example (context + generation), multiple LLMs generate reasoning chains and the final probabilities. The author also mention if the highest probabilities - lowest probabilities > threshold, this example, along with all generated chains, are flagged for further review. However, what is "further review" is not explained. According to the context, this should be the LLM judgement, but the writing is confusing. The reader may interpret it as "further review" for human evaluation, while the examples that is not filtered will be judged again by the LLM
  2. Missing discussion with model calibration methods. To predict the probability of the context being true using an LLM, a more straightforward way is to first elicit a probability from the LLM (e.g., get the next token prediction probability of True and False) then calibrate such confidence scores. In this line of work, temperature scaling [2] and other calibration methods [1] can achieve the same goal as this paper. These methods can be a strong baseline but are missing in the comparison.

Typos:

Line 113: availablee -> available

[1] Tian, Katherine, et al. "Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback." arXiv preprint arXiv:2305.14975 (2023).

[2] Shen, Maohao, et al. "Thermometer: Towards Universal Calibration for Large Language Models." arXiv preprint arXiv:2403.08819 (2024).

评论

Thanks for your invaluable suggestions! Your feedback has been crucial in improving the quality of our paper. Let me address your concerns below.

1. Clarify the details of Synthetic data construction

Here are some key clarifications regarding our synthetic data construction:

  • Prompts used to elicit probabilities from LLMs are included in Appendix C.
  • For each example, we query four LLMs, two instruction-tuned models and two reasoning-augmented models, to independently generate probability estimates (see lines 298–300).
  • Agreement-Based Filtering is performed at the example level. For a given input, if the range of predicted probabilities exceeds a predefined threshold, the example and all its associated reasonings are flagged for further review.
  • Further review involves re-evaluation by a separate reasoning-enhanced LLM, which acts as a judge model. We choose DeepSeek-R1-Distill-Qwen-32B for its strong correlation with human judgments on UNLI and instruction following ability. This step ensures we select high-confidence pseudo-labels with high-quality rationale, and no human annotation is involved in this process.

2. Missing discussion with model calibration methods

We implement the True / False probing [1] and Thermometer calibration method [2]. The general ideas are simply decoding true and false logits then calibrated with temperature scaling. Results shows it does not always provide high correlation with humans, especially in comparison evaluation. Furthermore, since temperature scaling applies a monotonic change to output logits, it offers no benefit for ranking-based metrics like Spearman Correlation. Please refer to Overall Response Experiment (1) for more information.

Reference

[1] Tian, Katherine, et al. "Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback." arXiv preprint arXiv:2305.14975 (2023).

[2] Shen, Maohao, et al. "Thermometer: Towards Universal Calibration for Large Language Models." arXiv preprint arXiv:2403.08819 (2024).

评论

Thank the authors for the detailed feedback. The response makes sense to me. I will keep my positive rating on this paper.

审稿意见
8

The paper presents an approach for training LLMs to predict probability estimations. The paper uses LLMs to modify existing datasets by relying on LLMs to generate probability estimates for examples in a suite of NLI-related datasets. The paper then fine-tunes LLMs to perform regression, specifically using a loss function that is a linear combination of the loss of the models' prediction on the original example, the loss of the modified probabilistic label, and a ranking loss. The experiments indicate that typically using this loss fine-tuning the largest possible model outperforms RobERTa-Large and non fine-tuned models.

接收理由

The paper tackles the important problem of using LLMs for regression problems. The paper is thorough and includes conving experiments and results.

拒绝理由

I started reading this paper thinking that the paper was about how to get better calibrated probabilistic predictions from LLMs, especially based on this sentence in the abstract: "LLMs continue to struggle with making … well-calibrated probabilistic predictions under uncertainty or partial information". I was very interested in this paper's approach to creating well-calibrated probabilistic predictions.

However, it became clear that the paper is about using LLMs for regression problems that are inspired by probabilistic reasoning rather than actually making well-calibrated probabilistic estimates. Building models to make well-calibrated probabilistic estimates is very important problem. I wish the paper had either tackled this head-on or framed the paper as using LLMs for regression (which is valid, and worthy work too).

给作者的问题

Are "human-annotated conditional probabilities” actually “well-calibrated probabilistic predictions”? The intrinsic evaluation, where the model's estimates are compared to human or LLM assigned probabilities, might not reflect well-calibrated probabilistic predictions, since human assigned probabilities might not be well-calibrated (maybe I am just getting hung up on “well-calibrated probabilistic predictions”).

“We advocate evaluating models based on their effectiveness in real-world decision-making, where probabilistic reasoning aids belief modeling and evidence integration.” I was hoping to see some “real-world” task where the quality of the probability predicted by the models was evaluated, rather than on NLI, which is not a real-world task.

In 3.1, mentioning what tasks will be used would be helpful. In 128 textual outcomes is mentioned, by what are these outcomes? Until 157 I wanted to know what tasks are being used. Knowing the task would make the paper clearer. For example, on 213, what is the universe of tokens when converting “any target scalar label y into a distribution over tokens”?

139: missing space: “LLMyielding"

Table 1: Outside of pearson's correlation for the intrinsic tasks, what are these metrics? What do bold and underline indicate? Is bold first place and underline second? At a minimum, this info should be mentioned in the caption

I like the approach of "eliciting direct probability estimates from LLMs,” but it would be nice to see how “well-calibrated” these probability estimates are. Using discrepancy makes sense for agreement-based filtering, but it would be nice to see some statistics or analysis about the agreement filtering.

评论

Thank you for your thoughtful and constructive review. We appreciate your engagement with our work and address each of your concerns below:

1. Connection between LLM Regression and Uncertainty Calibration

As we discussed in our general response, the specific regression task that we formulate here is closely related to instance-wise-calibration. We want to reiterate that our formulation is more general than calibrated classification. additional Experiment (1), Table 1, clearly shows that simple post-hoc calibration is not sufficient for our task, while Experiment (2) shows that our approach directly induces more calibrated classifiers for unseen tasks. Our formulation empowers LLMs to directly reason about uncertainty in continuous outcome spaces, leveraging their commonsense and linguistic knowledge.

2. On the Use of "Well-Calibrated Probabilistic Predictions"

We acknowledge that achieving truly calibrated probability estimates, especially in subjective reasoning tasks, is a significant challenge. Nevertheless, previous studies offer compelling evidence supporting the quality of UNLI annotations:

  • UNLI annotations have been carefully curated with strict qualification criteria and detailed hyperparameter tuning. [2]
  • UNLI labels have been shown to correspond very well to more traditional classification calibration notions in ChaosNLI [1]
  • Subjective probability annotation can be significantly improved through annotator quality control [1], and careful interface design [5]

3. "Real-World" Application of Our Model

We appreciate the desire to see more directly applied decision-making tasks. While our benchmark tasks are drawn from standard datasets, they are structured to reflect real-world belief modeling workflows:

  • Intrinsic Evaluation involves estimating probabilities for uncertain claims.
  • Comparison Evaluation simulates defeasible reasoning—deciding between competing hypotheses.
  • Structural Evaluation involves multi-step probabilistic reasoning and is designed to emulate real-world decision-making workflows, such as multi-hop question answering and scientific claim verification. This serves as a downstream application for our model by introducing uncertainty during the reasoning process. For instance, we assess commonsense and temporal reasoning capabilities on the Bayesian inference framework BIRD [4].

4. Conditional Probability Estimation Task

We emphasize that our framework is general-purpose and not limited to NLI. As noted in Section 3.1, conditional probability estimation is broadly applicable, for example, in information-seeking, risk assessment, or legal reasoning. Prior work often limits regression to function approximation (e.g., curve fitting) [3]; our work extends it to reasoning under uncertainty in open-domain texts.

5. Suggestion for Writing

  • The “textual outcomes” in line 128 refer to individual claims or propositions over which probabilities are estimated.
  • In line 213, “token” refers to the special bin-associated tokens used for label discretization during training.
  • Typos such as “LLMyielding” will be corrected in the final revision.
  • We will revise Table 1’s caption to explain the metric types, and the meaning of bold and underline (first and second place, respectively). We compute Spearman Correlation for intrinsic tasks and classification metrics for Comparison and Structural Reasoning tasks [lines 270, 295].

6. Statistic of Synthetic Data

We generated synthetic data on the UNLI test split (3040 examples) and found a high correlation with human annotations. Our method even surpassed the average performance of five GPT-4o runs using the same prompting. We also observed that the correlation of aggregated results on the UNLI test set increases as the discrepancy among LLMs decreases. For further verification on how "LLM-as-a-judge" improves over simple score averaging, please refer to our response to Reviewer eG4a.

methodconfidence expectationgpt-4o 5time-avg
spearman correlation0.78550.7713
Discrepancynum of exampleSpearman
0.52802.805
0.32110.847
0.21662.869
评论

Reference

[1] Zhengping Jiang, Anqi Liu, and Benjamin Van Durme. Addressing the binning problem in calibration assessment through scalar annotations. Transactions of the Association for Computational Linguistics, 12:120–136, 2024a.

[2] Tongfei Chen, Zhengping Jiang, Adam Poliak, Keisuke Sakaguchi, and Benjamin Van Durme. Uncertain natural language inference. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8772–8779, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.774. URL https://aclanthology.org/2020.acl-main.774.

[3] Xingyou Song and Dara Bahri. Decoding-based regression. arXiv preprint arXiv:2501.19383, 2025.

[4] Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth. Bird: A trustworthy bayesian inference framework for large language models, 2024. URL https://arxiv.org/abs/2404.12494.

[5]Han, X., Yu, F., Sedoc, J. and Van Durme, B., 2024. Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations. arXiv preprint arXiv:2408.09765.

评论

I appreciate that this is referring to instance level calibration. However, I’m not convinced that UNLI instances are actually instance level calibration. Probabilities in UNLI were derived by asking annotators to estimate how likely a situation described in the hypothesis is based on the situation in the premise. For the predictions to be instance level calibrations, I would expect that if a model predicted a probability of say 80%, then when 100 annotators are asked whether hypothesis is “entailed” by premise, about 80 of them would say yes.

Overall, I like this work and am generally happy with the authors responses. My major gripe is the notion of well calibrated probability predictions. I’ve updated my overall score from a 6 to a 8. In my mind this is a clear accept.

评论

Thank you for raising the scores and for your clearly stated support!

While we chose to build upon human subjective probability estimation as a reliable source, as suggested by prior research, we agree with the reviewer that achieving perfect calibration remains a significant challenge. We too believe that future work focused on discovering and curating resources for more precise distribution matching will be a crucial area of continued research.

评论

I think the paper would be much stronger if mentions of and the motivation of calibration was removed. It is not crucial to the papers contributions and the paper does not tackle this problem

评论

Overall Response

We greatly appreciate the reviewers’ insightful observation that our approach is connected to prior work in LLM calibration. We would like to take this opportunity to further elaborate on the relationship between our problem formulation and a well-established concept in the calibration literature, instance-level calibration.

If we view predicting an outcome in context as a binary classification task, our regression model estimates the posterior probability of that outcome. Unlike standard calibration methods that rely on clustering and can’t ensure instance-level accuracy, our scalar-targeted approach enables calibration at the instance level. This also extends to classification tasks, offering a weaker but still useful form of calibration.

In other words, our method aims to align model uncertainty with human subjective uncertainty. As discussed in the introduction, this alignment has been shown to be critical for tasks such as world modeling [1] and informed decision-making (e.g., BIRD [2]), yet remains surprisingly underexplored. Our work directly addresses this gap by proposing a targeted formulation and learning strategy that substantially outperforms both zero-shot verbalized uncertainty from LLMs and traditional regression-based uncertainty models. To ensure that our evaluation reflects the intended downstream applications, we carefully designed our pipeline accordingly, an approach that we are pleased to see resonated positively with several reviewers. Nevertheless, to directly address concerns regarding the effectiveness of our method relative to more conventional calibration techniques, we provide two additional sets of experiments:

(1) Limited benefit from temperature scaling: We show that temperature scaling yields only marginal improvements across most of the tasks we consider.

(2) Generalization to unseen data: On a completely held-out QA dataset, our model produces well-calibrated predictions across all candidate answers.

Experiment (1): True and False logits with temperature scaling baseline

Following reviewer oXxq’s suggestion we use Thermometer [3] to tune different temperatures for each task. This diverse input feature makes it a solid calibration method adapting to unseen tasks. We use the same base model Qwen-14B-instruct as for our strongest models.

We further extend the training from human annotated UNLI to synthetic data on ANLI and WANLI, and the result is shown below, where the method is denoted as T/F+TS.

IntrinsicUNLIcircaGNLIEntailmentBanke-CARE
T/F+TS.681.553.843.558.856
Compareδ\delta-SNLIδ\delta-ATOMICCOPAHellaSwag
T/F+TS81.374.786.874.9

We notice that temperature scaling performs worse than our approach, and is particularly ineffective for comparison tasks, as we observe very little change in the ranking of the likelihood of hypotheses.

Experiment (2): Generalization to unseen QA

We also show that our models induce calibrated classifier zero-shot on ProtoQA [4], a dataset of general commonsense questions with multiple plausible answers and vote-based frequency from human annotators. For each question-answer pair (q,ai)(q, a_i), we predict p(aiq)p(a_i \mid q) and for a probabilistic classifier. The results are shown below

MetricECEBrier ScoreJSD
Value0.01050.01590.0499

These results indicate that predictive distribution from out model is well-calibrated against human label distribution.

Reference

[1] Lionel Wong, Gabriel Grand, Alexander K. Lew, Noah D. Goodman, Vikash K. Mansinghka, Jacob Andreas, and Joshua B. Tenenbaum. 2023. From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought. arXiv preprint arXiv:2306.12672.

[2] Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth. Bird: A trustworthy bayesian inference framework for large language models, 2024. URL https://arxiv.org/abs/2404.12494.

[3] Shen, Maohao, et al. "Thermometer: Towards Universal Calibration for Large Language Models." arXiv preprint arXiv:2403.08819 (2024).

[4] Michael Boratko, Xiang Li, Tim O’Gorman, Rajarshi Das, Dan Le, and Andrew McCallum. 2020. ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1122–1136, Online. Association for Computational Linguistics.

评论

I’d like to kindly remind the reviewers to provide a follow-up in response to the authors’ rebuttal. Your input will help ensure a fair and well-informed decision.

评论

We thank the Area Chair for the thoughtful reminder. As the end of the rebuttal period approaches, we sincerely encourage the reviewers to share their feedback and insights on both our submission and response.

评论

We sincerely thank all the reviewers and the meta-reviewer for their constructive and insightful feedback. Your comments have been very helpful in improving the quality and clarity of our paper. In the final version, we will improve the writing, incorporate the most crucial experiments as suggested, and slightly adjust the motivation to better reflect the perspectives raised in the reviews.

最终决定

The overall evaluation is positive (as reflected in most reviewers' comments, and I agree). Toward acceptance, please make sure to carefully address the reviewers’ questions and concerns, and refine the camera-ready version accordingly.