PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
5
4
3
3.8
置信度
创新性2.0
质量2.8
清晰度2.5
重要性2.5
NeurIPS 2025

SteerConf: Steering LLMs for Confidence Elicitation

OpenReviewPDF
提交: 2025-05-08更新: 2025-10-29
TL;DR

We introduce SteerConf, a method to steer the verbalized confidence of LLMs and calibrate these confidences, enhancing their reliability and trustworthiness in practical applications.

摘要

关键词
calibrationuncertaintypromptingblack-boxLLM

评审与讨论

审稿意见
4

This paper proposes SteerConf, a framework to improve confidence calibration in LLMs by steering their confidence scores via prompts. It introduces three components: (1) a steering prompt strategy; (2) a steered confidence consistency measure; and (3) a steered confidence calibration method. Without additional training or fine-tuning, SteerConf reduces calibration errors and improves failure detection across 7 benchmarks.

优缺点分析

Strengths

  1. The idea of steering LLM confidence through carefully designed prompts is simple yet underexplored. And the SteerCof method does not require training or fine-tuning, making it applicable to black-box commercial LLMs.
  2. This paper presents thorough experiments and evaluations across 7 diverse benchmarks (including commonsense, math, ethics, etc.) with multiple LLMs (GPT-3.5, GPT-4, LLaMA 3), demonstrating broad application of the framework.

Weaknesses

  1. The proposed method requires multiple LLM calls per query, which is costly and makes it hard to extend to complex multi-turn reasoning tasks.
  2. The steering prompts are manually designed. Small changes in these prompts might affect the final performance, but no robustness analysis is provided.
  3. The answer selection process in Sec 3.3 is not intuitive. As claimed before that confidence scores elicited from LLMs are often overestimated, matching calibrated confidence c(x)c(x) with steered confidence scores to select the final answer seems not quite reasonable.

问题

  1. How much impact does the setting of ll in the steering level have on the results, or as least how much ll is needed to achieve good results?
  2. What does the argument in Line 186-187 mean, by saying "howerver overlooks the consistency of the verbalized confidences returned by LLMs, which may contain rich information of model's interior certainty"?
  3. When using different prompts to guide the LLM's confidence elicitation, have the authors considered the influence of LLM's instruction following ability on these results?
  4. The tested LLMs in this paper are all with large-scale parameters. What will the performance of LLMs with relatively less parameters (like 7B, 32B, etc.) be like? (As it is time-consuming to conduct experiments during the rebuttal, the authors can first discuss the relevant results.)

局限性

yes

最终评判理由

The authors' rebuttal has addressed most of my previous concerns. However, as the authors stated that "our primary focus is on widely-used LLMs with reasonably good capabilities for black-box confidence calibration (e.g., GPT-3.5, LLaMA3, GPT-4)", it implies that the SteerConf is a method relying on the strong reasoning capability of large language models, which limits it application to relatively small models (like 7B, 14B, etc.).

Overall, according to the novelty of the paper, the experimental results, and the authors' rebuttal, I think it is a boardline paper. So I rate it 4.

格式问题

No formatting concerns on this paper.

作者回复

Response to reviewer BqFQ

We thank the reviewer for the insightful comments. We have addressed them with clarifications and experiments.

W1: multiple LLM calls per query; complex multi-turn reasoning tasks.

Response: We clarify that multiple query-based calibration is a common practice in the literature [31,37]. In our setting, since we don't have access to the internal states of LLMs, additional cost with multiple calls represents a necessary and reasonable trade-off to alleviate this information gap. For complex multi-turn reasoning tasks, our multi-query methodology can be viewed as a form of test-time scaling: investing more computational resources during inference to obtain better calibration and consequently more reliable reasoning traces.

W2: robustness of prompts

Response: We provide experiments to demonstrate the robustness of our method to prompt variations. We modified the expression of steering prompts to be more concise while preserving the core steering intent. For example, we simplified the "very_cautious" prompt as follows:

Original(SteerConf): Read the question, provide your answer and your confidence in this answer. Note: (1) The confidence indicates how likely you think your answer will be true. (2) You are making important decisions, thus you should avoid giving a wrong answer with high confidence. (3) You should be very cautious, and tend to give low confidence on almost all of the answers.

Modified(SteerConf_brief): Read the question, provide your answer and your confidence in this answer. Note: (1) The confidence indicates how likely you think your answer will be true. (2) Be very cautious, tend to give very low confidence on every answer.

We compare the modified and original versions of SteerConf with backbone LLM GPT-3.5 without CoT setting in the table below. Notably, SteerConf_brief achieves even lower ECE than SteerConf, and both outperform GPT-3.5-vanilla. These results demonstrate the robustness of our proposed method to prompt variations.

MethodGSM8K ECEGSM8K AUROCLaw ECELaw AUROCDateUnd ECEDateUnd AUROCStrategyQA ECEStrategyQA AUROCEthics ECEEthics AUROCAverage ECEAverage AUROC
GPT-3.5-vanilla62.655.843.251.760.256.629.653.326.054.844.354.4
SteerConf_brief9.682.49.361.312.155.021.058.414.868.113.365.0
SteerConf22.882.924.060.633.060.514.958.610.871.021.166.7

W3: The answer selection process in Sec 3.3 is not intuitive

Response: Our answer selection mechanism is motivated by the principle that answers with higher calibrated confidence should be selected when the model is more certain, and vice versa. Specifically, we compare the relative rank of confidence scores among candidate answers, as formalized in equation (6). Experimental results in Table 5 demonstrate that this selection strategy improves calibration. We will clarify this point in the revised version.

Q1: setting of the steering level

Response: To investigate the impact of steering levels ll, we vary ll from 55 to 33 using LLaMA3-70b-4bit with CoT, and report results in the table below. "SteerConf-no_verys" excludes very_cautious and very_confident prompts, while "SteerConf-no_mild" excludes cautious and confident prompts. The results show that l=5l=5 yields better performance than l=3l=3. Increasing ll may further improve results but incurs higher query cost and requires designing more fine-grained prompts, which is non-trivial. In practice, we find l=5l=5 offers a good balance. We will add this discussion in the revised version.

MethodGSM8K ECEGSM8K AUROCLaw ECELaw AUROCDateUnd ECEDateUnd AUROCStrategyQA ECEStrategyQA AUROCEthics ECEEthics AUROCAverage ECEAverage AUROC
SteerConf-no_verys3.076.814.764.411.269.47.571.68.072.88.971.0
SteerConf-no_mild8.478.86.862.810.467.47.273.716.377.79.872.1
SteerConf6.181.23.364.710.070.74.374.513.681.27.574.5

Q2: Clarify the argument in Line 186-187

Response: We clarify that existing study [37] focuses on answer consistency, not confidence score consistency. We argue that a model's internal certainty is reflected in both answer and confidence consistency. Therefore, our method is designed to consider both aspects. Empirical results in the experiments section validate this approach.

Q3: influence of LLM's instruction following ability

Response: Prior verbalized confidence works [31,37] did not address the impact of LLMs' instruction following ability. In our setting, the vanilla verbalized confidence method [31,37] already poses a challenge: if an LLM (e.g., small LLaMA-2 models) has poor instruction following ability, it may fail to produce answer-confidence pairs in the required format. Once an LLM can reliably perform verbalized confidence tasks, its instruction following ability is sufficient for our steering confidence instructions. Figure 4 demonstrates that our steering prompts are successfully followed by current LLM backbones. We will clarify this point in the revised version.

Q4: smaller models

Response: As suggested, we conduct experiments using the small model Qwen3-1.7b under the CoT setting, as shown in the table below. Due to the model's relatively weaker instruction following ability, frequent collapsed verbalized confidence elicitation renders results on some datasets (Law and Ethics) unusable; thus, we report results only on datasets with reliable responses. The results demonstrate that our proposed method consistently outperforms the vanilla verbalized confidence approach under the CoT setting, even with a small model backbone.

MethodGSM8K ECEGSM8K AUROCSport ECESport AUROCDateUnd ECEDateUnd AUROCStrategyQA ECEStrategyQA AUROCObjCnt ECEObjCnt AUROCAverage ECEAverage AUROC
Qwen3-1.7b Vanilla25.554.737.256.045.562.437.255.532.465.335.558.8
Qwen3-1.7b + SteerConf8.380.440.149.846.562.339.251.122.074.531.263.6
评论

Thank you for your responses! I appreciate the additional experiments and efforts the authors made during the rebuttal, which have addressed most of my concerns. However, I still have some follow-up questions regarding the authors' responses:

Q2: If I did not misunderstand, Table 5 in Appendix D shows the results of considering only answer consistency (SteerConf Majority) and considering both answer consistency and confidence score consistency (SteerConf). Based on the results, it seems that SteerConf is comparable to SteerConf (Majority) across datasets and metrics. So the authors may need to emphasize the role of confidence calibration in decision-making in the paper. Besides, Appendix D should be referred to in the paper.

Q4: Although the average ECE and AUROC indicate SteerConf is better than Vanilla on small models, detailed results on multiple datasets seem to indicate that this improvement is mainly attributed to the improvement on GSM8K and ObjCnt, and its performance on other datasets is not significant. It may raise concerns about the effectiveness of SteerConf on smaller models.

As a side note, the ll in your response to Q1 should be 1 for "SteerConf-no_verys" and "SteerConf-no_mild", since there are 2l+12l+1 levels as defined in the paper.

评论

We are pleased that our rebuttal has addressed most of your concerns. Thank you!

For Q2, in Table 5 of Appendix D, both SteerConf(Majority) and SteerConf utilize the same calibrated confidence c(x)c(x) from Eq. (4), but differ in answer selection: SteerConf(Majority) relies solely on answer consistency, while SteerConf employs the proposed answer selection mechanism (Lines 231–236) that is guided by calibrated confidence. Thus, Table 5 demonstrates the improvement of our answer selection mechanism. As suggested, we will emphasize the role of confidence calibration in decision-making and reference Appendix D in the paper.

For Q4, as noted in your review, it is time-consuming to conduct the experiments during rebuttal, but we really wanted to provide relevant results for our discussion. Therefore, we ran experiments on a rather small model, Qwen3-1.7b, which, as noted previously, has weaker instruction-following ability and sometimes produces unusable results, leading to performance fluctuations. Nevertheless, as observed, our method achieves better average performance across datasets. In this work, our primary focus is on widely-used LLMs with reasonably good capabilities for black-box confidence calibration (e.g., GPT-3.5, LLaMA3, GPT-4), and our main results show that our method outperforms baselines. We agree that exploring confidence calibration for very small LLMs with unstable capabilities is a promising future direction, which may also help strengthen small LLMs. Besides, ll should be 1 and 2l+12l+1 is 3. We will incorporate and fix these into the revised paper.

Thank you for your insightful discussion. Please let us know if you have further feedback.

评论

Thank you for your further explanation, which mostly resolves my issue with Q2 and Q4. I will raise my score accordingly.

评论

Thank you for raising your score. We appreciate your great efforts.

评论

Dear Reviewer BqFQ,

Thank you for your effort in reviewing our paper. We have addressed your comments by conducting experiments on the robustness of prompts, varying different steering levels, and testing smaller model, all of which further demonstrate the effectiveness of our designs, and provided additional clarifications on the design and setting in our work accordingly.

Please have a look, and we look forward to your feedback. Thank you.

Best,
Authors of SteerConf

审稿意见
5

This paper introduces a method to enable large-language models to output a calibrated confidence score alongside answers to questions. The method implemented builds upon previous work in this area, where answer consistency has been used as a measure of confidence, however here the authors also introduce confidence consistency, based on the consistency of confidence scores, when models are steered towards different degrees of confidence. This provides 2 signals for confidence that together in concert may provide a more reliable measure of confidence. The authors demonstrate that the confidence scores output using their approach are better calibrated than other approaches. Additionally, by using their score to select answers, the authors demonstrate an improvement in performance. This work has important implications for high stakes domains, where confidence may be an important component of decision-making.

优缺点分析

Strengths

  • The ability to generate reliable confidence scores with black-box language-models is an important area of research and the authors make a notable contribution to this field.
  • The confidence score is well designed with a clear rationale and appears to offer an improvement compared to other scores used in past research.
  • The score is relatively well validated, comparing their approach to several competing approaches, and demonstrating an improvement across multiple dimensions.

Weaknesses

Major

  • The authors discuss the improvements in confidence calibration offered by their method however they only consider one metric for calibration (expected calibration error). It might be more informative to also use calibration curves to better understand the nature of the improvement in calibration. See [1] for a description of different types of calibration. I think figure 4 is less relevant for the main results and could be moved to the appendix to make space for more interrogation of calibration here.
  • It’s not clear that these results would hold across many model-types. The results only compare 3 models, and 2 of these are both from OpenAI, so are likely trained in a similar manner. It would be useful to know whether this approach also works for smaller models, like Llama-8b models or other model types.
  • Steerconf is only compared to the vanilla confidence level, however Figure 1 clearly shows that a cautious prompt gives better results. It would be good to compare this approach to more cautious prompts, as a more difficult baseline, to demonstrate that steerconf does offer an improvement to just using more cautious prompts.

Minor

  • In the introduction (line 66), the authors state “the very cautious prompt improves calibration, achieving a 29% increase in Area Under the Receiver Operating Curve” however AUROC does not measure calibration, so please amend this to be clear.
  • The Related Work section needs to mention some of the work that informs the baselines they later use as part of their validation (such as [37])
  • The introduction to the overall method (lines 138-147) could be rephrased to better justify the rationale to the approach. Lines 217-225 very convincingly and clearly explained why the measure was defined as it was and could be moved to the introduction to better introduce the approach.

[1] Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016 Jun;74:167-76. doi: 10.1016/j.jclinepi.2015.12.005. Epub 2016 Jan 6. PMID: 26772608.

问题

  1. Did the authors experiment with a greater/different number of steering levels in the prompting?
  2. Did the authors consider weighting the confidence and answer consistency terms in their confidence measure to observe how this changed the results?

局限性

This method only works in the simple closed-form question/answering setting. The authors could be more clear about this limitation, pointing out that this method would not work (or would require adaptation) for cases where a language-model is outputting a free-form longer response to a question.

最终评判理由

I have updated my score to reflect the author's response to the weaknesses highlighted in my original review.

Overall, I think this paper introduces a novel method for improving confidence scores in language-models which is an important issue deserving attention. The paper is not without its limitations, though the authors address these in the paper and in their response. In the future I hope to see more research build upon the findings presented in this paper to further advance the field of confidence elicitation in language-models.

格式问题

  • Typo “mannually” on line 318
作者回复

Response to reviewer NUu4

We thank the reviewer for the insightful comments. We have addressed all comments with experiments and clarifications.

W1: calibration curves

Response: We will add the calibration curves and move Figure 4 to the appendix as suggested by the reviewer. As external URLs to diagrams are not permitted in the rebuttal, these updates will be made in the revision.

W2: smaller and different model

Response: As suggested, we conduct experiments using the small model Qwen3-1.7b under the CoT setting, as shown in the table below. Due to its relatively weaker instruction following ability, frequent collapsed verbalized confidence elicitation renders results on some datasets (Law and Ethics) unusable; thus, we report results only on datasets with reliable responses. The results demonstrate that our proposed method consistently outperforms the vanilla verbalized confidence approach under the CoT setting, even with a small model backbone.

MethodGSM8K ECEGSM8K AUROCSport ECESport AUROCDateUnd ECEDateUnd AUROCStrategyQA ECEStrategyQA AUROCObjCnt ECEObjCnt AUROCAverage ECEAverage AUROC
Qwen3-1.7b Vanilla25.554.737.256.045.562.437.255.532.465.335.558.8
Qwen3-1.7b + SteerConf8.380.440.149.846.562.339.251.122.074.531.263.6

W3: compare this approach to more cautious prompts

Response: As suggested, we report results using only the very_cautious, cautious, confident, and very_confident prompts in the LLaMA3-70b-4bit with CoT setting, respectively. SteerConf consistently achieves the best overall performance in terms of average ECE and AUROC, though a single prompt could outperform on specific metric. We will include this comparison in the revised version as recommended by the reviewer.

MethodGSM8K ECEGSM8K AUROCLaw ECELaw AUROCDateUnd ECEDateUnd AUROCStrategyQA ECEStrategyQA AUROCEthics ECEEthics AUROCAverage ECEAverage AUROC
vanilla5.055.122.854.313.754.311.864.67.854.212.256.5
very_cautious9.962.98.757.110.459.911.969.619.473.312.164.6
cautious5.055.720.855.813.657.810.868.47.567.011.660.9
confident4.953.823.056.714.255.312.466.110.949.913.156.4
very_confident4.652.324.656.614.555.612.366.611.550.113.556.2
SteerConf6.181.23.364.710.070.74.374.513.681.27.574.5

Minor W4 & Minor W6:

Response: As suggested, we will rephrase the relevant sentence to improve presentation.

Minor W5: discuss a related work

Response: As suggested, we will discuss the suggested related work.

Q1: experiment with a greater/different number of steering levels

Response: As suggested, we vary the number of steering levels from 55 to 33 to assess their impact, using LLaMA3-70b-4bit with CoT. In the table, "SteerConf-no_verys" omits the very_cautious and very_confident prompts, while "SteerConf-no_mild" omits the cautious and confident prompts. The results show that l=5l=5 yields better performance than l=3l=3. In practice, l=5l=5 offers a good balance, as increasing the number of levels incurs additional query cost. We will include this discussion in the revised version.

MethodGSM8K ECEGSM8K AUROCLaw ECELaw AUROCDateUnd ECEDateUnd AUROCStrategyQA ECEStrategyQA AUROCEthics ECEEthics AUROCAverage ECEAverage AUROC
SteerConf-no_verys3.076.814.764.411.269.47.571.68.072.88.971.0
SteerConf-no_mild8.478.86.862.810.467.47.273.716.377.79.872.1
SteerConf6.181.23.364.710.070.74.374.513.681.27.574.5

Q2: weighting the confidence and answer consistency terms

Response: We clarify that our approach multiplies the confidence and answer consistency terms, c(x)=μcκansκconfc(x)=\mu_c\cdot \kappa_{ans}\cdot \kappa_{conf} as shown in Equation (4) below line 225. This multiplicative design avoids the need to manually balance the two terms, and leverages the transferability advantage of multiplication. Introducing a weight parameter would not directly assign a specific weight to either term.

评论

Thank you for response. I would like to acknowledge the time and effort spent producing additional experimental results. You have responded to each of my weaknesses in a satisfactory manner and I have upgraded my rating to 5 accordingly.

评论

We are pleased to know that we have satisfactorily addressed your comments. Thank you for your effort in reviewing our work and upgrading the rating.

审稿意见
4

This paper introduces SteerConf, a novel black-box framework for improving the confidence calibration of large language models via semantic steering. Instead of relying on internal logits or fine-tuning, SteerConf applies a set of carefully crafted prompts with varying confidence attitudes to elicit multiple predictions and verbalised confidence scores. The method computes a final calibrated confidence by aggregating the average confidence with both answer and confidence consistency metrics. Extensive experiments on seven benchmarks using GPT-3.5, GPT-4, and LLaMA3 demonstrate that SteerConf consistently improves Expected Calibration Error (ECE) and failure prediction metrics compared to vanilla verbalised confidence and other existing baselines.

优缺点分析

Strengths:

  1. The framework works in a black-box setting without requiring internal access or model fine-tuning.
  2. The method is conceptually simple and easy to apply to any LLM with API access. Weaknesses:
  3. Although the proposed confidence consistency metric captures numerical stability across prompts, it does not consider potential semantic differences in the generated answers. In tasks involving more open or ambiguous questions, different prompts may lead to answers with similar confidence scores but different meanings. This aspect is not sufficiently discussed in the paper.
  4. The method requires multiple LLM calls per input but does not analyse whether all prompts are necessary. Without understanding each prompt’s contribution, the method may be less efficient than necessary.

问题

  1. Have the authors conducted any ablation studies to assess the individual impact of each steering prompt?
  2. How does the framework perform when applied to tasks with longer reasoning chains or in the CoT setting? Is there any further analysis in such cases?
  3. How does SteerConf perform when the prompts intentionally guide the model to express overly high or low confidence? Is the method robust to such suggestive or misleading prompt cues?

局限性

yes

最终评判理由

The authors carefully responded to my previous questions. Based on the explanation and more experimental results, I have decided to raise my score.

格式问题

None

作者回复

Response to reviewer 1saN

We thank the reviewer for the insightful comments and we have addressed them as follows.

W1: semantic differences in the generated answers, more open or ambiguous questions

Response: We clarify that, following prior work, our current tasks focus on numerical answers (GSM8K) or answers with predefined options (Law), where ambiguity is minimal and answer consistency is both sufficient and effective. We agree that addressing semantic differences in generated answers is important for more open-ended tasks, and will include this discussion as future work in the revised version.

W2: Each prompt’s contribution
Q1: individual impact of each steering prompt

Response: As suggested, to assess the contribution of each steering prompt, we conduct an ablation study by removing each prompt individually using LLaMA3-70b-4bit with CoT as the backbone. For example, "SteerConf w/o very_confident" excludes the very_confident prompt. The results are shown in the table below. Notably, the cautious prompts (cautious, very_cautious) primarily improve AUROC, while the confident prompts (confident, very_confident) mainly enhance ECE.

MethodGSM8K ECEGSM8K AUROCLaw ECELaw AUROCDateUnd ECEDateUnd AUROCStrategyQA ECEStrategyQA AUROCEthics ECEEthics AUROCAverage ECEAverage AUROC
SteerConf w/o very_confident7.280.95.164.311.170.95.874.317.980.49.474.1
SteerConf w/o confident7.480.85.364.611.169.85.974.316.081.49.174.2
SteerConf w/o vanilla6.879.34.464.211.468.55.874.412.978.78.373.0
SteerConf w/o cautious6.980.74.763.89.368.35.374.212.977.27.872.8
SteerConf w/o very_cautious3.178.013.365.711.369.97.672.46.378.08.372.8
SteerConf6.181.23.364.710.070.74.374.513.681.27.574.5

Q2: Perform with CoT setting

Response: We highlight that Table 2 in our paper (Page 7, above line 260) presents results for the CoT setting across different model backbones. CoT prompting consistently enhances the performance of all LLMs. Importantly, SteerConf continues to outperform the vanilla verbalized confidence approach, confirming the effectiveness of our method regardless of CoT usage.

Q3: How does SteerConf perform when the prompts intentionally guide the model to express overly high or low confidence?

Response: The design principle of our method SteerConf is precisely to intentionally steer the confidence output of LLMs. SteerConf effectively handles cases when prompts intentionally guide the model to express overly high or low confidence. In fact, this intentional steering mechanism is the key design of our SteerConf (Steering Confidence) method as stated in Sections 1 and 3, where "steering" refers to: generating confidence scores in a specified direction (e.g., conservative or optimistic) by employing prompts with varying degrees of caution or confidence. Specifically, we apply steering prompts like "You should adopt a very cautious approach and assign low confidence to most of your answers" to guide the LLM to express deliberately calibrated confidence levels. The steered confidence scores and answers are then aggregated to obtain more reliable confidence estimates. We will emphasize this point in the revised version.

评论

Dear Reviewer 1saN,

Thank you for your effort in reviewing our paper. We have addressed your comments by conducting ablation studies on each steering prompt, referring to the performance in the CoT setting in our paper, and clarifying our design principle.

We look forward to your feedback. Thank you.

Best,
Authors of SteerConf

评论

Dear Reviewer 1saN,

As the rebuttal discussion is close to end, this is a kind reminder to discuss.

Your comments are insightful and easy to address. We have addressed them: we conduct ablation studies on each steering prompt to show effectiveness, we point out that we have conducted experiments on CoT setting in our paper, and we further clarify our design principles.

We appreciate your effort in reviewing our paper.

Best,
Authors of SteerConf

审稿意见
3

This paper proposes a new approach to obtain verbalized confidence scores that are better calibrated. Specifically, the approach prompts the model through different prompts and obtains a list of paired answers and confidence scores. Then the authors propose a way to aggregate these pairs to obtain a final confidence score. This approach demonstrate better calibration performance than the vanilla verbalized confidence scores and several other baselines.

优缺点分析

Strengths

  1. This paper studies an important problem to calibrate LLMs’ confidence scores, which could be helpful to control hallucination and risks.
  2. The proposed approach is training-free and can be easily adopted.

Weaknesses

  1. This paper proposes a new way to aggregate confidence and answer pairs, and it should compare with stronger baselines such as Avg-Conf and Pair-Rank in [1] – which is necessary also because the evaluation datasets of the two works are similar.
  2. The proposed approach has several components, and I think it should be made clear the effectiveness of each component. For example, (1) how necessary is it to use carefully designed steering prompts? Would just semantically similar prompts suffice? Or maybe we can just use one prompt but sample multiple different responses? (2) Controlling the prompts fixed, how effective of different aggregation approaches?

问题

NA

局限性

yes

最终评判理由

This paper is limited at novelty in general -- methodologically it is like a new way of aggregating different verbose confidence scores. Several similar works exist previously, and the originality is not significant enough. The original submission is not at good quality I would say, as in my review, it lacks some baselines and proper ablations for the design choices, thus I gave an initial rating of 2.

The authors added additional baselines and ablation results in the rebuttal which addressed my concerns on the experiments, but my concern on the originality of the paper persists. I think this paper is very borderline and I don't have clear preference, I raise my score to 3 because the initial version was not good, but the AC or the authors can interpret my score as 3.5.

格式问题

NA

作者回复

Response to reviewer ZMK7

We have addressed all your comments below with experiments. Thank you for the constructive comments.

W1: compare with baselines such as Avg-Conf and Pair-Rank

Response:

  1. As noted in Line 267, the "Top-K" baseline already incorporates the Pair-Rank aggregation, which is specifically tailored for the Top-K approach.
  2. For Avg-Conf, we report results using the LLaMA3-70b-4bit backbone with CoT in the table below. The results show that our proposed method outperforms all baselines, including Self-Random and Misleading variants with Consistency, Pair-Rank, and Avg-Conf aggregation. For instance, SteerConf achieves ECE of 7.5, notably better than Self-Random with Avg-Conf that is 14.3.
MethodGSM8K ECEGSM8K AUROCLaw ECELaw AUROCDateUnd ECEDateUnd AUROCStrategyQA ECEStrategyQA AUROCEthics ECEEthics AUROCAverage ECEAverage AUROC
Vanilla5.055.122.854.313.754.311.864.67.854.212.256.5
Misleading+Consistency4.483.818.959.718.467.811.765.314.075.013.570.3
Self-Random+Consistency2.279.327.164.67.266.817.360.115.955.313.965.2
Misleading+Avg-Conf3.383.518.862.415.670.411.167.814.975.212.871.9
Self-Random+Avg-Conf2.379.327.164.48.868.317.259.916.355.014.365.4
Top-K+Pair-Rank10.261.123.049.727.552.221.155.011.152.418.654.1
Top-K+Avg-Conf56.460.615.155.840.755.936.452.05.267.530.758.4
SteerConf6.181.23.364.710.070.74.374.513.681.27.574.5

W2: make clear the effectiveness of each component. (1) steering prompts (2) different aggregation approaches

Response: We perform an ablation study to assess the impact of steering prompting and aggregation components, using LLaMA3-70b-4bit with CoT as the backbone. For (1), removing steering prompting yields "SteerConf w/o Steering Prompting". For (2), we replace our proposed aggregation with Consistency and Avg-Conf, denoted as "SteerConf+Consistency" and "SteerConf+Avg-Conf". Ablating both components results in the Self-Random approach with Consistency or Avg-Conf aggregation.
The results are reported in the table below, which demonstrates that each component contributes to improved performance, supporting the effectiveness of our method’s design.

MethodGSM8K ECEGSM8K AUROCLaw ECELaw AUROCDateUnd ECEDateUnd AUROCStrategyQA ECEStrategyQA AUROCEthics ECEEthics AUROCAverage ECEAverage AUROC
SteerConf w/o Steering Prompting2.380.013.266.710.070.26.172.510.259.08.469.7
Self-Random+Consistency2.279.327.164.67.266.817.360.115.955.313.965.2
SteerConf+Consistency2.471.725.963.87.568.217.261.414.368.513.566.7
Self-Random+Avg-Conf2.379.327.164.48.868.317.259.916.355.014.365.4
SteerConf+Avg-Conf2.871.325.863.38.667.117.461.316.773.514.367.3
SteerConf6.181.23.364.710.070.74.374.513.681.27.574.5
评论

I appreciate the authors's new results that address my major concerns on the empirical results. I am willing to raise my score. I think this paper is a borderline after addressing these concerns due to novelty in general.

评论

Thank you for raising your score. As mentioned in your review, our approach is training-free and can be easily adopted, and we propose a new way to aggregate confidence and answer pairs, and this approach demonstrate better calibration performance. These points mentioned by you should reflect our novel contributions in this work. We appreciate your effort and recognizing our contributions.

Best,

评论

Dear Reviewer ZMK7,

You have raised two points that are constructive and easy to address. We have addressed them with experiments. Please have a look, and we look forward to your feedback. We thank you for your effort in reviewing our paper.

1. Comparison with Pair-Rank and Avg-Conf. In the paper, we compared SteerConf to Top-K with Pair-Rank aggregation, and in this rebuttal, we have included additional results with Avg-Conf. Our SteerConf method outperforms these baselines, demonstrating its effectiveness.

2. Ablation Study. We have conducted ablation studies to show the effectiveness of each component in our method.

Best,
Authors of SteerConf

最终决定

This paper introduces SteerConf, a novel, training-free framework for improving the confidence calibration of black-box Large Language Models. The core idea is to use a set of carefully crafted prompts to steer the model into generating answers with varying levels of verbalized confidence (e.g., cautious, confident). The final calibrated confidence is then derived by aggregating these outputs, considering both the consistency of the answers and the consistency of the confidence scores themselves. The authors demonstrate through extensive experiments on seven diverse benchmarks that SteerConf significantly outperforms existing methods in calibration and failure detection. The primary strengths, as highlighted by the reviewers, are that the paper addresses an important and underexplored problem (ZMK7, NUu4, BqFQ), the proposed method is simple, practical, and broadly applicable as it is training-free and works in a black-box setting (1saN, ZMK7), and the claims are supported by consistent and significant empirical improvements (1saN, ZMK7, NUu4, BqFQ). The main weaknesses raised were concerns about the method's novelty, with one reviewer finding it an incremental aggregation technique (ZMK7), its reliance on the instruction-following capabilities of large models which may limit its effectiveness on smaller models (BqFQ), and the increased computational cost due to multiple queries per input (BqFQ). The decision to accept is based on the paper's clear and practical contribution, validated by a strong set of experiments. The authors' rebuttal was exemplary, providing substantial new experiments that addressed nearly all reviewer concerns, leading three of the four reviewers to either confirm a high score or raise their initial score.