/10

Poster4 位审稿人

最低3最高4标准差0.5

ICML 2025

Learning to Route LLMs with Confidence Tokens

Yu-Neng Chuang,Prathusha Kameswara Sarma,Parikshit Gopalan,John Boccio,Sara Bolouki,Xia Hu,Helen Zhou

提交: 2025-01-18更新: 2025-07-24

TL;DR

This work introduces Self-REF, a training strategy that teaches large language models to reliably express confidence in their answers, leading to improved performance in downstream tasks like routing and rejection learning.

摘要

关键词

Large Language ModelLLM RoutingModel Confidence

评审与讨论

审稿意见

评分: 42025-03-13

The paper introduces Self-REF, a lightweight fine-tuning framework designed to teach large language models (LLMs) to express confidence in their answers through confidence tokens. These learned tokens, indicate whether the model is confident or uncertain about its prediction, improving reliability and performance in downstream routing and rejection learning tasks. The authors demonstrate that Self-REF outperforms traditional approaches, such as verbalizing confidence or using token probabilities, on multiple public datasets by enabling more accurate routing to stronger models and better rejection of uncertain answers. The method achieves improved system efficiency and calibration while maintaining model performance.

update after the rebuttal

During the rebuttal the authors have addressed all my concerns. I have decided to increase my score from 3 to 4

给作者的问题

What data did you use in the setup when you mention you trained on MMLU? I looked at table 4 in the appendix but I'm not sure about the numbers given that, from the MMLU paper, I get "The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and is made of 1540 questions, and the test set has 14079 questions.". In general, i'd move some information about training the confidence tokens from the appendix to the main body of the work, or at least better discuss them in the paper.
In figure 2 I see random route to 70B baseline that I can't find it described. Can you elaborate on it?
I am slightly confused by the quantile thresholds in Section 4.2 and their relation to the routing rate in Figure 2. Specifically, how do these two quantities interact? For example, in Figure 2(a), what was the exact threshold value (t) that resulted in a routing rate of 0.4? More generally, it would be helpful to explicitly discuss the relationship between t and routing rate, as this would give practitioners a clearer understanding of how to tune Self-REF for different trade-offs between accuracy and efficiency.
Can you clarify what you mean with the following sentence about the in-context learning for llama3: "All experiments utilize Llama3-70B-Instruct with only its strong in-context learning capabilities during instance routing. The routing decisions are determined by the probabilities"

论据与证据

The claims in the submission are supported by clear and convincing evidence. The paper presents quantitative results across four public datasets (MMLU, OpenbookQA, GSM8K, MedQA), showing that Self-REF improves routing efficiency by reducing the number of queries sent to larger models while maintaining accuracy (e.g., Llama3-8B routes only 39% of queries to match Llama3-70B's performance). Additionally, ROC curves demonstrate superior rejection learning, and calibration metrics (ECE, Brier Score, Cross-Entropy) confirm that confidence tokens align well with correctness. However, while the authors discuss trade-offs in routing and rejection, they do not extensively analyze potential failure cases, such as when Self-REF misidentifies confidence leading to incorrect rejections or unnecessary escalations.

方法与评估标准

The proposed method and evaluation criteria are well-aligned with the problem of confidence-based routing and rejection learning in LLMs. The use of confidence tokens is a novel yet intuitive approach that integrates seamlessly into autoregressive models, and the evaluation on standard QA and reasoning benchmarks (MMLU, OpenbookQA, GSM8K, MedQA) allows for fair comparison with prior art. The routing and rejection tasks are practical and relevant, as they reflect real-world scenarios where LLMs need to manage uncertainty efficiently. As stated above, an error analysis of model behvior in confidence-based routing would help gain additional insights on the effectiveness of Self-REF.

理论论述

N/A

实验设计与分析

The experimental design is generally strong. The routing experiments are well-structured, using multiple confidence thresholds to analyze trade-offs between accuracy and efficiency. The rejection learning task is also appropriately tested with artificially modified datasets where the correct answer is removed. Also the calibration analysis (ECE, Brier Score, CE) help measure model confidence alignment with correctness.`

On the other hand, a much stronger analysis would include a systematic investigation of failure cases to identify patterns in when and why Self-REF misclassifies confidence. While the paper provides overall accuracy and calibration metrics, it does not explore whether certain types of questions, knowledge domains, or reasoning patterns lead to systematic overconfidence or underconfidence. For instance, if the model consistently misroutes some type of quesstions questions, this could highlight fundamental limitations in its confidence estimation. Identifying such failure patterns would not only improve interpretability but also inform targeted improvements to Self-REF, such as adjusting confidence token fine-tuning strategies or incorporating adversarial training.

Also, an analyses of the relation between confidence threshold (more below in the questions section) could help get a better understanding of model behavior when self-REF is used.

With the analysis above, I consider this paper to deserve a score of 4 instead of its current 3

补充材料

N/A

与现有文献的关系

The paper builds on prior work in uncertainty quantification, LLM routing, and rejection learning, but introduces confidence tokens as a mechanism for end-to-end confidence estimation. Previous methods, such as logit-based calibration and verbalized uncertainty prompts, shows worse alignment between confidence scores and correctness, whereas Self-REF fine-tunes the LLM itself to embed confidence directly. Unlike external classifiers for routing, Self-REF integrates confidence estimation within the autoregressive decoding process, making routing decisions more adaptive and model-aware. Additionally, it improves on LLM rejection learning by enabling confidence-based abstention without requiring a separate rejection model or new loss functions.

遗漏的重要参考文献

All the relevant references needed to understand the paper are discussed.

I found this paper that came out after ICLR submission deadline but it could be a nice reference to add to the related work section: Dhananjay Ashok, Jonathan May, Language Models Can Predict Their Own Behavior

其他优缺点

The paper is generally well written and easy to follow.

其他意见或建议

Typos:

line 117: retarded? should be a typo
line 349: bed is s typo

作者回复

2025-04-01

Thank you for the valuable feedback and reference. We've incorporated them into our related work and made the suggested expository improvements in our revised paper.

[Q-1] More analysis on a systematic investigation of failure cases to identify patterns in when and why self-REF misclassifies confidence.

[A-1] This is a good suggestion, we previously included some case studies of successful routing in Appendix D, and will add more failure cases as well. Analyzing the top 5 categories of correct/incorrect and overconfident/underconfident predictions, we have:

Predict Correctly+<UN> (underconfident in its predictions): (1) computer_security; (2) high_school_biology; (3) high_school_european_history; (4) human_sexuality; (5) miscellaneous
Predict Wrongly+<CN> (overconfident in its predictions): (1) college_computer_science; (2) conceptual_physics; (3) high_school_computer_science; (4) high_school_microeconomics; (5) jurisprudence
Predict Correctly+<CN>: (1) international_law; (2) college_biology; (3) moral_disputes; (4) philosophy; (5) us_foreign_policy
Predict Wrongly+<UN>: (1) abstract_algebra; (2) anatomy; (3) college_chemistry; (4) college_medicine; (5) econometrics

Analyzing these categories, we note:

Predict Correctly+<UN> (underconfident in its predictions): The model often hesitates with context-heavy subjects due to their ambiguity and need for broader reasoning. "miscellaneous" reflects general uncertainty in non-standard topics.
Predict Wrongly+<CN> (overconfident in its predictions): Many of these involve technical, structured domains, where the model shows overconfidence, likely due to familiarity from training. However, it may struggle with edge cases and nuanced reasoning, especially in areas like jurisprudence.
Predict Correctly+<CN>: These topics rely on broad conceptual knowledge rather than strict calculations, and the model appears well-calibrated, likely due to strong training data coverage or clearer signals of correctness.
Predict Wrongly+<UN>: These highly specialized, detail-heavy subjects require precise recall or deep understanding, and the model’s uncertainty may reflect an awareness of its limitations in recalling detail-heavy content to answer questions.

[Q-2] What data did you use in the setup when you mention you trained on MMLU? In general, i'd move some information about training the confidence tokens ...

[A-2] We use a randomly sampled subset of the official MMLU training set, where ground truth answer choices are augmented with confidence tokens when fine-tuning Self-REF on MMLU (Algorithm 1, Section 3.2). We will move more details about the training setup from the appendix into section 4.1.

[Q-3] In figure 2 I see random route to 70B baseline that I can't find it described. Can you elaborate on it?

[A-3] The random routing approach is a naive baseline that uniformly at random routes to the 70B model at the specified rate. When the random routing rate is 0.0, then it gives the performance of the small LM, and when the random routing rate is 1.0, it gives the performance of routing to the 70B model entirely. We will update our paper with these details.

[Q-4] I am slightly confused by the quantile thresholds in Section 4.2 and their relation to the routing rate in Figure 2. Specifically, how do these two quantities interact? for example, in Figure 2(a), ...? More generally, it would be helpful to explicitly discuss the relationship between t and routing rate, as ... trade-offs between accuracy and efficiency.

[A-4] To better analyze the routing performance, we set the thresholds t at 20-quantiles, as described in Section 4.2. For instance, the threshold t corresponding to a routing rate of 0.4 is the 40th percentile of confidence scores across all input queries. Practically, one could observe the 20-quantiles of the distribution, and choose thresholds that should empirically correspond to rough routing rates, and then monitor these thresholds over time. Note that Self-REF fine-tuning does not require committing to a particular tradeoff, as instead of relying directly on the sampled tokens, we instead extract confidence scores from the logits of the confidence tokens, thus giving us the ability to threshold this score and control how often to route.

[Q-5] Can you clarify what you mean with the following sentence about the in-context learning for llama3: "All experiments utilize Llama3-70B-Instruct with only its strong in-context learning capabilities during instance routing..."

[A-5] This refers to the fact that we did not fine-tune Llama3-70B-Instruct (the larger, more expensive LLM) in the routing setting.

[Q-6] Missing reference, formats, and Typos.

[A-6] Thank you for the valuable comments and references. We will add this to our related work section on rejection learning. The suggestions for formats and typos are fixed in our next version.

审稿人评论

2025-04-02

Thanks for your response. Thing are clearer now. I don't see any major weakness with this work and have decided to increase my score to 4

审稿意见

评分: 32025-03-17

Self-REF is a lightweight method for training a language model to show when its answers are correct or incorrect by using “confidence tokens.” The approach starts with a base model that generates predictions and labels each instance as “confident” or “unconfident” based on the answer’s correctness, creating an augmented dataset. The model is then fine-tuned on these labeled samples and learns both to provide the correct answer and to tag it appropriately. Finally, continuous confidence scores are computed by comparing the probabilities of the “confident” vs. “unconfident” token at the end of each response.

They propose two applications. First, low-confidence queries can be routed to a larger model to cut costs without sacrificing accuracy. Second, when no larger model is available, the system can reject responses it deems untrustworthy. Across MMLU, OpenbookQA, GSM8K, and MedQA, routing only uncertain queries matches the strong model’s accuracy while reducing latency and cost; rejecting low-confidence outputs also helps avoid incorrect claims for questions lacking a valid answer.

给作者的问题

论据与证据

Claim that the model’s confidence tokens always align with correctness: The paper states that once fine-tuned, the model should consistently output <CN> whenever it is correct and <UN> otherwise. In practice, even well-calibrated methods can yield discrepancies between predicted confidence and true correctness.
Claim of consistently “lightweight” overhead The resulting method would yield a cascading method, where first we have to monitor the smaller model's output then route to the bigger one to get the final answer.

方法与评估标准

Overall, the methods and evaluation criteria do generally make sense given their focus on practical improvements in downstream tasks. However, I have a few concerns:

The assumption that <CN> tokens represent correctness and <UN> represent incorrectness lacks a clear theoretical grounding. The paper implicitly assumes confidence directly correlates with correctness, though confidence in practice might not always reflect correctness accurately, potentially limiting generalizability.
The paper partially addresses imbalance by subsampling unconfident tokens with a tunable parameter, but does not elaborate on handling extreme cases where the base model might be consistently correct or incorrect, which could severely skew training data. In fact, the approach depends heavily on the correctness of the base model to annotate confidence tokens. If the base model has substantial weaknesses or biases, these may propagate through fine-tuning, limiting improvements or perpetuating biases.

理论论述

实验设计与分析

Yes, I would suggest studying additional baselines:

I'm curious if authors explored a baseline where the model generates a confidence token before producing an answer. This can reduce computation by preventing full generation when confidence is low. Also a learned router baseline would provide insight
Explore other methods for labeling confidence:
- Estimating confidence based on consistency across multiple stochastic samples from the model.
- External calibration techniques, e.g. entropy-based uncertainty

补充材料

Yes, A and B

与现有文献的关系

The work addresses an important problem of allowing LLMs to self-assess the confidence of their predictions, and their applications to LLM routing and rejections.

遗漏的重要参考文献

其他优缺点

其他意见或建议

作者回复

2025-04-01

Thank you for the detailed feedback. We respond to each point below, and will update the discussion accordingly:

[Q-1] The assumption that <CN> tokens represent correctness and <UN> represent incorrectness lacks a clear theoretical grounding. The paper implicitly assumes confidence directly correlates with correctness...

[A-1] A core contribution of this paper is utilizing confidence scores for rejection and routing downstream. Unlike prior works that derive uncertainty scores from logits, verbalized outputs, or re-sampling, we train our notion of confidence into the model, teaching it to reflect on its answer after producing it. The notion of confidence is grounded in the probability that a prediction is correct, i.e., $P(\text{<CN>} |X,Z) = P(Y=Z|X,Z)$ , where CN is the confidence token, $X$ is the question, $Z$ is the predicted answer, and $Y$ is the true answer. This is a more direct form of confidence than other techniques, which often instead reflect consistency in the answer. For example, logits would instead reflect $P(Z|X)$ , with no relation to $Y$ . As noted in our paper (Section 1 and 5.3), calibrated uncertainty metrics are not necessarily correlated with correctness. In our paper, we assess both the utility of the confidence scores as well as their calibration.

[Q-2] If the base model has substantial weaknesses or biases, these may propagate through fine-tuning, limiting improvements or perpetuating biases.

[A-2] We respond using two perspectives:

The primary goal of Self-REF is to help the base model identify its existing weaknesses while maintaining performance, thereby enabling more effective routing.
One possible approach for extreme incorrectness is as follows: first, fine-tune the base model on the downstream task to improve its task-specific capabilities; then, in a second stage, apply the Self-REF framework to teach the model to route effectively based on uncertainty. This is a useful extension of Self-REF we will add to the discussion.

[Q-3] Explore other methods for labeling confidence (consistency across sampling and external calibration).

[A-3] As mentioned in Q-1, the goal of Self-REF is to produce confidence tokens useful for downstream settings such as routing and rejection, boosting correctness of the overall system. Well-calibrated confidence scores do not always correlate with correctness [1, 2] (Section 1 and 5.3). Two toy examples to explain this misalignment intuitively: Assume we have a binary classification task with predicted probabilities (see below). Example 1 achieves lower accuracy but has a lower ECE score, whereas Example 2 achieves higher accuracy but a higher ECE score. This example demonstrates that calibration metrics are not correlated with the correctness of the prediction. Similarly, consistency-based uncertainty is not necessarily aligned with downstream correctness, e.g., when one has highly consistent incorrect answers. This misalignment can degrade routing performance when such signals are used for routing.

Bins = 2
predicted prob. of "1" = [0.5, 0.5, 0.5, 0.5]
ground truth = [0, 0, 1, 1]

Example 1: 
predicted prob. of "1" = [0.5, 0.5, 0.5, 0.5]
-> ECE(↓)= 0%
-> accuracy (↑)= 0%

Example 2: 
predicted prob. of "1" = [0.4, 0.6, 0.9, 0.9]
-> ECE(↓) = 20%
-> accuracy (↑) = 75%

[1] Huang, et al. "Look before you leap: An exploratory ..." arXiv 2023.

[2] Yona, et al. "Can Large Language ... Uncertainty in Words?." arXiv 2024.

[Q-4] I'm curious if authors explored a baseline where the model generates a confidence token before producing an answer. Also a learned router baseline would provide insight.

[A-4] To the best of our knowledge, there is no existing baseline that generates a confidence token prior to producing the answer. In multiple-choice QA, token generations are relatively short, making it feasible to predict a confidence token after producing the answer. This allows the model to condition its confidence estimation directly on the generated output, potentially resulting in more accurate confidence scores for routing. However, for tasks involving longer responses (e.g. reasoning), an alternative approach to improve efficiency could involve an early-stopping mechanism for confidence tokens. The model might produce a confidence token midway through its answer generation, allowing an earlier routing decision.

We have additional experiments with a learned confidence-based router baseline: OOD-Probe [1] to assess its routing performance from Mistral-7B to Llama3-70B in the MMLU dataset. The results are shown in the table below. We provide the accuracy in different routing ratios, and observe that Self-REF outperforms the new baseline.

Route_Ratio	0%	20%	40%	60%	80%	100%	least_routing_ratio
OOD-Probe	0.50	0.61	0.66	0.68	0.74	0.74	80%
Self-REF	0.55	0.64	0.68	0.72	0.74	0.74	70%

[1] Mahaut, et al. "Factual confidence of LLMs:..." ACL 2024.

审稿意见

评分: 42025-03-17

Authors proposed a lightweight training strategy to teach LLMs to express confidence in whether their answers are correct in a reliable manner. Using this, the authors build a router algorithm that reduces latency and improves overall QA performance.

给作者的问题

Your framework is only able to predict the confidence token at the end of the answer? Is that correct? If so, I see the drawback in terms of routing capabilities. since the model need to generate the full response before going to route the response to another larger LLM? Do authors see any way to get the confidence tokens before the full answer is generated, using the proposed framework?

论据与证据

The claims are well stated and supported.

方法与评估标准

All three RQ1, RQ2, RQ3 benchmarks and datasets are make sense and well analyzed.

理论论述

Not applicable

实验设计与分析

Experimental designs and analyses are valid. In particular, significantly reducing the latency metric, improves the validity of the results provided.

补充材料

The supplementary materials is good. Authors provided valuable additional insights that enhance the understanding of the main results presented in paper.

与现有文献的关系

The routing problem and Confidence Tokens are very novel research area. Every related literature is well covered.

遗漏的重要参考文献

None

其他优缺点

Please see other sections

其他意见或建议

Paper is well written.

作者回复

2025-04-01

Thank you for the valuable feedback and ideas. We have updated our discussion accordingly.

[Q-1] Your framework is only able to predict the confidence token at the end of the answer? Is that correct? If so, I see the drawback in terms of routing capabilities. Since the model need to generate the full response before going to route the response to another larger LLM? Do authors see any way to get the confidence tokens before the full answer is generated using the proposed framework?

[A-1] Thank you for the insightful feedback. In question-answering tasks, the cost of generation is relatively low, and predicting a confidence token after generating the answer allows the model to condition its confidence on the output, potentially leading to more accurate uncertainty estimation for routing. However, in other long-response tasks, such as LLM-based reasoning, a promising direction could be to develop an early-stopping approach for confidence token generation. The model may emit a confidence token partway through answer generation in this setup, enabling an earlier routing decision to stronger models. This strategy could be particularly beneficial in complex reasoning scenarios, and we consider it a valuable direction for future work.

审稿意见

评分: 32025-03-19

The paper proposes Self-REF, a training strategy that adopts LoRA to fine-tune an LM on a dataset augmented with confidence tokens, based on prediction correctness. Self-REF enhances downstream applications like model routing and answer rejection by leveraging the learned confidence token scores.