6.4

/10

Poster3 位审稿人

最低4最高4标准差0.0

3.3

置信度

创新性2.3

质量2.7

清晰度2.7

重要性2.3

NeurIPS 2025

C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning

Antonios Valkanas,Soumyasundar Pal,Pavel Rumiantsev,Yingxue Zhang,Mark Coates

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

摘要

关键词

large language modelefficiencyinferencecascade

评审与讨论

审稿意见

评分: 4置信度: 32025-07-01

The paper introduces C3PO (Cost Controlled Cascaded Prediction Optimization), a framework designed to optimize large language model (LLM) cascades with probabilistic cost constraints. The core idea behind C3PO is to create an efficient cascading inference system where simpler, cheaper models handle easy queries, and more complex models are only invoked for difficult cases. The framework provides a systematic approach to controlling costs while achieving high performance, offering a promising solution for real-world LLM deployment. The approach is particularly beneficial in scenarios where high inference costs are a barrier, ensuring both scalability and reliability.

优缺点分析

Strengths:

The paper provides a rigorous theoretical guarantee for the proposed C3PO framework, particularly in terms of cost control and generalization error, which is a significant contribution to the research in this field.
In the context where the computational cost of LLMs is becoming an increasing bottleneck for applications, the method proposed in this paper precisely addresses the core issue of how to significantly reduce inference costs while ensuring high accuracy, with great practical significance and application value.

Weaknesses:

The performance limit of this method is constrained by the teacher model. If the teacher model itself contains errors, the system not only fails to correct them but may also learn and reinforce these errors.
The effectiveness of the system's decisions highly depends on the consistency between the distribution of the training data and the data distribution in the actual application scenario. When the real-world user problem patterns change (i.e., "distribution shift"), the pre-trained model selection strategy may fail or become suboptimal, leading to a decline in system performance.
This method does not compare with the setting of "fine-tuning a cheap small model for a specific task."
Lacks results on more challenging reasoning benchmarks, such as mathematical reasoning and code generation tasks. GSM8K is somewhat outdated; more recent benchmarks like AIME 2024/2025, Livecode bench should be considered.

问题

It is recommended to use experiments to demonstrate that C3PO has stronger robustness to distribution shift compared to baseline methods.
In the tasks discussed in the paper, conduct a re-experiment by replacing the API call method with a fine-tuned small model.
It is suggested to discuss what the handling approach would be if the teacher model itself provides uncertain answers.

局限性

yes, in supplementary.

最终评判理由

I also read the author's responses and response to other reviewers' concerns, which are convincing.I will raise my score to 4.

格式问题

No formatting concerns.

作者回复

2025-07-31

We thank the reviewer for recognizing the benefits of C3PO's cost control, theoretical grounding, and efficiency. Below, we address the concerns.

3.1 Evaluate robustness to distribution shift

“It is recommended to use experiments to demonstrate that C3PO has stronger robustness to distribution shift compared to baseline methods.”

We thank the reviewer for this suggestion. We have conducted an experiment to study C3PO’s robustness compared to the baselines under distribution shift. We note that both FrugalGPT and TREACLE will be affected by the distribution shift, since these methods require training on the labeled data. C3PO is affected as well since we need to learn the thresholds for early exiting from the agreement between weaker LLMs and the MPM by training on in-distribution data. On the other hand, both MoT and ModelSwitch are heuristic cascades and do not have any learnable parameters. This does not indicate that they are immune to distribution shift, since for a fixed budget, the learned threshold from the in-distribution data might not be effective in the out-of-distribution setting for MoT, resulting in a loss of accuracy or a gross violation of budget constraint. ModelSwitch does not have any threshold mechanism, so we cannot control its inference cost.

Specifically, we trained C3PO, FrugalGPT, and TREACLE independently on SVAMP and GSM8K. and evaluated the learned exit policy on MATH-500 test set. This simulates a realistic setting where cascade policies are optimized using data from different supervision domains than the deployment task.

We obtained the same qualitative trend as in Figs. 2, 3, 7, and 9 in the paper for different algorithms in this setting. C3PO significantly outperforms the baselines in the low-cost regime and achieves superior cost-effectiveness while achieving near-MPM performance as the allowable budget is increased. Specifically, under a budget constraint of 0.0015 USD, when all the LLAMA cascades are trained on GSM8K, C3PO's best accuracy (57.8%) on MATH-500 is significantly higher than that of FrugalGPT (33%) and TREACLE (11%). On the other hand, C3PO achieves a near-MPM accuracy of 61.2% at less than 20% of the inference costs of FrugalGPT and TREACLE, when the allowable budget is increased to 0.005 USD. Similar results are obtained for training on SVAMP as well.

These results demonstrate excellent distribution shift robustness for C3PO in comparison to the baselines. Note that due to response formatting regulations (no images, no links), we are unable to include the figure depicting the cost vs. accuracy curve for this experiment, but we will include the corresponding plots in the camera ready paper. We note that in the cost vs. accuracy curve C3PO clearly dominates the baselines consistently achieving a higher accuracy for the same cost.

Summary

We trained C3PO, FrugalGPT, and TREACLE on SVAMP and GSM8K, then evaluated on the MATH‑500 test set.

C3PO: significantly outperforms baselines at low budgets and achieves near‑MPM accuracy at higher budgets (e.g. 57.8 % vs. 33 % for FrugalGPT (second best model) at $0.0015).
Baselines: FrugalGPT/TREACLE suffer large drops under shift; MoT/ModelSwitch cannot reliably control cost or adapt thresholds.
Due to formatting rules, we omit the curves here but will include them in the camera‑ready version.

3.2 Compare against fine‑tuned small models

"In the tasks discussed in the paper, conduct a re-experiment by replacing the API call method with a fine-tuned small model."

We appreciate the reviewer’s suggestion. However, we respectfully disagree with the reviewer on the applicability of finetuned baselines.

In this work, we focus specifically on cascades for the inference-time deployment setting, where both C3PO and all baselines access models via APIs in a zero-shot or few-shot manner without any task-specific fine-tuning. This is motivated by practical considerations. Many real-world deployments of LLMs, especially in commercial and user-facing environments, rely on black-box APIs where model weights are not accessible and fine-tuning is infeasible.

Moreover, the objective for finetuning is to improve a single small/medium-sized, open-weight LM's domain-specific reasoning by either SFT [1] or RL [2]. This is in sharp contrast to the cascade-learning goal, which is to train a lightweight early exit policy to save inference cost while maintaining performance as close to the MPM as possible. Additionally, both SFT and RL finetuning require a large amount of labeled data (or a strong verifier & multiple rollouts), and a much larger computational burden compared to any LLM-cascade training (including C3PO and the baselines we consider). For example, [1] requires 26 minutes to train on 16 NVIDIA H100 GPUs and uses 1000 labeled questions and [2] requires 2.788M H800 GPU hours for its full RL training. On the other hand, C3PO typically takes approximately 0.01 seconds on a single M3 CPU to train and does not require any labels. In addition, we note that none of the finetuning-based research papers (including [1, 2]) consider LLM-cascades as a baseline. Finally, finetuned LLMs can still be employed in a cascade setting as just another model. We leave experiments integrating finetuned LLMs as cascade members for further improving cascade's cost-accuracy tradeoff in future work.

[1] s1: Simple test-time scaling, N. Muennighoff et al., arXiv 2501.19393, 2025

[2] DeepSeek-V3 Technical Report, DeepSeek-AI, 2025

3.3 Propose a strategy for uncertain teacher answers

“It is suggested to discuss what the handling approach would be if the teacher model itself provides uncertain answers.”

This question is similar to Reviewer r17A’s “robustness to weak MPM” above. Due to space limitations please refer to response 1.1.

3.4 Include harder and newer benchmarks

“In the tasks discussed in the paper, conduct a re-experiment by replacing the API call method with a fine-tuned small model.”

We thank the reviewer for this helpful suggestion.

Our current evaluation targets logical, arithmetic, and mathematical reasoning tasks, which are central to C3PO’s intended scope. This includes both classical and challenging benchmarks such as MATH-500 and SVAMP, which cover multi-step symbolic and numerical reasoning. We point out that all of the baseline works in cascaded LLM inference exclusively evaluate in the same areas (logical, arithmetic and mathematical reasoning). Additionally, not a single one of our baselines evaluates in code generation tasks.

We agree that newer benchmarks such as AIME 2024/2025 are valuable (for which we provide results in this response), and we plan to include AIME in the camera-ready version. More specifically, we join AIME 2024/2025 into one dataset and train and evaluate all methods using the GPT cascade. We obtain the following results: C3PO has the best accuracy (53%) and is significantly higher than that of FrugalGPT (48%), TREACLE (12%), MoT (39%), ModelSwitch (45%) and SC with MPM (45%) (all for a comparable cost at ~0.01 USD per question). The cost-accuracy curve results for this produce a figure qualitatively equivalent to Figures 3, 5, 9 from the paper. Due to response formatting limitations we are not able to post the figure in this response but in the camera ready version we will include results on the AIME dataset.

On the other hand, we respectfully argue that code generation (e.g., LivecodeBench) lies outside the intended scope of our work. Code introduces domain-specific correctness criteria (e.g., execution sandboxing, syntax parsing) that are not aligned with C3PO’s formulation, which focuses on optimizing cascades under a general-purpose probabilistic reasoning setup. Additionally, none of the existing baselines we compare against conduct experiments in the code domain.

Furthermore, we would like to emphasize the breadth of our current evaluation. At submission time, we included 16 datasets and 10 LLMs across 3 model families. In response to reviewer feedback, we have added a new hybrid model family that mixes models across families, conducted experiments for the out-of-distribution setting, and now evaluate on 16 datasets and 10 models across 4 families. This brings the total number of unique experimental settings to 16 × 4 = 64 direct comparisons between our model and the baselines, which makes our evaluation substantially more comprehensive than prior work, both in terms of LLM model diversity, task coverage and statistical significance.

We believe this breadth demonstrates the robustness and generality of C3PO across a wide range of settings. We view extensions to code-specific or multimodal domains as exciting future directions.

Summary

We now include AIME 2024/2025 results for the GPT cascade in our response:

C3PO has the best accuracy (53 %) vs. FrugalGPT (48 %), TREACLE (12 %), MoT (39 %), ModelSwitch (45 %), SC on MPM (45 %) at $0.01.
We will add the AIME curve in the camera‑ready version.
The 5% improvement over the next best baseline is significant given that the MPM only achieves 45% accuracy.
In this setting C3PO manages to outperform the MPM because the second-best model in the cascade correctly answers some questions that the MPM does not. C3PO can effectively leverage this by exiting.

Code generation benchmarks (LiveCodeBench) are outside our current scope—code tasks require syntax/execution evaluation and specialized pipelines not covered by our probabilistic reasoning cascade. Additionally, no baselines we compare against evaluate code generation.

Finally, we note that at submission we had 15 datasets and 10 LLMs across 3 families. In this rebuttal, we added a 4th “mixed” family, yielding 16 datasets and 10 models across 4 families—64 unique cascade comparisons—demonstrating unprecedented evaluation breadth relative to other papers in this area.

评论- Official comment from Reviewer MPyT

2025-08-02

Thank you for your detailed response. You have addressed most of my concerns. I also read the author's responses to other reviewers' concerns, which are convincing.I will raise my score to 4.

审稿意见

评分: 4置信度: 32025-07-02

In this paper, the authors introduced a new cascade framework that dynamically decides whether to accept an intermediate model's prediction based on the learned thresholds and conformal cost calibration. Also, theoretical guarantees on both cost control and generalization error are provided. Experiments conducted on several public LLM families have demonstrated the advances of the proposed method.

优缺点分析

Strengths

Compared to other linked frameworks, the proposed method has higher accuracy and lower overhead.
The paper provides a detailed introduction to the existing methods and other preliminary knowledge.
Theoretical proof has demonstrated the effectiveness of the optimization capability of the proposed algorithm under certain constraints.

Weaknesses

Lack of more details on the design of ablation experimental exploration methods and other results analysis:

Selecting calibration data, i.g, how to choose the number of samples in this subset.
The effectiveness of methods in cross-family models.
In different test sets, the improvement effect of the proposed algorithm varies significantly. Does this indicate the applicability of the algorithm? In addition, in Figure 9, there was a phenomenon of performance decreasing with increasing cost in some tests. How to explain this?

问题

Please see the weaknesses.

By the way, it seems that the proposed method is not limited to language models. Can its effectiveness be explored in more scenarios, such as multimodality?

局限性

yes

最终评判理由

After reading the response and other comments, I keep my original score as 4.

格式问题

N/A

作者回复

2025-07-31

Response to Reviewer 3Nwb

We thank the reviewer for acknowledging C3PO's superior cost-accuracy tradeoff compared to existing models as well as the theoretical gurantees and the detailed exploration of related work. Below, we respond to the questions raised by the reviewer.

2.1 Clarify calibration data design

“Selecting calibration data, i.e., how to choose the number of samples in this subset.”

We address this question both theoretically and empirically. We will include a summarized discussion of the text below in a revised version.

• Theoretical

The required number of calibration samples can be derived from our theoretical results. In particular, we have this conformal finite-sample guarantee:

Conformal cost control (Theorem 1). Given $N_{\mathrm{cal}}$ calibration examples, we sort their realized total costs

C(x_i;\tau)\,,\quad i=1,\dots,N_{\mathrm{cal}},

and let $C^*$ be the $(1-\alpha)$-quantile, i.e., the smallest value whose rank is

\lceil (1-\alpha)\,(N_{\mathrm{cal}}+1)\rceil.

Then for any new test point,

\Pr\bigl[C(x_{\mathrm{test}};\tau)>C^*\bigr]\le\alpha.

While this holds for all $N_{cal}\ge1$ , the quantile resolution is $1/(N_{cal}+1)$ . To ensure the empirical violation probability upper bound is within $\pm\epsilon$ of $\alpha$ , choose

\frac{1}{N_{cal}+1}\approx\epsilon \quad\Longrightarrow\quad N_{cal}\approx\frac{1}{\epsilon}.

• Empirical

We also evaluate the sensitivity of performance to calibration set size. We fix the target cost at $10^{-3}$ USD per question and set $\alpha=0.1$ , thus allowing a 10% violation rate in cost. As shown in the table below for the GSM8K dataset, the realized cost violation rate remains well below 10% for all configurations.

# Calibration Points	% Observed Cost Violations
10	0%
25	0%
50	1.4%
100	5.4%
250	4.2%

Table: Effect of calibration size on realized cost on GSM8K. The model under-utilizes the cost constraint that allows for up to 10% of the questions to exceed the cost budget ( $\alpha=0.1$ ).

These results suggest that C3PO is robust to calibration size and performs well even in low-resource calibration settings. We observe that as more calibrations points become available, C3PO can be more aggressive in its budget allocation, achieving improved accuracy and efficient model utilization while still respecting the soft budget constraint

2.2 Test cross‑family model robustness

“The effectiveness of methods in cross‑family models.”

We investigate the effectiveness of cross family models by creating a "chimera" cascade composed of LLMs from different families. Notably we include LLaMA 3.2 1B-Instruct, Qwen 2.5 32B-Instruct and GPT 4o-mini as the models in the cascade. These models are chosen such that we include one model from each family.

After running experiments for the same 16 datasets our results mirror the right side of Fig. 1 without significant differences. Our model clearly reduces costs and acieves near parity with MPM at one third of the cost. MoT achieves near parity with MPM at 80% of the cost and FrugalGPT and TREACLE achieve parity with MPM at nearly the same cost (and sometimes higher cost).

In conclusion, we see that a mixed model family cascade comprised of LLaMA 3.2 1B-Instruct, Qwen 2.5 32B-Instruct and GPT 4o-mini operates with roughly the same performance as cascades of models from the same family, suggesting that the method is robust to LLM family selection.

We will include this additional figure and analysis in the camera ready version of the paper.

Summary

We built a mixed‑family cascade of LLaMA 1B, QWEN 32B, and GPT‑4O‑mini and ran the same 16 datasets. C3PO:

Dominates baselines in low‑cost regimes (similar to Fig. 1),
Matches MPM performance at ~⅓ the cost of FrugalGPT/TREACLE and 80% of the cost of MoT,
Demonstrates performance on par with same‑family cascades.

We will include the corresponding plot in the camera‑ready version.

2.3 Explain non‑monotonic cost‑accuracy trends

“In different test sets, the improvement effect of the proposed algorithm varies significantly. Does this indicate the applicability of the algorithm? In addition, in Figure 9, there was a phenomenon of performance decreasing with increasing cost in some tests. How to explain this?”

The variability in performance improvement across test sets reflects the diversity of model capabilities relative to the tasks, rather than a limitation of C3PO’s applicability. Our method and all other cascade baselines assume that the most expensive model in the cascade (i.e., the MPM) is also the most accurate. This assumption generally holds across tasks, but may break down in some domains. We note that this limitation inherently affects ALL cascade models that don't use labels, not just ours.

For example, in the Movie Recommendation dataset shown in Figure 9, we observe that gpt-4o-mini (a cheaper model) outperforms o3 (a more expensive reasoning model). Note that 4o-mini is used as a general purpose model] in tasks requiring nuanced language understanding as well as math so it is a more well rounded model, whereas o3 outperforms in code understanding and mathematics. As a result, this behavior is expected, as some of our datasets cover non-STEM reasoning. We cannot insert any external link as per the recent NeurIPS instruction, but this issue is well-known and is discussed in OpenAI forums online (if interested, please search for comparisons between 4o-mini and o3). Note that all cases where the paradoxical inverse relation between budget and performance occurs in Fig. 9 are non-mathematical reasoning tasks (e.g., Sports, DisambiguationQA, Movie Recommendation). In such cases, the assumption that the MPM represents the best available model is violated, and the system may incorrectly allocate more budget to inferior models, leading to a drop in overall accuracy as cost increases. This explains the counterintuitive phenomenon in Figure 9 where performance degrades at higher budget thresholds.

This behavior is not a failure of the optimization algorithm per se, but a result of model misranking that cannot be detected in a fully label-free setting. Without labeled data, it is impossible to determine whether a more expensive model is genuinely more accurate than its cheaper counterparts.

A practical solution to this limitation is to introduce a small labeled validation set, which could be used to estimate the relative accuracy of the models in the cascade. This would allow us to identify and discard models that are less accurate than cheaper ones, even if they have higher inference costs. In the current label-free setup, we do not have access to such supervision, so we must rely on cost as a proxy for model strength.

Summary

All cascade methods assume the MPM is best; when it isn’t (e.g. gpt-4o-mini > o3 on Movie Recommendation), higher budgets hurt performance.
This misranking is inherent to any label‑free cascade.
Mitigation: small validation set to detect and prune inferior models.

2.4 Discuss generalization to multimodal settings

“It seems that the proposed method is not limited to language models. Can its effectiveness be explored in more scenarios, such as multimodality?”

We thank the reviewer for noting the potentially broader applicability of our framework beyond language models.

In principle, C3PO is modality-agnostic and could be applied to any setting where a cascade of models produces confidence scores and incurs measurable inference costs. The core formulation and optimization procedure make no assumption specific to language models, and the framework could, in theory, be extended to multimodal tasks such as vision-language reasoning or audio-text understanding.

However, we have not yet evaluated C3PO in multimodal scenarios, so we cannot make concrete claims about its performance in those domains. Exploring the effectiveness of C3PO in multimodal cascades is an exciting direction for future work.

We believe that with appropriate modifications, C3PO could serve as a promising building block for cost-aware multimodal inference, and we look forward to validating this in future research.

2025-08-05

Thanks for the response. I will keep my score.

2025-08-05

Thank you for your response, do you have any remaining concerns?

审稿意见

评分: 4置信度: 42025-07-03

This paper introduces C3PO, a novel framework for optimizing LLM cascades that tackles key limitations of existing methods like reliance on labeled data and lack of cost control. Its primary contributions are a self-supervised optimization process that learns an exit policy without ground-truth labels by minimizing regret against a Most Powerful Model (MPM); a principled probabilistic cost control mechanism using conformal prediction to guarantee budget adherence; and strong theoretical generalization bounds based on PAC-Bayes analysis. Empirically, C3PO achieves state-of-the-art cost-accuracy performance on a wide range of reasoning benchmarks.

优缺点分析

Strengths:

The self-supervised nature of C3PO is its most significant practical advantage. By optimizing for agreement with the MPM, it sidesteps the need for costly data annotation, making the framework highly adaptable and scalable to new tasks and domains where labeled data is scarce.
The application of conformal prediction to enforce a probabilistic cost constraint is a major innovation. This moves the field beyond heuristic-based budget tuning and provides a rigorous, theoretically sound method for managing inference costs, which is critical for real-world deployment.
The paper is exceptionally well-grounded in theory. The combination of a conformal guarantee on cost and a PAC-Bayesian bound on generalization error provides a level of rigor and reliability that is often missing in a field dominated by empirical heuristics.
Despite the sophisticated theoretical underpinnings, the core algorithm is an efficient and straightforward grid search over thresholds. As the authors correctly point out, this is highly practical for the typical size of LLM cascades used today.

Weaknesses:

The framework's core assumption is that the MPM serves as a reliable proxy for the ground truth. The objective is to match the MPM's output, not necessarily the correct answer. In scenarios where the MPM is systematically flawed or biased, C3PO would learn to replicate these errors efficiently, which could be a significant drawback.
The performance of the learned thresholds depends on the availability and quality of confidence scores from the LLMs. The paper assumes that these scores are stochastically increasing with the probability of correctness, but the reliability and calibration of such scores can vary significantly across different models and may not always be a robust signal.
As the authors briefly note, conformal prediction can sometimes yield conservative bounds. This could lead the cascade to under-spend its allocated budget, potentially leaving some accuracy gains on the table in an effort to strictly satisfy the probabilistic cost constraint.

问题

How does C3PO's performance degrade if the MPM itself has relatively low accuracy on a given task? Could you characterize the break-even point where optimizing for agreement with a flawed MPM becomes less effective than a simpler baseline like self-consistency on a cheaper model?
The method relies on model-produced confidence scores. Did you experiment with different ways of deriving these scores (e.g., average token log-probs, self-consistency voting)? How sensitive is C3PO to the choice of this confidence metric?
The analysis of performance across different difficulty levels in MATH-500 is very insightful. Did you observe if C3PO learns to allocate budget as expected—i.e., does it learn lower confidence thresholds for easier problems and higher thresholds for more difficult ones, effectively spending more on harder questions?
The cost model is assumed to be fixed and query-independent. How would the theoretical guarantees, particularly the conformal cost bound, be affected by stochastic costs (e.g., due to variable output lengths which are common in CoT reasoning)?
One interesting point in the paper is that given the difficulty of problems, it can skip certain level of grid. Is there a statistic number of such cases and how much would it benefits from this.

局限性

yes

格式问题

作者回复

2025-07-31

Response to Reviewer r17A

We thank the reviewer for his or her insightful feedback recognizing the benefits of C3PO's self‑supervised design, conformal cost control, theoretical underpinnings, and efficiency. Below, we address the concerns.

1.1 Analyze robustness to weak or biased MPMs

“How does C3PO's performance degrade if the MPM itself has relatively low accuracy on a given task?”

There are two distinct cases to consider when the Most Powerful Model (MPM) is flawed (i.e., it has low accuracy or high uncertainty):

The MPM is weak in absolute terms but remains the strongest among the available models.
In this case, although the overall accuracy of all models (including the MPM) is low, the MPM remains the best available predictor in the cascade. This is not problematic: C3PO will still perform reasonably well by optimizing cost allocation while maintaining predictive quality. For instance, in Fig. 3, on the hardest subset of the MATH‑500 dataset, both the cascade and the MPM achieve an accuracy of approximately 40 %, yet C3PO minimizes cost effectively. Note that integrating alternative strategies such as ensembling weaker models or increasing the number of self‑consistency (SC) samples for cheaper models is straightforward. Such strategies can be treated as new candidate models in the cascade, each with its own cost and confidence scores.
The MPM is relatively weak compared to other models.
Here, the MPM is not the strongest predictor—violating C3PO’s assumption that the MPM is the most accurate model in the cascade. This can lead to degraded or paradoxical outcomes. For example, in Fig. 9 (Movie Recommendation dataset), the cheaper model gpt-4o-mini outperforms the more expensive reasoning model o3, so allocating more budget to the MPM actually reduces performance. Since C3PO operates in a label‑free setting, it cannot detect such ranking violations, which can harm performance.
This limitation is inherent to all cascade models (e.g., TREACLE, MoT, FrugalGPT) and is unavoidable under a label‑free assumption. We view this as a reasonable limitation of our setting. However, it can be mitigated through:

Introducing a small labeled validation set: estimate relative model accuracy and prune expensive models that underperform cheaper ones.
Assuming known model rankings: incorporate prior knowledge of model strengths in cascade setup.

Either of these assumptions fully addresses a potentially weak MPM and replaces it with a stronger alternative model from the cascade.

1.2 Break-even Point with Self-Consistency (SC)

“Could you characterize the break-even point where optimizing for agreement with a flawed MPM becomes less effective than a simpler baseline like self-consistency on a cheaper model?”

A break-even scenario arises, for example, when the cascade contains models A, B, and C, where A is very weak (e.g., random guessing) and B and C have similar accuracy but different costs. In such cases, the optimal strategy is to always use model B (i.e., SC on B). C3PO can replicate this strategy: it learns to skip A by setting its threshold >100% and always exit at B by setting its threshold to zero. Thus, SC is a special case within C3PO’s solution space, and our method can always match or outperform it. (See lines 161–166 for how thresholds >100% disable models.)

1.3 Evaluate different confidence metrics

“The method relies on model-produced confidence scores. Did you experiment with different ways of deriving these scores (e.g., average token log-probs, self-consistency voting)? How sensitive is C3PO to the choice of this confidence metric?”

We agree with the reviewer that the choice of model confidence metric is a key design choice. We argue that for both theoretical and practical reasons deriving the confidence metric from self-consistency levels is the only straightforward method that applies in our setting. In our setting, self-consistency (SC) is the most practical and effective confidence metric. It is strongly correlated with correctness and widely adopted in prior work on cascades (e.g., MoT, ModelSwitch).

Alternatives like token log-probs are infeasible for us: GPT models do not expose them, and our API setup (even for open-source models) lacks token-level access. Moreover, prior work [1] shows token log-probs are poor uncertainty estimates, making SC both a practical and robust choice.

[1] Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation, L. Kuhn et. al., ICLR 2023

1.4 Check if the model allocates budget by difficulty

“The analysis of performance across different difficulty levels in MATH-500 is very insightful. Did you observe if C3PO learns to allocate budget as expected—i.e., does it learn lower confidence thresholds for easier problems and higher thresholds for more difficult ones, effectively spending more on harder questions?”

C3PO is trained without access to difficulty labels and does not explicitly learn question-specific thresholds. However, we observe that easier questions tend to exit early, while harder ones proceed further in the cascade.

For example, in the Qwen cascade, ~53% of level-1 (easy) questions exit at the smallest model (1.5B), while 0% reach the MPM (72B). In contrast, for level-5 (hard) questions, only ~5% exit at 1.5B, and ~52% reach the MPM. Similar trends hold for GPT and LLAMA cascades, confirming that C3PO allocates more budget to harder questions, even without explicit difficulty supervision. These results support the cost analyses presented in Figures 3 and 11 by showing how C3PO spends more budget for harder questions.

1.5 Account for stochastic cost settings

“The cost model is assumed to be fixed and query-independent. How would the theoretical guarantees, particularly the conformal cost bound, be affected by stochastic costs (e.g., due to variable output lengths which are common in CoT reasoning)?”

We show that the conformal cost-control guarantee in Theorem 1 can be extended to the stochastic cost setting, where model cost varies per query (e.g., due to variable output lengths in chain-of-thought reasoning). While some statistical efficiency is lost, the form of the guarantee remains intact.

1. Fixed-cost recap (Theorem 1)

In the main setting, each model $M_j$ has a fixed, known cost $c_j$ . The total cascade cost for a query $x$ with exit policy $\tau$ is

C(x; \tau) = \sum_{k=1}^{z(x, \tau)} c_k,

which is deterministic. Sorting $C(x_i; \tau)$ over the calibration set and selecting the $(1-\alpha)$ -quantile $C^{\*}$ gives the guarantee

\Pr_{x \sim \text{test}}[C(x; \tau) > C^*] \le \alpha,

under the standard assumption that calibration and test samples are exchangeable.

2. Stochastic costs

Now assume each model $M_j$ on input $x$ has a random cost $C_j(x)$ (e.g., proportional to output length), so that the total cost becomes

C(x; \tau) = \sum_{k=1}^{z(x, \tau)} C_k(x),

which is itself a random variable. If the joint distribution of ${(x_i, C_1(x_i), \dots, C_m(x_i))}_{i=1}^{N}$

and a test sample $(x_\text{test}, C_1(x_\text{test}), \dots, C_m(x_\text{test}))$ is exchangeable, then applying conformal prediction to the realized total costs still yields

\Pr[C(x_\text{test}; \tau) > C^*] \le \alpha,

where $C^*$ is the empirical $(1 - \alpha)$ -quantile of the realized $C(x_i; \tau)$ values on the calibration set. Thus, the guarantee is unchanged in form.

3. Practical considerations

Heavier tails: If $C(x)$ has high variance more calibration samples may be required for a reliable estimate.
Upper bounds: Model costs can be bounded deterministically. By truncating CoT length during generation, one can recover a worst-case cost upper bound at some loss in predictive power. Note that most APIs allow users to set a limit on maximum generation length so it is practically easy to implement.

1.6 Quantify grid‑skipping behavior

“One interesting point in the paper is that given the difficulty of problems, it can skip certain level of grid. Is there a statistic number of such cases and how much would it benefits from this?”

As the reviewer correctly notes, a key feature of C3PO is its ability to skip models. For example, in Fig. 2 (MATH500), under the lowest budget ( $10^{-5}$ USD/question), C3PO exits immediately at the first model (LLaMA 1B) to respect the constraint. At higher budgets (e.g., $10^{-3}$ USD/question), it learns to skip LLaMA 1B entirely, reducing cost by avoiding weak models. Model skipping is observed in ~20% of examples in Fig. 2—especially at budget extremes, where it's optimal to either exit early or directly escalate to stronger models.

1.7 Conservative nature of conformal bounds

“As the authors briefly note, conformal prediction can sometimes yield conservative bounds. This could lead the cascade to under-spend its allocated budget, potentially leaving some accuracy gains on the table in an effort to strictly satisfy the probabilistic cost constraint.”

We agree with the reviewer’s observation. C3PO can become overly conservative in order to maintain cost guarantees with high probability. This can result in under-utilization of the available budget in some cases. We note that we stated this limitation clearly in lines 341-343. One promising direction to address this conservativeness is to move beyond single-pass parallel inference. Two natural alternatives are:

Batch-level adjustment: Run inference in batches and reallocate unused budget from earlier batches to later ones.
Iterative passes: Use initial thresholds in a first pass, then spend leftover budget on harder queries (e.g., by invoking the MPM) in a second pass.

Both approaches improve budget utilization at the cost of additional inference time.

2025-08-05

Thank you for your response, do you have any remaining concerns?

2025-08-06

Thanks for the explanation, I will keep my score.

最终决定Accept (poster)

2025-09-17

This paper introduces C3PO, a novel framework to leverage cascaded LLM tackling key limitations of existing methods including lack of labeled data and cost control. C3PO presents a self-supervised optimization process that learns a threshold based exit policy without ground-truth labels by minimizing regret against a Most Powerful Model (MPM). Reviewers praised the theoretical rigor, innovation in terms of application of conformal prediction to enforce a probabilistic cost constraint and supported acceptance. Additionally, during the rebuttal the authors provided new results on AIME and stochastic cost modeling. Overall I think this paper is a marginal accept with a good contribution in the domain of cascaded LLMs.