7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.8

置信度

创新性2.5

质量3.0

清晰度3.3

重要性3.0

NeurIPS 2025

MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees

Herbert Woisetschläger,Ryan Zhang,Shiqiang Wang,Hans Arno Jacobsen

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

We propose a new algorithm that introduces guarantees for minimum user satisfaction rates in language model zoos while optimizing for operating cost, which can be practical for inference endpoint services.

摘要

Open-weight large language model (LLM) zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of $2\times$ cost savings compared to existing LLM routing techniques.

关键词

inference optimizationmodel routingstochastic optimization algorithmsconvergence analysis

评审与讨论

审稿意见

评分: 4置信度: 42025-07-01

This paper presents MESS+, an algorithm for optimally routing requests across multiple language models (LLM zoo) while maintaining service level guarantees and minimizing operating costs. Here are the key points and contributions:

Main Problem Addressed:

How to automatically select the most cost-efficient LLM for each request while ensuring a minimum quality level specified by Service Level Agreements (SLAs)
Balancing between user needs (high-quality responses) and provider needs (low operating costs)

Key Components of MESS+:

Online Learning System:

Dynamically learns which models can satisfy which types of requests
Uses ModernBERT-based predictor to estimate satisfaction probabilities
Adapts through exploration/exploitation strategy

Cost Optimization Framework:

Uses virtual queues to track SLA compliance
Solves per-request optimization problem considering both costs and predicted satisfaction
Provides theoretical guarantees for both cost optimality and SLA compliance

SLA Guarantee Mechanism:

Maintains minimum satisfaction rate (α) over time
Uses virtual queue system to track and correct violations
Allows configuration of trade-off between quick convergence and cost efficiency

Results:

Achieves 2x cost savings compared to existing routing techniques
Successfully maintains SLA requirements across various benchmarks
Demonstrates effectiveness across different types of language tasks

The paper's main contribution is a theoretically grounded framework that, for the first time, provides both cost optimization and rigorous service level guarantees for LLM request routing, while learning request satisfaction patterns online without requiring pre-curated routing datasets.

优缺点分析

Strengths:

Theoretical Foundation

Provides rigorous theoretical guarantees for both cost optimality and SLA compliance
Clear mathematical framework based on Lyapunov optimization
Theorems are well-stated with complete proofs referenced in appendix

Novel Approach

Introduces online learning component that eliminates need for curated datasets
Novel integration of virtual queues with satisfaction prediction

Empirical Validation

Comprehensive evaluation across 8 different benchmarks
Clear ablation studies showing impact of different parameters

Weaknesses:

Implementation Details

Limited details about the ModernBERT predictor architecture
Missing specifics about classifier design on top of ModernBERT
Could benefit from more details about hyperparameter selection

Practical Limitations

Assumes immediate availability of user satisfaction feedback
Requires querying all models during exploration phase, which could be expensive
May need significant number of requests before convergence
The authors acknowledge these limitations but don't propose solutions

Experimental Scope

All experiments use same family of models (Llama) and only 3 models in the model zoo
Could benefit from testing across different model families and more models
Limited discussion of how performance varies with different types of requests

Theoretical Assumptions

Some assumptions (like i.i.d. requests) may not hold in practice
Convergence guarantees require large number of requests
Trade-off between exploration and exploitation could be explored more thoroughly

Comparisons

Could include comparison with simpler baselines (e.g., random routing with cost threshold, nearest neighbor routing, etc)
No discussion of failure cases or edge conditions

问题

Predictor Architecture Details [Good to have] Question: Could you provide more specific details about the ModernBERT-based predictor implementation?

How exactly is the classifier implemented on top of ModernBERT?
Which layers are frozen vs. trained?
What is the rationale for these architectural choices? Impact: This information is crucial for reproducibility and understanding the method's practical requirements. A detailed response could increase the quality score by demonstrating the robustness of the technical implementation.

Cross-Model Family Performance [Highly Important] Question: Have you tested MESS+ across different model families?

How does performance vary when mixing models from different providers?
Are there specific characteristics that make certain model combinations more effective?
What modifications might be needed for heterogeneous model zoos? Impact: Demonstrating broader applicability across model families would strengthen the paper's generalizability claims. Positive results could increase both significance and originality scores.

Cost Structure Analysis [Highly important] Question: Could you provide more detailed analysis of:

How different cost structures affect routing decisions?
The impact of varying cost ratios between models?
The relationship between exploration costs and long-term benefits? Impact: This would help readers understand the economic implications of deployment. Clear analysis could increase the paper's practical significance score.

Additional Suggestions:

Consider including simple baseline comparisons (e.g., random routing, knn router, mlp router; c.f. RouterBench). Some of them can be adapted to online setting, but even comparing to the offline baseline is valuable.
Add analysis of failure cases or edge conditions
Provide guidelines for selecting hyperparameters (V, c) in practice

局限性

The authors have partially addressed limitations, but there are several areas that could benefit from more thorough discussion:

Technical Limitations The paper discusses some key technical limitations in Section 5:

Dependency on readily available user satisfaction labels
Challenges with multiple model exploration However, it could benefit from addressing:
Scalability limitations with very large model zoos
Performance degradation scenarios
Resource requirements for the prediction system

Societal Impact Missing discussions on:

Potential misuse of cost optimization at the expense of quality
Privacy implications of collecting user satisfaction data
Fairness considerations in satisfaction prediction across different user groups
Potential bias of the router

Deployment Risks Should address:

Risks of system manipulation through adversarial requests
Failure modes in high-stakes applications
Security considerations for the routing system

Suggestion: Add a dedicated section discussing these aspects comprehensively, particularly focusing on the societal and ethical implications of automated decision-making in service provision.

最终评判理由

The authors have addressed all my raised points in the rebuttal. I highly suggest the authors to include these analysis into the paper.

格式问题

N/A

作者回复

2025-07-31

Thank you for your thorough review! We will first address your questions based on their impact and then reply to the weaknesses.

Q2 & W3) Our response consists of three parts. Part A addresses mixed-provider model zoos, part B specific characteristics and heterogeneous zoos, and part C weakness 3.

Part A: We include a new model zoo that mixes Llama 3 and Qwen 2 models. MESS+ maintains strong performance compared to our Educated Guessing baseline. We will include these results along with a full comparison against all baselines that we could not do due to time constraints in our updated paper.

Table I: Mixed zoo with Llama 3 and Qwen 2 models.

Category	Operating Cost	Request Satisfaction	Model Call Ratio (Q32B/L8B/L1B/Q0.5B)
Qwen2 0.5B	0.15±0.01	54.12±45.15	0% / 0% / 0% / 100%
Llama 3.2 1B	0.20±0.01	61.13±4.79	0% / 0% / 100% / 0%
Llama 3.1 8B	0.50±0.01	67.07±4.61	0% / 100% / 0% / 0%
Qwen2.5 32B	1.99±0.01	70.91±4.48	100% / 0% / 0% / 0%
Educated Guessing	1.06±0.01	67.69±2.30	46% / 14% / 29% / 11%
MESS+ (ours)	0.98±0.01	67.47±3.28	42% / 27% / 17% / 14%

Part B: Our approach can be used with any zoo as long as there are user satisfaction feedback and operating cost measurements. Regarding specific model combinations, MESS+ works best with models that have distinct cost characteristics as our problem setting (Equation 3) optimizes for operating cost.

Part C: We conducted additional experiments on a different, larger model zoo using the Qwen 2 model family (0.5B, 1.5B, 7B, 32B). Our additional results further underpin the performance of MESS+. In Table I (see rebuttal for reviewer 9ALn), we show the average performance across the 8 benchmarks discussed in our paper ( $\alpha = 0.67$ ). We will include the detailed results in our updated paper. MESS+ benefits from distinct models in a zoo, i.e., the more the performance and cost characteristics of models in a zoo vary, the better our routing works. This results from our objective function where we trade off model performance for operating cost.

Q3) We answer this question in two parts. The first part (A) is a combined answer regarding varying cost structures and ratios and the second (B) outlines the exploration vs. exploitation cost when using different $c$ values for MESS+.

Part A: We reduce the cost spread in the Qwen2 zoo by a factor of 2x, centering the cost closer around the mean cost per requst across models. Consequently, this also changes the cost structure in our zoo. This makes it more challenging for MESS+ to route requests. As can be seen in Table II below, MESS+ maintains its performance even with a more homogeneous cost structure. We will include these additional results in our updated paper.

Table II: Experiments on the Qwen 2 model zoo with a 2x squeezed cost ratio around the average inference cost per request

Category	Operating Cost	Request Satisfaction	Model Call Ratio (Q32B/Q7B/Q1.5B/Q0.5B)
Qwen2 0.5B	0.15±0.01	54.12±45.15	0% / 0% / 0% / 100%
Qwen2 1.5B	0.20±0.01	61.13±4.79	0% / 0% / 100% / 0%
Qwen2 7B	0.50±0.01	67.07±4.61	0% / 100% / 0% / 0%
Qwen2.5 32B	1.99±0.01	70.91±4.48	100% / 0% / 0% / 0%
Educated Guessing	0.59±0.01	67.36±2.46	45% / 29% / 13% / 13%
MESS+ (ours)	0.33±0.01	66.36±3.68	33% / 29% / 15% / 24%

Part B: Using models from the Qwen 2 family, we analyzed costs for exploration (querying the model zoo to train the classifier) versus LLM inference (using the trained predictor). Table III shows average results across 8 benchmarks for the first 500 requests per benchmark. We found that $c=0.01$ causes slow predictor learning, requiring well beyond 500 steps to reach SLA compliance, while $c=1.0$ wastes resources on excessive training. The optimal value $c=0.1$ balances exploration and inference costs. As designed, MESS+ reduces exploration probability over time, making inference costs eventually dominate. These insights will be included in our updated paper.

Table III: Exploration vs. Exploitation cost in the Qwen 2 model zoo

c	Exploration Cost @ 500 steps (in MJ)	LLM Inference (Exploitation) Cost @ 500 steps (in MJ)	Mean Benchmark Accuracy @ 500 steps ( $\alpha = 0.67$ )
0.01	0.09 ± 0.01	0.15 ± 0.01	0.40 ± 0.01
0.1	0.46 ± 0.01	0.11 ± 0.01	0.68 ± 0.01
1.0	0.56 ± 0.01	0.10 ± 0.01	0.70 ± 0.01

Q1 & W1) We cited the paper that introduces the ModernBERT architecture as reference [25] in our paper. The classification layer on top of the ModernBERT transformer is described in detail in our supplementary material. The exact hyperparameter choice is based on a hyperparameter sweep and is also described in our supplementary material. We included a link to our code base in the supplementary material (Appendix B at the end). The code base will be made public, if accepted.

W2) We divide this response into three parts. The first (A) addresses sparsely available feedback and the second (B) predictor convergence, and the third (C) the limitations of our work.

Part A: MESS+ can be directly extended to work on sparse feedback, i.e., only a small number of users submit (complete) feedback. We show results when only 20% of users submit feedback (Table I in response to Reviewer tW9L). We emulate this behavior by sampling from a uniform distribution with values $[0, 1]$ using a threshold of 20%. We only update $Q$ whenever there is feedback available. We observer that MESS+ shows strong performance, even when feedback is only sparsely available. However, it takes longer to satisfy the SLA requirement as $Q$ captures fewer SLA violations that might occur over time and stabilization of $Q$ takes longer. This also leads to MESS+ preferring cost efficient models more than in scenarios with abundant feedback.

Part B: Figure 3 in our paper shows that our predictor training quickly converges well within 100 incoming requests (avg. across 8 benchmarks), which, in practice, equals a very small cost. With $c$ , we can control how often we want to update our predictor and account for the difficulty of a workload, i.e., the more difficult a workload, the more we want to train our predictor in the beginning.

Part C: Yes, we outline the limitations to offer a genuine avenue for future work. Detailed solutions to address these limitations are beyond the scope of this paper.

W4) We separate our response into three parts. The first (A) addresses the IIDness assumption, the second (B) discusses convergence, and the third (C) the trade-off.

Part A: We conducted additional experiments concatenating multiple benchmarks (concatenated ARC Challenge, PiQA, Winogrande benchmarks in a single experiment) to create a non-IID scenario (α = 0.67). Results show a slightly larger gap between α and final MESS+ accuracy (1.57% vs. 0.55% for individual benchmarks, see Table I Reviewer tW9L). The changing data distribution makes predictor training more challenging, causing increased training loss fluctuation (not presented here since we cannot include plots in the rebuttal). However, MESS+ still meets our SLA requirement and outperforms the Educated Guessing baseline. These findings will be included in our updated paper.

Part B: We underpin the effectiveness of MESS+ with our experiments on evaluating a diverse set of benchmarks and verying $c$ values. Our experiments show that the predictor learns quickly (even under non-IID settings) and converges within a practical amount of time.

Part C: The experimental results on the tradeoff between exploration and exploitation have been presented and discussed in our response to Q3 above. Theoretically, $\psi$ (defined in Assumption 1) quantifies how long it takes to obtain a "good enough" predictor. As shown in Corollary 1 (first inequality, before big-O), when $\psi$ is large (which occurs when we explore less), it takes a longer time to meet the same level of SLA satisfaction. This is validated by the results in the last column of Table II in our response to Reviewer 9ALn. For the cost optimality, our upper bound on the total cost in Theorem 2 can be extended to $E^\textnormal{OPT} +\mathcal{O}\left(\frac{M\left(\frac{1}{c}+c\right)}{\sqrt[4]{T}} + \frac{1}{V} + F_\mathrm{min} \right)$ by writing out $c$ inside the big-O notation in our proof. Here, the term $\frac{1}{c}$ captures the reduction in exploitation cost when we have a better predictor by exploring more with a larger $c$ , and the term $c$ captures the added cost due to exploration. This tradeoff has also been validated in the second and third columns of Table III.

W5 & Additional Suggestions) We have already included baselines that do random routing (“educated guessing”) and MLP-based routing (e.g., RouteLLM) to provide a thorough performance comparison. We consider average SLA satisfaction over time, which smoothens out extreme cases. As such, the performance of MESS+ is not affected by individual failures and rare edge cases.

Limitations) We explicitly mention user satisfaction feedback as an avenue for future work to explore partial or delayed signals. This will also address scalability challenges and performance aspects of exploration but is beyond the scope of our work. MESS+ adapts to user preferences through online feedback collected via standard consent-based inference endpoints and voluntary feedback mechanisms. However, learning from a data distribution typically carries some bias and fairness challenges, such as favoring frequent users over infrequent ones. We will clarify these potential bias sources in our updated paper. Deployment risks go beyond the scope of our work.

2025-08-04

Thank the authors addressing all my raised points. I have increased my rating accordingly.

2025-08-09

Many thanks to Reviewer 28jX for increasing the rating. We really appreciate it.

For the sake of completeness and just in case this comes up in later discussions, after discussing with and getting permission from the area chair, we have added the RouteLLM and RouterDC baseline results to the first two additional experiments that we provided in this rebuttal to Reviewer 28jX, where we were not able to gather the results during the one-week rebuttal period. The updated Tables I and II for our rebuttal to Reviewer 28jX are shown below. These more complete results underpin the effectiveness of MESS+ to minimize operating costs while maintaining SLA compliance over time.

Table I extended: Mixed zoo with Llama 3 and Qwen 2 models (now including RouteLLM and RouterDC)

Category	Operating Cost	Request Satisfaction	Model Call Ratio (Q32B/L8B/L1B/Q0.5B)
Qwen2 0.5B	0.15±0.01	54.12±45.15	0% / 0% / 0% / 100%
Llama 3.2 1B	0.20±0.01	61.13±4.79	0% / 0% / 100% / 0%
Llama 3.1 8B	0.50±0.01	67.07±4.61	0% / 100% / 0% / 0%
Qwen2.5 32B	1.99±0.01	70.91±4.48	100% / 0% / 0% / 0%
Educated Guessing	1.06±0.01	67.69±2.30	46% / 14% / 29% / 11%
RouteLLM	1.79±0.01	68.93±2.32	83% / 0% / 0% / 17%
RouterDC	1.31±0.01	69.17±2.25	72% / 11% / 10% / 7%
MESS+ (ours)	0.98±0.01	67.47±3.28	42% / 27% / 17% / 14%

Table II extended: Experiments on the Qwen 2 model zoo with a 2x squeezed cost ratio around the average inference cost per request (now including RouteLLM and RouterDC)

Category	Operating Cost	Request Satisfaction	Model Call Ratio (Q32B/Q7B/Q1.5B/Q0.5B)
Qwen2 0.5B	0.15±0.01	54.12±45.15	0% / 0% / 0% / 100%
Qwen2 1.5B	0.20±0.01	61.13±4.79	0% / 0% / 100% / 0%
Qwen2 7B	0.50±0.01	67.07±4.61	0% / 100% / 0% / 0%
Qwen2.5 32B	1.99±0.01	70.91±4.48	100% / 0% / 0% / 0%
Educated Guessing	0.59±0.01	67.36±2.46	45% / 29% / 13% / 13%
RouteLLM	1.71±0.01	69.06±2.32	83% / 0% / 0% / 17%
RouterDC	1.05±0.01	70.73±2.30	71% / 17% / 11% / 0%
MESS+ (ours)	0.33±0.01	66.36±3.68	33% / 29% / 15% / 24%

We will include all the additional results in the next version of this paper. Thank you again!

审稿意见

评分: 4置信度: 42025-07-01

As Gen AI models, including LLMs, continue to get commoditized, leveraging a plurality of models helps meet varying user/customer performance needs as well as minimize resource utilization. This paper addresses the problem of how to select the best LLM per user query by casting satisfaction of service-level agreements by model hosts and optimal resource utilization of a given query as competing objectives of an online optimization problem. The authors provide both theoretical guarantees (Section 3) as well as empirical evaluation across 8 benchmark datasets and 3 model sizes of the Llama family. Results show that the proposed approach outperforms related approaches on benchmark datasets and selected models.

优缺点分析

Strengths:

A novel approach, supported by a combination of theoretical analysis and experimental results. The theoretical guarantees distinguish this work from related approaches.
The paper writing is good, providing a good explanation of results via separate cost analysis, model call ratio, approach overhead costs, user satisfaction prediction.
Good coverage of competing approaches and their use in comparative evaluation.
Appreciate the clear callout of contributions at the end of section 1: many submissions miss out on this helpful practice that supports clarity.
Similarly, Table 1 provides a good overview, as well as a relative comparison of baseline approaches.
Publicly available code supports reproducibility. Weaknesses
Outside of benchmarks, where ground-truth labels support model calibration, it is unclear how to learn the relative ability of the various models in the model zoo.
Solution has been tested with only 3 models from the same model family.
Ablation study, quantifying the relative contributions of satisfaction predictor and the quality of the online optimization, for example, is missing.

问题

How would the proposed approach deal with non-stationary and open-ended task compilations?
How would the approach fare in the presence of task-specific, fine-tuned models?
Table 2 is too dense with tiny fonts. Is there an alternate way to present these results?
Might it be better to refer to the approach as "inference-time or deployment-time", instead of "test-time?" That would also help with the clarity.

局限性

格式问题

作者回复

2025-07-31

Thank you for your detailed review of our manuscript and your positive feedback! In the following, we will first address the weaknesses (in the original numbering in the review) and then your questions.

W7) It is challenging to obtain ground truth, but our approach relies on a model that estimates the probability that a model can satisfy a user request. In practice, we learn our predictor based on online user feedback (a binary signal, which can be a thumbs up or down as frequently seen in chat interfaces). These events are collected over time and used to train the user satisfaction predictor that estimates the ability of a model.

W8) We have conducted additional experiments on a larger model zoo. Our results on the Qwen 2 model family (0.5B, 1.5B, 7B, 32B) also demonstrate the strong performance of MESS+ compared to the baselines in our paper (Table I below). We will include detailed results in our updated paper.

Table I: Additional experiments on a larger zoo with 4 Qwen 2 models - Avg. across 8 benchmarks ( $\alpha = 0.67$ )

Category	Operating Cost	Request Satisfaction	Model Call Ratio (Q32B/Q7B/Q1.5B/Q0.5B)
Qwen2 0.5B	0.12±0.01	54.12±45.15	0% / 0% / 0% / 100%
Qwen2 1.5B	0.16±0.01	61.13±4.79	0% / 0% / 100% / 0%
Qwen2 7B	0.40±0.01	67.07±4.61	0% / 100% / 0% / 0%
Qwen2.5 32B	1.60±0.01	70.91±4.48	100% / 0% / 0% / 0%
Educated Guessing	0.99±0.01	67.02±2.32	53% / 26% / 11% / 10%
RouteLLM	1.37±0.01	69.01±2.31	83% / 0% / 0% / 17%
RouterDC	1.13±0.01	69.17±2.47	63% / 17% / 9% / 11%
MESS+ (ours)	0.84±0.01	67.55±3.23	48% / 26% / 10% / 16%

W9) MESS+ only works when using both the predictor and the online optimization together. Without the predictor, online optimization cannot work, because the online optimization objective (3a) requires $\hat{s}_{m,t}$ that is provided by the predictor. If we naively modify (3a) to remove the second term altogether, the algorithm would always choose the smallest model that gives the lowest cost, which obviously will not give a satisfactory performance. If we do not do the online optimization, the current algorithm does not use the predictor in any other step, so the predictor would be redundant in that case and no model choice would be made. In other words, the online optimization in (3) is the only step that decides which model to choose and it requires the predictor output as its input.

Q1) We have conducted additional experiments where we concatenate multiple benchmarks in a single experiment (ARC Challenge, PiQA, Winogrande). In doing so, we create a non-IID scenario ( $\alpha = 0.67$ ). The results are in Table II below. We observe a slightly larger gap between $\alpha$ and the final MESS+ accuracy than what we see for the individual benchmarks (1.57% vs. 0.55%). Due to the change in data distribution over time, training the predictor becomes more challenging, and we have observed slightly increased training loss fluctuation compared to the individual benchmarks in our experiments (not presented here since we cannot include plots in the rebuttal), especially at the beginning of each new benchmark. Nonetheless, MESS+ still satisfies our SLA requirement and outperforms our Educated Guessing baseline. Q1) We have conducted additional experiments where we concatenate multiple benchmarks in a single experiment (ARC Challenge, PiQA, Winogrande). In doing so, we create a non-IID scenario ( $\alpha = 0.67$ ). The results are in Table II below. We observe a slightly larger gap between $\alpha$ and the final MESS+ accuracy than what we see for the individual benchmarks (1.57% vs. 0.55%). Due to the change in data distribution over time, training the predictor becomes more challenging, and we have observed slightly increased training loss fluctuation compared to the individual benchmarks in our experiments (not presented here since we cannot include plots in the rebuttal), especially at the beginning of each new benchmark. Nonetheless, MESS+ still satisfies our SLA requirement and outperforms our Educated Guessing baseline. We will include these insights in our updated paper.

Table II: Additional experiments with non-IID requests - 3 benchmarks concatenated (ARC Challenge, PiQA, Winogrande)( $\alpha = 0.67$ )

Category	Operating Cost	Request Satisfaction	Model Call Ratio (Q32B/Q7B/Q1.5B/Q0.5B)
Qwen2 0.5B	0.26±0.01	54.50±49.80	0% / 0% / 0% / 100%
Qwen2 1.5B	0.35±0.01	62.92±5.37	0% / 0% / 100% / 0%
Qwen2 7B	0.93±0.01	69.35±5.12	0% / 100% / 0% / 0%
Qwen2.5 32B	3.12±0.01	71.94±4.99	100% / 0% / 0% / 0%
Educated Guessing	2.33±0.01	71.44±2.16	70% / 11% / 9% / 10%
RouteLLM	2.96±0.01	72.38±3.29	97% / 0% / 0% / 3%
RouterDC	2.24±0.01	72.36±2.46	68% / 15% / 4% / 13%
MESS+ (ours)	2.07±0.01	68.57±2.28	66% / 19% / 8% / 7%

Q2) MESS+ requires the models in a zoo to have varying cost characteristics and be capable of addressing the same or overlapping tasks. When multiple models have overlapping expertise for specific tasks, regardless of whether they have been fine-tuned or not, MESS+ can effectively learn to identify the most adequate model with minimum operating cost by solving the online decision making problem (3). This is orthogonal to the problem of selecting from disjoint expert models that handle separate, non-overlapping tasks.

Q3) We will make use of the extra page for a potential camera ready version to improve the readability of Table 2.

Q4) We are happy to change the title to “inference-time”.

审稿意见

评分: 5置信度: 32025-07-01

This paper presents MESS+, a novel algorithm designed to automatically and efficiently select the best Large Language Model (LLM) from a "zoo" of different models. The core challenge it addresses is balancing the competing interests of end-users, who want high-quality responses, and service providers, who want to minimize operating costs.

MESS+ formulates this challenge as a stochastic optimization problem. It works by learning in real-time, as users interact with the system, to predict the probability that any given LLM in the zoo can satisfy an incoming request. For each request, the algorithm then chooses the most cost-effective model that is likely to meet a predefined Service Level Agreement (SLA), which guarantees a minimum rate of user satisfaction over time.

The authors provide a theoretical analysis of the algorithm's performance and demonstrate its effectiveness through experiments. Results show that, on average, MESS+ can reduce operating costs by a factor of two compared to existing LLM routing techniques while successfully upholding the required service quality.

优缺点分析

Strengths:

The routing problem setting with SLA constraint is novel and practical.
The proposed approach is grounded with solid theoretical analysis.
The main experiment results demonstrates the effectiveness of MESS.

Weakness:

As acknowledged by the authors, the method requires user satisfaction labels on the fly, which is usually not a practical assumption.
Please see questions section.

问题

Why user satisfaction is limited to a binary value? For more open-ended queries, binary value may not capture user feedback precisely.
In Table 2, how are the values of $\alpha$ determined? Does MESS+ ensure satisfaction of SLA at all different values of $\alpha$ beyond these 3 picked points? Also, does MESS+ achieves the lowest cost among all competing methods at every point?
As RouteLLM and RouterDC are not designed for SLA constraints, how are their operating point selected given a specific $\alpha$ ?

局限性

Yes, limitations are sufficiently discussed in the conclusion section.

最终评判理由

The authors sufficiently addressed my concerns, so I raised my score to 5.

W1: Addition experiments make sense. Q1: "We focus on binary feedback in our work because requiring users to score responses on scales such as Likert scales demands more effort and could produce even sparser feedback than simple binary signals", I agree that binary feedback is a more practical choice, and I believe the framework can be extended to non-binary cases (adding additional experiments is beyond this paper's scope). Q2: Exploration of varying $\alpha$ is in the supplement. Q3: The authors promise to highlight the distinct objectives of RouteLLM, RouterDC, and MESS+ in the next version of their paper.

格式问题

N/A

作者回复

2025-07-31

Thank you for your positive feedback and the thorough review of our paper! We will first address the weaknesses and then your questions.

W1) When deploying inference services, monitoring the fit for purpose is an integral part of a reliable and robust pipeline. A large part typically consists of capturing SLA compliance to ensure the client receives a service they are paying for. In applications with such SLA requirements, there needs to be a mechanism to log whether the AI model is able to successfully process user requests. If unsuccessful, the same request would have to be processed by other means (e.g., by a human). Therefore, such satisfaction logs would naturally exist in real systems, so we believe that it is reasonable to assume that a user satisfaction indicator is available after we get the response from the model. In addition, our MESS+ algorithm can be extended to scenarios where the feedback is sparsely available, e.g., the feedback is only collected on a uniformly randomly sampled subset of requests, and we only update the virtual queue when such feedback is available. Statistically, because such sampling is done in an IID manner, the expected SLA satisfaction would be the same as the true SLA satisfaction over all requests. We conduct an extra experiment to show this (Table I below). We sample from a uniform distribution limited to values in the range between $[0, 1]$ and set the threshold to 20% (max. rel. number of requests in an experiment that will be used for updating $Q$ ). We only update $Q$ whenever there is feedback available. As can be seen even under sparse feedback MESS+ is still able to satisfy the SLA requirement but relies more on more cost efficient models. This is because $Q$ captures fewer SLA violations that might occur over time and stabilization of $Q$ takes longer.

Table I: Additional results when receiving sparse user feedback - Mean over 8 benchmarks ( $\alpha=0.67$ , 20% user participation)

Category	Operating Cost	Request Satisfaction	Model Call Ratio (Q32B/Q7B/Q1.5B/Q0.5B)
Qwen2 0.5B	0.12±0.01	54.12±45.15	0% / 0% / 0% / 100%
Qwen2 1.5B	0.16±0.01	61.13±4.79	0% / 0% / 100% / 0%
Qwen2 7B	0.40±0.01	67.07±4.61	0% / 100% / 0% / 0%
Qwen2.5 32B	1.60±0.01	70.91±4.48	100% / 0% / 0% / 0%
Educated Guessing	0.99±0.01	67.02±2.32	53% / 26% / 11% / 10%
RouteLLM	1.37±0.01	69.01±2.31	83% / 0% / 0% / 17%
RouterDC	1.13±0.01	69.17±2.47	63% / 17% / 9% / 11%
MESS+ (ours)	0.82±0.01	67.47±3.35	49% / 18% / 14% / 19%

Q1) Binary feedback is the most challenging data to learn from as signals are very coarse. When looking at practical applications (e.g., chat applications like ChatGPT) they usually have a small thumbs-up/-down button next to each generated response. Our predictor takes such feedback as an input. We focus on binary feedback in our work because requiring users to score responses on scales such as Likert scales demands more effort and could produce even sparser feedback than simple binary signals. However, our MESS+ algorithm can be directly extended to operate on more granular feedback signals and the procedure remains the same, with the only difference that the user satisfaction $s_{m,t}$ would be a non-binary value.

Q2) The parameter $\alpha$ serves as an input to the MESS+ algorithm. The algorithm guarantees SLA compliance for any $\alpha$ value that falls within the performance range of the available model zoo. In our experimental setup, we selected $\alpha$ values positioned within the model zoo's capabilities. While the selection of $V$ influences the convergence time to SLA compliance, our theoretical analysis demonstrates that the algorithm will eventually achieve the target $\alpha$ regardless of the initial conditions. Additional experimental results exploring different $\alpha$ values can be found in the supplementary material (Figures C.2 and C.3).

Q3) For the RouteLLM baseline, we tuned the decision threshold to minimize the cost required to meet $\alpha$ (Appendix B in the supplementary material, Table 2); for RouterDC that requires training a small encoder model, we performed this additional training step on the training split of our benchmark tasks based on the recommended procedure in the RouterDC paper. Even after these adjustments, we still observe that both RouteLLM and RouterDC have a tendency of preferring the largest model, especially in the beginning, whereas MESS+ takes a more conservative and cost-focused approach. This difference is clearly rooted in the different objectives of routing approaches (max. response quality vs. appropriate response quality under a given SLA). We will better highlight the distinct objectives of RouteLLM, RouterDC, and MESS+ in the next version of our paper. When we choose a smaller $V$ , MESS+ becomes more likely to also prefer the largest model in the zoo, and the cost gap towards the routing baselines becomes smaller. Our appendix includes additional experiments on various SLA requirements for a more thorough overview (Figures C.2 and C.3).

2025-08-06

I have read the rebuttal and the authors addressed all my concerns.

审稿意见

评分: 5置信度: 42025-07-08

This paper introduces MESS+ (Model Selection with Cost-optimal Service-level Guarantees), an per-request optimization algorithm to select an LLM from a set of models at inference time. with the objective to minimize cost while meeting a given service-level agreement. The algorithm dynamically learns the probability of an LLM satisfying a user request in real-time and uses this information to make cost-optimal model selection decisions. MESS+ is doing so by using virtual queues to track performance over time, and then choosing the best model from solving the resulting optimization problem. Experimental results using various LLM benchmarks show that MESS+ can provide an average of 2x cost savings compared to existing LLM routing techniques, achieving the targeted request satisfaction rates with minimal overshoot.

优缺点分析

Strengths:

Clear motivation and context: The motivation of the paper is clearly stated and easy to follow. The problem is well-defined, and the authors clearly explain how their approach differs from prior work, especially in terms of the tradeoffs it considers.
Strong theoretical foundation: The algorithmic framework is grounded in stochastic control theory. The mathematical formulation provide a clear rationale for the proposed algorithm. The system offers theoretical guarantees for meeting user-defined constraints or SLAs, making it a practical option in real-world settings where users require a minimum performance level while minimizing cost.
Careful evaluation: The system is evaluated on a diverse set of benchmarks. Results consistently show that it can reduce operating costs while maintaining target performance. Notably, it does so without relying on explicit routing preference datasets, thanks to the use of an online learning algorithm. Furthermore, the authors conduct ablations on key hyperparameters (α and V), providing insight into their influence. They also account for the cost overhead introduced by the router/predictor, which strengthens the evaluation.

Weaknesses:

Strong assumption on model feedback: MESS+ assumes access to feedback from all models in the zoo to train the predictor. Although the authors acknowledge this as a limitation, it remains a significant practical concern. In many real-world scenarios, such feedback may not be available, especially in the absence of a perfect automatic verifier. Despite this, the benchmarks assume full access to high-quality feedback, which may not generalize to more constrained or noisy settings.
Unfair comparison with baselines: MESS++ is benchmarked under fixed SLA constraints, while baselines are allowed to exceed these constraints. This leads to a comparison where baselines achieve higher satisfaction rates, which is interpreted solely as inefficiency. However, this could also be the result of MESS++ choosing a more favorable point in the quality-cost tradeoff space. I understand that arriving to this tradeoff point is actually a strength of MESS++, but still this might be unfair to competing approaches. Furthermore, this makes it difficult to isolate the true benefits of MESS+ under realistic deployment tradeoffs. One possibility would be to pick as target the request satisfaction given by competing approaches and then measure the reduction in cost achieved by MESS+ for this target.
Limited model diversity: The evaluation uses a model zoo of only three models, with significantly different sizes (1B, 8B, 70B). This setup likely makes routing easier, as the models are highly separable in quality and cost. In real-world deployments, model zoos may include many more models with subtler differences. It is unclear how well MESS+ would perform in such cases, where the quality-cost tradeoffs are less clear-cut.
Unclear hyperparameter choices: The paper introduces two hyperparameters that users would have to define / learn for their specific routing use case, and how they might choose them is not immediately clear.

问题

How would the cost and quality results change if MESS+ aimed for a quality level comparable to what baselines actually achieve, rather than a seemingly arbitrary alpha that other systems don’t accept? Would the cost margins shrink significantly?
Initial exploration cost is quite high when all models in a zoo are being queried to learn a predictor model. Could the exploration phase be made more efficient by warm-starting the predictor with benchmark performance data for similar tasks? Alternatively, could a larger or pretrained model with better priors be used to reduce exploration cost and improve early performance?
The current method assumes access to full feedback from all models in the zoo. This assumption is not realistic in many practical settings. How does the system perform when only partial feedback is available during the exploration phase, e.g., querying only a subset of models per input? Can the learning algorithm be adapted to handle such incomplete supervision robustly?
The system is currently tested on a zoo with just three highly distinct model sizes (1B, 8B, 70B). How does performance (cost, routing accuracy, SLA satisfaction) scale as the zoo expands to include more models, especially those that are closer in performance or from different families? Are there diminishing returns or increases in misrouting as model separation becomes less obvious?
How do you choose hyperparameters apha and V for real-world deployments?

局限性

Yes

格式问题

None

作者回复

2025-07-31

Thank you for your thorough review of our paper and the positive feedback! We first respond to the weaknesses and then to your questions. We combine answers where appropriate.

W1 & Q3) We appreciate your observation and have explicitly acknowledged this limitation in our paper. The primary goal of our work is to propose a theoretically grounded routing method that can learn from online feedback. It is important to note that we do not assume access to high-quality or rich feedback. Instead, our preference estimation mechanism relies solely on binary signals, such as the thumbs-up or thumbs-down feedback provided by users interacting with LLMs through chat interfaces. Nevertheless, we agree that collecting feedback across multiple models can be impractical in some real-world scenarios. In principle, MESS+ could be adapted to handle partial feedback. Since our objective function requires a signal for the request satisfaction likelihood and the cost for an incoming request, partial feedback could be supported by swapping out our current classifier-based request satisfaction predictor for a technique that can learn from incomplete or delayed feedback. A detailed study on partial feedback is beyond the scope of this paper and thus left for future work, where our approach can serve as a well-founded starting point for such studies.

W2 & Q1) It is true that the baselines are allowed to exceed $\alpha$ (our SLA requirement), but we made sure to choose a configuration that is as close as possible to our problem setup. For the RouteLLM baseline, we tuned the decision threshold to minimize the cost required to meet $\alpha$ (Appendix B in the supplementary material, Table 2); for RouterDC that requires training a small encoder model, we performed this additional training step on the training split of our benchmark tasks based on the recommended procedure in the RouterDC paper. Even after these adjustments, we still observe that both RouteLLM and RouterDC have a tendency of preferring the largest model, especially in the beginning, whereas MESS+ takes a more conservative and cost-focused approach. This difference is clearly rooted in the different objectives of routing approaches (max. response quality vs. appropriate response quality under a given SLA). We will better highlight the distinct objectives of RouteLLM, RouterDC, and MESS+ in the next version of our paper. When we choose a smaller $V$ , MESS+ becomes more likely to also prefer the largest model in the zoo, and the cost gap towards the routing baselines becomes smaller. Our appendix includes additional experiments on various SLA requirements for a more thorough overview (Figures C.2 and C.3).

W3) We conducted additional experiments on a larger model zoo with 4 models from the Qwen 2 family, with a smaller parameter range (0.5B, 1.5B, 7B, 32B). Below, we report the average performance of MESS+ across the 8 benchmarks in our paper with $\alpha = 0.67$ . In the larger zoo, MESS+ outperforms the baselines (Table I below). This demonstrates that our approach can effectively route larger zoos with models that have less distinct energy characteristics. We will include the detailed additional results in our updated paper.

Table I: Additional experiments on a larger zoo with 4 Qwen 2 models - Avg. across 8 benchmarks ( $\alpha = 0.67$ )

Category	Operating Cost	Request Satisfaction	Model Call Ratio (Q32B/Q7B/Q1.5B/Q0.5B)
Qwen2 0.5B	0.12±0.01	54.12±45.15	0% / 0% / 0% / 100%
Qwen2 1.5B	0.16±0.01	61.13±4.79	0% / 0% / 100% / 0%
Qwen2 7B	0.40±0.01	67.07±4.61	0% / 100% / 0% / 0%
Qwen2.5 32B	1.60±0.01	70.91±4.48	100% / 0% / 0% / 0%
Educated Guessing	0.99±0.01	67.02±2.32	53% / 26% / 11% / 10%
RouteLLM	1.37±0.01	69.01±2.31	83% / 0% / 0% / 17%
RouterDC	1.13±0.01	69.17±2.47	63% / 17% / 9% / 11%
MESS+ (ours)	0.84±0.01	67.55±3.23	48% / 26% / 10% / 16%

W4 & Q5) In general, $\alpha$ should be between the highest and lowest satisfaction rates of all the models in the zoo. To determine $\alpha$ in practice, one can observe how well individual models serve user requests (e.g., based on chat feedback through a thumbs-up/-down mechanism), which can serve as a basis for estimating the performance of models. From an algorithmic perspective, $\alpha$ is given as an input that specifies the SLA requirement, so we may tune other parameters, such as $V$ , to meet $\alpha$ after processing some pre-specified number of requests.

The parameter $V$ serves as a control parameter for scaling (or a way of normalizing) the cost to the same (or similar) order of magnitude as $\alpha$ . While having a large $V$ is generally possible, it would take an impractically long time for the average request satisfaction to converge against $\alpha$ . A good starting point for tuning $V$ is to set $V = 10^{-m}$ , where $m$ is the order of magnitude of the cost per request of the largest model.

We will make this selection process more clear in our updated paper.

Q2) Yes, pre-training the predictor on data for similar tasks and establishing strong priors can reduce the exploration cost at the start. Note, however, that a benefit of MESS+ is that it can adapt over time, i.e., some exploration over time is desirable to dynamically adapt to changing user requirements.

最终决定Accept (poster)

2025-09-17

Paper summary The paper presents MESS+, a framework to select an appropriate LLM to call for each given input query. MESS+’s formulation, like many existing model routing works, seeks to minimize the average cost, subject to a constraint on the average quality (applied to Service Level Guarantee, or SLA in this work). What makes MESS+ unique is that it (Algorithm 1) allows learning to be done entirely in an online fashion, through stochastic optimization.

Reviews Overall, the paper received positive feedback. The reviewers noted the novelty of MESS+ [9ALn] with strong theoretical guarantees [Cuig, tW9L, 9ALn] and strong empirical results [tW9LV]. Sticky points mentioned by multiple reviewers are:

[Cuig, tW9L, 9ALn] Impractical feedback assumption. Specifically, MESS+ requires online feedback on user satisfaction for all models in the zoo to train its predictor.
[Cuig, 9ALn] Limited model diversity. Model sizes are too separate (i.e., 1B, 8B, 70B) and so are easy to route to. Also, in experiments, MESS+ is tested on only three models from the same family.
[Cuig, tW9L] Potentially unfair comparison to baselines. Specifically, baselines do not have SLA guarantees like MESS+ does.

The authors submitted a strong rebuttal that sufficiently addressed the above concerns 2) and 3), convincing some reviewers to adjust ratings more positively in the process. In the AC’s view, concern 1) is already explicitly mentioned as a limitation (in the conclusion section). Also, existing model routing works tend to consider the “batch/offline” setting where one trains a routing model on an offline training set. Each example in the training set consists of an input query, and ground-truth correctness labels from all LLMs in the pool (or at least pair-wise comparison labels from a selected subset of LLMs). So, in terms of the assumption on feedback, existing works also require the same assumption. Thus, concern 1) above is not specific to the present work. A key distinguishing feature of this work is that the proposed MESS+ can learn to route queries in an online fashion. This is one of the few works that considers this online setting.