6.0

/10

Poster4 位审稿人

最低3最高5标准差0.8

3.3

置信度

创新性2.5

质量2.5

清晰度3.0

重要性2.5

NeurIPS 2025

Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving

Fangzhou Wu,Sandeep Silwal

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

TL;DR

We propose the first efficient, training-free online routing algorithm for high-volume LLM serving under token budget constraints, achieving significant improvements in both routing performance and cost efficiency.

摘要

Increasing demand for Large Language Models (LLMs) services imposes substantial deployment and computation costs on providers. LLM routing offers a cost-efficient solution by directing queries to the optimal LLM based on model and query features. However, existing works primarily focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets. In this work, we introduce the first training-free algorithm for online routing scenarios. Our algorithm leverages approximate nearest neighbor search to efficiently estimate the features of queries and performs a one-time optimization over a small set of initial queries to learn a set of routing weights that guide future routing. We provide a theoretical guarantee that the algorithm achieves a competitive ratio of $1 - o(1)$ under natural assumptions, which is further validated by extensive experiments across 3 benchmark datasets and 8 baselines, showing an average improvement of 3.55$\times$ in performance, 1.85$\times$ in cost efficiency, and nearly 4.25$\times$ in throughput. Our code is available at https://github.com/fzwark/PORT.

关键词

Large Language ModelsLLM RoutingCost-Aware InferenceOnline AlgorithmMulti-LLM Serving

评审与讨论

审稿意见

评分: 3置信度: 42025-07-01

This paper introduces a training-free routing strategy by leveraging the log of historical prompts and associated generations as well as their quality scores and costs. Specifically, it conducts semantic embedding based nearest neighbor search to retrieve similar past queries and estimate the performance by averaging recorded measures. Then, it combines estimated quality and cost of each model using learned and fixed weights to decide the final pick of the model.

优缺点分析

Strength:

Comprehensive evaluation on three benchmarks.
Extensive tests over query volume, arrival order, size of historical data, size of search candidates, etc.
Code and data are released for reproducibility.

Weakness:

The effectiveness of the proposed method depends on reliable quality labels. The experiments mainly use the pre-existing ground truth and lack of robustness evaluation against noisy quality labels.
The baseline methods are somewhat weak. Some of the comparable works are discussed in the related work sections but are not empirically compared in the evaluation section.
The theoretical proof relies on strong assumptions including the existence of at least one close neighbor in the log and that generation quality and cost are smoothly tied to prompt embeddings. Similarly, evaluation is mainly conducted with an in-distribution setting, and it is unclear if the proposed method is robust enough to handle the out-of-distribution challenge and drifting query patterns in dynamic online settings.

问题

What is the query-level latency? How does it compare to other routing strategies including trained lightweight routers?

局限性

yes

最终评判理由

While I appreciate the additional results and explanations provided by the authors during the discussion period, I believe the paper needs another round of revision to consolidate the extra supporting experiments and strengthen the associated discussions. Therefore, I’m leaning toward a borderline rating.

格式问题

作者回复

2025-07-30

Q1: The experiments mainly use the pre-existing ground truth and lack of robustness evaluation against noisy quality labels.

A: Thank you for your comments. To demonstrate the robustness of our algorithm under noisy historical labels, we conduct experiments in a noisy setting where two types of strong noise are introduced simultaneously. Specifically, for each performance label $d_{ij}$ , we randomly flip its value with a probability of 20%. For each cost label $g_{ij}$ , we apply multiplicative, mean-preserving noise consisting of two components: (i) log‑normal jitter $\left(\log \mathcal{N}(0, \sigma^2) - \frac{1}{2} \sigma^2\right)$ with $\sigma = 0.25$ , and (ii) a spike factor of 3.0 applied with probability 2%.

We compare our algorithm against all baseline methods on RouterBench under the main experimental setting. As shown in the results below, our method consistently outperforms all baselines under noisy historical labels. It achieves the highest overall performance, the highest throughput, and the best cost efficiency, demonstrating the strong robustness of our algorithm under noisy conditions.

Table 1. Routing performance under noisy historical data. Comparison between our algorithm and baselines on RouterBench.

Algorithm	Peformance	Cost	Performance per Cost	Throughput
Random	1403.25	0.419	3350.43	3227
Greedy-Perf	1323.6	0.370	3573.75	2471
Greedy-Cost	1660.2	0.461	3604.72	4047
KNN-Perf	1323.95	0.369	3585.39	2472
KNN-Cost	1660.2	0.461	3604.72	4047
BatchSplit	1757.25	0.456	3856.68	3974
Roberta-Perf	83	0.045	1836.33	84
Roberta-Cost	358	0.075	4787.593	1616
Ours	2319.65	0.444	5221.28	4775

Q2: The baseline methods are somewhat weak.

A: Thank you for your suggestions. To the best of our knowledge, our work is the first to focus specifically on training-free online multi-LLM routing, and there are currently no existing methods that directly target the same setting. To provide further meaningful comparisons, we additionally include and adapt three local routing baselines adapted from two related works: RouteLLM (MF) and RouteLLM (BERT) from [1], and CARROT (RoBERTa) from [2]. Literature [1] is a well-known two-model routing approach, while [2] represents the current state-of-the-art in offline multi-LLM routing, with the goal of balancing cost and performance. Since [1] only handles two-model routing, we extend it to the multi-model case by training multi-label predictors for win rates and applying a cheapest feasible model strategy: routing to the cheapest model whose predicted win rate logit exceeds a threshold $\alpha$ . Furthermore, as all three baselines are originally intended for local or offline routing, we adapt them to our online setting. We use $20\%$ historical data to search the threshold $\alpha$ (for RouteLLM) and the cost-efficiency coefficient $\mu$ (for CARROT) over the range [0, 1] with step size of 0.1 to maximize performance under a fixed budget.

We present comparisons on the most representative benchmark, RouterBench, under the main experimental setting. Our algorithm outperforms all three additional baselines across all key metrics while maintaining the lowest routing latency. Notably, it achieves over 1.5 $\times$ higher throughput than all baselines and a routing latency of just 0.64 seconds, which is over 200× faster than the state-of-the-art CARROT method:

Table 2. Comparison results with 3 additional baselines on RouterBench

Algorithm	Peformance	Cost	Performance per Cost	Throughput	Total routing Latency(s)
Multi-RouteLLM(MF)( $\alpha=0.6$ )	2041	0.35	5883.61	3525	3.66
Multi-RouteLLM(BERT)( $\alpha=0.4$ )	1764.35	0.31	5753.82	3246	135.94
CARROT(RoBERTa)( $\mu=0.2$ )	1353.2	0.23	5987.61	2737	255.35
Ours	2718.6	0.447	6075.58	5195	0.64

Q3: The theoretical proof relies on strong assumptions including the existence of at least one close neighbor in the log and that generation quality and cost are smoothly tied to prompt embeddings. The evaluation is mainly conducted with an in-distribution setting, and it is unclear if the proposed method is robust enough to handle the out-of-distribution challenge and drifting query patterns in dynamic online settings.

A: Thank you for your comments. We believe there may be a misunderstanding, and we would like to clarify that the assumptions we make, which are discussed in detail in Appendix B.1, are all natural and mild. In particular, Assumption 1 is grounded in the core idea of informative query routing: we assume that there exist historical data points that are informative for routing future queries. This aligns with the core intuition behind nearly all prior predictive LLM routing methods [1, 2, 3, 4], which assume that some queries in the training or historical data share similar routing-relevant features with future queries. If no such informative historical data exists, then leveraging training/historical data for predictive routing becomes impossible, and all previous methods, whether training language models or low-rank models on the historical data, would reduce to random routing, as discussed in Appendix B.1. In such a case, no meaningful inference can be made without directly querying the LLMs. One must query all LLMs to make informed routing decisions. This is because, in the worst case, under an adversary query targeting the LLM query order in the system, all LLMs may present extremely low performance, except the last LLM. This underscores the importance of Assumption 1, which connects historical data to future queries, enabling meaningful routing and improved performance.

To evaluate the robustness of our algorithm, we construct a challenging out-of-distribution setting using RouterBench, which consists of 13 distinct datasets with substantially different distributions (see Table 2 in the Appendix). We use a single dataset (MMLU) in RouterBench as historical data and treat the remaining 12 datasets (including Chinese, GSM8K, and so on) as the test set. We compare our algorithm against all baselines under the main setting in both the OOD and noisy settings. The detailed results are as follows:

Table 3. Routing performance in the out-of-distribution setting: Comparison between our algorithm and baselines on RouterBench

Algorithm	Peformance	Cost	Performance per Cost	Throughput
Random	1495.35	0.518	2888.14	3308
Greedy-Perf	807.15	0.252	3204.76	1643
Greedy-Cost	1677.45	0.561	2989.30	4044
KNN-Perf	816.10	0.254	3212.31	1658
KNN-Cost	1679.70	0.561	2993.23	4047
BatchSplit	1828.55	0.526	3474.42	4283
Roberta-Perf	9.50	0.017	559.15	10
Roberta-Cost	1692.65	0.562	3013.92	4037
Multi-RouteLLM(MF)	924	0.247	3734.01	2227
Multi-RouteLLM(BERT)	908.60	0.248	3657.92	2239
CARROT(Roberta)	459.40	0.121	3783.47	738
Ours	2050.8	0.532	3851.53	4644

As shown in the Table, even under this challenging OOD setting, our algorithm consistently outperforms all baselines. It achieves the highest overall performance, the best performance-per-cost ratio, and the greatest throughput. These results highlight the strong robustness of our method to distribution shift and changing query patterns.

Q4: What is the query-level latency? How does it compare to other routing strategies including trained lightweight routers?

A: Thank you for raising this point. We provide a detailed comparison of query-level routing latency on RouterBench under varying numbers of test queries. All implementations are in Python, and each method is executed using a single CPU thread. For model-based methods that require GPU computation, we follow the same setup as in [1] and use a single NVIDIA L4 GPU (24GB). Across all methods, we fix the processing batch size to 256 for consistency. The detailed latency results are summarized below, where our method achieves the fastest routing among all baselines (except the random routing strategy, which exhibits significantly poorer performance).

Table 4. Per-query routing latency (ms) of different algorithms on RouterBench under varying numbers of queries

Algorithm	4000queries	6000queries	8000queries	10000queries
Greedy-Perf	0.203	0.203	0.201	0.203
Greedy-Cost	0.233	0.230	0.226	0.233
KNN-Perf	0.478	0.475	0.469	0.475
KNN-Cost	0.479	0.520	0.508	0.506
BatchSplit	0.300	0.282	0.281	0.278
Roberta-Perf	26.978	26.322	26.589	26.578
Roberta-Cost	27.000	26.373	26.566	26.591
Multi-RouteLLM(MF)	0.393	0.377	0.364	0.366
Multi-RouteLLM(BERT)	13.620	13.568	13.661	13.594
CARROT(Roberta)	26.015	25.413	25.644	25.535
Ours	0.060	0.060	0.060	0.064

[1] Ong, Isaac, et al. "Routellm: Learning to route llms with preference data." arXiv preprint arXiv:2406.18665 (2024).

[2] Somerstep, Seamus, et al. "Carrot: A cost aware rate optimal router." arXiv preprint arXiv:2502.03261 (2025).

[3] Hu, Qitian Jason, et al. "Routerbench: A benchmark for multi-llm routing system." arXiv preprint arXiv:2403.12031 (2024).

[4] Wang, Xinyuan, et al. "Mixllm: Dynamic routing in mixed large language models." arXiv preprint arXiv:2502.18482 (2025).

评论- Follow up to Reviewer RvpU

2025-08-03

Dear Reviewer RvpU,

Did we address all your concerns satisfactorily? If your concerns have not been resolved, could you please let us know which concerns were not sufficiently addressed so that we have a chance to respond?

Many thanks, The authors

2025-08-06

Thanks for your response. While I appreciate the additional results and explanations, I believe the paper needs another round of revision to consolidate the extra supporting experiments and strengthen the associated discussions. Therefore, I’m leaning toward a borderline rating.

评论- Response to Reviewer RvpU

2025-08-07

Dear Reviewer RvpU,

We respectfully disagree. Our submission already contains strong empirical results overall the prior state of the art baselines on all the relevant datasets that we are aware of in the literature. In addition, we provide strong theoretical guarantees which were not present in prior works.

The new experiments we added at your request are very much orthogonal to the main contribution of the paper. For example, the new experiments focus on a synthetic noisy setting that is not present in any of the actual real world datasets in the submission. Thus, we believe the new experiments can easily be included in the appendix, since they were created to satisfy the specific concerns that you raised in the initial review, while the main experiments and contributions are already complete in the current submission.

Many thanks for your considerations, The authors

审稿意见

评分: 4置信度: 22025-07-01

This paper proposes a training-free, online routing algorithm to efficiently handle high-volume, budget-constrained LLM inference requests. Unlike prior methods that rely on heavy offline training or dynamic retraining, the proposed method:

Uses Approximate Nearest Neighbor Search (ANNS) to estimate query-level performance and cost features from historical data.
Learns a single global routing strategy (γ)* based on a small subset of initial queries.
Applies this learned strategy for online query routing using a principled dual formulation of a relaxed MILP.

It demonstrates strong theoretical guarantees (competitive ratio of 1-o(1) and empirical results showing up to: 3.55× performance, 1.85× better cost-efficiency, 4.25× higher throughput over 8 competitive baselines on 3 benchmarks.

优缺点分析

Strengths:

This is the first routing method for online multi-LLM serving that is training-free, addressing the scalability bottlenecks of prior methods in realistic deployments.
Solid formulation of routing as a MILP and its LP relaxation.
Compared against 8 baselines across multiple metrics and settings.

Weaknesses:

While the paper claims low-latency routing, it doesn’t provide absolute latency measurements (e.g., milliseconds per query) or breakdowns.

问题

How sensitive is the routing performance to the quality and size of the historical dataset?
How often should γ∗ be re-learned in a real system?
Could reinforcement learning improve the second-stage routing?
Can the proposed method be applied for VLM (vision language model)?

局限性

yes

最终评判理由

My concerns are addressed.

格式问题

N/A

作者回复

2025-07-30

Q1: While the paper claims low-latency routing, it doesn’t provide absolute latency measurements (e.g., milliseconds per query) or breakdowns.

A: Thank you for raising this point. We provide a detailed comparison of per-query routing latency on RouterBench under varying numbers of test queries. All methods are implemented in Python and executed using a single CPU thread. For model-based approaches requiring GPU computation, we follow the same setup as in [1], using a single NVIDIA L4 GPU (24GB). To ensure consistency, we fix the processing batch size to 256 across all methods. The latency results are summarized below, showing that our method achieves the fastest routing among all baselines (except the random routing strategy, which exhibits significantly poorer performance).

Table 1. Per-query routing latency (ms) of different algorithms on RouterBench under varying numbers of queries

Algorithm	4000 queries	6000 queries	8000 queries	10000 queries
Greedy-Perf	0.203	0.203	0.201	0.203
Greedy-Cost	0.233	0.230	0.226	0.233
KNN-Perf	0.478	0.475	0.469	0.475
KNN-Cost	0.479	0.520	0.508	0.506
BatchSplit	0.300	0.282	0.281	0.278
Roberta-Perf	26.978	26.322	26.589	26.578
Roberta-Cost	27.000	26.373	26.566	26.591
Ours	0.060	0.060	0.060	0.064

These results clearly demonstrate that our algorithm consistently achieves the lowest per-query routing latency across all settings, significantly outperforming all baselines. Notably, compared to model-based routers, our method achieves over 200 $\times$ speed-up, highlighting its strong routing efficiency. Furthermore, our current implementation, which is written in Python and executed on a single CPU thread, can be further optimized. Since the algorithm requires no GPU operations, it can be easily reimplemented in a lower-level language such as C++ for further improvement. Additionally, because routing decisions for individual queries are made independently, the algorithm is inherently parallelizable and well-suited for multi-threaded execution. These optimizations can lead to substantial additional speed-ups in real-world deployments.

Q2: How sensitive is the routing performance to the quality and size of the historical dataset?

A: Thanks for your comments. We have already included detailed experiments on the impact of historical data size on routing performance in Appendix C.5 (see Figure 11). The results show that varying the amount of historical data has a negligible effect on our algorithm’s performance. Across all data sizes and settings, our method consistently and significantly outperforms all baselines, demonstrating strong robustness. Notably, even with as few as 5,000 historical records, our algorithm remains highly effective for routing. To assess sensitivity to data quality, we consider two challenging settings: noisy labels and out-of-distribution (OOD) historical data.

In the noisy setting, we introduce two types of strong noise simultaneously. For each performance label $d_{ij}$ , we randomly flip its value with a probability of 20%. For each cost label $g_{ij}$ , we apply multiplicative, mean-preserving noise consisting of: (i) log-normal jitter $\left(\log \mathcal{N}(0, \sigma^2) - \frac{1}{2}\sigma^2\right)$ with $\sigma = 0.25$ , and (ii) a spike factor of 3.0 applied with 2% probability.

In the OOD setting, we construct a distribution shift using RouterBench, which contains 13 diverse datasets (see Table 2 in the Appendix). We use MMLU as the historical dataset and evaluate on the remaining 12 datasets, which differ significantly in data distribution. In both settings, our algorithm consistently outperforms all baselines under the main evaluation protocol. The detailed results are as follows:

Table 2. Routing performance under noisy historical data: our algorithm v.s. baselines on RouterBench

Algorithm	Peformance	Cost	Performance per Cost	Throughput
Random	1403.250	0.419	3350.430	3227
Greedy-Perf	1323.600	0.370	3573.750	2471
Greedy-Cost	1660.200	0.461	3604.720	4047
KNN-Perf	1323.950	0.369	3585.390	2472
KNN-Cost	1660.200	0.461	3604.720	4047
BatchSplit	1757.250	0.456	3856.680	3974
Roberta-Perf	83.000	0.045	1836.330	84
Roberta-Cost	358.000	0.075	4787.593	1616
Ours	2319.650	0.444	5221.280	4775

Table 3. Routing performance in the out-of-distribution setting: Comparison between our algorithm and baselines on RouterBench

Algorithm	Peformance	Cost	Performance per Cost	Throughput
Random	1495.35	0.518	2888.14	3308.00
Greedy-Perf	807.15	0.252	3204.76	1643.00
Greedy-Cost	1677.45	0.561	2989.30	4044.00
KNN-Perf	816.10	0.254	3212.31	1658.00
KNN-Cost	1679.70	0.561	2993.23	4047.00
BatchSplit	1828.55	0.526	3474.42	4283.00
Roberta-Perf	9.50	0.017	559.15	10.00
Roberta-Cost	1692.65	0.562	3013.92	4037.00
Ours	2050.8	0.532	3851.53	4644.00

As shown in the above tables, our algorithm consistently outperforms all baselines under both noisy historical labels and out-of-distribution settings. In both cases, it achieves the highest overall performance, best performance-per-cost ratio, and greatest throughput, demonstrating strong robustness to label noise and varying distribution shift.

Q3: How often should γ∗ be re-learned in a real system?

A: Thank you for the question. In our approach, the routing weights $\gamma^\star$ are learned based on the number of upcoming queries to be routed and the chosen fraction ratio $\epsilon$ . For example, with $\epsilon = 0.025$ and a total query batch of 10000, only 250 initial queries are needed to learn the routing weights $\gamma^\star$ . In practice, incoming queries can be divided into sequential chunks (e.g., of size 10000), and $\gamma^\star$ is relearned for each chunk. Specifically, after finishing routing one chunk, we relearn $\gamma^\star$ for the next chunk. The overhead of this learning process is negligible, where it takes approximately 0.03s on 250 samples and 0.05s on 500 samples, and can be further improved. This means that, even in high-throughput scenarios (e.g., 10000 queries per second), this routing weight optimization only adds negligible overhead (0.03s), making the relearning process both efficient and practical for real systems.

Q4: Could reinforcement learning improve the second-stage routing?

A: Thanks for your suggestion. Using reinforcement learning (RL) for improving the second-stage routing is indeed an interesting direction. In our current design, we learn the routing weights efficiently via a one-shot optimization of the dual LP, which is highly efficient and requires no costly online training process. While RL could offer benefits such as continual adaptation to changing query patterns and integration of richer feedback signals, it also introduces extra complexity and exploration overhead. We consider RL-based routing a promising avenue for future work.

Q5: Can the proposed method be applied for VLM (vision language model)?

A: Thanks for the question. Our routing framework is largely modality‑agnostic and only requires that the processed queries can be mapped into a vector space (embedding). In a VLM scenario, one can first encode each image (and its accompanying text, if any) into a joint embedding (e.g., via CLIP or another vision–language encoder). With this joint embedding, our algorithm can be naturally extended to the VLM setting by applying the same procedure: using ANNS to retrieve informative historical data points and performing online cost-efficient routing accordingly. We view this extension as an exciting future direction.

[1] Ong, Isaac, et al. "Routellm: Learning to route llms with preference data." arXiv preprint arXiv:2406.18665 (2024).

评论- Follow up to Reviewer toJm

2025-08-03

Dear Reviewer toJm,

Many thanks, The authors

评论- Response to Reviewer toJm

2025-08-05

Dear Reviewer toJm,

Thank you for filling out the rebuttal acknowledgement. We believe we have addressed your main weakness of addressing the latency of our algorithm. Please let us know if this concern was not fully addressed.

Many thanks, The authors

2025-08-09

Thanks the rebuttal addressed my concerns. I would like to keep my rating.

审稿意见

评分: 3置信度: 32025-07-02

Existing works for LLM router primarily focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets.

This paper introduces the first training-free algorithm for online routing scenarios. The algorithm leverages approximate nearest neighbor search to efficiently estimate query features and performs a one-time optimization over a small set of initial queries to learn a routing strategy that guides future routing.

The algorithm estimates the performance scores and costs efficiently using a historical dataset, and learns a routing strategy from a small subset of observed queries (of size ϵ|Q|) to guide future query routing.

Experiments on simulated traces show the performance.

优缺点分析

Strengths

Improving the efficiency of LLM routing is an important topic.
The proposed method with approximate nearest neighbor search and online learning is helpful.
Theoretical analysis and experiments are provided.

Weaknesses

Such approximate nearest neighbor methods highly rely on the training data. E.g. Routerbench has 36,497 examples. The paper uses 26,497 queries as historical data to construct the search examples and 10,000 queries for testing.
Stronger model-based methods like BERT can be added. The router running time of these methods can also be compared.
Experiments are only on simulated traces.

问题

Move Table 1 to the next page.

局限性

yes

最终评判理由

After discussion with authors, the paper needs to improve the experiment setup to provide a fair comparison with baselines and consolidate the extra supporting experiments.

格式问题

N/A

作者回复

2025-07-30

Q1: Such approximate nearest neighbor methods highly rely on the training data. E.g. Routerbench has 36,497 examples. The paper uses 26,497 queries as historical data to construct the search examples and 10,000 queries for testing.

A: Thanks for the comments. We would like to clarify that our algorithm does not rely heavily on historical data. As we already showed in Appendix C.5 (Figure 11), varying the amount of historical data used for ANNS has minimal impact on performance. Our method consistently and significantly outperforms all baselines across all data sizes, demonstrating strong robustness. Notably, even with as few as 5000 historical records, our algorithm remains highly effective for routing.

Q2: Stronger model-based methods like BERT can be added.

A: We appreciate the suggestion. Following the main setting in the paper, we train two separate BERT-based models to predict the generation performance score and cost for each query, resulting in two additional BERT-based routing baselines: (i) BERT-perf-routing, which routes each query to the model with the highest predicted performance; (ii) BERT-cost-routing, which routes each query to the model with the greatest available budget. We compare our algorithm with these two baselines on RouterBench under the main experimental setting. The detailed results are as follows:

Table 1. Comparison of Our Routing Algorithm with BERT‑Based Methods on RouterBench

Algorithm	Peformance	Cost	Performance per Cost	Throughput
Bert-Perf	384.25	0.154	2480.87	452
Bert-Cost	630.35	0.145	4340.51	156
Ours	2718.6	0.447	6075.58	5195

As shown in the above results, our method achieves >4 $\times$ higher Performance and >3 $\times$ higher Throughput than the strongest BERT‑based baseline, while maintaining favorable cost efficiency.

Q3: The router running time of these methods can also be compared.

A: Thank you for raising this point. We provide a detailed comparison of routing latency on RouterBench under varying numbers of test queries. All implementations are in Python, and each method is executed using a single CPU thread. For model-based methods that require GPU computation, we follow the same setup as in [1] and use a single NVIDIA L4 GPU (24GB). Across all methods, we fix the processing batch size to 256 for consistency. The detailed latency results are summarized below, demonstrating the routing efficiency of our algorithm: our method achieves the fastest routing among all baselines (except the random routing strategy, which exhibits significantly poorer performance):

Table 2. Total routing latency (s) comparison of routing algorithms on RouterBench under varying numbers of queries

Algorithm	4000 queries	6000 queries	8000 queries	10000 queries
Greedy-Perf	0.81	1.22	1.61	2.03
Greedy-Cost	0.93	1.38	1.81	2.33
KNN-Perf	1.91	2.85	3.75	4.75
KNN-Cost	1.92	3.12	4.07	5.06
BatchSplit	1.20	1.69	2.25	2.78
Roberta-Perf	107.91	157.93	212.71	265.78
Roberta-Cost	108.00	158.24	212.53	265.91
Bert-Perf	111.77	164.98	221.64	277.28
Bert-Cost	112.76	165.84	223.32	277.32
Ours	0.24	0.36	0.48	0.64

These results clearly show that our algorithm consistently achieves the lowest routing latency across all settings, outperforming all baseline methods. This demonstrates the strong routing efficiency of our approach. Furthermore, we note that our current implementation of our algorithm, which is written in Python and runs on a single CPU thread, can be further optimized. Since the algorithm does not require any GPU resources or operations, it can be easily reimplemented in a faster programming language such as C++ for better performance. In addition, because routing decisions for individual queries are made independently based on learned routing weights, the algorithm is naturally parallelizable and well-suited for multi-threaded execution. Applying these optimizations can further provide substantial speed-ups in practice.

Q4: Experiments are only on simulated traces.

A: Thank you for your comment. To the best of our knowledge, we have evaluated our method on almost all publicly available multi-LLM routing benchmarks [2, 3, 4] that are widely used in the literature and research community. We are not aware of any additional public datasets in this space. If the reviewer has any pointers to specific labeled traffic traces, we would be happy to incorporate them into further experiments.

Q5: Move Table 1 to the next page

A: Thank you for the suggestion. We will move Table 1 to an appropriate position on the next page in the revised version.

[1] Ong, Isaac, et al. "Routellm: Learning to route llms with preference data." arXiv preprint arXiv:2406.18665 (2024).

[2] Hu, Qitian Jason, et al. "Routerbench: A benchmark for multi-llm routing system." arXiv preprint arXiv:2403.12031 (2024).

[3] Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024.

[4] Somerstep, Seamus, et al. "Carrot: A cost aware rate optimal router." arXiv preprint arXiv:2502.03261 (2025).

评论- Follow up to Reviewer f1oy

2025-08-03

Dear Reviewer f1oy,

Many thanks, The authors

评论- Response to Reviewer f1oy,

2025-08-07

Dear Reviewer f1oy,

Thank you for filling out the rebuttal acknowledgement. We would like to note that one of your main weaknesses (size of historical dataset) was already addressed in the submission. In addition, we note that we used all the datasets in literature that we are aware of. If you have a suggestion we are happy to incorporate it, but we do not believe that we have missed any datasets. Would you please let us know if you had any other datasets in mind or engage with us in any discussion?

Many thanks, The authors

2025-08-08

Thank you for the clarification! Could you elaborate on why BERT and RoBERTa perform worse than the random baseline? The proposed ANN with language embedding model appears to have a mechanism similar to BERT classification, so this result is surprising.

In addition, it is unclear why varying the number of training data points does not affect model performance. Could you provide more explanation or analysis for this observation?

2025-08-08

Q1: why BERT and RoBERTa perform worse than the random baseline?

A: Thank you for the comments. This is due to our online routing setup, as detailed in Appendix A (lines 560–564), where a sequence of queries must be routed to different LLMs within a limited time and budget.

For performance-based methods (e.g., BERT-perf and RoBERTa-perf), routing is based only on predicted performance. In the online setting, these methods tend to over-route queries to strong LLMs early on, quickly exhausting their budgets. Without the consideration of the remaining budget, subsequent queries continue to be routed to those LLMs, resulting in unprocessed queries and reduced overall performance.

For cost-based methods, the true remaining budget is unobservable at routing time, as it depends on the actual incurred costs of prior queries, which are only known after those queries have finished. Hence, methods such as BERT-cost routing and RoBERTa-cost routing must rely on predicted costs to estimate remaining budget. Estimation errors in cost prediction will mislead routing decisions and reduce performance. These errors vary across benchmarks. As shown in Table 1 in the paper, RoBERTa-cost method outperforms random routing on SPROUT and Open LLM Leaderboard v2, but still falls short of our method.

Q2: In addition, it is unclear why varying the number of training data points does not affect model performance.

A: Thank you for the question. Our method does not involve any training phase. Instead, we treat the so-called "training data" as a historical dataset from which we retrieve informative data points using Approximate Nearest Neighbor Search (ANNS) algorithms. These retrieved data points are then used to estimate the features of incoming queries and make online routing decisions. As a result, our approach depends on retrieving representative neighbors rather than learning parameters, making performance reliant on information coverage rather than dataset size. In practice, real queries cluster in embedding space, and around 5000 randomly sampled examples are sufficient to cover these clusters. This explains the stable performance across dataset sizes and highlights a key strength of our method: it eliminates the need for model training.

We would be happy to clarify any additional questions or concerns.

2025-08-08

If Bert can be trained to balance both performance and cost, it should have similar result as the ANN method.

I also have another question on the latency, since ANN method also needs to get the prompt embedding, why it is much faster than Bert? Is it due to a smaller embedding model?

Around 5000 randomly sampled examples are sufficient to cover these clusters. The authors can show how the result change from 0 to 5000 sampled examples.

2025-08-08

Q1: If Bert can be trained to balance both performance and cost, it should have similar result as the ANN method.

A: Thank you for the comments. While it is theoretically possible to train a BERT-based router to balance performance and cost simultaneously, doing so is highly challenging in practice due to the lack of reliable supervision signals for cost-efficient routing. To the best of our knowledge, no existing BERT-based routers in the literature are trained with this objective. Furthermore, even if such a cost-efficient BERT router could be trained, it would still face the same limitations as other model-based routers: (1) It is difficult to adapt to dynamic online settings, often requiring retraining, and (2) It struggles to make globally cost-efficient routing decisions under changing budget constraints. In contrast, our proposed ANN-based online routing algorithm naturally handles dynamic trade-offs and avoids these limitations without retraining.

Q2: I also have another question on the latency, since ANN method also needs to get the prompt embedding, why it is much faster than Bert? Is it due to a smaller embedding model?

A: Thank you for the question. The routing system of all training-free methods in the paper (KNN-Routing, Perf-Routing, BatchSplit, and ours) consists of two modules: (1) an embedding module, and (2) a core online routing module. Our latency evaluation of these training-free methods focuses on the second part, the core online routing module, for two main reasons:

(1) Decoupled pipeline. The embedding step can operate asynchronously and is fully decoupled from the core routing logic.

(2) Module independence. The embedding model is interchangeable (e.g., MiniLM, GTE-small, bge-small), while our core contribution lies in the core routing algorithm itself.

Furthermore, our routing algorithm only solves a lightweight linear program over a small batch of queries in an online fashion, adding negligible overhead.

For completeness, we also report the embedding latency of the bge-base model under the same setting. As shown in the table below, even when including embedding time, our method remains over 5 $\times$ faster than model-based routers.

Table 1. The latency (s) comparison of our algorithm with model-based routers on RouterBench under varying numbers of queries.

Algorithm (Module)	4000 queries	6000 queries	8000 queries	10000 queries
Roberta-Perf	107.91	157.93	212.71	265.78
Roberta-Cost	108.00	158.24	212.53	265.91
Bert-Perf	111.77	164.98	221.64	277.28
Bert-Cost	112.76	165.84	223.32	277.32
Ours(bge-base embedding latency)	21.75	31.11	42.87	51.91
Ours(core online routing latency)	0.24	0.36	0.48	0.64

Q3: Around 5000 randomly sampled examples are sufficient to cover these clusters. The authors can show how the result change from 0 to 5000 sampled examples.

A: Thanks for your suggestion. We report the results with varying historical data sizes from 5 to 5000 in the following Table. As shown, even with only 50 samples, our method achieves a performance of ~2273, which is already significantly better than all baselines. Compared to the full-data performance (~2700), the drop is modest, demonstrating strong performance even under limited historical data.

When the number of historical samples drops to just 5, performance degrades more noticeably. This aligns with our theoretical analysis: when there is little or no correlation between historical data and future queries, informative routing becomes provably impossible.

Notably, with around 2000 samples, our method already achieves near-optimal performance, matching the result with all historical data. These results demonstrate the robustness of our algorithm under varying amounts of historical data points.

Table 2. Results when varying the number of historical data points from 5 to 5000

size	performance	cost	perf/cost	throughput
5	491.0	0.141	3479.42	1149
50	2273.35	0.456	4978.16	4815
100	2369.35	0.433	5469.38	4921
200	2294.40	0.439	5222.18	4887
300	2409.50	0.416	5791.87	4812
400	2586.45	0.444	5828.73	5141
500	2640.35	0.443	5959.35	5173
600	2656.60	0.448	5933.56	5255
800	2557.15	0.448	5712.32	4990
1000	2550.85	0.449	5676.95	5052
1500	2637.05	0.438	6022.88	5171
2000	2692.45	0.434	6201.30	5216
2500	2658.45	0.427	6230.55	5070
3000	2792.65	0.448	6229.25	5246
3500	2747.55	0.429	6405.52	5267
4000	2730.15	0.439	6212.63	5204
4500	2731.65	0.429	6373.63	5266
5000	2686.15	0.433	6206.67	5200

2025-08-08

Instead of BERT, if training on bge-base model, such a method can have the same latency as the proposed method?

评论- Reply to Reviewer f1oy

2025-08-08

Q: Instead of BERT, if training on bge-base model, such a method can have the same latency as the proposed method?

A: Thank you for the question. If we fine-tune the same bge model used for embedding as a router, the overall latency would indeed be similar to our method. However, the performance would be much worse since our method is already improving over Bert based routers. In fact, imagine an extreme case where we make random decisions. Then our latency is effectively 0 but the performance is reduced significantly (see our Table 1 in the submission).

Furthermore, as discussed above, any model-based routers still face several key limitations in online settings (even if we imagine using a hypothetical model that has 0 latency but somehow magical performance):

(1) Lack of adaptability. Model-based routers must be retrained whenever the model pool or budget constraints change, which makes them hard to maintain in dynamic deployment environments.

(2) Limited cost-efficiency. Such routers often optimize for a fixed objective (performance or cost) and struggle to adapt to dynamic query arrivals.

(3) No modular reuse. The model-based router is tightly coupled with the router and cannot be reused for other tasks. In contrast, our embedding module is decoupled, allowing the same embeddings to be shared across routing, retrieval, logging, and other downstream tasks.

Thanks again for your comments, and we hope this clarifies the practical advantages of our training-free approach.

2025-08-09

Thank you for your rebuttal. I will carefully consider your points and discuss them with the other reviewers.

审稿意见

评分: 5置信度: 42025-07-03

The paper introduces a training-free algorithm for optimizing the routing of user queries to multiple LLMs in online settings with constrained token budgets. The algorithm leverages a small subset of queries to builds an offline ANNS indice for estimating performance and cost for each deployed LLM, and performs a one-time optimization over a small subset of queries to learn weights for routing. These weights are utilized for guiding routing decisions of online queries, ensuring high performance and cost efficiency.

优缺点分析

Advantages:

The performance of the method is good. It achieves an average improvement of 3.55× in overall performance, 1.85× in cost efficiency, and nearly 4.25× in throughput.
The method does not require training or frequent finetuning existing models.
The method looks not very sensitive to the distributions of user queries. Even if the query distribution might shift in online setting, it only requires an update on the HNSW index by inserting the new query data. And this process can be performed in an offline manner.
The writing is clear.
The evaluations are comprehensive.

What I like most about this paper is that their method is simple but effective. It does not make strong or unrealistic assumptions and is suitable for real-world deployment. My major concern is, the method only considers cost and efficiency as features for model selection. However, in a real-world setting, the traffic of queries is also important. If a high volume of queries arrives at a same time and they are routed to a same LLM, there might be a traffic congestion. For example, if a large number of users ask similar questions in a short amount of time, and the HNSW index suggests to route all the queries to the same LLM, the inference efficiency would be very low. However, at the same time, the other LLMs are idle. Also, the arrival time of different queries and the sequences of different queries matter. I understand that solving this problem isn't straightforward, and the linear programming might not be suited. However, the problem is practical and important in a real-world setting.

问题

What parameters have you set for HNSW? I only saw the number of neighbors in the appendix. However, the number of layers of HNSW is not specified. The parameters of HNSW might impact the evaluation results. Could you add evaluations for that?
See weakness.
Should be HNSW not HWSN throughout the whole paper.

局限性

The method only considers cost and efficiency as features for model selection but does not consider traffics of the queries. The authors mentioned to add additional constraints in the original MILP for incorporating other routing factors. However, it is not accurate as incorporating sequences of similar types of queries is not straightforward using MILP.

最终评判理由

The authors have addressed my concern.

格式问题

N/A

作者回复

2025-07-30

Q1: The method only considers cost and efficiency as features for model selection but does not consider traffics of the queries. In a real-world setting, the traffic of queries is also important.

A: Thank you for your comments. We acknowledge that query traffic patterns play an important role in real-world systems. However, the primary goal of our work is to develop a cost-efficient online routing algorithm, a problem that has gained increasing attention in recent literature [1, 2, 3], especially given the high computational cost of LLMs. To isolate this core algorithmic problem, we adopt a standard assumption widely used in the online algorithms literature [4, 5, 6]: that incoming queries arrive in a randomly permuted order. This assumption, also discussed in detail in Appendix B.1, allows us to abstract away complex traffic dynamics and focus on the fundamental cost-efficiency routing problem.

We would like to clarify that, under this assumption, our framework is designed to operate on query-intrinsic features that are independent of time and query order, such as response performance score and cost per query across the models. In contrast, operational-state features, such as current queue lengths or system load, are beyond the scope of our current formulation and are more aligned with a related but orthogonal problem: load balancing under traffic constraints.

Still, we believe our algorithm offers a natural foundation for incorporating such traffic-aware considerations. One potential approach is to discretize time into short windows and track the per-model backlog of unserved workload. Within each time window, a corresponding "time-bucket" MILP can be solved with additional traffic constraints, and any unserved workload can be carried over to the next window. For instance, let $\ell_{ij}$ denote the service time of query $j$ on LLM $i$ , $b_{it}$ be the backlog of unserved workload for LLM $i$ at the start of time window $t$ , and $A_{it}$ denote the available service capacity for LLM $i$ during window $t$ . Then, for each window $t$ and LLM $i$ , we can add the following constraint to the MILP: $\sum_j \ell_{ij} x_{ij} \leq \max \\{ 0, A_{it} - b_{it} \\}, \quad \forall i,t$ This "time-bucket" formulation helps smooth bursts and reduce queuing delays.

In the revision, we will further explicitly clarify the scope of our current MILP formulation and also discuss such possible extensions of our algorithm.

Q2: What parameters have you set for HNSW? The number of layers of HNSW is not specified. The parameters of HNSW might impact the evaluation results. Could you add evaluations for that?

A: Thank you for pointing this out. We use the standard and recommended HNSW implementation from hnswlib with the following parameter settings in our paper: construction effort factor ef_construction=200, max number of neighbors per node M = 32, and query search breadth ef = 50. Regarding the number of layers in hnswlib, it is not a direct user-configurable hyperparameter. Instead, it is an emergent property of the graph determined during construction, which is primarily controlled by the M parameter. A smaller M tends to produce a taller, multi-layered graph, while a larger M results in a flatter, more interconnected one.

To evaluate the robustness of our algorithm to these three key parameters, we conduct ablation studies on RouterBench under the main evaluation setting. Specifically, we vary ef_construction from 100 to 900, M from 8 to 96, and ef from 32 to 384. The ablation results, shown in the following tables, demonstrate that our routing algorithm remains highly robust across a wide range of values for all three parameters:

Table 1. Performance of our algorithm on RouterBench with varying ef_construction

`ef_construction`	Peformance	Cost	Performance per Cost	Throughput
100	2785.40	0.46	6107.40	5305.00
200	2744.85	0.45	6147.22	5196.00
300	2738.35	0.45	6085.40	5215.00
400	2747.90	0.45	6133.20	5217.00
600	2715.80	0.44	6105.08	5168.00
800	2726.80	0.45	6104.13	5169.00
1200	2734.85	0.45	6125.43	5192.00

Table 2. Performance of our algorithm on RouterBench with varying M

`M`	Peformance	Cost	Performance per Cost	Throughput
8	2596.60	0.44	5923.04	5084.00
16	2764.15	0.45	6107.89	5272.00
24	2745.45	0.45	6110.49	5211.00
32	2744.85	0.45	6147.22	5196.00
48	2729.20	0.45	6116.41	5194.00
64	2726.40	0.45	6075.57	5201.00
96	2763.35	0.45	6104.96	5214.00

Table 3. Performance of our algorithm on RouterBench with varying ef

`ef`	Peformance	Cost	Performance per Cost	Throughput
32	2730.35	0.45	6032.83	5195.00
50	2744.85	0.45	6147.22	5196.00
80	2594.10	0.43	5973.32	5021.00
128	2706.90	0.44	6084.47	5170.00
192	2601.85	0.44	5914.84	5022.00
256	2710.80	0.44	6112.35	5171.00
384	2744.20	0.45	6093.61	5223.00

[1] Ong, Isaac, et al. "Routellm: Learning to route llms with preference data." arXiv preprint arXiv:2406.18665 (2024).

[2] Somerstep, Seamus, et al. "Carrot: A cost aware rate optimal router." arXiv preprint arXiv:2502.03261 (2025).

[3] Hu, Qitian Jason, et al. "Routerbench: A benchmark for multi-llm routing system." arXiv preprint arXiv:2403.12031 (2024).

[4] Devanur, Nikhil R., and Thomas P. Hayes. "The adwords problem: online keyword matching with budgeted bidders under random permutations." Proceedings of the 10th ACM conference on Electronic commerce. 2009.

[5] Mehta, Aranyak, et al. "Adwords and generalized online matching." Journal of the ACM (JACM) 54.5 (2007): 22-es.

[6] Gagan Goel and Aranyak Mehta. Online budgeted matching in random input models with applications to adwords. In SODA, volume 8, pages 982–991, 200

评论- Follow up to Reviewer QUtW

2025-08-03

Dear Reviewer QUtW,

Many thanks, The authors

2025-08-06

Thank you for your explanation. I have adjusted the score.

最终决定Accept (poster)

2025-09-17

This paper proposes the first training-free algorithm for online routing in multi-LLM serving under constrained token budgets. The method leverages approximate nearest neighbor search on historical queries and performs a one-time optimization over a small subset of queries to derive routing weights, which are then used for efficient online routing. The authors provide both theoretical guarantees and extensive empirical results across three benchmarks and eight baselines, showing consistent improvements in performance, cost efficiency, and throughput.

The strengths of the paper lie in its novelty, simplicity, and practicality. The method avoids retraining, is robust to data size and distribution shifts, achieves significant empirical gains, and is supported by strong theoretical analysis. The evaluations are comprehensive, with ablations, robustness tests to noise, and latency measurements, making the work relevant for real-world deployments.

The weaknesses are that some reviewers questioned the reliance on high-quality historical data, limited real-world traces, and whether stronger model-based routers could offer more competitive baselines. There were also concerns about assumptions in the theoretical proof and the lack of traffic-aware constraints in the current formulation.

The main reason for acceptance is that the paper addresses an important and timely challenge in scalable LLM deployment with a simple yet powerful solution that demonstrates both theoretical rigor and empirical robustness. While some reviewers leaned borderline due to experimental setup or scope, the rebuttal convincingly clarified key points, added strong additional baselines (e.g., BERT, CARROT), robustness evaluations (noise, OOD), and latency breakdowns. This addressed the main criticisms from reviewers QUtW, f1oy, RvpU, and toJm, with most acknowledging that their concerns were resolved.