PaperHub
6.0
/10
Poster4 位审稿人
最低3最高5标准差0.8
3
4
3
5
3.3
置信度
创新性2.5
质量2.5
清晰度3.0
重要性2.5
NeurIPS 2025

Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

We propose the first efficient, training-free online routing algorithm for high-volume LLM serving under token budget constraints, achieving significant improvements in both routing performance and cost efficiency.

摘要

Increasing demand for Large Language Models (LLMs) services imposes substantial deployment and computation costs on providers. LLM routing offers a cost-efficient solution by directing queries to the optimal LLM based on model and query features. However, existing works primarily focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets. In this work, we introduce the first training-free algorithm for online routing scenarios. Our algorithm leverages approximate nearest neighbor search to efficiently estimate the features of queries and performs a one-time optimization over a small set of initial queries to learn a set of routing weights that guide future routing. We provide a theoretical guarantee that the algorithm achieves a competitive ratio of $1 - o(1)$ under natural assumptions, which is further validated by extensive experiments across 3 benchmark datasets and 8 baselines, showing an average improvement of 3.55$\times$ in performance, 1.85$\times$ in cost efficiency, and nearly 4.25$\times$ in throughput. Our code is available at https://github.com/fzwark/PORT.
关键词
Large Language ModelsLLM RoutingCost-Aware InferenceOnline AlgorithmMulti-LLM Serving

评审与讨论

审稿意见
3

This paper introduces a training-free routing strategy by leveraging the log of historical prompts and associated generations as well as their quality scores and costs. Specifically, it conducts semantic embedding based nearest neighbor search to retrieve similar past queries and estimate the performance by averaging recorded measures. Then, it combines estimated quality and cost of each model using learned and fixed weights to decide the final pick of the model.

优缺点分析

Strength:

  • Comprehensive evaluation on three benchmarks.
  • Extensive tests over query volume, arrival order, size of historical data, size of search candidates, etc.
  • Code and data are released for reproducibility.

Weakness:

  • The effectiveness of the proposed method depends on reliable quality labels. The experiments mainly use the pre-existing ground truth and lack of robustness evaluation against noisy quality labels.
  • The baseline methods are somewhat weak. Some of the comparable works are discussed in the related work sections but are not empirically compared in the evaluation section.
  • The theoretical proof relies on strong assumptions including the existence of at least one close neighbor in the log and that generation quality and cost are smoothly tied to prompt embeddings. Similarly, evaluation is mainly conducted with an in-distribution setting, and it is unclear if the proposed method is robust enough to handle the out-of-distribution challenge and drifting query patterns in dynamic online settings.

问题

What is the query-level latency? How does it compare to other routing strategies including trained lightweight routers?

局限性

yes

最终评判理由

While I appreciate the additional results and explanations provided by the authors during the discussion period, I believe the paper needs another round of revision to consolidate the extra supporting experiments and strengthen the associated discussions. Therefore, I’m leaning toward a borderline rating.

格式问题

no

作者回复

Q1: The experiments mainly use the pre-existing ground truth and lack of robustness evaluation against noisy quality labels.

A: Thank you for your comments. To demonstrate the robustness of our algorithm under noisy historical labels, we conduct experiments in a noisy setting where two types of strong noise are introduced simultaneously. Specifically, for each performance label dijd_{ij}, we randomly flip its value with a probability of 20%. For each cost label gijg_{ij}, we apply multiplicative, mean-preserving noise consisting of two components: (i) log‑normal jitter (logN(0,σ2)12σ2)\left(\log \mathcal{N}(0, \sigma^2) - \frac{1}{2} \sigma^2\right) with σ=0.25\sigma = 0.25, and (ii) a spike factor of 3.0 applied with probability 2%.

We compare our algorithm against all baseline methods on RouterBench under the main experimental setting. As shown in the results below, our method consistently outperforms all baselines under noisy historical labels. It achieves the highest overall performance, the highest throughput, and the best cost efficiency, demonstrating the strong robustness of our algorithm under noisy conditions.

Table 1. Routing performance under noisy historical data. Comparison between our algorithm and baselines on RouterBench.

AlgorithmPeformanceCostPerformance per CostThroughput
Random1403.250.4193350.433227
Greedy-Perf1323.60.3703573.752471
Greedy-Cost1660.20.4613604.724047
KNN-Perf1323.950.3693585.392472
KNN-Cost1660.20.4613604.724047
BatchSplit1757.250.4563856.683974
Roberta-Perf830.0451836.3384
Roberta-Cost3580.0754787.5931616
Ours2319.650.4445221.284775

Q2: The baseline methods are somewhat weak.

A: Thank you for your suggestions. To the best of our knowledge, our work is the first to focus specifically on training-free online multi-LLM routing, and there are currently no existing methods that directly target the same setting. To provide further meaningful comparisons, we additionally include and adapt three local routing baselines adapted from two related works: RouteLLM (MF) and RouteLLM (BERT) from [1], and CARROT (RoBERTa) from [2]. Literature [1] is a well-known two-model routing approach, while [2] represents the current state-of-the-art in offline multi-LLM routing, with the goal of balancing cost and performance. Since [1] only handles two-model routing, we extend it to the multi-model case by training multi-label predictors for win rates and applying a cheapest feasible model strategy: routing to the cheapest model whose predicted win rate logit exceeds a threshold α\alpha. Furthermore, as all three baselines are originally intended for local or offline routing, we adapt them to our online setting. We use 20%20\% historical data to search the threshold α\alpha (for RouteLLM) and the cost-efficiency coefficient μ\mu (for CARROT) over the range [0, 1] with step size of 0.1 to maximize performance under a fixed budget.

We present comparisons on the most representative benchmark, RouterBench, under the main experimental setting. Our algorithm outperforms all three additional baselines across all key metrics while maintaining the lowest routing latency. Notably, it achieves over 1.5×\times higher throughput than all baselines and a routing latency of just 0.64 seconds, which is over 200× faster than the state-of-the-art CARROT method:

Table 2. Comparison results with 3 additional baselines on RouterBench

AlgorithmPeformanceCostPerformance per CostThroughputTotal routing Latency(s)
Multi-RouteLLM(MF)(α=0.6\alpha=0.6)20410.355883.6135253.66
Multi-RouteLLM(BERT)(α=0.4\alpha=0.4)1764.350.315753.823246135.94
CARROT(RoBERTa)(μ=0.2\mu=0.2)1353.20.235987.612737255.35
Ours2718.60.4476075.5851950.64

Q3: The theoretical proof relies on strong assumptions including the existence of at least one close neighbor in the log and that generation quality and cost are smoothly tied to prompt embeddings. The evaluation is mainly conducted with an in-distribution setting, and it is unclear if the proposed method is robust enough to handle the out-of-distribution challenge and drifting query patterns in dynamic online settings.

A: Thank you for your comments. We believe there may be a misunderstanding, and we would like to clarify that the assumptions we make, which are discussed in detail in Appendix B.1, are all natural and mild. In particular, Assumption 1 is grounded in the core idea of informative query routing: we assume that there exist historical data points that are informative for routing future queries. This aligns with the core intuition behind nearly all prior predictive LLM routing methods [1, 2, 3, 4], which assume that some queries in the training or historical data share similar routing-relevant features with future queries. If no such informative historical data exists, then leveraging training/historical data for predictive routing becomes impossible, and all previous methods, whether training language models or low-rank models on the historical data, would reduce to random routing, as discussed in Appendix B.1. In such a case, no meaningful inference can be made without directly querying the LLMs. One must query all LLMs to make informed routing decisions. This is because, in the worst case, under an adversary query targeting the LLM query order in the system, all LLMs may present extremely low performance, except the last LLM. This underscores the importance of Assumption 1, which connects historical data to future queries, enabling meaningful routing and improved performance.

To evaluate the robustness of our algorithm, we construct a challenging out-of-distribution setting using RouterBench, which consists of 13 distinct datasets with substantially different distributions (see Table 2 in the Appendix). We use a single dataset (MMLU) in RouterBench as historical data and treat the remaining 12 datasets (including Chinese, GSM8K, and so on) as the test set. We compare our algorithm against all baselines under the main setting in both the OOD and noisy settings. The detailed results are as follows:

Table 3. Routing performance in the out-of-distribution setting: Comparison between our algorithm and baselines on RouterBench

AlgorithmPeformanceCostPerformance per CostThroughput
Random1495.350.5182888.143308
Greedy-Perf807.150.2523204.761643
Greedy-Cost1677.450.5612989.304044
KNN-Perf816.100.2543212.311658
KNN-Cost1679.700.5612993.234047
BatchSplit1828.550.5263474.424283
Roberta-Perf9.500.017559.1510
Roberta-Cost1692.650.5623013.924037
Multi-RouteLLM(MF)9240.2473734.012227
Multi-RouteLLM(BERT)908.600.2483657.922239
CARROT(Roberta)459.400.1213783.47738
Ours2050.80.5323851.534644

As shown in the Table, even under this challenging OOD setting, our algorithm consistently outperforms all baselines. It achieves the highest overall performance, the best performance-per-cost ratio, and the greatest throughput. These results highlight the strong robustness of our method to distribution shift and changing query patterns.

Q4: What is the query-level latency? How does it compare to other routing strategies including trained lightweight routers?

A: Thank you for raising this point. We provide a detailed comparison of query-level routing latency on RouterBench under varying numbers of test queries. All implementations are in Python, and each method is executed using a single CPU thread. For model-based methods that require GPU computation, we follow the same setup as in [1] and use a single NVIDIA L4 GPU (24GB). Across all methods, we fix the processing batch size to 256 for consistency. The detailed latency results are summarized below, where our method achieves the fastest routing among all baselines (except the random routing strategy, which exhibits significantly poorer performance).

Table 4. Per-query routing latency (ms) of different algorithms on RouterBench under varying numbers of queries

Algorithm4000queries6000queries8000queries10000queries
Greedy-Perf0.2030.2030.2010.203
Greedy-Cost0.2330.2300.2260.233
KNN-Perf0.4780.4750.4690.475
KNN-Cost0.4790.5200.5080.506
BatchSplit0.3000.2820.2810.278
Roberta-Perf26.97826.32226.58926.578
Roberta-Cost27.00026.37326.56626.591
Multi-RouteLLM(MF)0.3930.3770.3640.366
Multi-RouteLLM(BERT)13.62013.56813.66113.594
CARROT(Roberta)26.01525.41325.64425.535
Ours0.0600.0600.0600.064

These results clearly demonstrate that our algorithm consistently achieves the lowest per-query routing latency across all settings, significantly outperforming all baselines. Notably, compared to model-based routers, our method achieves over 200×\times speed-up. Note that our current implementation, which is written in Python and runs on a single CPU thread, can be further optimized. Since it requires no GPU operations and handles queries independently, it can be easily reimplemented in a lower-level language such as C++ and parallelized to achieve even greater speed-ups in practice.

[1] Ong, Isaac, et al. "Routellm: Learning to route llms with preference data." arXiv preprint arXiv:2406.18665 (2024).

[2] Somerstep, Seamus, et al. "Carrot: A cost aware rate optimal router." arXiv preprint arXiv:2502.03261 (2025).

[3] Hu, Qitian Jason, et al. "Routerbench: A benchmark for multi-llm routing system." arXiv preprint arXiv:2403.12031 (2024).

[4] Wang, Xinyuan, et al. "Mixllm: Dynamic routing in mixed large language models." arXiv preprint arXiv:2502.18482 (2025).

评论

Dear Reviewer RvpU,

Did we address all your concerns satisfactorily? If your concerns have not been resolved, could you please let us know which concerns were not sufficiently addressed so that we have a chance to respond?

Many thanks, The authors

评论

Thanks for your response. While I appreciate the additional results and explanations, I believe the paper needs another round of revision to consolidate the extra supporting experiments and strengthen the associated discussions. Therefore, I’m leaning toward a borderline rating.

评论

Dear Reviewer RvpU,

We respectfully disagree. Our submission already contains strong empirical results overall the prior state of the art baselines on all the relevant datasets that we are aware of in the literature. In addition, we provide strong theoretical guarantees which were not present in prior works.

The new experiments we added at your request are very much orthogonal to the main contribution of the paper. For example, the new experiments focus on a synthetic noisy setting that is not present in any of the actual real world datasets in the submission. Thus, we believe the new experiments can easily be included in the appendix, since they were created to satisfy the specific concerns that you raised in the initial review, while the main experiments and contributions are already complete in the current submission.

Many thanks for your considerations, The authors

审稿意见
4

This paper proposes a training-free, online routing algorithm to efficiently handle high-volume, budget-constrained LLM inference requests. Unlike prior methods that rely on heavy offline training or dynamic retraining, the proposed method:

  1. Uses Approximate Nearest Neighbor Search (ANNS) to estimate query-level performance and cost features from historical data.
  2. Learns a single global routing strategy (γ)* based on a small subset of initial queries.
  3. Applies this learned strategy for online query routing using a principled dual formulation of a relaxed MILP.

It demonstrates strong theoretical guarantees (competitive ratio of 1-o(1) and empirical results showing up to: 3.55× performance, 1.85× better cost-efficiency, 4.25× higher throughput over 8 competitive baselines on 3 benchmarks.

优缺点分析

Strengths:

  1. This is the first routing method for online multi-LLM serving that is training-free, addressing the scalability bottlenecks of prior methods in realistic deployments.
  2. Solid formulation of routing as a MILP and its LP relaxation.
  3. Compared against 8 baselines across multiple metrics and settings.

Weaknesses:

  1. While the paper claims low-latency routing, it doesn’t provide absolute latency measurements (e.g., milliseconds per query) or breakdowns.

问题

  1. How sensitive is the routing performance to the quality and size of the historical dataset?
  2. How often should γ∗ be re-learned in a real system?
  3. Could reinforcement learning improve the second-stage routing?
  4. Can the proposed method be applied for VLM (vision language model)?

局限性

yes

最终评判理由

My concerns are addressed.

格式问题

N/A

作者回复

Q1: While the paper claims low-latency routing, it doesn’t provide absolute latency measurements (e.g., milliseconds per query) or breakdowns.

A: Thank you for raising this point. We provide a detailed comparison of per-query routing latency on RouterBench under varying numbers of test queries. All methods are implemented in Python and executed using a single CPU thread. For model-based approaches requiring GPU computation, we follow the same setup as in [1], using a single NVIDIA L4 GPU (24GB). To ensure consistency, we fix the processing batch size to 256 across all methods. The latency results are summarized below, showing that our method achieves the fastest routing among all baselines (except the random routing strategy, which exhibits significantly poorer performance).

Table 1. Per-query routing latency (ms) of different algorithms on RouterBench under varying numbers of queries

Algorithm4000 queries6000 queries8000 queries10000 queries
Greedy-Perf0.2030.2030.2010.203
Greedy-Cost0.2330.2300.2260.233
KNN-Perf0.4780.4750.4690.475
KNN-Cost0.4790.5200.5080.506
BatchSplit0.3000.2820.2810.278
Roberta-Perf26.97826.32226.58926.578
Roberta-Cost27.00026.37326.56626.591
Ours0.0600.0600.0600.064

These results clearly demonstrate that our algorithm consistently achieves the lowest per-query routing latency across all settings, significantly outperforming all baselines. Notably, compared to model-based routers, our method achieves over 200×\times speed-up, highlighting its strong routing efficiency. Furthermore, our current implementation, which is written in Python and executed on a single CPU thread, can be further optimized. Since the algorithm requires no GPU operations, it can be easily reimplemented in a lower-level language such as C++ for further improvement. Additionally, because routing decisions for individual queries are made independently, the algorithm is inherently parallelizable and well-suited for multi-threaded execution. These optimizations can lead to substantial additional speed-ups in real-world deployments.

Q2: How sensitive is the routing performance to the quality and size of the historical dataset?

A: Thanks for your comments. We have already included detailed experiments on the impact of historical data size on routing performance in Appendix C.5 (see Figure 11). The results show that varying the amount of historical data has a negligible effect on our algorithm’s performance. Across all data sizes and settings, our method consistently and significantly outperforms all baselines, demonstrating strong robustness. Notably, even with as few as 5,000 historical records, our algorithm remains highly effective for routing. To assess sensitivity to data quality, we consider two challenging settings: noisy labels and out-of-distribution (OOD) historical data.

In the noisy setting, we introduce two types of strong noise simultaneously. For each performance label dijd_{ij}, we randomly flip its value with a probability of 20%. For each cost label gijg_{ij}, we apply multiplicative, mean-preserving noise consisting of: (i) log-normal jitter (logN(0,σ2)12σ2)\left(\log \mathcal{N}(0, \sigma^2) - \frac{1}{2}\sigma^2\right) with σ=0.25\sigma = 0.25, and (ii) a spike factor of 3.0 applied with 2% probability.

In the OOD setting, we construct a distribution shift using RouterBench, which contains 13 diverse datasets (see Table 2 in the Appendix). We use MMLU as the historical dataset and evaluate on the remaining 12 datasets, which differ significantly in data distribution. In both settings, our algorithm consistently outperforms all baselines under the main evaluation protocol. The detailed results are as follows:

Table 2. Routing performance under noisy historical data: our algorithm v.s. baselines on RouterBench

AlgorithmPeformanceCostPerformance per CostThroughput
Random1403.2500.4193350.4303227
Greedy-Perf1323.6000.3703573.7502471
Greedy-Cost1660.2000.4613604.7204047
KNN-Perf1323.9500.3693585.3902472
KNN-Cost1660.2000.4613604.7204047
BatchSplit1757.2500.4563856.6803974
Roberta-Perf83.0000.0451836.33084
Roberta-Cost358.0000.0754787.5931616
Ours2319.6500.4445221.2804775

Table 3. Routing performance in the out-of-distribution setting: Comparison between our algorithm and baselines on RouterBench

AlgorithmPeformanceCostPerformance per CostThroughput
Random1495.350.5182888.143308.00
Greedy-Perf807.150.2523204.761643.00
Greedy-Cost1677.450.5612989.304044.00
KNN-Perf816.100.2543212.311658.00
KNN-Cost1679.700.5612993.234047.00
BatchSplit1828.550.5263474.424283.00
Roberta-Perf9.500.017559.1510.00
Roberta-Cost1692.650.5623013.924037.00
Ours2050.80.5323851.534644.00

As shown in the above tables, our algorithm consistently outperforms all baselines under both noisy historical labels and out-of-distribution settings. In both cases, it achieves the highest overall performance, best performance-per-cost ratio, and greatest throughput, demonstrating strong robustness to label noise and varying distribution shift.

Q3: How often should γ∗ be re-learned in a real system?

A: Thank you for the question. In our approach, the routing weights γ\gamma^\star are learned based on the number of upcoming queries to be routed and the chosen fraction ratio ϵ\epsilon. For example, with ϵ=0.025\epsilon = 0.025 and a total query batch of 10000, only 250 initial queries are needed to learn the routing weights γ\gamma^\star. In practice, incoming queries can be divided into sequential chunks (e.g., of size 10000), and γ\gamma^\star is relearned for each chunk. Specifically, after finishing routing one chunk, we relearn γ\gamma^\star for the next chunk. The overhead of this learning process is negligible, where it takes approximately 0.03s on 250 samples and 0.05s on 500 samples, and can be further improved. This means that, even in high-throughput scenarios (e.g., 10000 queries per second), this routing weight optimization only adds negligible overhead (0.03s), making the relearning process both efficient and practical for real systems.

Q4: Could reinforcement learning improve the second-stage routing?

A: Thanks for your suggestion. Using reinforcement learning (RL) for improving the second-stage routing is indeed an interesting direction. In our current design, we learn the routing weights efficiently via a one-shot optimization of the dual LP, which is highly efficient and requires no costly online training process. While RL could offer benefits such as continual adaptation to changing query patterns and integration of richer feedback signals, it also introduces extra complexity and exploration overhead. We consider RL-based routing a promising avenue for future work.

Q5: Can the proposed method be applied for VLM (vision language model)?

A: Thanks for the question. Our routing framework is largely modality‑agnostic and only requires that the processed queries can be mapped into a vector space (embedding). In a VLM scenario, one can first encode each image (and its accompanying text, if any) into a joint embedding (e.g., via CLIP or another vision–language encoder). With this joint embedding, our algorithm can be naturally extended to the VLM setting by applying the same procedure: using ANNS to retrieve informative historical data points and performing online cost-efficient routing accordingly. We view this extension as an exciting future direction.

[1] Ong, Isaac, et al. "Routellm: Learning to route llms with preference data." arXiv preprint arXiv:2406.18665 (2024).

评论

Dear Reviewer toJm,

Did we address all your concerns satisfactorily? If your concerns have not been resolved, could you please let us know which concerns were not sufficiently addressed so that we have a chance to respond?

Many thanks, The authors

评论

Dear Reviewer toJm,

Thank you for filling out the rebuttal acknowledgement. We believe we have addressed your main weakness of addressing the latency of our algorithm. Please let us know if this concern was not fully addressed.

Many thanks, The authors

评论

Thanks the rebuttal addressed my concerns. I would like to keep my rating.

审稿意见
3

Existing works for LLM router primarily focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets.

This paper introduces the first training-free algorithm for online routing scenarios. The algorithm leverages approximate nearest neighbor search to efficiently estimate query features and performs a one-time optimization over a small set of initial queries to learn a routing strategy that guides future routing.

The algorithm estimates the performance scores and costs efficiently using a historical dataset, and learns a routing strategy from a small subset of observed queries (of size ϵ|Q|) to guide future query routing.

Experiments on simulated traces show the performance.

优缺点分析

Strengths

  1. Improving the efficiency of LLM routing is an important topic.

  2. The proposed method with approximate nearest neighbor search and online learning is helpful.

  3. Theoretical analysis and experiments are provided.

Weaknesses

  1. Such approximate nearest neighbor methods highly rely on the training data. E.g. Routerbench has 36,497 examples. The paper uses 26,497 queries as historical data to construct the search examples and 10,000 queries for testing.

  2. Stronger model-based methods like BERT can be added. The router running time of these methods can also be compared.

  3. Experiments are only on simulated traces.

问题

  1. Move Table 1 to the next page.

局限性

yes

最终评判理由

After discussion with authors, the paper needs to improve the experiment setup to provide a fair comparison with baselines and consolidate the extra supporting experiments.

格式问题

N/A

作者回复

Q1: Such approximate nearest neighbor methods highly rely on the training data. E.g. Routerbench has 36,497 examples. The paper uses 26,497 queries as historical data to construct the search examples and 10,000 queries for testing.

A: Thanks for the comments. We would like to clarify that our algorithm does not rely heavily on historical data. As we already showed in Appendix C.5 (Figure 11), varying the amount of historical data used for ANNS has minimal impact on performance. Our method consistently and significantly outperforms all baselines across all data sizes, demonstrating strong robustness. Notably, even with as few as 5000 historical records, our algorithm remains highly effective for routing.

Q2: Stronger model-based methods like BERT can be added.

A: We appreciate the suggestion. Following the main setting in the paper, we train two separate BERT-based models to predict the generation performance score and cost for each query, resulting in two additional BERT-based routing baselines: (i) BERT-perf-routing, which routes each query to the model with the highest predicted performance; (ii) BERT-cost-routing, which routes each query to the model with the greatest available budget. We compare our algorithm with these two baselines on RouterBench under the main experimental setting. The detailed results are as follows:

Table 1. Comparison of Our Routing Algorithm with BERT‑Based Methods on RouterBench

AlgorithmPeformanceCostPerformance per CostThroughput
Bert-Perf384.250.1542480.87452
Bert-Cost630.350.1454340.51156
Ours2718.60.4476075.585195

As shown in the above results, our method achieves >4×\times higher Performance and >3×\times higher Throughput than the strongest BERT‑based baseline, while maintaining favorable cost efficiency.

Q3: The router running time of these methods can also be compared.

A: Thank you for raising this point. We provide a detailed comparison of routing latency on RouterBench under varying numbers of test queries. All implementations are in Python, and each method is executed using a single CPU thread. For model-based methods that require GPU computation, we follow the same setup as in [1] and use a single NVIDIA L4 GPU (24GB). Across all methods, we fix the processing batch size to 256 for consistency. The detailed latency results are summarized below, demonstrating the routing efficiency of our algorithm: our method achieves the fastest routing among all baselines (except the random routing strategy, which exhibits significantly poorer performance):

Table 2. Total routing latency (s) comparison of routing algorithms on RouterBench under varying numbers of queries

Algorithm4000 queries6000 queries8000 queries10000 queries
Greedy-Perf0.811.221.612.03
Greedy-Cost0.931.381.812.33
KNN-Perf1.912.853.754.75
KNN-Cost1.923.124.075.06
BatchSplit1.201.692.252.78
Roberta-Perf107.91157.93212.71265.78
Roberta-Cost108.00158.24212.53265.91
Bert-Perf111.77164.98221.64277.28
Bert-Cost112.76165.84223.32277.32
Ours0.240.360.480.64

These results clearly show that our algorithm consistently achieves the lowest routing latency across all settings, outperforming all baseline methods. This demonstrates the strong routing efficiency of our approach. Furthermore, we note that our current implementation of our algorithm, which is written in Python and runs on a single CPU thread, can be further optimized. Since the algorithm does not require any GPU resources or operations, it can be easily reimplemented in a faster programming language such as C++ for better performance. In addition, because routing decisions for individual queries are made independently based on learned routing weights, the algorithm is naturally parallelizable and well-suited for multi-threaded execution. Applying these optimizations can further provide substantial speed-ups in practice.

Q4: Experiments are only on simulated traces.

A: Thank you for your comment. To the best of our knowledge, we have evaluated our method on almost all publicly available multi-LLM routing benchmarks [2, 3, 4] that are widely used in the literature and research community. We are not aware of any additional public datasets in this space. If the reviewer has any pointers to specific labeled traffic traces, we would be happy to incorporate them into further experiments.

Q5: Move Table 1 to the next page

A: Thank you for the suggestion. We will move Table 1 to an appropriate position on the next page in the revised version.

[1] Ong, Isaac, et al. "Routellm: Learning to route llms with preference data." arXiv preprint arXiv:2406.18665 (2024).

[2] Hu, Qitian Jason, et al. "Routerbench: A benchmark for multi-llm routing system." arXiv preprint arXiv:2403.12031 (2024).

[3] Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024.

[4] Somerstep, Seamus, et al. "Carrot: A cost aware rate optimal router." arXiv preprint arXiv:2502.03261 (2025).

评论

Dear Reviewer f1oy,

Did we address all your concerns satisfactorily? If your concerns have not been resolved, could you please let us know which concerns were not sufficiently addressed so that we have a chance to respond?

Many thanks, The authors

评论

Dear Reviewer f1oy,

Thank you for filling out the rebuttal acknowledgement. We would like to note that one of your main weaknesses (size of historical dataset) was already addressed in the submission. In addition, we note that we used all the datasets in literature that we are aware of. If you have a suggestion we are happy to incorporate it, but we do not believe that we have missed any datasets. Would you please let us know if you had any other datasets in mind or engage with us in any discussion?

Many thanks, The authors

评论

Thank you for the clarification! Could you elaborate on why BERT and RoBERTa perform worse than the random baseline? The proposed ANN with language embedding model appears to have a mechanism similar to BERT classification, so this result is surprising.

In addition, it is unclear why varying the number of training data points does not affect model performance. Could you provide more explanation or analysis for this observation?

评论

Q1: why BERT and RoBERTa perform worse than the random baseline?

A: Thank you for the comments. This is due to our online routing setup, as detailed in Appendix A (lines 560–564), where a sequence of queries must be routed to different LLMs within a limited time and budget.

For performance-based methods (e.g., BERT-perf and RoBERTa-perf), routing is based only on predicted performance. In the online setting, these methods tend to over-route queries to strong LLMs early on, quickly exhausting their budgets. Without the consideration of the remaining budget, subsequent queries continue to be routed to those LLMs, resulting in unprocessed queries and reduced overall performance.

For cost-based methods, the true remaining budget is unobservable at routing time, as it depends on the actual incurred costs of prior queries, which are only known after those queries have finished. Hence, methods such as BERT-cost routing and RoBERTa-cost routing must rely on predicted costs to estimate remaining budget. Estimation errors in cost prediction will mislead routing decisions and reduce performance. These errors vary across benchmarks. As shown in Table 1 in the paper, RoBERTa-cost method outperforms random routing on SPROUT and Open LLM Leaderboard v2, but still falls short of our method.

Q2: In addition, it is unclear why varying the number of training data points does not affect model performance.

A: Thank you for the question. Our method does not involve any training phase. Instead, we treat the so-called "training data" as a historical dataset from which we retrieve informative data points using Approximate Nearest Neighbor Search (ANNS) algorithms. These retrieved data points are then used to estimate the features of incoming queries and make online routing decisions. As a result, our approach depends on retrieving representative neighbors rather than learning parameters, making performance reliant on information coverage rather than dataset size. In practice, real queries cluster in embedding space, and around 5000 randomly sampled examples are sufficient to cover these clusters. This explains the stable performance across dataset sizes and highlights a key strength of our method: it eliminates the need for model training.

We would be happy to clarify any additional questions or concerns.

评论

If Bert can be trained to balance both performance and cost, it should have similar result as the ANN method.

I also have another question on the latency, since ANN method also needs to get the prompt embedding, why it is much faster than Bert? Is it due to a smaller embedding model?

Around 5000 randomly sampled examples are sufficient to cover these clusters. The authors can show how the result change from 0 to 5000 sampled examples.

评论

Q1: If Bert can be trained to balance both performance and cost, it should have similar result as the ANN method.

A: Thank you for the comments. While it is theoretically possible to train a BERT-based router to balance performance and cost simultaneously, doing so is highly challenging in practice due to the lack of reliable supervision signals for cost-efficient routing. To the best of our knowledge, no existing BERT-based routers in the literature are trained with this objective. Furthermore, even if such a cost-efficient BERT router could be trained, it would still face the same limitations as other model-based routers: (1) It is difficult to adapt to dynamic online settings, often requiring retraining, and (2) It struggles to make globally cost-efficient routing decisions under changing budget constraints. In contrast, our proposed ANN-based online routing algorithm naturally handles dynamic trade-offs and avoids these limitations without retraining.

Q2: I also have another question on the latency, since ANN method also needs to get the prompt embedding, why it is much faster than Bert? Is it due to a smaller embedding model?

A: Thank you for the question. The routing system of all training-free methods in the paper (KNN-Routing, Perf-Routing, BatchSplit, and ours) consists of two modules: (1) an embedding module, and (2) a core online routing module. Our latency evaluation of these training-free methods focuses on the second part, the core online routing module, for two main reasons:

(1) Decoupled pipeline. The embedding step can operate asynchronously and is fully decoupled from the core routing logic.

(2) Module independence. The embedding model is interchangeable (e.g., MiniLM, GTE-small, bge-small), while our core contribution lies in the core routing algorithm itself.

Furthermore, our routing algorithm only solves a lightweight linear program over a small batch of queries in an online fashion, adding negligible overhead.

For completeness, we also report the embedding latency of the bge-base model under the same setting. As shown in the table below, even when including embedding time, our method remains over 5×\times faster than model-based routers.

Table 1. The latency (s) comparison of our algorithm with model-based routers on RouterBench under varying numbers of queries.

Algorithm (Module)4000 queries6000 queries8000 queries10000 queries
Roberta-Perf107.91157.93212.71265.78
Roberta-Cost108.00158.24212.53265.91
Bert-Perf111.77164.98221.64277.28
Bert-Cost112.76165.84223.32277.32
Ours(bge-base embedding latency)21.7531.1142.8751.91
Ours(core online routing latency)0.240.360.480.64

Q3: Around 5000 randomly sampled examples are sufficient to cover these clusters. The authors can show how the result change from 0 to 5000 sampled examples.

A: Thanks for your suggestion. We report the results with varying historical data sizes from 5 to 5000 in the following Table. As shown, even with only 50 samples, our method achieves a performance of ~2273, which is already significantly better than all baselines. Compared to the full-data performance (~2700), the drop is modest, demonstrating strong performance even under limited historical data.

When the number of historical samples drops to just 5, performance degrades more noticeably. This aligns with our theoretical analysis: when there is little or no correlation between historical data and future queries, informative routing becomes provably impossible.

Notably, with around 2000 samples, our method already achieves near-optimal performance, matching the result with all historical data. These results demonstrate the robustness of our algorithm under varying amounts of historical data points.

Table 2. Results when varying the number of historical data points from 5 to 5000

sizeperformancecostperf/costthroughput
5491.00.1413479.421149
502273.350.4564978.164815
1002369.350.4335469.384921
2002294.400.4395222.184887
3002409.500.4165791.874812
4002586.450.4445828.735141
5002640.350.4435959.355173
6002656.600.4485933.565255
8002557.150.4485712.324990
10002550.850.4495676.955052
15002637.050.4386022.885171
20002692.450.4346201.305216
25002658.450.4276230.555070
30002792.650.4486229.255246
35002747.550.4296405.525267
40002730.150.4396212.635204
45002731.650.4296373.635266
50002686.150.4336206.675200
评论

Instead of BERT, if training on bge-base model, such a method can have the same latency as the proposed method?

评论

Q: Instead of BERT, if training on bge-base model, such a method can have the same latency as the proposed method?

A: Thank you for the question. If we fine-tune the same bge model used for embedding as a router, the overall latency would indeed be similar to our method. However, the performance would be much worse since our method is already improving over Bert based routers. In fact, imagine an extreme case where we make random decisions. Then our latency is effectively 0 but the performance is reduced significantly (see our Table 1 in the submission).

Furthermore, as discussed above, any model-based routers still face several key limitations in online settings (even if we imagine using a hypothetical model that has 0 latency but somehow magical performance):

(1) Lack of adaptability. Model-based routers must be retrained whenever the model pool or budget constraints change, which makes them hard to maintain in dynamic deployment environments.

(2) Limited cost-efficiency. Such routers often optimize for a fixed objective (performance or cost) and struggle to adapt to dynamic query arrivals.

(3) No modular reuse. The model-based router is tightly coupled with the router and cannot be reused for other tasks. In contrast, our embedding module is decoupled, allowing the same embeddings to be shared across routing, retrieval, logging, and other downstream tasks.

Thanks again for your comments, and we hope this clarifies the practical advantages of our training-free approach.

评论

Thank you for your rebuttal. I will carefully consider your points and discuss them with the other reviewers.

审稿意见
5

The paper introduces a training-free algorithm for optimizing the routing of user queries to multiple LLMs in online settings with constrained token budgets. The algorithm leverages a small subset of queries to builds an offline ANNS indice for estimating performance and cost for each deployed LLM, and performs a one-time optimization over a small subset of queries to learn weights for routing. These weights are utilized for guiding routing decisions of online queries, ensuring high performance and cost efficiency.

优缺点分析

Advantages:

  1. The performance of the method is good. It achieves an average improvement of 3.55× in overall performance, 1.85× in cost efficiency, and nearly 4.25× in throughput.

  2. The method does not require training or frequent finetuning existing models.

  3. The method looks not very sensitive to the distributions of user queries. Even if the query distribution might shift in online setting, it only requires an update on the HNSW index by inserting the new query data. And this process can be performed in an offline manner.

  4. The writing is clear.

  5. The evaluations are comprehensive.

What I like most about this paper is that their method is simple but effective. It does not make strong or unrealistic assumptions and is suitable for real-world deployment. My major concern is, the method only considers cost and efficiency as features for model selection. However, in a real-world setting, the traffic of queries is also important. If a high volume of queries arrives at a same time and they are routed to a same LLM, there might be a traffic congestion. For example, if a large number of users ask similar questions in a short amount of time, and the HNSW index suggests to route all the queries to the same LLM, the inference efficiency would be very low. However, at the same time, the other LLMs are idle. Also, the arrival time of different queries and the sequences of different queries matter. I understand that solving this problem isn't straightforward, and the linear programming might not be suited. However, the problem is practical and important in a real-world setting.

问题

  1. What parameters have you set for HNSW? I only saw the number of neighbors in the appendix. However, the number of layers of HNSW is not specified. The parameters of HNSW might impact the evaluation results. Could you add evaluations for that?

  2. See weakness.

  3. Should be HNSW not HWSN throughout the whole paper.

局限性

The method only considers cost and efficiency as features for model selection but does not consider traffics of the queries. The authors mentioned to add additional constraints in the original MILP for incorporating other routing factors. However, it is not accurate as incorporating sequences of similar types of queries is not straightforward using MILP.

最终评判理由

The authors have addressed my concern.

格式问题

N/A

作者回复

Q1: The method only considers cost and efficiency as features for model selection but does not consider traffics of the queries. In a real-world setting, the traffic of queries is also important.

A: Thank you for your comments. We acknowledge that query traffic patterns play an important role in real-world systems. However, the primary goal of our work is to develop a cost-efficient online routing algorithm, a problem that has gained increasing attention in recent literature [1, 2, 3], especially given the high computational cost of LLMs. To isolate this core algorithmic problem, we adopt a standard assumption widely used in the online algorithms literature [4, 5, 6]: that incoming queries arrive in a randomly permuted order. This assumption, also discussed in detail in Appendix B.1, allows us to abstract away complex traffic dynamics and focus on the fundamental cost-efficiency routing problem.

We would like to clarify that, under this assumption, our framework is designed to operate on query-intrinsic features that are independent of time and query order, such as response performance score and cost per query across the models. In contrast, operational-state features, such as current queue lengths or system load, are beyond the scope of our current formulation and are more aligned with a related but orthogonal problem: load balancing under traffic constraints.

Still, we believe our algorithm offers a natural foundation for incorporating such traffic-aware considerations. One potential approach is to discretize time into short windows and track the per-model backlog of unserved workload. Within each time window, a corresponding "time-bucket" MILP can be solved with additional traffic constraints, and any unserved workload can be carried over to the next window. For instance, let ij\ell_{ij} denote the service time of query jj on LLM ii, bitb_{it} be the backlog of unserved workload for LLM ii at the start of time window tt, and AitA_{it} denote the available service capacity for LLM ii during window tt. Then, for each window tt and LLM ii, we can add the following constraint to the MILP: jijxijmax0,Aitbit,i,t\sum_j \ell_{ij} x_{ij} \leq \max \\{ 0, A_{it} - b_{it} \\}, \quad \forall i,t This "time-bucket" formulation helps smooth bursts and reduce queuing delays.

In the revision, we will further explicitly clarify the scope of our current MILP formulation and also discuss such possible extensions of our algorithm.

Q2: What parameters have you set for HNSW? The number of layers of HNSW is not specified. The parameters of HNSW might impact the evaluation results. Could you add evaluations for that?

A: Thank you for pointing this out. We use the standard and recommended HNSW implementation from hnswlib with the following parameter settings in our paper: construction effort factor ef_construction=200, max number of neighbors per node M = 32, and query search breadth ef = 50. Regarding the number of layers in hnswlib, it is not a direct user-configurable hyperparameter. Instead, it is an emergent property of the graph determined during construction, which is primarily controlled by the M parameter. A smaller M tends to produce a taller, multi-layered graph, while a larger M results in a flatter, more interconnected one.

To evaluate the robustness of our algorithm to these three key parameters, we conduct ablation studies on RouterBench under the main evaluation setting. Specifically, we vary ef_construction from 100 to 900, M from 8 to 96, and ef from 32 to 384. The ablation results, shown in the following tables, demonstrate that our routing algorithm remains highly robust across a wide range of values for all three parameters:

Table 1. Performance of our algorithm on RouterBench with varying ef_construction

ef_constructionPeformanceCostPerformance per CostThroughput
1002785.400.466107.405305.00
2002744.850.456147.225196.00
3002738.350.456085.405215.00
4002747.900.456133.205217.00
6002715.800.446105.085168.00
8002726.800.456104.135169.00
12002734.850.456125.435192.00

Table 2. Performance of our algorithm on RouterBench with varying M

MPeformanceCostPerformance per CostThroughput
82596.600.445923.045084.00
162764.150.456107.895272.00
242745.450.456110.495211.00
322744.850.456147.225196.00
482729.200.456116.415194.00
642726.400.456075.575201.00
962763.350.456104.965214.00

Table 3. Performance of our algorithm on RouterBench with varying ef

efPeformanceCostPerformance per CostThroughput
322730.350.456032.835195.00
502744.850.456147.225196.00
802594.100.435973.325021.00
1282706.900.446084.475170.00
1922601.850.445914.845022.00
2562710.800.446112.355171.00
3842744.200.456093.615223.00

[1] Ong, Isaac, et al. "Routellm: Learning to route llms with preference data." arXiv preprint arXiv:2406.18665 (2024).

[2] Somerstep, Seamus, et al. "Carrot: A cost aware rate optimal router." arXiv preprint arXiv:2502.03261 (2025).

[3] Hu, Qitian Jason, et al. "Routerbench: A benchmark for multi-llm routing system." arXiv preprint arXiv:2403.12031 (2024).

[4] Devanur, Nikhil R., and Thomas P. Hayes. "The adwords problem: online keyword matching with budgeted bidders under random permutations." Proceedings of the 10th ACM conference on Electronic commerce. 2009.

[5] Mehta, Aranyak, et al. "Adwords and generalized online matching." Journal of the ACM (JACM) 54.5 (2007): 22-es.

[6] Gagan Goel and Aranyak Mehta. Online budgeted matching in random input models with applications to adwords. In SODA, volume 8, pages 982–991, 200

评论

Dear Reviewer QUtW,

Did we address all your concerns satisfactorily? If your concerns have not been resolved, could you please let us know which concerns were not sufficiently addressed so that we have a chance to respond?

Many thanks, The authors

评论

Thank you for your explanation. I have adjusted the score.

最终决定

This paper proposes the first training-free algorithm for online routing in multi-LLM serving under constrained token budgets. The method leverages approximate nearest neighbor search on historical queries and performs a one-time optimization over a small subset of queries to derive routing weights, which are then used for efficient online routing. The authors provide both theoretical guarantees and extensive empirical results across three benchmarks and eight baselines, showing consistent improvements in performance, cost efficiency, and throughput.

The strengths of the paper lie in its novelty, simplicity, and practicality. The method avoids retraining, is robust to data size and distribution shifts, achieves significant empirical gains, and is supported by strong theoretical analysis. The evaluations are comprehensive, with ablations, robustness tests to noise, and latency measurements, making the work relevant for real-world deployments.

The weaknesses are that some reviewers questioned the reliance on high-quality historical data, limited real-world traces, and whether stronger model-based routers could offer more competitive baselines. There were also concerns about assumptions in the theoretical proof and the lack of traffic-aware constraints in the current formulation.

The main reason for acceptance is that the paper addresses an important and timely challenge in scalable LLM deployment with a simple yet powerful solution that demonstrates both theoretical rigor and empirical robustness. While some reviewers leaned borderline due to experimental setup or scope, the rebuttal convincingly clarified key points, added strong additional baselines (e.g., BERT, CARROT), robustness evaluations (noise, OOD), and latency breakdowns. This addressed the main criticisms from reviewers QUtW, f1oy, RvpU, and toJm, with most acknowledging that their concerns were resolved.