Large Language Model Routing with Benchmark Datasets
We reuse benchmark evaluations to learn router models for selection of LLMs on unseen tasks.
摘要
评审与讨论
This paper propose a new approach to LLM selection without relying on labeled data from the new task. Instead, they develop a "router" model trained solely on inputs from various benchmark datasets, which reduces the selection problem to a collection of binary classification tasks. Empirical results show that this method consistently outperforms using a single model for all tasks.
接收理由
(1) An interesting task setting; (2) A formal approach with extensive experiments
拒绝理由
(1) If I understand it correctly, you still assign a unique LLM to solve some specific task, but not different LLMs to solve a task? Basically, it would be intuitive, even if a capable model can still underperform a weak model on some instances of a specific task. Thus, it seems more reasonable if you route instances but not datasets to LLMs.
(2) Efficiency can be potentially an issue: since you use lots of LLMs, how do you effectively control the effiency cost of maintaining multiple LLMs like MoE?
(3) There lacks a deep discussion how this approach is compared with MoE, in both advantages and disvantages.
Thank you for your review. Please find our responses below.
Routing instances vs routing datasets.
We considered both. In our HELM experiments, we chose a single LLM per dataset. We also compared to choosing an LLM per instance (MoE baseline in Table 1 and Figure 2 - note that here MoE is meant as a "classical" mixture of experts, not an MoE LLM like Mixtral), which underperformed due to the difficulty of accurately predicting correctness of LLMs on new tasks. In the MixInstruct experiment (Table 2), we routed per instance. Here our method outperforms or matches recent baselines while being a lot more efficient as it requires generating only with the selected LLM (and a much smaller sentence transformer for obtaining the input embedding).
How do you effectively control the efficiency cost of maintaining multiple LLMs like MoE?
Our method selects from a pool of LLMs using the inputs and calls only the selected LLM for generation. All other LLMs are unused and do not consume any resources. This is more efficient than an MoE LLM. For example, consider our HELM experiment when choosing a single LLM per dataset. At first, the router chooses a single LLM. This could be a 7B parameter LLM that is then loaded onto the GPU and used for inference. An MoE LLM like Mixtral 8x7B requires storing all of its ~47B parameters on the GPU, even though only 7B are active during inference. Larger GPU memory requirement implies that it needs a more advanced GPU and leaves less memory for KV-cache, which slows down the inference when compared to routing to a 7B model with our method.
There lacks a deep discussion how this approach is compared with MoE, in both advantages and disadvantages.
In addition to the discussion above, we note that the goal of our approach is somewhat orthogonal to how MoE LLMs work. Our goal is to learn the strengths of candidate LLMs from benchmark evaluations to choose an LLM for a new task using inputs. MoE LLMs, despite the name, do not typically consist of "experts" and route to different FFN networks for each token within each layer. Quoting Mixtral authors (see Section 5 in [1]): "Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic." An MoE LLM should simply be considered among the pool of candidate LLMs when using our method as it doesn't have a well-defined set of strengths of experts apriori. We will add a discussion on MoE LLMs to the paper.
[1] Jiang et al. Mixtral of Experts.
I marginally raised my rating score, while I hope these comments would be incorporated into the revised version.
Thank you. We will make sure to incorporate the comments into the revised version.
This paper focuses on learning to route different LLMs for downstream tasks without using task-related labeled data. The authors propose to utilize the performance of these LLMs on benchmark datasets and train a router model to do binary classification on the LLMs' performance given the task inputs. The prediction scores are used to conduct the LLM selection. The approach leverages embedded inputs and task descriptions from benchmark datasets and employs k-nearest neighbors classifiers to predict LLM correctness, aiming to generalize to unseen tasks. Experimental results on HELM and MixInstruct datasets demonstrate the potential of selecting suitable LLMs to reduce the cost and improve overall performance. However, the authors mentioned that out-of-distribution generalization is the main challenge of LLM routers.
接收理由
- This motivation is clear.
- Extensive experimental results demonstrate the effectiveness of the proposed methods.
拒绝理由
- Lack of explanation of the usage of KNN models. How does the proposed method deal with the out-of-distribution challenge.
- Lack of evaluation on more realistic tasks or instructions. Some instructions require the LLMs to use multiple or a combination of capabilities to solve, which is hard to reflect on the benchmarks. This composition problem needs to be discussed for the proposed method.
Thank you for your review. Please find our responses below.
Lack of explanation of the usage of KNN models.
We used kNN models as they are easy to implement and perform well. Our goal is to illustrate the potential of LLM Routing using benchmark data evaluations and demonstrate the efficacy of the proposed scores. That said, we also present results with MLPs as correctness predictors in Appendix C.
How does the proposed method deal with the out-of-distribution challenge.
Any correctness predictor will face challenges when generalizing OOD (kNN, MLP, or anything else). To clarify, OOD challenge is the problem of deteriorating accuracy of the correctness predictors on unseen tasks. For example, the average accuracy of correctness predictors for results reported in Table 1 is only 59% (quite low for a binary classification task - see "Reducing the OOD gap" paragraph on page 7). Our key insight is that our method can still select an appropriate LLM for a new task by accounting for errors of the correctness predictors. To achieve this we designed S3, a score that models confidence of the correctness predictors OOD (see "A simple OOD confidence model" at the bottom of page 4 and eq. (4)).
Lack of evaluation on more realistic tasks or instructions.
We respectfully disagree. We conducted experiments on 29 tasks from HELM, one of the prominent LLM benchmarks designed to cover scenarios representative of realistic LLM use cases. Among the 29 tasks we considered, there are 11 RAFT tasks, which stands for "Real-world Annotated Few-shot Tasks." [1]
We also conducted experiments on MixInstruct, which aggregates multiple datasets designed to test instruction following abilities of LLMs in various contexts (potentially including inputs requiring compositional abilities).
[1] Alex et al. RAFT: A Real-World Few-Shot Text Classification Benchmark.
I have read the rebuttals and changed my rating score accordingly.
Dear Reviewer, as the end of the discussion period is approaching, we'd appreciate it if you could comment on our rebuttal. Thank you.
This paper propose a method to select the best LLM from a collection of pre-trained models, given a new task. The proposed method does not assume any labeled data from the new task. It learns a “Router” model for LLM selection from inputs only. Experimental results show the effectiveness of the proposed method.
接收理由
- The proposed method is well-established while the paper is overall well-written.
- Selecting language models on a new unseen task with low cost is a really interesting and beneficial topic.
拒绝理由
-
The improvements are limited according to Table 2.
-
No human evaluation, I think there might at least be one to generally prove that the proposed method could predict the quality of a LLM’s output as expected.
给作者的问题
N/A
Dear Reviewer, it appears that the posted review is for a different paper. We do not consider domain-specific human evaluation of LLMs and do not propose any method named "LalaEval".
I am very much sorry with this mistake, the reviews are edited accordingly with my apology .
Thank you for updating the review. Please find our comments below.
The improvements are limited according to Table 2
We agree that performance improvements over Llama 2 70B in Table 2 are small. However, the improvements in terms of parameter count are fairly substantial. For example, S3 on average selects an LLM with 0.71 of the Llama 2 70B parameters, which can reduce the inference cost, while providing some performance improvements.
We also note that the results in Table 2 are for the most challenging scenario when new tasks have no labeled data. In Figure 2 we show that the performance gains from selecting LLMs with our scores can be further increased when a few labeled samples from the new task are available.
No human evaluation, I think there might at least be one to generally prove that the proposed method could predict the quality of a LLM’s output as expected.
HELM tasks are amenable to automatic evaluation and do not require human evaluation. For the MixInstruct (open-ended instruction following) experiments in Section 5.2, we used the metrics proposed by the dataset's authors [1]. Although they did not perform human evaluations—which we agree could strengthen the empirical findings—they verified that the considered automatic evaluation metrics correlate well with scores obtained using ChatGPT.
[1] Jiang et al. LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion.
This paper aims to selecting the best LLM from a collection of pretrained models for an unseen task. It is useful and motivates well. It also quantifies a rooting function to predict performance. The authors propose repurposing a collection of benchmark datasets to learn a "router" model for LLM selection from inputs only, reducing the problem to a collection of binary classification tasks.
接收理由
Addresses a pertinent challenge in the field of natural language processing, i.e., selecting the best LLM for a new task without labeled data.
Introduces an innovative approach by repurposing benchmark datasets for LLM selection, potentially offering a cost-effective solution compared to traditional methods.
An empirical evaluation is conducted to validate the proposed approach, providing practical insights into its effectiveness and feasibility.
拒绝理由
I am clear about the idea and motivation. However, it is really confusing with the detailed methods, for Sections 3 and 4, especially with so many new notions.
Language and terminology should be refined for clarity and precision, especially in technical descriptions and explanations. It would be beneficial to include visual aids such as figures or tables to illustrate key concepts or experimental results for better comprehension.
Thank you for your review and encouraging feedback.
Clarity. We are glad that the idea and motivation are clear and will revise the technical part to improve clarity. We acknowledge that the notations could appear a bit dense and will use the extra page allowed for the camera-ready version to add a table summarizing the notations.
This paper focuses on addressing the following problem: given a number of LLMs, pick the best one per input to optimize for quality. The proposed approach relies on training a binary classifier per LLM on existing benchmarks. At inference time, it does not rely on labels and only operates on inputs and the classifiers -- a collection of binary classification jobs are run to predict which LLM will be the most accurate. The idea of using sentence transformer for embedding the data and using KNNs as correctness predictors for new tasks is interesting. The approach is computationally efficient to implement, fast at inference Extensive results (e.g. on HELM) shows the effectiveness of the approach considering that it only relies on the input, although for truly out of distribution tasks the generalization results are understandably less promising.