4.3

/10

Rejected4 位审稿人

最低3最高6标准差1.3

3.5

置信度

ICLR 2024

Large Language Model Routing with Benchmark Datasets

Tal Shnitzer,Anthony Ou,Mírian Silva,Kate Soule,Yuekai Sun,Justin Solomon,Neil Thompson,Mikhail Yurochkin

OpenReview PDF

提交: 2023-09-23更新: 2024-02-11

TL;DR

We use benchmark evaluations to learn router models for selection of LLMs on new tasks.

摘要

关键词

benchmark datasetsmodel selectionLarge Language Models

评审与讨论

审稿意见

评分: 5置信度: 32023-10-22

The paper poses the question whether based on existing benchmark performance, one can select the best LLM given a task associated with a dataset. The authors propose to consider the task all in terms of accuracy and train a binary classifier to estimate the task performance for each LLM. More precisely, in this work, the authors study three estimators.

Empirically, the authors show that the proposed method outperforms best model in average (BMA) as well as an per instance perplexity based baseline. Training an estimator has O.O.D concerns, particularly the objective is to test on a different tasks; the authors have carefully examined and discussed this issue.

优点

The paper is quite easy to follow, even though it presents some non trivial technical details (e.g. Lemma 4.1), thanks to the well organized presentations.

The paper has shown strong empirical results: not only it outperforms the natural baseline BMA, the best model also outperforms an instance based approach based on perplexity. The approach is shown to achieve near 90% of Oracle accuracy (Table 1).

The authors have investigated and discussed the prominent O.O.D issues thoroughly in the paper. In Table 1, it shows the oracle accuracy showing the gap. In the paragraphs reducing the OOD gap as well as the discussion, the authors discuss how the phenomenon shows as well how much data might be need to mitigate the issues.

缺点

The paper seems self contained however:

it only compares with relatively straightforward instance based approach while in related routing LLM sections, it does mention more approaches but not compare with them.
The O.O.D problem is well investigated and discussed, however, the authors don't compare with other ways of estimating the accuracy. One popular approach is like G-eval which might overcome the O.O.D issues in some extent.
There is no clear conclusion that can be drawn from the paper that in practice, what score to use (and with O.O.D score, whether one should use a score). For example, the NN experiments show that S1 performs the best while in Table 1, S3 performs the best. Note that due to the fact that we work on dataset, there is only 28 datapoints. This is not to blame the authors but with this data size and different results, it seems impossible to draw conclusions.

问题

For S3, the authors say that "we assign a task descriptor u(d)", what is this u(d) please? I found this a bit confusing since Appendix A further sues u(d) as dataset distance.

By using S3, how many tasks end up using BMA please? Can you comment on the difference between Table 1 and Table 3?

评论- Response to Reviewer 3SNW [Part 1]

2023-11-16

We thank the reviewer for their questions and comments. Please see our responses below.

The paper only compares with relatively straightforward instance based approach while in related routing LLM sections, it does mention more approaches but not compare with them.

As discussed in the corresponding related work section, all prior methods consider instance-based LLM selection and (1) require a sufficient amount of labeled data from the new task to train a scoring or ranking model and (2) require generating outputs with every candidate LLM. In contrast, we consider a setting where no (or very little) labeled data/references on a new task are available, which we believe is a more practical scenario. The only baseline applicable in such a problem setting is LL (or perplexity), which we compare to in our experiments. We note that LL still suffers from (2), while our method avoids this problem while also outperforming this baseline.

One exception is the experiment in section 5.2, where the train and test data are from the same distribution and thus it is possible to compare to additional baselines mentioned in the related work section, e.g., PairRanker (note that these baselines continue to be inefficient dues to (2)).

In Figure 2, we also added an additional baseline, Few-Shot, which uses the small number of labeled samples from the new task to select an LLM.

The O.O.D problem is well investigated and discussed, however, the authors don't compare with other ways of estimating the accuracy. One popular approach is like G-eval which might overcome the O.O.D issues in some extent.

In appendix B.1 we report results for an alternative method to estimate accuracy, ATC as proposed by Garg et al., and we show that this method for estimating accuracy does not perform as well compared to the method we used. As we report, the MAE for the kernel smoother is 0.118, and the MAE for the ATC is 0.177.

The suggested G-Eval score relies on GPT-4 and thus is a very expensive and closed-source method for accuracy estimation. One of the motivations for our work is to improve the collective capabilities of open-source models, hence using GPT-4 inside the method is not suitable.

There is no clear conclusion that can be drawn from the paper that in practice, what score to use (and with O.O.D score, whether one should use a score).

Based on the experiments in Figure 2, we recommend using kNN as the correctness predictor together with S3 while labeling ~5 samples from the new task for the best tradeoff between the router performance and labeling efforts. The reason to use kNN over MLP is that it is straightforward to have it take advantage of the few labeled samples from the new task, while MLP would require retraining/fine-tuning with up-weighted new-task labeled samples. When it is possible to obtain more labeled samples from the new task (~25-50), we recommend S1 instead of S3 as the kNN classifier becomes better calibrated, thus reducing the need to model correctness with S3. This recommendation is also supported by the results in Figure 2 for higher $\alpha$ values and is briefly discussed in the last paragraph of Section 5.1.

That said, rather than a concrete practical recommendation, the main contribution of our work is the new problem formulation of LLM routing using benchmark datasets that has the potential to improve the collective capabilities of open-source LLMs. The use of benchmark datasets is important as it allows researchers with limited computational resources to contribute to LLM research by reusing publicly available LLM (per-sample) evaluation data (e.g., HELM, Open LLM Leaderboard, evaluation data released as part of our code) instead of running costly evaluations themselves. Thus, our main focus is on bringing forth various aspects of the problem along with methodological approaches to address them.

评论- Response to Reviewer 3SNW [Part 2]

2023-11-16

For S3, the authors say that "we assign a task descriptor u(d)", what is this u(d) please? I found this a bit confusing since Appendix A further sues u(d) as dataset distance.

By "task descriptor" we simply mean some scalar or vector representation of a task. The specific choice of task descriptor u(d) in our paper is described in Appendix A and is based on dataset distance, although other choices (including those not based on distance) are conceptually possible. We have edited Appendix A to clarify this.

By using S3, how many tasks end up using BMA please? Can you comment on the difference between Table 1 and Table 3?

In both Table 1 and Table 3, the column titled "% BMA" specifies the fraction of times that the BMA was selected by each of the methods. For example, based on Table 1, BMA is chosen $29\times 0.48\approx 14$ times by S3. The difference between Table 1 and Table 3 is the type of correctness predictor, $g_m(x)$ , used. In Table 1, we use non-parametric kNN classifiers, and in Table 3, we train Multi-Layer Perceptrons (MLPs) as the correctness predictors.

Please let us know if you have any further questions or concerns. If we have addressed your concerns, we would appreciate it if you could consider increasing the score.

评论- Thanks for the explanations

2023-11-21

Thanks for the authors for providing constructive and clarifying feedback for my questions. I think that the limited comparisons that the paper conducts and missing auto-eval baselines (I am not fully convinced by the GPT4 argumeent) make the empirical contribution of this paper marginally below the acceptance criteria.

评论- Discussion

2023-11-22

Thank you for starting the discussion!

We have extended the related work (Section 2) paragraph on Routing LLMs to clarify the relation to auto-eval methods and cited G-eval along with two other related methods. We summarize the main message below:

To use auto-eval for LLM routing, for each sample in the test set one would need to (1) obtain generations with every candidate LLM and (2) score each generation with another, typically larger/commercial, LLM to decide which generation to output. This is in contrast to our method which uses the proposed LLM routing score (the score for a sample can be computed with a single call to a sentence transformer, typically a much smaller model) to decide on the most suitable LLM and then produces generation with only the chosen LLM.

We hope the additional discussion clarifies that the proposed method is a lot more cost-efficient than any of the prior LLM routing/automatic evaluation methods. We are happy to answer any further questions.

评论- Follow-up

2023-11-23

Thanks again for your comments and feedback. Could you please let us know if the additional discussion we added to the paper addresses your concern regarding the relation to auto-eval and if you have any remaining/additional questions?

审稿意见

评分: 3置信度: 32023-10-31

The paper aims at selecting the best LLM for a unseen task for practical usage. By formaluating the selection of LLM as the binary classification tasks, authors ropose three scores for selecting LLMs for a new task using these correctness predictors. The results on 29 datasets from HELM demonstrate the effectiveness of proposed methods.

优点

The writing of this paper is commendable as it is well-structured and easily comprehensible. The paper addresses a significant problem: how to select the best model from a multitude of language models for a new task. The authors provide comprehensive experimental evidence of the effectiveness of their approach, particularly on the HELM benchmark.

缺点

I have some concerns that I would like to address in my review of this paper. Firstly, I believe that the application scope of this work may be somewhat limited. The main approach relies heavily on the Large Language Model (LLM) learning from past similar tasks and using that knowledge to measure performance on new tasks. However, I would like to highlight that acquiring the necessary "knowledge" for a slightly larger LLM can be a costly process, requiring evaluation on a large number of benchmarks. Additionally, when dealing with a new dataset, an alternative approach could involve evaluating a selection of promising models on a smaller amount of data to identify the best-performing model. This evaluation process can be time-efficient. Thus, the efficiency of different models could also be included

Another concern I have is related to the adequacy of the baseline comparisons. It seems that a simple baseline approach could involve evaluating all relevant LLMs ( < 100) on a very small dataset and selecting the best-performing model.

问题

Please see weakness

评论- Response to Reviewer 214h

2023-11-16

We thank the reviewer for their questions and comments. Please see our responses below.

I believe that the application scope of this work may be somewhat limited. The main approach relies heavily on the Large Language Model (LLM) learning from past similar tasks and using that knowledge to measure performance on new tasks. However, I would like to highlight that acquiring the necessary "knowledge" for a slightly larger LLM can be a costly process, requiring evaluation on a large number of benchmarks.

While our approach relies on data from evaluating LLMs across tasks, it is crucial to highlight that we propose to reuse benchmark evaluations. The current practice of releasing new LLMs is to accompany them with an extensive evaluation of these LLMs across many tasks to demonstrate their credibility. In other words, there is essentially no additional cost to apply our method provided new LLM releases are accompanied by per-sample benchmark evaluation results. Such data is already readily available for some benchmarks, e.g., HELM provides per-sample evaluation results that can be downloaded from their website and Open LLM Leaderboard provides per-sample evaluations for over 1000 LLMs on their benchmark datasets, which can be downloaded from HuggingFace. Thus, model routing using benchmark evaluations is cost-efficient and practical given current practices, while it also provides a new LLM-related problem to study for researchers with limited computational resources. Please also see the discussion in Section 2 in the "Benchmarking" paragraph.

When dealing with a new dataset, an alternative approach could involve evaluating a selection of promising models on a smaller amount of data to identify the best-performing model. This evaluation process can be time-efficient. Thus, the efficiency of different models could also be included.

Such an approach would be inefficient as it would require (1) obtaining labels/references for some amount of data from the new task and (2) evaluating every candidate LLM on the data from the new task (i.e., outside of the benchmark evaluation data). Results presented in Table 1 demonstrate that our method is capable of selecting an LLM without any labeled data from the new task and without running inference with every candidate model on any of the data from the new task. Our method also outperforms perplexity (LL)-based LLM selection, which bypasses (1), but still suffers from (2).

Another concern I have is related to the adequacy of the baseline comparisons. It seems that a simple baseline approach could involve evaluating all relevant LLMs ( < 100) on a very small dataset and selecting the best-performing model.

Thank you for the suggestion! While in Table 1 we consider a setting where no labels/references are available for the new task, Figure 2 presents results with our method when a few labeled samples are available. We have added a baseline (Few-Shot) that uses performance only on those labeled samples to select an LLM. We see that our method outperforms such a baseline by a large margin.

Please let us know if you have any further questions or concerns. If we have addressed your concerns, we would appreciate it if you could consider increasing the score.

评论- Follow-up and new results

2023-11-23

Thanks again for your comments and feedback. Could you please let us know if our rebuttal and revisions address your concerns and if you have any remaining/additional questions? Please also note the additional comparison to a MoE baseline that we added to Table 1 and Figure 2. We also added a comparison to the simple baseline (Few-Shot) that you suggested in Figure 2, as discussed in our rebuttal response.

审稿意见

评分: 6置信度: 42023-10-31

The main contribution of this paper is the proposal of a new problem formulation, which involves using benchmark datasets to learn a "router" model for selecting the best LLM. The authors demonstrate that this problem can be simplified into a series of binary classification tasks, and through experiments, they showcase the practicality and limitations of learning model routers from various benchmark datasets.

优点

This paper showcases the potential of utilizing benchmarks for routing LLMs and explores three model scores in the context of out-of-distribution (OOD) generalization when assigning LLMs to new tasks. It also outlines potential future directions aimed at enhancing the quality and effectiveness of LLM routers.

- The author propose three scores for selecting LLMs for a new task. Especially, the third score accounts for the OOD data because a new task is more likely to be different from datasets in benchmarks.
- The routers only depend on the input x, which is different from prior works. It is more efficient if a router don't need to obtain generations with LLM.
- The author conducts a robust experiment and provides compelling evidence to demonstrate how an imperfect correctness predictor can enhance the performance of LLMs.

缺点

- I'm somewhat confused about whether it's crucial to use "imperfect" one if we have a "perfect" correctness predictor. In other words, why do we opt for an imperfect correctness predictor, such as a non-parametric classifier, instead of a parametric one?
- From my perspective, this work bears similarities to the Mixture of Experts (MoE) model, where experts in MoE are replaced with LLMs. So, what is the distinguishes between this work and MoE, where LLMs serve as experts? Would the non-parametric method remain efficient if we use it for the traditional MoE?
- This paper doesn't seem to clarify the difference between this method and certain fine-tuning techniques, nor does it address whether the proposed method outperforms the current fine-tuning methods. If we fine-tune the selected LLM, would it certainly perform better than an LLM that hasn't been selected? Or should we use the selected LLM directly after the router has chosen it?
- The results from the candidate LLMs (Table 5) clearly indicate that larger models outperform their smaller counterparts. This might suggest that the optimal strategy is simply to choose the largest model available. However, I believe this perspective may not be entirely accurate. Therefore, I propose showing more detail to challenge and potentially debunk this assumption.

问题

See weakness

评论- Response to Reviewer cuLP

2023-11-16

We thank the reviewer for their questions and comments. Please see our responses below.

I'm somewhat confused about whether it's crucial to use "imperfect" one if we have a "perfect" correctness predictor. In other words, why do we opt for an imperfect correctness predictor, such as a non-parametric classifier, instead of a parametric one?

If we have a "perfect" correctness predictor, the problem is fairly trivial and all proposed scores will perform as the oracle. Unfortunately, the perfect correctness predictor is generally unattainable, analogous to perfect accuracy being unattainable in most supervised learning tasks, especially with distribution shift. The "imperfection" is not related to the choice of correctness predictor (nonparametric vs parametric) but to the challenging nature of the binary classification problem corresponding to correctness prediction. In Appendix B.1 we present results using a neural network as the correctness predictor, i.e., a parametric classifier.

From my perspective, this work bears similarities to the Mixture of Experts (MoE) model, where experts in MoE are replaced with LLMs. So, what is the distinguishes between this work and MoE, where LLMs serve as experts? Would the non-parametric method remain efficient if we use it for the traditional MoE?

The traditional application of MoE is to decide which expert(s) to use based on the input. In this paper, we mainly consider the problem of selecting a single model for a task, i.e., a collection of inputs. It is perhaps possible to use some variation of MoE to improve correctness predictors ( $g_m$ s), thus potentially improving the effectiveness of model routing with our scores, however, it is unclear how MoE alone can be used in our problem setting.

One exception is the experiment in section 5.2, where the model is selected per sample and some of the baselines, e.g., PairRanker (which requires inference with all experts unlike our method), can be viewed as variations of MoE.

We have added a brief discussion regarding MoE in the conclusion.

This paper doesn't seem to clarify the difference between this method and certain fine-tuning techniques, nor does it address whether the proposed method outperforms the current fine-tuning methods. If we fine-tune the selected LLM, would it certainly perform better than an LLM that hasn't been selected? Or should we use the selected LLM directly after the router has chosen it?

We consider the problem of selecting an LLM for a new task where no (or very few) labels/references are available, thus fine-tuning is not a viable option. In our experiments, the LLM selected by the router is directly applied to the samples from the new task.

When the new task comes with a sufficient amount of labels/references, choosing a model most suitable for fine-tuning is an interesting problem, but is beyond the scope of this work. In the future work part of the conclusion, we briefly discuss the potential impact of routing models that were already fine-tuned for various domains, i.e., "expert" LLMs, where we anticipate our method to continue being effective.

The results from the candidate LLMs (Table 5) clearly indicate that larger models outperform their smaller counterparts. This might suggest that the optimal strategy is simply to choose the largest model available. However, I believe this perspective may not be entirely accurate. Therefore, I propose showing more detail to challenge and potentially debunk this assumption.

Note that Table 5 presents the average performance of models over all datasets, where indeed larger models outperform smaller ones. However, our results throughout the paper demonstrate that when choosing different models for different benchmark datasets we are able to improve the average performance, compared with a single large model. Specifically, Table 1 demonstrates that our method outperforms llama-2-70b (referred to as Best Model on Average, or BMA, the largest model considered which has the best average performance as shown in Table 5) while reducing inference cost by occasionally choosing smaller models (see "# Params" column). In addition, in Figure 2, we show that the results of our approach can be improved further when using in-distribution data, obtained, for example, by evaluating each LLM on a small subset of the new task, in addition to the benchmark data used for training our predictor.

Please let us know if you have any further questions or concerns. If we have addressed your concerns, we would appreciate it if you could consider increasing the score.

评论- Follow-up and new results

2023-11-23

Thanks again for your comments and feedback. Could you please let us know if our rebuttal and revisions address your concerns and if you have any remaining/additional questions? Please also note the additional comparison to MoE where we used the same non-parametric (kNN) approach to select the best LLM for each input. The additional results are in Table 1 and Figure 2 and are discussed in the revised draft and our general response.

审稿意见

评分: 3置信度: 42023-11-01

This paper concentrates on the objective of selecting Large Language Model (LLM) from a diverse collection of models for novel tasks. The authors formulate this objective into a series of binary classification problems. The method learns correctness predictors and also defines several scoring metrics to select an LLM given a new task.

优点

Strengths:

Formulation: The paper formulates the LLM routing process as a collection of binary classification tasks.
Better Performance: The proposed method achieves better results than a strong single model.

缺点

Weaknesses:

Comparison with Existing Methods: The concept of routing is a prevalent strategy in conventional Mixture-of-Experts (MoE) solutions. More comprehensive discussion and experimental comparisons are encouraged.
The notation needs more clarification.
Results of "S3 true p" need further practical analysis. These results are only achieved when the model has access to the true accuracy of correctness predictors.

问题

$g_m(x)$ is defined to evaluate the correctness of model $m$ on an input $x$ and gold label $y$ . The lack of $y$ in $g_m(x)$ causes confusion.
Eq.1 is a little bit confusing. It estimates the loss of $g_m(x)$ and $y(x,m)$ given $g_m$ . $g_m$ seems to be both independent variable and dependent variable.
The problem of OOD in Eq.3 lacks necessary discussion. Eq.3 does not contain notation of $P(y|x)$ , it estimates $g_m$ . Although the target of $g_m$ is to estimate the correctness of model $m$ that can be potential affected by OOD, there still lacks necessary clarification about what $P(y|x)$ represents in Eq.3.
The relation of solution of Eq.4 and OOD is not clear.
There is only one optimization problem, that is Eq.1, to find the best predictor function. Given the learned $g_m$ , the rest of the method is to use $g_m$ to choose the language model. How to design a better predictor also needs more discussion.
This approach takes a lot of efforts on choosing LLMs based on a prediction function. For example, Eq.3 chooses a language model directly based on the prediction of $g_m$ . Eq.4 introduces the threshold and can get better generalization results. These tricks are valuable, but some of them are popular in traditional classification methods.

评论- Response to Reviewer oZKe [Part 1]

2023-11-16

We thank the reviewer for their questions and comments. Please see our responses below.

The concept of routing is a prevalent strategy in conventional Mixture-of-Experts (MoE) solutions. More comprehensive discussion and experimental comparisons are encouraged.

Following your suggestion, we have added a comment regarding MoE in the conclusion.

Results of "S3 true p" need further practical analysis.

We present this score to demonstrate the upper bound on the proposed model routing score that takes into account the potential inaccuracies of the correctness predictors. As discussed at the top of page 7, we find it encouraging that even with correctness predictors that on average have accuracy as low as 0.59 on a binary classification task, it is conceptually possible to route LLMs efficiently.

$g_m(x)$ is defined to evaluate the correctness of model $m$ on input $x$ and gold label $y$ . The lack of $y$ in $g_m(x)$ causes confusion.

$g_m(x)$ is a classifier which estimates the probability that an LLM $m$ will generate the correct answer for an input $x$ . The probability is determined entirely from $x$ so including $y$ in $g_m(x)$ would be incorrect.

Eq. 1 is a little bit confusing. It estimates the loss of $g_m(x)$ and $y(x,m)$ given $g_m$ . $g_m$ seems to be both independent variable and dependent variable.

For our approach, we are training the probabilistic estimator $g_m(x)$ to predict the gold label $y(x, m)$ . Eq. 1 states that we are selecting the $g_m(x)$ that minimizes the binary cross-entropy loss between the prediction and the gold label. In other words, we are simply training a standard binary classifier in eq 1. We have clarified this in the updated draft.

The problem of OOD in Eq.3 lacks necessary discussion. Eq.3 does not contain notation of $P(y|x)$ , it estimates $g_m$ . Although the target of $g_m$ is to estimate the correctness of model $m$ that can be potential affected by OOD, there still lacks necessary clarification about what $P(y|x)$ represents in Eq.3.

Eq. 3 uses $g_m$ , which is an estimator of $P(y(x,m)=1|x)$ as stated after eq. 1. We are simply stating that a (probabilistic) binary classifier may fail to generalize out-of-distribution, particularly in terms of its calibration quality. We have clarified this in the updated draft.

The relation of solution of Eq.4 and OOD is not clear.

Eq. 4 presents one of the model routing scores based on predictions with $g_m$ s. As a binary classifier, $g_m$ is likely to have lower accuracy when evaluated on OOD data, thus the score based on these predictions may be less effective when selecting an LLM for a new task. We have clarified this in the updated draft.

There is only one optimization problem, that is Eq.1, to find the best predictor function. Given the learned $g_m$ , the rest of the method is to use $g_m$ to choose the language model. How to design a better predictor also needs more discussion.

We reiterate that eq. 1 is simply saying that we are training a binary classifier. How to train binary classifiers has been extensively studied in ML literature over the past several decades. In this paper, we considered simple choices such as kNN and MLP and instead focused on the questions that are specific to our problem setting and have not been studied previously, e.g., how to improve model routing with imperfect classifiers.

评论- Response to Reviewer oZKe [Part 2]

2023-11-16

This approach takes a lot of efforts on choosing LLMs based on a prediction function. For example, Eq.3 chooses a language model directly based on the prediction of $g_m$ . Eq.4 introduces the threshold and can get better generalization results. These tricks are valuable, but some of them are popular in traditional classification methods.

Could you please clarify the concern? If the question is regarding the novelty of the proposed scores, then we agree that S2 (eq. 4) is a standard trick. However, the subsequent discussion on modeling LLM correctness and the construction of S3 is novel to the best of our knowledge. We also want to emphasize that the main contribution of our work is the new problem formulation of LLM routing using benchmark datasets that has the potential to improve the collective capabilities of open-source LLMs. The use of benchmark datasets is important as it allows researchers with limited computational resources to contribute to LLM research by re-using publicly available LLM (per-sample) evaluation data (e.g., HELM, Open LLM Leaderboard, evaluation data released as part of our code) instead of running costly evaluations themselves. Thus, our main focus is on bringing forth various aspects of the problem along with methodological approaches to address them.

Please let us know if you have any further questions or concerns. If we have addressed your concerns, we would appreciate it if you could consider increasing the score.

评论- Follow-up and new results

2023-11-23

评论- Authors-Reviewers discussion

2023-11-21

Dear Reviewers and AC,

We have posted our responses, added an additional baseline to Figure 2, and revised the draft to answer your questions and incorporate your feedback. We hope you would take the rebuttal into consideration when evaluating our work and ask you to please let us know your thoughts soon as the discussion period ends on November 22nd.

Thank you,
Authors

评论- Comparison to MoE

2023-11-23

We have added an additional baseline using MoE in Table 1 and Figure 2 following suggestions/requests from the reviewers.

The MoE approach selects an LLM (i.e., an “expert”) for each input individually based on the correctness frequency of candidate LLMs in the neighborhood of this input (see also paragraph “Model routing” in Section 5.1). Selecting an LLM per input (instead of a single LLM per task) has the potential to outperform even the Oracle LLM, however, such an approach is heavily reliant on the accuracy of the expert selector. As a result, the MoE baseline performs quite poorly. In contrast, our proposed scores that account for the challenging nature of predicting correctness perform a lot better.

AC 元评审

2023-12-07

This paper proposes a new formulation for selecting the best Large Language Model (LLM) for new tasks by learning a "router" model from benchmark datasets and reducing the problem to a collection of binary classification tasks. After careful reading of the reviews, I would not recommend the paper to be accepted of its current form due to several concerns, including insufficient comparison with existing methods such as Mixture-of-Experts (MoE) solutions and unclear notation. Most of the reviewers reached the consensus that the paper can be further improved before it can be published in the conference.

为何不给更高分

n/a

为何不给更低分

n/a

最终决定Reject

2024-01-16

Reject