6.3

/10

Poster4 位审稿人

最低5最高8标准差1.3

3.3

置信度

正确性3.0

贡献度2.8

表达3.0

NeurIPS 2024

Are More LLM Calls All You Need? Towards the Scaling Properties of Compound AI Systems

Lingjiao Chen,Jared Quincy Davis,Boris Hanin,Peter Bailis,Ion Stoica,Matei Zaharia,James Zou

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

We study the scaling properties of compound inference systems both theoretically and empirically

摘要

关键词

Scaling Laws; Compound AI systems; language models

评审与讨论

审稿意见

评分: 5置信度: 32024-07-04

LLM inference systems often generate multiple answers (function calls) to a query and then aggregate the answers with rules like vote, filter-vote. This paper investigates how the number of function calls influence the influence of performance of the compound system. More concretely, this paper found that there often exist a U-shape overall performance curve with an increasing number of function calls where easy queries benefit from more function calls and hard queries can be harmed by the function calls. Some theoretical results are given for toy settings and an optimal number of function calls can be derived accordingly. Empirical results are given for simple settings.

优点

Clarity: The paper is mostly clear and easy to understand.
Originality: The problem is less studied in the literature.

缺点

1. The results are not surprising.

The problem is quite simple and straightforward if we drop the LLM background. Basically, we are drawing samples uniformly at random from a distribution (function calls) and then apply majority vote to them, then with more samples, the uniform + majority vote leads to select the one sample with highest probability (if majority exists). If the majority is the correct answer, then the sample is "easy", otherwise it is "hard". Mixing easy and hard majority naturally leads to U-shape in the performance curve.

2. The model prediction may not outperform simple hyperparameter-search.

The "optimal" function calls prediction is based on strong assumptions which may not hold true and estimate accurately in practice. Instead, simple hyperparameter search is usually much more robust performance.

问题

As stated in the weakness.

局限性

Not applicable.

作者回复

2024-08-07

We thank the reviewer for the feedback. We answer the questions as follows.

The results are not surprising: We appreciate your comment but we would like to respectfully argue that our results are interesting. First, many practitioners and researchers believe that they can consistently improve performance with more LLM calls (e.g. https://arxiv.org/abs/2402.05120). In this context, our finding that more LLM calls can hurt performance in specific settings is unexpected and a useful contribution to the literature. Furthermore, we provide an explanation of this counterintuitive phenomenon based on query difficulty and provide both mathematical and empirical justifications. The prediction based on our theory matches the empirical performance accurately. These results have not been not studied in existing work and opened the door to understanding and optimizing the design of compound LLM systems.

model prediction is based on strong assumption: Thank you for this question. We would like to clarify that the algebraic formula (K*) shown in Theorem 4 is used for a special case. On real-world datasets, we use the scaling law function empirically fitted on a small dataset for prediction purposes, which does not require strong assumptions. The scaling law function’s predictions are empirically accurate (see Figure 4(c)). We will highlight this in the revised paper.

Comparison with hyper-parameter search algorithms: Thank you very much for pointing this out! We conducted additional experiments to compare our proposed scaling laws with two popular hyper-parameter search algorithms, bayesian optimization and tree structured parzen estimator, with the same number of performance evaluation budgets. Specifically, we evaluate these hyper-parameter search algorithms on GPQA, AVERITEC, and MMLU PHYSICS, with 5 calls to the performance oracle (the same input to our scaling law method). As shown in the following table, our scaling law reaches the most accurate prediction and overall accuracy. We will add this discussion in the revised version. Notably, both the Parzan and Bayesian optimizers would have recommended the users to use a large number of LLM calls (>100), which would have incurred much more cost while performing worse than our estimator.

Method	GPQA		AVERITEC		MMLU PHYSICS
	LLM Calls	Acc	LLM Calls	Acc	LLM Calls	Acc
Our scaling law	19	0.317	5	0.367	13	0.542
Tree-structured parzen estimator	444	0.307	114	0.361	418	0.530
Bayesian optimization	430	0.307	377	0.360	47	0.536
Ground truth	13	0.320	4	0.368	10	0.543

评论- Response

2024-08-12

I have read the authors responses and believe that the second and third point addressed my concerns. As for the first one, I am still not very convinced that theory contribution is significant as it seems to me to rediscover the math (the non-peer reviewed works also are not convincing too). Yet I am willing to increase the score.

评论- Thank you for your response and increasing the score!

2024-08-14

Thank you for your response and increasing the score!

审稿意见

评分: 8置信度: 32024-07-10

This paper evaluates the LM calls and the task performance based on two compound system designs: Vote and Filter-Vote, for performing multiple-choice selection tasks. The authors conduct theoretical analysis for the system designs and the scaling behavior by proposing formal notion of query difficulty and model behavior. They suggest that adding LM calls does not always lead to monotonic task performance. Basically, additional LM calls increase performance on easy queries and decrease performance on difficult queries; increasing the number of LM calls can lead to non-monotone behavior depending on task difficulty. The authors also conduct empirical analysis on different datasets including one synthetic dataset to validate the claims, and notice that the analytical scaling model can accurately predict the performance of Vote and Filter-Vote.

优点

The paper offers a novel insights to the relation of task difficulty, LM calls, model behavior.
The overall writing of the paper is very easy to follow. Throughout the writing, the paper has made the main scope clear.
Both the theoretical analysis and empirical experimentation are well conducted to support the main claim.

缺点

Although I found that the authors have made solid empirical experimentation to support the claim, the empirical results are not well presented. The authors experimented with several different datasets, but only showed a few case studies on a certain dataset in the main body of the paper. I would at least present the overall results of all datasets in the experiment section if the page limit permits.

问题

Since all your datasets are multiple choice tasks, how did you extract the results of the LLM outputs for answer mapping? Perhaps the token logits? If that is the case, did you check the robustness of the results of the token logits comparing to the real textual outputs?
Did you (Or do you think whether it is worthwhile to) conduct additional robustness tests (e.g. by shuffling the option orders, or adding input perturbations) to validate the LLM outputs?

Recent evaluation papers that could be relevant, they more or less discuss that the difficulty of the tasks could affect model robustness: https://aclanthology.org/2024.naacl-long.295/ https://aclanthology.org/2024.findings-naacl.130/

局限性

The authors already talk about the Limitations in Appendix B.

作者回复

2024-08-07

We thank the reviewer for the helpful feedback and support of our paper! Please see below for our response.

present the overall results of all datasets in the experiment section if the page limit permits: Thanks for your suggestion! We will move more empirical results to the experiment section in the revised paper.

answer extraction: We prompted the LLM to generate its final output in a specific format (i.e., “the answer is (X)”) and then used regular expression to extract the output.

Did you (Or do you think whether it is worthwhile to) conduct additional robustness tests (e.g. by shuffling the option orders, or adding input perturbations) to validate the LLM outputs?: Thank you very much for pointing this out! Robustness studies are orthogonal but complementary to the focus of this paper. For example, it would be interesting to study how input perturbation affects query difficulty distribution. We will add a discussion in the revision.

Recent evaluation papers: Thanks for the reference. We will add a discussion to them in the revised paper.

评论- Thanks for response

2024-08-13

Dear authors,

thanks for the response. I think this is overall very interesting work and my initial score should properly reflect the quality of this work. And I would like to see it included to the proceedings.

One final remark: thanks to the author for pointing the use of regex to extract the response and I would suggest the author include the details of the extraction process in the appendix, as well as the rate of the missuccess (if any). Especially the template "the answer is (X)" reminds me of the title of a very recent paper from ACL this year that might be more or less relevant. Perhaps also worth reading: https://aclanthology.org/2024.findings-acl.441/

评论- Thank you for your response and support of our paper!

2024-08-14

Thank you for your response and support of our paper! We will make sure to include the answer extraction process in the appendix and add a discussion to the suggested ACL paper.

审稿意见

评分: 5置信度: 42024-07-11

In this paper, the authors studied the scaling properties of compound inference systems. Theoretically and empirically, the authors answered the properties of multiple LM calls.

优点

Answered several important questions in multiple LM calls, which may benefit the compound inference systems.
Heuristic for Optimal Number of LM Calls.

缺点

The authors only studied two simple natural inference system designs, i.e., vote and filter vote, which might reduce the significant of findings.

问题

Is the optimal number of LM calls model-specific or not?

局限性

According to definition 1, a query can only be categorized into easy and difficult according to the difficulty indicator. The is a simplified case of the real-world scenarios. Besides, the difficulty indicator is not model-agnostic, which may also reduce the significant of findings.
The optimal LM calls depends on query difficult while the difficult level is simply either difficult or easy. A more rational way is to measure the least number of calls for a query to get correct answer.

作者回复

2024-08-07

We thank the reviewer for the helpful feedback and support of our paper! Please see below for our response.

Simple System: We indeed focus on simple systems. There are two reasons. First, they both represent real-world systems. For example, the Cot@32 approach by Google Gemini is indeed a simple vote method. Second, the observed phenomenon is pretty surprising, and we want to understand it empirically and analytically in the simplest place it appears. We believe it is always important to find a minimal working example. Of course, it will be valuable to investigate in challening settings in future work.

Is the optimal number of LM calls model-specific or not?: Thanks for this great question. We actually predict the optimal number of calls in two ways (K* for a special case and the scaling law function G in general). They are both model-specific. This is an important point and we will emphasize it in the revision.

the least number of calls for a query to get correct answer: Thank you so much for pointing this out. We think this is a great intuitive way to explain how to characterize query difficulty. It is very closely connected, in fact, to the characterization we currently use. It is a kind of soft version of our difficulty level. Indeed, let’s take Vote as an example.

d_V(x) = \max_a \Pr[G(x, \theta) = a] - \Pr[G(x, \theta) = y]

Here, the inverse of $Pr[G(x,\theta)=y]$ is the expected number of calls to obtain at least one correct answer, which is one formalization of the reviewer’s soft difficulty. We focus on the binary difficulty because it offers a natural explanation of the non-monotonic behavior. We appreciate the suggestion for providing this nice intuition and will put a sentence about it into the revision.

2024-08-12

Thanks for the clarification. While my concerns/questions are not fully addressed, I will keep my score.

评论- Thank you for your response and support of our paper!

2024-08-14

Thank you for your response and support of our paper!

审稿意见

评分: 7置信度: 32024-07-13

Recent state-of-the-art results in language tasks have been achieved using compound systems that make multiple calls to Language Models (LMs) and aggregate their responses. However, there is limited understanding of how the number of LM calls affects the performance of these compound systems. This paper studies the scaling properties of compound inference systems, specifically analyzing how the number of LM calls impacts the performance of two simple compound system designs, Vote and Filter-Vote, which use majority voting to aggregate LM responses, with or without applying filters. The analysis, both theoretical and empirical, reveals that the performance of Vote and Filter-Vote systems can initially improve but then decline as the number of LM calls increases. This non-monotonic behavior is attributed to the varying difficulty of queries within a task: more LM calls enhance performance on "easy" queries but reduce it on "hard" queries.

优点

The findings of the authors are of definite interest, and the phenomenon that they identify is thought provoking, although mathematically intuitive. I think the paper has potential for good to high impact. The problem itself is conveyed very clearly, and the authors made the right choice in including the 'main contributions' box and Figure 1 on the first page of the paper itself.

The theory and substance of the paper is strong, and worth the status of the conference.

The study also provides a method to determine the optimal number of LM calls that maximizes system performance, based on a small number of samples, and develops an analytical scaling model for both systems. Experiments confirm that the scaling model can accurately predict the performance of Vote and Filter-Vote systems, identifying the optimal number of LM calls.

缺点

One weakness is in the insistence on using multiple choice benchmarks, while not being more expansive about using more benchmarks (for example, there are many common sense benchmarks that are also multiple choice). Another weakness is that I still don't quite buy the way in which difficulty was determined experimentally. However, the question itself is a provocative one, and I'm not sure that any answer is really perfect. The paper does a good job at starting the conversation and showing one way in which all of this can be formalized.

问题

Are there methods beyond Vote and Vote-Filter that can be applied here? And can this approach generalize if difficulties are probabilistic to begin with?

局限性

The paper focuses on analyzing and experimenting with the scaling behaviors of two specific instances of compound AI systems, though there are many other types. The experiments are conducted on relatively objective language tasks for ease of evaluation. It remains an open question how performance scales on more subjective tasks, such as generating poems and writing essays. Another open problem is how to predict the difficulty of queries without actually querying the language models.

作者回复

2024-08-07

We thank the reviewer for the helpful feedback and support of our paper! We have addressed the questions as follows.

Are there methods beyond Vote and Vote-Filter that can be applied here? Can this approach generalize if difficulties are probabilistic to begin with?: Thanks for bringing up this point. We indeed believe that our methods can apply beyond just Vote and Vote-Filter. We chose to focus on those two systems because … [paste] … We have in mind to do some followup work on more complex systems. One example is Ranking-Vote: Each LLM call provides a ranking of possible answers, and then we perform ranking aggregation (e.g., Borda count) to produce a final answer. Here it’s indeed important to consider probabilistic query difficulty. The effect will be significant because while in Vote and Vote-Filter we have a law of large numbers that governs the resulting prediction (polarizing the difficulty to a binary easy or had), in more complex models, we expect that the full distribution of query difficulty over a dataset will impact model performance.

Can this approach generalize if difficulties are probabilistic to begin with?: Yes. For example, one might take a Bayesian view: difficulties can be initially quantified as a prior distribution, and invoking more LLM calls offers a more accurate estimation of the posterior.

作者回复

2024-08-07

We thank all reviewers for their feedback! Please find our answers and clarifications in the individual responses.

最终决定Accept (poster)

2024-09-25

Increasingly, systems designed to perform natural language tasks aggregate outputs from multiple calls to LLMs. This work asks an incisive question about such approaches: What is the relationship between the number of LM calls and systems performance? Intriguingly, in the straightforward case of multiple choice QA, this relationship is non-monotonic. In addition to this empirical finding, the authors provide a formalism to analyze LLM calls and offer a heuristic for the optimal number of calls to make for a task (estimated on the basis of a small number of queries).

The biggest weakness of the work is that the empirical evaluation is somewhat limited here, as it is restricted to multiple choice QA datasets. Nonetheless, there was a consensus amongst reviewers that the question being raised and the initial empirical evaluation (as well as formalism) on offer is likely of broad interest to the community, and may provoke discussion and follow-up work.