Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding
摘要
评审与讨论
This paper introduces the Collaborative Speculative Decoding (CoSD) algorithm that enables efficient LLM knowledge fusion during decoding. Upon the idea of speculative decoding, the paper proposes two novel decision-making rules: Rule-based and Tree-Based. The method features 1) efficiency (parallel generation, no additional model training), 2) transferability across different domains and models with different tokenizers, and 3) interpretability. CoSD successfully improves baselines by up to 10%.
优点
- interesting and inspiring idea on fusing knowledge at the decoding time
- The algorithm is clearly presented in this paper through both a workflow diagram and mathematical expressions.
- Both Rule-Based verification and Tree-Based verification are well-designed and both make sense to me.
缺点
- I'm not sure if the goal of this algorithm is to A) achieve performance comparable to the assistant model but in a more efficient way, or if it's aimed at B) outperforming both the draft model and the assistant model individually (1+1>2). How do these objectives apply to the four scenarios of knowledge fusion discussed in section 4.1? If the goal is A, since the draft models in complementary knowledge fusion and catastrophic forgetting recovery scenarios are about the same size as the assistant model, and the algorithm involves autoregressive generation of the draft model, I doubt the algorithm improves efficiency. If the goal is B, I can't see improvement based on Table 2.
问题
- "the algorithm regenerate and re-verify iteratively until all tokens are accepted" How many iterations does it take on average?
- during the training process of the decision tree, if neither the draft model's generation nor the assistant model's generation match the target, you drop the sample and continue the loop with i ← i+1. Any ideas of improvement other than simply dropping these samples?
- typos: line 287, "tree" to "three", "drat" to "draft"
We greatly appreciate the reviewer's recognition of the advantage of CoSD over the existing frameworks. Regarding the weaknesses raised, we address all the concerns in detail below.
W1: I'm not sure if the goal of this algorithm is to A) achieve performance comparable to the assistant model but in a more efficient way, or if it's aimed at B) outperforming both the draft model and the assistant model individually (1+1>2). How do these objectives apply to the four scenarios of knowledge fusion discussed in section 4.1? If the goal is A, since the draft models in complementary knowledge fusion and catastrophic forgetting recovery scenarios are about the same size as the assistant model, and the algorithm involves the autoregressive generation of the draft model, I doubt the algorithm improves efficiency. If the goal is B, I can't see improvement based on Table 2.
We thank the reviewer for clearly pointing out goals A and B. In fact, both A and B are our goals. We expect our algorithm to achieve different objectives when applied to different tasks and models. Specifically:
(1) When the assistant model has a much larger parameter size than the draft model, we expect CoSD to achieve goal A, which achieves performance comparable to the assistant model but in a more efficient way. This is also the goal of speculative decoding, and we expect our CoSD algorithm to retain this functionality while achieving results closer to the assistant model. In our experiments, pair 4 in Table 2 shows the effectiveness of CoSD in this scenario:
| ID | Draft | Assist | Spc.Dec. | Avg. Dec. | CoLLM | CoSD-R | CoSD-T |
|---|---|---|---|---|---|---|---|
| Pair4 | 14.67 | 25.16 | 24.11 | 22.43 | 23.72 | 24.39 | 23.66 |
We display the average score across all 3 benchmarks in this table. It shows that using a 1b draft model and a 7b assistant model in CoSD inference can achieve similar performance to the single 7b model (24.39 -- 25.16) and be 2 times faster.
(2) When the two models have similar sizes and complementary knowledge, we expect CoSD to have higher averaging performance across all the tasks. Pair 2 and pair 3 in Table 2 show the effectiveness of CoSD in this scenario:
| ID | Draft | Assist | Spc.Dec. | Avg. Dec. | CoLLM | CoSD-R | CoSD-T |
|---|---|---|---|---|---|---|---|
| Pair2 | 41.89 | 44.18 | 35.57 | 42.05 | 43.05 | 44.40 | 43.08 |
| Pair3 | 37.96 | 30.59 | 29.15 | 38.04 | 39.27 | 44.49 | 40.29 |
CoSD-Rule outperforms both the draft and the assistant model in these two pairs and has significant improvement when the areas of expertise of the two models are entirely distinct (pair 3).
(3) When the two models are of similar size but one is significantly stronger overall, CoSD can achieve the performance level of the stronger model but cannot save computational costs. However, considering that in real-world applications, we cannot predict the performance of different models across all tasks, we believe it is still worthwhile to attempt fusing the knowledge of two similarly sized models to ensure that the combined performance is close to the better model across various domains. For instance, pair 1 in Table 2 shows this scenario:
| ID | Draft | Assist | Spc.Dec. | Avg. Dec. | CoLLM | CoSD-R | CoSD-T |
|---|---|---|---|---|---|---|---|
| Pair1 | 38.65 | 48.98 | 45.36 | 44.87 | 44.51 | 47.26 | 45.49 |
CoSD still outperforms all the baselines.
Here, we want to emphasize that the superiority of our approach lies in its ability to consistently achieve excellent cross-domain performance, regardless of the type or performance of the given models. By using the CoSD algorithm, we eliminate the need for users to evaluate and select models or consider goals like A and B. Instead, the system can automatically adapt to either efficient inference tasks or knowledge fusion tasks. This represents a significant advantage and contribution compared to previous works focused solely on single objectives like A: efficient inference (e.g., speculative decoding) or B: knowledge integration (e.g., CoLLM).
We answer all the questions below.
Q1: "the algorithm regenerate and re-verify iteratively until all tokens are accepted" How many iterations does it take on average?
The number of interactions depends on the max length of the model output. Here we measure the average number of iterations in the GSM8K dataset when the max length is 128, 256, and 512:
| Max Length | CoSD-Rule | CoSD-Tree | Spec. Dec. |
|---|---|---|---|
| 128 | 11.41 | 13.58 | 9.77 |
| 256 | 15.29 | 16.01 | 14.20 |
| 512 | 21.23 | 21.95 | 18.51 |
Note that the number of iterations being does not mean that the generation time is times that of a single model. As the number of accepted tokens increases, the number of tokens that the draft model needs to regenerate decreases significantly. For instance, in our experiment with a max length of 128, the number of interactions is 11, and the draft model's final average total generation length is around 300. We will add this discussion to the paper soon.
Q2: During the training process of the decision tree, if neither the draft model's generation nor the assistant model's generation match the target, you drop the sample and continue the loop with i ← i+1. Any ideas of improvement other than simply dropping these samples?
We thank the reviewer for a very good point to make our paper better. A possible idea is to involve more collaborating models to make sure that at least some of them can predict the correct token during the decision tree training. We also discuss with reviewer hdS6 about the possibility of involving more models in CoSD. Here are some results when we let 3 models to collaborate with CoSD:
| ID | Draft | Assist. 1 | Assist. 2 | CoSD-Rule | CoSD-Tree |
|---|---|---|---|---|---|
| MMLU | 32.13 | 47.65 | 35.62 | 44.14 | 46.48 |
| GSM8K | 3.36 | 15.63 | 8.33 | 15.85 | 14.02 |
where Draft model = TinyLlama, Assist. 1 = Llama 2 Chat 7b, Assist. 2 = Llama-7b.
We find that CoSD is still useful with more models and is able to be extended to more than 3 models. We believe that with enough LLMs involved, we can make better use of the training data.
In addition, we clarify that for an instruction-output pair with output tokens, we can generate up to samples for training the decision tree. Considering that decision trees require a small amount of training data, we can use very little data to generate the decision tree training set even after dropping some tokens. For instance, we only use 10 samples in MMLU and 3 for other datasets as training samples in Table 4. So, there is no concern that the training data is not sufficient for the decision tree.
Q3: typos: line 287, "tree" to "three", "drat" to "draft"
Thank you very much to the reviewer for pointing out the typos. We will carefully double-check the typos in the paper and correct all typos before submitting the updated version.
Thanks for clarifying these points. I'd suggest directly highlighting them in the next version. I will keep my score.
Thank you for your valuable feedback. Following your valuable suggestions, we have added the clarifications and additional experiments in the rebuttal to the paper and updated the experiments, limitations and appendix. We highlight the goal of our algorithm and discuss the limitation in the paper and extend the paper to 10 pages.
We would greatly appreciate it if you could review these additions and consider raising your score based on the improvements. Please don’t hesitate to let us know if there are any other suggestions or areas where we could further enhance the paper.
This paper introduces a novel collaborative speculative decoding algorithm which can efficiently fuse the knowledge from different LLMs during inference. The experiment setting is quite interesting and includes different types: complementary knowledge fusion, catastrophic forgetting recovery, capacity imbalance and different tokenizers. The results are better than different baselines.
优点
- This paper provides an interesting perspective to fuse knowledge between LLMs using speculative decoding, which leverages the strengths of different LLMs while still keeping the efficiency.
- The experiment setting is interesting, which tries complementary knowledge fusion, catastrophic forgetting recovery, capacity imbalance and different tokenizers.
缺点
- The paper only does the experiment in each pair of the LLMs. It would be interesting to see more LLMs collaboratively fuse knowledge.
- It would be better to show more details about the limitations of the proposed method and show some error analysis.
问题
- Is the proposed algorithm suitable for collaboration among multiple LLMs? What will be the potential challenges?
- Can you explain more about the limitations of the current method? I'm curious when it doesn't work well.
We greatly appreciate the reviewer's recognition of the advantage of CoSD over the existing frameworks. Regarding the weaknesses raised, we address all the concerns in detail below.
W1: The paper only does the experiment in each pair of the LLMs. It would be interesting to see more LLMs collaboratively fuse knowledge.
Thank you for your valuable comment! Our proposed algorithm indeed supports collaboration among multiple LLMs, including scenarios involving three or more models. As an example, in a three-model CoSD, one draft model generates the draft and 2 assistant models verify the draft. The rule-based verification process will be:
and
max(),
The assistant token with a higher probability will replace the draft token if all the 3 conditions are met.
For the tree-based CoSD with LLMs, we extend the decision tree to -class classification and select the model that predicts the next token with the highest probability as the ground truth label, with all other settings remaining the same as the two-LLM setting. We have conducted experiments with this three-model CoSD to validate the algorithm’s capability in fusing knowledge from multiple LLMs. The results are shown in the table below:
| ID | Draft | Assist. 1 | Assist. 2 | CoSD-Rule | CoSD-Tree |
|---|---|---|---|---|---|
| MMLU | 32.13 | 47.65 | 35.62 | 44.14 | 46.48 |
| GSM8K | 3.36 | 15.63 | 8.33 | 15.85 | 14.02 |
where Draft model = TinyLlama, Assist. 1 = Llama 2 Chat 7b, Assist. 2 = Llama-7b.
The current experimental results of this model group demonstrate that when three models collaborate if one significantly outperforms the other two, the final system will achieve performance close to that of the best model. This indicates that our algorithm is effective when applied to more than two models. We are conducting collaborative experiments with other model groups. We will keep you updated in real time if we obtain new results and conclusions. Once the experiments for all model groups are completed, we will incorporate this section into the main body of the paper.
W2: It would be better to show more details about the limitations of the proposed method and show some error analysis.
A potential limitation of our approach is that in some cases when the two collaborating models have similar parameter sizes but one is significantly more powerful than the other, it is not necessary to use CoSD since directly using the more powerful model is enough. For example, in Table 2 pair 1, the assistant 8B model has much better overall performance than the draft 8B model, and CoSD can only achieve similar performance to the assistant model, and cannot save computation cost since the models have the same parameter size.
About the error analysis, CoSD cannot guarantee the assistant token is better than the draft token it replaced. Sometimes it may drop some good draft tokens with a worse assistant token when the draft model is not confident enough. Here is an example question in MMLU:
Rowena can paint a room in hours, while Ruby can paint it in hours. If Rowena paints for hours and Ruby paints for hours, they will finish half of the painting, while if Rowena paints for hours and Ruby paints for hours they will paint the whole room. Find the ordered pair . A. B. C. D. (1,1)
The draft model gave the correct answer C with 0.32 probability but was rejected by the answer D from the assistant model with 0.51 probability. This example illustrates that low probability does not necessarily indicate an incorrect token. Therefore, replacing tokens with higher-confidence alternatives is merely a heuristic algorithm and is subject to error. After all, during the inference stage and model deployment, the ground truth is unknown.
We answer the questions below.
Q1: Is the proposed algorithm suitable for collaboration among multiple LLMs? What will be the potential challenges?
The experiments in W1 show that CoSD is suitable for collaboration among multiple LLMs, especially CoSD-Tree. For the CoSD-Rule, the potential challenge is how to define the rule of replacement when the LLMs predict multiple different tokens. For the CoSD-Tree, the potential challenge is how to reduce the number of substitute tokens to improve efficiency. We will add all the experiments and related discussions to the paper. Please let us know if you have more valuable suggestions.
Q2: Can you explain more about the limitations of the current method? I'm curious when it doesn't work well.
We discuss the limitations and the scenarios when CoSD doesn't work well in W1 and W2. Simply put, we summarize it into the following two points:
(1) When the two collaborating models are of similar size and one significantly outperforms the other, CoSD offers no advantage over using only the better model. Naturally, this limitation exists in any work involving model collaboration.
(2) The CoSD algorithm cannot theoretically guarantee that the replaced token is always better than the discarded one. We can only select the output of the more confident model to maximize the likelihood of choosing a better token.
We will add a limitation section to the paper to thoroughly discuss these potential limitations.
Thank you for your feedback so far. We have provided a detailed rebuttal and an updated draft addressing your comments with additional results. Please feel free to share any further questions or thoughts before the discussion period ends.
This paper proposes Collaborative Speculative Decoding (CoSD) that fuses LLM knowledge at test time. The algorithm employs a draft model to generate initial response sequences and a rule-based or decision tree to decide when to leverage an assistant model to improve the drafts. The authors have conducted experiments using different pairs of LLMs and under various experimental setups.
优点
- The authors have put effort in experimenting their proposed framework under different setups, including various draft and assistant models, different simulated scenarios, etc.
- The proposed framework gains some advantage over the existing framework of Co-LLM in certain scenarios.
缺点
-
In Table 2, I notice that in most cases, the fused model underperforms the draft model and the assistant model. For instance, for Pair 1, none of the fusion methods outperform both draft and assistant model for GSM8K, HumanEval; for Pair 2, none of the fusion methods consistently outperform both draft and assistant model for GSM8K, and MMLU. Then I wonder what is the point of fusing knowledge in these cases if we can simply adopt one model instead of the other?
-
It seems that for Pair 3, CoSD-Rule performs exceptionally well on GSM8K, yielding 45.47 while the draft and assistant models yield 25.01 and 35.43, which is very different from the performance patterns for this same pair on other datasets such as MMLU and also other pairs. Could you give more insights into such a result? Could you present some examples that CoSD-Rule excel at under this situation that cannot be addressed by either the draft nor the assistant model?
问题
- In 3.2 Verification, for Tree-Based Verification, you claim to use benchmark datasets such as GSM8K to train the classifier, but then in your test, you incorporate the GSM8K dataset as well. Is there any information leakage in terms of that you are training your verifier on the test set so that it gains advantage over other models?
We greatly appreciate the reviewer's recognition of the advantage of CoSD over the existing frameworks. Regarding the weaknesses raised, we address all the concerns in detail below.
W1: In Table 2, I notice that in most cases, the fused model underperforms the draft model and the assistant model. For instance, ... Then I wonder what is the point of fusing knowledge in these cases if we can simply adopt one model instead of the other.
The reviewer mentioned that in most cases, the fused model underperforms the draft model and the assistant model, we disagree with this conclusion. Here we list the averaging accuracy across all 3 benchmarks of Table 2:
| ID | Draft | Assist | Spc.Dec. | Avg. Dec. | CoLLM | CoSD-R | CoSD-T |
|---|---|---|---|---|---|---|---|
| Pair1 | 38.65 | 48.98 | 45.36 | 44.87 | 44.51 | 47.26 | 45.49 |
| Pair2 | 41.89 | 44.18 | 35.57 | 42.05 | 43.05 | 44.40 | 43.08 |
| Pair3 | 37.96 | 30.59 | 29.15 | 38.04 | 39.27 | 44.49 | 40.29 |
| Pair4 | 14.67 | 25.16 | 24.11 | 22.43 | 23.72 | 24.39 | 23.66 |
For pair 2 and pair 3, CoSD outperforms both the draft model and the assistant model, and all the baselines. It means that applying CoSD can achieve better and more balanced performance across multi-domain tasks.
For pair 4, CoSD achieves comparable performance to the assistant model and outperforms all baselines. Considering that the assistant model is much larger than the draft model, CoSD can significantly improve the inference speed by 2 to 3 times.
For pair 1, CoSD has comparable performance to the assistant model. This is because the assistant model has much better performance across domains than the draft model. In this case, we agree with the reviewer that we can simply adopt the assistant model. However, in practice, it is hard to know in advance which model will perform better. CoSD alleviates the need to choose a model, providing performance superior or comparable to the best model.
We further clarify that model fusion cannot guarantee consistently achieving better performance than the best individual model in the fusion across all benchmarks. Instead, we aim to:
(1)Improve the average accuracy across all the benchmarks if the merged models have complementary knowledge.
(2)If one model performs significantly better than the other across different benchmarks, the fusion aims to approximate the performance of the better-performing model as closely as possible.
W2: It seems that for Pair 3, CoSD-Rule performs exceptionally well on GSM8K, ... Could you give more insights into such a result? Could you present some examples that CoSD-Rule excel at under this situation that cannot be addressed by either the draft nor the assistant model?
Here is an example of the generation of the draft model, the assistant model, and the CoSD-Rule for the GSM8K dataset in pair 3:
Draft:
The cost of the subs is 2 x $7.50 = <<2*7.5=15>>15. The cost of the chips is 2 x $1.50 = <<2*1.5=3>>3. The cost of the cookies is 2 x $1.00 = $<<2*1=2>>2. The total cost of the food items is $15 + $3 + $2 = $<<15+3+2=20>>20. The cost of the delivery fee is 20% of the total cost of the order, so 20% of 20 = <<20% of 20=10>>10. The total cost of the food items and delivery fee is $20 + $10 = $<<20+10=30>>30. The cost of the tip is $5.00. The total cost of the order is 15 + 3 + 2 + 10 + 5 = <<15+3+2+10+5=45>>45.
Assistant:
The cost of the subs is 2 x $7.50 = $<<2*7.5=15>>15. The cost of the chips is 2 x $1.50 = $<<2*1.5=3>>3. The cost of the cookies is 2 x $1.00 = $<<2*1=2>>2. The total cost of her order without any fees is $15 + $3 + $2 = $<<15+3+2=20>>20. The delivery fee is 20% of the total order, so $20 * 0.20 = $<<20*0.20=4>>4. The tip is an additional $5. Therefore, her delivery order will cost $15 + $3 + $2 + $4 + $5 = $<<15+3+2+4+5=35>>35.
CoSD-Rule:
The cost of the subs is 2 x $7.50 = $<<2*7.5=15>>15. The cost of the chips is 2 x $1.50 = $<<2*1.5=3>>3. The cost of the cookies is 2 x $1.00 = $<<2*1=2>>2. The total cost of the food items is $15 + $3 + $2 = $<<15+3+2=20>>20. The delivery fee is 20\% of $20, so the delivery fee is $20 x 0.20 = $<<20*0.2=4>>4. The total cost of the food items and delivery fee is $20 + $4 = $<<20+4=24>>24. The tip is $5.00, added to the total cost of the food items and delivery fee, so the total cost of the order is $24 + $5 = $<<24+5=29>>29.
Where the correct answer is 29, only CoSD-Rule gets the correct answer. We found that in this case, the draft model helps establish a better chain of thought (e.g., clearly calculating the total cost first, followed by delivery fee and tips). However, the draft model's mathematical computation ability is weak, making errors in multiple steps within the chain of thought. At this point, the assistant model ia involved to correct the computational mistakes, resulting in an excellent final outcome. We believe this situation occurs when the two models have highly complementary capabilities in different domains (e.g., pair 3 in the table of W1)
We answer the question here:
Q1:In 3.2 Verification, for Tree-Based Verification, you claim to use benchmark datasets such as GSM8K to train the classifier, but then in your test, you incorporate the GSM8K dataset as well. Is there any information leakage in terms of that you are training your verifier on the test set so that it gains advantage over other models?
We clarify that we did not use GSM8K for training the decision tree in the main experiments. As mentioned in lines 297-298, we use three samples from the AlpacaEval dataset to train the decision tree, which does not overlap with the benchmarks. We provide the experiment results on how the training data of the verifier can affect the final results in Table 4, which shows that there are some advantages if we train the decision tree on the same dataset as the benchmark. However, this concern does not arise in our main experiments in Table 2.
In addition, for an instruction-output pair with x output tokens, we can generate up to x samples for training the decision tree. Considering that decision trees require a small amount of training data, we can use very little data to generate the decision tree training set. For instance, we only use 10 samples in MMLU and 3 for other datasets as training samples in Table 4.
Thanks for the authors' response.
For the first weakness point, I disagree with the authors' claim. First, I think averaging scores is a biased practice, as it clearly hides the fact that in Table 2, for Pair 1, GSM8K, CoSD-Rule and CoSD-Tree achieve 45.72 and 41.89, respectively, lagging behind the Assistant model (51.02) by around 6 and 10 points. This clearly fails your aim to approximate the performance of the better-performing model as closely as possible.
In addition, in the paper, for Table 5, I do not see significant token latency improvement over speculative decoding. Your claim that significantly improve the inference speed by 2 to 3 times seems to be ungrounded, what baselines are you comparing to? And what are the results?
Thanks for providing the example, I think the example itself is very interesting. I suggest when the authors iterate the paper, you can focus more on studying how models with complementary capabilities help each other rather than focusing on the performance improvement, which seems to be not guaranteed and in certain cases (the example I gave), is opposite to the authors' expectation.
Thank you for pointing this out. Our reply addresses the concerns below:
For the first weakness point, I disagree with the authors' claim. First, I think averaging scores is a biased practice, as it clearly hides the fact that in Table 2, for Pair 1, GSM8K, CoSD-Rule and CoSD-Tree achieve 45.72 and 41.89, respectively, lagging behind the Assistant model (51.02) by around 6 and 10 points. This clearly fails your aim to approximate the performance of the better-performing model as closely as possible.
Averaging scores is a widely used metric for evaluating model knowledge fusion and overall performance. It is adopted in many benchmarks to evaluate the comprehensive model capabilities, such as the Hugging Face LLM leaderboard.
We clarify that the strength of our approach lies in its consistent cross-domain performance, without any prior knowledge of fused model performance. Consider that users do not always know the performance of their LLMs and LLM APIs on all the benchmarks, by applying the CoSD algorithm, they no longer need to evaluate or select models manually based on all the benchmark performance. For instance, in Table 2 pair 1, the users may not know in advance that the assistant model is much better than the draft model in all benchmarks, but by applying CoSD-Rule, the user will get improved MMLU score and much higher GSM8K and HumanEval scores than the base model, and comparable average score to the assistant model.
Of course, we acknowledge that our approach cannot achieve the ideal scenario of completely matching the accuracy of the better-performing model across all benchmarks. It can only get closer to it compared to the baseline. This is a limitation of our method. We will soon add a limitation section to the paper to discuss this limitation and will let you know once we update the paper.
In addition, in the paper, for Table 5, I do not see significant token latency improvement over speculative decoding. Your claim that significantly improve the inference speed by 2 to 3 times seems to be ungrounded, what baselines are you comparing to? And what are the results?
Here the inference speed up is compared to only using the 7b assistant model for inference, which is also the baseline in speculative decoding works. According to Table 5, we have the similar efficient inference performance compared to our method. However, as shown in Table 2, its knowledge fusion performance is inferior to ours. In fact, SD itself lacks strong complementary knowledge fusion capabilities and can only make the system perform approximately like the assistant model (as evidenced by Table 2, Pair 3).
Thanks for providing the example, I think the example itself is very interesting. I suggest when the authors iterate the paper, you can focus more on studying how models with complementary capabilities help each other rather than focusing on the performance improvement, which seems to be not guaranteed and in certain cases (the example I gave), is opposite to the authors' expectation.
We are glad to see that the reviewer finds our example interesting. Based on your and other reviewers' suggestions, we are drawing a figure containing several samples generated by CoSD, including the examples we provide for you. These samples intuitively illustrate how CoSD makes the model cooperate by annotating the draft tokens, the replaced draft tokens, and the assistant tokens. This will help us and readers have a better understanding that how the assistant model help polishing the draft. Also as the reviewer suggested, we prepared 2 "bad examples". One of them we also showed with reviewer hdS6, indicating that sometimes CoSD will also reject some good draft tokens and replace them with bad assistant tokens. We will update this part in the paper as soon as possible.
Here is one example instruction from MMLU dataset that CoSD drop the correct answer:
Rowena can paint a room in hours, while Ruby can paint it in hours. If Rowena paints for hours and Ruby paints for hours, they will finish half of the painting, while if Rowena paints for hours and Ruby paints for hours they will paint the whole room. Find the ordered pair . A. B. C. D.
The draft model gave the correct answer C with 0.32 probability but was rejected by the answer D from the assistant model with 0.51 probability. This example illustrates that low probability does not necessarily indicate an incorrect token. Therefore, replacing tokens with higher-confidence alternatives is merely a heuristic algorithm and is subject to error. After all, during the inference stage and model deployment, the ground truth is unknown. This example can help show that sometimes the performance improvement is not guaranteed. We will add this sample to the figure and let you know once we update the paper.
Thanks for the response.
-
Averaging scores is a widely used metric for evaluating model knowledge fusion and overall performance.
This only applies when you have lots more datasets and want to show the general trend. Since you have conducted experiments on three datasets, averaging them clearly hides many critical details, just like the detail I pointed out in my earlier response that your method lags behind the baseline by 6 and 10 points.
In other words, taking average is clearly not robust when there are few datasets (even if there are many, it is still not encouraged). Say if you include the fourth dataset and on that dataset, your method significantly outperforms the baseline, when taking average, you would conclude that your method is much superior to the baseline; if say there is the fifth, and your method underperforms the baseline, the conclusion will then flip immediately when you only consider the average.
I wish the authors can understand that such a practice shall not be encouraged. Even people conduct such practices, especially people from industry, we shall not lower our standard as researchers. Please think twice before adopting "widely used metric" and common practice.
- Regarding the inference speed,
Here the inference speed up is compared to only using the 7b assistant model for inference, which is also the baseline in speculative decoding works. According to Table 5, we have the similar efficient inference performance compared to our method.
Here, you acknowledge that you have the similar efficient inference performance compared to our method, which contradicts to what you suggest before CoSD can significantly improve the inference speed by 2 to 3 times.
Thank you for the comments!
About the average performance.
We believe we are in agreement regarding the limitations of the “average performance” metric. Indeed we do not include it in our paper and provide more fine-grained per-task results. The reason we brought up average performance was to respond to one of the weaknesses you indicated in your initial review, i.e., our method does not outperform best of draft/assistant models in all cases. Broadly speaking, no method is typically the best in all cases. However, a good method tends to be the best choice more often than not, making it a robust option in practice, especially when the performance of all candidate methods is not known beforehand. We simply tried to point out that, while acknowledging that our method is not always the best, it is a good method as it performs well more often than not. Average performance was simply a way to quantify “more often than not”. If you recommend other metrics or experiments that could help address your concern, please let us know.
About the inference speed.
We apologize for the confusion. To clarify, the inference speed gains are measured with respect to the vanilla decoding (simply using the assistant model). Both our method and speculative decoding improve the inference speed by 2 to 3 times in comparison to vanilla decoding. The inference speed of our method and speculative decoding are approximately the same. However, our method outperforms speculative decoding in terms of performance.
Thank you for your valuable feedback. Following your valuable suggestions, we have incorporated three detailed case study samples into the paper, complete with annotations and in-depth analysis ('case studies' in the experiment section and two big tables). We have expanded the paper to provide a more comprehensive presentation of our work.
We would greatly appreciate it if you could review these additions and consider raising your score based on the improvements. Please don’t hesitate to let us know if there are any other suggestions or areas where we could further enhance the paper.
I appreciate the author's effort throughout the discussion period.
I am still not fully convinced by the performance gains demonstrated in the existing experiments, especially that there are only three benchmarks involved in the process. I am curious to see what happens if more general benchmarks get involved (e.g. Math, etc). Currently, we have seen that for a single pair, there is one out of three cases the performance drops significantly, there is also a case that the performance goes up, therefore we do not know in general whether your method would help or not. I think you will have much stronger support if say you have tested eleven benchmarks (you do not need to stick to this number exactly), and ten out of the eleven witness a performance boost, only one out of them suffers performance decline.
Therefore, I would keep my score for now.
Thank you for your comments!
Following the suggestion from the reviewer, we have conducted experiments on all the benchmarks in the tinyBenchmark except AlpacaEval (we use it to train the decision tree) for pair 1, here are the new results:
Pair 1:
| Benchmarks | Draft | Assist. | Spc. Dec. | Avg. Dec. | CoLLM | CoSD-R | CoSD-T |
|---|---|---|---|---|---|---|---|
| MMLU | 54.81 | 52.02 | 53.20 | 52.31 | 55.25 | 56.97 | 58.37 |
| GSM8K | 39.79 | 51.02 | 43.85 | 43.89 | 41.04 | 45.72 | 41.89 |
| HumanEval | 21.34 | 43.90 | 39.02 | 38.41 | 37.25 | 39.10 | 36.22 |
| Hellaswag | 87.17 | 81.99 | 82.52 | 86.39 | 85.64 | 86.96 | 86.84 |
| TruthfulQA | 64.92 | 59.51 | 60.61 | 64.11 | 63.85 | 65.40 | 64.14 |
| Winograde | 80.20 | 78.02 | 79.77 | 80.42 | 80.31 | 80.52 | 80.69 |
| Avg. | 58.04 | 61.08 | 59.83 | 60.92 | 60.56 | 62.45 | 61.36 |
It can be seen that CoSD-R and CoSD-T consistently outperform the baseline on all newly added benchmarks and achieve the best results in the average score. This was achieved even on pair 1, where CoSD did not perform as well, so we believe this can also be realized in other pairs. We are currently conducting experiments on other pairs, and we will update the experimental table in the final version of the paper to include all the benchmarks.
Besides, we also tried new experiments to switch the draft model and the assistant model, here are the results in pair 1:
| Benchmarks | Draft | Assist. | Spc. Dec. | CoLLM | CoSD-R | CoSD-T |
|---|---|---|---|---|---|---|
| MMLU | 52.02 | 54.81 | 54.23 | 53.92 | 54.17 | 56.18 |
| GSM8K | 51.02 | 39.79 | 41.28 | 45.75 | 49.52 | 48.88 |
| HumanEval | 43.90 | 21.34 | 25.17 | 36.90 | 42.31 | 43.62 |
| Avg | 48.98 | 38.65 | 40.23 | 45.52 | 48.67 | 49.56 |
We find that if we use the better model as the draft model, the CoSD performance will be better. This provides valuable guidance for the application of CoSD.
While more experiments are always nice to have, we believe we have already presented a reasonable amount of empirical evidence supporting our key claims: Our method consistently outperforms all other LLM fusion baselines such as Speculative Decoding, Avg Decoding, CoLLM, etc, which are recent works published in NeurIPS 2024, ICML 2024, and ACL 2024, thus we believe our method will be of interest to the ICLR community. We sincerely hope that our detailed responses, additional experiments, and clarified claims have provided the reviewers with a better understanding of our work. We would be truly grateful if these efforts could be reflected in a more favorable assessment and higher scores. Thank you for your thoughtful consideration.
Dear ICLR Reviewers,
The author discussion phase is ending soon. Please promptly review and respond to author rebuttals for your assigned papers. Your engagement is critical for the decision-making process.
Deadlines:
- November 26: Last day for reviewers to ask questions to authors.
- November 27: Last day for authors to respond to reviewers.
- November 28 - December 10: Reviewer and area chair discussion phase.
Thank you for your timely attention to this matter.
The paper proposes Collaborative Speculative Decoding (CoSD), a method for efficient knowledge fusion between LLMs during inference using rule-based and tree-based verification strategies. The approach aims to both improve efficiency when combining models of different sizes and enhance performance through complementary knowledge fusion.
The reviewers value the comprehensive experimental analysis across different model pairs and scenarios. Through extensive discussion, the authors clarified several reviewers’ questions. However, some concerns remain: 1) performance improvements are not consistent across all benchmarks, with only three test datasets making it difficult to draw robust conclusions, 2) efficiency claims compared to baselines need better substantiation, and 3) the averaging of scores across benchmarks may mask important per-task performance variations.
While the authors have engaged constructively with reviewer feedback and provided additional experiments and analysis, I recommend rejection as more comprehensive evaluation across a broader set of benchmarks is needed to convincingly demonstrate the method's effectiveness and generality.
审稿人讨论附加意见
Check above.
Reject