3.5

/10

withdrawn4 位审稿人

最低3最高5标准差0.9

3.8

置信度

正确性2.0

贡献度2.3

表达2.5

ICLR 2025

In-batch Ensemble Drafting: Toward Fast and Robust Speculative Decoding for Multimodal Language Models

Minjae Lee,Wonjun Kang,Minghao Yan,Christian Classen,Hyung Il Koo,Kangwook Lee

OpenReview PDF

提交: 2024-09-17更新: 2024-11-14

TL;DR

Systematically explore speculative decoding for MLLMs and propose new method

摘要

关键词

Speculative decodingLarge language modelVision language modelInference Acceleration

评审与讨论

审稿意见

评分: 3置信度: 42024-10-26

The paper adapts speculative decoding to vision and language models (VLMs), which was previously widely adopted only by language models (LLMs), but not widely adopted for VLMs. It compares four speculative decoding strategies (multimodal drafting (M), text-only drafting (T), caption drafting (C), and pooled multimodal drafting (P)) and sees no clear superiority of one strategy over the other. In consequence, the paper proposes In-batch Ensemble Drafting (IbED) which chooses to apply all four strategies simultaneously and combine their probability distributions during speculative decoding. IbED shows more consistent decoding speedups than single strategies (either M, T, C or P).

优点

S1. Efficiency gains: IbED shows more consistent decoding speedups compared to single strategies (either M, T, C or P) and IbED (if it were validated on multiple models, which it is not, see W2) would eliminate the need to decide between M, T, C or P, reducing ablation costs when selecting optimal methods. This validation would ultimately enhance accessibility and ease of deployment.

S2. The paper has many analyses in terms of strategy variation (M, T, C or P) and has a good coverage of image-text datasets in the experiments. However, this variety does not extend to model selection, as the study is limited to a single model (LLaVA 1.5).

S3. The paper’s language is clear. All in all, this a well written paper, with only hard to find typos (see “Questions” section).

缺点

TLDR: The paper would benefit from a more thorough evaluation of model output quality (not just speed) and from testing whether the experimental findings with LLaVA 1.5 generalize to at least two other models.

W1. Overstated naming & terminology: This paper uses the term “multimodal large language models (MLLMs)” but focuses solely on an image and text model. Framing the paper as covering "multimodality" seems overstated when other modalities, such as speech-text or video-text, are not addressed. This should be toned down to “image-text models” or “vision and language models”.

W2. Unclear whether the findings generalize: This paper conducts all experiments with a single model (LLaVA 1.5), limiting the learnings generalizability of its findings. What if the outcomes of the experiments are due to special quirks of this one model (or its tokenizer approach)? What if the superiority of M, T, C or P is unclear only for this model? This would make IbED unnecessary for all other models. The paper does not falsify alternative hypotheses like this. To be more convincing, the results would need to be validated by at least two more models.

W3. Limited novelty: This paper applies an established method for LLMs (speculative decoding) to VLMs. This reduces novelty, since previous work [1] already showed that there speculative decoding benefits multimodal models too. Prior work, particularly [1], has already demonstrated the benefits of speculative decoding for multimodal models and revealed that language-only draft models can achieve acceleration, underscoring the phenomenon of "unimodal collapse" in VLMs [4]. Unfortunately, this study adds no further innovation regarding language-only draft models beyond what [1] established.

The paper’s main contribution is In-batch Ensemble Drafting (IbED), a reasonable extension, though not especially novel. The need of such a method can be still debated, as discussed in W2.

Note: There have been at least three other works on speculative decoding in the multimodal domain [1, 2, 3] already. The work [1] was publicly posted before the ICLR deadline and is indeed cited by the authors (well done). The works [2,3] were submitted at ICLR (judging by the paper template) and deal with autoregressive generation, but of text-to-image, and not image-to-text as this paper deals with, so this paper has still enough uniqueness to it.

W4. Missing output quality evaluation: The paper’s results only show measured block efficiency, while the accuracy of the model generated outputs / answers is completely neglected. What if the verification model verifying the draft model accepts wrong tokens?

[1] “On Speculative Decoding for Multimodal Large Language Models”, Mukul Gagrani, Raghavv Goel, Wonseok Jeon, Junyoung Park, Mingu Lee, Christopher Lott., 2024.04.
[2] “LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding”, Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sung-Yub Kim, Eunho Yang., 2024.10.
[3] “Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding” Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu. , 2024.10.
[4] “Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?” Parcalabescu & Frank, 2024

问题

Question: Could be generalized to incorporate additional data modalities, such as audio or video? This would broaden its appeal in real-world applications.

Suggestions and Typos:

Page 7, line 458: “mcuh less than” should be corrected to “much less than.”
The use of “draftings” is slightly awkward. Consider “drafting strategies” or “draft methods” for clarity.

审稿意见

评分: 3置信度: 42024-10-31

In this work the authors explore speculative decoding in the realm of multi-modal language models. Primarily, they focus on how different input representations to a draft model impact block efficiency. Specifically, they compare:

"Multimodal drafting": draft model consumes the image the same way as the large multimodal LLM
"Pooled drafting": image tokens are further average pooled to shorten context length, compared to the representation consumed by the large multimodal LLM.
"Text-only drafting": the draft model only consumes the text input, no images.
"Caption drafting": the draft model consumes a caption representing the image, generated by an external captioning model.

They report that different methods lead to different block efficiency across different tasks, which motivates them to introduce "In-batch Ensemble Drafting", a method where they run multiple draft models in parallel and select the next (draft) token based on a uniform average of the prediction probabilities of the ensemble members. They show that in their setting this strategy can improve block efficiency compared to the individual ensemble members.

优点

The paper presents a clear framing of the problem and motivates its central contributions well. It is clearly written and easy to follow. Furthermore, the authors notably design their evaluation setup to include single-image per sample, two images per sample, and 5 images per sample tasks, which offers a more comprehensive view of speculative decoding in typical multimodal settings.

缺点

The purpose of speculative decoding is to improve inference latency by reducing the need for token-by-token auto-regressive decoding for a much larger model. The authors clearly motivate that due to the significantly smaller draft model (and associated compute need) they squarely focus their study on block efficiency. However, as described in appendix E, their central method uses another model as part of their drafting strategy: Florence 2 Large FT. This is a 0.77B model in its own right, more than ten times the size of the proposed draft model in the paper. This compute is also not re-used at the verification stage (differently to the multimodal drafting mode). Thus, this is additional latency that based on the in-batch-ensembling method design can not be parallelized. Considering this, the slight improvement reported in block efficiency going from MT to MTC (1 image case: + 0.02, 2 image case: neutral, 5 image case: + 0.15) seems not practical. Similarly, as mentioned in Gagrani et al., 2024, text-only drafting has the notable upside that it can be parallelized with image encoding. This can further limit the practically achievable performance improvements achieved with slightly higher block efficiency. It would be great if the paper could discuss some of these practical considerations, in addition to the strong focus on block efficiency as the target metric. Specifically, I would suggest to consider a metric that incorporates time spent on captioning (when used), such as overall speed-up (including captioning) from the proposed method. This may also directly motivate even smaller / more efficient captioners.
The "pooled" drafting strategy is essentially just a slightly different strategy of instance of multimodal drafting. Such pooling is also a popular choice by large multimodal LLMs (i.e. not draft models), for example. It is a valid choice, but perhaps less novel / different than the terminology may suggest (see for example McKinzie et al., 2024).
Another result the authors discuss is that in their setting the multimodal drafting (i.e. not pooled) performs poorly in the n = 5 image setting, which could be a result of the relatively small size of the drafting model. At only 68M parameters, it is signficantly smaller than the 115M parameter draft model proposed in Gagrani et al., 2024. It would have been great to see results with different draft model sizes to verify the notable drop in block efficiency in the multi-image settings.

问题

Given the additional complexity, and practically latency, of the captioning based drafting approach, have you considered just an MTP ensemble? Since it's cheap to create, perhaps also different pooling targets in one ensemble?
Have you considered different draft model sizes? Perhaps comparing your current size of 68M to the one propose in Gagrani et al., 2024 (115M) or even larger?
By selecting evaluation benchmarks that ask for simple direct answers, such as VQAv2, then changing the prompt to elicit more verbose repsonses, have you considered that it may present a particularly favorable setting for speculative decoding? If the question is something as simple as "What color is the truck?", a long form written response may not be particularly information dense.

审稿意见

评分: 3置信度: 42024-11-01

This paper investigates the effectiveness in speculative decoding for multimodal large language models. The paper first studies the different time of vision-encoding, prefill and decoding to understand the bottleneck. Then the paper realizes that the time fraction remains almost constant with different context length. Therefore, the speedup is solely dependent on the block efficiency. The paper thus propose different drafting models and ensemble to maximize the efficiency.

优点

The paper did comprehensive study and show enough preliminary results to demonstrate their findings.
The paper is amongst the first few papers to work on MLLM specifically.

缺点

The paper lacks principle contribution in terms of both algorithm or data.
The discovery in the paper is somewhat seen in the prior literature.
The proposed method by the paper, like pooling or ensemble lack significant contribution.

问题

The Table 12 seems to the only one reporting the accuracy, however, the numbers are very very low. The drop is quite significant. Some of benchmarks are also very simple. I would suggest the authors to report more results on difficult ones like MMMU.

审稿意见

评分: 5置信度: 32024-11-04

This paper is an extension of https://arxiv.org/abs/2404.08856, providing more thorough experimental analysis, and proposes a novel method, namely, In-Batch Ensemble Drafting.

For additional experimental analysis, this paper finds that the bottleneck of multimodal speculative decoding lies in the block efficiency, and to improve this factor, the key is to improve the drafting method. Moreover, through comparison between four drafting methods, namely, multimodal (M), text-only (T), caption (C), and pooled multimodal (P) draftings, this paper observes that: although generally speaking, C > M > P > T (C > P > T > M when the number of images reach 5), no single drafting method encompasses all the tokens correctly predicted by the others.

To remedy this issue, the authors propose In-Batch Ensemble Drafting, which integrates all the four drafting methods with minimal memory overhead due to a single small drafting model. The ensembling is implemented by sampling from the averaged distribution.

优点

The experimental findings are interesting and useful.
The ensembling method is effective and insightful.
It is interesting to see that multimodal drafting is not better than caption drafting, and with the #images increasing, the multimodal drafting degrades very fast.

缺点

The presentation may be a bit poor.
- For example, it could be better to put the discussion over four drafting methods together. Instead, this paper compares M v.s. T in “Section 4: Analysis of Speculative Decoding for MLLMs”, and compares C v.s. P v.s. M v.s. T in “Section 5: Exploring Drafting Methods for MLLMs”. The Section 4 could then focus on the analysis of speculative decoding (such as the 4.2 time analysis) instead of drafting methods.
- The section titles are not straightforward. For example, when seeing “Section 5.1: How Necessary is the Image Modality for Drafting”, I originally thought that this section mainly discussed M v.s. T or at least v.s. C. However, it is actually discussing M versus P, where the image modality still exists. What’s worse, as for “Section 5.2: Can We Replace Image Modality with Another One for Drafting”, it discusses caption drafting. However, text-only drafting is also a modality other than image modality.
C > M > P > T for fewer images; while C > P > T > M for more images.
- Can the LLaVA 1.5 models support n>5 images as inputs? (I’m afraid that LLaVA 1.5 is not trained on as many images.) If not, the performance degradation may be not caused by the drafting methods but just the model itself.
- The performance of caption drafting is too high, implying that the draft models are sub-optimal. In fact, current SOTAs never leverage captions as the image features since they suffer from information loss compared with image encoders.

问题

See weaknesses.

撤稿通知

2024-11-14

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.

评论- Thank you for your review

2024-11-14

We'd like to thank the reviewers for the valuable insights. After careful consideration, we have decided to withdraw our paper. We will try to improve our research based on your feedback.