PaperHub
7.3
/10
Oral4 位审稿人
最低4最高5标准差0.5
4
4
5
5
3.3
置信度
创新性2.8
质量3.0
清晰度3.0
重要性3.3
NeurIPS 2025

ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29

摘要

Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components—combined with complex inference pipelines and heterogeneous workloads—introduce significant inference overhead. Therefore, efficiently serving MLLMs remains a major challenge. Current tightly coupled serving architectures struggle to distinguish between mixed request types or adapt parallelism strategies to different inference stages, leading to increased time-to-first-token (TTFT) and poor resource utilization. To address this, we introduce Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity across request types and inference stages. Building upon EMP, we develop ElasticMM, an MLLM serving system that (1) separates requests into independent modality groups with dynamic resource allocation via a modality-aware load balancer; (2) decouples inference stages and enables parallelism adjustment and adaptive scaling via elastic partition scheduling; and (3) improves inference efficiency through unified multimodal prefix caching and non-blocking encoding. Experiments on diverse real-world datasets show that ElasticMM outperforms state-of-the-art (SOTA) serving systems, reducing TTFT by up to 4.2$\times$ and achieving 3.2–4.5$\times$ higher throughput while meeting service-level objectives (SLOs).
关键词
Hardware and SystemsEfficient Inference MethodsDistributed InferenceVisual Question Answering

评审与讨论

审稿意见
4

This paper proposes a serving framework for multimodal large language models (MLLMs) that elastically adapts to resource heterogeneity across different request types and inference stages. The proposed framework, ElasticMM, enables dynamic resource allocation via a modality-aware load balancer, parallelism adjustment and adaptive scaling through elastic partition scheduling, and inference optimization using unified multimodal prefix caching and non-blocking encoding. By evaluating the framework on the VisualWebInstruct and ShareGPT-4o datasets with LLaMA3.2-Vision-11B and Qwen2.5-VL-7B models, the authors demonstrate significant performance improvements, achieving a 4.2× reduction in time-to-first-token and approximately 4× higher throughput.

优缺点分析

Strength

  • The paper achieves impressive improvements in inference efficiency, substantially reducing latency and enhancing throughput for MLLMs.
  • The paper provides a comprehensive review of existing literature, clearly outlining the motivation and necessity for the proposed approach.
  • The method is thoroughly analyzed, with detailed ablation studies validating the contributions of each component.

Weakness

  • Some figures lack clear explanations, making it difficult to fully understand the presented concepts (see detailed questions below).
  • The architectures of the target inference models are described only in text, which makes it challenging to grasp the inference pipeline and model architecture. An overview diagram would significantly aid comprehension of Lines 72–94.
  • The evaluation focuses only on 7B and 11B models. It would be valuable to assess performance on larger model sizes for a more comprehensive understanding of scalability.

Minor

  • In Figure 4, the letter "s" in “Images” appears to have a smaller font size.
  • Section 3.3 (Inference Optimization) may not logically belong under Section 3 (Elastic Multimodal Parallelism). I recommend either renaming Section 3 or moving Section 3.3 to a new standalone section.

问题

  • (Figure 1) What datasets are used in subfigures (b) and (c)?
  • (Figure 2) Why does the image-text model exhibit dynamic request distribution?
  • (Figure 4) What do the dashed lines within the rectangles signify? What does the GPU icon specifically represent? How could existing methods be visualized within this figure for comparison?
  • Would it be possible to conduct ablation studies for each method proposed in Sections 3.1 and 3.2?

局限性

The authors briefly mention future work in a single sentence. A more in-depth discussion on current limitations and potential future directions would enhance the clarity.

最终评判理由

The paper addresses an important and useful topic. However, the current manuscript lacks comprehensive diagrams that would facilitate an overall understanding of the proposed approach. Additionally, practical serving experiments using real-world large-scale models are insufficiently demonstrated. For these reasons, I assigned a "weak accept". Nevertheless, I would like to note that I am not an expert in LLM serving.

格式问题

.

作者回复

Weakness

W1: The architectures of the target inference models are described only in text, which makes it challenging to grasp the inference pipeline and model architecture. An overview diagram would significantly aid comprehension of Lines 72–94.

A: Thank you for your great suggestion. Adding an architectural diagram of the MLLM inference pipeline would indeed help readers better understand the processing workflow. We will include this figure in the revised version.

W2: The evaluation focuses only on 7B and 11B models. It would be valuable to assess performance on larger model sizes for a more comprehensive understanding of scalability.

A: Supporting larger models brings our system closer to real-world deployment scenarios and provides a stronger test of its scalability. Serving such models requires substantially more GPU resources. For example, a 70B model typically needs two or even four GPUs for inference. In our decoupled-stage inference design, effectively leveraging ElasticMM’s elastic scheduling for such models necessitates at least two nodes. Therefore, deploying ElasticMM on larger clusters to support larger models is a key direction for our future work.

However, to preliminarily validate the scalability of our system, we conducted a preliminary multi-node experiment. we deployed two instances running the Llama3.2-Vision 11B model on two machines interconnected with 200GB/s InfiniBand. We migrated requests of varying sequence lengths from one instance to another and recorded the downtime during the migration, as shown in the below table.

Table 1. Downtime and overhead of KV cache migration on multi-node

Overhead1K2K4K8K10K
Migration21ms23ms28ms30ms33ms
Recompute255ms476ms812ms1750ms2136ms
Decode Latency46ms42ms41ms44ms45ms

The results show that even in inter-node scenarios, the overhead of KV cache migration remains low. This is not only far less than recomputation, but also lower than decode latency. This preliminary experiment demonstrates that ElasticMM maintains its performance benefits under common multi-node deployment settings, validating the scalability of our elastic scheduling strategy.

Minor:

  • In Figure 4, the letter "s" in “Images” appears to have a smaller font size.
  • Section 3.3 (Inference Optimization) may not logically belong under Section 3 (Elastic Multimodal Parallelism). I recommend either renaming Section 3 or moving Section 3.3 to a new standalone section.

A: Thank you for pointing this out. We acknowledge the inconsistency in the font in Figure 4 and will correct it in the revised version. Additionally, we agree that Section 3.3 does not logically align well with the rest of Section 3. To improve the structural clarity of the paper, we will reorganize the content and move it into a separate section in the revision.

Question

Q1: (Figure 1) What datasets are used in subfigures (b) and (c)?

A: The data shown in the figures are sourced from the publicly available ShareGPT-4o dataset. (dataset link) We will make sure to clearly indicate this in the revised version.

Q2: (Figure 2) Why does the image-text model exhibit dynamic request distribution?

A: This observation is based on traces from real-world production workloads. Today, many large model service providers use unified multimodal models to handle both text and multimodal requests. In practical user sessions, users often begin with pure text input and gradually introduce multimodal content such as images or PDFs, frequently followed by additional textual queries or descriptions. As a result, text-only requests tend to be more stable over time, whereas multimodal requests exhibit much greater temporal variability. This variability includes both clear periodic patterns (e.g., day-night cycles) and short-term spikes. Together, these behavior patterns lead to a highly dynamic request distribution over time for vision-language models.

Q3: (Figure 4) What do the dashed lines within the rectangles signify? What does the GPU icon specifically represent? How could existing methods be visualized within this figure for comparison?

A: In Figure 4, the dashed lines within each stage represent relative batch sizes. The decode stage shows more dashed lines than the prefill stage, indicating a larger batch size. This is because decode is primarily memory-bound, and increasing its batch size moderately does not significantly affect latency but improves overall throughput. Each GPU icon represents a single GPU instance. We agree that adding a comparison with existing methods would better highlight system differences. Due to a new policy introduced by the committee, we are currently unable to provide additional figures. However, we will include such a comparison diagram in the revised version of the paper.

Q4: Would it be possible to conduct ablation studies for each method proposed in Sections 3.1 and 3.2?

A: It is difficult to conduct separate, standard ablation studies for the methods proposed in Sections 3.1 and 3.2, because these methods are designed to function as a hierarchical and cooperative scheduling framework, rather than as independent components. After resource adjustment across modality groups via the strategies in Section 3.1, intra-group resources must be further scheduled dynamically using the methods in Section 3.2. These two components are tightly coupled—for instance, intra-group resource scarcity may trigger inter-group migration or preemption of GPU instances. Therefore, the two techniques are executed jointly and influence each other during runtime, making it difficult to perform standard ablation studies by simply "turning off" one of them in isolation.

评论

Thank you for the detailed clarification. As mentioned, the manuscript would indeed become even more valuable with comprehensive figures and these additional experiments. I maintain my current positive score.

评论

We'll ensure to incorporate these additional experimental results in the final version. Thank you for your support!

审稿意见
4

The paper introduces Elastic Multimodal Parallelism, a new serving paradigm for efficiently serving multimodal large language models. The proposed system, ElasticMM, addresses key challenges in multimodal inference pipelines, including significant overhead due to additional model components and diverse computational requirements across different modalities. Experimental evaluations demonstrate that ElasticMM reduces TTFT by up to 4.2× and increases throughput by 3.2–4.5× compared to SOTA methods.

优缺点分析

Strengths:

  • Quality: The system ElasticMM is well-designed, leveraging detailed insights about workload characteristics (burstiness, computational bottlenecks) to drive performance.
  • Significance: Substantial improvements demonstrated experimentally highlight the practical significance, addressing critical performance bottlenecks in multimodal LLM deployment.

Weaknesses:

  • Lack of Baseline: This system only compares with vLLM and self-implemented vLLM-Decouple. However, in December 2024, a work related to MLLM serving was posted on arXiv (Efficiently Serving Large Multimodal Models Using EPD Disaggregation), which also involves issues such as the separation of different components. I suggest that the author include this work in the baseline to verify the advancement of the system.
  • Scalability Validation: The paper's experiments are limited to a single-node environment, potentially constraining insights about performance in real-world large-scale deployments. Although the author stated that multi-machine experiments will be left for future work, as a serving task, I believe a single machine is not sufficient, especially since the author has also involved tech like elastic scaling that clearly require multi-machine validation. Considering that the author may have limitations in machine resources, different model, trace and machines can be used for small-scale experiments.
  • Clarity: Some statements in the paper are not very clear, for example, where does the workload in Fig. 2 come from? Is it an open-source trace or from the company's actual production trace? Please at least indicate the source.

问题

Please see Weaknesses.

It should be noted that due to possible resource or permission limitations, the author may not be able to fully address the above issues. However, I hope the author can at least add relevant baselines and their comparisons. I will adjust my score based on the extent of the author's responses.

局限性

yes

最终评判理由

The author answered my questions well and also provided the corresponding experiments, so I raised my score.

格式问题

no

作者回复

Weaknesses

W1: Lack of baseline. This system only compares with vLLM and self-implemented vLLM-Decouple. However, in December 2024, a work related to MLLM serving was posted on arXiv (Efficiently Serving Large Multimodal Models Using EPD Disaggregation), which also involves issues such as the separation of different components. I suggest that the author include this work in the baseline to verify the advancement of the system.

A: Thank you for your suggestion. We agree that the original baseline was limited. We have incorporated two new baselines for a more comprehensive evaluation: EPDServe [1], as you suggested, and DistServe (OSDI 2024) [2].

  • EPDServe is a recent work that was built on top of vLLM and introduces component separation and elastic scheduling optimizations specifically for multimodal LLMs. To ensure fairness in comparison, we have upgraded its vLLM version to match vLLM version.
  • DistServe, another crucial baseline, proposes prefill/decode disaggregation and is known for its high throughput on general LLM workloads. We adapted DistServe to support multimodal LLMs by integrating a vision encoder module.

We evaluated all these systems on Input Latency and Maximum throughput using Llama3.2-Vision 11B model and ShareGPT-4o dataset. The results are illustrated as follows:

Table1. Norm input latency (s/token) with different request rate

System2 req/sec4 req/sec6 req/sec8 req/sec10 req/sec
vLLM0.110.300.450.661.33
DistServe0.080.210.390.531.24
EPDServe0.040.070.230.460.67
ElasticMM0.030.050.150.250.49

Table2. Maximum throughput (req/min) meeting SLO

System1 SLO scale2 SLO scales3 SLO scales4 SLO scales5 SLO scales
vLLM225258264309402
DistServe434489532587694
EPDServe597628731845963
ElasticMM720792103012161640

We can observe that ElasticMM significantly outperforms existing baselines by fundamentally addressing their architectural and scheduling limitations:

  • Compared to DistServe, ElasticMM achieves 2.6× lower input latency and 2.3× higher throughput. DistServe adopts a decoupled architecture, but relies on static resource allocation, making it less adaptive to dynamic workloads and thus limiting throughput. In addition, it does not differentiate between text-only and multimodal requests, which further constrains efficiency under mixed-load scenarios.
  • Compared to EPDServe, ElasticMM achieves 1.8× lower input latency and 1.4× higher throughput. EPDServe introduces elastic scheduling on top of separated inference stages, but still does not differentiate between request types. Specifically, it mixes text-only and multimodal requests in the same batch for backend LLM inference. For encoder-decoder models such as Llama3.2-Vision, this mixed batching strategy significantly increases inference latency due to computation graph heterogeneity, especially from cross-attention layers.

We will add these results and discussions into the revised paper.

W2: Scalability Validation. The paper's experiments are limited to a single-node environment, potentially constraining insights about performance in real-world large-scale deployments. Although the author stated that multi-machine experiments will be left for future work, as a serving task, I believe a single machine is not sufficient, especially since the author has also involved tech like elastic scaling that clearly require multi-machine validation. Considering that the author may have limitations in machine resources, different model, trace and machines can be used for small-scale experiments.

A: Thank you for pointing this out this. To address this, we conducted preliminary experiments to evaluate the scalability of our system in a multi-node environment below. Under the EMP paradigm, our architecture involves relatively frequent KV cache migrations, making inter-node communication latency a potential bottleneck. To assess this, we deployed two inference instances of the LLaMA-Vision 11B model on two machines connected via 200 GB/s InfiniBand, and migrated requests with varying sequence lengths between them. We measured the downtime during migration and compared it with both recomputation latency and single-step decode latency, as shown as the below table.

Table 3. Downtime and overhead of kv cache migration on multi-node

Overhead1K2K4K8K10K
Migration21ms23ms28ms30ms33ms
Recompute255ms476ms812ms1750ms2136ms
Decode Latency46ms42ms41ms44ms45ms

These results show that even in inter-node scenarios, the overhead of KV cache migration remains low. This is not only far less than recomputation, but also lower than a single decode step. In contrast, the recomputation cost increases significantly with sequence length, reaching up to 64× the migration time. For instance, recomputing the KV cache of a 10k token sequence takes 2.1 seconds, equivalent to stalling 47 decode iterations.

This preliminary experiment demonstrates that ElasticMM maintains its performance benefits under common multi-node deployment settings, validating the scalability of our elastic scheduling strategy. Extending our system to a larger number of nodes will be a crucial focus of our future work.

W3: Some statements in the paper are not very clear, for example, where does the workload in Fig. 2 come from? Is it an open-source trace or from the company's actual production trace? Please at least indicate the source.

A: The data for this figure is derived from the company's production traces, covering a 24-hour interval. We will supplement the source of this workload in the paper.

Reference:

[1]: Efficiently Serving Large Multimodal Models Using EPD Disaggregation. arxiv link

[2]: DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arxiv link

评论

Thanks authors for your rebuttal. My questions have been resolved, I will raise my score accordingly.

评论

Thank you for your support, we'll incorporate these additional results in the final version.

审稿意见
5

The paper targets the high-latency, low-throughput pain points that arise when existing LLM serving systems are asked to handle multimodal workloads. The authors observe that current pipelines run every stage—image preprocessing/encoding, LLM prefill, and decode—on the same GPU instances and batch both text-only and multimodal requests together, leading to severe resource contention and SLO violations.

To address this, they propose Elastic Multimodal Parallelism (EMP) and implement it in a system called ElasticMM. The key ideas are:

  • Modality-aware decoupling: Route text-only and multimodal requests to separate modality groups, each with its own GPU pool, then let a load balancer proactively allocate—and reactively pre-empt—GPUs across groups as traffic shifts.
  • Stage-level elasticity: Within each group, further disaggregate the pipeline so that encoding, prefill, and decode can scale up or shrink independently via an elastic partition scheduler driven by a gain-cost model.
  • MLLM-specific optimizations: (i) a unified multimodal prefix cache that reuses both image and text prefixes, and (ii) non-blocking encoding that runs vision encoding asynchronously, eliminating its interference with prefill and decode.

The prototype is built atop vLLM and evaluated on an 8×A800 workstation with two representative MLLMs—LLaMA-3.2-Vision-11B (enc-dec) and Qwen-2.5-VL-7B (dec-only)—under two mixed workloads, VisualWebInstruct and ShareGPT-4o. Compared with vLLM, ElasticMM cuts time-to-first-token by up to 4.2× and boosts end-to-end throughput by 3.2–4.5× while meeting service-level objectives; ablations show that each component of EMP contributes materially to these gains.

优缺点分析

Strengths

  1. Robust quality and engineering rigor An end-to-end implementation atop vLLM is benchmarked on two real-world workloads and two representative MLLMs, showing up to 4.2×4.2\times TTFT reduction and 3.2 ⁣ ⁣4.5×3.2\!-\!4.5\times higher throughput, with detailed ablations that isolate each component’s contribution.

  2. Novel decoupled-and-elastic paradigm for MLLM serving (Originality & Significance). Elastic Multimodal Parallelism (EMP) introduces (i) modality-aware load balancing, (ii) stage-level elastic partition scheduling, plus (iii) unified multimodal prefix caching and non-blocking encoding—an architectural combination not present in prior systems.

Weaknesses

  1. Single-node evaluation limits external validity All results come from one 8 × A800 workstation; multi-node or heterogeneous-cluster behavior is left to “future work,” so real-world scalability remains unproven.

  2. Baseline coverage is too narrow Only vLLM and an in-house static variant are compared; other state-of-the-art disaggregated or high-throughput frameworks (SGLang, DeepSpeed-FastGen, Distserve etc.) are excluded, leaving the true performance gap uncertain. And the number of models are limited.

问题

1.Multi-node scalability. Your experiments are limited to a single 8 × A800 box. Could you either (a) add preliminary multi-node results or (b) provide an analytic model/trace-driven simulation that shows EMP still improves TTFT and throughput when network latency and cross-node KV migration are present?

2.Baseline coverage. Please justify excluding SGLang, DeepSpeed-FastGen, Splitwise, DistServe and Mooncake, or add at least one of them to Table 3. If ElasticMM still shows ≥2 × improvement under identical settings, that would materially strengthen the claim of significance.

局限性

Environmental impact. Running eight A800s at high utilization is energy intensive; explicit reporting or mitigation discussion is absent.

Security / privacy risk. Unified prefix caching potentially exposes vision embeddings across requests. A short note on isolation mechanisms would be valuable.

最终评判理由

This is a very solid paper on inference in the multimodal domain. I recommend that it be accepted.

格式问题

No major violations observed. The submission appears to follow the NeurIPS 2025 template (≤ 9 pages main text, correct margins, 10-pt font, one-column references, no author names)

作者回复

Question/Weakness

Q1: Multi-node scalability. Your experiments are limited to a single 8 × A800 box. Could you either (a) add preliminary multi-node results or (b) provide an analytic model/trace-driven simulation that shows EMP still improves TTFT and throughput when network latency and cross-node KV migration are present?

A: This is a good question regarding the scalability verification of ElasticMM. Firstly, our core scheduling algorithm, EMP, operates at the granularity of individual GPU instances, meaning there is no fundamental difference at the algorithmic level whether it is deployed across multiple nodes. However, the primary challenge with multi-node deployment is the additional latency caused by inter-node network bandwidth.

Therefore, we added a preliminary multi-node experiment: we deployed two instances running the Llama3.2-Vision 11B model on two machines interconnected with 200GB/s InfiniBand. We migrated requests of varying sequence lengths from one instance to another and recorded the downtime during the migration. We then compared this migration time with both recomputation time and single-decode latency.

Table 1. Downtime and overhead of KV cache migration on multi-node

Overhead1K2K4K8K10K
Migration21ms23ms28ms30ms33ms
Recompute255ms476ms812ms1750ms2136ms
Decode Latency46ms42ms41ms44ms45ms

The results show that even in inter-node scenarios, the overhead of KV cache migration remains low. This is not only far less than recomputation, but also lower than decode latency. In contrast, the recomputation cost increases significantly with sequence length, reaching up to 64× the migration time. For instance, recomputing the KV cache of a 10k token sequence takes 2.1 seconds, equivalent to stalling 47 decode iterations. This preliminary experiment demonstrates that ElasticMM maintains its performance benefits under common multi-node deployment settings, validating the scalability of our elastic scheduling strategy.

Q2: Baseline coverage. Please justify excluding SGLang, DeepSpeed-FastGen, Splitwise, DistServe and Mooncake, or add at least one of them to Table 3. If ElasticMM still shows ≥2 × improvement under identical settings, that would materially strengthen the claim of significance.

A: Thank you for your suggestion. Our original baselines were indeed limited. We have adopted your advice by adding a comparative experiment with DistServe [1]. Additionally, we also include EPDServe [2] (ICML 2025) as another newer baseline for comparison.

  • DistServe serves as a crucial baseline because it adopts prefill/decode disaggregation and is recognized for its high throughput in general LLM workloads. To support evaluation on multimodal LLMs, we extended DistServe by adding a vision encoder module.
  • EPDServe, a recent work, is built on top of vLLM and uniquely optimizes multimodal LLM inference by introducing component separation and elastic scheduling. To ensure fairness in comparison, we have upgraded its vLLM version to match ours.

We evaluated all systems on Input Latency and Maximum throughput using LLaMA-Vision 11B model and ShareGPT-4o dataset. The results are as follows:

Table 2. Norm input latency (s/token) with increased request rates

System2 req/s4 req/s6 req/s8 req/s10 req/s
vLLM0.110.300.450.661.33
DistServe0.080.210.390.531.24
EPDServe0.040.070.230.460.67
ElasticMM0.030.050.150.250.49

Table 3. Maximun throughput (req/min) meeting SLO scales

System1 SLO scale2 SLO scales3 SLO scales4 SLO scales5 SLO scales
vLLM225258264309402
DistServe434489532587694
EPDServe597628731845963
ElasticMM720792103012161640

ElasticMM still outperforms the new baselines by addressing their architectural and scheduling limitations:

  • Compared to DistServe, ElasticMM achieves 2.6× lower input latency and 2.3× higher throughput. DistServe adopts a decoupled architecture, but relies on static resource allocation, making it less adaptive to dynamic workloads and thus limiting throughput. Furthermore, it does not differentiate between text-only and multimodal requests, which further constrains efficiency under mixed-load scenarios.
  • Compared to EPDServe, ElasticMM achieves 1.8× lower input latency and 1.4× higher throughput. EPDServe introduces elastic scheduling on top of separated inference stages, but still does not differentiate between request types. Specifically, it mixes text-only and multimodal requests in the same batch for backend LLM inference. For encoder-decoder models such as LLaMA-Vision, this mixed batching strategy significantly increases inference latency due to computation graph heterogeneity, especially from cross-attention layers.

These results clearly demonstrate the superiority of our proposed method. We will include these results and the corresponding discussion in the revised paper.

Reference:

[1]: DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arxiv link

[2]: Efficiently Serving Large Multimodal Models Using EPD Disaggregation. arxiv link

评论

All my concerns have been addressed. Thank you for conducting the new experiments. My previous concerns have now been fully addressed. This is a very solid piece of work, and I will be raising my score accordingly.

评论

Thank you for your insightful review and valuable suggestions, which have been very helpful in improving our work.

审稿意见
5

The authors introduced ElasticMM, a multi-model LLM serving framework designed to optimize time-to-first-token (TTFT) and throughput. The authors noticed the importance of separating text-only and multi-modal requests and the need to have an elastic serving system to handle dynamic multi-modal requests. Modality-aware load balancing and elastic instance allocation and elastic auto-scaling mechanism are proposed to achieves these goals. Extensive experiments on real-world work loads validated the effectiveness of the proposed method.

优缺点分析

Strengths:

  • This paper studies an important problem, i.e., multi-modal LLM serving, which differs from text-only LLM serving but does not receive much attention. The paper is novel and well motivated. The proposed elastic instance allocation and elastic auto-scaling mechanism do work.
  • This paper is well-written and easy to follow.

Weaknesses:

  • As the authors discussed in the Limitations section, the proposed method is only implemented and tested in a single-node environment. While multi-node model serving is not studied, the contribution of this work is non-trival. It would be great if how the proposed method could be extended to work in a multi-node model serving environment could be briefly discussed in a short future work section.

问题

  • In Figure 2, the average time per output token is shown. For the proposed approach, it increases at a very low rate as request rate increases. While for vllm, the average time per output token increases sharply. I am wondering what is the reason for the sharp increase of average time per output token when vllm is used?
  • If model parallel is needed in order to serve the model, will things change? For example, if a very large vision encoder is used and a GPU only holds the vision encoder (pipeline parallel rank 0 only holds the vision encoder). Will the performance gain of the proposed method become less obvious?

局限性

Yes

最终评判理由

Although MM model serving is not a very "hot" topic in the academia, it has high practical value. The proposed method works well and has high practical value. The authors addressed my questions, e.g., generalization to a multi-node setting. Hence, my recommendation is "accept".

格式问题

N/A

作者回复

Weaknesses

W1: As the authors discussed in the Limitations section, the proposed method is only implemented and tested in a single-node environment. While multi-node model serving is not studied, the contribution of this work is non-trival. It would be great if how the proposed method could be extended to work in a multi-node model serving environment could be briefly discussed in a short future work section.

A: Thank you for the suggestion. We supplemented a preliminary experiment to demonstrate the capability of our proposed method in multi-node environment. We deployed two instances of the Llama3.2-Vision 11B model on two machines connected via 200 GB/s InfiniBand, and migrated requests with varying sequence lengths between them. We measured the downtime during migration and compared it with both recomputation latency and single-step decode latency, as shown in Table 1.

Table 1. Downtime and overhead of migration on multi-node

Overhead1K2K4K8K10K
Migration21ms23ms28ms30ms33ms
Recompute255ms476ms812ms1750ms2136ms
Decode Latency46ms42ms41ms44ms45ms

These results show that even in inter-node scenarios, the overhead of KV cache migration remains low. It is not only significantly less than recomputation, but also lower than a single decode step. This preliminary experiment demonstrates that ElasticMM maintains its performance benefits under common multi-node deployment settings, validating the scalability of our elastic scheduling strategy.

As suggested, we will further add a discussion on multi-node scalability in the future work section. Specifically, we would like to briefly discuss the challenges and potential solutions related to extending our system to multi-node environments. Under our EMP paradigm, the architecture involves relatively frequent KV cache migrations, which may make inter-node network latency a potential bottleneck. To mitigate this, we plan to use high-throughput communication technologies such as InfiniBand to maximize the benefits of elastic scheduling. Additionally, the scheduler can incorporate placement constraints to reduce latency. For example, by co-locating GPU instances handling the same inference stage on the same node. Furthermore, techniques such as asynchronous KV cache migration can help minimize the impact of migration on the inference process.

Quesiton

Q1: In Figure 2, the average time per output token is shown. For the proposed approach, it increases at a very low rate as request rate increases. While for vllm, the average time per output token increases sharply. I am wondering what is the reason for the sharp increase of average time per output token when vllm is used?

A: The significant difference in performance between vLLM and ElasticMM stems from their distinct architectural designs and scheduling strategies. Specifically,

  1. vLLM adopts a tightly coupled architecture, where the inference pipeline stages—encode, prefill, and decode—are executed serially on the same hardware. This design leads to contention for compute resources and, more critically, causes the time-consuming encode stage to block subsequent inference steps. As a result of its coupled design, output latency increases sharply.

  2. In contrast, ElasticMM employs elastic partitioned scheduling, which dynamically adjusts resource allocation between the prefill and decode stages. When the decode stage becomes resource-constrained due to large batch sizes, ElasticMM reallocates additional instances to this stage. Moreover, because different stages are isolated in execution, the decode stage is not interfered with by others. Consequently, as the request rate increases, output latency remains consistently low.

Q2: If model parallel is needed in order to serve the model, will things change? For example, if a very large vision encoder is used and a GPU only holds the vision encoder (pipeline parallel rank 0 only holds the vision encoder). Will the performance gain of the proposed method become less obvious?

A: As mentioned in your example, where the vision encoder is served on a dedicated GPU (or multiple GPUs), the performance benefit of our method remains unaffected. In our system, we also isolate the encode stage onto separate GPU instances, even when it is relatively lightweight. This is because the encode stage tends to be computationally intensive, and isolation helps prevent blocking the subsequent inference stages. This is a key optimizations proposed in our work. As shown in Figure 9, the ablation study confirms that this design significantly reduces input latency.

However, larger models do introduce additional complexity to our elastic scheduling algorithm. When an LLM backend needs to be deployed across multiple GPUs using TP or other parallelization methods, it introduces some additional constraints. For example, we only be able to use an even number of GPUs to execute a inference stage. Furthermore, larger models often require multi-node deployment, which introduces new system-level challenges. Maintaining the performance gains of our elastic scheduling under large models and large-scale GPU clusters remains an important direction for our future work.

评论

Thank you for responding to my questions. I do not have additional questions or concerns. I believe this is a solid paper and choose to keep my initial recommendation unchanged.

最终决定

In this paper, the authors introduce ElasticMM, which is a framework for serving multimodal models, to optimize for time-to-first-token (TTFT) and throughput. There are diverse computational requirements across different modalities resulting in several high-latency, low-throughput pain points. The authors observe that current pipelines run every stage—image preprocessing/encoding, LLM prefill, and decode—on the same GPU instances and batch both text-only and multimodal requests together, leading to severe resource contention and SLO violations.

To address this, they introduce Elastic Multimodal Parallelism (EMP). Reviewer 3qBq has a nice summary of the main workings i.e.

"The key ideas are:

  • Modality-aware decoupling: Route text-only and multimodal requests to separate modality groups, each with its own GPU pool, then let a load balancer proactively allocate—and reactively pre-empt—GPUs across groups as traffic shifts.
  • Stage-level elasticity: Within each group, further disaggregate the pipeline so that encoding, prefill, and decode can scale up or shrink independently via an elastic partition scheduler driven by a gain-cost model.
  • MLLM-specific optimizations: (i) a unified multimodal prefix cache that reuses both image and text prefixes, and (ii) non-blocking encoding that runs vision encoding asynchronously, eliminating its interference with prefill and decode.

The prototype is built atop vLLM and evaluated on an 8×A800 workstation with two representative MLLMs—LLaMA-3.2-Vision-11B (enc-dec) and Qwen-2.5-VL-7B (dec-only)—under two mixed workloads, VisualWebInstruct and ShareGPT-4o. Compared with vLLM, ElasticMM cuts time-to-first-token by up to 4.2× and boosts end-to-end throughput by 3.2–4.5× while meeting service-level objectives; ablations show that each component of EMP contributes materially to these gains."

The authors demonstrate results on real-world work loads.

Strengths

  1. All reviewers think the paper is solidly executed. The proposed elastic instance allocation and elastic auto-scaling mechanism do work. well-designed, leveraging detailed insights about workload characteristics (burstiness, computational bottlenecks) to drive performance. Substantial improvements demonstrated experimentally highlight the practical significance, addressing critical performance bottlenecks in multimodal LLM deployment. The method is thoroughly analyzed, with detailed ablation studies validating the contributions of each component.
  2. All reviewers also find the contributions extremely novel and also effective. They also appreciate the high quality of the manuscript that provides a comprehensive review of existing literature, clearly outlining the motivation and necessity for the proposed approach.

Weaknesses

  • For this paper, all weaknesses have been favorably addressed by the authors in the rebuttal: For example:
    • Adding more recent extra baselines of EPDServe and DistServe
    • Showing initial results on multi-node deployments, which was earlier left to future work

All 4 reviewers have chosen to accept this paper. The paper represents a solid contribution to the area of serving multimodal LLMs, with all reviewer concerns addressed.