HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Colocating online and offline LLM inference requests in the same inference engine.
摘要
评审与讨论
This paper proposes HyGen, an efficient serving system for Large Language Models that addresses the issue of low resource utilization in current LLM deployments through elastic online-offline request co-location. Its core methodology involves precise performance control using a latency predictor and an SLO-aware profiler, coupled with SLO-aware offline scheduling policies designed to maximize throughput and ensure fairness.
优缺点分析
Strengths:
- The paper focuses on an interesting and important research direction: how to efficiently co-locate online and offline LLM workloads.
- The experimental evaluation is relatively comprehensive, providing a thorough investigation into the effectiveness of the proposed method. Weaknesses:
- Limited Evaluation Scope and Generalizability
- The paper claims evaluation on "production workloads" but doesn't specify the diversity of LLM models tested. Does HyGen work equally well across different model architectures (GPT, LLaMA, etc.) and sizes? The performance characteristics may vary significantly.
- No discussion of how the system performs under different hardware configurations (different GPU types, memory constraints, etc.)
- Overhead and Complexity Analysis
- The paper introduces a latency predictor and SLO-aware profiler but doesn't appear to quantify their computational overhead. These components themselves consume resources and may impact the very metrics they're trying to optimize.
- Missing analysis of the system's bootstrapping phase - how long does it take for the predictor to become accurate? What happens during this warm-up period?
- Workload Assumptions and Limitations
- The system appears to assume predictable workload patterns. How does it handle sudden spikes or highly variable request patterns?
- No discussion of handling failures or degraded performance scenarios. What happens when predictions are wrong?
- Comparison Methodology Concerns
- Lack baseline fairness. Were the baselines properly optimized?
- Missing comparison with other state-of-the-art hybrid serving systems beyond "online and hybrid serving baselines"
- Practical Deployment Challenges
- No discussion of deployment complexity or operational overhead. How difficult is it to tune the system parameters in practice?
- Missing analysis of how the system handles model updates or version changes, which are common in production environments
问题
- Regarding Figure 1: How did the authors obtain the request rate data for Microsoft Azure's LLM service?
- The paper's organizational structure and presentation need significant optimization. Specifically, Section 3, which covers background and motivation, could be condensed or moved to the appendix. In contrast, the explanation of the proposed framework and Algorithm 1 in Section 4 is overly vague and confusing, urgently requiring clearer elaboration. More crucially, the content from Appendix A.1 and A.2 should be moved into the main body of the paper, as they constitute essential components of the core methodology.
- The architectural diagram (Figure 2) in the paper requires more detailed explanations. For instance, the diagram mentions "check violation" under "Section 4.2," but there is no corresponding explanation or detailed description of this mechanism in the main text, leaving readers confused about its meaning.
- What are the specific novelties or differences of the "Prefix Sharing Maximization Strategy" presented in this paper compared to prior work? These distinctions should be clearly identified and elaborated upon.
- Please provide a specific and clear explanation for the phrase "at a specific offline QPS" in Line 240.
局限性
The paper seems to lack formal guarantees or theoretical analysis of the scheduling policies. Under what conditions might the SLO guarantees fail?
最终评判理由
Thank you for your rebuttal and the additional experiments provided. While I appreciate the efforts to address some concerns, several critical issues remain unaddressed or insufficiently resolved, which are essential to evaluating the robustness and practicality of HyGen.
First, regarding the evaluation scope and generalizability: the inclusion of more models (e.g., Sheared-LLaMA-2.7B) and an additional GPU (A5000) is a step forward, but questions about performance across broader architectures (e.g., GPT variants) and detailed hardware constraints (e.g., memory limitations under varying loads) still lack concrete analysis.
More importantly, key concerns about overhead and complexity—specifically the computational costs of the latency predictor and SLO-aware profiler, as well as the bootstrapping/warm-up phase of the predictor—remain entirely unaddressed. These are critical to understanding whether the system’s components introduce bottlenecks that undermine its claimed efficiency.
Additionally, there is no discussion of how HyGen handles sudden workload spikes, variable request patterns, or failure scenarios (e.g., when latency predictions are inaccurate). Such scenarios are common in production, and their absence weakens claims of practical applicability.
The comparison methodology also remains problematic: there is no clarification on whether baselines were properly optimized, nor additional comparisons with state-of-the-art hybrid serving systems beyond the mentioned "online and hybrid baselines."
Finally, practical deployment challenges—including parameter tuning difficulty and handling model updates—are still unaddressed, leaving uncertainties about the system’s real-world feasibility.
In summary, while some clarifications (e.g., "check violation" in Figure 2, the "Prefix Sharing Maximization" novelties) are helpful, the core issues related to overhead, workload robustness, comparison rigor, and deployment practicality need more thorough analysis to justify the work’s validity.
So i will keep my original score.
格式问题
See questions.
We thank the reviewer for their valuable feedback and for recognizing our work as an "interesting and important research direction" with a "relatively comprehensive" experimental evaluation. We believe the concerns raised are primarily due to misunderstandings, which we have addressed point-by-point below. We have also included new experimental results on additional hardware to further strengthen our claims of generalizability.
Concern 1: Limited Evaluation Scope and Generalizability
Does HyGen work equally well across different model architectures... and sizes?
No discussion of how the system performs under different hardware configurations...
HyGen is generalizable across diverse models and hardware, as demonstrated in the paper and confirmed with new experiments.
-
Paper's Evidence: Our evaluation already included diverse models (Llama2-7B, Qwen-14B, Yi-34B, Mistral-7B) on two distinct hardware platforms (4x A100 & 4x A40 GPUs), as detailed in Sec. 5.1 & 5.4. The consistent results in our figures validate HyGen's robustness.
-
New Results: We conducted new tests on an NVIDIA A5000 GPU (24 GB VRAM) with a Sheared-LLaMA-2.7B model. The results below confirm HyGen achieves significant throughput gains (up to 2.18× offline and 1.30× total), reinforcing its adaptability.
| Mean TBT Tolerance | Offline TPS HyGen* | Total TPS HyGen* | Offline TPS HyGen | Total TPS HyGen | Offline Throughput Gain | Total Throughput Gain |
|---|---|---|---|---|---|---|
| 5% | 214.84 | 1969.84 | 468.59 | 2223.59 | 2.18× | 1.13× |
| 10% | 366.35 | 2121.35 | 678.55 | 2433.55 | 1.85× | 1.15× |
| 15% | 519.48 | 2274.48 | 987.93 | 2742.93 | 1.90× | 1.21× |
| 20% | 667.59 | 2422.59 | 1247.69 | 3002.69 | 1.87× | 1.24× |
| 25% | 811.96 | 2566.96 | 1432.88 | 3187.88 | 1.76× | 1.24× |
| 30% | 909.68 | 2664.68 | 1600.49 | 3355.49 | 1.76× | 1.26× |
| 35% | 1009.28 | 2764.28 | 1831.82 | 3586.82 | 1.81× | 1.30× |
Concern 2: Overhead and Complexity Analysis
The paper introduces a latency predictor and SLO-aware profiler but doesn't appear to quantify their computational overhead.
...how long does it take for the predictor to become accurate? What happens during this warm-up period?
HyGen has a low-cost, one-time bootstrapping and negligible runtime overhead.
- Bootstrapping: There is no "warm-up" period. The process is a one-time, pre-deployment step that is extremely lightweight, requiring only ~15ms on a CPU for over 80,000 samples (Line 299).
- Runtime Overhead: We measured prediction overhead just ~80μs per scheduling iteration on a CPU, three orders of magnitude smaller than a typical chunked batch step. This overhead can be fully hidden by asynchronous scheduling (e.g., in recent of vLLM/SGLang versions), where scheduler logic runs on the CPU concurrently with GPU execution. We will clarify this in the final version. We included algorithm complexity analysis to analyze its negligible overhead in Appendix A.4 too.
No discussion of deployment complexity or operational overhead. How difficult is it to tune the system parameters in practice?
Missing analysis of how the system handles model updates or version changes, which are common in production environments
HyGen is designed for simple deployment and minimal operational overhead.
-
Automatic Tuning & Trivial Updates: Deployment is straightforward. The key parameter (latency budget) is set automatically by a one-time profiler (Sec. 4.2), requiring no manual tuning. For model updates, this lightweight profiling is simply re-run. Training on 80k+ samples takes only ~15 ms on a CPU (Line 299), making adaptation cost trivial.
-
Zero-Overhead Runtime Adaptation: HyGen adapts to workload shifts instantly. The switch from offline to online processing completes "within a single decoding step" (Lines 628–632), due to our asynchronous design. Fig. 14 demonstrates the system's robustness to prediction errors and workload variations, maintaining SLOs without reconfiguration.
Concern 3: Workload Assumptions and Limitations
The system appears to assume predictable workload patterns. How does it handle sudden spikes or highly variable request patterns?
HyGen is expressly designed for unpredictable, bursty workloads.
- We evaluate our system using two public, bursty production traces: the Microsoft Azure trace (Patel et al.; Sec. 3.2, Fig. 1) and the Kimi Mooncake trace (App. B, Fig. 13 & 16).
- Fig. 8 shows HyGen dynamically throttling offline throughput in direct response to unpredictable online bursts.
- The preemption mechanism is built precisely for this purpose (Sec. 4, App. D).
No discussion of handling failures or degraded performance scenarios. What happens when predictions are wrong?
HyGen is inherently robust to prediction errors and system degradation via its closed-loop design.
- Prediction Errors: The system is not brittle. The profiler establishes a conservative latency budget (Alg. 1, Lines 15–16). Fig. 14 shows that even with predictor error (MAPE) >20%, HyGen maintains high throughput while strictly meeting the P99 SLO.
- System Degradation: For issues like thermal throttling, HyGen adapts automatically without any tuning. If system performance degrades, batch latencies increase, automatically reducing the available latency budget for the next scheduling iteration and ensuring SLOs are always met.
Concern 4: Comparison Methodology Concerns
Lack baseline fairness. Were the baselines properly optimized?
Missing comparison with other state-of-the-art hybrid serving systems beyond "online and hybrid serving baselines"
Our baselines are SOTA and meticulously designed for a rigorous and fair evaluation. While other hybrid systems (e.g., MuxServe, Punica) solve orthogonal problems, our comparison focuses on single-model, online/offline co-location:
- SOTA Foundation: We chose Sarathi (OSDI '24) as our foundation. Sarathi's chunked prefill is used to support numerous recent LLM advancements (e.g., EMNLP '24 Memorize Step by Step, ASPLOS '25 POD-Attention, SwiftKV, PrefillOnly), making it a widely-used SOTA system.
- Optimized "Ceiling" Baseline: We compared against a fully-optimized pure-offline system (
Sarathi-offline) representing max throughput (Fig. 4). HyGen reaches 84.3% of this theoretical ceiling while serving an online workload under strict SLOs. - Fair Isolation Baselines: To ensure a fair, apples-to-apples comparison, we enhanced Sarathi with our own online-first preemption mechanism, thus isolating the benefit of our SLO-aware scheduler. To ablate our dynamic scheduling algorithm, we created
HyGen*that uses a static offline rate. The large performance gap between HyGen andHyGen*(Fig. 4 & 9) cleanly quantifies the gains from our dynamic, SLO-aware approach alone.
Specific Questions
How did the authors obtain the request rate data for Microsoft Azure's LLM service?
As cited in Line 106, this trace is publicly available (Patel et al., ISCA 2024). Our use of this trace is detailed in Sec. 5.1 (Line 222).
The paper's organizational structure and presentation need significant optimization... the explanation of the proposed framework and Algorithm 1 in Section 4 is overly vague... content from Appendix A.1 and A.2 should be moved into the main body.
Thank you for the feedback. We agree Section 4 can be clearer. As moving the full pseudocode would exceed the page limit, in the final version, we will enhance Sec. 4 by integrating the key architectural details from Appendix A.1. We will add a description of our asynchronous, dual-queue scheduling workflow to the main text, clarifying how Alg. 1 is invoked and providing necessary system-level context.
The architectural diagram (Figure 2) requires more detailed explanations. For instance, the diagram mentions "check violation" under "Section 4.2," but there is no corresponding explanation.
The "check violation" block refers to our core SLO compliance logic. The system predicts a potential batch's latency and checks if it violates the pre-set budget. This is the central idea of Sec. 4.2. We will add this explanation to the main text and Fig. 2's caption.
What are the specific novelties or differences of the "Prefix Sharing Maximization Strategy"...?
We clarify that our novelty is not the Prefix Sharing Maximization (PSM) concept itself, which we cited as prior work (Line 189). Our contributions are the SLO-aware application and extension of PSM, which again highlights the effectiveness of our solution to augment existing deployments:
- Our system integrates PSM to maximize throughput for offline tasks using only the residual GPU capacity available after satisfying strict online SLOs.
- Our extended policy (Lines 207-213) introduces a fairness mechanism to mitigate request starvation, a critical issue for production deployments that is unaddressed by naive PSM.
Please provide a specific and clear explanation for the phrase "at a specific offline QPS" in Line 240.
This describes our HyGen* baseline. We profile the max fixed, constant offline QPS the system sustains without violating online SLOs. HyGen* serves requests at this static rate, contrasting with the full HyGen system, which dynamically adjusts its offline rate. We will rephrase for clarity.
Thank you for acknowledging the rebuttal. If any parts of our response remain unclear or raise further questions, we would be more than happy to elaborate.
We greatly appreciate your thoughtful review and consideration.
Thank you for your rebuttal and the additional experiments provided. While I appreciate the efforts to address some concerns, several critical issues remain unaddressed or insufficiently resolved, which are essential to evaluating the robustness and practicality of HyGen.
First, regarding the evaluation scope and generalizability: the inclusion of more models (e.g., Sheared-LLaMA-2.7B) and an additional GPU (A5000) is a step forward, but questions about performance across broader architectures (e.g., GPT variants) and detailed hardware constraints (e.g., memory limitations under varying loads) still lack concrete analysis.
More importantly, key concerns about overhead and complexity—specifically the computational costs of the latency predictor and SLO-aware profiler, as well as the bootstrapping/warm-up phase of the predictor—remain entirely unaddressed. These are critical to understanding whether the system’s components introduce bottlenecks that undermine its claimed efficiency.
Additionally, there is no discussion of how HyGen handles sudden workload spikes, variable request patterns, or failure scenarios (e.g., when latency predictions are inaccurate). Such scenarios are common in production, and their absence weakens claims of practical applicability.
The comparison methodology also remains problematic: there is no clarification on whether baselines were properly optimized, nor additional comparisons with state-of-the-art hybrid serving systems beyond the mentioned "online and hybrid baselines."
Finally, practical deployment challenges—including parameter tuning difficulty and handling model updates—are still unaddressed, leaving uncertainties about the system’s real-world feasibility.
In summary, while some clarifications (e.g., "check violation" in Figure 2, the "Prefix Sharing Maximization" novelties) are helpful, the core issues related to overhead, workload robustness, comparison rigor, and deployment practicality need more thorough analysis to justify the work’s validity.
So i will keep my original score.
We thank the reviewer for their continued engagement. However, we must respectfully disagree with the final assessment, as it does not reflect the detailed explanations and new experimental evidence provided in our rebuttal. This response is intended for the reviewer and the Area Chair to demonstrate that a significant misunderstanding persists and that all critical issues have, in fact, been thoroughly resolved.
1. On Generalizability
"questions about performance across broader architectures... lack concrete analysis."
This is inaccurate. HyGen's generalizability is by design, not coincidence. Our latency predictor (Sec. 4.2) models the universal prefill/decode paradigm of all modern decoder-only Transformers (e.g., GPT, LLaMA). This principle was validated on a diverse model set (Llama2-7B, Qwen-14B, Yi-34B, Mistral-7B) and further confirmed with new results on an A5000 GPU in our rebuttal.
2. On Overhead and Complexity
"key concerns about overhead and complexity... remain entirely unaddressed."
The overhead is negligible and architecturally hidden. The one-time profiling cost is a mere ~15ms (Line 299). The runtime overhead is ~80μs, which is three orders of magnitude smaller than a decode step and fully masked by standard asynchronous scheduling, resulting in zero effective overhead.
3. On Workload Robustness
"no discussion of how HyGen handles sudden workload spikes... or failure scenarios..."
Robustness is a core, evaluated feature. Our system handles bursts via an online-first preemption mechanism (Fig. 8, App. D). It is robust to prediction errors via a closed-loop design that self-corrects, maintaining SLOs even with >20% predictor error (Fig. 14). Both were validated on real-world, bursty production traces.
4. On Comparison Methodology
"The comparison methodology also remains problematic..."
Our baselines were meticulously optimized for a rigorous, fair comparison. Sarathi-offline established a performance ceiling, while HyGen* served as a careful ablation to isolate our dynamic scheduler's gains (Fig. 4). As the first work to solve this specific problem, comparison to systems solving orthogonal problems (e.g., multi-model serving) would be inappropriate.
5. On Practical Deployment
"practical deployment challenges... are still unaddressed..."
Deployment is straightforward. Parameters are set automatically by our profiler (Sec. 4.2). Model updates require only re-running a negligible, 15ms profiling step, which is standard practice for high-performance systems.
Summary
In conclusion, every concern raised has been thoroughly addressed by the extensive evidence presented in our paper and rebuttal—including theoretical principles, algorithmic details, comprehensive evaluations, and new experimental data. We believe a misunderstanding of this material may persist and therefore respectfully ask the reviewer and the Area Chair to evaluate our work based on the full evidence provided.
While the authors have provided additional explanations, several key concerns remain unaddressed or insufficiently clarified:
1. Generalizability to Emerging Architectures: The authors emphasize HyGen’s design around the prefill/decode paradigm of decoder-only Transformers, but its adaptability to emerging hybrid models (e.g., Mamba-Transformer hybrids like Hunyuan-TurboS) and sparse architectures (e.g., MoE models) is untested. These models exhibit distinct computational patterns—such as Mamba’s linear sequence processing or MoE’s dynamic expert activation—that diverge from dense Transformers. It is unclear whether HyGen’s latency predictor, scheduling policies, or resource management can account for these unique characteristics, or if they would introduce inefficiencies or SLO violations. Additionally, while the expanded model set (Llama2, Qwen, etc.) is noted, the breadth of real-world LLM diversity—including variants with architecture-specific optimizations—still lacks comprehensive validation. 2. Overhead and Bootstrapping Details: Though runtime and profiling overheads are quantified (~80μs, ~15ms), critical gaps remain. The bootstrapping phase of the latency predictor is underexplained: How long does it take for predictions to stabilize? What performance degradation or SLO breaches occur during this warm-up period? These details are critical for understanding real-world deployment reliability. 3. Workload Robustness Quantification: Mechanisms like "online-first preemption" and "closed-loop correction" are mentioned, but their efficacy under extreme conditions lacks granularity. For instance, what magnitude or duration of sudden traffic spikes can HyGen handle without SLO violations? When prediction errors exceed 20%, what is the actual rate of SLO adherence? Concrete metrics here are necessary to validate robustness claims. 4. Comparison Methodology Transparency: Claims of "meticulously optimized baselines" (e.g., Sarathi-offline) lack supporting details. What specific parameters were tuned to ensure baselines performed at their peak? Without transparency into this process, it is hard to confirm the fairness of comparisons. Additionally, the exclusion of other hybrid serving systems—even if framed as "orthogonal"—requires clearer justification of why they are irrelevant to evaluating HyGen’s relative advancement. 5. Practical Deployment Nuances: While automatic parameter tuning and 15ms model update profiling are noted, critical details are missing. How robust is the automated tuning across varying hardware/memory constraints? Does model updating cause transient downtime or performance drops? These factors directly impact real-world operational feasibility.
These gaps persist despite the rebuttal and are essential to assessing HyGen’s broader applicability and reliability.
This work proposes the HyGen, an interference-aware LLM serving system that enables efficient co-location of online and offline workloads while preserving service-level objectives. The two major contributions are:
- performance control mechanisms: a latency predictor to estimate the batch execution time and a profiler to estimate latency interference
- SLO-aware offline scheduling policies: maximize throughput and prevent starvation
The author starts with characterizing the different demands of online serving (low and stable response time) and offline serving (high throughput and better resource utilization). The author states that previous systems often segregate online and offline serving into different clusters. The author argues that resource efficiency can be optimized by co-locating online and offline workloads on the same GPU cluster, because the real-world LLM workloads demonstrate up to 3x variation in request rates within minutes, leading to the need to provide more GPU resources for peak demand, but underutilize the resources during off-peak hours.
优缺点分析
Strengths
- The author addresses an important question in LLM serving in contemporary AI infrastructure, tackling the underutilization of GPU resources by intelligently utilizing both online and offline serving, this line of research could have potentially big impact for service providers of LLM serving
- The author makes clear and comprehensive analysis of the challenges of building a hybrid serving system, summarizing the specific requirements to meet the demands of the current uses
Weaknesses
- The key to jointly handling online and offline requests lie in accurately estimating the latency for each request, but there is no experimental evaluation of the effectiveness linear regression model for latency prediction, also whether more advanced modelling can improve its robustness. Since it plays an important role in the proposed pipeline, more experimental details is necessary.
问题
- In section 4.2, when calculating the batch execution time, why do you add the quadratic scaling term, when the decode stage shows linear scaling?
- Is the prefix sharing maximization strategy similar to the one used in the SGLang framework’s radix tree structure for prefix caching? Have you compared using block-level prefix caching, similar to vLLM? It would be interesting to have some ablation studies of using different prefix caching strategies.
- In the experiments, I am wondering why HyGen can improve over the baselines in pure online and pure offline settings? What are the key designs that achieve that?
局限性
- The work has the goal of improving the serving of LLMs in large clusters, but the scale of the experiments seem quite limited (with only two nodes with 4 GPUs each)
- In general, I find this work well-written and the motivation be clear and well-conveyed, the solutions seem conceptually simple but I do not doubt it involves a lot of engineering and optimization to make it work
最终评判理由
The author clarified the points I mentioned in the review further, and I will maintain my initial assessment.
格式问题
I did not notice any major formatting concerns.
We sincerely thank the reviewer for their positive assessment and insightful questions. We are encouraged that the reviewer finds our work "well-written", our motivation "clear and well-conveyed," and recognizes the potential impact of our research. Below we address the questions and concerns point-by-point.
Concern 1: Latency Predictor Effectiveness
The key to jointly handling online and offline requests lie in accurately estimating the latency for each request, but there is no experimental evaluation of the effectiveness linear regression model for latency prediction, also whether more advanced modelling can improve its robustness.
Our latency predictor is highly accurate, as proven in our evaluation, and the system is robust even to significant prediction errors.
- High Accuracy Proven: As evaluated in Fig. 5 (Sec. 5.3), our predictor is highly effective. The results show a mean absolute percentage error (MAPE) of only 1.78% and 1.07% on a mixed production workload.
- Robustness to Inaccuracy: While more complex models are possible, our system is not brittle. As shown in Fig. 14, HyGen strictly meets P99 SLOs even when we artificially increase the predictor's error to over 20%. This demonstrates that extreme accuracy is not required, making our lightweight linear model a practical and sufficient choice under runtime dynamics.
Concern 2: Modeling of Batch Execution Time
In section 4.2, when calculating the batch execution time, why do you add the quadratic scaling term, when the decode stage shows linear scaling?
The quadratic term is essential for accurately modeling mixed batches containing requests in the prefill stage, which has quadratic complexity. While the decode stage is linear, a batch often contains a mix of requests. As stated in Sec 4.2 (Lines 156-158) and established in prior work [1,2,3], the prefill stage's attention mechanism has quadratic complexity w.r.t. input length. Our model (Eq. 1) includes the quadratic term S²p to correctly account for these prefill requests.
[1] Attention Is All You Need. NeurIPS '17. [2] On The Computational Complexity of Self-Attention. ALT '23. [3] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse. NeurIPS '24.
Concern 3: Novelty of Prefix Sharing Strategy
Is the prefix sharing maximization strategy similar to the one used in the SGLang framework’s radix tree structure for prefix caching? Have you compared using block-level prefix caching, similar to vLLM? It would be interesting to have some ablation studies of using different prefix caching strategies.
Our novelty is not the caching mechanism itself, but its SLO-aware application and fairness-aware extension in a hybrid serving context. Our scheduling policy is orthogonal and complementary to low-level caching architectures like SGLang's RadixAttention or vLLM's Automated Prefix Caching. Our contributions are:
- SLO-Aware Co-location: We are the first to apply a PSM-style strategy to opportunistically schedule offline requests using only the residual capacity from a primary, SLO-bound online workload.
- Fairness Extension: We introduce a practical extension (Lines 207-213) that incorporates request freshness to mitigate the starvation problem inherent in naive PSM, making it deployable in production.
We agree that an ablation study on the interplay between our policy and different caching implementations would be a fascinating follow-up.
Concern 4: Clarification on Performance Gains
In the experiments, I am wondering why HyGen can improve over the baselines in pure online and pure offline settings? What are the key designs that achieve that?
To clarify, HyGen's significant throughput gain is achieved in a hybrid setting over a pure online baseline, demonstrating the value of co-location, not by outperforming specialized pure-mode systems. You are correct that a specialized system is optimal for a single task.
- The
Sarathi(pure online) andSarathi-offlinebaselines in Fig. 4 represent the performance upper bounds for their respective tasks. - HyGen's goal is to productively utilize the idle resources of the pure online system. The reported gain (3.87-5.84×) is achieved by our hybrid system over the
Sarathipure-online baseline, demonstrating the value of our approach.
Concern 5: Scalability and Experiment Design
The work has the goal of improving the serving of LLMs in large clusters, but the scale of the experiments seem quite limited (with only two nodes with 4 GPUs each)
Our experimental scale is appropriate because HyGen is designed as an instance-level scheduler, a standard architecture for large-scale serving, and our results already prove its effectiveness in complex, distributed settings.
-
Scalable by Design (Architecture): HyGen operates as a modular, instance-level scheduler, intended to run on each node of a large cluster. As detailed in Sec. 5.1 (Line 227), a system-level router (e.g., Preble, as cited) distributes requests across these independent instances. This standard, decentralized architecture has no central bottleneck and scales horizontally simply by adding more nodes. Our experiments accurately model the workload and deployment complexity any single instance would face in such a deployment.
-
Proven Effectiveness (Evidence): Our evaluation already provides strong evidence for this scalability. The experiments using Tensor and Pipeline Parallelism (Sec. 5.4, Fig. 9) demonstrate that HyGen effectively manages the complex scheduling logic required in multi-GPU environments. This capability is foundational for and directly applicable to multi-replica production scenarios.
Thanks for the reviewer's detailed explanation. I have no further concerns. Best,
Thank you for your thoughtful review and acknowledgment. Your insightful comments have significantly improved the clarity and presentation of our manuscript, strengthening its overall impact.
The paper introduces HyGen, an efficient LLM serving system that co-locates online and offline workloads on the same GPU cluster. The authors investigate different features of the two workloads, i.e., latency-sensitive for online queries and throughput-oriented for offline workloads, and design different mechanism for managing the workloads. Specifically, the method leverages a control mechanism to predict and quantify latency, as well as an offline scheduling policy that maximize throughput.
优缺点分析
Strengths:
-
The proposed method integrates online and offline requests while maintaining strict latency guarantees. It can adapts to diverse latency requirements.
-
The authors propose a novel control mechanism that estimates batch execution time and quantifies interference to ensure latency guarantees.
-
The authors design an adaptive scheduler that maximizes offline throughput. It leverages a Prefix Sharing Maximization Strategy that transforms the offline requests into a prefix tree for capturing prefix sharing characteristics of the offline requests, thus reuse KV caches and improve efficiency.
-
The performance of the approach achieves 3.87–5.84× throughput gains over baselines while maintaining strict latency SLOs.
-
The evaluations are comprehensive.
Weakness:
-
This paper does not consider the overhead of shift between different workloads.
-
The method is designed to optimize workloads for a single model instance. However, real-world deployments often serve multiple model replicas simultaneously.
-
Typo: line 52: "while guaranteeing strict SLO compliance and ."
Suggestions:
The authors might strengthen their evaluation by evaluating the impact of workload shifting on throughput.
问题
I understand that you might consider sharing the code after the paper is accepted. However, since ML sys papers are typically hard to evaluate without checking implementation details, would you be able to share the source code?
局限性
yes
格式问题
NA
We sincerely thank the reviewer for recognizing our method as novel, our results as strong, and our evaluation as comprehensive. We appreciate the insightful questions and address them below.
Concern 1: Overhead of Workload Shifting
This paper does not consider the overhead of shift between different workloads.
The authors might strengthen their evaluation by evaluating the impact of workload shifting on throughput.
HyGen is designed for minimal overhead during workload shifts, at both the micro (task-switching) and macro (workload-adaptation) levels.
-
Zero Runtime Task-Switching Overhead: The switch from offline to online processing is nearly instantaneous, completing "within a single iteration step" (Lines 628-632). Our scheduling algorithm's low computational complexity, formally analyzed in Appendix A.4, ensures this preemption incurs negligible overhead.
-
Minimal Macro-Level Adaptation Overhead: HyGen adapts to new or changing workloads with extreme efficiency:
- Inherent Robustness: The system is inherently robust to workload variations. Fig. 14 shows that HyGen's performance remains stable even when predictor accuracy varies, proving its resilience to distributional shifts without needing reconfiguration.
- Trivial Retraining: For entirely new models or tasks, retraining our lightweight latency predictor is trivial. As stated on Line 299, training on over 80,000 samples takes only ~15ms on a CPU. This makes the one-time cost of adapting to a new workload practically zero.
Concern 2: Applicability to Multi-Replica Deployments
The method is designed to optimize workloads for a single model instance. However, real-world deployments often serve multiple model replicas simultaneously.
HyGen scales seamlessly to multi-model-replica deployments as it functions as a modular, instance-level scheduler managed by a system-level router.
-
Scalable by Design (Architecture): HyGen operates as a modular, instance-level scheduler, intended to run on each node of a large cluster. As detailed in Sec. 5.1 (Line 227), a system-level router (e.g., Preble, as cited) distributes requests across these independent instances. This standard, decentralized architecture has no central bottleneck and scales horizontally simply by adding more nodes. Our experiments accurately model the workload and deployment complexity any single instance would face in such a deployment.
-
Proven Effectiveness (Evidence): Our evaluation already provides strong evidence for this scalability. The experiments using Tensor and Pipeline Parallelism (Sec. 5.4, Fig. 9) demonstrate that HyGen effectively manages the complex scheduling logic required in multi-GPU environments. This capability is foundational for and directly applicable to multi-replica production scenarios.
Concern 3: Typo Correction
Typo: line 52: "while guaranteeing strict SLO compliance and ."
Thank you for catching this. We will correct it in the final version.
Concern 4: Code Availability
I understand that you might consider sharing the code after the paper is accepted. However, since ML sys papers are typically hard to evaluate without checking implementation details, would you be able to share the source code?
We agree on the importance of implementation details and have provided the AC with a private, anonymized link to our development codebase, per NeurIPS policy. We are enthusiastic about contributing our work to the community upon acceptance and believe a public release accompanying a peer-reviewed publication is the best way to ensure the work is presented completely and accurately. The publicized code will be thoroughly documented and well-structured, and we’re actively making efforts towards a public release version. We appreciate your understanding.
Thank you for your explanation. I'll keep my rating.
This paper proposes HyGen, an efficient serving system for large language models using online-offline request co-location. It uses a latency predictor and an SLO-aware profiler, coupled with SLO-aware offline scheduling policies designed to maximize throughput and ensure fairness.
Strengths:
- This paper introduces a new and important problem, i.e., efficient co-location of online and offline workloads.
Weaknesses:
- As suggested by cK9i, there are many questions regarding how this new framework applies to different models and compute resource constraints and how this new framework addresses some extreme edge cases.
These gaps persist despite the rebuttal and are essential to assessing HyGen’s broader applicability and reliability.
However, I believe this paper marks a significant step froward in this particular research direction.
For the remaining concerns raised by cK9i, see the author's final remarks.
P1. Our data-driven latency predictor, T_batch = f(Sp, Sd, S_p^2, S_d^2, Np, Nd), adapts flexibly to new architectures. For hybrid models (e.g., Mamba-Transformer), it learns a reduced S_p^2 coefficient (Transformer cost) while capturing Mamba’s linear cost via Sp, consistent with Flash-Linear-Attention GitHub repo's report. For MoE models, due to a fixed activated expert size per token, MoE-FFN costs are captured by Sp and Sd. This again highlights the flexibility and effectiveness of our design.
P2. The “warm-up” concern reflects a misunderstanding. Latency predictor and budget are fixed via one-time, pre-deployment profiling; No runtime stabilization and no SLO risk at startup. The system is immediately ready.
P3. HyGen uses a multi-layer strategy for traffic bursts. First, its latency budget is robust to in-distribution bursts, as it is derived from realistic traces containing such bursts. On spikes, online queues fill, offline requests are preempted/blocked (Fig. 8). Our evaluations on realistic workloads (Fig. 4), with large fluctuations (e.g., request load varies up to 8x in minutes), show HyGen outperforms prior methods. Second, it remains resilient to out-of-distribution events, maintaining SLOs even with 20% MAPE (Fig. 14). Third, for lasting workload shifts, it adapts using feedback from online deployment, asynchronously to serving. Predictor retraining is highly efficient (80k samples in 15 ms).
P4. Baselines were transparently optimized. The Sarathi-offline ceiling baseline tunes chunk size via hyperparameter search (Line 252), achieving 12% throughput improvement over default.
P5. Our problem scope is orthogonal to the cited hybrid systems. Multi-LoRA methods (e.g., Punica) batch requests from different LoRAs, while multi-model methods (e.g., MuxServe) multiplex models across space. Instead, HyGen batches online and offline requests, addressing distinct optimization problems that are not interchangeable.
P6. The profiler’s practicality comes from its data-driven design: it measures performance on target hardware under the actual online request distribution, capturing hardware and memory constraints. Our evaluations show that HyGen efficiently adjusts latency budgets when model, hardware, workload, or SLO changes (Figures 4, 8, 14, and our A5000 experiment replied in Concern 1), as noted in P3.