Synergistic Tensor and Pipeline Parallelism
A framework for tensor and pipeline parallelism to reduce TP bubbles.
摘要
评审与讨论
This paper proposes a novel hybrid model parallelism strategy that jointly optimizes tensor parallelism (TP) and pipeline parallelism (PP) for efficient large-scale training of LLMs and MLLMs. The authors introduce a “braided” scheduling technique that interleaves fine-grained forward and backward computation units to overlap TP communication with PP execution. In addition, they develop a “V”-shaped pipeline schedule that balances memory footprint across stages while minimizing pipeline bubbles. The proposed method demonstrates up to 12% and 16% throughput improvement on LLMs and MLLMs, respectively, compared to existing methods like 1F1B-I and Zero Bubble. The method is implemented atop Megatron-Core and evaluated across multiple model scales and hardware settings.
优缺点分析
Strengths:
- This paper is well-motivated and targets practical training efficiency issues: pipeline bubbles and GPU idleness during TP communication.
- The scheduling scheme is a thoughtful and non-trivial approach. It includes a detailed breakdown of computational units, a V-shaped scheduling strategy, and a memory-aware enhancement, all supported by theoretical analysis of memory footprint and bubble rates.
- The implementation is based on Megatron-Core and intended to be released, which supports reproducibility.
Weakness:
- Despite that the evaluation part includes several pipeline overlapping baselines, the paper omits a comparison with some recent TP overlapping frameworks (Megatron-LM, Flux). It remains unclear whether a combination of TP overlapping and ZB-V will outperform the proposed work?
- Meanwhile, the throughput of Fig. 7 and 8 begins at a high y-axis baseline, which visually exaggerates the perceived speedup.
- The method increases peak activation memory compared to some baselines, which may limit applicability in memory-constrained environments if offloading is not viable.
- How well does the scheduling scheme integrate with other parallel overlapping schemes such as data parallelism and context parallelism? It would be valuable to understand the generality and composability of the approach.
问题
Please see the weakness part.
局限性
yes
格式问题
None
Thank you for your support! We sincerely thank the reviewer for the valuable feedback and suggestions. Below, we address each point raised in your review.
Despite that the evaluation part includes several pipeline overlapping baselines, the paper omits a comparison with some recent TP overlapping frameworks (Megatron-LM, Flux). It remains unclear whether a combination of TP overlapping and ZB-V will outperform the proposed work?
Thank you for the valuable question. Megatron-LM and Flux improve TP communication primarily at the kernel level by fusing computation and communication. In contrast, our approach is implemented entirely at the software level, without relying on custom kernels, making it more flexible and broadly applicable.
Moreover, in Megatron-LM, TP overlapping cannot be enabled independently—it requires coupling with sequence parallelism (SP), which introduces an additional parallel strategy. Similarly, Flux implements only several fused kernels, all-gather+gemm and gemm+reduce-scatter, which require one more parallel to fit these predefined patterns. In contrast, our method is synergistic with widely adopted pipeline and tensor parallelism strategies in large-scale model training and requires no kernel fusion, custom CUDA extensions, or modifications to core operators, enabling easier deployment across diverse architectures. Due to these constraints, we did not combine these frameworks with pipeline parallelism in our main experiments.
Here, we conducted additional experiments combining TP overlap in Megatron-LM with pipeline schedules. The throughput results are summarized below, where * indicates TP overlap kernels are enabled.
| Methods | Throughput (samples/s) |
|---|---|
| 1F1B-I* | 7.13 |
| ZB-V* | 7.06 |
| Ours | 7.27 |
In the Qwen2-12B configuration with a sequence length of 4k, our schedule achieves a throughput of 7.27 samples/s, outperforming both 1F1B-I and ZB-V. This represents an improvement of approximately 0.14 samples/s over the best baseline. The results indicate that our synergistic schedule enables more effective cooperation between TP and PP, achieving superior performance compared to simply combining conventional TP overlap techniques with standard pipeline schedules.
Meanwhile, the throughput of Fig. 7 and 8 begins at a high y-axis baseline, which visually exaggerates the perceived speedup.
We appreciate the reviewer’s valuable feedback. We agree that starting the y-axis at a high baseline may exaggerate the visual perception of speedup. To address this, we will adjust the y-axis scale in Figures 7 and 8, ensuring a more accurate and fair visual comparison of throughput improvements.
The method increases peak activation memory compared to some baselines, which may limit applicability in memory-constrained environments if offloading is not viable.
We appreciate the reviewer’s observation. We acknowledge that our method increases activation memory compared to the baselines, as it prioritizes efficiency over memory usage. To assess this trade-off, we conducted an experiment (Table 5 in Appendix) that maximizes memory utilization under a fixed batch size and sequence length. The results show that our schedule achieves the highest throughput (2.74 samples/s), outperforming 1F1B-I (2.72 samples/s) and ZB-V (2.61 samples/s). This suggests that despite a higher memory footprint, our method remains more efficient in practical settings.
How well does the scheduling scheme integrate with other parallel overlapping schemes such as data parallelism and context parallelism? It would be valuable to understand the generality and composability of the approach.
Thank you for the thoughtful question. Our method is compatible with data parallelism and context parallelism, and we have evaluated its performance under these settings. Specifically, we conduct experiments using the Qwen2 12.1B model and report the throughput (samples per second) under different parallelism configurations:
| Config | 1F1B-I | ZB-V | Ours |
|---|---|---|---|
| context parallel: seq=12k, TP=2, PP=4, CP=2, bs=128 | 2.64 | 2.61 | 2.71 |
| data parallel: seq=4k, TP=2, PP=4, DP=2, bs=256 | 9.22 | 9.16 | 9.40 |
These results demonstrate that our approach integrates well with context parallelism and data parallelism, and provides improved throughput over the baselines in both settings. This supports the generality and composability of our scheduling scheme across multiple parallelism strategies.
Thanks for your reply. In practice, SP is a more commonly adopted variant of tensor parallelism in many training infrastructure settings. The throughput improvements reported in the tables do not appear significant enough to justify the added complexity of the proposed scheme. Given that, I will maintain my current score.
Thank you for your continued feedback. We agree that SP has been widely adopted in large-scale training infrastructures. However, integrating such kernel-level optimizations typically requires significant engineering effort, including implementing and integrating fused CUDA kernels, which are often tightly coupled with specific GPU architectures and communication libraries.
In contrast, our schedule introduces only minimal changes (limited to the pipeline schedule and a lightweight adjustment to the forward function). This design choice enables our method to remain easy to adopt and maintainable in real-world training pipelines. Furthermore, our approach is fully compatible with widely-used training techniques such as activation checkpointing and context parallelism, demonstrating its versatility.
While the throughput improvement appears moderate, it is consistent and achieved without introducing substantial system complexity. We view this as a promising direction and are actively pursuing further enhancements in both scheduling scheme and implementation to maximize its performance benefits.
Once again, we sincerely thank you for your support and for your invaluable feedback.
This paper introduces a pipeline parallel schedule that incorporates communication computation overlap to reduce tensor parallel bubbles by "fusing" backwards and forwards tasks. The schedule is introduced after some relevant background. An alternative schedule with offloading to reduce memory consumption is also introduced. Evaluations against some prior baselines are provided.
优缺点分析
I would first like to thank the authors for submitting their work to NeurIPS 2025.
Pros
The background treatment of PP and TP is fairly comprehensive overall, covering GPipe, 1F1B, Interleaved, Zero Bubble, Flux, etc. In addition, the authors correctly describe the current paradigm of distributed model training with a combination of TP and PP.
The idea of overlapping the TP computation "braiding” seems to be sound and makes sense. So does the pipeline schedule itself (i.e. no overlapped critical-path tasks).
Section 4.2 is written well, and provides some well-needed intuition for the schedule and its design. The theoretical analysis in section 4.3 seems to check out and offers a good comparison with some well known prior works.
Offloading is also a nice extension to the basic schedule as well.
The evaluation section contains a lot of experimental results against 2 solid baselines, and the results seem to be promising.
Cons
In section 3, it’s stated nowhere what the Pre-Attn, Attn, Pre-MLP, and MLP consist of. Figure 2 has Attn and MLP annotated, but there should be a pointer to this, and Pre-Attn and Pre-MLP should be clearly annotated.
Equation 1 is not clear at all for the most part, it should be made clear that this is for a single TP rank. Also, I’m not convinced that this formula is correct as shown. Where is coming from? What exact operations does this actually perform?
Figure 4 is likewise not motivated. It’s clear for someone that’s aware of interleaved pipeline parallelism, but there’s no annotation of what the colours represent, or any annotation of device, or any markings for the memory.
In Figure 5, I don’t see the concept of TP bubbles illustrated clearly. My understanding is the overlapped computations should have less latency overall. Perhaps an illustration of the time difference compared to a pipeline schedule without any TP bubble reduction would make this clearer, in the style of Figure 4 from [1].
Table 1 could benefit from including parameter memory as well.
In lines 202-203 where it is stated that activation offloading offers a better E2E performance trade-off compared to checkpointing. This is a strong statement that is not strictly true in all situations to the best of my knowledge. In the case where the device to host bandwidth is poor, checkpointing might actually enable larger batch sizes and more efficient execution overall.
In terms of the evaluation, I have four issues.
- Although TP overlap methods are discussed in line 35, there’s no comparison done, for example against Flux. I see that as one of the most appropriate baselines given one of the main topics of this paper is reducing TP bubbles as well.
- I see there is an evaluation of the memory footprint of offloading compared to the default method. But given that Figure 9 shows that the other baselines take less memory, would they not be more efficient by being able to support a larger microbatch size?
- What about long context experiments? At longer contexts, the all-reduce latency may no longer be overlapped with compute causing the bubble to reappear.
- Why are only 1F1B and ZB the baselines? Some other newer works that are cited earlier as inspiration (line 75) are not compared to at all.
I also don’t see any discussion of asynchronous vs synchronous pipeline parallelism, which is one of the main topics covered in Zero Bubble Pipeline Parallelism [2]. Seems to be that the proposed schedule is synchronous but ultimately it is being compared against asynchronous schedules like ZB-V [2] which should at least have some discussion.
Overall, while this paper introduces an interesting new schedule for pipeline parallelism which incorporates “TP codesign”, introducing yet another pipeline schedule is of limited novelty from a research perspective - The main insight here is that backwards and forwards tasks can be fused to provide comp/comm overlap. I am not convinced that the evaluation is sufficient to prove that the method is significant. There are also clarity and overall polish/quality issues.
Other Comments
Nits:
Figure 1 already includes results in the introduction. I feel this is more appropriate for the results section where the experimental setup can be explained.
In Figure 3, state that green is forward, and blue is backward.
Equation 1: Use the full form of LayerNorm.
Figure 5: It’s hard to see what’s an overlapped F & B vs F followed by B.
Table 1: Activation Memory -> Peak Activation Memory
Table 3: Include “throughput” and units here for clarity.
[1] Narayanan et al., Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM, SC 2021, https://dl.acm.org/doi/10.1145/3458817.3476209
[2] Qi et al., Zero Bubble Pipeline Parallelism, Arxiv, https://arxiv.org/pdf/2401.10241
问题
Major
Can you clarify equation 1?
Can you clarify whether Figure 5 should actually have smaller boxes for the combined F & B parts?
Why not consider long context baselines?
Why not evaluate against previous TP overlap work?
Why not consider a memory apples to apples comparison?
Why not evaluate against more recent work already cited as inspiration, for example, Hanayo?
Minor
Did you try any experiments without 2 virtual stages per device? Any difference in performance for the baselines?
Any idea how this performs with context parallelism?
局限性
Some limitations (such as performance PP = 8) discussed in lines 272-273. The authors would be encouraged to make the limitations clearer.
最终评判理由
Overall, the reviewers addressed a lot of my concerns. Overall, the paper's results are good, and the idea itself is interesting and novel. It is an interesting addition to the area of pipeline parallel training.
However, I still believe there are some issues with clarity and quality of the work with respect to figures and equations. The other overall decision factor here is I am not sure if a new pipeline parallelism schedule is of enough significance.
格式问题
None.
We thank the reviewer for the valuable feedback. Below, we provide responses to each point.
Equation 1 is not clear ... Where is coming from? Attention actually perform?
Thanks for your suggestion. We would like to note that Equation 1 is for a single TP rank.
The standard process (LayerNorm + Attention) before MLP is formulated as:
where is to calculate the attention output for a single TP rank, then is to get the output for all TP ranks.
Due to the residual operation executing after AllReduce, it introduces a data dependency from LayerNorm, increasing engineering complexity for backward passes. Therefore, we remove this dependency by detaching , which no longer calculates gradients during backward.
arises from this modification, that the residual operation is removed before AllReduce to make sure it does not stall waiting for AllReduce completion: , where is the number of TP ranks. And it keeps computational equivalence in forward as . In backward, we add gradient for to compensate for the detach operation of residual operation.
In Figure 5, ... TP bubbles illustrated clearly ...
Figure 5 is intended to illustrate the pp structure without explicitly visualizing TP bubbles. In our schedule, TP bubbles are implicitly hidden within the overlapped F&W blocks, which indeed have smaller latency than the sum of the individual F and W blocks. We chose to align all blocks in Figure 5 for visual clarity, as misaligned blocks, caused by shrinking the overlapped F&W blocks, could make it harder for readers to trace the communication patterns across stages. That said, we appreciate your feedback, and we will explore improved visualizations that more clearly reflect TP bubble reduction while maintaining readability.
Table 1 ... parameter memory ...
The methods listed in Table 1 are not bidirectional schedules, which have the same parameter memory when not considering the embedding and head layer. Thanks for the suggestion and we will update Table 1 in the final version.
In lines 202-203 where it is stated that activation offloading ...
Thank you for pointing this out. We will revise the sentence to make it more precise and reflective of the conditions under which the statement holds. We also appreciate your suggestion and will carefully review the rest of the paper to ensure similar statements are appropriately qualified.
... there’s no comparison done, for example against Flux ...
There are several works designed to improve TP communication, including Megatron-LM and Flux. Their implementations are primarily at the kernel level by fusing computation and communication. In contrast, our approach is implemented entirely at the software level, without relying on custom kernels, making it more flexible and easier to integrate into various training pipelines without requiring custom kernel development.
Moreover, in Megatron-LM, TP overlapping cannot be enabled independently—it requires coupling with sequence parallelism (SP), which introduces an additional parallel strategy. Similarly, Flux implements only several fused kernels, all-gather+gemm and gemm+reduce-scatter, which require additional parallel to fit these predefined patterns. In contrast, our method is synergistic with widely adopted pipeline and tensor parallelism strategies in large-scale model training and requires no kernel fusion, custom CUDA extensions, or modifications to core operators, enabling easier deployment across diverse architectures. Due to these constraints, we did not combine these works with baselines in our main experiments.
Due to insurmountable technical challenges in integrating Flux kernels into our experimental environment despite extensive efforts, we conducted experiments combining TP overlap in Megatron-LM with pipeline schedules, which embodies a similar overlap as Flux. The results are listed below, where * indicates TP overlap kernels are enabled.
| Methods | Throughput (samples/s) |
|---|---|
| 1F1B-I* | 7.13 |
| ZB-V* | 7.06 |
| Ours | 7.27 |
In the Qwen2-12B with a sequence length of 4k, our schedule achieves a throughput of 7.27 samples/s, outperforming 1F1B-I and ZB-V. This represents an improvement of approximately 0.14 samples/s over the best baseline. The results indicate that our synergistic schedule enables more effective cooperation between TP and PP, achieving superior performance compared to simply combining conventional TP overlap techniques with standard pipeline schedules.
... consider a memory apples to apples comparison?
Since different schedules have varying memory footprints, we ensured that all schedules could be executed under the same configuration to enable a fair comparison. This constraint inevitably led to suboptimal memory utilization for some schedules. The primary goal of these experiments was to compare the efficiency of different schedules, rather than to assess their memory usage.
We were also fully aware of this and conducted a related experiment (Table 5) in the appendix. This experiment explored the optimal throughput that can be achieved by each schedule through maximizing the utilization of memory under the given batch size 192 and sequence length 8k. The results showed that our schedule realized the best throughput (2.74 samples/s), better than 1F1B-I (2.72 samples/s) and ZB-V (2.61 samples/s).
What about long context experiments?
We increase the sequence length to 16k and reduce the model dim and layer number to simulate this scenario. And the detailed results are as follows:
| Methods | Throughput (samples/s) |
|---|---|
| 1F1B-I | 4.24 |
| ZB-V | 4.21 |
| Ours | 4.36 |
As shown in the Table, our schedule outperforms other baselines in the setup of longer context and smaller model scale, even though AR may not be fully overlapped. This indicates that our schedule is more robust to the varying sequence length.
Why are only 1F1B and ZB the baselines? ... are not compared to at all, like Hanayo.
We chose 1F1B-I and ZB-V as baselines because 1F1B-I is a classical and effective method introducing virtual stages, while ZB-V is a more recent method published at NeurIPS 2024. Other methods such as Hanayo are indeed promising; however, they are not open-sourced, and we cannot ensure faithful reproduction, especially regarding communication optimizations they incorporate.
discussion of asynchronous vs synchronous pipeline parallelism ...
We would like to correct a mistake here: ZB-V, cited in our paper, is [1], rather than the reference you mentioned. And asynchronous is not the topic in [1]. Therefore, such an issue is entirely irrelevant to our comparison. Moreover, in practice, synchronous schedules are more popular, considering the potential convergence issue in asynchronous schedules. Therefore, evaluating all methods in a synchronous setting is reasonable.
[1] Qi et al. Pipeline Parallelism with Controllable Memory. NeurIPS 2024.
the position of Fig.1; full form of LayerNorm in Eq. 1; visualization of overlapped F & B in Fig. 5; Activation Memory -> Peak Activation Memory in Tab. 1; throughput's units in Tab. 3; color annotations in Fig. 3; clear discussions in line 272-273; a pointer to Pre-Attn and Pre-MLP in Fig. 2; annotation in Fig. 4
Thanks for your valuable comments! We will revise our paper in the final version according to your suggestions.
... experiments without 2 virtual stages per device?
No! When virtual stages are disabled, our schedule is unavailable as we leverage the virtual stages to realize the overlap.
... perform with context parallelism
Our method is compatible with context parallel. Here, we conduct a corresponding experiment as follows:
| Qwen2 12.1B | 1F1B-I | ZB-V | Ours |
|---|---|---|---|
| seq=12k, TP=2, PP=4, CP=2, bs=128 | 2.64 | 2.61 | 2.71 |
As shown, our method achieves a throughput of 2.71 samples/s, outperforming both 1F1B-I (2.64 samples/s) and ZB-V (2.61 samples/s). This demonstrates that our scheduling strategy remains effective and superior even when integrated with context parallelism to handle longer sequences.
Please refer to the rebuttals for Reviewers EjFt and MuRP to get more results when it is integrated with more parallelism and strategies.
... pipeline schedule is of limited novelty from a research perspective ...
While overlapping communication and computation are common in LLM training, our method ingeniously combines TP and PP together without any custom CUDA kernels, making them synergistically work. And all experiments in the paper and appendix consistently demonstrate that our schedule is effective.
Furthermore, the value of our approach has been acknowledged by multiple reviewers. Reviewer EjFt noted that "The proposed scheduling design is practical," Reviewer P8vN highlighted its "solid contributions," stating that "the modification to the interleaved pipelining schedule is simple but effective, and the idea to overlap compute and communication for different microbatches makes sense," and Reviewer MuRP described the scheme as a "thoughtful and non-trivial approach." These comments affirm that our work offers a meaningful and well-justified advancement in the design of efficient training schedules.
Thanks to the authors for their response and clarifications. I have considered their rebuttal and the responses to the other reviewers.
I am satisfied with the response, and as a result, I am increasing my score and recommend the paper for publication overall. Where I think the paper can still improve is overall polish and clarity.
It will be good if the final paper cleans up the equations and figures to be clearer. For example, I still don't believe it's clean to include detach in Equation 1, but this is minor while my comment about Figure 4 still stands.
It will also be helpful to include some of the long context and Table 5 results from the appendix in the main paper.
We sincerely thank the reviewer for carefully considering our rebuttal and for the constructive feedback throughout the review process. We are truly grateful for your positive assessment and for increasing the score.
We fully agree with your suggestions. In the final version, we will carefully revise the equations and figures to enhance their readability and precision, and we will incorporate the those results into the main paper to better support our claims.
Thank you again for your thoughtful comments and continued support.
Dear Reviewer nDzM
The author-reviewer discussion period will end in August 6. Please read the rebuttal, post any remaining questions or comments, and confirm the Mandatory Acknowledgement.
Best regards, Your AC
Model training at scale often uses tensor and pipeline parallelism. Tensor parallelism can suffer from high communication overhead. This paper offers a way to overlap communication with computation to improve training speed.
优缺点分析
Strengths:
- Important problem (improving efficiency of LLM and and MLLM training at scale).
- Solid contributions: the modification to the interleaved pipelining schedule is simple but effective, and the idea to overlap compute and communication for different microbatches makes sense.
- Results are good.
Weaknesses:
- Some of the writing can be made a bit more precise. For example: a) I don't think TP communication significantly grows with TP size as long as you're able to run collectives at the same bandwidth; of course compute-communication ratio decreases. b) What does "at the kernel or hardware levels" mean when discussing communication overlapping with computation? c) Clarify hardware setup in statements like "where limited inter-device bandwidth can severely degrade training efficiency".
- I'm not sure how "synergistic" the modification to the interleaved pipelining schedule is, can't the proposed memory improvements be used for pipeline parallelism in isolation?
问题
- Figure 3 doesn't make full sense to me. For example, in Figure 3a, is the green AR in the communication stream for a different microbatch than the green "Attn (F)"? If they are for the same microbatch, how does "Attn (F)" complete before its communication?
- How much is computation slowed down by the overlapped communication (since communication probably uses some SMs that can't be used by GEMMs)?
- Is there a way to apply these ideas to PP=1?
- What are the throughputs in Figure in TFLOP/s/GPU? That is, how fast on an absolute scale are the baselines and "Ours"?
局限性
NA
最终评判理由
The paper studies an important problem, provides novel contributions, and presents good results. The authors addressed weaknesses and questions I raised during the review process. As a result, I rate it 5 and keep the paper as accept.
格式问题
NA
Thank you for your support! And we sincerely thank the reviewer for the valuable feedback and suggestions for improved clarity. Below, we address each point raised in your review.
Some of the writing can be made a bit more precise. For example: a) I don't think TP communication significantly grows with TP size as long as you're able to run collectives at the same bandwidth; of course compute-communication ratio decreases.
Sorry for misleading. The original statement in our paper was "The proportion of TP communications grows significantly with the increased TP size". Here, "proportion of TP communications" refers to the ratio of TP communication time to the total time, not the absolute TP communication time.
We agree that if the bandwidth remains constant, the absolute communication time may not grow significantly. However, as TP size increases, the compute-to-communication ratio decreases, leading to TP communication becoming a larger fraction of the overall runtime, which is we intended to highlight.
To avoid misunderstanding, we will revise the sentence to clarify this point in the final version.
b) What does "at the kernel or hardware levels" mean when discussing communication overlapping with computation?
Thanks for your question. The statement of "at the kernel or hardware levels" means that the overlapped communication is implemented based on the well-designed and complex CUDA kernels, such as all-gather+gemm kernel [1] (kernel level), or on some specific hardware, such as track and trigger in [2] (hardware level).
Thanks for your feedback! We will revise this sentence in our final version to make it clearer.
[1] Chang, Li-Wen, et al. Flux: Fast software-based communication overlap on gpus through kernel fusion. arxiv. [2] Pati, Suchita, et al. T3: Transparent tracking & triggering for fine-grained overlap of compute & collectives. ASPLOS 2024.
c) Clarify hardware setup in statements like "where limited inter-device bandwidth can severely degrade training efficiency".
Thank you for the insightful suggestion. We clarify the sentence as follows:
"where limited inter-device bandwidth (e.g., systems without NVLink) can severely degrade training efficiency."
I'm not sure how "synergistic" the modification to the interleaved pipelining schedule is, can't the proposed memory improvements be used for pipeline parallelism in isolation?
Yes! The proposed memory improvements can be used for pipeline parallelism in isolation, but they are not synergistic with the interleaved pipelining schedule actually. The modification of memory is just designed for coping with limited memory scenarios. "Synergistic" here is to reflect in the execution timeline of overlapped F&B block in the schedule, where we interleave the forward and backward passes in the same PP stage to overlap the TP communication in forward and backward passes.
Figure 3 doesn't make full sense to me. For example, in Figure 3a, is the green AR in the communication stream for a different microbatch than the green "Attn (F)"? If they are for the same microbatch, how does "Attn (F)" complete before its communication?
Sorry for the misunderstanding here. In Figure 3a, the green "Attn (F)" and the green AR both correspond to the same microbatch. "Attn (F)" refers solely to the computation of the attention forward pass, while AllReduce (AR) represents the associated communication operation. And those two constitute a complete attention operation. The green AR is initiated only after the corresponding computation (Attn (F)) is completed. Therefore, the forward pass of the attention module is only considered fully complete once both the computation and the communication have finished.
How much is computation slowed down by the overlapped communication (since communication probably uses some SMs that can't be used by GEMMs)?
Thank you for your insightful question. We understand your concern that overlapped communication might occupy certain GPU resources (e.g., SMs), potentially slowing down GEMM computations. To examine the issue, we conduct experiments to measure GEMM performance with and without overlapping AllReduce (AR) operation. The execution time of GEMM is set to be larger than that of AllReduce in Experiment 1, which is opposite in Experiment 2. The execution times are reported as follows:
| Operation | Experiment 1 | Experiment 2 |
|---|---|---|
| GEMM | 8.605ms | 0.334ms |
| AllReduce | 3.364ms | 1.643ms |
| GEMM+AR | 11.969ms | 1.977ms |
| GEMM with overlapped AR | 9.251ms | 1.685ms |
In Experiment 1, the communication is effectively hidden behind computation, resulting in minimal overhead (approximately 7.5%) for GEMM and substantial time savings compared to sequential execution (GEMM+AR, 11.969ms).
In Experiment 2, although the GEMM operation completes earlier, leaving a portion of the communication time unhidden, the overall execution time remains close to the theoretical lower bound (i.e., the communication durations) and is significantly shorter than sequential execution (1.977 ms).
These results demonstrate that overlap is beneficial in both scenarios, with minimal interference and improved overall efficiency.
Is there a way to apply these ideas to PP=1?
The idea of leveraging the backward passes to overlap TP communication can be valid to a certain extent. Specifically, when PP=1, our method degenerates into pure TP training with gradient accumulation. In this case, where gradient accumulation is enabled, the backward pass for microbatch 1 can be interleaved with the forward pass for microbatch 2. Therefore, the overlap is valid for the majority of microbatches.
What are the throughputs in Figure in TFLOP/s/GPU? That is, how fast on an absolute scale are the baselines and "Ours"?
Thank you for the question. Due to the absence of specific numbers, we present some results in TFLOP/s/GPU of Figures 7 and 8.
| Methods | 1F1B-I | ZB-V | Ours |
|---|---|---|---|
| TP=4, Seq=6144, PP=4, bs=192 | 158.8 | 155.7 | 167.5 |
| TP=8, Seq=6144, PP=2, bs=192 | 107.8 | 108.5 | 121.1 |
| TP=4, Seq=4096, PP=8, bs=256 | 147.3 | 148.0 | 152.7 |
| TP=8, Seq=4096, PP=4, bs=256 | 99.7 | 100.7 | 110.8 |
As shown in the table, in the TP=4, Seq=6144, PP=4, bs=192 setup, our approach achieves an absolute improvement of 8.7 TFLOPS/s/GPU over 1F1B-I. Furthermore, in the TP=8, Seq=4096, PP=4, bs=256 configuration, we achieve an absolute improvement of 10.1 TFLOPS/s/GPU compared to ZB-V. These results underscore the consistent and significant gains in computational efficiency delivered by our scheduling method across diverse parallel and workload settings.
Dear Reviewer P8vN,
I notice that you have confirmed "Mandatory Acknowledgement" and reviewed the rebuttal. Could you please engage in a discussion with the authors?
Best regards, Your AC
I thank the authors for the rebuttal. They address weaknesses and questions I raised.
Thank you very much for your constructive feedback and for acknowledging our responses. We truly appreciate your time and insights.
Dear Reviewer P8vN
The author-reviewer discussion period will end in August 6. Please read the rebuttal, post any remaining questions or comments, and confirm the Mandatory Acknowledgement.
Best regards, Your AC
This paper proposes a novel approach to large-scale Transformer training by better coordinating tensor and pipeline parallelism. It aims to reduce the "tensor parallel bubble"—the communication overhead—by overlapping it with the backward computation of the pipeline. The experiments demonstrate improved throughput, albeit with some memory overhead.
优缺点分析
Strength
- The paper provides a clear analysis of how tensor and pipeline parallelism interact and where current methods fall short.
- The proposed scheduling design is practical and accounts for the potential memory challenges introduced by this method.
- The implementation considers real hardware constraints, such as memory usage and communication overhead, making the method more applicable to real-world training setups. Also the experiments are well-executed across multiple large models (including LLMs and multimodal models), providing solid evidence of scalability and general applicability.
Weaknesses
- I am not very familiar with pipeline parallelism, so I found some figures (e.g. Figure 3) a bit challenging to fully understand.
问题
- How would the proposed scheduling handle highly imbalanced stages, such as those caused by MoE layers or custom modules?
- Is the system compatible with activation checkpointing strategies commonly used in large-scale models?
局限性
N/A
格式问题
N/A
Thank you for your support! And we sincerely thank the reviewer for the valuable feedback. Below, we address each point raised in your review.
I am not very familiar with pipeline parallelism, so I found some figures (e.g. Figure 3) a bit challenging to fully understand.
Sorry for that. Figure 3 presents the execution timelines of the computation units. "F" and "B" denote the forward and backward passes of these units. Taking the Attn unit as an example, it completes computation in the computation stream and then immediately activates the communication stream to execute the AllReduce (AR) operation. The forward computation in the MLP unit must wait for the completion of that AllReduce due to the inherent data dependency. Therefore, the computation stream would remain idle during the AR execution in the naive implementation.
In our schedule, we interleave the forward and backward passes of the units to reduce idle time. And that AR operation is naturally overlapped with the computation of backward passes in Figure 3 (a). Figure 3 (b) presents the timeline in which the weight gradient computation is separated from the full backward pass. We will revise Figure 3 to make it easy to understand in the final version.
How would the proposed scheduling handle highly imbalanced stages, such as those caused by MoE layers or custom modules?
Thanks for your valuable question. The imbalance among stages indeed negatively affects performance, especially for multimodal models, custom modules, and so on. In this work, we indeed try to address this issue, although it is not emphasized in the paper, as our focus is primarily on the implementation of the proposed PP and TP schedules.
Specifically, we first profile the execution time of the forward and backward passes of each unit, as well as the memory footprint. Then, we design a simple algorithm to split the consecutive unit sequence into balanced PP stages by ensuring that the total forward and backward time per stage is approximately equal and ensuring that the memory footprint of each PP stage stays within memory limits.
Is the system compatible with activation checkpointing strategies commonly used in large-scale models?
Yes! Our schedule is compatible with activation checkpointing strategies. We conduct an experiment on Qwen2-12.1B using activation checkpointing (AC) with a batch size of 128 and a sequence length of 6k, and report the results in the following table:
| Config | Throughput (samples/s) | Peak Memory (GB) |
|---|---|---|
| AC disabled | 4.79 | 56.0 |
| AC enabled in Ours w/ MLP | 4.19 | 44.7 |
| AC enabled in Ours w/ Attn+MLP | 3.94 | 41.5 |
| AC enabled in Ours w/ Attn+MLP+Norm | 3.75 | 36.3 |
As shown, disabling AC yields the highest throughput of 4.79 samples per second but requires a peak memory footprint of 56.0 GB. Enabling AC selectively on the MLP modules reduces peak memory by 20.2% to 44.7 GB, with a corresponding throughput decrease of 12.5% to 4.19 samples/s. Extending AC to both Attention and MLP modules further reduces peak memory to 41.5 GB (a 25.9% reduction), while throughput declines to 3.94 samples/s. The most aggressive configuration, applying AC to all modules, achieves the greatest memory savings, reducing peak memory by 35.2% to 36.3 GB, at the cost of a 21.7% reduction in throughput, resulting in 3.75 samples/s. These results confirm that our scheduling framework is fully compatible with activation checkpointing strategies.
Dear Reviewer EjFt,
The author-reviewer discussion period will end in August 6. Please read the rebuttal, post any remaining questions or comments, and confirm the Mandatory Acknowledgement.
Best regards, Your AC
Dear Reviewer EjFt,
Could you please review the rebuttal and engage in a discussion with the authors?
Best regards, Your AC
The authors proposes a novel approach to large-scale Transformer training by better coordinating tensor and pipeline parallelism, which targets at reducing the "tensor parallel bubble"—the communication overhead—by overlapping it with the backward computation of the pipeline.
The reviewers (through review and rebuttal) recognize the significance of the work, specifically the scheduling strategy in efficiently overlapping communication and computation with tensor and pipeline parallelism.
During the rebuttal, the authors clarify the reviewers’ detailed comments, such as the imbalanced stages in MoE layers, clarification of the formulations, evaluations against previous TP overlap work, memory-throughput trade-offs, and so on. And some additional results demonstrate the generalization ability of the approach.
Most of the reviewers express that their comments are clearly addressed, and express their positive attitude towards the acceptance of the paper. To me, the idea is interesting and of sufficient novel. The resutls have demonstrate the superiority of the proposed method. I would recommend acceptance of the paper. The authors are encouraged to incorporate the comments from the reviewers into the final version the manuscript.