PaperHub
5.7
/10
Poster3 位审稿人
最低5最高7标准差0.9
5
5
7
3.3
置信度
正确性2.7
贡献度3.0
表达3.0
NeurIPS 2024

Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-12

摘要

关键词
Large Language ModelDistributed TrainingCommunication Topology

评审与讨论

审稿意见
5

The paper proposed an unified space to enhance ZeRO-based partitioning strategies, providing a better trade-off between memory and communication. The paper also introduced a more efficient (than ring-based) collective communication method. The core motivation of this paper is to fully leverage the efficient intra-group communication, by introducing intra-group partitioning as an extra option (PaRO-DP) and making the collective communication group-based (PaRO-CC).

The paper is well written and the core insight is delivered clearly. However, the novelty is limitted. For example,

  • MiCS also improves the efficiency of ZeRO and collective communication based on a similar idea.
  • The partitioning space is similar to Alpa, where the devices are placed into a 2D mesh, with intra-node devices as inner dimension.
  • Megatron-LM suggests tensor parallelism should be within the same node generally.

优点

  • The paper provides a systematic view of ZeRO-based partitioning strategies.
  • The paper enlarges the option space of ZeRO strategies.
  • The proposed PoRA-CC can improve the communication efficiency in practice.
  • The problem is well formulated, easy to follow.
  • The paper considers various senarios, including full/partial parameters training, which helps to understand the motivations.
  • The experiments contain a large number of baselines, which makes it more convincible.

缺点

I have some concerns about the title and claims.

  • "Rethinking" is too ambitious for this paper, regarding it only focuses on ZeRO-based strategies and doesn't combine with any other stategies like TP and PP.
  • "for Efficient Large Language Model Training" seems unsuitable. It lacks evidence to prove it's efficient for LLM, as mentioned in Limitations, 3D parallelism + ZeRO-1 is usually used in LLM training. Actually, it's not necessary to couple with LLM, because the proposed methods is general for most of the models.
  • It's kind of overclaiming for "266% that of SOTA basic DP strategies", because it compares with ZeRO-3 in Figure 3(d), while MiCS has a similar performance.

For the experiments,

  • ZeRO-1 is missing, which makes it incomplete and less convincible. ZeRO-1 is a commonly used strategy, especially when there are many microbatches. It would be better if ZeRO-1 is also included for some senarios.
  • It would be better to compare PaRO-CC with the Hierarchical Communication Strategy in MiCS.

Figure 1 is a bit complicated.

问题

  • What's the difference between partial-parameters training and PEFT?
  • If the scope of this paper is for LLM as in the title, could you please add more related work to elaborate the current state of LLM training?
  • Can you elaborate the difference between PaRO-CC and the Hierarchical Communication Strategy in MiCS (or other previous work if there is any)?

局限性

How to effectively integrate with current 3D parallelism (or other model parallel techniques) in LLM is a big concern, which may limit its application. In 3D parallelism,

  • ZeRO-1 is usually used to avoid per-microbatch communication, because PP requires a large number of microbatches to reduce pipeline bubbles.
  • tensor parallelism also leverages the high-speed intra-node communication, which will be in conflict with the proposed methods.
作者回复

Thanks for your valuable and constructive feedback. We will explain your concerns point by point.

Weaknesses:

Q1: concerns about the title and claims

A1: Thanks again for your insightful advice. We adjust the title to "Rethinking Memory and Communication Costs for Efficient Large Language Model Training in Data Parallelism" to better fit the study of this paper. As stated in the global rebuttal (Author Rebuttal), PaRO can be used with other n-D parallel strategies to accelerate the training of large models.

We will update the description of performance improvements with more specific and accurate language, like "PaRO improves the training speed of LLMs by up to 266% that of ZeRO-3 as basic DP strategy".

Q2: For the experiments:

a. ZeRO-1 is missing

b. It would be better to compare PaRO-CC with the Hierarchical Communication Strategy in MiCS.

A2:

a. We conducted a set of experiments comparing PaRO(NII) with ZeRO-1, using the 7B model on 32 GPUs. With the same effective batch size, the throughput of NII was found to be 48.7% higher than ZeRO-1. For detailed data, please refer to Table 1 in the PDF.

b. The difference between PaRO-CC and MICS's Hierarchical Communication Strategy (MiCS-HCS) is that we overlap inter-group and intra-group communication, and we do not need to rearrange data blocks, avoiding unnecessary memory operations. Details can be compared between Figure 2 in the PDF and the MiCS paper. And the collective communication time with 128 GPUs of PaRO-CC, MiCS-HCS and Ring are 162, 183 and 288 ms separately.

Q3: Figure 1 is a bit complicated.

A3: We have redrawn the figure, please refer to Figure 1 in the PDF.

Questions:

Q1: What's the difference between partial-parameters training and PEFT?

A1: To tradeoff between training efficiency and statistical performance, model training may have diverse requirements of trainable parameters in different scenarios. The ratio (trainable-parameters/parameters) of partial-parameters and PEFT is 1/16 and 3/1000 separately. This difference will result in significant differences in memory usage and communication data volume, and the most suitable strategy is also different. Please refer to section 3.1.1 in our transcript and Table 9 in the Appendix for more information.

Q2: If the scope of this paper is for LLM as in the title, could you please add more related work to elaborate the current state of LLM training?

A2: As mentioned in Weaknesses A1 and global rebuttal (Author Rebuttal), we will adjust the title and give more description of PaRO usage in LLM training.

Q3: Can you elaborate the difference between PaRO-CC and the Hierarchical Communication Strategy in MiCS?

A3: Same as Weaknesses A2.b.

Limitations:

Q1: How to effectively integrate with current 3D parallelism (or other model parallel techniques) in LLM is a big concern, which may limit its application.

A1: As a future work, we currently working on further utilising PaRO in n-D parallel training for more large-scale LLM training. We find it is effective in two scenarios in the latest n-D parallel training:

  1. Scenario 1 is when there is a difference in bandwidth between intra-group and inter-group communication, which is common in GPU clusters. n-D hybrid parallelism usually performs intra-node TP and inter-node PP in a subgroup of nodes, and nesting DP or PaRO in the outer groups. However, the grouping criteria is no longer one machine per group, but rather based on the network topology of the cluster. For example, in a multi-layer switch scenario, grouping can be classified based on the lower-layer switches. For another example, when training with multiple Nvidia SuperPods or similar infrastructures, one SuperPod can be used as a group.

  2. Scenario 2. In a vast number of GPU scenario, n-D parallelism usually uses TP + PP + DP for efficient training of over ten thousand GPUs. Following the same principles as discussed earlier, PaRO could provide a more flexible strategy as an alternative to DP in such cases.

Please refer to global rebuttal (Author Rebuttal) for more detail.

Thank you again for taking the time to review our paper. We hope we have addressed the reviewer's questions and the reviewer is willing to raise their score. Please let us know if we can provide any additional clarification or information.

评论

Thanks for the response.

Most of my concerns are well addressed, except for the part of integrating with 3D parallelism (the statement itself already verifies its application is limited to specific scenarios). I'm not saying it's useless in 3D parallelism, and I do believe there must be some scenarios it can help. However, it's still a big concern how to combine with current methods because intra-node communication, which is the most common setting PaRO helps, is already well utilized.

Another concern is the novelty (refer to my summary). Although I appreciate the paper provides a systematic implementation for various DP strategy, for me PaRO-DP doesn't provide too much new insights. Many previous work leverages intra-group communication to accelerate training/inference. For the systematic part, PaRO-DP is kind of like a subset of Alpa. Actually for me the novelty mainly comes from PaRO-CC (although most of text are describing PaRO-DP), but I'm not quite familiar with the literature in this direction. Could you please clarify more about the novelty/originality?

评论

Thank you for your timely questions, which are closely related to most latest research on LLM training.

1. Regarding the first concern related to n-D parallelism, we present four reasons why our PaRO strategy is more effective:

Reason 1: A DP-based method (e.g. ZeRO, PaRO) is preferred over a TP-based method when the model fits into memory.

We agree that TP is necessary for partitioning weights in extremely large models. However, for moderately large models like the 65B model, we argue that the DP-based method is preferable (These models are more commonplace when it comes to practical business applications of LLM due to the cost of extremely large LLM models and they also require efficient training over terabytes data). It includes:

a). The computational efficiency of TP degrades due to the subdivision of matrix multiplications, and b). it is challenging to overlap communication and computation. c) TP requires significant changes to the model implementation compared to the DP-based method, making it difficult to use.

Reason 2: In a scenario where the TP+DP-based method is used, PaRO owns more scalability.

PaRO can be directly used as an alternative to DP in TP + DP-based methods, as demonstrated in reason 4. However, from another aspect, TP+DP typically uses DP (e.g. ZeRO) for inter-node communication, while utilizing TP for intra-node communication. In contrast, PaRO uses ZeRO for inter-group communication and still ZeRO for intra-group communication. The main difference lies in whether TP or ZeRO is used for intra-group communication and the scope of the group. TP requires higher intra-group network requirements compared to DP-based methods, due to the much higher communication volume of activations. This necessitates that the intra-group be intra-node to take advantage of high-speed NVLink in TP. PaRO, on the other hand, furthers the group to be inter-node as well, due to its lower communication cost. This flexibility is particularly beneficial for longer sequences with larger activations.

Reason 3: PaRO can be orthogonally integrated with Sequence Parallelism (SP), unlike TP.

The current SP can be categorized based on whether it partitions over the head dimension (e.g. Deepspeed-ulysses) or the sequence length dimension (e.g. Ring-SelfAttention/Contextual Parallelism). However, combining TP with the Deepspeed-ulysses-like method is challenging due to both partitioning over the head dimension. On the other hand, the PaRO like DP method can be integrated orthogonally with both of these two sequence parallelism approaches. For example, under PaRO, the Deepspeed-ulysses style head parallelism can be leveraged for intra-node parallelism, which requires all-to-all operations and significant demand on network topology. Meanwhile, the Ring-attention style context parallelism can be applied for inter-node parallelism.

Reason 4: PaRO can still enhance training efficiency in 4D parallelism when training extremely large LLMs.

When training extremely large models, PaRO still improves training efficiency. For example, in Llama 3.1, 4D parallelism is employed in the order of [TP, CP, PP, DP], where PaRO can be directly used as an alternative to DP. This improvement in efficiency is achieved in two ways. Firstly, it reduces the number of participants in a communication collective by grouping, even though expensive communication is still required over 128 nodes in Llama 3.1 (DP=128). Secondly, it can leverage the heterogeneous network in a multi-layer switch scenario as discussed.

2. Regarding the second concern related to Alpa, we contend that the comparison is misplaced; our solution cannot be identified through Alpa's automatic parallelism methodology.

Firstly, Alpa primarily emphasizes searching for an optimal hybrid parallel strategy under various existing parallelism techniques including Intra-Operator Parallelism (e.g., DP, TP & ZeRO) and Inter-Operator Parallelism (e.g., PP). It does not propose a new parallelism strategy but leverages the existing parallelism techniques, where new parallelism techniques like PaRO or Sequence parallelism are not included. In contrast, PaRO proposes a new parallelism strategy, which primarily focuses on optimizing DP across different training parameter scenarios. The grouping optimization scheme employed by PaRO is not encompassed within an automatic parallelism framework, such as Alpa. Consequently, PaRO can serve as a complementary component to Alpa.

评论

Reason 3 is reasonable to me, because Deepspeed-ulysses is another widely used technique, where PaRO can be applied (but the specific strategy is still not clear).

It does not propose a new parallelism strategy but leverages the existing parallelism techniques, where new parallelism techniques like PaRO or Sequence parallelism are not included

I don't think this claim is correct. Although Alpa doesn't propose any new parallelism strategy explicitly, most existing strategies including PaRO and Sequence parallelism lie in the solution space of Alpa. The main difference is, Alpa uses an op-level space while PaRO uses a model-level space (only sharding batch dimension), which means PaRO has a much smaller space so that all solutions can be enumerated while Alpa needs an ILP to search for the optimal solution. There may be some tiny differences in the details, but conceptually, PaRO is a subset of Alpa.

Given my concerns about integrating with other methods are partially addressed and the systematic implementation should be helpful to the community, I'm willing to increase my score from 4 to 5. But I still think the paper needs to be improved regarding the above discussions. More experiments about integrating with other methods to prove its efficiency would be appreciated.

评论

Thanks for your response. Alpa stands out as one of the most significant advancements in recent years for automatic parallelism training. It uses a fine-grained, operator-level optimizer on heterogeneous networks, but struggles to do a better job of optimizing for the structure of transformer-based models [referred to H3T (NIPS, 2023)]. As you mentioned, if we don't consider those differences in the details, PaRO or recent sequence parallelism works may exist in Alpa's search space. However, we argue that even with this consideration, Alpa's objective function remains simplified, ultimately leading to a suboptimal solution due to the omission of several key components.

  1. Alpa does not explicitly incorporate memory constraints into its optimization objective, even though these constraints are crucial for training large language models (LLMs) and are challenging to model within the objective function [referred to MiCS (VLDB, 2022), which corresponds to III strategy in PaRO]. For example, while MiCS(III) and PaRO-IIG have comparable communication costs, IIG stands out for its lower memory requirements. This advantage enables IIG to support a larger batch size, thereby increasing throughput, as demonstrated in Table 8 of the appendix in the manuscript. In contrast, the Alpa method fails to account for these differences. Additionally, there are several similar scenarios when tradeoff the memory and communication costs as discussed in our manuscripts.

  2. Alpa does not explicitly consider the computation and communication overlapping in their optimization objective function. Overlapping is mainly used by different hand-craft parallelism techniques to improve training efficiency. For example, Ring-SelfAttention saves memory for key-value (KV) pairs but incurs extra communication overhead, which is in contrast to the objective function of Alpa. Meanwhile, it optimizes efficiency by overlapping communication and computation using a ring-style peer-to-peer approach, alongside an incremental softmax inspired by FlashAttention.

Overall, mainstream LLM training uses hand-crafted parallelism (e.g. 4D-parallelism in Llama 3.1) rather than automatic parallelism such as Alpa.

Thank you again for your insightful suggestions and we will refine our manuscript based on our discussions. For the integration of PaRO with other parallelism, we are currently working as part of our future research and it requires significant workloads.

评论
评论

Thank you for your response. To clarify, we are stating that Alpa does not "explicitly" incorporate memory constraints into its optimization objective. Instead, Alpa estimates the memory consumption primarily through profiling. However, this approach can be imprecise for LLM training unless Alpa can enumerate and profile every partition at each stage and there are several hyperparameters.

审稿意见
5

This paper introduces the Partial Redundancy Optimizer (PaRO) to improve the efficiency of training large language models (LLMs) by optimizing the trade-off between memory and communication costs. PaRO includes two main strategies: PaRO Data Parallelism (PaRO-DP), which refines model state partitioning and training procedures, and PaRO Collective Communications (PaRO-CC), which rearranges the communication topology for faster collective operations. The proposed strategies demonstrate significant improvements in training speed, up to 266% over state-of-the-art (SOTA) data parallel strategies, and 17% when applied independently to model parallel strategies like Megatron.

优点

  • PaRO enables much more fine-grained parallelism strategy, compared to ZeRO. This facilitates much more optimized training performance in broad range of resource setup.
  • Not only partitioning strategy, but this paper also proposes PaRO-CC, intra-group-aware collective communication operation.
  • The paper provides extensive experiment results, including the training convergence.

缺点

  • In full parameter training, 512 sequence length is too small. Do you have much longer sequence length, for example, 4K or 8K?
  • The latest Megatron supports more efficient training schemes. It would be helpful to show if PaRO can still achieve significantly higher training performance even when considering these methods.

问题

  • How PaRO determines the group size (G)? Is it a hyper-parameter?
  • In the experiments, do other well-known computation optimizations (e.g., FlashAttention) are applied?
  • How can users enable PaRO? Could you provide an example of the programming interface?

局限性

No additional limitations exist.

作者回复

Thanks for your postitive and constructive feedback. We will explain your concerns point by point.

Weaknesses:

Q1: Do you have much longer sequence length, for example, 4K or 8K?

A1: We are currently conducting experiments with longer sequence lengths, but we have not had time to obtain the experiment results. Theoretically, longer sequence lengths should not significantly increase communication volume (only parameters and gradients are communicated in DP strategy) and are anticipated to result in similar performance improvements.

Q2: The latest Megatron supports more efficient training schemes. It would be helpful to show if PaRO can still achieve significantly higher training performance even when considering these methods.

A2: PaRO could be an alternative to DP in the latest n-D parallel training and advance the training effiency. Please refer to global rebuttal (Author Rebuttal).

Questions:

Q1: How PaRO determines the group size (G)? Is it a hyper-parameter?

A1: The PaRO will automatically partition groups according to the nodes, making full use of the communication advantages within the nodes. It also supports other group sizes, such as using the lower-layer switch as the grouping basis in multi-layer switch networks, and of course, custom group sizes can also be defined.

Q2: In the experiments, do other well-known computation optimizations (e.g., FlashAttention) are applied?

A2: In order to avoid introducing more factors, we did not use optimization schemes such as flash-attn and quantization. But for the LLaMA-65B model, we activated checkpointing to ensure successful training.

Q3: How can users enable PaRO? Could you provide an example of the programming interface?

A3: We provide the open-source release of our code in paper. To enable PaRO, you just need to add "paro_strategy": "NIG" in the zero_optimization dict in the original ds_config.json, and start the training the same way as with deepspeed. Configuration examples can be found in the README.md file in the code repository.

Thank you again for taking the time to review our paper. We hope we have addressed the reviewer's questions and the reviewer is willing to raise their score. Please let us know if we can provide any additional clarification or information.

审稿意见
7

This paper recast basic known distributed training strategies (such as the ZeRo-1, 2, 3, MiCS, FSDP) in a unified framework that takes into account the trade-off between memory consumption and communication. By exhibiting natural levels of granularity for the partionning of different parts of the model and optimizer, this paper makes an exhaustive list of all 27 strategies possible for the distributed training of LLMs (that include previously known ones). Then, it proposes a simple model to predict the throughput and feasibility of all strategies based on physical measurements of the cluster used and task at hand, allowing to chose the fastest feasible distributed training strategy. This model is shown to predict effectively actual performances of LLMs training, allowing to bring significant speedup compared to previous "go-to" methods in some cases.

优点

  • Exhaustive enumeration of possible distributed training strategies in a unified framework: The 3 level of granularity for the partitioning of some “vector” are proposed, and seem natural in the context of GPU cluster (“no partition = copy on all worker”, partitioned inside a cluster node, or across all workers). 3 different model+optimizer parts are considered (model parameters, gradients, and optimizer’s state) as depending on the task and the use of mixed precision, these are treated differently. With the 3 level of granularity, and 3 type of vectors, it leads to 33=273^3=27 possibilities for distributed strategies.

  • Simple model to predict the throughput of a given strategy: Based on simple physical characteristics of the GPU cluster used (e.g., intra and inter node communication bandwidth, GPU memory) and the task (pretraining, parameter-efficient finetuning), a very simple model is proposed to predict what would be the fastest feasible distributed training strategy.

  • Model is shown to highly correlate to actual implementation throughput: In part 4.2, through extensive experimentation, the model is shown to highly correlate to actual time measurements when implementing the different strategies.

  • Significant speedup can be observed in practice by using the recommended strategy: Thanks to this model, the best feasible strategy can be chosen in advance, which can be a completely new one compared to previously known ones (such as ZeRo). In the experiments performed, this can lead to significant speedup compared to standard methods.

缺点

  • Figures a bit hard to understand: I found Fig.1 & 2 hard to read and to understand.
  • Experiments in Part 4.2 do not seem exhaustive: in line 287, 3 configurations out of the 14 considered are dismissed, leading to 11 remaining possible experiments. Yet, only 9 are reported in Figure 3, why is this the case?
  • Comparison with standard collective communication strategies other than "the ring topology" is not done: The only collective communication strategy investigated (other than the one proposed by the authors) is the one based on the ring topology. However, this is not the only standard implemented in the industry. For instance, NCCL, in addition to "Ring all-reduce" also propose "Tree all-reduce", which can allow significant speedup at scale compared to the ring option (NCCL choses automatically which to use depending on the network/task, but it is possible to force a particular strategy). Moreover, other strategies such as Butterfly All-Reduce [Patarasuk et al., 2009] or Moshpit All-Reduce [Ryabinin et al., 2021] are not considered either.

问题

  • line 228: “Assuming that communication and computation can be fully overlapped, TT can be approximately regarded as the maximum of tcomm.t_{comm.} and tcomp.t_{comp.} Could you be more specific when this assumption is actually valid in practice?
  • line 237: “the formula of TT Do you mean maxtcomp.,tcomm.\max \\{ t_{comp.}, t_{comm.} \\}?
  • line 287: (throughput indicator) It not clear at first that this is the “TPS indicator” (otherwise not defined) in the subsequent figures, I would advice indicating it at this stage.
  • Fig 7: Small variations between the losses for the different methods are observed, although they should be mathematically identical, why is that?

Suggestion:

In the paragraph line 232, I would advice to precise that the formulas (based on physical characteristics of the cluster/task) to estimate any t××t_{\times \times} are provided in appendix A3.

局限性

.

作者回复

Thanks for your positive and constructive feedback. We will explain your concerns point by point.

Weaknesses:

Q1: Figure.1 & 2 a bit hard to understand.

A1: We have redrawn the figures, please refer to Figure 1 & 2 in the PDF.

Q2: Experiments in Part 4.2 do not seem exhaustive.

A2: We choose the best strategy for specific training scenarios based on the guidelines and verify it in our experiments.

Q3: Comparison with standard collective communication strategies other than "the ring topology" is not done.

A3: I appreciate you bringing this up.

The Butterfly topology sends and receives large data blocks each time, which makes it difficult to fully utilize the bandwidth and prone to delay jitter. Its performance is not as good as the Ring topology.

Compared to the Ring topology, the Tree topology in NCCL speeds up communication primity locally. We focus on the huge difference in communication bandwidth between inter-group and intra-group interactions. Group communication is utilized to minimize inter-group communication and maximize intra-group communication. The communication can utilize either a ring or tree structure.

The Moshpit topology is a group communication strategy for allreduce that solves cluster instability and node failure problems through an iterative averaging protocol. Our algorithm combines multi-level sharding of large models and fully utilizes bandwidth between and within machines through a grouping strategy to improve communication.

Questions:

Q1: line 228: “Assuming that communication and computation can be fully overlapped, 𝑇 can be approximately regarded as the maximum of 𝑡𝑐𝑜𝑚𝑚. and 𝑡𝑐𝑜𝑚𝑝.” Could you be more specific when this assumption is actually valid in practice?

A1: When training transformer architecture models, the next layer can prefetch parameters while the previous layer is computing, allowing computation and communication to happen simultaneously. Here we simplify and assume perfect overlap, meaning the training time for a single step is determined by the longer of the two, computation or communication.

Q2: line 237: “the formula of 𝑇” Do you mean max{𝑡𝑐𝑜𝑚𝑝.,𝑡𝑐𝑜𝑚𝑚.}?

A2: Yes. We will make it a numbered formula in our manuscript for better clarification.

Q3: line 287: (throughput indicator) It not clear at first that this is the “TPS indicator” (otherwise not defined) in the subsequent figures, I would advice indicating it at this stage.

A3: Yes, throughput indicator(log(1/T)) is the “TPS indicator” in figures. We will declare this in the subsequent version.

Q4: Fig 7: Small variations between the losses for the different methods are observed, although they should be mathematically identical, why is that?

A4: Thanks for pointing this out. We have verified the values of the model parameter, gradient, and optimizer state before and after an update step, and the errors of different strategies remain within the normal machine precision range (1e-6). Small variations between the losses for different methods, despite their mathematical equivalence, can be attributed to floating-point representation errors in computers. These truncation errors accumulate during different steps of the calculations. As a result, the outputs may show slight discrepancies. However, these factors do not impact the statistical performance during training or the consistency in convergence behavior. The phenomenon is also consistent across ZeRO-1/2/3.

Thank you again for taking the time to review our paper. We hope we have addressed the reviewer's questions and the reviewer is willing to raise their score. Please let us know if we can provide any additional clarification or information.

评论

I thank the authors for their efforts in their rebuttal, which answered some of my concerns. Given the fact that I think this work could appeal the community and spark interesting discussions, I raise my score by one point.

作者回复

We thank the reviewers for these insightful comments and constructive advices. We offer a general response here and respond to each reviewer individually.

The proposed PaRO can be used as a standalone DP strategy or combined with other parallel strategies for n-D parallel training. PaRO-CC communication optimization can also be widely applied to distributed training strategies with global collective communication operations. Next, we'll introduce the practical applications of PaRO.

Firstly, PaRO is not only effective in scenarios where there is a difference between intra-group and inter-group communication costs, but it is also particularly useful when there are a large number of nodes and GPUs (e.g. more than thousands of GPUs). This is because it becomes difficult to maintain a linear speedup ratio for collective communication operators when the number of GPUs is particularly large. The Pytorch community has recently proposed Hybrid Sharding Data Parallel (HSDP) as a 2D strategy for performing FSDP within a host and DDP across hosts to reduce the number of nodes involved in collective communication. While HSDP aims to solve the same problem as PaRO, it was not as mature at the time of drafting this article and therefore was not compared. Additionally, due to hardware limitations, scenarios with more than thousands of GPUs were not tested. Lastly, as demonstrated in our transcript, PaRO provides more flexibility to more complicated machine learning systems, such as distributed RLHF systems.

As a future work, we currently working on further utilising PaRO in n-D parallel training for more large-scale LLM training. We find it is effective in two scenarios in the latest n-D parallel training:

  • Scenario 1 is when there is a difference in bandwidth between intra-group and inter-group communication, which is common in GPU clusters. n-D hybrid parallelism usually performs intra-node TP and inter-node PP in a subgroup of nodes, and nesting DP or PaRO in the outer groups. However, the grouping criteria is no longer one machine per group, but rather based on the network topology of the cluster. For example, in a multi-layer switch scenario, grouping can be classified based on the lower-layer switches. For another example, when training with multiple Nvidia SuperPods or similar infrastructures, one SuperPod can be used as a group.
  • Scenario 2. In a vast number of GPU scenario, n-D parallelism usually uses TP + PP + DP for efficient training of over ten thousand GPUs. Following the same principles as discussed earlier, PaRO could provide a more flexible strategy as an alternative to DP in such cases.

It is worth noting that the reason why this article uses "intra-group" and "inter-group" instead of "intra-machine" and "inter-machine" to describe network topology is precisely because the grouping criteria for network topology can be diverse, as is the case when using n-D parallelism.

It is also worth emphasizing again that PaRO is a non-intrusive distributed training solution for LLM training as ZeRO. We believe it is one of the few feasible non-intrusive acceleration solutions for more than thousands of GPUs.

评论

Dear reviewers:

As the discussion period is going to end soon, please try to actively engage with the authors about the paper. Thanks a lot for your help and dedication.

You AC.

最终决定

The paper introduces the Partial Redundancy Optimizer (PaRO), a set of basic strategies designed to enhance the training speed of large language models (LLMs) by optimizing memory and communication costs. PaRO Data Parallelism (PaRO-DP) improves training speed through refined model state partitioning, while PaRO Collective Communications (PaRO-CC) enhances communication efficiency by rearranging topology. Experiments show that PaRO increases training speed significantly as compared with SOTA. Overall the results shown in the paper are interesting, and the reviewers all agree that this is an interesting work.