Capturing and Mitigating Gradient Aggregation Errors for Fault-Tolerant Distributed Training
This work theoretically and empirically studies and mitigates the convergence problem with the gradient aggregation error, which is caused by silent hardware or software issues.
摘要
评审与讨论
This paper considers hardware failures that can happen during gradient aggregation. These failures can cause different clients to obtain different noisy versions of aggregated gradients and, thus, different model weights. Such noise can accumulate over rounds and cause performance degradation and even divergence of the training process.
To account for such aggregation errors, the paper proposes PAFT, an algorithm that performs occasional model weight synchronization. PAFT has two versions: one where the model weight updates occur periodically and, another where the noise is estimated, and synchronization is invoked based on the estimated noise.
The paper also provides convergence rate analysis with noisy gradients and evaluates whether PAFT successfully mitigated the problem over different scenarios.
优点
-
The paper is well-written and structured
-
The paper performs convergence analysis that takes the modeled aggregation errors into account
-
The experimental study considers different setups and scenarios
缺点
-
The failure model is vague, and it is unclear what exactly fails in real systems that cause aggregation errors. This is at the heart of the paper and should be made precise. For example, both in TCP and InfiniBand (with and without RDMA), there are data corruption checks (e.g., CRC checksums) that invalidate packets that experienced data corruption (e.g., link-level and e2e).
-
Weight synchronization is not a new idea, and critical related work is missing (e.g., see [1][2][3])
-
The theoretical results are fairly standard and do not seem to provide new insight or techniques (e.g., see results in [1][2][3])
[1] McMahan, Brendan, et al. "Communication-efficient learning of deep networks from decentralized data." Artificial intelligence and statistics. PMLR, 2017.
[2] Zhang, Sixin, Anna E. Choromanska, and Yann LeCun. "Deep learning with elastic averaging SGD." Advances in neural information processing systems 28 (2015).
[3] Koloskova, Anastasia, et al. "Decentralized Deep Learning with Arbitrary Communication Compression." International Conference on Learning Representations (2020).
问题
-
How can a gradient corruption happen without the packet being discarded by the transport protocol, which must ensure reliable data delivery by design? Also, why is the noise assumed to be unbiased? An HW error can nullify a bit, for example.
-
As previous works have already suggested, if one is concerned with the modeled error, why not use weight synchronization instead of gradient synchronization to eliminate the problem altogether and without introducing overhead?
We would like to express our sincere gratitude to the reviewer for their time and insightful comments on our paper. We appreciate the recognition of the strengths of our work, including its well-structured presentation, the convergence analysis that accounts for modeled aggregation errors, and the comprehensive experimental study that considers various setups and scenarios.
Q1. Vague failure model issues.
The failure model is vague, and it is unclear what exactly fails in real systems that cause aggregation errors. This is at the heart of the paper and should be made precise. For example, both in TCP and InfiniBand (with and without RDMA), there are data corruption checks (e.g., CRC checksums) that invalidate packets that experienced data corruption (e.g., link-level and e2e).
Ans for Q1: Thank you for your valuable feedback. We acknowledge that the failure model may appear vague; however, it is important to note that the current communication protocols like NVlinks, commonly used in data centers, do not provide reliable data corruption checks. This is a critical issue prevalent in emerging hardware connections designed for fast AI training, as highlighted in Tables 4, 5, and 6 of our appendix. Although there are checks in protocols like TCP and InfiniBand, they may not be applicable in our context considering to achieve the high-speed communication.
Q2. Novelty issues.
Weight synchronization is not a new idea, and critical related work is missing (e.g., see [1][2][3])
Ans for Q2: We appreciate your comment regarding weight synchronization. While model averaging has been proposed in federated learning (FL) scenarios [1,4,5] and local SGD-based distributed training [2,3], our work specifically addresses the silent data corruption problem in large-scale distributed training, which differs significantly from FL and local SGD approaches. We believe we are the first to theoretically analyze this silent data corruption issue in the gradient aggregation of large-scale distributed training. Although model averaging has been explored in other scenarios, our findings demonstrate that it can be a simple and effective solution to this SDC problem in gradient aggregation.
Q3. Standard theoretical results.
The theoretical results are fairly standard and do not seem to provide new insight or techniques (e.g., see results in [1][2][3])
Ans for Q3: Thank you for your observations. While we recognize that some theoretical results may seem standard, our contribution lies in identifying that model divergence accumulates over time due to gradient inconsistency. This accumulation significantly impacts convergence, which has not been thoroughly addressed in previous works. Our analysis provides a new perspective on how these errors affect training dynamics.
Q4. Ensuring reliable data delivery by design.
How can a gradient corruption happen without the packet being discarded by the transport protocol, which must ensure reliable data delivery by design? Also, why is the noise assumed to be unbiased? An HW error can nullify a bit, for example.
Ans for Q4: Thank you for your question. The current communication protocols in high-speed GPU cards used in data centers indeed lack reliable data corruption checks, which is a significant issue for fast communication. Implementing reliable data delivery and fault-tolerant data checking will sacrifice communication efficiency.
To generalize different error types in our analysis, we assume the noise follows a Gaussion distribution. And we do not have the prior knowledge about the noise direction, thus we assume the error follows the distribution.
It is true that a hardware error can nullify a bit, leading to substantial gradient errors. In such cases, workers can utilize gradient clipping techniques to mitigate the impact of these large errors.
Q5. Using weight synchronization instead of gradient.
As previous works have already suggested, if one is concerned with the modeled error, why not use weight synchronization instead of gradient synchronization to eliminate the problem altogether and without introducing overhead?
Ans for Q5: Thanks for your insightful question. Using weight synchronization can guarantee the training dynamics of SGD as same as the gradient synchronization, but fails to guarantee the convergence of other advanced optimizers. Specifically, when using weight synchronization instead of gradient synchronization, workers cannot have the global averaged gradient. Thus, the optimizers including SGD momentum, Adam, Adagrad and others that require the global gradient to conduct some gradient correction operations will fail to obtain the accurate global gradient direction. Then, the convergence of these optimizers is no longer guaranteed to be as same as their original versions.
Reference
[1] McMahan, Brendan, et al. "Communication-efficient learning of deep networks from decentralized data." Artificial intelligence and statistics. PMLR, 2017.
[2] Zhang, Sixin, Anna E. Choromanska, and Yann LeCun. "Deep learning with elastic averaging SGD." Advances in neural information processing systems 28 (2015).
[3] Koloskova, Anastasia, et al. "Decentralized Deep Learning with Arbitrary Communication Compression." International Conference on Learning Representations (2020).
[4] Advances and Open Problems in Federated Learning. 2019.
[5] Local SGD Converges Fast and Communicates Little. ICLR 2018.
I acknowledge the authors' rebuttal and will keep my score.
This paper discusses an interesting research question: how silent data corruption in gradient aggregation impacts on distributed training and how to capture and mitigate them. The paper theoretically analyzes how silent data corruption causes accumulated model divergence and failed convergence. It then designs a fault-tolerant distributed training system that can be proved to illuminate model divergence and ensure convergence, as well as reduce incurred communication overhead.
优点
- This paper discusses how silent data corruption (SGC) influences training convergence for the first time
- This paper theoretically analyzes how gradient inconsistency caused by SGC leads to failed convergence and empirically show that a small noise can already harm training convergence and accuracy
- This paper proposes a simple yet effective synchronization scheme to eliminate model divergence. It also reduces the incurred communication overhead by overlapping synchronization with training.
- This paper demonstrates its effectiveness with four models, different noise degrees, and different training scales.
缺点
- Lack of real-world trace. It is true that SGC has negative impacts on training convergence, but how bad it could be depends on the frequency of SGC during training. Is it possible to provide any real-world traces to connect the noise degree with SGC frequency?-
- Test accuracy is still impacted in case of high noise degrees. As listed in Table 2, the test accuracy is dramatically lower than the DSGD baseline when the noise degree is 0.01 or 0.1. It appears that PAFT cannot avoid accuracy dropping under all scenarios and doesn’t completely address the problem.
- Only CIFAR-10 is used for the evaluations of ResNet. Could you also include ImageNet in the evaluation to check whether these conclusions still hold for a large image dataset?
问题
N/A
Thank you for your time and valuable comments on our paper. We appreciate your recognition of the strengths of our work, particularly: the novel exploration of silent data corruption (SDC) in training convergence, the theoretical analysis linking gradient inconsistency to failed convergence, the proposed synchronization scheme to eliminate model divergence, and the empirical validation across multiple models and noise degrees.
Q1. Real-world trace issues.
Lack of real-world trace. It is true that SGC has negative impacts on training convergence, but how bad it could be depends on the frequency of SGC during training. Is it possible to provide any real-world traces to connect the noise degree with SGC frequency?
Ans for Q1: Thank you for your insightful comment. We have provided some real-world SDC error distribution in the Appendix. However, we currently do not have access to the real-world SDC error distribution during training. We will make efforts to collect this data in the future and release it to enhance the understanding of the connection between noise degree and SDC frequency.
Q2. Test accuracy gap when noise is large.
Test accuracy is still impacted in case of high noise degrees. As listed in Table 2, the test accuracy is dramatically lower than the DSGD baseline when the noise degree is 0.01 or 0.1. It appears that PAFT cannot avoid accuracy dropping under all scenarios and doesn’t completely address the problem.
Ans for Q2: Thank you for pointing this out. Yes, for larger noise degrees, the test accuracy is indeed impacted. This is due to the fact that the gradient information itself is affected by the noise, which causes the gradient direction to deviate from the true gradient direction, leading to a drop in accuracy.
Q3. Imagenet Experiments.
Only CIFAR-10 is used for the evaluations of ResNet. Could you also include ImageNet in the evaluation to check whether these conclusions still hold for a large image dataset?
Ans for Q3: Thank you for your suggestion. We have also included evaluations of GPT-2 and LLaMA-2 training with Alpaca and OpenWebText in our experiments. We will consider adding ImageNet for experiments in the future to further validate our conclusions on larger image datasets.
I acknowledge the author rebuttals and will keep my scores.
Thanks for your recognition and reply. The silent data corruption is an important problem in large-scale infrastructure. We will make efforts to collect the real-world SDC errors in the future and release it to enhance the understanding of the connection between noise degree and SDC frequency.
The submission claims to introduce a technique for capturing and mitigating the errors in gradient accumulation in distributed SGD due to silent data corruption (SDC).
To this end, the submission presents the "PAFT" algorithm. The main feature of this algorithm is that it periodically synchronizes the models in a way quite similar to the standard local SGD algorithms. In a variant of this algorithm, which is fully synchronous periodic model averaging, in addition to the gradient communication after every gradient update step, the claim is that the algorithm suffers from high communication cost, whose mitigation requires the presentation of a variant where the averaging frequency is determined based on the model divergence measure.
The algorithms are evaluated on ResNet-18 with CIFAR-10, ResNet-50 with CIFAR-100 for 120 epochs, and GPT-2 with Open WebText for 3000 iterations. Additionally, pre-trained LLaMA2 and GPT-2 are trained for an epoch on the Alpaca dataset using the Low-Rank Adaptation scheme. In all these experiments, SDC is simulated as white noise.
The submission also includes convergence discussion of the presented algorithms.
优点
The motivation of the approach is straightforward and relevant.
缺点
The idea of this work is poorly conceived. In particular,
- It is unclear why the discussion did not cite any of the local SGD papers. For example, a comparison with "Don't Use Large Mini-Batches, Use Local SGD, Lin et al. 2018" will be a relevant approach, where for multiple first epochs, distributed SGD is applied, and after that, periodic averaging is performed.
- The periodic model averaging on top of gradient communication after every computation is an overkill. Can't the generated/simulated SDC vector be sent in the next round added to the gradient, much similar to the error feedback method? See "Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent, Nadiradje et al. 2022". Once it is modeled with error feedback, it then automatically fits the Elastic Consistency framework for analysis.
- The simulated SDCs do not seem to be "capturing" it. Can the authors elaborate on some real-life cases of SDC that may be modeled as the standard N(0,1) distribution?
- The convergence results in the main body of the paper do not even include the convexity nature of the objective, which is mentioned only in the appendix. Thus, the results statements still need to be completed. It distracts a reader even when reading the derivations.
- Today's Distributed SGD methods invariably include communication compression -- quantization, sparsification, etc. A full gradient communication approach is dated. See "Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent, Nadiradje et al. 2022".
- Training ResNet18/50 models for 120 epochs on CIFAR 10/100 data is not standard. A more standard benchmark is training these models for 300 epochs. Is there any specific reason for using 120 epochs only? It looks more like training to subpar accuracy, an area where different training methods behave starkly differently but, after 200 or so epochs, are close to the known best results. Similarly, for other models.
问题
- How about trying out model averaging after every local gradient update? Very likely, the motivated mitigation will be achieved. If not, can the authors clarify why so? After all, the above-suggested scheme involves much less communication than the proposed one.
- Can you please address other comments included above in the "Weakness" assessment?
Thank you for your time and valuable comments on our paper. We appreciate your recognition of the strengths of our work, particularly the straightforward and relevant motivation of our approach.
Q1. Related Work issues.
It is unclear why the discussion did not cite any of the local SGD papers. For example, a comparison with "Don't Use Large Mini-Batches, Use Local SGD, Lin et al. 2018" will be a relevant approach, where for multiple first epochs, distributed SGD is applied, and after that, periodic averaging is performed.
Ans for Q1: We have cited the local SGD papers [1,2,3,4] in the revision. While model averaging is indeed proposed in federated learning (FL) scenarios [2,3] and local SGD-based [1,4] distributed training, our paper specifically addresses the silent data corruption problem in large-scale distributed training, which differs from FL and local SGD. Additionally, distributed SGD primarily focuses on gradient aggregation rather than local SGD. The model averaging in our approach is designed to synchronize models to eliminate divergence.
Q2. Overkill design issues.
The periodic model averaging on top of gradient communication after every computation is an overkill. Can't the generated/simulated SDC vector be sent in the next round added to the gradient, much similar to the error feedback method? See "Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent, Nadiradje et al. 2022". Once it is modeled with error feedback, it then automatically fits the Elastic Consistency framework for analysis.
Ans for Q2: Thank you for highlighting this related work. However, we note that in [4], there is a maximum inconsistency bound for various distributed algorithms, which significantly impacts convergence. In fault-tolerant distributed training, gradient inconsistency can lead to increasing model divergence during training. Therefore, periodic model averaging is essential for reducing this inconsistency bound, as evidenced by our experimental results.
Q3. Real-life SDCs.
The simulated SDCs do not seem to be "capturing" it. Can the authors elaborate on some real-life cases of SDC that may be modeled as the standard N(0,1) distribution?
Ans for Q3: We have provided examples of real-world SDC errors in Tables 4, 5, and 6 of our Appendix. We assume the noise follows a N(0,1) distribution for generality, as real-world SDC errors occur with a much lower probability. For larger gradient noise, workers can employ gradient clipping techniques to mitigate its effects.
Q4. convexity of objective function.
The convergence results in the main body of the paper do not even include the convexity nature of the objective, which is mentioned only in the appendix. Thus, the results statements still need to be completed. It distracts a reader even when reading the derivations.
Ans for Q4: We have revised our paper to include the convexity nature of the objective in the main body, ensuring that the results are clearly presented and accessible to readers.
Q5. Gradient compression.
Today's Distributed SGD methods invariably include communication compression -- quantization, sparsification, etc. A full gradient communication approach is dated. See "Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent, Nadiradje et al. 2022".
Ans for Q5: While many gradient compression techniques have been proposed, it is important to note that many industrial large-scale training systems still utilize full-precision gradients or only 8-bit or 16-bit gradients. Additionally, gradient compression techniques are orthogonal to our proposed method. Our analysis aims to understand gradient inconsistency, its analysis is also suitable in the gradient compression techniques.
Q6. Training settings.
Training ResNet18/50 models for 120 epochs on CIFAR 10/100 data is not standard. A more standard benchmark is training these models for 300 epochs. Is there any specific reason for using 120 epochs only? It looks more like training to subpar accuracy, an area where different training methods behave starkly differently but, after 200 or so epochs, are close to the known best results. Similarly, for other models.
Ans for Q6: Training ResNet-18/50 for 120 epochs achieves the same performance to training for 300 epochs with appropriate learning rate decay. Besides, our focus is on evaluating the convergence behavior of different methods rather than maximizing model performance.
Q7. Trying model averaging.
How about trying out model averaging after every local gradient update? Very likely, the motivated mitigation will be achieved. If not, can the authors clarify why so? After all, the above-suggested scheme involves much less communication than the proposed one.
Ans for Q7: Thank you for this insightful question. The proposed 'model averaging after every local gradient update' refers to local SGD. While local SGD is a feasible solution to mitigate gradient inconsistency, there are following reasons that why local SGD might not be a suitable choice and we need to conduct model averaging in distributed SGD:
-
- The local SGD is not widely adopted in practical distributed training within data centers, which typically have higher communication bandwidth. The additional communication overhead is relatively small in data centers.
-
- There is no guaranteed similar model performance with DSGD when using local SGD. Thus, current LLM training systems still rely on distributed SGD/Adam for training. Modifying the optimizer from DSGD/Adam to local SGD might have implicit convergence problems.
Q8. Question2.
Can you please address other comments included above in the "Weakness" assessment?
Ans for Q8: We have provided above responses to address all the comments included in the 'Weakness' assessment.
Reference
[1] Don't Use Large Mini-Batches, Use Local SGD, Lin et al. 2018
[2] https://arxiv.org/pdf/1602.05629
[3] Advances and Open Problems in Federated Learning. 2019.
[4] Local SGD Converges Fast and Communicates Little. ICLR 2018.
Thank you for your response.
there is a maximum inconsistency bound for various distributed algorithms, which significantly impacts convergence. This is not a substantiated argument. Please substantiate.
Please answer this question that I asked:
How about trying out model averaging after every local gradient update? Very likely, the motivated mitigation will be achieved. If not, can the authors clarify why so? After all, the above-suggested scheme involves much less communication than the proposed one.
For now, my score is unchanged.
Dear Reviewer #jNJ1,
Thanks for your timely reply and follow-up questions.
Follow-up Q1: Maximum inconsistency bound:
Answer for Follow-up Q1: Thank you for your follow-up question; let use clarify this in more detail. The maximum inconsistency bound refers to the difference in gradient views caused by the maximum delay, [5]. In [5], the divergence of the gradients is assumed to be bounded. Additionally, the model parameters between different time steps on different workers may differ, but only by a maximum amount. Without such a bound, the final model parameters across workers would diverge significantly. Here's why such an assumption is reasonable in some scenarios [5]:
-
Model parameters are directly updated in Asynchronous SGD: In this setting, the gradients are computed based on different model parameters at different steps, but after updating, each worker’s parameters are brought in line with the most recent version. Thus, the parameters remain consistent.
-
Model parameters are synchronized in gradient-compressed SGD: While gradient compression can introduce divergence in gradient views, workers still update their parameters in the same direction, ensuring a consensus among the model parameters.
However, in synchronous SGD, communication noise during the gradient broadcast can lead to the continuous accumulation of divergence between different updating directions. Without a maximum inconsistency bound, we cannot prove the convergence of these distributed algorithms. Simply relying on error feedback is insufficient to address this problem. Workers need a way to keep their model parameters close to each other. This is why direct communication and parameter updates are necessary in algorithms like Asynchronous SGD [6] and local SGD [4].
Follow-up Q2: Model averaging after every local gradient update.
Answer for Follow-up Q2: We assume that the solution you're referring to is local SGD. In this approach, different workers directly update their model parameters based on their individual gradients, and then, after one or more steps, the workers perform model averaging (please correct us if we misunderstand your approach).
In this case, we agree that model averaging can achieve the motivated mitigation. However, there are two main weaknesses of local SGD in large-scale LLM training:
-
There is no guarantee of similar model performance compared to Distributed SGD when using local SGD. Thus, current LLM training systems still rely on distributed SGD/Adam for training [9]. Adapting local SGD to LLM training is a recent and emerging direction, particularly in geo-distributed settings with low-bandwidth conditions [10,11]. However, we would like to emphasize that our focus is on data-center scenarios, where distributed SGD/Adam is widely used.
-
Modifying the optimizer from DSGD/Adam to local SGD may introduce implicit convergence problems. Specifically, Distributed SGD/Adam ensures that workers have a consistent global gradient, estimated from all sampled batch data across all workers. This guarantees consistency in the momentum/precondition of SGD momentum/Adam, which aligns with large-batch optimization. In contrast, local SGD cannot achieve this consistency in momentum/precondition.
-
Data centers typically have sufficient communication bandwidth. The use of local SGD is primarily suited to low-bandwidth environments, such as federated learning. There is a trade-off between model performance and communication costs [8]. In data centers with higher communication bandwidth, the communication cost of sending gradients after each update is manageable, and thus, distributed SGD/Adam is preferred for better performance.
Please correct us if we misunderstand the new algorithm you are proposing.
Reference
[4] Local SGD Converges Fast and Communicates Little. ICLR 2018.
[5] Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent, Nadiradje et al. 2022.
[6] Jeff Dean et al. Large Scale Distributed Deep Networks. In NeurIPS 2012.
[7] Sebastian U. Stich et al. Sparsified SGD with Memory. In NeurIPS 2018.
[8] Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD. In MLsys 2019.
[9] LLaMA: Open and Efficient Foundation Language Models. In Arxiv 2023.
[10] DiLoCo: Distributed Low-Communication Training of Language Models. Technical Report 2024.
[11] Exploring Scaling Laws for Local SGD in Large Language Model Training. In Arxiv 2024.
Dear Authors,
Thank you for your response. Assuming that every node starts from the same initial model state, in terms of optimization dynamics (trajectory of the model), Local SGD with a single gradient update step followed by model averaging across the nodes is the same as distributed SGD with a model update with an average of the gradients. This will be better than letting the SDC errors accumulate on the clients and, after that, average it after several updates.
The proposed scheme can be substantially improved and better resubmitted to a subsequent venue.
Thanks for your suggestions.
We agree that for the SGD, in terms of optimization dynamics (trajectory of the model), Local SGD with a single gradient update step followed by model averaging across the nodes is the same as distributed SGD with a model update with an average of the gradients.
However, the Local SGD with a single gradient update step cannot successfully help workers to update the momentum in SGD or the precondition in Adam. This is a significant concern in optimization. Specifically, Local SGD updates parameters before obtaining the global gradient (using local gradients); then, workers average their model parameters as an un-biased estimation of global model which is the same as the distributed SGD. However, when there is a momentum term or the precondition in Adam, we will fail to accurately estimate the momentum and precondition because workers do not have the global gradient in Local SGD. Maybe we can use the difference between the new averaged model and the old model parameters as the pseudo global gradient to update them. However, this is an inaccurate estimation.
So, this is not exactly a concern with enough explorations on global momentum schemes, for example, "SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum, Wang et al.," etc. You want the averaged gradient to update the local momentum, precondition, etc.; for that, you can very well use the local gradient. Momentum and other augmented terms for the global updates are now quite standard and have shown promises over the local-only momentum -- see the references in "Ordered Momentum for Asynchronous SGD, Shi et al. NeurIPS 2024".
The bottom line is that the simulated error in gradient updates on clients looks like a stretched problem to solve with the heavy machinery proposed in your work.
I encourage the authors to explore the developments in this area. It will be a worthy contribution to the community if you explore addressing potential SDCs in the popular works in this domain that update augmented tensors such as local and global momentum, precondition, and the like.
Best wishes.
Thanks for your provided references. We agree that these methods provide possible useful solutions to help updating the momentum and precondition with in local or asynchronous SGD. Nevertheless, they are still the inaccurate estimation of the original global momentum and precondition that are updated using the global averaged gradient. And the real-world LLM training does not exploit these methods in optimization, because there is no guarantee that we can obtain the models as good as optimized with original SGD-momentum or Adam. There are some initial trials in training LLMs using Local SGD [10,11]. We agree that this would be a very promising way to explore in future works because they can help reduce the communication costs significantly.
We would like to emphasize the major contribution of this work is identifying how the SDC errors in gradient aggregation cause the failed convergence, and we find that the model averaging is a simple and effective method to mitigate it.
Thank you again for providing the thoughts about using model averaging to replace gradient averaging. We would consider it in future works.
Reference
[10] DiLoCo: Distributed Low-Communication Training of Language Models. Technical Report 2024.
[11] Exploring Scaling Laws for Local SGD in Large Language Model Training. In Arxiv 2024.
This paper is focused on the reliability problem of distributed training. Specifically, it aims to mitigate the statistical divergence during the training caused by the unreliable gradient aggregation, i.e. silent data corruption problem. The method proposed by this paper is to periodically synchronize the model weights of workers. To reduce the extra overhead of model sync up, authors come up with a dynamic way to decide the sync-up frequence and overlap the communication with back propagation.
优点
- The research problem is quite important regarding the large-scale distributed training. How to deal with the potential unreliability of the data communication is crucial to model training.
- The idea of dynamically adjusting the sync-up frequency is interesting since the* silent data corruption may happen by a random rate.
缺点
- My biggest concern is that, if we assume the silent data corruption could randomly happen during the gradient aggregation, then it could also happen during the model synchronization. Therefore, how can we synchronize the model weights without any error?
- The novelty of this paper is also quite limited. From many previous studies, we already know that in data parallel distributed training, if local models apply different local gradients for some steps, and then average the model weights globally, the global model can still converge [1]. So the theoretical contribution of this paper is limited. Besides, overlapping the communication with the back propagation is also a widely used technique in both research papers and real-world systems like pytorch.
问题
Please refer to the weakness
Thank you for your time and insightful comments on our paper. We appreciate your recognition of the strengths of our work, particularly the importance of the research problem regarding unreliability in large-scale distributed training and the interesting idea of dynamically adjusting the synchronization frequency.
Q1. Further technical challenges.
If we assume the silent data corruption could randomly happen during the gradient aggregation, then it could also happen during the model synchronization. Therefore, how can we synchronize the model weights without any error?
Ans for Q1: We acknowledge your concern regarding the potential for silent data corruption during model synchronization. However, we would like to clarify that
- 1. Weight synchronization errors have less influence than gradient errors. Specifically, one-time weight synchronization helps to reduce accumulated errors over H steps, which has the variance . However, the weight synchronization error only occurs once, yielding variance as , which is much smaller than accumulated model divergence. In other words, a successful weight synchronization helps to completelyt elimiate all previous divergence; a noised weight synchronization helps to reduce divergence while not completely. However, without the synchronization the divergence will be accumulated continuously.
- 2. Using fault-tolerance communication protocol of the weight synchronization. Additionally, we can implement an error detection protocol in the communication channel for weight synchronization, although this may incur higher costs and lower efficiency compared to current protocols (current high-speed communication protocol between GPU cards in datacenters does not provide the guaranteed reliable data delivery due to considering the efficiency). Since weight synchronization occurs less frequently, it can be effectively overlapped with back propagation.
Q2. Novelty of model averaging.
From many previous studies, we already know that in data parallel distributed training, if local models apply different local gradients for some steps, and then average the model weights globally, the global model can still converge [1]. So the theoretical contribution of this paper is limited.
Ans for Q2: We appreciate your point regarding the novelty of model averaging. While it is true that model averaging has been explored in federated learning (FL) [1,2] and local SGD [3] scenarios, our work specifically addresses the silent data corruption problem in large-scale distributed training, which is distinct from FL and local SGD. We are the first to theoretically analyze the silent data corruption issue in gradient aggregation, and while model averaging has been proposed elsewhere, we uniquely demonstrate its feasibility as a solution to this problem. And using parameter synchronization to address this problem is simple and effective.
Q3. Novelty of overlapping communication with back propagation.
Overlapping the communication with the back propagation is also a widely used technique in both research papers and real-world systems like pytorch.
Ans for Q3: We understand your perspective on the novelty of overlapping communication with back propagation. While this technique is indeed widely used in research and practical systems like PyTorch, we emphasize that we are the first to apply it specifically to address the silent data corruption problem in large-scale distributed training. Our approach is tailored to the unique challenges posed by SDC errors, making our contribution significant in this area.
Reference
[1] https://arxiv.org/pdf/1602.05629
[2] Advances and Open Problems in Federated Learning. 2019.
[3] Local SGD Converges Fast and Communicates Little. ICLR 2018.
Dear authors,
Thanks for your response. But I am afraid I am still not convinced. For the novelty of model averaging and communication overlapping, if we represent the silent data corruption as a noise term, actually it is quite similar to those communication quantization works that also aim to deal with the biased or unbiased nose caused by the quantization. Therefore, I would consider it as a minor extension of a well-studied area.
Thanks for your follow-up questions. Yes, we can represent the communication quantization as the noise. However, there exists a significant difference between the noise introduced by the quantization error and the noise introduced by the broadcast error. Specifically, the noised gradients [1,2,3] are averaged across all workers. Then, with gradient compression, workers utilize a consistent gradient to update model parameters, which means that the updating direction between workers are the same.
However, with SDC error in gradient broadcasting, noises cause that different workers obtain different gradients . Thus, model parameters are updated towards different directions. And the divergence between model parameters is accumulated during training. Without model averaging, this accumulated divergence will cause failed convergence.
We agree that the model averaging itself is not a novel technique especially in local SGD and federated learning. However, the major contribution of this work is identifying how the SDC errors in gradient aggregation cause the failed convergence, and we find that the model averaging is a simple and effective method to mitigate it.
References:
[1] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In ICLR 2024.
[2] TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In NeurIPS 2017.
[3] Sparsified SGD with Memory. In NeurIPS 2018.
The paper proposes that silent hardware errors are a potential cause of failed convergence in model training, and provides theoretical analysis and a system for mitigating these errors.
The reviewers agreed that the problem of silent errors is important to address. However, the paper did not adequately cite related literature on convergence under noisy gradients, and reviewers pointed out that the theoretical and methodological contributions may not be novel in light of the literature. One reviewer also pointed out that the full gradient communication approach is outdated in light of the increasing use of quantization and other compression methods in foundation model training. Reviewers also pointed out that the experiment scale (CIFAR-10) is limited.
审稿人讨论附加意见
Reviewers pointed out that the theoretical and methodological contributions may not be novel in light of the literature. One reviewer also pointed out that the full gradient communication approach is outdated in light of the increasing use of quantization and other compression methods in foundation model training. Reviewers also pointed out that the experiment scale (CIFAR-10) is limited.
While the experiment scale issue was partly addressed with GPT-2 and LLaMa-2 experiments, the novelty concerns were not convincingly addressed.
Reject