6.0

/10

Poster4 位审稿人

最低5最高7标准差0.7

3.5

置信度

COLM 2025

Streaming DiLoCo with overlapping communication

Arthur Douillard,Yani Donchev,J Keith Rush,Satyen Kale,Zachary Charles,Gabriel Teston,Zachary Garrett,Jiajun Shen,Ross McIlroy,David Lacey,Alexandre Rame,Arthur Szlam,MarcAurelio Ranzato,Paul R Barham

OpenReview PDF

提交: 2025-03-21更新: 2025-08-26

TL;DR

Distributed training where only a subset of the outer gradients is communicated

摘要

关键词

distributed traininglarge-scalellm

评审与讨论

审稿意见

评分: 6置信度: 22025-05-11

Streaming diloco is an improvement on top of diloco in a few aspect: model subset synchronization, overlapped computation and communication, and quantizing the gradients.

接收理由

Achieves similar performance to DiLoCo training on multiple eval benchmarks, while reducing networking bandwidth usage by order of magnitude, and improved GPU utilization with overlapping techniques
A reasonable simulation for the estimated gain on model training with different scales. Although the authors do not have such resources, it helps clarify how much gain is expected and the approach to analyze the training time.

拒绝理由

the set of model architectures is limited. It remains to be seen whether such techniques work across other llm architecture variants such as moe models. Both model size and architecture may affect the scale of the gradients during training, quantization techniques may require further experimentation to show the effectiveness.
it is unclear how sensitive the hyper-parameter selection is when streaming diloco is used. Ideally it should be a robust algorithm without too much extra tuning.

给作者的问题

2025-05-29

Thank you for your review.

Indeed, we only experimented on Chinchilla-style decoder. Other publications such as OpenDiLoCo have also reproduced DiLoCo (not streaming) on a Llama architecture. However, there is no reasons to believe it won't work on MoE or other transformer-based design, as model merging methods are usually compatible across them too.
We compare the new Streaming DiLoCo using the exact same hyperparameters as the original DiLoCo and found those to transfer well out-of-the-box, showing the robustness of our method.This gives us confidence that we can re-use also the scaling laws found by a different work from us: https://arxiv.org/abs/2503.09799

审稿意见

评分: 6置信度: 52025-05-12

This paper proposes "Streaming DiLoCo," an enhanced distributed training algorithm for LLMs building upon DiLoCo/FedOpt. The goal is to reduce the high peak and total bandwidth requirements and worker blocking associated with synchronization in distributed training, especially between non-co-located workers. The authors introduce three main improvements: (1) synchronizing only subsets (fragments) of parameters sequentially instead of all at once, (2) overlapping the communication of these fragments with ongoing worker computation, and (3) applying low-precision quantization (4-bit E3M0) to the communicated gradients (outer gradients/model deltas). Through simulations and scaling experiments up to 4B parameters, they demonstrate that this approach can maintain comparable model quality to Data Parallelism and standard DiLoCo while drastically reducing required bandwidth (by up to two orders of magnitude) and improving compute utilization under bandwidth constraints.

接收理由

Addresses a critical and practical challenge in scaling LLM training: the high cost and complexity of communication infrastructure.
Proposes a synergistic combination of three well-motivated techniques (parameter streaming, overlap, quantization) to tackle the problem.
Provides both simulation results for compute utilization analysis under varying bandwidths and empirical LLM training results to validate model quality.
Achieves significant claimed reductions in bandwidth requirements (peak and total) while maintaining performance, potentially enabling training across less ideal network conditions.
Builds logically on recent work (DiLoCo, FedOpt) and offers clear advantages over the baseline DiLoCo.

拒绝理由

The combination of streaming, overlapping, and quantization introduces significant system complexity; the practical implementation overheads and tuning difficulties might be understated.
Sensitivity to hyperparameters like H (sync frequency), τ (overlap steps), fragment size/pattern, and outer learning rate seems potentially high and may require careful tuning per setup.
While comparisons are made to Data Parallelism and DiLoCo, a broader comparison to other advanced gradient compression or communication scheduling techniques could strengthen the claims.
The compute utilization results rely on simulation, which makes simplifying assumptions (acknowledged by authors) and may not perfectly reflect real-world network dynamics and system bottlenecks.
Lack of citing and comparing with similar works like CO2, SlowMo.

给作者的问题

Could you elaborate on the practical implementation challenges and potential system overheads (e.g., scheduling complexity, memory management for buffering overlapped gradients) introduced by combining all three proposed techniques?
How sensitive is the final model quality to the choice of quantization format (E3M0)? Were other low-bit formats explored, and how did they perform?
Figure 5 suggests robustness to overlap (τ) up to 10 steps. How does this interact with H? Is there an optimal ratio, and does the optimal τ change significantly with model scale or fragment size?
The term "distributed free lunch" is evocative but potentially strong. Could you clarify the trade-offs? Is the "cost" primarily in implementation complexity and hyperparameter tuning rather than pure ML performance or bandwidth?

2025-05-30

Thank you for your review.

A. About Reasons to Reject

1a. Our three components (streaming, overlap, quantization) do bring complexity in some sense, but they can be used independently from each other, and brings significant benefits.

2a. Overlapping step (tau) is fixed to 1-5. The fragment structure is rather robust, and don't need to be adapted across scales at the model gets larger. H and the outer learning rate do require to some extent tuning, but we found the best HPs being the same as the original DiLoCo work.

3a. Showing hardware numbers can sometimes be a bit misleading due to the difference of implementation quality, but we do agree that would strengthen our work.

4a. Thanks for suggesting CO2, it's a relevant related work, we'll add it. While also doing overlapping, it's limited compared to our proposed overlapping method as they work on stale parameters w.r.t the gradients, while we don't. This is critical for best performance, particularly at large number of overlapping steps: we show negligeable loss of performance under ~10 steps while they only perform 1 step. Moreover, overlapping a subset of the parameters as we do instead of the whole set as they do further improve resiliency to overlapping.

B. Questions To Authors:

1b. We work in Jax, and rely a lot on its compiler to properly schedule all operations. The overlapping is a bit more tricky, because if you compile a whole step, then you cannot "communicate" info from one step to the other easily. A way to do it is to send the communication to CPU on another thread, running another Jax program. Because of Jax rigidity too, it can be simpler to package the outer gradients in fixed sized buffer, meaning that the communication of the embeddings on its own fragment needs to be padded with 0s. Host offloading is relatively easy with https://docs.jax.dev/en/latest/notebooks/host-offloading.html

2b. We didn't explore other quantization schemes, and leave it for future work.

3b. We found H and \tau to be orthogonal. The optimal tau is hard to define as it's a tradeoff between compute utilization and ML performance, but it's stable across scales and fragment composition.

4b. The main con of our method is of course the additional complexity over data-parallel, which is extremely simple. However, we show similar ML performance on both IID and OOD downstream evals, and superior compute utilization. The hyperparameter tuning of Streaming DiLoCo is similar as DiLoCo, and to be honest, not very difficult, as we found the best HPs to be robust across model scales and datasets.

2025-06-09

Thanks for the responses. The difference of proposed method with exist method like CO2 should be at least discussed in the paper, if it is not suitable to compare. It will be good to provide a paragraph of discussion on this.

2025-06-10

We fully agree that CO2 should be discussed, and ideally also compared in the paper. We'll include this baseline in the final paper.

2025-06-11

I will raise the score to 6.

审稿意见

评分: 7置信度: 42025-05-13

DiLoCo is a distributed training algorithm inspired by federated learning that reduces communication overhead and relaxes all-to-all high-bandwidth connectivity requirements. It enables training across isolated device clusters ("islands") by performing local updates on data shards and infrequent inter-island synchronization via an outer optimizer.

Streaming DiLoCo further optimizes communication efficiency by reducing peak bandwidth demands, consequently minimizing latency. It achieves this by asynchronously synchronizing only subsets of model parameters, rather than the entire set at once, effectively overlapping computation with communication. Additionally, outer gradients are quantized to further minimize data transfer.

The optimizations used are reminiscent of asynchronous pipeline + data parallel training methods and gradient compression algorithms, which have been used in a new context of DiLoCo.

Evaluation shows that peak bandwidth demand is reduced by orders of magnitude without loss in quality of ML models.

接收理由

Paper is well written.
Strong ablation studies.
Discussion on strided fragment patterns for partial updates is novel and interesting.

拒绝理由

The proposed optimizations seem largely derivative of prior work on asynchronous data-parallel training and gradient compression.
Evaluation with pure data parallel (instead of pipeline + data parallel, with and without DiLoCo) is unrealistic, especially for large (~>1B parameters) models, and doesn't give an accurate description of potential gains of Streaming DiLoCo.
Comparison of wall-clock time for DiLoCo and Streaming DiLoCo would have been useful. Given that DiLoCo already achieves a 500x reduction in communication by synchronizing outer gradients only once every H=500 steps, it is difficult to appreciate the impact of Streaming DiLoCo.
The paper lacks a clear explanation for why model quality does not degrade, despite the reduced synchronization and additional approximations.

Nits:

Please improve the quality of figures - especially figures 2 and 3.

给作者的问题

Evaluation of training efficiency with only data parallel approach is unrealistic. With large models (with billions, or even 100s-of-millions of parameters), one almost never uses pure data parallel. Using pipeline parallel training is very common. From section 3.1, Streaming DiLoCo "reduces the latency by splitting the communication of outer gradients across fragments". This is in fact, what pipeline parallel achieves - parameters are split across partitions, and each data parallel set needs to synchronize only a subset of gradients. How does compute utilization, bandwidth requirements and latency for training models using Streaming DiLoCo compare against pipeline + data parallel approaches with and without DiLoCo?
DiLoCo already claims to reduce communication overhead by 500x without loss in model accuracy. Since outer gradients are only synchronized once every 500 steps, what is the impact of proposed optimizations on compute utilization and training latency (in comparison to DiLoCo)?
The proposed approach is similar in essence to asynchronous data parallel training. These approaches are known to degrade model quality in large language models, and the reason why they have not been adopted in main stream ML development. What makes the Streaming DiLoCo techniques different from these approaches? Would the same observations translate to models other than those shown in the paper?

Nits:

How was latency of data transfer calculated in Figure 2? I.e., what was the bandwidth provisioned?

2025-05-29

Thank you for your review.

A. About the reasons to reject:

1a. While there are previous related works, we present here a novel view of how to express those and show with careful execution that it can make it work at large scale. Moreover our overlapping method is different from what the literature proposes and is expressed nicely with our streaming synchronization.

2a. Data-Parallel is still very competitive at large scale, as mainly it doesn't suffer from Pipelining's bubble-of-time and doesn't require extremely large BSZ to reduce the bandwidth. However, we will consider a pipeline model in our simulator (fig3) for the sake of completeness.

3a. While DiLoCo can show competitive results at H=500, it requires usually a lower H (e.g. H=100) to reach equal performance as a DDP baseline, as shown in the OpenDiLoCo paper. We show in appendix ablations of Streaming DiLoCo vs DiLoCo as H grows, and show the former to perform comparably if not better while in addition reducing the peak bandwidth and overlapping communication.

4a. We believe that the reduced sync is possible due to the partial sync which spread out more frequently information across the network. More theorical explanation of why it works would be useful, here we're showing only that it works.

B. About Questions To Authors:

1b. See 2a.

2b. We show across all our experiments and most of our ablations a comparison of Streaming DiLoCo vs DiLoCo under equal value of H. In that situation the former is winning in term of compute utilization due to our overlapping method, and in term of loss we show equal or better performance. Ablation figure 7 shows the performance change from DiLoCo to Streaming DiLoCo (w/o overlap), and ablation figure 5 shows the performance change from w/o to w/ overlapping. What else would you like to see?

3b. Async DP uses to some extent a stale gradient, which results in poor performance particularly if the number of async stale step is large. In our case, despite doing a kind of async communication, its optimization is never stale, and the communication is in the params-space which has even better staleness tolerance.

4b. Figure 2 is only here for illustrative purpose. Refer to figure 3 to see various levels of provisioned bandwidth.

2025-06-08

Thank you for your response!

2b. I am particularly interested in end-to-end wall clock times for training the models to similar accuracy (with different bandwidth profiles). Even an estimate, based on the average time to train H steps using data parallelism, DiLoCO and Streaming DiLoCO would be very helpful. Readers will be able to appreciate the benefits of streaming DiLoCO better if you can quantify the GPU-hours saved. This is unclear from compute utilization alone.

2a. I would greatly appreciate if you could share the simulator results (figure 3) that includes comparison with pipeline-parallel + DiLoCO approach.

2025-06-10

Due to limited time, we can only answer during that rebuttal period about 2b:

For a 1B model, spending 0.1s per step and doing 186,250 steps, the time spent under a specific bandwidth:

1GiB:

DP: 21 days
DiLoCo: 12h
Streaming DiLoCo with overlapped FP4 com.: 6h

10GiB:

DP: 2 days
DiLoCo: 6h
Streaming DiLoCo with overlapped FP4 com.: 5h

100GiB:

DP: 10h
DiLoCo: 5h
Streaming DiLoCo with overlapped FP4 com.: 5h

1000GiB:

DP: 5h
DiLoCo: 5h
Streaming DiLoCo with overlapped FP4 com.: 5h

We ack your remark 2a and will provide that in the final paper.

审稿意见

评分: 5置信度: 32025-05-13

This paper presents Streaming DiLoCo, a novel enhancement of the DiLoCo framework for distributed training of large language models (LLMs). It addresses key bottlenecks in cross-worker communication by introducing three main contributions: (1) Streaming synchronization of parameter subsets instead of full models, (2) Overlapping communication and computation, and (3) Quantization of exchanged gradients to four-bit representations. These techniques collectively reduce peak bandwidth, wall-clock training time, and overall communication cost, while maintaining training quality. Experimental evaluations, including scaling laws and overtraining on large datasets, demonstrate that Streaming DiLoCo achieves comparable performance to standard data-parallel training with significantly less bandwidth.

接收理由

Comprehensive experiments: The paper includes scaling studies, ablations, bandwidth simulations, and performance on downstream tasks such as HellaSwag, PiQA, and ARC-Easy.

Strong empirical results: The proposed method demonstrates a 400× reduction in communication cost without sacrificing accuracy, making it highly relevant for scenarios with constrained interconnect bandwidth.

Solid engineering implications: The method is compatible with existing frameworks like DiLoCo, FSDP, and federated setups, making adoption easier in practice.

拒绝理由

Comparison with CO2 (Communication-Computation Overlap): While the paper makes a strong case for overlapping communication and computation, it omits discussion of "CO2: Efficient Distributed Training with Full Communication-Computation Overlap"(ICLR 24). A direct comparison or at least a discussion would significantly strengthen the paper’s positioning and highlight its specific advantages.
Bandwidth and Hardware Profiling: While the paper includes simulation-based compute utilization plots, real-world hardware experiments (e.g., GPU/TPU wall-clock comparisons) would make the results more tangible. I would suggest including latency and throughput numbers measured on actual hardware, especially under variable bandwidth settings, to validate the simulated results.
Impact on convergence in highly asynchronous regimes: The experiments primarily use 2 replicas. It's unclear how the method scales with increasing asynchrony or heterogeneity. I would suggest Provide additional experiments or discussion on how Streaming DiLoCo behaves with larger numbers of non-synchronous workers, or when using heterogeneous devices.

2025-05-29

Thank you for your review.

Thanks for suggesting CO2, it's a relevant related work, we'll add it. While also doing overlapping, it's limited compared to our proposed overlapping method as they work on stale parameters w.r.t the gradients, while we don't. This is critical for best performance, particularly at large number of overlapping steps: we show negligeable loss of performance under ~10 steps while they only perform 1 step. Moreover, overlapping a subset of the parameters as we do instead of the whole set as they do further improve resiliency to overlapping.
Showing hardware numbers can sometimes be a bit misleading due to the difference of implementation quality, but we do agree that would strengthen our work.
Our overlapping method is a kind of "async", here "async w.r.t to the step". To compare with heterogenous devices, that would be kind of "async w.r.t to the other clients", which require other methods (such as https://arxiv.org/abs/2401.09135v1) and was not the focus of this work. It's however interesting to see how to combine both methods.

2025-06-10

I am not pleased with the response I received. I anticipated a comparison of the experimental results. None of the results I requested were included. Therefore, I want to lower my rating to 5.

2025-06-10

We're sorry that you're not pleased with our answer, but please note:

We answered about your point 1), and will provide a discussion.

About point 2): We provide to Aw6r, an estimation of the wallclock time under different bandwidth regime. We'll provide real results in the final paper, due to the lack of time/compute we had right now.

About point 3): you're asking about a completely different setting (async between replicas, heterogenous hardwares), which was very explicitly not the scope of this paper and moreover would require extensive infra change on our side: this is unrealistic to expect this during a rebuttal.

最终决定Accept

2025-07-08

This paper presents Streaming DiLoCo, an enhancement to the DiLoCo distributed training framework that reduces communication overhead through three key modifications: streaming parameter synchronization, computation-communication overlap, and gradient quantization. The work addresses a critical practical challenge in decentralized LLM training by dramatically reducing peak bandwidth requirements (up to 100×) while maintaining model quality comparable to standard approaches. The proposed method represents a well-engineered combination of existing techniques applied thoughtfully to the DiLoCo context, with strong empirical validation showing significant communication cost reductions without accuracy degradation. The comprehensive experimental evaluation includes scaling studies, ablation analyses, and downstream task performance assessment, demonstrating the practical value of the approach for bandwidth-constrained distributed training scenarios.

Pros: Simple improvements on DiLoCo that work well; addresses critical bandwidth bottlenecks in distributed training; achieves substantial communication cost reductions (up to 400×) while preserving model quality; comprehensive experimental evaluation with strong ablation studies.

Cons: Limited evaluation to relatively small models (≤4B parameters) and primarily 2-worker setups; lacks comparison with related communication-optimization methods. The work could be improved by adding analysis of scalability to larger numbers of asynchronous workers and going into the details of practical implementation.