PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
5
4
5
3.5
置信度
创新性2.8
质量3.3
清晰度3.0
重要性3.0
NeurIPS 2025

Understanding outer learning rates in Local SGD

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

Modern machine learning often requires training with large batch size, distributed data, and massively parallel compute hardware (like mobile and other edge devices or distributed data centers). Communication becomes a major bottleneck in such settings but methods like Local Stochastic Gradient Descent (Local SGD) show great promise to reduce the global communication need. Local SGD consists of three parts: a local optimization processes, an aggregation mechanism, and an outer optimizer that uses the aggregated updates from the nodes to produce a new model. While there exists an extensive literature on understanding the impact of hyperparameters in the local optimization process, the choice of outer optimizer and its hyperparameters is less clear. We study the role of the outer learning in Local SGD, and prove new convergence guarantees for the algorithm. In particular, we show that tuning the outer learning rate allows us to (a) trade off between optimization error and stochastic gradient noise variance, and (b) make up for ill-tuning of the inner learning rate. Our theory suggests that the outer learning rate should sometimes be set to values greater than $1$. We extend our results to apply to when we use momentum in the outer optimizer, and also introduce a novel data-dependent analysis of Local SGD that yields further insights on outer learning rate tuning. We conduct comprehensive experiments with standard language models and various outer optimizers to validate our theory.
关键词
communication-efficient optimizationlocal sgdstochastic optimizationdistributed optimization

评审与讨论

审稿意见
5

This work studies the impact of outer learning rate in Local SGD through a theoretical lens of convergence bounds. The theorems show the inverse relation between optimization error and gradient variance with different regimes of outer learning rates. The analysis is also extended to the momentum-based outer updates.

优缺点分析

Strengths

a. The work is well-motivated and the manuscript is well-written. The objective of the paper and the theoretical treatment were easy to follow.

b. The authors have done a good job of building the related work section, it would be extremely helpful for the beginners in this area.

c. The theorems are followed by insightful discussion which captures the key points well.


Weaknesses

a. The theoretical analysis is limited to the convex objective functions.

b. There's a lack of clarity in the empirical results in terms of the relation between local and global learning rates. E.g., For Section 4.1, is the local learning rate η\eta supposed to be in the "higher" or "lower" ranges according to the upper bounds set through Theorem 3.3. Some more questions are detailed below.

问题

a. What changes would be needed to make for a non-convex analysis on outer learning rates?

b. Are there empirical results on end perplexity for a grid of local and global learning rates? I wanted to see how the relation between the local and the global learning rates (of theorem 3.3) translate to convex and non-convex objectives in practice.

c. For clarification, does theorem 3.3 imply that larger η\eta (local learning rates) might result in an accumulated gradient with large variance, which needs to be counterbalanced by a smaller γ\gamma? Would we achieve the same effect with a larger HH (local steps)?

d. A follow up to the above question would be, under what setting of local and global learning rates we can minimize the FLOPs?

e. For Figure 9 in the Appendix, why are we seeing high gradient similarity near the end of the training for the left plots and why are we seeing low gradient similarity for the right plots? Furthermore, line 283 states "... noise-dominated regime, which we may expect to observe towards the end of the training process." Why is that so?

f. For Figure 1b, why does the loss start high for low variance setting of σ=0.001\sigma = 0.001?

g. For Table 1, SF-SGD case, why are outer learning rates with magnitudes >2 detrimental?

局限性

Yes

最终评判理由

The authors have addressed my questions about non-convex analysis satisfactorily and clarified the empirical results, hence I would like to keep my score to "Accept".

格式问题

None

作者回复

Thank you so much for your review and positive evaluation of our work.

  1. (Non-convex analysis). Our theorem mainly relies on the two lemmas Lemma B.6 and Lemma B.7; Lemma B.6 holds regardless of convexity, but Lemma B.7. does not. We would need to relax Lemma B.7. to handle non-convex objectives, but this lemma in turn relies on the fact that gradient descent is contractive (Lemma B.1.), which is not true for arbitrary non-convex functions. We believe it is possible to use similar tools as in [10, Theorem 4.1], but this requires adding the bounded gradients assumption and using a different approach for the theorem derivation that starts with the smoothness inequality. Alternatively, we can use the approach of [12] of bounding this sequence by a sum of gradients at the cost of an extra HH factor in the Local SGD variance term.

  2. (LR grids) The local learning rate is always fixed in our experiments is fixed to ηbaselineM\frac{\eta_{\mathrm{baseline}}}{\sqrt{M}} where ηbaseline\eta_{\mathrm{baseline}} is the local learning rate that was tuned on the minibatch SGD baseline for each model scale and MM is the number of nodes (see Table 1 in the main paper). We only varied the outer learning rate otherwise, since what is important for our analysis is the ratio of outer to local learning rate. We have included a table with all the experiments we’ve conducted varying the outer learning rate in our response to Reviewer advP.

  3. (Implications of Thm. 3.3) Theorem 3.3 shows there are two sources of noise for Local SGD: the second term in equation (3) scales with η\eta and max(0,γ1)\max(0, \gamma-1) but not HH, and the third term scales with η2\eta^2 and also HH. In addition, we also have to satisfy the constraint ηL(1+(γ1)+H)1\eta L (1 + (\gamma-1)_+ H) \leq 1. For tuning the outer learning rate, if either η\eta or HH are larger, we are forced to use smaller γ\gamma in the same way. However, the convergence of Local SGD will suffer more under larger η\eta than larger HH, all else being equal, because larger η\eta will increase both noise terms while larger HH only increases the latter.

  4. (Minimizing FLOPs) To minimize iteration complexity given a fixed number of clients MM and local steps HH, we ought to choose the outer learning rate to make up for a small inner learning rate and to vary with the number of local steps. Depending on the magnitude of the gradient variance, choosing an outer learning rate larger than 1 can also be useful. We found that outer learning rates can be tuned at a smaller scale and transfer as-is to larger models. For example, with the 150M model at H=50, M=4, it achieves a perplexity of 17.75, while Nesterov with lr=0.7 reaches 17.25 and SF-SGD with lr=2.0 achieves 16.88. These represent improvements of 2.8% and 4.9% respectively compared to the baseline with no outer learning rate. The performance gains transfer and become even more pronounced at larger model scales: for the 1B model, vanilla SGD achieves 13.67 perplexity while Nesterov (lr=0.7) and SF-SGD (lr=2.0) reach 12.51 and 12.40 respectively, representing improvements of 8.5% and 9.3%. Of course, in practice, we control the number of clients MM and also have to consider FLOP efficiency from a systems perspective, e.g. normally increasing MM comes at the cost of decreasing the local batch size or alternatively increasing σ2\sigma^2. This complicates the trade-off; Figure 4 shows the result of varying the number of clients/replicas in practice on FLOP efficiency. A study of FLOP efficiency would have to take into account the specific hardware we have available, and what the minimum batch size needed to saturate the GPU computation is.

  5. “For Figure 9 in the Appendix, why are we seeing high gradient similarity near the end of the training for the left plots and why are we seeing low gradient similarity for the right plots?” For all of them except SF-SGD in Figure 9 (c), we do observe low similarity towards the end of training (lower than 0.1). For SF-SGD in Figure 9 (c), we do observe a spike in similarity towards the end but are not sure why this is the case.

  6. “Furthermore, line 283 states "... noise-dominated regime, which we may expect to observe towards the end of the training process." Why is that so?” This is an empirical observation made in prior work [11], and the intuition is that towards the end of training the gradient norms are smaller so the relative magnitude of the noise is higher compared to the gradient norms.

  7. “f. For Figure 1b, why does the loss start high for low variance setting of σ=0.001\sigma = 0.001?” All of the losses are reported after one communication round– the figure should start at r=1 not at r=0. They would all start at the same value if we started at r=0. Since we cannot upload PDFs or figures, here are the trajectories in tabular form:

Figure 1 (b) tabular

σRound 1Round 2Round 3Round 4Round 5
0.0011.421e+073.965e-023.050e-037.399e-045.203e-04
0.011.421e+076.859e-026.751e-031.370e-037.368e-04
0.11.421e+071.919e-012.858e-021.052e-026.916e-03
0.51.421e+071.051e+002.031e-011.059e-017.660e-02
1.01.421e+071.220e+003.663e-012.674e-012.948e-01
10.01.421e+072.333e+018.727e+006.208e+005.305e+00
  1. “g. For Table 1, SF-SGD case, why are outer learning rates with magnitudes >2 detrimental?” They weren’t really detrimental, they just did not help very much. Here are the results of our grid search on a 150M parameter model with H=50H=50 and M=4M=4. Very small outer learning rates were a lot more detrimental, as they did not allow for enough progress.

Additional Learning Rate Sweeps (150M, H=50, M=4)

AlgorithmLearning RatePerplexity
SF-SGD0.130
SF-SGD0.522.89
SF-SGD1.019.42
SF-SGD1.518.32
SF-SGD2.017.98
SF-SGD3.017.96
SF-SGD4.018.09
SF-SGD5.017.51

[10] Glasgow, M. R., Yuan, H., & Ma, T. (2022, May). Sharp bounds for federated averaging (local sgd) and continuous perspective. In International Conference on Artificial Intelligence and Statistics (pp. 9050-9090). PMLR. [11] Faghri, F., Duvenaud, D., Fleet, D. J., & Ba, J. (2020). A study of gradient variance in deep learning. arXiv preprint arXiv:2007.04532.

评论

Apologies for having missed replying to the authors. The authors have addressed my questions about non-convex analysis satisfactorily and clarified the empirical results. I would like to keep my score to "Accept".

审稿意见
5

The paper seeks to better understand the role of the outer optimizer and its hyperparameters in the i.i.d Local SGD setting. Specifically, the main results of the paper are (1) a convergence theorem for generalized Local SGD with arbitrary inner and outer learning rates with SGD being the inner and outer optimizer, (2) an extension of the first theorem but using SGD with momentum as the outer optimizer, and (3) a high probability guarantee for the iterates of generalized local SGD. The authors provide analysis of each result, illustrate how their high-probability theory is borne out in practice, and position their results within the existing literature. Finally, the authors provide empirical results on a toy convex problem that is illustrative of their theoretical analysis and provide empirical results on language model pre-training of decoder-only transformers following the DiLoCo setting.

优缺点分析

Strengths:

  • The paper is well written, the work is effectively placed within the context of the existing literature, and the connection to and demarcation between new and existing results is clearly stated.
  • To the extend of my limited understanding of convergence analysis proofs (I am not intimately familiar with the proofs from existing work, but I did read through the math in the main text), the authors theorems seem sound and to the best of my knowledge, they support what is mentioned in the implications sections/contributions.
  • The empirical evaluations correspond with the theory, and the LLM experiments are conducted at a reasonably large scale.

Weaknesses:

  • Reproducibility Having a single table with all hyperparameters swept would be beneficial for reproducibility. Currently, the ranges presented in Table 1 leave some guesswork to the reader, and I’m not sure the actual values used can be found from an attentive read of the appendix. Moreover, it would be beneficial to provide the final performance of all different hyperparameters swept in the appendix. This will help further improve the reproducibility of the paper and will be of use to practitioners reading the paper and wondering about the performance of different outer learning rates in the appendix. I will raise my score if the above concerns are adequately addressed (e.g., provide in line the tables of values you will report in the appendix and the final performance of HPs swept).

  • Impact The practical guidance that can be gleaned from the current theory seems limited. I think the theoretical results are interesting and will certainly help improve our current understanding of the outer learning rate’s role in Local SGD and could potentially lead to more practical improvements down the line (why I recommend accepting). However, from my understanding, the current theory does not suggest practical improvements to the Local SGD algorithm beyond: 1) tuning the outer learning rate is important and 2) noisy outer gradients require a lower outer learning rate. Could the authors comment on the above?

  • presentation Most of the captions could be more self-contained.

问题

  • Is the use of a schedule-free outer optimizer instead of SGD with a schedule is due to the theory not allowing for scheduled outer learning rates? If so, mentioning this could be helpful for the reader.
  • How is perplexity calculated in Figure 2? (e.g., what data, what batch size, etc)
  • Why does Theorem 3.5 not use an EMA for the momentum?
  • For LLM experiments, were the hyperparameters swept at each scale or were they swept only at a smaller scale?

局限性

Yes

最终评判理由

I have increased my score. I recommend accepting the paper.

Issues resolved:

  • Missing hyperparameters and final loss values for each.
  • Misunderstanding of the momentum used in the proofs.
  • Clarification of scheduled learning rates for the outer optimizer.
  • Clarification of the use of a schedule-free outer-optimizer.

格式问题

N/A

作者回复

Thank you so much for your review and positive evaluation of our work.

  1. (Reproducibility) We did not report all of the hyperparameter sweep results in line with reporting in prior work which reported the best hyperparameters only. We are happy to report all the experimental results we have done. Here are the results of the hyperparameter sweeps we will include in the appendix:
HMAlgorithmLearning RatePerplexityModel Size
-1Data-Parallel-18.07150M
-1Data-Parallel4x BS16.89150M
-1Data-Parallel-15.28400M
-1Data-Parallel4x BS13.21400M
-1Data-Parallel-13.381B
-1Data-Parallel4x BS11.341B
504SGD1.017.75150M
504Nesterov0.717.25150M
504Nesterov1.016.38150M
504SF-SGD2.0 (b=0.2)16.88150M
504SGD1.014.90400M
504Nesterov0.713.71400M
504Nesterov1.030400M
504SF-SGD2.0 (b=0.2)13.95400M
504SGD1.013.671B
504Nesterov0.712.511B
504SF-SGD2.0 (b=0.2)12.401B
1504SGD1.017.58150M
1504Nesterov0.717.90150M
1504Nesterov1.016.79150M
1504SF-SGD2.0 (b=0.2)16.96150M
2504SGD1.018.20150M
2504Nesterov0.718.09150M
2504Nesterov1.017.12150M
2504SF-SGD2.0 (b=0.2)16.97150M
5004SGD1.018.44150M
5004Nesterov0.717.95150M
5004Nesterov1.018.15150M
5004SF-SGD2.0 (b=0.2)17.18150M
10004SGD1.018.18150M
10004Nesterov0.718.16150M
10004Nesterov1.018.75150M
10004SF-SGD2.0 (b=0.2)17.29150M
20004SGD1.018.11150M
20004Nesterov0.718.40150M
20004Nesterov1.018.36150M
20004SF-SGD2.0 (b=0.2)17.59150M
502SGD1.018.64150M
502Nesterov1.016.81150M
502SF-SGD2.0 (b=0.2)17.13150M
508SGD1.018.38150M
508Nesterov1.016.27150M
508SF-SGD2.0 (b=0.2)16.92150M
5016SGD1.019.86150M
5016Nesterov1.016.25150M
5016SF-SGD2.0 (b=0.2)16.75150M

Additional Learning Rate Sweeps (150M, H=50, M=4)

AlgorithmLearning RatePerplexity
SF-SGD0.130
SF-SGD0.522.89
SF-SGD1.019.42
SF-SGD1.518.32
SF-SGD2.017.98
SF-SGD3.017.96
SF-SGD4.018.09
SF-SGD5.017.51
Nesterov0.3 (cosine)17.16
Nesterov0.5 (cosine)17.06
Nesterov0.7 (cosine)16.93
Nesterov0.9 (cosine)17.19
Nesterov1.1 (cosine)17.56
SGD0.3 (fixed)21.04
SGD0.3 (cosine)17.68
SGD0.5 (cosine)16.63
SGD0.7 (cosine)18.84
SGD1.0 (cosine)19.21

SF-SGD Beta (b) Parameter Sweep (150M, H=50, M=4, lr=2.0)

b valuePerplexity
0.030
0.0516.88
0.116.78
0.216.89
0.417.15
0.517.35
0.717.93
0.919.07
0.9519.65
0.9920.51
  1. (Impact) Yes, it is true that our theory as it is provides limited practical prescriptions. We are primarily concerned with understanding the role of the outer learning rate and how it should be tuned, but have not directly provided a theory-based schedule for it; this task we leave to future work. One crucial takeaway from our theory, which is not common in practice, is that it sometimes pays off to try outer learning rates greater than 1.0. To the best of our knowledge, the current literature (e.g. [7, 8]) did not sweep for such learning rates, going up to 1.0 and no higher. Our work shows that for Schedule-Free SGD as the outer optimizer, a learning rate of 2.0 performed best. We will highlight this recommendation further.

  2. (Captions) We will expand the captions to be more self-contained in the revised manuscript. Thank you for pointing this out.

  3. "Is the use of a schedule-free outer optimizer instead of SGD with a schedule is due to the theory not allowing for scheduled outer learning rates? If so, mentioning this could be helpful for the reader." The main reason we used it was to remove the variable of (outer) learning rate scheduling. The theory can be extended to scheduled outer learning rates, but in its current form it does not allow for them. We will mention this.

  4. "How is perplexity calculated in Figure 2? (e.g., what data, what batch size, etc)" Perplexity is calculated on the C4 validation set, with sequence length 1024, and batch size 512 for all model scales. The tokenizer is SentencePiece.

  5. "Why does Theorem 3.5 not use an EMA for the momentum?" Our way of writing the momentum update is equivalent to using EMA for momentum. [9, Lemma 7.2] shows it is equivalent to the update mt=μmt1+gtm_{t} = \mu m_{t-1} + g_t and xt+1=xtηmtx_{t+1} = x_t - \eta m_t. By unrolling, this corresponds to xt+1=xtηk=0tμkgtkx_{t+1} = x_t - \eta \sum_{k=0}^t \mu^k g_{t-k}. For classical EMA written as mt=(1α)mt1+αgtm_t = (1-\alpha) m_{t-1} + \alpha g_t and xt+1=xtγmtx_{t+1} = x_t - \gamma m_t, unrolling we have xt+1=xtγαk=0t(1α)kgtkx_{t+1} = x_t - \gamma \alpha \sum_{k=0}^t (1-\alpha)^k g_{t-k}. Thus putting μ=1α\mu = 1-\alpha and γ(1μ)=η\gamma (1-\mu) = \eta we can see that they are equivalent up to a rescaling of the learning rate.

  6. "For LLM experiments, were the hyperparameters swept at each scale or were they swept only at a smaller scale?" We swept them at the 150M scale, with a small number of additional trials at higher scale.

Thank you so much for your detailed comments.

[7] Douillard, Arthur, et al. "Diloco: Distributed low-communication training of language models." arXiv preprint arXiv:2311.08105 (2023). [8] Liu, B., Chhaparia, R., Douillard, A., Kale, S., Rusu, A. A., Shen, J., ... & Ranzato, M. A. (2024). Asynchronous local-sgd training for language modeling. arXiv preprint arXiv:2401.09135. [9] Garrigos, G., & Gower, R. M. (2023). Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235.

评论

Thank you for your reply. I have no further questions.

审稿意见
4

This paper analyzes the role of the outer learning rate in Local SGD, proving its impact on convergence and showing it can compensate for suboptimal inner learning rates, sometimes requiring values exceeding 1. The theoretical framework includes extensions to momentum-based outer optimizers and introduces an analysis for tuning insights. Experiments on language models validate these findings.

优缺点分析

  1. The paper analyzes local SGD methods in the homogeneous setting. However, the heterogeneous setting is more common in convergence analyses for similar local SGD methods (e.g., [1]). Could the authors discuss the challenges of extending their analysis to the heterogeneous setting?

  2. Since this work does not propose a new method, I would like to assess its contributions along the following dimensions:

a) Theoretical Novelty:

Are there any new frameworks or techniques in the theoretical analysis compared to prior works like [1][2]? If so, the authors should highlight these differences more explicitly.

b) Practical Impact:

For optimization methods, it’s not recommended to introduce new hyperparameters if not necessary, due to the cost of hyperparameter tuning. Gen-loc-SGD has more learning rates that need to be tuned compared with vanilla local SGD. Are there some dramatic improvements due to the introduction of the outer learning rates?

  1. Some suggestions for writing:

a) Clarity of Notation (Lines 54–55 & Algorithm 1): The phrase “for what the ideal learning rate pair (η\eta, γ\gamma) should be” is unclear. The roles of η\eta and γ\gamma should be defined earlier, ideally in Algorithm 1, rather than later in Section 3 (Lines 116–118).

b) Mathematical Notation: Vectors (e.g., parameters x and gradients g) should be consistently boldfaced for better readability.

[1] Hao Yu, et al. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization. ICML 2019. [2] Stich, SebastianU, et al. Local SGD Converges Fast and Communicates Little. ICLR 2019.

问题

Please see Strengths And Weaknesses

局限性

Please see Strengths And Weaknesses

最终评判理由

Given that my concerns are addressed, I'd like to raise my score.

格式问题

Please see Strengths And Weaknesses

作者回复

Thank you so much for your review.

  1. "Could the authors discuss the challenges of extending their analysis to the heterogeneous setting?" Our primary setting is large language model training, as in DiLoCo and similar works, and there the training data is i.i.d. Additionally, we would like to point out that in the heterogeneous setting, the guarantees for Local SGD are very pessimistic; that is, small degrees of heterogeneity result in much worse theoretical guarantees for Local SGD compared to Minibatch SGD, as discussed in [3, p. 5-6], and it is not clear under which assumptions the algorithm should be analyzed (see [12]). Because we wanted to focus on when Local SGD is superior to the alternatives, this is one more reason we focused on the i.i.d. setting.

That said, we believe extending our analysis to the heterogeneous setting is possible. Lemma B.6. still holds, but Lemma B.7. should be adjusted to handle function heterogeneity. If we make the same assumption as [1, Assumption 1.2 (3)], then this is possible at the cost of an extra HH factor. Modifying the rest of the proof of Theorem 3.3 with the revised Lemma B.7. is straightforward. The biggest drawback is that the conclusion will be too pessimistic compared to Minibatch SGD. We believe it might be possible to use assumptions such as second-order similarity and second-order smoothness, as in [4], to get around this, but this requires significant modifications not just to Lemma B.7 but to the main proof strategy of Theorem 3.3, and we leave it to future work.

  1. "Are there any new frameworks or techniques in the theoretical analysis compared to prior works like [1][2]? If so, the authors should highlight these differences more explicitly." Yes, there are. First, our algorithmic framework is significantly different; Neither [1] nor [2] consider the use of outer learning rates or outer momentum, with both using simple averaging (equivalent to fixing the outer learning rate at 11) and [1] considering local momentum. Our framework enables us to study the conditions under which using outer learning rates and momentum is useful. This difference in framework is the primary reason why we compare against [5, 6] in the discussion after Theorem 3.3 and not [1, 2]. Second, the analysis of Local SGD always involves bounding the so-called "client drift", or how far the local iterates stray from each other between synchronizations. In [1] this is controlled in Lemma 4 and in [2] this is controlled in Lemma 3.3. Both of these control the drift by bounding the maximum squared norm. We control the drift using a combination of the local regret against the starting point (Lemma B.6) and the maximum squared norm (Lemma B.7). This more fine-grained control allows us to show the benefit of using a large outer learning rate. To the best of our knowledge, a regret guarantee like Lemma B.6. has not been used in the literature before. In order to use this lemma, we also have to conduct the analysis per communication round rather than per training step as in [1, 2]. We will revise the next version of the manuscript to highlight these novel contributions.

  2. "For optimization methods, it’s not recommended to introduce new hyperparameters if not necessary, due to the cost of hyperparameter tuning. Gen-loc-SGD has more learning rates that need to be tuned compared with vanilla local SGD. Are there some dramatic improvements due to the introduction of the outer learning rates?" Yes, the performance degradation when not using an outer learning rate is significant. In our experiments, vanilla Local SGD consistently underperforms compared to using outer optimizers. For example, with the 150M model at H=50, M=4, it achieves a perplexity of 17.75, while Nesterov with lr=0.7 reaches 17.25 and SF-SGD with lr=2.0 achieves 16.88. These represent improvements of 2.8% and 4.9% respectively. The performance gains become even more pronounced at larger model scales: for the 1B model, vanilla SGD achieves 13.67 perplexity while Nesterov (lr=0.7) and SF-SGD (lr=2.0) reach 12.51 and 12.40 respectively, representing improvements of 8.5% and 9.3%. Because we can tune the outer learning rate at a smaller scale (this is what we did here) and therefore at a much smaller cost, we believe the additional hyperparameter is justified. There are very few reliable ways of squeezing out an additional 8%-10% performance at the 1B scale at the cost of one additional hyperparameter that can be tuned at a smaller scale.

  3. Thank you for your comments on improving clarity. We will define the learning rates γ\gamma and η\eta in Algorithm 1 and use bold faces for the vectors.

We thank you for your review and hope our discussion above sufficiently addresses your concerns.

[1] Hao Yu, et al. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization. ICML 2019. [2] Stich, Sebastian U. Local SGD Converges Fast and Communicates Little. ICLR 2019. [3] Woodworth, B. E., Patel, K. K., & Srebro, N. (2020). Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33, 6281-6292. [4] Zindari, A., Luo, R., & Stich, S. U. (2023). On the convergence of local sgd under third-order smoothness and hessian similarity. In OPT 2023: Optimization for Machine Learning. [5] Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., & Suresh, A. T. (2020, November). Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning (pp. 5132-5143). PMLR. [6] Jhunjhunwala, D., Wang, S., & Joshi, G. (2023). Fedexp: Speeding up federated averaging via extrapolation. arXiv preprint arXiv:2301.09604. [12] Patel, Kumar Kshitij, et al. "The limits and potentials of local sgd for distributed heterogeneous learning with intermittent communication." The Thirty Seventh Annual Conference on Learning Theory. PMLR, 2024.

评论

Thanks for your response. I have no further questions currently. Given that my concerns are addressed, I'd like to raise my score.

审稿意见
5

This paper presents a tighter convergence analysis of Generalized Local SGD in convex settings, highlighting the critical role of the outer learning rate. In particular, the authors show that appropriately setting the outer learning rate can: (1) interpolate between vanilla Local SGD and standard SGD, achieving the better of the two convergence rates; and (2) compensate for an overly small inner learning rate. Additionally, the paper provides (a) convergence guarantees when the outer optimizer uses momentum, and (b) the first high-probability convergence bound for Generalized Local SGD. Finally, the authors validate their theoretical findings through experiments on both convex optimization tasks and transformer models.

优缺点分析

Strengths:

  1. This is a nice theory paper, with solid technical contributions.

  2. The theoretical insights are valuable and provide practical guidance for setting the outer learning rate.

  3. The experiments align with the theoretical findings well. In particular, when properly tuned, Generalized Local SGD outperforms vanilla parallel AdamW within a fixed number of training steps.

Weaknesses:

  1. The writing could be improved. In particular, the presentation of Theorems 3.5 and 3.6 is overly technical. The authors should better highlight the key takeaways and practical implications.

  2. The experimental section lacks clarity. For example, in Figure 2, it is unclear what the numbers in the legend refer to. Do they refer to the wall-clock time? I could not find an explanation in the main text or figure caption. Additionally, the authors should clearly introduce all algorithm abbreviations used in the figures in the captions.

问题

Why is the high-probability bound in Section 3.3 referred to as an "adaptive convergence result"? The term "adaptive" might be somewhat misleading, as it typically reminds people of adaptive gradient methods such as Adam

局限性

Yes.

最终评判理由

All my concerns and questions have been addressed. I would like to keep my positive rating.

格式问题

NA

作者回复

Thank you so much for your review and positive evaluation of our work.

  1. We agree that our use of "adaptive" in Section 3.3, as we do not mean adaptivity in the same sense as Adam or AdaGrad (i.e. algorithmic adaptivity), but rather in the sense of obtaining data-dependent guarantees. We will modify the term "adaptive" to instead be "data-dependent" and thank the reviewer for pointing this out.
  2. We will reorganize the presentation of Theorems 3.5 and 3.6 to highlight the key takeaways and defer the more technical discussions. The core observation from Theorem 3.5, is that momentum allows us to use an effectively larger stepsize but does not alter the outer stepsize tradeoff shown by Theorem 3.3. Similarly, Theorem 3.6 shows that when noise magnitude dominates the optimization error (e.g. in later training stages), using γ<1\gamma < 1 helps maintain convergence. Conversely, when the optimization term is larger than noise, γ>1\gamma > 1 acts like momentum to accelerate convergence. When both terms are comparable, γ=1\gamma = 1 (simple averaging) is optimal.
  3. We thank the reviewer for pointing this out. Figure 2 shows the total number of training steps (i.e. R×HR \times H where R is the number of communication rounds and HH is the number of local steps). We will clarify this and add a short table or glossary with all the abbreviations we use.
评论

Thank you for your response. Replies 1 and 2 have addressed my concerns. Particularly, your explanation of the role of the outer learning rate in balancing the optimization and noise terms is both clear and intuitive.

However, regarding Figure 2, my question was about the numbers shown in the legend (e.g., Data-parallel: 18.7), rather than the meaning of the x-axis. Could the authors clarify what these numbers represent? Do they indicate wall-clock time or something else?

评论

Thank you for your response and for pointing out the ambiguity in the figure. The numbers in the legend in Figure 2 represent the final perplexity reached by each method. We included them in the legend for ease of comparison; We will make it clear that they represent the perplexity in the caption.

评论

Thank you for your response. All my concerns and questions have been addressed. I will keep my positive rating!

最终决定

This paper provides a clean theoretical result about the optimal "outer learning rate" of local SGD (a distributed version of SGD consisting of local gradient updates, aggregation and a outer update stages). The reviewer all agree that the contribution of the paper is solid and useful in optimization practice of distributed training.

The concerns seem to be minor, and the authors addressed them well during the rebuttal stage.