Faster Stochastic Optimization with Arbitrary Delays via Adaptive Asynchronous Mini-Batching
Improved rates for asynchronous stochastic optimization with better delay adaptivity via asynchronous mini-batching
摘要
评审与讨论
The paper introduces a framework for asynchronous stochastic optimization that leverages quantile delays instead of traditional average delay measures. It presents a black-box conversion that transforms any standard stochastic first-order method into an asynchronous version with only simple analyses of classical methods. It also provides an adaptive procedure for asynchronous non-convex smooth and convex Lipschitz problems that automatically tunes to the best quantile without needing prior knowledge of the delay distribution, yielding accelerated rates.
update after rebuttal
While this is a theoretical paper, numerical validation could have been included in this submission to further demonstrate the usability and robustness of their framework as a 'black-box' conversion to any first-order stochastic methods. Therefore, although the analysis of quantile delays sounds new to me, I will keep my current rating 3 due to the lack of experiments.
给作者的问题
- Do the authors plan to include empirical validations, and can you share any preliminary results?
- Does the method in this paper cover heavy-tailed or non-stationary delay distributions?
论据与证据
The submission provides rigorous theoretical proofs for convergence rates based on a chosen quantile of delays and accelerated convergence rates for convex smooth problems. In addition, the adaptive variant in Algorithm 2 does not require pre-specified delay bounds and automatically adapts to the best quantile in hindsight.
While the theoretical evidence is solid within its assumptions, the paper could further strengthen its theoretical findings with empirical validations or simulation studies for practical benefits.
方法与评估标准
The theoretical results in this paper are built heavily upon the theorem with classical stochastic methods from existing works (Lan, 2012, 2020), but it makes sense to modify the existing results to be adapted to asynchronous optimization under arbitrary delays.
理论论述
I checked the proofs of main theorems in the main body and didn't observe any technical errors.
实验设计与分析
N/A
补充材料
I have briefly scanned the whole appendices.
与现有文献的关系
The paper makes the following key contributions to the asynchronous optimization literature:
- Replace average delay measures with quantile delays for more robust convergence guarantees.
- Propose a black-box conversion that adapts standard stochastic methods to asynchronous settings.
- Introduce an adaptive mechanism that tunes automatically to the best quantile delay without prior knowledge.
- Achieve accelerated convergence rates in convex smooth settings, addressing gaps left by prior works such as Cohen et al. (2021) and Mishchenko et al. (2022), and relating to recent adaptive methods, e.g., Tyurin (2024).
遗漏的重要参考文献
Some recent related works in stochastic optimization with delayed updates are missing, e.g., [1,2].
[1]. Adibi, A., Dal Fabbro, N., Schenato, L., Kulkarni, S., Poor, H.V., Pappas, G.J., Hassani, H. and Mitra, A. Stochastic approximation with delayed updates: Finite-time rates under markovian sampling. In International Conference on Artificial Intelligence and Statistics (pp. 2746-2754), 2024. [2]. Dal Fabbro N, Adibi A, Poor HV, Kulkarni SR, Mitra A, Pappas GJ. Dasa: Delay-adaptive multi-agent stochastic approximation. In IEEE Conference on Decision and Control (pp. 3889-3896). 2024.
其他优缺点
Strengths:
- The paper offers a novel perspective by replacing average delay measures with quantile delays, which provides a more robust framework for asynchronous optimization.
- It presents detailed and rigorous convergence proofs, extending classical results to handle arbitrary delays and even achieving accelerated rates in convex smooth settings.
- The adaptive procedure that automatically tunes to the best quantile without prior knowledge is innovative and has potential for broad applicability in distributed systems.
Weaknesses:
- This work is entirely theoretical; the lack of empirical evaluation limits the practical effectiveness and robustness of Algorithms 1 and 2 in real-world scenarios.
其他意见或建议
It would be beneficial to discuss potential extensions of the framework to weak smooth conditions (such as [1]), which have wide applications in deep neural network training. In addition, addressing Markovian noise that leads to biased gradient oracles, often seen in reinforcement learning, would be valuable.
[1]. Zhang, J., He, T., Sra, S. and Jadbabaie, A., Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity. In International Conference on Learning Representations, 2020.
Thanks for the feedback—please see below our responses to the main points you raised.
“This work is entirely theoretical; the lack of empirical evaluation limits the practical effectiveness and robustness of Algorithms 1 and 2 in real-world scenarios.”
Our work is primarily theoretical, focusing on providing improved convergence guarantees for asynchronous stochastic optimization and improving our understanding of this fundamental problem. That said, we agree with the reviewer that a numerical evaluation comparing our methods to classical asynchronous algorithms can further strengthen our results. Since the rebuttal period this year is extremely short, we will only be able to complete this for the final version. Specifically, our plan is to implement a simulated environment of delays, and to conduct a small-scale experiment with synthetic data, allowing for a larger number of simulated machines, as well as a larger experiment (with a deep NN on a standard benchmark) with a smaller number of simulated machines.
“Do the authors plan to include empirical validations, and can you share any preliminary results?”
See comment above.
“Does the method in this paper cover heavy-tailed or non-stationary delay distributions?”
Note that we actually consider arbitrary sequences of delays that need not come from a probability distribution. Therefore we do support any kind of distribution (as a special case), including heavy tailed and/or non-stationary distributions.
Thanks for pointing out additional references and further suggestions.
This paper studies the convergence of asynchronous stochastic methods, and the key differences with the literature include 1) a relatively general algorithm framework; 2) a new model/characterization of the delays, which, compared to delay models like maximum/average delay, can characterize the delays better. They derive convergence results in terms of their delay model, and show the advantage through comparison with the literature.
给作者的问题
Does Theorem 1 allow for the same parameter range of the algorithm ? I ask this question because many methods such as delayed gradient descent requires the step-size to be inversely proportional to the maximum delay to guarantee convergence.
论据与证据
Yes, the claims are supported by their theory.
方法与评估标准
Didn't propose any new method.
理论论述
No.
实验设计与分析
No experiments.
补充材料
Not.
与现有文献的关系
Limited.
遗漏的重要参考文献
No.
其他优缺点
-
Strength: They consider general asynchronous stochastic methods under a new delay model which is much better than average/maximum delay. They derived strong results, especially Theorem 3 which shows that the algorithm can adapt to in the -quantile delay model to achieve faster convergence. This is impressive.
-
Weakness:
-
I would expect a theoretical bound on . This is because that is used to determine in Algorithm 1, while may affect . Therefore, without a theoretical bound, may be difficult to determine.
-
The adaptivity result in Theorem 3 is impressive, but I still don't understand how this adaptivity is achieved. I would expect the authors to explain more.
-
The literature review is slightly narrow. Specifically, many works consider the setting where each agent has a private local cost function, while they are not surveyed. I understand that the focus of this paper does not include that setting, but since this paper considers "asynchronous" methods, I would encourage the authors to survey papers in that category. To list a few:
[A] Mishchenko K, Iutzeler F, Malick J, et al, A delay-tolerant proximal-gradient algorithm for distributed learning, International conference on machine learning. PMLR, pp. 3587-3595, 2018.
[B] Soori S, Mishchenko K, Mokhtari A, et al, DAve-QN: A distributed averaged quasi-Newton method with local superlinear convergence rate, International conference on artificial intelligence and statistics. PMLR, pp. 1965-1976, 2020.
[C] Wu X, Magnusson S, Feyzmahdavian H R, et al, Delay-adaptive step-sizes for asynchronous learning, International Conference on Machine Learning, pp. 24093-24113, 2022.
Also, there are several papers that allow for arbitrary delays or attempt to adapt to the real delay patterns, such as [C] and
[D] Hannah, Robert, Fei Feng, and Wotao Yin. "A2BCD: Asynchronous acceleration with optimal complexity." International Conference on Learning Representations. 2019.
其他意见或建议
I was not aware of some issues pointed out by Reviewer A1zA. Now I have read them and agree with his concern.
However, the q-quantile delay model used in this paper is new to me and is a better delay model compared to the "upper bound of delay" used in most existing works. Convergence analysis under the new model can help to reflect the effect of delay distribution rather than the maximum delay. Therefore, this paper may have general meaning to the community of asynchronous optimization.
Thanks for the review and strong support! Please see below our responses to the main points you raised.
“I would expect a theoretical bound on . This is because that is used to determine in Algorithm 1, while may affect . Therefore, without a theoretical bound, may be difficult to determine.”
may range from to , depending on the adversary determining the delays, as we do not impose any assumptions on their distribution. Consequently, Algorithm 1 requires external knowledge () about the observed delays. This is analogous to the use of the average delay (or an upper bound on it) in previous work, except that quantiles are less sensitive to outliers.
“The adaptivity result in Theorem 3 is impressive, but I still don't understand how this adaptivity is achieved. I would expect the authors to explain more.”
The somewhat surprising adaptivity stems from two key components. First, (synchronous) SGD can be used with a large batch size—determined by problem parameters—with minimal degradation in performance, even though the number of update steps is reduced, as long as the noise term dominates the error. Second, the sweep of batch sizes bypasses the discrepancy between the number of asynchronous updates that are not filtered (which is unknown) and the number of updates to . This doubling trick, at the cost of a small constant, enables us to “simulate” knowledge of the number of updates will receive.
“Does Theorem 1 allow for the same parameter range of the algorithm ? I ask this question because many methods such as delayed gradient descent requires the step-size to be inversely proportional to the maximum delay to guarantee convergence.”
This is a good distinction. There are no constraints on the parameter range of that stem from the asynchronous nature of the problem. The only effect of asynchronous gradients on is the number of update steps, which is unknown in advance and thus requires a bound or a batch size sweep.
Thank you for pointing out additional relevant works. We will expand the literature review to include these settings in the final version.
N/A
给作者的问题
N/A
论据与证据
N/A
方法与评估标准
N/A
理论论述
N/A
实验设计与分析
N/A
补充材料
N/A
与现有文献的关系
N/A
遗漏的重要参考文献
N/A
其他优缺点
I do not believe I am sufficiently qualified to review this paper. However, I can make two observations:
The algorithms are poorly written, to the point of being nearly incomprehensible. It seems quite difficult to claim an acceleration in convergence without demonstrating it through a simulation study.
Concerning the rate, since I am not competent, I'll follow the other reviewers' opinions.
其他意见或建议
N/A
Thank you for your feedback and for your transparency regarding your familiarity with the field in relation to the review.
“The algorithms are poorly written, to the point of being nearly incomprehensible.”
The algorithms aggregate gradients for mini-batching while filtering stale ones based on the delay. Algorithm 2 employs an additional outer loop to determine the optimal batch size, which is unknown in advance, using the standard doubling technique.
“It seems quite difficult to claim an acceleration in convergence without demonstrating it through a simulation study.”
Accelerated SGD and accelerated rates are standard terms associated with Nesterov’s seminal accelerated gradient method and its stochastic variants, which achieve convergence rates of , without necessarily a connection to empirical performance.
The authors consider the problem of stochastic optimization with delays. Concretely, they minimize an objective function for a convex set . They consider access to a stochastic unbiased gradient oracle with variance . Additionally, at each round , the gradient oracle provides the stochastic gradient at where the delay sequence can be arbitrary. The goal is to obtain optimization algorithms that obtain best convergence rates after rounds in terms of the delay distribution.
The authors consider arbitrary delay distribution, with no assumptions on the delay, while existing works make several simplifying assumptions on delay distribution (fixed delay, bounded delay, fixed delay per machine or knowledge of delay distributions).
Under arbitrary delays, the authors provide a method (Algorithm 1), where given a black-box algorithm for stochastic optimization, given an upper bound , they select the corresponding batch size and the number of iterations for the algorithm .
Using this they run the black-box algorithm, to obtain the best possible convergence rates in terms of for smooth non-convex, smooth convex, and non-smooth convex objectives. See Table 1 for the exact rates. Here, denotes the quantile and denotes the delay quantile values. Crucially, these are much better than existing works that can only do this for or the average delay.
Further, to eliminate the dependence on knowledge of the upper bound of , they use a doubling trick in Algorithm 2, to run several copies of Algorithm 1, with geometrically increasing number of rounds in each copy and increasing batch-sizes determined by the corresponding function class. Again, applying this to all function classes above, they obtain an for their convergence rates achieving the best possible(Theorem 3 and Table 1).
Finally, they show that vanilla SGD with fixed step size without modifications or additional assumptions will always incur the maximum delay in Theorem 4.
给作者的问题
- Lower bound: I see that (Tyurin 2024) has a lower bound for delay distributions in their Theorem 1. Does the lower bound hold here, as this does not seem to be a zero-respecting algorithm. Also, I understand that it is concurrent work, but I request the authors to provide a more detailed comparison.
- Delay distributions independent of stochastic gradients: In Line 188-189, the authors assume that delay distributions are independent of stochastic gradients. Are there cases in distributed optimization where this is violated, for instance where machines with a bad delay and gradient noise, but the average gradient noise over all machine is small? Also, do existing works like (Tyurin & Richtarik 2023) handle these cases?
论据与证据
- Core-idea for black-box conversion : The core-idea behind their Algorithm 1 is to construct a mini-batch of gradients in an asynchronous fashion. They run steps of the algorithm with batch size . In each step, they wait until stochastic gradients have been obtained to update the model. By choosing carefully, they can control both the number of steps and the delay in steps in terms of . They state this in Lemma 1 and provide the proof just after it.
- Core-idea for anytime algorithm: To extend Algorithm 1 to Algorithm 2, they use the doubling trick on number of rounds for the instantiation of Algorithm 1. This allows them to obtain a similar bound on the batch size , number of steps of black-box algorithm , number of rounds and delay quantile for any in Lemma 2. The proof is simple but insightful. To go from here to the convergence rates for any function class, they plug in the batch size in any convergence rate, then find its optimal value to recover the best possible rate for the function class in terms.
- Lower-bound and additional results: The lower bound as well as the suboptimality of average delay versus that of best case quantile are shown by constructing arbitrarily bad delay distributions.
方法与评估标准
The authors provided no experiments. As the goal of this paper is to propose an algorithm, I request the authors to provide some experiments for their algorithm in practice. Ideally one synthetic example(quadratics with gaussian noise) and 1 real world neural network example on any bad delay distribution where is large would do.
One issue with several stochastic optimization algorithms, even those in (Lan 2020), are that the optimal step sizes depend on problem parameters like smoothness, or suboptimality gap. So, when implementing algorithms in practice, especially on NNs, one has to resort to tuning. In the case of Algorithm 2 here, the batch size also depend on problem parameters, so the authors should show if a tuning procedure alongwith some growth conditions on is sufficient in practice. For instance, for non-convex smooth objectives, the authors can choose to find the value of by tuning.
理论论述
- All the theoretical claims are correct and in fact have simple proofs that the authors have explained very clearly.
- I would like to add that both Algorithm 1 and 2 are quite simple and can be applied to any black-box algorithm, which makes the method quite novel and insightful.
实验设计与分析
See Methods and Evaluation Criteria.
补充材料
I went through the whole appendix.
与现有文献的关系
The key contribution is to provide a generic recipe for conversion of arbitrary optimization algorithms to handle arbitrary delays by mini-batching. The insight is quite novel, and turns a hard problem, that people can solve for specific delay distributions using complicated algorithms, into a simple one.
遗漏的重要参考文献
I think the authors might want to motivate the doubling trick as it has been used previously in online learning for anytime algorithms, eg -(van Erven et al 2011). Apart from that the literature review is pretty thorough.
References
- (van Erven et al 2011) Adaptive Hedging. NeurIPS.
其他优缺点
None.
其他意见或建议
- Presentation: I think certain parts of the main paper can be moved to the appendix and an experiment section and conclusion can be included. For instance, in Line 376-382 right column, is a standard property of smoothness and just a reference for it, like Nesterov's book can be provided. Also, the proof for the lower bound can be moved to the appendix. Apart from this, footnote 2 on page 4 is incomplete, there are two "the" in the first line of Page 8 and, on the sample page, "Lemma 3" is applied, but it hasn't been defined what this Lemma states.
Thanks for the feedback—please see below our responses to the main points you raised.
“The authors provided no experiments. As the goal of this paper is to propose an algorithm, I request the authors to provide some experiments for their algorithm in practice.”
Our work is primarily theoretical, focusing on providing improved convergence guarantees for asynchronous stochastic optimization and improving our understanding of this fundamental problem. That said, we agree with the reviewer that a numerical evaluation comparing our methods to classical asynchronous algorithms can further strengthen our results. Since the rebuttal period this year is extremely short, we will only be able to complete this for the final version. Specifically, in agreement with your suggestions, our plan is to implement a simulated environment of delays, and to conduct a small-scale experiment with synthetic data, allowing for a larger number of simulated machines, as well as a larger experiment (with a deep NN on a standard benchmark) with a smaller number of simulated machines.
“One issue with several stochastic optimization algorithms, even those in (Lan 2020), are that the optimal step sizes depend on problem parameters like smoothness, or suboptimality gap. So, when implementing algorithms in practice, especially on NNs, one has to resort to tuning. In the case of Algorithm 2 here, the batch size also depend on problem parameters, so the authors should show if a tuning procedure alongwith some growth conditions on is sufficient in practice. ”
Compared to Algorithm 1, Algorithm 2 performs an outer sweep over different batch sizes. This addition is made to identify the optimal batch size, as the number of non-stale updates is unknown without prior knowledge of the delays. In a tuning scenario, one can directly tune the batch size, resulting in a simpler asynchronous algorithm that aggregates gradients for a mini-batch synchronous algorithm (essentially Algorithm 1 with a tuned batch size). We will include your suggestion in the experiments we will complete for the final version.
“Lower bound: I see that (Tyurin 2024) has a lower bound for delay distributions in their Theorem 1. Does the lower bound hold here, as this does not seem to be a zero-respecting algorithm. Also, I understand that it is concurrent work, but I request the authors to provide a more detailed comparison.”
The lower bound from the concurrent work (Tyurin 2024) applies, in their specific delay model, to the algorithms we discuss. The main distinction between our paper and (Tyurin 2024) lies in the computational model: they introduce a framework that deviates significantly from the arbitrary delay models considered in prior work (e.g., Cohen et al., 2021; Mishchenko et al., 2022; Koloskova et al., 2022). While their model is general, it is also somewhat abstract, resulting in somewhat more complex guarantees compared to the (arguably) more intuitive quantile-based convergence guarantees we provide. On the other hand, (Tyurin 2024) also establishes the optimality of filtering stale gradient technique with a well tuned batch size by proving a lower bound for zero-respecting first-order algorithms. We will include a more detailed comparison in the revision.
“Delay distributions independent of stochastic gradients: In Line 188-189, the authors assume that delay distributions are independent of stochastic gradients. Are there cases in distributed optimization where this is violated, for instance where machines with a bad delay and gradient noise, but the average gradient noise over all machine is small? Also, do existing works like (Tyurin & Richtarik 2023) handle these cases?”
The scenario you refer to falls within the setting of asynchronous optimization with heterogeneous data, where different workers have access to different samples. Some works address this setting but require additional assumptions, such as fixed delays per machine (Theorem A.4 of Tyurin & Richtarik, 2023) or certain distributional assumptions (Assumption 2 of Mishchenko et al., 2022). Supporting this case without additional assumptions is not feasible, as the data on slower machines must be accessed sufficiently many times for effective minimization. This could indeed be a valuable extension, but one that probably deserves a separate study.
Thanks for pointing out typos and further suggestions.
Thanks for the detailed response. I do not have any other questions.
This paper proposes an asynchronous mini-batch black-box algorithm that aggregates asynchronously computed stochastic gradients as input to any stochastic-gradient-type optimization algorithm. In contrast to performing biased update using stale stochastic gradients, the proposed algorithm adaptively aggregates delayed gradients according to the delay length . Analysis is provided to demonstrate the effect of batch size and its dependence to the delay quantile on the convergence rate for multiple kinds of objective functions. Furthermore, another variant of algorithm that adopts increasing batch size is proposed, suggesting that the increasing batch size would achieve a tighter convergence bound that is optimal over the delay quantile .
给作者的问题
No.
论据与证据
As discussed below under "Theoretical Claims".
方法与评估标准
Regarding Algorithm 1
- What does "Play " means?
- Since , at the iteration -th of algorithm , always receives a mini-batch of stochastic gradients evaluated on . From here, for simplicity consider when is SGD, it seems that Algorithm 1 is only a synchronous SGD that waits for mini-batches from multiple workers at every iteration. I found it confusing as it differs from the usual asynchronous gradient paradigm. It would be more clear if the authors could expand on the discussion of how the asynchrony in the proposed algorithm can achieve acceleration against synchronous SGD, e.g., in consideration of the larger batch size used in Algorithm 1.
理论论述
Regarding the proof of Lemma 1
- It is not intuitive to see how is obtained in line 310.
- Similarly, it is not clear how the argument "since there are at least rounds with ." is obtained in line 314.
- By the notation , I assume algorithm queried and received stochastic gradients, as stated in Algorithm 1 because the iteration counter increases only after line 266. Therefore, it is not clear how do we introduce another in line 300 and what is the meaning behind assuming .
- It is not explained how the second inequality in line 324 is obtained.
Explanation regarding the above ambiguities shall be included in the proof.
Regarding Theorem 4
- The proof of Theorem 4 considered the mini-batching algorithm, which is not the vanilla asynchronous SGD algorithm [Stich & Karimireddy, 2020] [Arjevani et al., 2020] as claimed in the first paragraph of Section 5.
- I suggest the authors to provide references / proof for the case of smaller , i.e., when , as claimed in the paragraph of line 390.
实验设计与分析
- This paper lacks numerical experiments.
- It is not shown how to choose the step size and batch size sequence in practice. For instance, in order to achieve the optimal upper bound suggested in Theorem 3, one must know the optimal and correspondingly in order to choose that satisfy the conditions required in Theorem 3.
补充材料
No
与现有文献的关系
The proposed algorithm is orthogonal to classical asynchronous gradient algorithm such as [Arjevani et al., 2020], in the sense that the proposed algorithm avoids applying biased gradient steps by filtering and buffering the delayed gradients.
遗漏的重要参考文献
No
其他优缺点
As discussed below.
其他意见或建议
- The claim of this paper would be more convincing if the authors can present numerical experiments on extremely large and show the benefit of Algorithm 1 against classical asynchronous algorithm such as [Arjevani et al., 2020].
- According to Table 1 in the non-convex setting, for large enough such that the term dominates the error bound, the proposed convergence rate would be slower than the existing rate for .
Thanks for the feedback. Due to space constraints, we provide our responses to the main points below.
“What does “Play ” means?”
In the asynchronous paradigm we consider, at each step the algorithm plays a point (essentially the current model at time ). This line means that the point the algorithm selects is .
“Since , at the iteration -th of algorithm …always receives a mini-batch…how the asynchrony in the proposed algorithm can achieve acceleration against synchronous SGD…”
Consider a scenario with workers, where one is significantly slower while the remaining workers are fast (e.g., three times as fast). Synchronous mini-batch SGD with batch size would wait for the slow worker at each iteration. In contrast, our algorithm with batch size leverages the fast workers to compute the one missing gradient, achieving a 1.5x speedup.
“Explanation regarding the above ambiguities shall be included…”
We will provide a more detailed version of the proof of Lemma 1 in the final version. Throughout the proofs, we treated the quantile delays () as integers, which can be done without loss of generality because (A) the delays are all integers, so is also a -quantile, and (B) smaller delay is better. So, we can always use instead of . We will state it more clearly in the final version---thanks for pointing this out. Below are more details on the specific transitions the review mentioned.
“...how is obtained in line 310.”
As the integer sequence is increasing, . As , (these are integers), so .
“...how the argument "since there are at least rounds with ." is obtained in line 314.”
We assumed by contradiction that , where is the number of rounds with delay less or equal . Working with integers, . So the last of these will have indices .
“By the notation , I assume algorithm queried and received stochastic gradients…how do we introduce another ...”
We differentiate between , which is the number of updates needs to produce an output, and , the actual number of updates received. cannot be larger than due to the loop structure, but we need to prove that asynchronous steps are enough to produce updates to , which is ensured by proving that .
“...how the second inequality in line 324 is obtained.”
The second inequality is a part of Lemma 1. What we prove in the Lemma is that if this inequality holds, then .
“I suggest the authors to provide…proof for the case of smaller ...as claimed in the paragraph of line 390.”
We will include the proof for this case and are happy to provide further elaboration if requested by the reviewer.
“The claim of this paper would be more convincing if the authors can present numerical experiments…”
Our work is primarily theoretical, focusing on providing improved convergence guarantees for asynchronous stochastic optimization and improving our understanding of this fundamental problem. That said, we agree with the reviewer that a numerical evaluation comparing our methods to classical asynchronous algorithms can further strengthen our results. Since the rebuttal period this year is extremely short, we will only be able to complete this for the final version. Specifically, our plan is to implement a simulated environment of delays, and to conduct a small-scale experiment with synthetic data, allowing for a larger number of simulated machines, as well as a larger experiment (with a deep NN on a standard benchmark) with a smaller number of simulated machines.
“The proof of Theorem 4 considered the mini-batching algorithm, which is not the vanilla asynchronous SGD algorithm [Stich & Karimireddy, 2020] [Arjevani et al., 2020] as claimed in the first paragraph of Section 5.”
The vanilla asynchronous SGD algorithm is SGD which performs update steps in an asynchronous manner. The asynchronous SGD algorithm presented in [Stich & Karimireddy, 2020] is the instance of vanilla asynchronous SGD with constant delay. To accommodate arbitrary delays, previous works (e.g., Mishchenko et al., 2022) extended asynchronous SGD to general delays. Theorem 4 follows the definition of asynchronous SGD with arbitrary delays.
“According to Table 1…for large enough … would be slower than the existing rate….”
As we remark in lines 111-113 and in Section C, for , our new rates are of the same order or better than those of previous work. The reason is that when . As the median is a more robust quantity to outliers than the average, our bounds are preferable.
Thank you for your response, especially your explanation on the proof details, which will improve the readability of this paper when included in the main text. Below are my comments on addressing your response.
Consider a scenario with workers, where one is significantly slower while the remaining workers are fast (e.g., three times as fast). ... fast workers to compute the one missing gradient, achieving a 1.5x speedup.
Your explanation adds a lot value towards understanding Algorithm 1. I suggest this idea to be presented in the main paper. In this sense, the actual wall clock time spent to run asynchronous rounds in Algorithm 1 is the same as asynchronous SGD in prior works, and a potentially faster convergence is claimed by running Algorithm 1.
As we remark in lines 111-113 and in Section C, for , our new rates are ... our bounds are preferable.
I respectfully disagree with the authors that the proposed new rates are of the same order or better than those of previous work. Here is my argument:
- I agree with the authors that when comparing the coefficients of the higher order term, i.e., the error term, is better than the prior SOTA term .
- However, when an optimization algorithm is ran for large enough , the slower converging error term, e.g., variance error at the rate of , dominates the convergence bound and the effect of error term vanishes.
- When comparing the lower order term, the proposed rate is slower than the prior SOTA term . For example, if , to find solution such that , there exists a small enough such that the proposed algorithm is 2 slower than SOTA. If , the proposed lower order term matches with SOTA, but the proposed higher order term is larger than the SOTA higher order term , therefore a weaker error bound.
We appreciate the reviewer for the thoughtful engagement and additional valuable comments. We will of course incorporate these important discussions into the final version of the paper.
Regarding the comparison of our bounds to SOTA: I believe we are actually in agreement with the reviewer. When we stated that our bounds are of the “same order or better than those of previous work”, we were referring to a comparison of rates up to numerical constants. It is indeed possible that our bound (for ) is up to a factor of 2 weaker than the SOTA bound in the regime the reviewer pointed out.
We apologize for the lack of clarity in our earlier response and hope this properly addresses the reviewer’s remaining concern. We will ensure this point is clearly explained in the final version of the paper.
This paper studies the theoretical convergence rates of mini-batching asynchronous SGD algorithm via a blackbox model. Using a computation model that is similar to that of Rennala SGD, or so called "Fixed Compute Model", the paper first studies the effect of different delay quantiles on the convergence rate using a similar algorithm to Rennala SGD, then an adaptive method based on an exponentially increasing "epoch" size is studied, which achieves the convergence rate with optimal quantile without knowing in advance.
The 5 reviewers have a split in opinions at the initial review. A few reviewers found the proposed analysis to be new and interesting. On the other hand, reviewers also raised issues on the writing quality, and on the significance of the new results in practice, especially in the absence of any numerical experiments to substantiate the authors' claim. Although the authors promised to include numerical experiments in the next revision, it is too early to judge if the latter can be successful. As such, the opinions have remained split after the rebuttal.
The paper has thus generated a lot of discussions during the AC Reviewer Discussion phase, which has cleared some confusions during the review phase. In my opinion, while I appreciate the theoretical contributions of proposing a simple scheme to achieve the best rate robust to different quantiles of delay, I found the paper's positioning and presentation to be suboptimal. As it stands, the main contribution for the paper is more of a theoretical one. The current presentation would have misled (some of) its readers into finding its values from the more practical perspective, yet this is one of its main weaknesses. At the moment due to the lack of numerical experiments, it cannot be judged if the proposed algorithms are practical.
As such, I am recommending a "Weak Accept" for the paper, where it can be included in the proceedings if there is sufficient space in the program.