PaperHub
7.0
/10
Oral4 位审稿人
最低6最高8标准差1.0
8
6
8
6
3.8
置信度
ICLR 2024

Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks

OpenReviewPDF
提交: 2023-09-20更新: 2024-03-06
TL;DR

In distributed learning, we present SA-SRRW algorithm to prioritize lesser-visited nodes while discouraging frequently visited nodes in distributed learning, and show its performance improvement.

摘要

We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard *linear* Markovian token by one which follows a *non-linear* Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given 'base' Markov chain, the SRRW, parameterized by a positive scalar $\\alpha$, is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves $O(1/\\alpha)$ decrease in the asymptotic variance for sampling. We propose the use of a `generalized' version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate $O(1/\\alpha^2)$ - the performance benefit of using SRRW thereby *amplified* in the stochastic optimization context. Empirical results support our theoretical findings.
关键词
Distributed LearningSelf-Repellent Random WalkToken AlgorithmCentral Limit TheoremAsymptotic Analysis

评审与讨论

审稿意见
8

This paper studied a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents using the Self-Repellent Radom Walk (SRRW).

This paper generalized the previous SRRW by introducing an additional iterate to update the empirical distribution so the resulting algorithm shares a similar pattern with the two-time-scale stochastic approximation (SA).

The author then provides some theoretical understanding of the proposed algorithm, including

  • (1) almost surely convergence of two iterates
  • (2) central limit theorem (CLT) of parameter θn\theta_n
  • (3) order analysis on the asymptotic variance matrices for three different time-scales choices.

优点

The paper is well-written and solid. I like reading the paper.

The paper contains rich theoretical analysis, which investigates deeply about the properties of proposed methods. The acceleration effect of the proposed method is indeed new and interesting (at least to me).

Most of the proof in the appendix seems to be correct, but I didn’t check very carefully.

The simulation validates the theoretical finding in the paper.

缺点

A concern in my mind is that the paper assumes XX takes values in a finite state space. The decentralized optimization where only a single agent is active each time is a particular example. Are there any other examples where XX takes finite value?

In my opinion, for a general SA method, the randomness or the noise could be various. Note that the Poisson method used in the proof is not limited to discrete randomness. Is it possible to extend the analysis to a more general setting where XX could be a continuous random variable? If not, where is the main difficulty?

问题

  1. Under the paragraph below (3), the author tries to explain why SRRW gets its name. More specifically, they said `` if node jj has been visited more often, so far, the entry xjx_j becomes larger (than target value μj\mu_j )’’. I didn’t understand the logic behind it. If node jj has been visited more often, I think xjx_j, as the coordinate of a distribution vector, should be close to μj\mu_j. I can’t tell whether xjx_j is larger than μj\mu_j or not.

However, I acknowledge that the nonlinear kernel in (3) would encourage the algorithm to visit the states that are less visited, based on its expression.

  2. The newly introduced iterate in (4b) generalizes the original algorithm and allows us to treat the resulting algorithm as a two-time-scale SA. However, I didn’t see much motivation to introduce the new iterate. Could the author elaborate more on the motivation?

  3. If we set α=\alpha=\infty, from (6), the variance matrix would be zero. Could I say, in this case, xnx_n actually converges to μ\mu at a faster rate than γn\gamma_n so that the CLT degenerates to zero? How to explain the intuition behind it?

  4. From the simulation, it seems the larger α\alpha, the faster convergence. Is there any suggestion for a practical choice of α\alpha?

  5. If the α\alpha is also updated, for example, we use αt\alpha_t at the iterate tt that increases at a carefully selected rate in advance (or perhaps could be updated using a third iterate). Can we accelerate the convergence rate of θn\theta_n so that its mean square error is faster than βn\beta_n?

I will increase my point to 8 if most of my questions are addressed.

------------------------------------ Post rebuttal --------------

Thanks for the authors' clarification. Most of my concerns and questions are well addressed, so I increase my point to 8. However, there are still two points I don't agree with the author.

One point I disagree with is that I think when α\alpha \to \infty, the kernel (3) is still well defined. In this case, I think Kij(x)=1K_{ij}(x) = 1 if and only if xj/μj=argminlxl/μlx_j/\mu_j = \arg\min_{l} x_l/\mu_l . As you can see, this kernel is reduced to a deterministic transition. I guess this reduction makes the system no longer random so that the CLT degenerates. One property of this generation is that the variance matrix would converge to zero. This actually is a good property from my perspective because it means that one could get a convergence rate faster than the step size.

Let's do a simpler thought experiment. Let XtβtN(0,ct)X_t \sim \sqrt{\beta_t} \cdot N(0, c_t ). If the variance ctc_t is non-zero (so that ct1c_t \to 1 w.l.o.g.), then the MSE of XtX_t is dominated by the variance and is of order βt\beta_t. If the variance is (nearly) zero (say ct0c_t \to 0), then the MSE is dominated by a higher-order term and we have EXt2E |X_t|^2 converge faster than βt\beta_t.

Back to this paper, I guess the counterpart of ctc_t in this paper is 1/α1/\alpha. So, if we let α\alpha \to \infty, we would have ct0c_t \to 0. This is the reason why I ask whether increasing α\alpha would fasten the convergence. The author's response works only when the optimal α\alpha^{\star} exists and is finite. However, the particular case I am interested in is when α=\alpha^{\star} = \infty. I think that exploring the case where α=\alpha^{\star} = \infty detailedly is worth a more detailed examination. It is fine not to discuss this in this paper.

评论

(Q4). From the simulation, it seems the larger alpha\\alpha, the faster convergence. Is there any suggestion for a practical choice of alpha\\alpha?

Answer: In practice, we recommend using moderate values of alpha\\alpha, e.g., alpha=5\\alpha=5. Since covariance in case (i) decreases at a rate of O(1/alpha2)O(1/\\alpha^2), even moderate values of alpha\\alpha are sufficient to achieve small errors, e.g., alpha=5\\alpha=5 in Figure 2(c). Moreover, it’s important to consider the potential drawbacks of setting alpha\\alpha too large. As illustrated in Figure 2(b), larger alpha\\alpha values could slow down convergence in the early stage of the training process. For example, when alpha\\alpha is set to 1010 or 2020, performance is slightly inferior to alpha=5\\alpha=5 until napprox5×104n\\approx 5×10^4.

(Q5). If the alpha\\alpha is also updated, for example, we use alphan\\alpha_n at the iterate nn that increases at a carefully selected rate in advance (or perhaps could be updated using a third iterate). Can we accelerate the convergence rate of thetan\\theta_n so that its mean square error is faster than betan\\beta_n?

Answer: Assume alpha_n\\alpha\_n is updated through a third iterate and converges to some finite value alpha\\alpha^*, we speculate that the asymptotic rate of MSE cannot go faster than beta_n\\beta\_n. Our current analysis, as detailed in Theorem 3.3, addresses scenarios with a fixed alpha\\alpha value. Here, the influence of the SRRW sequence X_n\\{X\_n\\} is encapsulated within the matrix mathbfU\\mathbf{U}, which contributes to the asymptotic covariance mathbfV_theta\\mathbf{V}\_{\\theta} of theta_n\\theta\_n. Introducing a time-varying alpha_n\\alpha\_n adds a new dimension to the system, potentially transforming the SA-SRRW algorithm, which we analyze via two-timescale SA setting, into a three-timescale SA. The CLT analysis of such a complex scenario remains unexplored and presents an interesting avenue for future research. Intuitively speaking, in the asymptotic regime, as alpha_n\\alpha\_n already approaches alpha\\alpha^*, the behavior of sequence X_n\\{X\_n\\} would likely be similar to that driven by SRRW with constant alpha\\alpha^*. This means that while varying alphan\\alpha_n impacts the asymptotic covariance mathbfV_theta\\mathbf{V}\_{\\theta} through matrix mathbfU\\mathbf{U}, the convergence rate of theta_n\\theta\_n, determined by beta_n1/2\\beta\_n^{-1/2}, would remain unchanged. Hence, even with a dynamic alpha_n\\alpha\_n, the MSE of theta_ntheta\\theta\_n-\\theta^* is still expected to converge at a rate tied to betan\\beta_n, according to our current theoretical framework.

(Q6): Are there any other examples where XX takes finite value? In my opinion, for a general SA method, the randomness or the noise could be various. Note that the Poisson method used in the proof is not limited to discrete randomness. Is it possible to extend the analysis to a more general setting where XX could be a continuous random variable? If not, where is the main difficulty?

Answer: Our SA-SRRW algorithm is versatile and not limited to decentralized optimization employed with token algorithm only. It is equally applicable to scenarios in stochastic optimization where the entire dataset is fully accessible. This application is particularly prevalent in the machine learning domain, such as in training neural networks using SGD or other accelerated algorithms. In these instances, the dataset forms a complete graph with self-loops so that XX can be independently and identically sampled from the dataset with marginal distribution mu\\mu, implying that these instances can also be transformed into the decentralized optimization in the broader sense. Here, the baseline Markov chain mathbfP\\mathbf{P}, as indicated in the SRRW kernel (3), becomes a rank-1 matrix mathbf1mathbfmuT\\mathbf{1}\\mathbf{\\mu}^T.

Meanwhile, XX cannot be a continuous random variable. The limitation here is not related to the Poisson method used in our proofs but is inherent to the structure of the SRRW employed in the SA method. Specifically, the concept of self-repellence in SRRW requires counting the visits to each state XX to form the empirical measure mathbfx_n\\mathbf{x}\_n at time nn. This count is crucial for designing the SRRW kernel mathbfK[mathbfx_n]\\mathbf{K}[\\mathbf{x}\_n]. In a continuous state space, such a counting mechanism becomes infeasible since the chance of sampling the same XX in the continuous state space is measure zero, posing a significant challenge to the design and implementation of a continuous version of SRRW in our algorithm.

评论

We appreciate the reviewer for the detailed comments. Please find our replies to your questions below.

(Q1). Under the paragraph below (3), the author tries to explain why SRRW gets its name. More specifically, they said ''if node jj has been visited more often so far, the entry xjx_j becomes larger (than target value muj\\mu_j )''. I didn’t understand the logic behind it. If node jj has been visited more often, I think xjx_j, as the coordinate of a distribution vector, should be close to muj\\mu_j. I can’t tell whether xjx_j is larger than muj\\mu_j or not. However, I acknowledge that the nonlinear kernel in (3) would encourage the algorithm to visit the states that are less visited, based on its expression.

Answer: Your interpretation ''xjx_j should be close to muj\\mu_j’ is indeed accurate in the context of long-term behavior, where the empirical distribution mathbfx\\mathbf{x} converges towards the target distribution mu\\mu almost surely. However, the point of confusion seems to arise from our description 'if node jj has been visited more often so far'. To clarify, when saying 'node jj has been visited more often', we mean that the node jj has been visited more often than other nodes (and we have improved this explanation in the revision). In this context, according to the definition of empirical distribution, the entry [mathbfx_n]_j[\\mathbf{x}\_n]\_j – representing the proportion of visits to node jj at time nn – can be (temporarily) larger than its target value muj\\mu_j. In this case, node jj will be chosen with smaller probability at time n+1n+1 by the SRRW kernel with mathbfx_n\\mathbf{x}\_n in place of mathbfx\\mathbf{x} in (3). Note that, eventually, from the SRRW convergence result, all these [mathbfx_n]_j[\\mathbf{x}\_n]\_j will converge to the target mu_j\\mu\_j as n increases, as you pointed out already. In some sense, SRRW tries to exploit such temporal fluctuation (while keeping the stationary behavior the same) to 'force' the random walker to reduce such temporal fluctuation (deviation of mathbfx\\mathbf{x} at time nn from the target mu\\mu), which translates to smaller sampling variance.

(Q2). The newly introduced iterate in (4b) generalizes the original algorithm and allows us to treat the resulting algorithm as a two-time-scale SA. However, I didn’t see much motivation to introduce the new iterate. Could the author elaborate more on the motivation?

Answer: We thank the reviewer for commenting on the SRRW iterate (4b). This SRRW iterate generalizes the step size gamman\\gamma_n from 1/(n+1)1/(n+1) to a general form 1/(n+1)a1/(n+1)^a for ain(0.5,1]a\\in(0.5,1], which is crucial for enhancing the algorithm’s flexibility and performance. This generalization allows for the exploration of the step sizes betan=1/(n+1)b=o(gamman)\\beta_n=1/(n+1)^b =o(\\gamma_n) (case(i)), which is suggested as the best algorithmic choice among all three cases in simulation. Without this generalization, i.e., if we stick to the original step size of gamman=1/(n+1)\\gamma_n=1/(n+1) as in [Doshi et al. 2023], the case (i) becomes vacuous as it is impossible to choose the step size betan=o(1/n)\\beta_n=o(1/n) (such step size becomes summable, fundamentally violating the condition for the step size in any stochastic approximation setting). To reflect this response, we have added a new footnote #7 in the revision to elaborate on the motivation for the generalized step size in the SRRW iterates.

Doshi, V., Hu, J., & Eun, D. Y. (2023). Self-Repellent Random Walks on General Graphs--Achieving Minimal Sampling Variance via Nonlinear Markov Chains. International Conference on Machine Learning. PMLR, 2023.

(Q3). If we set alpha=infty\\alpha=\\infty, from (6), the variance matrix would be zero. Could I say, in this case, mathbfx_n\\mathbf{x}\_n actually converges to mu\\mu at a faster rate than gamma_n\\gamma\_n so that the CLT degenerates to zero? How to explain the intuition behind it?

Answer: It is crucial to clarify that our CLT results, as presented in Theorem 3.3, are specifically tailored for any 'given’ alpha<\\alpha < \infty, where mathbfxn\\mathbf{x}_n converges to mu\\mu at the exact asymptotic rate gamman\\gamma_n and the resulting covariance tends to zero for larger alpha\\alpha without actually reaching it. It's important to understand that our analysis does not extend to the case of alpha=infty\\alpha=\\infty since the SRRW kernel in (3) is undefined. Consequently, the behavior of mathbfx_n\\mathbf{x}\_n in such a setting, including its convergence rate to mu\\mu and the resulting asymptotic covariance, falls outside the scope of our theoretical framework.

审稿意见
6

This paper deals with distributed stochastic optimization, with tokens sampled by self-repelling Markov chain. The authors proved the law of large numbers and central limit theorem for the optimization iterate errors, which are refinements of recent work of Doshi et al. The proposed algorithm achieves 1/α21/\alpha^2-rate, and the theoretical results are corroborated by the empircal study.

优点

The paper studies rigorously a distributed stochastic optimization using self-repelling random walks. The paper is well-written, and I enjoyed reading it. I have also checked most theoretical results, and they appear to be correct. The rate 1/α21/\alpha^2 also appears to be new.

缺点

The paper is mostly motivated by the recent progress of Dochi et al., and it is not clear what is the "main" novelty of this paper compared to the previous work (fundamentally). It seems to me that most results are refinements of Dochi et al. (while I agree that the setting is slightly different.) The authors may add a paragraph to highlight the main "technical" novelty of this paper.

Also the parameter α\alpha measures the "heaviness" of the self-repelling random walk, which is basically a hyper-parameter. It seems to be a bit strange to quantify the errors using Poly(1/α)Poly(1/\alpha) (provided that one sends α\alpha \to \infty and I do have concerns on the real application in this regime). Also the idea of using self-repelling random walk to accelerate optimization is not new, see https://arxiv.org/abs/2005.04507 (in which it was shown to outperform many existing algorithms.) The authors may think of citing the work and some references therein.

问题

See the Weakness.

伦理问题详情

Not available.

评论

Thank you for your comments and for the time and effort you put into reading, understanding and evaluating our paper. We now answer the question posed by the reviewer.

(Q1). The paper is mostly motivated by the recent progress of Doshi et al., and it is not clear what is the "main" novelty of this paper compared to the previous work (fundamentally). It seems to me that most results are refinements of Doshi et al. (while I agree that the setting is slightly different.) The authors may add a paragraph to highlight the main "technical" novelty of this paper.

Answer: While Doshi et al.’s analysis is foundational, it primarily utilizes existing CLT results for single-timescale SA with controlled Markov noise. Our work, in contrast, provides a first of its kind CLT result for the more complex two-timescale SA with controlled Markov noise (step-size cases (i) and (iii) fall into this category) marking a significant advancement. This CLT result is obtained via our original analysis detailed in Appendices D.2 and D.3, and does not build upon the analysis in Doshi et. al.. Without this, we cannot theoretically show the asymptotic covariance of the SA iterates thetan\\theta_n and prove that case (i) achieves O(1/alpha2)O(1/\\alpha^2) rate for its covariance and outperforms case (iii). The significance of this analysis is underscored right after Theorem 3.3 in our original paper, where we outline the technical challenges encountered when dealing with controlled Markov noise. In response to your valuable feedback, we have revised Contribution #2 in the introduction to delineate our technical innovations more clearly.

(Q2). Also the parameter alpha\\alpha measures the "heaviness" of the self-repelling random walk, which is basically a hyper-parameter. It seems to be a bit strange to quantify the errors using Poly(1/alpha)Poly(1/\\alpha) (provided that one sends alphatoinfty\\alpha\\to\\infty and I do have concerns on the real application in this regime).

Answer: Thank you for highlighting the concerns regarding the hyper-parameter alpha\\alpha. We would like to clarify its role and implications in our analysis in the following three aspects:

  • Role of alpha\\alpha in Asymptotic Covariance: Our CLT results in Theorem 3.3 enable us to quantify how the asymptotic covariance of the error varies with alphageq0\\alpha\\geq 0, for any given finite alpha\\alpha. It is important to note that CLT analysis does not extend to the case of alphatoinfty\\alpha\\to\\infty, even though the form of the asymptotic covariance matrix allows one to evaluate it at alpha =infty\\alpha\ = \\infty. This is because setting alpha\\alpha to infinity would disrupt the continuity of the SRRW kernel mathbfK[mathbfx]\\mathbf{K}[\\mathbf{x}] as stated in (3), and it would no longer be well-defined for any mathbfxintextIntSigma\\mathbf{x}\\in\\text{Int}{\\Sigma}. The continuity of mathbfK[mathbfx]\\mathbf{K}[\\mathbf{x}] with respect to mathbfx\\mathbf{x} is essential in our CLT analysis.

  • Practical Implications of alpha\\alpha: In real-world applications, using large values of alpha\\alpha is not necessary. Our findings show that the asymptotic covariance in case (i) decreases at a rate of O(1/alpha2)O(1/\\alpha^2). This suggests that even moderate values of alpha\\alpha are sufficient to achieve small errors. For instance, in our empirical results shown in Figure 2(c), setting alpha\\alpha to 55 already yields satisfactory performance. Increasing alpha\\alpha from 55 to 2020 does not result in significant improvements in mean square error (MSE), as the MSE was already small enough for alpha=5\\alpha=5.

  • Drawbacks of Excessively Large alpha\\alpha: It's also important to consider the potential drawbacks of setting alpha\\alpha too large. As illustrated in Figure 2(b), larger alpha\\alpha values can actually slow down convergence in the initial stages of the training process. For example, when alpha\\alpha is set to 1010 or 2020, performance is slightly inferior to alpha=5\\alpha=5 until napprox5×104n\\approx 5×10^4.

评论

I would thank the authors for the detailed response. The score is increased to 6.

评论

(Q3). Also the idea of using self-repelling random walk to accelerate optimization is not new, see https://arxiv.org/abs/2005.04507 (in which it was shown to outperform many existing algorithms.) The authors may think of citing the work and some references therein.

Answer: We appreciate the reviewer for directing our attention to Guo et al.'s work using 'self-repelling random walk’ to accelerate optimization. We have included this reference in footnote #3 of our revised manuscript to acknowledge its relevance to our research domain.

Our study never claims to be the first paper to incorporate the concept of `self-repellence’. Additionally, our study diverges significantly from Guo et al.'s work in its application and theoretical framework. The novelty of our work lies in the strategic control of the noise sequence Xn\\{X_n\\} in the token algorithm (4c) within the context of distributed learning adhering to arbitrary graphical constraints (general graphs), which is not a consideration in Guo et al.’s study where the underlying graph is more of 1-D nature (increasing vs. decreasing, instead of choosing one of neighbors in an arbitrary graph). Our approach also leverages SRRW from the Markov Chain Monte Carlo (MCMC) literature in controlling the noise sequence X_n\\{X\_n\\} in the function H(theta_n,X_n+1)H(\\theta\_n,X\_{n+1}) as outlined in (4c). This is markedly different from Guo et al.’s approach, which integrates self-repellence into gradient descent iterates themselves for escaping saddle points.

Finally, It’s important to note that our version of self-repellence and that of Guo et al. are not directly comparable, as they serve different purposes within their respective algorithms, e.g., one is about the perturbation thetan\\theta_n, and the other is about the choice of Xn+1X_{n+1} in the function H(theta_n,X_n+1)H(\\theta\_n,X\_{n+1}). However, an intriguing possibility is the combination of both versions of self-repellence approaches to improve algorithmic performance.

审稿意见
8

The present paper consider a family of stochastic algorithms of the following form. There exists a finite space N\mathcal{N} and some kind of "random walk" {Xn}nN\{X_n\}_{n\in\N} over this set. One then takes θ0Rd\theta_0\in\mathbb{R}^d and a function H:Rd×NRdH:\mathbb{R}^d\times \mathcal{N}\to \mathbb{R}^d and sets

θn+1=θn+βnH(θn,Xn).\theta_{n+1} = \theta_n + \beta_n\,H(\theta_n,X_n).

One particular example is when XnμX_n\sim \mu and H(θ,i)=f(θ,i)H(\theta,i) = -\nabla f(\theta,i) for some ff. Then the above is basically a form of stochastic gradient descent for the function

F(θ):=iμif(θ,i).F(\theta) :=\sum_i \mu_i\,f(\theta,i).

The authors say that XnX_n may be thought of as a "token", so they are considering token distributed optimization methods.

The authors are especially interested in the case where XnX_n is a stochastic process called "self-repellent random walk". Basically, this kind of RW have asymptotic distribution μ\mu, but they are designed to be less likely to jump to states that have already been visited many times. A recent paper by Doshi et al. showed that such random walks lead to smaller-variance Monte Carlo estimates of expectations with respect to μ\mu. In fact, the variance in this case decreases like α1\alpha^{-1}, where α>0\alpha>0 measures the amount of self-repulsion.

The present paper builds on this to show a remarkable result. Namely, under suitable conditions, θn\theta_n converges to a fixed point θ\theta_* (in expectation) of the iteration, and θnθ\theta_n - \theta_* has variance decrease when α\alpha grows. In fact, the variance goes down like 1/α21/\alpha^2 for large α\alpha. In particular, this implies that self-repellent walks are "even better" for optimization or fixed point computations than for Monte-Carlo integration. However, this main result requires on tuning βn\beta_n to be smaller than a parameter describing the rate of change of the "self-repulsion measure". In fact, the authors show that, when this is not the case, it may be that a large α\alpha will not help at all. The paper's proofs build on Doshi et al and rely on stochastic approximation methods.

优点

The paper describes a surprising phenomenon -- even considering the paper by Doshi et al. It gives precise asymptotic results. Their small simulation study observe gains in finite time. The paper is very clear and convincing.

缺点

All results are asymptotic. It is not clear (theoretically) whether there are gains for finite time, before eg. the token can visit most elements of N\mathcal{N}. The analysis requires some a priori assumptions about the ODE invoked for stochastic approximation. The analysis is heavily indebted to Doshi et al.

问题

I only have one very open ended question: is there any hope of getting finite sample bounds?

评论

Our sincere thanks for the in-depth review of our paper. We now answer the question posed by the reviewer.

(Q1). I only have one very open ended question: is there any hope of getting finite sample bounds?

Answer: We think it’s very unlikely to obtain the finite sample bound for our SA-SRRW algorithm. In the new appendix B in the revision, we briefly demonstrate that the state-of-the-art non-asymptotic analyses all require globally Lipschitz mean field function for both iterates to derive the finite sample bounds. However, for our SA-SRRW algorithm (4), the mean field function boldsymbolpi(mathbfx)mathbfx\\boldsymbol{\\pi}(\\mathbf{x})-\\mathbf{x} in the SRRW iterates (4b) is only locally Lipschitz continuous in mathbfxintextInt(Sigma)\\mathbf{x}\\in\\text{Int}(\\Sigma) (i.e., the interior of the probability simplex) and we provide a detailed explanation of the local Lipschitzness in appendix B. There does not exist a uniformly bounded Lipschitz constant for boldsymbolpi(mathbfx)mathbfx\\boldsymbol{\\pi}(\\mathbf{x})-\\mathbf{x} in terms of mathbfxintextInt(Sigma)\\mathbf{x}\\in\\text{Int}(\\Sigma) . Thus, in this context, our SA-SRRW algorithm is unlikely to obtain finite sample bounds. Indeed, we believe that deriving any usable finite sample bound in this general area of two-timescale SA with controlled Markov noise for non-globally-Lipschitz kernels, would pose as one of important future direction in the literature, and clearly much beyond the scope of our paper now.

审稿意见
6

This paper introduces a new distributed stochastic optimization algorithm, where the random walk guiding the sampling of gradients across the network uses Self-Repellent Random Walk (SRRW), a recently introduced family of nonlinear Markov chain. For MCMC sampling on a graph, SRRW has been shown to achieve a O(1/α)O(1/\alpha) decrease in the asymptotic variance [Doshi et al, ICML 2023], where α\alpha is a hyperparameter of the chain; roughly speaking, given any base Markov chain, SRRW works by preferring transitions to states that were less visited in the past, and the larger the α\alpha, the stronger this preference. Building on this insight, the authors in this paper use SRRW as the "token" gradient sampler in a distributed stochastic optimization setting.

优点

The main contributions are as follows. Very interestingly, the authors show that given the "right" step-size setting (i.e. when the step-size for the optimization parameter θn\theta_n, βn\beta_n, is o(γn)o(\gamma_n), where γn\gamma_n is the step-size for the update of the weighted empirical distribution xnx_n), the asymptotic variance of the θn\theta_n (defined as θnθβn\frac{\theta_n - \theta^*}{\sqrt{\beta_n}}) actually decays with the rate O(1/α2),O(1/\alpha^2), which is better than the O(1/α) O(1/\alpha)decay rate for the (un)weighted empirical distribution xnx_n in [Doshi et al., 2013]. An important result the authors show is that the asymptotic step-size ratio between βn\beta_n and γn\gamma_n is key; when γn=βn\gamma_n = \beta_n or wheren γn=o(βn)\gamma_n = o(\beta_n), the asymptotic variance is only O(1/α)O(1/\alpha). The key technical tool that achieves this is a novel analysis of two-time scale SA with controlled Markov noise. The theoretical results are also well-supported by simulation results.

Overall, this is a well-written paper and the idea of adding SRRW to distributed stochastic optimization is nice. In addition, the theoretical result about the influence of step-size ratio on the asymptotical covariance decay rate is also very interesting.

缺点

see questions below:

问题

There are some questions that I hope the authors can address.

  1. Could the authors comment a little bit more on the stability of the iterates? This is assumed to always hold, but as the authors point out, practical tricks are required to ensure this, in which case the theoretical analysis might also need tweaks.
  2. Are there any tradeoffs in picking α\alpha? Should α\alpha always be as large as possible, or will that affect other aspects of the algorithm, e.g. stability?
  3. While the analysis is on asymptotic covariance, could the authors comment on the likely finite-time convergence rate behavior of their SRRW-SA algorithm, and whether improvements (due to the SRRW) can be also shown in terms of finite-time convergence rates?
  4. The authors suggest that βn=o(γn)\beta_n = o(\gamma_n) is the best step-size choice if a O(1/α2)O(1/\alpha^2) decay rate in the asymptotic covariance is desired. However, if faster convergence is to be desired, a larger learning rate (i.e. larger βn\beta_n) should always be picked, i.e. we would want to pick the maximal possible step-size required by the assumptions in the paper, i.e. βn=O(1/n0.5+ϵ)\beta_n = O(1/n^{0.5+\epsilon}) for some ϵ>0\epsilon > 0 as close to 0 as possible, in which case γn\gamma_n can at best match the step-size of βn\beta_n, meaning that we can only get O(1/α)O(1/\alpha) in the asymtotic covariance decay again. Could the authors comment on this possible issue?
  5. It would be nice if the authors provided some intuition as to why the βn=o(γn)\beta_n = o(\gamma_n) step-size ratio can yield the O(1/α2)O(1/\alpha^2) rate. Moreover, it would be nice if the authors spelled out precisely the dependence on the difference between the step-size exponents in their asymptotic rates (i.e. if βn=nb\beta_n = n^{-b} and αn=na\alpha_n = n^{-a}, how the rate actually depends on the difference between aa and bb). If this difference does not affect the asymptotic rate, it would be nice to provide insight into how this difference might play out in practice.
评论

(Q5). It would be nice if the authors provided some intuition as to why the betan=o(gamman)\\beta_n=o(\\gamma_n) step-size ratio can yield the O(1/alpha2)O(1/\\alpha^2) rate. Moreover, it would be nice if the authors spelled out precisely the dependence on the difference between the step-size exponents in their asymptotic rates (i.e. if betan=nb\\beta_n=n^{-b} and gamman=na\\gamma_n=n^{-a}, how the rate actually depends on the difference between aa and bb). If this difference does not affect the asymptotic rate, it would be nice to provide insight into how this difference might play out in practice.

Answer: For betan=o(gamman)\\beta_n=o(\\gamma_n) in case (i), the impact of the SRRW iterates on the SA iterates is primarily reflected in the correlation terms mathbfJ_12(alpha)\\mathbf{J}\_{12}(\\alpha) and mathbfJ_22(alpha)\\mathbf{J}\_{22}(\\alpha), as detailed in equation (55) and Lemma D.3 in Appendix D.2.2. These terms play a pivotal role in shaping the matrix mathbfU_theta(alpha)\\mathbf{U}\_{\\theta}(\\alpha) as follows:

mathbfU_theta(alpha)=mathbfU_22mathbfU21(mathbfJ_12(alpha)mathbfJ_22(alpha)1)TmathbfJ_12(alpha)mathbfJ_22(alpha)1mathbfU_12+mathbfJ_12(alpha)mathbfJ_22(alpha)1U_11(mathbfJ_12(alpha)mathbfJ_22(alpha)1)T,\\mathbf{U}\_{\\theta}(\\alpha)=\\mathbf{U}\_{22}-\\mathbf{U}_{21} (\\mathbf{J}\_{12}(\\alpha) \\mathbf{J}\_{22}(\\alpha)^{-1})^T-\\mathbf{J}\_{12}(\\alpha)\\mathbf{J}\_{22}(\\alpha)^{-1}\\mathbf{U}\_{12}+\\mathbf{J}\_{12}(\\alpha) \\mathbf{J}\_{22}(\\alpha)^{-1}\mathbf{U}\_{11}(\\mathbf{J}\_{12}(\\alpha) \\mathbf{J}\_{22}(\\alpha)^{-1})^T,

where mathbfU_ij\\mathbf{U}\_{ij} is defined in (9) for i,jin1,2i,j\\in\\{1,2\\}. Through algebraic computations, we derive the O(1/alpha2)O(1/\\alpha^2) rate for matrix mathbfU_theta(alpha)\\mathbf{U}\_{\\theta}(\\alpha) and in turn for the asymptotic covariance mathbfV_theta(1)(alpha)\\mathbf{V}\_{\\theta}^{(1)}(\\alpha).

Regarding the difference in step-size exponents (aa and bb in beta_n=nb\\beta\_n=n^{-b} and gamma_n=na\\gamma\_n=n^{-a}), our main CLT results in Theorem 3.3 indicate that for all three cases, the asymptotic covariances remain the same irrespective of the variations in aa and bb. The key change lies in the asymptotic rates beta_n1/2\\beta\_n^{1/2} and gamma_n1/2\\gamma\_n^{1/2} of the iterates. In practice, we indeed observe that for a fixed value bb, the convergence speed in the initial training phase is affected by the different value of aa. We take b=0.9b=0.9 as an example. Choosing aa close to 0.50.5 results in the slowest convergence since the SRRW iterates mathbfx_n\\mathbf{x}\_n drastically vary in the initial period, potentially deviating from the target distribution mu\boldsymbol{\\mu} and introducing huge bias to the SA iterates theta_n\\theta\_n. As aa increases to 0.80.8, we note the fastest convergence in case (i) in our simulation setup. However, further increases in aa (beyond 0.80.8) gradually slow down the convergence since such choice of aa would get closer to cases (ii) and (iii). This empirical observation suggests that in practice, as aa increases in the range (0.5,b](0.5,b], the convergence becomes faster but then decelerates beyond a certain point close to bb in the initial period. This necessitates careful fine-tuning of the step size in case (i) to achieve optimal performance.

评论

(Q3). While the analysis is on asymptotic covariance, could the authors comment on the likely finite-time convergence rate behavior of their SRRW-SA algorithm, and whether improvements (due to the SRRW) can be also shown in terms of finite-time convergence rates?

Answer: The current non-asymptotic analysis of two-timescale SA with controlled Markov noise cannot be applied to our SA-SRRW algorithm. Specifically, all the state-of-the-art non-asymptotic analysis requires that the mean field functions of both iterates are globally Lipschitz [7-9]. This globally Lipschitzness condition is in place even for finite-time bounds in the singe-timescale SA [10-12]. However, the mean field function boldsymbolpi(mathbfx)mathbfx\\boldsymbol{\\pi}(\\mathbf{x})-\\mathbf{x} of SRRW iterates (4b) exhibits only locally Lipschitz continuity, where mathbfx\\mathbf{x} is in the interior of the probability simplex. This is due to the polynomial form (xi/μi)alpha(x_i/μ_i)^{-\\alpha} inside boldsymbolpi(mathbfx)\\boldsymbol{\\pi}(\\mathbf{x}), which was elaborated on in Appendix D [2] as a necessary condition to ensure scale invariance and local information reliance. This distinction is critical as it places our SA-SRRW algorithm outside the existing non-asymptotic analysis. To address this, we include additional insights in the paragraph following (4) in our revised manuscript and provide a detailed discussion in the new appendix B.

[7]. Doan, T. T. Finite-time convergence rates of nonlinear two-time-scale stochastic approximation under markovian noise. arXiv preprint arXiv:2104.01627, 2021.

[8]. Zeng, S., Doan, T. T., and Romberg, J. A two-time-scale stochastic optimization framework with applications in control and reinforcement learning. arXiv preprint arXiv:2109.14756, 2021.

[9]. Doan, T. T. Nonlinear two-time-scale stochastic approximation convergence and finite-time performance. IEEE Transactions on Automatic Control, 2022.

[10]. Chen, Z., Maguluri, S. T., Shakkottai, S., & Shanmugam, K. Finite-sample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874, 2020.

[11]. Sun, T., Sun, Y., & Yin, W. On markov chain gradient descent. Advances in neural information processing systems, 31, 2018.

[12]. Even, M. Stochastic gradient descent under markovian sampling schemes. International Conference on Machine Learning, 2023.

(Q4). The authors suggest that betan=o(gamman)\\beta_n=o(\\gamma_n) is the best step-size choice if a O(1/alpha2)O(1/\\alpha^2) decay rate in the asymptotic covariance is desired. However, if faster convergence is to be desired, a larger learning rate (i.e. larger betan\\beta_n) should always be picked, i.e. we would want to pick the maximal possible step-size required by the assumptions in the paper, i.e. betan=O(1/n0.5+epsilon)\\beta_n=O(1/n^{0.5+\\epsilon}) for some epsilon>0\\epsilon>0 as close to 0 as possible, in which case gamman\\gamma_n can at best match the step-size of betan\\beta_n, meaning that we can only get O(1/alpha)O(1/\\alpha) in the asymptotic covariance decay again. Could the authors comment on this possible issue?

Answer: In our paper, betan=o(gamman)\\beta_n=o(\\gamma_n) is shown to be the best step-size choice in terms of achieving smaller asymptotic error (captured via asymptotic covariance). These errors are smallest for choices of betan=O(1/n)\\beta_n=O(1/n), and not O(1/n0.5+epsilon)O(1/n^{0.5+\\epsilon}) for some close-to-zero epsilon\\epsilon, since the step sizes are also the scaling factors under which the finite asymptotic variance is characterized. Thus, larger learning rates actually lead to larger asymptotic variances as well; even though the initial convergence (to a level set around the solution) could be quicker, larger learning rates result in larger variances (larger level sets) for large enough time. This is also demonstrated by the extreme case of constant step size, where, without additional averaging, the SGD iterates may converge quickly, but only to a level set (neighborhood) around the solution, and not to the solution itself, and it is well known that the size of such neighborhood around the solution critically depends on the (asymptotic) variance of the iterates. Therefore, selecting the largest possible learning rate is not necessarily the most desirable.

评论

We thank the reviewer for their detailed comments. Following are our detailed responses to the questions posed.

(Q1). Could the authors comment a little bit more on the stability of the iterates? This is assumed to always hold, but as the authors point out, practical tricks are required to ensure this, in which case the theoretical analysis might also need tweaks.

Answer: We appreciate the reviewer's question regarding the stability of the iterates in our SA-SRRW algorithm, and we recognize the potential impact of practical tricks on our theoretical analysis. The stability of SRRW iterates mathbfxn\\mathbf{x}_n in (4b) is always ensured through the truncation method in [1], where the detailed steps have been shown in [2]’s Appendix E with gamman=1/(n+1)\\gamma_n=1/(n+1). This analysis can be extended to generalized step size gamman=1/(n+1)a\\gamma_n = 1/(n+1)^a for ain(0.5,1]a \\in (0.5,1] as well since it satisfies the step size assumption (A1) in [1].

However, the stability of the SA iterates thetan\\theta_n presents a more complex challenge. This is due to its dependency on the SRRW iterates mathbfxn\\mathbf{x}_n. Currently, there is a lack of comprehensive stability analysis for thetan\\theta_n within the domain of two-timescale SA with controlled Markov noise. Even the latest result in this area, which provides only the almost sure convergence (no CLT result therein), still requires the stability assumption [3]. This situation mirrors the challenges faced in single-timescale SA literature, where establishing stability conditions often requires significant analytical effort, e.g., Chapter 4 in [4], [1], [5], [6].

Therefore, while we ensure the stability of the SRRW iterates using established methods, the stability of thetan\\theta_n under the two-timescale framework remains an open question. Addressing this would not only require modifying our theoretical analysis to accommodate practical algorithmic adjustments but also contribute significantly to the broader field of two-timescale SA, which we acknowledge as an important aspect of future work.

[1]. Andrieu, C., Moulines, É., & Priouret, P. Stability of stochastic approximation under verifiable conditions. SIAM Journal on control and optimization, 44(1), 283-312, 2005.

[2]. Doshi, V., Hu, J., & Eun, D. Y. Self-Repellent Random Walks on General Graphs--Achieving Minimal Sampling Variance via Nonlinear Markov Chains. International Conference on Machine Learning. PMLR, 2023.

[3]. Yaji, V. G., & Bhatnagar, S. Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent Markov noise. Mathematics of Operations Research, 45(4), 1405-1444, 2020.

[4]. Borkar, V. Stochastic Approximation: A Dynamical Systems Viewpoint: Second Edition. Texts and Readings in Mathematics. Hindustan Book Agency, 2022.

[5]. Fort, G., Moulines, E., Schreck, A., and Vihola, M. Convergence of markovian stochastic approximation with discontinuous dynamics. SIAM Journal on Control and Optimization, 54(2):866–893, 2016.

[6]. Yaji, V. G., & Bhatnagar, S. Analysis of stochastic approximation schemes with set valued maps in the absence of a stability guarantee and their stabilization. IEEE Transactions on Automatic Control,65(3), 1100–1115, 2020.

(Q2). Are there any tradeoffs in picking alpha\\alpha? Should alpha\\alpha always be as large as possible, or will that affect other aspects of the algorithm, e.g. stability?

Answer: While a larger alpha\\alpha does indeed favor a smaller asymptotic covariance as per our CLT, there are practical considerations that suggest a more balanced approach to choosing alpha\\alpha. In practice, a moderate value of alpha\\alpha often strikes the right balance between reducing asymptotic covariance and maintaining efficient convergence in the initial phases of training. For instance, as observed in our simulations (Figure 2(b)), we noticed that when alpha\\alpha is set to larger values like 10 or 20, the performance initially lags behind that of alpha=5\\alpha=5 until about napprox5×104n\\approx 5×10^4 iterations. In addition, asymptotic covariance in case (i) decreases at a rate of O(1/alpha2)O(1/\\alpha^2), implying that even moderate values of alpha\\alpha are sufficient to achieve small errors, e.g., alpha=5\\alpha=5 in Figure 2(c).

As explained in the previous response, the stability of SRRW iterates can always be guaranteed through truncation method in [1] for any alpha<infty\\alpha<\\infty. Thus, although we do not have the stability of thetan\\theta_n, due to the robust behavior of SRRW iterates, we speculate that large alpha\\alpha is unlikely to compromise the stability of thetan\\theta_n.

评论

Dear Reviewers,

We would like to remind you that the rebuttal period will close in one day. During this phase, we have responded to the reviews and comments/concerns you have provided. Your engagement is crucial in this part of the process. Your insights and expertise are invaluable to ensuring the improvement of our paper. We greatly appreciate your commitment and time dedicated to this important phase.

Thank you once again for your valuable reviews.

Best regards,

Submission #2003 Authors

AC 元评审

The paper describes an interesting and surprising phenomenon that appears in the context of an important problem. The consensus of the reviewers is that it should be accepted.

为何不给更高分

N/A

为何不给更低分

The contribution is nice and the phenomenon described (acceleration of distributed optimization with repelling random walks) is something the broader ML community should be aware of.

最终决定

Accept (oral)