PaperHub
6.8
/10
Poster4 位审稿人
最低6最高7标准差0.4
7
7
6
7
4.0
置信度
正确性3.3
贡献度3.3
表达3.3
NeurIPS 2024

Efficient Sign-Based Optimization: Accelerating Convergence via Variance Reduction

OpenReviewPDF
提交: 2024-05-15更新: 2024-12-19

摘要

Sign stochastic gradient descent (signSGD) is a communication-efficient method that transmits only the sign of stochastic gradients for parameter updating. Existing literature has demonstrated that signSGD can achieve a convergence rate of $\mathcal{O}(d^{1/2}T^{-1/4})$, where $d$ represents the dimension and $T$ is the iteration number. In this paper, we improve this convergence rate to $\mathcal{O}(d^{1/2}T^{-1/3})$ by introducing the Sign-based Stochastic Variance Reduction (SSVR) method, which employs variance reduction estimators to track gradients and leverages their signs to update. For finite-sum problems, our method can be further enhanced to achieve a convergence rate of $\mathcal{O}(m^{1/4}d^{1/2}T^{-1/2})$, where $m$ denotes the number of component functions. Furthermore, we investigate the heterogeneous majority vote in distributed settings and introduce two novel algorithms that attain improved convergence rates of $\mathcal{O}(d^{1/2}T^{-1/2} + dn^{-1/2})$ and $\mathcal{O}(d^{1/4}T^{-1/4})$ respectively, outperforming the previous results of $\mathcal{O}(dT^{-1/4} + dn^{-1/2})$ and $\mathcal{O}(d^{3/8}T^{-1/8})$, where $n$ represents the number of nodes. Numerical experiments across different tasks validate the effectiveness of our proposed methods.
关键词
SignSGDmajority votesign-based methodvariance reduction

评审与讨论

审稿意见
7

This paper introduces the Sign-based Stochastic Variance Reduction (SSVR) algorithm, which enhances the convergence rate of the traditional signSGD method. By incorporating variance reduction techniques with sign-based updates, the authors achieve a convergence rate of O(d1/2T1/3)O(d^{1/2}T^{-1/3}) for general stochastic optimization and O(m1/4d1/2T1/2)O(m^{1/4}d^{1/2}T^{-1/2}) for finite-sum problems. Additionally, the paper proposes novel algorithms for distributed environments by introducing the unbiased sign operation, resulting in superior convergence rates for heterogeneous data. Numerical experiments further validate the effectiveness of the proposed methods.

优点

  1. The proposed methods improve the convergence rates over traditional signSGD methods and their variants, achieving faster convergence rates for general non-convex optimization and finite-sum optimization.
  2. The SSVR-MV algorithm developed in the paper is communication-efficient and well-suited for distributed settings. The obtained convergence rates significantly enhance previous results in heterogeneous settings.
  3. Numerical experiments validate the effectiveness of the proposed methods, demonstrating superior performance compared to existing sign-based optimization methods in terms of convergence speed and accuracy.

缺点

  1. Given that signSGD methods typically require large batch sizes to ensure convergence, the authors should include more experimental results to demonstrate the dependency on batch size for the proposed methods. This would clarify whether the proposed method also necessitates large batches for convergence in practice.
  2. The authors introduce the stochastic unbiased sign operation in Definition 1, which differs from the traditional sign operation. It would be beneficial to provide a more detailed explanation of their differences and specify when the sign operation is preferred.
  3. There are some typos in the paper, as listed below:
    • Page 2, Lines 44 and 45: T1/2T^{1/2} should be T1/2T^{-1/2}.
    • Page 3, Line 98: "signed-based" should be "sign-based".
    • Page 5, Algorithm 2, Step 4: "set t=τt=\tau" should be "set τ=t\tau=t".
    • Page 9, Figure 2: "SSVR" should be "SSVR-MV".
    • Page 28, Line 482: f(xt;ξtj)f(x_t;\xi_t^j) should be f(xt;ξtj)\nabla f(x_t;\xi_t^j).

问题

  1. Can you provide a more detailed discussion about the design of equation (3), especially the error correction term? The authors claim that "This gap can be effectively mitigated by the error correction term we introduced in equation (3)," but it remains unclear how it works based solely on the content in the main body.
  2. Could you provide additional experimental results to demonstrate the performance of the proposed SSVR method with different batch sizes?

局限性

The paper is theoretical and does not present potential negative societal impacts.

作者回复

Thank you very much for your constructive comments!


Q1: The authors should include more experimental results to demonstrate the dependency on batch size for the proposed methods. This would clarify whether the proposed method also necessitates large batches for convergence in practice.

A1: According to your request, we have included additional experimental results for different batch sizes {64,128,256,51264,128,256,512}, and the results can be found in the Global Response. As can be seen, the proposed method does not necessitate large batches for convergence in practice.


Q2: The authors introduce the stochastic unbiased sign operation in Definition 1, which differs from the traditional sign operation. It would be beneficial to provide a more detailed explanation of their differences and specify when the sign operation is preferred.

A2: The main difference is that the traditional sign operation produces a biased estimation of the original input, whereas the stochastic sign operation in Definition 1 provides an unbiased estimation. The disadvantage of the unbiased sign operation is that it requires the bounded gradient assumption to ensure the input of this operation is bounded. In the centralized setting, where the traditional sign operation already achieves improved convergence rates, we simply use it to avoid introducing additional assumptions. In heterogeneous distributed settings, where we need to use the sign operation twice, the traditional sign operation can lead to large bias and worse convergence rates. Therefore, we use the stochastic sign operation to obtain better convergence rates in this setting.


Q3: There are some typos in the paper.

A3: Thank you for catching typos. We will correct them in the revised version.


Q4: Can you provide a more detailed discussion about the design of equation (3), especially the error correction term? The authors claim that "This gap can be effectively mitigated by the error correction term we introduced in equation (3)," but it remains unclear how it works based solely on the content in the main body.

A4: According to the analysis from Lines 379 to 380, we can bound the gradient estimation error as follows:

$

    \mathbb{E}  \left[ || \nabla f(\mathbf{x}_{t+1}) - \mathbf{v}\_{t+1} ||^2\right] \leq  (1-\beta) \mathbb{E}\left[ ||\mathbf{v}_t - \nabla f(\mathbf{x}_t) ||^2\right]   + 2L^2 \mathbb{E}\left[||\mathbf{x}\_{t+1} - \mathbf{x}_t ||^2\right]  + 2\beta^2 \mathbb{E} \left[||\nabla f\_{i\_{t+1}}  ( \mathbf{x}\_{t+1} )  - \nabla f ( \mathbf{x}\_\{t+1} ) + \nabla f( \mathbf{x}\_{\tau} ) - \nabla f\_{i\_{t+1}} ( \mathbf{x}\_{\tau} ) ||^2\right] 

$

Note that in the last term, f(x_τ)f_i_t+1(x_τ) \nabla f(\mathbf{x}\_{\tau})- \nabla f\_{i\_{t+1}}(\mathbf{x}\_{\tau}) is the error correction term. Without this term, we would need to bound E[f_i_t+1(x_t+1)f(x_t+1)2] \mathbb{E} \left[||\nabla f\_{i\_{t+1}}(\mathbf{x}\_{t+1}) - \nabla f(\mathbf{x}\_{t+1})||^2\right], which is hard to deal with without assuming E[f_i(x)f(x)2]σ2 \mathbb{E} \left[||\nabla f\_{i}(\mathbf{x}) - \nabla f(\mathbf{x})||^2\right] \leq \sigma^2 for each function fif_i. However, with the error correction term, we can effectively bound it as:

$

 \mathbb{E} \left[||\nabla f\_{i\_{t+1}}(\mathbf{x}\_{t+1}) - \nabla f\_{i\_{t+1}}(\mathbf{x}\_{\tau})  - \nabla f(\mathbf{x}\_{t+1}) + \nabla f(\mathbf{x}\_{\tau})||^2\right] 
\leq  \mathbb{E}\left[||\nabla f\_{i\_{t+1}}(\mathbf{x}\_{t+1}) - \nabla f\_{i\_{t+1}}(\mathbf{x}\_{\tau})||^2 \right] 
 \leq L^2 \mathbb{E}\left[|| \mathbf{x}\_{t+1} - \mathbf{x}\_{\tau}||^2 \right].

$

As long as the learning rate is small enough and τ\tau is updated periodically, we can easily bound the above term.

评论

Thank you for the detailed responses. They have satisfactorily addressed all my concerns, and I am inclined to recommend acceptance.

审稿意见
7

The paper introduces an enhanced sign-based method called SSVR, designed to improve the convergence rate of signSGD with variance reduction techniques. SSVR achieves an improved convergence rate of O(d1/2T1/3)O(d^{1/2}T^{-1/3}) for stochastic non-convex functions. For non-convex finite-sum optimization problems, the SSVR-FS method demonstrates an improved convergence rate of O(m1/4d1/2T1/2)O(m^{1/4}d^{1/2}T^{-1/2}). To address heterogeneous data in distributed settings, the authors present novel algorithms that outperform existing methods in terms of convergence rates. The effectiveness of these proposed methods is validated through numerical experiments.

优点

  1. The proposed methods improves upon existing algorithms in complexities. The sign-based algorithms have broad applications in the ML community and may be of great interest in the field.
  2. The proposed sign-based variance-reduced estimator and stochastic unbiased sign operation are novel and efficient. They effectively reduce the gradient estimation error and are communication efficient, particularly for heterogeneous data in distributed settings.
  3. The experimental results on various datasets validate the effectiveness of the proposed methods.
  4. The paper is well-written, with clear problem settings and contributions. The proofs are also easy to follow.

缺点

  1. The authors introduce the bounded gradient assumption (Assumption 4) for distributed settings. Why is this assumption necessary? Is it commonly used in previous literature for this setting?
  2. The description of Algorithm 2 is not very rigorous. The variable τ\tau in step 4 and τ\tau in step 10 seems different but used the same notation.
  3. Some typos:
    • Line 98: "signed-based" --> "sign-based."
    • Line 247: "Assumption 5′" --> "Assumption 5" or "Assumption 4′."
    • Line 487: "S(vtj)S(v_t^j)" --> "SR(vtj)S_{R}(v_t^j)."

问题

  1. Does previous work also require similar bounded gradient assumptions for the majority vote in heterogeneous settings?
  2. Is the performance of the proposed method sensitive to variations in batch size?

局限性

This paper does not present negative societal impacts.

作者回复

Thank you very much for your constructive comments and suggestions!


Q1: The authors introduce the bounded gradient assumption (Assumption 4) for distributed settings. Why is this assumption necessary? Is it commonly used in previous literature for this setting?

A1: This assumption is necessary because the unbiased sign operation SR()\operatorname{S_R}(\cdot) defined in Definition 1 requires its input to be bounded such that vR||\mathbf{v}||_{\infty} \leq R. To meet this requirement, we assume the gradient is bounded so that the norm of our gradient estimator vtj\mathbf{v}_t^j is also bounded. Similar bounded (stochastic) gradient assumptions are also used in previous literature [Jin et al., 2023, Sun et al., 2023] for this heterogeneous distributed setting.


Q2: The description of Algorithm 2 is not very rigorous. The variable τ\tau in step 4 and τ\tau in step 10 seems different but used the same notation.

A2: Sorry for the misleading notation. We will clarify this by replacing the τ\tau in step 10 with φ\varphi in the revised version.


Q3: Some typos.

A3: Thank you for pointing out the typos. We will correct them in the revised paper.


Q4: Does previous work also require similar bounded gradient assumptions for the majority vote in heterogeneous settings?

A4: Yes, as mentioned in A1, previous literature [Jin et al., 2023, Sun et al., 2023] also requires similar bounded gradient assumptions for this setting.


Q5: Is the performance of the proposed method sensitive to variations in batch size?

A5: To answer this question, we conduct experiments with different batch sizes {64,128,256,51264, 128, 256, 512}, and the results can be found in Global Response. As can be seen, the proposed algorithm is not sensitive to variations in batch size.


References:

Jin et al. Stochastic-Sign SGD for federated learning with theoretical guarantees. 2023.

Sun et al. Momentum ensures convergence of SIGNSGD under weaker assumptions. ICML, 2023.

评论

Thank the authors for the rely. It addresses all my questions and concerns. I'm willing to keep my score.

审稿意见
6

This paper considers application of variance reduction to sign-based optimization, i.e., a setting where the algorithm can only access to sign(f(xt;ξt))\mathrm{sign}(\nabla f(x^t; \xi^t)) at step tt, in the smooth and nonconvex setting. Because ±1\\{\pm 1\\}-valued vectors can be transmitted more efficiently than real-valued vectors, the sign-based optimization is considered to be useful especially in distributed optimization. While a naive SGD-type algorithm previously achieved the convergence rate of O(d12T14)O(d^\frac12 T^{-\frac14}), the proposed variance-reduced algorithm yields the rate of O(d12T13)O(d^\frac12 T^{-\frac13}) (measured by the L1L_1-norm of the gradient.) Next, they applied the algorithm to finite-sum optimization (with mm components) and obtained the convergence rate of O(m14d12T14)O(m^\frac14 d^\frac12 T^{-\frac14}). Finally, they applied these results to distributed optimization (the server received ±1\\{\pm 1\\}-valued gradient estimators from clients), with an additional technique called majority vote to obtain unbiased gradient estimators, and obtained improved communication complexities.

优点

The problem is well-motivated

Sign-based optimization is an important technique to reduce computational / communication complexities in large-scale optimization. I think it is a natural attempt to apply variance reduction, which is a common technique to accelerate optimization, to achieve more efficient sign-based optimization.

Improved convergence rate

Overall, the proposed approach improved the convergence rates of sign-based optimization, which I think solid contributions to the theory of optimization.

  • For stochastic non-convex optimization, the proposed algorithm achieves min1tTE[f(xt)]d12T13\min{1\leq t \leq T}\mathbb{E}[\\|\nabla f(x^t)\\|]\lesssim d^\frac12 T^{-\frac13}, which is an improvement from the d12T14d^\frac12 T^{-\frac14} rate of SignSGD (Bernstein et al., 2018).

  • For finite-sum optimization, the proposed algorithm achieves min1tTE\[f(xt)]m14d12T12\min{1\leq t \leq T}\mathbb{E}\[\|\nabla f(x^t)\\|]\lesssim m^\frac14 d^\frac12 T^{-\frac12}, whereas the previous SignSVRG algorithm achieves the rate of O(m12d12T12)O(m^\frac12 d^\frac12 T^{-\frac12}) (Chzhen and Schechtman, 2023. (I am a bit confused because Ii looks like there is no benefit of variance reduction for SignSVRG in terms of the dependency on mm.)

  • For distributed optimization, the proposed algorithm achieves min1tTE[f(xt)]d12T12+dn12\min{1\leq t \leq T}\mathbb{E}[\\|\nabla f(x^t)\\|]\lesssim d^\frac12 T^{-\frac12} + d n^{-\frac12} or d14T14\lesssim d^\frac14 T^{-\frac14}. The previous bound was dT14+dn12d T^{-\frac14} + d n^{-\frac12} (Sun et al., 2023) or d38T18d^\frac38 T^{-\frac18} (Jin et al., 2023, in L2L^2-norm). This is achieved by a trick called majority vote, which stochastically maps the real-valued gradient vector (or fi(x)\nabla f_i(x)) into ±1\\{\pm 1\\}-valued vectors transmitted to the server.

Experimental results

The authors showed sufficient amount of experiments to verify the validity of the proposed methods (as a theory paper).

缺点

Technical novelty

Variance reduction itself is quite common in the optimization field so I am afraid that this paper might be a bit incremental. Section 3.3 (Results under weaker assumptions) looks redundant given Theorem 2, as there are no technical difficulty about weakening the assumption described.

Hyperparameter choice

Although this is a common criticism, most variance reduction algorithm require a very careful hyperparameter tuning. It looks like the theoretical guarantees of the proposed algorithms also require a specific set of hyperparameters, and I am not confident in the robustness of the proposed algorithms to the hyperparameters.

问题

On additional terms

  • Why is the term of md/Tmd/T required in Theorem 4?

  • Why is the term of dn12d n^{-\frac12} required in Theorem 5?

On the difficulties to further improve the convergence rates

  • In theorem 6, why cannot you improve the convergence rate into poly(d)T13\mathrm{poly}(d) T^{-\frac13}?

Assumptions

  • Is it possible to weaken Assumption 3 to averaged smoothness, i.e., 1mi=1mfi(x)fi(y)2L2xy2\frac1m \sum_{i=1}^m \\|\nabla f_i(x)-\nabla f_i(y)\\|^2 \leq L^2 \\|x-y\\|^2

局限性

N/A

作者回复

Thanks for your constructive comments! We will revise accordingly.


Q1: Variance reduction is quite common in the optimization field so I am afraid this paper might be a bit incremental. Section 3.3 looks redundant given Theorem 2, as there are no technical difficulty about weakening the assumption described.

A1: First, we want to emphasize that our Algorithm 1 is the first sign-based variance reduction method to improve the convergence rate from O(T1/4)\mathcal{O}(T^{-1/4}) to O(T1/3)\mathcal{O}(T^{-1/3}) (Theorem 1). Second, to deal with the finite-sum problem, we introduce a novel error correction term into our gradient estimator in Algorithm 2, making our approach distinct from existing methods. Furthermore, the corresponding guarantee (Theorem 2) surpasses the existing variance reduction method signSVRG. Finally, the proposed SSVR-MV algorithm and its corresponding analysis (Theorem 3 & 4) are also new, where we utilize unbiased sign operations and the projection operation in the SSVR-MV method. As a result, we believe our paper provides significant contributions beyond being merely incremental in the existing literature.

Regarding Section 3.3, we agree it is not the main contribution of our paper and our intention is to broaden the applicability of our methods to a wider range of functions by relaxing assumptions. Moreover, there actually exists technique challenges in the analysis. For instance, in the previous analysis, we can bound E[f(x_t+1;ξ_t+1)f(xt;ξ_t+1)2]L2x_t+1xt2L2η2d\mathbb{E}[||\nabla f(\mathbf{x}\_{t+1};\xi\_{t+1})-\nabla f(\mathbf{x}_t;\xi\_{t+1})||^2]\le L^2||\mathbf{x}\_{t+1}-\mathbf{x}_t||^2\le L^2 \eta^2d with small η\eta. Under weaker assumptions, we can only ensure E[f(x_t+1;ξ_t+1)f(xt;ξ_t+1)2]x_t+1xt2(K0+K1E_ξ_t+1f(xt;ξ_t+1))\mathbb{E}[||\nabla f(\mathbf{x}\_{t+1};\xi\_{t+1})-\nabla f(\mathbf{x}_t;\xi\_{t+1})||^2]\le ||\mathbf{x}\_{t+1}-\mathbf{x}_t ||^2(K_0+K_1\mathbb{E}\_{\xi\_{t+1}}||\nabla f(\mathbf{x}_t;\xi\_{t+1})||), which can not be bounded since f(xt;ξ_t+1)||\nabla f(\mathbf{x}_t;\xi\_{t+1})|| is unbounded. To address this, we employed a different analysis, using mathematical induction techniques (see the proofs of Lemmas 3 and 4). Therefore, the analyses for Section 3.3 are more challenging. However, following your suggestion, we will consider moving Section 3.3 to the appendix.


Q2: I am not confident in the robustness of the proposed algorithms to hyper-parameters.

A2: Following your suggestion, we have examined the behavior of our algorithm with different values of the learning rate and parameter β\beta. Initially, we fix β=0.5\beta=0.5 and vary the learning rate within the set {5e3,1e3,5e4,1e4,5e55e-3,1e-3,5e-4,1e-4,5e-5}. Then, we fix the learning rate at 1e31e-3 and varied β\beta within {0.3,0.5,0.7,0.9,0.990.3,0.5,0.7,0.9,0.99}. The results, which are reported in the Global Response, indicate that our method is not sensitive to the choice of hyper-parameters.


Q3: Why is the term of md/Tmd/T required in Theorem 4?

A3: Compared to Assumption 3 used in Theorem 2, we use the weaker Assumption 3^\prime in Theorem 4, which includes an additional f(xt)||\nabla f(\mathbf{x}_t)|| term. Consequently, this extra term appears in the analysis. In the final step, we have E[1T_t=1Tf(xt)1]O(1ηT+ηd+dmη)+O(dηm)E[1T_t=1Tf(xt)1]\mathbb{E}[\frac{1}{T}\sum\_{t=1}^T||f(\mathbf{x}_t)||_1]\le\mathcal{O}(\frac{1}{\eta T}+\eta d+d\sqrt{m}\eta)+\mathcal{O}(d{\eta m})\mathbb{E}[\frac{1}{T}\sum\_{t=1}^T||\nabla f(\mathbf{x}_t)||_1]. To cancel the last term, we set ηO(1md)\eta\leq\mathcal{O}(\frac{1}{md}). As a result, we have O(1ηT)=O(mdT)\mathcal{O}(\frac{1}{\eta T})=\mathcal{O}(\frac{md}{T}) in the convergence rate.


Q4: Why is the term of dn1/2dn^{-1/2} required in Theorem 5?

A4: According to equation (15) on Page 29, we need to bound Rdf(xt)R1n_j=1nS(vtj)R\sqrt{d}||\frac{\nabla f(\mathbf{x}_t)}{R}-\frac1n\sum\_{j=1}^n S(\mathbf{v}_t^j)||. To this end, we separate it into two terms: Rdf(xt)R1nR_j=1nvtjR\sqrt{d}||\frac{\nabla f(\mathbf{x}_t)}{R}-\frac{1}{nR}\sum\_{j=1}^n\mathbf{v}_t^j|| and Rd1n_j=1n(S(vtj)vtjR) R\sqrt{d}||\frac{1}{n}\sum\_{j=1}^n(S(\mathbf{v}_t^j)-\frac{\mathbf{v}_t^j}{R})||. The first term represents the estimation error of estimator vtj\mathbf{v}_t^j and can be bounded using a similar technique in Theorem 1. The second term leads to the O(dn1/2)\mathcal{O}(dn^{-1/2}) term, since we bound it as:

RdE[1n_j=1n(S(vtj)vtjR)]RdE[1n_j=1n(S(vtj)vtjR)2]Rd1n2_j=1nES(vtj)vtjR2Rd1n2_j=1nE[S(vtj)2]dRn. R\sqrt{d}\mathbb{E}[||\frac1n\sum\_{j=1}^n (S(\mathbf{v}_t^j)-\frac{\mathbf{v}_t^j}{R})||]\le R\sqrt{d}\sqrt{\mathbb{E}[||\frac{1}{n}\sum\_{j=1}^n(S(\mathbf{v}_t^j)-\frac{\mathbf{v}_t^j}{R})||^2]}\le R\sqrt{d}\sqrt{\frac{1}{n^2}\sum\_{j=1}^n\mathbb{E}||S(\mathbf{v}_t^j)-\frac{\mathbf{v}_t^j}{R}||^2}\le R\sqrt{d}\sqrt{\frac{1}{n^2}\sum\_{j=1}^n\mathbb{E}[||S(\mathbf{v}_t^j)||^2]}\le \frac{dR}{\sqrt{n}}.


Q5: In theorem 6, why cannot you improve the convergence rate to O(T1/3)\mathcal{O}(T^{-1/3})?

A5: For Theorem 1, we obtain the convergence rate of O(T1/3)\mathcal{O}(T^{-1/3}) with the help of Assumptions 1 and 2. In theorem 6, although the proposed unbiased sign operation provides an unbiased gradient estimation, the obtained gradient cannot satisfy the average smoothness (Assumption 1), which is crucial for achieving the O(T1/3)\mathcal{O}(T^{-1/3}) converge rate. This is the main reason why we cannot further improve the convergence rate.


Q6: Is it possible to weaken Assumption 3 to averaged smoothness, i.e., 1mi=1mfi(x)fi(y)2L2xy2\frac1m \sum_{i=1}^m||\nabla f_i(x)-\nabla f_i(y)||^2\le L^2||x-y||^2?

A6: Yes, it is possible. In the analysis, Assumption 3 is used to ensure E_i_t+1[f_i_t+1(x_t+1)f_i_t+1(x_t)2]L2x_t+1x_t2\mathbb{E}\_{i\_{t+1}}[||\nabla f\_{i\_{t+1}}(\mathbf{x}\_{t+1})-\nabla f\_{i\_{t+1}}(\mathbf{x}\_t)||^2]\le L^2||\mathbf{x}\_{t+1}-\mathbf{x}\_t||^2. With the averaged smoothness, we can achieve a similar guarantee. Since it+1i_{t+1} is randomly sampled from {1,,m1,\cdots,m}, we have:

E_i_t+1[f_i_t+1(x_t+1)f_i_t+1(x_t)2]=1mi=1mfi(x_t+1)fi(x_t)2L2x_t+1x_t2,\mathbb{E}\_{i\_{t+1}}[||\nabla f\_{i\_{t+1}}(\mathbf{x}\_{t+1})-\nabla f\_{i\_{t+1}}(\mathbf{x}\_t)||^2]=\frac1m\sum_{i=1}^{m}||\nabla f_i(\mathbf{x}\_{t+1})-\nabla f_{i}(\mathbf{x}\_t)||^2\le L^2 ||\mathbf{x}\_{t+1}-\mathbf{x}\_t||^2,

which indicates the proof holds with averaged smoothness.

评论

Thank you very much for your detailed response. I would like to keep my score.

Best, reviewer.

审稿意见
7

This paper investigates the stochastic optimization problem and a special case of finite sum optimization problem. They propose the Sign-based Stochastic Variance Reduction (SSVR) method, leveraging variance reduction techiniques, which improves the convergence rate of Sign stochastic gradient descent (signSGD) algorithm. Further, they also study the heterogeneous majority vote in distributed settings and introduce two novel algorithms, whose convergence rates improve the state-of-the-art. Numerical experiments prove the efficiency of their algorithms and support their theoretical guarantees.

优点

The paper is clearly written, with detailed discussion and comparision.

缺点

Some places of the paper need further clarification.

问题

  1. How did you get the convergence rate of O(m1/2d1/2T1/2)O(m^{1/2}d^{1/2}T^{−1/2}) for the SignSVRG of y? It seems that in their paper they claim O(T1/2)convergenceratewithoutO(T^{-1/2}) convergence rate without mandandn$ in the numerator. Did you calculate this result by yourself? Please provide discussions.

  2. There is another paper studying sign-based optimization (Qin et al. (2023)), the setting studied in this paper is the same as your problem (2). Can you also discuss this paper and compare the results?

References:

  1. E. Chzhen and S. Schechtman. SignSVRG: fixing SignSGD via variance reduction. ArXiv e-prints, 317 arXiv:2305.13187, 2023.
  2. Qin, Z., Liu, Z., & Xu, P. (2023). Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization. arXiv preprint arXiv:2310.15976.

局限性

Yes.

作者回复

Many thanks for the constructive feedback! We will revise our paper accordingly.


Q1: How did you get the convergence rate of O(m1/2d1/2T1/2)\mathcal{O}(m^{1/2}d^{1/2}T^{-1/2}) for the SignSVRG? It seems that in their paper they claim O(T1/2)\mathcal{O}(T^{-1/2}) convergence rate without mm and dd in the numerator. Did you calculate this result by yourself? Please provide discussions.

A1: Yes, the original paper does not explicitly state the dependence on mm and dd, so we derived these dependencies ourselves. In Remark 2 of their paper, they indicate that E[f(xˉ_T)1]dTmax((f(x1)f(x_)+1)/d,P)\mathbb{E}\left[||\nabla f(\bar{x}\_{T^{\prime}})||_{1}\right] \lesssim \sqrt{\frac{d}{T^{\prime}}} \max \left( ( f\left(x_1\right)-f\left(x\_{*}\right)+1 ) / d, \mathrm{P} \right), where TT^{\prime} is the iteration number. Consequently, to ensure E[f(xˉ_T)_1]ϵ\mathbb{E} \left[ || \nabla f(\bar{x}\_{T^{\prime}})||\_{1} \right] \leq \epsilon, we require at least T=O(dP2ϵ2)T^{\prime}=\mathcal{O}\left( dP^2 \epsilon^{-2}\right). As mentioned in Lemma 4 of their paper, the algorithm may need to recompute the full gradient (mm samples) every PP round, resulting in a sample complexity on the order of O(dmϵ2)\mathcal{O}\left( d m \epsilon^{-2}\right). This leads to the overall convergence rate of O(d1/2m1/2T1/2)\mathcal{O}\left( d^{1/2} m^{1/2} T^{-1/2}\right).


Q2: There is another paper studying sign-based optimization [Qin et al, 2023], the setting studied in this paper is the same as your problem (2). Can you also discuss this paper and compare the results?

A2: Thank you for bringing this related work to our attention. We will cite this paper and provide a detailed discussion in the revised version of our paper. Specifically, the mentioned paper [Qin et al., 2023] focuses on the finite-sum optimization problem and investigates signSGD with random reshuffling, achieving a convergence rate of O(log(mT)/mT+σ1)\mathcal{O}\left({\log (mT)}/{\sqrt{mT}}+||\sigma ||_1\right), where mm is the number of component functions, TT is the number of epochs of data passes, and σ\sigma is the variance bound of stochastic gradients. By leveraging variance-reduced gradients and momentum updates, they further propose the SignRVR and SignRVM methods, both achieving the convergence rate of O(log(mT)m/T)\mathcal{O}\left({\log (mT)\sqrt{m}}/{\sqrt{T}}\right). In contrast, our method obtains a convergence rate of O(m1/4/T)\mathcal{O}\left({m^{1/4}}/{\sqrt{T}}\right), which enjoys a better dependence on mm than their methods.


References:

Qin et al. Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization. 2023.

评论

Thanks for the rebuttal. It addressed all my concerns. I will increase my score.

评论

Thank you very much for your kind reply! We will revise our paper according to the constructive reviews.

Best regards,

Authors

作者回复

Global Response


In response to the request of reviewers, we provide additional experimental results in this part.

Figure 1 & Figure 2: According to the suggestion of Reviewer 2Gd9, we provide the performance of our SSVR method with different hyper-parameters. First, we fix the value of β\beta as 0.5 and try different learning rates from the set {5e3,1e3,5e4,1e4,5e55e-3, 1e-3, 5e-4, 1e-4, 5e-5}. Then, we fix the learning rate as 1e31e-3 and enumerate β\beta from the set {0.3,0.5,0.7,0.9,0.990.3, 0.5,0.7,0.9,0.99}. The results show that our method is insensitive to the choice of the hyper-parameters within a certain range.

Figure 3: As requested by Reviewer Huhp and Reviewer Xovb, we include experiments on the SSVR method with different batch sizes {64,128,256,51264, 128, 256, 512}. The results indicate that our algorithm does not necessitate large batches for convergence and is not sensitive to variations in batch sizes.

评论

Dear Reviewers,

Thank you for your hard work during the review process. The authors have responded to your initial reviews. If you haven’t already done so, please take the time to carefully read their responses. It is crucial for the authors to receive acknowledgment of your review by the deadline for the author-reviewer discussion period, which is August 13, 2024 (11:59 PM, Anywhere on Earth). Please address any points of disagreement with the authors as early as possible.

Best,

Your AC

最终决定

This paper investigates the convergence of sign-SGD algorithms, with a focus on enhancing their performance through variance reduction techniques. The authors successfully improve the convergence rate of sign-SGD for stochastic optimization from O(d1/2T1/4)O(d^{1/2}T^{-1/4}) to O(d1/2T1/3)O(d^{1/2}T^{-1/3}). For finite-sum problems, the proposed approach further accelerates convergence by a factor of m1/4m^{1/4}, where mm represents the number of individual functions. Additionally, the paper demonstrates that this technique can be effectively applied to distributed settings, improving upon existing results in the literature.

The paper is generally well-written, presenting a solid contribution in the form of improved convergence rates over established methods. Following an in-depth discussion between the authors and reviewers, consensus has been reached that the paper is suitable for publication. I am also recommending acceptance of this paper. However, I have some additional comments that I believe could enhance the paper’s clarity and presentation, which I encourage the authors to consider in their final version:

  1. Clarification of Convergence Rate Definitions:
    The paper would benefit from a more detailed description of the convergence rate definitions used. Specifically, it would be helpful if the authors could clearly delineate the differences between this work and others in terms of convergence rate definitions—whether in terms of the norm of the gradient or the squared norm, and whether considering l1l_1 or l2l_2 norms. This clarification will aid readers in making direct comparisons between the results in this paper and those in related works.

  2. Discussion on Upper vs. Lower Bounds:
    In the introduction, the authors discuss lower bounds for stochastic gradient methods and finite-sum optimization. However, the upper bounds provided in the theoretical section are primarily for the l1l_1 norm, while the literature often discusses upper bounds in terms of the l2l_2 norm. The authors should add a discussion that addresses this discrepancy, clarifying that it is acceptable that there is no exact lower bound for the sign-based optimization setting.

  3. Comprehensive Discussion of Variance Reduction Techniques:
    Since SVRG plays a pivotal role in achieving the improved convergence results in this paper, the related work section would benefit from a more thorough discussion of key variance reduction algorithms such as SCSG [1], SDCA [2, 4], nonconvex SVRG [3], SNVRG [5], etc.

[1] Lei, Lihua, and Michael Jordan. "Less than a single pass: Stochastically controlled stochastic gradient." Artificial Intelligence and Statistics. PMLR, 2017.

[2] Shalev-Shwartz, Shai, and Tong Zhang. "Stochastic dual coordinate ascent methods for regularized loss minimization." Journal of Machine Learning Research 14.1 (2013).

[3] Reddi, Sashank J., et al. "Stochastic variance reduction for nonconvex optimization." International conference on machine learning. PMLR, 2016.

[4] Richtárik, Peter, and Martin Takáč. "Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function." Mathematical Programming 144.1 (2014): 1-38.

[5] Zhou, Dongruo, et al. "Stochastic nested variance reduction for nonconvex optimization." Journal of machine learning research 21.103 (2020): 1-63.