PaperHub
5.0
/10
Poster3 位审稿人
最低3最高6标准差1.4
3
6
6
3.0
置信度
ICLR 2024

Det-CGD: Compressed Gradient Descent with Matrix Stepsizes for Non-Convex Optimization

OpenReviewPDF
提交: 2023-09-22更新: 2024-03-10
TL;DR

We propose two matrix stepsized sketch gradient descent algorithms for minimizing matrix-smooth non-convex objectives.

摘要

关键词
OptimizationFirst-order optimizationNon-convex optimizationDistributed optimization

评审与讨论

审稿意见
3

The paper investigates matrix step sizes of the compressed gradient descent method and provides convergence analysis.

优点

The paper investigates matrix step sizes of compressed gradient descent and provides convergence analysis.

缺点

see questions.

问题

  1. The paper claims that the matrix step size improves performance but introduces problems. The presence of the neighboring term in Theorem 3 creates a significant neighborhood when Δinf\Delta^{inf} is large.
  2. This paper uses the compressed gradient as shown in (3), which is a direct compression on the full gradient. Given the prevalence of error feedback techniques in handling compression errors, it is pertinent to consider whether the proposed matrix step size can be extended to incorporate error feedback methods.
  3. The paper does not consider the use of SGD, local update, and partial participation techniques in distributed settings.
  4. The paper introduces det-CGD1, but it is unclear why this algorithm, specifically in a distributed setting, can obtain Algorithm 1.

伦理问题详情

none.

评论

Questions

  • The paper claims that the matrix step size improves performance but introduces problems. The presence of the neighboring term in Theorem 3 creates a significant neighborhood when $\Delta^{inf}$ is large.

    Indeed, if Δinf\Delta^{\inf} is large, then the error in our analysis becomes large. However, we have the freedom to choose λD/n\lambda_D / n small enough, such that effect of this second term is cancelled. On the other hand, this error term is expected, as it is a result of the stochasticity of the DCGD method. Furthermore, it is also present in the scalar case (see Khirirat et al. 2018). In order to remove this term and get better dependence on the parameters of the algorithm, one should consider variance reduced federated learning mechanisms, like MARINA (Gorbunov et al. 2021), DASHA (Tyurin et al. 2022), or, as the reviewer mentioned below, error feedback algorithms. However, this is not the focus of the current paper, and is left for future investigation.  

  • This paper uses the compressed gradient as shown in (3), which is a direct compression on the full gradient. Given the prevalence of error feedback techniques in handling compression errors, it is pertinent to consider whether the proposed matrix step size can be extended to incorporate error feedback methods.

    Indeed, error feedback methods have shown to be efficient in reducing the variance that is due to the gradient compression in the distributed setting. However, in this paper, we aim at developing the first matrix stepsized distributed algorithm, following the simpler structure of the DCGD (Khirirat et al. 2018). Which is simply compressing the gradient locally, and averaging them at the server. We leave the development of more advanced methods in this setting for future studies. In particular, to leverage the benefits of error feedback settings, one must first extend our theory for the biased gradient compressors, which is falls beyond the scope of the focus current manuscript.

  • The paper does not consider the use of SGD, local update, and partial participation techniques in distributed settings.

    Similar to the previous comment, integrating more advanced techniques, such as local updates and partial participation, is left for future work. We believe that our paper is self-contained in its current form. Here, we mostly focus on the DCGD framework, which already improves on relevant existing methods.

  • The paper introduces det-CGD1, but it is unclear why this algorithm, specifically in a distributed setting, can obtain Algorithm 1.

     For Algorithm 1, consider the case when there is only one client, the server and the client become a simple "formality". In particular, assume n=1n = 1. The latter means f(x)=f1(x)f(x) = f_1(x). Then, let us denote  Sk:=S1kS^k := S_1^k, for all kNk \in \mathbb{N}. This means, gk=DSkf1(xk)=DSkf(xk)g^k = D S^k \nabla f_1(x^k) = D S^k \nabla f(x^k). If we insert this into the update rule (line 9 of Algorithm 1), then we recover det-CGD1. Thus, Algorithm 1 is indeed the distributed version of det-CGD1.

审稿意见
6

This paper introduces two new matrix stepsize sketch Compressed Gradient Descent (CGD) algorithms for minimizing matrix-smooth non-convex functions. The theoretical analysis of these algorithms extends to both single-node and distributed settings.The newly introduced matrix stepsize strategy captures the underlying structure of the objective function, and lead to faster convergence. Leveraging the block-diagonal structure within neural networks, the proposed Det-CGD algorithm outperforms classical methods.

优点

S1. The authors introduce two novel matrix stepsize sketch CGD algorithms, and offer convergence guarantees for these algorithms in both single-node and distributed scenarios.

S2. Taking into account the layer-wise structure of neural networks, the authors devise efficient compression mechanisms.

S3. The author derives the expression for the optimal step size for the CGD algorithms, offering an important guideline for achieving rapid convergence.

缺点

W1. Assumption 2 is stringent and may not align well with the characteristics of neural network problems. Note that Inequality (4) in Assumption 2 is assumed to apply to the entire dataset rather than just a mini-batch.

-- Neural network problems can be viewed as high-order polynomial fitting problems that are inherently nonconvex. Inequality (4) should include not only second-order terms but also third-order and higher-order terms. Utilizing a varying matrix D^t is more effective than using a constant stepsize D to capture the curvature information of the nonconvex objective function.

-- Even though Assumption 2 holds, the authors employ a global convex majorization function in Inequality (4), which may lead to slow convergence due to its potential looseness.

W2. The authors claim that their proposed methods take advantage of the layer-wise structure of neural networks, but in the experiments, they only focus on the logistic regression problem with a non-convex regularizer. Additionally, the neural network architecture is not defined.

W3. This stepsize is the minimizer of the right-hand side of the normalization upper bound in (10) under certain conditions. It is not clear why this results in the "optimal stepsize".

W4. Solving the log-det minimization problem under the constraints specified in (21) can be challenging and impractical, particularly when the dimension dd is large.

问题

See above.

评论

We did not copy all the comments in their entirety to meet the character limit.

Weaknesses:

  • Assumption 2 is stringent and may not align well with the characteristics of neural network problems. Note that Inequality (4) in Assumption 2 is assumed to apply to the entire dataset rather than just a mini-batch.   Assumption 2 constitutes a generalized form of the commonly employed scalar smoothness assumption, a standard in theoretical optimization. Furthermore, it is important to notice that, in the distributed setting, where we explicitly assume the finite-sum structure of the objective, each component function fif_i is assumed to adhere to the matrix smoothness assumption (refer to Assumption 4). This implies that the condition holds not only for the objective ff but also for the component functions fif_i (and therefore mini-batches as well).

  • Neural network problems can be viewed as high-order polynomial fitting problems... Indeed, having more information about the regularity might significantly improve the quality of the algorithms. As an evidence of this phenomenon, is the Newton method, where one needs to invert the Hessian at every iteration. This is known to be faster than the "ordinary" gradient descent algorithms. Nevertheless, these matrix inversion operations are significantly more expensive to compute. Having more information about the higher-order regularity and/or trying to estimate it, we will need to have methods that are more complicated and/or more expensive to implement. Thus, we believe that this argument cannot be seen as a weakness of our method, but rather as a useful observation.    We agree with the reviewer regarding the potential benefits of using a varying matrix stepsize. However, the update rule for such a stepsize must be carefully considered and meticulously designed to make the stepsize adaptive without causing additional overhead. We leave this as future work.  

  • Even though Assumption 2 holds, the authors employ a global convex majorization function in Inequality (4), which may lead to slow convergence due to its potential looseness.    We agree with the reviewer. It is indeed a fact that an overestimation of the matrix LL can result in slow convergence of the algorithm. However, it's worth noting that a similar challenge exists with the scalar smoothness assumption. Despite this, numerous studies based on the scalar smoothness assumption have been published in prestigious conferences such as ICLR, ICML, and NeurIPS. In addition, we want to emphasize that Inequality (4) represents a generalization of the standard smoothness assumption, offering improvements over the scalar smoothness assumption. While it is feasible to explore matrix stepsize without relying on such an assumption, we must clarify that such an investigation falls beyond the scope of this paper.

  • The authors claim that their proposed methods take advantage of the layer-wise structure... In our experimental design, we chose to concentrate on the logistic regression problem with a non-convex regularizer as a representative scenario to showcase the effectiveness of our proposed methods. We believe that for the purpose of this work, this simpler experimental setting suffices to show the advantages of our method. Nevertheless, we are mindful of the broader context, including considerations related to neural networks. We will undoubtedly incorporate your suggestions into our future work, aiming to bridge the gap between theoretical advancements and practical applications.

  • This stepsize is the minimizer of the right-hand side of the normalization upper bound in (10) under certain conditions. It is not clear why this results in the "optimal stepsize".   The only parameter that we can choose for det-CGD1 and det-CGD2 on the right-hand side of (10) is the stepsize matrix DD. And the only term that is affected by this choice is the det(D)det(D) in the denominator. Thus, to make this fraction small, we need to choose a matrix stepsize with the largest possible determinant, which satisfies the theorem conditions. Hence, in this context, (12) and (13) define the optimal stepsizes for the algorithms.

  • Solving the log-det minimization problem under the constraints specified in (21) can be challenging and impractical, particularly when the dimension $d$ is large.   We agree with the reviewer that solving (21) could be challenging in general. However, to take advantage of the matrix stepsize setting, we do not necessarily need an exact solution. For instance, we can solve it for diagonal matrices instead of the general case. Our experiments show that even with these suboptimal stepsize choices, distributed det-CGD1 and distributed det-CGD1 consistently perform better the DCGD algorithm with a scalar stepsize.

审稿意见
6

In this paper, the authors present two novel matrix stepsize sketch type compression GD algorithms which are optimization algorithms based on sketch based compression gradient calculation. The authors utilize the layered structure of the sketches while designing the sketches.

优点

  • Matrix smoothness is leveraged to reduce distribution communication complexity.

  • layer layer-wise structure of the neural nets is utilized while designing the sketches and block diagonal smoothness is used.

-The number of bits broadcasted at each iteration are reduced without losing in the total communication complexity. (free compression)

缺点

  • Determinant normalization is not too common approach. Hence, I would appreciate some further reference, and supporting statement into why to use the normalized determinant.

问题

  • Why did the authors first present the stochastic gradient and presented the compressed gradient as a replacement for the stochastic gradient? Is it stochastic?
  • Did the authors try some further algebraic tricks on equation 15? What is the complexity of those operations?
  • What is the take home message in Figure 1, for comparison of CGD 1 and CGD2. Did the authors experiment a scenario where these two methods are performing different?

伦理问题详情

This paper does not require an ethics review.

评论

Weaknesses

  • Determinant normalization is not too common approach. Hence, I would appreciate some further reference, and supporting statement into why to use the normalized determinant.

  The determinant normalization is not an approach, but rather a simple algebraic trick. Its purpose is to make the weighted gradient norms comparable to each other. Indeed, if the stepsize DD is not normalized, then the unit ball {x:xD<1x: \|x\|_D < 1} can have an arbitrarily large volume, rendering the comparison of different algorithms redundant in this norm. Instead, by normalizing with the determinant, we ensure the volume is equal to 11. A more thorough discussion is available in the paragraph preceding Section 3.1.

Questions:

  • Why did the authors first present the stochastic gradient and presented the compressed gradient as a replacement for the stochastic gradient? Is it stochastic?

  Indeed, the compression operators are stochastic in most cases. Thus, it acts not as a replacement for the stochastic gradient descent but rather as a particular case. In the setting of det-CGD, the stochasticity comes from the sketch matrix SkS^k.  

  • Did the authors try some further algebraic tricks on equation 15? What is the complexity of those operations?

Further algebraic tricks could be employed if we specify the type of sketch we are using. Specifically, when using a rand-kk sketch, the calculation of the diagonal matrix Ai=E[SikWiLiWiSik]A_i = \mathbb{E}[S_i^k W_i L_i W_i S_i^k] simplifies to performing multiplications of the diagonal elements. The latter takes O(di)\mathcal{O}(d_i) iterations for each layer, where did_i is the dimension for that layer. Moreover, since the eigenvalues of Wi1/2AiWi1/2W_i^{-1/2}A_iW_i^{-1/2} are identical to those of Ai1/2Wi1Ai1/2A_i^{1/2}W_i^{-1}A_i^{1/2} (which is a diagonal matrix), we can easily obtain all the largest eigenvalues in O(d)\mathcal{O}(d). For general sketches, however, there is no universal technique.

  • What is the take home message in Figure 1, for comparison of CGD 1 and CGD2. Did the authors experiment a scenario where these two methods are performing different?

  The aim of Figure 1 is to show that even in the simplest setting, both distributed det-CGD1 and det-CGD2 outperform DCGD and DCGD with matrix smoothness by Safaryan et al. 2021. Here, the simple setting of these experiments does not allow to distinguish the performance of our two algorithms. However, we believe that broadcasting the matrix stepsize in the beginning of det-CGD2 might be less practical in terms of communication complexity and local memory as opposed to det-CGD1. Therefore, we do not conduct further experiments to highlight their differences in the distributed setting.

评论

We thank all the reviewers for their time and effort.

The reviewers have pointed out key strengths in our paper, including leveraging matrix smoothness for reduced communication complexity, utilizing neural network layer-wise structure in sketch design, and achieving free compression by reducing broadcasted bits. They particularly highlighted the investigation of matrix step sizes and introduction of two novel matrix stepsize sketch CGD algorithms with convergence guarantees. The paper's efficiency in compression mechanisms considering neural network structure is acknowledged, along with the derivation of optimal step sizes for rapid convergence. We extend our appreciation to the reviewers for their valuable insights and constructive feedback.

The reviewers also had questions and concerns, which we addressed in the individual responses.

AC 元评审

The paper introduces a new method for addressing matrix-smooth nonconvex minimization problems. The method uses matrix-valued step sizes (in place of traditional scalar step sizes) and as such is generally better at adapting to the local problem geometry. The selected steps are based on random sketches that act as gradient compression operators. The paper provides rigorous convergence results for the studied algorithms, which are supported by numerical experiments on deep neural nets, where layer-wise structure is leveraged.

The paper is well-written, the results are clean and easy to understand and follow, and the approach is natural, well-motivated, and appears promising from an empirical perspective.

为何不给更高分

The paper is a clear accept from my perspective, but it is not at the level of a spotlight.

为何不给更低分

I have justified this in the "Additional Comments."

最终决定

Accept (poster)