PaperHub
3.5
/10
withdrawn4 位审稿人
最低3最高5标准差0.9
3
3
3
5
3.3
置信度
正确性1.8
贡献度1.8
表达1.3
ICLR 2025

HOME-3: HIGH-ORDER MOMENTUM ESTIMATOR USING THIRD-POWER GRADIENT FOR CONVEX, SMOOTH NONCONVEX, AND NONSMOOTH NONCONVEX OPTIMIZATION

OpenReviewPDF
提交: 2024-09-18更新: 2024-11-13
TL;DR

In this work, we denote a momentum built on a high-power first-order gradient as a high-order momentum to further advance the performance of gradient-based optimizer.

摘要

关键词
Gradient DescentMomentumHigh-Power GradientNonconvexCoordinate Randomization

评审与讨论

审稿意见
3

This paper proposes to leverage the third order momentum of gradients for optimization algorithms. Home-3 is proposed and its convergence is proved under different settings including convex, nonconvex smooth and nonconvex nonsmooth ones. Numerical results are also provided for dictionary learning, deep matrix factorization, and some neural networks.

优点

The idea of employing third order momentum is interesting, and as far as I know, it is not explored in the literature.

缺点

W1. The paper is not well written, and sometimes to an extent that brings difficulty to evaluate the correctness of results.

  • It is not clear what does assumption 1 refer to in theorem 4.1, and there is no assumption indexed with 1.
  • Similarly, what is exactly assumption 2? It is hard to decipher from the context.
  • The major theorems are presented in a way such that even the authors seem not too confident about. For example, in theorem 1, the authors wrote "..(some term)... can reach ...". It seems to be a statement that is very ambiguous -- only in the best case it holds. Similarly, "... should be equal to ..." in theorem 4 is also very confusing.

W2. The theoretical results need to be further discussed.

  • Assumption 2.4 is too strong, and clearly gtg_t depends on gt1g_{t-1}, i.e., their correlation is not 0.
  • The convergence criterion in this paper is f(xt)f(xT)\|f(x_t) - f(x_T)\|, which is quite different from commonly adopted suboptimality error or gradient norm. More importantly, it is not clear whether f(xt)f(xT)\|f(x_t) - f(x_T)\| make sense. This term can be 00 if we do not update at all, i.e., set x0=x1=...=xTx_0= x_1=...=x_T. In other words, even if this term is small, it is not straightforward to say anything about optimality. Lastly, when f(x0)f(xT)\|f(x_0) - f(x_T)\| is small, it simply means that the algorithm does not make too much progress -- suppose that f(\x_0) is large, it means that f(x_T) is also large.
  • The results here are not directly comparable to e.g., Adam or Storm, because of different metrics.

W3. The algorithm is not intuitively clear. See my questions below.

问题

Q1. Can the authors help me on the intuitive understanding of HOME3? In particular, if we ignore the momentum in S, M, V (by setting betas = 0), and consider the following settings respectively:

  • large gradient regime, where gtg_t is large entrywisely. In this case we have gt3gtg_t^3 \gg g_t, which means that MtStgt3M_t - S_t \approx -g_t^3. And note that [gt]i[gt3]i>0[g_t]^i * [g_t^3]^i > 0, in other words, negative MtStM_t - S_t is not a descent direction!
  • small gradient regime, where gtg_t is small entrywisely. In this case we have gt3gtg_t^3 \ll g_t, MtStgtM_t - S_t \approx g_t, and the algorithm reduces to Adam. Then it does not explain the large performance gap between HOME3 and adam in the numerical results.
审稿意见
3

This paper presents HOME-3 (High-Order Momentum Estimator using 3rd power gradient) optimizer for convex/non-convex optimization and its convergence analyses. It also provides some numerical results indicating that HOME-3 performs better than the existing optimizers.

优点

The strength of this paper is to present HOME-3 (High-Order Momentum Estimator using 3rd power gradient) optimizer for convex/non-convex optimization. The originality of this paper is to apply 3rd power gradients to solve convex/non-convex optimization problems. Moreover, the significance of this paper is to provide numerical results showing that HOME-3 performs well.

缺点

The weaknesses of this paper are:

  • the mathematical preliminaries are poor.
  • the proofs of the convergence analyses are incorrect.

问题

The following are my questions and suggestions:

  • L103: XRDXRDX \in \mathbb{R}^D \to X \subset \mathbb{R}^D (since XX is a set)
  • L104: Is D<D < \infty (i.e., DRD \in \mathbb{R}) correct? I think that DND \in \mathbb{N} is correct (since DD should not be negative).
  • L112: (f(x))n(\nabla f (x))^n is not defined. I guess f(x):=(f(x)x1,,f(x)xD)\nabla f (x) := ( \frac{\partial f (x)}{\partial x_1}, \cdots, \frac{\partial f (x)}{\partial x_D})^\top and (f(x))n:=((f(x)x1)n,,(f(x)xD)n)(\nabla f (x))^n := ( (\frac{\partial f (x)}{\partial x_1})^n, \cdots, (\frac{\partial f (x)}{\partial x_D})^n)^\top. I think that n<n < \infty should be replaced with nNn \in \mathbb{N} (since nn should not be negative).
  • L115: kf(x)\nabla^k f(x) is not defined.
  • L120: suppose t(tN)t (\forall t \in \mathbb{N}) \to suppose tNt \in \mathbb{N}?
  • Definition 2.4: I do not think that the definition is well-defined. Do you mean that R[x1,,xD]=[x^1,,x^D]\mathcal{R}[x_1, \cdots, x_D]^\top = [\hat{x}_1, \cdots, \hat{x}_D]^\top? x^i\hat{x}_i (i=1,2,,D)(i=1,2,\cdots, D) is not defined.
  • Definition 2.5: I do not think that the definition is well-defined. From G ⁣:RRD\mathcal{G} \colon \mathbb{R} \to \mathbb{R}^D and f(x)Rf(x) \in \mathbb{R}, G(f(x))RD\mathcal{G} (f(x)) \in \mathbb{R}^D is well-defined. However, G2(f(x)):=G[G(f(x))]\mathcal{G}^2 (f(x)) := \mathcal{G} [\mathcal{G} (f(x))] is not defined, since the domain of G\mathcal{G} is R\mathbb{R}.
  • Definition 2.6: I do not think xTx_T is just a stationary point. Do you mean that xTx_T is an approximated stationary point?
  • Definition 2.7: I do not think that the definition is well-defined from the issue of Definition 2.5.
  • Assumption 2.1: x,yRD,f(y)...\forall x, y \in \mathbb{R}^D, f(y) \geq ... is correct.
  • Assumption 2.2: L>0x,yRD,f(y)...\exists L > 0 \forall x, y \in \mathbb{R}^D, f(y) \leq ... is correct.
  • Assumption 2.4: The definition indicates that TnNϵ>0\forall T \forall n \in \mathbb{N} \forall \epsilon > 0 (t[1,T][k1,,kT]RT(\forall t \in [1,T] \to \exists [k_1, \cdots, k_T]^\top \in \mathbb{R}^T (2) holds). I do not know why Assumption 2.2 implies Assumption 2.4 (L155-L156). Could you prove that Assumption 2.2 implies Assumption 2.4?
  • (3): Please add more explanations such that (3) is useful to solve optimization problems.
  • (3): I think that the definition of V^t1\sqrt{\hat{V}_{t-1}} (the square root of a vector) is needed.
  • L206-L215: Does "the Eq. 3 equal to 0" imply that xt=xt1x_t = x_{t-1}? I cannot understand why (5) and steps 10-12 in Algorithm 1 are needed. In particular, the authors should explain the reasons why using (5) and steps 10-12 in Algorithm 1 escapes potential stationary points (not "potential" but "poor" is correct?).
  • Table 1: M0,V0,S0M_0, V_0, S_0 are not defined. From Page 14, I guess that M0,V0,S0=0M_0, V_0, S_0 = 0.
  • Theorem 4.1: There is no Assumption 1 in this paper. Write all of the assumptions and conditions used in the theorem. The Appendix section uses αt=α\alpha_t = \alpha and ϵ1=0\epsilon_1 = 0, while Table 1 uses αt\alpha_t and ϵ1>0\epsilon_1 > 0. ...\Vert \sum ... \Vert is replaced with ...| ... |.
  • L265: I do not understand what LT0,x,yX\frac{L}{\sqrt{T}} \to 0, \forall x, y \in X is.

Proof of Theorem 4.1:

  • (A2) does not hold, since xt+1RDx_{t+1} \in \mathbb{R}^D. Please use xt+1xT2=xtxT2\Vert x_{t+1} - x_T \Vert^2 = \Vert x_{t} - x_T \Vert^2 - \cdots.
  • It is not obvious that (A3) holds. To show (A3), the authors seem to use Assumption 2.4. Since Assumption 2.4 is an ϵ\epsilon-approximation, Assumption 2.4 does not lead to (A3). Please provide the proof of (A3).
  • L724: It is not guaranteed that S^t\hat{S}_t is bounded. Also, (A4) does not imply that S^t\hat{S}_t is bounded, since it is a possibility such that gT\Vert g_T \Vert \to \infty when TT \to \infty. Please prove the boundedness of S^t\hat{S}_t.
  • L728, L729: Please also prove the boundedness of M^t\hat{M}_t and V^t\hat{V}_t.
  • (A5): The authors use M^t=Mt1β1t=0\hat{M}_t = \frac{M_t}{1 - \beta_1^t}= 0 to show (A5). It is incorrect. If the authors could use M^t=0\hat{M}_t = 0, then the following steps are needed.

Case 2): Suppose that t\forall t gt<1\Vert g_t \Vert < 1 (See Page 15).

Case 1): Suppose that ¬(t gt<1)=t0 gt01\lnot(\forall t \text{ }\Vert g_t \Vert < 1) = \exists t_0 \text{ } \Vert g_{t_0} \Vert \geq 1 (See also L731; L731 is not correct).

  • Case 1-1): Suppose that M^t0=0\hat{M}_{t_0} = 0.
  • Case 1-2): Suppose that M^t00\hat{M}_{t_0} \neq 0.

Therefore, I strongly believe that the current proof of Theorem 4.1 is incorrect.

  • Theorem 4.2: The proof of Theorem 4.2 is based on Theorem 4.1. Moreover, Theorem 4.4 is based on Gt\mathcal{G}^t, which is not well-defined (See my comment for Definition 2.5). Therefore, Theorems 4.2 and 4.3 are also incorrect mathematically.
审稿意见
3

This paper explores advancements in gradient-based optimization by introducing a novel High-Order Momentum Estimator (HOME) optimizer, focusing on the use of third-order momentum to improve convergence rates. The key idea is to extend traditional momentum techniques, which primarily use first- and second-order gradients, to incorporate higher-order gradients and accelerate convergence, especially in challenging optimization landscapes like nonsmooth nonconvex problems.

The proposed HOME-3 optimizer is accompanied by faster convergence rates for both convex and smooth nonconvex problems. The authors also empirically validate the effectiveness of HOME-3 on nonsmooth nonconvex problems, including deep neural networks, where it reportedly outperforms existing momentum-based optimizers like Adam and STORM.

优点

Incorporating higher-order momentum into existing optimization algorithms is an interesting direction that can lead to accelerated rates.

缺点

  • Section 2 is somewhat difficult to read without accompanying intuition or explanatory text. Adding a brief development or context for these definitions could improve clarity. In addition, some definitions are really unclear; e.g., in Definition 2.4, what does it mean for R\mathcal{R} to be a ''coordinate randomization''?

  • It is hard to follow the formulas when the terms have not yet been defined. For example, in the core algorithmic update of Equation (3), M^t,V^t,S^t\hat M_t, \hat V_t, \hat S_t are the ''first-order, second-order, and third-order momentum'' and the authors refer to Definition 2.1, but the latter never defined these. The definitions are only presented later within Algorithm 1.

  • Many of the formulas lack clarity. For example, in Equation (3), there is a division by V^t\hat V_t, which is supposedly a vector.

  • Quantities like ϵ1\epsilon_1 should be explicitly specified, rather than just ''defined the same as in Adam''.

Due to significant ambiguities in the definitions and the lack of clarity for key components of the algorithm, as mentioned above, I found it difficult to fully understand the proposed method. As a result, it’s challenging to assess the validity or implications of the theoretical and experimental results presented in later sections.

问题

Aside from the weaknesses above,

  • The paper makes multiple uses of quantities like (f(x))n(\nabla f(x))^n (e.g., Definition 2.1). Given that f(x)\nabla f(x) is a vector, what exactly does this mean?

  • From the descriptions, it seems that V^t\hat V_t is a vector. Then, what does it mean to divide by it in Equation (3)? How about adding the scalar ϵ1\epsilon_1 to it?

审稿意见
5

This paper proposes a deterministic first-order algorithm called HOME-3 for convex, smooth and non-smooth nonconvex optimization, leveraging the idea of Adam while introducing an extra momentum term with the third power of the gradient norm. The authors provide the convergence analysis for convex and smooth nonconvex optimization and claim that this algorithm achieve a convergence rate of O(1/T^(5/6)) for finding an approximation stationary point. For non-smooth nonconvex optimization, the authors conduct some experiments to show the effectiveness of the proposed algorithm.

优点

This paper proposes a deterministic first-order algorithm called HOME-3 for convex, smooth and non-smooth nonconvex optimization, leveraging the idea of Adam while introducing an extra momentum term with the third power of the gradient norm. I think the main strength is that the proposed HOME-3 algorithm introduces an extra momentum term rather than two momentum terms of Adam.

缺点

  1. The overall presentation is not clear. Specifically, in section 2 where some definitions and assumptions are given as preliminaries and section 4 where the convergence results of the proposed algorithm are analyzed, the description is very dense and difficult to parse.

  2. About Definition 2.2. The authors give an unusual definition that any kth order derivative of the objection function f with k∈N is Lipschitz continuous with a fixed Lipschitz constant L, without further description, which is not make sense.

  3. Another major weak point in the presentation is a lack of clear comparison with prior work. The authors mentioned a few previous algorithms such as Adam, STORM and STORM+ and their convergence rate. However, it’s not clear that whether these results are comparable, since I’m not sure if they have the same definition of smoothness and use Assumption 2.3 and 2.4.

  4. Although the paper claim that the proposed HOME-3 achieve a convergence rate of O(1/T^(5/6)) for finding an approximation stationary point, I’m not sure whether this result is reasonable. According to Carmon et al. [1], for a class of nonconvex function with L_p-Lipschitz continuous derivatives for all q∈{1,...,p}, where p∈N, there does not exist a deterministic first-order algorithm with iteration complexity better than O(ϵ^(-8/5)), which is contradictory to the result proposed in this paper.

  5. In section 3 the authors give a stopping criterion ‖M ̂_t-S ̂_t ‖<ϵ but lack of analysis about the relationship between this criterion and the common definition of stationarity in nonconvex optimization.

  6. Though the paper leverages the idea of Adam, the proposed algorithm is only analyzed in deterministic setting. Why not give a further convergence analysis in stochastic smooth convex/nonconvex optimization?

  7. Regarding the experiments, the authors conduct several small-scale tasks with 100 or 200 iterations. Addition large-scale experiments should be conduct considering the analysis of the proposed algorithm in smooth nonconvex case is based on T is sufficiently large according to section 4.2.

[1] Carmon Y, Duchi J C, Hinder O, et al. Lower bounds for finding stationary points II: first-order methods[J]. Mathematical Programming, 2021, 185(1): 315-355.

问题

  1. The overall presentation is not clear. Specifically, in section 2 where some definitions and assumptions are given as preliminaries and section 4 where the convergence results of the proposed algorithm are analyzed, the description is very dense and difficult to parse.

  2. About Definition 2.2. The authors give an unusual definition that any kth order derivative of the objection function f with k∈N is Lipschitz continuous with a fixed Lipschitz constant L, without further description, which is not make sense.

  3. Another major weak point in the presentation is a lack of clear comparison with prior work. The authors mentioned a few previous algorithms such as Adam, STORM and STORM+ and their convergence rate. However, it’s not clear that whether these results are comparable, since I’m not sure if they have the same definition of smoothness and use Assumption 2.3 and 2.4.

  4. Although the paper claim that the proposed HOME-3 achieve a convergence rate of O(1/T^(5/6)) for finding an approximation stationary point, I’m not sure whether this result is reasonable. According to Carmon et al. [1], for a class of nonconvex function with L_p-Lipschitz continuous derivatives for all q∈{1,...,p}, where p∈N, there does not exist a deterministic first-order algorithm with iteration complexity better than O(ϵ^(-8/5)), which is contradictory to the result proposed in this paper.

  5. In section 3 the authors give a stopping criterion ‖M ̂_t-S ̂_t ‖<ϵ but lack of analysis about the relationship between this criterion and the common definition of stationarity in nonconvex optimization.

  6. Though the paper leverages the idea of Adam, the proposed algorithm is only analyzed in deterministic setting. Why not give a further convergence analysis in stochastic smooth convex/nonconvex optimization?

  7. Regarding the experiments, the authors conduct several small-scale tasks with 100 or 200 iterations. Addition large-scale experiments should be conduct considering the analysis of the proposed algorithm in smooth nonconvex case is based on T is sufficiently large according to section 4.2.

[1] Carmon Y, Duchi J C, Hinder O, et al. Lower bounds for finding stationary points II: first-order methods[J]. Mathematical Programming, 2021, 185(1): 315-355.

撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.