PaperHub
7.3
/10
Spotlight4 位审稿人
最低4最高5标准差0.5
4
5
4
5
3.3
置信度
创新性3.0
质量3.0
清晰度2.8
重要性2.8
NeurIPS 2025

Escaping saddle points without Lipschitz smoothness: the power of nonlinear preconditioning

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

We study generalized smoothness in nonconvex optimization, focusing on $(L_0, L_1)$-smoothness and anisotropic smoothness. The former was empirically derived from practical neural network training examples, while the latter arises naturally in the analysis of nonlinearly preconditioned gradient methods. We introduce a new sufficient condition that encompasses both notions, reveals their close connection, and holds in key applications such as phase retrieval and matrix factorization. Leveraging tools from dynamical systems theory, we then show that nonlinear preconditioning—including gradient clipping—preserves the saddle point avoidance property of classical gradient descent. Crucially, the assumptions required for this analysis are actually satisfied in these applications, unlike in classical results that rely on restrictive Lipschitz smoothness conditions. We further analyze a perturbed variant that efficiently attains second-order stationarity with only logarithmic dependence on dimension, matching similar guarantees of classical gradient methods.
关键词
Nonconvex optimizationgeneralized smoothnesssaddle point avoidance

评审与讨论

审稿意见
4

This paper focuses on generalized smoothness conditions in non convex optimization. It first introduces a common sufficient condition for (L0, L1)-smoothness and anisotropic smoothness, and shows that it is satisfied in a few key applications. Secondly, it studies the saddle-point avoiding property of a few algorithms under relaxed smoothness conditions.

优缺点分析

Strength: The paper is well-written. It has a good summary of the literature and clear statements of the main results.

Weakness:

  1. The paper introduces a new common sufficient condition for the two generalized smoothness conditions in the literature, but didn't discuss how 'tight' this sufficient condition could be. Without such a discussion, it is hard to judge if it is indeed an insightful sufficient condition, or just a convenient intermediate step to include a few applications.

  2. The paper shows that a few classical ML non convex optimization problem satisfies the generalized smoothness condition. Those applications in Section 2.3 have specialized algorithm that are quite efficient. It is not clear whether nonlinear preconditioned gradient method is a good choice for those problems.

  3. There are no numerical experiments that support the findings in the paper.

问题

See weakness.

局限性

See weakness.

最终评判理由

The authors have addressed most of my concerns about the tightness of conditions and the lack of numerical examples. I am happy to increase my score to 4 in the final justification.

格式问题

No

作者回复

Dear Reviewer,

Thank you for your review.


Comment: A discussion regarding the `tightness' of the sufficient condition (Assumption 2.8) is missing.

Response: Assumption 2.8 is a sufficient condition that is significantly easier to verify than the second-order characterization of anisotropic smoothness (3), which in turn is equivalent to a matrix inequality involving 2f\nabla^2 f [32, Lemma 2.5]. The merits of this condition are two-fold: On the one hand, it is practical and was used it to establish generalized smoothness of the examples in section 2.3. On the other hand, it reveals the close connection between anisotropic smoothness and (L0,L1)(L_0, L_1)-smoothness.

For (L0,L1)(L_0, L_1)-smoothness, some tightness is lost by requiring global polynomial upper and lower bounds to the Hessian and gradient norms, respectively. In principle, a bounded ratio of these two quantities suffices, in which case the definition of (L0,L1)(L_0, L_1)-smoothness can be used directly.

For anisotropic smoothness, more tightness is lost: the matrix inequality [32, Lemma 2.5]

xRn:2f(x)LLˉ[2ϕ(Lˉ1f(x))]1\forall x \in \mathbb{R}^n: \nabla^2 f(x) \prec L \bar L[\nabla^2 \phi^*(\bar L^{-1}\nabla f(x))]^{-1}

only imposes an upper bound to the Hessian's eigenvalues, whereas a bound on 2f(x)\Vert\nabla^2 f(x)\Vert -- as in Assumption 2.8 or in (L0,L1)(L_0, L_1)-smoothness -- imposes both a lower and an upper bound to the Hessian's eigenvalues. This makes anisotropic smoothness (at least for certain reference functions ϕ\phi) more general than (L0,L1)(L_0, L_1)-smoothness (see also Example 2.6), but significantly harder to verify.

As we have illustrated through the applications in section 2.3, the Assumption 2.8 entails a practically verifiable sufficient condition, whereas, e.g., the matrix inequality [32, Lemma 2.5] would be far more difficult to establish for specific applications.

We thank the reviewer for posing this question, and agree that a discussion regarding the tightness of Assumption 2.8 would be valuable. We will include it after introducing Assumption 2.8.


Comment: The applications in section 2.3 have specialized algorithms that are efficient. It is not clear why a nonlinearly preconditioned gradient method is a good choice for those problems.

Response: Section 2.3 does not claim that nonlinearly preconditioned gradient methods are efficient methods for the applications in section 2.3 -- although [32, Fig 3] indicates that they perform very well for phase retrieval, and this also agrees with our personal experience. Rather, we wanted to illustrate that this theoretical framework for generalized smoothness is truly relevant by showing that it applies to certain key applications where classical Lipschitz smoothness fails. As acknowledged by one of the other reviewers, this is a crucial step which many theoretical papers fail to take.


Comment: There are no numerical experiments.

Response: To address this comment, we have conducted two experiments:

(i) For the symmetric matrix factorization problem (6) with n=2n = 2 and r=1r = 1, we have created a 2D visualization of the level curves of the objective, along with the iterates of both classical gradient descent (GD) and nonlinearly preconditioned gradient descent (P-GD) with ϕ(x)=cosh(x)1\phi(x) = \cosh(\Vert x \Vert) - 1. As suggested by Reviewer M3Bk, this provides the reader with some intuition regarding the (P-GD) behavior. In particular it reveals that unless (GD) is initialized close to a stationary point, the stepsize must be chosen (very) small to prevent the iterates from diverging -- as expected, because the objective is not Lipschitz smooth. In contrast, the (P-GD) iterations take the form (for Lˉ=1\bar L = 1)

x+=xγsinh1(f(x))f(x)f(x).x^+ = x - \gamma \frac{\sinh^{-1}(\Vert \nabla f(x) \Vert)}{\Vert \nabla f(x) \Vert} \nabla f(x).

In this case, large gradients are damped -- recall the close resemblance to clipping methods, cf. Fig 1b -- resulting in the convergence of (P-GD) for stepsizes γ\gamma that are often orders of magnitudes larger than the maximum stepsize of (GD). In turn, this causes (P-GD) to require significantly fewer iterations, and overall outperform (GD) for fixed stepsize.

(ii) We illustrate the fast escape of saddle points by Algorithm 1, by reproducing the experiment of [11, Section 5] for Algorithm 1. In particular, we consider the `octopus' objective from [11] which was precisely constructed to ensure that (GD) takes exponential time to find a local minimizer. We also select all hyperparameters as in the experiment of [11], and set the only additional hyperparameter Lˉ=1\bar L = 1. We compare against classical GD, and vary the constant L1,1.5,2,3L \in \\{1, 1.5, 2, 3\\} and dimension n5,10n \in \\{5, 10\\}, thus recreating the plots [11, Figs 3 and 4] for Algorithm 1 and classical GD. The resulting plots look very similar to [11, Figs 3 and 4] and validate Theorem 3.6. For example, they confirm that the number of iterations scales proportionally with LL. In this way, we also confirmed that the number of iterations scales proportionally with Lˉ1\bar L^{-1}.

We believe these experiments significantly improve our manuscript and therefore plan to include them.


We hope that we have adequately addressed any concerns that you may have had, and that you are willing to re-evaluate your rating of our work.

审稿意见
5

This paper presents a significant theoretical contribution to the field of nonconvex optimization. It tackles the analysis of first-order methods under generalized smoothness conditions that are more realistic for many machine learning problems than the standard Lipschitz smoothness of the gradient. The authors focus on two such conditions: (L0,L1)(L_0, L_1)-smoothness and anisotropic smoothness.

The core contributions are:

  1. The paper introduces a novel sufficient condition (Assumption 2.8) that elegantly unifies (L0,L1)(L_0, L_1)-smoothness and anisotropic smoothness, revealing a deep structural connection between these two frameworks.

  2. It rigorously proves that this new condition holds for several key, non-Lipschitz-smooth problems in machine learning, including phase retrieval, symmetric and asymmetric matrix factorization, and Burer-Monteiro factorizations for SDPs. This provides the first formal verification of these generalized smoothness properties for such problems.

  3. Saddle-Point Avoidance: The paper extends the classical saddle-point avoidance results for gradient descent to the nonlinearly preconditioned gradient method (P-GD). It shows that under anisotropic smoothness, P-GD asymptotically avoids strict saddle points. The analysis is also extended to handle non-smooth preconditioners like hard gradient clipping.

  4. Efficient Complexity Guarantees: For a perturbed version of P-GD, the authors provide a complexity analysis, showing that the algorithm finds an approximate second-order stationary point in time that depends only polylogarithmically on the dimension, matching the state-of-the-art results for standard gradient descent under Lipschitz smoothness.

This is a purely theoretical paper with no experimental results.

优缺点分析

Strengths:

  1. The primary strength of this work is its success in building a unified theoretical bridge between (L0,L1)(L_0, L_1)-smoothness (which is often empirically motivated) and anisotropic smoothness (which arises naturally from preconditioning). Assumption 2.8, which connects polynomial bounds on the Hessian and gradient norms, is novel and powerful. This provides a much-needed foundational understanding of the geometry of loss landscapes that do not satisfy standard assumptions.
  2. A major achievement of this paper is demonstrating that its theoretical framework is not vacuous. By proving that problems like phase retrieval and matrix factorization satisfy Assumption 2.8 (and thus both generalized smoothness conditions), the authors make their abstract theory immediately relevant. This is a crucial step that many theoretical papers fail to take; it validates the entire enterprise by showing the assumptions are met in important, practical settings where classical theory fails.
  3. The paper successfully generalizes the seminal saddle-point avoidance results of Lee et al. (asymptotic) and Jin et al. (efficient) to the broader setting of anisotropic smoothness. This is a non-trivial technical feat. The authors correctly identify the key challenges, such as the difficulty of bounding the function value after a perturbation step without the convenience of quadratic bounds from Lipschitz smoothness, and the fact that the mapped Hessian Hλ(x)H_\lambda(x) is not necessarily normal. Overcoming these challenges is impressive and extends our understanding of why first-order methods work so well in practice.
  4. The paper is well-written and organized. The motivation is laid out clearly with two central research questions. The contributions are explicitly listed. The mathematical development flows logically from definitions to applications and finally to the main analysis of saddle-point dynamics.

Weaknesses:

  1. The paper is entirely theoretical. While this is acceptable for a strong theoretical contribution, the work would have been enhanced by even simple numerical experiments. For instance, a 2D visualization of the P-GD dynamics on a function satisfying the assumptions (e.g., the polynomial from Example 2.7) could provide valuable intuition. Similarly, illustrating the convergence of the perturbed algorithm and comparing it to standard GD could help ground the complex theoretical bounds.
  2. The complexity analysis in Section 3.2 relies on Assumption 3.3, the Lipschitz continuity of the mapped Hessian Hλ(x)H_\lambda(x). While the paper provides an example where this holds, the assumption itself is quite technical. It is not immediately clear how restrictive this assumption is or how one would verify it for a general problem and preconditioner. A more detailed discussion on the scope and interpretation of this assumption would be beneficial. For instance, how does the Lipschitz constant ρ\rho scale with problem parameters for the applications discussed? The final complexity bound in Theorem 3.6 is given in Big-O notation, O~()\tilde{O}(\dots), which hides polylogarithmic factors in nn and ϵ\epsilon. The text also notes it hides a factor χ4\chi^4, where χ\chi is polylogarithmic. However, the bound also depends on problem-specific constants like LL, Lˉ\bar{L}, and the new Lipschitz constant ρ\rho. It would be valuable for the authors to discuss how these constants might scale for the applications in Section 2.3. If ρ\rho or other constants hide a polynomial dependency on nn, the overall efficiency of the algorithm could be misleading.

问题

  1. Regarding Assumption 3.3 (Hλ(x)H_\lambda(x) is ρ\rho-Lipschitz): Could you provide more intuition on this condition? How does the choice of the reference function φ\varphi affect whether this assumption holds? Are there natural choices of φ\varphi for which it fails, and what would be the consequences?
  2. In Theorem 3.6, how do the constants LL, Lˉ\bar{L}, and especially ρ\rho scale with the dimension nn and other problem parameters for the key applications (e.g., matrix factorization)? Is the overall complexity still truly polylogarithmic in nn once these constants are accounted for?
  3. The analysis for hard clipping in Theorem 3.2 requires the set U{xf(x)L}U \coloneqq \{x | \|\nabla f(x)\| \neq L\} to have measure one. This is used to ensure differentiability of the iteration map almost everywhere. Can you provide a justification or an example where this assumption holds for the problems of interest? It seems plausible but is not immediately obvious.
  4. Could you elaborate on the significance of the regularization term (κ>0)(\kappa > 0) in the asymmetric matrix factorization case (Theorem 2.14)? The explanation provided is clear, but does this imply that the unregularized problem (κ=0)(\kappa = 0) does not satisfy (L0,L1)(L_0, L_1)-smoothness, and if so, does this have implications for the analysis of unregularized matrix factorization?

局限性

Please refer to the section on Weaknesses.

最终评判理由

The authors have addressed my concerns. I will stick to my original score.

格式问题

None.

作者回复

Dear Reviewer,

We thank you for your thorough and encouraging feedback.


Comment: A visualization of the P-GD dynamics and an illustration of the convergence of Algorithm 1 in comparison to standard GD could provide valuable intuition.

Response: Thank you for this suggestion! We have conducted the following two experiments:

(i) For the symmetric matrix factorization problem (6) with n=2n = 2 and r=1r = 1, we have created a 2D visualization of the level curves of the objective, along with the iterates of both classical gradient descent (GD) and nonlinearly preconditioned gradient descent (P-GD) with ϕ(x)=cosh(x)1\phi(x) = \cosh(\Vert x \Vert) - 1. As you suggested, this provides the reader with some intuition regarding the (P-GD) behavior. In particular it reveals that unless (GD) is initialized close to a stationary point, the stepsize must be chosen (very) small to prevent the iterates from diverging -- as expected, because the objective is not Lipschitz smooth. In contrast, the (P-GD) iterations take the form (for Lˉ=1\bar L = 1)

x+=xγsinh1(f(x))f(x)f(x).x^+ = x - \gamma \frac{\sinh^{-1}(\Vert \nabla f(x) \Vert)}{\Vert \nabla f(x) \Vert} \nabla f(x).

In this case, large gradients are damped -- recall the close resemblance to clipping methods, cf. Fig 1b -- resulting in the convergence of (P-GD) for stepsizes γ\gamma that are often orders of magnitudes larger than the maximum stepsize of (GD). In turn, this causes (P-GD) to require significantly fewer iterations, and overall outperform (GD) for fixed stepsize.

(ii) We illustrate the fast escape of saddle points by Algorithm 1, by reproducing the experiment of [11, Section 5] for Algorithm 1. In particular, we consider the `octopus' objective from [11] which was precisely constructed to ensure that (GD) takes exponential time to find a local minimizer. We also select all hyperparameters as in the experiment of [11], and set the only additional hyperparameter Lˉ=1\bar L = 1. We compare against classical GD, and vary the constant L1,1.5,2,3L \in \\{1, 1.5, 2, 3\\} and dimension n5,10n \in \\{5, 10\\}, thus recreating the plots [11, Figs 3 and 4] for Algorithm 1 and classical GD. The resulting plots look very similar to [11, Figs 3 and 4] and validate Theorem 3.6. For example, they confirm that the number of iterations scales proportionally with LL. In this way, we also confirmed that the number of iterations scales proportionally with Lˉ1\bar L^{-1}.

We believe these experiments significantly improve our manuscript and therefore plan to include them.


Question: Provide more intuition regarding Assumption 3.3. How does the choice of ϕ\phi affect whether this assumption holds? Are there natural choices of ϕ\phi for which the assumption fails, and what would be the consequences?

Response: Thank you for raising this interesting question.

We believe Assumption 3.3 constitutes a meaningful generalization of the Hessian Lipschitz continuity assumption that is often used in complexity analyses. In an attempt to provide some more intuition, we highlight that under the conditions in Assumption 2.10 2ϕ\*(y)0\Vert \nabla^2 \phi^\*(y) \Vert \to 0 as y\Vert y \Vert \to \infty. The rough idea is that this counteracts the growth of 2f(x)\nabla^2 f(x) as x\Vert x \Vert \to \infty in a way that ensures Lipschitz continuity of Hλ(x)=2ϕ\*(λf(x))2f(x)H_\lambda(x) = \nabla^2 \phi^\*(\lambda \nabla f(x)) \nabla^2 f(x).

We stress that even if Assumption 3.3 is violated, then the asymptotic result of Theorem 3.1 still holds. Only regarding the required number of iterations, we can no longer give a formal guarantee.

Finally, we refer to our answer to Question 3 of Reviewer ZsPD for a discussion about sufficient conditions on ff and ϕ\phi for Assumption 3.3 to hold.


Question: In Theorem 3.6, how do L,LˉL, \bar L and ρ\rho scale with the dimension nn and other problem parameters for key applications? Is the overall complexity truly polylogarithmic?

Response: This is an interesting question. From a theoretical point of view, we mention that the proofs in Section 2.3 do not explicitly construct the constants L,LˉL, \bar L, so that we cannot formally rule out a dependence on the dimension. However, we strongly believe this is not the case, and this for the following reasons:

(i) By Example 2.6, (L0,L1)(L_0, L_1)-smoothness implies anisotropic smoothness with L=L1L = L_1 and Lˉ=L0/L1\bar L = L_0 / L_1. Thus, if L0,L1L_0, L_1 are independent of nn, then so are L,LˉL, \bar L. Judging from the experiments in [38], the empirical estimates for L0,L1L_0, L_1 remain very moderate, even for high-dimensional neural networks.

(ii) This is in line with with our own experiments on e.g. matrix factorization, in which the method converges for large step sizes, also when the dimension is increased.

(iii) Likewise, we mention that for LHL_H-Lipschitz continuous Hessians 2f\nabla^2 f, the constant ρ\rho reduces to LHL_H when ϕ=122\phi = \frac{1}{2}\Vert \cdot \Vert^2. To further rule out hidden dependencies on nn, we carefully investigated the scaling of the number of iterations of Algorithm 1 with the dimension nn (cf. Comment 1: Experiment (ii)). We found the scaling to closely match the one reported in [11, Figs 3 and 4], thus further validating the complexity of Theorem 3.6.


Question: The analysis for hard clipping in Theorem 3.2 requires the set U:={xf(x)Lˉ}U := \{x \mid \Vert \nabla f(x) \Vert \neq \bar L\} to have measure one. Can you provide a justification or an example where this assumption holds for the problems of interest?

Response: Let us define the function F(x):=f(x)2F(x) := \Vert \nabla f(x) \Vert^2, which is C1\mathcal{C}^1 whenever fC2f \in \mathcal{C}^2.

If Lˉ2\bar L^2 is a regular value of FF, i.e., if there exists no xRnx \in \mathbb{R}^n for which F(x)=Lˉ2F(x) = \bar L^2 with F(x)=0\nabla F(x) = 0, then by the preimage theorem the level set F1(Lˉ2)xF(x)=Lˉ2RnUF^{-1}(\bar L^2) \equiv \\{x \mid F(x) = \bar L^2 \\} \equiv \mathbb{R}^n \setminus U is a smooth manifold of dimension n1n - 1, and therefore has measure zero in Rn\mathbb{R}^n.

On the other hand, the set of critical values (Lˉ2\bar L^2 is a critical value of FF if it is not a regular value of FF) is small for generic functions. To be more precise, by Sard's theorem the set of critical values has measure 00 if FF is kk times continuously differentiable for some kk large enough (kk depends on nn).

In conclusion, if ff is kk times continuously differentiable for some kk large enough, then for almost every Lˉ\bar L the set UU has measure one.


Question: Could you elaborate on the significance of the regularization term (κ>0\kappa > 0) in the asymmetric matrix factorization case (Theorem 2.14)? Does this imply that the unregularized problem (κ=0\kappa = 0) does not satisfy (L0,L1)(L_0, L_1)-smoothness, and if so, does this have implications for the analysis of unregularized matrix factorization?

Response: Indeed, if κ=0\kappa = 0, then asymmetric matrix factorization does not satisfy (L0,L1)(L_0, L_1)-smoothness. Let m=n=r=1m = n = r = 1 such that f(w,h)=12(why)2f(w, h) = \frac{1}{2} (w h - y)^2 is minimized w.r.t. w,hRw, h \in \mathbb{R} for some given yRy \in \mathbb{R}. Clearly, any w,hw, h on the curve h(w)=ywh(w) = \frac{y}{w} are (global) minimizers. The Hessian for global minimizers reduces to

2f(w,h(w))=(y2w2y yw2).\nabla^2 f(w, h(w)) = \begin{pmatrix} \frac{y^2}{w^2} & y\\\ y & w^2 \end{pmatrix}.

This matrix has eigenvalues 00 and y2w2+w2\frac{y^2}{w^2} + w^2, and hence 2f(w,h(w))=y2w2+w2\Vert \nabla^2 f(w, h(w)) \Vert = \frac{y^2}{w^2} + w^2. Clearly, as ww \to \infty, this Hessian norm grows unbounded, whereas the gradient norm f(w,h(w))=0\Vert \nabla f(w, h(w)) \Vert = 0 remains zero. This clearly violates (L0,L1)(L_0, L_1)-smoothness.

Consequently, also Assumption 2.8 is violated. However, it remains unclear whether anisotropic smoothness may nevertheless hold for some reference function ϕ\phi; this is a question we are further exploring. In this regard, we also highlight that anisotropic smoothness generalizes (L0,L1)(L_0, L_1)-smoothness, since the equivalent matrix inequality [32, Lemma 2.5] only imposes an upper bound to the eigenvalues of 2f(x)\nabla^2 f(x). We refer to our answer of Reviewer F5tt's first comment for more details.

A possible implication for unregularized matrix factorization is that regularization actually helps satisfy (generalized) smoothness assumptions, which in turn may make the optimizer more efficient. We believe that this hypothesis deserves further investigation.


We thank you for your valuable suggestions, and hope that we have accurately addressed your questions.

评论

Thank you for the detailed clarification. I will keep my rating.

审稿意见
4

The authors proposes a novel sufficient condition (Assumption 2.8) bridging (L0,L1L_0,L_1)-smoothness and anisotropic smoothness, enabling saddle-point avoidance and convergence guarantees for nonlinear preconditioned GD (e.g., gradient clipping) beyond Lipschitz assumptions. The key innovations of this article lie in two aspects:1. Proposing a novel sufficient condition for generalized smoothness. 2.Establishing a theoretical guarantees for saddle-point avoidance to nonlinear preconditioning.

优缺点分析

Strengths:
1.Proposes ​Assumption 2.8​ as a novel sufficient condition bridging two distinct smoothness frameworks;
2.Extends saddle-point avoidance properties to non-Lipschitz settings;
3.Demonstrates applicability in key nonconvex problems where Lipschitz smoothness fails;
4.The structure of this paper is clear, and the language is well-articulated, which greatly aids readers in understanding this manuscript.

Weaknesses:
1.Lack of numerical experiments validation;
2.Lack of analysis for parameter rr in Assumption 2.8.

问题

1.Would requiring gradient growth (r+1r +1) to strictly exceed Hesse growth (rr) limit application scenarios?
2.What relationship exists between the polynomial degree r and the function f in the algorithm 1? Is there any theoretical or empirical analysis on how to select the appropriate rr for different function ff?
3.Assumption 3.3 plays a central role in the complexity analysis of the perturbed preconditioned gradient method. However, it is not entirely clear under what practical conditions this assumption is guaranteed to hold. Are there natural or verifiable sufficient conditions on the function ff and the reference function Φ\Phi that ensure Assumption 3.3 holds?

局限性

Yes

最终评判理由

After carefully considering the feedback and further discussion, I maintain my original rating.

格式问题

Not abailable

作者回复

Dear Reviewer,

We thank you for your time and effort in reviewing our work.


Comment: Lack of numerical experiments.

Response: To address this comment, we have conducted two experiments:

(i) For the symmetric matrix factorization problem (6) with n=2n = 2 and r=1r = 1, we have created a 2D visualization of the level curves of the objective, along with the iterates of both classical gradient descent (GD) and nonlinearly preconditioned gradient descent (P-GD) with ϕ(x)=cosh(x)1\phi(x) = \cosh(\Vert x \Vert) - 1. As suggested by Reviewer M3Bk, this provides the reader with some intuition regarding the (P-GD) behavior. In particular it reveals that unless (GD) is initialized close to a stationary point, the stepsize must be chosen (very) small to prevent the iterates from diverging -- as expected, because the objective is not Lipschitz smooth. In contrast, the (P-GD) iterations take the form (for Lˉ=1\bar L = 1)

x+=xγsinh1(f(x))f(x)f(x).x^+ = x - \gamma \frac{\sinh^{-1}(\Vert \nabla f(x) \Vert)}{\Vert \nabla f(x) \Vert} \nabla f(x).

In this case, large gradients are damped -- recall the close resemblance to clipping methods, cf. Fig 1b -- resulting in the convergence of (P-GD) for stepsizes γ\gamma that are often orders of magnitudes larger than the maximum stepsize of (GD). In turn, this causes (P-GD) to often require significantly fewer iterations, and overall outperform (GD) for fixed stepsize.

(ii) We illustrate the fast escape of saddle points by Algorithm 1, by reproducing the experiment of [11, Section 5] for Algorithm 1. In particular, we consider the `octopus' objective from [11] which was precisely constructed to ensure that (GD) takes exponential time to find a local minimizer. We also select all hyperparameters as in the experiment of [11], and set the only additional hyperparameter Lˉ=1\bar L = 1. We compare against classical GD, and vary the constant L1,1.5,2,3L \in \\{1, 1.5, 2, 3\\} and dimension n5,10n \in \\{5, 10\\}, thus recreating the plots [11, Figs 3 and 4] for Algorithm 1 and classical GD. The resulting plots look very similar to [11, Figs 3 and 4] and validate Theorem 3.6. For example, they confirm that the number of iterations scales proportionally with LL. In this way, we also confirmed that the number of iterations scales proportionally with Lˉ1\bar L^{-1}.

We believe these experiments significantly improve our manuscript and therefore plan to include them.


Comment: Lack of analysis for parameter rr in Assumption 2.8.

Response: The meaning of this comment is not entirely clear to us, so please let us know if it is not properly addressed by the statements below.

In case you are wondering how to obtain rr in Assumption 2.8 for specific applications, the degree of the polynomial bound to the Hessian norm is typically quite clear, especially for polynomial objectives. Given this polynomial upper bound of degree rr the main difficulty in showing that Assumption 2.8 holds, is establishing the polynomial lower bound to the gradient norm.

In case this comment relates to us using the same symbol rr for the polynomial degree in Assumption 2.8 and the perturbation radius rr in Algorithm 1 (which are unrelated, see also Question 2), we apologize. We will update the notation.


Question: Does requiring gradient growth r+1r+1 to strictly exceed Hessian growth rr limit application scenarios?

Response: We are not aware of practical applications in which this is the limiting factor, although toy examples can of course be constructed.

Nevertheless, we highlight that in cases where the gradient norm is lower bounded by a polynomial of degree rr rather than r+1r+1, Theorem 2.9 still holds, as noted in the footnote on page 4. Likewise, the proof of Theorem 2.11 still holds for reference functions ϕ(x)=h(x)\phi(x) = h(\Vert x \Vert) for which h(y){h^*}'(y) is uniformly lower and upper bounded. This is the case, e.g., for h(x)=xln(1x)h(x)= - \vert x \vert - \ln(1 - \vert x \vert), but not, e.g., for h(x)=cosh(x)1h(x) = \cosh(x) - 1.

For simplicity of exposition, and because we believe this entails no significant limitation, we formulated Assumption 2.8 with the gradient growth strictly exceeding the Hessian growth.


Question: What relationship exists between the polynomial degree rr and the function ff in Algorithm 1? Is there any theoretical or empirical analysis on how to select the appropriate rr for different function ff?

Response: The polynomial degree rr in Assumption 2.8 is unrelated to the perturbation radius rr in Algorithm 1: the former bounds the gradient and Hessian growth of ff, whereas the latter is a hyperparameter in Algorithm 1, which we fix (cf. (13)) in the analysis of Theorem 3.6.

We acknowledge that using the same symbol for both is confusing, and will update the notation to have separate symbols for both; our apologies for any confusion caused because of this.


Question: It is not entirely clear under what practical conditions Assumption 3.3 is guaranteed to hold. Are there natural or verifiable sufficient conditions on the function ff and the reference function ϕ\phi that ensure Assumption 3.3 holds?

Response: Thank you for raising this interesting question. First, we would like to point out that the reasoning in Example 3.4 with ϕ=cosh()1\phi = \cosh(\vert \cdot \vert) - 1 actually holds for any univariate polynomial, regardless of its degree. We will also explicitly mention this in the manuscript. This result provides a strong indication that Assumption 3.3 is relatively mild, although verifable sufficient conditions on ff and ϕ\phi remain difficult to obtain.

Nevertheless, it appears that polynomial bounds can also be used as a sufficient condition in this case. We further explored this idea under the conditions of Theorem 2.11 and the additional assumption that hC3h^* \in \mathcal{C}^3, which we mention only holds for h=cosh()1h = \cosh(\cdot) - 1, but not for the other kernel functions in (2).

Lipschitz continuity of Hλ(x)H_\lambda(x) can be established through boundedness of Hλ(x)\Vert \nabla H_\lambda(x) \Vert, which in turn can be bounded by an expression of the form

Hλ(x)3ϕ\*(f(x))2f(x)2+2ϕ\*(f(x))3f(x).\Vert \nabla H_\lambda(x) \Vert \leq \Vert \nabla^3 \phi^\*(\nabla f(x)) \Vert \Vert \nabla^2 f(x) \Vert^2 + \Vert \nabla^2 \phi^\*(\nabla f(x)) \Vert \Vert \nabla^3 f(x) \Vert.

After careful examination of the structure of 3ϕ(f(x))\Vert \nabla^3 \phi^*(\nabla f(x)) \Vert, a similar reasoning as in the proof of Theorem 2.11 indicates that the following two conditions on ff and hh would be sufficient:

(i) 3f(x)\Vert \nabla^3 f(x) \Vert is upper bounded by a polynomial of degree rr in the variable x\Vert x \Vert (similar to the Hessian bound in Assumption 2.8).

(ii) h(y)\vert {h^*}'''(y) \vert decreases fast enough as yy \to \infty, i.e.,

limy+y2h(y)<. \lim_{y \to +\infty} \vert y^2 {h^*}'''(y) \vert < \infty.

We note that this condition holds for h=cosh()1h = \cosh(\cdot) - 1.

It would be interesting to further investigate the approach outlined above. In particular, we believe that the assumption hC3h^* \in \mathcal{C}^3 can be relaxed to accomodate for both h(x)=exp(x)x1h(x) = \exp(\vert x \vert) - \vert x \vert - 1 and h(x)=xln(1x)h(x) = - \vert x \vert - \ln(1 - \vert x \vert), in which case the corresponding h(y){h^*}''(y) is differentiable everywhere except at 00.


We hope that we have accurately addressed any concerns that you may have had.

评论

Thank you for your response. I appreciate the discussion, and I stand by my positive assessment.

审稿意见
5

The paper studies nonconvex optimization under a generalized smoothness condition, called (L,Lˉ)(L,\bar{L})-anisotropic smoothness, introduced in [32]. This condition generalizes both the standard Lipschitz smoothness and the (L0,L1)(L_0,L_1)-smoothness proposed in [38]. The latter has been empirically validated for LSTMs in [38], and later for the transformers in [36], where gradient clipping was shown to be effective.

Building on this, the paper takes an important step forward by theoretically showing that several well-known standard nonconvex problems, such as phase retrieval and matrix factorization, satisfy both the (L0,L1)(L_0,L_1)-smoothness and the (L,Lˉ)(L,\bar{L})-anisotropic smoothness conditions. This provides additional support for their practical relevance. Furthermore, the paper is the first to show that (perturbed) nonlinearly preconditioned gradient method, including gradient clipping, can (efficiently) escape saddle points, under these relaxed smoothness assumptions.

优缺点分析

  1. The paper presents that several well-known nonconvex problems satisfy the (L,Lˉ)(L,\bar{L})-anisotropic smoothness condition, by first identifying a sufficient condition (Assumption 2.8), formulated in terms of polynomials, and then verifying that these examples satisfy this assumptioin. This offers a valuable theoretical justification for analyzing the problems under this relaxed condition, beyond the traditional Lipschitz smoothness, which was lacking in prior work. I would like to point out one relevant reference that appears to be missing, which also explores a relaxed smoothness condition for the phase retrieval problem:
  • Bolte, Sabach, Teboulle and Vaisbourd, First Order Methods Beyond Convexity and Lipschitz Gradient Continuity with Applications to Quadratic Inverse Problems, SIAM Journal of Optimization, 2018.
  1. As the authors note, extending saddle point escape guarantees from the standard Lipschitz smoothness setting to the more general (L,Lˉ)(L,\bar{L})-anisotropic smoothness is nontrivial, which especially required using a recent extension of the stable-center manifold theorem [8]. This extension strengthens the theoretical foundation for the empirical success of gradient clipping fin modern neural network training, where such relaxed smoothness conditions may hold, and represents a meaningful contribution.
  • Although the authors provide careful explanations and examples of (L,Lˉ)(L,\bar{L})-anisotropic smoothness condition, the geometric intuition behind the definition and properties, particularly in Definition 2.3 and Proposition 2.4, remains difficult to grasp due to their complexity. While I understand this may be hard to elaborate further within the page limit, I recommend that the authors guide interested readers to [32] for more geometric intuition, and consider adding a brief pointer in the revised version.

问题

Theorem 3.2: Can you provide an insight why a set UU needs to be measure one? I first need some explanation why f(x)||\nabla f(x)|| and Lˉ\bar{L} are related here.

局限性

The paper does not explicitly state its limitations, but I did not observe any major ones.

格式问题

.

作者回复

Dear Reviewer,

We thank you for your encouraging feedback.


Suggestion: Another relevant reference.

Response: Thank you for bringing this relevant reference to our attention. We include it accordingly.


Suggestion: Guide readers to [32] for a geometric intuition behind (L,Lˉ)(L, \bar L)-anisotropic smoothness.

Response: Thank you for this suggestion. We will add an explicit pointer to [32, p2], which links anisotropic smoothness to the related concept of Φ\Phi-convexity and provides a visualization of the latter. For a deeper geometric intuition, we believe it is also worth pointing the interested reader to the related works [23, 31] on Φ\Phi-convexity.


Question: Theorem 3.2: Why does UU need to be measure one, and why are f(x)\Vert \nabla f(x) \Vert and Lˉ\bar L related?

Response: The recent extension of the stable-center manifold theorem to nonsmooth iteration maps [8, Prop. 2.5] requires the iteration map Tγ,λT_{\gamma, \lambda} to be continuously differentiable on a set of measure one. This is the case if ϕ(Lˉ1f())\nabla \phi^*(\bar L^{-1}\nabla f(\cdot)) is continuously differentiable almost everywhere.

For ϕ=h()\phi = h(\Vert \cdot \Vert) with h=122+δ[1,1]h = \frac{1}{2} \Vert \cdot \Vert^2 + \delta_{[-1, 1]} we have that

ϕ(Lˉ1f(x))=min(1Lˉ1f(x),1)Lˉ1f(x), \nabla \phi^*(\bar L^{-1} \nabla f(x)) = \min \left( \frac{1}{\Vert \bar L^{-1} \nabla f(x) \Vert}, 1 \right) \bar L^{-1} \nabla f(x),

which is potentially non-differentiable at points xx for which Lˉ1f(x)=1\Vert \bar L^{-1} \nabla f(x) \Vert = 1. If UU has measure one, then ϕ(Lˉ1f())\nabla \phi^*(\bar L^{-1}\nabla f(\cdot)) is continuously differentiable almost everywhere, as required by [8, Prop. 2.5]. We also refer to Question 3 of Reviewer M3Bk for more details on when this assumption holds.


We thank you again for your suggestions and hope this response adequately addresses your question.

最终决定

The paper studies the nonlinear preconditioned gradient method, which generates its iterate updates by pushing the gradient through a nonlinear map, under an anisotropic smoothness conidtion. This framework encompasses both standard gradient descent and methods that use clipping or truncation, variants of which have been proposed for studying the training of transformers and other models, as well as for statistical problems with heavy-tailed costs. The paper studies the saddle-escaping property of this family of methods, under anisotropic smoothness. The former is important for guaranteeing high-quality solutions under structured nonconvexity (i.e., phase retrieval and matrix factorization), while the latter is important for nonlinear problems with high-order polynomial costs, where the global Lipschitz property breaks down. The paper proves (L,\bar{L})-anisotropic smoothness of a number of problems with structured nonconvexity, including matrix factorization, phase retrieval, and Burer-Monteiro factorization. It then proves that the nonlinear preconditioned gradient method has the saddle avoiding property, i.e., from random initialization it escapes strict saddles with probability one, and with random initialization, it efficiently escapes saddles. Reviewers positively evaluate the paper’s novelty and technical contributions, noting that relaxed Lipschitz conditions are necessary in problems with higher-order costs. The paper provides both a probability-one guarantee of saddle avoidance using the stable manifold theorem and complexity analysis for the perturbed version. The paper is very clearly organized and well-written. The only limitation noted by the reviewers is the lack of numerical experiments. This was partially addressed during the discussion period; in any case, the theoretical contributions of the paper are strong enough to merit acceptance.