PaperHub
5.3
/10
Poster3 位审稿人
最低4最高7标准差1.2
7
4
5
3.7
置信度
正确性3.3
贡献度2.7
表达2.7
NeurIPS 2024

Globally Q-linear Gauss-Newton Method for Overparameterized Non-convex Matrix Sensing

OpenReviewPDF
提交: 2024-05-11更新: 2025-01-13

摘要

关键词
low-rank matrix sensingnon-convex optimizationGaussian-Newton method

评审与讨论

审稿意见
7

This paper introduces a custom Guass-Newton method dubbed AGN to solve general over-parametrized matrix sensing problems, and demonstrated that 1) This new method is a descent method under benign assumptions and 2) it achieves Q-linear convergence under restrictive RIP assumptions.

优点

  1. Introduced a new algorithm beyond GD based ones to solve this hard non-convex problem with some theoretical guarantees, which is nice. The framework of the problem is relatively relaxed, and is not tailored to very specific instances of matrix sensing.

  2. Showcases that AGN will not be trapped inside hessian points theoretically.

  3. The authors proves the quick convergence of this algorithm under small RIP constant.

  4. The structure of this paper is easy to follow, addresses appropriate prior works, and offers appropriate level of details. Overall, this paper is very well-written and a pleasure to read.

缺点

  1. I still find computational cost and difficulty of obtaining a good AGN update to be hard to understand. It is well-known that GN methods perform better than GD in terms of landscape and convergence rate, while at the cost of computation. Therefore the readers need to know where this tradeoff stands in this scenario.’

  2. Section 5 offers the analysis under a very restrictive setting, which is an extremely small RIP constants. Such constants are hard to find in real life, and even GD methods perform very well under this setting, some achieving linear convergence when not too far from the ground truth. The readers need to know under this easy setting, how does AGN shine?

  3. Theorem 2 offers the convergence results in terms of function value, but this is a tricky choice because we don’t know whether the distance between XtX_t and ground truth also shrinks at a linear rate, which is what we are after. Since in Theorem 2 a small RIP constant is used, I imagine the difference in function value can be transformed into matrix distances with a minor twist, i still find it intriguing why the authors made this choice.

问题

See some above.

局限性

N/A

作者回复

We greatly appreciate the reviewer for your positive evaluation of our work, and we also thank you for your valuable and constructive suggestions. As for your concerns, we make detailed responses as follows.

1. Question: The computational cost of AGN over GD.
Answer: In general, the computational cost of the Gauss-Newton method is higher than that of GD. However, in our low-rank matrix recovery problem, the proposed AGN benefits from the low-rank structure, making the computational cost of solving equation (8) not much higher than that of GD, given that the dimension dd is much smaller than nn. Specifically, equations (33) and (34) indicate that the main computational overhead of AGN compared to GD is the inversion of a matrix of size d×dd \times d, which is manageable for small dd.

2. Question: About the RIP constant and the advantage of AGN over GD under special initialization.
Answer: We aim to show that the proposed AGN method can effectively avoid saddle points with linear convergence of the function value. As a result, we may not need to improve the RIP constant, though this could be a valuable direction for our future research. With this small RIP constant, even with good initialization, GD converges sub-linearly for the over-parameterized low-rank matrix sensing model, as noted by [15]. Additionally, the convergence of GD highly depends on the condition number of the target matrix, as indicated by [26]. In contrast, AGN guarantees linear convergence that is independent of the condition number of the target matrix.

3. Question: The convergence in terms of the function value vs. the variables X.
Answer: Our primary objective is to demonstrate that the AGN method avoids getting trapped in saddle points where the function value is high, but the gradient norm is low. We aim to illustrate the decrease in function value (as adopted by [15]), based on the premise that if an optimization method is hindered by saddle points, its function value will decrease very slowly near these points, as shown for GD in Fig. 1 of the main paper. The linear convergence at a constant rate of the AGN on function value clearly demonstrates that AGN does not get trapped in saddle points.

Moreover, with a small RIP constant, it is straightforward to derive the convergence of the variable XX based on the linear convergence of the function value and the following distance metric dist(Xt,X)=minQ12UtQUF2+12VtQVF2{\rm{dist}}(X_t, X^*) = \min_Q \frac{1}{2}||U_tQ-U^*||_F^2 + \frac{1}{2}||V_tQ-V^*||_F^2 with orthogonal matrix QQ, where Xt=[UtVt]X_t= \begin{bmatrix} U_t^{\top} & V_t^{\top} \end{bmatrix}^{\top}, X=[UV]X^* = \begin{bmatrix} U^{* \top} & V^{* \top} \end{bmatrix}^{\top} is the global optimal solution. We thank the reviewer for the helpful comment. We will include the convergence of dist(Xt,X){\rm{dist}}(X_t, X^*) in our revision.

评论

I thank the author for your explanation, which is helpful in general. I still think favorably of this work and would like to keep my original rating.

评论

We sincerely appreciate your time and efforts in providing us with your response. Your support has made a significant difference in our work, and we are confident that it will lead to a stronger final product.

审稿意见
4

This paper focuses on the optimization of overparameterized, non-convex low rank matrix sensing (LRMS)—an essential component in contemporary statistics and machine learning. This paper introduces an approximated Gaussian-Newton (AGN) method for tackling the non-convex LRMS problem. Notably, AGN incurs a computational cost comparable to gradient descent per iteration but converges much faster without being slowed down by saddle points. This paper proves that, despite the non-convexity of the objective function, AGN achieves Q-linear convergence from random initialization to the global optimal solution. Moreover, under certain conditions on the sensing operator, AGN demonstrates super-linear convergence rate. The global Q-linear convergence of AGN represents a substantial enhancement over the convergence of the existing methods for the overparameterized non-convex LRMS.

Problem:

There are some severe typos that will affect the readability of this paper.

(1) Line 134, it should be J(xt)=ϕ(xt)J(x_t) = \phi'(x_t) instead of J(xt)=ψ(xt)J(x_t) = \psi'(x_t). (2) Line 155, B(X,X)=A(PXXQ)\mathcal{B}(X,X) = \mathcal{A} (PXX^\top Q), so, what is B(Δ,Xt)\mathcal{B}(\Delta, X_t) in line Eq.(7)? (3) Why does Eq. (7) relate to Gauss-Newton? I known ϕ(x)=B(X,X)b\phi(x) = \mathcal{B}(X,X) - b, but what is its Jacobian? what is JΔJ\Delta?

Due to these problems, I think this paper is hard to read.

优点

This paper focuses on the optimization of overparameterized, non-convex low rank matrix sensing (LRMS)—an essential component in contemporary statistics and machine learning. This paper introduces an approximated Gaussian-Newton (AGN) method for tackling the non-convex LRMS problem. Notably, AGN incurs a computational cost comparable to gradient descent per iteration but converges much faster without being slowed down by saddle points. This paper proves that, despite the non-convexity of the objective function, AGN achieves Q-linear convergence from random initialization to the global optimal solution. Moreover, under certain conditions on the sensing operator, AGN demonstrates super-linear convergence rate. The global Q-linear convergence of AGN represents a substantial enhancement over the convergence of the existing methods for the overparameterized non-convex LRMS.

缺点

No

问题

No

局限性

No

作者回复

We would like to express our sincere gratitude for your valuable comments on our work. As for your concerns, we give detailed response below which we hope can help you fully understand our work. Please feel free to let us know if you have any further concerns. We will deeply appreciate that you can raise your score if you find our responses resolve your concerns.

1. Question: About the typos.
Answer: We apologize for the typo here and will thoroughly review the entire manuscript to eliminate any remaining errors.

2. Question: For the expression B(X,X)=A(PXXTQ)\mathcal B(X,X)= \mathcal A(PXX^TQ).
Answer: We will restate the bilinear function as B(X,Y)=A(PXYTQ)\mathcal B(X,Y)= \mathcal A(PXY^TQ) to ensure clarity, such that B(Δ,X)=A(PΔXTQ)\mathcal B(\Delta, X) = \mathcal A(P\Delta X^TQ) is unambiguous.

3.Question: About the Jacobian and the first-order approximation of B(X,X)b\mathcal B(X,X) - b.
Answer: It is straightforward to find that JΔ=B(X,Δ)+B(Δ,X)=A(PΔXTQ)+A(PXΔTQ)J\Delta = \mathcal B(X, \Delta) + \mathcal B(\Delta,X)= \mathcal A(P\Delta X^TQ) + \mathcal A(PX\Delta^TQ). Note that ϕ(X+Δ)=A(P(X+Δ)(X+Δ)TQ)=A(PXXTQ)+A(PΔXTQ)+A(PXΔTQ)+O(Δ2)\phi(X+\Delta) = \mathcal A(P(X+\Delta)(X+\Delta)^TQ) = \mathcal A(PXX^TQ) + \mathcal A(P\Delta X^TQ) + \mathcal A(PX\Delta^TQ) + \mathcal O(\Delta^2). Therefore the first-order Taylor expansion is A(PXXTQ)+A(PΔXTQ)+A(PXΔTQ)\mathcal A(PXX^TQ) + \mathcal A(P\Delta X^TQ) + \mathcal A(PX\Delta^TQ).

评论

Dear reviewer L5qb,

We appreciate the time and effort you’ve dedicated to reviewing. As the discussion period nears its end, please feel free to reach out with any additional concerns. We would be happy to address them. If our responses have resolved your concerns, we would greatly appreciate it if you could consider raising your score.

Thank you very much !

评论

Dear reviewer,

Since the end of the discussion period is approaching, if you have any further concerns please feel free to let us know and we are pleased to discuss with you. Thank you very much!

审稿意见
5

In this submission, the authors proposed an approximated Gaussian-Newton (AGN) method for overparameterized non-convex low-rank matrix sensing problem. The authors presented the corresponding theoretical analysis and partially explained the reason the proposed AGN method achieves fast convergence rates.

优点

In this submission, the authors proposed a new AGN method The main idea is natural and the authors provide both theoretical analysis as well as some numerical experiments that provide empirical results consistent with the theoretical results. The main structure of this submission is clear and the presentation is of high quality.

缺点

There are major two drawbacks of this paper.

  1. The authors should put all the empirical results from the numerical experiments into one single section. The authors should proved more numerical experiments to compared the proposed method with other state-of-art algorithms and put all the figures and tables in one numerical experiment section.

  2. The authors didn't present or prove the superlinear convergence rate of the proposed AGN algorithm.

问题

The authors claims that the proposed method has superlinear convergence rate. Where is the theorem about the superlinear convergence rate?

局限性

The authors has presented the limitations in the last section.

作者回复

We deeply appreciate the reviewer for your careful review, constructive suggestions and positive feedbacks. The followings are our responses to your concerns. We will greatly appreciate that you can raise your score if you find our responses resolve your concerns.

1. Question: About the experiments.
Answer: Thank you for the reviewer’s suggestion. We have added comparisons of the iteration complexity of the proposed AGN with GD, PrecGD, and ScaledGD (λ\lambda), as in the following table

referencealgorithmiteration complexity
[19]GDκ8+κ6log(κn/ϵ)\kappa^8 + \kappa^6\log(\kappa n / \epsilon)
[15]PrecGDκ8+log(1/ϵ)\kappa^8 + \log(1/\epsilon)
[27]ScaledGD(λ\lambda)logκlogκn+log(1/ϵ)\log \kappa \cdot \log \kappa n + \log(1/\epsilon)
oursAGNlog(1/ϵ)\log(1/\epsilon)

where κ\kappa is the conditional number of the target matrix, nn is the matrix dimension. These results are also presented in Table 1 (in the rebuttal PDF file). Additionally, we provide comparisons of the computational time of these competing methods across different dimensions in Table 2 (in the rebuttal PDF file). Both theoretical and experimental results demonstrate the superiority of the proposed AGN over the competing methods. We will include details of the experimental settings and the above comparisons in the revision.

2. Question: About the superlinear convergence.
Answer: The superlinear convergence is indicated by cq=0c_q=0 in Theorem 2, which is presented at the end of the Theorem 2. We will make it more clearer in the revision.

评论

Dear reviewer KSmV,

We understand that reviewing is a time-consuming task and we want to express our gratitude for your dedication. We value your expertise and opinion, and hope that you can take time to have discussions with us if you have any further concerns. We would greatly appreciate it if you could raise your score if our responses have addressed your concerns.

Thank you very much !

评论

Dear reviewer,

Since the end of the discussion period is approaching, if you have any further concerns please feel free to let us know and we are pleased to discuss with you. Thank you very much!

评论

Thanks for the response of the authors.

After carefully reading all the rebuttals, I change my score to 5 for this submission.

评论

We sincerely appreciate your time and efforts in providing us with your response. Your support has made a significant difference in our work, and we are confident that it will lead to a stronger final product.

作者回复

Dear ACs and reviewers,

Thank you very much for your valuable comments. We truly appreciate the time and effort you've taken to review our work. We're glad the reviewers found our work valuable and provided positive feedback. Your feedback is important to us, and in accordance with the reviewers' comments, we have carefully revised the manuscript and provided detailed responses to all your concerns (please refer to our rebuttal for each reviewer for more detailed responses).

Saddle points can significantly slow the convergence of gradient descent methods in non-convex optimization problems. Previous works have shown that over-parameterization further reduces the convergence rate of gradient descent from linear to sublinear for low-rank matrix sensing. This paper demonstrates that by employing an approximation of the Gauss-Newton method, AGN can effectively and efficiently escape all saddle points with a linear convergence rate, significantly improving upon the results of gradient descent. We include in the one-page rebuttal PDF comparisons of iteration complexity (Table 1) and computational time (Table 2) of AGN with most recent state-of-the-art methods.

We greatly appreciate the reviewer’s comments on the experiment section. In our revision, we will provide more detailed settings and analyses for the figures and tables in the experiment section. We are also grateful to the reviewers for pointing out the typos and their concerns, which are invaluable for improving our work.

Once again, thank you for taking the time to review our work and provide valuable insights. We will take all the reviewers' suggestions into consideration for future improvements. We are also open to further discussions with the reviewers if there are any additional concerns.

Regards, Authors of #4376

最终决定

This paper demonstrates that AGN achieves Q-linear convergence from random initialization to the global optimal solution despite the non-convexity of the objective function, and under certain conditions on the sensing operator, it further exhibits a super-linear convergence rate, representing a significant improvement over existing methods for overparameterized non-convex LRMS.