PaperHub
6.3
/10
Poster4 位审稿人
最低4最高7标准差1.3
7
4
7
7
3.3
置信度
正确性3.0
贡献度2.8
表达2.8
NeurIPS 2024

Provable Acceleration of Nesterov's Accelerated Gradient for Asymmetric Matrix Factorization and Linear Neural Networks

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

We study the convergence rate of first-order methods for rectangular matrix factorization, which is a canonical nonconvex optimization problem. Specifically, given a rank-$r$ matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$, we prove that gradient descent (GD) can find a pair of $\epsilon$-optimal solutions $\mathbf{X}_T\in\mathbb{R}^{m\times d}$ and $\mathbf{Y}_T\in\mathbb{R}^{n\times d}$, where $d\geq r$, satisfying $\lVert\mathbf{X}_T\mathbf{Y}_T^\top-\mathbf{A}\rVert_F\leq\epsilon\lVert\mathbf{A}\rVert_F$ in $T=O(\kappa^2\log\frac{1}{\epsilon})$ iterations with high probability, where $\kappa$ denotes the condition number of $\mathbf{A}$. Furthermore, we prove that Nesterov's accelerated gradient (NAG) attains an iteration complexity of $O(\kappa\log\frac{1}{\epsilon})$, which is the best-known bound of first-order methods for rectangular matrix factorization. Different from small balanced random initialization in the existing literature, we adopt an unbalanced initialization, where $\mathbf{X}_0$ is large and $\mathbf{Y}_0$ is $0$. Moreover, our initialization and analysis can be further extended to linear neural networks, where we prove that NAG can also attain an accelerated linear convergence rate. In particular, we only require the width of the network to be greater than or equal to the rank of the output label matrix. In contrast, previous results achieving the same rate require excessive widths that additionally depend on the condition number and the rank of the input data matrix.
关键词
matrix factorizationnonconvex optimizationNesterov's accelerationlinear neural networks

评审与讨论

审稿意见
7

In this study, the authors established the convergence rate of the gradient descent and Nesterov's accelerated gradient descent methods for the asymmetric matrix factorization and the 2-layer linear network. The authors proved that an unbalanced initialization can lead to linear convergence of both methods, and the Nesterov's acceleration result in a faster convergence rate. Numerical experiments are implemented to support the theory.

优点

This paper is clearly written and easy to follow in general. The results are novel and inspiring to audiences in the non-convex optimization field. I did not check the proofs in the appendix due to time limit, but the part in the main manuscript looks correct to me.

缺点

I do not see major weaknesses. The authors could consider discussing more related literature and include more intermediate steps in the main manuscript. Also, I think the authors can consider illustrating the effect of an unbalanced initialization in numerical experiments.

问题

  • I think the following two papers also discussed the asymmetric low-rank matrix optimization problem. It would be better if the authors could discuss them and the references therein:

[1] Zhang, H., Bi, Y., & Lavaei, J. (2021). General low-rank matrix optimization: Geometric analysis and sharper bounds. Advances in Neural Information Processing Systems, 34, 27369-27380.

[2] Bi, Y., Zhang, H., & Lavaei, J. (2022, June). Local and global linear convergence of general low-rank matrix recovery problems. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 9, pp. 10129-10137).

  • Following the above comment, I wonder if the authors can briefly discuss the potential and challenges of extending the results to more general low-rank matrix optimization problems.

  • Lines 79-80: I am not sure why the method in [Stöger and Soltanolkotabi, 2021] is considered a preconditioned method.

  • Theorem 1: In my understanding, the value RtF\|R_t\|_F should be proportional to the scale of AA. However, in the bound of RtF\|R_t\|_F, the right hand-side grows with the size of AA. Maybe there is a typo?

  • The above comment also applies to Theorem 2.

  • It would be better to mention in Theorems 1-2 that the bound TT is derived using the bound in Proposition 1?

  • Line 182: I think the shrinkage rate θ(0,ρ]\theta \in (0, \rho]?

  • I wonder if the authors can compare the performance of GD/NAG with different values of c?

局限性

See my comments in the Questions section.

作者回复

Thank you very much for your positive feedback and constructive comments. Our responses to each of the comments are listed below.

Q1: I think the following two papers also discussed the asymmetric low-rank matrix optimization problem.

We thank the reviewer for pointing out these related works, and we will cite [1] and [2] and discuss them in the next version.

Q2: I wonder if the authors can briefly discuss the potential and challenges of extending the results to more general low-rank matrix optimization problems.

Generalizing the results to the general low-rank matrix optimization problem minX,Yf(XY)\min_{X,Y}f(XY^\top) is challenging. One of the challenges is the possible non-linearity of the gradient w.r.t. XX. In our proof, a key step is to show the residual and the error terms in all iterations are in the contraction subspace of the dynamics, which requires the gradient of ff to preserve the column space of XX. Unfortunately, such a property does not hold for general loss functions. Nevertheless, generalization to neural networks with non-linear activations, minX,YLXσ(YD)F2\min_{X,Y}|L-X\sigma(Y^\top D)|_F^2, will not have this issue, and we believe it is an interesting topic that is worth future investigation.

Q3: Lines 79-80: I am not sure why the method in [Stöger and Soltanolkotabi, 2021] is considered a preconditioned method.

Thank you for pointing out this misplacement. The method in [Stöger and Soltanolkotabi, 2021] is not a preconditioned one. We cite it to support the former claim " Overparameterization may heavily slow down convergence", as it shows O(κ2log1ϵ)O(\kappa^2\log\frac{1}{\epsilon}) for exact parameterization case and O(κ6logκϵ)O(\kappa^6\log\frac{\kappa}{\epsilon}) for overparameterization case. To avoid confusion, we will move it before the comma in the next version.

Q4 & 5: Theorem 1 & 2: In my understanding, the value RtF|R_t|_F should be proportional to the scale of AA. However, in the bound of RtF|R_t|_F, the right hand-side grows with the size of AA. Maybe there is a typo?

There is no typo here. The bound on RtF|R_t|_F implicitly depends on AF|A|_F through the choice of cc. The Theorems require c2c2=O(AF)c^2\geq\underline{c}^2=O(|A|_F), and the RHS grows with c2c^2, so RtF|R_t|_F implicitly depends on AF|A|_F. When cc is larger than c\underline{c}, AF|A|_F is dominated by c2c^2 and our results still hold. When cc is too small, our proof does not work.

Moreover, the RHS does not explicitly depend on the size of AA. The size might affect the bound through AF|A|_F and σi(A)\sigma_i(A), but there is no explicit dependence on the dimensions mm and nn. The bound only depends on the rank rr and the overparameterization dd, both are not directly related to the size of A.

Q6: It would be better to mention in Theorems 1-2 that the bound TT is derived using the bound in Proposition 1

Thank you for the suggestion. We will mention the use of quantitative results in Proposition 1 in the next version.

Q7: Line 182: I think the shrinkage rate θ(0,ρ]\theta\in(0,\rho]?

It is not a typo, the error shrinkage rate θ\theta is larger than the "ideal shrinkage rate" ρ\rho given by the condition number of the linear part of the dynamics. Intuitively, the existence of nonlinear error terms will slow down convergence, so Rt|R_t| will shrink at a rate θ\theta slower than ρ\rho, namely θ>ρ\theta>\rho. Then by Lemma 3, all error terms can be controlled by sequences with shrinkage rate θ\theta. The auxiliary Lemma 6 provided in Appendix B.1 then guarantees that the final residual RtF|R_t|_F is controlled by O(θt)O(\theta^t).

Q8: I wonder if the authors can compare the performance of GD/NAG with different values of c.

We add additional experiments on GD and NAG with different values of cc. The results are provided in Figure 5 in the PDF file in "Author Rebuttal". As illustrated, when cc is sufficiently large, increasing cc further has little effect on the convergence rate, which is consistent with our theory.

[1] Zhang, H., Bi, Y., & Lavaei, J. (2021). General low-rank matrix optimization: Geometric analysis and sharper bounds. Advances in Neural Information Processing Systems, 34, 27369-27380.

[2] Bi, Y., Zhang, H., & Lavaei, J. (2022, June). Local and global linear convergence of general low-rank matrix recovery problems. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 9, pp. 10129-10137).

评论

Dear Reviewer,

Thank you for your time in reviewing our manuscript and offering valuable comments and suggestions. We've addressed the questions raised in your review with our author rebuttals. We'd like to know if these responses have sufficiently addressed your concerns. If any points require further clarification or discussion, please let us know.

Thank you again for your effort.

Best wishes,

Authors

审稿意见
4

This paper considers the convergence of the first-order optimization method, which includes the gradient descent and the nesterov's acccelerated gradient for the marix factorization and the linear neural network.

In Section 2.1, they analyze the gradient descent algorithm on the matrix factorization (c.f. Thm. 1). In Section 2.2, they analyze the NAG (c.f. Thm 2). Section 3 illustrates the proof strategy. Section 4 extends the analysis to the linear neural network (c.f. Thm. 3) and Section 5 presents the numerical experiments.

优点

This paper is generally well-written and is quite easy to follow. Also, the analysis seems to be valid.

缺点

The major concern is the potential impact of this paper. Generally speaking, this paper studies a well-studied problem with quite standard technique.

Also, there are some small typos that are quite annoying. For example, the AF\|A\|_F in Thm. 1and Thm. 2 can cancel out. Thm 2 (line 14), the GD should be NAG. Please do another round of proof-reading.

问题

  1. Why the convergence rate is independent of AF\|A\|_F?
  2. Can you explain the line 807? Why the eigenvalues of λi(TGD)\lambda_i(T_{GD}) (1i(mr)n1\leq i \leq (m-r)n can be ignored? The previous line suggest these eigenvalues are one.

局限性

NA

作者回复

Thank you for your time in reviewing the paper and helping us improve. Our responses to each point of the Weaknesses and Questions are listed below.

W1: The major concern is the potential impact of this paper. Generally speaking, this paper studies a well-studied problem with quite standard technique.

Impact is a little subjective, but we hope the reviewer could consider the following regarding whether this is a well-studied problem and whether our technique is standard:

Quantitative theoretical guarantee for the optimization of general nonconvex function without Lipschitz gradient, even after decades of remarkable progress, is still largely an open problem. We managed to do it for a specific type of problem, namely matrix factorization. It is still a nonconvex problem without Lipschitz gradient. Moreover, it carries a lot of insight for a better understanding of deep learning, as it is closely related to linear neural networks, which we also analyzed. Therefore, it is extensively studied, however not well-studied because many understandings are still lacking.

To illustrate this more precisely, consider [1] which is a very recent progress (NeurIPS'23): the motivation was to understand Gradient Descent (not gradient flow, which was easier) for this problem, but due to technical difficulty that was not accomplished; instead, the authors managed to work out a variant, namely Alternating Gradient Descent, and obtained convergence rate. Now, in our work, we not only work out Gradient Descent but also its momentum version, which is more complicated to analyze but beneficial, because we quantitatively prove that Nesterov's momentum indeed accelerates convergence. This requires new analysis techniques (as discussed in Remark 1) and rather precise (if not tight) error bounds, and we're glad it can work out.

[1] Ward, Rachel, and Tamara Kolda. "Convergence of alternating gradient descent for matrix factorization." Advances in Neural Information Processing Systems 36 (2023): 22369-22382.

W2: There are some small typos that are quite annoying.

Thank you very much for helping us catch them. Although we intentionally put one AF|A|_F in the denominator of the prefactor so the other AF|A|_F can align with the definition of relative error (lines 124 and 140), the expression indeed becomes more confusing. We will cancel out AF|A|_F in Theorem 1 and 2 and LD|LD^\top| in Theorem 3 to avoid confusion.

The GD on line 140 is indeed a typo. We will proofread and fix it along with other typos in the next version.

Q1: Why the convergence rate is independent of AF|A|_F?

While the convergence rate does not explicitly depend on AF|A|_F, AF|A|_F will still affect the convergence rate through cc defined in Eq. (2). Theorems 1 and 2 require c2c2=O(AF)c^2\geq\underline{c}^2=O(|A|_F), and the results show RtF|R_t|_F depends on c2c^2. When we choose c=cc=\underline{c}, the bound will have an explicit dependence on AF|A|_F. When we choose c>cc>\underline{c}, AF|A|_F is dominated by the factor c2c^2, hence it is implicitly contained in our results in lines 123-144 and 139-140.

Moreover, when considering the iteration complexity, we adopt relative error as the metric, i.e., RtF/AFϵ|R_t|_F/|A|_F\leq\epsilon. Therefore, the right-most AF|A|_F in the line between 123-124 (and the line between 139-140) does not appear in the iteration complexity.

Q2: Can you explain the line 807? Why the eigenvalues of λi(TGD)(1i(mr)n)\lambda_i (T_{GD}) (1\leq i\leq (m-r)n) can be ignored?

Thanks for the opportunity to correct a critical misunderstanding. It is not because eigenvalues are ignored, but because <v,vi>=0\left<v,v_i\right>=0. By Lemma 1 (line 188), H\mathcal{H} is the eigen subspace corresponding to positive eigenvalues of H0H_0, which is orthogonal to the kernel subspace of H0H_0. Through the derivation from line 804 to 807, we know that {v1,,v(mr)n}\{v_1,\dots,v_{(m-r)n}\} exactly spans the kernel subspace of H0H_0 (where λmni(H0)=0\lambda_{mn-i}(H_0)=0). Given the condition that vHv\in\mathcal{H}, we get <v,vi>=0\left<v,v_i\right>=0 for i=1,,(mr)ni=1,\dots,(m-r)n, hence the first (mr)n(m-r)n terms vanish. Thanks to the reviewer we realize this was under-explained, and will add an explanation about this in line 808 in the next version.

评论

Dear Reviewer,

Thank you for your time in reviewing our manuscript and offering valuable comments. We've addressed the weaknesses and questions raised in your review with our author rebuttals. We'd like to know if these responses have sufficiently addressed your concerns. If any points require further clarification or discussion, please let us know.

Thank you again for your effort.

Best wishes,

Authors

评论

Thank you for the response. I will stick to my original score.

I appreciate that the author agrees with my suggestions on the presentation improvement and promise to make the revisions accordingly.

审稿意见
7

The paper calculates convergence rates of the gradient descent and the Nesterov's accelerated gradient descent algorithms for factorization of rectangular matrices- a nonconvex optimization problem. Their analysis is for algorithms when the factor matrices are initialized as follows: one matrix is initialized as the original matrix multiplied with Gaussian random matrix with appropriate scaling and the other is a zero matrix. They show convergence rates with quadratic dependence on condition number of the data matrix for GD and linear dependence for NAG. They also extend their analysis to linear neural networks and back their theoretical findings with empirical results.

优点

  1. Gradient descent algorithm and its variants are used almost everywhere in machine learning. Deep neural networks have made nonconvex optimization also very frequent in ML and matrix factorization is a basic problem. As such a work providing theoretical guarantees for the problem is very relevant.

  2. Although the paper is difficult to read, it is structured in a very nice manner so that a reader can follow the high-level ideas quite clearly. The related works and the placement of this work among existing works is discussed very clearly.

  3. I could not check the proofs in detail, but enough intuition, proof sketch and moderate level of details are provided in the main paper itself. The ideas appear solid and sound.

缺点

  1. The experimentation is done on very small matrices. In practice much larger rectangular matrices are often encountered. It would give more insights if the experiments were also performed with moderate and large size matrices.

问题

See Weaknesses

局限性

See Weaknesses

作者回复

Thank you very much for your positive feedback and constructive comments.

W1: The experimentation is done on very small matrices. In practice much larger rectangular matrices are often encountered. It would give more insights if the experiments were also performed with moderate and large size matrices.

Thank you for your suggestion. We add additional experiments on large-scale matrices. In particular, we set (m,n)=(1200,1000)(m,n)=(1200,1000) for matrix factorization and (m,n,N)=(500,400,600)(m,n,N)=(500,400,600) for linear neural networks in the additional experiment. We compare the performances of GD and NAG while keeping other settings the same as in Figure 2 in the paper. The results are plotted in Figure 4 in the PDF file attached to "Author Rebuttal". As illustrated in Figure 4, our conclusion in the paper that (1) NAG performs better than GD and (2) overparameterization accelerates convergence remains valid for large matrices.

评论

I have read the other reviews and the rebuttal. I will keep my score.

审稿意见
7

In this papers the authors analyze the convergence of Nesterov Accelerated Gradient algorithm for a) rectangular matrix factorisation and b) linear neural networks. By using imbalanced initialisation, the authors come up with linear rates of convergence impoving upon state-of-the-art regarding dependence of the rates condition numbers of sought matrices.

优点

  • The paper is well written and easy-to-follow.
  • The analysis of convergence of NAG on rectangular matrix factorisation and linear neural networks is an interesting topic of research.
  • The authors improved upon state-of-the-art results using novel technical approaches and elegant proof techniques.

缺点

  • The authors didn't mention relevant works studying imbalance effect on the similar problems e.g. [1].
  • The results address the linear neural network case but it's not obvious how the analysis could be extended to account for non-linearities or non-smooth activation functions.

[1] Min, Hancheng, et al. "On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks." International Conference on Machine Learning. PMLR, 2021.

问题

  • In [1] the authors provided rates of convergence of gradient flows in the imbalance initialisation regime. Could these results provide some insights on how to generalise the derived rate for different amount imbalance of imbalance?
  • Could the authors provide some insights on whether the theoretical results on linear neural network could be extended to non-linear and possibly non-smooth activation functions?

Minor: Eq. under line 132: Should xtx_t be replaced by ztz_t (??

局限性

Yes

作者回复

Thank you very much for your positive feedback and constructive comments. Our responses to each of the comments are listed below.

W1& Q1: The authors didn't mention relevant works studying imbalance effect on the similar problems e.g. [1]. In [1] the authors provided rates of convergence of gradient flows in the imbalance initialisation regime. Could these results provide some insights on how to generalise the derived rate for different amount imbalance of imbalance?

We thank the reviewer for pointing out the related work, and we will cite [1] and discuss it in the next version. In particular, our settings and proof techniques differ from [1]. Consequently, their results cannot directly translate into our case.

  1. We consider GD and NAG with step size O(1/L)O(1/L), while [1] considers GF with infinitesimal step size. Without carefully characterizing discretization error, the result in [1] for GF cannot be applied to GD.

  2. In our proof, the imbalance initialization guarantees the induction steps to be valid. The amount of imbalance will affect the constant factors but will not affect the convergence rate (1μL)t(1-\frac{\mu}{L})^t (or (1μL)t(1-\sqrt{\frac{\mu}{L}})^t). To be more explicit, suppose we initialize X0=c1AΦ1Rm×dX_0=c_1A\Phi_1\in\mathbb{R}^{m\times d}, Y0=c2Φ2Rn×dY_0=c_2\Phi_2\in\mathbb{R}^{n\times d}, then by replacing H0H_0 in Proposition 2 with H0=I(X0X0)H_0^\prime=I\otimes (X_0X_0^\top), we can generalize the proof in lines 814-815 and 821-832. The induction requires a sufficiently large c1c_1 and a relatively small c2O(c1)c_2\leq O(c_1). Meanwhile, by Proposition 4.3.2 in [2], we have 11O(c1c2)A0FR0F(1+O(c1c2))A0F\frac{1}{1-O(c_1c_2)}|A_0|_F\leq|R_0|_F\leq (1+O(c_1c_2))|A_0|_F with high probability. Therefore, when c1c_1 is fixed, a smaller c2c_2 yields a smaller initial loss, resulting in a smaller constant factor. However, the convergence rate remains the same, as it depends on the extreme non-zero eigenvalues of H0H_0^\prime, i.e., μ\mu and LL, which only relates to c1c_1 but not c2c_2. We conduct additional experiments on GD/NAG with different values of c2c_2, and the numerical results support our claim. Please find the experiment details and results in "Author Rebuttal" and Figure 6 in the PDF file.

W2& Q2: The results address the linear neural network case but it's not obvious how the analysis could be extended to account for non-linearities or non-smooth activation functions. Could the authors provide some insights on whether the theoretical results on linear neural network could be extended to non-linear and possibly non-smooth activation functions?

Extending our results to neural networks with non-linear (and possibly non-smooth) activations is an interesting topic that we want to investigate in the future. While non-linearity complicates the analysis, we believe some of our techniques still apply. Suppose we apply non-linear activations σ()\sigma(\cdot) and the problem becomes minX,YLXσ(YD)F2\min_{X,Y}|L-X\sigma(Y^\top D)|_F^2. In our analysis, one of the key steps is to show that the residual and error terms are in the contraction subspace of the dynamics. Since the second layer is linear, by sketching initialization X0=cLΦX_0=cL\Phi we can still get X0X_0 that shares the same column space with LL. By checking the dynamics of GD/NAG, we can verify that this space contains the column space of residuals and errors of all later iterations. However, due to the non-linear activation, the analysis of the linear part of the system and error terms becomes complicated. It requires further effort to verify whether our results can successfully generalize to the non-linear activation setting.

Q3: Eq. under line 132: Should xtx_t be replaced by ztz_t?

Thank you for pointing this out. This is indeed a typo and we will replace xtx_t with ztz_t in the next version.

[1] Min, Hancheng, et al. "On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks." International Conference on Machine Learning. PMLR, 2021.

[2] Ward, Rachel, and Tamara Kolda. "Convergence of alternating gradient descent for matrix factorization." Advances in Neural Information Processing Systems 36 (2023): 22369-22382.

评论

Thank you for your rebuttal to my comments and for the additional experiments you conducted to explore how imbalanced initialization affects the rate of convergence. I will keep my score as it is.

作者回复

We thank all the reviewers for dedicating their time to reviewing our paper and providing valuable feedback. In response to the comments involving the size of the matrices (by reviewer t3Co), the performance of GD/NAG with different values of cc (by reviewer u7PK), and the amount of imbalance (by reviewer SNTq), we conduct additional numerical experiments and show the results (Figures 4 to 6) in the attached PDF file. We discuss each of them below:

W1 by t3Co: The size of the matrices is small.

Our original submission uses (m,n)=(100,80)(m,n)=(100,80) for matrix factorization and (m,n,N)=(100,80,120)(m,n,N)=(100,80,120). In this rebuttal where we investigate larger-sized problems, we set (m,n)=(1200,1000)(m,n)=(1200,1000) for matrix factorization and (m,n,N)=(500,400,600)(m,n,N)=(500,400,600) for linear neural networks. We compare the performances of GD and NAG under the same setting as in Figure 2 with moderate/large matrices in these sizes. The results are provided in Figure 4. As illustrated, the conclusion that (1) NAG performs better than GD and (2) overparameterization accelerates convergence remains valid for large matrices.

Q8 by u7PK: The performance of GD/NAG with different values of cc.

We conduct additional experiments on GD and NAG with different values of cc and plot the results in Figure 5. As illustrated, when cc is sufficiently large, increasing cc further has little effect on the convergence rate, which is consistent with our theory.

Q1 by SNTq: The effect of the amount of imbalance at initialization.

We conduct additional experiments on GD and NAG with initialization X0=c1AΦ1Rm×dX_0=c_1A\Phi_1\in\mathbb{R}^{m\times d}, Y0=c2Φ2Rn×dY_0=c_2\Phi_2\in\mathbb{R}^{n\times d}, where [Φ1]_i,jN(0,1/d)[\Phi_1]\_{i,j}\sim{N}(0,1/d) and [Φ2]i,jN(0,1/n)[\Phi_2]_{i,j}\sim{N}(0,1/n). We keep c1=50c_1=50 and set different values of c2c_2. The results are plotted in Figure 6. As illustrated, changing c2c_2 within a range will not significantly affect the convergence rate (slope), but will change the initial loss (intercept). This result supports our claim in the individual response to reviewer SNTq.

In the next version, we will add these results and discussions in an "Additional Experiments" section in the appendix. For responses to other comments, please refer to our responses to reviewers.

最终决定

The paper studies the problem of factorization of rectangular matrices. The authors prove the convergence rate of gradient descent with quadratic dependence on the condition number of the data matrix, and the convergence rate of Nesterov's accelerated gradient descent with linear dependence on the condition number. They also extend their analysis to linear neural networks. Three reviewers are unanimously positive, and one review with borderline reject is short and less informative. I decide to accept the paper.