Invariant subspaces and PCA in nearly matrix multiplication time
Invariant subspaces and PCA embeddings can be provably approximated in nearly matrix multiplication time in finite precision
摘要
评审与讨论
The paper analyses complexity of generalized symmetric eigenvalue problem computation.
优点
- Solid analysis of the computations involved in generalized eigenvalue problems.
- Excellent survey of related works.
- Relevant applications in ML (PCA).
缺点
- The assumption of H being symmetric needs to be in the Abstract as well, as it is quite significant.
- Missing citation to kernel PCA [1]
- While the analysis is strong, the computational aspect is not as emphasized. For example, Algorithm 1 is not shown in main body and only in Appendix, and the paper has no experimental evaluation.
[1] Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. "Nonlinear component analysis as a kernel eigenvalue problem." Neural computation 10.5 (1998): 1299-1319.
问题
My suggestion is to better emphasize the computational aspect and make the paper more suitable for a conference. For example,
- Move the main Algorithms to the main body as it gives a more visual impact, which is of higher importance for a conference paper
- Use tables to compare accuracy and complexity of different algorithms, such as baselines like Lanczos, Randomized SVD, etc.
- It would be helpful if the authors could show numerical results w.r.t. accuracy, complexity, or relevant quantities to support their theoretical analysis.
局限性
Addressed.
We thank the reviewer for their useful feedback and for their suggestions to improve the readability of our manuscript. Below we provide replies to the specific questions that were raised:
-
Question: The assumption of being symmetric needs to be in the Abstract as well, as it is quite significant.
Answer: We will fix the missing mention in the abstract that must be Hermitian and Hermitian and positive-definite. Indeed, these are crucial assumptions that must be emphasized.
-
Question: Move the main Algorithms to the main body as it gives a more visual impact, which is of higher importance for a conference paper.
Answer: We will move the most relevant algorithms from the appendices into the main body of the manuscript, taking advantage of the additional page of content that is allowed in the final version.
-
Question: It would be helpful if the authors could show numerical results w.r.t. accuracy, complexity, or relevant quantities to support their theoretical analysis.
Answer: Due to the limited time, we well not be able to incorporate numerical results in our manuscript, but will provide indicative examples in the final version.
-
Question: Missing citation to kernel PCA [1] Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. "Nonlinear component analysis as a kernel eigenvalue problem." Neural computation 10.5 (1998): 1299-1319.
Answer: We are aware of the paper by Schölkopf et al. and cited it twice in our manuscript (Ref. [116]). If other relevant references are missing, we would gladly add them.
We hope that our responses clarify the concerns raised by the reviewer. If (other) major concerns remain, we will address them to the best of our capabilities. Again, we would like to thank the reviewer for their useful recommendations.
As the deadline of the reviewer-author discussion period is approaching, we were wondering if the reviewer had the chance to consider our responses, and if we can provide further clarifications for any remaining concerns.
Thanks for the response, I suggest that the authors include the planned revisions in the final version. I will increase my score.
This paper considers the following optimization problem: given Hermitian and Hermitian, positive definite , find matrices and such that where is the eigenvectors and is the eigenvalues. The important application is when the eigenvectors of the interest form an invariant subspace, which covers applications such as DFT and PCA. So the goal is to compute a rank- projection onto the top or bottom- eigenvectors. The main contribution of this paper is a stable algorithm that computes an approximate projector such that where is the projector onto top/bottom- principal components. The algorithm runs in the current matrix multiplication time up to log factors in , , and the gap between . It uses polylog bits of precision in the above parameters. It also provides error bounds for Cholesky beyond classical algorithm. As corollaries, the algorithm leads to improvement for DFT, PCA and Block-PCA.
优点
This paper designs a novel algorithm for invariant subspace projection and PCA in the current matrix multiplication time. The algorithm is a good combination of existing techniques (such as approximating the sign function via inverse Newton) and new insights, such as reducing gap and midpoint of eigenvalues computation to eigenvalue threshold counting, and solve the counting via smoothed analysis.
The paper is well-written and results are presented clearly in the main body, some intuitions and proof sketches are provided for better understanding of the algorithms and analysis. Overall, I think this is a good paper with solid results.
缺点
The only complaint I have with this paper (might be a bit unfair) is due to the sheer amount of contents and its very numerical linear algebra nature, this paper might be better suited for conferences such as ISSAC or journals such as SIAM journal on matrix analysis. Otherwise, I think this paper is definitely strong enough to be published in NeurIPS.
问题
For your open question on rectangular FMM: is the current runtime bottleneck due to block square FMM? I didn't dig too deep into the algorithm details, but I would imagine the bottleneck is due to certain linear algebraic operations on matrices, instead of a bunch of small rectangular FMMs?
局限性
Authors have discussed limitations.
We would like to thank the reviewer for their constructive feedback and for taking the time to review our manuscript. The reviewer is absolutely right about the rectangular FMM. Actually, our comment about rectangular FMM only applies to the analysis of the Block-Krylov method in Section 4.2 and in Appendix G, where we used blocked-square FMM to perform SquareRectangular matrix multiplication. Using Rectangular FMM instead of blocked-square FMM might provide minor (but not substantial) improvements for specific regimes of the input parameters.
If more information is needed, we would be pleased to further expand on the rectangular FMM or other topics.
I thank authors for answering my questions on rectangular FMM. Just a small follow up question: what parameters of rectangular FMM are you looking for? Could you specify in terms of ? Thanks!
Good point, the aforementioned parameter would be , which determines the block-sizes. would indeed make it more general, or similarly , to cover also the case where is sparse (though different mat-mul algorithms have different stability properties so it might need some care). It might indeed be an overkill to mention it as an open problem. We will adapt that part to our best capabilities in the final version. Thank you for the discussion!
The paper tackles the fundamental problem of GEP subspace approximation (with forward error approximation). The problem is very relevant to different areas of Machine learning. This paper improves the time complexity of this approximation from cubic in n to matrix multiplication time, which is a significant improvement.
I have read the first nine pages and Appendix B, and the idea seems reasonable, and the paper is well explained. However, given the length of the paper (which exceeds 50 pages), I have not been able to verify the correctness of the solution. Therefore, I will not comment on the strengths and weaknesses of the paper and give a low confidence score.
In my opinion, a more theoretical venue such as STOC/FOCS would be more appropriate for this paper (both because of the emphasis on theory and because the reviewers have significantly lower paper load). I wish the authors the best.
优点
缺点
问题
-
Could the authors please elaborate on the difference between their result and the forward error approximation in Theorem 5.2 of [1] (which is the [37] citation of the authors)? The authors of [1] write that their Theorem 5.2 is restrictive as it needs S to be invertible and H to have a simple eigenspectrum (no repeated eigenvalues). It seems that Theorem 1.1 by the authors also requires S to be invertible but can allow a non-simple spectrum.
-
Is a general dependence on the term gap_k necessary for analysis like this?
局限性
The authors have adequately addressed the limitations of this work.
We would like to thank the reviewer for their comments and carefully examining our manuscript as well as Appendix B. Our replies to their comments can be found below:
-
Question: Could the authors please elaborate on the difference between their result and the forward error approximation in Theorem 5.2 of [1] (which is the [37] citation of the authors)? The authors of [1] write that their Theorem 5.2 is restrictive as it needs to be invertible and to have a simple eigenspectrum (no repeated eigenvalues). It seems that Theorem 1.1 by the authors also requires to be invertible but can allow a non-simple spectrum.
Answer: This is indeed a very interesting point that deserves further investigations. We have been in close contact with the authors of [37] the past months. After discussing and comparing our results in detail, we concluded that our works complement each other. The major difference comes from the fact that our Theorem 1.1 is stated specifically for Hermitian-definite pencils: must be Hermitian, while must be Hermitian and positive-definite, which is stricter than just invertible. However, we do not need simplicity in the spectrum because of Weyl's inequality (or the slightly more general Kahan's bound): For Hermitian matrices, a backward-error directly translates into a forward-error in the eigenvalues (see also our Lemma B.1 in Appendix B). This property does not hold for general matrices, as the ones that are assumed in Theorem 5.2 of [37]. In that case, one would resort to either the Bauer-Fike theorem, or to the properties of the pseudospectrum, in order to obtain forward-error bounds from the backward-error. If the reviewer is interested, we can also refer to the thesis of the last author of [37] for additional remarks.
-
Question: Is a general dependence on the term gap_k necessary for analysis like this?
Answer: We have thoroughly discussed the dependence in the bounds with other authors and experts in the field. It appears to be unavoidable. Intuitively, if we want to distinguish between two invariant subspaces, the error must be smaller than the gap. Consider the following arbitrary example: The gap is equal to , and the spectral projector on the top-1 invariant subspace is Now consider the following -perturbation of : . The spectral projector of the top-1 invariant subspace is completely different than before. In fact, it is orthogonal to the first one.
We hope that these responses clarify the reviewer's concerns. Let us know if the answers are not sufficient so we can elaborate further.
Dear authors, Thank you for the explanations. Please find a followup question below.
In the paper, you mentioned that obtaining forward approximation error is a significantly more challenging task than obtaining backward approximation error,
Yet, here you mention that if the matrices are hermitian (as in the case of H) backward approximation error can be directly translated to forward approximation error, which seems to contradict the hardness of obtaining forward approximation error. It would be great if you could help me understand this point.
Thanks a lot.
This is a subtle point, but in this case the statements that appear to be contradicting refer to different quantities. As it is mentioned in our previous comment and in the paper, it is possible to obtain forward-errors for the eigenvalues from the backward-error using known inequalities, even for the non-Hermitian case (see e.g. Ref [24]). Obtaining forward-error bounds for invariant subspaces is not easy.
Consider again the example in the previous comment. The eigenvalues of perfectly approximate those of . But the invariant subspaces are completely different. Numerical analysis often refers to this fact as "eigenvector sensitivity" (see e.g. ref. [38]). The eigenvalues and the invariant subspaces of matrix pencils are even more sensitive than those of simple matrices, yet, our analysis still applies for Hermitian-definite pencils (Appendix C serves as the backbone).
We hope that this clarifies the question and we are pleased to elaborate more if required. If the reviewer thinks that these details are not clear in the article, we will modify it accordingly. In that case, it would be helpful to point us to specific pages/lines that need clarification.
The paper presents a novel approach to approximating invariant subspaces of generalized eigenvalue problems (GEPs), which are fundamental in many applications such as Principal Component Analysis (PCA) and Density Functional Theory (DFT). The authors introduce an algorithm that computes a spectral projector with a forward-error approximation guarantee in near matrix multiplication time , where is the matrix multiplication exponent. The approach advances a new analysis for Cholesky factorization and a smoothed analysis for computing spectral gaps, which are key innovations applied to obtain the desired bound with high probability.
The paper's technical claims are well-supported by rigorous mathematical proofs and thorough analysis. The use of Cholesky factorization and smoothed analysis for spectral gaps is innovative and effectively addresses the computational challenges. The proposed algorithm’s performance is theoretically grounded. However, some sections could benefit from additional explanations, empirical computations and examples to enhance understanding, especially for readers less familiar with the theoretical work developed in the paper, and potentially reaching a larger audience.
优点
- Originality: The paper introduces a novel approach to solving GEPs and PCA with nearly matrix multiplication time complexity, which is a significant advancement over classical methods. The novel analysis for Cholesky factorization and a smoothed analysis for computing spectral gaps could have a wider application too.
- Quality: The mathematical rigor and comprehensive analysis in terms of bit complexity ensure that the proposed methods are both theoretically sound and practically applicable.
- Clarity: The paper is generally well-written, with detailed explanations of the methods and thorough proofs.
- Significance: The results have broad implications for many applications in machine learning and scientific computing, providing a more efficient computational framework.
缺点
- Complexity of Implementation: The proposed methods, while theoretically sound, may be complex to implement in practice. Detailed implementation guidelines for the algorithms 1, 2, 3, and 4. A standard implementation also would be of value, showing how some of the factors hidden in the complexity could play a significant role in practice. For example, as it is, it remains open how, for example, the significance of the factor defining the bound on the approximation error (which appears inside a expression) when working with varied sizes of matrices and approximation error . Theorem 1 also should include some comments about this issue.
- Hyperparameter Sensitivity: The algorithm's performance depends on several parameters (e.g., spectral gap, condition number). A more detailed discussion on selecting these parameters and their impact on performance would be helpful, along with empirical implementations and discussion.
问题
- Parameter Selection: How should the hyperparameters (e.g., spectral gap, condition number) be chosen in practice? Are there heuristic methods or guidelines for selecting these parameters to ensure optimal performance?
- Extensions to Sparse Matrices: How does the proposed method handle sparse matrices common in large-scale applications? Are there any modifications or optimizations specific to sparse matrix structures?
局限性
The presented paper discusses in great depth the theoretical underpinning of the proposed algorithm, and we understand this already has a significant impact. Nevertheless, the lack of practical implementation, empirical validation, and analysis of some aspects of the algorithms hinders the potentially higher impact of the proposed method. Even a limited empirical section, including investigations into the role of the hyperparameters and covering varied structures of matrices, would highly improve the paper's potential impact and readership.
We would like to thank the reviewer for their positive feedback and for carefully reviewing our manuscript! Below we provide answers to the specific questions that were asked:
-
Question: Parameter Selection: How should the hyperparameters (e.g., spectral gap, condition number) be chosen in practice? Are there heuristic methods or guidelines for selecting these parameters to ensure optimal performance?
Answer: We would like to stress out that all required parameters (gaps, condition numbers, norms, etc) are determined by the algorithm, i.e., the user does not need to provide them. The only inputs are the matrices to be handled, their size , and the desired accuracy . A reduced set of input parameters is an additional contribution compared to previous algorithms, where methods to determine the required parameters are typically not provided. In our case, the complexity of determining the parameters is included in the main theorems.
-
Question: Extensions to Sparse Matrices: How does the proposed method handle sparse matrices common in large-scale applications? Are there any modifications or optimizations specific to sparse matrix structures?
Answer: Extending these methods to sparse matrices is a very intriguing, but rather difficult open problem. To-date, for large-scale sparse problems, a Krylov-based method is probably the algorithm of choice. It takes advantage of matrix-vector products and exhibits a somewhat sparsity-time complexity. We thoroughly described the bit-complexity analysis of the celebrated Block-Krylov PCA method (see Section 4 and Ref. [100]). The problem with such methods is that (the target rank) must be very small to work well. As mentioned in Section 5 (iii), large-scale banded problems can probably be solved with so-called ``superfast eigensolvers'' (e.g., refs [61, 131]), but we are not aware of any end-to-end complexity/stability analysis of these algorithms. For general sparse matrices (non-banded) and large we believe that is probably the best that can be achieved.
-
Question: Complexity of Implementation.
Answer: We expect that the implementation of the proposed methods does not involve substantial difficulties (although it remains to be demonstrated). The main steps are the diagonal perturbations, the Newton iteration, and the bisection-like algorithm to compute the spectral gap. The level of difficulty should definitely not exceed that of, for example, Ref. [11,37] (for which open source implementations exist).
To conclude, we would like to thank again the reviewer for the detailed feedback and for their interest, they are both highly appreciated! Let us know if further clarifications of our replies are required.
This paper studies the computational cost of invariant subspaces for generalized eigenvalue problems (GEPs) which is a fundamental computational problem with applications in machine learning and scientific computing. The authors propose a novel method that approximates the spectral projectors of GEPs, and give the first bit complexity result for forward-error approximation in the floating point model, improve upon the previous result. Based on this result, the authors also give new matrix multiplication-type upper bounds for PCA problems in terms of bit complexity.
优点
- This paper is well-written with very solid theoretical results and proofs. The problem, the model, the results are stated clearly. The existing works and previous results are discussed thoroughly.
- The idea of reducing the problem to approximating the average and the gap among two adjacent eigenvalues is interesting. The result of approximating these two quantities (Theorem 3.1) can have potential other applications.
- This paper gives a new stability analysis of the Cholesky factorization under floating point model and improves upon the previous result, which is of independent interest (especially in the TCS community).
缺点
The main concern is that this paper may be too dense with technical details and rigorous proofs as a NeurIPS submission (instead of a TCS conference). It might be better if the authors consider presenting this paper in a way that is easier to follow (e.g., delay more proof details to appendix, while adding some examples of what the results look like for specific schemes of spectral decay, and more detailed discussion about the potential applications / examples of GEPs), so that this problem can be brought to more people in this community.
问题
- In the statement of Theorem 1.1 and Proposition 2.1, is a proper assumption?
- Also in Theorem 1.1, what if is very small for the given (say, )?
- How does the result of Theorem 3.1 compare with previous works (if any)? How hard is it to approximate and for a given , is there any previous work that also achieve the bit complexity result?
- Can the authors provide a more detailed discussion of how the proposed algorithm can (or cannot) be applied to real world problems such as solving large scale PCA tasks?
局限性
The authors have adequately discussed the limitations of their work.
We would like to thank the reviewer for their positive feedback. Below we elaborate on the specific questions that were raised:
-
Question: In the statement of Theorem 1.1 and Proposition 2.1, is a proper assumption?
Answer: is typically a standard assumption in eigenvalue algorithms. For Theorem 1.1., in particular, can be approximated up to a constant factor in floating point using Lanczos, e.g., by applying Theorem 18 in the ArXiv version of Ref. [101]. We can then scale by the approximated norm. Similarly with , we can use the algorithm in Appendix E to compute the minimum singular value of , and scale accordingly.
-
Question: Also in Theorem 1.1, what if is very small for the given (say, )?
Answer: This is an excellent question that we discussed with other authors and experts in the field. A dependence on the gap should be unavoidable. Otherwise, it is not possible to distinguish between the two subspaces and obtain forward-error bounds (backward-error bounds are easier). If , then the problem is rather ill-posed to begin with, but the algorithm will still terminate in polynomial time.
-
Question: How does the result of Theorem 3.1 compare with previous works (if any)? How hard is it to approximate and for a given , is there any previous work that also achieve the bit complexity result?
Answer: For general we are not aware of any other algorithms, except for Banks et al. (Ref. [11]). We briefly described in Proposition B.2 how to use [11] to compute the gap in . However, [11] leads to a slightly slower algorithm than our main algorithm, which is diagonalization-free. We are also not aware of any lower bounds to compute a single spectral gap. We believe that it should be -hard, but we have no formal proof for that.
-
Question: Can the authors provide a more detailed discussion of how the proposed algorithm can (or cannot) be applied to real world problems such as solving large scale PCA tasks?
Answer: When it comes to practical applications, the main benefit of the proposed ``matrix multiplication reduction'' is not the asymptotic complexity of the resulting algorithm, but rather the fact that we can exploit matrix multiplication in the computations. It is fully parallelizable and lends itself to cache-optimization, which can provide significant speedups (up to orders of magnitude) in practice. We also think that random perturbations (diagonal, Ginibre, ...) might actually become very useful in the near future, due to their eigenvalue repulsion properties, but this field is currently under-explored.
Thank you again for the feedback and for the interest! We are happy to discuss these points more in depth, if needed.
I appreciate it that the authors have addressed all my concerns and questions. I have also read all other reviews as well as the corresponding comments by the authors. I think we all agree that this paper provides a strong result with a solid proof, while for a conference like NeurIPS, I would suggest the authors to find a way to present this paper so that it could be understood more easily (relatively speaking). Thanks, I would like to keep my rating.
Thank you once more for the feedback and for the discussion, we will try to accommodate all the suggestions from all the reviewers accordingly in the final version.
The reviewers generally agreed that this paper provides strong theoretical results for the central problem of solving generalized eigenvalue problems in finite precision. The paper is well-written, appears sound, and while it is more of a numerical analysis paper than an ML paper, should be of interest to some in the broader NeurIPS community.
Beyond the various helpful suggestions made by the reviewers, I'd like to strongly suggest to the reviewers that in the final version they should tone down/clarify some of the claims about what the work contributes to the special case of the Hermitian eigenvalue problem.
Cor 1.7 of https://arxiv.org/pdf/1912.08805 gives a forward stable solution to this problem in matrix multiplication time. A subtle difference is that it returns a factorization V D V^-1 rather than VD V^T although given that V will be nearly orthogonal due to the accuracy guarantee, I suspect that this is easily resolved. See Remark 6.1 of the same paper for some discussion.
Thus I feel that the claim in the intro of the submitted paper: "Still, to the best of our knowledge, the state-of-the-art algorithms providing an end-to-end, provably accurate analysis (in the sense of Eq. (2)) for arbitrary invariant subspaces of definite GEPs or even regular Hermitian eigenproblems, in any finite precision model, remain the classic aforementioned O(n3) algorithms" is either incorrect or at least very misleading/overstated. It should be more carefully discussed how the author's solution improves on or buttons up prior work in the Hermitian case.
I apologize for not brining this issue up during the response period. I don't feel that it detracts from the main contribution of the paper on generalized eigenvalue problems enough to change my recommendation. If it is helpful and possible to discuss with the authors post decision I would be happy to.