Approximating the Top Eigenvector in Random Order Streams
摘要
评审与讨论
The paper studies the problem of finding the top eigenvector of a matrix in the streaming model. The problem is: given a sequence of d-dimensional rows a_1, a_2, ..., a_n of a matrix A, approximately compute the top eigenvector v_1 of A^TA. The algorithms gets a single pass over the stream of input rows, must use subquadratic (in d) space, and should output an approximation \hat v_1 to v_1.
Existing works on this problem have the following natural idea at their core: initialize a vector z\in R^d to be zero, and then, upon receiving a row a_i, set z\gets z+ \eta a_i^T z a_i for some learning rate \eta. One can either keep a fixed learning rate, or use a combination: the latter is done in Price's algorithm from a few years ago, which computes an approximation \hat v_1 that is at least 1-O(log d/R) correlated with v_1, where R is the eigenvalue gap \sigma_2(A^TA)/\sigma_1(A^TA). Price also showed that a 1-O(1/R^2) correlation requires almost quadratic in d space.
The author(s) result is that in random streams, the correlation bound can be improved to 1-O(1/\sqrt{R}). They complement their algorithmic result with a lower bound for Price's algorithm -- this is basically an adaptation of Price's lower bound to random order streams. They also provide certain impossibility results for fixed rate Oja's algorithm.
The main algorithmic technique is clean. The main point is that the fact that the stream is randomly ordered implies that its substreams contain good spectral approximations to the original matrix -- this is by standard spectral concentration bounds (after removing heavy entries). Due to the spectral gap assumption the top eigenvectors of these rescaled submatrices are close to v_1. Thus, one can run the power method on the substreams, of which there are quite a few. Thus, a good convergence can be obtained.
优点
S1. The authors provide a clean analysis of a very natural algorithm for top eigenvector computation in data streams. This is definitely worth publishing.
缺点
W1. I am not sure how novel the idea of using substreams that concentrate around A as steps in the power iteration is. If it is, then this is a very nice paper. Otherwise it can be seem more as a sharpening of Price's result.
W2. The author(s)' overview of related work on random streams is not up to date. From their list of citations if seems that there was no work done on random streams between 2009 and 2023. See, for example, https://arxiv.org/pdf/2110.10091 and references therein.
W3. I find it strange the author(s) use an unpublished result (Magdon-Ismail, 2010) from 14 years ago for the concentration bound that their analysis relies on. I recommend looking at Joel Tropp's survey on matrix concentration for a more formal reference.
问题
None.
局限性
None.
We thank the reviewer for suggestions on citations for the random order model. We will add more references in the final version of the paper.
Using substreams for quadratic approximation
When each row of the stream is independently sampled from a distribution, Hardt and Price analyze a similar block power method. In that case, it is much easier to argue that the substreams concentrate around the covariance matrix. But, in our case, we assume that the rows of the matrix are fixed and the only randomness is in the order the rows are presented to the streaming algorithm. This makes the problem quite different to the one considered by Hardt and Price and hence the techniques used to achieve our proofs are different as well.
Using Magdon-Ismail’s result
We used the manuscript of Magdon-Ismail as the main reference since it directly has the result in the form we require. But we agree that citing Tropp’s work would be useful for the readers. We will include a citation in the final version of the paper.
The paper studies the problem of approximating the top eigenvector of when rows of an matrix are given in a stream. The authors consider worst-case inputs but assume the rows are presented in a uniformly random order. They show that when the gap parameter , there is a randomized algorithm using bits of space that outputs a unit vector with correlation with the top eigenvector . Here, denotes the number of "heavy rows" in the matrix, defined as rows with Euclidean norm at least . The authors provide a lower bound showing that any algorithm using bits of space can obtain at most correlation with the top eigenvector. Their results improve upon the requirement in a recent work by Price. The authors' algorithm is based on a variant of the block power method. They also show improvements to the gap requirements in Price's analysis for both arbitrary order and random order streams. The techniques involve row sampling, concentration inequalities, and careful analysis of the power method with approximate quadratic forms.
优点
The paper introduces a novel algorithm to approximate the top eigenvector in random order streams, a significant advancement over previous methods that required multiple passes over the data or had more stringent conditions. One of the key strengths is its method for handling heavy rows in the matrix. By storing the rows with large Euclidean norms separately and processing the remaining rows, the algorithm ensures an accurate approximation of the top eigenvector. The algorithm achieves near-optimal space complexity, using space where is the number of heavy rows, which is shown to be nearly tight. Moreover, The paper improves upon the gap requirements needed for the algorithm to work. It reduces the gap from to for arbitrary order streams and from to constant for random order streams. The use of row norm sampling as a general technique to reduce the number of rows while preserving the top eigenvector is a versatile approach.
缺点
The main weakness of this work is the lack of empirical validation. Given the simplicity of their algorithm and the ease of generating synthetic matrices with a given sequence of singular values, it should be easy to perform experiments to compare it with Price's approach and provide insight into its practical effectiveness.
问题
How does your algorithm compare with Price's algorithm in practice?
Typos
- line 123, centralized equation, should be .
- line 124, should be .
- line 144, should be .
- line 4 of Algorithm 1, should be .
- line 10 of Algorithm 1, should denote the rows in the range , right? should be , then.
局限性
Yes.
We thank the reviewer for their comments.
Separation in Practice
In section 5 of the paper, we give a synthetically constructed instance for which Price’s algorithm does not give a good approximation to the top singular vector. For instances observed in practice, it seems harder to show such a separation. In many practical instances, the observations i.e., the rows are sampled from a nice enough distribution for which earlier analyses of Oja’s algorithm show very good approximability of the top eigenvector.
I thank the authors for their rebuttal and have no further questions. As an aside (not affecting my score) I think the paper would benefit from including, even if only briefly and informally, a discussion of the empirical side, e.g. what you mention regarding the many practical scenarios in which the distribution is nice enough and Oja's algorithm yields a good approximation. This is related to reviewer yHif's concern that your paper could better motivate the problem.
Thanks for the response. We will include our observations from experiments in the final version of the paper.
This paper is concerned with estimating a direction that is well-correlated with the top eigenvector of a matrix where the rows are streamed to us in uniformly random order. To make this well-defined, a gap assumption is required, i.e., that . A successful algorithm is one that returns an estimate vector that is correlated with while using at most space. Note that if we are allowed to use space, then the problem is trivial, as we can simply maintain . Thus, one of the main challenges is to figure out how to estimate when we are not allowed to remember too many of the rows we have seen so far.
The main result is an algorithm that uses space and outputs a unit vector such that for some universal constant , and where denotes the number of "heavy" rows (those that contribute at least a fraction to the Frobenius norm of ). This dependence on is essentially optimal, as the authors give a lower bound saying that any algorithm that outputs a vector with at least correlation must use space.
The main technical contribution is to observe that if the rows of are randomly permuted and streamed to us, then in some sense, we can see spectral approximations to in disjoint contiguous windows of the stream. Thus, the power method can be simulated by adding to the estimate in each round. To actually form the spectral approximations and implement this with low space, the authors use a row sampling result which says that one only needs to choose and reweight rows of to form a decent spectral approximation to , where denotes the stable rank of . So, the authors use the row sampling result to form many spectral approximations to and then simulate the power method on these approximations. The final result follows from showing that the top eigenvector of the spectral approximations is reasonably well correlated with the top eigenvector of , which itself follows from standard eigenvector perturbation tools and the gap assumption.
Although this does not immediately work for heavy rows (e.g. consider a more difficult scenario where there is one row orthogonal to all the other rows in the matrix, it has large norm, and it witnesses , then we must remember it), the simple fix is to just identify heavy rows on the fly and store those exactly. As mentioned earlier, this linear dependence on the number of heavy rows is necessary.
优点
The problem is a very natural one and the random order stream of a worst-case matrix is a nice assumption that allows one to get more optimistic results while not trivializing the question.
The algorithm is really simple and easy to interpret. The solution/analysis are very elegant. The writing is also clear. Finally, the idea and execution are pretty insightful.
缺点
In order to implement the row sampling result, we need an estimate on the operator norm of . The authors fix this by assuming that the coordinates of the inputs are between and and then maintaining copies of the algorithm in parallel. This is not too big of a deal, but I wish this assumption were made more explicit towards the beginning of the paper.
问题
None
局限性
Yes
Given a stream of data-points , this paper studies the problem of approximating the top eigen-vector of the matrix at the end of the stream, where is the data-matrix with rows of the matrix corresponding to the data-points. This is the same problem as approximating the first principal component in PCA. Prior work on this problem by Price (2023) gave an algorithm that works with some worst-case guarantees i.e. both the data-points and the order of the data-points can be chosen by the adversary. This paper is the first to study the setting where although the data-points can be chosen arbitrarily by the adversary, the order of arrival of the data-points is uniformly at random. For this setting, the paper provides an algorithm with improved guarantees over the algorithm by Price. Additionally, they also provide a lower bound for streaming algorithms for the random-order arrival model.
优点
Originality: First paper to study the problem of approximating the top eigen-vector in the random-order arrival setting.
Quality and Clarity: Overall well-written paper.
Significance: The paper can certainly do a better job of motivating the problem. This paper is primarily a theory paper but since the problem studied is closely related to PCA, which is obviously very important in practical applications, it is easy to imagine scenarios where the algorithm described in this paper can be implemented. For example, in the case of PCA for population genetics, where the dimensionality of the data (number of genetic markers) can exceed 10 million, and the number of samples in standard datasets like UK Biobank is of the order of hundreds of thousands, you could imagine running this algorithm in one-pass to approximate the top principal component, running it on a second pass to approximate the second principal component, and then using a third pass to map all the points using these two components (most PCA studies in population genetics papers just use 2 components as they are easy to visualize in papers). Even otherwise, as multiple papers on the closely-related "Streaming PCA" problem have appeared in past iterations of NeurIPS and other related conferences, this paper is clearly relevant to NeurIPS.
缺点
Paper needs to do a better job motivating the problem. This is a submission to NeurIPS, not COLT. Clearly, the paper has interesting technical contributions but I think this might be a major weakness for the paper. Some suggestions for the same:
It would be nice for the introduction of the paper to have at least a few lines motivating the case for high-dimensional d and why the distinction between and is something which matters in real-world applications. For example, see the paper “Memory Limited, Streaming PCA” by Mitliagkas, Caramanis, and Jain, which appeared in NeurIPS 2013: “In certain high-dimensional applications, where data points are high resolution photographs, biometrics, video, etc., often is of the order of - , making the need for memory prohibitive. At many computing scales, manipulating vectors of length is possible, when storage of is not.”
A great reference to add on line 80 would be the book chapter by Gupta and Singla titled “Random-Order Models” which appeared in the book “Beyond Worst-Case Analysis” edited by Tim Roughgarded.
问题
How do the guarantees in this paper extend to the setting where you need to approximate not just the top eigen-vector, but the top principal components, for some arbitrary value of ? I understand that this might not be straight-forward to answer, but at the very least, it would be nice to have a conclusion or future directions section at the end of the paper, with a brief discussion on this.
Minor editing comment: For the paper “Syamantak Kumar and Purnamrita Sarkar. Streaming PCA for Markovian data.”, please use the Bibtex citation from the official NeurIPS website instead of Google Scholar. The current citation reads as NeurIPS 2024, but it actually appeared in NeurIPS 2023..
局限性
Primarily a theory paper.
Thanks for the reference suggestions. We will add them to the introduction in the final version.
How do the guarantees in this paper extend to the setting where you need to approximate not just the top eigen-vector, but the top principal components, for some arbitrary value of ?
In practice, one can do passes over the data and by projecting away the vectors computed in the previous rounds, we can obtain approximations to the top singular vectors. But we do not have any guarantees on how good such approximations are. The main reason is that the approximation guarantee i.e., that we obtain in the first pass is not sufficient to ensure that the gap between the top 2 singular values of the “residual stream” i.e., of the vectors is large enough for the algorithm to work.
It is an interesting open question to extend our algorithm or obtain an algorithm using a different approach for k-SVD for the worst case input matrices presented to the algorithm in a uniform random order. We note that the work of Price also only applies to the top eigenvector and the same open question applies there as well.
This paper studies the problem of approximating the top eigenvector of the matrix in the random order data stream model. Both upper and lower bounds are provided.
All reviewers agreed that the problem studied is natural, and that the random order stream is a reasonable model for obtaining more optimistic results. It is also very nice that the space complexity of the proposed algorithm is nearly optimal.