Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms
摘要
评审与讨论
This paper considers the problem of estimating the Frobenius norm residual error of the best rank- low-rank matrix approximation and the -norm residual error of the best -sparse vector approximation. The main result, for the matrix case, is that best rank- approximation error can be approximated to relative error using a bilinear compression of by two projection-cost-preserving sketches and . The new result improves the dependence on from (Andoni and Nguyen, 2013); a new lower bound establishes the optimality of the current method.
优点
This is a nice work of theoretical computer science. The writing is good and reasonably clear. The fact that the projection-cost-preserving sketch technology allows for an improvement of the Andoni–Nguyen result is very interesting and, to the best of my knowledge, novel. The lower bound is nice as well, and it establishes the optimality of the proposed approach (though I have my concerns about the correctness of Lemma 3.2).
缺点
As a work of theoretical computer science, this paper is perfectly adequate. As a work of machine learning, it has a few deficiencies.
Use Case
When is the proposed method useful? The story the authors tell feels a bit farfetched to me:
- Using a generalized Nyström approximation and OSNAP sketches, a rank- approximation can be computed in a single pass over the matrix. For a dense matrix stored on disk, the time cost is roughly ; the cost of the proposed residual estimate is roughly . The difference between these costs is not very significant, particularly if .
- The authors suggest data streams as a potential application. Indeed, the proposed approach does give a way of estimating the residual error from a stream in low space. After which, the authors say one can use the residual estimate to decide whether or not low-rank approximation is worth it. But this raises two objections: 1) If one is going to compute a low-rank approximation anyways, then one will have to eat the space cost of storing such an approximation eventually. The proposed low-space diagnostic doesn't seem useful if, conditional on a positive outcome, actually computing the approximation needs more space. 2) Often in the streaming setting, we assume we get only a single pass over the data. The authors approach requires two passes: one to compute the diagnostic and a second to compute a low-rank approximation.
It may be possible that there is some use case where the proposed approach is natural, but it seems to me that the proposed method probably has fairly limited usefulness in practice.
CountSketch
I find the author's use of CountSketch to be questionable, particularly in the numerical experiments. In my experience, an OSNAP (even with just 2 entries of sparsity per column) is dramatically more safe and effective than CountSketch in practice.
To demonstrate this, I ran my own experiment. Let and, following Table 1, consider and . With CountSketch, the estimate is a factor of 58 smaller than the true residual error . By contrast, OSNAP with just 2 nonzeros per column only underestimates the residual error by a factor of 2! For the proposed use case of screening whether or not significant compression is possible by low-rank approximation, an estimate within a factor of two is probably useful; one that is wrong by more than an order of magnitude is not.
This is a paper about theoretical computer science: Rigorous analysis of computational methods so that they work in practice even on worst-case inputs. The use of a plain CountSketch in the numerical experiments, now demonstrated to be able to severely underestimate the residual error in practice, should be removed in favor of the OSNAP or, perhaps, the composite CountSketch–Gaussian sketch.
As a side note, the introduction claims that it will prove that the method works if and are chosen to be OSNAP's. But this is omitted from the main text. This should be included, because OSNAP's are often the most effective sketching matrices in practice.
Numerical Evaluation
The runtime of a full, dense SVD is not the appropriate comparison in Table 1. Rather, the appropriate comparison is a randomized low-rank approximation such as the randomized SVD or generalized Nyström (with OSNAP's, if the latter).
Accuracy
As I mention above and as the authors tacitly admit with their numerical experiments, an error estimate which is within a factor of two is often sufficient, particularly if the proposed use case is determined whether or not to run a low-rank approximation algorithm. Viewed in this light, the improvement of the sketch dimension from to in the present work is not that meaningful in practice.
Sparse Approximation
The residual error for -sparse approximation of vectors in norms for is very much a question of interest to theoretical computer science. I fail to see any interest for practical problems in machine learning.
Conclusion
Ultimately, I feel like this paper would be more at home at a dedicated TCS publication venue. I'm not sure when I would actually run the author's proposed method in practice, and the improvement from to has limited practical implications as well. I have significant concerns with the use of plain CountSketch in the numerical experiments. If the numerical experiments were fixed to avoid the use of plain CountSketch, I would be inclined to give a soft recommendation for acceptance.
问题
Is Lemma 3.2 correct?
The proof of Lemma 3.2 is not convincing to me in several ways. First, why does and imply . I believe the ultimate result is probably true, but I don't following the line of reasoning.
I'm not sure data processing and linearity of the sketch is enough to conclude that . The perturbation is applied to both and and it is dependent with the latter of these. I'm not sure exactly what the author's argument is. It appears that the authors are making a claim of the form:
If , then .
This claim is certainly not true. Let be a random variable which is never and take . Then but .
I'm not sure how exactly the authors purport to conclude the proof of lemma 3.2. The authors should revise this proof to make unassailable the correctness of this result.
We thank the reviewer for their thoughtful and detailed comments. We have uploaded a new version of our submission, where we added OSNAP in Lemma 3.5 and replaced the CountSketch matrix with OSNAP in our experiments section.
-
The proof of Lemma 3.2: We thank the reviewer for pointing this out. We have added more details to the proof of this lemma. Please check our new version of the submission. At a high level, here we have that is small and wish to prove that is small, where is independent of but not of . In fact, we can show that is small when conditioned on a fixed (the distribution of depends on the choice of ) over an overwhelming number of choices of . This is enough to conclude that the unconditioned value is small.
-
The experiments: the reason we use a plain CountSketch matrix is that we found it has good performance for the two real-world datasets. We agree with the reviewer's point that OSNAP has a better worst-case guarantee. Hence, we have replaced CountSketch with ONSAP in our new version of the experiments. (Note that the new experiments are conducted on a different device, so the runtime is incomparable with that of the previous version.)
-
On use cases: there are natural settings in which we do not necessarily need to compute a low-rank approximation of all of the input matrices . Instead, we only need to compute an -fraction of them, based on which of them has a good low-rank approximation. In the case when is small (such as ) and computing an actual low-rank approximation is much more expensive than estimating its cost, our algorithm will achieve a factor of roughly savings in its resource requirements. In the streaming model, the space we need for our algorithm is , while if we want to compute a rank- approximation in a stream, the space we need would be , which is larger than when . Also, the overall runtime of our algorithm for our rank- residual estimation algorithm is , while for a rank- approximation, the overall runtime is . Hence, if the matrix is sparse enough, the second term of will save a factor of from the latter complexity of . For the sparse recovery problem, when , our algorithm is able to find heavy hitters that are even ``less heavy" than those for , e.g., imagine a single coordinate has value while all other coordinates have value . Then must be found by an sparse recovery algorithm, but not by an sparse recovery algorithm.
Thanks for clearing up the proof of Lemma 3.2; it's much clearer now. The integration of OSNAP into the text and the numerical experiments is an important improvement, in my eyes.
Unfortunately, the more I look at the numerical experiments and the proposed use cases for the method, I become less convinced this will be a useful algorithmic technique in practice (see below).
I understand the authors are coming from a theoretical computer science perspective. As a self-contained work of TCS and divorced from questions of whether the algorithmic problem being solved has practical use, I have no complaints about this work. To me, however, publication of a linear algebraic algorithm in ICLR requires the algorithm to be interesting and useful to practitioners with experiments that appropriately demonstrate its usefulness. My concerns about practical usefulness of the method have only grown the more I've looked into the method and played around with implementing it myself, so I have elected to lower my score.
Numerical experiments
In my original review, I encouraged the authors to use a randomized low-rank approximation algorithm—rather than a full SVD—as the baseline for the numerical comparisons in Table 1. As they have not adopted this suggestion in the most recent revision, I did some of my own experiments.
As a test, I generated a random matrix of size 3430 by 6906 and ran four algorithms: the author's method (with OSNAP s=2), Andoni–Nguyen, the Halko–Martinsson–Tropp (HMT) randomized SVD, and a full SVD. I set and . Here are the results:
- 0.043 seconds for the author's proposal.
- 0.071 seconds for Andoni–Ngueyn.
- 0.046 seconds for HMT.
- 7.22 seconds for a full SVD.
But the author's mention that their matrix is sparse, possessing only 353,160 nonzeros. Using this sparsity and representing using a sparse matrix, I get the following timings:
- 0.008 seconds for the author's proposal.
- 0.012 seconds for Andoni–Nguyen.
- 0.003 seconds for HMT.
- 7.31 seconds for a full SVD.
Actually computing a full low-rank approximation and the error of this low-rank approximation exactly using the formula for the HMT low-rank approximation is over twice as fast as the proposal on a matrix with the same size and sparsity as in the author's own tests.
On the basis of these timings, I believe the author's numerical experiments—as currently designed—are misleading about the purported benefits of the proposed methodology.
Use Cases
There are natural settings in which we do not necessarily need to compute a low-rank approximation of all of the input matrices . Instead, we only need to compute an -fraction of them, based on which of them has a good low-rank approximation. In the case when is small (such as 0.1) and computing an actual low-rank approximation is much more expensive than estimating its cost, our algorithm will achieve a factor of roughly savings in its resource requirements.
I have never encountered or heard of such a scenario occurring in practice. Do the authors have a reference for such a scenario in an applied context?
In the streaming model, the space we need for our algorithm is...
I understand that your algorithm has lower space than computing a low-rank approximation. My question is What use is a low-space algorithm for testing the residual error if, should the low-rank approximation error be small, you'll need higher-space to actually compute the low-rank approximation? If I only have space available to me, your algorithm isn't helpful to me. Sure, I can run your algorithm, but I lack enough space to actually do anything useful (i.e., compute a low-rank approximation) after a positive result from your algorithm.
The runtime benefits might be a different story, but, as I demonstrate above, your existing experiments don't support a big speedup over just running a randomized low-rank approximation algorithm.
For the sparse recovery problem...
As a matter of mathematics, I agree with the authors that heavy hitters can detect lighter heavy hitters than an algorithm does. (This is a point that would be work making in the paper itself, to communicate the purported benefit of the proposed approach to a broad machine learning audience!) But noting that heavy hitters can detect lighter heavy hitters, by itself, does not provide justification that the residual estimation problem is of interest to machine learning. Does the proposed algorithm lead to concrete benefits on real data sets? Is super-constant space cost of acceptable in practice? To publish an algorithm for this problem in this venue, I think these "applied" questions deserve answers.
We thank the reviewer for their detailed comments. We have uploaded a new version of our submission.
The Numerical Experiments.
We note that the reason we mentioned the time of the SVD is just as a reference time - we did not treat it as a baseline to beat. Indeed, the main point we want to illustrate from the previous version of the experiments is that the sparse sketch we use has a similar accuracy while having a much faster running time than the dense sketch. Following the reviewer's advice, we compare the following two approaches in our new version of the experiments:
-
Randomized SVD: here we use the implementation in the sklearn Library in Python. Note that this method does not support the streaming setting and has a larger space requirement.
-
The algorithm in [CW13]: which supports the streaming setting and computes a low-rank approximation directly.
Taking the parameters and for the KOS dataset as an example, we have:
- 0.120s for our approach
- 0.230s for randomized SVD
- 0.660s for the streaming LRA
Hence, we believe that our algorithm still has a runtime benefit when compared to more baselines for low-rank approximation. (The result of our reports is slightly inconsistent than that of the reviewer's. One possible reason is if the reviewer is using Matlab, then the built-in method is already heavily optimized, which could result in a closer runtime gap).
On Use Cases.
I understand that your algorithm has lower space than computing a low-rank approximation. My question is What use is a low-space algorithm for testing the residual error if, should the low-rank approximation error be small, you'll need higher-space to actually compute the low-rank approximation? "
One of the reviewer's points seems to be that since the downstream task needs more space if we test a smaller error, then the memory savings is not very useful as we would ultimately need a larger memory. However, note that the data analysis in practice is often conducted simultaneously among different groups of data. Hence, the smaller space requirement in the first stage allows us to do a larger amount of testing among different groups of data during the same period, which results also in a time savings.
I have never encountered or heard of such a scenario occurring in practice. Do the authors have a reference for such a scenario in an applied context?
A typical application of low-rank approximation is to recommendation systems (for more details, see the textbook [Agg16]). In this model, we consider an preference matrix , where denotes user 's preference for item . Given a matrix preference with a subset of known entries, the goal is to predict large entries of , representing good recommendations. The method based on low-rank matrix completion works well in practice for this problem. For this task, it is reasonable to first check whether the current collection of items and users indeed have a good low-rank representation and use them as part of the data in future predictions, which we believe should result in a runtime savings. The space of the algorithm is also beneficial here as usually the prediction is performed simultaneously across different user/item groups, and hence a smaller working space allows for computing more groups at the same time.
Another potential use of our algorithm is as a sub-routine of more complicated optimization problems. Consider the low-rank tensor decomposition problems in [SGL+20]. At a high level, for each mode of the tensor , we compute a decomposition of , however, if the tail error of is close to , then we could just use an arbitrary orthonormal basis, which will only increase the error by a small factor and we then gain some savings in runtime / memory by not having to explicitly compute a low rank approximation of most flattenings of the input tensor.
I agree with the authors that heavy hitters can detect lighter heavy hitters than an algorithm does. (This is a point that would be work making in the paper itself, to communicate the purported benefit of the proposed approach to a broad machine learning audience!)
We have added some discussion in the introduction, and we thank the reviewer for pointing this out.
“But noting that heavy hitters can detect lighter heavy hitters, by itself, does not provide justification that the residual estimation problem is of interest to machine learning. Is super-constant space cost of acceptable in practice? "
Note that our algorithm indeed also outputs a vector which satisfies the requirements for the sparse recovery problem, which was unknown for . We also provide a matching lower bound for the recovery problem. As mentioned, our algorithm for is able to find heavy hitters that are even ``less heavy" than those for , which is beneficial to, e.g., the standard Gaussian white noise model in practice for which the underlying vector is a sparse vector plus Gaussian noise (see. e.g., [WHXF11]), where now we can handle a smaller variance on the Gaussians. The ability to detect the existence of a large signal in compressed sensing is also useful, as we can decide to use more measurements only if we notice the existence of such a signal; otherwise we can use fewer. This would make the sensing adaptive, but this is also a common model in compressed sensing (see, e.g., [IPW12, MN13]).
[CW13]: Kenneth L. Clarkson, David P. Woodruff. Low Rank Approximation and Regression in Input Sparsity Time.
[Agg16]: Charu C. Aggarwal. Recommender Systems: The Textbook.
[CGL+19] Yiming Sun et al. Low-Rank Tucker Approximation of a Tensor from Streaming Data.
[WHXF11]: Ning Wan-zheng et al. The Analysis of Noise Reduction Performance in Compressed Sensing.
[IPW12]: Piotr Indyk, Eric Price, David P. Woodruff. On the Power of Adaptivity in Sparse Recovery
[MN13]: Matthew L. Malloy, Robert D. Nowak .Near-Optimal Adaptive Compressed Sensing.
Thanks to the authors for their detailed replies.
I remain quite skeptical that either of the proposed methods will be useful in practice (at least for the use cases suggested by the authors), and, as such, I find ICLR a bit of an odd publication venue for this work.
That said, I appreciate the paper for its mathematical and TCS contributions. I appreciate the authors for taking the time to thoughtfully respond to my (rather lengthy) reviews, and I believe that the changes made will make the method more robust if it sees use in practice (OSNAP rather than CountSketch) and will be more informative to practitioners about the speedups they can expect (randomized low-rank approximations as baselines).
I have raised my score to its original level and wish the authors all the best.
This paper researches the problem. To be specific, for a matrix , one estimates and judges if it is worthwhile to compute a low-rank approximation such that , where is the best rank- approximation to ; for a vector , one estimates and determines whether to compute a -sparse vector such that , where is the best -sparse approximation to .
Given a matrix , for the estimator of the residual error , this paper obtains the lower bound and upper bound of the dimensionality , which improves the upper bound in [Andoni and Nguyen'13].
Given a vector , for the estimator of with , this paper gives the upper bound for space. This result can be extended to the sparse recovery problem, in which the upper bound almost matches the obtained lower bound when taking as a constant.
优点
(1) For low-rank residual error estimation, this paper gives the lower bound for the dimensionality of the sketching matrices and , and the upper bound , which matches the lower bound and improves the upper bound in [Andoni and Nguyen'13].
(2) For residual error estimation of vector norm with , this paper gives an upper bound for the dimensinality of the sketching matrix. Also, this upper bound can be extended to sparse recovery problem and almost matches the obtained lower bound when is a constant.
(3) The experimental results indicate that the proposed algorithm has a comparable error but at least ten times faster than the existing method [Andoni and Nguyen'13].
(4) This paper is well-written and easy to understand.
缺点
In the experiments, this paper only proves the efficiency of the proposed algorithm for low-rank approximation. Can this paper supplement experiments for the residual error estimation of vector norms?
问题
See Weaknesses.
We thank the reviewer for their thoughtful comments. We can provide additional experiments in the next version of our paper.
Thank you for your feedback. I will retain my score.
This paper studies the problem of estimating the residual error using linear sketching. Specifically it focusses on the following two problems:
-
For any matrix A, the task is to estimate within a -factor where is the best rank-k approximation to . For this problem, it is shown a bilinear sketch of the form where the sketching matrices and can be sparse countsketch matrices with a single in their columns. This gives a sketch of size . A lower bound is given which shows S and T must have rows and columns respectively for any algorithm which takes as input and outputs a residual ran-k approximation. This shows the upper bound is tight for bilinear sketches.
-
The next problem is estimating upto residual error for any vector which is updated in the streaming model, for any norm with and where is the the vector containing the top-k elements of . The algorithm uses countsketch to find the indices of top-k elements of approximately and parallely uses a known frequency estimation procedure for these coordinates and subtracts the approximate value of these coordinates to get the residual error. This yields an upper bound of and also gives a -sparse approximation for . A lower bound of is also shown for the -sparse recovery problem.
Some experiments are performed with the sparse sketching matrices.
优点
-
For the residual matrix norm estimation problem, the paper presents the bilinear sketches which are optimal in size. Moreover, since the sketching matrices can be sparse, they can be computed are pretty fast. Previous results for this problem used dense sketching matrices and had a larger upper bound of on the size of the sketch. The proof for the sketching algorithm is pretty straightforward from results on projection cost preserving sketches. The lower bound proof follow via minimum singular value separation results for random Gaussian matrices.
-
For the residual vector norm recovery problem, the algorithm also outputs a approximation in norm which is optimal upto polylog factors (though the dependence on is unclear). The lower bound proof follows from reducing the gap infinity problem to the sparse recovery problem.
-
Though some of the results are pretty straightforward from prior results (for example, the bilinear sketch upper bound follows directly from the PCP sketch properties), they are an interesting addition to the sketching literature.
-
The paper is mostly well written and easy to follow apart from some typos etc. (see weakness and questions)
缺点
-
It is unclear how good the upper for the residual norm estimation problem as compared to the optimal. The given lower bound applies to the sparse recovery problem. But this is a harder problem than the norm residual error estimation problem. Some discussion on the actual lower bound for the norm estimation problem might make things more clear for the reader. Also, though the upper bound for the sparse recovery problem is optimal in terms of n and k, it is unclear on what the correct dependence on is.
-
There are some typos etc. in the paper which should be corrected:
- In definition 3.4, should be
- The proof of theorem 3.7 should refer to lemma 3.5 instead of 3.4.
问题
-
Can anything be said regarding how good is as compared to , the best rank-k approximation to A?
-
I'm confused about the the proof of Lemma 4.3. It seems we should have instead of exactly being equal to as it could happen that and in which case wouldn't belong to any of
the sets ? It seems instead of decomposing I, decomposing J as where each constains elements of not in in the given interval and then doing the analysis should given the same result?
Remark: I have not checked whether Lemma 5.2 of Price and Woodruff, 2011 indeed generalizes to p>2. This is used in Lemma 4.10.
We thank the reviewer for their thoughtful and detailed comments. We have fixed the typos mentioned and uploaded a new version of the paper. We agree that there is still a bit of a gap between the bounds for the sparse recovery and residual norm estimation problems we study in this paper, and we treat this as an interesting future direction.
-
Question 1: Note that the size of is different from that of , so they are not comparable as matrices. We have shown though in Theorem 3.7 that the residual error of is a good approximation to the residual error of using for . Note though that is a much smaller sketch of and so gives a low memory way of approximating the residual that one could have found by using . Also, in the low-rank approximation problem, using matrices , we can more efficiently (than computing directly) compute two matrices and such that (See the survey [1] for more details.)
-
Question 2: We thank the reviewer for pointing this out. In fact, one can show that for all . Defining a new magnitude level for coordinates between and is enough and the remainder of the proof is almost the same. The updated proof has been included in the new version which we have uploaded.
[1] David P. Woodruff. Sketching as a Tool for Numerical Linear Algebra.
Thanks for your response and for addressing my concerns in the new version. My original score remains unchanged.
This paper studies the (approximate) estimation of residual error of rank-k approximation of matrix and k-sparse recovery of vector using linear sketches. For the matrix case (Frobenius norm), the main result is a bilinear sketch of size by combining the count sketch and JL sketch; this improves upon the previous best result [Andoni-Nguyen, SODA'13]. The authors also provide a matching lower bound. For the -sparse recovery under norm for , the author show an upper bound of . This is done by two sketches, where one is used to find heavy hitters, and another to do a estimation.
优点
I think this paper is technically solid and solves an interesting open problem. The high-level explanation provided in the paper are helpful to the readers.
缺点
- For the matrix case, the improvement compared to Andoni and Nguyen seem a bit incremental.
- While I think it makes sense to estimate the residual error of matrix approximations, I think there is little use to estimate the residual error of sparse recovery under norm.
问题
See weakness.
We thank the reviewer for their thoughtful comments.
-
On our improvement over Andoni and Nguyen: we improve the upper bound of Andoni and Nguyen from to . Note that the bound in their work was also only shown to hold for Gaussian matrices. In our work, we give a very different analysis than that in Andoni and Nguyen, and instead directly show that any sketching matrix with the so-called projection cost-preserving (PCP) property suffices, which allows us to directly use extremely sparse sketching matrices with a much faster running time, and also allows us to improve the bound to using advances in PCPs. We also show an lower bound, which shows that our upper bound is optimal up to a constant factor.
-
For the sparse recovery problem, when , our algorithm is able to find heavy hitters that are even ``less heavy" than those for , e.g., imagine a single coordinate has value while all other coordinates have value . Then must be found by an sparse recovery algorithm, but not by an sparse recovery algorithm.
The paper considers the problem of estimating the residual error when using a low rank approximation for a matrix, or using a sparse approximation for a vector. The paper significantly improves the previous results for residual error estimation in both the memory bound and the choices of the sketching matrix, getting tight bound for this case, and it develops new algorithms for the vector case. All reviewers are supportive of the theoretical contribution of the paper.
On the other hand, one reviewer is concerned that the runtime/memory improvement as part of a larger application might not be significant, which limit the potential scenarios where it is worthwhile to perform the estimation before computing the estimation. In particular, the cost of estimation is within a logarithmic factor of computing the actual factorization in theory and in practice, the difference seems negligible for some experiments done by the reviewer. Other reviewers also note that some proof techniques are straightforward, and the lower bound for the vector case is for a different and harder problem so it is unclear if it is a good evidence for the hardness of error estimation.
为何不给更高分
Given two high scores, it can be accepted as a spotlight if there is room. The contribution is definitely more theoretical than immediately practical so the decision here depends on the fit with the goal of the conference.
为何不给更低分
All reviewers are supportive of the theoretical contribution of the paper, with several high scores so the paper definitely merits acceptance.
Accept (poster)