PaperHub
5.7
/10
Rejected3 位审稿人
最低3最高8标准差2.1
6
3
8
3.0
置信度
ICLR 2024

High-Dimensional Geometric Streaming for Nearly Low Rank Data

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11

摘要

We study streaming algorithms for the outer $(d-k)$-radius estimation of a set of points $a_1, \ldots ,a_n \in \mathbb{R}^d$. The problem asks to compute the minimum over all $k$-dimensional flats $F$ of $\max_i d(a_i, F)$, where $d(u, F)$ denotes the distance of a point $u$ from the flat $F$. This problem has been extensively studied in earlier works (Varadarajan et al., SIAM J. Comput. 2006) over a wide range of values of $d$, $k$ and $d-k$. The earlier algorithms are based on SDP relaxations of the problem and are not applicable in the streaming setting where we do not have space to store all the rows that we see. We give an efficient streaming coreset algorithm that selects $\text{poly}(k, \log n)$ rows and at the end outputs a $\text{poly}(k, \log n)$ approximation to the outer $(d-k)$-radius. The algorithm only uses $d \cdot \text{poly}(k, \log n)$ bits of space and runs in an overall time of $O(\text{nnz}(A) \cdot \log n + \text{poly}(d, \log n))$, where $\text{nnz}(A)$ denotes the number of nonzero entries in the $n \times d$ matrix $A$ with rows given by $a_1, \ldots, a_n \in \mathbb{R}^d$. In a recent work, Woodruff and Yasuda (FOCS 2022), give streaming algorithms for a number of high-dimensional geometric problems such as width estimation, convex hull estimation, volume estimation etc. Their algorithms require $\Omega(d^2)$ bits of space and have an $\Omega(\sqrt{d})$ multiplicative approximation factor even when the rows $a_1,\ldots, a_n$ are “almost” spanned by a $k$ dimensional subspace. We show that when the rows are $a_1,\ldots,a_n$ are “almost” spanned by a $k$ dimensional space, our streaming coreset construction algorithm can be used to obtain algorithms that use only $O(d \cdot \text{poly}(k, \log n))$ bits of space and have a multiplicative error of $O(\text{poly}(k, \log n))$. When $d$ is large and $k$ is much smaller than $d$, our algorithms use a much smaller amount of space while guaranteeing a better approximation. We pay an additive error depending on how close the rows $a_1,\ldots,a_n$ to being spanned by a rank $k$ subspace. As another application of our algorithm, we show that our streaming coreset can also be used to obtain approximations to the $\ell_p$ subspace approximation problem using exponential random variables to embed the $\ell_p$ subspace approximation problem into an instance of the $\ell_{\infty}$ subspace approximation problem.
关键词
CoresetsSubspace ApproximationStreaming Algorithms

评审与讨论

审稿意见
6

This paper studies the outer (d - k)-radius estimation problem. This problem may be viewed as a clustering problem, where the goal is to find a k-flat F, such that the max distance of every data point to F is minimized. This is a generalization of the 1-center clustering, where k = 0. This paper gives a (strong) coreset for the problem with approximation ratio poly(k log n) and size poly(k logn). Here, A subset of points S is an \alpha-coreset, if for every k-flat F, the objective evaluated on S is in between [OPT, \alpha OPT]. This eventually implies a streaming algorithm that can achieve the same (order) of ratio.

Other extensions are also considered, and various similar results are obtained. Specifically, the main bound can be improved a little bit (in terms of the dependent in k and log n), by assuming a bounded rank-k condition number. Another extension is to consider an \ell_p aggregation function, which means the objective is changed to the sum of p-th power of distance of a data point to F. For this setting, a similar streaming bound can be obtained, but with a weaker error bound.

优点

  • The study is well-motivated. In particular, the problem is related to clustering and subspace approximation which are fundamental ML/data analysis tasks, and the streaming setting addresses the computational issues of ML in the big data era.

  • The result can also be applied to improve a recent paper [17] in a certain case, which is a nice application that shows the theoretical relevance of the paper

  • The paper also provides experiments, which indicate that the seemingly complicated steps can actually be implemented and have the potential to be used in practice

缺点

  • The paper is quite technical and is not easy to understand especially for general audience. In addition, too many results are squeezed into the 9 pages. In my point of view, the author could focus on the main result Theorem 1.1, and this itself should already fit the volume of an ICLR paper (considering the 9 pages of the main text).

  • I don't see a related work section. Since your main technique is coreset, it might make sense to mention works related to coreset.

  • In fact, the discussion of the coreset literature is almost completely missing. Since your problem may be viewed as 1-center projective clustering, it's important to compare it with the relevant coreset literature. For example, this paper seems relevant: "New Coresets for Projective Clustering and Applications. Tukan et al. AISTATS 2022". From what I read, they gave O(1)-error coreset, but the size is k^k.

  • It is not discussed if the coreset is tight or can be improved, in terms of the error bound

问题

  • Does JL work here? In particular, can one reduce d to O(log n)? Or maybe subspace JL that reduce to O(k) dimension? I didn't find this discussed/mentioned. This is important as I guess otherwise your approach may not improve WS22? It may be useful to have a brief discussion of this in the paper.

  • It seems Theorem 3.3 and Theorem 3.6 are in different models of streaming algorithms? Also, does the streaming algorithm in Theorem 3.3 work if deletions are allowed? Please clarify in the paper.

  • In both Theorem 3.3 and Theorem 3.6 it is mentioned that the coordiantes are integers. Is this necessary even for the offline algorithm, or it is only a matter of storage model? Please clarify in the paper.

评论
  • Reorganizing the paper: We will think about better ways to reorganize the paper to make it more understandable to the general audience. We think the applications to width estimation, convex hull approximation are important and should be present in the main paper and we will look for ways to make the presentation a bit less dense.

  • Related work: We will add a discussion about the existing coreset literature in the streaming/non-streaming settings for subspace approximation and related problems. We note that ours is the first strong coreset construction, with only poly(k, log n) points, for \ell_{\infty} subspace approximation in the streaming setting and as we show the strong coreset property is important to obtain our applications.

  • Projecting with JL: It is not clear how to obtain strong coreset properties by projecting to lower dimensions using JL transforms in the streaming setting. Usually, projections with JL are used to obtain approximate sensitivities and then a strong coreset is constructed. But this cannot be implemented in the streaming setting with one pass. Kerber and Raghavendra (CCCG 2015) show that when we project to poly(k, log n) dimensions and construct a coreset, we can preserve the costs and extract a solution but we cannot approximate the cost of every subspace and obtain our application such as width estimation, convex hull etc.

  • Streaming Model: All the results in our paper are in the row-arrival model where we see one point after another. Our results do not hold when points can later be deleted.

  • Integer Points: It is not necessary for the points to have integer coordinates. We use bounded integer assumption to only obtain results that are independent of the condition number of the data. However, the algorithms will work even for arbitrary values and the size of the coreset constructed will depend on a specific condition number of the dataset.

评论

A reminder that the discussion period ends tomorrow. Please ask us if you have any questions about the paper and the rebuttal.

审稿意见
3

In this paper, the authors consider the problem of designing a streaming algorithm for the (dk)(d-k)-dimensional radius approximation, i.e., find a kk-dimensional flat that minimizes the maximum distance of any input point to this flat. When k=0k=0, the problem is the well-known minimum enclosing ball problem, i.e., the smallest d-dimensional ball that encloses all input points. They also consider the p\ell_p-sapce approximation problem. Overall, there are two major results that the authors claim:

  1. A single-pass streaming algorithm for (dk)(d-k)-radius estimation that has poly(k,logn)(k,\log n) approximation using poly(k,logn)(k, \log n) space. The algorithm maintains a coreset (a small subset of points) that approximates the (dk)(d-k)-radius in the streaming setting.
  2. The authors then extend this algorithm to the p\ell_p-subspace approximation problem and design a streaming algorithm for this setting. The space requirement and the approximation ratios are poly(k,logn)(k,\log n)

优点

The authors present a rather simple streaming algorithm for the p\ell_p-subspace approximation problem and show its practical value via experiments.

缺点

The authors do not present the state-of-the-art for \ell_\infty-subspace approximation. For instance, when k is 0 or 1, there are O(1)-approximations (see for instance, Chan and Pathak CGTA 2014, Agarwal and Sharathkumar, (SODA 2010, Algorithmica, 2015) and some other followup work. Although for restricted k, they achieve significantly better approximation ratios. Does your algorithm achieve similar ratios when k is small? These algorithms are equally simple: Does your algorithm have a better empirical performance for instance for the MEB problem?

For width approximation, there are no randomized algorithm can achieve better than d1/3d^{1/3}-approximation while using ed1/3e^{d^{1/3}} space, this was shown in Agarwal and Sharathkumar, (SODA 2010, Algorithmica 2015). I think it is important to place your result in the context of this lower bound.

Overall, I had difficulty in understanding the impact and importance of this work and this, I believe, is mainly due to lack of a good comparison with existing work.

问题

Apart from the questions/concerns above, can you compare your work with the results in Kerber and Raghvendra (CCCG 2015): https://research.cs.queensu.ca/cccg2015/CCCG15-papers/16.pdf In particular, can one first apply JL-projection to d= O(poly{k, \log n}) dimensional space and then run a standard streaming algorithm for this space that gives O(d) =O(poly{k,\log n})-approximation in O(poly{k,\log n}) space?

评论

We will recap our main contributions here:

  • A deterministic coreset construction algorithm for \ell_{\infty} subspace approximation with any rank parameter k. We are not aware of any previous works with streaming algorithms that work for any given rank parameter k.
  • A black-box randomized reduction from an p\ell_p subspace approximation coreset to an \ell_{\infty} subspace approximation coreset for any rank k using scaled exponential random variables. This reduction is also new and may have additional applications since any improvement in an \ell_{\infty} coreset construction would directly improve the p\ell_p coreset construction as well.
  • Deterministic Streaming Algorithms for other problems such as convex hull estimation, volume estimation, and width estimation in any direction, for almost low rank data with better guarantees than previous work of Woodruff and Yasuda (FOCS 2022)

Our main motivation was to obtain streaming algorithms that work for values of k that are not just small constants as well. This particular problem seems to have not been studied in the streaming setting. We will add references to the line of work studying streaming algorithms for k=0,1 and add a discussion to place our results in context.

  • For k=0 and 1, we do not expect our algorithms to be better than the algorithms that are specifically constructed for those values of k. For MEB (k=0), our algorithm is essentially exactly the simple 2 approximation algorithm: pick an arbitrary point (in our case we pick the first) and essentially compute the distance of the farthest point from the first point. So, we do not expect to do better than the 2-approximation using our algorithm for k=0.

  • The width estimation problem, as studied in Agarwal and Sharathkumar (SODA 2014), corresponds to the case of k = d-1 for the \ell_{\infty} subspace approximation problem. The hardness result of Agarwal and Sharathkumar is, as you stated, for approximating to a better factor than d^{1/3} in small space. Our streaming coreset construction gives only multiplicative approximation up to a factor of d^{3/2} in a bounded integer points case and a d^{1/2} factor for well-conditioned instances in this setting (k=d-1) and hence makes a poly(d) space streaming algorithm possible.

Comparison with Project-and-Construct Coreset:

For \ell_{\infty} subspace approximation, ours is a strong coreset construction which can be used to approximate the subspace approximation cost of any k-dimensional subspace. As we show this strong coreset property has applications to width estimation in arbitrary directions, volume estimation etc.

The random projection method in Kerber and Raghvendra (CCCG 2015) only preserves the optimal projective clustering cost up to 1+eps approximation but cannot be used to approximate the projection cost of arbitrary subspaces. For p\ell_p subspace approximation though, our coreset construction can also not approximate costs for arbitrary inputs.

评论

A reminder that the discussion period ends tomorrow. Please ask us if you have any questions about our rebuttal.

审稿意见
8

The contributions of the paper are several-fold and it is indeed a bit hard to describe them succinctly... I will not go over all of them in this summary but give a basic flavor of what the paper does.

The paper considers several (somewhat) related problems in high-dimensional geometry and provides streaming algorithms for solving these problems. For some problems, like the "outer (dk)(d-k)-radius estimation problem", this paper provides the first streaming algorithms which use space that is poly(k,d,log(n))(k,d, \log (n)) with a distortion factor also being poly(k,d,log(n))(k,d, \log (n)). Here, we are given nn points in Rd\mathbb{R}^d, along with a parameter k<dk < d. Their algorithm works by constructing coresets using online ridge leverage scores.

For some other problems studied in the paper, prior work by Woodruff and Yasuda [FOCS 2022] already provided the first streaming algorithms using space poly(d,log(n))(d, \log (n)) with a distortion factor also being poly(d,log(n))(d, \log (n)). Woodruff and Yasuda also construct coresets but by using online leverage scores. In this submission, they show that assuming that the data points all approximately come from a low-rank subspace (which seems reasonable), their coreset algorithm can be used to provide streaming algorithms which are more efficient than the ones proposed by Woodruff and Yasuda.

优点

Originality: First paper to provide streaming algorithms for the outer (dk)(d-k)-radius estimation problem. Their coreset algorithm is new and as they show has several applications. Would be of future interest to researchers working on other related problems.

Quality and Clarity: Paper is very well-written, easy to understand. I have not checked all the technical details in the proofs but they are very well-explained and it is unlikely that there are any major issues.

Significance: The paper's key contribution is providing a coreset construction algorithm, which they prove works for the problems which they consider, but also can be used independently (although without guarantees) for training other ML models.

缺点

Not any that I can see right now.

问题

None for now.

AC 元评审

The paper proposes a new coreset algorithms for 1-center projective clustering with a wide range of parameters that can be applied to the streaming setting. In this problem, we are given a dataset in d dimensions and we would like to find a small subset (a coreset) such that for any given k-dimensional subspace, the maximum distance between the subspace and the subset is approximately the maximum distance between the subspace and the original data. The new algorithm also works in the streaming setting where we have low memory and can only scan through the dataset once. The paper also extends the result to the case where the clustering cost is an l_p norm of the vector whose entries are the distances as opposed to l_infinity (the maximum).

The reviewers all appreciate the importance of the particular clustering problem considered in the paper. They also note that the algorithm can be implemented and give reasonable results in the experiments.

On the downside, two reviewers note the lack of comparison with previous works even for special cases to show the significance of the improved guarantee. For example, when k=0 or 1, which are two common settings, the existing algorithms have much stronger guarantees. One reviewer also questions whether the bounds are close to tight, which is related to the previous point because the paper does not connect with previous lower bounds for the same problem.

为何不给更高分

The lack of comparison with the existing literature is concerning as it seems there are a lot of previous works at least for the important special cases and the new work seems to give worse results for a more general setting and not a strict generalization. The positive reviewer suggests that the new algorithm is the first for the problem and can bring additional attention to the task but it seems that there is already significant related literature that was not thoroughly covered by the paper.

为何不给更低分

N/A

最终决定

Reject