Coresets for Clustering with Noisy Data
We study the problem of data reduction for clustering in the presence of stochastic noise, propose a measure that better captures the quality of a coreset for this setting and show its effectiveness theoretically and empirically.
摘要
评审与讨论
The paper studies the problem of coreset for -means clustering in the noisy setting. Given a dataset and a set of centers, one can define the cost to be . The -means clustering problem is to find the set that minimize the cost given a dataset . However, when the size of is huge, one common technique to improve the storage and computation is to extract a smaller subset of and find the set such that the -means cost is minimized for . Different previous results showed that we can construct such a subset such that the size of is small and the difference between the -means cost for and is also small. In reality, the data is often noisy. Therefore, we would like to construct a coreset from the noisy dataset . The authors showed that the standard notion of coreset for -means is too strong for noisy dataset and defined a new notion for the noisy setting. The authors then showed that if is a good coreset for under some mild assumptions on the noise then is also a good coreset for . The main idea is to first show that itself is a good coreset of even though the size is the same. Then, using the composition property, the authors proved the main theorem. Also, the authors provided some experimental results.
优点
-
The problem is well-motivated. Most of the previous results did not consider the noisy version of the problem while the real world data is often noisy. Hence, I believe it is a good starting point to investigate this line of work.
-
The paper is well-written. Readers of all levels of expertise should be able to understand this paper.
缺点
- Most techniques are straightforward calculations. I am not sure if there are any fundamentally new techniques introduced in this paper.
问题
na
Thanks for appreciating the motivation of our problem and the writing. As you noted, one of the main contributions of our paper is to initiate the study of coresets in the noisy setting.
We think that establishing the separation between the two error measures and , both theoretically and empirically, is also interesting.
Moreover, in formulating responses to other reviewers we have presented additional features/generalizations of our results that may be of interest.
- In responses to Reviewers gA35 and QbwN, we provide an example to illustrate that, without assuming well-separated clusters, the two error measures and may be the same; showing the necessity of the assumption of well-separated clusters.
- Addressing the concern of Reviewer QbwN about the assumption of independent noise across dimensions, we illustrate how our techniques can be extended to handle non-independent noise across attributes of data points.
- Based on the suggestion of Reviewer QbwN, we evaluate our results on the Census1990 dataset whose dimension is 68. We use the same setup as in Section 4 and the new results can be found in the following file in the supplementary material (https://openreview.net/attachment?id=D96juYQ2NW&name=supplementary_material). These empirical findings validate our theoretical results on datasets with a large number of attributes.
Thanks for the response. I will take it into consideration during the AC discussion.
This paper evaluates the performance of the optimal solution in the original data when dealing with noisy data . The analysis is based on the coreset technique, but the traditional measure for determining coreset quality is too strong when the data is noisy. To address this issue, the authors propose the AR-coreset, which restricts itself to a local sublevel-region and considers the cost ratio. This new measure allows the authors to obtain refined estimates of . The authors demonstrate that is a (1+nd)-coreset and a (1+kd)-AR-coreset of , meaning that is a (1+nd)-approximation and a (1+kd)-approximation solution, respectively.
优点
1, The motivation is clear. It is instructive to consider the local sub-optimal region of the solution space in the AR-coreset.
2, Utilizing the coreset technique to analyze the approximation of the solution is an interesting approach.
3, The lower bound of the approximation is also discussed.
缺点
- The failure of the 'Err()' seems natural since it considers the worst case in the entire solution space.
- This paper does not provide a detailed comparison of the proposed method with other existing analysis methods for clustering with noisy data.
- This paper considers independent noise across dimensions, the real noise might be correlated. The experiment part is too simple, which considers only two datasets; also the dimensions are low (6 and 10).
- The noise model and the assumptions make the result to be relatively narrow.
问题
1, the authors say ‘Intuitively, how good a center set we can obtain from is affected by the number of sign changes. ’. Please provide more thorough interpretations.
2, Besides determining the quality of a coreset in the presence of noise, are there other potential applications of the proposed AR-coreset?
3, Can we conclude that the (near) optimal solution(s) are robust to noise for other problems?
4, In the definition of AR-coreset, why should we consider the ratio of the cost for a given center set to the minimum cost? What would happen if we only consider the denominator of .
Thanks for appreciating our motivations and results for the lower bound. We address your concerns and questions below.
- The failure of the 'Err()' seems natural since it considers the worst case in the entire solution space.
Thanks for raising this point. Even in the noise-free setting, considers the worst case in the entire solution space but, as the example below shows, it may not fail and capture the quality of the optimal center set of a coreset. The non-trivial contribution is in identifying a measure and presenting quantitative separation between two error measures and in the presence of noise.
Consider the worst case instance constructed in [Cohen-Addad et al., 2022] where consists of unit basis vectors in where . Let be a (weighted) subset of size at most . The optimal center set of must lie on the spanned subspace of , and hence can not be a -approximate center set of . This geometric observation implies that is neither an -coreset nor a -approximation coreset. Then the tight sizes of both an -coreset and an -approximation coreset of such are , which implies that and are the same on this dataset .
We will add this example in the final version.
- This paper does not provide a detailed comparison of the proposed method with other existing analysis methods for clustering with noisy data.
We have included a discussion of the literature on clustering with noisy data in Appendix A, along with a comparison of their analysis methods with our work. If you have any preferences regarding additional content or the placement of this section in the final version, please feel free to let us know -- we would be happy to do so.
- This paper considers independent noise across dimensions, the real noise might be correlated.
Thanks. We acknowledge that real noise may exhibit correlations across dimensions. As mentioned in the introduction, the attributes of the data may demonstrate weak correlations or interactions. For this type of real data, it is common to rely on the assumption of independent noise across dimensions, as observed in studies by [Zhu & Wu, 2004; Freitas, 2001; Langley et al., 1992].
A positive aspect of our techniques is that they can be extended to handle non-independent noise across dimensions. Consider a scenario where the covariance matrix of each noise vector is (with when each under the noise model II). Our proof of Theorem 3.1 relies on certain concentration properties of the terms and . Note that . Hence, by a similar argument as in the proof of Claim 3.6, one can show that concentrates on and . Consequently, an -coreset of is an -coreset of and a -approximation-ratio coreset of . The only difference with Theorem 3.1 is that we replace the original variance term to .
We will add a remark on how our techniques can be extended to the non-independent noise case in the final version.
Thanks for your response, and I will consider it in the next discussion stage.
- the authors say ‘Intuitively, how good a center set we can obtain from is affected by the number of sign changes.’. Please provide more thorough interpretations.
Thanks for your question. Below we explain how the quality of the optimal center set of may be affected by the number of sign changes.
Consider the case of 1-Means. Let be an underlying dataset. Let and be observed datasets drawn from under the noise model II with noise parameters and respectively. We assume that . Recall that the optimal centers of , , and are denoted by , , and respectively. We note that for any two centers and lying on the line segment , the signs of and must be different. This is besause the center that is closer to has a smaller cost for but a larger cost for when compared to the other center. Hence, the number of sign changes from to is proportional to the distance . The same observation holds for . Also, note that since . Consequently, the number of sign changes from to is smaller than the number of sign changes from to . Meanwhile, the quality of the optimal center is , which is better than the quality of since .
Hence, as the number of sign changes increases, the quality of the optimal center for the observed dataset deteriorates. We will add this example in the final version.
- Besides determining the quality of a coreset in the presence of noise, are there other potential applications of the proposed AR-coreset?
Thanks for asking this. Yes, the notion of AR-coreset may have potential applications in (noise-free cases of) machine learning tasks when one only wants to preserve near-optimal solutions, e.g., in regression. The notion of AR-coreset may be useful to further reduce the size of coreset. This is an interesting future direction and we will mention it in Section 5.
- Can we conclude that the (near) optimal solution(s) are robust to noise for other problems?
This is a great question. We have listed it as a future direction in Section 5. At present, the existing results seem insufficient to conclude the robustness of (near) optimal solutions to noise for other problems -- the answer is likely to be problem dependent.
- In the definition of AR-coreset, why should we consider the ratio of the cost for a given center set to the minimum cost? What would happen if we only consider the denominator .
Since considering the denominator corresponds to a specific instance of the definition of AR-coreset (Definition 2.2) with , using will only provide a guarantee that is a -AR coreset of . In contrast, considering the ratio of the cost for a given center set corresponds to the case , and enables a quantitative analysis of the affect of noise on center sets as we vary , for a given .
- The experiment part is too simple, which considers only two datasets; also the dimensions are low (6 and 10).
Based on your comment, we evaluated our results on the Census1990 dataset whose dimension is 68. We use the same setup as in Section 4 and the new results can be found in the following file in the supplementary material (https://openreview.net/attachment?id=D96juYQ2NW&name=supplementary_material). These empirical findings validate our theoretical results on datasets with large number of attributes, including 1) a numerical separation between and ; 2) both and initially decrease and then stabilize as the coreset size increases; 3) variance of noise (or noise level) is the main parameter that determines both and . Specifically, under the noise model II with Gaussian noise, decreases from 0.282 to 0.270 as increases from 500 to 2000, and then remains around 0.270 as continues to increase to 5000. We will incorporate these results in the final version.
- The noise model and the assumptions make the result to be relatively narrow.
We have addressed your concern about the noise model in our response 3 above.
Regarding the assumptions, in addition to the discussion in Section 3 explaining why both balancedness and well-separated clusters are reasonable and standard assumptions in the clustering literature, we provide an example below to illustrate that, without assuming well-separated clusters, the two error measures and may be the same.
Consider a 3-Means problem over , i.e., and . Let consist of points each located at -1.01, -0.99, 0.99 and 1.01. A simple calculation shows that the optimal center set is either or , and (same in both cases).
First note that, because there exist two clusters of points that are close to each other in both optimal center sets, does not satisfy the well-separateness assumption. (The distance between the nearby clusters is 0.02 while our assumption requires that it is at least .)
Next, let be an observed dataset drawn from under the noise model II with each . Since , the optimal partition of is likely to be , , and the mean points of these three partitions are roughly -2, 0, 2 respectively. In this case, the optimal center set of is likely to be . Note that . By Theorem E.1, we also have is a -coreset of . Thus, .
We will add this in the final version.
The paper studies the problem of computing coreset for a dataset from its noisy perturbation. The paper considers the most apparent approach: compute a coreset for the perturbed version, and use it directly as a coreset for the original dataset. The paper showed the following results:
- The coreset from the noisy dataset can be very bad for the original dataset, in terms of the relative error Err commonly used to measure a coreset's approximation quality.
- The authors notice that the traditional measure (the relative error mentioned in 1) is too strong because it is the supreme over all possible center sets, while in practice people is more interested on how well the coresets can approximate costs for "not so bad" center sets. This motivates the authors to design a new relative error (which they call "approximate error ratio") that only takes supreme only over center sets that approximates the optimal solution.
- The authors show that this new definition of relative error can help give tigher approximation ratio estimation for center set computed on the coreset (obtained from the noisy dataset). In particular, a coreset with large Err can have much smaller , which means a good approximate solution on will also have a small cost on the original dataset. While if we use for estimation, the bound obtained is much looser.
优点
I think the definition of is quite neat. The authors show that there is a strong separation between the traditional relative error and their new measure . I like this separation result in particular.
缺点
Although the authors claim the two assumptions (i.e. -balancedness and well-separation) are "mild" and only for "overcoming technical difficulty", I feel they are quite strong. Also, one possibility is that the separation between and is actually a result of these two assumptions. It would be great if the authors can show a quantitive analysis on the dependence between the separation and the two assumptions. For example, would it be possible that when the data become less balanced / well-separated, the two measures and converge to each other. (My gut's feeling is the degree of well-separation could likely have a non-trivial effect on / ). If that's the case, I feel this would somewhat strengthen the paper's result since can be viewed as a unifying measure that's tighter in extreme parameter ranges.
问题
There is no space between "-Clustering" and the text following it. I guess the author should ad a \xspace after their macro for "-Clustering"
Thanks for appreciating the novelty of our AR-notion and our results. We address your specific questions below.
The assumptions are quite strong ... one possibility is that the separation between and is actually a result of these two assumptions. It would be great if the authors can show a quantitive analysis on the dependence between the separation and the two assumptions...
Thank you for suggesting to investigate this. Below we give an example that illustrates that if one does not assume that the clusters are well-separated, the two error measures and may be the same.
Consider a 3-Means problem over , i.e., and . Let consist of points each located at -1.01, -0.99, 0.99, and 1.01. A simple calculation shows that the optimal center set is either or , and (same in both cases).
First note that, because there exist two clusters of points that are close to each other in both optimal center sets, does not satisfy the well-separateness assumption. (The distance between the nearby clusters is 0.02 while our assumption requires that it is at least .)
Next, let be an observed dataset drawn from under the noise model II with each . Since , the optimal partition of is likely to be , , and the mean points of these three partitions are roughly -2, 0, 2 respectively. In this case, the optimal center set of is likely to be . Note that . By Theorem E.1, we also have is a -coreset of . Thus, .
We will explain this formally in the final version.
There is no space between "-Clustering" and the text following it. I guess the author should ad a \xspace after their macro for "-Clustering"
Thanks, we will fix it.
The paper shows how to construct coreset for clustering with noisy data.
优点
-
The paper introduces new measures that are used to construct coreset that guarantees eps approximations.
-
The paper also presents a lower bound for 1-mean with noisy data.
-
The theory is backed by a good number of empirical evaluations on real dataset.
缺点
-
Some intuition and relation between the claims in section 2 will be helpful.
-
A formal algorithm, even in the appendix, will increase its impact.
问题
-
Please give an example of how the number of sign changes affects the goodness/quality of centers.
-
Coreset size inversely proportional to n and d, in coreset measure is counter-intuitive. Comments?
Coreset size inversely proportional to and , in coreset measure is counter-intuitive. Comments?
Thanks for your question. Below we explain why this is not counter-intuitive.
First, the coreset size suggested by Lemma 3.4 contains an term in the factor . typically increases linearly with , e.g., in the example in Appendix B, . Thus, typically scales with .
Second, by Theorem 3.3, the error measure , which increases as increases. Thus, Lemma 3.4 implies that we cannot achieve an -coreset of for . This is different from the noise-free setting where we can achieve an -coreset for any . Since a coreset algorithm usually constructs a coreset of size proportional to , an -coreset of is typically of size inversely proportional to .
For instance, applying the algorithm of [Cohen-Addad et al., 2022], one can construct an -coreset of with size when . We will explain this in the final version.
Thanks for appreciating our new measure and theoretical results. We address your questions below.
Comment: Some intuition and relation between the claims in Section 2 will be helpful.
Thanks for the comment. The intuition behind Claim 2.3 is as follows: By definition, a -approximate center set of , for some , must be a -approximate center set of . This allows us to find a near-optimal center set of from .
This transitivity implies Claim 2.4 because we can obtain a near-optimal center set from a coreset or an approximation-ratio coreset of another coreset of . A bit more formally, Claim 2.3 ensures that an -approximation for is an -approximation for . Since is an -coreset of , we conclude that is an -approximation for .
We will explain this in Section 2 in the final version.
Comment: A formal algorithm, even in the appendix, will increase its impact.
It seems like there is some confusion. We do not propose a new coreset algorithm; instead, we show how any coreset algorithm can be used to construct a coreset in the presence of noise. In particular, Lemma 3.4 asserts that when , one can apply a given coreset algorithm to construct an -coreset of given the noise parameter .
Specifically, when provided with the noise parameter , the noisy dataset , , and , one can simply return the coreset obtained from coreset algorithm , where we set , , and .
The lemma also implies that when , it is not possible to construct an -coreset of .
We will clarify this in the final version.
Please give an example of how the number of sign changes affects the goodness/quality of centers.
Thanks for bringing this up. Below we explain how the quality of the optimal center set of may be affected by the number of sign changes.
Consider the case of 1-Means. Let be an underlying dataset. Let and be observed datasets drawn from under the noise model II with noise parameters and respectively. We assume that . Recall that the optimal centers of , , and are denoted by , , and respectively. We note that for any two centers and lying on the line segment , the signs of and must be different. This is besause the center that is closer to has a smaller cost for but a larger cost for when compared to the other center. Hence, the number of sign changes from to is proportional to the distance . The same observation holds for . Also, note that since . Consequently, the number of sign changes from to is smaller than the number of sign changes from to . Meanwhile, the quality of the optimal center is , which is better than the quality of since .
Hence, as the number of sign changes increases, the quality of the optimal center for the observed dataset deteriorates. We will add this example in the final version.
The paper develops a new notion of error that focuses only on the approximation factor of nearly optimal solutions instead of the error in strong coresets where one cares about the approximation factor for all solutions. It shows that in the case of noisy datasets, the traditional error for strong coresets is too strict, leading to huge and pessimistic approximation factors while the new notion remains meaningful. The paper shows that for datasets with clusters that are well-separated and balanced, any coreset for the noisy dataset is still a "good" coreset for the original dataset where the goodness depends on the amount of noise.
On the other hand, the main result on the strength of the new notion, tightening from strong coresets, relies on two strong assumptions, well-separation and balance. These assumptions are strong and might not hold for real datasets. Additionally, one reviewer suggests that there is no new coreset algorithm for the noisy setting, making it less clear whether some new algorithms can do better than applying any existing coreset construction. One reviewer suggests that the noise model is somewhat stylized and restrictive (independent noise across coordinates with bounded moments).
为何不给更高分
The results are good to know but seem be somewhat restrictive. They only apply to well-separated and balanced clusters. Furthermore, the results are generic for all coreset constructions. They do not seem to rule out specific algorithms or other weaker notions such as weak coresets.
为何不给更低分
N/A
Reject