PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
5
5
6
6
3.8
置信度
ICLR 2024

Coresets for Clustering with Noisy Data

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11
TL;DR

We study the problem of data reduction for clustering in the presence of stochastic noise, propose a measure that better captures the quality of a coreset for this setting and show its effectiveness theoretically and empirically.

摘要

We study the problem of data reduction for clustering when the input dataset $\widehat{P}$ is a noisy version of the true dataset $P$. Motivation for this problem derives from settings where data is obtained from inherently noisy measurements or noise is added to data for privacy or robustness reasons. In the noise-free setting, coresets have been proposed as a solution to this data reduction problem -- a coreset is a subset $S$ of $P$ that comes with a guarantee that the maximum difference, over all center sets, in cost of the center set for $S$ versus that of $P$ is small. We find that this well-studied measure which determines the quality of a coreset is too strong when the data is noisy because the change in the cost of the optimal center set in the case $S=\widehat{P}$ when compared to that of $P$ can be much smaller than other center sets. To bypass this, we consider a modification of this measure by 1) restricting only to approximately optimal center sets and 2) considering the *ratio* of the cost of $S$ for a given center set to the minimum cost of $S$ over all approximately optimal center sets. This new measure allows us to get refined estimates on the quality of the optimal center set of a coreset as a function of the noise level. Our results apply to a wide class of noise models satisfying certain bounded-moment conditions that include Gaussian and Laplace distributions. Our results are not algorithm-dependent and can be used to derive estimates on the quality of a coreset produced by any algorithm in the noisy setting. Empirically, we present results on the performance of coresets obtained from noisy versions of real-world datasets, verifying our theoretical findings and implying that the variance of noise is the main characterization of the coreset performances.
关键词
clusteringnoisecoreset

评审与讨论

审稿意见
5

The paper studies the problem of coreset for kk-means clustering in the noisy setting. Given a dataset PRdP \subset\mathbb{R}^d and a set CC of kk centers, one can define the cost to be xPmincCxc2\sum_{x\in P} \min_{c\in C}\| x-c \|^2. The kk-means clustering problem is to find the set CC that minimize the cost given a dataset PP. However, when the size of PP is huge, one common technique to improve the storage and computation is to extract a smaller subset SS of PP and find the set CC such that the kk-means cost is minimized for SS. Different previous results showed that we can construct such a subset SS such that the size of SS is small and the difference between the kk-means cost for PP and SS is also small. In reality, the data is often noisy. Therefore, we would like to construct a coreset from the noisy dataset P^\widehat{P}. The authors showed that the standard notion of coreset for kk-means is too strong for noisy dataset and defined a new notion for the noisy setting. The authors then showed that if SS is a good coreset for P^\widehat{P} under some mild assumptions on the noise then SS is also a good coreset for PP. The main idea is to first show that P^\widehat{P} itself is a good coreset of PP even though the size is the same. Then, using the composition property, the authors proved the main theorem. Also, the authors provided some experimental results.

优点

  • The problem is well-motivated. Most of the previous results did not consider the noisy version of the problem while the real world data is often noisy. Hence, I believe it is a good starting point to investigate this line of work.

  • The paper is well-written. Readers of all levels of expertise should be able to understand this paper.

缺点

  • Most techniques are straightforward calculations. I am not sure if there are any fundamentally new techniques introduced in this paper.

问题

na

评论

Thanks for appreciating the motivation of our problem and the writing. As you noted, one of the main contributions of our paper is to initiate the study of coresets in the noisy setting.

We think that establishing the separation between the two error measures Err\mathrm{Err} and ErrAR\mathrm{Err}_{AR}, both theoretically and empirically, is also interesting.

Moreover, in formulating responses to other reviewers we have presented additional features/generalizations of our results that may be of interest.

  • In responses to Reviewers gA35 and QbwN, we provide an example to illustrate that, without assuming well-separated clusters, the two error measures Err\mathrm{Err} and ErrAR\mathrm{Err}_{AR} may be the same; showing the necessity of the assumption of well-separated clusters.
  • Addressing the concern of Reviewer QbwN about the assumption of independent noise across dimensions, we illustrate how our techniques can be extended to handle non-independent noise across attributes of data points.
  • Based on the suggestion of Reviewer QbwN, we evaluate our results on the Census1990 dataset whose dimension is 68. We use the same setup as in Section 4 and the new results can be found in the following file in the supplementary material (https://openreview.net/attachment?id=D96juYQ2NW&name=supplementary_material). These empirical findings validate our theoretical results on datasets with a large number of attributes.
评论

Thanks for the response. I will take it into consideration during the AC discussion.

审稿意见
5

This paper evaluates the performance of the optimal solution C^\hat{C} in the original data PP when dealing with noisy data P^\hat{P}. The analysis is based on the coreset technique, but the traditional measure for determining coreset quality is too strong when the data is noisy. To address this issue, the authors propose the AR-coreset, which restricts itself to a local sublevel-region and considers the cost ratio. This new measure allows the authors to obtain refined estimates of P^\hat{P}. The authors demonstrate that P^\hat{P} is a (1+nd)-coreset and a (1+kd)-AR-coreset of PP, meaning that C^\hat{C} is a (1+nd)-approximation and a (1+kd)-approximation solution, respectively.

优点

1, The motivation is clear. It is instructive to consider the local sub-optimal region of the solution space in the AR-coreset.

2, Utilizing the coreset technique to analyze the approximation of the solution is an interesting approach.

3, The lower bound of the approximation is also discussed.

缺点

  1. The failure of the 'Err()' seems natural since it considers the worst case in the entire solution space.
  2. This paper does not provide a detailed comparison of the proposed method with other existing analysis methods for clustering with noisy data.
  3. This paper considers independent noise across dimensions, the real noise might be correlated. The experiment part is too simple, which considers only two datasets; also the dimensions are low (6 and 10).
  4. The noise model and the assumptions make the result to be relatively narrow.

问题

1, the authors say ‘Intuitively, how good a center set we can obtain from P^\hat{P} is affected by the number of sign changes. ’. Please provide more thorough interpretations.

2, Besides determining the quality of a coreset in the presence of noise, are there other potential applications of the proposed AR-coreset?

3, Can we conclude that the (near) optimal solution(s) are robust to noise for other problems?

4, In the definition of AR-coreset, why should we consider the ratio of the cost for a given center set to the minimum cost? What would happen if we only consider the denominator of rP(c^)=cost(P,c^)OPTr_P(\widehat{c})=\frac{\operatorname{cost}(P, \widehat{c})}{\mathrm{OPT}}.

评论

Thanks for appreciating our motivations and results for the lower bound. We address your concerns and questions below.

  1. The failure of the 'Err()' seems natural since it considers the worst case in the entire solution space.

Thanks for raising this point. Even in the noise-free setting, Err()\mathrm{Err}(\cdot) considers the worst case in the entire solution space but, as the example below shows, it may not fail and capture the quality of the optimal center set of a coreset. The non-trivial contribution is in identifying a measure Err_AR()\mathrm{Err}\_{AR}(\cdot) and presenting quantitative separation between two error measures Err()\mathrm{Err}(\cdot) and Err_AR()\mathrm{Err}\_{AR}(\cdot) in the presence of noise.

Consider the worst case instance constructed in [Cohen-Addad et al., 2022] where P=e1,e2,,edP=\\{ e_1, e_2,\ldots, e_d \\} consists of unit basis vectors in Rd\mathbb{R}^{d} where d=2kε2d=2k \varepsilon^{-2}. Let SPS\subset P be a (weighted) subset of size at most kε2k \varepsilon^{-2}. The optimal center set of SS must lie on the spanned subspace of SS, and hence can not be a (1+ε)(1+\varepsilon)-approximate center set of PP. This geometric observation implies that SS is neither an ε\varepsilon-coreset nor a (0,ε)(0,\varepsilon)-approximation coreset. Then the tight sizes of both an ε\varepsilon-coreset and an (0,ε)(0,\varepsilon)-approximation coreset of such PP are Θ(kε2)\Theta(k \varepsilon^{-2}), which implies that Err()\mathrm{Err}(\cdot) and Err_AR()\mathrm{Err}\_{AR}(\cdot) are the same on this dataset PP.

We will add this example in the final version.

  1. This paper does not provide a detailed comparison of the proposed method with other existing analysis methods for clustering with noisy data.

We have included a discussion of the literature on clustering with noisy data in Appendix A, along with a comparison of their analysis methods with our work. If you have any preferences regarding additional content or the placement of this section in the final version, please feel free to let us know -- we would be happy to do so.

  1. This paper considers independent noise across dimensions, the real noise might be correlated.

Thanks. We acknowledge that real noise may exhibit correlations across dimensions. As mentioned in the introduction, the attributes of the data may demonstrate weak correlations or interactions. For this type of real data, it is common to rely on the assumption of independent noise across dimensions, as observed in studies by [Zhu & Wu, 2004; Freitas, 2001; Langley et al., 1992].

A positive aspect of our techniques is that they can be extended to handle non-independent noise across dimensions. Consider a scenario where the covariance matrix of each noise vector ξp\xi_p is ΣRd×d\Sigma\in \mathbb{R}^{d\times d} (with Σ=θId\Sigma = \theta \cdot I_d when each Dj=N(0,θ)D_j = N(0,\theta) under the noise model II). Our proof of Theorem 3.1 relies on certain concentration properties of the terms _pPξP22\sum\_{p\in P} \|\xi_P\|_2^2 and _pPξp,pc\sum\_{p\in P}\langle \xi_p, p-c\rangle. Note that Eξp22=trace(Σ)\mathbb{E} \|\xi_p\|_2^2 = \mathrm{trace}(\Sigma). Hence, by a similar argument as in the proof of Claim 3.6, one can show that _pPξP22\sum\_{p\in P} \|\xi_P\|_2^2 concentrates on ntrace(Σ)n \cdot \mathrm{trace}(\Sigma) and _pPξp,pccc2O(ntrace(Σ))\sum\_{p\in P}\langle \xi_p, p-c\rangle \leq \|c-c^\star\|_2\cdot O(\sqrt{n\cdot \mathrm{trace}(\Sigma)}). Consequently, an ε\varepsilon-coreset SS of P^\widehat{P} is an O(ε+ntrace(Σ)OPT+ntrace(Σ)OPT)O(\varepsilon + \frac{n \cdot \mathrm{trace}(\Sigma)}{\mathrm{OPT}} + \sqrt{\frac{n \cdot \mathrm{trace}(\Sigma)}{\mathrm{OPT}}})-coreset of PP and a (0,O(ε+ktrace(Σ)OPT))(0, O(\varepsilon + \frac{k \cdot \mathrm{trace}(\Sigma)}{\mathrm{OPT}}))-approximation-ratio coreset of PP. The only difference with Theorem 3.1 is that we replace the original variance term θd\theta d to trace(Σ)\mathrm{trace}(\Sigma).

We will add a remark on how our techniques can be extended to the non-independent noise case in the final version.

评论

Thanks for your response, and I will consider it in the next discussion stage.

评论
  1. the authors say ‘Intuitively, how good a center set we can obtain from P^\hat P is affected by the number of sign changes.’. Please provide more thorough interpretations.

Thanks for your question. Below we explain how the quality of the optimal center set of P^\hat P may be affected by the number of sign changes.

Consider the case of 1-Means. Let PRdP \subset \mathbb{R}^d be an underlying dataset. Let P^1\widehat{P}_1 and P^2\widehat{P}_2 be observed datasets drawn from PP under the noise model II with noise parameters θ1\theta_1 and θ2\theta_2 respectively. We assume that θ1<θ2\theta_1 < \theta_2. Recall that the optimal centers of PP, P^1\widehat{P}_1, and P^2\widehat{P}_2 are denoted by μ(P)\mu(P), μ(P^1)\mu(\widehat{P}_1), and μ(P^2)\mu(\widehat{P}_2) respectively. We note that for any two centers cc and cc' lying on the line segment μ(P)μ(P^1)\mu(P)-\mu(\widehat{P}_1), the signs of cost2(P,c)cost2(P,c)\mathrm{cost}_2(P,c) - \mathrm{cost}_2(P,c') and cost2(P^1,c)cost2(P^1,c)\mathrm{cost}_2(\widehat{P}_1,c) - \mathrm{cost}_2(\widehat{P}_1,c') must be different. This is besause the center that is closer to μ(P)\mu(P) has a smaller cost for PP but a larger cost for P^1\widehat{P}_1 when compared to the other center. Hence, the number of sign changes from PP to P^1\widehat{P}_1 is proportional to the distance d(μ(P),μ(P^1))d(\mu(P), \mu(\widehat{P}_1)). The same observation holds for P^2\widehat{P}_2. Also, note that d(μ(P),μ(P^1))<d(μ(P),μ(P^2))d(\mu(P), \mu(\widehat{P}_1)) < d(\mu(P), \mu(\widehat{P}_2)) since θ1<θ2\theta_1 < \theta_2. Consequently, the number of sign changes from PP to P^1\widehat{P}_1 is smaller than the number of sign changes from PP to P^2\widehat{P}_2. Meanwhile, the quality of the optimal center μ(P^1)\mu(\widehat{P}_1) is cost2(P,μ(P^1))=cost2(P,μ(P))+Pd2(μ(P),μ(P^1))\mathrm{cost}_2(P, \mu(\widehat{P}_1)) = \mathrm{cost}_2(P, \mu(P)) + |P|\cdot d^2(\mu(P), \mu(\widehat{P}_1)), which is better than the quality of μ(P^2)\mu(\widehat{P}_2) since cost2(P,μ(P^1))<cost2(P,μ(P^2))\mathrm{cost}_2(P, \mu(\widehat{P}_1)) < \mathrm{cost}_2(P, \mu(\widehat{P}_2)).

Hence, as the number of sign changes increases, the quality of the optimal center for the observed dataset deteriorates. We will add this example in the final version.

  1. Besides determining the quality of a coreset in the presence of noise, are there other potential applications of the proposed AR-coreset?

Thanks for asking this. Yes, the notion of AR-coreset may have potential applications in (noise-free cases of) machine learning tasks when one only wants to preserve near-optimal solutions, e.g., in regression. The notion of AR-coreset may be useful to further reduce the size of coreset. This is an interesting future direction and we will mention it in Section 5.

  1. Can we conclude that the (near) optimal solution(s) are robust to noise for other problems?

This is a great question. We have listed it as a future direction in Section 5. At present, the existing results seem insufficient to conclude the robustness of (near) optimal solutions to noise for other problems -- the answer is likely to be problem dependent.

  1. In the definition of AR-coreset, why should we consider the ratio of the cost for a given center set to the minimum cost? What would happen if we only consider the denominator rP(C^)r_P(\widehat{C}).

Since considering the denominator rP(C^)r_P(\widehat{C}) corresponds to a specific instance of the definition of AR-coreset (Definition 2.2) with α=1\alpha = 1, using rP(C^)r_P(\widehat{C}) will only provide a guarantee that SS is a (0,O(ε+θkdOPT))(0, O(\varepsilon + \frac{\theta k d}{\mathrm{OPT}}))-AR coreset of PP. In contrast, considering the ratio rS(C)r_S(C) of the cost for a given center set CC corresponds to the case α1\alpha \geq 1, and enables a quantitative analysis of the affect of noise on center sets CC as we vary α\alpha, for a given SS.

评论
  1. The experiment part is too simple, which considers only two datasets; also the dimensions are low (6 and 10).

Based on your comment, we evaluated our results on the Census1990 dataset whose dimension is 68. We use the same setup as in Section 4 and the new results can be found in the following file in the supplementary material (https://openreview.net/attachment?id=D96juYQ2NW&name=supplementary_material). These empirical findings validate our theoretical results on datasets with large number of attributes, including 1) a numerical separation between Err\mathrm{Err} and Err_AR\mathrm{Err}\_{AR}; 2) both Err\mathrm{Err} and Err_AR\mathrm{Err}\_{AR} initially decrease and then stabilize as the coreset size increases; 3) variance of noise (or noise level) is the main parameter that determines both Err\mathrm{Err} and Err_AR\mathrm{Err}\_{AR}. Specifically, under the noise model II with Gaussian noise, Err(S)\mathrm{Err}(S) decreases from 0.282 to 0.270 as S|S| increases from 500 to 2000, and then remains around 0.270 as S|S| continues to increase to 5000. We will incorporate these results in the final version.

  1. The noise model and the assumptions make the result to be relatively narrow.

We have addressed your concern about the noise model in our response 3 above.

Regarding the assumptions, in addition to the discussion in Section 3 explaining why both balancedness and well-separated clusters are reasonable and standard assumptions in the clustering literature, we provide an example below to illustrate that, without assuming well-separated clusters, the two error measures Err\mathrm{Err} and Err_AR\mathrm{Err}\_{AR} may be the same.

Consider a 3-Means problem over R\mathbb{R}, i.e., k=3k=3 and d=1d=1. Let PRP\subset \mathbb{R} consist of n4\frac{n}{4} points each located at -1.01, -0.99, 0.99 and 1.01. A simple calculation shows that the optimal center set CC^\star is either 1.01,0.99,1\\{-1.01, -0.99, 1\\} or 1,0.99,1.01\\{-1, 0.99, 1.01\\}, and OPT=0.0005n\mathrm{OPT} = 0.0005n (same in both cases).

First note that, because there exist two clusters of points that are close to each other in both optimal center sets, PP does not satisfy the well-separateness assumption. (The distance between the nearby clusters is 0.02 while our assumption requires that it is at least 3\sqrt{3}.)

Next, let P^R\widehat{P} \subset \mathbb{R} be an observed dataset drawn from PP under the noise model II with each Dj=N(0,1)D_j = N(0,1). Since E_xN(0,1)[xx<0]=E_xN(0,1)[xx0]=1\mathbb{E}\_{x\sim N(0,1)}[|x|\mid x<0] = \mathbb{E}\_{x\sim N(0,1)}[|x|\mid x\geq 0] = 1, the optimal partition of P^\widehat{P} is likely to be p^:p^1\\{\widehat{p}: \widehat{p}\leq -1 \\}, p^:1p^1\\{\widehat{p}: -1\leq \widehat{p}\leq 1 \\}, p^:p^1\\{\widehat{p}: \widehat{p}\geq 1\\} and the mean points of these three partitions are roughly -2, 0, 2 respectively. In this case, the optimal center set of P^\widehat{P} is likely to be C^=2,0,2\widehat{C} = \\{-2,0,2\\}. Note that cost2(P,C^)nOPT\mathrm{cost}_2(P, \widehat{C}) \approx n \gg \mathrm{OPT}. By Theorem E.1, we also have P^\widehat{P} is a nOPT\frac{n}{\mathrm{OPT}}-coreset of PP. Thus, Err(P^)Err_AR(P^)nOPT\mathrm{Err}(\widehat{P})\approx \mathrm{Err}\_{AR}(\widehat{P})\approx \frac{n}{\mathrm{OPT}}.

We will add this in the final version.

审稿意见
6

The paper studies the problem of computing coreset for a dataset from its noisy perturbation. The paper considers the most apparent approach: compute a coreset for the perturbed version, and use it directly as a coreset for the original dataset. The paper showed the following results:

  1. The coreset from the noisy dataset can be very bad for the original dataset, in terms of the relative error Err commonly used to measure a coreset's approximation quality.
  2. The authors notice that the traditional measure (the relative error mentioned in 1) is too strong because it is the supreme over all possible center sets, while in practice people is more interested on how well the coresets can approximate costs for "not so bad" center sets. This motivates the authors to design a new relative error ErrARErr_{AR} (which they call "approximate error ratio") that only takes supreme only over center sets that approximates the optimal solution.
  3. The authors show that this new definition of relative error can help give tigher approximation ratio estimation for center set computed on the coreset (obtained from the noisy dataset). In particular, a coreset SS with large Err can have much smaller ErrARErr_{AR}, which means a good approximate solution on SS will also have a small cost on the original dataset. While if we use ErrErr for estimation, the bound obtained is much looser.

优点

I think the definition of ErrARErr_{AR} is quite neat. The authors show that there is a strong separation between the traditional relative error ErrErr and their new measure ErrARErr_{AR}. I like this separation result in particular.

缺点

Although the authors claim the two assumptions (i.e. O(1)O(1)-balancedness and well-separation) are "mild" and only for "overcoming technical difficulty", I feel they are quite strong. Also, one possibility is that the separation between ErrErr and ErrARErr_{AR} is actually a result of these two assumptions. It would be great if the authors can show a quantitive analysis on the dependence between the separation and the two assumptions. For example, would it be possible that when the data become less balanced / well-separated, the two measures ErrErr and ErrARErr_{AR} converge to each other. (My gut's feeling is the degree of well-separation could likely have a non-trivial effect on ErrErr / ErrARErr_{AR}). If that's the case, I feel this would somewhat strengthen the paper's result since ErrARErr_{AR} can be viewed as a unifying measure that's tighter in extreme parameter ranges.

问题

There is no space between "(k,z)(k,z)-Clustering" and the text following it. I guess the author should ad a \xspace after their macro for "(k,z)(k,z)-Clustering"

评论

Thanks for appreciating the novelty of our AR-notion and our results. We address your specific questions below.

The assumptions are quite strong ... one possibility is that the separation between Err\mathrm{Err} and Err_AR\mathrm{Err}\_{AR} is actually a result of these two assumptions. It would be great if the authors can show a quantitive analysis on the dependence between the separation and the two assumptions...

Thank you for suggesting to investigate this. Below we give an example that illustrates that if one does not assume that the clusters are well-separated, the two error measures Err\mathrm{Err} and Err_AR\mathrm{Err}\_{AR} may be the same.

Consider a 3-Means problem over R\mathbb{R}, i.e., k=3k=3 and d=1d=1. Let PRP\subset \mathbb{R} consist of n4\frac{n}{4} points each located at -1.01, -0.99, 0.99, and 1.01. A simple calculation shows that the optimal center set CC^\star is either 1.01,0.99,1\\{-1.01, -0.99, 1\\} or 1,0.99,1.01\\{-1, 0.99, 1.01 \\}, and OPT=0.0005n\mathrm{OPT} = 0.0005n (same in both cases).

First note that, because there exist two clusters of points that are close to each other in both optimal center sets, PP does not satisfy the well-separateness assumption. (The distance between the nearby clusters is 0.02 while our assumption requires that it is at least 3\sqrt{3}.)

Next, let P^R\widehat{P} \subset \mathbb{R} be an observed dataset drawn from PP under the noise model II with each Dj=N(0,1)D_j = N(0,1). Since E_xN(0,1)[xx<0]=E_xN(0,1)[xx0]=1\mathbb{E}\_{x\sim N(0,1)}[|x|\mid x<0] = \mathbb{E}\_{x\sim N(0,1)}[|x|\mid x\geq 0] = 1, the optimal partition of P^\widehat{P} is likely to be p^:p^1\\{\widehat{p}: \widehat{p}\leq -1 \\}, p^:1p^1\\{\widehat{p}: -1\leq \widehat{p}\leq 1 \\}, p^:p^1\\{\widehat{p}: \widehat{p}\geq 1 \\} and the mean points of these three partitions are roughly -2, 0, 2 respectively. In this case, the optimal center set of P^\widehat{P} is likely to be C^=2,0,2\widehat{C} = \\{-2,0,2\\}. Note that cost_2(P,C^)nOPT\mathrm{cost}\_2(P, \widehat{C}) \approx n \gg \mathrm{OPT}. By Theorem E.1, we also have P^\widehat{P} is a nOPT\frac{n}{\mathrm{OPT}}-coreset of PP. Thus, Err(P^)Err_AR(P^)nOPT\mathrm{Err}(\widehat{P})\approx \mathrm{Err}\_{AR}(\widehat{P})\approx \frac{n}{\mathrm{OPT}}.

We will explain this formally in the final version.

There is no space between "(k,z)(k,z)-Clustering" and the text following it. I guess the author should ad a \xspace after their macro for "(k,z)(k,z)-Clustering"

Thanks, we will fix it.

审稿意见
6

The paper shows how to construct coreset for clustering with noisy data.

优点

  • The paper introduces new measures that are used to construct coreset that guarantees eps approximations.

  • The paper also presents a lower bound for 1-mean with noisy data.

  • The theory is backed by a good number of empirical evaluations on real dataset.

缺点

  • Some intuition and relation between the claims in section 2 will be helpful.

  • A formal algorithm, even in the appendix, will increase its impact.

问题

  • Please give an example of how the number of sign changes affects the goodness/quality of centers.

  • Coreset size inversely proportional to n and d, in coreset measure is counter-intuitive. Comments?

评论

Coreset size inversely proportional to nn and dd, in coreset measure is counter-intuitive. Comments?

Thanks for your question. Below we explain why this is not counter-intuitive.

First, the coreset size suggested by Lemma 3.4 contains an OPT2\mathrm{OPT}^2 term in the factor OPT2n2d2\frac{\mathrm{OPT}^2}{n^2 d^2}. OPT\mathrm{OPT} typically increases linearly with nn, e.g., in the example in Appendix B, OPT=n\mathrm{OPT} = n. Thus, OPT2n2d2\frac{\mathrm{OPT}^2}{n^2 d^2} typically scales with 1d2\frac{1}{d^2}.

Second, by Theorem 3.3, the error measure Err(P^)Ω(θndOPT)\mathrm{Err}(\widehat{P}) \geq \Omega(\frac{\theta nd}{\mathrm{OPT}}), which increases as dd increases. Thus, Lemma 3.4 implies that we cannot achieve an ε\varepsilon-coreset of PP for εθndOPT\varepsilon \leq \frac{\theta nd}{\mathrm{OPT}}. This is different from the noise-free setting where we can achieve an ε\varepsilon-coreset for any ε>0\varepsilon > 0. Since a coreset algorithm usually constructs a coreset of size proportional to ε2\varepsilon^{-2}, an ε=θndOPT\varepsilon = \frac{\theta nd}{\mathrm{OPT}}-coreset of PP is typically of size inversely proportional to d2d^2.

For instance, applying the algorithm of [Cohen-Addad et al., 2022], one can construct an θndOPT\frac{\theta nd}{\mathrm{OPT}}-coreset of PP with size O~(k1.5θ2d2)\tilde{O}(\frac{k^{1.5}}{\theta^2 d^2}) when OPTn\mathrm{OPT} \approx n. We will explain this in the final version.

评论

Thanks for appreciating our new measure and theoretical results. We address your questions below.

Comment: Some intuition and relation between the claims in Section 2 will be helpful.

Thanks for the comment. The intuition behind Claim 2.3 is as follows: By definition, a β\beta-approximate center set CC of SS, for some βα\beta \leq \alpha, must be a β(1+ε)\beta(1+\varepsilon)-approximate center set of PP. This allows us to find a near-optimal center set of PP from SS.

This transitivity implies Claim 2.4 because we can obtain a near-optimal center set from a coreset or an approximation-ratio coreset SS of another coreset SS' of PP. A bit more formally, Claim 2.3 ensures that an α1+O(ε)\frac{\alpha}{1+O(\varepsilon)}-approximation CC for SS is an α\alpha-approximation for SS'. Since SS' is an ε\varepsilon'-coreset of PP, we conclude that CC is an α(1+ε)\alpha(1+\varepsilon')-approximation for PP.

We will explain this in Section 2 in the final version.

Comment: A formal algorithm, even in the appendix, will increase its impact.

It seems like there is some confusion. We do not propose a new coreset algorithm; instead, we show how any coreset algorithm can be used to construct a coreset in the presence of noise. In particular, Lemma 3.4 asserts that when ε>θndOPT+θndOPT\varepsilon > \frac{\theta n d}{\mathrm{OPT}} + \sqrt{\frac{\theta n d}{\mathrm{OPT}}}, one can apply a given coreset algorithm AA to construct an ε\varepsilon-coreset of PP given the noise parameter θ\theta.

Specifically, when provided with the noise parameter θ\theta, the noisy dataset P^\widehat{P}, kk, and ε>θndOPT+θndOPT\varepsilon > \frac{\theta n d}{\mathrm{OPT}} + \sqrt{\frac{\theta n d}{\mathrm{OPT}}}, one can simply return the coreset obtained from coreset algorithm AA, where we set P=P^P' = \widehat{P}, k=kk' = k, and ε=εθndOPT+θndOPT\varepsilon' = \varepsilon - \frac{\theta n d}{\mathrm{OPT}} + \sqrt{\frac{\theta n d}{\mathrm{OPT}}}.

The lemma also implies that when εθndOPT+θndOPT\varepsilon \leq \frac{\theta n d}{\mathrm{OPT}} + \sqrt{\frac{\theta n d}{\mathrm{OPT}}}, it is not possible to construct an ε\varepsilon-coreset of PP.

We will clarify this in the final version.

Please give an example of how the number of sign changes affects the goodness/quality of centers.

Thanks for bringing this up. Below we explain how the quality of the optimal center set of P^\hat P may be affected by the number of sign changes.

Consider the case of 1-Means. Let PRdP \subset \mathbb{R}^d be an underlying dataset. Let P^1\widehat{P}_1 and P^2\widehat{P}_2 be observed datasets drawn from PP under the noise model II with noise parameters θ1\theta_1 and θ2\theta_2 respectively. We assume that θ1<θ2\theta_1 < \theta_2. Recall that the optimal centers of PP, P^1\widehat{P}_1, and P^2\widehat{P}_2 are denoted by μ(P)\mu(P), μ(P^1)\mu(\widehat{P}_1), and μ(P^2)\mu(\widehat{P}_2) respectively. We note that for any two centers cc and cc' lying on the line segment μ(P)μ(P^1)\mu(P)-\mu(\widehat{P}_1), the signs of cost2(P,c)cost2(P,c)\mathrm{cost}_2(P,c) - \mathrm{cost}_2(P,c') and cost2(P^1,c)cost2(P^1,c)\mathrm{cost}_2(\widehat{P}_1,c) - \mathrm{cost}_2(\widehat{P}_1,c') must be different. This is besause the center that is closer to μ(P)\mu(P) has a smaller cost for PP but a larger cost for P^1\widehat{P}_1 when compared to the other center. Hence, the number of sign changes from PP to P^1\widehat{P}_1 is proportional to the distance d(μ(P),μ(P^1))d(\mu(P), \mu(\widehat{P}_1)). The same observation holds for P^2\widehat{P}_2. Also, note that d(μ(P),μ(P^1))<d(μ(P),μ(P^2))d(\mu(P), \mu(\widehat{P}_1)) < d(\mu(P), \mu(\widehat{P}_2)) since θ1<θ2\theta_1 < \theta_2. Consequently, the number of sign changes from PP to P^1\widehat{P}_1 is smaller than the number of sign changes from PP to P^2\widehat{P}_2. Meanwhile, the quality of the optimal center μ(P^1)\mu(\widehat{P}_1) is cost2(P,μ(P^1))=cost2(P,μ(P))+Pd2(μ(P),μ(P^1))\mathrm{cost}_2(P, \mu(\widehat{P}_1)) = \mathrm{cost}_2(P, \mu(P)) + |P|\cdot d^2(\mu(P), \mu(\widehat{P}_1)), which is better than the quality of μ(P^2)\mu(\widehat{P}_2) since cost2(P,μ(P^1))<cost2(P,μ(P^2))\mathrm{cost}_2(P, \mu(\widehat{P}_1)) < \mathrm{cost}_2(P, \mu(\widehat{P}_2)).

Hence, as the number of sign changes increases, the quality of the optimal center for the observed dataset deteriorates. We will add this example in the final version.

AC 元评审

The paper develops a new notion of error that focuses only on the approximation factor of nearly optimal solutions instead of the error in strong coresets where one cares about the approximation factor for all solutions. It shows that in the case of noisy datasets, the traditional error for strong coresets is too strict, leading to huge and pessimistic approximation factors while the new notion remains meaningful. The paper shows that for datasets with clusters that are well-separated and balanced, any coreset for the noisy dataset is still a "good" coreset for the original dataset where the goodness depends on the amount of noise.

On the other hand, the main result on the strength of the new notion, tightening from strong coresets, relies on two strong assumptions, well-separation and balance. These assumptions are strong and might not hold for real datasets. Additionally, one reviewer suggests that there is no new coreset algorithm for the noisy setting, making it less clear whether some new algorithms can do better than applying any existing coreset construction. One reviewer suggests that the noise model is somewhat stylized and restrictive (independent noise across coordinates with bounded moments).

为何不给更高分

The results are good to know but seem be somewhat restrictive. They only apply to well-separated and balanced clusters. Furthermore, the results are generic for all coreset constructions. They do not seem to rule out specific algorithms or other weaker notions such as weak coresets.

为何不给更低分

N/A

最终决定

Reject