PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
5
5
3.8
置信度
创新性3.3
质量3.3
清晰度2.8
重要性3.0
NeurIPS 2025

Coresets for Clustering Under Stochastic Noise

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We propose a new error metric for constructing coresets in \$(k,z)\$-clustering with noisy data, leading to smaller coresets, stronger theoretical guarantees, and improved empirical performance compared to classical methods.

摘要

We study the problem of constructing coresets for $(k, z)$-clustering when the input dataset is corrupted by stochastic noise drawn from a known distribution. In this setting, evaluating the quality of a coreset is inherently challenging, as the true underlying dataset is unobserved. To address this, we investigate coreset construction using surrogate error metrics that are tractable and provably related to the true clustering cost. We analyze a traditional metric from prior work and introduce a new error metric that more closely aligns with the true cost. Although our metric is defined independent of the noise distribution, it enables approximation guarantees that scale with the noise level. We design a coreset construction algorithm based on this metric and show that, under mild assumptions on the data and noise, enforcing an $\varepsilon$-bound under our metric yields smaller coresets and tighter guarantees on the true clustering cost than those obtained via classical metrics. In particular, we prove that the coreset size can improve by a factor of up to $\mathrm{poly}(k)$, where $n$ is the dataset size. Experiments on real-world datasets support our theoretical findings and demonstrate the practical advantages of our approach.
关键词
coresetclusteringstochastic noiseerror metrics

评审与讨论

审稿意见
4

The paper studies coresets for k-means clustering when the input dataset is corrupted by random noise. Specifically, the paper models noise by random perturbation of each data point. It then presents a a surrogate error metric, which is basically an upper bound on the error of the coreset but relative to the unperturbed dataset. The main result is then 2 algorithms for constructing a coreset from the perturbed dataset. The crux is that the algorithm's accuracy/quality is measured wrt the unperturbed points, even though the algorithm cannot see them. The algorithms are analyzed theoretically, showing smaller coreset size, and they are validated via a few experiments.

优缺点分析

I like the paper's research direction of studying noisy data instead of worst-case data. As a side comment, such models where the input is a combination of adversarial + random choices, are called semi-random inputs or smoothed analysis in related literature: David Arthur, Bodo Manthey, Heiko Röglin: Smoothed Analysis of the k-Means Method. J. ACM 58(5): 19:1-19:31 (2011) Jonathan A. Kelner, Jerry Li, Allen Liu, Aaron Sidford, Kevin Tian: Semi-Random Matrix Completion via Flow-Based Adaptive Reweighting. NeurIPS 2024

However, I find the paper's methodology is unconvincing. First, the main error measure Err(S,P)Err(S,P) is only an upper bound on the error (the paper calls it a surrogate), and I doubt it is the correct measure that we would like to minimize. Second, the two algorithms do not bring new insights or methods to the area, in fact the first one (theorem 3.1) simply says to apply any known coreset algorithm, and thus has no new component (the contribution is analysis of the error wrt the unperturbed dataset). The second algorithm (for theorem 3.3) is follows a rather standard recipe, basically computing an O(1)-approximate clustering and sampling from each cluster a carefully chosen number of points. I don't see how or why these algorithms are tailored for a perturbation model. Overall, I don't see a significant technical novelty in the algorithms, and the analysis might be involved but seems (from its overview) to use standard tools.

Similarly, the experiments do not convince me, primary because I am not fond of the error measure Err(S,P)Err(S,P), but also because they do not evaluate the quality of the coreset directly, i.e., solving the clustering problem directly for the input PP vs solving for the coreset SS and using this center set.

I have also many comments about specific issues in the paper, but they are minor. For example, line 217 demonstrates the applicability but works for a rather small ϵ=1/poly(k)\epsilon=1/poly(k) and large stability parameter γ\gamma. In addition, theorems 3.1 and 3.3 should probably include WHP somewhere in the statement. In addition, the rPr_P measure makes a lot of sense for worst-case inputs, so why use them here also in the noisy setting? Finally, the bounds in the two theorems involve many terms and the big picture is lost. Finally, it seems that you use k-means++ experimentally both for the initial O(1)-approximate solution and as a heuristic for finding the optimal solution (because it's NP-hard), which might cause inconsistency in the experimental results (the algorithm "starts" at the optimal solution).

问题

none

局限性

yes

最终评判理由

I raised my score based on the rebuttal and other reviews, which helped me understand better the goal and contribution of this work.

格式问题

none

作者回复

We thank the reviewer for taking the time to evaluate our submission and for the opportunity to clarify key aspects of our work. Below we address your comments.

...semi-random inputs or smoothed analysis in related literature...

Thank you for pointing us to the literature on semi-random (or smoothed-analysis) models. As you note, these frameworks also consider data formed by adding random perturbations to a clean dataset P^P+noise\widehat{P} \leftarrow P + \text{noise}.

Similarity:
Our setting shares this structural model: P^\widehat{P} is formed by applying i.i.d. additive noise to PP. For example, [Arthur et al., 2011] use Gaussian perturbations for kk-means, and [Kelner et al., 2024] apply entrywise noise in matrix completion—both analogous to our sub-exponential model.

Difference:
The goal is different. Smoothed analysis asks: Can a standard algorithm efficiently solve the problem on P^\widehat{P}? For instance, [Arthur et al., 2011] show kk-means converges quickly on the noisy data.

In contrast, we aim to construct a coreset SS from P^\widehat{P} that approximates the clustering cost on the unobserved dataset PP. Here, noise is the main challenge—not a tool for tractability. In this sense, our goal aligns more with [Kelner et al., 2024], who also seek to recover latent structure from noisy observations.

We will clarify this distinction in the related work section.

the main error measure Err(S,P)Err(S,P) is only an upper bound on the error (the paper calls it a surrogate), and I doubt it is the correct measure that we would like to minimize

The main measure we use is Errα(S,P)\mathsf{Err}_\alpha(S, P), not Err(S,P)\mathsf{Err}(S, P). The latter—widely used in the noise-free setting (e.g., [20, 21, 23, 35, 54])—is used in Theorem 3.1 to extend existing coreset guarantees to the noisy setting.

Our primary contribution is Theorem 3.3, which introduces a new algorithm and relies on the new error metric Errα(S,P)\mathsf{Err}_\alpha(S, P).

the rPr_P measure makes a lot of sense for worst-case inputs, so why use them here also in the noisy setting?

Because we still aim to construct a coreset for the original dataset PP, even though we only observe P^\widehat{P}. Thus, bounding rPr_P remains the goal.

...I don't see a significant technical novelty in the algorithms, and the analysis...use standard tools

Thank you for this opportunity to clarify the technical contributions of both our analysis and algorithm.

Novelty in Theorem 3.1

Theorem 3.1 does not propose a new algorithm, but it makes a novel analytical contribution: it quantifies the quality of a coreset SS (constructed from noisy data P^\widehat{P}) with respect to the true dataset PP, using the traditional error measure Err(S,P)\mathsf{Err}(S, P). This analysis reveals that Err(S,P)\mathsf{Err}(S, P) can be an overly conservative estimate of the true quality—especially when noise uniformly inflates the clustering cost (see Lines 184–186 and Section B.3).

To our knowledge, this is the first result to study how coreset performance degrades under noise. It also motivates our introduction of the new metric Errα(S,P)\mathsf{Err}_\alpha(S, P), which we analyze in Theorem 3.3 and show to be a tighter surrogate for the ideal measure rPr_P.

Novelty in Theorem 3.3 and Algorithm 1

In contrast to Theorem 3.1, Theorem 3.3 introduces a new coreset construction algorithm (Algorithm 1) specifically designed to be robust to additive noise. It also provides a sharper error guarantee for rPr_P, using our new metric Errα(S,P)\mathsf{Err}_\alpha(S, P) under mild assumptions (Assumption 3.2). As discussed in Lines 216–226, this algorithm achieves up to a poly(k)\text{poly}(k) improvement in coreset size over noise-agnostic baselines.

Algorithmic Novelty

Algorithm 1 consists of the following steps:

  1. Partition P^\widehat{P} into kk clusters P^i\widehat{P}_i using an O(1)O(1)-approximate center set C^\widehat{C} (Line 1).
  2. Construct a ball BiB_i around each cluster with a carefully chosen radius r^i\widehat{r}_i (Lines 2–3).
  3. Uniformly sample from Pi=P^iBiP_i'=\widehat{P}_i\cap B_i (Lines 4–5), and return the union of samples.

The key novelty lies in Step 2, which does not appear in prior noise-free coreset methods. Existing approaches sample directly from P^i\widehat{P}_i, which works in the absence of noise but fails to account for assignment drift caused by high-noise points. These noisy points can mislead clustering structure and distort center locations.

By constructing and sampling from BiB_i, our method explicitly excludes high-noise points and focuses the coreset on points whose assignments are reliable. This noise-aware sampling step is critical to preserving geometric stability. An example in Appendix E.1 illustrates the significance of assignment shifts in noisy settings. Lines 250–265 provide further analysis of this step.

Analytical Novelty

Our analysis introduces several new components:

  • Step 1: We prove that every CCα(S)C\in\mathcal{C}_\alpha(S) contains a center in each BiB_i (Lemma D.3), establishing consistent cluster structure through stability.

  • Step 2: We bound the deviation between centers of PiP_i' and its sample SiS_i (Lemma D.4).

  • Step 3: Crucially, we bound the shift between centers μ(Pi)\mu(P_i') and μ(Pi)\mu(P_i): separating the effects of internal noise and the excluded high-noise points (Lemma D.5). This step is novel, and essential to achieve a tighter bound for the coreset quality rPr_P compared to Err\mathsf{Err}.

  • Step 4: We conclude via a standard composition argument, bounding Errα(S,P)\mathsf{Err}_\alpha(S, P) in Theorem 3.3.

This is, to our knowledge, the first analysis to disentangle the impact of noise on clustering cost (scaling as θnd/OPTP\theta nd/ \mathrm{OPT}_P) from its effect on the location of optimal centers (scaling as θd/OPTP\theta d/\mathrm{OPT}_P) (Lines 266–279).

We will reflect these clarifications in the final version.

... rather small ε=1/poly(k)\varepsilon = 1/\mathrm{poly}(k) and large stability parameter γ\gamma

Thank you for raising this point. While Theorem 3.3 depends on the cost-stability parameter γ\gamma, the result applies for any ε(0,1)\varepsilon \in (0,1). To illustrate:

  • Let α=1+ε\alpha = 1 + \varepsilon and θ=OPTPndpoly(k)\theta = \frac{\mathrm{OPT}_P}{n d \cdot \mathrm{poly}(k)}.
  • Then Theorem 3.3 yields a coreset size O~(k/ε)\widetilde{O}(k/\varepsilon), improving over O~(min{k1.5/ε2,k/ε4})\widetilde{O}(\min\{k^{1.5}/\varepsilon^2, k/\varepsilon^4\}) by a factor of k/ε\sqrt{k}/\varepsilon (with a similar analysis as in Lines 221–222).
  • The error bound also improves over Theorem 3.1 by a poly(k)\mathrm{poly}(k) factor when npoly(k)n \gg \mathrm{poly}(k) (with a similar analysis as in Lines 223–226).

We will include this example in the final version.

the bounds in the two theorems involve many terms and the big picture is lost

Thanks for pointing this out-each term in the bounds of Theorem 3.1 and 3.3 plays a distinct role. Below we clarify it for (the harder case of) Theorem 3.3. In the final version, we will explain the importance of each term in these theorems.

  • O(θkdOPTP)O(\frac{\theta kd}{\mathsf{OPT}_P}): captures center drift due to noise. Intuitively, each center cc in the optimal center set C(P)\mathsf{C}(P) shifts a distance of θd\theta d due to noise. Thus, the total movement distance from C(P)\mathsf{C}(P) to C(P^)\mathsf{C}(\widehat{P}) is bounded by θkd\theta k d. This implies that C(P^)\mathsf{C}(\widehat{P}) is a (1+θkdOPTP)(1+\frac{\theta k d}{\mathsf{OPT}_P})-approximate center set on PP, introducing the term θkdOPTP\frac{\theta k d}{\mathsf{OPT}_P}.

  • O(α1αθkdOPTP+θndOPTP)O(\frac{\sqrt{\alpha-1}}{\alpha}\cdot \frac{\sqrt{\theta k d \text{OPT}_P}+\theta n d}{\text{OPT}_P}): accounts for using α\alpha-approximate centers. This term controls the additional error brought by α\alpha-approximate center set instead of the optimal one.

The role of each term in Theorem 3.1 can also be found in our response to Reviewer bg4J (link: https://openreview.net/forum?id=ycCi4SkzPH&noteId=dWrAWSnvVv).

the experiments...do not evaluate the quality of the coreset directly

We believe this is a misunderstanding. The metric r~S\widetilde{r}_S (Line 293) does exactly what you suggest.

  • Recall that

    r~S:=cost(P,CS)cost(P,CP),\widetilde{r}_S := \frac{\mathrm{cost}(P, C_S)}{\mathrm{cost}(P, C_P)},

    where CSC_S is the center set obtained by running kk-means++ on the coreset SS, and CPC_P is the center set obtained by running kk-means++ on the full dataset PP.

  • The denominator, cost(P,CP)\mathrm{cost}(P, C_P), corresponds to solving the clustering problem directly on the input PP.

  • The numerator, cost(P,CS)\mathrm{cost}(P, C_S), corresponds to solving the problem on the coreset SS and then evaluating the resulting solution on PP—i.e., using the coreset to approximate clustering on the original data.

Thus, r~S\widetilde{r}_S measures how well the coreset solution approximates the true clustering cost on PP, precisely as you suggested.

We will clarify this interpretation of r~S\widetilde{r}_S in the final version.

k-means++...might cause inconsistency in the experimental results

Thank you—this was due to a missing detail. Here is how kk-means++ is used in two distinct roles:

  • For initialization on P^\widehat{P}:
    We run kk-means++ with 'max\\_iter = 5` for a fast O(1)O(1)-approximate solution (runtime: 0.128s on Census1990\tt{Census1990}).

  • For evaluation on SS:
    We run kk-means++ 10 times (default settings, varied seeds) on SS and select the best solution (runtime: 0.181s on Census1990\tt{Census1990}).

We will clarify this setup in the final version.

theorems 3.1 and 3.3 should probably include WHP

Thanks, yes, we will fix this.

We thank you again for your time, and hope our responses convey the novelty and applicability of our work.

评论

Thanks for the clarifications. I now understand the results better, and will raise my score. I nevertheless feel the introduction does not explain the goal and contribution of this work well enough, but perhaps it's a matter of background eg on noisy data, as it seems clear to other reviewers, but not for me.

评论

We sincerely thank you for your thoughtful engagement with our paper and for taking the time to re-evaluate your score. We're glad the clarifications were helpful, and we deeply appreciate your careful reading and constructive questions, which have helped sharpen both the presentation and framing of our contributions. We will incorporate all your suggestions into the final version.

审稿意见
5

Clustering is a fundamental machine learning task. For large datasets, coresets -- small, weighted subsets -- are used to approximate the clustering cost while reducing computational requirements. However, most coreset constructions assume noise-free data, which is rarely the case in practice. The paper addresses the construction of coresets for (k, z)-clustering (which includes k-means and k-median) when the observed dataset is corrupted by stochastic (random, additive) noise with a known distribution -- a common scenario in privacy, sensor data, and robustness applications.

The paper analyzes the traditional surrogate error metric (Err), used to measure coreset quality in noise-free settings, and demonstrates that it may significantly overestimate error when data is noisy due to uniform cost inflation. Next, the authors introduce a new approximation-ratio-based metric, which more closely aligns with the true (unobservable) clustering quality under noise. Then, they show under certain assumptions on the dataset, under certain noisy models, their proposed sampling procedure produces a smaller -- by poly(k) factor -- coreset. The authors complement their theoretical results with empirical evaluations.

优缺点分析

Coreset construction in the presence of noise is highly relevant for practical machine learning yet underexplored. This paper fills an important gap. The theoretical guarantees under the two natural noise models provided in this paper are quite interesting. However, relying on certain assumptions about the dataset makes the results slightly weak, especially since such an assumption is necessary. Further, I am not sure whether the additive terms in Err and r_P (in Theorem 3.1) are unavoidable or not. Overall, I think this paper shows some nice, interesting results on coreset constructions for noisy datasets.

问题

Could you please comment on whether the additve factors on Err and r_P in Theorem 3.1 are unavoidable or not? Are those additive terms optimal? Are there any natural interpretations of those terms or do they appear solely because of technical reasons?

局限性

yes

最终评判理由

I thank the authors for their explanation. I did not have any major concern, and I am positive about this paper; thus, I am keeping the score as it is

格式问题

None

作者回复

We thank you for your encouraging review. We sincerely appreciate your recognition of the novelty and contribution of our work. We address your question below.

Could you please comment on whether the additive factors on Err and r_P in Theorem 3.1 are unavoidable or not? Are those additive terms optimal? Are there any natural interpretations of those terms or do they appear solely because of technical reasons?

Thank you for the insightful question, which motivates us to strengthen our technical contributions. The two additive factors on Err\mathrm{Err} and rPr_P in Theorem 3.1 have different natures: O(θndOPTP)O\left(\frac{\theta n d}{\mathsf{OPT}_P}\right) is an unavoidable consequence of the problem setting, while O(θndOPTP)O\left(\sqrt{\frac{\theta n d}{\mathsf{OPT}_P}}\right) is an artifact of our proof technique.

The unavoidable term: O(θndOPTP)O\left(\frac{\theta n d}{\mathsf{OPT}_P}\right). This term is worst-case optimal and thus unavoidable.

  • Natural interpretation: Intuitively, this term arises from the inherent "cost" of the data perturbation. Each of the θn\theta n perturbed points is expected to move by ξp2=d\|\xi_p\|^2 = d. This creates a minimum total displacement cost of Ω(θnd)\Omega(\theta n d) that must be paid. When viewed as a relative error against the original clustering cost OPTP\mathsf{OPT}_P, this naturally induces the additive error term θndOPTP\frac{\theta n d}{\mathsf{OPT}_P}.

  • Worst-case example justification: We formalize this intuition with a worst-case example. Let θ=0.1\theta = 0.1, n=10000n = 10000 and k=n1k = n-1. Let P=pi=100nei:i[nRnP =\\{p_i = 100 n e_i: i\in [n\\} \subset \mathbb{R}^n where eie_i is the ii-th unit basis in Rn\mathbb{R}^n. An optimal solution C(P)=p1,,pn2,pn1+pn2\mathsf{C}(P) = \\{p_1,\ldots, p_{n-2}, \frac{p_{n-1} + p_{n}}{2}\\}, and hence, OPT_P=10000n2\mathsf{OPT}\_P = 10000n^2. Note that with probability at least 0.8, the following events hold: pPξ_p_22=Θ(θnd)\sum_{p\in P} \|\xi\_p\|\_2^2 = \Theta(\theta n d) and for every pPp\in P, ξ_p_2210dlogn\|\xi\_p\|\_2^2 \leq 10d \log n. Conditioned on these events, the optimal solution C(P^)\mathsf{C}(\widehat{P}) of P^\widehat{P} must consist of n2n-2 points p^P^\widehat{p}\in \widehat{P} and the average of the remaining two points in P^\widehat{P}. W.l.o.g., assume C(P^)=p^_1,,p^_n2,p^_n1+p^_n2\mathsf{C}(\widehat{P}) = \\{\widehat{p}\_1,\ldots, \widehat{p}\_{n-2}, \frac{\widehat{p}\_{n-1} + \widehat{p}\_{n}}{2}\\}. Then we have cost(P,C(P^))(1+Ω(θndOPT)),\mathrm{cost}(P,\mathsf{C}(\widehat{P})) \geq(1 + \Omega(\frac{\theta n d}{\mathsf{OPT}})), implying that Err(P^,P)=Ω(θndOPTP)\mathsf{Err}(\widehat{P},P) = \Omega(\frac{\theta n d}{\mathsf{OPT}_P}).

The technical term: O(θndOPTP)O\left(\sqrt{\frac{\theta n d}{\mathsf{OPT}_P}}\right). This term arises from a technical step in our proof, and it's unclear if it is optimal. This term is used to bound the supremum of the total error contributed from all points over all center sets CC. The error induced by moving a single point pp w.r.t. a CC can be bounded by d2(p+ξp,C)d2(p,C)±ξp2d(p,C)d^2(p+\xi_p,C)-d^2(p,C) \in \pm\|\xi_p\|_2\cdot d(p,C). Summing these bounds across all points leads to the term pξp2d(p,C)(ξp22)(d2(p,C))θndOPTP.\sum_p \|\xi_p\|_2\cdot d(p,C)\leq\sqrt{(\sum \|\xi_p\|_2^2) \cdot (\sum d^2(p,C))} \le \sqrt{\theta nd \cdot \mathsf{OPT}_P}.

However, this analysis assumes the worst-case scenario where all individual errors d2(p+ξp,C)d2(p,C)d^2(p+\xi_p,C)-d^2(p,C) compound. It's possible that these errors (which can be positive or negative) partially cancel each other out for every CC. A more refined analysis that avoids this loose bounding step might eliminate or reduce term O(θndOPTP)O\left(\sqrt{\frac{\theta n d}{\mathsf{OPT}_P}}\right).

We will add this entire discussion to the appendix of the final version to give readers a clearer interpretation of the bounds in Theorem 3.1.

评论

Thanks for your explanation. I do not have any other concerns, and thus I am confirming my original score.

审稿意见
5

The core challenge addressed in this paper is the construction of coresets for (k,z)(k,z)-clustering in the dd-dimensional Euclidean space, where input points are perturbed by noise. The authors consider a main noise model where each point remains untouched with some probability, and with the complementary probability each of its coordinates is perturbed by noise drawn from a coordinate-specific centered and unit-variance distribution. They then extend this model to a setting where the distribution is allowed to have larger variance σ2\sigma^2, and to noise models where coordinate noises are no longer independent. This perturbation means the algorithm operates without ever seeing the true, original point set. As a result, the traditional error metric, which gauges how much a coreset solution's cost differs from that of the complete data (for every solution), might not be the most appropriate measure when noise is present.

This paper’s main results are the following:

  • First, they show that, in a noisy setting, the canonical error metric used to evaluate coreset quality falls short of being the right one, as it may significantly overestimate coreset error. They, thus, define a new distribution-independent quality measure that does not suffer from the brittleness of measuring absolute cost deviation.

  • The second result of this paper consists of showing that, with respect to the canonical error metric, a coreset of state-of-the-art size suffices to guarantee an error of ε\varepsilon plus an additive error polynomial in θndOPTP\frac{\theta nd}{OPT_P}.

  • The third and main result is an analog of the second with respect to the new error metric: It shows that, under a cost-stability assumption, the new metric allows for a (up to poly(k)poly(k)) smaller coreset size, and tighter control on the approximation ratio.

For the second result, its analysis relies on composition bounds to estimate the clustering cost. This involves controlling the deviation of the clustering cost on the true point set from its estimate, using concentration and Bernstein's moment conditions.

The third result is substantially more challenging to obtain. For instance, the natural composition that used to hold with respect to the canonical metric, no longer holds here, and a composition to an inflated (1+ε)(1+\varepsilon) version has to be considered. The analysis of the third result first demonstrates that the designed coreset algorithm has a small error on the perturbed point set. Second, it establishes the error gap between the true point set and its noisy counterpart. The algorithm initially constructs a coreset in a noise-free setting by partitioning points into well-separated clusters with uniform sampling. This method is then extended to noisy environments using geometric properties ensured by cost-stability. Noise-induced challenges, such as increased cluster diameter and weakened closeness guarantees, are addressed through a "cleaned up" phase for noisy points. This process results in a negligible effect from removed points and achieves the desired error bound.

优缺点分析

This paper presents a well-structured extension of coreset construction for clustering problems, specifically when handling noisy input. The exposition is generally clear. I found the simple example comparing the noise-free and noise-aware metrics particularly effective; it plainly shows why the proposed surrogate measure is appropriate. The additional discussion in the appendix is also insightful, demonstrating that the standard error measure can be brittle even without noise.

While the analysis for the second result is more standard, it plays a nice role by providing a benchmark against which the third result—which achieves sharper bounds—can be compared. For this third and main result, a key strength is the development of new techniques for the newly defined error metric, better suited for noisy environments. The "Key Ideas" paragraphs in the main body are effective in helping readers build intuition and grasp the problem. Despite its technical depth, the treatment maintains conceptual clarity.

Overall, the contributions and presentation are solid. The paper introduces the noisy coreset construction in dd-dimensional Euclidean space, illustrating the challenges it poses, and providing nice results on the coreset size for this setting.

问题

  • Reading the paper, I was wondering whether the mergeability property of coresets would be preserved under the new error metric, and I am glad you present this as a challenge in the last section. Do you have some intuition on which kind of instances mergeability might cease to hold?

  • Your noise models are currently defined only for Euclidean space. Have you considered extending this framework to general metric spaces, for example, by allowing point perturbations within a ball of a certain radius, where the radius itself is drawn from a distance distribution?

局限性

Yes

最终评判理由

Thank you for the response to all of my questions and comments. They explain everything I asked clearly and made me understand even better where the additional challenges in future work lie. I, thus, confirm my score.

格式问题

No formatting issues noticed.

作者回复

We thank you for your detailed and encouraging review. We especially appreciate your recognition of our technical contributions. Below we provide detailed responses to your questions.

Questions

Reading the paper, I was wondering whether the mergeability property of coresets would be preserved under the new error metric, and I am glad you present this as a challenge in the last section. Do you have some intuition on which kind of instances mergeability might cease to hold?

Thank you for this valuable question. In the ideal case, mergeability means that if two coresets S1,S2S_1, S_2 for disjoint datasets P1,P2P_1, P_2 each satisfy Err_α(S,P)ε\mathrm{Err}\_\alpha(S_\ell, P_\ell) \leq \varepsilon, then their union also satisfies

Err_α(S1S2,  P1P2)ε.\mathrm{Err}\_\alpha(S_1 \cup S_2,\; P_1 \cup P_2) \leq \varepsilon.

With our new metric Err_α\mathrm{Err}\_\alpha, this guarantee can fail—particularly when S1S_1 and S2S_2 summarize datasets with significantly different cluster structures. In such cases, the optimal center set for S1S2S_1 \cup S_2 may differ substantially from the union of the individual optima, and the combined coreset may not encode enough information to capture this emergent structure.

When does mergeability still hold?

Mergeability remains plausible when the two coresets are structurally similar. Let us define
C_β(Q)=C:cost(Q,C)βOPT_Q.\mathcal{C}\_{\beta}(Q) =\\{C : \mathrm{cost}(Q, C) \leq \beta \cdot \mathrm{OPT}\_Q \\}. If there exists 1αα1 \leq \alpha' \leq \alpha such that

C_α(S1)C_α(S2)andC_α(S2)C_α(S1),\mathcal{C}\_{\alpha'}(S_1) \subseteq \mathcal{C}\_\alpha(S_2) \quad \text{and} \quad \mathcal{C}\_{\alpha'}(S_2) \subseteq \mathcal{C}\_\alpha(S_1),

then S1S_1 and S2S_2 approximately share the same set of near-optimal solutions. This overlap enables a weaker form of mergeability under Err_α \mathrm{Err}\_\alpha.

Claim (Weak mergeability under Err_α\mathrm{Err}\_\alpha):
Suppose Err_α(S,P)ε\mathrm{Err}\_\alpha(S_\ell, P_\ell) \leq \varepsilon for =1,2\ell = 1,2, and the overlap condition above holds. Define:

κ=minOPT_S1,OPT_S2OPT_S1S2,τ=maxOPT_S1/OPT_P1OPT_S2/OPT_P2,OPT_S2/OPT_P2OPT_S1/OPT_P1.\kappa = \frac{\min\\{ \mathrm{OPT}\_{S_1}, \mathrm{OPT}\_{S_2} \\}}{\mathrm{OPT}\_{S_1 \cup S_2}}, \quad \tau = \max\\{ \frac{\mathrm{OPT}\_{S_1}/\mathrm{OPT}\_{P_1}}{\mathrm{OPT}\_{S_2}/\mathrm{OPT}\_{P_2}}, \frac{\mathrm{OPT}\_{S_2}/\mathrm{OPT}\_{P_2}}{\mathrm{OPT}\_{S_1}/\mathrm{OPT}\_{P_1}} \\}.

Then:

Err1+(α1)κ(S1S2,  P1P2)<ατ(1+ε)1.\mathrm{Err}_{1 + (\alpha - 1)\kappa}(S_1 \cup S_2,\; P_1 \cup P_2) < \alpha' \cdot \tau \cdot (1 + \varepsilon) - 1.

In the limiting case where α,α1\alpha', \alpha \to 1 and τ1\tau \to 1, this bound recovers the ideal mergeability condition:

Errα(S1S2,  P1P2)ε.\mathrm{Err}_\alpha(S_1 \cup S_2,\; P_1 \cup P_2) \leq \varepsilon.

We will include this intuition, the formal claim, and a sketch of the proof in the appendix of the final version.

Your noise models are currently defined only for Euclidean space. Have you considered extending this framework to general metric spaces, for example, by allowing point perturbations within a ball of a certain radius, where the radius itself is drawn from a distance distribution?

Thank you for this insightful question. Extending our framework to general metric spaces is a fascinating direction. As you note, directly perturbing points is often ill-defined in a general metric space M=(X,dist)M = (\mathcal{X}, \mathrm{dist}) due to the absence of a coordinate system. A more natural alternative is to introduce noise at the level of the distance function itself.

We outline a potential model below.

Let θ[0,1]\theta \in [0,1] be a noise parameter, and let DD be a distribution on R\mathbb{R} with mean 0 and variance 1. For each pair (x,y)X×X(x, y) \in \mathcal{X} \times \mathcal{X}, define the noisy pairwise distance dist^(x,y)\widehat{\mathrm{dist}}(x, y) as follows:

  • With probability θ\theta, set dist^(x,y)=dist(x,y)+ξ(x,y),\widehat{\mathrm{dist}}(x, y) = \mathrm{dist}(x, y) + \xi_{(x,y)}, where ξ(x,y)D\xi_{(x,y)} \sim D i.i.d.;
  • With probability 1θ1 - \theta, set dist^(x,y)=dist(x,y).\widehat{\mathrm{dist}}(x, y) = \mathrm{dist}(x, y).

This model reflects additive noise in pairwise distances, analogous to our perturbation model in Euclidean settings.

However, a significant challenge arises: the perturbed function dist^\widehat{\mathrm{dist}} may no longer satisfy the triangle inequality, and thus may not define a valid metric. This opens up a rich and challenging extension—coreset construction for noisy semimetric spaces.

Analyzing how our framework and theoretical guarantees extend under this relaxed setting (e.g., when only approximate triangle inequalities hold) is a non-trivial but important direction for future work.

In the final version, we will add a section to the appendix discussing this potential generalization, including the proposed distance-noise model and the resulting difficulties. We will also list this as a key open direction in the conclusion.

评论

Thank you for addressing all of my questions and comments. Your responses were clear and helped deepen my understanding, particularly regarding the challenges that remain for future work. I therefore would like to confirm my original score.

审稿意见
5

The paper investigates how to adapt a coreset produced from a noise-corrupted dataset for the underlying clean data in (k,z)(k,z)-clustering. It shows that the classical strong-coreset error ErrErr can become vacuous once noise is added. To address this issue, this paper then proposes a relaxed metric ErrαErr_{\alpha} that considers only center sets whose cost on the coreset is within an α\alpha-factor of the optimum. A separation construction illustrates that ErrαErr_{\alpha} can be exponentially smaller than ErrErr. Under mild assumptions on cost-stability, balanced clusters and bounded radius, the authors prove that any standard sensitivity-sampling coreset built on the noisy data remains a good coreset for the clean data when evaluated with Errα\mathrm{Err}_{\alpha}, maintaining an O~(klogk/ε)\tilde O(k\log k/\varepsilon) coreset size. Experiments on several public datasets with Gaussian noise confirm that the new metric allows 20–35 % smaller coresets while preserving clustering quality.

优缺点分析

One of the key strengths is the conceptual clarity and practical relevance of the new error notion, which focuses evaluation on solutions practitioners actually search for rather than on all possible centre sets. The theoretical analysis is careful, technically sound and matches empirical observations; the resulting algorithm is simple to implement and inherits the best-known size guarantees. Presentation is clear, with motivating examples, lucid proofs and reproducible experiments that cover multiple noise models and real data sets.

The main weaknesses stem from the assumptions behind the theory. Cost-stability, well separation and balance may be violated in highly skewed or overlapping clusters, and the noise model is i.i.d. additive with sub-exponential tails, leaving correlated or heavy-tailed settings unaddressed. In addition, the need for a constant-factor estimate of the clean optimal cost is glossed over.

问题

I would appreciate clarification on the following issues:

  1. How does the guarantee degrade when the cost-stability constant γ\gamma becomes small and clusters are less well separated?

  2. What is the practical impact if the estimate of the clean optimal cost is off by a factor of two or more?

局限性

Since this is mainly a theoretical result, there is no limitations and potential negative societal impact.

最终评判理由

The rebuttal addresses all concerns with solid theoretical arguments and well-supported empirical explanations The clarification on cost-stability effectively explains the trade-off between worst-case guarantees and practical performance, reinforcing confidence in the method’s robustness under challenging conditions. The discussion of how approximate clean-cost estimates influence the analysis is technically sound, demonstrating that while theoretical bounds may loosen, the practical coreset quality remains consistently high. Overall, this is a well-motivated, theoretically solid, and interesting paper, and I recommend acceptance while keeping my initial score.

格式问题

There is no major formatting issues for this paper.

作者回复

Thank you for your thoughtful and constructive feedback. We are grateful for your recognition of the novelty, clarity, and practical relevance of our proposed error measure. Below, we address your questions and comments:

Response to questions

How does the guarantee degrade when the cost-stability γ\gamma constant becomes small and clusters are less well separated?

This is an excellent question. While the theoretical guarantees indeed weaken as γ\gamma decreases, we find that the practical performance of our method remains robust even in such regimes.

Theoretical degradation.
As shown in Appendix E.1, in the worst case, our new error measure Err_α\mathsf{Err}\_\alpha can exceed the classical measure Err\mathsf{Err} when γ\gamma is small (e.g., γ=1\gamma = 1). This illustrates why the cost-stability assumption is essential for our formal guarantees—without it, the superiority of Err_α\mathsf{Err}\_\alpha cannot be established in theory.

Empirical robustness.
Despite this, our algorithm CN_α\mathrm{CN}\_\alpha, which uses Err_α\mathsf{Err}\_\alpha, consistently outperforms the Err\mathsf{Err}-based baseline in practice—even on datasets with significant cluster overlap. For example, on Adult and Census1990, where the estimated cost-stability is as low as γ0.07\gamma \leq 0.07, our method still achieves smaller coresets while maintaining clustering quality (see Table 2).

We will incorporate a brief discussion of this tradeoff in the theoretical section of the final version to clarify the contrast between worst-case bounds and empirical behavior.

What is the practical impact if the estimate of the clean optimal cost is off by a factor of two or more?

Thank you for the insightful question. A cc-estimate (c2c\geq 2) of the optimal cost OPTP\text{OPT}_P on PP worsens the error guarantee for CN\mathrm{CN} (using Err\mathrm{Err} measure) and hardens the data assumption for CN_α\mathrm{CN}\_\alpha (using Err_α\mathrm{Err}\_\alpha measure). Below we illustrate these points.

Suppose we are given a center set C^\widehat{C} for coreset construction, whose clustering cost is cost(P^,C^)=cOPTP\mathrm{cost}(\widehat{P},\widehat{C})=c\cdot \mathrm{OPT}_P for some c>1c>1 (cc-estimate), where PP is the underlying dataset and P^\widehat{P} is a given noisy dataset.

Impact on CN\mathrm{CN}: Let SS be a coreset derived from CN\mathrm{CN}. Employing a cc-approximate estimate C^\widehat{C} can weaken the quality bound of SS by up to a factor of cc in the following sense: With an exact estimate of OPTP\mathrm{OPT}_P, Theorem 3.1 provides a bound on the worst-case quality of an α\alpha-approximate center set of SS as

rP(S,α)(1+ε+O(θndOPTP+θndOPTP))2α,r_P(S,\alpha)\leq \left(1+\varepsilon+O\left(\frac{\theta n d}{\mathrm{OPT}_P}+\sqrt{\frac{\theta n d}{\mathrm{OPT}_P}}\right)\right)^2\cdot \alpha,

where ε\varepsilon is an error parameter, θ\theta is the noise level, and dd is the dimension. In contrast, using CN\mathrm{CN} with the approximate estimate C^\widehat{C} yields the following bound for the derived coreset SS:

rP(S,α)(1+cε+O(θndOPTP+θndOPTP))2α.r_P(S,\alpha)\leq \left(1+c\cdot\varepsilon+O\left(\frac{\theta n d}{\mathrm{OPT}_P}+\sqrt{\frac{\theta n d}{\mathrm{OPT}_P}}\right)\right)^2\cdot \alpha.

Compared to the exact estimation case, this bound increases the error term from ε\varepsilon to cεc\cdot \varepsilon. This arises because the importance sampling procedure from [9], used in CN\mathrm{CN}, introduces an additional error term εcost(P,C^)OPTP=cε\varepsilon\cdot\frac{\text{cost}(P,\widehat{C})}{\text{OPT}_P}=c\cdot \varepsilon in its analysis, whereas the term O(θndOPTP+θndOPTP)O\left(\frac{\theta n d}{\mathrm{OPT}_P}+\sqrt{\frac{\theta n d}{\mathrm{OPT}_P}}\right) results from Err(P^,P)\mathrm{Err}(\widehat{P},P) and remains unaffected by the quality of the estimation.

Impact on CN_α\mathrm{CN}\_\alpha: Let SS be a coreset derived from CN_α\mathrm{CN}\_\alpha. Using a cc-approximation estimate C^\widehat{C} weakens Assumption 3.2 by a factor of cc; specifically, the required cost-stability constant γ\gamma increases by this factor. Consequently, the range of datasets satisfying the theoretical bounds of CN_α\mathrm{CN}\_\alpha (as stated in Theorem 3.3) shrinks. This occurs because the performance guarantee of CN_α\mathrm{CN}\_\alpha relies on correctly identifying kk separated clusters, a task complicated by the approximation error in C^\widehat{C}. Increasing γ\gamma by a factor of cc ensures these clusters are accurately identified, restoring the desired guarantees.

Additionally, we would like to remind that the above analysis considers a worst-case guarantee. In practice, using a cc-estimate may still result in a high-qualified coreset as with a perfect estimate.

We will add the above analysis to the appendix in the final version.

Response to comments

the noise model is i.i.d. additive with sub-exponential tails, leaving correlated or heavy-tailed settings unaddressed

Thank you for raising this point. We agree that extending the framework to handle heavy-tailed and correlated noise is an interesting and important direction for future work. Our paper already makes some progress towards this direction by considering a non-independent noise across dimensions where each point pp is i.i.d., appending a noise vector ξpRd\xi_p\in \mathbb{R}^d whose covariance matrix is ΣRd×d\Sigma\in \mathbb{R}^{d\times d}. Note that under the noise model II, Σ=σ2Id\Sigma = \sigma^2\cdot I_d when each coordinate-noise distribution Dj=N(0,σ2)D_j=N(0,\sigma^2) (Lines 1309-1312).

Theoretical extension: As shown in Appendix F.1, we formally extend our analysis to this noise model and achieve a bound for rPr_P.

Empirical robustness: Appendix G.3 further includes experiments under non-independent Gaussian noise, demonstrating that our algorithm remains robust in such noisy settings.

We will add the extension to correlated or heavy-tailed noise models as future directions in the final version.

评论

The rebuttal effectively addresses all raised concerns with clear theoretical justifications and empirical evidence. The response on cost-stability clarifies the trade-off between worst-case guarantees and practical performance. This strengthens confidence in the method’s robustness even under challenging conditions.

The explanation of how approximate estimates of the clean cost affect the analysis is technically sound. The authors show that while the theoretical bounds weaken, practical coreset quality remains high. This is a good paper and I will keep my initial score.

最终决定

This paper introduces a new error notion for coreset construction under noise, supported by careful theoretical analysis and strong empirical validation. Reviewers appreciated the conceptual clarity, practical motivation, and technically sound proofs, as well as the fact that the method is simple to implement and reproduces best-known guarantees while adapting to noisy settings. The paper is also well-presented, with clear examples, intuitive explanations, and experiments across multiple datasets and noise models. While some reviewers noted limitations in the assumptions (e.g., stability, noise model) and raised questions about the surrogate error measure, these concerns were largely addressed by the rebuttal, which clarified trade-offs and provided convincing justifications. Overall, the consensus is that this work makes a meaningful and timely contribution to the clustering/coreset literature, addressing a practically important but underexplored problem.