PaperHub
6.0
/10
Poster4 位审稿人
最低3最高5标准差0.8
3
3
4
5
2.8
置信度
创新性3.0
质量2.5
清晰度2.8
重要性2.5
NeurIPS 2025

A Beyond-Worst-Case Analysis of Greedy k-means++

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

$k$-means++ and the related greedy $k$-means++ algorithm are celebrated algorithms that efficiently compute seeds for Lloyd's algorithm. Greedy $k$-means++ is a generalization of $k$-means++ where, in each iteration, a new seed is greedily chosen among multiple $\ell \geq 2$ points sampled, as opposed to a single seed being sampled in $k$-means++. While empirical studies consistently show the superior performance of greedy $k$-means++, making it a preferred method in practice, a discrepancy exists between theory and practice. No theoretical justification currently explains this improved performance. Indeed, the prevailing theory suggests that greedy $k$-means++ exhibits worse performance than $k$-means++ in worst-case scenarios. This paper presents an analysis demonstrating the outperformance of the greedy algorithm compared to $k$-means++ for a natural class of well-separated instances with exponentially decaying distributions, such as Gaussian, specifically when $\ell = \Theta(\log k)$, a common parameter setting in practical applications.
关键词
k-meansclusteringapproximation algorithms

评审与讨论

审稿意见
3

The paper focuses on the settings that it can be proven that greedy k-means++ outperforms k-means++. It was shown that greedy k-means++ has worse performance than k-means++ in the general setting (without any assumption on the input).

The main result is to define a natural class of instances that greedy k-means++ is better. The authors propose an instance where the clusters have a similar number of points (within constant factor), the gap between the distances between clusters is a kΔk^\Delta more than the distances inside the cluster and points of each cluster follow an exponentially decaying distribution.

优缺点分析

The family of the instances proposed in this work are similar to well clusterable / separable instances introduced in previous points which makes it interesting. The paper claims that this setting is common in practical applications without providing a strong reason. In most practical instances this is not the case. This raises the concern that the results in this work does not justify why greedy k-means++ is superior to k-means which is the main goal of this paper.

The main result is a clear separation between k-means++ and greedy k-means++ resulted from Theoram 1 and 2. This is very interesting and to the best of my knowledge, the only known case that greedy k-means++ outperforms. The main disadvantage is that it works only for the case that ln k + O(1) extra points are sampled, which is a special case of greedy k-means++. The author mentions that this is a common assumption in greedy k-means++ without providing any citations or reasons in line 156. This is a very important assumption for this work and needs to be explained.

问题

Please address the weaknesses above.

局限性

The write up of the paper needs to be improved. The limitation of the main result was not clear to me when reading the introduction and some of the theorems and was only mentioned as a common assumption which is not the case to the best of my knowledge.

最终评判理由

I think the paper is overall below the expectation for this conference.

格式问题

no

作者回复

We thank the reviewer for the helpful and constructive comments. We greatly appreciate the time and effort you put into this review. Below, we first summarize our contributions before addressing the reviewer's specific questions.

The k-means algorithm, by its nature of minimizing squared Euclidean distances to centroids, is most effective and meaningful for data organized into roughly spherical, well-separated clusters of similar sizes. For instances with non-spherical or arbitrary structures, other clustering methods are typically preferred (e.g. GMMs) as k-means is known to perform poorly. Therefore, the most relevant question is: which seeding algorithm performs best on these "k-means-friendly" instances? This long-standing question has been open since the k-means++ algorithm was initially introduced by Arthur and Vassilvitskii (2007) with about 13000 citations as of today. In this seminal paper introducing the theoretical guarantees for k-means++, they additionally posed the open question to analyze the greedy variant, mentioning its superior empirical performance.

Our EWW model is designed to formalize precisely this class of problems. It captures well-separated clusters with points concentrated around their centers (a property of sub-exponential distributions like Gaussians). The central puzzle, which our work addresses, is that even on these ideal instances for k-means, the standard k-means++ seeding algorithm can be provably suboptimal with an Ω(logk)\Omega(\log k) approximation ratio.

Our main contribution is to show that the greedy k-means++ algorithm overcomes this deficiency. We prove it achieves a significantly improved O((lnlnk)2)O((\ln \ln k)^2) approximation on EWW instances. In short, our work demonstrates the superiority of greedy k-means++ on exactly the class of problems where k-means is the appropriate tool in the first place. This provides the first rigorous theoretical explanation for the algorithm's widespread empirical success and its adoption in major libraries. For instance, our use of =lnk+Θ(1)\ell= \ln k + \Theta( 1 ) candidates is not just a theoretical convenience; it directly reflects the default implementation in Scikit-learn (\ell=2+ln(k)\lfloor \ln(k) \rfloor)), as documented in Pedregosa et al. (JMLR 2011).

Another advantage of studying EWW instances is that it clearly reveals the difference between the two algorithms: k-means++ fails to select points from all clusters with a non-trivial probability. In contrast, we show that the greedy algorithm selects one point as a center from each cluster with a high probability. In essence, our work uses the assumptions that are implicitly employed to justify the use of k-means clustering and reveals where the greedy algorithm surpasses k-means++.

Finally, we remark that the properties we assume are naturally satisfied by the following random construction. Sample k centers within a hypercube [0,1]d[0,1]^d uniformly at random, where kδdkk^{\delta} ≤ d ≤ k for some constant δ\delta. Each center induces a cluster with an expected radius of Θ(1)\Theta(1). Clusters can be heterogeneous; they only need to follow sub-exponential distributions and have sufficient mass of points away from their centers. The number of points of clusters can differ by any constant factor. Therefore, these properties are naturally satisfied by a simple random construction of instances with a high probability.

Q: Concern of the number of candidates is Θ(logk)\Theta(\log k) in reality.

A: We remark that the number of candidates for the greedy k-means++ algorithm used in the Python library is 2 + int(np.log(n_clusters)). The following paper is the reference for greedy k-means++ algorithms in the Python library.

Scikit-learn: Machine learning in python. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. the Journal of machine Learning research, 12:2825–2830, 2011.

(This is mentioned in Lines 73-74 and 122-125 of the paper.)

In the worst case, it is already proven that the greedy k-means++ algorithm is worse than the k-means++ algorithm by sampling a large number of candidates. This, however, contradicts the empirical suggestion. This paper aims to bridge this gap from the theoretical view, and we chose =lnk+Θ(1)\ell= \ln k + \Theta(1) following the parameter choice used in practice.

评论

I want to thank the authors for the rebuttal.

I disagree with the authors about the sentance: " Therefore, the most relevant question is: which seeding algorithm performs best on these "k-means-friendly" instances?". From the theoretical point of view, the main interesting problem is k-means in general setting and not k-means-friendly instances. From a practical point of view, still the literature suggests that the general setting is more interesting than a special case. Moreover Arthur and Vassilvitskii (2007) study the general case.

There are parts of the rebuttal that are not related to my review and I think it is pasted here accidentally.

Overall I do not think the rebuttal addresses my concerns and keep my score.

评论

Thank you for your continued engagement.

We appreciate the point that understanding k-means in the general setting is of broad theoretical interest. However, as the original Arthur and Vassilvitskii (2007) paper notes, worst-case inputs are often not representative of real-world data, particularly for k-means, which tends to perform well empirically despite poor worst-case guarantees. Our work is motivated by this gap: the greedy variant of k-means++ is known to perform worse in general instances (as you note), yet empirically it often outperforms k-means++. This suggests that the instances encountered in practice are not adversarial but instead exhibit some form of structure.

We therefore aim to understand when and why greedy k-means++ performs well, identifying a broad and realistic class of instances (EWW) where it provably improves over k-means++. In doing so, we provide the first theoretical separation in favor of greedy k-means++, using assumptions that are standard in beyond-worst-case clustering (e.g., separation and sub-exponential decay). These assumptions are not exotic as they reflect precisely the structure under which k-means is a good problem framework.

We agree that general settings are important, but due to known hardness and lower bounds, such settings cannot yield positive results for greedy k-means++. Thus, we believe the most fruitful direction is to analyze meaningful structured instances, which our work contributes to.

We hope this clarifies our motivation and the positioning of the paper.

审稿意见
3

Kmeans++ is an important algorithm to pick the initial centroids for k means algorithm. Greedy k-means++ is an improved version for k-means. This paper describes the theoretical foundation on why greedy k-means++ is better than standard k-means++ algorithm in picking seeds for k-means algorithm.

优缺点分析

The paper makes strong assumptions on the dataset, by combining two assumptions in previous papers and a new assumption. I feel the third assumption is not intuitive. I am not sure why the underlying clusters’ centroid will have this behavior.

The result on the approximation ratio is very promising. Lowering from ln k to (ln ln k)^2. Another question I feel is important is to explain why you care about the parameter k, instead of the total number of vectors, or the vector dimensions. Which parameter is the leading bottleneck in realistic settings?

I think what this paper is lacking is to demonstrate that realistic datasets actually met the assumptions made in the theoretical claim. Currently, the evaluation is based on synthetic dataset only, with artificial distribution. I hope the paper can study whether the approximation bound still applies on real datasets.

问题

From proof technique point of view, which particular technique you believe is most novel compared to the literature?

Will the assumptions hold on realistic datasets?

局限性

Yes.

最终评判理由

The response is clear. It would further strengthen the paper if the authors could add a section demonstrating that realistic datasets satisfy the new assumptions introduced in the paper, particularly Assumption 3.

格式问题

No concern.

作者回复

We thank the reviewer for the helpful and constructive comments. We greatly appreciate the time and effort you put into this review.

Q1: Will the assumptions hold on realistic datasets?

A: The k-means algorithm, by its nature of minimizing squared Euclidean distances to centroids, is most effective and meaningful for data organized into roughly spherical, well-separated clusters of similar sizes. For instances with non-spherical or arbitrary structures, other clustering methods are typically preferred (e.g. GMMs) as k-means is known to perform poorly. Therefore, the most relevant question is: which seeding algorithm performs best on these "k-means-friendly" instances? This long-standing question has been open since the k-means++ algorithm was initially introduced by Arthur and Vassilvitskii (2007) with about 13000 citations as of today. In this seminal paper introducing the theoretical guarantees for k-means++, they additionally posed the open question to analyze the greedy variant, mentioning its superior empirical performance.

Our EWW model is designed to formalize precisely this class of problems. It captures well-separated clusters with points concentrated around their centers (a property of sub-exponential distributions like Gaussians). The central puzzle, which our work addresses, is that even on these ideal instances for k-means, the standard k-means++ seeding algorithm can be provably suboptimal with an Ω(logk)\Omega(\log k) approximation ratio.

Our main contribution is to show that the greedy k-means++ algorithm overcomes this deficiency. We prove it achieves a significantly improved O((lnlnk)2)O((\ln \ln k)^2) approximation on EWW instances. In short, our work demonstrates the superiority of greedy k-means++ on exactly the class of problems where k-means is the appropriate tool in the first place. This provides the first rigorous theoretical explanation for the algorithm's widespread empirical success and its adoption in major libraries. For instance, our use of =Θ(lnk)\ell= \Theta( \ln k ) candidates is not just a theoretical convenience; it directly reflects the default implementation in Scikit-learn (\ell=2+ln(k)\lfloor \ln(k) \rfloor)), as documented in Pedregosa et al. (JMLR 2011).

Another advantage of studying EWW instances is that it clearly reveals the difference between the two algorithms: k-means++ fails to select points from all clusters with a non-trivial probability. In contrast, we show that the greedy algorithm selects one point as a center from each cluster with a high probability. In essence, our work uses the assumptions that are implicitly employed to justify the use of k-means clustering and reveals where the greedy algorithm surpasses k-means++.

Finally, we remark that the properties we assume are naturally satisfied by the following random construction. Sample k centers within a hypercube [0,1]d[0,1]^d uniformly at random, where kδdkk^{\delta} ≤ d ≤ k for some constant δ\delta. Each center induces a cluster with an expected radius of Θ(1)\Theta(1). Clusters can be heterogeneous; they only need to follow sub-exponential distributions and have sufficient mass of points away from their centers. The number of points of clusters can differ by any constant factor. Therefore, these properties are naturally satisfied by a simple random construction of instances with a high probability.

Q2: From proof technique point of view, which particular technique you believe is most novel compared to the literature?

A: The challenges of providing the theoretical understanding of why greedy k-means++ outperforms k-means++ in reality are that the former algorithm is proven to be worse than the latter algorithm even in well-separated instances. We overcome these challenges by introducing a natural instance class called EWW instances, which capture the instance where each cluster follows a sub-exponential distribution, such as Gaussian.

Alongside seeking a suitable instance definition, analyzing the approximation ratio is also not easy. One of the main barriers to analyzing the greedy k-means++ is that during the execution of the algorithm, the probability of each point being selected in a later iteration depends on the decisions made in earlier iterations. As a result, there exists a complex correlation between different iterations of the algorithm. To overcome this challenge, we prove that the correlations between iterations of the greedy k-means++ algorithm can be approximately treated as independent by carefully setting the parameter values used in the analysis. This proof relies crucially on the exponential decay property of regular instances and the use of concentration inequalities. We hope this idea is useful in future related works.

评论

Thanks for the author response. The response is clear. I think it will be great if the paper can add a section to show that realistic datasets can fit the new assumptions the paper is making, especially assumption 3.

审稿意见
4

This submission considers the gap between the strong empirical performance of the greedy k-means++ seeding algorithm and its not-so-strong worst-case theoretical guarantees. Previous theoretical work has shown that it can actually perform worse in the worst case. The authors propose a new analysis framework that goes beyond worst-case scenarios by focusing on a natural and meaningful class of instances, which they call EWW (Exponentially distributed, Well-separated, Well-spread).

优缺点分析

Strengths: This paper studies an interesting problem. It introduces a well-motivated beyond-worst-case framework using the EWW instance model. The paper establishes a asymptotic separation between the two algorithms on EWW instances, offering some improvement in approximation quality.

Weaknesses: While the paper offers strong theoretical insights, its applicability is limited by the restrictive nature of the EWW model, which assumes exponential tails, specific cluster separation, and balanced sizes—conditions that may not hold in many real-world datasets where greedy k-means++ still performs well. The key result—an improved approximation ratio—is only proven for a specific range of the separation parameter (when the separation exponent is at most 1/2). For more widely separated clusters, the authors only show that greedy k-means++ covers all clusters with high probability, without translating this into a better k-means objective value. The empirical evaluation is entirely on synthetic data constructed to match the EWW assumptions. While this aligns with the theory, it does little to demonstrate whether the proposed model explains the algorithm’s success on real-world data. The paper would be substantially stronger if it included experiments on benchmark datasets

问题

Can the authors provide a more in-depth discussion or empirical analysis of how realistic the EWW assumptions are in practice? For example, are there real-world datasets where the distributional properties (e.g., exponential tails, separation, balance) approximately hold?

局限性

The experimental evaluation is limited to synthetic data tailored to the theoretical assumptions, without validation on real-world benchmarks or analysis of how well those datasets align with the EWW conditions.

最终评判理由

While the EWW model's assumptions remain restrictive and real-world validation is lacking, the authors' rebuttal significantly strengthened the theoretical contribution by clarifying that the improved approximation ratio for greedy k-means++ holds for all separation parameters. I maintain my borderline acceptance assessment.

格式问题

N/A

作者回复

We thank the reviewer for the helpful and constructive comments. We greatly appreciate the time and effort you put into this review.

Q1: Regarding the assumption of the EWW instance.

A: The k-means algorithm, by its nature of minimizing squared Euclidean distances to centroids, is most effective and meaningful for data organized into roughly spherical, well-separated clusters of similar sizes. For instances with non-spherical or arbitrary structures, other clustering methods are typically preferred (e.g. GMMs) as k-means is known to perform poorly. Therefore, the most relevant question is: which seeding algorithm performs best on these "k-means-friendly" instances? This long-standing question has been open since the k-means++ algorithm was initially introduced by Arthur and Vassilvitskii (2007) with about 13000 citations as of today. In this seminal paper introducing the theoretical guarantees for k-means++, they additionally posed the open question to analyze the greedy variant, mentioning its superior empirical performance.

Our EWW model is designed to formalize precisely this class of problems. It captures well-separated clusters with points concentrated around their centers (a property of sub-exponential distributions like Gaussians). The central puzzle, which our work addresses, is that even on these ideal instances for k-means, the standard k-means++ seeding algorithm can be provably suboptimal with an Ω(logk)\Omega(\log k) approximation ratio.

Our main contribution is to show that the greedy k-means++ algorithm overcomes this deficiency. We prove it achieves a significantly improved O((lnlnk)2)O((\ln \ln k)^2) approximation on EWW instances. In short, our work demonstrates the superiority of greedy k-means++ on exactly the class of problems where k-means is the appropriate tool in the first place. This provides the first rigorous theoretical explanation for the algorithm's widespread empirical success and its adoption in major libraries. For instance, our use of =Θ(lnk)\ell= \Theta( \ln k ) candidates is not just a theoretical convenience; it directly reflects the default implementation in Scikit-learn (\ell=2+ln(k)\lfloor\ln(k) \rfloor)), as documented in Pedregosa et al. (JMLR 2011).

Another advantage of studying EWW instances is that it clearly reveals the difference between the two algorithms: k-means++ fails to select points from all clusters with a non-trivial probability. In contrast, we show that the greedy algorithm selects one point as a center from each cluster with a high probability. In essence, our work uses the assumptions that are implicitly employed to justify the use of k-means clustering and reveals where the greedy algorithm surpasses k-means++.

Finally, we remark that the properties we assume are naturally satisfied by the following random construction. Sample k centers within a hypercube [0,1]d[0,1]^d uniformly at random, where kδdkk^{\delta} ≤ d ≤ k for some constant δ\delta. Each center induces a cluster with an expected radius of Θ(1)\Theta(1). Clusters can be heterogeneous; they only need to follow sub-exponential distributions and have sufficient mass of points away from their centers. The number of points of clusters can differ by any constant factor. Therefore, these properties are naturally satisfied by a simple random construction of instances with a high probability.

Q2: Regarding the range of separation exponent θ\theta.

A: We remark that the approximation ratio of O((lnlnk)2)O((\ln \ln k)^2) for the greedy k-means++ algorithm holds for all θ(0,1]\theta\in(0,1]. We proved that the k-means++ algorithm achieves Ω(lnk)\Omega(\ln k)-approximation for any EWW point set with θ(0,1/2]\theta\in(0,1/2]. This implies that the k-means++ algorithm is also an Ω(lnk)\Omega(\ln k)-approximation for the general EWW point set, because the approximation ratio only cares about the worst case. Thus, our analysis separates these two algorithms on the general EWW point set in terms of the approximation ratio, independently of the value of θ\theta.

When θ>1/2\theta>1/2, the pairwise distances among clusters are very large, and thus, k-means++ yields an almost optimum seeding. So, in that case, there's not much room for improving k-means++. In this case, we additionally provide an indirect explanation for why the greedy approach is effective, which is based on the concept of the coverage probability.

评论

Thank you to the authors for the clarifications. Having reviewed the author responses, I note that my initial score was already slightly positive, and I will maintain it.

审稿意见
5

This paper examines the theoretical foundations of the greedy k-means++ algorithm, a practical seeding method for k-means clustering that has been shown to outperform the widely-used k-means++ initialization in many empirical cases. While k-means++ provides a provable O(logk)O(\log k) approximation guarantee in expectation, recent findings indicate that the greedy variant can perform worse in the worst-case scenario, despite its superior empirical performance. This paper seeks to resolve this contradiction by presenting a beyond-worst-case analysis for a specific class of realistic clustering instances.

The authors focus on a category of inputs they refer to as EWW instances (Exponentially Decaying, Well-Separated, and Well-Spread). These instances represent many natural distributions, such as Gaussian distributions, and reflect real-world clustering conditions: clusters of similar sizes, separated centers, and a rapid decay from cluster centers. In this context, the paper rigorously demonstrates that greedy k-means++ with l=lnk+Θ(1)l = \ln k + \Theta(1) achieves an approximation ratio of O((lnlnk)2)O((\ln \ln k)^2), surpassing the Ω(logk)\Omega(\log k) bound of standard k-means++.

The central idea of the analysis is that greedy sampling increases the likelihood of discovering new clusters during each iteration, especially in the early phases where covering all clusters is crucial for achieving quality results. The proof is based on concentration bounds for exponentially decaying distributions and a structural assumption that the clusters are well-separated but not excessively far apart (i.e., the separation scales as kθk^\theta, where θ(0,1/2]\theta \in (0, 1/2]). Furthermore, the authors extend their analysis to the case where θ>1/2\theta > 1/2, demonstrating that greedy k-means++ still has a greater likelihood of covering all clusters, if not a better objective.

The experimental section is closely aligned with the theoretical framework, utilizing synthetic data from exponential, half-normal, and Lomax distributions. The metrics considered include the objective value and cluster coverage over iterations. The experiments validate the theoretical findings: greedy k-means++ reduces costs more rapidly and covers clusters more efficiently than standard k-means++.

优缺点分析

Strengths

This paper offers a robust mathematical framework and rigorous proofs. It is the first to provide a beyond-worst-case guarantee for greedy k-means++ within realistic instance families. Additionally, it establishes an asymptotically tighter bound than k-means++ under natural assumptions.

Weaknesses

The positive results are confined to EWW instances, which represent only a small fraction of practical scenarios. The theory does not extend to the more common case where l=O(1)l = O(1), which could also perform well in practice. Furthermore, the experiments rely solely on synthetic data, and the performance on real datasets has not been explored.

问题

Could the authors investigate the behavior of greedy k-means++ when l=O(1)l = O(1)? Is there a gradual decline in performance, or does a phase transition occur?

How robust is the O((lnlnk)2)O((\ln \ln k)^2) bound to violations of the EWW assumptions, such as unbalanced clusters or overlapping distributions?

Since the greedy algorithm selects the best candidate from ll, does this introduce a computational trade-off? How significant is this trade-off compared to the performance gains achieved?

局限性

Yes.

最终评判理由

The authors' explanation clearly outlines why they focus on the EWW assumption and the case l=Θ(lnk)l = \Theta(\ln k) rather than l=Θ(1)l = \Theta(1). It convinced me that the assumption they used is sufficiently universal and supports the contributions of their paper well. However, it would be even better if the authors could provide some experiments using real-world datasets in addition to the synthetic ones. Despite the lack of experiments with practical datasets, their response addressed my questions and persuaded me of the high theoretical value of their paper. Therefore, I would like to raise my rating to 5.

格式问题

No.

作者回复

We thank the reviewer for the helpful and constructive comments. We greatly appreciate the time and effort you put into this review. Below, we first summarize our contributions before addressing the reviewer's specific questions.

The k-means algorithm, by its nature of minimizing squared Euclidean distances to centroids, is most effective and meaningful for data organized into roughly spherical, well-separated clusters of similar sizes. For instances with non-spherical or arbitrary structures, other clustering methods are typically preferred (e.g. GMMs) as k-means is known to perform poorly. Therefore, the most relevant question is: which seeding algorithm performs best on these "k-means-friendly" instances? This long-standing question has been open since the k-means++ algorithm was initially introduced by Arthur and Vassilvitskii (2007) with about 13000 citations as of today. In this seminal paper introducing the theoretical guarantees for k-means++, they additionally posed the open question to analyze the greedy variant, mentioning its superior empirical performance.

Our EWW model is designed to formalize precisely this class of problems. It captures well-separated clusters with points concentrated around their centers (a property of sub-exponential distributions like Gaussians). The central puzzle, which our work addresses, is that even on these ideal instances for k-means, the standard k-means++ seeding algorithm can be provably suboptimal with an Ω(logk)\Omega(\log k) approximation ratio.

Our main contribution is to show that the greedy k-means++ algorithm overcomes this deficiency. We prove it achieves a significantly improved O((lnlnk)2)O((\ln \ln k)^2) approximation on EWW instances. In short, our work demonstrates the superiority of greedy k-means++ on exactly the class of problems where k-means is the appropriate tool in the first place. This provides the first rigorous theoretical explanation for the algorithm's widespread empirical success and its adoption in major libraries. For instance, our use of =Θ(lnk)\ell= \Theta( \ln k ) candidates is not just a theoretical convenience; it directly reflects the default implementation in Scikit-learn (\ell=2+ln(k)\lfloor\ln(k) \rfloor)), as documented in Pedregosa et al. (JMLR 2011).

Another advantage of studying EWW instances is that it clearly reveals the difference between the two algorithms: k-means++ fails to select points from all clusters with a non-trivial probability. In contrast, we show that the greedy algorithm selects one point as a center from each cluster with a high probability. In essence, our work uses the assumptions that are implicitly employed to justify the use of k-means clustering and reveals where the greedy algorithm surpasses k-means++.

Finally, we remark that the properties we assume are naturally satisfied by the following random construction. Sample k centers within a hypercube [0,1]d[0,1]^d uniformly at random, where kδdkk^{\delta} ≤ d ≤ k for some constant δ\delta. Each center induces a cluster with an expected radius of Θ(1)\Theta(1). Clusters can be heterogeneous; they only need to follow sub-exponential distributions and have sufficient mass of points away from their centers. The number of points of clusters can differ by any constant factor. Therefore, these properties are naturally satisfied by a simple random construction of instances with a high probability.

Q1: Could the authors investigate the behavior of greedy k-means++ when =O(1)\ell=O(1)? Is there a gradual decline in performance, or does a phase transition occur?

A: We believe that it is possible to show some positive results when =O(1)\ell=O(1), but it would require more complicated analysis. We focused on =lnk+Θ(1)\ell=\ln k+\Theta(1) as it is the parameter used in practice (e.g., Scikit-learn library). It would be interesting to consider =O(1)\ell = O(1) in future work.

Q2: How robust is the O((lnlnk)2)O((\ln\ln ⁡k)^2) bound to violations of the EWW assumptions, such as unbalanced clusters or overlapping distributions?

A: In the paper, we prove that the O((lnlnk)2)O((\ln \ln k)^2) bound for EWW instance is robust under the constant scaling. Namely, our analysis still works when the distance between points and the number of points inside each cluster are scaled by any constant. We believe that it would also be interesting to consider the more unbalanced clusters and overlapping distributions in the future. Though a potential drawback would be considering several parameters such as distances between clusters, cluster sizes, cluster’s radius, distribution, and investigating various combinations of them, which could obfuscate the insights revealed by our analysis.

Q3: Since the greedy algorithm selects the best candidate from \ell, does this introduce a computational trade-off? How significant is this trade-off compared to the performance gains achieved?

A: There indeed exists a computation trade-off between the runtime and the number of candidates \ell. Increasing the number of candidates improves initialization quality but also increases runtime. In reality, people usually use Θ(logk)\Theta(\log k) as the number of candidates (e.g., the Scikit-learn library). Intuitively, with Θ(logk)\Theta(\log k) samples, we can get a high probability of including a “good” candidate (near the optimal choice) in each iteration. Specifically, this allows us to select a point from clusters from which no points were previously selected; further, the selected point is not far from the cluster center. This comes from probabilistic arguments and concentration bounds.

评论

I sincerely thank you for your response. Your explanation clearly outlines why you focus on the EWW assumption and the case l=Θ(lnk)l = \Theta(\ln k) rather than l=Θ(1)l = \Theta(1). It convinced me that the assumption you used is sufficiently universal and supports the contributions of your paper well. However, it would be even better if the authors could provide some experiments using real-world datasets in addition to the synthetic ones. Despite the lack of experiments with practical datasets, your response addressed my questions and persuaded me of the high theoretical value of your paper. Therefore, I would like to raise my rating to 5.

评论

We truly appreciate you taking the time to re-evaluate our work and for your thoughtful feedback during the discussion phase. It means a great deal to us that you found our responses clarified the motivation for the EWW setting and the value of our theoretical results.

We understand that you mentioned raising your rating to 5, and while we can no longer view the current score, we just wanted to express our gratitude again for any update you make on our behalf.

最终决定

The paper defines a property (called EWW for exponentially distributed, well seperable and well spread) for point sets which enables the popular greedy k-means++ algorithm to cluster them better than k-means++. There are some shortcomings: a) The property is quite strong, and there is no evidence given that it is satisfied on typical data sets, and b) while the approximation bound improves, it does not go down to O(1) even for these tailored input point sets. [It is not really clear what the goal of the experimental section is since it is already documented that greedy k-means++ outperforms k-means++ on many data sets, and this just shows that it does so on tailored instances for which it was indeed shown to do so (that doesn't seem surprising).]

However, the paper still provides the first attempt at explaining the puzzling situation for greedy k-means++: While many papers (including the original AV paper) report that it outperforms k-means++, the theoretical analysis says it performs much worse. And as the authors also notice, greedy k-means++ even performs badly on fairly simply structured instances. The paper at hand gives a tailored definition of very-nice-to-cluster point sets, but it does also show that k-means++ does actually not solve them well. This is very interesting, and the paper will probably spark further work in the area of beyond worst case analysis of (greedy) k-means++.