7.3

/10

Oral3 位审稿人

最低6最高8标准差0.9

4.0

置信度

正确性4.0

贡献度3.3

表达3.3

NeurIPS 2024

Achieving Optimal Clustering in Gaussian Mixture Models with Anisotropic Covariance Structures

Xin Chen,Anderson Ye Zhang

OpenReview PDF

提交: 2024-05-12更新: 2024-11-06

TL;DR

We study clustering in anisotropic Gaussian Mixture Models by establishing minimax bounds and introducing a variant of Lloyd’s algorithm that achieves the minimax optimality provably.

摘要

关键词

Minimax ratesMixture modelLloyd’s algoirhtmClustering

评审与讨论

审稿意见

评分: 8置信度: 42024-07-03

This paper states an hard-association EM algorithm for Gaussian mixture estimation, where the Gaussian components can be different and anisotropic. They also state theoretic bounds of the misclustering error rate.

优点

Originality: The theoretical association bound for different and anisotropic clustering is novel.

Quality: Contribution is technically sound. Claims are very well supported by theoretical analysis and also some experimental results. Methods used are appropriate. It is a complete piece of work.

Clarity: Submission is clearly written and well organized. It adequately informs the reader and states the proposed algorithms in concise form to be easily reproduced.

Significance: Key contribution to the community are the theorems and very detailed and elaborate proofs about the error rate that can theoretically achieved in anisotropic clustering. Overcoming the constraint of isotropic components (or anisotropic but identical components) is a major step.

缺点

Originality: A related work is "Frisch, Hanebeck, Gaussian Mixture Estimation from Weighted Samples" (not telling you to include this, just check if it's relevant; its soft-EM algorithm may be similar to your hard-EM).

Suggestions:

in Line 108 give some inutition about the loss function, e.g. "ratio of mis-associations".
mention that Model 1 is equivalent to isotropic components in linearly transformed space with root of covariance (if that's correct) Typos:
line 102 displayed as follows
line 136: "Figure 1" should be in same line, use \sim or cleveref package
line 150 computationally feasible
Theorem (and elsewhere): too big whitespace in "exp (". Use "exp!(" or \mathopen{}, \mathclose{} instead.
SNR has a non-math font in formulas and is sometimes italic and sometimes not
References: check capitalization, e.g. [11] PCM etc. I recommend: \usepackage[style=numeric-verb, maxbibnames=20, sorting=none, eprint=false]{biblatex}

问题

What about individual weights (mixing proportions) of the Gaussian components? E.g. from line 271, must the 30 clusters each have 1200/30=40 samples?

局限性

Limitations are stated but maybe should be summarized in a dedicated section.

"decent" initial guess required. What does that mean? Were you able to reliably get the rate-optimal result without taking into account prior knowledge about the ground truth? At least state the qualitative dependencies, like "decent" initial guesses are more restricted for lower SNR / overlapping components, fewer samples, higher dimensions, more components etc.

作者回复

2024-08-07

We thank you for your valuable comments and remarks.

Originality: A related work is

Thank you for pointing this out. Our hard-EM algorithm indeed shares similarities with the soft one. Specifically, our algorithm modifies the E-step of the soft-EM by implementing a maximization step that hard assigns data points to clusters instead of calculating probabilities. We will make this clearer in the final version of the paper.

Suggestions:

Thank you for pointing these out. We will surely address them in the final version.

What about individual weights (mixing proportions) of the Gaussian components?

In our numerical study, we initially opted for equal cluster sizes for simplicity, but our model and analysis are fully capable of accommodating variable cluster sizes. To better demonstrate the flexibility and applicability of our approach, we will have a variety of cluster sizes in the numerical settings in the final version of the paper.

"decent" initial guess required

Thank you for the question. In our manuscript, the term 'decent initial guess' refers to the need for initial cluster centers to be sufficiently close to the ground truth so that our algorithm achieves the rate-optimal result. It is due to that our theoretical analysis requires the initialization being within a specific proximity to the true parameters. In the final version of our paper, following your suggestion, we will explicitly detail the dependencies to provide a clearer understanding of when and how our algorithm can be expected to perform optimally.

审稿意见

评分: 8置信度: 52024-07-08

This paper analyzes the minimax error rate in clustering anisotropic Gaussian mixture models. The authors establish a universal lower bound in two different models: different means, same covariance matrix (resp., different means, different covariance matrix) for every cluster and different covariance. For both models, they prove that a simple iterative algorithm (a variant of Lloyd's) attains the minimax rate.

优点

Strong paper. The technical proofs are impressive and the results quite general with a minimum amount of assumptions (clusters all not too small, dimension O(\sqrt(n)), covariances well conditioned; the only strong assumption is over the initialisation).

缺点

The paper misses some high-level intuition, and several proofs are tough to parse. For example, the error decomposition with the terms F, G, H etc appears mysterious and a bit out of nowhere. While the ideal error has a natural explanation (at least when delta = 0), I did not find it properly mentioned. For example, line 435 embodies exactly my feeling while reading the proofs: "We next define a quantity that refers to as the ideal error"; yes this quantity is important, yet I do not tell explicitly why and I leave the reader to understand himself why this is indeed an ideal error.
The interpretation of the minimax rate for model 2 is quite complex. Because it involves hypothesis testing, one should naively expect to find the Chernoff information. Quick computations show that the Chernoff information between N( $\theta_1, \Sigma$ ) and N( $\theta_2, \Sigma$ ) equals $1 / 8 \times \|| \theta_1 - \theta_2 \||^2_{\Sigma}$ and indeed recovers the minimax rate of Model 1. Expression is more complex for Model 2. That could help interpret the quantity SNR' (which is not truly a signal-to-noise ratio, as it is hard from the expression of SNR'_{ab} to find where is the signal, the noise and the ratio...).
The assumption on the loss of the initialisation is quite strong (\hat z^{(0)} should already be a consistent estimator). Especially, the authors consider spectral clustering, which does not scale as well as iterative methods such as Lloyd's. Moreover, spectral clustering is rate-optimal in an isotropic mixture model. A natural question is: what happens if one takes the best over 10 runs over random initialisation \hat z^{(0)} (or something more clever such as k-means++)? Numerical simulations could show whether a strong condition on the initialisation is indeed needed or not. In particular, Paper [1] had weaker conditions on the initialisation (but stronger conditions on everything else).
In large dimensions (say d >> n), the lower bound derived here cannot be attained by algorithms agnostic on the model parameters. While this is obviously out of the scope of the paper, the authors should mention it more clearly (key reference 14 is a bit diluted in the intro, other recent work such as [2], albeit posterior to the author's submission, could also appear).

[1] Lu, Y., & Zhou, H. H. (2016). Statistical and computational guarantees of Lloyd's algorithm and its variants. arXiv preprint arXiv:1612.02099.

[2] Even, B., Giraud, C., & Verzelen, N. (2024). Computation-information gap in high-dimensional clustering. arXiv preprint arXiv:2402.18378.

Minor comments:

line 377: "where we use k = O(1)". No, you use SNR / log k >> 1.
Typo in line 189: the with probability 1-exp(-0.08) should be 1-exp(-0.08n) according to [12, Proposition D.1]. This typo appears several times (Corollary 2.1, 3.1 at least) and is important to correct.
line 290: I assume the log is in basis 10.
Line 641: typo in the ref.
Line 661: I don't think the SE(a,b) is properly defined (namely, a proper definition to explain what the a and b in SE(a,b) exactly stands for).

问题

Is the assumption SNR / log k \gg 1 needed? It seems to appear in the lower bound because of the choice of the space \mathcal{Z}. Would a more refined analysis delete this assumption? (By this comment, I am specifically referring to the paragraph "If our goal is only to obtain a lower bound ... there exists a much simpler way" in [1].) I am not sure if this assumption is needed in the upper bound (for example Lemma A.5 for the ideal error does not seem to require it?).
Line 42 the sentence: 'From a minimax perspective, the least favorable scenario among all sub-Gaussian distributions with variance proxy \sigma^2—and thus the most challenging for clustering—is when the errors are distributed as N (0, \sigma^2 I)". Does it mean that the SNR obtained in Model 2 is always larger than the SNR of Model 1 (with the same centres \theta_a but all covariances equal to, say, \Sigma_1)?
I am not sure how to obtain the second to last inequality of the inequations starting line 473. I get that Cauchy-Schwarz is applied, but it leads to something slightly different (more precisely, I don't know how to obtain the $\|| \sum_{a} 1(z_j = a) \epsilon_j \epsilon_j^T \||$ as Cauchy-Schwarz gives $\|| \epsilon_j \||$ . I may miss a simple one-liner showing that $\sum_{a} 1(z_j = a) \|| \epsilon_j \|| = \|| \sum_{a} 1(z_j = a) \epsilon_j \epsilon_j^T \||$ .
I do not understand the last statement "Future work will explore broader covariance structures and refine these methods to further bridge theoretical robustness with computational feasibility in complex clustering scenarios." For the first part of the sentence, I believe Model 2 already encompasses the broadest covariance structure (besides some assumptions of \Sigma^* being well-conditioned). For the second part, if the authors mean robustness to noise, they should cite recent work addressing this in isotropic Gaussian mixture models (for example [1,2]).

Ref:

[1] Anderson Y. Zhang. Harrison H. Zhou. "Minimax rates of community detection in stochastic block models." Ann. Statist. 44 (5) 2252 - 2280, October 2016. https://doi.org/10.1214/15-AOS1428

[2] Liu, Allen, and Ankur Moitra. "Robustly learning general mixtures of Gaussians." Journal of the ACM 70.3 (2023): 1-53.

[3] Patel, D., Shen, H., Bhamidi, S., Liu, Y., & Pipiras, V. (2023). Consistency of Lloyd's Algorithm Under Perturbations. arXiv preprint arXiv:2309.00578.

局限性

Albeit this is already a long and dense paper, I would have enjoyed a section motivating anisotropic mixture models in real-data examples. For example: standard Lloyd versus Algorithm 1 versus Algorithm 2. I believe Algorithm 2 obtains a higher accuracy in large data sets, but may not in smaller data sets (where the estimation of the covariance) may lead to less reliable predictions. Having (even heuristics) guidelines on which algorithm one should use depending on the data set size is important.
While I went quite far down the proof, I found them hard to read. Appendix C is just a dump of technical Lemmas over 17 (!!) pages with no structure. Come on, this is not really serious...

作者回复

2024-08-07

We thank you for your valuable comments and remarks.

The paper misses some high-level intuition, and several proofs are tough to parse.

Thanks for pointing this out. In the final version, we will enhance our discussion to better articulate the derivation and significance of key terms. The "ideal error" is defined as the error remaining after one algorithm iteration with known ground truth $z^*$ . It represents the minimum error under ideal conditions. The actual error emerges when iterating from an estimation $z$ , with the difference between actual and ideal errors expressed through the terms $F$ , $G$ , and $H$ : $F$ includes terms related to noise $\epsilon_j$ , illustrating the impact of measurement noise. $G$ covers estimation errors for cluster centers ( $\hat{\theta}_a(z) - \hat{\theta}_a(z^*)$ ) and covariance matrices ( $\hat{\Sigma}(z) - \hat{\Sigma}(z^*)$ ), showing the effect of parameter estimation inaccuracies. $H$ contains all other terms from additional error sources.

The interpretation of the minimax rate for model 2 is quite complex.

Thanks for the suggestion. We will include connections with Chernoff information in the final version to clarify the understanding of the minimax rate. Regarding SNR', although it diverges from the traditional signal-to-noise ratio, we use this term as it extends the classical definition $\frac{||\theta_1^* - \theta_2^*||^2}{\sigma^2}$ in the context of isotropic Gaussian noise.

The assumption on the loss of the initialization is quite strong

We acknowledge that this assumption appears strong; it is primarily driven by technical challenges encountered in our analytical framework.

In large dimensions (say d >> n), the lower bound derived here cannot be attained

You are correct that our paper does not cover the large dimensional case. In the final version, we will add more comments and references to make it clearer.

Minor comments

Thank you and we will correct them in the final version.

Is the assumption SNR / log k \gg 1 needed?

Thank you for the question. On one hand, this condition helps simplify the complexity for establishing the lower bound. As noted in the literature you referenced, a more refined analysis might allow us to relax this assumption to simply $\text{SNR} \gg 1$ . On the other hand, this condition is essential to establish a matching upper bound. The ideal error analysis does not explicitly require this assumption; however, it becomes necessary when considering the aggregate impact of errors $F$ , $G$ , and $H$ . These components introduce multiple factors of $k$ that appear before the desired exponential term, and the assumption ensures that these factors remain negligible.

Line 42 the sentence:

Thank you for the question. To address it, consider a simpler case with two Gaussian distributions: whether the SNR calculated from $N(\theta^*_1, \Sigma^*_1)$ and $N(\theta^*_2, \Sigma^*_2)$ is always larger than that from $N(\theta^*_1, \Sigma^*_1)$ and $N(\theta^*_2, \Sigma^*_1)$ . In this case, the answer to your question is not necessarily yes, as demonstrated by the following counterexample. When $\theta^*_1=(0,0)$ , $\theta^*_2=(1,0)$ , $\Sigma^*_1 = I_2$ and $\Sigma^*_2$ is a diagonal matrix with 2 in its (1,1) entry and 1 in its (2,2) entry, one can verify SNR is not larger. In fact, the effect of replacing $\Sigma^*_2$ with $\Sigma^*_1$ on SNR depends on the shapes of $\Sigma^*_1$ and $\Sigma^*_2$ and the direction of $\theta^*_2 - \theta^*_1$ . This is different from the sub-Gaussian setting. The rationale why $N(0, \sigma^2 I_d)$ leads to the smallest SNR among sub-Gaussian distributions with variance proxy $\sigma^2$ is that it is flatter in all directions compared to any other sub-Gaussian distribution.

I am not sure how to obtain the second to last inequality of the inequations starting line 473.

The inequality does not result directly from Cauchy-Schwarz. Instead, it first formulates a quadratic form and then upper bounds it by the operator norm of the matrix. Let $x$ be a vector. The inequality is essentially about showing $\sum_j |\epsilon_j^Tx|^2$ can be upper bounded by $||x||^2 ||\sum_j \epsilon_j \epsilon_j^T||$ . This can be proved by $\sum_j |\epsilon_j^Tx|^2 = \sum_j (x^T\epsilon_j)(\epsilon_j^T x) = \sum_j x^T(\epsilon_j \epsilon_j^T)x = x^T(\sum_j \epsilon_j \epsilon_j^T)x \leq ||x||^2 ||\sum_j \epsilon_j \epsilon_j^T||$ . In the final version, we will add this intermediate argument.

I do not understand the last statement

Thanks for the feedback. We agree it needs clarification. Our intent was to indicate a potential relaxation of the well-conditioned assumption on the covariance matrices' condition numbers. In the final version we will make it clearer.

Albeit this is already a long and dense paper, I would have enjoyed a section

Thank you for your suggestion. In the final version of the manuscript, we will incorporate a new section dedicated to evaluating the performance of the algorithms on real datasets. This section will also include practical guidelines for practitioners on selecting the most suitable algorithm.

While I went quite far down the proof, I found them hard to read.

Thanks for the comment. In the final version, we will add more details to make the proof more accessible. For example, in Appendix C, we will give an overview of lemmas to be proved.

2024-08-08

Thank you for your answers!

I updated my grade as I am quite sure this paper is a strong accept. In particular, the addition of some numerical experiments on real data sets makes it valuable for the NeurIPS community. With small cleaning, the proofs in the Appendix can be made easier to read (but even if it takes time, I do encourage you to add some details at key points in the proof, as it makes a difference from a reader's perspective).

Reading your answer about SNRs, another observation came to my mind. If we consider the problem of clustering an instance of an anisotropic GMM, Algo 2 is rate-optimal, while vanilla Lloyd's should be sub-optimal. In contrast, studying the problem in a worst-case scenario (minimax over all sub-gaussians mixture models), vanilla-Lloyd is rate-optimal (albeit I believe Algo 2 should also be rate-optimal).

2024-08-09

Thank you!

Your observation about SNR is correct. We will follow your suggestions to enhance our paper in the final version. Thank you once again for your constructive and detailed feedback.

审稿意见

评分: 6置信度: 32024-07-09

The paper provides a minimax lower bound of misclustering error rate for clustering under the anisotropic GMM. Then, the paper designs an adjusted Lloyd's algorithm which can obtain the minimax lower bound within log(n) iterations. The paper also conducts some experiments to show the performance of the proposed method.

优点

The paper tackles a more difficult setting of GMM, i.e., anisotropic GMM. The paper provides the bound of the misclustering error rate in this setting and designs a rate-optimal algorithm, the adjusted Lloyd's algorithm.
The paper is theoretically sound and solid.

缺点

My major concern is about its experiments. The paper only conducts experiments on synthetic data sets. In machine learning community, we often care more about the performance on the real-world data sets.
In Algorithm 2, it needs to compute the inverse and determinant of the covariance matrices, which seems time-consuming. Could you give the time complexity analysis and experiments on how much the overhead has increased compared to the vanilla Lloyd's algorithm?

问题

Please see above.

局限性

The paper claimed the limitations.

作者回复

2024-08-07

We thank you for your valuable comments and remarks.

My major concern is about its experiments.

Thank you for your comment. We plan to include a new experiment in the final version of our paper using a real dataset from the Fashion-MNIST collection. This experiment will focus on the clustering of two classes, T-shirt/top and Trouser, each comprising 6000 images. Numerical results show that Algorithm 2 achieves a misclustering error rate of 5.7%, which outperforms the vanilla Lloyd’s algorithm (8%). This addition will provide a more comprehensive evaluation of our methods, highlighting their effectiveness in practical scenarios.

In Algorithm 2, it needs to compute the inverse and determinant of the covariance matrices

Thanks for the comment. You are correct in noting that it incurs additional computational overhead due to the necessity of computing the inverse and determinant of covariance matrices. The time complexity of Algorithm 2 is $O(nkd^3T)$ , where $n$ is the number of points, $d$ is the dimensionality of each data point, $k$ is the number of clusters, and $T$ is the number of iterations. This contrasts with the vanilla Lloyd's algorithm, which has a lower time complexity of $O(nkdT)$ . The increased complexity is primarily due to the matrix operations in $d$ dimensions, which scale as $O(d^3)$ for matrix inversion and determinant computation. To provide a clearer perspective on the performance impact, our experiments show that at a dimensionality of 5, Algorithm 2 requires approximately twice the computation time of the vanilla Lloyd’s algorithm. This ratio increases to approximately 14 when the dimensionality is increased to 100. In the final version of the paper, we will include a detailed time complexity analysis and experimental results to illustrate how the overhead scales with increased dimensionality.

2024-08-12

Thanks for the authors' responses which addressed my concerns. I've no more comments.

作者回复

2024-08-07

We sincerely thank all reviewers for their detailed and insightful comments. Each reviewer has provided valuable feedback from different perspectives. In recognition of this, we respond to each reviewer's comments individually to ensure that all concerns and suggestions are thoroughly addressed.

最终决定Accept (oral)

2024-09-25

The paper presents a detailed theoretical analysis of the performance of clustering solutions arising from a modified variant of Lloyd's algorithm for data arising from Gaussian Mixture Models (GMMs) with anisotropic and potentially differing covariance matrix/matrices. Such results are scarce in the clustering literature and the reviewers unanimously appreciate the theoretical contributions. Limitations identified by the reviewers include a lack of experimental results on real data sets and a lack of intuitive discussion of the theory, but the authors have committed to including experiment(s) on real data in a revised version, and to enhance the readability/accessibility of the theoretical content.