PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
5
5
3.0
置信度
创新性2.8
质量3.3
清晰度2.5
重要性2.8
NeurIPS 2025

Understanding Contrastive Learning via Gaussian Mixture Models

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
contrastive learning; self-supervised learning; gaussian mixture models; linear dimensionality reduction

评审与讨论

审稿意见
4

This paper setup a framework to analyze the contrastive learning through the perspective of GMMs, which projects the unlabeled augmentation pairs in InfoNCE into the Fisher-optimal subspace. This paper discusses the ability of infoNCE to recover the optimal shared-covariance projections, which states that minimizing the CLIP InfoNCE loss gives subspaces for each modality that lie within their respective Fisher optimal subspaces. Then they validate their theoretical findings with the synthetic data and CIFAR100 data.

优缺点分析

For strengths across the dimensions of originality, quality, clarity, and significance:

Quality:

Theory is good in proving that InfoNCE and SimSiam objectives recover the Fisher subspace under a shared‐covariance GMM, which discusses how the contrastive and non‐contrastive losses align with supervised LDA. Besides, this paper extends the analysis to a CLIP-style multi-modal GMM and shows that contrastive loss learns discriminative subspaces in each modality without explicit class information.

Clarity:

Introduces a clean, analytically tractable model (Augmentation-enabled GMM) that captures “noisy” augmentations and paired multi-modal data. Clearly describe the difference between the collapse scenario with the well separated scenario.

Significance:

There are a few papers trying to illustrate how contrastive learning works so well. However, this paper provides an interesting perspective to show that, InfoNCE and related losses recover the Fisher-optimal subspace matching supervised LDA. They extend this to multi-modal settings, which might potentially guide the future design of self-supervised and vision–language models.

Originality:

The main point in this paper, including casts contrastive learning as linear dimensionality reduction on GMM, defines “noisy” augmentations with AeD, and proves that minimizing InfoNCE recovers the Fisher-optimal subspace. These results are original to me.

Weakness:

W1: In this paper, the feature extraction through the matrix A looks like a linear mapping. The work does provide a clear analysis, but I want to know how these Fisher-subspace results extend to real encoder architectures or with more non-linear setup.

W2: Validation is restricted to small synthetic mixtures and a toy CIFAR-100 clustering setup (with grayscale, downsampled features and known K for K-Means). There’s no demonstration on large-scale vision or vision–language benchmarks to support the theories are useful under the CLIP situations.

W3: Natural images or languages lie on highly curved, non-linear manifolds. Their complex, non‐linear, multi‐modal manifolds could not be well captured by any finite mixture of Gaussians with shared covariance. All theoretical guarantees assume access to the true data distribution (infinite samples). It remains unclear how many real examples or negatives are needed for InfoNCE to work in practice without convergence-rate analysis.

问题

Q1: Your contributions show that InfoNCE recovers the Fisher‐optimal subspace under GMMs and extends to multi-modal CLIP losses—but could you clarify what concrete practical benefits or downstream applications that we can expect from these theoretical guarantees? This part seems ambiguous to me.

Q2: In Theorem 4.1, you show that InfoNCE (and SimSiam) recover the full Fisher subspace only when the augmentation bias delta = 1. Could you clarify how sensitive the model is with these hyperparameters in practice when we try to ensure near-optimal projection recovery?

Q3: For the multi-modal CLIP-GMM result (Theorem 5.2), you prove that CLIP InfoNCE learns only a subset of each modality’s Fisher subspace—could you suggest how we identify or recover the “missing” directions, and what downstream impact this subset selection has on retrieval or classification performance?

Q4: The current scope of experiments is somehow narrow. Only linear mappings, with no experiments on any multi-modal CLIP-style data or validation on deep, non-linear embeddings or real CLIP retrieval benchmarks. Could you comment on whether and how your CLIP-GMM theory holds up when applied to actual CLIP embeddings (e.g. ResNet+Transformer) on image–text retrieval tasks?

Q5: Table 1 shows LDA achieving the highest ARI while InfoNCE leads in AMI—could you explain which properties of the learned projections drive this divergence, and what this implies for choosing one method over the other?

Minor:

  1. Missing citations in line 63. will be generated by separate GMMs. “For instance, in the CLIP []”

  2. It contains more space in Definition 3.1.1.

  3. In Figure 1, why (a) is well-separated while the (b) is the mode collapse? From the figure, the (b) looks more separated.

  4. Why in Line 187 the SVD subspace is italics while in line 189 is not?

局限性

Yes

最终评判理由

The authors have basically addressed my concerns.

格式问题

No

作者回复

Remark 1 : How do the results (and Fisher optimality) extend to real encoder architectures with a more non-linear setup

We show that for linear setting, InfoNCE loss learns the Fisher subspace, which we motivated as the optimal subspace for clustering. We do not know of an equivalent definition of optimality for representations under non-linear mappings, which also seems prohibitively difficult to come up with without assumptions on the mappings.

Remark 2 : Lack of demonstrations for large scale vision / vision-language benchmarks.

We realised the toy nature of our setup and hence have extended our experimental setup to the ImageNet dataset. This setup helps us argue about distributions more general than shared-covariance GMM. Specifically, we take images corresponding to 20 random classes from the ImageNet and use a pretrained ResNet-50 model to get non-linear mappings for the images. We then learn a linear map from the ResNet embedding space (2048 dimensional) to our target space. We present the results below (presented as ARI/AMI).

Method5 dim10 dim15 dim19 dim30 dim
Random0.03026 / 0.082960.07231 / 0.140050.13140 / 0.232090.14405 / 0.233630.26177 / 0.38685
PCA0.23408 / 0.464470.39843 / 0.591170.49903 / 0.658980.50954 / 0.661590.56355 / 0.70163
Ambient0.48641 / 0.672330.48641 / 0.672330.48641 / 0.672330.48641 / 0.672330.48641 / 0.67234
SimSiam0.37581 / 0.582330.56084 / 0.702460.58705 / 0.722270.60259 / 0.731590.62970 / 0.74476
InfoNCE0.84451 / 0.888310.98182 / 0.981150.92963 / 0.973900.99621 / 0.995790.93317 / 0.97721
LDA0.47112 / 0.698280.66967 / 0.838950.82550 / 0.891030.94744 / 0.949900.94745 / 0.94990

Remark 3 : All theoretical guarantees assume access to the true data distribution (infinite samples). It remains unclear how many real examples or negatives are needed

We acknowledge that our analysis is designed for the infinite sample setting and there does not seem to be a trivial extension to finite samples. The main technical difficulty lies with the LogSumExp term in the loss function of InfoNCE. We believe that our reviewer would agree that such an analysis is still non-trivial in the linear setup we propose. That said, we are exploring possible directions which would enable us to make meaningful comments about the establishing bounds on samples for learning subspaces that are comparable fo Fisher subspace. Our goal in this paper is to quantify the absolute limits of contrastive learning in the limit.

Finite-sample analysis with sample complexities and convergence rate analysis would also require a careful optimization algorithm design to solve the InfoNCE or other losses, which is not trivial and beyond the scope of our infinite sample analysis. We do leave the finite-sample analysis to future work, which we believe is an independent interest to the community.

Remark 4 : Practical benefits following from our results.

Our analysis seems to show that while both InfoNCE and CLIP loss are effective at filtering out the noise directions, InfoNCE loss (when considering perfect augmentations) can provably recover all discriminative directions (i.e. learns the complete Fisher subspace). We prove that CLIP loss (at least in the linear setting) is not able to capture the complete subspace and hence the representations are possibly less suited for image-only tasks like clustering.

Remark 5 : How sensitive is the model to the augmentation noise parameter in practice?

We believe the model is robust to the choice of the noise parameter and we empirically verify it with the synthetic shared-covariance GMM experiments. In Figure 2 (a) in the paper, we plot the the clustering performance (AMI/ARI) w.r.t. different augmentation noise δ\delta. We can see that InfoNCE achieves almost perfect clustering even at relatively high noise levels (δ=0.4\delta = 0.4). Please let us know if you would be interested in further experiments on this for the discussion period.

Remark 5 : For multi-modal clip what impact does not recovering the Fisher subspace have for downstream performance? How can we identify or recover the missing directions?

Our results seem to suggest that for a given model capacity, not recovering the full Fisher subspace can possibly lead to collapsing. Since Fisher subspace ensures the maximal separability (non-necessarily perfect separability), there might exist a subspace where certain components are almost indistinguishable, leading to poor performance.

One possible solution for recovering the missing directions might be to first train image and text encoders separately, and then fine-tune them for CLIP loss. Then, training the image encoder with the InfoNCE loss would help with learning discriminative representations.

Remark 6 : How the CLIP-GMM theory holds up when applied to actual CLIP embeddings on image-text retrieval datasets

Due to the size of the CLIP dataset, we were not able to complete experiments on it. We are working on a subset of the dataset and hopefully, we will try to make the results ready for the discussion period. Moreover, our theory assumes a linear map and the results could be a result of the limited mapping capacity of these linear maps. It is unclear whether our findings would transfer to representations learnt on highly non-linear networks.

Remark 7 : Discrepancy between the ARI and AMI scores for LDA and InfoNCE

A core difference between ARI and AMI is that ARI emphasizes pairwise clustering performance; it looks into how many pairs of points originally belonging to the same cluster ended up together. ARI could be smaller than AMI if the original cluster sizes are not balanced or the final clusters have size imbalance.

Remark 8 : Minor comments

We will work on them as pointed out by the reviewer. Regarding the comment on Figure 1, in this figure we wanted to show how the separability of representations after projection onto their SVD subspace vs onto their fisher subspace. The GMMs are well separated in their ambient space in both Fig. 1 (a) and 1 (b). Though for Fig1 (b), projection onto the SVD subspace (y-axis) leads to a mode collapse, while projection onto the Fisher subspace (x-axis) ensures that the underlying components still remain well separated.

评论

I thank the authors for the rebuttal. The authors have basically addressed my concerns. Therefore, I will increase my score to 4.

审稿意见
5

The paper studies contrastive learning through the lens of linear dimensionality‑reduction for Gaussian‑mixture models (GMMs). By introducing an "augmentation‑enabled distribution" that biases each augmentation toward the same mixture component as its anchor, the authors prove that (i) InfoNCE and a simplified SimSiam objective recover the Fisher–optimal subspace for shared‑covariance GMMs, thereby matching the performance of fully‑supervised LDA, and (ii) a CLIP‑style multi‑modal InfoNCE learns a subset of the Fisher subspaces for each modality.

Experiments on synthetic mixtures and CIFAR‑100 corroborate the theory, showing that InfoNCE outperforms PCA and approaches (sometimes exceeds) LDA on clustering metrics.

优缺点分析

Strengths

  • [theory] Sharp, closed‑form characterization of when InfoNCE is equivalent to LDA. Introduces a clean, tunable notion of “noisy” augmentations (δ\delta‑biased draws).

  • [theory] Through comparison to spectral methods (SVD) and supervised LDA, the work offers valuable insights of when and why contrastive losses can match or exceed traditional methods in the linear setting.

  • [empirical validation] Systematic synthetic study isolates effect of (i) augmentation noise, (ii) anisotropy, (iii) projection rank. Real‑data CIFAR‑100 experiment shows gains over PCA and parity with LDA.

Weaknesses

  • Linear Mapping Assumption: The analysis is restricted to linear projectors, whereas modern contrastive methods use deep nonlinear encoders. It remains unclear how these results extend to non‑linear settings.

  • Simplified SimSiam Analysis: The treatment of SimSiam employs a greatly simplified loss (no StopGrad or prediction head dynamics), raising questions about applicability to actual implementations.

  • Result for 0<δ<10 < \delta < 1 is only conjectured (stated but unproven, l. 256‑260).

问题

  • [minor] Nonlinear Extensions: Do the authors have insights or conjectures on how their linear‑theory might generalize to deep nonlinear encoders? Could kernel methods bridge this gap?

  • [minor] Finite‑Sample Bounds: Can the authors extend their analysis to finite datasets? What sample complexity is required for the empirical InfoNCE solution to approach the Fisher subspace?

  • [minor] Role of Temperature and Batch Size: InfoNCE performance in practice depends on hyperparameters like temperature and negative sampling. How do these factors interact with the theoretical model?

局限性

NA

最终评判理由

I would like to thank the authors for their rebuttal, which addresses most of my concerns. I agree that the non-linear setting is non-trivial, and I find the paper’s theoretical results both interesting and valuable for advancing our understanding of contrastive learning. Overall, I am inclined toward recommending acceptance.

格式问题

NA

作者回复

Remark 1 : Restriction of analysis to Linear Mapping Assumption

We show that for linear setting, InfoNCE loss learns the Fisher subspace, which we motivated as the optimal subspace for clustering. We do not know of an equivalent definition of optimality for representations under non-linear mappings, which also seems prohibitively difficult to come up with without assumptions on the mappings.

Remark 2 : Simplified SimSiam analysis

The reviewer raises a valid point. The analysis including the stop-grad and the prediction head introduces new challenges we cannot tackle without loosening the analysis. Our paper deals with the fixed point analysis of these loss functions. Such fixed point analyses are not affected by the stop-grad and the prediction head component. Though we agree these are important components essential to the success of SimSiam, for our results (i.e., learnt representations exist in the Fisher subspace) the presence of these components does not affect the results. To study the contribution of the stop-grad and the prediction head, one would have to take a dynamical systems approach to parameter learning, which is more complex than our setting. It would also be unclear whether there would be a well-defined notion of optimality for this setting.

Remark 3 : Proof for noise augmentations is only conjectured.

In our analysis, we try to show that for any direction in the Fisher subspace not included in the optimal solution, the gradient of the loss function is negative. This leads us to upper bound the gradient of the loss by δkwkak2+kwkak2 - \delta \cdot \sum_k w_k a_k^2 + \sum_k w_k a_k^2. This upper bound comes from the Perron-Frobenius theorem (at the end of Supplementary B). The gradient is indeed non-positive if δ=1\delta = 1. When δ<1\delta <1, our analysis cannot guarantee this, but we believe that the same is true even for relatively large values of δ\delta. However, this would require a more fine-grained analysis that takes into account the properties of mean subspace and the covariance matrix.

Additionally, we do study the effect of noise empirically as shown by our synthetic experiments. We believe that introducing noise doesn’t change the optimal subspace in the infinite sample limit, though we expect to have adverse effects on the number of samples required for it. This ties to the finite sample analysis question, which is on our future works list.

Remark 4 : Non-linear extensions

We show that for linear setting, InfoNCE loss learns the Fisher subspace, which we motivated as the optimal subspace for clustering. We do not know of an equivalent definition of optimality for representations under non-linear mappings, which also seems prohibitively difficult to come up with without assumptions on the mappings.

Remark 5 : Finite-sample bounds

We acknowledge that our analysis is designed for the infinite sample setting and there does not seem to be a trivial extension to finite samples. The main technical difficulty lies with the LogSumExp term in the loss function of InfoNCE. We believe that our reviewer would agree that such an analysis is still non-trivial in the linear setup we propose. That said, we are exploring possible directions which would enable us to make meaningful comments about the establishing bounds on samples for learning subspaces that are comparable of Fisher subspace. Our goal in this paper is to quantify the absolute limits of contrastive learning in the limit and we defer the finite sample study for our future work.

Finite-sample analysis with sample complexities and convergence rate analysis would also require a careful optimization algorithm design to solve the InfoNCE or other losses, which is not trivial and beyond the scope of our infinite sample analysis. We do leave the finite-sample analysis to future work, which we believe is an independent interest to the community.

Remark 6 : Role of Temperature and Batch Size

This is indeed a great questions, and it is relevant to the finite-sample analysis and the convergence rate of the optimization process. Clearly, batch size is not applicable to our analysis framework as we focus on the limiting behavior. Although temperature is more relevant, it should not effect our conclusion, beyond constants in the calculations, that InfoNCE would learn the Fisher subspace under perfect augmentations.

评论

Dear Reviewer 7TSE,

With the Author-Reviewer discussion ending in less than a day, could you please let the authors know if you are satisfied with their response?

Sincerely,

AC

审稿意见
5

This paper presents a new theoretical perspective on self-supervised learning via contrastive learning, aiming to explain why the simple strategy of contrasting or relating pairs is so effective. The authors formulate the representation learning problem as a form of dimensionality reduction for Gaussian Mixture Models (GMMs) with a shared covariance matrix, referred to as sharedGMMs. Their central finding is that contrastive methods learn the Fisher-optimal subspace—surpassing traditional unsupervised methods like SVD and approaching the performance of supervised techniques such as Linear Discriminant Analysis (LDA). To allow for analytical tractability, the analysis is restricted to linear mappings, and the optimality of representations learned via contrastive learning is studied under the assumption of perfect augmentations for the InfoNCE loss. The authors also distinguish between two categories of contrastive learning objectives: uni-modal and multi-modal GMMs, the latter modeled after CLIP-like architectures. In both scenarios, they demonstrate that contrastive learning with InfoNCE recovers a subset of the Fisher-optimal subspace. For empirical validation, the paper evaluates various linear dimensionality reduction methods on both synthetic sharedGMM datasets and real-world data (CIFAR-100), showing strong alignment between theoretical predictions and experimental results.

优缺点分析

Strengths

• The proposed analytical approach to studying contrastive learning through the InfoNCE loss, even in the simplified setting of linear mappings, is both insightful and well-motivated. The theoretical connection to supervised LDA is particularly compelling.

• The analysis of optimal projection learning for multi-modal data is valuable. Demonstrating that multi-modal InfoNCE learns a subset of the Fisher-optimal subspace contributes meaningfully to the understanding of contrastive objectives in complex settings.

• The introduction of the Augmentation-enabled Distribution (AeD), which includes a hyperparameter to control the correlation between a sample and its augmentation, is novel and provides a more principled framework for modeling augmentations compared to traditional distance-based approaches.

• The manuscript is fairy well written and well-structured, making the theoretical contributions easy to follow.

Weaknesses

• Theorem 4.1 is restricted to the ideal case of perfect augmentation (δ=1\delta = 1). However, such a scenario rarely holds in practice. The analysis does not account for more realistic conditions, such as limited numbers of positive or negative samples, or imperfect augmentations.

• There is no discussion of the number of positive and negative samples required for effective contrastive learning. This is a crucial factor, especially when sample balance and augmentation noise can significantly impact learning outcomes.

• The experimental analysis is quite limited. The use of CIFAR-100 as the primary dataset may not be well aligned with the assumptions of the theoretical framework. As shown in Table 1, the reported performance across all methods is quite low. A more suitable approach might involve first learning a nonlinear embedding for the image data that better satisfies the sharedGMM assumption, followed by applying linear mappings. This could also be extended to non-image datasets to further validate the theory.

• While the paper touches on the role of augmentation and the distinction between perfect and noisy augmentations, a more quantitative analysis would be valuable. Specifically, measuring the effect of augmentation noise on the ratio of inter-component to intra-component variance would strengthen the empirical insights.

• The experimental details, particularly for the synthetic studies, are not quite clear and need to be clarified. More transparency in how the synthetic data is generated and evaluated would improve reproducibility and understanding.

• Although the paper provides theoretical analysis for the multi-modal contrastive learning case, it lacks corresponding experimental validation. The authors could consider using a subset of the datasets from the CLIP paper to demonstrate their findings in practice.

问题

• The authors state, “Although we do not provide a proof for SInfo=SFS_{Info}=S_{F} when 0<δ<10<\delta<1 we conjecture that it is true, ...” (lines 256-7). Doesn’t this result depend on the noise level introduced by the augmentation process? Is it reasonable to assume that the learned subspace remains optimal under imperfect augmentations?

• The authors say that “both contrastive and non-contrastive objectives learn the same subspace” (lines 261–262). However, Theorem 4.1 suggests that SSiamS_{Siam} recovers the Fisher subspace even without perfect augmentations (δ<1\delta < 1), while the results for InfoNCE are tied specifically to the assumption of perfect augmentations. This appears inconsistent with the general conclusion (without making use of negative samples). Could the authors clarify this discrepancy?

• In practice, augmentation techniques do not rely on independently sampled transformations. Instead, each augmentation is typically generated conditionally, based on the original sample (e.g., type-preserving augmentations). Would it not be more realistic to model the augmentation process as a conditional distribution p(x^x)p(\hat{x}∣x)? How would this change the theoretical conclusions, especially under statistical dependencies introduced by type-based augmentations?

• Theorem 4.1 assumes rKr \geq K, but in all experiments, the projection dimension rr appears to be smaller than the number of clusters. Could the authors explain this mismatch between theory and experimental setup?

• The synthetic experiments are not clearly described. What is the data distribution? What is the ambient space dimension? Why does the "optimal" method perform worse? Is the data generated from a sharedGMM? Why is r=K1r=K−1 used in the numerical experiments?What is the rationale behind the design of the cluster means?

• The authors write that “InfoNCE loss learns to scale within the subspace leading to better clustering performance” (lines 336–337). Could the authors elaborate on what is meant by "scaling within the subspace" and how this contributes to improved clustering?

• In Definition 2.2, the equation for AeD, should KKK \neq K' be satisfied?

• The paper defines a "good projection" as one that performs well in classification tasks. Is this definition sufficient for evaluating the quality of representations more broadly, especially in unsupervised or transfer learning scenarios?

• Section G.1 in the supplementary material appears to be incomplete.

• In line 63, the citation is missing.

局限性

I think the main limitations of the proposed analytical framework for studying contrastive learning are as follows:

• It relies on the assumption of shared covariance in the GMM model, which is a strong condition and may not hold in many practical scenarios.

• It lacks an explicit analysis or quantification of the effects of augmentation noise and the overlap between components in the GMM, both of which are critical factors in real-world applications.

最终评判理由

The authors provided a detailed rebuttal and addressed many of my questions. They provided a clearer explanation of the assumptions and limitations of the theoretical work. Additionally, they provided a new set of results on the ImageNet dataset.

格式问题

NA.

作者回复

Remark 1 : Theorem 4.1 is restricted to the ideal case of perfect augmentation. Analysis doesn’t account for more realistic scenarios.

In our analysis, we try to show that for any direction in the Fisher subspace not included in the optimal solution, the gradient of the loss function is negative. This leads us to upper bound the gradient of the loss by δkwkak2+kwkak2 - \delta \cdot \sum_k w_k a_k^2 + \sum_k w_k a_k^2. This upper bound comes from the Perron-Frobenius theorem (at the end of Supplementary B). The gradient is indeed non-positive if δ=1\delta = 1. When δ<1\delta <1, our analysis cannot guarantee this, but we believe that the same is true even for relatively large values of δ\delta. However, this would require a more fine-grained analysis that takes into account the properties of mean subspace and the covariance matrix.

Additionally, we do study the effect of noise empirically as shown by our synthetic experiments. We believe that introducing noise doesn’t change the optimal subspace in the infinite sample limit, though we expect to have adverse effects on the number of samples required for it. This ties to the finite sample analysis question, which is on our future works list.

Remark 2 : Reliance on shared covariance assumptions

We wanted to expand the typical spherical (identity) covariance assumption and consider a general covariance matrix shared among components. The assumption though is necessary for the theoretical analysis. One of the steps in our proof requires an affine transformation to make each component’s covariances isotropic, and the shared (non-spherical) covariance assumption is the most general we could come up with (please see Supplementary C). Moreover, the Bayes optimality of the Fisher subspace is only true for shared covariance GMMs and we do not have a reliable measure of optimality beyond that.

Remark 3 : Lack of experimental validation for multi-modal scenario

Owing to lack of a good small scale multi-modal dataset and a time crunch, we unfortunately could not make it to the rebuttal period. A possible experimental setup would be to take ImageNet images and generate synthetic text captions for them using an LLM. We believe this is a toy scenario and we were not sure about its value before the reviewer’s eye. If you would be willing to see such results, we will conduct a small scale experiment in the above-described manner and make them ready for the discussion period. Alternatively, we could provide an analysis of (a subset of) real CLIP embeddings, which we were not able to complete due to the limited period allocated for the rebuttal preparation. We will be working on this in the meantime.

Remark 4 : Analysis doesn’t account for more realistic scenarios like a limited number of positives and negatives. No discussion on number of positives and negatives required for effective contrastive learning

We acknowledge that our analysis is designed for the infinite sample setting and there does not seem to be a trivial extension to finite samples. The main technical difficulty lies with the LogSumExp term in the loss function of InfoNCE. We believe that our reviewer would agree that such an analysis is still non-trivial in the linear setup we propose. That said, we are exploring possible directions which would enable us to make meaningful comments about the establishing bounds on samples for learning subspaces that are comparable of Fisher subspace. Our goal in this paper is to quantify the absolute limits of contrastive learning in the limit and we defer the finite sample study for our future work.

Finite-sample analysis with sample complexities and convergence rate analysis would also require a careful optimization algorithm design to solve the InfoNCE or other losses, which is not trivial and beyond the scope of our infinite sample analysis. We do leave the finite-sample analysis to future work, which we believe is an independent interest to the community.

Remark 5 : Limited experimental analysis

We agree with your comment on our CIFAR-100 experiments. Following your suggestion, we conducted another set of experiments using the ImageNet dataset. Due to character limits, please refer to Remark 2 for Reviewer nsJs for the new result.

Remark 6 : Experimental details are not quite clear

We apologize for the lack of clarity and detail, and thank you for your constructive comment. We will resolve the ambiguities and also fill out any incomplete details in the supplementary sections as pointed out. To summarize the details; for the synthetic experiments we consider a 10-component shared-covariance GMM, with an ambient dimension of 100. The covariance matrix is a diagonal matrix. To simulate a parallel pancake-like setup, we increase the variance in directions orthogonal to the mean subspace which leads to poor clustering performance with methods like PCA. We particularly choose this setup, as unsupervised methods struggle with this kind of data.

Remark 7 : Would it not be more realistic to model the augmentation process as a conditional distribution. How would this change the theoretical conclusions, especially under statistical dependencies introduced by type-based augmentations?

Conditionally generated samples, as studied in [HaoChen et al., 2021], serve as an alternative framework for analyzing self-supervised learning. Prior work exploring these directions often deal with theory-friendly approximations of InfoNCE loss, i.e., replacing LogSumExp by a squared term and are not relatable to some Bayes-optimal notion of representation that our work deals with. Using probabilistic modelling is a more realistic approach, however, the conclusions will not be enough to explain possible advantages of contrastive learning against, for instance, fully supervised learning.

Regarding the effect of considering augmentations as conditional distributions for our setting: we also tried assuming a naive conditional distribution x^N(x,σ)\hat{x} \sim \mathcal{N}(x, \sigma), i.e., the augmented sample is a “noisy” version of the point xx. Assuming this distribution, we were not able to recover the Fisher subspace with the InfoNCE loss. We posit that an ideal conditional distribution should be more informative than the original sample itself, and hence should provide some information about the cluster it belongs to . A stronger conditional distribution (our AeD) which assumes the augmented sample to be biased towards the underlying component of the original sample, is appropriately informative enough to learn the Fisher subspace. Identifying a less restrictive (when compared to AeD) conditional distribution, which can learn the Fisher subspace is indeed an interesting question.

We admittedly study a much simpler setup, but this helps us prove stronger results, i.e. showing equivalence to fully supervised methods, and our findings are partially recovered by the synthetic experiments and surprisingly validated by the real-data examples (both the old CIFAR-100 experiments and the new, non-linear ImageNet results).

HaoChen et.al., 2021. Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss, 2021

Remark 8 : Projection dimension r>K.

We tried to study a more difficult problem with fewer embedding dimensions to possibly make a stronger conclusion empirically, and we unfortunately did not have experiments with r>Kr > K. Following this remark, we run additional experiments with r=30 when for the 20-class dataset. We refer our reviewer to the last column of our ImageNet results in Remark 5.

Remark 9 : What does scaling within the subspace mean

Thanks for pointing this out. Our terminology was inaccurate and admittedly loose; let us be more precise. By scaling within subspace, we mean that methods such as LDA learn just a projection matrix (i.e., projection onto a subspace). While methods like InfoNCE learn an affine transformation (projection onto a subspace + affine transformation within the subspace) which leads to more suitable representations for clustering. We interpret that this is possibly the reason InfoNCE does better than LDA (Optimal in synthetic experiments) which only projects onto optimal subspace. We will make this clear in the camera ready version.

Remark 10 : Definition of a good projection and its sufficiency

We believe our reviewers refers to our definition of “favorable” mapping that maximizes Fisher discriminant. Our definition is indeed dependent on learning discriminative representations for each cluster. When each cluster corresponds to a broad class like “dogs”, these representations might be sub-optimal for certain transfer learning tasks (like distinguishing between different dog breeds). But when the clusters are finely defined (via augmentations), we get “good” representations that are discriminative enough for the sub-structures in the data. Naturally, the quality of the representations are tied to how fine-grained and/or noisy the augmentations are (which is the only information through which we could discover “classes”). Under this scenario, we believe our definition is reasonable, however, we would be delighted to hear more on this if our reviewer has better suggestions.

评论

I thank the authors for their detailed rebuttal and for addressing many of my questions. The response has clarified several aspects of the work and better explained the limitations of the current theoretical scope. I understand that running entirely new experiments in a limited time is not always feasible, and I appreciate the effort to provide new results for the ImageNet dataset.

believe this study has significant potential and am open to increasing my score. However, for the manuscript to be reconsidered, it is important that the authors incorporate the clear points made in the rebuttal into the main text, specifically along two main lines:

Expanding on Experimental Scope and Limitations: The manuscript would be significantly improved by a more thorough description of the synthetic experiments and a more comprehensive limitations section. It's important that the authors clearly describe the theoretical boundaries of their contribution. The final discussion section of the paper currently offers a very brief treatment of the study's limitations. This section should be comprehensively expanded to include the insightful points about the discussed limitations in the rebuttal (perfect augmentation, dimensionality of rr vs kk, infinit sampling, covariance structure, etc).

Multi-modal Experiment: Since an important part of the paper involves applying contrastive learning in a multi-modal setting, it is essential to include a corresponding experiment. The paper's claims would be much more convincing with at least one experiment on multi-modal data. To make this easier without requiring extensive additional training, I suggest the authors use embeddings from a pre-trained model such as CLIP, along with an associated image-text dataset.

审稿意见
5

The authors investigate a Gausian Mixture Model variant of contrastive learning. They show a noisy label contrastive setup can converge to the same subspace as discovered through LDA (which needs class labels). They perform quantitative experiments on small scale datasets to demonstrate the theory and show how it varies with key parameters like noise fraction.

优缺点分析

Strengths:

  • Good attention to detail in the mathematics and notation
  • Interesting theorem connecting contrastive learning with LDA and Fisher subspaces
  • Reasonable analysis of results and experiments Weaknesses:
  • The work is good but comes off a bit dense at times, perhaps include a overview figure in the work that captures the main plot, this will make it easier for people to quickly glean the important aspects of your work.
  • Minor typos
  • Smaller scale datasets for evaluation

问题

-Line 63 Missing clip cite

  • Can you help me understand the impact of the shared covariance on the model. Is this an assumption that all clusters have the same shape? If so this seems a little restrictive and I wonder whether your work could be applied to the general GMM case.
  • (For the authors curiosity) One paper that seems relevant to your work is “I-Con: A Unifying Framework for Representation Learning” from the looks of it, your theorem might represent a new row or column in their periodic table of representation learning algorithms.

局限性

yes

最终评判理由

A reasonable paper, keeping my score at accept

格式问题

no

作者回复

Remark 1 : Including an overview figure for the paper

Thanks for the suggestion. We will try to include one in the camera-ready version of the paper. We plan to add a figure capturing the data model for shared-covariance GMMs and display the properties of representations learnt by different self-supervised loss functions. We will also include a table that explains the main theoretical results for contrastive losses and the properties of the subspaces learned in comparison to optimal Fisher subspace.

Remark 2 : Comment on the shared covariance assumption

You are correct with your assessment; shared covariance means all components have the same covariance, and the data samples per component are distributed over the ambient space in the “same shape”. We wanted to expand the typical spherical (identity) covariance assumption and consider a general covariance matrix shared among components. The assumption though is necessary for the theoretical analysis. One of the steps in our proof requires an affine transformation to make each component’s covariances isotropic, and the shared (non-spherical) covariance assumption is the most general we could come up with (please see Supplementary C). Moreover, the Bayes optimality of the Fisher subspace is only true for shared covariance GMMs and we do not have a reliable measure of optimality beyond that. That said, we are exploring more general extensions of our analysis with appropriate optimality metrics.

In practice, however, we still get good numbers beyond shared covariances. For general Gaussians, we empirically show that InfoNCE learns representations that match (or even outperform) LDA on clustering metrics. We also include a new set of results on ImageNet. Specifically, we take images corresponding to 20 random classes from ImageNet-1K and use a pretrained ResNet-50 model to get the non-linear mappings. We then learn a linear map from the ResNet embedding space (2048 dimensional) to our target space. We present the results below (presented as ARI/AMI).

Method5 dim10 dim15 dim19 dim30 dim
Random0.03026 / 0.082960.07231 / 0.140050.13140 / 0.232090.14405 / 0.233630.26177 / 0.38685
PCA0.23408 / 0.464470.39843 / 0.591170.49903 / 0.658980.50954 / 0.661590.56355 / 0.70163
Ambient0.48641 / 0.672330.48641 / 0.672330.48641 / 0.672330.48641 / 0.672330.48641 / 0.67234
SimSiam0.37581 / 0.582330.56084 / 0.702460.58705 / 0.722270.60259 / 0.731590.62970 / 0.74476
InfoNCE0.84451 / 0.888310.98182 / 0.981150.92963 / 0.973900.99621 / 0.995790.93317 / 0.97721
LDA0.47112 / 0.698280.66967 / 0.838950.82550 / 0.891030.94744 / 0.949900.94745 / 0.94990

Remark 3 : Relevant work “I-Con: A Unifying Framework for Representation Learning”

We appreciate the reviewer pointing out other relevant work. Our method occupies the same block as InfoNCE i.e., uniform over positive pairs supervisory signal and gaussian learnt representations. Though for us the uniform pairs are defined as points belonging to the same cluster instead of a set of given positive pairs.

评论

Thanks for considering my suggestions and for explaining more details about my questions. I will keep my score at accept

评论

Dear Reviewers,

Please read the authors' rebuttals if you have done so yet, and respond to them as soon as possible to allow sufficient time for follow-up exchanges. The Author-Reviewer discussion is crucial to a constructive reviewing process, to which your reactivity and engagement are indispensable.

Best regards,

AC

最终决定

This work theoretically investigates the benefit of contrastive learning under Gaussian mixture models (GMMs). As a key contribution, the paper proves that the standard contrastive loss - InfoNCE retrieves the optimal linear mapping in terms of the separability of Gaussian clusters, from a infinitely large number of augmented data pairs in which both data points, the original and the augmented, are generated from the same Gaussian component. The same optimal linear mapping can also be obtained by LDA in a fully supervised manner, demonstrating the competitiveness of contrastive learning w.r.t. the supervised approach. Empirical results on synthetic and CIFAR100 data also support the comparability between contrastive and supervised learnings, in a less restrictive (and noisier) setting where the augmented data are generated from different clusters than the original ones with various probabilities. The multi-modal learning is also studied by considering separate GMMs with matching components, for which the optimal linear mapping is only partially recovered except under certain special configurations of model parameters.

The provided analysis is well motivated and carefully laid out. Its relevance and novelty were unanimously recognized by Reviewers. Several restrictions of the theoretical results were pointed out, such as linear mapping, infinite sample regime, and the conditional distribution of augmented data that depends only on the cluster index of original ones. Removing them would provide deeper insight into contrastive learning but require non-trivial extensions beyond the scope of the current analysis, as argued in Authors’ response. While they do not undermine the main messages of this article, these limitations should be better outlined and discussed in the manuscript, as per the suggestion of Reviewer 79Vv. Another criticism shared by several reviewers is the narrow scope of the experimentation, which is partly addressed by the new results on ImageNet provided during rebuttal, and can be further strengthened with experiments on multi-modal learning.