PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
2.8
置信度
创新性2.8
质量3.0
清晰度2.3
重要性3.0
NeurIPS 2025

A Statistical Theory of Contrastive Learning via Approximate Sufficient Statistics

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

Contrastive learning---a modern approach to extract useful representations from unlabeled data by training models to distinguish similar samples from dissimilar ones---has driven significant progress in foundation models. In this work, we develop a new theoretical framework for analyzing data augmentation-based contrastive learning, with a focus on SimCLR as a representative example. Our approach is based on the concept of approximate sufficient statistics, which we extend beyond its original definition in~\cite{oko2025statistical} for contrastive language-image pretraining (CLIP) using KL-divergence. We generalize it to equivalent forms and general $f$-divergences, and show that minimizing SimCLR and other contrastive losses yields encoders that are approximately sufficient. Furthermore, we demonstrate that these near-sufficient encoders can be effectively adapted to downstream regression and classification tasks, with performance depending on their sufficiency and the error induced by data augmentation in contrastive learning. Concrete examples in linear regression and topic classification are provided to illustrate the broad applicability of our results.
关键词
Contrastive learningdata augmentationSimCLRapproximate sufficient statistics

评审与讨论

审稿意见
4

The paper develops a statistical framework for augmentation based contrastive learning such as SimCLR through the concept of approximate sufficient statistics. Three equivalent definitions of approximate sufficient statistics for general f-divergences are proposed. The authors show that minimizing a finite-batch InfoNCE loss yields encoders whose sufficiency gap is controlled by the excess empirical risk. They then derive excess risk bounds for downstream regression and classification that depend on the encoder's sufficiency and the augmentation error.

优缺点分析

Strengths

  1. Solid theoretical analysis for self supervised contrastive learning using data augmentation.
  2. Establishes a clear link between downstream risk and approximate sufficient statistics, and provides learning-theoretic bounds for those statistics.

Weaknesses

  1. Their second claimed contribution (line 44) sounds vague. They assert that random transformations in SimCLR "introduce additional challenges for theoretical analysis," but they do not identify the challenges of applying Oko et al. 2025 to the contrastive learning for a single modality data. At the same time, their first claimed contribution (line 39) of extending approximate sufficient statistics from the KL divergence to any f divergence and proving three equivalent forms remains a modest generalization of Oko et al. 2025.

  2. SimCLR defines specific data augmentation schemes that are central to its success, yet the paper did not derive the resulting augmentation bias ϵG\epsilon_{\mathcal{G}} for any of these transforms.

问题

I would consider increasing the score after the following questions are at least partly addressed.

  1. Could you explain the technical gap or challenges to Oko et al. 2025 when dealing with a self-supervised contrastive learning for a single modality data?
  2. The paper introduces the augmentation error term ϵG\epsilon_G. Do you have any upper bounds for any real augmentation such as random cropping, color jitter, or blur?
  3. As a curiosity, how does the temperature parameter in the InfoNCE loss affect convergence rate?

局限性

yes

最终评判理由

I am changing my score to Borderline Accept (4), as the rebuttal meets the criteria I laid out in my Questions. In particular, it resolves my novelty concern of the difference from Oko et al. (2025).

On ϵG\epsilon_{\mathcal{G}} for real augmentations, I understand that general bounds are difficult, but there is still no concrete bound even for a simplified version of a canonical augmentation (e.g., random cropping). This remaining gap is why my recommendation is borderline rather than a clear accept.

格式问题

no major formatting issues found

作者回复

We are grateful to Reviewer DTT6 for their thoughtful review of our submission. Below, we respond to each question raised in their review.


Q1. Could you explain the technical gap or challenges to Oko et al. 2025 when dealing with a self-supervised contrastive learning for a single modality data?

A1. A main technical challenge of extending Oko et al. (2025) [1] to single-modality data (especially to SimCLR) is handling the error induced by random augmentation. Note that random augmentation is not performed in CLIP considered in [1], and therefore the downstream error can be controlled directly via the sufficiency of the encoder (see, e.g., Propositions 2, 3, and 4 in [1]). In our work, we introduce a novel augmentation error term ϵ_G\epsilon\_{\mathcal{G}} (or ϵcls_G\epsilon^{\mathsf{cls}}\_{\mathcal{G}}), and use it to establish downstream error bounds for any encoder ff, by applying triangle inequalities for 2\ell_2 error in regression (Theorem 2) and for KL divergence in classification (Theorem 3). Moreover, in the regression case (Theorem 2 and the subsequent discussion), we show that the same bound holds for the minimum error ϵ~_G\widetilde\epsilon\_{\mathcal{G}}, suggesting the tightness of our bounds for encoders ff with low sufficiency.


Q2. The paper introduces the augmentation error term ϵG\epsilon_{\mathcal{G}}. Do you have any upper bounds for any real augmentation such as random cropping, color jitter, or blur?

A2. We thank the reviewer for this practical question. In general, it is challenging to derive upper bounds for real-world augmentations such as cropping, color jitter, or blur, as such bounds depend on concrete assumptions about the augmentations and oracle knowledge of the data. For example, in image classification tasks, the augmentation error of random cropping depends on the proportion of pixels dropped and the true label probabilities of the cropped images—which are typically unknown. Nevertheless, in practice, one may use the predicted label probabilities of the original and cropped images from a highly accurate image classifier as a surrogate for the true label probabilities, and use them to empirically estimate the augmentation error ϵGcls\epsilon^{\mathsf{cls}}_{\mathcal{G}}.

Theoretically, we emphasize that our main results hold for arbitrary data augmentation gg, with the downstream error bounds depending on gg only through the augmentation error ϵ_G\epsilon\_{\mathcal{G}} (or ϵcls_G\epsilon^{\mathsf{cls}}\_{\mathcal{G}}). Explicit bounds on ϵ_G\epsilon\_{\mathcal{G}} (or ϵcls_G\epsilon^{\mathsf{cls}}\_{\mathcal{G}}) require additional assumptions on the augmentation. For example, in the context of Theorem 6, we have ϵ_G=0\epsilon\_{\mathcal{G}}=0 if gg is the projection onto some random subspace that contains θ\theta_\star.


Q3. As a curiosity, how does the temperature parameter in the InfoNCE loss affect convergence rate?

A3. In our work, the temperature parameter (denoted by tt) enters the InfoNCE loss through the link function τ(x)\tau(x) in Eq. (2) when defining τ(x):=x/t\tau(x):=x/t. Theoretically, the temperature affects our bound on sufficiency (Theorem 1) through affecting the upper bound on the score BSB_{\mathsf{S}}. Qualitatively, a smaller tt results in a larger constant CC in the bound in Eq. (4), which implies a slower convergence speed (up to constant factors).


[1]. Oko, K., Lin, L., Cai, Y., & Mei, S. (2025). A statistical theory of contrastive pre-training and multimodal generative ai. arXiv preprint arXiv:2501.04641.

评论

Thank you for the detailed response.

Your clarification on the SimCLR-specific challenge and the resulting augmentation error decomposition addressed my main concern about novelty. Accordingly, I am raising my score.

While I appreciate that deriving explicit bounds for the augmentation error ϵG\epsilon_{\mathcal{G}} is difficult in general, I strongly recommend including at least one concrete form or upper bound in the camera-ready, even under simplified assumptions.

评论

Thanks for your helpful comments and for raising the score! We will make sure to include concrete examples on the augmentation error ϵG\epsilon_{\mathcal{G}} in the paper.

审稿意见
4

This paper develops a new theoretical framework for data augmentation-based contrastive learning, focusing on SimCLR. It generalizes the concept of approximate sufficient statistics from prior work on CLIP to general f-divergences and establishes that minimizing contrastive losses yields encoders that are approximately sufficient. The authors show that minimizing contrastive losses (like InfoNCE) yields encoders that are approximately sufficient, and that the sufficiency of these encoders, together with the error induced by data augmentation, governs their adaptability and performance on downstream tasks. The paper provides concrete examples in linear regression and topic classification to illustrate the applicability of the theory.

优缺点分析

Strengths:

  • The paper significantly extends the notion of approximate sufficiency from its prior, more limited form (KL-divergence for CLIP) to a general framework encompassing various f-divergences and equivalent mathematical formulations. This generalization is both non-trivial and valuable for the broader contrastive learning community.
  • The definitions of sufficiency (information loss, variational form, conditional Bregman) are carefully presented and shown to be equivalent, providing a solid mathematical foundation.
  • In Sec. 3.1, the paper establishes a sufficiency bound for the ERM estimator in and shows that when f minimizes the ERM, the bound is a function of a generalization error (which goes to 0 as the number of samples tend to infinity) and a constant approximation error.
  • Then, in Sec. 3.2, the paper establishes that the downstream performance of learned encoders on linear regression and topic classification depends on both their sufficiency and the augmentation-induced error, providing a theoretical explanation on previous work focusing on how the data augmentions affect the performances of SimCLR's encoder.
  • In Sec 3.3, the authors generalizes to other loss terms than SimCLR, by generalizing their result of other type of f using general f-sufficiency. In Eq. 9, they notably extend to the chi-squared sufficiency. This generalization is appreciated and is a strength of the paper.

Weaknesses:

  • The paper is primarily theoretical; while it provides illustrative examples, it lacks empirical experiments that would demonstrate the practical impact of the theoretical bounds and insights in real-world settings. Notbaly, it would have been interesting to study the use on real dataset of various f, such as teh chi-squared contrastive loss for ex. Besides, we would have expected an experimental comparison between empirical risk and theoretical risk, using either real-world dataset (if applicable), or a synthetic one.
  • Overall, the paper is hard to read, and the use of the term "encoder" to both talk about the neural network and the f of the f-MI makes it confusing.
  • While the paper discuss how the sufficiency of the encoder impact the downstream performances, it eludes past work about identifiability (Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style, von Kugelgen et al. NeurIPS 2021, https://openreview.net/pdf?id=4pf_pOo0Dt), even though the connection between identifiability and encoder-sufficiency appears intuitively close.
  • While the theoretical bounds are elegant, they rely on assumptions such as boundedness of the score function and invertibility/Lipschitz continuity of the link function. In practical neural networks, these conditions may not strictly hold. Discussing these assumptions would be beneficial for the paper. Moreover, the theory abstracts away from architectural details (e.g., depth, parameter count), which are known to affect the performance of learned representations. A discussion on how architecture choices influence sufficiency and generalization would strengthen the practical relevance of the results.

问题

  • Is it straightforward to extend your theoretical results to multimodal contrastive learning settings (e.g. image-text), in line with identifiability analyse such as in Daunhawer et al. ICLR 2024 https://arxiv.org/pdf/2303.09166, IDENTIFIABILITY RESULTS FOR MULTIMODAL CONTRASTIVE LEARNING
  • How does your approximate sufficiency framework encompass or relate to alternative contrastive/self-supervised objectives such as VICReg, Barlow Twins, SwAV, and BYOL, DINO, where some of them do not explicitly use negative samples, and yet surpasses SimCLR in most natural imaging experiments ?
  • Could you clarify the practical validity of your assumptions (e.g. bounded score functions, invertibility and Lipschitz continuity of the link function), particularly in the context of deep neural network encoders?
  • Could you clarify the impact of the encoder architecture on your generalization bounds ? Does it impact the approximation error ?
  • Do you have experimental results comparing the empirical performance of encoders trained with different f-divergences (e.g. KL vs. chi-squared contrastive losses) on standard datasets?
  • Could you discuss how your present work connects/relates to identifiability ?

局限性

Yes.

最终评判理由

Authors have globally responded to my points but I wish to keep my score as it is.

格式问题

Nothing to report.

作者回复

We sincerely appreciate Reviewer RiyN for their thorough review and constructive feedback on our work. Below, we hope to address all your comments.

Q1 and Q6. generalization to multimodal contrastive learning and connection with identifiability [1,2]

A1. We thank the reviewer for this insightful question and pointing out the connection with the identifiability results.

Generalization to multimodal contrastive learning. Yes, it is straightforward to generalize our theoretical results to multimodal contrastive learning. The main modification is to choose X,YX,Y as a pair of samples (x,y)(x,y) from two modalities (e.g, an image and its associated text) instead of two augmented views in the definition of sufficiency (see line 99). As a consequence, similar downstream bounds as in [4] can be derived based on the generalized f\mathsf{f}-sufficiency.

Connection with identifiability. The identifiability results in contrastive learning [1,2] consider a setting where paired views (either from augmentations or across modalities) share some common latent factors, and show that a representation that minimizes the expected alignment loss can identify these latent factors. Under their assumptions, it can be shown that the learned representation is equivalent to an encoder with zero sufficiency in our work. In contrast, we derive general downstream error bounds for encoders with arbitrary sufficiency, even though the encoders cannot exactly identify the hidden factors when the sufficiency measure is strictly positive.


Q2. How does your approximate sufficiency framework encompass or relate to alternative contrastive/self-supervised objectives such as VICReg, Barlow Twins, SwAV, and BYOL, DINO?

A2. In our work, the approximate sufficiency framework applies to any encoder ff, including those obtained from alternative contrastive/self-supervised objectives mentioned by the reviewer, with the downstream error bounds depending on the sufficiency of ff. However, while the excess risk provides a direct upper bound on the sufficiency for the InfoNCE loss (line 166), for the alternative contrastive/self-supervised objectives mentioned above, it is not yet clear how their respective excess losses relate to our sufficiency measure. We leave establishing the connections between these excess losses and the sufficiency measure for future exploration.


Q3. Could you clarify the practical validity of your assumptions (e.g. bounded score functions, invertibility and Lipschitz continuity of the link function), particularly in the context of deep neural network encoders?

A3. In practice, the link function τ(x)\tau(x) is often chosen as a linear function, i.e., τ(x)=x/t\tau(x) = x/t, where t>0t > 0 is the temperature parameter. Thus, the invertibility and Lipschitz continuity of the link function are naturally satisfied. In SimCLR [3], the encoder ff is normalized before computing the score; therefore, the score is bounded by 1/t1/t, where the temperature t0.1,0.5,1t \in \\{0.1, 0.5, 1\\} in their experiments.


Q4. Could you clarify the impact of the encoder architecture on your generalization bounds ? Does it impact the approximation error ?

A4. The encoder architecture affects the generalization bounds by influencing the covering number in the second term of the generalization error (see line 236). By a standard parameter-counting argument (e.g., Theorem 6 in [4]), the covering number can be upper bounded by O(d)\mathcal{O}(d) for transformer-based models. The approximation error, on the other hand, depends on the encoder class F\mathcal{F}. A richer and more complex encoder class (or architecture) has a smaller approximation error but potentially larger generalization error.


Q5. Do you have experimental results comparing the empirical performance of encoders trained with different f-divergences (e.g. KL vs. chi-squared contrastive losses) on standard datasets?

A5. We conduct synthetic experiments to learn data representations via contrastive learning with a two-layer neural network, and evaluate them on downstream linear regression.

In the contrastive learning phase, we generate nn i.i.d. samples xiN(0,Id)x_i \sim \mathcal{N}(0, I_d). The augmentation gg adds i.i.d. N(0,σ12)\mathcal{N}(0, \sigma_1^2) noise to the first s<ds < d coordinates of xix_i, and replaces the remaining coordinates with i.i.d. N(0,1)\mathcal{N}(0, 1) noise. We apply KL and χ2\chi^2-contrastive learning (Eq 3 and 10) with link function τ(x)=x\tau(x) = x, and encoder f(x)f(x) being a two-layer ReLU neural network mapping Rd\mathbb{R}^d to Rs\mathbb{R}^s. We set s=10,d=100,n=500s = 10,d = 100,n=500, hidden dimension 64, and batch size K=64K = 64. The encoder is trained using Adam (learning rate 0.001) for 300 epochs, after which the training loss is observed to converge.

For downstream regression, we generate mm i.i.d. samples (xi,yi)_i=1m(x_i, y_i)\_{i=1}^m, where xiN(0,Id)x_i \sim \mathcal{N}(0, I_d) and yi=xi,θ+ϵiy_i = \langle x_i, \theta_\star \rangle + \epsilon_i, with ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2) independent of xix_i. We choose θ=(1s/s,0ds)\theta_\star = (\mathbf{1}^\top_s/\sqrt{s},\mathbf{0}^\top_{d-s})^\top and σ=1\sigma=1. Using the learned representation f^(xi)Rs\widehat{f}(x_i) \in \mathbb{R}^s from KL (or χ2\chi^2)-contrastive learning, we fit a downstream linear model to predict yiy_i. We define the excess risk of any predictor hh as E[(yih(xi))2]σ2\mathbb{E}[(y_i - h(x_i))^2] - \sigma^2, and evaluate the excess risk of the linear model trained on (f^(xi),yi)_i=1m(\widehat{f}(x_i),y_i)\_{i=1}^m using 50000 test samples. For comparison, we also report the excess risk of a linear model trained directly on the original samples (xi,yi)_i=1m(x_i,y_i)\_{i=1}^m. Results for various downstream sample sizes mm and the standard deviation over 10 runs are shown below.

Table: Excess risk for various downstream sample sizes m

mInfoNCEChi-squaredDirect LR
1500.106 ± 0.0400.120 ± 0.0282.066 ± 0.594
5000.060 ± 0.0150.070 ± 0.0130.243 ± 0.032
10000.052 ± 0.0120.063 ± 0.0120.114 ± 0.021
20000.046 ± 0.0110.058 ± 0.0130.055 ± 0.012
50000.042 ± 0.0110.055 ± 0.0130.021 ± 0.005
100000.040 ± 0.0120.054 ± 0.0120.011 ± 0.004

From the table, we observe that InfoNCE and Chi-squared achieve comparable excess risks (differences about one standard deviation), and both are substantially lower than that of direct linear regression when the sample size mm is relatively small (e.g., m=150,500m = 150, 500). This suggests that both KL and χ2\chi^2-contrastive learning can learn a “good” low-dimensional representation for the downstream task. As the sample size increases, the excess risk of direct linear regression converges to zero, while those of InfoNCE and Chi-squared converge to non-zero constants. This is consistent with our theoretical results, which attribute the excess risk to the non-zero sufficiency of f^\widehat{f} and the augmentation error ϵG\epsilon_{\mathcal{G}}.

Due to limited computational resources, we did not evaluate the performance of χ2\chi^2-contrastive learning on large-scale real-world datasets and leave this for future investigation.


Other comments: use ff for both encoder and ff-divergence.

We currently use different fonts for ff when referring to the encoder ff and the f\mathsf{f}-divergence. We will clarify this distinction more explicitly in the paper to avoid confusion.


[1]. Von Kügelgen, Julius, et al. "Self-supervised learning with data augmentations provably isolates content from style." Advances in neural information processing systems 34 (2021): 16451-16467.

[2]. Daunhawer, Imant, et al. "Identifiability results for multimodal contrastive learning." arXiv preprint arXiv:2303.09166 (2023).

[3]. Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PmLR, 2020.

[4]. Oko, K., Lin, L., Cai, Y., & Mei, S. (2025). A statistical theory of contrastive pre-training and multimodal generative ai. arXiv preprint arXiv:2501.04641.

评论

I thank the authors for the detailed responses to my questions. My score remains unchanged.

审稿意见
4

This paper proposes a theoretical framework for analyzing data augmentation-based contrastive learning, particularly SimCLR, by means of approximate sufficient statistics, which extends a prior work using KL-divergence to general f-divergences. Specifically, this work shows that minimizing a contrastive loss produces encoders approximately sufficient, and they can be adapted to downstream regression and classification tasks, where downstream performance depends on their sufficiency and the error induced by data augmentation in contrastive learning.

优缺点分析

Strengths

  • The idea of analyzing contrastive learning via approximate sufficient statistics sounds interesting.

  • The claim is supported by comprehensive theoretical derivation.

  • Concrete examples on regression and classification are provided, though they are limited to linear models.

Weaknesses

  • There is no experimental results, raising a concern on the practicality of the proposed idea. At least the prior work by Oko et al. [27] conducted some experiments with CLIP. If the theoretical results deviate from practical nonlinear models, then their contribution is limited accordingly. To address this concern, the authors could first conduct toy experiments with linear models as in the examples in Section 4, and then extend it to nonlinear models and see if their theoretical results empirically hold, or how much they are deviated from practical models.

  • L252: It is not clear how "end-to-end theoretical guarantees for the downstream performance of encoders obtained by minimizing general f-contrastive losses" is drawn by combining the results from Sections 3.3.1 and 3.3.2.

  • There are many "some constants" throughout the paper. How are they determined, at least in examples? How much are bounds tight?

  • The error on the downstream task induced by data augmentation epsilon_G is an important variable in their analysis but it is not specified well. For example, it would be useful to know how we can measure epsilon_G in practice, how much its value deviates by taking different augmentation strategies, and so on.

  • Formatting issue: generally speaking, any citation should not be appeared in Abstract.

  • Typo: f-sufficieny in the title of Section 3.3.1

问题

Please address concerns in Weaknesses above.

局限性

Different from the checklist, limitations are not provided.

最终评判理由

The authors successfully addressed my concerns. They also reported an additional experiment in a toy setting in their rebuttal, and further added another one in the CLIP setting around the end of the rebuttal period, though the experimental result in the CLIP setting seems somewhat weak.

格式问题

nothing special

作者回复

We thank Reviewer sYsE for their careful review and valuable feedback. We hope to address all questions below.


Q1. There is no experimental results, raising a concern on the practicality of the proposed idea. At least the prior work by Oko et al. [27] conducted some experiments with CLIP. If the theoretical results deviate from practical nonlinear models, then their contribution is limited accordingly. To address this concern, the authors could first conduct toy experiments with linear models as in the examples in Section 4, and then extend it to nonlinear models and see if their theoretical results empirically hold, or how much they are deviated from practical models.

A1. We conduct synthetic experiments to learn data representations via contrastive learning with a two-layer neural network, and evaluate them on downstream linear regression.

In the contrastive learning phase, we generate nn i.i.d. samples xiN(0,Id)x_i \sim \mathcal{N}(0, I_d). The augmentation gg adds i.i.d. N(0,σ12)\mathcal{N}(0, \sigma_1^2) noise to the first s<ds < d coordinates of xix_i, and replaces the remaining coordinates with i.i.d. N(0,1)\mathcal{N}(0, 1) noise. We apply KL and χ2\chi^2-contrastive learning (Eq 3 and 10) with link function τ(x)=x\tau(x) = x, and encoder f(x)f(x) being a two-layer ReLU neural network mapping Rd\mathbb{R}^d to Rs\mathbb{R}^s. We set s=10,d=100,n=500s = 10,d = 100,n=500, hidden dimension 64, and batch size K=64K = 64. The encoder is trained using Adam (learning rate 0.001) for 300 epochs, after which the training loss is observed to converge.

For downstream regression, we generate mm i.i.d. samples (xi,yi)_i=1m(x_i, y_i)\_{i=1}^m, where xiN(0,Id)x_i \sim \mathcal{N}(0, I_d) and yi=xi,θ+ϵiy_i = \langle x_i, \theta_\star \rangle + \epsilon_i, with ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2) independent of xix_i. We choose θ=(1s/s,0ds)\theta_\star = (\mathbf{1}^\top_s/\sqrt{s},\mathbf{0}^\top_{d-s})^\top and σ=1\sigma=1. Using the learned representation f^(xi)Rs\widehat{f}(x_i) \in \mathbb{R}^s from KL (or χ2\chi^2)-contrastive learning, we fit a downstream linear model to predict yiy_i. We define the excess risk of any predictor hh as E[(yih(xi))2]σ2\mathbb{E}[(y_i - h(x_i))^2] - \sigma^2, and evaluate the excess risk of the linear model trained on (f^(xi),yi)_i=1m(\widehat{f}(x_i),y_i)\_{i=1}^m using 50000 test samples. For comparison, we also report the excess risk of a linear model trained directly on the original samples (xi,yi)_i=1m(x_i,y_i)\_{i=1}^m. Results for various downstream sample sizes mm and the standard deviation over 10 runs are shown below.

Table: Excess risk for various downstream sample sizes m

mInfoNCEChi-squaredDirect LR
1500.106 ± 0.0400.120 ± 0.0282.066 ± 0.594
5000.060 ± 0.0150.070 ± 0.0130.243 ± 0.032
10000.052 ± 0.0120.063 ± 0.0120.114 ± 0.021
20000.046 ± 0.0110.058 ± 0.0130.055 ± 0.012
50000.042 ± 0.0110.055 ± 0.0130.021 ± 0.005
100000.040 ± 0.0120.054 ± 0.0120.011 ± 0.004

From the table, we observe that InfoNCE and Chi-squared achieve comparable excess risks (differences about one standard deviation), and both are substantially lower than that of direct linear regression when the sample size mm is relatively small (e.g., m=150,500m = 150, 500). This suggests that both KL and χ2\chi^2-contrastive learning can learn a “good” low-dimensional representation for the downstream task. As the sample size increases, the excess risk of direct linear regression converges to zero, while those of InfoNCE and Chi-squared converge to non-zero constants. This is consistent with our theoretical results, which attribute the excess risk to the non-zero sufficiency of f^\widehat{f} and the augmentation error ϵG\epsilon_{\mathcal{G}}.

Due to limited computational resources, we did not evaluate the performance of χ2\chi^2-contrastive learning on large-scale real-world datasets and leave this for future investigation.


Q2. L252: It is not clear how "end-to-end theoretical guarantees for the downstream performance of encoders obtained by minimizing general f-contrastive losses" is drawn by combining the results from Sections 3.3.1 and 3.3.2.

A2. In Section 3.3.1, we show that the sufficiency of encoders can be bounded by the excess risk of general f-contrastive losses (line 218), with concrete calculations for χ2\chi^2-sufficiency in Theorem 4. In Section 3.3.2, Proposition 5 shows that the downstream guarantees in Theorem 2 and 3 remain valid when Eq. (13) is satisfied. Together, these results imply (at least for χ2\chi^2-contrastive learning) that an encoder f^\widehat{f} minimizing the empirical f\mathsf{f}-contrastive loss can achieve small downstream errors, provided the pretraining sample size nn is sufficiently large and the augmentation error ϵG\epsilon_{\mathcal{G}} is small. Nevertheless, we agree the statement may be somewhat overstated, as we focus our analysis on χ2\chi^2-contrastive learning, leaving other contrastive losses for future investigation (lines 224–227). We will revise the statement in the paper.


Q3. There are many "some constants" throughout the paper. How are they determined, at least in examples? How much are bounds tight?

A3. Throughout the paper, we use c>0c > 0 to denote absolute constants independent of any parameters (e.g., c=1,2,10c = 1, 2, 10). We use C>0C > 0 to denote constants that depend polynomially on certain parameters (e.g., BSB_{\mathsf{S}} in Assumption 1), and we explicitly state the dependencies when introducing the constants. However, tracking the exact polynomial dependence (e.g., C=O(BS2)C = \mathcal{O}(B_{\mathsf{S}}^2)) can be cumbersome in analysis, so we use CC to simplify the presentation and improve clarity without losing too much in the bounds. Please let us know if this addresses your question.


Q4. The error on the downstream task induced by data augmentation epsilon_G is an important variable in their analysis but it is not specified well. For example, it would be useful to know how we can measure epsilon_G in practice, how much its value deviates by taking different augmentation strategies, and so on.

A4. We thank the reviewer for this practical question. In general, it is challenging to measure ϵ_G\epsilon\_{\mathcal{G}} (or ϵcls_G\epsilon^{\mathsf{cls}}\_{\mathcal{G}}) for real-world augmentations, as their values depend on concrete assumptions about the augmentations and oracle knowledge of the data. For example, in image classification tasks, the augmentation error ϵcls_G\epsilon^{\mathsf{cls}}\_{\mathcal{G}} of random cropping depends on the proportion of pixels dropped and the true label probabilities of the cropped images—which are typically unknown. Nevertheless, in practice, one may use the predicted label probabilities of the original and cropped images from a highly accurate image classifier as a surrogate for the true label probabilities, and use them to empirically estimate the augmentation error ϵGcls\epsilon^{\mathsf{cls}}_{\mathcal{G}}.


Q5 and 6: formatting issue and typo.

A5. Thanks for the suggestions. We will fix them in the paper.

评论

Thank you for your response. While I still believe that the experiment presented during the rebuttal period is somewhat limited, and that extending the experiment to a more realistic setting, e.g., using CLIP, would further strengthen the contribution, I do not consider this is a critical reason to reject the paper. I have no significant concerns at this point.

评论

Thanks for the comments and suggestions. Over the past week, we have conducted small-scale experiments in the CLIP setting (language-image pretraining). Namely, we trained the CLIP model (RN50-quickgelu, which consists of a ResNet-50 image encoder and 12-layer Transformer text encoder) on a 100K subsample of the cc3m-wds dataset using both InfoNCE and χ2\chi^2-contrastive losses. The original dataset contains about 3.3M image-text pairs, but due to limited compute, we trained on the 100K subsample for 32 epochs.

We evaluated the models based on their zero-shot classification performance on the ImageNet-1k validation set (1000 classes, 500 images per class). For InfoNCE and χ2\chi^2-contrastive loss, we set the link functions τ(x)\tau(x) to be x/tx/t and ex/te^{x/t}, respectively, with the trainable temperature tt initialized to 1. We chose the batch size K=128K=128 and used the AdamW optimizer with weight decay 0.02. The learning rate was selected via grid search over 3e5,1e4,3e4,1e3\\{3\mathrm{e}{-5}, 1\mathrm{e}{-4}, 3\mathrm{e}{-4}, 1\mathrm{e}{-3}\\}. For both losses, the selected optimal learning rate was 3e43\mathrm{e}{-4}.

We repeated the experiments three times and report the top-5 accuracy (%) on the ImageNet-1k validation set:

InfoNCEChi-squared
7.53 ± 0.259.43 ± 0.11

We can see that in this small-scale experiment, the model trained with χ2\chi^2-contrastive loss achieves comparable zero-shot performance to that trained with InfoNCE. We do not claim that the χ2\chi^2-contrastive loss is better than InfoNCE, as both methods could benefit from further hyperparameter tuning (e.g., initial temperature) or larger datasets. However, we do believe that χ2\chi^2-contrastive loss can learn useful representations of the data, aligning with our theoretical findings. We hope this addresses your question regarding the practical applicability of our method.

审稿意见
4

This paper provides a statistical theory of contrastive learning via the approximate sufficient statistics, following~\cite{oko2025statistical} for contrastive language-image pretraining (CLIP) using KL-divergence. Specifically, the authors generalize the concept of the approximate sufficient statistics to equivalent forms and general ff-divergence, and illustrate that minimizing SimCLR and other contrastive losses yields encoders that are approximately sufficient. The key factors influencing these near-sufficient encoders are their sufficiency and the error induced by data augmentation in contrastive learning. Examples, including linear regression and topic classification, are given to illustrate the applicability of results.

优缺点分析

Strengths

  1. Overall, this paper is well-written. The related work and comparison with this work are discussed, especially the work~\cite{oko2025statistical}.
  2. The claims about the contributions are supported. Specifically, (1) it generalizes the concept of the approximate sufficient statistics in~\cite{oko2025statistical} for contrastive language-image pretraining (CLIP) using KL-divergence; (2) it provides the theoretical analysis of data augmentation-based contrastive learning following the SimCLR framework, demonstrating the downstream performance of the learned encoder depends on its sufficiency and the error induced by the random transformation.
  3. The proofs seem right, although I have not checked the proof details line by line.

Weaknesses

  1. Technically, the novelty seems a little limited due to its similarity to the work~\cite{oko2025statistical}. More details about the additional challenges are suggested to be provided.
  2. The tightness of the bounds in Eq.(4) in Theorem 1, Eq.(7a) and Eq.(7b) in Theorem 2, and Eq.(8) in Theorem 3, are not discussed. Besides, for the classification problem of Theorem 3, why use KL divergence instead of common metrics (e.g., 0-1 loss) to evaluate the classification performance?
  3. There are no experimental results to support the theoretical results. Are the provided bounds non-vacuous or meaningful? Synthetic experiments on examples of linear regression or classification can be conducted to illustrate the validity of theoretical results.

问题

  1. Do the provided bounds in Eq.(4) in Theorem 1, Eq.(7a) and Eq.(7b) in Theorem 2, and Eq.(8) in Theorem 3 tight? Please give more discussions.
  2. Please give experimental results to support the validity of the theoretical results in Theorem 6.

局限性

Please see the weaknesses section.

最终评判理由

The authors have addressed most of my concerns, including the tightness of the bounds, additional challenges compared with~\cite{oko2025statistical}, and experimental results to support the validity of the theoretical results in Theorem 6. The remaining concern is the reason why using KL divergence instead of common metrics (e.g., 0-1 loss) to evaluate the classification performance. Thus, I keep the score.

格式问题

None

作者回复

We appreciate the helpful comments and constructive feedback from Reviewer 8N6b. Below, we respond to the questions and comments in a point-by-point manner.


Q1. Do the provided bounds in Eq.(4) in Theorem 1, Eq.(7a) and Eq.(7b) in Theorem 2, and Eq.(8) in Theorem 3 tight? Please give more discussions.

A1. For Eq. (4), we expect that the polynomial dependence of the constant CC on BSB_{\mathsf{S}} is necessary to obtain a non-vacuous bound for arbitrary batch size K>2K > 2. However, we suspect that the constant CC in Eq. (5a) may be improved to depend polynomially on log(BS)\log(B_{\mathsf{S}}). We hope to derive a more precise characterization of the (log-)polynomial dependence in Eq (4) and (5a) in future work.

For Eq. (7a), as discussed after Theorem 2 (line 195-197), the error ϵG\epsilon_{\mathcal{G}} on the right-hand side can be replaced by the minimum error ϵ~G\widetilde\epsilon_{\mathcal{G}}, and is therefore tight (up to constant factors) when the sufficiency is sufficiently small.

For Eq. (7b) and (8b), the current results are not tight due to the gap between between ϵG\epsilon_{\mathcal{G}} and the minimum error ϵ~G\widetilde\epsilon_{\mathcal{G}}. We leave the derivation of matching lower bounds to future work.


W1: Technically, the novelty seems a little limited due to its similarity to the work~\cite{oko2025statistical}. More details about the additional challenges are suggested to be provided.

A2. A main technical challenge of extending Oko et al. (2025) [1] to the SimCLR setting (or single-modal contrastive learning) is handling the error induced by random augmentation. Note that random augmentation is not performed in CLIP considered in [1], and therefore the downstream error can be controlled directly via the sufficiency of the encoder (see, e.g., Propositions 2, 3, and 4 in [1]). In our work, we introduce a novel augmentation error term ϵG\epsilon_{\mathcal{G}} (or ϵGcls\epsilon^{\mathsf{cls}}_{\mathcal{G}}), and use it to establish downstream error bounds for any encoder ff, by applying triangle inequalities for 2\ell_2 error in regression (Theorem 2) and for KL divergence in classification (Theorem 3).


W2. Besides, for the classification problem of Theorem 3, why use KL divergence instead of common metrics (e.g., 0-1 loss) to evaluate the classification performance?

A3. We use KL divergence instead of 0-1 loss in the multi-class classification setting (Theorem 3) because it is a standard and more informative metric for quantifying the difference between the predicted and true label distributions.


Q2 and W3. Please give experimental results to support the validity of the theoretical results in Theorem 6.

A4. We conduct synthetic experiments to learn data representations via contrastive learning with a two-layer neural network, and evaluate them on downstream linear regression.

In the contrastive learning phase, we generate nn i.i.d. samples xiN(0,Id)x_i \sim \mathcal{N}(0, I_d). The augmentation gg adds i.i.d. N(0,σ12)\mathcal{N}(0, \sigma_1^2) noise to the first s<ds < d coordinates of xix_i, and replaces the remaining coordinates with i.i.d. N(0,1)\mathcal{N}(0, 1) noise. We apply KL and χ2\chi^2-contrastive learning (Eq 3 and 10) with link function τ(x)=x\tau(x) = x, and encoder f(x)f(x) being a two-layer ReLU neural network mapping Rd\mathbb{R}^d to Rs\mathbb{R}^s. We set s=10,d=100,n=500s = 10,d = 100,n=500, hidden dimension 64, and batch size K=64K = 64. The encoder is trained using Adam (learning rate 0.001) for 300 epochs, after which the training loss is observed to converge.

For downstream regression, we generate mm i.i.d. samples (xi,yi)_i=1m(x_i, y_i)\_{i=1}^m, where xiN(0,Id)x_i \sim \mathcal{N}(0, I_d) and yi=xi,θ+ϵiy_i = \langle x_i, \theta_\star \rangle + \epsilon_i, with ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2) independent of xix_i. We choose θ=(1s/s,0ds)\theta_\star = (\mathbf{1}^\top_s/\sqrt{s},\mathbf{0}^\top_{d-s})^\top and σ=1\sigma=1. Using the learned representation f^(xi)Rs\widehat{f}(x_i) \in \mathbb{R}^s from KL (or χ2\chi^2)-contrastive learning, we fit a downstream linear model to predict yiy_i. We define the excess risk of any predictor hh as E[(yih(xi))2]σ2\mathbb{E}[(y_i - h(x_i))^2] - \sigma^2, and evaluate the excess risk of the linear model trained on (f^(xi),yi)_i=1m(\widehat{f}(x_i),y_i)\_{i=1}^m using 50000 test samples. For comparison, we also report the excess risk of a linear model trained directly on the original samples (xi,yi)_i=1m(x_i,y_i)\_{i=1}^m. Results for various downstream sample sizes mm and the standard deviation over 10 runs are shown below.

Table: Excess risk for various downstream sample sizes m

mInfoNCEChi-squaredDirect LR
1500.106 ± 0.0400.120 ± 0.0282.066 ± 0.594
5000.060 ± 0.0150.070 ± 0.0130.243 ± 0.032
10000.052 ± 0.0120.063 ± 0.0120.114 ± 0.021
20000.046 ± 0.0110.058 ± 0.0130.055 ± 0.012
50000.042 ± 0.0110.055 ± 0.0130.021 ± 0.005
100000.040 ± 0.0120.054 ± 0.0120.011 ± 0.004

From the table, we observe that InfoNCE and Chi-squared achieve comparable excess risks (within one standard deviation except for m=10000m=10000), and both are substantially lower than that of direct linear regression when the sample size mm is relatively small (e.g., m=150,500m = 150, 500). This suggests that both KL and χ2\chi^2-contrastive learning can learn a “good” low-dimensional representation for the downstream task. As the sample size increases, the excess risk of direct linear regression converges to zero, while those of InfoNCE and Chi-squared converge to non-zero constants. This is consistent with our theoretical results, which attribute the excess risk to the non-zero sufficiency of f^\widehat{f} and the augmentation error ϵG\epsilon_{\mathcal{G}}.

Due to limited computational resources, we did not evaluate the performance of χ2\chi^2-contrastive learning on large-scale real-world datasets and leave this for future investigation.


[1]. Oko, K., Lin, L., Cai, Y., & Mei, S. (2025). A statistical theory of contrastive pre-training and multimodal generative ai. arXiv preprint arXiv:2501.04641.

评论

Thank you for the detailed response. It has addressed most of my concerns, and I will keep the score.

评论

Dear reviewers,

Author-reviewer discussion period ends soon. Please check the rebuttals and take an appropriate action.

AC

最终决定

The paper develops a statistical framework for contrastive learning with data augmentation (e.g., SimCLR), extending prior work on CLIP with KL-divergence to general f-divergences. It formalizes the concept of approximate sufficient statistics, and shows that minimizing contrastive losses yields encoders that are approximately sufficient.

Reviewers commented strengths such as theoretical contribution, mathematical rigor and timely topic. While three reviewers were originally worried of lack of experiments, they came out with positive comments after successful rebuttal. At the end, all have unanimously provided review score of 4. Given the theoretical nature of the contribution, an acceptance is recommended.