PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高8标准差1.3
5
8
5
5
3.8
置信度
正确性2.5
贡献度2.8
表达3.0
ICLR 2025

Understanding Self-supervised Learning as an Approximation of Supervised Learning

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

We provide a theoretical framework that conceptualizes self-supervised learning as an approximation of supervised learning.

摘要

关键词
representation learningself-supervised learningcontrastive learningtheoretical framework

评审与讨论

审稿意见
5

This paper focuses on understanding self-supervised learning, where the authors theoretically formulate the self-supervised learning problem as an approximation of a supervised learning problem.

优点

  1. The paper is well-organized, with clear subsections that logically flow from theoretical foundations to empirical validations.
  2. The mathematical derivations and proofs in this paper seem appropriate.

缺点

  1. Although the authors claim to propose a new perspective for understanding SSL, it seems to me that there is significant overlap with [a]. Unfortunately, the authors do not provide a detailed analysis of the similarities and differences between the two.
  2. The correlation between theoretical analysis and insights in this article is weak: In the theoretical analysis section, only the proof of two upper bounds is provided. My question is: What is the relationship between these two upper bounds and the supervised learning paradigm? In other words, what is the significance of their insights for the paper?
  3. In Section 6, the proposed new SSL format does not seem to have significant advantages. In addition, the authors also do not conduct sufficient verification experiments.
  4. In fact, there are many related works (like [b-c]) on the understanding of SSL, unfortunately, this paper does not provide a detailed discussion and analysis of their differences.

References: [a] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, pp. 9929–9939. PMLR, 2020. [b] Tian Y, Chen X, Ganguli S. Understanding self-supervised learning dynamics without contrastive pairs[C]//International Conference on Machine Learning. PMLR, 2021: 10268-10278. [c] Purushwalkam S, Gupta A. Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases[J]. Advances in Neural Information Processing Systems, 2020, 33: 3407-3418.

问题

  1. In the theoretical analysis section, only the proof of two upper bounds is provided. My question is: What is the relationship between these two upper bounds and the supervised learning paradigm? In other words, what is the significance of their insights for the paper?
  2. What is the significance of self-supervised learning from this perspective? At a high level, beyond some conclusions that are very similar to [a], it is difficult to capture additional information.

Reference: [a] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, pp. 9929–9939. PMLR, 2020.

评论

Dear Reviewer tUzQ,

We sincerely appreciate the valuable time and effort you have dedicated to reviewing our manuscript. We address each of your questions and concerns individually below. Please let us know if there are any comments or concerns we have not adequately addressed.


[W1 & Q2] it seems to me that there is significant overlap with [a]

In Section 2, we address [a]. [a] and our paper differ in terms of approach, objective, and theoretical setting:

In terms of approach, [a] builds on an established contrastive loss (Equation (1) in [a]) that attracts or repels other 'samples'. In contrast, our paper starts from scratch with a formulated supervised objective (Equation (5) in our paper) that attracts or repels 'pseudo-labels' (prototype representations).

In terms of objective, [a] explores how attracting and repelling other samples relates to alignment and uniformity. In contrast, our paper demonstrates how attracting and repelling pseudo-labels can manifest as attracting and repelling other samples. Thus, while [a] focuses on the properties of contrastive loss, we focus on the foundation of contrastive loss as an approximation of supervised learning.

Additionally, in terms of theoretical setting, [a] primarily examines the limiting behavior in an asymptotic setting (Theorem 1 in [a]) where the number of negative samples approaches infinity. On the other hand, our approach addresses a more general setting (Theorem 1 and 2 in our paper), offering broader applicability.


[W2 & Q1] What is the relationship between these two upper bounds and the supervised learning paradigm?

The loss in self-supervised learning is shown to serve as an upper bound for a supervised learning objective. Specifically, starting from the problem we formulated and assuming common practices in self-supervised learning, interestingly, a generalized form of the InfoNCE loss—commonly used in self-supervised learning—emerges as an upper bound. This provides a perspective that views self-supervised learning as an approximation of supervised learning. It offers insights into the type of optimization problem that self-supervised learning is solving. The concepts of prototype representation bias and balanced contrastive loss that arise in this process can be valuable for understanding and enhancing self-supervised learning.


[W3] the proposed new SSL format does not seem to have significant advantages

The accuracy of the standard NT-Xent loss (a special case of the generalized NT-Xent loss with λ=1\lambda=1) is 65.98%, while the accuracy of the balanced contrastive loss is 67.40%, showing a gap of 1.42%. Considering that the chance-level accuracy for ImageNet, which consists of 1,000 classes, is merely 0.1%, achieving this level of improvement solely through proper balancing is significant.

The purpose of the experiments in this section is to check the validity of the theory. To this end, we focus on examining whether a more balanced contrastive loss, as predicted by the theory, effectively enhances performance. To ensure thoroughness, we conduct a comprehensive set of experiments across a parameter grid.


[W4] there are many related works (like [b-c]) on the understanding of SSL

Given the extensive body of work related to self-supervised learning, we had to selectively discuss relevant studies. As mentioned in Section 2, our work falls into the category of contrastive learning. However, [b] addresses non-contrastive learning. Therefore, non-contrastive learning is discussed in Section 5.1, focusing on major algorithms with references. [c] proposes a practical method that leverages video datasets to learn invariance when the dataset is not object-centric. However, we aimed to keep the setting streamlined to avoid complicating the theoretical analysis. In the revised manuscript, we have included the papers in appropriate sections.

评论

Dear Reviewer tUzQ,

Thank you again for your time and efforts in reviewing our paper.

As the discussion period draws close, we kindly remind you that two days remain for further comments or questions.
We would appreciate the opportunity to address any additional concerns you may have before the discussion phase ends.

Thank you very much.

Best regards,
Authors.

评论

Thanks for the authors' efforts during the rebuttal period, but I still believe that the validation experiments related to this paper are insufficient. After reading the other reviewers' comments and the author's rebuttal, I maintain my original score.

评论

Dear Reviewer tUzQ,

We sincerely appreciate your response and thoughtful feedback. We understand your concerns regarding the sufficiency of the validation experiments and would like to provide some clarification.

The transformation of the attraction/repulsion mechanism from pseudo-labels to samples has been fundamentally validated mathematically. To support with experimental validation, we have conducted the following:

  1. Prototype Representation Bias: We performed experiments to investigate the biases that emerge during the transition from supervised to self-supervised learning problems.
  2. Balanced Contrastive Loss: We conducted extensive experiments to evaluate its effectiveness, including pre-training and end-to-end evaluations across all parameter pairs.
  3. Key Assumptions: We ran additional experiments to assess and validate the importance of the core assumptions.

If there are specific additional validation experiments or directions you believe we should explore, we would greatly value your suggestions.

Thank you once again for your time, effort, and valuable feedback during the review process.

Best regards,
Authors.

审稿意见
8

This paper theoretically models self-supervised learning as an approximation to supervised learning. The authors derive a self-supervised loss related to contrastive losses, including InfoNCE, while introducing concepts like prototype representation bias and balanced contrastive loss. They apply the framework to analyze components of self-supervised learning, notably SimCLR, and explore the effects of balancing attraction and repulsion forces.

优点

1- The paper's primary strength lies in its establishment of a rigorous theoretical framework that bridges supervised and self-supervised learning, addressing a significant gap in the literature by grounding widely-used contrastive losses in theory. This approach contributes to the field by potentially enhancing the interpretability and rationale behind existing self-supervised methods.

2- The authors derive a self-supervised learning loss function from first principles, aligning it with established methods like the NT-Xent loss in SimCLR. This derivation offers the self-supervised learning community a deeper understanding of why and how particular loss functions, often implemented heuristically, are effective.

3- Introducing the concept of prototype representation bias, the paper reveals how self-supervised learning can be systematically evaluated and potentially optimized by minimizing this bias through data augmentation strategies. This is an innovative step that contextualizes the role of representation clustering within the self-supervised paradigm.

缺点

1- The authors define a surrogate prototype representation based on transformations (augmentations) of the same data point, but this choice may vary substantially across datasets and problem domains. Since many real-world applications use domain-specific augmentations (e.g., color transformations for medical images), the theoretical guarantees provided may not hold uniformly. A sensitivity analysis or empirical study on the effects of diverse augmentation choices would strengthen the validity of the surrogate representation assumptions.

2- The paper introduces parameters (e.g., balancing factors like \alpha and \lambda Equation 12) that govern the relative strengths of attraction and repulsion forces in the derived loss function. However, it provides limited insights into how these parameters impact performance across diverse datasets and tasks. A deeper empirical analysis or sensitivity study on these parameters would make the findings more robust and practically usable. Additionally, discussing guidelines for optimal parameter selection based on dataset characteristics would improve the utility of the paper for practitioners.

3- The paper situates itself within the context of contrastive learning methods, particularly NT-Xent and InfoNCE losses. However, there are alternative frameworks in self-supervised learning, such as clustering-based approaches (e.g., DeepCluster, SwAV) and bootstrapping methods (e.g., BYOL). While the authors mention these methods briefly, they do not provide a clear comparison or discussion of how their theoretical framework might align or diverge from these alternative approaches. Providing such a comparison could position the framework more effectively within the larger self-supervised landscape.

4- The paper’s findings, especially around balancing attraction and repulsion forces, suggest potential for optimization. Yet, there is minimal exploration of how the theoretical insights could inspire specific algorithmic modifications or optimizations for contrastive learning. For example, insights into prototype bias could be used to dynamically adjust the loss during training. Discussing these possibilities would improve the paper's impact by suggesting actionable ways to leverage its contributions.

问题

1- How would the assumptions, such as balanced data and the specific choice of cosine similarity, affect the generalizability of this framework to domains with significant class imbalance or non-standard data representations?

2- To what extent can prototype representation bias be quantitatively minimized through practical data augmentation strategies? Would further analysis on this bias's impact across different datasets yield consistent trends?

3- Could the theoretical framework be adapted or extended to include asymmetrical architectures, given their prominence in modern self-supervised learning algorithms? What additional assumptions might be required?

评论

Dear Reviewer Cr2E,

We deeply appreciate your time and effort to review our manuscript. We address each of your questions and concerns individually below. Please let us know if there are any comments or concerns we have not adequately addressed.


[W1] this choice may vary substantially across datasets and problem domains

We agree that the choice of data augmentation should vary depending on the application. The fundamental spirit of self-supervised learning lies in leveraging domain knowledge about the application when labels are not available. Specifically, it involves utilizing knowledge about what kinds of transformation invariance the desired representation should possess for a given application. Our work focuses on grounding contrastive losses under the assumption that such data augmentation is already provided. In reality, controlling prototype representations using only domain knowledge is challenging. Exploring which data augmentation techniques should be tailored to each application will be an interesting future direction.


[W2] it provides limited insights into how these parameters impact performance across diverse datasets

Our work focuses on theoretical understanding, and the experiments were designed to align with this purpose. The purpose of the experiments is to check the validity of our theory by demonstrating the potential performance improvements predicted by the theory through the derived balanced contrastive loss. Therefore, we selected canonical datasets such as ImageNet and CIFAR-10 and focused on providing complete results across a grid of parameters for alpha and lambda.


[W3] there are alternative frameworks in self-supervised learning

Those algorithms are not theoretical frameworks, so their ideas are expressed intuitively, but they can be discussed as follows: Comparing self-supervised learning to bootstrapping stems from the process of constructing pseudo-labels using only representations without external input. This aspect can be connected to our framework, where surrogate prototype representations are generated using representations and treated as pseudo-labels. The predictor used here can be viewed as an additional module designed to predict these pseudo-labels. This algorithm falls under non-contrastive learning, which is discussed in Section 5.1.

Additionally, in clustering algorithms, the cluster assignments of transformed images are made consistent. This can be interpreted within our framework as guiding the representations of transformed images to converge toward a single prototype representation, thereby assigning them to the same cluster. Following the suggestion, we added a discussion section to the paper to address this topic. Thank you very much for your suggestion.


[W4] The paper’s findings, especially around balancing attraction and repulsion forces, suggest potential for optimization.

For example, one could consider an algorithm that adjusts alpha and lambda dynamically rather than treating them as fixed hyperparameters. Our framework enhances the understanding of the roles of alpha and lambda: alpha reflects hedging the risk with multiple negative samples, while lambda adjusts the relative magnitudes of the attracting and repelling forces. By leveraging this intuition, dynamically adjusting alpha and lambda based on the overall distribution of representations could potentially improve the learning process.


[Q1] How would the assumptions, such as balanced data and the specific choice of cosine similarity, affect the generalizability of this framework to domains

In proving our theorems, we rely on certain assumptions that are common practices in self-supervised learning, such as balanced datasets and cosine similarity. As a result, generalization becomes less straightforward in scenarios where these assumptions are violated. However, by understanding the role these assumptions play at different stages of the proof, we may pave the way for the development of more generalized algorithms in the future.


[Q2] To what extent can prototype representation bias be quantitatively minimized through practical data augmentation strategies?

According to our framework, the following idea can be considered: data augmentation methods that merely apply color distortions or Gaussian blur may struggle to adequately cover images with the same label in the augmented image space (Figure 2 in our paper). Data augmentation leveraging generative AI may offer an alternative.


[Q3] Could the theoretical framework be adapted or extended to include asymmetrical architectures, given their prominence in modern self-supervised learning algorithms?

We discuss the asymmetric architecture in Section 5.1.

评论

Dear Reviewer Cr2E,

Thank you again for your time and efforts in reviewing our paper.

As the discussion period draws close, we kindly remind you that two days remain for further comments or questions.
We would appreciate the opportunity to address any additional concerns you may have before the discussion phase ends.

Thank you very much.

Best regards,
Authors.

评论

The reviewer thanks the authors for the response -- they addressed the concerns and the reviewer maintains the positive rating.

评论

Dear Reviewer Cr2E,

We are pleased to hear that our rebuttal addressed your concerns.
We also sincerely appreciate your support for our work.
Your constructive feedback has been invaluable in helping us improve the paper.

Thank you once again.

Best regards,
Authors.

审稿意见
5

The submission proposes a derivation of self-supervised learning problem as an approximation of supervised learning. To this end, in supervised learning formulation the authors replace the labels with prototype representations given by an oracle. These prototype representations can then be modelled via the expected representation of objects sampled from a conditional distribution (x conditioned on label y) and across augmentations, i.e. Et,Xy f(t(x))\mathbb{E}_{t,X|y} \ f(t(x)). Learning under this formulation can be achieved via triplet loss, i.e. attracting positive samples (sample from one class) and repelling negative samples (samples from different classes).

In self-supervised learning, however, one has no access to labels which renders prototype representations unavailable. Instead, the authors use surrogate prototypes, i.e. expected representation of sample across its augmentations, i.e. Et f(t(x))\mathbb{E}_t \ f(t(x)). The authors then provide an upper bound on the loss, which yields objective called balanced contrastive loss, and show its connection to NT-Xent loss used in SimCLR. One may measure the bias introduce by the surrogate by taking the expectation of the difference between the true and surrogate representations, called prototype representation bias. The bias is shown to correlate with downstream performance.

优点

It is safe to say that theoretical understanding of self-supervised learning methods is relatively lacking despite increasing interest and effort. Thus, the submission addresses an important topic and provides a clear connection to the supervised counterpart. It is generally well structured which makes it easy to follow, and provides a clear intuition of the approach. The approach covers typical components of self-supervised methods like Siamese networks, data augmentation and contrastive loss.

缺点

While the submission addresses an interesting connection to supervised learning, the connection has been addressed in previous literature [1], for example attraction/repelling and normalization has been brought up in [2]. Further, the consequences of the proposed framework seem to provide limited insight. While they provide supporting arguments for siamese architecture, data augmentation and infoNCE loss from a supervised perspective, the framework seem to conflict with the use of projection head and aggressive data augmentation. Let me elaborate on this further.

If SSL is an approximation of supervised learning, then on downstream task the use of the output of projection head should be more beneficial than the pre-projection features. However, this is not what one faces in practice. Interpreting SSL via supervised learning may inhibit understanding the use of projection head. Highlighting the mismatch between pretext and downstream tasks is important if we to gain practical consequences in designing SSL methods. The proposed interpretation, on the contrary, seem to sweep this distinction under the rug.

Furthermore, SSL methods use more aggressive augmentation strategies than those used in supervised learning, while aggressive augmentation negatively impacts supervised learning [3]. This is also something that seem to be out of tune with the proposed approach.

SimCLR-type losses are well understood from many perspectives, including spectral and information-theoretic [4,5], so it is not fair to render them as only intuitively and experimentally supported.

The authors introduce assumptions on the choice of similarity measure and use of normalization to derive the proposed loss, which is shown to generalize NT-Xent used in SimCLR. The assumptions are needed for the derivation, but I don't think one would need to additionally show their significance empirically, especially when this is already an established practice and has been ablated multiple times in the literature. Similar issue with experiments on balanced dataset. This seems to eat up space and doesn't reveal anything new about SSL methods.

Returning to the generalization of supervised learning problem from predicting labels to predicting prototype representations, this step is important but receives limited discussion in the submission. Since there are multiple target tasks for supervised training, the ideal prototype representations are as well target-specific here. How does this affect the overall framework?

[1] Saunshi, Nikunj, et al. "A theoretical analysis of contrastive unsupervised representation learning." International Conference on Machine Learning. PMLR, 2019.

[2] Wang, Tongzhou, and Phillip Isola. "Understanding contrastive representation learning through alignment and uniformity on the hypersphere." In International conference on machine learning, pp. 9929-9939. PMLR, 2020.

[3] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.

[4] Balestriero, Randall, and Yann LeCun. "Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods." Advances in Neural Information Processing Systems 35 (2022): 26671-26685.

[5] Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." arXiv preprint arXiv:1807.03748 (2018).

问题

In the experimental section with balancing parameters, how are the values in the Figure 4 obtained? Is this the single run or averaged across n?

Can you please elaborate more on ν\nu used in the proposed total loss? It seems to not be available and has not analogous term in NT-Xent loss.

评论

Dear Reviewer JRCA,

We sincerely appreciate your valuable time and effort spent reviewing our manuscript. We address each of your questions and concerns individually below. Please let us know if there are any comments or concerns we have not adequately addressed.


[W1] the connection has been addressed in previous literature [1], for example attraction/repelling and normalization has been brought up in [2].

Since papers on contrastive learning inherently include attracting positive pairs and repelling negative pairs, they may seemingly appear similar at first glance. We clarify the differences below: [1] is basically a paper on unsupervised learning, which is different from self-supervised learning. Both are in a situation where the label is unknown, but unlike unsupervised learning, self-supervised learning has an additional aspect of generating pseudo-labels from data. Therefore, in our paper, we formulate the problem using pseudo-labels (prototype representations) generated from data. In addition, the contrastive loss discussed in [1] is a variation of the triplet loss (Definition 2.3 in [1]) that operates on three samples without hard negative mining. The loss we address is a loss between a sample and prototype representations that incorporates hard negative mining. We demonstrate how this relates to an InfoNCE-type loss. Therefore, while [1] is about “classical contrastive loss in the context of unsupervised learning,” our paper is about “InfoNCE-type loss in the context of self-supervised learning.”

[2] explores how alignment and uniformity relate to contrastive loss (sample attraction/repulsion). In contrast, we demonstrate how our formulated supervised objective (pseudo-label attraction/repulsion) translates into contrastive loss. Thus, while [a] focuses on the properties of contrastive loss, we focus on the foundation of contrastive loss as an approximation of supervised learning. Through this process, common practices (like normalization) are unified within a single theoretical framework. Additionally, [a] focuses on an asymptotic setting (Theorem 1 in [a]) where the number of negative samples approaches infinity, whereas we address a more general setting (Theorems 1 and 2 in our paper), ensuring broader applicability.


[W2] the framework seem to conflict with the use of projection head

Extracting features before the projector boosts performance, but it does not mean that self-supervised learning algorithms fail to work when features are extracted after the projector. Therefore, the improvement in performance from features before the projector is not in conflict with the proposed framework but rather can be interpreted within it.

We interpret this as follows: During the process where contrastive loss directly manipulates features after the projector, the features before the projector are indirectly pushed closer or farther apart, encouraging the learning of more generalized representations. This process leads to the acquisition of noise-reduced and more robust high-level features. It can also be seen as a form of regularization leveraging the information bottleneck effect.


[W3] aggressive augmentation negatively impacts supervised learning [3]

The supervised setting in [3] involves training with a cross-entropy loss (Subsection B.8.1 in [3]), which is conceptually different from our supervised setting (Equation (2) in our paper) as the loss. Thus, it is challenging to directly apply their results to our interpretation.

However, we can discuss the following: Theoretically, if we know that the target representation must be invariant to certain transformations, we can enforce transformation invariance by aligning the representations of transformed data. We assume knowledge of such transformations to develop the theory.

In practice, however, identifying these transformations is challenging, leading to some reliance on domain knowledge. This domain knowledge is not perfect, and since data augmentation inherently modifies the data, it has the potential to introduce negative effects.

For instance, from a human perspective, color distortion does not alter the semantic meaning of a dog in a photo, so the representation should ideally be invariant to color distortion. However, from the model's perspective, some level of color information might contribute to identifying the object as a dog.

In a supervised setting, where access to the label is already available, excessive data augmentation might be unnecessary. Conversely, in self-supervised learning, where robust pseudo-labels must be constructed solely from transformed data, more aggressive data augmentation is often required.

评论

[W4] it is not fair to render them as only intuitively and experimentally supported.

We mention in Section 2 that the self-supervised learning losses has been studied from various perspectives, such as covariance-based learning and maximizing mutual information. Our intention was solely to establish the foundation by deriving losses from a formulated problem. Many studies accept the losses as a given and investigate its effects or characteristics. However, we think that the phrasing could lead to misunderstanding. Therefore, we have removed the expression in question in the revised manuscript. Additionally, [4] has been added to Section 2 ([5] is already cited).


[W5] I don't think one would need to additionally show their significance empirically

In the case of similarity measures, results that align with our setting are not readily available in the literature. For instance, Table 5 of [3] does not provide an apple-to-apple comparison, as multiple components (l2l_2-normalization and temperature τ\tau) are adjusted simultaneously.

In the case of balanced datasets, we provide experimental results under our setting for the completeness of the paper. However, agreeing that it is less critical, we have moved it to the appendix.


[W6] the ideal prototype representations are as well target-specific here. How does this affect the overall framework?

We cannot know what an ideal prototype representation truly is. However, what we can realistically do is enforce transformation-invariance by ensuring that the representations under specific transformations converge to a single point. These transformations may vary depending on the task, but once we assume they are given, the theory develops straightforwardly. Naturally, we consider the centroid (expectation) of the representations of the available transformed data as their shared target. From there, the remaining parts can be mathematically proven. In conclusion, these transformations are provided as a form of domain knowledge for a specific task, and this serves as supervision (Footnote 3 in our paper) derived from domain knowledge. We develop the theory under the setting where this is given.


[Q1] Is this the single run or averaged across n?

We calculated the accuracy by taking the average over 5 independent runs. For the scale of variability, please refer to Section A.4.2.


[Q2] Can you please elaborate more on ν used in the proposed total loss?

In the proof of Theorem 4.6, we mention the ideal case. To provide an intuitive understanding of the value mathbbE_T,Xvertyf_theta(T(X))\\| \\mathbb{E}\_{T', X' \\vert y'}f\_{\\theta}(T'(X')) \\|, let us consider a simple example. Suppose the embeddings f_theta(t_1(x_1))f\_{\\theta}(t\_1(x\_1)) and f_theta(t_2(x_2))f\_{\\theta}(t\_2(x\_2)) lie on a circle. These embeddings can then be represented as (costheta_1,sintheta_1)(\\cos \\theta\_1, \\sin \\theta\_1) and (costheta_2,sintheta_2)(\\cos \\theta\_2, \\sin \\theta\_2), respectively. The average of these embeddings is given by (fraccostheta_1+costheta_22,sintheta_1+sintheta_22)(\\frac{\\cos \\theta\_1 + \\cos \\theta\_2}{2}, \frac{\\sin \\theta\_1 + \\sin \\theta\_2}{2}), which corresponds to the midpoint of the chord connecting the two embeddings. Calculating the norm of this midpoint and simplifying the expression yields frac12+fraccos(theta_2theta_1)2\\frac{1}{2} + \\frac{cos(\\theta\_2 - \\theta\_1)}{2}. This value approaches 1 as theta_1\\theta\_1 and theta_2\\theta\_2 become closer, i.e., as the two embeddings move closer to each other.

评论

Dear Reviewer JRCA,

Thank you again for your time and efforts in reviewing our paper.

As the discussion period draws close, we kindly remind you that two days remain for further comments or questions.
We would appreciate the opportunity to address any additional concerns you may have before the discussion phase ends.

Thank you very much.

Best regards,
Authors.

评论

Dear authors, thank you for answering my questions. I still have my reservations about the proposed interpretation which are not resolved. The major one being limited significance of the consequences that could be gained from a connection to supervised learning, e.g. in terms of theoretical guarantees. The submission covers insights that are already established in practice. Thus while the proposed connection is itself interesting, the consequences specified have limited significance. After reading other reviews and to reflect authors efforts during discussion period, I slightly raised my score.

评论

Dear Reviewer JRCA,

Thank you for your thoughtful feedback and for taking the time to reconsider your evaluation during the discussion period. We deeply appreciate your engagement.

We would like to address your concerns regarding the significance of the consequences derived from our proposed connection to supervised learning.

  1. Theoretical contribution Self-supervised learning implies the idea of constructing pseudo-labels from samples. However, when we see self-supervised learning losses composed solely of samples, it is not immediately apparent how they relate to pseudo-labels. In this work, what we mathematically showed is that pseudo-label attraction/repulsion (Equation (7)), i.e.,
sleft(f_theta(t(x)),hatmu_yright)+lambdamax_yneqysleft(f_theta(t(x)),hatmu_yright),-s\\left(f\_{\\theta}(t(x)), \\hat{\\mu}\_{y}\\right) + \\lambda \\max\_{y' \\neq y} s\\left(f\_{\\theta}(t(x)), \\hat{\\mu}\_{y'}\\right),

where hatmu_y:=mathbbE_Tf_theta(T(x))\\hat{\\mu}\_{y} := \\mathbb{E}\_{T}f\_{\\theta}(T(x)) and hatmu_y:=mathbbE_T,Xvertyf_theta(T(X))\\hat{\\mu}\_{y'} := \\mathbb{E}\_{T', X' \\vert y'}f\_{\\theta}(T'(X')), can be optimized as sample attraction/repulsion (Equation (13)), i.e.,

logfracexp(alphasleft(f_theta(t(x)),f_theta(t(x))right))left(sum_xinhatXexp(alphas(f_theta(t(x)),f_theta(t(x))))right)lambda/nu,-\\log\\frac{\\exp(\\alpha s\\left(f\_{\\theta}(t(x)), f\_{\\theta}(t'(x))\\right))}{\\left( \\sum\_{x' \\in \\hat{\mathcal{X}}} \\exp(\\alpha s(f\_{\\theta}(t(x)), f\_{\\theta}(t'(x')))) \\right)^{\\lambda / \\nu}},

which generalizes the widely-used InfoNCE-type losses. This connection contributes to the firm foundation of self-supervised learning by addressing a crucial gap in the literature. To clarify this, we have updated the contents of Section 3.2, temporarily highlighted in "blue". We believe that technology built on a shaky foundation can be difficult to trust and may encounter limitations in long-term development.

  1. Practical contribution: While it is true that our submission builds upon some established practices, our contribution lies in unifying those practices into a cohesive framework. In addition, to be considered a good theory, we show not only internal consistency but also a reasonable degree of predictive power. In this regard, we provide evidence that leveraging the prototype representation bias and the balanced contrastive loss emerging from our theoretical development can lead to performance improvements. This can guide practitioners toward lines of research on how to reduce the bias in prototype representations or how to effectively balance contrastive losses.

In summary, we believe that our work will benefit the self-supervised learning community by serving as a basis and providing guidance for research.

We hope this clarifies the broader importance of our work. We value your constructive feedback, which encourages us to continue refining and communicating the implications of our findings effectively. Thank you once again for your time and consideration.

Sincerely,
Authors.

审稿意见
5

This paper formulates self-supervised learning (SSL) as an approximation of supervised learning (SL), deriving a loss function related to contrastive losses. The author introduce the concepts of prototype representation bias and balanced contrastive loss, providing some insights into SSL. They conduct experiments to validate their theoretical results

优点

  1. They proposed a novel theoretical framework that connects supervised and self-supervised learning.
  2. They introduction of the concepts of prototype representation bias and balanced contrastive loss, which play important roles in the connection.
  3. They offer some practical insights based on their framework.

缺点

I think the main issue with this paper is its insufficient theoretical contribution. Using the prototype representation bias, as defined by the authors, to represent the gap between SSL and SL is overly simplistic. In reality, no practical augmentation can achieve a very low prototype representation bias unless label information is available. Moreover, augmentations with the same prototype representation bias might exhibit vastly different downstream performance, depending on the finer relationship between the augmentation and the data—a topic the authors have not addressed.

问题

Please see the weakness.

评论

Dear Reviewer NNEK,

We sincerely appreciate the time you have taken to review our manuscript. We address your concern below. Please let us know if there are any comments or concerns we have not adequately addressed.


[W1] Using the prototype representation bias, as defined by the authors, to represent the gap between SSL and SL is overly simplistic. In reality, no practical augmentation can achieve a very low prototype representation bias unless label information is available.

The purpose of this paper is to provide a theoretical understanding rather than focusing on practical applications. In this context, the concept of prototype representation bias is introduced to better understand the connection between supervised and self-supervised learning. In reality, there are bound to be some limitations in controlling prototype representation bias through data augmentation. This is an inherent limitation of self-supervised learning itself, which must utilize data augmentation derived from domain knowledge because there are no labels. However, assessing the practicality of a newly proposed concept is not straightforward. Ideas inspired by the concept of prototype representation bias may lead to new algorithms in the future.

Additionally, in this framework, all other parts except the part where mathbbE_T,Xyftheta(T(X))\\mathbb{E}\_{T, X | y} f_{\\theta}(T(X)) is approximated by mathbbE_Tftheta(T(x))\\mathbb{E}\_{T}f_{\\theta}(T(x)) (with available images) are proven rigorously (Note that in the repelling component, even this approximation is not used). In the attracting component, the approximation relies on 1) the intuition that it is natural and 2) the tendency shown in the experiment. With the supervised objective defined this way, the math works seamlessly, and the InfoNCE loss, widely used in self-supervised learning, naturally emerges. Bridging the gap between supervised and self-supervised learning in a principled way is a necessary piece that is missing in the literature. In machine learning, we believe that it is a long-standing tradition to value a theory when it is reasonably solid and its limitations are properly acknowledged.

评论

Dear Reviewer NNEK,

Thank you again for your time and efforts in reviewing our paper.

As the discussion period draws close, we kindly remind you that two days remain for further comments or questions.
We would appreciate the opportunity to address any additional concerns you may have before the discussion phase ends.

Thank you very much.

Best regards,
Authors.

评论

Dear reviewers and AC,

We sincerely appreciate your valuable time and effort spent reviewing our manuscript.

As reviewers noted, we present a novel (NNEK), rigorous (Cr2E, tUzQ), and well-structured (JRCA, tUzQ) theoretical framework that addresses an important topic (JRCA, Cr2E) and establishes a clear connection between supervised and self-supervised learning (JRCA, Cr2E).

We appreciate your valuable feedback on our manuscript. In response to the comments, we have carefully revised and enhanced the manuscript, including the followings:

  • Clarified the contributions of our work further
  • Added a discussion section
  • Moved the experiments on balanced datasets to the appendix
  • Included additional relevant references

In the revised manuscript, these updates are temporarily highlighted in "blue” for your convenience to check. We sincerely believe that our theoretical framework will be a valuable contribution to the self-supervised learning community.

Thank you very much,

Authors.

评论

Dear Reviewers,

We would like to express our gratitude once again for your thoughtful review of our manuscript. Your valuable insights and feedback are greatly appreciated.

As the extended discussion period is nearing its conclusion, we kindly remind you that there are two days remaining to share any additional comments or questions. We would be delighted to address any further concerns you might have before the discussion phase ends.

Thank you for your time and consideration.

Warm regards,
Authors.

AC 元评审

This work addresses the largely empirical nature of self-supervised learning (SSL) by offering a principled, theoretical framework. The authors formulate self-supervised learning as an approximation of a supervised learning problem, deriving a loss closely related to contrastive losses, thus providing a theoretical foundation for them. Key concepts such as prototype representation bias and balanced contrastive loss emerge naturally, offering insights into self-supervised learning. The framework is aligned with established practices, particularly focusing on SimCLR, and explores the balance between attracting positive pairs and repelling negative pairs. The reviewers raised several concerns, including: (1) the practicality of the prototype representation bias and the insufficient theoretical contributions, (2) the limited insights gained from the connection between self-supervised learning and supervised learning as presented in this paper, and (3) the lack of sufficient experimental validation for the proposed new SSL method. Despite the authors' rebuttal and subsequent author-reviewer discussions, the paper did not receive enough support. Therefore, I recommend rejection.

审稿人讨论附加意见

After the rebuttal, only Reviewer Cr2E confirmed that all concerns were addressed and expressed support for our work.

Reviewer NNEK raised concerns about the practicality of the prototype representation bias, stating, “I think the main issue with this paper is its insufficient theoretical contribution. Using the prototype representation bias, as defined by the authors, to represent the gap between SSL and SL is overly simplistic.” In the authors' response, they emphasized that their work focuses on theoretical understanding and that conclusions about the newly proposed concept may be premature. However, if something is premature, we should exercise caution. I agree with the reviewer's comment, and I believe the authors' response did not adequately address this concern.

Reviewer JRCA mentioned, “The major concern being limited significance of the consequences that could be gained from a connection to supervised learning, e.g. in terms of theoretical guarantees.” Reviewer JRCA slightly increased the score to 5.

Reviewer tUzQ remained concerned that “the proposed new SSL format does not seem to have significant advantages. In addition, the authors also do not conduct sufficient verification experiments.”

Overall, I agree with most of the reviewers' evaluations and believe that this work, in its current form, does not meet the standards for publication.

最终决定

Reject