Supervised Dimension Contrastive Learning
摘要
评审与讨论
Motivated by Barlow Twins, this paper proposes a method to constrain the feature diversity in supervised contrastive learning. Experimental results show the effectiveness of the proposed method.
优点
From experimental section, we can see that the proposed method performance well in OOD scenario. This is interesting and validates to some extent that feature diversity does correlate with the generalizability of feature representations.
缺点
-
The novelty of this paper is really limited. This article does not explain the motivation for CLASS CORRELATION and FULL RANK AGGREGATION to be proposed. At the same time, no intuitive insight or theoretical analysis is given as to why CLASS CORRELATION and FULL RANK AGGREGATION promote supervised comparative learning performance.
-
The related work section is also quite weak. The authors have not done a good job of highlighting the developmental lineage of self-supervised learning, supervised comparative learning, and the connections and differences between the different algorithms. And also did not highlight the advantages and footholds of the method proposed in this paper.
问题
-
Why can CLASS CORRELATION encourage the embedding dimensions to be not only augmentation-invariant and decorrelated but also discriminative for the supervised classification task? As we can see, CLASS CORRELATION is only a metric of Correlational Relationship.
-
Please give the theoretical analysis or related references to the statement: To guarantee that all embedding dimensions contribute to class separability, every embedding dimension should play a role in generating the predicted class variables through the aggregate function. It can be achieved by enforcing aggregate function is full-rank.
For G2
-
As for the sentence: "A is full-rank, the transformation is bijective", it is problematic. Bijection can only be guaranteed if A is A square matrix and A is invertible.
-
The assumption that labels are published in a Gaussian distribution is somewhat difficult to understand.
-
The proof of 'Orthogonal Regularization' is built on 'A is square matrix'.
So, I think G2 is unconvincing.
For CLASS CORRELATION
What I'm trying to say is that correlation doesn't mean equality. Only the correlation between labels is restricted, and discriminability is not restricted. Therefore, the claim in L229-232 is overclaimed.
Overall
Some important concerns have not been addressed well. So, I keep my original rating.
Thank you for providing additional proof. However, there are still some important concerns that have not been addressed.
For Global Response (2-1/4, Revised):
- Why can Null-Range Decomposition satisfy the law of total entropy?
- Only constrain the one-to-one mapping can not lead to the conclusion: . This should be bijection.
- is designed to capture all -relevant information in . This is a hypothesis, thus, we can use this like a Definitive Conclusion, thus, can not been regarded as .
- Based on 1), 2), and 3), we can not regard as quantifying the uncertainty and information loss related to the aggregation process.
- For Diversity , the author still assume Y as a Gaussian distribution, this assume is meanful only for Regression Problem, not for Classification Problem.
For Global Response (2-2/4, Revised):
-
The authors claim that minimizing via Correlation Maximization can be seen as Covariate shift-invariant Distributional Alignment. This is puzzling. The Covariate Shift problem is related to the distribution divergence between the training dataset and the test dataset. However, this paper only focus on training dataset. Also, I read the whole article carefully, and the author do not analyze what confounders caused the distribution shift, nor do he propose a method to extract such confounders. So innovation on this point is overclaimed.
-
The formulation of is still based on the assumption of a Gaussian distribution. Meanwhile, how can you be sure that contains information that is not relevant to the task?
For Contribution:
- The novlty of this paper is incremental. The claim for solving the covariate shift in this paper is problematic.
Thus, I maintain my original rating.
This paper enhances Barlow Twins by using the class-correlation loss and orthogonal regularization loss. The class correlation loss uses supervision information, and the orthogonal regularization loss enforces the aggregate function to be full-rank. These additional losses positively affect the overall performance. Through numerical experiments on in-domain classification and out-domain transfer learning tasks, the proposed SupDCL demonstrated superior performance.
优点
- The proposed method demonstrated superior performance on in-domain classification and out-domain transfer learning tasks.
缺点
- The class correlation and orthogonal regularization losses have not been investigated in Barlow Twins, but such losses are often investigated in other methods or areas.
- According to Table 4, most of the performance gain of the proposed method compared with Barlow Twins comes from supervision.
问题
- What is an intuition that the proposed method improved performance on out-domain transfer learning tasks?
- Based on Figure 1, the proposed method inherits the superior out-domain performance from Barlow Twins. The other regularization or loss terms are likely to obtain improvement similar to the proposed method from Barlow twins.
- Are there any theoretical insights about the performance of the proposed method?
- Can we use the proposed method without the class correlation loss as a self-supervised learning method? If so, does this outperform the existing methods?
- Regarding Table 4, is there any discussion of the performance of Barlow Twins plus the orthogonal regularization loss?
- In Table 3(b), the existing methods increase performance as the projector dimensionality increases, but the proposed method decreases after 128. Is there any interpretation for that?
The paper introduces Supervised Dimension Contrastive Learning (SupDCL), which combines supervision with dimension-wise contrastive learning. It targets limitations in supervised contrastive learning (high cross-correlation and limited feature diversity) by reducing redundancy in embedding dimensions while enhancing class discriminability.
优点
- The integration of dimension contrastive learning with supervision to enhance feature diversity makes sense. SupDCL’s approach to reducing cross-correlation among dimensions to improve generalization, particularly in transfer learning tasks, is a well-justified.
- The evaluation covers both in-domain and out-domain tasks on different datasets, providing a clear comparison of SupDCL against multiple baselines, including self-supervised and supervised learning methods.
- Extensive ablation studies that analyze each component of SupDCL’s loss function.
- The paper acknowledges the limitations of SupDCL, specifically its reliance on fixed-form class labels. It provides suggestions for handling these limitations through future work.
缺点
- The paper introduces orthogonal regularization and decorrelation strategies to improve feature diversity, but it lacks a theoretical foundation to show why these particular choices are necessary. While the empirical results show improved performance, a lack of formal justification leaves the effectiveness of these strategies unclear.
- While SupDCL outperforms SupCon in out-doman tasks, it underperforms SupCon in in-doman tasks.
- SupDCL borrows heavily from existing redundancy reduction techniques (Barlow Twins) without significantly advancing the state of contrastive learning. The contribution of introducing supervision to dimension contrastive learning is incremental and could be viewed as a straightforward application of existing techniques.
问题
- Why is feature discriminative inportant for supervised contrastive learning? Figure 1 and experiments show that SupDCL underperforms SupCon in in-doman tasks.
- What is and is Eq (7)? They are not explained.
This paper argues that representations learned by supervised contrastive learning methods have limited feature diversity and high cross-correlation between dimensions. The authors propose an approach to learn representations that are both augmentation-invariant and dimension-wise decorrelated, while also being discriminative for classification. They achieve this by extending Barlow Twins, a dimension contrastive learning model, to supervised dimension contrastive learning. Specifically, they introduce a discriminativeness loss instead of the cross-entropy loss, consisting of a class correlation loss and an orthogonal regularization loss. Experimental results show that the proposed SupDCL performs well on both in-domain supervised classification tasks and out-of-domain transfer learning tasks.
优点
The idea of this paper is to combine SupCon (supervised contrastive learning) with Barlow Twins (decorrelated features). The specific approach involves introducing a discriminativeness loss to leverage the supervised labels. The motivation and approach are presented very clearly. The authors also provide experiments, analysis, and evaluation metrics to demonstrate the effectiveness of the proposed method.
缺点
I believe the experiments could be strengthened, and there are some minor points that would enhance the clarity of the paper. These are outlined in more detail in the Questions section.
问题
-
This work aims to extend Barlow Twins to a supervised version. A straightforward extension model could be a combination of Barlow Twins and SupCon, where, instead of generating two augmentations and from image , is selected from another image that shares the same class as image . In Table 2, I would like to see the results of this model.
-
Another straightforward extension is to use a cross-entropy loss function instead of the proposed discriminativeness loss function to leverage the supervised labels. While you have compared this model in Table 3, comparing the results in Table 1 and Table 2 would also be worthwhile.
-
In Table 3, you mention the representation space and embedding space; however, in the methods section and Figure 2, there is no reference to the representation space. You have defined the embedding space, , as an vector, but it is unclear which vector you refer to as the representation vector.
-
Similarly, Section 4.2 does not clarify what the representation , a vector, refers to.
-
In lines 371-374, when , Barlow Twins achieves 1320, which seems to be the lowest count rather than the highest as described. Is there a mistake here?
-
Visualization: Could you visualize the embedding space or representation space (with clearer references if possible) of different methods, along with the different dimensions of those representations? This would help to better understand how the proposed method achieves improved decorrelation and discriminative representations.
Minors:
-
It would be better to mark all the best numbers in all tables in bold.
-
In Table 2, add arrows to indicate whether larger or smaller values are better, and mark the best value in bold.
-
Figure 1 can be split into two separate figures, as the third sub-figure is not on the same scale as the other two.
-
In lines 353-358, 645 (SupDCL) vs. 532 (SupCon) is described as outperforming, while 645 (SupDCL) vs. 904 (Barlow Twins) is described as comparable?
Dear Reviewers,
We thank the reviewers for their valuable feedback, which guided us to strengthen the theoretical and empirical justification of our proposed dimension-based framework. Initially, we focused on the motivations behind our approach. Now, we theoretically and empirically demonstrate that all three components addresses specific challenges. Specifically, we decompose the mutual information between class labels and learned representations into two components (refer to Revised G2): diversity and discriminativeness, and together, they form a comprehensive framework for robust generalization. All justifications have been revised and reflected in the updated version of the paper.
“the total information capacity of the learned representation ”
“the uncertainty and information loss related to the aggregation process”
-
Dimension Decorrelation - Maximizing :
- Reduces inter-dimension correlation using Barlow Twins (BT), theoretically shown to increase information capacity of representation (Revised G2 of the global response and Appendix A.2).
- Empirically, it improves generalization performance, as evidenced by the relationship between dimension correlation and whole-domain accuracy (Figure 1) and the effectiveness of BT+CE over CE alone (Table B).
-
Class Correlation - Minimizing :
- A novel method for robust distributional alignment, theoretically shown to be invariant to covariate shift, and minimize distance between predicted and true class distributions (Revised G2 of the global response and Appendix A.3.1).
- Empirically, class correlation improves in-domain and out-domain performance, consistently outperforming CE across all settings (Table B).
-
Full-rank Aggregation - Minimizing :
- Preserves class-relevant information of representation, theoretically shown to minimize information capacity of null space (Revised G2 of the global response and Appendix A.3.2).
- Empirically, comparisons such as BT+CE vs. BT+CE+Orthogonal and SupDCL w/o Orthogonal vs. SupDCL confirm its significant contribution to both in-domain and out-domain performance (Table B).
Combining these components, our method achieves state-of-the-art whole-domain generalization, outperforming all supervised and self-supervised baselines across in-domain and out-domain tasks (Table C).
Table C. In-domain classification and out-domain transfer learning. Linear evaluation performance comparison on 10 downstream datasets, for ResNet-50 pre-trained on ImageNet-1K.
| Method | In-domain (IN1K) | Average out-domain (10 datasets) | Whole-domain (all 11 datasets weighted equally) | Whole-domain (50% in-domain, 50% averaged out-domain) |
|---|---|---|---|---|
| Supervised | 76.1 | 74.5 | 74.6 | 75.3 |
| Self-supervised Representation Learning: | ||||
| SimCLR | 69.1 | 73.0 | 72.6 | 71.0 |
| Barlow Twins | 73.2 | 79.0 | 78.5 | 76.1 |
| SwAV | 75.3 | 79.0 | 78.7 | 77.2 |
| MoCo v3 | 71.1 | 79.4 | 78.7 | 75.3 |
| DINO | 75.3 | 80.3 | 79.9 | 77.8 |
| Supervised Representation Learning: | ||||
| SupCon | 77.9 | 77.8 | 77.8 | 77.9 |
| PaCo | 78.7 | 69.4 | 70.2 | 74.1 |
| GPaCo | 79.5 | 68.8 | 69.8 | 74.2 |
| SupDCL-1024 (Ours) | 78.2 | 78.9 | 78.9 | 78.6 |
| SupDCL (Ours) | 77.5 | 80.8 | 80.5 | 79.2 |
Table B. Comparison of SupDCL and straightforward extensions of Barlow Twins with cross-entropy loss
| Method | In-domain (CIFAR-100) | Average Out-domain |
|---|---|---|
| Barlow Twins | 52.6% | 39.1% |
| Barlow Twins + CE | 72.6% | 39.6% |
| Barlow Twins + CE with Orthogonalization | 73.8% | 40.8% |
| SupDCL w/o Orthogonalization | 73.8% | 41.0% |
| SupDCL (Ours) | 74.2% | 42.0% |
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.