In-Context Symmetries: Self-Supervised Learning through Contextual World Models
Learning general representations in self-supervised learning while having the versatility to tail down to task-wise symmetries when given a few examples as the context.
摘要
评审与讨论
This paper proposes ContextSSL, a novel self-supervised learning framework designed to enhance the existing joint embedding architecture by incorporating task-specific context. The main idea is to dynamically adapt symmetries by leveraging context in SSL. Consequently, ContextSSL can adapt to varying task symmetries without requiring parameter updates. The authors demonstrate the efficacy of ContextSSL on 3DIEBench and CIFAR10, showing that ContextSSL can selectively learn invariance or equivariance to transformations while maintaining general representations.
优点
[S1] This paper suggests an interesting direction for SSL, proposing that self-supervised representation incorporating context can enable dynamic adaptation to varying task symmetries.
[S2] The overall writing is smooth and easy to follow.
缺点
[W1] It seems possible for invariant-based approaches to achieve context lengths of 0 to 126 by training a linear classifier. Why are these results ignored? More shots could also improve the performance of SimCLR and VICReg.
[W2] Although ContextSSL performs well on the augmentation prediction task, it underperforms compared to other important baselines in linear classification, which is the most common task.
[W3] It seems that ContextSSL can be trained on a single augmentation type, while other Equivariant-based approaches benefit from multiple augmentations.
问题
[Q] Regarding the [W2], can ContextSSL benefit from few-shot classification, e.g., ImageNet accuracy of models trained with 1% of labels [1]? I believe few-shot can serve as context, and in this setup, ContextSSL might outperform other baselines.
[1] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
局限性
They addressed the limitations.
We thank the reviewer for their time and feedback that helps us improve the work. We believe that there are a few confusions that have resulted in the given rating. We have endeavored to address your concerns as concretely as possible and ask for your careful consideration of our clarifications. All of the discussions below shall be added to the paper to improve its clarity.
Reporting results for invariant baselines for context length 0 to 128
The invariant and equivariant baselines do not utilize context in their training, and as a result, their performance remains constant regardless of context length. The reported performance for these baselines in Table 1 corresponds to zero-shot evaluations. We recognize the confusion caused by the centered values in the table. To clarify further, we have included supporting figures and tables in the attachment pdf that will also be added to the revised manuscript.
Lower performance of ContextSSL on linear probe compared to baselines
Core to ContextSSL is its ability to selectively enforce invariance or equivariance based on context. Thus, our experiments test if ContextSSL can do so without sacrificing performance on standard benchmarks like linear probe accuracy. Achieving high linear probe accuracy is not the primary goal of this work.
-
On the 3DIEBench dataset, ContextSSL achieves higher scores for rotation (0.744) and color (0.986) compared to all baselines (the highest being 0.671 for rotation and 0.975 for color). It also matches or exceeds other equivariant models in linear probe performance, as shown in Table 1. Unlike equivariant baselines, ContextSSL does this without training separate models for each equivariance. Moreover, in 3DIEBench, augmentations like rotation or color are independent of classification labels, indicating that equivariant models generally aren't expected to outperform invariant ones. To clarify this further, we will add additional clarification in the revised manuscript.
-
In datasets where equivariance to transformations is crucial and correlated with labels, ContextSSL achieves superior performance, with an 83% linear probe classification accuracy compared to 72% for baselines like SimCLR (as shown in Table 4).
-
As shown in Table 3, ContextSSL achieves an of 0.608 on rotation and 0.925 on color, significantly surpassing SimCLR's 0.459 and 0.371. Further, ContextSSL achieves this while also achieving a linear probe accuracy of 88.5%, comparable to SimCLR's 89.1%.
-
Additional compelling evidence of ContextSSL's strong performance over baselines is presented in Table 2, Table 3, Table 4, and Figure 5 of the original manuscript.
Please refer to the detailed response about key observations from Table 1 in the consolidated review above.
ContextSSL can be trained on a single augmentation type, while other Equivariant-based approaches benefit from multiple augmentations.
Similar to other invariant and equivariant baselines, ContextSSL is indeed trained with multiple augmentations by using multiple augmentations for generating positive samples in the context. As shown in Table 1, all equivariant baselines are trained to be equivariant to either 1) rotation, 2) color, 3) both rotation and color. This requires training a separate model for each setting. ContextSSL, on the contrary, trains a single model using two contexts—one corresponding to rotation and the other to color—thus using multiple augmentations. Depending on which context is used, the model dynamically enforces either invariance/equivariance to rotation or color. To avoid this confusion around Table 1, we show a different version of Table 1 in the attached rebuttal document (through Table 1 and Figure 1) that is hopefully clearer; it separates context length from the comparison (the baseline methods are independent of context length).
Please refer to the detailed response about key observations from this table in the consolidated review above.
ContextSSL benefits in few shot classification setting
We are afraid that there seems to be some misunderstanding. The linear probing metrics in Table 1 show zero-shot results for both ContextSSL and other methods, which, indeed, form a fair comparison. ContextSSL outperforms both the invariant and equivariant baselines in terms of quantitative equivariance measures such as in Table 1, Mean Reciprocal Rank (MRR), and Hit Rate in Table 2. While achieving the highest linear probe accuracy is not the goal of this work, ContextSSL still demonstrates competitive performance in Table 1 and surpasses other baselines in Table 4.
To further emphasize this, we compare the linear probe accuracy of the predictor of ContextSSL across different context lengths with that of SimCLR. As shown in Table 4 in the attached rebuttal pdf and Table 19 in the Appendix of our manuscript, ContextSSL outperforms SimCLR in linear probe accuracy across all context lengths. Note that the invariant and equivariant baselines do not operate on context, and, as a result, the performance of SimCLR in this table remains constant regardless of context length.
Dear reviewer iSU1,
As the discussion period is drawing to a close, we wanted to kindly request your feedback on our rebuttal. In our response, we have carefully tried to address all your concerns and also included additional experiments to further demonstrate the strengths of our work. We would greatly appreciate it if you could provide your feedback at your earliest convenience.
Thank you for your time.
Best regards,
Authors
I'm sorry for the late response to the authors' answer.
For the W1, it would be better to clarify the evaluation setup in the final manuscript. However, I think the zero-shot performance of SimCLR or VICReg underestimates the value of their representations: They can perform with the context to train a linear classifier or regressor (e.g., ContextSSL also uses the few labeled contexts too, which is not indeed zero-shot, but train-free - If I'm wrong, please correct the statement and please clarify about the unlabeled contexts in the final manuscript.) to predict the rotation/color. Similarly, for Table 4 of the attached pdf, the performance of SimCLR is also underestimated to justify the benefit of understanding the context compared to the SSLs.
Overall, I agree with the author that SSLs can benefit from understanding context. but, I still couldn't find the benefit of in-context learning from the experimental justifications. SSLs can utilize linear or simple MLP headers to adapt the new tasks in a few-shot manner (like few-context), and this ability to adapt to the tasks may be better with recent state-of-the-art SSLs, like Dino (v1 or v2) or MoCo v3 too.
We thank the reviewer for their response. We are happy to address their remaining concerns below and hope that this revision serves as a basis for a positive review.
However, I think the zero-shot performance of SimCLR or VICReg underestimates the value of their representations: They can perform with the context to train a linear classifier or regressor (e.g., ContextSSL also uses the few labeled contexts too, which is not indeed zero-shot, but train-free - If I'm wrong, please correct the statement and please clarify about the unlabeled contexts in the final manuscript.) to predict the rotation/color.
The key difference between ContextSSL and SimCLR/VICReg is that ContextSSL leverages knowledge from context to adapt to the “inductive bias” of downstream tasks (e.g., preferring features with rotation invariance). This is the exact purpose of our study— to train a model that can dynamically adapt to a range of downstream tasks according to the feature invariance as required. In contrast, methods like SimCLR produce features with a “fixed inductive bias”, making them less adaptable to downstream tasks.
New comparison and setup. To highlight this difference, we additionally finetune the SimCLR model with 128 examples (same examples used as context for ContextSSL) for rotation prediction and/or color prediction. This matches the total amount of information available to ContextSSL and SimCLR. The results are shown in the table below. SimCLR-ft(128) rot+color is the SimCLR model finetuned with both rotation and color prediction loss. SimCLR-ft(128) rot is finetuned with rotation prediction only, and SimCLR-ft(128) color is finetuned with color prediction only. As with all results in our paper and consistent with [1], rotation prediction is calculated by training an MLP over frozen features, while color prediction and classification accuracy are based on linear probing over frozen features.
Analysis. From the table, we can see that ContextSSL significantly outperforms fine-tuned SimCLR in both rotation and color prediction by a large margin, indicating clear gains of ContextSSL over previous SSL methods with fixed inductive bias.
Table S1
| Method | Rotation Prediction () | Color Prediction () | Classification (top-1) | |
|---|---|---|---|---|
| SimCLR | 0.506 | 0.148 | 85.3 | |
| Higher is better | Higher is better | |||
| rot+color | SimCLR-ft(128) rot+color | 0.546 | 0.319 | 82.1 |
| Higher is better | Lower is better | |||
| rot | SimCLR-ft(128) rot | 0.560 | 0.222 | 83.7 |
| rot | ContextSSL, rot. context | 0.744 | 0.023 | 80.4 |
| Lower is better | Higher is better | |||
| color | SimCLR-ft(128) color | 0.490 | 0.411 | 76.9 |
| color | ContextSSL, color context | 0.344 | 0.986 | 80.4 |
[1] Self-supervised learning of split invariant equivariant representations. ICML 2023.
For the W1, it would be better to clarify the evaluation setup in the final manuscript.
We thank the reviewer for their suggestion. For better clarity, we have already elaborated this in our rebuttal by adding new tables and figures with captions clearly stating the evaluation configuration for all baselines and ContextSSL (as shown in the attached rebuttal document). We will make sure to add this clarification to our revised manuscript too.
Similarly, for Table 4 of the attached pdf, the performance of SimCLR is also underestimated to justify the benefit of understanding the context compared to the SSLs.
We would like to highlight that our context merely consists of unlabeled data of the form and, thus, does not provide any information about the downstream label (). These unlabeled data pairs are generated with data augmentations of training data and thus do not provide extra supervision to the model. Therefore, it is indeed a fair comparison with SimCLR/VICReg since both representations are obtained with only unlabeled data. We apologize for this confusion and will definitely make it clearer throughout the paper.
I agree with the author that SSLs can benefit from understanding context. but, I still couldn't find the benefit of in-context learning from the experimental justifications. SSLs can utilize linear or simple MLP headers to adapt the new tasks in a few-shot manner (like few-context), and this ability to adapt to the tasks may be better with recent state-of-the-art SSLs, like Dino (v1 or v2) or MoCo v3 too.
In our evaluation, we trained a linear head for color prediction and an MLP head for rotation prediction on SimCLR features, following [1]. As shown in Table S1, even with fine-tuning on the same few-shot samples, SimCLR falls short in performance on equivariant tasks of color and rotation prediction compared to ContextSSL.
As discussed in the paper, ContextSSL fundamentally differs from previous SSL paradigms (SimCLR, DINO, MoCo); Previous SSL methods learn features with a given set of augmentations, which requires the model to ignore certain augmentations, like color and rotation. Instead, ContextSSL does not drop any information during pretraining but models the feature symmetry to be dependent on the context. The experiments above clearly demonstrate this — ContextSSL significantly outperforms SimCLR in preserving color and rotation information. We believe that this comparison provides strong evidence to demonstrate that our method has clear gains over existing SSL paradigms.
We thank the reviewer again for their constructive feedback, which has helped us clarify and improve our work. If the reviewer finds our revisions satisfactory, we hope that they would kindly consider re-evaluating our work.
Thank you for your response. I have no additional questions. I will carefully consider these issues and make a final rating.
We sincerely appreciate the reviewer’s continued engagement with our work and their thoughtful review. Their feedback has been very valuable in refining our paper. We also thank them for their decision to raise their score.
This work proposes to employ context modules to learn general representations such that invariance and equivariance to specific augmentations do not bias the representations. The method utilizes a module to learn to be both invariant or equivariant based on the context of the input augmentations, thus producing highly generalized features from learnt symmetries that are capable of preserving or disregarding a variety of transformations for the downstream tasks. The resulting performance leads to improved downstream evaluations in both invariant and equivariant settings, excelling above state-of-the-art in some settings.
优点
⁃ The paper is generally well presented and written, describing a clear and rationale problem statement supported by appropriate examples.
⁃ The resulting framework is original and highly significant in the field of SSL. Notably, the addition of the context module could allow for a significant shift in the real-world application of SSL.
⁃ Empirical results show significant improvement in equivariant downstream tasks further justifying this works significance in the field. Additionally, an extensive ablation and sensitivity analysis is performed guiding the reader into a greater understanding of the behavior of the method and the rationale behind implementation decisions.
⁃ Extensive details to support replication are provided.
缺点
⁃ How does the method handle more complex augmentation strategies, the proof of concept in the setting of 3DIE and CIFAR demonstrate strong performance yet these transformations, especially equivariance are highly controlled and unique to these datasets. It therefore would have been good to see more generalized augmentations including multiple combinations in the downstream context that better adapt to the real world setting and thus support the generalization claim of the work.
⁃ Does not the choice of context length to extract representations for inference require significant supervision? Such an implementation requires the practitioner to construct augmentations per context and then provide such information for the downstream. Please correct my understanding if incorrect.
⁃ From the latter point, this method claims to not hard code symmetries. However, from my understanding this method is still hard coding symmetries to some extent as the practitioner is selecting which augmentations to select. In this case the context is enabling separability between representations that belong to each group symmetry. They are still hand-coded just conditioned on the learnt context for determining whether to be invariant or equivariant.
Minor:
⁃ Figure 1 and 2 could perhaps be made clearer or positioned differently in the paper. It is not overly clear or useful in supporting the written explanation.
⁃ Context length should be headed in the tables.
问题
⁃ I’m intrigued to understand how does “out of context” augmentations for downstream tasks impact performance? I assume given the automatically learnt symmetries that applying a full context would result in better performance to downstream cases where the context would be vastly different to the training data.
⁃ See weaknesses
局限性
The limitations are appropriately addressed
We appreciate the reviewer's thorough and insightful review, along with their positive feedback regarding the significance of our work for the SSL community, extensive evaluations and ablations, and our writing. In response to their review, have endeavored to address your concerns as concretely as possible
Extension to complex augmentation strategies
We thank all the reviewers for raising this excellent question. It is indeed critical to evaluate our approach beyond self-supervised datasets, which use synthetic augmentations to enforce priors on the representation. To address this, we show that ContextSSL extends to naturally occurring symmetries and sensitive features in fairness and physiological datasets such as the MIMIC III [1] and UCI Adult [2]. To demonstrate this, we train ContextSSL to be selectively equivariant or invariant to gender by merely attending to different contexts. This is crucial; for instance, equivariance is needed for gender-specific medical diagnoses where different medicine dosages are required, while invariance is essential for fairness in tasks such as predicting hospital stay duration or medical cost. We present these results in Table 2 and Table 3 of the attached rebuttal document, with details in the caption. From Table 2, we can observe that ContextSSL learns equivariance to gender in one context, improving gender and medical diagnosis prediction for MIMIC-III. In another context, ContextSSL achieves higher invariance to gender, resulting in superior performance on fairness metrics like equalized odds (EO) and equality of opportunity (EOPP) for hospital stay (LOS) prediction. We observe similar results for fairness of income prediction in the UCI Adult dataset, as shown in Table 3 of the attached document.
[1] Johnson, A., T. Pollard, and R. Mark III. "MIMIC-III Clinical Database (version 1.4). PhysioNet. 2016." (2016).
[2] Arthur Asuncion and David Newman. UCI machine learning repository, 2007.
Does the choice of context length require supervision at inference?
This is indeed a critical question. As shown in Table 1 and Table 4 of the attached rebuttal document, ContextSSL is robust to varying context lengths and generalizes well to longer contexts, eliminating the need for explicit supervision during inference. During training, we use random masking and subsequently test without masking, which enhances the model's robustness to varying context lengths. For example, although trained with an average context length of 9 under 90% data masking, the model extrapolates well to context lengths up to 128 during testing, as demonstrated in Table 1.
Furthermore, depending on the useful priors of different downstream tasks, one only needs to construct the corresponding context and use the maximum context length. The degree of equivariance or invariance in ContextSSL increases with context length and is highest at the maximum context length, as observed empirically through all our experiments.
Does our work still hard coding symmetries to some extent?
Indeed, we still need to know the set of symmetries at training time. However, our work is the first to move beyond these fixed symmetries, training a representation that can dynamically adapt to be invariant or equivariant to a subset of these transformations. This enables learning a single representation that performs well across various downstream tasks, eliminating the need to retrain a new model for each task. So far, we have tested this on environments like rotation and color or crop and blur. We believe that it is an important stepping stone towards learning from diverse contexts. In practice, ContextSSL could be trained to handle a larger set of transformations, covering the entire set of commonly used augmentations.
To demonstrate how ContextSSL extends beyond these synthetic transformations to naturally occurring symmetries, we conduct experiments on the fairness and physiological datasets such as the MIMIC III [1] and UCI Adult [2]. We present these results in Table 2 and Table 3 of the attached rebuttal document, with details in the caption.
Please refer to the detailed response about the extension to naturally occurring features and symmetries in the consolidated review above and to Table 2 and Table 3 in the attached document.
[1] Johnson, A., T. Pollard, and R. Mark III. "MIMIC-III Clinical Database (version 1.4). PhysioNet. 2016." (2016).
[2] Arthur Asuncion and David Newman. UCI machine learning repository, 2007.
Minor issues and typographical errors
We thank the reviewer for their attention to detail. We will make the following corrections in our revised manuscript:
-
Positioning of Figure 1: We will add more discussion on Figure 1 to highlight our approach clearly.
-
Header for context length in Table 1: We agree with the reviewer and have improved Table 1 with supporting plots and tables, as shown in the attached rebuttal document (Table 1 and Figure 1).
Many thanks for the detailed response that addresses many of the weaknesses identified and questions raised. I appreciate the addition of naturally occurring symmetries, and the rebuttal results demonstrate the capability of the method to adapt to such settings. It would also be beneficial to see more visual benchmarks given this had been the main focus of the paper, however, I understand that time restrictions do not permit this. However, for future revisions comparisons on benchmark datasets to compare against methods such as EquiMod, E-SSL, CARE, would improve the findings. Additionally, the clarification on the context length is a useful addition.
I emphasise that all clarifications made during this rebuttal should be made in any revised manuscript to improve clarity of the work.
Given my already positive review, I for now maintain my score.
Thank you for the prompt response and recognition of our new experiments and discussions. We will incorporate all additional experiments and clarifications into our revised manuscript. The important concerns raised by you have been very valuable in enhancing the clarity of our work. We fully concur with the suggestion to include more vision benchmarks and are currently testing our approach on them. We will ensure that they are included in the revised manuscript.
This paper focuses on the problem of symmetry discovery in self-supervised learning. In particular, the goal is to learn models that are either sensitive to certain features like rotations and lightning or invariant to them, depending on the task. The authors propose to learn a world model that models transformations of the input images as a sequence of state, action, next state tuples. The major contribution is learning to adapt the representation of the world model based on the provided context of the task.
优点
-
The authors present a novel combination of in-context learning and symmetry discovery. Their method successfully adapts its representation to be equivariant or invariant to different transformations
-
The experimental evaluation uses a large number of strong baselines in self-supervised representation learning and learning of symmetric representations. The proposed method is superior in its ability to be sensitive to or invariant to certain features based on the context of the task.
-
The paper contains an extensive ablation study to justify all components of the method.
缺点
-
I am not sure if it is meaningful to call this property equivariance: “if H(A|Z) is relatively small, the representation Z is nearly equivariant to the augmentation A”. Equivariance is specifically defined as the transformation of the input to the model having a predictable effect on the transformation of the output. This is different from simply having features that are predictive of a particular features (hence low entropy H(A|Z)). Would it be better to call this property something like sensitivity to a transformation?
-
Section 4.1 does not make a strong case for ContextSSL outperforming the baselines. The results in Table 1 are somewhat mixed. Table 2 is not explained well.
-
The paper does not make a clear case for the application of the proposed method outside of synthetic tasks. The 3DIEBench is artificially created to test equivariance and invariance to specific properties and the CIFAR-10 experiments do not actually demonstrate an improvement in classification accuracy. It is unclear how this method could be applied to more general and practical visual pre-training settings, such as CLIP [1] or DINO [2] self-supervised pre-training.
References:
[1] https://arxiv.org/abs/2103.00020
[2] https://arxiv.org/abs/2304.07193
Comments:
- Table 1 is difficult to read. It is not immediately clear why ContextSSL is missing from the Rotation + Color section and why the context length != 14 fields for other methods are empty. Moreover, depending on the place in the table, a high or a low R^2 score could be the best result. That is very non-intuitive.
- Clipped sentence: “We further test this at For all our equivariant baselines on 3DIEBench”.
- The text “Contextual Self-Supervised Learning” in Figure 1 should be rotated by 180 degrees.
问题
What is the path towards making this method discover or adapt to naturally occurring symmetries?
局限性
Despite stating “Limitations of our work are discussed in Section 5” in the paper checklist, the discussion of the limitations in Section 5 is insufficient.
We are grateful to the reviewer for the time they put in to review our work. We are glad to see that they recognize several strengths in our work, including the novelty of our approach, comprehensive empirical evaluation using many baselines, and conducting thorough ablations. Below, we share our thoughts on the questions asked.
H(A|Z) as a definition of equivariance
We fully concur with the reviewer's observations. As noted in the footnote of Section 2.1, we use the term "equivariance" in a somewhat relaxed sense to denote that learned features are sensitive to data augmentations. However, it is common practice in equivariant self-supervised learning [1, 2, 3, 4] to use this definition to enforce equivariance.
[1] Dangovski, Rumen, et al. "Equivariant contrastive learning." arXiv preprint arXiv:2111.00899 (2021).
[2] Lee, Hankook, et al. "Improving transferability of representations via augmentation-aware self-supervision." NeurIPS (2021): 17710-17722.
[3] Xie, Yuyang, et al. "What should be equivariant in self-supervised learning." CVPR. 2022.
[4] Scherr, Franz, Qinghai Guo, and Timoleon Moraitis. "Self-supervised learning through efference copies." NeurIPS (2022): 4543-4557.
Confusion regarding Table 1 and explanation about the benefits of ContextSSL
We acknowledge Table 1 may be confusing as presented in the paper, and will improve the presentation. Table 1 compares ContextSSL with baselines and shows the effect of context length. We believe the empirical success of ContextSSL is significant, and to show that, we show a different version of Table 1 in the attached rebuttal document (through Table 1 and Figure 1), that is hopefully clearer; it separates context length from the comparison (the baseline methods are independent of context length).
Please refer to the detailed response about key observations from this table in the consolidated review above.
- On the 3DIEBench dataset, ContextSSL achieves higher scores for rotation (0.744) and color (0.986) compared to all baselines (the highest being 0.671 for rotation and 0.975 for color). It also matches or exceeds other equivariant models in linear probe performance. Unlike equivariant baselines, ContextSSL does this without training separate models for each equivariance.
- ContextSSL seamlessly enforces equivariance or invariance to rotation or color by merely paying attention to different contexts, as shown in Figure 1 of the attached document. Thus one model can align the learned representation to priors that are beneficial for different downstream tasks.
- Additional compelling evidence of ContextSSL's strong performance over baselines is presented in Table 2, Table 3, Table 4, and Figure 5 of the original manuscript.
Confusion regarding Table 2
We provide more details regarding Table 2 here and will add the corresponding discussion in the paper.
Table 2 shows that ContextSSL outperforms baseline approaches on two key metrics for equivariance: Mean Reciprocal Rank (MRR) and Hit Rate at k (H@k) [1]. ContextSSL's performance on these metrics consistently improves with increasing context length, demonstrating adaptation to rotation-specific features. To put these numbers into perspective, a H@1 score of 0.29 for ContextSSL signifies that the first nearest neighbor is the target embedding 29% of the time. In contrast, this occurs only 5% of the time for EquiMod and SEN, which is marginally better than the 2% expected by random chance. Notably, ContextSSL surpasses the baseline performances even with zero context, demonstrating its ability to learn equivariance without any contextual information.
[1] Garrido, Quentin, Laurent Najman, and Yann Lecun. "Self-supervised learning of split invariant equivariant representations." arXiv preprint arXiv:2302.10283 (2023).
Extension beyond synthetic augmentations and towards adapting to naturally occurring symmetries
We thank the reviewer for raising this excellent question. It is indeed critical to evaluate our approach beyond self-supervised datasets, which use synthetic augmentations to enforce priors on the representation. To address this, we show that ContextSSL extends to naturally occurring symmetries and sensitive features in fairness and physiological datasets such as the MIMIC III and UCI Adult. To demonstrate this, we train ContextSSL to be selectively equivariant or invariant to gender by merely attending to different contexts. This is crucial; for instance, equivariance is needed for gender-specific medical diagnoses where different medicine dosages are required, while invariance is essential for fairness in tasks such as predicting hospital stay duration or medical cost. We present these results in Table 2 and Table 3 of the attached rebuttal document, with details in the caption.
Please refer to the detailed response about the extension to naturally occurring features and symmetries in the consolidated review above and to Table 2 and Table 3 in the attached document.
Additional Limitations
We would like to highlight some additional limitations of our work.
- So far, ContextSSL has been evaluated on medium-sized datasets such as 3DIEBench and CIFAR. Its scaling law to massive datasets and more diverse environments is left to be explored in the future with more available compute.
- Using the transformer network to learn a contextual world model increases training and memory costs, though these are relatively small compared to the encoding process.
- So far, we have tested ContextSSL on two environments such as rotation and color or crop and blur. As a future work, we aim to expand our testing to continuous environments, moving beyond the constraints of finite settings.
Minor errors
We thank the reviewer for their attention to detail. We will make these corrections in our revised manuscript.
Dear reviewer WVFD,
As the discussion period is drawing to a close, we wanted to kindly request your feedback on our rebuttal. In our response, we have carefully tried to address all your concerns and also included additional experiments to further demonstrate the strengths of our work. We would greatly appreciate it if you could provide your feedback at your earliest convenience.
Thank you for your time.
Best regards,
Authors
Dear reviewer WVFD,
Understanding that you may be busy, and with the author-reviewer discussion period coming to a close, we would like to take this last opportunity to summarize the major updates we made during the rebuttal to address your concerns.
- In response to your concern about the applicability of ContextSSL beyond synthetic augmentations, we demonstrated that ContextSSL extends to naturally occurring symmetries and sensitive features in fairness and physiological datasets such as MIMIC III [1] and UCI Adult [2].
- Based on your concerns regarding Table 1, we replaced it with a version featuring clearer annotations and captions, highlighting key strengths of our approach, ContextSSL.
- We clarified the definition of equivariance and its connection to H(A|Z).
- We provided additional clarification around Table 2 and added other limitations and future directions of our work.
We hope these revisions address your concerns and would be happy to answer any further questions. If you find our revisions satisfactory, we hope that you would kindly consider re-evaluating our work.
Best,
Authors
We thank all the reviewers for their time and expertise in evaluating our paper. Their perceptive remarks and constructive feedback have been valuable in improving our work. In response, we have made several key revisions to address their concerns and have conducted additional experiments to enhance the support for our claims. Below is a brief summary of the key revisions:
Testing ContextSSL beyond synthetic augmentations and towards naturally occurring symmetries?
We thank all the reviewers for raising this excellent question. It is indeed critical to evaluate our approach beyond self-supervised datasets, which use synthetic augmentations to enforce priors on the representation. To address this, we show that ContextSSL extends to naturally occurring symmetries and sensitive features in fairness and physiological datasets such as the MIMIC III [1] and UCI Adult [2]. To demonstrate this, we train ContextSSL to be selectively equivariant or invariant to gender by merely attending to different contexts. This is crucial; for instance, equivariance is needed for gender-specific medical diagnoses where different medicine dosages are required, while invariance is essential for fairness in tasks such as predicting hospital stay duration or medical cost. We present these results in Table 2 and Table 3 of the attached rebuttal document, with details in the caption. From Table 2, we can observe that ContextSSL learns equivariance to gender in one context, improving gender and medical diagnosis prediction for MIMIC-III. In another context, ContextSSL achieves higher invariance to gender, resulting in superior performance on fairness metrics like equalized odds (EO) and equality of opportunity (EOPP) for hospital stay (LOS) prediction. We observe similar results for fairness of income prediction in the UCI Adult dataset, as shown in Table 3 of the attached document.
[1] Johnson, A., T. Pollard, and R. Mark III. "MIMIC-III Clinical Database (version 1.4). PhysioNet. 2016." (2016).
[2] Arthur Asuncion and David Newman. UCI machine learning repository, 2007.
Clarifications about key results of ContextSSL and Table 1 of the manuscript
We acknowledge that in the initial manuscript, there was some confusion surrounding Table 1, which may not have effectively communicated the strengths of our approach. To address this, we present an improved version of Table 1 in the attached rebuttal document (through Table 1 and Figure 1). The new table demonstrates the empirical success of ContextSSL over invariant and equivariant baselines, and the new figure highlights ContextSSL's dynamic adaptability by paying attention to different contexts. Key observations from the Table are as follows:
- With context corresponding to rotation and color, respectively, ContextSSL achieves higher scores for rotation (0.744) and color (0.986) compared to all baselines (the highest being 0.671 for rotation and 0.975 for color). This indicates that it enforces equivariance to rotation and color in their respective contexts. ContextSSL also matches or exceeds other equivariant models in linear probe performance. Unlike equivariant baselines, ContextSSL does this without training separate models for each augmentation group.
- With contexts of rotation and color, ContextSSL achieves invariance to the other transformation, i.e., color ( of 0.023) and rotation ( of 0.344), respectively, comparable to SIE's values of 0.011 for color and 0.304 for rotation. However, ContextSSL achieves higher linear probe classification accuracy while training a single model, unlike SIE and other equivariant baselines that require two trained models, one for rotation and one for color.
- ContextSSL seamlessly enforces equivariance or invariance to rotation or color by merely paying attention to different contexts, as shown in Figure 1 of the attached document. Thus one model can align the learned representation to priors that are beneficial for different downstream tasks.
- Additional compelling evidence of ContextSSL's strong performance over baselines is presented in Table 2, Table 3, Table 4, and Figure 5 of the original manuscript.
This paper focuses on the transformations that self-supervised learning algorithms learn to be invariant or equivariant to. The Authors are concerned with the inductive biases brought by this paradigm, and propose to adapt invariance (or equivariance) of the learned representation to the context of the current task.
After the rebuttal, all Reviews are leaning towards acceptance (BA, WA and A), and they praised the main contribution of this paper, its writing and the experimental analysis. I recommend acceptance, and encourage the Authors to improve their paper by taking into account the feedback and discussions with Reviewers