5.8

/10

Poster4 位审稿人

最低5最高6标准差0.4

3.8

置信度

正确性2.5

贡献度3.0

表达2.8

NeurIPS 2024

Self-supervised Transformation Learning for Equivariant Representations

Jaemyung Yu,Jaehyun Choi,Dong-Jae Lee,HyeongGwon Hong,Junmo Kim

OpenReview PDF

提交: 2024-05-09更新: 2025-01-15

TL;DR

Self-Supervised Transformation Learning (STL) uses transformation representations from image pairs instead of labels to learn equivariant transformations, enhancing performance efficiently.

摘要

关键词

Equivariant LearningTransformation RepresentationSelf-supervised Transformation Learning

评审与讨论

审稿意见

评分: 6置信度: 42024-07-11

This paper is concerned with the learning of expressive representations in a self-supervised fashion. In particular, it aims at learning representations that

优点

Presentation: The paper is clearly written, well-structured, easy to follow, and overall is in a good state
Proposed approach: the idea proposed by the authors is not ground-breaking but simple, makes sense, and is efficient. It applies the same reasoning used in invariant self-supervised learning to equivariant representation learning.
Evaluation pipeline: the authors evaluate their method quite thoroughly, evaluate semantic classification across a large number of datasets, evaluate object detection, and check the sensitivity to hyper-parameters and transformation prediction.
Experimental results: results show that the approach effectively learns representations that perform well on both semantic and non-semantic information in comparison to baselines. Sensitivity to hyperparameters seems limited.

缺点

My main concern regards the baselines chosen for evaluation. Authors explicitly mention that SIE and SEN are excluded from the evaluation due to the need for these methods to have access to transformation labels which STL does not require. While I see how these methods require more "supervision" than the proposed method, STL still requires weak supervision (i.e., access to pairs of images that have undergone the same transformation). Cases, where one would have access to such weak supervision but not access to transformation labels, are quite rare (e.g., AugMix as mentioned by authors), as a consequence I think the evaluation should include these baselines, and results should be analyzed considering this light difference in supervision level. My score will be increased if this concern is addressed.
Experimental results are computed for a single seed, for object detection results, the performance gap between EquiMod and STL is narrow hence additional seeds would make the evaluation more robust.
In Table 1, STL+AugMix can be compared to SimCLR with AugMix augmentations which I do not believe is reported in this table.

Minor:

line 40: typo

问题

No questions but a suggestion for improvement, I think section 4.3, first part, is intriguing and it is not clear to me why equivariance and transformation learning are complementary. Discussing further the intuition behind these results would I think improve the paper.

局限性

Limitations are discussed and in particular, the main limitation I see to this work (i.e., the framework requires access to a pair of images that have undergone the same transformation, this weak form of supervision is not always accessible in the wild), is mentioned.

作者回复

2024-08-07

Thank you for taking the time to thoroughly review our manuscript.

In response to your detailed feedback, we have gone to great lengths to address and accommodate every single one of your comments.

We would greatly appreciate it if you could review our responses to your comments and the submitted rebuttal PDF.

Sincere thanks in advance for your time and efforts.

[W1] Comparison to SEN and SIE

In response to the reviewer's comments on the absence of experimental comparisons with existing equivariant self-supervised learning (SSL) methods, we have conducted comparisons with the SIE[1] and SEN[2] methods.

As in the SIE paper, our implementation of the SEN method adopts the InfoNCE loss instead of the triplet loss to minimize the need for extensive hyperparameter tuning. Additionally, SIE employs a strategy where the representation dimension is divided to separately address invariant and equivariant representations. For consistency, all methods, including SEN and SIE, were trained using SimCLR as the base model.

Table A1 in the rebuttal PDF presents the results, demonstrating that our method performs competitively against both SEN and SIE, supporting the robustness and effectiveness of our approach.

[W2] Various Seeds

We appreciate the reviewer’s feedback and have addressed the concern regarding the robustness of our results. In response, we conducted additional experiments using different random seeds for pretraining to ensure that the observed improvements were statistically significant. The results of these additional experiments are presented in Table A1 in the rebuttal PDF, which illustrates that our method, STL, consistently outperforms baseline approaches across various downstream tasks, achieving improvements ranging from an average of 3% to nearly 10%. These results demonstrate that STL significantly enhances performance compared to existing methods, confirming the robustness and generalizability of our approach.

[W3] STL with AugMix

We have indeed conducted experiments applying AugMix to SimCLR, and the results are documented in Table A1 in the rebuttal PDF. Our findings indicate that while incorporating AugMix with SimCLR yields a performance improvement compared to using standard augmentations, the model using STL with AugMix demonstrates superior performance. This suggests that our STL approach effectively leverages the AugMix augmentations to enhance representation learning beyond the capabilities of SimCLR with AugMix.

[W4] Detailed Comments

Thank you for bringing the typo on line 40 to our attention. We will correct the error in the final manuscript. We appreciate your meticulous review.

[Q] Complementary Relation

Table 5 in the main manuscript illustrates the performance of three models: the Only Equivariance model(employing $\\mathcal{L}\_\\text{inv}$ and $\\mathcal{L}\_\\text{equi}$ ), the Only Transformation model (employing $\\mathcal{L}\_\\text{inv}$ and $\\mathcal{L}\_\\text{trans}$ ), and the STL model (incorporating $\\mathcal{L}\_\\text{inv}$ , $\\mathcal{L}\_\\text{equi}$ , and $\\mathcal{L}\_\\text{trans}$ ). The results include the average accuracy in downstream classification tasks, and the regression and classification outcomes for transformation prediction tasks.

The Only Equivariance model significantly improves downstream task performance over SimCLR, but its transformation prediction capabilities are limited. Conversely, the Only Transformation model excels in transformation prediction compared to SimCLR but shows limited improvement in downstream tasks. In contrast, the STL model, which integrates both $\\mathcal{L}\_\\text{equi}$ and $\\mathcal{L}\_\\text{trans}$ , demonstrates enhanced performance in both downstream tasks and transformation prediction, empirically validating their complementary nature.

The mechanism of this complementarity lies in the ability of the transformation representations, learned through $\\mathcal{L}\_\\text{trans}$ , to facilitate the effective learning of equivariant transformations in the representation space. Specifically, the transformation representations allow $\\mathcal{L}\_\\text{equi}$ to ensure that transformations in image space correspond accurately to transformations in representation space. This correspondence is demonstrated by the improved performance metrics, such as MRR, H@k, and PRE, for STL's equivariant transformations compared to the Only Equivariance model, as evidenced by the results in Table A2 in the rebuttal PDF.

References

Garrido, Quentin, Laurent Najman, and Yann Lecun. "Self-supervised learning of split invariant equivariant representations.", ICML, 2023. 2, Park, Jung Yeon, et al. "Learning symmetric embeddings for equivariant world models.", ICML, 2022.

评论- Answer to Rebuttal

2024-08-11

Thank you to the authors for their efforts in answering my concerns.

The additional experimental validation provided by the authors addresses my concerns as I believe they strengthen the validation of the proposed method. I have adjusted my score accordingly.

2024-08-11

We sincerely appreciate your thoughtful review and are grateful for recognizing the additional experimental validations we provided. Your feedback has been instrumental in refining our work. We are glad to hear that our revisions addressed your concerns and strengthened our method's validation. We welcome any further suggestions you may have and thank you for adjusting your score.

审稿意见

评分: 5置信度: 32024-07-13

The paper introduces STL, a method for learning self-supervised equivariant representations. It suggests replacing transformation labels with representations derived from data pairs. The proposed pretext task promotes learning invariant and equivariant representations alongside transformation-related information. The method's effectiveness is showcased through various classification and object detection tasks.

优点

The paper argues that existing methods treat each transformation independently as they require transformation labels, which disregards interdependency among transformations. The main strength of the paper lies in their method, which does not require transformation labels.
The complexity of transformations does not constrain the proposed method STL, as they demonstrate learning equivariance representations with complex transformations like AugMix.
Authors show empirically that the learned representations are competitive for classification and detection. They also show that STL learns equivariant representations with relational information between transformations.

缺点

Experimental comparisons with some key existing equivariant SSL methods are missing [1, 2] and datasets (3DIEBench).
Missing several related works [3-6]. The authors should thoroughly review the relevant literature.
The approach seems computationally demanding since there are multiple (three) InfoNCE losses for each iteration. Additionally, the authors haven’t provided results/discussions on computational costs.
Information for reproducibility is limited, especially when extending STL to BarlowTwins, BYOL, and SimSiam.

[1] Garrido, Quentin, Laurent Najman, and Yann Lecun. "Self-supervised learning of split invariant equivariant representations." arXiv preprint arXiv:2302.10283 (2023).

[2] Park, Jung Yeon, et al. "Learning symmetric embeddings for equivariant world models." arXiv preprint arXiv:2204.11371 (2022).

[3] Gupta, Sharut, et al. "Structuring representation geometry with rotationally equivariant contrastive learning." arXiv preprint arXiv:2306.13924 (2023).

[4] Xie, Yuyang, et al. "What should be equivariant in self-supervised learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[5] Guo, Xifeng, et al. "Affine Equivariant Autoencoder." IJCAI. 2019.

[6] Gupta, Sharut, et al. "Learning Structured Representations with Equivariant Contrastive Learning."

问题

It’s known that InfoNCE loss can be prone to dimensional collapse [7]. Having multiple InfoNCE losses (especially when invariance and equivariance losses can be contradictory) in your method, have you observed dimensional collapse happening?
What influence does the size of the hypernetwork $f_T$ has over the downstream tasks?
How does different hyperparameter configurations of $\lambda_{inv}(\cdot)$ , $\lambda_{equi}(\cdot)$ , and $\lambda_{trans}(\cdot)$ affect the rate of convergence while pre-training?
How do you extend STL to asymmetric methods like BYOL and SimSiam? Are the InfoNCE criteria still kept for equivariance and transformation-related losses?
Any insights into how much contribution $\lambda_{equi}$ and $\lambda_{trans}$ have for learning the relational information between transformations?

[7] Jing, Li, et al. "Understanding dimensional collapse in contrastive self-supervised learning." arXiv preprint arXiv:2110.09348 (2021).

局限性

The authors have provided some limitations. It would be beneficial to include a discussion on computational costs.

作者回复

2024-08-07

Thank you for taking the time to thoroughly review our manuscript.

In response to your detailed feedback, we have gone to great lengths to address and accommodate every single one of your comments.

We would greatly appreciate it if you could review our responses to your comments and the submitted rebuttal PDF.

Sincere thanks in advance for your time and efforts.

[W1-1] Comparison to SIE and SEN

In response to the reviewer's comments on the lack of comparisons with existing equivariant SSL methods, we compared our approach with SIE [1] and SEN [2]. Following the SIE paper, we used InfoNCE loss for SEN to reduce hyperparameter tuning. SIE splits representation dimensions for invariant and equivariant features. All methods, including SEN and SIE, used SimCLR as the base model. Table A1 in the rebuttal PDF shows our method is competitive with SEN and SIE, demonstrating its robustness and effectiveness.

[W1-2] Evaluation on 3DIEBench

The 3DIEBench dataset, while valuable in equivariance learning by pre-applying transformation with transformation label, poses a challenge in evaluating with STL due to its static nature. As transformations in 3DIEBench are pre-defined and fixed, we cannot re-apply the same transformation to different samples during training as STL requires dynamic application of transformation during training to learn the interdependency of each transformation. To utilize 3DIEBench effectively, we would not only need full access to the simulator used for generating the 3DIEBench dataset but also real-time generation to apply the pair-wise transformation during training.

[W2] Related Work Recommendation

Thank you for highlighting the missing references. We will incorporate these references in the final version.

[W3] Computational Costs

To empirically evaluate computational costs, we conducted an experiment using ResNet-50 with a batch size of 256. We utilize a single NVIDIA 3090 GPU, measuring the average training time per iteration over 1000 iterations following a 1000-iteration warm-up phase. As presented in Table A6 in the rebuttal PDF, the additional computational cost due to our network design is approximately 1.1 times that of SimCLR. In comparison to EquiMod, the computational cost remains nearly equivalent.

[W4, Q4] STL Extension on Various Base Models

STL extension on BYOL

In our STL extension to BYOL, we utilize the dissimilarity loss function intrinsic to BYOL to define the invariant, equivariant, and transformation losses. The dissimilarity loss in BYOL is: $\\mathcal{L}\_\\text{BYOL} (y, y^+;g, q, \theta, \xi) = \|\overline{q}\_\theta (g_\theta (y)) - \overline{g}_\xi (y^+) \|_2^2$

STL extension on SimSiam

For the SimSiam model, the dissimilarity loss is defined as follows: $\\mathcal{L}\_\\text{SimSiam} (y, y^+; g, h) = \frac{1}{2} \\mathcal{D}(h(g(y)), \\text{stopgrad}(g(y^+))) + \frac{1}{2} \\mathcal{D}(h(g(y^+)), \\text{stopgrad}(g(y)))$ where $\\mathcal{D}(a, b) = -\frac{a}{\\|a\\|\_2} \\cdot \frac{b}{\\|b\\|\_2}$

STL extension on BarlowTwins

In the case of BarlowTwins, the dissimilarity loss is given by: $\\mathcal{L}\_\\text{BarlowTwins} (Y=\\{y\_i\\}\_i, Y^+=\\{y^+\_i\\}\_i; g, \\lambda) = \\sum\_i (1 - \\mathcal{C}\_{ii})^2 + \\lambda \\sum\_i \\sum\_{j \\neq i} \\mathcal{C}\_{ij}^2$

where $\\mathcal{C}\_{ij} = \frac{\\sum\_b g(y\_b)\_i g(y^+\_b)\_j}{\\sqrt{\\sum\_b (g(y\_b)\_i)^2} \\sqrt{\\sum\_b (g(y^+\_b)\_i)^2}}$ and $g(y)\_i$ is the i-th dimension of $g(y)$ .

[Q1] Dimensional collapse caused by multiple infoNCE losses

Throughout our experiments, we did not observe dimensional collapse, likely due to the design and implementation of STL. Each loss function is crafted to enhance discrimination between samples, reducing collapse risk. Additionally, using a non-linear projector and appropriate batch size helped maintain dimensionality in learned representations. To ensure proper batch size for transformations, we used the aligned batch configuration proposed in this paper, preserving batch complexity by matching the number of transformation types to sample types.

[Q2] Ablation study on auxiliary transformation backbone network

In Table A5 in the rebuttal PDF, we present empirical results that demonstrate the impact of varying the number of layers in the auxiliary transformation backbone on the average accuracy across different downstream classification tasks. Our findings indicate that change in the number of layers in result in only marginal variations in performance. This suggests the size of hypernetwork is not a primary factor influencing the accuracy in these tasks.

[Q3] Convergence Speed

In Figure A2 in the rebuttal PDF, we have provided the linear evaluation results across different $\\mathcal{L}\_\\text{equi}$ : $\\mathcal{L}\_\\text{trans}$ weight ratios while fixing $\\mathcal{L}\_\\text{inv}$ to 1. Our observations indicate that there is no significant difference in convergence speed attributable to the varying ratios.

[Q5] Transformation representation

Table A4 in the rebuttal PDF shows the results of downstream classification and transformation prediction tasks on STL, based on different weights of equivariant and transformation losses. The best performance for both tasks is observed with weight ratios of 1:0.2 or 1:0.5. When the transformation loss ratio is reduced to 1:0.1, performance in the transformation prediction task decreases, indicating insufficient learning of transformation representations. Increasing the ratio to 1:1 prioritizes transformation learning but disrupts the balance with other components, leading to performance drops in both tasks. At a higher ratio of 1:2, transformation prediction improves significantly, but classification performance declines sharply. While a higher ratio benefits transformation learning, maintaining a lower ratio enhances the generalizability of image representations.

2024-08-12

Dear Reviewer gZBh,

We sincerely appreciate your valuable feedback, which has greatly contributed to enhancing our manuscript. We have submitted our responses to your insightful comments and would be grateful to hear your thoughts on whether our replies have addressed your concerns.

Any further comments or questions are welcome to us.

Thank you

审稿意见

评分: 6置信度: 42024-07-13

The authors propose Self-supervised Transformation Learning (STL) to learn equivariant representations. The core of the method is to not use augmentation information but instead use transformation representations obtained from pairs of data. In this sense, instead of knowing the transformation information, it is necessary to have pairs of data with the same transformation applied to them. This allows STL to leverage more complex augmentation schemes such as AugMix to reach higher performance on classification and detection benchmarks.

优点

The non reliance on knowledge of the group elements leads to a method with different assumptions than previous works which can thus be used in different scenarios. Assuming knowledge of pairs of data with the same transformation can be a weaker or stronger assumption than knowledge of the group element depending on the domain, which makes STL complementary to existing works.

Assuming that images are sampled randomly, using a pair of images to compute the transformation representation of another one ensures that information does not leak between the semantics of the image and the transformation representation. At inference time, applying the transformation does not require knowledge of pairs.

The performance gained using STL over existing methods is convincingly shown, across a wide range of tasks. We can however notice that aiming for equivariance over all augmentations does not lead to the best performance (Table 6),reinforcing previous findings that the benefits of invariance/equivariance can be task dependant

缺点

1) An important reference to [1] is missing from the paper. It was proposed in this paper to use pairs of data transformed by the same (but unknown) group element (see equation 4 for example) to learn equivariant representation. Although the considered experimental setting is different, and the way to leverage the pairs differs, it remains a very related method.

2) With how the loss is designed, the representations are both aimed at being invariant and equivariant at the same time (albeit with a projector in between the loss and representations). As invariance is a perfectly valid solution when wanting equivariant solutions ( $\theta = Id$ ), it is possible that the learned representations are invariant to augmentations in the end, as illustrated in figure S2 of [2] for example. Currently, the metric defined in equation 15 doesn't provide enough information as to whether or not the predictor indeed applied transformation well (and thus whether or not the representations are equivariant).

Intuitively (correct me if my understanding is wrong), the proposed metric looks at how much closer the predicted representation is to the target compared to its starting point. A value higher than 1 thus means that the prediction is closer to its target than its starting point. But here the values are barely around 1.1 which would suggest that while the predictor is indeed doing something positive, it's far from applying the transformations perfectly.

Completing this analysis with other common metrics such as the ones used in EquiMod[3] or metrics such as Hit at rank K, MRR, PRE[2,4,5] for example would provide more compelling evidence of the equivariance or not of the representations.

3) Performance seems to not be reported on the pretraining dataset but only on other downstream tasks. It is important to report those numbers to understand the behaviour of the models both in and out of domain.

[1] Shakerinava, Mehran, Arnab Kumar Mondal, and Siamak Ravanbakhsh. "Structuring representations using group invariants." Advances in Neural Information Processing Systems 35 (2022): 34162-34174.

[2] Garrido, Quentin, Laurent Najman, and Yann Lecun. "Self-supervised learning of split invariant equivariant representations." arXiv preprint arXiv:2302.10283 (2023).

[3]Devillers, Alexandre, and Mathieu Lefort. "Equimod: An equivariance module to improve self-supervised learning." arXiv preprint arXiv:2211.01244 (2022).

[4]Kipf, Thomas, Elise Van der Pol, and Max Welling. "Contrastive learning of structured world models." arXiv preprint arXiv:1911.12247 (2019).

[5]Park, Jung Yeon, et al. "Learning symmetric embeddings for equivariant world models." arXiv preprint arXiv:2204.11371 (2022).

问题

Line 42-43 "with a transformation label, each transformation is treated independently, disregarding interdependency [...] each component in color jitter is treated distinctively although they are related to each other." I am not sure to see the point here, as this argument can be made for any image augmentation. Giving all transformation parameters (as well as their order if it isn't fixed) gives full information about the transformation. The transformations are fully independent of each other and at best they do not commute.

Line 47-48 "After all, the reliance on transformation labels limits the performance gain in equivariant representation learning". This claim seems unsubstantiated, are there any references or concrete evidence to support this claim ?

In Equation 2 the notation lacks preciseness. If $\mathcal{L}$ is an InfoNCE criterion, it cannot consider samples in isolation as is written in Equation 2, but instead needs knowledge of the full batch to be computed.

When using Color Jitter for AugSelf and Equimod, is the order of the transformations fixed or randomly selected for each call, as is default in torchvision for example ? If it is not then the same labels can be associated with different transformations which will hinder their learning.

When considering a large set of transformations, it is possible that the transformation representations only represent some of the transformations and not all of them, or that the predictor only uses partial information (e.g. applies hue change but not brightness change). Did you perform analyses on individual transformations to study the equivariance properties on individual transformations ?

局限性

The authors address the limitations in section 6, notably how using pairs of transformed data is not a panacea, and can have issues to be extended to even larger transformation sets than considered with STL.

作者回复

2024-08-07

Thank you for taking the time to thoroughly review our manuscript. In response to your detailed feedback, we have gone to great lengths to address and accommodate every single one of your comments. We would greatly appreciate it if you could review our responses to your comments and the submitted PDF. Sincere thanks in advance for your time and efforts.

[W1] Related Work Recommendation

As mentioned in the review, the suggested related work [1] is similar to STL in that [1], too, utilizes the difference in the pair of images, and we will update the reference accordingly. The main difference between [1] and STL lies in utilizing the transformation group. As [1] utilizes the transformation group information for selecting the appropriate loss term utilized for training. Acknowledging this limitation [1] suggested a way to bypass this limitation, however, as STL does not require the transformation group in addition to the transformation label, STL could be considered as a more general methodology for learning representations with pairs of images.

[W2] Transformation Equivariance

You are correct in noting that our metric is designed to measure the proximity of the transformed representation to the target representation relative to the original. The intention was to assess how well the equivariant transformation aligns with its image space counterpart compared to an identity transformation. However, we recognize that comparing solely to the identity transformation does not fully capture the nuances of equivariance across a variety of transformations. Therefore, we employed suggested metrics (MRR, H@k, and PRE) using a pretrained STL-10 model to extend our analysis to include metrics that evaluate the relative alignment of transformations in the image space. We tested on STL-10 test data subjected to transformations like individual transformations such as cropping and color jitter, as well as the standard combination of transformations used during training. Our results, detailed in Table A2 in the rebuttal PDF, show that our method surpasses existing techniques in most metrics, except for crop H@5 and PRE. This improvement suggests that the equivariant transformations learned by our approach more accurately reflect actual transformations in the image space compared to prior methods.

[W3] In-domain Evaluation on STL-10 and ImageNet100

To address the concern regarding the in-domain performance evaluation, we have included the results of the linear evaluation conducted on the pretraining dataset in Table A3 in the rebuttal PDF. Our findings demonstrate that, except for the SEN method, which focuses solely on equivariant learning, all other approaches, including STL, effectively maintain the performance level of the base invariant learning model within the in-domain setting. This evidence underscores the robustness of our approach in preserving performance both in-domain and out-of-domain.

[Q1] Interdependencies between Transformations

While it is true that individual augmentations can be applied independently in the image space, STL focuses on the interdependencies between transformations in the representation space, particularly concerning equivariant transformations. Due to the length limitation, please refer to the global response for further discussion.

[Q2] Leveraging Complex Augmentation

From "[...] transformation labels limits the performance gain [...]", we highlight that while complex augmentation like AugMix significantly enhance performance (shown in Table 1 in the main manuscript denoted as STL with AugMix), prior equivariance learning methods cannot leverage such complex augmentations due to inaccessbility of the corresonding transformation labels thereby bounding their performance gain.

[Q3] InfoNCE with Batch Information

Equation 2 was simplified to focus on the core concept. However, as we acknowledge that the InfoNCE loss requires consideration of the full batch, we will update Equation 10 to incorporate batch interactions as follows:

\mathcal{L}_\text{InfoNCE} (y, y^+; g, \tau, \{y_i\}_i) = -\log \frac{\exp\left(\text{sim}\left(g(y), g(y^+)\right)/\tau\right)}{\sum{y_i \neq y} \exp\left(\text{sim}\left(g(y), g(y_i)\right)/\tau\right)}

This reflects the necessary batch-level computation. Similarly, Equations 11-13 should be conditioned on the batch samples, aligning the formulation with the InfoNCE criterion’s requirements which will updated in the final version accordingly.

[Q4] The order of the transformations

In our implementation for AugSelf and Equimod, the order of transformations, including Color Jitter, is fixed rather than randomized. As mentioned in your review, this consistency must hold to prevent any association of different transformations with the same labels, thereby supporting effective learning and representation alignment.

[Q5] Analysis on individual transformations

We have conducted a detailed analysis of individual transformations to examine their equivariance properties in Table 5 in the main manuscript. Furthermore, we analyzed the similarity between equivariant transformations and their corresponding transformations using metrics such as MRR, Hit@k, and PRE. For this analysis, we employed an STL-10 pre-trained model, transforming the STL-10 test data using crop, color jitter, and combinations of augmentations. Table A2 in the rebuttal PDF shows that our method surpasses existing approaches in all metrics except for H@5 and PRE in the crop transformation. This superiority demonstrates that equivariant transformations learned through STL effectively capture and reflect the actual transformations applied.

References

Shakerinava et al., "Structuring representations using group invariants.", NIPS 2022.
Garridov et al., "Self-supervised learning of split invariant equivariant representations.", ICML 2023.

2024-08-09

Thank your for the detailed answer, clarifying my previous questions. The added equivariance measures are helpful to understand the exact behaviour of STL.

The comparisons to previous work added are convincing (Table A2), where STL outperforms previous approaches, but with the caveat that the overall performances seems a bit low. For example, the highest MRR for color jitter is 0.33 when recent works achieve around 0.8 https://arxiv.org/pdf/2403.00504 . This comparison is not perfect due to data and model size differences, but this still raises the question of whether or not the considered setup is too hard for every considered method (including STL).

A small question on the equivariance metrics computation (MRR,PRE,H@k), how many different transformed images did you use for the computation of the metrics ? MRR can be very sensitive to this value (SIE used 50 for example).

2024-08-11

Thank you for carefully reviewing our detailed response and for providing additional comments. We appreciate your thoroughness and valuable insights.

Regarding your comment on the MRR for color jittering being 0.33 compared to the 0.8 reported by Image World Models (IWM) in [1], we acknowledge that several factors influence the difficulty in direct comparison: network architecture, dataset used, and base invariant learning models. Especially looking at Table S2 in [1], we note that the depth of the architecture seems to play a critical role as average MRR drops by 0.378 when using 12-layer predictor rather than 18-layer predictor across various color jittering settings. As our model utilizes a hypernetwork with two layers, the reduced depth might limit the model’s ability to capture intricate transformations as effectively as deeper architectures and indicate that our setting is particularly challenging given the differences in network depth.

While comparisons with IWM are difficult due to the aforementioned architecture differences, we find it more meaningful to compare our work with SIE as the SIE paper [2] uses the same ResNet-18 and predictor network as our paper and forms the basis for IWM's equivariant learning approach. However, unlike STL, SIE was evaluated on the 3DIEBench dataset using a combination of transformations such as 3D rotation and color changes, with MRR calculated over 50 transformations per sample (whereas STL's evaluation involves 60 transformations per sample). As outlined in the table below, the MRR for SIE in our setting drops from 0.41 (Table A2 in the rebuttal PDF) to 0.275 (Table 2 of [2]), indicating the increased difficulty. In contrast, STL achieves an MRR of 0.4708 for crop and color combinations, showing a significant improvement over SIE's 0.275 in the same experimental setting.

Finally, we want to point out that STL achieves an H@1 of 0.35 and an H@5 of 0.6080 for crop and color combinations (Table A2 in the rebuttal PDF). These metrics indicate a 35% probability that the actual corresponding transformation is ranked first and a 60% probability that it ranks within the top five. This suggests that our predictor effectively captures and reflects the actual transformations to a substantial extent.

We hope this explanation clarifies the differences and supports the validity of our results despite the inherent challenges in making direct comparisons across different methodologies.

Method	MRR on SIE setting	MRR on STL setting
SIE	0.41	0.2750
STL	-	0.4708

References

Garrido, Quentin, et al. "Learning and leveraging world models in visual representation learning." arXiv preprint arXiv:2403.00504 (2024). (https://arxiv.org/pdf/2403.00504)
Garrido, Quentin, Laurent Najman, and Yann Lecun. "Self-supervised learning of Split Invariant Equivariant representations." International Conference on Machine Learning. PMLR, 2023. (https://arxiv.org/pdf/2302.10283)

2024-08-12

Thank you for the additional clarifications.

Indeed, translating existing equivariant methods to a new setup can be difficult as they are always surfing on a fine line to avoid collapsing to invariance. This reflects more on the brittleness of the methods themselves. I believe that the provided baselines are applied well and that the methods had issues generalizing to this new setup. They still display non random performance (which would be an MRR of 0.08) which is reassuring in this judgement.

Considering that my concerns have been addressed, I have adjusted my score accordingly.

2024-08-12

We sincerely appreciate your thoughtful feedback and thank you for recognizing the complexities involved in adapting existing equivariant methods to new setups. We agree that addressing the issue of collapsing to invariance is crucial in equivariant learning. We recognize the importance of developing methods to tackle this challenge and believe that further research in this area is essential for advancing the field.

Thank you for adjusting your score based on our clarifications. We warmly invite you to share any further suggestions or insights, as they are invaluable to the ongoing improvement of our paper.

审稿意见

评分: 6置信度: 42024-07-14

This paper proposes a new way to learn equivariant representations by directly learns the transformation representation. They enforce that different transformations have their own input-agnostic representations. To obtain this, they learn an encoder that takes pairwise representations of the same image to extract the transformation representation. To avoid trivial solutions, they use different examples in transformation representation extraction and later transformation module, which helps disentangle the sample and transformation features. They further add an regularization on the sample consistency to encourage it to be sample-invariant. They combine it with invariant learning loss for training, and demonstrates its benefits on several tasks.

优点

I generally like the idea of disentangling sample and transformation representations in equivariant learning, and the paper proposes a clever patching-based objective to enforce it.
And the transformation representation they obtain in Figure 1 does make sense intuitively since it reflects the relative distance between different augmentations.
The evaluation demonstrates its advantages over previous works on linear probing and object detection tasks. The propose method is particularly superior on the transformation prediction task. I think it's because they successfully disentangle the features.

缺点

The evaluation is not complete. I didn't find linear probing results (often the most important ones) on ImageNet-100 or STL-10.
Lack the comparison to a few relevant E-SSL baselines, such as [1,2].
Lack more in-depth analysis of the learned transformation representations. I have some doubts about it. Since there are many variations in each augmentation (eg cropping with different positions, ratios), what is the relationship of their representations in the latent space? Is there an Arithmetic relationship?

[1] Equivariant Contrastive Learning. ICLR 2022.

[2] Residual Relaxation for Multi-view Representation Learning. NeurIPS 2021.

问题

局限性

yes

作者回复

2024-08-07

Thank you for taking the time to thoroughly review our manuscript.

In response to your detailed feedback, we have gone to great lengths to address and accommodate every single one of your comments.

We would greatly appreciate it if you could review our responses to your comments and the submitted rebuttal PDF.

Sincere thanks in advance for your time and efforts.

[W1] In-domain Evaluation on STL-10 and ImageNet100

To address the concern regarding the in-domain performance evaluation, we have included the results of the linear evaluation conducted on the pretraining dataset in Table A3 in the rebuttal PDF. Our findings demonstrate that, except for the SEN method which focuses solely on equivariant learning, all other approaches including the proposed method (STL) effectively maintain the performance level of the base invariant learning model within the in-domain setting. This evidence underscores the robustness of our approach in preserving performance both in-domain and out-of-domain.

[W2] Comparison to Equivariant Contrastive Learning (E-SSL)

From the suggested baselines, E-SSL[1] and Prelax[2], we conducted experiments on E-SSL as the code is publicly available while Prelax does not. For the comparison with E-SSL, we re-implemented it by applying crop and color transformations to the original image and predicting their parameters. The results including E-SSL can be seen in Table A1 in the rebuttal PDF.

[W3] Intra-relationship between Transformations

Figure 1 (c), in the main manuscript, demonstrates the inter-relationship between different transformation types. To further explore the intra-relationship within transformation representations, we applied various augmentations, specifically focusing on different aspects of cropping and color adjustments. For cropping, we considered variations in the box's center position and scale. For color transformations, we varied parameters such as brightness, contrast, saturation, and hue.

We employed UMAP visualizations to represent these transformations in the transformation representation space, as shown in Figure A3 in the rebuttal PDF. The results reveal that in the space, crops align according to the movement of the center position of the box along the x-axis, while color transformations align according to the degree of parameter adjustment. This arrangement suggests that the representations are sensitive to the specific parameters of the transformations.

We believe these findings support the view that transformation representations are organized in a manner that reflects their inherent properties, thereby capturing both inter- and intra-relationships effectively. This structured representation allows for a more nuanced understanding of transformation effects, supporting our hypothesis on transformation sensitivity and contributing to the broader discourse on representation learning.

References

Dangovski, Rumen, et al. "Equivariant contrastive learning.", ICLR, 2022.
Wang, Yifei, et al. "Residual relaxation for multi-view representation learning.", NIPS, 2021.

评论- Thanks

2024-08-08

Thanks the authors for the rebuttal. I find the new results on linear probing, reproduced baseline of E-SSL, and intra-relationship to be promising and help strengthen this work. Therefore, I would like to increase my score to 6.

2024-08-08

We sincerely thank you for recognizing the efforts we made in addressing your concerns and for raising your rating. Your review has been instrumental in refining our paper, and we ensure that all relevant clarifications and insights will be fully incorporated into the revision. We warmly invite you to share any further suggestions. Your insights are precious to the continuous improvement of our paper.

作者回复

2024-08-07

Thanks for

Dear Reviewers and Area Chairs,

We thank the reviewers for their constructive feedback. We are glad to take various helpful reviewer comments to clarify and complete our work. Reviewers agreed on the originality, motivation, soundness, and significance of the paper. In here, we breifely recap the gloal of our paper and the proposed methods, Self-supervised Transformation Learning(STL).

Previous methods using transformation labels are divided into explicit equivariant learning, which learns equivariant transformations, and implicit equivariant learning, which focuses on transformation prediction, as shown in Figure A1 in rebuttal PDF. Our approach differs by learning transformation representations without transformation labels, enabling explicit equivariant learning. This method allows STL to effectively capture transformation-sensitive information without relying on predefined labels.

The primary goal of STL is learning better equivariant representation, which learns transformation-sensitive information. Transformations that induce similar semantic changes in the image space should correspondingly yield similar changes in the representation space. For instance, transformations affecting color information differ in their semantic impact compared to spatial transformations like cropping, which alter relative spatial information and proportions within an image.

Figures 1(a) and (b) of our paper illustrate the visualizations of learned equivariant transformation's functional weights for corresponding transformations. In these visualizations, previous approaches, such as EquiMod, treats transformations like crop and color changes as independent mappings. In contrast, STL demonstrates that color-related equivariant transformations, which share similar semantic changes, are learned with similar mappings in the functional space. This highlights STL's ability to capture the nuanced interdependencies between transformations more effectively. Together, the experimental results firmly demonstrate the represtntation ability of STL, which outperforms existing methods in 7 out of 11 classification tasks and shows superior average performance.

Our approach underscores the importance of recognizing these interdependencies to enhance semantic understanding within the representation space, thus improving the effectiveness of equivariant representation learning. Additionally, STL effectively learns intra-relationships within transformations (Figure A3 in the rebuttal PDF), and experiments show that the equivariant transformations learned by STL reflect actual transformations more accurately than existing methods, as demonstrated by equivariance metric measurements (Table 5 of our paper and Table A4 in the rebuttal PDF). Unlike previous methods, STL can leverage complex transformations such as AugMix that were previously inaccessible due to the lack of transformation labels (Table 1 of our paper).

最终决定Accept (poster)

2024-09-25

This paper received mixed reviews of 4,4,6,6. Overall the paper proposes a self-supervised representation learning technique for learning equivariant representations that can be applied to various settings such as BYOL, SimCLR, Barlow Twins and SimSiam. The method obtains strong performances and is presented clearly. While at the beginning of the discussion phase reviewers pointed out missing experiments, and discussions of related works, the authors provided these to a satisfying degree such that the engaged reviewers upgraded their scores. Overall, this work is of interest to a large part of the vision community of NeurIPS and the AC recommends acceptance.