PaperHub
6.3
/10
Poster4 位审稿人
最低5最高8标准差1.1
6
6
5
8
3.0
置信度
正确性2.8
贡献度2.8
表达2.8
NeurIPS 2024

Harnessing small projectors and multiple views for efficient vision pretraining

OpenReviewPDF
提交: 2024-05-16更新: 2025-01-12
TL;DR

We characterize the implicit biases imposed by learning dynamics along with architecture and loss function for self-supervised representation learning, leading to practical recommendations for improving its sample and compute efficiency.

摘要

Recent progress in self-supervised (SSL) visual representation learning has led to the development of several different proposed frameworks that rely on augmentations of images but use different loss functions. However, there are few theoretically grounded principles to guide practice, so practical implementation of each SSL framework requires several heuristics to achieve competitive performance. In this work, we build on recent analytical results to design practical recommendations for competitive and efficient SSL that are grounded in theory. Specifically, recent theory tells us that existing SSL frameworks are actually minimizing the same idealized loss, which is to learn features that best match the data similarity kernel defined by the augmentations used. We show how this idealized loss can be reformulated to a functionally equivalent loss that is more efficient to compute. We study the implicit bias of using gradient descent to minimize our reformulated loss function, and find that using a stronger orthogonalization constraint with a reduced projector dimensionality should yield good representations. Furthermore, the theory tells us that approximating the reformulated loss should be improved by increasing the number of augmentations, and as such using multiple augmentations should lead to improved convergence. We empirically verify our findings on CIFAR, STL and Imagenet datasets, wherein we demonstrate an improved linear readout performance when training a ResNet-backbone using our theoretically grounded recommendations. Remarkably, we also demonstrate that by leveraging these insights, we can reduce the pretraining dataset size by up to 2$\times$ while maintaining downstream accuracy simply by using more data augmentations. Taken together, our work provides theoretically grounded recommendations that can be used to improve SSL convergence and efficiency.
关键词
representation learningself-supervised learningdata-augmentationlearning dynamicssample efficient SSLcompute efficient SSL

评审与讨论

审稿意见
6

This paper investigates a theoretical formulation of the contrastive loss used in self-supervised pretrainging of vision models. (One of the assertions of the paper is that all SSL (self-supervised learning) losses are variants of the same loss). The authors propose a more compute-efficient version of the loss objective, and use the theoretic derivations of their analysis of gradient descent dynamics to propose two changes to SSL protocols in practice: the use of a higher orthogonalization regularization coupled with a lower-dimensional projection, and using more augmentations per sample. These recommendations are supported by experiments showing that a lower-dimensional projection coupled with a higher orthogonalization constant leads to higher top-1 accuracy results, and that using more augmentations per sample leads to higher downstream classifier accuracy and faster SSL convergence.

优点

Unfortunately the presentation in the paper was quite difficult to understand, which made it difficult to evaluate the paper's strong points (please see 'weaknesses' and 'questions' below. However, the following strengths still seem to be present.

  • There is a fairly strong coupling between the theoretical conclusions and the experimental results, making both more credible.
  • The use of a simplified (from a computational sense) loss objective to use a sample estimate (thm 3.1) is a helpful contribution that may help improving SSL training in general.
  • The experimental results are quite good in context of the baselines, showing higher accuracy at the same data size/similar accuracy at reduced data size, when the paper's recommendations are followed.
  • The authors make it clear where they are building on past results and where the novelty of their contribution lies.

缺点

The paper is very difficult to read for several reasons:

  • Key terms are not defined inside the paper, making it necessary to reference multiple additional works in-depth just to get a basic understanding of this one. These include:
    • SimCLR
    • Barlow Twins
    • VicREG
    • backward data augmentation covariance kernels kDABk^{DAB}
    • Infinite Mercer Features and Mercer Theorem
  • Theorem 3.2 is only stated informally in the text, and is not restated formally in the appendix (though theorem B.2 seems like a restatement of the informal version of the theorem). Perhaps because of this, it is not clear at all from the statement of the theorem why SGD chooses redundant directions.
  • Typos around terminology, for instance in Figure 2, it seems that λ\lambda should actually be β\beta. Also, the scale for λ\lambda in Panel B seems very different to that in Panel E.
  • The experiments use the new, proposed loss objective (derived from Theorem 3.1), which should agree with the more computationally-expensive ones in the limit. However, the discrepancy between using the two is not ablated, and no baselines are used that use the original loss formula, even where it might be computationally feasible (where mm is lower).

问题

  • What is the impact of using the new, proposed loss objective, as opposed to using the traditional one on the accuracies of the downstream models?
  • Why does theorem 3.2 imply the loss of orthogonality when using SGD?
  • What are the specific data augmentations used for the experiments?

局限性

The limitations were adequately addressed.

作者回复

Thank you for your thorough review and valuable suggestions for improving our paper. We appreciate the time and effort you've invested in providing this feedback. We acknowledge the shortcomings in our initial presentation, particularly the absence of a comprehensive overview of existing self-supervised learning (SSL) methods such as SimCLR, BarlowTwins, and VICReg. We are committed to rectifying these issues to ensure our paper is informative and comprehensible to a broad audience. Below, we address each of the reviewer's specific concerns in detail:

  1. Lack of Defined Key Terms: We apologize for the oversight in not defining some key terms within the text. To rectify this, we will incorporate a comprehensive preliminary section that outlines the foundational SSL methods referenced in our study. Additionally, we will append a glossary of formal terms, including 𝑘𝐷𝐴𝐵𝑘_{𝐷𝐴𝐵}, in the Appendix and expand Section A to include a clear definition and intuitive explanation of Mercer features and Mercer’s Theorem.

  2. Formal Restatement of Theorem 3.2: We appreciate the reviewer's attention to the formal restatement of Theorem 3.2. We acknowledge the confusion caused by not linking Theorem 3.2 to its formal counterpart in the Appendix. We will amend the main text to explicitly state that Theorem 3.2 is formally restated and proven as Theorem B.4 in the Appendix. Furthermore, we will explain the theorem's implications regarding the selection of redundant directions by SGD in linear networks.

Theorem B.4: (Formal) Let Γ=VΛVT\Gamma = V\Lambda V^T represent the eigendecomposition of Γ\Gamma, and define zz as the projection of the weight vectors in WW onto singular vectors of Γ\Gamma, V. Formally, z=WVz = WV. Assuming small initialization (as in Simon et al. (2023), i.e. zpi(0)<<1\| z_{pi}(0) \| << 1 for all p,ip,i, we can derive the following conclusions:

  1. sign(Δzpi(t)zpi(t))=sign(λi)sign(\frac{\Delta z_{pi}(t)}{z_{pi}(t)}) = sign(\lambda_i)
  2. For all λi,λj>0\lambda_i, \lambda_j > 0, zpi(t)zpi(0)=(zpj(t)zpj(0))λiλj\frac{z_{pi}(t)}{z_{pi}(0)} = (\frac{z_{pj}(t)}{z_{pj}(0)})^{\frac{\lambda_i}{\lambda_j}} where λi\lambda_i denotes the ithi^{th} singular value, i.e. ithi^{th} element of diagonal matrix Λ\Lambda.

As noted, the proof of the above theorem is presented in Appendix B (Pg 16-17). The insights derived from this theorem are as follows:

  1. If λi=0\lambda_i = 0, zpi(t)=zpi(0)z_{pi}(t) = z_{pi}(0) and if λi<0\lambda_i < 0, then limtzpi(t)=0lim_{t \to \infty} z_{pi}(t) = 0.
  2. The alignment between a weight vector, say WpW_p and a singular vector of Γ\Gamma, say ViV_i, i.e. zpi(t)z_{pi}(t) depends on the corresponding singular values, i.e. λi\lambda_i and zpi(0)z_{pi}(0).

Theorem B.4 presents the implicit bias of gradient descent as an optimizer. We show that those eigenfunctions of Γ\Gamma that are aligned with the feature space of the network at initialization and correspond to high singular values are more likely to be learned. Therefore, under weak orthogonalization constraints, the learned representation space will over "represent" the strong singular vectors, thereby losing orthogonality. We will further discuss these insights and implications in Appendix B (Pg 18).

  1. Typographical Errors: We thank the reviewer for identifying these errors and will diligently correct them to enhance the manuscript's clarity.

  2. Impact of the Proposed Loss Objective: The reviewer is correct that the new loss formulation should agree with the less computationally efficient older formulations of the loss in the limit of long pretraining. The performance gap between the new loss formulation and original formulation, as identified by the reviewer, arises from the faster convergence of the efficient formulation in the limit of finite training budget (100 epochs in most of our experiments). To confirm this, we conducted extended experiments (400 epochs) to confirm that both formulations converge to similar levels of accuracy. This finding is detailed in the rebuttal document and Appendix E, Figure 18. We will update the main text to emphasize this equivalence.

We would like to thank the reviewer again for their thoughtful suggestions and excellent questions. We will carefully consider your points as we revise the presentation of our results and make the proposed changes. Please let us know if you have any other feedback or concerns.

Best regards,
The Authors

评论

Thank you to the authors for the explanations and for committing to the improvements to the presentation of the paper. Accordingly, I am increasing my score.

评论

We sincerely appreciate your thoughtful reconsideration and the time you've taken to review our responses.

We appreciate your willingness to engage with our explanations and acknowledge the value of our work. We also want to express our gratitude for your constructive feedback throughout this process. We are committed to implementing these improvements to ensure our paper communicates its contributions as clearly and effectively as possible.

Again, thank you for your time, expertise, and fair consideration of our work.

Sincerely,
The Authors

审稿意见
6

The paper conducts a theoretical study the training dynamics of self-supervised learning (SSL) methods by using an idealized loss function that resembles the Barlow Twins/VICReg loss. The idealized loss function is built on the idea of forward data augmentation (DAF) graph kernel. The paper shows how this loss function maybe recast in an equivalent form that is computationally friendly. The training dynamics behavior of this loss is then used to build a hypothesis that representation learning can be improved with stronger orthogonal constraint regularization for small dimensional projectors and multiple augmentations improve performance. The paper provides empirical evidence to support the claims made in the paper.

优点

  • The analytical study conducted in the paper on training dynamics is very interesting to this reader and should be of interest to the community
  • The idea of using an equivalent loss to DAF that is simpler to compute may also be of interest to the community as it has connections to other recently proposed SSL methods
  • The practical takeaways may be relevant to SSL practitioners

缺点

  • A main weakness that I see is that I am unable to relate the faster convergence claim/proof to improved downstream accuracy. This may be a matter of presentation or perhaps a gap in the reviewer's knowledge on how alignment argument in pretext task implies better downstream performance. Would the authors be able to clear this question up during rebuttal?

  • The experimental setup considered in the paper is on the smaller-scale side for understandable reasons. However, this may make the statements made in the paper less interesting to empirical scientists that work in SSL. Experiments using ImageNet-100 may help with showing the claims can hold with setups used in practice. However, I do not see any experiments with ImageNet-100 in Section 4.1 (low dimensional projection)

问题

I look forward to getting clarifications during rebuttal for the questions above.

局限性

The authors have clearly listed their limitations. In addition, the empirical setup may not be the one used by empirical scientists in SSL so making a note may be helpful (but not necessary)

作者回复

We thank the reviewer for their thorough and insightful comments. We are pleased that the reviewer found our analysis of the SSL loss formulation and the training dynamics exciting and relevant to the community. Below, we address the questions raised by the reviewer with enhanced clarity and additional insights:

  1. Understanding Downstream Performance Claims: The reviewer's question is important and touches on the core of representation learning in SSL. Our framework focuses on accelerating convergence of the SSL loss itself, which inherently influences downstream performance. By optimizing the SSL loss more efficiently, we ensure that semantically similar images are mapped closely in the representation space, facilitating easier classification through methods like k-nearest neighbors or linear decoding. This is why reducing the SSL loss more rapidly leads to more rapid improvements in downstream accuracy. Our empirical results confirm this (e.g. see Figure 3).We will emphasize this critical link in the revised manuscript to ensure clarity and robustness in our claims.

  2. Results with Barlow Twins and VICReg on ImageNet-100: We acknowledge the oversight in not including results from ImageNet-100. In the rebuttal, we have added plots for ResNet-18 pretrained with BarlowTwins on ImageNet-100. These results reinforce our recommendation in Section 4.1 that smaller projector dimensions can yield effective representations even in larger-scale pretraining scenarios. This addition aims to fully address the reviewer's concern regarding the scalability and applicability of our findings.

  3. Connections to Maximum Manifold Capacity Representations (MMCR): We thank the reviewer for bringing the MMCR paper by Yerxa et al. to our attention, which is highly relevant. Yerxa et al. introduce a non-contrastive SSL objective that leverages multiple augmentations to estimate the manifold of object centroids. By maximizing the nuclear norm of the covariance matrix of these centroid representations, they effectively enhance the representation space's suitability for linear decoding of semantic categories. Although this approach is rooted in a statistical mechanical perspective, Schaeffer et al. [3] have convincingly linked it to the information-theoretic viewpoints prevalent in the SSL literature. Consequently, the MMCR loss formulation can be viewed as a functionally equivalent form to the loss formulations discussed in our work.
    While Yerxa et al. provide compelling empirical evidence, intuitive justifications, and posthoc analyses to underscore the benefits of multiple augmentations in improving semantic categorization, our work offers a complementary, theoretically grounded perspective. We elucidate how these augmentations not only contribute to a more efficient optimization process but also facilitate the early learning of superior features. This dual benefit—accelerated convergence and enhanced feature quality—underscores the efficacy of the MMCR framework in learning robust representations.

  4. Contrasting results in Rankme and LiDAR: This is an excellent point to clarify, thank you for highlighting the connection. Our focus differs from RankMe and LiDAR, which analyze the dimensionality of the learned representation space in SSL algorithms. On the other hand, our study investigates the design space configurations underlying these algorithms, precisely the number of units in the projector head and the orthogonalization constraint hyperparameter in the loss function. We believe there’s no inherent contradiction between our findings and RankMe/LiDAR, and we complement the previous studies by providing a theoretical basis for how these design choices impact SSL representation learning.
    To elaborate on the relationship between the design configurations and the dimensionality of the features, Agrawal et al. [4], demonstrate that the same number of units in the projector can yield different representation space dimensionalities depending on the orthogonalization constraint. One can reduce the number of neurons in the projector head while maintaining high dimensionality in the representation space as long as one does not reduce it beyond a threshold. Therefore, our results do not contradict the findings of RankMe, LiDAR, and α\alpha-ReQ. Instead, our work presents a complementary theoretically grounded explanation of how these design factors affect representation learning in SSL frameworks and under what conditions can good representations be learned using projectors with fewer units. But, in line with the reviewer’s question, RankMe, and LiDAR do indicate why one would not want to reduce the projector dimensionality down to extremely low values (e.g. 16 units). We will clarify this point in the updated manuscript.

We would like to thank the reviewer again for their thoughtful comments and excellent questions. Please let us know if you have any other feedback or concerns.

Best regards,
The Authors

[1] Garrido et al. On the duality between contrastive and non-contrastive self-supervised learning. ICLR, 2023.

[2] Zhai et al. Understanding augmentation-based self-supervised representation learning via rkhs approximation and regression. ICLR, 2024.

[3] Schaeffer et al. Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations. arXiv, 2024.

[4] Agrawal et al. α\alpha-req: Assessing representation quality in self-supervised learning by measuring eigenspectrum decay. NeurIPS, 2022.

评论

Understanding Downstream Performance Claims: The reviewer's question is important and touches on the core of representation learning in SSL. Our framework focuses on accelerating convergence of the SSL loss itself, which inherently influences downstream performance. By optimizing the SSL loss more efficiently, we ensure that semantically similar images are mapped closely in the representation space, facilitating easier classification through methods like k-nearest neighbors or linear decoding. This is why reducing the SSL loss more rapidly leads to more rapid improvements in downstream accuracy. Our empirical results confirm this (e.g. see Figure 3).We will emphasize this critical link in the revised manuscript to ensure clarity and robustness in our claims.

  • I thank the author for the above response. Figure 3 is does indeed show improved accuracy over time, however, I do not see the corresponding loss values during optimization.

  • In general, I understand the thinking above, but I do not believe smaller SSL loss always implies good representations (and better downstream accuracy). Trivial representations can provide zero loss but we want to avoid these situations. Empirically, recent SSL models (see I-JEPA especially the official logs (released by authors)[https://github.com/facebookresearch/ijepa] ) show a non-monotonic loss behavior which raises interesting questions on when a SSL user can conclude training is done.

  • As a SSL user, I have been curious about the question being considered by the authors. What are SSL losses really optimizing as training losses are always not informative. In any case, a careful argument in the paper by the authors using the rebuttal above would make the paper very useful to readers.

I request the authors to make all other updates that they have committed to during the rebuttal. I am raising my score as I am satisfied with the rebuttal to my questions as well as reading other reviews and responses

评论

Thank you for your thoughtful feedback and for raising these important points. We appreciate your engagement with our work and willingness to reconsider your score based on our rebuttal.

  • You are correct that Figure 3 does not include the corresponding loss values during optimization. We acknowledge this oversight and will add this information to provide a complete picture of the training process.
  • We agree with your insightful observation that lower SSL loss does not always imply good representations or better downstream accuracy. Your point about trivial representations potentially providing zero loss while being undesirable is well-taken. The reference to I-JEPA and its non-monotonic loss behavior is particularly relevant; we thank you for bringing this to our attention. You raise an excellent question about when SSL users can conclude that training is complete, especially given this non-monotonic behavior. This is indeed a complex issue that warrants further investigation, and we will discuss this in the final version of the paper.
  • We appreciate your interest in what SSL losses are optimizing, given that training losses are not always informative. This is a crucial question in the field. As you suggested, we will expand our discussion in the paper to address this point more thoroughly, incorporating the arguments from our rebuttal.

Thank you for your valuable feedback and for helping us improve our work. We look forward to incorporating these changes and presenting a more comprehensive paper.

Sincerely,
The Authors

审稿意见
5

This paper identifies the implicit bias of non-contrastive SSL loss and optimization, and proposes two ingredients to improve SSL learning: 1) Low-dimensional projectors can yield good representations; 2) Multiple augmentations improve kernel approximation. Further, the authors propose that In a low-data regime, using diverse & multiple augmentations can be as effective as acquiring more unique samples. All of these insights get validated on experiments of ResNet-50-based models.

优点

  1. The paper is theoretically motivated well based on the formulation of NC-SSL. By casting the problem into the equivalent formulation of data augmentation invariance loss and the orthogonality constraint.

  2. The empirical evaluation fairly meets the expectations from the alternative formulation of NC-SSL, which validates the proposed insights.

缺点

  1. I am not sure about the definition of weak/strong orthogonality constraints. How did the authors define what is a strong or weak constraint? In other words, what value of beta is regarded as a strong constraint?

  2. All the experiments are conducted on ResNet-50. Can authors do experiments on other architectures like Vision Transformers (ViTs) or other CNNs, in order to show the conclusions also transfer to other models?

  3. The insights are empirically validated and reasonable. However, these are the things that a deep learning practitioner could expect or imagine, especially that data augmentation can help in the low-data regime and that multiple augmentation improves the performance. These insights are also consistent in the supervised learning setting. In other words, there is no big "surprise" from the paper; every conclusion can be predicted even without the knowledge of SSL.

  4. For the conclusion that Low-dimensional projectors can yield good representations, the experiments are only conducted on CIFAR10 and STL-10, where the dimension size 8192 is considerably larger than the usual setting and even 1024 can be considered as a large projector. Can author did experiments on ImageNet-100? I am also curious the see the results on the step-wise increasing dimensions: 256,512,1024,2048,4096 instead of some big jumps in Table 1.

问题

Please refer to the questions.

局限性

The paper is basically an investigation into the insights revealed by the alternative formulation of NC-SSL loss. The insights get empirically validated and the overall story makes sense.

However, I do not see any "surprise" from the paper except for some predictable conclusions which generally hold in deep learning. I am also not really convinced by the results in Table 1, which has big jumps in dimension sizes and only validated on CIFAR10/STL-10. These issues prevent me from giving a higher score.

作者回复

We would like to thank the reviewer for their thorough and insightful comments. We are also glad the reviewer found our proposal theoretically motivated and the empirical validation satisfactory. Below, we address the questions raised by the reviewer:

  1. Definitions of weak/strong orthogonality constraints: The reviewer is correct in pointing out that we did not provide explicit definitions in the original manuscript as to what constitutes ”weak” vs “strong” orthogonality constraints. In the extremes, the weakest possible orthogonality constraint is a β=0\beta = 0, and the higher the value of β\beta, the stronger it is. Thus, we generally use these terms to describe the relative values of the hyperparameter, β\beta, i.e., we generally use these words to indicate differences between different models. But, we do not have a specific threshold that universally defines “weak” vs. “strong”. Based on reference [31], we generally consider β=1\beta = 1 a “strong” constraint. We will clarify this usage in the updated text and make it clear that there is no explicit threshold.

  2. Experiments on other architectures like Vision Transformers (ViTs) or other CNNs: We thank the reviewer for this suggestion. In this work, we train Resnet-50 and Resnet-18 (Imagenet experiments) following previous literature in non-contrastive SSL. To our knowledge, self-supervised ViTs are generally trained using either the MAE or self-distillation objectives (JEPA-style). While the insight regarding projector dimensionality and orthogonalization constraint is not directly transferable to such settings, the loss formulation involving multiple augmentations can improve convergence in self-distillation settings. We can, however, test on other CNN architectures if the reviewer feels this would strengthen the empirical results. We will do this over the discussion period, but we are confident we will see similar results for other architectures.

  3. Core contributions of our work: We are glad that the reviewer found our work to be theoretically well motivated with sufficient empirical evaluation. The reviewer's point is well taken that it is not surprising that additional augmentations can help pretraining. However, we would like to highlight that while existing works and heuristics help with improving the generalization performance, they still require long pretraining, usually 800-1000 epochs for Imagenet. Instead, our work focusses on improving the convergence speed while maintaining performance. Moreover, we provide theoretical justification for how our recommendations impact the SSL pretraining, and such theoretical grounding helps to move machine learning towards more principled practice.

  4. Low-dimensional projector experiments beyond CIFAR-10 and STL-10: We apologize for the oversight in presenting the results from Imagenet-100 in Table 1. In the rebuttal document, we have presented the results for Resnet-18 pretrained using BarlowTwins on Imagenet-100. We will also modify Table 1 to include all intermediate dimension sizes (see the rebuttal document for the updated Table 1 to be put in the revised manuscript). We hope these fleshed-out results will be more convincing for the reviewer.

We thank the reviewer again for their thoughtful comments and excellent questions. We will carefully consider your points as we revise and extend this work. Please let us know if you have any other feedback or concerns.

Best regards,

The Authors

评论

Thanks for the response!

My concerns have been addressed. Given my original score being positive, I would keep it as I do not see this paper presents an important contribution to the SSL community (the main theory paper is based on past results and conducts a empirical validation).

评论

We sincerely appreciate your thorough review and engagement during the rebuttal phase. While we're pleased that our responses address most of your concerns, we respectfully would like to address how our paper makes an "important contribution to the SSL community." While the core ideas explored may sound intuitive, we would argue that their technical implementations and implications offer significant value to the field. We want to highlight two key technical insights that our work proposes:

  1. Implications for efficient SSL pretraining in resource-constrained environments by leveraging parameter-efficient projector head design: Our work provides a theoretically grounded recommendation for substantially reducing the parameter count of the projector head used in non-contrastive SSL pipelines while maintaining downstream task performance. This is useful in resource-constrained environments, where these recommendations can help the user determine the optimal training regime in a practical setting, especially given many epochs (e.g., 1000 for Imagenet pretraining).
    For instance, consider the architecture of typical non-contrastive learning algorithms like Barlow Twins. The original paper pairs a ResNet-50 backbone (~23M parameters) with a 3-layer MLP projector with 8192-dimensional outputs per layer. This projector alone accounts for ~83.9M parameters (2048*8192 + 2*8192^2). In contrast, our theoretically grounded recommendation would suggest using a significantly lower number of units in the projector network, e.g., 512. This reduces the parameter count to ~1.3M (2048*512 + 2*512^2), a reduction of over 98% in projector size while maintaining competitive performance. We present a version of Table 1 from the rebuttal document with the parameter count for the projector network.
pdimProjector params (approx)Barlow Twins (base β)Barlow Twins (optimal β*)VICReg (base β)VICReg (optimal β*)
64135k73.6 ± 0.982.1 ± 0.268.9 ± 0.281.9 ± 0.1
128278k74.7 ± 1.483.0 ± 1.170.6 ± 0.382.3 ± 0.4
256589k75.9 ± 0.783.4 ± 0.475.3 ± 0.281.9 ± 0.3
5121.3M79.2 ± 0.882.8 ± 0.579.3 ± 0.482.1 ± 0.6
10243.1M81.3 ± 1.082.9 ± 0.379.2 ± 0.982.5 ± 0.9
20488.3M81.0 ± 0.982.3 ± 0.580.6 ± 0.081.9 ± 1.2
409625.2M82.3 ± 0.482.3 ± 0.480.5 ± 0.381.0 ± 0.4
819283.9M82.2 ± 0.482.2 ± 0.480.4 ± 1.580.4 ± 1.5
  1. Leveraging multiple augmentations for better feature learning within the same compute budget: Our work demonstrates the effect of multiple augmentations in improving SSL pretraining convergence speed, i.e., using multiple augmentations during pretraining yields better features earlier in training. Leveraging this insight, we demonstrate that getting better features within the same training budget is possible using more augmentations per sample instead of more unique samples in the dataset. We present an improved Pareto frontier (Fig. 5) for a tradeoff between SSL pretraining runtime and classification error on the downstream task. Specifically, we show that in low compute budgets (low runtime), using multiple, diverse augmentations is better than using more unique samples.
    Notably, this empirical result contradicts the intuition-based claims of previous works (see [1] Section 6.2), which advocates increasing the number of unique samples instead of augmentations per sample during SSL pretraining. his finding is particularly crucial in domains where acquiring additional unique samples is costly or impractical. Our empirical results demonstrate that this approach can significantly improve model performance in such scenarios without additional raw data collection.

Thank you again for your valuable feedback! We would appreciate your input for specific suggestions for experiments or analyses that further illuminate our contributions. We are eager to incorporate your expertise to enhance the impact and clarity of our work.

Best regards,
The Authors

[1] V. Cabannes, B. Kiani, R. Balestriero, Y. LeCun, and A. Bietti. The SSL interplay: Augmentations, inductive bias, and generalization. I ML, July 2023.

审稿意见
8

This work introduces a new loss for self-supervised retraining that is functionally equivalent to existing methods but is computationally more efficient. The method identifies a key equivalence between existing SSL objectives (VICReg and BarlowTwins) features when employing augmentation kernel. Furthermore, the authors demonstrate that under their new loss formulation far fewer projector dimensions are needed and in-turn the efficiency is improved (time to convergence). Additionally, the authors show that multiple augmentations improve the kernel approximation, improving accuracy and convergence speed.

优点

⁃ The paper is very well written and presented, with all contributions clearly defined and rationale within the problem setting. Figures are simple yet informative and lead to a high-quality piece of work that is understandable to a wide audience.

⁃ The proposed methodology is novel and highly original, addressing key and ongoing concerns in the field of self-supervised learning. Moreover, the method is theoretically and empirically supported with extensive proof and experimentation, thus addressing key concerns with correctness and confidence.

⁃ The authors present a variety of recommendations that go above simple proposition of a new methodology, thus contributing to the impact and significance of the work.

⁃ The empirical analysis is extensive and supports all claims made, the resulting performance is a faster convergence with fewer projector dimensions.

缺点

I have only two minor concerns/weaknesses:

⁃ You enforce a orthogonalization constraint on the projector, yet how is collinearity of the features avoided in practice for earlier representations, specifically after the projector is dropped? Is this even a problem?

⁃ Alternative measures for efficiency could be presented alongside convergence speed. I would expect alternative measures in both iterations per second, and memory consumption to be included at the very least if making claims of improved efficiency. Additionally, how much of a speedup can be attributed to each component I.e. the reformulation vs orthogonality constraint vs reduced projector.

问题

⁃ Does the proposed formulation lead to better improved overall performance, by that I mean improved best case downstream performance compared to original formulations?

⁃ Does number of layers or dimensions of intermediate layers of the projector have significant impact on the observed performance?

局限性

The limitations are appropriately addressed

作者回复

Thank you for your thorough and insightful review of our work. We greatly appreciate your time and effort in assessing our paper and providing constructive feedback. We are delighted you found the paper well-written, clearly presented, and understandable to a broad audience. It's delightful to hear that you find our approach new, original, and well-supported, theoretically and empirically. Your acknowledgment of the range and importance of our suggestions is very encouraging.

  1. Avoiding collinearity of the encoder features: You raise an intriguing point about avoiding feature collinearity after dropping the projector. In practice, enforcing an orthogonality among the projector output features implicitly avoids collinearity in the feature space of the encoder. An intuitive explanation of this phenomenon is as follows. Since the projector output is a piecewise nonlinear function of the encoder output (owing to the fully connected layers with ReLU in the projector network), a collinearity in the encoder output space would imply a collinearity in the projector output space. Therefore, by enforcing an orthogonality constraint in the projector output space, we are able to avoid a collapse or collinearity in the feature space. An in-depth analysis of the learning dynamics is, however, an open research problem with some recent progress being made in simplified settings [1].

  2. Metrics of efficiency: The suggestion to include additional efficiency measures such as iterations per second and memory consumption is well-taken. We will incorporate these metrics in a revision to provide a more comprehensive picture of the efficiency improvements. Disentangling the contributions of each component (reformulation, orthogonality constraint, reduced projector) to the overall speedup is also a valuable direction for further analysis.

  3. Other questions:

  • Regarding downstream performance, our proposed formulation achieves comparable best-case results to the original formulations while being more efficient. We believe there is potential for further performance gains with additional tuning and optimization.
  • The depth and width of the projector do impact the observed performance. We found that relatively shallow projectors (1-2 layers) with moderate width (e.g., 128-512 dimensions) worked well, but the optimal configuration likely depends on the specific dataset and backbone architecture. More systematic ablations on projector design would be informative.

Thank you again for your excellent suggestions and questions. We will carefully consider your points as we revise and extend this work. Please let us know if you have any other feedback or concerns.

Best regards,

The Authors

[1] Y. Xue, E. Gan, J. Ni, S. Joshi, and B. Mirzasoleiman, Investigating the Benefits of Projection Head for Representation Learning. ICLR 2024.

评论

Many thanks for answering the questions raised during the rebuttal and addressing the identified weaknesses. I appreciate clarity on my misunderstanding on some points. I would emphasise that the inclusion of efficiently metrics and more clear comparative baselines on any revised manuscript.

Given my original rating being positive, I for now am maintaining my score.

评论

Thank you for your thoughtful feedback and for taking the time to review our responses to the questions raised during the rebuttal phase. We greatly appreciate your careful consideration of our work and the clarifications we provided.

We're pleased that our responses have helped address your concerns and clear any misunderstandings. Your input has been invaluable in helping us identify areas for improvement in the presentation and content of our manuscript. The importance of including efficiency metrics and clearer comparative baselines is well taken. We fully agree that these elements are crucial for providing a comprehensive evaluation of our work, and we are committed to incorporating them.

Thank you once again for your insightful comments and continued support of our research.

Sincerely,
The Authors

作者回复

We would like to thank all the reviewers for their thorough and insightful comments and suggestions. Below, we summarize the major points addressed in individual responses to the reviewers' comments.

  1. Inclusion of Results for ImageNet-100: We have added results for ResNet-18 pretrained using Barlow Twins on ImageNet-100 (please see attached rebuttal pdf). This addresses the reviewers’ request for additional empirical evidence for the utility of low-dimensional projectors. We are currently running experiments for Imagenet-1k, which we will add in the final version of the paper.

  2. Extended version of Table 1: Reviewer rDTK pointed out that there are major jumps in the projector dimensionality values presented in Table 1. Therefore, we present an extended version of Table 1 which indicates the performance for projector dimensionality of 64, 128, 256, 512, 1024, 2048, 4096 and 8192 (please see attached rebuttal pdf). We will update the Table in the main text for the final version of our paper.

  3. Clarification on Lower Dimensional Projections: We clarified the apparent discrepancy between our results and empirical findings from RankMe and LiDAR, as raised by reviewer RyLx. Specifically, we study the effect of design space characteristics of the SSL pipeline on the learning dynamics, whereas the aforementioned works characterize the dimensionality of the learned representation space of pretrained networks. We highlighted that the properties of the learned representation space and downstream performance depend on the interplay between projector dimensionality and the orthogonalization constraint, β\beta. Our explanation demonstrates that our results are complementary to existing works like RankMe and LiDAR, providing a deeper understanding of representation learning in SSL frameworks.

  4. Explanation of Downstream Performance Claims: We clarified the relationship between faster convergence on the SSL pretraining loss and downstream performance. We detailed how the formal version of the SSL desiderata (invariance to semantic information-preserving augmentations) influences good feature learning. By explaining that a feature space with the desired properties facilitates easier kNN or linear readout, we highlighted how this leads to better downstream task performance. We believe that our work now more clearly shows the theoretical grounding of the framework's claims about improved convergence and representation learning.

  5. Acknowledgment of Typos and Miscellaneous Comments: We apologize for the confusion in notation of the orthogonalization constraint, β\beta. We have used the updated notations in the results presented in the attached rebuttal pdf, and will update the figures in the final version of our paper. We have also addressed other presentation issues pointed out by reviewers, including missing definitions and formal versions of theorems, which will be updated in the main text of our paper. We hope that these corrections will improve the clarity and readability of the paper, making our work more accessible to a wider audience.

Best regards,
The Authors

最终决定

The paper introduces a novel, computationally efficient loss function for self-supervised learning (SSL) that outperforms existing methods. By requiring fewer projector dimensions, the new loss function accelerates convergence, making it highly valuable for large-scale SSL applications. The authors provide novel theoretical insights by identifying a key equivalence between SSL objectives when using an augmentation kernel, deepening the understanding of SSL methodologies. These insights lead to practical recommendations, such as employing lower-dimensional projectors and multiple augmentations, which have been empirically validated to improve both accuracy and convergence speed. The authors made a great effort in the clarity of presentation making the work accessible to a broad audience. This combination of novel theoretical insights, practical recommendations, and robust empirical evidence positions the paper as a significant contribution to the field of SSL.