/10

Poster3 位审稿人

最低3最高3标准差0.0

ICML 2025

One Leaf Reveals the Season: Occlusion-Based Contrastive Learning with Semantic-Aware Views for Efficient Visual Representation

Xiaoyu Yang,Lijian Xu,Hongsheng Li,Shaoting Zhang

提交: 2025-01-23更新: 2025-07-24

TL;DR

We propose Occluded Image Contrastive Learning (OCL), a scalable method that contrasts masked image patches to learn semantic concepts efficiently without hand-crafted augmentations.

摘要

关键词

contrastive learningself-supervised pre-training

评审与讨论

审稿意见

评分: 32025-03-12

The work propose occlusion-based contrastive learning with masked image modeling approach. It compares against iBot, MAE, I-JEPA and other relevant SSL methods. Achieves competitive results with less time needed for training.

给作者的问题

how much time is needed for occlusion in this setup in training compared to non-occluded images?

论据与证据

clearly stated and confirmed by the results

One of them is usage of occlusion into MIM training, which is well designed for MIM framework rather than manual augmentation selection. The OCL is able to extract better high level concepts.

方法与评估标准

correct and appropriate

methods based on contrastive learning and MIM are commonly used, however using occlusion for training is novel

Standard benchmark are used such as ImageNet and architecture such as ViT-L/16. Additionally tasks such as linear probining and fine tuning are reported. Additional benchmarks on COCO and ADEK are used.

理论论述

实验设计与分析

following the good practices and other works

Common benchamrks such as ImageNet, ADEK, COCO and tasks such as fine tuning and linear probing. Additional ablations on times needded to train are provided and looks promising. Additionally generalization capabilites are tested on different variants of ImageNet.

补充材料

Not

与现有文献的关系

provided extensively, including SimCLR, iBot, I-JEPA and MIM.

遗漏的重要参考文献

it is discussed properly

其他优缺点

it is interesting and well written work, clear presentation of the claims and methodology that is easy to follow

其他意见或建议

a discussion of how much time is needed for occlusion in this setup? Not whole pretraining, but I would like to see what kind of fraction of time in pretraining is used on occlusion when compared to standard augmentations.

作者回复

2025-03-31

Thank you very much for your sincere review, especially for summarizing the strengths of our work, including 1) interesting and well-written work, 2) clear presentation of the claims and confirmed by the results, 3) novel training method of occlusion that is easy to follow and 4) good experiments with common benchmarks, additional promising ablations and generalization validations. In response to your concerns, we have provided a detailed explanation below and revised our manuscript accordingly.

(Comment & Question) occlusion time. Thanks so much for your sincere and constructive suggestions. Different from the standard augmentations used in traditional contrastive pre-training, we use the torch module with CUDA acceleration to implement the occlusion operations for our method. Quantitatively, our whole model has a floating point of operations (FLOPs) of 12.03 G, while the operation of occlusion only has 0.12 G FLOPs, accounting for about 1%. It is confirmed that occlusion operations take up very few computation resources and runtime but bring great efficiency improvements.

Besides, to better demonstrate the efficiency of our model, we conduct the table following the MixMAE to illustrate the comparison with previous methods on ImageNet-1K classification with ViT-B model. All methods are evaluated by Aug., Epoch (Ep.), FLOPs/G, Parameters (Param.)/M, linear probing (LIN) and fine-tuning (FT). The resolution of images is fixed to 224×224. Aug. indicates the utilization of handcrafted view data augmentation during pre-training. FLOPs/G is utilized to show the runtime and computational resources of the pre-training. Param./M is calculated for the encoders of the pre-training model, following the MixMAE. $\dagger$ denotes the results are copied from the MixMAE. Top-1 accuracy (Acc) is used as the metric.

	Aug.	Ep.	FLOPs (G)	Param. (M)	LIN	FT
Masked Image Modeling
BEiT $\dagger$	w/o	800	17.6	87	-	83.2
MAE $\dagger$	w/o	1,600	17.5	86	68.0	83.6
CAE $\dagger$	w/o	1,600	17.5	86	70.4	83.9
I-JEPA	w/o	600	17.5	86	72.9	-
Contrastive Learning
DINO	w/	1,600	74.7	171	78.2	82.8
MoCo v3	w/	600	74.7	171	76.7	83.2
Masked Image Modeling with Contrastive Learning
SiameseIM	w/	1,600	16.3	88	78.0	84.1
ccMIM	w/o	800	39.1	86	68.9	84.2
ConMIM	w/	800	17.5	86	-	85.3
MixMAE $\dagger$	w/o	600	15.6	88	61.2	84.6
iBOT $\dagger$	w/	1,600	17.5	86	79.5	-
OCL	w/o	800	12.0	86	74.2	83.4

From the table, our method achieves 12.0 G FLOPs, surpassing the second-best method (MixMAE at 15.6 G) and reducing computational costs by approximately 23%. It is verified that our occlusion operation achieves considerable enhancement of efficiency. We have revised our paper and added more detailed descriptions about occlusion time. Thanks for your sincere suggestion.

Thank you once more for generously dedicating your time to provide a thoughtful review. Your feedback is tremendously valuable, and we are open to hearing from you at any time. If you find our response satisfactory, we would greatly appreciate your assistance in improving our rating score.

审稿意见

评分: 32025-03-13

The paper introduces Occluded Image Contrastive Learning (OCL), a novel self-supervised learning (SSL) paradigm for efficient visual representation. OCL combines the strengths of Masked Image Modeling (MIM) and Contrastive Learning (CL) by using random masking to create diverse views within an image and contrasting them within a mini-batch. The key innovation lies in generating fine-grained semantic differences through masking, which reduces conceptual redundancy and avoids the need for hand-crafted data augmentations or auxiliary modules. The authors demonstrate that OCL is highly scalable, achieving competitive results on downstream tasks like ImageNet classification, object detection, and segmentation, while significantly reducing pre-training time and computational resources.

给作者的问题

No.

论据与证据

Yes.

方法与评估标准

The proposed method is well-motivated.

理论论述

No theoretical claims.

实验设计与分析

I believe the experiments are not convincing enough.

补充材料

No Supplementary Material.

与现有文献的关系

The paper builds on existing work in Masked Image Modeling (MIM) (e.g., MAE, BEiT) and Contrastive Learning (CL) (e.g., SimCLR, MoCo v3), but distinguishes itself by combining the strengths of both paradigms without relying on hand-crafted augmentations or auxiliary modules.

遗漏的重要参考文献

This paper has considered most of the relevant works as far as I know.

其他优缺点

Strength:

The paper is well-written and easy to follow.
The ablation studies are comprehensive.

Weaknesses:

In line 134, only a low masking ratio is adopted, and the contrastive learning requires two branches, which may take much more computational costs? However, no effective training epoch or running time is provided in Table 4 for fair comparisons.
Although more computational costs are consumed, both the fine-tuning and linear-probing results are comparable to previous SOTA contrastive learning and MIM methods.
The detection and segmentation results on COCO and ADE20k datasets are also lower than previous SOTA methods.

其他意见或建议

It's better to provide training wall-clock time or computational costs of the proposed methods.

作者回复

2025-03-31

Thank you very much for your review, especially for summarizing the strengths of our work, including 1) well-motivated method 2) well-written paper and easy to follow and 3) comprehensive ablation studies. In response to your concerns, we have provided a detailed explanation below and revised our manuscript accordingly.

(Weakness 1) computational costs. Thanks very much for your detailed review. Concerning traditional contrastive paradigms such as MoCo v3 and DINO, they use student-teacher dual networks as two branches to process distinct views, leading to almost double the computation cost for one image. However, we did not use two networks as two branches, while we utilized two different parts of an image as two branches. Thus, two branches of contrastive learning in our method will not bring additional computation costs. Cooperating with the masking strategy, we can further reduce the amount of computation for an image. Moreover, our model does not depend on an additional transformer decoder to reconstruct the image, leading to less computation. In summary, we use the above contributions to significantly reduce the computational costs of the pre-training paradigm and improve the efficiency.
- To better demonstrate the efficiency of our model, we conduct the table following the MixMAE.
  
  https://s2.loli.net/2025/03/31/RfKjqYMtAxoVQHC.png (Same Table as in Reviewer 2yVR, sorry for limited chars.)
  
  From the table, our method achieves 12.0 G FLOPs, surpassing the second-best method (MixMAE at 15.6 G) and reducing computational costs by approximately 23%.
- Moreover, we have presented Figure 3 in our manuscript to illustrate the efficiency and scaling ability of our model, and discussed the computational costs in line 248. OCL is highly scalable compared to previous methods, requiring less computational resources while achieving comparable and competitive results and without relying on handcrafted data augmentations. https://s2.loli.net/2025/03/31/VqHwXUrMgpcGTt1.png
- Furthermore, we have also conducted ablation experiments on mask ratio and discussed it in section 3.2.1. MASKED RATIO of our manuscript. The results demonstrate that a lower masking ratio (0.4) optimally balances computational efficiency and visual representations. https://s2.loli.net/2025/03/31/VfcqPNk7SXuoIjO.jpg
- Besides, We also provide the runtime with FLOPs and parameters for Table 4 to better illustrate the ablation of the MLP head from the efficiency perspective. | MLP Head | FLOPs(G) | Param. (M) | Pre-training Hours | Eff. Bsz. | LIN | FT | |--|--|--|--|--|--|--| | w/o | 42.3 | 303 | 559 | 1,800 | 77.4 | 85.7 | | 2-layer | 43.6 | 305 | 600 | 1,800 | 76.7 | 85.6 | | 3-layer | 43.7 | 306 | 611 | 1,800 | 76.6 | 85.6 |
We have added the discussion about the computational costs of two branches in our contrastive paradigm, and added the Table of pre-training details of the ViT-B model to show the runtime and effective pre-training epochs, and revised our Table 4 to add FLOPs and Parameters. Thanks for your constructive suggestions.
(Weakness 2) fine-tuning and linear-probing results. Many thanks for your sincere review. We are sorry for the misunderstanding of our theme in this paper. Our goal is to provide a new pre-training paradigm for efficient visual representation with affordable training time and reasonable computation cost. Actually, our method reduces computational costs while maintaining efficiency. From the FLOPs comparison in the table above, we can see that our FLOPs of 12.0 G significantly exceed the second-best MixMAE with 15.6 G, improving the efficiency by about 23%. With such significant improvement in efficiency, our method achieves comparable and competitive results of fine-tuning and linear-probing to previous SOTA contrastive learning and MIM methods. We have added a more detailed discussion about the fine-tuning and linear-probing results to clarify the misunderstanding. Thanks for your sincere review.
(Weakness 3) detection and segmentation results. Thanks for your comments. We found it is quite similar to the weakness 2. We provide a variety of downstream tasks to show that our pre-trained model has the generalization ability competitive with previous SOTA methods. However, our core contribution lies in proposing a simple pre-training paradigm that balances efficiency and performance, no bells and whistles. We hope it helps us train larger vision models and replicate the success of LLM. We have also added a discussion about the relationship between downstream tasks and our theme. Thanks for your sincere review.

We appreciate your time and thorough review. Your feedback is highly valuable to us, and we welcome further communication from you. If you are satisfied with our response, we would be grateful for your support in enhancing our rating score.

审稿意见

评分: 32025-03-14

This work proposes a contrastive self-supervised training method that relies on masking. After a global masking step, two nonoverlapping sets of patches are selected to create two views which are then aligned to each other and contrasted to all views of other samples in the minibatch. Noticeably, no contrastive MLP-head is used. Despite its simplicity regarding architecture and augmentations, the proposed OCL method performs reasonably well both in fine-tuning and linear probing evaluations. Additionally, the authors report a significant decrease in training duration compared to multiple other efficient methods, e.g. MAE, I-JEPA. The method applies a loss (T-distributed spherical metric) that introduces an additional concentration parameter.

update after rebuttal: The authors have addressed some of my concerns and answered the questions. Therefore, I will keep the positive score.

给作者的问题

Why do you apply this global mask, why is it applied to all samples in a mini-batch. Does it have efficiency reasons or does it benefit learning?

Looking at the masking ratios and the number of epochs, I cannot spot the source of the speedup compared to MAE. Compared to I-JEPA it is even harder to find the edge of OCL. You mention the simplicity of the architecture and no MLP-head, is an effective epoch (sample times views) of OCL that much faster to train than an epoch of MAE or I-JEPA?

There are studies about the concentration parameter, but no ablation about the loss itself. How is the loss and the ability to not use MLP heads related?

论据与证据

Claims are that OCL is a highly efficient SSL method that performs nearly on-par with SOTA methods. Evidence is provided in various benchmark evaluations.

方法与评估标准

Imagenet classification evaluation is the common way to evaluate self-supervised methods. Additionally, semantic segmentation and object detection performance is evaluated on ADE20k and COCO, respectively. Robustness is evaluated on a few ImageNet-like benchmarks (A,R and S). Overall, a solid set of evaluation experiments.

Ablation and sensitivity studies are mainly performed on ImageNet1k.

理论论述

no theoretical claims.

实验设计与分析

I checked the performance and ablation experiments and did not find any issues.

补充材料

Yes. All three pages.

与现有文献的关系

OCL is another small step towards improved SSL training efficiency and less reliance on hand-crafted augmentations.

遗漏的重要参考文献

none.

其他优缺点

Strengths:

Efficent and simple. The result without MLP head is quite astounding.
Only needs very basic augmentations, like MAE. Masking is performed randomly, which is a strength compared to methods that rely on specific masking strategies that are tailored to Imagenet, e.g. I-JEPA.
Like other joint embedding methods, it is shown that the learned features already have a higher level of semantic abstraction than standard MIM methods.

Weakness:

Results are good, but not ground breaking. Them main argument for OCL is its reported efficiency. The number of reported epochs varies between 800 (line 380) and 1600 (Table 6.). Unless the authors already take the 2 generated views into account, and report effective epochs, this does not fit. Furthermore, there is not a single table that holds both, runtime and training epochs.
Given the similarity in effective masking ratio to MAE and the need to create 2 views, the source of the speedup is not that easy to explain. see questions.
Using the strategy from MoCo V3 to freeze the patch layer reduces the number of learnable parameters and has shown benefits with regard to overfitting. On the other hand it suffers from the same problem like e.g. max-pooling, it often works well, but sometimes fails. No study about the effect of random initialized untrainable patch creation is performed.

其他意见或建议

none.

作者回复

2025-03-31

We sincerely appreciate your thoughtful and detailed review. Your positive feedback serves as a great source of encouragement for our team, especially recognizing the strengths: 1) efficient and simple framework, 2) not relying on specific masking strategies and 3) higher level of semantic abstraction than standard MIM methods. Furthermore, we have diligently responded to each of your questions and incorporated your feedback into our manuscript revisions:

(Weakness 1) efficiency about epochs and views generations Thanks very much for detailed and thoughtful review. We sincerely apologize for the typo in Table 6, that we actually used 800 epochs for pre-training, which is consistent with our Table 7, Table 8 and the description in line 380 of the paper. We have corrected this typo and double-checked the entire manuscript to make sure there are no further typos.

Besides, we are sorry for the misunderstanding of the efficiency of views generation. We do take into account the time it takes to generate two views. As illustrated in Algorithm 1, we conduct generations of two views in torch module MiCLAutoencoderViT (code in models_mae.py) with CUDA acceleration, instead of image pre-processing using CPU like DINO and MoCo v3. Thus, our method has considered the efficiency of views generation. And we have clarified this misunderstanding in our revised manuscript.
(Weakness 2) runtime and training epochs. Many thanks for your kind suggestions. To better demonstrate the efficiency of our model, we conduct the table following the MixMAE.

https://s2.loli.net/2025/03/31/RfKjqYMtAxoVQHC.png (Same Table as in Reviewer 2yVR, sorry for limited chars.)

From the table, our FLOPs of 12.0 G significantly exceed those of the second-best method, MixMAE (15.6 G), improving efficiency by about 23%. Concerning DINO and MoCo v3, they leverage student-teacher dual networks to pre-train, leading to higher FLOPs and Parameters of the pre-training encoder. We have revised the paper and added this table to illustrate the efficiency of our method more clearly.
(Weakness 3 & Question 2) the sources of the speedup. We sincerely appreciate your kind and insightful suggestions. The acceleration of our method primarily stems from two key factors:
- First, the masking strategy significantly reduces the number of tokens processed by the ViT for a single image. With a ratio of 0.4, only a part of tokens participate in pre-training, dramatically decreasing the computational load required by the ViT.
- Second, the contrastive framework eliminates the need for additional modules to reconstruct the image. Traditional MIM methods and most MIM-CL hybrid approaches employ a transformer decoder for image reconstruction, which incurs substantial computational overhead. In contrast, our OCL requires neither a decoder nor an MLP head, thereby reducing pre-training computations and enhancing efficiency.
As shown in Table of Answer 2, our method requires significantly less training time per epoch compared to MAE and I-JEPA. Specifically, OCL achieves 12.0G FLOPs, surpassing MAE of 17.5G by approximately 31% through elimination of decoder dependency. Moreover, I-JEPA’s higher FLOPs (29.8 G) result from its dual encoders (context and target) to generate different views. We have revised the paper to provide a more detailed explanation of these acceleration mechanisms.
(Weakness 4) frozen patch layer. Sincerely thanks for your thoughtful and helpful suggestions. We did experience training failures. While we were more concerned about the efficiency of the model's pre-training, so we simply changed the random seed and retrained. We will continue to explore solutions to model failures in future research.
(Question 1) global masking strategy. Thank you for valuable suggestions. The global masking strategy is designed to enhance representation learning by addressing semantic redundancy in images. As highlighted by the MAE and discussed in line 32 of our paper, images inherently carry redundant semantics. Global masking prunes unnecessary patches, forcing the model to abstract high-level semantic patterns. We have revised the paper to provide a detailed explanation of the global masking strategy.
(Question 3) the relationship between MLP head and loss. Thanks for valuable suggestions. Though distinct in mechanism, both T-SP loss and MLP head enhance visual representations. We have added additional ablation experiments and related discussion to validate the relationship between MLP head and T-SP loss.

T-SP Loss MLP Head LIN FT
√ √ 77.0 85.5
x √ 72.4 85.1
√ × 77.9 85.8
× × 61.3 82.6

T-SP Loss	MLP Head	LIN	FT
√	√	77.0	85.5
x	√	72.4	85.1
√	×	77.9	85.8
×	×	61.3	82.6

Thanks again for your valuable time and careful review. Your feedback is immensely valuable to us. If you find our response satisfactory, could you please consider helping us improve our rating score?

最终决定Accept (poster)

2025-05-01

This paper presented a method for visual representation learning with occlusion-based contrastive learning. A new view selection method was proposed Experiments on downstream tasks show the effectiveness and efficiency of the proposed method.

This paper received a consistent recommendation, with 3 Weak Accept from the reviewers. The reviewers acknowledge the proposed simple idea and its efficiency, and raised concerns about experimental details, limited performance over previous methods, and other technical issues. The authors provided a rebuttal, which addressed the major concerns raised by the reviewers.

Overall, this paper presented an interesting idea, with contributions that could be of interest to the audience at ICML. The AC is happy to recommend it being accepted to present at the conference.