PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高8标准差2.0
3
3
5
8
4.5
置信度
ICLR 2024

Conditional MAE: An Empirical Study of Multiple Masking in Masked Autoencoder

OpenReviewPDF
提交: 2023-09-19更新: 2024-02-11

摘要

关键词
masked autoencodermultiple masking

评审与讨论

审稿意见
3

The paper proposes a multiple masking strategy for MAE to enhance the local perception ability of Masked Image Modeling. In addition, the paper also summarizes several takeaways from the observation. Downstream experiments on classification, object detection and semantic segmentation show the effectiveness of the proposed method.

优点

The paper proposes an incremental masking strategy based on MAE, which may provide some experience for the community.

The experiments show the performance gains on multiple downstream tasks.

缺点

Actually, the contribution is limited. The proposed method is more of an empirical technique. I do understand the authors have done lots of tuning experiments, however, most of them are not that significant, or even well-known for the community (e.g., the one-shot masking).

In addition, the three-shot masking is significantly worse than the baseline as shown in Table 5, further implying its limitation.

问题

Please see the weakness.

Typos: fine-grind --> fine-grained

评论

Firstly, we sincerely thank the reviewer for acknowledging the effort we have made to conduct extensive experiments.

However, we do not agree with the reviewer's comment that the contribution is limited. In this paper, we are the first to present an in-depth analysis of masking which is important for masked autoencoder. Our study gives a comprehensive empirical analysis and sheds light on how multiple masking affects the optimization in training and performance of pretrained models. More importantly, we observe that multiple masking is capable of introducing locality bias to models. e.g., introducing more locality to models. Our contributions are also acknowledged by Reviewer JZAj and GAuN.

We also do not agree with the reviewer that the performance is not that significant. In our experiments, our two-shot masking has the potential to significantly improve the baseline (MAE). Please see Figure 3(b). For example, Our two-shot masking L(0, 10) basically outperforms the baseline by 2% in ImageNet100. We also empirically demonstrate that multiple masking has the potential in fine-grained classification, outperforming baseline (MAE) in three widely-used fine-grained datasets by above 2%. Please see Table 2. Additionally, our best-performing setting with ViT-B has outperformed baseline by 0.3% in ImageNet1k. The reviewer comments that the three-shot masking is significantly worse than the baseline as shown in Table 5. As shown in Table 5, our three-shot masking is about 0.6% lower than the baseline. If the 0.6% performance gap is significant, how about our enhancement of two-shot masking over baseline, which is over 2%?

We acknowledge that the three-shot masking is worse than the baseline. However, the reviewer's comment diverges from the original goal of our three-shot masking experiments. To give a comprehensive analysis to subsequent researchers, it is our responsibility to explore the boundary of improvement, hoping to inspire future work Additionally, though three-shot masking is worse than the baseline in the general classification, we find that the three-shot model outperforms the baseline in three fine-grained datasets by around 0.5% in Table 9 in Appendix. These results demonstrate that the three-shot masking is also meaningful.

评论

Dear Reviewer xPJ6:

As the rebuttal period is ending soon, we wonder if our response answers your questions and addresses your concerns.

Thanks again for your very constructive and insightful feedback! Looking forward to your post-rebuttal reply!

评论

Dear Reviewer xPJ6:

As the discussion is going to end, we sincerely wonder if our response answers your questions and addresses your concerns.

Thanks again! Looking forward to your reply!

sincerely

the authors

审稿意见
3

The paper studies masking strategies in masked autoencoders. Moreover, the paper proposes a multi-shot masking strategy for MAE, and shows the results of downstream tasks, such as image classification, object detection, and semantic segmentation. However, the experiments mainly focus on ImageNet-100 and several downstream tasks to demonstrate the effectiveness. Meanwhile, I am confused about the motivation of this paper.

优点

This paper explores the application of masking in MAE training and validates its effectiveness through a series of comprehensive experiments. The problem they aim to address seems reasonable to me. The writing in the paper is clear and comprehensible.

缺点

  1. Readers are confronted with a perplexing dilemma due to the lack of support for motivation in the given context. The second paragraph in the introduction is particularly confusing. The reasons behind the failure of these methods and the evidence supporting the interior robustness of MAE are uncertain. The original MAE paper indicates a more comprehensive evaluation of robustness. These statements appear to be fabricated.

  2. In fact, VideoMAE v2 is the first to propose to mask MAE. Therefore, this paper is not the first research to explore the impact of multiple shots masking in MAE.

  3. The experimental setting of this paper does not convince me of its results. The dataset selection has several issues. For the main experiments, the author selected the less commonly used benchmark, ImageNet-100. Since the scale and diversity of the data can significantly influence the experimental results, the validity of comparing with MAE is uncertain. Furthermore, the author also failed to choose common datasets for the downstream tasks. Moreover, the article makes adjustments under the original experimental setup conditions of MAE, thus rendering the conclusions drawn from the experiments inaccurate.

  4. The proposed method appears to lack technical contributions, and I fail to discern any tangible practical benefits in terms of performance or speed.

问题

Implementing more experiments on the benchmark ImageNet-1k dataset with the same experimental setting is necessary.

伦理问题详情

No ethics review needed.

评论

Q3: The experimental setting of this paper does not convince me of its results. The dataset selection has several issues. For the main experiments, the author selected the less commonly used benchmark, ImageNet-100. Since the scale and diversity of the data can significantly influence the experimental results, the validity of comparing with MAE is uncertain. Furthermore, the author also failed to choose common datasets for the downstream tasks. Moreover, the article makes adjustments under the original experimental setup conditions of MAE, thus rendering the conclusions drawn from the experiments inaccurate.

A3: We do not agree with the reviewer's such biased and ridiculous comments.

Firstly, the subset of ImageNet1k, i.e., ImageNet-100, is widely used in analysis in papers [1] [2] [3] [4] [5] [6] [7] [8] published in top conferences and Journal, e.g, ICLR, NeurIPS, ICML, JMLR, etc. Besides, we also conduct experiments on large ImageNet1k to verify the effectiveness and scalability. Please see Section 3.4.

Secondly, our selected datasets for downstream tasks are widely used in the field of self-supervised learning. For example, we choose ADE20K for semantic segmentation, MSCOCO for object detection. The two datasets are also used in MAE[9], BEiT[10], CAE[11], iBOT[12], and other methods published in top conferences and Journal, e.g., CVPR, ICLR, IJCV, etc. For transfer learning, we use CIFAR100 and CIFAR10 that are also widely used in iBOT[12], MoCo v3[13], DINO[14], SimCLR[15].

Finally, parameter tuning is common and reasonable in the field of deep learning because different methods may have different fitting abilities and optimization spaces. By parameter tuning, we can find the best configuration and unleash model capabilities better. For example, iBOT [12] and DINO [14], the code of iBOT is based on DINO while in ViT-B, iBOT sets local crop scale to (0.05, 0.32) and DINO uses (0.05, 0.25) for local crop scale. Both of them perform parameter tuning for better performance.

Q4: The proposed method appears to lack technical contributions, and I fail to discern any tangible practical benefits in terms of performance or speed.

A4: As we have mentioned in the last paragraph of the Introduction, we are not to propose a state-of-the-art method, but to enhance both the understanding and performance of MAE by exploring the potential of masking and to inspire future research. And our study actually has two aspects of contribution:

From the aspect of understanding MAE, our study shows that multiple masking introduces more locality to models, and summarizes several takeaways from our findings. Please see our Introduction. Reviewer GAuN and JZAj also acknowledge our findings and contribution.

From the aspect of enhancing the performance of MAE, we also empirically demonstrate that multiple masking has the potential to further improve the performance of MAE as shown in Table 6 and Figure 8. More importantly, we find multiple masking has significant superiority over MAE on fine-grained classification, which is missing in previous studies.

[1] Whitening for Self-Supervised Representation Learning. ICML2021

[2] Adversarial Masking for Self-Supervised Learning. ICML2022

[3] Mosaic Representation Learning for Self-supervised Visual Pre-training. ICLR 2023 Spotlight

[4] Self-Supervised Learning with an Information Maximization Criterion. NeurIPS 2022.

[5] solo-learn: A Library of Self-supervised Methods for Visual Representation Learning. JMLR2022

[6] Improving Transferability of Representations via Augmentation-Aware Self-Supervision. NeurIPS2021

[7] A Simple Data Mixing Prior for Improving Self-Supervised Learning. CVPR 2022.

[8] DLME: Deep Local-flatness Manifold Embedding ECCV2022

[9] Masked Autoencoders Are Scalable Vision Learners CVPR2023

[10] BEiT: BERT Pre-Training of Image Transformers. ICLR2022 Oral

[11] Context Autoencoder for Self-Supervised Representation Learning. IJCV2023

[12] iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR2022

[13] An Empirical Study of Training Self-Supervised Vision Transformers. ICCV 2021

[14] Emerging Properties in Self-Supervised Vision Transformers. ICCV 2023

[15] A Simple Framework for Contrastive Learning of Visual Representations. ICML2020

评论

Q1: Readers are confronted with a perplexing dilemma due to the lack of support for motivation in the given context. The second paragraph in the introduction is particularly confusing. The reasons behind the failure of these methods and the evidence supporting the interior robustness of MAE are uncertain. The original MAE paper indicates a more comprehensive evaluation of robustness. These statements appear to be fabricated.

A1: We do not agree with the reviewer. In the second paragraph, we do not mention the robustness of MAE and other methods. How did the reviewer obtain the content about "robustness" out of thin air in the second paragraph?

In fact, the logic of our motivation is below: Firstly, we emphasize the importance of masking operation. Then we describe the fact how other methods including MAE perform masking and select a suitable mask ratio. Afterward, in light of that masking is an important and flexible operation that can be performed at different stages, we present our concern and pose a question about a different masking strategy, i.e., multiple masking, and ask how it affects the optimization in training and performance.

It is also acknowledged by Reviewer GAuN and JZAj that our article presents a well-organized and well-argued thought process with clear logic and structure.

We feel sorry that the reviewer is confused. We have further revised the second paragraph carefully with better clarity as below:

  • A crucial component of the masked autoencoder is the mask ratio, which directly impacts the model's performance. For instance, in MAE, the performance gap for fine-tuning accuracy may vary by up to 2% with different mask ratios. However, current methods, including MAE, mostly ablate the mask ratio only on the input image: they mask the input image with various ratios and select the best-performing ratio after training those model variants. Considering that masking is an important and flexible operation that can be performed at different stages (\eg, the input image and different levels of representations) and with different ratios, these approaches may fail to fully exploit the potential of the autoencoder. Hence, a question naturally raises: Can the masked autoencoder handle multiple rounds of masking at different levels, and how does multiple masking affect its optimization in training and performance?

Q2: In fact, VideoMAE v2 is the first to propose to mask MAE. Therefore, this paper is not the first research to explore the impact of multiple shots masking in MAE.

A2: By carefully going through VideoMAE v2, especially section 4.2 (Main result, Results on dual masking, https://arxiv.org/pdf/2303.16727.pdf), we agree that VideoMAE v2 is the first to propose to mask MAE. But VideoMAE v2 is not the first work to dive deep into the impact of multiple shots masking in MAE. We list three discrepancies between VideoMAE v2 and our work:

Firstly, the application is different. As the name suggested, VideoMAE v2 is designed for videos while our work is for images.

Secondly, the motivation is different. VideoMAE v2 is to improve the overall efficiency of computation and memory so as to perform scaling in model and data. As one can see in Table 2 Page 7 of VideoMAE v2, it primarily wants to tell the story about the benefits in computational cost, memory consumption, etc. However, the in-depth analysis of potential impact of dual masking on encoder is absent in VideoMAE v2. In contrast, our work dives deep into it and reveals what multiple shots of masking bring to the encoder and shows how it affects the encoder's optimization in training. In other words, VideoMAE v2 focuses on application while our work focuses more on analysis. This is the biggest difference between VideoMAE v2 and our work.

Thirdly, in terms of specific implementation, the masking place and times are different. VideoMAE v2 only performs two times of masking and employs them on encoder and decoder respectively. Differing from VideoMAE v2, our method involves three times of masking that are all performed on the encoder.

Finally, we express profound respect for all the contributors to VideoMAE v2, recognizing their efforts in scaling models and datasets within the video domain. Simultaneously, we hope that the reviewer can objectively assess the distinctions between our work and VideoMAE v2 and realize the contribution we have made in the in-depth analysis of multiple masking.

We also thank the reviewer for pointing out this problem. To avoid misunderstanding, we have cited VideoMAE v2, added the discussion in related work, and revised our description of our contribution as follows to distinguish with VideoMAE v2:

  • Building on our proposed flexible framework, i.e., Conditional MAE, we are the first to make an in-depth analysis of multiple masking and reveal its impact on masked autoencoder's optimization in training and performance.
评论

Dear Reviewer MagS:

As the rebuttal period is ending soon, we wonder if our response answers your questions and addresses your concerns.

Thanks again for your very constructive and insightful feedback! Looking forward to your post-rebuttal reply!

评论

Thanks for the response from authors.

After reviewing the response and other reviewer's questions carefully, I think the author's response fails to convince me for the following reasons:

1.I continue to question the author's motivation, such as the introduction of the multi-shot masking in Video MAE V2. While the proposal aims to save computational resources, the author's strategy of employing multiple masks lacks a clear motivation and appears to address similar issues. Moreover, the rationale behind this approach is not well articulated.

2.Concerning the experiments conducted on ImageNet-100 as the authors proposed, the articles either present original methods or possess a clear motivation with proposed solutions. However, the author's work does not align with these standards.

  1. Additionally, the majority of methods validate their effectiveness on ImageNet-1K. Many self-supervised methods that are effective in small data scenarios show contradictory results when the dataset size increases. The author's experimentation on MAE, given that the original MAE paper conducted experiments on ImageNet-1K, raises concerns. While it may be acceptable for an original self-supervised method to be tested solely on ImageNet-100, incremental work or findings based on existing methods necessitate a comparison under the same experimental conditions. This is essential for a convincing conclusion, and the current conclusions based on small-scale experiments are difficult to trust.

  2. Similarities exist between Video MAE v2 and the author's experiments. However, while Video MAE v2 aims to reduce computational load in large-scale data settings at the expense of model performance, the model's performance in this article purportedly improves. This inconsistency raises doubts about the conclusions drawn in this paper.

In summary, the author's rebuttal lacks sufficient support to address my concerns about the motivation, experimental design, and conclusions presented in the paper. Further clarification and additional evidence are needed for the arguments to be more compelling in the context of academic discourse. Therefore, I keep my score.

评论

We thank the reviewer's response.

1. For the first question, even though the multiple shot indeed has such benefits in saving computation resources, it is not the main focus of our paper. We even do not use any table to show it. As we said in the second point of A2, our motivation is that given that masking can be performed at different stages (e.g., the input image and different levels of representations) and with different ratios, we are to reveal what multiple shots of masking at different stage bring to the encoder and shows how it affects the encoder's behavior. We are extremely strange why the reviewer can not understand such natural and easy logic and confuses our work with VideoMAE V2 all the time. We have clearly presented three differences from VideoMAE V2 in A2. It is hoped that the reviewer does not philosophically hold a severe bias to our work.

2. We can not agree with the reviewer's comments. In fact, conducting experiments on other datasets is also acceptable for analysis or finds. They also have been published in top conferences and journal, e.g., ICLR, ICML, CVPR, and TMLR. For example, in [1], it conducts experiments only on MNIST, CIFAR10, and CIFAR100 and compares the results with those of SimCLR and ViCReg. [2] use CIFAR10 to visualize the results of kNN in the projection space and the embedding space for CIFAR10. [3] conduct experiments on CIFAR10, CIFAR100, SVHN and ImageNet100 using InstDisc. [4] also uses STL-10 and ImageNet100 for analysis. [5] leverages CIFAR-10 and SimCLR. These evidences strongly support our claims and we hope the reviewer to take our perspective seriously. Besides, we may need to emphasize that we also conduct experiments using ImageNet1K to verify its scaling ability and compare the performance with MAE. Please see Figure 8.

3. As for the inconsistency, it is easy to explain. Firstly, the application fields are different. Our work focuses on images while VideoMAE v2 focuses on videos. The additionally introduced time dimension may influence the optimization of encoder. Secondly, the masking places are different. VideoMAE v2 employs two times of masking on encoder and decoder respectively while our method performs all masking on the encoder. It is widely known in the field of computer vision that slight differences may cause large performance discrepancies, e.g., the masking strategies, block, or random.

And we also can not agree with the review's such hasty thoughts in 4 of his reply. In fact, win-win situations often happen in the field of deep learning, e.g., in the field of model compression. TinyViT[6] can reach 86.5% accuracy, being slightly better than Swin-L while using only 11% parameters. EfficientViT-M5[7] surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput. MiniViT[8] can reduce the size of the pre-trained Swin-B transformer by 48%, while achieving an increase of 1.0% in Top-1 accuracy on ImageNet.

Finally, we sincerely hope that the reviewer does not philosophically hold a severe bias to our work.

[1] Minimalistic Unsupervised Learning with the Sparse Manifold Transform ICLR2023 Spotlight.

[2] Intra-Instance VICReg: Bag of Self-Supervised Image Patch Embedding. TMLR2023

[3] Understanding the Behaviour of Contrastive Loss CVPR2021

[4] Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere ICML2020

[5] Understanding Contrastive Learning Requires Incorporating Inductive Biases. ICML2022

[6] TinyViT: Fast Pretraining Distillation for Small Vision Transformers. ECCV2022

[7] EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. CVPR2023

[8] MiniViT: Compressing Vision Transformers with Weight Multiplexing. CVPR2022

评论

Thanks for the response from authors. Firstly, I would like to clarify that I indeed do not hold any bias against this paper. The following is my response:

It seems that the author has consistently misunderstood my points and evaded addressing critical questions.

  1. Regarding question 1, my emphasis was on the existing motivation failing to convince me. I hope the authors can identify a perspective that convinces me, demonstrating how the method can address a certain issue, akin to videomae v2's resolution of computational load concerns. Presently, the current explanation lacks inspiration and noteworthy aspects.

  2. Concerning the experiments, while it's true that a small amount of work has published papers using small-scale datasets, almost all research papers use ImageNet-1K as a benchmark due to the greater persuasiveness of results derived from such datasets. Papers conducting experiments on small datasets require stronger innovation, a quality lacking in this article. Furthermore, based on the author's ImageNet experiments, I pointed out that the current method is likely to be ineffective on a large-scale dataset. The author's experiments indicate a very marginal improvement in finetuning (a mere 0.1% increase with an increase in model size and epochs consistent with the MAE benchmark). I believe this marginal improvement is more likely due to random seed fluctuations. Meanwhile, the author's results lack universally accepted experiments like linear probing, which better reflect model performance and require significantly less evaluation time than finetuning.

  3. Lastly, I did not claim that a win-win scenario is impossible. I merely highlighted the conflict between the conclusion and existing works. The author's method introduces "mask findings" for MAE, while videomae v2 also utilizes MAE. The latter demonstrates that multi-shot masks can degrade MAE performance, conflicting with the conclusions drawn in this paper. Both works utilize visual data, and videomae v2 does not specifically process the temporal dimension. I believe that the conflicting conclusions cannot be explained merely by differences in data and would prefer to see more compelling experimental results.

评论

We thank the reviewer's reply.

However, we do not agree with the reviewer's clarification. In our view, the reviewer is completely brainwashed by VideoMAE v2, its story, its method, etc. The reviewer mentioned VideoMAE v2 in every reply. We beg the reviewer to let go of his obsession with VideoMAE v2 and hold a fair attitude. We may need to emphasize again that our motivation is completely different from VideoMAE v2. VideoMAE v2 is to save computational resources (Please note) while ours is inspired by the high flexibility of multiple masking and to understand the potential influence of multiple masking on encoder and give a comprehensive analysis. We are not to solve an issue like VideoMAE v2 because ours is an analysis work. We are extremely confused why the reviewer wants to require our work to strictly follow the story of VideoMAE v2. It is ridiculous. Do other reviewers doubt our motivation? No. As the saying goes "The person involved is confused; onlookers see clearly.", we hope the reviewer realizes the self-unaware and severe bias to our work, forgets VideoMAE v2 , and holds a fair attitude toward our work.

Secondly, we also can not agree with the potential logic behind the comment in 2, i.e., the sentence "it's true that a small amount of work has published papers using small-scale datasets, ....... Papers conducting experiments on small datasets require stronger innovation, a quality lacking in this article.". Because the given papers have already been published, the reviewer has to acknowledge their innovations and forgive their absence of large dataset experiments. In contrast, while our work is not, the reviewer severely doubts our innovation. Additionally, we have to point out that published papers using small-scale datasets are not few, especially for analysis papers. As for the improvement, as we have said, the main goal is to verify the scalability of multiple masking. Moreover, a 0.1% improvement is not marginal. If the reviewer confidently thinks it is marginal, how about the 0.1% performance drop between VideoMAE v2-H and VideoMAE v2-g in Table 6 (a), how about the 0.2% improvement between VideoMAE v2-H and VideoMAE v2-g in Table 6 (c)?

Finally, for the conflict, as we have explained previously, our work and VideoMAE have different application fields, i.e., image and video. It is worth doubting whether it is reasonable to use the conclusion drawn from a field to argue the conclusion drawn from another field. Not to mention that our work and VideoMAE v2 perform masking at different places. (VideoMAE v2 employs two times of masking on encoder and decoder respectively while our method performs all masking on the encoder.) It is widely known in the field of computer vision that slight differences may cause large performance discrepancies, e.g., the masking strategies, block, or random. We hope the reviewer to be more rigorous.

In a nutshell, it is sincerely appreciated if the reviewer realizes his self-unaware and severe bias, lets go of his obsession with VideoMAE v2, and holds a fair attitude to our work. Thanks.

审稿意见
5

This article focuses on the mask strategy issue in MAE, which adopts a multi-stage mask approach instead of a fixed mask method. The optimal mask ratio and layer are selected in different stages to mask both the input image and feature layers. The author believes that this method effectively enhances the attention of the mask on locality, and the effectiveness of this method is verified through a series of experiments.

优点

Originality: Although this article proposes new improvements based on the masking strategy of MAE, the problem addressed is a commonly overlooked issue. Through a series of experiments, the author reflects on the mask strategy and verifies the specific role of the mask in SSL.

Quality: The article presents a well-organized and well-argued thought process, and the effectiveness of condition-MAE is demonstrated through a series of comparative experiments.

Clarity: The article is highly readable, with clear logic and structure.

Significance: The further exploration of the role of the mask and the unexpected findings in the experiments are also worth attention.

缺点

  1. The essence of this article is based on MAE and explores the mask strategy. However, this setting lacks novelty as there have been many studies on mask strategies (e.g., [1-3]), and the related work mentioned is not sufficient. It is hoped that the author can conduct further comparisons and supplements to evaluate performance.

  2. Some findings mentioned in the article, such as the non-positive correlation between linear probing and finetune performance, are relatively straightforward. Linear probing mainly evaluates the model's discriminative generalization ability, while finetune focuses on the model's fitting ability to the data. However, there is still a positive correlation, mainly depending on the model's scale. The larger the model, the less obvious the positive correlation.

  3. The article proposes that the second-stage mask introduces more locality. However, I am concerned that the feature-level mask may have caused this result, as the input image is low-level, while the feature is high-level. Therefore, the combination of the two naturally makes SSL focus on global and local representations.

[1] Choe, J., Shim, H.: Attention-based dropout layer for weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2219–2228 (2019)

[2] Shi, Y., Siddharth, N., Torr, P.H., Kosiorek, A.R.: Adversarial masking for self-supervised learning. arXiv preprint arXiv:2201.13100 (2022)

[3] Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. arXiv preprint arXiv:2112.09133 (2021)

问题

  1. The mask ratio in the first stage of the article follows the MAE setting, while the best ratio in the second stage is selected manually. However, it is unclear whether this setting will be effective if a different backbone is used instead of VITs. Would it still require manual tuning?

  2. There are some minor changes that need to be noted. For example, the sentence "The dashed line denotes the one-shot baseline with masking ratios of 0.75 and 0.9 respectively" is unclear in terms of the baseline used in the graph. Additionally, the experiment with a mask ratio of 0.9 is missing, and if it is included in the appendix, it should be mentioned to improve readability. In the "Potential Application" section, "fine-grained" should be used instead of "fine-grind", and there are inconsistencies in the use of nouns throughout the article.

  3. In Figure 4, it can be seen that the difference between layer 0-3 and layer 0-6 is significant, and according to your statement, there should be an improvement. However, the experimental results show a decrease from 84.1 to 84.0. On the other hand, the improvement from layer0-6 to layer0-9 is significant, but the graph shows that the difference from the baseline is not as significant. This has caused some confusion, and it would be helpful if the author could explain this discrepancy.

评论

Q4: There are some minor changes that need to be noted. For example, the sentence "The dashed line denotes the one-shot baseline with masking ratios of 0.75 and 0.9 respectively" is unclear in terms of the baseline used in the graph. Additionally, the experiment with a mask ratio of 0.9 is missing, and if it is included in the appendix, it should be mentioned to improve readability. In the "Potential Application" section, "fine-grained" should be used instead of "fine-grind", and there are inconsistencies in the use of nouns throughout the article.

A4: We feel sorry for this confusion. We have highlighted the dashed line, gone through the whole article, and revised similar issues of inconsistencies. Thank the reviewer for the kind suggestion.

Q5: In Figure 4, it can be seen that the difference between layer 0-3 and layer 0-6 is significant, and according to your statement, there should be an improvement. However, the experimental results show a decrease from 84.1 to 84.0. On the other hand, the improvement from layer0-6 to layer0-9 is significant, but the graph shows that the difference from the baseline is not as significant. This has caused some confusion, and it would be helpful if the author could explain this discrepancy.

A5: We sincerely thank the reviewer for this constructive suggestion. This is easy to explain. First of all, we may need to emphasize that the disparity in the heatmap does not necessarily imply whether the learned representation is advantageous or detrimental. It only reflects how the representation learned by our two-shot masking model varies from that of the baseline. Hence, it would be unreasonable to use the significance of heatmap to assess the performance after fine-tuning. For L0-6, we think it is an awkward transitional state. It doesn't resemble L0-3, which learns a significantly superior discriminative ability (2.2% better than L0-6) as the initial state for fine-tuning. It also differs from L0-9 and L0-10, which exhibit stronger fitting capabilities to the data. As a result, L0-6 is relatively inferior to others. As shown in Figure 18 in Appendix, we compare the attention distance of two-shot model variants L(0;3/6/9/10/11;0.75,0.1), the results indicate that the adjustment of L0-6 is relatively inconspicuous.

We thank the reviewer again and have added this discussion to our paper to avoid confusion.

[1] Choe, J., Shim, H.: Attention-based dropout layer for weakly supervised object localization. CVPR2019

[2] Shi, Y., Siddharth, N., Torr, P.H., Kosiorek, A.R.: Adversarial masking for self-supervised learning. arXiv2022

[3] Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. arXiv2021

评论

Q1: The essence of this article is based on MAE and explores the mask strategy. However, this setting lacks novelty as there have been many studies on mask strategies (e.g., [1-3]), and the related work mentioned is not sufficient. It is hoped that the author can conduct further comparisons and supplements to evaluate performance.

A1: We partially agree with the reviewer's comments that there have been many studies on mask strategies but our work is still meaningful. Because these studies primarily focus on studying how to further improve the performance while as we mentioned in the last paragraph of the Introduction, our work is to reveal how multiple masking affects masked autoencoder's behavior, e.g., introducing more locality, and further enhance the understanding of masked autoencoder. For example, [1] leverages self-attention mechanism to hide the most discriminative part and highlight the informative region to improve the accuracy of weakly supervised object Localization (WSOL). [2] uses an adversarial objective to consistently improve on state-of-the-art self-supervised learning (SSL) methods. [3] uses Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, as reconstruction target, and verifies its effectiveness on video recognition.

We thank the reviewer's suggestion and have cited and discussed them in our related work section to show the discrepancy.

Q2: The article proposes that the second-stage mask introduces more locality. However, I am concerned that the feature-level mask may have caused this result, as the input image is low-level, while the feature is high-level. Therefore, the combination of the two naturally makes SSL focus on global and local representations.

A2: Thank you for the comment, but we cannot fully agree with the comment. We may need to remind the reviewer that the reconstruction target In MAE is normalized pixel value in RGB space. Even though we perform the feature-level masking, the model still needs to focus on the recovery of image pixels. Consequently, the discrepancy of the masking type is not the essence of the introduced locality. In fact, the locality is introduced because the presence of the second masking necessitates that patches that interacted in previous layers must recover their corresponding masked neighbors in the forward pass. As a result, the model needs to dedicate a portion of its capacity to learning how to infer local neighbors.

We have added this discussion to our paper to better clarify how the locality is introduced.

Q3: The mask ratio in the first stage of the article follows the MAE setting, while the best ratio in the second stage is selected manually. However, it is unclear whether this setting will be effective if a different backbone is used instead of VITs. Would it still require manual tuning?

A3: Considering that the masking ratio is a hyperparameter, similar to other hyperparameters, it may not always be compatible with various backbones., e.g., ConvNeXt V2 (0.6) and MAE (0.75), and various masking algorithms, e.g, block (SIMMIM 0.5) and random (MAE 0.75), even without our strategy. Hence, we are afraid that it would possibly require some manual tuning. But since the essence of masked image modeling is invariant, this process would be significantly shorten with the help of the summarized takeaways in our paper. Please see the Introduction.

评论

Dear Reviewer GAuN:

As the rebuttal period is ending soon, we wonder if our response answers your questions and addresses your concerns.

Thanks again for your very constructive and insightful feedback! Looking forward to your post-rebuttal reply!

评论

Dear Reviewer GAuN:

As the discussion is going to end, it is quite important for us to obtain your feedback. We sincerely wonder if our response answers your questions and addresses your concerns.

Thanks again! Looking forward to your reply!

sincerely

the authors

审稿意见
8

This work is a systematic empirical study of multi-shot masking in MAE by applying additional masking on hidden representations on chosen layers of the encoder on top of input masking in standard MAE. The ratios of masking (different ratios used at different layers) and how many of these additional masking (“shots”) are extensively studied. There are several interesting insights from these studies, as clearly stated in the introduction by the authors: Masking at the beginning is always beneficial for task performance, with 75% being optimal; building on one-shot masking, increasing the interval of two-shot masking with a large first ratio and a small second ratio is helpful; a small third ratio is helpful for three-shot masking; and more. Extensive and crucial analyses have been performed, such as linear probing / fine-tuning on models with different numbers of masking shots, different masking ratios, and masking levels, Centered Kernel Alignment, attention distance and entropy, visualization, robust analysis, and classification on fine-grained datasets, and transfer learning.

Given the strong, comprehensive empirical analysis and the great presentation of this submission, the reviewer recommends an acceptance.

优点

Originality: as the author rightfully cited and compared, there have been numerous works analyzing and studying masked image modeling. However, to the reviewer’s knowledge, this is the first paper to study the effect of multiple masking ratios across different layers in the MAE encoder. The work is, in this sense, original.

Quality: the quality of the paper is high in terms of experimental results and analysis. First, four layers at equal intervals, five masking ratios, and two model sizes (ViT-S/16 and ViT-B/16) are considered, and the representations of one-shot, two-shot vs. three-shot maskings are carefully compared via linear probing/fine-tuning, Centered Kernel Alignment of layer representation, attention distance and entropy, and fine-grained locality oriented datasets such as Flower102, Stanford Dog and CUB-200. The transfer learning analysis is also done on COCO for detection and ADE20K for semantic segmentation. There are further studies on robustness measured by classification results after occlusion and shuffling perturbation and the scalability of performance under different model sizes.

Clarity: this paper is very well written, with key messages clearly presented, results from each section nicely summarized, and key figures carefully designed.

Significance: given that masking image modeling is a popular topic, the research is beneficial to the community by creating new potentials to improve existing approaches with multi-shot masking.

缺点

Post rebuttal update: the authors successfully clarified the differences between this work and previous methods and provided extensive discussions. Therefore, the rating was updated.


Originality: progressive masking has been extensively studied in generative modeling or the combination of SSL and generative modeling [1-3]; adaptive masking strategies on images or languages have also been proposed [4-5]. However, the authors did not cite or discuss them.

Quality: Considering the nature of ICLR, the biggest weakness of this paper is not having any theoretical results, insights, or even discussions about the proposed approach. The insights are not theoretically backed, reducing the quality of the work.

The scalability discussion in Sec. 3.4 is weaker because there are only two model sizes. Also, the pre-training is done using ImageNet-100, creating a gap between the conclusions of the submission and the hypothetical behaviors of MAE trained on ImageNet-1K using multi-shot masking.

[1] Chang, Huiwen, et al. "Maskgit: Masked generative image transformer." CVPR 2022.

[2] Chang, Huiwen, et al. "Muse: Text-to-image generation via masked generative transformers." 2023.

[3] Li, Tianhong, et al. "Mage: Masked generative encoder to unify representation learning and image synthesis." CVPR 2023.

[4] Bandara, Wele Gedara Chaminda, et al. "AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders." CVPR 2023.

[5] Xiao, Yisheng, et al. "AMOM: adaptive masking over masking for conditional masked language model." AAAI 2023.

问题

  1. The authors may not need to perform empirical experiments on this, but what would the authors conjecture about using a low masking ratio (e.g., 0.1) on the beginning position for one-shot masking?

  2. "We reconstruct it primarily conditioned on the 'borrowed' information through such interaction." The authors may rephrase this sentence to be more precise.

  3. Minor grammar errors: "We need reconstruct two targets", and "pretaining loss" in Figure 7.

评论

We sincerely thank the reviewer for acknowledging the contribution of our work and the kind and constructive suggestions.

Q1: Progressive masking has been extensively studied in generative modeling or the combination of SSL and generative modeling [1-3]; adaptive masking strategies on images or languages have also been proposed [4-5]. However, the authors did not cite or discuss them.

A1: We thank the reviewer for this kind suggestion. The discussion is below:

Chang et.al. introduce MaskGIT[1], which employs a bidirectional transformer decoder and is capable of learning to predict randomly masked tokens via attending to tokens in all directions during training. When inference, MaskGIT first generates all tokens of an image and then refines the generated image iteratively based on the previous generation. Recently, Chang et.al. propose Muse[2] and train it to predict randomly masked image tokens given the text embedding extracted from a pre-trained large language model (LLM). Leveraging LLM enables Muse to understand fine-grained language, translate to high-fidelity image generation, etc. Moreover, Muse directly enables inpainting, outpainting, and mask-free editing without the need to fine-tune or invert the model. Li et al. [3] propose to use semantic tokens learned by a vector-quantized GAN at inputs and outputs and combine this with masking to unify representation learning and image generation. Bandara et al. propose an adaptive masking strategy called AdaMAE [4]. AdaMAE samples visible tokens based on the semantic context using an auxiliary sampling network and empirically demonstrates the efficacy. Xiao et al. introduce a simple yet effective adaptive masking over masking strategy called AMOM [5] to enhance the refinement capability of the decoder and make the encoder optimization easier.

We have cited these works in our paper and added this discussion in our related work section. We sincerely hope to obtain support from the reviewer.

Q2: The scalability discussion in Sec. 3.4 is weaker because there are only two model sizes. Also, the pre-training is done using ImageNet-100, creating a gap between the conclusions of the submission and the hypothetical behaviors of MAE trained on ImageNet-1K using multi-shot masking.

A2: We acknowledge that the scalability is slightly weak. We may need to explain that due to the limitation of computation resources (as we have pointed out in Section 6 limitation), we could not conduct experiments on larger ones, e.g., ViT-Huge to further verify the capability of scaling. But it is worth mentioning that our scaling experiments are all conducted on large ImageNet1K instead of ImageNet100, which could verify the effectiveness of scaling model size of multi-shot masking in large datasets to some extent.

Q3: The authors may not need to perform empirical experiments on this, but what would the authors conjecture about using a low masking ratio (e.g., 0.1) on the beginning position for one-shot masking?

A3: It is a very interesting question. When using a small low masking ratio, e.g., 0.1, that is, more information is left and could be used to infer the masked patches, according to our empirical experience, this would lead to an easier reconstruction task, failing to make the encoder learn sufficient inference knowledge (or capability). The masked autoencoder is inclined to degenerate into (or resemble) a vanilla autoencoder (masking ratio is 0). Above is our speculation before performing experiments.

Since this question is quite interesting, we are willing to perform such an experiment and share the result with the reviewer. We perform it on ImageNet100 using ViT-S/16. In this experiment, we find that after 300 epoch pretraining, the model is inferior to that of 0.75 masking ratio (31.7 v.s. 45.0 for linear probing and 80.2 v.s. 82.5 for fine-tuning), supporting our speculation above.

Q4: Sentence issue and Minor grammar errors.

A4: We thank the reviewer for pointing out these issues. We have fixed the grammar issue and rephrased the sentence in our paper.

[1] Chang, Huiwen, et al. "Maskgit: Masked generative image transformer." CVPR 2022.

[2] Chang, Huiwen, et al. "Muse: Text-to-image generation via masked generative transformers." 2023.

[3] Li, Tianhong, et al. "Mage: Masked generative encoder to unify representation learning and image synthesis." CVPR 2023.

[4] Bandara, Wele Gedara Chaminda, et al. "AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders." CVPR 2023.

[5] Xiao, Yisheng, et al. "AMOM: adaptive masking over masking for conditional masked language model." AAAI 2023.

评论

The reviewer sincerely appreciates the authors' response. The reviewer is satisfied with the response, and the revision made the paper stronger, addressing multiple concerns from different reviewers. Therefore, the reviewer raises the score to 8.

评论

We sincerely thank the reviewer for acknowledging our contribution ! Thank the reviewer' effort and time again !

评论

We sincerely appreciate all reviewers’ time and efforts in reviewing our paper. And we also thank all reviewers for their insightful and constructive suggestions that help a lot in further improving our paper. According to the reviewer's suggestions, we have revised our paper. Below are the main modifications (highlighted in blue in our updated paper):

  • Add more related work and discussion mentioned by reviewer JZAj and GAuN
  • Rephrase the sentence and fix typo issues for better clarity

Finally, we thank all reviewer's effort again and hope our pointwise responses below could clarify each reviewers' confusion.

评论

Dear Reviewer JZAj, GAuN, MagS, xPJ6, and AC:

We pen down this letter with a nervous disposition.

As this discussion is going to be close, we sincerely hope that the reviewers and AC will take into consideration the contributions of our method, which is the first in-depth analysis of how multiple masking affects masked autoencoder's behavior with extensive experiments supported.

We are also very grateful to the reviewers for their constructive comments and suggestions and hope to obtain feedback from reviewers.

the authors

Best

AC 元评审

The paper studies the effects of multi-shot masking in Masked Autoencoders (MAE).

The paper received mixed reviews (8, 5, 3, 3). Some reviewers appreciated the comprehensive empirical study and the in-depth analysis of the effect of multiple masking. However, significant concerns were raised about the novelty and the quality of the work. Specifically, reviewers questioned the motivation behind the multiple masking strategy and noted that similar concepts have been addressed in previous works. Some reviewers also found the improvements from the proposed method to be marginal and questioned the scalability and generalizability of the findings.

The authors responded to the concerns by clarifying their motivation, emphasizing the differences between their work and related studies, and providing additional details about their experimental setup. However, some major concerns still remain about the generalization of the proposed method, e.g. findings on ImageNet-100 from this paper may not extend well to ImageNet-1k and other larger-scale datasets.

After a discussion with considering the reviews, rebuttals, author's letter, and the revised submission, the AC and reviewers acknowledged the authors' efforts but concurred that the paper is not ready for acceptance in its present state.

为何不给更高分

The paper received mixed reviews with scores of (8, 5, 3, 3). The AC has thoroughly reviewed all the feedback and believes that the concerns and comments raised by the reviewers require more comprehensive addressing. Since reviewers xPJ6 and GAuN have not actively participated in the discussion, the AC plans to seek additional input from them to ensure a more informed final decision.

为何不给更低分

N/A

最终决定

Reject