5.4

/10

Poster5 位审稿人

最低5最高6标准差0.5

3.8

置信度

正确性2.6

贡献度2.6

表达2.6

NeurIPS 2024

Self-Guided Masked Autoencoder

Jeongwoo Shin,Inseo Lee,Junho Lee,Joonseok Lee

OpenReview PDF

提交: 2024-05-13更新: 2024-12-19

摘要

关键词

self-supervised learningself-supervisionSSLrepresentation learning

评审与讨论

审稿意见

评分: 5置信度: 32024-07-04

This paper aims to enhance the Masked Autoencoder (MAE) approach for self-supervised learning in computer vision. The authors discovered that MAE intrinsically learns pattern-based patch-level clustering from the early stages of pre-training. This finding led them to propose a self-guided masked autoencoder, which internally generates informed masks based on the learned patch clustering, rather than relying on random masking. This method improves the learning efficiency without the need for external models or supplementary information, maintaining the self-supervised nature of MAE. Their approach involves generating informed masks that cover the main objects in an image more effectively than random masks. These masks are derived from the early-stage patch clustering learned by MAE, accelerating the training process and leading to clearer and more refined feature embeddings. The self-guided masked autoencoder leverages the inherent pattern-based clustering learned by MAE to produce more effective masks, enhancing the training efficiency and performance of the model on various computer vision tasks.

优点

The self-guided masked autoencoder is innovative, as it generates informed masks internally based on the model’s own learning progress. This method is also backed with sound motivations.
The paper provides an in-depth analysis of the MAE’s learning process, revealing that pattern-based patch-level clustering occurs from the early stages of pre-training. This understanding contributes insights into the workings of MAE and can inform future research.
This paper also presents comprehensive analysis on learned feature space, providing multiple views of understanding of this method.

缺点

The success of the self-guided masked autoencoder hinges on the accuracy of the initial patch clustering. If the initial clusters are not well-formed, the informed masks generated could be suboptimal, leading to less effective learning and reduced performance gains.
The process of bi-partitioning and generating informed masks adds significant computational overhead. This extra complexity can increase the overall training time and resource consumption. It would be better if the author could include the training time or FLOPS for their method.
Lack of baselines. The author has mentioned multiple informed masking strategies, and they should be included to this paper for comparisons (CL-MAE, ADIOS). Currently, only MAE and AMT served as baseline method.

问题

Does this work perform the best under the default mask ratio (0.75) similar to other works?
Have you considered using stochastic $S_i$ (weighted sampling) in the hinted method? In addition, I think sampling hint patches from different clusters might be useful.

局限性

The author has addressed the limitation in this paper.

作者回复

2024-08-07

We appreciate the reviewers' positive comments and constructive feedback. We have made efforts to address each of the concerns raised.

[W1] Quality of informed masks

We acknowledge the concern that initial clusters may not be well-formed for some images. Although we have shown that MAE properly clusters patches through the observations in Sec. 3, it is true that the clustering quality may sometimes be suboptimal. To address this issue, we proposed a 'relevance score' that measures the relationship of the entire token set to the bi-partitioned cluster and then generates masks based on this score (Sec. 4 L222). As a result, our approach yields robust informed mask even in the case when the clustering and bi-partitioning is imperfect, illustrated in Fig. II in Appendix A.

[W2] Training cost

Our method only requires only one more step to generate masks, which empirically increases the pre-training time about 0.25% (for 400 epochs pre-training). We will make this clearer in camera-ready.

[W3] Additional baselines

We thank this consturctive feedback. As AMT is the only model soley relying on MAE archietcture itself, we chose it as baseline. We exclude CL-MAE [1] and ADIOS [2] from our baselines since both of them utilize additional module such as curriculum masking module or occlusion module which are parameterized and need to be trained in addition to MAE. The reported figures in their original papers are not comparable, since CL-MAE experimented only on a subset (20%, randomly selected 200 classes) of ImageNet-1K (IN1K), which does not align with our main purpose of self-supervision on large datasets. Also, as it takes about 6x training time for CL-MAE empirically, we will add CL-MAE in camera-ready due to the short rebuttal period. ADIOS was also experimented in its original paper with a downsized version (224->96) of a subset (10%, ImageNet-100) of IN1K and is not applied to the MAE architecture. Admitting their relevance, we will introduce CL-MAE and ADIOS in the related work section.

Reflecting the reviewer's point, we compare with two other models (SemMAE [3] and HPM [4]) using additional module or external model in Table B of the rebuttal PDF. Since these baselines incorporate additional parameters (external models or extra modules attached to MAE), they slightly outperform ours. However, these models incur significantly larger training cost for the additional parameters, while our method requires only a constant additional time (about 0.25% of the training cost of MAE) for mask generation relying solely on MAE itself without introduction of any additional resources.

[Q1] Ablation on masking ratio

We provide an ablation study on masking ratio in Table C in the rebuttal PDF. A masking ratio of 0.6 shows slightly better performance (+0.1%), while a masking ratio of 0.9 shows degraded performance, which shows similar trend to the results in [5]. We had to conduct this experiment with 200 pre-training epochs due to the short rebuttal period, but will add a similar table with regular 400 epochs of pre-training in camera-ready.

[Q2] Hinting strategy

Both ideas are very interesting, and we explored them in an ablation study. Consideration of using $S_i$ -based sampling is shown in the ablation studies (Ln 273-277, Table 3) in main paper, where it exhibited slightly degraded performance. Sampling hint tokens from different clusters is presented in Table I in the supplementary material. We sampled hint tokens from two major clusters alternately throughout the training epochs, which also resulted in slightly lower performance compared to our default setting.

[1] Madan, Neelu, et al. "CL-MAE: Curriculum-Learned Masked Autoencoders." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024.

[2] Shi, Yuge, et al. "Adversarial masking for self-supervised learning." International Conference on Machine Learning. PMLR, 2022.

[3] Li, Gang, et al. "Semmae: Semantic-guided masking for learning masked autoencoders." Advances in Neural Information Processing Systems 35 (2022): 14290-14302.

[4] Wang, Haochen, et al. "Hard patches mining for masked image modeling." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[5] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

2024-08-13

I appreciate the authors' detailed response. I encourage the authors to explicitly include these additional results in the next version of the paper. Additionally, I suggest that the authors explore the potential benefits of increasing the masking ratio in both linear probing and fine-tuning, as this may improve the efficiency of the approach. After considering the response, I will maintain my original score.

2024-08-13

We appreciate the reviewer for reply.

As the reviewer recommended, we will definitely include Table B (comparison with additional baselines) and Table C (ablation study on masking ratio) from the rebuttal PDF to the main paper. Additionally, we have included the results of fine-tuning with various masking ratios (0.6, 0.75, 0.9) in the table below.

		Masking Ratio
	0.6	0.75	0.9
Linear Probing	54.5	54.4	53.2
Fine-Tuning	83.3	83.2	82.5

Our method accelerates MAE training leveraging its intrinsic property, i.e., patch clustering, without relying on external information. Consequently, our method mirrors the behavior of MAE regarding the masking ratio, showing that a 0.75 masking ratio is optimal for both MAE and our method in terms of both efficiency and performance. Please note that the ablation study presented below was conducted with 200 epochs due to the limited time available during the rebuttal period. These results will be updated to 400 epochs in the final version.

If there are any further questions or points of clarification needed, we would be grateful for the opportunity to provide additional information. Thank you for your thoughtful feedback and consideration.

审稿意见

评分: 6置信度: 42024-07-05

This paper proposes a novel masking strategy, Self-Guided Informed Masking for pre-training MAE. The authors found that MAE learns patch-level clustering of visual patterns at the early stages of training, which is demonstrated through an in-depth analysis. Based on this insight, the authors suggest bi-partitioning the image to identify and mask entire objects and then reconstruct them using a small number of hint tokens. Comprehensive experiments across various downstream tasks validate that the proposed strategy consistently improves the performance of MAE.

优点

[S1] The paper provides a novel and interesting observation to understand the learning mechanism of MAE, along with a comprehensive analysis to justify the observation.

[S2] The paper shows that the proposed approach consistently and significantly improves MAE’s performance on various downstream tasks.

[S3] The overall writing is smooth and easy to follow.

缺点

[W1] Lack of baselines.
[W1-1] Though they chose only MAE and AMT to show the effectiveness of the approach, it is too limited to fully demonstrate their claim. Specifically, guiding the mask with a pre-trained network should also be considered a baseline if it is not trained in a supervised manner, e.g., AttMask [1].
[W1-2] In addition, HPM [2] should be included as a baseline, since it also proposes an automatic masking strategy that does not use any external data or pre-trained model.

[W2] Scalability.
Even though other MIMs [3-4] already have become SOTA in the image domain, I believe the core advantage of MAE is scalability [5] compared to other SSLs, e.g., it performs significantly better with more pre-training epochs, larger models, or more data. The authors should demonstrate the scalability, e.g., by providing the accuracy curve and results on ViT-L/16.

[1] Kakogeorgiou, Ioannis, et al. "What to hide from your students: Attention-guided masked image modeling." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[2] Wang, Haochen, et al. "Hard patches mining for masked image modeling." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[3] Assran, Mahmoud, et al. "Self-supervised learning from images with a joint-embedding predictive architecture." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[4] Wang, Haochen, et al. "Droppos: Pre-training vision transformers by reconstructing dropped positions." Advances in Neural Information Processing Systems 36 (2024).
[5] Singh, Mannat, et al. "The effectiveness of MAE pre-pretraining for billion-scale pretraining." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

问题

[Q1] Can this approach be applied to other domains or modalities too? It would be more impactful if it could treat other domains since MAE is shown to be the modality-agnostic self-supervised learning [1-2].

[1] Jang, Huiwon, et al. "Modality-agnostic self-supervised learning with meta-learned masked auto-encoder." Advances in Neural Information Processing Systems 36 (2024).
[2] Xie, Johnathan, et al. "Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised Learning." arXiv preprint arXiv:2402.14789 (2024).

局限性

They addressed the limitations.

作者回复

2024-08-07

Thank you for your valuable feedback. We have tried to alleviate all the concerns raised.

[W1] Additional baselines

We agree that it would be valuable to compare our method with others that rely on external pre-trained models or additional modules, as long as the comparison is fair. We included a table comparing our method to these models, including HPM [1] in rebuttal PDF. (We will update the table with unified 400 pre-training epochs in camera-ready.)

Upon thorough understanding of AttnMask [2], the model suggested by the reviewer, we conclude that this approach DOES supervise the model with labels, making it no longer comparable with our self-supervised method. For this reason, we confirm to exclude this for fair comparison. We highlight that our main contribution is achieving complete independence from any extra resources while successfully expediting the learning process of MAE, as demonstrated in our analysis.

[W2] Scalability

We thank the reviewer for the feedback on scalability. Scalability on pre-training epochs can be found in the supplementary materials (~1600 epochs). Although we would like to include what the reviewer suggested now, it is challenging to conduct experiments on larger model sizes and datasets during this short rebuttal period. We will add the experiments on ViT-L/16 with 400 pre-training epochs compared to MAE in camera-ready.

[Q1] Applicability across various domains and modalities

Since our method accelerates the training process of MAE without altering MAE's inherent learning (i.e., patch clustering), we believe it is applicable to any domain or modality where MAE itself is suitable.

[1] Wang, Haochen, et al. "Hard patches mining for masked image modeling." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[2] Kakogeorgiou, Ioannis, et al. "What to hide from your students: Attention-guided masked image modeling." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.

2024-08-12

Thank you for your response. Please add scalability experiments in the final draft.

Nevertheless, while I again appreciate the in-depth analysis of MAE, my concern from [W1] and [Q1] is about the contribution of the masking strategy. Practically, we can use other strategies of utilizing additional architecture or a pre-trained network in an unsupervised manner (Furthermore, they can perform better with the better pre-trained network), and the proposed approach seems to perform worse than SemMAE or HPM.

2024-08-12

Thanks for the reply.

Although SemMAE and HPM demonstrate superior performance compared to our method, we respectfully request that the training time be taken into consideration. As detailed in our rebuttal PDF, SemMAE and HPM incur 3.15x and 1.5x more training costs, respectively, while our approach requires only a marginal increase of 1.0025x compared to the vanilla MAE. For fair comparison, SemMAE / HPM / ours should be pre-trained for 400 / 800 / 1200 epochs respectively.

We fully recognize that enhancing the performance of MAE with additional resources is an important area of research. However, we kindly ask the reviewer to consider our work from a different perspective. Our research is centered on understanding the internal mechanisms of MAE and autonomously improving its performance while maintaining its intrinsic property of patch clustering.For instance, given that SemMAE could benefit from a better pre-trained network, our self-guided MAE could be utilized as an external pre-trained model to further enhance SemMAE's performance. This indicates that SemMAE and our approach may serve different purposes and could potentially work synergistically. As a result, SemMAE and our approach may serve different purposes and, therefore, should be evaluated accordingly, recognizing their distinct contributions.

2024-08-12

Thanks for your rebuttal for addressing my questions! I have raised the rating to 6.

I would like to note that clarifications about the comparisons with other works should be included in the final manuscript to improve the clarity of the work.

2024-08-12

Thank you for your thoughtful consideration and for recognizing the contributions of our work.

We will make sure to incorporate these comparisons in the final manuscript to provide a clearer understanding of our contributions relative to existing methods.

审稿意见

评分: 5置信度: 42024-07-10

This paper demonstrates that Masked Autoencoders (MAEs) learn pattern-based, patch-level clustering and that they obtain this ability in the early stages of pre-training. Based on this finding, it proposes a self-guided Masked Autoencoder method, which utilizes self-attention map information to mask out meaningful foreground information. The experiments show that the proposed method outperforms the vanilla method as well as AMT, another self-attention guided MAE method.

优点

I enjoyed reading this paper, as it begins by analyzing the properties of MAEs to identify the drawbacks of the methods and then proposes a novel method based on these analyses. I believe such an attempt to motivate the method via empirical analyses should be encouraged. Additionally, this paper includes post-hoc analyses of the proposed method, which are helpful in understanding the method.
The direction of introducing a novel masking strategy in this study seems promising and could potentially lead to significant advancements in the field.
The proposed method outperforms both the vanilla MAE and AMT, not only in image classification tasks but also in dense prediction tasks.

缺点

I believe the main weaknesses of this paper are two-fold: soundness and novelty.

Soundness: I am not fully convinced by the claims in this paper. The following are examples, though not exhaustive.
- In L99, dissimilarities between the main objects and the mean values of tokens might not directly imply “the MAE encoder has learned patch clustering basded on visual patterns,” and it is insufficient to conclude that MAE can clearly distinguish between backgrounds and the object based on the results in Fig 2b. Alternatively, it might be possible to simplify or even skip Section 3.2, as [40] addresses a similar claim.
- Since MAE does not train a CLS token, I am not convinced that the results of Fig 2c lead to meaningful findings.
- In L126, even if it is true that “the decoder learns a similar but les clearer pattern than the encoder,” the main objective of MAE is training the encoder, not the decoder. Therefore, the implications of this statement are not clear.
- In L195, the drawbacks of random masking appear somewhat suddenly. I couldn’t find direct links between the analyses in Section 3 and the inefficiency of the random masking strategy.
Novelty: The core takeaways from the analyses and the proposed method are not groundbreaking.
- L33: Even though Figure III in the appendix elegantly shows the token-wise clustering properties of MAE, the claim regarding the "MAE’s property that learns pattern-based patch-level clustering" is already demonstrated in [40]. Furthermore, according to [40], this property is not unique to MAE but is general to MIM methods. Additionally, it might be an interesting finding that this property emerges in the extremely early stages of training, but I wouldn’t say the 40th or 100th epochs constitute an extremely early stage. It might not be an apples-to-apples comparison, but [22] demonstrates that 100 epochs are a sufficient setting to achieve reasonable performance.
- L37: As this paper also cited, the proposed method is not the only method that utilizes a self-attention mask to introduce an efficient masking strategy. Similarly, prior works, such as AttnMask [28], offer comparable insights.
Writing: Although the writing and organization are not the major weaknesses of this paper, as I understand it, there is room for improvement.
- This paper occasionally refers to figures from later sections at the beginning, causing me to go back and forth to understand the context fully.
- Should the paper introduce the (accumulated) exploitation rate instead of using NMI or self-attention distance?
- Although the proposed method is straightforward and easy to understand, including a diagram to overview the method would enhance its readability.
- I sometimes struggled to find the definitions of symbols and hyperparameters, which could be more clearly indexed or referenced.
- In Section 5.4, explicitly stating the implications of the experimental results would significantly improve the paper.
Method: Likewise, I believe there is room for improvement in the proposed method.
- In L218, this method utilizes bi-partitioning and assumes that the dominant tokens represent background parts. However, its effectiveness might depend on the ImageNet dataset. Images with many objects or those containing large objects could be considered; in such cases, this assumption might no longer hold.
- In L238, he method employs the second-to-last encoder layer to utilize the self-attention map. However, it would also make sense to use the last self-attention layer for the sake of simplicity, since I assume that there would only be minor performance degradation.
- For a fair comparison, the experiments neglect methods that utilize external information or models, but it would be great if this paper could compare the method and emphasize the pros and cons alongside those methods.

Overall, I am on the fence but lean towards acceptance since the strengths outweigh the weaknesses. However, I believe there is room for improvement, and a major revision could significantly strengthen this paper.

问题

Please refer to the weaknesses section.

局限性

I don't find any ethical issues. Please refer to the weaknesses section.

作者回复

2024-08-07

Thank you for your thorough review and constructive feedback; we have carefully considered your comments and made efforts to address each of the concerns raised. Due to length limitations, we have had to use references from the main paper. We apologize for any inconvenience this may cause.

[W1] Does Fig. 2 indicate that MAE clusters patches based on visual patterns?

We understand that Fig. 2 may not be sufficient to claim visual pattern-based patch clustering. To support this, we refer to Figure III in Appendix, which qualitatively shows token clustering based on their visual pattern. Following your suggestion, we will refer to [40], which demonstrates that 1) MIM models can distinguish tokens and 2) MIM models focus on high-frequency information.

[W2] Attention map with CLS token

Correct. Since the CLS token is not updated during training, it does not contain particularly meaningful information. The main point of Fig. 2c is that if patch vectors are well clustered, they can be distinguished by any random vector. We will clarify Fig. 2c in camera-ready.

[W3] Unclear statement (L126)

We agree that MAE mainly aims to obtain a well-trained encoder. However, we discuss both encoder and decoder to show that the entire architecture of MAE aims to learn patch clustering, as our analysis targets the whole MAE. Also, we apologize for the misuse of the term 'pattern' in L126, which was confusing with 'visual pattern'. It would be better to be replaced with 'trend'.

[W4] Drawbacks of random masking (L195)

Thanks for this clarifying question. In Sec. 3, we discover that MAE is already well-trained to separate the image into major feature clusters in early epochs. After the initial $T$ epochs, MAE becomes sufficiently trained to distinguish tokens into major clusters. Random masking, however, keeps producing easily separable training examples, without taking advantage of its learned knowledge so far. We propose a method to exploit this to expedite the learning process in Sec. 4. In this context, we describe it as 'delaying the learning of patch clustering' and identified this as a drawback of random masking. We will clarify this in the paper.

[W5-1] Difference from [40] (L33)

To clarify, the demonstration in [40] about MIM models is confined to 'token distinguishment', distinct from our observation of 'pattern-based patch clustering'. We claim that merely distinguishing tokens is not sufficient to establish our method, as it does not indicate that MAE can generate meaningful patch clusters for informed masking.

[W5-2] $T=40$ epochs is not early

It may be subjective whether 40 epochs are extremely early or not, but our contribution lies in the discovery that MAE trained for less than $T$ epochs is insufficiently trained and would be inadequate for downstream tasks. As this is the empirical minimum, any pre-training method with $T$ or longer epochs will benefit from our method, and in this context we argue $T$ epochs to be an 'early' stage.

[W6] Difference from prior works (L37)

While our masking strategy shares some common aspects with them, the fundamental difference is that ours relies solely on MAE itself, established through our original analysis of MAE's internal operations. AttnMask [28] relies on supervised training with external labels, unlike our complete self-supervision. We highlight that our main contribution is achieving complete independence from any extra resources while successfully expediting the learning process of MAE.

[W7,9,10,11] Suggestions for writing quality

We thank for the suggestions to improve our manuscript. We will reflect them (overview diagram, notation summary, and summary of experimental implications) in camera-ready.

[W8] Why do we need exploitation rate?

Employing NMI or attention distance might look feasible to decide if the MAE is sufficiently trained to provide informed masks, but they pose the challenge of setting a proper threshold, since they are unbounded. Our exploitation rate quantifies the shared information within the mask tokens compared to the visible tokens, providing a bounded score and thereby easing the thresholding.

[W12] Effectiveness of bi-partitioning in diverse image contexts

We acknowledge the concern regarding images with large objects or multiple objects.

For an image with a large object, only their sub-parts will be masked because the object occupies more than the masking ratio. This scenario still benefits from our method, as intensive masking on the main object still leads to better representation (see Fig. II in Appendix).

As you pointed out, images with many objects might present a challenge for applying informed masking effectively. As there will be many small clusters, masking specific clusters with informed masking may yield similar masks to random masking. However, we emphasize that this limitation is not unique to our method but is a common issue across various informed masking strategies. We will explicitly note this limitation in the paper to provide a clearer understanding of the scope and applicability of our method.

[W13] Using the last encoder layer for informed masking (L238)

We agree with your insight. We simply follow the analysis results and employ the second-last layer, but as pointed out, using the last layer has a minor impact on performance. We will note this in camera-ready.

[W14] Additional baselines

We provide Table B in rebuttal PDF to compare with other methods that use external pre-trained models or additional parameterized modules. Thanks to the external pre-trained models or additional parameterized modules, they slightly outperform our method. However, we emphasize that our method is entirely independent of any extra resources, almost maintaining the original cost of MAE (+0.25% of additional cost), in contrast to other models which require extra cost for training external models and incorporating additional parameters into MAE.

评论- Official Comment of Submission5912 by Reviewer BsNN

2024-08-11

Thank you, authors, for the clarifications. In particular, I believe the claims from W1 to W5-1 address my concerns. I encourage the authors to include such points more explicitly in the text. Although I am still not fully convinced of the strong links between the analysis and the method proposal sections, I understand it might be challenging to articulate such claims. Regarding W5-2, describing T=40 as "extremely early stages of training" seems somewhat overstated.

Overall, I still believe there are useful takeaways from this paper, and I would like to maintain my original score. If the writing is improved and stronger links are established between the analysis and the method sections, I could consider rating this paper as a solid 6.

2024-08-11

Thanks for your positive reply. We are pleased that our rebuttal has addressed most of your concerns. Our responses to your new comments are as follows:

1. Links between the analysis in Sec. 3 and our method in Sec. 4

In Sec. 3.2 we first show that the MAE encoder learns patch clustering after training through qualitative and quantitative analyses comparing MAE to MoCo and ViT. Then in Sec. 3.3, we show that the MAE encoder learns patch clustering from an early stage through bi-partitioning and KL divergence analyses. Bi-partitioning is sufficient to distinguish tokens into two major token clusters and construct informed masks by masking out one of them. This result indicates that we can generate informed masks with MAE itself relatively early in the pre-training phase and use these informed masks for the remainder of the training.

Then, the next question is when exactly the MAE can properly cluster the patches. In other words, when exactly can MAE generate informed masks by itself and start to employ them? We answer this in Sec. 3.4 via exploitation rate analysis following the reasoning below.

Once the encoder has effectively learned to cluster patches, its outputs embody this information. These outputs are then used to constitute mask tokens within the decoder. As a result, the mask tokens carry the patch clustering information and become heavily utilized in reconstructing the masked-out patches. By reversing this process, a high exploitation rate of mask tokens in the decoder suggests that they have successfully inherited proper patch clustering information from the encoder, indicating the encoder's proficiency in clustering patches. We validate this through exploitation rate analysis, which demonstrates that mask tokens begin to be exploited as extensively as visible tokens from an early epoch, i.e., exactly from an epoch $T$ .

In summary, 1) we conduct bi-partitioning and KL divergence analyses to show that MAE learns patch clustering from early stage and 2) we suggest "exploitation rate" method to determine the precise point at which MAE can independently generate an informed mask. This finding allows us to confidently generate informed masks at epoch $T$ , ultimately leading to the design of our method.

As suggested, we will clarify the links between analysis section and our method in the main paper as above.

We have made every effort to clarify the explanation. However, if any aspects remain unclear or if there are additional questions, we would appreciate the reviewer reaching out for further clarification.

2. $T=40$ epochs is not "extremely early stage"

Considering that training for 100 epochs shows sufficient performance in [22], we agree with the point that expression "extremely early" is somewhat overstated thus needs to be tone downed. We will revise it to "relatively early phase of training".

If you still find this expression strong or have any other suggestions, please let us know.

2024-08-12

To provide further clarity and detail on our work, we have included additional comments.

First, let us explain why informed masking has led to performance gains in prior works—a point that, to the best of our knowledge, has not been clearly articulated before our study. Traditional random masking involves masking patches across the entire image randomly. According to our analysis in Sec. 3.2, this approach leads the MAE to learn a broad and less focused patch clustering across the whole image. In contrast, informed masking focuses the masking on object-centric regions, which helps the MAE to learn stronger and more meaningful patch clustering specifically within the object-related areas. By narrowing the masking focus, we can guide MAE to concentrate on learning patch clustering within the object regions, i.e., loss affects only the object-related parts, thereby accelerating the process of learning patch clustering in object region.

In previous works, informed masks were generated introducing extra costs, which is a straightforward approach. However, our goal is to develop a method that is independent of external models or additional parameters. We specifically aimed to explore whether MAE can autonomously perform informed masking without relying on such external aids.

To achieve this, we conduct an in-depth analysis of MAE’s intrinsic learning properties to understand how it naturally clusters patches. Our analyses include:

Section 3.2: We demonstrate that MAE learns patch clustering after training through qualitative and quantitative comparisons with MoCo and ViT.
Section 3.3: We show that MAE begins to learn patch clustering from an early stage using bi-partitioning and KL divergence analyses. These findings indicate that MAE is capable of generating informed masks by itself at an early phase of training.
Section 3.4: To determine when MAE can effectively cluster patches and generate informed masks, we introduce the "exploitation rate" analysis. This analysis reveals that once the encoder has learned to cluster patches, this information is passed to the mask tokens in the decoder. A high exploitation rate of these tokens indicates successful patch clustering, and we identify a specific epoch, $T$ , when MAE is ready to generate informed masks independently.

By confirming that MAE can autonomously perform informed masking, we avoid the need for external models and additional parameters. Based on these analyses, we propose using object-centric informed masking to enhance MAE’s performance. Instead of randomly masking the entire image, we concentrate the masking on object-related regions, allowing the MAE to learn stronger patch clustering in these areas, thereby improving its overall efficiency and effectiveness.

Based on this elaboration, we will revise our paper as follows:

At the beginning of Section 3, we will explicitly state our motivation for the analyses in Sections 3.2 and 3.3.

To achieve independence from external models and additional parameters, we first need to uncover the learning properties of MAE.

At the start of Section 3.4, we will explain the reason behind the exploitation rate analysis.

In brief: When the encoder begins clustering patches, this information is transferred to the mask tokens in the decoder, leading to higher exploitation of these tokens by the decoder. By reversing this logic, we infer that a high exploitation rate of mask tokens indicates that the encoder can effectively cluster patches. This allows us to determine the exact point when the encoder can generate informed masks. To validate this, we propose the exploitation rate analysis.

We will add an explanation on informed masking in the object-centric masking part of Section 4.

In brief: By narrowing the masking focus from the entire image to the object-related region (where loss affects only this specific part) we can build stronger patch clustering within the object-related region.

审稿意见

评分: 6置信度: 52024-07-12

Masked autoencoding (MAE) based pre-training has been widely adopted recently. However, it is not fully uncovered what and how MAE exactly learns. In this paper, the authors provide an in-depth analysis of MAE and discover that it learns pattern-based patch-level clustering from early stages of pre-training. Upon this finding, the authors propose a self-guided masking strategy which generates informed masks in an unsupervised manner without relying on external models or labels. The experiments on various downstream tasks shows the effectivness of the masking strategy in MAE pre-training.

优点

The paper provides an in-depth analysis of what and how MAE learns during pre-training.
The authors use attention score and cosine similarity score to analyze the pairwise relationship of tokens using the last layer of the ViT-B encoder and the decoder. MAE based qualitative analysis shows that patches are clearly clustered as compared to MoCo and ViT. Quantitatively, the feature variance and similarity variance is high for MAE encoder and decoder which show it clusters patches based on their patterns. The analysis provide another insight about MAE regarding when they start to learn patch clustering.
The exploitation rate of mask tokens and visible tokens show that every decoder hold substantial amount of shared information estimated by the encoder which means MAE is trained to efficiently cluster the patches.
Based on the findings, the authors propose a new called self-guided informed masking.
Experiments on various datasets (IN1k, COCO, iNat19 etc) shows the effectiveness of proposed masking strategy.

缺点

Although the author shows the performance of masking strategy with 1600 epochs, is there any reason to use a model trained for 400 epochs for analysis? What is the motivation behind it? Did the authors only use 10% of training data during pre-training for analysis?
Given this patch clustering and estimated shared information, does it mean we need only few tokens for downstream tasks? Any analysis with linear layer fine-tuning and fewer visible tokens would be interesting?
It is still not clear if the authors pre-train the model in two stages? Did they pre-train the model with random masking for T epochs and then continue pre-training it with informed masking? Any remarks on that would be helpful.
Given the decoder has enough shared information to reconstruct the masked patches and ablation table shows the performance boost, can we directly use the features extracted from the decoder for downstream tasks ?
Can the authors shed light on the computation and the memory and compared the approach with random masking? Given the embeddings are extracted from either encoder or decoder for informed masked with bi-partition, would that increase the total pre-training time?
Any ablation on the masking ratio? how much hint is added in the informed mask?
Any learning comparison in terms of reconstruction loss, with hint or without hint. If the reconstruction task becomes harder and we know that the generalization of pre-training can be effected if the proxy task is too challenging.
The masking strategy looks a combination of normalized cut, [1] and [2]. Can the authors please give a remark on the differences of self-guided masking and these methods?

[1] Self-supervised transformers for unsupervised object discovery using normalized cut.

[2] SemMAE: Semantic-guided masking for learning masked autoencoders.

问题

Please look at the weakness section for the questions and suggestions.

局限性

Yes, the authors has adequately addressed the limitations.

作者回复

2024-08-07

We appreciate the reviewer for positive comments and constructive feedbacks. We have tried to address each of the concerns raised.

[W1] Reason for the analyzing MAE with 400 pre-training epochs

To use a 1600-epoch pre-trained MAE for analysis, we would need to train MAE, MoCo, and ViT for 1600 epochs for a fair comparison. Due to the excessive time and resources required for this setting, we set the training time to 400 epochs following the literature, MAE [1], MoCo [2], and ViT [3]. These seminal works have analyzed with 400 or less epochs, reflecting the fact that 400 epochs is sufficient to obtain a well-trained embedding space for comparison. Nevertheless, we have shown the results with 1600 epochs to demonstrate scalability and effectiveness of our method. Also, all the anlayses have been conducted with models trained on the whole IN1K training set.

[W2] Using fewer tokens for downstream tasks

We think this is an interesting idea. Because MAE distinguishes features solely based on their visual patterns, if we can extract specific clusters of tokens tailored to a given downstream task, it would be feasible to use only these subsets of tokens for those tasks. Although this is beyond the scope of this paper, it would be a great future research topic.

[W3] Training process of our method

The model is trained in a single stage; at epoch $T$ , we begin generating informed masks and continue the training process without interruption. We will state this point more clearly in the revised paper.

[W4] Using decoder features for downstream tasks

It is an interesting suggestion to use the decoder layers for downstream tasks, if the MAE is sufficiently trained. Using decoder features might have slight benefit for the tasks requiring pixel-level details, considering the fact that decoder features are closer to the pixel-level space. However, using decoder layers would be less efficient since it requires additional inference time. This would be an interesting future direction for this line of research.

[W5] Training cost

Our method only requires only one more step to generate masks, which empirically increases the pre-training time about 0.25% (for 400 epochs pre-training). We will make this clearer in camera-ready.

[W6] Ablation on masking ratio

We provide an ablation study on masking ratio in Table C in the rebuttal PDF. A masking ratio of 0.6 shows slightly better performance (+0.1%), while a masking ratio of 0.9 shows degraded performance, which shows similar trend to the results in [1]. We had to conduct this experiment with 200 pre-training epochs due to the short rebuttal period, but will add a similar table with regular 400 epochs of pre-training in camera-ready.

For hint tokens, we start by providing 10% of the masked tokens and gradually decrease the ratio to 5% by the end. We will include this information as well.

[W7] Reconstruction loss with and without hints

As shown in the ablation studies in Table 3 in main paper, the learning process can become too challenging with informed masking, making the reconstruction task difficult and ultimately resulting in degraded performance. The reconstruction losses (MSE) after 400 epochs of training for vanialla MAE, ours (with hint) and ours (without hint) are 0.41, 0.56, and 0.64 respectively as shown in Table D in rebuttal pdf.

Although the model can fail to learn generalized feature with a too difficult task, we successfully addressed this issue by providing hint tokens and consistent performance improvements in various downstream tasks verify our claim.

[W8] Difference from SemMAE [4]

Both masking strategies aim to mask out important regions of the image. However, the fundamental difference between SemMAE and ours is that SemMAE is 'supervised' by an external pre-trained model which requires extra training cost for pre-training this model, while ours is completely 'self-supervised'. As SemMAE is supervised by external model, it is not guaranteed that SemMAE still learns what MAE actually learns, i.e., pattern-based patch clustering. The training process of SemMAE can be interpreted as indirect feature distillation via reconstruction task, since it heavily depends on the quality of the features extracted from the external model, e.g., iBoT [5] features containing semantic segmentation information. On the other hand, our work identifies the property of MAE (i.e., patch clustering) emerging during the training process and utilizes this observation to accelerate MAE's learning of this property. In this context, we refer to our method as providing 'acceleration' rather than 'performance improvement' because it speeds up MAE's ability to learn its inherent features, rather than enhancing its feature space with external resources. Since our method relies solely on MAE itself, it is completely free from relying on the attributes of feature representation from external models, and this is the key difference between our method and SemMAE.

[1] He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

[2] He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[3] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

[4] Li, Gang, et al. "Semmae: Semantic-guided masking for learning masked autoencoders." Advances in Neural Information Processing Systems 35 (2022): 14290-14302.

[5] Zhou, Jinghao, et al. "ibot: Image bert pre-training with online tokenizer." arXiv preprint arXiv:2111.07832 (2021).

评论- Response to the authors

2024-08-13

I thank the authors for the clarifications.

After reading my fellow reviewers reviews and the authors rebuttal, I still have some concerns which are given below:

I'm still not convinced on using 400 epochs for analysis. I still believe at least 800 epochs would be used for in-depth analysis.
Most of the comparison is done with MAE vanilla, can the authors please compare the performance with SemMAE? I believe SemMAE achieves 68.7 in linear probing on IN1k, while this approach achieves 65.9 with 800 epochs of pre-training. When the model is pre-trained using 1600 epochs, it then achieves 68.7.
The authors did not clearly state the training time overhead compared to vanilla MAE for 800 and 1600 epochs. Providing the numbers in terms of hours or days would be greatly appreciated. Furthermore, given the results of SemMAE and the need for an additional 800 epochs to achieve comparable performance, it would be helpful to include in the manuscript the amount of additional training time required.
I think the experiments related to using fewer tokens and decoder features for downstream tasks should be included in the manuscript. For fewer tokens, a simple experiment would be to sample let's say 75-90% of the high-quality informed tokens and do linear probing experiments.
The authors mentioned "On the other hand, our work identifies the property of MAE (i.e., patch clustering) emerging during the training process and utilizes this observation to accelerate MAE's learning of this property. In this context, we refer to our method as providing 'acceleration' rather than 'performance improvement' because it speeds up MAE's ability to learn its inherent features, rather than enhancing its feature space with external resources". I think these are overstated. What does acceleration really means here? Is it accelerating the pre-training? or is it accelerating the learning? Given the bi-partitioning graph, it actually increases the pre-training overhead. If the learning is accelerated, does it mean the downstream tasks performance can be significantly improved with lesser pre-training epochs?

Overall, I still believe there are useful points to takeaway from this paper, but given my above comments, I will maintain my original score of 5.

2024-08-14

Thanks for the reply. Our responses for the additional concerns are as follows:

1. Anlysis with 400 epochs

We appreciate the suggestion of using 800 epochs for analysis and fully understand its importance. However, we kindly ask the reviewer to consider that 400 epochs is a reasonable and sufficient duration for the following reasons:

MAE consistently exhibits properties related to patch clustering by measuring feature variance and similarity variance of MAE encoder over the course of training, as shown in the table below. MAE shows a steady increase in variances, indicating that it effectively learns patch clustering from the early epochs and maintains this trend throughout the training. Our analyses in Sec. 3.3, including bi-partitioning and KL divergence, as well as the observation that the exploitation rate of mask tokens at 400 epochs closely mirrors that at 800 epochs (Fig. 4 in Sec. 3.4), further support that MAE with 400 epochs reflects the same properties as MAE with 800 epochs.

Also, for MoCo and ViT in detail, MoCo was trained for 200 epochs and ViT for 300 epochs on ImageNet-1K in their respective works, demonstrating that strong performance can be achieved within this range. We belive this is why neither MoCo nor ViT has experimented beyond 300 epochs; as it is sufficient to analyze the performance gain and limitations. Additionally, MAE itself shows considerable performance at 400 epochs as well. We believe it is important to maintain consistency throughout the paper, ensuring that the training epochs in the analysis (Sec. 3), experiments (Sec. 4), and further analysis (Sec. 5) align.

With 8 A6000 GPUs, training for 400 epochs takes 2.5 days. It is prohibitively time-consuming to rerun all experiments with 800 epochs pre-training. We plan to report 800 epochs results in camera-ready, but this is beyond something that we can do within a week of discussion period since we need to train MoCo and ViT for 800 epochs each.

We hope this clarification helps the reviewer understand that MAE demonstrates consistent patch clustering characteristics across these training durations, and the rationale and validity behind choosing 400 epochs for our analysis.

Epoch	1	200	400	800
Feature variance	0.031	0.068	0.074	0.082
Similarity variance	0.047	0.057	0.068	0.075

2. Comparison with SemMAE

As presented in Table B of the rebuttal PDF, SemMAE requires approximately 3.15x training time compared to original MAE, while ours requires only 1.0025x. To ensure a fair comparison, SemMAE would have been pre-trained only for 133 epochs, to match the computational cost for training our method for 400 epochs. Considering that training MAE for 200 epochs yields a 53.9% accuracy in linear probing, and assuming SemMAE maintains its 4.9% performance gap over MAE (800 epochs, which is reported as their setting) with only 200 epochs, SemMAE would achieve at most 58.8% accuracy at 200 epochs. This means that even under this optimistic assumption that SemMAE is trained for 200 epochs not 133 epochs, our approach (62.9%) would still surpass SemMAE when comparing with the same training cost.

Throughout this discussion, we recognize that enhancing MAE's performance with additional resources, such as those used in SemMAE, is an important and valuable line of research. However, we respectfully request the reviewer to view our work from an alternative perspective. Our study is centered on understanding the internal mechanisms of MAE and improving its performance independently from any other additional resources, while maintaining its fundamental feature of patch clustering.

3. Training time

Training our model takes almost the same time (1.0025x) compared to training the vanilla MAE. Training our method using 8 NVidia A6000 GPUs (48GB) roughly takes 2.5 days for 400 epochs, 5 days for 800 epochs, and 10 days for 1600 epochs.

2024-08-14

(cont'd)

4. Using fewer tokens for downstream tasks

Following your advice, we performed linear probing on ImageNet-1K using MAE and our method, both pre-trained for 400 epochs. We selected the top 75% of tokens based on their scores (as defined in our manuscript, section 222L), which indicate their relevance to the object, and used only these tokens for global average pooling. The results are presented in the table below. Both MAE and our method show slightly degraded performance compared to those obtained using all tokens, while ours still shows superior performance.

	Entire tokens	Top 75% tokens
MAE	61.4%	59.2%
Ours	62.9%	60.6%

While it is clear that using fewer tokens selected according to our method yields less favorable results, we agree that properly selected tokens could potentially lead to similar or even better performance with greater efficiency. To be specific, since our method focuses on robustly selecting tokens to be masked out, the selection process can be relatively simpler rather than strictly optimized, e.g., precise importance ranking of tokens. If we were to accurately select the most important tokens, we believe that your idea could indeed be validated. Additionally, we will include this result in camera-ready for further insight as we agree with the reviewer that this result can provide valuable support for our masking method. In detail, slight degradation in performance suggests that our method effectively masks object-centric tokens, as linear probing focuses on recognizing the main object. However, we kindly ask the reviewer to consider that further boosting in downstream task or performance improvement via token selection extend beyond the scope of our current work. We appreciate again the reviewer for this insight.

5. Definition of "acceleration" used in our work

We would like to clarify that when we refer to "accelerated pre-training" as our main contribution, we mean shortening the pre-training time required to achieve the same quality of feature representation. In this context, the improvement in downstream tasks with same pre-training epochs compared to MAE serves as evidence of the accelerated pre-training process.

Furthermore, to demonstrate that our method indeed enhances patch clustering—thereby directly accelerating MAE’s learning process—we conducted a thorough investigation of the embedding space of MAE and our method using various analysis tools, as detailed in Sec. 5.4. In summary, our approach significantly increases the exploitation of pattern information and produces more diverse features. Also, as shown in Table A of the rebuttal PDF, our method results in higher feature variance, which suggests finer patch clustering. We would also like to highlight that the additional cost of our method is only +0.25% (1.0025x) compared to the original MAE.

We sincerely appreciate again the reviewer's time and constructive feedback to improve our paper. We hope our responses address the reviewer's concerns and highlight the key contributions of our work effectively.

评论- Thanks for clarifications

2024-08-14

I sincerely thanks the author for further clarifying my concerns. After reading the comments, I'm happy to increase my score to 6. I would like the authors to please add all of these discussions in the final revision.

2024-08-14

We sincerely thank you for your careful evaluation and for acknowledging the significance of our contributions.

We will ensure that all results derived from the reviewer's insights and suggestions are included in our final manuscript.

审稿意见

评分: 5置信度: 32024-07-14

This paper investigates the properties of MAE and introduces a better masking strategy for MAE. Analyses using similarity scores and attention show that MAE learns to cluster the patches at the early epochs. On the other hand, analyzing exploitation rates of various layers implies that the encoder is trained sufficiently to cluster the patches after the early training. Based on the observation, Self-guided Informed Masking is introduced to accelerate MAE, which searches for the foreground tokens and masks them except a few tokens as a hint. Quantitative comparison in the experiments section verifies its effectiveness on MAE.

优点

The paper provides various analyses
The proposed Self-guided Informed Masking improves the baseline

缺点

There is a missing link or jump between the analyses in Section 3 and the proposed method in Section 4.
- If the proposed Self-guided Informed Masking is based on the observations in Section 3, then the authors should verify whether the metrics reported in Section 3 are modified through the proposed masking strategy. However, the impacts of the proposed method are not investigated through the analyzing tools in Section 3.
- What is the connection between the exploitation rate analysis and the design choices made in Self-guided Informed Masking?
Similar work to Self-guided Informed masking
- SemMAE [1] also masks foreground tokens of the given image while leaving a few of them as cues for autoencoding. The authors should clarify how Self-guided Informed Masking differs from and improves upon SemMAE's approach
Lack of explanation
- Key terms are not defined early enough in the paper. This paper considers visual patterns as a key aspect of the MAE pre-training. However, I cannot find any definition or explanation of the term 'pattern' until page 4 (line 116), which hinders understanding the main arguments in the early sections.
- The term 'Hint strategy' should be explained in or near the caption of Table 3.
Inconsistent notation
- M is used to denote both the set of mask tokens in Section 3 and the cluster containing the main object in Section 4. If there's no specific link between these definitions, such duplicate notation may confuse readers.
Performance improvements are marginal in some benchmarks (e.g., ADE20K segmentation, COCO detection, iNat2019)

[1] Li et al., "SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders," NeurIPS, 2022.

问题

Please refer to the weaknesses part above

局限性

The authors provide a limitation at the end of the paper, but it does not seem like their limitation but the shortage of MAE.

作者回复

2024-08-07

We appreciate the detailed feedback and valuable insights provided by the reviewer, and we have tried to address all concerns and suggestions.

[W1-2] Connection between the explotation rate analysis in Sec. 3 and our method

We thank the reviewer for this constructive feedback. This will be explicitly stated at the beginning of Sec. 4 in the main paper as detailed below.

In Sec. 3, we show that the MAE encoder learns patch clustering from an early stage through bi-partitioning and KL divergence analyses. As bi-partitioning is sufficient to distinguish tokens into two major token clusters, we can bi-partition the image and mask out one of them. This result indicates that we can generate informed masks with MAE itself early in the pre-training phase and use these informed masks for the remainder of the training. Then, the next question is when exactly the MAE can properly cluster the patches. When the encoder is sufficiently trained to cluster patches, the encoder outputs reflect this information. Then, they are utilized to constitute mask tokens in the decoder. This means the mask tokens possess this patch clustering information and start to be highly exploited for reconstructing masked-out patches. By reversing this order, it can be inferred that a high exploitation rate of mask tokens in the decoder indicates that mask tokens have proper patch clustering information conveyed from the encoder, implying that the encoder can cluster the patches. We verified this via exploitation rate analysis, showing that mask tokens start to be exploited as much as visible tokens from an early epoch ( $T$ epoch). This finding allows us to confidently generate informed masks at epoch $T$ , ultimately leading to the design of our method.

[W1-1] Analyzing tool in Sec. 3

Following your suggestion, we verify our method using the analysis tools in Sec. 3 in Table A in the rebuttal PDF. The results indicate that our method shows more diversified feature space via higher feature variance and similarity variance, aligned well with the analysis in Sec. 5.

[W2] Difference between SemMAE [1] and our method

Both masking strategies aim to mask out important regions of the image. However, the fundamental difference between SemMAE and ours is that SemMAE is 'supervised' by an external pre-trained model which requires extra training cost for pre-training this model, while ours is completely 'self-supervised'. As SemMAE is supervised by external model, it is not guaranteed that SemMAE still learns what MAE actually learns, i.e., pattern-based patch clustering. The training process of SemMAE can be interpreted as indirect feature distillation via reconstruction task, since it heavily depends on the quality of the features extracted from the external model, e.g., iBoT [2] features containing semantic segmentation information. On the other hand, our work identifies the property of MAE (i.e., patch clustering) emerging during the training process and utilizes this observation to accelerate MAE's learning of this property. In this context, we refer to our method as providing 'acceleration' rather than 'performance improvement' because it speeds up MAE's ability to learn its inherent features, rather than enhancing its feature space with external resources. Since our method relies solely on MAE itself, it is completely free from relying on the attributes of feature representation from external models, and this is the key difference between our method and SemMAE.

[W3, W4] Lack of explanation and inconsistent notation

Thanks for your thorough review. As suggested, we will define the term 'visual pattern' earlier in the paper and explain the hint strategy in the section pertaining to Table 3. Also, the notation $*M*$ in Sec. 4 will be replaced with an alternative notation.

[W5] Performance improvements

We admit that the degree of improvement varies across various tasks, as the reviewer pointed out. We emphasize, however, that our method consistently improves performance across various tasks with different nature, e.g., requiring global understanding (classification) and pixel-level details (segmentation).

Additional comment on limitations

Our method may show less significant improvement when training with excessively fragmented images, especially for the segmentation task. In detail, since there would be numerous clusters within each image, masking specific clusters with informed masking may yield similar masks to random masking. We will note this as a limitation in the paper.

[1] Li, Gang, et al. "Semmae: Semantic-guided masking for learning masked autoencoders." Advances in Neural Information Processing Systems 35 (2022): 14290-14302.

[2] Zhou, Jinghao, et al. "ibot: Image bert pre-training with online tokenizer." arXiv preprint arXiv:2111.07832 (2021).

2024-08-14

Thank you for your positive feedback. We are pleased that our responses have addressed most of your concerns. We would also like to address your last concern regarding the effectiveness of our method compared to SemMAE.

As shown in Table B, SemMAE requires about 3.15x the training time due to the additional cost of the external model, whereas our approach requires approximately 1.0025x the training time. In this context, we respectfully suggest that for a fair comparison, SemMAE should be evaluated after 133 epochs, while our method would be evaluated after 400 epochs. Even if we optimistically assume that SemMAE maintains the 4.9% performance gap (as reported at 800 epochs) compared to the original MAE at 200 epochs, it may achieve at most around 58.8% accuracy in linear probing at 200 epochs. This indicates that even under this optimistic assumption, where SemMAE is trained for 200 epochs instead of 133, our approach (which achieves 62.9%) would still outperform SemMAE when comparing under the same training cost.

As suggested, to clearly convey the effectiveness of our method especially compared to SemMAE, we will update the performance comparison with SemMAE in Table B to reflect a unified training time in camera-ready.

Regarding the overall presentation, we will clearly explain the connection between the analysis in Section 3 and our method, addressing the initial concerns raised. Additionally, we will include the evaluation of our method using the analysis tools discussed in Section 3 in Section 5.4 for further verification of our method. We will also ensure that the comparison of our contributions relative to SemMAE is explicitly stated in the revised manuscript.

Again, we sincerely appreciate the reviewer’s thoughtful feedback and the effort invested in improving our paper.

2024-08-14

I appreciate the author's clarification. The author's response partially addressed my concerns (W3, W4, and partial aspects of other questions). I hope the authors resolve the lack of explanation, inconsistent notation, and limitations of their approach during the revision.

[W1-1] Analyzing tool in Sec. 3

The additional experiments demonstrated the proposed method's improvements in both feature variance and similarity. If possible, during the revision, I hope the authors also show the relation between "the feature variance and similarity" and some quantitative performances, so that enhancing the metrics leads to better performance in downstream tasks

[W2] Difference between SemMAE and our method

I understand the design difference between SemMAE and the proposed method. However, the comparison between them raises questions about whether the proposed method outperforms the masks generated by SemMAE in enhancing MAE's ability to learn its inherent features. It would be valuable to explain how the masks generated by each method differently impact MAE pre-training and to determine which approach is more effective.

[W1-2] Connection between the exploitation rate analysis in Sec. 3 and our method

Thanks to the detailed explanation, which has clarified the background of the proposed method. However, it seems that all the analyses in Section 3 only explain 'what and how MAE exactly learns' and 'when the clustering is sufficiently done'. The justification for "masking the cluster containing the main object" should be provided to bridge the gap between the background in Section 3 and the design choice of the proposed method. Could the authors provide more insight into this?

2024-08-14

Thank you for your feedback. We provide our responses to the additional questions below.

[W1-1] Relation between "feature variance and similarity variance" and performance

To demonstrate the relationship between our analysis metrics and the quantitative performance, we measure these metrics across the training epochs and present along with the performance of MAE and our method. As shown in the table below, both MAE and our approach exhibit increasing feature variance and similarity variance over time, which directly indicates that patch clusters become finer during training, aligning with performance improvements. (1epoch is omitted with our method as our method starts informed masking from $T$ >> 0 epoch). We would also like to highlight that our method consistently shows higher values, corresponding to the consistent performance improvements across the epochs, as detailed in Table II in the Appendix.

Epoch	1	200	400	800
MAE	0.031	0.068	0.074	0.082
Ours	-	0.070	0.083	0.096

Epoch	1	200	400	800
MAE	0.047	0.057	0.068	0.075
Ours	-	0.058	0.071	0.079

[W2-1] Does our method actually enhance MAE's ability to learn patch clustering?

To verify that our method effectively accelerates the learning of patch clustering, we conducted a detailed investigation of the embedding space using various metrics, including attention distance, Fourier analysis, and mask token variance, as outlined in Sec. 5.4. In summary, the higher exploitation of pattern information and increased mask token variance directly indicate that patch clusters are indeed more diversified. Furthermore, as shown in the table above in our response to [W1-1], our method demonstrates higher feature variance and similarity variance throughout the training epochs compared to the original MAE, which aligns with the development of finer patch clustering. Accelerated patch clustering can also be visually confirmed through qualitative analysis, as shown by the finer patch clusters in Figure VII in the Appendix.

[W2-2] Comparison to SemMAE regarding the generated masks

SemMAE leverages the semantic segmentation knowledge from a pre-trained iBoT to segment the image and explicitly masks out those segmented parts to guide MAE in learning semantic segmentation. Thus, SemMAE compares the segmentation of image features to those of iBoT rather than directly to MAE. In contrast, our method generates masks based on pattern-based patch clusters, aiming to accelerate MAE's learning of patch clustering—something MAE naturally does. Broadly speaking, the semantic segmentation (as learned by SemMAE) and pattern-based patch clustering (as learned by our method and MAE) may learn mostly similar but different patterns in the detail, and we believe they will compensate each other due to this difference in detail. We emphasize again that ours purely leverage MAE's intrinsic property in that our method guanrantees to maintain its original property without external intervention.

We fully acknowledge that using additional resources to enhance MAE's performance is valuable. However, we kindly ask the reviewer to consider our work from the perspective of understanding and accelerating MAE's internal operations while preserving its inherent properties. We believe our method can be easily applied across various domains and modalities beyond images, with limited training data. This is because it ensures that the features consistently retain the inherent properties of the original MAE, without the risk of incorporating external properties that might compromise the original characteristics of MAE.

[W1-2] Connection between Sec.3 analysis and "object-centric masking" in Sec.4

We appreciate for this clarifying question. Our approach stems from the observation that masking the entire image leads to learn patch clustering across the entire image, i.e., loss affects the whole image. To refine this process, we restrict the masking to object-centric regions. By narrowing the masking focus, we can guide MAE to concentrate on learning patch clustering within the object regions, i.e., loss affects only the object-related parts, thereby accelerating the process of learning patch clustering in object region.

We appreciate the reviewer’s insights and will explicitly address this point in the main paper as suggested. We hope our responses address the reviewer’s concerns and provide clarity on our contributions.

2024-08-14

I appreciate your quick responses.

The additional feedback addresses the remaining concerns on W1-1, W2-1, and W1-2. Specifically, the supplementary results demonstrating the correlation between the performance of MAE-based models and the metrics (feature variance & similarity), resolve my concern on W1-1.
However, some concerns still exist regarding W2-2. While the proposed mask generation method is grounded in MAE's intrinsic properties, its effectiveness is not obvious if it does not outperform other mask generation methods that don't exploit these properties (e.g., SemMAE). The effectiveness of the proposed method needs to be more clearly demonstrated.
Honestly, in my perspective, the current version of the paper falls short of the expected bar. The authors' claims are not clearly conveyed throughout the paper, necessitating further explanation. The overall presentation of the paper requires improvement. Nevertheless, given that the analytical background and motivation provided during the rebuttal period are reasonable, and the authors clarified most of my concerns, I decided to raise my rating to 5. I recommend revising the paper overall.

作者回复

2024-08-07

Dear all reviewers,

Thank you for your time and effort in reviewing our paper. We have carefully addressed all the concerns raised and incorporated the valuable suggestions. We kindly request that you re-evaluate our paper in light of these rebuttals.

最终决定Accept (poster)

2024-09-25

This paper explores the properties of MAE and introduces an improved masking strategy for MAE. Specifically, it proposes a self-guided masked autoencoder that internally generates informed masks by leveraging its progress in patch clustering. All reviewers agree that the analysis is interesting and that the results are robust compared to baseline methods. Concerns regarding a more detailed analysis and differentiation from previous work have largely been addressed during the rebuttal. Final ratings ranged from Borderline Accept to Weak Accept. Given the intriguing findings, clear methodology, and significant results, the Area Chair recommends accepting the paper.