PaperHub
4.4
/10
withdrawn5 位审稿人
最低3最高6标准差1.2
3
6
5
3
5
4.6
置信度
正确性2.0
贡献度2.6
表达2.4
ICLR 2025

Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization

OpenReviewPDF
提交: 2024-09-26更新: 2024-11-28
TL;DR

The Multi-level Optimized Mask Autoencoder (MLO-MAE) improves visual representation learning by using feedback from downstream tasks to optimize the masking strategy during pretraining, leading to better performance across various datasets and tasks.

摘要

关键词
Multi-level OptimizationMask AutoencoderSelf-Supervised LearningImage Masking StrategiesRepresentation LearningVision Transformers

评审与讨论

审稿意见
3

Existing Masked Image Modeling (MIM) methods mask patches without considering feedback from downstream tasks. This study introduces a method to utilize such feedback for learning an optimal masking strategy. The authors propose a framework called Multi-level Optimized Mask Autoencoder (MLO-MAE), which leverages downstream task feedback to determine an effective masking strategy during pretraining. Specifically, MLO-MAE operates in three stages: pretraining the image encoder, training the classification head, and updating the masking network. The proposed masking strategy, guided by the feedback from the classification task, tends to mask patches on foreground objects directly related to class labels. Empirical results on eight benchmarks demonstrate the effectiveness of MLO-MAE, showing it outperforms the MAE baseline.

优点

Extensive experiments across eight benchmarks and three different tasks (i.e., classification, object detection, and semantic segmentation) demonstrate the effectiveness of the proposed MLO-MAE, surpassing the performance of the MAE baseline.

缺点

Deviation from the Motivation: The paper's motivation is that "previous MIM methods mask patches without incorporating feedback from downstream tasks (line 053)," and this study seeks to "use feedback from downstream tasks to learn the optimal masking strategy (line 062)." However, the current implementation only utilizes feedback from the classification task to guide the masking strategy, deviating from the original motivation. If downstream tasks were object detection or semantic segmentation, relying solely on a classification task in Stage 2 would likely result in a suboptimal masking strategy.

Unfair Experimental Comparisons: The proposed MLO-MAE method leverages additional knowledge from classification labels in ImageNet-1K, making a direct comparison with the MAE baseline unfair.

Unconvincing Conclusions:

  • It remains unclear why the masking strategy in MLO-MAE effectively "masks patches on foreground objects directly relevant to the class labels" (line 1113), and further discussion is needed to clarify this aspect.
  • The ablation study on the number of unrolling steps was conducted solely on the CIFAR-10 dataset, which shows that ''an increase in the unrolling steps leads to gradual improvement in accuracy'' (line 474). However, given the limited scale of CIFAR-10, the same study should also be performed on ImageNet-1K for a more convincing conclusion.
  • The ablation study for patch sizes was restricted to small images (32x32) on CIFAR-10. To strengthen the conclusion, similar studies should be conducted on larger images, such as those in ImageNet-1K.

Writing Issues:

  • In Stage 3, the method for updating the masking network TT remains unclear, even after thoroughly reviewing the method section (lines 222-228) and Figure 1.
  • There is no explanation provided for "unrolling steps."

问题

  • Q1: Why do the authors choose to use the classification task in Stage 2 for all downstream tasks, instead of tailoring different tasks in Stage 2 to match each specific downstream task?

  • Q2: How does the masking strategy in MLO-MAE manage to "mask patches on foreground objects directly relevant to the class labels" (line 1113)?

  • Q3: What makes "masking patches on foreground objects directly relevant to the class labels" an optimal masking strategy? If all patches of the foreground objects are masked, does it still qualify as an optimal strategy?

  • Q4: Could you elaborate on the implementation of Stage 3? Specifically, how is the masking network updated using the validation loss?

  • Q5: Could you provide a more detailed explanation of the unrolling steps?

伦理问题详情

No ethical concerns have been identified.

评论

Q7: Confusion on optimal masking strategy

We apologize for the confusion. MLO-MAE determines the masking regions by directly leveraging the classification loss, incorporating the optimization of the masking network in Stage III. This approach eliminates the need for human-designed heuristics, relying instead on a task-driven strategy. The term "optimal strategy" refers to the masking strategy used in Stage I, which is optimized using the downstream classification loss defined in Stage III. For example, if the masking network in MLO-MAE determines that the masking region for a specific image should cover all foreground objects, this implies that reconstructing the entire masked foreground is beneficial for downstream classification performance. Such a strategy remains desirable.

Q8: Elaborate on the implementation of Stage 3

We thank the reviewer for this comment. In Stage III, we optimize the trainable masking network by minimizing the validation loss, leveraging the optimized classification head and encoder backbone obtained from Stage I and II. The validation loss is computed as the cross-entropy classification loss on the validation set, using the frozen vision encoder and classification head. It is important to note that this validation set is created from the ImageNet training set, while the original ImageNet validation set is reserved as the test set to evaluate the overall performance of our method. In the MLO-MAE framework, the vision encoder, classification head, and masking network are implicitly interconnected. As described in Equation (8), we utilize the chain rule to propagate gradients, with the vision encoder and classification head treated as approximate optimal solutions from Stages I and II, respectively. Finally, the masking network is updated using one-step gradient descent.

评论

We appreciate your constructive feedback and provide our response to your questions as follows.

Q1: Deviation from the motivation

We thank the reviewer for pointing this out. Our transfer learning experiments in Section 4.4.2 demonstrate that MLO-MAE, even with Stage II focused on a classification task, is still capable of learning effective and generalizable representations. These representations lead to superior performance on semantic segmentation and object detection tasks, highlighting the versatility of our approach. Additional object detection experiments can also be found in Appendix D.1 Our ultimate goal is to enhance the quality of representations learned by the backbone network, ensuring they generalize well across a wide range of downstream tasks.

Q2: Unfair comparison

We would like to clarify that our method does not unfairly use more labeled data than the baseline approaches. The labeled data used in Stages II and III of our framework is identical to that employed during the fine-tuning phase of the baselines, ensuring a fair comparison. In MLO-MAE, we directly leverage the downstream objective to guide and optimize the mask-reconstruction process during the MAE pretraining stage. It is also important to note that labels are utilized during downstream evaluation for all methods, maintaining fairness in the comparison. Moreover, the effectiveness of MLO-MAE is further demonstrated through transfer learning experiments, where neither the data nor the labels from the downstream tasks are used during the pretraining of either MLO-MAE or the baselines. The superior performance of MLO-MAE in these settings empirically highlights its strong generalizability.

Q3: Clarification on MLO-MAE’s masking strategy

We apologize for any confusion caused. MLO-MAE determines the masking regions by directly leveraging the classification loss, incorporating the optimization of the masking network in Stage III. It integrates three interconnected optimization objectives: the MAE pretraining objective, the optimization of the classification head on the classification task, and the optimization of the masking network on the same task. These three levels of optimization are mutually dependent, as formalized in Equation (4). To address this complex problem, we employ an efficient hypergradient-based method to solve Equation (4), enabling gradient descent updates for the masking network. This formulation allows the masking network, which is optimized on the cross-entropy loss for class label prediction in Stage III, to generate downstream-informed masking patterns in Stage I. Consequently, the resulting masking patches are directly aligned with the class labels, ensuring their relevance to the downstream tasks.

Q4: Ablation on ImageNet

We thank the reviewer for this suggestion. Our primary classification results in Table 1 and Table 2 show the effectiveness of MLO-MAE on both ImageNet and CIFAR dataset. Due to the resource limit, we conducted the ablation study on the CIFAR experiment.

Q5: Unclear of the update of masking network and unrolling steps

We apologize for the confusion. In Stage III, we optimize the masking network by minimizing the validation loss, leveraging the optimized classification head and encoder backbone obtained from Stages I and II. This process is carried out using gradient-based methods. The detailed computation of implicit gradients is outlined in Appendix A.1, and the pseudo-code for the integrated approach is provided in Algorithm 1.

The unrolling steps serve as hyperparameters to approximate the optimal weights in Equations (1) and (2). In Equation (1), E(T)E^*(T) represents the optimal weights of the encoder. However, finding the exact optimal solution for EE^* is computationally infeasible. To address this, we approximate EE^* using a fixed number of gradient descent steps, referred to as unroll steps. In Equation (2), CC^* is approximated using the same approach. In most of our experiments, we set the unroll step to 2 for EE^* amd 1 for CC^* , as this configuration empirically provides a good tradeoff between training time and performance. Additional details on the impact of different unroll step values can be found in our ablation studies in Section 4.6. We hope this clarifies our approach and the rationale behind our chosen hyperparameters.

Q6: Choosing classification as Stage II

Image classification is a fundamental computer vision task. Our transfer learning experiments on fine-grained classification, semantic segmentation, and object detection show the effectiveness of the classification guided masking network in generating the mask patterns that benefit other downstream tasks as well.

审稿意见
6

In this submission, the authors aim to improve visual representation learning in self-supervised pretraining. Their motivation is that traditional Masked Autoencoders (MAE) overlook the varying informativeness across different patches. They hope to leverage feedback from downstream tasks to guide the learning of masking strategies during pretraining. The proposed framework, named Multi-level Optimized Mask Autoencoder (MLO-MAE), consists of three interdependent stages: pretraining the image encoder, training the classification head, and updating the masking network. These stages are executed end-to-end through a multi-level optimization (MLO) strategy. Finally, their experiments show performance improvements of MLO-MAE across multiple datasets and tasks.

优点

The experiments involve multiple benchmark datasets and also validate the transfer learning ability on downstream tasks such as fine-grained classification and semantic segmentation. Additionally, various visualization analyses are provided.

缺点

  1. Considering the additional computational overhead introduced by the multi-level optimization, it is hoped that more comparisons with other baselines can be added, such as ColorMAE.
  2. In Section 4.1, the authors mention that they divide the original training set of each dataset into two subsets, D_tr and D_val. From my understanding, this approach is essentially similar to dividing datasets into training, validation, and test sets in machine learning. The validation set is used to monitor the training process. The difference here is that instead of manual adjustments by developers, the authors use multi-level optimization (MLO) to automate the adjustment of model parameters. As I am not very familiar with this field, I am curious about the transferability and applicability of this automated approach.

问题

Please see Weaknesses.

评论

We appreciate your constructive feedback and provide our response to your questions as follows.

Q1: Comparison with ColorMAE

We thank the reviewer for this helpful suggestion. MLO-MAE achieves superior performance than ColorMAE in ImageNet1K classification, semantic segmentation on ADE20K, and object detection on COCO as shown in the table below.

ModelImageNet1K (Top1 Acc)ADE20K (mIoU)COCO (AP_bbox)
ColorMAE83.949.350.1
MLO-MAE84.849.850.1

Q2: Transferability and applicability of MLO-MAE

We appreciate the reviewer for this insightful comment. As described in Section 4.1, we use the original ImageNet1K validation set as the test set and split the training set into a new training set and a new validation set. The masking network is then optimized based on the performance on the newly created validation set, following Equation (4). Our transfer learning experiments in Section 4.4, conducted on fine-grained classification tasks (iNaturalist 19, CUB, and Stanford Cars), semantic segmentation (ADE20K), and object detection (MS-COCO), demonstrate the transferability of the visual representations learned by MLO-MAE across a diverse range of downstream tasks, beyond classification alone. Additionally, we tested the MLO-MAE approach in the medical domain using a continued pretraining setup, as detailed in Section 4.5. Empirical results on two medical datasets, PDDB and PAD-UFES, further highlight the effectiveness of MLO-MAE as a continued pretraining technique for domain-specific datasets.

审稿意见
5

This paper introduces a variant of the masked autoencoder (MAE) designed to explicitly enhance the performance of downstream tasks. The framework consists of three main stages: (1) pretraining an image encoder using the MAE approach based on the mask generator network, (2) training a classification head with the frozen MAE-pretrained encoder on a downstream task, and (3) training a masking network with the trained classification head to minimize the loss to tune the masks. Experimental results demonstrate the effectiveness of the proposed method.

优点

  • The idea of explicitly introducing these three steps to improve downstream task performance is intuitive and reasonable.
  • The performance improvements over the baseline looks impressive.

缺点

  • While the idea appears practical and well-founded, the rationale behind including specific steps and how each contributes to overall performance are missing. Furthermore, it is unclear why the proposed method would outperform SemMAE and AutoMAE, which also use adaptive mask generation. The authors are encouraged to provide some insights into why their approach may offer an advantage over these similar methods.
  • The multi-level optimization presented appears to depend heavily on the previous steps. While the problem formulation in Eq. (4) integrates all sub-objectives, each optimization step must wait for the completion of the previous one. This approach seems a bit cumbersome, potentially leading to cache and memory issues (which may not be fully resolved by standard packages) and slowing down the process due to the sequential nature of the steps.
  • There are no analyses of the generated masks during or after training, nor comparisons with the outputs of other mask generators (e.g., SemMAE's and AutoMAE's). The reviewer believes presenting this analysis is crucial, as it could provide insights into training dynamics, specifically which parts of images should be visible or reconstructed to impact performance.
  • In training, stages 2 and 3 utilize downstream task data, which seems to be used more extensively than in standard practice, where downstream data is only introduced during the fine-tuning stage. This incurs an unfair evaluation.

问题

  • The reviewer still speculates some results, for example, why ImageNet pre-training/fine-tuning requires only 50 epochs to surpass other methods, especially since all steps use the same ImageNet dataset and the pre-training stage even trains on smaller images (80%). Could the authors provide some insights?
  • The proposed method appears to adapt specifically to ImageNet when pre-training the classification head on ImageNet in step 2. This may raise issues, such as difficulty handling discrepancies in patch size between pre-training and fine-tuning stages. How can a model trained on ImageNet in this way be effectively applied to downstream tasks like CIFAR?
  • The dataset is split into training and validation data for the three-step training proposed in this paper. Can the authors ensure consistent performance when the data split varies due to changes in random seeds or other factors?
评论

We appreciate your constructive feedback and provide our response to your questions as follows.

Q1: Rationale behind MLO-MAE

We appreciate the reviewer for this insightful suggestion.Thank you for the thoughtful suggestion, and we deeply appreciate the opportunity to elaborate on the key distinctions between MLO-MAE and existing adaptive mask generation methods, such as SemMAE and AutoMAE. These differences can be summarized as follows:

  1. Learnable Masking Network with Downstream Guidance: MLO-MAE employs a learnable masking network that is directly optimized based on downstream performance, specifically guided by the cross-entropy loss, rather than relying on heuristic or standalone approaches.

  2. Hierarchical Optimization Framework: MLO-MAE introduces a hierarchical design of mutually dependent optimization problems, solved using implicit differentiation, to address the complexity of this framework effectively.

Unlike other methods, MLO-MAE simultaneously solves three optimization problems: optimizing the weights of the vision encoder (Stage I), the classification head (Stage II), and the masking network (Stage III). This structure ensures that the optimization processes are interdependent—for example, optimizing the vision backbone depends on the masking network, and optimizing the masking network requires the vision backbone and classification head to be optimized. This mutual dependency allows MLO-MAE to integrate downstream guidance directly into the mask selection during the mask-reconstruction process (Stage I). In contrast, SemMAE employs a standalone semantic encoding module to generate masking patterns, while AutoMAE relies on an external ViT to extract attention maps indicative of semantic regions. These methods depend heavily on external semantic modules, making them potentially sensitive to different data distribution. MLO-MAE, by design, avoids this limitation, ensuring greater robustness and adaptability across different datasets.

Q2: Potential ache and memory issue

The cache and memory challenges can be effectively addressed using gradient checkpointing, a feature supported by most standard deep learning libraries. While MLO-MAE introduces additional computational complexity due to the calculation of the best-response Jacobian, this overhead is offset by the reduced number of training epochs required to achieve comparable or superior performance on ImageNet benchmarks, as demonstrated in Table 9.

Q3: Analysis of the generated masks

We thank the reviewer for pointing this out. We have provided a detailed analysis of the masking patterns learned by MLO-MAE in Appendix C.1, alongside a comparison with AutoMAE and SemMAE. Unlike MAE, SemMAE, and AutoMAE, which predominantly mask background regions that are irrelevant to image class labels, MLO-MAE strategically guides the encoder network to focus on learning effective representations for objects. These findings highlight the advantages of MLO-MAE in prioritizing meaningful features during pretraining.

Q3: Unfair evaluation

We would like to clarify that our method does not unfairly use more labeled data than the baseline approaches. The labeled data used in Stages II and III of our framework is identical to that employed during the fine-tuning phase of the baselines, ensuring a fair comparison. In MLO-MAE, we directly leverage the downstream objective to guide and optimize the mask-reconstruction process during the MAE pretraining stage. It is also important to note that labels are utilized during downstream evaluation for all methods, maintaining fairness in the comparison. Moreover, the effectiveness of MLO-MAE is further demonstrated through transfer learning experiments, where neither the data nor the labels from the downstream tasks are used during the pretraining of either MLO-MAE or the baselines. The superior performance of MLO-MAE in these settings empirically highlights its strong generalizability.

Q4: Insights on smaller number of epoch of MLO-MAE

We appreciate the reviewer for this keen observation. The training process for MLO-MAE involves jointly solving three interconnected optimization problems using implicit differentiation, which introduces additional computational complexity compared to the standard MAE optimization. In practice, we approximate E* and C* by unrolling a limited number of gradient steps, which further increases the computational complexity per iteration. It is important to note, however, that the number of training epochs for MLO-MAE is not directly comparable to that of MAE, as MAE does not involve implicit differentiation or gradient unrolling.

评论

Q5: Difficulty handling discrepancies in patch size between pre-training and fine-tuning

We thank the reviewer for this helpful comment. We primarily adhere to the downstream evaluation protocol used by MAE, in which the classification head from MLO-MAE is not utilized for downstream tasks. Instead, the classification head is independently initialized during the fine-tuning phase. Additionally, to maintain consistency with ImageNet-pretrained encoders, we could resize CIFAR images to a resolution of 224x224, and therefore effectively apply the ImageNet pretrained MLO-MAE on CIFAR dataset.

Q6: Data split used by MLO-MAE

We thank the reviewer for pointing this out. The new data split was performed using an 8:2 random split within each class of the ImageNet training set. While we did not specify a random seed, the large size of the dataset ensures that the performance of our method remains consistent when following the procedures described in Section 4.1 and Appendix B.3. To ensure a fair comparison with MAE on datasets where predefined train, validation, and test sets exist, MLO-MAE uses the original validation set in Stage III. Meanwhile, MAE is trained on the combined train and validation sets. Both methods are then evaluated on the test set to maintain consistency and fairness in the comparison.

评论

Thank you for your time and effort in providing detailed responses. However, while some of my concerns were addressed, others were not.

Q2: Potential ache and memory issue

The cache and memory challenges can be effectively addressed using gradient checkpointing, a feature supported by most standard deep learning libraries. While MLO-MAE introduces additional computational complexity due to the calculation of the best-response Jacobian, this overhead is offset by the reduced number of training epochs required to achieve comparable or superior performance on ImageNet benchmarks, as demonstrated in Table 9.

The authors provide the computational costs of Jacobian computation; however, the reviewer's concerns were rather centered on the overall memory issue that could be a practical problem for adjusting the batch size or so on during training.

Q3: Analysis of the generated masks

We thank the reviewer for pointing this out. We have provided a detailed analysis of the masking patterns learned by MLO-MAE in Appendix C.1, alongside a comparison with AutoMAE and SemMAE. Unlike MAE, SemMAE, and AutoMAE, which predominantly mask background regions that are irrelevant to image class labels, MLO-MAE strategically guides the encoder network to focus on learning effective representations for objects. These findings highlight the advantages of MLO-MAE in prioritizing meaningful features during pretraining.

This reviewer appreciates the authors' effort in including the visualizations. However, after reviewing both the manuscript and the visualizations, this reviewer fails to identify a clear trend supporting the authors' claim like "masks background regions that are irrelevant to image class labels." Instead, the patterns seem to be more random. Can the authors enlighten this? This reviewer speculates that the patterns might become more evident with higher masking ratios.

Q3: Unfair evaluation

We would like to clarify that our method does not unfairly use more labeled data than the baseline approaches. The labeled data used in Stages II and III of our framework is identical to that employed during the fine-tuning phase of the baselines, ensuring a fair comparison. In MLO-MAE, we directly leverage the downstream objective to guide and optimize the mask-reconstruction process during the MAE pretraining stage. It is also important to note that labels are utilized during downstream evaluation for all methods, maintaining fairness in the comparison. Moreover, the effectiveness of MLO-MAE is further demonstrated through transfer learning experiments, where neither the data nor the labels from the downstream tasks are used during the pretraining of either MLO-MAE or the baselines. The superior performance of MLO-MAE in these settings empirically highlights its strong generalizability.

This reviewer considers this concern was/is a major issue in the work, as Reviewer 5ZVw similarly pointed out. The standard practice for "pre-training" is that the pre-training phase does not involve the downstream task's data. This ensures that the pre-training method's capability of generalization is evaluated independently of the downstream data's characteristics. While this reviewer acknowledges that leveraging downstream data can indeed improve fine-tuning performance, the concern lies in the process itself—specifically, using downstream data during pre-training and then reusing it during fine-tuning. Rather than this process, this reviewer encourages the authors to evaluate the proposed method's performance using other downstream datasets unrelated to the fine-tuning dataset to be trained and evaluted.

Q6: Data split used by MLO-MAE

We thank the reviewer for pointing this out. The new data split was performed using an 8:2 random split within each class of the ImageNet training set. While we did not specify a random seed, the large size of the dataset ensures that the performance of our method remains consistent when following the procedures described in Section 4.1 and Appendix B.3. To ensure a fair comparison with MAE on datasets where predefined train, validation, and test sets exist, MLO-MAE uses the original validation set in Stage III. Meanwhile, MAE is trained on the combined train and validation sets. Both methods are then evaluated on the test set to maintain consistency and fairness in the comparison.

My concern was how the randomness of the data split influences performance. Since the dataset is large enough, random seeds typically result in performance variations of about ±0.3–0.4% even with the full data; however, using only 80% of the data could introduce greater fluctuations. In particular, given mislabeled training samples in the original ImageNet dataset, random seeds would be affected more. I encouraged the authors to investigate this issue through some experiments. If immediate experiments are not feasible, it would be valuable to outline a clear plan for addressing this limitation in future work.

审稿意见
3

The paper proposes a new MAE training framework, Multi-Level Optimization MAE (MLO-MAE). The authors add two additional learning objectives to conventional MAE training: linear probing and mask selector optimization. Linear probing loss trains a classifier using class labels, and the mask selector learns a mask that maximizes classification performance. The mask selector also used for MAE training that is effective for MAE's finetuning performance. The three objectives are optimized by betty-ml library's multi-level optimization algorithm.

优点

  • Learning the mask with multi-level optimization is an interesting approach.
  • ImageNet performance in Table 1 is impressive. It is hard to achieve 84.8 accuracy with MAE framework.

缺点

  • The paper lacks motivation and insight explaining the improvement of MLO-MAE. Although MLO-MAE achieves impressive performance, I believe it is hard to contribute to MAE researchers without proper theoretical background or analysis.

  • MLO-MAE is not a self-supervised learning (SSL) method. It utilizes image class labels as a multi-objective training method. Since target task-agnostic pre-training is an important characteristic of SSL, MLO-MAE's contribution is significantly limited to conventional SSL.

  • The presentation is not good. The main algorithm is hard to understand through formulas in 3.2. It would be better to revise the formula to improve readability.

  • The experiment only covers a single model size (ViT-B), which is not enough to validate MAE-variant performance. Validation on various model sizes in ImageNet is required to argue for an improved MAE framework in general.

问题

  • MLO-MAE utilizes learnable masking that is trained to maximize linear probing performance. What is the theoretical background and insight behind this design? Why does this masking improve MAE fine-tuning performance?

  • MLO-MAE makes classification-friendly random masking, which can be considered as an easy masking strategy. Can it achieve performance improvement in large models? How about small models like ViT-S?

  • In Table 3, improvement in downstream tasks looks smaller than in ImageNet. Does this mean MLO-MAE's improvement is biased toward the pretraining dataset? How about using the downstream dataset as an objective in Stage 2 and Stage 3?

  • Even though another learnable masking (SemMAE and AutoMAE) is not effective on benchmarks, MLO-MAE achieves a completely different pattern with a learnable mask. What makes the difference? What are the minimal changes to make learnable masks beneficial?

  • Table 5 omits APmaskAP_{mask}. Does MLO-MAE also improve APmaskAP_{mask} of mask-RCNN in COCO?

  • It is a surprising result that BLO-MAE has no positive effect on performance. Why is it so different from MLO-MAE? Does BLO-MAE fail to converge on any of Stage?

评论

Q6: Different masking patterns comparing to baselines and minimal changes to make learnable masks beneficial

We thank the reviewer for this insightful comment. It is important to note that the MAE ImageNet result reported in Table 1 is based on a model pretrained for 1600 epochs, whereas an MAE pretrained for 800 epochs achieves an accuracy of 83.2% during fine-tuning [1]. This suggests that the learnable masking strategies employed by SemMAE and AutoMAE are still somewhat effective for improving downstream performance. However, the semantic masking approach used by SemMAE and the adversarially trained masking strategy of AutoMAE yield only marginal improvements (~0.1%) in downstream ImageNet fine-tuning, raising questions about the efficacy of directly learning a masking strategy guided by downstream tasks. SemMAE employs pretrained attention maps to adaptively mask both within and across semantic parts, while AutoMAE utilizes an adversarially trained generator and a joint optimization technique to update the mask generator. Notably, both methods rely on a standalone pretrained vision encoder (iBoT for SemMAE and a pretrained ViT for AutoMAE) to extract semantic information. In contrast, MLO-MAE integrates the learning of masking patterns directly into a unified optimization framework using multi-level optimization, as detailed in Appendix A. This approach eliminates the need for a separate pretrained vision encoder for semantic extraction.

In Section 4.6, we explored a simplified variant of MLO-MAE, referred to as BLO-MAE, where the optimization problem is reformulated as shown in Equation (5). While BLO-MAE achieves 2.8% lower accuracy compared to MLO-MAE, it still outperforms the MAE baseline (81.5% vs. 73.5%). This demonstrates that even minimal modifications, such as the BLO-MAE setup, which combines Stage I and Stage II, can effectively implement a downstream-guided masking strategy. [1] SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders, NeurIPS 2022

Q7: Effectiveness and the convergence of BLO-MAE

We thank the reviewer for this astute observation. While BLO-MAE has lower CIFAR accuracy than MLO-MAE, it still achieves better performance than MAE baseline (73.5%, according to Table 2) and shows convergence across all stages, indicating that BLO-MAE, compared to the MAE, has a positive effect on the performance. It is not surprising that MLO-MAE achieves better performance than BLO-MAE because the lower level of BLO-MAE approach involves optimizing a weighted sum of losses from two distinct tasks (CC^* and EE^*). This scenario often leads to task competition and balancing the competing losses requires meticulous adjustment of the tradeoff parameter, which is challenging.

Q8: Additional experiment on different size of ViT model

We thank the reviewer for this helpful suggestion. We are currently running MLO-MAE on a different model size. However, due to resource constraint and time limitation during the rebuttal period, we are unable to finish the experiment. We will include the result into the paper once it finishes.

评论

Thank you for your responses. While some questions have been addressed, they are not sufficient to resolve all the major issues.

Here is my feedback

  • A1 & A2: Theoretical background and insight on MLO-MAE's improvement

I know that MLO-MAE's masking strategy differs from SemMAE and AutoMAE and performs better than others. What I want to know is why it performs better. The paper significantly lacks explanation on it. It looks like the authors randomly try different masking strategies, and luckily, it works better than others. Random exploring can be a contribution, but I'm sure it has only less contribution than a method proposed with a strong theoretical or analytical background for performance improvement.

Masking object regions rather than the background can be a hint for why. However, there is still no direct connection between object masking and the performance of MAE. Also, I think MLO might outperform simple object region masking MAE. MLO-MAE needs solid reasons for improvement to inspire MAE researchers and make a genuine contribution.

  • A3: MLO-MAE is not a self-supervised learning.

My point is that MLO-MAE is not self-supervised learning because it utilizes image labels for pre-training. Whether it is multi-objective training or not is irrelevant. I believe MLO-MAE can't be considered SSL, which degrades its contribution and impacts. Please respond if you disagree with this opinion.

  • A5: small improvement on the downstream dataset.

The response addressed my concern about it. The small improvement on downstream originated from the nature of masking strategies. While it seems natural, it can be considered as a weakness or limitation of research in masking strategies.

  • A6 & A7: MLO vs BLO

Thank you for pointing out MAE performance to clarify the message of this experiment. However, it would be better to add MAE's performance to Figure 2. As a reader, it is difficult to search for MAE's performance from Table 2 and compare it with Figure 2. BLO's performance can help to understand where the improvement came from.

  • A8: Model sizes

I believe evaluating performance across various model sizes is important in MAE-like benchmarks. Thus, it is hard to give a high rating for a method validated on only one size, even though the authors trying their best to cover it.


Overall, I believe the method, MLO-MAE, has potential and can make a valuable contribute to the field. But, the current paper's quality, analysis, and evaluation are not enough to meet my acceptance threshold. I will keep my initial rating.

评论

We appreciate your constructive feedback and provide our response to your questions as follows.

Q1: Insights of design behind learnable masking in MLO-MAE

Our work is inspired by the idea of directly incorporating downstream information to guide masking decisions through multi-level optimization. In Appendix C.1, we provide a comprehensive analysis of the masking patterns learned by MLO-MAE, comparing them to those produced by MAE, SemMAE, and AutoMAE. Unlike these approaches, which predominantly mask background regions irrelevant to image class labels, MLO-MAE strategically focuses masking on areas that encourage the encoder network to learn effective representations for objects rather than background regions. This highlights the ability of MLO-MAE to prioritize meaningful features in the learning process.

Q2: Motivation and insight explaining the improvement of MLO-MAE

We thank the reviewer for this helpful comment. Our MLO-MAE explores the effectiveness of introducing the downstream insight in guiding the pretraining task by integrating a learnable masking network through a hypergradient-based multi-level optimization (MLO) framework. Through our transfer learning and continued pretraining experiment, MLO-MAE has shown its effectiveness and generalizability to diverse domains and various vision tasks. Due to the page limit, we have included detailed discussion of our implicit differentiation setup for our MLO gradient computation and of the differentiability of the MLO-MAE framework in the Appendix.

Q3: MLO-MAE utilizes image class labels as a multi-objective training method

We appreciate the reviewer’s keen observation on the mechanism of our MLO-MAE. However, we would like to point out that MLO-MAE is not a multi-objective training method (MOO). MOO aims to optimize multiple objectives simultaneously and disregards the inherent hierarchical dependencies that lie within the definition of the optimization problem. The convention of PT-FT pipeline explicitly defines a hierarchical relationship between pretraining and fine-tuning objectives, since the fine-tuning is directly depending on the pretraining. MLO-MAE leverages this explicit hierarchical dependency and introduces a learnable masking network on top of the existing dependency to form a multi-level optimization objective. The design of MLO has multiple benefits over MOO. Firstly, MLO explicitly models the dependency between levels, ensuring that the solution at one level supports and enhances the objectives of the next level. Secondly, MLO avoids explicitly balancing competing objectives within a single level.

Our transfer learning experiments on additional dense prediction tasks, as well as continued pretraining setting, demonstrate that MLO-MAE achieves superior performance compared to baseline models. This highlights the model's ability to learn more robust representations and exhibit greater generalizability. It worth to mention that, under the transfer learning setting—where neither the dataset nor the labels used for evaluation are present during MLO-MAE's pretraining or the baseline's pretraining stages—the comparison remains fair. This ensures that MLO-MAE's enhanced performance does not stem from leveraging additional information unavailable to the baseline. In addition, we explored MLO-MAE as a continued pretraining paradigm against baseline methods on domain specific dataset in Section 4.5, and our empirical results show that MLO-MAE is effective in continued pretraining domain.

Q4: Presentation of formulas in Section 3.2

We apologize for any confusion. Due to the page limit, we have included more detailed explanations of formulas from Section 3.2 in Appendix A.1 and A.3.

Q5: Improvement in downstream tasks over ImageNet is small

We thank the reviewer for this observation. Table 3 contains transfer learning experiments on iNat 19, CUB, and cars. Similar marginal improvements are also reported in SemMAE, which also tends to have larger improvements over ImageNet than the fine-grained datasets (1.7% improvements on ImageNet, and 0.3%, 0.2%, 0.6 on iNat 19, cub, and cars, respectively) under transfer learning setting. The experimental result of MLO-MAE on ADE20K semantic segmentation (Table 4) and MS-COCO object detection (Table 5) showed non-trivial performance boost, indicating the improvement is not limited to the pretraining dataset.

审稿意见
5

The paper proposes an interesting pretraining framework, MLO-MAE, which guides the masking process in masked image modeling by leveraging labels from downstream tasks to learn representations that meet task-specific requirements. The pretraining of MLO-MAE involves three main stages: pretraining the image encoder, training the classification head, and updating the masking network. MLO-MAE optimizes these three stages through a multi-level optimization algorithm.

优点

  1. This paper addresses a valuable and important problem. Given the inherent differences between images and text, using random masking as in NLP may not be optimal for masked image modeling.
  2. The proposed method is concise and effective, achieving strong performance across multiple benchmarks.
  3. The paper is well-organized, making it easy to reproduce and follow.

缺点

  1. The pretraining of MLO-MAE relies on labels from downstream tasks, which does not align with the self-supervised learning paradigm of learning general representations from the data itself. MLO-MAE resembles a supervised pretraining approach, so the experimental setup that mainly compares it with SSL methods is somewhat inappropriate.
  2. The main weakness of this paper lies in the experiments:
    • Although MLO-MAE demonstrates improved fine-tuning and linear probing performance (Sec 4.3), it already leverages class information during pretraining, making comparisons with self-supervised baselines unfair and insufficient to verify MLO-MAE's effectiveness.
    • I notice that in the transfer learning experiments (Sec 4.4), the pretrained model undergoes additional supervised fine-tuning on imagenet, which diverges from the typical transfer learning paradigm. Is this fine-tuning necessary? Can MLO-MAE be transferred to downstream tasks without it? Does the use of different labeled data during pretraining affect the performance of transfer learning?
    • The paper also lacks an analysis of the learned masking patterns. How do the masking patterns learned by MLO-MAE differ from those of other methods? Why do the learned masking patterns help in learning better representations? A more in-depth analysis is needed to support the main claims of this work.

问题

See Weaknesses.

评论

We appreciate your constructive feedback and provide our response to your questions as follows.

Q1: Unfair comparison with baseline methods

We thank the reviewer for this helpful comment. It is important to emphasize that our method does not utilize more labeled data than the baseline approaches. The labeled data used in Stages II and III of our framework is identical to that employed during the fine-tuning phase of the baselines, ensuring a fair comparison. In MLO-MAE, we directly leverage the downstream objective to guide and optimize the mask-reconstruction process within the MAE pretraining procedure. Additionally, during downstream evaluation, labels are used consistently across all methods, maintaining the fairness of the comparison. The effectiveness of MLO-MAE is further highlighted in transfer learning experiments, where neither the data nor the labels from downstream tasks are incorporated during the pretraining of MLO-MAE or the baselines.

In addition, we conducted transfer learning experiments on a fine-grained dataset using ImageNet fine-tuned checkpoints to facilitate a fair comparison. Results are shown in the table below. In this setting, both the baseline method and the MLO-MAE vision backbone have access to the same ImageNet training data and labels, ensuring a fair evaluation. Notably, MLO-MAE consistently outperformed the baseline methods across all three datasets, demonstrating its robustness and effectiveness.

Table 1: Performance of ImageNet fine-tuned MAE, SemMAE, and MLO-MAE across fine-grained classification.

ModeliNaturalist 19CUBStanford Car
MAE80.587.3393.32
SemMAE80.387.2893.26
MLO-MAE80.587.5293.51

Q2: Confusions on transfer learning experiments

We apologize for the confusion. To clarify, the experimental settings in Section 4.4 did not involve a separate fine-tuning process. To ensure a fair comparison and incorporate ImageNet label information into the baseline experiments, we also conducted transfer learning using ImageNet fine-tuned baseline checkpoints. The results, provided in the table above, demonstrate that starting from ImageNet fine-tuned checkpoints, MLO-MAE consistently achieves comparable or superior performance on fine-grained classification tasks.

Q3: Masking patterns of MLO-MAE

We appreciate the reviewer for this helpful comment. We have provided a detailed analysis of the masking patterns learned by MLO-MAE in Appendix C.1, alongside a comparison with the patterns produced by MAE, SemMAE, and AutoMAE. Unlike MAE, SemMAE, and AutoMAE, which predominantly mask background regions that are irrelevant to image class labels, MLO-MAE strategically guides the encoder network to focus on learning effective representations for objects. These results highlight the advantages of our approach in prioritizing meaningful features during pretraining.

撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.