Masked VAE: Distributionally-Informed Self-Supervised Vision Learning

Gabe Guo,Richard Zemel

OpenReview PDF

提交: 2024-09-23更新: 2024-11-25

TL;DR

We present a new self-supervised learning method for vision that respects the inherent multi-mode nature of masked autoencoding tasks.

摘要

关键词

Self-Supervised LearningTransformersVisionMasked Autoencoders

评审与讨论

审稿意见

评分: 3置信度: 42024-11-01

This paper introduces the Masked Variational Autoencoder (Masked VAE), a self-supervised approach designed to learn contextually-aware representations of images. Specifically, the method employs a deterministic encoder and a variational encoder to process masked and visible image tokens, respectively. The combined features from both encoders are then used by a decoder to reconstruct the image. To assess the effectiveness of the proposed method, the authors establish a Context-Completion benchmark. Experimental results show that Masked VAE outperforms the baseline MAE on this benchmark

优点

The paper is well-orgnized and easy to follow
Proposing contextually-aware representations introduces a novel perspective in visual representation learning. The methodology is intuitive and well-articulated.

缺点

The comparison between Masked MAE and the baseline MAE in the proposed Context-Completion benchmark appears somewhat unfair. MAE is primarily designed to encode general image representations with a lightweight decoder, which may not be suitable for image inpainting tasks. A more relevant comparison might be against specialized models designed for image completion, such as popular diffusion models.
The metrics used in the proposed benchmark seem outdated.
The demonstrated image completion results do not seem to match the performance of currently popular models in this domain.
The results presented in Section 5 suggest that the proposed method does not offer any improvements over the baseline MAE. This outcome is somewhat disappointing, especially for a model that builds upon the MAE framework.
The motivation behind the paper is somewhat unclear, and the practical applications of the proposed method are not well defined. It would enhance the paper if the authors could clarify the intended use cases and potential impact of their approach within the field of self-supervised learning.

问题

N/A

伦理问题详情

N/A

审稿意见

评分: 3置信度: 42024-11-02

This paper proposes a MaskedVAE, a self-supervised learning method that builds upon the MAE by introducing distributional awareness. This paper also introduces a new benchmark: Context completion, which is used for evaluating context-aware representation learning, by measuring the ability of inpainting objects given a partial image. The authors show that MaskedVAE can be competitive with MAE on representation for classification, while outperforming MAE in context completion tasks, demonstrating the effectiveness of distributional awareness.

优点

The motivation for introducing stochasticity to the masked prediction task is clear that the masked prediction task is under-constrained but existing works consider them as deterministic.
The paper is easy to follow.
MaskedVAE shows competitive performance to MAE on ImageNet fine-tuning, which indicates that introducing stochasticity in MAE does not harm the classification representation.

缺点

I think although masked prediction pre-text tasks do not involve stochasticity, their representation is quite good for understanding context, e.g., MAE achieves good performance on dense-prediction tasks, like object segmentation tasks. What is the benefit of context completion task rather than context understanding in representation space?
Similar to the 1, do we really need a good representation to in-paint the exact object even if it does not improve representation for any discriminative tasks? I think utilizing a much more powerful decoder like MAE (They used gan loss to fine-tune the model for more realistic generation) or adding diffusion models like I-JEPA may be enough to in-paint the masked part.

问题

Please answer the weakness part.

审稿意见

评分: 3置信度: 42024-11-05

This paper proposes Mask VAE, which combines VAE and MAE to address the multiple hypotheses problem in masked image modeling tasks. The proposed method shows promising results on Context-Completion, a benchmark also introduced in this paper. This benchmark evaluates contextual modeling ability using object inpainting as a metric.

优点

This paper reveals a problem in masked image modeling and proposes a probabilistic distribution from VAE to solve it. The benchmark demonstrates the superiority of this approach.

缺点

Although this paper presents a good problem and solution, the point I least understand is that solving this problem doesn't seem to bring any benefit to self-supervised learning: downstream tasks are only on par with MAE. Masked image modeling is just a proxy task, so if solving the flaws in this task doesn't benefit downstream tasks, is this flaw reasonable? If it's just for the image inpainting task, why wouldn't I use diffusion methods to solve it? Therefore, the position of this paper is not very clear.

问题

See weakness.

撤稿通知

2024-11-25

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.