Pre-training with Random Orthogonal Projection Image Modeling
Pre-training with Random Orthogonal Projection Image Modeling
摘要
评审与讨论
This paper focuses on the task of MIM, by designing a new corruption process via the random orthogonal projection. Such a ROP strategy results in a more efficient and effective pre-training method. Another advantage of ROP is the guaranteed bound on the noise variance during the corruption process.
优点
- This work is based on the sound theory of random orthogonal projection.
- ROPIM is able to achieve more superior performance in a shorter pre-training time.
- The decoder only contains one linear layer, being more slight comparing with MIM.
- The experiments verify its effectiveness on several downstream tasks, including classification and segmentation.
缺点
- It is a little difficult to understand why does the proposed method is more superior than MIM. According to my understanding, the ROP strategy randomly discards some local (not global) patterns during corruption as shown in Fig. 9. This is very similar to MIM. So, I can't intuitively catch what results in the superiority of ROP in the field of MIM. It would be better to have a discussion in the paper.
问题
-
Proposition 1 gives a complicated way to generate a projection matrix that contains only three elements, namely {-1, 0, 1}. This procedure relies on two auxiliary variables of and . Is it possible to directly sample from {-1, 0, 1} for each entry of ?
-
Following question 1, the projection matrix is composed of {-1, 0, 1}. Generally speaking, "0" denotes discarding some embed patch in projection. It is hard to understand the function of "-1".
-
In proposition 1, I guess: should be ?
-
In proposition 2, the notation for is missing.
Response to Rev. 3 (Reviewer zDzQ)
Firstly, we thank the reviewer for the constructive review and valuable questions.
1. Little difficult to understand why the proposed method is better than MIM. The ROP strategy randomly discards some local (not global) patterns during corruption as shown in Fig. 9. This is very similar to MIM.
Thank you. The simplified answer is that we achieve better results due to different profile/type of noise we implicitly inject via sketching compared to binary masking:
- masking introduces binary patterns (keep token or completely mask it) resulting in a limited number of masking patterns
- Fig. 8 and 9 show that rather than just removing token, our ROPIM has this more continuous effect compared to binary masking, which creates more patterns of randomness (Figure 4 compares the effect of binary masking versus our strategy).
For simplicity, imagine a masking problem with just 2 tokens:
- for standard binary masking, in total there are unique masking patterns, and assuming 50% masking ratio, that just limits patterns to mere 2.
- in contrast, if masking has more 'continuous' nature, say we can get \\{0\\%, 25\\%, 50\\%, 75\\%, 100\\%\\} of original energy preserved per token, that gives unique masking patterns, and under 50% masking ratio (assuming it equals to 50% lost information), that gives us (0,100),(25,75), (50,50), (75,25), (100,0) patterns (5 in total). While this is a simplification, ROPIM yields a similar effect but (i) in addition we get complementary `unmasking' pattern and known variance bound of the implicitly injected noise as per Prop. 1-5.
Importantly, in Figure 7a we show the actual real-data comparison between our ROPIM strategy and binary masking. Figure 7a illustrates numerically what we argue above, i.e., for the same `masking ratio':
- binary masking makes many tokens at the input to be completely unmodified (see the first histogram bean in Fig. 7a)
- in contrast, our ROPIM leaves fewer tokens `untouched' by some level of noise - creating richer noise patterns than binary masking
- consequently, Figure 7d also shows that our `unmasking' step works much harder
Thus, ROPIM leads to greater variety of input patterns and `unmasking' patterns, which exposes network to greater diversity of patterns to help learn the implicit manifold of the data (think e.g. DAE).
Note that suitable data augmentation strategies during pre-training are an open question:
[I]. Masked auto-encoders are scalable vision learner, He et al., CVPR 2022
[II]. Mixed auto-encoder for self-supervised visual representation learning, Chen et al., CVPR 2023
[III]. Corrupted image modeling for self-supervised visual pre-training. Fang et al., CVPR 2023.
E.g., color jittering degrades the transfer learning performance of MAE in [III], implying that MIM exhibits a preference for specific kinds of data augmentation (not all are good).
2. Proposition 1 gives a complicated way to generate a projection matrix with elements: -1, 0, 1. This procedure relies on h and s. Is it possible to directly sample from -1, 0, 1 for each entry of P ?
Thank you. We appreciate the process may look complicated but this is the standard definition of count sketching. Both variables are necessary because hashing function h (producing ) controls unique locations of non-zero entries per row of and thus promotes each row vector of to be orthogonal to other row vectors. This is essential in order to ensure we deal with subspaces (and that Prop. 2-5 hold).
3. The projection matrix P is composed of -1, 0, 1. ”0” denotes discarding some embed patch in projection. It is hard to understand the function of ”-1”.
Attributing roles to -1 and 1 at the coefficient level is hard as is a projection matrix for matrix-vector multiplication, not element-wise tool.
The meaning of elements 1 and -1 is that they are an output of a hash function ( is specific result from hashing function s).
For detailed explanation of count-sketching, the following article is a sensible introduction to this advanced topic: https://en.wikipedia.org/wiki/Count_sketch
Kindly notice what matters to ROPIM is the resulting properties in Propositions 2-5 of our work:
- controllable/bounded variance of approximation
- easy `unsketching' operation (project to subspace and retract back, notice we use on input)
- easy complement operation for unmasking (output of decoder) that tries recover exactly what was removed on input
- fast generation and computation (generating is fast, multiplying with too).
4. In prop. 1, should be ?
Thank you. We have now fixed it.
5. Notation missing.
Our apology: should be not . and reads as `take any two vectors'.
Thanks for your response. This response addressed all of my concerns. I remain my postive rating.
I suggest to add the discussion in Q1 to the final version.
Than you. Absolutely, we will add such a discussion. If there is anything more we can improve or answer meantime, kindly let us know.
Best regards,
Authors
The work proposes to use random projections in the place of masked images for pre-training the ViTs. The work is shown to lead to better performance for classification tasks compared to the MIM methods. The work seems novel enough where they replace binary mask with floating point mask.
优点
-
The use of linear algebraic projection technique in both the method and the loss function for reconstruction.
-
The results are achieved better with a considerably smaller number of epochs.
-
The work is well-motivated from the basics and seems reproducible.
-
The transfer learning results are added value to the work as such.
-
The work should be useful as a pre-trainer for several ViT based applications.
缺点
- I'm not sure if all the recent works on MIM have been compared with. Authors are requested to comment on this.
问题
Are there any recent works which have been exempted from comparison?
Apart from classification and semantic segmentation, do the authors have results on any other applications?
Response to Rev. 1 (Reviewer 5QU4)
Firstly, we would like to sincerely thank the reviewer for the constructive review and valuable questions. Below are point-by-point responses.
1. Are there any recent works which have been exempted from comparison? I am not sure if all the recent works on MIM have been compared with.
- Self-supervised learning is an active research area. We have tried our best to cover all related works. Researchers from different fields have been creative in using MIM techniques for various applications such as video:
[I]. Videomae v2: Scaling video masked autoencoders with dual masking, Wang et al., CVPR 2023
and point clouds:
[II]. Geomae: Masked geometric target prediction for self- supervised point cloud pre-training, TIan et al., CVPR 2023
However, they are not directly related to our work as they research other modality inputs.
-
Another stream of research studies merging various pre-training strategies, i.e., combining contrastive learning with MIM, using teacher-student networks or utilizing networks pre-trained on large multi-modal datasets:
[III]. Towards all-in-one pre-training via maximizing multi-modal mutual information, Su et al., CVPR 2023
Such ideas can be potentially integrated with ROPIM but they are orthogonal to our work, and training each such a pipeline takes several days if not weeks. Nonetheless, we have equipped the most promising in our modest view such a framework, called LGP [IV] (in Fig. 1 we called it GPL due to typo), with ROPIM and showed substantial improvement in our Fig. 1.
[IV]. Bridging contrastive learning and masked image modeling for label-efficient representations
Another very recent pipeline, called Correlational Image Modeling (CorIM) [V], proposes a self-supervised pre-training task leveraging a cropping strategy, a bootstrap encoder, and a correlation decoder. However, performance of CorIM (83.1% top 1 acc. for ViT-B) is short of the performance of pure ROPIM's 83.5% with 300 pre-training epochs. Thus, as the most promising new strategy, we investigated GPL with our ROPIM (GPL+ROPIM in Fig. 1 of main submission).
[V]. Correlational image modeling for self-supervised visual pre-training, Li et al., CVPR 2023
We also reported recent CIM models [VII] which also perform worse than ours despite using a mix of contrastive and masking strategies (e.g., 81.6 vs. ours 82 in Table 1).
2. Apart from classification and semantic segmentation, do the authors have results on any other applications?
Following the very recent works [IV, V, VI, VII], we have reported results on classification and segmentation downstream tasks as each downstream task is time-consuming (e.g., 4 days for segmentation).
[VI]. Masked frequency modeling for self-supervised visual pre-training, Xie et al., ICLR 2023
[VII]. Corrupted image modeling for self- supervised visual pre-training, Fang et al., ICLR 2023
Nonetheless, we have also initiated less often used object detection on COCO for ViT-B. After 5 days of running the code we have completed 50/100 epochs with 50.01% bbox/AP while MAE completed 50/100 epochs at 49.19%. We anticipate to reach 100 epochs within another 5 days on our 8 GPU server. We will update our final manuscript accordingly.
I am satisfied with the response by the authors. Retain my positive rating.
Thank you,
If there is anything meantime we can add, change or improve, kindly do let us know.
Best regards,
Authors
This paper considers the self-supervised learning problem, which has attracted much attention recently. While the masked image modeling, e.g. masked autoencoder, has recently shown very promising performance in self-supervised learning for visual pre-training, the authors aim to improve it by proposing a new image modeling. Specifically, unlike the masked image modeling applying random crops to the input and learning to recover the masked inputs with an encoder-decoder network, this paper considers a random orthogonal projection modeling which uses a random subspace for the projection and then learns to recover the complement of that subspace. Provided experiments show that the proposed random orthogonal projection can yield better performance than the crop-based masking.
优点
Though simple and straightforward, to my knowledge the proposed random orthogonal projection modeling for self-supervised learning is novel. Provided experimental results have demonstrated better performance of the proposed random orthogonal projection method in comparison with the masked image modelling using crop-based masking.
缺点
The proposed method is somewhat heuristic.
问题
Some of the Propositions in Section 3.1 are rather straightforward and would be better not be expressed as Proposition.
loss is used in the reconstruction loss, would the MSE loss yield worse performance?
Response to Rev. 2 (Reviewer xoq6)
Firstly, we thank the reviewer for the constructive review and valuable questions.
1. The proposed method is somewhat heuristic.
Thank you. The very standard Masked Auto Encoder (MAE) and related works perform basic binary masking which we agree may be viewed as heuristic. However, MAE can be inspired by the mathematics of Denoising Auto-encoders:
What regularized auto-encoders learn from the data-generating distribution., Alain and Bengio, JMLR 2014
The above study explains how the variance of noise injected into DAE is directly connected with the gradient regularization and manifold learning property, e.g., consider Reconstruction Contractive Auto-encoder (RCAE) (their Eq. 6 and Section 3.1)
From that point of view, transformer-based MAE and related works may be harder to analyze but if we can quantify the variance of the noise we inject, we can lean on the theory of DAE and RCAE to understand how the noise variance governs their data manifold learning property.
Thus, our Proposition 2 we quantify the upper bound on variance implicitly introduced by our projection in ROPIM. From that point of view, our ROPIM enjoys better foundations than existing masked auto-encoders as we are able to quantify the injected variance while previous works do not seem to notice the importance of noise variance, and they do not provide any guarantee or indications in that sense at all.
We will ensure to explain that theoretical connection better in our final draft.
2. Some of the Propositions in Section 3.1 are rather straightforward and would be better not be expressed as Proposition.
Thank you. We are more than happy to simply rephrase propositions 2-5 as a list of properties of ROPIM with sketch-based projections. We are more than keen to read further advice from the reviewer.
While simple, we include Property 2 as it explains the variance of implicitly injected noise (important due to the conenction to DAE), Propositions 3 and 4 explain how to use such projections in practical terms (e.g., unsketchign is needed to keep tokens in one common space), Proposition 5 is important as it gives `the complement of sketching' operation required to guide the output of our approach, and ensures it `unmasks' exactly the information that was removed at the input.
3. The loss is used in the reconstruction loss, would the MSE loss yield worse performance?
Thank you for spotting this. We have now conducted experiments with MSE for ViT-T and ImageNet100, and the MSE loss performs slightly better than the loss. We will include such additional ablations.
| Loss | Top 1 acc. (with 300 PT epochs) | Top 1 acc. (with 800 PT epochs) |
|---|---|---|
| ROPIM with | 82.98 | 86.43 |
| ROPIM with MSE | 83.60 | 86.70 |
Esteemed Reviewers,
We want to thank for your constructive comments which help us refine our work.
We just wanted to let you know that we are working on the rebuttal right now to answer all your questions. We hope to post detailed answers around weekend.
Kind regards,
Authors
The paper propose an Image Modeling framework based on random orthogonal projection instead of binary masking as in MIM. The work is well-motivated from the basics and seems reproducible. Provided experimental results have demonstrated the superiority of the proposed random orthogonal projection method.
The reviewer has mentioned that It’s a little difficult to understand why does the proposed method is more superior than MIM. Perhaps adding a little discussion in the paper would be better.
为何不给更高分
The acceptance of the paper has been voted on.
为何不给更低分
The idea is novel and based on the sound theory.
Accept (spotlight)