Depth-Guided Self-Supervised Learning: Seeing the World in 3D
Depth signals and 3D views improves self-supervised learning
摘要
评审与讨论
This paper proposes a new self-supervised representation learning approach (SSL) incorporating novel view synthesis data augmentation and estimated depth maps as input. DPT and AdaMPI are used for depth and novel view synthesis estimation. By incorporating depths and novel views during training the authors found their method more accurate when learning from few data and more robust to noisy test inputs. The authors claim this is the first method to use estimated depth as inputs for self-supervised representation learning.
优点
-
Biologically inspired and a corresponding well-written introduction.
-
Good correlation with biological elements in the visual system.
-
Interesting reasoning on why novel views should improve SSL: As the mutual information between depth and input images is high, the effect of depth is negligible on an infinite dataset. However, novel 3D views introduce new information.
-
Simple and clear method, seems reproducible.
-
Clear ablation studies.
缺点
-
Figure 1 lacks details. For instance, it is unclear what "2D augmentation" is being performed in PixDepth. It would also be good to visually represent your SSL objective.
-
Depth is dropped from the input to encourage the network to not over-rely on it. However, depth is used to compute the error metrics in the results table. In this case, the comparison could be considered unfair, as the previous methods do not have depth as input. As depth is obtained from a supervised network, this method is not a pure SSL method.
-
Most improvements come from adding the depth channel, which I could not consider an important contribution.
-
What is the reason behind the statement of improvements of SwAV in Table 1 when no results for SwAV are provided?
-
For the results on imagenet-100 no improvement is achieved, it is actually the opposite. This is not clearly reflected in the text in a tricky way. Why are there no results with the combination method (depth + 3D)?
-
I am afraid that the added robustness to corruptions in ImageNet-C and ImageNet-3DCC comes from the robustness of the depth estimation network (trained with depth GTs over a considerable amount of data).
-
Some minor typos: "a approach", "an conceptually",
问题
-
Don't you think this is too casual language "we take seriously two insights" ?
-
Clear metrics comparing against SOTA would be helpful in the introduction.
伦理问题详情
NA
Figure 1 lacks details. For instance, it is unclear what "2D augmentation" is being performed in PixDepth. It would also be good to visually represent your SSL objective.
We thank the reviewer for these suggestions. 2D Augmentation refers to the standard set of augmentations applied on the 2D image. For instance, random crop, flip, color jitter etc. We don’t show the SSL objective here explicitly because we are using several standard methods for SSL and the SSL algorithm itself is not the focus of our work. Our focus is on extending input with a depth map and using augmentations that correspond to different (inferred) 3D views.
Depth is dropped from the input to encourage the network to not over-rely on it. However, depth is used to compute the error metrics in the results table. In this case, the comparison could be considered unfair, as the previous methods do not have depth as input. As depth is obtained from a supervised network, this method is not a pure SSL method.
We thank the reviewer for this interesting question. We did run an ablation to analyze the performance of the model when depth is not available during inference. An excerpt from the paper (Section 4.5) follows:
“Table 6 reports these results on ImageNette dataset with BYOL. Interestingly, we find that even with the absence of depth information, the accuracy of the model is higher than the baseline BYOL. This indicates that the model has implicitly learned some depth signal and captured better representations. It can also be seen that the performance on IN-3DCC is 1.5% higher than BYOL.”
Most improvements come from adding the depth channel, which I could not consider an important contribution.
We note that 3D Views and the ensemble of Depth + 3D Views also improve performance over state-of-the-art baseline SSL methods, allowing us to argue for the broader claim that depth signals are valuable early in the vision pipeline. Perhaps the reviewer considered the contribution of a depth channel as unimportant because it is such a minor change to the architecture. We agree it is a minor architectural tweak, but wish to emphasize that it is a significant conceptual change: very little work in the field considers computations to extract depth from 2D images to precede object recognition.
What is the reason behind the statement of improvements of SwAV in Table 1 when no results for SwAV are provided?
We do not show results with SwAV + 3D views because the multi-crop setting used in SwAV makes it non-trivial to be adapted to 3D Views. We have mentioned this in the paper but we will update the paper to clarify.
For the results on imagenet-100 no improvement is achieved, it is actually the opposite. This is not clearly reflected in the text in a tricky way. Why are there no results with the combination method (depth + 3D)?
We are not sure if we understand the question correctly. We report results with depth + 3D in Table 2 on ImageNet-100. We do not report results Depth + 3D on ImageNet-1k due to compute constraints. Apologies, we should have mentioned this fact in the article text.
I am afraid that the added robustness to corruptions in ImageNet-C and ImageNet-3DCC comes from the robustness of the depth estimation network (trained with depth GTs over a considerable amount of data).
We begin by emphasizing that the depth-estimation network need not be used during inference. As Section 4.5 (Table 6) indicates, when depth is unavailable at inference, the proposed method is still better than the corresponding baseline SSL method. Regarding the reviewer’s point that the depth-estimation network confers some robustness benefit to the learned embeddings, we certainly agree, although the benefit is indirect through the SSL method. Our investigation was focused on the empirical question of whether such an indirect benefit would indeed be obtained.
-
Even though the augmentations are clear in the text, the two images in PixDepth look almost the same. Representing a more perceptually significant transformation in the image would be better. I still think that for better conveying your idea, it would be good to represent your SSL objective (or objectives) pictorically.
-
Can you point out where is the table that contains the baseline BYOL, which is 1.5% less accurate than your model with no disp input during the test?
-
Thanks for your answer. As the authors mentioned, this is a straightforward architectural change. In terms of making depth an additional input, I think it would be necessary to explore further ways in which depth could be used as input. For example, surface normals and 3d point clouds can be extracted from the depth map.
-
Thanks for the answer.
-
Thanks for your answer. What would be necessary to make the network with depth inputs dropped be as good as the network with depth inputs during testing?
We thank the reviewer for continued engagement.
Even though the augmentations are clear in the text, the two images in PixDepth look almost the same. Representing a more perceptually significant transformation in the image would be better. I still think that for better conveying your idea, it would be good to represent your SSL objective (or objectives) pictorically.
Thanks for the suggestions. We will update it.
Can you point out where is the table that contains the baseline BYOL, which is 1.5% less accurate than your model with no disp input during the test?
Table 6 (row 1) is the baseline BYOL with 85.27% and Table 6 (row 3) is the one with no depth input during inference achieves 86.80% on ImageNette dataset. The difference is 1.53% as stated in the Section 4.5.
As the authors mentioned, this is a straightforward architectural change. In terms of making depth an additional input, I think it would be necessary to explore further ways in which depth could be used as input. For example, surface normals and 3d point clouds can be extracted from the depth map.
We believe that you're making the sensible point that many alternative representations of depth could be used as an input to models. We focused on representing depth directly as a distance in a channel of the input. We agree that other representations, such as surface normals or point clouds, are potentially interesting. However, the exploration of alternative representations does not seem essential to our fundamental research question of whether including depth information (in some form) would enhance representation learning.
What would be necessary to make the network with depth inputs dropped be as good as the network with depth inputs during testing?
This is an interesting question. The brain is remarkably good at extracting depth that is implicitly encoded in 2D retinal images. Dropping depth from the input means simply that depth must be inferred by early stages, analogous to the manner in which people extract depth. In this case, if the depth knowledge is sufficiently encoded in the model's early stages, then depth inputs during inference might be redundant.
This work proposes to incorporate depth signals into the self-supervised learning (SSL) framework. Specifically, two baselines are provided: the first baseline directly concatenates RGB and depth signals as the input of SSL, and the second baseline augments novel view generated according to the depth signal for SSL.
优点
- This work investigate the influence of including depth signals into the SSL framework.
- The experiments show that with the introduction of depth signals, the existing SOTA SSL methods yield a better performance.
缺点
- Using depth signals as augmentation is not new in SSL. Previous works, e.g., DepthContrast, have explored it thouroughly.
- The proposed method lacks generalizbility. Though it can be adopted to any SSL frameworks, the adopted depth estimation model is supervised trained on several datasets. The performance of depth estimation can not gurantee in scenarios that have a huge domain gap compared to the trained datasets.
- Also, due to the utilization of the supervised depth-estimation model, it is questionable to claim the proposed method as a SSL framework.
- Experiments are all conducted on the subset of Imagenet or the modification of Imagenet. Results on more datasets are expected.
问题
Please see Weaknesses.
Using depth signals as augmentation is not new in SSL. Previous works, e.g., DepthContrast, have explored it thouroughly.
We thank the reviewer for pointing us to DepthContrast [1]. As we understand it, DepthContrast requires ground truth depth information in the form of point-cloud data and relies on format-specific encoders (like PointNet++ for point clouds and UNet for Voxels). In contrast, our method takes a standard 2D RGB image input and assumes only that depth maps can be estimated from a separate module, allowing our method to be seamlessly integrated into standard SSL pipelines. More importantly, the motivation behind the DepthContrast and our method is very different.
We have discussed comparisons with other related work in the Appendix A.
The proposed method lacks generalizbility. Though it can be adopted to any SSL frameworks, the adopted depth estimation model is supervised trained on several datasets. The performance of depth estimation can not gurantee in scenarios that have a huge domain gap compared to the trained datasets.
Depth-Prediction Transformer (DPT) [2] is robust to various natural image datasets as shown in the paper via zero-shot cross dataset transfer. It is due to these reasons that various works in 3D reconstruction [3, 4] and view synthesis [5, 6] use DPT to estimate depth maps.
Also, due to the utilization of the supervised depth-estimation model, it is questionable to claim the proposed method as a SSL framework.
While the reviewer is technically correct, we urge the reviewer to appreciate our underlying premise. We argue that biological agents obtain depth information for free by continuous-time interaction with their environments, and so should a complete, continual learning AI agent. But rather than developing a complex multi-module AI system, we start with the premise that depth maps are available and then ask how they can benefit self-supervised learning. The means by which depth is obtained is not central to our investigation; however, the fact that depth is obtained via extraction from a 2D image (as the human brain does) is key. This work provides insights into and benefits of how to effectively utilize this depth information.
Experiments are all conducted on the subset of Imagenet or the modification of Imagenet. Results on more datasets are expected.
We have reported results on the ImageNet-1k dataset (Table 3) consisting of 1.2 million real-world images. It is common in the SSL literature to report the pretraining results only on ImageNet-1k [7, 8, 9, 10]. We also emphasize that SSL in general is computationally expensive.
[1] Zhang, Zaiwei, et al. "Self-supervised pretraining of 3d features on any point-cloud." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
[2] Ranftl, René, Alexey Bochkovskiy, and Vladlen Koltun. "Vision transformers for dense prediction." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[3] Wu, Chao-Yuan, et al. "Multiview compressive coding for 3D reconstruction." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[4] Liu, Ruoshi, et al. "Zero-1-to-3: Zero-shot one image to 3d object." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[5] Han, Yuxuan, Ruicheng Wang, and Jiaolong Yang. "Single-view view synthesis in the wild with learned adaptive multiplane images." ACM SIGGRAPH 2022 Conference Proceedings. 2022.
[6] Jiang, Yutao, et al. "Diffuse3D: Wide-Angle 3D Photography via Bilateral Diffusion." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[7] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
[8] Yeh, Chun-Hsiao, et al. "Decoupled contrastive learning." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[9] Shi, Yuge, et al. "Adversarial masking for self-supervised learning." International Conference on Machine Learning. PMLR, 2022.
[10] Dwibedi, Debidatta, et al. "With a little help from my friends: Nearest-neighbor contrastive learning of visual representations." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
This paper provides a new representation learning method that utilizes an estimated depth map to learn a geometry-aware representation. As mentioned in the abstract, the goal of SSL is to learn useful representation for "downstream tasks." However, the scale of the conducted experiments is insufficient to claim the effectiveness of the proposed method.
- The proposed method is evaluated only in small-scale datasets (e.g., ImageNet-100, ImageNet-1k).
- The proposed method is evaluated only in small-scale models (e.g., ResNet-18, 50).
- The proposed method is evaluated only in a classification task.
The proposed method must need to show its effectiveness and scalability in large-scale datasets, various backbone models (e.g., CNN, Transformer variant models), and diverse downstream tasks (e.g., 2D/3D detection, 2D/3D segmentation, 3D reconstruction, 3D view generation, etc)
优点
This paper provides a new representation learning method that utilizes an estimated depth map to learn a geometry-aware representation. To train an RGB-D backbone network, the method generates 3D views with an image and estimated depth map and utilizes them with the previous SSL method.
缺点
[Quality & Significance] As mentioned in the abstract, the goal of SSL is to learn useful representation for "downstream tasks." However, the scale of the conducted experiments is insufficient to claim the effectiveness of the proposed method.
- The proposed method is evaluated only in small-scale datasets (e.g., ImageNet-100, ImageNet-1k).
- The proposed method is evaluated only in small-scale models (e.g., ResNet-18, 50).
- The proposed method is evaluated only in a classification task.
The proposed method must need to show its effectiveness and scalability in large-scale datasets, various backbone models (e.g., CNN, Transformer variant models), and diverse downstream tasks (e.g., 2D/3D detection, 2D/3D segmentation, 3D reconstruction, 3D view generation, etc)
[Clarity] I recommend the authors to narrow down the scope of the proposed method from general SSL to a specific SSL method. The current backbone tasks RGB-Depth image as inputs, so targeting downstream tasks for RGB-D inputs (e.g., RGB-D segmentation, 3D recon, 3D view synthesis, human pose estimation) is a more reasonable choice to claim the effectiveness of the proposed method. The current claim is too general and insufficient to support the claim.
问题
Please see the weakness part.
伦理问题详情
NA
The proposed method is evaluated only in small-scale datasets (e.g., ImageNet-100, ImageNet-1k). The proposed method is evaluated only in small-scale models (e.g., ResNet-18, 50).
We have reported results on ImageNet-1k which consists of 1.2 million real-world images. It is a bit shocking to see the reviewer classify this dataset as a small-scale dataset. Even in recent SSL papers [1, 2], Imagenet-1k is the only dataset used for evaluation with a single backbone (ResNet-50)
[Clarity] I recommend the authors to narrow down the scope of the proposed method from general SSL to a specific SSL method. The current backbone tasks RGB-Depth image as inputs, so targeting downstream tasks for RGB-D inputs (e.g., RGB-D segmentation, 3D recon, 3D view synthesis, human pose estimation) is a more reasonable choice to claim the effectiveness of the proposed method. The current claim is too general and insufficient to support the claim.
We respectfully disagree with the reviewer here. As we pointed out to other reviewers, the goal of SSL is to learn representations that are useful for various downstream tasks. It is common in SSL literature to primarily report linear classification accuracy. In our paper, we focused more on understanding the various components of our method (including depth channel dropout, quality of 3d views) rather than evaluating on a range of downstream tasks.
[1] Bardes, Adrien, Jean Ponce, and Yann LeCun. "Vicregl: Self-supervised learning of local visual features." Advances in Neural Information Processing Systems 35 (2022): 8799-8810. [2] Yeh, Chun-Hsiao, et al. "Decoupled contrastive learning." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
The paper aims to learn geometry-aware representations in a self-supervised manner. The paper received uniformly negative scores from the reviewers.
The main issues raised in the reviews are:
- limited evaluation: small-scale datasets, models, and tasks (only classification)
- the mono-depth models are supervised
- unclear whether the method can be applied to other domains (rely on the generalizability of DPT)
During the discussions, the authors clarified several comments. However, most reviewers remain unconvinced. With the consensus from the reviewers, the AC finds no ground to accept.
为何不给更高分
Limited evaluation, the dependency on mono-depth models, and applicability to other SSL.
为何不给更低分
N/A
Reject