DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation
We propose DFormer, an innovative RGB-D pretraining framework for enhancing RGB-D segmentation tasks, achieving new state-of-the-art performance with less than half the computational cost of current best methods.
摘要
评审与讨论
This paper presents an RGB-D scene understanding framework with RGB-D pertaining weights. Two tasks are considered, including RGB-D semantic segmentation and salient object detection. A global awareness attention module and a local enhancement attention module are designed. RGB-D pre-training is performed on ImageNet-1K with estimated depth data. The proposed model achieves state-of-the-art performance and maintains good efficiency compared to existing works. As shown in Table 3, the benefit of using RGB-D pre-training is significant. Extensive ablation studies and parameter studies are conducted.
优点
- This work is one of the first works to consider RGB-D pertaining to enhance RGB-D scene understanding. The results show that the benefit of RGB-D pretraining is significant.
- The proposed model is highly efficient compared to existing works.
- Table 1 presents a nice way of comparing with code-public-available works.
- The paper is overall well-written and nicely structured.
缺点
- MultiMAE also uses RGB-D pretraining. However, in this work, a different depth estimation model is used. Would it be nice to provide a more fair comparison by using the same depth estimation model as MultiMAE to produce ImageNet depth data?
- Again regarding fairness, the RGB-D pertaining is based on ImaegNet RGB-D data, and the depth estimation leverages important knowledge learned on other datasets. However, this knowledge is not used by existing RGB-D segmentation works like CMX. This can be discussed.
- Will the pretraining weights be released? Would the ImageNet depth data be released? This could be discussed.
- In the introduction, it was argued: "the interactions are densely performed between the RGB branch and depth branch during finetuning, which may destroy the representation distribution". Do you have any observations to support this argument? E.g., some destroyed distributions or feature maps could be analyzed.
- There are still some writing mistakes. E.g., "we conduct a depth-wise convolution" should be "We conduct a depth-wise convolution". "Our DFormer perform better segmentation accuracy than the current state-of-the-art" should be "Our DFormer produces higher segmentation accuracy than the current state-of-the-art".
- ACNet (ICIP 2019) should be added to Table 1.
- How to scale to other modalities like RGB-thermal, RGB-LiDAR, X-Y-Z data, or even more modalities and datasets? This is not well discussed in the future work section. Different from depth data which can be produced by robust depth estimation models, it is harder to have large-scale thermal and LiDAR datasets for pertaining. This can be better discussed.
- As the main contribution lies in the study of RGB-D pertaining, more and recent advanced pertaining strategies could be compared and discussed. The main technical design lies in the fusion blocks, but there are no specific pertaining designs. Please discuss this and assess more pertaining choices.
Sincerely,
问题
The proposed model is highly efficient and it has large gains thanks to the RGB-D pertaining. If the RGB-D pertaining strategy is applied to heavier state-of-the-art RGBD segmentation models like CMX and CMNeXt, how much gain can be achieved? If it is possible, this could be assessed and would help provide a fairer comparison.
Fig. 11 shows that the proposed module is sophisticated. Would it be nice to provide more detailed ablations to study other design choices based on this module architecture?
Sincerely,
Thanks for your effort for reviewing our paper and giving some insightful suggestions. We hope the following responses could solve your concerns.
Q1. Discussion about different depth estimation methods.
In the Sec. A of our appendix, we have conducted experiments to explore the effect of the depth maps generated by different methods. Specifically, we adopt the Adabins (trained on 26K pairs on KITTI) and more advanced Omnidata (trained on 4M RGB-D pairs, adopted in MultiMAE), where the Omnidata uses more training data and obtained more advanced zero-shot capacity. The estimated depth maps of the two methods are shown in the Fig.10. After finetuning, the DFormer-B that pretrained with different depth maps achieves the same performance, i.e. 55.6. This indicates the quality of depth maps generated by recent depth estimation methods has little effect on the performance of DFormer. Besides, we will also explore replacing the depth with other modalities and analyze the influence of different estimation methods on our model in the future.
Q2. About the open source of DFormer.
In the supplementary materials, we have provided the source code for the finetuning. Moreover, all estimated depth maps, pretrained weights, and source code will be made publicly available.
Q3. Observation to support the argument within the introduction.
As stated in [1], the mean and variance of the BN layer can reflect the overall distribution center of input data and the distribution breadth of input data, respectively. Fusing RGBD features within the RGB-pretrained models would bring in drastic changes in the feature distribution, which makes the previous statistic of batch normalization incompatible with the input features. To verify whether this phenomenon exists, we visualize the statistics of batch normalization in a random layer of the DFormer in Fig. 11 in the new revision. For the RGBD-pretrained DFormer, the running means and the variance of the BN layer only have slight changes after finetuning, illustrating the learned RGBD representation is transferable for the RGBD segmentation tasks. While, for the RGB-pretrained DFormer, the statistics of the BN layer are indeed drastically changed after finetuning, indicating that the encoding is mismatched. The situation causes difficulties in optimizing the RGB-pretrained weights for the RGBD scenes. Following your suggestion, we also visualize some features of the DFormer that use the RGB and RGBD pretraining in Fig. 12 in the new revision.
[1] An empirical study of adder neural networks for object detection. In NeurIPS, 2021.
Q4. Comparison to ACNet.
Thanks for your suggestion. We add this method in Tab.1 of the new revision.
Q5. Discussion on the generalization of DFormer.
In this paper, the DFormer is endowed with the capacity to interact the RGB and depth during pretraining. To verify whether the interaction still works when replacing the depth with another modality, we apply our DFormer to some benchmarks with other modalities, i.e., RGB-T on MFNet (IROS, 2017) and RGB-L on KITTI-360 (TPAMI, 2022), as shown in the below table. As can be seen, RGBD pretraining still improves the performance on the RGB and other modalities, but the improvement is limited compared to that on RGBD scenes. To address this issue, a foreseeable solution is to further scale the DFormer to other modalities, and there are two ways worth trying, i.e., synthesize the pseudo modal data, and separate pretraining on a single modality dataset. For the former, there are some generation methods to generate other pseudo modal data. For example, Pseudo-lidar[2] proposes a method to generate the pseudo lidar data from the depth map. And N-ImageNet [3] obtains the event data on the ImageNet. Besides, collecting data and training the modal generator for more modalities, is also worth exploring. For the latter one, we can separately pretrain the model for processing the supplementary modality and then combine it with the RGB model. We will attempt these methods to bring more significant improvements for DFormer on different multimodal scenes.
| Methods | Param | Flops | MFNet (RGB-T) | KITTI-360 (RGB-L) |
|---|---|---|---|---|
| CMX (MiT-B2) | 66.6M | 67.6G | 58.2 | 64.3 |
| CMX (MiT-B4) | 139.9M | 134.3G | 59.7 | 65.5* |
| CMNeXt (MiT-B2) | 66.6M | 65.5G | 58.4* | 65.3 |
| CMNeXt (MiT-B4) | 135.6M | 132.6G | 59.9 | 65.6* |
| DFormer-L (RGB-Pretrained) | 39.0M | 65.7G | 59.5 | 65.2 |
| DFormer-L (RGBD-Pretrained) | 39.0M | 65.7G | 60.3 | 66.1 |
'*' means the result is not provided and we implement them via their official code.
[2] Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. CVPR, 2019. [3] N-imagenet: Towards robust, fine-grained object recognition with event cameras. CVPR, 2021.
Thanks for appreciating our paper and raising your rating. Following your suggestions, we added the analyses of the RGB-D pretraining to other methods and the results on MFNet/KITTI360 to the main paper of the new revision.
Q6. Apply the RGB-D pretraining to other methods.
Following your suggestion, to make the comparison more fair, we pretrain the CMX (MiT-B2) on RGB-D ImageNet and it obtain about 1.4% improvement on the mIoU. We also provide the RGB pretrained DFormers to provide more insights. Under the RGB-only pretraining, DFormer-L still outperforms the CMX (MiT-B2) by about 1.1% with less computation cost. For the RGB-D pretraining, compared to the CMX (MiT-B2), the improvement of DFormer-L is further enlarged from 1.1% to 1.5%, which we attribute to that the pretrained fusion weight within DFormer can achieve better and more efficient fusion between RGB-D data.
| Methods | Pretrained weight | Param | Flops | NYU DepthV2 | SUNRGBD |
|---|---|---|---|---|---|
| CMX (MiT-B2) | RGB-only | 66.6M | 67.6G | 54.4 | 49.7 |
| DFormer-B | RGB-only | 29.5M | 41.9G | 53.3 | 49.5 |
| DFormer-L | RGB-only | 39.0M | 65.7G | 55.4 | 50.6 |
| CMX (MiT-B2) | RGB-D | 66.6M | 67.6G | 55.8 | 51.1 |
| DFormer-B | RGB-D | 29.5M | 41.9G | 55.6 | 51.2 |
| DFormer-L | RGB-D | 39.0M | 65.7G | 57.2 | 52.5 |
Q7. More detailed ablation towards the block.
In Tab.5~8, we have provided the ablation experiments about the components of our RGB-D block, the pooling size of our GAA, as well as the fusion manners in GAA and LEA. We notice that the modules that only encode RGB features may be not well ablated. Here we provide more results for the modules that encode RGB features. Due to limited time and computation resources, we use a short pretraining duration of 100 epochs on the DFormer-S. In the future, we will add this ablation experiment of full pretraining duration to the main paper. Note that the results in the below table only ablates the structure within the modules (gray part in Fig.11) that only process the RGB features.
| DWConv Setting | Attention Operation | Param | Flops | NYUDepthV2 |
|---|---|---|---|---|
| DWConv | Hadamard Product | 18.7M | 25.6G | 51.9 |
| DWConv | Hadamard Product | 18.7M | 23.9G | 51.6 |
| DWConv | Hadamard Product | 18.7M | 27.1G | 51.9 |
| DWConv | Addition | 18.7M | 25.0G | 51.3 |
| DWConv | Concatanation | 19.3M | 26.9G | 51.7 |
The reviewer would like to thank the authors for adding these analyses and more detailed ablations and experiments, which help solve many concerns. The observations in Fig. 11 and Fig. 12 are very interesting. The results on MFNet and KITTI360 also provide insights that the method could boost RGB-T and RGB-L segmentation.
These analyses should be added to the final version. In particular, the analyses of the RGB-D pretraining to other methods and the results on MFNet/KITTI360 are important and could be added to the main paper.
The reviewer would like to elevate the rating accordingly.
Sincerely,
The paper introduces a RGB-D pretraining framework for transferable representations in RGB-Depth segmentation tasks. In the proposed methods DFormer, the RGB-depth backbone is pretrained using RGB-D pairs from ImageNet-1K, with the aim of enabling effective encoding of RGB-D information. It incorporates a sequence of RGB-D blocks designed for optimal representation of both RGB and depth data.
优点
The proposed RGB-D pretraining framework can be used to solve the representation distribution shift between RGB and the depth information, and to increase the performance of RGB-D representation.
A building block is proposed to perform RGB and depth feature interaction early in the pretraining stage, thus it is possible to reduce the interaction outside the backbone in the fine-tuning stage.
缺点
The comparison of using RGB-Depth pretraining on other previous works is missing. The most improvement seems from the join pretraining by using additional depth information as compared to previous methods.
The analysis of the depth generation is less included. Only one depth estimation method is used to generate the depth image for ImageNet.
There is generalization limitation in combining two modalities for pre-training. The performance of pre-training or fine-tuning on downstream tasks seems to be highly dependent on the generation or estimation of another modality besides RGB.
问题
What is the effect of using different depth estimation models? How effective is the accuracy of depth estimation for RGBD model pre-training, and will there be accumulation of errors?
How is the comparison between the fusion building block and the fusion module proposed in previous methods, such as cmx? Also, do the authors try to perform RGB-D pretraining for other methods, so as to perform a more comparable setting?
How does the DFormer perform if only RGB pretrain + D initialization for finetuning?
How is the effect and how is improvement from the light hamburger decoder in the proposed model? Whether the authors try to use other decoders?
Why to perform feature interaction between RGB and depth information in the last two stages?
Reviewer zqyp has raised an interesting question about the setting using RGB pretrain + D initialization for DFormer. I am also curious to know that.
Besides, I agree that it would be interesting to investigate the effects of using different depth estimation methods under various measurement noises.
Thanks for your effort for reviewing our paper and giving some kind suggestions. We are happy to see your positive comments on writing, experiments and method. We hope the following responses could solve your concerns.
Q1. Discussion about different depth estimation methods.
In the Sec. A of our appendix, we have conducted experiments to explore the effect of the depth maps generated by different methods. Specifically, we adopt the Adabins[1] (trained on 26K pairs on KITTI [2]) and more advanced Omnidata[3] (trained on 4M RGB-D pairs, adopted in MultiMAE), where the Omnidata uses more training data and obtained more advanced zero-shot capacity. The estimated depth maps of the two methods are shown in the Fig.10. After finetuning, the DFormer-B that pretrained with different depth maps achieves the same performance, i.e. 55.6. This indicates the quality of depth maps generated by recent depth estimation methods has little effect on the performance of DFormer. Besides, we will also explore replacing the depth with other modalities and analyze the influence of different estimation methods on our model in the future.
[1] Adabins: Depth estimation using adaptive bins. In CVPR, 2021.
[2] Vision meets robotics: The kitti dataset. IROS 2013.
[3] Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In ICCV 2021.
Q3. Pretraining the CMX.
Thanks for your suggestion. To make the experiments more comprehensive, we use the pretraining manner of DFormer to pretrain the CMX (MiT-B2) and then finetune the model via their official code. As shown in below, the pretraining manner in this paper brings about a 1.4% improvement for the CMX (MiT-B2), but our DFormer-L still outperforms CMX (MiT-B2) that adopts the same pretraining manner by a large margin. We attribute this to the well-designed building block and the efficient encoding design of our DFormer.
| Methods | Pretraining | Param | Flops | NYU DepthV2 | SUNRGBD |
|---|---|---|---|---|---|
| CMX (MiT-B2) | Only RGB | 66.6M | 67.6G | 54.4 | 49.7 |
| CMX (MiT-B2) | RGB-D | 66.6M | 67.6G | 55.8 | 51.1 |
| DFormer-B | RGB-D | 29.5M | 41.9G | 55.6 | 51.2 |
| DFormer-L | RGB-D | 39.0M | 65.7G | 57.2 | 52.5 |
Q4. DFormer with RGB pretrain + D initialized
To provide more insights, we pretrain the RGB part via RGB-only pretraining and randomly initialize the depth part of the DFormer, and the results are shown in the below table. Under the RGB-only pretraining, DFormer-L still outperforms the CMX (MiT-B2) by about 1.1% with less computation cost. For the RGB-D pretraining, compared to the CMX (MiT-B2), the improvement of DFormer-L is further enlarged from 1.1% to 1.5%, which we attribute to that the pretrained fusion weight within DFormer can achieve better fusion between RGB-D data.
| Methods | Pretrained weight | Param | Flops | NYU DepthV2 | SUNRGBD |
|---|---|---|---|---|---|
| CMX (MiT-B2) | RGB-only | 66.6M | 67.6G | 54.4 | 49.7 |
| DFormer-B | RGB-only | 29.5M | 41.9G | 53.3 | 49.5 |
| DFormer-L | RGB-only | 39.0M | 65.7G | 55.4 | 50.6 |
Q5. DFormer with different decoders.
Considering that the RGB-D interaction has been achieved within each block of our DFormer encoder, there is no need to specifically design extra fusion modules within the decoder. Thus, we follow an RGB segmentation method SegNeXt (NeurIPS, 2022) to use the Ham head as the decoder. To better understand the effect of our DFormer, we further provide the results of DFormer-B with different decoder heads in the below table.
| Decoder heads | Param | Flops | NYU DepthV2 | SUNRGBD |
|---|---|---|---|---|
| Ham head | 29.5M | 41.9G | 55.6 | 51.2 |
| MLPs head | 29.1M | 43.5G | 55.3 | 50.9 |
| Non-local head | 32.8M | 49.4G | 56.0 | 51.5 |
| CMX decoder | 35.0M | 48.3G | 55.4 | 51.0 |
Q6. About the feature fusion?
In our DFormer, the interaction between the RGB and depth features is performed within each building block of the encoder. As stated in SegNext (NeurIPS, 2022), the features from Stage 1 contain too much low-level information and cannot bring performance improvement, as shown in the table below. Therefore, we only use the features from the last three stages for the prediction.
| Decoder Features | Param | Flops | NYU DepthV2 |
|---|---|---|---|
| last three stages | 18.7M | 25.6G | 53.6 |
| all the four stages | 20.1M | 29.5G | 53.6 |
Thanks for the detailed information provided by the authors. The response has addressed most of my concerns. After reading other post-rebuttal discussions, I chose to keep my final rating.
In this paper, the authors propose an RGB-D pretraining framework for RGB-D semantic segmentation and salient object detection (SOD). First, they use an off-the-shelf depth estimator to generate depth maps for ImageNet-1K. Then, they use the image-depth pairs from ImageNet-1K to pretrain the backbone. Next, they insert an existing head on the backbone and then finetune the model on the RGB-D semantic segmentation and salient object detection datasets.
优点
- To improve the model performance, the authors pretrained the backbone on ImageNet-1K with image-depth pairs.
- The authors conducted experiments on two RGB-D segmentation tasks.
缺点
-
The novelty and contributions are too limited. First, the proposed RGB-D block slightly modifies popular techniques, i.e., self-attention mechanism (Vaswani et al., 2017), depth-wise convolution, and attention weights (Hou et al., 2022), and combines them to fuse RGB and depth features. Second, the design of the RGB-D block follows the widely used idea in SOD, i.e., global and local information fusion. Third, the decoder directly uses the existing head from SegNext (Guo et al., 2022a) without any novel design. Thus, the contribution only comes from the pretraining idea, which is limited.
-
The authors missed some related methods [1-4] for comparison.
[1] Visual Saliency Transformer. ICCV 2021.
[2] 3-d convolutional neural networks for rgb-d salient object detection and beyond. TNNLS 2022.
[3] Bi-Directional Progressive Guidance Network for RGB-D Salient Object Detection. TCSVT 2022.
[4] UCTNet: Uncertainty-aware cross-modal transformer network for indoor RGB-D semantic segmentation. ECCV 2022.
-
To demonstrate the effectiveness of the pretrained backbone, the authors should replace the previous backbone in the compared methods with the proposed one to see whether improvements can be achieved.
-
The authors ignore existing pre-training methods [5, 6] for discussion and comparison.
[5] RGB-based Semantic Segmentation Using Self-Supervised Depth Pre-Training
[6] Self-Supervised Pretraining for RGB-D Salient Object Detection. AAAI 2022.
- Some widely used RGB-D SOD benchmark datasets [7-9] are also ignored.
[7] Depth-induced multi-scale recurrent attention network for saliency detection. ICCV 2019.
[8] Learning selective mutual attention and contrast for rgb-d saliency detection. TPAMI 2021.
[9] Saliency detection on light field. CVPR 2014.
问题
Please see weaknesses.
In Table 1, the authors compared the proposed method against methods with open code implementations.
Regarding [4] UCTNet, while it reaches high accuracy using uncertainty-aware self-attention, it does not release its source code and models. I think it is fine to only compare against methods with open code. The authors are suggested to discuss UCTNet in the related work section.
Thanks for your effort for reviewing our paper and giving some kind suggestions. We hope the following responses could solve your concerns.
Q1.About the architectural design
The main goal of this paper is to provide a new scheme for the acquisition of transferable representations for RGB-D segmentation tasks. Thus, the focus of our architectural design is how to achieve such RGB-D pretraining in an efficient way, instead of building complicated modules. To this end, we take the interaction between RGB-D into the attention mechanism to achieve efficient fusion within each building block. Besides, we observe that the depth information only needs a small portion of channels to encode, rather than a whole RGB pretrained backbone as done in previous works. Based on these designs, even with a simple decoder head, i.e., Ham head, MLPs heads, our DFormer can achieve state-of-the-art performance, as shown in Q3.
Q2.Comparison with more methods
In the main paper, we have conducted comparative experiments with 13 RGB-D semantic segmentation methods and 11 RGB-D salient object detection methods to verify the effectiveness of our RGB-D pre-training manner. Thanks to your suggestions, we add the mentioned three RGB-D SOD methods, namely VST, RD3D, and BPGNet in Tab.2 of the new revision. Our DFormer still outperforms these methods by a large margin. Note that we fail to compare with UCTNet (ECCV,2022) due to its unopened source code, which is also pointed out by reviewer KRMX.
Q3.Why not replace the backbone with the DFormer in previous methods
The previous works mainly focus on using the interaction between RGB-D data within the encoder and decoder, e.g., CMX interacts the RGB-D features at each stage in the encoder and further fuses them at the decoder. In contrast, our DFormer only performs the efficient interaction between RGBD modalities in each building block of the encoder, which not only reduces the computation burden but also achieves better information fusion. We conduct experiments on DFormer-B with different decoder heads in the below table. We also apply our DFormer-B to the CMX decoder, improving the performance by about 1.1% on mIoU and reducing nearly half the computation cost. MLPs head achieves comparable performance with CMX decoder, further proving there is no need to fuse RGB-D features by designing fusion modules.
| Decoder | Param | Flops | NYUDepthV2 | SUNRGBD |
|---|---|---|---|---|
| Ham head | 29.5M | 41.9G | 55.6 | 51.2 |
| MLPs head | 29.1M | 43.5G | 55.3 | 50.9 |
| Non-local head | 32.8M | 49.4G | 56.0 | 51.5 |
| CMX decoder | 35.0M | 48.3G | 55.4 | 51.0 |
Q4. Discussion on pretraining methods
This paper focuses on the RGB-D pretaining in a supervised manner, thus the discussion of the two self-supervised works you mentioned is out of the scope of this paper. Besides, their performances are far behind the state-of-the-art supervised pretrained methods like CMX. As a result, we do not take these two methods into account.
Q5. More RGB-D SOD benchmarks
This paper focuses on the acquirement of transferable representations for RGB-D segmentation tasks, rather than refreshing state-of-the-art performance on all the RGB-D benchmarks. To this end, we conduct experiments on several popular RGB-D segmentation benchmarks and RGB-D SOD benchmarks following the settings of recent works like HiDANet(TIP,23)[1], SPSN(ECCV,22)[2],and SPNet (ICCV,21)[3].
Thanks to your suggestions, we further conduct experiments on the mentioned RGB-D SOD datasets in Tab.19 of the revised appendix. We report the brief results in the following table, and our DFormer still outperforms other methods significantly.
'-' means the results are not provided in the paper. We will reproduce these missing results in the next revision to make the comparison more comprehensive.
| Dataset | Param | Flops | DUTLF | -Depth | [4] | ReDW | eb-S | [5] | LF | SD | [6] | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metric | M | G | M | F | S | E | M | F | S | E | M | F | S | E |
| DSNet | 172.4 | 141.2 | - | - | - | - | .133 | .676 | .716 | .736 | .068 | .849 | .867 | .890 |
| BIANet | 49.6 | 59.9 | - | - | - | - | - | - | - | - | - | - | - | - |
| SPNet | 150.3 | 68.1 | - | - | - | - | - | - | - | - | - | - | - | - |
| VST | 83.3 | 31.0 | .024 | .948 | .943 | .969 | .113 | .763 | .759 | .826 | .061 | .889 | .882 | .921 |
| RD3D+ | 28.9 | 43.3 | .030 | .945 | .936 | .964 | .130 | .697 | .718 | .786 | - | - | - | - |
| BPGNet | 84.3 | 138.6 | .031 | .938 | .930 | .958 | - | - | - | - | .066 | .875 | .874 | .908 |
| C2DFNet | 47.5 | 21.0 | .025 | .934 | .931 | .958 | - | - | - | - | .065 | .863 | .880 | .883 |
| MVSalNet | - | - | - | - | - | - | - | - | - | - | .072 | .880 | .856 | .906 |
| HiDANet | 130.6 | 71.5 | - | - | - | - | - | - | - | - | - | - | - | - |
| DFormer-L | 38.8 | 26.2 | .023 | .952 | .945 | .970 | .106 | .813 | .760 | .826 | .056 | .903 | .885 | .925 |
[1]HiDANet: Rgb-d salient object detection via hierarchical depth awareness. TIP,2023.
[2]Spsn: Superpixel prototype sampling network for rgb-d salient object detection. ECCV,2022.
[3]Specificity-preserving rgb-d saliency detection. ICCV,2021.
[4]Depth-induced multi-scale recurrent attention network for saliency detection. ICCV,2019.
[5]Learning selective mutual attention and contrast for rgb-d saliency detection. TPAMI,2021.
[6]Saliency detection on light field. CVPR, 2014.
Dear Reviewer HRuY,
Thank you very much for your valuable feedback. We hope that we have addressed your concerns through more comprehensive comparisons for more RGB-D SOD methods and benchmarks, the application of our DFormer to different decoders like the decoder of CMX, and the revision of the paper according to your suggestions.
The discussion period is soon coming to an end. If you still have any further reservations or suggestions, please don't hesitate to share them. Your insights are invaluable to us, and we're keen to address any remaining issues.
Best regards!
Authors
Dear Authors,
Thanks for your efforts in conducting experiments as per my comments. I believe your method has good results, however, I still think it lacks profound insights and the technical novelty is below the ICLR standard. Hence, I keep my rating as reject.
Best,
Dear Reviewers and AC:
We sincerely appreciate your valuable time and constructive comments.
We've uploaded a revised draft incorporating reviewer feedback. Most modified are marked in red. Below is a summary of the main changes:
-
Add the dicussion about the generalization of DFormer and the experiments of applying RGB-D pretraining to CMX in the main paper, i.e., the bottom of Sec 3.3.
-
Add the observations towards the distribution shift and more detailed ablation of the DFormer in the appendix, i.e., Sec. B.
-
Add more RGB-D SOD methods for a more comprehensive comparison in Tab. 2.
-
Add results on more RGB-D SOD datasets in Tab.19 of the appendix.
-
Move the implementation details, inference time analysis, and qualitative results comparisons to the appendix.
We sincerely hope our responses and revisions address all reviewers' concerns. We sincerely believe that these updates may help us better deliver the benefits of the proposed work to the ICLR community. Thank you very much,
Authors.
Paper summary
- The paper proposes to do supervised RGB-D pre-training on ImageNet-1K to obtain an improved backbone network for downstream RGB-D tasks. Since ImageNet does not have depth information, the work propose to use an monocular depth network (Adabin) to provide depth images. A lightweight decoder is added for task specific decoding, and the entire model is fine-tuned. Experiments on RGB-D semantic segmentation and salient object detection shows the proposed strategy can outperform prior work.
Strengths
- The proposed method is simple and effective and does not appear to be explored in prior work
Weaknesses
- The paper is missing discussion of some relevant prior work (see comments by reviewer HRuY)
Review summary
- Two of the three reviewers are positive on this work (KRMX, zqyp), while the third (HRuY) felt technical contribution of the work is too limited. The AC agrees that the ideas are simple but as it does not appear to be investigated by prior work, it is of sufficient interest and novelty and most importantly useful to the community. Thus the AC recommends acceptance as a poster.
Suggested updates to the paper
- Add discussion of relevant work on self-supervised pretraining with RGB-D (see [5] [6] from reviewer HRuY). While these works different from the pretraining proposed by this work, at this point in time, they should be discussed.
- Improve main text to point to appropriate sections in appendix for various things such as Appendix A for the effect of the choice of depth estimation model for generating the depth images
- Minor points about wording and typos. A proofreading pass is recommended.
- Section 3.2 - "Dataset&" => "Dataset &"
- Supplement A. "MORE ANALYSIS TOWARDS DFORMER" => "MORE ANALYSIS OF DFORMER"
- Supplement B. "SUPPLEMENTARY DETAILS OF OUR DFORMER" => "DETAILS OF DFORMER"
- Supplement B. - Add space after "Structure."
- Supplement B. "detalied" => "detailed"
- Please reserve the word "prove" for when you mathematically prove something, and not when you show something experimentally.
为何不给更高分
The reviewers noted limited technical novelty.
为何不给更低分
The proposed method is simple but effective and does not seem to have been investigated by prior work.
Accept (poster)