PaperHub
5.8
/10
Poster4 位审稿人
最低5最高7标准差0.8
6
7
5
5
4.3
置信度
正确性3.3
贡献度2.5
表达3.0
NeurIPS 2024

DCDepth: Progressive Monocular Depth Estimation in Discrete Cosine Domain

OpenReviewPDF
提交: 2024-05-05更新: 2024-11-06

摘要

关键词
Monocular depth estimationDiscrete cosine transformDeep learning

评审与讨论

审稿意见
6

The paper presents a novel framework for the long-standing monocular depth estimation. The task is first formulated as a progressive regression in the discrete cosine domain. The authors propose two modules: the PPH module progressively estimates higher-frequency coefficients based on previous predictions, and the PFF module incorporates a DCT-based downsampling technique to mitigate information loss and ensure effective integration of multi-scale features.

优点

  1. The paper is well-organized and the idea is easy to understand.
  2. The results of the method are presented clearly.

缺点

  1. The authors claim that the global-to-local (coarse-to-fine) depth estimation is a contribution, but this idea is common and adopted by other works [1] [2].
    [1] Liu C, Yang G, Zuo W, et al. Dpdformer: a coarse-to-fine model for monocular depth estimation[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20(5): 1-21.
    [2] Li Y, Luo F, Xiao C. Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module[J]. Computational Visual Media, 2022, 8(4): 631-647.
  2. In Section 3.2, some descriptions are confusing. “This grouping strategy ensures that lower-frequency groups contain fewer components necessitating more prediction steps, while higher-frequency groups encompass a larger number of components requiring fewer steps”, but in Fig 2, local areas (higher-frequency) require more steps. It is recommended to explain the unclear representation.
  3. The authors claim that the DCT-based downsampling technique tends to mitigate information loss, but this module has not been reasonably explained.
  4. The experiments are somewhat lacking in terms of including the latest works that achieve state-of-the-art performance. Although the authors compare their results with some previous works, other MDE methods [3, 4, 5] could provide valuable additional comparisons. When evaluated against these newer methods, the proposed method does not demonstrate superior performance.
    [3] Ning J, Li C, Zhang Z, et al. All in tokens: Unifying output space of visual tasks via soft token[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 19900-19910.
    [4] Yang L, Kang B, Huang Z, et al. Depth anything: Unleashing the power of large-scale unlabeled data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 10371-10381.
    [5] Saxena S, Kar A, Norouzi M, et al. Monocular depth estimation using diffusion models[J]. arXiv preprint arXiv:2302.14816, 2023.
  5. “MDE is extensively applied across various fields such as autonomous driving, robotics, and 3D modeling [45, 48, 9, 42]” recommend changing to “autonomous driving[…], robotics[…], and 3D modeling[…]”.

问题

See the Weaknesses.

局限性

The contributions are not substantial enough. The authors have not clearly articulated their design for DCT-based downsampling. Additionally, the proposed method does not achieve state-of-the-art results.

作者回复
  1. Clarification about Contribution: Thank you for your comment. We believe there may be a misunderstanding regarding our contributions for the following reasons:

    1. As stated in lines 53-54, our main contribution is the first to formulate monocular depth estimation as a progressive regression task in the discrete cosine domain, achieving state-of-the-art performance. All other reviewers consider our idea a novel insight (Reviewer HHm4), quite interesting (Reviewer fZWS), and significantly different from previous methods (Reviewer tNUP).
    2. The global-to-local or coarse-to-fine concept is broad, and researchers can achieve this in various ways. The two approaches you mentioned employ tandem networks to perform pixel-wise coarse depth estimation and final depth refinement in the spatial domain. In contrast, our method leverages discrete cosine transformation to segregate depth information into various frequency components and progressively predict them from low-frequency to high-frequency components based on an iterative machenism. Our approach is fundamentally different from the two approaches you mentioned.
    3. Besides the novel formulation for monocular depth estimation, we also proposed two innovative modules: the progressive prediction head and the pyramid feature fusion module. We demonstrated their effectiveness through ablation studies.
  2. Clarification about Grouping Strategy: Thank you for your comment. The lower-frequency groups contain fewer frequency components to be predicted, while the higher-frequency groups contain more frequency components. Each group of components is predicted through one iteration, thus the average number of prediction steps for higher-frequency components is less than that for lower-frequency components. Due to space limitations, we only report part of the evolution result in Figure 2. Please refer to Figure 1 of the attached PDF for a detailed illustration.

  3. Clarification about DCT-based Downsampling: Thank you for your comment. We believe you may have missed some of the explanations about the DCT-based downsampling in our main text. In Section 3.1, we review the Discrete Cosine Transform (DCT) and introduce its energy compaction property. In lines 154-162, we elaborate on the workflow of the DCT-based downsampling strategy, explaining that the key information of feature maps is preserved during downsampling by leveraging the energy compaction property of DCT. Additionally, we illustrate the workflow of DCT-based downsampling in the bottom-left corner of Figure 3.

  4. Clarification about Experimental Comparison: Thank you for your comment. Compared with our method, the approaches you mentioned have significant advantages in experimental configurations. We summarize them as follows:

    MethodBackbonePretrainingTraning Set Size
    AiT [3]Swin-Large V2SimMIMSame as Our
    Depth Anything [4]ViT-LargeDINO V2Over 63.5 Million
    DepthGen [5]Efficient U-NetImage-to-image self-supervised pretraining & supervised image-to-depth pretrainingSupervised pretraining: 8.5 million & NYU-Depth-V2 finetuning: 50000
    OurSwin-Large V1ImageNet-22k classificationNYU-Depth-V2: 24231

    We did not include the comparison with these methods due to the large gap in experimental configurations. To further demonstrate the advancement of our method, we integrated our method into Depth Anything and finetuned it in NYU-Depth-V2 following the settings in the Depth Anything paper. We denote the fine-tuned Depth Anything model as Depth Anything ft., and our fine-tuned method as Our ft. DA. We also compared our method with DepthGen on both NYU-Depth-V2 and KITTI Eigen datasets. The results are reported as follows:

    MethodBackboneAbs RelRMSElog10log_{10}δ<1.25\delta<1.25δ<1.252\delta<1.25^2δ<1.253\delta<1.25^3
    Depth Anything ft.ViT-Large0.0560.2060.0240.9840.9981.000
    DepthGenEfficient U-Net0.0740.3140.0320.9460.9870.996
    OurSwin-Large0.0850.3040.0370.9400.9920.998
    Our ft. DAViT-Large0.0550.2040.0240.9850.9981.000
    MethodBackboneAbs RelRMSERMSElogRMSE_{log}δ<1.25\delta<1.25δ<1.252\delta<1.25^2δ<1.253\delta<1.25^3
    DepthGenEfficient U-Net0.0642.9850.1000.9530.9910.998
    OurSwin-Large0.0512.0440.0760.9770.9970.999

​ Our finetuned method outperforms the depth anything couterpart, which demonstrate the superiority of our method. Furthermore, our method significantly outperforms DepthGen on all metrics of KITTI Eigen and several metrics on NYU-Depth-V2, despite DepthGen employing a large amount of data for supervised pretraining.

  1. About References: Thanks for your advice and we will revise our paper according to your suggestion.
评论

First, thanks for your effort in responding to my questions.

Then, I read all the responses, and the numerous experiments and detailed answers have convinced me that this paper has sufficient contributions. However, the author should provide limitations and corner cases.

Finally, I'd like to raise my rating.

评论

Thank you for your feedback and for improving the rating. Due to limited space, we discuss the limitations of our method in Section E of the supplementary material. We will move this discussion to the main text in the revised version to enhance the completeness of our paper.

Specifically, our model is supervised by comparing the difference between its predictions and the ground truth in the spatial domain. However, the sparsity of ground truth may inefficiently supervise the estimation of frequency coefficients. While we have evaluated our method on the KITTI dataset with sparse ground truth and achieved state-of-the-art performance, further exploration is needed to evaluate its performance on even sparser datasets. Please see Section E for detailed discussion.

审稿意见
7

The paper introduces a frequency domain-based method for monocular depth estimation. The proposed method begins with the prediction of low-frequency components to establish a global scene context, followed by successive refinement of local details through the prediction of higher-frequency components. The proposed method is validated on the NYU-Depth-V2, TOFDC and KITTI datasets.

优点

To the best of my knowledge, the discrete cosine domain-based progressive design is significantly different from previous methods and gives a promising paradigm in depth estimation.

The proposed method is interpretable, using discrete cosine transformation to segregate the depth information into various frequency components and enriching details in a progressive manner.

The proposed method outperforms prior work across multiple datasets with comparable or fewer parameters.

Detailed ablation study indicate the effectiveness of each key component.

The paper is clearly structured and well presented.

缺点

As shown in Table 2, the proposed method significantly outperforms NewCRFs, PixelFormer, and IEBins on the TOFDC dataset compared to other datasets. It would be beneficial to provide a detailed discussion regarding this to better understand the strengths of the proposed method.

It is nice to also visualize the evolution of depth predictions in the frequency domain.

In Fig. 3, the three depth predictions at the bottom look the same. They probably need to be replaced with the actual experimental results.

问题

Please see the weakness section.

局限性

It is nice to try the completed ground-truth depth maps for supervision on the KITTI dataset, using existing depth completion frameworks.

作者回复
  1. Discussion on TOFDC Result: Thanks for your valuable suggestion. The TOFDC dataset is challenging due to its small number of training images with limited diversity. NewCRFs, PixelFormer, and IEBins independently predict pixel-wise depth without modeling the correlations among them. We believe this approach makes it difficult to learn general depth estimation knowledge from a small-scale dataset, causing these large models to overfit on the training set and resulting in degraded depth estimation performance. In contrast, models with smaller parameters, such as BTS and AdaBins, achieve better depth estimation results.

    Unlike the large models mentioned above, our proposed method models local depth correlations by making patch-wise predictions in the frequency domain. This approach allows our model to better exploit general depth estimation knowledge. Additionally, we include two regularizations in the training loss to encourage the model to output smoother depths, helping to avoid overfitting and achieve better performance.

  2. Depth Prediction Evolution Visualization: Thank you for your professional suggestion. We have included the visualization of frequency prediction evolution in Figure 1 of the attached PDF. We will incorporate this result into our revised paper.

  3. Depth Prediction Visualization in Figure 3: Thank you for your suggestion. The depth results at the bottom of Figure 3 are produced by our model. However, they appear indistinguishable due to the small resolution. We will enhance the presentation in our revised version to ensure the differences are more visible.

  4. Supervision with Complete Depth Map: Thank you for your constructive advice. We will explore using the dense depth map generated by the depth completion model as supervision to improve the performance of our method in future work.

评论

Thank you for your responses. The proposed method is promising and the authors have addressed my concerns.

评论

Thank you for your feedback and for recognizing our work.

审稿意见
5

This paper introduces DCDepth, a framework designed to tackle the long-standing monocular depth estimation task. In general, the proposed methods are quite interesting, the motivations for this work are clear, the experiments conducted are comprehensive, and the paper is well-structured.

优点

This paper introduces DCDepth, a framework designed to tackle the long-standing monocular depth estimation task. In general, the proposed methods are quite interesting, the motivations for this work are clear, the experiments conducted are comprehensive, and the paper is well-structured.

缺点

  1. The comparisons with SoTA methods in Table 1 and Table 2 are not very clear. The authors should consider focusing on more decimal places and only highlighting the best-performing methods for each metric.

  2. The tables should include the publication year and the venue of the compared methods, as this will help readers better understand the context of the comparisons.

问题

Please see weaknesses.

局限性

N/A

作者回复
  1. Table Result Presentation: Thank you for your valuable feedback! Based on your suggestions, we plan to update our tables in the following two aspects:

    1. Updating the accuracy metrics in Tables 1 and 2 to percentages and multiplying the error metrics in Table 1 by 100, keeping two decimal places for reporting quantitative results.
    2. Highlighting only the best-performing methods for each metric.

    Below are the updated comparison results of VA-DepthNet, which is closest to our method in performance on the NYU-Depth-V2 and TOFDC datasets:

    MethodReferenceBackboneAbs RelSq RelRMSElog10log_{10}δ<1.25\delta<1.25δ<1.252\delta<1.25^2δ<1.253\delta<1.25^3
    VA-DepthNetICLR-2023Swin-Large8.563.8830.443.6893.6899.2099.87
    Our---Swin-Large8.513.8730.433.6693.9599.2099.84
    MethodReferenceBackboneAbs RelSq RelRMSERMSElogRMSE_{log}δ<1.25\delta<1.25δ<1.252\delta<1.25^2δ<1.253\delta<1.25^3
    VA-DepthNetICLR-2023Swin-Large0.2340.0290.6190.37399.55099.89099.969
    Our---Swin-Large0.1880.0270.5650.35299.49099.89199.970

    In the revised version, we will update our tables according to these principles.

  2. Including Publication Year and Venue Information for Comparison Methods: Thank you for your valuable advice! We will include the publication year and venue information of the compared methods in our tables to help readers better understand the context of the comparisons.

评论

Dear Reviewer fZWS,

Thanks for your comments to our paper. As the deadline for the discussion period is approaching, we wanted to kindly remind you that your feedback is very important to us. We have submitted our responses to your comments and would greatly appreciate any additional questions or feedback you may have.

Thank you for your time and consideration.

Best regards,

The Authors

评论

Firstly, thank the authors the details explanation. I will keep my original rating.

评论

Thank you for your feedback and support!

审稿意见
5

The author has proposed the DC depth, which aims to predict a depth map from a monocular image. The authors introduce a novel technique that implements depth estimation of the frequency coefficients from the discrete cosine domain and enables modeling the local depth correlations. The author conducted experiments on the NYU-Depth-V2, ToF-DC, and KITTI datasets.

优点

  1. Predicting depth from the frequency domain is a relatively novel insight.
  2. The presentation is clear. Figure 1 clearly illustrates the progressive estimation scheme.

缺点

Lack of comparison: Why not compare with methods like Depth Anything [Yang et al., CVPR 2024], Metric3D[Yin et al., ICCV 2023], etc.? The methods currently compared are outdated. Therefore, it cannot be said that a new state-of-the-art performance has been achieved. L14

问题

Please consider answering the questions in weaknesses section during the rebuttal. Besides, on L27-28, the author states that the current depth estimation is based on a per-pixel basis. I am not sure if this statement is reasonable, as recent MDE methods, such as Depth Anything, they are based on transformers, and there are some operations of patch tokenization, perhaps the per-pixel basis is not a reasonable way of writing.

局限性

yes.

作者回复
  1. Clarification of Experimental Comparison: Thank you for your comment. We would like to clarify the following points:

    1. Our research goal is different from that of Depth Anything and Metric3D. We focus on the novel network and algorithm design for monocular depth estimation task, while Depth Anything and Metric3D focus on improving the generalization of the depth estimation model. This difference leads to different experimental setup for these two kinds of methods, which we summarize as follows:

      MethodPretrainingTraning Set SizeRatio to Our Training Set
      Depth AnythingDINO-V2Over 63.5 Million2621 / 6350 / 2742
      Metric 3DImageNet-22KOver 8 Million330 / 800 / 345
      OurImageNet-22KNYU-Depth-V2: 24231 / TOFDC: 10000 / KITTI Eigen: 231581 / 1 / 1

      Due to the significant differences in the experimental setup, it is unfair to directly compare our method with these two methods, thus we do not include these comparison results in our paper.

    2. We disagree with your comment that “the methods currently compared are outdated,” as we have included recent approaches such as IEBins [1], BinsFormer [2], and VA-DepthNet [3] in our comparison, as shown in Tables 1, 2, and 3 of the main text and Table 7 of the supplementary material.

    3. Under the same experimental settings, our method surpasses existing methods on three datasets, achieving state-of-the-art performance. To further demonstrate the effectiveness of our method, we additionally designed the following comparative experiments:

      • We use the weights of Depth Anything as initialization, transplant the proposed progressive prediction head into the network of Depth Anything, and fine-tune it on NYU-Depth-V2 following the settings in the Depth Anything paper. We denote the fine-tuned Depth Anything model as Depth Anything ft., and our fine-tuned method as Our ft. DA.
      • We use the weights from Metric3D encoder as initialization and randomly initialize the decoder. Then we fine-tune Metric3D and our model on the NYU-Depth-V2 dataset for 10 epochs, denoted as Metric3D ft. and Our ft. M3D, respectively.

      The experimental results are reported as follows:

      MethodReferenceBackboneAbs RelRMSElog10log_{10}δ<1.25\delta<1.25δ<1.252\delta<1.25^2δ<1.253\delta<1.25^3
      Depth Anything ft.CVPR-2024ViT-Large0.0560.2060.0240.9840.9981.000
      Metric3D ft.ICCV-2023ConvNeXt-Large0.0650.2320.0280.9710.9960.999
      Our ft. DA---ViT-Large0.0550.2040.0240.9850.9981.000
      Our ft. M3D---ConvNeXt-Large0.0620.2290.0270.9700.9960.999

      Our finetuned methods outperform their counterparts, demonstrating the superiority of our approach.

    [1] Shao et al. “IEBins: Iterative Elastic Bins for Monocular Depth Estimation", NIPS 2023.

    [2] Li et al. "BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation", TIP 2024.

    [3] Liu et al. "VA-DepthNet: A Variational Approach to Single Image Depth Prediction", ICLR 2023.

  2. Clarification on Per-Pixel Basis Statement: Thank you for your feedback. On lines 27-28, we stated that “to predict depth on a per-pixel basis within the spatial domain,” which indicates that the mentioned approaches independently predict depth for each pixel in the spatial domain. In contrast, our method models local depth correlations by predicting the frequency spectrum in the discrete cosine domain. The term “per-pixel basis” refers to the output, not the input. We will enhance the expression in the revised version to clarify this distinction.

评论

Dear Reviewer HHm4 and fZWS,

You still need to share your feedabck about the rebuttal; please post your comments as soon as possible.

Thank you

评论

Dear Reviewer HHm4,

Thanks for your comments to our paper. As the deadline for the discussion period is approaching, we wanted to kindly remind you that your feedback is very important to us. We have submitted our responses to your comments and would greatly appreciate any additional questions or feedback you may have.

Thank you for your time and consideration.

Best regards, The Authors

作者回复

Dear Reviewers and Chairs,

Thank you for your constructive feedback and efforts.

We have individually responded to each reviewer’s comments. In the submitted PDF, we have visualized the prediction evolution of our method in both the spatial and frequency domains.

If you have any questions, please feel free to discuss them with us. If you are satisfied with our responses, we hope you will consider improving your rating.

Best regards,

The Authors

评论

Dear reviewers,

the authors put a significant effort into providing detailed answers to the questions raised in your evaluation reports. In particular, they discussed concerns about novelty and comparison to SOTA, which are very relevant. Please participate in the discussion by sharing your comments about the rebuttal.

Thanks AC

最终决定

As acknowledged by reviewers, tackling monocular depth estimation by exploiting the frequency domain is original and experimental results demonstrate the effectiveness of such a paradigm. The concerns raised by reviewers in the initial evaluation report were mainly concerned with the missing comparison with networks like DepthAnything and Metric3D, which were adequately addressed in the rebuttal and the following discussion with the authors. At the end of the review process, all reviewers recommended acceptance, acknowledging the relevance of the proposed paradigm. The AC is on the same page and does the same.