BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation
We present BetterDepth to efficiently achieve robust affine-invariant monocular depth estimation with fine-grained details.
摘要
评审与讨论
This paper proposes a plug-and-play diffusion refiner for pre-trained zero-shot feed-forward MDE (monocular depth estimation) models, so that these generalized models can capture fine-grained details. In this paper, the coarse depths output from a pre-trained feed-forward MDE model are used as additional conditions for the proposed diffusion refiner, which is parameterized into any existing diffusion-based MDE model. In addition, a global pre-alignment strategy and a local patch masking strategy are proposed to ensure the faithfulness of the predicted depths.
优点
(1) The paper is well-organized.
(2) The proposed diffusion refiner is simple and clear, which could be be easily combined with different existing feed-forward MDE methods.
(3) Experimental results on several datasets demonstrate the effectiveness of the proposed method in some cases.
缺点
(1) The motivation of this work is a bit far-fetched: As stated in the introduction, the low accuracy of the existing feed-forward MDE methods is caused by noisy and incomplete depth labels collected in real-world scenarios, while the poor generalizability of the diffusion-based MDE methods is caused by less diverse synthetic training samples. However, if only training data results in limited MDE performance for these existing methods, it seems that a straightforward motivation for alleviating the above problems is to enrich the training datasets, rather than to modify the model architecture as done in this work.
(2) Some statements in Table 1 are confusing: (i) To the reviewer’s knowledge, many existing feed-forward methods only use real datasets for training, rather than using real and synthetic datasets together. Why do the authors indicate that feed-forward methods use both real and synthetic datasets for training? (ii) Some existing diffusion-based methods (e.g., "Monocular depth estimation using diffusion models.arXiv preprint arXiv:2302.14816, 2023.") also use real datasets for training. Why do the authors indicate that diffusion-based methods use only synthetic datasets for training in Table 1? (iii) The authors state that feed-forward methods have a better generalizability than diffusion-based methods. But to the reviewer’s knowledge, in many zero-shot learning tasks (e.g., zero-shot recognition), generative models (e.g., GANs and diffusion models) generally show their better generalizability than the feed-forward counterparts. Could the authors give some explanations on why to make the above statement?
(3) Comparaive evaluation: As seen from Tables 2 and 3, when the proposed BetterDepth is used together with two early methods MiDaS (published in 2020) and DPT (published in 2021), it could bring some improvements. However, when the proposed BetterDepth is used together with Depth Anything[43] (a SOTA method), it only brings a slight improvement in most of the used datasets (particularly NYUv2 and ScanNet). So it seems that the proposed BetterDepth has a limited effect for boosting the performance of SOTA methods.
问题
(1) As for the global pre-alignment, since the estimated depth values are derived from their corresponding depth labels, what is the performance of only using them in the local patch masking? And for the local patch masking, the authors simply discard significantly different regions in the estimated depths and real depths, will this lead to the loss of visual world information?
(2) In lines 128-132, the author's conclusion is that large-scale datasets lead to the generalizability of the feed-forward MDE methods. Therefore, in order to improve the generalizability of diffusion-based MDE methods, why not to simply train them with large-scale datasets following the feed-forward MDE methods? Why is this still a challenge?
(3) In lines 133-137, the author's conclusion is that in the diffusion-based MDE methods, the high-quality labels in the synthetic datasets lead to the ability to capture fine-grained details, but some feed-forward MDE methods also use these synthetic (and real) datasets for training as indicated in Table 1. Why are these feed-forward MDE methods unable to capture fine-grained details?
(4) Is Fig. 3 a schematic diagram or a direct visualization of the output distribution? What is the difference between the output distributions X(M_DM,D_syn) and \hat{X}? Moreover, for the middle sketch, why does the global pre-alignment not only affect \hat{X} but also X(M_DM,D_syn)?
(5) As indicated in Fig. A2, lower η generally means stricter filtering, but if η is too small, too many local regions would be discarded, which would lead to severe information loss. To this end, I’m wondering how many local regions are discarded in the current version of BetterDepth, and how to balance the strictness of the filter and the loss of information.
(6) Is there any diffusion-based MDE method trained on real datasets?
局限性
The proposed method might be improved in two ways: (i) The offsets from the real depths to the estimated depths could be modeled in an implicit learnable manner; (ii) The pairs of significantly different local regions in two depths may not be harmful, but contain useful information for monocular depth estimation.
Thank you for the thoughtful comments. We kindly ask the Reviewer to read the top-level global response first. Our detailed responses to the comments in the weaknesses (denoted as W) and questions (denoted as Q) sections are listed below.
W-1: Motivation
Both the training data and the model architecture are important for performance. The very recent method Depth Anything V2 [R4] uses synthetic data for better details. Although promising improvements are achieved in Depth Anything V2, BetterDepth still shows better performance as discussed in the detail evaluation section of global response, thanks to the iterative refinement of diffusion models.
Enriching the training dataset helps to boost MDE performance, but (i) obtaining high-quality labels for real datasets is difficult due to depth sensor limitations, (ii) synthetic datasets offer high-quality labels but are costly to generate at scale, and (iii) training on large datasets is both time-consuming and resource-intensive.
BetterDepth efficiently combines the strengths of feed-forward and diffusion-based MDE models, achieving robust performance with fine-grained details with minimal training effort.
W-2: Table 1
(i) Feed-forward models can easily use large-scale datasets (both synthetic and real ones) to gain robust performance, e.g., Depth Anything V2 [R4], so we include both in Tab. 1 for generality. (ii) Recent works [9,11,14] validate the effectiveness of capturing fine details by training diffusion models on synthetic datasets. While techniques like depth infilling [34] enable training on real datasets, the resulting depth maps tend to be smoother due to the sparse and noisy labels. Thus, we focus on the recent diffusion-based MDE methods with synthetic data. We will explicitly indicate this to avoid confusion. (iii) Diffusion-based models could generalize better in tasks with diverse, accurately annotated datasets, e.g., recognition, but the sparse/noisy labels of real datasets and the limited diversity of synthetic datasets in the MDE task make it challenging. Tab. 2 also supports the significant performance gap between the state-of-the-art diffusion and feed-forward MDE approaches, e.g., Marigold v.s. Depth Anything.
W-3: SOTA improvements
BetterDepth aims to achieve robust MDE performance with fine-grained details. Despite achieving state-of-the-art performance, Tab. 2 cannot show the full advantages of BetterDepth (especially detail extraction) due to the sparse and noisy depth labels (Fig. A6-A15), which is also observed in [R4]. Thus, we provide the detail evaluation experiment in the global response, and the results in Tab. T3 verifies our significant improvements over the state-of-the-art model.
Q-1: Training strategies
To test patch masking with only depth labels, we use a smoothed version of depth label as conditioning to imitate the estimation of pre-trained MDE models. As shown in the table below, patch masking with only depth labels (denoted as Label Only) shows inferior performance due to the significant distribution gap of depth conditioning in the training and inference stages.
| Method | NYUv2 | KITTI | ETH3D | ScanNet | DIODE | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Label Only | 4.4/98.0 | 8.0/94.0 | 7.8/97.8 | 4.4/98.0 | 23.0/75.4 | |||||
| BetterDepth | 4.2/98.0 | 7.5/95.2 | 4.7/98.1 | 4.3/98.1 | 22.6/75.5 |
For patch masking, discarding patches will indeed result in loss of visual information. However, since BetterDepth utilizes the rich geometric prior from the pre-trained model, training on limited visual information already yields promising results, e.g., BetterDepth-2K is comparable with our full model in Tab. 2.
Q-2: Large datasets
BetterDepth aims to achieve both robust performance and fine details. Although large-scale real datasets can be employed to gain better generalizability, the sparse and noisy labels hinder models from extracting fine details. Synthetic datasets provide high-quality labels for fine detail extraction, but their low diversity limits the learned geometric priors. Similar discussions can be also found in Sec. 2-3 of [R4].
Q-3: Fine details
Training with synthetic datasets could help improve detail extraction, but the model architecture is also important. The recent Depth Anything V2 [R4] employs synthetic training data for better details, and we perform comparison in the detail evaluation section of the global response. Thanks to the iterative refinement scheme, BetterDepth shows the best performance on detail extraction (Tab. T3).
Q-4: Fig. 3
Fig. 3 is a schematic diagram where X_(M_FFD, {D_syn, D_real}) and X_(M_DM, D_syn) are fixed distributions representing the characteristics of different methods (Tab. 1). By contrast, \hat{X} indicates the learned output distribution of BetterDepth and we mainly analyze its change under different training strategies. Thus, only the \hat{X} is affected by global pre-alignment in the middle sketch. We will clarify this to avoid confusion.
Q-5: Filtering
The percentage of discarded patches on the training dataset is 36.6% in BetterDepth. Although a small will lead to fewer valid patches, BetterDepth works well with small-scale training datasets even under the information loss (as discussed in Q-1). We empirically find the optimal value to achieve the best performance balance (Fig. A2).
Q-6: Diffusion MDE + real data
Both [34] and [R5] employ depth infilling techniques to train diffusion-based MDE models on real datasets and achieve promising results. However, these works primarily show the feasibility of applying diffusion models on MDE without exploring fine detail extraction like recent methods, e.g., Marigold. Besides, they focus on in-domain testing instead of zero-shot evaluation. We will add [R5] to the related work section.
Limitation
Thank you. We will explore metric depth and improve information utilization in future works.
Dear reviewer,
The discussion period is coming to a close soon. Please do your best to engage with the authors.
Thank you, Your AC
Dear authors,
Thanks for the rebuttal. The rebuttal has cleared some of my concerns. However, although the evaluation on an extra dataset has been added, one main concern still remains: the proposed BetterDepth has a limited effect for boosting the performance of SOTA methods. Additionally, comparing Table T2 in the rebuttal and Table A2 in the submitted text, it is noted that some results of “the proposed method + Depth Anything” become better. Which results should the readers believe? Hence, I would keep my initial rating.
Dear Reviewer EiFe,
Thank you for your responses and comments. We'd like to provide further clarifications for your remaining concerns:
Improvements over SOTA
The proposed BetterDepth aims to achieve robust MDE performance with fine details. Our experiments verify the superiority of BetterDepth in both zero-shot performance (Tab. T2) and fine-grained detail extraction (Tab. T3). Extensive visual results also support the improvement of BetterDepth over SOTA methods (e.g., Fig. 1, 5, A4, A5).
Tab. A2 and Tab. T2
Thanks for the comments. Tab. A2 uses the same settings as Tab. 2 in the main paper, where BetterDepth is trained with Depth Anything. For Tab. A2, we train an additional BetterDepth variant with DPT [25], and we explicitly indicate the experimental settings in the caption of Tab. A2 and Sec. E.
We hope this addresses your concerns. Please feel free to let us know of any additional comments and suggestions. Thank you.
Best,
Authors
The paper proposes a simple approach to improve and refine current monocular depth estimation (MDE) methods. Leveraging the strong geometric prior from a state-of-the-art discriminative depth estimation method, and the strong image prior from a generative model, the authors set a new state-of-the-art in MDE. They condition a pre-trained latent diffusion model on an image and a corresponding depth map from a pre-trained MDE model and fine-tune the diffusion part to obtain higher-fidelity depth maps. Additional loss masking and alignment of affine-invariant depth to the ground truth are found to be crucial for performance.
优点
The paper is well-organized and well-written, with a clear common thread. The authors present a comprehensive set of experiments, validating the effectiveness of their method, and clearly ablating their contributions. The idea is simple and achieves very good results on a broad range of depth estimation benchmark datasets with minimal additional training effort, utilizing strong image and depth foundation models.
缺点
-
Since the idea is that simple, the general impact or contribution might only be moderate in my opinion. The main contribution is the injection of additional conditioning information into the diffusion model. Since this prior already gives state-of-the-art results, adding a full diffusion model on top with a strong image prior to learn the affine-invariant residual between ground truth and prediction is obvious to improve benchmark metrics. This also explains why there is only few data needed, and why the error bar in appendix D is much lower than Marigold. The task for Marigold is much harder compared to the refinement of a very good depth map.
-
Further contribution claims incorporate the global pre-alignment and local patch masking. The first, global pre-alignment, is a simple re-phrasing of scale and shifting the ground truth depth map to the predicted depth map of the MDE method, which is a common approach for testing affine-invariant depth estimation methods. Since the least squares fit cannot satisfy all pixels to match perfectly, discarding some portion of it, which the authors termed local patch-masking, is also just a small trick to further boost performance. Hence, these two contributions lack novelty in my opinion, and the only real novelty is the incoporation of a strong depth prior into the diffusion model.
Besides the concern about contribution impact I think the idea is very clearly and nicely presented, and the paper very nicely polished.
问题
-
Figure 1: I was wondering why the surface normals still show some wobbles for flat surfaces. Since the strong geometric prior from DepthAnything does not show these artefacts, it seems like the diffusion model inserts these. Do you have any explanation for that? Have you evaluated whether this stems from the first stage?
-
Line 196: Maybe I am missing something, but is there a reason why you choose max-pooling? As far as I understand that means that if at least one patch in a certain region has a small distance between ground truth depth and scaled and shifted depth, the full region is included for training. Do you have an estimate of how often a full region is rejected? And at which semantic regions in an image that usually is the case (e.g. only for the sky)?
-
Furthermore, since you exclude non-matching regions via the local patch masking, wouldnt it be better to directly use an outlier-aware method (e.g. RANSAC)? This would directly better align matching parts and non-matching parts can be more easily filtered out.
-
Since depth refinement alone is in my opinion only a moderate contribution, I was wondering whether you could extend your method to transfer affine-invariant depth to metric depth, since this might be a more challenging and thus more interesting task?
-
Line 305: Does this time include the MDE model? Maybe I was missing it, but can you give a clear separation of much time goes into which step and how many ensembles you are taking for this inference time measure?
-
If possible I would appreciate further clarification on appendix C. Is the model without any geometric prior just Marigold? And did you train the diffusion model without image prior completely from scratch?
-
Lastly, one of your claims is to capture more "fine-grained scene details" compared to other methods. I was wondering whether you know about some quantitative way to show that?
局限性
The authors have adequately addressed and discussed the limitation of their method.
Thank you for your thoughtful comments. We kindly ask the Reviewer to read the top-level global response first. Our detailed responses to the comments in the weaknesses (denoted as W) and questions (denoted as Q) sections are listed below.
W-1: Contribution
Apart from the depth-conditioned diffusion refiner, a key contribution of BetterDepth is the proposed training strategies to achieve both robust performance and fine details. While Depth Anything gives state-of-the-art results, naively conditioning on it without our training strategies only yields inferior results as shown in Tab. T1 (Naive Conditioning v.s. BetterDepth) and discussed in the contribution section of the global response. In addition, the advantages of BetterDepth, e.g., lower error bar, also come from the proposed training strategies. We compare the standard deviation (std) based on the settings in Sec. D, and the results below show that the naive conditioning model is even more unstable than Marigold. This is because the injection of additional conditioning information makes it harder to determine which prior to follow, and the performance of BetterDepth further highlights the importance of our training methods.
| Methods | AbsRel std | std | ||
|---|---|---|---|---|
| Marigold | 0.66 | 0.99 | ||
| Naive Conditioning | 0.81 | 1.06 | ||
| BetterDepth | 0.28 | 0.28 |
W-2: Training strategies
To achieve robust MDE performance with fine details, the key challenge is how to ensure conditioning strengths while enabling the learning of detail refinement. We agree that our training strategies are not difficult to implement, but the motivation to use them during training is more important.
- To ensure conditioning strength, we propose to narrow the distance between depth conditioning and labels in a global-to-local manner. The pre-alignment first eliminates the global differences caused by unknown scale and shift, and the patch masking further addresses the local estimation bias in depth conditioning.
- For detail refinement, global pre-alignment and local patch masking together contribute to fine-grained detail extraction. Although significantly different regions are excluded to ensure conditioning strength, the combination of pre-alignment and patch masking still enables the learning of detail refinement as shown in Fig. S1 of the attached PDF, e.g., the basket.
Thus, the proposed training strategies are critical for better MDE performance and fine-grained details, which is also supported by the results/analyses in W-1.
Q-1: Wobbles
Diffusion-based MDE methods tend to introduce subtle variation due to the random noise in the diffusion process. This can be fixed by using the mean instead of the median in test-time ensembling (Fig. S2 in the attachment), which also achieves better performance as follows.
| Method | NYUv2 | KITTI | ETH3D | ScanNet | DIODE | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Median | 4.2/98.0 | 7.5/95.2 | 4.7/98.1 | 4.3/98.1 | 22.6/75.5 | |||||
| Mean | 4.2/98.1 | 7.4/95.3 | 4.6/98.1 | 4.3/98.1 | 22.5/75.5 |
Q-2: Max-Pooling
Max-pooling is used to convert pixel-level masks to latent space. We follow Marigold to employ 8x8 max-pooling for mask downsampling as the VAE encoder in our diffusion model performs 8x downsampling for pixel-to-latent conversion. Since the patch size is also set to 8x8, max-pooling only happens within each patch without affecting a larger region. We will clarify this in revision.
Q-3: Outlier-aware method
Better alignment like outlier-aware methods could indeed have more patches survive for better performance. Although the data efficiency experiments (Tab. 2) show that fewer valid training patches lead to lower model performance, the models trained with small datasets, e.g., BetterDepth-2K, already achieve comparable results to our full model, indicating that limited patches can be sufficient. Nevertheless, better alignment methods could improve patch preservation to further benefit efficient training.
Q-4: Metric depth
Transferring affine-invariant depth to metric depth is a promising direction to benefit practical use but poses challenges, e.g., scale/shift ambiguity and diverse depth ranges. Nonetheless, BetterDepth shows potential to boost metric depth estimation. We employ the metric Depth Anything model and apply BetterDepth in a plug-and-play manner on zero-shot datasets iBims-1 [R6] and SUN RGB-D [R7]. Due to the unknown scale and shift in the current BetterDepth, we align the outputs of Depth Anything and BetterDepth to depth labels and compare the quality of depth maps. The table below (metrics are AbsRel / ) and Fig. S3 of the attachment verify the superiority of BetterDepth. One promising next step is to model the scale/shift in an implicit learnable manner (as suggested by Reviewer EiFe), and we will further study this in future works.
| Method | iBims-1 | SUN RGB-D | ||
|---|---|---|---|---|
| Depth Anything | 5.1/97.7 | 13.5/87.7 | ||
| BetterDepth | 4.5/98.2 | 12.7/88.8 |
Q-5: Inference
The inference time includes the pre-trained MDE model, i.e., Depth Anything. To measure the time spent at each step, we reproduce the experiment, and the inference time of the pre-trained model and diffusion model are 0.02 and 0.38 seconds per sample, respectively. The ensemble size is set to 1.
Q-6: Appendix C
The model without geometric prior uses the same network and fine-tuning method as Marigold but estimates inverse depth (following Depth Anything) instead of relative depth. For the model without image prior, we follow Stable Diffusion to train the latent UNet from scratch and keep the pre-trained VAE unchanged.
Q-7: Detail evaluation
Thanks for the suggestions. We conduct quantitative evaluation for detail extraction in the detail evaluation section of the global response, and the results in Tab. T3 demonstrates the state-of-the-art performance of BetterDepth in detail extraction.
Thank you for your response and the detailed answers to my questions, which addressed most but not all of my concerns.
I still have doubts about the contribution of the method. As mentioned earlier, providing a diffusion model with a very strong depth prior, which is already state-of-the-art, is highly likely to improve metrics. The proposed training strategies - global pre-alignment and local patch masking - are common techniques in the MDE literature. The global pre-alignment seems to be a rephrasing of the well-known approach of scaling and shifting the ground truth depth map to match the predicted depth map, which is standard practice for testing affine-invariant depth estimation methods. The local patch masking, which involves discarding certain portions to boost performance, appears to rather be a trick to circumvent the issues of non-matching scale and shift parameters.
In my view, the real novelty lies in incorporating a strong depth prior into the diffusion model, but the other contributions seem to lack originality or novelty.
Nevertheless, I will maintain my score, since the small incremental step of adding a strong prior is well ablated and justified.
Dear Reviewer Qdvk,
Thank you very much for your responses and comments. We'd like to provide further clarifications in the hope of addressing your remaining concerns:
A diffusion model + a very strong depth prior is likely to improve metrics
Simply providing a diffusion model with a strong depth prior does not outperform the pre-trained MDE model itself (see Depth Anything v.s. Naive Conditioning in Tab. T1 of the global response). As such, it does not immediately improve metrics and these counter-intuitive results reveal the challenges as discussed in the contribution part of the global response. BetterDepth efficiently addresses these challenges and achieves state-of-the-art performance.
Global pre-alignment + local patch masking
Thanks for the comments. The global pre-alignment uses the same least squares fitting as the affine-invariant MDE evaluation protocol [26], and we explicitly point this out on Lines 170-171 of the main paper. We consider the simplicity of our approach as an advantage instead of a drawback and additionally provide the motivation behind each design choice as well as comprehensive analysis on Lines 162-223 of the main paper.
At last, we would like to emphasize that we respect your perspective and sincerely appreciate your comments and suggestions, which significantly improve the quality of our paper. We are also grateful for your support in continuing to recommend the acceptance of our paper. Thank you.
Best,
Authors
This paper presents a plug-and-play monocular depth estimator with the diffusion model. In the proposed method, the authors first employ the pretrained monocular depth model (MDE) to estimate a coarse depth map as the condition of the diffusion model. Then, a modified diffusion refiner is used to obtain the fine-grained depth map of the scene. Extensive experiments are conducted on benchmark datasets to demonstrate the effectiveness of the proposed method.
优点
- The proposed method is well-motivated and easy to follow.
- The proposed method is a plug-and-play module and easy to use in different backbone models.
缺点
- The novelty of the proposed method is somewhat limited. The proposed method is basically a conditional diffusion refiner, which combines the zero-shot MDE such as depth anything and diffusion-based MDE such as Marigold. In my opinion, the predicted depth map with depth anything can provide a “good” initialization to the diffusion model. Although the global pre-alignment and local patch masking modules are developed in the proposed method, I still think it is an incremental contribution.
- In the proposed local patch masking module, why the patch mask strategy can obtain more refined details of the depth map? How to determine the parameters to control the mask ratio? In the ablation study of the supplementary material, the range of the mask ratio is [0.05, 0.3]. Can we set the range to a big interval?
- In the experiment section, for the proposed method, different numbers of training samples are used. These samples are real data or synthetic data? For Marigold, the training samples are synthetic data without real data. Therefore, I suggest the authors to highlight the use of the real data or synthetic data. In addition, the authors could provide the reasons why the proposed method is slightly inferior to the existing methods on some datasets.
问题
Please refer to the weakness section. I hope to address the issuses in the rebuttal.
局限性
The authors discussed the limitations in the supplementary material.
Thank you for your thoughtful comments. We kindly ask the Reviewer to read our top-level global response first. Our detailed responses to the comments in the weaknesses (denoted as W) section are listed below.
W-1: Novelty
One of our key contributions is the proposed training strategies that combine the merits of feed-forward and diffusion-based MDE methods in an efficient manner. We agree that building a conditional diffusion model to combine two components is straightforward, but properly leveraging advantages of both is challenging. Although the depth map estimated by Depth Anything provides good initialization, directly building a diffusion refiner on top of it without our training strategies (denoted as Naive Conditioning) yields inferior results as shown below (metrics are AbsRel / ).
| Avg. Rank | Methods | NYUv2 | KITTI | ETH3D | ScanNet | DIODE | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.8 | Depth Anything | 4.3/98.0 | 8.0/94.6 | 6.2/98.0 | 4.3/98.1 | 26.0/75.9 | ||||||
| 2.7 | Naive Conditioning | 5.2/97.0 | 8.6/92.2 | 5.4/96.9 | 5.6/96.2 | 22.4/74.7 | ||||||
| 1.2 | BetterDepth | 4.2/98.0 | 7.5/95.2 | 4.7/98.1 | 4.3/98.1 | 22.6/75.5 |
This is because the naive conditioning overfits the small training datasets and thus underutilizes the prior knowledge in the pre-trained MDE model, resulting in degraded zero-shot performance (Lines 276-279 of the main paper). With the proposed training strategies, BetterDepth learns to utilize the geometric prior in pre-trained models for zero-shot transfer and the image prior in diffusion models for detail refinement, efficiently achieving robust performance with fine-grained details (as discussed in the contribution section of the global response).
W-2: Local patch masking
To achieve better details without disregarding the geometric prior learned in pre-trained MDE models, the key challenge is to ensure the depth conditioning strength and simultaneously enable learning detail refinement. Thus, we design the patch masking mechanism to:
- improve depth conditioning strength at local regions. By excluding significantly different patches, BetterDepth learns to follow the depth conditioning and better utilizes the geometric prior for robust estimation, which is important for zero-shot performance as the comparison shown in Section W-1 (Naive Conditioning v.s. BetterDepth).
- learn detail refinement within a reasonable range. With the filtered patches, BetterDepth learns detail refinement without overfitting the training data distribution, achieving better detail extraction while maintaining robust MDE performance. A visual example is provided in Fig. S1 of the attached PDF, where the model can learn to improve the detail of the basket (Fig. S1c) according to the depth label (Fig. S1b).
However, higher conditioning strength limits the capacity for detail refinement, and lower conditioning strength results in degraded performance, e.g., the naive conditioning model in W-1. Therefore, the threshold is used to balance these two contradicting properties. We conduct experiments with larger (0.5 and 1) and combine the results in Fig. A2 as follows (metrics are AbsRel / ). Too large often leads to worse performance as the geometric prior is not well utilized, and we choose the optimal in BetterDepth.
| NYUv2 | KITTI | |||
|---|---|---|---|---|
| 0.05 | 4.21/98.06 | 7.70/95.03 | ||
| 0.1 | 4.18/98.04 | 7.47/95.22 | ||
| 0.3 | 4.43/97.89 | 7.75/94.83 | ||
| 0.5 | 4.37/97.91 | 7.92/94.50 | ||
| 1 | 4.50/97.72 | 8.14/94.20 |
W-3: Training samples and method performance
All the training samples are synthetic data in BetterDepth (Lines 249-253 of the main paper). For the two smaller training sets, we randomly select 400 and 2K samples from the full 74K synthetic dataset (composed of Hypersim and Virtual KITTI, which is the same synthetic dataset used in Marigold), and our experiments (Tab. 2) show promising performance of BetterDepth even with such small-scale training sets.
The effectiveness of quantitative evaluation heavily relies on the quality of depth labels in the benchmark. However, due to the limitation of depth sensors, the commonly adopted test benchmarks often contain incomplete and noisy depth labels. For example, Fig. A14-A15 illustrates significantly incorrect annotations and noise in the ground-truth of the DIODE dataset, and such label noise makes the reported metrics not fully reliable on a single dataset (similar discussions can be found in Tab. 2 and Sec. 6.1 of Depth Anything V2 [R4]). A more reliable evaluation is to compare the performance across diverse datasets, so we additionally provide the average ranking of the five benchmarks in Tab. 2 and our BetterDepth achieves the overall best performance. In addition, we further validate the superiority of BetterDepth on fine-grained detail extraction, as discussed in the detail extraction section of the global response.
Dear reviewer,
The discussion period is coming to a close soon. Please do your best to engage with the authors.
Thank you, Your AC
Thanks for the efforts on the rebuttal. The authors have addressed some of my concerns. Nonetheless, I still have some concern on the contribution of the proposed method with the training strategy of introducing the depth prior into the diffusion model. Therefore, I keep my orginial rating.
Dear Reviewer LNRC,
We kindly ask the Reviewer to provide specific remaining concerns about the contribution so we can respond with more details.
Best,
Authors
We sincerely thank all reviewers and area chairs for their valuable time and comments. We will incorporate all suggestions to improve the revised paper. After providing more results/analyses, we would like to give an overall response and re-emphasize the contribution and performance of BetterDepth.
Contribution
BetterDepth aims to efficiently combine the beneficial characteristics of feed-forward and diffusion-based monocular depth estimation (MDE) methods to achieve robust MDE performance with fine-grained details. Although it might seem straightforward to gain better performance by combining two models, naive conditioning without our proposed training strategies only results in inferior performance as shown in Tab. T1 below. Even having good depth maps from the pre-trained model as initialization, the naive conditioning model struggles to balance the contribution of different priors and does not yield an improvement. Thus, to efficiently achieve our goal, challenges still exist:
- Performance Trade-off. One solution to better utilize the initial depth map is to improve the conditioning strength. However, stronger depth conditioning limits the capacity for detail refinement, and lower conditioning strength results in degraded performance, e.g., naive conditioning in Tab. T1. Therefore, how to properly utilize the merits of different priors and balance the performance trade-off is non-trivial.
- Resource Efficiency. It might be possible to achieve improvements by training on diverse large-scale datasets with high-quality labels, but (i) obtaining high-quality labels for real datasets is difficult due to the imperfection of depth sensors, (ii) synthetic datasets offer high-quality labels but are costly to generate at scale, and (iii) training on large datasets is both time-consuming and resource-intensive.
To this end, we propose global pre-alignment and local patch masking to balance the performance trade-off, ensuring conditioning strength while enabling detail refinement. As a result, BetterDepth efficiently combines the strengths of feed-forward and diffusion-based MDE models, achieving robust performance and fine-grained details with minimal training effort.
Table T1. Comparisons with the naive conditioning model (AbsRel / ).
| Avg. Rank | Methods | NYUv2 | KITTI | ETH3D | ScanNet | DIODE | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.8 | Depth Anything | 4.3/98.0 | 8.0/94.6 | 6.2/98.0 | 4.3/98.1 | 26.0/75.9 | ||||||
| 2.7 | Naive Conditioning | 5.2/97.0 | 8.6/92.2 | 5.4/96.9 | 5.6/96.2 | 22.4/74.7 | ||||||
| 1.2 | BetterDepth | 4.2/98.0 | 7.5/95.2 | 4.7/98.1 | 4.3/98.1 | 22.6/75.5 |
Zero-Shot Performance
With the proposed architecture and training strategies, BetterDepth achieves state-of-the-art performance on the widely adopted zero-shot datasets as shown in Tab. T2 below, where the very recent approach Depth Anything V2 [R4] is also included for comparisons.
Table T2. Zero-Shot MDE performance (AbsRel / ).
| Avg. Rank | Methods | NYUv2 | KITTI | ETH3D | ScanNet | DIODE | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3.7 | Marigold | 5.5/96.4 | 9.9/91.6 | 6.5/96.0 | 6.4/95.1 | 30.8/77.3 | ||||||
| 1.9 | Depth Anything | 4.3/98.0 | 8.0/94.6 | 6.2/98.0 | 4.3/98.1 | 26.0/75.9 | ||||||
| 2.6 | Depth Anything V2 | 4.4/97.8 | 8.3/93.9 | 6.2/98.2 | 4.2/97.8 | 26.4/75.4 | ||||||
| 1.4 | BetterDepth | 4.2/98.0 | 7.5/95.2 | 4.7/98.1 | 4.3/98.1 | 22.6/75.5 |
Detail Evaluation
Following the suggestions of Reviewer Qdvk, we further provide quantitative evaluation of fine-grained detail extraction. Since depth labels in the commonly adopted benchmarks are generally sparse or noisy as discussed in [R4] and shown in Fig. A6-A15, which makes them less reliable for detail evaluation, we conduct experiments on a high-resolution RGB-D dataset Middlebury 2014 [R1]. Four additional edge-based metrics are employed to exclude the influence of non-edge regions: the completeness and accuracy of depth boundary errors (denoted as DBE_comp and DBE_acc) [R2] and the edge precision and edge recall (denoted as EP and ER) [R3]. As shown in the Tab. T3 below, although the recent Depth Anything V2 achieves promising improvements in detail extraction by leveraging high-quality synthetic training data, our BetterDepth still exhibits better performance even with much less synthetic training data (595K in Depth Anything V2 v.s. 74K in BetterDepth), thanks to the iterative refinement of diffusion model. In addition, BetterDepth also captures better details like the cat's hair in Fig. S5 of the attached PDF, validating its overall best performance.
Table T3. Evaluation of detail extraction.
| Avg. Rank | Method | AbsRel | DBE_comp | DBE_acc | EP (%) | ER (%) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3.7 | Marigold | 7.57 | 93.24 | 5.60 | 3.09 | 16.65 | 23.75 | |||||||
| 3.2 | Depth Anything | 3.14 | 99.44 | 6.35 | 2.66 | 24.73 | 16.12 | |||||||
| 2.2 | Depth Anything V2 | 3.06 | 99.38 | 4.19 | 2.23 | 26.74 | 35.89 | |||||||
| 1 | BetterDepth | 2.95 | 99.52 | 3.61 | 2.09 | 28.49 | 50.35 |
We hope we have addressed your concerns and kindly ask you to consider updating your rating based on our additional explanations and evaluations. Please don’t hesitate to let us know of any additional comments or questions.
[Reference]
[R1] High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. GCPR 2014.
[R2] Evaluation of CNN-based Single-Image Depth Estimation Methods. ECCVW 2018.
[R3] Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps with Accurate Object Boundaries. WACV 2019.
[R4] Depth Anything V2. arXiv preprint arXiv:2406.09414, 2024.
[R5] Monocular depth estimation using diffusion models. arXiv preprint arXiv:2302.14816, 2023.
[R6] Evaluation of CNN-based Single-Image Depth Estimation Methods. ECCVW 2018.
[R7] SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite. CVPR 2015.
I am a bit confusing: In the last row of Table T2, which method is embedded into the proposed BetterDepth?
Dear Reviewer EiFe,
Thanks for the comments. BetterDepth uses Depth Anything (not V2) by default, consistent with the setting in the main paper.
Best,
Authors
Dear Reviewers,
We appreciate your valuable comments and suggestions. In the rebuttal, we have responded to all the questions raised and highlighted the contribution and performance of BetterDepth:
-
We explain the contribution and novelty of BetterDepth. Specifically, BetterDepth efficiently addresses the two main challenges (performance trade-off and resource efficiency) and achieves robust performance with fine details.
-
We demonstrate the state-of-the-art MDE performance of BetterDepth on zero-shot datasets, where the very recent Depth Anything V2 [R4] is also included for comparisons.
-
We provide both quantitative and qualitative evaluations for fine-grained detail extraction, further validating the superiority of BetterDepth.
We kindly invite the reviewers to check our response and give us a chance to clarify the remaining concerns. Please let us know if you have any unsolved or additional concerns before Aug 13. Then, we have enough time to provide further feedback. Thanks.
Best,
Authors
The paper received 2 borderline rejects and 1 weak accept. Reviewers approved of the simplicity of the method and the strong quantitative performance. Reviewers questioned the problem setting (is depth refinement or improvement a big enough task?), motivation of the method, and technical novelty.
After reading all the reviews, rebuttal, and the paper myself, I recommend a borderline Accept as Poster. I share Reviewer EiFe's view that the correct, principled approach to the problem would be to directly deal with the data issue or develop new training strategies; however, with that said, I also agree with the authors that there is value in identifying and clearly stating the problem as well as exploring immediate fixes. Given that MDE is a very crucial task that affects a large number of vision subfields (3d, generative, etc.), there's value in having better depth numbers right now, even at the cost of what is essentially an extra network in a diffusion cascade. The quantitative results, particularly the added evaluation in the rebuttal period, are convincing enough to me to suggest that there is value in the work as is.
I would strongly encourage the author's to integrate the rebuttal information into their paper and also to very seriously consider the discussion points mentioned by Reviewer EiFe.