A Simple yet Universal Framework for Depth Completion
We propose a universal few-shot learner for depth completion with arbitrary sensor.
摘要
评审与讨论
This paper proposes a novel depth completion framework that leverages a large model for monocular depth estimation (i.e., MiDAS, called also Visual Foundation Model -- VFM) to achieve depth completion across a wide range of scenes and sensors (here defined as "UniDC problem"). The UniDC baseline architecture also exploits hyperbolic geometry to fine-tune the model to the destination domain with a limited amount of labeled data (even only one RGB+Sparse depth+GT frame -- 1-shot learning -- and few RGB+Sparse depth frames -- few-shot learning without dense GT). The UniDC architecture is mainly compared to another VFM-based depth completion framework [19] and other VFM-less depth completion models showing good performance.
优点
Good performance of the framework: the proposal surpasses previous VFM-less works and VFM-based [19] with a large margin in 1-shot, 10-shot, 100-shot, and 1-sequence configurations. The ablation study confirms that the hyperbolic space is useful for this kind of experiment. Even more, the authors tried to fine-tune uniDC in a self-supervised manner showing interesting results w.r.t. networks [19] and [75] trained in the same way.
Novel solution for a known problem: depth completion methods that generalize well in other domains and/or using other sensors is a recent emerging problem in the literature [32] [33] [19] (P1). The proposal is a contribution to enlarge the deployment of depth completion solutions to real-world scenarios.
The reading was smooth: The authors wrote the paper in a linear and clear way. They expose the problem to the reader and following logical steps they arrive to the proposed solution. The figures are clear and help the reader to understand the proposal and the results. There are some minor imperfections that can be fixed (see question paragraph).
(P1) Bartolomei, L., Poggi, M., Conti, A., Tosi, F., & Mattoccia, S. (2024, March). Revisiting depth completion from a stereo matching perspective for cross-domain generalization. In 2024 International Conference on 3D Vision (3DV) (pp. 1360-1370). IEEE.
缺点
The comparison of the proposal with VFM-less methods seems unfair: VFM features are a powerful source of extra knowledge (Tab. 7) that is not available in other networks. For example in Tab 1, all VFM-less competitors got access to one training data only, while UniDC and [19] have (indirect) access to extra data through VFM extra knowledge. Authors could have included the extra VFM knowledge for example replacing the sparse depth input using the rescaling solution proposed in Eq. 4 of paper [19].
Related literature is missing: (P1) (published before [19]) tries to resolve the uniDC problem in a zero-shot manner exploiting a stereo network pre-trained on synthetic data only. They define a novel benchmark (already used in other works (P2)) to evaluate performance on cross-domain generalization (indoor -> outdoor and vice versa) without labeled data of the target domain. In my opinion, this is a huge missing in the related works and experiments chapters.
Not SOTA performance on 1-shot experiments (Minor Weakness): (P1) shows better performance (Tab. 3 of the cited paper) on KITTI and NYU (expect RMSE) even in a zero-shot manner w.r.t. the proposal in 1-shot manner. Those two results are comparable because the proposal leverages MiDAS which is trained using millions of frames, while (P1) is trained using thousands of frames (SceneFlow). Even so, it is worth noticing that by using more "shots" (e.g., 10-shot column) the proposal achieves better results.
Some experiments and competitors are missing: this paper lacks three kinds of experiments: E1) full in-domain performance -- i.e., same domain for training and test sets, where the training is fully used; E2) cross-domain performance -- i.e., a network trained using a dataset and tested in another; E3) varying-density performance -- i.e., for example, to simulate the deployment on different LiDAR setups. Furthermore, related domain/density generalized depth completion solutions are missing from the experiments [32] [33] (P1). I'm aware that code for [32] [33] is unavailable (to compare them in few-shot manner), however (P1) code is available and it reports results also for [32] regarding the aforementioned (E1) (E2) and (E3) experiments.
(P2) Zuo, Y., & Deng, J. (2024). OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations. arXiv preprint arXiv:2406.11711.
问题
Before questions, I resume here the motivations behind my overall rating: the proposed hyperbolic space is beneficial for the few-shot experiments and permits to achieve better results w.r.t. [19]. However, the paper lacks other important experiments -- i.e., the aforementioned (E2) (E3); generalized depth completion competitors in the experiments; fairness with VFM-less proposals.
Minor comments: 1) there is a typo at row 274: Eg -> Eq; 2) there is a typo at table 7: 0.13930 -> 1.393; 3) Fig. 1 is a little ambiguous for [19]: the initial relative depth is rescaled using Eq. 4 of paper [19]; 5) I suggest to resume the claims using a list.
After Reviewer-Authors decision
After carefully reading the Authors' rebuttal, I decided to change my score from 3 to 5, based on these two main points:
-
Response to W1: one of my major concerns was about the fairness with VFM-less methods. Table G demonstrates that even if paired with additional knowledge, VFM-less methods struggle in few-shot experiments. This suggests that the integration of VFM is not trivial.
-
Response to W2: Authors agree that (P1) is an important piece of the related work, and they will update the literature review and how it compares to the proposal.
I want to highlight that this score change will be confirmed only if the camera-ready paper has all the edits discussed during the Reviewer-Authors period.
局限性
The authors addressed almost all limitations. I suggest the authors to investigate also limitations from the introduction of VFM as done in [19].
[bXoG] W.3 Not SOTA Performance on 1-shot Experiments
VPP4DC [P1], which utilizes a stereo matching network for depth completion, creates a model robust in cross-domain generalization for the task. As the reviewer mentioned, P1 achieves accurate performance by pretraining the model with a thousands number of synthetic datasets, slightly outperforming our method in the 1-shot setting. P1 estimates depth by predicting disparity and converting it with baseline and camera parameters, which might lead to slight differences in evaluation protocol. Additionally, it seems that P1 only measures performance on the left camera (Camera_02) for KITTI (50% of official validation), which could cause metric protocol discrepancies. To fairly compare the model and ours, we conduct experiments under the UniDC problem we proposed. Table.I demonstrates that while P1 performs similarly to our method in the 1-shot setting, it does not show significant improvement in the 10 or 100-shot settings. This suggests that VPP4DC, pretrained on a synthetic dataset, struggles with learning new representations. This is consistent with in-domain experiments conducted in the paper, where VPP4DC shows weaknesses in the in-domain performance, due to insufficient learning from the scene appearance in NYU and the sparse data obtained from new sensors.
[Table.I] Performance Comparison between VPP4DC and Ours
| NYU | RMSE | MAE | DELTA1 | RMSE | MAE | DELTA1 | RMSE | MAE | DELTA1 |
|---|---|---|---|---|---|---|---|---|---|
| 1-shot | 10-shot | 100-shot | |||||||
| VPP4DC | 0.2357 | 0.1259 | 0.9623 | 0.2379 | 0.1109 | 0.9721 | 0.2287 | 0.0934 | 0.9712 |
| Ours | 0.2099 | 0.1075 | 0.9752 | 0.1657 | 0.0794 | 0.9849 | 0.1473 | 0.0669 | 0.9885 |
| KITTI | RMSE | MAE | DELTA1 | RMSE | MAE | DELTA1 | RMSE | MAE | DELTA1 |
|---|---|---|---|---|---|---|---|---|---|
| 1-shot | 10-shot | 100-shot | |||||||
| VPP4DC | 1.549 | 0.5602 | 0.9896 | 1.453 | 0.4619 | 0.9856 | 1.408 | 0.406 | 0.9895 |
| Ours | 1.6840 | 0.5217 | 0.9826 | 1.3850 | 0.4073 | 0.9903 | 1.2238 | 0.3386 | 0.9927 |
[bXoG] Minor Comments & Limitations
- [1,2] We will fix the typos in the revised version. Thanks!
- [3] Fig 1. Initial relative depth of (b) DepthPrompting. We implement the DepthPrompting [19] method with MiDaS. We report the initial depth with a metric scale which is inferred by a pretrained model in 1-shot environment. We agree that the depth map may confuse readers to understand, so we will change the depth map into the output of the pretrained depth foundation model in relative scale.
- Limitation of VFM and rescaling solution of [19]: We think the major concern is how to change the relative-scale depth in metric scale with the guidance, sparse depth from depth sensor. The pioneer work [19] successfully utilizes the output of the monocular depth model, but it shows poor adaptation performance.
[Table.G] Experiments on VFM-less Methods with VFM (RMSE/MAE/DELTA1.25)
| Model | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| CSPN | 1.483 / 1.206 / 0.346 | 0.317 / 0.196 / 0.711 | 0.285 / 0.131 / 0.975 |
| CSPN + "VFM" | - / - / - | 0.569 / 0.438 / 0.756 | 0.533 / 0.408 / 0.787 |
| BPNet | 0.357 / 0.208 / 0.948 | 0.239 / 0.112 / 0.974 | 0.176 / 0.079 / 0.983 |
| BPNet + "VFM" | - / - / - | - / - / - | - / - / - |
| OGNIDC [P2] | 0.365 / 0.200 / 0.921 | 0.312 / 0.160 / 0.957 | 0.207 / 0.095 / 0.974 |
| OGNIDC + "VFM" | 0.695 / 0.323 / 0.888 | 0.372 / 0.189 / 0.932 | 0.248 / 0.148 / 0.958 |
| DP [19] | 0.358 / 0.207 / 0.910 | 0.220 / 0.101 / 0.973 | 0.210 / 0.101 / 0.974 |
| Ours | 0.210 / 0.108 / 0.975 | 0.166 / 0.079 / 0.985 | 0.147 / 0.067 / 0.988 |
[Table.H] Varying Density Experiment (RMSE/MAE/DELTA1.25)
| [NYU-100] | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| BPNet | 0.737 / 0.436 / 0.876 | 0.319 / 0.177 / 0.942 | 0.276 / 0.149 / 0.955 |
| LRRU | - | 0.512 / 0.344 / 0.849 | 0.453 / 0.184 / 0.927 |
| OGNIDC | 0.439 / 0.274 / 0.884 | 0.394 / 0.176 / 0.933 | 0.287 / 0.154 / 0.951 |
| Ours | 0.326 / 0.196 / 0.936 | 0.261 / 0.151 / 0.962 | 0.227 / 0.132 / 0.971 |
| [NYU-32] | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| BPNet | 0.676 / 0.486 / 0.763 | 0.492 / 0.326 / 0.851 | 0.403 / 0.258 / 0.888 |
| LRRU | - | 0.735 / 0.547 / 0.688 | 0.678 / 0.496 / 0.719 |
| Ours | 0.486 / 0.325 / 0.852 | 0.380 / 0.244 / 0.893 | 0.312 / 0.190 / 0.935 |
| [KITTI 16-Line] | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| BPNet | 3.387 / 1.203 / 0.954 | 3.063 / 1.086 / 0.964 | 2.305 / 0.800 / 0.975 |
| DFU | 4.357 / 2.139 / 0.862 | 3.935 / 1.885 / 0.911 | 2.990 / 1.428 / 0.950 |
| OGNIDC | 5.590 / 2.540 / 0.797 | 2.570 / 0.898 / 0.965 | 2.413 / 0.832 / 0.969 |
| Ours | 2.827 / 1.020 / 0.964 | 2.319 / 0.845 / 0.979 | 2.215 / 0.745 / 0.975 |
| [KITTI 4-Line] | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| BPNet | 5.568 / 2.886 / 0.775 | 5.332 / 2.384 / 0.863 | 4.471 / 1.844 / 0.906 |
| DFU | - | 5.558 / 3.017 / 0.682 | 4.872 / 2.569 / 0.793 |
| Ours | 4.790 / 2.224 / 0.872 | 4.153 / 1.918 / 0.895 | 4.084 / 1.659 / 0.926 |
If the experiment fails to optimize or diverge, they are marked with a "-"
[bXoG] W.1 Comparison with VFM-less Methods with
We acknowledge the reviewer's concern regarding the fairness of comparing our method with VFM-less approaches. We recognize that Visual Foundation Model (VFM) features provide a significant advantage due to their extensive pre-training on large datasets, which VFM-less methods lack. To address your concern, we carry out additional experiments using VFM knowledge in VFM-less baselines by replacing the sparse depth input with Eq. 4 of the paper [19].
In this experiment, we found that directly applying the VFM approach, as suggested, sometimes yields unsatisfactory performance compared to the baseline (Table.G). This underperformance can be attributed to optimization issues stemming from the fact that the sparse depth provides complete metric depth information, whereas fitting the relative-scale depth from VFM using Eq. 4 of [19] does not achieve this precision level. The fitting process involves solving ∣Ax−B∣, which performs a linear fit with the available data, i.e., sparse depth. In [19], the authors employed global linear fitting to predict the depth scale using scalar values A and B, initially converting relative depth to metric scale. However, this approach often fits disproportionately to regions with rich information, leading to inaccuracies in areas with sparse depth information. Consequently, using metric sparse depth as input can cause inaccuracies, making optimization difficult and resulting in suboptimal performance.
We agree that there is a significant gap in using VFM directly for depth completion. Instead of directly using relative-scale depth, we chose to leverage intermediate features to indirectly utilize foundation knowledge. This approach allows us to benefit from VFM while avoiding the direct application of relative-scale depth, thereby mitigating some of the challenges observed in this experiment.
[bXoG] W.2&4 Related literature, experiments, and competitors are missing
We appreciate the reviewer's suggestion to include the work of [P1] and [P2], which addresses the UniDC problem in a zero-shot manner. We agree that this is an important piece of the related work, and we will update our literature review and how it compares to our method. Additionally, we will expand our experiments to include their benchmark, as outlined in their work, to better evaluate various sensors with a few labeled data in the other domain.
Our model is not specifically designed for zero-shot generalization; rather, it addresses task-specific domain shifts by integrating relative-scale 3D knowledge learned by the VFM into a sensor fusion model, i.e., depth completion model. Our primary objective is to quickly and effectively adapt to new sensors, leveraging the generalization capability of the VFM model. While cross-domain experiments (source RGBD dataset to target RGBD dataset), as shown by P1 and P2, are important for understanding model generalization, challenges such as catastrophic forgetting or significant covariate shifts (e.g., with new sensors like 4D radar) can render existing models ineffective.
Our solution focuses on fast adaptation with minimal data, enabling strong performance even without dense labeled data, as demonstrated in Table 3 of the paper. Thanks to the reviewer's suggestion, we can showcase the superiority and applicability of our model across various environments by comparing it with advanced methods, demonstrating the feasibility to various sensors and environments.
[Full In-domain Performance]: In response to the request for full in-domain experiments, we carry out an extensive evaluation of the KITTI DC and NYU datasets. To explore our method under various configurations, we develop four variants by adjusting the number of channels, similar to the recent state-of-the-art methods like LRRU and CompletionFormer. As shown in Table.C, the results demonstrate that our method is effective in a full training setting and performs competitively against traditional methods that require extensive labeled data, confirming our approach's robustness and versatility across different training scenarios. Moreover, Table.A shows that our variants achieve competitive performance compared to well-established methods concerning specialization within a single domain. We compare those advanced methods, including their variants (Mini to Base), with ours in few-shot experiments and verify the effectiveness of our proposed method regardless of the number of model parameters.
[Varying-density Performance]: We simulate different LiDAR setups by varying the density of the input data. For the NYU indoor dataset, we randomly sampled 100 and 32 sparse depths, while for the KITTI outdoor dataset, we utilize 16-Line and 4-Line configurations. These experiments test the robustness and adaptability of our method in response to changes in input data quality and quantity. As shown in Table.H, the results show that our method achieves superior performance across different sensor configurations. In contrast, most comparison methods exhibit a decline in performance when adapting to new sensor configurations, as demonstrated in the DepthPrompting [19].
Dear Authors,
Thanks for your large effort in the response. After carefully reading your rebuttal, this is my reply:
Just to clarify: I cited the after-deadline work (P2) only to prove that (P1) benchmark is already used in the field. I apologize if (P2) was seen as a request for comparison with your proposal.
[bXoG] W.1 Comparison with VFM-less Methods with...
Authors: Consequently, using metric sparse depth as input can cause inaccuracies, making optimization difficult and resulting in suboptimal performance. I agree.
[Table.G] Experiments on VFM-less Methods with VFM (RMSE/MAE/DELTA1.25)
Thanks for this important experiment about one of my main concerns. From your results, it seems that VFM-less methods cannot exploit the implicit knowledge of a VFM in the few-shot tests using my suggestion, while the usage of hyperbolic space confirms rows 103-104 of your paper. In my opinion, this is a very important outcome.
[bXoG] W.2&4 Related literature, experiments, and competitors are missing
Authors: Our model is not specifically designed for zero-shot generalization; rather, it addresses task-specific domain shifts by integrating relative-scale 3D knowledge learned by the VFM into a sensor fusion model, i.e., depth completion model. Our primary objective is to quickly and effectively adapt to new sensors, leveraging the generalization capability of the VFM model. I agree: (P1) does not explore the adaptation using few-shot target data. However, in my mind, it shares the same goal --i.e., a model that works effectively when paired with new sensors. So I appreciate the effort of the authors to update the paper's literature.
- I cannot see Table.C and Table.A. Did you forget to add them to the rebuttal?
[Table.H] Varying Density Experiment (RMSE/MAE/DELTA1.25)
Thanks for the experiment. I think that the results in this table are a consequence of the outcome of [Table.G].
[bXoG] W.3 Not SOTA Performance on 1-shot Experiments
Authors: Additionally, it seems that P1 only measures performance on the left camera (Camera_02) for KITTI (50% of official validation), which could cause metric protocol discrepancies. Are you using val_selection_cropped split? (I cannot understand it from your supplementary code) If yes, I checked the (P1) code available online and they used all 1000 frames (there are no "camera_02" or "camera_03" folders inside val_selection_cropped, only the "image" folder). If not, I apologize for the wrong assumption (please specify your KITTI test split).
[bXoG] Minor Comments & Limitations
Thanks for your answer. Which version of MiDAS did you utilize in your experiments?
Best Regards,
Reviewer bXoG
Dear bXoG,
Thank you for your detailed and thoughtful responses to our rebuttal. We greatly appreciate your insights and are glad to address your comments and queries.
Clarifications on Citations and Comparisons
We appreciate you even more because of letting us know the recent paper and it was interesting to compare our method with it.
W.1 & Table.G
We appreciate your agreement for our assessment that using metric sparse depth as input can lead to inaccuracies and optimization challenges. We are also pleased with the experiment to address your concerns.
[bXoG] W.2&4 Related Literature, Experiments, and Competitors
Thanks for your understanding about our design philosophy, which focuses on addressing sensor-specific domain shifts by integrating relative-scale 3D knowledge from the VFM into a sensor fusion model. We promise that this rebuttal content is updated in the revised version of our paper, including the shared goal of effective model performance with new sensors.
Missing Tables
We have included these tables in our "global" response, as well as in the attached PDF file. Due to the systemic error in OpenReview, the tables can be invisible, we are providing the tables here for your convenience. Please see Table A and Table C below:
[Table.A] Experiment on Advanced Methods (RMSE/MAE/DELTA1.25)
| Model | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| LRRU_Mini | 0.704 / 0.505 / 0.738 | 0.989 / 0.677 / 0.642 | 0.551 / 0.392 / 0.797 |
| LRRU_Tiny | 0.842 / 0.633 / 0.574 | 0.771 / 0.549 / 0.707 | 0.565 / 0.373 / 0.836 |
| LRRU_Small | 0.589 / 0.388 / 0.826 | 0.404 / 0.246 / 0.919 | 0.442 / 0.306 / 0.887 |
| LRRU_Base | 0.447 / 0.278 / 0.899 | 0.424 / 0.252 / 0.922 | 0.316 / 0.189 / 0.949 |
| DFU | 0.754 / 0.589 / 0.637 | 0.590 / 0.464 / 0.719 | 0.467 / 0.344 / 0.868 |
| OGNI-DC | 0.365 / 0.200 / 0.921 | 0.312 / 0.160 / 0.957 | 0.207 / 0.095 / 0.974 |
| Ours_Tiny | 0.243 / 0.131 / 0.969 | 0.186 / 0.088 / 0.982 | 0.151 / 0.068 / 0.988 |
| Ours | 0.210 / 0.108 / 0.975 | 0.166 / 0.079 / 0.985 | 0.147 / 0.067 / 0.988 |
| Ours_Small | 0.255 / 0.136 / 0.968 | 0.181 / 0.089 / 0.983 | 0.149 / 0.067 / 0.988 |
| Ours_Base | 0.247 / 0.138 / 0.969 | 0.190 / 0.093 / 0.983 | 0.148 / 0.066 / 0.988 |
[Table.C] Full Dataset Training Benchmark (NYU & KITTI)
| # of Learnable (M) Params. | Models | NYU RMSE | NYU MAE | KITTI RMSE | KITTI MAE |
|---|---|---|---|---|---|
| 41.5M | Cformer_Tiny | 0.091 | 0.035 | - | - |
| 82.6M | Cformer_Small | 0.090 | 0.035 | 0.739 | 0.196 |
| 142.4M | Cformer_Base | 0.090 | 0.035 | 0.709 | 0.203 |
| 0.3M | LRRU_Mini | 0.101 | - | 0.800 | 0.219 |
| 1.3M | LRRU_Tiny | 0.096 | - | 0.762 | 0.208 |
| 5.2M | LRRU_Small | 0.093 | - | 0.741 | 0.202 |
| 21M | LRRU_Base | 0.091 | - | 0.728 | 0.198 |
| 1.2M | Ours_Tiny | 0.107 | 0.042 | 0.907 | 0.231 |
| 4.6M | Ours | 0.098 | 0.038 | 0.867 | 0.224 |
| 36.9M | Ours_Small | 0.095 | 0.038 | 0.824 | 0.209 |
| 63.2M | Ours_Base | 0.093 | 0.036 | - | - |
[bXoG] W.3 Not SOTA Performance on 1-shot Experiments
Regarding your query about the KITTI test split, we are indeed using the val_selection_cropped split. We apologize for any confusion caused by our supplementary code. (P1) utilizes all 1000 frames without distinguishing between "camera_02" and "camera_03." We noticed some ambiguity in the publicly available code, where the dataloader defines two variables:
- image2_list = sorted(glob(osp.join(datapath, 'image/*.png')))
- image3_list = sorted(glob(osp.join(datapath, 'image/*.png')))
When we initially reproduced the VPPDC, we considered these lists to differentiate between the left and right images for stereo matching. Actually, "val_selection_cropped" consists of 500 "image_02" and 500 "image_03" frames. However, based on the public code, (P1) utilized the entire 1000 frames in the val_selection_cropped split without distinguishing between "camera_02" or "camera_03". Thank you for giving us a chance to clarify the evaluation protocol. We want to note that we do not use this code for Table I, which trains VPPDC with our dataloader, fully utilizing the 1000 frames in the val_selection_cropped split.
[bXoG] Minor Comments & Limitations
For the MiDAS version used in our experiments, we utilized v2.1 Small. This version was chosen for its compatibility and optimal performance with our model, and its lightweight nature makes it especially practical. Once again, we appreciate your engagement and valuable feedback. Please let us know if there are any further questions.
Best regards,
Authors
Dear Authors,
Thanks for your response.
Thanks for tables A and C. Now, I can see also the global pdf.
Best regards,
Reviewer bXoG
Dear bXoG,
Thank you for your constructive comments and the updated score. We are pleased to hear that tables A and C, as well as the PDF files, have been helpful for your review. We assure you that the contents of this rebuttal will be included in the revised version of the paper.
If you have any further questions or requests for clarification, please feel free to ask us.
Best regards,
Authors
This paper proposes Universal Depth Completion as a new task and provides a solution that tackles the problem. The proposed approach utilizes a monocular depth foundation model for its general understanding of 3D scenes and then completes depth information from different sensors with a learned affinity map. The learned features are embedded into a hyperbolic space to build hierarchical 3D structures.
优点
The authors identified the problem that most depth completion solutions are tailored to specific settings (indoor/outdoor) and sensors, and have trouble generalizing to different configurations. The paper then claims to define a new problem called Universal Depth Completion, which is an even more general setup than the sensor-agnostic depth completion task from the DepthPrompting paper.
The proposed baseline is novel, it uses a depth foundation model for its generalizable knowledge of 3D scenes and leverages hyperbolic geometry for learning hierarchical features.
The usage of hyperbolic geometry is a notable contribution. While embedding features into a hyperbolic space have already been proposed in prior works, this work observes the limitation of a fixed and predetermined curvature and proposes to learn a suitable curvature based on the fused features. Furthermore, the pre-processing step (sparse-to-dense conversion) and post-processing step (spatial propagation) are both adapted to hyperbolic space.
The paper also includes few-shot experiments comparing the proposed method to other SOTA ones. The proposed approach outperforms all other methods in various few-shot setups.
缺点
The authors mentioned that only KITTI and NYU are predominantly utilized in the depth completion research field, which I agree with. Yet in the paper, they still use only these two datasets. I don't think testing on these two datasets is enough. There are many other RGBD datasets that could be used here to make the claim of generalizability stronger.
问题
Why do the 100-shot results of NYU look so much worse compared to KITTI?
局限性
Yes.
[AS7d] W.1 Experiments on other RGBD datasets
We appreciate your feedback regarding the limited use of datasets in our evaluation. In response to concerns about unseen domain generalization, we have expanded our experimental evaluation to include an additional dataset, SUN-RGBD, which contains RGB-D images from four different sensors, offering a diverse range of scenes and sensors, e.g., Intel RealSense 3D Camera, Asus Xtion LIVE PRO, Microsoft Kinect V1 and V2. The total of 1000 scenes, where each sequence is roughly 20 seconds long and annotated every 0.5 seconds, is officially split into train/val/test set with 700/150/150 scenes. According to the results of Table.B, our method consistently demonstrates improved performance across these datasets, outperforming state-of-the-art methods in new domain scenarios. These results highlight our framework's strong generalization capabilities and will be included in the revised version of this paper.
[AS7d] Q.1 Comparison of qualitative results
The difference in qualitative results results observed between the NYU and KITTI datasets can be primarily attributed to the sensor characteristics. ITTI employs a 64-line Velodyne LiDAR, which provides consistent sparse depth information at fixed locations, creating an inductive bias that simplifies optimization. The model can more easily learn from consistent depth cues provided by the LiDAR. In contrast, NYU relies on synthetic random sampling for sparse depth, leading to varying locations of depth seeds across different samples. This variability increases the difficulty of optimization, particularly in few-shot setups, as the model struggles to learn effectively from constantly changing sparse depth locations.
Dear AS7d,
Thank you for your insightful feedback on our submission. We have addressed your comments in our response. As the rebuttal deadline nears, we welcome any further questions or concerns and are ready to provide additional clarification if needed.
Best regards, Authors
The paper introduces a universal depth completion framework that aims to resolve two challenges: "generalizable knowledge" of unseen scenes and "adaptation ability" to arbitrary depth sensors. Unlike the previous method relying on extensive pixel-wise labeled data, the proposed method 1) utilizes a foundation monocular depth estimation model for unseen environments, 2) generates a pixel-wise affinity map for fast adaptation, and 3) utilizes hyperbolic space to build implicit hierarchical structures of 3D data from fewer examples. As a result, the proposed depth completion framework can be generalized to the unseen domain and rapidly adaptable with few samples for diverse depth sensors. The method shows effectiveness and fast adaptation ability on KITTI DC and NYU v2 datasets.
优点
[Originality] The utilization of hyperbolic embedding to capture the implicit hierarchical structure of 3D data from a few samples seems novel and effective.
[Relation to prior work] The paper clearly discusses the difference between previous related works.
[Clarity] The limitations of traditional depth completion methods are well described, and the method is clearly motivated by the limitations.
[Significance] The method shows effectiveness and fast adaptation ability on KITTI DC and NYU v2 datasets.
缺点
W1. [Significance] The paper describes that the method aims to address unseen domain generalization in L8 and introduction. However, the experiments about unseen domain generalization are a bit insufficient and do not effectively show its superiority to claim that the method fully addresses the generalization problem. It would be great to see further extensive unseen domain generalization experiments with other comparison methods to highlight the proposed method's superiority.
问题
Please resolve the issue in the weakness part. Additional questions are as follows:
Q1. [Full sequence training] The paper only demonstrates a few-shot experiment and 1 sequence training results. It would be great to see the proposed method still valid even following a typical training process (e.g., full epoch training, such as 100 epoch training on KITTI DC dataset).
Q2. [Clarity] In ablation study 5.3, "Probe for Hyperbolic Embedding," it would be great to describe what "Euclidean" stands for. I guess this means that all operations of the proposed method are done in Euclidean space. But would you please explain the implementation in more detail?
Q3. By following the references [67-70], hyperbolic space is effective in low-shot problems. It would be great to see how much performance difference exists between Ecludiean and hyperbolic spaces in a few-shot setting. Would you provide a Tab 5 experiment in a few shot settings to support the claim it is valid for depth completion tasks as well?
Q4. [Limitatoin] Would you describe in more detail the pros/cons/limitations of hyperbolic geometry in the depth estimation or depth completion tasks?
局限性
The paper aims to resolve two challenges: "generalizable knowledge" of unseen scenes and "adaptation ability" to arbitrary depth sensors. The current paper demonstrates that the method works well in few-shot settings on KITTI and NYU v2 datasets. It can also adapt to common LiDAR and RGB-D sensors. I guess through the rebuttal, the paper is also possible to show its generalization performance in unseen domains.
[Table.F] Ablation of Hyperbolic Operation (RMSE/MAE/DELTA1)
| NYU | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| Ours (Euclidean) | 0.217 / 0.112 / 0.971 | 0.172 / 0.081 / 0.984 | 0.149 / 0.069 / 0.989 |
| Ours (Hyperbolic) | 0.210 / 0.108 / 0.975 | 0.166 / 0.079 / 0.985 | 0.147 / 0.067 / 0.988 |
| KITTI | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| RMSE / MAE / DELTA1 | RMSE / MAE / DELTA1 | RMSE / MAE / DELTA1 | |
| Ours (Euclidean) | 1.745 / 0.578 / 0.982 | 1.397 / 0.417 / 0.990 | 1.291 / 0.342 / 0.992 |
| Ours (Hyperbolic) | 1.684 / 0.522 / 0.983 | 1.385 / 0.407 / 0.990 | 1.224 / 0.339 / 0.993 |
[1] LRRU: Long-short Range Recurrent Updating Networks for Depth Completion (ICCV23)
[3] Improving Depth Completion via Depth Feature Upsampling (CVPR24)
[2] OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations (ECCV24)
[4] Revisiting depth completion from a stereo matching perspective for cross-domain generalization (3DV24)
[5] Hyperbolic Neural Networks (NIPS 2018)
[6] Hyperbolic Neural Networks++ (ICLR 2021)
[7] Rethinking the compositionality of point clouds through regularization in the hyperbolic space (NIPS 2022)
[8] Capturing implicit hierarchical structure in 3D biomedical images with self-supervised hyperbolic representations (NIPS 2021)
[9] HypLiLoc: Towards Effective LiDAR Pose Regression with Hyperbolic Fusion (CVPR 2023)
[10] On Hyperbolic Embeddings in Object Detection (DAGM 2022)
[11] Fully Hyperbolic Convolutional Neural Networks For Computer Vision (ICLR 24)
[12] The Numerical Stability of Hyperbolic Representation Learning (ICML23)
[13] Clipped Hyperbolic Classifiers Are Super-Hyperbolic Classifiers (CVPR22)
[14] HypLiLoc: Towards Effective LiDAR Pose Regression with Hyperbolic Fusion (CVPR23)
[15] Hyperbolic Contrastive Learning for Visual Representations beyond Objects (CVPR23)
Thank you for responding to my questions. I have also checked the other reviews and their responses. The response resolves my concerns about unseen domain generalization (W1), full sequence training (Q1), clarity of ablation study 5.3 (Q2), few-shot comparison between Ecludiean and hyperbolic spaces (Q3), and limitation (Q4). I keep my initial rating.
Dear g2HF,
Thank you for your review and thoughtful feedback on our work. We appreciate your positive evaluation and are pleased that we could address your concerns. Should you have any additional questions or wish to discuss further, we welcome the opportunity. Thank you once again for your valuable insights.
Best regards, Authors
[g2HF] W/L. Comparison with other methods and experiments about unseen domain generalization.
As requested by the reviewer, we conduct additional experiments on diverse datasets beyond KITTI DC and NYU v2. We include tests on the SUN RGB-D, which show different environmental conditions and various sensor types, e.g., Intel RealSense 3D Camera, Asus Xtion LIVE PRO, Microsoft Kinect V1 and V2. According to Table.B, our method achieves consistent performance improvements for all the datasets, outperforming the state-of-the-art methods in unseen domain scenarios. Plus, we compare with other advanced methods (e.g., LRRU[1], DFU[2], OGNI-DC[3], VPP4DC[4],) to highlight the proposed method's superiority (Please refer the Table.A, [tAUP] W.2 and [bXoG] W.3).
[g2HF] Q.1 Full sequence training
In response to the request for full sequence training experiments, we carry out an extended evaluation of the KITTI DC dataset. The results, shown in Table.C, indicate that our method not only maintains its effectiveness in a full training regime but also achieves competitive performance over traditional methods that rely on extensive labeled data. This supports our claim that the proposed approach is robust and versatile across different training settings.
[g2HF] Q.2 Implementation of hyperbolic operation
As you noted, “Euclidean” refers to a space working for conventional operations like convolution. This means that all computations, including embedding and affinity computations, are performed using typical neural network operations in n-dimensional Euclidean space. Starting with the foundational work on Hyperbolic Neural Networks [5,6], there has been a significant shift towards leveraging hyperbolic operations across various downstream tasks. Typically, tensors processed in neural networks are considered within the Euclidean space. However, by embedding these tensors into hyperbolic space, we can exploit the advantages of hyperbolic geometry. This embedding enables us to perform operations directly in hyperbolic space, capturing hierarchical relationships more effectively than in Euclidean space [5,6]. Here is a pseudo-code implementation illustrating the difference between Euclidean and hyperbolic operations in our method:
Pseudo-Code
Input: image features , initial sparse depth , curvature , pixel coordinate , and the neighboring pixel coordinate of .
-
: transform function from hyperbolic to Euclidean space (Eq.4)
-
: Distance function with hyperbolic features (Eq.8)
-
: convolution layer with hyperbolic features (Eq.10)
For each in do
1: and
2:
3:
4:
End for
Output: Hyperbolic affinity and initial depth .
[g2HF] Q.4 Pros/cons/limitations of hyperbolic geometry for depth perception
Hyperbolic geometry offers substantial advantages in modeling hierarchical relationships within sparse data, providing new cues for understanding 3D structures in depth perception tasks[7,8,9,10]. By embedding data into hyperbolic space, we can more effectively capture complex structures, enabling a richer representation of 3D environments. This approach enhances the performance of depth completion tasks, particularly in few-shot settings, and improves generalization and adaptability in unseen domains.
Most depth completion tasks use encoder-decoder structures to extract multi-scale features and propagate sparse depth information through affinity maps. Hyperbolic space naturally accommodates exponentially growing hierarchies and tree-like structures, allowing robust affinity construction with low distortion. This enhances the distinction between unrelated pixel features and reduces bleeding errors by effectively capturing boundary information, even in ambiguous regions.
However, hyperbolic geometry remains certain challenges. One major limitation is the increasing computational complexity, which can impact both training and inference times [11,12,13], even with a similar number of parameters as highlighted in [tAUP] Q.2. Additionally, the performance sometimes depends on a hyperparameter setting for curvature value. Despite these challenges, the overall benefits of using hyperbolic geometry, such as improved generalization and performance in diverse and unseen environments, often outweigh these drawbacks, making it a valuable tool for depth estimation and completion tasks.
Considering these advantages and limitations, a strategy that combines Euclidean and hyperbolic operations could be beneficial as done in Eq.8, and [14,15]. By leveraging the strengths of both geometries, it is possible to achieve a balance between computational efficiency and the ability to capture complex hierarchical relationships in depth perception tasks, which will be mentioned in the revised version of this paper.
[g2HF] Q.3 Few shot comparisons for Euclidean and hyperbolic space
As suggested, we conduct additional experiments under the few-shot regime. The results, presented in Table.F, show a noticeable improvement when using hyperbolic space, with a performance gain of 5% on average, compared to Euclidean space. This validates the effectiveness of hyperbolic geometry in depth completion tasks, especially when dealing with limited data samples.
This paper proposes a universal depth completion method (UniDC) to address generalization issues in unknown scenes and arbitrary depth sensors. The method utilizes depth information extracted from a pre-trained monocular depth estimation model to generate pixel-wise affinity maps, which adjust sparse depth. Additionally, this paper develops a hyperbolic propagation method for generating and refining dense depth.
优点
1.This paper introduces a pre-trained depth estimation model for depth completion tasks and proposes a universal depth completion framework that effectively addresses depth completion challenges across various scenes and different depth sensors.
- The proposed method performs 1-Shot, 10-Shot, and 100-Shot experiments on the NYU and KITTI DC datasets, demonstrating that UniDC achieves outstanding performance and generalization ability.
缺点
-
The proposed method does not provide comparative results on the online KITTI depth completion benchmark, and these results need to be further presented.
-
Some advanced and state-of-the-art depth completion methods need to be discussed and compared.
[1] Improving Depth Completion via Depth Feature Upsampling. CVPR 2024.
[2] Tri-Perspective View Decomposition for Geometry-Aware Depth Completion. CVPR2024.
[3] LRRU: Long-short Range Recurrent Updating Networks for Depth Completion. ICCV 2023.
[4] BEVDC: Bird's-Eye View Assisted Training for Depth Completion. CVPR 2023.
[5] RigNet: Repetitive Image Guided Network for Depth Completion. ECCV 2022.
问题
Please refer to weaknesses. Moreover,
-
Compared to previous spatial propagation networks (SPN, CSPN, DySPN, etc.), what advantages does the calculation of the affinity map A_{j,k} in hyperbolic space proposed by Equation (11), and why can it better refine depth maps?
-
When incorporating a pre-trained depth estimation model into the network, it results in additional computational costs, particularly for large models like DepthAnything. How do you mitigate this issue?
局限性
The limitations are discussed in detail in the paper. Estimating the uncertainty of noise will better enhance the model's generalization capability.
[tAUP] W.1 Online KITTI depth completion benchmark.
We thank the reviewer for the comments about the outstanding performance and generalization ability of our method. Following your comment, we report the performance of our work in the KITTI benchmark, which is reported in Figure.A of the uploaded PDF file. To analyze our method under various configurations (Table.C), similar to recent state-of-the-art methods such as LRRU and CompletionFormer, we designed four variants by adjusting the number of channels. Notably, both our method and these methods totally follow the scaling laws of deep learning models [1, 2]. Our variants achieve competitive results compared to the state-of-the-art methods, especially in setups with fewer labels, where our approach demonstrates superior performance.
[tAUP] W.2 Comparison with recent SoTA methods
We have provided a detailed comparison with recent state-of-the-art methods in Table.A. This comparison highlights the advantages of our approach across different experimental setups.
In the case of the LRRU family [3], we observe that in the KITTI dataset, smaller models tend to perform better, whereas in the NYU dataset, larger models achieve better performance. This behavior is attributed to the IP-Basic algorithm [4] used for preprocessing, which is biased towards the KITTI dataset. Unlike LRRU, our approach leverages foundation model knowledge, enabling consistent and rapid adaptation to both indoor and outdoor data.
Additionally, the DFU [5] model addresses the issue of sparse decoder features in encoder-decoder networks by using a depth feature upsampling method. While this approach seems to aid adaptation in few-shot setups, it appears vulnerable in indoor scenarios with varying sparse depth configurations.
OGNIDC [6] iteratively refines the depth gradient field (depth differences between neighboring pixels) and integrates this information to produce a dense depth map. This method has shown excellent performance in various zero-shot generalization experiments and satisfactory results in few-shot experiments, ranking second-best among comparing methods. Unlike other models, which highly rely on RGB features, OGNIDC focuses on refining 3D-aware information through the Depth Gradient Field, making it less sensitive to RGB appearance changes, as seen in models like CostDCNet [7] which shows superior performance than other methods in the Table.1 of the manuscript. This results in strong generalization and few-shot adaptation capabilities. However, our methodology demonstrates an ability to learn feature representations effectively in hyperbolic space using smaller datasets, leading to faster adaptation in harsh conditions compared to recent SoTA methods.
[tAUP] Q.1 Advantages of hyperbolic space for calculation of the pixel affinity map.
We appreciate the opportunity to explain why depth completion methods benefit from hyperbolic geometry. Most spatial propagation networks (e.g., CSPN, NLSPN, and DySPN) adopt encoder-decoder structures to extract multi-scale features w.r.t. structure and photometric similarities. Then, initial seeds (i.e., sparse depth) are propagated based on affinity maps computed from the learned features in an iterative manner. Therefore, if the computed affinity map is accurate, capturing boundary information, which is the highly ambiguous region for pixel-wise prediction task, is concomitant. However, object boundary ambiguities, caused by noise or smooth intensity changes, can lead to bleeding errors [8].
To address these issues, traditional studies (i.e. optimization-based approaches) for affinity construction have utilized hierarchical structures [9-13]. Like tree-based structure, non-local propagation has shown superiority over the local propagation that CSPN, NLSPN, and DySPN follow. By embedding pixel features into hyperbolic space [14,15], we formulate these hierarchical relations in a continuous and differentiable manner. The hyperbolic space naturally accommodates exponentially growing hierarchies and tree-like structures, allowing robust affinity construction with low distortion. This enhances the distinction between unrelated pixel features (i.e., low affinity), while semantically-close pixels benefit from the hierarchical structure, reducing bleeding errors.
We conducted a toy example to verify the effectiveness of hyperbolic geometry in various propagation schemes, including CSPN (Convolutional), NLSPN (Non-Local), and DySPN (Dynamic attention). Using the same backbone (ResNet-34) and loss functions (L1 and L2) across all schemes ensures a fair comparison. As shown in the Table.D, hyperbolic operations significantly improve performance in various few-shot setups. Compared to Euclidean methods, hyperbolic structures improve pixel distinction under challenging conditions. We also hope you refer to section [g2HF-Q.3] for an ablation study of hyperbolic operations in our architecture.
[tAUP] Q.2 Probe for computational costs of depth foundation model
As mentioned, depth foundation models are typically large and computationally expensive due to training on extensive datasets. However, recent models offer various variants, allowing flexibility in computational demands. As shown in Table.7 of the manuscript, we conduct ablations on multiple models and observe comparable performance across them. As shown in Table.E, MiDaS[12] and Depth Anything[13] have significantly fewer parameters than other depth completion models, suggesting that leveraging a pre-trained foundation model's knowledge does not necessarily entail high computational costs. The light-weight models can achieve sufficient generalization performance as well. Note that we use the publicly available official codes for MiDaS [16] (v2.1 Small), Depth Anything v1 [17] (ViT-S), and UniDepth [18] (ViT-L). Plus, we will mention it in the revised version of this paper.
Thank you for the detailed responses. I now have a better understanding of the benefits of the proposed Hpy.
However, the performance of the proposed method on the online KITTI benchmark appears to be subpar. The Root Mean Square Error (RMSE) stands at 804.33mm, whereas the current state-of-the-arts (without utilizing large-scale models) range around 685mm. This discrepancy is quite significant. Notably, the 800mm benchmark was attained approximately five years ago. This raises the question: what justifies the development of such a large-model prompt? Given its notably inferior performance on widely employed benchmarks, its potential contribution seems notably constrained.
Additionally, this study seems to overlook the discourse and references to numerous prior state-of-the-art (SOTA) methodologies and recent advancements. For a top-tier research paper, it is imperative to include discussions on and citations of the latest and high-performance related works. This aspect holds significant importance in showcasing the relevance and impact of the research being conducted.
Consequently, I am inclined to assign a lower score.
[Table.D] SPNs (CSPN, NLSPN, DySPN) with Hyperbolic Geometry
| Model | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| RMSE / MAE / DELTA1 | RMSE / MAE / DELTA1 | RMSE / MAE / DELTA1 | |
| CSPN | 1.483 / 1.213 / 0.266 | 0.470 / 0.330 / 0.839 | 0.222 / 0.106 / 0.973 |
| CSPN + Hyperbolic | 1.188 / 0.950 / 0.398 | 0.429 / 0.271 / 0.866 | 0.186 / 0.101 / 0.982 |
| NLSPN | 1.396 / 1.136 / 0.290 | 0.925 / 0.719 / 0.489 | 0.283 / 0.192 / 0.952 |
| NLSPN + Hyperbolic | 1.338 / 1.079 / 0.328 | 0.353 / 0.208 / 0.934 | 0.211 / 0.133 / 0.978 |
| DySPN | 1.499 / 1.210 / 0.283 | 0.567 / 0.422 / 0.742 | 0.243 / 0.117 / 0.972 |
| DySPN + Hyperbolic | 1.303 / 1.044 / 0.035 | 0.428 / 0.304 / 0.807 | 0.216 / 0.103 / 0.978 |
[Table.E] Computational Cost of Depth Foundation Models
| Model | Total Param. | Learnable Param. | Inference Time (s) | GPU Memory (MiB) |
|---|---|---|---|---|
| BPNet | 89.874M | 89.874M | 0.072 | 4792 |
| LRRU | 20.843M | 20.843M | 0.038 | 3650 |
| CompletionFormer | 83.574M | 83.574M | 0.060 | 4206 |
| Ours_MiDaS | 21.279M | 4.685M | 0.056 | 5980 |
| Ours_DepthAnything | 24.731M | 2.981M | 0.035 | 4729 |
| Ours_UniDepth | 238.607M | 5.852M | 0.116 | 5829 |
References
[1] Scaling Laws for Neural Language Models (Arxiv 2020)
[2] Scaling Vision Transformers (Arxiv 2020)
[3] LRRU: Long-short Range Recurrent Updating Networks for Depth Completion (ICCV 2023)
[4] In defense of classical image processing: Fast depth completion on the cpu (CRV 2018)
[5] Improving Depth Completion via Depth Feature Upsampling (CVPR 2024)
[6] OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations (ECCV 2024)
[7] CostDCNet: Cost Volume based Depth Completion for a Single RGB-D Image (ECCV 2022)
[8] Learning Affinity with Hyperbolic Representation for Spatial Propagation (ICML 2023)
[9] Tree filtering: Efficient structure-preserving smoothing with a minimum spanning tree. (TIP 2013)
[10] Fully connected guided image filtering. (CVPR 2015)
[11] A non-local cost aggregation method for stereo matching.(CVPR 2014)
[12] Stereo matching using tree filtering. (TPAMI 2014)
[13] Real-time salient object detection with a minimum spanning tree. (CVPR 2016)
[14] Representation tradeoffs for hyperbolic embeddings. (ICML 2018)
[15] Poincar´e embeddings for learning hierarchical representations. (NeurIPS 2017)
[16] Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer (TPAMI 2022)
[17] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation (CVPR 2024)
[18] Universal Monocular Metric Depth Estimation (CVPR 2024)
We appreciate your insights and the opportunity to address your concerns.
Regarding the performance on the online KITTI benchmark, we acknowledge that the error is higher compared to the current state-of-the-art methods. “However, the primary objective of our work was to explore the potential of large-model prompts in enhancing adaptation ability to arbitrary sensors even with limited labeled data, rather than solely outperforming previous works only focused on the KITTI dataset." Additionally, due to limited time in the rebuttal period, we couldn't fully engineer the performance for the rebuttal period. We now invite you to review the benchmark performance of our Ours_Base model, as presented in Table A and Table C. This model, listed as "UniDC Base" on the official KITTI benchmark homepage, achieved an RMSE of 736.0mm and an MAE of 202.4. As shown in the table below, the result shows that our method outperforms the relevant works in [1,2,3] which are recently published and have almost the same purpose as ours. While these results may not match the very latest SoTA, we hope you will consider the broader motivation behind our work.
| KITTI | Validation Set | Online Benchmark |
|---|---|---|
| Model | RMSE/MAE | RMSE/MAE |
| [1] | 818.1 / 205.7 | - / - |
| [2] | 835.7 / 218.5 | - / - |
| [3] | - / - | 754.5 / 206.1 |
| Ours_Base | 758.9 / 203.8 | 736.0 / 202.4 |
- Results for models [1] and [2] are sourced from their respective papers, while [3] represents the KITTI online leaderboard benchmark. *
[1] Flexible Depth Completion for Sparse and Varying Point Densities (CVPR24)
[2] Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement (CPVR24)
[3] Depth Prompting for Sensor-Agnostic Depth Estimation (CVPR24)
We want to emphasize that, over the past decade, numerous depth completion papers have focused on in-domain experiments on NYU and KITTI. However, there has been a recent trend towards addressing out-of-domain challenges in depth completion research [1-9]. This new direction aims to develop models that can handle variations in new sensor configurations [1,2,3,4,5], unseen environmental conditions [6], and training schemes without the need for dense GT [7,8,9], rather than being restricted to specific sensors or environments. This trend is gaining attraction in top-tier conferences (e.g., CVPR [1,2,3,7]) and journals (e.g., TPAMI [5]), highlighting the importance of adaptability and generalization in depth completion models. Our research aligns with this direction and shares similar goals. We kindly note that most of those works do not consider the KITTI benchmark, which is an in-domain experiment with a 64-Line LiDAR sensor. While we agree that top-tier papers should demonstrate a certain level of performance, we also believe that research focusing on generalization and adaptability for arbitrary sensors and environments is valuable and deserves recognition.
In response to the lack of references and discussions on recent SOTA methodologies, during the rebuttal process, we had tried to compare and discuss as many advanced models as possible. Unfortunately, only two of the five suggested papers provide publicly available code, LRRU (ICCV24) and DFU (CVPR24) experiments. Another reviewer (bXoG) also mentioned comparisons with recent models, and we have conducted additional experiments with the VPP4DC (3DV24) and OGNI-DC (ECCV24) papers (see Table.C & Table.I). We will revise the manuscript to include a comprehensive review of the latest high-performance works.
We hope that these clarifications with the updated performance on the online KITTI benchmark address your concerns.
[4] Sparsity Agnostic Depth Completion (WACV23)
[5] G2-MonoDepth: A General Framework of Generalized Depth Inference from Monocular RGB+X Data (TPAMI24)
[6] All-day Depth Completion (Arxiv24)
[7] Test-Time Adpation for Depth Completion (CVPR24)
[8] Monitored Distillation for Positive Congruent Depth Completion (ECCV22)
[9] Unsupervised Depth Completion with Calibrated Backprojection Layers (ICCV21)
Dear Reviewer tAUP,
Please share your thoughts about the authors' response to your message; the points raised are relevant.
All, please provide your feedback about the rebuttal and other posts.
Thank you
We thank all reviewers for their helpful comments which make our paper polish up. If reviewers want to see additional results and analysis, please let us know it and we are always welcome to discuss.
For additional experiments requested by multiple reviewers, we report them here.
-
[Table.A] Comparison of Recent SoTA Methods: Reviewer tAUP, g2HF and bXoG
-
[Table.B] Additional RGB-D Dataset with Various Sensors: Reviewer g2HF and AS7d
-
[Table.C] Full Dataset Training Benchmark: Reviewer tAUP, g2HF and bXoG
For all experimental results for this rebuttal, we have put them all together in the uploaded PDF file.
[Table.A] Experiment on Advanced Methods (RMSE/MAE/DELTA1.25)
| [NYU] | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| LRRU_Mini | 0.704 / 0.505 / 0.738 | 0.989 / 0.677 / 0.642 | 0.551 / 0.392 / 0.797 |
| LRRU_Tiny | 0.842 / 0.633 / 0.574 | 0.771 / 0.549 / 0.707 | 0.565 / 0.373 / 0.836 |
| LRRU_Small | 0.589 / 0.388 / 0.826 | 0.404 / 0.246 / 0.919 | 0.442 / 0.306 / 0.887 |
| LRRU_Base | 0.447 / 0.278 / 0.899 | 0.424 / 0.252 / 0.922 | 0.316 / 0.189 / 0.949 |
| DFU | 0.754 / 0.589 / 0.637 | 0.590 / 0.464 / 0.719 | 0.467 / 0.344 / 0.868 |
| OGNI-DC | 0.365 / 0.200 / 0.921 | 0.312 / 0.160 / 0.957 | 0.207 / 0.095 / 0.974 |
| Ours_Tiny | 0.243 / 0.131 / 0.969 | 0.186 / 0.088 / 0.982 | 0.151 / 0.068 / 0.988 |
| Ours | 0.210 / 0.108 / 0.975 | 0.166 / 0.079 / 0.985 | 0.147 / 0.067 / 0.988 |
| Ours_Small | 0.255 / 0.136 / 0.968 | 0.181 / 0.089 / 0.983 | 0.149 / 0.067 / 0.988 |
| Ours_Base | 0.247 / 0.138 / 0.969 | 0.190 / 0.093 / 0.983 | 0.148 / 0.066 / 0.988 |
| [KITTI] | 1-shot | 10-shot | 100-shot |
|---|---|---|---|
| LRRU_Mini | 6.719 / 3.068 / 0.792 | 5.608 / 2.787 / 0.811 | 3.576 / 2.020 / 0.841 |
| LRRU_Tiny | 7.961 / 4.049 / 0.698 | 6.253 / 3.022 / 0.788 | 4.201 / 2.394 / 0.796 |
| LRRU_Small | 16.162 / 8.008 / 0.516 | 5.930 / 2.905 / 0.800 | 5.934 / 3.602 / 0.612 |
| LRRU_Base | 14.889 / 7.454 / 0.526 | 13.078 / 6.904 / 0.587 | 9.736 / 6.090 / 0.420 |
| DFU | 3.652 / 2.020 / 0.853 | 1.889 / 0.966 / 0.973 | 1.808 / 0.897 / 0.986 |
| OGNI-DC | 2.618 / 0.816 / 0.962 | 1.516 / 0.421 / 0.985 | 1.514 / 0.430 / 0.984 |
| Ours_Tiny | 2.002 / 0.725 / 0.951 | 1.457 / 0.448 / 0.987 | 1.251 / 0.353 / 0.991 |
| Ours | 1.684 / 0.522 / 0.983 | 1.385 / 0.407 / 0.990 | 1.224 / 0.339 / 0.993 |
| Ours_Small | 1.865 / 0.590 / 0.975 | 1.465 / 0.436 / 0.988 | 1.283 / 0.388 / 0.991 |
| Ours_Base | 1.716 / 0.607 / 0.979 | 1.423 / 0.428 / 0.988 | 1.246 / 0.345 / 0.992 |
[Table.B] Experiment on SUN RGB-D (RMSE/MAE/DELTA1.25)
| 1-shot | 10-shot | 100-shot | |
|---|---|---|---|
| BPNet | - | 0.497 / 0.244 / 0.870 | 0.342 / 0.164 / 0.900 |
| DP | 0.706 / 0.534 / 0.537 | 0.683 / 0.512 / 0.558 | 0.700 / 0.527 / 0.545 |
| LRRU | 0.912 / 0.743 / 0.347 | 0.507 / 0.300 / 0.785 | 0.476 / 0.313 / 0.779 |
| DFU | - | 0.890 / 0.696 / 0.314 | 0.552 / 0.444 / 0.438 |
| OGNI-DC | - | 0.466 / 0.270 / 0.817 | 0.382 / 0.188 / 0.881 |
| Ours | 0.529 / 0.285 / 0.830 | 0.418 / 0.188 / 0.895 | 0.345 / 0.166 / 0.901 |
[Table.C] Full Dataset Training Benchmark (NYU & KITTI)
| # of Params. (M) | Models | NYU (RMSE) | NYU (MAE) | KITTI (RMSE) | KITTI (MAE) |
|---|---|---|---|---|---|
| 41.5M | Cformer_Tiny | 0.091 | 0.035 | - | - |
| 82.6M | Cformer_Small | 0.090 | 0.035 | 0.739 | 0.196 |
| 142.4M | Cformer_Base | 0.090 | 0.035 | 0.709 | 0.203 |
| 0.3M | LRRU_Mini | 0.101 | - | 0.800 | 0.219 |
| 1.3M | LRRU_Tiny | 0.096 | - | 0.762 | 0.208 |
| 5.2M | LRRU_Small | 0.093 | - | 0.741 | 0.202 |
| 21M | LRRU_Base | 0.091 | - | 0.728 | 0.198 |
| 1.2M | Ours_Tiny | 0.107 | 0.042 | 0.907 | 0.231 |
| 4.6M | Ours | 0.098 | 0.038 | 0.867 | 0.224 |
| 36.9M | Ours_Small | 0.095 | 0.038 | 0.824 | 0.209 |
| 63.2M | Ours_Base | 0.093 | 0.036 | - | - |
If the experiment fails to optimize or diverge, they are marked with a "-"
The paper deals with the relevant problem of improving the generalization ability of depth completion across domains and sensors with a few labelled data by exploiting the knowledge provided by a foundation model for monocular depth estimation. The features obtained by the foundation model are embedded into hyperbolic space for depth propagation and refinement. As witnessed by the recent literature, this topic has gained much attention, and the paper proposes an innovative solution that achieves excellent results for this task. Nonetheless, reviewers raised concerns about methods not considered in the initial submission and additional experimental results. Moreover, the reviewers requested clarifications about the motivation and effectiveness of using hyperbolic space. In the rebuttal, the authors clarified most concerns. They provided a significant amount of additional experimental results as requested by reviewers, persuading most of them during the discussion to lean towards acceptance, with reviewers bXoG and g2HF championing the paper and reviewer AS7d setting a final borderline acceptance score. In contrast, reviewer tAUP considers the performance of the proposed architecture trained in the KITTI domain unsatisfactory compared to the state-of-the-art for in-domain depth completion and proposes rejecting the paper. The AC acknowledges that the proposal is not competitive with most recent works tackling in-domain depth completion, but this task is different from the purpose of this work. On the other hand, the proposed method has excellent performance, outperforming state-of-the-art methods that deal with a few labelled data for training, raising the bar in the field. Nonetheless, the paper needs a significant revision to integrate the additional experimental results and clarifications requested by the reviewers.