PaperHub
5.8
/10
Poster4 位审稿人
最低5最高6标准差0.4
6
6
5
6
4.5
置信度
正确性3.0
贡献度3.0
表达2.8
NeurIPS 2024

MonoMAE: Enhancing Monocular 3D Detection through Depth-Aware Masked Autoencoders

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

This paper presents MonoMAE, a monocular 3D detector inspired by Masked Autoencoders that addresses the object occlusion issue in monocular 3D object detection by masking and reconstructing objects in the feature space.

摘要

关键词
Monocular 3D Object DetectionMasked Autoencoders

评审与讨论

审稿意见
6

This paper proposes a monocular 3D detection framework inspired by Masked Autoencoders (MAE), designed to address the challenge of object occlusions in 3D object detection. It utilizes a unique depth-aware masking module that simulates occlusions by adaptively masking non-occluded object features based on depth information, coupled with a lightweight completion network that reconstructs these masked features to learn occlusion-tolerant representations. It generates training pairs of non-occluded and occluded object representations directly, enhancing its capability to handle occlusions effectively. The framework is optimized for low computational overhead during inference, as it does not require object masking at this stage.

优点

  1. The proposed method outperforms the conventional methods across various datasets such as KITTI and Nuscenes. It demonstrates the effectiveness of the proposed method. Moreover, the proposed method achieves real-time inference time.
  2. An extensive ablation study is proven to demonstrate the impact of the proposed module.
  3. The idea is simple yet effective.

缺点

  1. The performance improvement is marginal, especially on the cross-validation in Table 6.
  2. Missing evaluation on the Waymo dataset

问题

  1. Many recent 3D object detection studies have utilized the Waymo dataset for evaluation. Could you explain why your experiments were limited to KITTI and nuScenes?
  2. There appears to be a performance drop in the nuScenes dataset at distances beyond 40 meters. Could you provide insights into what causes this decline?
  3. There is a slight difference in inference time between 'Ours*' (36ms) and 'Ours' (38ms), with significant performance differences noted in Table 2. Could you elaborate on the role of the completion network (CN) given these differences?
  4. The mask ratio r varies with scale parameters and maximum depth. How sensitive is your method to changes in the mask ratio?

局限性

See the weakness and question parts.

作者回复

We thank the valuable comments and insightful suggestions, and we hope our detailed responses below can address your concerns.

Weakness 1: Performance Improvement

We would clarify that the proposed MonoMAE achieves clear performance improvement over the state-of-the-art. As shown in Table 1 of the submitted manuscript, MonoMAE outperforms the SOTA method MonoCD [1] by clear margins, especially for objects of Moderate and Hard categories. The scale of the improvement is also competitive as compared with several recently published works as listed in Table 1.

The experiments in Table 6 aim to validate the generalization ability of the proposed MonoMAE. We can observe that MonoMAE achieves the best or the second-best performance consistently across all metrics. Besides, the MAE metric used in Table 6 is hard to obtain significant increases. For example, MonoUNI [2] obtains slight performance gains and even performance drops on the nuScenes dataset.

[1] Yan, Longfei, et al. MonoCD: Monocular 3D Object Detection with Complementary Depths. CVPR, 2024.

[2] Jinrang, Jia, et al. MonoUNI: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues. NeurIPS, 2023.

Weakness 2 & Question 1: Experiments on Waymo

Thanks for your suggestion. We extend the experiments by examining the generalization over a new task KITTI3D->Waymo as shown in the table below, using AP(IoU=0.5)AP(IoU=0.5) as the metric. We can observe that our method has generalization ability on the Waymo dataset, and even outperforms some methods trained on Waymo.

MethodPatchNet* [1]M3D-RPN* [2]Ours
Level_12.923.794.53
Level_22.423.614.17

* denotes this method is trained on Waymo.

[1] Ma, Xinzhu, et al. Rethinking pseudo-lidar representation. ECCV, 2020.

[2] Brazil, Garrick, et al. M3d-rpn: Monocular 3d region proposal network for object detection. ICCV, 2019.

Question 2: Performance Drop at Far Distances on the nuScenes dataset

The performance drop beyond 40 meters could be due to:

  1. Limited visual information (in image resolution, etc.) is captured for objects beyond 40 meters;
  2. Adverse conditions in lighting and bad weather could have a larger impact on the quality of distant objects;
  3. The annotation accuracy could degrade for distant objects due to small object sizes, leading to degraded network training.

As shown in Table 6 of the submitted manuscript, existing SOTA methods all have performance drop in the nuScenes dataset at distances beyond 40 meters. The proposed MonoMAE achieves the second-best performance at a far distance.

Question 3: The Role of the Completion Network

The Completion Network aims to learn to complete occluded query objects for detection performance for various occluded objects. It is a lightweight network that introduces 2ms additional inference time only but can achieve effective completion in the feature space. During training, it learns to reconstruct and complete the masked object queries (simulating real object occlusions). During inference, it reconstructs the occluded queries as completed ones which clearly improves the performance of monocular 3D detection as shown in Table 2.

We managed to show the impact of the completion network by plotting the losses with and without using the completion network in Figure 7 of the submitted Appendix. The graph shows that the training loss with the Completion Network (the orange line) drops while the training loss without the Completion Network (the blue line) remains at a high level, showing that the Completion Network helps acquire occlusion-tolerant representations by learning to reconstruct the masked queries.

Question 4: Sensitivity of Our Method to the Mask Ratio

As defined in Equation (3) of the submission, the mask ratio rr of each query is determined by r=1.0diDmaxr = 1.0 - \frac{d_{i}}{D_{max}}. rr is the applied mask ratio for each query. did_{i} is the predicted depth of the ii-th query. DmaxD_{max} is the maximum depth in the dataset that could manually adjusted to affect the mask ratio rr.

The table below shows the performance of our method when changing the maximum depth DmaxD_{max}. In each cell of the table, the performance is listed as AP3D(IoU=0.7)AP_{3D}(IoU=0.7) / APBEV(IoU=0.7)AP_{BEV}(IoU=0.7).

Maximum DepthEasyModerateHard
0.5Dmax0.5*D_{max}28.17 / 38.2119.83 / 26.6217.06 / 21.24
0.75Dmax0.75*D_{max}29.63 / 39.7820.31 / 26.8517.42 / 22.70
DmaxD_{max}30.29 / 40.2620.90 / 27.0817.61 / 23.14
1.5Dmax1.5*D_{max}29.04 / 39.2319.92 / 26.4317.15 / 22.39
2.0Dmax2.0*D_{max}27.70 / 37.1819.45 / 26.1216.84 / 20.52

We can observe that the best performance is achieved when the maximum depth is set at the original DmaxD_{max}. When the maximum depth is smaller than DmaxD_{max}, the performance drops. This is because if the depth of an object is too large, the query of this object will not be masked, leading to masking failures and hindering the training of the completion network.

When we set the maximum depth larger than DmaxD_{max}, the performance also drops. This is because the mask ratios are relatively high with the large maximum depth for all objects according to Equation (3). A high mask ratio will pose challenges for the Completion Network by reducing the available information since objects at far normally have limited pixels, hindering the reconstruction of occluded object regions.

Nevertheless, we can observe that setting the maximum depth between 0.750.75 and 22 Dmax*D_{max} achieves similar 3D detection, indicating the robustness of our method with respect to this parameter.

评论

Thank you for providing the additional experiments and detailed explanations in your rebuttal. The new information is convincing and addresses my initial concerns effectively. Based on this, I have increased my initial rating of your submission.

评论

Thanks for your positive feedback! We are glad that we have addressed your initial concerns.

审稿意见
6

This paper introduces MonoMAE, a novel monocular 3D object detection framework designed to improve detection performance in the presence of object occlusions. MonoMAE leverages the concept of Masked Autoencoders, treating object occlusions as natural masking and training the network to complete occluded regions. This innovative approach addresses the pervasive issue of object occlusions in monocular 3D detection, leading to superior detection performance. Extensive experiments on datasets like KITTI 3D and nuScenes show that MonoMAE outperforms state-of-the-art methods in both qualitative and quantitative measures.

优点

  1. The introduction of depth-aware masking to simulate occlusions and the use of a lightweight query completion network are innovative and address a significant challenge in monocular 3D detection.
  2. MonoMAE improves detection performance without the need for additional training data or annotations, making it a practical solution for real-world applications like autonomous driving and robotics.
  3. The framework demonstrates superior performance on benchmark datasets (KITTI 3D and nuScenes), outperforming existing state-of-the-art methods in both occluded and non-occluded scenarios.
  4. MonoMAE shows strong generalization capabilities to new domains, which is critical for deploying models in diverse environments.

缺点

  1. In many datasets and methods, objects are not merely labeled as "occluded" or "non-occluded." Instead, they may be assigned occlusion levels or degrees that quantify the extent to which an object is occluded. These levels provide more granularity and can influence how models are trained and evaluated. It would be beneficial to specify how occlusion levels are defined and used. Clarifying whether discrete or continuous levels are employed and how these influence the labeling, training, and evaluation processes will provide a clearer understanding of the methodology and its robustness in handling occlusions.
  2. The paper does not provide explicit details about the accuracy of the occlusion classification network or how this accuracy influences the overall 3D object detection network. This information appears to be missing.
  3. The paper does not explicitly report the performance or accuracy of the query completion network. Including a report on the performance of this network, such as quantitative results or visualization of the reconstructed queries, would be valuable. It would demonstrate whether the query completion network is learning meaningful features and contributing effectively to the overall 3D object detection performance.

问题

  1. What is the accuracy of the occlusion classification network? How does the accuracy influence the whole 3D object detection network?
  2. What is the accuracy of the query completion network?

局限性

The authors discussed some failure cases in their paper, as well as the gap between the generated occlusion pattern and natural occlusion patterns.

作者回复

We thank the valuable comments and insightful suggestions, and we hope our detailed responses below can address your concerns.

Weakness 1: Occlusion Levels: Definition and Usage

We appreciate the reviewers' insightful comments regarding the use of occlusion levels. Our method only uses the binary "occluded" and "non-occluded" labels. During training (Figure 2 of the submitted manuscript), the binary occlusion labels are used to train the Non-Occluded Query Grouping module for classifying the non-occluded queries that are masked and completed. During inference (Figure 5 of the submitted manuscript), the Non-Occluded Query Grouping module is used to classify the queries.

To this end, we only need to obtain the binary occlusion labels indicating whether the objects are occluded or not, without using the fine-grained occlusion labels which could make the task complicated. In this paper, we transform the ground truths of occlusion degrees oo provided in the KITTI 3D dataset, where oo \in {0, 1, 2, 3}, into binary ground truths. The transformation simplifies o=0o=0 as non-occlusion and oo \in {1, 2, 3} as occlusion, leading to the binary ground truths of occlusion conditions ogto^{gt} \in {0, 1} with 0 indicating non-occlusion and 1 indicating occlusion.

Weakness 2 & Question 1: Accuracy and Impact of the Occlusion Classification Network

Thank you for your suggestion. The accuracy of the occlusion classification network has been provided in Section F.3 of the submitted Appendix. The accuracy of the occlusion classification network is 96.46%, indicating that most object queries can be correctly classified as occluded or non-occluded.

Moreover, to validate the influence of the classification accuracy on the overall 3D object detection network, we add experiments using a trained network with fixed weights. During each inference, the occlusion classification accuracy is manually adjusted to 50% and 70% with the help of ground-truth occlusion labels. The experimental results are shown in the table below, where the original accuracy (96.46%) of the occlusion classification network is also exhibited. The metrics AP3D(IoU=0.7)AP_{3D} (IoU=0.7) and APBEV(IoU=0.7)AP_{BEV}(IoU=0.7) with Easy, Moderate and Hard categories are used. In each cell of the table, the performance is listed as AP3D(IoU=0.7)AP_{3D}(IoU=0.7) / APBEV(IoU=0.7)AP_{BEV}(IoU=0.7).

Occlusion Classification AccuracyEasyModerateHard
50%27.12 / 36.4118.05 / 24.6715.21 / 19.74
70%28.47 / 37.8219.51 / 26.9016.02 / 21.85
96.46%30.29 / 40.2620.90 / 27.0817.61 / 23.14

From this table, we can observe that the higher occlusion classification accuracy can contribute to better 3D detection performance, validating the importance of classification accuracy on the overall 3D object detection network.

We will add these results and analyses in the revised version.

Weakness 3 & Question 2: Performance of the Query Completion Network

Thank you for your suggestion! The performance of the Completion Network is measured through the similarity between the non-occluded queries before masking and the queries completed by the Completion Network. This similarity is defined by the Smooth L1 loss, as defined by Equation (6) of the submitted manuscript.

With the help of the Smooth L1 loss, we visualize the training losses with and without using the Completion Network in Figure 7 of the submitted Appendix. As Figure 7 shows, the training loss using the Completion Network (the orange line) drops while the training loss without using the Completion Network (the blue line) remains at a high level, demonstrating the effectiveness of the Completion Network in acquiring occlusion-tolerant representations effectively by learning to reconstruct the masked queries.

The visualization of masked and reconstructed queries is not provided since the queries are one-dimensional vectors and cannot be visualized in a meaningful approach.

Moreover, Table 2 in the submitted manuscript can further validate the effectiveness of the Completion Network quantitatively. Comparing Rows 2 and 6, and Rows 4 and 7, using Completion Network can improve the 3D detection performance effectively, showing that the masked queries are properly reconstructed.

评论

Thanks for your reply. Most of my concerns have been addressed. And I will keep my rating.

评论

Thank you for your positive feedback! We are glad that we have addressed your concerns.

审稿意见
5

This paper applies Masked Autoencoder to 3D object detection. It distinguishes object queries into occluded and non-occluded categories, and during training, it applies depth-aware masking to the non-occluded queries and learns by completing them. At test time, the completion is applied to the occluded queries.

优点

  • It achieved state-of-the-art performance on the KITTI 3D dataset.
  • The idea of interpreting occluded queries as masked queries to solve the problem is interesting.
  • The training and test times are illustrated clearly in figures.

缺点

  • As stated in the limitations section, occlusion at the image level and masking at the feature level of object queries are not the same. Further analysis is needed to understand the actual implications of masking in object queries.
  • If masking serves the role of occlusion at the image level, there should be no reason for the mask ratio to vary with depth, yet depth-aware masking is highly beneficial. An analysis is needed to understand why depth-aware masking works well compared to random masking.
  • In my opinion, the performance of the Non-Occluded Query Grouping classification is crucial for the framework to function properly. Although classification accuracy is provided in the supplementary material, it would be helpful to include various metrics such as precision, recall, and F1-score. If the results of the Non-Occluded Query Grouping classification are biased, it might be interesting to apply completion not only to the occluded queries but also to the non-occluded queries at test time.

问题

Please refer to the weaknesses.

局限性

Limitations are included in the main text.

作者回复

We thank the valuable comments and insightful suggestions, and we hope our detailed responses below can address your concerns.

Weakness 1: Feature-Level Masking vs. Image-Level Masking

We would clarify that the strategy of masking and completion aims to generate pairs of non-occluded and occluded (via masking) data for learning occlusion-tolerant object representations. The strategy could be implemented in either image space or feature space. We chose to mask and reconstruct query features, as masking and reconstructing images is complicated and computationally intensive due to noisy image data and super-rich occlusion patterns. As shown in Table 3 of the submitted manuscript, masking and completing query features perform clearly better than masking and completing images. Nevertheless, we understand that the simulated occlusion features are different from that of natural occlusions (as briefly discussed In the Limitations section), and shared that this issue could be mitigated by introducing generative networks to learn the distribution of natural occlusions.

Weakness 2: Depth-Aware Masking v.s. Random Masking

We would clarify that the proposed depth-aware masking is implemented in the feature space. We examined how different masking strategies affect the monocular 3D detection by testing three masking strategies: 1) random masking in the image space; 2) random masking in the feature space; and 3) depth-aware masking in the feature space, as shown in Rows 1-3 in Table 3 of the submitted manuscript. We can observe that the depth-aware masking performs clearly the best. As discussed in Section A.1 of the submitted Appendix, the depth-aware masking modulates the mask ratio by lowering it for distant objects. This effectively retains more visual information for distant objects which are usually small and have limited pixels and visual information. In addition, it also facilitates the ensuing completion task as completing heavily masked small objects is challenging and liable to failures.

Weakness 3: Non-Occluded Query Grouping

Thanks for your valuable suggestion! Below is the detailed performance of the Non-Occluded Query Grouping classification module with your suggested metrics.

MetricsPrecisionRecallF1-ScoreAccuracy
Performance98.35 %94.47 %96.38 %96.46 %

To evaluate the performance when applying completion on all queries, both occluded and non-occluded, at the test time, we conduct experiments and show the results in the table below. The metrics AP3D(IoU=0.7)AP_{3D} (IoU=0.7) and APBEV(IoU=0.7)AP_{BEV}(IoU=0.7) with Easy, Moderate and Hard categories are used. In each cell of the table, the performance is listed as AP3D(IoU=0.7)AP_{3D}(IoU=0.7) / APBEV(IoU=0.7)AP_{BEV}(IoU=0.7).

SettingEasyModerateHard
Completing All Queries28.46 / 37.1718.40 / 25.9515.23 / 21.82
Original30.29 / 40.2620.90 / 27.0817.61 / 23.14

The results indicate a performance drop when applying completion to all queries, compared to completing the occluded queries classified by the occlusion classification network. This is because the Completion Network is designed to reconstruct the occluded queries to learn occlusion-tolerant visual representations. Consequently, applying completion to non-occluded queries introduces confusion, degrading overall 3D detection performance.

评论

Thank you for answering the questions. Most of the explanations are based on good final performance, and although there is still a lack of theoretical explanation on why query masking should be better than natural image masking, I understand that this is a part that is difficult to explain perfectly in theory. I will focus on the opinions of other reviewers and the performance improvement of the proposed method, I will raise my initial rating. However, if additional theoretical justification is provided, the paper could become even better.

评论

Thank you for your positive feedback and valuable suggestions! We agree that providing a theoretical explanation would be challenging but can strengthen the paper. We will further study relevant literature (e.g., pros-and-cons of image augmentation vs visual feature augmentation in prior studies) and discuss this issue in the revised manuscript.

审稿意见
6

This paper introduces a novel framework for improving monocular 3D object detection, particularly in handling object occlusions. The proposed MonoMAE leverages depth-aware masking to simulate occlusions in the feature space and employs a lightweight completion network to reconstruct occluded object regions, thereby learning occlusion-tolerant representations. Experiments show that this learning stratgy helps to improve the performance of monocular 3D object detection.

优点

  1. This paper is well-structured, with a clear problem statement, methodology, experiments, and ablation studies that substantiate the contributions and effectiveness of MonoMAE.
  2. This paper addresses a significant challenge in monocular 3D object detection, object occlusion, with a novel approach using depth-aware masked autoencoders.

缺点

  1. The reliance on depth-aware masking to simulate occlusions may not perfectly replicate natural occlusion patterns, potentially affecting the model's reconstruction accuracy. The gap between synthetically masked and naturally occluded object queries could limit the model's robustness in real-world scenarios.
  2. While this paper claims generalizability, the lack of extensive cross-dataset validation leaves the true scope of its generalization capability somewhat unproven.

问题

  1. All the experimental results presented in this paper are about vehicle detection. Does MonoMAE also work for more difficult cases like pedestrain and cyclist detection?
  2. The paper suggests investigating generative approaches for simulating natural occlusion patterns. Can you elaborate on what this might entail and how it could further improve monocular 3D detection?

局限性

  1. The paper could provide a more detailed analysis of the computational efficiency, including speed and resource usage, to fully assess the practicality of MonoMAE for real-time applications.
  2. This paper only present results of vehicle detection. The performance of detecting objects with small sizes is unknown.
作者回复

We thank the valuable comments and insightful suggestions, and we hope our detailed responses below can address your concerns.

Weakness 1: Depth-Aware Masking for Occlusion Simulation

We acknowledge that the proposed depth-aware masking may not perfectly replicate natural occlusion patterns, which could affect the model's reconstruction accuracy potentially.

To fill the gap between synthetic and natural occlusions, we have considered alternative methods, such as directly overlaying images of objects (e.g., using a car image to occlude another car). However, this approach poses significant challenges, particularly in obtaining accurate training labels (3D boxes, orientations) for the overlaying objects, making it impractical for our current implementation.

As shown in Section D.1 of the submitted Appendix, one possible solution is introducing Generative Adversarial Networks (GANs) that learn distributions from extensive real-world data for generating occlusion patterns that are very similar to natural occlusions. Specifically, a Generator could be employed to generate occluded queries based on non-occluded queries, while a Discriminator could be used to discriminate between these generated occluded queries and natural occluded queries. Through adversarial training, the Generator could produce occlusions that are very similar to natural occlusions.

Weakness 2: Lacking Extensive Cross-Dataset Validation

We would clarify that most existing monocular 3d detection methods [1, 2, 3] validate their generalization ability on the task KITTI3D->nuScenes only. We followed these prior studies to facilitate benchmarking. As suggested, we extend the experiments by examining the generalization over a new task KITTI3D->Waymo as shown in the table below, using AP(IoU=0.5)AP(IoU=0.5) as the metric. We can observe that our method has generalization ability on the Waymo dataset, and even outperforms some methods trained on Waymo.

MethodPatchNet* [4]M3D-RPN* [5]Ours
Level_12.923.794.53
Level_22.423.614.17

* denotes this method is trained on Waymo.

[1] Shi, Xuepeng, et al. Geometry-based distance decomposition for monocular 3d object detection. ICCV, 2021.

[2] Kumar, Abhinav, et al. Deviant: Depth equivariant network for monocular 3d object detection. ECCV, 2022.

[3] Jinrang, Jia, et al. MonoUNI: A unified vehicle and infrastructure-side monocular 3d object detection network with sufficient depth clues. NeurIPS, 2023.

[4] Ma, Xinzhu, et al. Rethinking pseudo-lidar representation. ECCV, 2020.

[5] Brazil, Garrick, et al. M3d-rpn: Monocular 3d region proposal network for object detection. ICCV, 2019.

Question 1 & Limitation 2: Performance on Small Size Objects (Pedestrian and Cyclist)

The proposed MonoMAE can handle challenging cases with competitive detection performance. We conducted new experiments on the suggested pedestrians and cyclists that often have smaller scales and are more challenging to detect. The two tables below show the experimental results with metric AP3D(IoU=0.7)AP_{3D}(IoU=0.7).

Performance for pedestrian detection.

MethodEasyModerateHard
DID-M3D11.787.446.08
MonoNeRD13.208.267.02
Ours13.378.417.10

Performance for cyclist detection.

MethodEasyModerateHard
DID-M3D7.823.953.37
MonoNeRD4.792.482.16
Ours8.054.163.54

We can observe from the above two tables that the proposed MonoMAE performs well for the pedestrian and cyclist categories for monocular 3D object detection.

Question 2: Generative Approaches for Simulating Natural Occlusion Patterns

As briefly shared in Section D.1 of the submitted Appendix, the proposed MonoMAE could be improved by employing generative networks such as GANs to learn distributions of real-world data. The trained model will then generate occlusion patterns that are more similar to natural occlusions than our proposed feature masking, leading to better monocular 3D object detection. Take GAN as an example. The GAN generator will learn to generate occluded queries (with many non-occluded queries as reference), while the GAN discriminator will learn to discriminate between the generated occluded queries and naturally occluded queries. Through adversarial learning, the trained generator could generate more realistic occlusions, which further leads to better occlusion completion and monocular 3D object detection.

Limitation 1: Computational Efficiency

We would clarify that we have provided the inference time of the proposed MonoMAE and several state-of-the-art monocular 3D detection methods in Table 5 of the submitted manuscript. According to the inference time, MonoMAE can achieve above 26.3 FPS (frame per second) which demonstrates its great potential for real-time tasks. For resource usage, all our experiments were conducted on a computer with Intel Xeon Gold 6134 CPUs, CentOS operating system with 128GB RAM, and a single NVIDIA V100 GPU with 32 GB memory. We will highlight the above information in the updated manuscript.

评论

Thanks for providing additional information in your rebuttal. Your reply has addressed part of my concerns. Regarding the part "Question 1 & Limitation 2: Performance on Small Size Objects (Pedestrian and Cyclist)", providing more qualitative or visualization results would be more convincing.

评论

Thank you for your valuable suggestion! Unfortunately, the OpenReview system does not allow uploading a PDF for visualization. Additionally, the NeurIPS discussion guidelines prohibit including external links. We will include the suggested visualization in the revised manuscript/appendix later.

作者回复

General Response

We thank all the reviewers for your time, insightful suggestions, and valuable comments. We are highly encouraged by the reviewers' acknowledgment of our method in its innovative idea and novel design (xT7k, 8dEd, g4DX), superior performance (xT7k, 8dEd), exhaustive experiments (g4DX, bA8P), strong generalization capabilities (xT7k) and good presentation (8dEd, g4DX). Most reviewers concurred that the studied occlusion problem is critical and a significant challenge for monocular 3D detection.

The reviewers also shared several concerns, mainly focusing on:

  • More detailed information on the usage of occlusion levels information.
  • Further insights regarding the gap between synthetic and natural object occlusion.
  • Additional experiments regarding the network generalization.

We respond to each reviewer's comments in detail below. We again thank the reviewers for your valuable suggestions, which we believe will greatly strengthen our paper.

最终决定

The paper receives positive ratings from all the reviewers. They liked the idea of using depth-aware masked auto-encoders for monocular 3D object detection. The technical contribution is solid, and the experimental results are strong. AC agrees with the positive comments from the reviewers and recommends 'accept as poster' for this paper.