PaperHub
5.3
/10
Poster4 位审稿人
最低4最高7标准差1.1
4
5
5
7
3.5
置信度
正确性2.5
贡献度2.3
表达2.8
NeurIPS 2024

Unified Domain Generalization and Adaptation for Multi-View 3D Object Detection

OpenReviewPDF
提交: 2024-05-13更新: 2024-12-19
TL;DR

Label-Efficient Domain Adaptation for Multi-view 3D Object Detection

摘要

Recent advances in 3D object detection leveraging multi-view cameras have demonstrated their practical and economical value in various challenging vision tasks. However, typical supervised learning approaches face challenges in achieving satisfactory adaptation toward unseen and unlabeled target datasets (i.e., direct transfer) due to the inevitable geometric misalignment between the source and target domains. In practice, we also encounter constraints on resources for training models and collecting annotations for the successful deployment of 3D object detectors. In this paper, we propose Unified Domain Generalization and Adaptation (UDGA), a practical solution to mitigate those drawbacks. We first propose Multi-view Overlap Depth Constraint that leverages the strong association between multi-view, significantly alleviating geometric gaps due to perspective view changes. Then, we present a Label-Efficient Domain Adaptation approach to handle unfamiliar targets with significantly fewer amounts of labels (i.e., 1$%$ and 5$%)$, while preserving well-defined source knowledge for training efficiency. Overall, UDGA framework enables stable detection performance in both source and target domains, effectively bridging inevitable domain gaps, while demanding fewer annotations. We demonstrate the robustness of UDGA with large-scale benchmarks: nuScenes, Lyft, and Waymo, where our framework outperforms the current state-of-the-art methods.
关键词
Domain Generalization.+Domain Adaptation.+Multi-view 3D Object Detection.+Autonomous driving.+Domain Generalization.

评审与讨论

审稿意见
4

The paper proposes an Unified Domain Generalization and Adaptation(UDGA) scheme for multi-view 3D object detection. The main components of the proposed method are (1) depth inconsistency-based constraints from multi-view and (2) an efficient domain adaptation scheme (LEDA). The multi-view depth inconsistency constraints improve the domain gap caused by geometric gaps when it comes to perspective view changes. LEDA offers adaptation with fewer amounts of labels while preserving the knowledge from the source domain.

优点

The paper clearly addresses which problem it focuses on, and each proposed module has solid goals to address existing limitations. The experimental results from multiple datasets also seem convincing.

缺点

I have a few concerns about the demonstration of the proposed method as well as addressing existing works.

(1) The method part needs to be improved. For instance, specifying every input and output tensor dimension (i.e., depth estimation network, etc.) would significantly help readers understand what each module is doing more quickly.

(2) Using the depth inconsistency seems similar to DETR3D's[1] inconsistency constraint on RGB features. In DETR3D, it is even mentioned that using the RGB feature is better than having explicit depth estimation. Based on this, the depth estimation network can introduce additional parameters for optimization while not being very helpful. I cannot see experiments or comparisons with DETR3D.

(3) More related works need to be addressed. Currently, the paper focuses on multi-view-based domain generalization for 3D detection. However, there is also another active line of research with LiDAR-based domain adaptation for 3D detection, pioneered by ST3D[2]. Clearly addressing the difference between those researches will make the reader recognize the line of research that the paper focuses on.

(4) The effectiveness of LEDA is not well demonstrated, as claimed in the paper, compared to existing approaches.

[1] Detr3d: 3d object detection from multi-view images via 3d-to-2d queries, Wang, Yue, et al. CORL, 2022

[2] St3d: Self-training for unsupervised domain adaptation on 3d object detection, Yang, Jihan. et al., CVPR 2021.

问题

I have a few questions that correspond to the weakness that I mentioned.

(1) Is there a justifiable reason why using Depth inconsistency constraint is superior to using deep RGB feature-based inconsistency ( as in DETR3D)?

(2) As one of the main proposed contributions claimed, a detailed analysis of LEDA seems missing. How much does LEDA improve the proposed system compared to existing approaches in terms of efficiency and accuracy?

Addressing those two questions properly will improve the quality of the paper and, subsequently, the reviewer's rating.

局限性

The limitations are addressed

作者回复

(1) Discussion with RGB feature-based methods

Thank you for your constructive suggestion. As reported by DETR3D [4], CVT [5] and BEVFormer [6], RGB feature-based methods are considerably robust against calibration noise. However, these methodologies experience significant performance degradation in dynamic view shift environments, such as cross-domain scenarios where calibration changes dramatically, as shown in table below (Lyft \rightarrow nuScenes).

MethodBackboneNeckSource(NDS*/mAP)Target(NDS*/mAP)# Params(M)
BEVDepthResNet50LSS0.684 / 0.6020.213 / 0.10251.7
CVTResNet50Transformer0.658 / 0.5630.231 / 0.06651.2
DETR3DResNet50Transformer0.650 / 0.5520.179 / 0.08732.3
BEVFormerResNet50ST Transformer0.624 / 0.5060.138 / 0.00833.5
OursResNet50LSS0.702 / 0.6300.421 / 0.28151.7

Additionally, we explore how domain shift hinders stable 3D recognition in main text Sec. 3.2, Appendix C and Table 7. Although, fair comparisons are challenging due to architectural differences, we validate that our proposed modules quantitatively mitigate dramatic view changes in the cross-domain (up to 20.9% NDS gain).

(2) In-depth explanation of LEDA

We hope that additional explanation of LEDA in global response will help you understand better. To improve the development of 3D object detection, we design Label-Efficient Domain Adaptation (LEDA) framework that successfully adapts to novel targets by only a few data and extra parameters, introducing a practical solution for real-world Autonomous Driving. To this end, we conduct structural experiments (Rebuttal PDF Table 2 and 3) to explore the optimal architecture that is both effective and efficient. Also, we validate that our depth constraint method encourages transfer learning from pre-trained source potentials to novel targets as shown in Rebuttal PDF Table 1. Especially, by comparing our methodology with existing state-of-the-art models and evaluating NDS by efficiency across various setups (Rebuttal PDF Figure 1), we quantitatively showcase that our proposed methods successfully bridge the gap between the source and novel target using a limited amount of training budget (i.e., training parameters and data). It is noteworthy that our proposed methodology outperforms existing works by 1% on novel targets with only less than 20% extra parameters. Finally, we will obviously present and analyze in the revised version.

(3) Status of LiDAR-based cross-domain object detection

LiDAR object detection research includes addressing performance degradation caused by domain shifts. Wang et al.[1] proposed statistical normalization to mitigate the differences in object size distribution across various datasets. ST3D[2] leveraged domain knowledge through random object scale augmentation, and their self-training pipeline refined the pseudo-labels. STAL3D[3] extended ST3D by incorporating adversarial learning. These methods were all proposed under the assumption of target-aware conditions, which presents the limitation that they cannot be applied without prior access to the target domain. The table (Waymo \rightarrow KITTI) below extracts a subset of experimental results from STAL3D. LiDAR-based methods were less affected by domain shifts compared to those of cameras, though their performance did not reach the oracle level.

MethodBEV AP3D AP
Oracle83.2973.45
Direct Transfer67.6427.48
SN[1]78.9659.20
ST3D[2]82.1961.83
STAL3D[3]82.2669.78

Unlike LiDAR modalities, these approaches fail to address the domain shift problem due to the poor quality of pseudo-labels resulting from the inaccurate geometric information in image-based 3D detection. To address the domain generalization for camera-based systems, we propose a depth constraint module and introduce the LEDA module to achieve oracle-level performance with label efficiency in the target domain.

Plus, we will clearly address the weaknesses (1) More details of our methods and (3) More related works in the revised version.

Thank you.


[1] Wang, Chen, et al. Train in Germany, Test in The USA: Making 3D Object Detectors Generalize, Computer Vision and Pattern Recognition, 2020.
[2] Yang, Shi, et al. ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection, Computer Vision and Pattern Recognition, 2021.
[3] Zhang, Zhou, STAL3D: Unsupervised Domain Adaptation for 3D Object Detection via Collaborating Self-Training and Adversarial Learning, IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2024. [4] Wang, Yue, et al. "Detr3d: 3d object detection from multi-view images via 3d-to-2d queries." Conference on Robot Learning. PMLR, 2022.
[5] Zhou, Brady, and Philipp Krähenbühl. "Cross-view transformers for real-time map-view semantic segmentation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[6] Li, Zhiqi, et al. "Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers." European conference on computer vision. Cham: Springer Nature Switzerland, 2022.

评论

I appreciate the authors' effort to answer my questions. I am convinced by the authors' responses about "RGB-features" and "In-depth explanation of LEDA".

However, the authors' response to "Status of LiDAR-based cross-domain object detection" made me sceptical about what the authors meant by "(1) These methods were all proposed under the assumption of target-aware conditions, which presents the limitation that they cannot be applied without prior access to the target domain." and (2) "LiDAR-based methods were less affected by domain shifts compared to those of cameras, though their performance did not reach the oracle level." in the rebuttal.

Regarding (1), did the authors mean that LiDAR-based self-training methods require target statistics, such as average object size, and self-training requires access to the target domain?

From the reviewer's understanding, the proposed method LEDA even requires labels directly from the target domain(although it's a comparably small amount, i.e. 1%- 5%), whereas the LiDAR-based self-training methods do not require such direct labels.

In the context that LEDA requires target labels and LiDAR-based methods require self-training, it seems to me that they both require prior access to the target domain.

Regarding (2), if the authors meant that LiDAR-based methods haven't met the Orcal performance in Waymo to KITTI adaptation task, I must say that this is false. For instance, DTS [1] and so on already outperformed Oracle with the justifiable reasons in Waymo to KITTI adaptation task.

I understand that the multi-view-based and LiDAR-based Domain adaptation/generalization methods have different characteristics and cannot be directly comparable. However, as a similar branch of work, the LiDAR-based domain adaptive 3D detection also needs to be addressed with proper description so that readers can understand the lines of similar works. Specifically, it does not sound very convincing to advertise that multi-view-based domain generalization is more challenging than LiDAR-based domain generalization because of the above-mentioned reasons for (1) and (2). I personally think that just mentioning LiDAR-based domain adaptive/generalization 3D detection works as relevant works without comparing which one is more challenging would make it clearer for readers to know what the relevant lines of research are.

[1] Density-Insensitive Unsupervised Domain Adaption on 3D Object Detection, Hu et al. CVPR 2023

评论

Dear Reviewer 2JqY,

We sincerely appreciate your valuable efforts and professional feedback, which have indeed improved the quality of the final version of our manuscript. Above, we've prepared answers to your remaining concerns regarding the clear description of the relationship between lidar- and camera-based domain adaptation literature.

If it appears to be reasonable, could you please update your rating of our work, as you indicated in your initial review? As the authors-reviewers discussion period is nearing its end, we would like to politely request your final verdict on this matter.

Best regards,

The Authors

评论

We genuinely appreciate the reviewer's thoughtful and detailed feedback. Our proposed generalization technique, Multi-view Overlap Depth Constraint, does not require prior access to the target, unlike Statistical Normalization (SN) [1].
However, due to its limited performance, we designed LEDA to efficiently learn novel target knowledge, and we agree with your observation that LEDA indeed falls under the category of methods requiring direct labels from target domain. Furthermore, we acknowledge that our lack of insight into UDA for LiDAR 3D Object Detection led to an insufficient comparison in the answer (3) 'Status of LiDAR-based Cross-Domain Object Detection'. To address these issues, we will revise rebuttal phrases (1) and (2) as follows:
Regarding (1): “These methods were all proposed under the assumption of target-aware conditions, which presents the limitation that they cannot be applied without prior access to the target domain.” \rightarrow “These methods adopt 'prior access to target' approaches (e.g., SN [1], self-training [2, 3]) to effectively align the domain gaps between the source and target.”
Regarding (2): “LiDAR-based methods were less affected by domain shifts compared to those of cameras, though their performance did not reach the oracle level.” \rightarrowLiDAR-based methods were less affected by domain shifts compared to those of cameras, though their performance did not reach the oracle level.

More importantly, we observe that the challenges of LiDAR-based methods are also aligned with our problems in a certain aspect (i.e., label-efficient learning). LiDAR-based approaches utilize self-training strategies with pseudo-labeling; however, our UDGA framework introduces a method for fine-tuning LEDA using a small subset of novel target labels, and applying LEDA to LiDAR-based 3D object detection represents a promising direction for future work. We will clearly mention this discussion in the final manuscript and agree that this explanation will indeed enhance the readability of our paper.

Thanks for your thorough and professional comments on this comparison.


[1] Wang, Yan, et al. "Train in germany, test in the usa: Making 3d object detectors generalize." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[2] Yang, Jihan, et al. "St3d: Self-training for unsupervised domain adaptation on 3d object detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
[3] Hu, Qianjiang, Daizong Liu, and Wei Hu. "Density-insensitive unsupervised domain adaption on 3d object detection." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

审稿意见
5

This paper is about Unified Domain Generalization and Adaptation for outdoor 3D object detection. To address the geometric misalignment between the source and target domains, Multi-view Overlap Depth Constraint that leveraging the strong association between multi-views and Label-Efficient Domain Adaptation are proposed. Experiments are done with several cross-dataset settings.

优点

  1. The domain generalization and adaption is important for 3D object detection.
  2. The writing is easy to follow.

缺点

  1. In the past works, the comparative methods, such as DG-BEV, and PD-BEV, reported the performance on both source and target domains. Yet it seems that the reported metrics in Table 1 do not align with the reported results in the two papers. What is the reason for such changes?
  2. It is noteworthy that the NDS is specific for nuScenes and different for other datasets. It is important to discriminate such differences.
  3. The authors claim that this work is toward 3D universal detection, yet there are other works toward 3D universal detection, such as [1][2]. It is better to illustrate the differences over these methods, both in discussion and experimental comparison. Besides, there are other works devoted to multi-dataset training for 3D detection for generalization, also missed in the paper. Can the proposed method be applied in multi-source training and multi-target testing?
  4. The Domain Adapter design seems simple, yet the motivation and potential of such design is missed. It is important since it is one of the two components of this work. [1] Towards universal LiDAR-based 3D object detection by multi-domain knowledge transfer. ICCV 2023. [2] Cross-Dataset Sensor Alignment: Making Visual 3D Object Detector Generalizable. PMLR 2023.

问题

The experimental comparison and the novelty, importance of the work.

局限性

The limitation part has been addressed.

作者回复

(1) Explanation of Table 1

Thank you for your insightful review. We understand that there was some confusion regarding the misaligned results in Table 1 due to insufficient explanation. Since previous methods did not adopt consistent experimental methods, a fair comparison is only possible on the target domain (i.e., they did not evaluate the same model on both the source and target domains). To address this issue, we adopt their highest scores for comparison, as detailed in Appendix Table 5. We will make the necessary revisions to clarify this information and enhance the overall clarity of our manuscript.

(2) Discussion with other metrics

Note that, to ensure fair comparison, we adopt unified evaluation metric NDS as reported in DG-BEV and PD-BEV.
Waymo and Lyft calculate 3D AP by directly measuring the intersection over union (IoU) between predicted and ground truth bounding boxes. In contrast, NDS measures differences between predicted and ground truth objects leveraging various indicators such as translation error, size error, orientation error, and BEV AP. This approach enables capturing plausible failure cases (e.g., translation fails or orientation fails) and allows for practical analysis of domain shift issues, as shown in main paper Sec. 3.2 L#126-130, Sec. 4.2 L#207-209, Sec. 4.3 L#244-245 and L#250-251, and Appendix C L#505-513.

(3) Discussion with the 3D universal detection

Thank you for your meaningful discussion. [1] and [2] explore multi-domain learning and generalization to improve 3D object detection. Here, we present a brief overview as follows.

MethodModalityViewTarget-awareApproachMulti-sourceMulti-target
[1]LiDARmultiXout-domainOO
[2]CamerasingleOout-domainOO
OursCameramultiXin-domainO (a few extra targets)O

The Universal 3D Object Detection task is quite similar to our framework, and our proposed methods showcase considerable potential in this field (multi-source training and multi-target testing), as shown in main paper table 2. However, while universal training methods generalize external domain knowledge, we leverage internal knowledge, which do not rely on large-scale and diverse datasets. Furthermore, our framework can generalize without prior access to the target, making it more feasible to develop and deploy 3D detection in real-world scenarios.

(4) More details of Label-Efficient Domain Adaptation

We hope that this rebuttal will be helpful to you. We provide more details of LEDA in the global response, improving the motivation and potential of our proposed method. Although existing methods aim to generalize the domain shift between the source and the novel target, they often fail to provide a practical solution mainly due to unsatisfied generalization performance and show room for improvement. Additionally, costly resources exacerbate these issues and hinder the expansion of multi-view 3D object detection. To tackle these issues, we design a de-facto framework leveraging a efficient and effective learning strategy LEDA for multi-view 3D object detection.

First, we show that our depth constraint method smoothly deals with sensor misalignment between source and target domains and effectively boost the adaptation capability of LEDA (as shown in Rebuttal PDF Table 1). Also, our LEDA successfully transfers their own potentials to novel targets without forgetting previously learned knowledge as shown in main text Table 5. As a result, we note that our proposed framework impressively releases costly resource issues and unstable performance as shown in Rebuttal PDF Figure 1 and presents practical solutions for real-world autonomous driving.

We sincerely appreciate your constructive review and will address the points discussed above in the revised manuscript.

Thank you.


[1] Wu, Guile, et al. "Towards universal LiDAR-based 3D object detection by multi-domain knowledge transfer." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[2] Zheng, Liangtao, et al. "Cross-Dataset Sensor Alignment: Making Visual 3D Object Detector Generalizable." Conference on Robot Learning. PMLR, 2023.

评论

Dear Reviewer gahV,

We greatly appreciate your valuable efforts and professional feedback, which have indeed improved the quality of the final version of our manuscript. We’ve provided answers to your remaining concerns above, and it would be great to hear your feedback on our rebuttal so that we can further improve the final version.

Although the authors-reviewers discussion period is nearing its end, we are fully prepared to address any further questions you may have.

Best regards,

The Authors

审稿意见
5

The paper presents an adaptation of 3D object detectors to varying target environments using two major strategies. The proposed multi-view overlap depth constraint leverages associations across views to learn view-invariant features. Additionally, a LORA-like structure is designed for parameter-efficient adaptation, accommodating scenarios with limited target data. Experiments on three benchmark datasets demonstrate the effectiveness of the proposed approach, with minimal modification of parameters.

优点

  • The paper addresses a significant and practical issue in 3D object detection, aiding in the development of robust models for dynamically changing testing environments.
  • The flowchart clearly illustrates the core components, making it easy for readers to grasp the main idea.
  • Extensive experiments have been conducted, and the proposed strategy significantly outperforms the pre-trained source model and even surpasses full fine-tuning strategies.

缺点

  • The two proposed strategies appear somewhat disconnected. While the multi-view idea is interesting, the technical details are confusing. Equation (5) includes three different objectives, but their importance or sensitivity is not discussed. The latter adaptation strategy resembles existing works and lacks specific design for the 3D detection task. This is also not correlated with the view-transform. The isolated adaptation modules seem more like a stacking of existing techniques rather than a cohesive addition.
  • More discussion and comparisons with previous multi-view augmentation strategies would help clarify the merits and innovations of the proposed approach. The current claim that existing strategies are poorly generalizable is somewhat unconvincing.

问题

Please refer to the weakness section.

局限性

The authors have not provided a discussion on the limitations of their work.

作者回复

(1) More details of LEDA.

Disconnected strategies
We hope that further explanations of LEDA in global response will enhance your understanding. We empirically observe that direct fine-tuning approaches (with a small fraction of data) often fail to align between source and target, mainly due to dynamic perspective shifts. To mitigate this issue, our proposed depth constraint method softly addresses perspective gaps and enhances adaptation capacity to novel targets. In Rebuttal PDF Table 1, LEDA (w/o Lov\mathcal{L_{ov}} and Lp\mathcal{L_{p}}) struggles to bridge with novel target knowledge and show only 20.4% / 24.4% NDS / mAP gain. However, LEDA (w/ Lov\mathcal{L_{ov}} and Lp\mathcal{L_{p}}) significantly transfer pre-trained knowledge to novel targets (34.2% / 42.5% NDS / mAP gain). As a result, we note that Lov\mathcal{L_{ov}} and Lp\mathcal{L_{p}} enable efficient learning by encouraging stable convergence even with tiny datasets.

Ablation of objectives
Additionally, we study the importance of our Optimization Objectives (Lov\mathcal{L_{ov}} and Lp\mathcal{L_{p}}). Although, our depth constraint method effectively addresses domain gaps, it suffer from narrow overlap regions between adjacent views. To tackle this drawback, Lp\mathcal{L_{p}} effectively boosts stable recognition during both pre-training and fine-tuning phase. Specifically, Lp\mathcal{L_{p}} yields max 1.9% / 1.9% NDS / mAP gain during pre-training and max 2.1% / 1.6% NDS / mAP gain during fine-tuning. As a result, we demonstrate that our proposed two methods synergistically enhance UDGA performance.

In-depth explanation of LEDA.
In this paper, we advocate that depth consistency play a pivotal role to bridge domain gaps. Especially, we show that UDGA stably releases geometric differences from perspective view changes and encourages optimal BEV representation learning as shown in Rebuttal PDF Table 3. Precisely, UDGA achieves remarkable performance improvements of up to 10.0% in NDS and 9.5% in mAP in view transformation and BEV encoder layers, demonstrating its effectiveness.

(2) Comparison with Augmentation strategies.

We also measured generalizability of previous multi-view augmentation strategies as shown in Rebuttal PDF Tab.4. We adopt conventional augmentation strategies to multi-view 3D object detection as follows:

  • GT sampling effectively address unbalanced labels by sampling ground truth objects in the 3D object detection.
  • 2D aug. directly augment multi-view inputs (i.e., image resize, crop and paste, contrast and brightness distortion).
  • 3D aug. globally rotate, re-scale and translate multi-view inputs and ground truths.
  • Extrinsic aug. also adopts global yaw rotation towards the random direction
  • CBGS re-balances to solve unbalanced ground truths.

These methods significantly enhance geometric understanding from input noises. However, in dynamic view changes (i.e., cross-domain), they still suffer from geometric inconsistency and show poor generalization capability. Moreover, various 2D approaches do not guarantee geometric alignments between 2D images and 3D ground truths and relevant studies have not been explored well, as reported in DG-BEV and [1]. To tackle these issues, we present a novel Multi-view Overlap Depth Constraint that effectively mitigate dynamic view changes in the cross domain. We also applied top 3 methods 2D, 3D, and extrinsic augmentation to our method and achieved state-of-the-art performance on both source and target. Unfortunately, we were unable to find any additional comparable papers for generalized multi-view 3D object detection. If you could provide us with relevant papers, we would be happy to analyze them together.

We sincerely appreciate your thoughtful review and will address above discussion in the revised version.

Thank you.


[1] Zhao, Yunhan, Shu Kong, and Charless Fowlkes. "Camera pose matters: Improving depth prediction by mitigating pose distribution bias." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[2] Zheng, Liangtao, et al. "Cross-Dataset Sensor Alignment: Making Visual 3D Object Detector Generalizable." Conference on Robot Learning. PMLR, 2023.

评论

I would like to appreciate the detailed response from the authors and most of my questions have been addressed. As I agree with other reviewers, the proposed multi-view approach does not seem specifically designed for DG problem. However, based on the impressive improvement compared with the commonly seen augmentation strategy, I think the proposed module has the merit to facilitate future work. Thus, I would like to increase my score to borderline accept.

评论

We are delighted that our responses to your questions have been well received and led to a positive evaluation of our work (4 → 5). As you acknowledged, our practically motivated framework achieves state-of-the-art performance. We also hope that our work provides new perspectives on the unified view of domain generalization and adaptation for future research.

审稿意见
7

This paper focuses on multi-view 3D object detection. The authors proposed a unified domain generalization and adaptation-based detection method. To enhance the detection model for unseen datasets and address the geometric misalignment problem, the authors proposed a multi-view overlap depth constraint module and a label-efficient domain adaptation technique. The comprehensive experiments on lar-scale datasets, including nuScenes, Lyfy and Waymo have demonstrate the effectiveness of the proposed method.

优点

  • The storyline is clear, the authors provided detailed motivation analysis in the introduction and present their innovation clearly.
  • The performance is strong. The method shows SOTA performance in variance benchmarks with fast speed.
  • Most of the figures are clear. This paper is also well-written.

缺点

  • There is no source code for review.

问题

This paper seems well-shaped to me. However, I'm not an expert in this field, so I'm open to discussion with other reviewers if they hold opponent opinions to me.

局限性

Please refer to the above.

作者回复

Code release

We plan to release the source code upon acceptance.

Additionally, in this rebuttal, we provide further explanations, analyses, and results (please refer to global response and Rebuttal PDF). We hope that these materials will be helpful to you.

If you have any questions or need further discussion, please feel free to ask.

Thank you.

作者回复

We sincerely appreciate your effort to review our work. We have carefully read and considered all of the comments and suggestions provided by the reviewers. To assist with your understanding, we provide detailed analyses and additional experiments of Label-Efficient Domain Adaptation(LEDA) in the global response. Furthermore, we have addressed each reviewer's comments accordingly, and have provided detailed responses and references for each comment. We hope that this rebuttal will be helpful in clarifying any misunderstandings or concerns and will contribute to the overall evaluation of our work.

Thank you again for your time and consideration. We look forward to hearing your feedback.


The motivation of LEDA

There exist practical challenges in developing and deploying multi-view 3D object detectors for safety-critical self-driving vehicles. Each vehicle and each sensor requires its own model that can operate in dynamic weather, location, and time conditions. Furthermore, collecting large-scale labels in diverse environments is extremely expensive and inefficient. Among those, we are particularly motivated to address the following issues:

  1. Stable performance
  2. Efficiency of training
  3. Preventing catastrophic forgetting
  4. Minimizing labeling cost

To satisfy these practical requirements, we carefully design an efficient and effective learning strategy, Label-Efficient Domain Adaptation (LEDA).

In Rebuttal PDF Figure 1, we evaluate the efficiency of LEDA compared to existing methods, particularly in terms of domain adaptation (DA) performance. LEDA achieves the highest accuracy with low parameters and data cost, demonstrating its practicality and effectiveness in real-world applications.

Technical details of LEDA

Label-Efficient Domain Adaptation is a novel strategy to seamlessly bridge domains gaps leveraging a small amount of target data. To this end, we add extra parameters A\mathcal{A} consisting of bottleneck structures (i.e., projection down ϕdown\phi_{down} and up ϕup\phi_{up} layers).

$

\mathcal{A}(x) = \phi_{up}(\sigma(\phi_{down}(BN(x)))),

$

where σ\sigma and BNBN indicates activation function and batch normalization. We parallelly build A\mathcal{A} alongside pre-trained operation blocks B\mathcal{B} (e.g., convolution, and linear block) in main paper Figure 3 (ii) and below Equation;

$

y = \mathcal{B}(x) + \mathcal{A}(x),

$

Firstly, we feed xx into ϕdown\phi_{down} to compress its shape to [H/r,W/r][H/r, W/r], where rr is the rescale ratio, and then utilize ϕup\phi_{up} to restore it to [H,W][H, W]. Secondly, we fuse each outputs from B\mathcal{B}, and Adapter by exploiting skip-connections that directly link between the downsampling and upsampling paths. By doing so, these extensible modules allow to capture high-resolution spatial details while reducing network and computational complexity. Plus, it notes worthy that they are initialized by a near-identity function to preserve previously updated weights. Finally, our LEDA lead to stable recognition in both source and target domains, incrementally adapting without forgetting pre-trained knowledge.

To analyze the impact of different architectural choices on the performance of our model, we conduct two ablation studies on the structure of adapters.

  • In Rebuttal PDF Table 2, we compare variations of the projection up and down layers.
  • In Rebuttal PDF Table 3, we compare different locations where the adapter can be attached.

This allows us to understand how the structure of the adapter affects the model's performance.

Optimization Objective

We optimize our proposed framework UDGA using the total loss function Ltotal\mathcal{L}_{total} during both phases (i.e., generalization and adaptation).

Ltotal=λdetLdet+λovLov+λpLp\mathcal{L_{total}} = \mathcal{\lambda_{det}L_{det}} +\mathcal{\lambda_{ov}L_{ov}} +\mathcal{\lambda_{p}L_{p}} where we grid-search λdet\lambda_{det}, λov\lambda_{ov} and λp\lambda_{p} to harmonize Ldet\mathcal{L_{det}}, Lov\mathcal{L_{ov}} and Lp\mathcal{L_{p}}. Specifically, Ltotal\mathcal{L}_{total} supervises B\mathcal{B} during pre-training and A\mathcal{A} during fine-tuning, respectively. In Rebuttal PDF Table 1, we highlight the importance and the sensitivity of each objective and demonstrate the validity and the connection of both methods.

Additionally, we present more comparison with existing multi-view augmentation in Rebuttal PDF Table 4. Among 5 dominant methods, we applied the top 3 methods, 2D, 3D, and extrinsic augmentation, to our method.

Limitations

There are two limitations of our proposed UDGA for multi-view 3D object detection task. These limitations include:

  1. Our proposed method for calculating depth transformation between multi-view images relies on the presence of overlapping regions between images. The width of the overlap region impacts the accuracy of depth estimation, as demonstrated in Appendix C, Fig.6.
  2. It is structurally challenging to apply depth constraint techniques to query-based networks. This is because cross-attention networks, including BEVFormer and DETR3D, often project from 2D to 3D without passing through a depth net, making it difficult to obtain depth features and calculate correlations.
最终决定

This paper initially received mixed scores, including two borderline rejects, one borderline accept, and one accept. Following a rebuttal by the authors, Reviewer LaL5 revised their score from borderline reject to borderline accept. While most concerns raised have been addressed, two issues remain unresolved: 1) the proposed multi-view approach's specific applicability to the Domain Generalization (DG) problem, and 2) inaccuracies in certain statements.

The first concern was raised by Reviewer LaL5. Despite this, the impressive performance improvements demonstrated by the proposed method lend credibility to its contribution to the field, and the AC finds this contribution acceptable.

The second concern involves incorrect statements highlighted by Reviewer 2JqY, including the assertion that only LiDAR-based methods require prior access to the target and that only the proposed method is susceptible to challenges posed by weather changes or sensor deployment. The AC suggests that these statements be revised in the final manuscript, noting that these revisions will not detract from the paper's primary contributions, such as its novelty and high performance.

Considering the above factors, the AC believes the paper possesses merits for the field of multi-view 3D object detection and thus suggests borderline acceptance. The AC strongly recommends that the authors address the specific issues raised by Reviewer 2JqY in the final version of the paper.