Generalizable Multi-Camera 3D Object Detection from a Single Source via Fourier Cross-View Learning
摘要
评审与讨论
This paper proposed a Fourier Cross-View Learning (FCVL) framework, which augments the data in the frequency domain and includes a contrastive-style semantic consistency loss to improve the model generalization ability from a single source.
给作者的问题
-
Regarding the semantic consistency loss, it appears to work well for small vehicles. But what happens when the vehicle is large? Will the rear part of a large vehicle be treated as a negative sample of the front part?
-
Besides large vehicles, what if there are other vehicles in the background image, will they also be regarded as negative samples?
-
Could you also show some examples where the proposed model still cannot detect correctly? How to further improve the model the future?
-
The idea of augmenting data in the frequency domain is not new and have been tried by previous researchers. What are advantages of the proposed frequency domain augmentation?
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
The proof looks good, but I didn't check it carefully.
实验设计与分析
Yes. Didn't find an issue.
补充材料
Yes. I reviewed all materials.
与现有文献的关系
This method is applicaple for different kinds of detection network.
遗漏的重要参考文献
None.
其他优缺点
Pros:
- This paper augments the data in frequency domain by jitterring both amplitude and phase, which diversify the dataset. The phase term typically captures high-frequency features that may be more transferable.
- The designed contrastive loss, which utilizes the adjacent image regions, is also a good idea.
- The proposed method is adaptable to different approaches and achieved SOTA results with various baseline models.
- The t-SNE visualization and other visualization results validate that the learned features are domain-invariant and the images become more diverse.
Cons:
- There may be some false negative samples for the contrastive loss design since some large vehicles may distribute across several adjacent frames.
其他意见或建议
Please check my questions.
Thanks for your positive and constructive feedback! We have addressed all the comments and incorporated additional experimental results to further validate our approach.
Q1 and W1: In our approach, there is a cross-view instance binding mechanism that the identical instance labels are assigned to cross-view instances of the same object. This ensures their consistent assignment as positive sample pairs in contrastive learning. This effectively prevents the rear part of large vehicles from being misclassified as negative samples. Visualization analysis in the figure (https://drive.google.com/file/d/1X04hoOqohT-O3SmlxuWx843Ffm-LS_8N/view?usp=sharing) demonstrates that the front and rear components of large vehicles have consistent activation responses in feature maps. Besides, we further list the results (mAP) of large vehicles. We can observe consistently significant improvements over large vehicles. In particular, our approach has 10.8% improvement for trailer.
| Method | truck | trailer | bus |
|---|---|---|---|
| BEVDet | 0.128 | 0.038 | 0.222 |
| +FCVL | 0.208(+8%) | 0.146(+10.8%) | 0.293(+7.1%) |
Q2: Thank you for raising this question. As we mentioned above, in our approach, the identical instance labels are assigned to cross-view instances of the same object. When taking a target object in one view as the anchor, we search for objects with the same instance label in adjacent views as positive samples. This is because these objects inherently represent the same target observed from different angles, exhibiting strong correlations. For negative sample selection, we do not treat objects in the anchor's background as negatives. Instead, we choose samples of different categories from other views. This guarantees that negative samples are categorically distinct from the anchor, enhancing the model's ability to differentiate features across categories. Through this design, we effectively leverage cross-view consistency to improve model performance.
Q3: As highlighted in our discussion of limitations, there are some opportunities for improvement under extreme weather and low light conditions. For example, targets are occluded by fog or inherently difficult-to-detect small distant targets become even more challenging under extreme weather conditions. Some examples are put here. (https://drive.google.com/file/d/1d-hDTCbvTj3SOVffPHnlBlEnJIeShM-6/view?usp=sharing) These limitations can be further improved from two aspects: (1) weather-specific data augmentation combined with a multi-scale strategy and (2) multi-modal fusion integrating LiDAR with cameras against low light. In addition, under adverse weather conditions LiDAR performance may be degraded. To maximize sensor effectiveness across diverse scenarios, an adaptive cross-modal fusion scheme should be designed to achieve dynamic fusion of different modalities.
Q4: We would like to further clarify the advantages of our method over other frequency-domain approaches[1,2]. Compared with these methods, our method integrates superior performance, high efficiency and extendability.
Firstly, in the setting of single source data, our proposed method can enhance the generalization ability of the detectors by a large margin. FACT[1] needs to mix up different domains' data in frequency to achieve great OOD performance. But when training with only single domain, FACT can only mix the samples within the single domain and it indeed improves the in-domain clean set's performance a bit, but the improvement on the OOD sets is very slim. Different from FACT, we first propose Frequency Jitter at the image level to create diverse samples. Then, at the feature level, we introduce a novel method Amplitude Transfer to achieve fine-grained styles without content distortions. Via uncertainty estimation, Amplitude Transfer can obtain diverse feature statistics, which can gradually shift the features to more diverse domains through continuous training.
Secondly, due to the high complexity of BEV-based 3D object detection models, our plug-and-play data augmentation method can achieve better generalization results more efficiently. AGFA[2] trains the classifier and the amplitude generator adversarially to synthesize the worst-case domain for adaptation. Compared with this method, our proposed method is more stable and effective without introducing sophisticated extra modules or special training recipes for stable performances. This also increases the extendability of our method to other frameworks.
In summary, the proposed method balances both performance and efficiency, and addresses real-world challenges in autonomous driving, underscoring its practical value.
[1]. Xu, Qinwei, et al. A fourier-based framework for domain generalization. CVPR, 2021.
[2] Kim, Minyoung, et al. Domain generalisation via domain adaptation: An adversarial fourier amplitude approach. 2023.
The author proposes the Fourier Cross-View Learning (FCVL) framework including Fourier Hierarchical Augmentation (FHiAug), an augmentation strategy in the frequency domain to boost domain diversity, and Fourier Cross-View Semantic Consistency Loss to facilitate the model to learn more domain-invariant features from adjacent perspectives. According to the author, this is the first study to explore generalizable multi-camera 3D object detection with a single source.
给作者的问题
N/A
论据与证据
Yes, the author provides extensive experimental results to demonstrate the claims.
方法与评估标准
Yes.
理论论述
Yes, I have checked the correctness of proofs in the supplementary material.
实验设计与分析
Yes. For issues about the experiments, please refer to the below part.
补充材料
Yes, I reviewed all the supplementary material.
与现有文献的关系
The author proposes a new problem setting, which can be a good contribution to the literature. However, I keep doubt for the relationship between the author method and the problem setting.
遗漏的重要参考文献
The author has discussed most of related works.
其他优缺点
Strengths: This paper introduces a new problem setting: generalizable multi-camera 3D object detection with a single source, which I believe is highly important. Moreover, the author's approach of applying augmentation in the frequency domain is quite novel. The experimental results also show performance improvements. Overall, I am inclined to accept this paper.
Weakness:
- I find the FHiAug method quite novel. However, I believe it is a relatively general technique for RGB images. In contrast, the authors claim to be working on a new task: generalizable multi-camera 3D object detection. In my view, FHiAug has little to do with multi-camera or 3D detection specifically; rather, it is a more general method applicable to RGB images. Therefore, I do not see a clear connection between the proposed method and the novelty of the task itself. The authors should: (1) establish the relevance of their method to the multi-camera 3D detection task and (2) conduct experiments on tasks like 2D detection to demonstrate its broader applicability.
- The Fourier Cross-View Semantic Consistency Loss also does not seem to have a clear connection to the Fourier space; it appears to be a loss function applicable to various augmentation methods. I believe the authors should similarly (1) establish the relationship between this consistency loss and FHiAug and (2) apply the loss to other augmentation methods to validate its effectiveness.
- Although the experimental results show some improvements, the gains over the current state-of-the-art methods seem quite limited. For example, when using BEVFormer, the improvement is less than 1%. Given that the main contribution of the paper is FHiAug, I believe the authors should also provide results without the consistency loss, using only FHiAug, and compare them with existing methods to better demonstrate the effectiveness and improvements brought by the proposed approach.
- Although NDS is indeed a very important metric on nuScenes, I believe AP remains crucial for the detection task. I hope the authors can also provide AP metrics to specifically evaluate the model's capability in 3D bounding box prediction.
其他意见或建议
No, please refer to the above weakness part.
Thanks for your acknowledgment of our approach, which is truly encouraging! We have addressed all the comments and incorporated additional experimental results to further validate our approach. We sincerely appreciate your contributions to help elevate the quality of this submission. All the tables are put here (https://drive.google.com/file/d/19yp9tYUu7XV-R4V69FW-Nzix8lZ8Kg5s/view?usp=sharing). Zoom in for better viewing.
W1: I find the FHiAug method quite novel.
We are deeply grateful for your acknowledgment of the novelty of FHiAug.
(1) establish the relevance of the proposed method to the multi-camera 3D detection task:
Relevance 1: The proposed FCVL framework leverages the cross-view consistency in multi-camera 3D detection input to enhance generalization. By introducing the cross-view consistency loss, the model is enforced to learn domain-invariant features that preserve semantic alignments across camera perspectives. However, its effectiveness is limited under the single-domain setting due to restricted feature diversity. Therefore, we propose FHiAug to alleviate the bias in single-domain representations. As shown in Figure 2 of the paper, FHiAug, on one hand, expands the domain diversity to force the model to learn from different feature distributions. On the other hand, it expands the quantity and diversity of the cross-view sample pairs, enabling the consistency loss to more effectively explore semantic alignments between adjacent perspectives. Ultimately, the FCVL framework achieves generalizable multi-camera 3D object detection with a single source.
Relevance 2: Compared to traditional 2D tasks, acquiring and annotating multi-camera 3D detection datasets is quite expensive. This method achieves remarkable improvements for 3D detection models at a relatively low cost without relying on large-scale annotated data. This solution offers a computationally efficient approach to address real-world challenges in autonomous driving, underscoring its practical value.
(2) validate broader applicability: Besides, we conduct experiments on the 2D detection task. Similarly, following the paradigm of generalizing from a single domain to multiple domains, we train on the daytime-sunny set and test on the other four domains with different weather conditions at different times. As is shown (please refer to Table 1 in the link), our method also effectively improves generalization for 2D detection tasks.
W2:
(1)connection to the Fourier space: Given that the semantic information is contained in the phase components, our semantic consistency loss is computed by extracting semantic information from the phase information. We evaluate the semantic consistency of samples with phase components. Please kindly refer to Equation 11–13 in our manuscript.
(2)relationship between this consistency loss and FHiAug: The proposed FCVL framework follows "domain diversity first, then domain invariance" paradigm. FHiAug expands the data distribution at both image and feature levels, while semantic consistency regularization enables the model to learn more domain-invariant representation.
(3)apply the loss to other augmentation methods: To validate this, we incorporate the consistency loss with DSU. The results are detailed in Table 2 in the link. This loss, when combined with other augmentation methods, can further enhance the generalization. With the same consistency loss constraints, our FCVL framework still demonstrates much more advantages.
In conclusion, the proposed framework accommodates the specificity of multi-camera 3D detection while demonstrating extendability to other vision tasks. We leave it for future work to extend FCVL for general vision tasks.
W3 & W4: Thank you for the comments. The proposed FCVL has achieved SOTA results with average performance improvements of 0.86%-2.47% on eight domains compared to other five methods across two distinct datasets and four different frameworks . The consistently significant improvements across multiple experimental setups fully demonstrate the strong adaptability of our method. It can seamlessly adapt to diverse scenarios and consistently maintain superior and stable performance. Autonomous driving systems, as safety-critical systems, require consistent performance across diverse scenarios. We further list the results of large vehicles. Due to insufficient training data of large vehicles, detecting large vehicles is more challenging compared to common cars. We can observe consistently significant improvements over large vehicles. In particular, our approach achieves +10.8% improvement for trailer, +8% for truck, and +7.1% for bus. Please refer to Table 3 in the link.
Furthermore, we list the results of mAP in Table 4 in the link. Our method maintains superior performance across multiple frameworks when leveraging FHiAug only and FCVL still achieves SOTA results with mAP metric.
The author rebuttal has addressed my concerns, and I will keep my original weak accept rating.
Thanks again for the time and effort you have dedicated to reviewing our manuscript! Your insightful feedback has been valuable in enhancing the quality of our work!
The authors propose a novel generalization multi-camera 3D object detection framework using Fourier Cross-View Learning. Via the proposed Fourier Hierarchical Augemetatiion and Semantic Consistency Loss across views, this work consistently improves the generalization ability of the previous methods over multiple datasets. The extensive experiments support the authors' claim, and the real-world demonstration shows the robustness of the proposed method for autonomous driving scenes.
给作者的问题
- Since the method can improve the detector's generalization ability, how about the performance that trains the detector using the proposed framework on nuScenes/AV2 and test it on AV2/nuScenes? It will further strengthen the methods for domain generalization also across datasets (scenes and cameras).
论据与证据
The claims of this work are Fourier Hierarchical Augmentation (FHiAug) and Fourier Consistency Loss. Among them, FHiAug can be further broken into Frequency Jittering (Amplitude and Phase) and Amplitude Transfer.
- Frequency Jittering (Amplitude & Phase): It is supported by the ablation results (Tab. 4) and visualization (Fig. 7 & 9), which shows it can change the input image appearance and help the overall performance.
- Amplitude Transfer: It is supported by the ablation results (Tab. 4) and visualization (Fig. 8 & 9), which shows it can further change the input image appearance through extracted image features and keep improving the overall performance.
- Consistency loss: It is supported by ablation results (Tab. 4) showing its benefit.
方法与评估标准
In general, the method is very novel and interesting. FHiAug successfully generates more diverse training samples. With the cross-view consistency loss, the model manages to learn more generalizable features for multi-camera 3D object detection.
For the benchmarks, nuScenes-C is widely used and can successfully test the generalization ability of the methods. However, it seems the results for Argoverse 2 (AV2), i.e. Table 3, are missing the explanation of the definition of City and Cloudy settings.
理论论述
All the proofs and theoretical claims look fine except for the second paragraph in the introduction, which the authors claim, "however, directly applying these approaches to BEV-based tasks introduces several challenges. First, BEV representations are generated by projecting multi-view 2D features using real-world physical constraints, which limits the use of strong geometric transformations, such as 270-degree rotations, as they would disrupt the spatial consistency of the BEV space."
It seems to me a logical error here. The augmentation, including the proposed FHiAug, is all applied to the "images," not the BEV features. Thus, using the claims here does not convince me why previous methods are not suitable for multi-camera 3D object detection.
实验设计与分析
The experimental designs are well structured, and the experiments are extensive over five methods across two challenging datasets.
The analyses seem to be too short in the main paper. The efficiency analysis could be moved to the supplementary section, while the authors should focus more on the ablation studies. Tab. 4 also needs to be structured better so that the readers can easily understand which rows to compare and what the takeaway messages are.
补充材料
The supplementary materials are good and cover theoretical analysis, algorithms, and an intro for 2D data augmentation (which should be moved to the related works in the main paper), as well as more results for both quantitative and qualitative ones.
与现有文献的关系
I believe this work has a broader scientific impact, given its novel idea of generating diverse training data through FFT (Fig. 2) and the real-world demo (Fig. 5)
遗漏的重要参考文献
No. This paper doesn't have essential references not discussed.
其他优缺点
Strengths:
- The idea of generating diverse training samples via FFT is novel and interesting.
- Cross-view consistency is intuitive and effective.
Weakness:
- Putting the related work in Sec. 4, which is not a common writing, and also putting the 2D data augmentation in the supplementary while it should be in the related work.
- The proposed domain generalization augmentation is only applied to perspective images/features. The motivation that the traditional 2D data augmentation won't work on BEV features is weakened by this.
- The explanation of generating City and Cloudy settings for AV2 is missed.
- The ablation studies (Tab. 4) are hard to understand. It should be revised.
其他意见或建议
In general, I love the idea and proposed framework. Yet the unusual writing orders and some tables structure make the writing look not professional which should be improved.
We are pleased that the reviewer found our paper novel, interesting and effective. Thanks very much for your acknowledgment, which is truly encouraging! We have addressed all the comments and further improved the manuscript. We are deeply grateful for your contributions to help elevate the quality of this submission.
W1: To enhance readability, we have reorganized the paper by moving the related work to Sec. 2. Besides, we have added a subsection in Sec. 2 to systematically review existing 2D augmentation techniques.
W2:The proposed domain generalization augmentation is only applied to perspective images/features. The motivation that the traditional 2D data augmentation won't work on BEV features is weakened by this.
We apologize for any confusion.
The BEV representation is constructed by mapping 2D features from surrounding camera views into 3D space through physics-aware methods such as depth estimation. Although traditional 2D data augmentation are applied to images, when these augmented images are projected to BEV space, the artifacts introduced by augmentation will degrade the quality of BEV features.
Firstly, strong geometric transformations (e.g. large-angle rotations, translations) can no longer be freely applied. Such transformations on 2D images would violate the spatial consistency between adjacent cameras', leading to distortion of the target's position or orientation in BEV space. This would degrade the perception system's reliability. This phenomenon exposes the limitation of common geometric augmentation in 3D perception. Geometric transformations must adhere to physical constraints derived from the multi-view geometry. We have conducted experiments to demonstrate this point. As shown in the table, natively applying strong geometric augmentations will not improve or even hurt the performance.
Secondly, style transfer techniques replace the original image statistics with those from the target style, which causes the interference between style and content and distorts content features. If the 2D features are impaired, the projected BEV feature will also be affected, ultimately hurting the 3D detection performance. While, our method decouples style manipulation from content preservation, effectively avoiding this limitation.
Compared with these common 2D augmentations, the key advantage of our method is its ability to maximize sample diversity under physical constraints, while maintaining superior content integrity preservation.
| Model | Clean | OOD Avg. |
|---|---|---|
| BEVDet | 0.3880 | 0.2017 |
| +strong geo | 0.3530 | 0.1749 |
W3: Missing the explanation of the definition of City and Cloudy settings.
Thank you for pointing this out and we have added more details in the manuscript. The Argoverse contains different driving scenarios across six major U.S. cities (Miami, Washington D.C. and so on), including various weather conditions such as sunny days and cloudy conditions. To adhere to the single domain to multi domain generalization paradigm, we take sunny-day data from Miami as the single-domain training set, while sunny-day data from other cities (with diverse urban road structures) as the first ood test set (City), and cloudy (dim lighting) data from other cities as the second ood test set (Cloudy).
W4: The ablation studies (Tab. 4) are hard to understand. It should be revised.
Thanks for your suggestion! More ablation studies and analysis are put in the body of the paper. The ablation studies include (1) effects of different components of FCVL and (2) effects of different inserted positions of Amplitude Transfer at feature level (this subsection has been moved from the Appendix D.4 to the main body of paper).
We have also revised the Table 4 to improve the readability. We have put the new table here ( https://drive.google.com/file/d/1BYMOE_trRM3vPfBfyxjt1dbW75p13Igv/view?usp=sharing ) . We hope this revised table will help readers understand the effect of each module in FCVL better.
Q:Since the method can improve the detector's generalization ability, how about the performance that trains the detector using the proposed framework on nuScenes/AV2 and test it on AV2/nuScenes?
Thanks for your constructive suggestion, which further enhances our validation framework. As the number of surround-view cameras is different (NuScenes: 6 cameras, AV2: 7 cameras), we keep AV2's 6 cameras (ring_front_center, ring_front_left, ring_front_right, ring_side_left, ring_side_right, ring_rear_left) and align the Argoverse coordinate system to NuScenes. As the labeled categories between these two datasets are quite different, we mainly focus on two common categories (car and pedestrian) with mAP metrics. The results below demonstrate that our method achieves great improvements in cross-dataset generalization.
| Model | mAP |
|---|---|
| nuScenes(train) | AV2(test) |
| BEVDet | 0.1020 |
| +FCVL | 0.1246(+2.26%) |
Thanks to the authors' effort in the rebuttal. The authors address all my concerns, and I am willing to increase the rating to 4. The experiments of cross-dataset results are valuable, and the reorganization of tables and writing are necessary for the final version of the manuscript.
Thanks for raising the score! Your insightful comments are valuable in enhancing the quality of our work!Thanks again for the time and effort you have dedicated to reviewing our manuscript.
Aiming to improve the generalization in only single source data available for training, this paper proposed Fourier Cross-View Learning (FCVL) framework. FCVL framework can leverage the Fourier transformation to separate high-level and low-level information within the image. Subsequently, it can make appropriate modifications to this information, so as to achieve the purpose of generalization. Overall, this framework outperforms other models compared in the paper, proving its generalization capability.
给作者的问题
NA
论据与证据
Yes
方法与评估标准
In this paper, use the nuScenes dataset as the training set and the NuScenes-C dataset as the testing sets. Why don't use NuScenes-C dataset as the training set to transfer to a simpler dataset? Are there any relevant experiments?
In addition, Parameters d1 and d2 in Formula 1 are not introduced.
理论论述
NA
实验设计与分析
The proposed Fourier Hierarchical Augmentation is similar to the image style transfer method with content restriction or diffusion-base method. Why not use the existing model for style transfer, instead of using fixed parameters to change the style of the image.
补充材料
No
与现有文献的关系
The paper is related to multi-view 3D object detection. The proposed model leverages the Fourier transformation to separate high-level and low-level information within the image, which achieves better generalization ability.
遗漏的重要参考文献
No
其他优缺点
Strengths: (1) Sufficient experiments have been made to prove the performance of the method. (2) Proposed baselines achieve better performance than comparisons and the experiments.
其他意见或建议
The overall structure of the paper is rather confusing, and the writing skills need to be improved. The Relate Work should be introduced in front of the methodology, and the Relate work is not detailed enough.
Thanks for your positive and constructive feedback! We have addressed all the comments and incorporated additional experimental results to further validate our approach.
Q1: In this paper, use the nuScenes dataset as the training set and the NuScenes-C dataset as the testing sets. Why don't use NuScenes-C dataset as the training set to transfer to a simpler dataset? Are there any relevant experiments?
The objective of this paper is to generalize models trained on a single domain (e.g., nuScenes) to multiple diverse application scenarios (e.g., NuScenes-C, which includes eight ood test scenarios). Such cross-domain generalization is more challenging and allows to rigorously validate the performance of the proposed algorithm under significant domain shifts.
Besides, "use NuScenes-C dataset as the training set to transfer to a simpler dataset" falls within the research paradigm of generalizing from multiple domains to unknown target domain. We conducted experiments under this paradigm. As shown in the table, our approach can still enhance generalization performance in this setting.
| Model | NDS |
|---|---|
| NuScenes-C(train) | NuScenes(test) |
| BEVDet | 0.1830 |
| +FCVL | 0.1993(+1.63%) |
Q2: and denote height and width of the image. Thank you for bringing this detail to our attention and we have revised the manuscript accordingly.
Q3: The proposed Augmentation is similar to style transfer method with content restriction or diffusion-base method. Why not use the existing model for style transfer, instead of using fixed parameters to change the style of the image.
Compared with existing methods, FHiAug has advantages in three aspects: superior content integrity preservation, style diversity flexibility, and high efficiency.
Firstly, FHiAug demonstrates superior content integrity preservation. Style transfer techniques replace the original image statistics with those from the target style in pixel domain, which would blur the boundary between style and content, distorting important features. In the training process, more data will be unintensionally simplified, leading to worse performances. Conversely, FHiAug operating in the frequency domain decouples style manipulation from content preservation,effectively avoiding the interference between style and content in spatial domain. Both experimental comparisons against other style transfer methods (Table 1 in the paper) and theoretical analysis validate the superiority of our approach.
Secondly, FHiAug has more flexibility in expanding style diversity. The hyperparameters of FHiAug are not entirely fixed. During each iteration, new style statistics is randomly sampled from Gaussian distribution (please kindly refer to Equation 6–9 in paper), which ensures the diversity of styles in each iterative process. However, diffusion-based techniques require conditional control, specific training data , or tailored architecture to achieve style generation for autonomous driving scenarios.
Thirdly, FHiAug exhibits better efficiency and extendability. Diffusion-based techniques demand substantial time for generation and additional storage space to store the data. If adopting an online generation approach, frequent calls to the generative model during training would significantly increase computational overhead, making them impractical for the training of complex 3D detection models. While, FHiAug is a plug-and-play online augmentation approach, which can be extended to other frameworks with high flexibility.
To further validate the advantages of FHiAug, we add more efficiency and performance analysis of diffusion-based method[1]. Under the same resolution, the generation method takes over ten times longer than FHiAug. Besides, we generate synthetic data following [1]. We incorporate synthetic data into the original dataset and train all the data on BEVDet. As can be seen from the generated images (https://drive.google.com/file/d/1M4gzNi_wVNPHvtWy_l0ddS5qnBqxpFLR/view?usp=sharing), some objects are distorted and the diversity of synthetic data is quite limited. Compared with diffusion-based method, FHiAug achieves superior generalization performance (+4.19%) much more efficiently.
| Model | Resolution | time-consumed(s) | OOD Avg. |
|---|---|---|---|
| BEVDet | - | - | 0.2017 |
| FHiAug only | 256 704 | 0.107 | 0.2579(+4.19%) |
| MagicDrive[1] | 256 704 | 1.5 | 0.2160 |
[1] Gao, Ruiyuan, et al. MagicDrive: Street View Generation with Diverse 3D Geometry Control. ICLR 2024.
Q4: The overall structure of the paper is confusing. The Relate Work should be introduced in front of the methodology, and the Relate work is not detailed enough.
We have further modified the structure of the paper and put the related work section in front of the methodology. We also provide an additional related work section in the Appendix for detailed descriptions and broader coverage of all related work.
Thanks for the authors' rebuttal. I will keep the rating.
Thanks again for the time and effort you have dedicated to reviewing our manuscript. Your constructive feedback has been valuable in enhancing the quality of our work.
This work proposes a strategy to boost domain diversity in the frequency domain, using Fourier Hierarchical Augmentation (FHiAug) and Fourier Consistency Loss for generalization of multi-camera 3D object detection trained on a single source domain to multiple domains. It received final scores of weak accept, accept, accept and weak accept from four reviewers. The reviewers appreciated the novelty of the proposed approach, its robust performance in improving cross-domain generalization under different settings and the comprehensive experiments conducted. The reviewers' concerns were addressed by the authors' responses. The AC concurs with the reviewers' consensus and recommends acceptance. The authors are encouraged to make the changes that they have promised in the rebuttal in the final camera ready version of the paper and its supplement.