5.8

/10

Poster4 位审稿人

最低5最高7标准差0.8

4.3

置信度

正确性3.3

贡献度3.0

表达2.5

NeurIPS 2024

Unveiling the Hidden: Online Vectorized HD Map Construction with Clip-Level Token Interaction and Propagation

Nayeon Kim,Hongje Seong,Daehyun Ji,Sujin Jang

OpenReview PDF

提交: 2024-05-01更新: 2024-11-06

TL;DR

This paper introduces a novel clip-level pipeline to explicitly unveils the invisible map elements.

摘要

关键词

vectorized HD mapclip-level pipelineclip-level tokeninteractionpropagation

评审与讨论

审稿意见

评分: 5置信度: 52024-07-10

This paper aims to improve vectorized HD map construction for autonomous driving. Inspired by the global feature association in traditional offline HD mapping, the proposed MapUnveiler processes input frames in a clip-based manner and hopes to resolve occlusions using information from previous frames. Built up MapTRv2, MapUnveiler introduces clip tokens together with the Inter-clip and Intra-clip Unveiler modules to update the map queries with temporal information. Experiments on nuScenes and Argoverse2 datasets demonstrate the superior performance of the proposed method, especially on highly-occluded scenes.

优点

The idea of incorporating and aggregating clip-level information for online vectorized HD mapping is reasonable and is more akin to how humans drive. The proposed method has more thoughtful designs than early works such as StreamMapNet to better handle occlusions and incorporate long-range information.
The proposed MapUnveiler obtains state-of-the-art results in various experimental settings. The improvements over previous methods are especially prominent in the large 100mx50m setting and the highly-occluded scenes collected by the authors.
Extensive ablation studies enumerate the choices of almost all hyper-parameters or model components, which helps better understand and break down each element's contributions.

缺点

The clarity of the method description is poor, making it very hard to thoroughly understand the proposed architecture. Details are discussed below:
- The method explanation is not self-contained: i) The Inter-clip Unveiler section refers to the TTM and directly skips all details. There is no information at all about how is the compact memory token generated from the denser map queries; ii) The "loss" section refers to MapTRv2 and again skips all details. The authors should not assume the general audience to be aware of the concrete details of TTM and MapTRv2. The core formulation of these components should be elaborated with texts or equations, while full details can go to the appendix.
- The definitions of the temporal window T and the stride S are unclear. Based on the text descriptions and the common definition of stride, my understanding of "T=3 and S =2" is that "each clip has 3 frames, and every two consecutive frames have a temporal gap of 1." However, the symbols in L177-178 seem to suggest other meanings of T and S.
- The description of the inference mechanism is also vague. Is the MapUnveiler executed per frame or per clip? Figure 2 seems to suggest the per-clip inference where the predictions of T frames are obtained together. If this is the case, does it hurt the actual response frequency?
In short, Section 3 of the paper lacks significant details, and I cannot properly understand MapUnveiler's exact formulation. Given that the authors answer "No" to Question 5 of the Checklist, I have to raise concerns about the paper's reproducibility.
There is no detail on how the pre-training and fine-tuning are conducted. Do you initialize the MapNet by training MapTRv2? If this is the case, how are the training epochs split for the MapNet pre-training and the end-to-end MapUnveiler fine-tuning? If the 24/6 epochs for nuScenes/Argo2 are only for the fine-tuning stage, then the comparisons in the main table are unfair, as other methods in the table have not fully converged.
The main comparison results are incomplete. Most previous papers provide the nuScenes results of both short and long training schedules, but the main table only presents short-schedule results. Considering the last question about the pre-training and fine-tuning, the authors should complement the table with long-schedule results to show that MapUnveiler can obtain consistent performance boosts when all the methods are fully converged. This concern is backed up by the fact that MapUnveiler's improvement is much smaller on Argo2 compared to nuScenes -- based on my empirical experience, previous methods like MapTRv2 and its followups converge faster on Argo2, and training for 6 epochs is close to convergence. This probably suggests that the large performance gaps on nuScenes come from unfair training settings.
Your interpretation of StreamMapNet and SQD-MapNet's Argo2 training epochs is wrong. These two methods employ a different frame sampling strategy at training time compared to MapTRv2, but their effective number of training samples is the same as MapTRv2. Therefore, the claim about the "longer training schedules" in the main table's caption is misleading.
The venues in the main table are not accurate. HIMap[49] and MGMap[24] are accepted by CVPR2024, and the information was already available at the time of NeurIPS submission. Furthermore, a recent HD map construction method, MapTracker[A], also studies temporal modeling and should be very relevant, but it is missing in the discussion and related works.

[A] MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping, arXiv:2403.15951

问题

The paper studies an important problem (temporal information association) in online HD map construction and proposes a reasonable method. However, the poor clarity and the potentially incomplete/unfair comparison results raise serious concerns about the paper's quality and reproducibility. My current rating is reject, and I will consider changing the score if the main weaknesses are properly addressed.

局限性

The limitations and broader impacts are adequately discussed in the paper.

作者回复

2024-08-06

We deeply appreciate your thorough and insightful feedback. We believe we can enhance the paper's quality and present more concrete results based on your comments. Below are point-by-point responses for your comments, and these will be included in the revised paper.

W1-1. Explanation of TTM and MapTRv2

Thanks for your valuable comment. Initially, we tried to avoid overclaiming in terms of model architecture, so we opted to skip the explanation of TTM and MapTRv2. However, we fully understand that this is not good for general readers. Following your advice, we will improve the presentation as follows.

i) Inter-clip Unveiler

In Line #172, we will introduce an equation as

$U^{read}=Read(U_{t-2S:t-S}^{memory}, Q^{map})=S_{N_c}([U_{t-2S:t-S}^{memory}|| Q^{map}])$

Here, the notation $[U_{t-2S:t-S}^{memory}|| Q^{map}]$ denotes the concatenation of two elements.

In Line #178, we will introduce an equation as

$U_{t-S:t}^{memory}=Write(U_{L}^{clip}, U_{L}^{map}, U_{t-2S:t-S}^{memory})=S_M([U_{L}^{clip}|| U_{L}^{map} ||U_{t-2S:t-S}^{memory}])$

If the tokens within the memory are not re-selected in the subsequent steps, it will be removed from the memory, and the selection mechanism will be determined through the learning.

ii) Loss

To provide detailed information about the losses, we will add following equations:

$\mathcal L_{one2one} = \lambda_c^F\mathcal L_{cls}^F + \lambda_p^F\mathcal L_{p2p}^F + \lambda_d^F\mathcal L_{dir}^F$

$\mathcal L_{dense} = \alpha_d\mathcal L_{depth} + \alpha_b\mathcal L_{BEVSeg} + \alpha_p\mathcal L_{PVSeg}$

$\mathcal L_{Frame-level MapNet} = \beta_o\mathcal L_{one2one} + \beta_d\mathcal L_{dense}$

$\mathcal L_{MapUnveiler} = \lambda_c^M\mathcal L_{cls}^M + \lambda_p^M\mathcal L_{p2p}^M + \lambda_d^M\mathcal L_{dir}^M$

where $\mathcal L_{one2one}$ is used for frame-level MapNet, and we set with $\lambda_c^F=2, \lambda_p^F=5, \lambda_d^F=0.005$ . $\mathcal L_{dense}$ is an auxiliary loss using semantic and geometric information, and we set $\alpha_d=3, \alpha_b=1, \alpha_p=2$ . $\mathcal L_{MapUnveiler}$ is used for MapUnveiler, and we set $\lambda_c^M=2, \lambda_p^M=5, \lambda_d^M=0.005$ .

Due to the character limit in the rebuttal, we conceptually show the revised explanation. We will further detail TTM, MapTRv2, and each loss in the Appendix.

W1-2. The definitions of the temporal window $T$ and the stride $S$

To provide a clear explanation, we will introduce the following equation.

$C_k = \lbrace f_t \rbrace_{t=kS+1}^{kS+T}$

The notation $C_k$ represents the k-th clip, starting from k=0. Each frame is denoted by $f_t$ , represents the t-th frame among the consecutive frames. Therefore, when T=3 and S=2, we obtain clip sets such as $C_0=\lbrace f_1, f_2, f_3 \rbrace$ , $C_1=\lbrace f_3, f_4, f_5 \rbrace$ , ..., and $C_k=\lbrace f_{2k+1}, f_{2k+2}, f_{2k+3}\rbrace$ .

Thus, the temporal stride S refers to the clip stride, not the frame stride.

Additionally, it would be interesting to additionally set frame stride, as you mentioned. If we set the frame stride=2 and the clip stride (S)=1 (to see all frames), MapUnveiler achieves an mAP of 68.8% in nuScenes. This is lower than the model with frame stride=1 and T=1, (69.3%). We conjecture that temporally nearest frames provide the most rich information for unveiling maps because they have large spatially overlap regions, and temporally distant frames can effectively be exploited through the inter-clip Unveiler.

W1-3. Inference mechanism

MapUnveiler excutes per-clip. Thus, it can cause response delay if the input frame rate is slower than 12.7 (which is the inference speed of MapUnveiler) because the model will wait until it collects T frames. To avoid this problem, we can set the clip stride (S) to 1 (but it will lead to a slight performance degradation, as presented in Table 7). Alternatively, we can fill in the intermediate frames' results with frame-level MapNet: during the response delay, we can directly construct an online map using map queries generated from the frame-level MapNet. In this case, however, the performance will drop from 69.8% to 68.0%.

W1-4. Open access code

We are planning to release source codes upon acceptance.

W2. Detail on how the pre-training and fine-tuning

We end-to-end fine-tune MapUnveiler for 24/6 epochs from pre-trained MapTRv2, which was trained for 24/6 epochs. To address this concern, we present the freeze fine-tune setting in Table 4, but we understand it cannot fully address this concern. So, we additionally present experimental results where we pre-train MapTRv2 for 12/3 epochs and then fine-tune MapUnveiler for 12/3 epochs, totaling 24/6 epochs. The results are given in G1 of the global response.

W3. Long training schedules

Thanks for your thorough feedback. We trained 110/30 epochs, and MapUnveiler achieves mAP of 70.6% and 72.9% on nuScenes and Argoverse2, respectively. As you commented, the performance improvement of our model is marginal. We think that we can converge more quickly because we start training from a pre-trained frame-level MapNet. Even though our model lags behind HIMap (73.7% and 72.7%) in the long training schedule setting, we would like to kindly emphasize the following three points: (1) HIMap is heavy (9.7 FPS) and runs slower than ours (12.7 FPS); (2) HIMap requires more training time to fully converge, which may cause overfitting; (3) HIMap was published arXiv two months ago when we submitted our paper, so this work stands as a contemporaneous work.

W4. Caption in Table 1

Thanks for the correction. We will remove † in Table 1.

W5. HIMap[49], MGMap[24], and MapTracker[A]

Thanks for the thorough check. We will revise them as CVPR2024 and will discuss with MapTracker. We would like to kindly inform the reason for notating as preprints: the official paper list of CVPR2024 was opened on the same day as the deadline for NeurIPS2024 submissions.

2024-08-10

Thanks for the detailed response. The added details clarify the architectural design, especially for the clip-based inference mechanism and how $T$ and $S$ work. I believe these will also help other readers better understand your approach.

I appreciate the new results in the general response, which resolve my concern about the fairness of experiments. However, please make sure the results in G4 also follow a fair setting, or you should at least indicate the difference in training schedules in the table. Overclaiming the performance boost with an unfair setting under the hood will hurt the community.

With the above concerns addressed, I will increase my score from 3 to 5 (borderline accept). Reasons for not giving a higher score are mainly two-fold: 1) the clip-wise inference is a bit unnatural, making the output quality temporally inconsistent; 2) the initial writing clarity is very problematic, requiring a massive revision to improve the paper's quality.

If the paper is accepted, I hope the authors can carefully improve the clarity of the method description (W1), fix the discussion/reference of related works (W5), provide fair experimental results and clearly indicate the difference in training settings (W2,3).

评论- Thank you!

2024-08-10

We would like to express our sincere gratitude for your positive feedback and for raising your rating of our work from 3 to 5. We especially appreciate your thorough and professional comments, which have helped us improve the fairness of our experimental setup and enhance our presentation.

As you suggested, we will revise the relevant results presented in the comparison tables (Table 1 and G4) with more fair settings (12/3 epochs fine-tuning) in the final version. Additionally, we will revise the final version by incorporating all comments from the rebuttal, including clarifications of our approach (W1), discussions of the suggested related works (W5), and clarifications of the training settings (W2, W3).

Best regards,
The Authors

审稿意见

评分: 7置信度: 32024-07-13

The authors propose a new approach for constructing vectorized high-definition maps that exploits temporal information across adjacent input frames. The model, which they call MapUnveiler, operates at the clip-level and consists of an intra-clip unveiler which generates vectorized maps for T frames and an inter-clip unveiler which uses a memory module to aggregate information between clips. The authors present results on two standard benchmarks, vectorized HD map construction benchmarks (nuScenes and Argoverse2) and demonstrate the model’s superior quantitive performance to several previously proposed approaches. They also show several qualitative examples of how MapUnveiler can better handle occlusions in the input images.

优点

The paper is well-written and contextualized well within prior work.
The methodology is novel and well-motivated.
The results are strong on the two tested datasets, both quantitatively and qualitatively.
Many different analyses and ablations were included to justify the design decisions used within MapUnveiler and show its strengths.

缺点

The methods is dense and a bit hard to read. The architecture figures help but are also a bit difficult to parse through. It would be helpful to try to weave more intuition into the text.
Claiming "-9.8%" is significant but "-6.0%" is comparable in the robustness to occlusion section seems a bit arbitrary (and potentially overstating MapUnveiler's performance, as a 6% drop is still considerable). I suggest the authors rephrase this sentence (and address similar claims in the paper).

There are several typos throughout the paper. I have enumerated some here, but encourage the authors to do a detailed proofread:

127: With there
129: mapnet -> MapNet
161: bev -> BEV
167 parenthesis
192 backwards parenthesis
294: In addition, if we choose too short

问题

Have the authors tried quantized models to reduce GPU memory? It could be interesting to see if the gains from larger window sizes outweigh the losses from quantization.
The model still seems to struggle with some occlusions (a 6% drop from the standard split). Why do the authors think that is? Are these just very difficult cases or issues with the model?
The one limitation that was discussed seems like it can be tested. How does randomly dropped intermediate frames affect model performance?

局限性

Only one limitation is included. I encourage the authors to think through other potential limitations.

作者回复

2024-08-06

We are particularly encouraged that the reviewer finds our method novel and well-motivated. And we highly appreciate your constructive comments and suggestions. Below are our responses to each of your queries, and we will include them in the revised paper.

W1. Weave more intuition into the text in 3. Method

We will include our motivation in the method explanation to make it easier to follow and understand our method. For this, we will revise the paper as below (we highlighted the new and revised descriptions in italics):

Clip Token Generator (Line #149-150)

... , respectively. To globally gather intra-clip map features, we opt for a naive cross-attention [38]. Through this step, we obtain compact clip-level map representations, enabling efficient intra-clip communication with small-scale features.

BEV Updater (Line #155)

... original bev features. The main idea of BEV Updater is to avoid heavy computation in spatio-temporal cross attention. To achieve this, we do not directly communicate intra-clip BEV features, but instead decouple the spatial BEV features and temporal clip tokens. We then update the spatial BEV features with compact temporal clip tokens, effectively communicating spatio-temporal information with reasonable computational costs. The updated BEV ...

Map Generator. (Line #161)

... the updated BEV features $U_l^{BEV}$ . Since the updated BEV features are spatio-temporally communicated, we directly extract map tokens $U_l^{map}$ . Each map token represents a vectorized map element through a 2-layer Multi-Layer Perceptron (MLP). The map tokens ...

W2. Rephrase some sentences

We are sorry, we understand that it may be an overstatement. We will rephrase Line 261 as follows:

~~, while MapUnveiler shows comparable performance~~ MapUnveiler also shows a proformance degradation of 69.8%→63.8% (-6.0%), but it demonstrates a smaller performance gap compared to previous studies.

Following your advice, we also tone down following sentences (Line 242):

Although MapUnveiler incorporates temporal modules, we achieve a ~~fast~~ resonable inference speed (12.7 FPS) compared to frame-level MapNet (MapTRv2 [22], 15.6 FPS), surpassing both performance and speed compared to the heavy state-of-the-art approach (HiMap [49], 9.7 FPS).

W. Typos

Thank you for your diligent efforts to enhance the quality of our paper. We will rectify the mentioned typos and strive to correct any other potential typos. We will thoroughly proofread the paper from the beginning.

Q1. Model quantization

That's an intriguing idea, and we hope to see the comparison. Unfortunately, we tried to implement the quantized models, but it was quite challenging to implement and evaluate the accuracy, memory usage, and speed of the quantized model. The primary reason behind this is that we developed our MapUnveiler using several custom addition and multiplication operators, which the FakeQuantize function does not support. Implementing it with TensorRT necessitates reconstructing the entire pipeline of our code, hence requiring additional time.

Based on current knowledge, we can consider extreme quantization (4-bit) and a general solution used in TensorRT (16-bit and 8-bit). The performance drop due to 4-bit model quantization may be marginal in CNN architecture if we use advanced quantization algorithms such as [A]. However, transformer-based architectures currently lose more than 20% of the accuracy [B] with 4-bit quantization. As an alternative, if we adaptively combine FP16 and INT8 [C], we can accelerate the model speed x1.87 times with a marginal performance drop (about -1.19%). With this, we can boost MapUnveiler with a large window size (T=5) through quantization. However, the performance gain from a wide window size is marginal (mAP of 70.1% for T=5, and mAP of 69.8% for T=3) as shown in Table 7 of the main paper. We conjecture that our model effectively leverages long time dependencies through the inter-clip Unveiler, thus it may not be necessary to use a large window size. But it would be interesting to see the gains of the main model from quantization.

We plan to implement the quantized model and include the results in the final version.

[A] QDrop, ICLR 2022

[B] RepQ-ViT, ICCV 2023

[C] https://github.com/DerryHub/BEVFormer_tensorrt

Q2. The reason for strugglgling with some occlusions

We found that MapUnveiler may be unable to unveil some regions if the regions are occluded for every frame. We visualize an example in Figure 9 of the rebuttal PDF file of the global response. MapUnveiler initially roughly predicts the boundary where we highlighted in green. However, the invisible regions are continuous, and MapUnveiler eventually predicts the region as having no boundary. As such, if the model cannot see a clear region across all frames, MapUnveiler may fail to recognize the occluded map information.

Q3. Randomly dropped intermediate frames

A very constructive suggestion. We evaluated three models with randomly dropped intermediate frames. Frames were dropped by converting multi-camera images into black images. The experiment was conducted with drop rates of 20%, 10%, and 5%, and the results are given in Table 13 of the rebuttal PDF file of the globalresponse. MapUnveiler is affected by dropped frames, but the performance degradation is reasonable compared to MapTRv2.

L. Other potential limitations

One can be a potential limitation that was discussed in Q2; MapUnveiler is likely to fail in unveiling fully occluded roads across all frames.

Additionally, MapUnveiler requires marginal additional computational costs (15.6 FPS → 12.7 FPS) but requires approximately two times more GPU memory (830.4 MB → 1614.9 MB) and three times more parameters (76.4 MB → 213.9 MB) compared to the frame-level MapTRv2 model. We provide a comparison table in G3 of the global response.

2024-08-09

Thank you for your substantial efforts in the overall rebuttal and in response to my specific comments. The changes you have proposed and the explanations of the results help considerably.

Thanks for explaining the difficulties with implementing a quantized model. I still think it would be valuable to see so it would be great if you could try to get it working. That being said, I don't think it's critical to the paper.

I have read the other reviews + rebuttal, and maintain my score. However, I encourage the other reviewers to reconsider their assigned scores after the authors' rebuttal.

评论- Thank you!

2024-08-09

We sincerely appreciate your efforts in reviewing our responses. We are delighted to receive your positive feedback and supports for our work.

Thanks,
The Authors

审稿意见

评分: 5置信度: 42024-07-15

This paper proposes a clip-based vectorized HD map construction paradigm for the processing of long temporal sequence, in which occluded map elements are unveiled explicitly by efficient clip tokens. Through clip token propagation, MapUnveiler achieves effective utilization of long-term temporal map information by associating inter-clip information, in which clip tokens are propagated rather than dense BEV features. Experiments demonstrate that MapUnveiler boosts the performance on public benchmark datasets, also for more challenging setting like long-range perception and heavily occluded driving scenes.

优点

This paper is well-written and easy-to-follow. Figures clearly conveys the intended message.
“Unveiling the hidden” and clip token propagation are reasonable and effective strategy for static Map element detection, which is practical and alleviates the problem to some extent.
The proposed method demonstrates strong performance on benchmark dataset, comprehensive experiments and ablation studies justify the model design.

缺点

As mentioned at line 227, this work is built on pretrained frame-level MapTRV2 and fine-tuned, thus the comparison can be unfair. Results without pretraining are required to verify your effectiveness.
At line 53 and BEV Updater in line 151, for occluded features, how to select the tokens that are visible in certain frames? Seems tokens within the temporal window are fully utilized for BEV update by cross attention, how to determine whether these tokens contain unblocked information? More explanations are required.

问题

What is the experiment result for geometric-based dataset split as mentioned in [1] and [2]? Besides, what is the additional computing costs considering the injection of temporal clip token? [1] Yuan, Tianyuan, et al. "Streammapnet: Streaming mapping network for vectorized online hd map construction." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024. [2] Lilja, Adam, et al. "Localization is all you evaluate: Data leakage in online mapping datasets and how to fix it." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

局限性

Yes. The authors mentioned the weakness of their approach on the corrupted input.

作者回复

2024-08-05

Thank you for providing the insightful and constructive feedback. We appreciate your acknowledgment that the paper is easy to follow and that the proposed approach is effective. Below are our responses to each comment, and we will include all the results and comments in the revised version.

W1. Results without pre-training

Thank you for valuable feedback. We fully understand that our training pipeline may present an unfair advantage since previous works trained their models on vectorized HD map datasets for only 24 epochs, whereas we pre-train MapTRv2 for 24 epochs and fine-tune MapUnveiler for 24 epochs, amounting to a total of 48 epochs.

To address this concern, we conducted an experiment where we pre-train MapTRv2 for 12epochs/3epochs and then fine-tune MapUnveil for 12epochs/3epochs on either the nuScenes or Argoverse2 training set. Even though our proposed modules will be trained for only 12epochs/3epochs, this will undoubtedly provide fair comparisons, as we are training a total of 24epochs/6epochs. The results are given in G1 of the global response.

As shown in the table, our method still demonstrates state-of-the-art performance of mAP on both nuScenes and Argoverse2 validation sets under a more fair setting.

Additionally, we tried to skip the pre-training stage and train MapUnveilr from scratch, but it failed to converge, achieving an mAP of 18.2% in the nuScenes (60x30m) validation set. This indicates that our method necessitates meaningful frame-level map features to learn map unveiling. We would like to kindly note that this pre-training strategy is not own approach but is commonly adopted for training networks with temporal information. For instance, StreamMapNet [45] and SQD-MapNet [40] trained their initial 4 epochs with single-frame inputs. MapTracker [A] employed a three-stage training process (BEV Encoder → Vector Decoder → All Parameters) to facilitate initial convergence.

[A] Chen, Jiacheng, et al. "Maptracker: Tracking with strided memory fusion for consistent vector hd mapping." arXiv preprint arXiv:2403.15951 (2024).

W2. How to select the tokens that are visible in certain frames

The model automatically selects the visible and valuable BEV regions through the learning of cross attention. As depicted in Figure 1-(c) in the main paper, each clip token within the temporal window is fully utilized for BEV update. We could limit the clip tokens to select manually choosen visible BEV regions, but we thought that this might not always provide an accurate solution for constructing a clear HD map (e.g., the model might try to select an easy-to-see region even if there are no lanes). Hence, we opted to learn the selection mechanism in an end-to-end manner, which can minimize the losses of the constructed map. This approach is straightforward yet effective, as illustrated in Figure 1: compared to (a) MapTRv2 and (b) StreamMapNet, our BEV feature most clearly represents map elements.

Q1. Experiment result for geometric-based dataset split

Thank you for your insightful suggestion. We conducted additional experiments on the geometric-based dataset splits you suggested [1]. The results have been moved to the global response due to the character limit. Please find the result in G2 in the global response.

Q2. Additional computing costs

We provided additional computational costs in Table 3 by injecting temporal clip tokens in two aspects: Intra-clip Unveiler and Inter-clip Unveiler. To provide rich information, we further measured GPU memory consumption during inference time and model parameters, and append Table 3. We present new Table 3 in the global response. Please find the result in G3 in the global response.

评论- Thanks again for your review

2024-08-14

Dear Reviewer JH5U,

We greatly appreciate your valuable efforts and professional feedback, which have indeed improved the quality of the final version of our manuscript. We have provided answers to your remaining concerns above, and it would be great to hear your feedback on our rebuttal so that we can further improve the final version. Although the authors-reviewers discussion period is nearing its end, we are fully prepared to address any further questions you may have.

Best regards,
The Authors

审稿意见

评分: 6置信度: 52024-07-15

This work presents a method called MapUnveiler, which aims to improve the construction of vectorized HD maps for autonomous driving. MapUnveiler uses a novel clip-level pipeline to unveil occluded map elements by relating dense image representations with efficient clip tokens and propagating inter-clip information. This approach leverages temporal information across adjacent input frames, addressing the limitations of single-frame and streaming inference methods. The model achieves state-of-the-art performance on the nuScenes and Argoverse2 benchmark datasets, demonstrating promising improvements in challenging scenarios with longer perception ranges and heavy occlusions.

优点

The introduction of a clip-level pipeline for vectorized HD map construction effectively addresses occlusion issues and leverages temporal information across multiple frames.
The method utilizes clip tokens to propagate map information efficiently, reducing redundant computations and enhancing prediction consistency.
Extensive experiments demonstrate that MapUnveiler achieves state-of-the-art performance on nuScenes and Argoverse2 benchmarks, particularly in challenging scenarios.

缺点

The community has noticed a severe data leakage issue with utilizing nuScenes and Argoverse2 datasets for online mapping evaluation {1, 2}, as these datasets are not intentionally built for online mapping. It might also be necessary to validate the proposed method on geo-disjoint training and validation sets.
It would be good to see the analysis of added model compacity due to the introduction of the proposed intra-clip unveiler and inter-clip unveiler.
It seems the proposed intra-clip unveiler and inter-clip unveiler are adaptable to any single-frame inference online mapping methods. It would be good to validate the effectiveness of the proposed modules on other baseline methods.
The authors are encouraged to investigate the consistency of estimated HD maps across frames of the proposed method compared to existing methods with "inconsistent and suboptimal prediction results" (mentioned in Line 7). {1} Augmenting Lane Perception and Topology Understanding with Standard Definition Navigation Maps. {2} Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It.

问题

What do the map queries stand for? Can they be transferred directly to vectorized HD maps?
Is the map decoder adopted from MapTRv2?
Are map tokens generated from the intra-clip unveiler the refined version of map queries?

局限性

The limitation of dependency on temporally consecutive frames is discussed.

作者回复

2024-08-06

We thank the reviewer for providing thorough feedback and interesting suggestions. We are grateful for your acknowledgment that the introduction of a clip-level pipeline for vectorized HD map construction is effective and the proposed clip tokens propagate map information efficiently. Below are our responses to each comment, and we will include all the results and comments in the revised version.

W1. Validation on geo-disjoint dataset

Thank you for your insightful feedback. To address this concern, we conducted additional experiments on a recent geo-disjoint dataset split. Due to the character limit, we attached the full table in the global response. Please find the results in G2 in the global response.

W2. Analysis of added model compacity

That would be a good analysis to provide rich information about our model for readers. Additional to the analysis of accuracy and computational costs of the proposed intra-clip Unveiler and inter-clip Unveiler (Table 3 in the main paper), we add GPU memory consumption during inference time and the model parameters. We present the table in the global response. Please find the results in G3 in the global response.

W3. Validate the proposed modules on other baseline methods

That's a valuable suggestion. We believe that any single-frame inference online mapping methods based on DETR (which outputs both rasterized BEV and vectorized query features) can be adapted for our MapUnveiler. Therefore, it would be exciting to additionally claim our contribution in modularization. Consequently, we attempted these experiments with DETR-based models for which the code is available online (MapVR [46], MGMap [24], MapTRv1 [21]). Unfortunately, we are unable to present the results at this time due to a lack of resources and time to set up and conduct experiments within the short rebuttal period. Nevertheless, we would like to address the concern regarding this query, so we experimented with various backbone networks (ResNet-18 and V2-99 {3}). We present the result table in G4 in the global response.

As shown in the table, our method is not limited to MapTRv2 with ResNet50, but can be extended to ResNet18 and V2-99 {3} backbones. This suggests that our method can also work for other various frame-level features as well.

We plan to implement our method on other single-frame inference online mapping methods and include the results in the final version.

{3} An energy and GPU-computation efficient backbone network for real-time object detection, CVPRW. 2019.

W4. Consistency of estimated HD Map across frames

Thanks for your thorough feedback. Measuring the consistency of the model would be very helpful and further elaborate the contribution of our model. To quantitatively measure the consistency of the model, we would need to indicate the track ID of each estimated map element. However, this is challenging in our model because we simplify and propose a straightforward pipeline that does not require any complicated spatial warping process across time, which is required for streaming inference (StreamMapNet [45]), nor track ID annotation. Thus, we attempt to show the consistent results qualitatively in Figures 4, 5, 6, 7, and 8.

We recently found that MapTracker {4} presented a consistency-aware mAP (C-mAP) metric and measured it for baseline methods (e.g., MapTRv2) by predicting track IDs through their tracking algorithm. It seems that we can also measure the C-mAP by applying the tracking algorithm proposed in {4}. To measure it, we have to re-train the model with the refined GT data proposed in {4}, so we are unfortunately unable to show the C-mAP result currently due to a lack of resources and time.

We plan to implement C-mAP and add the results in the final version.

{4} MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping, arXiv:2403.15951

Q1. Map queries

That's correct. Map queries are trained with the same objective function as the map tokens, so it can be transferred directly to vectorized HD maps. Thus, MapUnveiler can be degraded to MapTRv2 if we transfer vectorized HD maps using map queries instead of map tokens. An interesting experiment can be conducted based on this idea; we can use MapUnveiler for both frame-level and clip-level scenarios (as sometimes we cannot utilize temporally consecutive frames due to unexpected communication/sensor errors in real-world scenarios). We experimented by replacing map tokens with map queries for every two frames, resulting in a slight performance degradation from 69.8% to 68.0% in nuScenes (60x30m).

To clarify this, we will add following explanation in Line #130 (we highlighted the new descriptions in italics):

... the map decoder, respectively. BEV features represent rasterized map features, whereas map queries embed vectorized map information; thus, we can directly construct vectorized HD maps using the map queries.

Q2. Map decoder

That's correct. The map decoder is adopted from MapTRv2. To clarify this, we will revise the caption in Figure 2:

... vectorized HD maps. All components in the frame-level MapNet (i.e., Backbone, PV to BEV, Map Decoder) are adopted from MapTRv2 [22]. The frame-level MapNet extracts ...

Q3. Map tokens

That's correct. The map tokens generated from the intra-clip unveiler are the refined version of map queries. To clarify this, we will revise the paper in Line #158:

... in the previous step. The objective of this step is to generate a refined version of frame-level map queries. As illustrated in ...

2024-08-12

Thank you for the thorough and thoughtful rebuttal. The authors have successfully addressed my concerns, so I am increasing my rating from 5 to 6.

评论- Thank You!

2024-08-12

We are so pleased to have received your positive feedback and raised score (5 → 6). We deeply appreciate your efforts in reviewing our responses.

Best regards,
The Authors

作者回复

2024-08-07

We thank all the reviewers for their constructive and thorough comments. We are particularly excited that all reviewers acknowledged the idea of a clip-level pipeline as reasonable, novel, or effective for online vectorized HD mapping. We believe this rebuttal further enhance the paper through the valuable comments provided by the reviewers. We provided detailed point-by-point responses for all queries in each reviewer's rebuttal. Here, we provide global responses from G1 to G4.

G1. A more fair comparison

We conducted an experiment where we pre-train MapTRv2 for 12epochs/3epochs and then fine-tune MapUnveil for 12epochs/3epochs on either the nuScenes or Argoverse2 training set. Even though our proposed modules will be trained for only 12epochs/3epochs, this will undoubtedly provide fair comparisons, as we are training a total of 24epochs/6epochs. The results are given below.

Method (60x30m)	AP $_{p}$	AP $_{d}$	AP $_{b}$	mAP (nuScenes)	FPS	AP $_{p}$	AP $_{d}$	AP $_{b}$	mAP (Argoverse2)
MapTRv2 [22]	59.8	62.4	62.4	61.5	15.6	62.9	72.1	67.1	67.4
SQD-MapNet[40]	63.0	62.5	63.3	63.9	-	64.9	60.2	64.9	63.3
MGMap [24]	61.8	65.0	67.5	64.8	12.3	-	-	-	-
MapQR [26]	63.4	68.0	67.7	66.4	14.2	64.3	72.3	68.1	68.2
HIMap [49]	62.6	68.4	69.1	66.7	9.7	69.0	69.5	70.3	69.6
*Map-Unveiler (ours)**	67.6	67.6	68.8	68.0	12.7	68.9	73.7	68.9	70.5
Map-Unveiler (ours)	69.5	69.4	70.5	69.8	12.7	69.0	74.9	69.1	71.0

Method (100x50m)	AP $_{p}$	AP $_{d}$	AP $_{b}$	mAP (nuScenes)	FPS	AP $_{p}$	AP $_{d}$	AP $_{b}$	mAP (Argoverse2)
MapTRv2 [22]	58.1	61.0	56.6	58.6	15.6	66.2	61.4	54.1	60.6
StreamMapNet [45]	62.9	63.1	55.8	60.6	12.5	-	-	-	57.7
SQD-MapNet[40]	67.0	65.5	59.5	64.0	-	66.9	54.9	56.1	59.3
*Map-Unveiler (ours)**	68.0	70.0	68.2	68.7	12.7	69.7	67.1	59.3	65.4
Map-Unveiler (ours)	68.4	71.2	68.3	69.3	12.7	70.4	66.8	59.3	65.5

In the table, we mark with an asterisk (*) when we use a new fair training schedule (pre-train 12epochs/3epochs and fine-tune 12epochs/3epochs), and the other methods are directly copied from Table 1 in the main paper. Our method still demonstrates state-of-the-art performance of mAP under this setting.

G2. Experimental results on geo-disjoint dataset splits

We conduct experiments on a recent geo-disjoint dataset split: StreamMapNet where the performance of MapTRv2 has been significantly drop from 61.5% to 36.6%. The results are given in Table 12 of the rebuttal PDF file. For a fair comparison, our method uses short training schedule as discussed in G1. We also succesfully acheieve state-of-the-art performance on geo-disjoint dataset split. We will further evaluate on various geo-disjoint dataset splits in the final version.

G3. Additional computing costs and analysis of added model compacity

To analyze the model's capacity increased by the proposed modules, we measure GPU memory consumption during inference time and the model parameters in Table 3. The updated Table 3 is provided below.

Method	AP $_{p}$	AP $_{d}$	AP $_{b}$	mAP	FPS	GPU (MB)	Parameters (MB)
MapTRv2 [22]	58.8	61.8	62.8	61.2	15.6	830.4	76.4
+ Intra-clip Unveiler	65.6	67.6	68.0	67.1	13.1	1552.5	144.0
+ Inter-clip Unveiler	69.5	69.4	70.5	69.8	12.7	1614.9	213.9

As shown in the table, our methods require marginal additional computational costs (15.6 FPS → 13.1 FPS if we utilize clip tokens only within intra-clip tokens and 15.6 FPS → 12.7 FPS if we fully employ intra-clip and inter-clip tokens). However, our approach requires approximately two times more GPU memory and three times more parameters compared to the frame-level MapTRv2 model. This could be considered a potential limitation of our method, but the amounts are not large (2GB graphic cards can be purchased for under $25, and we commonly have more than 1GB of RAM and hard disks).

G4. Experiments with various backbones

Here we additionally present experimental results with various backbones: ResNet-18 and V2-99.

Method (60x30m)	Backbone	AP $_{p}$	AP $_{d}$	AP $_{b}$	mAP (nuScenes)	AP $_{p}$	AP $_{d}$	AP $_{b}$	mAP (Argoverse2)
MapTRv2 [22]	R18	53.3	58.5	58.5	56.8	58.8	68.5	64	63.8
MapTRv2 [22]	V2-99	63.6	67.1	69.2	66.6	64.5	72.2	70.1	68.9
Map-Unveiler (ours)	R18	65.5	68.4	68.2	67.4	66.5	71.1	67.5	68.4
Map-Unveiler (ours)	R50	69.5	69.4	70.5	69.8	69.0	74.9	69.1	71.0
Map-Unveiler (ours)	V2-99	72.1	72.9	74.9	73.3	71.4	75.1	73.0	73.2

Method (100x50m)	Backbone	AP $_{p}$	AP $_{d}$	AP $_{b}$	mAP (nuScenes)	AP $_{p}$	AP $_{d}$	AP $_{b}$	mAP (Argoverse2)
MapTRv2 [22]	R18	52.7	57.3	51.5	53.8	60.3	57.6	49.6	55.8
MapTRv2 [22]	V2-99	62.6	67.8	65.2	65.2	68.5	62.1	58.4	63.0
Map-Unveiler (ours)	R18	67.4	69.8	68.8	68.6	68.5	64.3	57.7	63.5
Map-Unveiler (ours)	V2-99	71.9	75.0	75.6	74.2	73.3	69.1	63.9	68.8

As shown in the table, our method is not limited to MapTRv2 with ResNet50, but can be extended to ResNet18 and V2-99 backbones. This suggests that our method can also work for other various frame-level features as well.

最终决定Accept (poster)

2024-09-25

This paper initially received mixed reviews. After the rebuttal period, all reviewers recommend acceptance, praising the strong performance and extensive evaluations. There are some lingering concerns about presentation clarity, but the authors have been given clear guidance and they seem easy to address for the camera-ready submission.