CodeMerge: Codebook-Guided Model Merging for Robust Test-Time Adaptation in Autonomous Driving
an efficient model merging approach for online adapting lidar-based 3D detection and end-to-end autonomous driving systems
摘要
评审与讨论
This paper introduces CodeMerge, a novel and efficient framework for test-time adaptation (TTA) in the context of 3D perception for autonomous driving. The primary contribution is a method that avoids the high computational cost associated with existing model merging techniques. Instead of loading and performing inference with multiple full model checkpoints, CodeMerge creates a codebook where each checkpoint is represented by a compact, low-dimensional "fingerprint." These fingerprints are derived from the penultimate features of a source model. The framework then uses ridge leverage scores calculated on these fingerprints to determine the importance of each past checkpoint for merging. The authors demonstrate the effectiveness of CodeMerge on both end-to-end autonomous driving systems and modular LiDAR-based object detectors, showing significant performance improvements on challenging corruption and cross-dataset benchmarks, as well as benefits for downstream tasks like mapping and planning.
优缺点分析
Strengths
-
Significance: The paper addresses a highly significant and practical problem in autonomous driving: robustness to real-world distribution shifts encountered during deployment. The evaluation of TTA within a full end-to-end pipeline (SparseDrive), demonstrating that improvements in the perception module positively impact downstream tasks.
-
Originality: The core idea of using low-dimensional fingerprints and ridge leverage scores to guide model merging is novel and insightful. The strong empirical correlation shown between fingerprint and parameter space differences provides good evidence for the validity of this latent-space approach.
-
Quality and Clarity: The paper is generally well-organized, and is supported by a comprehensive and rigorous set of experiments.
Weaknesses
- As other popular architectures like UniAD and VAD "experience over tenfold performance degradation on nuScenes-C, hindering effective adaptation training" (Lines 272-273), this raises a significant question about the generalizability of CodeMerge. While the results on SparseDrive are strong, the paper lacks evidence that the approach would be effective on other end-to-end models, which currently limits the perceived scope of the contribution.
问题
-
The paper's success is demonstrated compellingly on the SparseDrive architecture. However, the conclusion mentions significant challenges in applying TTA to other architectures like UniAD. Could the authors elaborate on the underlying reasons for the performance degradation on these other models? Does this suggest a limitation in the TTA paradigm for certain architectures, or a property of CodeMerge itself? Answering this would greatly clarify the method's scope and potential for broader application. My score would improve if a convincing argument can be made for its generalizability.
-
Section 3.1 states that fingerprints are computed using the initial pretrained feature extractor ϕΘ(0). For the end-to-end experiments, it is mentioned that only the 3D box regression head is updated (Line 141). Does this "frozen feature extractor" strategy also apply to the LiDAR-only detector experiments on SECOND, or are more parameters adapted and merged in that setting? Clarifying the scope of adapted parameters Θ(t) for all experimental settings would be helpful.
As I am not very familiar with this research field, if the answers are convincing enough, I would like to increase my scores.
局限性
Yes
最终评判理由
The authors' rebuttal, particularly the inclusion of new experiments on two additional modern architectures, has decisively addressed my primary reservation about the paper's generalizability.
- Generalizability was my most significant concern, and it has been fully resolved. First, they convincingly argued that the failure of models like VAD is due to the brittleness of the base model itself, not a flaw in the TTA paradigm or CodeMerge; their results showed CodeMerge does not degrade performance even in this collapsed regime. More importantly, they provided new experimental results on two recent, state-of-the-art architectures (DiffusionDrive and MomAD). These new results demonstrate clear and consistent performance gains across all downstream tasks, providing strong evidence that CodeMerge is not limited to SparseDrive and is indeed a more generally applicable technique.
- My question about the adaptation strategy for the SECOND detector was answered clearly and concisely. The authors confirmed they adapt the entire network, which is the established protocol in prior work, and cited relevant papers to support their methodology. This resolves the ambiguity.
- There are no remaining unresolved issues from my perspective.
格式问题
No formatting concerns found.
Response to Reviewer hVAv
We are grateful for your time and insightful comments!
W1 & Q1 - Generalizability beyond SparseDrive: explanation of the underlying causes of degradation, clarification of whether the limitation lies in the TTA paradigm or CodeMerge itself, and empirical evidence or arguments demonstrating broader applicability beyond SparseDrive.
Early end‑to‑end models under corruption. Our experiments show that some early architectures (e.g., VAD) collapse on corrupted scenes even before any TTA is applied. For example, under heavy snow, VAD attains only 0.0168 mAP for detection and 0.0353 NDS, indicating that the base model’s predictions are already unreliable and model almost collapsed. In this regime, online self‑supervision becomes untrustworthy, so any TTA method has little signal to learn from. Importantly, this reflects a robustness limitation of the base model, not of TTA or CodeMerge. Even so, attaching CodeMerge to VAD yields small but consistent gains across perception and planning metrics, and does not degrade VAD’s performance, which underscores that the bottleneck is the model’s severe brittleness rather than our adaptation mechanism.
| Snow | mAP↑ | NDS↑ | mATE↓ | mASE↓ | mAOE↓ | mAVE↓ | mAAE↓ | AMOTA↑ | AMOTP↓ | RECALL↑ | IDS↓ | APped↑ | APdivider↑ | APboundry↑ | map mAP↑ | minADE↓ | minFDE↓ | MR↓ | EPA↑ | L2-Avg↓ | CL↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| VAD | 0.0168 | 0.0353 | 0.9691 | 0.8577 | 0.9713 | 1.1064 | 0.9327 | -- | -- | -- | -- | 0.0006386 | 0.0014679 | 0.0002426 | 0.0007830 | 2.3891099 | 3.7661626 | 0.3022 | 0.0479 | 1.622955 | 0.0132 |
| VAD+CodeMerge | 0.0176 | 0.0356 | 0.9689 | 0.8576 | 0.9719 | 1.1019 | 0.9337 | -- | -- | -- | -- | 0.0005710 | 0.0015413 | 0.0002724 | 0.0007949 | 2.3264942 | 3.6645312 | 0.2974072 | 0.0519021 | 1.605700 | 0.01307 |
Recent architectures with non‑collapsed baselines. When the backbone remains minimally competent under corruption, CodeMerge is highly effective and broadly applicable. On DiffusionDrive [A] and MomAD [B] (both in CVPR 2025), CodeMerge delivers consistent improvements across detection, tracking, mapping, motion, and planning, and it reduces collision rate under both Brightness and Snow corruptions. These results support that CodeMerge is able to scale beyond SparseDrive.
| Brightness | mAP↑ | NDS↑ | mATE↓ | mASE↓ | mAOE↓ | mAVE↓ | mAAE↓ | AMOTA↑ | AMOTP↓ | RECALL↑ | IDS↓ | APped↑ | APdivider↑ | APboundry↑ | mAP↑ | minADE↓ | minFDE↓ | MR↓ | EPA↑ | L2-Avg↓ | CL↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Diffusiondrive | 0.3278 | 0.4588 | 0.6402 | 0.2782 | 0.6473 | 0.3005 | 0.1851 | 0.2818 | 1.4334 | 0.4076 | 787 | 0.3651 | 0.4819 | 0.4794 | 0.4421 | 0.6624 | 1.0282 | 0.1380 | 0.4448 | 0.6001 | 0.089% |
| Diffusiondrive+CodeMerge | 0.3580 | 0.4845 | 0.6210 | 0.2747 | 0.5767 | 0.2811 | 0.1916 | 0.3150 | 1.3529 | 0.4638 | 873 | 0.4125 | 0.5162 | 0.5333 | 0.4873 | 0.6582 | 1.0261 | 0.1380 | 0.4648 | 0.5919 | 0.062% |
| MomAD | 0.3340 | 0.4711 | 0.6276 | 0.2734 | 0.5931 | 0.2900 | 0.1747 | 0.2944 | 1.4241 | 0.4101 | 731 | 0.2490 | 0.1808 | 0.3157 | 0.2485 | 1.3402 | 2.2509 | 0.2212 | 0.3965 | 7.3178 | 8.500% |
| MomAD +CodeMerge | 0.3664 | 0.4952 | 0.6092 | 0.2741 | 0.5496 | 0.2670 | 0.1803 | 0.3341 | 1.3447 | 0.4737 | 824 | 0.2755 | 0.1976 | 0.3365 | 0.2699 | 1.13309 | 2.2389 | 0.2211 | 0.4147 | 7.3171 | 8.438% |
| Snow | mAP↑ | NDS↑ | mATE↓ | mASE↓ | mAOE↓ | mAVE↓ | mAAE↓ | AMOTA↑ | AMOTP↓ | RECALL↑ | IDS↓ | APped↑ | APdivider↑ | APboundry↑ | mAP↑ | minADE↓ | minFDE↓ | MR↓ | EPA↑ | L2-Avg↓ | CL↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Diffusiondrive | 0.1006 | 0.2373 | 0.7811 | 0.3794 | 0.9105 | 0.7007 | 0.3585 | 0.0503 | 1.8648 | 0.1064 | 305 | 0.0106 | 0.0426 | 0.0512 | 0.0348 | 1.0603 | 1.7077 | 0.1951 | 0.2063 | 0.9300 | 0.447% |
| Diffusiondrive+CodeMerge | 0.1797 | 0.3492 | 0.7389 | 0.2956 | 0.6673 | 0.4621 | 0.2424 | 0.1078 | 1.7171 | 0.1785 | 681 | 0.1010 | 0.1713 | 0.1688 | 0.1471 | 0.8086 | 1.2838 | 0.1730 | 0.3070 | 0.7730 | 0.215% |
[A] Liao, etc. Truncated Diffusion Model for End-to-End Autonomous Driving. CVPR2025
[B] Song, etc. Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving. CVPR 2025
Q2 - Does this "frozen feature extractor" strategy also apply to the LiDAR-only detector experiments on SECOND, or are more parameters adapted and merged in that setting?
For 3D detection, prior works [C, D, E] have shown that adapting all weights of the entire detector during TTA yields the best performance, and we simply adopt that established experimental setup. Therefore, we train entire networks and then adapt and merge all parameters. Please also refer to Reviewer G337 W3.
[C] Chen, etc. MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection. ICLR 2025
[D] Chen, etc. DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object Detection. ACMM 2024
[E] Yuan, etc. Reg-TTA3D: Better Regression Makes Better Test-Time Adaptive 3D Object Detection. ECCV 2024
Thank you for your detailed and convincing rebuttal. These results have substantially strengthened the paper and are the primary reason for my increased score.
Thank you very much for your positive and encouraging feedback. We are delighted that our rebuttal and additional results have addressed your concerns and strengthened the paper. We genuinely appreciate your time and thoughtful consideration during the review process. If you have any further suggestions, we would be more than happy to discuss them.
The authors tackle test-time adaptation for 3D detection by compressing every intermediate fine-tuned checkpoint into a low-dimensional fingerprint. The compressed features are stored in a codebook and ranked by ridge-leverage scores. Then, the top-k weights are linearly merged. The experiments are conducted with both camera-based and LiDAR-based 3D detectors with various types of sensor degradations. Improved perception performance also benefits the downstream task, including mapping, motion prediction, and planning.
优缺点分析
Strengths
- The paper is generally well-written and easy to follow.
- The proposed idea of replacing K forward passes with a projection-space distance is clear.
- Experiments cover both camera-based and LiDAR-based 3D detectors with various types of degradation.
Weaknesses
- If I understand correctly, CodeMerge still requires a buffer of previously fine-tuned checkpoints, which means it needs gradient-based updates, and the paper assumes the model can back-propagate per every frame (or every N frames) to grow the buffer. This may be impractical in a real scenario. Regarding this, how many gradient steps are executed per frame to produce each checkpoint? Please correct me if I misunderstood.
- Lack of robustness analysis. Once a heavy distribution shift occurs (e.g., fog + LiDAR dropout), the model may fail to accurately rank.
- Similar to the above, does CodeMerge work if the backbone itself is adapted?
- It would be beneficial to include more discussion on domain-adaptation methods (e.g, SN, ST3D). From my understanding, ST3D and SN are offline domain-adaptation methods, and CodeMerge is evaluated after every test frame. What are pros and cons about the two different directions?
- Could EMA be another baseline for merging in Table 5?
Additional Comments
- Line 251: Quantitative Analysis --> Qualitative Analysis
问题
Please see the above.
局限性
The authors have adequately addressed the limitations and potential negative societal impact of their work.
最终评判理由
I appreciate the author's enthusiastic rebuttal. All of my concerns have been addressed well. I will maintain my positive score, raising my confidence from 2 to 3, as I believe I'm not an active researcher working on this specific task. I am happy for further discussion if needed.
格式问题
There are no major formatting issues in this paper.
Response to Reviewer G337
Thanks for your constructive and valuable feedback, which will greatly assist in enhancing our work!
W1 - Back-propagate through buffer? How many steps per frame?
No. All stored checkpoints are frozen. We never back‑propagate through past models. At step , we (i) compute merging weights in fingerprint space and form a sign‑consistent merged teacher from the top‑ buffered checkpoints, and (ii) use only to generate pseudo‑labels for the incoming batch. Only the current model is updated once; other checkpoints are used for merging without gradient updates. This avoids the -forward‑pass overhead of MOS [A], which is why our runtime and memory are lower (refer to Table 6 in the paper and response to W1 of reviewer jES4).
W2 - Robustness under fog + LiDAR dropout
We conduct additional experiments adapting SECOND from KITTI to KITTI-C under heavy fog with increasing point-dropout (0%→25%→50%) and report the AP 3D and AP BEV at the moderate level in the table below.
| Dropout Ratio | 0% | 25% | 50% |
|---|---|---|---|
| AP 3D | 75.96 | 73.87 | 71.95 |
| AP BEV | 88.16 | 85.73 | 85.04 |
With increasing point-dropout, performance degrades gradually rather than catastrophically: AP 3D drops from 75.96→73.87→71.95 (–4.01 %, ≈5.3% relative), and AP BEV from 88.16→85.73→85.04 (–3.12 %, ≈3.5% relative). This smooth, monotonic decline suggests that our checkpoint ranking/merging remains reliable even under challenging compounded shifts.
W3 - Does CodeMerge still work when the backbone is adapted?
E2E model: In the paper’s end‑to‑end setting, we froze all components except the 3D box head to isolate the effect of merging and keep TTA lightweight. We conduct additional experiments where feature extraction backbone (ResNet) is also adapted and observe negligible deltas relative to head‑only TTA (e.g., NDS −0.0034, mAP −0.0055, AMOTA −0.0028, map mAP −0.0097, L2‑Avg −0.0023), indicating that CodeMerge remains effective when the backbone itself is updated. Furthermore, we have evaluated the effectiveness of CodeMerge by changing to different backbone models (see Reviewer hVAv, W1 & Q1), demonstrating the method’s broad applicability.
| ColorQuant | mAP↑ | NDS↑ | mATE↓ | mASE↓ | mAOE↓ | mAVE↓ | mAAE↓ | AMOTA↑ | AMOTP↓ | RECALL↑ | IDS↓ | APped↑ | APdivider↑ | APboundry↑ | mAP↑ | minADE↓ | minFDE↓ | MR↓ | EPA↑ | L2-Avg↓ | CL↓ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CodeMerge+det head | 0.2742 | 0.4331 | 0.6575 | 0.2764 | 0.5903 | 0.3018 | 0.2137 | 0.2339 | 1.4868 | 0.333 | 490 | 0.260 | 0.3445 | 0.3267 | 0.3104 | 0.7002 | 1.0859 | 0.1454 | 0.384 | 0.6729 | 0.106% |
| CodeMerge+ResNet | 0.2687 | 0.4297 | 0.6574 | 0.2778 | 0.5906 | 0.3092 | 0.2119 | 0.2311 | 1.4799 | 0.3468 | 510 | 0.2407 | 0.3452 | 0.3161 | 0.3007 | 0.7087 | 1.0930 | 0.1453 | 0.3793 | 0.6752 | 0.102% |
3D detection model: We simply follow prior works [A, B] to adapt the entire backbone (encoder, detection head, and all batch‑norm layers) as these works have already shown that, for test-time adaptation in 3D detection, adapting the full backbone leads to superior performance than adapting batch norm layers only.
[A] Chen, etc. MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection. ICLR 2025
[B] Chen, etc. DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object Detection. ACMM 2024
W4 — Offline UDA (SN, ST3D) vs. online TTA (CodeMerge): pros & cons
Different Assumptions and Settings
UDA (SN, ST3D): has offline access to unlabeled target data and retrains the model for multiple epochs with self‑training / distribution alignment (e.g., ST3D), or target‑size statistics (SN). This typically yields strong target‑specific models when the target distribution is relatively stationary, but incurs heavy computation and latency before deployment.
TTA (CodeMerge): adapts online at inference on a stream of target frames. There is no multi‑epoch retraining. So each step is low-cost and fits real-time constraints.
Strengths & trade‑offs
UDA (SN, ST3D):
- Pros: Can fully specialize to the target dataset with extensive training; strong performance when the deployment domain is fixed.
- Cons: Requires offline target data and multi‑epoch computation; not reactive to rapidly changing in‑field dynamic conditions.
TTA (CodeMerge):
- Pros: Immediate, frame‑by‑frame adaptation suited to dynamic shifts; efficient (e.g., vs. MOS, –41.8% runtime and –8% memory on SECOND; even larger savings on SparseDrive). Robust improvements are observed on cross‑dataset and corruption benchmarks, often surpassing UDA baselines (e.g., > ST3D on nuScenes→KITTI).
- Cons: No offline access to target data; improvements depend on on‑the‑fly pseudo‑labels and the diversity/quality of the checkpoint buffer, so substantial shifts may require several steps to accumulate useful checkpoints. (We mitigate this with geometry‑preserving fingerprints and curvature‑aware selection.)
W5 - Merging baseline: EMA
We conducted additional experiments by replacing the merged model with the EMA model. On the same setting in Table 5 of the paper, CodeMerge (k=5, proj‑dim=1024) consistently outperforms EMA across all tasks: Detection mAP/NDS +0.0373/+0.0272; Tracking AMOTA +0.0372 and AMOTP −0.0810 (↓ better); Mapping mAP/APped +0.0355/+0.0300; Motion mADE/mFDE −0.0147/−0.0301; Planning L2‑Avg/CR‑Avg −0.0274/−0.0160. These gains indicate that our fingerprint‑guided merging, which selects complementary checkpoints, offers stronger generalization than the uniform, time‑local smoothing performed by EMA.
| Method | K | Proj.-D | Detection | Tracking | Mapping | Motion | Planning | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mAP↑ | NDS↑ | AMOTA↑ | AMOTP↓ | mAP↑ | APped↑ | mADE↓ | mFDE↓ | L2-Avg↓ | CR-Avg↓ | |||
| EMA | – | – | 0.2478 | 0.3992 | 0.1869 | 1.6016 | 0.3748 | 0.3413 | 0.7375 | 1.1447 | 0.6778 | 0.125 |
| CodeMerge | 5 | 1024 | 0.2851 | 0.4264 | 0.2241 | 1.5206 | 0.4103 | 0.3713 | 0.7228 | 1.1146 | 0.6504 | 0.109 |
Thanks again for your helpful comments and questions!
I appreciate the author's enthusiastic rebuttal. All of my concerns have been addressed well. I will maintain my positive score, raising my confidence from 2 to 3, as I believe I'm not an active researcher working on this specific task. I am happy for further discussion if needed.
We greatly appreciate your recognition of our efforts and your willingness to engage constructively. Should you have any additional suggestions or points for further discussion, we would be delighted to address them. Thanks again for your supportive review!
This paper proposes a novel model merge based TTA method for LiDAR 3D object detection and E2E AD models. In the proposed CodeMerge approach, previously extracted features and corresponding checkpoints are stored. At each time step, the features of these checkpoints are projected into low-dimensional features, fingerprints. RLS are then computed based on these fingerprints to guide the model merging process. Compared to prior model merge based TTA methods such as MOS, this approach demonstrates improved efficiency and performance. The effectiveness of the method is validated across multiple settings, including cross-dataset and clean-to-corruption adaptation.
优缺点分析
[ Strengths ]
- The proposed fingerprint-based model merging method resolves the scalability limitations of prior model merging-based method, which is computationally expensive.
- The method is evaluated not only for LiDAR-based 3D object detection but also for end-to-end autonomous driving tasks, demonstrating its effectiveness as a TTA method across various application domains. This broadens its potential applicability.
[ Weaknesses ]
- Regarding the efficiency analysis, does the reported runtime refer to a single adaptation step? If so, it would be helpful to specify the inference time for SECOND and computational burden introduced by each TTA step. Including runtime comparisons with other baselines such as TENT, CoTTA, and DUA would provide further evidence of the practicality of the proposed method.
- The paper employs random projection for generating fingerprints. Is there a rationale behind using random projections? Why are these projections redefined at each time step? Does the inherent randomness introduce performance variance? It would be helpful to clarify whether the method is robust across repeated experiments.
问题
- Have alternative fingerprint generation strategies been considered?
- Can the proposed TTA method be applied to 3D object detection models other than SECOND?
- Since the fingerprint is generated using features extracted from a fixed encoder given the i-th input, can it reliably serve as a key of the model parameters themselves?
局限性
yes
最终评判理由
The authors have thoroughly addressed all the concerns raised in the rebuttal. My doubts regarding the efficiency and rationale of the proposed methodology have been fully resolved. Therefore, I am updating my final rating.
格式问题
There are no major formatting issues in the paper.
Response to Reviewer jES4
W1 - Computational cost of each TTA step and comparison with baselines on SECOND
Thank you for your valuable comments! The runtime in Table 6 of our paper is the total time aggregated over all adaptation steps across the full test set. Below we report GPU memory and per-frame latency with AP 3D for SECOND on nuScenes→KITTI.
| Method | GPU Memory (MiB) | Runtime (Seconds / Frame) | AP 3D |
|---|---|---|---|
| TENT | 10,832 | 0.26 | 18.83 |
| CoTTA | 15,099 | 0.15 | 47.61 |
| MOS | 17,411 | 0.49 | 51.11 |
| CodeMerge | 16,041 | 0.28 | 58.54 |
Speed comparison. Compared with MOS [A], CodeMerge cuts latency by ~43% (0.49→0.28 s) and raises AP 3D by +14.5%. Relative to TENT, +210.9% AP 3D at similar latency (0.26 vs. 0.28 s). Compared with CoTTA, only +0.13 s/frame yet +22.9% AP 3D.
Memory comparison. CodeMerge uses less memory than MOS (16,041 vs. 17,411 MiB) and slightly more than CoTTA (15,099 MiB), while achieving the highest AP 3D. Compared with TENT, it needs +5,209 MiB (~5.1 GiB) but keeps similar runtime and delivers significant AP gain: +210.9%.
Overall, CodeMerge offers the best accuracy–efficiency trade-off on SECOND for nuScenes→KITTI: highest AP 3D with competitive latency and memory, suitable for real-time TTA.
W2.1 - Rationale behind random projections.
Efficiency. Random projection primarily improves memory and runtime. We use a fixed Gaussian projection to compress intermediate features into compact “fingerprints” (262,144→1,024), stored in a codebook to compute merge weights. This training-free step keeps memory/compute proportional to fingerprint dimension (not full model size) and avoids loading/forwarding K past models (as MOS), yielding markedly lower latency and memory in practice (e.g., –41.8% runtime and –8% memory vs. MOS on SECOND).
Fidelity. Despite compression, the fingerprint space preserves model-space geometry [B]; pairwise fingerprint differences correlate strongly with parameter differences across architectures/datasets (Pearson 𝑟, Kendall τ>0.7). Our ridge-leverage scoring in fingerprint space links to inverse curvature in parameter space, justifying that these low-dimensional features carry the information needed for reliable checkpoint selection.
Robustness. Ablations (Table 5 of our paper) over projection dimension () show small variation, with $d'=1024$ the best accuracy–efficiency trade-off.
In summary, random projections give a simple, training-free, compute-efficient way to summarize checkpoints while retaining the structure needed for stable, effective test-time merging.
[A] Chen, etc. MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection. ICLR 2025
[B] GONG, etc. Random Projections and Pre-trained Models for Continual Learning. NeurlPS 2023
W2.2 - Redefining projections at each time step?
No. We instantiate a fixed Gaussian random projection once before TTA and keep it frozen. It is applied to the pretrained source feature extractor’s intermediate activations $\phi_{\Theta^{(0)}}(x)$ to produce low-dimensional fingerprints $\hat z$ (Eq. 7 of our paper), embedding all checkpoints into a shared space across time. Because both the fingerprinting features (from the frozen extractor) and the projection are fixed, this mapping is training-free and lightweight.
W2.3 - Does the inherent randomness introduce performance variance?
We verify robustness by resampling the projection with three seeds (RP1–RP3) and re-running the full pipeline. The results vary slightly: Perception NDS = 0.4920–0.4946, mAP = 0.3630–0.3735; Tracking AMOTA = 0.3252–0.3347; Planning L2-Avg = 0.6209–0.6255. These small ranges indicate minor impact on end-to-end performance.
| Method | mAP | NDS | mATE | mASE | mAOE | mAVE | mAAE | AMOTA | AMOTP | RECALL | IDS | APped | APdivider | APboundry | mAP | minADE | minFDE | MR | EPA | L2-Avg | CL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RP1 | 0.3735 | 0.4946 | 0.6137 | 0.2786 | 0.5468 | 0.2935 | 0.1892 | 0.3347 | 1.3359 | 0.4714 | 869 | 0.4305 | 0.5224 | 0.5398 | 0.4976 | 0.6504 | 1.0122 | 0.1392 | 0.4680 | 0.6209 | 0.094% |
| RP2 | 0.3693 | 0.4935 | 0.6149 | 0.2786 | 0.5353 | 0.2925 | 0.1897 | 0.3319 | 1.3372 | 0.4756 | 1256 | 0.4229 | 0.5172 | 0.5349 | 0.4917 | 0.6415 | 0.9964 | 0.1362 | 0.4682 | 0.6219 | 0.11% |
| RP3 | 0.3630 | 0.4920 | 0.6068 | 0.2764 | 0.5315 | 0.2837 | 0.1962 | 0.3252 | 1.3411 | 0.4553 | 892 | 0.4106 | 0.5109 | 0.5198 | 0.4804 | 0.6406 | 0.9948 | 0.1356 | 0.4660 | 0.6255 | 0.125% |
This stability is expected because (i) the projection is a fixed Gaussian linear map instantiated once and kept frozen during TTA, applied to features from the fixed source encoder to produce low-dimensional fingerprints (Eq. 7 of our paper), embedding all checkpoints in a shared space; (ii) the fingerprint space faithfully mirrors parameter-space geometry (Figure 3 of our paper: Pearson $r$ and Kendall $\tau$ typically $>0.7$), so random draws that preserve this embedding yield similar merge decisions.
Overall, repeated runs with different seeds show only minor fluctuations, consistent with our design for a stable, geometry-preserving random projection.
Q1 - Alternative fingerprint generation strategies?
Start with lightweight strategies. We thank you for your constructive suggestions. We have incorporated efficient feature aggregations: MaxPool [C], AdaptiveAvgPool [D], Lp-Pooling [E], and Fractional-Max [F]. Although inexpensive, these compress features to coarse statistics and underperform on Waymo→KITTI; our random projection (RP) achieves $AP_{3D}=63.23$ vs. 58.58/60.34/61.40/61.78 for pooling (gains +4.6%, +2.88%, +1.82%, +1.44%). This suggests simple pooling loses the discriminative structure needed for behavior-aware merging.
| Method | RP | Maxpool | AdaptiveAvgPool | Lp-Pooling | Fractional-Max Pooling |
|---|---|---|---|---|---|
| $AP_{3D}$ | 63.2258 | 58.5847 | 60.3375 | 61.4017 | 61.7825 |
[C] Scherer, etc. Evaluation of pooling operations in convolutional architectures for object recognition. ICANN 2010
[D] He, etc. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014
[E] Boureau, etc. A theoretical analysis of feature pooling in visual recognition. ICML 2010
[F] Zhai, etc. Pooling With Stochastic Spatial Sampling. CVPR 2017
Heavier dimensionality reduction isn’t practical online. Methods like PCA require fitting (or updating) decompositions on streaming features, adding substantial compute at every TTA step, which is incompatible with our single-pass, real-time constraint.
In summary, pooling is fast but loses geometry/accuracy; PCA-like reductions are too costly per step. A fixed random projection preserves the geometry needed for robust merging while meeting real-time TTA efficiency, and empirically delivers the strongest performance in our comparisons.
Q2 - Applicability beyond SECOND
| TTA Methods (w. DSVT) | AP BEV | AP 3D |
|---|---|---|
| No Adapt | 65.06 | 27.14 |
| Tent | 63.94 | 31.07 |
| CoTTA | 66.63 | 34.51 |
| SAR | 66.12 | 37.45 |
| DPO | 75.46 | 45.06 |
| MOS | 77.38 | 57.41 |
| CodeMerge | 79.88 | 61.06 |
To assess generality, we evaluate CodeMerge on the recent 3D detector DSVT [G] for Waymo → KITTI. CodeMerge reaches 79.88 AP BEV / 61.06 AP 3D, outperforming MOS by 3.2% and 6.4%, and surpassing No Adapt by 125% in AP 3D. Because CodeMerge builds on training-free random-projection fingerprints and codebook-guided merging rather than architecture-specific modules, these results indicate our TTA is model-agnostic and transfers to detectors beyond SECOND.
[G] Wang, etc. DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets. CVPR 2023
Q3 - Can fingerprints (from a fixed encoder on input $x_i$) reliably key the model parameters?
Yes. Empirically, pairwise distances in fingerprint space track those in weight space across datasets and models (Figure 3), with Pearson $r$ and Kendall $\tau$ typically >0.7, indicating that the low-dimensional geometry preserves parameter-space structure and supports reliable merge decisions.
Theoretically, under a ridge-regression surrogate for the linear detection head (with a fixed encoder), the ridge leverage score of a fingerprint is proportional to the directional inverse curvature $z_i^\top H_w^{-1} z_i$, so computations in fingerprint space provide a curvature-aware proxy for selecting informative parameter directions (Eq. 9–10). This provides a principled basis for using fingerprints as keys to the parameters they represent.
Finally, resampling the projection with different seeds (refer to W2.3) yields minor performance variation, further confirming that fingerprinting is stable and robust to projection randomness.
Once again, we sincerely appreciate your thoughtful review and constructive feedback! We will carefully incorporate these discussions into the revised manuscript.
Thank you for the detailed response. I’ve read all of your replies carefully, and my concerns have been fully addressed. I will therefore update my rating.
Thank you for carefully reviewing our rebuttal. We’re glad your concerns are resolved and your constructive comments have improved our work, and we will incorporate all the discussions and the additional experiments into the paper. We appreciate the updated rating.
Robust 3D perception under unpredictable conditions is a key challenge for autonomous driving, but existing test-time adaptation methods struggle with instability in high-variance tasks like 3D object detection. While model merging approaches based on linear mode connectivity improve stability, they are computationally heavy due to repeated checkpoint access and multiple forward passes. This paper introduces CodeMerge, a lightweight framework that merges models in a compact latent space using low-dimensional fingerprints and a key-value codebook, enabling efficient model composition with minimal overhead. CodeMerge achieves strong results across benchmarks—improving 3D detection by 14.9% NDS on nuScenes-C and over 7.6% mAP on LiDAR-based detection.
优缺点分析
Strengths:
-
This paper is well written and easy to follow. The motivation is clear and reasonable.
-
The performances are good, improving 3D detection by 14.9% NDS on nuScenes-C and over 7.6% mAP on LiDAR-based detection.
-
The experimental results are extensive. Plenty ablation and sensitivity studies are provided which are convincing.
Weaknesses:
- The authors claim that "Code and pretrained models are released in the supplementary material." In abstract but I can not find them.
问题
Please refer to the Weaknesses.
局限性
The authors provide one sentence in the Conlusion. But I don't think it is sufficient.
最终评判理由
The quality of the paper is quite good and the authors address my concerns and I keep my initial rating.
格式问题
No
Response to Reviewer XHzU
W1 - Code availability and pretrained models.
Code availability. Thank you for pointing this out. The source code was included in the Supplementary Material as a ZIP archive attached to the submission.
Pretrained models. The pretrained and adapted checkpoints for the end‑to‑end models and the 3D detectors are not uploaded with the submission due to file‑size limits. We will release all checkpoints publicly upon acceptance together with a reproducible repository that includes experiment configs, logs, and instructions for all experiments.
This paper introduces CodeMerge, a lightweight model merging framework for robust test-time adaptation (TTA) in autonomous driving 3D perception. It uses low-dimensional fingerprints and a key-value codebook to enable efficient model composition, addressing the computational overhead of prior merging methods. Reviewers praised its clear motivation, strong performance gains (14.9% NDS on nuScenes-C, 7.6% mAP on cross-dataset), and broad evaluation across LiDAR/camera detectors and downstream tasks. Concerns included initial ambiguity about code availability, efficiency analysis gaps, checkpoint buffer practicality, and generalizability beyond SparseDrive. However, the authors resolved these in their rebuttal with clarifications, additional experiments on new architectures (DiffusionDrive, MomAD), and runtime comparisons. Despite minor limitations, CodeMerge makes valuable contributions to TTA for safety-critical autonomous driving, with rigorous validation and practical efficiency. The AC concurs this is a significant advancement and recommends acceptance.