7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

2.8

置信度

创新性2.8

质量2.8

清晰度2.5

重要性2.5

NeurIPS 2025

CodeMerge: Codebook-Guided Model Merging for Robust Test-Time Adaptation in Autonomous Driving

Huitong Yang,Zhuoxiao Chen,Fengyi Zhang,Zi Huang,Yadan Luo

OpenReview PDF

提交: 2025-05-04更新: 2025-10-29

TL;DR

an efficient model merging approach for online adapting lidar-based 3D detection and end-to-end autonomous driving systems

摘要

关键词

test-time adaptationautonomous driving

评审与讨论

审稿意见

评分: 5置信度: 22025-06-20

This paper introduces CodeMerge, a novel and efficient framework for test-time adaptation (TTA) in the context of 3D perception for autonomous driving. The primary contribution is a method that avoids the high computational cost associated with existing model merging techniques. Instead of loading and performing inference with multiple full model checkpoints, CodeMerge creates a codebook where each checkpoint is represented by a compact, low-dimensional "fingerprint." These fingerprints are derived from the penultimate features of a source model. The framework then uses ridge leverage scores calculated on these fingerprints to determine the importance of each past checkpoint for merging. The authors demonstrate the effectiveness of CodeMerge on both end-to-end autonomous driving systems and modular LiDAR-based object detectors, showing significant performance improvements on challenging corruption and cross-dataset benchmarks, as well as benefits for downstream tasks like mapping and planning.

优缺点分析

Strengths

Significance: The paper addresses a highly significant and practical problem in autonomous driving: robustness to real-world distribution shifts encountered during deployment. The evaluation of TTA within a full end-to-end pipeline (SparseDrive), demonstrating that improvements in the perception module positively impact downstream tasks.
Originality: The core idea of using low-dimensional fingerprints and ridge leverage scores to guide model merging is novel and insightful. The strong empirical correlation shown between fingerprint and parameter space differences provides good evidence for the validity of this latent-space approach.
Quality and Clarity: The paper is generally well-organized, and is supported by a comprehensive and rigorous set of experiments.

Weaknesses

As other popular architectures like UniAD and VAD "experience over tenfold performance degradation on nuScenes-C, hindering effective adaptation training" (Lines 272-273), this raises a significant question about the generalizability of CodeMerge. While the results on SparseDrive are strong, the paper lacks evidence that the approach would be effective on other end-to-end models, which currently limits the perceived scope of the contribution.

问题

The paper's success is demonstrated compellingly on the SparseDrive architecture. However, the conclusion mentions significant challenges in applying TTA to other architectures like UniAD. Could the authors elaborate on the underlying reasons for the performance degradation on these other models? Does this suggest a limitation in the TTA paradigm for certain architectures, or a property of CodeMerge itself? Answering this would greatly clarify the method's scope and potential for broader application. My score would improve if a convincing argument can be made for its generalizability.
Section 3.1 states that fingerprints are computed using the initial pretrained feature extractor ϕΘ(0). For the end-to-end experiments, it is mentioned that only the 3D box regression head is updated (Line 141). Does this "frozen feature extractor" strategy also apply to the LiDAR-only detector experiments on SECOND, or are more parameters adapted and merged in that setting? Clarifying the scope of adapted parameters Θ(t) for all experimental settings would be helpful.

As I am not very familiar with this research field, if the answers are convincing enough, I would like to increase my scores.

局限性

Yes

最终评判理由

The authors' rebuttal, particularly the inclusion of new experiments on two additional modern architectures, has decisively addressed my primary reservation about the paper's generalizability.

Generalizability was my most significant concern, and it has been fully resolved. First, they convincingly argued that the failure of models like VAD is due to the brittleness of the base model itself, not a flaw in the TTA paradigm or CodeMerge; their results showed CodeMerge does not degrade performance even in this collapsed regime. More importantly, they provided new experimental results on two recent, state-of-the-art architectures (DiffusionDrive and MomAD). These new results demonstrate clear and consistent performance gains across all downstream tasks, providing strong evidence that CodeMerge is not limited to SparseDrive and is indeed a more generally applicable technique.
My question about the adaptation strategy for the SECOND detector was answered clearly and concisely. The authors confirmed they adapt the entire network, which is the established protocol in prior work, and cited relevant papers to support their methodology. This resolves the ambiguity.
There are no remaining unresolved issues from my perspective.

格式问题

No formatting concerns found.

作者回复

2025-07-31

Response to Reviewer hVAv

We are grateful for your time and insightful comments!

W1 & Q1 - Generalizability beyond SparseDrive: explanation of the underlying causes of degradation, clarification of whether the limitation lies in the TTA paradigm or CodeMerge itself, and empirical evidence or arguments demonstrating broader applicability beyond SparseDrive.

Early end‑to‑end models under corruption. Our experiments show that some early architectures (e.g., VAD) collapse on corrupted scenes even before any TTA is applied. For example, under heavy snow, VAD attains only 0.0168 mAP for detection and 0.0353 NDS, indicating that the base model’s predictions are already unreliable and model almost collapsed. In this regime, online self‑supervision becomes untrustworthy, so any TTA method has little signal to learn from. Importantly, this reflects a robustness limitation of the base model, not of TTA or CodeMerge. Even so, attaching CodeMerge to VAD yields small but consistent gains across perception and planning metrics, and does not degrade VAD’s performance, which underscores that the bottleneck is the model’s severe brittleness rather than our adaptation mechanism.

Snow	mAP↑	NDS↑	mATE↓	mASE↓	mAOE↓	mAVE↓	mAAE↓	AMOTA↑	AMOTP↓	RECALL↑	IDS↓	APped↑	APdivider↑	APboundry↑	map mAP↑	minADE↓	minFDE↓	MR↓	EPA↑	L2-Avg↓	CL↓
VAD	0.0168	0.0353	0.9691	0.8577	0.9713	1.1064	0.9327	--	--	--	--	0.0006386	0.0014679	0.0002426	0.0007830	2.3891099	3.7661626	0.3022	0.0479	1.622955	0.0132
VAD+CodeMerge	0.0176	0.0356	0.9689	0.8576	0.9719	1.1019	0.9337	--	--	--	--	0.0005710	0.0015413	0.0002724	0.0007949	2.3264942	3.6645312	0.2974072	0.0519021	1.605700	0.01307

Recent architectures with non‑collapsed baselines. When the backbone remains minimally competent under corruption, CodeMerge is highly effective and broadly applicable. On DiffusionDrive [A] and MomAD [B] (both in CVPR 2025), CodeMerge delivers consistent improvements across detection, tracking, mapping, motion, and planning, and it reduces collision rate under both Brightness and Snow corruptions. These results support that CodeMerge is able to scale beyond SparseDrive.

Brightness	mAP↑	NDS↑	mATE↓	mASE↓	mAOE↓	mAVE↓	mAAE↓	AMOTA↑	AMOTP↓	RECALL↑	IDS↓	APped↑	APdivider↑	APboundry↑	mAP↑	minADE↓	minFDE↓	MR↓	EPA↑	L2-Avg↓	CL↓
Diffusiondrive	0.3278	0.4588	0.6402	0.2782	0.6473	0.3005	0.1851	0.2818	1.4334	0.4076	787	0.3651	0.4819	0.4794	0.4421	0.6624	1.0282	0.1380	0.4448	0.6001	0.089%
Diffusiondrive+CodeMerge	0.3580	0.4845	0.6210	0.2747	0.5767	0.2811	0.1916	0.3150	1.3529	0.4638	873	0.4125	0.5162	0.5333	0.4873	0.6582	1.0261	0.1380	0.4648	0.5919	0.062%
MomAD	0.3340	0.4711	0.6276	0.2734	0.5931	0.2900	0.1747	0.2944	1.4241	0.4101	731	0.2490	0.1808	0.3157	0.2485	1.3402	2.2509	0.2212	0.3965	7.3178	8.500%
MomAD +CodeMerge	0.3664	0.4952	0.6092	0.2741	0.5496	0.2670	0.1803	0.3341	1.3447	0.4737	824	0.2755	0.1976	0.3365	0.2699	1.13309	2.2389	0.2211	0.4147	7.3171	8.438%

Snow	mAP↑	NDS↑	mATE↓	mASE↓	mAOE↓	mAVE↓	mAAE↓	AMOTA↑	AMOTP↓	RECALL↑	IDS↓	APped↑	APdivider↑	APboundry↑	mAP↑	minADE↓	minFDE↓	MR↓	EPA↑	L2-Avg↓	CL↓
Diffusiondrive	0.1006	0.2373	0.7811	0.3794	0.9105	0.7007	0.3585	0.0503	1.8648	0.1064	305	0.0106	0.0426	0.0512	0.0348	1.0603	1.7077	0.1951	0.2063	0.9300	0.447%
Diffusiondrive+CodeMerge	0.1797	0.3492	0.7389	0.2956	0.6673	0.4621	0.2424	0.1078	1.7171	0.1785	681	0.1010	0.1713	0.1688	0.1471	0.8086	1.2838	0.1730	0.3070	0.7730	0.215%

[A] Liao, etc. Truncated Diffusion Model for End-to-End Autonomous Driving. CVPR2025

[B] Song, etc. Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving. CVPR 2025

Q2 - Does this "frozen feature extractor" strategy also apply to the LiDAR-only detector experiments on SECOND, or are more parameters adapted and merged in that setting?

For 3D detection, prior works [C, D, E] have shown that adapting all weights of the entire detector during TTA yields the best performance, and we simply adopt that established experimental setup. Therefore, we train entire networks and then adapt and merge all parameters. Please also refer to Reviewer G337 W3.

[C] Chen, etc. MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection. ICLR 2025

[D] Chen, etc. DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object Detection. ACMM 2024

[E] Yuan, etc. Reg-TTA3D: Better Regression Makes Better Test-Time Adaptive 3D Object Detection. ECCV 2024

2025-08-02

Thank you for your detailed and convincing rebuttal. These results have substantially strengthened the paper and are the primary reason for my increased score.

2025-08-07

Thank you very much for your positive and encouraging feedback. We are delighted that our rebuttal and additional results have addressed your concerns and strengthened the paper. We genuinely appreciate your time and thoughtful consideration during the review process. If you have any further suggestions, we would be more than happy to discuss them.

审稿意见

评分: 4置信度: 32025-06-29

The authors tackle test-time adaptation for 3D detection by compressing every intermediate fine-tuned checkpoint into a low-dimensional fingerprint. The compressed features are stored in a codebook and ranked by ridge-leverage scores. Then, the top-k weights are linearly merged. The experiments are conducted with both camera-based and LiDAR-based 3D detectors with various types of sensor degradations. Improved perception performance also benefits the downstream task, including mapping, motion prediction, and planning.

优缺点分析

Strengths

The paper is generally well-written and easy to follow.
The proposed idea of replacing K forward passes with a projection-space distance is clear.
Experiments cover both camera-based and LiDAR-based 3D detectors with various types of degradation.

Weaknesses

If I understand correctly, CodeMerge still requires a buffer of previously fine-tuned checkpoints, which means it needs gradient-based updates, and the paper assumes the model can back-propagate per every frame (or every N frames) to grow the buffer. This may be impractical in a real scenario. Regarding this, how many gradient steps are executed per frame to produce each checkpoint? Please correct me if I misunderstood.
Lack of robustness analysis. Once a heavy distribution shift occurs (e.g., fog + LiDAR dropout), the model may fail to accurately rank.
Similar to the above, does CodeMerge work if the backbone itself is adapted?
It would be beneficial to include more discussion on domain-adaptation methods (e.g, SN, ST3D). From my understanding, ST3D and SN are offline domain-adaptation methods, and CodeMerge is evaluated after every test frame. What are pros and cons about the two different directions?
Could EMA be another baseline for merging in Table 5?

Additional Comments

Line 251: Quantitative Analysis --> Qualitative Analysis

问题

Please see the above.

局限性

The authors have adequately addressed the limitations and potential negative societal impact of their work.

最终评判理由

I appreciate the author's enthusiastic rebuttal. All of my concerns have been addressed well. I will maintain my positive score, raising my confidence from 2 to 3, as I believe I'm not an active researcher working on this specific task. I am happy for further discussion if needed.

格式问题

There are no major formatting issues in this paper.

作者回复

2025-07-31

Response to Reviewer G337

Thanks for your constructive and valuable feedback, which will greatly assist in enhancing our work!

W1 - Back-propagate through buffer? How many steps per frame?

No. All stored checkpoints are frozen. We never back‑propagate through past models. At step $t$ , we (i) compute merging weights in fingerprint space and form a sign‑consistent merged teacher $\bar{\Theta}^{(t)}$ from the top‑ $K$ buffered checkpoints, and (ii) use $\bar{\Theta}^{(t)}$ only to generate pseudo‑labels for the incoming batch. Only the current model $\Theta^{(t)}$ is updated once; other checkpoints are used for merging without gradient updates. This avoids the $K$ -forward‑pass overhead of MOS [A], which is why our runtime and memory are lower (refer to Table 6 in the paper and response to W1 of reviewer jES4).

W2 - Robustness under fog + LiDAR dropout

We conduct additional experiments adapting SECOND from KITTI to KITTI-C under heavy fog with increasing point-dropout (0%→25%→50%) and report the AP 3D and AP BEV at the moderate level in the table below.

Dropout Ratio	0%	25%	50%
AP 3D	75.96	73.87	71.95
AP BEV	88.16	85.73	85.04

With increasing point-dropout, performance degrades gradually rather than catastrophically: AP 3D drops from 75.96→73.87→71.95 (–4.01 %, ≈5.3% relative), and AP BEV from 88.16→85.73→85.04 (–3.12 %, ≈3.5% relative). This smooth, monotonic decline suggests that our checkpoint ranking/merging remains reliable even under challenging compounded shifts.

W3 - Does CodeMerge still work when the backbone is adapted?

E2E model: In the paper’s end‑to‑end setting, we froze all components except the 3D box head to isolate the effect of merging and keep TTA lightweight. We conduct additional experiments where feature extraction backbone (ResNet) is also adapted and observe negligible deltas relative to head‑only TTA (e.g., NDS −0.0034, mAP −0.0055, AMOTA −0.0028, map mAP −0.0097, L2‑Avg −0.0023), indicating that CodeMerge remains effective when the backbone itself is updated. Furthermore, we have evaluated the effectiveness of CodeMerge by changing to different backbone models (see Reviewer hVAv, W1 & Q1), demonstrating the method’s broad applicability.

ColorQuant	mAP↑	NDS↑	mATE↓	mASE↓	mAOE↓	mAVE↓	mAAE↓	AMOTA↑	AMOTP↓	RECALL↑	IDS↓	APped↑	APdivider↑	APboundry↑	mAP↑	minADE↓	minFDE↓	MR↓	EPA↑	L2-Avg↓	CL↓
CodeMerge+det head	0.2742	0.4331	0.6575	0.2764	0.5903	0.3018	0.2137	0.2339	1.4868	0.333	490	0.260	0.3445	0.3267	0.3104	0.7002	1.0859	0.1454	0.384	0.6729	0.106%
CodeMerge+ResNet	0.2687	0.4297	0.6574	0.2778	0.5906	0.3092	0.2119	0.2311	1.4799	0.3468	510	0.2407	0.3452	0.3161	0.3007	0.7087	1.0930	0.1453	0.3793	0.6752	0.102%

3D detection model: We simply follow prior works [A, B] to adapt the entire backbone (encoder, detection head, and all batch‑norm layers) as these works have already shown that, for test-time adaptation in 3D detection, adapting the full backbone leads to superior performance than adapting batch norm layers only.

[A] Chen, etc. MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection. ICLR 2025

[B] Chen, etc. DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object Detection. ACMM 2024

W4 — Offline UDA (SN, ST3D) vs. online TTA (CodeMerge): pros & cons

Different Assumptions and Settings

UDA (SN, ST3D): has offline access to unlabeled target data and retrains the model for multiple epochs with self‑training / distribution alignment (e.g., ST3D), or target‑size statistics (SN). This typically yields strong target‑specific models when the target distribution is relatively stationary, but incurs heavy computation and latency before deployment.

TTA (CodeMerge): adapts online at inference on a stream of target frames. There is no multi‑epoch retraining. So each step is low-cost and fits real-time constraints.

Strengths & trade‑offs

UDA (SN, ST3D):

Pros: Can fully specialize to the target dataset with extensive training; strong performance when the deployment domain is fixed.
Cons: Requires offline target data and multi‑epoch computation; not reactive to rapidly changing in‑field dynamic conditions.

TTA (CodeMerge):

Pros: Immediate, frame‑by‑frame adaptation suited to dynamic shifts; efficient (e.g., vs. MOS, –41.8% runtime and –8% memory on SECOND; even larger savings on SparseDrive). Robust improvements are observed on cross‑dataset and corruption benchmarks, often surpassing UDA baselines (e.g., > ST3D on nuScenes→KITTI).
Cons: No offline access to target data; improvements depend on on‑the‑fly pseudo‑labels and the diversity/quality of the checkpoint buffer, so substantial shifts may require several steps to accumulate useful checkpoints. (We mitigate this with geometry‑preserving fingerprints and curvature‑aware selection.)

W5 - Merging baseline: EMA

We conducted additional experiments by replacing the merged model with the EMA model. On the same setting in Table 5 of the paper, CodeMerge (k=5, proj‑dim=1024) consistently outperforms EMA across all tasks: Detection mAP/NDS +0.0373/+0.0272; Tracking AMOTA +0.0372 and AMOTP −0.0810 (↓ better); Mapping mAP/APped +0.0355/+0.0300; Motion mADE/mFDE −0.0147/−0.0301; Planning L2‑Avg/CR‑Avg −0.0274/−0.0160. These gains indicate that our fingerprint‑guided merging, which selects complementary checkpoints, offers stronger generalization than the uniform, time‑local smoothing performed by EMA.

Method	K	Proj.-D	Detection		Tracking		Mapping		Motion		Planning
			mAP↑	NDS↑	AMOTA↑	AMOTP↓	mAP↑	APped↑	mADE↓	mFDE↓	L2-Avg↓	CR-Avg↓
EMA	–	–	0.2478	0.3992	0.1869	1.6016	0.3748	0.3413	0.7375	1.1447	0.6778	0.125
CodeMerge	5	1024	0.2851	0.4264	0.2241	1.5206	0.4103	0.3713	0.7228	1.1146	0.6504	0.109

Thanks again for your helpful comments and questions!

2025-08-07

We greatly appreciate your recognition of our efforts and your willingness to engage constructively. Should you have any additional suggestions or points for further discussion, we would be delighted to address them. Thanks again for your supportive review!

审稿意见

评分: 4置信度: 32025-07-02

This paper proposes a novel model merge based TTA method for LiDAR 3D object detection and E2E AD models. In the proposed CodeMerge approach, previously extracted features and corresponding checkpoints are stored. At each time step, the features of these checkpoints are projected into low-dimensional features, fingerprints. RLS are then computed based on these fingerprints to guide the model merging process. Compared to prior model merge based TTA methods such as MOS, this approach demonstrates improved efficiency and performance. The effectiveness of the method is validated across multiple settings, including cross-dataset and clean-to-corruption adaptation.

优缺点分析

[ Strengths ]

The proposed fingerprint-based model merging method resolves the scalability limitations of prior model merging-based method, which is computationally expensive.
The method is evaluated not only for LiDAR-based 3D object detection but also for end-to-end autonomous driving tasks, demonstrating its effectiveness as a TTA method across various application domains. This broadens its potential applicability.

[ Weaknesses ]

Regarding the efficiency analysis, does the reported runtime refer to a single adaptation step? If so, it would be helpful to specify the inference time for SECOND and computational burden introduced by each TTA step. Including runtime comparisons with other baselines such as TENT, CoTTA, and DUA would provide further evidence of the practicality of the proposed method.
The paper employs random projection for generating fingerprints. Is there a rationale behind using random projections? Why are these projections redefined at each time step? Does the inherent randomness introduce performance variance? It would be helpful to clarify whether the method is robust across repeated experiments.

问题

Have alternative fingerprint generation strategies been considered?
Can the proposed TTA method be applied to 3D object detection models other than SECOND?
Since the fingerprint is generated using features extracted from a fixed encoder given the i-th input, can it reliably serve as a key of the model parameters themselves?

局限性

yes

最终评判理由

The authors have thoroughly addressed all the concerns raised in the rebuttal. My doubts regarding the efficiency and rationale of the proposed methodology have been fully resolved. Therefore, I am updating my final rating.

格式问题

There are no major formatting issues in the paper.

作者回复

2025-07-31

Response to Reviewer jES4

W1 - Computational cost of each TTA step and comparison with baselines on SECOND

Thank you for your valuable comments! The runtime in Table 6 of our paper is the total time aggregated over all adaptation steps across the full test set. Below we report GPU memory and per-frame latency with AP 3D for SECOND on nuScenes→KITTI.

Method	GPU Memory (MiB)	Runtime (Seconds / Frame)	AP 3D
TENT	10,832	0.26	18.83
CoTTA	15,099	0.15	47.61
MOS	17,411	0.49	51.11
CodeMerge	16,041	0.28	58.54

Speed comparison. Compared with MOS [A], CodeMerge cuts latency by ~43% (0.49→0.28 s) and raises AP 3D by +14.5%. Relative to TENT, +210.9% AP 3D at similar latency (0.26 vs. 0.28 s). Compared with CoTTA, only +0.13 s/frame yet +22.9% AP 3D.

Memory comparison. CodeMerge uses less memory than MOS (16,041 vs. 17,411 MiB) and slightly more than CoTTA (15,099 MiB), while achieving the highest AP 3D. Compared with TENT, it needs +5,209 MiB (~5.1 GiB) but keeps similar runtime and delivers significant AP gain: +210.9%.

Overall, CodeMerge offers the best accuracy–efficiency trade-off on SECOND for nuScenes→KITTI: highest AP 3D with competitive latency and memory, suitable for real-time TTA.

W2.1 - Rationale behind random projections.

Efficiency. Random projection primarily improves memory and runtime. We use a fixed Gaussian projection to compress intermediate features into compact “fingerprints” (262,144→1,024), stored in a codebook to compute merge weights. This training-free step keeps memory/compute proportional to fingerprint dimension (not full model size) and avoids loading/forwarding K past models (as MOS), yielding markedly lower latency and memory in practice (e.g., –41.8% runtime and –8% memory vs. MOS on SECOND).

Fidelity. Despite compression, the fingerprint space preserves model-space geometry [B]; pairwise fingerprint differences correlate strongly with parameter differences across architectures/datasets (Pearson 𝑟, Kendall τ>0.7). Our ridge-leverage scoring in fingerprint space links to inverse curvature in parameter space, justifying that these low-dimensional features carry the information needed for reliable checkpoint selection.

Robustness. Ablations (Table 5 of our paper) over projection dimension ( $d' \in {256,512,1024,2048}$ ) show small variation, with $d'=1024$ the best accuracy–efficiency trade-off.

In summary, random projections give a simple, training-free, compute-efficient way to summarize checkpoints while retaining the structure needed for stable, effective test-time merging.

[A] Chen, etc. MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection. ICLR 2025

[B] GONG, etc. Random Projections and Pre-trained Models for Continual Learning. NeurlPS 2023

W2.2 - Redefining projections at each time step?

No. We instantiate a fixed Gaussian random projection once before TTA and keep it frozen. It is applied to the pretrained source feature extractor’s intermediate activations $\phi_{\Theta^{(0)}}(x)$ to produce low-dimensional fingerprints $\hat z$ (Eq. 7 of our paper), embedding all checkpoints into a shared space across time. Because both the fingerprinting features (from the frozen extractor) and the projection are fixed, this mapping is training-free and lightweight.

W2.3 - Does the inherent randomness introduce performance variance?

We verify robustness by resampling the projection with three seeds (RP1–RP3) and re-running the full pipeline. The results vary slightly: Perception NDS = 0.4920–0.4946, mAP = 0.3630–0.3735; Tracking AMOTA = 0.3252–0.3347; Planning L2-Avg = 0.6209–0.6255. These small ranges indicate minor impact on end-to-end performance.

Method	mAP	NDS	mATE	mASE	mAOE	mAVE	mAAE	AMOTA	AMOTP	RECALL	IDS	APped	APdivider	APboundry	mAP	minADE	minFDE	MR	EPA	L2-Avg	CL
RP1	0.3735	0.4946	0.6137	0.2786	0.5468	0.2935	0.1892	0.3347	1.3359	0.4714	869	0.4305	0.5224	0.5398	0.4976	0.6504	1.0122	0.1392	0.4680	0.6209	0.094%
RP2	0.3693	0.4935	0.6149	0.2786	0.5353	0.2925	0.1897	0.3319	1.3372	0.4756	1256	0.4229	0.5172	0.5349	0.4917	0.6415	0.9964	0.1362	0.4682	0.6219	0.11%
RP3	0.3630	0.4920	0.6068	0.2764	0.5315	0.2837	0.1962	0.3252	1.3411	0.4553	892	0.4106	0.5109	0.5198	0.4804	0.6406	0.9948	0.1356	0.4660	0.6255	0.125%

This stability is expected because (i) the projection is a fixed Gaussian linear map instantiated once and kept frozen during TTA, applied to features from the fixed source encoder to produce low-dimensional fingerprints (Eq. 7 of our paper), embedding all checkpoints in a shared space; (ii) the fingerprint space faithfully mirrors parameter-space geometry (Figure 3 of our paper: Pearson $r$ and Kendall $\tau$ typically $>0.7$), so random draws that preserve this embedding yield similar merge decisions.

Overall, repeated runs with different seeds show only minor fluctuations, consistent with our design for a stable, geometry-preserving random projection.

Q1 - Alternative fingerprint generation strategies?

Start with lightweight strategies. We thank you for your constructive suggestions. We have incorporated efficient feature aggregations: MaxPool [C], AdaptiveAvgPool [D], Lp-Pooling [E], and Fractional-Max [F]. Although inexpensive, these compress features to coarse statistics and underperform on Waymo→KITTI; our random projection (RP) achieves $AP_{3D}=63.23$ vs. 58.58/60.34/61.40/61.78 for pooling (gains +4.6%, +2.88%, +1.82%, +1.44%). This suggests simple pooling loses the discriminative structure needed for behavior-aware merging.

Method	RP	Maxpool	AdaptiveAvgPool	Lp-Pooling	Fractional-Max Pooling
$AP_{3D}$	63.2258	58.5847	60.3375	61.4017	61.7825

[C] Scherer, etc. Evaluation of pooling operations in convolutional architectures for object recognition. ICANN 2010

[D] He, etc. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014

[E] Boureau, etc. A theoretical analysis of feature pooling in visual recognition. ICML 2010

[F] Zhai, etc. Pooling With Stochastic Spatial Sampling. CVPR 2017

Heavier dimensionality reduction isn’t practical online. Methods like PCA require fitting (or updating) decompositions on streaming features, adding substantial compute at every TTA step, which is incompatible with our single-pass, real-time constraint.

In summary, pooling is fast but loses geometry/accuracy; PCA-like reductions are too costly per step. A fixed random projection preserves the geometry needed for robust merging while meeting real-time TTA efficiency, and empirically delivers the strongest performance in our comparisons.

Q2 - Applicability beyond SECOND

TTA Methods (w. DSVT)	AP BEV	AP 3D
No Adapt	65.06	27.14
Tent	63.94	31.07
CoTTA	66.63	34.51
SAR	66.12	37.45
DPO	75.46	45.06
MOS	77.38	57.41
CodeMerge	79.88	61.06

To assess generality, we evaluate CodeMerge on the recent 3D detector DSVT [G] for Waymo → KITTI. CodeMerge reaches 79.88 AP BEV / 61.06 AP 3D, outperforming MOS by 3.2% and 6.4%, and surpassing No Adapt by 125% in AP 3D. Because CodeMerge builds on training-free random-projection fingerprints and codebook-guided merging rather than architecture-specific modules, these results indicate our TTA is model-agnostic and transfers to detectors beyond SECOND.

[G] Wang, etc. DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets. CVPR 2023

Q3 - Can fingerprints (from a fixed encoder on input $x_i$) reliably key the model parameters?

Yes. Empirically, pairwise distances in fingerprint space track those in weight space across datasets and models (Figure 3), with Pearson $r$ and Kendall $\tau$ typically >0.7, indicating that the low-dimensional geometry preserves parameter-space structure and supports reliable merge decisions.

Theoretically, under a ridge-regression surrogate for the linear detection head (with a fixed encoder), the ridge leverage score of a fingerprint is proportional to the directional inverse curvature $z_i^\top H_w^{-1} z_i$, so computations in fingerprint space provide a curvature-aware proxy for selecting informative parameter directions (Eq. 9–10). This provides a principled basis for using fingerprints as keys to the parameters they represent.

Finally, resampling the projection with different seeds (refer to W2.3) yields minor performance variation, further confirming that fingerprinting is stable and robust to projection randomness.

Once again, we sincerely appreciate your thoughtful review and constructive feedback! We will carefully incorporate these discussions into the revised manuscript.

2025-08-07

Thank you for the detailed response. I’ve read all of your replies carefully, and my concerns have been fully addressed. I will therefore update my rating.

2025-08-07

Thank you for carefully reviewing our rebuttal. We’re glad your concerns are resolved and your constructive comments have improved our work, and we will incorporate all the discussions and the additional experiments into the paper. We appreciate the updated rating.

审稿意见

评分: 5置信度: 32025-07-02

Robust 3D perception under unpredictable conditions is a key challenge for autonomous driving, but existing test-time adaptation methods struggle with instability in high-variance tasks like 3D object detection. While model merging approaches based on linear mode connectivity improve stability, they are computationally heavy due to repeated checkpoint access and multiple forward passes. This paper introduces CodeMerge, a lightweight framework that merges models in a compact latent space using low-dimensional fingerprints and a key-value codebook, enabling efficient model composition with minimal overhead. CodeMerge achieves strong results across benchmarks—improving 3D detection by 14.9% NDS on nuScenes-C and over 7.6% mAP on LiDAR-based detection.

优缺点分析

Strengths:

This paper is well written and easy to follow. The motivation is clear and reasonable.
The performances are good, improving 3D detection by 14.9% NDS on nuScenes-C and over 7.6% mAP on LiDAR-based detection.
The experimental results are extensive. Plenty ablation and sensitivity studies are provided which are convincing.

Weaknesses:

The authors claim that "Code and pretrained models are released in the supplementary material." In abstract but I can not find them.

问题

Please refer to the Weaknesses.

局限性

The authors provide one sentence in the Conlusion. But I don't think it is sufficient.

最终评判理由

The quality of the paper is quite good and the authors address my concerns and I keep my initial rating.

格式问题

作者回复

2025-07-31

Response to Reviewer XHzU

W1 - Code availability and pretrained models.

Code availability. Thank you for pointing this out. The source code was included in the Supplementary Material as a ZIP archive attached to the submission.

Pretrained models. The pretrained and adapted checkpoints for the end‑to‑end models and the 3D detectors are not uploaded with the submission due to file‑size limits. We will release all checkpoints publicly upon acceptance together with a reproducible repository that includes experiment configs, logs, and instructions for all experiments.

最终决定Accept (poster)

2025-09-17

This paper introduces CodeMerge, a lightweight model merging framework for robust test-time adaptation (TTA) in autonomous driving 3D perception. It uses low-dimensional fingerprints and a key-value codebook to enable efficient model composition, addressing the computational overhead of prior merging methods. Reviewers praised its clear motivation, strong performance gains (14.9% NDS on nuScenes-C, 7.6% mAP on cross-dataset), and broad evaluation across LiDAR/camera detectors and downstream tasks. Concerns included initial ambiguity about code availability, efficiency analysis gaps, checkpoint buffer practicality, and generalizability beyond SparseDrive. However, the authors resolved these in their rebuttal with clarifications, additional experiments on new architectures (DiffusionDrive, MomAD), and runtime comparisons. Despite minor limitations, CodeMerge makes valuable contributions to TTA for safety-critical autonomous driving, with rigorous validation and practical efficiency. The AC concurs this is a significant advancement and recommends acceptance.