7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

4.5

置信度

创新性2.8

质量2.8

清晰度2.3

重要性2.3

NeurIPS 2025

Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

Weining Ren,Hongjun Wang,Xiao Tan,Kai Han

OpenReview PDF

提交: 2025-04-10更新: 2025-10-29

TL;DR

We propose a unified framework that distills knowledge from monocular geometry models to improve the performance of feed-forward reconstruction networks.

摘要

关键词

3D ReconsturctionKnowledge DistillationFine-Tuning

评审与讨论

审稿意见

评分: 4置信度: 52025-06-25

This paper proposed a new fine-tuning strategy to improve the performance of the current feed-forward reconstruction models. The authors involve MoGe as the teacher network and finetune the model on SA-1B dataset. Besides, they also involve the LoRA model, and meanwhile rescale the norm of the feature to satisfy the requirements of the frozen decoders.

优缺点分析

Strengths

The authors made a thorough analysis of the limitations of the current feed-forward reconstruction methods.
The authors proposed a new fine-tuning strategy.

Weaknesses

Instead of distilling the knowledge from MoGe, why not directly integrate it into the proposed model? Since in Table 1, none of the finetuned method has superior performance than MoGe.
In Line 132-133, the authors claimed, 'the limitations in detail recovery primarily originate from the encoder'. However, I did not see any theoretical or empirical evidence to support the statement. Finetuning the decoder with a frozen encoder could also improve the performance.
What motivates the authors to integrate LoRA instead of others?
The improvement in Figure 5 and 6 is tiny, in the 6th line of Figure 4, it seems such finetune degrade the original performance.
In Table 3, why the proposed method improved the static scene reconstruction methods, but somehow make the dynamic scene reconstruction method worse?
How about finetuning the encoder and decoder on the new dataset without any proposed so-called strategy?

问题

Please refer to the weaknesses.

局限性

Please refer to weaknesses.

最终评判理由

The answers of authors addressed my concerns.

格式问题

None.

作者回复

2025-07-31

We sincerely thank the reviewer for their positive and encouraging feedback. We are particularly encouraged that the reviewer recognized the two central contributions of our paper: our detailed analysis of limitations in existing feed-forward reconstruction methods, and the value of our fine-tuning strategy for generating more robust geometric features. We appreciate their thorough evaluation and will now address the points raised for further clarification.

W1: Why not integrate Moge directly

We thank the reviewer for this point, as it allows us to clarify the central research question of our work. Our primary goal is not to develop a system that surpasses the top-performing monocular depth estimators like MoGe. Instead, we investigate a different and fundamental question: "How can we leverage a powerful monocular depth model to improve the training of feed-forward 3D reconstruction models?"

Our work focuses on addressing the inherent limitations of these feed-forward models, which often struggle with geometric details and robustness. By using the monocular model as a supervisory signal, we successfully enhance the underlying feature representation. This contribution is validated by the fact that our method achieves promising results even with limited data, a key strength recognized by Reviewer dQ4p.

This distinction is important. While one could likely achieve a higher metric by directly integrating a large model like MoGe, this represents more of an engineering solution to boost performance. Such an approach is not relevant to our scientific inquiry and would introduce additional practical drawbacks:

Computational Cost: Integrating MOGE would add 24 ViT layers, drastically increasing the model's parameter count.

Inference Overhead: This added complexity would lead to a substantial increase in inference time.

W2/W6: encoder / decoder tuning

We sincerely thank the reviewer for this insightful comment. We agree that providing an empirical justification for freezing the decoder is essential, as this is a central design choice in our methodology.

This was, in fact, a core hypothesis that we tested during our initial exploratory studies. The results from these early experiments confirmed that fine-tuning the decoder with single-view data creates a conflict: it improves per-image depth but degrades the cross-view matching ability required for robust pose estimation. In preparing the final manuscript, we omitted this specific ablation study in the interest of conciseness, choosing to focus on the results of our final proposed model.

We agree with the reviewer Hi89 as well as reviewer Kjwd, that providing these results now significantly strengthens the paper's narrative. We are grateful for the opportunity to share the data that validates our approach. The following is the relative pose estimation result on Scannet.

Method Relative Pose Estimation
AUC @ 5° ↑ AUC @ 10° ↑ AUC @ 20° ↑
VGGT 28.40 47.36 61.51
Encoder-only 32.06 52.29 68.04
Decoder-only 28.42 51.59 67.30
Full Finetune 26.35 45.90 60.02
VGGT+Ours 35.21 56.70 72.80

Method		Relative Pose Estimation
	AUC @ 5° ↑	AUC @ 10° ↑	AUC @ 20° ↑
VGGT	28.40	47.36	61.51
Encoder-only	32.06	52.29	68.04
Decoder-only	28.42	51.59	67.30
Full Finetune	26.35	45.90	60.02
VGGT+Ours	35.21	56.70	72.80

W3: Why Lora instead of others?

Our primary objective is to enhance the feature extractor's ability to capture fine-grained geometric details and improve its robustness. This requires a fine-tuning method that can directly modify the internal representations of the model. We selected LoRA because it is exceptionally well-suited for this task. By injecting low-rank adaptation matrices directly into the layers of the encoder, LoRA refines the model's weights, fundamentally improving the feature generation process itself.

To validate our hypothesis, we contrasted this approach with Visual Prompt Tuning. VPT introduces learnable tokens into the input space, which is highly effective for adapting a fixed feature extractor to a new task but does not alter its core feature representation.

The results below confirm that for our goal of improving the features themselves, our LoRA-based approach is superior:

Visual Prompt Tuning

Method Monocular Depth Relative Pose Estimation
rel ↓ delta1 ↑ AUC @ 5° ↑ AUC @ 10° ↑ AUC @ 20° ↑
VGGT+Ours 4.29 96.7 35.21 56.70 72.80
VGGT+Visual Prompt Tuning 4.50 96.0 33.87 54.46 70.06

W4: Tiny improvement

Tiny improvements in Figure 5 Indeed, this figure shows consistent and notable improvements in monocular depth estimation—boundary sharpness and robustness—across three different baselines. We will further highlight the improvement in our revision.

Tiny improvements in Figure 6 We appreciate the reviewer's comment. The improvements shown in Figure 6 are indeed targeted and precise, as they are designed to correct the specific failure modes of the baseline model. Figure 6, for instance, is not about large-scale changes but about critical geometric details: the sharp chair boundaries, the coherent reconstruction of a reflective cabin, the planar surface near the bread machine, and the removal of wall artifacts. These examples are direct proof that our method enhances fine geometry and robustness where it matters most. For further evidence of these improvements at a scene-wide scale, we direct the reviewer's attention to our supplementary video, which contains fly-throughs that clearly illustrate the superior geometric quality and stability provided by our method.

6th line of Figure 4 finetuned model degrades We believe the reviewer may be referring to the 6th row of Figure 5, rather than Figure 4, which contains feature norm heatmaps rather than reconstruction outputs. This example, featuring a banner with complex geometry, was intentionally chosen to showcase how our method enhances model robustness in challenging scenarios. We would like to highlight that the baseline models completely fail to reconstruct this banner. In contrast, our fine-tuned models can successfully capture the banner's overall structure.

W5: Table 3 results

The issue is data contamination in the cut3r's training protocol (as we illustrate in L217-L218). The original CUT3R training dataset includes RealEstate10k training set, which will boost its performance on the re10k test set. This leads to an overfitted baseline; this isolated result is an artifact of the baseline's evaluation, not a limitation of our method. Our fin-tuned DUSt3R and VGGT demonstrate clear improvement on multi-view pose estimation.

2025-08-06

Thank you to the authors for their rebuttal. I acknowledged that the core contribution of this paper is to improve the training; however, the goal of improving training is either to improve the accuracy or achieve the same level of accuracy with less time and data. But if the authors fine-tune the model starting from an existing checkpoint, it is not reasonable to claim that using less data and time to achieve the same level of performance. Also, I am still not clear about how LoRA benefits the whole training? From the theoretical and experimental perspectives, respectively. Besides, I understand there might be data contamination in the training config of Cut3r; however, if the authors fine-tune the model starting from the Cut3r ckpt and attempt to improve the performance, why can't the fine-tuned model surpass Cut3r? Can you further explain this? Thank you.

2025-08-07

We thank Reviewer Kjwd for the reply. We appreciate the opportunity to provide a more thorough clarification of our work.

Q1: On Claims of Efficiency When Fine-Tuning from a Checkpoint

Our main paper's contribution is demonstrating that a dedicated fine-tuning process (using limited data) can significantly enhance pre-trained models. We do not claim to reduce the cost of the original pre-training itself.

The claim of achieving "similar performance with fewer resources" was made specifically during our rebuttal comparison with the concurrent work, Pi3. This is a direct and fair comparison, as both our method and Pi3 explicitly state that we initialize from the same pre-trained VGGT checkpoint. The Pi3 authors confirm this in their Appendix A.2:

"Our final model is not trained from scratch. Instead, we initialize the weights for the encoder and the alternating attention module from the pre-trained VGGT model."

Given same starting point, our method's superior efficiency in this head-to-head comparison is a valid. Actually, our method does not contradict Pi3; instead, it can potentianlly be implemented to further improve it.

Q2: Why lora benefits the training, theory and experiment.

Theoretical Justification: Established researches have demonstrated that attempting to fully fine-tune a large model on a small amount of data—which is exactly our use case—risks catastrophic forgetting. LoRA is explicitly designed to mitigate this issue [1][2][3][4]. By freezing the vast majority of the pre-trained weights, it allows us to inject the new monocular knowledge without overwriting the powerful multi-view foundation that the model learned from a much larger dataset.

Experimental Validation: Our ablation study in Table 7 of the main paper provides direct empirical proof. The results demonstrate that our proposed LoRA-based method (Row 4) outperforms both full-parametric encoder fine-tuning (Row 2), and a naive LoRA implementation (Row 3), confirming our strategy is the most effective for this problem.

Q3: CUT3R performance

This is an important point, and we want to be precise about our claim. Our paper does not claim to improve pose accuracy. Our contribution lies in improving the model's monocular performance, and for the multi-view task, our explicit goal was to maintain the original model's multi-view capability (like pose estimation). This is a critical test, as our fine-tuning does not supervise the pose heads at all.

To verify this, we adopted a more robust evaluation protocol in Table3. We conducted 10 new evaluation runs using a stricter, randomized frame selection method, replacing the uniform sampling used in the previous version. We report the mean and standard deviation (± std) across the 10 runs:

	RRA@5	RTA@5	AUC@30
CUT3R	96.66 ± 0.14	61.66 ± 0.30	78.95 ± 0.23
CUT3R+Ours	96.99 ± 0.12	62.15 ± 0.26	79.13 ± 0.20
VGGT	95.32 ± 0.10	52.65 ± 0.28	73.65 ± 0.14
VGGT+Ours	96.44 ± 0.12	55.03 ± 0.22	74.40 ± 0.12

The results above validate our claims. For CUT3R and VGGT, our fine-tuned model performs statistically on-par with the baseline, successfully achieving our goal of preserving its multi-view capabilities.

Notes on the updated results:

The baseline CUT3R scores differ slightly from those in the previous version because these new evaluations use random frame sampling protocol instead of uniform sampling. This more rigorous method results in a slight degradation of performance.

The standard deviation for the CUT3R baseline is relatively high. This is a known characteristic, as we also discuss in our response to Reviewer Hi89.

The CUT3R baseline outperforms VGGT on re10k pose estimation because it was trained with the re10k dataset. Crucially, our method maintains this performance after fine-tuning, achieving results that are on-par with the original CUT3R baseline.

We will update Table 3 in the revised manuscript with these more robust results. And we hope our responses address your concern.

References

[1] Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." ICLR 1.2 (2022): 3.

[2] Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." Advances in neural information processing systems 36 (2023): 10088-10115.

[3] Kalajdzievski, Damjan. "Scaling laws for forgetting when fine-tuning large language models." arXiv preprint arXiv:2401.05605 (2024).

[4] Liao, Baohao, Yan Meng, and Christof Monz. "Parameter-Efficient Fine-Tuning without Introducing New Latency." The 61st Annual Meeting Of The Association For Computational Linguistics. 2023.

2025-08-09

Thank you for your answers. Will increase the ratings accordingly.

2025-08-09

We appreciate reviewer Kjwd's thoughtful feedback and recognize that some of the concerns raised may stem from unclear explanations in our manuscript. We will refine these points in the next version and sincerely thank you for highlighting them.

审稿意见

评分: 5置信度: 42025-06-27

The authors propose an encoder-only fine-tuning method for the *3R family, where monocular depth information from MoGe (single-view DUST3R) is distilled into the backbone encoder feature. When fine-tuning, they apply re-normalization LoRA on the encoder to address the over-smoothed geometry caused by multi-view pointmap regression.

The goal of this approach is to produce sharper and more robust single-view depth, while performing comparable multi-view reconstruction.

They compare their method to the backbones in *3R family, such as DUSt3R, CUT3R, and VGGT, evaluating on benchmarks for monocular depth, pose estimation and 3D reconstruction.

优缺点分析

[Strengths]

This work is insightful, offering valuable insights for fine-tuning the *3R family to adapt them for producing fine-grained details. Instead of fine-tuning the decoder or using noisy depth labels, they chose to apply a re-normalized LoRA to fine-tune the encoder using MoGe. This approach can be generalizable to subsequent *3R methods.

[Major Weaknesses]

Missing video (multi-view) depth evaluation: The goal of the paper is improving monocular depth while preserving multi-view consistency. While the method is evaluated on pose estimation and 3D reconstruction, improved pose can lead to better 3D reconstruction. It is very important to evaluate multi-view depth, because it can help better factor out the impact of this fine-tuning method on multi-view consistency. And these datasets (e.g., DTU, ScanNet, KITTI, ETH3D, T&T, Bonn, TUM Dyn) are also widely used in previous work like CUT3R/VGGT.

[Minor Weaknesses]

Monocular depth distillation in the encoder may harm inherent multi-view consistency in feature space except increasing feature norms. It could be verified whether the features after adaption still enable fine-grained, multi-view consistent reconstruction using Feat2GS benchmark. There seems to be an overclaim in L132-135.
CUT3R is not only a feed-forward 3D reconstruction model but also a 4D foundation model. It would be interesting if the authors could show the proposed adaptation’s performance on 4D fine details reconstruction.
L108-111: The motivation for 'Scale Uncertainty' paragraph is unclear. There’s a common issue in *3R models—incorrect depth or scale leads to multi-layered walls in reconstruction. It would be great to demonstrate this improvement by providing some visual examples where the baseline predicts layered walls, but this proposed method predicts a single layer wall,
L116-117 lacks evidence: It would be helpful to cite VGGT(in the Table 3 of the original paper), as it supports this point.
Lack of discussion on LORA3D and TEST3R. Why not use Visual Prompt Tuning insdead of LORA?

问题

Refer to [Major Weaknesses] 1) and [Minor Weaknesses] 1) 2) For now, I am not yet convinced by the performance of the method due to the mentioned points around evaluation. However, I would be willing to raise my score if the authors address the concerns raised in my review and provide clarification on the points mentioned above.

局限性

No, there is a lack of discussion on the limitations of monocular depth distillation's impact on the inherent multi-view consistency of the backbone feature.

最终评判理由

The authors provide a comprehensive rebuttal and reply to all of my questions, which is very much appreciated. And I also appreciate the responsiveness of the authors and verify major weakness 1) and minor weakness 1) completely. I strongly recommend incorporating Video Depth Estimation Result into the main paper. I recommend accepting this paper.

格式问题

作者回复

2025-07-31

We thank the reviewer for their positive feedback and for accurately summarizing the core contributions of our work. We are particularly encouraged that the reviewer found our fine-tuning strategy to be insightful and recognized its generalizability to the broader *3R family of models, as this was a key objective of our research. Next, I will address the concern.

Major: video-depth estimation

Thanks reviewer for bringing this forward. The primary reason this was not included in our initial submission is that prominent video benchmarks used in cut3r (e.g., Sintel, KITTI, Bonn) primarily feature dynamic scenes. This creates a significant domain mismatch for models like VGGT, which are trained exclusively on static images. To provide a fair and rigorous evaluation, we additionally add ETH3D and Tanks & Temples, two multi-view datasets that are standard for evaluating performance on static scenes. (Note: We excluded DUST3R from this comparison as it requires a post-alignment step for multi-view inputs).

As shown in the table below, our method consistently improves both the CUT3R and VGGT backbones on these benchmarks. This demonstrates that our fine-tuning strategy enhances multi-view consistent depth accuracy and generalizes across different model architectures.

Video Depth Estimation Result

Method ETH3D T&T KITTI Sintel Bonn
rel ↓ delta1 ↑ rel ↓ delta1 ↑ rel ↓ delta1 ↑ rel ↓ delta1 ↑ rel ↓ delta1 ↑
CUT3R 0.126 83.1 0.209 69.5 0.123 87.4 0.428 47.4 0.077 93.9
CUT3R+Ours 0.130 82.8 0.180 76.2 0.112 89.8 0.406 58.4 0.062 96.8
VGGT 0.044 97.9 0.137 85.3 0.072 96.5 0.301 68.4 0.052 97.3
VGGT+Ours 0.041 99.2 0.115 88.0 0.069 96.6 0.252 72.7 0.048 97.5

Method	ETH3D		T&T		KITTI		Sintel		Bonn
	rel ↓	delta1 ↑	rel ↓	delta1 ↑	rel ↓	delta1 ↑	rel ↓	delta1 ↑	rel ↓	delta1 ↑
CUT3R	0.126	83.1	0.209	69.5	0.123	87.4	0.428	47.4	0.077	93.9
CUT3R+Ours	0.130	82.8	0.180	76.2	0.112	89.8	0.406	58.4	0.062	96.8
VGGT	0.044	97.9	0.137	85.3	0.072	96.5	0.301	68.4	0.052	97.3
VGGT+Ours	0.041	99.2	0.115	88.0	0.069	96.6	0.252	72.7	0.048	97.5

Furthermore, to reinforce the point about multi-view performance, we encourage the reviewer to see our additional point map estimation results on DTU and ETH3D (provided in our response to Reviewer dQ4p). These results highlight not only improved multi-view consistent geometry but also pose, confirming that our method enhances single-view detail without compromising multi-view consistency.

Minor1: Feat2GS Benchmark

We thank the reviewer for this insightful point. We agree that our method may not address all potential sources of feature shift, and we will revise the claims in our manuscript to more precisely reflect our specific contributions.

However, we provide extensive empirical evidence to demonstrate that our approach successfully maintains multi-view consistency. To further validate this, we now present results on Feat2GS benchmark, which directly evaluates the quality of multi-view features for novel view synthesis.

Due to the rebuttal time constraints and feat2gs haven't included cut3r yet, we focused our evaluation on the DUST3R and VGGT backbones. The results are as follows:

Geometry Texture All
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
DUSt3R 19.56 0.6504 0.3181 18.06 0.6006 0.3221 19.40 0.6477 0.3700
DUSt3R_ft 19.60 0.6512 0.3181 18.05 0.6015 0.3217 19.65 0.6417 0.3669
VGGT_e 19.66 0.6558 0.3123 18.07 0.6003 0.3225 19.61 0.6510 0.3788
VGGT_e_ft 19.70 0.6561 0.3115 18.10 0.6008 0.3224 19.66 0.6514 0.3781

		Geometry			Texture			All
	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
DUSt3R	19.56	0.6504	0.3181	18.06	0.6006	0.3221	19.40	0.6477	0.3700
DUSt3R_ft	19.60	0.6512	0.3181	18.05	0.6015	0.3217	19.65	0.6417	0.3669
VGGT_e	19.66	0.6558	0.3123	18.07	0.6003	0.3225	19.61	0.6510	0.3788
VGGT_e_ft	19.70	0.6561	0.3115	18.10	0.6008	0.3224	19.66	0.6514	0.3781

As shown, our method not only preserves but slightly improves multi-view performance, evidenced by the gains in PSNR and LPIPS. It is important to contextualize these numbers: performance on Feat2GS is typically concentrated within a very narrow range (e.g., PSNR often between 19.40 and 19.70). This demonstrates our method successfully improves single-view geometry while preserving the integrity of multi-view features.

Minor2: CUT3R Fine Structure

While the rebuttal format prevents us from including new qualitative figures, we can provide quantitative evidence of the improvement in boundary sharpness. To do this, we use the Boundary F1 score, a metric designed by DepthPro specifically to measure geometric detail at object edges, which has been adopted by recent works like MoGe-v2 for this exact purpose. We evaluated our method on a diverse set of benchmarks, including static real-world scenes (iBims-1, HAMMER) and dynamic synthetic scenes (Sintel, Spring). The results are as follows:

Boundary F1 Score for cut3r

Method iBims-1 HAMMER Sintel Spring
CUT3R 8.34 4.01 23.79 28.88
CUT3R+Ours 20.63 16.66 29.78 29.35

Method	iBims-1	HAMMER	Sintel	Spring
CUT3R	8.34	4.01	23.79	28.88
CUT3R+Ours	20.63	16.66	29.78	29.35

The data clearly shows that our method (CUT3R + Ours) achieves improvements in boundary sharpness over the original CUT3R across both static and dynamic datasets. This quantitatively confirms that our fine-tuning enhances the model's ability to capture fine-grained geometry. For qualitative evidence, we also invite the reviewer to examine the supplementary material of our submission, which contains several examples of in-the-wild depth estimations from our enhanced CUT3R model.

Minor3: Scale-Uncertainty ambiguity

We apologize for the lack of clarity. Our central argument is that the point map regression head inhibits the capture of fine details.

The Scale-Uncertainty is one of the reasons. It is primarily related to the normalization process when calculating loss; the normalization is fixed to the reference view. This creates reprojection errors for all non-reference views, a phenomenon we prove mathematically in the supplement and show empirically in Figure 2(b). These errors cause high loss at object boundaries, forcing the model to learn blurred edges.

The recent concurrent work, Pi3, independently validates this diagnosis, as they also remove the pointmap head from VGGT to finetune it, albeit with a different focus on equivariance.

Regarding the multi-layer wall issue, our method shows improvements, which can be seen in Figure 6. For example, in the top-left kitchen scene, our method produces clean cabinets where the baseline shows multi-layer artifacts, as the zoomed-in region shows. Likewise, in the right image, our method corrects the layered wall at the upper left corner produced by the original VGGT.

While we acknowledge the issue is not entirely solved and can still occur occasionally in close-ups, we believe the fly-through videos in our supplementary material offer a more dynamic and compelling demonstration of the overall reduction of these artifacts. We invite the reviewer to view them.

Minor4: Cite VGGT to support

Thanks for the kind suggestion, we will add a citation here for our next version!

Minor5: Compare with LORA3D/Test3R; why not visual prompt tuning

The key distinction between our method lies in the fine-tuning paradigm:

Our Method: We perform a one-time, general fine-tuning to create a single model that offers zero-shot generalization at inference.

LoRA3D/Test3R: This is a per-scene (or per-dataset), test-time optimization that requires fine-tuning for each new scene (or dataset). In essence, our goal is a universal, zero-shot model, whereas LoRA3D/Test3R is a test-time adaptation technique. We will clarify this distinction in our final manuscript. For the visual prompt tuning comparison. We also implement a visual prompt (following Test3R) tuning for the encoder, following table shows the results.

Visual Prompt Tuning

Method	Monocular Depth		Relative Pose Estimation
	rel ↓	delta1 ↑	AUC @ 5° ↑	AUC @ 10° ↑	AUC @ 20° ↑
VGGT+Ours	4.29	96.7	35.21	56.70	72.80
VGGT+Visual Prompt Tuning	4.50	96.0	33.87	54.46	70.06

Our method outperforms the VPT baseline. Our analysis suggests this is because VPT primarily adapts a model to a new task by tuning input prompts, leaving the core feature extraction unchanged. Conversely, our approach directly refines the feature extractor itself. This allows the model to generate more robust and detailed features, which is essential for high-quality geometric reconstruction and explains our method's performance advantage.

Limitation: Addressing the Concern Regarding Multi-View Consistency

We thank the reviewer for their feedback on this crucial point. To directly address the initial concern about the multi-view consistency of our learned features, we have conducted extensive experiments, including:

Video Depth Estimation: Demonstrating strong temporal and view-to-view consistency.
Additional Point Map Estimations: Showing geometrically coherent reconstructions across multiple viewpoints.
The Feat2GS Benchmark: Providing quantitative validation on a feature-wise benchmark specifically designed to measure feature quality for 3D reconstruction.

We hope that the reviewer’s concerns have been properly addressed with our responses above.

评论- Official Comment by Reviewer Cnzb

2025-08-05

The authors provide a comprehensive rebuttal and reply to all of my questions, which is very much appreciated. And I appreciate the responsiveness of the authors and verify major weakness 1) and minor weakness 1) completely. I strongly recommend incorporating Video Depth Estimation Result into the main paper.

评论- Rebuttal Reply

2025-08-06

We sincerely thank reviewer Cnzb for the encouraging and positive assessment. We are especially grateful for your recognition of our efforts in the rebuttal. As strongly recommended, we will ensure the Video Depth Estimation results are incorporated directly into the main paper to strengthen the next version of our paper.

审稿意见

评分: 5置信度: 42025-06-27

The authors fine-tune 3D estimators (outputting point clouds) that are trained on videos (using structure from motion as a means to get GT/surrogate 3D) with pre-trained single-view estimators that have higher detail fidelity. First attempts that did not work, due to feature shift, are reported, showing the difficulty of the task. The main contribution is to prevent this shift with only training the encoder and with weight normalization.

优缺点分析

Strength:

An effective method to prevent drift by normalizing weights after an update.
Keeping the same network architecture and even pre-trained decoder keeps inference efficient and provides a drop-in replacement
Section 4.1 Monocular Depth Estimation shows clear and consistent improvement across several backbones

Weaknesses:

Results for MultiView Pose Estimation on RealEstate10k are a bit mixed, fine tuning sometimes makes things worse
Results in Table 5 are also a bit mixed. As no std across multiple runs are reported, it is hard to judge how significant the improvements are. Could you explain when the method degrades?
Figure 6 is only for VGGT, it would be nice to see side views for all methods. In general, depth maps are hard to read, in particular if the scale is not normalized as in Figure 7. It would be much better to show 3D side views for analyzing depth map quality.
In Table 7, ablation, I am missing an experiment on fine-tuning just the encoder vs. the whole model (please explicitly state whether any of the existing models are fine-tuning all). Also, some of the initial exploratory experiments were (unless I am mistaken) done on fine-tuning the whole model. Does this also hold when only fine-tuning the encoder?

Summary: A simple yet effective method. My main concern is the significance of the improvements on real data for which repeated experiments with std reported would help.

问题

Line 168 " we use 2-view data with distillation supervision applied exclusively to the view1 pointmap head for distillation loss" - why not also to the second? Is it a good thing to only use one? Would drift be prevented if two are used?
Line 218 " The only exception is CUT3R—likely due to its training includes the RealEstate10k dataset" What does this dataset bring to the table that prevents improvement?
Line 171 "During each epoch, we sample 20,000 images from SA-1B [23], 1,000 from Hypersim [40], and 1,000 from TartainAir [61]" I wonder how this differs from the training of baselines and whether this could also have an effect on the improvements?
Was there a validation set used to determine stopping, would there be overfitting if training for more than 10 epochs?

局限性

Limitations are not clearly stated. In particular, given the mixed results on real images, it would be important to discuss and exemplify where things are getting worse.

最终评判理由

Questions have been answered in the rebuttal

格式问题

Reads well and is well structured.

作者回复

2025-07-31

We thank the reviewer Hi89 for the positive and constructive feedback. We are pleased they recognized our method as simple yet effective, its practical value as an efficient "drop-in replacement", and the "clear and consistent improvement" shown in our monocular depth estimation experiments. We will now address the points raised for further clarification.

W1 & Q2: The Multi-view pose estimation CUT3R Baseline and the RealEstate10k Dataset

We thank the reviewer for this critical question, which allows us to clarify this specific result. The issue is data contamination in the baseline's protocol. The original CUT3R was trained on the RealEstate10k dataset, from which the test set is also drawn. This leads to an overfitted baseline with an artificially inflated score. Therefore, this isolated result is an artifact of the baseline's evaluation, not a limitation of our method. Our fintuned DUSt3R and VGGT demonstrate clear improvement on multi-view pose estimation.

W2: Table 5 mixed results

To assess model robustness, we measured the standard deviation (std) of metrics over 10 runs with randomly permuted input image orders. The std results are presented in following tables.

As expected, VGGT and our fine-tuned version (VGGT+Ours) are exceptionally stable (most std < 0.001), a direct result of VGGT's permutation-invariant architecture-only change the first reference image will change the result. In contrast, CUT3R exhibits a higher sensitivity to input order, an inherent characteristic that our fine-tuning does not alter. large variance of CUT3R is also observed by concurrent work Pi3. Dust3R was excluded as its dense pair-wise matching design is inherently order-invariant.

std on pointmap estimation

Method 7-scenes NRGBD
Acc. ↓ Comp. ↓ N.C. ↑ Acc. ↓ Comp. ↓ N.C. ↑
Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Mean Med.
CUT3R 0.005 0.002 0.006 0.002 0.001 0.001 0.014 0.009 0.005 0.004 0.001 < 0.001
CUT3R+Ours 0.006 0.002 0.006 0.002 0.001 <0.001 0.012 0.010 0.006 0.004 0.001 < 0.001
VGGT 0.001 <0.001 <0.001 <0.001 < 0.001 < 0.001 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001
VGGT+Ours <0.001 <0.001 <0.001 <0.001 < 0.001 < 0.001 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001

Our post-submission analysis finds that CUT3R's poor performance stems from the base model's inaccurate pose estimation. The resulting large-scale misalignments(multi-layer wall, misalignment tables) dominate the evaluation metrics, thereby obscuring the geometric benefits of our method. Due to the rebuttal policy change this year I cannot share the visualization results. But it can be easilly confirmed by using the official CUT3R huggingface demo with some images from the 7-Scenes or NRGBD datasets.

The VGGT's minor performance dip on NRGBD is due to a ceiling effect, as its synthetic scenes offer limited room for improvement over a strong baseline. Conversely, our method's true efficacy is evident on more demanding benchmarks. We direct the reviewers to our response to Reviewer dQ4p, where we present more results showing that our fine-tuned VGGT yields significant gains on the real-world ETH3D and high-quality DTU datasets.

W3: 3D visualization

We agree that 3D visualizations would strengthen the paper. While NeurIPS policy prevents us from adding new visuals during the rebuttal, we direct the reviewer to the existing depth visualizations in our supplementary material, where we show monoculda depth estimation results from out-of-distribution images. We commit to adding the requested 3D side-view and reconstruction results from all models to the next version to enhance our analysis.

W4: Decoder Tuning

We sincerely thank the reviewer for this insightful comment. We agree that providing an empirical justification for freezing the decoder is essential, as this is a central design choice in our methodology.

This was, in fact, a core hypothesis that we tested during our initial exploratory studies. The results from our early experiments confirmed that fine-tuning the decoder with single-view data creates a conflict: it improves per-image depth but degrades the cross-view matching ability required for robust pose estimation. In preparing the final manuscript, we omitted this specific ablation study in the interest of conciseness, choosing to focus on the results of our final proposed model.

We agree with the reviewer Hi89 also reviewer Kjwd that providing these results now significantly strengthens the paper's narrative. We are grateful for the opportunity to share the data that validates our approach. Following is relative pose estimation result on scannet.

Method Relative Pose Estimation
AUC @ 5° ↑ AUC @ 10° ↑ AUC @ 20° ↑
VGGT 28.40 47.36 61.51
Encoder-only 32.06 52.29 68.04
Decoder-only 28.42 51.59 67.30
Full Finetune 26.35 45.90 60.02
VGGT+Ours 35.21 56.70 72.80

Method		Relative Pose Estimation
	AUC @ 5° ↑	AUC @ 10° ↑	AUC @ 20° ↑
VGGT	28.40	47.36	61.51
Encoder-only	32.06	52.29	68.04
Decoder-only	28.42	51.59	67.30
Full Finetune	26.35	45.90	60.02
VGGT+Ours	35.21	56.70	72.80

Q1: why not 2nd view supervision for dust3r

This is due to DUST3R's architecture: it predicts point map for view 2 in the coordinate frame of view 1. This makes direct supervision with a monocular depth teacher on view 2 impossible since it does not predict self-view geometry. Despite this, we found that supervising only the first view is highly effective and works "out of the box" for both DUST3R and MAST3R (as noted in our response to Reviewer dQ4p), yielding strong performance.

Q3&Q4 dataset selection and training epoch

Dataset Selection: Our choice of the SA1B dataset was inspired by its proven success in state-of-the-art models like Depth-Anything-v2 and DistillAnyDepth. SA1b accounts for 1/6 DepthAnything data and DistillAnyDepth demonstrates that only using SA1B can achieve sota performance through self-distillation. We selected the synthetic datasets Hypersim and TartanAir specifically for their high-quality, sharp geometric boundaries compared with other real-world dataset.

Training Scale & Duration: Regarding the training duration, our experiments reveal that the model's performance scales positively with data volume without showing signs of overfitting. We explicitly tested this by increasing our training data tenfold from 0.2M to 2M images. As the results below show, this significantly improved performance and closed the gap with the state-of-the-art MoGe model.

Method Monocular Depth
rel ↓ delta1 ↑
SA1B(0.2M) 4.29 96.7
SA1B(2M) 4.18 96.9
MoGe 4.14 96.9

Method	Monocular Depth
	rel ↓	delta1 ↑
SA1B(0.2M)	4.29	96.7
SA1B(2M)	4.18	96.9
MoGe	4.14	96.9

The key insight from our findings is that recent 3D foundation models(like VGGT) have already gain very strong ability in depth estimation. And they can be rapidly advanced to near SOTA performance with a relatively small amount of pseudo-labeled in-the-wild data, highlighting the efficiency and power of our fine-tuning strategy.

Limitations

Regarding the results in Tables 5 and 6, we refer to our clarification in W1 & Q2 and W2. We also provide experiments on point map estimation (reviewer dQ4p), video depth estimation (reviewer Cnzb), and MASt3R fine-tuning(reviewer dQ4p) that further confirm our method's robust generalization in real-world scenarios. We respectfully hope the reviewer can reconsider their initial assessment of this point.

The primary limitation of our work is that our renormalization LoRA weights are not mergeable like standard LoRA. This introduces a minor but measurable overhead (~5% in computational time and <0.1% in parameters).

We hope that the reviewer’s concerns have been properly addressed with our responses above.

评论- thx

2025-08-05

Thank you for the clarification. It answers all my high-importance questions and hence I recommend accepting.

评论- Rebuttal Reply

2025-08-06

We are grateful for reviewer Hi89 's constructive feedback and for confirming that we have resolved your concerns. We will enhance the paper with 3D visualizations of our method, alongside more visualizations from dust3r and cut3r. Furthermore, to quantify model stability, we will report the standard deviation of metrics over multiple runs.

审稿意见

评分: 4置信度: 52025-07-01

This paper introduces Fin3R, a lightweight and effective fine-tuning method aimed at improving feed-forward 3D reconstruction networks such as DUSt3R, CUT3R, and VGGT. These models predict 3D pointmaps of input images in a single pass but frequently struggle with capturing fine geometric details due to challenges like limited high-fidelity supervision and geometric misalignments across views. Fin3R addresses these issues by fine-tuning only the image encoder, while freezing the decoder of the model. By introducing a monocular teacher model and a custom LoRA adapter, Fin3R distills high-quality geometry from large unlabeled data, enhancing the encoder's ability to capture complex structures and sharp boundaries. Experiments on the DUSt3R, CUT3R, and VGGT validate that Fin3R consistently improves geometry prediction in both single-view and multi-view settings.

优缺点分析

Strengths:

This paper is well-written and easy to follow.
The improvement across several tasks with different datasets is promising, especially considering that the proposed method only requires fine-tuning the target model with a small amount of unseen data.

Weaknesses:

The paper lacks experiments on dense multi-view stereo (MVS) estimation using the high-quality DTU dataset, which could effectively reflect both pose and depth improvements. Additionally, according to Sec. 3.3, only monocular distillation loss and pointmap loss are used for fine-tuning. It is unclear why the proposed method improves relative pose estimation for models like VGGT and CUT3R, which directly regress camera parameters.
There is insufficient discussion and comparison with similar methods that also use LoRA to fine-tune feed-forward 3D reconstruction networks, such as LoRA3D (ICLR 2025).
Can the proposed method be used to improve Mast3R, which uses an architecture similar to DUSt3R but regresses metric depth?
Confidence maps are widely used in feed-forward 3D reconstruction methods to filter unreliable predictions. It would be interesting to discuss whether the proposed method can improve the accuracy of pointmaps or depths while increasing confidence in predictions.
Could the use of larger-scale datasets and more diverse scenes further enhance the fine-tuning results? Additionally, what was the rationale behind selecting the SA-1B dataset for fine-tuning? What specific criteria informed this decision?

问题

What are the criteria for selecting training data for fine-tuning?
Please provide more results on large-scale outdoor scenes, such as the Tank and Temples dataset.
Can the proposed method be used to fine-tune models for metric depth/pointmap prediction, such as Mast3R?

局限性

Although the method proposed in this paper improves the performance of most models, it negatively impacts some models that already perform well, which is often undesirable in practical applications.

最终评判理由

Most of my concerns have been addressed. The comparisons on MVS, video depth, metric depth estimation (mast3r), and LoRA-based Fine-tuning further enhance the effectiveness and robustness of the proposed method.

格式问题

N/A

作者回复

2025-07-31

We thank Reviewer dQ4p for insightful and helpful suggestions. We are very pleased that the reviewer found the paper to be "well-written and easy to follow" and recognized that the "improvement across several tasks with small amount of datasets" We appreciate this positive assessment and will now address the points raised for further clarification."

W1: On MVS Evaluation and the Mechanism for Pose Improvement

1. On MVS Evaluation Protocol and Results

We followed the established evaluation protocol from recent works (e.g., Spann3R, CUT3R) and the concurrent work Pi3 by assessing performance on point map estimation. This approach is standard in the absence of an open-source Multi-View Stereo (MVS) evaluation benchmark for the 3R model series and effectively reflects both geometric accuracy and pose estimation quality.

Our results on the DTU and ETH3D datasets, evaluated using the official Pi3 codebase, are presented below. The table shows our fine-tuned model (VGGT+Ours) not only surpasses the VGGT baseline but also outperforms the concurrent work Pi3 on 8 of 12 metrics, while being vastly more efficient: our fine-tuning took 1 day on 4 L20 GPUs, versus Pi3's several days on 64 A100 GPUs(they also use VGGT weight initialization).

Method			DTU						ETH3D
	Acc. ↓		Comp. ↓		N.C. ↑		Acc. ↓		Comp. ↓		N.C. ↑
	Mean	Med.	Mean	Med.	Mean	Med.	Mean	Med.	Mean	Med.	Mean	Med.
Pi3	1.151	0.622	1.793	0.629	0.668	0.754	0.194	0.130	0.220	0.135	0.867	0.965
VGGT	1.187	0.715	2.229	1.309	0.694	0.779	0.290	0.196	0.371	0.230	0.839	0.932
VGGT+Ours	0.948	0.520	1.879	0.905	0.699	0.787	0.209	0.112	0.170	0.085	0.861	0.972

2. On the Pose Improvement Mechanism:

While direct pose enhancement was not our primary objective, our empirical results demonstrate this benefit. We hypothesize this stems from the fact that the decoder is a secret feature-matching module. The decoder is actually doing matching across images like other matchers (e.g., superglue). With an improved feature with fine details, the matching result can potentially be improved. This hypothesis is also supported by our following Mast3r fine-tuning experiments and our feat2gs benchmark analysis (see response to Reviewer CnZb), which empirically confirms that our method preserves multi-view feature consistency but improves accuracy.

W2: Comparison with LoRA-based Fine-tuning (e.g., LoRA3D)

We thank the reviewer for this point. The key distinction lies in the fine-tuning paradigm:

Our Method: We perform a one-time, general fine-tuning to create a single model that offers zero-shot generalization at inference.

LoRA3D: This is a per-scene (or per-dataset), test-time optimization that requires fine-tuning for each new scene (or dataset).

In essence, our goal is a universal, zero-shot model, whereas LoRA3D is a test-time adaptation technique. We will clarify this distinction in our final manuscript.

W3 & Q2: Applicability to Metric Depth Models like Mast3r

Yes, our fine-tuning strategy is model-agnostic and effective for metric depth models. To demonstrate this, we applied our method directly to Mast3r, using the same dataset and loss configuration from our primary experiments and use MoGe-V2 as teacher model.

The results, presented below, confirm the significant benefits of our approach.

Fine-tuning Results on Mast3r

Method Monocular Depth Relative Pose Estimation
rel ↓ δ1 ↑ AUC @ 5° ↑ AUC @ 10° ↑ AUC @ 20° ↑
Mast3r 49.62 30.2 37.60 59.96 76.24
Mast3r + Ours 27.60 64.3 37.93 60.21 76.68
MoGe-V2 (SOTA) 14.82 78.6 - - -

Analysis of Results:

Solid Improvement in Metric Depth: Our method improve the mast3r's depth. We nearly halved the relative error (49.62 → 27.60) and more than doubled the δ1 accuracy (30.2 → 64.3), significantly closing the gap to the state-of-the-art MoGe-V2 model.

Consistent Improvement in Pose Estimation: Notably, relative pose estimation also slightly and consistently improves across all AUC metrics. This is particularly insightful because these gains were achieved without fine-tuning the decoder or the feature head. This result provides strong evidence for our hypothesis: enhancing encoder features(with our strategy) alone directly leads to more robust feature matching and, consequently, better pose solver result.

These gains were achieved with only 10 training epochs. Metric depth potentionally need more data compared with relative depth, underscoring and potential impact of our fine-tuning approach on metric depth *3R models.

Method	Monocular Depth			Relative Pose Estimation
	rel ↓	δ1 ↑	AUC @ 5° ↑	AUC @ 10° ↑	AUC @ 20° ↑
Mast3r	49.62	30.2	37.60	59.96	76.24
Mast3r + Ours	27.60	64.3	37.93	60.21	76.68
MoGe-V2 (SOTA)	14.82	78.6	-	-	-

W4: Confidence map improved?

Yes, our method significantly enhances the reliability of the predicted confidence maps. This improvement is a direct result of fine-tuning on diverse, pseudo-labeled in-the-wild data, which teaches the model to better assess its own prediction quality.

While we cannot include figures in this year nips rebuttal, we provide quantitative evidence below. We measured the average predicted confidence during monocular depth estimation across several standard benchmarks. Our fine-tuned model (VGGT+ours) more than doubles the average confidence score from 4.69 to 9.58. Notably, datasets where our model achieves larger gains in depth accuracy, such as DIODE and iBims-1, also exhibit the more substantial increases in confidence.

Uncertainty Results

ETH3D NYUv2 KITTI DIODE HAMMER iBims-1 Average
VGGT 3.94 8.78 2.14 6.51 4.12 4.45 4.69
VGGT+ours 9.05 8.71 4.13 10.21 9.65 15.71 9.58

Besides, we assessed the model's reliability on challenging out-of-distribution data, including paintings, sport games, movies, and images with reflective surfaces or low light. On these OOD samples, our model produced an average confidence score of 8.51 compared to the baseline VGGT's score of 1.87. This demonstrates a markedly improved ability to produce reliable confidence estimates even in novel and difficult scenarios, highlight the robustness of our finetuned model.

	ETH3D	NYUv2	KITTI	DIODE	HAMMER	iBims-1	Average
VGGT	3.94	8.78	2.14	6.51	4.12	4.45	4.69
VGGT+ours	9.05	8.71	4.13	10.21	9.65	15.71	9.58

W5&Q1 Dataset Selection

Our choice of the SA1B dataset was inspired by its proven success in state-of-the-art models like Depth Anythingv2 and DistillAnyDepth. SA1b accounts for 1/6 DepthAnything data and DistillAnyDepth demonstrate only use SA1B can achieve sota performance through self-distillation.

Our experiments confirm that performance scales directly with data volume; increasing our training set 10x (from 0.2M to 2M images) improved results and further closed the gap to the state-of-the-art MoGe model. While a mixed dataset(BDD100K, SA1B, GoggleLandmarkv2, Places365) at a smaller scale (0.2M) did not show significant gains over using SA-1B alone, we suspect this is a scale-dependent observation. As suggested by related work (e.g., Depth Anything v2 Appendix B.4), data diversity remains critical for performance at very large scales.

The key insight from our findings is that recent 3D foundation models(like VGGT) have already gain very strong ability in depth estimation. And they can be rapidly advanced to near SOTA performance with a relatively small amount of pseudo-labeled in-the-wild data, highlighting the efficiency and power of our fine-tuning strategy.

Method Monocular Depth
rel ↓ delta1 ↑
SA1B(0.2M) 4.29 96.7
Mixed(0.2M) 4.27 96.6
SA1B(2M) 4.18 96.9
MoGe 4.14 96.9

Method	Monocular Depth
	rel ↓	delta1 ↑
SA1B(0.2M)	4.29	96.7
Mixed(0.2M)	4.27	96.6
SA1B(2M)	4.18	96.9
MoGe	4.14	96.9

Q2 Large Scale Results, like T&T

As suggested by reviewer Cnzb. We add Video Depth Estimation benchmark, which includes T&T dataset. Our method concistent improve depth estimation accuracy while containing multi-view consistency.

Video Depth Estimation

Method ETH3D T&T KITTI Sintel Bonn
rel ↓ delta1 ↑ rel ↓ delta1 ↑ rel ↓ delta1 ↑ rel ↓ delta1 ↑ rel ↓ delta1 ↑
CUT3R 0.126 83.1 0.209 69.5 0.123 87.4 0.428 47.4 0.077 93.9
CUT3R+Ours 0.130 82.8 0.180 76.2 0.112 89.8 0.406 58.4 0.062 96.8
VGGT 0.044 97.9 0.137 85.3 0.072 96.5 0.301 68.4 0.052 97.3
VGGT+Ours 0.041 99.2 0.115 88.0 0.069 96.6 0.252 72.7 0.048 97.5

Method	ETH3D		T&T		KITTI		Sintel		Bonn
	rel ↓	delta1 ↑	rel ↓	delta1 ↑	rel ↓	delta1 ↑	rel ↓	delta1 ↑	rel ↓	delta1 ↑
CUT3R	0.126	83.1	0.209	69.5	0.123	87.4	0.428	47.4	0.077	93.9
CUT3R+Ours	0.130	82.8	0.180	76.2	0.112	89.8	0.406	58.4	0.062	96.8
VGGT	0.044	97.9	0.137	85.3	0.072	96.5	0.301	68.4	0.052	97.3
VGGT+Ours	0.041	99.2	0.115	88.0	0.069	96.6	0.252	72.7	0.048	97.5

Limitation: On the Perceived Negative Impact on Some Models

We thank the reviewer for raising this critical point. However, we have provided extensive multi-view experiments, including Video Depth Estimation, Point Map Estimation, Mast3r fine-tuning, and Feat2GS benchmarking(see reviewer Cnzb) show that our method does not harm multi-view consistency and provides broad benefits.

The specific instances of performance degradation are explainable artifacts:

CUT3R in Table 3: The negative result is due to data contamination in the baseline's protocol(as we illustrate in L217-L218). The original CUT3R was trained on the re10k, leading to an overfitted baseline. Our model, trained on general data, only fluctuates this overfitted score by a marginal amount (0.12, 0.82, and 0.41 in AUC).

Table 5 Fluctuation: As we demonstrate in our response to Reviewer Hi89, this fluctuation is attributable to inaccurate camera poses provided by the CUT3R baseline, not a failure of our method.

We hope that the reviewer’s concerns have been properly addressed with our responses above.

2025-08-05

Thanks for the detailed rebuttal from the authors. It has addressed most of my concerns, and I suggest that the authors add these results to their revision to further improve the insight of the paper. I will improve my score accordingly.

评论- Rebuttal Reply

2025-08-06

We are glad to have addressed reviewer dQ4p 's concerns and sincerely appreciate your valuable feedback. We will incorporate the point cloud estimation and video depth estimation results into the main paper. Additionally, we will integrate as many of the other experiments and discussions from the rebuttal as the page limit allows, with the remainder moved to the supplementary material.

最终决定Accept (poster)

2025-09-17

After the rebuttal phase, all four reviewers recommend acceptance.