MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
Improving 3D representation learning for multimodal large language models.
摘要
评审与讨论
This paper addresses the challenge of improving the 3D scene understanding capabilities of multimodal large language models, which are inherently pretrained on 2D data and thus lack strong 3D geometric awareness. The authors first conduct a study revealing a strong positive correlation between a model's multi-view feature consistency (a proxy for 3D awareness) and its performance on downstream 3D reasoning tasks. Motivated by this finding, they propose a framework called 3D-Aware representation supervision (3DRS). 3DRS employs knowledge distillation, using a pretrained 3D foundation model (eg, VGGT) as a "teacher" to provide rich 3D-aware feature targets. The MLLM is then fine-tuned with an auxiliary loss that encourages its visual features to align with these 3D-aware targets, in addition to the standard text generation loss. The authors demonstrate that this approach consistently and significantly improves performance across three different MLLM architectures (LLaVA-Next-Video, LLaVA-OneVision, Qwen2-VL) on five diverse 3D scene understanding benchmarks
优缺点分析
Strengths
-
The paper is very well-motivated. The initial investigation (Sec 2.1) that establishes a quantitative link between multi-view correspondence scores and downstream task performance is very interesting. This diagnostic-first approach provides a clear and compelling rationale for the proposed approach, moving beyond simple intuition.
-
The experimental evaluation is good. The authors validate their method on multiple MLLM backbones and a wide range of 3D vision-language tasks (grounding, captioning, QA). The ablation study comparing 2D vs. 3D foundation models as teachers (Table 3) is interesting and strongly supports the central claim. There is consistent improvements across the board (Figure 1b, Table 2) which indicates the effectiveness of the proposed approach.
-
The paper is well-written and easy to follow. Figure 1 provides an excellent high-level overview of the key results. The method is explained clearly, and the overall narrative is logical and persuasive.
Weaknesses
-
The primary weakness of the paper is its limited conceptual novelty. The core proposal is to distill the features of a multi-view model to an MLLM. I guess the contribution here is to show that this actually leads to improved performance, but because the tasks are spatial in nature this could be considered obvious.
-
The paper shows that 3D supervision works by improving a "correspondence score" and downstream metrics, but it does not go deeper into what specific 3D properties are being learned. Is the MLLM learning about object permanence, relative scale, or other geometric properties? VGGT only outputs pointmaps and camera poses so there's not much information about objects, and it's not metric scale.
-
The performance of the proposed method is upper-bounded by the quality of the "teacher" 3D foundation model. The paper does not discuss the limitations of the teacher models themselves or analyze how their potential failure modes (e.g., incorrect correspondence or depth estimation) might negatively impact or mislead the student MLLM during training.
问题
-
The paper convincingly shows that 3DRS improves performance. To better understand why, could you provide a more fine-grained analysis? For example, could you design a targeted probe or qualitative analysis to see if the model with 3DRS is better at specific geometric reasoning tasks that are not directly measured by the benchmarks, such as:
- Judging the relative size of two objects at different depths.
- Understanding occlusion relationships in a scene.
- Identifying support surfaces (eg, items on a single table).
-
The performance of 3DRS relies heavily on the quality of the 3D foundation model teacher. Could the authors discuss the failure modes of the teacher (VGGT) and how they might propagate to the student MLLM? For instance, what happens in scenes with reflective surfaces, textureless objects, or dynamic elements where the teacher's correspondence or geometry estimates might be noisy or incorrect?
-
The distillation uses a simple cosine similarity loss between the final layer features. Did you explored other knowledge distillation strategies?
局限性
yes
最终评判理由
This paper shows that geometric vision models can be useful for MLLMs in understanding spatial relationships. While the performance improvement isn't very large, the results show it clearly is significant. My final rating is borderline accept.
格式问题
None
Thanks for the constructive feedback. We appreciate the recognition of our motivation, experimental results, and paper writing. We address the concerns below.
Weaknesses
Q1. The primary weakness of the paper is its limited conceptual novelty. The core proposal is to distill the features of a multi-view model to an MLLM. I guess the contribution here is to show that this actually leads to improved performance, but because the tasks are spatial in nature this could be considered obvious.
R: We agree that it is intuitive to consider distilling knowledge from 3D models to enhance 3D awareness in other models. However, in the context of modern MLLMs, it remains unclear whether these models inherently possess 3D spatial understanding due to their massive-scale pretraining. To our knowledge, there is not a systematic investigation of 3D awareness in MLLMs, and this remains an open question.
Our paper first provides a thorough analysis to reveal the lack of 3D awareness in existing MLLMs, which motivates our proposed method. We then introduce a simple and effective framework to distill geometric knowledge from 3D models into MLLMs, and empirically demonstrate that this approach significantly enhances their spatial reasoning abilities. We believe our work provides both an important empirical finding and a practical solution for improving 3D scene understanding in MLLMs.
Q2. The paper shows that 3D supervision works by improving a "correspondence score" and downstream metrics, but it does not go deeper into what specific 3D properties are being learned. Is the MLLM learning about object permanence, relative scale, or other geometric properties? VGGT only outputs pointmaps and camera poses so there's not much information about objects, and it's not metric scale.
R: We begin with correspondence learning, a fundamental cue in 3D vision and multiview geometry, as it directly reflects a model’s spatial reasoning ability [1]. Correspondence estimation serves as a critical probe for evaluating the 3D awareness of MLLMs. However, our analysis shows that existing MLLMs struggle with correspondence estimation and lack robustness across many aspects of 3D understanding.
By distilling knowledge from VGGT, our approach enables models to inherit strong 3D capabilities from extensive training tasks, e.g., pointmap and depth map prediction, camera pose estimation, and point tracking. While VGGT does not provide ideal object-centric or metric-scale supervision, this represents a significant improvement over previous methods. With continued advances in 3D foundation models, we believe that metric-scale spatial awareness is within reach, and our method helps drive the broader development of 3D-aware MLLMs.
To further investigate which specific 3D properties are learned, we evaluate our method on VSI-Bench [2], a challenging benchmark designed for geometric-aware spatial reasoning with video inputs. VSI-Bench covers a diverse range of tasks—object counting, absolute and relative distance estimation, object size, room size, relative direction, route planning, and appearance order—all of which are tightly coupled with core 3D properties such as spatial relationships, relative scale, and geometric understanding. We follow the training protocol of VG-LLM [3] and use LLaVA-Next-Video as our baseline.
As shown in Table 1 below, our method consistently achieves higher scores across these spatially-grounded tasks compared to strong proprietary models (e.g., GPT-4o, Gemini-1.5 Pro) and our baseline. This demonstrates that our approach not only improves correspondence estimation but also significantly enhances the model’s geometric reasoning abilities, including scale estimation, object association, spatial arrangement, and relational understanding—even without explicit metric-scale supervision.
Table 1: Evaluation on VSI-Bench.
| Method | Avg. | Object Count | Absolute Distance | Object Size | Room Size | Relative Distance | Relative Direction | Route Plan | Appearance Order |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 34.0 | 46.2 | 5.3 | 43.8 | 38.2 | 37.0 | 41.3 | 31.5 | 28.5 |
| Gemini-1.5 Flash | 42.1 | 49.8 | 30.8 | 53.5 | 54.4 | 37.7 | 41.0 | 31.5 | 37.8 |
| Gemini-1.5 Pro | 45.4 | 56.2 | 30.9 | 64.1 | 43.6 | 51.3 | 46.3 | 36.0 | 34.6 |
| --------------------- | ------ | -------------- | --------- | ------------- | ----------- | ------------------- | -------------------- | ------------ | ------------------ |
| Baseline | 42.1 | 62.4 | 34.0 | 52.4 | 53.6 | 39.3 | 41.2 | 21.7 | 29.2 |
| Baseline w/Ours (DINOv2) | 44.2 | 67.1 | 33.8 | 52.6 | 56.1 | 39.0 | 42.7 | 26.3 | 35.6 |
| Baseline w/Ours (VGGT) | 45.9 | 68.7 | 34.8 | 53.6 | 56.6 | 40.9 | 43.2 | 30.4 | 39.2 |
These results confirm that our framework substantially improves the MLLM’s grasp of a wide range of 3D geometric properties—not just low-level correspondence—demonstrating robust and generalizable 3D spatial understanding.
Q3. The performance of the proposed method is upper-bounded by the quality of the "teacher" 3D foundation model. The paper does not discuss the limitations of the teacher models themselves or analyze how their potential failure modes (e.g., incorrect correspondence or depth estimation) might negatively impact or mislead the student MLLM during training.
R: We agree that, due to the distillation-based nature of our approach, the performance of our method is upper-bounded by the quality of the teacher 3D foundation model. Any limitations or failure modes of the teacher—such as inaccurate correspondence, erroneous depth estimation, or incomplete geometric representations—can be propagated to the student MLLM and may potentially mislead it during training.
While our experiments demonstrate consistent improvements over strong baselines, it is possible that errors or biases in the teacher's predictions can negatively impact the downstream 3D reasoning abilities of the student model. We believe that, as 3D foundation models continue to rapidly advance, this limitation becomes less pronounced over time.
We appreciate the reviewer’s suggestion and will include a dedicated discussion of this limitation in our revised version.
Questions
Q1. The paper convincingly shows that 3DRS improves performance. To better understand why, could you provide a more fine-grained analysis? For example, could you design a targeted probe or qualitative analysis to see if the model with 3DRS is better at specific geometric reasoning tasks that are not directly measured by the benchmarks.
R: Our experiments on VSI-Bench above (Table 1) provide a fine-grained analysis across diverse geometric reasoning tasks, such as distance estimation, object counting, and spatial relationships. The consistent gains with 3DRS on these sub-tasks demonstrate its effectiveness on specific geometric reasoning abilities. We will include this in the revision.
Q2. The performance of 3DRS relies heavily on the quality of the 3D foundation model teacher. Could the authors discuss the failure modes of the teacher (VGGT) and how they might propagate to the student MLLM? For instance, what happens in scenes with reflective surfaces, textureless objects, or dynamic elements where the teacher's correspondence or geometry estimates might be noisy or incorrect?
R: As discussed above (Q3 in Weaknesses), limitations and errors in the teacher model (VGGT) can propagate to the student MLLM through distillation. This may result in suboptimal geometric reasoning in such challenging scenarios. We will add a discussion on this point.
Q3. The distillation uses a simple cosine similarity loss between the final layer features. Did you explore other knowledge distillation strategies?
R: As shown in the table below, we try L2 loss and a combination of cosine and L2 losses. The results show that all strategies yield similar performance, indicating that the choice of feature distance metric has limited impact in our setting.
Table 2: Performance with different distillation losses.
| Supervision | Scanrefer | Scanrefer | Multi3dref | Multi3dref | Scan2cap | Scan2cap | Scanqa | Scanqa | SQA3d |
|---|---|---|---|---|---|---|---|---|---|
| Acc@0.25 | Acc@0.5 | Acc@0.25 | Acc@0.5 | B-4@0.5 | C@0.5 | C | EM | EM | |
| Cosine loss | 62.9 | 56.1 | 60.4 | 54.9 | 41.6 | 86.1 | 104.8 | 30.3 | 60.6 |
| L2 loss | 62.2 | 55.5 | 60.4 | 55.2 | 41.8 | 85.9 | 104.5 | 30.1 | 60.7 |
| Cosine + L2 | 62.3 | 55.7 | 60.3 | 55.0 | 42.1 | 85.8 | 102.7 | 29.7 | 60.2 |
We will include this comparison in the revised version.
Thanks again for the detailed and thoughtful review.
[1] Hartley, Richard, and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
[2] Yang, Jihan, et al. "Thinking in space: How multimodal large language models see, remember, and recall spaces." CVPR 2025.
[3] Zheng, Duo, et al. "Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors." Arxiv 2025.
Thank you for the detailed response. The rebuttal addresses most of my initial concerns. I'll include this in my final rating.
We are delighted to hear that you are satisfied with our rebuttal. Thank you for your valuable time reviewing our paper and for your constructive suggestions.
This paper proposes 3DRS, which enhances multi-modal LLMs with additional 3D supervision for 3D scene understanding. The motivation is that MLLMs inherently lack 3D awareness (especially multi-view geometry). Thus, this paper introduces a simple yet effective supervision method: VGGT or FLARE distillation. It shows they outperform 2D foundation model supervisions like DINOv2. Experiments demonstrate consistent gains across the ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D benchmarks.
优缺点分析
Pros:
-
This paper provides good engineering insights by demonstrating the use of MVS (i.e. view equivariance in L43) for supervision. The method is simple yet effective, which distills the 3D-aware knowledge from 3D foundation models like VGGT.
-
The experiments are conducted across various MLLMs (LLaVA-Next-Video, LLaVA-OneVision, and Qwen2-VL). Tables 3 and 4 demonstrate the performance gains from VGGT's 3D priors over 2D foundation models and direct correspondence loss.
-
The motivation of 3D-awareness is well illustrated and validated in using the correspondence score in Sec. 2.1.3. Fig. 2 intuitively demonstrates how the similarity of visual embeddings influences 3D scene understanding.
Cons:
-
The claim in L247 that 2D foundation models lack specific spatial priors doesn't seem convincing. Moreover, VGGT and FLARE are also trained on ScanNet specifically, giving them a domain advantage. I'm not sure this claim holds strong.
-
The idea of "distillation from 3D" has been explored in prior 3D scene understanding works. However, related works [1,2,3] are missing, and there is a lack of discussion on other "distillation from 3D" ways.
[1] 3D-to-2D Distillation for Indoor Scene Parsing. CVPR 21 [2] Pri3D: Can 3D Priors Help 2D Representation Learning? ICCV 21 [3] Openscene: 3d scene understanding with open vocabularies. CVPR 2023
Minor:
-
The motivation makes sense to me overall, but VGGT has been trained on ScanNet to reconstruct its 3D geometry, which gives it a strong domain alignment with the benchmark tasks.
-
Fig. 2 primarily illustrates 3D correspondence at the voxel level. However, the following 3D tasks rely more on instance-level annotations. The similarity between the same objects is more directly related to 3D scene understanding and could better illustrate the relations.
问题
See weakness and minor parts.
局限性
Yes.
最终评判理由
The results on VSI-Bench (out-of-domain data) effectively address my concerns. Table 2 shows that the instance-level alignment IoU indicates a positive correlation with object-level semantics.
Overall, these complementary results help strengthen the claims. I am inclined to raise my rating from borderline accept to accept.
格式问题
No.
Thanks for the insightful review and for recognizing our motivation, experimental evaluation, and method design. We address the concerns below.
Weaknesses
Q1. The claim in L247 that 2D foundation models lack specific spatial priors does not seem convincing. Moreover, VGGT and FLARE are also trained on ScanNet specifically, giving them a domain advantage. I am not sure this claim holds strong.
R: We appreciate the thoughtful comment. To further assess the effectiveness of 3D foundation models versus 2D foundation models, we use training data from VG-LLM [1] and use LLaVA-Next-Video as a baseline to conduct experiments on the VSI-Bench dataset. VSI-Bench [2] is a challenging benchmark for spatial reasoning. This dataset includes data from ScanNet++ [3] and ARKitScenes [4], which VGGT did not use for training, thus mitigating the domain advantage. VSI-Bench covers a diverse range of tasks—object counting, absolute and relative distance estimation, object size, room size, relative direction, route planning, and appearance order—all of which tightly couple with core 3D properties such as spatial relationships, relative scale, and geometric understanding.
As shown in Table 1 below, using 3D foundation model consistently outperforms the 2D foundation model and other strong baselines on VSI-Bench. This demonstrates that geometric knowledge remains crucial for scene understanding, especially on tasks that demand spatial reasoning:
Table 1: Evaluation on VSI-Bench.
| Method | Avg. | Object Count | Absolute Distance | Object Size | Room Size | Relative Distance | Relative Direction | Route Plan | Appearance Order |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 34.0 | 46.2 | 5.3 | 43.8 | 38.2 | 37.0 | 41.3 | 31.5 | 28.5 |
| Gemini-1.5 Flash | 42.1 | 49.8 | 30.8 | 53.5 | 54.4 | 37.7 | 41.0 | 31.5 | 37.8 |
| Gemini-1.5 Pro | 45.4 | 56.2 | 30.9 | 64.1 | 43.6 | 51.3 | 46.3 | 36.0 | 34.6 |
| --------------------- | ------ | -------------- | ------------------- | ------------- | ----------- | ------------------- | -------------------- | ------------ | ------------------ |
| Baseline | 42.1 | 62.4 | 34.0 | 52.4 | 53.6 | 39.3 | 41.2 | 21.7 | 29.2 |
| Baseline w/Ours (DINOv2) | 44.2 | 67.1 | 33.8 | 52.6 | 56.1 | 39.0 | 42.7 | 26.3 | 35.6 |
| Baseline w/Ours (VGGT) | 45.9 | 68.7 | 34.8 | 53.6 | 56.6 | 40.9 | 43.2 | 30.4 | 39.2 |
These results indicate that geometric priors provided by 3D foundation models offer advantages in spatial reasoning and scene understanding, even without the domain-specific training. We hope this addresses the concern and clarifies the role of geometry-aware models for these challenging tasks.
Q2. The idea of "distillation from 3D" has been explored in prior 3D scene understanding works. However, related works [1,2,3] are missing, and there is a lack of discussion on other "distillation from 3D" ways.
R: Thanks for highlighting this important point. We agree that there exist works exploring "distillation from 3D," which primarily focus on pure visual models or CLIP-style vision-language models. However, there is very limited exploration of leveraging 3D distillation specifically to enhance 3D representations in LLMs, which is the main focus of our work.
We will include the relevant references and a more thorough discussion of related methods in our revision.
Minor Issue
Q1. The motivation makes sense to me overall, but VGGT has been trained on ScanNet to reconstruct its 3D geometry, which gives it a strong domain alignment with the benchmark tasks.
R: Following this suggestion, in Table 1 above, we experiment on the VSI-Bench dataset, which includes out-of-domain data from ScanNet++[3] and ARKitScenes[4]. The consistent and notable improvements demonstrate that the 3D foundation model maintains strong performance even without domain-specific training. This highlights the model’s robust generalization ability and the critical role of geometric knowledge in scene understanding.
Q2. Fig. 2 primarily illustrates 3D correspondence at the voxel level. However, the following 3D tasks rely more on instance-level annotations. The similarity between the same objects is more directly related to 3D scene understanding and could better illustrate the relations.
R: Thanks for the valuable suggestion. Our use of voxel-level correspondence in Fig. 2 aims to emphasize the model’s understanding of 3D geometric space, as fine-grained spatial relationships effectively reflect the model’s ability to capture geometric features and spatial positioning.
We agree that instance-level correspondence can highlight high-level semantic relationships between objects. Due to rebuttal format limitations, we are unable to include additional figures at this stage. However, we compute instance-level correspondence results based on the LLaVA-Next-Video model and present them in the tables below, which still exhibit the positive relation between downstream performance and 3D feature learning quality.
Table 2: Instance-level correspondence results.
| Dataset | Metric | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|---|
| Multi3D-Refer | IOU_25_ALL | 43.07 | 48.77 | 48.88 | 54.42 |
| Multi3D-Refer | IOU_50_ALL | 39.45 | 44.12 | 45.32 | 50.09 |
| ScanRefer | IOU_25_ALL | 45.98 | 51.88 | 55.26 | 57.09 |
| ScanRefer | IOU_50_ALL | 40.37 | 46.39 | 49.23 | 50.80 |
| Scan2Cap | CIDEr | 85.14 | 100.06 | 104.41 | 103.40 |
| ScanQA | CIDEr | 97.70 | 96.17 | 105.94 | 109.34 |
| SQA3D | All | 56.44 | 57.27 | 61.02 | 59.37 |
Thanks again for the valuable feedback.
[1] Zheng, Duo, et al. "Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors." Arxiv 2025.
[2] Yang, Jihan, et al. "Thinking in space: How multimodal large language models see, remember, and recall spaces." CVPR 2025.
[3] Yeshwanth, Chandan, et al. "Scannet++: A high-fidelity dataset of 3d indoor scenes." ICCV 2023.
[4] Baruch, Gilad, et al. "Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data." Arxiv 2021.
Thank you to the authors for the detailed rebuttal that addresses most of my concerns.
The additional results on VSI-Bench (out-of-domain data) effectively support the claims in Q1 and Q2, demonstrating that the 3D foundation model consistently improves performance. Table 2 shows that the instance-level alignment IoU indicates a positive correlation with object-level semantics.
Overall, these complementary results help strengthen the claims. I am inclined to raise my rating to accept.
Thank you for your feedback. We really appreciate your valuable suggestions and recognition of our work, which means a lot to us.
This paper first discovers a strong positive correlation between the 3D-aware representation and downstream 3D task performance. It subsequently proposes learning better 3D-aware representation from 3D foundation model like FLARE and VGGT via knowledge distillation. On five common 3D scene understanding benchmarks, the proposed 3DRS approach shows superior performance.
优缺点分析
Strengths
- The multi-view correspondence experiment makes intuitive sense to me. Ideally, a good 3D-aware representation should extract similar features given different views of the same 3D location.
- This paper is relatively well-organized and easy-to-follow. I am not an expert in 3D domain, but I find reading this paper rather easy and joyful.
- The proposed 3DRS shows consistent and non-trivial improvement across multiple 3D scene understanding benchmark.
Weakness
- The authors uses the finding of multi-view correspondence analysis to establish the superiority of 3D-aware representation, but then they turn to 3D foundation model distillation, without solid experimental proof why such approach is better. The authors are encouraged to provide further analysis on how distillation from VGGT or FLARE promotes multi-view correspondence as well as other necessary 3D properties.
- It is well-known in 3D detection area that knowledge distilled from a strong 3D model to a 2D model leads to better representation and improved performance [1][2][3], presumably because the spatial cue captured in 3D modality is not so well learnt in those 2D models, leading to less original contribution of this work. Also, this paper didn't cite these important related works.
References
[1] Chen, Zehui, et al. "Bevdistill: Cross-modal bev distillation for multi-view 3d object detection." arXiv preprint arXiv:2211.09386 (2022).
[2] Wang, Zeyu, et al. "Distillbev: Boosting multi-camera 3d object detection with cross-modal knowledge distillation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[3] Bang, Geonho, et al. "Radardistill: Boosting radar-based object detection performance via knowledge distillation from lidar features." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
问题
- Why does distillation from 2D foundation models like SigLIP2 or MAE also leads to improved 3D performances as shown in Table 3? Intuitively these model should only have learnt text-aligned or low-level representation?
- I wonder if the proposed 3DRS method also leads to improved performance on new 3D MLLM benchmark like VSI-Bench [4], which is a video VQA dataset that requires advanced visual-spatial intelligence?
- In Table 5, the authors only ablate with single layer distillation. I wonder, intuitively, is it better to align with multiple layers?
References
[4] Yang, Jihan, et al. "Thinking in space: How multimodal large language models see, remember, and recall spaces." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
局限性
- This paper only studies QWen2-VL, but not the latest QWen2.5-VL.
- So far this work only studies multi-view images, which is a relatively narrow scenario. Personally I believe spatial understanding given video input is much more common and potentially impactful.
最终评判理由
I am delighted to see the effectiveness of the proposed method on VSI-Bench. I am also satisfied with the explanation of distilling from pure 2D vision models like SigLIP2 or MAE. I am, however, still a bit skeptical about dramatic change from the finding of multi-view correspondence analysis to 3D foundation model distillation.
Overall, I agree that this is a decent work and should be accepted. I am inclined to raise my rating to 5.
格式问题
N/A
Thanks for the constructive feedback and for acknowledging our ideas and experimental results. Below, we provide detailed responses to the concerns.
Weaknesses
Q1: The authors use the finding of multi-view correspondence analysis to establish the superiority of 3D-aware representation, but then they turn to 3D foundation model distillation, without solid experimental proof why such an approach is better. The authors are encouraged to provide further analysis on how distillation from VGGT or FLARE promotes multi-view correspondence as well as other necessary 3D properties.
R: We analyze through multiview correspondences as they are the fundamental cues for 3D geometric modeling [1]. The 3D foundation models, such as VGGT, are trained on large-scale 3D tasks, which go far beyond multi-view correspondence. These models also learn from depth maps, camera poses, point clouds, and other 3D-related tasks. Therefore, when we adopt feature distillation from such models, our method implicitly inherits rich 3D features from VGGT.
Since our model cannot be directly evaluated on these 3D tasks (as doing so would require additional training heads for each specific task), we instead use the VSI-Bench [2] dataset for evaluation. VSI-Bench covers a diverse range of tasks—object counting, absolute and relative distance estimation, object size, room size, relative direction, route planning, and appearance order—all of which are tightly coupled with core 3D properties such as spatial relationships, relative scale, and geometric understanding. We use the training data from VG-LLM [3] and adopt LLaVA-Next-Video as the baseline.
As shown in the table below, our method achieves consistent and notable improvements across both the overall metric and individual sub-tasks, compared to the baseline and a variant using DINOv2 features. This demonstrates that our approach effectively inherits the 3D properties from the foundation model, thus promoting not only multi-view correspondence but also other critical 3D capabilities. We will include these results in our revision.
Table 1: Evaluation on VSI-Bench.
| Method | Avg. | Object Count | Absolute Distance | Object Size | Room Size | Relative Distance | Relative Direction | Route Plan | Appearance Order |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 34.0 | 46.2 | 5.3 | 43.8 | 38.2 | 37.0 | 41.3 | 31.5 | 28.5 |
| Gemini-1.5 Flash | 42.1 | 49.8 | 30.8 | 53.5 | 54.4 | 37.7 | 41.0 | 31.5 | 37.8 |
| Gemini-1.5 Pro | 45.4 | 56.2 | 30.9 | 64.1 | 43.6 | 51.3 | 46.3 | 36.0 | 34.6 |
| --------------------- | ------ | -------------- | ------------------- | ------------- | ----------- | ------------------- | -------------------- | ------------ | ------------------ |
| Baseline | 42.1 | 62.4 | 34.0 | 52.4 | 53.6 | 39.3 | 41.2 | 21.7 | 29.2 |
| Baseline w/Ours (DINOv2) | 44.2 | 67.1 | 33.8 | 52.6 | 56.1 | 39.0 | 42.7 | 26.3 | 35.6 |
| Baseline w/Ours (VGGT) | 45.9 | 68.7 | 34.8 | 53.6 | 56.6 | 40.9 | 43.2 | 30.4 | 39.2 |
Q2: It is well-known in the 3D detection area that knowledge distilled from a strong 3D model to a 2D model leads to better representation and improved performance [1][2][3], presumably because the spatial cues captured in the 3D modality are not so well learned in those 2D models, leading to less original contribution of this work. Also, this paper didn't cite these important related works.
R: Thank you for the suggestion. Previous works mainly focus on improving pure vision models or CLIP-style VLMs. In contrast, our work targets at enhancing the 3D representation abilities of 3D MLLMs, which remains an under-explored problem. We will discuss these relevant papers in our revision.
Questions
Q1: Why does distillation from 2D foundation models like SigLIP2 or MAE also lead to improved 3D performance as shown in Table 3? Intuitively, these models should only have learned text-aligned or low-level representations.
R: Although 2D foundation models like SigLIP2 or MAE are mainly trained on 2D tasks, recent studies [4,5] have shown that they can learn emergent properties beyond low-level features. During training, these models develop object awareness, an understanding of shapes, spatial relationships, and even implicit geometry, as discussed in works like DINOv2 [6] and Dino-ViT-Features [7]. As a result, some 3D-related knowledge—such as depth cues and object correspondences—is also encoded in their features. This explains why distillation from strong 2D models can bring improvements on 3D tasks.
Q2: I wonder if the proposed 3DRS method also leads to improved performance on the new 3D MLLM benchmark like VSI-Bench, which is a video VQA dataset that requires advanced visual-spatial intelligence?
R: Following the suggestion, we evaluated our 3DRS method on the VSI-Bench benchmark. The results are reported in Table 1 above. The results show that our method achieves clear improvements in visual-spatial understanding on this challenging video VQA dataset, further demonstrating the effectiveness of our approach for advanced 3D MLLM benchmarks.
Q3: In Table 5, the authors only ablate with single layer distillation. I wonder, intuitively, is it better to align with multiple layers?
R: We also try supervising multiple layers, as shown below. However, as the number of supervised layers increases, the performance drops. We believe this is because multi-layer supervision over-constrains geometric feature learning and weakens semantic representation learning, leading to worse downstream results. These results suggest that while single-layer supervision is most effective in our setting, future work could further investigate more sophisticated multi-layer strategies to balance both geometric and semantic cues. We will discuss these findings and insights in the revised paper.
Table 2: Distillation on multiple layers.
| Supervision | Scanrefer | Scanrefer | Multi3dref | Multi3dref | Scan2cap | Scan2cap | Scanqa | Scanqa | SQA3d |
|---|---|---|---|---|---|---|---|---|---|
| Acc@0.25 | Acc@0.5 | Acc@0.25 | Acc@0.5 | C@0.5 | B-4@0.5 | C | EM | EM | |
| Last layer | 62.9 | 56.1 | 60.4 | 54.9 | 41.6 | 86.1 | 104.8 | 30.3 | 60.6 |
| Last layer + last 3rd layer | 61.5 | 54.8 | 60.1 | 54.9 | 41.4 | 84.4 | 101.4 | 29.2 | 60.5 |
| Last layer + last 3rd + last 5th layer | 60.5 | 53.9 | 59.0 | 53.8 | 40.0 | 81.1 | 102.9 | 30.0 | 59.6 |
Limitations
Q1: This paper only studies QWen2-VL, but not the latest QWen2.5-VL.
R: We actually study three types of MLLMs in our paper, i.e., LLaVA-Next-Video, LLaVA-One-Vision, and Qwen2.0-VL. Below, we also provide results for Qwen2.5-VL, where our method can still consistently improve the performance notably, further demonstrating its effectiveness.
Table 3: Evaluation using Qwen2.5-VL.
| Method | Scanrefer | Scanrefer | Multi3dref | Multi3dref | Scan2cap | Scan2cap | Scanqa | Scanqa | SQA3d |
|---|---|---|---|---|---|---|---|---|---|
| Acc@0.25 | Acc@0.5 | Acc@0.25 | Acc@0.5 | C@0.5 | B-4@0.5 | C | EM | EM | |
| Qwen2.5-VL 7B | 58.4 | 52.5 | 59.6 | 52.9 | 80.0 | 40.1 | 100.5 | 29.2 | 59.1 |
| Qwen2.5-VL 7B w/Ours | 60.0 | 54.3 | 61.8 | 54.6 | 82.1 | 41.3 | 102.8 | 29.3 | 60.3 |
Q2: So far this work only studies multi-view images, which is a relatively narrow scenario. Personally, I believe spatial understanding given video input is much more common and potentially impactful.
R: As mentioned in our response to Q1 in the weaknesses section, our method also works on VSI-Bench with video inputs, demonstrating its effectiveness beyond posed multi-view images. We will add the corresponding experimental results in the revised paper.
Thanks again for the insightful review.
[1] Hartley, Richard, and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
[2] Yang, Jihan, et al. "Thinking in space: How multimodal large language models see, remember, and recall spaces." CVPR 2025.
[3] Zheng, Duo, et al. "Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors." Arxiv 2025.
[4] El Banani, Mohamed, et al. "Probing the 3d awareness of visual foundation models." CVPR 2024.
[5] Man, Yunze, et al. "Lexicon3d: Probing visual foundation models for complex 3d scene understanding." NeurIPS 2024.
[6] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." Arxiv 2023.
[7] Amir, Shir, et al. "Deep vit features as dense visual descriptors." Arxiv 2021.
I thank the authors for the detailed rebuttal and I recognize the amount of effort and work to prepare the rebuttal in such a short amount of time.
I am delighted to see the effectiveness of the proposed method on VSI-Bench. I am also satisfied with the explanation of distilling from pure 2D vision models like SigLIP2 or MAE. I am, however, still a bit skeptical about dramatic change from the finding of multi-view correspondence analysis to 3D foundation model distillation.
Overall, I agree that this is a decent work and should be accepted. I am inclined to raise my rating to 5.
Thank you for the encouraging feedback, constructive suggestions, and recognition of our rebuttal and efforts.
To further clarify, multi-view correspondences play a fundamental and indispensable role in 3D vision. They serve as the essential cues for a wide range of core 3D tasks, such as ray retriangulation [1], bundle adjustment [2], and pose estimation [1]. Moreover, correspondences are critical for downstream applications, including instance recognition and retrieval [4,5]. For this reason, we adopt multi-view correspondence analysis as as the proxy to evaluate the 3D representations of MLLMs.
3D foundation models (e.g., VGGT) are specifically designed to learn robust and comprehensive 3D representations, and thus possess strong correspondence reasoning abilities. In addition to multi-view correspondence, these models are trained on a diverse set of 3D tasks, such as depth estimation, camera pose estimation, and point cloud reconstruction. As demonstrated in Table 4, feature distillation from 3D foundation models allows our method to inherit richer and more expressive 3D features, leading to significantly improved performance compared to using multi-view correspondence loss alone. We appreciate your suggestions and will incorporate these discussions into our revision to further clarify the connection and fill this gap.
Thank you again for your insightful feedback and support!
References:
[1]. Hartley, Richard, and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
[2]. Triggs, Bill, et al. "Bundle adjustment—a modern synthesis." International workshop on vision algorithms. Berlin, Heidelberg: Springer Berlin Heidelberg, 1999.
[3]. Stojanov, Stefan, et al. "Learning dense object descriptors from multiple views for low-shot category generalization." Advances in Neural Information Processing Systems 35 (2022): 12566-12580.
[4] Xu, Yong, et al. "Multi-view 3D shape recognition via correspondence-aware deep learning." IEEE Transactions on Image Processing 30 (2021): 5299-5312.
[5] Wu, Hao, et al. "COM3D: Leveraging Cross-View Correspondence and Cross-Modal Mining for 3D Retrieval." 2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024.
This paper explores the 3D-awareness of Multimodal Large Language Models (MLLMs) by evaluating their consistency across multi-view image features. The authors show a positive correlation between 3D-aware representation quality and downstream scene understanding performance. They first apply a correspondence loss to improve multi-view consistency and then introduce a more powerful distillation loss from pretrained 3D foundation models, enabling MLLMs to better align their visual features with rich 3D representations. The proposed method, termed 3DRS, leads to consistent performance improvements across multiple benchmarks and model backbones without adding runtime or training complexity.
优缺点分析
Strengths:
-
Clear Motivation and Insightful Evaluation: The paper provides a well-motivated investigation into a key limitation of current MLLMs: limited 3D-awareness due to 2D pretraining. The use of multi-view correspondence as a proxy for 3D consistency is intuitive and effective.
-
Simple but Effective Techniques: The proposed correspondence loss and distillation loss from 3D foundation models are easy to implement, yet significantly improve 3D feature learning.
-
Strong Empirical Validation: Results show consistent and notable improvements across diverse tasks (3D captioning, grounding, QA) and popular MLLMs, including LLaVA variants and Qwen2-VL.
Weaknesses:
-
Limited Evaluation of Generalization Ability: The generalization capability of 3DRS is not fully explored in the paper. Since the training and test data appear to come from the same or similar distributions, it's unclear how well 3DRS performs in out-of-distribution settings. Evaluating the method on unseen benchmarks such as MMScanQA [1] or OpenEQA [2] would provide stronger evidence of its robustness and broader applicability.
-
Lack of Discussion on Broader Applicability: While 3DRS is shown to be effective for 3D scene understanding, the paper does not explore or discuss its potential benefits for more general-purpose tasks in vision or multimodal learning, such as image or video understanding. Clarifying whether the proposed 3D supervision could also improve performance in such broader domains would enhance the impact and relevance of the work.
[1] MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations. NeurIPS 2024 [2] OpenEQA: Embodied Question Answering in the Era of Foundation Models. CVPR 2024
问题
- It appears there is a typo in Figure 3 where “XE loss” is likely intended to be “CE loss” (Cross-Entropy). Please clarify or correct this for clarity.
- Table 5 suggests that applying the distillation loss to different final layers of the LLM improves downstream task performance. It would be valuable to explore the effect of applying distillation across multiple layers simultaneously, potentially capturing both low-level and high-level 3D cues. This could provide deeper insights into where 3D supervision is most effective within the model architecture.
局限性
yes
最终评判理由
The proposed method is decent with great generalization ability. I'll keep my accept recommendation.
格式问题
The paper format looks good to me.
Thanks for the valuable feedback and for recognizing our work. We sincerely appreciate the constructive comments and address the concerns below.
Weaknesses
Q1: Limited Evaluation of Generalization Ability: The generalization capability of 3DRS is not fully explored in the paper. Since the training and test data appear to come from the same or similar distributions, it's unclear how well 3DRS performs in out-of-distribution settings. Evaluating the method on unseen benchmarks such as MMScanQA [1] or OpenEQA [2] would provide stronger evidence of its robustness and broader applicability.
R: Thank you for highlighting this important point. Following the suggestion, we conduct additional experiments on the OpenEQA benchmark. We use LLaVA-Next-Video as the baseline and uniformly sample 32 frames per video for testing. Notably, the original OpenEQA evaluation uses the gpt-4-1106-preview model, which is no longer available; therefore, we use the updated gpt-4.1-2025-04-14 model for our experiments.
As shown in Table 1 below, our method continues to deliver performance improvements on this out-of-distribution dataset, which demonstrates its strong generalization capability. Specifically, distillation from VGGT brings a 2.5-point improvement over the baseline.
Table 1: Results on OpenEQA.
| Method | OpenEQA |
|---|---|
| Baseline | 55.1 |
| Baseline w/Ours (DinoV2) | 56.1 |
| Baseline w/Ours (VGGT) | 57.6 |
To further verify the robustness and broader applicability of our approach, we also evaluate our method on another challenging spatial understanding dataset (VSI-Bench [1]), which adopts videos instead of posed multiview images as inputs for evaluation. VSI-Bench covers a diverse range of tasks—object counting, absolute and relative distance estimation, object size, room size, relative direction, route planning, and appearance order—all of which tightly couple with core 3D properties such as spatial relationships, relative scale, and geometric understanding. This dataset includes data from ScanNet++ [2] and ARKitScenes [3], which VGGT did not use for training, thus mitigating the domain advantage. We adopt training data from VG-LLM [4] and use LLaVA-Next-Video as the baseline.
As shown in Table 2, our method consistently achieves higher scores across multiple sub-tasks, which reinforces its effectiveness in diverse scenarios.
Table 2: Evaluation on VSI-Bench.
| Method | Avg. | Object Count | Absolute Distance | Object Size | Room Size | Relative Distance | Relative Direction | Route Plan | Appearance Order |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 34.0 | 46.2 | 5.3 | 43.8 | 38.2 | 37.0 | 41.3 | 31.5 | 28.5 |
| Gemini-1.5 Flash | 42.1 | 49.8 | 30.8 | 53.5 | 54.4 | 37.7 | 41.0 | 31.5 | 37.8 |
| Gemini-1.5 Pro | 45.4 | 56.2 | 30.9 | 64.1 | 43.6 | 51.3 | 46.3 | 36.0 | 34.6 |
| --------------------- | ------ | -------------- | ------------------- | ------------- | ----------- | ------------------- | -------------------- | ------------ | ------------------ |
| Baseline | 42.1 | 62.4 | 34.0 | 52.4 | 53.6 | 39.3 | 41.2 | 21.7 | 29.2 |
| Baseline w/Ours (DINOv2) | 44.2 | 67.1 | 33.8 | 52.6 | 56.1 | 39.0 | 42.7 | 26.3 | 35.6 |
| Baseline w/Ours (VGGT) | 45.9 | 68.7 | 34.8 | 53.6 | 56.6 | 40.9 | 43.2 | 30.4 | 39.2 |
These results provide strong evidence to show that our method not only generalizes to out-of-domain settings, but also enhances performance on demanding 3D reasoning tasks. We will add these results and further discussion in the revision.
Q2: Lack of Discussion on Broader Applicability: While 3DRS is shown to be effective for 3D scene understanding, the paper does not explore or discuss its potential benefits for more general-purpose tasks in vision or multimodal learning, such as image or video understanding. Clarifying whether the proposed 3D supervision could also improve performance in such broader domains would enhance the impact and relevance of the work.
R: We agree that it is valuable to consider the impact of 3D supervision beyond traditional 3D scene understanding tasks.
We note that our experiments on VSI-Bench demonstrate that 3DRS brings improvements in video understanding scenarios. This result provides evidence that our approach is not limited to posed multi-view scenes but also extends to video-based tasks.
Moreover, we believe that by enhancing a model’s understanding of geometry and spatial relationships, our 3D supervision has the potential to benefit a wide range of image and video understanding tasks, especially those requiring deep spatial reasoning. While our current study focuses on 3D scene understanding, we appreciate the suggestion and plan to further investigate the impact of our approach on broader vision and multimodal domains in future work. We will add the discussion in our revision.
Questions
Q1: It appears there is a typo in Figure 3 where “XE loss” is likely intended to be “CE loss” (Cross-Entropy). Please clarify or correct this for clarity.
R: Thank you for pointing this out. Yes, it should be “CE loss” (Cross-Entropy). We will update this in the revision for clarity.
Q2: Table 5 suggests that applying the distillation loss to different final layers of the LLM improves downstream task performance. It would be valuable to explore the effect of applying distillation across multiple layers simultaneously, potentially capturing both low-level and high-level 3D cues. This could provide deeper insights into where 3D supervision is most effective within the model architecture.
R: Thank you for the insightful suggestion. We conduct additional experiments where we apply supervision to multiple layers simultaneously, as shown in Table 3 below. Interestingly, we observe that as the number of supervised layers increases, the performance decreases. We believe that multi-layer supervision may over-constrain the geometric features while weakening the model's semantic representations, ultimately resulting in lower performance on downstream tasks.
Table 3: Distillation on multiple layers.
| Supervision | Scanrefer | Scanrefer | Multi3dref | Multi3dref | Scan2cap | Scan2cap | Scanqa | Scanqa | SQA3d |
|---|---|---|---|---|---|---|---|---|---|
| Acc@0.25 | Acc@0.5 | Acc@0.25 | Acc@0.5 | C@0.5 | B-4@0.5 | C | EM | EM | |
| Last layer | 62.9 | 56.1 | 60.4 | 54.9 | 41.6 | 86.1 | 104.8 | 30.3 | 60.6 |
| Last layer + last 3rd layer | 61.5 | 54.8 | 60.1 | 54.9 | 41.4 | 84.4 | 101.4 | 29.2 | 60.5 |
| Last layer + last 3rd + last 5th layer | 60.5 | 53.9 | 59.0 | 53.8 | 40.0 | 81.1 | 102.9 | 30.0 | 59.6 |
These results suggest that while single-layer supervision is most effective in our setting, future work can further investigate more sophisticated multi-layer strategies to balance both geometric and semantic cues. We will discuss these findings and insights in the revision.
Thanks again for the valuable feedback.
[1] Yang, Jihan, et al. "Thinking in space: How multimodal large language models see, remember, and recall spaces." CVPR 2025.
[2] Yeshwanth, Chandan, et al. "Scannet++: A high-fidelity dataset of 3d indoor scenes." ICCV 2023.
[3] Baruch, Gilad, et al. "Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data." Arxiv 2021.
[4] Zheng, Duo, et al. "Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors." Arxiv 2025.
Thanks for authors' detailed feedback. My concerns are all addressed and I will keep my accept recommendation.
Thanks for your time reviewing our paper and all the constructive suggestions.
We sincerely thank all reviewers for their time reviewing our paper and providing constructive feedback. We appreciate the reviewers' recognition of our work:
- Our diagnostic-first approach that “establishes a quantitative link between multi-view correspondence scores and downstream task performance” (Ko1x), and “provides a well-motivated investigation into a key limitation of current MLLMs” (AJfQ).
- The "consistent and non-trivial improvement across multiple 3D scene understanding benchmarks" (QnTH) and "strong empirical validation with well-designed ablations and comparisons" (AJfQ).
- The clarity and organization: "well-organized and easy-to-follow" (QnTH), "well-written and easy to follow" (Ko1x).
- The simplicity and effectiveness of our proposed method: "simple but effective" (AJfQ), "method is simple yet effective" (8Yhg).
Below, we summarize the main concerns and our responses.
Common Concerns & Our Responses
- Novelty Explanation & Compare Prior Work on 3D-to-2D Distillation (QnTH, 8Yhg, Ko1x)
- Response: We clarify that our contribution lies in systematically analyzing and improving 3D awareness in MLLMs specifically, which remains under-explored. We will add the missing references and a more thorough discussion of prior works in the revised paper.
- Broaden Evaluation Scope & Out-of-Domain Generalization (QnTH, AJfQ, 8Yhg, Ko1x)
- Response: We have conducted additional experiments on OpenEQA and VSI-Bench, which demonstrate consistent improvements and strong generalization across out-of-domain, challenging, diverse, and video-based spatial reasoning tasks. Our results on video-based VSI-Bench further highlight the potential of our approach for broader vision and multimodal tasks. We will discuss these findings in our revision and plan to explore this direction further in future work.
- Analysis of What 3D Properties Are Learned (QnTH, Ko1x)
- Response: We provide fine-grained evaluation on sub-tasks within VSI-Bench (object counting, distance, room size, spatial relationships, etc.), confirming robust improvements in a wide range of geometric reasoning abilities. We also provide new results using instance-level correspondence metrics.
- Distillation on Multiple Layers & Different Distillation Strategies (QnTH, AJfQ, Ko1x)
- Response: We conducted new experiments with multi-layer supervision and different loss types, finding that single-layer supervision is optimal in our setting, and using different types of loss functions makes no big difference on the performance. Results and discussion will be included in the revision.
Specific Concerns & Our Responses
- Evaluation Qwen2.5-VL (QnTH)
- Response: We have included additional results on Qwen2.5-VL, demonstrating that our method still consistently improves performance on this latest model version, further validating its effectiveness and generality.
- Discussion on Why Do 2D Foundation Models Offer Some 3D Gains? (QnTH)
- Response: Recent works show that strong 2D models capture emergent spatial properties and some implicit geometry, explaining observed improvements when used for distillation.
- Typo in Figure 3 (“XE loss”) (AJfQ)
- Response: We will correct to “CE loss” (Cross-Entropy) in the revision.
- Results of Voxel-Level Correspondence (8Yhg)
- Response: We provide new instance-level correspondence results, further confirming the positive link between 3D feature quality and downstream performance.
- Discussion of Teacher Model & Error Propagation (Ko1x)
- Response: We discuss how failure modes of the teacher 3D model can propagate to the student MLLM. As 3D foundation models improve, this limitation is expected to diminish. We will include this discussion about limitations in our revised paper.
We believe all major issues have been carefully and thoroughly addressed. If there are any further questions or clarifications needed, we welcome additional feedback.
Dear Reviewers,
Thank you very much again for performing this extremely valuable service to the NeurIPS authors and organizers.
As the authors have provided detailed responses, it would be greatly appreciated if you could take a moment to review them and see if your concerns have been addressed. Given that the discussion phase is nearing its end, your prompt feedback would be especially valuable, allowing the authors a final opportunity to offer any additional clarifications if needed.
Cheers,
AC
All reviewers agreed that the problem studied is important, the findings are interesting, the proposed method is effective, and the results are promising. With additional experimental results and clarifications, the rebuttal successfully addressed the reviewers’ concerns, and all reviewers recommended acceptance. The authors are encouraged to further strengthen the camera-ready version by incorporating the reviewers’ suggestions.