Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
摘要
评审与讨论
This paper introduces SpatialReasoner, a novel Vision-Language Model (VLM) designed to enhance fine-grained spatial reasoning through two key contributions:
- Fine-Grained Direct Preference Optimization (fDPO), which segments LongCoT responses into descriptive and reasoning components for targeted optimization.
- Multi-Model Monte Carlo Tree Search (M3CTS), a collaborative method to generate high-quality reasoning paths for training data. The model achieves state-of-the-art performance on spatial reasoning benchmarks (e.g., +9.8% accuracy over baselines) while maintaining competitiveness on general vision-language tasks.
优缺点分析
Strengthens
- The introduction of fDPO is well-motivated, addressing the limitation of uniform optimization in standard DPO.
- M3CTS leverages multiple VLMs to explore diverse reasoning paths, mitigating data scarcity for multi-step spatial tasks.
- Comprehensive experiemnts on multiple benchmarks demonstrate significant improvements (e.g., +4.1% in spatial quality tasks with fDPO).
- The experiments compare several training methods (SFT/DPO/fDPO) on multiple model size (4/8B parameters).
Weaknesses
- The paper would benefit from providing more in-depth analysis. I’m particularly curious about the impact of the three different types of rewards on model training—do all of them need to be used, or is one more critical than the others?
- The paper needs to provide more details. I could not find which pretrained model was used as the base model for finetuning. These details are critical for the paper's credibility and reproducibility.
问题
Could you please provide more details about implementations and experiments?
局限性
Yes.
最终评判理由
I will maintain my original score and assessment.
格式问题
No.
We thank the reviewer for their constructive review and recognition of our paper’s novel, well-motivated contributions, our thorough experimental design, and strong empirical gains demonstrated on fine-grained spatial reasoning tasks. We are especially encouraged by the reviewer’s acknowledgement that our approach meaningfully addresses limitations in standard DPO and tackles core challenges in multi-step spatial reasoning. We hope our responses below effectively clarify the reviewer’s questions.
Q1: The paper would benefit from providing more in-depth analysis. I’m particularly curious about the impact of the three different types of rewards on model training—do all of them need to be used, or is one more critical than the others?
Thank you for this insightful question. To better understand the contribution of each reward component in our fDPO framework, we conducted a controlled ablation study in which we removed one reward at a time and retrained the model using 200K training samples.
| Below/Above | Left/Right | Big/Small | Tall/Short | Wide/Thin | Behind/Front | Qual Acc | |
|---|---|---|---|---|---|---|---|
| fDPO w/o R_vc | 90.00 | 87.62 | 84.91 | 85.71 | 85.58 | 88.18 | 87.06 |
| fDPO w/o R_sp | 83.33 | 83.81 | 79.25 | 80.36 | 82.69 | 84.55 | 82.34 |
| fDPO w/o R_lc | 88.33 | 85.71 | 83.02 | 77.68 | 82.69 | 86.36 | 84.02 |
| fDPO with all | 90.00 | 89.52 | 86.79 | 86.61 | 87.50 | 90.00 | 88.43 |
Removing led to the most significant drop in qualitative accuracy (−6.09%), confirming that spatial grounding is the most critical supervision signal for the task. This is expected, as spatial relations are central to our benchmark and require explicit reinforcement. In contrast, removing resulted in the smallest degradation in qualitative accuracy (−1.37%), suggesting that visual description is already relatively well-learned from pretraining and contributes less incremental benefit during optimization. Nevertheless, performance still dropped without RvcR_{\text{vc}}Rvc, confirming its complementary role. The logical coherence reward led to a notable performance decline (−4.41%), especially in categories like Tall/Short and Wide/Thin, where multi-step logical reasoning over relative object properties (size, etc.) is needed. These findings support our design choice to incorporate all three reward types and demonstrate that each contributes complementary benefits to the final model performance.
Q2: The paper needs to provide more details. I could not find which pretrained model was used as the base model for fine-tuning. These details are critical for the paper's credibility and reproducibility.
Could you please provide more details about implementations and experiments?
Thank you for pointing this out. Our SpatialReasoner is built upon the SA2VA, which itself is based on InternVL2.5. To demonstrate the effectiveness of our proposed methods, we benchmark our final model not only against this SA2VA baseline but also against a standard DPO implementation, as shown in Table 1. We will include this omission in our Appendix E.: Training Details, where we describe implementation details and training data. We plan to release our code, training data, and pretrained model checkpoints to support reproducibility and future research. We sincerely welcome any additional questions on implementation and experimental details, and would be glad to provide further clarifications during the discussion period.
Thank you for your response. I will maintain my original score and assessment.
Thank you for your review and for considering our responses. We appreciate your feedback and engagement.
The authors propose SpatialReasoner, a vision-language model focused on fine-grained spatial reasoning. They first introduce a high quality dataset constructed using tree-search and reward algorithms for better data diversity and spatial reasoning context. Then an MLLM is trained using fine-grained DPO. Their model is evaluated across benchmarks highlighting the clear and consistent performance improvements.
优缺点分析
Strengths
- The paper is well written with great use of figures to illustrate points
- Extensive evaluation established usefulness and strengths of method
- Novel dataset construction and training mechanism
- Good method interpretability through qualitative results
Weaknesses
- Can any principled method be derived for chosing alpha and lambda values in fDPO? Ablations in Table 3 & 4 show some sensitivity to these parameters.
- What are the VLMs used for rewards? Highlight this more in main paper.
- Consider discussing possibly related prior work like [1] & [2].
- Also on the architecture design, have the authors explored using textual representations for locations (like in [1,2]) instead of masks with an additional prompt encoder? Consider discussing this.
[1] “Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic.” 2023
[2] “Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs.” CVPR 2024
问题
See weaknesses
局限性
Yes
最终评判理由
The authors resolve all concerns. Retaining vote of accept.
格式问题
N/A
We sincerely thank the reviewer for their thoughtful and supportive review, and for recognizing the strengths and impact of our work. We greatly appreciate the reviewer’s recognition of our novel dataset construction and fine-grained training methodology, the clarity and interpretability of our approach through strong visualizations and qualitative examples, and the consistent performance gains across spatial reasoning benchmarks. We are especially grateful for the reviewer’s encouragement and thoughtful questions, which reinforce the value of our novel framework.
Q1: Can any principled method be derived for choosing alpha and lambda values in fDPO?
Thank you for raising this important point! In our method, and regulate how the segment-wise DPO losses are weighted and shaped. While there exists no closed-form solution or analytical method for choosing optimal and values, their impact is interpretable.
controls the sensitivity of the weight distribution to reward differences. Given reward deltas for each segment, denoted as and , we compute segment-wise weights and via a softmax-like transformation (Eq.2 in main paper):
w\_s \= \\frac{\\exp(\\lambda \\cdot \\Delta R\_s)}{\\exp(\\lambda \\cdot \\Delta R\_{\\text{desc}}) \+ \\exp(\\lambda \\cdot \\Delta R\_{\\text{reason}})}, \\quad s \\in \\{\\text{desc}, \\text{reason}\\}
These weights adaptively prioritize learning from reasoning segments (description or logic) that have stronger training signals, and the term controls the prioritization sensitivity. The design is analogous to applying a softmax over two reward deltas, with functioning as an inverse temperature. When , both segments receive nearly equal weight regardless of their relative reward. When , all weight is assigned to the segment with the highest reward delta, making the weighting sharper and amplifying reward-dominant segments.
controls the magnitude of dynamic scaling applied to the segment-wise DPO losses. After computing , each segment’s optimization weight is modulated as follows (Eq.3 in main paper):
\\beta\_s \= \\beta \\cdot \\left 1 + \alpha \cdot (2w_s - 1)\right
Here, a high encourages the model to focus more strongly on high-signal segments while attenuating learning from weaker segments. In contrast, a small keeps the updates more uniform, reducing the variance across segment types.
In practice, we sweep these hyperparameters on a validation split to choose values that yield optimal performance across key spatial reasoning metrics (ablations in Tables 3-4). A more adaptive or meta-learned strategy could be explored in future work.
Q2: What are the VLMs used for rewards? Highlight this more in main paper.
Thank you for pointing this out. Our reward model ensemble, used during the LongCoT preference data generation and fDPO training stages, comprises Gemini 1.5 Pro, Qwen2.5-VL-72B, and Qwen2.5-VL-7B. We chose this ensemble to capture diverse reasoning patterns and mitigate overreliance on a single model’s inductive biases. To verify the value of this design, we conducted two complementary analyses:
- A statistical breakdown over 1,000 sampled reasoning trees showed that reasoning nodes were selected from Gemini (49%), Qwen2.5-VL-72B (42%), and Qwen2.5-VL-7B (9%). This confirms that no single model dominates, and that the ensemble promotes meaningful diversity and calibration across reasoning paths. The inclusion of Qwen2.5-VL-7B ensures reasoning remains aligned with the capacity of our target model, improving generalization.
- A quantitative ablation showed that training on LongCoT data generated by the multi-model M3CTS ensemble significantly outperformed those trained on single-model data (Qwen2.5-VL-72B) by 8.5 points in qualitative accuracy and 6–10% across spatial categories, highlighting the contribution of the multi-VLM setup.
These insights are further discussed in our response to reviewer a5xU. We will ensure the final paper more clearly describes the VLMs used for reward modeling and their motivation.
Q3: Consider discussing possibly related prior work like & .
** “Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic.” 2023**
** “Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs.” CVPR 2024**
Thank you for bringing these highly relevant works to our attention. Both are solid contributions to the field.
introduces a compelling approach for referential dialogue by processing numerical coordinates directly within the text stream. The work in
is insightful, and we find their method of explicitly fine-tuning the model to associate object names with textual coordinates to be highly effective in improving localization and spatial comprehension. While our work shares the broader goal of enhancing spatial reasoning in VLMs, our focus is orthogonal, as we aim to improve multi-step reasoning over spatial relations with LongCoT supervision, a multi-model reward ensemble, and segment-level optimization via fDPO. We agree these are important references and will ensure both papers are properly cited and discussed in our revised manuscript.
Q4: Also on the architecture design, have the authors explored using textual representations for locations (like in ) instead of masks with an additional prompt encoder? Consider discussing this.
That is a great question. We did consider various location representations in our initial architectural design, including textual formats such as points and bounding boxes. However, compared to these coarser representations, we found that fine-grained masks represent objects with higher precision. This precision is particularly crucial in challenging scenarios where we found masks to be a more effective prompt encoder. For instance: 1. In cases of occlusion, a bounding box would inevitably include pixels from the occluding object, introducing noise and ambiguity. 2. For objects that occupy a very small pixel area, masks provide the necessary pixel-level alignment that is essential for accurate semantic understanding.
Thank you for the detailed response. Retaining vote for accept.
Good job on the interesting paper!
Thank you very much for your positive feedback. We sincerely appreciate your careful review, constructive suggestions, and encouragement throughout the process. Your recognition of our work is truly motivating. Thank you again for your support and engagement.
The paper introduces SpatialReasoner, a VLM aimed at improving fine-grained spatial reasoning, especially in tasks requiring multi-step logic and precise spatial alignment. Existing VLMs often rely on shallow, direct-response outputs and struggle with complex spatial queries. SpatialReasoner addresses this by generating interpretable LongCoT directly from 2D images without extra sensor input.
To support training, the authors propose two components: a Multi-LLM MCTS (M3CTS) method to collaboratively generate diverse, high-quality reasoning traces using multiple VLMs, and (2) a Fine-Grained Spatial Rewards mechanism that evaluates responses based on descriptive grounding, spatial precision, and logical coherence. These are used to build a high-quality dataset for preference learning.
The model is trained using a fine-grained Direct Preference Optimization (fDPO) algorithm, which separates learning signals for different reasoning segments—focusing on spatial localization during description and on logic during inference. Experiments show SpatialReasoner outperforms prior methods on spatial reasoning benchmarks, while also maintaining strong general vision-language capabilities.
优缺点分析
Strengths:
-
The paper presents a novel and effective approach for generating LongCoT data for spatial reasoning by leveraging a Multi-LLM Monte Carlo Tree Search (M3CTS). The idea of using collaborative exploration across multiple vision-language models to enhance the diversity and quality of reasoning paths is interesting.
-
The introduction of a fine-grained Direct Preference Optimization (fDPO) method is well-motivated and addresses a clear limitation in existing DPO frameworks. By disentangling descriptive grounding from logical reasoning during training, the proposed method enables more targeted learning and demonstrates notable performance gains on spatial reasoning benchmarks.
-
The paper provides sufficient technical detail in both the main text and the supplementary materials. This includes algorithmic descriptions, architectural choices, and training procedures, which contribute to potential reproducibility of the work.
-
The experimental evaluation is thorough. The authors benchmark their method on standard datasets and show consistent improvements over prior state-of-the-art baselines. The ablation studies, particularly those analyzing the effects of the λ and α hyperparameters in fDPO, offer valuable insights into the method’s effectiveness.
-
The paper includes well-chosen qualitative examples illustrating different components of the proposed pipeline, such as the M3CTS-generated reasoning paths and sample model responses. These examples help to demonstrate how the system performs in practice.
Weaknesses:
-
The related work section in the main paper is overly brief and lacks sufficient coverage of key areas relevant to the proposed method. Much of the discussion is deferred to the supplementary material, which gives the main submission an incomplete appearance. A more comprehensive review should be included in the main text to properly situate the contribution within the broader literature.
-
While the textual description of the method is generally clear and well-structured, the figures—particularly Figure 1 and Figure 2—are difficult to interpret. These visual aids could be significantly improved in terms of layout clarity, labeling, and flow to better support the reader’s understanding of the overall architecture and training pipeline.
-
The contribution of using multiple vision-language models in the M3CTS framework for generating LongCoT data is unclear. It remains unclear how much additional benefit the multi-model setup provides compared to using a single model.
问题
-
What is the base model architecture used for SpatialReasoner?
-
Would incorporating a broader range of vision-language models during the LongCoT generation phase further improve performance? Additionally, is there any statistical analysis on which models most frequently contribute to the final selected reasoning paths during the M3CTS process?
局限性
Yes
最终评判理由
I've read the rebuttal by the authors, and my concerns are reasonably addressed. In this case, I'm keeping my original evaluation as borderline accept.
格式问题
None
We thank the reviewer for their constructive review and recognition of our paper’s novel, well-motivated contributions that address limitations of existing DPO frameworks. We are encouraged that the reviewer recognized our method’s strong performance across spatial reasoning benchmarks, the thoroughness of our ablation studies, the paper’s clear technical exposition, and the utility of our qualitative examples for illustrating the practical impact of our work.
Q1: Related work section in the main paper is overly brief. Much of the discussion is deferred to the supplementary material.
Thank you for this valuable feedback. Due to space constraints, we focused the Related Work section on the most directly comparable areas and provided a broader discussion in Appendix C, including related efforts in spatial reasoning, preference optimization, and MCTS. That said, we agree that a more integrated discussion in the main paper would improve clarity and better situate our contributions. We will revise the camera-ready accordingly.
Q2: While the textual description of the method is generally clear and well-structured, the figures—particularly Figure 1 and Figure 2—are difficult to interpret.
Thank you for the constructive feedback. We have made an effort to identify specific strategies to improve both figures. We clarify below what each figure is intended to convey, as we sincerely would welcome any additional suggestions to improve their design:
Figure 1 illustrates our complete workflow, which is comprised of the SpatialReasoner architecture (left) and its three-stage training pipeline (right). The architecture itself is a VLM that takes an image, a text query, and a visual prompt as input to generate a reasoned response. Stage 1 employs our offline M3CTS framework to generate a diverse pool of structured LongCoT reasoning paths. In Stage 2, our reward model assesses their quality, curating them into high-quality preference pairs. Finally, these curated preference pairs serve as training data for Stage 3, where the model is fine-tuned using our novel fDPO method, with distinct optimization parameters, and to separately improve the descriptive and logical components of the model's reasoning.
Figure 1 revision plan: We plan to break Figure 1 into two separate figures, one for the model architecture and one for the training pipeline with a better left-to-right flow, add explicit labels to each component to clarify their roles, and use consistent and distinctive color coding across components to visually differentiate between inputs, modules, and outputs.
Figure 2 illustrates the fine-grained reward evaluation (Stage 2 in Figure 1). Given an example question about the kitchen island's height, our reward mechanism decomposes reasoning quality into three independent axes: descriptive accuracy, spatial correctness, and logical coherence, as shown in the "Reward Score Breakdown" box. `Resp2` is selected as the positive sample because it excels in all three: it provides a more detailed description (e.g., mentioning the bar stools), it uses a correct spatial reference (e.g., the stools) for reasoning, and it applies sound logic (an explicit calculation). In contrast, `Resp1' is marked as the negative sample due to its incomplete description, flawed spatial claim ("lower than the counter"), and illogical "half-height" heuristic. This granular, multi-faceted reward design is what allows us to identify specific reasoning failures and construct high-quality preference data for targeted model improvement.
Figure 2 revision plan: We will reduce visual clutter in Figure 2 by simplifying textual descriptions in each response to highlight key contrasts without overwhelming the reader. We plan to also use a clearer layout separation between Resp1 and Resp2, and remove redundant elements such as the repeated spatial query (Question panel)
We welcome any additional suggestions from the reviewer to further improve our figure design.
Q3: The contribution of using multiple VLMs in the M3CTS framework for generating LongCoT data is unclear. It remains unclear how much additional benefit the multi-model setup provides compared to using a single model.
Thank you for raising this important point. The use of multiple VLMs in M3CTS is a deliberate design choice to improve both the diversity and robustness of generated LongCoT reasoning paths.
(1) Enhanced Diversity in Reasoning Paths: A single VLM is prone to getting trapped in homogeneous or low-quality reasoning paths due to its inherent biases and "cognitive blind spots." By contrast, our multi-model approach mitigates this by introducing varied inductive biases and linguistic strategies, which encourage broader exploration of the reasoning space. Each model explores the problem space differently, which prevents the search from converging on a single, potentially flawed, reasoning style and improves coverage of edge cases in spatial reasoning.
(2) Ensemble-based Error Detection: Our multi-model setup enables a robust, ensemble-based evaluation. As described in Equation (7) and Appendix D, each intermediate reasoning state is scored based on a collective judgment from all models on visual, spatial, and logical coherence. This process allows the ensemble to identify errors that a single model might miss, preventing the propagation of flawed logic that may go unnoticed in single-model pipelines.
We provide an analysis below to verify the contribution of our multimodel setup. Results below show that SFT models trained on M3CTS-generated data consistently outperform those trained with LongCoT data produced by a single model (Qwen2.5VL-72B) across all spatial reasoning categories, leading to accuracy gains ranging from 6% to over 10%, and raising the overall qualitative accuracy by 8.5 points (from 71.23% to 79.75%). We observe that M3CTS provides the largest improvements on spatial relations that are frame-dependent (e.g., Behind/Front, Left/Right) or visually ambiguous (e.g., Big/Small, Wide/Thin), where single-model reasoning often fails due to fixed biases or limited grounding. The clear and consistent margin across all spatial tasks demonstrates that leveraging multiple VLMs substantially enhances both the diversity and reliability of the generated reasoning paths.
| Relation | Below/Above | Left/Right | Big/Small | Tall/Short | Wide/Thin | Behind/Front | Qual Acc |
|---|---|---|---|---|---|---|---|
| SFT with M3CTS data | 81.66 | 81.90 | 75.47 | 75.89 | 79.80 | 83.63 | 79.75 |
| SFT with single-model data | 74.17 | 72.38 | 68.87 | 67.86 | 71.15 | 72.73 | 71.23 |
Q4: What is the base model architecture used for SpatialReasoner?
Our SpatialReasoner is built upon the Sa2VA architecture, which is based on InternVL2.5. As shown in Table 1, we benchmark our model directly against Sa2VA, as well as against a standard DPO implementation, to clearly isolate and demonstrate the improvements gained from our proposed methods.
Q5:Would incorporating a broader range of vision-language models during the LongCoT generation phase further improve performance? Additionally, is there any statistical analysis on which models most frequently contribute to the final selected reasoning paths during the M3CTS process?
That is a great question. The primary benefit of our multi-model approach is ensuring we capture diverse and robust reasoning paths. Any single model has inherent blind spots and may fail on certain types of queries. Using an ensemble makes it far more likely that at least one model will generate a valid LongCoT response. However, while it might seem appealing to include more VLMs during LongCoT generation, prior work suggests diminishing returns beyond a certain ensemble size . This aligns with our practical findings, where we observed that benefits decline beyond our current 3-model ensemble in terms of performance vs. compute and API usage costs. Our curated setup provides enough reasoning diversity to build a high-quality dataset while keeping the generation process practical.
We have conducted experiments with 1,000 randomly sampled image-question pairs. Among all reasoning nodes, 49% of nodes were selected from Gemini, 42% from Qwen2.5VL-72B, and 9% from Qwen2.5VL-7B, indicating that no single model dominates the reasoning process and that the ensemble encourages diversity and calibrated supervision. While Gemini contributes more frequently in the overall reasoning process, given its strong performance on spatial tasks, its responses are not always accurate; therefore, we incorporated Qwen2.5VL-72B to enhance the overall diversity and reasoning robustness. Although Qwen2.5VL-7B accounts for the smallest proportion of contributions, we intentionally include this in M3CTS because the model we train is closer in size to 7B. This way, we calibrate the reasoning complexity to the capacity of our target model, improving alignment between data and model capabilities .
We note that this is an excellent question that connects with foundational ideas in ensemble learning theory, hypothesis space expansion, and Bayesian learning, which have explored similar principles. A rigorous theoretical investigation of how ensemble-driven reasoning and inter-model diversity affect downstream learning, while beyond the scope of this paper, would be an impactful direction for future work.
A unified theory of diversity in ensemble learning. JMLR 2023
Diversity and generalization in neural network ensembles. AISTATS 2022
Does knowledge distillation really work? NeurIPS 2021
Dear reviewer a5xU,
Thank you again for your thoughtful and constructive review. We greatly appreciate the time and effort you have dedicated to evaluating our work. We wanted to kindly check if you had any remaining questions after reading our rebuttal.
This paper introduces a reasonable VLM training and architecture pipeline designed for improving fine-grained spatial reasoning capabilities. Their main contributions are: 1) Fine-grained Direct Preference Optimization (fDPO) technique that decomposes reasoning outputs into descriptive and logical components and optimizes them with segment-specific learning signals, 2) Monte Carlo Tree Search M3CTS to generate diverse, high-quality Chain-of-Thought reasoning paths for spatial tasks, 3) Spatial Rewards framework that evaluates outputs along visual consistency, spatial grounding, and logical coherence dimensions. They also provide rich experiments showing good results on the SpatialRGPT benchmark and competitive performance on other vision-language understanding datasets.
优缺点分析
Strengths:
- This paper is well-written, and the figures clearly demonstrate the methods.
- They propose some relatively novel and reasonable techniques, such as fDPO, M3CTS, and their reward design, which show good improvement to the LLM/VLM framework.
- The experimental results are good - they present and claim the SOTA results on SpatialRGPT, and the good results on other tasks and benchmarks. Besides, they provide an extensive ablation study and qualitative results.
Weaknesses:
- It seems like they didn't consider the related methods using RL tuning methods.
- It is not clear how the failure cases happened, and it lacks analysis of the failure reasoning.
- Considering the complexity of the M3TCS and the proposed hyperparameters, it is unclear whether it will overfit to the training dataset.
问题
- What is the computational overhead of M3CTS compared to standard preference data generation pipelines?
- Could you add examples where SpatialReasoner fails or produces erroneous reasoning, especially in scenes with occlusions or ambiguous spatial relations?
局限性
- It is better to add a discussion on potential biases in generated reasoning traces due to reliance on multiple LLMs.
- What is the robustness of depth-based spatial rewards under complex scenarios?
最终评判理由
This paper is generally well-written, and the proposed method is effective. Their rebuttal addressed my major concerns, although some minor ones remain. I carefully read their rebuttal and other reviewers' comments - I decided to mark it as weak/borderline accept.
格式问题
No formatting concerns.
We thank the reviewer for their constructive feedback. We appreciate that the reviewer found the paper to be well-written, and recognized the key contributions and novelty of our proposed methods, as well as the strength of our experimental results, including the SOTA performance on SpatialRGPT, comprehensive ablations, and generalization across broader V+L tasks. We are encouraged that the reviewer found our qualitative analyses informative. We hope our responses and additional analyses have meaningfully addressed all remaining concerns.
Q1: It seems like they didn't consider the related methods using RL tuning methods.
We thank the reviewer for the suggestion. We performed additional analysis comparing standard DPO with our proposed fDPO and a recent preference optimization method, mDPO , that extends preference optimization to the multimodal domain by directly modeling preferences conditioned on both images and responses, a capability that is essential to spatial reasoning, where visual grounding is critical. However, spatial reasoning tasks also require more localized, fine-grained perception and step-wise inference to accurately capture spatial relationships between objects in the image (such as up and down, left and right, front and back, etc.), which mDPO does not explicitly model. Our fDPO method addresses this due to its fine-grained, segment-specific learning signals that separately supervise spatial description and logical inference, leading to more targeted supervision and improved performance on spatial relation benchmarks. The comparison below demonstrates that our method advances beyond the common RL-based DPO tuning.
| Below/Above | Left/Right | Big/Small | Tall/Short | Wide/Thin | Behind/Front | Qual Acc | |
|---|---|---|---|---|---|---|---|
| DPO 8B | 94.16 | 93.33 | 89.62 | 90.18 | 88.64 | 92.27 | 91.48 |
| mDPO 8B | 95.00 | 95.24 | 90.57 | 91.96 | 87.50 | 93.64 | 92.39 |
| fDPO 8B | 98.33 | 98.10 | 95.28 | 96.43 | 91.34 | 93.64 | 95.59 |
Wang, Fei, et al. "mDPO: Conditional Preference Optimization for Multimodal Large Language Models." Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2024.
Q2: Considering the complexity of the M3TCS and the proposed hyperparameters, it is unclear whether it will overfit to the training dataset.
Thank you for the comment. M3CTS is an offline data generation pipeline that produces high-quality and diverse LongCoT reasoning paths across a wide spectrum of spatial problem types (e.g., distance, size, relative position), improving coverage of spatial patterns and edge cases. By avoiding confined problem structures or templates and encouraging exploration of varied reasoning strategies, M3CTS inherently reduces the risk of overfitting during data construction. In parallel, the use of independent reward models and segment-level optimization in fDPO further mitigates overfitting to annotation noise or specific reasoning styles.
The absence of overfitting is empirically supported by our results in Table 1: on SpatialRGPT-Bench, our model shows significant and consistent performance gains across all spatial sub-tasks after fDPO training, demonstrating generalizable spatial reasoning capabilities. Moreover, the model exhibits strong generalization across 8 external vision-language benchmarks (Table 2), further confirming that it does not overfit to the training distribution.
Q3: What is the computational overhead of M3CTS compared to standard preference data generation pipelines?
The computational overhead of the M3CTS offline data generation is comparable to standard preference data generation pipelines. The generation time for a response from one of our sources, Gemini 1.5 Pro, is identical within our pipeline as it would be if generated directly. The only added overhead comes from the selection step among the concurrent responses from our different models, which takes under three seconds per sample and is fully parallelizable. Since our entire pipeline is built for parallel processing, the end-to-end overhead is negligible in practice. We note that M3CTS is an offline data generation process and thus does not affect inference efficiency.
Q4: It is not clear how the failure cases happened, and it lacks analysis of the failure reasoning. Could you add examples where SpatialReasoner fails or produces erroneous reasoning, especially in scenes with occlusions or ambiguous spatial relations?
Thank you for the insightful suggestion! We notice failures occur when the model misapplies prior knowledge instead of relying on the visual evidence presented in the image itself.
For example, in a furnished bedroom scene, Region 1 is the dresser mirror on the right, adjacent to a sleigh‑style bed headboard on the left. Our model’s pipeline for “How tall is Region 1?” proceeds in three steps: 1. The model segments the mirror mask (Region 1) and then searches nearby objects to find a familiar item for scale. 2. It recognizes the bed headboard, assuming the typical headboard height of 1.5 m and estimating the mirror to be 0.5 m taller. 3. It adds 1.5 m and 0.5 m to output ≈2.0 m. However, an actual dresser mirror in this context is closer to 1.5 m tall.
The root cause of this failure is that the model overrelies on default furniture heights rather than grounding its measurement in visual cues in the image, such as the mirror’s height relation to the floor. This is why we incorporate fine‑grained reward signals during training that explicitly reward consistency between predicted measurements and image‑derived evidence to encourage the model to verify its predictions against the actual scene rather than default priors.
As we cannot include images per conference rebuttal guidelines, the qualitative image described above and additional failure case analyses will be included in a new Appendix section (before Appendix H: Limitations) in the camera-ready version. To confirm its inclusion and reference the correct asset, we provide its MD5 checksum: 51b759a689433fb598ebf4a165b91d88. We sincerely appreciate this suggestion.
Q5: It is better to add a discussion on potential biases in generated reasoning traces due to reliance on multiple LLMs.
We thank the reviewer for raising this point. While our original dataset did not explicitly log the source model of each generated response, we have now conducted an additional analysis using 1,000 randomly sampled image-question pairs. Among all reasoning nodes, 49% of nodes were selected from Gemini, 42% from Qwen2.5VL-72B, and 9% from Qwen2.5VL-7B, indicating that no single model dominates the reasoning process and that the ensemble encourages diversity and calibrated supervision.
While Gemini contributes more frequently in the overall reasoning process, given its strong performance on spatial tasks, its responses are not always accurate; therefore, we incorporated Qwen2.5VL-72B to enhance the overall diversity and reasoning robustness. Although Qwen2.5VL-7B accounts for the smallest proportion of contributions, we intentionally include this in M3CTS because the model we train is closer in size to 7B. This way, we calibrate the reasoning complexity to the capacity of our target model, improving alignment between data and model capabilities
.
Stanton S, Izmailov P, Kirichenko P, Alemi AA, Wilson AG. Does knowledge distillation really work? NeurIPS 2021
Q6: What is the robustness of depth-based spatial rewards under complex scenarios?
Thank you for this excellent question. The robustness of our depth-based spatial reward () in complex scenes, e.g., those with occlusions or clutter, is achieved via two mechanisms (Appendix B, Fig. 8):
(1) Uncertainty weighting provides robustness in cases of occlusion. For example, when an object is (partially) hidden behind another, the depth map for the occluded parts, as derived from the image, often shows ambiguous or missing depth cues. increases when the model detects such ambiguities in the depth map, e.g., sharp discontinuities or high variance in predicted depth. This ensures that the reward remains stable and does not improperly penalize the model for the inherent uncertainty in inferring occluded 3D structure from a single 2D image.
(2) Context-aware weighting ensures stability in scenes with dense clutter by focusing the reward computation only on spatial relations or entities directly relevant to the query (as parsed by the model), and down-weighting peripheral inaccuracies involving unrelated scene objects. For instance, when asked about the distance between a table and a chair in a crowded scene, ensures small errors about background entities (e.g., a plant in the corner) do not dominate the reward signal, enabling the model to learn from more targeted, relevant feedback.
Thanks for the extra experiments and detailed reply. I will mark it as 3 -> 4.
Thank you very much for your thoughtful review and for considering our responses and additional analyses. We sincerely appreciate the time and effort you have invested in engaging with our submission and discussing our work. We are grateful that you found our clarifications and new experiments helpful. Thank you again for your constructive feedback and for updating your rating.
This paper proposes SpatialReasoner, a VLM enhanced with fine-grained preference optimization and multi-LLM tree search, yielding state-of-the-art performance on various spatial reasoning tasks.
Strength: The approach introduces well-motivated and technically sound designs, including fine-grained rewards and fDPO, supported by comprehensive evaluation.
Weakness: The related work discussion is limited and some aspects of presentation clarity, particularly in figures and contributions, could be improved.
Given the quality of the submission, the rebuttal and all feedback, the ACs recommend acceptance, as this is among the first substantial works on spatial VLMs with RL and offers impactful contributions.