Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
摘要
评审与讨论
In this paper, the authors observe that reasoning capabilities can be transferred from language-only to vision-language models, leading to significant gains on visual reasoning tasks. This leads the authors to propose a 3-stage training pipeline for VLMs, encompassing 1) fine-tuning the LM module on language-only reasoning traces, 2) PPO-based RL to refine the LM outputs, and 3) PPO-based tuning of the full model on a curated multimodal reasoning set to improve the cross-domain transfer of reasoning capabilities. This pipeline is used to tune an existing foundational model, Qwen2.5-VL, into a model for visual reasoning dubbed OVR-7b. Empirically, the authors show that this model achieves consistently better results on a range of visual reasoning benchmarks compared to other models of the same size. Analyzing the reasoning traces with GPT-4o-mini, the authors also show that steps (1)-(3) increase the occurrence of “cognitive behaviors” (e.g., re-inspection or tracing) in the reasoning traces of the model. Finally, the authors also present some ablations on different secondary questions, e.g., if the additional training is detrimental for the language/vision capabilities (unrelated from reasoning) or at what step of the procedure the transfer starts happening.
优缺点分析
Strengths
The paper is overall well-written, and the narration is easy to follow. The methodology proposed is sufficiently motivated and I am not aware of other works that proposed the same methods before. The authors present empirical evidence on the validity of their method. These includes benchmarking on standard language-only reasoning (mostly mathematical, AIME/MATH) and general (MMLU) benchmarks, as well a wide range of visual reasoning benchmarks (again, mostly mathematical-based reasoning). Furthermore, the paper also includes a quantification of the improvement of cognitive behaviors (through an automatic analysis of the reasoning traces), that appear to be consistently growing thanks to the 3-stage tuning procedure proposed in the paper.
Weaknesses
The main weakness that I can identify in this paper is the limited novelty factor of the proposed methodology. The three-stage training pipeline proposed to study the transfer of cognitive behaviors from language to multimodal inference is not characterized by significantly novel ideas compared to the techniques already present in the literature. Possibly, step (3) of the proposed pipeline (i.e., the multimodal RL for cross-modal adaptation) is the only original step in the method (or at least, I am not familiar with any other works doing it). On another note, I believe that the observation that improvements on mathematical and coding reasoning (resulting from RL on language-only data) can also directly benefit multimodal reasoning tasks is interesting but somehow expected.
问题
- How do you justify the superior performances of OVR-7b on the language-only benchmarks in Table 1? For instance, why do you think it is the case the OVR-7b improves over the standard Qwen2.5-7b model on MMLU accuracy by 5.4?
- The ablation on the emergence of visual-specific cognitive behaviors is only performed on MathVision; would it be possible to extend it to the other visual reasoning datasets that are considered in Table 2 as well?
- Related to the ablation of steps (1)-(3), could you also provide an ablation not just on the presence of cognitive behaviors in the reasoning steps, but also on the actual performance on the different multimodal reasoning tasks? What I am missing right now is a table to quantify the progression from Qwen2.5-VL-7B to OVR-7b in terms of performances on the benchmarks for the different steps of your proposed pipeline.
- Could you reproduce/collect the results on all the datasets in Table 2 for Qwen2.5-VL-7B? This is the base model from which OVR-7b is obtained, and I believe it to be important to have a full picture of its performance.
- I think this missing reference could be quite relevant: VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning, Wang et al, 2025. Could you please comment on the differences between their and your work?
局限性
Yes.
最终评判理由
The study of cross-modal cognitive behavior transfer is principled and supported by an extensive empirical evaluation. The authors provided additional experimental results during rebuttal that increased my confidence in the validity of their evaluation and claims. Overall, the quality of the submission is high, but some intrinsic factors (e.g., to what extent an increase in the occurrence of cognitive abilities can effectively improve reasoning performance, also mentioned by reviewer SEpa, and the overall methodological novelty compared to previous works) prevent me from recommending a full acceptance of the paper. If there were an option between "borderline accept" and "accept" (as it was, for example, the "weak accept" score last year), I would have increased my rating to that.
格式问题
The formatting of the paper is okay.
Thank you very much for the thoughtful and constructive review. We are encourged by the recognition of our paper’s clarity, comprehensive evaluation, and the value of our empirical study. We carefully address each of your concerns as below:
Q1: On the Superior Performance on Language Benchmarks
The strong language reasoning performance of OVR-7B can be attributed to the behavior-rich nature of the language data and the structure of our training pipeline:
1. Language Cold Start exposes the model to diverse and high-quality language reasoning traces rich in cognitive behaviors, laying a solid foundation for general reasoning.
2. Language RL serves as a cognitive scaffold, reinforcing these behaviors and aligning them exact-match reward signals.
3. Multimodal RL does not degrade language reasoning abilities. As shown in Lines 283–287, the model even slightly improves on language benchmarks after Stage 3, suggesting that multimodal adaptation complements, rather than interferes with, the reasoning skills acquired during earlier stages.
Q2: Extending Behavior Analysis on MathVerse
We have extended the same visual cognitive behavior analysis to MathVerse-MINI, following the methodology used in the main paper. The results show a consistent trend in the emergence and amplification of cognitive behaviors across training stages, supporting the generality of our findings across different tasks.
| Re-inspection | Decomposition | Cross-checking | Tracing | |
|---|---|---|---|---|
| Baseline | 0.00 | 4.33 | 0.00 | 0.00 |
| +Cold Start | 20.33 | 4.00 | 2.00 | 0.33 |
| +Language RL | 25.33 | 4.66 | 2.33 | 1.33 |
| +Multimodal RL | 29.33 | 5.00 | 2.67 | 1.33 |
Q3: Ablation on Multimodal Reasoning Performance Across Stages
Table 5 presents the ablation results across training stages. To further address your concern, we have now extended the evaluation to include more benchmarks. These results further validate the contribution of each stage.
| MathVerse | DynaMath | WeMath | LogicVista | |
|---|---|---|---|---|
| Baseline | 41.1 | 21.8 | 31.2 | 47.9 |
| +Cold start | 52.28 | 31.54 | 33.71 | 52.57 |
| +Language RL | 49.62 | 31.94 | 36.1 | 52.35 |
| +Multimodal RL | 52.9 | 32.7 | 38.2 | 55.0 |
Q4: Qwen2.5-VL-7B Performance in Table 2
The performance of Qwen2.5-VL-7B is reported in Row 6 of Table 2. We have now added its evaluation on MMMU-Pro (38.3). OVR-7B achieves a 7.11 average improvement over the base model (44.45 --> 51.66) across all multimodal benchmarks in Table 2.
Q5: Difference with VL-Rethinker
While both VL-Rethinker and our method leverage RL to enhance reasoning in MLLMs, our OVR is distinct in motivation, analytical focus, performance, and methodology:
1. Cross-modal cognitive behavior transfer as the central goal: VL-Rethinker focuses on reflective fine-tuning within a vision-language setting to enhance multimodal reasoning. In contrast, our work uniquely centers on how cognitive behaviors acquired from language-only training can transfer to visual reasoning, revealing the internal mechanisms and generalization patterns behind advanced reasoning capabilies. This enables a scalable and interpretable pathway for building stronger multimodal reasoning systems.
2. In-depth behavioral analysis: We provide the first quantitative analysis of cognitive behavior emergence and cross-modal transfer, systematically tracking behavior shifts across stages. Notably, we discover that visual-specific cognitive behaviors can emerge in a zero-shot manner, even though they do not appear in the training data. This angle is not investigated in prior works.
3. Strong reasoning across modalities: Unlike prior approaches that focus solely on visual tasks, OVR also achieves SOTA performance across a suite of language reasoning benchmarks, demonstrating the value of modality-agnostic cognitive behaviors for strong reasoning capabilities.
Q6: Addressing Mentioned Weaknesses & Clarifying Contributions
We acknowledge that “RL with a cold start” pipeline is a well-established paradigm in LLM training. In our work, this pipeline is not positioned as the core novelty, but rather serves as a tool to systematically investigate the cross-modal transfer of cognitive behaviors and to build a strong multimodal reasoning model (lines 37–44).
We further clarify our distinct contributions as follows:
1. Cross-modal cognitive behavior transfer as the central focus: Our work uniquely centers on the unexplored behavior transfer from language to vision, revealing the internal mechanisms and generalization patterns behind advanced reasoning capabilies. This enables a scalable and interpretable pathway for building stronger multimodal reasoning systems.
2. In-depth behavioral analysis: We provide the first quantitative study of behavior emergence across training stages, including the surprising zero-shot emergence of visual-specific cognitive behaviors, transfer patterns in Sec. 6, and the correlation between behavior and reasoning performance (see Q3 response to Reviewer SEpa). This sheds light on how abstract behaviors generalize across modalities and benefit model's reasoning capabilities.
3. Superior reasoning across modalities: OVR achieves exceptional performance on both text and multimodal reasoning, as well as on general-purpose benchmarks, demonstrating the value of modality-agnostic cognitive behaviors for strong reasoning capabilities. Notably, it establishes a powerful and reproducible baseline for the open-source community, with full data, model and implementation details to be released publicly.
Thank you for the extensive response to all my questions and for the additional experimental results provided in the rebuttal. I have some additional comment regarding your response:
-
Q4: I appreciate the additional experimental results on MMLU-Pro. I was already familiar with the results for Qwen2.5-VL-7B in Table 2. I guess my question/suggestion was more to replicate yourselves the results which were taken from other papers and to fill in the missing value for MMLU-Pro (now addressed). Replicating the results of the base models for all the setting could be important to make sure that the performance boost actually comes from your proposed methods and not from small changes in your evaluation framework and optimization routine.
-
Q5--Q6: I agree to some extent on the deltas between yours and previous works argued in points 1 and 2 (which could be a single point). However, I am not fully convinced by point 3 in both answers. First of all, OVR does not achieve SOTA. Secondly, other baselines (such as VL-Rethinker) were simply not tested extensively on language reasoning benchmarks.
Thank you very much for the quick response and constructive suggestions. We have added additional experimental results to sincerely address each remaining concern:
-
Q4: Comparison with Reproduced Results of the Baseline
We fully agree with the rigorous and fair comparison and have replicated the results of Qwen2.5-VL-7B on all benchmarks under our evaluation framework. The table below reports our reproduced results and the performance of OVR-7B. OVR-7B achieves a significant +9.47 average improvement over the faithfully reproduced baseline, clearly demonstrating that the gains come from our proposed method rather than from differences in evaluation or optimization.
MathVista MathVision MathVerse DynaMath WeMath LogicVista MMMU-Pro CharXiv-reas CharXiv-desc Avg Qwen2.5-VL-7B (Reproduce) 69.2 25.5 44.42 19.36 31.2 42.51 37.41 36.4 67.3 41.47 OVR-7B 72.9 50.0 52.9 32.7 38.2 55.0 37.8 43.4 75.6 50.94 -
Q5--Q6 Point3: Comprehensive Evaluation on Language Reasoning Benchmarks
We appreciate your thoughtful suggestions and have conducted comprehensive evaluations of SOTA 7B-scale MLLMs on language reasoning benchmarks. As shown in the table, OVR-7B consistently outperforms 7B models across all tasks by a large margin, achieving the best results.
AIME 2024 AIME 2025 MATH500 GPQA Diamond MMLU MMLU-Pro Qwen2.5-VL-7B 6.7 6.7 67.4 31.8 69.6 51.7 VLAA-Thinker-Qwen2.5-7B 7.5 1.25 58.4 33.84 67.98 49.96 OpenVLThinker-7B 6.88 0.83 66.4 32.32 69.74 49.93 MM-Eureka-Qwen-7B 5.63 1.88 67.80 39.39 72.46 51.65 VL-Rethinker-7B 6.46 1.46 67.80 16.67 71.17 52.3 OVR-7B 58.8 41.7 94.1 50.0 78.0 67.0 This further strengthens the response (Point 3) to Q5–Q6. Different from prior works that primarily focus on visual tasks, OVR places equal emphasis on language reasoning, which we consider fundamental to developing robust and general intelligence, and a necessary foundation for effective multimodal reasoning. We also respect the contributions of concurrent efforts such as VL-Rethinker, which explore reasoning from other distinct perspectives.
We thank the reviewer again for the opportunity to further clarify and present our complete results, and apologize for the urgency of the previous response due to the experiment workload within the limited time. We sincerely hope these clarifications help strengthen your confidence in our submission. Please feel free to discuss if you have any other questions or suggestions.
Thanks for the reproduced experimental results on the vanilla Qwen2.5-VL-7B model and the more extensive comparison with other baselines on language reasoning benchmarks; this indeed solves my concerns and increases my confidence in the results and empirical setup. I will reflect this by increasing the confidence of my review.
At this point, I have no further questions for the authors; I suggest them to incorporate all the feedback received so far into the submission.
Overall, I sincerely appreciate the effort that the authors made to provide new experiments and clarify their contribution. While my judgment is now leaning more towards acceptance, I would still like to keep my score because of some inherent concerns about the work (e.g., the effective relevance of cognitive abilities in reasoning, also mentioned by reviewer SEpa).
Dear Reviewer bA1c,
Thank you again for your time, efforts, and thoughtful feedback in reviewing our paper. We truly appreciate your recognition of our work and your constructive suggestions.
We would like to share that Reviewer SEpa has confirmed the concerns regarding the relevance of cognitive behaviors to reasoning have been addressed, and has kindly expressed stronger support for our paper.
Given that your remaining concern was closely aligned with Reviewer SEpa’s, we would be grateful to hear if our clarifications and additional analyses have also helped address your concerns. If you have any remaining suggestions or questions, we’d be more than happy to engage further.
Thank you again for your time and constructive feedback!
Best,
The Authors
(The following response provides an extended discussion beyond the original question, building upon the points outlined above.)
2. Does "more" cognition necessarily lead to better reasoning performance within the stage? (Further discussion inspired by Reviewer SEpa's Q3)
While the stage-wise results exhibit a clear positive correlation between behavior frequency and reasoning performance, we conduct a more rigorous intra-stage evaluation to faithfully address the nuance. Our analysis shows that within each training stage, behavior quantity does not directly correlate with performance gains.
This observation is described as follows:
- In cold start, visual re-inspection frequency rises sharply at first, then plateaus despite fluctuate performance improvements.
- In multimodal RL, behavior frequency drops initially, then rises again, ultimately surpassing prior levels. The reasoning performance increases with fluctuations.
While this initially appears counterintuitive, we provide further analysis to explain the underlying mechanisms:
(1) It’s not just quantity but also effectiveness that matters for behaviors
- Not all cognitive behaviors contribute equally to reasoning improvement. Some are effective, such as insightful re-evaluations that help correct mistakes; others are ineffective, like aimless repetitions.
- For instance, a model might generate statements like “Wait, maybe the answer is wrong…” and re-inspect the image, yet extract nothing useful. This distinction highlights that behavior effectiveness, rather than raw frequency, is the critical factor. This is also consistent with human cognitive intuition.
(2) Extended discussion: RL critically discerns and amplifies effective behaviors
- Cold start introduces a broad range of behaviors, transferring abstract reasoning patterns from language to vision. RL then acts as a filtering and amplifying mechanism, suppressing ineffective behaviors and selectively scaling up the most useful patterns (the crucial tokens analyzed in [2]).
To address this concern seriously, we will revise the paper to:
(1) Extend and clarify the analysis of the correlation between cognitive behaviors and reasoning ability, particularly in the Introduction and Section 6;
(2) Refine the wording regarding the role of cognitive behaviors in reasoning to avoid any vagueness or potential misunderstandings.
We thank the reviewer again for encouraging us to further clarify this point. Additionally, sometimes non-intuitive observations offer the greatest insights and deserve careful exploration. Thank you very much for your time and patience. Please feel free to share any further suggestions if you have.
Thank you for the deep and extensive answer, I really appreciate the time and effort spent trying to address my comments.
However, I am still not completely sure about it. For my understanding, your experiments show correlation, not causation between cognitive behaviors and reasoning accuracy (also not completely consistent, because of the difference between quantity and quality, with which I agree). How can you support the claim that there is a causal relationship between the emergence of cognitive abilities and reasoning performance? Could they not be independent phenomena that have the same causes (cold start + multimodal RL)?
I agree with the reviewer bA1c that one should be careful to claim a causal relationship between cognitive behaviors and reasoning accuracy. I think the clarification about the positive correlation and the difference between quantity and quality are indeed helpful on improving the clarity of the paper.
That said, the authors might want be careful on the phrasing of statement like "Cognitive behavior transfer from language to vision is necessary to enable stronger multimodal reasoning performance" (response in Part I) to avoid any misunderstanding.
Thank you for sharing your thought!
Reviewer SEpa
We truly appreciate that our response helped resolve your concerns. Thank you so much for raising your confidence and we will definitely incorporate your valuable feedback into the revised paper to improve clarity.
Given the limited space and time during the initial rebuttal, some potential misunderstandings may remain regarding the relationship between cognitive behaviors and reasoning ability. As part of our commitment to a rigorous and faithful explanation, we offer a more detailed clarification here.
Please allow us to first clarify that
- Cognitive behavior transfer indeed contributes to stronger reasoning performance. Actually, Reviewer SEpa’s Q3 primarily focused on the role of behavior quantity. This is a valuable perspective, though it's equally important to consider behavior effectiveness.
- Our original response addressed this while also including several extended discussions (which, while not directly asked, we found valuable to share). We guess it's this mix that has caused some confusion.
To ensure clarity, we'd like to break down the original response for Reviewer SEpa Q3 more explicitly into two components:
-
First, we confirmed the positive contribution of cognitive behavior transfer to stronger reasoning performance, as evidenced by clear stage-wise ablations. This explains the relationship between behavior transfer and the performance, and also provide a general response to the question focusing on the role of behavior quantity.
-
Second (extended discussion for deeper insight), we offered a more nuanced discussion about:
(1) whether "more" cognitive behaviors necessarily lead to better reasoning within individual training stages (a more precise exploration)
(2) an extended analysis inspired by the question, focusing on intra-stage behavioral trends. While the latter is not directly asked, we believe it provides valuable perspective on the dynamics of behavior emergence and effectiveness during training.
Here is a brife and clear summary of the conclusions for your quick reference:
- (Main) Cognitive behavior transfer from language to vision is necessary to enable stronger multimodal reasoning performance, as demonstrated by the stage-wise improvements in ablations. In OVR, we facilitates this transfer and validates it as a practical and effective approach for achieving superior reasoning performance on both language and vision benchmarks.
- (Extension) However, a higher quantity of cognitive behaviors does not necessarily lead to better reasoning performance. Rather than focusing solely on behavior frequency, we emphasize the importance of behavior effectiveness—i.e., whether the behaviors meaningfully contribute to solving the task.
We elaborate the main conclusion in detail below:
1. Cognitive behavior transfer can indeed support stronger reasoning performance
Our results in Table 3 and Table 5 clearly demonstrate that cognitive behavior transfer contributes meaningfully to enhanced reasoning abilities:
- Zero-shot emergence: During the Cold Start stage, we observe a substantial improvement in visual reasoning performance (e.g., MathVision: 25.5 → 46.2; MME-R: 608.2 → 685). Simultaneously, visual-specific behaviors such as visual re-inspection emerge spontaneously in a zero-shot manner (from 0.0 to 2.0), indicating that structured reasoning patterns acquired in language can be transferred to the visual modality. This emergence, from absence to presence, highlights how behavior transfer directly facilitates improvements in visual reasoning.
- Progressive improvement across stages: When performance saturates at the current stage, continuing to the next stage training further improves both behavior frequency (0.0 → 2.0 → 2.4) and reasoning performance (Mathvision 25.5 → 46.2 → 47.5). This confirms the key point from our original response: visual cognitive behaviors are a necessary condition for achieving strong visual reasoning capabilities.
We are truly grateful to both Reviewer bA1c and Reviewer SEpa for your thoughtful engagement and academic rigor in discussing this relationship.
-
We agree with Reviewer bA1c that our experiments primarily focus on correlation, not strict causation, and we sincerely apologize for any imprecise wording in our previous discussion. Following Reviewer SEpa's kind suggestion, we will be more careful with phrasing (e.g., replacing “necessary to enable” with “consistently associated with”) to avoid any potential misunderstanding.
-
To clarify our revision for the paper:
- The analyses in both the main paper and rebuttal are aimed to offer correlational insights into the relationship between cognitive behaviors and stronger multimodal reasoning—specially, which types of behavior emerge, when they emerge, and how they correlate with reasoning capabilities.
- This line of analysis offers meaningful visual reasoning patterns to the community and may inspire further research into uncovering other key mechanisms underlying advanced MLLM reasoning.
We genuinely appreciate that reviewers' feedback helps deepen and strengthen the rigor of this important contribution.
Extended Lightweight Discussion on Causality
The reviewers’ feedback has helped us clarify that our analyses focus on correlation. Additionally, we genuinely regard Reviewer bA1c’s comment on causality as a valuable direction for future exploration. Below, we share a few ideas that could help investigate this in the future work:
This question can be precisely formulated as a causal inference problem: Is it (1) Behavior → Reasoning, or (2) Behavior ← Training → Reasoning?
To disentangle these possibilities, we find the following explorations promising:
- Intervention and Counterfactual Analysis: We can simulate a "do(B)" intervention by manually inserting or removing cognitive behaviors and evaluating whether this leads to a significant change in reasoning performance.
- Behavior Ablation Studies: By training models on de-behaviorized reasoning traces, we can compare them to models trained with full behaviors to assess the necessity of such behaviors for reasoning capabilities. A potential challenge here lies in controlling for linguistic consistency.
- Behavior Induction via Reward Shaping: Additionally, we can attempt to induce behaviors (e.g., reward injection) and measure whether they lead to stronger reasoning, which would further verify their causal role.
We sincerely thank Reviewer bA1c and SEpa again. We deeply respect your dedication and the time you’ve devoted to offering such valuable suggestions, which help make our work more comprehensive, in-depth, and rigorous.
This discussion became truly meaningful because of your valuable efforts.
This paper presents an interesting observation that the cognition behaviors trained from text-only cold start can transfer to vision reasoning tasks. Based on the finding, the authors design a three-stage training pipeline on cross-modality cognitive behavior training. The resulting model demonstrates strong reasoning ability across modalities. An in-depth analysis is also conducted to improve the understanding of multi-modal reasoning.
优缺点分析
Strength:
- This paper is clearly written and well-motivated
- The observation about cognitive behavior transfer is interesting
- The resulting model demonstrates strong performance on reasoning problems in both modality
- An in-depth analysis is provided about the cognition behavior transfer
Weakness:
- The training framework is not well-explained. In Section 4.1, the authors propose a three-stage training, including language-only cold start, language-only RL and multimodal RL. It's unclear why each stage is necessary, as we know that both "language-only cold start" and "language-only RL" incentivize the cognitive behaviors in reasoning.
- Lack of ablations/baselines. Build upon point 1, more ablation studies are needed to understand the training framework. For example, how much is the average gain on vision reasoning tasks, if we compare (state 1 + stage 2 + stage 3) with (stage 2 + stage 3) or stage 2/3 only training. Also, it'll be interested to see from empirical results whether stage 3 could enhance LLM reasoning or not.
- Lack of discussion and analysis. This paper contains a lot of discussion on cognition behaviors but lack discussion on the relationship between cognition behaviors and reasoning.
问题
- What will be the performance if we train the model only with stage 2/3 or stage 2+3. Are all the 3 stages necessary?
- In table 2, many results on the section of "RL-based methods" are missing. What's the average gain of the model on vision-reasoning tasks and text-reasoning tasks, respectively?
- What's the relationship between cognition behaviors and reasoning? Does more cognition necessarily lead to better reasoning performance?
局限性
yes
最终评判理由
My concerns have been addressed by the authors' response and I believe this paper will be valuable to the community. Therefore, I'm overall positive to this paper.
格式问题
No
Thank you very much for the constructive and encouraging feedback. We appreciate your recognition of the clarity of our writing, the significance of our cognitive behavior transfer findings, and the value of our analysis. We carefully address each of your concerns as follows.
Q1: Necessity of the Three-stage Training & Supporting Ablations
We conduct detailed ablations on RL training stages. As shown below and in Table 3 and Table 5, the results highlight the critical role of each component. Key observations include:
- Our two-stage RL training pipeline outperforms single-stage or mixed-stage RL settings, achieving the best average results on both language and multimodal reasoning tasks.
- Stage 3 further enhances both language and multimodal reasoning, while also promoting the emergence of more visual cognitive behaviors as shown in Table 3.
| AIME2024 | MATH500 | MathVision | MathVerse | |
|---|---|---|---|---|
| Language Cold Start | 54.4 | 93.1 | 69.7 | 46.2 |
| +Language RL | 57.9 | 93.6 | 71.6 | 47.5 |
| +Multimodal RL | 55.0 | 92.9 | 72.0 | 46.5 |
| +Mixed RL | 58.0 | 94.4 | 72.4 | 49.0 |
| +Language RL & Multimodal RL | 58.8 | 94.1 | 72.9 | 50.0 |
1. The experiments results:
- Our proposed training pipeline, incorporating both Stage 2 and Stage 3, outperforms the mixed RL setting, yielding improvements on vision reasoning tasks such as MathVision (+0.5) and MathVerse (+1.0).
- Each stage contributes positively across multiple reasoning benchmarks, validating the necessity of our training approach.
- Performing Stage 3 after Stage 2 not only enhances visual reasoning and elicits more visual cognitive behaviors (as shown in Table 3), but also further boosts language reasoning, with gains on AIME2024 (+0.9) and MATH500 (+0.5).
2. Discussion on Two-stage RL vs Mixed RL:
- Beyond performance, separating the stages is also more practical from a training efficiency perspective: language RL involves longer and more complex responses, requiring longer training time; multimodal RL, while not trivial, requires shorter response length and exhibits less scaling complexity (as discussed in line 298-303), resulting in faster convergence. Notably, recent works such as NemoTron-1.1 [1] have adopted similar staged training strategies, achieving promising results.
- Discussion: While mixed RL is a valid alternative, it introduces greater training cost without clear advantages. Further scaling with prolonged RL has the potential to amplify the differences between mixed and separate RL stages and lead to stronger results, though it would require significantly more compute resources.
[1] AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy
Q2: Comparison to RL-Based Methods
We complement evaluations for representative RL-based baselines in the table. Our method achieves average gains of 6.82 on vision reasoning tasks** over these RL-based baselines and 25.95 on text reasoning tasks over the base model, demonstrating its effectiveness across modalities.
| Model | MathVista | MathVision | MathVerse | DynaMath | WeMath | LogicVista | MMMU-Pro | CharXiv-reas | CharXiv-desc | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| VLAA-Thinker-Qwen2.5-7B | 68.0 | 26.4 | 48.2 | 22.4 | 41.5 | 48.5 | 33.20 | 36.80 | 63.92 | 43.21 |
| OpenVLThinker-7B | 70.2 | 25.3 | 47.9 | 19.36 | 36.86 | 46.09 | 36.97 | 38.30 | 63.45 | 42.71 |
| MM-Eureka-7B | 73.0 | 26.9 | 50.3 | 24.95 | 38.1 | 48.99 | 38.42 | 39.60 | 65.30 | 45.06 |
| OVR-7B | 72.9 | 50.0 | 52.9 | 32.7 | 38.2 | 55.0 | 37.8 | 43.4 | 75.6 | 51.88 |
Q3: Relationship between Behaviors and Reasoning Performance
We analyze both stage-wise and intra-stage correlations between cognitive behavior frequency and reasoning performance to address this question. The key conclusions includes:
- Cognitive behavioris a necessary but not sufficient condition for strong reasoning abilities.
- Higher behavior frequency does not necessarily imply better reasoning, as performance also depends on the task-alignment and effectiveness of the emerging behaviors.
The detailed observations and analysis are as below:
1. Strong stage-wise correlation
- Observation: Across training stages, both the visual re-inspection behavior and reasoning performance increase progressively. The behavior frequency grows from 0.0 → 2.0 → 2.4, while MathVision performance improves from 25.5 → 46.2 → 47.5.
- Analysis: This stage-wise trend suggests a strong positive correlation between cognitive behavior frequency and reasoning capabilities. However, correlation does not imply direct causation. We futher analyze intra-stage performances as follows.
2. Intra-stage divergence highlights behavior effectiveness
- Observation:
- During cold start, behavior frequency increases rapidly at first and then plateaus with fluctuations (e.g., visual re-inspection behavior: 0.3 → 2.2 → 2.0).
- In Multimodal RL, we observe an initial drop in behavior frequency, followed by a gradual recovery (2.4 → 0.5 → 2.5), yet reasoning performance continues to climb throughout.
- Analysis: This divergence reveals that a greater number of behaviors do not necessarily lead to better reasoning. In early Cold Start, the model generates a wide range of exploratory behaviors, regardless of whether they are redundant or effective. As training progresses, we observe that behavior frequency plateaus in late cold start and drops at the start of RL, yet performance continues to improve. This indicates that gains are driven more by behavior effectiveness than by behavior quantity.
- Further discussion: In the RL stage, the training initially reshapes the model's behaviors to maximize rewards and suppresses ineffective reasoning paths, leading to a temporary drop in behavior frequency. However, the recovery shows RL critically discerns and scales up effective patterns (the crucial tokens analyzed in [2]). This aligns with the insight that SFT memorizes while RL generalizes [3].
We will add the correlation analysis in the revision.
[2] Reasoning with exploration: An entropy perspective
[3] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Thank you for the detailed response and extensive experiments.
-
Q1, Q2: I appreciate the authors effort on clarifying the effect of each training stage of the proposed algorithm. The additional experiments are very helpful on understanding the algorithm and have addressed my concerns on this part.
-
Q3: I appreciate the authors' explaination on the relationship between conginition behavior and reasoning ability. In my opinion, the original manuscript has overemphasized the role of cognition behavior in reasoning. While I acknowledge that the observation of conginition behavior transfer is a valuable contribution, it does not necessarily lead to stronger reasoning ability. Therefore, further clarification and elaboration on this point would improve the overall clarity and strengthen the paper.
We sincerely appreciate the reviewer’s constructive suggestions for improving the clarity and precision of our writing, and we are encouraged that our previous response helped address your first two concerns
Given the limited space and time during the initial rebuttal, some potential misunderstandings may remain regarding the relationship between cognitive behaviors and reasoning ability. As part of our commitment to a rigorous and faithful explanation, we would like to offer a more detailed clarification here.
Please allow us to first clarify that:
- Cognitive behavior transfer indeed contributes to stronger reasoning performance. However, Q3 primarily focused on the role of behavior quantity. This is a valuable perspective, though it's equally important to consider behavior effectiveness.
- Our original response addressed this while also including several extended discussions (which, while not directly asked, we found valuable to share). We guess it's this mix that has caused some confusion.
To ensure clarity, we explain that our original response is actually structured around two key points:
-
First, we confirmed the positive contribution of cognitive behavior transfer to stronger reasoning performance, providing a general response to Q3 regarding the role of behavior quantity.
-
Second, we offered a more nuanced extended discussion:
(1) addressing whether "more" cognitive behaviors necessarily lead to better reasoning within individual training stages (a more precise extended exploration to Q3)
(2) a further analysis inspired by Q3, focusing on intra-stage behavioral trends. While the latter is not directly asked in Q3, we believe it provides valuable perspective on the dynamics of behavior emergence and effectiveness during training.
Here is a brife and clear summary of the conclusions for your quick reference:
- (Main) Cognitive behavior transfer from language to vision is necessary to enable stronger multimodal reasoning performance, as demonstrated by the stage-wise improvements in ablations. In OVR, we facilitates this transfer and validates it as a practical and effective approach for achieving superior reasoning performance on both language and vision benchmarks.
- (Extension) However, a higher quantity of cognitive behaviors does not necessarily lead to better reasoning performance. Rather than focusing solely on behavior frequency, we emphasize the importance of behavior effectiveness—i.e., whether the behaviors meaningfully contribute to solving the task.
We elaborate in detail for the main conclusion below :
1. Cognitive behavior transfer can indeed support stronger reasoning performance
Our results in Table 3 and Table 5 clearly demonstrate that cognitive behavior transfer contributes meaningfully to enhanced reasoning abilities:
- Zero-shot emergence: During the Cold Start stage, we observe a substantial improvement in visual reasoning performance (e.g., MathVision: 25.5 → 46.2; MME-R: 608.2 → 685). Simultaneously, visual-specific behaviors such as visual re-inspection emerge spontaneously in a zero-shot manner (from 0.0 to 2.0), indicating that structured reasoning patterns acquired in language can be transferred to the visual modality. This emergence, from absence to presence, highlights how behavior transfer directly facilitates improvements in visual reasoning.
- Progressive improvement across stages: When performance saturates at the current stage, continuing to the next stage training further improves both behavior frequency (0.0 → 2.0 → 2.4) and reasoning performance (Mathvision 25.5 → 46.2 → 47.5). This confirms the key point from our Q3 response: visual cognitive behaviors are a necessary condition for achieving strong visual reasoning capabilities.
(The following response provides an extended discussion related to this question, building upon the points outlined above.)
2. Does "more" cognition necessarily lead to better reasoning performance within the stage? (Further discussion inspired by Q3)
While the stage-wise results exhibit a clear positive correlation between behavior frequency and reasoning performance, we conduct a more rigorous intra-stage evaluation to faithfully address the nuance. Our analysis shows that within each training stage, behavior quantity does not directly correlate with performance gains.
This observation is described as follows:
- In cold start, visual re-inspection frequency rises sharply at first, then plateaus despite fluctuate performance improvements.
- In multimodal RL, behavior frequency drops initially, then rises again, ultimately surpassing prior levels. The reasoning performance increases with fluctuations.
While this initially appears counterintuitive, we provide further analysis to explain the underlying mechanisms:
(1) It’s not just quantity but also effectiveness that matters for behaviors
- Not all cognitive behaviors contribute equally to reasoning improvement. Some are effective, such as insightful re-evaluations that help correct mistakes; others are ineffective, like aimless repetitions.
- For instance, a model might generate statements like “Wait, maybe the answer is wrong…” and re-inspect the image, yet extract nothing useful. This distinction highlights that behavior effectiveness, rather than raw frequency, is the critical factor. This is also consistent with human cognitive intuition.
(2) Extended discussion: RL critically discerns and amplifies effective behaviors
- Cold start introduces a broad range of behaviors, transferring abstract reasoning patterns from language to vision. RL then acts as a filtering and amplifying mechanism, suppressing ineffective behaviors and selectively scaling up the most useful patterns (the crucial tokens analyzed in [2]).
To address your concern seriously, we will revise the paper to:
(1) Extend and clarify the analysis of the correlation between cognitive behaviors and reasoning ability, particularly in the Introduction and Section 6;
(2) Refine the wording regarding the role of cognitive behaviors in reasoning to avoid any vagueness or potential misunderstandings.
We thank the reviewer again for encouraging us to refine this point more precisely. Additionally, sometimes non-intuitive observations offer the greatest insights and deserve careful exploration. We sincerely hope that these clarifications help strengthen your confidence for the acceptance of our work. Please feel free to share any further suggestions.
I appreciate the author's comprehensive and faithful analysis on the relationship between cognition behavior and reasoning performance, which will strenghten the discussion on Section 6. The explanation has addressed my concerns on Q3 and improved my confidence on the paper. I'll improve my score to 5.
We sincerely thank the reviewer for patiently reading our analysis and kindly raising your score. We’re truly delighted that our response helped address your concerns. We will carefully incorporate the analysis and valuable perspective you provided into the paper. Thank you again for your thoughtful feedback and generous support!
It proposes a 3-stage training pipeline for multi-modal reasoning, where cold start training for linguistic reasoning happens first, followed by RL for linguistic reasoning and finally multi-modal reasoning.
优缺点分析
Strengths:
- Comprehensive evaluations
- Strong performance compared to the baselines
- Interesting research angle: it investigates an interesting question, even though the results are not sufficient to support the general claims made in this paper
Weaknesses:
- A big part of this submission seems to be a study contribution, but it lacks comparison to other training pipelines e.g. (1) Language cold start + multi-modal RL, (2) language cold start + multi-modal cold start + multi-modal RL; and additional settings that could be interesting to compare to e.g. (3) multi-modal cold-start + multi-modal RL; (4) language cold-start only with more training data; (5) multi-modal cold-start only with more training data
- The reviewer is curious what the emergence of cognitive behaviors look like for these settings above?
- According to this paper, most transfer happens after the cold start training – is it because of the relative size of the data, the training method or something else?
- The paper describes MMVet as a perception-centric benchmark, but it also includes many questions e.g. in the math split that are reasoning-heavy. As for perception-centric benchmark, What about performance on CV Bench and BLINK?
- Lack of generalization: Are the findings generalizable to other models, other data sizes (e.g. larger data for stage2 or stage3 RL)?
Minor: In the reward design, using exact matches for reward=1 seems a bit strict? Have the authors considered alternative designs?
问题
- According to this paper, most transfer happens after the cold start training – is it because of the relative size of the data, the training method or something else?
- The paper describes MMVet as a perception-centric benchmark, but it also includes many questions e.g. in the math split that are reasoning-heavy. As for perception-centric benchmark, What about performance on CV Bench and BLINK?
- Are the findings generalizable to other models, other data sizes (e.g. larger data for stage2 or stage3 RL)?
局限性
Yes
最终评判理由
Thank you authors for updating the results with more ablations and benchmarks! They address most of my questions/concerns, so I'm updating my rating to 4.
However, the following questions remain:
- The authors mentioned that multi-modal cold start data hurts performance due to their lower quality: can the authors provide a more in-depth analysis on what exactly leads to the low quality? Also, it's mentioned multi-modal and language only cold start data have conflicting formats -- would the performance improve in this mixed setting if the formats are unified or if these data are used in separate stages?
- The authors mentioned the results hold on a larger Qwen VL model, what about models with a different language (e.g. llama) and vision backbone?
It'd be great if the authors could address these in the final version.
格式问题
N/A
We thank the reviewer for the constructive feedback and recognition of our paper’s strong performance and interesting research angle. We carefully address each of your concerns below.
Q1: More Ablations
We conducted comprehensive ablations along three dimensions: (1) different cold-start data strategies, (2) RL training schedule, and (3) enhanced cold start with self-distilled multimodal data.
The results in the table are summarized:
- Adding open-source multimodal cold-start data or applying mixed RL stages does not outperform our proposed training pipeline.
- Introducing self-distilled data from OVR into cold start leads to improved performance.
| AIME2024 | MATH500 | MathVista | MathVision | |
|---|---|---|---|---|
| Baseline | 6.67 | 67.4 | 69.2 | 25.5 |
| Part I: Cold start | ||||
| Lang cold start | 54.4 | 93.1 | 69.7 | 46.2 |
| MM cold start | 23.1 | 75.2 | 71.6 | 33.5 |
| Lang cold start + MM cold start | 51.0 | 90.3 | 70.4 | 40.1 |
| Part II: RL on language cold start | ||||
| MM RL | 55.0 | 92.9 | 72.0 | 46.5 |
| Lang RL | 57.9 | 93.6 | 71.6 | 47.5 |
| Mixed RL | 58.0 | 94.4 | 72.4 | 49.0 |
| Part III: Cold start with extra data | ||||
| Lang cold start + self-distilled mm data | 59.1 | 93.9 | 72.7 | 50.5 |
| Current OVR | ||||
| Lang cold start + Lang RL + MM RL | 58.8 | 94.1 | 72.9 | 50.0 |
The ablation details are shown below:
1. Ablations on Cold Start (Part I)
- Settings: We experiment with training on multimodal cold-start data either independently or in combination with language cold-start data. The multimodal cold-start datasets contain publicly available sources such as R1-OneVision [1] and Vision-R1 [2], which are constructed by either synthesizing image-text pairs using image descriptions and language reasoning models or rendering language cold-start data into multimodal form.
- Results & Analysis:
- Training with multimodal cold-start data alone leads to lower performance on both language and multimodal reasoning tasks, particularly in language reasoning (e.g., AIME2024: 23.1 vs. 54.4). This reveals the lack of high-quality multimodal cold-start datasets. The much better performance on MathVision also verifies that language reasoning skills are fundemental for advanced multimodal reasoning, and demonstrates the robustness of our data curation pipeline.
- Adding multimodal cold-start data on top of language cold-start training reduces performance. This arises from conflicting reasoning patterns: multimodal cold-start data has obvious differences in style, length and structure compared to the language-only version. Nevertheless, this challenge can be addressed through large-scale RL and iterative self-distillation.
- Discussion: Constructing effective multimodal cold-start data remains a significant challenge. Unlike the language domain, the multimodal community lacks high-quality, open-source reasoning teachers like DeepSeek-R1, and most closed-source models hide explicit reasoning traces. Our method strategically avoids this limitation by leveraging a strong language cold-start phase followed by multimodal RL, resulting in a robust and scalable multimodal reasoner.
2. Ablations for RL stages (Part II)
- Settings: Based on the language cold start , we ablate three RL strategies: (1) language-only RL, (2) multimodal-only RL, and (3) a one-stage mixed RL that combines both.
- Results & Reasons: The two-stage RL setup consistently yields the best performance across reasoning benchmarks. From a training efficiency perspective, separating the stages is also more practical: language RL involves longer and more complex responses, requiring longer training time; multimodal RL, while not trivial, requires shorter response length and exhibits less scaling complexity (as discussed in line 298-303), resulting in faster convergence. Notably, recent works such as NemoTron-1.1 [3] have adopted similar staged training strategies, achieving promising results.
- Discussion: While mixed RL is a valid alternative, it introduces greater training cost without clear advantages. Further scaling with prolonged RL has the potential to amplify the differences between mixed and separate RL stages and leads to stronger results, though it would require significantly more computation resources.
3. Enhancing Cold Start with Self-Distilled Multimodal Data (Part III)
- Setting: To address the reviewer’s suggestion on “(5) multimodal cold start with more data,” we augment the cold start with self-distilled multimodal data generated by our own OVR model. As discussed in Point 1, the community currently lacks high-quality multimodal cold-start data and is challenged in constructing. Thus, we leverage the model’s own capabilities to distill high-quality reasoning traces, and train with a combination of language-only cold-start data and self-distilled multimodal samples.
- Reults & Analysis: Incorporating self-distilled data during cold start leads to better performance in both language and multimodal reasoning tasks, faster convergence, and lower training loss. These demonstrate the effectiveness of self-distillation [4] in bootstrapping multimodal reasoning. Moreover, they point to a promising direction: in the absence of strong open-source multimodal teacher models, iterative RL and self-distillation can serve as a viable and scalable strategy for progressively enhancing model capabilities.
4. Cognitive Behaviors on ablations: Given that behaviors primarily emerge during cold start, we analyze their development across the three cold start variants. Consistent with final performance metrics, the multimodal cold start does not yield an increase in either textual or visual cognitive behaviors. We attribute this to the following factors:
- Multimodal cold-start data still suffers from limited quality. The synthetic nature fails to introduce new reasoning patterns, which even hurts performances.
- Visual behaviors tend to emerge in the early steps of cold start and plateau thereafter (visual re-inspection: 0.3 → 2.2 → 2.0 from iter 0 → iter 200 → iter 700 in OVR). Since multimodal cold start data is derived from language-only sources, the underlying patterns remain unchanged. As a result, simply adding more data, regardless of modality, does not further enhance the emergence of cognitive behaviors.
[1] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
[2] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
[3] AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy
[4] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Q2: Cause of Transfer after Cold Start
Our analysis suggests that the transfer is primarily driven by the nature of cold start data, rather than data volume.
1. Transfer originates from linguistic “mental imagery”: The linguistic data distilled from DeepSeek-R1 often include internal visualizations (somehow reveals its mental imagery) to support mathematical reasoning, which are often articulated through phrases such as “let me visualize/sketch...”. Once this linguistic scaffolding was introduced into MLLM, these patterns were rapidly grounded in actual visual input for handling multimodal reasoning tasks.
2. Emergence happens early and is not driven by data size: We track the emergence of visual behaviors across different iterations within the cold start. Specifically, we observe that a significant portion of visual behavior occurs at the early step and then fluctuates rather than steadily increasing (0.3 → 2.2 → 2.0 from iter 0 → iter 200 → iter 700). This suggests that the effect is not proportional to data size, but instead reflects early inductive bias amplification.
Q3: More Perception-centric Evaluations
We evaluate on more perception benchmarks. OVR achieves consistent improvements than the baseline in the table.
| Model | BLINK | HallusionBench | MMbench |
|---|---|---|---|
| Baseline | 53.71 | 49.43 | 85.84 |
| OVR-7B | 54.13 | 54.90 | 86.42 |
Q4: Generalization across Models and Data Sizes
Our extended experiments support the core finding—cognitive behaviors learned in language can be transferred to vision for advanced visual reasoning—generalizes well across both larger model scales and more data with broader task coverage.
1. Model Scaling: Our training on an in-house MLLM built upon Qwen2.5-32B confirms strong scalability, boosting AIME 2024 performance from 13 → 73 and MathVision from 35 → 60.
2. Task & Data Scaling: Rather than simply increasing data volume, we also emphasized task diversity and data quality. By adding just ~10k high-purity STEM samples to the RL training, we achieve a +2% increase on MathVision.
These results indicate that the benefits of behavior transfer not only generalize but also amplify with larger model capacity and curated data, reinforcing the robustness and scalability of our approach.
Q5: Reward Design
Using exact match as the reward function is well-suited to the nature of the studied tasks and offers robustness to reward hacking.
First, the tasks we focus on includes multimodal math, logic, and VQA tasks, which typically have objectively correct answers. Exact match provides the most direct and reliable correctness signal.
Second, from a broader methodological perspective, exact match is a widely adopted reward shaping in reasoning tasks. It offers an unambiguous reward signal, thereby effectively mitigating reward hacking. This choice is consistent with mainstream methods such as Open-Reasoner-Zero [5].
While we agree that reward design is an important area for future exploration, it is not the focus of this work. In future studies, we plan to investigate hybrid reward strategies that integrate model-based feedback and rule-based signals to further enhance robustness.
[5] Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Dear Reviewer 9Ror,
We sincerely appreciate the time and great efforts you have dedicated to reviewing our work.
If possible, we would be grateful for the opportunity to further engage in discussion and hear your thoughts. We would love to know whether our experiments and explanations have addressed your concerns, or if you have any additional suggestions for improvement. Thank you!
Best,
The Authors
This paper investigates the transference of cognitive behaviors from language-only training for vision language models. The authors propose that reasoning skills learned from language-only training can significantly improve a model's performance on visual reasoning tasks. They introduce a three-stage training pipeline to facilitate and study this transfer. The resulting model, named Open-Vision-Reasoner (OVR), is a 7B parameter model based on Qwen2.5-VL.
优缺点分析
Strengths:
- The central hypothesis—that core, abstract reasoning skills are modality-agnostic and can be transferred from language to vision—is compelling and aligns with theories of human cognition. The initial experiment showing that a language-only "cold start" boosts visual reasoning performance (Figure 1) is a strong and surprising finding that provides excellent motivation for the work.
- The proposed three-stage training pipeline is logical and well-structured. It allows for a systematic investigation of the central hypothesis and enables a clear ablative analysis (Table 5) to demonstrate the contribution of each stage. This methodical approach adds credibility to the findings.
- The paper demonstrates impressive results across a wide range of challenging language and visual reasoning benchmarks. OVR significantly outperforms other open-source 7B models on tasks like MathVision (+15.1%) and LogicVista (+6.5%), providing concrete evidence that the proposed training strategy is effective.
- The authors go beyond simply reporting benchmark scores. The analysis of both linguistic (Table 4) and visual-specific (Table 3) cognitive behaviors provides valuable insight into the model's internal changes. The attempt to quantify the "Behavior Transfer Rate" (BTR) is a novel contribution towards understanding the mechanisms of cross-modal generalization.
Weaknesses:
- The paper provides no validation for the reliability or accuracy of GPT-4o-mini as a classifier for these nuanced behaviors. How often does it produce false positives or negatives? Without this, all quantitative claims about behavior frequency (e.g., in Tables 3 and 4) are built on an unverified foundation.
- The analysis cannot distinguish between a model that has genuinely learned a new cognitive process (e.g., verification) and a model that has simply learned to verbalize its outputs in a style that mimics verification, because such patterns were present in its training data (distilled from DeepSeek-R1). The model may be a more sophisticated "stochastic parrot" of reasoning steps rather than an actual reasoner. This is a critical distinction that the current analysis fails to address.
- The proposed three-stage training pipeline (SFT -> Language RL -> Multimodal RL) is more of a training recipe than a novel method. And such pipeline is well known in the community. The individual components are standard: PPO with GAE is a well-established algorithm, and a minimalist binary reward function for correctness is a common choice in reasoning tasks. While the application to study cognitive transfer is interesting, the work does not introduce a new algorithm, architecture, or fundamental learning principle, which limits its contribution for a top-tier venue.
- The paper's own results (Table 5) show a clear and significant trade-off between reasoning and perception. As the model's reasoning score on MME-R increases, its perception score on MME-P and MMVet consistently decreases from the baseline. The authors acknowledge this but frame it as a temporary issue to be addressed in future work. However, for a paper claiming to create a "general-purpose multimodal reasoning" system, sacrificing a fundamental capability like visual perception is a major flaw.
问题
See weaknesses
局限性
None
格式问题
None
Thank you for the detailed and insightful feedback. We appreciate the reviewer's recognition of our central hypothesis as "compelling", our methodology as "logical and well-structured", our results as "impressive", and our behavioral analysis as providing "valuable insight". Below, we carefully address each of your concerns and clarify potential misunderstandings.
Q1: Evaluation Reliability
We conducted a human evaluation to validate the reliability of GPT-4o-mini as the behavior classifier, focusing on visual re-inspection.
- The results show high alignment between the model and human predictions.
- Moreover, the human annotations further support the effectiveness of our behavior transfer.
1. Setup: We randomly sampled 100 queries from the MathVision-mini benchmark and collected model outputs from both Qwen2.5-VL-7B and OVR-7B (200 responses in total). These anonymized samples (image, query, response) were shuffled and evaluated on a dedicated annotation platform by five professional annotators with expertise in CV and NLP. Annotators identified whether a given response exhibited visual re-inspection behavior, based on formal definitions and detailed annotation criteria (formatted as prompts in Appendix S-Figure 1).
2. Metrics:
- We first compute the behavior classification accuray of GPT-4o-mini against aggregated human annotations.
- Additionally, we compared the behavior frequencies of the two models, using GPT and human annotations respectively.
3. Results:
-
GPT Classification Accuracy: Across all annotated cases, GPT-4o-mini achieved an accuracy of 92.42%, showing strong alignment with human judgments. The model produced only 2 false positives and 13 false negatives, showing a conservative tendency in behavior detection. This indicates that GPT-4o-mini reliably captures the overall trend of behavior emergence, supporting its use as a practical and scalable proxy for behavior classification in our study.
-
Behavior Frequency Comparison: The table shows the behavior frequency of the two models, as measured by GPT and human annotations respectively. Both sources consistently indicate that OVR demonstrates a substantially higher behavior frequency than the baseline, further validating that the behavior was effectively transferred. | Model | GPT (%) | Human (%) | |----------------|---------|-----------| | Qwen2.5-VL-7B | 0.0 | 2.0 | | OVR-7B | 10.1 | 19.2 |
Q2: Distinguishing Genuine Behavior from Pattern Mimicry
We appreciate the opportunity to address the possible misunderstanding. We emphasize that the defined visual-specific cognitive behaviors has been evaluated in the main paper to handle this concern, which are not present in the training data. The analysis on emergence of these behaviors demonstrates the genuinely learned cognitive process, rather than superficial mimicry. We further clarify the definition and findings as follows:
1. Visual-specific behaviors defined to capture genuine transfer: To assess whether the model genuinely learns and transfers cognitive behaviors, we examine visual-specific behaviors grounded in visual inputs (defined in Sec. 5.4 and detailed in Appendix C.1). While these behaviors are conceptually related to their linguistic counterparts (explained in S-Table 1), they are fundamentally different in modality, making them well-suited for verifying true cross-modal transfer rather than superficial mimicry.
2. Zero-shot emergence confirms internalization and transfer: These behaviors are entirely absent from all training data. Even During the multimodal RL, the model only receives ground truth answers without any intermediate reasoning steps or behavioral scaffolding. Nevertheless, as shown quantitatively in Table 3 and qualitatively in Figure 2, these behaviors consistently emerge. This strongly suggests that the model has internalized and strategically transferred abstract reasoning strategies to the visual domain, rather than mimicking surface-level text artifacts. We consider this one of the most valuable findings of our work.
We will further clarify this in the revision to avoid any ambiguity.
Q3: Contribution Clarification
We acknowledge that “RL with a cold start” is a well-established paradigm in LLM training. In our work, this pipeline is not positioned as the core novelty, but rather serves as a tool to systematically investigate the cross-modal transfer of cognitive behaviors and to build a strong multimodal reasoning model (lines 37–44).
We further clarify our distinct contributions as follows:
- Cross-modal cognitive behavior transfer as the central focus: Our work uniquely centers on the unexplored behavior transfer from language to vision, revealing the internal mechanisms and generalization patterns behind advanced reasoning capabilies. This enables a scalable and interpretable pathway for building stronger multimodal reasoning systems.
- In-depth behavioral analysis: We provide the first quantitative study of behavior emergence across training stages, including the surprising zero-shot emergence of visual-specific cognitive behaviors (see Q2 response), transfer patterns (Sec. 6), and the correlation between behavior and performance (see Q3 response to Reviewer SEpa). This sheds light on how abstract behaviors generalize across modalities and benefit model's reasoning capabilities.
- Superior reasoning across modalities: OVR achieves exceptional performance on both text and multimodal reasoning, as well as on general-purpose benchmarks, demonstrating the value of modality-agnostic cognitive behaviors for strong reasoning capabilities. Notably, it establishes a powerful and reproducible baseline for the open-source community, with full data, model and implementation details to be released publicly.
Q4: On the Reasoning-Perception Trade-off
We would like to emphasize that the primary goal of this paper is to enhance multimodal reasoning. Within this context, the partial decline in perception metrics is acceptable and solvable, and therefore should not be considered a “major flaw”. We address this concern through four key aspects:
1. No "consistent degradation" in perception
- Contrary to the reviewer's comments, MMVet scores in Table 5 actually increase from 57.3 to 60.8.
- We also evaluate on more perception-centric benchmarks, and the results show that OVR maintains or even improves performance across these tasks. | Model | BLINK | HallusionBench | MMbench | |------------------|--------|----------------|---------| | Qwen2.5-VL-7B | 53.71 | 49.43 | 85.84 | | OVR-7B | 54.13 | 54.90 | 86.42 |
2. Acceptable trade-off for the reasoning-focused OVR: Given our framework’s emphasis on language and multimodal reasoning, some marginal drop on perception metrics is an acceptable trade-off. While MME-P shows a decline of only 7.3%, reasoning metrics improve more substantially. For instance, AIME24 increases from 6.67 → 58.8 and MathVision from 25.5 → 50.0, demonstrating a strong overall capability gain.
3. Solution through Scaling: As discussed in line 280-282, we further scale up multimodal RL, leading to a gradual recovery in perception metric MME-P 1547→1600 and a increase in reasoning metrics like MME-R 713.6 → 725. This suggests that perception degradation is not fundemental, but solvable through continued training.
4. A well-studied and explainable phenomenon in the community
- Perception-reasoning trade-off is a well-recognized phenomenon in MLLM reasoning [1]. The observed decline in perception performance often arises from reduced reliance on fine-grained visual details, as models develop longer and more abstract reasoning chains that favor linguistic priors.
- We plan to address this in future work via prolonged multimodal RL and self-distillation, aiming to jointly reinforce reasoning and perception abilities.
[1] More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
Thanks for your effort. You have addressed my concern about Q1 and Q2. But I still have concerns about Q3 and Q4. Q3: I acknowledge the authors' clarification that the training pipeline is intended as a "tool" to study cognitive transfer, not as a novel method in itself. However, this reframing does not fully resolve the issue of the paper's core contribution. Q4: I thank the authors for providing additional benchmark data. However, the defense of the reasoning-perception trade-off is not convincing and, in fact, reinforces my original concern. Therefore, I choose to keep my score.
Q4
To avoid any potential misunderstanding, we’d like to clarify that the primary focus of our work is on enhancing MLLM reasoning from the angle of cognitive behavior transfer. We do consider perception an important capability—which is why we faithfully include it as a side discussion in the paper—but it's actually not the central focus. Beyond clarification, we add further discussion on this extra part in the point 3 below to address reviewer's concerns.
The concrete explanations are detailed below:
1. As a paper focused on reasoning, our work presents a clear and distinctive motivation, strong experimental evidence, and in-depth analytical insights.
- Built around the novel perspective of cognitive behavior transfer, we propose a distinct training principle, conduct extensive evaluations, and provide in-depth behavioral analysis. We sincerely appreciate that all reviewers have acknowledged the coherence and value of these efforts for the mainline.
2. The perception-related discussion is an additional point but not the mainline.
- This part in the main paper was actually included to faithfully report on an observed phenomenon. It is not central to the main objective of improving reasoning in MLLMs via cognitive behavior transfer and is discussed to sincerely motivate future community efforts.
3. Further discussion, analysis, and solutions on this point are as follows (though not our main focus):
While this is not our central focus, we agree with the reviewer that the "trade-off" represents an important direction for future research. Therefore, in our original response to Q4, we conducted additional study and observed the following:
- This phenomenon has been well-studied and analyzed in [4,5] for current SOTA visual reasoning models [6,7], which exhibit perception degradation. This is a feature of the problem space, not a "flaw" in our method.
- There is actually no fundamental degradation in perception. Conversely, we observe improvements on perception-related benchmarks such as BLINK and HallusionBench.
- The practical solution proposed in Point 3 of our original response has been shown effective in addressing the temporary and marginal drop observed solely on MME-P, demonstrating that this issue is addressable and won't be a future concern.
[4] More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
[5] Are Reasoning Models More Prone to Hallucination?
[6] MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
[7] OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
We sincerely thank the reviewer again for your time and thoughtful engagement throughout the rebuttal process. We truly hope these clarifications help resolve any remaining ambiguities or misunderstandings. Please feel free to reach out if you have any further concerns or suggestions.
Extended cold start experiments for supporting OVR's distinct and impactful training principle:
(Supporting Point 1 in the Part I above)
| AIME2024 | MATH500 | MathVista | MathVision | |
|---|---|---|---|---|
| Baseline | 6.67 | 67.4 | 69.2 | 25.5 |
| Lang cold start | 54.4 | 93.1 | 69.7 | 46.2 |
| MM cold start | 23.1 | 75.2 | 71.6 | 33.5 |
| Lang cold start + MM cold start | 51.0 | 90.3 | 70.4 | 40.1 |
1. Settings: We experiment with training on multimodal cold-start data either independently or in combination with language cold-start data. The multimodal cold-start datasets contain publicly available sources such as R1-OneVision [2] and Vision-R1 [3].
2. Results: Training with multimodal cold-start data alone leads to lower performance on both language and multimodal reasoning tasks. Adding multimodal cold-start data on top of language cold-start training reduces performance. This reveals the lack of high-quality multimodal cold-start datasets.
3. Conclusion: Constructing effective multimodal cold-start data remains a significant challenge. Unlike the language domain, the multimodal community lacks high-quality, open-source reasoning teachers like DeepSeek-R1, and most closed-source models hide explicit reasoning traces.
[2] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
[3] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
We sincerely appreciate that our response helped address your first two concerns. We also thank the reviewer for the opportunity to further clarify our contributions, and avoid any vagueness or potential misunderstanding caused by the limited space in the initial rebuttal.
Q3
To help better understand the core contributions of our work, we would like to offer a clearer clarification of our original response to Q3. The following elaboration is not a repetition, but a faithful extended clarification.
Our paper's contribution is multifaceted, extending beyond novel perspectives to include new conceptual paradigms, new scientific findings, and powerful practical artifacts:
1. A New Conceptual Paradigm: Language-to-vision cognitive transfer, the central focus of our work, is indeed a distinct and impactful training principle for advanced MLLM reasoning
- We totally understand your concern that “RL with a cold start” is often regarded as a common tool in LLMs. However, its direct application to MLLMs fails to yield comparable benefits, primarily due to: (clearly demonstrated by the cold start experiments in the next box)
- The scarcity of high-quality multimodal cold-start data
- Unlike the language domain, the multimodal community lacks powerful open-source reasoning teachers like DeepSeek-R1, and most closed-source models hide explicit reasoning traces.
- Our cognitive transfer paradigm directly addresses this critical gap from a novel and underexplored perspective. By first cultivating effective reasoning patterns in the data-rich language domain, It provides an effective strategy to leverage existing, high-quality resources to enhance reasoning capabilities across both modalities.
In summary, our work demonstrates a scalable, interpretable, and reproducible training paradigm to build stronger multimodal reasoners. This distinct principle will meaningfully impact the future direction of MLLM reasoning. This is a fundamental shift in how to approach the problem.
2. New Scientific Insights: In-depth behavior transfer analysis reveals patterns essential to reasoning
- As shown in the Q3 response (point 2), we present the first quantitative study on cross-modal behavior transfer, tracking how cognitive behaviors learned in language emerge and evolve in the visual domain. This is a direct and novel contribution to the community's scientific knowledge.
- Beyond validating the transfer itself, our analysis reveals key reasoning-effective patterns within long multimodal reasoning chains. This acutually offer practical principle for constructing stronger multimodal cold-start data, highlighting which behavior patterns are worth preserving or distilling. This also helps address the data scarcity challenge discussed in the first point.
3. A Powerful Community Resource: We developed a strong open-source reasoner for both language and vision
- As shown in the Q3 response (point 3), unlike prior works that primarily focus on visual reasoning, our OVR demonstrates superior reasoning capabilities across both language and vision tasks.
- The resulting OVR is valuable not only from the perspective of cognitive behavior transfer, but also due to the effective data curation pipeline and our robust RL training methodology. Creating a powerful, accessible, and competitive baseline is a significant form of novelty that will directly faciliate future research.
Finally, regarding whether a paper offers sufficient contribution, we borrow a perspective from Novelty in Science [1]:
If you hear a good idea, there is a moment of surprise and then, the better it is, the more obvious it may seem. If it is easy to explain and obvious in hindsight, this in no way diminishes the creativity (and novelty) of the idea.
[1] Black. Novelty in Science. Medium. https://medium.com/@black_51980/novelty-in-science-8f1fd1a0a143
We sincerely thank all reviewers for your thoughtful feedback, which has helped us strengthen the clarity, rigor, and in-depth of our paper. We are encouraged that reviewers recognize:
- Motivation & Research Perspective: Our core hypothesis of behavior transfer was found to be compelling (WErP), and the overall research question is interesting and well-motivated (SEpa, 9Ror, bA1c).
- Training Design & Methodology: Reviewers appreciated the systematic and well-structured pipeline (WErP, bA1c), and noted the novelty of our well-motivated design (bA1c).
- Performance & Experiments: Our model achieves strong performance across both language and vision reasoning tasks (SEpa), with clear gains over open-source 7B models (WErP), and comprehensive evaluations supporting our claims (9Ror, bA1c).
- Analysis & Insight: The in-depth analysis of cognitive behavior emergence and transfer was seen as valuable and informative (WErP, SEpa, bA1c).
- Writing Quality: Reviewers described the paper as clearly written, easy to follow, and well-structured in its reasoning (SEpa, bA1c).
We have carefully reponsed to all reviewer concerns with substantial experiments, clarified claims, and proposed future directions where needed. We genuinely appreciate the thoughtful engagement throughout the process, and hope the improvements made during the rebuttal period will support your final evaluation.
This paper explores how cognitive behaviors learned from language-only training can transfer to vision-language models. The authors posit that reasoning skills acquired in the language domain can meaningfully improve performance on visual reasoning tasks. To investigate this, they design a three-stage training pipeline and present Open-Vision-Reasoner (OVR), a 7B-parameter model based on Qwen2.5-VL.
Overall, reviewers are generally positive about the work. Two points, however, deserve attention. First, a potential trade-off exists between reasoning and perception. The authors provide additional experiments indicating that perceptual performance is minimal and can largely be neglected. Second, the analysis of correlation versus causation between cognitive behaviors and reasoning remains limited; the authors plan to expand this in the final version.
I suggest the authors further discuss these aspects, particularly the limitations. Overall, I think the paper meets the NeurIPS standard, and addressing these points in the final version would strengthen it further.