PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差1.0
5
3
3
5
3.3
置信度
创新性2.8
质量2.8
清晰度2.8
重要性2.8
NeurIPS 2025

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29

摘要

关键词
Multimodal Decision MakingSituational ReasoningVision-Language Model Reasoning

评审与讨论

审稿意见
5

This paper investigates vision-grounded decision-making. The central insight is that the visual scenes can be better understood by first generating a detailed description of the scene, and then reason about the decision based on the description. The proposed Praxis-VLM is trained using a GRPO algorithm with an adaptive reward function

优缺点分析

Strengths: see below; Weaknesses: see Questions and Limitation

  1. The empirical discovery is interesting (though less surprising), and it also shows the current, even SOTA VLMs, fall short in multimodal understanding. It may provide an alternative approach for a more data-efficient method of reasoning acquisition, bypassing the needs for image-text-reasoning pairs.
  2. The performances seem to be good.
  3. The paper provides thorough analysis, including the role of reasoning length, error cases, clustering of reasoning type, etc.

问题

  1. While the approach transfers reasoning from vision to texts, it doesn't directly optimize visual grounding during training: Line 126, “during the training phase, visual inputs are replaced by their textual descriptions, and only the language model components of the VLM are updated. This leverages the scalability of text data for knowledge acquisition. Yet, during the inference phase, the entire trained VLM architecture, including the vision encoder, is used to process the image-text input pair”.
  2. The performance of GRPO may rely heavily on the backbone VLM (Qwen2.5-VL as a strong open-source model).

局限性

  1. The data generation relies heavily on the backbone VLM (GPT-4o). the quality and diversity of synthetic scenarios are not deeply discussed. There's room to explore how scenario variety or GPT hallucination might affect training.
  2. GRPO, multi-stage training, and adaptive reward design are used to improve reasoning. Although effective, it is an established approach and may focus more on the engineering perspective.

最终评判理由

My concerns are mostly addressed. I like the idea of text-only training to solve visual-language gap. I raised my score to 5: Accept.

格式问题

NA

作者回复

We sincerely thank you for the valuable feedback and for recognizing the strengths of our work, including the interesting empirical discovery, strong performance, and thorough analysis. Below we address the raised questions and concerns.

Q1: Not Directly Optimizing Visual Grounding

This is a good question! We would like to first clarify that our approach focuses on transferring reasoning skills learned from text-only training to vision-grounded multimodal inference. Our preliminary analysis (Section 2) shows that reasoning and decision-making can be disentangled from visual perception, allowing us to leverage the scalable text-only data to enhance VLM decision-making abilities.

This design addresses a critical bottleneck in current VLM development: the lack of accessible, high-quality multimodal data for model training, especially for decision-making tasks. By focusing on text-driven training, our method allows for efficient and cost-effective learning of reasoning skills without relying on multimodal supervision.

As such, directly optimizing visual grounding during training falls outside the scope of this work, as it would require a fundamentally different setup involving paired multimodal data, which contrasts with our case where such multimodal training data is unavailable.

As acknowledged in our Limitation Section (Line 550), we agree that further strengthening visual perception is an important next step for improving vision-grounded decision-making. While beyond the scope of this work, we plan to explore perception-focused training in future research.


Q2: Reliance on a Strong Backbone VLM like Qwen2.5-VL

This is a good question and we would like to clarify our choice of Qwen2.5-VL.

First, due to computational constraints, we focus on models at the 3B and 7B scales. Qwen2.5-VL is selected because it is a strong, open-source VLM with competitive performance, offering a reproducible and reliable foundation for our study. Recent work [1,2] suggests that RL mainly elicits and refines knowledge embedded in the base model, rather than learning entirely new capabilities. Therefore, tackling complex decision-making depends critically on the strength of the base model, and Qwen2.5-VL provided the best open-source option at this scale when our work is conducted.

Second, we follow established practice in reasoning VLM research, including OpenVLThinker [3], R1-OneVision [4], and MM-Eureka [5], all of which also build upon Qwen2.5-VL. This ensures comparability and consistency with existing methods.

We agree that extending Praxis-VLM to other architectures is an important future direction. However, our primary goal here is to demonstrate the principle and effectiveness of text-driven reinforcement learning for enhancing reasoning and decision making. We will clarify this motivation and model choice more explicitly in the revision.

[1] Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?

[2] S1: Simple test-time scaling

[3] OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

[4] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

[5] MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning


L1: Regarding Data Generation

Scalable data generation pipeline

This is a good point! While our data generation relies on an off-the-shelf model (GPT-4o), the pipeline is intentionally designed to be both diverse and scalable. We use an automated process: GPT-4o is prompted to generate 10 samples per batch, followed by deduplication based on word-overlap similarity to remove highly similar scenario descriptions within each batch. Beyond this step, we avoid manual filtering or heavy curation, enabling fast, domain-agnostic dataset creation that can scale to large volumes.

Leveraging LLMs for synthetic generation has been widely adopted in previous research [1]. Our approach builds on this by further eliminating the need for costly image-text paired datasets, offering a more practical and data-efficient strategy to enhance VLM reasoning capabilities.

Diversity of synthetic scenarios

To assess the diversity of the generated scenarios, we prompt GPT-4o to cluster the generated textual situations into topical categories. The results indicate a broad coverage across varied scenarios:

  • Workplace Performance and Personal Issues,
  • Resource Allocation
  • Project Management
  • Balancing Competing Interests
  • Policy, Rules & Enforcement
  • Ethical Dilemmas
  • Interpersonal Conflict
  • Emergency Handling
  • Navigating Setbacks
  • Event Planning & Logistics
  • Balancing Inclusivity and Majority Preference

Impact of Synthetic Data on Training

While the generated data cover diverse situations, there can still be a domain gap when compared to specific evaluation benchmarks (e.g., PCA-Bench’s embodied robotics or EgoNormia’s egocentric video contexts). However, Praxis-VLM demonstrates consistent improvements and robust generalization across all benchmarks, indicating that our text-driven RL training learns generalized, transferable reasoning skills rather than overfitting to domain-specific patterns. This is also supported by the reasoning dimension analysis in Figure 5, showing that even without deliberate domain-specific data curation, Praxis-VLM successfully acquires adaptable reasoning abilities.

As discussed in the Limitation Section (Line 545), we believe exploring advanced data generation and selection methods for more efficient model training is an important direction. We also appreciate your insightful suggestion of how potential GPT hallucinations affect training, which we plan to investigate in future work. We will revise the Limitation Section to further incorporate your suggestions into our discussions.

[1] Self-Instruct: Aligning Language Models with Self-Generated Instructions


L2: Regarding Established Approach and focus more on the engineering perspective

Thank you for raising this point. While we acknowledge that components such as GRPO, and multi-stage training are established techniques, we respectfully disagree that our work is primarily engineering-focused. The key contribution lies in the novel text-only RL training for VLMs to tackle a fundamental challenge: enhancing reasoning of VLMs without relying on expensive image-text paired training data.

We propose a text-driven reinforcement learning framework that uses language as the primary training medium to develop transferable reasoning capabilities, which are then effectively applied to multimodal decision-making tasks during inference. This framework enables us to decouple reasoning from perception, allowing for scalable and cost-efficient training while preserving strong performance in vision-grounded scenarios.

By demonstrating that such reasoning skills can be learned purely from text and transferred to complex visual tasks, our work offers a new conceptual perspective on VLM training that goes beyond the use of existing RL methods, and addresses a practical gap in current VLM development.

评论

Thank you for your response. My concerns are mostly addressed. I like the idea of text-only training to solve visual-language gap. I defend my initial rating of 4: Borderline accept.

评论

Dear Reviewer rfEF,

Thank you very much for your thoughtful feedback and for recognizing the value of our text-only training approach. We’re glad that our response addresses most of your concerns, and we will revise the paper to clarify the previously raised points and incorporate your suggestions. We believe this work offers a meaningful step toward scalable and generalizable VLM reasoning, and we kindly hope you might consider raising your score.

If you have any further questions or concerns, we’d be happy to address them.

Best regards,

The Authors

评论

Thank you for your continued engagement in the rebuttal process. We are encouraged that you like our idea of text-only training to solve the visual-language gap.

To further address your concern raised in the original comments that “the performance of GRPO may rely heavily on the backbone VLM (Qwen2.5-VL as a strong open-source model),” we conduct additional experiments using Qwen2-VL [1], a base model known to have weaker language reasoning abilities [2]. This helps evaluate the robustness and generality of our approach across models with varying capacities.

Specifically, we apply our text-only RL training to Qwen2-VL-7B-Instruct. Due to time and resource constraints, we focus on Stage-2 training only, without the math cold-start phase. The results are as follows:

TaskVIVAPCA-BenchEgoNormia-verified
Qwen2-VL77.1047.9561.05
Ours (w/ Stage-2 RL)80.8152.3764.21
  • Note: Due to time constraints, we evaluate on EgoNormia-verified, a 200-video high-quality subset provided by the original paper and leaderboard.

As shown, our method consistently improves performance across all benchmarks, even when applied to a base model with more limited reasoning ability. This further supports the effectiveness and generality of our approach.

[1] Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

[2] Qwen2.5 Technical Report

评论

Thank you for the additional experiment. Raised my score to 5 in support of the paper.

评论

Dear Reviewer rfEF,

We sincerely appreciate your recognition of our efforts and the increased score. We are also grateful for your valuable suggestions and will ensure they are incorporated into our revisions.

Best regards,

Authors

审稿意见
3

The paper proposes the VLM-based reasoning system, Praxis-VLM, that relies on text descriptions of the visual context (an image) in the vision-grounded decision-making process. Praxis-VLM uses reinforcement learning (GRPO method) to learn how to evaluate possible actions and their outcomes. The paper demonstrates that strong visual reasoning capabilities can be accomplished when the visual context (an image) is represented only through a pure textual description.

优缺点分析

Strengths:

  1. The idea of using merely text descriptions of visual scenes to impose effective visual reasoning and consequent decion-making is sound and interesting.
  2. The proposed Praxis-VLM shows its upper-hand over the baselines (base VLMs) in the experimental evaluation.
  3. The proposed approach can be of value in the taks for which the available (image, text) training pairs are scarce.

Weaknesses:

  1. The experimental evaluation is limited to one VLM model in two size versions (3B and 7B parameters).
  2. It is not clear from the paper how a detailed (complete in a semantic sense) descriptions are imposed when using GPT-4o.

问题

  1. It will be interesting to see the results of Praxis-VLM built based on other popular VLMs. It is probable that the results (degree of improvement, if any) might differ from those achieved using Qwen2.5-VL.
  2. How one can be sure, without concept leaking in the prompt formulation, that a textual description of an image generated by GPT- 4o is sufficient and accurate in a given decision-making context?
  3. Do you have any suggestion why there is no difference between „Major” and „Pass@1” scores of Qwen2.5-VL-7B on VIVA dataset?

局限性

Yes

最终评判理由

My concerns are mostly addressed. The underlying idea of using solely text descriptions of visual scenes to impose effective visual reasoning is interesting, although I still believe the experimental justification should be expanded. I maintain my score.

格式问题

The paper begins with the figure (Figure 1) located above the abstract, which looks odd.

作者回复

Thank you for the thoughtful feedback. We are encouraged to know that you consider our work sound, interesting, and valuable for scenarios where paired image-text data is scarce. Below we address the weaknesses and questions you raised.

W1 & Q1. Experiment with Qwen2.5-VL

We appreciate your suggestion and would like to clarify our choice of Qwen2.5-VL (3B and 7B).

First, due to computational constraints, we focus on models at the 3B and 7B scales. Qwen2.5-VL is selected because it is a strong, open-source VLM with competitive performance, offering a reproducible and reliable foundation for our study. Recent work [1,2] suggests that RL mainly elicits and refines knowledge embedded in the base model, rather than learning entirely new capabilities. Therefore, tackling complex decision-making depends critically on the strength of the base model, and Qwen2.5-VL models provide the best open-source option at this scale when our work is conducted.

Second, we follow established practice in reasoning VLM research, such as OpenVLThinker [3], R1-OneVision [4], and MM-Eureka [5], all of which also build upon Qwen2.5-VL. This ensures comparability and consistency with existing methods.

We agree that extending Praxis-VLM to larger scales or other architectures is an important future direction. However, our primary goal in this work is to establish the principle and effectiveness of text-driven reinforcement learning for improving reasoning and decision-making capabilities. We demonstrate that the reasoning skills learned through language-only training can be successfully transferred to multimodal inference. We will clarify this motivation and the rationale behind our current model choice more explicitly in the revision.

[1] Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?

[2] S1: Simple test-time scaling

[3] OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

[4] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

[5] MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning


W2 & Q2. Regarding descriptions of images

Thank you for raising this question. We would like to clarify the following points and address any potential misunderstanding:

W3. How are detailed descriptions imposed using GPT-4o?

Praxis-VLM does not use generated textual descriptions of images, either during training or evaluation.

Our approach is motivated by the scarcity of large-scale image-text paired datasets of vision-grounded decision-making for VLM training. To address this, we explicitly disentangle reasoning from visual perception and train Praxis-VLM entirely on a synthetic, text-only dataset, where situations, questions, and potential decisions, are fully described in natural language, as shown in Figure 3. These synthetic textual training data are generated by GPT-4o without any reference to benchmark images, ensuring no concept leakage.

During training, only the language model component of the VLM is updated, with the visual encoder disabled. We apply reinforcement learning with adaptive rewards on the synthetic text-only data to develop the model’s multi-step reasoning and decision-making abilities.

At inference time, the model directly processes the actual visual inputs (images or video frames) from benchmark datasets, without relying on any textual description from GPT-4o. The central idea is that the decision-making skills learned from text-only training transfer effectively to multimodal inference, avoiding the need for paired image-text training data or intermediate descriptions.

Q2. How can one be sure that a textual description of an image generated by GPT-4o is sufficient and accurate?

We would like to clarify that GPT-4o generated image descriptions are only used in our preliminary analysis (Section 2) to explore whether VLMs can perform decision-making when visual contexts are replaced by text descriptions. In these experiments, images are replaced with GPT-4o-generated or human annotated descriptions, and only the language model component of the VLM is used for prediction. These experiments indicates that decision-making and reasoning can be disentangled from visual perception, motivating our text-only RL training approach for improving VLM reasoning purely within the language space.

Since Praxis-VLM always uses actual visual inputs (i.e., real images) during benchmark evaluations, the accuracy or completeness of GPT-4o-generated descriptions is not relevant to the core method. Our framework remains fully grounded in the visual modality during inference, ensuring valid and reliable evaluation in vision-grounded decision-making contexts.

We will revise our paper to better clarify this.


Q3: "Major." vs. "Pass@1" Scores for Qwen2.5-VL-7B

This is an excellent observation! The "Major." score reflects the accuracy of the most frequent answer across 8 sampled outputs, while "Pass@1" measures whether at least one of the 8 outputs is correct.

For the base Qwen2.5-VL-7B model, these scores are close because the model predicts answers in a highly deterministic manner, without explicit reasoning. Its output distribution is sharp and concentrated, and the model tends to repeatedly generate the same answer. As a result, the most frequent answer ("Major.") almost always matches the "Pass@1" result.

In contrast, reasoning VLMs (Reason SFT & Praxis-VLM) make predictions with explicit reasoning, enabling them to produce a broader variety of reasoning trajectories, exploring diverse solution paths, before reaching the final answer. This diversity increases the likelihood that at least one of the 8 samples is correct, even if the most frequent answer might be wrong, thereby creating a larger gap between "Major." and "Pass@1" scores. We will revise our paper to make this clearer.


Regarding Figure 1 Location

Thanks for your suggestion. We will revise our paper and relocate the Figure 1 accordingly.

评论

Thank you for the rebuttal. My concerns are mostly addressed. The underlying idea of using solely text descriptions of visual scenes to impose effective visual reasoning is interesting, although I still believe the experimental justification should be expanded.

评论

Thank you for your continued engagement in the rebuttal process. We appreciate your recognition that our underlying idea of using solely text data to enhance visual reasoning is interesting.

To further address your remaining concern regarding experimental justification, we have conducted additional experiments using Qwen2-VL [1], a base model known to have weaker language reasoning ability compared to Qwen2.5-VL [2]. This allows us to better examine the robustness of our method across models with varying capacities.

Specifically, we apply our text-only RL training to Qwen2-VL-7B-Instruct. Due to time and resource constraints, we focus on Stage-2 training only, without the math cold-start phase. The results are as follows:

TaskVIVAPCA-BenchEgoNormia-verified
Qwen2-VL77.1047.9561.05
Ours (w/ Stage-2 RL)80.8152.3764.21
  • Note: Due to time constraints, we evaluate on EgoNormia-verified, a 200-video high-quality subset provided by the original paper and leaderboard.

As shown, our method consistently improves performance across all benchmarks, even when applied to a base model with more limited reasoning ability. This further supports the effectiveness and generality of our approach.

[1] Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

[2] Qwen2.5 Technical Report

评论

Thank you for the additional clarification. I will take it into account when making my final decision.

评论

Dear Reviewer x39V,

Thank you for considering our additional experiments and for your thoughtful evaluation throughout the review process. We appreciate your feedback and will make revisions to clarify the points you raised and incorporate your valuable suggestions.

Best regards,

Authors

审稿意见
3

This paper presents Praxis-VLM, a framework that enhances the decision-making and reasoning abilities of VLMs by training them on text-only decision-making scenarios. The core idea is to first build reasoning capabilities in a purely textual environment and then transfer these skills to multimodal inference. The method leverages GRPO and introduces an adaptive reward system (focusing on format, accuracy, and reasoning length at different training stages). Notably, Praxis-VLM achieves better performance than standard VLMs and SFT baselines.

优缺点分析

Strengths

  1. The idea of using text-only reasoning to bootstrap visual decision-making is nice and data-efficient, especially given the scarcity of image-text reasoning datasets.

  2. The multi-stage GRPO training with adaptive rewards is well-motivated and clearly described, focusing on progressively harder reasoning tasks.

  3. Leveraging GPT-4o to synthetically generate a high-quality, diverse text-only dataset reduces the reliance on expensive manual annotations.

  4. Praxis-VLM consistently outperforms vanilla VLMs and SFT baselines, even if the margins are modest (~3-4%).

Weaknesses:

  1. Dataset Generation Not Fully Discussed: The process of creating the synthetic dataset relies heavily on GPT-4o with a small set of seed prompts. It’s unclear how representative, diverse, or potentially biased this generated data is. Also, manual effort in seed question crafting might limit scalability.

  2. Lack of Ablations on Visual Inputs: The paper does not explore what would happen if image-to-text data (captions or descriptions derived from images) were used instead of purely synthetic text-to-text scenarios.

  3. Reward Design: The adaptive reward seems similar to prior GRPO training setups. The novelty here feels incremental, especially since no in-depth reward ablation is provided to justify the significance of each reward component (especially the Rlen reward, whose importance is assumed but not convincingly demonstrated).

  4. Limited Performance Gains: The performance improvement over the baselines is modest (3-4%), raising questions about the practical significance versus the added complexity.

  5. Results: The cold-start initialization and math-focused training did not show significant advantages in some cases, but the paper does not sufficiently explain why.

  6. SFT Underperformance: It’s surprising that SFT and reason with SFT sometimes performs worse than the base model, which raises questions about dataset or training stability but is not addressed.

问题

  1. Dataset Requirements: How scalable is the dataset generation process? How much manual intervention is required in crafting seeds, and could this limit transfer to other domains?

  2. Visual Context Integration: Why not explore using external tools (e.g., image captioners) to generate text from visual scenes instead of relying solely on synthetic text? Could that have improved the visual-text alignment?

  3. Ablation on Visual vs. Text Reasoning: Can you provide results where image-derived text is used during training instead of purely text-generated scenarios? Would this still achieve the same level of reasoning transfer?

  4. Role of Rlen: You report that increasing reasoning length sometimes decreases accuracy. Could you provide a detailed ablation isolating the impact of Rlen, and show which reward components ultimately matter most in driving performance gains?

  5. Relative Contributions: How much of the final performance improvement comes from text-to-text reasoning versus the GRPO training? Given the small gains, is it possible that most of the improvement is from the RL fine-tuning alone?

  6. Adaptive Reward Novelty: How does your adaptive reward meaningfully differ from prior GRPO reward structures? It feels largely similar to other multi-stage RL schemes.

  7. SFT Anomalies: Why does SFT sometimes perform worse than the base model? Is this due to overfitting on synthetic text or some limitation in the data quality?

局限性

NA

最终评判理由

The paper introduces an interesting idea and Authors addressed some of my questions. However, the improvements presented are not high and experimental section is still weak and hence i will retain my score.

格式问题

None

作者回复

Thank you for the insightful feedback. We are encouraged that you find our work to be "nice and data-efficient," and appreciate the motivation and performance of the proposed method. Below, we address your questions:

W1 & Q1. Data Generation

Scalability of Data Pipeline

Our dataset pipeline is designed to be scalable and effective to create a challenging and diverse dataset that fosters multi-step reasoning: (1) For diversity, we adopt a batch generation strategy, prompting GPT-4o to produce 10 samples at a time. Each batch undergoes deduplication based on word-overlap similarity to remove highly similar situations; (2) For difficulty, we embed explicit complexity instructions in our prompts (Line 649), guiding GPT-4o to generate questions that require non-trivial deliberation.

Beyond these automated steps, we intentionally avoid manual filtering or curation, enabling a fast and scalable pipeline for dataset creation.

The manual seed effort is minimal: we craft only 10 seed questions (similar to Figure 8) to provide GPT-4o as in-context examples. These seeds serve only to establish task format and structure rather than domain-specific knowledge. Consequently, the manual intervention is minimal, enabling easy adaptation to new domains.

Diversity and Representativeness of Synthetic Data

To assess the diversity of the generated scenarios, we prompt GPT-4o to cluster the generated textual situations into topical categories. The results indicate a broad coverage across varied scenarios:

  • Workplace Performance and Personal Issues,
  • Resource Allocation
  • Project Management
  • Balancing Competing Interests
  • Policy, Rules & Enforcement
  • Ethical Dilemmas
  • Interpersonal Conflict
  • Emergency Handling
  • Navigating Setbacks
  • Event Planning & Logistics
  • Balancing Inclusivity and Majority Preference

While the generated situations are more close to the domain of VIVA (human-centered situations), there can be a domain gap compared to other benchmarks (e.g., PCA-Bench’s embodied robotics or EgoNormia’s egocentric video contexts). However, Praxis-VLM demonstrates consistent improvements and generalization across all benchmarks, indicating that our text-driven RL training learns generalized reasoning skills rather than overfitting to domain-specific patterns. This is also supported by the analysis in Figure 5, showing that with our text-only RL training, Praxis-VLM successfully acquires fundamental and adaptable reasoning abilities for decision making.

In future work, we plan to explore advanced data selection approaches to further enhance model training efficiency.


W2 & Q2 & Q3. Regarding Visual Inputs Ablations

We appreciate the question and would like to clarify a possible misunderstanding. Large-scale multimodal (image-text paired) datasets for vision-grounded decision-making are scarce and, to our knowledge, no such open-source training data are currently available. To address this, our approach deliberately disentangles reasoning from visual perception, using text-only training to isolate language as the medium for learning transferable reasoning skills. At multimodal inference, where image-text benchmarks are available, the model applies these reasoning abilities directly to real visual inputs (as shown in Figure 3).

W2 & Q2 & Q3. Why not use image captions?

Because large-scale multimodal training data is not available for this task, we cannot explore replacing our synthetic textual scenarios with image captions. This limitation is precisely why we investigate whether reasoning skills learned from pure text training can transfer effectively to multimodal settings, bypassing the need for expensive multimodal training data.

Q3. Visual vs. Text Reasoning Ablation

We cannot test training with image captions due to the lack of multimodal training datasets to obtain such captions. However, our method is inherently compatible: If we can generate captions, they could serve as additional textual situations to our synthetic data. Our current results already demonstrate that text-only training transfers effectively, as Praxis-VLM shows consistent improvements across multimodal benchmarks without any image-text multimodal training.


W3 & Q4 - Q6. RL Training and Reward

Q4. Increasing reasoning length does not reduce accuracy

We would like to clarify that our analysis (Sec 5.2) shows that Praxis-VLM tends to generate longer reasoning for more challenging samples. The observed accuracy drop is therefore linked to sample difficulty rather than the reasoning length itself. We will clarify this in the revision.

W3 & Q4. Ablation of RlenR_{len}

Besides accuracy reward, we follow DeepSeek-R1 to include the format-related rewards (RtagR_{tag}, RformatR_{format}) to ensure outputs follow the explicit reasoning and prediction structure. We incorporate the length reward (RlenR_{len}) to encourage more comprehensive analysis and reasoning, which is particularly beneficial for complex decision-making tasks.

To quantify its impact, we conduct an ablation study on Praxis-VLM-7B by removing RlenR_{len}. The results show that RlenR_{len} consistently improves performance across benchmarks:

ModelVIVAPCA-BenchEgoNormia
Praxis-VLM-7B84.0360.2554.33
w/o RlenR_{len}82.9056.7852.15

These results confirm that longer and more deliberate reasoning chains helps the model perform deeper analysis, leading to accuracy gains.

Q5. Contributions of text-to-text reasoning vs. GRPO training

We clarify that the text-to-text reasoning and the GRPO training are fundamentally interdependent. Praxis-VLM is trained with the GRPO algorithm on synthetic text-to-text data to develop reasoning abilities, which are then transferred to multimodal inference. Thus, we cannot isolate the effects of text-to-text reasoning and GRPO training independently.

W3 & Q6. Novelty of Method

While our approach shares the general structure of multi-stage RL, the novelty of our adaptive reward lies in its alignment with stage-specific reasoning skills and its application to text-driven training for multimodal decision-making.

First, unlike prior GRPO setups that rely on paired image-text data, we apply this adaptive reward strategy in a text-only RL setup, with the goal of transferring learned reasoning skills to vision-grounded inference.

Second, our reward function is explicitly tailored to target different cognitive competencies progressively: Stage 1 focuses on format adherence and general logical reasoning; Stage 2 shifts the reward signal toward situational reasoning and decision making.

This enables data-efficient learning and effectively equips the model with decision-making capabilities, without requiring costly multimodal supervision.


W4. Limited Performance Gains

We acknowledge your concern and highlight the practical significance of our model from two perspectives.

First, all benchmarks are recently proposed and designed to challenge current VLMs. While the gains on in-domain benchmarks such as VIVA are around 3%, it achieves +13% over the baseline on PCA-Bench (7B) and +8% on EgoNormia (7B), demonstrating its ability to handle diverse, out-of-domain scenarios. Moreover, Praxis-VLM is trained in a data-efficient manner, without requirement of image-paired data, reinforcing its practical utility.

Second, we further compare Praxis-VLM with OpenVLThinker [1], a recent reason-VLM based on Qwen2.5-VL-7B for general visual reasoning trained with multimodal data.

ModelVIVAPCA-BenchEgoNormia
Praxis-VLM-7B84.0360.2554.33
OpenVLThinker-7B82.3449.2153.13

These results highlight that with text-only training, Praxis-VLM demonstrates superior performance across all tasks.

As Praxis-VLM shares the same parameter count as Qwen2.5-VL and is trained solely with text-only data, it introduces minimal added complexity while delivering practically meaningful gains.


W5. Cold-start Initialization

The math cold-start stage aims to equip the VLM with abstract, generalizable logical reasoning before Stage 2 task-specific training. This initialization proves crucial for cross-domain generalization, particularly on the EgoNormia dataset, where Praxis-VLM achieves notable gains.

Meanwhile, the smaller improvements on VIVA are mainly due to domain alignment: our synthetic text corpus in Stage 2 is similar to VIVA’s human-centered tasks (as discussed in W1 & Q1), so the benefits of math pretraining are less pronounced. We will revise our paper (Line 247) to better clarify this.


W6 & Q7. SFT Performance

The degradation of SFT-based models mainly arises from behavioral cloning and overfitting (Line 235). As discussed in W1 & Q1, there exists domain gaps between synthetic data and certain benchmarks like PCA-Bench and EgoNormia datasets. SFT training tends to enforce pattern memorization from the training distribution, which may hurt performance when the model is tested on benchmarks from divergent domains.

This observation aligns with prior work [1,2] that reason with SFT primarily establishes reasoning structures without encouraging exploration or adaptability. In contrast, RL training allows models to explore beyond static dataset examples, facilitating more fundamental and transferable decision-making skills. This difference explains why our GRPO-trained Praxis-VLM consistently generalizes better compared to its SFT counterparts.

[1] OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

[2] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

评论

Dear Reviewer J55j,

Thank you again for your valuable feedback and suggestions on our work. We have carefully addressed your comments regarding synthetic data generation, visual input analysis, ablations on RL rewards, model performance, and further clarifications on cold-start initialization and SFT result analysis.

As the discussion period will conclude in two days, we kindly invite you to review our response. We hope it sufficiently addresses your concerns, and we remain open to any further questions you may have.

Best regards,

The Authors

审稿意见
5

The paper introduces Praxis-VLM, a vision-language model trained to perform vision-grounded decision-making through text-only reinforcement learning. Instead of using paired image-text data, the model learns reasoning from synthetic textual scenarios using the GRPO algorithm and an adaptive reward scheme across two training stages. At inference time, it applies these reasoning skills to real visual inputs. Experiments on three benchmarks show Praxis-VLM outperforms supervised fine-tuning baselines and generalizes well across tasks. The authors argue that foundational decision-making skills can be effectively learned from language alone and transferred to visual contexts.

优缺点分析

Strengths:

  • Innovative Use of Text-Only Training: The paper presents a cost-efficient alternative to multimodal training by using synthetic text-only data to develop reasoning skills for vision-language models.

  • Strong Empirical Results and Generalization: Praxis-VLM consistently outperforms strong baselines across diverse tasks, including an out-of-domain benchmark, demonstrating both effectiveness and robustness.

  • Clear Evidence of Reasoning Quality In Experiment: The paper includes thoughtful qualitative and quantitative analysis, such as reasoning type clustering and diverse decoding evaluations, which show the model produces multi-dimensional, structured reasoning that improves decisions.

Weaknesses:

  • Evaluation Focused on Multiple-Choice QA Format: All benchmarks used (VIVA, PCA-Bench, EgoNormIA) are framed as multiple-choice QA tasks. This narrow task format may limit the generality of conclusions. The model’s ability to reason in open-ended or interactive settings (e.g., planning, dialogue-based decision making, robotic control) is not tested.

  • Lack of Analysis on Visual Feature Usage: Although the model uses vision encoders at inference time, it is unclear how much the model actually relies on visual input versus prior learned textual knowledge. There is no attention or attribution analysis showing whether the model grounds its reasoning in visual evidence or simply applies text-trained patterns.

问题

  • Robustness to Visual Understanding Errors: How does the model perform when image descriptions are imperfect or noisy? Is its performance sensitive to vision encoder quality?

  • Synthetic Data Quality and Coverage: How diverse and representative are the GPT-4-generated scenarios? Could performance drop if the text data lacks variety or introduces bias?

  • Role of the Math Pretraining Stage: What specific benefit does Stage 1 contribute beyond output formatting? Are there measurable improvements in reasoning depth or structure?

局限性

yes

最终评判理由

The authors have addressed most of my concerns and provided a clear explanation of the paper’s contribution to embodied AI, particularly in addressing limited paired image-text data through text-only training and visual grounding at inference. While I remain cautious about its effectiveness in real-world settings with open-ended action spaces, the proposed idea of disentangling text and image during training presents a novel perspective that may be of value to the community.

格式问题

no formatting issue

作者回复

Thank you for your valuable comments. We appreciate your recognition of our work's innovative method, effectiveness and robustness of the model, and thoughtful analysis. Here we address your concerns and questions:

W1: Regarding Multiple-Choice QA Format

This is a good question! We choose multiple-choice question (MCQ) benchmarks for two primary reasons:

  1. Objective and Reproducible Evaluation: The MCQ format allows for a straightforward and consistent evaluation of decision-making with accuracy, which has been widely adopted by prior work on reasoning-based VLMs [1,2]. In our case, it serves as a stable testbed for evaluating final decisions while avoiding the potential bias or ambiguity in scoring open-ended outputs. Evaluating open-ended actions, especially in complex, real-world scenarios, remains a challenging task. Therefore, we focus on MCQ based benchmarks for model evaluation.
  2. Challenging and Diverse Scenarios: The benchmarks we used are recently proposed and specifically designed to evaluate decision-making in a wide range of complex, vision-grounded situations. They span human-centered social contexts, embodied agent tasks (e.g., robotics and autonomous driving), and egocentric video understanding, offering a comprehensive evaluation suite for our methods.

We acknowledge that extending these learned skills to interactive or open-ended environments remains an important and exciting direction for future work. Our current study, however, establishes a critical and reproducible step toward developing models capable of such real-world applications. We will revise our paper to clarify the benchmark choice.

[1] MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

[2] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization


Q1 & W2: Regarding Analysis on Visual Feature

Q1. Model Performance when image descriptions are imperfect

We first clarify that Praxis-VLM does not use image descriptions during inference. The generated image descriptions are only used in our preliminary analysis (Section 2) to explore whether VLMs can perform decision-making when visual contexts are replaced by text.

To assess the effect of descriptions on VLMs, we conduct an additional experiment where we replace the high-quality image descriptions used in our preliminary analysis (Figure 2, generated by GPT-4o or dataset annotations) with descriptions generated by a weaker model, LLaVA-1.5-7B. We then test the original Qwen2.5-VL-7B model's accuracy.

TaskOriginal Text Description (high-quality)LLaVA-1.5-7B Generated Description
VIVA83.3974.84
PCA-Bench47.9542.59

The results show a significant performance drop when using lower-quality, noisy descriptions, confirming that robust situational understanding is crucial for reliable decision-making to reach a good score for the benchmarks in our experiments.

These findings indicate that the vision encoder's quality remains a critical factor in the model’s inference-time reliability. However, enhancing visual perception typically requires paired image-text training data, which is unavailable in our target scenarios and beyond the scope of our work. Our approach addresses this by disentangling reasoning from perception, allowing reasoning to be trained efficiently on scalable text-only data.

W2. Does Praxis-VLM Actually Use Visual Features or Relies on Prior Learned Textual Knowledge?

This question is related to our discussion above. First, the significant performance drop when using noisy image descriptions (LLaVA-generated text) indicates that accurate visual perception is essential for VLMs to achieve strong performance on these benchmarks: when situational descriptions degrade, overall performance declines substantially. Praxis-VLM’s strong results on all three benchmarks suggest that the model is not applying patterns learned from text-only training, but instead actively leveraging visual information during inference.

Second, as detailed in our reasoning analysis (Section 5.4), a key component of the model’s reasoning is Situational Analysis, where the model explicitly interprets and grounds its reasoning in visual content. This is further illustrated in the output examples (Figures 10–12), where the model consistently begins its reasoning by analyzing the visual situation. These findings collectively suggest that Praxis-VLM grounds its reasoning in visual evidence, rather than relying solely on text-trained patterns. We will highlight this conclusion in our revision.

We will revise our paper to better clarify this.


Q2. Regarding Synthetic Data Quality and Coverage

Our synthetic dataset pipeline is designed to be scalable and effective to create a challenging and diverse dataset that fosters multi-step reasoning: (1) For diversity, we adopt a batch generation strategy, prompting GPT-4o to produce 10 samples at a time. Each batch undergoes deduplication based on word-overlap similarity to remove highly similar situations; (2) For difficulty, we embed explicit complexity instructions in our prompts (Line 649), guiding GPT-4o to generate questions that require non-trivial deliberation.

Beyond this, we avoid manual filtering or heavy curation, enabling fast and domain-agnostic data generation.

To assess the diversity of the generated scenarios, we prompt GPT-4o to cluster the generated textual situations into topical categories. The results indicate a broad coverage across varied scenarios:

  • Workplace Performance and Personal Issues,
  • Resource Allocation
  • Project Management
  • Balancing Competing Interests
  • Policy, Rules & Enforcement
  • Ethical Dilemmas
  • Interpersonal Conflict
  • Emergency Handling
  • Navigating Setbacks
  • Event Planning & Logistics
  • Balancing Inclusivity and Majority Preference

Notably, while the generated data cover diverse situations, there can still be a domain gap when compared to certain evaluation benchmarks (e.g., PCA-Bench’s embodied robotics or EgoNormia’s egocentric video contexts). However, Praxis-VLM’s consistent improvements and strong generalization across these benchmarks indicate that the text-based training equips the model with generalized and transferable reasoning skills. Figure 5 also shows that Praxis-VLM learns general reasoning dimensions (situational analysis, consequence evaluation, safety, and norm adherence), enabling effective transfer to multimodal scenarios.

However, if the text data lacks variety or introduces bias, we believe the model performance will drop. Unlocking fundamental decision-making skills requires a wide variety of situations and perspectives. Prior work also confirms that dataset diversity is a key factor influencing model performance and alignment [1, 2]. Moving forward, we plan to explore advanced data selection methods to further enhance effective model training.


Q3. Role of Math Pretraining

This is a great and insightful point! Our motivation for incorporating the math cold-start stage is to provide the model with a foundation in abstract, generalizable logical reasoning before introducing task-specific decision-making in Stage 2. The results show that this cold-start initialization measurably enhances the model’s generalization capabilities, particularly on the out-of-domain EgoNormia benchmark (as discussed in line 247). This demonstrates that the math training strengthens the model’s underlying reasoning and improves its adaptability to novel, complex decision-making scenarios.

This finding is also consistent with concurrent work [1], which shows that including mixed-domain training data can improve general reasoning performance. We will revise our paper to better clarify our motivation.

Measurable Improvements in reasoning

This is a good question! Quantifying reasoning depth and structure for comparison remains a challenge due to the lack of standardized metrics. To approximate this, we utilize GPT-4o as a Rhetorical Structure Theory (RST) discourse parser and convert each generated reasoning into a discourse tree. The average RST tree depths are comparable across models (3.93 vs. 3.90 for models with and without math pretraining), indicating similar surface structure.

Therefore, we further conduct a manual inspection on examples where only the pretrained model produces correct predictions. We find that Praxis-VLM with math pretraining integrates more relevant symbolic visual cues into its reasoning, particularly in complex contexts. For instance, in PCA-Bench's autonomous driving scenarios, Praxis-VLM with math pretraining is able to identify and integrate multiple road symbols (e.g., traffic signs, lane markers) into its reasoning, leading to more contextually appropriate action choices, whereas the variant without math pretraining tend to rely on a narrower subset of cues and make less contextually appropriate action choices.

We hypothesize that the symbolic reasoning rigor inherent in math and geometry tasks foster a more systematic reasoning framework. This structured reasoning appears to transfer effectively to multimodal decision-making by promoting careful observation, multi-step analysis, and integration of fine-grained visual signals.

We will revise the paper to clarify this motivation and highlight these empirical insights more explicitly.

[1] LIMA: Less Is More for Alignment

[2] Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

评论

Thank you for the thoughtful analysis and the additional experimental results. Most of my questions have been addressed. I appreciate the authors’ clear explanation of the paper’s position within embodied AI—particularly its attempt to address the scarcity of paired image-text training data by training with text only while enabling inference with visual grounding. While I remain concerned about the real-world applicability of the proposed approach in open-ended action spaces, I acknowledge that the idea of disentangling text and image during training may hold some interest for the community, even if its practical impact is still uncertain. Based on these considerations, I will raise my rating.

评论

Dear Reviewer jthP,

Thank you for your thoughtful feedback and for acknowledging our clarifications and additional results. We sincerely appreciate your recognition of our text-only training approach as a way to address the scarcity of multimodal data.

We agree that the applicability of our approach in open-ended and interactive action spaces is an important direction. We will include a discussion of this and potential future directions in the limitation section.

Thank you again for your constructive comments and for raising your rating. If you have any further questions or suggestions, we would be happy to address them.

Best regards,

The Authors

评论

Dear Reviewer jthP,

Thank you very much for your thoughtful and constructive feedback, which has been invaluable in helping us improve our work. We have carefully addressed your comments regarding the question format, analysis of visual feature usage, synthetic data generation, and clarifications on cold-start initialization.

As the discussion period concludes in two days, we kindly invite you to review our response. We hope our response adequately addresses your concerns, and we remain happy to clarify any remaining questions you may have.

Best regards,

The Authors

评论

Dear PCs, ACs, and all Reviewers,

We would like to express our genuine gratitude for your time and efforts in facilitating the discussion regarding our paper. We sincerely thank all reviewers for their valuable and insightful feedback. We are encouraged that the reviewers consider our text-only training approach to be innovative (Reviewer jthP), interesting (Reviewer jthP, x39V, rfEF) and well-motivated (Reviewer J55j), and recognize its practical, data-efficient benefits (Reviewer J55j, x39V, rfEF), strong performance (Reviewer jthP, J55j, x39V, rfEF), and our thorough analysis (Reviewer jthP, rfEF).

During the rebuttal process, we have addressed all the raised questions and concerns. Here, we would like to further summarize several key points addressed in our rebuttal:

  • We clarified that Praxis-VLM does not use image descriptions during training or inference. Our approach addresses the challenge of scarce large-scale image-text paired data by disentangling reasoning from visual perception and training Praxis-VLM entirely on a synthetic, text-only dataset. At multimodal inference, where image-text benchmarks are available, the model applies these reasoning abilities directly to real visual inputs. We also demonstrated that Praxis-VLM actively leverages visual information during inference rather than relying solely on prior textual knowledge.
  • We provided detailed clarifications of our synthetic data construction pipeline, including analysis of its diversity. Our pipeline is designed to be scalable and efficient, requiring minimal manual effort;
  • We conducted additional experiments, including reward component ablations, evaluations with a different base model, and visual input analysis with noisy captions, demonstrating the robustness and generalization capabilities of our method.

Overall, we believe our work offers a practical and generalizable pathway to more capable VLMs by transferring reasoning learned from language to guide complex vision-grounded decision making. We will revise our paper to incorporate the suggestions. We will release our code, models, and data to facilitate future research.

We thank all reviewers once again for their insightful comments and constructive feedback.

Sincerely,

Authors

最终决定

This paper trains multimodal language models to perform better reasoning for visually-grounded decision-making by distilling a RL using text-based reasoning on synthetically generated scenario descriptions. Experiments show that models trained this way outperform existing vision-language models, including ones fine-tuned using SFT To perform reasoning, on several vision-based reasoning benchmarks.