Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation
摘要
评审与讨论
This paper addresses an important and timely question in vision-and-language navigation: how can embodied agents handle long-horizon tasks more effectively? The authors propose using Chain-of-Thought (CoT) reasoning as a promising direction and provide a thorough analysis of how to incorporate CoT into the learning process. The main contribution is Aux-Think, a method that uses CoT during training but not at inference. The approach is well-motivated and supported by experiments demonstrating its effectiveness and data efficiency.
优缺点分析
Strengths:
1、The paper addresses a critical issue in VLN—how to incorporate structured reasoning like CoT into long-horizon embodied tasks. The proposed perspective is promising, valuable, and underexplored, especially given the known benefits of CoT in static tasks.
2、The authors provide a well-articulated conceptual motivation by connecting human dual-process theory to the proposed Aux-Think strategy. The analogy to System 2 (explicit reasoning during training) vs. System 1 (implicit execution during inference) is not only intuitive but also inspired in the context of robot learning.
3、The paper presents a thorough and clearly defined comparison of four reasoning strategies (No-Think, Pre-Think, Post-Think, Aux-Think). This taxonomy may inspire future work that revisits the role of reasoning in policy learning.
4、Aux-Think achieves state-of-the-art or competitive performance on standard VLN-CE benchmarks, especially under limited data conditions. The success rates on long-horizon tasks further support the claim that reasoning-guided training improves generalization and stability.
Weaknesses:
1、One concern is that while the paper does include real-world experiments, they mostly cover short-range tasks. Since the method is motivated by long-horizon reasoning, it would be helpful to see results on longer and more complex real-world instructions to better assess its applicability outside simulation.
2、In Section 3.3, the authors attribute the failure of Pre-Think and Post-Think to hallucinated or unstable CoT predictions in long trajectories. This is convincing, but it raises the question: could techniques like DAgger or student forcing be used to make the model more robust under distribution shift, instead of avoiding inference-time CoT altogether? Some exploration in this direction would strengthen the analysis.
3、The two auxiliary tasks—Instruction-based Reasoning and Receding-Horizon Action Planning—are introduced without much explanation and discussion. While they improve results (Table 4), the paper does not make clear why these tasks were chosen or what role they play in learning better representations. Moreover, the ablation in Table 4 lacks a proper isolation, which makes it difficult to assess the true contribution of each component.
4、(Minor point) Since the CoT supervision comes entirely from LLM-generated text (Qwen-VL 2.5), it would be good to say something about quality control. Even a brief mention of filtering or checking would help make the results more convincing.
问题
I believe the most important addition would be a real-world demonstration on a long-horizon task, as real-world performance is ultimately the key concern in robotics. As for the remaining points, it would be helpful if the authors could offer some further discussion or exploration; including experimental evidence would be even better.
局限性
The authors provide a discussion of the method's limitations, which I think reasonable and fair.
最终评判理由
This is a fairly good piece of work with no obvious flaws or shortcomings. I believe it meets the acceptance bar for NeurIPS, and therefore I am keeping my original score.
格式问题
None.
Q1. More complex real-world experiments
A1: Thank you for this insightful comment. We agree that evaluating on longer and more complex real-world instructions is important for thoroughly assessing the applicability of our method. We have actively scaled up real-world tests to cover longer-horizon tasks. We will include additional results and videos in the final version to better showcase performance under longer instructions and more challenging scenarios.
Q2: Other methods for avoiding inference-time CoT
A2: Thank you for the thoughtful suggestion and deep understanding of the task. Techniques such as DAgger or student forcing may offer alternative ways to mitigate distribution shift under inference-time reasoning. Their potential to reduce hallucination and improve stability in generated CoT for long-horizon navigation is worth investigating.
After further reflection, we realize that these approaches could be meaningfully integrated with our framework in several ways:
-
DAgger. DAgger could be used to iteratively collect trajectories where the CoT-guided policy deviates from optimal behavior, and correct them using ground-truth supervision. This could help reduce hallucinations in Pre-Think and stabilize reasoning across long sequences.
-
Student Forcing. Student forcing could potentially be used during Aux-Think training to expose the model to its own predictions during intermediate CoT generation, making the auxiliary supervision more realistic and robust to compounding errors, even though CoT is not generated at inference time in our approach.
We appreciate the reviewer’s suggestion for highlighting these possibilities. These hybrid strategies open up valuable new directions that are complementary to our current design, which could be explored in the future.
Q3. Further Explanation for Instruction-based Reasoning and Receding-Horizon Action Planning
A3: Thank you for the detailed review.
-
Instruction-based Reasoning. This auxiliary task requires the model to reconstruct the instruction text based on the visual input from a trajectory. It strengthens the grounding between visual observations and language instruction, helping the agent better associate perception with task-level goals. Such grounding is especially important in VLN, where understanding the full instruction context is critical for accurate long-range navigation.
-
Receding-Horizon Action Planning. Inspired by the future-oriented nature of CoT supervision, this task trains the model to predict a sequence of upcoming actions rather than just the immediate next step. This encourages long-term planning behavior and helps the model internalize structured, temporally abstract planning patterns, which are often implicit in CoT traces.
-
New Ablation Results. To further clarify their individual contributions, we have added new ablation experiments isolating each component. Results show that both modules independently contribute to performance improvements, supporting their complementary roles in enhancing reasoning and planning (see the following table).
A: CoT Reasoning, B: Non-CoT Reasoning, C:Receding-Horizon Action Planning.
| A | B | C | NE↓ | OSR↑ | SR↑ | SPL↑ |
|---|---|---|---|---|---|---|
| 7.78 | 43.7 | 35.1 | 30.2 | |||
| √ | 7.08 | 47.6 | 41.3 | 35.8 | ||
| √ | 7.12 | 46.3 | 40.6 | 35.7 | ||
| √ | 7.14 | 47.3 | 37.1 | 32.2 | ||
| √ | √ | 6.92 | 49.1 | 44.2 | 38.9 | |
| √ | √ | √ | 6.01 | 52.2 | 46.0 | 40.5 |
Q4: The Quality Control of the CoT Data.
A4: Thank you for the question.
-
We assessed the quality of Qwen-VL 2.5 (72B)-generated CoT data using standard human evaluation methods, including checking alignment among instructions, actions, and reasoning steps, as well as assessing the coherence of the reasoning itself. Our qualitative analysis suggests that the reasoning data are generally reliable and suitable for use in navigation tasks. Therefore, we used the Qwen-generated CoT data without any modification.
-
While we did not perform extensive filtering or refinement, our experimental results offer indirect evidence of quality: Aux-Think, trained on these unmodified data, achieves comparable performance with the state-of-the-art methods and strong data efficiency. This suggests that the Qwen-generated CoT content is well suited for training.
-
We will release our dataset to support further community efforts in analyzing, evaluating, and improving CoT data quality for navigation tasks.
The questions I raised were exploratory in nature, and the authors have provided good discussion on them. Since this is a promising idea, I hope the authors can demonstrate real-device results that other methods cannot achieve, as this would be a way to elevate the paper to an outstanding level. At present, this is a commendable work.
Thank you very much for your thoughtful and encouraging feedback. We truly appreciate your recognition of the potential of our work. In fact, we have already conducted additional long-horizon real-world experiments that demonstrate behaviors not easily achievable by previous methods, including multi-turn indoor navigation, complex instructions spanning over 10 landmarks, and outdoor navigation over extended distances. These scenarios highlight the strengths of long-horizon reasoning and pose significant challenges for existing baselines.
Unfortunately, due to NeurIPS policies, we are unable to include new figures, videos, or external links during the rebuttal phase. However, we plan to share these additional results through a public project page in the camera-ready version and supplementary materials. We hope this will help the community better understand the capabilities of our method and inspire further exploration of real-world long-horizon navigation.
This paper explores the integration of reasoning into VLN tasks. The findings reveals the inference time reasoning harms the model performance, and a reasoning-aware co-training strategy is proposed to internalize the reasoning ability and evaluated on R2R val-unseen split.
优缺点分析
Strengths
1 The whole paper is well strcutred and layout is clear
Weakness
1 A dataset is listed as contribution but the introduction details and stats about the dataset is not enough.
2 The experimental results may not be stastically significant.
问题
1 how are the retrospective thinking data labelled? Do these data contain self-correct information?
2 In the table 1, it's still not very clear how many training data are added. How many tokens are the 600k training extra data? I would suggest using tokens to represent the amount of data when compared to other methods.
3 How to check the quality of Qwen data? It's not very convincing to demonstrate the pre-think and post-think performance (Finding 1 in line 157 - line 158)
局限性
Yes
最终评判理由
I have read the responses by the authors and other reviewers' discussions. The author cleared my concerns about the data amount and quality. The SPL result of RxR CE task is still not convincing enough given two times of Aux-think data. I hope the authors can provide explanation on this in their revision. I would like to keep my original rating.
格式问题
N/A
Q1: The details of thinking data construction
A1: Thank you for the question.
-
We clarify that the construction process for our retrospective thinking dataset is detailed in Appendix A.2 (see Lines 502–508). Specifically, we prompt the Qwen-VL 2.5 (72B) model with the navigation instruction, historical visual frames, and the ground-truth action sequence, and explicitly ask it to produce a deep reasoning trace for the next navigation steps, including reflecting on past decisions and future plans.
-
Importantly, we observe that the generated thinking traces often include self-corrective signals, such as identifying navigation mistakes and proposing better alternatives—indicating a form of retrospective reflection. This reflective property helps guide the model toward more robust policy learning.
-
We agree that this dataset construction process is an important part of our contribution. In the revised version, we will move more details from the appendix to the main paper (Section 3.2) and provide additional statistics and examples to better present its value.
Q2. Amount of training data at the token level
A2: Thank you for the detailed suggestion.
-
For simulation-based training data, our collection from R2R-CE and RxR-CE follows a similar procedure to UniNavid [1] and NaVILA [2], ensuring comparability in terms of both sample count and token quantity. We agree that reporting the number of training tokens provides another more standardized basis for comparison.
-
We have recalculated the total token count using the Qwen tokenizer. Our additional 600k training samples contain approximately 60M text tokens and 4.2M images in total. In contrast, based on the data sources described in the NaVILA paper, we estimate that NaVILA’s training data amounts to approximately 200M text tokens and 9.7M images. NaVILA uses a large amount of general VQA data (1000K) and real-world navigation videos (500K). The general VQA data in NaVILA introduces a large number of extra tokens and images that are not fully aligned with the navigation task, so we estimate the tokens according to the origin papers of the datast. In conclusion, the number of training tokens in NaVILA is much larger than ours.
-
Despite using significantly fewer training tokens, our method achieves comparable performance, which further supports our contribution of data efficiency through CoT-based auxiliary supervision. We will include the token counts and clarify this comparison in the revised version.
[1] Zhang, J., Wang, K., Wang, S., Li, M., Liu, H., Wei, S., ... & Wang, H. (2024). Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224. [2] Cheng, A. C., Ji, Y., Yang, Z., Gongye, Z., Zou, X., Kautz, J., ... & Wang, X. (2024). Navila: Legged robot vision-language-action model for navigation. arXiv preprint arXiv:2412.04453.
Q3. The quality of Qwen data
A3: Thank you for the question.
-
We assessed the quality of Qwen-VL 2.5 (72B)-generated CoT data using standard human evaluation methods, including checking alignment among instructions, actions, and reasoning steps, as well as assessing the coherence of the reasoning itself. Our qualitative analysis suggests that the reasoning data are generally reliable and suitable for use in navigation tasks. Therefore, we used the Qwen-generated CoT data without any modification.
-
While we did not perform extensive filtering or refinement, our experimental results offer indirect evidence of quality: Aux-Think, trained on these unmodified data, achieves state-of-the-art performance and strong data efficiency. This suggests that the Qwen-generated content is well suited for training.
-
We will release our dataset to support further community efforts in analyzing, evaluating, and improving CoT data quality for navigation tasks.
Q4.The experimental results may not be stastically significant.
A4: Thanks for the question. We provide the additional scalability experiment by increasing the total training size from 60M text tokens to 130M text tokens to address your concern about the experimental results.
Table R1: Updated experimental results on R2R-CE benchmark
| R2R-CE | Version | Training Tokens | Data Size | NE ↓ | OSR ↑ | SR ↑ | SPL ↑ |
|---|---|---|---|---|---|---|---|
| NaVILA | - | 200 Million | 3.1 Million | 5.22 | 62.5 | 54.0 | 49.0 |
| Aux-Think | Initial Submission | 60 Million | 0.9 Million | 5.88 | 54.9 | 49.7 | 41.7 |
| Aux-Think | Update in Rebuttal | 130 Million | 1.9 Million | 5.49 | 62.9 | 55.7 | 48.7 |
Table R2: Updated experimental results on RxR-CE benchmark
| RxR-CE | Version | Training Tokens | Data Size | NE ↓ | OSR ↑ | SR ↑ | SPL ↑ |
|---|---|---|---|---|---|---|---|
| NaVILA | - | 200 Million | 3.1 Million | 6.77 | - | 49.3 | 44.0 |
| Aux-Think | Initial Submission | 60 Million | 0.9 Million | 6.91 | 60.1 | 49.3 | 37.9 |
| Aux-Think | Update in Rebuttal | 130 Million | 1.9 Million | 6.40 | 63.9 | 52.4 | 39.9 |
We observe that Aux-Think, despite using significantly less scale-up data, achieves state-of-the-art performance on the most important metric (Success Rate) and remains comparable to current SOTA models on other metrics. These consistent results across multiple benchmarks demonstrate the significant effectiveness and data efficiency of our method.
Dear Reviewer FdAS,
Thank you again for your time and for your review of our paper.
We have submitted our rebuttal and are hoping to hear your thoughts. We are very eager to know if our responses about the details of our dataset have helped to address the concerns you raised in your initial review.
The discussion period has been extended, and we would be very grateful for the opportunity to answer any follow-up questions you might have before the discussion phase ends.
Thank you once again for your valuable time and consideration.
Dear Reviewer FdAS,
As the rebuttal period is nearing its end (in less than two days), we would greatly appreciate any further thoughts you might have on our response, particularly whether it addresses your concerns adequately.
Thank you again for your valuable time and consideration.
This paper introduces an interesting Aux-Think approach that incorporates two reasoning-based auxiliary tasks during training. Specifically, the authors propose:
- A CoT-based reasoning generation task, in which the model must predict chain-of-thought traces from {instruction, observation history}.
- An inverse instruction reasoning task, in which the model must recover the instruction from visual observations.
They retain a primary task of predicting a sequence of actions based on the instruction, observation, and navigation history. As shown in Table 3, this approach yields improvements in success rates on visual navigation tasks, whereas both pre-think and post-think neither improve upon the baseline nor avoid degrading performance. The proposed method is simple and easy to follow; however, the authors should explore whether injecting noise into states and actions during training could more principledly address the performance drop observed with both pre-think and post-think.
优缺点分析
-
Strengths:
- The observation that pre-think and post-think reasoning can hurt performance is compelling, and the analysis offers a plausible explanation.
- However, this analysis is rather general and likely applies to many settings where pre-/post-think fails. The key challenge is bridging the gap between the high-quality, oracle reasoning available during training and the noisier reasoning that occurs at inference time.
-
Weaknesses:
- Receding-horizon planning gains: According to Table 4, most improvements arise from the receding-horizon action planning component. Predicting the next n actions from the instruction and prior observations may be trivial and adds limited insight. Furthermore, the method is highly sensitive to the chosen number of planning steps.
- Unclear CoT vs. pre-think distinction: A major open question is: What are the fundamental differences between the CoT-based reasoning task and the pre-reasoning (pre-think) scheme?
- Dataset construction details: The authors omit crucial details about how the R2R-CoT-320k dataset was constructed in the main paper, instead hiding them in the supplementary material.
- Lack of alternative RL baselines: The core approach hinges on combining two reasoning-based auxiliary tasks, yet the authors do not report whether they tried recent PPO variants (e.g., GRPO, DAPO, VAPO) to train the model to generate reasoning traces before or after producing the final action sequence.
问题
The primary concern is how the authors train the pre-/post-thinking schemes and whether using more advanced PPO variants (e.g., GRPO, DAPO, VAPO) would help.
Please also address the other concerns raised above under Weaknesses:
- Receding-horizon planning gains: According to Table 4, most improvements arise from the receding-horizon action planning component. Predicting the next n actions from the instruction and prior observations may be trivial and adds limited insight. Furthermore, the method is highly sensitive to the chosen number of planning steps.
- Unclear CoT vs. pre-think distinction: A major open question is: What are the fundamental differences between the CoT-based reasoning task and the pre-reasoning (pre-think) scheme?
- Dataset construction details: The authors omit crucial information about how the R2R-CoT-320k dataset was constructed in the main paper, instead relegating it to the supplementary material.
局限性
No.
最终评判理由
The reviewer believes the core questions about the RL findings remain unaddressed, particularly the lack of experimental evidence:
- Fundamental blockers for RL failure: No ablation studies or diagnostic experiments provided to identify these blockers. Reward function design: Missing experimental comparisons with alternative reward formulations beyond simple format/action rewards.
- VLM limitations: No experiments correlating VLM base performance with RL effectiveness, or results comparing RL on stronger vs. weaker base models.
- DAPO/VAPO analysis: Lacks quantitative results, performance numbers, or learning curves comparing these variants.
Without experimental evidence and systematic ablations, readers cannot determine if your negative RL results are a significant finding or just implementation-specific limitations. The authors' strong claims about RL's ineffectiveness need empirical support. The reviewer is willing to reconsider my evaluation if these points are addressed with concrete experimental results.
格式问题
No
Q1: The PRIMARY concern is how the authors train the pre-/post-thinking schemes and whether using more advanced PPO variants (e.g., GRPO, DAPO, VAPO) would help.
A1: Thanks for your comments. We would like to clarify two key points regarding your concerns:
(1) Clarification on our scope of algorithmic contribution.
-
There appears to be a misunderstanding about how our models are trained. We use supervised fine-tuning (SFT) (Lines 130, 211, 278), which is the same training paradigm used by other state-of-the-art VLA-based navigation methods. Optimizing reasoning policies with advanced RL variants is orthogonal to our goal. Our contribution is to systematically analyze reasoning strategies in VLN and introduce Aux-Think, a training framework that uses CoT only during training as auxiliary supervision.
-
As an exploratory alternative, we also implemented GRPO-based reinforcement learning to train different reasoning strategies (Appendix A.4). This was done to more comprehensively investigate the learning behavior of different reasoning strategies. However, we observed that this reinforcement learning approach underperforms compared to our final method (Line 519), the SFT-based Aux-Think framework.
-
We also note that the reviewer suggests replacing “basic PPO” with GRPO or other PPO variants. We clarify that our RL implementation already uses GRPO, not vanilla PPO, as described in Lines 231 and 519, with analysis in Lines 522–544 in our paper. We hope our response clears up the confusion.
(2) Additional GRPO Experimental Details Although reinforcement learning is not the focus of our paper, we include the relevant experimental details here for completeness. We implemented GRPO with two types of reward signals following [1]:
- Format reward: Indicates whether the output follows the structured reasoning format (“The next action is [action]”).
- Action reward: Indicates whether the predicted action matches the oracle navigation step.
We applied GRPO after the SFT as cold start, to No-Think, Pre-Think, Post-Think, and our Aux-Think strategies, and we report the results below:
| Method | Training Strategy | NE↓ | OSR↑ | SR↑ | SPL↑ |
|---|---|---|---|---|---|
| No-Think | SFT | 7.78 | 43.7 | 35.1 | 30.2 |
| No-Think | GRPO | 7.89 | 45.7 | 34.7 | 30.1 |
| Pre-Think | SFT | 9.23 | 19.3 | 11.4 | 8.6 |
| Pre-Think | GRPO | 9.01 | 21.0 | 9.0 | 8.2 |
| Post-Think | SFT | 8.59 | 35.1 | 29.0 | 23.8 |
| Post-Think | GRPO | 8.89 | 34.8 | 28.4 | 23.5 |
| Aux-Think | SFT | 7.09 | 47.6 | 41.3 | 35.8 |
| Aux-Think | GRPO | 7.23 | 48.6 | 40.1 | 33.4 |
We found that applying GRPO generally led to no improvement or even slight degradation in SR and SPL, which are the most important metrics in VLN. In particular, while the OSR score sometimes improved after GRPO, this increase is not indicative of meaningful navigation success: we observed that GRPO-trained models tend to take random exploratory actions when they are unsure whether to stop, which incidentally leads them to pass through the goal location without correctly stopping, thus inflating OSR without increasing SR.
GRPO did not bring consistent benefits but harmed performance, we chose not to highlight these results in the main paper, nor to claim them as part of our contributions. We instead focused our analysis on the supervised (SFT-only) setting, which more clearly illustrates the advantages of our Aux-Think strategy.
We hope these key points clarify the reviewer’s concerns and helps accurately position the contribution of our work. We will revise the paper to make these points clearer where necessary.
Reference:
[1] Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., ... & Yue, X. (2025). Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776.
Q2: Receding-horizon planning gains: According to Table 4, most improvements arise from the receding-horizon action planning component. Predicting the next n actions from the instruction and prior observations may be trivial and adds limited insight. Furthermore, the method is highly sensitive to the chosen number of planning steps.
A2: We thank the reviewer for the comment and address the concerns as follows:
(1) Main Gains Come from CoT Supervision, Not Receding-Horizon Planning.
- Receding-horizon planning is not the main source of improvement. As shown in Table R1 bellow (the original Table 4 in our paper), CoT-based supervision (Config A) yields a +6.2 SR / +5.6 SPL gain over the baseline, while receding-horizon planning alone (Config C) provides a smaller boost (+1.8 SR / +1.6 SPL). This confirms that CoT is the key contributor, with horizon planning playing a complementary role.
Table R1: The original Table 4 in our paper for ablation study on different components. A: CoT Reasoning, B: Non-CoT Reasoning, C:Receding-Horizon Action Planning.
| A | B | C | NE↓ | OSR↑ | SR↑ | SPL↑ |
|---|---|---|---|---|---|---|
| 7.78 | 43.7 | 35.1 | 30.2 | |||
| √ | 7.08 | 47.6 | 41.3 | 35.8 | ||
| √ | √ | 6.92 | 49.1 | 44.2 | 38.9 | |
| √ | √ | √ | 6.01 | 52.2 | 46.0 | 40.5 |
- To further clarify their individual contributions, we have added new ablation experiments isolating each component. Results show that both modules independently contribute to performance improvements, supporting their complementary roles in enhancing reasoning and planning (see the following table).
Table R2: Updated ablation study isolating each component. A: CoT Reasoning, B: Non-CoT Reasoning, C:Receding-Horizon Action Planning.
| A | B | C | NE↓ | OSR↑ | SR↑ | SPL↑ |
|---|---|---|---|---|---|---|
| 7.78 | 43.7 | 35.1 | 30.2 | |||
| √ | 7.08 | 47.6 | 41.3 | 35.8 | ||
| √ | 7.12 | 46.3 | 40.6 | 35.7 | ||
| √ | 7.14 | 47.3 | 37.1 | 32.2 |
(2) Multi-Step Prediction Is Non-Trivial and CoT-Aligned. We respectfully argue that predicting the next n actions is a non-trivial task.
- Multi-Step Prediction leverages the planning capabilities offered by the CoT-based auxiliary supervision (Line 270-272). CoT reasoning is inherently future-oriented, and multi-step prediction is a natural structural extension. It encourages the model to plan ahead, align with temporally abstract supervision in CoT-based auxiliary task, and improve short-term forecasting and consistency (Line 191).
- Empirically, moderate-length horizons yield coherent behaviors that single-step prediction cannot easily capture.
(3) Horizon Length Is Not a Sensitivity Issue. We do not view performance variation with horizon length as a sign of fragility.
- As shown in Table 5, performance varies smoothly, and a range of reasonable values provide consistent gains.
- We treat horizon length as a standard design choice that balances short-term reactivity with long-term planning, not as a sensitive hyperparameter.
We will clarify this in the revision.
Q3. Unclear CoT vs. pre-think distinction: A major open question is: What are the fundamental differences between the CoT-based reasoning task and the pre-reasoning (pre-think) scheme?
A3: Thanks for the question. The fundamental distinction lies in how and when CoT is used:
Our CoT-based reasoning is used only during training as an auxiliary task that generates reasoning sequences without actions to enhance navigation reasoning ability. During inference, our model only predict actions without CoT Reasoning. Pre-Think, in contrast, jointly generates both reasoning and actions in training and inference, and uses the actions conditioned on the CoT directly for inference, which leads to Inference-time Reasoning Collapse.
We further provide other detailed differences from multiply aspects:
-
Loss function. For Aux-Think, the training loss function is Equation 5 and the loss function of CoT-based auxiliary task is Equation 2. For Pre-Think, the loss function is the Pre-Think branch of Equation 1.
-
Training-only vs. Training + Inference. CoT supervision in our method shapes internal representations during training, while inference remains clean and direct. Pre-Think depends on CoT generation at inference time, making it more prone to reasoning collapse and less efficient, as demonstrated in Table 3 in our paper.
-
We also provide a quantitative visualization in Figure 2 in our paper that contrasts Pre-Think and the CoT reasoning in Aux-Think.
We hope the explanation and accompanying analysis help address the reviewer’s problem.
Q4: Dataset construction details: The authors omit crucial details about how the R2R-CoT-320k dataset was constructed in the main paper, instead hiding them in the supplementary material.
A4: Thanks for the suggestion. We have included key information about dataset construction in Section 3.2 (Line 133-140), including the simulator, VLM model, CoT format, and dataset scale. Given space limits, we prioritized a concise description of the dataset in Section 3.2, while allocating the remaining space to the Aux-Think framework and empirical analysis. We agree that the dataset is also the key contribution of our work and this section can be better highlighted, which will be revised in the final version accordingly.
The reviewer appreciates the detailed responses from the authors, and several major concerns are well addressed. However, the current improvements mainly come from the SFT from the high-quality R2R-CoT-320k dataset, and the current conclusion that RL fails to bring consistent benefits but even harms performance is less convincing. The authors are required to explain what are the fundamental blockers that lead to these non-trivial conclusion?
For example, how can we design a better verification reward function beyond the current simple format reward and action reward? Or is the fundamental limitation at that the pre-trained VLM still has relatively low success rates, thus resulting in less effective RL post-training? Can the authors provide more in-depth discussions about how the variations like DAPO and VAPO perform?
Thanks for your comments. We would like to offer the following clarifications and discussions:
1. Core Contribution Lies in Reasoning Strategy, Not Dataset Alone
However, the current improvements mainly come from the SFT from the high-quality R2R-CoT-320k dataset
Our contribution includes both the Aux-Think reasoning strategy and the R2R-CoT-320k dataset. The dataset benefits all methods equally. The consistent improvements of Aux-Think over other methods like Pre-Think, Post-Think, and No-Think under identical data conditions demonstrate that the core gains come from our reasoning design, not the dataset alone.
2.Clarification on the RL Findings
The current conclusion that RL fails to bring consistent benefits but even harms performance is less convincing. The authors are required to explain what are the fundamental blockers that lead to these non-trivial conclusion?
We respectfully disagree with the reviewer’s assessment that our conclusion about RL being less effective is "less convincing". We implemented GRPO using standard architectures and reward designs across all variants. Expecting extensive RL tuning, while SFT performs well without such effort, introduces an unfair comparison rather than a stronger baseline. We believe it is important to report such findings transparently, especially as they reflect the real-world challenges of applying RL in VLN.
This is primarily due to several practical limitations: (1) Reward design misalignment: VLN requires long-horizon reasoning, but GRPO relies on short-term, action-level rewards that fail to capture global trajectory reward; (2) Unstable pre-trained VLM: as the reviewer noted, the pre-trained VLM can produce low success rate, which introduces challenge during RL training; (3) Lack of diversity during sampling: after SFT, the model tends to generate identical actions across samples, resulting in uniform rewards and ineffective gradient signals. These factors collectively limit the effectiveness of RL in our setting.
Furthermore, RL-based training is out of scope for our current contribution. Neither the introduction nor the conclusion sections present RL as our motivation or contribution, and we do not propose any RL-based methods in this work. Our focus remains on improving VLN under SFT.
3.Scope of RL Exploration and the Further Discussion
For example, how can we design a better verification reward function beyond the current simple format reward and action reward? Or is the fundamental limitation at that the pre-trained VLM still has relatively low success rates, thus resulting in less effective RL post-training? Can the authors provide more in-depth discussions about how the variations like DAPO and VAPO perform?
While reinforcement learning is not part of our paper’s core contributions or scope, we do not propose any RL-based methods, but we still provided additional analysis from the RL perspective to address the reviewer’s concerns. If future work aims to improve RL for VLN, three potential directions include: (1) Designing long-horizon or trajectory-level reward functions, such as using Success Rate (SR), Navigation Error (NE), or step-wise progress to better supervise multi-step decision-making and encourage global reasoning. (2) Continual pretraining of vision-language models on navigation-related data, which may inject structured spatial and planning priors into the model and reduce noise in generated rationales during RL training. (3) Additional RL variants: DAPO or VAPO filters out duplicate actions in group sampling, and may tackle the lack of action diversity in GRPO training.
We believe that the reward design and additional RL variants such as DAPO or VAPO set an unbounded standard for our work, and they are outside the scope of our current study. Our goal was not to exhaust all possible RL techniques, but to evaluate whether standard policy optimization offers gains in the SFT setting.
We hope our response has constructively addressed the reviewer’s concerns and clarified that RL is not the focus or contribution of our paper.
Dear Reviewer juUc,
Thank you again for your time and for your thoughtful review of our paper.
We have submitted our rebuttal and are hoping to hear your thoughts. We are very eager to know if our responses have helped to address the concerns you raised in your review.
The discussion period has been extended, and we would be very grateful for the opportunity to answer any follow-up questions you might have before the discussion phase ends.
Thank you once again for your valuable time and consideration.
Dear Reviewer juUc,
We sincerely appreciate your earlier comments and the recognition that several of your concerns were addressed in our initial response.
Since the rebuttal phase is drawing to a close, we wanted to kindly follow up to see whether our latest reply addressed your remaining questions. We would be happy to further discuss any remaining points if helpful.
Thank you once again for your time and consideration.
Thank you for your response. However, the core questions about the RL findings remain unaddressed, particularly the lack of experimental evidence:
- Fundamental blockers for RL failure: No ablation studies or diagnostic experiments provided to identify these blockers.
- Reward function design: Missing experimental comparisons with alternative reward formulations beyond simple format/action rewards.
- VLM limitations: No experiments correlating VLM base performance with RL effectiveness, or results comparing RL on stronger vs. weaker base models.
- DAPO/VAPO analysis: Lacks quantitative results, performance numbers, or learning curves comparing these variants.
Without experimental evidence and systematic ablations, readers cannot determine if your negative RL results are a significant finding or just implementation-specific limitations. The authors' strong claims about RL's ineffectiveness need empirical support. The reviewer is willing to reconsider my evaluation if these points are addressed with concrete experimental results.
Dear Reviewer juUc,
Thank you for your comment. Our work focuses on inference-time reasoning design (Aux-Think) under supervised finetuning (SFT), not RL. RL has never been claimed as part of our contribution (see Lines 70-80); we never claim any the RL finding in our main paper as a significant conclusion; the only RL relevant content is exploratory GRPO experiments in the appendix, which Neurips guidelines permit as supplementary analyses alongside the main experiments. The core contribution of our work is that Aux-Think achieves comparable SOTA performance using substantially less training data, as demonstrated in our experiments and acknowledged by other reviewers.
We respectfully suggest that the RL-related points raised represent promising future research beyond the current work’s scope. We would be glad to discuss them further, but since the reviewer’s earlier core concerns, such as the SFT training strategy, the distinction between Aux-Think and Pre-Think, dataset details, and the role of Receding-Horizon Planning, have been well addressed and acknowledged by the reviewer, we kindly hope the reviewer could first re-evaluate our work based on the scope-related content in our paper and rebuttal. Thanks again.
This paper explores integrating reasoning strategies into Vision-and-Language Navigation (VLN). A surprising finding is that inference-time reasoning degrades performance, which is named by the authors as “Inference-time Reasoning Collapse (IRC)”. Based on this insight, they propose Aux-Think, a framework that uses Chain-of-Thought (CoT) reasoning only during training as auxiliary supervision, while performing inference without explicit reasoning. Experimental results show that Aux-Think outperforms other reasoning strategies and demonstrates promising data efficiency.
优缺点分析
Strengths: The paper provides the first systematic analysis of reasoning strategies for VLN, highlighting a counterintuitive but important finding: inference-time CoT reasoning hurts performance in long-horizon, partially observable environments. This contributes a novel insight to the field.
Weaknesses: The experimental results are not very strong. At present, the performance is not on par with the state-of-the-art baseline NaVILA. It is understandable that CoT data is more difficult to collect, and thus the authors could not scale up training to the same extent. However, this inevitably weakens the paper—both from the perspective of scalability and from that of data efficiency.
The scalability of the method remains unproven, as no large-scale experiment has been conducted to establish whether the approach can match or exceed existing methods when trained on comparable amounts of data.
Evidence for data efficiency is also limited. A more convincing demonstration would involve directly comparing the proposed method against strong baselines—e.g., fine-tuning a pretrained NaVILA model—under identical small-data regimes and in the same unseen environments. Importantly, the comparison should be with a pretrained NaVILA rather than one trained from scratch. Since collecting CoT data is a clear limitation of the proposed approach, it is critical to show that even with a limited amount of CoT supervision, the method can outperform models trained on large-scale non-CoT data, which is significantly easier to obtain.
问题
Please see my comments on weaknesses above.
局限性
The paper does have limitation section.
最终评判理由
My primary concern is stated in the weakness part, regarding the scalibility of the method since It does not outperform SOTA model. The new experiment results show that increasing data size does improve the performance and it is now on par with or slightly better than the SOTA. Although it is not clear about the further scalibility, I believe this is enough for me to raise my score.
格式问题
No
Q1: The experimental results and the scalability of our method
A1: Thanks for the good question. We provide the additional scalability experiment by increasing the total training size from 0.9M to 1.9M samples to address your concern:
Table R1: Updated experimental results on R2R-CE benchmark
| R2R-CE | Version | Data Size | NE ↓ | OSR ↑ | SR ↑ | SPL ↑ |
|---|---|---|---|---|---|---|
| NaVILA | - | 3.1 Million | 5.22 | 62.5 | 54.0 | 49.0 |
| Aux-Think | Initial Submission | 0.9 Million | 5.88 | 54.9 | 49.7 | 41.7 |
| Aux-Think | Update in Rebuttal | 1.9 Million | 5.49 | 62.9 | 55.7 | 48.7 |
Table R2: Updated experimental results on RxR-CE benchmark
| RxR-CE | Version | Data Size | NE ↓ | OSR ↑ | SR ↑ | SPL ↑ |
|---|---|---|---|---|---|---|
| NaVILA | - | 3.1 Million | 6.77 | - | 49.3 | 44.0 |
| Aux-Think | Initial Submission | 0.9 Million | 6.91 | 60.1 | 49.3 | 37.9 |
| Aux-Think | Update in Rebuttal | 1.9 Million | 6.40 | 63.9 | 52.4 | 39.9 |
- We conduct a new large-scale experiment by increasing the total training size from 0.9M to 1.9M samples. The extended training set includes 500K DAgger samples and 500K YouTube navigation samples annotated using the same pipeline as NaVILA.
- Due to the incomplete release of NaVILA’s data, we are only able to utilize approximately 60% of its total training data. Even so, with this training set, our model achieves comparable performance to NaVILA on both the R2R-CE and RxR-CE benchmarks. These results demonstrate that our method continues to improve as more data becomes available, confirming its scalability.
Q2: Evidence for data efficiency and the possibility to finetune NaVILA
A2: Thank you for the insightful comments.
-
We agree that comparing with a pretrained NaVILA under matched small-data settings would further strengthen the evidence for data efficiency. While NaVILA’s full training data and finetuning code were not publicly available at the time of submission (and remain partly unavailable now), we made our best effort to approximate the reviewer's intent using the available data. We designed a new experiment guided by the same motivation: to test whether our method performs comparably with less or reduced data.
-
Specifically, we collected the publicly released non-CoT portion of NaVILA’s training data (~ 50%, 1.58 Million) and combined it with the proposed CoT data (~ 10%, 0.32M), serving as a counterpart to the remaining ~ 50%. This gives a training set roughly 60% (1.9M) the size of NaVILA’s original (3.1M). Under this setup, Aux-Think achieves comparable performance to NaVILA on both R2R-CE and RxR-CE (see the above Table R1 and R2). This provides concrete evidence for the data efficiency of our method.
Hi, thanks for your reply and the additional results! I will raise my score accordingly.
Dear Reviewer gjCb,
Thank you again for your time and for your review of our paper.
We have submitted our rebuttal and are hoping to hear your thoughts. We are very eager to know if our responses or the additional experiments have helped to address the concerns you raised in your initial review.
The discussion period has been extended, and we would be very grateful for the opportunity to answer any follow-up questions you might have before the discussion phase ends.
Thank you once again for your valuable time and consideration.
This paper studies the role of inference-time reasoning in vision-language navigation tasks, where existing approaches for inference-time reasoning actually degrade the performance of VLN systems. To address this, the authors propose to train VLN systems with chain-of-thought at training time as auxiliary supervision, but perform inference without chain-of-thought, and find that it performs better than existing reasoning approaches for VLN.