4.6

/10

Poster4 位审稿人

最低3最高3标准差0.0

3.5

置信度

创新性2.5

质量2.5

清晰度2.3

重要性2.3

NeurIPS 2025

STRIDER: Navigation via Instruction-Aligned Structural Decision Space Optimization

Diqi He,Xuehao Gao,Hao Li,Junwei Han,Dingwen Zhang

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

TL;DR

STRIDER is a framework for zero-shot VLN-CE, featuring a SWG to constrain action space with spatial structure, and a TAR to adjust behavior based on task progress, helping agents align actions with spatial layout and task intent.

摘要

关键词

Vision-and-Language NavigationMulti-Modal Reasoning

评审与讨论

审稿意见

评分: 3置信度: 42025-06-24

This paper addresses the task of Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE), where an agent must follow natural language instructions to navigate unseen 3D spaces without any scene-specific training or fine-tuning. The authors propose STRIDER, a novel framework that structures the agent’s decision space using spatial layout priors—via a Structured Waypoint Generator—and dynamically regulates agent behavior through task progress feedback—via a Task-Alignment Regulator. STRIDER is evaluated on the R2R-CE and RxR-CE benchmarks, achieving substantial improvements over state-of-the-art zero-shot methods, particularly in Success Rate (SR) and Success weighted by Path Length (SPL), thereby demonstrating the effectiveness of spatially-constrained and feedback-driven navigation.

优缺点分析

Major strengths:

The paper identifies a key gap in zero-shot VLN-CE: existing agents often drift from the intended instruction due to unstructured decision spaces and lack of integrated feedback, particularly in long-horizon tasks.
STRIDER achieves state-of-the-art zero-shot performance on both R2R-CE and RxR-CE benchmarks.
The ablation studies—especially on the Task-Alignment Regulator—clearly demonstrate its consistent positive impact on navigation metrics, with an insightful discussion of metric trade-offs (e.g., slightly higher Navigation Error (NE) due to more conservative path execution).

Major weaknesses:

The framework’s dependence on a VLM for semantic descriptions and an LLM for action selection may obscure the core contributions; it remains unclear how much of the observed gains stem from STRIDER itself versus from the use of powerful pretrained models. Further ablations with varying backbones would help clarify this.
STRIDER’s success heavily relies on the robustness of the underlying VLM and LLM. If the VLM misinterprets visual scenes (e.g., under poor lighting or challenging viewpoints, as noted in the supplementary), or if the LLM provides inaccurate progress feedback, errors may compound. An analysis quantifying the impact of incorrect feedback on overall performance is needed.

问题

Along with the queries mentioned in the "Major weaknesses" part, it would be great if the authors could address the following:

What is the per-step latency introduced by skeletonization and LLM reasoning? Can STRIDER operate in real time? How does it compare to Open-Nav with an API call instead of a local deployment?
How sensitive is STRIDER to the choice of VLM/LLM? Would a smaller or domain-specific language model still yield benefits?
Have the authors observed cases where the Skeleton graph omits valid paths or where feedback misleads the agent due to issues in VLM? How are such issues mitigated?

局限性

No limitation is mentioned explicitly.

最终评判理由

Thanks for the rebuttal. While I acknowledge the experiments, one major bottleneck I find is the model's over-reliance on the capabilities of the underlying VLM, which may hinder its more real-time application. Given this, I prefer to stay on my previous rating.

格式问题

作者回复

2025-07-31

Weakness 1:

Thank you for the thoughtful comment. To disentangle the contribution of STRIDER’s framework from the underlying pretrained models, we conduct controlled comparisons using the same LLM and VLMs across STRIDER and baseline methods. As shown in Table 1, STRIDER outperforms the other two baselines under identical LLM and VLM settings, confirming that the performance gain stems from our framework design.

Table 1: Comparison Under Identical VLM and LLM

Method	VLM	LLM	TL	NE↓	NDTW↑	OSR↑	SR↑	SPL↑
DiscussNav	InstructBLIP + RAM	GPT-4	6.27	7.77	42.87	15	11	10.51
STRIDER	InstructBLIP + RAM	GPT-4	8.24	7.21	46.61	29	23	19.05
Open-Nav	Spatial-Bot + RAM	GPT-4	7.68	6.70	45.79	23	19	16.10
STRIDER	Spatial-Bot + RAM	GPT-4	8.31	6.82	49.30	31	26	22.37

Weakness 2:

Thank you for raising this important point. We agree that the robustness of STRIDER depends in part on the reliability of its underlying VLM and LLM components. To evaluate the impact of incorrect feedback, we conduct a controlled experiment where we randomly corrupt the progress feedback provided to the LLM during reasoning. As shown in Table 2, this experiment offers a concrete quantification of STRIDER’s tolerance to feedback noise and supports its practical usage even in imperfect settings.

Table 2: Robustness to incorrect feedback

Method	TL	NE↓	NDTW↑	OSR↑	SR↑	SPL↑
STRIDER	8.21	7.24	45.62	27	24	20.54

Question 1:

STRIDER introduces a per-step latency of approximately 34.2754 seconds, only slightly higher than Open-Nav’s 32.1105 seconds. In terms of overall runtime, STRIDER takes 3.9074 minutes per episode, while Open-Nav takes 3.6606 minutes. Notably, STRIDER’s waypoint generation is faster, requiring only 0.4001 seconds compared to 0.8892 seconds for Open-Nav. These results demonstrate that STRIDER remains efficient and practical for real-world use.

Question 2:

We clarify that the capabilities of the underlying VLM/LLM are fundamental to the success of STRIDER.

Our method is designed to leverage and amplify the strengths of powerful models like GPT-4o, which help improve decision robustness. This makes the method less sensitive among models of similar capability, e.g., between GPT-4 and GPT-4o.

We present ablation results in Table 3 using different VLMs. The results demonstrate that while stronger models do yield better performance, STRIDER maintains competitive results even with smaller or different models, confirming that the design generalizes well across models of similar capacities and does not rely on any single pretrained model.

Table 3: Ablation on different VLMs

VLM	TL	NE↓	NDTW↑	OSR↑	SR↑	SPL↑
Qwen-VL-Max	8.13	6.91	51.87	39	35	30.30
Qwen2.5-VL-72B	8.30	6.78	51.99	39	34	29.07
Qwen2.5-VL-32B	8.56	7.12	48.02	33	28	24.20
Qwen2.5-VL-7B	8.92	7.46	46.35	29	24	21.12
GPT‑4o	8.01	6.75	50.12	39	36	31.37
Gemini-2.5-Pro	8.34	6.92	51.35	37	34	29.85
Gemini-2.5-Flash	7.68	7.08	49.87	34	29	25.30
Claude-3.5	7.81	6.86	52.10	36	33	29.40
Claude-4	8.22	7.14	45.25	31	29	26.10

Question 3:

We have indeed observed occasional cases where the skeleton graph misses viable paths or where feedback from the VLM introduces minor errors.

To mitigate these issues, we enforce a maximum step-size limit of 1.5 meters during navigation, which ensures fine-grained control and prevents large deviations. As a result, even if an error occurs at one step, the agent does not stray far from the intended path and can often correct itself in subsequent steps after gaining a broader view.

Our robustness evaluation, shown in Table 2 in the supplemental material, further supports this claim: when subjected to mid-trajectory perturbations that displace the agent from its planned waypoint, STRIDER demonstrates better recovery.

2025-08-05

Thanks for the rebuttal. I have no further questions. But I feel the method is very sensitive to the choice of the VLM.

2025-08-05

Thank you for the comment. We would like to clarify that sensitivity to model capacity is expected and inherent in any zero-shot method based on frozen VLMs and LLMs. This is not unique to STRIDER; it is a general limitation of the paradigm. Our goal is not to eliminate this dependency, but to bring out the strengths of powerful models while ensuring that weaker models still yield usable results.

As demonstrated in Table 1 from Open-Nav[1] and Table 2 from CA-Nav[2], even with an identical architecture, different LLMs can lead to large differences. Our results in Table 3 of the rebuttal further support that changing only the VLM (perception) leads to performance differences. This reflects a well-understood fact in the field: zero-shot methods rely heavily on the quality of the pretrained models.

Table 1: Comparison in Open-Nav[1]

Method	TL	NE↓	nDTW↑	OSR↑	SR↑	SPL↑
Llama3.1-70B	8.07	7.25	44.99	23	16	12.90
Qwen2-72B	7.21	8.14	43.14	23	14	12.11
Gemma-27B	8.41	6.76	40.57	16	12	10.65
Phi3-14B	8.47	8.53	33.64	8	5	3.81

Table 2: Comparison in CA-Nav[2]

Method	NE↓	SR↑	OSR↑	SPL↑
GPT-3.5	7.66	21.1	45.0	9.4
Claude-3.5 Sonnet	7.41	25.2	47.1	11.8
GPT-4	7.58	25.3	48.0	10.8

What we emphasize is that STRIDER remains robust and competitive across model scales. The system remains functional even with smaller models. Under fair comparison across models, the consistent performance gains of STRIDER sufficiently demonstrate the strength of our contribution. We hope this provides further clarity and addresses the reviewer’s concerns.

Reference:

[1] Yanyuan Qiao, Wenqi Lyu, Hui Wang, Zixu Wang, Zerui Li, Yuan Zhang, Mingkui Tan, and Qi Wu. Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms. arXiv preprint arXiv:2409.18794, 2024.

[2] Kehan Chen, Dong An, Yan Huang, Rongtao Xu, Yifei Su, Yonggen Ling, Ian Reid, and Liang Wang. Constraint-aware zero-shot vision-language navigation in continuous environments. arXiv preprint arXiv:2412.10137, 2024.

审稿意见

评分: 3置信度: 42025-06-30

This paper introduces STRIDER, a framework for zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE). STRIDER combines a Structured Waypoint Generator that encodes spatial priors into the action space, with a Task-Alignment Regulator that dynamically adjusts actions based on feedback about task progress. The authors report performance improvements across two VLN-CE benchmarks (R2R-CE and RxR-CE), claiming that STRIDER provides more structured, instruction-aligned navigation without fine-tuning.

优缺点分析

Strengths

The Structured Waypoint Generator is a clever use of skeletonization and topological abstraction to impose environmental constraints without explicit training.

Weakness

Model Selection Justification
- The paper uses Qwen-VL-Max as the VLM and GPT-4o as the LLM, but does not justify these choices. Why not use a more recent VLM such as Qwen2-VL-72B, especially since it has shown strong results in VLN tasks?
- It remains unclear whether STRIDER’s performance gains are architecture-agnostic or dependent on these specific models. This limits the generalizability of the method.
Lack of Baseline Comparison for Waypoint Generation
- The "original waypoint generator" is frequently mentioned but never clearly referenced in the main text. What are its core mechanics, and how does it compare structurally to the proposed method?
- There is no direct comparison under a controlled setup where Task-Alignment Regulator is held constant, and only the waypoint generation method varies. Such an experiment is essential to validate the structured waypoint hypothesis.
Missing Prompting System Details
- Given that the entire method operates in a zero-shot prompting-based setup, the absence of details on the prompt engineering used with GPT-4o is a critical omission.
- How do these prompts compare to those used in Open-Nav or SmartWay? SmartWay integrates history reflection and allows backtracking — capabilities not clearly discussed in STRIDER. A deeper comparison here would be informative.
Zero-Shot Performance Remains Far from Supervised Upper Bound
- Although the improvements in zero-shot settings are appreciated, the gap with supervised models remains significant (e.g., SR of 35 vs. 61 in R2R-CE).
- As zero-shot VLN-CE is still relatively immature, it is debatable whether this research constitutes a paradigm-shifting contribution or a well-executed empirical exploration. The paper could better contextualize its significance within current research trends, where fine-tuning and hybrid systems are dominant.

问题

Typos:

Page 2, Line 77: "An Task-Alignment Regulator" -> "A Task-Alignment Regulator"
Page 9, Line 297-298: "navigate complex, unseen environments" -> "navigate in complex, unseen environments"

Suggestions

The LLM selects a waypoint from a candidate set, but it's unclear if this includes high-level reasoning (e.g., subgoal decomposition) or just ranking visual directions. The authors could detail the reasoning scope handled by GPT-4o.

局限性

See Weaknesses#4.

最终评判理由

Thanks for the rebuttal. The added experiment about different VLMs is appreciated. My qustions regarding the details of prompts and the definition of original waypoint generator are resolved.

My major concern still remains that the framework replies heavily on superior VLMs. And it remains unclear how the SWG would influence the fine-tuning based methods. Therefore, I choose to keep my score.

格式问题

作者回复

2025-07-31

Weakness 1.1:

We select to use Qwen-VL-Max (latest) because it is the most recent (updated on 25-04) and stable release. In our experiments, it produces clear, concise descriptions under our prompt design, showing comparable performance with other VLMs.

For the LLM, although most baselines adopt GPT-4, we choose GPT-4o due to its more consistent formatting, reduced verbosity, faster and more stable response under API usage.

Weakness 1.2:

We would like to clarify that STRIDER is fundamentally architecture-agnostic. Its design centers around a structured planning-regulation pipeline that can operate with a variety of models.

We present ablation results in Table 1 using different VLMs. The results demonstrate that while stronger models do yield better performance, STRIDER maintains competitive results even with smaller or different models, confirming that the design generalizes well across models of similar capacities and does not rely on any single pretrained model.

Table 1: Ablation on different VLMs

VLM	TL	NE↓	NDTW↑	OSR↑	SR↑	SPL↑
Qwen-VL-Max	8.13	6.91	51.87	39	35	30.30
Qwen2.5-VL-72B	8.30	6.78	51.99	39	34	29.07
Qwen2.5-VL-32B	8.56	7.12	48.02	33	28	24.20
Qwen2.5-VL-7B	8.92	7.46	46.35	29	24	21.12
GPT‑4o	8.01	6.75	50.12	39	36	31.37
Gemini-2.5-Pro	8.34	6.92	51.35	37	34	29.85
Gemini-2.5-Flash	7.68	7.08	49.87	34	29	25.30
Claude-3.5	7.81	6.86	52.10	36	33	29.40
Claude-4	8.22	7.14	45.25	31	29	26.10

Weakness 2.1:

Thank you for pointing this out. In the paper, our reference to the “original waypoint generator” corresponds to the learning-based waypoint predictor used in prior VLN works. This predictor is typically trained on the R2R dataset using RGB or depth inputs to regress a set of candidate waypoints (angles and distances). It operates as a perception network with no explicit structural grounding, relying entirely on visual cues and learned priors. We use the version from Open-Nav in all comparison experiments in the full paper.

In contrast, STRIDER replaces this component with a geometry-based, layout-consistent skeleton extraction process that requires no training and offers interpretable and spatially constrained waypoint candidates.

Weakness 2.2:

We appreciate the reviewer’s suggestion, and we would like to clarify that Table 3 in the full paper already includes the requested controlled comparison.

Specifically, the first row (marked with "×" under SWG) uses the original learning-based waypoint predictor from Open-Nav, while keeping other components unchanged. In contrast, the remaining rows apply our structured waypoint generator (SWG) under different configurations. As shown, enabling SWG instead of the original waypoint predictor yields consistent improvements across all metrics, confirming the effect of structured waypoint design.

Weakness 3.1:

Thank you for pointing this out. Our prompting setup for GPT-4o follows the prompt templates from Open-Nav. We intentionally adopt these prompts without major modifications to ensure a fair comparison with prior methods and to isolate the contribution of our structural planning architecture. Since our focus is not on prompt engineering for LLMs, we kept this component consistent across experiments.

Weakness 3.2:

STRIDER adopts the same prompt as Open-Nav, which includes historical observations, previous decisions, and prior reasoning steps, along with the current view. While SmartWay supports explicit backtracking when the agent detects failures, STRIDER follows a forward-only design.

Weakness 4.1:

We acknowledge the performance gap between STRIDER and fully supervised models. However, it is important to highlight that STRIDER operates entirely in a zero-shot manner without any task-specific training. This design is particularly valuable for practical deployments, where collecting annotated navigation data is often infeasible.

Despite this constraint, STRIDER achieves strong results across two challenging datasets (R2R-CE and RxR-CE). As shown in Table 1 and Table 2 in the full paper, STRIDER improves SR and SPL significantly (e.g., +20.6% SR and +34.9% SPL over baseline on R2R-CE; +11.5% SR and +52.3% SPL on RxR-CE), demonstrating its robust generalization to diverse environments and languages.

We believe this performance offers a complementary perspective to zero-shot VLN.

Weakness 4.2:

We appreciate the reviewer’s broader reflection on the maturity of zero-shot VLN-CE. While we agree that this research area is still developing, we believe this is what makes STRIDER a timely and meaningful contribution.

STRIDER offers a structural shift in zero-shot VLN by combining skeletonized spatial abstraction and action regulation. It complements existing approaches by offering an interpretable alternative that can generalize across environments without fine-tuning. While not aiming to replace fine-tuning or hybrid systems, STRIDER opens a new direction for zero-shot, interpretable navigation that we believe will grow increasingly relevant as foundation models continue to evolve.

Question 1:

Thank you for pointing out the typos. We appreciate your attention to detail and have carefully proofread the manuscript to fix such issues.

Question 2:

In STRIDER, the LLM’s role extends beyond ranking visual cues. GPT-4o is responsible for interpreting and decomposing the full instruction, reviewing the agent’s historical observations, past decisions, and prior reasoning. In this sense, the LLM performs structured, high-level reasoning grounded in both spatial layout and linguistic intent.

2025-08-05

Thanks for the rebuttal. The added experiment about different VLMs is appreciated. My qustions regarding the details of prompts and the definition of original waypoint generator are resolved.

2025-08-07

Thank you for the comment. We would like to clarify that reliance on strong pretrained models is a natural characteristic of zero-shot frameworks, and not unique to STRIDER. This reliance is widely observed across recent works, as shown in Table 1 from Open-Nav, Table 2 from CA-Nav, and Table 1 from our rebuttal, where using different VLMs/LLMs under the same architecture leads to performance variation. Despite this, STRIDER achieves significant gains over prior methods under the same VLM setting, as shown in Table 3, demonstrating that our improvements stem from the structured design rather than relying on larger or better models.

Table 1: Comparison in Open-Nav[1]

Method	TL	NE↓	nDTW↑	OSR↑	SR↑	SPL↑
Llama3.1-70B	8.07	7.25	44.99	23	16	12.90
Qwen2-72B	7.21	8.14	43.14	23	14	12.11
Gemma-27B	8.41	6.76	40.57	16	12	10.65
Phi3-14B	8.47	8.53	33.64	8	5	3.81

Table 2: Comparison in CA-Nav[2]

Method	NE↓	SR↑	OSR↑	SPL↑
GPT-3.5	7.66	21.1	45.0	9.4
Claude-3.5 Sonnet	7.41	25.2	47.1	11.8
GPT-4	7.58	25.3	48.0	10.8

Table 3: Comparison Under Identical VLM and LLM

Method	VLM	LLM	TL	NE↓	nDTW↑	OSR↑	SR↑	SPL↑
DiscussNav	InstructBLIP + RAM	GPT-4	6.27	7.77	42.87	15	11	10.51
STRIDER	InstructBLIP + RAM	GPT-4	8.24	7.21	46.61	29	23	19.05
Open-Nav	Spatial-Bot + RAM	GPT-4	7.68	6.70	45.79	23	19	16.10
STRIDER	Spatial-Bot + RAM	GPT-4	8.31	6.82	49.30	31	26	22.37

Regarding whether SWG can benefit fine-tuned methods, we replace the waypoint predictor with our SWG. As shown in Table 4, even in supervised settings, incorporating SWG leads to additional gains, validating that structured guidance contributes beyond zero-shot setups, which we believe sufficiently supports the strength of our contribution.

Table 4: Finetuning method with SWG

Method	NE↓	OSR↑	SR↑	SPL↑
BEVBert	4.57	67	59	50
BEVBert w/ SWG	4.37	70	61	53

We hope this provides further clarity and addresses the reviewer’s concerns.

Reference:

审稿意见

评分: 3置信度: 32025-07-01

The paper proposes STRIDER, a zero-shot Vision-and-Language Navigation (VLN-CE) framework that optimizes an agent’s decision space using spatial layout priors and dynamic task feedback. Key innovations include a Structured Waypoint Generator (constraining actions via environmental skeletons) and a Task-Alignment Regulator (adjusting behavior based on task progress). Experiments on R2R-CE and RxR-CE show STRIDER improves Success Rate from 29% to 35% and outperforms SOTA on metrics like SPL and NDTW. The approach addresses long-horizon execution drift by integrating spatial structure and feedback loops.

优缺点分析

Strengths:

Novel Framework Design: STRIDER’s dual-module approach (structured waypoints + feedback regulation) effectively combines spatial constraint and semantic alignment, addressing core VLN-CE challenges.
Strong Empirical Validation: Comprehensive experiments on two benchmarks (R2R-CE, RxR-CE) demonstrate significant performance gains over zero-shot baselines, with SR improvements and higher SPL.

Weaknesses:

Computational Overhead: The pipeline involves multiple model calls (VLM for descriptions, LLM for reasoning), which could impact real-time execution.
The authors should indicate in the paper which LLMs or VLMs are used by other zero-shot methods. Additionally, it is important to clarify how much of the performance gain remains when the same LLM or VLM is used across methods.

问题

Are there limitations in skeleton extraction for complex topologies or non-standard layouts (e.g., open spaces)?
Can the task-alignment regulator handle fine-grained semantic differences in instructions?
Why do some candidate waypoints in Figure 2 have a degree greater than 1？
In the ablation study of Table 3, what is the performance without using TAR?

局限性

yes

最终评判理由

Due to the concerns on the computational overhead and the lack of deep analysis of limitations in skeleton extraction for complex topologies or non-standard layouts in simulator environments, I recommend the borderline reject rating.

格式问题

None

作者回复

2025-07-31

Weakness 1:

We understand the reviewer’s concern. While STRIDER’s pipeline involves multiple calls to a VLM and an LLM, our measurements show that the overall inference time per episode remains comparable to the Open-Nav baseline. Specifically, STRIDER requires 3.9074 minutes per episode, compared to 3.6606 minutes for Open-Nav. We believe this minor overhead is acceptable given the substantial gains in navigation performance, and it demonstrates that STRIDER is practical for offline applications.

Weakness 2:

Thank you for the helpful suggestion. To assess how much of STRIDER’s performance gain comes from model design rather than model choice, we conduct controlled experiments where STRIDER and prior methods use the same LLM and VLM. As shown in Table 1, STRIDER outperforms the other two baselines under identical LLM and VLM settings, confirming that the performance gain stems from our framework design.

Table 1: Comparison Under Identical VLM and LLM

Method	VLM	LLM	TL	NE↓	NDTW↑	OSR↑	SR↑	SPL↑
DiscussNav	InstructBLIP + RAM	GPT-4	6.27	7.77	42.87	15	11	10.51
STRIDER	InstructBLIP + RAM	GPT-4	8.24	7.21	46.61	29	23	19.05
Open-Nav	Spatial-Bot + RAM	GPT-4	7.68	6.70	45.79	23	19	16.10
STRIDER	Spatial-Bot + RAM	GPT-4	8.31	6.82	49.30	31	26	22.37

Question 1:

Our method is designed with a focus on indoor navigation scenarios, where environments typically exhibit well-defined spatial constraints and structured topologies (e.g., hallways, rooms, doorways). In these settings, skeleton extraction yields meaningful connectivity graphs that align with human spatial reasoning and support robust, layout-consistent planning.

We agree that in open or non-standard layouts, such as large open spaces without clear boundaries, the extracted skeleton may become overly simplified or lack informative branching. However, for the indoor environments targeted in VLN-CE benchmarks like R2R-CE and RxR-CE, which are exactly our tasks, our abstraction is well-suited and highly effective.

Question 2:

Yes, the Task-Alignment Regulator is explicitly designed to handle fine-grained instruction alignment by generating feedback based on scene changes. At each step, the regulator compares pre- and post-movement observations, and reasons over whether the intended semantic progression (e.g., fully entering a room vs. partially entering) has been achieved.

This allows the agent to detect subtle differences and adjust accordingly in the next decision step. As shown in Fig. 5 in the full paper, the regulator distinguishes between visually similar but semantically distinct states.

Question 3:

We appreciate the reviewer’s attention to detail. The candidate waypoints in Figure 2 are intended for illustrative purposes only, to convey the general idea of skeleton-based waypoint sampling. As shown, the figure includes various keypoints along the skeleton, including some with degree > 1, to better visualize the underlying graph structure. However, as described in Section 3.2 (and shown in Figure 3), our actual method only retains candidate waypoints with degree = 1, i.e., skeleton endpoints.

Question 4:

The performance without using the Task-Alignment Regulator (TAR) is reported in Table 4 in the full paper, specifically in the first row labeled “×”. We intentionally separate this analysis from the Structured Waypoint Generator (SWG) ablation in Table 3 in the full paper to isolate the individual contribution of each component.

审稿意见

评分: 3置信度: 32025-07-02

This paper proposes a structural framework STRIDER for navigation tasks in a zero-shot manner. Specifically, STRIDER introduces a structural module to generate waypoint candidates and employs a regulator to adjust agent actions throughout action execution, ensuring alignment with the instruction's intent. Experimentally, STRIDER achieves significant performance improvements on both the R2R-CE and RxR-CE benchmarks.

优缺点分析

Strengths

The paper is well-written and easy to follow.
STRIDER addresses an important real-world challenge: long-horizon navigation aligned with human intent. It demonstrates promising improvements on two benchmarks.

Weaknesses

While STRIDER improves SR and SPL across both benchmarks, its performance on NE is notably worse than the baselines. The authors discuss this betwen lines 259–263, but the explanation does not address my questions. In particular, it remains unclear why STRIDER struggles more than other methods with stopping at semantically appropriate locations—an issue that commonly affects zero-shot models.
Additionally, STRIDER shows a 12% drop in OSR on R2R-CE compared to baselines. This degradation needs further discussion.

问题

Why are different models (Qwen and GPT4o) used for feedback generation and action selection?
Could the authors provide a comparison of inference time to better understand the computational cost of STRIDER?
Given that the feedback plays an important role in STRIDER’s performance, it would be valuable to include an ablation study using feedback of varying quality, like from open-source models of different sizes, to assess the impact on performance.
What stopping strategy does STRIDER employ? Is the decision to stop explicitly predicted by the model?

局限性

Yes

最终评判理由

Generally, this work introduce multiple VLM calls to perform long-horizon navigation aligned with human intent. I mainly concern about the diverse performance gains and the extra costs. The current stop strategy could also be improved. Therefore, I maintain my rating of bordline reject, but acknowledge the potential of this work.

格式问题

作者回复

2025-07-31

Weakness 1:

While NE (Navigation Error) is a meaningful metric, it is not the most decisive indicator of navigation quality, as shown in recent works such as Smartway, which accepts slightly higher NE (7.01), in exchange for improved SR, SPL, and NDTW. This is because NE measures the Euclidean distance from the agent's final position to the goal, but it does not account for whether the agent successfully followed instructions or generated a coherent, goal-directed path. In contrast, SR (Success Rate) reflects whether the agent stops next to the goal, SPL (Success weighted by Path Length) evaluates path efficiency while accounting for success, and NDTW (Normalized Dynamic Time Warping) captures trajectory fidelity concerning the reference path. These metrics are more indicative of instruction-following accuracy, semantic alignment, and task-level success, especially in instruction-following navigation. To this end, recent works mainly use SR, SPL, and NDTW to evaluate the overall performance of the methods.

Weakness 2:

While STRIDER’s OSR (39) on R2R-CE is lower than SmartWay's (51), we believe this difference is largely attributable to SmartWay’s use of a backtracking mechanism and significantly longer trajectories (TL = 13.09) as shown in Table 1. OSR considers a path successful if any visited point comes within a threshold distance of the goal. Therefore, methods with longer and more exploratory trajectories inherently have higher OSR, even if they do not stop correctly or follow instructions precisely. In contrast, STRIDER produces concise, forward-directed trajectories (TL = 8.47) without backtracking, yet still achieves a strong OSR. This suggests that our agent’s paths are well-aligned with the goal, not by chance, but through accurate perception and structured reasoning. STRIDER also ranks highest in SR and SPL, further indicating its strength in task completion and trajectory quality, rather than relying on incidental proximity.

We see this as a design trade-off: STRIDER avoids unnecessarily long or exploratory paths, and instead focuses on semantically faithful, efficient navigation.

Table 1: Comparison of TL and OSR

Method	TL	OSR↑	SR↑
DiscussNav	6.27	15	11
Open-Nav-Llama3.1	8.07	23	16
Open-Nav-GPT4	7.68	23	19
Smartway	13.09	51	29
STRIDER	8.47	39	35

Question 1:

We would like to emphasize that our architecture adopts a modular design, where VLMs are used for perception and LLMs for reasoning. This design is not tied to any specific models and is model-agnostic across components of similar capabilities.

To verify this, we conduct an ablation study across multiple VLMs of varying sizes and providers (Table 2). The results demonstrate that while stronger models do yield better performance, STRIDER maintains competitive results even with smaller or different models, confirming that the design generalizes well across models of similar capacities and does not rely on any single pretrained model.

Table 2: Ablation on different VLMs

VLM	TL	NE↓	NDTW↑	OSR↑	SR↑	SPL↑
Qwen-VL-Max	8.13	6.91	51.87	39	35	30.30
Qwen2.5-VL-72B	8.30	6.78	51.99	39	34	29.07
Qwen2.5-VL-32B	8.56	7.12	48.02	33	28	24.20
Qwen2.5-VL-7B	8.92	7.46	46.35	29	24	21.12
GPT‑4o	8.01	6.75	50.12	39	36	31.37
Gemini-2.5-Pro	8.34	6.92	51.35	37	34	29.85
Gemini-2.5-Flash	7.68	7.08	49.87	34	29	25.30
Claude-3.5	7.81	6.86	52.10	36	33	29.40
Claude-4	8.22	7.14	45.25	31	29	26.10

Question 2:

We measure the average inference time per episode for both STRIDER and the Open-Nav baseline. STRIDER takes 3.9074 minutes per episode compared to 3.6606 minutes for Open-Nav. Despite incorporating additional reasoning steps via the Task-Alignment Regulator, STRIDER introduces only a minor runtime overhead. We believe this slight increase is acceptable given the clear performance improvements, and it confirms that STRIDER remains computationally practical for real-world applications.

Question 3:

Thank you for the insightful suggestion. We evaluate the effect of feedback quality by conducting an ablation study using open-source VLMs of varying sizes and capabilities, including Qwen2.5-VL-72B, 32B, and 7B (Table 2). The results confirm that higher-quality feedback (e.g., from larger or more capable models) improves performance. However, STRIDER continues to offer improvements over baselines even when paired with smaller models, suggesting that our structural guidance provides added value regardless of model scale.

Question 4:

STRIDER does not rely on the model generating an explicit “stop” signal.

Specifically, the instruction is first broken down into a series of subtasks by the LLM, and the maximum number of execution steps is set to be equal to the number of subtasks, with a hard minimum of 6 steps to ensure completion. Once the max step is reached, the system stops.

This approach avoids reliance on potentially unreliable stop predictions from the model and ensures stable behavior across tasks.

2025-08-05

Thanks for authors' response. The explanation sounds make sense. But I remain my concern on the trade-off between different performance gains and the latency with multiple VLM calls. Also, with VLMs perform high-level planning, perhaps instructing the model to generate an explicit "stop" action would produce a more clear evaluation.

2025-08-06

Thank you for the comment. As shown in Table 1, STRIDER achieves an 84.2% improvement in Success Rate (SR) over Open-Nav (from 19 to 35), while the average latency per episode increases by only 6.7% (from 3.6606 min to 3.9074 min). Importantly, the number of VLM calls in STRIDER is only marginally higher than Open-Nav. Specifically, both methods make approximately 3–5 VLM calls per step, depending on the number of candidate waypoints. STRIDER adds only two extra VLM calls per episode: one for a global observation and one for task alignment. Compared to DiscussNav, which uses 12 VLM calls per step with longer latency but achieves only 11 SR, STRIDER is both faster and significantly more effective.

Table 1: Tradeoff between latency and performance

Method	Latency per episode (min)	VLM calls per step	SR ↑
DiscussNav	4.5162	12	11
Open-Nav	3.6606	~(3–5)	19
STRIDER	3.9074	~(3–5) + 2	35

As for the stopping strategy, we adopt the same stopping mechanism as our baseline Open-Nav[1] and DiscussNav[2]. Under this fair comparison, STRIDER achieves consistently strong performance across multiple metrics and model types. This demonstrates that our improvements come from the method's design itself, which we believe sufficiently supports the strength of our contribution.

We hope this provides further clarity and addresses the reviewer’s concerns.

Reference:

[2] Yuxing Long, Xiaoqi Li, Wenzhe Cai, and Hao Dong. Discuss before moving: Visual language navigation via multi-expert discussions. In Proceedings of the International Conference on Robotics and Automation, pages 17380–17387, 2024.

最终决定Accept (poster)

2025-09-17

This paper presents a method to perform instruction-guided navigation in real world environments (e.g. VLN-CE). Towards this goal, the authors propose a framework that makes heavy use of VLMs and LLMs to reason about where the agent should navigate to next and propose a structured-waypoint generator to create navigable options.

After the rebuttal, the remaining concerns of the reviewers fall into two categories:

The heavy use of VLMs/LLMs makes the method slow and possibly not applicable to real-time applications.
The method is sensitive to the choice of VLM/LLM.
The method relies heavily on the strength of modern VLMs/LLMs.

The AC disagrees with the reviewers.

While the method is not applicable to real-time applications, its inference time is not meaningfully different than existing methods, such as DiscussNav[1] and Open-Nav[2] (Tab 1 in response to NbAg). It is also worth noting that, in the view of the AC, zero-shot VLN-CE has low enough success rates that performance, not inference time, is the current bottleneck for deployment.
The sensitivity of methods to the choice of VLM or LLM is a common issue and present in many other published works (both in VLN-CE and more broadly). As pointed out by the authors in the rebuttal, Open-Nav[2] and CA-Nav[3] show similar sensitivity to the choice of VLM and LLM. Further, the majority of this sensitivity appears to stem from the underlying capabilities of the model rather than over-reliance on a specific model, e.g. the proposed method achieves very similar results with Qwen-VL-Max, Qwen2.5-VL-72B, GPT-4o, and Gemini-2.5-Pro (Tab 2 in response to NbAg).
While the proposed method does heavily rely on the strength of modern VLMs/LLMs, this is only an issue if the improvements compared to existing work come solely from the choice of VLM/LLM. Experiments provided in the rebuttal show that while some of the improvements do come from these models, other significant improvements come from how the authors use these models (Table 1 in response to RNwD and mq4X).

Turning to the strengths of the paper, here the AC agrees with the reviewers. Specifically, reviewers noted the importance of the task (NbAG), the novelty and elegance of the proposed method (mq4X, iwMP, RNwD), the strength of its improvements over existing methods (mq4X, RNwD), and the value of the ablation studies presented (RNwD).

Overall, this paper presents a valuable contribution to the VLN-CE community, especially the more nascent zero-shot sub-community, and the criticisms levied against it have either been rebutted or do not warrant rejection. As such, the AC recommends accepting this paper.

[1] https://arxiv.org/abs/2309.11382, ICRA 2024

[2] https://arxiv.org/abs/2409.18794, ICRA 2025

[3] https://arxiv.org/abs/2412.10137