Test-Time Adaptation for Online Vision-Language Navigation with Feedback-based Reinforcement Learning
We introduce a novel test-time adaptation framework for online vision-language navigation using feedback-based reinforcement learning.
摘要
评审与讨论
This paper investigates the problem of Vision-Language Navigation (VLN) adaptation during deployment and introduces FEEDTTA, which uses feedback-based reinforcement learning. The main idea is to give the agent simple binary feedback after each navigation attempt (+1 for success or -1 for failure). To enhance learning from this binary feedback, they present Stochastic Gradient Reversion (SGR), a gradient regularization technique that helps maintain a balance between plasticity and stability.
给作者的问题
No
论据与证据
The claims in the paper are backed up by evidence from experiments. The authors show that FEEDTTA works better than Test-time Adaptation baselines, while being comparable to Offline-training based methods. The paper provides sufficient evidence.
方法与评估标准
The evaluation focuses on flexibility and interactivity, with standard evaluation protocol and a new metric called Adaptive Success Rate (ASR) to measure how well the agent adapts. This method is tested on three well-known benchmarks, which cover different navigation scenarios and instruction types.
理论论述
was not reviewed in depth
实验设计与分析
This paper tests on three benchmarks (REVERIE, R2R, and R2R-CE) and provides clear explanations of their results and why FEEDTTA performs as it does. The experimental design is sound. The comparison with both offline training methods and other test-time adaptation approaches provides a comprehensive evaluation of FEEDTTA's performance.
补充材料
was not reviewed in depth
与现有文献的关系
I am not familiar with the literature in this area
遗漏的重要参考文献
I am not familiar with the literature in this area
其他优缺点
Strengths: The problem of test-time adaptation in VLN is interesting. The Feedback-based Test-Time Adaptation is simple and aligns well with how humans might interact with navigation agents in real-world scenarios.
Weaknesses: Figures 2, 3, 5, and 7 are not vector graphics and appear blurry.
其他意见或建议
Improve the quality of figures to enhance readability
Thank you for your positive evaluation of our work. We are committed to enhancing the quality of the figures in the final version upon acceptance. If there are any further questions or suggestions during the discussion period, we would greatly appreciate the opportunity to address them and further refine our work towards clear acceptance.
Thanks for the rebuttal. After reading the other reviews and the rebuttal, I recommend weak acceptance of this paper. I encourage the authors to revise the paper to incorporate the rebuttal, either in the main text or in the supplementary materials.
We greatly appreciate your supportive evaluation and recommendation for acceptance. We will incorporate the feedback from the rebuttal phase into the final version accordingly.
FeedTTA is a test-time adaptation (TTA) framework for online vision-language navigation (VLN). It utilizes LLM for external interaction, providing binary feedback to the deployed navigation algorithm and establishing a feedback-based online reinforcement learning mechanism. By leveraging binary episodic feedback and gradient regularization, FeedTTA enhances adaptability while balancing plasticity and stability. Experiments demonstrate that FeedTTA outperforms state-of-the-art methods, excelling in unfamiliar environments.
给作者的问题
See the above comments.
论据与证据
1、I notice that the paper deploys FeedTTA on an A100 GPU, which is a high-end server-grade GPU rather than typical mobile hardware. This raises concerns about its feasibility in real-world navigation scenarios with limited computational resources. It would be helpful to discuss whether FeedTTA increases the computational burden on the base navigation algorithm and compare its hardware requirements and inference speed with standard VLN models to enhance practicality.
方法与评估标准
1、Can we leverage large language models (LLMs) to provide more detailed scene-aware feedback or sub-goal feedback instead of simple binary feedback? This might help mitigate the issue of extreme binary feedback.
2、You mention the issue of extreme binary feedback—does this directly correspond to the sparse reward problem in reinforcement learning?
3、I notice that you propose using SGR to generate counterfactual gradient information to address the extremity of binary feedback. Could LLM-generated counterfactual evaluations replace this approach? If so, would slightly less accurate LLM-based navigation evaluations actually improve performance?
4、The paper suggests that TTA should flexibly handle different navigation outcomes, yet it still relies on an LLM-generated binary feedback mechanism, considering only success or failure. Would a more diversified navigation evaluation be more effective? For example, a failed navigation attempt that still gets close to the goal, or a successful navigation where the agent deviates from the ideal final position.
理论论述
SGR (Stochastic Gradient Regularization) mitigates the extremity of binary feedback signals by introducing counterfactual gradient updates, which help smooth the learning process. However, its role in preventing catastrophic forgetting needs further clarification.
实验设计与分析
Can the current method be applied to VN tasks, where there are no complex textual instructions, and test failures stem solely from visual observations and scene layout differences?
补充材料
yes.
与现有文献的关系
The paper proposes a novel feedback mechanism utilizing LLM for TTA training and introduces the SGR module to address the binary extremity issue in feedback. Experimental results in the VLN setting outperform previous works.
遗漏的重要参考文献
Without
其他优缺点
Strengths The paper proposes a novel approach using LLM for TTA training, introducing an innovative feedback mechanism to enhance adaptation. Weaknesses Would incorporating LLM for feedback provide greater flexibility and lead to better performance? Does using LLM introduce a higher computational burden for fine-tuning? Could modifying the feedback mechanism eliminate the need for the SGR module? These questions require further discussion.
其他意见或建议
See the above comments.
Thank you for the detailed feedback. Below, we provide our responses to each comment and hope they contribute to a better evaluation of our work.
Q1. Computational feasibility of FeedTTA
A : We first clarify that FeedTTA does not require high-end server-grade GPUs and can be efficiently deployed on practical hardware (e.g., GTX 1080). The trained DUET policy consists of 181.08M parameters, whereas FeedTTA trains only 78.67M parameters for adaptation—just 43% of the total—making adaptation highly efficient. Furthermore, FeedTTA increases memory usage by only 0.67%, requiring 4.42 GB compared to the 4.39 GB used by baseline VLN models during inference. Lastly, while FeedTTA introduces some latency due to backpropagation, its episodic updates do not impact real-time navigation performance, making it a practical and feasible solution for real-world deployment.
Q2. Can we leverage LLMs to provide more detailed feedback instead of simple binary feedback?
A : As discussed in Section 5.3, LLMs achieve at most 73% accuracy in predicting simple navigational outcomes. While this is sufficient to guide baseline policies, errors can accumulate, hindering stable adaptation. In our response to Q3 of reviewer vESv, we showed that dense step-wise rewards, though less efficient than sparse goal-based feedback at test time, still significantly improve performance. As LLMs advance in navigational reasoning, their potential for richer feedback remains an exciting research direction.
Q3. Could LLM-generated counterfactual evaluations replace SGR?
A : The counterfactual reasoning of SGR is a regularization technique applied on a limited number of parameters, which means that the large portion of parameters should be updated based on proper feedback for intended functionality. Furthermore, while LLMs can indeed reason counterfactual scenarios, their reliability on predicting navigation outcomes itself still remains as a challenge, making them unsuitable as a direct replacement for SGR.
Q4. Can FeedTTA be applied to VN tasks?
A : Yes, FeedTTA can be applied to VN tasks even in the absence of complex language instructions, as it only requires determining success or failure within the navigation system. To identify the dominant modality influencing navigation outcomes, we analyze navigation consistency for each trajectory in the REVERIE dataset, where each trajectory is paired with multiple language instructions. Specifically, we compute the average success rate across different instructions for each trajectory. We then identify trajectories with consistent outcomes—defined as those with a high (> 0.8) or low (< 0.2) average success rate—and calculate their proportion within the validation set. Our experiment yields a ratio of 0.72, suggesting that visual observations are a key factor not only in VN tasks but also in VLN, where they play a more decisive role compared to language variations.
Q5. How does SGR prevent catastrophic forgetting?
A : The expected absolute value (EAV) of the gradients quantifies the deviation from the case where neither forgetting nor adaptation occurs, indicating the extent of policy forgetting and adaptation. For brevity, we omit the dimension index in subsequent derivations.
In a standard gradient update, the EAV is given by:
For small and such that , the EAV for the SGR-modified gradients is:
Using a first-order approximation:
Applying SGR scales the EAV by a factor of , reducing the gradient magnitude compared to the standard gradient update.
- (Gradient Dropout):
The scaling factor is fixed at .- (Gradient Scaling):
The scaling factor is controlled by , but remains bounded above:- (Gradient Reversion):
The scaling factor is controlled by , bounded both above and below:This result demonstrates that reversing a subset of gradients as proposed in SGR provides a strategic way to balance plasticity and stability in adapting to unseen environments.
The paper introduces FEEDTTA, a test-time adaptation (TTA) framework for vision-language navigation (VLN) that uses binary episodic feedback to adapt navigation policies in unfamiliar environments. To maintain stability during learning from binary signals, the authors propose Stochastic Gradient Reversion (SGR), a technique that reverses gradient directions for randomly selected parameters. Experiments on REVERIE, R2R, and R2R-CE benchmarks show FEEDTTA outperforms other TTA methods and sometimes even surpasses offline training approaches.
给作者的问题
- How would different orderings of test examples affect FEEDTTA's performance? Have you experimented with different sequence orderings to quantify this effect?
- The trajectory length often increases after adaptation. Could you provide more evidence that this represents beneficial exploration rather than inefficient navigation?
- How does FEEDTTA compare to approaches that use more informative feedback (beyond binary signals) in terms of adaptation efficiency? Is there a trade-off between feedback simplicity and adaptation speed?
论据与证据
Most claims are well-supported by empirical evidence:
- FEEDTTA's superior performance over other TTA methods is demonstrated across multiple datasets and metrics.
- The effectiveness of SGR compared to gradient dropout and scaling is shown through ablation studies.
- The claim about LLMs as potential feedback oracles is supported, though with acknowledged limitations in reliability.
However, some claims require closer scrutiny:
- The claim of outperforming state-of-the-art offline methods is limited to specific configurations (primarily REVERIE with DUET) rather than being a general finding.
- The interpretation of increased trajectory length as beneficial "exploration" lacks rigorous justification.
方法与评估标准
The methods and evaluation criteria are appropriate for online VLN:
- Binary episodic feedback is a practical choice for real-world deployment scenarios.
- The introduction of ASR effectively measures adaptation capability by examining both preserved and converted success rates.
- The three benchmark datasets (REVERIE, R2R, R2R-CE) represent diverse VLN tasks.
The evaluation could be strengthened by examining how the sequence of navigation episodes affects adaptation performance, as online learning is inherently sequence-dependent.
理论论述
The theoretical claims are limited and mostly sound. The derivation of the scaling factor in SGR (in supplementary material) ensures consistency in expectation during gradient updates. The analysis of how SGR alleviates non-stationarity is reasonable but not formally proven.
实验设计与分析
The experimental designs are generally sound:
- Comparisons with other TTA methods (Tent and FSTTA) establish clear baselines.
- Ablation studies on feedback quality and quantity provide useful insights.
- The catastrophic forgetting analysis appropriately measures stability.
One limitation is the lack of analysis regarding sequence effects in online learning - different orderings of test examples might yield different adaptation results.
补充材料
I reviewed the supplementary material, including:
- LLM prompts for feedback oracles
- SGR mathematical derivation
- Hyperparameter sensitivity analyses
- Trajectory visualizations
These materials effectively complement the main paper and substantiate its claims.
与现有文献的关系
The paper effectively bridges three research areas:
- VLN: It addresses a gap in online adaptation where previous works focused primarily on offline training.
- TTA: It identifies limitations of entropy minimization approaches for sequential decision-making tasks.
- Feedback-based RL: It adapts concepts from RLHF literature to navigation tasks.
The binary feedback mechanism builds on established sparse reward RL approaches, though the paper positions this in the novel context of test-time adaptation for VLN.
遗漏的重要参考文献
The paper would benefit from discussing:
- Connections to continual reinforcement learning literature, particularly works addressing non-stationarity in online learning environments.
- Prior work on sample ordering effects in online RL, such as curriculum learning approaches.
- Research on uncertainty-aware navigation that could provide context for understanding where adaptation is most effective.
其他优缺点
Strengths: Practical approach that requires minimal feedback, making it feasible for real-world deployment Weaknesses: Limited novelty in the basic approach, as online RL with sparse rewards is well-established
其他意见或建议
N/A
Thank you for your positive evaluation of our work. We hope our response fully addresses all concerns and demonstrates the strength of our contributions.
Q1. How does different sequence orderings affect adaptation?
We agree that online learning is sequence-dependent. However, we show that the benefits of FeedTTA is invariant to sequence ordering through the following experiments with three different configurations. We use the 'validation unseen' split of the REVERIE dataset and compare with the DUET policy. For all configurations, the reported numbers of FeedTTA are the average of the results from 3 different seeds, with standard deviation reported in brackets.
1. General TTA
- In this configuration, all episodes are randomly ordered regardless of scene IDs, which corresponds to the experimental setting reported in Table 1 of our paper.
| Method | SR | SPL | RGSPL |
|---|---|---|---|
| DUET | 46.98 | 33.73 | 23.03 |
| FeedTTA | 65.33 | 42.63 | 28.71 |
2. Per-Scene TTA
- Here, we analyze the effects of random episode orders for each scene ID. Note that in this setting, the adaptation is performed per-scene, and not throughout the entire validation set. The results below are in the form of (DUET / +FeedTTA)
| Scene ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| SR | 48.63 / 60.78 | 72.22 / 77.16 | 32.65 / 37.51 | 46.85 / 51.75 | 43.34 / 51.09 | 30.30 / 39.46 | 44.84 / 56.76 | 50.60 / 64.57 | 45.89 / 71.54 | 55.67 / 69.47 |
| SPL | 30.74 / 40.80 | 56.61 / 61.51 | 19.88 / 22.79 | 37.27 / 38.98 | 25.55 / 29.31 | 19.81 / 24.73 | 35.21 / 30.53 | 34.36 / 38.59 | 29.54 / 50.29 | 44.91 / 56.42 |
3. Continual TTA
- For this configuration, we fix the episode orders for each scene ID, and set the adaptation sequence based on mixed scene ID orders, evaluating continual adaptation performances across different scenes.
| Method | SR | SPL | RGSPL |
|---|---|---|---|
| DUET | 46.98 | 33.73 | 23.03 |
| FeedTTA | 54.81 | 36.70 | 23.74 |
These experiments confirm that sequence ordering does influence navigation outcomes; however, the benefits of FeedTTA remain consistent, as evidenced by superior performances with low variations across different seeds.
Q2. Does increased trajectory length represent beneficial exploration?
A : We justify that the increased trajectory length (TL) indicates beneficial exploration by empirically testing the hypothesis: “The overall increase in TL primarily results from episodes that would have failed in the original navigation but succeeded after applying FeedTTA”. In the table below, we compare the increase in TLs for the successful navigation episodes after adaptation, categorized based on the pre-tested results before applying FeedTTA. For this experiment, we use the 'validation unseen' split of the REVERIE dataset with DUET as the base policy. Here, we discover that the average TL increase is significantly larger for fail-to-success cases than success-to-success cases. This clearly highlights the role of FeedTTA in overcoming failure cases through extended exploration in unseen environment.
| Success->Success | Fail -> Success | |
|---|---|---|
| Increased TL | 3.54 | 10.65 |
Q3. How does FeedTTA compare to approaches that use more informative feedback (beyond binary signals)?
A : The rationale behind choosing a simple binary episodic feedback mechanism stems from the practical limitations of the online test-time navigation environment:
- Human involvement should be minimal, as following every navigation steps to provide rewards is infeasible in real-world environment.
- Reward systems used in offline learning (e.g. step-wise distance-based rewards ) are infeasible at test-time, as we assume no access to ground-truth goal position or pre-defined maps.
We empirically evaluate the efficiency of the feedback system by comparing our method with the step-wise distance-based reward system used in HAMT, where the feedback is defined as the reduction in distance to the target at each step. Additionally, if the agents successfully arrives at the goal positions, 2 is given as a success signal and otherwise -2 as a penalty. As we observe from the table below, our binary episodic feedback surpasses the distance-based dense reward system, even without access to ground-truth information. This clearly demonstrates that the proposed feedback mechanism appears to be simple, yet efficient and effective in improving navigation performance.
| Feedback Strategy | SR | SPL | RGSPL |
|---|---|---|---|
| Distance-based (Dense) | 63.25 | 42.89 | 28.46 |
| Goal-based (Sparse) | 66.49 | 45.38 | 30.75 |
This paper proposes a method for adapting to unseen environments during the test phase in Vision-Language Navigation (VLN) using Large Language Models (LLMs). The reviewers noted the strengths of this work as its practical approach (vESv), novel methodology (gT5s ), and the interesting nature of the problem (P7CW), and the overall evaluation leaned toward acceptance. The authors’ rebuttal effectively addressed the reviewers’ concerns, leading one reviewer to raise their score, and ultimately, all reviewers recommended a weak accept. Therefore, the AC recommend accepting this paper. Incorporating the feedback received during the rebuttal phase into the final version is strongly encouraged.