Active Test-time Vision-Language Navigation
摘要
评审与讨论
This paper presents a novel test-time adaptation (TTA) method that leverages human (or self-) feedback to update the navigation policy. The method is evaluated on the Visual Language Navigation (VLN) task using widely adopted benchmarks such as REVERIE, R2R, and R2R-CE. It is grounded in entropy minimization, and the authors further propose a mixture entropy optimization approach that decouples the entropy of the pseudo-expert distribution from that of the action distribution. A key contribution is a novel mechanism for incorporating human or self-feedback to dynamically maximize or minimize entropy based on the predicted outcome (success or failure). The method demonstrates strong performance, outperforming all prior TTA approaches, including those combined with active learning. This work represents a significant advancement in test-time adaptation and human-in-the-loop visual language navigation.
优缺点分析
-
The innovative combination of test-time adaptation and human feedback has strong potential for real-world robot deployment with a human in the loop.
-
The method is evaluated on multiple benchmarks against strong baselines and demonstrates promising results.
-
The method relies on trial-and-error; that is, the navigation episode must either succeed or fail before the model can adapt. In other words, the robot can attempt a task multiple times to complete a given episode. This assumption may be impractical in scenarios where the robot must adapt immediately, for example, to avoid obstacles, since failure could result in a crash.
-
The method also assumes that the robot can reset itself to the original starting position. For instance, in the Supplementary Figures, after completing Trial 1 and receiving human feedback, the robot begins Trial 2 from the reset position rather than from where it previously stopped.
-
Real-world navigation typically involves a continuous, rather than discrete, action space. As a result, this method may not be directly applicable to real robots such as [NOMAD]. However, I am not aware of any existing TTA work that proposes entropy minimization for continuous action spaces, so this limitation is acceptable given the current state of the field.
[NOMAD] Nomad: Goal masked diffusion policies for navigation and exploration (ICRA'24)
问题
-
If I understand the method correctly, the feedback is binary and based solely on whether the trial succeeds. This feedback is applied to the entire action trajectory. Have you considered cases where the navigation is technically successful but inefficient, for example, when the robot takes an unusually long or suboptimal path?
-
Test-time adaptation is typically used for handling out-of-domain inputs, such as visually corrupted observations. However, in your setup, all observations are clean and in-domain. What motivates the use of entropy minimization in this context?
-
Have you tried alternative approaches that may not have worked but are worth discussing? For instance, since human feedback is available for the entire navigation trajectory, could reinforcement learning (RL) be applied during test time?
局限性
Line 333: I don’t see a strong reason why your method cannot be applied to other visual navigation tasks. Most of these tasks share similar input and output formats; the main difference lies in the goal specification. The more critical limitation, as I noted earlier, is the reliance on trial-and-error and the assumption of a discrete action space, both of which may not always hold.
最终评判理由
The authors have mitigated most concerns during rebuttal. I would like to maintain a positive score toward acceptance.
格式问题
NA
[W1] Reliance on episodic trial-and-error
We fully agree with the reviewer's concern regarding the importance of immediate, step-by-step adaptation, particularly in scenarios requiring timely obstacle avoidance. However, the primary focus of our work was to emphasize how a simple human-robot interaction through episodic feedback can be utilized to control the prediction entropy and improve test-time adaptation. Therefore, our work addresses the distinct but equally crucial challenge of adaptation, where the goal is to learn from high-level task outcomes, a problem fundamental to language-guided navigation.
Furthermore, as demonstrated in our experiments and discussions regarding FSTTA, performing intermediate adaptations in between navigation steps can sometimes degrade baseline performance, as frequent parameter update at test time has the potential to hinder general navigation knowledge acquired during training. Also, as we have shown in Table 8 in the appendix, such adaptation methods significantly increase the latency during navigation rollout, making them computationally impractical compared to episodic methods like ATENA. Nonetheless, we agree this is a valuable direction, and we believe a key challenge for future work is to merge the benefits of our efficient, sparse adaptation with more immediate, step-wise corrections, without performance degradation or high latency shown in our analysis.
[W2] Resetting the starting position
We would like to respectfully clarify that our method does not assume nor necessitates the robot to reset itself to the original position. The scenarios illustrated in the supplementary material were a deliberate setup, designed to clearly show the direct impact of ATENA's adaptation on a single, controlled problem (i.e., before vs. after feedback). For all results in our main tables, we strictly adhere to the standard benchmark protocols where each distinct test episode is attempted only once without repetition.
However, we agree with the spirit of the reviewer's concern regarding real-world continuity. To directly address this, we conducted an additional experiment on the more realistic IVLN-CE [1] benchmark. In this setting, episodes are chained together—the end of one path is the start of the next—eliminating arbitrary resets and better simulating a single, continuous deployment. ATENA's episodic framework adapts naturally to this setting by treating each distinct instruction in the sequence as a self-contained episode for feedback and adaptation. Using the MAP-CMA [1] policy as our baseline, we evaluated ATENA under the 'Inferred via RedNet' map source with the 'Iterative' mapping procedure (Please see L14 of Table 4 in [1]), and demonstrated clear improvements across diverse metrics for the 'val_unseen' split:
| Method | SR ↑ | SPL ↑ | nDTW ↑ |
|---|---|---|---|
| MAP-CMA | 35 | 32 | 54 |
| w/ ATENA | 36 | 34 | 56 |
These strong results show that ATENA's adaptation mechanism is robust and effective not only in standard single-episode evaluations but also in more realistic, continuous navigation scenarios.
[1] Krantz, et al., "Iterative vision-and-language navigation.", CVPR 2023.
[W3] Entropy-based TTA in continuous action spaces
As the reviewer correctly noted, adapting to continuous action spaces using entropy-based TTA is a novel and under-explored area. In our work, we evaluated ATENA on the R2R-CE benchmark , which explicitly replaces the discrete action space of R2R with a continuous one to better simulate real-world robotic control. As shown in Table 3 in our paper, applying ATENA to baseline policies like ETPNav and BEVBert in this continuous environment leads to notable performance improvements. Furthermore, thanks to your constructive feedback, we have also demonstrated ATENA's performance in the IVLN-CE benchmark (Please see [W2] above), which simulates a more realistic continuous environments. We believe our positive results in both of these challenging benchmarks represents a valuable step forward for applying the method to real-world robotic systems.
[Q1] Success cases with suboptimal and inefficient trajectory
During training, the navigation policies explicitly learn to follow the optimal expert trajectories. However, even though the policy learns some general navigation knowledge through training, navigating in unseen environments presents significant challenges due to differences in environment structure and conditions. Therefore, we argue that a certain degree of exploration is necessary and allowed to facilitate successful adaptation in an unseen environment. Accordingly, we classify the task as successful even if the trajectory itself may appear suboptimal or inefficient compared to the optimal path.
To support our claim, we have conducted an additional experiment, where we uniformly divided the test episode sequence in the 'val_unseen' of REVERIE into four splits. Here, we aim to compare the average trajectory length during adaptation against the average ground truth trajectory length at each split. As seen in the table below, the initial split shows a long average trajectory length compared to the ground truth, indicating early stage explorations. However, as the episodes proceed, the gap reduces and the trajectories generated during adaptation become closer to the optimal path, with increasing success rates. This demonstrates that ATENA enables the agent to progress from initial, inefficient exploration to increasingly optimal navigation paths as it adapts, rather than simply settling for any kind of successful route.
| Scenario A. | Pred Length | GT Length | Pred:GT Ratio | SR ↑ |
|---|---|---|---|---|
| split 1/4 | 23.14 | 9.54 | 2.43 | 70.27 |
| split 2/4 | 21.13 | 9.37 | 2.26 | 86.49 |
| split 3/4 | 19.74 | 9.52 | 2.07 | 86.11 |
| split 4/4 | 13.73 | 8.31 | 1.65 | 91.67 |
| Scenario B. | Pred Length | GT Length | Pred:GT Ratio | SR ↑ |
|---|---|---|---|---|
| split 1/4 | 21.71 | 9.45 | 2.30 | 54.05 |
| split 2/4 | 20.06 | 9.26 | 2.17 | 91.89 |
| split 3/4 | 16.01 | 8.86 | 1.81 | 94.44 |
| split 4/4 | 14.44 | 9.18 | 1.57 | 94.44 |
[Q2] Test-time adaptation and the motivation for using entropy
The datasets we used (REVERIE, R2R, and R2R-CE) explicitly include the ‘validation_unseen’ and ‘test_unseen’ splits, which evaluate navigation performance in entirely new indoor environments never encountered during training. Therefore, these validation splits are considered as the ‘out-of-domain scenarios’ in the literature [1,2], despite having visually clean observations.
Furthermore, test-time adaptation (TTA) methods are not only limited to visually corrupted inputs but are broadly applicable to any domain shifts encountered during inference [3,4]. In specific, entropy minimization is a prevalent TTA technique for adapting models with stochastic outputs, since it explicitly handles output distribution in an unsupervised manner.
VLN policies also output a probability distribution over action spaces, naturally leading to increased prediction uncertainty with high entropy in the out-of-domain scenarios. For example, the baseline policy, DUET, shows significantly higher average entropy on the out-of-domain validation set (val_unseen: 6.345) compared to the in-domain validation set (val_seen: 2.945). Therefore, entropy was an essential selection as a learning signal in our framework, where we creatively further guide its optimization direction through episodic feedback, which clearly improves navigation performance and adaptation.
[1] Zheng, et al., "Towards learning a generalist model for embodied navigation.", CVPR 2024.
[2] Liu, et al., "Vision-language navigation with energy-based policy.", NeurIPS 2024.
[3] Zhang, et al., "Domainadaptor: A novel approach to test-time adaptation.", ICCV 2023.
[4] Su, et al., "Beware of model collapse! fast and stable test-time adaptation for robust question answering.", EMNLP 2023.
[Q3] Could RL be used as an alternative?
Thank you for bringing up this insightful question. Motivated by this perspective, we conducted additional experiments on the 'val_unseen' of REVERIE, incorporating the Reinforcement Learning from Human Feedback (RLHF) paradigm into DUET. Specifically, for a fair comparison with our method, we provided episodic rewards of +1 for success, and -1 for failure. The policy is updated using the REINFORCE policy gradient algorithm. The results in the table below show DUET + RLHF achieving notable improvements (SR: 60.98, SPL: 36.04), clearly surpassing the original DUET (SR: 46.98, SPL: 33.73) as well as established TTA baselines such as TENT (SR: 47.55, SPL: 33.99) and FSTTA (SR: 54.15, SPL: 36.41). These results strongly validate the effectiveness of leveraging episodic human interaction to enhance navigation adaptation. However, updating the policy with a fixed integer of -1 or 1 regardless of the episodes shows limitations in handling the stochastic nature of the policy, naturally yielding lower performance compared to ATENA which dynamically controls the confidence of the policy itself through the Mixture Entropy Optimization.
| Methods | DUET | w/ TENT | w/ FSTTA | w/ RLHF | w/ ATENA |
|---|---|---|---|---|---|
| SR ↑ | 46.98 | 47.55 | 54.15 | 60.98 | 68.11 |
| SPL ↑ | 33.73 | 33.99 | 36.41 | 36.04 | 45.82 |
[+] Additional Comments
Thank you for your thoughtful interpretation of our limitations. As you have pointed out, we have indeed demonstrated ATENA's generalizability to other visual navigation tasks, and we respectfully direct you to our response to Reviewer 1sz2 (W1) for detailed results. Regarding other limitations, we hope our responses have resolved your concerns about the trial-and-error system and action spaces. Thank you.
The authors address all my concerns, although they were not serious in the first place, as the paper already provided comprehensive experiments and showed good results. Moreover, the application of TTA in robot navigation, especially in a human-in-the-loop setting, is an understudied problem. I will maintain my score and wish you good luck.
[W1] Reliance on episodic trial-and-error
The answer is fair enough. I appreciate the application of test-time adaptation as human-in-the-loop feedback.
[W2] Resetting the starting position
Thank you for clarifying how the test is set and resolving my misunderstanding. I do not have further concerns about this.
[W3] Entropy-based TTA in continuous action spaces
Thank you for clarifying. This concern is solved
[Q1] Success cases with suboptimal and inefficient trajectory
I appreciate the authors conducting new experiments.
[Q2] Test-time adaptation and the motivation for using entropy This answer is sensible. Thank you for clarifying it.
[Q3] Could RL be used as an alternative? Fair answer. Thank you for conducting this interesting experiment. I would love to see it in the final version.
We are glad that all the concerns have been addressed. We will incorporate these valuable feedback in our final version. Thank you for your support towards acceptance!
This paper proposes ATENA, a test-time active learning framework that enables a practical human-robot interaction on uncertain navigation outcomes, to enhance the VLN performance in unseen test environments. Two core modules are constructed. Firstly, in the mixture entropy optimization module, the entropy is derived from a mixture of the action distribution and the pseudo-expert distribution. Then, this module minimizes entropy for successful navigation and maximizes it for failed ones. Secondly, a self-active learning strategy is designed, where the agent evaluates its navigation outcomes based on confident predictions, which enables the agent to actively engage through all iterations for adaptive decision-making. Experimental results on multiple VLN benchmarks, including REVERIE, R2R, and R2R-CE, demonstrate the superiority of the proposed approach over the compared baseline methods.
优缺点分析
Strengths
1 The proposed approach has good generalization ability, and the experimental results show that the proposed approach brings significant performance gain compared with the baselines on multiple VLN benchmarks.
2 The motivation for designing mixture entropy optimization (MEO) and self-active learning (SAL) is described clearly. The paper is well-written and easy to follow. Figures and tables are nicely presented.
3 Supplementary materials are provided with implementation details, computational cost comparison, and trajectory visualizations.
Weaknesses
1 A major problem is that in Section 3.1, the authors did not define the test setting utilized in the proposed approach in detail. Since there exists obvious difference between traditional test setting and the one utilized, it is important to describe the test setting utilized in their approach concretely (better with formulations for clarification), e.g., what is the agent’s input and output during the test setting, and what is the form of information provided by the human/agent oracles, how the proposed approach is integrated into the agent decision-making process, etc., to clearly demonstrate the difference between the two kinds of test settings.
2 Since the proposed approach introduces the prior information of the test data, it seems that the direct comparison with the traditional baselines such as HAMT, DUET, BEVBERT, etc, is not fair. Moreover, relying on the human oracle looks like more unpractical than the traditional test setting in real-world applications. I would like to see more explanations and clarifications about these concerns.
3 In Line 28-31, the authors claim that test-time adaptation is crucial for real-world robot navigation. However, there are no real-world experiments in the manuscript. Moreover, the results on the R2R-CE dataset, which is more similar to real-world setting, show that the proposed approach brings only ~1 point improvement over the baseline ETPNav and BEVBert on Val Unseen. Such results cause doubts about the effectiveness of the proposed method in the real-world setting.
4 In the trajectory visualization in the supplementary material, the authors did not clearly present and explain the intermediate result (e.g., the information provided by the human/agent oracle and the discriminative effect of MEO and SAL) about how the proposed approach impacts the action decision. As a result, it makes the action decision process lack interpretability.
5 In Section 2.1 in the Related Work section, some important references about the zero-shot discrete/continuous VLN [1-4] are missing, e.g., MapGPT, Open-Nav, etc.
References
[1] Chen K, An D, Huang Y, et al. Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments[J]. arXiv preprint arXiv:2412.10137, 2024.
[2] Qiao Y, Lyu W, Wang H, et al. Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs[C]//ICRA, 2025.
[3] Chen J, Lin B, Xu R, et al. Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation[C]//ACL, 2024.
[4] Long Y, Cai W, Wang H, et al. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment[C]//CoRL, 2024.
问题
As mentioned in the Weaknesses section, I would like to see: 1) detailed definition and pipeline of the proposed test setting, 2) the explanation about the comparison fairness, 3) the effectiveness evidence of the proposed approach in real-world experiments, 4) the visualization demonstration about the intermediate process of how the proposed approach impact the action decision.
局限性
Yes.
最终评判理由
The authors have addressed most of my concerns, including the task definition of test-time vision-language navigation, comparison fairness, explanation of the R2R-CE performance, etc. Based on the rebuttal and other reviewer's comments, I raise my score to 4 (Borderline accept).
格式问题
N/A
[W1] Detailed test setting
We apologize for our lack of detailed explanations of the task setting in Section 3.1. We will revise the section in our final draft as follows:
3.1 Task Description
Traditional Test-Time Vision-Language Navigation
Vision-Language Navigation (VLN) requires an agent to follow natural language instructions, , to navigate through a visual environment. In a standard offline evaluation, a pre-trained policy, , is frozen and evaluated on a batch of distinct test episodes, . For each episode , the agent starts at a given position, receives the instruction and the initial panoramic visual observation , and sequentially selects actions. At each timestep , the policy outputs a probability distribution over the action space, and the agent executes the action with the highest probability, . This process is repeated until the agent selects a 'stop' action, generating a trajectory . Performance is then calculated by averaging metrics across all episodes. The key aspect of this setting is that the policy does not change during evaluation.Online Setting with Active Policy Adaptation
Our work introduces an online, active adaptation setting that simulates a real-world deployment scenario. Unlike the traditional batch evaluation, test episodes are executed sequentially (i.e., with a batch size of one), allowing the policy to be updated between episodes. Specifically, after the agent completes a navigation episode, it receives a simple, binary success/fail feedback. This episodic outcome is then used to guide an immediate adaptation of the policy parameters. The updated policy is then carried over to the subsequent navigation episode, allowing the agent to learn from its experience on-the-fly. By doing so, we can study how an agent adapts on-the-fly to overcome distribution shifts using simple, interactive feedback, rather than requiring dense, ground-truth supervision.
[W2] Comparison fairness and practicality
1. Clarification on the comparison fairness
We would like to respectfully clarify the nature of the feedback system and our experimental design.
First, our approach does not use 'prior information' of the test data during adaptation. Rather, as mentioned from our response for [W1] above, the distinct test episodes are streamed only once without repetition. Furthermore, the feedback is post-hoc, meaning that it is only provided after an episode is finished. This is a highly challenging yet practical setting where the agent must learn to adapt its policy for future, unseen episodes based solely on the high-level outcome of its past performance, without ever revisiting the same task.
Next, our comparison with baselines like DUET or HAMT is not to claim ATENA is a superior model architecture, but to demonstrate that any given offline-trained policy can significantly improve when augmented with our ATENA framework in this online setting. The adaptation performance itself is fairly assessed through comparison against other TTA methods, TENT and FSTTA, by applying them onto the same baseline policy. Our experiments clearly report the performance of the base policy and then the results after applying TTA methods, isolating the gains from the adaptation (e.g., 'DUET' and 'DUET w/ ATENA' in Table 1).
2. Practicality of the human oracle
We agree that relying on a constant human oracle would be impractical. However, our proposed framework was designed specifically to address this challenge and maximize autonomy:
-
Minimal expertise and effort: As we have stated in L46, human feedback should require low effort for real-world practicality. In our framework, the feedback is a simple yes or no signal at the end of each episode, which is a highly practical form of human-robot interaction.
-
Active Learning based adaptation: The agent only requests human feedback for episodes where it is most uncertain (uncertainty-based thresholding in Sec 3.4.1). This active learning strategy reduces the need for human intervention by focusing only on the most informative scenarios.
-
Self-Active Learning (SAL): As one of our key contributions, the SAL mechanism enables the agent to autonomously predict the outcome of its own navigation for episodes where it is confident. As shown in Table 5, SAL is highly effective and provides a substantial performance boost. This drastically reduces the reliance on the human oracle and is a core contribution toward practical real-world application.
[W3] Real-world Applicability
1. Lack of real-world experiments
We agree that sim-to-real transfer is a crucial step in evaluating test-time robotic navigation. As we do not possess a physical mobile robot, we are unable to conduct real-world experiments at this point. However, we are actively seeking opportunities to expand our work to the real-world, which may open many new interesting and challenging research directions to solve.
Nonetheless, we believe that our simulated experiments provide a strong and necessary foundation for this line of research for several key reasons. First, simulation offers reproducibility and controlled evaluation, which is critical when assessing the impact of an adaptation algorithm. Second, simulation enables validation at a scale and diversity (thousands of varied scenarios in the REVERIE, R2R, and R2R-CE benchmark) that would be infeasible in the real world. For these reasons, many recent studies emphasize these advantages of simulation, highlighting its reliability as a proxy for real-world evaluation [1,2].
[1] Li, et al. "Evaluating real-world robot manipulation policies in simulation.", CoRL 2024.
[2] Yang, et al, "EMBODIEDBENCH: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents", ICML 2025
2. Regarding the performance of R2R-CE
We would like to emphasize that the comparatively less improvements on the R2R-CE dataset, relative to the substantial gains on REVERIE, does not reflect a limitation of ATENA's generalization to real-world scenarios. Instead, it results from fundamental differences in task objectives and instruction granularity between the two datasets.
Specifically, the REVERIE dataset is designed as a goal-oriented task, characterized by high-level and coarse instructions. In contrast, R2R and its extension R2R-CE consist of detailed and fine-grained navigation instructions, naturally requiring step-wise guidance for policy training. Therefore, the episodic feedback might be relatively sparse to bring notable performance enhancements for R2R variants compared to REVERIE.
In summary, the differential performance observed across datasets does not indicate a limitation in ATENA's ability to reflect the real-world but rather highlights its specialized strength in goal-oriented navigation scenarios.
[W4] Visual Demonstrations of the Intermediate Results
We apologize for the lack of interpretability of our visual demonstrations. However, it is a shame that, due to the rebuttal policy, we could not present an improved visual illustrations during this phase. We firmly promise to incorporate your feedback and replace the current version with more clear and detailed figures including the intermediate processes.
In the meantime, we would like to provide details of the current version and clarify once more our initial intention. To explicitly visualize how ATENA improves the trajectory, we sampled episodes that initially failed before adaptation, and repeated the same episode after processing the failure feedback through ATENA. Here, we observe that the policy selects a different action from the previous trial at major junctions (e.g., corridors connecting rooms or open spaces with multiple branches). This can be interpreted as our Mixture Entropy Optimization (MEO) using the 'failure' signal to penalize the overconfident but incorrect action by maximizing its entropy. This correction at the critical junction enables the agent to select the correct path on its next attempt, leading to a successful trajectory.
[W5] Additional references in related work
Thank you for the suggestion. We will revise the sentence in L92-L94 to incorporate the following discussions on the referred zero-shot navigation methods (bolded reference numbers are from the paper):
To overcome this discrepancy, recent research focus on enhancing zero-shot navigation by leveraging Large Language Models (LLMs) for sophisticated planning and reasoning [1-4, 30,31,32]. Approaches like Open-Nav [2] and MapGPT [3] aim to equip agents with stronger intrinsic capabilities to interpret complex instructions and navigate unseen environments from the outset, without task-specific fine-tuning. However, their reasoning capabilities without fine-tuning have yet to yield reliable performances compared to the trained ones.
[1] Chen et al., "Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments", arXiv 2024.
[2] Qiao et al., "Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs", ICRA 2025.
[3] Chen et al., "Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation", ACL 2024
[4] Long et al., "Instructnav: Zero-shot system for generic instruction navigation in unexplored environment", CoRL 2024.
The authors have addressed most of my concerns, including the task definition of test-time vision-language navigation, comparison fairness, explanation of the R2R-CE performance, etc. Based on the rebuttal and other reviewer's comments, I raise my score to 4 (Borderline accept).
We are pleased to hear that our response addressed your concerns and led to a positive evaluation of our work! Thank you very much.
The authors address the problem of domain adaptation in Vision-Language Navigation by proposing to further optimize the policy in the test environment through Mixture Entropy Optimization and Self-Active Learning. The experimental results demonstrate the effectiveness of the proposed method.
优缺点分析
Strengths:
-
The paper is clearly written and easy to follow.
-
It focuses on a timely and important topic, and naturally integrates entropy and active learning techniques.
Weaknesses:
- The motivation for introducing a pseudo-expert distribution in the Mixture Action Distribution is not well explained, lacking an intuitive or theoretical justification. There are more general ways to control the sharpness of the distribution, such as introducing different temperatures in the softmax.
问题
-
In Section 3.3.2, could you provide the gradient update formulas before and after the modification to more formally illustrate that the method behaves as intended?
-
Is the purpose of the trained Self-Prediction Head to determine whether the task is completed, or the additional training of the self-prediction head during testing also help improve the policy itself? What happens if the agent determines at deployment that the task is not completed? Also, what is the accuracy of the Self-Prediction Head? I did not see relevant data in the experiments.
-
Could you comment on how the Self-Prediction Head would behave if it were trained only with human-verified feedback, instead of using its own pseudo-labels? I am concerned that the current self-training scheme may introduce drawbacks such as confirmation bias and error accumulation.
局限性
Yes.
最终评判理由
After reviewing all the current materials, I maintain my original assessment.
格式问题
NA.
[W1,Q1] Motivation & justification for the pseudo-expert distribution
1. Theoretical analysis and formulation of the gradient updates
The core motivation and intuition for introducing the pseudo-expert distribution is to create a larger gradient gap between the selected action and all other non-selected actions. We provide a theoretical analysis of how MEO with pseudo-expert distribution can achieve a larger gradient gap compared to the standard entropy optimization. For simplicity, we modify the binary indictor in Eq (3) to be for success, and for failure, formulating the general entropy objective function as for a certain probability distribution .
In standard entropy optimization, the objective function at time is:
Next, we compute the gradient of this loss w.r.t the policy parameter :
Applying the product rule to the term inside the summation, we get full gradient for standard entropy optimization:
Leveraging the same structure from above, the gradient of the MEO loss with pseudo-expert distribution is given as:
According to the definition of from Eq (1) in our paper, the parameters of only influence through . Therefore, the gradient of w.r.t is . Substituting this into the equation above, we get:
The gradient formulation of each objective function can be seen as a weighted average of the individual gradient contributions from each action , where the weighting term determines how much influence each action has on the final direction of the update. In other words, the larger gap between the weight on the selected action compared to an unselected action indicates effectiveness.
The weight gap of the standard optimization is:
For MEO, the weight gap is:
The ratio of the actions for the mixture distribution is:
For any , is strictly larger than , which also holds true in the log-space:
Therefore, this larger gap formally demonstrates that our MEO approach creates a more focused and directional gradient signal than standard entropy optimization. This ensures the final update to the policy parameters is dominated by the gradient contribution from the selected action, either reinforcing it on success or suppressing it on failure with greater precision.
2. Comparison with temperature scaling of softmax
The fundamental difference of our mixture distribution against the temperature scaling of softmax is in the isolation of specific action. While softmax divides all logits with a uniform temperature value to adjust the sharpness, ours directly injects a probability mass to one specific action. By doing so, the temperature globally adjusts the sharpness between all action pairs, while ours amplify the gap between the selected action and the rest in a more targeted way.
For example, consider a scenario where the policy outputs the following logits for three possible actions : , where is the selected action. Using a standard softmax with , we get the probability distribution of . We calculate the probability ratios of two pairs : and : , which are approximately and , respectively. Next, if we sharpen the distribution with and make a scaled distribution , we get two ratios of and . The gap between the selected action and the unselected action grew significantly, but the gap between the unselected action also grew. On the other hand, if we apply our MEO with a mixture weight of , we get a distribution , with two ratios being and . Notice that the gap remained exactly the same for the unselected actions, indicating that our approach preserves the policy's initial understanding of the relationships between the unselected alternatives, while directly focusing the signal to reinforce or penalize the selected action.
[Q2,Q3] Questions regarding the Self-Prediction Head (SPH)
Q 2.1: What is the purpose of the trained SPH?
The purpose of the SPH is in building an automatic feedback system that evolves throughout online iterations to gradually reduce the burden of human feedback. As iteration proceeds, its accuracy in predicting navigation outcome grows by learning from both self-predicted and human-verified feedback. Therefore, once trained to a certain degree, it can be utilized not only for determining the task completion, but more importantly, providing itself with a reliable feedback for adaptation in unseen environments with limited human feedback.
Q 2.2: Does additional training of the head during testing also help improve the policy itself?
Additional training of the head on the policy—without employing Self-Active Learning (SAL)—does have some effect on the parameters and the performances (SR:68.11→68.45 on the 'val_unseen' split of REVERIE). However, the improvement is marginal, and we argue that this training must be properly integrated into our proposed SAL framework to fully leverage its adaptive capabilities and realize the intended contribution of our approach.
Q 2.3: What happens if the agent determines at deployment that the task is not completed?
In that case, the agent will use its self-generated signal to update and re-plan a new trajectory. If the agent's determination is correct, it will penalize its past decisions and explore better alternatives. Even if it’s a false signal, it still pushes the agent to find a more efficient path. This process of seeking a more definitive success state simultaneously trains the prediction head to become more robust for future attempts.
Q 2.4: What is the accuracy of SPH?
We have provided the performance of the SPH in our appendix. The confusion matrix in Table 7 demonstrates high accuracy (82.19) and precision (80.78) of the SPH under the supervision of online pseudo-labels, making it a reliable source for self-training.
Q 3.1: How would SPH behave if it were trained only with human-verified feedback? (Concerns on error accumulation)
We agree that self-training via pseudo-labeling inherently suffers from error accumulation. However, pseudo‑labeling maintains widespread use across diverse machine learning fields [1-3]. Importantly, these works do not assume pseudo-labels are flawlessly accurate; instead, they demonstrate that the benefits of using pseudo-labels (e.g., improved generalization, enhanced robustness under limited annotations, and reduced annotation costs) clearly outweigh their drawbacks.
Similarly, our SPH employs pseudo-labeling to continuously guide a robust adaptation during test-time navigation with limited human feedback. To address the reviewer's question more directly, we have conducted an experiment by training SPH only with human-verified feedback. Here, we observed that the accuracy of SPH in this setting shows 78.93, which is less than the accuracy under the default self-training setting (82.19, please see Q 2.4 above). This clearly indicates that a certain amount of noise and self-correction through pseudo-labels brings robustness in SPH training. Moreover, as also demonstrated by our empirical results in Sec 4.5 of our paper, the practical advantages clearly surpass the risks associated with noisy labels, justifying our approach within the active test-time VLN framework.
[1] Wang, et al., "Deep insights into noisy pseudo labeling on graph data.", NeurIPS 2023.
[2] Zhang, et al., "Pseudo label refinery for unsupervised domain adaptation on cross-dataset 3d object detection.", CVPR 2024.
[3] Chen, et al., "SAT: Improving semi-supervised text classification with simple instance-adaptive self-training.", ACL 2022
Thank you for the authors’ response. After reviewing all the current materials, I maintain my original assessment.
We truly appreciate your evaluation and support towards acceptance. Thank you.
This paper introduces ATENA (Active Test-time Navigation Agent), a novel framework for Vision-Language Navigation (VLN) that addresses the challenge of domain shifts between offline training and online test environments. ATENA incorporates active learning at test time, using Mixture Entropy Optimization (MEO) to adjust the agent's confidence based on navigation success or failure. Additionally, it proposes Self-Active Learning (SAL), enabling the agent to autonomously assess its navigation outcomes and reduce reliance on human feedback. Extensive evaluations on benchmark VLN tasks (REVERIE, R2R, and R2R-CE) demonstrate that ATENA outperforms existing test-time adaptation methods, improving performance while maintaining efficiency and minimizing human intervention.
优缺点分析
Strengths:
- ATENA introduces a novel approach to test-time adaptation in VLN by combining active learning with entropy optimization. The use of Mixture Entropy Optimization (MEO) and Self-Active Learning (SAL) makes it a significant advancement over existing methods that rely solely on entropy minimization.
- The paper provides thorough experiments on REVERIE, R2R, and R2R-CE, showing that ATENA outperforms baselines like TENT and FSTTA in multiple metrics, such as Success Rate (SR) and Success weighted by Path Length (SPL).
- ATENA is lightweight and efficient, suitable for real-world deployment with minimal human feedback. Its ability to adapt autonomously through Self-Active Learning makes it highly scalable and practical for real-time VLN tasks.
Weaknesses:
- The method is tested primarily on VLN tasks, and its generalization to other types of navigation tasks, like object-goal navigation, remains unexplored. Further evaluation in diverse settings is needed to assess its broader applicability.
- Although ATENA reduces human feedback reliance, it still depends on episodic feedback for uncertain episodes. Ambiguity or inconsistency in human feedback could introduce errors, affecting performance, especially in complex environments.
- The Self-Prediction Head contributes to reducing human feedback, but its accuracy is not flawless. False predictions could accumulate over time, potentially impacting the adaptation process and leading to suboptimal decisions.
- While ATENA is relatively efficient, it still introduces some additional adaptation latency compared to simpler methods like TENT. The minor increase in computational cost may be a limitation in highly time-sensitive environments.
- While Self-Active Learning (SAL) improves performance, its effectiveness in sparse feedback scenarios could be better evaluated. Further analysis on how SAL handles more challenging environments with limited feedback would strengthen the claims.
问题
Please refer to the Weaknesses.
局限性
yes
最终评判理由
The authors have sufficiently addressed my main concerns. In light of the rebuttal, I have adjusted my score to 4 (Borderline accept).
格式问题
N/A
[W1] Generalization to other navigation tasks
To demonstrate the general applicability of ATENA to other types of navigation tasks, we present additional experimental results on the Image-Goal Navigation task, where the agent is asked to navigate and locate a given visual target. We use TSGM [1] as the baseline policy and report the adaptation result of ATENA in the ‘hard’ difficulty environment (Please see 4.2 of [1] for details). Also, we use the mixture rate of and the learning rate of 1e-6. The clear increases of SR and SPL in the table below strongly demonstrate ATENA’s superior adaptability also in the Image-Goal Navigation task, underscoring its potential as a general TTA method for robotic navigation tasks. Furthermore, please refer to our response for [W2] of Reviewer zXCt, where we empirically demonstrated the capability of ATENA in the Iterative-VLN benchmark as well. We will incorporate these new analysis into the final version of the paper.
| Methods | Hard SR ↑ | Hard SPL ↑ |
|---|---|---|
| TSGM [1] | 70.30 | 50.00 |
| w/ ATENA | 72.49 | 53.23 |
[1] Kim et al., “Topological Semantic Graph Memory for Image-Goal Navigation”, CoRL 2022
[W2] Reliability of human feedback
We understand the reviewer’s concern on the reliability of human feedback. To address this, we conducted additional experiments by manually varying the proportion of verified human feedback from 50% to 100%, with incorrect feedback provided to the remaining proportion. This setting closely resembles the real-world scenario where even the human feedback may be noisy. The results on the 'val_unseen' split of REVERIE, averaged over three experiments conducted with different random seeds, are summarized below:
| DUET | w/ TENT | w/ ATENA (50%) | w/ ATENA (60%) | w/ ATENA (70%) | w/ ATENA (80%) | w/ ATENA (90%) | w/ ATENA (100%) | |
|---|---|---|---|---|---|---|---|---|
| SR ↑ |
Although a slight performance decrease is observed at the lowest feedback accuracy level (50%), ATENA consistently improved the DUET baseline at feedback accuracy levels of 60% and above. This clearly indicates that even when feedback includes up to 40% of incorrect signals, ATENA demonstrates robust and effective adaptation performance in complex environments. Moreover, we argue that the 40% noise level tested in this experiment represents a highly conservative estimate of real-world error. Providing a binary success/fail judgment is a trivial and straightforward task for a human user, suggesting that the actual feedback reliability in practice would be very high, placing performance closer to the upper end of our experimental results.
[W3] Reliability of the Self-Prediction Head
We agree that the Self-Prediction head is not flawless; however, we empirically demonstrate its high reliability in the appendix (Table 7), with an overall accuracy of 82.19%. This functions as a self-training mechanism, a well-established paradigm where a model beneficially learns from its own high-confidence pseudo-labels to improve performance [1-3]. Importantly, our Self-Active Learning (SAL) strategy only uses these self-predictions on episodes with low uncertainty, while querying a human for more challenging cases, which inherently minimizes the risk of learning from incorrect labels. Therefore, this combination of high empirical accuracy and confidence-based filtering ensures the benefits of autonomous adaptation significantly outweigh the minimal risk of error accumulation from imperfect pseudo-labels. We also address a related concern in our response to Reviewer PkQQ (Q 3.1), which further discusses and emphasizes the reliability of using the Self-Prediction Head.
[1] Wang, et al., "Deep insights into noisy pseudo labeling on graph data.", NeurIPS 2023.
[2] Zhang, et al., "Pseudo label refinery for unsupervised domain adaptation on cross-dataset 3d object detection.", CVPR 2024.
[3] Chen, et al., "SAT: Improving semi-supervised text classification with simple instance-adaptive self-training.", ACL 2022
[W4] Minor increase in computational costs
To clearly highlight the efficiency of ATENA against TENT, we present an additional experimental analysis on the 'val_unseen' split of REVERIE, where we provide less feedback to the policy by varying feedback intervals (after every 1, 2, 4, 10, 20, and 100 iterations). As the result in the table below shows, the total inference time drops proportionally as the feedback interval increases. However, even if the success rate also reduces from increased interval, the performance stays superior compared to TENT with much less inference time. Furthermore, we would like to emphasize that, since the increased latency is mostly due to the parameter updates in between the episodes, it does not hinder the navigation latency itself.
| DUET | w/ TENT | w/ ATENA (interval=100) | w/ ATENA (interval=20) | w/ ATENA (interval=10) | w/ ATENA (interval=4) | w/ ATENA (interval=2) | w/ ATENA (interval=1) | |
|---|---|---|---|---|---|---|---|---|
| SR ↑ | ||||||||
| Inf. Time (s) |
[W5] Performances in sparse-feedback scenario
As the reviewer mentioned, although the effectiveness of Self-Active Learning (SAL) is verified, assessing its robustness in sparse-feedback environments is crucial. Therefore, we analyzed this using the experimental setting and results from our response for [W4] above, and additionally reported the impact of SAL at each interval. As seen in the table below, although we provide feedback less frequently, ATENA still improves the baseline policy DUET and surpasses TENT, even in the extremely sparse feedback condition (interval = 100). Furthermore, as we remove SAL from ATENA, we observe performance drops from all intervals, indicating that SAL poses strong benefits regardless of feedback scarcity.
| DUET | w/ TENT | w/ ATENA (interval=100) | w/ ATENA (interval=20) | w/ ATENA (interval=10) | w/ ATENA (interval=4) | w/ ATENA (interval=2) | w/ ATENA (interval=1) | |
|---|---|---|---|---|---|---|---|---|
| SR ↑ | ||||||||
| SR w/o SAL ↑ | N/A | N/A |
The authors have sufficiently addressed my main concerns. In light of the rebuttal, I have adjusted my score to 4 (Borderline accept).
We are glad our response addressed your concerns. Thank you for your positive evaluation!
This paper provides a novel TTA method for improving vision-language navigation policies from sparse (binary) human feedback. The framework ATENA is based on two simple techniques: mixture entropy optimization (MEO) and self-active learning (SAL). Experiments in the paper show that ATENA outperforms existing TTA baselines across several VLN benchmarks, and further experiments from the rebuttal show that ATENA also outperforms in an image-goal navigation task and in continuous environments (IVLN-CE).
The rebuttal/discussion period significantly strengthened the paper, addressing all major concerns (regarding generalizability of the method, the role of SAL and how accurate the SAL predictor needs to be, the required accuracy of human feedback, providing more theoretical justification for MEO) raised by the reviewers. In the end, the reviews concluded with a positive impression of the work and its timeliness. I strongly encourage the authors to incorporate all proposed changes from the final remark into the camera-ready, which will significantly strengthen the quality of the manuscript.