PaperHub
4.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
5
4
5
3.0
置信度
正确性2.3
贡献度2.0
表达2.0
NeurIPS 2024

MO-DDN: A Coarse-to-Fine Attribute-based Exploration Agent for Multi-Object Demand-driven Navigation

OpenReviewPDF
提交: 2024-05-10更新: 2024-11-06
TL;DR

We propose a multi-object demand-driven navigation benchmark and train an coarse-to-fine attribute-based exploration agent to solve this task.

摘要

关键词
Mudolar Object NavigationDemand-driven NavigationAttribute Learning

评审与讨论

审稿意见
5

The paper introduces a new benchmark Multi-object Demand-driven Navigation and trains various models on it.

优点

  1. This paper introduces a new benchmark, which are historically undervalued
  2. The ablation section is very clear and well written
  3. There is adequate coverage of baselines to my knowledge, although my knowledge may not be current on this area.
  4. The paper does achieve SOTA performance on many of the tasks relative

缺点

  1. Some of the figures are a bit hard to read
  2. Requires pretrained foundation models for the task
  3. The method section can be tough to follow at times
  4. Frequent gramatically errors and typo
  5. They are a variety of CLIP encoders available now, and it would be interesting to compare to newer models or other variants of OpenClip

问题

typo on line 82 incorrect capitailzation typo on line 240 missing a Why was that specific version of CLIP chosen? For the LLM component, why was the specific version chosen

局限性

Yes

作者回复

We highly appreciate the time and effort you put into reviewing our paper! We are very grateful to you for appreciating our benchmarks and experiments! We hope the following clarification will ease your concerns, and hope to hear back from you if you have further questions!

Q1: Some of the figures are a bit hard to read

A1: We apologize for the confusion caused by the figures. We will modify the content of these figures to make them easier to understand. Figure 1 shows an example of a MO-DDN task. The robot visits five different locations and finds multiple objects that fulfill the user's demand. Figure 2 shows how we train the attribute features. The ends of the arrows in the figure represent the targets for loss computation, and the colors of the arrows represent different ways of loss computation. Please read Figure 2 and Section 4.1.3 jointly. We have drawn a new figure in the PDF attached at Common Response that describes the switches between the coarse exploration phase and the fine exploration phase, and we hope that this will help you to understand the relationship between Figure 3 and 4. Please see Common Response 1 for Method for detailed description. We draw a new figure in the attached PDF to describe the switches between the coarse (Figure 3) and fine (Figure 4) exploration phase.

Q2: Requires pretrained foundation models for the task

A2: Thanks for pointing this out. The usage of pre-trained models is common in navigation, e.g. EmbCLIP[1], LM-Nav[2], MOPA[3]. These pre-trained models have good generalization due to training on large-scale datasets.

Q3: The method section can be tough to follow at times

A3: We apologize for the difficulty in understanding the method section. We will add more articulation and section summaries to enhance readability. Please see Common Response 1 for Method for detailed description and Revision Plan at the bottom of Common Response. We will also add the above modification in the video in the supplemental material that hopefully provide a better understanding.

Q4: Frequent gramatically errors and typo

A4: We appreciate you pointing out these typo and grammatical errors. We'll revise the issues you've mentioned and double-check the entire paper to correct grammatical issues and typos.

Q5: they are a variety of CLIP encoders available now, and it would be interesting to compare to newer models or other variants of OpenClip. Why was that specific version of CLIP chosen?

A5: Thank you very much for your advice. We are using the official model provided by OpenAI, ViT-L/14, which is the most popular and downloaded model on the hugging face website among all OpenAI's CLIP models. We argue that the shared semantic space of vision and text provided by the CLIP model can effectively migrate attribute features to vision, which is important for the end-to-end model in the fine exploration phase (a similar conclusion in DDN). We add an experiment named Ours (ViT-H-14 Encoder), which uses the OpenCLIP's ViT-H-14 model as you suggested. See the attached PDF in Common Response for experimental results. Experimental results show that the larger CLIP model does slightly improve navigation performance, but the time and computational resources required for training are also increased. These experimental results still support the conclusion in our paper that attribute features can improve navigation performance at two different levels of the exploration phase.

Q6: For the LLM component,

There seems to be unfinished questions here, and we look forward to your suggestions for LLM!

References

[1] Simple but Effective: CLIP Embeddings for Embodied AI

[2] LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

[3] MOPA: Modular Object Navigation with PointGoal Agents

评论

After reading the rebuttal and the other reviews, I am maintaining my score.

评论

We greatly appreciate your response. If you have further questions, we are more than happy to answer them, and hopefully our responses will alleviate your concerns.

Regarding the lack of clarity in the writing of the method you mentioned, we have described it in more detail in the Common Response. We will also add this section to the paper to improve the readability of the paper. We are very grateful for your suggestions on our paper, which largely improved the readability of the paper and the clarity of the description of the method.

评论

Dear Reviewer AuxK,

As the discussion period is drawing to a close, we would like to kindly invite any further questions or concerns you might have. We are eager to address these and hope that our responses will alleviate any remaining concerns.

We re-wrote a paper revision plan in common response in the hope that our revision plan will alleviate your concerns about the lack of clarity in our methods and the difficulty of reading it.

Once again, we thank you for the time and effort you have devoted to reviewing our paper. We greatly cherish the opportunity to discuss with you.

Sincerely,

Authors of Submission 3733

审稿意见
5

The paper presents Multi-object Demand-driven Navigation (MO-DDN) which extends the DDN task to work with multi-object search and personal preference. DDN task aims to find an object in a navigation setting based on a given demand instruction. The paper proposes a new attribute model for multi-object DDN where the demand instructions and object categories are encoded into the same feature space. The attribute model is a VQ-VAE like model that uses attributes generated from GPT-4 as input. The attribute features are later used with a coarse-to-fine exploration agent for navigation and obtaining the solutions. The proposed method is evaluated on habitat with the HSSD dataset and the quantitative results show that both attribute training and coarse-to-fine exploration improve the performance.

优点

  1. The paper has good motivation for the MO-DDN task. Ideally for robot navigation, the robot should be able to consider all possible solutions and prioritize the most feasible one.
  2. Coarse-to-fine exploration is interesting and seems to be a good solution to the DDN task.

缺点

  1. In line 41, the authors claimed that personal preferences are considered. However, I don't think it was addressed in the paper.
  2. It is still unclear to me that, how MO-DDN is superior to DDN other than the solutions are multi-object rather than single-object. If we train a DDN with preferred solutions, would it have superior performance on SRpSR_p?
  3. Overall, I feel like section 4, especially section 4.2 is a bit disconnected which makes the section hard to follow. Also, Figure 3 and 4 are hard to understand too.
  4. It would be more helpful to show more examples of basic and preferred attribute features, and some corner cases where priotizing becomes challenging.

问题

  1. For results in Table 1, are rbr_b and rpr_p in Equation 2 being adjusted to prioritize basic or preferred solutions? And how are the metrics being evaluated (Success rate for basic/preferred) considering the baseline only works on one solution?
  2. The authors proposed attribute loss and matching loss in the attribute training. I am wondering how efficient are they and why is the weight of attribute loss larger than other loss terms.
  3. In line 137, a simpler version of "Find" action is used in MO-DNN compared to DNN. Does this adjustment contribute to the performance improvement over the DNN method?
  4. Why does the MLP branch have a similar or worse performance compared to MOPA+LLM? And how is the branch chosen between MLP and LLM during navigation?

局限性

The authors discussed the limitations of their works.

作者回复

We are very grateful for your time and effort in reviewing our paper! We appreciate your kind endorsement of our benchmark and method. We hope the following clarification will ease your concerns, and hope to hear back from you if you have further questions!

Q1: In line 41, the authors claimed that personal preferences are considered.

A1: Thanks for pointing this out. Due to character limitations, please see Common Response 3 for Perference .

Q2: how MO-DDN is superior to DDN other than the solutions are multi-object rather than single-object

A2: Thanks for your advice! A complete comparison between MO-DDN and DDN benckmark can be found in Table 6 in the Appendix. Our method can leverage knowledge from external large models to evaluate the small areas where objects are most likely to be found, and use end-to-end models to efficiently and quickly explore within the small areas. DDN, on the other hand, needs to rely on an end-to-end model to explore in a large area. What's more, the attribute features used by DDN are one-to-one between instructions and objects, which cannot be well applied to multi-object search. Using preferred solutions slightly improves the SRpSR_p of DDN, but does not exceed our proposed method. Please see Common Response 2 for Experimet for the DDN trained with preferred solutions.

Q3: Overall, I feel like section 4, especially section 4.2 is a bit disconnected which makes the section hard to follow.

A3: Thank you for pointing this out. We apologize for not making the method clear. We'll go into more detail about section 4.2 in the Common Response 1 for Method. We have provided a video in the supplemental material that may hopefully provide a better understanding. We will add summarizing paragraphs in the paper to illustrate the connections between each section.

Q4: It would be more helpful to show more examples.

A4: Thanks for the suggestion. We visualize two point clouds in the attached PDF. Each block is colored by its score, with darker colors representing higher scores. We find that adjusting the values of rbr_b and rpr_p changes the scores of the blocks, which in turn affects the behavior of the agent. rbr_b and rpr_p are hyper parameters that control the priority of basic and preferred solutions (See A1 and Common Response 3).

Q5: For results in Table 1,xxx

A5: Thanks for pointing this out. Table 1 shows the results under rb=1r_b=1 and rp=1r_p=1. In a real deployment, these two hyper parameters can be modified by the user to flexibly prioritize the basic and preferred solutions. See Appendix A.1.2 for calculation of the success rate. We have made some modifications to baselines to make them trainable and testable on our task, see Appendix A.4.2 for how this was done. Briefly, for the fairness of the test, we modify all the baselines' action space to Moveforward, RotateLeft, RotateRight, LookUp, LookDown, and Find. It is note that the baselines' policy does not output Done and Done is automatically executed only after the number of Find reaches the cind=5c_{ind}=5 limit.

Q6: Attribute loss and matching loss in the attribute training.

A6: Thanks for pointing this out. Attribute loss directly train the mapping of instructions and objects to attribute features. Matching loss directly guides the alignment of the attribute features of objects and instructions. In attribute feature training, we have two objectives: first, to train two MLPs (Ins MLP Encoder and Obj MLP Encoder), which serve to map the CLIP features of instructions and objects, to the CLIP features of attributes (referred to as attribute features), and second, to align the attribute features of instructions and objects to the same feature space. Attribute loss serves to train the first objective, while all other losses are in the training the second objective. We argue that the first objective's to be more important because the alignment only makes sense when the mapping is correct. Therefore, the weight of the attribute loss should be greater than the other losses. Without attribute loss and matching loss, attribute features would degenerate into CLIP features, since VQ-VAE Loss would only map features to the codebook's feature space which is initialized by CLIP features. We add an experiment named CLIP Exploration to show that CLIP features are not as good as attribute features in coarse exploration phase, proving the effectiveness of the two losses.

Q7: a simpler version of "Find" action is used in MO-DNN compared to DNN.

A7: Thanks for pointing this out. This simpler version of Find action is used in all baselines and our method, including DDN. All baselines do not require the output of the bounding box of the target objects. As we said in A5, we keep the fairness in the action space. Moreover, the simpler Find action just removes the requirements of outputting the bounding box and does not affect the navigation success metrics.

Q8: Why does the MLP branch have a similar or worse performance compared to MOPA+LLM?

A8: Thanks for pointing this out. We argue that MLP branch is weaker than MOPA+LLM largely because we provide ground truth semantic labels for MOPA+LLM for GPT-4 to make decisions on whether to perform Find. The powerful inference capability of GPT-4 and the availability of ground truth labels greatly enhance the decision correctness. MLP branch does not use GPT-4 at any time and does not use ground truth labels for decision making. MLP branch is designed with the intention of abandoning dependence on external LLM resources and focusing on lightweight, purely local running. In Table 1, we test the performance of the two branches separately, so that the branch chosen among an episodes is predetermined (of course they can also be switched freely during navigation, but we did not test this). When deployed in real life, one of the branches can be freely chosen to run according to compute resources and LLM availability.

评论

I appreciate the authors' responses. After reading other reviews and rebuttals, I would like to keep my rating.

评论

We do appreciate you reading the rebuttal. We greatly appreciate your review, which enhances the readability and completeness of our paper. We are always ready to answer your further questions, and we hope that our answers will alleviate your concerns.

评论

Dear Reviewer xf5s,

As the discussion period is drawing to a close, we would like to kindly invite any further questions or concerns you might have. We are eager to address these and hope that our responses will alleviate any remaining concerns.

We re-wrote a paper revision plan in common response in the hope that our revision plan will alleviate your concerns about the lack of clarity in our methods and the difficulty of reading it.

Once again, we thank you for the time and effort you have devoted to reviewing our paper. We greatly cherish the opportunity to discuss with you.

Sincerely,

Authors of Submission 3733

审稿意见
4

The paper presents "MO-DDN," a novel benchmark and approach for Multi-object Demand-driven Navigation (MO-DDN), where an agent needs to find multiple objects to satisfy complex, user-specific demand instructions. The proposed approach leverages a coarse-to-fine attribute-based exploration strategy. The method includes the training of an attribute model using CLIP features and a VQ-VAE loss, followed by a dual-phase exploration process combining modular and end-to-end techniques. The experimental results demonstrate that this method outperforms several baselines on the HM3D ObjectNav datasets.

优点

Novel Benchmark: The introduction of MO-DDN as a benchmark addresses real-life complexities in demand-driven navigation, considering multi-object searches and user preferences. Coarse-to-Fine Strategy: The paper presents an innovative coarse-to-fine exploration strategy that effectively combines the benefits of modular and end-to-end methods, optimizing both efficiency and performance. Comprehensive Evaluation: The method is evaluated rigorously against multiple baselines, showing significant improvements in success rates and navigation efficiency. Attribute Model: The use of attribute features trained with a discretized codebook and VQ-VAE loss is well-motivated and shows promising results in aligning demand instructions and object features.

缺点

Complexity and Implementation: The coarse-to-fine exploration strategy and the dual-phase approach add significant complexity to the implementation. This complexity might hinder the practical deployment and replication of the method in other settings. Generalization to Unseen Environments: While the method shows good performance on the HM3D dataset, its generalization to other datasets or real-world environments with different characteristics is not extensively evaluated. Fixed Attribute Numbers: The assumption of fixed attribute numbers (k1 and k2) for instructions and objects simplifies training but may limit the model's flexibility and applicability to real-world scenarios where attributes can vary widely. Dependency on Pre-trained Models: The method heavily relies on pre-trained models like CLIP and GPT-4 for attribute extraction and task generation, which might pose limitations in environments where these models do not perform well or are not available. Limited Discussion on Limitations: The paper provides limited discussion on the potential limitations of the proposed method and the challenges that might arise in different applications or under varying conditions.

问题

Scalability: How does the method scale with increasing complexity of demand instructions and the number of objects required to satisfy them? Is there a performance degradation when the complexity of the scene increases? Attribute Feature Space: The paper mentions using a discretized codebook for attribute features. How does the choice of the number of vectors (128) and their dimensions (768) affect performance? Would a different configuration yield better results? Adaptability to Dynamic Environments: How does the approach handle dynamic environments where objects might move or new objects might appear? Is the model capable of real-time adaptation in such scenarios? Robustness to Noise: How robust is the method to noisy or incomplete demand instructions? For example, how does it handle ambiguous or partially incorrect instructions from users? Energy and Time Efficiency: Given the dual-phase exploration strategy, what are the energy and time efficiencies of the proposed method compared to the baselines? Are there any trade-offs between computational cost and navigation performance?

局限性

n/a

作者回复

We greatly appreciate your time and effort in reviewing our paper! We really value your recognizing the novelty of our benchmark and method and the comprehensive evaluation. We hope the following clarification will ease your concerns, and hope to hear back from you if you have further questions!

Q1: Complexity and Implementation

A1: Our modular method is easy to deploy, with multiple models connected to each other by a simple pipeline. Moreover, the computational resource consumption of our proposed method in the deployment is low, requiring only a consumer-grade graphics card with 10G memory (such as a GTX1080Ti or a RTX3080) for complete deployment. We can also replace the object detection model with the more lightweight model in some computing resource constrained situations. Training consumption is also very low throughout the method, requiring only training the end-to-end agent in the fine exploration phase, which only requires about 12 hours of training on an RTX4090 graphics card.

Q2: Generalization to Unseen Environments

A2: Thank you so much for pointing this out. We first need to point out that the scene dataset we use is HSSD, not HM3D. HSSD divides the scenes into training and testing scenes, and we follow their settings. We completely evaluate the generalizability of our method including unseen environments and unseen task instructions in Table 1. To test on other scene datasets we first need to build the set of task instructions for the corresponding dataset, which is very difficult due to time constraints. Generalization is an issue that needs to be addressed for all navigation tasks. We do not purposely design modules to improve generalizability in this paper, which will be left for future research.

Q3: Fixed Attribute Numbers

A3: Thanks for the advice. This is indeed a limitation, and we have discussed this limitation in Section 6 of the paper. We argue that four attributes are sufficiently expressive for common demands and objects. The more flexible options of k1k_1 and k2k_2 are left for future research.

Q4: Dependency on Pre-trained Models

A4: Thank you for pointing this out. The use of pre-trained models is common in navigation, e.g. EmbCLIP[1], LM-Nav[2], MOPA[3]. These pre-trained models generalize well in most scenarios after being trained on large-scale datasets. We are also proposing MLP branch in the coarse exploration phase, which can be run completely locally when GPT-4 is not available or computational resources are limited to run a LLM locally.

Q5: Scalability

A5: Thank you for pointing this out. We argue that as the complexity of the demand instructions and the number of objects increase, the performance degradation is possible. Due to the limitation of the categories of objects in the dataset, it is difficult for us to construct more complex demand tasks in a short period of time, so verifying this fact may need to be done on more complex scene datasets in the future.

Q6: Attribute Feature Space

A6: Thank you for pointing this out. We did some simple experiments (which were not put in the paper) for the number of vectors. In this simple experiment, we tested 64, 128, 256, 512 as the number of vectors. The metrics chosen are similar to the cosine similarity expressed by Line 310 in the paper. In the end, 128 achieved the best cosine similarity. 768 was chosen because we wanted this code book to be a subspace of the CLIP feature space, and therefore chose dimensions consistent with the CLIP ViT-L/14 version. Since code book only serves as a non-direct alignment between object attributes and instruction attributes, we did not perform complete ablation experiments on different configuration. Our ablation experiments demonstrate that the absence of VQ-VAE Loss and codebook initialization degrades the expression of attribute features, which in turn leads to degraded navigation performance. Please see Common Response 2 for Experiment for OpenCLIP ViT-H-14's results.

Q7: Adaptability to Dynamic Environments

A7: Thank you for pointing this out. The research on dynamic environments is necessary and important in the real world but beyond the scope of this paper. We assume, as most Object Navigation tasks do, that the scenes are static during navigation.

Q8: Robustness to Noise

A8: In the LLM branch, the reasoning of instructions is left to LLMs such as GPT-4, whose powerful reasoning capability also handles ambiguous instructions well. And our tasks are designed with the assumption that all the demand instructions are correct, otherwise it is difficult to design the correct solutions. We add the following experiments for illustrating the robustness of our method to noise. We add Gaussian noise to the RGB-D camera (add N(0,3) to RGB and N(0, 0.03m) to Depth). The experimental results show that our method does not suffer much performance degradation under noise, showing robustness to noise. Please see the attached PDF file in the Common Response for results.

Q9: Energy and Time Efficiency

A9: As we said in A1, the model can be deployed on a consumer-grade graphics card. During our testing, inference with the RTX4090 resulted in an FPS of around 5 or so, which is acceptable for navigation (even in the real world). Compared to end-to-end baselines (VTN, DDN, ZSON), our method does increase the computational consumption, but it also brings large performance gains. Compared to FBE+LLM and MOPA+LLM, the computational consumption is almost the same, but our method performs better.

References

[1] Simple but Effective: CLIP Embeddings for Embodied AI

[2] LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

[3] MOPA: Modular Object Navigation with PointGoal Agents

评论

Dear Reviewer 9oyu,

As the discussion period is drawing to a close, we would like to kindly invite any further questions or concerns you might have. We are eager to address these and hope that our responses will alleviate any remaining concerns.

We re-wrote a paper revision plan in common response in the hope that our revision plan will alleviate your concerns about the lack of clarity in our methods and the difficulty of reading it.

Once again, we thank you for the time and effort you have devoted to reviewing our paper. We greatly cherish the opportunity to discuss with you.

Sincerely,

Authors of Submission 3733

审稿意见
5

The paper introduces a new task called Multi-Object Demand Driven Navigation where an agent is tasked with searching for multiple-objects that satisfy a specified demand instruction by a user. The demand instruction is a natural language instruction specified by a user which might directly or indirectly ask for a specific object. For example, a instruction “I’m thirsty” requires an agent to search for some object in the environment that can be a drink (softdrink or water) that can address the query from the user. In MO-DDN the authors extend the task of DDN to scenarios where an instruction might require the agent to search for multiple objects. In addition, the paper proposes a modular method that involves building a coarse to fine attribute based exploration agent by leveraging SLAM and multimodal models like CLIP. The key idea of the method is to finetune CLIP to map a demand instruction to features that capture relevant attributes (one or more) to satisfy the goal. For example, feature vector for instruction “Find me a comfortable spot to work on report” should contain relevant features for objects required for such a instruction which could be “desk, lounge chair, laptop/book”. To learn such feature space authors propose a attribute based finetuning scheme by leveraging synthetic data generated using GPT-4. Finally, the authors show that using this method enables learning effective MO-DDN agents

优点

  1. The paper is well written and easy to follow.
  2. The proposed task of formulating multi-object search as a demand driven navigation where the objects required for tasks are not necessarily explicitly specified is a interesting task at the intersection of common sense and embodied AI and is valuable to study.
  3. The proposed method of distilling demand based attributes into CLIP feature space is intuitive, simple and effective based on results shown in table 1.

缺点

  1. The experiments section is missing one important baseline where we evaluate how good the pretrained CLIP features do zero-shot on this task without any attribute based finetuning. It is important to establish that baseline to show effectiveness of the proposed method that leverages CLIP finetuning for the task. I’d appreciate if authors can add those results.
  2. The authors mention about flexibility of the approach which allows to switch from basic solution to preferred solution for a demand instruction easily. However the part about basic vs preferred solution and when is the agent expected to execute the preferred solution in unclear in the manuscript. It’d be good if authors can clarify if any of the basic/preferred solution is acceptable for each task in the dataset or there are specific instructions which require the agent to execute the preferred solution. If yes, how is that conveyed to the agent? Is that part of the instruction or the agent needs to implicitly infer it?
  3. The details about fine exploration policy is unclear. Is the fine exploration policy only called for low-level navigation action execution when a object of specific attribute is detected on the map? Authors need to add more details in section 4.2.2. to clarify the role of this module

Due missing experiment mentioned in 1 and clarifications needed for 2 & 3 I am currently giving the paper a borderline reject but I'd be happy to increase my rating if authors address my concerns.

问题

Mentioned in weaknesses section

局限性

Yes

作者回复

We are very grateful for your time and effort in reviewing our paper! We are also very appreciative that you have recognized the writing, the task setting, and the method proposed. We hope the following clarification will ease your concerns, and hope to hear back from you if you have further questions!

Q1: The experiments section is missing one important baseline

A1: Thanks for the valuable advice. We supplement this baseline experiment named CLIP Exploration. In CLIP Exploration, CLIP-Text-Encoder ViT-L/14 is used to obtain CLIP features for instructions and objects, and the sum of feature similarities is used to select the block in the coarse phase. Please see Common Response's attached PDF for the results of the experiment. The experimental results show that computing block scores with CLIP features is not as good as computing them with attribute features.

Q2: The authors mention about flexibility of the approach which allows to switch from basic solution to preferred solution for a demand instruction easily.

A2: Thanks for pointing this out. In our task settings, both basic and preferred demands are informed to the robot at the beginning of the episodes. Then GPT-4 generates the basic and preferred attributes which are encoded into basic attribute features and preferred attribute features. These two kind of attribute features will be used to calculating the block scores in the equation (2) of the paper. The flexibility we mention is that when the robot actually serves a specific user, the user can tell the robot to prioritize basic or preferred demands by easily and flexibly adjusting the two hyper parameters, rbr_b and rpr_p in the equation (2) of the paper (please see Table 5 in the paper for the effects of the two hyper parameters). rbr_b and rpr_p can be set at the beginning of the episodes by users (of course they are adjustable during navigation, but we didn't try this in our experiments). The higher the value of rpr_p, the more the robot tends to look for preferred solutions. In Table 1, we set rb=rp=1r_b=r_p=1 for our main experiment. However, it can be noticed from Table 1 and Table 5 that finding the preferred solution is more difficult compared to the basic solution; this flexible design allows users to adjust the probability of being satisfied according to their own situation. For example, a user who prefers to drink coke can increase rpr_p to motivate the robot to find the preferred coke. However, when he is very thirsty, This user can adjust rpr_p down and rbr_b up , so that the robots can find basic solutions (e.g., water) with a higher success rate. Future work can consider letting the robot automatically select the appropriate rbr_b and rpr_p from the user's instruction content and intonation. For example, when the robot detects that the user is anxious by the user's intonation, the robot can automatically turn up rbr_b and turn down rpr_p. Please see Common Response 3 for Perference for more detailed description.

Q3: Authors need to add more details in section 4.2.2. to clarify the role of this module.

A3: Thank you very much for your advice. The fine exploration phase actually consists of an end-to-end navigation model. Once a target block is selected in the coarse exploration phase based on block scores, a habitat-sim built-in path planner generates a sequence of actions (such as MoveAhead\mathrm{MoveAhead}, RotateRight\mathrm{RotateRight} and RotateLeft\mathrm{RotateLeft}) for the robot to reach the selected target block (we make sure that the next action of the path planner is on the explored region). When the robot reaches inside the selected target block, it enters the fine exploration phase. The end-to-end model continuously generates actions based on the current RGB-D image. The actions are given to the robot to execute. When this end-to-end model outputs Find\mathrm{Find}, the habitat simulator records a list of objects in the RGB image, and then the robot returns to the coarse exploration phase to select the next block to explore. Such switches between the coarse and fine exploration phase do not stop until the Find\mathrm{Find} count reaches the upper limit cfind=5c_{find}=5 (we also discuss the limitations of doing so in Appendix A.5). Please see Common Response 1 for Method for more detailed description.

评论

The authors have demonstrated that the proposed method of finetuning CLIP to capture task specific attributes indeed performs better than zero-shot evaluation results from base CLIP model which was the experiments I asked about. Authors have also clarified how the hyper parameters rbr_b and rpr_p are set as part of the task. They also provided additional details about fine exploration policy. There's still a bunch of details that are unclear in how coarse and fine exploration policies are being used or whether it is the right instantiation. I went through the original paper and added details in rebuttal and here are few things that are still unclear about exploration policies:

  1. Coarse Exploration Policy: In this stage authors mentioned that a map is built using RGBD and pose information as the exploration policy navigates in an environment. One key detail missing/unclear from the description of this policy in the paper is: how is the exploration done in this phase? Is the policy using frontier exploration or there is a smarter heuristic that leverages the finetuned CLIP feature scores to select waypoints/frontiers? I would appreciate if authors add those details.

Another concern I have is the coarse exploration policy uses the shortest path planner from habitat to navigate to frontiers/waypoints in coarse exploration phase. The path planner in habitat uses privileged information from simulator to plan shortest path. Have authors tried using a fast marching planner/A-star on the map they built to evaluate performance? I would appreciate if authors clearly mention in the description that they use a oracle path planner for navigation if the FMM/Astar planner on built map performs worse.

  1. Fine Exploration Policy: The paper mentions that fine exploration policy is a end-to-end policy which is handed control after the coarse exploration policy navigates to a target block. The details regarding training of this module are quite fuzzy throughout the paper and appendix. It is unclear what is the horizon of steps the fine exploration policy operates on. Authors need to add details of the dataset used for training this policy, how it was collected, how long was the policy trained for, what are some statistics of dataset used for training this module (like avg. steps, etc).

Another big concern this brings up is: why is the fine exploration policy a end-to-end policy? As the authors are already using object detection + mapping for coarse navigation phase wouldn't the agent always end up registering all relevant objects in the map during exploration that you can leverage and navigate to using shortest path planner with information from the map?

评论

We look forward to receiving your valuable and inspiring feedback!

评论

We are very happy to receive your inspiring feedback! We hope the following clarification will ease your concerns, and hope to hear back from you if you have further questions!

Q1: how is the exploration done in this phase?

A1: In the coarse exploration phase, the goal of the robot is to build the scene's point cloud map and select a small area (called a block in the paper) that needs to be finely explored. We do not have an explicit exploration policy for efficiently building a point cloud of the scene, but rather we integrate this process in the selection of the target block. Concretely, we build the scene point cloud well by setting a simple rule: choose blocks that have never been visited before. When selecting blocks, we prioritize the never visited blocks with high scores (the scores are calculated as described in equation (2) below).

s=oblock(rp×maxi=1..k2;j=1..k1foifprefinsj+rb×maxi=1..k2;j=1..k1foifbasicinsj)s=\sum_{o \in block} (r_{p}\times \max_{i=1..k_2; j=1..k_1} f_{o}^{i}*f_{pref ins}^{j} + r_{b}\times \max_{i=1..k_2; j=1..k_1} f_{o}^{i}*f_{basic ins}^{j})----(2)

Where foif_{o}^{i} denote ithi^{th} attribute features of the object oo in the block, fbasicinsjf_{basic ins}^{j} denote jthj^{th} basic attribute features of the instruction (so do fprefinsjf_{pref ins}^{j}), * denotes cosine similarity, and rbr_{b} and rpr_{p} are adjustable weights for whether to find basic or preferred solutions.

In the above equation (2), the score of the block is determined by the cosine similarity between the attribute features of the instruction and the attribute features of the objects within the block, i.e., as you mentioned, we utilized the finetuned CLIP feature scores. And again, because each selection is the highest scoring of the available blocks, this means that there is a high probability of demand objects appearing, so the fine exploration phase is also more likely to find the goal. We argue that this method balances exploration and exploitation, and our ablation experiments in Table 2 demonstrate that such an exploration approach that utilizes the finetuned CLIP features is superior to frontier based exploration (FBE).

Q2: Have authors tried using a fast marching planner/A-star on the map they built to evaluate performance?

A2: Thank you very much for pointing this out. In fact, this was a concern when we chose to use the build-in planner. We would worry that the built-in planner would use points from some unknown areas for planning paths. But we found that with our processing, we can ensure that in the vast majority of cases (>98.7%), the build-in planner will not use privileged information from simulator.

Since the built-in planner would only output the next action at a timestep (e.g. MoveAhead\mathrm{MoveAhead}, RotateRight\mathrm{RotateRight}, RotateLeft\mathrm{RotateLeft}), we impose some restrictions on it planning, such as detecting that the next step can only be chosen at a location that has already been recorded in the point cloud. We then compared the built-in planner with our own written planner (see the bottom of this QA) and found to be almost identical in planning the next step (over 98.7% of the actions are identical in our pre-test about the planner). Considering the low planning speed of our own planner and the fact that the built-in planner only uses privileged information from the simulator in a very small number of cases when planning, the trade-off between efficiency and reality is that we ended up choosing the built-in planner.

For a fair comparison, we also use the built-in planner for other baseline modular methods such as MOPA and FBE. We will also be more explicit in the paper about the usage of the built-in planner.

Here we briefly describe how our planner works. (1) Notate the points that are obstacle-free above in the point cloud map as navigable points; (2) Compress the navigable points into two dimensions xy and build an undirected graph according to the robot's action space; (3) Find the shortest paths on the undirected graph using BFS.

评论

We appreciate your timely and valuable feedback!

Q1: Does this mean that the policy always knows how many "blocks" are in the environment i.e. are the authors assuming that agent already has access to 1 tour of the house where the "blocks" are constructed first?

A1: The robot doesn't know how many blocks are in the scene at first. These blocks are segmented from the point cloud that has been built so far, based on the xy coordinates of the points. As the robot moves within the scene, the scene's point cloud continues to add newly explored points based on the current RGB-D input and then the number of blocks increases. The choice of blocks in the coarse exploration phase is also limited to those that have already been discovered. Therefore, the robot does not have access to a tour of the house to build the point cloud.

We visualize some runtime point cloud and block segmentation examples in Appendix A.3.1. As can be seen from the examples, the point clouds of the scene are incomplete and the blocks being segmented are limited to point clouds that have already been explored.

Thank you for pointing these out, and we'll add those details to the paper. Your valuable and constructive suggestions have nicely enhanced the readability and completeness of our paper.

评论

Thanks for clarifying the details. My concerns have been addressed so I have increased my rating.

评论

We are very happy to hear that your concerns have been addressed. We also appreciate you raising your rating.

评论

Q3: The details regarding training of this module are quite fuzzy throughout the paper and appendix.

A3: Thank you so much for pointing this out. We describe in Appendix A.3.2 the model architecture, how the trajectories were collected. We collected about 50,000 trajectories to train the fine exploration module under seen tasks and seen scenes. These trajectories were collected according to the following steps: 1) randomly select a scene and a task, 2) initialize agent within two meters (i.e., block size in coarse exploration) of a target object, 3) use habitat-sim's built-in planner to get the next step, and let the agent to execute it; 4) when the distance to the target object is less than 0.2 meters, turn left/right or look up/down according to the height and position of the object, 5) when the target object is in the field of view, execute Find\mathrm{Find} and close this trajectory.

The average length of these trajectories is 9.19. The standard deviation of the trajectory length is 5.62. The median trajectory length is 8. The plurality of trajectory lengths is 3. TThe maximum trajectory length is 51. The minimum trajectory length is 2.

We trained the model on a single RTX 4090 using imitation learning and cross-entropy loss, i.e., considering the action prediction as a classification task, consuming about 12h.

We will also add the above details about the fine exploration phase in A.3.2.

Q4: why is the fine exploration policy a end-to-end policy? As the authors are already using object detection + mapping for coarse navigation phase wouldn't the agent always end up registering all relevant objects in the map during exploration that you can leverage and navigate to using shortest path planner with information from the map?

A4: We design the end-to-end policy for the fine exploration phase for two reasons below.

Reason one: It's OK to use information about objects recorded in the point cloud and plan a shortest path to reach the objects, but there are a couple of drawbacks. (1) Demand objects may not be visible or recorded in the point cloud at the coarse exploration phase. (2) In demand-driven navigation, the target objects may be small and complicated to place, such as a book, a small jar, or a pen, so it is difficult and inefficient to design a hard rule for localizing the target objects and recording them into the point cloud when the robot reaches the selected block.

To demonstrate this, we design a simple hard rule for replacing the fine exploration phase and do a quick experiment: (1) after reaching the selected block in the coarse exploration phase, the robot rotates in a circle and records the observed objects using the object detection model; (2) the recorded objects are told to the LLM, which determines whether or not there exists an object that satisfies the demand; (3) if it exists, the robot plans to go next to this object. Quick experimental results on seen tasks and seen scenes show that such a hard rule only yields SRbSR_b 15.20, SPLbSPL_b 3.42, SRpSR_p 9.83 and SPLpSPL_p 3.16. We observe that the in-place rotation of the robot is incomplete for the observation of objects inside the block due to some occlusions.

Thank you very much for the reminder, we will also complete this quick experiment as an ablation experiment for Table 3 to demonstrate the need for fine exploration.

Reason two: Since we consider hard rules to be difficult (as also evidenced by the quick experiments above), we try to learn an exploration policy. Attribute-feature based end-to-end policies have been shown to be effective in DDN, so we also train an attribute-based end-to-end policy in the fine exploration phase. The difference is that we cut down the transformer decoder in the DDN and use only the transformer encoder as a feature extractor to reduce graphics memory consumption and speed up inference and use new attribute features as input.

评论

Thanks to the authors for addressing my queries. I have another follow up question on coarse exploration policy. Authors mentioned

A1: In the coarse exploration phase, the goal of the robot is to build the scene's point cloud map and select a small area (called a block in the paper) that needs to be finely explored. We do not have an explicit exploration policy for efficiently building a point cloud of the scene, but rather we integrate this process in the selection of the target block. Concretely, we build the scene point cloud well by setting a simple rule: choose blocks that have never been visited before. When selecting blocks, we prioritize the never visited blocks with high scores (the scores are calculated as described in equation (2) below).

Does this mean that the policy always knows how many "blocks" are in the environment i.e. are the authors assuming that agent already has access to 1 tour of the house where the "blocks" are constructed first?

As for my other questions, the response from authors have addressed my major concerns. I'd appreciate if the authors can add these details in the paper and make it explicit in writing so that it's easier for the reader to understand these details.

作者回复

Common Response

We are very grateful to all the reviewers and AC for their time and effort. We highly thank the reviewers for their appreciation of our writing, benchmark, methods, and experiments. "The paper is well written and easy to follow.""The proposed task .... is a interesting task at the intersection of common sense and embodied AI and is valuable to study."(XjVA) "The paper presents an innovative coarse-to-fine exploration strategy that effectively combines the benefits of modular and end-to-end methods, optimizing both efficiency and performance"(9oyu) "The paper has good motivation for the MO-DDN task." "Coarse-to-fine exploration is interesting and seems to be a good solution to the DDN task."(xf5s) "The ablation section is very clear and well written" "The paper does achieve SOTA performance on many of the tasks relative" (AuxK)

(Common Response 1 for Method) However, we notice that some reviewers (XjVA, xf5s, AuxK) have some confusion about our method and the figures. We next summarize our method section at a high level.

In Secion 4.1, We train two MLPs (Ins MLP Encoder and Obj MLP Encoder in Fig.2) for mapping the CLIP features of instructions and objects, to the CLIP features of their corresponding attributes (referred to as attribute features). The purpose of Attribute Loss is to allow the MLPs to correctly map CLIP features to ground truth attribute features. The purpose of the other four losses is to align the attribute features of instructions and objects, guiding them to be in the same feature space.

In Section 4.2, We describe the coarse-to-fine exploration phase. Specifically, the robot label the object categories in the RGB and builds a scene point cloud as it walks. The point cloud will then be segmented into a number of squares, called Blocks, based on the xy coordinates of the points. At the beginning of the episodes, the robot asks the GPT-4 about the task instructions to get the basic and preferred attributes of the instructions, which will then be encoded into the basic attribute features and preferred attribute features for the next computation of block scores. The robot calculates the score of blocks using equation (2) and chooses a highest-scoring and never-visited block as the target block. We use habitat-sim's built-in path planner to generate a series of action sequences that navigate the robot to the target block (we make sure that the next action of the path planner is in the explored area). Please see our video in the supplementary material or appendix A.3.1 for the Block Visualizations. Once the robot reaches the target block, it enters the fine exploration phase, where the robot invokes an end-to-end navigation model (Section 4.2.2) to generate a sequence of actions based on the current RGB-D. When the robot performs a Find action in the fine exploration phase, the objects in the current RGB are recorded by the habitat simulator. The robot then switches to the coarse exploration phase to plan the next target block. The switches between these two phases do not stop until the number of Find action executions reaches the upper limit cfind=5c_{find}=5, followed by the execution of Done and the end of the current episode. We draw a new figure in the attached PDF to describe the switches between the coarse and fine exploration phase.

(Common Response 2 for Experiment) We also note that the reviewers (XjVA, 9oyu, xf5s, AuxK) make valuable suggestions about our experiments. We add the following four experiments: (1) DDN (Preferred Trajectories): training the DDN using trajectories with the preferred solutions, (2) CLIP Exploration: computing block scores using the CLIP features rather than attribute features, (3) Ours (with Noise): adding Gaussian noise to the RGB-D camera (add N(0,3) to RGB and N(0, 0.03m) to Depth), and (4) Ours (ViT-H-14 Encoder): replacing the OpenAI's CLIP ViT-L-14 model with the OpenCLIP's ViT-H-14 model in the Method Section in the LLM branch. See attached PDF for the results.

(Common Response 3 for Perference) Some reviewers (XjVA, xf5s) have questions about the way we address personal preferences. We apologize for not explaining clearly the roles of the two hyper parameters rbr_b and rpr_p. In our benchmark, we propose that personal preferences affect the behavior of the robot. In our method, we use two hyper parameters rbr_b and rpr_p to adjust the prioritization of the basic and the preferred solutions in the equation (2). The two hyper parameters will be set at the beginning of episodes. In Table 1, we set rb=rp=1r_b=r_p=1 for our main experiment. In Table 5, we study the effect of these two hyper parameters on robot behavior. We make a visualization example for illustrating how rbr_b and rpr_p affect decision making in the attached PDF. From Table 1 and 5 we can note that, in general, the success rate of satisfying the basic demands is higher than the success rate of satisfying the preferred demands. When deployed in a real environment, the two hyper parameters can be flexibly adjusted by users, allowing users to decide whether to prioritize basic or preferred demands at the moment. For example, a user who prefers to drink coke can increase rpr_p to motivate the robot to find the preferred solution. When he is very thirsty, This user can adjust rpr_p down and rbr_b up , so that the robots can find basic solutions (e.g., water) with a higher success rate. Future work can consider letting the robot automatically select the suitable rbr_b and rpr_p from users' instruction and intonation.

Revision Plan:

  1. We will add a description of the switches between the coarse and fine exploration phases and an associated pipeline figure in section 4.2. We will explain in more detail how preferences are addressed in our method.

  2. We will again double-check the paper for grammar and typos.

  3. In the experiment section, we will add our supplementary ablation and baseline experiments.

评论

We would appreciate it if the reviewers and AC would read this comment.

We extend our deepest gratitude to all the reviewers. Their insightful suggestions have been invaluable in enhancing the quality of our paper. We recognize that reviewing a paper requires considerable time and effort, and many reviewers are also authors who must take time to address feedback on their own submissions. This understanding leads us to deeply cherish the opportunity to engage in meaningful discussions with the reviewers.

The reviewers paid close attention to the writing and details of the coarse-to-fine methods in our paper. As the deadline for author and reviewer discussions approaches, we make a plan for revising the paper based on the discussion with the reviewers. We hope that this revision plan will alleviate the reviewers' concerns.

1.We will clearly show the switches between Figure 3 and Figure 4. We have drawn a new Figure in the PDF attached to Common Response to show the specific switching pattern. We also optimize the writing of the attribute training by highlighting the motivations for using the five loss functions, placing some of the less important code engineering details in the appendix.

2.We will clarify the role of rbr_b and rpr_p in regulating preferences and the way they take effect. In particular, we will write about the advantage of using rbr_b and rpr_p i.e., the flexibility for the user to manually adjust preferences to meet the demands of different scenarios.

3.In Appendix A.3.2, we will add more details on the way to train end-to-end agents in fine exploration, including the statistical features of the dataset used for training, and the training time. In the paper, we will also explain why there is a fine exploration phase (For detailed reasons, see our discussion with reviewer XjVA at https://openreview.net/forum?id=MzTdZhMjeC&noteId=wku5OOovow Further Responses (Part 2/2)) .

4.We will include supplementary experiments in Table 1 to better demonstrate the superiority of our method.

We again thank the reviewers and AC for their time and effort!

最终决定

This paper received divergent initial opinions with two reviewers in support of acceptance and two reviewers expressing concerns. The reviewer concerns revolved around missing a zero-shot baseline with pretrained CLIP features, lack of clarity about the distinction of basic vs preferred solutions and about the exploration policy, and overall complexity of the approach limiting its practical deployment. The rebuttal responded to reviewer concerns, and one initially negative reviewer found the clarifications to alleviate their concerns, thus raising their opinion to accept. The second initially negative reviewer did not engage in discussion despite repeated prompting by the AC. Moreover, this latter negative reviewer's review appears to be cursory and somewhat contradictory in parts. The AC does not find a basis to overrule the positive overall opinion of all the other reviewers. The AC thus recommends acceptance and strongly encourages the authors to incorporate clarifications and revisions to improve the final manuscript.