PaperHub
6.0
/10
Poster5 位审稿人
最低4最高8标准差1.3
6
6
8
6
4
4.0
置信度
正确性2.4
贡献度2.4
表达2.6
NeurIPS 2024

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

OpenReviewPDF
提交: 2024-05-13更新: 2025-01-06
TL;DR

We propose a powerful agent with hybrid multimodal memory architecture, Optimus-1, in Minecraft.

摘要

关键词
Multimodal AgentMultimodal Large Language ModelsMultimodal In-context Learning

评审与讨论

审稿意见
6

This paper introduces Optimus-1, a retrieval augmented generation method that enables minecraft agents to excel in long-horizon tasks. The proposed method is based on a Hybrid Multimodal Memory Module that consists of an Abstracted Multimodal Experience Pool and a Hierarchical Directed Knowledge Graph. Those two key components are mainly proposed to solve the challenge of insufficient structured knowledge and lack of multimodal experience. Empirical evaluations show that such a memory module can effectively improve the performance of a minecraft agent in long-horizon planning tasks. Ablative experiments are conducted to validate the effective of each component.

优点

This paper addresses an important problem that transforms LLM into autonomous decision making agents.

The proposed method achieves a significant improvement over prior state-of-the-art in Minecraft, particularly exceling at long-horizon tasks such as getting diamonds.

缺点

The writing of this paper can be improved. In particular, I would suggest a major rewriting of the method section that 1) follows a top-down organization to first talk about high-level ideas of the method before going to the details such as "1 frame per second" and "a window size of 16", and 2) reduces unnecessary complications of the terminology including Hybrid Multimodal Memory Module, Hierarchical Directed Knowledge Graph, etc.

Major claims in the introduction are not supported by empirical evidence. It is unclear whether GPT4V etc does not have sufficient structured knowledge and related works in multi-modal agents do employ multi-modal experiences [1].

It seems that the use of Hierarchical Directed Knowledge Graph is limited to Minecraft where there is a strict Directed Graph relation between different objects. It is unclear whether this method can be helpful in the general setting.

The performances are evaluated on a set of custom benchmarks, and it will be good to have results on prior benchmarks reported by the baseline methods such as Voyager and Jarvis.

Minor: Line 25 "Early research [1, 6, 17] developed simple agents by constructing policy networks." seems unclear what those policy networks are.

Line 30, what long horizon tasks are the authors talking about?

Line 36 grammar error "Insufficient of Structured Knowledge", what is empirical evidence for existing agents do not have structured knowledge?

Line 70 and line 71, two 30% seems repetitive.

Line 108, where is the subgoal coming from?

Line 115-116, what are those reflection phases?

In line 206, why did the authors contruct a new benchmark instead of following the prior benchmarks on minecraft?

[1] Zhang, C., Yang, Z., Liu, J., Han, Y., Chen, X., Huang, Z., … Yu, G. (2023). AppAgent: Multimodal Agents as Smartphone Users.

[2] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

[3] Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023.

问题

See weakness

局限性

Limitations are not adequately discussed in the paper. One potential limitation is that the performance of RAG methods may be capped by the performance of the base model.

作者回复

Q1: The writing of this paper can be improved.

A1: Thank you for your valuable suggestions. In the method section, we first introduce the core contribution of this paper, Hybrid Multimodal Memory, including its motivation, innovations, and components. Based on it, we developed a novel agent framework, Optimus-1. We follow a top-down organization to introduce Optimus-1’s components. Following your suggestions, we will improve the organization and logic of the paper to minimize potential misunderstandings for readers.

Q2: It is unclear whether GPT4V etc does not have sufficient structured knowledge and related works in multi-modal agents do employ multi-modal experiences [1].

A2: (1) Table 2 in the manuscript shows that in the Minecraft environment, the removal of structured knowledge from Optimus-1 (based on GPT-4V) leads to a significant decrease in success rates across all task groups. It proves that GPT4V lacks sufficient knowledge of the Minecraft environment.

(2) App Agent does not employ multimodal experiences. During the stage of free exploration, it records the effects of actions applied to different UI elements in the text modality. In contrast, Optimus-1 not only records the tasks, environmental information, agent initial state, and plan in the text modality, but also the abstract visual information in the image modality. Furthermore, It dynamically summarizes long-sequence multimodal information, significantly reducing memory and retrieval costs.

Q3: It is unclear whether this method can be helpful in the general setting.

A3: In the future, we will extend the Hybrid Multimodal Memory to other domains. However, we believe that the current environment and experiments sufficiently demonstrate the contribution and effectiveness of our work. Please refer to the response to reviewer 3HCc, A3.

Q4: The performances are evaluated on a set of custom benchmarks, and it will be good to have results on prior benchmarks reported by the baseline methods such as Voyager and Jarvis.

A4: (1) Our benchmark is extensive and comprehensive, involving the most common long-horizon tasks in Minecraft. Moreover, we add the average steps (AS) and average time (AT) of completing the task as evaluation metrics, to better evaluate the efficiency of the agent. Furthermore, we have constructed a baseline for human evaluation, which represents a major contribution compared to the previous benchmark.

(2) We have evaluated the performance of Optimus-1 on prior benchmarks reported by DEPS [1], Voyager [2], MP5 [3] (Table 15, Table 16, Figure 8 in the Appendix). Extensive experimental results demonstrate that Optimus-1 outperforms all baselines.

Q5: Minor: Line 25 "Early research [1, 6, 17] developed simple agents by constructing policy networks." seems unclear what those policy networks are.

A5: Policy networks in line 25 refer to models based on the Transformer architecture, trained through reinforcement learning/imitation learning. We will revise the wording to avoid potential misunderstandings by readers.

Q6: Line 30, what long horizon tasks are the authors talking about?

A6: In Minecraft, long-horizon tasks refer to complex tasks that require the agent to continually interact with a complex environment to complete a long sequence of sub-goals. We provide a detailed explanation and examples in Appendix C.3.

Q7: Line 36 grammar error "Insufficient of Structured Knowledge"

A7: We will revise it to 'Insufficient Exploration of Structured Knowledge.

Q8: Line 70 and line 71, two 30% seems repetitive.

A8: We will remove “Optimus-1 closes the gap between agents and human player performance by 30%”.

Q9: Line 108, where is the subgoal coming from?

A9: As stated in line 152-153, sub-goals are plans generated by the Knowledge-Guided Planner. We will revise the description of sub-goals in line 108 to avoid misunderstanding by readers.

Q10: Line 115-116, what are those reflection phases?

A10: As stated in line 173, reflection results from Experience-Driven Reflector are categorized as COMPLETE, CONTINUE, and REPLAN. We will revise the description of reflection phases in line 115 to avoid misunderstanding by readers.

Q11: Limitations are not adequately discussed in the paper. One potential limitation is that the performance of RAG methods may be capped by the performance of the base model.

A11: We discuss the limitations in Appendix B. Figure 5 in the manuscript indicates that the proposed hybrid multimodal memory (using RAG technology) is adaptable to GPT-4V and open-source multimodal large language models (MLLMs). Various MLLM-based versions of Optimus-1 have demonstrated performance improvements ranging from 2 to 6 times.

[1] Wang et al. Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents. 2023.

[2] Wang et al. Voyager: An open-ended embodied agent with large language models. 2023.

[3] Qin et al. MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception. 2024.

评论

Thank the authors for the rebuttal. Thanks for pointing me to the comparisons in prior benchmarks. The rebuttal has clarified some of my concerns and I will adjust my score accordingly. However, there are still two important concerns remaining:

The writing of the paper can be improved.

As discussed above, the writing can be significantly improved. Currently the description of the technical section is very convoluted and can be better organized. Some of the terminologies are overly complicated, e.g. Hybrid Multimodal Memory, what's its difference with just Multimodal Memory? Abstracted Multimodal Experience Pool can be simplified to Multimodal Experience Pool. IMHO, such complication in terminology only makes it harder for readers to understand.

Furthermore, it would be great to ensure the major claims in the paper are scientific as supported by the experiments. In particular, the experiments only show that it is important for the agent to know the rules of Minecraft (e.g. how are materials synthesized) and they are not enough to show GPT4V does not have structured knowledge.

A significant effort is required for the work to be able to be published at top-tier ML conference.

The application beyond minecraft remains unclear.

In particular, a potential reason for HDKG to work well in minecraft is that there is a very clear game rule in minecraft (e.g. to make a stone axe we need stone and wooden sticks etc), and it is very unclear to me whether such designs can generalize to more realistic settings like real-world web navigation agents or robotics. I did read the response to reviewer 3HCc but I am not convinced that a better method for minecraft is interesting itself, unless it can be easily generalized to more realistic settings.

评论

Thank you very much for taking the time to discuss with us despite your busy schedule. Regarding your concerns, we respond as follows:

Q1: Currently the description of the technical section is very convoluted and can be better organized.

A1: We appreciate your valuable and constructive feedback, which will be pivotal in enhancing the quality of our work. To present the methodology with clear logic, we first introduce the proposed hybrid multimodal memory module, followed by a detailed description of the Optimus-1 architecture. The hybrid multimodal memory is coupled with the operation mechanism of Optimus-1, while the current version lacks detailed descriptions of sub-goals, the reflection module, etc. We will add these descriptions in Section 2.1.1 to ensure logical coherence and avoid reader confusion. Additionally, in each subsection, we will organize the content in a top-down manner, following the sequence of motivation, high-level idea of the method, and implementation details. We will revise the manuscript based on your suggestion. We also note that the reviewers found the manuscript well-written and easy to follow (R#397y, R#AfEc), so these revisions will not affect the main contribution of our work.

Q2: Some of the terminologies can be simplified.

(1) These terminologies reflect the characteristics and innovations of proposed methods. For example, the Hierarchical Directed Knowledge Graph links knowledge at different levels (wooden, stone, diamond) through Directed graphs, forming a Hierarchical knowledge graph, which differs from previous knowledge graphs. The Abstracted Multimodal Experience Pool Abstracts long-sequence multimodal historical information into multimodal experiences, whereas existing multimodal memory mechanisms do not summarize multimodal information.

(2) We can use abbreviations in the manuscript to describe these terminologies. Although we introduced the Hierarchical Directed Knowledge Graph (HDKG) and the Abstracted Multimodal Experience Pool (AMEP) in Section 1 of the manuscript, we have frequently used the full names in subsequent sections out of concern that readers might forget or misunderstand the meanings of HDKG and AMEP. We will consider using abbreviations in the manuscript to describe these terminologies to facilitate easier reading for the readers.

Q3: Difference between Hybrid Multimodal Memory and just Multimodal Memory.

A3: Hybrid Multimodal Memory is different from existing multimodal memory. As stated in the Introduction section of the manuscript, the hybrid multimodal memory module consists of structured knowledge (graphs) and multimodal experiences (text, image sequences). It stores multiple heterogeneous contents in a mixed manner and dynamically summarizes long-sequence multimodal information.

In contrast, existing agents only store text and images as multimodal memory and do not summarize them. For example, Jarvis-1 [1] stores text and image sequences without summarizing multimodal information.

Moreover, the comparison of memory mechanisms in existing Minecraft agents is shown in Table 7 of the manuscript.

Q4: Ensure the major claims in the paper are scientific as supported by the experiments.

A4: Thank you for your suggestions. We will revise the claim in the Introduction section in the manuscript: “Existing Multimodal Large Language Models such as GPT-4V lack sufficient knowledge in Minecraft”, and results of Table 2 in the manuscript support this claim.

The rest of the responses are in the next comment.

评论

This comment connects to the Responses to Reviewer wzp9 (1)

Q5: It is unclear whether such designs can generalize to more realistic settings like real-world web navigation agents or robotics.

A5: (1) To address your concerns about the generalisation of our method in real-world scenarios, we applied Optimus-1 to the app agent scenario. We followed the environment and settings of AppAgent [2] and conducted comparative experiments on its benchmark (9 apps with a total of 45 tasks). The experimental results in the table below show that Optimus-1 outperforms AppAgent and GPT-4 baselines. This reveals that Optimus-1 can generalize to more realistic settings, such as real-world app navigation agents.

Tab 1: Experiments on the benchmark of AppAgent. We report the success rate of the agent in completing 45 tasks.

MethodSuccess Rate
GPT-448.9%
AppAgent73.3%
Optimus-186.7%

(2) Proposed Hybrid Multimodal Memory is a general architecture. As mentioned in lines 31 to 35 in the manuscript, we were inspired by the theory that 'humans benefit from knowledge and experience when performing long-sequence tasks' , and propose a novel Hybrid Multimodal Memory structure that incorporates both knowledge and experience into the memory mechanism. We argue that it is general and can be adapted according to different environments. In the experiments mentioned above in the app scenario, the key step is obtaining the logical relationships between buttons or actions and converting them into a knowledge graph. Once these logical relationships are established, HDKG can easily be adapted to the app environment. As for AMEP, it can be simplified to store the task prompt, images, and actions for each atomic operation.

We hope that the results of these experiments can address your concerns, and we would greatly appreciate it if you could consider giving us a higher rating. If you have any further questions, please feel free to contact us.

[1] Wang et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. 2023.

[2] Zhang et al. AppAgent: Multimodal Agents as Smartphone Users. 2023.

评论

Thank the authors for the additional response on the clarity of the paper and results on real-world app navigation. I am more convinced of the proposed method being general and increase the score. I hope that those new results and rewriting can be reflected in a future version of the paper.

评论

We are pleased to have addressed your concerns. We will incorporate your valuable suggestions in future revisions and add more experiments in the Appendix. Additionally, we have built an official repository to provide well-structured open-source codes and a project page (to be released upon acceptance).

审稿意见
6

The paper tackles the long-horizon tasks in Minecraft by building a pipeline based on multimodal LLM. Specifically, it proposes to store multimodal memory during agent exploration and a knowledge graph that stores the causal relations between objects and tasks. Additionally, a self-reflection mechanism is used to improve the textual actions. Finally, the textual actions are executed by a pre-trained action controller to output the low-level actions.

优点

  • The core idea is intuitive that multimodal experience can help with long-horizon tasks for embodied agents, where they can be retrieved at test time to help with decision making.
  • The paper is overall written clearly and sufficient details are provided in appendix.

缺点

  • There seem to be too many individual components in the paper, most of which are claimed to be important components in the paper, which makes it unclear to what extent the claimed contribution generalizes to broader settings beyond Minecraft (or even just beyond the evaluated tasks in Minecraft), as it is also highly plausible that the entire pipeline is carefully designed specifically for the evaluated tasks.

问题

See "weaknesses" section above.

局限性

The limitations are described in the paper.

作者回复

Q1: There seem to be too many individual components in the paper, most of which are claimed to be important components in the paper, which makes it unclear to what extent the claimed contribution generalizes to broader settings beyond Minecraft (or even just beyond the evaluated tasks in Minecraft), as it is also highly plausible that the entire pipeline is carefully designed specifically for the evaluated tasks.

(1) The core innovative module of Optimus-1 is Hybrid Multimodal Memory. It includes 1) a novel memory module called Hierarchical Directed Knowledge Graph, which is highly structured and easily updated, enabling concise representation and storage of complex knowledge; 2) a novel method for constructing Abstracted Multimodal Experience Pool that dynamically summarizes long-sequence multimodal information, encompassing both global overviews and local details of multimodal experiences. Building on it, the Knowledge-guided Planner utilizes the HDKG to enhance task planning capabilities, while the Experience-Driven Reflector leverages the AMEP to improve reflection abilities. Both represent improvements over planner and reflector of existing agents.

(2) Optimus-1 can adapt to different settings and generalize to various Minecraft tasks, consistently outperforming all baselines across multiple benchmarks [1] [2] [3] (Figure 8, Table 15, Table 16 in the Appendix).

(3) Our proposed Hybrid Multimodal Memory is easily adaptable to other domains. Take the app agent [4] as an example, the key step involves transforming the knowledge structure from Minecraft's object synthesis relationships into logical relationships between buttons or operations. Once these logical relationships are established, HDKG can easily be adapted to the app environment. As for AMEP, it can be simplified to store the task prompt, image, and action for each atomic operation. Adapting our method to other domains remains as future work.

[1] Wang et al. Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents. 2023.

[2] Qin et al. MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception. 2024.

[3] Wang et al. Voyager: An open-ended embodied agent with large language models. 2023.

[4] Zhang et al. AppAgent: Multimodal Agents as Smartphone Users. 2023.

评论

Additional experimental results

To demonstrate that our Hybrid Multimodal Memory is easily adaptable to other domains, we applied Optimus-1 to the app agent scenario. We followed the environment and settings of AppAgent [1] and conducted comparative experiments on its benchmark (9 apps with a total of 45 tasks). The experimental results in the table below show that Optimus-1 outperforms AppAgent and GPT-4 baselines. This reveals that Optimus-1 can generalize to more realistic settings, such as real-world app navigation agents.

Tab 1: Experiments on the benchmark of AppAgent. We report the success rate of the agent in completing 45 tasks.

MethodSuccess Rate
GPT-448.9%
AppAgent73.3%
Optimus-186.7%

We hope that the results of these experiments can address your concerns, and we would greatly appreciate it if you could consider giving us a higher rating.

We welcome the reviewer to continue the discussion with us during the discussion phase. If you have any further questions, please feel free to contact us.

P.S.: We just realized that it was not visible to you due to a setting error. We sincerely apologize for any inconvenience this may have caused and for not responding to your question in a timely manner.

评论

We would be grateful if you could take time out of your busy schedule to discuss with us. We are very keen to engage in deeper discussions with the reviewers.

We further conducted experiments on AitW [1]. We followed the environment settings of AppAgent and conducted comparative experiments on AitW. As shown in Table below, Optimus-1 outperforms the PaLM 2, GPT-4V and AppAgent. AitW is a popular, general benchmark that can demonstrate an agent's ability to operate apps in real-world scenarios. So the experimental results we provided below are sufficient to demonstrate Optimus-1's generalisation in real-world scenarios. We will include these results in the Appendix. Additionally, we have built an official repository to provide well-structured open-source codes and a project page (to be released upon acceptance).

Tab 1: Experiments on the subset of AitW. We report the partial match scores for AitW Standard split.

MethodMatch Scores
PaLM 239.6
GPT-4V50.5
AppAgent52.4
Optimus-158.3

We hope that the results of these experiments can address your concerns, and we would greatly appreciate it if you could consider giving us a higher rating.

[1] Rawles et al. Android in the Wild: A Large-Scale Dataset for Android Device Control. 2023.

评论

Thank you for the response, and I appreciate the efforts for the new experiments. These have addressed my concerns and I'm raising my recommendation accordingly.

评论

Thanks for your reply! We are pleased to have addressed your concerns. We will add these experiments in the Appendix. Additionally, we have built an official repository to provide well-structured open-source codes and a project page (to be released upon acceptance).

审稿意见
8

The paper proposes "Optimus-1" a Mulit-Modal LLM based agent. They evaluate their agent extensively on MineCraft and demonstrate superior performance compared to previous work.

The key components of the approach are:

  1. A memory module consisting of structured memories (DAG) and multi-modal experience replay (with negative and positive examples)
  2. A planing module
  3. A reflection module
  4. A "low-level" execution module

The novelty of the work consists of:

  1. The DAG memory
  2. the positive & negative samples in the replay memory
  3. Putting everything together in this way

Multiple ablation studies are made.

优点

Strengths:

  1. A lot of work went into running these experiments and a lot of results are presented (incl. open-source MLLMs)
  2. A lot of work went into designing the evaluation benchmark
  3. A novel agent that performs better than previous work (and seems that can be applied to more tasks).

缺点

Weakness:

  1. "teacher guidance learning" for HDKG -> It seems that teacher demonstrations are needed to make the agent perform the hard tasks. a. The paper (even incl. appendix) does not explain where these demonstrations come from and what the impact of these is. (Or how costly these demonstrations are to obtain). b. Looking at ablation studies one can see that this Knowledge Module is key and performance drops by 20% (however, it is unclear how much the "teacher" contribution is. It seems that without the teacher contribution the model might perform quite poorly compared to baselines.) Therefore this indicates that expert human demonstrations are needed to actually make the agent work, as opposed to the "LLM" doing the reasoning and work. c. Similarly, all further evaluations (incl. against other benchmarks in the appendix) are therefore somewhat questionable.

  2. The actual learning mechanism of the DAG memory (HDKG) is not described in sufficient detail to replicate the work. It would be good to cite more references or explain this part in more detail.

  3. Positive and Negative examples in AMEP. While the paper argues this is an important contribution, it seems that it is nowhere evaluated if only positive examples would be more succesful.

  4. Computational costs are only marginally mentioned ($5000 for OpenAI, and only 4xA100, but no amount of hours). It is not clear how much the "exploration" / learning phase is (time and money), what about the teacher phase?

  5. Evaluation outside of Minecraft would be interesting as well.

问题

Questions:

  1. How are teacher demonstrations obtained? How costly is this? How many human annotators and hours are needed for this?
  2. How does the agent perform without teacher demonstrations?
  3. What is the performance without negative samples in AMEP?
  4. How long does the training / exploration phase take? How costly is it? (How many GPU hours on 4x A100 did it take?) How much is an evaluation?
  5. How long would your method take to setup in a new environment altogether (and what would be the rough steps)? What is another good environment?

局限性

Limitations:

  1. The authors speak about the limitations of one of the modules (the action generator), however, the teacher phase limitation mentioned above is not discussed. It seems that the method strongly depends on expert demonstrations and these are expensive to obtain in new environments outside of Minecraft.

  2. Evaluations on other envs than Minecraft would have been interesting to discuss.

作者回复

Q1: How are teacher demonstrations obtained? How costly is this? How many human annotators and hours are needed for this?

A1: (1) We obtain teacher demonstrations from Minecraft Wiki. For each task group, we randomly select 5 tasks that are not included in the benchmark. We then create corresponding task plans based on content (synthesis relationship between objects) from the Minecraft Wiki. Taking the task “craft a wooden sword” as an example, we obtain the crafting relationships from the Minecraft Wiki: {1 wooden stick, 2 planks, 1 crafting table → 1 wooden sword}, {1 log → 4 planks}, {2 planks → 4 sticks}, {4 planks → 1 crafting table}. These relationships are transformed into the plan: {1. Get two logs 2. Craft eight planks 3. Craft a crafting table 4. Craft two sticks 5. Craft a wooden sword}. The plan serves as teacher demonstrations for action controller to perform the task.

(2) This method is very low-cost. This process does not require additional human annotators, and it takes about 2 hours to obtain all teacher demonstrations.

(3) We will include these details in the Appendix.

Q2: How does the agent perform without teacher demonstrations? It is unclear how much the teacher demonstrations contribute to HDKG. Teacher demonstrations are needed to actually make the agent work, as opposed to the "LLM" doing the reasoning and work.

A2: (1) Table 2 in the manuscript shows that without teacher demonstrations, the performance of Optimus-1 decreases (e.g., 9.5% -> 1.8% on Diamond Group).

(2) The construction of the HDKG is indispensable without teacher demonstrations. For example, in the free exploration phase, Optimus-1 learns the methods for crafting/mining basic materials (e.g., craft sticks and mine diamonds). Without demonstrations (plan), it cannot learn the synthesis methods for advanced items (e.g., a diamond sword is obtained by a stick and two diamonds). This limits Optimus-1's ability to complete challenging long-horizon tasks (such as craft a diamond sword).

(3) Obtaining teacher demonstrations is efficient and low-cost. Additionally, compared to parameter-update learning, our non-parametric learning method needs only a few demonstrations (plans) as learning data. This allows for the rapid expansion of the HDKG, enabling better reasoning and planning for MLLMs.

Q3: The actual learning mechanism of the DAG memory (HDKG) is not described in sufficient detail to replicate the work.

A3: (1) Due to the diversity and complexity of knowledge in Minecraft, we propose "free exploration-teacher guidance" approach instead of manually constructing a knowledge graph.

(2) During the free exploration phase, Optimus-1 will randomly initialize the environment, materials, and tasks. It will freely explore basic tasks, such as chop down trees, mine stones with pickaxes, craft stone swords, etc. When the environment feedback indicates that a task is completed, the corresponding relationship (e.g., {1 wooden stick, 2 planks} → {1 wooden sword}) is updated into the HDKG.

(3) During the teacher guidance phase, after Optimus-1 completes a long-horizon task, advanced synthesis relationship (e.g., {1 wooden sticks, 2 diamonds} → {1 diamond sword}) is updated in the HDKG.

(4) We will add a detailed description of “free exploration-teacher guidance" approach in the Appendix.

Q4: What is the performance without negative samples in AMEP?

A4: Table 3 in the manuscript shows the ablation study for AMEP. It demonstrates that removing negative samples results in a decrease on the success rate (e.g. 94% -> 84% on Stone Group). This reveals that incorporating both success and failure cases into in-context learning significantly enhances the performance of the agent.

Q5: How long does the training / exploration phase take? How costly is it? (How many GPU hours on 4x A100 did it take?) How much is an evaluation?

A5: (1) In the free exploration and teacher guidance phases, there is no need to access OpenAI's API, which keeps costs low. We instantiate multiple Optimus-1 in parallel, sharing the same memory, and the learning process takes approximately 16 hours on 4x A100 80G GPUs.

(2) Evaluating Optimus-1 on benchmark costs approximately $900. We parallelize the evaluation, which takes about 20 hours on 4x A100 80G GPUs. While figure 5 in the manuscript demonstrates that with Hybrid Multimodal Memory, open-source MLLMs’ performance near the GPT-4V’s. This reveals that we can achieve excellent performance with open-source MLLMs at a very low cost.

Q6: How long would your method take to setup in a new environment altogether? What is another good environment?

A6: Our method is highly adaptable to various environments. Take the app agent as an example, the key step involves transforming the knowledge structure from Minecraft's object synthesis relationships into logical relationships between buttons or operations. Once these logical relationships are established, HDKG can easily be adapted to the app environment. As for AMEP, it can be simplified to store the task prompt, image, and action for each atomic operation. Adapting our method to other domains remains as future work.

Q7: The teacher phase limitation mentioned above is not discussed. It seems that the method strongly depends on expert demonstrations and these are expensive to obtain in new environments outside of Minecraft.

A7: As stated in A1 and A2, teacher demonstrations (plans) are easy to collect, efficient, and low-cost. In a new environment, such as an app agent, it is only necessary to collect plans of atomic operations.

Q8: Evaluations on other envs than Minecraft would have been interesting to discuss.

A8: In the future, we will extend the Hybrid Multimodal Memory to other domains. However, we believe that the current environment and experiments sufficiently demonstrate the contribution and effectiveness of our work. Please refer to the response to reviewer 3HCc, A3.

评论

Thank you for taking the time and effort to address all questions. Things are clearer at this stage. There are a few follow-up questions.

A1:

(1) We obtain teacher demonstrations from Minecraft Wiki. For each task group, we randomly select 5 tasks that are not included in the benchmark. We then create corresponding task plans based on content (synthesis relationship between objects) from the Minecraft Wiki

a.) Could you please describe one simple example of how such a plan is constructed. Is an LLM used for this? Are these constructed by a human?

A3:

(2) During the free exploration phase, Optimus-1 will randomly initialize the environment, materials, and tasks.

a. )Does this happen after the initial expert demonstrations update the HDKG? b.) How does it happen without any initial demonstrations - what is the prompt, or few-shot example used? How is it constructed (human annotated)?


Additional questions (AQ): a.) Do you mean by updating the HDKG, that a fact (i.e. triplet) is added to an actual KG? b.) Could you compare your method in more detail to works such as Voyager, specifically in terms of manual effort needed for the various phases and comparison of results with human effort vs. no human effort (especially on the harder long-range tasks).

评论

This comment connects to the Responses to Reviewer mEzT (1)

Q4: Could you compare your method in more detail to works such as Voyager, specifically in terms of manual effort needed for the various phases and comparison of results with human effort vs. no human effort (especially on the harder long-range tasks).

A4: (1) Neither Voyager nor our Optimus-1 requires human annotation. As stated in A1, we obtain the plans required for the teacher guidance phase through an automated process. This automated process is simple and easy to implement, and it takes approximately two hours.

(2) As mentioned above, we do not require human effort to obtain the plans. So we cannot provide a comparison of results with human effort versus no human effort.

(3) More detailed comparison between Optimus-1 and Voyager: Firstly, Voyager executes sub-goals in the Mineflayer environment by calling APIs (in the form of code), while Optimus-1 uses an action controller to generate low-level actions like a human, which is more challenging. Secondly, Voyager acquires knowledge through environment feedback during both learning and reasoning processes. For instance, when it fails to execute the task 'craft a diamond sword,' it acquires knowledge such as 'two more diamonds are needed' from the environment feedback. In contrast, Optimus-1 learns whether a task is successfully executed through environment feedback only during the learning phase and requires only a small number of plans to complete the learning process.

Tanks for your discussions with us. If you have any further questions, please feel free to contact us.

评论

Thank you very much clarifying your work.

It sounds like excellent work indeed. Perhaps, some parts need further clarification and simplification. (Such as the initial plan acquisition step, at this stage it would be hard to replicate this step from your paper and discussion).

Follow questions:

  1. How could you illustrate the automatic plan acquisition step more clearly? What is required beforehand (eg a Wiki, in what format?) what are the exact steps (i.e. input and output of each component).
  2. Does Optimus-1 exceed the capabilities of the teacher alone? How meaningful are these new capabilities?
评论

Thank you very much for taking the time to discuss with us despite your busy schedule. Regarding your concerns, we respond as follows:

Q1: Example of how a plan is constructed during the teacher guidance phase. Is an LLM used for this? Are these constructed by a human?

A1: We obtain the plans required for the teacher guidance phase through an automated process. For each task group, we randomly select 5 tasks that are not included in the benchmark. Taking the task “craft a wooden sword” as an example, We use a script to automatically obtain the crafting relationships for a wooden sword from the Minecraft Wiki: {1 wooden stick, 2 planks, 1 crafting table → 1 wooden sword}, {1 log → 4 planks}, {2 planks → 4 sticks}, {4 planks → 1 crafting table}. These relationships can be represented as a directed acyclic graph. Then, by performing a topological sort, the graph can be converted into tuples of materials and their quantities: (wooden sword, 1), (crafting table, 1), (wooden stick, 1) (planks, 8), (log, 2). Finally, we prompt GPT-4 to construct a plan in order from basic materials to advanced materials: {1. Get two logs 2. Craft eight planks 3. Craft a crafting table 4. Craft a wooden stick 5. Craft a wooden sword}. The entire process is automated, simple, and easy to implement. We only need to construct 5 (tasks/group) x 7 (groups) = 35 plans in the teacher guidance phase, which takes approximately two hours.

Q2: Does the free exploration phase happen after the initial expert demonstrations update the HDKG? How does it happen without any initial demonstrations - what is the prompt, or few-shot example used? How is it constructed (human annotated)?

A2: As described in Section 2.3 in the manuscript, we initialize the Hybrid Multimodal Memory as empty and begin with free exploration to acquire basic knowledge, such as crafting sticks and mining diamonds. We then proceed to teacher guidance phase to learn advanced knowledge, e.g., a diamond sword is obtained by a stick and two diamonds. The entire process does not require additional prompts, few-shot examples, or manual annotations.

In the free exploration phase, we randomly initialize the environment, materials, and tasks. For the task “craft a wooden pickaxe”, we provide initial materials (three planks, two sticks), and then Optimus-1 (only the action controller activated) attempts to complete the task. If the environment feedback indicates the task is successful, the knowledge {3 planks, 2 sticks → wooden pickaxe} is added to the HDKG. Note that we randomly initialize materials and their quantities, which means that the task may not always succeed. As a result, each free exploration may not acquire the corresponding knowledge, but it can record the relevant experience (whether successful or fail). In the free exploration phase, Optimus-1 learns simple atomic operations, such as crafting sticks in the Wooden Group and mining diamonds in the Diamond Group. This phase is insufficient for Optimus-1 to learn advanced knowledge, such as crafting a diamond sword.

In the teacher guidance phase, Optimus-1 executes each sub-goal sequentially according to the given plan. Once the task is completed, the materials and their corresponding relationships (e.g., {1 wooden stick, 2 diamonds} → {1 diamond sword}) are updated in the HDKG, and the multimodal experience of each sub-goal is stored in the AMEP. Teacher guidance phase allows Optimus-1 to acquire advanced knowledge and learn multimodal experiences through complete long-horizon tasks.

Q3: Do you mean by updating the HDKG, that a fact (i.e. triplet) is added to an actual KG?

A3: By updating the HDKG, triplets are added to an actual KG. For example, the current KG contains (2 planks, 'craft', 4 stick), (wooden pickaxe, 'mine', stone). When new knowledge is acquired: (2 stick, 'craft', 1 stone pickaxe), (3 stone, 'craft', 1 stone pickaxe), (1 crafting table,'needed', stone pickaxe). These 3 triples added to current KG, then updated as: (1 planks, 'craft', 4 stick), (wooden pickaxe, 'mine', stone), (2 stick, 'craft', 1 stone pickaxe), (3 stone, 'craft', 1 stone pickaxe), (1 crafting table,'craft', stone pickaxe).

The rest of the responses are in the next comment.

评论

Additional experimental results

As you and some reviewers mentioned the feasibility of adapting Optimus-1 to other domains, we added additional experiments to demonstrate the generalisation of our method. We applied Optimus-1 to the app agent scenario. We followed the environment and settings of AppAgent [1] and conducted comparative experiments on its benchmark (9 apps with a total of 45 tasks). The experimental results in the table below show that Optimus-1 outperforms AppAgent and GPT-4 baselines. This reveals that Optimus-1 can generalize to more realistic settings, such as real-world app navigation agents.

Tab 1: Experiments on the benchmark of AppAgent. We report the success rate of the agent in completing 45 tasks.

MethodSuccess Rate
GPT-448.9%
AppAgent73.3%
Optimus-186.7%

We hope that the results of these experiments can address your concerns, and we would greatly appreciate it if you could consider giving us a higher rating.

评论

Thank you very much for taking the time to discuss with us despite your busy schedule. Regarding your concerns, we respond as follows:

Q1: How could you illustrate the automatic plan acquisition step more clearly? What is required beforehand (eg a Wiki, in what format?) what are the exact steps (i.e. input and output of each component).

A1:

Step 1: We randomly select 5 tasks for each Group (7 groups in total) that are not included in the benchmark.

Step 2: For each selected task, we use a script to automatically obtain the crafting relationships from the Minecraft Wiki. The pseudocode for the script is as follows:

def get_information_from_wiki(item):
	html = get_html_from_wiki(f"https://minecraft.wiki/w/{item}")
	status, recipe = parser_crafting_from_html(html)
	if status == "failed": # means atomic item
		breaking = parser_breaking_from_html(html)
		can_break_tools = parser_tools_from_breaking(breaking)
		save_breaking(item, can_break_tools)
		return
	save_recipe(item, recipe)
	
	for sub_item in recipe:
		get_information_from_wiki(sub_item)
	
	
# get item's knowledge from wiki
item = "wooden_sword"

get_information_from_wiki(item)

Taking the task “craft a wooden sword” as an example, we use the script to automatically obtain the crafting relationships: {1 wooden stick, 2 planks, 1 crafting table → 1 wooden sword}, {1 log → 4 planks}, {2 planks → 4 sticks}, {4 planks → 1 crafting table}.

Step 3: These relationships are converted into a directed acyclic graph through the script below.

def get_knowledge_graph(item):
	status, recipe = read_recipe(item)
	if stauts == "failed":
		tools = read_breaking_tools(item)
		for tool in tools:
			tool_graph[tool][item] = True
		return
	for sub_item in recipe:
		craft_graph[sub_item][item] = recipe.number[sub_item]
		get_knowledge_graph(sub_item)
		

# get item's knowledge graph
item = "wooden_sword"
craft_graph = {}
tool_graph = {}
get_knowledge_graph(item)

# merge 2 graph into an unify knowledge graph
kg = merge(item, craft_graph, tool_graph)

By performing a topological sort, the graph can be converted into tuples of materials and their quantities: (wooden sword, 1), (crafting table, 1), (wooden stick, 1) (planks, 8), (log, 2)

Step 4: We prompt GPT-4 to construct a plan in order from basic materials to advanced materials:

System Prompt: You are an expert in Minecraft and you can efficiently make plans for me to complete challenging tasks.

User Prompt: I need to complete the task {Task}. Here are the materials needed and their quantities {Materials}. 
Please make a feasible plan for me in the order from basic materials to advanced materials:

Finally, we get the plan: {1. Get two logs 2. Craft eight planks 3. Craft a crafting table 4. Craft a wooden stick 5. Craft a wooden sword}

We have built an official repository to provide well-structured open-source codes and a project page (to be released upon acceptance). And we will add these implementation details in the Appendix.

Q2: Does Optimus-1 exceed the capabilities of the teacher alone? How meaningful are these new capabilities?

A2: Actually, 'teacher' refers to the internal knowledge of the Minecraft environment. In the free exploration phase, Optimus-1 acquires basic knowledge through environmental feedback, while in the teacher guidance phase, Optimus-1 obtains advanced knowledge through the automated process described in A1. This knowledge cannot directly transform into the capabilities in Minecraft. So we constructed Optimus-1, which includes modules such as the Knowledge-guided Planner, Experience-Driven Reflector, and Action Controller, to transform this knowledge into the capabilities for the agent to execute long-horizon tasks in Minecraft, through the reasoning and reflection of multimodal large language model.

Figure 5(b) in the manuscript shows that Optimus-1's performance continually improves during the multi epoches 'free exploration-teacher guidance' learning process. This indicates that Optimus-1 can utilize past memories (knowledge and experience) to gradually enhance its performance on unseen tasks. Table 2 in the manuscript shows that the performance of Optimus-1 significantly decreases after removing such knowledge (e.g., from 9.5% to 1.8% on the Diamond Group). Therefore, knowledge has a significant impact on Optimus-1's capabilities to execute long-horizon tasks.

We hope that these explanation can address your concerns, and we would greatly appreciate it if you could consider giving us a higher rating.

评论

Thank you for this detailed clarification and answer!

The score is raised from 7 to 8.

We would recommend adding all of the above into the paper to make the paper clearer. Specifically:

  1. The comparision against previous work (such as Voyager-1)
  2. The detailed explanation of the various steps and how they work and what they involve.
  3. Additional discussion above.
评论

We deeply appreciate the time and effort you invested in the evaluation of our paper.

We will revise the paper based on your and the Reviewer wzp9's suggestions, including:

  1. Add the comparison with existing Minecraft agents (e.g., Voyager) in Appendix E.3.
  2. Add a detailed explanation (how to construct and apply) of the HDKG and AMEP in Appendix F.1.
  3. Add experimental results in the app agent scenario in the Appendix.
  4. Revising the methodology section and other parts.
  5. Incorporating clarifications and implementation details discussed with the reviewers into the Appendix.

Additionally, we have built an official repository to provide well-structured open-source codes and a project page (to be released upon acceptance).

Thank you again for your valuable suggestions, which are crucial for improving the quality of our paper.

审稿意见
6

This paper presents Optimus-1, a multimodal agent that focuses on Minecraft tasks. Specifically, Optimus-1 is equipped with a Hybrid Multimodal Memory including: 1) a Hierarchical Directed Knowledge Graph that stores the world knowledge through free exploration and teacher guidance; 2) an Abstracted Multimodal Experience Pool that enables Optimus-1 to reason about the current situation by using past experience. Based on the multimodal memory, Optimus-1 adopts a Knowledge-Guided Planner and an Experience-Driven Reflector to generate better plan and reflect periodically in long-horizion tasks. Experiments results illustrates the effectiveness of the multimodal memory of Optimus-1 in long-horizion tasks.

优点

  1. The paper introduces a multi-modal memory mechanism that includes a hierarchical world knowledge graph and a multi-modal past experience pool. The memory is later utilized by the multi-modal planner and reflector module of the Optimus-1.
  2. Good experiment results illustrate the effectiveness of Optimus-1 compared to other strong baselines. Sufficent ablation studies of the effectiveness of the modules in Optimus-1 as well as the necessity of both success and failure cases for reflection.
  3. The paper is well written.

缺点

  1. The details of the construction procedure of AMEP are unclear (e.g. How to maintain the image buffer by computing image similarity? What is the threshold used in MineCLIP? )
  2. How Optimus-1 acquires world knowledge through free-exploration is not specified.
  3. There are other minecraft agents (Jarvis-1, etc.) using multi-modal memory. The paper claims the efficiency of the memory storage and retrieval compared to Jarvis-1, but this is not quantitatively evaluated.
  4. It seems that the low-level action controller will not be updated through reflection, which limits the effectiveness of the reflection pipeline.

问题

  1. In the construction of AMEP, given the video stream, how to adaptively update the abstracted frames in the buffer? How is the threshold of MineCLIP determined?
  2. In the free-exploration phase, what is the efficiency of Optimus-1 to learn an entry of world knowledge such as "a stone sword can be crafted with a wooden stick and two cobblestones" through random exploration?
  3. In the reflection phase, how to retrieve past experience from AMEP? Is the retrieval based on the image similartiy, task goal, or a combination of both? How does the fail case contribute to the final success of the task?
  4. In Table 1, how is GPT-4V evaluated? Does GPT-4V have the same world knowledge as Optimus-1?

局限性

Yes.

作者回复

Q1: In the construction of AMEP, given the video stream, how to adaptively update the abstracted frames in the buffer? How is the threshold of MineCLIP determined?

A1: (1) As described in Section 2.1.1, for video streams, we filter video streams at 1-second intervals and store them in a variable-length video buffer. These filtered images sequentially enter an image buffer with a window size of 16. When the image buffer reaches its capacity and a new image is added, we calculate the cosine similarity between frames and remove one frame from the pair with the highest similarity. Through this process, we can dynamically preserve abstracted frames in the image buffer.

(2) We empirically set the threshold for MineCLIP at 0.7.

Q2: How Optimus-1 acquires world knowledge through free-exploration? what is the efficiency of Optimus-1 to learn an entry of world knowledge such as "a stone sword can be crafted with a wooden stick and two cobblestones" through random exploration?

A2: (1) During the free exploration phase, Optimus-1 will randomly initialize the environment, materials, and tasks. It will freely explore basic tasks, such as chop down trees, mine stones with pickaxes, craft stone swords, etc. When the environment feedback indicates that a task is completed, the corresponding relationship (e.g., {1 wooden stick, 2 planks} → {1 wooden sword}) is updated into the HDKG.

(2) We use task decomposition and parallelized methods to enable Optimus-1 to learn world knowledge efficiently. Given the initial materials, Optimus-1 only needs to perform one sub-goal (chop a tree, mine iron ore, craft a stone sword by a wooden stick and two cobblestones, etc.) each time, which enables Optimus-1 to quickly complete the task and then learn knowledge. Furthermore, we initialize multiple instances of Optimus-1, which share the same HDKG and AMEP. This allows Optimus-1 to efficiently learn such knowledge.

Q3: There are other minecraft agents (Jarvis-1, etc.) using multi-modal memory. The paper claims the efficiency of the memory storage and retrieval compared to Jarvis-1, but this is not quantitatively evaluated.

A3: Jarvis-1 [1] stores all images without summarization, while our approach summarizes each sub-goal and only retains 16 images per sub-goal, significantly improving storage efficiency. Take the example of "craft a wooden pickaxe" shown in its demo website, Jarvis-1 executes 1,139 steps, storing 1,139 images. In contrast, we only store 5 sub-goals × 16 images = 80 images, resulting in a 14x increase in storage efficiency. Under a smaller memory storage, the retrieval efficiency will naturally be higher.

Q4: It seems that the low-level action controller will not be updated through reflection, which limits the effectiveness of the reflection pipeline.

A4: The action controller does not affect the effectiveness of the reflection pipeline. The purpose of reflection is to correct the planner. The reflector evaluates the current situation based on multimodal experience to determine whether the planner needs to replan. If needed, the planner is requested to generate a new plan, which is then executed by the action controller. If not, the action controller continues executing the current sub-goal. Therefore, the action controller does not affect the effectiveness of the reflection pipeline.

Q5: In the reflection phase, how to retrieve past experience from AMEP? Is the retrieval based on the image similartiy, task goal, or a combination of both? How does the fail case contribute to the final success of the task?

A5: (1) During the reflection phase, we use text matching method to retrieve task goal to obtain success and failure cases. When there are multiple similar cases, we select the one with the highest image similarity.

(2) Failure cases serve as in-context examples for the Reflector, assisting in evaluating whether the current task goal can be achieved under the present situation. Due to the complexity and diversity of the environment, it’s challenging to determine the success of the current task based solely on success cases. The inclusion of failure cases allows the agent to assess the current state through a diverse comparison. Table 3 in the manuscript reveals that incorporating both success and failure cases into in-context learning significantly enhances the performance on long-horizon tasks.

Q6: In Table 1, how is GPT-4V evaluated? Does GPT-4V have the same world knowledge as Optimus-1?

A6: (1) In Table 1 in the manuscript, GPT-4V is evaluated without integrating hybrid multimodal memory modules. During the planning phase, GPT-4V generates a plan for the action controller based on observation and task. During the reflection phase, it generates reflection results in a zero-shot manner.

(2) Optimus-1 is built upon the GPT4V foundation with the Hybrid Multimodal Memory. Table 2 in the manuscript shows that the removal of structured knowledge from Optimus-1 (based on GPT-4V) leads to a significant decrease in success rates across all task groups. It proves that GPT4V lacks sufficient knowledge of the Minecraft environment.

(3) We will add a detailed description of the baseline settings in the Appendix.

[1] Wang et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. 2023.

评论

I have read the rebuttal and other reviews. Most of my concerns have been solved, so I will maintain my original score as 6: Weak Accept.

审稿意见
4

Optimus-1 introduces an innovative Hybrid Multimodal Memory module that combines a Hierarchical Directed Knowledge Graph (HDKG) and an Abstracted Multimodal Experience Pool (AMEP) to address knowledge and experience management in long-horizon tasks.

优点

  • Proposes a hybrid multimodal memory module, which includes HDKG and AMEP, offering an innovative approach to managing knowledge and experience in long-horizon tasks.
  • Experimental results demonstrate that Optimus-1 significantly outperforms all existing agents on long-horizon tasks and achieves near-human-level performance in many tasks.
  • The HDKG maps knowledge into a directed graph structure, enabling the agent to efficiently retrieve and utilize knowledge without needing to update parameters.
  • The AMEP summarizes not only successful cases but also failure cases, significantly enhancing the agent's learning effectiveness.

缺点

  • Improving long-horizon task performance through memory modules is common in LLMs; applying this directly to VLMs is not very novel.
  • Reaction time and decision efficiency might be issues; the current experimental results still show a large number of steps.
  • Although it performs well in the Minecraft environment, its performance in real-world applications, such as software manipulation and web navigation, has not been verified.
  • As new knowledge and tasks emerge, effectively updating and maintaining the knowledge (HDKG) remains a challenge.

问题

See weaknesses

局限性

Yes

作者回复

Q1: Improving long-horizon task performance through memory modules is common in LLMs; applying this directly to VLMs is not very novel.

A1: Incorporating multimodal memory into VLMs presents significant challenges compared to apply unimodal memory to LLMs. Long-horizon tasks require the model to save and utilize past information, which is particularly crucial in a multimodal environment. The diverse structures and characteristics of information from different modalities make it difficult to effectively preserve multimodal historical information over extended time periods.

To address this, we propose a novel Hybrid Multimodal Memory module that incorporates structured knowledge and multimodal experiences, implementing the dynamic summarization of multimodal information to reduce storage and retrieval costs. It includes 1) a novel memory module called Hierarchical Directed Knowledge Graph, which is highly structured and easily updated, enabling concise representation and storage of complex knowledge; 2) a novel method for constructing Abstracted Multimodal Experience Pool that dynamically summarizes long-sequence multimodal information, encompassing both global overviews and local details of multimodal experiences.

Table 7 in the manuscript shows that existing agents do not incorporate memory modules with both knowledge and experience, resulting in inferior performance on long-horizon tasks compared to Optimus-1 (Table 1, Table 15, Table 16 in the manuscript).

Q2: Reaction time and decision efficiency might be issues; the current experimental results still show a large number of steps.

A2: (1) For reaction time, Table 1 in the manuscript shows that Optimus-1's average task completion time (AT) is significantly lower than other baselines and approaches human-level performance. For example, on the Wood Group, Optimus-1 takes an average of 47 seconds, human baseline takes 31 seconds, whereas DEPS [1] takes 85 seconds.

(2) For decision efficiency, unlike existing agents [1] [2] [3] that require multiple interactions with (M) LLMs for task planning, Optimus-1 completes task planning with a single interaction, thus achieving much higher decision efficiency than existing agents. For example, to complete the planning of task “craft iron pickaxe”, MP5 [3] requires 11 interactions with MLLM, whereas Optimus-1 completes the planning in just one interaction.

(3) In MineRL environment, 'steps' refers to the number of interactions between the agent and the environment, occurring at a frequency of 20 times per second. For example, if an agent takes 2 seconds to complete the task “chop a tree”, it interacts with the environment 40 times, resulting in a recorded steps number of 40. Table 1 in the manuscript shows that Optimus-1's average task completion step (AS) is significantly lower than other baselines.

Q3: Although it performs well in the Minecraft environment, its performance in real-world applications, such as software manipulation and web navigation, has not been verified.

A3: (1) Minecraft is a valuable and representative environment for evaluating long-horizon tasks, offering greater diversity and complexity compared to other environments. Unlike web/app navigation [4] and embodied manipulation [5], Minecraft is an open world with a complex and dynamic environment (79 biomes, including ocean, plains, forest, desert, etc.). To complete long-horizon tasks, agents must achieve multiple sub-goals (e.g., 15 sub-goals to craft a diamond sword), making the construction of a Minecraft agent quite challenging. Many studies [2] [3] [6] have chosen Minecraft as the environment for validating performance on long-horizon tasks. Extensive experimental results (Table 1, Table 15, Table 16, Figure 8 in the manuscript) show that Optimus-1 outperforms all baselines. Therefore, conducting experiments in the Minecraft environment is sufficient to demonstrate the contributions of this paper.

(2) Our proposed Hybrid Multimodal Memory is easily adaptable to other domains. Take the app agent [4] as an example, the key step involves transforming the knowledge structure from Minecraft's object synthesis relationships into logical relationships between buttons or operations. Once these logical relationships are established, HDKG can easily be adapted to the app environment. As for AMEP, it can be simplified to store the task prompt, image, and action for each atomic operation. Adapting our method to other domains remains as future work.

Q4: As new knowledge and tasks emerge, effectively updating and maintaining the knowledge (HDKG) remains a challenge.

A4: Our HDKG can be efficiently updated and expanded. When adding new nodes, the HDKG can be updated by simply merging the nodes and relationships into the graph. This method involves local linear modifications to the graph rather than altering the entire graph, making the process efficient and time-saving. For example, when M new nodes and N edges are added, the HDKG can be updated with M+N times of operations. Moreover, an HDKG containing 851 objects (nodes) requires less than 1 MB of memory. Thus, the HDKG can be efficiently updated and maintained.

[1] Wang et al. Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents. 2023.

[2] Wang et al. Voyager: An open-ended embodied agent with large language models. 2023.

[3] Qin et al. MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception. 2024.

[4] Zhang et al. AppAgent: Multimodal Agents as Smartphone Users. 2023.

[5] Jiang et al. VIMA: General Robot Manipulation with Multimodal Prompts. 2023.

[6] Baker et al. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos. 2022.

评论

Thanks for authors' rebuttal. I would will change my score to 4. I look forward to seeing some experiments in software manipulation or web navigation based on Optimus-1 framework and also open to further modify the score.

评论

Thank you very much for taking the time to discuss with us despite your busy schedule. Regarding your concerns, we respond as follows:

Q1: I look forward to seeing some experiments in software manipulation or web navigation based on Optimus-1 framework and also open to further modify the score.

A1: To address your concerns about the generalisation of our method in real-world scenarios, we applied Optimus-1 to the app agent scenario. We followed the environment and settings of AppAgent [1] and conducted comparative experiments on its benchmark (9 apps with a total of 45 tasks). The experimental results in the table below show that Optimus-1 outperforms AppAgent and GPT-4 baselines. This reveals that Optimus-1 can generalize to more realistic settings, such as real-world app navigation agents.

Tab 1: Experiments on the benchmark of AppAgent. We report the success rate of the agent in completing 45 tasks.

MethodSuccess Rate
GPT-448.9%
AppAgent73.3%
Optimus-186.7%

We hope that the results of these experiments can address your concerns, and we would greatly appreciate it if you could consider giving us a higher rating. If you have any further questions, please feel free to contact us.

[1] Zhang et al. AppAgent: Multimodal Agents as Smartphone Users. 2023.

评论

Thanks for the additional experiments. Actually, AppAgent can also achieve the similar results with "Watching Demos". Would you like to explore the performance of Optimus-1 in AitW[1] and Osworld[2], it would be more challenging and persuasive.

[1] Android in the Wild: A Large-Scale Dataset for Android Device Control [2] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

评论

Thank you very much for taking the time to discuss with us despite your busy schedule. Regarding your concerns, we respond as follows:

Q: AppAgent can also achieve the similar results with "Watching Demos". Would you like to explore the performance of Optimus-1 in AitW[1] and Osworld[2], it would be more challenging and persuasive.

A: (1) The experimental results in the Table below show that Optimus-1 outperforms AppAgent, even when compared to AppAgent with Watching Demos.

Tab 1: Experiments on the benchmark of AppAgent. We report the success rate of the agent in completing 45 tasks.

MethodSuccess Rate
GPT-448.9%
AppAgent-Auto. Exploration73.3%
AppAgent-Watching Demos84.4%
Optimus-186.7%

(2) We further conducted experiments on AitW [1]. We followed the environment settings of AppAgent and conducted comparative experiments on AitW. As shown in Table below, Optimus-1 outperforms the PaLM 2, GPT-4V and AppAgent. It is important to note that our method is train-free, and due to time constraints, the experiments were conducted under insufficient “free exploration-teacher guidance” learning conditions. So it is unfair to compare it to baselines that fine-tuned on AitW dataset. We will include these results in the Appendix. Additionally, we have built an official repository to provide well-structured open-source codes and a project page (to be released upon acceptance).

Tab 2: Experiments on the subset of AitW. We report the partial match scores for AitW Standard split.

MethodMatch Scores
PaLM 239.6
GPT-4V50.5
AppAgent52.4
Optimus-158.3

(3) Since conducting experiments on OSWorld [2] requires a virtual machine and the environment configuration is quite complex, we are unable to provide experimental results for Optimus-1 on OSWorld in such short time. However, AitW is a popular, general benchmark that can demonstrate an agent's ability to operate apps in real-world scenarios. So the experimental results we provided above are sufficient to demonstrate Optimus-1's generalisation in real-world scenarios. We would like to provide experiments in more scenarios in future versions to demonstrate the generalisation of the proposed method. Additionally, we have built an official repository to provide well-structured open-source codes and a project page (to be released upon acceptance).

We hope that the results of these experiments can address your concerns, and we would greatly appreciate it if you could consider giving us a higher rating.

[1] Rawles et al. Android in the Wild: A Large-Scale Dataset for Android Device Control. 2023.

[2] Xie et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. 2024.

评论

Thanks for the experiments results. Would you like to share the implementation code and comprehensive trajectory records in an anonymous link, I will check with it and increase my score if the implementation is correct.

评论

Thank you very much for taking the time to discuss with us despite your busy schedule. We would like to share the code with you, but we have noticed that NeurIPS's rebuttal guideline clearly states, 'All the texts you post (rebuttal, discussion, and PDF) should not contain any links to external pages.' Therefore, we will discuss the feasibility of sharing the code link with the area chair, and we hope you can understand our concerns.

Moreover, we would be grateful if you could focus more on our contribution of proposing the hybrid multimodal memory module and constructing the agent Optimus-1, which outperformed all powerful baselines in executing long-horizon tasks in Minecraft. This has been acknowledged by all reviewers (#3HCc, #397y, #mEzT, #AfEc, #wzp9).

To address your concerns, we have done our best to verify the effectiveness of the proposed method on the AppAgent benchmark and supplemented experiments on AitW at your additional request. These experimental results demonstrate the generalization of Optimus-1 in general scenarios. We sincerely hope you can see the efforts and sincerity we put into addressing your concerns. We have built an official repository to provide well-structured open-source codes and a project page (to be released upon acceptance).

Additionally, we are pleased to have addressed the concerns of other reviewers (#397y, #mEzT, #wzp9), and they have raised their scores. We would greatly appreciate it if you could take these aspects into consideration and give us a higher score.

Thanks again for the time and effort you invested in the evaluation of our paper.

评论

We are still in contact with the area chair and have not yet received permission to share the code. We sincerely hope you can recognize the efforts and sincerity we have put into addressing your concerns. We would greatly appreciate it if you could take these aspects into consideration and give us a higher score.

Thanks again for the time and effort you invested in the evaluation of our paper.

作者回复

Response to all Reviewers

We would like to thank all reviewers (#3HCc, #397y, #mEzT, #AfEc, #wzp9) for their time and efforts in providing constructive feedback. We are very encouraged that the reviewers found the manuscript is well-written and easy to follow (R#397y, R#AfEc), proposed Hybrid Multimodal Memory is novel (R#3HCc, R#AfEc, R#397y, R#mEzT, R#wzp9), proposed Optimus-1 outperforms over prior state-of-the-art (R#3HCc, R#mEzT, R#AfEc, R#wzp9), with comprehensive experiments (R#mEzT, R#AfEc). We have built an official repository for providing well-structured open-source codes (released upon the acceptance).

We have responded to your questions and comments inside each individual review. We hope these responses will offer a more thorough understanding of our paper. If your concerns have been resolved, we would greatly appreciate it if you could consider giving us a higher rating.

最终决定

This paper introduces Optimus-1, designed to handle long-horizon tasks in Minecraft. The core innovation is a Hybrid Multimodal Memory module that manages knowledge and experience in such tasks. Additionally, a self-reflection mechanism enhances textual actions, which are then executed by a pre-trained action controller to produce low-level actions.

Strengths

  • Novel concept of a hybrid multimodal memory module, which facilitates the management of knowledge and experience in long-horizon tasks.

  • Optimus-1 significantly outperforms all existing agents on long-horizon tasks in Minecraft.

Weaknesses/suggestions

  • Limited Domain: Some components of the proposed pipeline are somewhat domain-specific (specialized to Minecraft), which raised concerns among reviewers regarding the generality of the work. The authors addressed these concerns by presenting additional results on mobile device control.

I think the paper makes a nice contribution that the community will find valuable. However, I encourage the authors to think carefully about how to reflect the comments or resolve the questions from reviewers in the camera ready version.