PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
4
5
5
3
3.3
置信度
创新性3.0
质量2.8
清晰度2.5
重要性2.5
NeurIPS 2025

MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

Embodied agents powered by large language models (LLMs), such as Voyager, promise open-ended competence in worlds such as Minecraft. However, when powered by open-weight LLMs they still falter on elementary tasks after domain-specific fine-tuning. We propose MindForge, a generative-agent framework for cultural lifelong learning through explicit perspective taking. We introduce three key innovations: (1) a structured theory of mind representation linking percepts, beliefs, desires, and actions; (2) natural inter-agent communication; and (3) a multi-component memory system. Following the cultural learning framework, we test MindForge in both instructive and collaborative settings within Minecraft. In an instructive setting with GPT-4, MindForge agents powered by open-weight LLMs significantly outperform their Voyager counterparts in basic tasks yielding $3\times$ more tech-tree milestones and collecting $2.3\times$ more unique items than the Voyager baseline. Furthermore, in fully collaborative settings, we find that the performance of two underachieving agents improves with more communication rounds, echoing the Condorcet Jury Theorem. MindForge agents demonstrate sophisticated behaviors, including expert-novice knowledge transfer, collaborative problem solving, and adaptation to out-of-distribution tasks through accumulated cultural experiences.
关键词
lifelong learningtheory of mindLLMscultural learningin-context learning

评审与讨论

审稿意见
4

MindForge introduces a framework for cultural lifelong learning in embodied agents through explicit perspective taking. It uses (1) a structured theory of mind representation linking percepts, beliefs, desires, and actions; (2) natural inter-agent communication; and (3) a multi-component memory system. In Minecraft experiments, MindForge agents significantly outperform their Voyager counterparts. In fully collaborative settings, even agents that fail in isolation succeed together, with performance scaling over multiple communication rounds. Through these mechanisms, agents exhibit expert–novice knowledge transfer, collaborative problem solving, and adaptation to out-of-distribution tasks.

优缺点分析

Strengths:

  1. MindForge equips agents with structured Belief-Desire-Intention (BDI) models that link percepts, beliefs, desires, and actions, enabling them to form and update mental models of both themselves and their collaborators. This framework allows agents to reason recursively about partner states and generate more targeted, context-appropriate advice during collaboration.
  2. MindForge agents lead to significant gains with open-weight LLMs in experiments, both in the Voyager setting and fully collaborative settings.
  3. The integration of episodic, semantic, and procedural memory subsystems enables MindForge agents to keep and leverage past experiences. Ablation studies show that removing episodic memory leads to drops in task completion rates, and semantic memory facilitates both post-collaboration performance improvements and out-of-distribution generalization.

Weaknesses:

  1. MindForge functions more as an engineering framework than a typical research paper. While it touches on many concepts (lifelong learning, collaborative learning, cultural learning), its core research contributions, particularly how it differentiates from other collaboration frameworks, are not clearly presented.
  2. MindForge agents can initiate dialogue only in a hardcoded fashion—specifically, the weaker agent must fail a task before requesting help, limiting flexibility in dynamic environments and potentially missing opportunities for proactive collaboration.
  3. MindForge introduces additional computational overhead due to multi-round communication. However, the paper does not quantify how much slower or more resource-intensive (e.g., in terms of API calls) MindForge is compared to non-collaborative baselines, nor does it explore different strategies or parameter settings to balance efficiency and performance.

问题

  1. The current design requires a weaker agent to completely fail a task before initiating dialogue. Can the communication protocol be made more flexible? This would improve efficiency and allow better generalization across different tasks.
  2. Would the framework generalize beyond Minecraft, such as being evaluated in another embodied environment? Since the paper claims it is a general framework for diverse settings, this would demonstrate whether the structured ToM representations, memory subsystems, and communication mechanisms transfer effectively.
  3. Can the compute–performance tradeoff be more precisely quantified? This would provide clearer insights and enhance the framework’s practical utility.

局限性

Yes.

最终评判理由

The authors addressed most of my questions and concerns. Considering that the overall quality is generally acceptable and the novelty of the work is present but limited, I raised the score and recommend a rating of 4: Borderline Accept.

格式问题

None.

作者回复

Thank you for the review rsUz, we appreciate your comments and suggestions!

  1. The current design requires a weaker agent to fail a task before initiating dialogue.

To clarify, there is no limiting factor in the design of the MindForge framework itself with respect to the communication protocol as long as communication is done in natural language. Note that we only adopt this approach (where failure is required before initiating dialogue and then performing communication and task actions in an interleaved manner) for our experiments, to be able to compare the effect of communication on the base behavior for task completion.

Flexbile communication protocol

To showcase that our MindForge framework is not tied to a specific communication protocol, we perform additional experiments on two MindForge tasks where we allow for a more flexible communication setting. Specifically, we consider the following setup: in an instructive setting, we allow at each turn for the weak agent to decide if it wants to start communication ( in the case it is unsure how to solve the task to begin with ) or try the action itself.

If the agents decides to try the action and does not feel the need to communicate, in case of failure to complete the task, we follow the standard protocol considered in the manuscript. Below you can see the results of this experiment together with some useful statistics.

Mixtral-8x7B

Mine dirtComm. Round 0Comm. Round 1Comm. Round 2Comm. Round 3
MindForge w/ flexible communication37%45%67%67%
MindForge29%42%62%67%
Mine dirt and woodComm. Round 0Comm. Round 1Comm. Round 2Comm. Round 3
MindForge w/ flexible communication75%79%79%83%
MindForge75%79%79%83%

Mistral-7B

Mine dirtComm. Round 0Comm. Round 1Comm. Round 2Comm. Round 3
MindForge w/ flexible communication37%42%45%54%
MindForge37%42%45%54%
Mine dirt and woodComm. Round 0Comm. Round 1Comm. Round 2Comm. Round 3
MindForge w/ flexible communication41%45%50%50%
MindForge41%45%50%50%

Observations in a flexible communication scenario:

  • Mixtral-8x7B correctly assess that it needs help to mine dirt without needing to fail first, which results in an increase in task success
  • Mistral-7B is extremely confident in its ability to solve the task without help and never asks for help before failure even if this option is given.
  • When dealing with a task resembling something they solved in the past, both models correctly identify they do not require additional help as they should already possess the required knowledge in semantic and procedural memory.
  1. Since the paper claims it is a general framework for diverse settings...mechanisms transfer effectively. Would the framework generalize beyond Minecraft?

To begin with a clarification, in terms of measuring generalization, different Minecraft biomes already serve as diverse environments for testing, making the tech-tree experiments a significant reflection of continual learning capabilities, similar to Voyager’s experiments (Wang et al., 2023a).

That said, our proposed innovations (perspective taking, natural communication) and the MindForge architecture itself are game agnostic. Generalization beyond Minecraft would only change the base prompt for the individual modules to adapt to the specific environment (or the real world). However, incorporating different games for experimentation are non-trivial on academic budgets; the experiments in Minecraft alone in the current paper cost around 600 GPU hours.

  1. its core research contributions, particularly how it differentiates from other collaboration frameworks, are not clearly presented.

To clarify, the manuscript tackles the following research questions:

  • (i) can we bridge the capabilities between closed-source and open-weight models in embodied settings through a natural language interface that is also compatible with humans

  • (ii) similar to humans, can we use perspective-taking and theory of mind to enhance the communication between multiple agents

  • (iii) what type of memory subsystems are beneficial for embodied agents in a lifelong learning setting

To the best of our knowledge, we are the first to enable agents to take perspective-taking using a structured representation. Moreover, we consider a lifelong cultural learning setting where agents communicate and evolve across a curriculum of tasks inside Minecraft.

Contributions with respect to the Research Questions

  • (i) we show in Figure 1 b) how our MindForge agents an in instructive setting achieve new milestones compared to the Voyager baseline in a lifelong setting inside Minecraft, bringing the performance closer to closed source models like GPT-4. Besides the immediate performance improvements, our MindForge framework offers a potential solution to tackle false beliefs through communication in multi-agent settings.

  • (ii) We show through an ablation in Table 5 Appendix B that using perspective taking and the BigTOM representation results in more effective communication between agents. The text snippet in Appendix B shows an example how perspective taking incentives agents to think from the perspective of the other agents, tailoring their advice to their needs.

  • (iii) we add 2 more memory components compared to Voyager: episodic and semantic memory. We consider semantic memory an important component in MindForge as it represents a store for correct beliefs about Minecraft tasks, gathered through multi-agent interactions. This component functions as a way to override the inherent false beliefs that stem from LLM pretraining. Moreover, **we show how episodic memory enables embodied agents to remember their mistakes and avoid making them again. **

MindForge functions more as an engineering framework than a typical research paper.

The majority of the scientific work in this foundational model powered embodied agents can be considered engineering frameworks [1] [2].

Moreover, Reviewer ADK5 compares our belief and memory systems to a number of single-agent papers (ExpeL, CLIN, SSO, DEPS, ADAPT), highlighting that most of the papers in this space are engineered around foundational models. Please refer to our response for Reviewer ADK5 for a more in-depth breakdown.

However, the goal of MindForge is not to propose an engineering framework to solve Minecraft, but rather to provide a solution towards human-AI and multi-agent collaboration in cultural learning settings purely through natural language and in-context learning. Minecraft simply provides us with the environmental substrate required to assess the efficacy while the core of MindForge remains generally applicable

  1. Can the compute-performance tradeoff be more precisely quantified?

Focusing on the compute-performance trade-off would erroneously consider MindForge as an engineering goal (solving the task efficiently) rather than the manuscript's stated scientific goal: to investigate the process of in-situ learning and skill acquisition in an agent that is otherwise incapable, using social collaboration as the mechanism.

Nonetheless, the effect of compute-performance tradeoff is best seen in the context of the Minecraft tech tree experiments as shown in Fig. 1 b) and Table 3.

MindForge agents manage to achieve more milestones than Voyager open-weight alternatives in the same number of actions. Moreover it reaches certain milestones faster ( as counted by the number of actions) than the Voyager counterpart.

The efficiency of MindForge agents can be quantified in two separate ways:

  • (1) number of effective tokens generated by the model across communication, action, belief creation and perspective taking

  • (2) number of actions in the Minecraft environment taken by the agent.

We consider the latter as a more appropriate way of measuring the compute-performance trade off since it is agnostic to the underlying model, as opposed to counting the effective number of tokens which is heavily dependent on the prompting, type of model ( base, instruct, reasoning model ) and model provider training recipe.

Please refer to the last point mentioned in the rebuttal for Reviewer qoax.

References:

[1] Zhang, Hongxin, et al. "COMBO: compositional world models for embodied multi-agent cooperation." arXiv preprint arXiv:2404.10775 (2024). [2] Sumers, Theodore, et al. "Cognitive architectures for language agents." Transactions on Machine Learning Research (2023).

评论

Thank you for responding in depth to all of my questions/concerns. Please add these additional results into the paper, space permitting.

评论

Thank you for the follow-up and your original comments, reviewer rsUZ! The clarifications and the additional results following your comments help strengthen the paper and we will absolutely include them in the final version, either in the main text or the appendix depending on space constraints. Nevertheless, the discussion will be reflected appropriately.

Please let us know if there is anything further we can do in the remaining author discussion period to move towards a more favorable assessment (apologies if you have already updated your score, we no longer have visibility of the overall rating).

We appreciate the constructive exchange, thank you again!

审稿意见
5

The authors propose MindForge, a method that extends an LLM-powered embodied agent with a structured representation inspired by the theory of mind. This representation enables agents to reason about the beliefs, goals, and capabilities of others. The study focuses on inter-agent communication in two scenarios: when agents have differing capabilities and when they are equally capable. The proposed method is evaluated in the Minecraft environment, with Voyager serving as the main baseline.

优缺点分析

Strengths:

  • The paper is well-constructed and clearly written, making the proposed ideas and methodology easy to follow.
  • The ablation studies are detailed and well-structured, providing valuable insights into the contribution of each component.

Weaknesses:

  • Figure 6 is unclear; it is not specified which experimental condition corresponds to each colored line, making interpretation difficult. Clarifying the legend or labeling would improve readability.

问题

  • During the collaborative setting, does the strong agent (e.g., a human expert or GPT-4) actively interact with the environment? Or is it treated more like an oracle, where the weaker agent queries it without requiring the strong agent to update its own beliefs about the environment?

  • The agent's beliefs are represented using a {question: answer} format. How are these questions generated? What mechanism determines which aspects of the environment are relevant or worth querying about?

  • Have the authors tested scenarios involving more than two agents with similar capability levels? If so, how does the communication and belief sharing scale in multi-agent settings?

  • In the case where two agents of similar ability interact, do their beliefs converge over time, or does each agent maintain its own distinct belief system? For example, if there are two valid strategies to reach the same goal, have the authors observed cases where each agent have a different belief about the strategy to adopt?

  • Have the authors explored settings where GPT-4 interacts with only one of the two agents in a collaborative task? If so, does the uninformed agent benefit indirectly from its partner’s interaction with GPT-4—e.g., through shared communication or updated beliefs?

局限性

Yes

最终评判理由

The reviewer is very satisfied with the responses provided and the additional experiments; the paper is technically and experimentally sound.

格式问题

No

作者回复

Thank you for the review and overall positive assessment of our paper, rCMn! We’re grateful that you found the manuscript effectively organized and easy to follow, clear exposition was and is a primary goal, so your comment affirms that effort. We also appreciate your endorsement of the ablation suite: by isolating the Theory-of-Mind graph and the memory, we aimed to give readers a transparent view of where the gains truly arise. Your acknowledgment that these studies are both detailed and well-structured reinforces the value of that analytical depth.

  1. Figure 6 is unclear.

We will fix this for the camera-ready version. We will add a legend of the starting population success for each colored line, currently this is denoted on the curve itself in the figure.

  1. Does the strong agent actively interact with the environment?

Yes, the strong agent is actively interacting with the environment while helping the weaker agent. The complexity and open-endedness of environments like Minecraft require agents to get continuous environment feedback as a way to revalidate or correct their pre-training induced beliefs. Even strong models like GPT-4 struggle to complete tasks in a zero-shot manner, often requiring additional environment signals, as shown in Voyager [60]. Besides stronger grounding in the environment, the lack of environment interaction would hinder the ability of the other agents to take perspective, potentially making communication less effective, as shown in Appendix B Table 5.

Moreover, requiring both agents to interact with the environment offers interesting perspectives on how they co-evolve (e.g do their belief structure converge over time?). While we consider this out-of-scope for this submission, we believe it is interesting to experiment how two or more agents influence each other’s curriculum, which becomes even more meaningful in a purely collaborative setting.

Expert agent as an oracle

It is important to note that, while the weak agent is indeed asking for help from the strong agent, the strong agent is not an oracle. While rare in our experiments, the weak agent can influence the beliefs of the strong agent given the way our MindForge agents are constructed.

  1. The agent beliefs are represented using a {question:answer} format. How are these questions generated? What mechanism determines which aspects of the environment are relevant or worth querying about?

To clarify, we employ the {question:answer} format for task-related beliefs, while for perception beliefs and partner beliefs we use a list like format {perception_beliefs: list} and {partner_beliefs:list}.

The questions for the task-related beliefs are generated similar to the base Voyager [60] setup by the autocurriculum module. Once the curriculum generator (also an LLM) decides which task to tackle next (e.g collect wood block) it generates a question for the beliefs about solving the task (e.g. "How do you collect a wood block?"). This becomes the question corresponding to the {question:answer} format. Note that the answer to the question might change over time as a result of communication with another agent, which results in a belief correction.

A concrete example of task beliefs in the {question:answer} format can be seen in Appendix C.3.

The auto-curriculum generator is also responsible for parsing the environmental state based on the agent sensory input and the items in the current inventory for deciding which task to pursue next based on resources available in proximity. This is also similar to the base Voyager setup.

  1. Have the authors tested scenarios involving more than two agents with similar capabilities level?

While we have not specifically tested this setting for this manuscript, we consider this as a promising future research direction. The way our framework is structured allows for easy scalability in terms of number of agents. This would however add a layer of complexity on the communication side, and given the principle investigation needed for the two-agent case, we didn't have sufficient space to do the 3-and-more-agents case adequate justice.

  1. In case where two agents of similar ability interact, do their beliefs converge over time, or does each agent maintain its own belief system?

Throughout interaction in a lifelong collaborative setting (which refers to the symmetric ability scenario), we do observe that in most cases their beliefs (especially task related beliefs) match after a certain number of communication rounds. Note that this is not always beneficial; as we discuss in Sec. 5.5, below a certain individual performance threshold, agents can also reinforce false beliefs (see the discussion on Jury Theorem in 5.5). However, we observe that when performance improves for specific initial competence starting points (Fig. 6), the false beliefs related to the tasks do get corrected and converge. We will add concrete qualitative examples to the Appendix in the camera ready version.

  1. Have the authors explored settings where GPT-4 interacts with only one of the two agents in a collaborative task?

We run this experiment as part of this rebuttal, showing that indeed the uninformed agent benefits indirectly from its partner interaction with GPT-4. Thanks for this suggestion! We will add this to the paper.

In this experimental setup, we let a Mixtral-8x7B agent collaborate in an instructive setting with GPT-4 as shown in Figure 4 and then pair it with an uninitiated (base) Mixtral-8x7B. As the table below shows, the performance of the base Mixtral improves by 17% (37% \rightarrow 54%) if its partner previously communicated with GPT-4.

Mine dirtTask Success Rate
Mixtral-8x7B (base)37%
Base Mixtral-8x7B after communication with GPT-467%
Base Mixtral-8x7B collaborating with parter Mixtral-8x7B that has had GPT-4 interaction54%

This is an advantage of our structured belief representation as well as memory components, allowing agent to recall correct beliefs from memory and share them further with weaker agents.

评论

Dear reviewer rCMn,

Thank you for the review and suggestions! We've addressed your questions and incorporated your suggestions. Since we have less than 2 days in the extended author discussion period, we would like to follow up to see if there is anything further we can do in the remaining time to move towards a more positive assessment.

Best,

The authors

评论

The authors' responses fully satisfy the reviewer. The paper is very interesting.

评论

We greatly appreciate your thorough and constructive review, and we're glad our work resonated with you.

As the final decision will be based on the overall average score, every strong recommendation makes a real difference. We're doing everything we can in the remaining discussion period to ensure the paper's merits are fully reflected.

If there's anything else we can do or clarify in the time that's left to help strengthen your rating (in case you haven't already since we don't have visibility of the rating anymore), please don't hesitate to let us know.

Thank you,

The authors

审稿意见
5

The authors propose a framework for agent memory and decision making in complex environments.

优缺点分析

Strengths:

  • demonstrates benefit of agent architecture improvements on smaller llms
  • runs multiple ablations on architecture to help identify key components
  • architecture performs well empirically on low parameter, high complexity situations

Weaknesses:

  • main results shown in table 3 are difficult to parse
    • reporting success rate and task length side by side confuses which results are better than others
    • I believe that I should be comparing each mindforge result to the open weight voyager results with the corresponding model, but that's not clear when looking at the table
  • some figures are nearly unreadable
    • figure 1
    • figure 3
    • figure 5
  • lack of mentioned related work in LLM memory and belief architectures
    • Expel: Llm agents are experiential learners
    • CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
    • Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills
    • Describe, explain, plan and select: interactive planning with llms enables open-world multi-task agents
    • Adapt: As-needed decomposition and planning with language models

问题

How do you believe your work compares to other belief and memory architectures such as ExpeL, CLIN, SSO, DEPS, ADAPT?

局限性

yes

最终评判理由

Assuming that the authors improve the presentation of their results and elaborate more on how their method compares to related work, I am happy to recommend this paper be accepted.

格式问题

n/a

作者回复

Thank you for your review! We appreciate your recognition that (i) our agent-centric design enables smalelr LLMs to punch above their weight, (ii) the systematic ablations isolate how components such as the Theory-of-Mind graph and memory each contribute to the overall lift, and (iii) the architecture proves its value on challenging, high-complexity settings where parameter count alone typically stalls progress. Your acknowledgment underscores the practical importance of making lightweight models more capable through principled architectural choices.

To clarify, MindForge occupies a distinctive point in the design space for multi-agent interactions because it unifies three capabilities seldom found together.

  1. Explicit Theory-of-Mind graph. Each agent carries a causal belief network with four disjoint buckets—task, perception, interaction, partner—and can inspect or update another agent’s beliefs during dialogue. This makes perspective-taking and mis-belief repair first-class operations rather than emergent side effects.

  2. Tri-partitite long-term memory. MindForge separates experience into episodic traces, semantic summaries, and procedural code skills, each with tailored retrieval. A reward-gated consolidation step promotes useful skills and corrects false beliefs across episodes, so knowledge compounds over time instead of remaining in context.

  3. Native multi-agent collaboration. Because both pillars are baked in, agents can teach, coordinate, and critique one another, not merely self-reflect. Collaboration is therefore a built-in driver of learning, not an after-thought.

To begin, all the methods you mention (EXPEL, CLIN, SSO, DEPS, ADAPT) focus on single-agent setups. Nevertheless, each provide valuable pieces: self-reflection, skill libraries, or causal abstractions but omits at least one of these three elements. The following are the concrete differences.

  • EXPEL
    Targets non-parametric experiential learning, yet its representational substrate differs sharply from MindForge’s. The only enduring artefacts are (i) a vector database of full task trajectories and (ii) a free-text “insight list” distilled in offline batches. At run-time the system simply retrieves the top-k similar trajectories and concatenates them—together with any matching insights—into the current prompt. There is no explicit belief graph (social or otherwise): the chain-of-thought produced by ReAct/Reflexion is discarded once the episode ends, so the agent cannot query or revise another agent’s mental state. In MindForge’s terms, ExpeL instantiates just a two-store design (episodic-like vectors plus semantic-like insights), omitting (a) a procedural skill library, (b) reward-gated consolidation across memory types, and (c) any Theory-of-Mind reasoning. MindForge, by contrast, maintains all three stores and exposes belief-revision operations as first-class actions during dialogue, making collaboration a driver of learning rather than a post-hoc prompt augmentation.

  • CLIN
    CLIN stores experience as sentence-level causal abstractions such as “Action → Effect (may/should).” After every trial a dedicated “memory-generator” LLM rewrites this list, pruning to the most salient rules; a single Controller LLM then retrieves the top-k rules (keyword + embedding match) to pick the next goal, yielding fast within-task adaptation. Yet the system remains strictly single-agent: (i) it builds no explicit belief graph, (ii) the causal rules are not promoted into any long-lived semantic or procedural store, and (iii) there is no recursion over other minds—the agent cannot inspect or repair a partner’s beliefs. Cross-task generalisation is limited to a lightweight “meta-memory” text summary, so CLIN lacks the multi-store memory and Theory-of-Mind mechanisms that drive MindForge’s collaborative learning.

  • SSO
    Skill-Set Optimisation (SSO) is best understood as the procedural-memory slice of MindForge: it keeps a library of short, reward-validated code trajectories and retrieves the top-k matches by embedding similarity at inference time, precisely mirroring our procedural store. Yet SSO goes no further—there are no episodic or semantic layers, so any experience that is not rewritten into a surviving skill vanishes when the context window resets; its consolidation is purely numeric, retaining a skill only while its running return stays above ε, which trims dead weight efficiently but cannot migrate corrected beliefs or factual insights across tasks; and because the skills are treated as opaque black boxes, the framework never models why a collaborator failed or which beliefs need updating, leaving it unable to perform perspective-taking or mis-belief repair. MindForge therefore subsumes SSO’s economical skill caching while augmenting it with episodic and semantic memories and, crucially, an explicit belief graph that makes social teaching and belief revision first-class operations.

  • DEPS
    DEPS excels at in-episode self-repair—each cycle it re-describes the world, explains any failure, replans, and selects the next step—yet this textual Descriptor + Explainer remains an ad-hoc string that vanishes with the context window. Because it never crystallises into a graph-structured belief model or persists beyond the episode, DEPS lacks both long-term memory (episodic, semantic, or procedural) and any representation of a partner’s mind. MindForge, by contrast, preserves experience in multi-store memory and manipulates an explicit Theory-of-Mind graph, enabling agents not only to debug themselves but also to query, revise, and teach one another’s beliefs across tasks.

  • ADAPT
    ADAPT offers a neat “plan-on-failure” loop but remains strictly ephemeral: its executor relies on a fixed prompt of hand-written atomic skills, and each decomposition tree evaporates once the episode ends. Because beliefs live only as transient chain-of-thought text, the agent neither stores episodic traces nor forms a structured Theory-of-Mind, and successful sub-plans are never promoted to new skills. MindForge supplies precisely what ADAPT leaves on the table—an explicit BigToM belief graph for self- and other-modeling, tri-partitioned long-term memory that survives across tasks, and reward-gated consolidation that turns ad-hoc sub-tasks into reusable procedures—thereby shifting recovery from a one-off patch to a cumulative, collaborative learning process.

Summary Across the spectrum, MindForge is the only system that simultaneously (i) reasons over explicit causal beliefs about self and others, (ii) separates memory into episodic, semantic, and procedural components with tailored retrieval, and (iii) treats collaboration as a primary driver of belief revision. The other frameworks focus on one or two of these dimensions but do not combine all three, which is why we believe MindForge occupies a distinctive place in the current literature.

Please let us know if there are further clarifications or information you require that could lead to a increase in the score!

评论

Dear reviewer ADK5,

Thank you for the review! We've addressed all your questions you've posed. Since we have less than 2 days in the extended author discussion period, we would like to follow up to see if there is anything further we can do in the remaining time to move towards a more favorable assessment.

Best, The authors

评论

Thank you for contrasting your work with the papers I suggested. I agree that Mindforge introduces new ideas, although it would have been nice to see more direct comparisons with previous work. I know that is sometimes not possible in which case I encourage you to add a summary of the above related work to the paper. I'm raising my score.

评论

Thank you for the comments, the overall positive assessment, and for raising your score, reviewer ADK5!

We will indeed add the summary of the discussion to the paper and clarify the positioning of MindForge relative to these and other works discussed in this review process.

Best,

The authors

审稿意见
3

The paper proposes MINDFORGE, a multi-module framework that augments open-weight LLM-based embodied agents with a causal Theory-of-Mind template, episodic + semantic + procedural memory, and multi-round natural-language communication. Knowledge is primarily distilled from a single expert agent. During inference, two MINDFORGE agents can collaborate, exchanging beliefs and action plans to refine their policies without gradient updates. Experiments are conducted in Minecraft, where the authors report higher success rates on two basic resource-collection tasks and deeper advancement along the tech tree relative to the open-weight Voyager baseline.

优缺点分析

Strengths

  1. Clear architectural decomposition: The paper articulates how ToM, memory subsystems, and communication interact, with helpful diagrams and prompt listings.
  2. Initial evidence that social interaction with an expert can offset smaller model capacity: Even with open-weight LLMs, the distilled agent surpasses Voyager on the reported metrics.

Weakness:

  1. Limited novelty in the core learning mechanism: The main performance gains come from distilling reasoning produced by a stronger expert model; the distillation pipeline itself follows standard practice.
  2. Questionable effectiveness under purely collaborative use: When two identical MINDFORGE agents collaborate—without an expert—the paper reports a performance drop after communication, hinting that the proposed belief-exchange may introduce noise rather than useful signal.
  3. Narrow evaluation scope: Quantitative success-rate numbers are provided only for dirt and wood collection, two of the simplest Minecraft tasks, leaving the bulk of the MineDojo/Minecraft benchmark unexplored.
  4. Insufficient ablations for a complex system: MINDFORGE chains many hand-engineered modules (three memory types, BigToM belief graph, skill library, conversation scheduler). The importance of each piece is unclear.

问题

  1. Broaden the task set: It would be great include success-rate or reward curves on a wider range of MineDojo goals spanning different difficulties.
  2. Deeper ablation: Provide separate ablations that disable each memory subsystem (episodic, semantic, procedural) and the ToM causal graph to quantify individual contribution.
  3. Compare communication cost to expert cost: Multi-round peer communication may involve more cumulative LLM tokens than a single expert rollout. Please report token counts, latency for (a) expert-only, (b) distilled MINDFORGE collaborative.

局限性

Yes

最终评判理由

As illustrated in my review and response to the authors' rebuttal, I still find the tasks set and baselines not sufficient to support the authors' claim about the BDI framework.

The original tasks "dirt" and "wood" are two simplest tasks possible in Minecraft, but most of the experiments and ablation studies are based on them. Although the authors added "craft pickaxe" and "mine iron" in the rebuttal phase, the authors did not include the important corresponding ablation studies. I am also skeptical about the "mine iron" results, as it is supposed to be much harder than "dirt" and "wood", but the difficulty does not negatively affect the performance (dirt 67%, mine iron 62%). (I apologize that I overlooked these results at the beginning of the discussion, which did not leave the authors an opportunity to explain about this)

The authors could not provide a baseline framework that communicates with a stronger model for the fundamental tasks (non-life long settings), which leaves only the comparison between MindForge and Voyager (which does not support the communication).

My rating will remain borderline reject. As I am not an expert of multi-agent communication, I am not sure if the current baseline settings are common in the field. I will lower my confidence score.

格式问题

No

作者回复

Thank you for your review qoax! We believe several of your comments contain factual inaccuracies (such as experiments and ablations we have already included in the paper) and misinterpretations of our contributions, which we clarify below.

  1. Broaden the task set. include success-rate or reward curves on a wider range of MineDojo goals spanning different difficulties.

In fact, our main experiment (Fig.1b and Table 3) demonstrates MindForge agents’ effectiveness in an open-ended setting, for multiple tasks along the Minecraft tech-tree with increasing complexity, corresponding to MineDojo goals spanning different difficulties. Specifically, we do not restrict the set of tasks MindForge agents are performing during lifelong learning but rather let the curriculum continuously choose diverse tasks in the environment. The success-rate and reward curves presented indicate MineForge agents effectiveness on these diverse tasks over the Voyager counterparts.

We specifically report detailed results later in the paper on the simplest Minecraft tasks (dirt, wood) since the base Voyager agents failed at even these simple tasks when powered by open-weight LLMs.

Nevertheless, for further clarity for more complex tasks up the tech-tree (crafting pickaxe, mining iron), we ran additional experiments quantifying the effect of communication on task success rate ( like in Figure 4) with a MindForge agent (LLaMA 70B):

TaskComm. Round 0Comm. Round 1Comm. Round 2Comm. Round 3
Craft a pickaxe20%33%41%45%
Mine iron41%50%54%62%

In the instructive setting, MindForge agents improve task success rate by +25% for crafting a pickaxe and 21% for mining an iron ore. Together with the existing results in Figure 4, Figure 1b) and Table 3, these demonstrate that MindForge agents can learn through natural language communication in lifelong learning settings, where tasks become increasingly hard.

TaskComm. Round 0Comm. Round 1Comm. Round 2Comm. Round 3
Craft a pickaxe20%25%33%33%

In the collaborative setting two MindForge agents, we see that in a medium complex task like crafting a pickaxe, collaboration leads to an increase from 2020% to 3333%.

  1. Deeper ablations: Provide separate ablations that disable each memory subsystem (episodic, semantic, procedural) and the ToM causal graph to quantify individual contribution.

In fact, we already provide ablations for our novel contributions (specific memory components and perspective-taking) (L269-72):

Ablations on memory subsystems

  • episodic memory (Appendix C Table 6): MindForge agents perform worse when there is no episodic memory to remember the agent's failures.
  • semantic memory (Appendix A Table 4): positive contribution; the MindForge agent uses semantic knowledge gained through prior collaboration to solve a task either in-distribution or out-of-distribution.
  • procedural memory: we do not repeat the ablation since Voyager [60] already performs this experiment and shows the dramatic decrease in lifelong learning performance when this component is removed.

Perspective Taking

In Table 5, Appendix B we have already performed an ablation of perspective-taking as a whole in the MindForge framework, which together with the structured representations yields a boost in task success as the number of communication rounds increases. A visually intuitive example of how partner perspective evolves with communication is shown in Fig. 5.

Please note that performing an ablation on an individual component within the Belief-Desire-Intention framework by Bratman [R1] is not conceptually viable: a lifelong learning agent needs to have a goal (desire) and intention (action) to perform any task in the environment.

Nevertheless, we have additionally performed an ablation comparing our structured representation of Theory of Mind in MindForge with unstructured perspective-taking through a simple 2-step prompt as in Think Twice [64]. Results below indicate that structured representations provide a meaningful advantage.

Craft a pickaxeComm. Round 0Comm. Round 1Comm. Round 2Comm. Round 3
MindForge w/o structured ToM20%29%33%41%
MindForge20%33%41%45%
Mine ironComm. Round 0Comm. Round 1Comm. Round 2Comm. Round 3
MindForge w/o structured ToM41%50%54%58%
MindForge41%50%54%62%
  1. Questionable effectiveness under purely collaborative use. The paper reports a performance drop after communication.

Potential misunderstanding about the goal: we are not looking to provide agents for 'use' towards the best improvement in the Minecraft tasks. Rather, we are investigating social interactions (perspective-taking specifically) as a means for for continual improvement of agents that fail in isolation.

In Fig. 6, we show that in the collaborative setting with extended communication regime, MindForge agents actually improve their task success rate depending on the initial expertise. This is still a positive scientific result characterizing the conditions when collaboration with symmetric expertise can work over the instructive setting. As we discuss in Sec. 5.5, the “blind leading the blind” scenario where agents below a certain individual performance threshold can reinforce mistakes is well known within the Jury theorem [3], and our results should be interpreted within this context.

Solving Minecraft tasks requires two core capabilities:

  • i: factual knowledge about the game

  • ii: the ability to translate that knowledge into appropriate actions.

In a collaborative setting, two agents of similar level cannot obtain new factual knowledge (unless there is an external environment signal), yielding the question: can two agents of similar level cooperate to correct their code capabilities with the right facts, reducing hallucination?

Intuitively, both communication and belief systems are as powerful as the capabilities of the underlying models allow it. While a purely collaborative scenario does not work out of the box with the set of open weights models, Figure 6 shows that collaborative MindForge agents can thrive given better initial expertise.

We show both quantitatively (see Figure 4) as well as quantitatively (see Figure 5) that the proposed system allows for a gradual increase of task performance as more communication is allowed between agents.

  1. Limited novelty.. The main performance gains come from distilling reasoning produced by a stronger expert model; the distillation pipeline itself follows standard practice.

In fact, our incorporation of the Cultural Learning [56] for multi-agent embodied interactions has never been attempted in literature to the best of our knowledge. We also tackle two modes of Cultural Learning throughout the manuscript: instructive and collaborative learning.

Further, our mode of knowledge distillation doesn't follow standard practice since we employ structured theory of mind representations along with memory. In current literature, there are two main approaches to distill knowledge between different models: knowledge distillation by matching the distribution of the expert [22] and generating task specific data with the teacher model and performing supervised fine-tuning [R2]. We specifically tackle the latter case as a baseline (see Table 1) and find that distillation through supervised fine tuning just marginally improves the performance of failing Voyager [60] agents on the simplest tasks inside Minecraft: collecting dirt and wood blocks.

The approach we propose in MindForge uses the in-context learning abilities of the models to learn more efficiently in terms of computation, avoiding expensive and data intensive weight updates. Additionally, owing to natural language communication, MindForge works in human-agent settings too. We have also tested this scenario (Table 2), where interacting with a human results in improved task success rate (Mixtral-8x7B goes from 29.15% to 87% task success).

  1. Compare communication cost to expert cost: Multi-round peer communication may involve more cumulative LLM tokens than a single expert rollout. Please report token counts, latency...

Reporting the absolute token count would not be a meaningful way of measuring cost since token count is depended on the prompt, model variant, type of model (base, instruct, reasoning) and API provider, to which our framework is agnostic. Instead, we consider the number of actions needed to complete an objective as the go-to measure when it comes to performance and trade-off, which we have reported in Fig. 1b.

Potential misinterpretation of goal: we are not seeking to reduce cost compared to expert rollouts in an engineering sense, but rather to get rid of the reliance on large, pretrained experts in favor of in-deployment skill acquisition.

This means that if we look at things like token counts, we must also account for the computational cost of pretraining the experts for a fair comparison overall. Instead, as we show in Fig. 1b, MindForge agents take fewer number of actions to reach a tech-tree milestone as well as reach new milestones that where previously unattainable for the agent.

References:

[R1] Bratman, M. E. (1987). Intention, Plans, and Practical Reason.

[R2] Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018)

评论

Thank you for your response. I appreciate the comprehensive ablation studies provided, which help clarify the internal design choices of your method. However, I still have several concerns regarding the fairness of comparisons and the claimed novelty around natural language communication.

On the Fairness of Comparison with Voyager:

Your method introduces a more capable model into the decision-making loop, which inherently gives it an advantage over Voyager. This raises a concern about the fairness of the comparison. In the extreme case, inserting an oracle into the pipeline would naturally lead to better performance, but such a setup wouldn’t offer meaningful insights into the core contributions of the approach. For a fairer evaluation, it would be helpful to include comparisons against methods that also leverage a similarly capable model.

On the Claimed Novelty of Natural Language Collaboration:

If the central contribution lies in enabling collaboration through natural language, I would expect comparisons with other multi-agent communication or coordination methods—especially those that also use natural language as the medium. Evaluating only against Voyager, which involves no agent collaboration, does not demonstrate whether your communication strategy is more effective or efficient than existing alternatives.

While I do not consider myself an expert in cultural learning, language-based communication among multiple agents appears to be a fairly common paradigm in the literature. For example, works such as [1] also explore natural language as a medium for collaboration. Given this, the idea of multi-agent communication via natural language does not seem particularly novel on its own.

In conclusion, if your setup includes both a weaker and a stronger model, comparing only to Voyager (which uses only the weaker one) does not provide a complete picture. A fair evaluation should consider baselines that also involve access to a similarly capable model. Moreover, while the ablations are appreciated, the use of natural language for agent communication and skill acquisition does not seem to be a novel contribution itself, given the current landscape of related work.

[1] Building Cooperative Embodied Agents Modularly with Large Language Models. https://arxiv.org/abs/2307.02485

评论

Thank you for the follow up, qoax, and acknowledging our existing and new ablations which further clarify our design choices. The points you raise now were not in the review so we couldn't address them earlier. We believe that these points still have potential underlying misunderstandings which we clarify first before addressing them.

Clarifications of potential misunderstandings

C1.

"central contribution lies in enabling collaboration through natural language"

Not quite. Our central contributions (Abstract L4-8 and and L52-56) actually are in enabling lifelong learning in agents by introducing i. structured representations of self and others' beliefs, desires, and intentions (BDI framework; see [R1] in our rebuttal) to take partner perspectives, ii. apply learned skills through episodic and semantic memory, iii. and finally, in-context multi-agent collaboration in natural language (which you note).

"For example, works such as [1] also explore natural language as a medium for collaboration"

We have already discussed the work you refer to in our manuscript (reference [67] L79-88, L48). Please note that this work, COELA, does not incorporate structured BDI representations, and does not enable lifelong open-ended task exploration through an autocurriculum generator.

C2.

"Your method introduces a more capable model into the decision-making loop, which inherently gives it an advantage over Voyager."

We interpret this comment to mean: a weaker MindForge agent with a stronger MindForge agent, it provides an unfair advantage in comparison to a weak (open-weight) Voyager. Please let us know if we misinterpret it. To be clear, the partnering with a strong agent only takes place in the instructive setting. Crucially, we in fact provide direct comparison between open-weight Voyager and MindForge while controlling for the underlying model capability. The results from Table 1 and Table 2 are a direct comparison between open-weight Voyager and open-weight MindForge without a partner. For dirt tasks, we see an improvement over Voyager from 27%27\% task success rate to 29%29\% task success rate, while for wood we see a +23%+23\% improvement over open-weight Voyager (Table 2 Column 2).

C3. Potential misunderstanding about goals and how this affects comparisons

The point is not to ask if there is a single agent that has been pre-trained to achieve a comparable performance that two weak agents can achieve. GPT-4 has near perfect performance on simple Minecraft tasks (see Tab. 1&3). Crucially, we address this point by Voyager authors (see L35-36): "Voyager requires the quantum leap in code generation quality from GPT-4 which GPT-3.5 and open-source LLMs cannot provide". The key question we tackle is whether social interactions in-context, especially structured perspective taking, can serve as a promising means for agents to continually evolve beyond single-agent starting performance. Our experiments and ablations demonstrate this to be true.

C4.

"inserting an oracle would naturally lead to better performance"

Access to an oracle does not directly lead to better performance (see rebuttal for rCMn). The weak agent needs to figure out how to query the oracle to get the information it wants. Conversely, the oracle must reliably provide relevant information. Our in-context knowledge distillation through explicit perspective-taking with structured BDI representations is novel and useful, as established by our comprehensive experimental results and ablations.


Responding to the newly raised points

VoyagerCOELAMindForgeComment
Structured BDI❌ None❌ None✅ Explicit BDI with causal ToM graphOnly MindForge models beliefs, desires, and intentions structurally
Autocurriculum Generator (lifelong learning)✅ Present❌ Not present✅ PresentVoyager and MindForge use open-ended task generation
  1. Fairness in comparison against Voyager

Without experimentation, a claim that our core contributions (BDI, episodic and semantic memory) are useful (see C1) is only a hypothesis. As we build upon the Voyager agent, comparison against Voyager is required at minimum for empirical rigor. Moreover, we do have fair comparisons in the sense that you are implying (see C2), even though that is not the purpose of the comparison (see C3).

  1. "Comparisons with other multi-agent communication" (like COELA)

Please see C1 and table above for the actual core contributions which are novel and the component differences with COELA. We argue that comparing against agents like COELA are actually unfair given that they are not equipped to learn open-ended tasks sequentially. Instead, our ablation removing explicit BDI is effectively similar to comparing against COELA (which is also based on the SOAR cognitive architecture), but with an autocurriculum generator. Here, we find that our novel contribution of structured perspective-taking actually helps (see Tab. 5).

评论

Thank you for the detailed clarification and for expanding on the design choices behind MindForge. The additional context is helpful, and I recognize the effort invested in the ablations. After considering your responses, I still have four remaining concerns.

1 Novelty Clarification

You note that MindForge combines structured BDI reasoning, episodic memory, semantic memory, and natural-language collaboration. As we have converged on, language-based communication as well as episodic / semantic memory stores are now standard features in many recent agentic systems. Consequently, the BDI formulation appears to be the singular differentiating element of the framework. It would strengthen the paper to foreground this more explicitly and to calibrate the novelty claim around BDI rather than around “natural-language multi-agent communication” in general.

2 Voyager-versus-MindForge Clarity

Tables 1 and 2 already show that, when the underlying model capacity is held constant, open-weight MindForge outperforms open-weight Voyager (this is also not contributed by the BDI framework as there is no communication at all). However, these numbers are split across two separate tables. Placing the controlled Voyager ↔ MindForge (no collaboration) scores side-by-side in a single table would make the fairness of the comparison far clearer to the reader.

3 Missing baselines that involve communication

Because the main intellectual contribution is a communication-centric BDI framework, it seems insufficient to benchmark primarily against Voyager, which contains no inter-agent communication at all. To validate the effectiveness of your proposed communication protocol, please consider adding direct comparisons with:

  1. any recent method that enables LLM-LLM or human-LLM collaboration via free-form language,
  2. or an ablated MindForge that replaces BDI with a simpler, well-known coordination scheme (simply removing perspective-taking seems insufficient as "the BDI framework without perspective-taking" may not be a valid coordination scheme). Such baselines would allow readers to attribute performance gains specifically to the structured BDI reasoning, rather than to the mere presence of communication.

4 Generality of the empirical support

The current experiments focus on simple Minecraft tasks and use open-weight models that are far below state of the art. From this narrow slice it is difficult to conclude that the BDI protocol would transfer to more practical and challenging settings—for example, a human collaborator working with GPT-4-level agents on complex, long-horizon missions. Broadening the evaluation to harder tasks or stronger models would provide more convincing evidence of BDI’s practical utility.

Your clarifications have resolved several earlier ambiguities, and I will raise my score to 3. Nonetheless, the concerns above keep me from recommending acceptance in the current form.

评论

We are glad that the additional context is helpful, and thanks for acknowledging the effort in the experiments. For the remaining concerns, we find that (i) several suggestions are already implemented in the current paper, and (ii) for the rest we really wish to incorporate your feedback but feel there might still be a few crucial misunderstandings preventing us from doing so, even for a future version, and clarify these below.


1. Novelty centered on BDI

Language-based communication as well as episodic/semantic memory stores are now standard features.

We wonder if you are overlooking the critical context of agents capable of lifelong learning after deployment within open-ended environments (abstract - L5, Introduction L62, Sec. 5.4)—these features are far from standard in this space. As detailed in our response to Reviewer ADK5, comparable frameworks are largely single-agent focused: they may include one or two of MindForge’s components but do not combine (i) explicit causal beliefs about self and others, (ii) a tri-partite memory with tailored retrieval, and (iii) collaboration as a primary driver of belief revision. This integration is not a modular "add-on"; it requires tightly coupled interfaces between reasoning, memory, and interaction. For this reason, we believe assessing novelty solely through the lens of a single component risks missing the broader contribution of the framework.

Calibrate the novelty claim around BDI rather than "natural-language multi-agent communication."

We believe we have exactly done this! Theory of Mind appears in the title, abstract, and as the first element in the Methods section, while the quoted text does not appear at all. If you can point to places where the framing suggests otherwise, we will gladly revise the language.


2. Voyager-versus-MindForge clarity

Not contributed by the BDI framework as there is no communication at all.

This is not quite accurate: even in single-agent mode, the sensory input is processed into structured BDI—only the partner belief recursion is unused (Fig. 8). This internal structuring still drives performance improvements, much as it provides basis for self-representation in humans and enables practical reasoning in isolation [R3]. That said, we agree that placing the controlled Voyager ↔ MindForge (no collaboration) results side-by-side will improve clarity and can be done within this review cycle.


3. Missing baselines that involve communication

(a) any recent method that enables LLM-LLM collaboration...

We reiterate that, directly comparing to non-lifelong learning systems risks a strawman: if they fail to acquire new skills, is it due to the absence of BDI or because they are not designed for open-ended evolution? If you have a specific recent method you believe is well-matched in scope and objectives, we will make every effort to add it within the time remaining.

(b) Simpler coordination schemes
BDI is an agent-level reasoning architecture, not merely a coordination protocol. Nonetheless, we have already an ablation (apart from the one in paper Appendix B,Table 5) in our rebuttal for you above, that removes the structured ToM representation and replaces it with the 2-step reasoning from Think Twice [64], showing the benefit of structured ToM. For completeness:

Craft a pickaxeComm. R0R1R2R3
MindForge (Think Twice)20%29%33%41%
MindForge20%33%41%45%
Mine ironComm. R0R1R2R3
MindForge (Think Twice)41%50%54%58%
MindForge41%50%54%62%

4. Generality of the empirical support

Task complexity. Our setting follows the full Minecraft tech tree (Fig. 1b, Table 3), not just isolated toy tasks. Beyond generalizing to unseen tasks, the framework is also tested across a diverse range of biomes in which the agent may spawn. In some cases, the agent begins in a biome with no natural resources relevant to its current goal and must discover alternative strategies to complete the task. During our experiments, the agent encountered the following biomes: forest, plains, savanna, birch forest, jungle, snowy plains, badlands, flower forest, and ocean. The per-task evaluations are provided solely to contextualize the starting point at which open-weight SOTA agents fail.

Use of non–SOTA models. This is deliberate: our goal is to demonstrate social interaction as an alternative to pretraining, a setting directly relevant to domains where training an agent from scratch with base capabilities is infeasible (e.g., privacy-critical healthcare) or resources are constrained (e.g., global south institutions). In such cases, starting from a weak base model is not a limitation—it is the reality. MindForge’s significance lies in showing that even such agents can continue to evolve after deployment.

[R3] Rao, Anand S., and Michael P. Georgeff. "BDI agents: From theory to practice." Icmas. Vol. 95. 1995.

评论

Thank you for the detailed response, your response resolved part of my concerns, but some of my concerns still remains.

Although your method and experiments may focus on lifelong learning in open-ended environment, the fundamental tasks "getting a wood block" and "getting a dirt block" are still a major part of your experiments, especially when those tasks are the testbed of almost all of the important ablation studies.

Missing baselines

We reiterate that, directly comparing to non-lifelong learning systems risks a strawman: if they fail to acquire new skills, is it due to the absence of BDI or because they are not designed for open-ended evolution?

I agree this concern applies to the lifelong setting. However, a substantial portion of the results and many of your key ablations hinge on the fundamental tasks “get dirt” and “get wood.” For these tasks, it is both feasible and informative to include additional baselines that do not require lifelong capabilities. Doing so would give readers a clearer sense of performance gains to your design choices comparing to other designs before moving to the lifelong regime.

Task Complexity

Our setting follows the full Minecraft tech tree (Fig. 1b, Table 3), not just isolated toy tasks.

As discussed above, other than the tech tree, most of your experiments are based on "dirt" and "wood". I personally believe "dirt" and "wood" are too very simple tasks in the Minecraft settings. A more diverse task set seems to be necessary (ex. killing a sheep/chicken, craft a wooden axe ...) to draw conclusions to both main results and ablation studies. This is orthogonal to your claim about diverse biomes.

Use of weak models.

This is deliberate: our goal is to demonstrate social interaction as an alternative to pretraining

I understand that this is a deliberate design choice. However, limiting evaluation to weak models on very simple tasks undermines broader claims. For example, as an extreme case, if a recent method claims that "our method significantly improves the performance of vgg-16 and ResNet", people will challenge whether their method is practical enough for broader claim. I believe that, combining the simple task choices, the use of weak models prevent the authors to make broader claim of their work and this remains a limitation.

评论

Thank you for the continued engagement, qoax, we appreciate the discussion. Since we are at the end of the phase, can we assume that our earlier clarifications have addressed your earlier concerns on:

(i) fairness of Voyager comparison,

(ii) novelty claims and foregrounding of BDI,

(iii) computation cost comparison, and

(iv) effectiveness in collaborative setting.

For these remaining points that mainly deal with task complexity and task-specific comparisons, we:

  1. Clarify a potential missed point about task complexity and our claims there
  2. Outline what the proposed comparisons would entail, to hopefully show that i. we have considered them in good faith, ii. explain why they are not trivial to operationalize, and iii. why they do not alter our valid basis for core contribution claims.

Task complexity

A more diverse task set seems to be necessary (ex. killing a sheep/chicken, craft a wooden axe ...)

Please note that we already ran and reported the detailed results for you on crafting a pickaxe and on another complex task mining iron above in the rebuttal.

Moreover, in Fig. 1b our agents actually reach all the way up to crafting a chainmail helmet that actually requires succeeding at all the dependent tasks earlier like get coal, craft a stone pickaxe, etc. while the corresponding Voyager does not manage to go beyond the wooden pickaxe.

Killing a chicken is not fundamentally complex, a wood axe and a single hit is enough. In contrast, our ladder of tasks are more compositional.

I personally believe "dirt" and "wood" are too very simple tasks in the Minecraft settings

As humans we also agree with you, but they remain non-trivial for open-weight agents even after finetuning on Minecraft Wiki and GPT-4 logs (L236–241). This is why we need to report these in greater detail: they are genuine stumbling blocks in our evaluation. To further emphasize the points, we describe what an experiment of the nature you are suggesting would require…

On adding non-lifelong baselines

A fair per-task comparison like you suggest would involve:

  1. Manually ordering tasks by prerequisite dependencies.
  2. Ensuring feasibility: controlling the spawn environment so each task is possible.
  3. Evaluate each task for both agents starting with the simplest:
    • 2a (success) — record performance and move to the next task.
    • 2b (failure) — “bootstrap” missing skills without lifelong learning (e.g., grant prerequisite items, inject a single demo into memory, or fine-tune briefly), then rerun.
  4. Stop when a task cannot be mastered even after bootstrapping.

Please note that Step 2b is quite non-trivial: Minecraft skills are compositional. Failing get wood but attempting craft wooden axe is like trying to playing a song on the piano without first learning the notes; it requires manual “cognitive surgery” on multiple subsystems which isn't always feasible.

For 2a, Our BDI ablation already approximates this branch and reflects what a comparison against CoELA is likely to look like for such a baseline.

Crucially, please consider that running full 2a/2b chains for all tasks would require substantial extra engineering and compute (our current experiments already exceed 600 GPU-hrs).

On “weak” models

our method significantly improves the performance of vgg-16 and ResNet

A more accurate analogy to our claim would be "our method significantly improves the performance of vgg-16 and ResNet without any pretraining directly at test". This would still be significant for the community because it

  • changes the research question and contribution away from absolute performance gains to constraints about test time improvement.
  • ofcourse, the core methodological components (e.g. explicit BDI through the prompts we have provided in this case) can still be integrated in other more powerful models. Our method actually provides increasing performance returns with increase in base in-context capabilities of these open-weight LLMs. Please do see this phenomena demonstrated with our experiments using Mistral-7b and LlaMa-70b.
  • Please specify the broader claim are you referring to? We make no case that this is the best way to "solve" Minecraft which it seems you are alluding to with your analogy involving vgg and ResNet-this not the case we make.
评论

Dear Area Chair/s,

Since the discussion phase is coming to a close, we thought of summarizing the reviews and our rebuttal for the benefit of the ACs and the reviewers. Please note that this global summary does not change the content of the rebuttal discussions, and is only meant as a summary.

Core contributions

  1. Novel lifelong cultural-learning agent framework. MindForge is a framework that endows embodied agents with these abilities: (i) explicit, multi-agent perspective-taking via structured BDI beliefs; (ii) episodic, semantic, and procedural memory for applying and consolidating skills; (iii) natural-language collaboration among agents.

Crucially, these are tightly coupled—not plug-in modules—so evaluating novelty at the level of any single component risks missing the overarching contribution.

  1. Lifelong Learning Evaluation on Open-Ended Tasks: Within instructive and collaborative cultural-learning settings, MindForge enables open-weight agents to acquire skills through social interaction, addressing two persistent failure modes—false beliefs and faulty execution—where standard distillation or supervised fine-tuning underperform (see Table 1).

Key strengths noted by reviewers

  1. Clarity and Structure: The paper is well-constructed, clearly written, and the proposed methodology is easy to follow (rCMn)
  2. Novel Architecture: Explicit BDI for perspective-taking plus tri-partite memory and natural-language interaction (rsUZ)
  3. Strong Empirical Results: The architecture demonstrates significant performance gains, particularly in enabling smaller, open-weight LLMs to succeed in complex, high-complexity situations where they would otherwise fail. (ADK5, rCMn, rsUZ)
  4. Insightful Ablation Studies isolating ToM and memory contributions (rCMn, ADK5, rsUZ)

Key concerns and Our responses

  1. "MindForge functions more like an engineering framework" (Reviewer rsUZ)

Response. To our knowledge, MindForge is the first to operationalize structured perspective-taking within a cultural-learning curriculum, yielding insights into how collaborating agents’ beliefs evolve over time in Minecraft over a course of tasks.

Outcome: Reviewer acknowledged our detailed clarifications; we will incorporate the new results in the final version.

  1. Related works in LLM memory and belief architecture (Reviewer ADK5)

The reviewer asked about positioning our own contributions with respect to specific other papers.

Response: We provided a point-by-point comparison to ExpeL, CLIN, SSO, DEPS, and ADAPT (suggested by the reviewer). While these single-agent systems share components, MindForge is distinctive in synthesizing (i) explicit multi-agent ToM, (ii) structured memory, and (iii) collaboration as a driver of learning.

Outcome: Reviewer is satisfied and raised their score; we will add this summary to the paper.

  1. Unquantified performance vs communication cost tradeoff(Reviewer rsUZ, qoax)

Both Reviewers rsUZ and qoax mention quantifying the cost or overhead of communication in the MindForge framework.

Response: We quantify the trade-off (Figs. 4, 6 and rebuttal to rsUZ) and argue that environmental actions are a more decision-relevant cost metric than token counts; if tokens are compared, pretraining cost for expert single-agent baselines should also be considered. Our aim is not to “solve” Minecraft at minimal cost, but to study how perspective-taking and memory enable on-deployment skill acquisition through social interaction.

Outcome. rsUZ and qoax thanked us for the in-depth clarification and qoax moved to new and different concerns.

  1. Further clarifications to qoax

The original review had factual inaccuracies and here are there remaining clarifications:

  • Novelty centered on BDI. Response. BDI is already the foregrounded; the novelty lies in the lifelong, open-ended learning, while integrating explicit multi-agent ToM, tri-partite memory with tailored retrieval, and collaboration as a driver of belief revision. We’ll tighten wording wherever this emphasis isn’t clear.
  • Voyager vs. MindForge clarity. Response. Even without partner communication, inputs are structured via BDI; only partner recursion is unused. We will place the controlled Voyager↔MindForge (no-collab) results side-by-side in one table.
  • Communication baselines. Response. We have already included the ablation replacing structured ToM with “Think Twice” (pickaxe R3: 45% vs 41%; iron R3: 58% vs 62% ), attributing gains to structured ToM rather than mere communication. We welcome a concrete suggestion to compare with lifelong systems.
  • Generality. Response. We evaluate across the full tech tree and diverse biomes; using non-SOTA models is deliberate to study post-deployment social learning where heavy pretraining is infeasible.

We hope our rebuttal will drive constructive discussions towards evaluating and improving MindForge.

最终决定

This paper proposes MindForge, a method that promotes lifelong cultural learning among LLM-based agents through Theory of Mind, specifically, perspective-taking. Compared to prior LLM-based agent methods, it mainly introduces a structured BDI representation. Evaluation on Minecraft shows that MindForge agents powered by open-weight LLMs significantly outperform a strong baseline, Voyager.

During the author-reviewer discussion, the most discussed issues were the fairness of comparison with Voyager and the novelty (e.g., in comparison to COELA). Based on the discussion, the converged opinion is that the main novelty lies in the BDI component.

Given the rebuttal and the discussion, AC thinks that the contribution of this paper is significant enough for acceptance. Having said that, AC would urge the authors to revise the claims about the three innovations. Based on the converged conclusion after the discussions, the more accurate claim appears to be introducing BDI representations to enable lifelong cultural learning.