PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
5
4
3.5
置信度
创新性2.8
质量2.5
清晰度3.0
重要性2.5
NeurIPS 2025

WALL-E: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

OpenReviewPDF
提交: 2025-05-05更新: 2025-10-29

摘要

关键词
world modelembodied agentlarge language modelneurosymbolic learning

评审与讨论

审稿意见
4

In the environment of Mars and ALFWorld, this paper propose propose a training-free method that learns an environment’s symbolic knowledge complementary to large language models (LLMs). Moreover, this paper propose propose an RL-free, model-based agent “WALL-E” through the model-predictive control (MPC) framework. Experimental results show substantial improvements with reported increases in success rates and scores.

优缺点分析

Strength

  1. The paper provides a thorough description of the proposed method, making it easy to follow.
  2. The paper conducts experiments across diverse environments (Mars and ALFWorld), which validates the effectiveness and generalizability of the proposed method. In addition, the paper includes comprehensive ablation studies that help to isolate and assess the contributions of individual components.

Weakness

  1. The approach of using LLMs to summarize rules and knowledge is intuitive and has been explored in other domains.
  2. Translating symbolic knowledge into executable code limits the generalizability of the proposed approach. Although the paper demonstrates its effectiveness in a 2D environment, the reliance on API calling (i.e., interacting with the environment through executable code) diverges from human-like behavioral patterns. This interaction paradigm poses significant challenges for adaptation to 3D environments or real-world settings, thereby constraining the contribution of the work to virtual 2D agent scenarios.

问题

  1. When translating symbolic knowledge into executable code, how is the correctness of the generated code verified (syntax, logical consistency, and intended functionality)?
  2. How long does the symbolic knowledge learning phase take, and how many trajectories are required for effective learning? Additionally, what is the storage overhead associated with maintaining the symbolic knowledge base?

局限性

yes

最终评判理由

I have read the author's rebuttal. Most of my concerns have been addressed, and I will keep my score.

格式问题

NA

作者回复

Thank you for your valuable and detailed review. We're glad you engaged deeply with our work, and we address your concerns point by point below.

1. Novelty Beyond LLM-Based Rule/Knowledge Summarization: Here are the distinctive novelties of WALL-E: (1) a training-free world alignment framework that aligns pretrained LLMs with a specfic environment's dynamics, which leads to a precise neuro-symbolic world model; (2) an effective symbolic knowledge learning scheme (using LLM as an inductive reasoner); (3) a world model-grounded agent with efficient model-predictive control (MPC, using LLM as the optimizer).

These novelties are revolutionary to existing world modeling strategies as we only need to focus on the complementary symbolic and verifiable knowledge that cannot be covered by pretrained LLMs, which are much more efficient to learn than training a world model from scratch. In our method, we do not require any parameter training. We also replace the low-level optimization in classical MPC with high-level reasoning by LLMs under the symbolic constraints. This makes the planning more precise for long-horizon challenging tasks.

2. The Reliance of Code Rules on API Calling Limits Generalization Beyond 2D Environments: Executable code via API calls is a strength rather than a limitation—and it is increasingly becoming a popular design in recent 3D-embodied robotic systems, especially for hierarchical agents where high-level LLM planners call low-level controllers via API calls [1–4].

Our neurosymbolic MPC agent operates at a high level of abstraction (e.g., code rules execution), agnostic to the environment's dimensionality; the API layer serves to map these abstractions to environment-specific low-level controllers. This setting is important for maximizing the potential of pretrained LLMs as they are good at high-level inductive reasoning.

As a result, adapting our method from 2D to 3D or real-world scenarios only requires updating the API/controller bindings, without changing the neurosymbolic learning or planning mechanism.

3. Verification Code Rule Correctness (Syntax, Logic, Functionality): We verify the correctness by discarding any rule that contradicts ground-truth transitions from real trajectories τreal\tau^{\text{real}}. For Syntax, we filter out any rule that raises an execution error during any run. For logical consistency and intended functionality, we do not manually inspect rules; instead, we rely on empirical validation: only rules that are correct, executable, and successfully cover/correct many LLM-mispredicted transitions are retained—these form the output of "Stage 4. code rules pruning". In practice, this pruning process consistently yields logically sound and functional rules (see the learned rules in Appendix D). We will further clarify this in "Stage 4: Code Rules Pruning" and the Appendix.

4. Efficiency and Storage of Symbolic Knowledge Learning: Symbolic knowledge learning is fast (<1h30min in Mars; <30min in ALFWorld), data-efficient (5 trajectories for Mars; 30 for ALFWorld), and lightweight—requiring <50kB of storage per environment (JSON format). Mars features high degrees of freedom and task complexity (≈240 steps/trajectory), so 5 trajectories reveal diverse transitions. In contrast, ALFWorld tasks are simpler and shorter (≈18 steps/trajectory), and categorized into different types, so we sample 5 trajectories per type across 6 types to ensure a broad coverage.

[1]. Liang, et al. "Code as Policies: Language Model Programs for Embodied Control." ICRA 2023.
[2]. Mao, et al. "Robomatrix: A skill-centric hierarchical framework for scalable robot task planning and execution in open-world." arXiv 2024.
[3]. Belkhale, et al. "RT-H: Action Hierarchies using Language." RSS 2024.
[4]. Shi, et al. "Hi robot: Open-ended instruction following with hierarchical vision-language-action models." ICML 2025.

评论

Thanks for your clarification. Most of my concerns have been addressed, and I will keep my score.

评论

We sincerely thank you for your response and are pleased that most of your concerns have been resolved. Your thoughtful review and positive assessment are greatly appreciated.

审稿意见
4

This paper proposes WALL-E, a training-free framework to achieve "world alignment" of LLM agents. WALL-E learns environment-specific symbolic knowledge complementary to LLM’s prior knowledge. The symbolic knowledge covers action rules, knowledge graphs, and scene graphs, which are extracted by comparing agent-explored trajectories and model-predicted ones. The symbolic knowledge can be converted into executable codes to regulate LLM agents’ policies through an efficient look-ahead optimization process. Experiments show that WALL-E significantly outperforms existing methods in open-world challenges like Mars and ALFWorld.

优缺点分析

Strengths

  • This paper proposes a training-free framework, WALL-E, to achieve world alignment of an LLM through a meticulously designed neurosymbolic learning approach.

  • The proposed method significantly outperforms existing baselines on challenging benchmarks like Mars and ALFWorld.

Weaknesses

  • The writing needs to be improved. There are too many symbols involved in describing this framework. It would be better if the number of symbols can be reduced or these symbols can be marked in corresponding positions of the presented figures for reference.

  • The code rule pruning method is hard to understand.

  • The whole framework is quite complex, with LLMs playing different roles in summarizing action rules, constructing the knowledge graph, translating symbolic knowledge into code rules, and other mechanisms for scene graph construction and code rule set pruning.

问题

  • The code pruning method is quite confusing. Can you provide a more intuitive explanation of this method?

  • How do you ensure that the extracted code rules are correct and executable?

局限性

yes

最终评判理由

  • This paper proposes a training-free framework, WALL-E, to achieve world alignment of an LLM through a meticulously designed neurosymbolic learning approach, though the method is a bit complex for understanding and implementation.

  • The proposed method significantly outperforms existing baselines on challenging benchmarks like Mars and ALFWorld.

  • The experiments are extensive and solid.

格式问题

There are no paper formatting concerns.

作者回复

Thank you for taking the time to evaluate our paper and provide thoughtful feedback. We value and appreciate your comments! In the following, we provide detailed responses and clarifications to each of your concerns.

1. Symbol Overload and Readability: We will remove non-essential symbols and annotate key ones directly in the figures to improve the clarity and readability.

2. Intuitive Explanation of Code Rule Pruning: Intuitively, Code Rules Pruning is to select a small subset of extracted rules that can cover and correct most mispredictions by LLM only. We formulate it as a classic maximum coverage problem (Eq.5 on Page 5) that can be addressed by the greedy algorithm (Alg.3 on Page 30). Specifically, we first collect all transitions where the LLM-only's predictions failed (Dinc\mathcal{D}^{\text{inc}}) —these mark LLM's misalignments with the real environment. Then, at each step, we select the rule that can cover/correct the most remaining mispredicted transitions and add it to the subset, remove its covered transitions from future consideration, and repeat the above procedure until the number of selected rules reach the budget size ll. This yields a compact yet high‐impact rule set that best aligns the LLM's priors with environment dynamics.

3. WALL-E's Design is Complex: WALL-E's architecture is modular by design—it is principal and general. In WALL-E, the world model is an LLM augmented by a set of symbolic knowledge extracted from previous experiences. The agent planning is acheived by model-predictive control, in which an LLM agent interact with the world model for multiple rounds to determine the best action(s) to take in the following step(s).

The idea is to use LLMs for inductive reasoning of symbolic knowledge and proposing actions, while lightweight program modules encode symbolic knowledge and ensure transparent, verifiable execution of the rules. This follows a growing best practice in agentic AI: leveraging LLMs for generalization and abstraction, while using structured programs to ground and constrain LLMs' outputs, reducing hallucinations and uncertainty of LLM-only predictions. Compared to LLM-only or rigid rule-based systems, this neuro-symbolic approach offers greater robustness, efficiency, interpretability, and reliability, by combining the complementary strengths of LLMs and symbolic systems. Moreover, this is achieved without any parameter training or intensive human supervision.

4. Verification of Code Rule Correctness, Executability: We ensure the correctness by verifying each candidate code rule on ground-truth transitions from the real trajectory τreal\tau^{\text{real}} and discarding any rule that leads to wrong predictions. For executability, any rule that raises an execution error during any run is also removed. This filtering mechanism is simple but rigorous. We will further clarify this in "Stage 4: Code Rules Pruning."

评论

Thanks for your clarification. Most of my concerns have been addressed, and I will raise my score.

评论

Thank you for your response, we’re very glad to hear that your concerns have been addressed. We sincerely appreciate your thoughtful feedback throughout the process and truly appreciate your decision to raise the score.

审稿意见
5

The authors introduce WALL-E, an LLM agent that is augmented with symbolic knowledge (in the form of action rules, and knowledge/scene graphs). The LLM uses its existing knowledge to plan actions, which are run in an MPC loop using both LLM knowledge and code rules derived from the symbolic knowledge. The agent is tested in two challenging environments, where it achieves state-of-the-art (SotA) performance.

优缺点分析

This is a clearly-written paper that introduces a novel LLM agent framework that achieves SotA performance on challenging domains. Although there are many components to the neurosymbolic agent, these are explained well, and are explored in the experiments/ablations. The method achieves strong performance, and is sample-efficient. LLM agents may be able to reason about these environments in a zero-shot fashion, but the ability to override this via neurosymbolic learning is very useful for generalisation.

As the authors admit, one drawback is the inflexibility of the symbolic rule structure. One aspect of the paper's results/claims that is a little unclear is about whether WALL-E is truly SotA. The authors acknowledge that "DreamerV3 achieves higher absolute rewards" - but with significantly more environment interactions. Table 1 is supposed to highlight the best results in bold, but WALL-E's results are bolded despite being worse than DreamerV3's. The authors could do a better job of addressing the different evaluation regimes.

问题

The claim is that WALL-E's "focus is not on absolute performance", but do you have any comments on what is preventing it from reaching DreamerV3's scores? Is it merely an issue of environment interactions, or are there some limitations to the neurosymbolic framework?

局限性

Yes.

最终评判理由

I have read the other reviews and the authors' rebuttals and believe that this paper still deserves a score of 5. In particular, I do not share as many concerns about the nature of the system and common criticisms against neurosymbolic methods - the results are strong and it is up to future work to show simpler methods that can perform as well/extending this system to incorporate shortcomings such as probabilistic dynamics.

格式问题

None.

作者回复

Thank you for your thoughtful review and constructive comments. We're pleased that you recognized the potential of our approach, and we address your questions and suggestions in detail below.

1. Inflexibility of Symbolic Rule Structure: Our symbolic rule structure is highly flexible, as the rules are automatically derived by an LLM from previous experiences without relying on any domain-specific engineering. As noted in the main text (Lines 69–74, 100–103) and Appendix (Lines 455–459), WALL-E performs inductive reasoning by an LLM to extract symbolic rules directly from experiences—without handcrafted prompts, task-specific modeling, or any fine-tuning—enabling fast and scalable adaptation across diverse environments.

2. Clarify WALL-E's SotA Claim in Table 1 Given DreamerV3's Higher Score: WALL-E is the SOTA training-free method that does not require any finetuning of LLM parameters. While DreamerV3 achieves higher absolute rewards, it is trained for 1 million environment steps. Our bolding in Table 1 reflects WALL-E's leading performance within the training-free (LLM-based) category. DreamerV3's results are included for reference only, to highlight the contrast in sample efficiency and adaptation speed. We will revise Table 1's caption to make the evaluation and comparison setting clearer.

3. Why WALL-E Underperforms DreamerV3: This is because WALL-E does not finetune any model parameters and focus on high-level planning, while DreamerV3 fully finetunes a policy for 1M steps to produce low-level actions. These gaps in the training and modeling lead to their difference in performance. This gap also motivates hierarchical planning/control in recent works (LLM planner + trained controller) [1,2]. However, in our submission, we keep using the same setup as most LLM agent baselines for fair comparisons, where the LLM is not finetuned for specific tasks and takes the full control of the agent.

[1]. Shi, et al. "Hi robot: Open-ended instruction following with hierarchical vision-language-action models." ICML 2025.
[2]. Wang, et al. "Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents." NeurIPS 2023.

评论

Thank you for your responses. I have read the other reviews and responses and will be keeping my score at 5.

评论

Thank you for taking the time to review the discussion and for your thoughtful evaluation. We sincerely appreciate your decision to maintain your score of 5.

审稿意见
4

This paper proposes a neural-symbolic framework for building world-model-based LLM agents. Specifically, the authors prompt the LLM to extract symbolic knowledge from collected trajectories by inferring deterministic rules and constructing knowledge graphs. Also, the LLM attempts to reconstruct complete observations using scene graphs. This symbolic knowledge is then summarized into executable code functions, which are deduplicated and pruned to maximize coverage of both successful and failed transitions. The resulting world model supports the agent’s decision-making via model predictive control, with inference and rule learning executed in alternating steps.

Experiments conducted on Mars and ALFWorld demonstrate that the proposed method outperforms RL-based approaches like DreamerV3, as well as LLM-based rule-learning methods such as IfR.

优缺点分析

Strengths

  • Compared to existing LLM-based rule learning methods, this work introduces several key innovations by integrating deterministic rules, knowledge graphs, and scenario graphs into executable code functions.

  • The experimental results are promising, and the ablation studies are thorough.

  • The paper is well-written and easy to follow.

Weaknesses

  • As noted in the limitations section, this work only considers rules for deterministic transitions (success or failure). However, real-world environments are typically dynamic and governed by probabilistic rules.

  • The author claims that the proposed code-based rules offer stricter enforcement than natural language rules in previous work. While it is true, code-based rules are also more brittle and less generalizable to unseen scenarios. For instance, the model learns that an agent cannot pick up an item unless the resource is on a "table." When encountering a similar situation involving a "desk," the rule fails to apply. But a natural language rule might still help the LLM generalize, thanks to commonsense knowledge that desks and tables serve similar functions.

  • If I undestand correctly, the collected rules are all about successful transition, which is indeed an important part of the world dynamics. However, a compact world model should capture a more general abstraction of transition dynamics, enabling the agent not only to predict whether an action will change the state but also to evaluate how likely the action is to achieve a goal (i.e., which action has the highest success rate).

问题

In addition to the questions raised above, I’m curious why Table 2 doesn’t include IfR.

局限性

yes

最终评判理由

after reading the rebuttal and the other reviews, I’m still leaning towards accepting the paper because has reasonable novelty, good empirical results, and no major flaws. that said, I haven’t raised my score because I’m still not fully convinced that code-based abstraction is much more effective than natural language, even in complex, highly open domains. so, I’ll be keeping my original score.

格式问题

no

作者回复

Thank you for your careful assessment and valuable comments. We are pleased to respond to your concerns and further clarify the design and impact of our approach below.

1. Deterministic Rule Limitation vs. Probabilistic Dynamics: Our rule learning focuses on the deterministic part of dynamics because: (1) it is the backbone of many environments and complementary to the stochastic part that may have been captured by LLMs; (2) it enables stable and verifiable modeling of dynamics; and (3) many seemingly probabilistic outcomes collapse to deterministic when conditioned on the reasoning results of LLM on trajectories. For example, crafting an iron pickaxe may seem uncertain until one realizes that a furnace is a necessary precondition.

As noted in §Limitation and Future Work, extending our framework to handle explicit probabilistic transitions by allowing light-weight LLM finetuning is an important direction for our future work.

2. Code-Based vs. Natural Language Rules: Compared to natural language rules, code-based rules avoid possible ambiguity and lead to more precise and verifiable generalization. This is essential for stable and controllable decision-making. Moreover, they are also easier to be followed by LLMs.

Unlike natural language rules, our approach leverages LLM-induced symbolic patterns—e.g., abstracting both "table" and "desk" as functionally equivalent—to generalize safely without ambiguity. In contrast, natural language rules may lead to vague interpretation and reasoning that cause unsafe or inconsistent behaviors—an unacceptable risk in decision-making systems.

3. Extending the Method to Capture More General Abstraction of Transition Dynamics: We start with binary success/failure rules because they provide compact, straightforward, and verifiable instructions that various LLMs can easily understand and follow without further finetuning. The effectiveness is reflected by the strong long-horizon results in Tables 1–2. In §Limitation and Future Work, we plan to extend the framework to richer types of abstractions for future work.

4. Missing IfR in ALFWorld Comparison (Table 2): IfR [1] is not included in the ALFWorld Comparison (Table 2) because it was developed and evaluated specifically for Mars environment and has not been applied to any other environments. Direct comparison would be unfair and potentially misleading.

[1]. Tang, et al. "Mars: Situated inductive reasoning in an open-world environment." NeurIPS 2024.

评论

Thanks for your responses and answers. after reading the rebuttal and the other reviews, I’m still leaning towards accepting the paper and will keep my original score.

评论

Thank you for your follow-up and for leaning toward acceptance. We sincerely appreciate your careful review and engagement during the process.

最终决定

This paper proposes a framework of LLMs that extract rules from trajectory data, assembles them into a world model, and then uses that world model in an MPC style to produce a policy. It achieves SOTA performance on a couple agent benchmarks, mars and alfworld.

The strength of the paper is its interesting and novel architecture and the corresponding SOTA results it gets using that architecture. Although it is complicated, the paper explains it clearly enough to see the idea.

Additional weaknesses beyond the complexity are the potential brittleness/inapplicability to stochastic systems. I think the authors fairly state that this is a starting point in an interesting direction and those weaknesses can be tackled in future work and thus recommend it be accepted.