General agents need world models
We prove that any agent capable of flexible goal-directed behaviour must have learned a world model, and provide algorithms for recovering this world model from the agent.
摘要
评审与讨论
- This paper proves a bound on an agent's ability to achieve zero-shot generalization.
- It studies a full-observable controlled Markov process, with standard simplifying environment assumptions.
- They find a bound on the regret of an agent with a key term being an L1 distance between true and estimated transition probabilities.
- This leads the paper to conclude that to do zero-shot generalization, agents must learn world models.
- There is lengthy discussion on implications of this and how it connects to results in related areas such as causality and safety.
给作者的问题
See review.
论据与证据
-
While I don't disagree with the discovered bound, I'm unclear whether this by itself definitively supports some claims such as 'Theorem 1 shows that any bounded agent has learned a world model'. The implicit vs explicit distinction seems important detail to maintain in such claims.
-
Another key contribution of the paper is 'we can recover an approximation of the environment transition function (a world model) from the agents policy alone', but I found the explanation of how to do this (at least in the main paper) vague.
方法与评估标准
No experiments.
理论论述
Assessing these proofs is a little out my comfort area. I can take a deeper dive if needed depending on other reviewer's expertise. For the first pass, I read the main paper carefully but only spot checked the proofs in the appendix.
So far the technical details appear well written and correct.
实验设计与分析
No experiments.
补充材料
I skimmed the proofs in the appendix.
与现有文献的关系
I am not particularly familiar with related work so withhold judgement.
遗漏的重要参考文献
Fine.
其他优缺点
- From one angle, the paper seems to be saying something quite obvious in a very complicated way -- that an agent must have an internal estimate of how its environment works in order to be good at reaching states within it. The better its internal estimates, the better the agent. I'm not familiar with the thought history on this, and I get lost in some of the nuances of the discussion, but to me this seems quite intuitive. From this angle I'm unclear how much impact the paper's result has.
- I'm unfamiliar with the structure the paper uses -- lengthy set up, followed by a very short summary of the main theoretical result, and then long discussion of connections with other work. No experiments.
- The paper takes just 7 pages, including a fair bit of repetition of text and definitions. Overall it feels a little empty. The heart of the contribution is section 3, which takes up less than a page.
- The paper could use this spare space to do a much better job of giving insight into theorem 1's proof and the procedure for extracting the transition probabilities.
其他意见或建议
- In abstract, why is it specifically a generative model that is learned?
- 'three temporal operators', but seemed to only define two (is the later or operation included in this?)
- Should eq 2 on rhs be ?
- What is in theorem 1? did I miss something?
- 'While model-based agents explicitly learn world models (typically transformers (Brooks et al., 2024) or diffusion models (Janner et al., 2022))'. Why cite Brooks et al. here? They do not do model-based learning, they are doing video generation. (They also use diffusion.)
- Line 206 rh column, I don't see what this prob ratio refers to?
- The first few pages of the appendix repeat most of the main paper again. If there are important details in the appendix, move them to the main paper.
Thank you for your detailed and thoughtful review and helpful comments. We hope to address your main concerns about the core claims of our paper, which we believe stem from a misunderstanding of theorem 1, and have implemented your suggestions for improving the paper.
What do we actually show? Your review notes that we prove a bound on the agent's ability to achieve zero-shot generalization and from this conclude that to do zero-shot generalization, agents must learn world models. We agree that if this was our result, it would be highly questionable.
What we actually do is assume the agent satisfies this regret bound (rather than prove it does). I.e. we assume the agent has some minimal degree of competence at zero-shot learning. We then formally prove (as verified by reviewer Xhe1) that for any agent satisfying this assumption, a world model is encoded in the agent's policy. We derive an (unsupervised) algorithm for recovering this world model from the policy (similar to [4]), and prove an error bound on the accuracy of the world model recovered, which depends on the agent's regret.
We agree that this could be made clearer in the paper, and have done so by specifying the algorithm that recovers the world model outside of the proof of Theorem 1 (see Algorithm 1 in response to reviewer aY71) and re-writing the results section to improve clarity and discussion of our results.
Is this obvious? The question of whether or not AI systems have or need world models is hotly debated (see for example [1]), and the subject of significant empirical research [2,3]. A similar result to ours [4] was recently proven for domain generalization (rather than zero-shot learning) and received an award at ICLR 2024.
While we agree it feels intuitive that agents should have a world model, there are many ways agents can can reach goal states without one; e.g. through numerous heuristics (schemas, similarity-based reasoning, ...), online learning, etc. Humans can switch between using model-based reasoning or heuristics to generalize to new tasks, depending on the situation. And many biological agents that are thought to be purely model-free (`stimulus-response' agents [5]). It is unclear if general AI systems like LLMs have world models, or if they can generalize purely via heuristics. Before now, there was no formal result showing that world models are necessary for generalization. And indeed we show that for myopic tasks, where the agent is optimizing for immediate outcomes, world models are not necessary.
We tackle this question by tying world models to a key capability---zero-shot learning. This has consequences for how we design agents (model-based v.s. model free) and reveals fundamental limitations on agent capabilities. Further, we show that this world model can be extracted, which has consequences for safety and interpretability.
How have we improved the paper based on your feedback
-
We agree with your point that the paper lacks insights into theorem 1 and the procedure for extracting world models. We now give an explicit algorithm for recovering world models (see response to reviewer aY71), and have extended the discussion of theorem 1 and other results to 2 pages and reduced repetition.
-
We have included experiments demonstrating that our algorithm can be applied to real world agents (see response to reviewer aY71)
-
We have introduced a new theorem which proves that learning a world model is not necessary for myopic agents, which optimize for immediate outcomes to their actions (depth-1 goals). This relates to your question on if the need for a world model is obvious or not. Theorem and proof can be seen here https://imgur.com/a/QzrXt0W
Reviewer questions
- Why is it a generative model?. The world model we recover can be used to simulate environment trajectories. This is opposed to the purely state-based world models often studied (e.g. in [2]).
- Three temporal operators... Typo corrected. We were referring to the trivial (Now) operator, but have removed this.
- Should eq. 2 rhs be We are taking the max over , which is equivalent.
- What is in theorem 1. Typo corrected.
- Comments on Brookes et. al. 2024, have removed.
- Line 206 rh column, I don't see what this prob ratio refers to?. This is the ratio of the estimated transition probability to the true value, i.e. the relative error. It is given by dividing the inequality in theorem 1 by
[1] https://x.com/ylecun/status/1667947166764023808
[2] Li, et al. "Emergent world representations: Exploring a sequence model trained on a synthetic task." ICLR (2023) oral
[3] Gurnee et. al. "Language Models Represent Space and Time." ICLR (2024)
[4] Richens et. al. "Robust agents learn causal world models". ICLR (2024) oral
[5] Tomasello. The evolution of agency: Behavioral organization from lizards to humans. MIT Press, 2022.
Thank you for this rebuttal. I have nudged my score up slightly to reflect my misunderstanding of the proof.
This paper shows the insight that any agent capable of performing zero-shot generalization must have learned an accurate generative model as a world model of its environment. This paper provides a comprehensive theoretical analysis to support the claims.
update after rebuttal
Thanks to the authors for providing the rebuttal. I've read the author's response and comments from other reviewers. I have no further questions at this time. I will keep my original positive rating.
给作者的问题
Please refer to the issues raised above.
论据与证据
This paper provides detailed theoretical proof to stand for the claims.
方法与评估标准
This paper only provides a theoretical framework but does not propose a new method and does not conduct experiments.
理论论述
I have checked the main theoretical claims, but not in every detail.
实验设计与分析
This paper does not provide any experiment.
补充材料
I have read the appendix, but not all of the details.
与现有文献的关系
This paper provides a theoretical framework to claim that the world model is essential for zero-shot generalization.
遗漏的重要参考文献
To the best of my knowledge, the references are sufficiently covered.
其他优缺点
While this paper focuses on the theoretical part, some experiments, even in some simple environments like Atari, help readers connect the claims of this paper to real-world RL or robotics applications. Whether any tasks exist (e.g., robot navigation, manipulation) can this paper's claims be applied?
其他意见或建议
Please refer to the issues raised above.
Thank you for you helpful comments. As noted by Reviewer Xhe1, our paper does propose a new method for eliciting world models from agents. However, this was quite unclear in the submitted draft, and we have included an explicit algorithm (below) in the manuscript to clarify this.
Following you recommendation we have made the following changes to the paper;
- Explicit algorithms for recovering world models from agents (below)
- New experiments, validating these algorithms on real agents (details below).
- Discussion of real-world tasks where our results can be applied (for example [1] recently developed goal conditioned agents that can generalize zero-shot to arbitrary linear temporal logic goals).
[1] Jackermeier et al. "DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-Task RL." ICLR 2025 (oral)
New experiments
Motivation: can our algorithm for recovering an agent's world model (below) be applied to real-world agents that perhaps maximally violate our theoretical assumptions? Namely, the strict regret bound we assume in Theorem 1.
Experimental setup
Our experiment involves extracting a world model from a model-based language agent in 120 randomly generated cMDP environments using our algorithm (below). We show the agent strongly violates our assumptions, but nonetheless this algorithm can still recover the agent's world model.
- 120 randomly generated environments described by cMPs, with between 5 and 40 states and 3 and 20 actions.
- Our goal-conditioned agent is an LLM (Gemini Flash 2.0), with an explicit, private world model.
- We then attempt to learn this private world model using Algorithm 1 (above) given only the agent's policy
A figure of our results can be viewed at the following URL: https://imgur.com/a/1gNe15c
We also note that our paper is a theory paper, which as pointed out by reviewer Xhe1 significantly extends important recent theory work. We hope it can be judged on these merits, without requiring experiments that extend on the current state of the art empirical work (for example [1] trained LTL conditioned agents and was an oral at ICLR 2025, and used environments simpler than Atari).
Algorithm 1: Estimate Transition Probability from Policy
Input:
- Goal-conditioned policy
- Choice of state , action , outcome
- Precision parameter (related to maximum goal depth )
- An alternative action
Function: EstimateTransitionProbability()
- Initialize
- For to :
- Define base LTL components:
- (Take action )
- (Take action )
- (Transitions eventually to state and takes action )
- (Transition Next to state )
- (Transition Next to any state other than )
- Define composite goal:
- (Sequential goal labelled Fail)
- (Sequential goal labelled Success)
- (Query the policy for the first action)
- If :
- break (Found smallest s.t. where agent prefers goal involving successes)
- Define base LTL components:
- Estimate
- Return
Thanks to the authors for providing the rebuttal. I've read the author's response and comments from other reviewers. I have no further questions at this time. I will keep my original positive rating.
The authors establish that an agent capable of generalizing across a sufficiently large number of goal-conditioned tasks within an environment must have learned an accurate approximation of the environment’s transition model. As a consequence of this result, their proof provides a method for extracting the transition model directly from the agent's policy.
给作者的问题
N/A
论据与证据
The main claim is supported by the proof of Theorem 1, which appears to be correct to the best of my knowledge.
However, the authors also claim that an agent trained on a small set of 'universal' goal-directed tasks can generalize to solve significantly more complex tasks. This claim raises concerns, as the proof requires the agent to successfully solve all composite goals up to a given maximum depth. It is unclear whether this set remains small relative to the set of all possible finite-time trajectories. Additionally, if the maximum depth constraint is reduced and model error increases, it is not evident how this would affect generalization performance beyond the training set.
方法与评估标准
N/A
理论论述
I have carefully reviewed the proofs of the lemmas and the theorem, and to the best of my knowledge, they appear to be correct.
实验设计与分析
N/A
补充材料
I have reviewed the proofs.
与现有文献的关系
This paper extends the important findings of Richens (2024) to the sequential setting, albeit within a more restricted domain. Richens (2024) demonstrated that a robust agent must have learned a causal model, and this work builds upon that insight by considering goal-conditioned agents in sequential decision-making tasks. This line of research is particularly timely, as large goal-conditioned models are increasingly being deployed in real-world robotic applications.
遗漏的重要参考文献
N/A
其他优缺点
Strengths: The proofs are easy to read. Both the discussion and related work sections are insightful. Weakness: It not obvious that the set of 'universal' goal-directed tasks is small compared to the set of finite trajectories. I would be happy to raise my score if the authors provide insights on this aspect.
其他意见或建议
p.5 second sentence "We compare two goals, the first ψ1(r, n) which is satisfied if the outcome state is S = s at most r times" -> Shouldn't the outcome state be s'?
Thank you for thoroughly reviewing our paper and for your helpful comments. In particular, the following comment on the need to clarify the size of the `universal’ set of goals, compared to the set of finite horizon trajectories.
Reviewer: It not obvious that the set of 'universal' goal-directed tasks is small compared to the set of finite trajectories. I would be happy to raise my score if the authors provide insights on this aspect.
You correctly point out that we assume the agent can generalize to any composite goal of depth n (denoted ). What we failed to make clear in the paper is that the proof of Theorem 1 actually only requires that the agent can generalize to only a very small subset of . This is what led us to comment on the existence of small `universal’ goal sets, which are sufficient for learning a world model and hence generalizing to more complex goals.
Precisely, the set of goals we require the agent to generalize to is
where are described at the start of the proofs of Lemma 5 and Theorem 1. The cardinality of this set is n, which is much smaller than , and in general much smaller than the number of finite time trajectories, which scale as for horizon . We can interpret this as growing exponentially with the goal depth n, as satisfying our sequential goals of depth n requires trajectories at least of length for these goals.
We have made this all clear in the paper, and thank the reviewer for pointing it out.
Typo corrected on p.5. second sentence.
The paper establishes a theoretical result indicating that a reinforcement learning agent capable of generalising across a sufficiently large number of goal-conditioned tasks (where goals are LTL specifications) within an environment must necessarily have learned an accurate transition model of the environment.
The paper's result appears sound and is relevant to discussions on whether AI agents must or do learn "world models". The result here indicates that, at least in this context, learning a world model is a necessary condition if an agent is to be considered general. The paper is purely theoretical, establishing its main result and then discussing the implications; however, reviewers were split on whether the result is indeed valuable and useful to the RL community, so a better discussion here would improve the paper. During the rebuttal, the authors provided clarifications, a practical algorithm for extracting the transition model and experimental results that align with the theory, although perhaps even better would have been a simpler experiment on a small cMP demonstrating how the error in the transition function changes as an agent has learned to achieve more goals.
Despite these shortcomings, the reviewers generally agree on acceptance, and authors are encouraged to incorporate the additional information provided during the rebuttal phase.