Predictive Coding Enhances Meta-RL To Achieve Interpretable Bayes-Optimal Belief Representation Under Partial Observability
Meta-RL with self-supervised predictive coding modules can learn interpretable, task-relevant representations that better approximate Bayes-optimal belief states than black-box meta-RL models across diverse partially observable environments.
摘要
评审与讨论
I am not sure what this paper is about. The title contains several terms:
- Predictive coding
- Meta-RL
- Bayes-optimal belief presentation
- Partial observability
and even after reading the abstract I still am confused what are the main problems solved in this work.
优缺点分析
- the work addresses the partial observability problem that is largely unsolved in RL literature
-
writing style makes it difficult to follow:
- What is the problem
- Why the problem is important w/ example cases (beyond a few refs)
- What have been previously done (state-of-the-art)
- Own contributions
-
comparing results to a 9-years old paper is not sufficient, especially as it is difficult to see any differences
-
motivation of using the selected environments is poor (why these are fundamentally good examples of partially observable and realistic cases)
问题
Predictive coding (PC) is an alternative method to other optimization methods, such as backpropagation (BP), used in neural networks. Why and how PC provides something fundamentally better than BP using gradient descent? (please, do not answer that read all the cited references as this should be clearly explained in the paper)
Why the meta-RL is not explained in this work (method is half a page and referring to appendices - the main parts should be in the paper)?
How can I confirm that there is any statistical significance in the results in Figure 2 and 3?
局限性
Not discussed (no explicit section)
最终评判理由
After engaging in discussion with the authors and after reading the reviews of other reviewers I was finally able to understand the position of this work in the continuum of works from Wang et al. (2016) [idea of meta-RL] to Ortega et al. (2019) [claim of meta-RL leading to Bayes optimal estimator] to Mikulik et al. (2020) [Bayes optimal verified]. The current work shows that predictive coding helps to provide representations that match with Bayes optimal estimator. What are the consequences of that remain to be shown, but the work shows strong incremental progress and therefore I update my grade to the positive side.
格式问题
Formatting was not the problem
We thank the reviewer for your review and the opportunity to address your concerns. Below we will clarify the main problem that we solve in our work to better convey these points. In particular we believe there is confusion about the framework of predictive coding, and we will clarify this aspect below and in our revision.
What problem are we solving and why is it important?
In the real-world, agents (AI systems, humans) often have to make decisions without complete information. For example:
- A doctor diagnosing a patient can only observe symptoms, not the underlying disease
- A robot navigating a building can only see with its camera, not through walls
As you pointed out, this is known in RL as partial observability. In POMDPs, the optimal way to handle this is to maintain a belief (probability distribution) over the true state, updating it as new information arrives. This is the Bayes-optimal belief representation. Learning Bayes-optimal belief representations is important not only for performance but also for interpretability and generalization.
Why previous methods fail
Memory-based meta-RL (RL²) is shown as a powerful method to learn near-optimal policies. However, RL² fails to learn compact, interpretable representations that match Bayes-optimal belief states, which was the main negative result from Mikulik et al 2020. Imagine learning to diagnose diseases by memorizing every symptom combination rather than understanding the underlying disease mechanism. This causes several fundamental problems:
- Lack of interpretability (we can’t explain the symptoms)
- Ineffective exploration (without disease mechanisms, we don’t know where to look for potential new treatment)
- Inadequate generalization (we don’t know whether the symptom combination will transfer to new patients).
This is exactly the problem we attempt to address: how can we design meta-RL agents that can learn Bayes-optimal belief representations to support interpretability, exploration, and generalization?
Our contribution
Motivated by the predictive coding principle in neuroscience (pioneered by Rao & Ballard, 1999), we show that by adding predictive objectives (predicting future observations) to meta-RL, the learned internal representations become much closer to the theoretically optimal Bayesian beliefs. To use an analogy, this is like the difference between:
- Memorizing that "fever + cough often means flu" (what RL² does, a verbose representation)
- Understanding the underlying disease mechanism that "flu causes fever 80% of the time and cough 70% of the time" (what our model achieves, an interpretable, task-relevant optimal representation)
To this end, our work went beyond performance comparison that lacks interpretability, to rigorously evaluate whether the learned internal representations are truly Bayes-optimal beliefs. Our work presented a thorough model interpretability analysis: incorporating predictive objectives leads to better representations that match Bayes-optimal beliefs, crucial for generalization and more interpretable AI systems.
We hope this clarifies our motivations and contributions. We're happy to provide additional clarification. Below we address specific questions.
Q1: Clarifying the confusion about predictive coding
This appears to be a misunderstanding. In our work, predictive coding is not an alternative optimization method to backpropagation. Rather, predictive coding here refers to the future predictive objective that shapes representation learning. This predictive objective is:
- Inspired by neuroscience theories of how brain process information
- Implemented using standard backpropagation
- Similar to successful representation learning methods in ML (e.g. future predictive loss, contrastive predictive coding)
We’ll clarify this distinction in the revision to prevent misunderstanding.
Q2: Why these environments?
We would like to emphasize that our goal is not to compare with SOTA performance, but rather to learn interpretable representations. Consequently, we carefully select POMDP tasks where we can analytically compute Bayes-optimal solutions to enable rigorous comparison of learned representations. We find this a necessary methodological choice because:
- Optimal solutions in complex POMDPs are known to be intractable. Without optimal solutions, we cannot directly verify whether the learned representations are optimal
- In tractable POMDPs, we can quantitatively measure how close our learned representations are to Bayes-optimal solutions, not merely comparing performance
- Our selected tasks capture core, fundamental challenges in POMDPs, including exploration-exploitation tradeoff, information gathering, and dynamic uncertainty
To be specific, our tasks encompass core partial observability challenges with real-world implications.
- Bandits: exploration-exploitation tradeoff → Clinical trial design, A/B testing
- Tiger: classic POMDP benchmark → Scenarios requiring costly information gathering before high-stakes decisions
- Dynamic bandits/Tiger: non-stationary tasks requiring dynamic belief tracking → Markets, weather, or any dynamic domains
- Oracle bandit: active information seeking → Active, exploration-based information gathering
We acknowledge that these tasks aren't meant for realistic applications, but rather diagnostic environments where we can evaluate representation quality against ground truth, enabling our core contribution of model interpretability.
Q3: Comparison baselines
Our primary goal is model interpretability and understanding why predictive objectives work. The state machine simulation method from Mikulik et al 2020 that we employ in our work provides the most rigorous analysis to directly compare learned representations with Bayes-optimal beliefs—it's an analysis tool for model interpretability, beyond performance baseline.
To contextualize our model, many modern deep RL methods have converged on incorporating predictive objectives:
- VariBAD (Zintgraf et al 2021) uses variational inference with future prediction
- SOLAR (Zhang et al 2019) and Dreamer (Hafner et al 2025) learn world models through observation prediction
- DynaMITE-RL (Liang et al 2024) employ dynamic predictive model
These methods show improved performance on various tasks using predictive objectives. This convergence suggests predictive learning is a fundamental principle. However, none of these works systematically analyze whether the learned representations actually approximate Bayes-optimal beliefs. Our contribution is demonstrating why predictive objectives work: they guide neural networks toward Bayes-optimal representations. We chose RL² as our baseline because:
- It's the canonical meta-RL method, representing a foundational and widely-used baseline.
- Mikulik et al 2020 specifically examined RL², allowing direct comparison.
- Recent meta-RL methods build upon RL².
- This allows us to isolate the effect of predictive learning on representation quality.
By showing that predictive modules lead to interpretable, Bayes-optimal representations while RL² doesn't, we provide theoretical insights for why the field has empirically converged on predictive objectives. This understanding is crucial for designing interpretable meta-RL systems.
Q4: How can I confirm that there is any statistical significance in the results in Figure 2, 3?
We acknowledge that interpretation of state machine simulation results were not as straightforward as comparing performance. We have introduced the analysis in Section 4 to address model comparison and their statistical significance. In brief, for two representations A and B to be considered equivalent, state machine simulation analysis evaluate four metrics: state dissimilarity for A mapped onto B (), and vice versa (), and output dissimilarity for A mapped onto B (), and vice versa (). For interpretation, we need all 4 metrics to be small for A and B to be said to be equivalent.
Therefore, to interpret our results, we'd like to draw your attention to Fig. 2C, and focus on the results for the bottleneck layers in RL2 and our model. We reported S.E. across 5+ seeds and asterisks marked statistical significance (p<0.05, t-test) in all state machine simulation plots. Our results show, for the first three metrics (, , ), RL² and our model attain comparably low value. However, for the metric , our model has significantly lower value than RL², indicating that overall our model learns a representation that is significantly closer to the Bayes-optimal beliefs (again noting that all 4 metrics to be small to be equivalent).
We show this consistent pattern across all tasks we considered, which provides strong evidence for better representation learning induced by predictive coding.
Q5: Why meta-RL is not explained
We have introduced the background and methodology of meta-RL in the “Memory-based meta-RL” section under Section 2. Our meta-RL model, including model architecture, training objectives, and regularization, are provided in Section 3 and Figure 1B & C. As our model components are standard, e.g. RNN, VAE, we refer to the appendix for implementation details.
Q6: Limitations not discussed
We have discussed limitations of our current work in the second paragraph of Section 6. To make this stand out more clearly, we will add a header highlighting this paragraph where we discussed limitations.
We hope this clarifies our contributions and addresses confusion. The core insight—that predictive objectives guide meta-RL toward Bayes-optimal representations—has important implications for building interpretable, generalizable agents. We will revise our writing to clearly communicate this message.
I thank the authors for their response letter. That motivated me to study the prior works in more detailed to understand the claims and I think I am now more aware. However, I still cannot understand what are the contributions that advance the prior knowledge about meta-RL. In the following, I try to explain my "path of thinking" and why I am still perplexed.
[meta-RL] Wang et al. (2016) introduced (deep) meta-RL which is based on a RNN encoder that receives the past observations, rewards, and actions. The encoder learns to produce a 'state estimate' which is input for the policy network.
This makes sense since past observations help to collect information from the environment and therefore adapt to parameters that are changing during training episodes (e.g. surface friction for mobile robots).
Arguably, the encoder learns to approximate the true state in some internal form.
[Bayes optimality] Ortega et al. 2019 further investigated meta-RL and claimed that it is (nearly) Bayes optimal estimator of the true state. This is an interesting claim as the meta-RL technique does not explicitly enforce Bayes principles.
[Verification of Bayes optimality] Mikulik et al. (2020) design an experimental setting to investigate whether meta-RL behaves identically to Bayes-optimal estimator. They show that indeed meta-RL asymptotically approximates Bayes optimal estimator since in simulations they behave identically. After convergence, meta-RL - they claim - becomes 'indistinguishable' from Bayes-optimal agents.
[Current submission] Assuming that all above is known, the submission title claims that the proposed PC 'enhances' meta-RL to achieve Bayes-optimal representation under partial observability.
Questions:
-
Did not Mikulik et al. already verify that meta-RL achieves Bayes-optimal reprensentation - what is new here?
-
If representation is 'indistinguishable' from Bayes-optimal agents, then how can that be anymore substantially enhanced?
-
Why do you emphasize 'partial observability' since aren't all environments in all previous work partially observable. Having a state estimator for a fully observable environment wouldn't make any sense, right?
-
You rebuttal lists the following 'enhancements':
-
More interpretable - how can we measure that they are substantially and repeatably more interpretable than Mikulik et al. representations?
-
Compact - in what sense your representations are more compact than those used in previous works? (memory-wise?)
-
I missed something important?
-
Where Mikulik et al. work discusses that "RL2 fails to learn compact, interpretable representations that match Bayes-optimal belief states, which was the main negative result from Mikulik et al 2020." as you write in your rebuttal?
-
Mikulik et al. seem to define their own meta-RL algorithm that perhaps is similar to RL2 (they do not cite the actual work), but it is not clear to me that your method is identical?
I am not meta-RL specialist, but I do work on RL. The two other reviewers were much more positive than me, and likely more distinguished in the field, and meta-reviewer will hopefully put more emphasis on their reviews. However, I cannot raise my rating if I still cannot understand what is the main point of the work. Your writing makes it difficult to follow the red thread.
To me your work seems like the Mikulik et al. method with an additional training component using predictive coding. This is a good incremental step, but then it should be clearly stated and analyzed what kind of improvements it brings to the meta-RL framework and why. Does it provide faster convergence or what?
Thank you for your engagement with our response and the prior literature. We appreciate the opportunity to further clarify the key distinctions and contributions. Your questions highlight an important conceptual point that we would like to communicate more clearly.
The crucial distinction: Behavioral equivalence ≠ Representational equivalence
The core challenge of interpreting state machine simulation results comes from conflating two distinct types of equivalence:
- Behavioral equivalence: Whether meta-RL agent’s policy matches the Bayes-optimal policy
- Representational equivalence: Whether meta-RL agent’s internal states match Bayes-optimal belief states
Q1 & Q5: Mikulik et al. showed behavioral equivalence, not representational equivalence
To clarify what Mikulik et al found: They found RL² learns to behave indistinguishably from a Bayes agent, not that its representation is indistinguishable from a Bayes agent.
Mikulik et al. found that RL² achieves near Bayes-optimal behavior but fails to learn Bayes-optimal representations. This is exemplified by their results of non-negligible dissimilarities (blue bars) in bandit tasks (bottom) in Fig. 4. To quote their result section: “In the reverse direction however, we observe that the meta-learned solutions do not always simulate the Bayes-optimal with high accuracy, as seen by the non-negigible output dissimilarity Do. We believe that this is because the sufficient statistics learned by the meta-learners are not minimal.”
They went on to discuss in section 5: “meta-learners are well simulated by Bayes-optimal agents, but not necessarily vice versa. This failure of simulation is most likely a failure of injectivity: if a single state in one agent must be mapped to two distinct states in another then simulation is impossible. This occurs when two trajectories lead to the same state in one agent but not another (for instance if exchangeability has not been fully learned). We suspect that RNN meta-learners represent non-minimal sufficient statistics as a result of training.”
This "failure of simulation" and "non-minimal sufficient statistics" is precisely what we mean by RL² failing to learn compact, interpretable representations. The key insight: an agent can achieve optimal behavior while using inefficient/ verbose, non-interpretable internal representations, which is captured as high dissimilarities under state machine simulation.
For example, in a Bernoulli prediction task (e.g. inferring the probability of a biased coin), input sequences head-tail-head and tail-head-head lead to precisely the same belief in the Bayes-optimal solution, but might lead to different states in RL² if the RNN uses a verbose history representation and fails to learn that these two sequences (HTH & THH) should be mapped onto the same representation in this task.
Q2 & Q4: Our contribution is showing that with predictive modules, meta-RL achieves representational equivalence
We would like to highlight that for a learned representation to be:
- Compact: we mean whether the learned representation is as efficient as possible; in other words, whether the learned representation is the minimal sufficient statistics as used by a Bayes-optimal agent.
- Interpretable: we mean whether the representation can be understood/ interpreted as Bayes-optimal belief states.
As explained above, Mikulik et al. found RL² learns non-minimal sufficient statistics, indicating that it fails to learn Bayes-optimal belief representation, therefore its internal state is not compact (not minimally sufficient) and cannot be directly understood as Bayes-optimal belief states.
In our work, we showed that by adding predictive coding modules, meta-RL agents can learn the minimally-sufficient Bayes-optimal belief representations across diverse tasks. The representational improvements are quantitatively measured by state machine simulation:
- RL²: Low (good behavior) but high in Bayes→meta-RL direction (poor representation)
- Ours: Low AND low in both directions (good behavior AND representation)
For example, in two-armed bandits (Fig. 2C), our method achieves significantly lower state dissimilarity than RL² in the Bayes→meta-RL direction, indicating our representations better approximate the minimal sufficient statistics used by the Bayes-optimal agent. This pattern is consistent across all tasks, and strongly suggests that the learned representations in our model are more compact (minimally sufficient) and interpretable (can be directly interpreted as Bayes-optimal belief) compared to RL².
Q3: Why do you emphasize 'partial observability' since aren't all environments in all previous work partially observable. Having a state estimator for a fully observable environment wouldn't make any sense, right?
Your observation is correct—our work and previous work are all partially observable tasks. This is because meta-RL (training across a task distribution) can be naturally re-formulated as POMDPs. We emphasize partial observability because we're specifically addressing representation learning for partially observable settings where learning the Bayes-optimal belief states matter, not fully observable RL settings where state estimation would be unnecessary.
Q6: Mikulik et al. seem to define their own meta-RL algorithm that perhaps is similar to RL2 (they do not cite the actual work), but it is not clear to me that your method is identical?
The method RL² (Wang et al. 2016) is cited as reference 6 in Mikulik et al. The core architecture of RL² is using RNN to process history dependency to learn history-dependent policy across the task distribution. Wang et al, Mikulik et al, and our implementation all follow this architecture and parametrize the RL² agent using RNN. Each work might use different network sizes/ layers but the overall architecture is the same.
We hope this address your questions and further clarify our contributions. We're happy to provide additional explanation.
Hi Reviewer e9yF,
We’d like to follow up to see if we have addressed your questions thoroughly. Please feel free to let us know if you have follow up questions and we’ll be happy to discuss further. Thank you very much.
I thank you for answering all my concerns that finally helped me to understand the work. I position of this work in the continuum of works from Wang et al. (2016) [idea of meta-RL] to Ortega et al. (2019) [claim of meta-RL leading to Bayes optimal estimator] to Mikulik et al. (2020) [Bayes optimal verified]. The current work shows that predictive coding helps to provide representations that match with Bayes optimal estimator. What are the consequences of that remain to be shown, but the work shows strong incremental progress and therefore I update my grade to the positive side.
Thank you for your feedback and review which helped us better articulate our contribution within the broader meta-RL literature. We appreciate that you highlighted how our work advances the research in the context of Wang et al. → Ortega et al. → Mikulik et al. by demonstrating that predictive coding can bridge the gap between Bayes-optimal policies and Bayes-optimal representations. We are also excited by the future prospective of our work and how it may lead to novel and testable insights into neural computation, and thank you again for your positive reassessment of our work.
This manuscript investigates how to improve internal state representations for meta-reinforcement learning (meta-RL) agents in partially observable environments. The authors propose integrating a self-supervised predictive module, inspired by the predictive coding principle in neuroscience, into a meta-RL agent. This module is trained to predict future observations and rewards, which in turn shapes a latent representation intended to approximate a Bayes-optimal belief state. Using state machine simulations, the authors finds that the predictive module enables the learning of more interpretable and structurally equivalent representations compared to standard meta-RL (). In tasks requiring active information seeking, this improved version allows the agent to achieve optimal policies where conventional meta-RL fails.
优缺点分析
Strengths
- Clear theoretical bridge and a tight conceptual link to the predictive coding principle in neuroscience.
- Rigorous evaluation methods are adopted so that we can truly know if the learned representations are Bayes-optimal or not. This goes beyond previous methods that may have adopted predictive coding but lack interpretation.
Weaknesses
- The experiments are limited to simple POMDPs and thus the scalability of this method (in terms of performance and interpretability ) is unknown when facing high-dimensional complex tasks.
- The code is not available at the time of reviewing.
问题
no
局限性
no
最终评判理由
I have read the rebuttal and found it partially addresses my concerns. I recommend accepting this paper.
格式问题
no
We thank Reviewer dd6m for your positive assessment and for recognizing our theoretical contributions and rigorous evaluation methodology. We appreciate your positive feedback on our “tight conceptual connection to predictive coding principles in neuroscience” and our systematic approach to verifying Bayes-optimality of learned representations (“This goes beyond previous methods that may have adopted predictive coding but lack interpretation”). We're delighted you found our work provides a “clear theoretical bridge” with “excellent clarity.” Below we provided point-by-point responses to your comments and questions.
Q1: The experiments are limited to simple POMDPs and thus the scalability of this method (in terms of performance and interpretability ) is unknown when facing high-dimensional complex tasks.
We acknowledge this important question and appreciate the reviewer raising this constraint. Our focus on simple POMDPs was a necessary methodological choice that enables our core contribution: validating that predictive coding induces representations which structurally match Bayes-optimal beliefs—a claim that can only be rigorously evaluated when ground-truth Bayes-optimal solutions exist.
-
Rigorous evaluation requires ground truth: Our primary contribution is demonstrating that predictive modules enable learning Bayes-optimal representations. This requires ground-truth Bayes-optimal POMDP solutions for comparison, which are available only in tractable domains. In high-dimensional tasks where ground-truth solutions are not available, we cannot rigorously verify whether learned representations truly approximate Bayes-optimal belief states, and thus are reduced to black-box comparison using performance. To this end, this choice of tasks is a necessary foundation for understanding why predictive objectives work in meta-RL.
-
Systematic coverage of POMDP challenges: Despite using tractable tasks, we would like to highlight that we have significantly expanded evaluation beyond prior work investigating representation equivalence for meta-RL. Compared to Mikulik et al. (2020) where only stationary Bernoulli bandits are evaluated, we significantly expanded and selected exemplar tasks aiming to capture more diverse POMDP challenges, including exploration-exploitation tradeoff, information gathering, and dynamic uncertainty:
- Bandits: Uncertainty and exploration-exploitation tradeoff
- Tiger: Classic POMDP benchmark of full sequential decision-making problem
- Dynamic bandits/Tiger: Non-stationary hidden states where dynamic belief tracking is required
- Oracle bandit: Strategic information gathering
To our knowledge, this is the first systematic evaluation of meta-RL representations across diverse POMDP structures beyond stationary bandits using ground-truth comparison.
-
Distinguish model scalability from evaluation scalability: While full state machine simulation analysis becomes intractable for high-dimensional tasks, there is nothing in principle preventing our model architecture from scaling to high dimensional tasks. First, the modular architecture (separate predictive and policy networks) naturally scales to complex tasks. Second, the predictive objective remains well-defined for more complex tasks. We believe that our work establishes the principles—validating that predictive learning approximates Bayes-optimal computation provides the theoretical foundation for understanding how these methods scale to complex domains where direct verification isn't possible.
-
Empirical scalability from related work: To this end, we would like to point to empirical evidence from related work where the predictive objectives have demonstrated impressive scalability in practice. For instance, VariBAD (Zintgraf et al., 2021), utilizing similar architecture of modular predictive and policy networks, succeeds on higher dimensional and continuous control tasks in MuJoCo. Dreamer (Hafner et al., 2019-2025), a world model approach where world models are learned with predictive objectives and actor-critic networks are trained on imaginary model rollouts, achieve SOTA on challenging Atari and control tasks with image observations. Our work provides the interpretability foundation for understanding this family of methods:
- Existing work (VariBAD, Dreamer): Demonstrates that predictive objectives improve performance
- Our work: Explains why they work by revealing predictive representations approximate Bayes-optimal beliefs
We believe that this understanding is valuable precisely because it applies broadly. By establishing that predictive coding induces Bayes-optimal structure in tractable and the clearest settings, we provide a principled foundation for understanding the success of these predictive objective methods in deep RL. This understanding is crucial for designing robust models for complex tasks where we cannot verify optimality directly.
We will add a paragraph in the Discussion section explicitly addressing the above evaluation methodological choice and scalability considerations, and outlining these pathways for understanding deep RL models with predictive objectives on complex tasks.
Q2: The code is not available at the time of reviewing
We apologize for the code not being available previously during review. We have now followed the procedure to submit our anonymized code via Official Comment to the AC for distribution.
Upon acceptance, we will release our complete codebase, including:
- Full implementation of meta-RL with predictive modules
- All baseline implementations (RL²)
- Complete experimental setup
- State machine simulation analysis tools
- Scripts to reproduce results
The code will be released under an open-source license to facilitate reproducibility and enable the community to build upon our work. We will update the manuscript to include the link explicitly.
We hope this addresses your questions and clarifies our contribution to the intersection of neuroscience, meta-RL, and interpretability, and its potential for broader impact. Thank you again for your positive and constructive feedback.
I thank the authors for clarifying my questions. I agree that minimal yet sufficient evaluations with ground truth are the first step towards a full picture. But a necessary next step will be to expand the range of the experiments to show the computational superiority over other methods. I acknowledge the value of increasing our understanding of how predictive coding enhances representation quality, which, however, to some extent is a well-acknowledged belief in the neuroscience community over many years. If RL is added into the story, I'd like to see more of a performance leap rather than pure interpretation-level analysis. This actually leaves us a dilemma: if one increases the dimensionality, one loses the tractability (thus no rigorous Bayes-related analysis anymore). But if not, it's hard to see the advantage, because RL and Bayes-Optimal representations are conceptually two distinct topics. That's why I still think the lack of further experiments is a limitation of this work, because otherwise I can't see a very clear meaning of doing so.
As for code, I'm not very sure if it's a problem on my side: I still can't see the code in my reviewer dashboard. I'll leave this to the chair to decide.
Therefore, I will keep my score and opinion the same as the original.
Thank you for the thoughtful feedback. We appreciate your acknowledgment of the value in our interpretability analysis and understand the desire to see broader demonstration. Our oracle bandit results show representation quality directly impacts policy performance whereas RL² struggles with both, providing evidence supporting how good representation learning may lead to performance leap. We see our approach using ground truth comparisons as the foundation that enables broader experiments, and believe that the analytical framework we employ is crucial for understanding the representation learning effect of predictive objectives in larger-scale settings where Bayes-optimal comparisons aren't possible.
For code access, we have re-sent the anonymous link to our code to the AC and will work with the AC to resolve the code access issue.
The authors tackle the twin challenges of interpretable belief representations and efficient exploration by integrating a predictive-coding module into a memory-based meta-reinforcement learning (meta-RL) agent. Concretely, a variational RNN encoder ingests (o_t, r_t, a_{t-1}) at each step and compresses them into a latent belief btb_t; two lightweight decoders are trained, under an ELBO with a KL bottleneck, to predict the next observation and reward. A separate actor–critic head then conditions its policy solely on btb_t, with no policy-gradient leakage, back into the encoder.
The authors hypothesise that this compact latent representation more closely matches the Bayes-optimal belief. They test the claim using a state-machine simulation metric on five POMDP families whose Bayes-optimal solutions are analytically tractable, providing a ground truth: static and dynamic Bernoulli bandits, stationary and dynamic Tiger, and an 11-arm “oracle” bandit that demands strategic exploration.
Results show that the encoder’s bottleneck reduces State and Return dissimilarities D_s, D_o relative to RL² and does so without enlarging network capacity. Both agents reach optimal return on the easier tasks, but on the hardest Dynamic-Tiger and Oracle settings only the predictive-module agent attains Bayes-optimal reward and uncovers the correct exploration policy. Ablations confirm that (i) the KL information bottleneck and (ii) gradient isolation are each necessary, removing either degrades both belief alignment and performance, demonstrating that predictive coding alone, properly regularised, drives the improvement.
Overall, the study provides strong but small-scale evidence that self-supervised prediction is a powerful inductive bias for belief learning in meta-RL.
优缺点分析
This is a high-quality empirical study with good experimental controls, a well-motivated design, and a sensible evaluation methodology. The use of state-machine simulation to measure representational adequacy, both from agent-to-Bayes and Bayes-to-agent, is a strong choice that moves beyond the typical reliance on downstream Return as a proxy. The ablation study is appropriately scoped and supports the central claims. That said, the experimental domain is limited to low-dimensional, fully known POMDPs, which constrains generalisability.
The paper is clearly written, with a logical flow from motivation to architecture, experiments, and discussion. Key assumptions are made explicit, the diagrams are effective, and notation is standard. A minor weakness is that the state-machine simulation metric, while central to the evaluation, could benefit from a clearer standalone description for readers unfamiliar with Mikulik et al.
The significance is primarily conceptual rather than empirical: the paper reframes ideas from predictive coding as an inductive bias for Bayes-optimal belief representation in meta-RL. While the domains are small-scale, the implications, particularly for training interpretable and robust meta-RL agents, are notable. If the method generalises to continuous control or high-dimensional inputs, it would have substantial impact. It's very important to note that not only are the environments used small scale, but they are from within a very specific domain of analytically tractable Bayes-optimal solvable settings. This is a major weakness of the paper.
This work is original in its use of predictive coding as a direct mechanism for belief representation in meta-RL. While prior work has explored predictive objectives for generalisation or sample efficiency, this paper explicitly shows that self-supervised future prediction can induce latent states that closely approximate Bayes-optimal beliefs, validated through a principled, bi-directional state-machine simulation test.
A second key novelty is the architectural decoupling of inference and control: the belief encoder is trained solely via predictive losses, with no policy gradient leakage, yet still supports Bayes-optimal behaviour. This differs from both conventional meta-RL (e.g., RL^2) and world-model agents that require planning in latent space. There is a limitation here however in terms of the limited comparison to only a single type of meta-RL approach. Comparison to the likes of VariBAD would be an important improvement.
问题
How does this approach compare to other meta-RL methods that incorporate predictive objectives (VariBAD, world model approaches)? Why did you choose RL² as the sole baseline?
You claim that "no policy-gradient leakage" is crucial, but the ablation shows joint training doesn't hurt all that much. Can you provide more analysis of when/why gradient isolation matters?
Your tasks have 2-11 discrete actions and analytically tractable belief states. How do you expect the approach to scale to continuous control or high-dimensional observation spaces where Bayes-optimal solutions are unknown? This is probably the biggest weakness of the study.
The oracle bandit success is definitely interesting, but it's also a very specific exploration structure. How does the approach work on more diverse exploration challenges?
What's the computational overhead of the predictive modules? In particular, how does training time compare to RL²?
How sensitive is the method to the choice of the belief state bottleneck dimension?
You mention OOD generalization as future work, but can you show results on task distributions that differ from training (e.g., different bandit arm counts, different Tiger observation accuracies)? This would seem like a relatively simple test to carry out.
局限性
Yes
最终评判理由
I thank the authors for both exceptionally clear responses, as well as for the inclusion of additional results. I think that these answers improve my understanding, and with the new results will indeed strengthen the paper significantly. For this reason I am upgrading my score from a 3 to a 5.
格式问题
N/A
We thank the reviewer for your thorough, insightful, and constructive review. We are encouraged that you found the paper to be a "clearly-written" "high-quality empirical study with good experimental controls, a well-motivated design, and a sensible evaluation methodology." We appreciate your positive assessment that our evaluation "is a strong choice that moves beyond the typical reliance on Return" and that our work provides "a principled […] test" with "conceptual significance" for "training interpretable and robust meta-RL agents." Below we respond point-by-point and report additional analyses.
Main Concerns
Q1: Limited experimental domain
Our focus on tractable tasks is a necessary methodological choice essential for validating that predictive coding induces representations approximating Bayes-optimal beliefs—which can be rigorously assessed only when ground-truth solutions exist.
Tractable tasks are necessary: In complex POMDPs (e.g., Atari, robotics), planning on Bayes-optimal beliefs is intractable. Without ground-truth solutions, we would be unable to apply state machine simulation and reduced to comparing only returns. Reduction to performance-only evaluation has critical limitations for understanding the learned representations, as performance doesn't reveal representational efficiency, structure, and interpretability. To move beyond and attain interpretability insights for representation, tractable tasks are thus necessary.
Expanded task selection to cover diverse POMDP challenges: Despite using tractable tasks, we'd like to highlight that we have significantly expanded evaluation beyond prior work. Compared to Mikulik et al 2020 where only stationary bandits are evaluated, we selected exemplar tasks to capture diverse POMDP challenges: explore-exploit tradeoff (bandits), sequential decision-making (Tiger), dynamic belief tracking (dynamic bandits/ Tiger), and information gathering (oracle bandit). To our knowledge, this is the first systematic evaluation of meta-RL representations across diverse POMDP structures beyond stationary bandits using ground-truth comparison.
Q2. ... compare to … VariBAD, world model approaches? Why … RL² as the sole baseline?
Our comparison was a deliberate choice to isolate and understand the effect of predictive objectives on representation learning. We aim to 1) Isolate the effect of predictive learning 2) Establish principles without being limited to specific implementation, and 3) Avoid interpretation confounds from specific design.
To this end, we consider our model, VariBAD, and world model approaches as belonging to the same family—meta-RL with predictive objectives. While comparing them may be interesting, we believe directed comparison to RL² would offer much clearer and interpretable results.
- Why not VariBAD: VariBAD is designed for stationary tasks and inapplicable to dynamic POMDPs. On stationary tasks where it applies, its architecture is similar to ours—both decouple belief inference from control. Comparing would essentially be comparing identical models.
- Why not world model approaches: Methods like Dreamer learn a world model and use rollouts for planning. They learn through planning, not belief representation. This would introduce extra complexity orthogonal to our focus on analyzing representations, as we'd be unable to determine whether differences stem from predictive learning, planning, or model details.
Our complementary contribution on interpretability: These models integrate predictive modules in various ways and show improved performance. Our work thus complements their success and provides the interpretability foundation for understanding why they work. This follows the fruitful tradition in ML of understanding principles in tractable settings that explains performance on complex tasks (e.g. theory on linear networks/ convex optimization→ deep learning).
Our work lays the groundwork for future analysis. We’ll expand our discussion section to clarify these relations and our complementary contribution.
Q3. How … scale to continuous or high-dim spaces?
- Additional evidence from continuous control: To address this concern, we designed LatentGoalCart, a continuous control task similar to HalfCheetahDir in MuJoCo and allowing tractable belief inference, where the agent receives continuous observation (position) and reward (how far away from the hidden target), and controls continuous action (velocity) to reach an unknown target that needs to be inferred. Even on this continuous task, our model yields representations that are significantly closer to the Bayes-optimal beliefs compared to RL² (as we cannot upload the figure, we report the result below and will add them in the revision).
model | | | | (bold: p<0.05)
Ours | 0.699±0.270 | 0.017±0.004 | 0.528±0.098 | 0.175±0.049
RL² | 1.523±0.518 | 0.066±0.109 | 1.743±0.520 | 0.371±0.230
This provides evidence that our findings extend beyond discrete, low-dim tasks.
-
Model vs evaluation scalability: Your question raises two types of scalability. While the evaluation method's scalability may be limited for complex tasks, our model architecture scales naturally to high-dim tasks: 1) predictive objectives remain well-defined and computable regardless of dimensionality; 2) the modular architecture scales up: VAE predictive modules can handle high-dim inputs (images, continuous states) and Gaussian policies can handle continuous actions.
-
Empirical evidence from related work: We'd like to point to empirical evidence from related work where predictive learning show impressive scalability: VariBAD succeeds on high-dim and continuous tasks in MuJoCo; PlaNet and Dreamer (Hafner 2019-25) achieve SOTA on challenging Atari and control tasks with image observations.
Specific questions
Q4: You claim that "no policy-gradient leakage" is crucial, but the ablation … when/why gradient isolation matters?
We'd like to clarify we did not intend to imply that gradient isolation is crucial; rather, it was a design choice to cleanly isolate and understand the representation learning effect of predictive objectives. By blocking policy gradients, we ensure representation learning purely comes from predictive coding without confound. We show that policy gradients are not necessary for learning Bayes-optimal representations, suggesting that representation learning can be attributed to predictive coding. We apologize for confusion in our presentation and will improve the clarity of this point in the paper.
Connection to existing architectures: Successful predictive methods, e.g. VariBAD and Dreamer, also decouple inference and control. Our findings may explain why this is desirable: predictive objectives alone suffice for good representations, and mixing policy gradients don’t add much.
Q5: Oracle bandit success … diverse exploration challenges?
Oracle bandit captures the core structure of many exploration challenges: locating information sources, paying costs to access information, and using the information effectively. This pattern is common across diverse exploration benchmarks including TreasureMountain (Zintgraf 2021), Sphinx (Wang 2023), Map, and Overcooked (Xie 2024).
Our agent shows effective exploration—strategically sampling the oracle, correctly interpreting information, and consistently exploiting it—going well beyond random exploration or Thompson sampling. State machine analysis reveals why: predictive coding creates belief representations that track information and drive information-directed exploration. While additional benchmarks would strengthen this claim, the insight should transfer broadly to exploration tasks sharing the core structure.
Q6: … training time compare to RL²?
Overhead comes from the 2 decoders. Compared to RL²: Training time increases by ~30-40%; Inference time is comparable. We will add these metrics in the revision.
Q7: How sensitive is … bottleneck dimension?
We evaluate different bottleneck dimensions (ranging from 1 to 32). Results show when:
- Dimension > task complexity (e.g. > 2 for the oracle bandit): achieve similarly low dissimilarities, indicating that regularization helps to maintain interpretable belief representation when the bottleneck is too expressive.
- Dimension < task complexity: both performance (suboptimal return) and representation (high dissimilarity) significantly degrade, indicating that bottleneck needs to surpass task complexity to solve the task.
We will include the results in the revision.
Q8: ... OOD generalization … on task distributions that differ from training?
Thank you for this excellent suggestion. We fully agree that OOD generalization is crucial, and while comprehensive OOD evaluation requires careful design of training and test distributions, we have two promising additional results:
- Different Tiger accuracies (zero-shot): For models trained on acc=0.7 and tested on acc=0.8 environments, our models show near-optimal return (-15.94±3.82, p<0.05), whereas RL² has significantly lower test return (-25.56±1.80), indicating our models support zero-shot by capturing the underlying belief updates shared across similar tasks.
- Oracle bandit generalization (transfer learning): We train models on task distributions that only contain arms 1-5, before transferring them to task distributions that contain arms 6-10. Our models showed significantly faster transfer learning as compared to RL² (return at 2e4 updates: ours 20.20±0.82 vs RL² 14.68±1.51, p<0.05), indicating our models learn the underlying task logic rather than memorizing specific arm configurations.
We’ll include these results in the revision.
Q9: state machine simulation ... unfamiliar with Mikulik et al
We'll expand the description in the main text.
I thank the authors for both exceptionally clear responses, as well as for the inclusion of additional results. I think that these answers improve my understanding, and with the new results will indeed strengthen the paper significantly. For this reason I am upgrading my score from a 3 to a 5.
Thank you very much for your very thorough and constructive review that strengthen our paper. We’re glad to know our response addressed your questions. We’re encouraged by your positive assessment of our work, and appreciate you highlighting the value of our work and for raising the score.
This manuscript investigates strategies for improving internal state representations in meta-reinforcement learning (meta-RL) agents in partially observable environments. The authors propose integrating a self-supervised predictive module, inspired by the predictive coding principle in neuroscience, into a meta-RL framework. The work establishes a clear theoretical foundation and maintains a strong conceptual link to predictive coding in neuroscience. Furthermore, the evaluation is conducted with methodological rigor, enabling a robust assessment of whether the learned representations approximate Bayes-optimality. As most of the reviewers’ comments have been sufficiently addressed, this paper appears suitable for publication