/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

Neurosymbolic World Models for Sequential Decision Making

Leonardo Hernandez Cano,Maxine Perroni-Scharf,Neil Dhir,Arun Ramamurthy,Armando Solar-Lezama

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

TL;DR

We synthesize state machines from low-level continuous observations and apply them to policy optimization

摘要

关键词

world modelsfinite state machinesstructure learningmodel-based reinforcement learningneurosymbolic

评审与讨论

审稿意见

评分: 32025-02-27

The paper presents SWMPO, a framework for learning neurosymbolic Finite State Machines (FSMs) to model environmental structures for policy optimization. The key contributions are an unsupervised learning algorithm for training modular world-model primitives from low-level continuous observations, a state-machine synthesis algorithm for constructing environment-specific FSMs, and evaluations showing these FSMs effectively support model-based Reinforcement Learning (RL). The framework was tested in environments like PointMass, LiDAR-Racing, Salamander, and BipedalWalkerHardcore, demonstrating accurate FSM synthesis and competitive RL performance.

给作者的问题

The framework relies on assumptions about the identifiability and minimality of modes. How would the method perform when these assumptions are violated, e.g., in environments with overlapping mode dynamics or non-minimal modes? Can you provide strategies to mitigate such issues?
The evaluations are conducted in simulated environments. What challenges do you foresee when applying SWMPO to real-world tasks (e.g., robotics, autonomous systems)? Have you considered case studies or initial experiments with real data?
How does the framework’s computational complexity scale with the number of modes or environment complexity? Are there strategies to optimize efficiency for high-dimensional state spaces?
The pruning approach removes spurious transitions. How sensitive is the pruned FSM’s performance to the choice of the error tolerance factor ε? Can you provide an analysis or guidelines for selecting ε?
How does SWMPO compare to hybrid neuro-symbolic models that explicitly encode domain knowledge (e.g., physics-based constraints)? Could combining such knowledge with SWMPO further improve performance?

论据与证据

The claims in the submission are generally supported by clear and convincing evidence. The authors provide a detailed description of their framework, SWMPO, and its components, including the unsupervised learning algorithm for training modular world-model primitives and the state-machine synthesis algorithm for constructing environment-specific FSMs. They also present evaluations in various simulated environments, such as PointMass, LiDAR-Racing, Salamander, and BipedalWalkerHardcore, demonstrating the effectiveness of their approach in synthesizing accurate FSMs and its competitive performance in model-based RL tasks. However, some claims could be considered problematic due to the limitations of the framework. For instance, the assumption that the latent categorical variable Mt can be characterized by a function $m:(o_{t−1},a_{t−1},o_t)\to m_t$ might be too restrictive for more complex environments where the mode variable is not easily identifiable from a single transition. Additionally, the pruning mechanism, while helpful in simplifying the state machine, might lead to removing transitions that are important for accurate modeling in specific scenarios.

方法与评估标准

The proposed methods and evaluation criteria are well-suited for the problem of synthesizing neurosymbolic world models for sequential decision making. The framework’s ability to learn structured world models from low-level observations and use them for efficient policy optimization is demonstrated through appropriate methods and comprehensive evaluations. The choice of benchmark environments and metrics effectively supports the claims made in the paper.

理论论述

There is no theoretical analysis in this paper.

实验设计与分析

Yes.

补充材料

Yes, all parts.

与现有文献的关系

The concept of neural world models has been explored in previous works, such as the recurrent world models by Ha and Schmidhuber (2018). These models use neural networks to predict future states and facilitate policy evolution. The proposed SWMPO framework extends this by incorporating a structured representation through FSMs, allowing for more interpretable and modular world models.

The use of structured models in RL has been explored in various contexts, such as hierarchical RL (Xu & Fekri, 2021; Botvinick, 2012) and modular RL (Simpkins & Isbell, 2019). These approaches aim to improve policy learning by encoding structure into the policy architecture. The SWMPO framework differs by focusing on the synthesis of a structured world model rather than directly encoding structure into the policy.

遗漏的重要参考文献

No.

其他优缺点

Strengths

The paper proposes a creative combination of neurosymbolic approaches, integrating neural networks with finite-state machines to model complex environments. This hybrid approach is innovative and addresses the limitations of purely neural or symbolic models.
The paper is well-structured and clearly presents the proposed methods, experimental designs, and results.

Weaknesses

The paper relies on several assumptions, such as the identifiability of the latent categorical variable $M_t$ and the minimality of modes. These assumptions might be too restrictive for more complex environments where the mode variable is not easily identifiable from a single transition.
The assumption that all POMDPs share the same set of modes but have different mode-transition dynamics might limit the framework's applicability to environments with highly diverse dynamics.
While the benchmark environments are relevant, they might not be sufficiently challenging to fully test the framework's capabilities. More complex, real-world scenarios could better assess the framework's performance.
The framework involves multiple components, including neural primitives, state-machine synthesis, and transition predicate synthesis. Its complexity might make it challenging to implement and optimize, especially for less experienced researchers or practitioners.

其他意见或建议

See questions.

作者回复

2025-04-01

We thank Reviewer pyge for their thoughtful comments and are committed to incorporating your feedback into the manuscript.

Note: Since our submission, we have demonstrated stronger RL performance (see response to B4Kd).

Please excuse our brevity due to the character limit.

Mode Identifiability Assumption

Please refer to the response to Reviewer bZQ6 ("Mode Identifiability Assumption").

Mode Sharing and Diverse Environments

In our original submission, we stated a stronger assumption than needed. Assumption 4.5 should instead be:

“The modes in the active POMDP must be a subset of the modes in the offline POMDPs.”

E.g, consider two offline POMDPs: one with modes A and B and one with modes C and D; the active POMDP could consist of modes A and C which we argue is not very restrictive. We will update our manuscript to reflect this correction.

We also note that the Salamander environment already exhibits high diversity, as it is a 3D simulation of a real robotic platform [5, 6] with computational fluid dynamics and rigid-body contacts.

Real-World Applicability

Please see response to bZQ6 ("Real-World Applicability").

Framework Implementation Challenging

To help the community build on our results:

We open-source our source code
The pseudocode typeset in the manuscript closely follows our implementation
As mentioned, we leverage off-the-shelf, well-established implementations of key algorithms

Computational Complexity

Let $n$ be the number of experiences $(o_t, a_t, o_{t+1})$ of dimension $d$ , $m$ the number of modes. The computational complexity of training is driven by these steps:

Solving Eq. 1: To solve Eq. 1 we use SGD, which is $O(G_1 n d K_1)$ , where $G_1$ is the number of SGD steps, and $K_1$ accounts for the architecture of the model (see [9] for discussion). The variables corresponding to the amount of data, the number of iterations and the model complexity needed for training, are not independent. E.g., a complex system (e.g., high number of modes) might need more data. However, a monolithic model would also require more data because it too must implicitly learn the same complex dynamics. Thus, we do not expect this step to present significant more overhead than a monolithic approach.
Training the $m$ neural primitives: We assume datasets will have in expectation $O(\frac{n}{m})$ elements. If models are trained sequentially, this is $O(m G_2 \frac{n}{m} d K_2)$ , where $G_2$ and $K_2$ account for the number of SGD steps and the architecture of the primitives. The caveats in 1. apply, but each step is at most as expensive as training a monolithic model. Primitives can be trained in parallel if wall-clock performance is critical.
Transition Predicate Synthesis: This step trains $2m^2$ decision trees with CART [12]. For the average case, we assume each dataset will be of size $O(\frac{n}{m})$ . Therefore, the overall cost in expectation is $O(m^3 d \frac{n}{m} \log^2(\frac{n}{m}))$ .

Applying dimensionality reduction techniques may help with high-dimensional spaces. In some high-dimensional domains (e.g., visual domains, which are not the focus of our work), we would expect a different class of models (e.g., CNNs) to be more effective.

Simplifying, the average complexity of training a SWMPO model is

$O(GnmdK + nmd + nmd\log^2(\frac{n}{m})),$

where $G = \max(G_1, G_2)$ and $K = \max(K_1, K_2)$ . Thus, the expected overhead of training in SWMPO compared to a monolithic model is only a linear factor on $m$ .

Then, during each model-based RL step, SWMPO evaluates the predicates of the current mode, which are small functions that could be evaluated in parallel, and the active mode's neural network. Thus, the expected run time is only an added small constant factor slower than a traditional monolithic world model.

Pruning Guidelines

We manually tuned $\epsilon$ by inspecting error plots (e.g., Figure 4), but did not find the state machines overly sensitive to it.

As a guideline, we suggest (1) inspecting model error plots to see if there are transitions that should be pruned in the first place, and (2) using error between models as a starting value for $\epsilon$ .

Additionally, hyperparameter tuning techniques could automate the process.

Domain Knowledge

While out of scope for this manuscript, we consider incorporating domain knowledge explicitly a promising future direction. For instance:

Replace models in Eq. 1 with physics-informed neural networks or other models that enforce physical constraints
Integrate syntactic/semantic constraints into the predicate synthesis using systems like Sketch [12]

We hypothesize that each approach would improve the local models and the FSM transition dynamics, respectively.

We note that the assumption that systems can be decomposed into modes is a form of domain knowledge in itself.

References

Please refer to the response to Reviewer bZQ6.

审稿意见

评分: 42025-03-13

In the setting of POMDP, the paper introduces SWMPO, a framework based on a Markov Decision space model in which each transition can be characterized by mode (FSM). That is, at each $t$ , the transition occurs by $(o_t, a_t) \mapsto o_{t+1} = f(o_t, a_t | M_t)$ where $M_t is the mode at time$ t $. The mode$ M_t $is to change based on$ o_t, a_t$ through what they call "transition predicate".

They train this model in two steps: (1) the first step in which the proxi $m_{\theta1}$ is included as "soft" modes so that the fuction of the form
$f_{\theta1}(m_{\theta2}(o_{t-1}, o_t, a_t) , o_t, a_t)$ can predict the the state well and that $m_{\theta2}$ is not correlated to the future.

Based on this $m_{\theta2}$ , the clustering is conducted, so that any $(o_{t-1}, o_t, a_t)$ can be assigned to a particular mode. Then, for each mode $m$ , $f(o_t, a_t | m)$ is trained to predict the observation whenever $(o_{t-1}, o_t, a_t)$ is of mode $m$ .

The transition predicate is trained so that the model transition to the best predicting state at any given $(o, a)$ .

At each round, SWMPO then uses small number of new environmental rollouts, and then uses the FSM that is newly synthized from the dataset.

Their methods are compared against Monolithic Neural Model based RL as well as model-free RL and is shown to outperform them on several benchmark examples.

给作者的问题

I think I have asked questions in the comments/suggestions section.

论据与证据

While it is true that their approach extended the FSM based approach to continuous domain, it is unfortunately a hard call to approve the claim that they demonstrate the advantage of the approach based on the model-based RL using a single monolithic neural network with no structure.

方法与评估标准

The evaluation criteria of the work is based on mode-label accuracy and the sheer policy performance. I believe that these criteria are fair, with the first one validating the claim that they can nail down the mode when there is one.

理论论述

There seems to be limited theoretical discussion.

实验设计与分析

The experimental designs seems conventional, and I believe they are sound.

补充材料

There seems to be no major extra experiment or deep theoretical claims in the supplementary material.

与现有文献的关系

N.A

遗漏的重要参考文献

Nothing in particular that comes to my mind, but PlaNet(Hafner et al 2019), Dreamer (Hafner et al 2021), SimPLe (Kaiser et al. 2020) might be worthy of mention as the the members of the kind that uses future-prediction model in a rather "meta" way.

Also, in Neural Fourier Transform (Koyama et al) and Unsupervised Learning of Equivariant Structure from Sequences (Miyato et al) they also use the two-stage framework that consists of the first step of making the "prediction model" with the similar loss and the second "block-diagonalizing step" of decomposing the prediction into disentangled features. They might be somewhat related as well?

其他优缺点

Strengths

They extended FSM and its application to RL to the tasks in continuous domain, and brought it up to competitive level
They have shown an inspiring framework of including "mode/gear change"in the prediction, and realized it at the level that RL can be competitively performed.

Weaknesses

It was sincerely my hope as the reviewer that this approach strongly outperforms the blackbox monolithic version, but it was not the case.
In the similar note to above, the paper is not too convincing in that the introduction of "mode" and of building sepeparate model is a beneficial approach.

其他意见或建议

It is hard to believe that, when the inductive bias of "neurosymbolic world model" is truely valid at the level of data-generation, the approach like this cannot strongly outperform the blackbox counter part neither in terms of training speed nor on sheer reward-based performance. Has there been any effort building an "extreme case" scenario?

Clearly, the efficacy of the approach in terms of labeling accuracy / Labenshtein distance is proven to be solid. I wonder if there is no Reward that more directly depends on these labels.

I value that you highlight the fact that SWMPO did not train any neural model online. I wonder if there is more quantitative way to highlight the SWMPO's gray-box approach.

作者回复

2025-04-01

We thank reviewer B4Kd for their thoughtful comments and appreciate your recognition of SWMPO’s extension of FSM-based modeling to non-linear continuous domains and of its practical competitiveness. We are committed to incorporating your feedback into the final version of the manuscript.

Note: Since the results in our initial submission, we have further demonstrated a stronger performance of our approach over baseline methods in the RL experiment (see below).

Performance over Monolithic Models

We understand your interest in a stronger empirical performance gap between SWMPO and the monolithic model-based RL baseline. We decided to submit our manuscript with early results, but since the submission we have been able to make some small improvements to the algorithm that in turn have resulted in significant improvement in performance. We also spent comparable time improving the performance of the baseline.

Specifically, this is what we have done for both SWMPO and the monolithic model-based RL baseline:

We ran the RL training 64 times, instead of the 16 times in our original experiment.
We further tuned the hyperparameters of all algorithms.
We applied a simple modification to the model-based rollout logic for both SWMPO and the baseline: instead of relying on the models for long trajectories (150 timesteps) ---which can lead to compounding errors--- we sampled a random observation from an environment trajectory and performed a shorter (30-step) model rollout. This reduces reliance on the model for long-horizon prediction, improving sample efficiency and stability for both methods. Note that the performance of the low-level neural models for long horizons is fully orthogonal to our claims, and the model-based RL community has mechanisms to scale neural forecasting performance. We will add this information to the final manuscript, along with our updated results.

These adjustments increase the performance of both SWMPO and the model-based baseline. However, SWMPO sees a greater gain, resulting in a larger performance gap between SWMPO and the monolithic model of approximately ~40% (~21 mean reward for SWMPO vs ~15 mean reward for vanilla model-based RL by timestep 75,000), a significant improvement over the much smaller gap in our original submission . Furthermore, SWMPO now reaches this performance level within 75,000 timesteps, compared to 200,000 steps in our original submission. We stopped at 75,000 timesteps due to rebuttal time and compute constraints; however, we will include the experiment up to 200,000 steps in our revised manuscript.

We encourage reviewers to check the updated results here (corresponding to an updated Figure 7): https://anonymous.5open.science/r/anonymousfig-30AA/improvedrewards.png

We summarize these updated results in the table below:

Iteration	Baseline RL (Mean Reward)	Model-based RL (Mean Reward)	SWMPO (Mean Reward)
20,000	~-1	~4.8	~5.2
40,000	~0.5	~9	~14
60,000	~0.5	~9	~17.5
75,000	~4	~15	~21

Extreme Case Scenario

We hope that our new results above demonstrating improved performance over monolithic models renders the use of an “extreme case” where SWMPO would outperform unnecessary. Our terrain-mass environment is already a scenario that closely matches the assumptions of the framework. We did not construct a more extreme synthetic case, as the updated results already show a significant performance gap in our original setting --- without needing to simplify the task.

Additional References

We appreciate the reviewer's suggestion of additional references, and will include a discussion of all of them in the final manuscript.

审稿意见

评分: 42025-03-21

The paper presents Structured World Modeling for Policy Optimization (SWMPO), a framework for unsupervised learning of neurosymbolic Finite State Machines (FSM) that capture environmental structure for policy optimization. The method operates in two main stages: (1) learning local “world-model primitives” that specialize in modeling different “modes” of a partially observed system, and (2) assembling those primitives into an FSM that captures transitions among those modes. These local world-model primitives are trained in an unsupervised fashion from offline data. Then, with limited new data from the current task, the system stitches the primitives into a new FSM that is specialized to the particular environment. Finally, the authors leverage this FSM representation for model-based policy optimization.

In terms of experiments, the paper reports results on four main environments: (1) PointMass (2) LiDAR-Racing (3) Salamander and (4) BipedalWalkerHardcore. The empirical results suggest that (a) the approach can learn to recover latent modes with reasonable accuracy, sometimes outperforming classical switching system baselines like HMMs and switching linear dynamical systems, and (b) the resulting FSM-based world models can be used effectively for model-based RL, matching or marginally outperforming a comparable monolithic neural model in some test settings.

给作者的问题

Thank you for the paper submission. Overall, it presents a promising neurosymbolic approach that merges mode discovery and WM.

论据与证据

(1) The paper claims that modeling the environment with a finite set of local dynamics (modes) can lead to better structural world models.

Through experiments (e.g., Figure 5 and related discussion), the authors show that the learned finite-state structure captures mode switching more accurately than certain baselines like HMM or switching linear models, especially in some of the environments (PointMass, LiDAR-Racing, Salamander).

(2) The paper argues that offline training of local dynamics primitives, followed by environment-specific stitching, can improve efficiency in policy optimization.

The authors’ experimental results (especially in PointMass) demonstrate that their approach, SWMPO, can use a short amount of new data in the active environment to synthesize an FSM that is then used for policy optimization. They compare a purely online-learned forward model vs. offline-learned primitives combined with minimal environment interactions, showing they achieve roughly similar or slightly better performance with fewer interactions.

方法与评估标准

Methods

(1) Neural Primitives: A neural network is trained to embed each observed transition into a continuous latent vector that captures the local mode. Then, an additional network predicts the next observation from the latent mode.

(2) Clustering and Pruning: Those embeddings are then clustered (k-means), and the resulting assignment to clusters is used to label each transition with a symbolic “mode.” A pruning step removes spurious transitions among modes.

(3) FSM Predicate Synthesis: Decision trees are used to learn conditions under which the FSM transitions from one mode to another, given the environment’s (observation, action) pairs.

(4) Model-Based RL: The final FSM is used in a model-based policy optimization loop (Soft-Actor Critic).

Evaluation Criteria

(1) Mode Accuracy: Measured via Levenshtein distance between the predicted mode sequence and ground-truth mode labels for new episodes.

(2) Policy Performance: Cumulative rewards achieved by the learned policy.

Overall, the method and evaluation are comprehensive.

理论论述

No detailed proofs (e.g., in an appendix) of correctness or identifiability are provided beyond references to known standard assumptions (like the existence of a unique minimal partition).

实验设计与分析

The experimental design is centered around four environments of increasing complexity, each with a known “ground-truth” notion of mode (e.g., land vs. water, track type, or different obstacle types). This design is appropriate.

补充材料

The paper only provides the notation and partition pruning.

与现有文献的关系

The references provided do give an overall sense that the authors are aware of classical baselines in this area.

遗漏的重要参考文献

The references included do not appear incomplete for an initial demonstration.

其他优缺点

Strength:

Structured Representation: The approach explicitly partitions the environment’s dynamics into interpretable “modes,” making it easier to reason about or visualize transitions.
Modular Reuse: Potentially very useful if the same local dynamics (e.g., “land locomotion,” “water locomotion,” “curved track,” etc.) reappear across multiple tasks.

Weaknesses:

Scalability: If there are many different modes or if the environment is very complex high-dimensional, the overhead of separate networks plus a large transition graph may become expensive and potentially complicated.
Partial Observability: The method’s success hinges on the assumption that a single time step (previous observation, action, new observation) is enough to identify the latent mode. This assumption may fail in more complex partially observed tasks.
More analysis on extending to complex environment and real-world data. The mode presented in the paper is too simple, and might not be practicle for more complex setting.
Writing and Figure. (1) The authors should give examples of primitives at the very early stage for better readibility. (2) Mode should be presented in teaser figure, as it is a key in this paper.

其他意见或建议

Analyzing or providing examples of failure cases Cocould help future readers understand the method’s limits.

作者回复

2025-04-01

We sincerely thank Reviewer bZQ6 for their thoughtful comments, and we are committed to incorporating your feedback into the final manuscript.

Note: Since our initial submission, we have demonstrated stronger RL performance of SWMPO over baselines (see response to B4Kd).

Please excuse our brevity due to the character limit.

Scalability

We understand your concern about the scalability of SWMPO in environments with a high number of modes or high dimensionality.

The computational complexity of SWMPO is driven by three steps: (1) solving Eq. 1, (2) training of neural primitives, (3) synthesizing predicates.

It can be shown that, in expectation, the overhead of training a SWMPO model is a constant factor dependent on the number of modes (training the end-to-end model while solving Eq. 1 is asymptotically equivalent to training a standard monolithic model). Additionally, this can be amortized by training the neural primitives in parallel if wall-clock performance is critical. We argue that this overhead is not prohibitive.

For more details, please refer to our response to pyge (”Computational Complexity").

Additionally, the “forward pass” of the SWMPO model consists only of evaluating the transition predicates of the current mode (small functions that can be evaluated in parallel if needed) and the active mode's neural network. Thus, runtime overhead is only a small constant factor slower than a standard monolithic world model.

Mode Identifiability Assumption

As part of our follow-up work, we are working on generalizing the framework to cases where mode inference may require longer temporal context or probabilistic reasoning. However, we believe that for many useful applications this assumption holds. Indeed, many state-of-the-art systems continue to assume fully Markovian dynamics (e.g., Bhatt et al., ICLR 2024 [7]; Kuznetsov et al., ICML 2020 [8]). We argue that this assumption does not preclude the application of our method to useful problems.

Real-World Applicability

Our evaluation focuses on challenging but controlled simulated environments, which allow for systematic study of the components of SWMPO. These environments are aligned with recent work in the field, which are often benchmarked on systems of similar or lower complexity---e.g., simulated Cartpole in [0] (ICML 2020), simulated low-dimensional three-mode systems in [3] (NeurIPS 2021), and simulated grid-world environments in [4] (NeurIPS 2024). We thus argue that simulation is a valuable setting for benchmarking novel structure learning frameworks.

Moreover, we note that the Salamander environment already features high complexity, including 3D locomotion of a simulated real robotic platform [5, 6], rigid-body dynamics, and computational fluid dynamics to simulate water. We believe this makes it a strong intermediate testbed bridging controlled and real-world scenarios.

Nonetheless, we acknowledge the limitations of not including physical robotic platforms. While real-world deployment introduces important challenges, we believe it is important to first validate the core elements of the framework in simulation. Furthermore, there is precedent for transferring small learned automata to hardware: Liu et al. (preprint, 2025) [1] demonstrate a three-mode FSM controlling a quadruped robot. We do not claim that our current results are directly applicable to real-world systems, but we strongly believe that SWMPO is a promising foundation for future real-world deployment.

Writing and Figure

We will incorporate your feedback into the final manuscript by modifying the teaser figure to highlight environment modes. Primitives are presented early in the introduction (e.g., paragraphs 2–5), but we will add more detail to their description.

References

[0] Zhang et al., Invariant Causal Prediction for Block MDPs, ICML 2020
[1] Liu et al., Discrete-Time Hybrid Automata Learning: Legged Locomotion Meets Skateboarding, arXiv preprint, 2025
[3] Poli et al., Neural Hybrid Automata: Learning Dynamics with Multiple Modes and Stochastic Transitions, NeurIPS 2021
[4] WorldCoder: a Model-Based LLM Agent, NeurIPS 2024
[5] https://www.epfl.ch/labs/biorob/research/amphibious/salamandra/
[6] https://www.cyberbotics.com/doc/guide/salamander?version=R2021a#salamander-wbt
[7] Bhatt et al., CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity, ICLR 2024
[8] Kuznetsov et al., Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics, ICML 2020
[9] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press
[10] David Arthur, Sergei Vassilvitskii, How slow is the k-means method?, SCG 2006
[11] Klusowski & Tian, Large Scale Prediction with Decision Trees, Journal of the American Statistical Association, 2023
[12] Solar-Lezama, The Sketching Approach to Program Synthesis, APLAS 2009

最终决定Accept (poster)

2025-05-01

This work proposes an interesting idea by learning a neurosymbolic Finite State Machines (FSMs) that model environmental dynamics in continuous spaces for policy optimization. The authors provide compelling evidence for SWMPO’s effectiveness through experiments in four simulated environments (PointMass, LiDAR-Racing, Salamander, BipedalWalkerHardcore), demonstrating competitive model-based RL performance. The rebuttal addresses reviewer concerns, particularly those regarding mode identifiability, scalability, and real-world applicability, by clarifying assumptions, providing detailed computational complexity analysis, and open-sourcing code to enhance accessibility. Significant improvements in RL performance (~40% over baselines) further strengthen the paper’s claims. Therefore, the work makes meaningful contribution to ICML.