Off-policy Reinforcement Learning with Model-based Exploration Augmentation
We propose MoGE, which enhances the Off-policy RL exploration by critical experiences generaion, leading to significant improvements in sample efficiency and performance ceilings across various tasks.
摘要
评审与讨论
The paper proposes an exploration algorithm that uses a diffusion model to generate useful states for exploration (critical states). A world model is learned to augment these states, with transition tuples, and these tuples are integrated into off-policy updates. The algorithm exhibits strong performance on deep RL benchmarks.
优缺点分析
Strengths:
- The paper is well-written.
- The problem studied in the paper is an important one.
- The algorithm exhibits strong performance against the compared baselines.
Weaknesses:
- The algorithm itself seems much more computationally expensive, since it has to learn a diffusion model for critical state generation and a latent dynamics model. Hence, I believe that compared to standard off-policy RL methods, this would be more computationally expensive to run. This is not at all discussed in the paper. Furthermore, I think baselines such as [1-4] should also be considered in the paper.
- Dreamer and TD-MPC learn a world-model for POMDPs, where the dynamics are learned from observations. This requires stacking the observations together and compressing them in a relevant latent space. The encoder and decoder in this paper only compress the current state into the latent space. For POMDPs this would be incorrect. For MDPs, I am not quite sure why a latent model has to be learned instead of just learning a forward dynamics model, which maps the current state and action to the next state.
- Intrinsic exploration baselines, such as P2E [5], are not at all discussed in the paper. Moreover, methods such as [6,7,8] which combine intrinsic exploration objectives with the extrinsic rewards, are not discussed or ablated. Given that these methods are much simpler than what is proposed, I think the authors should compare against them to make a convincing case.
- Finally, I am not quite sure about the relevance/key novelty or insight of Theorem 1. Especially, given that the authors already assume convergence of the algorithm: "Assuming that the policy converges to the optimal policy in finite steps".
References:
- Chen, Xinyue, et al. "Randomized ensembled double q-learning: Learning fast without a model." arXiv preprint arXiv:2101.05982 (2021).
- Hiraoka, Takuya, et al. "Dropout q-functions for doubly efficient reinforcement learning." arXiv preprint arXiv:2110.02034 (2021).
- Nauman, Michal, et al. "Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control." arXiv preprint arXiv:2405.16158 (2024).
- Lee, Hojoon, et al. "Simba: Simplicity bias for scaling up parameters in deep reinforcement learning." arXiv preprint arXiv:2410.09754 (2024).
- Sekar, Ramanan, et al. "Planning to explore via self-supervised world models." International conference on machine learning. PMLR, 2020.
- Chen, Eric, et al. "Redeeming intrinsic rewards via constrained optimization." Advances in neural information processing systems 35 (2022): 4996-5008.
- Sukhija, Bhavya, et al. "MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization." arXiv preprint arXiv:2412.12098 (2024).
- Sukhija, Bhavya, et al. "Optimism via intrinsic rewards: Scalable and principled exploration for model-based reinforcement learning." 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities. 2025.
问题
See Strengths And Weaknesses
局限性
yes
最终评判理由
The authors implemented most of the suggestions made during the rebuttal and therefore I have increased my score. However the submission now significantly differs from the initial submission and therefore I have decided to not increase the score beyond 4.
格式问题
No formatting issues
We sincerely thank you for the thoughtful and constructive feedback. Please find our detailed responses to the comments below.
1. Supplemental experiments of MPGE
Thank you for raising this important point. To address your concern, we have conducted additional walltime experiments to assess the computational cost of MPGE. The updated results are provided below:
| Env | MPGE | PGR | SER | DSAC |
|---|---|---|---|---|
| Humanoid-run | 151657±395 | 207813±1464 | 210321±788 | 128153±122 |
The results show that MPGE incurs moderate overhead compared to standard off-policy methods due to the added diffusion and world model components. However, this overhead is effectively controlled through two mechanisms:
-
We perform MPGE training only once every 10 environment steps, significantly reducing overall compute.
-
Unlike PGR-style methods that require retraining the generative model to convergence before use, MPGE supports online training with incremental updates, which simplifies training and improves efficiency.
We will include this discussion and the corresponding results in the final version of the paper.
Additionally, we appreciate your suggestion regarding baseline selection. We agree that including recent off-policy exploration methods such as [1–4] would provide a more comprehensive evaluation. We conduct the experiments in DMC Humanoid tasks, which are one of the hardest tasks within 3 seeds, and the results are illustrated as follows:
| Env | MPGE | Simba | BRO | RedQ | DroQ |
|---|---|---|---|---|---|
| Humanoid-run | 489 ± 9 | 268± 40 | 417± 15 | 187± 12 | 164± 21 |
| Humanoid-walk | 892 ± 19 | 801 ± 13 | 881 ± 25 | 665± 5 | 682± 14 |
| Humanoid-stand | 907 ± 7 | 920±14 | 905±3 | 902±4 | 896±6 |
As shown in this task, except for the simplest case (Humanoid-Stand), MPGE achieves the best performance across all more challenging tasks.
2. Discussion of the Encoder-decoder structure in MPGE
Thank you for this insightful question. We would like to clarify that MPGE is developed under the MDP setting. And why we choose to model dynamics in a latent space instead of directly learning the mapping is motivated by several practical considerations:
-
Learning a compact latent representation helps capture abstract, task-relevant features of the environment, improving generalization and training efficiency even under MDPs. Besides, different dimensions of the state may contribute unevenly to dynamic modeling. Using an encoder to transform the raw state allows the model to extract the most relevant components, which facilitates more accurate transition prediction.
-
Latent dynamics are often smoother and easier to predict compared to raw state transitions, especially in high-dimensional environments. Fitting dynamics in a latent space typically improves the accuracy of transition modeling.
-
Decoupling representation learning from dynamics prediction allows better reuse and transfer of components. In fact, in MPGE, our policy network is built on top of the same encoder, enabling a unified state representation that facilitates more effective policy learning.
To further support our claim, we conducted a simple experiment across three environments (each with 3 random seeds), using the same diffusion and baseline algorithm setup, to compare the TAR between vanilla dynamics (Using Transformer as predictor of next states and rewards) and latent dynamics. The results are as follows:
| Env | MPGE with vanilla dynamics | MPGE with latent dynamics |
|---|---|---|
| Humanoid-run | 489 ± 9 | 408± 35 |
| Humanoid-walk | 892 ± 19 | 684 ± 71 |
| Humanoid-stand | 907 ± 7 | 785 ± 88 |
The results show that, compared to directly mapping the current state and action to the next state, introducing a latent space leads to better learning of both environment transitions and state representations in the policy network.
3. Comparison with different exploration baselines
Thank you very much for the valuable suggestion. In response, we have conducted additional experiments to study the effect of different exploration strategies on the DSAC algorithm via ablation comparisons. Due to the development and implementation effort involved, and time constraints, we focus on three representative environments—the Humanoid tasks in DMC—for this evaluation.
Before presenting the results, we provide the following clarifications regarding the methods considered in this comparison:
- Plan2Explore: Due to the limited compatibility of the original codebase, we re-implemented this method within our own framework. Specifically, we replaced the world model with Dreamer, and trained five independent predictors to construct a latent disagreement signal. An exploration policy was then trained using this signal to collect additional data for the main task policy.
- EIPO: All experiments of EIPO are conducted on on-policy algorithms, specifically based on PPO, which poses challenges when attempting to transfer the method to off-policy algorithms such as DSAC. Although EIPO is theoretically algorithm-agnostic, applying it to off-policy methods (e.g., DSAC) poses practical challenges. Its core mechanism relies on alternating optimization between two policies and dynamically adjusting intrinsic reward via Lagrangian updates. However, off-policy training uses replay buffers, making it difficult to accurately evaluate both policies’ extrinsic returns for updating the multiplier. Additionally, the assumption of policy closeness often fails in off-policy settings, reducing surrogate objective reliability. Due to these challenges and time constraints, we were unable to include EIPO in our experiments. We appreciate your understanding.
- MaxInfoRL: The method is highly transferable, and we directly integrated its core components into our own codebase. We also attempted to build DSAC on top of the original implementation, but the performance was not as good as our reimplementation. Therefore, we report results based on the version implemented in our framework.
- OMBRL: Since the original implementation is in JAX, we reimplemented the method in our own codebase by replacing the SAC component in MBPO with DSAC. We use Deep Ensembles to compute , with each network sharing the same architecture as the predictor used in Plan2Explore.
The final results are as follows:
| Env | MPGE | Plan2Explore | MaxInfoRL | OMBRL |
|---|---|---|---|---|
| Humanoid-run | 489 ± 9 | 311 ± 12 | 197± 4 | 262±13 |
| Humanoid-walk | 892 ± 19 | 588 ± 7 | 481 ± 7 | 678± 5 |
| Humanoid-stand | 907 ± 7 | 801± 14 | 844±8 | 769±11 |
The results show that MPGE still offers a significant advantage compared to these methods. Intrinsic exploration methods encourage visiting novel or unpredictable states, but their signals (e.g., prediction error, uncertainty) are often task-agnostic and may lead to uninformative or misaligned exploration. In contrast, MPGE leverages task-aware utility functions (e.g., TD-error, entropy) to generate states that are directly aligned with policy improvement objectives. Besides, unlike intrinsic reward methods that require careful reward balancing and affect the policy's optimization objective, MPGE decouples exploration from reward design, enabling more stable training without interfering with the task-specific learning signal.
4. Clarification of Theorem 1
We fully understand your concern regarding this statement.
In fact, the purpose of this theorem is to ensure that the generated states, which are learned based on the state distribution in the replay buffer, can gradually follow a steady-state distribution. This is motivated by existing studies[4-5] showing that unstable or non-convergent training distributions in generative models often lead to degraded or uncontrolled generation quality. Our theoretical analysis aims to provide a principled guarantee that the generated states remain within the environment’s reachable state space, thereby ensuring that they can effectively support policy exploration and improvement.
As stated in line 532 of the paper, we also provide conditions under which the conclusion still holds even if the policy does not satisfy finite-step convergence. And the actual assumption made here is that the policy changes diminish over time as optimization proceeds. Importantly, the final stationary policy distribution does not need to correspond to an optimal policy (our derivation does not rely on any properties of the optimal policy, such as its return being the highest). In our paper, Lastly, we would like to reiterate that this is merely an idealized assumption. Our theoretical result is not tied to the convergence of the optimal policy, but only to its convergence rate, at which policy updates diminish over time.
We acknowledge that our previous wording may have caused confusion, and we will clarify it accordingly in the final version. We are sincerely grateful for your thoughtful question and suggestion.
Conclusion
We sincerely thank you for your thoughtful and constructive feedback. In response, we have conducted substantial additional experiments and analyses to directly address your concerns. We hope these efforts help clarify the points in question, and we would be truly grateful if you would consider them in your consideration for a higher score.
I thank the authors for their effort during the rebuttal. Most of concerns are addressed and I have increased my score to a 4. I will keep my score to 4 since I feel the authors have added considerable new results during the rebuttal which has significantly changed the paper from the initial submission.
Once again, thank you very much for your time, efforts, and insightful comments. Your advice and re-evaluation have been invaluable in motivating us to further improve the clarity, rigor, and completeness of our work!
Dear Reviewer DNyW,
Please provide your feedback to the authors as early as possible after reading the rebuttal and other reviews, so there is enough time for back-and-forth discussion. Thank you!
Best, Area Chair
This paper introduces Model-based Prioritized Generative Exploration (MPGE), a novel approach to enhance exploration in reinforcement learning, particularly within off-policy settings. MPGE addresses the limitations of traditional passive exploration by generating under-explored, critical states and synthesizing dynamics-consistent experiences. The method comprises two main components: a diffusion generator guided by policy entropy and temporal difference error to identify critical states, and a one-step imagination world model to construct valuable transitions for learning. Experimental results on OpenAI Gym and DeepMind Control Suite benchmarks show that MPGE improves exploration efficiency and overall performance in continuous control tasks.
优缺点分析
Strengths:
- The exploration of off-policy scenarios is both important and interesting. The proposed method appears to be novel.
- The introduction is well-written, and the literature review is comprehensive.
- Experimental results on OpenAI Gym and DeepMind Control Suite benchmarks show that MPGE improves exploration efficiency and overall performance in continuous control tasks.
Weaknesses:
The methodology section is somewhat confusing. I have several questions regarding the methods:
- The explanation of Equation 5 is unclear; it is not obvious how it is derived or how the variable is introduced.
- The two paragraphs on "Policy Entropy" and "Temporal Difference Error" appear abruptly in Section 3.1. Are these intended to be part of the method? Concepts like TD error might be better placed in the Preliminaries section.
- In Equation 10, the labels for reward and reconstruction seem to be swapped.
- In Equation 10, the use of sg() appears to bring and closer together. Does learning both the representation and dynamics simultaneously lead to unstable training?
- The experiments are conducted on continuous control tasks. Can the method generalize to discrete-action space cases?
问题
- In Theorem 1, why is it assumed that the policy converges to the optimal policy in a finite number of steps? The convergence of the training policy depends on your algorithm; why make this assumption at the outset?
- How does the proposed method compare to on-policy methods in terms of performance?
- Can the approach be generalized to discrete action spaces, such as those in Atari environments?
局限性
yes
最终评判理由
The authors addressed my concerns in the rebuttal. I raise my score to 4.
格式问题
None
We sincerely appreciate your detailed and constructive feedback. Below are our responses to your concerns.
1. Explanation of Equation 5
Equation (5) follows the standard gradient decomposition for classifier-guidance diffusion models [1]. It is used to guide the generation of states toward those favored by the target function. The guidance scale coefficient controls the strength of this guidance, balancing between following the unconditional diffusion prior and aligning with the target utility function.
2. Description of the two utility functions
We appreciate the reviewer’s suggestion and agree that these concepts are better introduced earlier. In the revised version, we will move the descriptions of policy entropy and TD error to the Preliminaries section for clarity. We also clarify that these are example utility functions within our framework, and other choices are equally applicable to support the generation of diverse and informative critical states.
3. Swapped label
Thank you very much for pointing this out! We confirm the labels are indeed swapped and will correct this in the final version.
4. About Equation 10
Thank you for the insightful question. We follow the design principle introduced in Dreamer series[2-3], where stop-gradient (sg) is applied to avoid representational collapse and to stabilize joint training of the encoder and the dynamics model . Specifically, the sg operator ensures that the prediction target is treated as fixed when training the dynamics model. At the same time, the target is treated as fixed when training the encoder, so that the encoder is not optimized to merely match the dynamics model output, but instead learns representations that are useful for downstream tasks such as value prediction or policy learning.
In summary, this design is implemented to prevent trivial solutions where both modules co-adapt to minimize the loss without learning meaningful structure. Moreover, Dreamer demonstrates that using such asymmetric losses with sg leads to more stable and scalable representation learning in model-based RL. Empirically, we observe that this design promotes stable convergence across tasks, consistent with prior works.
5. Extension
Yes, our method can be extended to discrete action spaces. As long as a differentiable mapping from given utility functions to states can be established, it remains feasible to compute utility functions under a state, resulting in critical generation. While our current experiments focus on continuous control tasks, we plan to explore more discrete-action environments in future work.
Q&A1: Clarification of Theorem 1
We fully understand your concern regarding this statement.
In fact, the purpose of this theorem is to ensure that the generated states, which are learned based on the state distribution in the replay buffer, can gradually follow a steady-state distribution. This is motivated by existing studies[4-5] showing that unstable or non-convergent training distributions in generative models often lead to degraded or uncontrolled generation quality. Our theoretical analysis aims to provide a principled guarantee that the generated states remain within the environment’s reachable state space, thereby ensuring that they can effectively support policy exploration and improvement.
As stated in line 532 of the paper, we also provide conditions under which the conclusion still holds even if the policy does not satisfy finite-step convergence. And the actual assumption made here is that the policy changes diminish over time as optimization proceeds. Importantly, the final stationary policy distribution does not need to correspond to an optimal policy (our derivation does not rely on any properties of the optimal policy, such as its return being the highest). In our paper, Lastly, we would like to reiterate that this is merely an idealized assumption. Our theoretical result is not tied to the convergence of the optimal policy, but only to its convergence rate, at which policy updates diminish over time.
We acknowledge that our previous wording may have caused confusion, and we will clarify it accordingly in the final version. We are sincerely grateful for your thoughtful question and suggestion.
Q&A2: Comparison of the on-policy methods
We have reported comparisons with two representative on-policy methods, PPO and TRPO, in Appendix C.2. Please refer to that section for detailed results.
Q&A3: Discrete action spaces generalization
Yes, the approach can be generalized to discrete action spaces. Please refer to our earlier response above for a detailed explanation.
Conclusion
We truly appreciate your thoughtful and constructive feedback. We hope that our detailed responses and additional efforts have addressed your concerns, and we kindly ask that you take them into consideration during your re-evaluation.
[1] Dhariwal P, Nichol A. Diffusion models beat GANs on image synthesis[J]. Advances in neural information processing systems, 2021, 34: 8780-8794.
[2] D. Hafner et al. Learning latent dynamics for planning from pixels. ICML, 2019.
[3] D. Hafner et al. Mastering diverse control tasks through world models. Nature, 2025.
[4]Argenson A, Dulac-Arnold G. Model-based offline planning[J]. arXiv preprint arXiv:2008.05556, 2020.
[5]Luo C. Understanding diffusion models: A unified perspective[J]. arXiv preprint arXiv:2208.11970, 2022.
Thanks for the rebuttal. The author has addressed my concerns, and I hope these details can be added to the revised version. By the way, will the code be released?
We will make sure to include all supporting analyses and clarifications in the final version of the paper. The code will also be open-sourced upon acceptance, after proper organization and documentation.
Once again, we kindly ask for your consideration of a higher score. We sincerely hope that your recognition will help this work get accepted and seen by a wider audience. Thank you once again for your support.
Thanks for addressing my concerns. I have raised my score.
We sincerely appreciate your recognition. We believe that your support and endorsement will help this paper reach a broader audience and, in turn, contribute to the RL community.
Dear Reviewer ZXF4,
Please provide your feedback to the authors as early as possible after reading the rebuttal and other reviews, so there is enough time for back-and-forth discussion. Thank you!
Best, Area Chair
The authors propose MPGE, a novel generative augmentation framework for off-policy RL algorithms. MPGE consists of three modules: generator and classifier to guide generator to generate under-explored states, and the world model to predict dynamically plausible next states and reward. By using augmented transitions, MPGE seamlessly integrates with off-policy RL algorithms and achieves higher sample-efficiency.
优缺点分析
Strengths
- Idea is straightforward and the paper is easy to follow.
- Conduct extensive experiments including ablation studies on each design components.
- Code is publicly available
Weakness
- It seems that the main motivation of this work is closely related to PGR, as both utilizes generative models with guidance to generate novel but dynamically plausible transitions. While authors claim that using PGR generates data closely to the original data distribution, it is not true when we check the analysis conducted by PGR. I’m also not sure that MPGE really generate novel but dynamically plausible transitions as there are only plots with the reward metric and lack of analysis. I strongly recommend authors to conduct analysis on novelty and dynamic plausibility of generated transitions by their method.
- While I acknowledge that authors conduct extensive experiments and get superior results, I’m not completely persuade myself that why it should be better than PGR. It seems that world model is also uncertain in unknown regions during training, so we cannot ensure that the prediction of world model is dynamically plausible. Again, it would be nice to have some additional analysis on why MPGE performs better than other methods.
问题
Here are some questions I want to ask:
- It seems that the method is susceptible to guidance scale . Is there any rule for deciding such hyperparameter?
局限性
Here are some comments I want to suggest:
- As mentioned above, off-policy RL algorithms are very sensitive to hyperparameters. It seems that MPGE achieves better performance in various benchmarks compared to SER and PGR, but the complexity of the proposed method is higher than those baselines. It is hard for me to ensure that MPGE can generalize to different tasks without extensive hyperparameter tuning.
最终评判理由
Authors additionally provide analysis on novelty and dynamic plausibility of generated transitions, which are promising. Therefore, I lean towards the acceptance.
格式问题
I do not notice any formatting issues.
We sincerely thank you for the thoughtful and constructive feedback. Please find our detailed responses to the comments below.
1. Qualitative Analysis of MPGE
We will address your concerns through theoretical analysis from two perspectives: novelty and dynamic compliance. Specifically, we aim to clarify the following questions:
-
Why do the actions or transitions generated by MPGE exhibit novelty, and how does this benefit policy optimization and evaluation?
-
How does MPGE ensure compliance with environmental dynamics when generating transitions?
-
Why are the transitions generated by MPGE more effective than those produced by methods like PGR, ultimately leading to improved policy performance?
1.1 The novelty of transitions
The joint distribution of transitions generated by MPGE can be derived as follows:
Based on the structure of MPGE, we can further decompose the source of its novelty into three main components:
(i) State-Level Generation Instead of Transition Resampling
MPGE generates high-utility states directly and then uses the world model to simulate plausible transitions. This contrasts with prior methods such as PGR, which often resample or perturb transitions from the buffer. By targeting utility-defined regions, MPGE explores more structurally diverse transitions.
(ii) Dynamically Shifting Utility Landscape During Training
The utility signal in MPGE evolves as training progresses, continuously shifting the generation target. This dynamic conditioning promotes continual novelty, as the generator focuses on states that are currently difficult for the policy, rather than statically defined regions.
(iii) Policy-Centric Guidance Instead of Buffer Imitation
Crucially, MPGE does not generate data based on the static behavior policy distribution in the buffer. Instead, the generation is guided by utility functions evaluated under the current policy, such as current-policy TD error or entropy. This ensures that the generated transitions align with the evolving learning needs of the agent, rather than being confined to outdated or suboptimal trajectories. The world model further enables rollouts from generated states, allowing MPGE to synthesize transitions that the current policy is likely to visit but has not yet explored, thus supporting meaningful off-policy generalization.
Summary:
MPGE leverages utility functions conditioned on the current policy to generate reachable but under-visited states with long-tail characteristics. Rather than producing out-of-distribution samples, it rebalances in-distribution critical states to align with the training objective.
As a result, transitions generated by MPGE are more novel and task-relevant compared to methods like PGR, which imitate the buffer and remain constrained by the behavior policy distribution.
1.2 The dynamical feasibility of transitions
(i) analysis of the dynamical feasibility in MPGE
In our setting, the world model is pretrained using data collected from a random policy, ensuring broad coverage of the state-action space and capturing generalizable environment dynamics rather than overfitting to narrow trajectories. This is a standard practice in offline model-based RL to ensure that rollouts remain dynamically plausible.
Importantly, MPGE generates only states, and transitions are simulated through the learned dynamics model. As such, all transitions remain on-manifold with respect to the model, provided it is well-trained. This follows the same principle as in Dreamer [1] and related work, where stochastic data and latent dynamics promote reliable predictions.
(ii) Comparison with methods like PGR
In fact, compared to MPGE—which explicitly assembles transitions using the environment dynamics and ensures that the resulting buffer complies with the true transition function—methods that imitate and resample transitions directly from the replay buffer are more likely to violate dynamics consistency.
Take PGR as an example: PGR models the joint distribution by learning directly from tuples in the buffer. However, this paradigm does not enforce any structural constraints that the transitions must satisfy the environment dynamics. This lack of constraint makes the generative model prone to overfitting, potentially capturing spurious correlations between variables rather than learning the true mapping from to . For example, although both approaches model joint distributions , PGR may end up modeling a distribution of the form such as , which does not reflect the causal or functional dependencies present in the environment and may results in poor policy evaluation.
This observation is further supported by our experiments below, which demonstrate that MPGE is more capable of generating transitions that are dynamically consistent with the environment, in contrast to methods that rely solely on joint distribution modeling without enforcing transition constraints.
2. Quantitative Analysis of MPGE (Supplementary Experiments)
To validate our analysis, we conducted an experiment to assess the dynamic compliance of transitions generated by MPGE and PGR. We trained a dynamics discriminator that classifies whether a given tuple aligns with the true environment dynamics.
The discriminator was trained on positive samples from the replay buffer and a random policy—both guaranteed to be valid. To ensure fairness, MPGE and PGR were trained independently, each generating an equal number of negative samples, while their learning was not influenced by the discriminator’s feedback.
Structure of the discriminator:
To mitigate potential overfitting as much as possible, we introduced an information bottleneck in the design of the dynamics discriminator. Specifically, we adopt the architecture where the input pair is first processed through a multi-layer MLP, which is then directly connected to the encoded representations of . This combined representation is subsequently passed through a lower-dimensional MLP before producing a softmax-based binary classification output.
def forward(self, s, a, s_prime):
sa_feat = self.mlp1(torch.cat([s, a], dim=-1))
sp_feat = self.mlp2(s_prime)
fused = torch.cat([sa_feat, sp_feat], dim=-1)
out = self.fusion_discriminator(fused)
return out
Results:
We conducted a joint training experiment on the Humanoid-Run with 3 seeds, where the dynamics discriminator, PGR, and MPGE were trained simultaneously. The resulting figure summarizing the method's accuracy (the ratio of the number of samples classified as real to the total number of test samples) over time is shown below:
| Method | Accuracy (100k) | Accuracy (500k) | Accuracy (1M) | Accuracy (1.5M) |
|---|---|---|---|---|
| MPGE | 10.7%±5.8% | 66.2%±15.1% | 75.3%±11.4% | 92.1%±4.5% |
| PGR | 13.1%±3.2% | 43.4%±7.9% | 63.4%±9.3% | 76.8%±10.6% |
The results show that as training progresses, MPGE consistently generates a larger proportion of transitions that do not violate the environment dynamics constraints, thereby demonstrating the higher quality and plausibility of its generated data.
Q&A1: Discussion on guidance scale
The guidance scale plays a critical role in balancing sample diversity and target alignment. In our current implementation, we select through cross-validation on a held-out set (or via grid search across a small range, e.g., ) and apply the same value across all environments for consistency.
In our ablations, performs best under standard training steps. However, with extended training (e.g., 3M steps), smaller scales (e.g., ) also achieve strong performance, differing mainly in convergence speed. Larger scales tend to distort the generated data. For simple tasks, a small is sufficient, while complex tasks may benefit from adaptive schemes such as annealing.
| TAR(3M) | 608.2±21.3 | 616.2±24.7 | 624±12.5 | 402.3±9.8 |
Q&A2: Discussion on hyperparameter
We fully understand the reviewer’s concern regarding generalization and hyperparameter sensitivity. To address this:
-
We used the same set of hyperparameters across all tasks on both benchmarks (DMC and Gym MuJoCo), without any task-specific tuning for MPGE.
-
The world model parameters are directly adopted from existing implementations of DreamerV3[1] and STORM[2], and the diffusion model settings follow standard DDPM[3] configurations—no further tuning was applied.
-
For our method-specific hyperparameters and , we conducted ablation studies, and then applied the same values across all tasks.
These results demonstrate that MPGE maintains strong performance without extensive tuning, and we believe this reflects its robustness across diverse environments.
Conclusion
We sincerely appreciate your detailed and constructive feedback. We hope our responses address your concerns and kindly request your consideration for a higher score.
[1] D. Hafner et al. Mastering diverse control tasks through world models. Nature, 2025.
[2] Zhang W, Wang G, Sun J, et al. Storm: Efficient stochastic transformer-based world models for reinforcement learning[J]. Advances in Neural Information Processing Systems, 2023, 36: 27147-27166.
[3] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851.
Thanks for authors' response. While many of my concerns have been resolved, I still concern on novelty of generated samples. While authors provide some theoretical analysis on why MPGE generates novel transitions compared to PGR, it is better to just compute the novelty and dynamic plausibility score generated by MPGE and PGR as done in PGR or SynthER [1].
[1] Lu, Cong, et al. "Synthetic experience replay." Advances in Neural Information Processing Systems 36 (2023): 46323-46344.
We sincerely thank you for your valuable suggestion! Building upon our original experiments, we conducted additional evaluations based on the relevant metrics proposed in [1] and [2]. The results are organized into the following two parts:
1. Novelty score:
To quantitatively assess the diversity of transitions generated by SYNTHER, we follow the evaluation protocol introduced in [1]. For each generated sample , we measure the minimum L2 distance of each datapoint from the dataset, which allows us to see how far the upsampled data is from the original data. Taking reduced 15% subset of replay from Humanoid-run task, the distances are illustrated as follows:
| Method | Humanoid-run | Humanoid-walk | Humanoid-stand |
|---|---|---|---|
| SER | 3.46 | 3.51 | 0.93 |
| PGR | 2.38 | 3.46 | 0.97 |
| MPGE | 4.69 | 4.37 | 1.21 |
2. Dynamic plausibility score:
At the same time, we compute the approximate mean-squared error (MSE) of dynamics over 10K generated transitions for MPGE, SER and PGR. The results are as follows:
| Method | Humanoid-run | Humanoid-walk | Humanoid-stand |
|---|---|---|---|
| SER | 5.89 | 6.23 | 5.94 |
| PGR | 5.14 | 6.75 | 5.13 |
| MPGE | 1.72 | 1.63 | 1.92 |
Results:
Upon further analysis, we observe that MPGE is capable of exploring broader regions of the state space while generating transitions that more accurately adhere to environment dynamics. And the empirical results of PGR and SYNTHER are consistent with their behavior described in prior works: PGR tends to densify the subspaces of the buffer to find local policy improvement opportunities, whereas MPGE actively searches for under-explored areas of the state space to discover novel and informative transitions. Moreover, the incorporation of a world model enables MPGE to synthesize transitions that are more consistent with true environment dynamics, which is further supported by our supplementary experiment involving a trained transition discriminator.
Finally, we would like to once again express our sincere gratitude for your valuable comments and suggestions. As a novel exploration-enhancing framework, MPGE is designed to serve as both a practical baseline and a source of inspiration for future research in the field. If our responses have addressed your concerns, we would be deeply grateful if you could kindly consider reevaluating our submission. Your recognition and support would mean a great deal to us, and we sincerely hope it can help bring our work to a broader audience. Thank you once again for your time and thoughtful review.
[1] Lu, Cong, et al. "Synthetic experience replay." Advances in Neural Information Processing Systems 36 (2023): 46323-46344.
[2] Wang R, Frans K, Abbeel P, et al. Prioritized generative replay[J]. arXiv preprint arXiv:2410.18082, 2024.
Thank for conducting additional experiments. I noticed that we could not include visualization during the rebuttal phase, so it would be better to add some density plot for novelty and dynamic plausibility score in the future manuscript. I will adjust the score to 4. Best.
We sincerely appreciate your support and recognition, which are of great importance to us. We will carefully incorporate all of your comments into the final version of our work. We genuinely hope that your recommendation and endorsement will help this paper gain broader visibility and make a meaningful contribution to the RL community.
Dear Reviewer W4cH,
Please provide your feedback to the authors as early as possible after reading the rebuttal and other reviews, so there is enough time for back-and-forth discussion. Thank you!
Best, Area Chair
To speed up RL, the authors propose to generate synthetic transitions with high exploration potential using conditional diffusion. Their method, MPGE, first trains an unconditional diffusion model on the replay buffer. Then imaginary states are generated using classifier guidance, where the utility function is either policy entropy or TD-error. A world model is learned to generate the rest of the synthetic transition. MPGE can be added onto any off-policy RL algorithm, and they show improvements over baselines in 10 continuous control tasks.
优缺点分析
Strengths
- an interesting idea of augmenting the replay buffer with imaginary transitions high in learning potential, by using a conditional diffusion model
- paper is generally well written
Weaknesses
- Usage of policy entropy is questionable as a utility function, especially in robotics tasks with optimal multi-modal policy distributions (see Diffusion Policy)
- there seems to be a lot of moving parts and hyperparameters (guidance weight, mixture weights, choice of utility function)
- gains over baselines is somewhat marginal in 5/10 tasks, with little analysis on why MPGE does better or worse than baselines. the experiments are run with only 3 seeds, which makes me think many of the tasks where MPGE only does marginally better, could be due to statistical error.
- It would have been great to see some qualitative analysis of what imaginary states MPGE generates, and why this leads to better performance in some tasks.
- No discussion of the compute costs, such as walltime comparison with other methods.
Minor:
- I don't think figure 2 is necessary, could be moved to appendix to make space for qualitative analysis
问题
See weaknesses
局限性
yes.
最终评判理由
Overall, I am fairly positive on this paper. The authors build on top of prior generative replay buffer work, by proposing to condition the generation on various task-relevant metrics instead of unconditional generation. Most of my concerns were addressed, such as concerns on hyperparameter tuning and seed variance.
What would have moved the needle more for me, would have been more qualitative analysis, which the authors provided in text form in the rebuttal. If they had more analysis in the initial submission, I would have increased my score. As such, I will maintain my rating of borderline accept.
格式问题
none.
We sincerely appreciate your detailed and constructive feedback. Below are our responses to your concerns.
1. Discussion of the utility function with Policy entropy
We sincerely appreciate your insightful comment. Indeed, policy entropy is only one possible choice of utility function within the MPGE framework. In tasks with inherently multi-modal optimal policies, such as those observed in robotics, we approximate entropy using a Gaussian Mixture Model (GMM). Prior works (e.g., [1]) have demonstrated the feasibility of this approach in the context of diffusion-based policies.
More generally, since entropy merely quantifies stochasticity, it can be replaced by other metrics to capture utility, such as the distance between a given state and the empirical state distribution in the buffer, or other differentiable surrogate signals. The key requirement is that the utility function must be differentiable with respect to the state, so that it can guide generation. MPGE is fully compatible with a broad family of such utility functions, and we view entropy as one heuristic example rather than a universal choice.
2.Discussion of the hyperparameters
We fully understand your concern regarding hyperparameter sensitivity. We will explain the meaning and selection strategy for each of the following components: the utility function, the guidance scale , and the trade-off parameter . I will also describe how these parameters are chosen in practice.
-
First, regarding the choice of utility functions: as mentioned in the paper, there exists a wide variety of utility functions in reinforcement learning tasks. Our goal is to provide a general architecture that can accommodate and benefit from better-designed or more informative utility signals in the future. Depending on the specific task, different utility functions (e.g., Q-values or intrinsic curiosity as used in PGR) can be adopted to guide the generation of task-specific critical states. Our current choiceslike policy entropy and temporal-difference (TD) error are meant to serve as illustrative examples: entropy naturally appears in the policy optimization objective, and TD error is central to value estimation. These choices are heuristic rather than exhaustive, and we encourage practitioners to adapt and extend this framework using other utility formulations suitable for their own downstream tasks.
-
In our ablations, performs best under standard training steps. However, with extended training (e.g., 3M steps), smaller scales (e.g., ) also achieve strong performance, differing mainly in convergence speed. Larger scales tend to distort the generated data. For simple tasks, a small is sufficient, while complex tasks may benefit from adaptive schemes such as annealing:
| TAR(3M) | 598.2±21.3 | 616.2±24.7 | 624±12.5 | 402.3±9.8 |
- As for the value of , we have discussed its setting in Appendix A. A large can lead to a significant distributional shift, which reduces the effectiveness of prioritized imitation mechanisms (PIM). Our ablation studies also confirm this observation. Therefore, actually is not a freely tunable parameter—it must remain within a small range to be effective, similar in spirit to the clipping threshold in PPO.
3.Supplemental experiments
We fully understand your concern. To address this, we have re-run the experiments on the Gym tasks using five random seeds based on the original setup. The updated results are presented below:
| Env | MPGE | PGR | DSAC |
|---|---|---|---|
| Walker2d-v3 | 7354±158 | 6621±237 | 6348±121 |
| Humanoid-v3 | 12091 ± 146 | 11395±379 | 10814±219 |
| Ant-v3 | 8294.6 ± 184 | 7735.2 ± 454 | 7121±176 |
| Halfcheetah-v3 | 18124 ± 395 | 17362± 128 | 17110 ± 14 |
| Swimmer-v3 | 144 ± 2 | 145±4 | 131 ± 6 |
The results show that our method consistently achieves state-of-the-art performance across all five seeds compared to the baseline algorithms. As further discussed in the supplementary experiments (Appendix C), we note that the baseline algorithm DSAC already achieves near-optimal performance on these Gym tasks, leaving limited room for improvement. The fact that both PGR and MPGE still manage to outperform DSAC highlights the importance of passive, utility-guided exploration, which contributes to further policy enhancement even in highly optimized settings.
4. Quality analysis of transitions generated by MPGE
MPGE generates imaginary transitions by first sampling critical states under utility guidance, then assembling transitions via a learned world model. These generated transitions exhibit two important properties like novelty and dynamical feasibility, which together contribute to improved policy performance.
4.1 Novelty of generated states
MPGE generates states that are reachable but under-visited, conditioned on utility signals (e.g., TD error or entropy) that reflect the current policy’s learning needs. This design introduces the following advantages:
(i) State-level generation allows MPGE to focus on critical regions beyond the static buffer coverage, rather than simply perturbing stored transitions as done in PGR.
(ii) Dynamic utility evolution enables continual novelty as the utility function shifts with policy updates, steering generation toward states where the policy is uncertain or under-trained.
(iii) Policy-centric guidance means MPGE is not constrained by the historical behavior policy; it rebalances the frequency of critical states under the current policy distribution, leading to better task relevance.
Together, these results in generated transitions that are more diverse and better aligned with the evolving policy, which supports exploration and evaluation in otherwise poorly covered areas of the environment.
4.2 Dynamical feasibility of generated states
MPGE ensures that all transitions respect environmental dynamics by simulating them through a pretrained world model, following standard practice in model-based RL. The world model is trained using diverse data (e.g., from random policy), covering broad regions of the state-action space and avoiding overfitting to narrow behavioral trajectories. In contrast, methods like PGR directly learn joint distributions over from buffer tuples. This paradigm does not enforce the structural constraints of real dynamics (i.e., the mapping from . As a result, PGR may capture spurious statistical correlations rather than true functional dependencies, potentially modeling unstructured factorizations. Such dynamics-inconsistent samples may degrade the value estimates and hinder policy improvement.
4.3 Summary
In summary, MPGE generates imaginary transitions that are:
Novel: they target critical, under-visited states tailored to the current policy;
Valid: they are assembled via a learned model trained to respect environment dynamics.
These properties enable MPGE to produce high-quality synthetic data that complements the replay buffer and improves learning efficiency. Our additional experiments further confirm that transitions generated by MPGE are more dynamically consistent and task-relevant, contributing to its superior performance over PGR in several benchmark tasks. To find more quantitative analysis, please refer to the rebuttal to reviewer W4cH due to the limitation of space.
5. Compute costs in MPGE
Thank you for the suggestion. We have added the relevant comparison. The walltime results(s) are now included as follows:
| Env | MPGE | PGR | SER | DSAC |
|---|---|---|---|---|
| Humanoid-run | 151657±395 | 207813±1464 | 210321±788 | 128153±122 |
The results show that MPGE incurs slightly higher walltime compared to baseline algorithms. Notably, to reduce computational overhead, we perform MPGE training only once every 10 environment interaction steps(please refer to codebase), which significantly controls the overall compute cost.
Moreover, unlike PGR, which requires retraining the generative model on the buffer until convergence before use, MPGE continuously updates its generator during training. This makes MPGE’s diffusion-based training more efficient and lightweight in practice.
Minor
We sincerely appreciate and acknowledge your valuable suggestion. We will carefully consider this point and incorporate the corresponding clarification or discussion in the revised version.
Conclusion
We sincerely appreciate your thoughtful and constructive feedback. We hope our responses have effectively addressed your concerns and respectfully ask that you consider a more favorable evaluation.
[1] Wang Y, Wang L, Jiang Y, et al. Diffusion actor-critic with entropy regulator[J]. Advances in Neural Information Processing Systems, 2024, 37: 54183-54204.
Thanks for the response, it clears up some of my concerns.
One concern I still have is:
qualitative analysis of what imaginary states MPGE generates, and why this leads to better performance in some tasks.
It would have been nice to see some visualizations or some additional analysis on what MPGE is doing, like seeing what imaginary states it generates for a task, to build more intuition on why MPGE is working. If possible, could the authors perform some analysis for some tasks?
I know it may be hard to report the results of this because of the restrictions on image / video / external links of the rebuttal, but this is still my most major concern.
We completely understand your concerns regarding what kind of transitions MPGE generates. In response, we have added two additional experiments and analyses to explore the nature of the generated data in a specific task, Humanoid-run of DMC. Due to this year’s restrictions on visualizations, we regret that we are unable to provide the corresponding images (as they are accessible to AC as well). Nevertheless, we will do our best to address your questions through detailed descriptions. Thank you again for your understanding and support.
1. Data Distribution Analysis
MPGE improves learning by covering a broader region of the state-action space, which enables the retrieval of data more relevant to the task. This facilitates more effective exploration and targeted training. Similar to the analysis presented in Figure 2 of PGR, we project generations for both our MPGE and PGR onto the same t-SNE plot and find that:
A: At step 10k, the data distributions generated by MPGE and PGR largely overlap, with no significant distinction between them.
B: Around the performance inflection point (approximately iteration 60k), MPGE begins to exhibit a wider coverage of the space than PGR, reaching into under-explored regions.
C: By the end of training(1.45M), MPGE maintains broader coverage over the state-action space, with its distribution more dispersed than that of PGR. This indicates MPGE’s ability to continuously generate more diverse and potentially high-value critical states.
2. The nature of the generated transitions
Since the agent’s full ground-truth state is maintained internally by the simulator's physics instance, while the generated and observed obs only provide partial access to this information. Therefore, we can only infer the agent’s actual state based on this partial observation. In the Humanoid-Run task, each obsvector contains 67 dimensions, including:
| Feature Name | Dim | Description |
|---|---|---|
joint_angles | 21 | All joint angles |
head_height | 1 | Z-coordinate of the head |
extremities | 12 | Positions of limb extremities relative to torso |
torso_vertical | 3 | Projection of torso orientation onto vertical |
com_velocity | 3 | Velocity of the center of mass |
velocity | 27 | Torso linear/angular velocity + joint velocities |
We also perform a per-dimension analysis of the states generated by MPGE at different time steps. Since we cannot infer the full joint-level kinematics, we focus on the center-of-mass velocity and torso height as proxies to determine whether the agent is falling or continuing to run.
Specifically, in the early stages, MPGE tends to produce states near critical failure boundaries — for example, just before the humanoid falls. These obs often exhibit consistent or patterned values in certain dimensions (e.g., head_height of current state > 0.7m and head_height of next state < 0.7m ), lower than these thresholds that typically precede a fall. Due to the scarcity of such samples in the original dataset (since the simulation step length in DMC is fixed, even if the agent falls early in an episode, it will continue struggling on the ground until the maximum number of steps is reached. This results in a large amount of data that does not help the agent learn how to maintain balance). The agent lacks sufficient knowledge to distinguish between actions that would lead to recovery versus failure. As a result, the Q-value estimation error in these regions is particularly large. MPGE helps fill this gap by supplying rare but highly informative samples, which are far more valuable than overrepresented low-mobility “crawling” states.
In contrast, during later stages of training(usually greater than 0.6M), MPGE shifts toward generating states with higher forward velocity(com_velocity of state > 10m/s), indicating a focus on balancing and control under high-speed locomotion. These states are crucial for pushing the performance ceiling of the policy. In summary, MPGE adapts its generative focus based on the evolving needs of the policy, consistently producing states that are critical for driving learning progress across training phases.
We sincerely apologize once again for our inability to provide direct visualizations due to submission restrictions. Nonetheless, we hope that the above analyses and descriptions help clarify how MPGE operates. We also believe that these interesting findings deserve to be showcased in the final version of our paper. Therefore, we kindly ask for your reconsideration for a more favorable score. If you have any further questions or concerns, please feel free to reach out — we will do our best to address them. Thank you again for your time and thoughtful review!
To facilitate your understanding, we provide the following additional clarifications:
- The properties of these dimensions were observed empirically through extensive sampling during our experiments.
- To help you better understand the context, we include the official reward function of the Humanoid-Run task. This demonstrates that these two observed dimensions directly influence the reward calculation, which in turn affects the learning and convergence of the value function.
3.Clarification: the head_height of current state > 1.05m and head_height of next state < 1.05m
def get_reward(self, physics):
"""Returns a reward to the agent."""
standing = rewards.tolerance(physics.head_height(),
bounds=(_STAND_HEIGHT, float('inf')),
margin=_STAND_HEIGHT/4)
upright = rewards.tolerance(physics.torso_upright(),
bounds=(0.9, float('inf')), sigmoid='linear',
margin=1.9, value_at_margin=0)
stand_reward = standing * upright
small_control = rewards.tolerance(physics.control(), margin=1,
value_at_margin=0,
sigmoid='quadratic').mean()
small_control = (4 + small_control) / 5
if self._move_speed == 0:
horizontal_velocity = physics.center_of_mass_velocity()[[0, 1]]
dont_move = rewards.tolerance(horizontal_velocity, margin=2).mean()
return small_control * stand_reward * dont_move
else:
com_velocity = np.linalg.norm(physics.center_of_mass_velocity()[[0, 1]])
move = rewards.tolerance(com_velocity,
bounds=(self._move_speed, float('inf')),
margin=self._move_speed, value_at_margin=0,
sigmoid='linear')
move = (5*move + 1) / 6
return small_control * stand_reward * move
I thank the authors for addressing my concerns, and have submitted the final recommendation. I think they should incorporate all the improvements suggested from myself and other reviewers into the next version of the paper.
We're glad the clarifications were helpful. We genuinely hope to contribute MPGE to the RL community, and we are working to prepare a well-polished revision, including open-sourcing and the improvements suggested by all reviewers, to support and encourage future research. We sincerely hope that your recommendation will help our work be seen and appreciated by a wider audience in the community!
This paper tries to improve exploration in off-policy RL. They proposed Model-based Prioritized Generative Exploration (MPGE). The method consists of two parts: a diffusion model that generate critical and underexplored states guided by metrics such as policy entropy, and a world model that learns the transition. While the idea is related to existing methods such as Prioritized Generative Replay (PGR), subtle and critical differences make the proposed method outperforming them in certain environments.
In the reviewer-author discussions phase, the authors provided additional evidences on the novelty and qualitative analysis that makes the proposed method more convincing. All reviewers' concerns are well-addressed. Overall, the problem studied is important, the proposed idea is interesting and general, and could definitely benefit the community.
The authors have to incorporate the new experiments in the discussion phase into the final version of the paper.