Improving Model-Based Reinforcement Learning by Converging to Flatter Minima
Encouraging model in model-based reinforcement learning to converge to flatter minima in the loss landscape will result in better downstream policies
摘要
评审与讨论
This paper considers the setting of model-based reinforcement learning, and argues that it is important that the environment model converges to a relatively flat minima. They support this by proving two theoretical results showing that the first-order sharpness of the model loss appears additively in upper bounds on both the value function estimation error of the approximate model (), as well as the performance gap between optimal policies learnt in the true environment and approximate model. Following this, they perform experiments where they use sharpness-aware minimization (SAM) on the model loss of TD-MPC2, and evaluate this across 11 tasks on the Humanoid-Bench suite.
优缺点分析
Strengths
- The paper is very clearly written.
- The empirical gains are significant, improving in almost all environments considered and notably increasing the average performance.
- The approach itself is straightforward and can be adapted by practitioners with minimal overhead.
- Theorems 1 and 2 nicely complement the empirical approach.
- I like the overall idea and work, and I think it nicely fits in to the narrative of "RL practitioners caring more about deep learning" that has evolved over recent years.
Weaknesses
- The experimental setup is rather limited: they only build on one baseline (TD-MPC2), and only evaluate on one suite of environments, and only use four seeds. A wider range of these would convince the reader that these results generalize.
- The largest weakness, in my opinion, is the relative lack of novelty: a pessimistic reader may think that the paper is simply applying SAM to model learning and reaping its benefits, which may be expected as SAM is well-known to improve generalization in supervised learning. I appreciate that the authors provided theoretical results (Theorems 1 and 2) in order to combat this, however a similar concern remains for me. Model-based RL as in TD-MPC2 can be seen as a joint optimization of the model loss, value loss, and policy loss. SAM is known to improve performance in supervised learning, why not simply apply it to all losses used? If the argument is that the model loss is most susceptible/affected by sharpness, I would be more convinced by experiments comparing these. If on the other hand SAM helps all losses independently and this work is focused on the model loss, I think a wider empirical evaluation would
Minor nits:
- The LaTeX
thm-restatepackage is very useful for restating theorems in the Appendix cleanly (Theorem 1 would stay Theorem 1 instead of becoming 3). - I believe that the which appears in equations (1), (2), and (3) should instead be a as I don't believe the max is guaranteed to exist.
问题
- Is the model loss particularly sensitive/susceptible to sharp minima? If you instead did standard SGD/ADAM on the model loss and SAM on the value/policy loss, are the same gains seen?
- Are there other baseline algorithms/environments which may be worth evaluating on (and are sufficiently diverse from TD-MPC2 / HumanoidBench)?
- How sensitive is the method to the neighbourhood radius ?
局限性
Yes
最终评判理由
Based on their rebuttal to my concerns and all other reviews, I believe the paper has been significantly strengthened and I increase my score.
格式问题
None
We sincerely thank you for your time in reviewing our work and your appreciation for it. We address your concerns as follows.
The experimental setup is rather limited.... A wider range of these would convince the reader that these results generalize..... TD-MPC2 / HumanoidBench?
We thank the reviewer for their constructive feedback. They raised a valid concern regarding the initial experimental scope, which prompted us to conduct a significant set of additional experiments to verify the generalizability of our claims. We integrated and reproduced SAM, with TWISTER [1], a strong, transformer-based MBRL agent. We evaluated it on the Atari100k benchmark, a diverse suite of image-based discrete control environments. As shown in Tab. 1, augmenting TWISTER with SAM leads to performance gains on the non-trivial games presented, often by a significant margin. These results showcase that SAM’s benefits extend to transformer-based world models with pixel-based inputs and a different MBRL algorithm.
| Game | Random | Human | TWISTER | TWISTER w/ SAM |
|---|---|---|---|---|
| Alien | 228 | 7128 | 823 | 947 |
| Amidar | 6 | 1720 | 172 | 172 |
| Assault | 222 | 742 | 777 | 1102 |
| Asterix | 210 | 8503 | 1132 | 1030 |
| Bank Heist | 14 | 753 | 673 | 886 |
| Battle Zone | 2360 | 37188 | 5452 | 10230 |
| Boxing | 0 | 12 | 80 | 86 |
| Demon Attack | 152 | 1971 | 286 | 293 |
| Freeway | 0 | 30 | 0 | 0 |
| Frostbite | 65 | 4335 | 388 | 1610 |
| Gopher | 258 | 2412 | 2078 | 4099 |
| Hero | 1027 | 30826 | 9836 | 12320 |
| James Bond | 29 | 303 | 354 | 426 |
| Kangaroo | 52 | 3035 | 1349 | 1555 |
| Ms Pacman | 307 | 6952 | 2319 | 2409 |
| Pong | -21 | 15 | 20 | 20 |
| Road Runner | 12 | 7845 | 9811 | 15532 |
| Seaquest | 68 | 42055 | 434 | 426 |
| Up N Down | 533 | 11693 | 4761 | 6857 |
Tab. 1: Performance comparison across 20 Atari100k games. TWISTER w/ SAM consistently improves performance in challenging environments. seeds, for all environments. The better performance is shown in bold
We expanded our evaluation on challenging continuous control tasks by adding more high-dof environments from DMC suite. In Tab. 2, we show that TD-MPC2 with SAM consistently outperforms the original baseline. The improvements are notable with lower variance.
| Method | humanoid_run | humanoid_walk | dog_run | dog_walk | dog-trot |
|---|---|---|---|---|---|
| TD-MPC2 | 301 11 | 883 13 | 428 39 | 887 46 | 891 21 |
| TD-MPC2 w/ SAM | 484 19 | 901 4 | 552 17 | 957 6 | 920 14 |
Tab. 2: Performance comparison on high-DoF DMC environments at 2M environment steps. seeds, .
We evaluated each atari100k env. on seeds as is common practice. The DMC experiments were run on four seeds, while the original TD-MPC2 was run on 3 seeds. To ensure and prove statistical significance we also provide one-tailed t-tests on the HumanoidBench. We observed the overall t-stat and the p-value are which shows a statistical significance of our results.
The LaTeX thm-restate package is.... becoming 3.
We thank the reviewer for this helpful suggestion. We will use the thm-restate package to ensure consistent theorem numbering.
I believe that the \max which.... to exist.
We respectfully clarify that the use of max is mathematically justified in this context. The maximization in Eq. (1), (2), and (3) is performed over the set of perturbations satisfying . This defines a closed and bounded ball in the parameter space, which is a compact set. By the Extreme Value Theorem, a continuous function (in this case, the loss function) is guaranteed to attain its maximum value on a compact set. Our usage is also consistent with the original SAM paper and subsequent literature. To ensure this is clear to all readers, we would be happy to add a brief footnote clarifying this point.
The largest weakness, in my opinion, is the relative lack of novelty: a pessimistic reader..... Is the model loss particularly sensitive/susceptible to sharp minima?.... are the same gains seen?
We hypothesize that the world model's loss landscape is indeed particularly susceptible to sharp minima in the context of MBRL. A model that overfits to a sharp minimum may memorize specific transitions from the replay buffer but fail to generalize to slightly novel states encountered during planning. By seeking a flat minimum using SAM, we encourage the world model to be more robust to minor weight perturbations, leading to more reliable long-horizon planning and, consequently, better final performance.
To test this hypothesis and answer the reviewer's question directly, we performed a rigorous ablation study where SAM is applied within different components of TD-MPC2. We tested three configurations: (i) SAM on the transition mode (ours), (ii) SAM on the policy optimization, and (iii) SAM on the reward and value predictors. The results, summarized in Tab. 5, are illuminating.
| Method | Episode Reward |
|---|---|
| TD-MPC2 | 301 11 |
| w/ SAM on Transition Dynamics (ours) | 484 19 |
| w/ SAM on Reward and Value Prediction | 10 3 |
| w/ SAM on Policy Learning | 463 49 |
Tab. 5: Applying SAM to different components TD-MPC2 for the humanoid_run environment. seeds.
Applying SAM to the transition model, as proposed in our paper, yields the largest and most stable performance gain. This strongly supports our core hypothesis that a robust world model is the most critical component for improving downstream task performance.
Applying SAM to the policy learning step also provides a significant boost. This aligns with findings in prior work [6], which suggests that policy optimization also benefits from flatter policy loss landscape.
Crucially, applying SAM to the reward and value predictors led to training instability and catastrophic failure, with the agent unable to learn a meaningful policy. We theorize that forcing the value function into a flat region may prevent it from accurately representing sharp but important value changes (e.g., near cliffs or goal states), leading to misguided policy updates.
To provide direct evidence for the underlying mechanism—that our method indeed finds flatter minima—we analyzed the Hessian of the model loss. We observe that the model trained using SAM converges to a flatter minima as shown in Tab. 6.
| Eigenvalue | TD-MPC2 w/ SAM | TD-MPC2 |
|---|---|---|
| 99.5 | 141.6 | |
| 82.3 | 92.1 | |
| 46.5 | 80.3 | |
| 40.1 | 69.5 | |
| 31.6 | 53.8 |
Tab. 6: Eigenvalue distribution upon applying SAM to TD-MPC2 for the humanoid_run env. @ 2M env. steps. Smaller the flatter the minima.
How sensitive is the method to the neighbourhood radius ?
We have conducted a thorough sensitivity analysis, which we will include in the final paper. We performed detailed ablation studies on ρ for both the DMC suite (humanoid_run) and the Atari100k benchmark (Bank Heist). The results are presented in Tab. 3, 4.
| Episode Reward | |
|---|---|
| 348 28 | |
| 445 56 | |
| 492 34 | |
| 454 17 | |
| 439 21 | |
| 414 21 |
Tab. 3: Ablation on rho for DMC humanoid_run for 2M env. steps. seeds.
| Episode Reward | |
|---|---|
| 281 277 | |
| 714 311 | |
| 886 445 | |
| 785 277 | |
| 781 213 |
Tab. 4: Ablation on rho for atari100k env. Bank Heist for 100k env. steps. seeds.
The performance is robust within an order of magnitude of the optimal value, avoiding prohibitive fine-tuning. Our paper provides the first evidence of sharpness-aware optimization for MBRL world models. A promising next step, which builds on our contribution, would be to integrate recent adaptive methods like ASAM[3], CR-SAM[4], and GAM[5] to further enhance performance and reduce this sensitivity.
We are more than happy to address any remaining concerns, if you feel we have addressed your concerns to your satisfaction, please do consider increasing our scores.
References
[1] Maxime Burchi and Radu Timofte. "Learning Transformer-based World Models with Contrastive Predictive Coding", ICLR 2025
[2] Yuval Tassa et al. "DeepMind Control Suite", 2018
[3] Kwon, Jungmin et al. "ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks", ICML 2021
[4] Wu, T. et al. CR-SAM: Curvature Regularized Sharpness-Aware Minimization. AAAI 2024
[5] Zhang, Xingxuan et al, "Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization", CVPR 2023
[6] Hyun Kyu Lee and Sung Whan Yoon, "Flat Reward in Policy Parameter Space Implies Robust Reinforcement Learning", ICLR 2025 Oral
Your review of our work has helped us improve the draft. We have put in a great amount of effort address your concerns. We kindly request you to acknowledge/respond to our rebuttal.
I thank the authors for their strong rebuttal and for performing the experimental suite provided. I am quite happy with their ablation on applying SAM to different components of TD-MPC2, and I hope that in the final version of their paper they add a discussion regarding this as I believe it will be an important takeaway for readers. I will improve my score.
Dear Reviewer
We appreciate the time you invested in reviewing our submission and your appreciation of our work. Your insights directly shaped our revisions and, we believe, enhanced the paper’s value to the NeurIPS community.
Kind regards, Authors, Sub. #8454
The paper applies Sharpness-Aware Minimization (SAM) to the dynamics model learning within the TD-MPC2 framework, arguing that flatter minima yields models that generalize better under environmental noise and dynamics shifts. Theoretically, the paper uses a PAC-Bayesian analysis to show that flat minima with reduced sharpness tightens upper bounds on the model’s value-prediction error and the performance gap between policies evaluated in the true versus learned dynamics. Empirically, this SAM-augmented approach leads to an 89.1% improvement on HumanoidBench Locomotion Suite benchmark over vanilla TD-MPC2.
优缺点分析
Strengths:
- It is the first to apply Sharpness-Aware Minimization to MBRL setting for learning dynamics model
- It presents a PAC-Bayes analysis, that links first-order sharpness to the value gap and policy sub-optimality of the learned model.
- Empirically it demonstrates that it outperforms the baseline TD-MPC2 method on most tasks in HumanoidBench Locomotion Suite.
- It provides a drop-in replacement for the standard model-learning component.
Weaknesses:
- Several claims are not supported by experiments:
- The first contribution listed in the Introduction claims to have shown that standard model learning in MBRL consistently converges to the sharp minima, however, I did not find where this was shown in the paper.
- The empirical results jump straight to comparing SAM-augmented TD-MPC2 with the baseline in terms of the average return, but there’s no measurement of sharpness. Hence it is unclear whether the proposed method or the baseline TD-MPC2 finds a sharp minimum or a flat one.
- The claim that SAM or flat minima mitigate rollout compounding error is not tested: the rollout error was not plotted or verified.
Limited evaluation:
- Short planning horizon: I assume that the proposed method uses a model horizon of 3 steps, following the hyper-parameter in TD-MPC2. This is a relatively short planning horizon. If SAM truly mitigates compounding error, we would expect that the performance remain strong under longer horizons; this is left untested.
- No radius ablation: it is described to be a sensitive parameter, yet no sensitivity analysis is provided.
- Single benchmark: The baseline, TD-MPC2, was originally tested in a variety of different tasks whereas the proposed method is evaluated only on HumanoidBench. Results on additional tasks would strengthen the empirical case.
Unrealistic theoretical assumption:
The theoretical analysis in Section 4 assumes i.i.d. samples, which amounts to a direct adaptation of existing SAM results from supervised learning. But RL data are temporally correlated. So the practical relevance of the analysis is limited.
Missing Details
- The proposed method seems to incur 2x gradient computation. However, computational overhead is not discussed.
Typos
- "minimas" in the title should be "minima"
- L302: duplicative "sit_simple"
问题
- Please refer to the comments in the weakness section for the list of items to address.
- Also, could the authors clarify how the reported 89.1% improvement was calculated?
局限性
The limitations are only alluded to instead of explicitly detailed. A dedicated paragraph about limitations would clarify the constraints.
最终评判理由
The authors have provided extensive results during the rebuttal, addressing most of my concerns, except for the one regarding the short planning horizon.
格式问题
NA
We sincerely thank reviewer YGnF for your detailed review and your efforts to serve our community. We have put a thorough effort to address your concerns as detailed below.
The first contribution listed in the Introduction claims to have shown that standard model learning in MBRL consistently converges to the sharp minima ...... Hence it is unclear whether the proposed method or the baseline TD-MPC2 finds a sharp minimum or a flat one.
We thank the reviewer for pointing this out. We have computed the eigenvalue spectrogram of the hessian for TD-MPC2 trained using the base optimizer and SAM on the training buffer. Larger the eigenvalues of the spectrogram, sharper curvature of the minima the model has converged to. We observe that the model trained using SAM converges to a flatter minima as shown in Tab. 6.
| Eigenvalue | TD-MPC2 w/ SAM | TD-MPC2 |
|---|---|---|
| 99.5 | 141.6 | |
| 82.3 | 92.1 | |
| 46.5 | 80.3 | |
| 40.1 | 69.5 | |
| 31.6 | 53.8 |
Tab. 6: Eigenvalue distribution upon applying SAM to TD-MPC2 for the humanoid_run env. @ 2M env. steps.
The claim that SAM or flat minima mitigate rollout compounding error is not tested: the rollout error was not plotted or verified.
We have calculated the model estimate of returns vs Monte-Carlo estimate on TD-MPC2 on DMC dog_run, dog_trot env. We noticed that the model trained using SAM's predicted returns is closer to the actual return experimentally supporting Thm. 1.
| Metric | TD-MPC2 | TD-MPC2 w/ SAM |
|---|---|---|
| dog_trot | 84.3 | 50.1 |
| dog_run | 21.58 | 13.51 |
Tab. 7: Comparison of value prediction error on the DMC environment.
it is described to be a sensitive parameter, yet no sensitivity analysis is provided.
We thank the reviewer for this important question. To address this, we have conducted a thorough sensitivity analysis for the neighborhood radius, (), presented in Tab. 3 & 4.
We agree that this sensitivity is a known characteristic of the original SAM algorithm. We see our work as providing the foundational evidence that sharpness-aware optimization is highly effective for MBRL world models. An exciting avenue for future work, which builds directly on our contribution, would be to integrate more recent adaptive methods (e.g., ASAM [2], CR-SAM [3], GAM[4]) that are designed to reduce this hyperparameter sensitivity.
| Episode Reward | |
|---|---|
| 348 28 | |
| 445 56 | |
| 492 34 | |
| 454 17 | |
| 439 21 | |
| 414 21 |
Tab. 3: Ablation on rho for DMC humanoid_run for 2M env. steps. seeds.
| Episode Reward | |
|---|---|
| 281 277 | |
| 714 311 | |
| 886 445 | |
| 785 277 | |
| 781 213 |
Tab. 4: Ablation on rho for atari100k env. Bank Heist for 100k env. steps. seeds.
Also, could the authors clarify how the reported 89.1% improvement was calculated?
The reported improvement percentage is the average of performance improvement across the tested HumanoidBench environments. Improvement percentage = 100 . We provide both the average improvement percentage and the average scores of TD-MPC2 w/ SAM and the baseline.
The baseline, TD-MPC2, was originally tested in a variety of different tasks .... Results on additional tasks would strengthen the empirical case.
We thank the reviewer for this crucial feedback. We agree that demonstrating our method’s generalizability is vital and, prompted by their suggestion, have conducted extensive new experiments that substantially strengthen our paper.
-
Generalization to New Algorithms, State-space model architecture and Image-Based Inputs: To show our method works beyond a single algorithm and state-based control, we reproduced and integrated it with TWISTER[1] (based on Dreamer) is a strong, transformer-based agent which takes in image based inputs. We evaluated this on 20 challenging Atari100k games which is a discrete control based setting. The results in Tab. 1 show that our approach yields significant improvements across the majority of these pixel-based environments.
-
Generalization to harder Environments: We also expanded our evaluation to harder tasks in the DeepMind Control Suite. As shown in Tab. 2, our method again achieves statistically significant gains over the TD-MPC2 baseline on widely recognized difficult benchmarks like humanoid_run and dog_run.
| Game | Random | Human | TWISTER | TWISTER w/ SAM |
|---|---|---|---|---|
| Alien | 228 | 7128 | 823 | 947 |
| Amidar | 6 | 1720 | 172 | 172 |
| Assault | 222 | 742 | 777 | 1102 |
| Asterix | 210 | 8503 | 1132 | 1030 |
| Bank Heist | 14 | 753 | 673 | 886 |
| Battle Zone | 2360 | 37188 | 5452 | 10230 |
| Boxing | 0 | 12 | 80 | 86 |
| Demon Attack | 152 | 1971 | 286 | 293 |
| Freeway | 0 | 30 | 0 | 0 |
| Frostbite | 65 | 4335 | 388 | 1610 |
| Gopher | 258 | 2412 | 2078 | 4099 |
| Hero | 1027 | 30826 | 9836 | 12320 |
| James Bond | 29 | 303 | 354 | 426 |
| Kangaroo | 52 | 3035 | 1349 | 1555 |
| Ms Pacman | 307 | 6952 | 2319 | 2409 |
| Pong | -21 | 15 | 20 | 20 |
| Road Runner | 12 | 7845 | 9811 | 15532 |
| Seaquest | 68 | 42055 | 434 | 426 |
| Up N Down | 533 | 11693 | 4761 | 6857 |
Tab. 1: Performance comparison across 20 Atari100k games. TWISTER w/ SAM consistently improves performance in challenging environments. seeds, for all environments. The better performance is shown in bold
| Method | humanoid_run | humanoid_walk | dog_run | dog_walk | dog-trot |
|---|---|---|---|---|---|
| TD-MPC2 | 301 11 | 883 13 | 428 39 | 887 46 | 891 21 |
| TD-MPC2 w/ SAM | 484 19 | 901 4 | 552 17 | 957 6 | 920 14 |
Tab. 2: Performance comparison on high-DoF DMC environments at 2M environment steps. seeds, .
The theoretical analysis in Section 4 assumes i.i.d. samples .... So the practical relevance of the analysis is limited.
There is a slight nuance to out assumption of i.i.d which we missed to express in detail in our original draft. Our analysis does not require successive time-steps to be independent; it only assumes that each mini-batch sampled from the replay buffer is approximately independent of the next. In practice, uniform-with-replacement sampling from a sufficiently large buffer—standard in modern MBRL—satisfies this weak-dependence condition, much as computer-vision work treats whole images as i.i.d. despite strong pixel-level correlations. We will add a brief clarification and citation in Sec. 4.1 to make this nuance explicit. We appreciate the opportunity to strengthen the manuscript.
The proposed method seems to incur 2x gradient computation. However, computational overhead is not discussed.
Our primary goal in this work was to provide the first strong empirical and theoretical evidence that targeting flatter minima in the world model's loss landscape yields significant performance gains while not increasing sample complexity.. To establish this principle clearly, we used the original, most established SAM algorithm. Vanilla SAM, in our implementation incurs a total 1.7X cost of standard training
We agree that computational overhead is a key consideration. This trade-off is well-understood, and a vibrant line of recent work e.g. SAF [5] has developed highly efficient variants that deliver similar benefits with negligible extra cost. Our paper provides the foundational proof-of-concept, paving the way for these more efficient techniques to be integrated into future MBRL agents.
minimas" in the ... L302: duplicative "sit_simple"
We sincerely apologise for this minor error and have rectified this in the updated draft.
References
[1] Maxime Burchi and Radu Timofte. "Learning Transformer-based World Models with Contrastive Predictive Coding", ICLR 2025
[2] Kwon, Jungmin et al. "ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks", ICML 2021
[3] Wu, T., Luo, T., & Wunsch II, D. C. (2024). CR-SAM: Curvature Regularized Sharpness-Aware Minimization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 6144-6152.
[4] Zhang, Xingxuan et al, "Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization", CVPR 2023
[5] Du, Jiawei et al, "Sharpness-aware training for free", NeurIPS 2022
[6] Mohri & Rostamizadeh. "Stability Bounds for Stationary φ-mixing and β-mixing Processes." NeurIPS 2008.
[7] Riemer et al. "Balancing Context Length and Mixing Times for Reinforcement Learning at Scale." NeurIPS 2024.
Your review of our work has helped us improve the draft. We have put in a great amount of effort address your concerns. We kindly request you to acknowledge/respond to our rebuttal.
Dear Reviewer YGnF,
Thank you again for your detailed feedback and your service to the community. In response to your review we have added:
- Direct sharpness measurements
- Plotted and quantified model error
- Supplied a full ρ-sensitivity study
- Compute overhead
- Justification for applicability of theoretical assumption
- Extended the evaluation to Atari-100k and additional high-DoF DMC tasks
- Showed the efficacy of SAM on other architectures and other MBRL algorithms.
We hope these revisions resolve your concerns. If anything remains unclear, we would be happy to elaborate. If we have addressed your concerns, we kindly request you to update your scores. It was a great pleasure reading your review and acting on its insights.
Authors, Sub. #8454
I thank the authors for the clarifications and the additional experiments.
A minor point: while I appreciate the statistical tests, their significance is limited due to the small number of seeds (4 or 5). That said, I understand the computational constraints, and since this setting is consistent with the baselines, it does not affect my rating.
I am not fully convinced by the compounding error results in Table 7, as they are limited to two environments and do not directly address my concern regarding the short planning horizon.
Nonetheless, most of my concerns have been addressed, and I am happy to increase my score accordingly.
Respected Reviewer We sincerely thank you for your effort and time in reviewing our work and raising important concerns. The process of addressing them resulted in an improvement of our manuscript which we shall incorporate in the final draft. Your service to the community is much appreciated. With warm regards Authors Sub. #8454
Sharpness-Aware Minimization (SAM) seeks to improve model generalization by optimizing a model towards flatter minima of the loss landscape. SAM has previously been used in supervised learning and reinforcement learning (RL). This paper proposes applying SAM to train a more robust world model for model-based RL. The paper presents theoretical results showing that using SAM will bound the value estimation error and suboptimality of the learned policy, i.e., a robust world model will make it more likely to find a good policy. Experiments on HumanoidBench show that the use of SAM does improve performance across a wide range of tasks.
优缺点分析
Strengths:
- There is clear motivation and novelty for introducing SAM into model-based RL. I quite enjoyed reading the introduction because of its clarity. I also appreciated that the paper made it possible to understand SAM without having previously studied it.
- The theoretical results help provide good justification for why SAM can help in model-based RL.
Weaknesses:
- The experimental analysis is limited to a comparison of the final performance of each algorithm. Since the paper focuses on training a more robust world model, I would have hoped to see an analysis of the world models trained with and without SAM. For example, is the model trained with SAM actually more robust? How can we measure that directly, instead of depending on the final performance measures? Since the experiments only consider one algorithm, I think there is space for this type of in-depth empirical analysis of the algorithm components.
- The evaluation was quite comprehensive in terms of the number of domains
used, and multiple trials were run, but the results are not presented well.
- For instance, quantitative results must be provided in a table; it is not sufficient to say "Estimate from plot" (line 289) -- this is essentially "eyeballing the results"
- In a similar vein, there was no statistical testing to support the claims made in the results. Pairwise comparisons such as t-tests must be used to show that differences in performance are statistically significant; it is not sufficient to compare mean scores. In particular, line 296 says that using SAM improved performance on 10 of 11 domains, but some of the domains look quite close and may not have statistically significant differences (Stair, Sit Simple).
问题
My primary concerns are with the experimental setup as described in the Strengths and Weaknesses above. I am happy to raise my Quality score and potentially my overall score if those concerns can be addressed.
- What does the "Avg. Performance" plot refer to in Figure 2? Is it averaged across all environments? Does there need to be any normalization since not all environments seem to have the same reward scale?
- Is SAM likely to work with other MBRL algorithms besides TD-MPC2?
Some minor comments:
- Line 37: There should be a description of what Shaprness-Aware Minimization is before mentioning it.
- Please carefully check references. For example, references 15 and 16 are repeated.
局限性
Limitations are discussed in Sections 5 and 6. There is no discussion of potential negative societal impacts; I encourage including a brief discussion in the conclusion.
最终评判理由
The authors addressed my concerns in the rebuttal, and I think the rigor of their final paper will be increased substantially. Thus I am happy to recommend acceptance.
格式问题
N/A
We sincerely thank reviewer Xi16 for their positive feedback. We added robustness metrics, component ablations, broader benchmarks (20 Atari, 5 DMC) and statistical tests, fully addressing every point as follows:
The experimental analysis is limited to a comparison of the final performance.... I think there is space for this type of in-depth empirical analysis of the algorithm components.
This is an excellent point and to address this we examine the value prediction error of TD-MPC2 and TD-MPC2 w/ SAM on DMC dog_run. We compared the estimated returns of the model by computing the expectation of w.r.t a given policy by and the true discounted returns . We notice that the error in value estimation is much lower when SAM is applied as compared to the base optimizer Tab. 7
| Metric | TD-MPC2 | TD-MPC2 w/ SAM |
|---|---|---|
| dog_trot | 84.3 | 50.1 |
| dog_run | 21.58 | 13.51 |
Tab. 7: Comparison of value prediction error on the DMC environment.
To further investigate the efficacy of SAM to different components of TD-MPC2, Tab. 5, we applied SAM to the Value and reward loss as well as the policy loss as well while maintaining the original training procedure of TD-MPC2. We made an interesting observation that:
- SAM significantly helps in policy learning while it negatively affects the downstream policy when applied on value loss and reward loss to such an extreme extent that the policy does not improve at all. The former was slightly hinted at SAM+PPO [2] where the authors claim that a flatter reward surface results in better policy optimization.
- Applying SAM to the policy learning smoothens the reward landscape dictated by Q function, making policy learning easier.
| Method | Episode Reward |
|---|---|
| TD-MPC2 | 301 11 |
| w/ SAM on Transition Dynamics (ours) | 484 19 |
| w/ SAM on Reward and Value Prediction | 10 3 |
| w/ SAM on Policy Learning | 463 49 |
Tab. 5: Applying SAM to different components TD-MPC2 for the humanoid_run environment. seeds.
The evaluation was quite comprehensive in terms of the number of domains used, and multiple trials were run, but the results are not presented well.
We accept the reviewers concerns and we promise to provide training logs and a table in the supplementary detailing the evaluation scores and std. dev. for each environment in humanoid bench. We have made the changes in our draft and hope to update our submission as soon as the oppotunity is availed to the authors.
Pairwise comparisons such as t-tests must be used to show ... statistically significant differences (Stair, Sit Simple).
To prove the statistical significance of our results we provide the results of t-tests and provide the p-values in Tab. 8. The overall t-stat and the p-value are -value which shows a statistical significance of our results.
| Metric | balance_hard | balance_simple | stair | walk | run | stand | sit_simple | sit_hard | crawl | pole | hurdle |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TD-MPC2 Episode Reward | 48 6 | 28 8 | 66 6 | 644 281 | 67 8 | 639 240 | 515 187 | 508 298 | 896 53 | 207 35 | 51 12 |
| TD-MPC2 w/ SAM Episode Reward | 82 10 | 145 27 | 77 12 | 885 44 | 302 124 | 870 58 | 773 223 | 843 64 | 846 26 | 273 22 | 71 36 |
| t-statistic | -5.83 | -8.31 | -1.64 | -1.69 | -3.78 | -1.871 | -1.773 | -2.198 | 1.694 | -3.19 | -1.05 |
| p-value | 0.002 | 0.002 | 0.17 | 0.184 | 0.032 | 0.148 | 0.128 | 0.108 | 0.16 | 0.024 | 0.356 |
Tab. 8: Statistical significance from a one-tailed t-test comparing the performance of TD-MPC2 w/ SAM against the baseline TD-MPC2 on HumanoidBench environments. The Paired t-test and p-value across the 11 tasks is -value
What does the "Avg. Performance" plot refer to in Figure 2? ... Does there need to be any normalization since not all environments seem to have the same reward scale?
Thank you for pointing this out—our description was not sufficiently clear.
What the plot shows. Avg. Performance is the arithmetic mean of the episode returns across the 11 HumanoidBench tasks.
Each HumanoidBench task is designed with a common reward range of roughly ; scores approaching 1000 correspond to near-expert behaviour. We will add this clarification to the caption in the revised draft, and we appreciate the opportunity to make the presentation better.
Is SAM likely to work with other MBRL algorithms besides TD-MPC2?
Recognizing the value of the question, we expanded our experimental evaluation. While we initially focused on HumanoidBench as one of the most challenging commonly-used benchmarks, we agree that a more diverse experimental setup strengthens our contributions.
We have included additional results on TWISTER [1], another MBRL algorithm that processes image-based inputs through a transformer-based state-space model built on the Dreamer V3 algorithm. Our evaluation on 20 out of 26 Atari100k environments (Table 1) demonstrates that applying SAM to the model learning component yields significant improvements across the majority of environments, as well as overall performance gains. All hyperparameters were kept consistent across games to ensure fair comparison.
Furthermore, we present results on challenging DMC environments including humanoid_run, humanoid_walk, dog_walk, dog_run, and dog_trot (Table 2). These environments feature high degrees of freedom (DoF > 20) and are widely recognized as difficult benchmarks. Our results show statistically significant improvements over the TD-MPC2 baseline, demonstrating SAM's adaptability and efficacy across diverse environmental settings.
These expanded experiments collectively demonstrate that our approach generalizes effectively beyond the original benchmark scope, reinforcing the robustness of our method across varied domains and complexity levels.
| Game | Random | Human | TWISTER | TWISTER w/ SAM |
|---|---|---|---|---|
| Alien | 228 | 7128 | 823 | 947 |
| Amidar | 6 | 1720 | 172 | 172 |
| Assault | 222 | 742 | 777 | 1102 |
| Asterix | 210 | 8503 | 1132 | 1030 |
| Bank Heist | 14 | 753 | 673 | 886 |
| Battle Zone | 2360 | 37188 | 5452 | 10230 |
| Boxing | 0 | 12 | 80 | 86 |
| Demon Attack | 152 | 1971 | 286 | 293 |
| Freeway | 0 | 30 | 0 | 0 |
| Frostbite | 65 | 4335 | 388 | 1610 |
| Gopher | 258 | 2412 | 2078 | 4099 |
| Hero | 1027 | 30826 | 9836 | 12320 |
| James Bond | 29 | 303 | 354 | 426 |
| Kangaroo | 52 | 3035 | 1349 | 1555 |
| Ms Pacman | 307 | 6952 | 2319 | 2409 |
| Pong | -21 | 15 | 20 | 20 |
| Road Runner | 12 | 7845 | 9811 | 15532 |
| Seaquest | 68 | 42055 | 434 | 426 |
| Up N Down | 533 | 11693 | 4761 | 6857 |
Tab. 1: Performance comparison across 20 Atari100k games. TWISTER w/ SAM consistently improves performance in challenging environments. seeds, for all environments. The better performance is shown in bold
| Method | humanoid_run | humanoid_walk | dog_run | dog_walk | dog-trot |
|---|---|---|---|---|---|
| TD-MPC2 | 301 11 | 883 13 | 428 39 | 887 46 | 891 21 |
| TD-MPC2 w/ SAM | 484 19 | 901 4 | 552 17 | 957 6 | 920 14 |
Tab. 2: Performance comparison on high-DoF DMC environments at 2M environment steps. seeds, .
Line 37: There should be a description of what Sharpness-Aware Minimization is before mentioning it.
We agree that introducing SAM earlier would improve the paper’s readability, and we will relocate the explanation to the Preliminaries section so readers encounter it before the main results. We appreciate this helpful suggestion.
Please carefully check references. For example, references 15 and 16 are repeated.
We extend our apologies for this drafting error on our behalf, We have rectified this mistake in our draft and we promise to reupload it as soon as the opportunity is availed to us.
Limitations are discussed in Sections 5 and 6....
We agree with the reviewer that there should be a more dedicated sec. on limitations of our work. Some of the limitations of our work include:
-
Domain coverage. Experiments cover high-DoF continuous control and Atari-100k, all in simulation. Real-robot deployment, partial observability and severe dynamics shift were not investigated.
-
Scope of the intervention. We applied SAM only to the world-model loss; exploring sharper/flatter minima for value and policy
References
[1] Maxime Burchi and Radu Timofte. "Learning Transformer-based World Models with Contrastive Predictive Coding", ICLR 2025
[2] Hyun Kyu Lee and Sung Whan Yoon, "Flat Reward in Policy Parameter Space Implies Robust Reinforcement Learning", ICLR 2025 Oral
[3] Nicklas Hansen and Xiaolong Wang and Hao Su, "{TD}-{MPC}2: Scalable, Robust World Models for Continuous Control", ICLR 2024
Your review of our work has helped us improve the draft. We have put in a great amount of effort address your concerns. We kindly request you to acknowledge/respond to our rebuttal.
Thank you to the authors for addressing my concerns. I am impressed with the new experiments and statistical results. Based on the results presented here, I am confident that the revised paper will be much stronger. I will raise my score accordingly.
PS: I appreciate the reminder, but please note that many of us reviewers are authors just like yourselves and are also quite busy :)
Dear Reviewer,
Thank you for your thoughtful review and for expressing appreciation of our work. Your feedback has helped strengthen the paper and benefits the NeurIPS community.
Best regards, Authors, Sub. #8454
This paper integrates Sharpness-Aware Minimization (SAM) into the world model training of TDMPC. It presents a theorem characterizing the performance gap between the optimal policy in the true environment and the policy that is optimal with respect to the learned model, demonstrating that reducing sharpness can mitigate value estimation error and thus improve policy performance in the real world. Experimental results on the HumanoidBench tasks show that the proposed method achieves higher mean returns compared to TDMPC.
优缺点分析
Strengths:
- The paper quantifies how decomposing the performance gap into prediction error, sharpness, and complexity yields theoretical guarantees. Sharpness reflects the model’s accuracy and robustness. A small sharpness value indicates that the model's evaluation of the policy is stable and less likely to be inflated by sharp artifacts in the model error landscape, thereby increasing the likelihood that the policy will perform well in the real environment. I have carefully reviewed the proof and did not identify any major issues.
- It can be integrated as a plugin into existing world models such as TDMPC, making it extremely convenient and practical.
Weaknesses:
-
The experimental evaluation is insufficient. The paper simply integrates SAM into TDMPC2 and reports results using only four random seeds. Given that the core methodological change is limited to swapping the optimizer, a more rigorous and comprehensive evaluation is essential. For the main results in Figure 2, please run the experiments with at least 8 seeds. Additionally, all experiments are limited to HumanoidBench tasks. To strengthen the paper, more diverse environments should be included, and both the code and full experimental results (e.g., via Weights & Biases) should be made publicly available. Without these improvements, I cannot recommend acceptance.
-
Since Sharpness-Aware Minimization (SAM) is not a novel contribution of this work, its description should be moved to the Preliminaries section.
-
No experiments have been conducted on other model-based RL algorithms, such as MBPO and DreamerV3.
问题
-
How do you determine the critical neighborhood radius? Why are there no experiments evaluating how variations in this parameter affect performance?
-
What are the original sources of Equation (4) in the Simulation Lemma and Equation (6) for the Bound on Population Model Error? Please provide appropriate citations. If no citation is available, a complete derivation is required. Additionally, after introducing predicted rewards, how is Equation (5) derived?
-
What guarantees that this method provides a tighter bound compared to a standard world model training?
局限性
You mention in the paper: “Regarding hyperparameters, the neighborhood radius is paramount and learning rates might require adjustment due to changes in effective gradient magnitudes, making sensitivity analysis beneficial.” This implies you conducted a detailed hyperparameter search for your method—e.g., tuning learning rates. However, you did not perform a similarly thorough search for TDMPC, such as trying several different learning rates. Since in your experiments the performance gap between your method and TDMPC is small, I speculate that exploring more learning rates for TDMPC could yield comparable results.
最终评判理由
Weaknesses 2 and 3, as well as Questions 1 and 2 and the limitation, have been addressed in the rebuttal. However, Weakness 1 remains unaddressed.
格式问题
No
We thank the reviewer for the detailed feedback and address each point below.
The experimental evaluation is insufficient. The paper simply integrates SAM into TDMPC2 and reports results using only four random seeds. Given that the core methodological change is limited to swapping the optimizer, a more rigorous and comprehensive evaluation is essential. For the main results in Figure 2, please run the experiments with at least 8 seeds.
We were limited by the available compute and time during the rebuttal given HumanoidBench runs are compute-intensive and require 100+ hrs/seed on nvidia-V100, making running larger number of seeds prohibitively expensive. We tried to match the reporting methodology of [1] and performed atleast/more seeds as done by the original baseline. To prove the statistical significance of our results we provide the results of t-tests and provide the p-values in Tab. 8. The overall t-stat and the p-value are -value which shows a statistical significance of our results.
| Metric | balance_hard | balance_simple | stair | walk | run | stand | sit_simple | sit_hard | crawl | pole | hurdle |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TD-MPC2 Episode Reward | 48 6 | 28 8 | 66 6 | 644 281 | 67 8 | 639 240 | 515 187 | 508 298 | 896 53 | 207 35 | 51 12 |
| TD-MPC2 w/ SAM Episode Reward | 82 10 | 145 27 | 77 12 | 885 44 | 302 124 | 870 58 | 773 223 | 843 64 | 846 26 | 273 22 | 71 36 |
| t-statistic | -5.83 | -8.31 | -1.64 | -1.69 | -3.78 | -1.871 | -1.773 | -2.198 | 1.694 | -3.19 | -1.05 |
| p-value | 0.002 | 0.002 | 0.17 | 0.184 | 0.032 | 0.148 | 0.128 | 0.108 | 0.16 | 0.024 | 0.356 |
Tab. 8: Statistical significance from a one-tailed t-test comparing the performance of TD-MPC2 w/ SAM against the baseline TD-MPC2 on HumnaoidBench environments. The Paired t-test and p-value across the 11 tasks is -value
Additionally, all experiments are limited to HumanoidBench tasks..... No experiments have been conducted on other model-based RL algorithms, such as MBPO and DreamerV3.
-
Broader Benchmarks : We added two new, demanding benchmarks to rigorously test the generality of our approach; the full results will be included in the updated manuscript.
-
Diverse Architectures : To confirm that SAM is not confined to state-based control or a single agent design, we integrated it with TWISTER [2], a transformer-based world model.
-
Comprehensive Atari100k Evaluation : The revised study now covers 20 Atari100k games, providing a wide spectrum of pixel-based, discrete-action tasks.
-
Substantial Performance Gains : Augmenting TWISTER with SAM yields large improvements in most titles—especially Battle Zone, Frostbite, Gopher, and Road Runner—as summarised in Tab. 1.
-
Evidence of Robust Generalisation : These findings demonstrate that our core idea transfers across input modalities (pixels vs. states) and model families, underscoring the wider applicability of SAM in model-based RL.
| Game | Random | Human | TWISTER | TWISTER w/ SAM |
|---|---|---|---|---|
| Alien | 228 | 7128 | 823 | 947 |
| Amidar | 6 | 1720 | 172 | 172 |
| Assault | 222 | 742 | 777 | 1102 |
| Asterix | 210 | 8503 | 1132 | 1030 |
| Bank Heist | 14 | 753 | 673 | 886 |
| Battle Zone | 2360 | 37188 | 5452 | 10230 |
| Boxing | 0 | 12 | 80 | 86 |
| Demon Attack | 152 | 1971 | 286 | 293 |
| Freeway | 0 | 30 | 0 | 0 |
| Frostbite | 65 | 4335 | 388 | 1610 |
| Gopher | 258 | 2412 | 2078 | 4099 |
| Hero | 1027 | 30826 | 9836 | 12320 |
| James Bond | 29 | 303 | 354 | 426 |
| Kangaroo | 52 | 3035 | 1349 | 1555 |
| Ms Pacman | 307 | 6952 | 2319 | 2409 |
| Pong | -21 | 15 | 20 | 20 |
| Road Runner | 12 | 7845 | 9811 | 15532 |
| Seaquest | 68 | 42055 | 434 | 426 |
| Up N Down | 533 | 11693 | 4761 | 6857 |
Tab. 1: Performance comparison across 20 Atari100k games. TWISTER w/ SAM consistently improves performance in challenging environments. seeds, for all environments. The better performance is shown in bold
To further demonstrate robustness beyond HumanoidBench, we expanded our evaluation to other high-degree-of-freedom (DoF) environments in the DeepMind Control Suite. Our results (Tab. 2) show that TD-MPC2 w/ SAM consistently and significantly outperforms the baseline on tasks like humanoid_run, dog_run, and dog_walk, which are widely recognized as difficult benchmarks.
| Method | humanoid_run | humanoid_walk | dog_run | dog_walk | dog-trot |
|---|---|---|---|---|---|
| TD-MPC2 | 301 11 | 883 13 | 428 39 | 887 46 | 891 21 |
| TD-MPC2 w/ SAM | 484 19 | 901 4 | 552 17 | 957 6 | 920 14 |
Tab. 2: Performance comparison on high-DoF DMC environments at 2M environment steps. seeds, .
... and both the code and full experimental results (e.g., via Weights & Biases) should be made publicly available. Without these improvements, I cannot recommend acceptance.
We agree with the reviewer that reproducibility is paramount. To that end, we commit to making our complete codebase and full Weights & Biases experimental logs publicly available upon acceptance. This will allow our work to be fully verified and provide access to all training curves and raw evaluation scores. We hope these commitments, along with our new experiments, have addressed the reviewer's concerns.
Since Sharpness-Aware Minimization (SAM) is not a novel contribution of this work, its description should be moved to the Preliminaries section.
We completely agree. Since Sharpness-Aware Minimization (SAM) is a foundational method that our work builds upon, rather than a novel contribution of this paper, its description properly belongs in the Preliminaries section. In the camera-ready version, we will move the explanation of SAM to the Preliminaries. We believe this change will significantly improve the paper's narrative flow by first introducing all necessary background concepts before detailing our specific contributions. This will provide a clearer and more logical reading experience for the community.We appreciate this constructive feedback, which will help us improve the final manuscript.
You mention in the paper: “Regarding hyperparameters, the neighborhood radius is paramount and ..., I speculate that exploring more learning rates for TDMPC could yield comparable results.
Re-optimising the baseline for our study would break comparability with prior and future work built on those well-vetted defaults. TD-MPC2 already comes with a single, task-agnostic hyper-parameter set obtained by an extensive search; using it is standard practice. This protocol ensures a fair, controlled comparison and confirms that the observed gains stem from flatter minima, not from asymmetric tuning.
We used exactly the published TD-MPC2 settings for both methods; the only additional parameter we touched is the SAM radius ρ. This choice isolates the effect of sharpness-aware minimisation—tuning LR for one side only would confound the attribution of gains.
How do you determine the critical neighborhood radius? Why are there no experiments evaluating how variations in this parameter affect performance?
We have conducted a thorough sensitivity analysis for the neighborhood radius, ρ (Tables 5 & 6), which reveals our key finding: the optimal ρ is directly proportional to the magnitude of the model's loss function. For instance, the optimal ρ for TWISTER's high-magnitude consistency loss was ~1e-3, while for TD-MPC2's lower-magnitude dynamics loss, it was ~5e-7. This insight provides a practical heuristic for tuning ρ in new environments. While more recent adaptive methods (e.g., ASAM) can mitigate this sensitivity, our work serves as a foundational proof of concept, demonstrating the benefits of sharpness-aware optimization for MBRL and paving the way for future improvements.
| Episode Reward | |
|---|---|
| 348 28 | |
| 445 56 | |
| 492 34 | |
| 454 17 | |
| 439 21 | |
| 414 21 |
Tab. 3: Ablation on rho for DMC humanoid_run for 2M env. steps. seeds.
| Episode Reward | |
|---|---|
| 281 277 | |
| 714 311 | |
| 886 445 | |
| 785 277 | |
| 781 213 |
Tab. 4: Ablation on rho for atari100k env. Bank Heist for 100k env. steps. seeds.
What are the original sources of Equation (4) in the Simulation Lemma and Equation (6) ... Additionally, after introducing predicted rewards, how is Equation (5) derived?
Thank you for pointing this out. We have rectified in the updated manuscript. We use simulation lemma from [1], [2] and Eq. 5 is as directly stated in [1]. We promise to upload the corrected manuscript with these corrections.
[1] Wen Sun, "Notes on simulation lemma", https://wensun.github.io/CS4789_data/simulation_lemma.pdf [2] Sam Lobel and Ronald Parr, "An Optimal Tightness Bound for the Simulation Lemma", RLJ 2024
Your review of our work has helped us improve the draft. We have put in a great amount of effort address your concerns. We kindly request you to acknowledge/respond to our rebuttal.
Dear reviewer sVin
Thank you for the time and care you invested in your review. Your feedback materially strengthened the work and, by extension, the NeurIPS community.
During the discussion period we undertook substantial additional experiments and revisions aimed squarely at your concerns: broader benchmarks (20 Atari100k games + high-DoF DMC), ρ-sensitivity sweeps, other MBRL algorithms and an open-code commitment for full reproducibility. We hope these additions resolve the issues you raised.
If any questions or doubts remain, please let us know, we are eager to clarify every point. Conversely, if the new evidence satisfies your concerns, we would be grateful if you could consider adjusting your score accordingly.
Thank you again for your thoughtful service to the community.
Thank you for your response.
-
Why did you choose to conduct experiments using TWISTER instead of Dreamer? TWISTER is a relatively new baseline and has not yet been widely adopted by the community. What is the rationale behind this choice?
-
How many seeds did you use when running TWISTER on Atari? The reported TWISTER results differ significantly from those in the original paper. For instance, in the Road Runner task, the original paper reports a score of 17,832, whereas your result is only 9,811.
-
Why are the values of the optimal ρ and the loss on the same order of magnitude? Is there any specific explanation for this observation?
"Why did you choose to conduct experiments using TWISTER instead... choice"
- Public PyTorch implementation: TWISTER (ICLR 2025) [1] provides an official, well-maintained PyTorch codebase.
- Architecture diversity: TWISTER replaces Dreamer’s CNN/RNN world model with a Transformer-based world model. Demonstrating SAM on both MLP- and Transformer-style dynamics broadens evidence that applying SAM to MBRL is architecture-agnostic.
- Algorithmic diversity: Using a different MBRL algorithm than TD-MPC2 strengthens the claim that SAM’s benefits are not implementation-specific.
- Pixel-based input regime: TWISTER operates directly on visual Atari observations; including it shows efficacy beyond the state-based HumanoidBench setting.
- Lack of an “official” Dreamer V3 PyTorch repo: Dreamer’s reference implementation is in JAX. Porting and validating it would blur experimental focus, whereas TWISTER offers an official PyTorch pipeline.
In short, TWISTER is “Dreamer + Transformer”. As the community gravitates toward Transformer world models, TWISTER is a forward-looking baseline.
"How many seeds did you use when running TWISTER on Atari ... 9,811."
- We followed the standard Atari-100k protocol of five random seeds per game, using the exact environment settings in the TWISTER repository.
- The baseline numbers were regenerated from scratch to guarantee an apples-to-apples comparison, we did not want any factor to influence the improvement in performance except the merits of our proposal.
- Example: for Road Runner we obtain 9 811 (5 seeds) vs. the original 17 832. We speculate the gap stems from different GPU determinism flags introduced after the initial release; we are cross-checking with the TWISTER authors.
"Why are the values of the optimal...or this observation?"
This is an excellent question!! and actually touches one subtle point that we tried to make. Sharpness-Aware Minimization(SAM) is a widely adopted algorithm for converging to flat-minima but not the best one and has its minor flaws. SAM uses a fixed spherical radius around the current weights.
- Scale coupling. The best is tied to the scale of the parameters encountered during training. Rescaling the weights (which also rescales gradients and the loss surface) changes the effective perturbation size, so the “ideal” tracks the magnitude of the loss, this has also been discussed in Sec. 4 [2].
- Follow-up remedies.
- ASAM [2] rescales perturbations by local weight norms, yielding a scale-invariant objective and decoupling from absolute loss values.
- CR-SAM [3] regularizes a normalized Hessian trace, further mitigating sensitivity in highly non-linear regions.
Hence observing comparable magnitudes for the tuned and the loss is expected
You raised an excellent point and we are glad we were able to address your concern regarding this subtle yet beautiful point. We see our work as not trying to optimize for the application of SAM to MBRL along different aspects such as, optimal performance/speed/hyper-parameter sensitivity etc. We position our work as the first work that shows evidence of efficacy of SAM in Model Based RL laying the foundation for future works to further develop this niche
We sincerely thank sVin for their insightful points. We are glad we were able to address them. If the reviewer has any other concerns we are more than happy to address them. We thank them for their service.
References
[1] Maxime Burchi and Radu Timofte. "Learning Transformer-based World Models with Contrastive Predictive Coding", ICLR 2025, spotlight
[2] Kwon, Jungmin et al. "ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks", ICML 2021
[3] Wu, T. et al. CR-SAM: Curvature Regularized Sharpness-Aware Minimization. AAAI 2024
Thanks for your response. Most of my concerns have been addressed, and I am willing to raise my score.
Dear Reviewer,
Thank you for your thoughtful comments and for acknowledging our contribution. Your feedback has significantly improved the paper and, we hope, added value to the NeurIPS community.
Best regards, Authors, Sub. #8454
Following a robust discussion with the authors, the reviewers agree that the work is a sound, novel, and potentially impactful application of an existing optimization idea in the MBRL context. They appreciate the strength of both the theoretical and empirical evaluation of the core contribution. The authors should pay close attention to the points that tripped up the reviewers so the final submission can anticipate and address those questions for other readers.