PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
5
4
4.3
置信度
创新性2.5
质量2.8
清晰度3.0
重要性2.8
NeurIPS 2025

Bootstrap Off-policy with World Model

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

We propose BOOM, a model-based RL that uses a soft value-weighted likelihood-free alignment loss to bootstrap the policy from non-parametric planner with world model, achieving state-of-the-art performance.

摘要

关键词
Model-based Reinforcement LearningOnline Planning

评审与讨论

审稿意见
4

The paper introduces BOOM, a model-based RL method that improves the integration of planning and off-policy learning. BOOM uses "bootstrap alignment" by leveraging non-parametric planner actions for both environment interaction and bootstrapping policy behavior via a Q-weighted likelihood-free alignment loss. This approach mitigates actor divergence, leading to enhanced training stability, and better final performance. Experimental results on various high-dimensional locomotion tasks demonstrate that BOOM consistently outperforms existing planning-driven and imagination-driven baselines.

优缺点分析

Strengths:

(1) The paper is well-written, with clear explanations that make the methodology and contributions easy to follow. It also provides solid theoretical justification, which enhances the credibility of the approach.

(2) The proposed method is conceptually simple and easy to implement. It demonstrates superior Total Average Return (TAR) across 7 tasks from the DeepMind Control Suite and 7 tasks from the Humanoid Benchmark. Compared to strong baselines such as DreamerV3 and TD-MPC2, the method consistently shows competitive or even superior performance, highlighting its practical effectiveness.

Weaknesses:

(1) Limited novelty: The key approach of imitating planner behavior to address actor divergence (Eq.2), as presented in this paper, appears to mirror similar methodologies already employed in BMPC (Eq.6 in BMPC paper). From my perspective, the novelty of the proposed method in this regard is not immediately apparent. The authors should delineate the core contributions of this work that distinguish it from the existing BMPC.

(2) Limited Statistical Validation: Experimental results (Fig. 2, Table 1) based on only three random seeds are insufficient for robust validation. Ablation experiments are confined to a single task, limiting generalizability.

问题

Q1: In Eq.(2), action a is extracted from a "planning-augmented behavior policy," yet in Eq.(3) and Algorithm 1, it appears to be sampled from the replay buffer. These methods are distinct, especially as MPPI planning actions evolve throughout training. The author needs to clarify the difference between these two methods.

Q2: Following Q1, due to the model bias in the initial world model training, the interaction data stored in the replay buffer is poor. Therefore, I have concerns about the effectiveness of using actions from the replay buffer for alignment. Why not adopt the re-planning approach used in BMPC?

局限性

No, see weaknesses.

最终评判理由

After reviewing the authors’ rebuttal, and conducting my own reproduction experiments, I am maintaining my overall positive assessment with the following justifications:

Time Efficiency – Partially Addressed

The authors clarified that while replanning introduces non-negligible overhead, the use of compiler-level optimizations (e.g., torch.compile) substantially mitigates this cost. As highlighted in my original comments, the training time was reduced from 23.6h to 9.9h (with replanning) and 6.8h (without), demonstrating that the method remains practical even when replanning is used. Although replanning remains an overhead, its impact is significantly reduced in the presence of modern compilation tools — making the cost more manageable in practice.

Simplicity and Generality — Resolved

I appreciate the authors' agreement with the clarification and their intention to revise the wording accordingly. This improves the clarity and general appeal of the method.

Performance Evaluation — Partially Resolved

I acknowledge the concern raised regarding reproduction of BMPC results. That said, I’d like to highlight that even under a reduced training budget (1M steps), the method achieves strong performance on challenging tasks such as Dog-run (731.3) and Humanoid-run (570), which are comparable to or better than those reported in the BMPC paper. While some variation in reproduction is to be expected, I encourage the authors to include more seeds for the BMPC baseline in the revised version to support a fair and robust comparison.

Novelty Compared to BMPC — Clarified and Accepted

The authors clearly articulated their contribution in retaining max-Q for policy improvement, which distinguishes their method from BMPC. While the distinction is somewhat subtle, it is technically sound and worth emphasizing more prominently in the final version.

Remaining Issues:

No major concerns remain unresolved. The few that persist (e.g., number of seeds, clarity of novelty) are relatively minor and do not significantly undermine the contribution.

格式问题

None

作者回复

We sincerely appreciate your detailed and constructive feedback. Below are our responses to your concerns.

1. Comparison with BMPC (Weakness 1)

Limited novelty: The key approach of imitating planner behavior to address actor divergence (Eq.2), as presented in this paper, appears to mirror similar methodologies already employed in BMPC (Eq.6 in BMPC paper). From my perspective, the novelty of the proposed method in this regard is not immediately apparent. The authors should delineate the core contributions of this work that distinguish it from the existing BMPC.

We appreciate the opportunity to clarify the differences between our work and BMPC. While both methods leverage planner-generated actions for imitation in MBRL, our approach is conceptually distinct and practically more efficient in several key aspects:

1. Time efficiency

  • BMPC relies on frequent re-planning during training to obtain up-to-date planner actions for behavior cloning, which introduces significant computational overhead.
  • In contrast, our method employs a bootstrap alignment mechanism that aligns the actor with the stored planner actions, eliminating the need for explicit re-planning. This design shows a clear advantage in time efficiency. (see following Table 6: Training time with re-planning).

2. Simplicity and generality

  • BMPC incorporates several auxiliary designs beyond its core planning framework. For instance, it periodically alters max_policy_std to control exploration levels. While such heuristics can be effective in specific tasks, they require manual tuning and are sensitive to domain choices, limiting generality and robustness.

  • In contrast, our approach focuses on a minimal yet effective alignment mechanism that improves policy quality without introducing domain-specific tricks or altering core components. This simplicity not only makes our method easier to implement and integrate, but also more stable and broadly applicable across tasks.

3. Performance
Empirically, our method consistently outperforms BMPC across various tasks, particularly in high-dimensional settings like Dog-run and Humanoid-run. While BMPC is prone to early convergence and unstable updates, our method maintains steady progress and achieves higher final returns with greater learning efficiency.

Summary

  • BMPC reflects a transitional stage in the development of planning-driven MBRL—a design that gestures toward the right direction but remains incomplete. It relies on several loosely motivated heuristics, such as periodically adjusting max_policy_std and fresh planner actions in replay buffer, to promote exploration and maintain training stability. These handcrafted tweaks suggest that BMPC is still evolving and lacks the robustness needed for broader applicability.

  • In contrast, our BOOM offers a fully-formed, lightweight solution that seamlessly integrates into existing MBRL frameworks. Without introducing new components or tuning-sensitive mechanisms, we achieve consistent gains in stability and performance. We believe this kind of alignment-based strategy not only improves practical usability but also opens up a promising direction for more systematic advancement in planner-policy integration.

2. Improved statistical validation (Weakness 2)

Limited Statistical Validation: Experimental results (Fig. 2, Table 1) based on only three random seeds are insufficient for robust validation.

We have increased the number of random seeds from 3 to 5 for our method on all benchmark tasks. Due to time constraints, we were only able to extend the seed count for our method rather than all baselines. The updated results are shown in the following table, and our method still consistently outperforms baselines with slightly reduced variance.

Table 1: Benchmark with increased number of seeds

SeedsHumanoid-standHumanoid-walkHumanoid-runDog-standDog-walkDog-trotDog-runAVG.DMCSuiteH1hand-standH1hand-walkH1hand-runH1hand-sitH1hand-slideH1hand-poleH1hand-hurdleAVG.H-Bench.
3962.1 ± 10.7936.1 ± 3.3582.8 ± 26.0986.8 ± 1.8965.4 ± 0.3947.9 ± 4.7821.5 ± 3.8876.7 ± 8.4926.1 ± 19.2935.4 ± 7.3682.2 ± 120.6918.1 ± 4.2926.1 ± 8.0930.5 ± 18.9435.6 ± 29.8820.6 ± 27.7
5963.5 ± 9.2938.3 ± 2.8578.6 ± 20.4988.2 ± 1.4964.1 ± 0.2949.2 ± 3.8819.7 ± 2.9877.4 ± 6.0928.2 ± 15.8936.8 ± 6.1684.4 ± 95.2919.3 ± 3.5927.4 ± 6.3932.1 ± 15.6417.4 ± 32.7818.9 ± 26.1

Notably, the only task where performance dropped was the hurdle, which requires the humanoid to continuously jump over multiple hurdles. We observed that some seeds resulted in the agent missing 1–2 jumps and falling prematurely, leading to relatively low scores around 350–400, while successful seeds achieved 450–500. The overall average across 5 seeds is 417, which, despite some instability, still significantly outperforms prior baselines—BMPC achieves only 197.1 ± 12.1, and Dreamer (10M) only 135.7 ± 6.1.

3. Add ablation study (Weakness 2)

Ablation experiments are confined to a single task, limiting generalizability.

We conduct additional ablation experiments on the challenging H1hand-slide task from Humanoid-Bench, which complements the original ablation on Dog-run.

Table 2: Ablation on the alignment metric

Dog-runH1hand-slide
Forward (default)819.7 ± 2.9927.4 ± 6.3
Reverse631.1 ± 29.4860.0 ± 24.1

Table 3: Ablation on the Q-weight mechanism

Dog-runH1hand-slide
W/ Q-weight (default)819.7 ± 2.9927.4 ± 6.3
W/O Q-weight738.2 ± 17.0840.0 ± 32.5

Table 4: Ablation on the alignment coefficient

Dog-runH1hand-slide
×1 (default)819.7 ± 2.9927.4 ± 6.3
×0.1760.6 ± 31.3875.0 ± 17.2
×10728.7 ± 44.1932.5 ± 5.0

As shown, the H1hand-slide involves higher-dimensional action space and thus requires relatively larger alignment coefficient. Overall, the trend still aligns well with our original claim about the effectiveness of the alignment mechanism. We believe these additional ablation study strengthen the empirical robustness of our claims.

4. Clarify the action in Eq.(2) and Eq.(3) (Q1)

In Eq.(2), action a is extracted from a "planning-augmented behavior policy," yet in Eq.(3) and Algorithm 1, it appears to be sampled from the replay buffer. These methods are distinct, especially as MPPI planning actions evolve throughout training. The author needs to clarify the difference between these two methods.

To clarify: Eq.(2) reflects our theoretical design — in an ideal setting, every training sample from the replay buffer would be paired with a freshly re-planned action generated using the current world model and policy. This corresponds to what we refer to as the "planning-augmented behavior policy."

However, in practice, continuously re-generating planner actions for all replay samples during every policy update introduces significant computational overhead. To address this, we adopt a practical compromise: when collecting experience, we store the MPPI-generated planner action into the replay buffer alongside the transition. As shown in Eq.(3), during training, we directly imitate these stored actions instead of re-planning on the fly.

While this may lead to some degradation due to action staleness, we mitigate this by applying a soft Q-weight mechanism. This allows to focus more on high-quality historical behaviors, reducing the impact of suboptimal actions.

We apologize for the confusion and will add a dedicated “Practical Implementation Details” section in the revision to make this distinction explicit.

5. More experiments and discussions on the re-planning mechanism (Q2)

Following Q1, due to the model bias in the initial world model training, the interaction data stored in the replay buffer is poor. Therefore, I have concerns about the effectiveness of using actions from the replay buffer for alignment. Why not adopt the re-planning approach used in BMPC?

We fully agree that actions stored in the replay buffer may be suboptimal, particularly due to model bias in the early stages. To evaluate the benefit of using up-to-date planner actions, we incorporated a re-planning mechanism from BMPC into our BOOM framework. Specifically, during every k=10k=10 policy updates, we select b=20b=20 samples from each training batch (batch size 256), and regenerate their corresponding planner actions using the current policy and world model. The re-planned action will replace the old action in the buffer.

Table 5: TAR with re-planning

Dog-runH1hand-slide
ours819.7 ± 2.9927.4 ± 6.3
w/ re-plan860.1 ± 2.5 (+4.5%)939.5 ± 5.4 (+1.3%)

Table 6: Training time with re-planning

Dog-runH1hand-slide
ours37.2h45.6h
w/ re-plan56.3h (+51.3%)71.9h (+57.7%)

This introduction of re-planning led to a consistent improvement in performance—from 819.7 to 860.1 TAR on the dog-run task—suggesting that aligning with fresh actions from the current model-planner pair can indeed enhance training. However, we also observed a notable trade-off: the overall training time increased by approximately 50% percent due to additional rollouts and planning.

We appreciate your suggestion, as it helped us identify a meaningful performance gain. However, we believe the increased time cost may limit the practicality of frequent re-planning in larger-scale or real-time applications. Given the relatively modest improvement and the added computational cost, we believe re-planning is not strictly necessary. That said, if performance is the top priority, it remains a reliable trick worth considering.

6. Conclusion

After extensive effort, we have added the requested more experiments with more seeds, clarified the difference from BMPC, and add experiments employing the re-planning mechanism from BMPC. We hope our responses address your concerns and kindly request your consideration for a higher score.

评论

1. Time efficiency

The authors suggest that BMPC incurs significant computational overhead due to frequent replanning and use this as a key point to highlight the time efficiency of their own method. However, this claim is not clearly supported by evidence in either the original BMPC paper.

In fact, the BMPC paper demonstrates that straightforward optimizations, specifically using ‘torch.compile’, can substantially accelerate training without altering the original algorithm's performance. In my reproduction of the Dog-run task with 1M environment steps, I observed the following training times: BMPC with replanning and no acceleration took 23.6 hours, BMPC with replanning and acceleration using torch.compile reduced this to 9.9 hours, and BMPC with acceleration but without replanning further reduced the time to 6.8 hours—a 31.3% reduction, which is notably smaller than the 51.3% reported by the authors in Table 4. All experiments were conducted using the official BMPC implementation from https://github.com/wertyuilife2/bmpc on a single RTX 3090 GPU.

These results suggest that, with proper engineering optimizations such as torch.compile, the computational cost associated with replanning in BMPC becomes quite manageable and does not constitute a major overhead. Therefore, the assertion that BMPC introduces “significant computational overhead” may be somewhat overstated given the available experimental evidence.

2. Simplicity and generality

The authors raise concerns about BMPC’s reliance on heuristics such as 'max_policy_std', suggesting that it limits generalization and robustness. However, this concern may be somewhat overstated. In practice, 'max_policy_std' serves as a simple and general mechanism for encouraging exploration, rather than a complex or task-specific heuristic. With appropriate initialization, this parameter has been found to perform robustly across a range of tasks, and the associated engineering overhead is minimal. As such, the presence of this parameter alone does not provide sufficient grounds to conclude that BMPC lacks generalization or robustness.

3. Performance

The authors claim that their method significantly outperforms BMPC on complex, high-dimensional tasks such as Dog-run and Humanoid-run. However, my independent reproduction results (Dog-run: 810.3 ±\pm 45.4, Humanoid-run: 636.0 ±\pm 41.1, using 2M environment steps and random seeds 0, 1, 2) are highly consistent with those reported in the original BMPC paper, where the average returns are approximately 700 for Dog-run and 600 for Humanoid-run. These results are substantially higher than the performance reported by the authors, whose paper (Figure 2) shows BMPC converging to around 500 on Dog-run and approximately 400 on Humanoid-run after 2M environment steps. This notable discrepancy from both the original paper and independently reproduced results suggests that the BMPC baseline may not have been correctly or fairly implemented in their evaluation.

4. Summary

In the summary, the authors describe BMPC’s use of “fresh planner actions in the replay buffer” as “loosely motivated heuristics.” However, the core motivation behind incorporating the replanning mechanism in BMPC is to effectively mitigate actor divergence—ensuring that the policy remains aligned with high-quality, up-to-date planner behaviors under the current world model. This design choice plays a key role in improving both generalization and stability.

In contrast, the authors’ approach, which relies on imitating historical planner actions, can be viewed as a trade-off that prioritizes efficiency over alignment. While this may simplify implementation, its effectiveness and theoretical grounding appear weaker compared to the original BMPC formulation. Notably, the authors themselves acknowledge the issue of stale actions in their method, which further suggests that their approach may be more of a pragmatic workaround than a principled solution.

Moreover, while the authors highlight the training cost of BMPC as a central drawback to emphasize the efficiency of their own method, both the BMPC paper and my reproduction demonstrate that, with simple engineering optimizations—such as the use of ‘torch.compile’—the additional overhead introduced by replanning remains manageable. In this context, the benefits of maintaining strong policy-planner alignment through replanning can outweigh the modest computational cost. As such, reducing training time at the expense of effective policy-planner alignment may not represent a fundamental methodological advancement.

评论

Dear Reviewer H2KF:

Thank you very much for taking the time out of your busy schedule to review our paper. We have made a concerted effort to address all concerns in detail and sincerely hope that our responses are satisfactory.

We genuinely hope that our work, BOOM, can serve as a meaningful contribution to the RL community—encouraging not only the pursuit of similarly simple yet impactful ideas, but also the development of more sophisticated and powerful planning-driven MBRL solutions.

We look forward to continuing thoughtful and engaging academic discussions with you. If you have any further questions or concerns, please feel free to let us know—we would be happy to conduct additional experiments or analyses to provide further clarification.

评论

Thank you for your detailed and thoughtful comments. We’re sorry that some parts of our previous response may have led to misunderstandings, and we genuinely appreciate the opportunity to clarify them below.

1. Efficiency

We appreciate your insight regarding torch.compile. We acknowledge that we did not apply this optimization, as our work was built upon the TD-MPC2 codebase. However, as your results indicate, even with torch.compile, replanning introduces a 45.6% training time overhead ((9.9 – 6.8)/6.8), which is closely consistent with our reported 51.3% ((56.3 – 37.2)/37.2). This suggests that replanning still constitutes a non-negligible computational cost and impacts training efficiency though torch.compile is employed.

2. Simplicity and generality

What we actually mean is the additional parameters introduced by BMPC: expl_log_std_min and expl_log_std_max (Line 26 and 27 in bmpc/bmpc/config.yaml). Periodically, the policy’s max_policy_std and min_policy_std are replaced by expl_log_std_max and expl_log_std_min, which appeared to be a heuristic trick rather than a principled design. We’ll revise our wording to reflect this more precisely.

See Line 145-151 in bmpc/bmpc/common/world_model.py for the heuristic trick

# Gaussian policy prior
mean, log_std = self._pi(z).chunk(2, dim=-1)
if expl:
    log_std = math.log_std(log_std, self.expl_log_std_min, self.expl_log_std_dif)
else:
    log_std = math.log_std(log_std, self.log_std_min, self.log_std_dif)
eps = torch.randn_like(mean)

Given the log_std and random eps, the action is calculated by: Line 167-173 in bmpc/bmpc/common/world_model.py

# Reparameterization trick
if self.cfg.bmpc:
    mean = torch.tanh(mean)
    action = (mean + eps * log_std.exp()).clamp(-1,1)
else:
    action = mean + eps * log_std.exp()
    mean, action, log_prob = math.squash(mean, action, log_prob)

3. Performance

We believe the performance discrepancy stems from two key factors:

(1) In the BMPC config (Line 30 in bmpc/bmpc/config.yaml), if steps was set to 2_000_000, this may have led to 4M environment steps due to action_repeat=2. The correct setting should be 1_000_000 to ensure 2M environmental steps. This may explain why your reproduction results are inconsistent with the results reported in the bmpc paper.

Please see Line 53-59 in bmpc/bmpc/envs/dmcontrol.py for action repeat

def step(self, action):
    reward = 0
    action = action.astype(self.action_spec_dtype)
    for _ in range(2):
        step = self.env.step(action)
        reward += step.reward
    return self._obs_to_array(step.observation), reward, False, defaultdict(float)

(2) BMPC sets log_std_min: -3 and log_std_max: 1 (Line 59 and 60 in bmpc/bmpc/config.yaml), instead of TD-MPC2's default setting -10 and 2. Since this setting controls the randomness range of the policy and may contribute to performance differences, We use the TD-MPC2 setting throughout for a fair comparison. This may also have some impact. Notably, even when strictly following the BMPC setting, their performance as reported is still lower or comparable to ours.

4. Summary

We appreciate your feedback and agree that our wording may have been overly critical. We recognize BMPC’s soild contribution in mitigating actor divergence via pure imitation with replanning, and we will revise our text to reflect this.

At the same time, our method achieves stronger time-efficiency and performance.

  • Time-efficiency: Both your reproduction results and our experiments indicate that the replanning mechanism used by BMPC incurs 40~50% increase in computational time compared to the variant without replanning.

  • Performance: After clarifying the environment step issue, we find that even under strict adherence to BMPC’s parameter settings, the reported performance remains below 700 on Dog-run and below 600 on Humanoid-run, which is either lower than or comparable to ours.

评论

5. Clarify the novelty compared to BMPC

Regarding theoretical grounding, our method is also theoretically well-grounded. In fact, Our framework also favors re-planning, as reflected in the Eq.(3) and supported by our experiments with re-planning—just as you suggested—which improve the performance on the dog-run task from 819.7 to 860.1. By the way, We also highlight that our method maintains strong performance both with and without re-planning, which supports its robustness.

Compared to BMPC, our major novelty lies in explicitly retaining max-Q for policy improvement. And the additional bootstrap alignment we propose is mainly designed to enhance the accuracy of Q-value learning. At the same time, the proposed bootstrap alignment alleviated the reliance on re-planning.

Our central claims is that retaining max-Q enables higher performance than pure imitation as in BMPC. This is because Q-values provide a direct signal of action quality, and even the latest MPPI-generated actions are not strictly optimal (MPPI is a sample-based optimization method and cannot obtain the optimal solution of the objective function, i.e., cumulative return). This suggests that our potential performance ceiling is inherently higher.

6. Conclusion

We hope this clarifies the key points and helps resolve your concerns. We sincerely appreciate your review and would be grateful if you would consider re-evaluating our work. Please feel free to reach out with any further questions or suggestions.

评论

1. Efficiency

I would like to clarify that my intention was not to deny the computational overhead introduced by replanning. Rather, my point is that the speedup brought by ‘torch.compile’ can make this overhead more manageable in practice. As shown in results, training time was reduced from 23.6h to 9.9h with replanning and 6.8h without replanning. The absolute training time remains much lower than the unoptimized baseline. This demonstrates that compiler-level optimizations like ‘torch.compile’ can significantly mitigate the efficiency concerns associated with replanning.

In other words, while replanning does remain a non-negligible cost, it becomes less of a bottleneck when such optimizations are applied.

2. Simplicity and generality

Thanks for the clarification. I agree with your interpretation and appreciate the more precise wording.

3. Performance & 4. Summary

Thanks for the clarification regarding the reproduction details of BMPC. I’d like to point out that even with steps set to 1M, the performance on the Dog-run (731.3) and Humanoid-run (570) tasks remains strong, which are either comparable to or better than the results reported in the authors’ paper. I acknowledge that some variation in reproduction is expected. To help ensure a fair and robust comparison, I encourage including more seeds for the BMPC baseline in the revised version.

5. Clarify the novelty compared to BMPC

Thank you for the detailed explanation explicitly retaining max-Q for policy improvement. To further strengthen your claims, it may be helpful to highlight this distinction more clearly in the revised version.


The author has addressed most of my concerns, and I appreciate the clarifications and additional insights. I will update my score accordingly.

评论

Thank you for the updated score and helpful suggestion. We will include more seeds for the BMPC baseline to ensure a fair and robust comparison, and highlight the distinction of explicitly retaining max-Q for policy improvement more clearly in the revised version.

We appreciate your thoughtful feedback and recognition of our clarifications.

Best

审稿意见
4

This paper seeks to address the problem of actor divergence in online planning with off-policy RL, because the data for value function update is collected by a behavior policy that is modified upon the control policy with MPPI -type planner. To this end, the authors propose a method called BOOM by jointly training a world model and the policy. The main design component is a policy regularization term for policy alignment which uses likelihood-free KL divergence for policy alignment and a soft Q-weighted mechanism to further change the weights in policy alignment toward high-value actions. Both theoretical and empirical results are provided to justify the effectiveness of the proposed approach.

优缺点分析

Strengths:

  1. The idea of using world model for online planning is interesting.
  2. Both theory and experiments are provided.
  3. The presentation is clear.

Weaknesses: My main concern about this paper lies in its limited technical novelty, which can be articulated from the following perspects:

  1. World model is an important component proposed in this paper as it can be told from the title. However, it didn't receive much attention in the main body of this paper, and the way to use it seems like a direct plug-in to replace the standard dynamics model in model-based planning.
  2. The entire actor divergence problem is similar to the problem of distribution shift in offline RL. The proposed design is basically a policy regularization by following a very similar way as in offline RL where one tries to regularize the policy update to be closer to the behavior policy for the offline dataset. In particular, the likelihood-free alignment metric in eq.3 has been widely used in offline RL to directly use the offline data without modeling the offline behavior policy, see
  • Rasool Fakoor, Jonas Mueller, Pratik Chaudhari, and Alexander J Smola. Continuous doubly constrained batch reinforcement learning. NeurIPS 2021.

The soft Q-weighted mechanism also shares a similar idea advantage weighted policy regularization in offline RL.

问题

Please see the weaknesses above. While the performance looks promising, it is hard to think of some new technical contributions from this paper to the research community. I would consider to change my score if the novelty can be well justified.

局限性

Yes

最终评判理由

The rebuttal has largely addressed my concerns. While I still have some reservations regarding the significance of the technical modifications, they are an improvement over direct applications of existing techniques. Combined with the contributions in identifying actor divergence and proposing a valid solution, I have updated my rating to 'borderline accept'.

格式问题

NA

作者回复

We sincerely appreciate your detailed and constructive feedback. Below are our responses to your concerns.

1. How world model learning is improved in BOOM (Weakness 1)

World model is an important component proposed in this paper as it can be told from the title. However, it didn't receive much attention in the main body of this paper, and the way to use it seems like a direct plug-in to replace the standard dynamics model in model-based planning.

We appreciate your comment and would like to clarify the motivation, structure, and the improvements to the world model in our framework.

Motivation: Our work investigates the challenges and solutions in integrating off-policy RL with planning-driven world model frameworks. One key challenge is actor divergence: the mismatch between the policy and planner, leading the (s,a)(s,a) (from planner) and (s,a)(s',a') (from policy since a=π(s)a'=\pi(s')) in the TD error follows different distributions, which easily leads to value over-estimation and impairs policy performance. (note that value is a core component of our world model). To mitigate this issue, our proposed BOOM framework leverages a policy-planner alignment mechanism, which not only stabilizes policy learning but also improves the learning of the world model itself.

Structure: Concretely, our world model consists of four components:

  • A latent encoder that encodes observations into a compact representation space

  • A dynamics model that predicts latent transitions

  • A reward predictor that provides rewards used for planning

  • A Q-value estimator that enables policy improvement and provides terminal value for planning.

Improvements: The reasons why we improve the learning of these components are:

(1) Q-value learning is improved due to policy-planner consistency. By aligning the policy with the planner, we reduce the discrepancy between the collected data and the actual policy behavior. This improves the distributional matching during training, enabling the value estimator to learn from more consistent and policy-relevant trajectories. As a result, the predicted values used in planning become more accurate and reliable.

(2) Encoder, reward, and dynamics learning are all improved due to a unified TD loss. We adopt a TD-style learning objective that jointly optimizes the value function along with the other components of the world model.

TD-style world model loss = latent dynamics loss + reward loss + Q-value loss

As the value predictions become more accurate due to better distributional match, the gradients flowing into the encoder, dynamics model, and reward predictor become more informative, leading to improved overall model quality.

In summary, although our world model is based on standard components, our framework improves its learning through designed bootstrap alignment mechanism. These enhancements contribute directly to the improved performance and stability observed in BOOM.

2. Comparison with offline RL (Weakness 2)

The entire actor divergence problem is similar to the problem of distribution shift in offline RL. The proposed design is basically a policy regularization by following a very similar way as in offline RL where one tries to regularize the policy update to be closer to the behavior policy for the offline dataset. In particular, the likelihood-free alignment metric in eq.3 has been widely used in offline RL to directly use the offline data without modeling the offline behavior policy

We thank you for raising this important point regarding the similarity between the actor divergence problem in planning-driven MBRL and the distribution shift challenge in offline RL. While there are some surface-level resemblances—such as the use of policy regularization to keep the updated policy close to a behavior policy—the motivations and practical implications differ fundamentally between the two settings.

(1) Focus of algorithm design

  • Offline RL centers on a fixed offline dataset, relying heavily on behavior cloning (BC) to restrict the policy within the dataset’s support. To improve policy performance, offline RL cautiously relaxes BC for poor actions, but always aims to keep the policy close to known data to avoid extrapolation error.

  • BOOM centers on the policy and seeks to maximize Q-values and align with planner-generated actions, both aimed at improving policy performance. Ultimately, the policy and planner co-adapt and converge to the optimal solution through online interaction.

(2) Relationship between two policy objectives (max Q and min KL)

  • In offline RL, maximizing Q-values and BC often conflict: Q-values outside the dataset distribution cannot be accurately estimated, so maximizing Q risks pushing the policy toward unsupported actions; BC pulls the policy back toward known actions. This tension forces a conservative balance.

  • In BOOM, the two objectives are largely complementary: the planner generally produces higher-quality actions than the policy. Policy alignment with the planner improves performance and strengthens policy-planner consistency, which in turn leads to more accurate Q estimates. More accurate Q-values then enable better maximization by the policy and allow the planner to generate even higher-quality actions. This positive feedback loop bootstraps policy and planner consistently toward faster convergence to the optimal solution.

This distinction highlights that BOOM is not an offline method regularized around given dataset, but an active, planner-supervised online RL algorithm that enables consistent and progressive improvement. This is also supported by our ablation experiments on the alignment coefficient. In a large range, the performance is good and stable.

Concise summary for quick understanding

## Offline RL
1. (Fixed Dataset)
  Static data collected from a behavior policy.

2. (Policy Regularization) 
  Constraint the policy close to the dataset, while also leaving little room for improvement by maximizing Q-values.
## BOOM (Online RL)
1. (Policy)
Align with planner's higher-quality actions for better behavior than max-Q alone.

2. (World Model)
Better distribution matching improves accuracy.

3. (Planner)
Improved model results in higher-quality actions for sampling.

4. (Policy)
High-quality data enhances policy training and completes the bootstrap loop.

3. Comparison with advantage weighted policy regularization in offline RL (Weakness 3)

The soft Q-weighted mechanism also shares a similar idea advantage weighted policy regularization in offline RL.

For the Advantage-weighting mechanism and offline RL algorithm AWAC, we have supplemented the comparative experiments, ablation analysis and discussion in the response to (the 2nd reviewer HgMu ). Results indicate that advantage-weighting is slightly less effective mainly due to over-selectivity. AWAC performs significantly worse, likely because it relies solely on weighted imitation loss, whereas our BOOM benefits from combining Q-value loss with alignment loss, which is better suited for online RL.

4. Clarify contributions (Q1)

While the performance looks promising, it is hard to think of some new technical contributions from this paper to the research community. I would consider to change my score if the novelty can be well justified.

Thank you for your candid and constructive feedback. While our method builds upon existing techniques in MBRL and offline RL, we believe the primary contribution of our work lies in clearly identifying and effectively addressing a critical yet under-explored issue in planning-driven MBRL — the divergence between the actor and the planner. This problem, though fundamental, has not received adequate attention in prior research.

Our contributions are justified as follows:

(1) Bottleneck identification of planning-based MBRL. We highlight the actor divergence as a under-explored challenge in planning-based MBRL. While similar to distributional shift issues in offline RL, this problem arises even in fully online and model-based contexts, and has distinct implications for model and policy learning.

(2) Simple yet effective solution. We propose a bootstrap mechanism that is easy to implement and integrates seamlessly with existing components, effectively aligning the policy with planner actions to improve stability and performance.

(3) Shift in focus. Rather than improving the structure and training loss of world models or adopting stronger planner, we focus on closing the gap between policy and planner while still maximizing Q-values. This complements existing directions in MBRL and offers a new angle for further performance improvement.

(4) Strong empirical evidence. We validate BOOM on multiple challenging benchmarks, where it consistently delivers stable performance gains. These results support our hypothesis that alignment is key to reliable planning-based learning.

Summary: A minimalist yet powerful approach for planning-driven MBRL. Our method aims to improve planning-driven MBRL through a simple bootstrap alignment mechanism that requires minimal changes to existing components. This straightforward and easy-to-implement approach consistently delivers strong and stable performance gains without introducing complex modifications or additional computational overhead.

We believe this minimalistic yet effective strategy represents a promising direction for further research. We genuinely hope this work can serve as a valuable contribution to the community, inspiring others to explore not only similarly simple yet impactful ideas, but also more sophisticated and powerful solutions in planning-driven MBRL.

5. Conclusion

We have clarified how model learning benefits from our method, elaborate the difference from offline RL and highlight the contributions. We hope our responses address your concerns and kindly request your consideration for a higher score.

评论

Dear Reviewer HqWU:

Thank you very much for taking the time out of your busy schedule to review our paper. We have made a concerted effort to address all concerns in detail and sincerely hope that our responses are satisfactory.

We genuinely hope that our work, BOOM, can serve as a meaningful contribution to the RL community—encouraging not only the pursuit of similarly simple yet impactful ideas, but also the development of more sophisticated and powerful planning-driven MBRL solutions.

We look forward to continuing thoughtful and engaging academic discussions with you. If you have any further questions or concerns, please feel free to let us know—we would be happy to conduct additional experiments or analyses to provide further clarification.

评论

Reviewer HqWU,

The authors have provided detailed responses to your concerns. Could you please confirm whether your concerns have been fully addressed? Your participation in the discussion is important to ensure a fair and thorough evaluation.

AC

评论

Thank the authors for rebuttal and clarifications. While I appreciate the identification of the actor divergence problem, I am still not convinced by the technical novelty. I totally agree that using the policy regularization in this paper has different motivations and implications compared to that in offline RL, because the setups are entirely different. However, the techniques including the reverse KL based policy regularization and weighted Q based coefficients are not novel per se.

Moreover, directly comparing with the offline algorithm AWAC, which is specifically designed for offline RL, in the experiments does not make sense to me, as the current paper studies online planning and BOOM is an online algorithm. It is expected that AWAC will not work well under the setup in this paper, not to mention the additional Q-value loss used in BOOM as pointed out by the authors.

Instead, a more reasonable way to justify the technical novelty in terms of the key alignment loss is to directly compare the policy regularization term, i.e., modifying BOOM by replacing the soft Q-weighted based policy regularization with the advantage weighted policy regularization in AWC. If this variant is significantly outperformed by the original BOOM, explain the underlying reasons in theory.

评论

Thank you for your detailed and thoughtful comments. We’re sorry that some parts of our previous response may have led to misunderstandings, and we genuinely appreciate the opportunity to clarify them below.

1. Technical novelty

Thank you for your thoughtful feedback and for recognizing our identification of the actor divergence issue. We fully agree with your points regarding the differing motivations and implications of policy regularization in our work compared to that in offline RL. Indeed, although reverse KL-based regularization and Q-weighted coefficients are not new individually, our contribution lies in identifying their specific utility in mitigating actor divergence and enhancing the accuracy of Q-value learning in planning-driven MBRL, which, to our knowledge, has not been systematically addressed in prior work.

2. Comparison with AWAC

We also appreciate your concern regarding the inclusion of AWAC in our experiments. We agree that AWAC is originally designed for offline RL, and that performing bad in an online planning context is expected. While AWAC is not a fully online method, we use it to illustrate how alternative regularization strategies influence performance, and to clarify the distinction between our method and AWAC.

3. Ablation on the weighting mechanism

More importantly, as you rightly pointed out, a more appropriate way to evaluate the effectiveness of our proposed alignment loss is to perform an ablation on the weighting: modifying BOOM by replacing our soft Q-weighted policy regularization with the advantage-weighted policy regularization used in AWAC.

In fact, we have already conducted this exact experiment (in the response to Reviewer HgMu: 3. Ablation on the Advantage-weighting mechanism (Q1)). Due to space limitations in the initial rebuttal, we did not elaborate further. We apologize for the oversight and any confusion it may have caused.

For this suggested ablation, below we present the complete results and insights:

we have explored an alternative weighting scheme based on the advantage function A(s,a)=Q(s,a)V(s)A(s,a) = Q(s,a) - V(s), which is commonly used in actor-critic methods to provide more stable gradients.

Specifically, we implemented an advantage-weighted variant of our method, where V(s)V(s) is estimated by averaging multiple samples from the current policy following AWAC. (The python code is included for clarity):

# Approximation of advantage
# Number of policy samples per state to estimate V more accurately
num_pi_samples = 10

with torch.no_grad():
    qs = self.model.Q(zs, actions_beta)

    vs_samples = []
    for _ in range(num_pi_samples):
        sampled_actions_pi = self.model.pi(zs)
        vs_samples.append(self.model.Q(zs, sampled_actions_pi))
    vs = torch.stack(vs_samples, dim=0).mean(dim=0)

    adv = (qs - vs).detach()
MethodDog-run TAR
ours Q-weight819.7 ± 2.9
Advantage-weight783.5 ± 4.2

While advantage-weighted imitation can offer more stable gradients, we observed a slight ~5% performance drop. We hypothesize that this is because advantage emphasizes the relative quality of planner actions compared to current policy actions, leading to a more selective imitation signal. However, our objective is not only to imitate actions that are strictly better, but to promote consistent alignment between the policy and the planner over time.

As online RL progresses, both the policy and planner co-evolve toward optimality. Overly selective weighting can hinder this mutual alignment, especially in early phases where the planner is not perfect. This observation is consistent with our earlier ablation on the temperature (τ\tau) of soft Q-weight (See the response to the Weakness 1 of Reviewer HgMu), where a smaller τ\tau also led to degraded performance due to hard-max style weighting.

To conclude, while advantage-weighted imitation offers a more selective filter, our results suggest that such over-selectivity can be counterproductive. It is also worth noting that additional policy sampling is required for advantage estimation. In contrast, the soft Q-weight mechanism provides a better balance between stability and performance. Given the observed performance degradation and the additional computational burden, we find advantage-based weighting to be unsuitable for our setting.

This ablation supports our core claim that the proposed alignment loss—not just the choice of likelihood-free KL, but its soft-Q weighting formulation—is critical for stabilizing policy and improving performance in BOOM.

4. Conclusion

Thank you again for the valuable suggestions. We hope that our clarifications and additional experiments have addressed your concerns, and we would greatly appreciate your reconsideration for a higher score.

评论

Thank you for the clarifications, and I will update my score. However, adapting a technique to a specific context is far more meaningful than merely applying it or pointing out its usage. In the final revision, I believe it is important to revise Section 3.2 by: (1) appropriately referencing reverse KL and policy regularization, as they are not entirely new; and (2) clearly motivating and justifying the design of the soft Q-weighted strategy in relation to the previously used advantage-based strategy.

评论

Thank you for the updated score and constructive feedback. We will revise Section 3.2 as you suggested, ensuring proper references and clearer motivation. We greatly appreciate your time and valuable input in improving our work.

Best

审稿意见
5

This paper introduces BOOM (Bootstrap Off-policy with World Model), a framework designed to resolve the "actor divergence" problem in planning-driven model-based reinforcement learning (MBRL). The core issue is that the online planner used for data collection acts as a different, more powerful actor than the learned policy, leading to a distributional shift that degrades both value and policy learning. BOOM addresses this through a tight bootstrap loop where the policy initializes the planner, and the planner's refined actions are used to guide the policy via behavior alignment.

优缺点分析

Strengths:

  • The paper provides a clear diagnosis of the actor divergence problem inherent in using online planners for data collection. The proposed solution, which utilizes a likelihood-free forward KL loss to align the policy with the planner's actions, is an elegant and effective way to bridge this gap without requiring the modeling of the complex, non-parametric planner distribution.
  • BOOM demonstrates significant performance gains over strong and recent baselines, including TD-MPC2, BMPC, and DreamerV3.

Weaknesses:

  • The soft Q-weighting mechanism introduces a temperature hyperparameter τ. In similar softmax-based weighting schemes, this parameter can be highly influential, controlling the "hardness" of the weighting. The paper lacks an ablation study on τ, leaving its sensitivity unevaluated.

问题

  1. In Figure 4(b), the authors investigated the importance of soft Q-weighting mechanism. Have the authors considered making ablation studies on alternative weighting schemes, such as using the advantage A(s, a), which might provide more stable learning signals?

  2. In equation 7, the bound of the expected Q-value depends on LQL_Q. Have the authors considered explicitly or implicitly controlling this value, such as through gradient clipping or a regularization term?

  3. Equation (13) in AWAC (https://arxiv.org/abs/2006.09359) proposes a weighted behavior cloning objective that appears similar to Equation (4) in this paper. Could the authors clarify whether these objectives are equivalent or highlight the key differences between them?

局限性

Yes, it's in Appendix D.

最终评判理由

My major concerns about not using the advantage function, but rather soft Q, and the difference to AWAC have been addressed.

格式问题

Formatting is good.

作者回复

We sincerely appreciate your detailed and constructive feedback. Below are our responses to your concerns.

1. Ablation on the soft Q-weight temperature τ\tau (Weakness 1)

The soft Q-weighting mechanism introduces a temperature hyperparameter τ\tau. In similar softmax-based weighting schemes, this parameter can be highly influential, controlling the "hardness" of the weighting. The paper lacks an ablation study on τ\tau, leaving its sensitivity unevaluated.

We thank you for highlighting the importance of the temperature parameter τ\tau, which controls the sharpness of the softmax distribution over Q-values. A smaller τ\tau results in a distribution closer to hard-max, while a larger τ\tau leads to more uniform weighting.

We conducted an ablation study on the Dog-run task by varying τ\tau from 0.1 to 10.0:

TemperatureTAR
τ=0.1\tau = 0.1781.2 ± 8.1
τ=0.5\tau = 0.5806.3 ± 5.2
τ=1.0\tau = 1.0 (default)819.7 ± 2.9
τ=2.0\tau = 2.0823.4 ± 2.7
τ=10.0\tau = 10.0799.4 ± 4.6

The results show that performance remains stable across a broad range of τ\tau values. Notably, a slightly higher τ\tau (e.g., 2.0) can lead to marginal performance improvements. In contrast, setting τ\tau too low results in performance degradation due to overly selective.

2. Ablation on the gradient clipping norm (Q2)

In equation 7, the bound of the expected Q-value depends on LQL_Q. Have the authors considered explicitly or implicitly controlling this value, such as through gradient clipping or a regularization term?

We agree that gradient clipping is a practical and effective strategy to mitigate instability during Q-function updates. As suggested, we explicitly investigated its role in our setting. Following TD-MPC2, we initially adopted a clipping norm of 20, as listed in Table 2 of Appendix B.2.

To evaluate the sensitivity of our method to this choice, we conducted an ablation study on the Dog-run task by varying gradient clipping norm from 2 to 200:

Gradient clipping normTAR
20 (default)819.7 ± 2.9
2813.4 ± 4.7
200805.6 ± 3.8

These results reveal the following:

  • A moderate clipping threshold of 20 yields the best performance and aligns with the default setting in TD-MPC2.
  • A smaller norm (2) still maintains reasonable stability, suggesting some robustness to tighter control.
  • A large norm (200) weakens the effect of clipping, allowing larger gradients that potentially destabilize training, leading to performance degradation.

Overall, this confirms that clipping is an effective and necessary component, and the default choice inherited from TD-MPC2 is already close to optimal.

3. Ablation on the Advantage-weighting mechanism (Q1)

In Figure 4(b), the authors investigated the importance of soft Q-weighting mechanism. Have the authors considered making ablation studies on alternative weighting schemes, such as using the advantage A(s, a), which might provide more stable learning signals?

Yes, we have explored an alternative weighting scheme based on the advantage function A(s,a)=Q(s,a)V(s)A(s,a) = Q(s,a) - V(s), which is commonly used in actor-critic methods to provide more stable gradients.

Specifically, we implemented an advantage-weighted variant of our method, where V(s)V(s) is estimated by averaging multiple samples from the current policy (python code included for clarity):

# Approximation of advantage
# Number of policy samples per state to estimate V more accurately
num_pi_samples = 10

with torch.no_grad():
    qs = self.model.Q(zs, actions_beta)

    vs_samples = []
    for _ in range(num_pi_samples):
        sampled_actions_pi = self.model.pi(zs)
        vs_samples.append(self.model.Q(zs, sampled_actions_pi))
    vs = torch.stack(vs_samples, dim=0).mean(dim=0)

    adv = (qs - vs).detach()
MethodTARTraining time
ours Q-weight819.7 ± 2.945.6h
Advantage-weight783.5 ± 4.246.7h

While advantage-weighted imitation can offer more stable gradients, we observed a slight ~5% performance drop and a ~2% increase in training time due to additional policy sampling. We hypothesize that this is because advantage emphasizes the relative quality of planner actions compared to current policy actions, leading to a more selective imitation signal. However, our objective is not only to imitate actions that are strictly better, but to promote consistent alignment between the policy and the planner over time.

As online RL progresses, both the policy and planner co-evolve toward optimality. Overly selective weighting can hinder this mutual alignment, especially in early phases where the planner and value are not perfect. This observation is consistent with our earlier ablation on the temperature (τ\tau) of soft Q-weight, where a smaller τ\tau also led to degraded performance due to more hard-max style weighting.

In summary, while advantage-weighted imitation offers a more selective filter, our results suggest that such over-selectivity can be counterproductive, leading to degraded performance. In contrast, the soft Q-weight mechanism provides a better balance between stability and performance. Given its effectiveness and lower overhead, it serves as a practical and sufficient choice, making advantage-based filtering unnecessary in our setting.

4. Comparison with Offline RL algorithm AWAC (Q3)

Equation (13) in AWAC (https://arxiv.org/abs/2006.09359) proposes a weighted behavior cloning objective that appears similar to Equation (4) in this paper. Could the authors clarify whether these objectives are equivalent or highlight the key differences between them?

(1) Conceptual differences:
AWAC is mainly designed for offline RL, aiming to balance behavioral cloning with performance optimization under the constraint of a fixed dataset. By applying Lagrangian relaxation analysis, AWAC derives a theoretical result stating that the optimal policy satisfies: π(as)πβ(as)exp(1λA(s,a))\pi^*(a|s) \propto \pi_\beta(a|s)\exp(\frac{1}{\lambda}A(s,a)). Consequently, at the algorithmic level, AWAC directly relaxes the imitation objective using advantage-based weighting. This approach constrains the learned policy within the dataset coverage while allowing relaxation of imitation on poor actions, enabling slightly better performance than pure behavioral cloning.

LAWAC=ΣwAKL(βπ)L_\text{AWAC} = \Sigma w_A \text{KL}(\beta || \pi)

In contrast, BOOM operates in an online RL setting where fresh planner rollouts are continuously available. Both the policy and planner are jointly optimized toward the optimal solution over time, enabling BOOM to focus more on maintaining alignment, rather than solely on selectively imitating only the highest-advantage actions.

LBOOM=Q(s,aπ)+λalignΣwQKL(βπ)L_\text{BOOM} = Q(s,a_\pi) + \lambda_\text{align} \Sigma w_Q \text{KL}(\beta || \pi)

(2) Empirical comparison:
Since AWAC fundamentally employs an imitation-style loss similar to BMPC, and BMPC incorporates a re-planning technique that refreshes planner actions in the replay buffer, we extended the comparison by adding re-planning variants for both AWAC and our BOOM (denoted as AWAC + replan and BOOM + replan). This ensures a fair comparison regarding the use of up-to-date planner data for policy alignment.

The Python code is provided for clarity, with the awac_lambda value set to 1.0, following the original paper’s setting.

# AWAC-style weighted imitation loss
weights = torch.exp(adv / self.cfg.awac_lambda)
log_pis_action = self.model.log_pi_action(zs, actions_beta)
pi_loss = (- weights * log_pis_action).mean()
MethodTARTraining Time
BOOM (ours)819.7 ± 2.937.2h
BOOM + replan860.1 ± 2.556.3h (+51.3%)
AWAC382.4 ± 37.839.6h
AWAC + replan497.3 ± 27.559.1h (+49.2%)

(3) Results analysis: As shown in the results, AWAC performs significantly worse than BOOM (The performance gap is so large that we did not add more ablation experiments to the hyper-parameter awac_lambda). Although both methods benefit from re-planning—with AWAC exhibiting a larger relative improvement—AWAC still lags far behind BOOM in absolute performance. It is important to note that re-planning incurs approximately 50% additional training time due to the overhead of repeated planning. Remarkably, BOOM achieves strong performance even without re-planning. This advantage likely stems from BOOM’s combined use of Q-function loss and imitation loss, which better supports learning in the online RL setting compared to AWAC’s sole reliance on weighted imitation loss.

In summary, although AWAC and our BOOM share conceptual similarities, differences in their practical settings and algorithmic designs make AWAC less suitable for our online planning-driven MBRL framework. Notably, BOOM retains the original Q-function loss alongside its particularly designed bootstrap alignment, which enhances learning effectiveness. Therefore, BOOM offers a more effective and computationally efficient approach for maintaining policy-planner consistency and achieving superior performance.

We appreciate your insightful question and will clarify these distinctions in the revised manuscript.

5. Conclusion

We sincerely appreciate your detailed and constructive feedback. We have added the requested experiments on soft Q-weight temperature, grad clipping norm, Advantage-based weights and comparison with AWAC. We hope our responses address your concerns and kindly request your consideration for a higher score.

评论

Thanks for the detailed response and additional experiments. The ablations on τ\tau, gradient clipping, and, especially, advantage-weighted imitation address my concerns and clarify the robustness of your design choices. The comparison to AWAC also makes the distinctions clear in both motivation and empirical performance. Overall, I will increase my rating to 'accept'.

评论

Thank you very much for your detailed review and positive feedback. Your encouragement means a lot to us. We’re glad that the additional experiments and clarifications addressed your concerns and helped clarify the robustness of our design. We are actively preparing a revised version with open-sourced code and documentation to encourage future research and wider adoption. Thanks again for your support!

审稿意见
4

This paper studies off-policy learning. The paper uses model-based approach to solve the off-policy problem where the sample distribution of the behaviour policy does not match that of the target policy. The world model is a latent dynamic model, learned together with policy optimization.

However, this class of planning-driven model-based RL algorithms inevitably suffers from a fun37 damental issue known as actor divergence: the data used for learning is collected by the planner, which acts as a different actor from the policy network [37].

This is just the off-policy leaning problem [Sutton& Barton, 1999]. Although not proposed by this paper, why do you need a new name for it? Actor divergence is a confusing name.

this issue leads to two problems: (1) Distribution shift in value learning: again this is the classical problem of off-policy learning, see [GTD, TDC, TDRC]

Inevitable Actor DivergenceWhen Off-policy RL Meets Online Planning: this is a confusing subsection title.

Section 3.1: this latent dynamics model started from earlier research, e.g., ACE. There is no literature survey (not even discussions) on world models. This is one shortcoming of the paper.

Another weakness of the paper is that it only compares with DreamerV3, TD-MPC and one of its variants.

优缺点分析

Results on DMC (14 tasks): some domains are very strong, including Humanoid-run, Dog-run, H1hand-walk, etc. However, some domains the advantages are small or tied, such as dog-walk, dog-trot, Dog-stand. ontinuous Control with Tree Search This is a good baseline to compare against too.

It’s surprising an off-policy learning missed gradient TD methods, which discusses the sample distribution mismatch, and importance sampling in early RL times. GTD: https://scholar.google.ca/citations?view_op=view_citation&hl=en&user=BXUTo4AAAAAJ&citation_for_view=BXUTo4AAAAAJ:qjMakFHDy7sC

TDC:

https://scholar.google.ca/citations?view_op=view_citation&hl=en&user=BXUTo4AAAAAJ&citation_for_view=BXUTo4AAAAAJ:u-x6o8ySG0sC

TDRC: https://arxiv.org/abs/2007.00611

ACE: https://arxiv.org/pdf/1811.02696 ACE: An Actor Ensemble Algorithm for C

问题

have you surveyed other model-based DRL algorithms?

局限性

n/a

格式问题

n/a

作者回复

We sincerely appreciate your detailed and constructive feedback. Below are our responses to your concerns.

1. Clarify confusing names

This is just the off-policy leaning problem [Sutton& Barton, 1999]. Although not proposed by this paper, why do you need a new name for it? Actor divergence is a confusing name. This issue leads to two problems: (1) Distribution shift in value learning: again this is the classical problem of off-policy learning, see [GTD, TDC, TDRC]. (2) Inevitable Actor DivergenceWhen Off-policy RL Meets Online Planning: this is a confusing subsection title.

Regarding "Actor divergence," we understand that for experienced RL researchers, this term may appear redundant with known off-policy challenges. However, in our planning-driven MBRL setting, we face a unique form of divergence: between the policy and the planner—both of which serve as “actors” but generate data from different distributions. Our goal was to make this divergence intuitive to readers less familiar with classical theory. We will explicitly link this terminology to traditional off-policy learning problem to mitigate misleading.

For the "Distribution shift in value learning," we fully agree that this is a classical off-policy learning problem, extensively studied in foundational works such as GTD, TDC, and TDRC. We will cite the recommended literature to clarify the connection.

Finally, regarding the subsection title "Inevitable Actor Divergence When Off-policy RL Meets Online Planning," our goal was to highlight the intrinsic tension that arises when off-policy RL is combined with model-based planning. We also agree that the title could be clearer—something like “Distribution Shift in Planning-Driven Model-Based RL” may better reflect the core challenge being addressed.

2. Discussion of GTD methods

It’s surprising an off-policy learning missed gradient TD methods, which discusses the sample distribution mismatch, and importance sampling in early RL times.

We acknowledge the importance of GTD, TDC, and TDRC for stable off-policy learning with linear function approximation. While these are foundational, our method uses deep RL with nonlinear networks, where accurate TD learning typically relies on empirical tricks such as target networks. We will include citations and briefly discuss these classical approaches for completeness and historical context.

3. Survey on world models

Section 3.1: this latent dynamics model started from earlier research, e.g., ACE. There is no literature survey (not even discussions) on world models. This is one shortcoming of the paper.

We appreciate this point and will expand related works. Below, we categorize world models into three major types based on their objectives and roles in policy learning:

1. Information-preserving latent models. These methods aim to learn a fully informative latent representation of the environment by combining with auxiliary reconstruction losses. This ensures that the latent space retains all relevant features from high-dimensional observations, supporting accurate imagination. Notably, PlaNet [1] and Dreamer [2] use stochastic recurrent latent dynamics models (e.g., RSSMs), which are trained with an additional decoder to reconstruct the original observations. Although this decoder is discarded at test time, the reconstruction loss regularizes the latent space, improving model stability, generalization. These models are typically used in imagination-driven MBRL, where the policy and value functions are trained entirely on imagined rollouts.

2. Task-aligned latent models trained via one unified TD loss. The latent space in these methods is not required to be fully reconstructive; instead, it only needs to encode information task-aligned critical for decision-making, enabling greater efficiency and stronger performance in high-dimensional control. Early examples include ACE [3], VPN [4] and TreeQN [5], which perform latent tree search over value predictions without decoding observations. More recent approaches like TD-MPC [6] adopts a unified TD loss, where the encoder, dynamics, reward, and value functions are jointly optimized. This family of methods is well-suited for planning-driven MBRL, where latent rollouts are used for online planning.

3. Trajectory-level sequence models for offline RL. A third category treats RL as a sequence modeling problem, without explicitly learning a policy. Instead, models such as Decision Transformer [7] and Trajectory Transformer [8] are trained offline to autoregressively predict future tokens (states, actions, rewards) conditioned on past trajectories and return-to-go. These methods rely on powerful sequence models (e.g., Transformers) to capture behavioral patterns and temporal dependencies. They do not require a separately defined policy network—the Transformer itself acts as a conditional policy, directly outputting the next action given the desired return and current observations. Extending this trend to the pixel level, video generation models like Sora [9] predict future frames directly from visual inputs using large-scale spatiotemporal Transformers. While not trained as agents, they serve as high-capacity world simulators and open new avenues for embodied planning in visually complex domains.

Our approach follows the 2nd category: we adopt a latent model from TD-MPC [6] and improve its learning by introducing a bootstrap alignment mechanism that explicitly encourages consistency between the policy and the planner. This consistency suppresses distributional shift and leads to more accurate model learning.

4. Add tree-search baseline

Another weakness of the paper is that it only compares with DreamerV3, TD-MPC and one of its variants.

Continuous control with Tree Search. This is a good baseline to compare against too.

Following ACE [3], we implement a multi-actor tree-search variant as a new baseline. Specifically, we rollout multiple candidate trajectories by alternately using multiple actors in a tree-unfolding manner, and select the best one before refinement with MPPI. Results are shown below:

Dog-run
BOOM819.7 ± 2.9
BOOM + Tree-search817.4 ± 7.4

Note we now use 5 seeds, so the result of BOOM is updated

As shown, tree search does not bring significant improvements, likely for two main reasons: (1) modern deep RL often uses stochastic policies (e.g., diagonal Gaussian), so multiple actors produce behaviors with limited diversity; (2) even if minor differences exist among actors, the MPPI planner further refines and smooths the actions, resulting in similar outputs. Therefore, tree search yields minimal performance gain while adding additional computational overhead. For these reasons, We choose to adopt a single shared policy for simplicity and efficiency.

5. Clarify small advantages on some domains

Some domains are very strong, including Humanoid-run, Dog-run, H1hand-walk, etc. However, some domains the advantages are small or tied, such as dog-walk, dog-trot, Dog-stand.

Indeed, simpler tasks such as Dog-stand, Dog-walk, and Dog-trot have lower difficulty (specifically, lower speed requirements) and smaller performance gaps across methods. Baselines already achieve near-optimal returns (>900), leaving little headroom for improvement.

Nonetheless, our method maintains performance advantages in these tasks while providing more stable training. Importantly, on the most challenging high-speed run tasks, our approach yields significant gains, highlighting its robustness under harder control demands.

6. Survey on other model-based RL methods

Have you surveyed other model-based DRL algorithms?

Beyond imagination-based and planning-based methods, another emerging paradigm involves synthesizing training data. These methods generate additional trajectoried or transitions to support policy learning. For example, Diffuser [10] uses a diffusion model to iteratively denoise trajectories for planning. Synthetic Experience Replay (SER) [11] generates additional replay data via diffusion-based upsampling. Prioritized Generative Replay (PGR) [12] improves data efficiency by generating high-value transitions independent of the current policy. However, these approaches often rely on training large generative models and have mostly been applied to low-dimensional tasks (e.g., Cheetah, Reacher), without showing consistent gains over standard planning-based methods. For these reasons, we did not include them in our comparison, though we consider them complementary and promising directions for future work.

[1] D. Hafner et al. Learning latent dynamics for planning from pixels. ICML, 2019.

[2] D. Hafner et al. Mastering diverse control tasks through world models. Nature, 2025.

[3] S. Zhang et al. ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search. AAAI, 2019.

[4] J. Oh et al. Value prediction network. NeurIPS, 2017.

[5] G. Farquhar et al. TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep RL. ICLR, 2018.

[6] N. Hansen et al. TD-MPC2: Scalable, Robust World Models for Continuous Control. ICLR, 2024.

[7] L. Chen et al. Decision Transformer: Reinforcement Learning via Sequence Modeling. Arxiv, 2021.

[8] M. Janner et al. Offline Reinforcement Learning as One Big Sequence Modeling Problem. NeurIPS, 2021.

[9] Brooks T et al. Video generation models as world simulators. OpenAI Blog, 2024.

[10] M. Janner et al. Planning with diffusion for flexible behavior synthesis. ICML, 2022.

[11] C. Lu et al. Synthetic experience replay. NeurIPS, 2023.

[12] R. Wang et al. Prioritized generative replay. ICLR, 2025.

7. Conclusion

We have clarified the confusing names, added the requested tree-search baselines, and more literature reviews on the world model development and other model-based methods. We hope our responses address your concerns and kindly request your consideration for a higher score.

评论

The authors did a good job in their rebuttal. A few issues are clarified and improved (like the discussions on the literatures of off-policy learning and world models). I can increase the score a bit. The performance of the algorithms may still be improved though.

So I would recommend this paper being "accept" or weak accept. I've read most of the other reviews, and it seems weak accept is the most vote.

评论

Thank you for the positive comments. We're glad the clarifications were helpful. We genuinely hope to contribute BOOM to the RL community, and we are working to prepare a well-polished revision, including open-sourcing the code and documentation, to support and encourage future research. We would be truly grateful if you would consider raising the score!

最终决定

The paper introduces BOOM (Bootstrap Off-policy with World Model), a model-based reinforcement learning (MBRL) framework designed to mitigate the actor divergence problem in planning-driven RL. Actor divergence refers to the mismatch between the behavior data collected using a planner (often more powerful than the learned policy) and the policy itself, leading to distribution shift and degraded policy/value learning. BOOM addresses this by (i) bootstrapping the planner from the current policy, and (ii) aligning the policy toward the planner through a likelihood-free KL regularization, with additional soft Q-weighting that prioritizes high-value actions. The authors provide theoretical analysis supporting their formulation and demonstrate empirical results across tasks from DeepMind Control Suite and Humanoid benchmarks. The method shows competitive or superior performance against strong baselines such as DreamerV3, TD-MPC2, and BMPC.

Strengths The reviewers agree that the paper provides a clear articulation of the mismatch between planning and policy learning, a central issue in planning-driven MBRL. The proposed bootstrap alignment mechanism—initializing the planner from the policy and then aligning the policy back toward planner actions—is conceptually simple, easy to implement, and empirically effective in improving stability and performance. In addition, the paper includes a theoretical analysis of the policy regularization design, which strengthens the overall contribution.

Weaknesses Several reviewers noted that the approach shows strong parallels to offline RL methods, particularly regularized policy learning with likelihood-free metrics and advantage weighting, as well as to BMPC, which also aligns the policy toward planner behavior. This raises questions about the degree of incremental novelty. The submission also lacks sufficient discussion of connections to advantage-weighted mechanism in AWAC. Experimental evaluation was limited in scope, with comparisons restricted to only a few baselines (DreamerV3, TD-MPC2, BMPC) while omitting other relevant algorithms such as continuous control with tree search. Additional concerns included the use of only three random seeds, the absence of hyperparameter sensitivity analysis, and the design choice regarding whether planner actions were sampled during training or drawn from the replay buffer for alignment.

The authors addressed many of these concerns during the rebuttal. They clarified the distinction between their use of soft Q-weighting and advantage functions, as well as the differences from AWAC. They also added comparisons with a multi-actor tree-search baseline and explained why BOOM outperforms. The number of random seeds was increased from 3 to 5 across all benchmark tasks to validate the statistical significance of the results. Additional ablations were provided, including an analysis of the temperature parameter τ\tau in soft Q-weighting and a comparison of using replay buffer actions versus replanning for alignment, which alleviated the concerns of the novelty claim relative to BMPC.

There is broad consensus that the paper is technically solid, clearly written, and supported by strong empirical results. For the revision, the authors are encouraged to better situate the contribution relative to prior work (particularly BMPC and AWAC) and expand experimental validation.