Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens
摘要
评审与讨论
This paper incorporates Bayesian uncertainty estimation into offline model-based planning to improve the adaptivity and generalization ability of offline-trained policies. Empirical results are shown to demonstrate the claimed advantages of the proposed method.
优点
- The incorporation of uncertainty estimation into the offline model-based planning framework is well done, mathematically. As far as I know, this is the first work to do so under the variational inference framework.
- Accounting for changes during deployment in the real environment is important in practice. In this sense, this work is well motivated.
- The empirical performance is promising and well verifies the adaptivity of the proposed method.
缺点
- APE-V (Ghosh et al., 2022) seems like a valid baseline for adaptive offline algorithms, however the paper does not compare with it, making the evaluation potentially incomplete and less convincing.
- It seems like the hyperparameters need to be carefully tuned for each task, which might limit the usability of the proposed method.
Reference
(Ghosh et al., 2022) "Offline RL Policies Should be Trained to be Adaptive", ICML 2022.
问题
- Section 5.1: How exactly do you test the prior policy in states sampled from the R dataset?
- Model-based methods are usually computationally expensive in training and planning is costly when executing actions, compared to sampling from a policy network. It seems like RefPlan needs to use an additional VAE network, which may further increase the computation burden. So I wonder how is the computational efficiency of the proposed RefPlan method, in training and in executing, respectively?
- Figure 4: I wonder what the performance will be like when you use the full dataset for LOOP and RefPlan. Maybe continuing the lines in the plots to 1M would help readers see how much and how rapidly the performance degrades when reducing the dataset size.
We thank the reviewer for their thoughtful feedback and for highlighting the strengths of our work, including the principled incorporation of Bayesian uncertainty estimation into offline model-based planning and its motivation for real-world scenarios. Below, we address your questions and concerns in detail.
”APE-V (Ghosh et al., 2022) seems like a valid baseline for adaptive offline algorithms, however the paper does not compare with it”
Thank you for suggesting this baseline. Unfortunately, APE-V does not have an official codebase, making direct comparisons challenging. However, MAPLE, included in our experiments, also claims to learn an adaptive policy in an offline manner using a model-based approach. As shown in Table 1, RefPlan significantly improves the test-time performance of MAPLE, demonstrating its ability to enhance prior policies.
”the hyperparameters need to be carefully tuned for each task, which might limit the usability of the proposed method.”
Thank you for raising this important point. As detailed in Appendix D.2, we tuned five key hyperparameters: horizon , noise scale , inverse temperature , value penalty , and the number of latent samples . Using grid search, we originally required 240 iterations for hyperparameter tuning.
To evaluate the practicality of this tuning process, we employed Bayesian optimization (BayesOpt) using Weights & Biases. Figure 11 in the appendix compares the number of BayesOpt iterations needed to achieve or exceed the best performance obtained via grid search. Using CQL as the prior policy on MR datasets from Hopper, HalfCheetah, and Walker2d, we observed that as few as 5 to 20 iterations were sufficient to match or surpass the performance of the grid search.
Notably, even the first BayesOpt iteration achieved performance levels comparable to or better than LOOP and the original prior policies. This demonstrates that the computational cost of hyperparameter tuning for RefPlan is manageable in practice.
”Section 5.1: How exactly do you test the prior policy in states sampled from the R dataset?”
Thank you for this question. To evaluate RQ1, we sampled a state randomly from the R dataset. During evaluation, this sampled state was used to override the internal state of the MuJoCo simulator when resetting the environment, allowing us to start the evaluation from this state. For fair comparison, we have used the same random seeds across compared methods to ensure they’re evaluated from the same initial states.
”how is the computational efficiency of the proposed RefPlan method, in training and in executing, respectively?”
We appreciate your question about computational efficiency. The training phase of RefPlan involves offline pre-training of a VAE dynamics model, which comprises two stages: encoder pre-training and decoder fine-tuning, as detailed in Appendix C.3. These stages rely entirely on supervised learning, making them computationally efficient. In our experiments, we performed 200 epochs of VAE pretraining and an additional 500 epochs of decoder fine-tuning.
The prior policies were trained using various model-free and model-based offline policy learning algorithms, a process orthogonal to RefPlan itself. At test time, compared to LOOP, RefPlan adds computational overhead due to the marginalization over latent variables in Equation 13. However, by maximizing GPU parallelization, the runtime efficiency of RefPlan is comparable to LOOP.
”Figure 4: I wonder what the performance will be like when you use the full dataset for LOOP and RefPlan. Maybe continuing the lines in the plots to 1M would help readers”
Thank you for this insightful suggestion. We have updated Figure 4 to include the performance of all compared methods when using the full dataset (1M transitions) for training.
Thank you for the response and additional results. Some of my concerns are solved.
About APE-V
I understand that sometimes it'd be hard to reproduce the results of some baselines when there is no official codebase, yet I still encourage the authors to inquire the baseline authors for the code. This will make the comparison more complete.
About how exactly you test the prior policy in states sampled from the R dataset
This does surprise me. I didn't know there's such a way to do that. Thank you for letting me know.
About the computational efficiency
Thanks for the explanation. It'd still be nice if the comparison of efficiency can be quantified and shown in the paper, though.
Dear Reviewer,
Thank you for your thoughtful feedback and for considering our previous response. We greatly appreciate your engagement and the opportunity to clarify further.
About APE-V:
We reached out to the APE-V authors but did not receive a response. More importantly, a direct comparison with APE-V is less relevant to our research questions. APE-V focuses on training adaptive model-free policies via value ensembles, while RefPlan augments static offline-trained prior policies with test-time adaptive planning. This difference makes comparisons with methods like LOOP, which also enhance prior policies, more appropriate. Evaluating adaptive offline RL methods like APE-V could be an exciting direction for future work, but we believe our current comparisons better highlight RefPlan’s contributions.
About computational cost analysis:
We have added quantifications of computational costs in Appendix D.3 of the revised manuscript, providing further clarity on RefPlan’s runtime overhead and efficient GPU parallelization.
Code release:
We recognize the importance of reproducibility and transparency in scientific research. To this end, we will publish our code upon acceptance of the paper, allowing the community to build on and verify our work.
Thank you again for your thoughtful comments. We hope these clarifications, updates to the manuscript, and our commitment to releasing the code address your concerns, and we kindly request your reconsideration of the score.
Best regards,
Authors
Dear Reviewer F97j,
The discussion phase is about to conclude, and we kindly ask if you could review our responses and updated revisions. We’d greatly appreciate your feedback or reconsideration of your evaluation if our updates have addressed your concerns.
Thank you again for your time and thoughtful review.
This paper proposes a principled approach based on Bayesian inference for offline MBRL. The proposed formulation uses Bayes-adaptive MDP approach, in which the uncertainty over model estimation is captured through belief representation in POMDP while planning under uncertainty for optimal actions is proposed to plan for actions that account for unknowns beyond the agent’s limited knowledge. In overall, the doubly Bayesian views are applied to both model learning and policy optimization. The writing is easy to follow. The final results are encouraging.
优点
Pros:
-
Clear and principled proposal for offline MBRL: The model distribution is learnt via variational inference, while the policy planning step takes into account this model uncertainty to plan for optimal actions to act under epistemic uncertainty.
-
The proposed formulation is sound.
-
Clear writing: It's easy to follow and understand both the technical part, e.g. maths behind, and the main proposal, e.g. Fig. 2 has a clear depiction of the proposal.
缺点
Cons:
-
It's questionable that RefPlan only fine-tunes a baseline policy. Either i) why the learnt model can be used to optimize a new policy, ii) it's a bit unfair in terms of extra computation needed in comparisons to the baselines, including both model-free and model-base. Especially the latter, the model learnt using model-based methods in the baseline will be discarded or unnecessarily unused in the fine-tuning stage of RefPlan. So it is expected to have a more light-weight fine-tuning approach for test-time planning.
-
Performance is mixed: RefPlan performs well on some tasks, which show clear improvements. However at some tasks e.g. Hopper the improvements are not obvious even in comparison to model-free methods. E.g Fig. 3 CQL can still perform well in Hopper though on OOD setting, without explicitly modeling epistemic uncertainty, and without doing extra training and planning at test-time. In addition, the gap to existing SOTA, LOOP which also did some extra computation, is not significant.
问题
See the above two main questions in Cons.
Other major comments:
-
Fig. 1: Misleading as could be understood that a GM is input to the encoder. There are observed nodes used as data, however the connection like graph is also provided?
-
Conceptually, how it's compared to the deterministic path and stochastic path in PlaNet and Dreamer model? The plan is also solved using sampling, which is however RefPlan can have a higher variance due to outer sampling w.r.t random variable "model m".
-
Some ablations are needed to understand the effect of the whole sampling step, e.g. trade-off between variance and performance and needed computation.
-
"In the offline setting, we aim to enhance the prior policy πp via MB planning at test time by inferring the posterior over": Policy and model are decoupled. Can it be revised to compute optimal policies directly from the model-based policy optimization and at least there is a comparison to this "baseline"?
-
Experiment in Section 5.1: The results are encouraging to say RefPlan is acting in the face of epistemic uncertainty, however it's hard to understand the effect of which component, e.g. visualization of uncertain region, or understand how policy is selected in such situation.
Minor comments:
-
It would be better in Related work to discuss and include this work: "Arthur Guez, David Silver, Peter Dayan: Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search. NIPS 2012"
We thank the reviewer for their thoughtful feedback and for highlighting the strengths of our work, including the principled and clear formulation of RefPlan, its soundness, and the clarity of our writing. Below, we address each of your comments and concerns in detail.
”It's questionable that RefPlan only fine-tunes a baseline policy. Either i) why the learnt model can’t be used to optimize a new policy”
Thank you for this suggestion. To investigate this, we conducted additional experiments using the learned VAE dynamics model (VariBAD’s model) for offline policy optimization. The results, included in Table 2 in Appendix B.3, compare the following setups:
- Original prior policy learning and evaluation (Orig).
- Using VariBAD’s model during offline policy training (NM (Train)).
- Applying RefPlan to policies trained with VariBAD’s model (NM (Train) + RefPlan).
- Applying RefPlan to original prior policies (RefPlan).
The results are also summarized below:
| MOPO | Orig | NM (Train) | NM (Train) + RefPlan | RefPlan | |
|---|---|---|---|---|---|
| Hopper | M | 66.9 | - | - | 67.7 |
| MR | 90.3 | 93.2 | 98.18 | 94.5 | |
| ME | 91.3 | - | - | 96.5 | |
| HalfCheetah | M | 42.8 | 40.6 | 66.45 | 59.8 |
| MR | 70.6 | 53.2 | 72.46 | 73.8 | |
| ME | 73.5 | 71.6 | 100.34 | 96.6 | |
| Walker2d | M | 82.0 | 60.6 | 72.73 | 85.9 |
| MR | 81.7 | 53.3 | 79.75 | 88.3 | |
| ME | 51.9 | 42.4 | 64.59 | 68.1 |
| COMBO | Orig | NM (Train) | NM (Train) + RefPlan | RefPlan | |
|---|---|---|---|---|---|
| Hopper | M | 60.9 | 52.2 | 62.30 | 77.2 |
| MR | 101.1 | 44.9 | 61.90 | 101.8 | |
| ME | 105.6 | 27.3 | 39.23 | 107.8 | |
| HalfCheetah | M | 67.2 | 30.3 | 41.61 | 77.4 |
| MR | 73.0 | 47.6 | 59.54 | 75.0 | |
| ME | 97.6 | 93.5 | 109.25 | 110.3 | |
| Walker2d | M | 71.2 | 79.1 | 89.43 | 87.4 |
| MR | 88.0 | 80.4 | 91.01 | 93.3 | |
| ME | 108.3 | 36.7 | 38.47 | 112.7 |
These experiments demonstrate that RefPlan consistently outperforms using the VAE model during offline training (NM (Train)). RefPlan dynamically adapts to epistemic uncertainty during test time using real-time history, which is not captured as effectively in offline training alone. Furthermore, applying RefPlan to NM (Train) policies significantly improves their performance, highlighting its ability to recover from suboptimal offline training.
”it's a bit unfair in terms of extra computation needed in comparisons to the baselines, including both model-free and model-base… the model learnt using model-based methods in the baseline will be discarded or unnecessarily unused in the fine-tuning stage of RefPlan.”
We appreciate this concern. To ensure fairness, we used LOOP as a primary baseline, as it also performs model-based planning during test time, introducing extra computation. The “Orig” column in Table 1 reflects the performance of prior policies without additional test-time planning, while LOOP and RefPlan include the cost of planning. While RefPlan incurs more computational overhead due to sampling latent variables and marginalizing via Monte Carlo, we leverage parallelized computation on a GPU to keep the runtime overhead sublinear and comparable to LOOP.
”Performance is mixed… at some tasks e.g. Hopper the improvements are not obvious even in comparison to model-free methods”
Thank you for this observation. While RefPlan does not outperform baselines in each and every task, its overall improvement is clear. For example, in Figure 3, though CQL performs well under OOD initialization in Hopper, there is still a noticeable drop in performance compared to the original performance shown as a dotted line. In contrast, RefPlan maintains higher resilience under this challenging setup. Furthermore, in other environments (HalfCheetah and Walker2d), both MB planning methods demonstrate clearer benefits, with RefPlan consistently providing better robustness. Figures 7 and 8 in Appendix B.2 further corroborate these findings.
”the gap to existing SOTA, LOOP which also did some extra computation, is not significant”
To provide a more rigorous comparison, we used RLiable [1], a robust evaluation framework for reinforcement learning algorithms. RLiable computes aggregate metrics (e.g., median, interquartile mean, and optimality gap) and uses stratified bootstrapping to estimate confidence intervals. This ensures comparisons are statistically sound and not skewed by noise or outliers.
As shown in Appendix B.1 (Figure 6), RLiable consistently indicates that RefPlan outperforms LOOP across all metrics. Importantly, RefPlan demonstrates statistically significant improvements, with non-overlapping confidence intervals, highlighting its effectiveness in leveraging epistemic uncertainty during planning.
”Fig. 1: Misleading as could be understood that a GM is input to the encoder”
Thank you for noting this potential ambiguity. We have revised the caption of Figure 1 to clarify that the encoder takes trajectories as input, and the graphical structure is shown only for illustrative purposes.
”comparison to the deterministic path and stochastic path in PlaNet and Dreamer model?”
Thank you for raising this point. PlaNet and Dreamer treat image-based environments as POMDPs, where planning is conducted in a learned latent space. In contrast, RefPlan frames offline RL as an epistemic POMDP, where partial observability arises from epistemic uncertainty about the environment due to limited dataset coverage. Unlike PlaNet and Dreamer, RefPlan addresses the challenges of offline RL by explicitly modeling and inferring epistemic uncertainty at test time within a unified probabilistic framework. This distinction sets our contributions apart.
”trade-off between variance and performance and needed computation”
This is an excellent question. Regarding computation, as noted earlier, we utilize GPU parallelization to minimize runtime overhead when increasing . Regarding variance, as increases, we expect the sample variance of planned actions (Equation 13) to decrease. We validated this empirically in Figure 9 (Appendix B.4), where increasing generally leads to improved performance and reduced variance. While impacts both performance and computation, tuning other RefPlan hyperparameters may provide better performance gains within a given budget.
”The results are encouraging to say RefPlan is acting in the face of epistemic uncertainty, however it's hard to understand the effect of which component, e.g. visualization of uncertain region, or understand how policy is selected in such situation.”
Thank you for this insightful comment. To address this, we are working on including a navigation task as a didactic example to visualize how RefPlan handles epistemic uncertainty and selects policies under uncertainty. We aim to include this in the final version if accepted.
”in Related work to discuss and include this work: "Arthur Guez, David Silver, Peter Dayan: Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search. NIPS 2012"”
Thank you for the suggestion. We have added this reference to the Related Work section.
[1] Deep reinforcement learning at the edge of the statistical precipice, Agarwal et al., NeurIPS’21.
Thanks for the new results. However from the Table, RefPlan does not look like outperform it bconsistently the policy train offline with the VAE model.
Thank you for your prompt response and for reviewing the new results. We wanted to provide a concise summary of the results from the new table to clarify our findings:
-
Performance of NM (Train): The policies trained offline with the VAE model (NM (Train)) achieved an average score of 56.68 across the tasks, representing a 40% drop from the original prior policy’s performance (average of 79.11). This indicates that using the VAE model for offline policy optimization significantly degraded performance rather than improving it.
-
RefPlan with NM (Train): When RefPlan was applied on top of NM (Train) policies, the average performance increased to 71.70, showing a 26.5% improvement over NM (Train) alone. This highlights RefPlan’s ability to recover some of the performance loss incurred during offline training.
-
RefPlan with original prior policies: RefPlan applied to the original prior policies achieved the highest average score of 88.1, surpassing both NM (Train) and NM (Train) + RefPlan. Importantly, RefPlan consistently outperformed NM (Train) across all tasks, demonstrating that leveraging the VAE dynamics model for test-time planning is far more effective than using it for offline training.
These results underscore that RefPlan not only outperforms NM (Train) by a significant margin but also achieves consistent improvements across all tasks. We hope this summary clarifies any potential misunderstandings regarding the results presented in the table.
Thank you again for your thoughtful feedback and consideration.
Dear Reviewer sAVu,
Thank you again for your feedback. We hope our response clarifying RefPlan’s consistent improvements over the baseline has addressed your concerns. If anything remains unclear, we’d be happy to provide further clarification.
We kindly ask you to consider revisiting your evaluation in light of the clarified results.
To effectively incorporate the uncertainty into planning, this paper propose Reflect-then-Plan (RefPlan), a doubly Bayesian approach for offline MB planning to enhances offline-learned policies for improved adaptivity and generalization. The performance is validated on three standard benchmarks (Hopper, HalfCheetah, and Walker2d). However, it is not sure that the agent has learned a near Bayes-optimal policy. It would be better to add theoretical support and/or to include a navigation task, along with visualizations of the agent's behavior.
优点
The Reflect-then-Plan (RefPlan) framework combines Bayesian modeling of epistemic uncertainty with model-based planning in a unified probabilistic approach.
缺点
-
The paper uses VariBAD's VAE structure to learn environment dynamics but lacks strong evidence that the agent has learned a near Bayes-optimal policy. It is recommended to add theoretical support or to include a navigation task, along with visualizations of the agent's behavior.
-
The paper lacks innovation; the approach of offline model-based planning as probabilistic inference is common (see [1]). Furthermore, the proposed algorithm is merely a minor modification of VariBAD, lacking novelty. In addition, related work in the field of offline meta-RL has shown adaptation and generalization across multiple tasks, which is more valuable than the single-task generalization problem addressed here (see [2][3]).
-
The paper evaluates the algorithm on only three tasks, which is insufficiently persuasive, and the experimental results show only marginal improvements over LOOP.
[1] Janner et al., 2022, Planning with Diffusion for Flexible Behavior Synthesis.
[2] Yuan et al., 2022, Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning.
[3] Ni et al., 2023, MetaDiffuser: Diffusion Model as Conditional Planner for Offline Meta-RL.
问题
See the previous section.
We thank the reviewer for their feedback. Below, we address your comments and concerns in detail.
"lacks strong evidence that the agent has learned a near Bayes-optimal policy. It is recommended to add theoretical support or to include a navigation task, along with visualizations of the agent's behavior."
Thank you for raising this concern. RefPlan builds on the theoretical foundations of [1] (Proposition 5.1), which shows that the Bayesian offline RL objective— ---is maximized by a policy conditioned on the agent’s belief over MDPs inferred from its history. While RefPlan employs approximate belief updates through variational inference, which may lead to suboptimal behaviors, it remains grounded in this theoretical principle.
We appreciate the suggestion to include a didactic navigation task and behavioral visualizations. These additions could help illustrate the agent’s performance and are planned for inclusion in the final version of the paper.
"the approach of offline model-based planning as probabilistic inference is common (see [1])"
Thank you for raising this point. The control-as-inference framework has indeed been used in various reinforcement learning contexts, including offline planning, as demonstrated in [1]. However, [1] does not account for leveraging prior action sampling distributions derived from offline-learned policies. In contrast, RefPlan seamlessly integrates model-free and model-based methods within a unified probabilistic framework, enabling prior policies learned through offline model-free algorithms to guide action sampling during test-time planning.
Moreover, RefPlan advances the field by adopting an epistemic POMDP perspective on offline RL, explicitly modeling and addressing the agent’s epistemic uncertainty during planning. This treatment of uncertainty allows RefPlan to handle out-of-distribution states more effectively, leveraging real-time history to adapt during deployment. To the best of our knowledge, this approach has not been considered in prior work, including [1].
By introducing the concept of prior policies into the control-as-inference framework and explicitly addressing epistemic uncertainty, RefPlan not only bridges model-free and model-based methods but also provides a novel and practical solution for robust offline RL.
"the proposed algorithm is merely a minor modification of VariBAD, lacking novelty"
We respectfully disagree with this assessment. While RefPlan builds on VariBAD’s VAE structure, its contributions extend beyond simple modification. Unlike VariBAD, which tackles meta-RL in multi-task settings, RefPlan focuses on single-task offline RL and introduces a unified probabilistic framework that explicitly considers epistemic uncertainty during test-time planning. This framework allows RefPlan to combine the strengths of offline model-free and model-based approaches: utilizing prior policies from offline RL and enhancing them through test-time planning with real-time uncertainty handling.
We also believe that connecting ideas across domains to tackle new problems constitutes meaningful and impactful research. RefPlan’s application of Bayesian modeling and offline planning to single-task offline RL offers a novel and practical solution to a challenging problem in reinforcement learning.
"related work in the field of offline meta-RL has shown adaptation and generalization across multiple tasks, which is more valuable than the single-task generalization problem addressed here"
We appreciate the reviewer’s emphasis on multi-task settings. However, we respectfully disagree with the assertion that single-task generalization is less valuable. Single-task offline RL is an active research area with significant practical relevance, as highlighted by numerous references cited in our work. Challenges such as epistemic uncertainty due to limited dataset coverage are critical to real-world applications where multi-task setups may not be feasible. RefPlan directly addresses these challenges by providing a robust and adaptive solution to single-task offline RL.
"The paper evaluates the algorithm on only three tasks"
While we focus on three environments (Hopper, HalfCheetah, and Walker2d), our experiments span five distinct dataset configurations per environment (random, medium, medium-replay, medium-expert, and full-replay), resulting in 15 tasks. These configurations were designed to systematically evaluate RefPlan’s performance under varying offline dataset qualities. Additionally, we address key research questions (RQ1, RQ3, RQ4) to assess RefPlan’s robustness to high epistemic uncertainty from different sources, complementing the main benchmark comparisons in RQ2.
"the experimental results show only marginal improvements over LOOP"
Thank you for this comment. To provide a statistically rigorous comparison, we used RLiable [2], a robust evaluation framework for reinforcement learning algorithms. RLiable computes aggregate metrics (e.g., median, interquartile mean, and optimality gap) and uses stratified bootstrapping to estimate confidence intervals, ensuring comparisons are not skewed by noise or outliers.
As detailed in Appendix B.1 (Figure 6), RLiable’s metrics consistently show that RefPlan outperforms LOOP across all metrics, with statistically significant improvements and non-overlapping confidence intervals.
[1] Offline RL Policies Should be Trained to be Adaptive, Ghosh et al., ICML’22.
[2] Deep reinforcement learning at the edge of the statistical precipice, Agarwal et al., NeurIPS’21.
Dear Reviewer NMjJ,
The discussion phase is nearing its conclusion, and we would greatly appreciate it if you could review our responses and let us know if they have adequately addressed your concerns. And we hope our clarifications and additional results will assist in reconsidering your evaluation.
Thank you once again for your time and thoughtful review.
This work addresses the problem offline reinforcement learning and proposes a Bayesian-inspired model-based solution based on VariBad and control-as-inference.
Varibad is a Bayesian solution for meta-learning: given data on a set of tasks, how to quickly identify the task during testing (and do well in it). This is done through variational inference, where an encoder is trained to capture distribution over the task (in the form of a latent variable) given a trajectory (paired with a decoder that is trained to reproduce the trajectories). Control-as-inference models decision making as a probabilistic problem, "probability of policy given optimality", using the expected return as a likelihood measurement.
The performance is compared to typical offline model-free and model-based methods on D4RL benchmark, where they show performance comparable to "LOOP".
优点
S1: The work is tackling a relevant (offline RL) problem that should be of interest to the a non-trivial section of the ICRL community. S2: The English is easy to understand and the math is (as far as can tell) sound.
S3: I believe the proposed solution - the combination of control-as-inference and Bayesian inference over dynamics - is non-trivial and novel. In particular, it attempts to leverage uncertainty in (offline model-free) policy and (offline model-based) dynamics in a computationally feasible way. Though I am not familiar with the offline RL community/literature, the question on how to combine model-free and model-based is a long and important one in RL. The way it is proposed here "makes sense": a policy pre-trained should be considered a prior, and fine-tuning this online when new data comes in a Bayesian fashion is (only in hindsight) an obvious one.
缺点
W1: One room of improvement, especially for someone with lack of background, is the accessibility of the background. One, certain (seemingly?) important concepts were not clearly defined (e.g. "epistemic POMDPs", "BA-MDPs".) Second, some concepts were clearly background (e.g. "control-as-inference"), but were not introduced. As far as I know, they were explained as part of the method description, which made it excessively hard to infer what was novel (and should be credited as well as scrutinized) and what was known in the literature.
W2: A major concern, in my opinion, is the lack of experiments for online learning. Conceptually, planning is useful if (1) it saves us computation time (plan for current states, not whole state space) or (2) we gain more information over time (improve learned model and thus our planning). As far as I understand, the experiments here are the initial performance, which begs the question (Q2) whether this performance could have been trained/reached offline instead.
W3: VariBad tackles meta-learning: it is assumed (and exploited) that the training data set is generated from different tasks. In particular, it is optimized to capture the task characteristics from different tasks, capture this in latent variables, and infer them online. As far as I understand, the experiments do not include this setting. In particular, it is not clear whether the "Bayesian" argument holds here: Varibad's encoder might just collapse - as all trajectories come from the same environment - and there should be no (latent) information to capture. As a result, while it is supposed to be "double Bayesian", the proposed method does not seem to have the Bayesian trait of doing optimal actions with respect to the uncertainty.
问题
Q1: Is 4.and 4.2 background (control as inference & varibad), or are there particular extensions / modifications hidden in there?
Q2: Given concerns of W3 (offline RL vs meta-learning), , rather little information is learn until "much data" is gathered. As a result, it feels as if any additional performance from online planning could have been done offline: refine policy by doing control-as-inference offline on Varibad's model. Do you have any idea how well that could or would perform?
Q3: Did you consider comparing with VariBad? How about and ablation study where you replace VariBad with other model-based approach (that does not do meta-learning)?
”Q2: Given concerns of W3 (offline RL vs meta-learning), , rather little information is learn until "much data" is gathered. As a result, it feels as if any additional performance from online planning could have been done offline: refine policy by doing control-as-inference offline on Varibad's model. Do you have any idea how well that could or would perform?”
Thank you for this insightful question. The control-as-inference framework can indeed be applied in various ways for policy learning, as explored in [5, 6]. Extending this framework to offline RL with VariBAD’s model could potentially involve novel algorithmic design choices. While this is an interesting direction, developing such an offline RL algorithm would constitute a separate study beyond the scope of this work.
To address your concern more directly, we investigated a related baseline where VariBAD’s model is used during offline training for policy optimization. Specifically, we compared the following setups:
- Original prior policy learning and evaluation (Orig).
- Using VariBAD’s model during offline policy training (NM (Train)).
- Applying RefPlan to policies trained with VariBAD’s model (NM (Train) + RefPlan).
- Applying RefPlan to original prior policies (RefPlan).
| MOPO | Orig | NM (Train) | NM (Train) + RefPlan | RefPlan | |
|---|---|---|---|---|---|
| Hopper | M | 66.9 | - | - | 67.7 |
| MR | 90.3 | 93.2 | 98.18 | 94.5 | |
| ME | 91.3 | - | - | 96.5 | |
| HalfCheetah | M | 42.8 | 40.6 | 66.45 | 59.8 |
| MR | 70.6 | 53.2 | 72.46 | 73.8 | |
| ME | 73.5 | 71.6 | 100.34 | 96.6 | |
| Walker2d | M | 82.0 | 60.6 | 72.73 | 85.9 |
| MR | 81.7 | 53.3 | 79.75 | 88.3 | |
| ME | 51.9 | 42.4 | 64.59 | 68.1 |
| COMBO | Orig | NM (Train) | NM (Train) + RefPlan | RefPlan | |
|---|---|---|---|---|---|
| Hopper | M | 60.9 | 52.2 | 62.30 | 77.2 |
| MR | 101.1 | 44.9 | 61.90 | 101.8 | |
| ME | 105.6 | 27.3 | 39.23 | 107.8 | |
| HalfCheetah | M | 67.2 | 30.3 | 41.61 | 77.4 |
| MR | 73.0 | 47.6 | 59.54 | 75.0 | |
| ME | 97.6 | 93.5 | 109.25 | 110.3 | |
| Walker2d | M | 71.2 | 79.1 | 89.43 | 87.4 |
| MR | 88.0 | 80.4 | 91.01 | 93.3 | |
| ME | 108.3 | 36.7 | 38.47 | 112.7 |
These experiments, detailed in Appendix B.3 (Table 2), indicate that planning with VariBAD’s model at test time (as in RefPlan) consistently outperforms using it during offline training. RefPlan leverages real-time history to dynamically adapt to epistemic uncertainty during deployment, which is not captured as effectively when the model is used exclusively in offline training. Moreover, combining RefPlan with NM (Train) policies leads to significant performance improvements, underscoring its ability to recover from limitations in offline training.
”Q3: Did you consider comparing with VariBad? How about an ablation study where you replace VariBad with other model-based approach (that does not do meta-learning)?”
Thank you for these questions. First, as you correctly noted in W3, VariBAD is specifically designed for meta-RL, where the training process involves interacting with multiple tasks and leveraging online interactions to train its VAE model. This setup differs fundamentally from RefPlan, which focuses on single-task offline RL. In VariBAD, epistemic uncertainty arises from task variation, whereas in RefPlan, it stems from incomplete state-action coverage in the offline dataset, as captured by the epistemic POMDP formulation [1]. Due to these differences, a direct comparison with VariBAD is not applicable.
To address the second part of your question, we agree that ablation studies with simpler model-based approaches are essential to evaluate the benefits of our method. LOOP, included in our experiments, serves precisely this purpose. LOOP uses a standard Markovian dynamics model learned from the offline dataset and combines it with a prior policy for test-time planning, without explicitly modeling epistemic uncertainty. By comparing RefPlan with LOOP, we can ablate the advantages of modeling epistemic uncertainty and using a unified probabilistic framework for planning.
Our results, reinforced by RLiable [7] statistical tests, show that RefPlan significantly outperforms LOOP across environments and dataset configurations. As shown in Figure 6 (Appendix B.1), RefPlan’s explicit uncertainty modeling consistently leads to better performance and reliability. These results demonstrate that incorporating epistemic uncertainty into test-time planning provides measurable advantages over simpler model-based approaches that do not account for such uncertainty.
[1] Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability, Ghosh et al., NeurIPS’21.
[2] Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble, Lee et al., CoRL’21.
[3] AWAC: Accelerating Online Reinforcement Learning with Offline Datasets, Nair et al., (2021) (https://arxiv.org/pdf/2006.09359).
[4] Offline RL Policies Should be Trained to be Adaptive, Ghosh et al., ICML’22.
[5] Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review, Levine, arXiv’18.
[6] Maximum a Posteriori Policy Optimisation, Abdolmaleki et al., ICLR’18.
[7] Deep reinforcement learning at the edge of the statistical precipice, Agarwal et al., NeurIPS’21.
We sincerely thank the reviewer for their thoughtful feedback and for highlighting the relevance of our work, its non-trivial contributions, and its strengths. Below, we address each weakness and question in detail.
”W1: One room of improvement, especially for someone with lack of background, is the accessibility of the background.”
Thank you for pointing this out. We have revised the main text to include clear explanations of epistemic POMDPs, BAMDPs, and control-as-inference in Section 3. Additionally, we have expanded on BAMDP in Appendix A.1. These additions aim to make the background more accessible and help readers distinguish the novel contributions of our work from prior methods.
”W2: A major concern, in my opinion, is the lack of experiments for online learning.”
We appreciate this concern and would like to clarify our problem setup. Our work focuses on offline RL, where a policy is trained using an offline dataset and subsequently deployed in a testing environment. The core goal is to address the agent’s epistemic uncertainty during deployment to achieve robust performance, rather than improving online learning efficiency.
As introduced by [1], epistemic POMDPs explicitly model the train-test split in RL. While BAMDPs emphasize online learning, epistemic POMDPs focus on single-episode evaluation at test time, prioritizing the agent’s adaptivity during deployment. Our experiments are consistent with this framework and intentionally focus on the deployment phase to isolate and evaluate the benefits of Bayesian modeling for epistemic uncertainty.
While our work focuses on the deployment phase, we acknowledge that the Bayesian framework introduced in RefPlan has the potential to inform offline-to-online RL methods. By explicitly modeling the agent’s epistemic uncertainty, it could help improve efficiency and adaptivity in online learning. This aligns with directions explored in the offline-to-online RL literature, such as [2, 3], and represents an important avenue for future research beyond the scope of this work.
”W3: VariBad tackles meta-learning: it is assumed (and exploited) that the training data set is generated from different tasks. … not clear whether the "Bayesian" argument holds here: Varibad's encoder might just collapse - as all trajectories come from the same environment”
Thank you for this insightful comment. Unlike VariBAD, which addresses meta-RL across multiple tasks, our work targets offline RL with a single task. Here, the source of epistemic uncertainty arises from incomplete state-action coverage in the offline dataset, as discussed in [1] and [4]. This aligns offline RL with the epistemic POMDP framework, where uncertainty is due to unexplored parts of the environment rather than task variability.
To investigate whether the latent distribution collapses to a deterministic one, we examined the impact of the hyperparameter (the number of latent samples) on evaluation performance. Specifically, for CQL as a prior policy and with the FR dataset, we fixed all other hyperparameters and varied , measuring the performance across three random seeds for each configuration. The resulting correlations between and performance in each environment were as follows:
| Environment | Correlation (Performance & ) |
|---|---|
| Hopper | 0.767368 |
| HalfCheetah | 0.83648 |
| Walker2d | 0.481960 |
These non-zero correlations suggest that the latent distribution does not collapse; instead, it retains uncertainty information that improves performance as increases.
Additionally, as shown in Figure 9 in the Appendix, the variance of the optimized actions averaged across an episode decreases as increases, and the performance generally improves. This provides further evidence that the latent distribution is capturing meaningful uncertainty, enabling RefPlan to leverage this information effectively during planning.
”Q1: Is 4.and 4.2 background (control as inference & varibad), or are there particular extensions / modifications hidden in there?”
Section 4.1 adapts control-as-inference to offline MB planning, specifically incorporating a prior policy and emphasizing deployment. This distinction is highlighted just above Equation 3. Section 4.2 leverages the ideas from VariBAD but modifies its use to focus on single-task offline planning. We discuss these differences at the end of Section 4.2. To further clarify, we added additional preliminary content in the main text and appendix to help readers distinguish between prior work and our contributions.
Dear Reviewer ezET,
The discussion phase is nearing its conclusion, and we would greatly appreciate it if you could review our responses and let us know if they have adequately addressed your concerns. And we hope our clarifications and additional results will assist in reconsidering your evaluation.
Thank you once again for your time and thoughtful review.
When data coverage for offline RL algorithms is incompete, this can lead to high epistemic uncertainty. The authors aim to improve performance in such settings at deployment time by incorporating a Bayesian-based approach. Specifically, their approach, called Ref-Plan, integrates model-based planning and uncertainty modeling. An empirical evaluation on standard offline RL benchmark domains considers the performance of RefPlan in environments where dynamics change or where data availability is limited.
优点
- The problem motivating this work is important, and to the best of my knowledge this algorithm seems like a novel contribution
- Many parts of the paper that are nicely written, including the motivations outline in section 1, and discussion in section 3.
- The math discussed in the work, e.g., 4.1 and 4.2 did not seem to have errors.
- The number / types of environments seems adequate to provide rankings between algorithms (in aggregation)
缺点
Background / Improving Clarity: The paper could be improved with more clarity on a lot of the background. While many algorithms / ideas were mentioned then cited, having a fuller description of these works in the paper (main body or appendix) would be beneficial, especially when these are used in the main algorithm or often referenced. E.g., BAMDP, control-as-inference framework, quantifying epistemic uncertainty.
Experiments:
RQ1. Further explanation connecting the environment settings chosen and resulting epistemic uncertainty would improve the flow.
RQ3 & RQ4. Q3 seems to be comparing performance under epistemic uncertainty but when that uncertainty is produced through limited data as opposed to RQ1? Improved clarity between these RQs would be beneficial. RQ1 seems to be a superset of RQ3&4.
Small Confusions / Errors
- the last comment in alg2 refers to line 5 in alg1, but there are no line numbers, line 5 specifically is the beginning of a loop
- what are the error bars used in the experiments?
- bold vs underline meaning in Table 1?
- H-step is mentioned without being defined.
问题
RQ3 and RQ4 seem to be a superset of RQ1. Could the authors clarify this? Instead, is it the case that these RQs each consider different causes of uncertainty?
伦理问题详情
None
We sincerely thank the reviewer for taking the time to provide valuable feedback and constructive suggestions. We appreciate your recognition of the importance of our problem, the novelty of our contribution, and the strengths of our work. Below, we address each of your concerns in detail:
”Background / Improving Clarity: The paper could be improved with more clarity on a lot of the background…”
Thank you for highlighting this point. We have added a concise discussion of the control-as-inference framework, epistemic PODMP, and BAMDP in Section 2 to enhance the reader’s understanding. Additionally, we expanded on BAMDP in Appendix A.1.
”RQ3 and RQ4 seem to be a superset of RQ1.”
Thank you for pointing out the need for clarification. Each RQ focuses on high epistemic uncertainty arising from different causes:
- RQ1 evaluates RefPlan’s ability to handle uncertainty caused by OOD initialization. Specifically, the agent is offline-trained using a medium-expert dataset with limited state-action coverage and is then initialized in a state sampled from the random dataset, creating significant epistemic uncertainty due to minimal overlap between the datasets.
- RQ3 addresses the epistemic uncertainty resulting from limited data availability during offline training. By subsampling the full-replay dataset, we vary dataset sizes, with smaller datasets leading to greater epistemic uncertainty at test time about the environment’s dynamics.
- RQ4 evaluates the scenario with high epistemic uncertainty due to changing environment dynamics at test time.
Although RQ1, RQ3, and RQ4 all assess performance under high epistemic uncertainty, they do so in different settings. To clarify this, we have revised the text to explicitly define the sources of epistemic uncertainty in each RQ and ensure that their scopes are distinct.
”the last comment in alg2 refers to line 5 in alg1, but there are no line numbers, line 5 specifically is the beginning of a loop”
Thank you for catching this oversight. We have added line numbers to Algorithm 1 and corrected the comment in Algorithm 2 to accurately describe the referenced portion of Algorithm 1.
”what are the error bars used in the experiments?”
The error bars in Figure 4 represent the standard error computed over three random seeds. All experiments presented in Section 5 use three random seeds. We have updated the figure caption to clarify this.
”bold vs underline meaning in Table 1?”
We appreciate the opportunity to clarify. Bold numbers indicate the best performance for each prior policy learning algorithm. Underlined numbers indicate the top two results when their confidence intervals overlap significantly. Since RefPlan is designed to enhance the performance of offline-learned prior policies during test time, the comparison is made per prior policy. For instance, in the Hopper environment with a medium dataset, RefPlan boosts the performance of the CQL prior policy from 66.9 to 85.1. We have revised the text to make this clearer.
”H-step is mentioned without being defined.”
Thank you for noting this potential source of confusion. We define the H-step return in Equation 1 as the discounted sum of model-predicted rewards over H steps, and we consistently use H to denote the prediction horizon. If there are additional areas where this definition is unclear, we would appreciate further clarification and will make the necessary revisions.
Dear Reviewer sNjS,
The discussion phase is nearing its conclusion, and we would greatly appreciate it if you could review our responses and let us know if they have adequately addressed your concerns. And we hope our clarifications and additional results will assist in reconsidering your evaluation.
Thank you once again for your time and thoughtful review.
We sincerely thank all the reviewers for their thoughtful feedback, constructive suggestions, and time spent reviewing our work. Your comments have been invaluable in helping us improve the clarity, presentation, and evaluation of our paper. Below, we summarize the key changes made in response to your feedback:
Paper Edits
- Background and Related Work: We added additional discussion on background materials and related work in the main text and appendix to make the paper more accessible to readers unfamiliar with specific concepts or methods.
- Figure and Table Captions: We updated captions throughout the paper to include more detailed explanations of the presented results.
- Rephrased Research Questions: In Section 5, we rephrased the research questions to better clarify the purpose of each experiment and its design.
- Updated Figure 4: We extended Figure 4 to show performance for all methods at the full dataset size, providing a clearer understanding of how performance scales with data availability.
Additional Experiments and Analyses
- RLiable Comparison: In Appendix B.1, we used the RLiable [1] framework to compare RefPlan and LOOP across multiple performance metrics. The results show that RefPlan outperforms LOOP consistently, with non-overlapping confidence intervals indicating statistical significance.
- New Baseline for Offline Policy Optimization: Appendix B.3 includes a new baseline that uses the VAE dynamics model for offline policy optimization. The comparison clearly demonstrates that planning with RefPlan using the VAE model provides better performance than using the model for offline policy learning.
- Effect of Latent Samples on Performance: In Appendix B.4, we analyzed how the number of latent samples affects the sample variance of planned actions and evaluation performance. The results show a general trend of improved performance and reduced variance with more samples.
- Hyperparameter Tuning Analysis: Appendix D.2 includes a new analysis (Figure 11) showing how Bayesian optimization can efficiently tune RefPlan’s hyperparameters. The results suggest that good performance can be achieved with a manageable number of iterations, significantly reducing the computational cost compared to a full grid search.
We hope that these revisions address your concerns and demonstrate the contributions and robustness of our work. Thank you again for your valuable feedback and for helping us improve our paper.
[1] Agarwal et al., "Deep reinforcement learning at the edge of the statistical precipice," NeurIPS’21.
This paper proposes a new method incorporating Bayesian uncertainty about the transition dynamics into a latent state space. The reviewers brought up questions about novelty in that it was not clear how this work was accurately positioned relative to the other works. Part of this is due to the background section not being accessible to most RL audiences despite having 4.5 pages to cover the intro and preliminaries. Reviewers also brought up issues with the experiments, noting that comparisons may not be fair due to hyperparameter tuning and lack of sufficient statistical support. Several reviewers did not respond after the author's response. However, in my own evaluation of the paper, I also noticed significant methodological errors. First, the randomness of the hyperparameter tuning process is not considered in the comparison. Second, the confidence intervals are not provided, issues with multiple comparisons are ignored, and confidence intervals are based on three seeds, which is known to be unreliable for the bootstrapping method. Furthermore, since the performance difference is small it is even more likely that the confidence intervals are not valid. This means identifying meaningful conclusions from the results is not possible.
Based on the errors above, I do not recommend the paper for acceptance. I suggest the authors consider the above points and the reviewers' comments in a future revision of the paper.
审稿人讨论附加意见
Only one reviewer responded to the authors, and this did not increase the score.
Reject