Behavior-agnostic Task Inference for Robust Offline In-context Reinforcement Learning
A model-based task inference method that is robust to changes in context behavior.
摘要
评审与讨论
This work analyzes the shortcomings of existing In-context Reinforcement Learning (ICRL) methods, pointing out their inability to handle context shift scenarios. The authors theoretically analyze the necessity of maximizing the true mutual information between context representation and task indices. Building on this foundation, they propose Behavior-agnostic Task Inference (BATI), which ensures that the context representation focuses solely on the environmental dynamics. Finally, experiments conducted in environments with noisy dynamics demonstrate the effectiveness of BATI.
update after rebuttal
Thanks for the author's reply, I have no further questions.
给作者的问题
From my perspective, methods like Algorithm Distillation handle both task inference and context-conditioned policy within a single model, such as a Transformer. Compared to these methods, what are the advantages and disadvantages of the authors' approach, which uses two separate models for learning?
论据与证据
Yes
方法与评估标准
Yes
理论论述
I have reviewed Theorem 3.1 proposed by the authors, but I cannot guarantee that the proof is entirely correct.
实验设计与分析
The experimental design is basically reasonable
补充材料
I reviewed all supplementary material.
与现有文献的关系
The authors' discussion on context shift in ICRL is quite intriguing and provides new insights for future work on generalization in ICRL.
遗漏的重要参考文献
No
其他优缺点
Strengths
- The authors' analysis of context shift appears to be insightful. Particularly, the example in Figure 1 aptly illustrates their motivation.
- The analysis of the UNICORN and CSRO works theoretically explains the motivation behind BATI's focus on environmental dynamics.
- The experimental results demonstrate that BATI's performance is competitive, especially in scenarios where the environmental dynamics noise increases.
Weaknesses
- Although BATI avoids the impact of behavioral policies on context shift by predicting environmental dynamics, it seems to also limit the information capacity of the task inference encoder.
- Despite the experimental results are promising, the types of experimental environments are somewhat limited. In addition, the comparisons do not include baselines like Algorithm Distillation, which use Transformer as a backbone, thereby reducing the persuasiveness of BATI.
其他意见或建议
No other comments.
Thank you for your review! We answer your questions below.
Q1. Types of environments.
R1: We use a set of MuJoCo environments that is standard in the field of meta-RL and consistent with our baselines. To further demonstrate the generality of BATI, we have additionally conducted a preliminary multi-agent experiment. We choose Kuhn Poker, a two-player card game with discrete state and action spaces, differing from the continuous MuJoCo environments used in our paper. We generate different player-2 (opponent) policies as "tasks" and learn an adaptive policy for player-1 over 10 episodes (20 steps) of contexts. As shown in the table below, BATI maintains superior performance over all baselines, further showcasing its capabilities and generalization.
| Method | Oracle (pre-IQL) | BATI | CSRO | FOCAL | Recon | UNICORN |
|---|---|---|---|---|---|---|
| Episodic Return |
Q2. Comparison with single-model, transformer-based approaches.
R2: We note that Algorithm Distillation imposes additional requirements on the training data (i.e. generated by the learning histories of an RL algorithm), making it less general than BATI and unsuitable for our evaluation setup. In general, we remark that the family of single-model approaches performs Bayesian inference to deduce the posterior given a context trajectory, and is thus similarly flawed as the context encoders of our baselines, as argued in Sec. 3 and 4 in our paper.
To support this argument, we add a new, recent baseline DPT that also makes use of a transformer to encode the context and predict the optimal action, outperforming Algorithm Distillation. Due to the computational requirements of training transformers and the tight rebuttal schedule, we implement DPT and evaluate it in AntDir and HalfCheetahVel, as shown below. It can be seen that DPT is outperformed in both environments.
| Environments | Oracle (pre-IQL) | BATI | DPT | CSRO | FOCAL | Recon | UNICORN |
|---|---|---|---|---|---|---|---|
| AntDir | |||||||
| HalfCheetahVel |
Q3. Design and capacity of task embeddings.
R3: We demonstrate the expressitivity of our task embedding design with extensive experiments in our paper, showing that BATI outperforms the baselines across the board. To further support the capabilities of BATI, we conduct a new OOD task generalization experiment in AntDir, which is even more demanding for the task embedding, requiring generalization to both OOD contexts and tasks. We sample training goal directions from and testing directions from to build disjoint distributions of training and testing tasks. The results are shown in the table below. It can be seen that while the performance decreases compared to the in-distribution case, BATI still outperforms baselines by a large margin and is the only method to achieve positive returns. This further validates the generalization capability of BATI and its task embedding designs.
| Method | BATI | CSRO | FOCAL | Recon | UNICORN |
|---|---|---|---|---|---|
| Episodic Return |
Thanks for the author's reply, I have no further questions.
Thank you again for your time and effort! We are glad to see that our response addresses your concerns and would greatly appreciate it if you would consider revising your score.
Authors propose Behavior-agnostic Task Inference (BATI) approach for meta RL problems which is claimed to be more robust to noisy dynamics compared to previous methods like UNICORN or CSRO and which works at the same level in noise-free cases.
给作者的问题
All of my questions were asked in previous sections of the review.
论据与证据
Claims are supported by the evidences. But experimental design is not clear for me which makes evidences questionable. E.g. CSRO which is the strongest baseline was not designed for the evaluation under the adversarial context and it is not clear whether it is tested under the same protocol with BATI or not. More on this in following sections. Replacing BRAC with IQL in baselines without demonstrating the difference is not reliable. Also, I haven’t found details on hyperparameters tuning for BATI and baselines which is crucial for RL.
方法与评估标准
Evaluation criteria and used benchmarks (meta MuJoCo environments) make sense and common in the field.
But I would like to offer additional set of experiments from the side of tasks. How would performances change if we have different distributions for train and test tasks? E.g. in AntDir what if train goals are sample from and test goals from ? UNICORN and CSRO should be able to handle this as demonstrated by the corresponding works and it is not clear for BATI because of it’s design (the finite codebook of latent representations for tasks Z).
理论论述
I did not carefully checked theoretical claims and did not observe issues during reading. I did not read theoretical part in Appendix.
实验设计与分析
Authors provide valid experimental design to support claims that their method is robust to noisy dynamics and irrelevant contexts. However, it is not clear whether the comparison against CSRO is correct. I haven’t seen the test protocol for CSRO which is designed to infer task by collecting the context on it’s own. Was it provided with the same context as BATI? It is not supposed to work with adversarial contexts and might perform better if the context is collected from scratch.
Based on the previous remark, I thing it is important to check how the algorithms will behave without adversarial contexts, i.e. collecting entire contexts needed for task inferences on their own (zero-shot).
If I understood the BATI correctly, there is a finite “code-book” (table) for latent representations of the tasks build based on the training instances. First, I did not understand whether the size is equal to amount of training tasks or not. Anyway, intuitively, BATI might heavily depend on this table size and there is no experiment which demonstrate the sensitivity. In case the table size matches training tasks, it is important to compare BATI and baselines under various sizes of training dataset in terms of number of tasks.
There is also a set of experiments about OOD environments instances I mentioned in “Methods And Evaluation Criteria” which I find lacking.
Authors replaced BRAC which was used in CSRO and UNICORN as offline RL algorithm with IQL. While in Appendix it is stated with text that IQL is more stable, I consider it as a major change for baselines and wish to see the exact performance difference between approaches backed up with BRAC and IQL. Now I can imagine that BRAC was not working for BATI while working for other baselines. It would be also nice to make a reference that IQL and BRAC-like algorithms demonstrate competitive performance (https://arxiv.org/abs/2210.07105).
I did not found any details on the hyperparameters tuning for BATI or baselines. Given the fact that authors changed offline RL algorithm from BRAC to IQL and collected their own datasets, it is important to show that BATI and baselines (or at least the most competitive CSRO) were tuned with similar hyperparameters tuning budget.
It would be good to also see BATI sensitivity to it’s hyperparameters, e.g. # Latent Samples N or task latent dimensionality.
补充材料
I’ve checked all supplementary except Appendix A.
I found hyperparameters tunning details and some datasets collection information (e.g. amount of train instances, amount of rollouts and their sizes) to be missing.
与现有文献的关系
Authors present approach which is more robust to noisy dynamics and adversarial contexts -- two aspects which seem essential for meta RL. They demonstrate that prior powerful offline meta RL methods are not able to handle such shifts easily.
遗漏的重要参考文献
While mentioning Algorithm Distillation, authors do not discuss follow-up works based on it’s findings or similar approaches based on scalable Transformer architecture. I believe those works should be at least mentioned in the context of the work and ideally compared against (but I do not require it now as an important change due to the computational costs). Transformers are known for being able to adopt to completely novel tasks after a certain scale. Here is a list of works I would recommend discussing:
- Prompt DT (https://arxiv.org/abs/2206.13499) which solves offline meta RL tasks in-context by training transformer to predict next action given the context of interactions with environment.
- https://arxiv.org/abs/2312.03801 where authors demonstrate that transformers are able to adapt to completely novel environments in-context after particular offline pre-training.
- Headless AD (https://arxiv.org/abs/2312.13327) proposes an AD modification which is able to adapt to completely novel action space, i.e. novel environment dynamics.
其他优缺点
The latent Z table might be a strong limitation when compared to previous methods which are based on context encoder models. The experiments I’ve asked for in this regard should reveal whether it is true.
其他意见或建议
I did not understand where from comes the training objective. While numerator seems intuitive, I do not understand why there is denominator. Wouldn’t just try to produce large positive numbers to reduce the loss? Why is it there at all?
Thank you for your efforts in reviewing our paper and the detailed review! We address your concerns below. Due to space limits, all the figures and tables referred below are available anonymously on this website.
Q1. Experiment protocol of BATI and baselines.
R1: We clarify that all methods, including CSRO, are evaluated on the same context distribution during test time, as described in Section 5.1. We design our testing protocol to be as challenging as possible since this aligns better with real-world scenarios where dynamics are noisy and the policy may not have control over the context (e.g. provided by a human operator). Other baselines like UNICORN and FOCAL also use a similar protocol. To address your concerns more thoroughly, we evaluate CSRO on HalfCheetahDir using their self-collection protocol and obtain a performance of . While the stronger protocol improves performance, CSRO still lags behind BATI by a lot.
Q2. Choice of base offline RL algorithm and hyperparameter tuning.
R2: We switched from BRAC to IQL because we observed in some cases that BRAC simply failed to learn; see website Fig. 2 for results in WalkerRandParams. We also performed little hyperparameter tuning and directly adopted hyperparameters from previous works. The IQL- and CSRO-related ones are borrowed from their respective papers (CSRO CLUB weights are divided by 10 since their implementation has a weight of 10 for the FOCAL loss while ours has 1) while the rest comes from UNICORN, including those of data-collection runs (with learning steps and dataset sizes slightly adjusted to accelerate training).
Furthermore, we would like to remark that as argued in Sec. 3 and 4, the core problem with baselines lies in task inference and not policy learning; the performances of baselines may even worsen with hyperparameter tuning, since the meta-policy would then get better at executing a wrong policy.
Q3. OOD task experiments.
R3: OOD generalization for tasks is extremely challenging, and to the best of our knowledge, the primary experiments of UNICORN and CSRO focus on generalization to OOD contexts on in-distribution tasks, same as BATI. Only UNICORN attempted an OOD task setting with a model-based RL approach, which is out of scope for our paper.
Nevertheless, we acknowledge the importance of OOD generalization and conduct the experiment you requested in AntDir with training goal directions sampled from and testing directions from . The results are shown on the website (Tab. 1). It can be seen that while the performance decreases compared to the in-distribution case, BATI still outperforms baselines by a large margin and is the only method to achieve positive returns. This further validates the generalization capability of BATI and its task embedding designs.
Q4. Task embedding table size; sensitivity to hyperparameters.
R4: The size of the task embedding table is the same as the number of training tasks, as described in Section 4.2. To demonstrate the robustness of BATI with respect to the number of training tasks, we conduct an ablation in AntDir that splits the 40 tasks into different train/eval splits. The results are shown on the website (Tab. 2). It can be seen that BATI outperforms baselines in all settings and is highly stable.
We conduct another ablation in WalkerRandParams to validate the robustness of BATI with respect to task embedding size. Compared with the main result of with embedding size 32, size 16 and 64 yield performances of and , respectively, demonstrating the robustness of BATI.
Q5. Discussion and comparison with transformer-based methods.
R5: Thank you for the additional related works! We remark that all of the three methods belong to the same paradigm where a transformer is trained to predict actions conditioned on the context, an act of Bayesian inference similarly flawed as the context encoders of our baselines.
To support this argument, we add a new baseline DPT that also uses a transformer to encode the context and predict the optimal action, outperforming AD. Due to limited time, we implement DPT and evaluate it in AntDir and HalfCheetahVel, as shown on the website (Tab. 3). DPT is inferior to BATI in both environments due to spurious correlations.
Q6. Dynamics model training objective .
R6: Thank you for pointing this out! We apologize for the confusion, the correct objective for Eq. 7 should be . We design to be a Gaussian distribution with mean and (log) variance parameterized by neural networks g,h, and L_recon is its negative log likelihood (up to a constant). This will be fixed in the revision.
The paper introduces Behavior-Agnostic Task Inference (BATI) to improve offline in-context reinforcement learning (ICRL) under distribution shifts. BATI, a model-based maximum-likelihood approach, infers task representations robustly by focusing on environmental dynamics. Results show BATI outperforms existing methods, especially with context shifts and noise.
给作者的问题
Please address the above comments.
论据与证据
Yes
方法与评估标准
The proposed method, BATI, makes sense for addressing the identified problem of distribution shifts in offline ICRL. It effectively shifts the focus from Bayesian posterior inference to a maximum-likelihood estimation of environmental dynamics, thus being more robust to context shifts. The evaluation criteria, using MuJoCo environments and varying noise levels, are appropriate for assessing BATI's performance and robustness in different scenarios.
理论论述
The theoretical result is standard.
实验设计与分析
The experimental designs and analyses are generally sound, using MuJoCo environments, relevant baselines, ablation studies, and appropriate evaluation metrics.
补充材料
No
与现有文献的关系
The paper improves offline in-context reinforcement learning (ICRL) under distribution shifts.
遗漏的重要参考文献
No
其他优缺点
Strengths
The paper is generally well-written and easy to follow. The authors clearly explain the problem, the proposed solution, and the experimental setup.
The paper addresses a critical challenge in ICRL, which is the vulnerability to distribution shifts. Overcoming this limitation is essential for the broader applicability of ICRL in real-world scenarios.
Weaknesses
The experiments are primarily conducted in MuJoCo environments. While these are standard benchmarks, such as MetaWorld. it would be beneficial to see results in other domains to further demonstrate the generalizability of BATI.
The theoretical analysis in Section 3 and Section 4.1 seems to have less connection to the proposed method. Why the proposed method can address the theoretical issue of the previous works shown in Section 3 and Section 4.1? Is there any theoretical analysis for the proposed method?
其他意见或建议
The paper spends several pages to explain the issue of the previous method. Some issues, such as spurious correlation of the context learner. The paper can be strengthened by the analysis of the proposed method on the issue.
Thank you for your thoughtful and constructive review. We're encouraged by the overall positive assessment that our paper "is generally well-written" and "addresses a critical challenge in ICRL". Below, we address the main concerns raised regarding the generalizability of our method beyond MuJoCo environments and the theoretical connections between our analysis and the proposed BATI approach. As demonstrated in our preliminary multi-agent experiments and theoretical explanations, BATI effectively addresses the identified limitations of previous methods while maintaining strong performance across diverse domains. The following are our detailed responses:
Q1. Results in other domains.
R1: Due to the tight rebuttal schedule, we have conducted a preliminary multi-agent experiment to demonstrate the general applicability of BATI in other domains. We choose Kuhn Poker, a two-player card game with discrete state and action spaces, differing from the continuous MuJoCo environments used in our paper. We generate different player-2 (opponent) policies as "tasks" and learn an adaptive policy for player-1 over 10 episodes (20 steps) of contexts. As shown in the table below, BATI maintains superior performance over all baselines, further showcasing its capabilities and generalization.
| Method | Oracle (pre-IQL) | BATI | CSRO | FOCAL | Recon | UNICORN |
|---|---|---|---|---|---|---|
| Episodic Return |
Q2. Theoretical analysis of our method.
R2: We apologize for any potential confusion. We establish the preliminaries and analyze the prior methods in Section 3. Building on these insights, Section 4.1 is primarily dedicated to our method that circumvents the prior failure modes. In Section 4.1, we provide two perspectives on the reason why BATI works: 1) the graphical model perspective (Fig. 2) where BATI achieves robustness by blocking ; 2) the robust likelihood perspective (page 4 right, line 179-219) where the core objective of BATI can be understood as a robust version of the full likelihood that does not depend on through derivations. We will communicate these points more clearly in the revision and greatly appreciate any suggestions to further improve the writing.
The authors propose a modification to the way offline context-based meta-RL methods supervise task identification, which they call BATI. The core idea is to remove the correlation between the task estimate and the behavior of the policy collecting the context. That way, the test-time context policy can be significantly different from the policy used to collect training context, and the method will still accurately select a task representation that leads to high returns when used as an augmented state feature for a standard offline-RL policy. The authors evaluate BATI in the standard meta-RL extensions of gym locomotion tasks under a setup that highlights robustness to stochastic dynamics and adversarial test-time context.
Update After Rebuttal
I appreciate the authors' reply and have no further concerns. I maintain my positive score.
给作者的问题
- How would the authors expect results to change when the train-time context data is generated by policies that are adapting over some distribution of tasks (rather than single-task experts trained by SAC)?
- What would be the challenges involved in scaling BATI to domains with extremely large task spaces?
论据与证据
Yes, the main claims are well studied by the experiments.
方法与评估标准
The authors evaluate their method in a set of mujoco locomotion benchmarks with a small set of randomized objectives and dynamics that have been staples since the first wave of deep meta-RL papers. These tasks play to the strengths of methods that can learn precise task estimates from short context over a narrow range of behavior. These benchmarks are quite saturated at this point. However, diverse and affordable alternatives are still starting to emerge, and this choice isn’t meaningfully impacting my review. The authors go to great lengths to try and make these tasks more interesting by adding noise to their dynamics and creating adversarial context sequences at test-time (mainly consisting of actions chosen from the expert data collection policy for the least similar task). Still, I wonder whether the method would scale to diverse training distributions that might require large M (the size of the task set and number of embeddings).
理论论述
I reviewed the appendix including the argument of Theorem 3.1
实验设计与分析
The experiments seem sound and implement all the baselines fairly on top of the same codebase. Transferable improvements (such as a change in the base offline RL algorithm) are applied across multiple baselines. The authors are relying on a nearly worst-case-scenario setup to evaluate their method on the tasks chosen, but this may reflect real applications.
It may be useful to report some oracle (task-conditioned or single-task expert) references scores for the experiments where the dynamics of the locomotion envs (like AntDir) are given significant noise.
The low seed counts are a bit of an issue given the very thin margins between most of the baseline results.
补充材料
I read the Appendix.
与现有文献的关系
The paper’s use of the term “in-context RL” (ICRL) is confusing — if this term even means anything unique at this point (and maybe it doesn’t). To review, “in-context learning” is primarily used to describe sequence models’ (almost always Transformers/LLMs) ability to understand a task at test-time based on a small sample of input/output pairs. This ability is often attributed to implicit Bayesian inference of the task inside the activations of the model, though there are some other explanations of this effect in NLP.
Modern Meta-RL and Meta-IL have begun to borrow the term when using Transformers to implicitly learn the ability to improve based on their input sequence. It is broadly used to rebrand traditional subsets of black-box meta-learning by using the connection to LLMs to highlight the flexibility of implicit task inference outside of standard few-shot meta-RL benchmarks (e.g., to few-shot prompting, opponent modeling). Or, in the case of Algorithm Distillation, to the implicit ability to learn the RL gradient update itself. BATI’s explicit task inference by conditioning the online policy on one of M task embeddings that minimizes a reconstruction objective on data provided by another behavior policy fits in neither category. There is no task estimate emerging implicitly in the context length of a sequence model. I had never seen the term “in-context RL” used this way. The authors write “We identify a critical limitation in existing offline ICRL methods” and then cite several sources, but the only one of these sources that actually call it “in-context RL” is the most recent UNICORN (Li et al., 2024), which I assume is how this came to be. This might be a losing battle, as the vocab and taxonomy of meta-RL has been messy for quite some time, but I really don’t think the term “in-context” applies here at all.
遗漏的重要参考文献
The literature is well covered, but I am interested in the authors’ thoughts on the connection to Dorfman et al., 2022. My understanding is that the evaluation setting in BATI is a similar worst-case scenario to what is discussed there, in that the context dataset is collected by an expert in a single task and therefore its behavior heavily implies the task identity.
其他优缺点
The main argument of behavior-agnostic task identification from context sets generated by single-task experts is made clearly and given plenty of motivation.
其他意见或建议
I think the paper would benefit from some more details in the Appendix related to the network architecture and overall implementation. Currently, the authors are relying on UNICORN for much of this information, which prevents the paper from standing alone. That’s a shame because the core argument for the change in inference objective is given enough space in the main text that someone would not necessarily have to read the baseline papers to follow it.
Thank you for your kind words and insightful review! We are happy to answer your questions below.
Q1. Training contexts generated by adaptive policies.
R1: The result would likely depend on the exact behavior of the adaptive policy. For example, if the adaptive policy is bad, it may randomly walk around the state space in any task and reveal little information about the true task identity, in which case all approaches would fail due to the uninformative nature of the context. At the other end of the spectrum, if the adaptive policy is good and finds the optimal policy quickly, its behavior may diverge significantly between tasks, resembling the single-task expert case where BATI works and baselines don't. In general, we expect BATI to outperform or at least remain competitive with the baselines for all cases, and the baselines to work only when the adaptive policy behaves similarly in all tasks yet still reveals a lot of task-related information.
Q2. Challenges brought by extremely large task spaces.
R2: A good question! Indeed, this is one of the potential future works we'd like to explore. The task inference procedure of BATI has a time complexity of where is the number of samples (lower-bounded by the number of training tasks) and is the length of the context and can be fully vectorized. This allows the inference procedure to run reasonably fast. However, if the task space is extremely large over which we couldn't practically enumerate, more sophisticated techniques could be employed to perform the inference. For example, we can use diffusion posterior sampling to sample from the conditional posterior where is the task latent, is , and is . Under this formulation, the task prior is represented by a diffusion model instead of a table of embeddings, and (robust) likelihood is provided by the dynamics model whose gradient serves as a form of classifier guidance. Leveraging the powerful expressitivity of diffusion sampling procedures, we can efficiently and robustly sample from a much larger task space. However, to demonstrate the advantage of such an approach, a benchmark with a sufficiently complicated task space is required, which is still emerging, as pointed out in your review, and we will explore this further in our future work.
Q3. Connection with BOReL (Dorfman et al., 2022).
R3. Thank you for bringing our attention to this related work! We note that the training data of BOReL are actually augmented via their reward relabelling (RR) procedure, in effect rolling out all context collection policies in all tasks and making and independent. This avoids the spurious correlation issue during training, but requires additional properties (e.g. tasks differing only by rewards) and ground-truth reward functions. Indeed, as noted in BOReL, "With reward relabelling ablated ... the agent believes the reward is at the point it first visited on the semi-circle" (Page 8, lower left), which is a consequence of the spurious correlation phenomenon. Our BATI can be applied to BOReL to replace RR without any additional requirements.
Q4. Oracle references and number of seeds.
R4: We have updated the main results to include oracle references by single-task experts, available anonymously here (Fig. 1 and Tab. 4). BATI is able to match or even slightly outperform the expert data in most cases due to optimizations performed by IQL. Due to limited time and computational resources, we have run two additional seeds for the main experiments, bringing the total to 5 seeds; the changes are also reflected in the figure and table above.
Q5. Usage of "in-context RL".
R5.: Thank you for the detailed and insightful comments! In BATI, we treat ICRL as mostly synonymous with context-based meta-RL, i.e. learning policies that can efficiently adapt to the context and improve on-the-fly without laborious fine-tuning, regardless of transformer or implicitness. We argue that this "functional" definition better captures the essence of ICRL. For example, there are state-space model approaches that produce a fixed-length representation of the context without transformers, and methods like DPT that are trained to perform explicit Bayesian inference of optimal actions. However, we stress that we greatly value inputs from the community about terminological choices, and are open to changes if deemed necessary.
Q6. Additional hyperparameter and implementation details.
R6: Thank you for your advice! As the ICML 2025 reviewing policy prevents authors from changing the PDF during rebuttal, we'll add relevant details to the appendix for a more self-contained reading experience when we get to update the paper.
This work proposes a new method for task identification in offline context-based meta RL, enhancing the agent's robustness to distribution shift in the test time. Experiments are conducted in MuJoCo domains, demonstrating reasonable performance improvements. Theoretical analysis are provided to complement the algorithmic design.
3 out of 4 reviewers appreciate the contribution of the work and are in favour of acceptance. The only negative review is relatively short and the reviewer (bfNC) did not engage well in author reviewer discussion. I therefore recommend acceptance. That being said, I do agree with bfNC that the work can benefit from experiments in more challenging domains such as Meta World. The author response does provide extra experiments and I hope the authors could provide more in the camera ready version.