PaperHub
5.7
/10
Poster3 位审稿人
最低4最高7标准差1.2
6
7
4
3.7
置信度
正确性3.0
贡献度2.7
表达3.3
NeurIPS 2024

Model-Based Transfer Learning for Contextual Reinforcement Learning

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-16

摘要

关键词
Deep Reinforcement LearningZero-Shot TransferGeneralizationBayesian Optimization

评审与讨论

审稿意见
6

The paper proposes Model-Based Transfer Learning (MBTL) in Contextual RL, which solves multiple related tasks and enhances generalization across different tasks. MBTL strategically selects a set of source tasks to maximize overall performance and minimize training costs. The paper theoretically demonstrates that the method exhibits regret that is sublinear in the number of training tasks. MBTL achieves greater performance than other baselines (e.g., exhaustive training, multi-task training, and random selection) on urban traffic and control benchmarks.

优点

  • The paper is well-written and organized, and includes a thorough illustrative diagrams and examples.
  • The problem formulation is well done, with clear mathematical representation. The analysis of Bayesian Optimization appears to be thorough.
  • The authors provide code for training and evaluating their proposed method. This significantly facilitates reproducing the experimental results and extending the introduced method.

缺点

  • The assumptions are too tight, particularly Assumption 3, which models the generalization gap using linear constraints. This approach is unsuitable for complex environments and therefore lacks generalizability.
  • The experimental environments (urban traffic and control benchmarks) are overly simplistic, consisting solely of vector environments with low-dimensional state spaces. The study lacks comparisons with complex tasks, such as those in the CARL benchmark, including games and a real-world application of RNA design.
  • The ablations on DRL algorithms (DQN, PPO, A2C) utilize outdated methods. Why not use more recent RL baselines?

问题

Below, I have a few questions and feedback for the authors:

  • How does the computational time consumption compare between MBTL and other baselines (exhaustive training, multi-task training, and random selection)?
  • I am curious to see experimental results in complex environments, such as visual environments.

局限性

N/A

评论
  • [1] C. Benjamins et al., “Contextualize Me -- The Case for Context in Reinforcement Learning,” Transactions on Machine Learning Research, Jun. 2023.
  • [2] https://automl.github.io/CARL/main/source/environments/environment_families/rna.html
  • [3] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, “MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3461–3475, Mar. 2023.
  • [4] V. Jayawardana, C. Tang, S. Li, D. Suo, and C. Wu, “The Impact of Task Underspecification in Evaluating Deep Reinforcement Learning,” in Advances in Neural Information Processing Systems, 2022.
评论

Thank you again for your review. Please let us know if you have any further questions or comments. If you feel that your questions were sufficiently addressed, we would deeply appreciate it if you could consider raising the score.

作者回复

The authors truly appreciate the reviewer’s positive feedback on our work. We invite the reviewer to also take a look at our general comments, which include additional experiments on (1) concerns with the linear generalization gap assumption, (2) application of MBTL to tasks with high-dimensional visual inputs, and (3) multi-task baselines. We hope that the new results will satisfactorily address the concerns raised, potentially leading to a reconsideration of our score.

The assumptions are too tight, particularly Assumption 3, which models the generalization gap using linear constraints. This approach is unsuitable for complex environments and therefore lacks generalizability.

We acknowledge your concern about assumption 3. However, ironically, this assumption is the key strength of our MBTL algorithm. The authors kindly request the reviewer to look at the General Response 1 (GR1).

The experimental environments (urban traffic and control benchmarks) are overly simplistic, consisting solely of vector environments with low-dimensional state spaces. The study lacks comparisons with complex tasks, such as those in the CARL benchmark, including games and a real-world application of RNA design.

We appreciate your suggestion, and we understand the importance of evaluating our method on more complex tasks with high-dimensional state spaces. In response, we conducted additional experiments on high-dimensional environments. Unfortunately, the CARL benchmark [1-2] no longer supports the suggested RNA design application, so we focused on vision-control experiments instead. These experiments are based on the MetaDrive benchmark [3], which supports visually generated observations for driving scenarios. The authors kindly request the reviewer to look at General Response 2 (GR2) for visualization of our results on high-dimensional state spaces. The preliminary results show that MBTL algorithms ranging from simple strategies to GP-based algorithm also work in complex high-dimensional state space. These experiments demonstrate the scalability and robustness of MBTL in more challenging and complex settings.

The ablations on DRL algorithms (DQN, PPO, A2C) utilize outdated methods. Why not use more recent RL baselines?

We apologize for any confusion caused by using the term "ablation." To clarify, we intended this section as a "sensitivity analysis" to demonstrate that the MBTL selection process is robust across different types of RL algorithms, whether they are value-based or policy-based, and that these methods may or may not significantly affect the generalization gap. Our intention was to show that the effectiveness of MBTL is not heavily dependent on the specific RL algorithm used. The primary goal was to highlight that both value-based methods (like DQN) and policy-based methods (like PPO and A2C) are compatible with our framework, indicating the versatility of MBTL. Additionally, we aimed to maintain consistency with the RL algorithms used in the original papers from which the traffic experiments were derived. For instance, the eco-driving control paper [4] utilized PPO variants, and we adopted similar algorithms to ensure a fair comparison and reproducibility of results.

How does the computational time consumption compare between MBTL and other baselines (exhaustive training, multi-task training, and random selection)?

Thank you for your question on the computation time. If we assume the same training time for each model, the number of trained models presented in Table 1 and Table 2 provides the order of magnitude difference in computational time across different methods. For instance, Exhausitive Training or Oracle Transfer requires training all tasks, which will need N(=X)N(=|X|) models, while MBTL requires kk models to train. However, the calculation for multi-task reinforcement learning deviates from this, since it depends on the batch size though it trains a single model. Our results (Table 1 and 2 in the main text) indicate that MBTL significantly reduces training time compared to exhaustive training and multi-task training while achieving comparable or better performance. This efficiency is primarily due to the strategic selection of training tasks, which minimizes redundant training. On the other hand, we include a detailed comparison of the computational time required for MBTL source task selection in our experiments. Specifically, MBTL-GP requires more computation than simple methods.

MBTL-ESMBTL-GSMBTL-GP
Pendulum4.2518E-050.000184321.61098456
Cartpole3.28488E-050.000269441.6663856
BipedalWalker3.25309E-050.000150421.64290924
HalfCheetah3.26369E-050.000145591.63489845
Traffic Signal3.49164E-050.0001560.5571901
Eco-Driving3.16037E-050.000147930.62795639
AdvisoryAutonomy3.17097E-050.000142570.69461881

Specifically, the running time for MBTL algorithms is relatively short compared to the actual computation time required for training RL models. When comparing the computation time for the SSTS process alone, simple strategies such as MBTL-ES (Equidistant Strategy) and MBTL-GS (Greedy Strategy, previously MBTL-PS) require almost negligible computation time. In contrast, MBTL-GP (Gaussian Process) requires additional computation time for the Bayesian optimization process.

Overall, the strategic task selection in MBTL results in a substantial reduction in the number of models trained, which in turn reduces the overall computational burden. We will provide a detailed analysis of these findings in the updated results section of the revised manuscript.

I am curious to see experimental results in complex environments, such as visual environments.

Thank you for your comments. We refer the reviewer to General Response [GR2] for visualizations and interpretations of our results on visual environments.

评论

Thank you for your detailed response to my review. The response has addressed most of my questions.

However, I agree with reviewer 21Qs regarding the concern about the theoretical justification of the linear generalization gap.

I am inclined to keep my original score.

审稿意见
7

The paper proposes a new framework to estimate the expected generalization performance across different tasks where the differences have an explicit and model-based structure. To improve the expected generalization performance via training on selected tasks, the paper proposes naive and Bayesian optimization based method to effectively explore the task space to find a policy that have the optimal zero-shot performance given a selected task. The experiments demonstrate that the proposed MBTL method outperform other baseline method and reach the performance of the optimal transfer method assuming full knowledge.

优点

  • By and large, the paper is written well. I especially appreciated the detailed discussion on SSTS problem formulation and its relations to robust RL training.
  • The idea of Bayesian Optimization to search training tasks using generalized performance estimation where tasks differences are “model-based” or have an explicit structure is novel and useful.
  • The section on analysis of BO and its comparison with ES and PS are also interesting, and highlights the sublinear regret theoretical results.
  • The theoretical results appear to be correct.

缺点

  • The empirical evaluation is by and large only on low dimensional systems. It would have been interesting to see how the method would scale with more challenging, high-dimensional tasks, such as those commonly found in vision control tasks in robotics.
  • It would have been interesting to see how well this method would work with policies/controllers not parametrized with neural networks (i.e. kernel machines).
  • Some additional commentary on how to use the GP to search for a trained policy given a selected task.

问题

  • Could you comment on the challenges of applying MBTL-GP on vision-control tasks?

局限性

Yes, the author has discussed the limitations.

作者回复

We appreciate the reviewer's thoughtful comments and suggestions. We encourage the reviewer to review our general comments, which include additional experiments addressing (1) concerns with the linear generalization gap assumption, (2) the application of MBTL to tasks involving high-dimensional visual inputs, and (3) multi-task baselines. We hope that the new results will satisfactorily address the concerns raised, potentially leading to a reconsideration of our score.

The empirical evaluation is by and large only on low dimensional systems. It would have been interesting to see how the method would scale with more challenging, high-dimensional tasks, such as those commonly found in vision control tasks in robotics.

We appreciate constructive comments on the scalability of our method to the high-dimensional tasks. With your suggestion, the authors conducted experiments with vision control tasks. We refer the reviewer to General Response [GR2] for visualizations and interpretations of our learning results.

It would have been interesting to see how well this method would work with policies/controllers not parametrized with neural networks (i.e. kernel machines).

Thank you for your valuable suggestion. One of the strengths of our work is its flexibility and extensibility, allowing the use of various methods beyond just reinforcement learning. Our method can be applied to other approaches such as kernel methods, radial basis functions (RBF), model predictive control (MPC), and optimal control. In this paper, we were motivated by traffic examples and focused on developing a simple and practical algorithm to efficiently solve a wide range of CMDPs. While we have primarily concentrated on deep reinforcement learning algorithms using neural network parameterizations, exploring the applicability of our method with alternative approaches is indeed an interesting direction for future research. In response to your comments, we conducted preliminary experiments with support vector machines (SVM), one of the most popular kernel machines, to solve the CartPole CMDPs. Though SVMs are not inherently designed for sequential decision-making, we used them in a supervised learning context where we trained the SVM on the best actions given certain states. After training the SVM model, we transferred it to other CMDPs and collected the rewards. Our preliminary result in Figure R.5 shows that an SVM-based controller trained on a default configuration actually solves other tasks. The results of these preliminary experiments in Fig. R5. indicate that our method shows promise when applied to kernel machines as well. This suggests that our approach can potentially be extended to other non-neural network-based methods. We appreciate your suggestion and will consider including more detailed experiments and discussions in future work.

Some additional commentary on how to use the GP to search for a trained policy given a selected task.

Thank you for your insightful comment. In our MBTL-GP method, the GP is utilized to predict the performance of a policy trained on a source task when applied to a target task. The modeled generalization gap helps predict the transferability of that source task. Using the GP model, we apply Bayesian Optimization (BO) to select the next training task. The acquisition function in BO balances exploration and exploitation by considering both the predicted mean performance and the uncertainty (variance) associated with the prediction. Specifically, our proposed modified UCB acquisition function is used. This allows us to strategically choose tasks that are likely to improve overall performance while reducing uncertainty. Once a new task is selected, the policy is trained on this task, and the resulting performance data is used to update the GP model. This iterative process continues, progressively refining the GP model and improving task selection. This approach allows the MBTL framework to efficiently search for trained policies by leveraging the predictive power of GP and the strategic task selection of BO. For a better understanding, we added Figure R.1 to illustrate the MBTL process.

Q1. Could you comment on the challenges of applying MBTL-GP on vision-control tasks?

We appreciate your suggestion, and we understand the importance of evaluating our method on more complex tasks with high-dimensional state spaces. In response, we conducted additional experiments on high-dimensional environments. Unfortunately, the CARL benchmark no longer supports the RNA design application, so we decided to focus on vision-control experiments instead. The experiments are based on the benchmark [1], which supports visual generalization for reinforcement learning. We ran preliminary experiments with this benchmark using the CartPole task, specifically examining contextual MDPs of the CartPole environment with different frame skip parameters. This setup is similar to the task variant of advisory autonomy in traffic tasks. We kindly request the reviewer to refer General Response 2 [GR2]. The preliminary results in Figure R.2 show that MBTL algorithms ranging from simple strategies to GP-based algorithm also work in complex high-dimensional state space. These experiments demonstrate the scalability and robustness of MBTL in more challenging and complex settings.

References

  • [1] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, “MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3461–3475, Mar. 2023.
评论

I wanted to thank author first for their detailed response. The response has addressed most of my questions and I would like to keep my original score to recommend accept this paper.

评论

Thank you for your feedback and suggestions during the rebuttal period. It really helped us improve our work.

审稿意见
4

The paper introduces a framework called Model-Based Transfer Learning (MBTL) for solving contextual reinforcement learning problems. By modelling the performance loss as a simple linear function of task context similarity, the authors leverages Bayesian optimization techniques and provides theoretical analysis showing that MBTL exhibits sublinear regret in the number of training tasks, and discusses conditions to further tighten the regret bounds. MBTL is validated on a range of simulated traffic management scenarios and standard control benchmarks, demonstrating superior performance compared to baselines like exhaustive training, multi-task training, and random task selection.

优点

  1. In general, the paper is easy to follow and well-motivated. The figures are helpful for readers to quickly grasp the key concepts and problem settings.

  2. Active and transfer learning for contextual RL is an interesting and practical problem, especially given that exhaustive/multi-task training on RL tasks can be computationally expensive.

  3. The experiments are conducted on a relative broad range of benchmarks and context variations.

缺点

I do think there are several major concerns regarding the current form of the paper:

  1. The paper (especially the theoretical analysis) is based on too many assumptions: continuity of task space and performance function, linear genearalization gap and deterministic MDP transitions, which oversimplifies the problem and significantly limits the applicability range of the proposed method. The most concerning is Assumption 3 (Linear generalization gap), for which I have no clue why and when this can hold. Assuming Lipschitz continuity of the generalization gap seems much more reasonable to me, which makes assumption 3 an inequality instead of an (approximate) equality.

  2. In line 96-98, the paper assumes that a vector representation/context feature xx of each task is given a priori, based on which the continuity assumptions 1-2 are made. Howeveer, as in line 333-335, context features are often not visible. Assuming prior knowledge of such information further limits the applicability range of the proposed method.

  3. (2 continued) In fact, no matter whether context features are given a priori, there are many effective methods which perform transfer/meta-RL by learning a conditioned policy [1][2][3], which should be significantly better than the "multi-task" baseline adopted in the paper. However, there is currently no empirical evidence on how well the proposed method compared to these stronger baselines.

  4. Even though Theorem 2 and Corollary 2.1 2.2 give quantitative regret bounds assuming the search space can be narrowed down at each training step, there is no theoretical guarantee on how the proposed method, especially MBTL-GP, can effectively realize such restricted search space. Hence the current analysis is not complete in terms of proving the effectiveness of proposed method.

[1] Rakelly, Kate, et al. "Efficient off-policy meta-reinforcement learning via probabilistic context variables." International conference on machine learning. PMLR, 2019.

[2] Shagun Sodhani, Amy Zhang, and Joelle Pineau. Multi-task reinforcement learning with context-based representations. In International Conference on Machine Learning, pages 9767–9779. PMLR, 2021.

[3] Li, Lanqing, Rui Yang, and Dijun Luo. "Focal: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization." ICLR 2021.

问题

What's the foundamental difference between "exhaustive training" and "multi-task RL"? Based on the brief description in line 257-258, they seem similar to me. Also why is "multi-task RL" even worse than "random" most of the time (in Table 1, 2)?

局限性

N/A

评论

[1] L. Sun, H. Zhang, W. Xu, and M. Tomizuka, “PaCo: Parameter-Compositional Multi-Task Reinforcement Learning,” Neural Information Processing Systems., 2022.

作者回复

We appreciate the reviewer's constructive and insightful feedback. We kindly request the reviewer to also refer to our general comments, which address (1) concerns with the linear generalization gap assumption, (2) the application of MBTL to tasks with high-dimensional visual inputs, and (3) multi-task baselines. Should our response meet the reviewer's expectations, we would be grateful if they could consider increasing the score.

The paper (especially the theoretical analysis) is based on too many assumptions: continuity of task space and performance function, linear genearalization gap and deterministic MDP transitions, which oversimplifies the problem and significantly limits the applicability range of the proposed method. The most concerning is Assumption 3 (Linear generalization gap), for which I have no clue why and when this can hold. Assuming Lipschitz continuity of the generalization gap seems much more reasonable to me, which makes assumption 3 an inequality instead of an (approximate) equality.

Thank you for your valuable comments. We refer to General Response [GR1], where we detail interpretations of the linear generalization gap. Moreover, assuming Lipschitz’s continuity of the generalization gap could bound the performance range, it would provide a more flexible and realistic generalization gap estimation. However, incorporating Lipschitz continuity into our MBTL algorithm involves additional considerations on effectively utilizing this bound to optimize task performance. We acknowledge that further investigation is needed to understand the potential benefits of Lipschitz continuity.

In line 96-98, the paper assumes that a vector representation/context feature xx of each task is given a priori, based on which the continuity assumptions 1-2 are made. Howeveer, as in line 333-335, context features are often not visible. Assuming prior knowledge of such information further limits the applicability range of the proposed method.

We appreciate the reviewer's insightful comments. This paper specifically addresses the SSTS problem settings where context features are visible. Many real-world applications for solving multiple related tasks with visible contexts, such as robotics manipulations, recommendation systems, autonomous driving scenarios, and personalized healthcare, often have accessible context features (e.g., robot configurations, user preferences, traffic, environmental conditions, and patient health records). These contexts allow our method to be broadly applicable in these domains.

We acknowledge that there are also scenarios where context features might not be visible or available. Extending MBTL to handle such hidden context scenarios is a promising direction for future work. This could involve developing techniques to infer or estimate hidden contexts from observable data. We appreciate your suggestion and look forward to exploring this avenue in our future research.

(2 continued) In fact, no matter whether context features are given a priori, there are many effective methods which perform transfer/meta-RL by learning a conditioned policy [1][2][3], which should be significantly better than the "multi-task" baseline adopted in the paper. However, there is currently no empirical evidence on how well the proposed method compared to these stronger baselines.

We appreciate this feedback and agree that including comparisons with stronger baselines would strengthen our evaluation. We refer the reviewer to the General Response [GR3] and Figure R.3 about the multi-task baseline. Additionally, our method has an advantage over PaCo [1] because it does not require learning additional parameters during training, thanks to the simple modeling of performance loss in transfer.

Even though Theorem 2 and Corollary 2.1 2.2 give quantitative regret bounds assuming the search space can be narrowed down at each training step, there is no theoretical guarantee on how the proposed method, especially MBTL-GP, can effectively realize such restricted search space. Hence the current analysis is not complete in terms of proving the effectiveness of proposed method.

Thank you for your constructive comments. To address your concerns, we offer a more detailed discussion on the practical realization of a restricted search space and include empirical evidence to support our theoretical claims. The graph in Figure R.4 illustrates the impact of search space elimination on the performance of different strategies over multiple transfer steps. It compares the empirical search space of MBTL-GP in tasks provided in this paper with the examples given in Corollaries 2.1 and 2.2. Corollary 2.2, representing a greedy strategy (MBTL-PS), demonstrates a more aggressive reduction in search space, leading to the tightest regret bounds and superior performance. Corollary 2.1 also shows a rapid reduction in max space and improved performance. Although we cannot guarantee the theoretical performance, our empirical results indicate that MBTL-GP can achieve competitive performance compared to MBTL methods using simpler strategies.

What's the foundamental difference between "exhaustive training" and "multi-task RL"? Based on the brief description in line 257-258, they seem similar to me. Also why is "multi-task RL" even worse than "random" most of the time (in Table 1, 2)?

Thank you for your feedback, and we apologize for any confusion in differentiating the two. Exhaustive training involves training separate models for all different cMDP tasks, while multi-task RL trains a single universal model that can generalize across different cMDP tasks. As shown in our transferability heatmap, some environments have low transferability to other tasks, making it challenging for multi-task RL to derive a single model that can effectively solve a wide range of task variants. Random selection of models in SSTS can sometimes cover a broader context range better than a single multi-task RL model.

评论

Thank you for the rebuttal, I appreciate the efforts for providing further experiments. After carefully reading through the rebuttal and general response, unfortunately, I feel some of my major concerns still remain. Most importantly, given the current form of the paper, I would expect more in-depth analysis and insight regarding the following subjects:

  1. Theoreticaljustificationofthelineargeneralizationgap**Theoretical justification of the linear generalization gap**. This is claimed by the authors as a "key strength" of the paper. However, unlike a loss function with well-defined form, generalization gap is a complex function of model parameters, architectures, dataset sizes & distributions as well as context feature without*without* a closed-form expression. Simply taking a linear assumption makes subsequent theoretical treatment easier, but without any convincing insight or theoretical justification, it would significantly limit the impact of the paper.

My suggestion: First of all, I appreciate the additional result that the proposed algorithms perform reasonably well in the presence of a non-linear (e.g., quadratic) generalization gap function. However, to be more general, maybe consider using a separate nerwork to approximate this function, which in principle can be any function due to the universal approximation theorem, and then demonstrate the effectiveness of your methods. If you still want to hold on to the linear form to support your theoretical development, consider the Lipschitz constraint instead, which seems much more realistic.

  1. RegardingMBTLGPrealizingrestrictedsearchspace**Regarding MBTL-GP realizing restricted search space**, the additional empirical evidence is nice. But since Theorem 2 and Corollary 2.1 2.2 are meant to theoretically ground the effectiveness of proposed methods, I think more rigorous proofs and insights instead of just empirical observations are still necessary to complete the whole argument. If you find it extremely hard or even impossible to bound/model the reduction rate in search space, at least state it clearly as an assumption, which of course, limits the impact of the paper.

Additional Concerns

Given the authors explanation about "exhaustive training" and "multi-task RL", I realize that I previously misinterpreted the SSTS problem as a "continuous learning" + "active learning" setting, where you train a single**single** model on a sequence of tasks, in attempt to select the optimal next task to maximize generalization. However, now if I understand correctly, the proposed methods actually follow the "exhaustive training" paradigm, where for each new task, a new model is trained. This formulation of SSTS seems unconventional (or "novel" on the bright side), for which I have two major concerns:

  1. According to Eqn 3, the task selection requires evaluation of all existing models (1-k) on every single target task xx', which may severely increase the total computational cost of SSTS. Suppose we have NN' target tasks, the proposed task selection requires O(Nk)O(N'k) complexity to compute. If we want to select NN source tasks in total sequentially, it will end up with O(NN2)O(N'N^2) complexity. Even though performing evaluation/inference is less computationally demanding than training, this additional cost can be non-neglectable especially when NN and NN' are large. This potentially makes the main motivation that "Our work has the potential to reduce the computational effort needed to solve complex real-world problems" much weaker.

  2. Training a new model whenever encountering a new task is not scalable. In an idealized scenario, we would like to have a "universal model", like human brain, which can continuously learn by reusing prior knowledge to solve new tasks that are similar to tasks encountered before, and only make significant updates when the new task is completely beyond your current skill set (aka "out-of-distribution" in statistical words). This is the fundamental motivation of continual learning or life-long learning. The current setting of SSTS, if I understand correctly, seems less realistic or at least require significant justification for its practical impact.

评论

We appreciate your valuable and insightful feedback.

Theoretical justification of the linear generalization gap

We appreciate your concern regarding the linear assumption of the generalization gap. As discussed in our general response [GR1], the linear model was chosen for its simplicity and to streamline the algorithm design process, though various complex factors influence the generalization gap. We are currently exploring the possibility of extending the proposed methods using neural networks to approximate the generalization gap. But again, the key strength of our simple algorithm is that we don’t necessarily need pre-training of those parameters.

Regarding MBTL-GP realizing restricted search space

We understand the importance of rigorous theoretical analysis and regret bounds. We agree that the paper would benefit from a more detailed analysis of how the MBTL-GP algorithm can systematically reduce the search space. While we offer examples of simpler algorithms like MBTL-GS to illustrate potential strategies, we face challenges in bounding the reduction rate for MBTL-GP. Therefore, we have provided empirical insights into the rate at which the search space for MBTL-GP can be reduced. In the revised paper, we clearly state this approach, along with formally defined assumptions and detailed explanations.

According to Eqn 3, the task selection requires evaluation of all existing models (1-k) on ...

We appreciate your observation regarding the difference between our proposed method and conventional approaches such as multi-task training, where a single universal model is trained across multiple tasks, or independent (exhaustive) training, where separate models are trained for each task. As you mentioned, one of the most important motivations for our method is that evaluation is much cheaper than training RL models in terms of computational cost. We think that this paradigm of training multiple models and applying zero-shot generalization (or fine-tuning) has little been studied and is promising.

The strength of our approach lies in its ability to achieve near-oracle performance with a significantly smaller number of trained models. For example, in our experiments on standard control benchmarks with 100 tasks, we achieved performance close to the oracle level with only 10-15 trained policies. While exhaustive evaluation is required to ensure the selection of the best model among all possible task-specific models, we believe that the computational cost of evaluation is relatively low compared to training the remaining RL policies.

Moreover, once the generalized performance of the models is obtained, further evaluations are not necessary in subsequent steps. This reduces the evaluation complexity to O(N)O(N') per step, resulting in a total complexity of O(NN)O(N N') when training NN tasks, as opposed to the O(NN2)O(N' N^2) complexity you mentioned. Additionally, in practical scenarios, we typically consider cases where kNk \ll N, further reducing the complexity to O(Nk)O(N' k), which is significantly smaller than O(NN)O(N N'). This highlights the computational efficiency of our approach in scenarios with many tasks.

Training a new model whenever encountering a new task is not scalable. In an idealized...

Thank you for your insightful comments. We recognize the importance of developing a "universal model" that can continuously learn and adapt to new tasks by reusing prior knowledge, as emphasized in recent research on continual and life-long learning. However, we believe that there are specific scenarios where task-specific training policies, rather than a universal model, are necessary and practical, especially when side information is available. For instance, in the design of traffic signal phases at 4-way intersections, it is crucial to train distinct reinforcement learning policies for different intersection configurations. Considering that there are approximately 16 million different intersections in the United States, training a universal model or separate 16 million models for all configurations would be highly expensive and almost impossible. However, by leveraging context information such as the number of lanes, lane lengths, expected traffic inflows, and speed limits, we can intelligently devise a training procedure that trains multiple models, while still achieving good generalization across the broader distribution. Empirically, we have observed that in such scenarios, our method proves to be computationally more efficient and performs effectively. This approach allows for the targeted training of models that are specialized but still generalize well to similar tasks, thus addressing practical constraints in real-world applications. We hope this conveys the core idea and motivation of our work. We also acknowledge the potential of context-free learning for scalable, universal models, as you mentioned, and consider it an important direction for future research.

评论

Thank you for the detailed response. My concern regarding the computational complexity is largely resolved. However, some of the major concerns remain, which I think the authors also agree upon.

In summary, this is a novel paper which introduces the concept of "model-based transfer learning" based on "sequential source task selection", which I have not seen the exact same setting before, to my best knowledge. The authors propose to solve this problem by Bayesian optimization techniques, which are empirically proven effective, but the methodologies are not new. By assuming a linear generalization gap as well as bounded reduction of search space, the authors arrive at theoretical guarantees for sublinear regret. By the design, the proposed method can achieve better computational efficiency compared to conventional multi-task learning, in certain scenarios. I will give credit for all contributions above.

However, as we discussed during the rebuttal, the paper still falls short in providing serveral key pieces of the story, such as the theoretical justification of the linear generalization gap (also needs to extend it to more generalized, realistic format as we discussed), theoretical guarantee for bounded search space, reliance on the existence of continuous context features, which significantly restrict the applicability and practical impact of the framework . Also for the problem setting of SSTS, the authors provided the example of "16 million different intersection configurations in the United States" to justify the need of training multiple models, which I find reasonable but notconvincingenough*not convincing enough*. A more promising and general approach to me to solve the same problem, is to leverage the power of pretrained large models. Specifically, one can use a pretrained context encoder model (e.g. large vision model for visual observation) to extract the context feature from the raw input, and use it to condition the downstream RL policy, instead of training a new model for each task, and carefully select the "optimal" next task, which is far less scalable.

To conlcude, I believe there are novelties in this work and appreciate the authors' efforts in rebuttal, but remain conservative about its practical impact to the community. More importantly, I sincerely hope the authors to conduct further investigations to fix the issues we agreed upon to make the paper stronger, whether the paper is accepted. Given the reasoning above, I will keep my score but are open to further discussions.

评论

Thank you for your thoughtful and comprehensive feedback. We appreciate your recognition of the novel aspects of our work.

First of all, we see the reviewer’s concern on the generalization gap assumption. In situations where the true function is difficult to analyze, approximation methods are commonly used. Our MBTL algorithms approximate the generalization gap with a simple linear function of task context similarity. We show that even with the linear function, our MBTL framework works in various settings ranging from standard control tasks to complex real-world traffic applications. To further address your concern, we have also evaluated the MBTL-GP performance using non-linear approximations, including quadratic, cubic, x5x^5, and x10x^{10} models along with the RMSE between the actual generalization gap. While higher-order approximation functions generally result in lower RMSE errors, we observed that the overall performance of MBTL on the SSTS problem does not consistently improve with more complex non-linear approximation. For example, the RMSE error of generalization gap with 10th-order polynomial approximation does not always perform best to solve CMDP tasks. This suggests that the simplicity and interpretability of the linear model can provide significant advantages without compromising effectiveness.

LinearLinearQuadraticQuadraticCubicCubicx5x^5x5x^5x10x^{10}x10x^{10}
PerformanceRMSEPerformanceRMSEPerformanceRMSEPerformanceRMSEPerformanceRMSE
Pendulum0.75550.10700.74230.09650.76150.08620.74940.07050.75580.0192
CartPole0.88960.25710.81020.19050.89410.14950.89260.10290.87610.0745
BipedalWalker0.93310.14220.92370.12010.93290.10160.93180.07710.93250.0246
HalfCheetah0.91650.00190.84260.00190.92600.00180.92710.00170.92950.0004
Traffic Signal0.89660.17800.89070.16060.89650.14090.89630.11060.89520.0676
Advisory Autonomy0.81770.05510.77820.05080.82140.04640.82450.03930.82450.0283
Eco-Driving0.53770.10710.48110.09500.52820.08320.52700.06600.54500.0434

Additionally, we would like to highlight the potential for fine-tuning in our approach. When training a model at each step, it is possible to use the previous model or a model already trained in a closely related context as a starting point. This can significantly reduce the number of episodes required for training a new model, thereby improving efficiency and having the potential to be scalable.

We are grateful for your constructive feedback and remain open to further discussions on these important points during the rebuttal period.

作者回复

The authors appreciate each of the reviewers for their detailed and constructive comments. Here, we first respond to all reviewers before answering each reviewer’s specific question.

[GR1] Concerns with the Linear generalization gap assumption

Thank you for your valuable comments. Similar concerns were raised by Reviewer 21Qs and ECSQ, regarding the assumptions made in our method. The purpose of our assumptions was not to oversimplify the problem but to design a straightforward algorithm that benefits from simple modeling. Assumptions 1-3 in Section 4.1 were made to streamline the empirical algorithm design rather than to constrain our theoretical analysis. Although it may appear to be an oversimplification, the simple modeling of training performance and the generalization gap are key strengths of this paper. Empirical findings indicate that even when these assumptions do not hold perfectly, our simple, principled algorithms remain effective. To address these concerns, we have included additional experiments that relax these assumptions. As one example of a non-linear generalization gap, we tested our algorithms with quadratic generalization gap assumptions. Preliminary results show that these methods can perform well, sometimes better than those assuming a linear generalization gap. Specifically, in tasks like CartPole variants, estimation with a quadratic function performs better than linear modeling since the actual generalization resembles a quadratic function.

Table R.1. Comparative performance of different methods with quadratic generalization gap function on CMDP tasks

RandomExhaustiveMBTL-GPMBTL-GP with quadratic functionOracle Transfer
CartpoleMass of Cart0.72210.94660.82120.79790.9838
CartpoleLength of Pole0.81210.91100.91240.9260.9875
CartpoleMass of Pole0.88580.95600.93510.95931
HalfCheetahGravity0.85420.66790.90730.82530.9544
HalfCheetahFriction0.85670.66930.92740.86010.9663
HalfCheetahStiffness0.85330.65610.91460.84230.9674

Furthermore, non-linear generalization gaps require estimating more parameters than linear gaps, potentially complicating the achievement of effective performance. Despite this, our preliminary results are promising and indicate that our method can be effectively adapted to handle non-linear generalization gaps, enhancing its robustness and applicability across different tasks.

[GR2] Application of MBTL to high-dimensional visual input task

We appreciate the feedback and comments from Reviewer wJEK and ECSQ regarding the applicability of MBTL to high-dimensional state space tasks. We understand the importance of evaluating our method on more complex tasks with high-dimensional state spaces. In response to these comments, we conducted additional experiments on high-dimensional vision-control experiments. These experiments are based on the MetaDrive benchmark [1], which supports visually generated observations for driving scenarios.

We ran preliminary experiments with the MetaDrive benchmark in the three-lane four-way intersection traffic network with different traffic density variations (from 0.05 to 0.5) (Fig. R.2). The task involves controlling an autonomous vehicle in the presence of other vehicles. The controlled vehicle observations are generated from a low-level sensor using an RGB-based camera view with 200x100 pixels. Those inputs were passed through a three-layer CNN for feature extraction. The autonomous vehicle is controlled with steering and acceleration changes.

The preliminary results show that MBTL algorithms, ranging from simple strategies to GP-based algorithms, are still effective in complex high-dimensional state spaces. These experiments demonstrate the scalability and robustness of MBTL in more challenging and complex settings, confirming the versatility of our approach in handling high-dimensional visual input tasks.

[GR3] Multi-task baselines

We greatly appreciate the feedback from reviewer 21Qs and concur that including comparisons with stronger baselines would enhance our evaluation. In response, we thoroughly examined state-of-the-art multi-task reinforcement learning (MTRL) methods, including the suggested baselines. However, we believe the suggested CARE algorithm [2], which involves language embeddings, is not an appropriate comparison for our approach, where context variation is continuous and straightforward. Instead, we have compared our methods with the Parameter-compositional multi-task reinforcement learning (PaCo) [3] algorithm. In Figure R.3, Our preliminary implementation of PaCo [5] on CartPole CMDP variants indicates that MBTL remains competitive against these methods, demonstrating superior performance compared to our previous, more naive MTRL strategy. We would like to note that due to the limited time available for rebuttals, we could not offer comparisons with various MTRL baselines, as the training procedures require considerable computation time and effort. Also, it was unfortunate that there were a few MTRL works that didn’t release the codebase, had issues running the code, or could not be reproducible during the implementation of previous works.

References

  • [1] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, “MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3461–3475, Mar. 2023.
  • [2] Shagun Sodhani, Amy Zhang, and Joelle Pineau. Multi-task reinforcement learning with context-based representations. In International Conference on Machine Learning, pages 9767–9779. 2021.
  • [3] L. Sun, H. Zhang, W. Xu, and M. Tomizuka, “PaCo: Parameter-Compositional Multi-Task Reinforcement Learning,” in Conference on Neural Information Processing Systems., 2022.
最终决定

The paper aims at solving contextual reinforcement learning problems, with a focus on selecting which tasks to train on, such that the learned policies can be used zero-shot to generalize to unseen task. Towards this goal, author use bayesian sequential sampling strategy, where the acquisition function is based on the generalization gap of the learned policies on the sampled tasks. This allows the method to sample tasks where the policies perform the worst, and thereby quickly gaining coverage for good performance.

Regarding this methodology, there was an active discussion where the reviewers raised several concerns regarding both theoretical and practical aspects of the paper. Particularly, regarding the linear generalization gap assumption, computational complexity when the number of tasks are large, applicability of the method to domains with higher-dimensional state features, and better baselines. In the rebuttal, authors provided several new results regarding all the concerns raised, and shared empirical validation where theoretical justification was more involved.

The overall idea is sensible, and opens a different avenue of storing few anchor policies (as opposed to having a global policy for all the tasks) to address the multi-task generalization problem. The experiments are also carefuly designed and results seem promising.