5.8

/10

Rejected4 位审稿人

最低3最高8标准差1.8

3.5

置信度

正确性2.8

贡献度2.8

表达3.0

ICLR 2025

In-Context Reinforcement Learning From Suboptimal Historical Data

Juncheng Dong,Moyang Guo,Ethan X Fang,Zhuoran Yang,Vahid Tarokh

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

We consider pretraining transformers for in-context RL from only suboptimal historical data.

摘要

关键词

In-context Learning; Transformer; Reinforcement Learning

评审与讨论

审稿意见

评分: 3置信度: 32024-10-18

This paper is positioned in the in-context reinforcement learning literature. The authors proposed to mitigate the burden of using only the optimal action labels by an exponential advantage weighting. As such, trajectories of different quality can be used for update since the exponential weighting can filter out bad actions when they have low advantage values.

优点

This paper is well-written and clearly motivated. The authors did a great job in crafting each section so they explained the context, problem and solution clearly.

缺点

However, the paper lacks novelty in my opinion. In the offline RL setting, exponential advantage weighted regression is already very popular (AWAC, AWR) and has served as basis of many more sophisticated methods, see refs [1, 2]. The KL regularized policy optimization as in proposition 4.1 has also been extensively investigated, the result of proposition 4.2 is fairly standard in the conservative policy iteration (CPI) literature, in which [Schulman et al., 2015] is one of the variants. The literature also includes the offline case, see [3]. So this paper did not propose anything sufficiently novel imo if it simply extended to in-context case with additional conditioning on $\tau$ . I recommend the authors take into account the above-mentioned references and discuss significant theoretical differences the paper has.

I would suggest the authors go deeper in comparing the weighting scheme to enhance the paper. The exponential weighting scheme depends heavily on the estimation quality of advantage function, which itself can be unreliable. The proposed in-context estimation is not very convincing as well in this regard. And it is also due to this issue, AWAC is often found to be underperforming. Many tricks have been proposed to improve the weighting, e.g. by incorporating uncertainty [4], setting the weights of bad actions to zero to eliminate their updates [5,6]. The paper could benefit from such systematic comparison.

References:
[1] Simple and Scalable Off-Policy Reinforcement Learning
[2] AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
[3] Behavior Proximal Policy Optimization
[4] Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning
[5] Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization
[6] Offline Reinforcement Learning via Tsallis Regularization

问题

Please refer to the above for my questions.

2024-11-25

Dear Reviewer 99QM,

We appreciate your comments and hope that our response below can help address your concerns.

Novelty.

We observe some misunderstanding regarding our contribution and in-context RL. Please see our general response for a detailed discussion about our contributions, particularly why extending AWR to in-context RL is technically novel.

Extra Experiments

To further corroborate the effectiveness of our approach (other than the significantly improved performance with reweighted pretraining), we have followed a suggestion from Reviewer fFnq to conduct an extra experiment to empirically prove that our method can accurately learn the value functions for distinct behavioral policies and RL tasks so that advantage-based reweighting can be instantiated. It can be observed in Figure 8 in Appendix I that the prediction error of the trained in-context advantage estimator is very small.

Comparisons with Advanced Reweighting Techniques

We would like to note that our main contribution is an effective approach for in-context value function estimation and to prove that even the most straightforward reweighting technique enables in-context RL with only suboptimal data whereas all existing works for in-context RL requires access to data generated by the optimal policies. While we believe that all the referred more advanced weighing mechanisms for standard RL can be applied to further improve our method’s performance, they are tangential to our key contribution and thus should not be considered as evidence to criticize the novelty of our work.

Insights and Value of DIT for the Community

Our key insight is that as long as the “relatively good” actions can be identified and emphasized during supervised pretraining, a transformer-based policy can perform as good as those LTM policies pretrained with much more expensive pretraining data such as DPT.

In particular, although the weighted action labels for one trajectory can be misleading, the weighted MLE objective averaged over trajectories across diverse environments does lead to high-quality LTM policies with strong in-context RL capability to generalize to unseen RL tasks. Moreover, with our in-context advantage estimator, we also observe that LTMs can generate reliable in-context estimations for value functions of distinct behavioral policies and RL tasks with only supervised learning and a single trajectory as context. We believe these insights and observations are important to share with the community so that the potential of in-context RL can be fully investigated.

On the practical aspect, our method is easy to implement and significantly increases the feasibility of in-context RL as sub-optimal trajectories are much easier to collect (large companies often have tremendous databases consisting of historical trajectories collected by non-expert users), creating opportunities for in-context RL to be applied in much broader applications.

2024-11-28

Hi, thanks for the clarification and extra experiments, they are appreciated.

I understand the authors have contributed to in-context RL by taking inspirations from standard RL literature. But the current paper lacks many important pieces to make the methodology convincing. Let me rephrase my questions:

as the authors said, key to the proposed method is the advantage function because they control the weight of update. But for this critical problem, as the authors put it, was only given the space around Eq. 8 stating that "estimation by $D^i$ is unreliable and we propose to estimate it by two models". Why is this the case? I didn't see any qualitatively new methods have been put forward to resolve this unreliability issue nor theoretical insights of estimating this critical quantity. It is appreciated the authors have added Figure 8 to show it indeed worked, but why, when, and how it worked, whether the standard advantage value could have worked, these questions remain unanswered.
because the authors are tackling a more challenging problem than solving a fixed task, it is reasonable to expect that the advantage functions are more difficult to estimate. A known issue of exp weighting is that even for actions with very bad advantage value, it still gives positive weights because the exp function is positive everywhere. Since the proposed agent learns in an offline manner, the agent can only see a limited range of samples. So even if a state-action pair is bad, its likelihood is still increased and others unseen pairs implicitly decreased. How did the authors deal with this issue?
regarding theoretical novelty. I do not agree with the authors' claims that this extension brings new insights. By looking at the proof of Theorem 4.1 and 4.2, it is quite clear that nothing significantly new was used. Take theorem 4.2 for example: the authors started by using a lemma from [Achiam et al. 2017] followed by Pinksker's inequality and Jensen's inequality. Up to this point the techniques were very common and general-purpose in the CPI, TRPO types of algorithms; only the last step was really relevant to this in-context setting where an expectation was taken. Let me ask the authors: how is it qualitatively different from the CPI-based algorithms like the ones I mentioned?

2024-11-29

Dear Reviewer 99QM,

Thank you for your follow-up questions. We are more than happy to clarify them and hope that our response below will address your concerns.

whether the standard advantage value could have worked

Yes. As we have established in our Proposition 4.1, reweighting by the true advantage functions guarantees an improved in-context RL policy over the behavioral policies that collected the trajectories in the pretraining dataset. Note that inferring policies better than the behavior policies is the exact goal of learning from suboptimal historical data. Specifically, the improvement for in-context RL performance is the expected individual task improvement over the training task distribution.

However, the true advantage functions are unknown. As we have discussed in the general response regarding the technical challenges, because the weight function is task-dependent, we need to estimate the advantages functions for all RL tasks and must estimate them individually for each trajectory in the pretraining dataset. This problem for in-context RL is completely absent from regular RL, thus requiring qualitatively new methods to address it.

I didn’t see any qualitatively new methods

As we have discussed in the general response, in-context RL and regular RL are two very different problems. And the aforementioned problem for in-context RL, which is completely absent from regular RL, requires qualitatively new methods to address it. To address the aforementioned challenge unique to in-context RL, we frame the advantage function estimation problem as another in-context learning problem – in context value estimation. This is qualitatively different to existing methods for regular RL, as there is no in-context learning for regular RL. Specifically, in this work, conditioned on a single trajectory that contains the environmental information, transformers use their in-context learning abilities to infer the value functions of the behavioral policy which collected the trajectory. In other words, with large scale pretraining, transformers can accurately infer distinct behavioral policies’ value functions in distinct RL tasks, using only one trajectory collected by the interested behavioral policy in the desired task. This is a valuable finding with the same spirit of in-context RL, and thus should be shared with the community.

Why, when and how it would work

As the success of in-context value estimation relies on transformers’ in-context learning (ICL) capability, this question can be directly addressed by works in this direction. Transformers' ICL capability is already a widely recognized and established phenomenon, e.g., see [1,2]. Additionally, there are a few prior works showing that the key to the success of ICL is that transformers construct representations of the task during inference time, based on the demonstration examples. Such representations are named “task vectors” or “function vectors” in these works Function Vectors in Large Language Models [4] and In-Context Learning Creates Task Vectors [5].

Moreover, on a comprehensive set of benchmarks, we have empirically verified both the effectiveness of the proposed in-context advantage estimator for in-context RL and the accuracy of its accuracy in value function estimation. We believe this provides strong evidence to support the efficacy of our proposed method.

Regarding exponential reweighting

Please note that the general goal of learning from historical data is to improve over the behavioral policies that collected it. Even the weight is still positive for the bad actions, as long as it is less than 1, the learned in-context RL policies still will improve over the behavioral policies. Thus, we respectfully disagree that this should be considered as an issue because, as demonstrated in all our experiments, reweighted pretraining significantly outperforms the un-reweighted one, which is our primary goal. As the first work to address in-context RL from suboptimal data, we believe an effective strategy for such an important problem is already a sufficient contribution.

Additionally, we would like to note again that our contribution is an effective approach for in-context value function estimation and to prove that even the most straightforward reweighting technique enables in-context RL with only suboptimal data. Given this, we believe that all the referred more advanced weighing mechanisms can be applied to further improve our method’s performance, although they are tangential to our key contribution.

评论- Continuation

2024-11-29

Regarding insights

We would like to clarify that we never claimed that the proof techniques of our theoretical results bring new insights, as this is not a work focused on learning theory. In contrast, we only claimed that the empirical success of our method brings new insights for in-context RL. Specifically, our key insight is that as long as the “relatively good” actions can be identified and emphasized during supervised pretraining, a transformer-based policy can perform as good as those LTM policies pretrained with much more expensive pretraining data such as DPT. In particular, although the weighted action labels for one trajectory can be misleading, the weighted MLE objective averaged over trajectories across diverse environments does lead to high-quality LTM policies with strong in-context RL capability to generalize to unseen RL tasks. Moreover, with our in-context advantage estimator, we also observe that LTMs can generate reliable in-context estimations for value functions of distinct behavioral policies and RL tasks with only supervised learning and a single trajectory as context. We believe these observations are important to share with the community so that the potential of in-context RL can be fully investigated.

We sincerely hope these detailed responses can help clarify the misunderstandings regarding our contribution and address your remaining concerns.

Best, Authors

[1] Dong, Qingxiu, et al. "A survey on in-context learning." arXiv preprint arXiv:2301.00234 (2022).

[2] Akyürek, Ekin, et al. "What learning algorithm is in-context learning? investigations with linear models." arXiv preprint arXiv:2211.15661 (2022).

[3] Xie, Sang Michael, et al. "An explanation of in-context learning as implicit bayesian inference." arXiv preprint arXiv:2111.02080 (2021).

[4] Todd, Eric, et al. "Function vectors in large language models." arXiv preprint arXiv:2310.15213 (2023).

[5] Hendel, Roee, Mor Geva, and Amir Globerson. "In-context learning creates task vectors." arXiv preprint arXiv:2310.15916 (2023).

2024-12-03

Dear Reviewer 99QM,

Thank you again for your comments.

As the discussion period is approaching an end, we hope that our response has sufficiently addressed your concerns.

Of course, we are more than happy to further address any additional concerns the reviewer might have.

Best,

Authors

审稿意见

评分: 6置信度: 42024-10-30

The paper proposes the Decision Importance Transformer (DIT), a novel approach in reinforcement learning (RL) that leverages transformer models to improve decision-making using suboptimal historical data. Traditional RL models struggle with data that does not include optimal action labels, but DIT overcomes this by introducing a weighted pretraining method. This method utilizes the advantage function, which estimates the importance of each action and weights the learning process accordingly. To handle situations where advantage functions are not directly available, DIT uses a transformer-based estimator that learns to approximate these advantage values across various tasks, enhancing the model’s ability to generalize. The authors demonstrate DIT’s effectiveness on bandit and Markov Decision Process (MDP) tasks, where DIT often matches or exceeds the performance of existing models, even when trained with suboptimal trajectories.

优点

By integrating a transformer-based in-context advantage estimation, the model approximates the advantage values dynamically when they are not explicitly available. This allows DIT to extend its functionality to environments where advantage functions are challenging to compute or estimate, improving its general applicability in real-world scenarios with limited labeled data.
The authors conducted extensive testing across bandit and MDP environments, demonstrating DIT’s ability to match or exceed the performance of state-of-the-art methods. The results on tasks with noisy or suboptimal data showcase the model’s robustness and adaptability, emphasizing its potential in practical applications where perfect data is scarce.

缺点

As the optimal solution of the optimization problem in Eq.4 and the policy improvement theorem under KL constraint have been widely researched, this work seems to me the main contribution of the paper is applying it into the in-context learning RL domain. Thus I suggest the authors to conclude the contribution more from the aspect that what's the new insight of combing in-context RL and advantage weighted regression.
The weighted pretraining approach heavily relies on accurate advantage values to guide the learning process effectively. DIT uses Eq.9 to optimize the Q and V functions. However, there is no evidence learning via this objective could avoid the overestimation problem widely exist in offline RL settings. I suggest the authors to compare the adopted training objective of Q and V with the ones with in-sample optimality guarantees such as IQL. Or at least an illustration of Eq.9 could converge to an optimal solution should be given.
I suggest add some experiments between DIT and DPT using the same sub-optimal datasets to illustrate the advantages of DIT further. Current comparison such as Figure 5-7 could confuse the readers.
The writing could be further improved regrading the notations paragraph.
Several related works are missed which use RL or chain-of-thought to improve (in-context) LTM for decision making such as [1][2][3].

[1] Q-value Regularized Transformer for Offline Reinforcement Learning. ICML 2024.

[2] Rethinking Decision Transformer via Hierarchical Reinforcement Learning. ICML 2024.

[3] In-Context Decision Transformer: Reinforcement Learning via Hierarchical Chain-of-Thought. ICML 2024.

问题

See weakness above.

2024-11-25

Dear Reviewer fFnq,

Thanks for your constructive comments and we will include the referred works in our literature review section. Please see our response to your other comments below.

Thus I suggest the authors to conclude the contribution more from the aspect that what's the new insight of combing in-context RL and advantage weighted regression.

This is a valuable suggestion. Please see the general response for a detailed clarification of our contributions.

In terms of insights, our key insight is that as long as the “relatively good” actions can be identified and emphasized during supervised pretraining, a transformer-based policy can perform as good as those LTM policies pretrained with much more expensive pretraining data such as DPT.

Notation Paragraph

We have followed your suggestion to improve our notation paragraph in Section 4.

I suggest add some experiments of DPT using the same sub-optimal datasets

We appreciate this insightful suggestion. Indeed, in all experiments, we compare DIT with a baseline BC, which is an unweighted version of our method DIT. Given that DIT and DPT use the same transformer model architecture, BC is equivalent to DPT trained with sub-optimal data. Hence, DIT’s effectiveness over DPT under sub-optimal data is clearly demonstrated as DIT significantly outperforms BC.

Overestimation Problem of the Value Functions: However, there is no evidence learning via this objective could avoid the overestimation problem widely exist in offline RL settings.

We would like to note that DIT relies on the behavioral policies’ value functions, instead of the optimal value functions. To this end, the overestimation problem commonly observed in offline RL is much less a concern for DIT because

we don’t need to evaluate a unseen policy
we don’t need to estimate a greedy policy

The reason for point 1 is that we use the learned value functions of the behavioral policies to generate weights for actions taken by the behavioral policies. In other words, we are always in-distribution with respect to the behavioral policies. The reason for point 2 is that we are only estimating the value functions for the behavioral policies, not for a greedy policy or optimal policy. Thus, we don’t need to bootstrap with a max operator which is the main cause of overestimation problems.

Extra Experiments for Validation. To present further evidence that the value functions are estimated well, we conduct an extra experiment to test the error of learned transformer-based value functions. The results are presented in Figure 8 in Appendix I. It can be observed that the prediction error of the trained in-context advantage estimator is very small.

Lastly, we would like to note that the ultimate criterion to evaluate the effectiveness of the learned value functions should be the in-context RL performance on unseen RL tasks. To this end, DIT significantly outperforms its unweighted variation, corroborating the effectiveness of the learned in-context advantage function estimator.

2024-11-25

I appreciate the effort of the authors. I think my concerns are addressed. I have updated my rating.

审稿意见

评分: 6置信度: 32024-11-03

This paper explores the potential of large-scale transformer models for in-context learning in reinforcement learning (RL). The authors present the Decision Importance Transformer (DIT), a new framework for training transformers with suboptimal historical data, diverging from previous approaches that rely on optimal action labels. DIT leverages a weighted maximum likelihood estimation that assigns higher weights to actions with high advantage values, encouraging the model to learn near-optimal policies despite the lack of optimal trajectories in the dataset.

优点

Decision Importance Transformer (DIT) presents a novel approach by leveraging suboptimal historical data for in-context reinforcement learning (RL). While previous transformer-based RL approaches like Decision Pretrained Transformer (DPT) rely on datasets labeled with optimal actions, DIT uniquely addresses the challenge of training RL agents with suboptimal data by introducing a weighted maximum likelihood estimation guided by advantage functions. This framework effectively removes the constraint of needing optimal labels, broadening the applicability of transformer models in RL to real-world scenarios where optimal data is scarce or inaccessible. Additionally, the proposal of using an in-context advantage estimator to enhance policy learning from suboptimal data represents a creative blend of autoregressive modeling and advantage-weighted learning.

缺点

Baseline Comparisons: To contextualize DIT's performance on suboptimal data, the paper could be strengthened by comparing DIT to more recent offline RL methods, such as Conservative Q-Learning (CQL) or other SOTA offline RL methods.
Add DPT experiments comparison on bandit problems: In the MDP PROBLEMS setting, many experiments are compared with DPT. Why DPT comparisons were not included for bandit problems?

问题

How does DIT handle scalability with larger task sets or datasets? What is the computational overhead introduced by the advantage-based weighting mechanism? such as training time or memory usage comparisons.
This DIT makes actions with high advantage values receive more weight, leading to guaranteed policy improvements over the behavior policies. It is very similar to some offline RL methods, such as A2PR[1], AW/RW[2]. How DIT specifically differs from or improves upon these methods? Can you add some discussion with these methods in the related works or more experiments comparison?
For different tasks, how many does DIT train the in-context task identification transformer, the Q and Value models? Is there one for all tasks?

References：

[1] Liu, Tenglong, et al. "Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning." In International Conference on Machine Learning (ICML). PMLR, 2024.

[2] Hong, Zhang-Wei, et al. "Harnessing mixed offline reinforcement learning datasets via trajectory weighting." arXiv preprint arXiv:2306.13085 (2023).

2024-11-25

Dear Reviewer Wiqi,

We appreciate these valuable comments and hope that the response below can help clarify your concerns.

This DIT makes actions with high advantage values receive more weight, leading to guaranteed policy improvements over the behavior policies. It is very similar to some offline RL methods, such as A2PR[1], AW/RW[2]. How DIT specifically differs from or improves upon these methods?

Please see our general response about a detailed clarification of our contributions.

Baselines of In-context RL and Offline RL: To contextualize DIT's performance on suboptimal data, the paper could be strengthened by comparing DIT to more recent offline RL methods

At a high level, in-context RL is similar to a meta-RL problem using transformers. In the testing/inference stage, a new task (unknown to the transformer) is sampled from the family of desired tasks, and the pretrained transformer model is deployed to assess the performance. Thus, direct comparison with offline RL methods may not be appropriate for contextualizing the performance of in-context RL algorithms.

To comprehensively evaluate the performance of our method, we compare with several SOTA in-context RL methods. Moreover, we also compare with DPT pretrained with the optimal action labels, which is not accessible to our method DIT. Thus, DPT can be considered as an oracle upper bound for DIT. Notably, although without any optimal action labels, the performance of DIT matches that of DPT on most problem instances. This corroborates the effectiveness of our method.

Extra Experiments: Inclusion of DPT in the bandit experiments

Thank you for this very constructive comment. For the bandit problems, our primary goal was to compare with the theoretically optimal bandit algorithms such as the Thompson Sampling.

Following your suggestion, we have conducted experiments to include the performance of DPT. Please see Figure 3 for the updated results. As expected, DIT is slightly outperformed by DPT since DPT has access to optimal bandits during pretraining whereas DIT do not. However, please note that DIT still outperforms the theoretically optimal bandit algorithms, and the regret curves of DIT and DPT demonstrate similar trends. These results prove that the effectiveness of DIT.

Computational Overhead: How does DIT handle scalability with larger task sets or datasets? What is the computational overhead introduced by the advantage-based weighting mechanism?

The DIT framework can be decomposed into 3 steps:

Step (1): training the in-context advantage function estimator;
Step (2): labeling all the state-action pairs with the trained advantage estimator;
Step (3): weighted supervised pretraining.

The extra computational overhead of DIT comes from Step (1) and Step (2). In particular, the overhead of Step (1) is about the same as Step (3). Step (2) is computationally simple and its cost is negligible compared to Step (1) and (3). Thus, the overall pretraining computational cost is about 2x the unweighted pretraining method.

For specific computation time, our model uses exactly the same architecture as DPT and we include this part in Appendix F. Inference time for our model is also minimal.

We would like to also note that for methods based on historical data, the computational cost is often not the main concern. Thus, given the significantly improved performance, the doubled pretraining time is very often acceptable.

For different tasks, how many does DIT train the in-context task identification transformer, the Q and Value models? Is there one for all tasks?

To facilitate in-context advantage function estimation, we train only one pair of Q and V transformers for all tasks. When combined, they estimate the advantage functions for all tasks.

评论- Official Comment by Reviewer Wiqi

2024-12-02

Thank you for providing the additional details and experiments. I have updated my score accordingly.

审稿意见

评分: 8置信度: 42024-11-04

This paper introduces the Decision Importance Transformer (DIT), a model designed to generalize to new RL tasks through in-context learning, even when pre-training data originates from sub-optimal behavior policies. DIT accomplishes this by reweighting an offline dataset using advantage functions estimated by a large-scale transformer model (LTM) that is trained autoregressively. Additionally, the authors demonstrate that the exponential reweighting technique provably ensures policy improvement. Experimental results in bandit, planning, and continuous-control tasks show that DIT outperforms baselines when generalizing to new RL instances, even when trained on data generated by sub-optimal policies.

优点

Significance: Generalizing to new RL tasks via in-context learning is a major concern in RL, particularly for handling sub-optimal datasets in offline RL. DIT addresses both of these challenges within a unified framework.
Novelty: The proposed DIT combines multiple methods—including LTMs, exponential reweighting, and actor-critic approaches—in a cohesive design. To the best of my knowledge, no similar approach exists in current literature.
Clarity: The paper is well-organized and self-contained, presenting DIT's architecture and methodology progressively, making it accessible and providing insights to guide future work.

缺点

Suggestions for Improvement:

Provide more direct empirical evidence that task information $\tau$ is being captured through in-context learning. For instance, demonstrate recovery of $\tau$ in experiments by querying with some special states.
The literature review is missing policy regularization methods for offline RL [1].

Minor Issues:

If $\pi(a | s ; \tau)$ and $\pi_\tau(a | s)$ refer to the same concept, choose one notation or define one in terms of the other for clarity.
Proposition 4.2 is labeled as "In-Context Policy Improvement." Since $\tau$ is provided explicitly, this theorem seems to justify the design of the loss function rather than "in-context learning." Renaming it to better reflect its role might avoid potential misinterpretation.

[1] Fujimoto, Scott, and Shixiang Shane Gu. "A minimalist approach to offline reinforcement learning." Advances in Neural Information Processing Systems, 34 (2021): 20132-20145.

问题

Is DIT pre-trained in an autoregressive way or an “encoder” way, i.e., does the output of the LTMs contain only a single action token instead of a sequence of predictions?
In a fair comparison—such as using DIT with optimal historical data—would DIT still outperform baseline methods like DPT? If so, what do you think is the advantage of DIT over DPT when both are using optimal historical data?

2024-11-25

Dear Reviewer Rsr6,

Thank you for these valuable comments! Following your advice, we have updated the name of Proposition 4.2 and included the referred work in our literature review. Please see below our response to your other questions.

If $\pi(a|s;\tau)$ and $\pi_\tau(a|s)$ refer to the same concept$

We use $\pi(a|s;\tau)$ to refer to some meta policy that takes the task identity $\tau$ as input to generate distinct policies for varying tasks. On the other hand, we use $\pi_\tau(a|s)$ , e.g., $\pi^b_\tau(a|s)$ , to represent some policy that is fixed for task $\tau$ .

We appreciate this question and have highlighted this in our manuscript to avoid confusion.

Provide more direct evidence that task information is being captured

Conceptually, the reason why DPT and DIT can construct task information in the inference time is that transformer models have the in-context learning ability — that is, these models can solve an unseen task during inference time, by only looking at a few demonstration examples. There are a few prior works showing that the key to the success of in-context learning is that transformers construct representations of the task during inference time, based on the demonstration examples. Such representations are named “task vectors” or “function vectors” in these works Function Vectors in Large Language Models [1] and In-Context Learning Creates Task Vectors [2].

Following the same rationale, we expect that DPT/DIT uses the in-context learning ability to infer the desired task during inference time. And inside these models, representations of the task are constructed during inference time, based on the trajectories sampled from the new task during inference time.

An Extra Experiment as Evidence. To validate this hypothesis, we conduct an extra experiment where, during inference time, we do not condition on the offline trajectories sampled from the task of interest. Instead, we draw offline trajectories from a different task, and use these trajectories as ``demonstration examples” in the prompt and test the performance of DPT/DIT on the task of interest. We compare this experiment setting with our experiment where the offline trajectories are indeed sampled from the desired task.

If our hypothesis that DPT/DIT extracts task-relevant information, we anticipate that, conditioning on trajectories from a different task, we will see a huge performance drop. The reason is that the offline trajectories mislead DPT/DIT in terms of which task the model is applied to. This is shown in Figure 9 in Appendix J.

From Figure 9, we can see that the extracted task information is crucial for the success of DPT/DIT: their performance degrades significantly if the task information is misleading. This shows that DPT/DIT’s transformer models heavily rely on the extracted task information and, given the significant performance boost when offline trajectory is sampled from the desired task, they indeed learn useful task information.

Pretraining of DIT

We follow the pretraining procedure of DPT to pretrain DIT in an autoregressive way where the output actions only depend on the prior inputs. This is achieved by using a standard GPT2 transformer architecture with casual attention masks.

Both DIT/DPT pretrained with optimal data.

We have the results for both DIT and DPT training on optimal historical data. In this way, both methods perform very well and there is no discernible difference. This observation is expected because the goal of DIT is to recover the missing information about the optimal action labels and to close the performance gap with DPT pretrained with optimal action labels. With optimal historical data, there is no need for this goal, and given that DPT and DIT use the same transformer architecture, their performance should also be very similar.

We choose to not include these results given that the contribution of this work is in-context RL with only sub-optimal data, which is an important research problem since in most real-world environments, optimal data is not commonly accessible.

[1] Todd, E., Li, M., Sharma, A. S., Mueller, A., Wallace, B. C., & Bau, D. (2024). Function Vectors in Large Language Models. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=AwyxtyMwaG

[2] Hendel, R., Geva, M., & Globerson, A. (2023). In-Context Learning Creates Task Vectors. The 2023 Conference on Empirical Methods in Natural Language Processing. https://openreview.net/forum?id=QYvFUlF19n

2024-11-30

Thank you for the detailed rebuttal. I appreciate the clarifications provided and have two additional follow-up questions and points for discussion regarding in-context learning of task information:

Exploration vs. Exploitation: In certain RL environments, recovering task information often requires exploration of the state space. How do you think DIT could balance the trade-off between exploration and exploitation during online deployment?
Robustness: In the experiments, task trajectories used during pretraining (e.g., in the LB problem) are generated from a specific distribution. Is the same distribution used during evaluation? If so, how does DIT maintain its performance when evaluated under different distributions or even adversarial task parameters?

I would appreciate it if the author could have some experiments on it, or at least have a discussion.

2024-11-30

Dear Reviewer Rsr6,

These are great points for discussion, and we appreciate them.

How do you think DIT could balance the trade-off between exploration and exploitation during online deployment?

From a theoretical perspective, the supervised pretraining employed by DPT/DIT is shown to be equivalent to implicit posterior sampling [1]. That is to say, conditioned on the context consisting of trajectories collected from a desired RL task, transformer implicitly builds a posterior distribution over the tasks (with the distribution of pretraining RL tasks as the prior distribution), sample a task from the posterior distribution, and use the optimal policy for the sampled task to take action. See, for example, Appendix C.1 of [1] for more details. This posterior sampling is a generalization of the Thompson Sampling algorithm, a theoretical optimal solution for the multi-armed bandit problems. In particular, posterior sampling is proved to be sample-efficient with online Bayesian regret guarantees [2]. Thus, the supervised pretraining framework employed by DPT/DIT allows for strong Exploration vs. Exploitation capabilities.

Is the same distribution used during evaluation?

Yes, we follow the conventional setting of in-context RL to evaluate on in-distribution yet unseen RL tasks.

how does DIT maintain its performance when evaluated under different distributions?

This is a great question. From the perspective of posterior sampling, testing on out-of-distribution RL tasks results in a wrong prior for the postering distribution. Thus, similar to almost all Bayesian methods, it would require more samples / trajectories to identify the true task and act optimally. While algorithms with a strong out-of-distribution performance is an important research question, this work primarily considers the standard in-distribution setting as in-context RL is still in its early stages. With that being said, we will include this important discussion into our discussion of weakness. Thank you again for this insightful point.

[1] Lee, Jonathan, et al. "Supervised pretraining can learn in-context reinforcement learning." Advances in Neural Information Processing Systems 36 (2024).

[2] Osband, Ian, Daniel Russo, and Benjamin Van Roy. "(More) efficient reinforcement learning via posterior sampling." Advances in Neural Information Processing Systems 26 (2013).

评论- Clarification about Contributions

2024-11-25

Dear Reviewers,

We sincerely appreciate your help and constructive feedbacks. However, we have noticed some misunderstandings regarding our contributions. To address these concerns, we would like to provide clarification on the following points:

Distinctions between In-context RL (ICRL) and regular RL
Challenges of ICRL from Suboptimal Historical Data (our main contribution)
Technical Novelty of Our Method (our technical contribution)

Distinctions between ICRL and regular RL

For standard (either online or offline) RL problems, the goal is to learn a policy for a single RL task, using training data collected from that task. In contrast, the goal for ICRL is much more challenging: with training data from diverse RL tasks, ICRL aims to learn an algorithm (implemented by a transformer) for solving distinct RL tasks. At inference time, ICRL directly applies the pretrained transformer model to new/unseen RL tasks which are unknown to the transformer, without updating the transformer’s parameters. Please see Figure 1 in https://arxiv.org/pdf/2210.14215 for a good schematic illustration. Hence, the pretrained transformer for ICRL must generalize to new RL tasks during inference time, conditioning on the provided context containing offline trajectories collected from these new RL tasks.

MAB Example. To concretely illustrate the distinctions, consider the multi-armed bandit (MAB) problem. For a standard RL problem, we consider a single MAB problem, and the goal is to identify the optimal bandit for this MAB problem with a dataset of observed rewards for all bandits. For an ICRL problem, given pretraining data collected from various MAB problems with distinct reward distributions, the goal of ICRL is to quickly identify the optimal bandit of an unseen MAB problem whose reward distributions are different to those in the pretraining dataset.

Challenges of ICRL from Suboptimal Historical Data

Existing works for ICRL all assume access to data collected by optimal policies from diverse RL tasks, which is a strong requirement difficult to hold in practice. To this end, our work considers ICRL from only suboptimal data. There are two main challenges towards solving this problem:

Challenge 1: How to generalize to new tasks unknown to the pretrained transformer model.
Challenge 2: How to address the distribution shift so that we can use suboptimal data to infer optimal policies for distinct RL tasks.

To address Challenge 1 above, we use the in-context learning ability of transformers, which have been widely recognized and empirically proven.

To address Challenge 2 above, we propose to use Advantage Weighted Regression (AWR). Notably, although AWR has been studied in standard RL, it remains unclear how to generalize this approach to ICRL which, as discussed above, is a distinct and much more challenging problem than standard RL. The main reason is that, unlike the weight function for standard RL with a single task, the weight function for ICRL needs to be task dependent.

Technical Novelty

Inspired by the AWR for standard RL, we also propose to use the advantage function as the weight function for ICRL.

Technical Challenge. However, because the weight function is task dependent, we must estimate the advantage functions for all distinct RL tasks in the pretraining dataset. Moreover, because we don’t know the training trajectories’ source tasks and thus cannot combine the trajectories from the same RL tasks to improve estimation, we need to estimate the advantage functions individually for each trajectory in the pretraining dataset. Hence, compared to AWR for standard RL where we have a significant amount of trajectories collected from the same RL task, the advantage function estimation problem for ICRL poses a formidable challenge.

Technical Novelty. Our technical novelty here is to learn the task-dependent weight function, that is the advantage functions for diverse RL tasks, with another in-context learning problem – in context value estimation. Specifically, the proposed in-context advantage estimator can correctly estimate the task-dependent weight function, conditioned only on the pretraining trajectories without any knowledge about their source RL tasks. This is also empirically verified in our extra experiment whose results are presented in Figure 8 in Appendix I. It can be observed that the prediction error of the trained in-context advantage estimator is very small.

AC 元评审

2024-12-29

The paper presents the Decision Importance Transformer (DIT), a model designed for in-context reinforcement learning (RL) using suboptimal historical data. DIT addresses the challenge of training RL agents with suboptimal data by using a weighted maximum likelihood estimation, which assigns higher weights to actions with high advantage values. The authors demonstrate that DIT achieves superior performance compared to other methods, particularly when the offline dataset contains suboptimal historical data. The approach is evaluated on bandit and Markov Decision Process problems. The core idea of DIT is that by emphasizing good actions during supervised pretraining, a transformer-based policy can perform well even with suboptimal data. The authors also highlight that their approach is an effective way for in-context value function estimation.

Reviewers highlighted its novelty in addressing in-context RL with suboptimal data, noting its significance for real-world applications. They praised the effective methodology that combines transformer models, exponential reweighting, and actor-critic approaches, resulting in a cohesive design. A key strength is the in-context advantage estimator, which dynamically approximates advantage values, enhancing policy learning from suboptimal data. Reviewers also noted the strong empirical results in bandit and MDP environments, showcasing DIT's robustness. Finally, they found the paper to be well-organized and clearly presented

The reviewers raised several concerns and questions about the paper. Reviewers questioned whether the advantage function estimation was reliable and if the method could avoid overestimation issues common in offline RL. Additionally, they requested more direct evidence that task information is being captured through in-context learning.

This area chair is familiar with the ICRL, has read the paper, and offers this recommendation. The paper needs to be self-sufficient and introduce the topic to a reader that might not be fully familiar with the topic. Unfortunately, this paper fails short of doing that. The next revision of the paper should make the presentation more accessible, including some examples on ICRL in the introduction would go a long way. For example, it is not clear what authors mean by different RL instances, or what they consider actions in this context -- the part that is confusing is the mention of LLMs. Is the DIT the foundation model being trained, or is an additional model that data weighting? A system diagram depicting DIT would be very helpful. The author's should also place the work in context of the relevant literature (e.g. https://arxiv.org/pdf/2404.11018 which address ICRL with suboptimal data). Lastly, the paper claims the methodology for large transformer models. However, all evaluations are done on GPT-2. From there I am inferring that DIT is an additional model that guides the next round of in-context examples. It appears that Appendix H clarifies this point -- please make it sooner in the paper.

Overall, the paper has some interesting ideas, although its presentation should be significantly improved.

审稿人讨论附加意见

The authors addressed these concerns by clarifying the distinctions between in-context RL and standard RL, explaining the technical challenges specific to ICRL, and providing additional experiments. They also emphasized that their contribution was an effective approach for in-context value function estimation and to prove that a simple reweighting technique facilitates in-context RL with suboptimal data. Most reviewers were satisfied with these clarifications and revised their ratings with the exception of one reviewer who persisted in their critique.

最终决定Reject

2025-01-22

Reject