PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
5
6
6
5
3.0
置信度
正确性3.0
贡献度2.5
表达2.5
ICLR 2025

Towards Zero-Shot Generalization in Offline Reinforcement Learning

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05

摘要

关键词
offline reinforcement learninggeneralization

评审与讨论

审稿意见
5

The authors address the problem of zero-shot offline generalisation, where an algorithm is provided with a number of datasets from some source environments and aims to train a policy that performs as good as possible in a set of target environments from which it has never seen any data.

优点

The authors propose a model-based (PERM) as well as a model-free algorithm (PPPO), which to the best of my knowledge are the first approaches to exhibit zero-shot generalisation capabilities with a proved bound on their suboptimality. They also provide empirical results on a number of environments & analyse why previous offline algorithms fail in the zero-shot generalisation setting without context. Personally, I believe this area to be crucially important for practical applications and believe it to deserve further investigation as it is so far underexplored, and I thus very much welcome the authors' work.

缺点

I have some questions about the empirical evaluation:

  • could you provide uncertainty estimates also for mean & median performance? Currently it is hard to judge whether the mean improvement from e.g. BC to IQL-4V on expert is significant or not.
  • generally, I am surprised that BC performs so well - not only on the expert datasets where it could be expected, but also on the mixed datasets. Do you have a theory about why that is the case? Overall BC seems to outperform the proposed method in terms of median & mean performance (except for mean on expert, but it is unclear whether that is significant). With that in mind, it is questionable whether in practice one would not rather use the much simpler, easier to interpret & implement BC method instead of the proposed solution. Generally, the authors don't comment much on BC in the corresponding section and I think this point needs some further discussion.
  • if I am not mistaken, figure 2 shows the differences in performance between IQL baseline and the newly proposed approach (?). Please add that information in the caption, it currently only says difference, but not between what.
  • furthermore, I understand that not much prior work in the offline zero-shot area exists, however as far as I understand (please correct if I am mistaken) your baseline IQL is overly simple, i.e. it treats all stages as belonging to the same MDP. An improvement over this behavior should be fairly simple when taking into account context information. I believe you could show the merits of your method in a much more convincing way of you chose a baseline that would do this - e.g. use a meta learning approach and don't allow it to fine-tune on the target task (meta learning does take into account context information but I would still expect your proposed method to outperform it)

问题

see weaknesses

评论

We sincerely thank you for your valuable feedback and suggestions. Our responses are as follows.

Q1 could you provide uncertainty estimates?

A1 Thank you for your insightful suggestion. We have included the uncertainty estimates of mean and median returns in Table 2 of our revised manuscript. We also report the Interquartile Means (IQM) with confidence intervals in the revised manuscript to better illustrate the significance of the observed improvements.

Q2 generally, I am surprised that BC performs so well - not only on the expert datasets where it could be expected, but also on the mixed datasets. Do you have a theory about why that is the case?

A2 We would like to emphasize that BC can outperform offline RL methods in certain cases, as its performance is highly dependent on the quality of the behavior policy. Theoretically, as established in [1], which studies the comparison between BC and offline RL in single-task settings, when the behavior policy is optimal, BC can converge to the optimal policy at a rate of 1/N1/N, where NN is the number of trajectories in the offline dataset. This convergence rate is faster than the 1/N1/\sqrt{N} rate achieved by most offline RL methods, such as CQL. Consequently, when the offline dataset is finite (as in our case), BC can indeed outperform offline RL methods. However, when the behavior policy is suboptimal, the comparison between BC and offline RL becomes more nuanced and is influenced by factors such as dataset quality and the degree of suboptimality in the behavior policy. The theoretical results in [1] also extend to our methods, as RL with generalization encompasses the single-task RL setting.

It is important to note that our primary goal is not to assert that our methods will always outperform BC, particularly in finite-sample settings. Instead, our aim is to present a theoretically grounded framework that identifies the failure modes of existing offline RL methods in terms of generalization and demonstrates how our proposed approaches address these issues. While BC may outperform offline RL methods in specific scenarios, our contributions lie in providing a deeper understanding of the limitations of existing methods and proposing solutions that offer better generalization under broader conditions.

[1] "When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning?"

Q3 Figure 2 shows the differences in performance between IQL baseline and the newly proposed approach. Please add that information in the caption.

A3 Thank you for highlighting this issue. We have clarified it in the revised version of the paper. Specifically, Figure 2 illustrates the performance differences between our proposed IQL-4V approach and the original IQL algorithm on a per-game basis, as measured by min-max normalized scores.

Q4 I believe you could show the merits of your method in a much more convincing way of you chose a baseline that would do this - e.g. use a meta learning approach and don't allow it to fine-tune on the target task.

A4 Thank you for your insightful suggestion. We would like to highlight that we do not allow context information during the test phase, therefore it is impossible to incorporate the context information to boost the performance of baseline algorithms such as vanilla IQL. Regarding the suggestion of using a meta-learning approach, we conducted an additional evaluation using the method proposed by Mitchell et al. (2021), without fine-tuning, to provide a direct comparison with our approach on the Miner and Maze Expert datasets. The results are summarized below:

Procgen gameIQL-4VMACAW-4Tasks (w/o fine-tuning)
Miner6.36±1.856.36 \pm 1.854.0±0.704.0 \pm 0.70
Maze5.0±1.265.0\pm 1.262.4±1.502.4\pm 1.50

As outlined in Algorithm 1 (MACAW Meta-Training) by Mitchell et al. (2021), the primary distinction between the training process of our IQL-4V method and the meta-training phase of the MACAW algorithm lies in how the critic gradient updates are handled. In MACAW's meta-training, the gradient updates from individual tasks at each training step are consolidated into a single critic network, guided by the "test batch" for each task. Notably, MACAW consistently maintains only one unified critic network throughout the meta-training process. While both our method and meta-learning approaches share the concept of leveraging multiple optimization objectives during training, our method demonstrates superiority. This advantage stems from our approach’s ability to maintain multiple value networks throughout training, which enhances the model’s capacity to effectively capture and utilize diverse contextual information.

Again, we sincerely thank you for your time and constructive feedback. We hope our responses address your concerns, and we welcome any further questions or suggestions.

评论

I thank the authors for their detailed response to my questions and changes made to the manuscript.

Even though I understand that your primary goal is not to show that the proposed method always outperforms BC, I would still have a follow-up question regarding that baseline:

You say that when the behavior policy is suboptimal, the comparison between BC and offline RL becomes more nuanced. I would absolutely agree, however wouldn't you then expect a much more competitive performance of your proposed method in the Mixed setting? It seems to only achieve 1/3 - 1/2 of the performance, depending on what metric I look at. When would you propose someone should use the method if BC can be reasonably expected to yield much better results?

Generally I see a value in algorithms that are well grounded in theory. It usually means one has a better understanding of what is going on. Do you maybe have an explanation for what we are seeing - why would you say is it not better?

评论

Thank you for your thoughtful follow-up question.

Q Offline RL should be more competitive to BC for the mixed data setting. Why is it not better?

A We appreciate your interest in understanding the nuanced comparison between our method and BC in the mixed setting. However, we would like to highlight that the assertion "offline RL will become more competitive with mixed-quality data" is not always true. Below, we address your concerns from both theoretical and empirical perspectives:

Theoretical Perspective:

We first provide a theoretical illustration of how the performance gap between offline RL and BC evolves with respect to the data coverage number (reflecting the quality of the behavior policy) in the single-task setting. Since the multi-task setting generalizes the single-task setting, our observation can also be extended to the multi-task ZSG setting.

We follow the setup in [1]. For the single-environment setting, let C=maxs,a,hdhπ(s,a)dhπb(s,a)C^* = \max_{s, a, h} \frac{d_h^{\pi^*}(s, a)}{d_h^{\pi^b}(s, a)} quantify the data coverage of the behavior policy relative to the optimal policy which reflects the data quality. The higher CC^*, the worse the offline data quality.

The suboptimality gap of BC is (ignoring logarithmic factors) H(C1)2+SHK\frac{H(C^* - 1)}{2} + \frac{SH}{K}, where SS is the number of states, HH is the planning horizon, and KK is the number of trajectories. For offline RL methods, the suboptimality gap is CSHK+CSHK\sqrt{\frac{C^* S H}{K}} + \frac{C^* S H}{K}. Thus, the performance difference between offline RL and BC is:

O(CH(1SK)CSHKH2+SHK).O(C^* H (1 - \frac{S}{K}) - \sqrt{\frac{C^* S H}{K}} - \frac{H}{2} + \frac{SH}{K}).

The key insight is that the above gap is not always an increasing function w.r.t. CC^*. For example, when 1SK=1H21-\frac{S}{K} = \frac{1}{H^2}, the above gap becomes

O(C/HCH(11/H2)+H1/H).O(C^*/H- \sqrt{C^*H(1-1/H^2)}+H-1/H).

We set C=HC^* = H and C=H2C^* = H^2 to represent offline data with near-expert quality and mixed quality, respectively. Then the above gap becomes

expert: O(1)>0,mixed: O(HH1.5)<0.\text{expert: }O(1)>0, \text{mixed: }O(H-H^{1.5})<0.

This example implies that with a mixed-quality dataset, it is possible for the performance gap between BC and offline RL to increase slightly as CC^* grows from the expert quality dataset.

[1] When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning? ICLR 2022.

Empirical Perspective:

Empirically, to further support our argument, we conducted an additional evaluation on the full Procgen benchmark, consisting of 16 games. This evaluation utilized a suboptimal dataset instead of the mixed expert-suboptimal dataset where CC^* becomes even larger. To construct the offline dataset, we extracted 1 million transitions from the 25M suboptimal dataset provided by Mediratta et al. (2023) for each Procgen game, following the same steps in Mediratta et al. (2023). For the evaluation, we compared the performance of the IQL-4V, IQL and BC algorithms, adhering to the same practices outlined in our original paper. The results of the evaluation are as follows:

Procgen gameIQL-4VIQLBC
Bigfish3.03±0.963.03\pm 0.961.77±0.061.77\pm 0.061.73±0.141.73 \pm 0.14
Bossfight1.04±0.201.04\pm 0.200.91±0.120.91\pm 0.121.06±0.131.06\pm 0.13
Caveflyer2.01±0.412.01\pm 0.411.63±0.201.63\pm 0.201.47±0.141.47\pm 0.14
Chaser0.42±0.050.42\pm 0.050.48±0.010.48\pm 0.010.46±0.010.46\pm 0.01
Climber1.07±0.061.07\pm 0.061.02±0.011.02\pm 0.011.07±0.161.07\pm 0.16
Coinrun2.10±1.002.10\pm 1.002.80±0.602.80\pm0.602.00±0.102.00\pm 0.10
Dodgeball0.72±0.200.72\pm 0.200.72±0.160.72\pm 0.160.61±0.090.61\pm 0.09
Fruitbot1.06±0.07-1.06\pm 0.070.16±0.06-0.16\pm 0.062.53±0.02-2.53\pm 0.02
Heist0.80±0.010.80\pm 0.010.65±0.150.65\pm 0.150.30±0.010.30\pm 0.01
Jumper1.70±0.101.70\pm 0.101.35±0.35 1.35\pm 0.351.20±0.101.20\pm 0.10
Leaper4.10±0.104.10\pm 0.103.75±0.053.75\pm 0.053.40±0.303.40\pm 0.30
Maze1.25±0.451.25\pm 0.451.25±0.251.25\pm 0.251.30±0.401.30\pm 0.40
Miner0.14±0.010.14\pm 0.010.12±0.020.12\pm 0.020.15±0.030.15\pm 0.03
Ninja1.20±0.101.20\pm 0.101.15±0.151.15\pm 0.151.35±0.151.35\pm 0.15
Plunder2.95±0.802.95\pm 0.802.26±0.302.26\pm 0.302.63±0.042.63\pm 0.04
Starpilot4.20±0.124.20\pm 0.123.89±0.323.89\pm 0.324.55±0.284.55\pm 0.28
Mean0.155±0.047-0.155\pm 0.0470.163±0.040-0.163\pm 0.040 0.179±0.042-0.179\pm 0.042
Median0.088±0.074-0.088\pm 0.0740.092±0.063-0.092\pm 0.0630.074±0.066-0.074\pm 0.066
IQM0.079±0.016-0.079\pm 0.0160.099±0.022 -0.099\pm 0.0220.108±0.022-0.108\pm 0.022

The results demonstrate that IQL-4V generally outperforms BC for highly suboptimal data. Combined with our experimental findings, this supports our theoretical claim that the notion "offline RL will always become more competitive compared to BC as data quality decreases" is not universally true.

We hope this response addresses any concerns the reviewers may have. Please do not hesitate to reach out with any further questions or comments.

评论

Thank you for your valuable comments. With the ICLR rebuttal phase deadline approaching, we would greatly appreciate any additional feedback or concerns you may have.

评论

Thank you for the detailed response.

I have to say I am unsure what to make out of it. I would agree that offline RL doesn't always have to be better than BC in the mixed data setting. If it were only about a few datasets one could easily argue that this is not an issue. However, we are talking about 3 times (!) better median score in the mixed setting & still better median score on the "highly suboptimal data", i.e. putting it in more drastic terms, even if the behavior policy is garbage, simply cloning it gives better (or equal, due to the uncertainty) median performance than the considered offline RL methods (I am thus also confused by the statement "IQL-4V generally outperforms BC for highly suboptimal data"). To me these results are highly surprising and at least to the best of my knowledge, this is not a universally expected situation in prior offline RL literature either - of course situations exist where BC makes more sense, but those are usually more on the better data quality side. Since you are tackling the offline setting, where one cannot test which method performs better, and a practitioner thus has to choose one method from the start, it seems your data is implying that BC should be always favoured over offline RL no matter what data at hand. In the current manuscript, I think this whole topic is not mentioned at all, which I believe to be problematic due to the surprising & unintuitive results.

评论

Thanks for your response. We address your concerns as follows.

Q: I am thus also confused by the statement "IQL-4V generally outperforms BC for highly suboptimal data”

A: To clarify, our statement “IQL-4V generally outperforms BC for highly suboptimal data” is based on the following observations: in experiments on the suboptimal dataset, IQL-4V outperforms BC in 9 out of 16 games, while BC outperforms IQL-4V in only 6 out of 16 games, with one tie. We emphasize that the experimental setting spans multiple independent games, each with varying levels of difficulty. It is crucial to evaluate performance across all games rather than relying solely on summary statistics like the median. Moreover, metrics such as the mean and IQM (Interquartile Mean) also indicate that IQL-4V outperforms BC. Therefore, we believe the statement “BC should always be favored over offline RL” is not accurate in this context.

Q: Even if the behavior policy is garbage, simply cloning it gives better (or equal, due to the uncertainty) median performance than the considered offline RL methods. To me these results are highly surprising and at least to the best of my knowledge, this is not a universally expected situation in prior offline RL literature either - of course situations exist where BC makes more sense, but those are usually more on the better data quality side.

A:While the impression that “situations where BC makes more sense are usually on the better data quality side” holds in classical RL settings, we respectfully disagree that this observation applies to offline RL with generalization. In fact, limited prior work, such as Mediratta et al. (2023), has demonstrated that standard offline RL methods often perform significantly worse than BC in both expert and mixed data settings. Notably, the performance gap is even larger in the mixed data setting, which contrasts with the conventional expectation. Therefore, we believe that this observation does not hold in the offline RL with generalization setting, where BC can exhibit competitive or superior performance even with lower-quality data.

Q: Since you are tackling the offline setting, where one cannot test which method performs better, and a practitioner thus has to choose one method from the start, it seems your data is implying that BC should be always favoured over offline RL no matter what data at hand. In the current manuscript, I think this whole topic is not mentioned at all, which I believe to be problematic due to the surprising & unintuitive results.

A: We want to highlight that the primary focus of our paper is not to propose an algorithm that universally outperforms BC in the challenging multi-environment generalization setting. Instead, our primary contribution lies in being the first to theoretically study zero-shot generalization in offline RL and to design theoretically sound offline RL algorithms with strong generalization ability. Our experiments demonstrate that applying our theoretical analysis-inspired frameworks can significantly improve the performance of baseline offline RL methods. Designing offline RL algorithms that consistently outperform BC remains a promising yet challenging direction for future research, but it is beyond the scope of this paper.

审稿意见
6

This paper studies zero-shot generalisation (ZSG) to unseen environment contexts in offline RL. The authors perform extensive theoretical analyses to establish why standard offline RL techniques perform poorly in this setting. They propose two idealised algorithms (PPPO and PERM) and provide bounds on the sub-optimality of these methods w.r.t. an optimal policy for ZSG. They conclude by providing a practical instantiation of their idealised algorithms based on IQL and evaluate its ZSG performance on procgen.

优点

  • The problem setting is an important, unexplored area. To the best of my knowledge, this is the first work that provides a theoretical analysis of zero-shot generalisation to unseen environments in offline RL.
  • The authors' theoretical analysis is meticulous and extensive.
  • The authors supplement their theoretical analysis with empirical results on the relevant benchmark provided by Mediretta et al. (2023)

缺点

  • Mediretta et al. (2023)'s work showed that behaviour cloning (BC) zero-shot generalised to unseen environments better than standard offline RL methods. The goal of this paper, in effect, is to establish the failure mode of offline RL methods in this setting and propose new methods that remedy it. However, the empirical findings in Table 2 and Figure 2 do not leave the reader confident that a solution has been found--BC's median performance is better than (or similar to) IQL-4V's on both the expert and mixed datasets. Presumably this can be explained with reference to the sub-optimality gap of PERM reported in Table 1, but the authors do not discuss this.
  • I found the paper difficult to follow. The notation is dense, and in my opinion, overloaded (e.g. the definitions of state, and state-action value functions in Lines 150-153), which made following the analysis difficult. At times the English is poor, but I'm sensitive to the fact that it may not be the authors' first language. After several read-throughs, I'm left unsure what the key contribution is (there are many disparate contributions). If a practitioner wanted to build algorithms based on the theoretical results it is, in my opinion, not clear where they should begin. NB: this is not a value-judgement on the quality of the theoretical analysis, but it does affect how easily others can build upon the authors' findings.

Minor feedback

  • In Line 19 you say that "our frameworks find the near-optimal policy with ZSG both theoretically and empirically". The findings in Section 6 suggest that you do not find the near-optimal policy in your empirical setting.
  • Figure 2 is tricky to read, and lacks a y-axis label.
  • Line 122: "successive feature" -> "successor features", and Touati et al. (2023) do study zero-shot generalisation in offline RL, but to unseen reward functions rather than unseen environments. Similar works that are not cited include [1,2]
  • Section 3 should be titled "Preliminaries".
  • In Line 91 you miss Section 2 from your discussion of the rest of the paper.
  • When reporting empirical results I would recommend following the guidance of [3] and use IQMs, confidence intervals obtained via bootstrapping etc.
  • Line 355: "becomes" -> "become".
  • Line 360: remove "it".
  • It would be helpful if the code to reproduce the experiments was linked/provided.

References

[1] Park, S., Kreiman, T., and Levine, S. (2024). Foundation policies with hilbert representations. International Conference on Machine Learning.

[2] Jeen, S., Bewley, T., and Cullen, J. M. (2024). Zero-shot reinforcement learning from low quality data. Advances in Neural Information Processing Systems 38.

[3] Agarwal, R., Schwarzer, M., Castro, P. S., Courville, A. C., and Bellemare, M. (2021). Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320

问题

  • In practice, why does IQL-4V not outperform BC in Section 6? It would be helpful if you could discuss more thoroughly the limitations in translating the theoretical findings to practical application.
  • As far as I understand, IQL-nV trains nn value functions, and nn policies (one w.r.t. each value function), but in practice nn is less than the number of test contexts. At test-time, how do you select which of the nn policies to use for an arbitrary unseen context?
评论

We sincerely thank you for your valuable feedback and suggestions. Our responses are as follows, which we hope could well address your concerns.

Q1 The empirical findings in Table 2 and Figure 2 do not leave the reader confident that a solution has been found--BC's median performance is better than (or similar to) IQL-4V's on both the expert and mixed datasets. Presumably this can be explained with reference to the sub-optimality gap of PERM reported in Table 1, but the authors do not discuss this.

Q2 In practice, why does IQL-4V not outperform BC in Section 6? It would be helpful if you could discuss more thoroughly the limitations in translating the theoretical findings to practical application.

A1&A2 We would like to emphasize that BC can outperform offline RL methods in certain cases, as its performance is highly dependent on the quality of the behavior policy. Theoretically, as established in [1], which studies the comparison between BC and offline RL in single-task settings, when the behavior policy is optimal, BC can converge to the optimal policy at a rate of 1/N1/N, where NN is the number of trajectories in the offline dataset. This convergence rate is faster than the 1/N1/\sqrt{N} rate achieved by most offline RL methods, such as CQL. Consequently, when the offline dataset is finite (as in our case), BC can indeed outperform offline RL methods. However, when the behavior policy is suboptimal, the comparison between BC and offline RL becomes more nuanced and is influenced by factors such as dataset quality and the degree of suboptimality in the behavior policy. The theoretical results in [1] also extend to our methods, as RL with generalization encompasses the single-task RL setting.

It is important to note that our primary goal is not to assert that our methods will always outperform BC, particularly in finite-sample settings. Instead, our aim is to present a theoretically grounded framework that identifies the failure modes of existing offline RL methods in terms of generalization and demonstrates how our proposed approaches address these issues. While BC may outperform offline RL methods in specific scenarios, our contributions lie in providing a deeper understanding of the limitations of existing methods and proposing solutions that offer better generalization under broader conditions.

[1] "When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning?"

Q3 The notation is dense and overloaded, which made following the analysis difficult.

A3 We appreciate your feedback and understand that the dense notation and definitions may have made the paper challenging to follow. The definitions of the value function and Q-function in Lines 153-156 follow standard conventions in reinforcement learning studies. However, we acknowledge that presenting these concepts clearly and concisely is crucial to improving readability. Additionally, we have reviewed the language throughout the paper to improve clarity and fluency.

Q4 I'm unsure what the key contribution is. If a practitioner wanted to build algorithms based on the theoretical results it is, in my opinion, not clear where they should begin.

A4 Our main contributions are summarized in the Introduction Section (Lines 52-80). In essence, we prove why previous offline RL algorithms fail to generalize in the zero-shot generalization (ZSG) setting, and we propose two frameworks, PERM and PPPO, that are shown to achieve ZSG both theoretically and empirically.

To provide practical insights, we have implemented our proposed methods and demonstrated their effectiveness in practice. The implementation is based on the key idea of leveraging multiple value networks to capture variations across environments. Detailed descriptions of our experimental setups and results can be found in Section 6, which we hope can guide practitioners in applying our methods.

Q5 Empirically, our frameworks do not find near-optimal policy with ZSG empirically

A5 We have revised the statement in the abstract to clarify and ensure precision.

Q6 At test-time, how do you select which of the policies to use for an arbitrary unseen context?

A6 We respectfully think there may be some misunderstandings. First, in our IQL-nV implementation, we only have 1 policy. Therefore, we do not need to "choose" policy. Second, we keep nn value functions, which are only used for training, not for testing. During the testing, we only evaluate our policy but not the value function.

Q7 Minor feedbacks and typos.

A7 Thank you for your valuable suggestions. We have revised our paper accordingly.

Moreover, we have included the link to our code to reproduce the experiments in the revised abstract.

Again, we sincerely thank you for your time and constructive feedback. We hope our responses address your concerns, and we welcome any further questions or suggestions.

评论

Thanks for your detailed response and for the effort that has gone into changing the manuscript. I'll respond directly to A1&A2 which address my core concerns.

I appreciate that we should expect BC to be a strong baseline when trained on expert trajectories, and I do not necessarily expect that your proposals should outperform it in that setting. But I would expect your method to outperform BC on the Mixed dataset, and it doesn't in aggregate, as I wrote in my original review. My question is: in what empirical settings should we expect your method to outperform BC? It seems you have done the required theoretical analysis to establish this, but you haven't been able to show this empirically. I appreciate this is largely a theory paper, but I feel this is an important empirical demonstration to include, and the paper currently lacks it.

评论

Thank you for your thoughtful follow-up question.

Q I would expect your method to outperform BC on the Mixed dataset. In what empirical settings should we expect your method to outperform BC?

A Thank you for your question. We argue that while it might seem intuitive to always prefer offline RL when data quality is suboptimal, the reality is more nuanced. Empirically, we claim that our algorithms are preferred when the data quality is highly suboptimal. To further support our argument, we conducted an additional evaluation on the full Procgen benchmark, consisting of 16 games. This evaluation utilized a suboptimal dataset instead of the mixed expert-suboptimal dataset where the data quality is even worse. To construct the offline dataset, we extracted 1 million transitions from the 25M suboptimal dataset provided by Mediratta et al. (2023) for each Procgen game, following the same steps in Mediratta et al. (2023). For the evaluation, we compared the performance of the IQL-4V, IQL and BC algorithms, adhering to the same practices outlined in our original paper. The results of the evaluation are as follows:

Procgen gameIQL-4VIQLBC
Bigfish3.03±0.963.03\pm 0.961.77±0.061.77\pm 0.061.73±0.141.73 \pm 0.14
Bossfight1.04±0.201.04\pm 0.200.91±0.120.91\pm 0.121.06±0.131.06\pm 0.13
Caveflyer2.01±0.412.01\pm 0.411.63±0.201.63\pm 0.201.47±0.141.47\pm 0.14
Chaser0.42±0.050.42\pm 0.050.48±0.010.48\pm 0.010.46±0.010.46\pm 0.01
Climber1.07±0.061.07\pm 0.061.02±0.011.02\pm 0.011.07±0.161.07\pm 0.16
Coinrun2.10±1.002.10\pm 1.002.80±0.602.80\pm0.602.00±0.102.00\pm 0.10
Dodgeball0.72±0.200.72\pm 0.200.72±0.160.72\pm 0.160.61±0.090.61\pm 0.09
Fruitbot1.06±0.07-1.06\pm 0.070.16±0.06-0.16\pm 0.062.53±0.02-2.53\pm 0.02
Heist0.80±0.010.80\pm 0.010.65±0.150.65\pm 0.150.30±0.010.30\pm 0.01
Jumper1.70±0.101.70\pm 0.101.35±0.35 1.35\pm 0.351.20±0.101.20\pm 0.10
Leaper4.10±0.104.10\pm 0.103.75±0.053.75\pm 0.053.40±0.303.40\pm 0.30
Maze1.25±0.451.25\pm 0.451.25±0.251.25\pm 0.251.30±0.401.30\pm 0.40
Miner0.14±0.010.14\pm 0.010.12±0.020.12\pm 0.020.15±0.030.15\pm 0.03
Ninja1.20±0.101.20\pm 0.101.15±0.151.15\pm 0.151.35±0.151.35\pm 0.15
Plunder2.95±0.802.95\pm 0.802.26±0.302.26\pm 0.302.63±0.042.63\pm 0.04
Starpilot4.20±0.124.20\pm 0.123.89±0.323.89\pm 0.324.55±0.284.55\pm 0.28
Mean0.155±0.047-0.155\pm 0.0470.163±0.040-0.163\pm 0.040 0.179±0.042-0.179\pm 0.042
Median0.088±0.074-0.088\pm 0.0740.092±0.063-0.092\pm 0.0630.074±0.066-0.074\pm 0.066
IQM0.079±0.016-0.079\pm 0.0160.099±0.022 -0.099\pm 0.0220.108±0.022-0.108\pm 0.022

The results show that IQL-4V generally outperforms BC for highly suboptimal data. We hope this response addresses any concerns you may have. We kindly request that our work not be judged based on goals beyond its current scope—such as conducting a full comparison with BC or developing a new algorithm that consistently outperforms BC—or on misaligned expectations regarding the experimental outcomes. Please do not hesitate to reach out if you have any further questions or comments.

评论

Thank you for your valuable comments. With the ICLR rebuttal phase deadline approaching, we would greatly appreciate any additional feedback or concerns you may have.

评论

Hi authors, thanks for your response and for the new results. I've re-read the paper with a fresh set of eyes, and have considered your responses to me and other reviewers.

I'm now more convinced that, for the purposes of this paper, you algorithms do not need to broadly outperform BC, and the improvement over vanilla IQL that you show justifies your theoretical contributions earlier in the paper.

I would encourage you to include these new results on the highly suboptimal datasets in an updated version of Table 2, and tone down language about "generally outperforming BC", and instead focus more on direct comparisons with vanilla IQL.

A new thought that entered my mind on the re-read: what happens as you push nn beyond 8 in Table 3? Should we expect performance to improve further? I appreciate it is not computationally efficient, but presumably you could show better empirical performance by aggregating your value functions over fewer environments (i.e. get closer to a point where you maintain independent value functions for each environment)? I'd be keen to hear your response, but adding this study to the paper is not vital.

Given my reassessment, I'm happy to update my score from 5 to 6.

评论

Thank you for your valuable feedback! We greatly appreciate your suggestions and will incorporate them. Below, we address the concern you have raised.

Q: How about more value networks? Should we expect a better performance?

A: We would like to address a potential misunderstanding: it is not necessarily true that increasing the number of value networks always leads to better performance. While having more value networks can increase their independence from one another, it also reduces the number of trajectories available to train each network. This observation is further supported by our Theorem 22, which provides a real-world suboptimality gap for PERM. For clarity, we restate Theorem 22 here for reference. For any policy π\pi' which is the output of PERM with mm number of value networks, we have its suboptimality gap as SubOpt(π)22log(6N_(Hm)1Π/δ)n_I1:Supervised learning (SL) error+2m_j=1m_h=1HE_π,Mj[Γ_j,h(sh,ah)s1=x1]_I2:Reinforcement learning (RL) error+5m+2sup_π1ni=1nVπ_i,1(x1)1mj=1mVπ_j,1(x1)_Additional approximation error\text{SubOpt}(\pi')\leq \underbrace{2\sqrt{\frac{2\log(6\mathcal{N}\_{(Hm)^{-1}}^\Pi/\delta)}{n}}}\_{I_1: \text{Supervised learning (SL) error}}+\underbrace{\frac{2}{m}\sum\_{j=1}^m\sum\_{h=1}^H\mathbb{E}\_{\pi^*,M_j}{[{\Gamma'}\_{j,h}(s_h,a_h)|s_1=x_1}]}\_{I_2: \text{Reinforcement learning (RL) error}}+ \underbrace{\frac{5}{m}+2 \sup\_\pi | \frac{1}{n}\sum_{i=1}^n V^{\pi}\_{i,1}(x_1)-\frac{1}{m}\sum_{j=1}^m {V'}^{\pi}\_{j,1}(x_1)|}\_{\text{Additional approximation error}} Our bound demonstrates that as mm, the number of value networks, increases, the "additional approximation error" term decreases. However, the "RL error" term may increase because the uncertainty Γj,h\Gamma'_{j,h} might grow to account for a more diverse average MDP MjM_j. As a result, we cannot claim that increasing the number of value networks will universally improve overall performance.

Additionally, we conducted a new experiment in our ablation study to examine this phenomenon. Specifically, we scaled the number of value functions in IQL-16V to twice that of our earlier IQL-8V ablation study. Below, we present the updated results, which build on Table 3 from our original submission.

Procgen Game16V-SP (Expert)8V-SP (Expert)4V-SP (Expert)
Miner7.04±0.277.04 \pm 0.277.88±0.717.88 \pm 0.716.36±1.856.36 \pm 1.85

From the new results, we observe that IQL-16V actually performs worse than IQL-8V. This finding further suggests that selecting a larger number of value networks does not always result in better performance.

审稿意见
6

This paper proposes Pessimistic Empirical Risk Minimization (PERM) and Pessimistic Proximal Policy Optimization(PPPO), which they leverage a pessimistic policy evaluation component, aiming to address ZSG by minimizing a "suboptimality gap," that combines supervised learning error (related to policy generalization) and reinforcement learning error (related to dataset coverage).

优点

  • The paper provides theoretical bounds on the suboptimality of both PERM and PPPO. This theoretical rigor is a positive contribution to ZSG.
  • The proposed algorithms consider multiple environments separately, potentially improving ZSG by better capturing variations across contexts.

缺点

  • Applying a pessimistic bias to counteract distributional shifts can lead to over-conservative policies that might underperform, particularly in environments where high-risk, high-reward actions are necessary.
  • Pessimism may sometimes hinder the exploration of higher-reward policies due to its inherent cautious nature. This paper’s methods also assume that each environment's context information and datasets are either directly available or accurately inferable, which limits the use cases from random sampled or mixed quality datasets.
  • The suboptimality bounds in the paper rely on having a large number of sufficiently varied environments. These bounds may be loose in situations with fewer or more similar environments, and the policy may perform poorly in unseen contexts.
  • Both PERM and PPPO involve parameters that adjust the degree of pessimism applied, where the sensitivity of them are not fully discussed.
  • PERM requires maintaining separate models or critic functions for each environment in the training set, which can quickly become computationally expensive as the number of environments grows. PPPO, while model-free, still requires multiple policies for different environments, which limits scalability when working with diverse and high-dimensional data.
  • The selection of the baseline methods are not justified, why not compare the proposed methods with SOTA methods?

问题

please see weakness

评论

We sincerely appreciate your thoughtful feedback and positive remarks about our theoretical contributions and algorithms. Below, we address your concerns and provide clarifications.

Q1 Applying a pessimistic bias to counteract distributional shifts can lead to over-conservative policies that might underperform, particularly in environments where high-risk, high-reward actions are necessary.

A1 We have some discussions in Remark 8 about why pessimism could help generalization in our ZSG setting. In our framework, pessimism can indeed facilitate generalization, rather than hinder it. Specifically, we employ pessimism to construct reliable Q functions for each environment individually. This approach supports broader generalization by maintaining multiple Q-networks separately. By doing so, we ensure that each Q function is robust within its specific environment, while the collective set of Q functions enables the system to generalize across different environments. Furthermore, our theoretical results demonstrate that the proposed pessimistic approach balances caution and generalization effectively, making it well-suited for ZSG.

Q2 This paper’s methods also assume that each environment's context information and datasets are either directly available or accurately inferable, which limits the use cases from random sampled or mixed quality datasets.

A2 We respectfully believe there may be a misunderstanding regarding our assumptions. In our setting, context information is required only during training on the offline dataset and is not available during evaluation, nor is it assumed to be perfectly inferable. Furthermore, the offline dataset does not need to include exact context variables—only labels that distinguish trajectories collected from different environments (1,2,,n1, 2, \dots, n). These labels enable differentiation across environments without requiring explicit context values or high-quality datasets, thereby broadening the applicability of our approach.

Q3 The suboptimality bounds in the paper rely on having a large number of sufficiently varied environments, and may be loose in situations with fewer or more similar environments, and the policy may perform poorly in unseen contexts.

A3 We agree that the suboptimality bounds depend on the number of environments (n). Specifically, the I1I_1 terms in our bounds scale with 1/n\sqrt{1/n}, where nn is the number of different contextual MDPs with the context drawn from the same distribution CC. Note that dependency is statistically unavoidable and aligns with standard generalization bounds in supervised learning, where generalization performance improves with more diverse training samples. Intuitively, if the number of samples is too small, we can not expect good generalization results.

Q4 Both PERM and PPPO involve parameters that adjust the degree of pessimism applied, where the sensitivity of them are not fully discussed.

A4 Thank you for this valuable suggestion. The degree of pessimism in our approach is controlled by the uncertainty quantifier Γ\Gamma, as defined in Definition 5. A larger Γ\Gamma corresponds to a greater degree of pessimism. Ideally, Γ\Gamma should be chosen as the smallest value that satisfies the inequality in Line 266.

The impact of Γ\Gamma is reflected in the I2I_2 terms in Theorems 9 and 14, where the suboptimality gap scales with Γ\Gamma. This indicates that an overly pessimistic uncertainty quantifier may degrade performance by unnecessarily increasing the suboptimality gap. Conversely, an appropriately tuned Γ\Gamma balances caution and generalization, ensuring reliable policy evaluation and improved performance.

Q5 PERM requires maintaining separate models or critic functions for each environment in the training set, which can quickly become computationally expensive as the number of environments grows. PPPO, while model-free, still requires multiple policies for different environments, which limits scalability when working with diverse and high-dimensional data.

A5 Thank you for highlighting these concerns. In our paper, we address the practical challenges of the theoretical algorithms with specific mitigations. Practically, we propose merging multiple data splits into a shared context, which leads to the development of the IQL-nV algorithm for our empirical evaluations (see Line 477 and subsequent discussion). Theoretically, we establish rigorous bounds for both PERM and PPPO after merging the datasets, as detailed in Remark 12, Remark 15, and Appendix C of the paper. These measures ensure that our approach remains both computationally feasible and scalable.

评论

Q6 The selection of the baseline methods are not justified, why not compare the proposed methods with SOTA methods?

A6 We choose IQL since it is a well-known method with Markovian policy. In Mediratta et al. (2023), the SOTA methods for comparison are identified as BC, which we have indeed included in our experiments. Our results demonstrate that the variants of IQL proposed in our paper significantly enhance its performance, bringing it close to the level achieved by BC. The primary focus of our work is to explore the mechanisms for improving offline RL performance in a generalization setting. We chose IQL as a representative method because of its suitability for this investigation.

Again, thank you very much for reviewing our work and your positive comments. We are happy to provide further clarifications if needed.

评论

Thank you for your valuable comments. With the ICLR rebuttal phase deadline approaching, we would greatly appreciate any additional feedback or concerns you may have.

评论

I thank the authors for their response. I'd like to keep my original rating.

评论

Thank you for your positive feedback. Again, we sincerely appreciate your support, thoughtful review, and constructive suggestions.

Best regards,

The Authors of Submission 11426

审稿意见
5

This paper focuses on generalization across contexts in offline RL. The authors prove that their proposed algorithms, pessimistic ERM and pessimistic PPO can achieve good regret bounds in this setting. Inspired by the theory, a modified version of IQL is evaluated on offline ProcGen and showcases the benefits of the approach.

优点

The paper tackles an important problem, generalization in RL with offline datasets, and takes a theoretical perspective. From a look, the derivations seems sound and generally, enough detail is included to understand. Also, although the paper has a theoretical focus, it is nice to see some of the principles tried in an empirical setting.

缺点

The problem setting requires some clarification. The general setting is contextual MDPs with offline datasets but some key details are unclear. For example:

  • The context information is included in the offline dataset. But when the agent is evaluated (run in the environment), is the context information available then?

  • There does not seem to be any assumption on context information or any common structure relating the context to the MDP. So, since the MDP can be arbitrarily different between contexts, it's confusing that the bound given by equation (1) (line 335) does not contain a term related to the number of contexts or a related quantity. How would it be possible to generalize to other contexts without any underlying structure?

  • If the context variable is included in the dataset and we are only interested in Markovian policies, how is this different from augmenting the state with the context variables and using a standard offline RL algorithm?

There are a few concerns about the empirical results and I have some questions in the next section.

I may be missing some important pieces of information concerning the above points and would be willing to revise my score based upon further clarification.

问题

  • Line 385: Why is Vi1,1π(x1)V^\pi_{i-1,1}(x_1) not achievable? What does this mean?

  • The objective chosen is the "suboptimality gap compared to the best Markovian policy" (line 164). The best Markovian policy (without context) can be bad on every environment. When using history-dependent policies, it can be possible to infer the context and do much better. Based on this, it's unclear how meaningful the proposed objective is. Perhaps a worst-case bound over contexts would be a more satisfying choice.
    Alternatively, setting where the context variable is revealed in the offline dataset but hidden during evaluation would be interesting to consider. That is, we want to change the objective so that we compare to history-dependent policies. Some additional modifications may be needed so learning is feasible and is different from POMDPs.

  • Line 257: The presence of an oracle for each individual dataset seems like a strong assumption. Could you elaborate on how one could implement the oracle and what kind of estimates would be feasiable for Γ(s,a)\Gamma(s,a)

  • For the counterexample CMDP in section 4 (line 235), it seems unreasonable to expect the agent to do well on actions that are never observed (μ(a)=0\mu(a) = 0). There would be no observed rewards for those actions so the agent has no information. Perhaps the counterexample should be adjusted to account for this and give some positive probability to all actions.

  • Looking at the table 3, the stochastic policy variant is better but this is not the result reported in table 2 (in the row for Miner) In table 2, we see the deterministic policy result is reported which is markedly worse. It would be more fair to report the stochastic policy variant in table 2. Currently, it seems like it is the difference between stochastic and deterministic policies that makes the largest difference and not the additional value networks.

  • Is the stochastic policy suggested in this work or is it a variant from previous works?

  • When running IQL-nV, there are more parameters and more capacity in the network. It would be more fair to compare to the 1V version which gets additional parameters to roughly match the ones of nV, n>1.

  • Methods making use of ensembles of value networks have indeed been used to combat value overestimation issues in various works e.g. [1] and [2]. Any thoughts on parallels between these lines of work and the current one?

[1] "Randomized Ensembled Double Q-Learning: Learning Fast Without a Model" Chen et al.
[2] "Maxmin Q-learning: Controlling the Estimation Bias of Q-learning" Lan et al.

Typos:

  • In the algorithm boxes, at the end, should it say "Return" instead of "Ensure"?

  • line 122: "successive feature" -> "successor feature"

评论

Q5 How meaningful is the proposed objective of comparing to the best Markovian policy, and could a worst-case bound over contexts or an alternative setting involving history-dependent policies provide a more appropriate benchmark?

A5 First, we address this concern in Remark 1 (Lines 168–175), where we compare our setting with POMDPs and history-dependent policies. Our choice to focus on Markovian policies is motivated by their simplicity and practicality. The goal of this work is to provide a generalizable framework that extends existing offline RL methods, which primarily focus on Markovian policies (e.g., IQL), to address generalization across contexts. While history-dependent policies are an interesting and potentially powerful direction, they fall outside the scope of this work.

Second, regarding worst-case bounds, we clarify in Remark 2 (Lines 176–183) that such bounds are not achievable in the zero-shot generalization (ZSG) setting. This assertion is supported by prior work (Ye et al., 2023), which establishes lower bounds for the online RL setting. Since the offline RL setting is inherently more challenging, these lower bounds hold for our scenario as well. Consequently, our approach focuses on achievable objectives within the ZSG framework.

Finally, we are indeed studying the setting suggested in your comment, where context information is not accessible during evaluation and may not be explicitly included in the offline data. The only assumption we make is that the context distribution remains the same between training and evaluation. History-dependent policy could be more powerful, but it is beyond the scope of our current work and could be an exciting direction for future research.

Q6 The presence of an oracle for each individual dataset seems like a strong assumption. Could you elaborate on how one could implement the oracle and what kind of estimates would be feasiable for Γ(s,a)\Gamma(s,a)

A6 First, the assumption of an oracle for each individual dataset aligns with prior theoretical work in offline RL, such as Jin et al. (2021).

Second, for linear MDPs, we provide an explicit formula for instantiating the oracle, as detailed in Eq. (29) in Appendix D. This demonstrates how the oracle can be practically implemented in specific cases.

Finally, for general non-linear MDPs, we discuss feasible implementations in Remark 7 (Lines 281-284). A practical approach is to use bootstrapping techniques to estimate uncertainty, as in the Bootstrapped DQN method (Osband et al., 2016). We note that when the bootstrapping method is straightforward to implement, the assumption of having access to an uncertainty quantifier is reasonable.

Q7 For the counterexample CMDP in section 4, it seems unreasonable to expect the agent to do well on actions that are never observed. There would be no observed rewards for those actions so the agent has no information.

A7 We apologize for the confusion. First, we would like to clarify that the two MDP graphs in Figure 1 indeed share the same state-action space (i.e., they are components u,v{u,v} of the same cMDP). As a result, the data distribution μ\mu effectively covers all the actions in the underlying MDP. Second, we emphasize that the data distributions μu\mu_u and μv\mu_v are derived from the near-optimal policy within each context. These distributions are almost entirely skewed toward the actions with maximum rewards in their respective contexts, which satisfies our assumptions and provides the agent with sufficient information about the maximum reward policies.

Q8 Why does Table 2 report the deterministic policy result for Miner instead of the stochastic policy variant, and would this not more fairly reflect the impact of stochastic versus deterministic policies relative to the additional value networks?

A8 We did not specifically optimize hyperparameters, including whether to use a stochastic or deterministic policy, for each individual game in Procgen due to the computational cost. Instead, we used a consistent set of hyperparameters across all games. While a stochastic policy performs better for the Miner dataset, it yields suboptimal performance when considering the entire collection of games. This is why, for the expert dataset, we adopt the stochastic policy to maximize overall performance, as detailed in Table 4, which outlines our hyperparameter selection process.

Regarding the difference between stochastic and deterministic policies, we emphasize that the choice is case-dependent. For instance, we use a deterministic policy for the mixed dataset, as shown in Table 4. Additionally, with respect to the number of value networks, Table 2 demonstrates that IQL-4V outperforms IQL with a deterministic policy for the mixed dataset, highlighting the utility of increasing the number of value networks.

评论

Q9 Is the stochastic policy suggested in this work or is it a variant from previous works?

A9 The use of a stochastic policy has indeed been considered by Mediratta et al. (2023) as an important hyperparameter configuration. However, their application of the stochastic policy was limited to BC evaluation and was not utilized for IQL evaluation. In contrast, our work specifically suggests employing a stochastic policy for IQL with multiple value networks. This recommendation is based on our observation that it significantly enhances overall performance on expert datasets.

Q10 It would be more fair to compare to the 1V version which gets additional parameters to roughly match the ones of nV, n>1.

A10 Thank you for your insightful suggestion. Based on your feedback, we conducted an additional ablation study to explore your proposed scenario, where we implemented IQL-1V with a value network scaled to four times its original size. Namely, we increase the hidden dimension of the fully-connected layers of the value network from 256 to 1024 in the IQL-1V setting, ensuring that the total number of critic parameters matches that of IQL-4V. Below are the results for the Miner Expert dataset:

Configuration4V-SP (256 hidden dim)1V-SP (256 hidden dim)1V-SP (1024 hidden dim)
Performance6.36±1.856.36 \pm 1.855.6±1.895.6 \pm 1.892.18±1.052.18 \pm 1.05

Interestingly, the results indicate that increasing the model size for the single value network (1V-SP with 4M parameters) actually degrades performance compared to the original IQL-1V setup. We hypothesize that this decline is likely due to overfitting caused by scaling the network without additional structural changes. This observation highlights the practical advantages of our multi-value network approach, which achieves better performance while mitigating overfitting risks.

Q11 Methods making use of ensembles of value networks have indeed been used to combat value overestimation issues in various works, any thoughts on parallels between these lines of work and the current one?

A11 Thank you for bringing up these works. It is worth noting that overestimation bias, as addressed in the cited studies, is a significant challenge in single-task reinforcement learning. While these methods focus on mitigating estimation bias, our use of an ensemble of value networks serves a complementary purpose. Specifically, we leverage the ensemble approach to enhance generalization across tasks, ensuring robust performance in diverse and challenging settings.

Q12 Typos

A12 We have fixed them in revision.

Again, we sincerely thank you for your time and constructive feedback. We hope our responses address your concerns, and we welcome any further questions or suggestions.

评论

Thank you for your valuable comments. With the ICLR rebuttal phase deadline approaching, we would greatly appreciate any additional feedback or concerns you may have.

评论

Dear Reviewer 15Pv,

Thank you for your time and thoughtful comments on our work. As the final ICLR rebuttal phase deadline approaches, we would greatly value any additional feedback or concerns you may wish to share.

Best,
The Authors

评论

Dear Reviewer 15Pv,

Thank you for your time and thoughtful suggestions on our work. We hope our detailed responses and clarifications have addressed your concerns. With the final ICLR rebuttal deadline approaching in less than 7 hours, we would greatly appreciate any additional feedback or concerns you might have. Your insights would be extremely helpful in refining our submission.

Best regards,

The Authors

评论

We sincerely thank you for your valuable feedback and suggestions. Our responses are as follows, which we hope could well address your concerns.

Q1 The context information is included in the offline dataset. But when the agent is evaluated (run in the environment), is the context information available then?

A1 The context information is only required during training on the offline dataset and is not available during evaluation in the environment. During training, the context information is used solely to group trajectories into different categories or environments. It is neither used as input nor relied upon during inference time.

Importantly, during training, we do not need the exact values of the context variables. Instead, labels indicating which trajectories belong to different environments (1,2,,n1, 2, \dots, n) are sufficient. These labels allow us to differentiate between environments without requiring explicit context values. This approach aligns with our goal of achieving generalization without dependence on direct context information.

Q2 Since the MDP can be arbitrarily different between contexts, how would it be possible to generalize to other contexts without any underlying structure?

A2 The contexts in our setting are drawn from the same distribution CC for both the offline training data and the testing evaluation. This shared distribution is essential for enabling generalization across contexts.

Our result does not explicitly depend on the size of the context set. Instead, it hinges on the covering number of the function class, as reflected in the I1I_1 term in equation (1) (line 321). This approach aligns with standard supervised learning theory, where generalization bounds depend on the complexity of the hypothesis space (e.g., VC-dimension or covering number) rather than the cardinality of the input space. This ensures that our method generalizes effectively even in cases where the number of contexts is large or infinite.

Q3 If the context variable is included in the dataset and we are only interested in Markovian policies, how is this different from augmenting the state with the context variables and using a standard offline RL algorithm?

A3 The context variables included in the dataset serve primarily as indicators to differentiate which task or environment each trajectory belongs to, rather than containing any semantic information. In our setting, context information is only required during training on the offline dataset and is not available during evaluation. Moreover, the offline dataset does not include explicit context variables but instead provides labels that differentiate trajectories collected from distinct environments. Since these labels do not correspond to explicit context variables, it is not feasible to augment the state with context information and directly apply standard offline RL algorithms.

Q4 Why is Vi1,1π(x1)V_{i-1,1}^\pi (x_1) not achievable? What does this mean?

A4 By stating that Vi1,1π(x1)V_{i-1,1}^\pi (x_1) is not achievable, we mean that it cannot be directly computed because the ground-truth transition dynamics of the environment are unknown. As a result, we rely on approximations. Specifically, we use a linear approximation method to iteratively update the value estimation based on the previous iteration's results. This approach aligns with standard techniques used in algorithms like PPO, where similar approximations are employed to estimate value functions in the absence of exact dynamics.

评论

Thank you for your comments. We have revised our draft according to your suggestions. We list the main changes here for your reference.

First, we conducted a new experiment to address the reviewers' concerns, specifically an ablation study that compares IQL-4V with an enhanced version of IQL-1V. In this comparison, the network parameters of IQL-1V were scaled to match the model capacity of IQL-4V (Lines 1445–1457).

Second, we have reported Interquartile Means (IQM) with confidence intervals in addition to mean and median performance metrics to provide a clearer understanding of the significance of the observed improvements (Line 453).

Third, we improved the clarity of the manuscript by enhancing figures and captions (notably, lines 455-470).

We hope these revisions address your concerns and enhance the overall quality and clarity of the paper. Thank you again for your valuable feedback.

AC 元评审

This paper studies zero-shot generalization in offline RL, proposing pessimistic algorithms and providing theoretical analysis along with empirical validation on the Procgen benchmark. While the theoretical framework contributes to understanding generalization in offline RL, the core technical approach of using pessimism for uncertainty quantification is fairly standard, and empirical results do not demonstrate clear advantages over simpler baselines like behavioral cloning.

审稿人讨论附加意见

During the discussion, reviewers raised several key concerns: (1) lack of empirical evidence that the proposed methods outperform behavioral cloning (BC), even on mixed and suboptimal datasets where offline RL should theoretically have an advantage, (2) overly complex analysis and notation that makes the practical implications unclear, and (3) reliance on pessimistic value estimation as the core technical approach, which is a well-established technique in offline RL. These concerns are not well addressed. While the authors emphasized that "the primary goal of this paper is not to propose a state-of-the-art algorithm that always outperforms existing methods like BC across all settings", I feel that the reviewers' concern is well-grounded. Moreover, it seems that the algorithms implemented by the authors are not those studied in theory. Overall, these concerns lead to the decision.

最终决定

Reject