4.5

/10

withdrawn4 位审稿人

最低3最高5标准差0.9

3.5

置信度

ICLR 2024

MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning

Mao Hong,Zhiyue Zhang,Yue Wu,Yanxun Xu

OpenReview PDF

提交: 2023-09-23更新: 2024-03-26

TL;DR

We propose a practically implementable model-based mirror ascent algorithm for offline RL with theoretical guarantees.

摘要

关键词

Offline Reinforcement LearningMirror AscentModel-basedPAC Guarantee

评审与讨论

审稿意见

评分: 3置信度: 42023-10-30

This paper:

(1) proposes a model-based mirror descent offline RL algorithm and its corresponding practical version via Lagrangian multiplier and function approximation.

(2) provides theoretical guarantee of the proposed algorithm.

(2) conducts some empirical experiments to validate the performance of the proposed algorithm.

优点

(1) The proofs are rigid.

(2) Apart from theoretical guarantees, there are also numerical results.

缺点

Overall I feel this work is not novel or significant enough:

(1) For Algorithm 1, the idea of constructing a confidence set over the model and use pessimism for policy improvement has been applied in the literature of offline RL, such as [1].

(2) For Algorithm 2, there also have been papers [2] utilizing Lagrangian multipliers to get rid of the constraint of confidence set in the theoretical algorithms and get computationally efficient algorithms. The method of function approximation in Algorithm 2 also has been applied in [3].

(3) The empirical performance of the proposed algorithm is not impressive. In Table 2, RAMBO, IQL and CQL all get better performance (averaged over all tasks) and thus I think the proposed algorithm does not have much significance in the empirical studies either.

In summary, I think this work simply combines the existing techniques and results in the literature and is kind of incremental.

[1] Uehara, Masatoshi, and Wen Sun. "Pessimistic model-based offline reinforcement learning under partial coverage." arXiv preprint arXiv:2107.06226 (2021).

[2] Rigter, Marc, Bruno Lacerda, and Nick Hawes. "Rambo-rl: Robust adversarial model-based offline reinforcement learning." Advances in neural information processing systems 35 (2022): 16082-16097.

[3] Guanghui Lan. Policy optimization over general state and action spaces. arXiv preprint arXiv:2211.16715, 2022.

问题

(1) The paper claims that it does not require Bellman completeness and only needs model realizability. However, the literature have shown that when a model class that contains the true MDP model is given, value-function classes that satisfy a version of Bellman-completeness can be automatically induced from the model class [4]. This implies that model realizability is even stronger than Bellman-completeness.

(2) The paper claims that the size of the function can be arbitrarily large. However, in Theorem 1 (the performance of Algorithm 2), the sample complexity clearly depends on the size of the function class $|\mathcal{F} _ {t,i}|$ . In addition, I believe both the performance of Algorithm 1 and 2 depends on the size of the model class, and model classes can be even larger than function classes.

(3) The paper does not need a parametric policy class. However, in general the computation complexity of the optimization problem Equation 7 will depend on the size of the state space, which can be infinite.

(4) The authors claim that Theorem 1 characterizes the performance of Algorithm 2, but it seems not. Theorem 1 assumes the existence of the confidence set of models and picks the most pessimistic model from such a confidence set while in Algorithm 2 you simply run primal-dual gradient descent-ascent and do not have a confidence set.

[4] Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR.

2023-11-23

Thank you for your constructive feedback. We would like to address you concerns point by point in this response.

Regarding novelty and significance

We acknowledge that theoretical techniques involved in conservative policy evaluation, policy improvement, and their combination have been explored in related work. However, we believe that the integration of these techniques in our study offers valuable insights, particularly in the context of model-based offline RL under general function approximation settings without the need for explicit policy parameterization.

The theoretical results of our approach, as detailed in Theorem 1, enables the separation of model learning and policy learning. This separation allows for the use of an arbitrarily large function class for the (augmented) value function and an unlimited policy class. This aspect of our work is distinct from existing studies such as those by Xie et al., Cheng et al., Bhardwaj et al., and Rashidinejad et al. In particular, our method does not require explicit policy parameterization, setting it apart from these prior works.

The empirical performance...

We understand your concern about its comparative performance relative to RAMBO, IQL, and CQL. Nevertheless, it is important to note that the primary focus of our work is on the theoretical aspects of the algorithm. The empirical experiments were designed to support the theoretical findings rather than to showcase state-of-the-art performance. Additionally, the hyperparameters used in our experiments were tuned based on heuristics rather than through rigorous optimization.

Regarding Bellman completeness

Thank you for pointing out the reference and its implications regarding the realizability of the transition model in relation to Bellman completeness. Upon revisiting this reference and considering your observation, we recognize that the condition of model realizability is not a milder condition compared to Bellman completeness. In light of this, we have amended our paper by removing the aforementioned statement from the list of contributions in the revised version.

The paper claims that the size of the function can be arbitrarily large. However,...

We are sorry for the confusion. We clarify that we consider two notions of sample sizes in this paper: the sample size $n$ for offline data and the sample size for the Monte-Carlo evaluation $N$ . While the offline sample size $n$ is fixed, the sample size of Monte-Carlo simulation $N$ can be picked by the user and arbitrarily large. In Theorem 1, the size of function class $|\mathcal{F}_{t,i}|$ only depends on $N$ , and is not affected by $n$ . The ability of $N\to\infty$ allows the size of $|\mathcal{F} _\\{t,i\\}|$ to be arbitrarily large.

The paper does not need a parametric policy class. However, in general the computation complexity of the optimization problem Equation 7 will depend on the size of the state space, which can be infinite.

In algorithm 2, we only need to solve equation (7) for finitely-many states $s$ by using the strategy of Monte-Carlo estimation. In fact, we have illustrated in Appendix A.3 that it takes polynomial time to run Algorithm 2.

The authors claim that Theorem 1 characterizes the performance of Algorithm 2, but it seems not.

Thank you. We clarify that we provide an approximate solution to (1) by implementing the primal-dual algorithm (3)(4). The regularizer in the objective function of (3) forces the transition model to be nearly inside the confidence set. We note that the idea of incorporating a Lagrangian multiplier into the objective function for a constrained optimization problem is common in the optimization literature.

We hope we have addressed your concerns, and we sincerely appreciate the time and effort you have dedicated to reviewing our paper.

审稿意见

评分: 5置信度: 32023-10-30

The paper proposes a model-based offline reinforcement learning algorithm with general function approximation. It aims to fill the gap between the theoretical guarantees and the practical implementation. The algorithm proposes a model-based mirror ascent algorithm with general function approximation under partial coverage of offline data. The paper shows the theoretical guarantees of the algorithm, which is supported by numerical studies.

优点

The paper provides a solid study on the model-based offline reinforcement learning algorithm. The main contribution comes from two side:

Comparing to previous model-based offline RL algorithms, this paper provides an practically implementable algorithm and provides an empirical study on it.
Comparing with model-free offline RL algorithms, the paper studies a case with weaker assumption and shows a faster suboptimality.

The theoretical and the empirical study in the paper is solid. Moreover, the results of the paper is presented in a

缺点

My major concern on the paper comes from the contribution of the paper. Algorithm 1 is similar to Uehara and Sun. While Algorithm 2 is a more interesting part, its idea is still similar to previous work, especially on the conservative estimation and the partial coverage. Moreover, I have several concerns on the implementation of the algorithm:

My first concern is on the PD algorithm (3) and (4) for finding the estimation of the transition model to construct the conservative estimation of the Q-function. In general, the parameterization of the transition kernel is non-convex, and the loss function is estimated empirically. Thus, it could be hard to find the global optimizer, or find a good estimation of the global minimizer.
The above problem also exists in (6) and (7) for estimating $\beta$ .

问题

Is there a theoretical guarantee on the PD algorithm in (3) and (4), especially in the case of general function approximation and empirical estimation of the loss function?
The paper presents the comparison with several other offline RL in Section 7. For table 1, is there also a comparison of MoMA and other SOTA algorithms (like in Table 2) on the synthetic dataset?
For table 2, there seems to be a large difference on the scores of the different algorithms. Is there any insight on why such a large difference occurs?
The definition of $\mathcal{P}$ seems missing.
The error rate in Theorem 1 involves $\epsilon_{est}$ , which, however, depends on $n$ as defined in (8). What is the dependence on $\epsilon_{est}$ in terms of $n$ ?

2023-11-23

We are grateful for your insightful comment and would like to address your concerns in this response.

Regarding the PD algorithm (3) and (4)

Thank you. The PD algorithm (3) and (4) is an implementable version of equation (1) - the constrained minimization problem. Regarding your concern on the non-convexity of (1), we would like to discuss on the objective function $V_P^{\pi}$ and the empirical loss function $\mathcal{E}_n(P)$ with respect to $P$ in the following.

Regarding $V_P^{\pi}$ . While $V_P^{\pi}$ is in general not a convex function in $P$ , we would like to present here a promising way of handling the non-convex optimization problem to address your concern. We view it as a future direction, which is out of the scope of our current work.

Motivated by that DL and RL both involve optimizing non-convex objectives, non-convex optimization is a rising field and many recent works demonstrate the global convergence of gradient descent (GD): [Zou et al., 2020] and [Ding et al., 2022] show that GD methods converge to globally optimal solutions in over-parameterized neural networks. [Agarwal et al., 2021] and [Bhandari and Russo, 2019] demonstrate the global optimality of policy gradient methods. In our case, the proposed GD method for problem (1) can be considered a counterpart of the policy gradient method, i.e., "model gradient method." Specifically, we can view the fixed policy $\pi_t$ as the transition model, and minimizing $V_P^{\pi_t}$ w.r.t. $P$ in Eq. (1) may be viewed as finding the best "policy" that minimizes the overall cost. Techniques from [Agarwal et al., 2021] and [Bhandari and Russo, 2019] can then be naturally employed to prove that the model gradient method for Eq. (1) can find the global minimizer. In particular, their key results arise from the form of policy gradient and performance difference lemma, providing a gradient dominance condition typically needed for the global optimality of first-order methods for non-convex objectives. Similarly, by considering the form of model gradient (w.r.t. $P$ ) instead of policy gradient (w.r.t. $\pi$ ), and considering the simulation lemma (difference in values between two $P^{\prime}$ s) instead of the performance difference lemma (difference in values between two $\pi^{\prime}$ ), an analogous gradient dominance condition for $P$ could be naturally derived. As a result, the global minimizer for Eq. (1) can be found efficiently using primal-dual gradient descent Eq.(3)(4).

Regarding $\mathcal E(P)$ . Though the empirical loss function $\mathcal E(P)$ is non-convex in general, it is indeed convex for a large class of distributions. For example, when $P$ is in an exponential family, then the negative likelihood is convex (Corollary 1.6.2. of [5]). Given that the empirical loss function is convex, the level set of the empirical loss function (i.e. the constraint set) is also convex.

References:

[1] Zou, Difan, et al. "Gradient descent optimizes over-parameterized deep ReLU networks." Machine learning 109 (2020): 467-492.

[2] Ding, Zhiyan, et al. "Overparameterization of deep ResNet: zero loss and mean-field analysis." Journal of machine learning research (2022).

[3] Agarwal, Alekh, et al. "On the theory of policy gradient methods: Optimality, approximation, and distribution shift." The Journal of Machine Learning Research 22.1 (2021): 4431-4506.

[4] Bhandari, Jalaj, and Daniel Russo. "Global optimality guarantees for policy gradient methods." arXiv preprint arXiv:1906.01786 (2019).

[5] Bickel, Peter J., and Kjell A. Doksum. Mathematical statistics: basic ideas and selected topics, volumes I-II package. CRC Press, 2015.

2023-11-23

The above problem also exists in (6) and (7) for estimating $\beta$ .

Thank you. We would like to discuss on (6) regarding the approximation of $\tilde{Q}_{\omega, t}$ and (7) regarding the policy update rule in the following.

Regarding (6). While the theoretical result of this (high-dimensional) optimization problem for the empirical loss function is not the primary focus of our work, we believe it is not a major issue. In linear approximation settings where the number of components (or the dimension of the parameter space) can tend to infinity, a closed form of Eq. (6) can be obtained using Ridge regression or the Expectation-Maximization (EM) algorithm for LASSO. When considering reproducing kernel Hilbert space (RKHS) for kernel ridge regression as a general function approximation, a closed form of (6) can also be achieved using the kernel matrix. As for non-convex problems, such as fitting neural networks with training data, empirical results have demonstrated their success. Moreover, some theoretical results exist under specific conditions. For instance, [Ding et al., 2022] provide a theoretical analysis explaining why a basic first-order optimization method (gradient descent) can find a global optimizer for a deep neural network (NN) that fits training data, despite it being a non-convex problem.

Regarding (7). We clarify that problem (7) does not have the concern you mentioned here. In particular, the objective function in (7) is a concave function of $p$ because it is a summation of a linear function of $p$ and a concave function of $p$ (which is $-\frac{1}{\eta_t} \omega(p)$ ). The constraint set $\Delta(\mathcal{A})$ , which is a simplex w.r.t. $p$ , is also a convex set. Therefore problem (7) can be solved by standard first-order optimization methods. Under some specific form of $\omega$ , problem (7) also have closed-form solutions.

Is there a theoretical guarantee on the PD algorithm in (3) and (4)?

Thank you. We agree that it is an interesting and challenging problem, and we leave it as a future direction.

The paper presents the comparison with several other offline RL in Section 7. For table 1, is there also a comparison of MoMA and other SOTA algorithms (like in Table 2) on the synthetic dataset?

Thank you. We did not have a comparison of MoMA and other SOTA algorithms on the synthetic dataset. The synthetic data experiments aim to demonstrate that MoMA has the ability to address the partial coverage issue. Consequently, we compare MoMA with the standard RL algorithms without the incorporation of pessimism, in the synthetic dataset.

For table 2, there seems to be a large difference on the scores of the different algorithms. Is there any insight on why such a large difference occurs?

Thank you. First, it depends on the specific task we use. For example, model-based methods are preferred when the environment is simpler to learn. Second, hyper-parameter tunning in different models affects their performance. As we don't have a unified selection rule for the hyperparameters of each method, their performance is quite different.

The definition of $\mathcal P$ seems missing.

Thank you. We have defined it in section 3 of the revised manuscript.

The error rate in Theorem 1

$\epsilon_{\text {est, }}$ usually scales with $\frac{1}{\sqrt{n}}$ (e.g. MLE). See Assumption 1(d), proposition 1, corollary 1 for more details.

We hope we have addressed your concerns. Thank you again for your valuable feedback.

References:

[1] Ding, Zhiyan, et al. "Overparameterization of deep ResNet: zero loss and mean-field analysis." Journal of machine learning research (2022).

审稿意见

评分: 5置信度: 42023-10-31

This paper proposes an offline RL algorithm, MOMA, that offers practical implementability alongside theoretical guarantees. MoMA uses a model-based mirror ascent approach with general function approximations, operating under partial coverage of offline data. The algorithm iteratively and conservatively estimates value functions and updates policies, moving beyond the conventional use of parametric policy classes. With mild assumptions, MoMA’s effectiveness is theoretically assured by establishing an upper bound on the policy’s suboptimality, and practical utility is confirmed through numerical studies.

优点

This paper proposes a new offline RL algorithm with pessimism.

缺点

The contribution of this work seems exaggerated. In particular, the authors claim that the algorithm enjoys theoretical guarantees and practical implementation. But in fact, the algorithm with the theoretical result is different from the practically implemented version. In particular, the theoretically analyzed algorithm involves pessimistic model learning, while the practical version uses a regularized form. With significantly different algorithms, the claim does not hold.
In terms of the theoretically sound version (Algorithm 1), the theoretical novelty is limited. The analysis is based on pessimistic model learning combined with policy mirror descent. The combination seems quite direct given existing works. In particular, in the proof of Theorem 2, the regret decomposition in (15)--(17) is standard analysis of a pessimism-based algorithm. The first term of (18) is standard analysis of policy mirror descent, and the second term of (18) is an application of simulation lemma. Similar analysis has also appeared in the RL literature, e.g., Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning.
The practically implemented version of algorithm is also similar to some existing algorithms. For example, "Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning" also studies a regularized version of policy optimization algorithm, but for online RL. In addition, MOPO and MoRel has practical implementations of pessimism-based algorithms.
The implemented version of algorithm is only tested on three D4RL tasks. It would be great to have more extensive experiments.

问题

How to implement the policy update when there is a normalization factor?
How do you update the multiplier $\lambda$ in the experiments?
Is it possible to unify these two different versions of algorithms?

2023-11-23

Thank you for your constructive feedback. We appreciate the opportunity to address each of your concerns.

The contribution of this work...

Thank you. We would like to clarify this aspect of our work.

In our study, we indeed present a theoretical framework for pessimistic model learning as outlined in (1), and alongside, we offer a practical implementation through the primal-dual algorithm as detailed in (3) and (4). We understand your concern about the potential differences between these two approaches. However, we would like to emphasize that the transition from the theoretical model to its practical implementation is grounded in a well-established technique in the optimization literature. This technique involves the incorporation of a Lagrangian multiplier into the objective function to address constrained optimization problems.

This approach ensures that while the practical implementation may appear as an approximate solution to the theoretical model, it fundamentally retains the core principles and objectives of the initial theoretical design. Therefore, we assert that the theoretically analyzed algorithm and its practical counterpart are aligned in their essence and are not significantly different as might be perceived.

In terms of the theoretically sound version...

Thank you for your comments regarding the theoretical aspects of our work. We acknowledge that theoretical techniques involved in conservative policy evaluation, policy improvement, and their combination have been explored in related work. However, we believe that the integration of these techniques in our study offers valuable insights, particularly in the context of model-based offline RL under general function approximation settings without the need for explicit policy parameterization.

The theoretical results of our approach, as detailed in Theorem 1, enable the separation of model learning and policy learning. This separation allows for the use of an arbitrarily large function class for the (augmented) value function and an unlimited policy class. This aspect of our work is distinct from existing studies. In particular, our method does not require explicit policy parameterization, setting it apart from these prior works.

Addressing your reference to "Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning," it's noteworthy that this work primarily focuses on linear models. In contrast, our research extends to more complex and general function approximation settings.

The practically implemented version...

Thank you for the references. We note that the policy classes in the works you mentioned are parameterized. While effective in many scenarios, this approach can be potentially restrictive, particularly when the optimal policy lies outside the pre-specified parametric policy class. Our work, on the other hand, does not need an explicit policy parameterization. By employing mirror ascent in conjunction with our model-based design, we enable our policy class to be sufficiently large, potentially containing the true optimal policy. This flexibility is a fundamental advantage of our approach, addressing limitations inherent in parameterized policy classes.

The implemented version of algorithm is only...

Thank you for the suggestions. Since this work mainly focuses on the theoretical part, we believe the numerical experiments in the D4RL tasks are enough to demonstrate the efficacy of MoMA.

How to implement the policy update when there is a normalization factor?

Thank you. When evaluating a policy, we need to calculate the normalization factor by summing the policy probabilities over all actions. Then we use the evaluated policy to generate Monte-Carlo trajectories.

How do you update the multiplier $\lambda$ ?

The update rule for $\lambda$ is described in equation (4). See section 5.3 for an example. In particular, we evaluate the gradient of the empirical loss function, and use it as an ascent direction for $\lambda$ .

Is it possible to unify these two different versions of algorithms?

Thank you for the idea. It is an interesting and challenging problem, and we leave it as a future study.

We hope we have addressed your concerns. Thank you again for your time and valuable comments.

审稿意见

评分: 5置信度: 32023-11-02

The paper proposes a conservative and practical model-based offline RL algorithm that alternates between pessimistic value estimation and policy update via mirror ascent. Theoretical guarantees are provided for the algorithm under partial data coverage and some experimental evaluations are provided.

优点

The paper studies an important problem, which is practical and theoretically-founded model-based offline RL with general function approximation.
The proposed algorithm is new, implementable, and enjoys theoretical guarantees.

缺点

One of the main contributions is stated as “In contrast to model-free RL literature which heavily relies on the assumption of Bellman completeness, we propose novel theoretical techniques for offline model-based policy optimization algorithms that are free of the assumption” In general, model-based methods (unless additional variables are introduced) do not require Bellman completeness assumption because assuming a realizable model is stronger; see e.g. [1], which shows that model realizability subsumes completeness assumptions.
It is stated that “To our knowledge, this work is the first study that provides a practically implementable model-based offline RL algorithm with theoretical guarantees under general function approximations.” There are prior works that offer implementable model-based offline RL algorithms with optimal statistical rates such the model-based version of the algorithm in [2]. The mentioned algorithm also does not require the difficult step of minimizing within a confidence set of transition models. Another example algorithm is ARMOR [4].
Although the work is focused on general function approximation, it requires the concentrability definition with bounded policy visitation and data distribution ratio for every state and action. This is a stronger assumption compared to the Bellman-consistent variant such as in [3].
Theory does not seem to be particularly challenging and/or offer new insights or techniques.
The experimental section is weak. In particular, the synthetic data experiments only compare MoMA with standard natural policy gradient and model-free FQI, none of which include any form of conservatism/pessimism, and the results are expected. It would be good to see comparison with other pessimistic model-based methods. Additionally, comparison with the work of Uehara & Sun 2021 when combined with the Lagrangian approach of this work would be useful. For the D4RL benchmark, comparison is only provided for a small subset of datasets and only baseline model-free offline RL methods. No comparison is provided with ARMOR and ATAC.

References:

[1] Chen, Jinglin, and Nan Jiang. "Information-theoretic considerations in batch reinforcement learning." In International Conference on Machine Learning, pp. 1042-1051. PMLR, 2019.

[2] Rashidinejad, Paria, Hanlin Zhu, Kunhe Yang, Stuart Russell, and Jiantao Jiao. "Optimal Conservative Offline RL with General Function Approximation via Augmented Lagrangian." In The Eleventh International Conference on Learning Representations. 2022.

[3] Cheng, Ching-An, Tengyang Xie, Nan Jiang, and Alekh Agarwal. "Adversarially trained actor critic for offline reinforcement learning." In International Conference on Machine Learning, pp. 3852-3878. PMLR, 2022.

[4] Bhardwaj, Mohak, Tengyang Xie, Byron Boots, Nan Jiang, and Ching-An Cheng. "Adversarial model for offline reinforcement learning." arXiv preprint arXiv:2302.11048 (2023).

问题

Comparison with the references above, both in terms of technique and empirical performance.
Clarifying challenges and technical contributions.

2023-11-23

We would like to express our gratitude for your time and effort in reviewing our paper. Your constructive feedback is greatly appreciated. In this response, we address your concerns point by point.

One of the main contributions is stated as...

Thank you for pointing out the reference [1] and its implications regarding the realizability of the transition model in relation to Bellman completeness. Upon revisiting this reference and considering your observation, we recognize that the condition of model realizability is not a milder condition compared to Bellman completeness. In light of this, we have amended our paper by removing the aforementioned statement from the list of contributions in the revised version. We appreciate your insight, which has helped clarify this aspect of our research and improve the accuracy of our claims.

It is stated that “To our knowledge...

Thank you for pointing out these two references [2] and [4]. We acknowledge that our study is not the first one to offer a practically implementable model-based offline RL algorithm with theoretical guarantees under general function approximations. Accordingly, we have revised our manuscript to remove this claim. Additionally, we have included comparisons with the algorithms presented in [2] and [4] to provide a more comprehensive overview.

However, we note that there is a specific distinction between our approach and the ones in [2] and [4]. Our methodology does not necessitate the assumption of a parametric policy class in advance. This advantage stems from our utilization of mirror ascent, which we believe provides an alternative perspective to the field of model-based offline RL.

Although the work is focused on general function approximation, it requires the concentrability definition...

Thank you for the comment. We recognize that our current assumption may indeed be more stringent compared to the Bellman-consistent variant outlined in [3]. In light of this, we plan to revisit and revise our assumptions and proofs. This revision will explore the possibility of modifying our concentrability coefficient to align more closely with the Bellman-consistent variant mentioned in [3]. We aim to include these changes and any resultant findings in our next revision.

Theory does not seem to be particularly challenging and/or offer new insights or techniques.

Regarding numerical results

Thank you. The synthetic data experiments mainly aim to demonstrate the theoretical results and the efficacy of the proposed algorithm. Therefore, we compared the proposed algorithm to the ones that do not incorporate pessimism, and showed that the proposed algorithm can indeed address the distribution shift issue. In addition, it can be seen from table 2 that we have already compared MoMA to both model-based and model-free offline RL methods for the D4RL benchmark.

Thank you again for your time and valuable suggestions.

References:

[1] Chen, Jinglin, and Nan Jiang. "Information-theoretic considerations in batch reinforcement learning." In International Conference on Machine Learning, pp. 1042-1051. PMLR, 2019.

[4] Bhardwaj, Mohak, Tengyang Xie, Byron Boots, Nan Jiang, and Ching-An Cheng. "Adversarial model for offline reinforcement learning." arXiv preprint arXiv:2302.11048 (2023).