Distributional off-policy evaluation with Bellman residual minimization
摘要
评审与讨论
This paper introduces novel target objectives for distributional RL along with a comprehensive error bound analysis.
优点
-
The underlying theory seems solid and well-founded, say the derived convergence rates from Theorems 2 and 3 appear reasonable.
-
Theorem 3 also handles non-realizable settings which are not seen from previous works.
缺点
- Theorem 3 dosen't degenerate to Theorem 2 under similar settings, which opens the possibiliy of whether a better proof/result exists.
问题
None
We appreciate your positive feedback and the constructive comment.
(Weakness 1) Non-degeneracy of Theorem 3 (of previous manuscript) towards Theorem 2.
We agree with your comment that the nondegeneracy of Theorem 3 (of previous manuscript) to Theorem 1 under realizability requires additional study. We would like to provide additional insights/explanations and summarize the reasons why we were not able to achieve that in our current proof.
The statement of Theorem 3 changed in the new manuscript. Instead of comparing Theorems 2 and 3, it is now fair to compare Theorem 2 and the finite-sample error bound suggested in B.4.3 (or simplified version in Equation (186)) with fixed level of .
To begin with, in non-realizable cases, the converging target is no longer the true distribution. Instead, we compromise our goal into the ``best approximation" , with defined in Equation (21). That being said, we are unable to make use of Theorem 1 anymore, which only required us to derive the convergence rate of estimated Bellman residual (our objective function). Therefore, unlike realizable scenario, it requires one additional procedure, which is obtaining convergence of the minimizing parameter. This leads us to take into account the function in B.4.3, which further slows down the rate by .
In addition, the sampling contains two procedures, collecting the data () and resampling multi-step trajectories, as defined before Equation (22). Since we are re-using the samples for multiple times, this no longer guarantees the independence of the data that we use. Then we cannot make use of common tricks for showing sample mean converging to the population mean (e.g. Hoeffding's inequality), since they require independence of samples. Therefore, we resort to another trick (Lemma 3 of Appendix B.2.1) that resembles triangular inequality, but slows down the rate by .
Due to the aforementioned reasons, the rate in non-realizable scenario is slower than realizable scenario by , that is verses .
We have updated our manuscript. All updated parts are highlighted in blue. Specifically, here are the updates regarding your comments:
- p.60: added a new subsection C.3.1 that explains the rate mismatch between Theorem 2 (realizable) and B.4.3 (non-realizable) under realizable setting with . also referred this part in p.8 (below Theorem 8).
This work proposes a new method of distributional OPE, EBRM, which is derived from a discrepancy metric of distribution-valued functions based on the energy distance.
The discrepancy metric has two nice properties. First, it allows us to bound the distributional OPE error in terms of the distributional Bellman residual. Second, it is straightforwardly estimated from the offline sample when the state-action space is finite (i.e., in the tabular setting).
In the tabular setting, the authors also constructed algorithms to estimate the minimizer of the proposed discrepancy metric for realizable and unrealizable cases, respectively.
Finally, some experimental results are provided.
优点
- Addressing interesting problem of OPE (not just in the distributional one): Discrepancy between the estimation error and the estimation objective, where the latter is typically defined with the Bellman residual.
缺点
- The key inequality (7) involves a large constant , which makes it uninteresting. If we are allowed to use a constant of , then the max-extended discrepancy is trivially bounded with the expectation-extended discrepancy .
- Limited experiments. Specifically, the environment is somewhat small and artificial.
- Some of the notation are broken: For example,
- What does mean in continuous state space? If it is the density of with respect to some base measure, what is the base measure?
- The definition of the kernel (33) does not seem to give .
问题
- What is the limitation of the proposed method?
- In the experiment, why don't you thoroughly feature and discuss the results of more standard and fair metrics (like the Wasserstein-based one in the appendix)?
- Are the weakness points I raised correct?
(Question 1) Limitations of EBRM
There remain a few limitations in our method. A major limitation is that our estimator is developed under the tabular setting. Given our discussion of Theorem 1 above, it is interesting to look into more general state-action spaces.
Besides, we focus on parametric family of the conditional distributions, i.e., is a subset in a finite-dimensional space. Extension to nonparametric modeling of the conditional distributions is also interesting, and potentially lessens the chance of non-realizability due to more flexible modeling.
Another limitation is that we do not provide a complementary method for control problems (finding optimal policy), as some other distributional RL works (which address evaluation and control problems in the same paper).
(Question 2) Not using more standard metric
Energy distance is a well-known metric that is being widely used and thoroughly studied [8]. In addition, it satisfies good properties, including that it is generally easier to compute than Wasserstein metric, and that it can be regarded as multi-dimensional extension of Cramer distance [3]. It is also a special case of maximum mean discrepancy [4] associated with a specific kernel . Given that [5] and [6] directly used maximum mean discrepancy, and [7] also tried to make an interpolation between Wasserstein metric and maximum mean discrepancy, the use of energy distance seems to be one reasonable choice of metric among the literature of distributional reinforcement learning.
As you also pointed out, we also calculated the inaccuracy levels with other metrics, including expectation-extended Wasserstein-1 metric in Tables 9 and 12 of Appendix D.4. In view of your comment, we will revise the paper to provide the corresponding discussion. In addition, we will add a new metric which is used in [2] (FLE, a method that we compared) in Corollary 4.14 of their paper. (This is also suggested by Reviewer 3TV7.) That is, treating the conditional return distributions as mixture components {} and mixture weights {}, we formed the mixture distribution that can be regarded as the marginal return distribution . Then we measured Wasserstein-1 metric value between the estimated and true (marginal) distributions, as shown in the following tables. Similar to Tables 8--9, 11--12 of Appendix D.4, EBRM outperformed its competitors in all except for the non-realizable scenario with , where the performances between EBRM and FLE are very close.
- Realizable:
| Sample size | ||||||
|---|---|---|---|---|---|---|
| EBRM | 2.052 | 1.406 | 2.052 | 0.843 | 0.629 | 0.508 |
| (1.524) | (1.005) | (0.778) | (0.556) | (0.350) | (0.255) | |
| FLE | 22.835 | 15.694 | 12.328 | 8.013 | 5.596 | 4.591 |
| (12.810) | (9.272) | (7.859) | (5.113) | (3.786) | (2.945) | |
| QRTD | 97.856 | 56.694 | 47.738 | 45.764 | 49.851 | 49.877 |
| (13.235) | (14.905) | (25.251) | (25.656) | (27.463) | (25.945) |
- Realizable:
| Sample size | ||||||
|---|---|---|---|---|---|---|
| EBRM | 18.021 | 11.613 | 7.528 | 5.441 | 3.607 | 3.062 |
| (12.227) | (8.077) | (5.306) | (4.184) | (2.487) | (1.981) | |
| FLE | 94.556 | 71.430 | 46.740 | 44.915 | 31.774 | 23.378 |
| (62.630) | (51.205) | (36.986) | (31.905) | (23.190) | (18.076) | |
| QRTD | 247.308 | 198.257 | 191.908 | 195.787 | 194.908 | 202.076 |
| (16.843) | (29.911) | (48.057) | (73.533) | (91.526) | (86.031) |
- Non-realizable:
| Sample size | ||||
|---|---|---|---|---|
| EBRM | 14.098 | 14.088 | 14.087 | 13.218 |
| (0.227) | (0.167) | (0.142) | (0.093) | |
| FLE | 13.954 | 13.977 | 13.968 | 13.970 |
| (0.359) | (0.292) | (0.234) | (0.184) |
- Non-realizable:
| Sample size | ||||
|---|---|---|---|---|
| EBRM | 106.850 | 84.839 | 73.338 | 66.560 |
| (39.076) | (29.761) | (17.974) | (5.392) | |
| FLE | 258.141 | 279.687 | 349.984 | 432.014 |
| (19.073) | (21.771) | (20.467) | (21.903) |
Thank you for your helpful comments.
(Weakness 1) The inequality (7) is not interesting.
In tabular settings, the two distances and are equivalent: for any . Due to (5) of our paper, the equivalence would lead to . So the reviewer's insight holds for tabular settings. However, Theorem 1 can be applied to general state-action space, which includes continuous state-action space. In the general settings, the two distances are not necessarily equivalent. We also had a related discussion at the end of Section 2.2. Let us assume that state-action space is a subset of with the data-generating probability being absolutely continuous with respect to the Lebesgue measure, thereby yielding the density . In general, we cannot bound by up to any multiplicative constant. So Theorem 1 would not be implied by the corresponding results of supremum-extended distance. Overall, Theorem 1 is indeed a non-trivial result.
As we discussed in Section 1, there is a theory-practice gap regarding distances and . Every method mentioned in the first paragraph of Section 1 of our paper is not free from this ``distance mismatch" (Tables 1 and 3). They utilize expectation-extended distance in the objective functions of their algorithms; however, the theoretical properties that motivate their constructions are based on supremum-extended distance . In its full generality, Theorem 1 provides a foundation for expectation-based distance for settings with general state-action spaces. Finally, we remark that our estimator (Sections 3 and 4) is based on tabular settings (and so would not be able to fully exploit the power of Theorem 1.) It is mainly because we aim to provide a more comprehensive study that involves nonstandard setups (e.g., non-realizability, multi-step extensions). Even with tabular settings, corresponding analyses are complicated and related theoretical results of these setups in distributional reinforcement learning are rare. To our knowledge, our work is the first that provides theoretically guarantee for non-realizable settings in distributional reinforcement learning. Extension to non-tabular settings with the full exploitation of Theorem 1 is left as a future work.
(Weakness 2) Limited experiments.
We admit that extensive numerical comparison is not our major focus, given the more theoretical flavor of this work. In our numerical experiments, our main focus lies in demonstrating that expectation-extended inaccuracy level converges to zero, and providing some reasonable comparisons with close competitors like FLE [2]. The environment was designed based on Figure 1 of the following paper that devised another distributional reinforcement learning method called RDPS [1]. We have also looked into the combination-lock environment used by [2], however it was designed for finite-horizon problems and so it is not appropriate. We will add more experiments. Specifically, we are considering a 2D example.
(Weakness 3) Broken notations
About : We are sorry for the confusion. If we are assuming a continuous state-action space, then the base measure will be Lebesgue measure, and (in Assumption 1) becomes the density of a continuous random vector. In the manuscript, the notation is used to indicate both the probability measure and its density (slight notational abuse). Now, we made this clear below Equation (6).
About the kernel : We have corrected the issue behind Equation (33) of Appendix A.3 in the original manuscript. The whole point of A.3 was to show that energy distance satisfies the properties of A.2.1. Instead of using the properties of the associated kernel, we directly showed that energy distance satisfies Property 1 of Appendix A.2.1. Thank you for pointing this out.
(Question 3) Are the weakness points raised correct?
First, we appreciate that you raised these points, which have helped us improve our paper.
(Weakness 1): We believe this is due to misunderstanding---our apology for confusing notation mentioned in (Weakness 3) which might have hindered your prior understanding. We hope that our response have addressed your concern.
(Weakness 2): We admit that extensive numerical comparison is not our major focus of this work. But we will add more experiments.
(Weakness 3): Thank you very much for pointing out these notational issues, which we have fixed.
References
[1] Morimura, Tetsuro, et al. "Nonparametric return distribution approximation for reinforcement learning." Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010.
[2] Runzhe Wu, Masatoshi Uehara, Wen Sun: Distributional Offline Policy Evaluation with Predictive Error Guarantees. ICML 2023: 37685-37712
[3] Marc G Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Remi Munos. The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743, 2017b.
[4] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Scholkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
[5] Thanh Nguyen-Tang, Sunil Gupta, and Svetha Venkatesh. Distributional reinforcement learning via moment matching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9144–9152, 2021.
[6] Pushi Zhang, Xiaoyu Chen, Li Zhao, Wei Xiong, Tao Qin, and Tie-Yan Liu. Distributional reinforcement learning for multi-dimensional reward functions. Advances in Neural Information Processing Systems, 34:1519–1529, 2021
[7] Ke Sun, Yingnan Zhao, Yi Liu, Wulong Liu, Bei Jiang, and Linglong Kong. Distributional reinforcement learning via sinkhorn iterations. arXiv preprint arXiv:2202.00769, 2022.
[8] Gabor J Szekely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249–1272, 2013.
We have updated our manuscript. All updated parts are highlighted in blue. Specifically, here are the updates regarding your comments:
- p.3: clarified the meaning of notation , and mentioned that it can denote the density in both tabular and continuous state-action spaces.
- p.9: added simulation inaccuracies measured with expectation-extended Wasserstein-1 metric in Table 2.
- p.14: revised the mistake in A.3 regarding the kernel.
- p.71: added Table 10 of simulation inaccuracies measured with Wasserstein-1 metric for marginalized distributions for realizable settings.
- p.72: added Table 13 of simulation inaccuracies measured with Wasserstein-1 metric for marginalized distributions for non-realizable settings.
As for the additional experiments, we are working on them.
We kindly ask you to consider raising your scores if the concerns were appropriately addressed.
This paper considered a fundamental problem of the distributional reinforcement learning, termed as the distributional off-policy evaluation. The authors proposed a new algorithm based on the Bellman residual minimization, and characterized the finite-sample statistical error bound under realizable and unrealizable settings. Empirical results on synthetic environments also demonstrate the effectiveness of the proposed setting.
优点
- A comprehensive analysis of distributional off-policy evaluation.
- Technically sound in general.
缺点
- I'm not fully convinced that the proposed distance measure has some meaning in practice. For me it is much more like a distance measure induced by the theoretical need, instead of motivated from some practical consideration. I can provide an example on this unnatural feeling: for OPE on expected return, we care more about instead of , and the latter one also suffers from a slow convergence rate.
- For the unrealizable setting part I feel there are also many designs that are purely for the proof without so many insights, e.g. function in (18).
- I believe that there will not be significance rate difference for the realizable and unrealizable settings, and the difference may also come from the algorithm aside from the reason mentioned by the author (e.g. no need for additional operations in the realizable setting). Such reason should be make more clear in the paper.
问题
- Can the authors revise the paper to highlight some intuitions behind such algorithm design?
- Can the authors dive into the rate mismatch between the realizable setting and the unrealizable setting?
I haven't had time to go through the proof line by line (as there are about 50 pages to read) but most of proof steps align with the standard techniques and I believe there will not be severe issues. I will raise some questions if some steps of the proof is unclear or potentially wrong.
Thank you for your constructive comments.
(Weakness 1) Not fully convinced if the chosen distributional distance is practically meaningful.
The analog of -function and value (expectation of over ) in distributional RL are the conditional return distributions {} and the marginal return distribution , i.e., the mixture distribution with mixture components {} and mixture weights {}.For simplicity, we choose the reference distribution as for our discussion. Based on our understanding, we believe that the reviewer concerns why we are interested in the convergence/accuracy of conditional return distributions, instead of the marginal return distribution.
One of the major reasons is that, like -function, the conditional return distributions can be useful for handling a control problem (finding the optimal policy), as shown in the following works ([1] and ). In these papers, they estimated the conditional return distributions , and then developed it into a risk-sensitive policy learning strategy that can deal with the control problem. This is one step further from Q-learning of conventional reinforcement learning, in that we can take into account the risk that is not represented in the expectation value.
Second, given the conditional return distributions {}, we can easily estimate the marginal return distribution . This can be theoretically supported by our theorems (Theorems 2 and 3), since Lemma 3 of Nguyen [3] implies and . In the literature of distributional reinforcement learning, we are not aware of any results indicating that directly proving the convergence of marginal distribution leads to a faster convergence rate. Indeed, the reviewer's comment has led to an interesting research direction.
(Weakness 2) Non-realizability section does not provide much intuition.
The insight for both realizable (Section 3) and non-realizable scenarios (Section 4) are elaborated in detail in the response for (Question 1) below. Here, we will only discuss the role of , along with the intuitive meaning of Equation (18) of the original manuscript. From its definition , this term represents the disparity from the true distribution. However, it can be simplified by replacing it with a constant , as in the updated manuscript. This demonstrates how challenging the given problem is, by combining the size of the model and the extent of non-realizability that becomes zero in the realizable case. The implication of Equation (18) is that we can shrink this obstacle (that represents non-realizability) to zero by taking multi-step approach, with large enough step level . This eventually allows our estimator to be close to the ``best approximation" . Please refer to the response to (Question 1) for more details.
(Weakness 3) Not convinced by the rate difference.
Please see our response to (Question 2) below.
(Question 1) Highlight the intuition behind algorithm.
First, under the realizable scenario (Section 3), the insight behind the construction of our estimator for the realizable scenario is directly based on Theorem 1, where we show that the inaccuracy is bounded by Bellman residual up to a multiplicative constant. This justifies our practice of minimizing the estimated Bellman residual .
Second, for the non-realizable scenario (Section 4), Bellman residual can never become zero with any , due to model misspecification. This may lead to a loose bound when the minimum value of Bellman residual is very large. Therefore, we suggested an alternative of applying multi-step as follows.
Temporarily ignoring mathematical rigor, the most important insight is that we can approximate with sufficiently large step level . Thanks to the properties of energy distance (Appendix A.3), we can obtain uniform convergence of -step Bellman residual towards inaccuracy function, that is as , where is the constant that quantifies non-realizability as illustrated in (Weakness 2).
This eventually allows us to obtain an estimator that converges to the best approximation, that is , where is the minimizer of the inaccuracy function (defined in Weakness 2). However, larger step levels come with a price that resembles bias-variance trade-off, as specified in Equation (186) of Appendix B.4.4. This is why we suggested a rule of choosing the optimal level of in Appendix D.3.1.
We will later revise the paper to highlight the related insights.
(Question 2) Dive into the rate difference.
The statement of Theorem 3 has been changed in the new manuscript. Instead of comparing Theorems 2 and 3, we should compare Theorem 2 and the finite-sample error bound suggested in B.4.3 (or simplified version in Equation (186)) with fixed level of .
To our understanding, the reviewer is suspecting the difference between algorithms of single-step and multi-step EBRM to be part of the reasons for rate difference (the third weakness listed in this response). Indeed, this partially explains the reason, since we form multi-step trajectories by resampling from the collected data (), as mentioned before Equation (22). This practice of re-using the samples no longer guarantees the independence of the data that we use. Then we cannot make use of common tricks for showing sample mean converging to the population mean (e.g. Hoeffding's inequality), since they require independence of samples. Therefore, we resort to another trick (Lemma 3 of Appendix B.2.1) that resembles triangular inequality, but slows down the rate by .
Second, in non-realizable cases, the converging target is no longer the true distribution. Instead, we compromise our goal into the ``best approximation" , with defined in Equation (21). That being said, Theorem 1, which only required us to derive the convergence rate of estimated Bellman residual (our objective function), is not enough now. Therefore, unlike realizable scenario, it requires one additional procedure, which is obtaining convergence of the minimizing parameter. This leads us to take into account the function in B.4.3, which further slows down the rate by .
Due to the aforementioned reasons, the rate in non-realizable scenario is slower than realizable scenario by , that is verses .
We will add related discussion to the paper.
References
[1] Morimura, Tetsuro, et al. "Nonparametric return distribution approximation for reinforcement learning." Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010.
[2] Morimura, Tetsuro, et al. "Parametric return density estimation for reinforcement learning." arXiv preprint arXiv:1203.3497 (2012).
[3] Thanh Nguyen-Tang, Sunil Gupta, and Svetha Venkatesh. Distributional reinforcement learning via moment matching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9144–9152, 2021.
We have updated our manuscript. All updated parts are highlighted in blue. Specifically, here are the updates regarding your comments:
- p.6: explained intuition behind multi-step when dealing with non-realizability.
- p.60: added a new subsection C.3.1 that explains the rate mismatch between Theorem 2 (realizable) and B.4.3 (non-realizable) under realizable setting with . also referred this part in p.8 (below Theorem 8).
We kindly ask you to consider raising your scores if the concerns were appropriately addressed.
Thanks for your response. It mostly solves my concern and I will vote for acceptance. However, I tend to keep the current score, as I feel the current result is not that satisfactory and I believe more work can be done on this direction.
The paper addresses a critical aspect of RL by delving into the realm of distributional off-policy evaluation. Specifically, authors introduce a method called Energy Bellman Residual Minimizer (EBRM) to estimate the return distribution, which extends the framework of Bellman residual minimization to distributional RL. This work also establishes finite-sample error bounds for the proposed framework and introduces a multi-step extension for non-realizable settings. Simulations are performed for performance validation and comparison to existing methods.
优点
This paper consider the problem of off-policy evaluation in Distributional RL, which concerns estimating the (conditional) return distribution of a target policy based on offline data. Existing reinforcement learning methods predominantly focus on the expectation of return distribution, while distributional RL methods, which extend this focus to the entire return distribution, have shown promising results. However, in off-policy evaluation within DRL, these methods lack theoretical development, error bounds, and convergence rate analysis. The paper addresses this gap by proposing the EBRM method, accompanied by its finite-sample error bounds, providing a theoretical basis for using expectation-extended distance in measuring distributional Bellman residuals. In particular, authors establish the application of expectation-extended distance for Bellman residual minimization in DRLs. A multi-step extension of the proposed method in non-realizable settings is also discussed, with corresponding finite-sample error bounds. The paper significantly contributes to the theoretical understanding of distributional off-policy evaluation in DRL.
缺点
The theoretical guarantees of the proposed methods rely on a set of assumptions, some of which appear to be strong, especially for the proof of Theorem 2 and Theorem 3. Authors should comment on how such assumptions can be satisfied when being applied and provide examples for further illustration. In particular, it is of interest to highlight the essential assumptions that are central to the statistical error bounds, and comment on the performance when relaxing some of the assumptions. Due to such assumptions, it is questionable whether the proposed method (EBRM) has wider applicability compared to the existing methods.
问题
-
Compared to FLE (Wu et al., 2023), how the proposed method excels in policy learning? In particular, as demonstrated in the empirical evaluations, EBRM does outperforms FLE in terms of statistical accuracy. Does the main reason lie in the fact that FLE focuses on learning the marginal distribution while the proposed method estimates the conditional return distribution?
-
This paper focuses on discounted infinite-horizon tabular RL. Is the current result in Bellman residual minimization generalizable to average-reward MDPs or linear MDPs, and how does it compare to "A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs"?
Thank you for your constructive comments.
(Weakness 1) Assumptions are strong. Provide examples and give comments on the consequences of relaxing those.
In the following, we provide the corresponding discussion of individual assumptions. We have also put some additional explanation below Assumption 8 in the updated manuscript.
First, we discuss the crucial assumptions.
- Assumption 1: This corresponds to coverage assumption that is essential for every OPE method, (e.g. FLE [1], Section 5.5 of [2]). It includes all examples where density can be lower-bounded by a positive number .
- Assumptions 5 & 8: Based on these, we can quantify model complexity and non-realizability with , as well as bound the estimation error. As mentioned in the manuscript, (supremum-extended Wasserstein-1 metric) is an example that satisfies these requirements.
- Assumption 7: Lipschitz continuity allows us to control the model complexity via the metric entropy of with respect to the Euclidean norm (see Remark 2). Combined with compactness, it also prevents ill-posed cases where the estimated objective function does not have a minimizer (in ). Our simulation examples in Appendix D.4 all satisfy this condition when we let to be supremum-extended Wasserstein-1 metric.
The following assumption is the strength of our paper, since it is already weak (realizable scenario) or has been even further relaxed (non-realizable scenario).
- Assumption 2: Realizability is already weaker than completeness which is widely assumed in the other papers (e.g. [1], [3], [4]). Even those who relaxed completeness (e.g. ) are still assuming realizability, but we took an even further step by relaxing it in Theorem 3 (non-realizable scenario).
The remaining assumptions are relatively less crucial. They are assumed in order to make the proof simpler, and they can be relaxed.
- Assumption 3: Subgaussian distribution (e.g. Gaussian, bounded distributions) is widely assumed to apply standard concentration inequalities. We can assume slower-decaying tail probability, however this further complicates the analyses.
- Assumption 4: iid assumption can be relaxed into trajectory-based observations (). In this case, we would need additional assumptions (e.g. exponential -mixing in Definition 1 of ) to ensure an appropriate decaying rate of the dependence.
- Assumption 6: The role of polynomial degree is to quantify the convergence rate (Theorem 3), and we can still obtain consistency without it. Here is one example. Assume the same setting in Section 5, only differing in reward distribution for all and the model with , where indicates gaussian probability with parameters , . Then we have a unique minimizer and holds for some , leading to .
(Question 2) How does it compare with the paper about Linear MDP or average-reward MDP?
First, in the suggested paper [7], they are interested in estimating the average reward throughout infinite time progress . This is different from EBRM (our method) which is interested in estimating the distribution of return (cumulative discounted summation of consecutive rewards) in Equation (1) of our text. Due to the difference between the target quantities (expectation vs. distribution; limiting average vs. discounted sum), it is not clear how to compare these two methods.
Next, linear MDP may be useful in expanding our method into continuous state-action space by expressing the transition probability and reward distribution as linear combinations of multiple features . To elaborate, it may allow us to construct an unbiased estimate of energy distance for continuous state-action space, where a single cannot be observed twice or more (almost surely). In such cases, our current estimation in Equation (10) leads to a biased estimate of the third term of expansion of shown in Equation (9), which is . Here, we have and , , where are for indicating independent copies. According to our current way of estimation (10), our estimate becomes with being independent. This is because the terms () have the same value, and are thereby cancelled out when there is only a single observation . This bias might be prevented if we can estimate the distribution of terms () (for that was observed only once) by expressing them with features. In addition, leveraging the structure of the linear MDP may improve the efficiency of our method. It will be interesting to see what the structure of return distribution induced by the linear MDP is, which requires further investigation. We will add some discussion on this reference in the paper.
References
[1] Runzhe Wu, Masatoshi Uehara, Wen Sun: Distributional Offline Policy Evaluation with Predictive Error Guarantees. ICML 2023: 37685-37712
[2] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). The MIT Press.
[3] Duan, Yaqi, Zeyu Jia, and Mengdi Wang. "Minimax-optimal off-policy evaluation with linear function approximation." International Conference on Machine Learning. PMLR, 2020.
[4] Agarwal, Alekh, et al. "Reinforcement learning: Theory and algorithms." CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep 32 (2019).
[5] Zanette, Andrea. "When is realizability sufficient for off-policy reinforcement learning?." International Conference on Machine Learning. PMLR, 2023.
[6] Antos, András, Csaba Szepesvári, and Rémi Munos. "Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path." Machine Learning 71 (2008): 89-129.
[7] Nevena Lazic, Dong Yin, Mehrdad Farajtabar, Nir Levine, Dilan Gorur, Chris Harris, and Dale Schuurmans. A maximum-entropy approach to off-policy evaluation in average-reward mdps. Advances in Neural Information Processing Systems, 33:12461–12471, 2020.
We have updated our manuscript. All updated parts are highlighted in blue. Specifically, here are the updates regarding your comments:
- p.8: briefly explained the roles of Assumptions 6--8.
- p.61: added a new subsection C.3.2 that explains how we linear MDP can help us expand towards continuous state-action space. referred to this in p.9 (Conclusion).
- p.71: added Table 10 of simulation inaccuracies measured with Wasserstein-1 metric for marginalized distributions for realizable settings.
- p.72: added Table 13 of simulation inaccuracies measured with Wasserstein-1 metric for marginalized distributions for non-realizable settings.
We kindly ask you to consider raising your scores if the concerns were appropriately addressed.
Thanks for your detailed reply and the further information to provide a better understanding. After reading your reply and other reviewers' opinions, I agree that this is a technically sound paper while sharing concerns regarding the practical applicability. Indeed, it provides a solid understanding of off-policy evaluation. As such, I decided to keep my original assessment and lean toward acceptance.
(Question 1) Why does EBRM outperform FLE? Is it due to the difference in focus (marginal distribution versus conditional return distribution)?
Both FLE [1] and EBRM (our method) are for off-policy evaluation. We believe the reviewer's question is regarding the performance of the two methods in terms of policy evaluation instead of policy learning (that is, finding the optimal policy).
First, we provide some explanations why the proposed EBRM achieved better performance than its competitors. The first explanation is that none of the simulated examples satisfy completeness, which FLE requires as its assumption. On the contrary, EBRM can be applied when completeness is violated, including realizable and non-realizable scenarios. Such effect becomes more apparent in our most difficult non-realizability setup (non-realizable example of D.2.3 with demonstrated in bottom plots of Figure 3). The performance of FLE even deteriorates with larger sample size, and this pattern is consistently observed with other measures of inaccuracy (Tables 11, 12, and 13 of Appendix D.4.2). Another plausible reason is that FLE is not utilizing the closed-form probability measure in each update. Whenever they update their estimation at -th iteration, they sample finitely many samples from the previous estimate , which is prone to approximation error.
Second, it is true that [1] measured inaccuracy of FLE estimation by comparing the marginal return distribution , which is the mixture distribution with mixture components {} and mixture weights {}. Therefore, we hereby evaluate the inaccuracy according to their method (Corollary 4.14 of [1]), by comparing the marginal distributions and via Wasserstein-1 metric . The results, as shown in the following two tables, are based on the same simulations shown in Tables 8--13 of Appendix D.4, but evaluated based on marginal distribution. As in previous results (Tables 8--9, 11--12), EBRM outperformed its competitors in all except for the non-realizable scenario with , where the performances between EBRM and FLE are very close.
- Realizable:
| Sample size | ||||||
|---|---|---|---|---|---|---|
| EBRM | 2.052 | 1.406 | 2.052 | 0.843 | 0.629 | 0.508 |
| (1.524) | (1.005) | (0.778) | (0.556) | (0.350) | (0.255) | |
| FLE | 22.835 | 15.694 | 12.328 | 8.013 | 5.596 | 4.591 |
| (12.810) | (9.272) | (7.859) | (5.113) | (3.786) | (2.945) | |
| QRTD | 97.856 | 56.694 | 47.738 | 45.764 | 49.851 | 49.877 |
| (13.235) | (14.905) | (25.251) | (25.656) | (27.463) | (25.945) |
- Realizable:
| Sample size | ||||||
|---|---|---|---|---|---|---|
| EBRM | 18.021 | 11.613 | 7.528 | 5.441 | 3.607 | 3.062 |
| (12.227) | (8.077) | (5.306) | (4.184) | (2.487) | (1.981) | |
| FLE | 94.556 | 71.430 | 46.740 | 44.915 | 31.774 | 23.378 |
| (62.630) | (51.205) | (36.986) | (31.905) | (23.190) | (18.076) | |
| QRTD | 247.308 | 198.257 | 191.908 | 195.787 | 194.908 | 202.076 |
| (16.843) | (29.911) | (48.057) | (73.533) | (91.526) | (86.031) |
- Non-realizable:
| Sample size | ||||
|---|---|---|---|---|
| EBRM | 14.098 | 14.088 | 14.087 | 13.218 |
| (0.227) | (0.167) | (0.142) | (0.093) | |
| FLE | 13.954 | 13.977 | 13.968 | 13.970 |
| (0.359) | (0.292) | (0.234) | (0.184) |
- Non-realizable:
| Sample size | ||||
|---|---|---|---|---|
| EBRM | 106.850 | 84.839 | 73.338 | 66.560 |
| (39.076) | (29.761) | (17.974) | (5.392) | |
| FLE | 258.141 | 279.687 | 349.984 | 432.014 |
| (19.073) | (21.771) | (20.467) | (21.903) |
We thank the reviewers for their time and effort spent in reviewing our paper. We have responded to each reviewer separately. Here are the major points that we have covered in our response.
- Assumptions: we have assessed whether the assumptions are crucial or can be relaxed, along with the additional issues that appear when they are relaxed. We also provided examples where the assumptions can be satisfied.
- Numerical comparison: There were questions regarding our way of measuring inaccuracy (expectation-extended energy distance) in the numerical experiments. We provide additional comparison results via other measures, which exhibits a similar conclusion as we have shown before.
- Rate Mismatch: We have provided insight behind our proof techniques, which we believe would enhance intuitive understanding of the rate differences between realizable and non-realizable scenarios in our results.
The changes that were made in the updated manuscript are highlighted in blue. However, in our responses to the reviewers' comments, we also mentioned several additional changes that we plan to make during the discussion stage.
Dear reviewers,
we sincerely appreciate your time and effort in reviewing our paper and responses. We hope that our reply has addressed your concerns. Please let us know if you have further comments or suggestions. Thank you once again for your detailed review and constructive feedback on our work.
The reviewers are concerned about the unrealistic assumptions this paper makes that cannot be satisfied in practice. It seems that using the approach this paper proposes, some of the strong assumptions cannot be avoided. This seems to motivate an entirely different analytical approach to the distributional off-policy evaluation problems, that this paper is not delivering.
为何不给更高分
The analytical approach this paper proposes does not alleviate the needs for strong, unrealizable assumptions that existing works make.
为何不给更低分
n/a
Reject