/10

Rejected4 位审稿人

最低1最高3标准差0.7

ICML 2025

Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

Pai Liu,LingfengZhao,Shivangi Agarwal,Jinghan Liu,Audrey Huang,Philip Amortila,Nan Jiang

OpenReview PDF

提交: 2025-01-21更新: 2025-06-18

摘要

关键词

model selectionoffline reinforcement learning

评审与讨论

审稿意见

评分: 32025-02-26

The paper proposes a new method for model selection in the offline policy evaluation (OPE) setting. That is, given multiple policies from an offline RL algorithm, the question is which model (either a value function or simulator MDP) among many is best for evaluating these policies? The authors propose new criteria for determining the best model and wrap their solution in the same tournament structure used in a previous work. They prove bounds on the error of their model selection algorithms in the realizable case and provide empirical evidence in an offline version of the hopper environment that their approach does not have catastrophically high errors like previous approaches and perform some interesting ablations.

*** Post-rebuttal update ***

I have read the authors’ response and the other reviews. However, the authors mis-understood my question about an ensemble (I meant to end the tournament early, with say 4 models, and then average their votes). And their description of the MB-MF comparison again sort of mis-understood my question because I was asking about the comparison of the new protocol to BVFT, which is the direct ancestor, not a full comparison of model-based and model-free methods.

Overall I agree with the other reviewers that the paper lacks clarity about exactly how the algorithm works and is not focused on the "main result" so I've downgraded my score.

给作者的问题

See the individual boxes above for particular questions.

论据与证据

Overall, I’m favorable to this paper. The approach is a somewhat incremental change to some of the previous works but the empirical evidence shows significant improvement when making these changes and the solutions (particularly the use of LSTDQ) are certainly non-trivial. In particular, I like that the authors were able to show that while in some cases other methods marginally outperform LSTD-Tournament, they each have bad failure cases that LSTD-Tournament sems to avoid. I also like the Experimental protocol the authors set up, which I think is an important contribution of the paper to the model selection literature.

I do think there are some places where the paper could be improved though and listed some of them in specific sections below.

方法与评估标准

The focus of the whole paper is on picking a single model to use in the policy evaluations. But why is picking a single model a good idea? In the realizable case, sure, it’s great to pick the right model, but there is some probability you won’t, and in the unrealizable case it seems like picking an ensemble of “good but not right” models would likely be the best strategy. I saw no discussion of the benefits of finding an ensemble of possible models, which seems like something this method could be used for.

理论论述

The model-based approaches proposed in this work seem somewhat ad-hoc. The regression based selector and sign-flip rules are both presented somewhat “out of the blue” and while theoretical results are proven about both, the bounds seem fairly large/loose and not compared to bounds on the previous methods so it is unclear what reviewers should be taking from the theorems. That is, the bounds are polynomial but are they better?

实验设计与分析

The fact that both of these new approaches are out-done by a model-free approach also brings their efficacy into question as sign-flips curves in Figure 3 are often flat or trending in the wrong direction with increased data. There isn’t really an explanation for that and it seems, frankly, like the paper would be stronger with an in-depth analysis of the model-free approach and less focus on the model-based ones.

补充材料

skimmed

与现有文献的关系

In the model-free case, LSTD-Tournament is certainly a refinement on BVFT, with the replacement of Q-pi abstractions by a more computationally efficient LSTDQ analysis. But I was disappointed that the authors left “a detailed description of their differences to future work”. LSTD-Tournament is the best algorithm in the current paper, and arguably the only one that really stands out given that it outperformed the model based approaches. And the algorithm structure itself is very similar to BVFT except for the comparison rule at the innermost loop of the approach. Reviewers are left to wonder how much of a change the new approach really is to BVFT and relegating that comparison to future work makes judging the novelty of the top algorithm in this paper much more difficult.

遗漏的重要参考文献

The paper has appropriate references.

其他优缺点

Again, I still have a favorable view of this paper. The math seems right, the experimental protocol is well thought out, and there is a clear improvement in LSTD-tournament in the experiments, but the factors above limit the scope and breadth of the result.

其他意见或建议

None

作者回复

2025-03-26

We thank the reviewer for appreciating the contributions of our work and the helpful comments for improvement.

Why is picking a single model a good idea? ensemble?

First, it is not clear what "ensemble" of models means in the context of OPE: do we simply average the final prediction? or we average the transition dynamics (like in [MJTS20]?) How do we weight each model? We are not aware of widely accepted ways of using ensemble in the context of OPE.

Second, there are ways to reduce the "ensemble" idea to our setting. In particular, if by "ensemble" we mean averaging transitions (like [MJTS20]), then an ensemble of "base MDPs" is simply a well-defined MDP itself. There is no reason why one cannot add such a model to $\mathcal{M}$ . Furthermore, if the weights for averaging "base" models are unknown, we may also consider $\mathcal{M}$ to be a set of models with different weights. That said, if the unknown weight space is continuous, this is more of a learning problem than a model selection problem, as our methods are not computationally efficient for continuous $\mathcal{M}$ . A plausible solution is to run other methods to learn such ensemble models in a separate learning phase, and apply our method in the selection phase.

[MJTS20] Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles.

Model-based (MB) approaches: "ad hoc", "out of the blue", "out-done by model-free"

The motivation for considering MB is that it has access to more available information than MF algorithms (if you have $\mathcal{M}$ , you can induce the value functions, but not vice versa), so it is natural to expect that by leveraging the additional information in $\mathcal{M}$ one can potentially do better than MF. This is what we expected before running the experiments, and it was surprising to see that LSTD-tournament works better than the MB methods, which we find interesting to report.

As another side product, we wanted to compare to [ZDMAK23], which is one of the very few existing methods for MF selection. However, they require a helper function class in addition to $\mathcal{Q}$ . The regression-based algorithm is essentially a clever way to implement [ZDMAK23], where the helper class ( $\mathcal{G}\_i$ ) can be constructed from $\mathcal{M}$ .

the [MB] bounds seem fairly large/loose

I was disappointed that the authors left [comparison between MF and MB guarantees] to future work

Ok this is a very tricky question. The MB guarantees (Theorems 4 and 5) are actually very standard (Line 230R). You can find very similar analyses in the OPE literature under the standard "Bellman completeness" assumptions for function approximation ([XJ21;XCJMA21]). The reviewer described these bounds as "loose", but these are really the dream results (in terms of cleanness and interpretability) you would want in RL theory. We obtain them thanks to the additional information available in $\mathcal{M}$ .

What is more "unusual" is Theorem 2 for LSTD-Tournament, which is largely a direct corollary of Theorem 1, the guarantee of LSTDQ. The key differences between Theorems 4,5 vs 1,2 is the definition of coverage: Theorems 4 and 5 use the standard $C^\pi$ (called "concentrability" [AMS08]), whereas Theorem 1 and 2 uses a matrix singular value that is highly specialized to LSTDQ. As far as we know, how to properly characterize the behavior of this value as a coverage parameter and how it compares to $C^\pi$ are largely open questions.

While we would certainly like to study this question, it is a fundamental issue that exists at a much more basic level. That is, you can ask the same question without even talking about model selection: just consider 3 basic OPE algorithms for learning $Q^\pi$ :

FQE (under Bellman completeness)
LSTDQ (under realizability)
Abstraction (under $Q^\pi$ -irrelevance, which is realizability for piecewise-constant class)

Their guarantees can be found in the literature, which depend on different coverage parameters. Again, no one knows how they compare. Both LSTDQ and abstraction have very algorithm-specific definitions of coverage that are very hard to parse and interpret. [JRSW24] recently made progress in understanding it for abstraction and BVFT ("aggregated concentrability"), showing it can be exponentially worse than $C^\pi$ in some cases (but there are also trivial cases where it's much better). The situation for LSTDQ and LSTD-tournament is likely similar.

[JRSW24] Offline Reinforcement Learning: Role of State Aggregation and Trajectory Data.

So without delving further into this rabbit hole, let us just say that this comparison is a conceptual mess, and a clean and elegant answer might not even exist. Your review expressed "disappointment" which seems a pretty strong sentiment, so we want to offer an explanation.

审稿意见

评分: 22025-03-09

This paper studies the problem of model selection for OPE, where you have one evaluation policy and several candidate OPE estimates, and the goal is to find the best OPE estimate. The paper presents new OPE selection procedure for both model-free and model-based OPE methods, leveraging some new theoretical insights as to how the Q function estimates and approximate models are related. The proposed approaches are then compared on Gym Hopper to show preliminary empirical results, following a experimental protocol that uses different procedures to generate candidate OPE and error bars by bootstrapping, which is different from past work.

update after rebuttal

In my opinion, there are promising aspects of this paper in its theoretical constructions, but the presentation clarity could be improved (e.g. L170-right after Theorem 2, assuming readers know what ε is without explaining; the "sign-flip" version in Sec 4.2 is also briefly stated without detailed explanation). I wonder if there is too much content the authors want to include within the page limit. On the other hand, the experimental evaluation protocol is not fully justified (the rebuttal mentioned some good points, but these should be included in the paper itself), and the experimental results are somewhat limited and mixed. Therefore, I am maintaining my overall recommendation.

给作者的问题

In the introduction, the setting of "model selection of OPE" might need to be further distinguished from the more common setting of "using OPE for model selection of policies". In the abstract, the paper even started with the latter setting, but did not point out difference clearly enough. Otherwise readers might be confused which is which. A diagram may help.
Does the proposed framework work for importance sampling OPE? IS doesn't seem to be either model-based or the model-free settings described in the paper.
How realistic is it to induce different candidate value functions from "variations of the groundtruth environment" (L83)? Will it necessarily reflect what we get from an offline data and multiple OPE methods? (also L324 right)
L115: are we assuming the "blackbox base algorithms" (which can be MC or TD etc) they produce the “correct” Q functions, like an oracle?
On L171-right, what is ϵ? It's not part of the result you show in Theorem 2.
For the model-based version (Sec 4 on page 4), do you only estimate the transition model (e.g. Eq 6)? What about reward model?
In L272 Prop 3, it says if Qπ is in the set, then applying Tπ to it still gives π. Essentially the "regression error" would be zero. But that doesn't necessarily suggest that smaller regression error is better, especially since these quantities are calculated on the dataset D which may lack coverage (L223-right). Can you clarify or provide more justification for the regression-based selector?
L326 the claim of alternative protocol being expensive: how expensive? Can you provide a big-O for a sense of scale?
Weak experimental results: In Fig 3, the proposed method is very close to TD-sq seems best, even for larger N, but in the theory section there lacks any mention of TD-sq.
L434 "as predicted by theory" - I don't think this experiment is supported by the earlier theory, it tracks with the intuition which was stated on L420.
L436 the experiment on Misspecification seems lacking and doesn't show any trends, so what's the takeaway?
There is no conclusion.

论据与证据

Mostly. See detailed comments below.

方法与评估标准

Yes.

理论论述

I did not check the proofs, but at a quick glance the all theorems presented in the paper make sense. The construction of linear basis features (L178) to reduce the problem into LSTDQ is quite clever.

实验设计与分析

The experiments were only on one environment (Hopper) but with variations on its parameters (gravity and noise).

补充材料

No.

与现有文献的关系

Model selection for OPE is an important, often overlooked, problem.

遗漏的重要参考文献

Tang & Wiens. Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings. Machine Learning for Healthcare Conference, 2021.

其他优缺点

See below.

其他意见或建议

See below.

作者回复

2025-03-26

We thank the reviewer for the comments. We do not find major concerns in the review - can you let us know what are the main factors that lead to your "weak reject"?

Importance sampling (IS)

We do not consider IS-based OPE. Vanilla IS does not need any function approximation, which is a major source of hyperparameters (e.g., neural architecture for TD). Its variants, such as doubly-robust, require a Q-function as control variates, which can be selected using our methods.

How realistic...to induce candidate value functions from "variations of the groundtruth environment"?

Great question; we intended to discuss in conclusion had we had more space. In practice, we will likely get candidate value functions by running methods like TD (say different neural architectures). When they produce inaccurate Q-functions, the errors are likely not that "uniform" compared to our generations (when we change the gravity, it changes in the same way in every state). Varying environment in more complex ways (e.g., only change gravity in certain regions) may mitigate this problem, which we leave for future investigation.

In fact, prior works have explored generating candidate Qs by FQE with different architectures, which we also tried in an early stage. The problem, as discussed on Line 80 left and Ft 1, is that often times all candidates are poor: Unlike in reality where we can devote much time to the problem at hand and come up with good candidate architectures, here we run a lot of experiments and simply don't have the energy to tailor candidate architectures to each problem instance. This is why we switched to the current protocol. While somewhat "unrealistic", its ability to carefully control the level of misspecification is an outstanding strength. Of course, it would be nice to perform empirical evaluation using both protocols if one has enough computation.

Are we assuming the "blackbox base algorithms" (which can be MC or TD etc)...produce the “correct” Q functions, like an oracle?

What do you mean by "correct"? If you mean that every base algorithm must produce the correct $Q\_{M^\star}^\pi$ , then that's certainly not the case, otherwise the selection is trivial. On the other hand, we do have the assumption (Line 101 Right) $Q^\pi = Q\_{M^\star}^\pi \in \{\mathcal{Q}\}$ for theoretical derivations, which means that one of the base algorithms must produce the true Q-function. As long as that's the case, other base algorithms can produce arbitrary functions that need not be "correct" in any sense. Extension to the case where only a good approximation of $Q^\pi$ can be found in $\{\mathcal{Q}\}$ is routine in RL theory (Line 95 Right), and we also explore this empirically in the "misspecification" experiments (which is partly why we include it despite the lack of "trends", to answer your other question).

L171, $\epsilon$

$\epsilon$ refers to the RHS of the bound in Theorem 2, and solving for $n$ we have $n=O(1/\epsilon^2)$ . This is a standard sample-complexity discussion and we will clarify that.

Reward model in MB?

We assume true reward is known in derivation and experiments for simplicity, but it is a trivial extension to allow for candidate models to have different reward models.

but that doesn't necessarily suggest that smaller regression error is better, especially...dataset...may lack coverage

You are absolutely right. The role of coverage is reflected in the $C^\pi$ term in Theorem 4, which a standard way to characterize data coverage (see [CJ19;LVY19]). If data lacks coverage, $C^\pi$ will be large and the guarantee vacuous. Note that this will be an issue for every method. When data lacks coverage, OPE and its model selection will be fundamentally hard. Therefore, pretty much the only thing we can do is to come up with algorithms where $Q^\pi$ (or $M^\star$ in the model-based setting) has $0$ (or minimal) loss, and typically you would have a guarantee (like Theorem 4) when certain coverage assumptions are satisfied. Also note that, $Q^\pi$ (and $M^\star$ , resp.) is not even a loss-minimizer in TD-sq in Eq.2 (and mb_naive in Eq.6, resp.), which prevent them from enjoying such guarantees. So, the motivation for designing the regression-based estimators, as well as any other estimator (this is also the case for prior work like BVFT [XJ21]), is to achieve guarantees like Theorems 1,4,5, which are nontrivial.

L326 Expensive

We have brief big-Oh discussion on L382L, but "expensive" here is more of a numerical one. The major cost in Q-caching is simulation steps and policy calls (neural-net inference). For MF.G, the cost is data size (3200) * trajectories (128) * horizon (1024) * target policies (10) * candidate models (15) * $M^\star$ choices (3) ≈ 10^11. Runs for a few days ~ a week on a 4090 PC.

审稿人评论

2025-04-02

Thank you for the clarifications.

My main concerns for not giving a higher rating include:

The theoretical results are interesting, but presentation and clarity wise it's a bit lacking. I'd say it's more difficult to follow than the BVFT paper.
Method for generating candidate value functions on Hopper-v4: lack of discussion on the limitation / unrealistic-ness / proof-of-concept nature of the setup.
Lack of evidence that the proposed new experimental protocol will in general produce the same conclusions as previously established protocols

作者评论

2025-04-02

Thanks for the additional comments and summary.

To respond to each of your points

We appreciate that you find the theoretical results interesting. Since your original review did not mention clarity issues explicitly, we would appreciate if you could be more concrete about where clarity is lacking so we can improve.
As mentioned in the rebuttal, the limitations of the protocol are something we are 100% happy and inclined to discuss had we had more space. We will surely add the discussion in revision when more space is granted, so we request that this point not viewed as a major concern for the paper.
The reviewer seems to believe that we claim "the proposed new experimental protocol will in general produce the same conclusions as previously established protocols". Perhaps the reviewer thinks we propose the new protocol to be a "cheap and stable replacement" of old protocols. This is NOT what we claim and we do not expect the conclusions to be identical. As we mentioned in the rebuttal (the paragraph on "How realistic..."), candidate functions generated in the old protocol often have poorer and uncontrollable quality than actual practice; in contrast, the functions generated in our case have controllable quality and enable additional interesting investigations (e.g., the gap experiments), but the errors can be unrealistic. Neither is a perfect replication of reality and they are complimentary to each other. (We wrote in rebuttal: "Of course, it would be nice to perform empirical evaluation using both protocols if one has enough computation.")

We hope this clarifies the reviewer's misunderstanding in our claims and we will edit the paper to clarify.

审稿意见

评分: 12025-03-10

The work proposes two new model selection algorithms for off-policy evaluation. These methods are inspired by "batch value function approximation" (BVFT) and its shortcomings. The authors propose to use LSTDQ as model/Q-function selector when following the steps of BVFT and dub the resultiung algorithm "LSTD-Tournament". The authors then show that the algorithm can also be extended to the model-based setting. Further, to ensure fairer comparison of selectors, the authors propose an evaluation protocol.

update after rebuttal

I have read all reviews and am keeping my score.

给作者的问题

Why do performances of the algorithms vary so drastically when the same gravity and noise values are applied? (For details see my comments in the "Claims and Evidence" section)

论据与证据

The results presented in section 6 do not clearly support the claims made in the paper. Firstly, the authors experimental setting uses Hopper-v4 with modifications to the used gravity and added noise to transitions. Instead of manual modifications to a gymnasium environment I would recommend that the authors use published benchmarks that already add such modifications. The work by Kirk. et al (2023) discusses various such environments for the purpose of evaluation of zero-shot generalization capabilities (something which is closely aligned to this work). Out of these I would recommend the CARL collection of environments as it provides a broad variety of modifications suitable for the authors experiments. This would also alleviate the need for the authors to define their own ranges of "gravities" for experimentation, resulting in improved reproducibility.

The experimental results seem highly doubtful to me and are the main reason I vote for rejection. Comparing the results of MF.G (g=-30, $\sigma$ =100.0) and MF.N (g=-30, $\sigma$ =100.0) give vastly different results even though the basic setup should be exactly the same. The gravity values and noise values are exactly the same. As those are the only parameters that should differ between experiments, there is no reason why the baseline mb_naïve has the lowest OPE error in the MF.G setting and the highest in the MF.N setting. These inconsistencies can be traced through all experiments, which makes me doubt the validity of the experiments. It is also claimed that this naive baseline performs poorly in high-noise environments (lines 356-358, right column) but this is simply not the case. It is the best performing for $\sigma$ =100.0 in MF.G g=-30, g=-3, MB.G g=-24 and second best in MB.G g=-30. The claims with respect to "simulator gaps"§ and "misspecification" are based on subsets of only 3 $\mathcal{M}$ which does not give a real meaningful comparison. Further, these experiments take the noisy variant of the transition dynamics instead of the changed gravities which would give a more meaningful comparison as the transition dynamics are truly different and not just a "fuzzy" version of the "ground truth" environment. No reason is stated for why the remaining 12 experiments with varying gravities and noise levels are not reported in either an aggregate statistic or reported in full in the appendix.

The need for a novel evaluation protocol does not seem well substantiated and the text does not make it clear how the proposed setup differs from commonly used protocols.

方法与评估标准

The selection methods do make sense. However, as stated in the prior section, the need for a different evaluation protocol does not seem to be well substantiated and, as differences to existing protocols are not clarified, I do not see how this is a claimed contribution of the work.

理论论述

I did not check the proofs for correctness as the experiments already qualify the paper for rejection.

实验设计与分析

For my critique of the experimental design refer to the "Claims and Evidence" section. Additional experimental design choices that are unjustified and thus questionable are the choice of horizon $H$ and number of MC rollouts $l$ .

补充材料

I have reviewed sections A, C, D & E though not as thoroughly as the main text. I was searching for missing results from the main text.

与现有文献的关系

Without Appendix A, the work does not provide an adequate discussion of the broader related work. Surpringly, the main text never refers to Appendix A

遗漏的重要参考文献

I am unaware of any works that would crucially need to be discussed.

其他优缺点

I believe the idea of using LSTDQ in the fashion of BVFT is an interesting and promising idea that could prove very useful.

其他意见或建议

Adding aggregate statistics to summarize the results across gravities and noise levels might provide a clearer picture of the strengths and weaknesses of the proposed selectors. The results should also be contrasted with true performance values of policies in the target environments as the OPE might not be meanigful in some settings. Take for example MF.G with g=-51. Most Methods achieve a low OPE, though this might be only due to the environment being so hard to solve with such high gravities, that all policies basically behave equally bad. Thus, providing true performance values in the environment will make it easier to understand if OPE values are meaningful or not. Further, in the zero-shot generalization for online RL literature, it has become standard practice to probe the "performance" with a random policy and an expert policy on the sampled gravities/noise values. These can the be used to properly normalize the performances and thus allows for providing a clearer picture of performance. An example of such a protocol can be viewed in https://openreview.net/forum?id=o8DrRuBsQb

作者回复

2025-03-26

We thank the reviewer for the comments. The main criticisms arise from technical misunderstandings, which we clarify first.

Major Technical Misunderstandings

Comparing ...MF.G (g=-30, $\sigma$ =100.0) and MF.N (g=-30, $\sigma$ =100.0) give vastly different results even though the basic setup should be exactly the same.

As mentioned clearly in Section 5.1, an experiment is determined by the groundtruth $M^\star$ and the candidate model set $\mathcal{M}$ , among other elements. The two experiments mentioned by the reviewer coincide in $M^\star$ =(g=-30, $\sigma$ =100.0), but differ crucially in $\mathcal{M}$ : MF.G uses the "gravity grid" ( $\mathcal{M}\_g$ ), and MF.N uses the "noise grid" ( $\mathcal{M}\_n$ ); see Section 6.1. Algorithms generally behave differently when they are given different $\mathcal{M}$ .

the remaining 12 experiments...are not reported.

If the reviewer thinks there must be 15 experiments because $|\mathcal{M}\_g|=15$ , then that's incorrect. In Figure 2 top (MF.G), all 3 plots correspond to the same $\mathcal{M}\_g$ but 3 different $M^\star$ .

[Varying noise levels creates] just a "fuzzy" version of the "ground truth" environment.

We respectfully disagree. In RL, dynamics is formally defined by the transition function $P(s'|s,a)$ . Models with the same gravity but different noise levels have different $P(s'|s,a)$ , so they are technically different.

While describing them as "fuzzy versions" of the same MDP is neither rigorous nor helpful, the reviewer is perhaps concerned that they may be roughly the same, which makes the experiments in MF.N trivial. This is simply not the case in Figure 2. Had all the models in MF.N ( $\mathcal{M}\_n$ ) be roughly the same, all methods would have nearly 0 OPE error, and it would be impossible to beat randomly selecting a model ("random") since all models produce the same prediction.

Moreover, theoretically mb_naïve is known to be biased towards more deterministic models (Line 222 Left). (This is also why mb_naïve is deceptively good in MF.N when $\sigma=10$ .) Having candidate models of varying noise levels poses significant challenges to such simple methods. This also explains your confusion that:

why the baseline mb_naïve has the lowest OPE error in the MF.G setting and the highest in the MF.N setting

This is precisely because models in MF.G have similar levels of stochasticity (so the bias of mb_naïve is not fully exposed), but those in MF.N have different levels of stochasticity.

It is also claimed that this naive baseline performs poorly in high-noise environments (Lines 356-358 Right) but this is simply not the case. It is the best performing for $\sigma$ =100.0 in MF.G g=-30, g=-3, MB.G g=-24 and second best in MB.G g=-30.

We agree that the text here is too brief and perhaps misleading. We are mostly referring to the MF.N experiments when $\sigma$ is high.

Take for example MF.G with g=-51. Most Methods achieve a low OPE, though this might be only due to the environment being so hard to solve with such high gravities, that all policies basically behave equally bad.

We are afraid that the reviewer might be carrying concepts and mindsets for policy optimization (which is standard in empirical RL) into the policy evaluation problem. The performance of target policies generally does not determine the difficulty of policy evaluation. Even if a target policy has poor performance, the OPE algorithm still needs to correctly predict its (low) value so that the user does not deploy it in the real system. There are no reasons to believe that this will be generally easier than evaluating a good policy; for example, if the offline data comes from a good policy, it can be more difficult to evaluate a poor policy than a good one due to lack of coverage.

it has become standard practice to probe the "performance" with a random policy

Again, this is a practice for policy optimization, and we are doing evaluation. In fact, we did something similar in spirit: all plots show the "random" baseline that randomly picks one of the candidate models to predict the performance of the target policies. This helps rule out the degeneracy that OPE error is low simply because all candidate models predict the correct value.

Misc

Providing true performance values in the environment will make it easier...

We did show this in Figure 1 (left). The plot shows the performance of 10 target policies across models in MF.G of different gravity values (x-label is a typo; should have been "gravity").

subsets of only 3 $\mathcal{M}$

First, it's not "3 $\mathcal{M}$ ", but $|\mathcal{M}|=3$ . This is due to our limited computational budget. On the other hand, this still results in nontrivial model selection problem, as all methods still suffer nontrivial OPE errors in Figure 4.

审稿意见

评分: 22025-03-10

The paper tackles the setting of off-policy evaluation (OPE). It analyzes different method for OPE, model-free as well as model-based. The paper introduces the general setting with a short overview over related work and lists its contribution. The paper presents a short overview of preliminary theory. The paper introduces a new model-free selector, follows with a section on model-based selectors and states a model-based experiment protocol with an exemplification of it.

update after rebuttal

Score increased. See rebuttal comment.

给作者的问题

It seems that one of the main selling points of the paper should be that the reader should learn about a well-working OPE framework. This could enable the reader to tackle his OPE task at hand. The reader could feel confident to do with by having clear instructions at hand and a demonstration of those instructions.

First, do you agree with this point?
Where would I as a reader find this and how can I follow along the demonstration?

论据与证据

The paper claims to develop new model-free and model-based selectors with theoretical guarantees. On the model-free side the LSTD tournament is introduced which merges the ideas of LSTDQ and BVFT. The connection and interpretation of Theorem 2 to its surrounding descriptions is not clear.

The sign-flip average bellman error is introduced, but the thread is hard to follow why.

The paper claims to develop a new experimental protocol for experimental evaluation.

The new experimental protocol is introduced, but it is not clear how one would actually apply it in practice.

The exemplification of the protocol is hard to follow.

方法与评估标准

The paper evaluates its new contribution with an exemplification on the Hopper environment and some adjusted derivation of Hopper. In general such a demonstration seems fitting to the presented claims. The comprehensibility of what the results demonstrate is not given. The evaluation criteria are hard to follow.

理论论述

There are four theorems presented in the paper. I did roughly check the proofs in the appendix. Besides several formatting problems, e.g., 821 lem:model..., it looks fine.

实验设计与分析

The experiments are done on Hopper and some derivations of it. The experiments seem to be designed appropriately, but it is hard to follow what is done and why.

补充材料

I did not review the supplementary material.

与现有文献的关系

The paper is heavily related to [XJ21] for BVFT and [MPW23, PKBK23] for LSTDQ, it relates to several model-based OPE publications.

遗漏的重要参考文献

I am missing an essential reference and discussion about the "current way to do OPE".

其他优缺点

Overall, the paper presents relevant and interesting bits of information, but in the current state comprehensibility is not given. The paper is thus in need of a major revision.

The central thread of the paper is extremely hard to follow. There are several reasons for this. The paper introduces multiple independent claims that are not closely woven together. The motivation for putting those claims together in a single paper instead instead of multiple different papers is not clear. There is no conclusion section, the main results are in a subsection which do not end the paper and are hard to comprehend.

The paper would benefit a lot of the following elements: Emphasizing the motivation why the presented claims make sense to be published together. Introducing a central thread that is easy to follow and helps with connecting the subtopics. Adding a conclusion that helps to wrap up the paper and clarifies what to take home from the paper from the authors' perspective. Refactoring most subsection und paragraph titles to be more intuitive. Referencing them along a central thread would be helpful as well.

Furthermore following points are noteworthy: The current citation style does not fit the ICML citation style in my understanding. There are a lot of formatting problems and typos in the paper. The labels for Figure 2 and 3 could be improved. The Figure formatting could be streamlined.

其他意见或建议

It seems unusual the not name the environment in the abstract. It is well known and there is only one environment used in the paper.

There are inconsistencies in the paper, e.g., about the contributions: two fold (025) vs. 4-fold (055).

The abbreviation "w.p." is unusual, thus confusing.

作者回复

2025-03-26

We thank the reviewer for the comments.

The paper presents relevant and interesting bits of information.

In general such a demonstration seems fitting to the presented claims.

We are glad that the reviewer finds new insights in our paper and that they are supported with evidence.

It analyzes different method for OPE, model-free as well as model-based.

... the reader should learn about a well-working OPE framework

The reviewer misdescribes our work as analyzing different OPE methods. Quite different, we study model selection over OPE, i.e., selecting between different ``base'' OPE methods that either produce Q-functions or dynamics models (Line 80, right column).

I am missing an essential reference and discussion about the "current way to do OPE".

As mentioned above, we are working on the model selection problem, which is ``one-level up'' above (albeit closely related to) the OPE problem. Mathematically we treat base OPE problems as "oracles" that produce either Q-functions or dynamics models (Line 80, right column), which covers a wide range of practical OPE methods considered in the literature (see Appendix A for relevant references about OPE and model selection in OPE). This way, we do not need to worry much about how OPE is done, but focus on selecting from the output of OPE algorithms.

Do you agree with this point [that "the main selling points... should be... a well-working OPE framework"]?

Again, we do not contribute to OPE itself but study its model selection problem. It is unclear to us whether this confusion has affected the reviewer's understanding of the work, or that the reviewer actually understood the difference (between OPE and model selection over OPE) and was simply not careful when summarizing and describing our work, since the review does not provide further technical comments.

This could enable the reader to tackle his OPE task at hand.

It seems that the reviewer feels the paper is like a "practical guide/manual" for OPE. (An example of this would be [VJY21] in our citation.) It is not. Given the under-investigation, we are far from the "practical guide" phase of research; the focus of our work is novel theoretical derivations and comprehensive empirical evaluation of the methods. The "protocol" is not about practical usage, but how we empirically test and understand these algorithms in simulation environments.

the paper introduces multiple independent claims that are not closely woven together.

We hear you and we are very aware that the paper is unconventional in this aspect when we submit the work. The paper is indeed filled with little bits of new insights here and there. That said, we do think the paper has a clear theme: progress in multiple dimensions on the model selection problem for OPE, which is heavily under-investigated (see Appendix A for relevant works). In particular, we design novel selection algorithms in Sections 3 and 4. Then, naturally one would want to evaluate the algorithms empirically, but existing frameworks/protocols for doing so have many problems, as briefly explained on Line 78, Left Column. So we also propose new protocols that allow more comprehensive empirical understanding of the selection algorithms. We feel these components are naturally connected and valuable to anyone interested in the model selection problem.

We indeed considered the reviewer's implicit suggestion that it might be better to "slice" the work into "multiple papers" ("The motivation for putting those claims together in a single paper instead instead of multiple different papers is not clear.") After careful consideration, we still believe the works is best presented in its current form. For example, one could consider taking out the new selection algorithms and their analyses and publish a purely theoretical paper; however, while the new algorithms take novel insights to come up with, once they come to mind, their theoretical properties naturally follow from existing analyses, so the theoretical contributions alone are likely too thin for a typical conference paper.

"is unclear""hard to follow"

The review ends several comments with phrases like "unclear" and "hard to follow". Without further concrete comments, it is very difficult for us to understand what is unclear. Perhaps OPE and its model selection does not fit your background and/or it's not a problem that interests you; otherwise, we would like to hear more concrete comments that can help us improve the paper.

Conclusion

We will add one. The omission was simply due to space limit.

w.p.

This means "with probability", which we spell out in Theorem 1. This is a very standard and widely used abbreviation in ML theory.

审稿人评论

2025-04-07

Thank you for clarifying several aspects of your research. I appreciate the time and effort you've taken to address my concerns. Here are some further thoughts based on your responses:

My main problem with the paper lays in the overall structure. Those "little bits of new insights here and there" are very difficult to combine into a comprehensible paper. The idea to publish a paper on "progress in multiple dimensions on the model selection problem for OPE" is a very difficult task to tackle. I generally respect the authors' motivation to do so, but I really think in this case it hurts the clarity and comprehensibility too much. To slice the work into multiple papers is indeed what I would recommend. If parts of the new slices get too thin for a typical conference paper, it might be worthwhile to extend some directions of the work first. Reading the other reviews strengthens this belief.

I think the omission of a conclusion due to a space limit - even if planned for the final version - is bad practice as it is easily the most frequently read part of the paper.

While two other reviews have also been critical, one review is more favorable, highlighting the value of your contributions. I am considering adjusting my score from 1 to 2.

作者评论

2025-04-07

We thank the reviewer for sharing further thoughts and considering raising the score on our paper. While we do think the paper deserves to be published in its current form, we understand your concerns given the somewhat unnatural structure and organization of the paper.

最终决定Reject

2025-05-01

This paper explores an important topic in reinforcement learning, which is off-policy evaluation: how can we accurately select between policies without access to the true environment? The paper correctly highlights shortcomings in current approaches, which themselves may have hyper-parameters. To this end, the paper turns the usual question on its head; with this in mind, how should we select an OPE model given some set of policies?

Overall the reviewers agreed that the paper offers interesting insights, and agree that this is a relevant area of research. However reviewers also agree that the paper is poorly organized, and its main ideas are not clearly communicated; given the relatively strong background on OPE of the reviewers, this would imply that the paper in its current form would be similarly confusing to the community more generally. For instance, the work lacks a conclusion, and is also confusingly written, such as suddenly introducing new approaches (e.g., sign flip) while talking about existing approaches in great detail within the same narrative thread. Reviewers were also unclear what the true main goals of the paper were; is it LSTD-Tournament? Is it the protocol? Maybe it's the comparison between OPE selection methods?

I therefore don't believe this paper is ready for acceptance in its current form, and requires significant rewriting. It seems generally that reviewers positively responded to the LSTD-Tournament method; if the authors were to therefore focus on model-free approaches, and removing the model-based section completely might improve the readability of this work. I also agree with some of the reviewers that the empirical side of the paper could be improved, such as through the introduction another domain beyond Hopper. Perhaps finer details of existing methods could be relegated to the Appendix to make room for this.