PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
5
4
5
3.3
置信度
创新性2.8
质量3.0
清晰度3.5
重要性3.0
NeurIPS 2025

Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We present the first finite-sample analysis for policy evaluation in robust average-reward reinforcement learning using semi-norm contractions.

摘要

We present the first finite-sample analysis of policy evaluation in robust average-reward Markov Decision Processes (MDPs). Prior work in this setting have established only asymptotic convergence guarantees, leaving open the question of sample complexity. In this work, we address this gap by showing that the robust Bellman operator is a contraction under a carefully constructed semi-norm, and developing a stochastic approximation framework with controlled bias. Our approach builds upon Multi-Level Monte Carlo (MLMC) techniques to estimate the robust Bellman operator efficiently. To overcome the infinite expected sample complexity inherent in standard MLMC, we introduce a truncation mechanism based on a geometric distribution, ensuring a finite expected sample complexity while maintaining a small bias that decays exponentially with the truncation level. Our method achieves the order-optimal sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2})$ for robust policy evaluation and robust average reward estimation, marking a significant advancement in robust reinforcement learning theory.
关键词
Robust Reinforcement LearningPolicy EvaluationFinite-Sample AnalysisAverage Reward MDPsStochastic Approximation

评审与讨论

审稿意见
5

This paper proposes an algorithm to solve the policy evaluation problem in the context of robust average reward RL. To achieve this, it analyzes the contractivity of the robust Bellman operator under a well-chosen semi-norm, then proposes an efficient estimation method for approximating a subproblem appearing when computing the Bellman operator, and then finally proposes an iterative sample-based algorithm to approach the fixed point of the robust Bellman equation. Sample complexities are provided for this algorithm.

优缺点分析

This paper proposes a very complete analysis of the robust average-reward policy evaluation problem. The presentation is very clear and progressive, making the paper easy to read and follow. The proofs of the main results look technically solid, although I could not check all the details of the appendix. The tackled setting looks interesting, yet it is somehow difficult to evaluate its practical reach, not so much discussed. One possible limitation of the paper in its present form is that there is no numerical illustration of the proposed algorithm. Other weaknesses of this very thorough work are, in my sense, minor, and are more related to open problems that could be subsequently studied (see questions below).

问题

  • How do existing work dedicated to the discounted setting (cited as [25, 29, 16, 15, 14] in the related work) relate to the developed method and sample complexities? Specifically, is the proposed algorithm of the present paper similar, or fundamentally different? Is there any kind of equivalence / correspondence between the average and discounted case, as could be suggested by Theorem 3.2?
  • Is it possible to illustrate the O~(ϵ2)\tilde O(\epsilon^{-2}) convergence rate of the algorithm on a very simple toy example?
  • In Theorems 5.1 and 5.2, the robustness level (quantified by δ\delta) does not appear explicitly in the convergence rates. If this dependence is simple, is it possible to make it explicit in the rates? In particular, what happens when δ0\delta \rightarrow 0? Is it possible to relate the robust TD algorithm to its non-robust counterpart?
  • The O(ϵ2)O(\epsilon^{-2}) sample complexity is known to be optimal for the non-robust policy evaluation problem. Do you know of any lower-bounds in the robust setting?
  • In line 149, PsaP^a_s should be replaced by P~sa\tilde P^a_s.
  • In line 114, the simplex Δ\Delta is not defined.
  • From lines 146 to 158, it seems that Δ(S)\Delta(S) and Δ(S)\Delta(|S|) are used interchangeably.

局限性

Limitations are mentioned in the conclusion about the ergodicity assumption and the finite state space setting.

最终评判理由

This paper proposes a very complete analysis of the robust average-reward policy evaluation problem. The presentation is clear. The authors added some numerical illustration during the rebuttal. The few limitations of this paper by no means justify that it should be rejected. Therefore I will keep my score and recommend to accept.

格式问题

Minor corrections should be made to the references formatting. In particular, some capital letters are missing in some conferences and journal names or book titles.

Also: [18] Arnab Nilim and Laurent Ghaoui.-> Laurent El Ghaoui.

[5] -> missing editor name.

作者回复

We sincerely thank the reviewer for their feedback and valuable suggestions. We address each point carefully and explicitly below:

Regarding Numerical Illustrations (Weaknesses and Q2)

We would like to note that this work is primarily theoretical, and the full empirical evaluation of robust TD learning is left as a future work for application papers. That said, because our guarantees hinge on the one‑step semi‑norm contraction properties (characterized in Lemma 3.1 and Theorem 3.2), we now include numerical evaluations that directly verify strict contraction across the settings studied. These results empirically support the key structural claim used by our analysis. Due to space constraint, we refer to our additions in the response to Reviewer rBB6 (Regarding Numerical Validations).


Regarding Discounted Robust RL Methods (Q1)

We make the following comparisons between our work and the prior robust discounted RL works. In discounted RL, contraction comes for free from the discount factor β<1\beta<1 in a (weighted) sup-norm [1], and standard stochastic fixed point iterations yields rates accordingly. In contrast, average-reward has no inherent contraction and our analysis builds a new semi-norm under which the robust average-reward operator is a contraction via ergodicity. The algorithmic analysis is therefore fundamentally different from discounted works, and we will add a short remark clarifying this correspondence and the different constants.


Regarding Robust Level δ\delta (Q3)

Thanks for bringing up this important point. We note that the bound for contamination (Eq. 138) and Wasserstein (Eq. 141) do not have a dependance over δ\delta, and thus the limit as δ0\delta\to 0 is the same. Thus, the result do not degrade with increasing δ\delta. However, we note that contraction factor depends on δ\delta since the maximum spectral radius in the uncertainty set will increase with δ\delta. However, there is an additional factor of (1+1δ)(1+\frac{1}{\delta}) for the TV uncertainty set (see Eq. 128, 139), which is a loose bound and is left as a future work to make this tighter. We will mention the discussion in the final version.


Regarding Lower Bounds (Q4)

We note that since the non-robust serves as a special case for the robust policy evaluation, the lower bound of non-robust can serve as a hard example for the robust setting. Nevertheless, obtaining tight lower bounds that also track δ\delta for robust evaluation remains an interesting open problem. We will mention the above in the revised version.


Regarding Typos and the Missing Definition of Δ\Delta (Q5-Q7 and Paper Formatting Concerns)

Thanks for identifying the typos, and we will fix them in the revised draft. In particular, we will define the simplex Δ(S)\Delta(\mathcal S) as the probability simplex over the state space and use this notation consistently.


References

[1] Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications, 2012.

评论

I would like to thank the authors for their thorough answers and precisions, as well as for the numerical illustration.

I believe this paper is clearly written and makes an interesting contribution, and I see no major limitations, hence it should be accepted.

评论

Thank you for your feedback. We are glad to hear that our response addressed your comments.

审稿意见
5

The paper studies policy evaluation for average-reward infinite-horizon reinforcement learning where the estimated value is robust to uncertainty in the data / observed transitions. They provide a finite-sample result that is order-optimal in the accuracy level ε\varepsilon. This is achieved by considering the robust Bellman operator and showing that it is a contraction with respect to a semi-norm. However, this robust Bellman operator requires the estimation of a support function. They provide a procedure to estimate this support function (based on a truncated multi-level monte carlo algorithm), that can then be used to iteratively apply the robust Bellman operator, providing the finite-sample result.

优缺点分析

Strengths

  1. The paper is very well-written/structured and easy to follow. The proof-sketches are clear and provide some good intuition for the general proof ideas.
  2. The work provides the first finite-sample results, while prior works only established asymptotic results.
  3. The algorithms extend known algorithm in novel non-trivial ways. The analysis also includes some key novel steps such as the contraction of the robust Bellman operator with respect to a semi-norm.

Weaknesses

  1. While it is explicitly stated that the finite-time bound is order-optimal in ε\varepsilon, optimality is not discussed for other quantities (e.g. S, A). Even if it is not optimal, it would be helpful to have a discussion on the dependence of these other quantities. The lower-bound justifying the claim that the dependence on ε\varepsilon was not discussed as far as I could tell.
  2. The one weakness of the structure would be that Section 5 feels a bit rushed, there is little discussion of the results Theorem 5.1 and 5.2 (which connects to the previous point). In addition, the difference in the results between the contamination model and the TV/Wasserstein distance could also be discussed more. In particular the difference only appears to be logarithmic in Theorem 5.1, and that difference comes from the value of NmaxN_{\max} (which it would also be useful to have the explicit setting of it in the main text). A brief discussion on how these policy evaluation methods could be fit in within a procedure to find the optimal policy would also be welcome.
  3. See questions below for some slightly confusing parts.

问题

  1. In the formulation section, the uncertainty set is introduced as the (s,a)(s, a)-rectangular compact uncertainty set, which is a bit confusing because it seems like this is the type of uncertainty set that will be considered instead of contamination, total variation and Wasserstein distance. In addition, because ultimately only samples from the centroid of this uncertainty set (the nominal MDP) are used, I wonder what the value is in introducing the rectangular set. Is it not equivalent and simpler to only define/introduce the nominal transitions and then consider the uncertainty set as defined by the contamination model, TV and Wasserstein distance ? Or am I missing something ?
  2. Following from the previous question, it would also be useful to provide some motivations for the different types of uncertainty sets. While the contamination model provides an algorithm and analysis that is much simpler, I assume the other two are considered because they provide more realistic in the way they model uncertainty ?
  3. Here is a list of a few typos: l.136: “definied”; Grammar l.144-145; l.187 typo in definition of E, should be E=edπTE = e {d^\pi}^T I think;
  4. How easy/hard is it to solve the robust policy evaluation problem when you know the true nominal transitions ? If it is easy, since you “focus on the model-free setting”, could you comment on the difficulties of a model-based approach where potentially you use all the samples to estimate the nominal transitions and then use this estimate as the ground-truth ?
  5. I was initially a bit confused by the statement of Theorem 2.1 where it seemed to me that gg should depend on the state and not be a scalar. This was clarified later when it was mentioned that under the ergodicity assumption, the average-reward did not depend on the starting state. Is this what is being used in Theorem 2.1 or does Theorem 2.1 also hold without the assumption of ergodicity ? And if Theorem 2.1 does require the assumption of ergodicity, I would suggest re-orgnaising or at least mentioning or referring to the independence of gg from the starting-state before Theorem 2.1.
  6. In the description of Algorithm 2, in line 18 the expected Bellman residual is averaged across all states. I am surprised that this work and that it should not be weighted by how often states are actually visited – could you comment ?

局限性

yes

最终评判理由

I believe this paper is of good quality, clarity, significance and originality. The points raised during my review, which were mostly about clarifying details that I did not fully understand were addressed and should be corrected adequately in the final version. I recommend acceptance.

格式问题

NA

作者回复

We sincerely thank the reviewer for their feedback and valuable suggestions. We address each point carefully and explicitly below:

Regarding Optimality and Lower Bounds (Weakness 1)

We agree our rates are order‑optimal only in ϵ\epsilon (O~(ϵ2)\tilde O(\epsilon^{-2})). Regarding other factors (e.g., S,A|\mathcal S|, |\mathcal A| and structural constants), we will clarify that our bounds have a linear SA|\mathcal S||\mathcal A| factor from per–state–action oracle calls, and also depend on the contraction gap 1γ1-\gamma, the span bound on VV, and an additional log(1/ϵ)\log(1/\epsilon) from truncation in the TV and Wasserstein cases.

Regarding the lower bounds, we note that classical regret lower bound for finding an optimal policy in non-robust average reward reinforcement learning in [1] is Ω(SATD)\Omega(\sqrt{|S| |A| T D}) where DD denotes the diameter of the MDP. However, this does not directly translate into a policy‑evaluation sample‑complexity lower bound of Ω(SADϵ2)\Omega(\sqrt{S A D} \epsilon^{-2}). This is because the conversion from regret in control to estimation error in evaluation is non‑trivial and depends on additional structure (e.g., mixing time/diameter and whether a generative model is available). To the best of our knowledge, there have been no works specifically characterizing the exact lower bound for policy evaluation in (robust) average reward reinforcement learning with the focus on S|S| and A|A|. We will therefore not claim tightness in SA|\mathcal S| |\mathcal A| and treat sharpening these dependencies as open. Regarding the lower-bound justifying the claim that the dependence on ϵ\epsilon, we note that we provide the standard mean estimation as a hard example and demonstrate the sample complexity of Ω(ϵ2)\Omega(\epsilon^{-2}) in Appendix F.2. We will mention the above in our revised draft.


Regarding Section 5 (Weakness 2)

We note that Section 5 is a bit concise due to page constraints, and we will expand it in the final version due to availability of an additional page. We have provided the detailed analysis for this Section in Appendix F, some of the key details can be moved to Section 5 with additional space availability in the final version. We would add the discussion of the additional logarithmic terms in the sample complexity results for TV and Wasserstein distance uncertainty sets. These are due to the bias of the estimators for the support functions σ^(V)\hat{\sigma}(V) in Algorithm 1. In contrast, the estimator of the support function in the contamination uncertainty set is unbiased (as shown in Eq. (19)), and thus the corresponding sample complexity for policy evaluation not have the logarithmic terms.

We also agree that a brief discussion on how these policy evaluation methods could be fit in within a procedure to find the optimal policy is important, specifically, our algorithm can be used to estimate the robust policy sub-gradients, which can be then utilized to perform policy optimizations accordingly, we will add the above in the revised draft.


Regarding Rectangularity (Q1)

Thanks for raising up this point. We note that our algorithms do focus on contamination, TV, and Wasserstein ambiguity around a nominal model PP. We will clarify the presentation by first fixing PP, and then defining the (s,a)(s,a)‑rectangular set as U=s,aUs,a\mathcal U=\prod_{s,a}\mathcal U_{s,a} where Us,a\mathcal U_{s,a} is, respectively, a contamination/TV/Wasserstein neighborhood of P(s,a)P(\cdot\mid s,a). In our algorithms, the nominal generative model first supplies samples from PP, and then we compute the support function minqUs,aq,V\min_{q\in\mathcal U_{s,a}}\langle q, V\rangle by decoupling across (s,a)(s,a), this procedure is achievable due to rectangularity assumption. Thus, our (s,a)(s,a)‑rectangular assumption makes Algorithm 1 implementable (closed‑form for contamination; truncated‑MLMC for TV/W) and is standard in robust DP to preserve Bellman‑style decomposability. We will add the above to the revised draft.


Regarding Motivations for Different Uncertainty Sets (Q2)

We provide the motivations for the different types of uncertainty sets studied in this work as follows:

  • Contamination (δ\delta). Models outliers or rare faults [2]: with prob. δ\delta, the next‑state law can be arbitrary (e.g., sensor glitch or unmodeled jump to a bad region). Leads to a simple worst‑case expectation combining PP with the worst state value.
  • Total variation (TV radius). Models categorical misspecification or discretization error [3]: bounded 1\ell_1 mass shift per (s,a)(s,a); captures sparse but possibly large reallocation of probability mass.
  • Wasserstein (metric radius). Models smooth model drift when states have a geometry [4] (e.g., position/velocity): mass can move to nearby states with cost proportional to distance; realistic for learned simulators and dynamics linearization.

We will add a compact paragraph with these examples in the revised draft


Regarding Typos (Q3)

Thanks for identifying the typos, we agree with the review on the definition of EE. The typos will be fixed in the revised draft.


Regarding Model-Based Methods (Q4)

We agree that our focus is model‑free setting. We note that the model-based counterpart is not straightforward, and there is a new paper [5] in ICML 2025 (which came online after our work was submitted), that studies robust model-based average reward setting with TV, chi-square, and contamination uncertainty sets. We will add this to Related Work in the final version, while note that the approach of model-based that is used in this paper which is reduction from discounted to average does not work well in model-free settings due to higher dependence on 1/(1γ)1/(1-\gamma) in the model-free setups.


Regarding gg and Theorem 2.1 (Q5)

We agree gπg^\pi is a scalar independent of the starting state under a standard unichain/ergodicity assumption. Our Theorem 2.1 indeed assumes ergodicity later in the section. Thanks for pointing this out. we will move that assumption next to the theorem statement to avoid ambiguity in the revised draft.


Regarding Line 18 in Algorithm 2 (Q6)

Thanks for pointing out this detail. In our algorithm, we use uniform averaging to control the span/semi‑norm uniformly over states. The analysis also goes through with stationary‑distribution weighting (dPd_{P}); we will add a remark noting this variant and that it can reduce variance in practice without changing the rate.


References

[1] Near-optimal regret bounds for reinforcement learning, 2008.

[2] A General Decision Theory for Huber’s ϵ\epsilon-Contamination Model, 2017.

[3] Partial Policy Iteration for L1L_1-Robust Markov Decision Processes, 2021.

[4] First-Order Methods for Wasserstein Distributionally Robust MDPs, 2021.

[5] A Reduction Framework for Distributionally Robust Reinforcement Learning under Average Reward, 2025.

评论

Thank you for your thorough response. All my points have been addressed and I am happy to maintain my current score.

评论

Thank you for your feedback. We are glad to hear that our response addressed your comments.

审稿意见
4

The paper investigates the finite-sample performance of policy evaluation in robust average-reward Markov Decision Processes (MDPs), where the transition dynamics are only partially known. The authors first establish that the robust Bellman operator, central to the robust dynamic programming framework, is a contraction mapping---guaranteeing convergence of value iteration. Building on this property, the paper derives a near-optimal sample complexity bound of O~(ϵ2)\tilde{O}(\epsilon^{-2}) for estimating the value of a fixed policy under worst-case transition dynamics. These results provide strong theoretical guarantees for robust policy evaluation in average-reward settings.

优缺点分析

Strengths:

  1. The paper tackles a timely and important problem in reinforcement learning, namely robust policy evaluation in average-reward MDPs—a topic of both theoretical and practical significance.

  2. Establishing near-optimal sample complexity bounds is a strong theoretical contribution and highlights the algorithm’s statistical efficiency.

  3. The presentation is generally clear, and the technical results are well-structured and accessible.

Weaknesses:

  1. The paper lacks numerical validation. While the sample complexity result is theoretically compelling, the absence of empirical experiments makes it difficult to assess the practical utility and robustness of the proposed approach.

  2. Assumption 2.3 appears to be quite restrictive and may not hold except for very small ambiguity sets P\mathcal{P}. It would be helpful to provide a practical discussion or example illustrating how this assumption can be verified or approximately satisfied in realistic settings.

问题

  1. I am confused by the sample requirements in Algorithm~1. It appears that exponentially many i.i.d.\ samples are needed from the generative model. Is this interpretation correct? If so, this would be practically infeasible and significantly limit the applicability of the method. Please clarify.

  2. The definition of the support function in Theorem 2.1 is unclear. I assume the intended definition is σ(V)=minpp,V\sigma(V) = \min_{p} \langle p, V \rangle, but the inner product notation is missing. Similar notational issues also appear in Equation (7). Please revise for precision.

  3. The vector ee in Lemma~3.1 is used without being defined. Please introduce and explain it when it first appears.

  4. The references section appears to contain duplicate entries, specifically [8] and [9]. Please clean up and consolidate the bibliography.

局限性

The paper has two significant limitations that are not easily addressed. First, the assumption of finite state and action spaces confines the results to relatively simple environments, limiting their applicability to the large-scale or continuous domains commonly encountered in modern reinforcement learning. Second, the dependence on a generative model—capable of providing i.i.d.\ samples (exponentially many?) on demand—is often unrealistic, as many practical settings involve only a single trajectory or a fixed batch of data. While these assumptions are standard in theoretical work, it would be valuable for the authors to comment on possible strategies to relax them or to highlight the key obstacles in doing so.

最终评判理由

I would like to thank the authors for the careful revision and in particular for providing some numerical evidence for the theoretical results in this paper. Given that many of my points have been clarified, I am willing to raise my score.

格式问题

none

作者回复

We sincerely thank the reviewer for their feedback and valuable suggestions. We address each point carefully and explicitly below:

Regarding Numerical Validations (Weakness 1)

We like to note that this work is primarily theoretical, and the full empirical evaluation of robust TD learning is left as a future work for application papers. That said, because our guarantees hinge on the one‑step semi‑norm contraction properties (characterized in Lemma 3.1 and Theorem 3.2), we now include numerical evaluations that directly verify strict contraction across the settings studied. These results empirically support the key structural claim used by our analysis.

Evaluation of Lemma 3.1

Lemma C.2 is the technical backbone for Lemma 3.1, as Lemma C.2 constructs the fixed‑kernel seminorm and provides the one-step contraction for a given PP. So we perform numerical evaluations on Lemma C.2 to demonstrate the one-step contraction property for Lemma 3.1.

For an ergodic Markov matrix PP with stationary dd, we follow the steps in Appendix C.1 and construct P||\cdot||_P in Eq. (36) with α=min(0.99,(1+ρ(Q))/2)\alpha=\min(0.99,(1+\rho(Q))/2) and ϵ=0.25(1α)\epsilon=0.25(1-\alpha) where Q=PedQ=P-\mathbf ed^\top. Then the one‑step contraction factor would be β=α+ϵ<1\beta=\alpha+\epsilon<1.

To generate ergodic matrices with dimension nn, let InI_n be the identity matrix and SnS_n be the cyclic shift matrix (Snei=ei+1modn)(S_ne_i=e_{i+1 \bmod n}). We provide the following 4 examples:

  • P1=0.5I5+0.5S5P_1 = 0.5I_5 + 0.5S_5
  • P2=0.6I6+0.4S6P_2 = 0.6I_6 + 0.4S_6
  • P3=0.55I7+0.45S7P_3 = 0.55I_7 + 0.45S_7
  • P4=0.6I8+0.3S8+0.1S82P_4 = 0.6I_8 + 0.3S_8 + 0.1S_8^2

We generate 1000 random unit vectors xx and calculate each of the ratio PixPxP \frac{||P_i x||_P}{||x||_P}. We present the empirical results as follows:

matrixnnmax span ratioρ(Q)\rho(Q)α\alphaϵ\epsilonβ\betaratio_minratio_medianratio_p90ratio_max
P1510.80900.90450.02390.92840.38240.79500.80770.8090
P2610.87180.93590.0160.95190.51970.85100.86870.8718
P3710.90200.95100.01220.96330.58610.87990.89760.9014
P4810.87000.93500.01620.95130.48550.82260.86040.8685

Evaluation of Theorem 3.2

Lemma C.4 is the key step for Theorem 3.2 as Lemma C.4 proves a uniform one‑step contraction across all PP in the uncertainty. So we perform numerical evaluations on Lemma C.4 to demonstrate the one-step contraction property for Theorem 3.2 under contamination, TV, and Wasserstein-1 distance uncertainty set. We select P1P_1, P2P_2, P3P_3, and P4P_4 as 4 examples of the nominal model.

To perform numerical approximation of P\lVert\cdot\rVert_{\mathcal P} defined in Eq. (50), we approximate P\lVert\cdot\rVert_{\mathcal P} by (i) discretizing the uncertainty set and (ii) using a finite‑product to approximate the extremal‑norm. First, we sample a family {P(i)}i=1mP\{P^{(i)}\}_{i=1}^m\subset\mathcal P

of size mm and form Qi:=P(i)edP(i)Q_i:=P^{(i)}-\mathbf e d_{P^{(i)}}^\top with r^=maxiρ(Qi)\hat r=\max_i \rho(Q_i). We set α=min(0.99,(1+r^)/2)\alpha=\min(0.99,(1+\hat r)/2) and ϵ(0,1α)\epsilon\in(0,1-\alpha). To approximate the extremal norm, we build a library of scaled products of the QiQ_i’s up to maximum length KK for each k=0,1,,Kk=0,1,\dots,K we draw a set of products Mk,j=QikQi1M_{k,j}=Q_{i_k}\cdots Q_{i_1}; the count of such draws at each kk is the products per length (denoted as samples_per_k in the table). This defines the surrogate zext(K)||z||_{\rm ext}^{(K)}

=max0kK,j αkMk,jz2\max_{0\le k\le K,j}\ \alpha^{-k}||M_{k,j}z||_2. We then set

xP(K)||x||_{\mathcal P}^{(K)}

=maxi=1,,mQixext(K)=\max_{i=1,\dots,m} ||Q_i x||_{\rm ext}^{(K)}

+ϵmincxceext(K),+\epsilon\min_{c}||x-c \mathbf e||_{\mathrm{ext}}^{(K)},

We generate 50 random unit vectors xx for each of the sampled uncertainty matrix inside

{P(i)}i=1mP\{P^{(i)}\}_{i=1}^m\subset\mathcal P

, and calculate each of the ratio of

P(i)xP||P^{(i)} x||_{\mathcal{P}}

and xP||x||_{\mathcal{P}}

We present the empirical results as follows:

Contamination

nominal distributionnnδ\deltammKKsamples_per_kmax span ratior^\hat{r}α\alphaϵ\epsilonγ\gammaratio_minratio_medianratio_p90ratio_max
P150.153032510.81380.90690.02330.93020.22100.63960.80570.8134
P260.153032510.88070.94030.01490.95530.29570.65810.86300.8785
P370.153032510.90670.95340.01170.96500.32170.66830.87000.8901
P480.153032510.88120.94060.01480.95550.37390.59640.82070.8624

TV

nominal distributionnnδ\deltammKKsamples_per_kmax span ratior^\hat{r}α\alphaϵ\epsilonγ\gammaratio_minratio_medianratio_p90ratio_max
P150.153032510.82800.91400.02150.93550.32390.76090.82390.8280
P260.153032510.90130.95070.01230.96300.40210.79040.88620.9006
P370.153032510.91750.95880.01030.96910.44570.79180.89620.9162
P480.153032510.88050.94030.01490.95520.44970.75030.84490.8739

Wasserstein-1

nominal distributionnnδ\deltammKKsamples_per_kmax span ratior^\hat{r}α\alphaϵ\epsilonγ\gammaratio_minratio_medianratio_p90ratio_max
P150.153032510.81840.90920.02270.93190.36980.75690.81410.8188
P260.153032510.89000.94500.01380.95870.42570.78890.87740.8894
P370.153032510.91100.95550.01110.96660.42620.78180.88270.9080
P480.153032510.87580.93790.01550.95340.44130.72240.85180.8689

Interpretations

Note that for the above tables, max span ratio denotes the empirical largest one‑step span contraction coefficient over the sampled families. This value equals to 11 for all the settings, meaning no strict one‑step contraction in the span.

The quantities ratio_min / ratio_median / ratio_p90 / ratio_max summarize the empirical one‑step ratios computed under the constructed semi-norms over all sampled PP and random unit directions xx. They report, respectively, the minimum, median, 90th percentile, and maximum observed value across those tests. In every table we have ratio_max < 1, so we observe empirical one‑step contraction under our semi-norms even when span does not contract. Moreover, ratio_max \leγ\gamma (robust case) or \leβ\beta (non-robust case), which is consistent with the corresponding theoretical contraction factor guaranteed by our constructions.

The above numerical results and the code for the experiments would be released in the revised manuscript.


Regarding Assumption 2.3 (Weakness 2)

Thanks for raising this question. We provide Remark 2.4 after this Assumption 2.3 showing how it relaxes to ergodicity of the nominal model only when the uncertainty radius is small. The assumption of ergodicity for nominal model aligns with common practice in the non‑robust average‑reward RL literature (such as [1]-[3]). Additionally, In our numerical validations, we also assume only the nominal model is ergodic and still observe strict contraction for reasonably large radii, in our case, δ=0.15\delta=0.15.


Regarding Algorithm 1's Sample Complexity (Q1)

We would like to clarify that in Algorithm. 1, the expected sample cost of a single call to the robust‑expectation oracle is O(Nmax)O(N_{\max}) by Theorem 4.1, and in Appendix F4 we set Nmax=O(log(1/ϵ))N_{\max}={O}(\log(1/\epsilon)) to control bias/variance. Hence we note that Algorithm 1 does not require exponentially many i.i.d. samples; the overall policy‑evaluation complexity remains O~(ϵ2)\tilde O(\epsilon^{-2}). Due to space constraint, we refer to our response to Reviewer uoff (Regarding MLMC) for a more detailed explanation of why we do not need exponentially many i.i.d. samples. We will mention the above in the revised draft to prevent this misunderstanding.


Regarding Typos and Definition of e\mathbf e (Q2-Q4)

Thanks for identifying the typos, we agree with the review on the definition of σ(V)\sigma(V) and the duplicated reference, and they will be fixed in the revised draft. We note that e\bf e is initially defined in line 144, we will also re-emphasize this definition in Lemma 3.1 in the revised draft.


Regarding Limitations

We agree that our tabular analysis and the use of a generative model are simplifying assumptions. Our goal here is to establish the first finite‑sample guarantees for policy evaluation in robust average‑reward RL. To the best of our knowledge, the most closely related works cited in Appendix A also study tabular problems and adopt a generative‑model abstraction. This abstraction is particularly natural in the robust setting, where the motivation is assuming access to the simulated environment. We have noted these limitations in the conclusion. We view these as promising directions but outside the scope of this first finite‑sample study.


References

[1] Finite Sample Analysis of Average-Reward TD Learning and Q-Learning, 2021.

[2] Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation, 2021.

[3] A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach, 2025.

评论

Dear Reviewer rBB6,

Thank you again for your thoughtful and constructive comments. We wanted to follow up to see if you had any additional questions or concerns following our rebuttal. We’d be glad to provide further clarification during the discussion phase if needed.

Regards,

Authors

审稿意见
5

This paper studies the finite-sample complexity bound for robust policy evaluation in average-reward settings. A key technical challenge in this context is that, as in the classic MDP setting, the Bellman operator is no longer a contraction. Moreover, the ambiguity set introduce nonlinearity, making simple the Monte Carlo method not applicable. The authors resolve the first issue by proving the existencce of a seminorm under which the evaluation operator is a contraction, provided that the underlying Markov chain induced by any policy and adversarial case kernel is recurrent and apperiodic. The second issue is addressed with a truncated multilevel Monte Carlo estimator. The combination of the two lead to a O~(n1/2)\widetilde O(n^{-1/2}) convergent model-free algorithm.

优缺点分析

The paper analyzes a model-free algorithm (TD learning), which is not only theoretically more challenging than its model-based counterparts, but also potentially generalizable to continuous state space models. Moreover, the paper establishes the contraction property of the Bellman operator under recurrence and aperiodicity, which could be of theoretical value to future work.

However, the contraction property does not have a concrete physical interpretation (such as mixing time or the span norm, both commonly used in classical MDP settings), and is not easily accessible due the relevant seminorm is obtained non-constructively.

Moreover, based on some of the results in the paper and my personal expertise in this domain, I believe it should be straightforward to obtain a O~(ϵ2)\widetilde O(\epsilon^{-2}) sample complexity using model-based methods.

Furthermore, if we consider the minimax sample complexity in this setting, I do not believe it should depend on (1γ)(1 - \gamma), which represents the one-step contraction factor. In classical MDP settings with mixing, the one-step Bellman operator is not a span contraction (for similar reasons as those discussed in this paper—it is a contraction under a certain semi-norm). Nevertheless, the sample complexity is still O~(tmix/ϵ2)\widetilde{O}(t_{\text{mix}} / \epsilon^2), independent of the one-step contraction factor. With that said, I think this is understandable given the focus on model-free average-reward MDP algorithms, which represent a significantly more challenging setting.

Given these, I think the narrative of the paper should be oriented toward the model-free aspect, emphasizing their algorithm's versatility.

On the other hand, even though TD learning is interesting in the robust MDP setting due to the nonlinearity introduced by the adversarial dynamics, one could argue that the control (i.e., policy optimization) problem is of greater importance to practitioners.

Finally, there is no numerical validation in the paper to support the theory. I would appreciate it if the authors could provide some numerical experimentation, especially given that I have some reservations about the analysis, as the current bias and variance bounds do not seem to suggest a O~(ϵ2)\widetilde O(\epsilon^{-2}) sample complexity. See below. I am not certain about this last assessment, but I do believe that the analysis is not tight enough.

问题

  • Line 86: The MLMC based estimator in the two Blanchet papers do not need infinite expected samples. Also, the MLMC based Q-learning in [1] has a non asymptotic expected sample complexity of O~(ϵ2)\widetilde O(\epsilon^{-2}). Could this be adapted in the context of this paper?

  • Line 184: The kernel of a seminorm is not defined.

  • Theorem 3.1 and 3.2 seems very interesting. The authors should consider providing a simple (not span contracting) example to illustrate this new seminorm.

  • Is there a relationship between 1/(1γ)1/(1-\gamma) and tmixt_{mix}? I would guess that 1/(1γ)=Ω(tmix)1/(1-\gamma) = \Omega(t_{mix}), but in general do we have 1/(1γ)=Θ(tmix)1/(1-\gamma) = \Theta(t_{mix})?

  • What is the bias level given an optimal allocation of the samples? In MLMC, one would typically get O(1) variance. To converge with constant setpsize, if one step bias is bb, then the total bias scales like k(1η)kηb=b\sum_k (1-\eta)^k \eta b = b. So one would make b=ϵb = \epsilon to converge. The bias under the truncation at level N is 2N2^{-N} due to the odd and even sampling (I think the analysis in this paper had 2N/22^{-N/2} which could be improved). If one use bias to be 2N/22^{-N/2}, the levels has to be log(ϵ2)\log(\epsilon^{-2}). This will use ϵ2\epsilon^{-2} samples within one iteration with constant order probability. So, this analysis seem to suggest the current bias bound is not sufficently tight to get ϵ2\epsilon^{-2} rate, as it didn't account for the the odd and even sampling in [2]. I am a little concerned about the validity of the convergence analysis.

  • I am wondering why the authors doesn't consider policy optimization. It seems that most of the arguments still goes through. Is it because 3.1 or 3.2 fails?

  • Equation (41), would you need the mapping Pρ(QP)P\rightarrow \rho(Q_P) be continuous?

[1] Wang, S., Si, N., Blanchet, J. & Zhou, Z.. (2023). A Finite Sample Complexity Bound for Distributionally Robust Q-learning. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:3370-3398.

[2] Blanchet, J. H., Glynn, P. W., & Pei, Y. (2019). Unbiased multilevel Monte Carlo: Stochastic optimization, steady-state simulation, quantiles, and other applications. arXiv preprint arXiv:1904.09929.

局限性

The limitation is not sufficiently discussed. Please see the weaknesses I have listed above.

最终评判理由

The authors are able to address my main concerns about the validity of their approach.

格式问题

NA

作者回复

We sincerely thank the reviewer for their feedback and valuable suggestions. We address each point carefully and explicitly below:

Regarding MLMC (Q1, Q5 and Weaknesses)

Thanks for bringing up comparisons between our work and the two Blanchet papers. We note that [1] obtains O~(ϵ2)\tilde O(\epsilon^{-2}) using unbiased and untruncated MLMC under a smooth‑functional assumption (differentiable with Hölder‑continuous derivative, as mentioned in their Theorem 1), which enables second‑order multilevel cancellation. [2] studies discounted robust RL with KL divergence uncertainty sets, where these conditions hold, so their unbiased‑MLMC analysis in [1] carries through. In contrast, we focus on TV and Wasserstein distance uncertainty sets, and the support functions in our setting are Lipschitz with no uniform differentiability guarantees. Thus, the analysis of the unbiased MLMC in [1] and [2] is no longer applicable. We therefore adopt a different framework where we apply the truncated MLMC method and still yielding the O~(ϵ2)\tilde O(\epsilon^{-2}) sample complexity.

We note that we characterize the sample complexity, bias and variance properties of our truncated MLMC approach (Algorithm. 1) in our Theorem 4.1-4.4. In particular, we note that due to truncation, the expected number of samples to perform Algorithm. 1 is O(Nmax)O(N_{\rm max}) instead of O(2Nmax)O(2^{N_{\rm max}}) with NmaxN_{\rm max} being the maximum truncation level. Thus, we agree with the reviewer that if one uses bias to be 2N/22^{-N/2}, the level is indeed O(log(ϵ2))O(\log(\epsilon^{-2})) , which is equivalent to O(log(ϵ1))O(\log(\epsilon^{-1})) (what we wrote in the draft). However, this will use O(log(ϵ2))O(\log(\epsilon^{-2})) samples instead of O(ϵ2)O(\epsilon^{-2}) within one iteration with constant order probability, which still leads to our O~(ϵ2)\tilde{O}(\epsilon^{-2}) sample complexity result.


Regarding Model-Based Methods (Weaknesses)

We agree that our focus is model‑free since the problem of policy evaluation is essentially a model-free approach by its nature, and we will add emphasis on this in the revised draft. We note that the problem is not straightforward even for model-based settings, and indeed there is a new paper [3] in ICML 2025 (which came online after our work was submitted), that studies robust model-based average reward setting with TV, chi-square, and contamination uncertainty sets. We will add this to Related Work in the final version, while note that the approach of model-based that is used in this paper which is reduction from discounted to average does not work well in model-free settings due to higher dependence on 1/(1γ)1/(1-\gamma) in the model-free setups.


Regarding the Semi‑Norm Contraction and γ\gamma (Q2-Q4 and Weaknesses)

Thanks a lot for the great suggestion on providing examples to demonstrate the contraction properties in Lemma 3.1 and Theorem 3.2. Due to space constraint, we refer to our additions in the response to Reviewer rBB6 (Regarding Numerical Validations), where we provide detailed numerical validations and demonstrate the semi-norm contraction properties.

Regarding the interpretation of γ\gamma. Note that in Lemma C.3, by setting ϵ0\epsilon \rightarrow 0 and αr\alpha \rightarrow r^\star, our construction allows γ\gamma to be made arbitrarily close to r=supPPρ(QP) r^\star = \sup_{P\in\mathcal P}\rho(Q_P) defined in Eq. (41). Since rr^\star represents the second‑largest eigenvalue among all the PPP\in\mathcal P, this semi-norm is constructive and γ\gamma tracks the spectrum that matters for stability. We note that in the reversible (or lazy‑reversible) case, rr^\star directly controls mixing: for each PP, where we have so tmix=O~(1/(1r))t_{\mathrm{mix}}=\tilde O\big(1/(1-r^\star)\big) (shown in Theorem 12.4 of [4]). For non‑reversible chains, the standard replacement is the pseudo‑spectral gap via multiplicative reversibilization; we will add a short remark connecting rr^\star to that quantity for completeness.

We also would like to note that the kernel of a semi-norm is the set of all vectors in the vector space that map to zero under the semi-norm, we will add this explanation in the revised draft to improve clarity.


Regarding Future Work (Q6 and Weaknesses)

We agree with the reviewer that the policy optimization and control problem is an important direction. However, policy optimization is based on policy evaluation, which is the focus of this paper. Thus, one cannot obtain policy optimization results without having a complete analysis of policy evaluation. Further, we note that the current policy optimization papers in robust average-reward setup [5] assume knowledge of exact sub-gradient, while our method provides actual sampling techniques with finite-time analysis for policy evaluation. Thus, our work can potentially be used in their analysis to replace the knowledge of exact sub-gradient. However, a detailed analysis is left as a future work, and will be mentioned in the paper.


Regarding the Mapping Pρ(QP)P\mapsto\rho(Q_P) (Q7)

We note that the continuity of the mapping holds under the assumptions in the paper. On finite ergodic chains, the stationary distribution dPd_P depends continuously on PP; hence QP=PedPQ_P=P-\mathbf e d_P^\top is continuous in PP. The spectral radius is a continuous function of the matrix entries (Appendix D in [6]), so the composition Pρ(QP)P\mapsto\rho(Q_P) is continuous on the ergodic region. We will add the detailed proof on this in the final version.


References

[1] Unbiased multilevel Monte Carlo: Stochastic optimization, steady-state simulation, quantiles, and other applications, 2019.

[2] A Finite Sample Complexity Bound for Distributionally Robust Q-learning, 2023.

[3] A Reduction Framework for Distributionally Robust Reinforcement Learning under Average Reward, 2025.

[4] Markov Chains and Mixing Times, 2017.

[5] Policy Optimization for Robust Average Reward MDPs, 2024.

[6] Matrix Analysis, 2012.

评论

In contrast, we focus on TV and Wasserstein distance uncertainty sets, and the support functions in our setting are Lipschitz with no uniform differentiability guarantees.

This is not entirely true in the sense that if you assume there is a minimum probability pp at the center of the uncertainty set, then with a sample size of Ω(p1,ϵ2)\Omega(p^{-1},\epsilon^{-2}), one can estimate the transition probability within error ϵ\epsilon. Then, one would still have differentiability if the empirical kernel is close to the center of the uncertainty set. Nevertheless, this might be a more complicated approach that leads to looser bounds. Moreover, I think the current algorithm has the advantage that the maximum sample size is bounded, whereas the fully randomized unbiased approach will have a (random) sample size with infinite variance in the smooth regime when one can choose the geometric probability within (1/2,3/4)(1/2,3/4).

We note that we characterize the sample complexity, bias and variance properties of our truncated MLMC approach (Algorithm. 1) in our Theorem 4.1-4.4. In particular, we note that due to truncation, the expected number of samples to perform Algorithm. 1 is O(Nmax)O(N_{\rm max}) instead of O(2Nmax)O(2^{N_{\rm max}}) with NmaxN_{\rm max} being the maximum truncation level. Thus, we agree with the reviewer that if one uses bias to be 2N/22^{-N/2}, the level is indeed O(log(ϵ2))O(\log(\epsilon^{-2})) , which is equivalent to O(log(ϵ1))O(\log(\epsilon^{-1})) (what we wrote in the draft). However, this will use O(log(ϵ2))O(\log(\epsilon^{-2})) samples instead of O(ϵ2)O(\epsilon^{-2}) within one iteration with constant order probability, which still leads to our O~(ϵ2)\tilde{O}(\epsilon^{-2}) sample complexity result.

Thank you for the clarification. Now I think I understand the tradeoff. Because now you can use Ψ=1/2\Psi = 1/2, this gives you the O~(N)\widetilde O(N) sample. But doing this will lead to a variance that is not O(1)O(1); in your case, this is O(N)=log(1/ϵ)O(N) = \log(1/\epsilon) when Ψ=1/2\Psi = 1/2, which is fine for getting the O~(ϵ2)\widetilde O(\epsilon^{-2}).

We agree that our focus is model‑free since the problem of policy evaluation is essentially a model-free approach by its nature, and we will add emphasis on this in the revised draft. We note that the problem is not straightforward even for model-based settings, and indeed there is a new paper [3] in ICML 2025 (which came online after our work was submitted), that studies robust model-based average reward setting with TV, chi-square, and contamination uncertainty sets. Further, we note that the current policy optimization papers in robust average-reward setup [5] assume knowledge of exact sub-gradient, while our method provides actual sampling techniques with finite-time analysis for policy evaluation.

Recently, I have seen works newer than [3] (the authors should do a review of) on policy optimization that achieve a better complexity (without the 1/(1γ)1/(1-\gamma)) [01, 02]. In particular, [01] seems to be a model-free approach. This makes me think that the 1/(1γ)1/(1-\gamma) might only be an artifact of your analysis. In particular, I'm wondering if this is a fundamental problem with the TD or Q-learning style algorithm, or an artifact of your analysis.

Nevertheless, the authors addressed my major concerns about the validity of their truncation approach. So, I will raise the score.

[01] Roch, Zachary, et al. "A finite-sample analysis of distributionally robust average-reward reinforcement learning." arXiv preprint arXiv:2505.12462 (2025). [02] Chen, Zijun, Shengbo Wang, and Nian Si. "Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning." arXiv preprint arXiv:2505.10007 (2025).

评论

Thank you for your feedback and the corrections of the details of the MLMC approaches. We are glad to hear that our response addressed your major concerns.

Additionally, thanks for bringing up [01] and [02]. We note that according to NeurIPS policy, both [01] and [02] are considered concurrent to this paper (and were posted around the NeurIPS main paper deadline). Nevertheless, we agree that [02] is a model-based method while [01] utilizes a model-free value iteration approach. Specifically, [01] studies robust average reward RL under contamination and lpl_p norm uncertainty sets, leveraging a value-iteration style Halpern iteration (RHI). Since [01] utilizes properties of non-expansiveness (along with some properties in the uncertainty sets) instead of contraction, [01] does not have the contraction factor in the sample complexity results. We also note that RHI is analyzed under the assumption that every stochastic increment it adds to the running Bellman estimate is unbiased. This is guaranteed by the recursive sampling procedure. For Contamination and lpl_p norm uncertainty, the robust-Bellman update can be expressed as one next-state expectation plus a deterministic norm term, so a single draw gives an increment whose conditional mean is exactly the true difference, guaranteeing unbiasedness. However, for a Wasserstein ball, the correction is a global optimal-transport cost that no finite set of local samples can reproduce on average, so any stochastic estimate is inevitably biased. Thus, one limitation of [01] is that RHI fails under Wasserstein uncertainty sets, while our work does not have this limitation.

We will add a discussion of the above concurrent works in the revised draft.

评论

Dear Reviewers,

Please take a look at the authors' response and discuss if there are still more concerns that need to be clarified. Thanks AC

最终决定

This paper develops an algorithm based on MLMC to solve the policy evaluation problem for robust average-reward reinforcement learning, and derives the finite sample analysis. The overall sample complexity is O(ϵ2)O(\epsilon^{-2}). In order to solve this problem, this paper develops some interesting results, including the contraction property of the Bellman operator. The reviewers requested numerical results during the discussion, and the authors have provided them. It would strengthen the paper if these results were incorporated into the future version of this paper. This is the first paper providing finite sample analysis for robust average-reward RL. Given the significance of the results, I therefore recommend acceptance of this paper.