Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning
We present the first finite-sample analysis for policy evaluation in robust average-reward reinforcement learning using semi-norm contractions.
摘要
评审与讨论
This paper proposes an algorithm to solve the policy evaluation problem in the context of robust average reward RL. To achieve this, it analyzes the contractivity of the robust Bellman operator under a well-chosen semi-norm, then proposes an efficient estimation method for approximating a subproblem appearing when computing the Bellman operator, and then finally proposes an iterative sample-based algorithm to approach the fixed point of the robust Bellman equation. Sample complexities are provided for this algorithm.
优缺点分析
This paper proposes a very complete analysis of the robust average-reward policy evaluation problem. The presentation is very clear and progressive, making the paper easy to read and follow. The proofs of the main results look technically solid, although I could not check all the details of the appendix. The tackled setting looks interesting, yet it is somehow difficult to evaluate its practical reach, not so much discussed. One possible limitation of the paper in its present form is that there is no numerical illustration of the proposed algorithm. Other weaknesses of this very thorough work are, in my sense, minor, and are more related to open problems that could be subsequently studied (see questions below).
问题
- How do existing work dedicated to the discounted setting (cited as [25, 29, 16, 15, 14] in the related work) relate to the developed method and sample complexities? Specifically, is the proposed algorithm of the present paper similar, or fundamentally different? Is there any kind of equivalence / correspondence between the average and discounted case, as could be suggested by Theorem 3.2?
- Is it possible to illustrate the convergence rate of the algorithm on a very simple toy example?
- In Theorems 5.1 and 5.2, the robustness level (quantified by ) does not appear explicitly in the convergence rates. If this dependence is simple, is it possible to make it explicit in the rates? In particular, what happens when ? Is it possible to relate the robust TD algorithm to its non-robust counterpart?
- The sample complexity is known to be optimal for the non-robust policy evaluation problem. Do you know of any lower-bounds in the robust setting?
- In line 149, should be replaced by .
- In line 114, the simplex is not defined.
- From lines 146 to 158, it seems that and are used interchangeably.
局限性
Limitations are mentioned in the conclusion about the ergodicity assumption and the finite state space setting.
最终评判理由
This paper proposes a very complete analysis of the robust average-reward policy evaluation problem. The presentation is clear. The authors added some numerical illustration during the rebuttal. The few limitations of this paper by no means justify that it should be rejected. Therefore I will keep my score and recommend to accept.
格式问题
Minor corrections should be made to the references formatting. In particular, some capital letters are missing in some conferences and journal names or book titles.
Also: [18] Arnab Nilim and Laurent Ghaoui.-> Laurent El Ghaoui.
[5] -> missing editor name.
We sincerely thank the reviewer for their feedback and valuable suggestions. We address each point carefully and explicitly below:
Regarding Numerical Illustrations (Weaknesses and Q2)
We would like to note that this work is primarily theoretical, and the full empirical evaluation of robust TD learning is left as a future work for application papers. That said, because our guarantees hinge on the one‑step semi‑norm contraction properties (characterized in Lemma 3.1 and Theorem 3.2), we now include numerical evaluations that directly verify strict contraction across the settings studied. These results empirically support the key structural claim used by our analysis. Due to space constraint, we refer to our additions in the response to Reviewer rBB6 (Regarding Numerical Validations).
Regarding Discounted Robust RL Methods (Q1)
We make the following comparisons between our work and the prior robust discounted RL works. In discounted RL, contraction comes for free from the discount factor in a (weighted) sup-norm [1], and standard stochastic fixed point iterations yields rates accordingly. In contrast, average-reward has no inherent contraction and our analysis builds a new semi-norm under which the robust average-reward operator is a contraction via ergodicity. The algorithmic analysis is therefore fundamentally different from discounted works, and we will add a short remark clarifying this correspondence and the different constants.
Regarding Robust Level (Q3)
Thanks for bringing up this important point. We note that the bound for contamination (Eq. 138) and Wasserstein (Eq. 141) do not have a dependance over , and thus the limit as is the same. Thus, the result do not degrade with increasing . However, we note that contraction factor depends on since the maximum spectral radius in the uncertainty set will increase with . However, there is an additional factor of for the TV uncertainty set (see Eq. 128, 139), which is a loose bound and is left as a future work to make this tighter. We will mention the discussion in the final version.
Regarding Lower Bounds (Q4)
We note that since the non-robust serves as a special case for the robust policy evaluation, the lower bound of non-robust can serve as a hard example for the robust setting. Nevertheless, obtaining tight lower bounds that also track for robust evaluation remains an interesting open problem. We will mention the above in the revised version.
Regarding Typos and the Missing Definition of (Q5-Q7 and Paper Formatting Concerns)
Thanks for identifying the typos, and we will fix them in the revised draft. In particular, we will define the simplex as the probability simplex over the state space and use this notation consistently.
References
[1] Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications, 2012.
I would like to thank the authors for their thorough answers and precisions, as well as for the numerical illustration.
I believe this paper is clearly written and makes an interesting contribution, and I see no major limitations, hence it should be accepted.
Thank you for your feedback. We are glad to hear that our response addressed your comments.
The paper studies policy evaluation for average-reward infinite-horizon reinforcement learning where the estimated value is robust to uncertainty in the data / observed transitions. They provide a finite-sample result that is order-optimal in the accuracy level . This is achieved by considering the robust Bellman operator and showing that it is a contraction with respect to a semi-norm. However, this robust Bellman operator requires the estimation of a support function. They provide a procedure to estimate this support function (based on a truncated multi-level monte carlo algorithm), that can then be used to iteratively apply the robust Bellman operator, providing the finite-sample result.
优缺点分析
Strengths
- The paper is very well-written/structured and easy to follow. The proof-sketches are clear and provide some good intuition for the general proof ideas.
- The work provides the first finite-sample results, while prior works only established asymptotic results.
- The algorithms extend known algorithm in novel non-trivial ways. The analysis also includes some key novel steps such as the contraction of the robust Bellman operator with respect to a semi-norm.
Weaknesses
- While it is explicitly stated that the finite-time bound is order-optimal in , optimality is not discussed for other quantities (e.g. S, A). Even if it is not optimal, it would be helpful to have a discussion on the dependence of these other quantities. The lower-bound justifying the claim that the dependence on was not discussed as far as I could tell.
- The one weakness of the structure would be that Section 5 feels a bit rushed, there is little discussion of the results Theorem 5.1 and 5.2 (which connects to the previous point). In addition, the difference in the results between the contamination model and the TV/Wasserstein distance could also be discussed more. In particular the difference only appears to be logarithmic in Theorem 5.1, and that difference comes from the value of (which it would also be useful to have the explicit setting of it in the main text). A brief discussion on how these policy evaluation methods could be fit in within a procedure to find the optimal policy would also be welcome.
- See questions below for some slightly confusing parts.
问题
- In the formulation section, the uncertainty set is introduced as the -rectangular compact uncertainty set, which is a bit confusing because it seems like this is the type of uncertainty set that will be considered instead of contamination, total variation and Wasserstein distance. In addition, because ultimately only samples from the centroid of this uncertainty set (the nominal MDP) are used, I wonder what the value is in introducing the rectangular set. Is it not equivalent and simpler to only define/introduce the nominal transitions and then consider the uncertainty set as defined by the contamination model, TV and Wasserstein distance ? Or am I missing something ?
- Following from the previous question, it would also be useful to provide some motivations for the different types of uncertainty sets. While the contamination model provides an algorithm and analysis that is much simpler, I assume the other two are considered because they provide more realistic in the way they model uncertainty ?
- Here is a list of a few typos: l.136: “definied”; Grammar l.144-145; l.187 typo in definition of E, should be I think;
- How easy/hard is it to solve the robust policy evaluation problem when you know the true nominal transitions ? If it is easy, since you “focus on the model-free setting”, could you comment on the difficulties of a model-based approach where potentially you use all the samples to estimate the nominal transitions and then use this estimate as the ground-truth ?
- I was initially a bit confused by the statement of Theorem 2.1 where it seemed to me that should depend on the state and not be a scalar. This was clarified later when it was mentioned that under the ergodicity assumption, the average-reward did not depend on the starting state. Is this what is being used in Theorem 2.1 or does Theorem 2.1 also hold without the assumption of ergodicity ? And if Theorem 2.1 does require the assumption of ergodicity, I would suggest re-orgnaising or at least mentioning or referring to the independence of from the starting-state before Theorem 2.1.
- In the description of Algorithm 2, in line 18 the expected Bellman residual is averaged across all states. I am surprised that this work and that it should not be weighted by how often states are actually visited – could you comment ?
局限性
yes
最终评判理由
I believe this paper is of good quality, clarity, significance and originality. The points raised during my review, which were mostly about clarifying details that I did not fully understand were addressed and should be corrected adequately in the final version. I recommend acceptance.
格式问题
NA
We sincerely thank the reviewer for their feedback and valuable suggestions. We address each point carefully and explicitly below:
Regarding Optimality and Lower Bounds (Weakness 1)
We agree our rates are order‑optimal only in (). Regarding other factors (e.g., and structural constants), we will clarify that our bounds have a linear factor from per–state–action oracle calls, and also depend on the contraction gap , the span bound on , and an additional from truncation in the TV and Wasserstein cases.
Regarding the lower bounds, we note that classical regret lower bound for finding an optimal policy in non-robust average reward reinforcement learning in [1] is where denotes the diameter of the MDP. However, this does not directly translate into a policy‑evaluation sample‑complexity lower bound of . This is because the conversion from regret in control to estimation error in evaluation is non‑trivial and depends on additional structure (e.g., mixing time/diameter and whether a generative model is available). To the best of our knowledge, there have been no works specifically characterizing the exact lower bound for policy evaluation in (robust) average reward reinforcement learning with the focus on and . We will therefore not claim tightness in and treat sharpening these dependencies as open. Regarding the lower-bound justifying the claim that the dependence on , we note that we provide the standard mean estimation as a hard example and demonstrate the sample complexity of in Appendix F.2. We will mention the above in our revised draft.
Regarding Section 5 (Weakness 2)
We note that Section 5 is a bit concise due to page constraints, and we will expand it in the final version due to availability of an additional page. We have provided the detailed analysis for this Section in Appendix F, some of the key details can be moved to Section 5 with additional space availability in the final version. We would add the discussion of the additional logarithmic terms in the sample complexity results for TV and Wasserstein distance uncertainty sets. These are due to the bias of the estimators for the support functions in Algorithm 1. In contrast, the estimator of the support function in the contamination uncertainty set is unbiased (as shown in Eq. (19)), and thus the corresponding sample complexity for policy evaluation not have the logarithmic terms.
We also agree that a brief discussion on how these policy evaluation methods could be fit in within a procedure to find the optimal policy is important, specifically, our algorithm can be used to estimate the robust policy sub-gradients, which can be then utilized to perform policy optimizations accordingly, we will add the above in the revised draft.
Regarding Rectangularity (Q1)
Thanks for raising up this point. We note that our algorithms do focus on contamination, TV, and Wasserstein ambiguity around a nominal model . We will clarify the presentation by first fixing , and then defining the ‑rectangular set as where is, respectively, a contamination/TV/Wasserstein neighborhood of . In our algorithms, the nominal generative model first supplies samples from , and then we compute the support function by decoupling across , this procedure is achievable due to rectangularity assumption. Thus, our ‑rectangular assumption makes Algorithm 1 implementable (closed‑form for contamination; truncated‑MLMC for TV/W) and is standard in robust DP to preserve Bellman‑style decomposability. We will add the above to the revised draft.
Regarding Motivations for Different Uncertainty Sets (Q2)
We provide the motivations for the different types of uncertainty sets studied in this work as follows:
- Contamination (). Models outliers or rare faults [2]: with prob. , the next‑state law can be arbitrary (e.g., sensor glitch or unmodeled jump to a bad region). Leads to a simple worst‑case expectation combining with the worst state value.
- Total variation (TV radius). Models categorical misspecification or discretization error [3]: bounded mass shift per ; captures sparse but possibly large reallocation of probability mass.
- Wasserstein (metric radius). Models smooth model drift when states have a geometry [4] (e.g., position/velocity): mass can move to nearby states with cost proportional to distance; realistic for learned simulators and dynamics linearization.
We will add a compact paragraph with these examples in the revised draft
Regarding Typos (Q3)
Thanks for identifying the typos, we agree with the review on the definition of . The typos will be fixed in the revised draft.
Regarding Model-Based Methods (Q4)
We agree that our focus is model‑free setting. We note that the model-based counterpart is not straightforward, and there is a new paper [5] in ICML 2025 (which came online after our work was submitted), that studies robust model-based average reward setting with TV, chi-square, and contamination uncertainty sets. We will add this to Related Work in the final version, while note that the approach of model-based that is used in this paper which is reduction from discounted to average does not work well in model-free settings due to higher dependence on in the model-free setups.
Regarding and Theorem 2.1 (Q5)
We agree is a scalar independent of the starting state under a standard unichain/ergodicity assumption. Our Theorem 2.1 indeed assumes ergodicity later in the section. Thanks for pointing this out. we will move that assumption next to the theorem statement to avoid ambiguity in the revised draft.
Regarding Line 18 in Algorithm 2 (Q6)
Thanks for pointing out this detail. In our algorithm, we use uniform averaging to control the span/semi‑norm uniformly over states. The analysis also goes through with stationary‑distribution weighting (); we will add a remark noting this variant and that it can reduce variance in practice without changing the rate.
References
[1] Near-optimal regret bounds for reinforcement learning, 2008.
[2] A General Decision Theory for Huber’s -Contamination Model, 2017.
[3] Partial Policy Iteration for -Robust Markov Decision Processes, 2021.
[4] First-Order Methods for Wasserstein Distributionally Robust MDPs, 2021.
[5] A Reduction Framework for Distributionally Robust Reinforcement Learning under Average Reward, 2025.
Thank you for your thorough response. All my points have been addressed and I am happy to maintain my current score.
Thank you for your feedback. We are glad to hear that our response addressed your comments.
The paper investigates the finite-sample performance of policy evaluation in robust average-reward Markov Decision Processes (MDPs), where the transition dynamics are only partially known. The authors first establish that the robust Bellman operator, central to the robust dynamic programming framework, is a contraction mapping---guaranteeing convergence of value iteration. Building on this property, the paper derives a near-optimal sample complexity bound of for estimating the value of a fixed policy under worst-case transition dynamics. These results provide strong theoretical guarantees for robust policy evaluation in average-reward settings.
优缺点分析
Strengths:
-
The paper tackles a timely and important problem in reinforcement learning, namely robust policy evaluation in average-reward MDPs—a topic of both theoretical and practical significance.
-
Establishing near-optimal sample complexity bounds is a strong theoretical contribution and highlights the algorithm’s statistical efficiency.
-
The presentation is generally clear, and the technical results are well-structured and accessible.
Weaknesses:
-
The paper lacks numerical validation. While the sample complexity result is theoretically compelling, the absence of empirical experiments makes it difficult to assess the practical utility and robustness of the proposed approach.
-
Assumption 2.3 appears to be quite restrictive and may not hold except for very small ambiguity sets . It would be helpful to provide a practical discussion or example illustrating how this assumption can be verified or approximately satisfied in realistic settings.
问题
-
I am confused by the sample requirements in Algorithm~1. It appears that exponentially many i.i.d.\ samples are needed from the generative model. Is this interpretation correct? If so, this would be practically infeasible and significantly limit the applicability of the method. Please clarify.
-
The definition of the support function in Theorem 2.1 is unclear. I assume the intended definition is , but the inner product notation is missing. Similar notational issues also appear in Equation (7). Please revise for precision.
-
The vector in Lemma~3.1 is used without being defined. Please introduce and explain it when it first appears.
-
The references section appears to contain duplicate entries, specifically [8] and [9]. Please clean up and consolidate the bibliography.
局限性
The paper has two significant limitations that are not easily addressed. First, the assumption of finite state and action spaces confines the results to relatively simple environments, limiting their applicability to the large-scale or continuous domains commonly encountered in modern reinforcement learning. Second, the dependence on a generative model—capable of providing i.i.d.\ samples (exponentially many?) on demand—is often unrealistic, as many practical settings involve only a single trajectory or a fixed batch of data. While these assumptions are standard in theoretical work, it would be valuable for the authors to comment on possible strategies to relax them or to highlight the key obstacles in doing so.
最终评判理由
I would like to thank the authors for the careful revision and in particular for providing some numerical evidence for the theoretical results in this paper. Given that many of my points have been clarified, I am willing to raise my score.
格式问题
none
We sincerely thank the reviewer for their feedback and valuable suggestions. We address each point carefully and explicitly below:
Regarding Numerical Validations (Weakness 1)
We like to note that this work is primarily theoretical, and the full empirical evaluation of robust TD learning is left as a future work for application papers. That said, because our guarantees hinge on the one‑step semi‑norm contraction properties (characterized in Lemma 3.1 and Theorem 3.2), we now include numerical evaluations that directly verify strict contraction across the settings studied. These results empirically support the key structural claim used by our analysis.
Evaluation of Lemma 3.1
Lemma C.2 is the technical backbone for Lemma 3.1, as Lemma C.2 constructs the fixed‑kernel seminorm and provides the one-step contraction for a given . So we perform numerical evaluations on Lemma C.2 to demonstrate the one-step contraction property for Lemma 3.1.
For an ergodic Markov matrix with stationary , we follow the steps in Appendix C.1 and construct in Eq. (36) with and where . Then the one‑step contraction factor would be .
To generate ergodic matrices with dimension , let be the identity matrix and be the cyclic shift matrix . We provide the following 4 examples:
We generate 1000 random unit vectors and calculate each of the ratio . We present the empirical results as follows:
| matrix | max span ratio | ratio_min | ratio_median | ratio_p90 | ratio_max | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| P1 | 5 | 1 | 0.8090 | 0.9045 | 0.0239 | 0.9284 | 0.3824 | 0.7950 | 0.8077 | 0.8090 |
| P2 | 6 | 1 | 0.8718 | 0.9359 | 0.016 | 0.9519 | 0.5197 | 0.8510 | 0.8687 | 0.8718 |
| P3 | 7 | 1 | 0.9020 | 0.9510 | 0.0122 | 0.9633 | 0.5861 | 0.8799 | 0.8976 | 0.9014 |
| P4 | 8 | 1 | 0.8700 | 0.9350 | 0.0162 | 0.9513 | 0.4855 | 0.8226 | 0.8604 | 0.8685 |
Evaluation of Theorem 3.2
Lemma C.4 is the key step for Theorem 3.2 as Lemma C.4 proves a uniform one‑step contraction across all in the uncertainty. So we perform numerical evaluations on Lemma C.4 to demonstrate the one-step contraction property for Theorem 3.2 under contamination, TV, and Wasserstein-1 distance uncertainty set. We select , , , and as 4 examples of the nominal model.
To perform numerical approximation of defined in Eq. (50), we approximate by (i) discretizing the uncertainty set and (ii) using a finite‑product to approximate the extremal‑norm. First, we sample a family
of size and form with . We set and . To approximate the extremal norm, we build a library of scaled products of the ’s up to maximum length for each we draw a set of products ; the count of such draws at each is the products per length (denoted as samples_per_k in the table). This defines the surrogate
=. We then set
We generate 50 random unit vectors for each of the sampled uncertainty matrix inside
, and calculate each of the ratio of
and
We present the empirical results as follows:
Contamination
| nominal distribution | samples_per_k | max span ratio | ratio_min | ratio_median | ratio_p90 | ratio_max | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P1 | 5 | 0.15 | 30 | 3 | 25 | 1 | 0.8138 | 0.9069 | 0.0233 | 0.9302 | 0.2210 | 0.6396 | 0.8057 | 0.8134 |
| P2 | 6 | 0.15 | 30 | 3 | 25 | 1 | 0.8807 | 0.9403 | 0.0149 | 0.9553 | 0.2957 | 0.6581 | 0.8630 | 0.8785 |
| P3 | 7 | 0.15 | 30 | 3 | 25 | 1 | 0.9067 | 0.9534 | 0.0117 | 0.9650 | 0.3217 | 0.6683 | 0.8700 | 0.8901 |
| P4 | 8 | 0.15 | 30 | 3 | 25 | 1 | 0.8812 | 0.9406 | 0.0148 | 0.9555 | 0.3739 | 0.5964 | 0.8207 | 0.8624 |
TV
| nominal distribution | samples_per_k | max span ratio | ratio_min | ratio_median | ratio_p90 | ratio_max | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P1 | 5 | 0.15 | 30 | 3 | 25 | 1 | 0.8280 | 0.9140 | 0.0215 | 0.9355 | 0.3239 | 0.7609 | 0.8239 | 0.8280 |
| P2 | 6 | 0.15 | 30 | 3 | 25 | 1 | 0.9013 | 0.9507 | 0.0123 | 0.9630 | 0.4021 | 0.7904 | 0.8862 | 0.9006 |
| P3 | 7 | 0.15 | 30 | 3 | 25 | 1 | 0.9175 | 0.9588 | 0.0103 | 0.9691 | 0.4457 | 0.7918 | 0.8962 | 0.9162 |
| P4 | 8 | 0.15 | 30 | 3 | 25 | 1 | 0.8805 | 0.9403 | 0.0149 | 0.9552 | 0.4497 | 0.7503 | 0.8449 | 0.8739 |
Wasserstein-1
| nominal distribution | samples_per_k | max span ratio | ratio_min | ratio_median | ratio_p90 | ratio_max | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P1 | 5 | 0.15 | 30 | 3 | 25 | 1 | 0.8184 | 0.9092 | 0.0227 | 0.9319 | 0.3698 | 0.7569 | 0.8141 | 0.8188 |
| P2 | 6 | 0.15 | 30 | 3 | 25 | 1 | 0.8900 | 0.9450 | 0.0138 | 0.9587 | 0.4257 | 0.7889 | 0.8774 | 0.8894 |
| P3 | 7 | 0.15 | 30 | 3 | 25 | 1 | 0.9110 | 0.9555 | 0.0111 | 0.9666 | 0.4262 | 0.7818 | 0.8827 | 0.9080 |
| P4 | 8 | 0.15 | 30 | 3 | 25 | 1 | 0.8758 | 0.9379 | 0.0155 | 0.9534 | 0.4413 | 0.7224 | 0.8518 | 0.8689 |
Interpretations
Note that for the above tables, max span ratio denotes the empirical largest one‑step span contraction coefficient over the sampled families. This value equals to for all the settings, meaning no strict one‑step contraction in the span.
The quantities ratio_min / ratio_median / ratio_p90 / ratio_max summarize the empirical one‑step ratios computed under the constructed semi-norms over all sampled and random unit directions . They report, respectively, the minimum, median, 90th percentile, and maximum observed value across those tests. In every table we have ratio_max < 1, so we observe empirical one‑step contraction under our semi-norms even when span does not contract. Moreover, ratio_max (robust case) or (non-robust case), which is consistent with the corresponding theoretical contraction factor guaranteed by our constructions.
The above numerical results and the code for the experiments would be released in the revised manuscript.
Regarding Assumption 2.3 (Weakness 2)
Thanks for raising this question. We provide Remark 2.4 after this Assumption 2.3 showing how it relaxes to ergodicity of the nominal model only when the uncertainty radius is small. The assumption of ergodicity for nominal model aligns with common practice in the non‑robust average‑reward RL literature (such as [1]-[3]). Additionally, In our numerical validations, we also assume only the nominal model is ergodic and still observe strict contraction for reasonably large radii, in our case, .
Regarding Algorithm 1's Sample Complexity (Q1)
We would like to clarify that in Algorithm. 1, the expected sample cost of a single call to the robust‑expectation oracle is by Theorem 4.1, and in Appendix F4 we set to control bias/variance. Hence we note that Algorithm 1 does not require exponentially many i.i.d. samples; the overall policy‑evaluation complexity remains . Due to space constraint, we refer to our response to Reviewer uoff (Regarding MLMC) for a more detailed explanation of why we do not need exponentially many i.i.d. samples. We will mention the above in the revised draft to prevent this misunderstanding.
Regarding Typos and Definition of (Q2-Q4)
Thanks for identifying the typos, we agree with the review on the definition of and the duplicated reference, and they will be fixed in the revised draft. We note that is initially defined in line 144, we will also re-emphasize this definition in Lemma 3.1 in the revised draft.
Regarding Limitations
We agree that our tabular analysis and the use of a generative model are simplifying assumptions. Our goal here is to establish the first finite‑sample guarantees for policy evaluation in robust average‑reward RL. To the best of our knowledge, the most closely related works cited in Appendix A also study tabular problems and adopt a generative‑model abstraction. This abstraction is particularly natural in the robust setting, where the motivation is assuming access to the simulated environment. We have noted these limitations in the conclusion. We view these as promising directions but outside the scope of this first finite‑sample study.
References
[1] Finite Sample Analysis of Average-Reward TD Learning and Q-Learning, 2021.
[2] Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation, 2021.
[3] A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach, 2025.
Dear Reviewer rBB6,
Thank you again for your thoughtful and constructive comments. We wanted to follow up to see if you had any additional questions or concerns following our rebuttal. We’d be glad to provide further clarification during the discussion phase if needed.
Regards,
Authors
This paper studies the finite-sample complexity bound for robust policy evaluation in average-reward settings. A key technical challenge in this context is that, as in the classic MDP setting, the Bellman operator is no longer a contraction. Moreover, the ambiguity set introduce nonlinearity, making simple the Monte Carlo method not applicable. The authors resolve the first issue by proving the existencce of a seminorm under which the evaluation operator is a contraction, provided that the underlying Markov chain induced by any policy and adversarial case kernel is recurrent and apperiodic. The second issue is addressed with a truncated multilevel Monte Carlo estimator. The combination of the two lead to a convergent model-free algorithm.
优缺点分析
The paper analyzes a model-free algorithm (TD learning), which is not only theoretically more challenging than its model-based counterparts, but also potentially generalizable to continuous state space models. Moreover, the paper establishes the contraction property of the Bellman operator under recurrence and aperiodicity, which could be of theoretical value to future work.
However, the contraction property does not have a concrete physical interpretation (such as mixing time or the span norm, both commonly used in classical MDP settings), and is not easily accessible due the relevant seminorm is obtained non-constructively.
Moreover, based on some of the results in the paper and my personal expertise in this domain, I believe it should be straightforward to obtain a sample complexity using model-based methods.
Furthermore, if we consider the minimax sample complexity in this setting, I do not believe it should depend on , which represents the one-step contraction factor. In classical MDP settings with mixing, the one-step Bellman operator is not a span contraction (for similar reasons as those discussed in this paper—it is a contraction under a certain semi-norm). Nevertheless, the sample complexity is still , independent of the one-step contraction factor. With that said, I think this is understandable given the focus on model-free average-reward MDP algorithms, which represent a significantly more challenging setting.
Given these, I think the narrative of the paper should be oriented toward the model-free aspect, emphasizing their algorithm's versatility.
On the other hand, even though TD learning is interesting in the robust MDP setting due to the nonlinearity introduced by the adversarial dynamics, one could argue that the control (i.e., policy optimization) problem is of greater importance to practitioners.
Finally, there is no numerical validation in the paper to support the theory. I would appreciate it if the authors could provide some numerical experimentation, especially given that I have some reservations about the analysis, as the current bias and variance bounds do not seem to suggest a sample complexity. See below. I am not certain about this last assessment, but I do believe that the analysis is not tight enough.
问题
-
Line 86: The MLMC based estimator in the two Blanchet papers do not need infinite expected samples. Also, the MLMC based Q-learning in [1] has a non asymptotic expected sample complexity of . Could this be adapted in the context of this paper?
-
Line 184: The kernel of a seminorm is not defined.
-
Theorem 3.1 and 3.2 seems very interesting. The authors should consider providing a simple (not span contracting) example to illustrate this new seminorm.
-
Is there a relationship between and ? I would guess that , but in general do we have ?
-
What is the bias level given an optimal allocation of the samples? In MLMC, one would typically get O(1) variance. To converge with constant setpsize, if one step bias is , then the total bias scales like . So one would make to converge. The bias under the truncation at level N is due to the odd and even sampling (I think the analysis in this paper had which could be improved). If one use bias to be , the levels has to be . This will use samples within one iteration with constant order probability. So, this analysis seem to suggest the current bias bound is not sufficently tight to get rate, as it didn't account for the the odd and even sampling in [2]. I am a little concerned about the validity of the convergence analysis.
-
I am wondering why the authors doesn't consider policy optimization. It seems that most of the arguments still goes through. Is it because 3.1 or 3.2 fails?
-
Equation (41), would you need the mapping be continuous?
[1] Wang, S., Si, N., Blanchet, J. & Zhou, Z.. (2023). A Finite Sample Complexity Bound for Distributionally Robust Q-learning. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:3370-3398.
[2] Blanchet, J. H., Glynn, P. W., & Pei, Y. (2019). Unbiased multilevel Monte Carlo: Stochastic optimization, steady-state simulation, quantiles, and other applications. arXiv preprint arXiv:1904.09929.
局限性
The limitation is not sufficiently discussed. Please see the weaknesses I have listed above.
最终评判理由
The authors are able to address my main concerns about the validity of their approach.
格式问题
NA
We sincerely thank the reviewer for their feedback and valuable suggestions. We address each point carefully and explicitly below:
Regarding MLMC (Q1, Q5 and Weaknesses)
Thanks for bringing up comparisons between our work and the two Blanchet papers. We note that [1] obtains using unbiased and untruncated MLMC under a smooth‑functional assumption (differentiable with Hölder‑continuous derivative, as mentioned in their Theorem 1), which enables second‑order multilevel cancellation. [2] studies discounted robust RL with KL divergence uncertainty sets, where these conditions hold, so their unbiased‑MLMC analysis in [1] carries through. In contrast, we focus on TV and Wasserstein distance uncertainty sets, and the support functions in our setting are Lipschitz with no uniform differentiability guarantees. Thus, the analysis of the unbiased MLMC in [1] and [2] is no longer applicable. We therefore adopt a different framework where we apply the truncated MLMC method and still yielding the sample complexity.
We note that we characterize the sample complexity, bias and variance properties of our truncated MLMC approach (Algorithm. 1) in our Theorem 4.1-4.4. In particular, we note that due to truncation, the expected number of samples to perform Algorithm. 1 is instead of with being the maximum truncation level. Thus, we agree with the reviewer that if one uses bias to be , the level is indeed , which is equivalent to (what we wrote in the draft). However, this will use samples instead of within one iteration with constant order probability, which still leads to our sample complexity result.
Regarding Model-Based Methods (Weaknesses)
We agree that our focus is model‑free since the problem of policy evaluation is essentially a model-free approach by its nature, and we will add emphasis on this in the revised draft. We note that the problem is not straightforward even for model-based settings, and indeed there is a new paper [3] in ICML 2025 (which came online after our work was submitted), that studies robust model-based average reward setting with TV, chi-square, and contamination uncertainty sets. We will add this to Related Work in the final version, while note that the approach of model-based that is used in this paper which is reduction from discounted to average does not work well in model-free settings due to higher dependence on in the model-free setups.
Regarding the Semi‑Norm Contraction and (Q2-Q4 and Weaknesses)
Thanks a lot for the great suggestion on providing examples to demonstrate the contraction properties in Lemma 3.1 and Theorem 3.2. Due to space constraint, we refer to our additions in the response to Reviewer rBB6 (Regarding Numerical Validations), where we provide detailed numerical validations and demonstrate the semi-norm contraction properties.
Regarding the interpretation of . Note that in Lemma C.3, by setting and , our construction allows to be made arbitrarily close to defined in Eq. (41). Since represents the second‑largest eigenvalue among all the , this semi-norm is constructive and tracks the spectrum that matters for stability. We note that in the reversible (or lazy‑reversible) case, directly controls mixing: for each , where we have so (shown in Theorem 12.4 of [4]). For non‑reversible chains, the standard replacement is the pseudo‑spectral gap via multiplicative reversibilization; we will add a short remark connecting to that quantity for completeness.
We also would like to note that the kernel of a semi-norm is the set of all vectors in the vector space that map to zero under the semi-norm, we will add this explanation in the revised draft to improve clarity.
Regarding Future Work (Q6 and Weaknesses)
We agree with the reviewer that the policy optimization and control problem is an important direction. However, policy optimization is based on policy evaluation, which is the focus of this paper. Thus, one cannot obtain policy optimization results without having a complete analysis of policy evaluation. Further, we note that the current policy optimization papers in robust average-reward setup [5] assume knowledge of exact sub-gradient, while our method provides actual sampling techniques with finite-time analysis for policy evaluation. Thus, our work can potentially be used in their analysis to replace the knowledge of exact sub-gradient. However, a detailed analysis is left as a future work, and will be mentioned in the paper.
Regarding the Mapping (Q7)
We note that the continuity of the mapping holds under the assumptions in the paper. On finite ergodic chains, the stationary distribution depends continuously on ; hence is continuous in . The spectral radius is a continuous function of the matrix entries (Appendix D in [6]), so the composition is continuous on the ergodic region. We will add the detailed proof on this in the final version.
References
[1] Unbiased multilevel Monte Carlo: Stochastic optimization, steady-state simulation, quantiles, and other applications, 2019.
[2] A Finite Sample Complexity Bound for Distributionally Robust Q-learning, 2023.
[3] A Reduction Framework for Distributionally Robust Reinforcement Learning under Average Reward, 2025.
[4] Markov Chains and Mixing Times, 2017.
[5] Policy Optimization for Robust Average Reward MDPs, 2024.
[6] Matrix Analysis, 2012.
In contrast, we focus on TV and Wasserstein distance uncertainty sets, and the support functions in our setting are Lipschitz with no uniform differentiability guarantees.
This is not entirely true in the sense that if you assume there is a minimum probability at the center of the uncertainty set, then with a sample size of , one can estimate the transition probability within error . Then, one would still have differentiability if the empirical kernel is close to the center of the uncertainty set. Nevertheless, this might be a more complicated approach that leads to looser bounds. Moreover, I think the current algorithm has the advantage that the maximum sample size is bounded, whereas the fully randomized unbiased approach will have a (random) sample size with infinite variance in the smooth regime when one can choose the geometric probability within .
We note that we characterize the sample complexity, bias and variance properties of our truncated MLMC approach (Algorithm. 1) in our Theorem 4.1-4.4. In particular, we note that due to truncation, the expected number of samples to perform Algorithm. 1 is instead of with being the maximum truncation level. Thus, we agree with the reviewer that if one uses bias to be , the level is indeed , which is equivalent to (what we wrote in the draft). However, this will use samples instead of within one iteration with constant order probability, which still leads to our sample complexity result.
Thank you for the clarification. Now I think I understand the tradeoff. Because now you can use , this gives you the sample. But doing this will lead to a variance that is not ; in your case, this is when , which is fine for getting the .
We agree that our focus is model‑free since the problem of policy evaluation is essentially a model-free approach by its nature, and we will add emphasis on this in the revised draft. We note that the problem is not straightforward even for model-based settings, and indeed there is a new paper [3] in ICML 2025 (which came online after our work was submitted), that studies robust model-based average reward setting with TV, chi-square, and contamination uncertainty sets. Further, we note that the current policy optimization papers in robust average-reward setup [5] assume knowledge of exact sub-gradient, while our method provides actual sampling techniques with finite-time analysis for policy evaluation.
Recently, I have seen works newer than [3] (the authors should do a review of) on policy optimization that achieve a better complexity (without the ) [01, 02]. In particular, [01] seems to be a model-free approach. This makes me think that the might only be an artifact of your analysis. In particular, I'm wondering if this is a fundamental problem with the TD or Q-learning style algorithm, or an artifact of your analysis.
Nevertheless, the authors addressed my major concerns about the validity of their truncation approach. So, I will raise the score.
[01] Roch, Zachary, et al. "A finite-sample analysis of distributionally robust average-reward reinforcement learning." arXiv preprint arXiv:2505.12462 (2025). [02] Chen, Zijun, Shengbo Wang, and Nian Si. "Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning." arXiv preprint arXiv:2505.10007 (2025).
Thank you for your feedback and the corrections of the details of the MLMC approaches. We are glad to hear that our response addressed your major concerns.
Additionally, thanks for bringing up [01] and [02]. We note that according to NeurIPS policy, both [01] and [02] are considered concurrent to this paper (and were posted around the NeurIPS main paper deadline). Nevertheless, we agree that [02] is a model-based method while [01] utilizes a model-free value iteration approach. Specifically, [01] studies robust average reward RL under contamination and norm uncertainty sets, leveraging a value-iteration style Halpern iteration (RHI). Since [01] utilizes properties of non-expansiveness (along with some properties in the uncertainty sets) instead of contraction, [01] does not have the contraction factor in the sample complexity results. We also note that RHI is analyzed under the assumption that every stochastic increment it adds to the running Bellman estimate is unbiased. This is guaranteed by the recursive sampling procedure. For Contamination and norm uncertainty, the robust-Bellman update can be expressed as one next-state expectation plus a deterministic norm term, so a single draw gives an increment whose conditional mean is exactly the true difference, guaranteeing unbiasedness. However, for a Wasserstein ball, the correction is a global optimal-transport cost that no finite set of local samples can reproduce on average, so any stochastic estimate is inevitably biased. Thus, one limitation of [01] is that RHI fails under Wasserstein uncertainty sets, while our work does not have this limitation.
We will add a discussion of the above concurrent works in the revised draft.
Dear Reviewers,
Please take a look at the authors' response and discuss if there are still more concerns that need to be clarified. Thanks AC
This paper develops an algorithm based on MLMC to solve the policy evaluation problem for robust average-reward reinforcement learning, and derives the finite sample analysis. The overall sample complexity is . In order to solve this problem, this paper develops some interesting results, including the contraction property of the Bellman operator. The reviewers requested numerical results during the discussion, and the authors have provided them. It would strengthen the paper if these results were incorporated into the future version of this paper. This is the first paper providing finite sample analysis for robust average-reward RL. Given the significance of the results, I therefore recommend acceptance of this paper.