PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高6标准差1.1
3
6
5
5
3.0
置信度
ICLR 2024

Revisiting the Static Model in Robust Reinforcement Learning

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

We incrementally identify static adversarial transition functions in an algorithm which finds solutions to the robust MDP problem.

摘要

关键词
Reinforcement LearningRobust MDPs

评审与讨论

审稿意见
3

Finding the optimal robust policy in Reinforcement Learning (RL) is often formulated as a two-player game and is related to the dynamic model of transition function uncertainty. In this model, the environment dynamics are allowed to change at every time step. In this paper, the authors consider the static model where one needs to provide robustness to a span of static transition models. While state-of-the-art robust RL algorithms build upon the dynamic uncertainty model, in this paper the authors show that the static uncertainty model can be solved by solving a set of non-robust problems. They have proposed a generic meta-algorithm IWOCS to identify the worst-case transition model in an incremental fashion. A deep RL version of the IWOCS is also proposed along with a comparison with existing robust RL algorithms.

优点

The authors show that the static uncertainty model can be solved by solving a set of non-robust problems. They have proposed a generic meta-algorithm IWOCS to identify the worst-case transition model in an incremental fashion. A deep RL version of the IWOCS is also proposed along with a comparison with existing robust RL algorithms. The analysis looks fine. The paper is easy to follow.

缺点

Overall, although the authors have proposed a new framework to tackle robust MDP problems under static uncertainty, the motivation is not clear. Also, the proposed methodology does not seem to provide any significant advantage compared to the robust methods under the dynamic uncertainty model.

问题

My concerns are as follows:

  1. Why is the dynamic model not a generalization of the static model? In other words, if we assume a dynamic model while proposing an RL scheme and in reality, the environment behaves like a static model, why would the proposed RL scheme perform badly? In fact, (Lemma 3.3 Iyengar 2005) echoes that for stationary policies, the performances should be identical.

  2. It seems, in an infinite-horizon problem as considered by the authors, the dynamic uncertainty model makes more sense than the static uncertainty model as the transition function can be piecewise static over the entire horizon. The rationale behind the static model being a better choice than the dynamic model from a practical viewpoint, needs more clarity.

  3. In this paper, the authors have considered stationary policies. In this case, the robust value iteration framework for the dynamic model will work. What are the additional advantages that the framework proposed by the authors provides?

  4. The authors claim that the proposed framework avoids the complexity of the dynamic model. However, it is established that the additional complexity due to robustness in robust value iteration is just O(log(1/ϵ)O(\log(1/\epsilon) for accuracy in the order of ϵ\epsilon. On the other hand, solving a sequence of classical MDP problems as considered by the authors could be really slow to converge depending on scenarios.

  5. How to construct the set Ti\mathcal{T}_i? This is not clear. It may be harder to construct than constructing the rectangular (s,a)(s,a) uncertainty set.

  6. For different (s,a)(s,a) pair, the argminargmin operation for Qi(s,a)Q_i(s,a) may give us different jj. How to tackle this problem? Can we still write the Bellman equation for Qi(s,a)Q_i(s,a)?

  7. Authors are requested to explain how they compute l(Tj^(s,a);s)l(\hat{T_j}(s,a);s’) according to the formula given. It is not clear to me.

评论

For different (s,a)(s,a) pair, the argmin\arg\min operation for Qi(s,a)Q_i(s,a) may give us different jj. How to tackle this problem? Can we still write the Bellman equation for Qi(s,a)Q_i(s,a)?

This is barely a problem and rather a strength of IWOCS. Each QTjQ^*_{T_j} is an upper-bound of the robust value function (Property 1). Hence, taking the minimum over the finite set of TjT_js in each s,as,a pair provides a point-wise definition of a tighter upper-bound on the robust value function. That is the simple point-wise definition of QiQ_i in Section 4. In Section 5, QiQ_i is defined implicitly as all QTjQ_{T_j} are kept in memory and evaluated when one needs to find πi(s)\pi_i(s).

Authors are requested to explain how they compute l(T^j(s,a),s)l(\hat{T}_j(s,a), s') according to the formula given. It is not clear to me.

ll is a simple 1\ell_1 loss as stated in Section 5.1. We perform a forward pass through the T^j\hat{T}_j network with s,as,a as inputs. The output is a state s^\hat{s}' and we compute s^s1||\hat{s}'-s'||_1. If the value is above ρ\rho, we tag the value function as inapplicable in s,as,a.

评论

In this paper, the authors have considered stationary policies. In this case, the robust value iteration framework for the dynamic model will work. What are the additional advantages that the framework proposed by the authors provides?

Again, we never claimed robust VI wouldn't work. However, one major weakness of (approximate, gradient-based) robust VI algorithms is their propension to become stuck in stationary points that are not saddle points, and hence not reach the robust value function. Counter-measures to this include populations of agents, noise injection, etc., as covered in Section 3. Conversely, optimizing for TT and π\pi separately, relying on well-understood methods for both problems, has the potential to segregate difficulties. In practice it yields a more efficient algorithm (in terms of sample requirements and overall attained performance).

An interesting aspect is that IWOCS is guaranteed to converge (but not necessarily to the robust value function) as is now discussed in the improved Section 4.

The authors claim that the proposed framework avoids the complexity of the dynamic model. However, it is established that the additional complexity due to robustness in robust value iteration is just O(log(1/ϵ))O(\log(1/\epsilon)) for accuracy in the order of ϵ\epsilon. On the other hand, solving a sequence of classical MDP problems as considered by the authors could be really slow to converge depending on scenarios.

The actual complexity of robust value iteration is O(CSlog(1/ϵ)/log(1γ))O(C S \log(1/\epsilon) / \log(1-\gamma) ), where CC is the cost of computing maxaminTER,S[R+γV(S)]\max_a \min_T \mathbb{E}_{R,S'}[R+\gamma V(S')] in a given ss. In discrete state and action spaces, computing the value of the expectation has cost O(SS). The minT\min_T has to be solved once for every state and action and it's a search over measures on SS. Let cc be its cost, then the cost of the maxaminT\max_a \min_T operation is O(cSA)O(cSA) and the overall complexity of RVI is O(cS2Alog(1/ϵ)/log(1γ))O(c S^2 A \log(1/\epsilon) / \log(1-\gamma) ). The complexity of VI is O(S2Alog(1/ϵ)/log(1γ))O(S^2 A \log(1/\epsilon) / \log(1-\gamma) ) (VI is a special case of RVI with a singleton as uncertainty set, so c=1c=1). The ratio between the two complexities is cc, not log(1/ϵ)\log(1/\epsilon).

Comparing IWOCS and RVI is a delicate matter because IWOCS is not based on a contraction mapping and has no convergence guarantees to the robust value function. Consequently, comparisons should be taken with a grain of salt. One iteration of IWOCS in that context has the complexity of VI for the policy search part, plus the complexity of finding a worst case transition function which is O(cSA)O(cSA). Hence, the overall complexity for MM iterations of IWOCS is O(M(S2Alog(1/ϵ)/log(1γ)+cSA))O(M(S^2 A \log(1/\epsilon) / \log(1-\gamma) + cSA)). Compared to RVI (depending on MM which we cannot easily quantify), this bound will be smaller when cc is large, which is the case when one deals with complex uncertainty sets and without further hypotheses. We have included this analysis in Appendix O which we refer to from the main text of Section 4.

In the context of large-scale environments with large state spaces, continuous action domains, and generic uncertainty sets, the algorithms most closely related to RVI, when adapted for deep learning, appear to be RARL [1] and M2TD3 [2]. RARL mitigates the cost cc by optimizing over a surrogate uncertainty set (and not the full T\mathcal{T}). M2TD3 optimize over the full T\mathcal{T}. But overall, both RVI-inspired methods are observed to be less computationaly effective compared to IWOCS, confirming the trend of the previous analysis that adapting RVI to large scale problems incurs a high-cost due to the minT\min_T problem.

[1] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. 2017. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning.

[2] T. Tanabe, R. Sato, K. Fukuchi, J. Sakuma, and Y. Akimoto. 2022. Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification. In Proceedings of the Advances in Neural Information Processing Systems.

How to construct the set Ti\mathcal{T}_i? This is not clear. It may be harder to construct than constructing the rectangular (s,a)(s,a) uncertainty set.

We are afraid we don't understand the question. In general, the uncertainty set (rectangular or not) is given as an input of the problem. In practice (in non-rectangular sets) it is the set of parameters conditioning TT. As to how to construct Ti\mathcal{T}_i, it is simply the collection of all previously identified worst-case transition kernels up to step ii along the execution of IWOCS.

评论

Thank you for these comments. We try to clarify things and answer your comments concerns below. We, however, fail to understand how these questions and the indicated weaknesses justify disqualifying our contribution and setting such a low score, and hope this answer will encourage raising your score. Additionally to this answer, we have updated the paper's pdf, highlighting in blue all changes made to ease the reading.

Overall, although the authors have proposed a new framework to tackle robust MDP problems under static uncertainty, the motivation is not clear.

Our motivation is solving robust MDPs. We have reformulated the paper in many places to better highlight this. We endeavor to exploit the equivalence between the static and dynamic models of uncertainty in the case of stationary policies and under the sasa-rectangularity assumption, so as to design a new approach to solve robust MDPs, hence addressing similar robust RL problems as the rest of the literature.

Also, the proposed methodology does not seem to provide any significant advantage compared to the robust methods under the dynamic uncertainty model.

IWOCS outperforms all other methods on most benchmarks, sometimes by a large margin (7 times the score of M2TD3 on Hopper with 3D uncertainty sets). On average across benchmarks, it performs twice as well as M2TD3 with the same sample budget (Table 1). We fail to see how IWOCS can be said to "not provide any significant advantage compared to the robust methods under the dynamic uncertainty model".

Besides raw scores, IWOCS is, to the best of our knowledge, a conceptually new algorithm, which sheds new light on the question of optimizing policies for robust MDPs. This argument alone, in our humble opinion, is a significant contribution to advancing the knowledge about robust RL.

Why is the dynamic model not a generalization of the static model? In other words, if we assume a dynamic model while proposing an RL scheme and in reality, the environment behaves like a static model, why would the proposed RL scheme perform badly? In fact, (Lemma 3.3 Iyengar 2005) echoes that for stationary policies, the performances should be identical.

We confirm your understanding. The dynamic model is indeed a super-set of the static model. And for stationary policies, their solutions are indeed the same. These results are not ours, they are from past literature and summarized in Iyengar's (2005) paper and in Section 2 of our submission. We have added quotes of original papers in Appendix A to ease the understanding for readers less familiar with robust MDPs and the static model.

If existing optimization methods converged to the true robust value function, methods based on the dynamic model of uncertainty would find the robust value function and there is no reason why they would perform badly. But the minmax\min\max problem of robust MDPs is hard (for many possible reasons: non-sasa-rectangularity in practical cases, stationary points of the optimization landscape blocking gradient-based methods, local saddle points, etc.) and optimization methods in general struggle to find good solutions. From that perspective, IWOCS is yet another optimization method, but one that is based on a new analysis of the problem, and that empirically performs well.

It seems, in an infinite-horizon problem as considered by the authors, the dynamic uncertainty model makes more sense than the static uncertainty model as the transition function can be piecewise static over the entire horizon. The rationale behind the static model being a better choice than the dynamic model from a practical viewpoint, needs more clarity.

This rationale of why the static model makes sense is already exposed by Iyengar (2005) (see Appendix A). We don't argue the static model is better than the dynamic one. Indeed, both have the same solution for robust value functions when policies are stationary. Consequently, we don't really argue on the fact that the dynamic model makes more or less sense than the static one. We just propose a resolution scheme that builds on solving a sequence of static MDPs, instead of the approach of robust VI.

Why this might make sense from a practical viewpoint is exposed and analyzed in the last paragraph of Section 2, the fully re-structured Section 4, and the discussion in Section 5.2.

评论

Thank you for your detailed response. It has resulted in a better understanding of the paper. However, I still feel that the novelty part compared to Iyengar (2005) is marginal as there are no convergence guarantees to the robust value function. Computationally efficient online schemes are already in the literature that need not solve the minmax problem repeatedly. For example: Wang, Yue, and Shaofeng Zou. "Online robust reinforcement learning with model uncertainty." Advances in Neural Information Processing Systems 34 (2021): 7193-7206. This is a robust RL algorithm. Although the authors' main contribution is in the space of planning and not in the space of learning, I still feel the contribution is marginal.

评论

Thank you for your detailed response. Regarding computational complexity, I mean that O(log(1/ϵ))O(\log(1/\epsilon)) iterations are needed where in each iteration we solve the inner problem. Please refer to Section 4.2 of Nilim, Arnab, and Laurent El Ghaoui. "Robust control of Markov decision processes with uncertain transition matrices." Operations Research 53.5 (2005): 780-798.

审稿意见
6

The paper considers the problem of robust reinforcement learning (RL), where the transition dynamics of the test and training environments can differ. Specifically, it formulates this problem as a two-player zero-sum game between an adversary that chooses the worst-case transition functions and the learner that selects the robust policy. Rather than assuming the adversary can change the transition function at each timestep (the dynamic model), the authors adopt the static model and propose IWOCS. This incremental search algorithm constructs a discrete set of worst-case transitions to solve the zero-sum game. They implement IWOCS for deep RL settings and demonstrate its effectiveness across various standard robust RL benchmarks, showing it can outperform other state-of-the-art methods.

优点

The paper's extensive empirical evaluations are a major strength, demonstrating the proposed IWOCS algorithm's superior performance compared to state-of-the-art methods like M2TD3 on MuJoCo benchmarks. I find the following points as the strengths of the paper:

  • Evaluation of both worst-case and average-case performance, which highlights IWOCS's ability to balance robustness and generalization.
  • Handling of partial state space coverage via predictive coding, which enables switching to randomization in cases where the worst-case Q-values are not valid.
  • Use of separate regularized and unregularized Q-networks.
  • The empirical analysis of the convergence of IWOCS.
  • I also appreciate reporting of total clock time, which is essential in the assessment of efficiency.

In summary, I think the proposed incremental construction of worst-case transition sets is novel and, given the strong empirical results, makes a potential contribution to robust RL. However, I have major concerns about the soundness of their method and clarity of the writing, which I will elaborate on in subsequent sections.

缺点

Before discussing my concerns about the clarity and quality of the writing, a major factor that makes a fair evaluation of the paper hard, let me discuss my understanding of the incremental worst-case search and why I think some of the claims made in the paper might be incorrect. To make the writing simple and concrete, I will re-write the IWOCS algorithm for a general minimax problem of

$

\min_{x \in \mathcal{X}} \max_{y \in \mathcal{Y}} f(x, y)

$

for some arbitrary function ff and sets X\mathcal{X} and Y\mathcal{Y}. For this problem, IWOCS becomes the following algorithm:

input: x0x_0, max number of iterations M, tolerance ϵ\epsilon
for i = 0 to M:

  1. Find f(xi)=maxyYf(xi,y)f^*(x_i) = \max_{y \in \mathcal{Y}} f(x_i, y)
  2. Define Xi=x_j_ji\mathcal{X}_i = \\{x\_j\\}\_{j \leq i}
  3. Define xi=argmin_xXif(x)x^\star_i = \arg\min\_{x \in \mathcal{X}_i} f^*(x)
  4. Define yi=argmax_yYf(xi,y)y_i = \arg\max\_{y \in \mathcal{Y}} f(x^\star_i, y)
  5. Find xi+1=argmin_xXf(x,yi)x_{i+1} = \arg\min\_{x \in \mathcal{X}} f(x, y_i)
  6. if f(xi+1,yi)f(xi,yi)ϵ|f(x_{i+1}, y_i) - f(x^\star_i, y_i)| \leq \epsilon then: return yiy_i, xi+1x_{i+1}.

One claim that's made multiple times throughout the text (e.g., in the incrementally solving the static model paragraph of section 2 and exploring the variance and convergence of IWOCS paragraph of section 5) is that the values of f_xif^*\_{x^\star_i} decreases monotonically and the algorithm is guaranteed to converge. Consider the following two cases:

  1. x_i+1X_ix\_{i + 1} \in \mathcal{X}\_i: In this case, xi+1=xix^\star_{i+1} = x^\star_{i} as X_i=X_i+1\mathcal{X}\_i = \mathcal{X}\_{i+1}. Therefore, f_xi=f_xi+1f^*\_{x^\star_i} = f^*\_{x^\star_{i+1}} and the algorithm is stuck.
  2. x_i+1∉X_ix\_{i+1} \not\in \mathcal{X}\_i but x_i+1X_ix^\star\_{i+1} \in \mathcal{X}\_i: This case also results in a similar behavior of the algorithm.

The paper needs to address such scenarios more concretely (both in theory and practice) and provide evidence on why this will not happen. In particular, the idea of constructing incremental sets of worst-case models can also be applied to any two-player zero-sum game (similar to the formulation I wrote above). If this is already done in the literature, I would question the novelty of the method. If not, what are the advantages of this approach to the more classic methods, such as alternating optimization? A more thorough analysis is needed.

Besides the soundness of the algorithm, I have some concerns regarding the writing, which I mention as follows:

  1. In the abstract, the authors claim to suggest an analysis of why solving the static model under mild hypotheses is a reasonable endeavor. However, they only provide shallow references to previous works (e.g., Iyengar, 2005 or Wiesemann et al., 2013) for this analysis. The paper is not self-contained, and the claims about the no duality gap or the equivalence of optimal policy under the static and dynamic cases are not fully supported. I expect a concrete and thorough overview of the conditions needed and the previous results (at least in the appendix section).
  2. The contributions of this paper are not clearly distinguished from previous work. Is this the first paper considering solving the static model instead of the dynamic case? Is the incremental construction of the worst-case models the main contribution? Or has the idea been previously used for minimax games? If not, I think the contribution not only applies to robust RL but any minimax problem, and the authors should clarify this.
  3. Most of the writing is unclear in terms of the introduction of the notations and concepts in the main text. In particular, section 4 can benefit greatly from re-writing, as in the current version, the notations are defined as part of the text and not in a separate place (e.g., the definition of T_i\mathcal{T}\_i, QiQ_i), and it requires the reader to pass through the text multiple times to understand the algorithm and notations. The authors also use multiple similar notations such as JJ, VV, πTi\pi_{T_i}, πi\pi_{i} without clarifying.
  4. Furthermore, not much discussion is provided for solving the minimization problem (finding the worst-case transition functions). This is especially important as finding the worst-case static model is more difficult than the worst-case dynamic one. The authors avoid discussing this problem in detail and refer only to evolutionary strategies such as CMA-ES to solve it.
  5. Finally, in the empirical evaluation section, the idea of training all the algorithms for the same number of steps does not necessarily provide a fair comparison. The results would be more compelling if the authors plotted all methods until their convergence.

问题

Besides the question I asked in the previous section:

  1. What is the concrete definition of T\mathcal{T} in the experiments?
  2. As far as I know, meta-algorithm usually refers to algorithms that output another algorithm. Why do you name your algorithm meta-algorithm?
  3. In Table 3, can you show that the values of 4.04.0 and 7.07.0 are the true worst-case parameters in the Halfcheetah 2 environment?

POST-REBUTTAL RESPONSE

I thank the authors for their detailed response and apologize for my late reply after the discussion deadline. I appreciate the authors' effort in updating the manuscript. I read the updated parts as well as other reviewers' comments. I believe the new version clarifies the contribution better and provides a more coherent discussion about the theory of minimax optimization and its convergence. I see the difference between the formulation of the algorithm for general minimax problems and MDPs, which consider point-wise minimization of Q-function. Since my main concern was on the writing and the theory, which are more or less addressed with the updated version, I increased my score.

However, an important part of the proposed algorithm remains unclear to me.That is solving argminTTVTπi(s0)\arg\min_{T \in \mathcal{T}} V^{\pi_i}_T(s_0) in line 4 of Algorithm 1. It still seems to me that solving this minimum is strictly harder in the static case than in the dynamic model. The authors need to discuss this in more detail and clarity.

Moreover, although the authors discussed why, in general, it isn't easy to quantify the convergence of deep RL algorithms, I still believe it is useful and more transparent to report the results for more steps, especially for RARL and M2TD3. I'm asking this because, as the authors have suggested, one of the issues for robust value iterations is getting stuck in local saddle points. Showing this claim empirically and showing that this does not happen for IWOCS will increase the paper's merits.

评论

Conditions needed for previous results to hold.

In the abstract, the authors claim to suggest an analysis of why solving the static model under mild hypotheses is a reasonable endeavor. However, they only provide shallow references to previous works (e.g., Iyengar, 2005 or Wiesemann et al., 2013) for this analysis. The paper is not self-contained, and the claims about the no duality gap or the equivalence of optimal policy under the static and dynamic cases are not fully supported. I expect a concrete and thorough overview of the conditions needed and the previous results (at least in the appendix section).

We think that citing (and summarizing) the precise equations, lemmas and propositions of the results we build on is not a shallow referencing. To make the paper more "self-contained", we have quoted the exact text from the original papers in Appendix A. But we believe this does not make a huge difference with what was already stated in Section 2: the static-dynamic equivalence holds for sasa-rectangular uncertainty sets and stationary policies, and the no-duality gap holds for sasa-rectangular uncertainty sets. Also, we find it awkward to cite full paragraphs of previous work for the sake of "self-containment": these results were previously peer-reviewed and published, citing the source where they were stated and proven should probably suffice to support building on them.

Positioning with respect to previous work

The contributions of this paper are not clearly distinguished from previous work. Is this the first paper considering solving the static model instead of the dynamic case? Is the incremental construction of the worst-case models the main contribution? Or has the idea been previously used for minimax games? If not, I think the contribution not only applies to robust RL but any minimax problem, and the authors should clarify this.

As stated in the last paragraph of Section 3, to the best of our knowledge, our approach is the only one that directly addresses the static model for robust RL, ie. with stationary policies.
It is important to note that the optimal policy for the static model can be non-stationary. The static-dynamic equivalence only holds for stationary policies.
Cover [1991] did study the static model in its full generality and Littman [1994] proved that solving the static model was NP-complete. But general works that attempt to solve the static model do so in the context where the static-dynamic equivalence does not hold, ie. when allowing non-stationary policies for the decision maker, which is not what we attempt to do here. Here our focus is on stationary policies, which enable the equivalence between models, and in turn the foundations of IWOCS. All this (besides the citations of Cover [1991] and Littman [1994] which are out-of-scope) was already stated in Section 2 but your remark lets us think we did not highlight it enough. We have tried to insist more on it, in particular in the section's last paragraph.

Cover, T. M. (1991). Universal portfolios. Mathematical finance, 1(1), 1-29. Littman, M. L. (1994). Memoryless policies: Theoretical limitations and practical results. In From Animals to Animats 3: Proceedings of the third international conference on simulation of adaptive behavior (Vol. 3, p. 238).

评论

Thank you for these comments and constructive feedback. Thank you also for highlighting the novelty, strong empirical results, and overall contribution to robust RL. We try to address all identified weaknesses and concerns below. We hope this answers your questions, will cancel your concerns and encourage you to raise your score. Additionally to this answer, we have updated the paper's pdf, highlighting in blue all changes made to ease the reading.

Reformulation as a general minmax\min\max problem

Thank you for this proposal of reformulation. We believe there may still be a slight nuance needed here (in lines 3 and 4) and we believe the parallel with general minmax\min\max problems is not so straighforward. Following the notations you introduce, IWOCS for MDPs would be a special instance where xix_i are transition models, yiy_i are policies, and f(xi)f^*(x_i) is the value of an optimal policy for xix_i.

In step 3 of IWOCS, QiQ_i is not the value function of any of the previous policies (it can correspond to the policy found for some xix_i in s,as,a and to that of some other xix_{i'} in another s,as',a'). Instead, it is (just) an estimate of the robust value function based on a few upper bounds. Consequently, the parallel with identifying a unique xix_i^* within Xi\mathcal{X}_i does not hold. It is a specific feature of MDPs to have the possibility of recombining several value functions into a new one which is a lower bound for all.

Consequently, in step 4, yiy_i is not defined as the optimal policy for a known transition kernel in Xi\mathcal{X}_i. Here, it is a (possibly) new policy that is greedy with respect to the estimate of the robust value function.

In turn, this makes the definition of xi+1x_{i+1} incorrect (not the line you wrote per se, but the dependence on yiy_i defined at the previous line makes the result different).

Interestingly, we have followed the same line of thought in the past as we assume you did in writing this review. We tried to define IWOCS more simply for general minmax\min\max problems but it seems the reasoning here is still quite linked to MDPs.

Concerning the fact that the sequence of robust value function estimates is decreasing and that the algorithm converges, we would like to point out that we never claimed the algorithm converged to the actual robust value function (we recalled regularly it wasn't the case, even in the last sentence before the paper's conclusion). But after reading your review (and that of the other reviewers) we agree our presentation lacked clarity and we have worked a lot on the writing and analysis to improve this. Please see the next item and the revised version of the paper for more details on how we fixed this: thank you for helping us make the paper a lot better.

Soundness of the algorithm and writing of Section 4

Based on your feedback, we fully reorganized and enhanced Section 4. It now separates more clearly the algorithm (including notations) from the discussion on its properties. In particular, we extracted and highlighted the arguments related to monotonicity and convergence. We also improved the discussion about the convergence point of the algorithm (which is not proven to be the robust value function). This now leads to a (hopefully) better discussion on how the algorithm can become stuck in bad solutions (note that this was already present in the previous version of the paper but we believe the discussion is clearer now - thank you for that!) and how the choice of Ti+1T_{i+1} affects convergence.

Notations

Robust MDPs unfortunately require defining a span of different quantities. Although we have tried to keep notations simple (previous work in the literature has sometimes been way heavier in terms of notations) we agree providing guidance to the reader would help. We have added a table in the appendix, summarizing all notations (Table 4, Appendix B). We refer to it very early in the main text (Section 2) so the reader can use it as a permanent reminder along the paper.

评论

Comparison fairness in the empirical evaluation

Finally, in the empirical evaluation section, the idea of training all the algorithms for the same number of steps does not necessarily provide a fair comparison. The results would be more compelling if the authors plotted all methods until their convergence.

We would like to bring up two separate points as an answer.
First, we wanted the comparison to be fair with the evaluation of M2TD3 (which is the state-of-the-art baseline). We believe it would have been unfair if we had enabled IWOCS to see more samples than M2TD3. Consequently, we retained the sample budget from Tanabe et al.'s experimental setup, and allocated it evenly over IWOCS iterations. We fail to see how letting SAC run longer would bring a fairer comparison with any of the competitors. We have made this more explicit in Section 5.2.

Besides, exact convergence of deep RL algorithms is generally difficult to quantify (the simplest example might be the convergence curves in the SAC paper https://arxiv.org/pdf/1812.05905.pdf) and there is no real consensus on any specific stopping criterion. Recall that IWOCS features a good overall performance, even with the current constraint on the sample budget (which is not so limiting, as stated in the new footnote on page 7, and already enables the two-fold improvement of IWOCS over M2TD3 on average). Also, the wall-clock computing time (Table 6, Appendix H + computing resources in Appendix C) is already quite large as is. Our experience running IWOCS experiments indicate there are very marginal gains obtained by letting SAC run longer between each IWOCS iteration. Concerning other methods, we quote their results from the M2TD3 paper (and have validated with the authors that our evaluation setting was indeed the same as theirs).

Definition of T\mathcal{T} in the experiments

What is the concrete definition of T\mathcal{T} in the experiments?

The answer is in the first paragraph of Section 5.2. For 2D uncertainty sets, TT is parameterized by a global friction coefficient and the agent’s mass, making T\mathcal{T} isomorphic to R2\mathbb{R}^2. In the second one, a third, environment-dependent parameter is included (details in Appendix I).

Use of the word "meta-algorithm"

As far as I know, meta-algorithm usually refers to algorithms that output another algorithm. Why do you name your algorithm meta-algorithm?

Because IWOCS takes a value function optimization algorithm and a worst-case search algorithm as inputs, and outputs an algorithm for solving robust RL problems. Section 4 presents an instance of an IWOCS algorithm based on value iteration and grid search. Section 5 presents an instance of an IWOCS algorithm based on SAC and CMA-ES. We have made this point clearer by highlighting the sub-algorithms in Algorithm 1 (these will remain in blue after the rebuttal).

Worst-case parameters for half-cheetah

In Table 3, can you show that the values of 4.0 and 7.0 are the true worst-case parameters in the Halfcheetah 2 environment?

Although we cannot formally prove it (precisely because, at this stage, IWOCS is not formally guaranteed to converge to the true robust value function), the fact that (almost) all IWOCS runs identify them as final worst cases seem to indicate they correspond to a stopping criterion for the algorithm, and are thus good "worst-case parameters" candidates. This also makes sense intuitively: the "most difficult" situation to control is when the cheetah has maximum mass and maximum internal joint friction.

审稿意见
5

The paper addresses the challenge of creating robust control policies in reinforcement learning (RL) that ensure consistent performance across different environments. It introduces the notion of a two-player game, rooted in the dynamic model of transition function uncertainty, but emphasizes the practical need for robustness in the face of static transition models. Despite the static model's increased difficulty, the paper proposes a reconsideration and introduces a meta-algorithm called IWOCS. This algorithm incrementally identifies worst-case transition models to guide the search for a robust policy. The discussion around IWOCS suggests a new way to separate policy optimization from adversarial transition functions, presenting fresh analytical perspectives. The passage concludes by mentioning a deep RL version of IWOCS, demonstrating its competitiveness with current algorithms on standard benchmarks.

优点

  1. The idea of using static model to solve robust MDPs is new and interesting.

  2. The paper demonstrates the competitive performance of IWOCS through comprehensive experiments.

缺点

  1. Although the paper has gained its intuition from the theory, it is better to formulate a simple theoretical analysis (convergence or sample complexity bound) of the algorithm.

  2. The theoretical concepts in this paper are not very readable.

问题

  1. The paper's intuition of the algorithm comes from combining the static and dynamic models equivalence and the no-duality gap condition, does these assumptions hold in the experiments the paper proposed?

  2. Since the paper shows the competitive performance with respect to traditional methods, can authors give some reasons or conjectures about the advantages of IWOCS?

评论

Thank you for this review. Thank you also for pointing out the novelty of the proposed idea and the completeness of the empirical evaluation. We try to address all identified weaknesses and concerns below. We hope this answers your questions and will encourage you to raise your score. Additionally to this answer, we have updated the paper's pdf, highlighting in blue all changes made to ease the reading.

Although the paper has gained its intuition from the theory, it is better to formulate a simple theoretical analysis (convergence or sample complexity bound) of the algorithm. The theoretical concepts in this paper are not very readable.

Thank you for this feedback. We have undertaken a reorganization of the paper, Section 4 in particular, to drastically enhance both the writing and the theoretical analysis (which now includes more clearly a proof of convergence which was only present in full text before). We hope this addresses your concerns.

On a side note, we would like to emphasize that the suggested theoretical analysis is far from "simple" and is yet another full contribution in itself. E.g. the sample complexity of robust MDPs is a field of study by itself, rapidly evolving, complex, and currently only addressing classical tabular approaches such as robust value iteration (see for instance https://arxiv.org/abs/2305.16589). Hence such a study seems out of the scope of this paper.

The paper's intuition of the algorithm comes from combining the static and dynamic models equivalence and the no-duality gap condition, does these assumptions hold in the experiments the paper proposed?

As in most non-toy scenarios, these assumptions do not hold: rectangularity is a very strong assumption and, unfortunately, is too strong for real-world scenarios. Unfortunately too, the majority of sound algorithms for robust MDPs rely on this assumption for analyzing their properties. Some rare methods depart from these assumptions (methods covered in Section 2) but this is not the case here. This does not prevent evaluating the various methods from the literature on the larger scale benchmarks of Section 5.2 and drawing conclusion on the empirical behavior of IWOCS and its competitors. We added a sentence in Section 5.2 to better reflect this.

Since the paper shows the competitive performance with respect to traditional methods, can authors give some reasons or conjectures about the advantages of IWOCS?

We have tried to highlight such reasons in the discussion of Section 5.2. One first reason is that gradient-based robust VI methods easily tend to become stuck on stationary points that are not saddle points (hence the different approaches in the literature covered in Section 3). In contrast (and secondly), IWOCS avoids this bottleneck by decoupling the min\min and max\max problems. In turn, IWOCS benefits from the different developments made in both value function optimization on the one hand, and worst-case search (a static optimization problem) on the other hand.

审稿意见
5

This paper considers the static uncertainty model for robust RL and proposes a novel robust RL algorithm with empirical evaluation. It highlights the necessity to mitigate the conservativeness of policies yielded by solving dynamic uncertainty models, and suggests to revisit the static uncertainty model despite its intractability for exact solutions. Then it designs an incremental worst-case search algorithm, namely IWOCS, that approximates the solution by alternatingly construct a incremental discrete ambiguity subset Ti\mathcal{T}_i and updating the optimal robust policy with respect to Ti\mathcal{T}_i. The algorithm is then implemented via deep neural networks and evaluated against state-of-the-art baseline algorithms on a few popular benchmarks, showing a considerable performance improvement in certain environments.

优点

  1. The paper provides an interesting new perspective to understand the solution concepts for robust MDPs. It is arguable that static uncertainty models are more realistic and useful once they can be efficiently computed.
  2. The algorithm is novel, simple and intuitive. It is actually surprising to see that a straightforward approximation is able to yield satisfactory performance in some cases.
  3. The empirical evaluation results display sufficient workload. Detailed results are honestly included without cherry-picking. The discussion provides some intuition into understanding its performance, as compared to the baselines.
  4. Overall the presentation of the paper is fine, with no obvious obstruction to first-time readers.

缺点

  1. The soundness of the meta-algorithm needs more justification.
    • The current way it appears in the paper leaves readers the impression that the algorithm (esp. the idea of incremental ambiguity subset) comes solely from a simple heuristics, without any theoretical guarantee or adequate empirical evidence.
    • For theoretical guarantee, it is advisable to include some convergence guarantees that read like "TiT\mathcal{T}_i \to \mathcal{T}". Though convergence of subset sequence is not reasonable in any sense for this setting, one can at least show, for example, that each time a new model is selected from TTi\mathcal{T} \setminus \mathcal{T}_i, and maybe further, that the robust value with respect to Ti\mathcal{T}_i converges to that of T\mathcal{T} (at a quantified rate).
    • For adequate empirical evidence, one basically needs to show the above properties with experiments on a range of different environments. Currently the motivating experiment is not motivating at all, and fails to show any sign of convergence.
  2. There are more to do for the empirical evaluations.
    • The deep RL algorithm is not described in details, making the paper not self-contained. It is especially unclear how the evolution algorithm is implemented without referring to the original paper. Further, the use of evolution algorithm needs more justification, since it barely has any optimality guarantees.
    • The experiment setup is suspicious. Usually people train the networks to convergence or to a certain average reward level. Cutting off experiments with different algorithms at the same number of time steps doesn't seem right.
    • It is evident that the proposed algorithm, on top of SAC, would be very time-consuming due to the incremental nature. Although efficiency (in terms of wall-clock time) is briefly touched upon in Appendix F, it is advisable to show results like time-to-performance.
    • The analysis can be enriched by providing more discussion on the reason why IWOCS performs significantly worse than other non-robustness-oriented algorithms, and perhaps further, what kind of environments are the most compatible with IWOCS.
    • It is advisable to include ablation studies (or at least comparison among different options) for design components like predictive coding and evolution algorithms.
  3. There are a few typos that need to be fixed. Besides, the important definitions and formula shall be highlighted in display mode. Writing everything in a single paragraph makes it hard for readers to grasp the key points at first glance and to review the important sentences.

问题

See the weakness part above.

伦理问题详情

N/A

评论
  • The experiment setup is suspicious. Usually people train the networks to convergence or to a certain average reward level. Cutting off experiments with different algorithms at the same number of time steps doesn't seem right.

Concerning the cut-off in number of samples for each iteration of IWOCS, we have tried to clarify things in the main text. The intention was to reproduce the experimental constraints of Tanabe et al. (2022) to provide a fair comparison, in particular to compare algorithms based on the same (sufficient) sample budget. Practically, we have also run experiments with more time and samples for SAC at each iteration of IWOCS: this only yields quite marginal improvements, while breaking the "equal sample budget" fairness criterion. We don't know how to address the concern of "the experimental setup is suspicious" besides providing all the code (which we have done) and having added this mention in the main text.

  • It is evident that the proposed algorithm, on top of SAC, would be very time-consuming due to the incremental nature. Although efficiency (in terms of wall-clock time) is briefly touched upon in Appendix F, it is advisable to show results like time-to-performance.

The wall-clock overhead of IWOCS compared to SAC is a three-fold increase. We are unsure whether one could consider this to be "very time-consuming" since both solve different problems. A better comparison would be between M2TD3 and IWOCS but unfortunately Tanabe et al. (2022) do not report wall-clock time. The point of Appendix H (Appendix F previously) is to avoid blur statements as "very time-consuming" and put a precise value on how much it costs to run IWOCS in the described setting. Concerning time-to-performance, given the iterative nature of IWOCS we would appreciate if you could make the question more precise. Currently, the different tables report a measure of "iteration-to-performance", which is directly linked to the performance vs time progress, since IWOCS hops from one performance value to another after each iteration.

  • The analysis can be enriched by providing more discussion on the reason why IWOCS performs significantly worse than other non-robustness-oriented algorithms, and perhaps further, what kind of environments are the most compatible with IWOCS.

We are quite surprised by this statement. Compared to "non-robustness-oriented algorithms" (which concerns only domain randomization actually, and even that is debatable), IWOCS does not perform significantly worse: actually it massively outperforms them on almost all environments, often providing in the order of a ten-fold improvement (Table 1). The only two exceptions are the Ant and Humanoid Standup environments. For Ant, we already provide tentative explanations and discussion in the "Worst-case performance" paragraph of Section 5.2. For Humanoid Standup, all methods (including robustness seeking ones) provide the same order of magnitude for performance, which makes it hard to draw specific conclusions.

  • It is advisable to include ablation studies (or at least comparison among different options) for design components like predictive coding and evolution algorithms.

We have updated the paper (new paragraph in Section 5.2 and new Appendix K) with an ablation study where we replace CMA-ES by a plain uniform grid search over the uncertainty parameters. The main conclusion is that CMA-ES did indeed find good enough minima, and the performance of IWOCS and IWOCS* (with CMA-ES or with the grid search) is comparable. In preliminary experiments, we have also experimented with variance networks in place of predictive coding. Both were equally difficult to tune (finding relevant thresholds to define confidence is the key technical challenge here). We believe this second ablation does not bring any new insight to the paper and we choose not to include it.

There are a few typos that need to be fixed. Besides, the important definitions and formula shall be highlighted in display mode. Writing everything in a single paragraph makes it hard for readers to grasp the key points at first glance and to review the important sentences.

We have tried to strike a balance between space constraints in the paper and general readability. Following your review (and those of the other reviewers) we have updated many parts of the paper and fully restructured Section 4. We have also added a "notations" appendix (Appendix B) which we refer to early in the paper, from Section 2, to facilitate keeping track of notations. We hope this improves overall clarity, and the balance between space constraints and readability.

评论

We thank the reviewer for these comments. We are glad to see the reviewer pointed out the overall clarity, motivation, novelty and empirical evaluation's completeness. We try to address all identified weaknesses and concerns below. We hope this answers your questions and will encourage you to raise your score. Additionally to this answer, we have updated the paper's pdf, highlighting in blue all changes made to ease the reading.

The soundness of the meta-algorithm needs more justification.

  • The current way it appears in the paper leaves readers the impression that the algorithm (esp. the idea of incremental ambiguity subset) comes solely from a simple heuristics, without any theoretical guarantee or adequate empirical evidence.
  • For theoretical guarantee, it is advisable to include some convergence guarantees that read like "TiT\mathcal{T}_i \rightarrow \mathcal {T}". Though convergence of subset sequence is not reasonable in any sense for this setting, one can at least show, for example, that each time a new model is selected from TTi\mathcal{T} \setminus \mathcal{T}_i, and maybe further, that the robust value with respect to Ti\mathcal{T}_i converges to that of T\mathcal{T} (at a quantified rate).

We have included several formal arguments, reformulated many parts in the paper, and restrctured Section 4 to better separate the algorithm from the discussion. Within the disussion, we also separated what is guaranteed, from what is heuristic, and what is proven from what is not. In particular, convergence is guaranteed and proven. The point of convergence is now discussed more explicitly. We also insisted more on the fact that for stationary policies the static and dynamic models optimize for the same value function (and hence solve the same problem, with different formulations). We believe that overall this improved the paper a lot and would like to think you for pointing it out.

  • For adequate empirical evidence, one basically needs to show the above properties with experiments on a range of different environments. Currently the motivating experiment is not motivating at all, and fails to show any sign of convergence.

Concerning empirical evidence, we would like to recall the whole of Section 5, which extensively provides empirical evidence as to the relevance of the approach and its ability to outperform previous baselines. We believe this evidence is "adequate" but would be happy to better understand why you may think it's not.

Concerning the motivating experiment, Figure 1 shows convergence. Maybe there was a misunderstanding there. Also, we never called it a motivating example: it is called an illustration because it is meant to illustrate the behavior of IWOCS, not motivate its development.

There are more to do for the empirical evaluations.

  • The deep RL algorithm is not described in details, making the paper not self-contained. It is especially unclear how the evolution algorithm is implemented without referring to the original paper. Further, the use of evolution algorithm needs more justification, since it barely has any optimality guarantees.

We fail to see how we can be more specific about the deep RL algorithm and the evolution algorithm without paraphrasing their original papers. SAC is SAC, without any fancy additions. CMA-ES is probably the most grounded, competitive and used evolutionary strategy, and comes with convergence guarantees (see for instance [1,2]). Its use is better motivated in the updated Section 4 and relies on the ability of CMA-ES to perform efficient black-box optimization. It is now also better justified empirically by the new paragraph on replacing CMA-ES by a grid-search in Section 5.2 and Appendix K. Both SAC and CMA-ES are used off-the-shelf: SAC is reimplemented to include the non-regularized prediction head and the predictive coding head as indicated in Appendix E (which also describes network architectures and all hyperparameters), CMA-ES is taken from the https://github.com/CyberAgentAILab/cmaes library, and the full code of our experiments is provided. To that extent, we do not believe the paper can be more self-contained without copy-pasting the pseudo-code of the SAC and CMA-ES algorithms.

[1] Ollivier, Y., Arnold, L., Auger, A., & Hansen, N. (2017). Information-geometric optimization algorithms: A unifying picture via invariance principles. The Journal of Machine Learning Research, 18(1), 564-628. [2] Hansen, N., Müller, S. D., & Koumoutsakos, P. (2003). Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary computation, 11(1), 1-18.

评论

I appreciate it very much that the authors upload an updated paper with all modifications highlighted.

The additional discussions in Section 4 are clean and helpful. I'm happy to see that the Q-function can actually be shown to converge, though not necessarily to the desired QTQ^*_{\mathcal{T}}. It is also clearly stated that the heuristics proposed in the paper might lead to premature stopping, which means the current method is more or less a workaround for computational efficiency, instead of the ultimate solution. By "adequate empirical evidence" I was referring to some additional motivating example that justifies convergence, which is no longer necessary since you've provided further discussions and results.

For the algorithms used in implementation, I have to admit that I'm not familiar with the recent advances of evolution algorithms, so maybe I was a little harsh on that point. I appreciate the neat ablation studies that justify the performance of CMA-ES in Section 5.2. As for SAC, there are many variants of it in practice, but judging from the context I think you are using the vanilla version.

For the experiment setup, I think the authors' argument is reasonable to some extent. Though it's still advisable to include a performance-against-time curve, even just for your own algorithm. Figures make things easier to understand than tables, and showing training reward curve against time (or /training steps) does help to understand the performance of RL algorithms.

For the comparison against other algorithms, I won't say my question is surprising. Experiments are meant to provide evidence for the proposed algorithm's performance, and when data do not support the claim, discussions should follow to justify these outliers even if they are rare to be observed (let alone I wouldn't call 4 out of 11 a rare case). As for the explanations provided for Ant, it would be great if you could provide further evidence to support your conjecture (e.g., adjusting the algorithm or environment to tackle or remove the mentioned challenges), though I agree that a thorough treatment may constitute a new subsequent project. Further, the explanations for HS is not convincing enough, since the performance of IWOCS in HC and Walker is also in the same order of magnitude as compared to M2TD3, yet the authors still mark them in boldface. As a conclusion, more explanations with evidence are expected.

I'll keep my ratings for now, but I'm open to raising it later, based on the reaction of other reviewers as well.

AC 元评审

Overall Evaluation:

  • The submission introduces the IWOCS algorithm, aiming to solve robust MDPs in Reinforcement Learning by focusing on static adversarial transition functions. The approach is novel, but the paper has received mixed evaluations from reviewers, with concerns regarding the theoretical soundness, empirical evidence, and clarity of presentation.

Reviewer Concerns:

  • Reviewer KrKt expressed concerns about the soundness of the meta-algorithm and the need for more empirical evidence. The updated paper, according to the reviewer, has addressed some of these issues, particularly the clarity and soundness, but left some questions open.
  • Reviewer SYvP found the idea novel but indicated that the paper would benefit from a clearer theoretical analysis and better presentation of theoretical concepts.
  • Reviewer cHJH raised significant concerns about the clarity of the writing, the soundness of the method, and questioned the novelty of the incremental worst-case search method. The updated manuscript has addressed some concerns regarding the theory and writing, leading to an increased score.
  • Reviewer ZA2E expressed skepticism about the novelty and practical advantage of the approach over existing dynamic model-based methods. The reviewer remains unconvinced even after the authors’ response.

Authors’ Response:

  • The authors have made significant efforts to address the reviewers’ concerns in their rebuttal, particularly focusing on improving the theoretical justification, empirical evidence, and clarity of the presentation. They argue for the practicality and efficiency of their approach, citing comparative advantages over existing methods.

为何不给更高分

Given the mixed reviews and the authors’ efforts to address the concerns, the submission shows potential but requires further refinement. The novel approach is noteworthy, but the paper would benefit from stronger empirical evidence and a more convincing argument about its advantages over existing methods. A recommendation for acceptance might be premature; a revision for a later conference or journal submission could be more appropriate, allowing the authors to fully address the remaining concerns.

为何不给更低分

N/A

最终决定

Reject