PaperHub
5.3
/10
Rejected3 位审稿人
最低5最高6标准差0.5
6
5
5
3.0
置信度
ICLR 2024

Exposing the Silent Hidden Impact of Certified Training in Reinforcement Learning

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11

摘要

关键词
Decision Boundary StabilityVolatilityRigorous AnalysisReinforcement Learning

评审与讨论

审稿意见
6

The paper analyzes adversarial training in deep reinforcement learning and identifies some issues with these methods. In particular, adversarially trained value functions are shown to overestimate the optimal values, and they may give incorrect ranking for the performance of sub-optimal actions.

优点

  • The paper is generally well-written. The motivation to study adversarial training is well-explained, and the potential issues for adversarial training is very relevant to the community.

  • A simple but insightful example with two states and three actions is provided to analytically demonstrate the effects of the regularizer in adversarial training methods. It is shown that the regularizer takes a lower value with parameters which overestimates the optimal value and reverses the order of the values of the second and the last action.

  • In numerical experiments on three environments in ALE, several metrics are used to compare vanilla trained models and adversarial trained models. One way is to directly compare the optimal value function with the two training methods. Another way is to compare their normalized value estimates among the best, the second best, and the worst actions. The paper also introduced the metric of performance drop when taking a sub-optimal action in a randomly sampled p-fraction of states. All the comparisons suggest that adversarial training might incur overestimation bias for the values of the optimal action, and provide inaccuracy value estimates for sub-optimal actions.

缺点

  • The statement of Theorem 3.4 is very informal and it is not clear what it guarantees. From the short proof in the supplementary, it seems like only one set of parameters given in Prop. 3.3 is analyzed. Since no further analysis (like gradient analysis) is done, it is not guaranteed that the regularized optimization will indeed go to the overestimation direction as suggested by the theorem. I think the authors may need to either rewrite the theorem more formally and provide a proof that the optimal solution to the regularized optimization problem indeed overestimate the values, or replace the theorem possibly by numerically showing the learning dynamics of the regularized problem.

  • Though overestimation of the optimal values and the incorrect relative ranking for sub-optimal actions seem to be potential issues when adversarial training is used, the paper doesn't discuss what aspects these issues might actually affect adversarial trained agents. Does the overestimation lead to performance loss? Does the incorrect ranking of sub-optimal actions affect robustness in any sense? Without more discussions on these issues, they may just be properties of adversarial training and not necessarily serious concerns.

  • Adversarial trained models give higher values compared with vanilla trained models, but it could be possible that vanilla trained models do not provide good value estimates either. Comparing with a more accurate estimate for optimal value may help clarifying this concern. One possibility is to compare with the average score so one may get an idea of whether the trained value function over or underestimate the values.

  • Some minor issues:

    • It's confusing when both awa_w and aAa_|A| are used to refer to the same variable.
    • The paper introduces the concept of τ\tau-dominate but it is not used in the main paper.

问题

  • Can the authors improve Theorem 3.4?

  • Are the issues identified in the paper connects to performance loss, robustness, or alignment with human decision?

  • Is it possible to compare adversarial trained and vanilla trained values with other value estimates, like the average score or some Monte Carlo methods?

评论

Thank you for investing your time to provide a thoughtful review on our paper.

1. “Can the authors improve Theorem 3.4?”

Yes, we now have improved Theorem 3.4 regarding your question and updated our paper. Furthermore, at your suggestion we have numerically optimized both the standard loss and the regularized loss for the MDP constructed in Section 3. These new results that we have added in the supplementary material further demonstrate that the numerically optimized solution to the regularized loss leads to overestimation and re-ordering of the state-action values.

2. “Are the issues identified in the paper connects to performance loss, robustness, or alignment with human decision?”

The claim made in our paper regarding the alignment with human decisions implies that the human decision-making process is better at discerning “correct” values for sub-optimal actions. For the evidence on the human decision making process and the fact that humans have a better than random perception of actions that they do not take please see [1,2,3].

For robustness please see the recent studies describing the non-robustness of adversarially trained deep reinforcement learning policies [4,5,6]. These studies essentially demonstrate that adversarially trained policies can be attacked with black-box adversarial perturbations [5], and adversarially trained policies are less robust compared to vanilla trained policies under natural adversarial attacks while vanilla trained policies demonstrate better generalization skills compared to adversarially trained ones [6]. Hence, notably it might be important to understand the root causes of these problems, because releasing models with guaranteed safety certifications with undiscovered non-robustness and vulnerabilities will in fact have serious consequences in the real world [7,8,9].

For the performance loss please see the response provided in bullet point I of Reviewer JDGz.

[1] Neural computations underlying action-based decision making in the human brain, Proceedings of the National Academy of Sciences (PNAS), 2009.

[2] How We Know What Not To Think. Trends in Cognitive Sciences, 2019.

[3] Cognitive neuroscience of human counterfactual reasoning. Frontiers in Human Neuroscience, 2015.

[4] Investigating Vulnerabilities of Deep Neural Policies, UAI 2021.

[5] Deep Reinforcement Learning Policies Learn Shared Adversarial Features Across MDPs. AAAI 2022.

[6]Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness. AAAI 2023.

[7] The Washington Post. Cruise recalls all its driverless cars after pedestrian hit and dragged, November 2023.

[8] New York Times. Driverless Taxis Blocked Ambulance in Fatal Accident, San Francisco Fire Department Says, September 2023.

[9] The Guardian. Tesla behind eight-vehicle crash was in ‘full self-driving’ mode, says driver, 2022.

3. “Adversarial trained models give higher values compared with vanilla trained models, but it could be possible that vanilla trained models do not provide good value estimates either. Comparing with a more accurate estimate for optimal value may help clarifying this concern. One possibility is to compare with the average score so one may get an idea of whether the trained value function over or underestimate the values.”

Please find the average scores in the table below. Note that standard double Q-learning already tries to solve the overestimation problem that DQN experienced [1,2]. Please also see further [1] to note that double-DQN still overestimates the Q-values yet learns Q-values lower than DQN. Hence, the fact that adversarially trained policies result in learning higher Q-values than double-DQN is further evidence that certified robust training techniques do indeed learn overestimated state-action values.

MDPsAdversarially TrainedVanilla Trained
BankHeist1228.7±11.41228.7\pm 11.41302.9±22.81302.9 \pm 22.8
RoadRunner43190.0±8154.043190.0 \pm 8154.0 46271.0±7182.046271.0 \pm 7182.0
Freeway30±0.030\pm 0.0 32.0±0.332.0 \pm 0.3

[1] Hado van Hasselt, Arthur Guez, David Silver. Deep Reinforcement Learning with Double Q-Learning. AAAI 2016.

[2] Double Q-learning, NeurIPS 2010.

评论

Thank you for the responses. Here are some follow up comments.

  • I appreciate the rewriting of Theorem 3.4. The additional numerical experiment for the example is very interesting, but I have some additional questions for it. Figure 12 has no legend, but I assume the orange ones are for the regularized loss, right? The figure only shows a1,a2,a3a_1, a_2, a_3 but not indicating which one is the best, thought it could be inferred. The values for the second best and the third best actions look identical for the orange ones. Are they exactly the same and is this expected from the re-ordering of the suboptimal actions analysis?

  • I appreciate the information connecting the property of sub-optimal actions' ordering to human decisions, and I feel like it would be nice if it could be incorporated into the main paper.

  • The table in the response shows the average scores of the two models, but no performance estimates by the associated Q-functions. Their values are very different from the Q-values in Table 2. Are there some scaling or normalizing? What I feel to be an important comparison is something like Figure 3 in the double Q-learning paper where true values and estimates are compared for different methods.

评论

Thank you for your response.

1. “I appreciate the rewriting of Theorem 3.4. The additional numerical experiment for the example is very interesting, but I have some additional questions for it. Figure 12 has no legend, but I assume the orange ones are for the regularized loss, right? The figure only shows a1,a2,a3a_1,a_2,a_3 but not indicating which one is the best, thought it could be inferred. The values for the second best and the third best actions look identical for the orange ones. Are they exactly the same and is this expected from the re-ordering of the suboptimal actions analysis?"

Thank you for this comment. Yes, the orange color refers to the regularized loss. The values of second best and third best actions are not identical, due to the scale difference between state-action values of the regularized loss and vanilla training (i.e. overestimation in adversarially trained ones) it appears in the plot that they might be identical. However, there is an actual difference between the values of second best action and third best action for the adversarially trained ones.

2. ”I appreciate the information connecting the property of sub-optimal actions' ordering to human decisions, and I feel like it would be nice if it could be incorporated into the main paper.”

Thank you very much for this comment. We actually added in the supplementary material a new section for this particular discussion (Section 1.9.). Currently the space restrictions are limiting our ability to put it in the main paper, but for the camera ready version we could definitely move this section to the main paper as you suggest.

3. “The table in the response shows the average scores of the two models, but no performance estimates by the associated Q-functions. Their values are very different from the Q-values in Table 2. Are there some scaling or normalizing? What I feel to be an important comparison is something like Figure 3 in the double Q-learning paper where true values and estimates are compared for different methods.”

Please note that the vanilla training here in our paper is indeed double-Q learning (DDQN). Hence, the results reported in Figure 3 of the double Q-learning paper directly apply to the vanilla training policies in our paper. Thus, the fact that the paper [1] demonstrates that double-Q learning policies already overestimate the state-action values, further affirms that the adversarially trained policies indeed overestimate given the results reported in Figure 4 and Table 2.

[1] Hado van Hasselt, Arthur Guez, David Silver. Deep Reinforcement Learning with Double Q-Learning. AAAI 2016.

审稿意见
5

This paper studies the effect of adversarial robustness regularization on the action gap, overestimation bias, and suboptimal action ranking accuracy of deep neural networks trained on RL tasks. It observes that adversarially robust RL agents exhibit a larger action gap between the predicted optimal and second-best action, accompanied by reduced accuracy at ranking suboptimal actions. It demonstrates these pathologies in a simple linear model, along with deep RL agents trained on a subset of games from the Arcade Learning Environment.

优点

  • This paper highlights a previously under-discussed pitfall of adversarial training: adversarial robustness objectives have the side effect of reducing the accuracy of the ranking of the Q-value over actions
    • It illustrates this pathology with an easy-to-interpret example, whose correctness is easy to verify.
    • The analysis of the action-ranking accuracy is creative and quite interesting.
    • The paper provides an interesting counterexample to a reasonably widely-held belief that increasing the action gap should correspond to reduced overestimation bias.
    • The figures are for the most part quite easy to interpret and clearly convey the intended message.
    • The experimental setup is clear and well-justified.

缺点

  • I have some issues with the notion of adversarial robustness studied in this paper. In particular, it is unclear whether the adversarial regularizer really makes sense in an RL context, since it doesn't distinguish between the quality of the action that gets overestimated in the adversarial example. In the image classification tasks where adversarial examples were initially studied, any incorrect label is equally 'incorrect' in a sense. However, in deep RL, the particular action targeted by the adversary can have wildly different influences on the performance of the agent's policy. The authors argue that the adversarial training objective they study is of interest because it has been studied in multiple published works, however some of the cited works (for example Gleave et al.) consider a very different adversarial threat model than that discussed here.
    • The observation that the adversarial regularizer interferes with the model's accuracy on suboptimal actions is exactly what would be expected by looking at the formula used. While the empirical evaluations and toy example are helpful to verify
    • It is not clear whether inaccurate ranking of Q-values and overestimation bias is actually a problem in the sense that it leads to worse behaviour policies or slower training. For example, I could imagine that inaccurate estimates of suboptimal actions could be a problem if the network has not yet converged to an optimal policy. However, analysis around this phenomenon is missing from the paper.
    • The discussion of the 'action gap' is misleading: the paper claims that "the fact that adversarially trained deep neural policies overestimate the optimal state-action values refutes the hypothesis that increasing the action gap is the sole cause of a decrease in overestimation bias of state-action values." However, the experimental setup of this paper does not isolate the effect of increasing the action gap on overestimation bias. Because there are many other confounding factors that arise from the adversarial regularizer, the results from section 6.3 are consistent with a model where increasing the action gap decreases overestimation bias, but then some other effect of the adversarial regularizer independently increases the overestimation bias, overwhelming the effect of the increased action gap. In that setting, it would still be correct to say that "the action gap is the sole cause of a decrease in overestimation bias" in other contexts. This claim should be adjusted to state that "an increase in the action gap of a Q-function does not uniformly reduce overestimation bias in all contexts."
    • Minor: Figure 5 is difficult to read due to the small font size. In Figure 4, there are some bizarre artifacts where the Q values of the blue line occasionally drop to overlap with the red line, and it's not clear why this should happen.

问题

  • Do the pathologies highlighted by this paper result in meaningful challenges for optimization? Do they slow down learning? Simply noting that a regularizer results in worse Q-value estimation does not on its own indicate that this worse Q-value estimation is necessarily a problem for performance or for learning dynamics. I would be more confident in the significance of these findings if the authors could indicate some practical examples in which they present a barrier to policy improvement.
    • One angle that seems to be missing from this paper is whether there are other ways of enforcing adversarial robustness which avoid these pathologies, assuming that they are indeed a problem. Would a more naive approach of e.g. explicitly regularizing the Lipschitz constant of the network also encounter this pathology?
  • The three games considered are settings where I would expect adversarial robustness to be particularly at odds with Q-value accuracy, as states with small pixel distance could correspond to very different value functions. I would be interested in seeing if we see the same magnitude of trends in a different domain such as Mujoco.
评论

4. ”One angle that seems to be missing from this paper is whether there are other ways of enforcing adversarial robustness which avoid these pathologies, assuming that they are indeed a problem. Would a more naive approach of e.g. explicitly regularizing the Lipschitz constant of the network also encounter this pathology?”

The main reason why the more naive approach will not work is due to the fundamental trade-off between performance of the policy and robustness certification. In particular, a Q-network is adversarially robust if the top-ranked action does not change under small perturbations to states i.e. if the change in state-action values is smaller than the value-gap between the top-ranked and second-ranked action. This value-gap varies significantly across states in an MDP, and thus attempting to enforce one fixed Lipschitz constant will either fail to achieve robustness in many states, or be overly-robust in many states, and thus be unable to assign correct values for nearby states that truly do have quite different values. The regularizer R(θ)\mathcal{R}(\theta) that we analyze can be seen as adaptively choosing the level of Lipschitzness, based on the value-gap in each state, thus overcoming this particular trade-off between performance and robustness. Also see our supplementary material for empirical analysis of a different approach to adversarial robustness (which explicitly claims to reduce overestimation). We show that this other approach still leads to overestimation and re-ordering of suboptimal actions, demonstrating that even new regularizers designed to mitigate overestimation still have this problem.

评论

Thank you for allocating your time to provide a considered response to our paper.

1. “Do the pathologies highlighted by this paper result in meaningful challenges for optimization? Do they slow down learning? Simply noting that a regularizer results in worse Q-value estimation does not on its own indicate that this worse Q-value estimation is necessarily a problem for performance or for learning dynamics. I would be more confident in the significance of these findings if the authors could indicate some practical examples in which they present a barrier to policy improvement.

Yes, it does indeed result in challenges in optimization. Please see the robust reinforcement learning algorithm [1] that is also investigated in our supplementary material. The paper [1] describes that first RADIAL-DQN is trained with 6 million steps with standard training and then trained with 4.5 million steps additionally with the RADIAL algorithm. Note that these models take an action or step in every 4 frames as also reported in [1]. Thus, RADIAL-DQN is trained with 24 million frames with standard training and then trained with 18 million frames additionally with the RADIAL algorithm. The score achieved by RADIAL-DQN via training in a total of 42 million frames is 45870 in Roadrunner. The score achieved by standard dueling DDQN is 51100 in 40 million frames as reported in [2] (Please see the learning curves in Figure 5 of [2]). Thus, yes these methods do indeed slow down learning.

Additionally note that the reported training times from the original paper [3] for vanilla DQN is 15 hours and for SA-DDQN is 50 hours on a single 1080 Ti GPU.

[1] Robust Deep Reinforcement Learning through Adversarial Loss, NeurIPS 2021.

[2] Rainbow: Combining Improvements in Deep Reinforcement Learning, AAAI 2018.

[3] Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations, NeurIPS 2020.

2. ”The observation that the adversarial regularizer interferes with the model's accuracy on suboptimal actions is exactly what would be expected by looking at the formula used. While the empirical evaluations and toy example are helpful to verify.”

Please see the certified robust reinforcement learning algorithm that we also investigate in our supplementary material. This more recent robust reinforcement learning study [1] specifically claims that the problems we discuss in our paper do not occur in their robust reinforcement learning algorithm. Yet, the results reported in our supplementary material further demonstrate that even more recent adversarial training algorithms learn inaccurate, overestimated and inconsonant state-action value functions.

[1] Robust Deep Reinforcement Learning through Adversarial Loss, NeurIPS 2021.

3. ”The three games considered are settings where I would expect adversarial robustness to be particularly at odds with Q-value accuracy, as states with small pixel distance could correspond to very different value functions. I would be interested in seeing if we see the same magnitude of trends in a different domain such as Mujoco.”

We would like to highlight that our paper’s focus is high-dimensional state observations. As can be seen from prior work [1,2], unfortunately there isn’t any certified adversarial training yet that can solve high-dimensional Mujoco tasks.

[1] Robust Deep Reinforcement Learning through Adversarial Loss, NeurIPS 2021.

[2] Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations, NeurIPS 2020.

评论

Thanks to the authors for the helpful response. I will try to highlight which parts of my review I think the response has addressed, and which remain a concern.

  1. Thanks for providing the data point re: convergence rates -- I recommend the authors explicitly discuss how these issues lead to slower convergence rates in the paper, as it is a really important corollary of their findings. Looking over the cited paper, I also noticed that the RADIAL experiments follow vanilla RL training for ~ the first half of training, and then incorporate the adversarial loss. This raises an interesting direction for evaluation which I think could also help improve the significance of the paper's contribution. It is completely believeable to me that adversarial training methods would slow down convergence, partly due to the reduced accuracy on suboptimal actions but also I would expect due to the fact that an adversarial loss might make it difficult for the agent to change its greedy policy (in fact, this second argument should also apply to policy-gradient methods, which would assign excessive probability to the current preferred action). I wonder if this suggests that a more efficient way to train adversarially robust deep RL agents would be to wait until the policy has converged and then make it robust, rather than trying to optimize two conflicting objectives at once.

  2. First I apologize for the cut-off sentence in this point in my original review. Could the authors point out where the cited paper explicitly claims that the method does not encounter the issues identified in this paper? To clarify the concern I had on this point: the findings of the paper are primarily observations that seem to follow directly from the adversarial loss in Definition 3.1. If the same issue arises in a variety of adversarial training strategies, this suggests that a stronger result than that proposed by Theorem 3.4 might be possible, which would demonstrate a more fundamental robustness-accuracy trade-off.

  3. Could the authors clarify why they don't think that lower-dimensional embeddings would not run into the overestimation and inaccurate ranking issues identified, given that the example from Prop 3.3 is on a very low-dimensional domain and that adversarial robustness methods have been successful on the standard low-dimensional observation space for Mujoco? I appreciate that it might be more difficult to characterize inconsonance in the action rankings, but certainly at least something like the overestimation bias could be verified.

I would like to emphasize the two main barriers that currently stop me from increasing my score.

C1: I don't think the paper provides sufficient information gain over what would be expected from reasoning about adversarial objectives in RL (i.e. it identifies predictable problems).

C2: the paper is missing actionable insights based on these observations. If action-ordering and overestimation bias are issues, does fixing them speed up training of adversarial robust RL agents? Or is there a fundamental trade-off between robustness and accuracy that means any adversarially robustness objective will induce these issues?

There are additional experiments and theoretical results which could address these issues (for example, training without adversarial robustness objectives and then distilling the resulting Q-learning policy into an adversarially robust learner, proving \forall rather than \exists statements characterizing theoretical tradeoffs), although I appreciate that these are difficult to fully address in the limited rebuttal time. I would be very interested in even small-scale findings addressing these concerns.

评论

Thank you for your response and helpful comments. Upon your request we now added a section (Section 1.7) in the supplementary material dedicated to theoretically prove the fundamental trade-off between accurate estimation of state-action values and adversarial robustness. In particular, the result applies to any method which ensures that perturbations of magnitude ϵ\epsilon do not change the action ranked first by the Q-function. Please see Section 1.7 in the supplementary material for more details.

审稿意见
5

This paper reveals that the adversarial training in RL could lead to inconsistencies and overestimations of state-action (Q) values. The authors further show that vanilla-trained DRLs have more accurate and consistent estimations, both in theoretical analysis with a linear model and experiments with neural network approximations.

优点

  1. The paper is well-organized and easy to follow.
  2. The research problem is interesting, it aims to answer the question "What cost does adversarial training bring in value estimation of DRL? "
  3. The motivating linear model example well explains the angle of overestimation and wrong order ranking with adversarial training.
  4. The existing experiments clearly support the claim and the findings in the paper.

缺点

  1. As a finding/observing paper, the "novelty" is not that strong. Here the "novelty" refers to the findings themselves. By adding a regularizer term as introduced in Definition 3.1, it is somewhat intuitive and straightforward to imagine that this regularizer keeps the peak value of the optimal action while punishing any other choices within the small neighborhoods, which could lead to the over-estimation of optimal action and reordering of non-optimal action. Therefore, it is not surprising to see the findings in the following context.
  2. Although the authors mention the effects of over-estimation of optimal action and reordering of non-optimal action from here to there in the paper, I don't see a systematic analysis and deep discussion of how they lead to a big problem for RL.
  3. As a finding/observing paper, the authors only implemented DDQN vs. SA-DDQN. This is not enough for a paper motivated by experiments. The authors are encouraged to bring more results to support their findings and claims.

问题

  1. From RL's perspective, overestimating the value of optimal action shouldn't be a problem as the higher value will encourage to pick the optimal actions. Re-ordering the non-optimal actions is not a problem either as the agent will never pick them. Should we really consider these two as drawbacks of adversarial training in RL?
评论

Thank you for investing your time to provide kind feedback.

1. “Although the authors mention the effects of over-estimation of optimal action and reordering of non-optimal action from here to there in the paper, I don't see a systematic analysis and deep discussion of how they lead to a big problem for RL.”

We would like to highlight that recent work [5,6,7] demonstrated the adversarial sensitivities of certified robust reinforcement learning policies from black-box adversarial attacks [7] to natural attacks that revealed the generalization problems of adversarially trained deep reinforcement learning policies when compared to straightforward reinforcement learning [5]. While these studies highlight the safety and security problems in certified adversarially trained policies, our paper dives into and explains the particular reasons why adversarial training experiences these safety problems. We believe it is crucial to understand the root causes of these problems, because guaranteeing safety certifications with undiscovered non-robustness and vulnerabilities will in fact have serious consequences in the real world [1,2,3,4].

[1] The Washington Post. Cruise recalls all its driverless cars after pedestrian hit and dragged, November 2023.

[2] The New York Times. Driverless Taxis Blocked Ambulance in Fatal Accident, San Francisco Fire Department Says, September 2023.

[3] The Guardian. Tesla behind eight-vehicle crash was in ‘full self-driving’ mode, says driver, December 2022.

[4] New York Times. A driverless Cruise taxi with a passenger collided with a fire truck Thursday night, August 2023.

[5]Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness. AAAI 2023.

[6] Investigating Vulnerabilities of Deep Neural Policies, UAI 2021.

[7] Deep Reinforcement Learning Policies Learn Shared Adversarial Features Across MDPs. AAAI 2022.

2. “From RL's perspective, overestimating the value of optimal action shouldn't be a problem.”

Please note that overestimation from RL’s perspective is indeed a problem. The reason that double-Q learning was proposed was specifically to solve the overestimation of Q-values that DQN was experiencing [1,2]. This is also currently present in the paper in Section 6.3.

[1] Hado van Hasselt, Arthur Guez, David Silver. Deep Reinforcement Learning with Double Q-Learning. AAAI 2016.

[2] Double Q-learning, NeurIPS 2010.

3. “As a finding/observing paper, the authors only implemented DDQN vs. SA-DDQN. This is not enough for a paper motivated by experiments. The authors are encouraged to bring more results to support their findings and claims.”

We would like to highlight that our paper is not only focusing on SA-DDQN. There are more results in the supplementary material that further demonstrate that more recent adversarial training algorithms are also experiencing the exact same problems as described in our paper, in which the robust reinforcement learning policies learn inaccurate, overestimated and inconsonant state-action value functions.

4. “As a finding/observing paper, the "novelty" is not that strong. Here the "novelty" refers to the findings themselves. By adding a regularizer term as introduced in Definition 3.1, it is somewhat intuitive and straightforward to imagine that this regularizer keeps the peak value of the optimal action while punishing any other choices within the small neighborhoods, which could lead to the over-estimation of optimal action and reordering of non-optimal action. Therefore, it is not surprising to see the findings in the following context.”

Please see the certified robust reinforcement learning algorithm that we also investigate in our supplementary material. This more recent robust reinforcement learning study [1] specifically claims that the problems we discuss in our paper do not occur in their robust reinforcement learning algorithm. Yet, the results reported in our supplementary material further demonstrate that even more recent adversarial training algorithms learn inaccurate, overestimated and inconsonant state-action value functions.

[1] Robust Deep Reinforcement Learning through Adversarial Loss, NeurIPS 2021.

5. “Should we really consider these two as drawbacks of adversarial training in RL?”

Yes, please see the authors’ response to item 1.

评论

Thanks for the reply. I understand that your focus in this paper is not on SA-DDQN. In the appendix, I only found RADIAL as another tested approach. Again, this is not enough for a paper motivated by experiments and observations.

Still, the finding itself is not surprising to me, as the robust training regularizer item already reveals some results from my perspective.

评论

We have added a new Section 1.7 in the supplementary material dedicated to theoretically prove the fundamental trade-off between adversarial robustness and accurate state-action value estimation. Please see our official comment to Reviewer JDGz and Section 1.7 of the supplementary material for more details.

Furthermore, please also see supplementary material Section 1.6 and Section 1.8. As can be seen from the supplementary material Section 1.6 and 1.8, the theoretical and empirical results provided in the main body of the paper further extend to even simpler MDPs demonstrating the generality of the problems discussed in our paper caused by adversarial training.

AC 元评审

This paper studies the effect of adversarial robustness regularization on the action gap, overestimation bias, and suboptimal action ranking accuracy of deep neural networks trained on RL tasks. It observes that adversarially robust RL agents exhibit a larger action gap between the predicted optimal and second-best action, accompanied by reduced accuracy at ranking suboptimal actions. It demonstrates these pathologies in a simple linear model, along with deep RL agents trained on a subset of games from the Arcade Learning Environment.

The reviewers find the analysis worthwhile but point out that the results are both not very surprising and not general enough, and that the adversarial training adopted in the paper is not the same as the formulation employed in past works. The rebuttal submitted by the authors address their concerns to some extent, but not to the extent they would believe the paper is above the bar of ICLR.

为何不给更高分

The reviewers find the analysis worthwhile but point out that the results are both not very surprising and not general enough, and that the adversarial training adopted in the paper is not the same as the formulation employed in past works. The rebuttal submitted by the authors address their concerns to some extent, but not to the extent they would believe the paper is above the bar of ICLR.

为何不给更低分

N/A

最终决定

Reject