JuxtAlign: A Foundational Analysis on Alignment of Certified Reinforcement Learning
摘要
评审与讨论
JuxtAlign studies alignment of certified RL using ideas of natural intelligence from neuroscience. The authors imply orthogonality of natural intelligence and adversarial training used often in robust RL. The paper contains a theoretical and an experimental analysis of the phenomena
优点
- The paper contains a theoretical foundation for the analysis
- The paper presents an experimental analysis of robustness and natural intelligence.
缺点
- The presentation can be improved as it was hard for me to fully assess the results. Perhaps I missed something, but I didn't get the main message.
- It’s well known that robustness, adversarial training are blunt tools in the sense that they try to avoid all possible outcomes as designed. For example, we can take a worst-case action within a ball of possible actions. If we re-design these training tools with a more constrained scope then the results will be different. Therefore, I don’t quite understand how we can make general conclusions on certified training vs natural intelligence from this study.
- In extreme situations, robust training can result in trivial policies - do not do anything. Therefore, I don’t find it surprising that sometimes robust training is different from what we perceive as naturally intelligent.
- The theory seems to be based on the Danskin theorem, which can be applied only to C1 Q-functions. I believe this is a quite restrictive setting. In min-max problems it’s even more restrictive.
- Overestimation in Q-functions is a general problem in RL, not just in a robust setting. But I agree that in robust settings it will be more acute. However, using overestimation as a reason for misalignment in this context is not entirely correct and can be misleading.
问题
Minor comments
- Line 213 - typo
- I am bit confused about the depiction of the brain and video games in Figure 2. Why are they there?
- Def 4.2. The formulation is slightly confusing "For any ..." in the begining of the sentence typically implies that the condition must be valid for all for the definition to hold. I think here the authors mean "for some ", which is better to place closer to the end of the sentence.
Thank you for writing a review for our paper, and thank you for stating that our paper provides a theoretical and an experimental analysis with theoretical foundations.
1. “The theory seems to be based on the Danskin theorem, which can be applied only to C1 Q-functions. I believe this is a quite restrictive setting. In min-max problems it’s even more restrictive.”
We believe you have a misconception here. The theory we introduce in the paper does not rely on Danskin’s theorem in any way. Danskin’s theorem is the standard theoretical justification for adversarial training and we just include it as background to explain the prior adversarial training techniques.
2. ”In extreme situations, robust training can result in trivial policies - do not do anything. Therefore, I don’t find it surprising that sometimes robust training is different from what we perceive as naturally intelligent.”
You seem to have a misconception here. All of the policies trained in our paper are well-performing policies.
3. ”Overestimation in Q-functions is a general problem in RL, not just in a robust setting. But I agree that in robust settings it will be more acute. However, using overestimation as a reason for misalignment in this context is not entirely correct and can be misleading.”
You have a misconception here. There is no causality between overestimation and misalignment. These are independent symptoms of robust training methods.
4. ”The presentation can be improved as it was hard for me to fully assess the results. Perhaps I missed something, but I didn't get the main message.”
Our paper discovers that a recent line of work that focuses on ensuring safety and robustness in reinforcement learning, in fact produces policies that are inconsistent, non-robust, misaligned, and even further unable to reason counterfactually; even though the original biologically inspired standard reinforcement learning in fact learns counterfactual, consistent and aligned values.
Our paper first provides a theoretical analysis that reveals the foundational reasons of the misalignment, inconsistency and inability to reason counterfactually, then further provides an extensive empirical analysis in deep reinforcement learning with robust and standard reinforcement learning policies.
Thus, our results challenge the current research norms on robustness and safety, and further the analysis provided in our paper invites the community to reconsider robustness and safety within the original inspiration of reinforcement learning and without losing the core foundations and attributes of reinforcement learning.
5. ”It’s well known that robustness, adversarial training are blunt tools in the sense that they try to avoid all possible outcomes as designed. For example, we can take a worst-case action within a ball of possible actions. If we re-design these training tools with a more constrained scope then the results will be different. Therefore, I don’t quite understand how we can make general conclusions on certified training vs natural intelligence from this study.”
The current adversarial training paradigms [1,2,3,4] are far from your suggestion, and our paper’s focus is the current robust training paradigms, not what robustness might be in an unforeseeable future.
[1] Towards Deep Learning Models Resistant to Adversarial Attacks, ICLR 2018.
[2] Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations, NeurIPS 2020.
[3] Understanding and Diagnosing Deep Reinforcement Learning, ICML 2024.
[4] Robust Deep Reinforcement Learning through Adversarial Loss, NeurIPS 2021.
Thank you for the response.
In the interest of time, I will to respond quickly to some clarifications, and leave others for another time
1. “The theory seems to be based on the Danskin theorem, which can be applied only to C1 Q-functions. I believe this is a quite restrictive setting. In min-max problems it’s even more restrictive.” We believe you have a misconception here. The theory we introduce in the paper does not rely on Danskin’s theorem in any way. Danskin’s theorem is the standard theoretical justification for adversarial training and we just include it as background to explain the prior adversarial training techniques.
My point stands: Danskin's theorem is not relevant for the RL case, and therefore it's value as background can only be confusing without further remarks.
2. ”In extreme situations, robust training can result in trivial policies - do not do anything. Therefore, I don’t find it surprising that sometimes robust training is different from what we perceive as naturally intelligent.”
You seem to have a misconception here. All of the policies trained in our paper are well-performing policies.
I don't believe I have a misconception. My point is that Robust RL is looking for a policy that performs equally well for all possible cases as defined in the problem formulation. The performance of a particular robust RL approach depends on this definition. In some cases, it can result in trivial policies. I am not claiming that any policies trained in the paper have this property, but it seems that there's some overclaiming in the paper at least.
3. ”Overestimation in Q-functions is a general problem in RL, not just in a robust setting. But I agree that in robust settings it will be more acute. However, using overestimation as a reason for misalignment in this context is not entirely correct and can be misleading.”
You have a misconception here. There is no causality between overestimation and misalignment. These are independent symptoms of robust training methods.
Apologies for the confusion. My point is that overestimation of Q functions is a general problem in RL. I don't see any convincing arguments that there's any specific issues caused by robust formulations.
- ”It's well known that robustness, adversarial training are blunt tools in the sense that they try to avoid all possible outcomes as designed. For example, we can take a worst-case action within a ball of possible actions. If we re-design these training tools with a more constrained scope then the results will be different. Therefore, I don’t quite understand how we can make general conclusions on certified training vs natural intelligence from this study.”
The current adversarial training paradigms [1,2,3,4] are far from your suggestion, and our paper’s focus is the current robust training paradigms, not what robustness might be in an unforeseeable future.
[1] Towards Deep Learning Models Resistant to Adversarial Attacks, ICLR 2018.
[2] Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations, NeurIPS 2020.
[3] Understanding and Diagnosing Deep Reinforcement Learning, ICML 2024.
[4] Robust Deep Reinforcement Learning through Adversarial Loss, NeurIPS 2021
I am not talking about unforeseeable future, and I don't see any contradictions between my statement and previous works on robust RL. For example, in [2] it is stated (in introduction):
To ensure safety under the worst case uncertainty, we consider the adversarial setting where the state observation is adversarially perturbed from s to v(s), yet the underlying true environment state s is unchanged.
Picking v(s) as a zero function, would obviously result a trivial policy. I see (and the authors of [2]) robustness as handling worst-case disturbances/uncertainty. Could you clarify why my suggestion is far from what the authors of [2] are studying?
1. “The theory seems to be based on the Danskin theorem, which can be applied only to C1 Q-functions. I believe this is a quite restrictive setting. In min-max problems it’s even more restrictive.” We believe you have a misconception here. The theory we introduce in the paper does not rely on Danskin’s theorem in any way. Danskin’s theorem is the standard theoretical justification for adversarial training and we just include it as background to explain the prior adversarial training techniques.
My point stands: Danskin's theorem is not relevant for the RL case, and therefore it's value as background can only be confusing without further remarks.
Your point does not still stand, because your point was that our theory is based on Danskin's theorem, which we pointed out to you is false, because our theory does not depend on Danskin's theorem. Thus, your point does not still stand.
Danskins theorem is the main theoretical justification for adversarial training methods [1], and these methods are further indeed used in reinforcement learning particularly in SA-DQN adversarial training [2].
Thus, the new point you are trying to make that “Danskins theorem is not relevant to RL” is also incorrect.
[1] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR 2018.
[2] Robust deep reinforcement learning against adversarial perturbations on state observations. NeurIPS 2020.
2. ”Apologies for the confusion. My point is that overestimation of Q functions is a general problem in RL. I don't see any convincing arguments that there's any specific issues caused by robust formulations.”
The fact that robust trained policies learn higher state-action values than standard reinforcement learning as reported in Figure 3 indeed demonstrates that robust training learns overestimated state-action value functions.
3. ”Picking v(s) as a zero function, would obviously result a trivial policy. I see (and the authors of [2]) robustness as handling worst-case disturbances/uncertainty. Could you clarify why my suggestion is far from what the authors of [2] are studying?”
The authors of [2] choose to be a worst-case perturbation in an -norm ball of radius . This is stated in the first sentence of Section 4 of the paper [2].
Furthermore, -norm bounded perturbations are the canonical way adversarial perturbations are computed and the canonical definition of robust training [1,2,3,4,5,6,7,8].
[1] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR 2018.
[2] Robust deep reinforcement learning against adversarial perturbations on state observations. NeurIPS 2020.
[3] Ensemble Adversarial Training: Attacks and Defenses, ICLR 2018.
[4] Robust Deep Reinforcement Learning through Adversarial Loss, NeurIPS 2021.
[5] Understanding and Improving Fast Adversarial Training, NeurIPS 2020.
[6] Adversarial Training and Provable Defenses: Bridging the Gap, ICLR 2020.
[7] Revisiting Adversarial Training for ImageNet: Architectures, Training and Generalization across Threat Models, NeurIPS 2023.
[8] Understanding and Diagnosing Deep Reinforcement Learning, ICML 2024.
Your point does not still stand, because your point was that our theory is based on Danskin's theorem, which we pointed out to you is false, because our theory does not depend on Danskin's theorem. Thus, your point does not still stand.
I am recommending a way of writing that would not cause confusion while reading. The way the flow goes, it seems that Danskin's theorem is relevant for the RL results. But as you claim the presented results are not based on Danskin's theorem, which creates a confusion.
Thus, the new point you are trying to make that “Danskins theorem is not relevant to RL” is also incorrect.
Let me clarify as there seems to be some confusion. If you want to claim that Danskin's theorem is relevant in this context, then please provide the proof that any optimal Q function for any MDP will be C1. Otherwise I don't see relevance and it will confuse the readers.
Figure 3 indeed demonstrates that robust training learns overestimated state-action value functions.
Could you clarify what the Figure depicts, I read your explanations to the other reviewers, but I am still confused about it? State-action space is many dimensional, but the axes are obviously 1-dimensional. Does it mean that the heat-map at q1-q2 gird represents the number of states (s,a) that are such that q1 = Q(s, a) and q2 = Q_adv(s, a)? What's the scale of the heat map?
How can you rule out that the issue here is not this specific formulation, adversarial RL algorithm? What makes you confident that there will be a significant impact for any other environment, adversarial RL algorithm?
The authors of [2] choose to be a worst-case perturbation in an -norm ball of radius . This is stated in the first sentence of Section 4 of the paper [2].
Sure, that's an assumption for their specific application. In other applications v(s) can be different and this assumption may not hold. It is possible that this assumption cannot be made in some environments or will lead to poor performance.
My point remains the following: how can you rule out that the observed behavior is specifically not caused by a specific application/environment? The claims that are made are quite general and I am not convinced that one can make them based on the discussion in the paper.
The paper approaches critically and formally recent advances in certified RL. It shows how these novel RL schemes aiming to learn robust policies, actually induce policies that are misaligned with natural intelligence. Moreover, they perform a rigorous theoretical analysis of such misalignment thus formally proving failure cases of such RL schemes.
优点
CLARITY and QUALITY:
- The paper is generally well-written, presents a clear problem, and clear derivations.
- It help the reader to follow the main argument logic step by step and presents clear mathematical statements of the conclusions derived.
SIGNIFICANCE and ORIGINALITY:
- The comparison with natural intelligence seems particularly interesting and likely novel.
- The problem treated is important and has already received significant attention in other communities (e.g., computer vision), but arguably less in RL so far. Hence I believe it is an important direction of investigation.
- Regarding the significance of the derived results I have doubts expressed via the points/questions below. Especially points (1) and (3) regarding related works and formal implications.
缺点
CLARITY and QUALITY:
- At points writing is not particularly sharp (e.g., the list of contributions could be significantly shorter and clearer). Often, diverse and not fully clear words are used to describe things (e.g., 'resilience' in line 137 - what does it formally mean?) or the terminology lacks cohesion (e.g., the word certified is not introduced in the introduction, where the related research line is introduced with alternative terminology).
- Example in pag. 4 might be hard to follow. I’d suggest to add a drawing with states/actions etc.
- The paper does not seem cohesive in terms of storyline. In particular, it seems it has two themes: (1) showing the relationship between natural decision-making and RL with and without certified methods, and (2) showing formally the limits of certified methods. Although both are relevant topics, I fail to see clearly the connection between the two. Maybe the authors have a clear picture in mind connecting the two themes, but currently the alternation between them renders the paper not cohesive and feels like reading two papers forced into one via unclear motivations and connections. This is related with point (2) below.
ORIGINALITY / SIGNIFICANCE / GENERAL:
- Related works mention adversarial training schemes in general (not specific to RL), then present recent works specific for RL with positive results and finally some with negative non-theoretical results towards claiming that this work is the first showing formally the limitations of these RL schemes. What I am missing is the following: since these algorithmic ideas are far more general than RL as they boil down to regularized optimization schemes (that have been for instance explored vastly in computer vision, NNs etc.), aren’t there works showcasing their fundamental limitations (formally) in general that can therefore trivially hold also for RL, where NNs are used to represent the policy?
- Even within the abstract the authors state ‘’This intrinsic gap between natural intelligence and the restrictions induced by certified training on the capabilities of artificial intelligence further demonstrates the need to rethink the approach…”. Why? Human/animals and machines clearly have different design spaces (i.e. have different limits and capabilities) so there is no reason to believe that machine intelligence has to be bound to schemes emerged in human/animal intelligence, which is arguably very limited and/or peculiar in terms of resources (e.g., sensors to minimize statistical complexity, and compute machinery). It seems to me that to evaluate (and compare) intelligence one should define measurable objectives rather than expecting certain specific behavior.
- I am having a hard time understanding the main message from Theorems 3.6 and 3.7. These seem portrayed as negative results as they show value function reordering for sub-optimal actions, but to the best of my understanding they seem also to claim that the value function properly identifies the optimal action, which seems to me what matters for achieving provably optimal policies in RL schemes (e.g., Q-learning). What am I missing (formally)?
- What is the point of plotting the defined Performance Drop rather than a classic RL performance measure? I understand why the Performance Drop would show results aligned with your thesis, but, as mentioned in the previous point, it seems not aligned with classic RL schemes optimality measures.
问题
I have listed questions within the points above.
Thank you for allocating your time to provide a review for our paper, and thank you for stating that our paper is well-written, presents a clear problem, and clear derivations on an important topic.
1. ”Related works mention adversarial training schemes in general (not specific to RL), then present recent works specific for RL with positive results and finally some with negative non-theoretical results towards claiming that this work is the first showing formally the limitations of these RL schemes. What I am missing is the following: since these algorithmic ideas are far more general than RL as they boil down to regularized optimization schemes (that have been for instance explored vastly in computer vision, NNs etc.), aren’t there works showcasing their fundamental limitations (formally) in general that can therefore trivially hold also for RL, where NNs are used to represent the policy?”
One critical key point here relies on understanding the foundations of reinforcement learning. In particular, for a vision task mislabeling a cat as a ship or as a plane has no meaning or difference between them, it is just simply mislabeled, it is just binary.
However, in reinforcement learning each action is associated with a value and the state-action value function informs us on if we take action in state what will be the expected cumulative discounted rewards we will receive due to taking action . Thus, the core foundations of reinforcement learning and its sequential nature makes this problem beyond a one-step binary “correct label”, “incorrect label” problem.
2. ”Even within the abstract the authors state ‘’This intrinsic gap between natural intelligence and the restrictions induced by certified training on the capabilities of artificial intelligence further demonstrates the need to rethink the approach…”. Why? Human/animals and machines clearly have different design spaces (i.e. have different limits and capabilities) so there is no reason to believe that machine intelligence has to be bound to schemes emerged in human/animal intelligence, which is arguably very limited and/or peculiar in terms of resources (e.g., sensors to minimize statistical complexity, and compute machinery). It seems to me that to evaluate (and compare) intelligence one should define measurable objectives rather than expecting certain specific behavior.”
This is indeed a sensible argument in which natural and artificial intelligence have different limits, possible design spaces and capabilities. However, the capability we focus on and highlight here is indeed a beneficial property independent from its connection to natural intelligence. Being able to assess the values of counterfactual actions independent from the human’s ability to do so, is a key skill to gain to be able to generalize beyond the training MDPs. A quite recent work also demonstrated that standard reinforcement learning can generalize better than certified robust policies [1], which indeed supports the argument on the importance of learning and assessing the values of counterfactual actions independent from its connection to natural intelligence.
It is sensible to think alignment with natural intelligence might not be necessary. Yet, we must also consider the evidence on how reinforcement learning originally stemmed from biological inspiration of “thinking” and “learning” (Watkins, 1989; Kehoe et al., 1987; Romo & Schultz, 1990; Montague et al., 1996). Furthermore, not only in reinforcement learning, but also in pure vision, it is the biological inspiration that led to the design of convolutional neural networks.
Hence, it is also perhaps critical to focus on the core inspiration of reinforcement learning, and revisit how the current line of research focusing on robustness and safety affects this inspiration. After all, it is the original inspiration of reinforcement learning that led to achieving superhuman performance in our current day. We wanted to rigorously point out that this inspiration is completely eliminated via the line of research focusing on ensuring safety and robustness, which furthermore turns out to be non-robust in many different unpredictable ways.
[1] Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness. AAAI Conference on Artificial Intelligence, AAAI 2023.
3. ”The paper does not seem cohesive in terms of storyline. In particular, it seems it has two themes: (1) showing the relationship between natural decision-making and RL with and without certified methods, and (2) showing formally the limits of certified methods. Although both are relevant topics, I fail to see clearly the connection between the two. Maybe the authors have a clear picture in mind connecting the two themes, but currently the alternation between them renders the paper not cohesive and feels like reading two papers forced into one via unclear motivations and connections. This is related with point (2) below.”
The formal limits of robust training discovered and analyzed in our paper demonstrate that a recent line of work that focuses on ensuring safety and robustness in reinforcement learning, in fact produces policies that are misaligned inconsistent, non-robust, and even further unable to reason counterfactually; even though the original biologically inspired standard reinforcement learning in fact learns counterfactual, consistent and aligned values. Hence, these concepts are intertwined and tightly connected.
4. ”I am having a hard time understanding the main message from Theorems 3.6 and 3.7. These seem portrayed as negative results as they show value function reordering for sub-optimal actions, but to the best of my understanding they seem also to claim that the value function properly identifies the optimal action, which seems to me what matters for achieving provably optimal policies in RL schemes (e.g., Q-learning). What am I missing (formally)?”
Formally the core objective of reinforcement learning is to estimate values of possible actions available for a given state, not only the optimal one. Thus, trying to fit reinforcement learning as a binary labeling problem, as the robust training techniques do, in fact leads to policies that are inaccurate, non-robust, unable to generalize and misaligned as our paper discovers and provides an extensive analysis for.
5. ”What is the point of plotting the defined Performance Drop rather than a classic RL performance measure?”
Performance drop is just the normalized version of the classic RL performance measure, and furthermore a rather standard method of measuring the drop in the performance of reinforcement learning.
6. ”At points writing is not particularly sharp (e.g., the list of contributions could be significantly shorter and clearer). Often, diverse and not fully clear words are used to describe things (e.g., 'resilience' in line 137 - what does it formally mean?) or the terminology lacks cohesion (e.g., the word certified is not introduced in the introduction, where the related research line is introduced with alternative terminology).”
In line 137 resilience refers to a policy being robust. In the introduction line 47 we actually mention robustness. But we can refer to it again, and make it more clear.
-
The question is not really addressed via the given explanation neither on a literature level or a logical/technical one.
-
I believe that one can certainly take inspiration from biological/neuroscientific understanding but then must be able to show its (positive) effect with respect to a measurable metric relevant for the original (AI) problem. As pointed out in my original review, and further in the answers below, this does not seem to be the case currently in this work.
-
I understand this argument, but the core issue regarding the connection lies in the following point.
-
Formally, the core objective of RL is not to estimate something, but rather to efficiently learn an optimal policy. By-passing certain estimation procedures can in fact be interpreted even positively as long as the optimal policy is learned. Clearly, given a dynamical system and an agent a wide variety of learning problems can be defined, including pure estimation problems of certain quantities (e.g., dynamics, values etc.). Nonetheless, the original line of work challenged by the authors seems to focus on the classic RL problem rather than on alternative (estimation) settings. As a consequence, as pointed out in my original review, I fail to see a clear chain of implications regarding the worsening of the learned policy.
-
Are you saying that the performance measure, which seems to embed an action selection strategy, defined in Def. 4.1 is standard in RL (especially the last sentence within the definition)? Why would the agent be evaluated when following such a (seemingly non-sensical) policy as defined within the definition?
1. ” The question is not really addressed via the given explanation neither on a literature level or a logical/technical one.”
In Q-learning, the state-action value function is supposed to represent the expected cumulative discounted rewards obtained from taking action in state and then taking the maximum Q-value action in all subsequent states. Thus, the value has a semantic meaning in the MDP for every action , not just the optimal action. In contrast, for a classification task, there is no semantically meaningful distinction between two different incorrect labels.
The above distinction between reinforcement learning and classification becomes important in any setting where the testing environment differs slightly from the training environment. For example, in the case of autonomous driving wear on a component causing loss of traction in a mechanical actuator, or in the case of financial trading agents a legal embargo on the trading stock, i.e. on the optimal action of the policy, will lead the robust trained policy to make random decisions; however, standard reinforcement learning will still make a decision which will in fact bring the second best rewards.
2. ” I believe that one can certainly take inspiration from biological/neuroscientific understanding but then must be able to show its (positive) effect with respect to a measurable metric relevant for the original (AI) problem. As pointed out in my original review, and further in the answers below, this does not seem to be the case currently in this work.”
Our results do show a positive effect with respect to the original AI problem. That is, avoiding certified training learns much more accurate state-action values. Please see our response to item (1) above for concrete examples on the importance of accurate and consistent state-action values.
3. & 4. ”I understand this argument, but the core issue regarding the connection lies in the following point. Formally, the core objective of RL is not to estimate something, but rather to efficiently learn an optimal policy. By-passing certain estimation procedures can in fact be interpreted even positively as long as the optimal policy is learned. Clearly, given a dynamical system and an agent a wide variety of learning problems can be defined, including pure estimation problems of certain quantities (e.g., dynamics, values etc.). Nonetheless, the original line of work challenged by the authors seems to focus on the classic RL problem rather than on alternative (estimation) settings. As a consequence, as pointed out in my original review, I fail to see a clear chain of implications regarding the worsening of the learned policy.”
The goal of deep reinforcement learning is to learn a policy that performs well, and further that can generalize. Please see our response to item (1) above for the concrete examples of the importance of accurate value functions in generalization.
5. ”Are you saying that the performance measure, which seems to embed an action selection strategy, defined in Def. 4.1 is standard in RL (especially the last sentence within the definition)? Why would the agent be evaluated when following such a (seemingly non-sensical) policy as defined within the definition?”
No, we are saying that measuring performance drop under various types of changes is standard in RL. The performance drop of the reinforcement learning policies as described in Line 355 is a standard way of measuring the difference in the policy performance under changes that can affect the policy, i.e. distributional shift or adversarial. Measuring performance drop is used when the robustness and generalization capabilities of reinforcement learning policies are evaluated. The performance drop is simply a normalized measure of performance decrease.
The authors provide a theoretical and experimental of analysis of certified robust training of Q-values for reinforcement learning. Certified robost training is a method for adversarially training neural networks such that they are robust to small perturbations in the input values. In the case of reinforcement learning in particular, this is implemented as a regularizer added to the standard temporal difference loss, where the regularizer penalizes Q functions for which a perturbation in the state can result in a change in the action that produces the maximum Q-value. The authors provide an existence proof that this style of training can produce misalignment amongst the sub-optimal Q-values, which they claim is a departure from natural intelligence which is able to properly order counterfactual actions. They demonstrate this phenomenon experimentally in several games in the Arcade learning environment, by showing that the performance drop incurred when selected the second best action some percentage of the time, instead of the optimal action, is much higher for adversarially trained RL than for vanilla RL. Additionally, they show that selecting the worst action some percentage of the time leads to a larger performance drop for vanilla RL than adversarial RL, again indicating that vanilla RL produces a better ordering over sub-optimal Q-values.
优点
- The paper is fairly clear - the existence proof is very detailed and easy to follow. Certain areas like Figure captions need some improvement
- The authors provide an original contribution - the misalignment of sub-optimal Q-values in adversarial training has not been observed before
- Experiments are convincing, at least of the existence of mis-aligned Q-values with adversarial training. The construction of the experiments (i.e. using the second best action or worst action) is intuitive and well-explained/motivated
缺点
Major Issues:
-
Overall, the main claim of the paper could be better motivated with a better argument for why it matters if sub-optimal Q-values are misaligned, particularly for a method that is specifically designed to be robust to changes in the optimal Q-values. The authors mainly motivate the importance of this by the divergence from natural intelligence however (a) it is not entirely clear why a departure from natural intelligence is necessarily a bad thing and (2) the claims that this is a departure from natural intelligence don't seem to be fully supported (see next point). The author's suggest that misaligned sub-optimal Q-values present a vulnerability which could perhaps be a good motivation - could you provide specific examples of how misaligned sub-optimal Q-values could be problematic or exploited in practical scenarios?
-
The paper specifically makes the claim that adversarially trained Q-values are not well-aligned with natural intelligence compared to vanilla RL Q-values. However, this claim seems too strong for the results that are actually presented in the paper. The author’s argument for this claim seems to be that previous work in neuroscience i.e. Fig 1 demonstrate that humans can assign correct ordering to counterfactual actions in a particular decision making task. But do humans always assign correct ordering? Are there any limitations to this ability? The authors demonstrate theoretically and empirically that there exists cases where adversarial training produces misaligned Q-values. However, it seems like to make a claim about natural intelligence alignment, the authors would have to actually test natural intelligence on the same tasks, particularly since the proof is an existence proof. Admittedly, I am not too familiar with the neuroscience literature so if the authors could provide more comprehensive evidence from neuroscience literature to support their claim, this would be helpful. Alternatively, I do not necessarily think the claim about alignment with natural intelligence is necessary, so the language could be toned down a bit.
-
The captions of several figures are not very informative and need to be guessed at by the reader. For example, in Fig 2 there is no description of what the bar chart is displaying, there are images from Atari games with no description of what the reader should be paying attention to and the brain scan similarly has no context. For the bar graph, it would be helpful to provide the environment details, details of how the Q-values were estimated and the source of the natural intelligence data. Similarly, Fig 3 has no mention of what each of the three panels is - the caption says Adversarial and Vanilla, but the superscript on all the x-axis Q-values is "Adv". Please explain what each of the three figures represent and make sure the axes are correct. The placement of Fig 3 is also odd, since it isn't mentioned in the text until Section 4.3 - I would either move it to that section or mention it in the text earlier.
Minor issues:
- The axes and text in the plots are much too small. Fig 6 is particularly bad.
- It might be more intuitive to plot the performance drop as a negative value, so that the plots have the worse performing curve lower than the best performing curve
- In the first paragraph of 4.1, is mentioned before it is defined
- Table 2 is never mentioned in the text
问题
- Is there some relation that can be drawn between this phenomenon and policy churn [1]? Perhaps this provides a different avenue to motivate this work.
- Do sections 4.1 and 4.2 make different points? They have different titles (randomized vs misaligned) but the claims seem the same?
[1] Schaul, Tom, et al. "The phenomenon of policy churn." Advances in Neural Information Processing Systems 35 (2022): 2537-2549.
3. ”Fig 3 has no mention of what each of the three panels is - the caption says Adversarial and Vanilla, but the superscript on all the x-axis Q-values is "Adv". Please explain what each of the three figures represent and make sure the axes are correct. The placement of Fig 3 is also odd, since it isn't mentioned in the text until Section 4.3 - I would either move it to that section or mention it in the text earlier.”
The axes in Figure 3 are correct. The x-axis in Figure 3 represents the max Q-values, i.e. of the certified robust policy, i.e. adversarially trained and the y-axis represents the max Q-values, i.e. of the standard reinforcement learning. Thus, Figure 3 allows us to immediately see the overestimated state-action values of certified robust trained reinforcement learning policies directly compared to standard reinforcement learning.
4. ”Do sections 4.1 and 4.2 make different points? They have different titles (randomized vs misaligned) but the claims seem the same?”
Yes, indeed Section 4.1 and 4.2 make different points. Section 4.1 demonstrates that standard reinforcement learning can assess the values of actions substantially better than certified robust policies. Section 4.2. reveals that certified robust training learns inaccurate and misaligned values that are up to a level that the true worst possible actions in given states are in fact assessed by the robust policies as the second best actions and vice versa.
5. “Is there some relation that can be drawn between this phenomenon and policy churn [1]? Perhaps this provides a different avenue to motivate this work.”
Policy churn sheds light on the learning dynamics of standard reinforcement learning, and demonstrates that within the learning dynamics of standard reinforcement learning a rapid change of greedy policy leads to better exploration. Our paper demonstrates that standard reinforcement learning learns the values of counterfactual actions while robust training methods do not. Furthermore, our work demonstrates that certified robust training methods learn inconsistent, misaligned and non-robust state-action value functions, and even further are unable to reason counterfactually.
[1] Tom Schaul, André Barreto, John Quan and Georg Ostrovski. The Phenomenon of Policy Churn, NeurIPS 2022.
Thank you for our response.
-
These are good examples and in my opinion provide a much better motivation for the paper than the divergence from natural intelligence, which is not necessarily a bad thing. However, I think you would need to add some additional experiments demonstrating where robustly trained policies can fail at generalization, in order to make these claims. If these were added and the motivation reframed, I would be willing to increase my score.
-
I appreciate the further details, but I still do not find the motivation of matching natural intelligence convincing. Your points about generalization are much more pertient, and I would like to see these demonstrated more concretely in the paper.
-
"Thus, Figure 3 allows us to immediately see the overestimated state-action values of certified robust trained reinforcement learning policies directly compared to standard reinforcement learning"
Thank you for the explanation. However it is still not clear what the three figures are depicting - what is the environment? what is the difference between the three panels? These details need to be in the caption.
- " Section 4.1 demonstrates that standard reinforcement learning can assess the values of actions substantially better than certified robust policies" while "Section 4.2. reveals that certified robust training learns inaccurate and misaligned values"
I still do not think these are separate points and could/should be combined for clarity.
1. ”These are good examples and in my opinion provide a much better motivation for the paper than the divergence from natural intelligence, which is not necessarily a bad thing. However, I think you would need to add some additional experiments demonstrating where robustly trained policies can fail at generalization, in order to make these claims. If these were added and the motivation reframed, I would be willing to increase my score.”
Thank you for stating that the examples we provide were good and provide a much better motivation for our paper. Of course we can add them. Please see that the recent study [1] demonstrated that robustly trained policies fail to generalize. In particular, this study demonstrates via extensive experiments that standard deep reinforcement learning can generalize while robust training cannot. We can also indeed add this citation.
[1] Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness, AAAI 2023.
2. ”Thank you for the explanation. However it is still not clear what the three figures are depicting - what is the environment? what is the difference between the three panels? These details need to be in the caption.”
Apologies that we just realized subcaption seemed to be not working for Figure 3. Indeed Figure 3 left graph reports results for BankHeist, center graph reports results for Freeway and right graph reports results for RoadRunner. We will immediately add the subcaptions.
3. " Section 4.1 demonstrates that standard reinforcement learning can assess the values of actions substantially better than certified robust policies" while "Section 4.2. reveals that certified robust training learns inaccurate and misaligned values"I still do not think these are separate points and could/should be combined for clarity.”
We just wanted to emphasize the difference between these two subsections one more time, but we are indeed willing to merge them. Section 4.1. demonstrates that robust training cannot assess values correctly compared to vanilla training, Section 4.2 further reports results revealing robust training is inconsistent within itself. But we can merge these two subsections if it is going to provide a better reading experience.
Thank you for providing a sensible review for our paper, and thank you for mentioning that our paper provides an original contribution with convincing experiments that are intuitive and well-explained/motivated.
1. “The author's suggest that misaligned sub-optimal Q-values present a vulnerability which could perhaps be a good motivation - could you provide specific examples of how misaligned sub-optimal Q-values could be problematic or exploited in practical scenarios?”
Most applications of reinforcement learning including language models face a distribution shift when moving from the training environment to the deployment environment. Thus, being able to assess the values of all possible actions in a given state correctly is a key capability compared to solely assessing the action that maximizes the -function in states which overfits to the training environment observations. The limited generalization capabilities of robust training methods that have also been revealed recently [1] support the idea of why we need to learn and assess the values of counterfactual actions.
Apart from distribution shift and generalization, let us think of a more adversarial scenario. In particular, our agent can face adversarial situations in which it is no longer an option to take the optimal action due to the adversarial manipulation. Thus in this case our work demonstrates that robust deep reinforcement learning will make random decisions, and standard reinforcement learning will make a choice that will give the second best rewards.
For a more particular example, in the case of autonomous driving wear on a component causing loss of traction in a mechanical actuator, or in the case of financial trading agents a legal embargo on the trading stock, i.e. on the optimal action of the policy, will lead the robust trained policy to make random decisions; however, standard reinforcement learning will still make a decision which will bring the second best rewards.
Thus, beyond and independent from the connection to natural intelligence, being able to assess the values of counterfactual actions is a key component in building agents that can robustly generalize to non-stationary environments.
[1] Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness, AAAI 2023.
2. ”But do humans always assign correct ordering? Are there any limitations to this ability? The authors demonstrate theoretically and empirically that there exists cases where adversarial training produces misaligned Q-values. However, it seems like to make a claim about natural intelligence alignment, the authors would have to actually test natural intelligence on the same tasks, particularly since the proof is an existence proof. Admittedly, I am not too familiar with the neuroscience literature so if the authors could provide more comprehensive evidence from neuroscience literature to support their claim, this would be helpful. Alternatively, I do not necessarily think the claim about alignment with natural intelligence is necessary, so the language could be toned down a bit.”
The neuroscience studies referred to in our paper demonstrate that humans do compute values for counterfactual actions that are grounded in some notion of their true utility. In fact, they even identify structures in the brain where this reasoning and assigning values takes place. It is sensible to think alignment with natural intelligence might not be necessary. Yet, we must also consider the evidence on how reinforcement learning originally stemmed from biological inspiration of “thinking” and “learning” (Watkins, 1989; Kehoe et al., 1987; Romo & Schultz, 1990; Montague et al., 1996). Furthermore, not only in reinforcement learning, but also in pure vision, it is the biological inspiration that led to the design of convolutional neural networks.
While reinforcement learning was originally inspired by natural intelligence and led to building policies that can achieve superhuman performance, we wanted to rigorously point out that this inspiration is completely eliminated via the line of research focusing on ensuring safety and robustness, which furthermore turns out to be non-robust in many different unpredictable ways.
Nonetheless, even independent from natural intelligence, from a completely broader perspective, if we want agents that are able to generalize to environments different than their training MDPs, these agents must learn and form knowledge of the values of all their possible actions, i.e. be able to reason counterfactually, which we demonstrate that indeed standard reinforcement learning does. However, the line of work focusing on ensuring safety and robustness completely breaks this tight relationship between the core inspiration of reinforcement learning and the knowledge formed by the reinforcement learning agents that allows them to generalize.
The paper analysis robust RL in the context of findings in neuroscience that aim to provide theoretical insight. A bridge between the fields would be highly interesting and potentially impactful, and the problem of robust RL is important. However, reviewers saw limited justification of how the natural intelligence findings are brought to bear on the artificial RL questions. In particular, an adequate scope of the robust RL question seems to be missing altogether.
审稿人讨论附加意见
Reviewers and authors engaged in a careful and thoughtful discussion of the merits of the paper, however the core concerns have not been alleviated. It is possible that the ICLR community is not the ideal audience for this type of work, but my impression is that the concerns are more intrinsic.
At times, the authors' feedback became argumentative, even combative. In my experience, it is more productive to tone down sentiments of perceived misdirected criticism, and instead to detail respectful disagreements.
Reject