3.7

/10

Rejected3 位审稿人

最低3最高5标准差0.9

3.7

置信度

ICLR 2024

Understanding and Controlling a Maze-Solving Policy Network

Ulisse Mini,peli grietzer,Mrinank Sharma,Austin Meek,Monte MacDiarmid,Alexander Matt Turner

OpenReview PDF

提交: 2023-09-19更新: 2024-02-11

TL;DR

Deep convolutional policy network contains redundant and manually-controllable internal representations of goal location.

摘要

关键词

interpretabilityalignmentAI safetyreinforcement learning

评审与讨论

审稿意见

评分: 3置信度: 42023-10-23

The authors analyze the internal representations learned by a policy network trained to solve mazes. They identify channels that contain the goal information and analyze several interventions that modify the behaviour of the agent.

优点

The paper is very well-written. The idea is simple and the presentation is clear, which makes the contents easy to follow. The experiments are well-designed to illustrate the discussion. The overall goal of understanding the internal representations learned by our agents is an important topic and clearly deserves a study.

缺点

The scope of the paper is far too narrow. The study is exhaustive, but it is focused on a particular architecture in a particular environment. Is there any chance of applying those results to other agents in other domains? I don't see any simple way. Thus, this work would have a very low impact on the community. Besides, the presented analysis is based on the 11 layers found by visual inspection. While suitable for a single experiment, this method can't be applied broadly.

The numerical results you obtain are far from convincing. If the logistic regression can predict the goal reachability in 82% compared to naive 71%, then quite few hard cases were actually explained. Certainly, there are more important features. I'd like to see that you define like 20 different features, exceed 95% accuracy, and then identify those that contribute most. Now, I think that you still miss important features. Also, when you analyze the interventions in Section 3.1 (and others as well), I see quite little difference between no intervention and intervening on all 11 channels. I'm not convinced that you control the policy. I'm convinced that there is some correlation between the intervention and the intended behaviour. Furthermore, I think this analysis would be useful if you exceed the impact of moving the cheese. Only then you can confidently claim that you can control the policy and convince me. Now, it seems more like a bias than a control.

Overall, after reading this paper I would agree that you identified some features that contribute to the behaviour, although (as you claim yourself) clearly there are much more of them, which makes the contribution even lower.

I am willing to increase my rating if you prove me wrong in those claims.

问题

Can the results that you present be generalized to other architectures and environments?

Can all those steps you describe (choosing the layer, identifying the channels, retargeting the goal, etc.) be automated?

Are there any reasons for the network to learn the goal location as a separate feature? Technically, it could be arbitrarily mixed with other features (as long as it can be extracted with linear transformations), rendering the visual inspection impractical.

Why did you choose this specific layer to inspect? Are those observations valid for many layers and you've just chosen an arbitrary one, or was it a careful decision?

2023-11-23

Dear Reviewer,

Thank you. We’re glad you found our work “well-written” and addressing “an important topic”. We’ve made a number of clarifications below, which we hope will allow you to improve your evaluation score.

The scope of the paper is far too narrow. The study is exhaustive, but it is focused on a particular architecture in a particular environment. Is there any chance of applying those results to other agents in other domains? [...] Besides, the presented analysis is based on the 11 layers found by visual inspection. While suitable for a single experiment, this method can't be applied broadly.

We believe the purpose of our work is to develop an understanding of the goals and goal representations of neural networks. Carefully studying a particular network is one way of doing this, similar to studying “model organisms” in biology.

In particular:

Finding that the goals of AI systems are redundant and retargetable is an important finding. We find this holds across 15 different agents.
There is further evidence of retargetable goals in other agents, like LLMs [1,2,3]. We will add these to the paper to contextualize the work. We show these findings also hold here.

the numerical results you obtain are far from convincing. If the logistic regression can predict the goal reachability in 82% compared to naive 71%, then quite few hard cases were actually explained.

Our aim is not to perfectly predict the generalization of the network. Instead, we show that it is possible to generalize depending on environmental factors. We think the interesting part of these results is showing that the pursued goal of the network depends on the environment. One might have thought that this would be random. Moreover, we never claim to perfectly predict the generalization.

Also, when you analyze the interventions in Section 3.1 (and others as well), I see quite little difference between no intervention and intervening on all 11 channels.

In Fig. 83 (Appendix E), we show that intervening on all 11 channels results in increased retargetability in all cases, even as we grow far away from the top-right path. Therefore we find substantial differences in the behavior when intervening. We admit the differences in the heatmaps can be hard to see; we will improve this. Additionally, there is a notable difference between no intervention and intervening on the 11 channels.

Furthermore, I think this analysis would be useful if you exceed the impact of moving the cheese.

This evidence shows that the goals of a network can be partially retargeted via online activation editing. To our knowledge, we are the first to give a proof of concept of this ability. We think this is an important finding because it develops insight into alignment methods that reuse existing network circuits.

Regarding defining many features and selecting those which contribute the most, we do initially regress upon ten features. Appendix B contains further details.

Only then you can confidently claim that you can control the policy and convince me.

We agree that we cannot perfectly control the network in this way. That is why we describe our network as “partially” controlling the policy. We do not claim to be able to fully control the policy, only that we provide first steps along a pathway to greater control through intervention techniques based on network activations.

Can the results that you present be generalized to other architectures and environments?

There is some evidence showing these results generalize to LLMS [1,2,3].

Can all those steps you describe be automated?

Potentially! We’ll add this to the paper under future work.

Are there any reasons for the network to learn the goal location as a separate feature?

Yes, the convolutional layers do present a “privileged basis” (the channel activations). However, visual inspection did not reveal an analogous “top-right” feature. Even though our results strongly suggest that the network tracks the location of the top-right accessible location. Therefore, it is interesting that the cheese was so clearly and redundantly represented.

Why did you choose this specific layer to inspect? Are those observations valid for many layers and you've just chosen an arbitrary one, or was it a careful decision?

We first developed the cheese vector technique, and found it worked best at this layer. So we looked into it in more detail.

Thank you for the review. We hope our clarifications are useful, and are looking forward to hearing what you think.

Best, The Authors

1 - Turner, A., et. al. (2023). Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248.

2 - Li, Kenneth, et al. "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model." arXiv preprint arXiv:2306.03341 (2023).

3 - Rimsky, N. Reducing sycophancy and improving honesty via activation steering (2023), AI Alignment Forum

审稿意见

评分: 5置信度: 32023-10-31

This paper focuses on a pretrained reinforcement learning policy which solves mazes problem. The authors find certain circuits correspond to one of these goals and identify eleven channels which track the location of the goal. What’s more, they modify these channels by hand-designed interventions or by combining forward passes to partially control the policy.

优点

1: The first to pinpoint internal goal representations in a trained policy network. 2: The visualization of activations is relatively clear and intuitive. 3: Experiments have been conducted to explore how the activation of specific channels affects the behaviors of policy.

缺点

1: The paper lacks further validation of whether the discovered intrinsic representation of the goal in the pre-training policy can be generalized to different policies. This would weaken the value attributed by the article to its exploration of the representation of the goal within the policy. This is because the phenomena mentioned in the article are only specific to a particular parameterized strategy rather than a general family of strategies trained on that environment. 2. The experimental phenomena and conclusions of the author cannot fully support their core contribution. We cannot demonstrate from the experiments that the 'intrinsic representation' of the goal in the policy can be represented by the activations of these 11 channels that are selected by human visual inspection. 3: The presentation of some of the experiments is confusing, and it may be helpful to detail the setup of these experiments in the appendix to help understand the work.

问题

Question 1: Are the activations influenced by the agent's location or different steps? When we aim to control the policy by adjusting the activations, should we modify the activation of the initial state or all other states? Question 2: Does the number of the most effective channels change when the size of the maze is altered? Or will the features found in 11 selected channels maintain consistency with respect to such changes in experimental settings? Question 3: On page 6, what does it mean by "the geometric mean of the action probabilities to a given square from the start position"? I'm asking because I'm uncertain about the calculation of the "normalized path probability." Question 4: Why did you choose activations after the first residual block of the second IMPALA block as your target? Are there any insights or observations behind it? Question 5: As authors said, we can control the behavior of the policy by combining different forward passes of the network, is it possible to control the policy to exhibit more flexible behavior in this way? For example, let the policy tend to move toward the upper-left corner (neither the position of the goal nor the "upper-right corner" bias introduced in the training phase).

2023-11-23

Dear Reviewer,

Thank you for your careful analysis and feedback. We’d like to make some clarifications, which we hope will allow you to increase your evaluation score and vote for acceptance.

The paper lacks further validation of whether the discovered intrinsic representation of the goal in the pre-training policy can be generalized to different policies

We find evidence that the cheese vector technique transfers across 15 different policies. This shows that our findings generalize across policies. Moreover, we perform our behavioral analysis across the same 15 policies, and repeatedly find evidence that the goals of these policies are context-dependent. We believe these are valuable findings. There is other evidence that suggests that we can align policies by modifying activations during forward passes, a technique sometimes known as activation engineering [1, 2].

The experimental phenomena and conclusions of the author cannot fully support their core contribution. We cannot demonstrate from the experiments that the 'intrinsic representation' of the goal in the policy can be represented by the activations of these 11 channels that are selected by human visual inspection

We would like to clarify that we have visual and quantitative evidence that these channel activations depend primarily on the location of the cheese. We do not mean to claim that all cheese information passes through these channels, but these channels are used by the network to track cheese. The visual evidence we include in the appendix, but we also include resampling results (Pg. 4), where we state “Across 200 mazes, resampling the cheese-tracking channels from mazes with a different cheese location changes the most probable action at a decision square in 40% of cases, which is much more than when resampling from mazes with the same cheese location (11%)”.

3: The presentation of some of the experiments is confusing, and it may be helpful to detail the setup of these experiments in the appendix to help understand the work.

Thank you. We’ll improve this. Are there any specific things you’d like to see improved?

Question 1: Are the activations influenced by the agent's location or different steps? When we aim to control the policy by adjusting the activations, should we modify the activation of the initial state or all other states?

The activations depend on the agent's location. We adjust the activations at all time steps (i.e., the policy network, as consistently used by the network), which defines a new policy.

Question 2: Does the number of the most effective channels change when the size of the maze is altered? Or will the features found in 11 selected channels maintain consistency with respect to such changes in experimental settings

The mazes as seen by the network are all the same size. We show the “human-friendly view”, which might be smaller than the full game grid if there is padding at the edges. As a result, the network always sees RGB images of the same size, and the number of effective channels thus do not change as the maze size is altered.

Question 3: Thanks. We’ll clarify this in the camera ready. By the geometric mean, we mean the following equation:

P_\text{path}(s_t\mid \pi):= \sqrt[t]{\prod_{i=0}^{t-1}\pi(a_i\mid s_i)},

where $s_0,s_1,\ldots, s_t$ is the unique shortest path between $s_0$ and $s_t$ , navigated by actions $a_i$ . If $s_0=s_t$ , then $P_\text{path}(s_t\mid \pi):=\pi(`no-op`\mid s_0)$ . The probability the agent goes through a path depends on the probabilities of each step multiplied together, so we used the geometric mean.

Question 4: Empirically we found that the cheese vector worked best there. We’ll clarify this in the paper.

Question 5: Thank you. It is possible to compose some activation additions. We can subtract the cheese vector and add the top-corner vector together, which leads to an increase in top-right corner seeking and a decrease in cheese-seeking. We’ll add this example to the paper, but we leave better understanding of composition to future work.

[1] Turner, A., Thiergart, L., Udell, D., Leech, G., Mini, U., & MacDiarmid, M. (2023). Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248.

[2] Rimsky, N. Reducing sycophancy and improving honesty via activation steering (2023), AI Alignment Forum

审稿意见

评分: 3置信度: 42023-11-04

The authors study generalization of an image-based RL agent trained to find cheese in a maze. During training time, the cheese is always in the top right corner of the maze. During test time, the cheese can be anywhere. As a result, in a new maze at test time, the agent sometimes finds cheese, and sometimes goes to the top right corner.

They carefully analyze the network's behavior and internal mechanisms, and find key situations in which the agent decides between finding the cheese, or just going to the top right corner. Next, they analyze the network structure and find key neurons in the network that track the cheese. By modifying these neurons, they can somewhat control the behavior.

优点

The authors tackle a very important problem in RL - generalization. Their specific definition of generalization, "goal misgeneralization", is less studied, yet is extremely important in the context of modern LLMs, RLHF, and alignment. Even if we give the agent the "correct" reward function, it may still act unpredictably in OOD situations.

Their analysis of the maze task and policy is quite deep, and has some interesting studies and findings. They find that the agent chooses to pursue the cheese or the corner based on visual proximity. Their experiments on controlling the policy by modifying internal activations is quite interesting as well.

缺点

The paper's deep analysis of the maze-cheese task and policy is its strength, and also main weakness. Many of the analyses and experiments hinge on their knowledge of the task, and also their design of the policy. This leads me to question 1) if these findings hold true for more realistic, complicated and relevant tasks, and 2) if the particular methodology used here, can be applied to other RL agents.

For example, the interpretability and controllability of the policy hinges on architecture and input image - an image-only, CNN-based policy. Because the task is a 2D image, and the policy is CNN based, the authors can manually inspect all feature maps to find correlations with the 2d position of the cheese.

Many tasks though, may be multimodal, non-image based, or even if they are image-based, may be first-person views of a 3d world. Many deep RL agents have different architectures - MLPs, LSTMs, Transformers, etc. The wide variety of possible tasks and agents seems to make it hard to use this approach for future studies.

Next, I did not see any mention of seed variance. Is it possible that these findings only emerge with the correct seed? How general are these results across RL agents, even if we fix the task to the Maze task?

问题

Could the authors address this point about task / policy specificity?

Could the authors address the concern on seed variance?

Top-right corner motivational vector - is the definition ordering swapped? Figure 7 - the columns seem to be out of order.

2023-11-23

We would first like to thank the reviewer for their careful consideration and analysis of our submitted work. We are pleased that you thought our work addresses “a very important problem”. We’ve made a number of clarifications below, which we hope will allow you to increase your evaluation score.

if these findings hold true for more realistic, complicated and relevant tasks, and 2) if the particular methodology used here, can be applied to other RL agents.

Thank you. We believe carefully understanding one particular architecture shed light on generic properties of deep learning systems, much like “model organisms” used in biology. To that end, we would like to clarify that: A number of our findings hold for multiple agents. We perform our behavioral analysis across 15 different trained networks (with substantially varying historical goal regions). We repeatedly find evidence of contextually activated goals. We believe this finding is an important result, because it helps understand the behavior of AI assistants. Some of our methods apply to other architectures. We show that we can re-use existing circuits within the network to control network behavior. The technique of combining network forward passes can be used on other networks, such as used in Turner et. al (2023) with large language models.

As such, we believe better understanding the maze-solving policy network allows us to understand the goals of other RL systems, including LLMs. Applying our approach to multiple seeds provides some additional evidence that the goals of AI systems are redundant, context-dependent, and retargetable. We believe these findings have significant implications for AI alignment. Of course, we acknowledge that only studying one environment is a limitation of the work. We will update the work accordingly for the camera ready, including better contextualisation with the related work, including discussions of how other related work provides evidence for our hypotheses.

Top-right corner motivational vector - is the definition ordering swapped? Figure 7 - the columns seem to be out of order.

Yes, we’ll fix this. Thank you.

Please let us know what you think, and whether these changes are enough for you to vote for acceptance. We think that a revised version of the manuscript will be of great interest to the ICLR community.

Thank you, The AuthorsWe would first like to thank the reviewer for their careful consideration and analysis of our submitted work. We are pleased that you thought our work addresses “a very important problem”. We’ve made a number of clarifications below, which we hope will allow you to increase your evaluation score.

if these findings hold true for more realistic, complicated and relevant tasks, and 2) if the particular methodology used here, can be applied to other RL agents.

A number of our findings hold for multiple agents. We perform our behavioral analysis across 15 different trained networks (with substantially varying historical goal regions). We repeatedly find evidence of contextually activated goals. We believe this finding is an important result, because it helps understand the behavior of AI assistants.
Some of our methods apply to other architectures. We show that we can re-use existing circuits within the network to control network behavior. The technique of combining network forward passes can be used on other networks, such as used in Turner et. al (2023) with large language models.

As such, we believe better understanding the maze-solving policy network allows us to understand the goals of other RL systems, including LLMs. Applying our approach to multiple seeds provides some additional evidence that the goals of AI systems are redundant, context-dependent, and retargetable. We believe these findings have significant implications for AI alignment. Of course, we acknowledge that only studying one environment is a limitation of the work. We will update the work accordingly for the camera ready, including better contextualisation with the related work, including discussions of how other related work provides evidence for our hypotheses.

Top-right corner motivational vector - is the definition ordering swapped? Figure 7 - the columns seem to be out of order.

Yes, we’ll fix this. Thank you.

Thank you, The Authors

AC 元评审

2023-12-09

This paper studies generalization capabilities of an image conditioned RL agent trained to find objects in a maze. The paper thoroughly studies the object oriented task (cheese in a maze), in the context of exploring goal misgeneralization which is an interesting open problem in RL. There was consensus that the paper is through in analyzing behaviors in this environment but everyone agreed that the setting is too narrow with reliance on a specific architecture and environment, and limitation generalization to other agents or tasks. There were also concerns raised regarding the method applicability, robustness to randomized starts (seed) and scalability issues. The authors responded with clarifications. However, the consensus still leans towards a clear rejection. This is an interesting research problem but it needs broader applicability to other environment/tasks, improved control mechanisms, and better contextualization to related works in the field.

为何不给更高分

This is an interesting research problem but it needs broader applicability to other environment/tasks, improved control mechanisms, and better contextualization to related works in the field.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject