3.0

/10

Rejected3 位审稿人

最低1最高5标准差1.6

3.7

置信度

ICLR 2024

Towards Meta-Models for Automated Interpretability

Lauro Langosco,Neel Alex,William Baker,David John Quarel,Herbie Bradley,David Krueger

OpenReview PDF

提交: 2023-09-24更新: 2024-02-11

TL;DR

Meta-models are neural networks that take parameters of other neural nets as input; we use them to uncover properties of the input nets.

摘要

关键词

interpretabilitymeta-modelsautomated interpretabilitybackdoorstrojansbackdoor detectiontracrrasp

评审与讨论

审稿意见

评分: 1置信度: 52023-10-30

This paper proposes to use meta-models to understand the internal properties of neural networks. The application includes backdoor detection and data distribution identification.

优点

See weaknesses.

缺点

Though this paper focuses on an interesting topic, it is less than 9 pages and the amount of work is not enough.

问题

None

公开评论- Public Comment on the Laziness of Review aCZg

2023-11-15

This is the laziest ICLR review I've ever seen, and I believe does a significant disservice to the authors. It's unclear to me that you read anything more than the abstract, and then just scrolled to the bottom of the paper. I encourage the area chair and other reviewers to disregard this review and to judge the paper on its own merits.

(Clarification: I found this paper by browsing OpenReview, and nobody asked me to write this comment, I just feel sympathetic to the poor authors of this paper)

2023-11-18

We believe research should be judged by the value of its contribution, not by quantity of work or number of pages. Could you please let us know if there are specific weaknesses in our experiments, or follow-up work that would be appropriate for us to do?

While it is true that our results do not require much page-space to present (especially as we deferred many details to the appendices for conciseness), they provide evidence that meta-models can potentially be used on a variety of problems of previous interest.

It is also true that, as other reviewers have noted, we included to little detail at times - e.g. we failed to include example RASP programs that can be reconstructed by our meta-model. We aim to address these weaknesses in our revisions.

审稿意见

评分: 3置信度: 32023-11-01

This research explores how meta-models can be used to interpret neural network models. The goal of the work is to improve the interpretability of complicated neural architectures by utilizing meta-models. By recovering RASP programs from model weights and using meta-models for backdoor identification in these basic models, they experiment to try and reverse-engineer neural networks. Amazingly, their solution outperforms numerous other approaches already in use, achieving over 99% accuracy in backdoor identification. The results signify a noteworthy advancement in the transparency and interpretability of neural networks.

优点

Contribution to Open Source: The paper mentioned making datasets available for future work. Contributing to open-source ensures that the broader scientific community can benefit from, replicate, and build upon their work.
Improved Interpretability: The research showcases a novel approach to recovering a program from transformer weights.
High Accuracy: In their testing, the meta-model achieved impressive accuracy. Such high accuracy showcases the effectiveness and reliability of their approach.

缺点

Dependence on Large Amounts of Training Data: Their method relies heavily on having a significant amount of training data, which might not always be feasible for every application.
Model Size: As mentioned in this paper, the models they worked with are relatively small, with an average of only 3,000 parameters. This contrasts with much larger models often used in deep learning, which can have millions or even billions of parameters. It is much better to see the performance of larger models.
Scope of Tasks: The tasks they tested are simpler compared to the full problem of reverse-engineering a large neural network.

问题

The baseline in this paper is not enough to show the performance of their methods, adding more baselines will be better.
I hope this method can be tested on a larger model and prove to be effective.

2023-11-18

Summary of our respones

Thank you for your review. We’re also excited to open-source our models and datasets, as we expect that a lot of progress can be made on meta-models as a research community develops for the domain, and we hope that the interpretability gains will be substantial.

Dependence on large amounts of training data.

Reliance on training data is indeed a limitation we discuss in Section 5. However, we do not think that this is a weakness of our work, rather than a limitation of the meta-model framework. We note that we achieve strong results in the backdoor detection setting training on only 4-6 thousand datapoints. In addition, we propose several ways in which meta-models could potentially be scaled up in relatively more data-efficient ways.

Size of base models is too small.

In Section 3.1 we show meta-models successfully classify base models up to 2 million parameters in size. We expect that meta-models can be scaled further, but this is out of scope for this paper.

Scope of tasks.

We agree that this is a limitation, and we discussed it in the limitations section. We think we have demonstrated an appropriate amount of potential for what our paper aims to be: a solid proof-of-concept.

Not enough baselines.

We are currently running more experiments. Note: we already compare against several baselines and exceed existing state-of-the-art performance.

Weakness #1

Dependence on Large Amounts of Training Data: Their method relies heavily on having a significant amount of training data, which might not always be feasible for every application.

It is true that reliance on training data is a limitation of the meta-model approach. While a meta-model may not need a large number of datapoints in an absolute sense, the training data may be expensive to obtain if each datapoint involves training a large model. Scaling to massive modern language or vision models is an important question, but out of scope for this work. Ways in which we might overcome the scaling challenge in future work include:

Pretraining, e.g. masked weight prediction or contrastive learning. This might allow finetuning on new tasks with fewer examples.
Experiments on generalization capabilities across scale. If a meta-model can (partially) generalize from small to large networks, then we may be able to reduce the number of large networks in the training data.
Applying meta-models to subsections of a target network. Depending on the target task, a single network might yield many datapoints, for example if individual circuits within a network can be mapped to specific functions.

We have edited the paper to discuss these points (see Section 5).

Weakness #2

Model Size: As mentioned in this paper, the models they worked with are relatively small, with an average of only 3,000 parameters. This contrasts with much larger models often used in deep learning, which can have millions or even billions of parameters. It is much better to see the performance of larger models.

This is an inaccurate representation of our experiments. Only the Tracr models had an average of 3,000 parameters. In the backdoors detection setting, models had around 70,000 parameters. In the hyperparameter-prediction setting, models had an average of 560,000 parameters, with the largest models up to 2,000,000 parameters.

Weakness #3

Scope of Tasks: The tasks they tested are simpler compared to the full problem of reverse-engineering a large neural network.

You are right that we still have a long way to go. However, the problem of fully reverse engineering neural networks is the subject of a large body of work and remains unsolved. As such, while this is a correct characterization that we acknowledge in Limitations, we do not think it is a weakness.

While the tasks we work on are relatively straightforward, we hope that they demonstrate potential in this field. What further experiments would you be interested to see?

Questions

The baseline in this paper is not enough to show the performance of their methods, adding more baselines will be better.

It would be useful if you could clarify which baselines are missing. We compare against leading methods ([1] for hyperparameter prediction, [2,3] for backdoor detection), and achieve are state-of-the-art results. Do you have recommendations for how to proceed?

Note: we are running experiments to compare our hyperparameter prediction performance against [4] and plan to add them to the revision.

[1] https://arxiv.org/abs/2002.05688

[2] https://arxiv.org/abs/1910.03137

[3] https://arxiv.org/abs/1906.10842

[4] https://arxiv.org/abs/2110.152880

审稿意见

评分: 5置信度: 32023-11-07

This paper presents what it terms a meta-model approach to mechanistic interpretability. Specifically, the approach is to create a model that takes in the weights (or other properties) of a model (a transformer), and then get the transformer to either 1) generate a human interpretable program (rasp) that corresponds or 2) detect backdoors. Stated differently, the goal of this work is to use a model to explain certain aspects of another model. The authors demonstrate the approach on a backdoor prediction task, and show that a meta-model can invert transformer weights that have been compiled by the tracr program.

优点

Interesting solution to an important problem: several of the demonstrations of the mechanistic interpretability paradigm have mostly been bespoke and tailored to single architecture-settings. Here the use of a model to predict the properties of a function that we seek to explain/interpret is interesting. The use of the tracr library as well is also very nice since that library was essentially design for the kind of tasks that it is used for in this work.
Variety in tasks: The authors show a variety in the kind of tasks that the approach can be used for. For example, one use it to reverse-engineer tracr programs. Another use is back-door detection. I particularly liked the next-token prediction framing, which is then used generate tracr programs of a model's weights. This kind of approach, if generalization, can be applied more extensively.

缺点

I should state upfront, that I am fundamentally skeptical of the approach that this work pursues, but I am willing to rethink/update my review given feedback from the authors.

Black-box model to explain a black-box model: I think this approach is fundamentally limited because you are now using one model you don't understand to try to explain another model that you don't understand. What happens if the training data of the meta-model has backdoors in it for example? That is, you make it so that the back-door model gives interpretations that are benign for a model that is actually problematic. This might seem far-fetched, but I think one of the key limitations of this approach is that we are left to just 'trust' the output of the meta-model. But as we know, transformer-based models can easily learn spurious signals, and other problematic behavior, so it is unclear how this approach fundamentally solves the interpretability-problem at hand. Perhaps if we could always convert trained models to their tracr equivalent then we would have more confidence in this approach. However, to me, it is a non-starter that one cannot ascertain the reliability of the results of the meta-model.
What does a mechanistic interpretation mean in this setting: For the tracr program reconstruction, I think I get it. Here it seems like the tracr program itself constitute the explanation of the model weights. In the backdoor setting it is less clear. How does the backdoor classification task demonstrate interpretability? Specifically, how do I know how the meta-model is able to detect which model has a backdoor? Here I think you would actually want to make it input specific. Specifically, I think only a certain subsets of inputs trigger the wrong outputs for models with backdoors. It is the model's behavior on these inputs that we are most interested in. I don't quite see how the current setup helps us to do this.

问题

I mixed in questions with the discussion on weaknesses above.

In addition to the points above, I have some questions about the tracr-program inversions section.

Can you show examples in the appendix of settings where you compile a program with tracr, and then reverse-engineer it with the meta-model. It would be helpful to compare the output of the meta-model and the original tracr program.
Are you not worried that that transformers that you get from tracr programs have much more sparse weights compared to SGD/ADAM trained models? Essentially, I am worried whether the demonstration here is too easy for the meta-model.
How do you envision that this meta-modeling approach will actually be used in practice, on real models, in the future? I struggle to see how one could ever train a meta model for a realistic setting, e.g., a vision transformer trained on ImageNet. Would I need to train thousands of imagenet models first to fit the metamodel?

伦理问题详情

N/A

评论- Response to questions

2023-11-18

Can you show examples in the appendix of settings where you compile a program with tracr, and then reverse-engineer it with the meta-model. It would be helpful to compare the output of the meta-model and the original tracr program.

Thank you for the suggestion. We will add a section to the Appendix with examples summarizing these results once we finish updating the Tracr experiments.

Are you not worried that that transformers that you get from tracr programs have much more sparse weights compared to SGD/ADAM trained models? Essentially, I am worried whether the demonstration here is too easy for the meta-model.

That’s a good point. Yes, it is likely that Tracr-compiled transformers are easier to reverse-engineer than transformers trained to approximate the same program, and we mention this in Limitations. However, that this reverse engineering is possible at all is promising. We have added some future steps for this line of work: (1) reverse engineering trained transformers that were regularized to be “simple” in the same way as Tracr-compiled transformers, then (2) reverse engineering general transformers trained on the input/output of RASP programs.

How do you envision that this meta-modeling approach will actually be used in practice, on real models, in the future? I struggle to see how one could ever train a meta model for a realistic setting, e.g., a vision transformer trained on ImageNet. Would I need to train thousands of imagenet models first to fit the metamodel?

This is an important question. How to scale meta-models is an open research question, but it seems doable to us. Ways in which we might overcome the scaling challenge in future work include:

Pretraining, e.g. masked weight prediction or contrastive learning. This might allow finetuning on new tasks with fewer examples.
Experiments on generalization capabilities across scale. If a meta-model can (partially) generalize from small to large networks, then we may be able to reduce the number of large networks in the training data.
Applying meta-models to subsections of a target network. Depending on the target task, a single network might yield many datapoints, for example if individual circuits within a network can be mapped to specific functions. This also means the meta-model could be smaller than the base model it helps to analyze.

We’ve added a section called “Future Work”, covering some ways in which we expect that meta-models might be useful. Some speculative examples include: highlighting backdoored neurons, producing simplified interpretations of neural network circuits, and identifying attention heads that perform a particular function.

[1] https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN

评论- Helpful clarifications

2023-11-21

Thanks to the authors for clarifying my understanding of this work. While I am still skeptical of the meta-model approach, the demonstration here is quite interesting. The part of the work that I most appreciate is generating tracr programs from model weights. That line of work is interesting, and could lead to more insights. It would be interesting to create a test-bed of transformer models where the behavior of the model is 'known a priori' and to empirically check that the tracr programs that you get from reverse engineering these models match( I assume the original tracr paper did this?)

Backdoor setting: is there a way to generate tracr programs that are input (or group input) specific? Here is what I am trying to get that: the critical issue for that setting is that the model has undesirable behavior on inputs that contain the backdoor. However, just identifying the backdoor is great, but somewhat unsatisfactory. It'll be great to get a tracr program for the model's behavior on normal vs backdoored inputs. This way, hopefully we can look at the output of the tracr program to see what the issue is. Actually, here is one crude way to do this: train models on normal inputs vs backdoored inputs, then compare the tracr outputs of the model trained on normal inputs to those trained on backdoored inputs. How different are they?

Overall, I think this work has some nice insights. I am not expecting any new experiments or updates to the paper on the basis of this response. I just wanted to give the authors additional feedback the future revision of the manuscript.

评论- Response to Weakness #1

2023-11-18

You write:

Black-box model to explain a black-box model: I think this approach is fundamentally limited because you are now using one model you don't understand to try to explain another model that you don't understand.

We greatly appreciated this comment. We’ve added a section in the Section 4 more precisely discussing this limitation of “a black-box interpreting a black-box”.

Mechanistic interpretability can be viewed as consisting of two problems:

Proposing a (mechanistic) explanation for some behavior.
Validating the correctness of an explanation.

Currently, automated methods exist for validating (e.g. [1]) but not proposing explanations.

Our paper only addresses problem 1. We agree this is a limitation - without means of verifying an explanation, meta-models can only provide limited assurance. Applying meta-models to verification is beyond the scope of this work.

Quoting more:

What happens if the training data of the meta-model has backdoors in it for example? That is, you make it so that the back-door model gives interpretations that are benign for a model that is actually problematic. This might seem far-fetched, but I think one of the key limitations of this approach is that we are left to just 'trust' the output of the meta-model. But as we know, transformer-based models can easily learn spurious signals, and other problematic behavior, so it is unclear how this approach fundamentally solves the interpretability-problem at hand. Perhaps if we could always convert trained models to their tracr equivalent then we would have more confidence in this approach. However, to me, it is a non-starter that one cannot ascertain the reliability of the results of the meta-model.

While meta-models can certainly be backdoored, we emphasize our point above – that meta-models (at present) only propose an explanation, rather than verify it. If proposed explanations can be validated, then the performance of the meta-model need only be good on average rather than good in the worst case. While robustly validating mechanistic explanations is out of scope for our paper, some candidates include testing the base models’ output vs the reconstructed RASP program, or (in a different setting) using causal scrubbing [1].

When the meta-model’s output cannot be verified, we caution against completely trusting predictions. We hope future work will make meta-model predictions more reliable and robust.

评论- Response to Weakness #2

2023-11-18

You write:

What does a mechanistic interpretation mean in this setting: For the tracr program reconstruction, I think I get it. Here it seems like the tracr program itself constitute the explanation of the model weights. In the backdoor setting it is less clear. How does the backdoor classification task demonstrate interpretability?

Good point! We agree that in the backdoor setting, the meta-model is doing interpretability but not mechanistic interpretability. That is, it is surfacing information about the behavior of the model (the backdoor) but not how that behavior is implemented mechanistically. In general, meta-models can report any type of information to the user.

We have edited the paper to make this clear (Section 3.2).

Note: in a sense, the meta-model itself is doing mechanistic interpretability, since it examines the base model weights to locate the backdoor. However, it doesn’t surface that information in a way that is useful for humans.

Specifically, how do I know how the meta-model is able to detect which model has a backdoor? Here I think you would actually want to make it input specific. Specifically, I think only a certain subsets of inputs trigger the wrong outputs for models with backdoors. It is the model's behavior on these inputs that we are most interested in. I don't quite see how the current setup helps us to do this.

The meta-model determines which base models are poisoned by their weights alone, with extremely high reliability in the domains we examined. We suspect it does so by examining the weights and identifying circuits that strongly alter the classification decision based solely on simple features – but we have no evidence at present for that claim. We agree that using a meta-model in this way to determine how backdoor circuits are implemented in base models would be worthwhile.

There are already an abundance of methods (https://arxiv.org/pdf/1811.03728.pdf, https://arxiv.org/pdf/1811.00636.pdf, https://proceedings.neurips.cc/paper_files/paper/2022/file/76917808731dae9e6d62c2a7a6afb542-Paper-Conference.pdf) that examine model activations to determine whether an example is poisoned or not.

评论- Summary of our response

2023-11-18

Thank you very much for taking the time to leave an in-depth review. We’re grateful for the detailed feedback, and we’re glad that you found this to be an interesting new method. Hopefully we can address some of your concerns.

We summarize our responses here, and go into more detail in other comments.

Meta-models use a black-box to explain a black-box, so the approach is fundamentally limited.

We agree that this is a limitation of meta-models. However, meta-models can still be useful: if we view mechanistic interpretability as consisting of the two problems of 1) proposing an explanation and 2) verifying an explanation, then meta-models can help automate problem (1) despite being a black-box method.

How are meta-models doing “mechanistic interpretability”?

Tracr: output programs constitute the explanation (as you say), so the meta-model is doing mechanistic interpretability.
Backdoor detection: the meta-model is doing interpretability (detecting backdoors), but not mechanistic interpretability.
Hyperparameter prediction: here the meta-model is not doing interpretability at all.

Can you show examples of programs you compile with tracr, then reverse-engineer?

Thank you for this proposal! We plan to include such examples in the revision.

Won’t real transformers be much harder to invert than Tracr transformers?

Yes, possibly ― we plan to find out in future work. We emphasize that our paper is exploratory work, and the first work to our knowledge that extracts a program from a transformer at all.

How are meta-models scalable?

We’ve added a paragraph on this topic to the revision in Section 5. Briefly, we anticipate large-scale pretraining to improve data-efficiency on new tasks, and we outline several tasks where the meta-model can be effectively used on a subgraph of the network, thus limiting the potential input size.

Changes to the paper

Here is a brief list of improvements we have made in response to your review so far:

We have expanded the Limitations section to discuss the problem of using a black box to interpret a black box (Section 4).
We have clarified what we mean by mechanistic interpretability and that the backdoor detection experiment does not count as mechanistic interpretability (beginning of Section 3.2).
We added a discussion of scalability concerns and suggestions to tackle them in Future Work (Section 5).
We have not yet added examples of decompiled RASP programs, but plan to add them in the final revision.

评论- Thanks for the feedback!

2023-11-22

Thanks for the response! Are there changes we can make that would convince you to increase your score? We think we've responded to the weaknesses that you raised in your review --- we'd be curious to know what you see as the main points that are still not adequately addressed.

评论- Satisfied with the response

2023-11-23

As it stands, there is no need for additional changes. I'd be happy to raise my score by 1, but it is unclear to me how that strengthens the case for the paper, at the current venue, given the other reviews. As I said before, my prior is to not be in favor of using one black-box to explain another black-box. I'll need to confer with the other reviewers and digest the new updates. Regardless of the outcome, I think this work has some interesting findings.

评论- Thank you for your comments

2023-11-23

We thank reviewers 2Mjn and HFmY for their helpful comments.

Summary

To summarize, the main criticisms of our paper were

Insufficiently strong results / baselines
Broad / conceptual issues with the entire approach, in particular how it is supposed to scale or be usefully applied to interpretability at scale.

We’ve responded to both points in the individual responses to reviews and have updated our paper with details and new results. In particular, we realized our description of our motivation & vision was lacking. We’ve now written a better outline of what we hope to achieve with this paper (see the introduction as well as the Future Work and Conclusion sections) - more details later in this comment.

New experimental results We also provide a new baseline comparing to https://arxiv.org/abs/2110.15288 (Schürholt et al., Neurips 2021), showing we outperform the strongest prior work on meta-models. This new result is visible in Figure 2 (right hand side).

Update: Bug in Tracr While working on improving our results on the 'mapping transformer weights to RASP code' experiment (Section 3.3), we discovered a bug in the Tracr library. To summarize, the bug means that in some cases, the compiled Transformer model may not perform the same computation as the RASP program. We are working with the Tracr authors to write tests and validate the compiler. Our best guess is that the bug overall worsens the performance of our meta-model, since it introduces noise in the compilation process and thus our dataset, but we can't say for sure just yet. This bug affects our results in Section 3.3, but not other experiments.

Motivation / Vision Since one criticism of our work was an unclear vision, here is a quick sketch of the overall vision. This is necessarily speculative and somewhat opinionated (which is why not all of it was included in the paper), and we post it here just to provide more context on our motivations:

A problem with interpretability is that models are getting larger and more complicated faster than we’re making progress in understanding them (especially if the standard you judge interp progress on involves full mechanistic understanding)
Fundamentally, the only way we can overcome this is by leveraging ML progress to help with interpretability.
There’s a lot of possible ways to do this, so we choose the most straightforward one: just train neural nets to do interp for us.
Long term, there’s a lot of things one can imagine doing, for example: have multimodal LLMs that take in weights / activations as a modality, and query them about neural network internals; train meta-models to amortize interpretability research, e.g. by automatic circuit detection and description; have meta-models that take as input both weights and activations, so they can analyze circuits + features jointly; use meta-models that are smaller than the models they interpret, so we can apply them to the largest models; etc…
Most of these things are pretty ambitious and longer term. Our goal right now is to gain the most information we can about whether this approach is feasible and worth pouring more resources into.
To do this, we first sanity check our approach by testing on applications that are a mix of a) reasonably close to our goal for meta-models (interpretability), and b) have an existing literature we can compare against. So we compare against the main available results on backdoor detection and against previous results on meta-models. We beat the state-of-the-art in both.
Second, we test on the task that is closest to our long-term goal of mechanistic interpretability, but also has cheap training data available: inverting Tracr-compiled models. This works fairly well - we recover the majority of program instructions, and since our approach scales well we expect these results to improve over time.
In the very long term, it is of hard to predict how far meta-models will scale and how helpful they will be for interpretability. Our goal in this paper is not to definitively argue that meta-models are the solution to interpretability (that would be overstating our confidence in the approach), but to conduct proof-of-concept experiments that are able to provide early evidence in favor or against meta-models. Within this scope, we think the evidence so far points in favor: meta-models are are readily applicable to tasks for which there is good training data.

AC 元评审

2023-12-11

This paper proposes an automated approach to model interpretability: Train a transformer that takes another model’s parameters and outputs (different forms of) “explanations”. Experiments include: predicting hyper parameters, detecting backdoors, and creating RASP code from transformers originally compiled from a RASP program.

This paper tackles an important problem of attempting to scale up efforts for model interpretability, and the perspective taken is valuable. The discussion period has been productive. Authors have added new baselines and made several edits to the writing that have improved the draft.

There are different views toward the effectiveness of using an “opaque” model to help interpret another “opaque” model. As long as the potential challenges and limitations of this overall approach are properly discussed and thoroughly considered, this in itself is not a major limitation. But it appears that there is more to be done to address these concerns. Authors argue that this approach helps with generating hypotheses for model interpretability, and not necessarily validating them. But treating these two steps as disconnected is not a convincing argument. There is little value in generating hypotheses that have a very low chance of passing the validation test. Specifically, in the context of the experiments presented in the paper, e.g. classification of data poisoning backdoors, it is unclear what would be the value of “generating a hypothesis” without validating it.

Additionally, based on AC’s own reading, the paper can benefit from a more focused framing. The framing might be too broad, and putting these experiments together seems a bit forced. Instead, a more thorough analysis and deeper discussions in the experiment sections can strengthen the draft. For example, in reverse engineering RASP code from model weights, more discussion is needed to better understand the quality of the recovered RASP code. Recovering the RASP code completely happens only in ~6% of the cases, but what does it mean when 70% of the program is recovered? Is extrapolation to get a full runnable code feasible in those cases? What is the value of an “algorithm” as an explanation, when it cannot be executed? I also appreciate the author’s honesty about bugs in the RASP compiler. Validating those experiments seems important before publication of the draft. Another limitation acknowledged by the authors and highlighted by the reviews is the extent to which this approach could be more generally applicable to help with mechanistic understanding of transformers trained from scratch.

Backdoor prediction is an interesting experiment, but casting it as an interpretability challenge is a bit of a stretch. Perhaps extending experiments in this section and showing what different attributes can or cannot be predicted from model parameters, and why, could strengthen these results.

Overall, the paper has potential and is a step in the right direction. The preliminary results presented are promising. However, the current draft can benefit from another round of revisions before publication for the aforementioned reasons.

为何不给更高分

The potential limitations of using a model to explain another model are not thoroughly considered.
The framing seems too broad, and the experiments seem a bit disconnected. Reverse engineering (compiled) transformer weights into RASP programs is a valuable mechanistic interpretability effort and shows promise, but the remaining experiments are less relevant to the main motivation of the paper -- "automating interpretability".
Deeper discussion of the experiments can significantly improve the paper.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject