3.0

/10

Rejected4 位审稿人

最低3最高3标准差0.0

4.0

置信度

正确性2.3

贡献度1.8

表达2.5

ICLR 2025

ALMANACS: A Simulatability Benchmark for Language Model Explainability

Edmund Mills,Shiye Su,Stuart Russell,Scott Emmons

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

We present a simulatability benchmark for automatically evaluating language model explainability methods.

摘要

关键词

explainabilityinterpretabilitysimulatabilityexplanationsevaluationbenchmarknatural language processing

评审与讨论

审稿意见

评分: 3置信度: 42024-10-31

This paper proposes a simulatability benchmark for LLM explanations. In the benchmark, various subjective questions are formed, testing the "opinions" held by different LLMs, and either asking them to provide explanations in natural language (rationalization) or computing the explanations through traditional methods such as attention or integrated gradient. To avoid the issue of explanation leaking information about the test instance, the model uses a separate test set. Experimental results on two LLMs, with the GPT-4 as the interpreter of the explanations, show limited utility of the explanations, compared to the no explanation baseline.

优点

This paper is written quite clearly and easy to follow.
The sourcing of subjective, opinion-based test questions is interesting and could be used in other LLM benchmarks.
The authors take special care of the label leakage issue by using a separate test set.

缺点

The results are mostly negative. While this is somewhat expected for a challenging benchmark, it is unclear what the takeaways are from the negative results. For example, when the GPT-4 interpreter fails to predict a model prediction based on the explanations (of other related inputs), is it due to the irrelevance of the training inputs (in which case the embedding model is too weak), or the inability of the GPT-4 model in using the explanation information (in which case the model or the prompting may need to be changed), or the low quality of the explanations? It seems to me that only the last case truly highlights the issue of the explanations, and even so, there is no further insight on identifying the specific aspect of the explanation causing the low quality.
I notice that both models studied are quite small, flan-alpaca-gpt4-xl with 3B parameter and vicuna-7b-v1.3 with 7B parameters. They are also released over 1 year ago, which is quite old in the context of recent LLM development. I would recommend trying the newer and larger models, such as llama 3(.1), or the closed source models from OpenAI or Anthropic (without the attention or integrated gradient access).
Accuracy or F1 score is not included as the metrics. I tend find them more intuitive, compared to metrics such as KLDiv.
There are some missing recent related works on studying the self-explanations (i.e., rationales) of LLMs, such as [1] and [2]. Both work show mixed results, echoing the general conclusions by this paper.

[1]. Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations. Huang et al.

[2]. Are self-explanations from Large Language Models faithful? Madsen et al.

问题

See weaknesses.

审稿意见

评分: 3置信度: 42024-11-01

The paper presents ALMANACS, a method to automate explanation methods by seeing how well they help to simulate model performance. The method works by supplying the explanation and the test data to an LLM and having them try to simulate the original classification model's prediction on held-out test data. The model is prompted with a few shots gotten via the 10 nn's in terms of Cosine similarity from the training data to help it simulate the novel test data (with the training data's explanations). Results show that no explanation method outperforms the "no explanation explanation" baseline.

优点

The main strenght of the paper is its generalizability and simplicity, it does indeed provide a nice scalable method to automate the evaluation of a lot of XAI techniques, which is (as the authors say) not a replacement for human studies, but a nice addition to them. As LLMs do get better the next few years, one can imagine human testing gradually being less necessay in a lot of circumstances.

缺点

The overarching weakness of the paper is that the author's failed to discover where their method shows discriminatory results in simulatability. I think that until this happens the story of the paper doesn't feel finished to me. I think that some experiments which would really help are those which show that the method can actually convey some actionable information for the LLM, just to show that some insightful results can be gleened from ALMANACS and how exactly to setup tests to do that, so that future researchers using this understand how to use it in their experiments.

The issue is, if I were to literally give random noise to the LLM instead of the explanations it would also likely do nothing, so you have to show some observable effect from your framework to validify it.

问题

Did your human study get IRB approval? I didn't see any mention of it, which is perhaps problematic. Maybe you can clarify how this was approached in your research.
Can you think of any experiments which would show ALMANACS helped an LLM with simulation?

伦理问题详情

I imagine the authors have some ethics approval from their institution for their usage of human labellers, but I cannot see it mentioned in the paper.

审稿意见

评分: 3置信度: 42024-11-03

This article introduces the ALMANACS benchmark, a standardized framework for evaluating the interpretability of language models. The benchmark measures model simulatability by constructing a series of complex scenario questions and assesses the effectiveness of four types of explanation methods. The results indicate that no single explanation method significantly enhances predictive performance across all tasks. Additionally, the paper examines the effectiveness of using GPT-4 as an automatic predictor and its alignment with human judgments. Overall, the work highlights the ongoing challenge of generating explanations that effectively aid prediction.

优点

This paper addresses a highly valuable problem: the design of a general, automated evaluation method for explanation tools.

缺点

The goal of developing an efficient tool for automated testing of model simulatability is not adequately achieved. User studies must minimize the influence of prior knowledge to ensure objective and effective testing of explanation quality. However, the proposed method relies heavily on the training data of the large language model (LLM), which may skew results and introduce biases. While the authors acknowledge this concern and caution against over-reliance on ALMANACS, it remains critical. If the reasoning generated by ALMANACS cannot reliably measure the simulatability of explanations, then automation in this context becomes meaningless. How do the authors demonstrate that the prior knowledge of GPT-4 as a predictor does not significantly affect the evaluation of explanations?
The focus on binary Yes/No questions, while suitable for preliminary implementation, severely limits the applicability of the tool, especially in scenarios that require outputs of richer forms.
The conclusions drawn in the paper are not convincing. The authors designed 15 templates for each topic, with each template containing 15 placeholders, generating the dataset by replacing these placeholders with different texts. Two issues arise here:
- Could the non-placeholder components of the templates significantly influence the model's predictions? If so, the explanations generated from data produced by the same template are likely to be similar, rendering the replacement of placeholders ineffective. In this case, the hundreds of data points generated are effectively indistinguishable from a single data point.
- For the average of experimental results to be meaningful, the distribution of the experimental data should be uniform. However, the authors did not verify the uniformity of the datasets across different topics. Instead, they concluded that all explanation methods are insufficient based on average results from various topic datasets, which is unconvincing.
Some key details in the paper are unclear. Part of them can be found in the appendix but they should really appear in the main text. Concretely:
- In Section 2.1, the authors should clarify why choosing GPT-4 as a predictor is crucial, as it is the primary tool for evaluating explanations. How might different kinds of LLMs impact their work?
- In Section 2.2, it is challenging to understand how to calculate the probability of a Yes answer without reference to the appendix.
- In Section 2.2, how did the authors compare different embedding methods? What metrics were employed, and why was the Sentence-BERT model all-mpnet-base-v2 chosen?
- In Section 2.2, the authors conducted a suite of evaluations to assess the models’ capabilities but did not provide any details or results from these evaluations. How can they demonstrate that the model explained has sufficient capability to address the questions posed in their benchmark?

问题

How do the authors address the limitations of using a large language model as a predictor, especially given that it does not fully resolve the issues mentioned in the introduction? Considering the example of Bills et al., which highlights the challenges humans face in evaluating explanations, how do the authors account for the fact that large language models also experience a decline in reasoning ability as the number of input tokens increases, complicating their assessment of overly complex explanations?
In Section 2, the authors mention that their method utilizes distributional shift, allowing the training and test sets to operate on different distributions. While this helps favor methods that provide faithful explanations of the model’s reasoning, could this differing distribution lead to unfairness in local explanation methods?
The definition of non-objective questions in section 2 is too vague. According to the authors' logic, since the explained model consistently provides the same answers, a confounding effect arises. How can they ensure that their non-objective questions do not lead the explained model to consistently yield the same answer?
In Section 2.1, the authors formalize an interpretability method as an explainer function: (f, D) -> e. Does "e" refer to a single explanation or multiple explanations? If it refers to a single explanation, why do you need multiple datasets D as input? If it refers to multiple explanations, why do you state that "each e is an explanation corresponding to a particular (x, y) ∈ D"?
In Section 2.1, when the authors state, "Additionally, we allow each e to depend on f and D," what does "depend" specifically mean in this context?
The authors provided 10 examples to GPT-4 as a predictor. How was this parameter determined?
As noted in http://arxiv.org/pdf/2202.12837v2, LLMs do not possess the capacity to "learn" from input data and corresponding labels during testing. This limitation raises questions about the ability of LLMs to infer model outputs from explanations, fundamentally questioning the reliability of the proposed method. How do you address concerns regarding the validity of their work due to this limitation?
Additionally, please refer to the Weaknesses section.

审稿意见

评分: 3置信度: 42024-11-08

The authors propose ALMANACS: a benchmark that can be used to evaluate the extent to which different language model explanation methods can aid simulatability (the ability of a human/other entity to predict the model's behavior on new inputs). The proposed benchmark includes a dataset of input prompts ("scenarios") that they use to construct the simulatability task (given a set of input prompts, model outputs, and explanations, to predict the model's output on a new input). The authors propose that to estimate each explanations' "simulatability", one can use a secondary ML model (to complete the task, e.g., predict the model's output), in place of a human. The authors use their benchmark to measure the simulatability of a number of explanation methods, and justify the use of a secondary model by comparing LLM predictions to human predictions on the simulatability task.

优点

The authors do a great job citing prior research on desiderata for post-hoc explanations in Section 3 and 6. In particular, I appreciate that the authors acknowledge the limitations of considering only simulatability as a desirable criteria, as illustrated nicely in L185 with your example that a naive explanation that would just provide the model weights would perfectly enable simulatability. It is also clear from the authors' discussions of related empirical studies/scholarship on simulatability, and description of each of the explanation methods, that they are well-read on contemporary scholarship on XAI.
Overall, the authors' methodology and benchmarking approach is described clearly. I also appreciate the efforts the authors took to include details about their methodology – e.g., the prompt templates used and the instructions presented to users in their study protocol – to facilitate reproducibility.

缺点

I will consider adjusting my score if my below concerns are addressed.

Weakness 1: Validity of using an LLM. My primary critique of the paper is that from the authors' experimental validation, I remain unconvinced by one of the authors' key claims: that an LLM predictor "can replace humans as [an] automated evaluator of explanations" in this context. To make this point, the authors argue that "the automated GPT-4 predictor is consistent with human evaluations". However, I need more detail to understand how the results in Section 5.2, and Figures 4b-c, support this argument.

The main piece of evidence that the authors use to justify that the LLM is "consistent with humans" is the wide error bars for all of the treatments in the "all" condition in Figure 4. But, if I understand correctly that the authors defined the "All" category by creating a big pool across all of the different tasks, then of course the error bars are going to be wide here because there's variance across tasks. A better measure of statistical significance wouldn't create a big pool, but instead conduct separate statistical tests for each of the 5 tasks (each "distribution"). I'm skeptical of this measure being used as justification, and am open to hearing other justifications instead.
As the authors already noted in their paper, I think it is interesting that there are some significant differences across conditions in the human experiments – for example, that humans did much better when given explanations in the hiring decisions context – and these trends are not found by the LLM.
I would be more amenable to this work if the authors did not claim that the LLM predictor's results are necessarily "consistent with humans", and instead note that there are discrepancies, describe in detail what these discrepancies are (e.g., expand further on some of the findings you already have, like how LLMs tend to be better at the simulation task). I think you can still try to make the argument that there is value in using an LLM to approximate the information different XAI methods give that might be useful for simulation, without necessarily needing to argue that it will be predictive of how a human will perform. But this argument needs to be more fleshed out. (Maybe the LLM is an "upper bound" of the predictive information available in an explanation to aid simulation; but humans may struggle/fail to infer how to use these information to actually complete the task accurately – an argument made by [1]).

Weakness 2: Biases introduced by using models to simulate other models. I am wondering if the reason why using another LLM (e.g., GPT-4) to predict the outputs of another LLM (e.g., flan-alpaca-gpt4-xl), performs well, has something to do with how the two models were developed – for example, flan-alpaca-gpt4-xl was trained to behave as similarly as possible to GPT-4.

In other words, these models already do a great job at simulating the outputs of other models because of the way that they were trained, in a way that humans cannot.
This makes me question the appropriateness of using large pretrained models as the "predictor". Past work that has used secondary "predictor" models in simulatability tasks doesn't share this same challenge – to my understanding, the predictor in these past studies was initialized from scratch.
If the authors believe it is necessary to use a large pretrained model as the predictor, perhaps they can acknowledge how these connections between the predictor vs. the model being explained as one factor that contributes to the predictor's good performance on the task. They might also be able to design additional experiments to explore this – e.g., is it true that a predictor that is the same (or a similar) model as the model being explained, will always perform better than other predictors that are unrelated?

Weakness 3: More discussion of limitations of existing scenarios; or extendibility of the benchmark itself. I would appreciate more discussion in the main text about what exactly the scenario categories are, the limitations of using a template-based approach, and how someone reading your paper could potentially contribute a new category of scenario, or a new template, to your benchmark. More broadly, it is unclear if you intend to support extendibility of your benchmark (i.e., the ability of users to add additional explanation methods, predictor models, or scenario prompts).

Nit: The text in your figures is inaccessible via a screen reader. To fix this, you can include figures as PDFs instead of PNGs.

[1] https://proceedings.neurips.cc/paper_files/paper/2022/file/0b9536e186a77feff516893a5f393f7a-Paper-Conference.pdf

问题

See the questions included throughout the Weaknesses section. Additionally, I had a few other clarifying questions:

Why not report simple accuracy (e.g., if we do 'hard classification' where the predictor must output Yes or No, what is the percentage of the predictor's outputs that actually do match the model's output?) as the performance metric, as is done in most XAI work on simulatability? I found the KLDiv metric to be quite difficult to interpret.
I am confused about what you mean when you use the term "complex behavior" in Section 2. What is meant by "nonlinear model behavior" in this context? Does this mean that the simple presence/absence of specific tokens doesn't influence the model to vote yes or no? What does it mean to "adversarially filter against a logistic regression baseline"? Is there a citation to work that's used a similar method?
For the "rationalization" explanation method, is there a chance that the model's predicted probabilities (the $y = f(x)$ ) change when you ask the model to include a rationalization?

AC 元评审

2024-12-25

PC is entering meta-review on behalf of SAC and AC:

The reviewers felt that the submission was not a strong contribution, and the authors did not respond during the review proicess.

审稿人讨论附加意见

TBD

最终决定Reject

2025-01-22

Reject