PaperHub
5.7
/10
Poster3 位审稿人
最低5最高7标准差0.9
5
5
7
3.0
置信度
正确性3.0
贡献度2.3
表达2.0
NeurIPS 2024

Explanations that reveal all through the definition of encoding

OpenReviewPDF
提交: 2024-05-16更新: 2025-01-17
TL;DR

We formalize the definition of encoding in explanation methods and provide two methods to detect encoding.

摘要

关键词
feature attributionsmodel explanationsevaluating explanationsencoding the predictioninterpretability

评审与讨论

审稿意见
5

This paper claims that the main problem regarding the evaluation of explainability methods is encoding, which refers to the leakage of information in explanation (here defined as a boolean mask for selecting features) structure and not the value of the selected features. The paper tries to quantify the extent of this leakage and analyzes some previously designed methods from the proposed perspective, summarized in Table 1. They continue an intuitive approach to this quantity and use some definitions to build a measure termed "ENCODE-METER". This measure helps them to tweak the EVAL-X objective, they call DET-X, to gain "strong" encoding detection in definition 3. The claims, in this work, are supported by proofs and sometimes based on explanations that come from hand-designed data generating process for constructing counter examples. Finally, they evaluate the proposed framework on some data generating processes quantitatively.

优点

  1. The paper adopts a formal approach toward its claims and goals.
  2. The paper provides intuition and summarizes it in concise mathematical equations.
  3. The goal and problem being tackled is of high priority in the field of explainability, and affects many researchers in the field.
  4. The direction that the paper is proposing seems quite promising and must be better studied.
  5. The paper tackles encoding that is not formalized yet and underdeveloped, which adds to the novelty of the work.

缺点

I would like to thank authors for such a hard work and contribution to the community. I have learned a lot from this work, so the points mentioned here must not seem disappointing, but rather it should help for a work with higher quality. It seems that this work is very close to publication, yet with small potential errors I pointed out here (nonetheless, I might be wrong).

Important missing discussions

  1. It would be helpful to rank some known explanation methods and compare it with known metrics like ROAR. Then it would be easier for the reader to decide what are the differences between your evaluation method and other evaluation methods. ROAR for example ranks many methods and reaches to a bottom-line and raises some open questions. Toy datasets are good for formal proofs, simplicity and demonstration. But relying on toy datasets can create a gap between modelling and practice. I would recommend having a table similar to ROAR that ranks methods. Maybe we see that the results of DET-X, although being more elegant in theory, actually coincide with the results one gets with ROAR or some other heuristic method.
  2. It seems that the main idea of this paper and some referenced papers revolves around spurious correlations in the DPG. Could be useful if there is a discussion on how spurious correlations in the DGP affects explanations.

Formal writing

  1. definition 1, does not look like a definition, it is written like a proposition. It is easier to understand if a definition is like "We call X an encoding if Y holds."
  2. Definition 2 is also like a proposition, or at least I do not understand what is being defined here. Also in maxe\max_e what is the domain of e here? Is it over explanation methods or over explanations that are generated by a dataset. What is the domain of ee^*?
  3. does definition 3, depend on the direction of the metric (higher the better or lower the better)?
  4. I cannot follow the argument in line 234-239. Maybe because I've not understood the definitions 1 and 2. In a proof, to show that a property satisfied, finally to show that X fits in some definition.

Punctuation, writing style and minor errors

  1. line 71 "that Hooker et al." should be removed
  2. it is better for readability of the paper to provide a more intuitive definition of "encoding explanations" earlier in the paper I would recommend putting this line: "Intuitively, encoding is when the binary mask output of the explanation itself provides information about the label beyond the selected values." earlier. Even in the abstract or early in the introduction when you want to mention encoding.
  3. lines 123-124, "... to predict the label have not been selected ..." shouldn't this proposition stated weaker? something like "... to predict the label may have not been selected ..."
  4. line 139 punctuation should be: "likely**,** exactly**,** when"
  5. line 100 punctuation better be: "method is good**,** requires"
  6. lines 154-161 reads poorly, maybe I do not understand what it tries to convey!
  7. DGP in line 164 is defined later in 191.
  8. in Eq. 3 the word "Different" better reflect the mathematical form than "Additional"
  9. lines 196-198 this is too informal, or at least I cannot make sense of it.
  10. line 208 punctuation "provides" must be "provide".

问题

  1. in figure 1 (c), the assumptions are unclear, why position based encoding explanation always selects the top-half? is it in the DGP that the cat always appear on the top? Or it is in the definition of encoding based explanation. More on that, how do we know "The label is independent of the color in the selected features"? Shouldn't you show this formally using definition of independence?
  2. line 164 "may seem okay" what does this mean? is it better to adopt a more formal writing style here?
  3. line 164-164 "because the color is independent of the label," can you show this more formally using the definition of independence? I think that the label well depend on the color, at least here.
  4. lines 165-166 "hides ... the predictions’ dependence on the control flow input" how the control flow can be shown in an explanation? I guess this does not fit in the definition of salience based explanations?
  5. lines 209-210 "only work when optimizing without constraints" what is optimization referring to here? what are constraints?
  6. refer to weaknesses section "Mathematics writing" number 2.
  7. in line 220-221 I do not understand why it is written "x3 predicts the label". Maybe I'm wrong, but it seems that the label could be 1 or 0 when x3 is known and equal to 1 (for example). So knowing x3 does not tell anything about the label, and you still need to look at the other features (either x1 or x2). Can you make a table of conditional probabilities and reason why x3 tells the label?

局限性

Although authors have answered "Yes" to the checklist for "discussion on limitations", I didn't find a separate limitations section in the paper. Maybe they have referred to the limitations of their work indirectly.

One limitation that I see in this work is that the arguments are based on assumptions that might never come true on real world datasets like Imagenet. This is why I encourage the authors to go beyond toy datasets and evaluate their method in the wild.

作者回复

Thank you for the generous comments and detailed feedback. We fixed the writing issues in the paper. If our response below addresses your primary concerns, would you kindly consider raising your score?

[rank explanations, compare with ROAR]

We thank the reviewer for raising this point. Unlike ROAR, EVAL-X and DET-X provably at least weakly detect encoding on any distribution, not just on the toy examples. We provide a comparison on the image recognition experiment; see the PDF in the general response [LINK.]. ROAR incorrectly ranks multiple encoding explanations as high as the optimal explanation.

[Encoding is like spurious correlations?]

One could think of encoding as a spurious correlation between the selection and the label. This correlation is not a fact about the DGP that produces (X,Y)(X,Y). It is a consequence of the explanation depending on features that are informative of the label but only selecting a subset of them.

[Def 1 and 2 sound like propositions, domains of ee]

We have updated the writing to make definition 1 start with "An explanation is called encoding if ... ". In definition 2, we define the notion of weak detection for a distribution q(y,x)q(y,x) as a property an evaluation method can have. The domain of e,e,^* is that of all functions that takes and input xx and return a binary mask over input. For example, considers explanations for images-label pairs; the domain is the space of images. We have clarified this in the draft.

[definition 3, depend on the direction of the metric]

The reviewer is correct. We assume higher scores mean better. If the direction flips, Definition 3 would say \leq instead of \geq.

[the argument in line 234-239]

Definition 1 and 2 are about weak and strong detection, which are properties of evaluation methods. Lines 234-239 only define EVAL-X. Theorem 1 proves that EVAL-X is a weak detector. In other words, EVAL-X is a weak detector. The sentence 239 leads to theorem 1. We have updated the draft to separate the definition of EVAL-X from the sentence that says it is a weak detector.

[lines 123-124, "... to predict the label have not been selected ..."]

The updated sentence reads "An encoding explanation should not score optimally under a good evaluation because the requisite input values needed to predict the label have not been selected by the explanation on a subset of the input values."

[lines 154-161 unclear]

This is an example of a "bad" encoding explanation. We flesh out the example here. We elaborate here. Consider reviews that can either be of type "My day was terrible but the movie was [ADJ1]." and "The movie was [ADJ2], but the day was not great." where ADJ1 can be "good" or "not great" and ADJ2 can "not great" or "terrible".

Due to english parlance, "terrible" indicates bad sentiment more often than "not great". Then, in the example setup above, only seeing that the fourth word is "terrible" yields bad sentiment with higher probability than when only seeing that the phrase is "not great". However, the fourth word but does not always describe the movie. An explanation can look at "not great" describing the movie as bad but then select terrible to encode the bad sentiment. Such an explanation is encoding because it selects a word that does not describe the movie but is informative of the sentiment.

[in Eq. 3, "Different" vs. "Additional"?]

The word additional denotes that knowing the explanation provide extra information about the label not in the selected values.

[lines 196-198 informal.]

The lines say that the these explanations are encoding due to lemma 1. By "the explanation e(X) varies with inputs other than the selected ones" we mean that e(X)e(X) depends on inputs that weren't in the binary mask. By "... inputs provide information about the label .. " we mean that the inputs that are not selected in e(X)e(X), are informative of the label which implies the second condition in lemma 1.

[fig1c unclear. Posenc selects the top-half? YY \perp color?]

We construct PosEnc to select the top-left-patch if label=cat. Intuitively, the label is independent of the color regardless of which patch it selects, the label is only determine the animal in the image. We added math to prove this in the draft.

["may seem okay" informal][

The selected inputs (third panel, second row, left square for each input) show exactly the animal of the label. Selecting only the animal of the label seems desirable at first; this is what we meant by "may seem okay". We then explain why such an explanation is encoding.

[how to show control flow?]

In the figure 1 and figure 3 examples, the "control flow feature" is the color patch, because it branches the DGP where the top right or the bottom right patches produces the label.

[209-210 optimization? constraints?]

Optimization here refers to finding an explanation that maximize the evaluation score maxeα(q,e)\max_e \alpha(q, e). Various constraints can be put on explanations, such the explanation cannot select more than KK inputs.

[220-221 "x3 predicts the label"? why?]

The reviewer is correct. However, conditional on the selected inputs, does predict the label. See the PDF in the general response for the math.

[Discuss limitations]

See the general response LINK.

[Based on assumptions that may not be true on real data like Imagenet.]

Evaluations like ROAR and EVAL-X have been used on explanations on imagenet and on chest x-rays. We prove such methods highly rank undesirable explanations that encode. In turn, conclusions about important inputs drawn from such high scoring explanations may not hold for the model being explained. Further, we experiment with a real sentiment analysis task where DET-X uncovers evidence that LLM-generated explanations encode.

评论

I'd like to thank authors for their time. After reading the comments and rebuttals, I've increased the rating, and hope that this paper also passes the test of time.

评论

Thank you for engaging with our rebuttal!

We appreciate your feedback and comments very much and the changes we made in response have already improved the writing in the paper.

审稿意见
5

How to best evaluate explanations is an important open question in the field, and one specific challenge that has so far received less attention is how to detect when explanations encode prediction in the identity of the selected inputs. This paper proposes a formal definition of encoding, that they later use to probe whether current evaluation protocols can detect encoding through experiments on simulated data. They find existing evaluation methods to be at best weak detectors of encoding, and they propose the evaluation method DET-X as a strong detector of encoding.

优点

  • The motivation for defining and properly evaluating the notion of encoding is sound
  • the experiments look sound
  • as far as I know, the formal definition is a novel contribution

缺点

  • The main and major weakness of the paper is clarity.
    • The paper is very very dense (space between paragraphed tempered and heavily reduced) and this makes it at times intelligible.
  • Weak contextualization with regard to prior work. The related work in the main paper is very succinct.
  • Overall I believe that the substance of this paper is interesting, but in its current form, the paper is really hard to read and understand.

问题

  • Fig.1C, "the label is produced from the bottom image if the color is red" then why is the explanation about the top image?
  • l.156 "For example, consider reviews of the type "My day was terrible, but the movie was [ADJ]." where ADJ can be "good" or "bad" and let the explanation be "terrible" if ADJ=bad and "Movie was good" if ADJ=good. The sentiment about the movie comes from the second part which the explanation fails to isolate, meaning this explanation should be scored poorly." I am not sure the problem is that straightforward here. How would we differentiate a model that makes that mistake from an erroneous explanation?
  • l.165 "control flow" is not defined, hence it's hard to understand this section
  • l.366 "We compute ENCODE-METER with q(Ev | xv) modeled by a ResNet34 trained the same way as EVAL-X but, instead of predicting the label, it predicts the identity of subset selected by the explanation" I don't understand what this means, could you please clarify?

局限性

I do not see a limitation section in the paper.

作者回复

We thank the reviewer for their careful feedback. If our response below addresses your primary concerns, would you kindly consider raising your score?

[paper is very very dense and this makes it at times intelligible]

We thank the reviewer for this feedback. We modified the draft to use the standard paragraph spacing with the following changes to the writing

  • Made section 2.1 more concise, where each encoding example now only spans a single paragraph covering both intuitive example formal construction.
  • Moved lemma 1 and associated text to the appendix.
  • Made 5.1 and 5.2 concise by only listing results and conclusions, and moved experimental details about training models for the encode-meter to the appendix.

If the reviewer has any other concerns about the clarity/density, we'll be happy to address them.

[Weak contextualization with regard to prior work]

There are limited studies in evaluating explanations and we discuss them. Due to space constraints, we had moved the related discussion about faithfulness and label-leakage to appendix B. We have moved it back and give it here:

Other investigations into evaluating explanations focused on label-leakage [14, 34] and faithfulness [11, 16, 35, 36, 37]. Akin to encoding, label-leakage is an issue that occurs with explanations that depend on both the inputs and the observed label; such explanations, when built naively, can yield attributed inputs that are more predictive of the label than the full input set. In this paper, we do not consider explanations that have access to the observed label. Faithfulness, intuitively, asks that the explanation reflect the process of how a label is predicted from the inputs. Typical faithfulness evaluations rely on the quality of prediction from the selected inputs [12, 16]. Jacovi and Goldberg [11], Jethani et al. [13] discuss how such evaluation methods are insensitive to encoding by checking how they score encoding constructions.

- [11] https://arxiv.org/abs/2004.03685
- [12] https://arxiv.org/abs/1806.10758
- [13] https://proceedings.mlr.press/v130/jethani21a.html
- [14] https://proceedings.mlr.press/v206/jethani23a.html
- [16] https://arxiv.org/abs/2005.00115
- [34] https://aclanthology.org/2020.findings-emnlp.390/
- [35] https://arxiv.org/pdf/2111.07367
- [36] https://ojs.aaai.org/index.php/AAAI/article/view/21196
- [37] https://arxiv.org/abs/2109.05463

If the reviewer sees a work we missed we would be happy to add it.

[Fig.1C, "the label is produced from the bottom image if the color is red" then why is the explanation about the top image]

There may be a little misunderstanding here. In this example, the explanation selects the top left quarter patch only, which only contains color. This explanations encodes information about the label in the selection. The reviewer may also be asking why such a construction can be called an explanation. As the overarching goal in the paper is evaluating explanations, we consider any functions of inputs that outputs binary masks as candidates.

[l.156 "For example, How would we differentiate a model that makes that mistake from an erroneous explanation]

We respond here assuming that the reviewer used the word "erroneous" to mean that the explanation selects inputs other than the ones that the model uses to predict. If the reviewer meant something else and clarify their comment, we would be happy to engage in further discussion.

To understand whether explanation selects the input the model depends on, one would have to look at how predictive the selected inputs are of the labels produced by the model . Let's look at evaluating a candidate explanation of a models predictions without assuming anything about how the explanation or the model were produced.

  • If the model makes mistakenly relies on the word "terrible" and the explanation correctly selects the word "terrible", then the selected features would have maximum information about the model-predicted labels.
  • If the model correctly depends on value that ADJ takes but the explanation selects the word "terrible", then the selected features would can only be informative of the model-predicted labels by encoding.

Due to these differences, an encoding explanation of a correct model and a non-encoding explanation of an incorrect model have different signatures that EVAL-X or DET-X would detect.

[l.165 "control flow" is not define]

Thank you for pointing this out. Control flow is not common parlance and comes from the software engineering literature. We explain it below. We use the phrase "control flow" to indicate that the DGP looks at the value of one input to determine which other inputs to produce the label function; this process reflects an if-else statement where the condition being checks determines the flow of computation. We have added this paragraph to the paper.

[l.366 "Trained the same way as EVAL-X but, instead of predicting the label, it predicts the identity of subset selected by the explanation?]

EVAL-X trains a model for q(yxv)q(y | x_v) that predicts the label yy from the selected inputs xvx_v, by randomly choosing vv for every sample. Similarly, ENCODE-METER relies on the conditional distribution q(Evxv)q(E_v | x_v). To model q(Evxv)q(E_v | x_v), we replace yy with EvE_v in the EVAL-X training procedure. Here, EvE_v is the indicator of whether the explanation for the input xx is the subset vv.

[I do not see a limitation section in the paper.]

We discuss limitations in the Discussion section 6, paragraph "Mis-estimated models, ... ". We elaborated on the limitation in the general response LINK. If the reviewer can point to other questions or limitations that we should discuss, we are happy to add them.

评论

I have read the rebuttal as well as the exchange with the other reviewers.

[Weak contextualization with regard to prior work]

I saw that some of the related work was in the Appendix and I am not sure I agree with this approach. In the main text, a good contextualization seems more important to me than more experiments (whose conclusions can be summarized in one line with a reference to the Appendix). While it is not a hard reject reason, I do advise the authors to consider the suggestion.

Clarity

Based on the rebuttal and on the different discussions, I believe the authors have made significant steps toward making the paper clearer by shortening long sections by removing less central information or moving them to the Appendix. Also, in agreement with Reviewer PabQ, I strongly recommend the authors to move the sentence about the intuition behind encoding much earlier in the work, in the introduction but also in the abstract.

I have updated my score in accordance with those points from 3 to 5.

评论

Thank you for engaging with our rebuttal!

We appreciate your feedback and comments very much and the changes we made in response have already improved the writing in the paper.

审稿意见
7

The paper presents a novel approach to evaluate feature attribution methods in machine learning by addressing the issue of encoding in explanations. The authors define encoding as when the explanation's identity provides additional information about the target beyond the selected input values. They categorize evaluation methods into weak detectors (optimal for non-encoding explanations) and strong detectors (score non-encoding explanations higher). The paper introduces DET-X, a new score that strongly detects encoding, and empirically verifies its effectiveness through simulated and real-world datasets, including an image recognition task and sentiment analysis of movie reviews.

优点

  • The paper introduces a precise mathematical definition of encoding, addressing a significant gap in the interpretability literature.

  • The classification of evaluation methods into weak and strong detectors provides a clear framework for assessing the robustness of feature attribution methods.

  • The authors rigorously prove that their proposed DET-X score strongly detects encoding, differentiating it from existing methods.

  • The paper includes empirical validation on both simulated data and real-world applications, demonstrating the practical utility of DET-X.

  • By uncovering encoding in LLM-generated explanations for sentiment analysis, the paper shows the relevance of its contributions to current AI applications.

缺点

  • The proposed DET-X score may require complex implementation and computational resources, which might limit its adoption in practical scenarios.

  • While the experiments are thorough, they are limited to specific datasets and types of tasks (image recognition and sentiment analysis). More diverse applications would strengthen the claims.

  • The paper acknowledges that misestimation of models used in evaluation could lead to incorrect conclusions, suggesting a need for robust estimation techniques.

问题

  • How does the DET-X score perform in different domains outside of image recognition and sentiment analysis?

  • Are there specific conditions or types of models where DET-X might not perform as expected?

  • How does the computational cost of DET-X compare to existing evaluation methods?

局限性

  • The paper's findings are primarily validated on specific datasets and tasks, which may not generalize to all types of machine learning applications.

  • The DET-X score's implementation complexity might be a barrier for widespread adoption, especially in resource-constrained environments.

  • The paper highlights the risk of misestimation in model-based evaluations, which could impact the reliability of DET-X in certain scenarios.

作者回复

We thank the reviewer for their careful feedback. If our response below addresses your primary concerns, would you kindly consider raising your score?

[The proposed DET-X score may require complex implementation and computational resources, which might limit its adoption in practical scenarios.The DET-X score's implementation complexity might be a barrier for widespread adoption, especially in resource-constrained environments.]

While the reviewer is correct that one needs to estimate the ENCODE-METER in addition to EVAL-X, the training process is standard supervised learning. The only difference is that the inputs are randomly masked like in EVAL-X. As training and evaluating predictive models via supervised learning is well-studied, even at scale, we do not foresee that training DET-X as a difficult task.

The reviewer is correct that DET-X requires twice as much computation as EVAL-X, which itself can take more computation than training a single model that predicts the label from the full inputs. This extra computation comes from have to learn to predict from different subsets. However, one cannot escape training to predict from different subsets when evaluating explanations based on how informative the selected inputs are. Using large pre-trained models does speed up this process. The estimation of encode-meter with GPT-2 in the LLM experiment, including hyperparameter tuning, took under a single day on a single GPU. Alternatively, given a conditional generative model for the full inputs given the masked ones, both components of DET-X reduce to averaging a single model's predictions over generated samples.

Code for training models from masked images for EVAL-X already exists and we will release our implementation for DET-X with the camera-ready version, if accepted.

[The paper acknowledges that misestimation of models used in evaluation could lead to incorrect conclusions, suggesting a need for robust estimation techniques. The paper highlights the risk of misestimation in model-based evaluations, which could impact the reliability of DET-X in certain scenarios.]

Thank you for highlighting this point. The problem with misestimated models goes beyond our work and applies to all evaluation methods that build a model to score explanations, including ROAR, FRESH, and recursive ROAR. Avoiding misestimation is unavoidable, but post-hoc methods for capturing the uncertainty of predictions, via conformal inference or calibration, can help mitigate the errors in evaluation due to misestimation.

[While the experiments are thorough, they are limited to specific datasets and types of tasks (image recognition and sentiment analysis). More diverse applications would strengthen the claims. How does the DET-X score perform in different domains outside of image recognition and sentiment analysis?]

The goal of the paper is to establish the mathematical definition of encoding and strong detection of encoding. So we focused on two popular tasks for which explanations are produced.

We have since added a tabular data experiment and report the results here. We ran an experiment on tabular CDS Diabetes dataset from the UCI repository and show the effectiveness of EVAL-X and DET-X on weakly and strongly detecting encoding respectively. See the PDF in the general response LINK.

The results show that

  1. EVAL-X scores the Optimal explanation above all the encoding ones, showcasing weak detection.
  2. However, EVAL-X scores the last non-encoding explanation, which selects features informative of the label, below the encoding ones PosEnc and MargEnc, showing its not a strong detector.
  3. DET-X correctly scores all the non-encoding explanations above all the encoding ones, demonstrating strong detection

[Are there specific conditions or types of models where DET-X might not perform as expected?]

DET-X is model-agnostic and would extend any type of model as long one can obtain input-output pairs from the model. DET-X depends on models that predict the label from subsets of the inputs. Learning to predict from subsets may take much larger models than predicting from the whole input set when the distribution of the label changes dramatically between conditioning two similar input subsets. As we discuss in section 6, DET-X only works for explanations the select subsets of features. Future work can extend the notion of encoding to free-text (natural language) rationales.

[How does the computational cost of DET-X compare to existing evaluation methods?]

DET-X has twice the computational time of EVAL-X because it runs the EVAL-X training process twice, but the two models are independent and can be trained in parallel. Compared to methods like ROAR, the relative increase in computational context might depend on the problem at hand as the training procedures are different which mean optimization may converge at different rates.

作者回复

General response

We thank the reviewers for their feedback. We are glad that the reviewers found the following strengths in our paper

  • The paper is interesting (Fr3U),
  • Tackles a high priority problem in promising directions (PabQ),
  • The definition of encoding is a novel and significant contribution (Fr3U, A1S9, PabQ)
  • The experiments look sound and demonstrate practical utility to current AI applications with LLMs (Fr3U, A1S9).

Briefly, the paper studies the evaluation of feature attribution methods. Feature attribution evaluations typically check how well the label is predicted from the selected inputs returned by a feature attribution method. However, feature attribution methods can hide information about the label in the identity of the selection beyond what is available in the values of the selected variables. For example, an explanation of predicting pneumonia from a chest X-ray the can output the top right pixel when pneumonia is present but the bottom left pixel when there is no pneumonia. Such explanations are called ``encoding''. Encoding is a recognized problem that limits the utility of both explanations and their evaluations [3,4].

In the literature only specific constructions of encoding explanations exist, without a formal definition. Without such a formal definition, an evaluation method's ability to detect encoding cannot be tested beyond the recognized few constructions. To addres this gap, this paper makes the following contributions:

  • Develop the first mathematical definition of encoding.
  • Show that existing ad-hoc encoding constructions fall under the introduced definition.
  • Formalize different notions of an evaluation’s sensitivity to encoding in terms of weak and strong detection.
  • Show that existing evaluations ROAR [1] and FRESH [2] do not weakly detect encoding.
  • Prove that EVAL-X weakly detects encoding, but does not strongly.
  • Introduce DET-X and prove it strongly detects encoding.
  • Use DET-X to uncover evidence of encoding in LLM-generated explanations for predicting the sentiment from movie reviews.

[Rebuttal overview] In response to the reviewer feedback,

  1. We have made the paper less dense by moving some details and technical parts (like lemma 1) to the appendix (Fr3U).
  2. Evaluated EVAL-X and DET-X on Tabular data (A1S9).
  3. Compared ROAR, EVAL-X, DET-X on the image experiment (PabQ).

Two reviewers also asked about limitations. We discuss limitations in the Discussion section 6, paragraph "Misestimated models, explanation search, and encoding for free-text rationales". Specifically, we point out that EVAL-X or DET-X scores may not retain their weak and strong detection properties when the scores are computed with misestimated models. We gave a formal example in Appendix D.4 but did not link it in section 6. Such problems from misestimation are not unique to EVAL-X or DET-X; they can occur in any evaluation method that build models to compute their score.

We responded to individual comments in separate responses.

[1] https://arxiv.org/abs/1806.10758

[2] https://arxiv.org/abs/2005.00115

[3] https://proceedings.mlr.press/v130/jethani21a.html

[4] https://arxiv.org/abs/2308.14272

评论

Dear reviewers,

The discussion period will end soon. Please read the rebuttal from the authors and participate in the discussion, as soon as possible. Thank you.

Best, AC

最终决定

This paper points out the encoding problem with the explanation methods, i.e., the explanation may be determined by information beyond the input values. This is a quite interesting idea and welcomed by all reviewers, but the presentation is a bit dense and needs to be polished.