7.0

/10

Poster3 位审稿人

最低3最高5标准差0.9

3.7

置信度

创新性3.3

质量3.0

清晰度3.0

重要性3.0

NeurIPS 2025

From Black-box to Causal-box: Towards Building More Interpretable Models

Inwoo Hwang,Yushu Pan,Elias Bareinboim

OpenReview PDF

提交: 2025-04-27更新: 2025-10-29

TL;DR

We develop a framework for building causally interpretable models where counterfactual queries can be consistently evaluated from a model and observational data.

摘要

关键词

Causal inferenceInterpretabilityXAI

评审与讨论

审稿意见

评分: 3置信度: 42025-06-23

This paper introduces a formal notion of causal interpretability, which captures when a prediction model can consistently answer counterfactual “what‐if” queries from observational data. The authors demonstrate that standard black-box predictors (which map directly from inputs to labels) and concept-based models (which utilize all observed features) generally fail to meet this interpretability criterion. To address this gap, they propose Generalized Concept‐based Prediction (GCP) models, which leverage a carefully selected subset of features for prediction. The key theoretical contribution is a graphical criterion that characterizes precisely which feature sets enable causal interpretability with respect to a given counterfactual query. Building on this, the paper proves the uniqueness of the maximal admissible feature set, which optimally balances predictive expressiveness under interpretability constraints. It also derives a closed-form expression revealing a fundamental tradeoff between interpretability and accuracy. Experiments on two image datasets provide empirical support for the theoretical results.

优缺点分析

Strengths

The paper rigorously addresses the question: under a counterfactual assumption involving a feature set $W$ , will a concept-based model constructed on feature set $T$ behave consistently? Theorem 1 formalizes the necessary condition that $W$ and $T$ must satisfy, offering theoretical insights into concept-based modeling and counterfactual reasoning.

Weaknesses

While the proposed notion of causal interpretability helps formalize how a model behaves under counterfactual assumptions, it offers limited practical guidance for improving a model’s interpretability. Specifically, it does not explain how the model derives its predictions from inputs, nor does it substantively advance the paper’s stated goal of “bridging the gap between low-level features and high-level concepts” (line 33).
The experimental evaluation is relatively simplistic and lacks comparisons with existing black-box models or alternative concept-based methods, weakening the empirical validation.

问题

The paper defines causal interpretability as “models within the exact model class yield consistent answers to the same counterfactual query.” What precisely is meant by model class in this context? For black-box and concept-based models, how should we determine whether two models belong to the same class? The definition could benefit from further clarification.

In Example 4, the reported value of $\mathcal{M}_{cp}$ seems to be 0.3 rather than 0.2. Please verify this value.

局限性

Yes

格式问题

N/A

作者回复

2025-07-31

Thank you for spending the time reviewing our paper. We believe that some misunderstandings of our work made the evaluation overly harsh, and we hope that you can reconsider our contributions based on the clarifications provided below.

[Q1] The paper defines causal interpretability as “models within the exact model class yield consistent answers to the same counterfactual query.” What precisely is meant by model class in this context? For black-box and concept-based models, how should we determine whether two models belong to the same class?

The model class in our paper refers to the set of all functions or hypotheses that a particular type of model can represent. $\Omega_\text{BP}$ is a model class that includes all black-box models that use the image $\mathbf{X}$ to predict the label (lines 124-125) (using the image $\mathbf{X}$ to predict the label is the particular type). A (classic) concept-based model class, $\Omega_\text{CP}$ , refers to all predictive models that use all concepts $\mathbf{V}$ for label prediction (lines 126-127). Generalized concept-based model class, $\Omega_{\text{GCP}(\mathbf{T})}$ , refers to all predictive models that use the feature set $\mathbf{T}$ for prediction (lines 227-228).

[W1-1] While the proposed notion of causal interpretability helps formalize how a model behaves under counterfactual assumptions, it offers limited practical guidance for improving a model’s interpretability.

We respectfully disagree with this point and would like to clarify that our notion of causal interpretability, particularly through Theorem 1, does offer practical guidance for model design. Specifically, Theorem 1 provides a graphical criterion for determining whether a predictor can consistently evaluate counterfactual queries based on observational data.

To illustrate, causal interpretability formalizes the property that a class of models can evaluate counterfactual questions consistently. Thm. 1 suggests that causal interpretability can be achieved if and only if the set of concepts $\mathbf{T}$ obeys the given graphical rules (certain causal relationships among the generative features). Thus, when aiming to design a predictor, one should consider involving such $\mathbf{T}$ in the model and exclude features that violate the graphical conditions. With this design, the predictor is capable of evaluating human-understandable counterfactual questions based on observational distribution and thus is more interpretable. This is a fundamental building block that ties causality, interpretability, and machine learning, and we expect it to be extended to a larger class of models across a wide range of domains.

[W1-2] Specifically, it does not explain how the model derives its predictions from inputs, nor does it substantively advance the paper’s stated goal of “bridging the gap between low-level features and high-level concepts” (line 33).

We would like to clarify that our method contributes to bridging the gap between low-level features and high-level concepts for answering counterfactual questions. Specifically, unlike many post-hoc explanation methods that focus on low-level parameters or raw inputs (such as pixels), existing concept-based models map low-level instances to high-level features and attempt to answer counterfactual questions over high-level concepts. However, these models are not guaranteed to provide consistent answers to counterfactual questions.

As we illustrated in our paper (lines 53–65 and Figure 1), two models (C and D) can give completely opposite answers to the counterfactual question: “What would the attractiveness be had the person smiled?” In such cases, users cannot determine which answer to trust. Our framework addresses this issue by ensuring that all models within the same generalized concept-based family provide consistent answers to counterfactual queries. This consistency in counterfactuals enables users to assess whether the model reasoning aligns with human expectations.

In summary, our framework inherits the ability to map pixels (low-level) to concepts (high-level) from existing concept-based models, and further closes the gap by enabling these high-level concepts to be used in answering counterfactual questions.

[W2] The experimental evaluation is relatively simplistic and lacks comparisons with existing black-box models or alternative concept-based methods, weakening the empirical validation.

First, for the black-box models, which are the mappings from the image $\mathbf{X}$ to the (predicted) label $\widehat{Y}$ that do not involve generative factors $\mathbf{W}$ , it is infeasible to evaluate this query through a black-box model. Even if additional operations are introduced to a black-box model to extract and access $\mathbf{W}$ , Example 3 demonstrates that black-box models can still lead to inconsistent answers.

Second, our experiments include the comparison with a concept-based model (that uses the features $\mathbf{V}$ to predict the label) and existing concept-based methods do not fall outside this. We would like to note that our main theoretical results involve how a different choice of a feature set $\mathbf{T}$ impacts the causal interpretability, and thus, we evaluated the models using different $\mathbf{T}$ , where a concept-based model is a specific case where $\mathbf{T}=\mathbf{V}$ .

Finally, we would like to note that in real-world datasets, it is infeasible to evaluate the actual value of the counterfactual query because the underlying ground-truth data-generating process for real-world datasets is not given. For example, it is unknown how nature decides the generation process of human facial features. Acknowledging this inevitable restriction, BarMNIST datasets (where we have the ground-truth SCM) allow us to thoroughly validate our theory.

[Q2] In Example 4, the reported value of $\mathcal{M}_\text{CP}$ seems to be 0.3 rather than 0.2. Please verify this value.

Yes, it is a typo and the value should be 0.3. The two models in the example are inconsistent with the query (0.3 and 0.5). Thanks for pointing it out!

We hope this addresses your questions, and please let us know if there are any remaining concerns!

评论- Gentle reminder

2025-08-08

Dear Reviewer xsrJ,

Thank you again for your feedback. We have addressed the concerns raised by the reviewer, but we remain open to further engagement and clarification of any remaining issues.

Thank you. Authors of the submission 4291.

审稿意见

评分: 5置信度: 42025-07-02

The work introduces the concept of counterfactual consistency and show how black-box models but also concept-based models are not counterfactually consistent in general. It introduces conditions on the set of concepts to use that provably guarantee counterfactual consistency.

优缺点分析

Strengths

This is an interesting contribution in the field of concept-based models, where recent approaches are trying to explore additional conditions for interpretability (e.g. disentanglement). Given the relevance of counterfactual explanations in interactive analysis of predictive models, this contribution can pave the way to the development of more consistent models.

Weaknesses

One concern I have about the formalization is the lack of distinction between ground-truth concepts and predicted concepts. The formulation seems to imply the availability of ground-truth concepts at prediction stage.

The experimental evaluation is very limited, with quantitative results only provided for a proof-of-concept mnist-based dataset. While this is mostly a theoretical paper, measuring the impact of the accuracy-counterfactual interpretability tradeoff in real datasets would strenghten the contribution.

问题

Ground-truth concepts are typically not available at prediction time. Does this affect the formulation in any way? can you clarify this?
It would be interesting to relate the t-admissible set to the actionability of the concepts, as in some applications one could be willing to incorporate non-actionable concepts for improving accuracy even if the model would not be counterfactually consistent with respect to them, provided consistency is guaranteed on actionable concepts. Can you comment on this?

局限性

Limitations are not discussed in the paper. However, the experimental evaluation is extremely limited, and the impact of the accuracy-interpretability tradeoff in real applications is left to be understood.

最终评判理由

I believe this is a solid piece of work on a relevant topic, and I am confident that authors can easily address my requests for clarifications and discussion on the limitations concerning real-world experiments.

格式问题

none

作者回复

2025-07-31

We sincerely thank the reviewer for the constructive and thoughtful feedback. We respond to your comments as follows:

[W1, Q1] One concern I have about the formalization is the lack of distinction between ground-truth concepts and predicted concepts. The formulation seems to imply the availability of ground-truth concepts at prediction stage. / Ground-truth concepts are typically not available at prediction time. Does this affect the formulation in any way?

In the closed-form formula Eq. (5), the concepts $\mathbf{W}$ and $\mathbf{T}$ are ground-truth concepts. Since the labels of ground-truth concepts are available, one can estimate $P(\mathbf{T} \mid \mathbf{X})$ over ground truth concept $\mathbf{T}$ . For clarity, let us denote this estimated distribution as $\widehat{P}(\mathbf{T} \mid \mathbf{X})$ .

In the prediction stage, the true concepts $\mathbf{W}$ and $\mathbf{T}$ of an image instance $\mathbf{X}$ are not given directly. Instead, the predicted concepts $\mathbf{W}$ and $\mathbf{T}$ are sampled through the estimated $\widehat{P}(\mathbf{T} \mid \mathbf{X})$ . When $\widehat{P}(\mathbf{T} \mid \mathbf{X})$ is accurate, the sampled (predicted) concepts are expected to align closely with the ground-truth concepts. However, if the estimation has an error, the predicted concepts may deviate from the true ones, and this error will naturally propagate into the counterfactual evaluation via Equation (5). We will include this discussion in the revised manuscript.

Our goal with this formulation is to formally characterize how these counterfactual quantities can be computed from the observational distribution under ideal conditions (i.e., accurate estimation). The challenge of robustly estimating $P(\mathbf{T} \mid \mathbf{X})$ from finite data is indeed fundamental and highly relevant to practice, but falls outside the scope of this work. That said, we agree it would be a valuable direction for future investigation, particularly in light of ongoing research in counterfactual estimation within the causal inference literature and the importance of creating more interpretable methods in practice.

[W2] While this is mostly a theoretical paper, measuring the impact of the accuracy-counterfactual interpretability tradeoff in real datasets would strengthen the contribution.

In real-world datasets, it is infeasible to evaluate the actual value of the counterfactual query because the underlying ground-truth data-generating process for real-world datasets is not given, specifically, the mechanisms of $\mathbf{V}$ are not known. For example, it is unknown how nature decides the generation process of human facial features. Due to this inevitable restriction, we thoroughly validated our theory in BarMNIST datasets (where we have the ground-truth SCM), including causal interpretability-accuracy tradeoff.

Still, our theory allows us to understand the interplay between causal interpretability and accuracy in real-world datasets. For example, given T-admissible set {smiling, gender} and the query "Would the person be attractive had they smiled?", if one wants to incorporate additional query "Would the person be attractive had they be a men?", we know the model using this $\mathbf{T}$ maintains causal interpretability w.r.t. both queries, and thus, would not compromise accuracy.

[Q2] It would be interesting to relate the t-admissible set to the actionability of the concepts, as in some applications one could be willing to incorporate non-actionable concepts for improving accuracy even if the model would not be counterfactually consistent with respect to them, provided consistency is guaranteed on actionable concepts.

Thanks for this insightful note. The intuition to add non-actionable variables for improving accuracy is correct and can be formally grounded by the T-admissible set and Thm. 1. Our theory suggests that, given the actionable concepts $\mathbf{W}$ , one could incorporate additional non-actionable concepts as long as they are non-descendants of $\mathbf{W}$ , which would help improve accuracy while retaining the causal interpretability w.r.t. $P(\widehat{Y}_{\mathbf{w}}\mid \mathbf{X})$ . For example, given the T-admissible set {smiling, gender} and the query ``Would the person be attractive had they smiled?'', one can incorporate additional concepts (including non-actionable concepts), e.g., age or hair color, that are non-descendants of smiling. We will include this discussion in the revised manuscript.

We appreciate your valuable comments and suggestions. Please let us know if you have further feedback!

评论- rebuttal discussion

2025-08-04

Thanks for your answers. I would like to encourage you to include the clarifications and the limitations concerning real-world experiments in the revised version of the manuscript.

2025-08-05

We appreciate your valuable comments and engagement. We will include our responses and discussions on the limitations in the final version. Thank you!

审稿意见

评分: 5置信度: 32025-07-04

The authors present the notion of causal interpretability, which formalizes when counterfactual queries can be evaluated from a model and observational data. To do so, they introduce a particular framework for testing which family of models is causally interpretable. In their inquiry, they analyse black-box models as well as concept-based models, which turn out to never be causally interpretable (former) and can not be causally interpretable (latter). They then derive the criteria for ensuring causal interpretability and test their theory on two different tasks.

优缺点分析

Strenghts

-I like the presentation of the (A)SCM paradigm. I saw a few of them, but this one is truly clear. On that same line of thought, I feel the quality of the presentation as a whole is excellent, with a good and efficient use of the nine pages.

-Though the accuracy-interpretability tradeoff is a sensible topic ([1], for example), I feel it is well defended here.

[1] Stop explaining black box machine learning models for high-stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 5 (2019), 206–215.

Weaknesses

1.1. I feel like sentences such as « In this work, we introduce the notion of causal interpretability, which formalizes when counterfactual queries can be evaluated from a model and observational data » (lines 5-7) are misleading. One could benefit from counterfactuals that do not always lead to the same outcome in the change of the prediction. For instance, one could examine the effect of the smile on attractiveness, considering that some people have an unattractive smile, teeth, etc. I feel like sentences such as « In this work, we introduce the notion of causal interpretability, which concerns whether a prediction model can be interpreted consistently across counterfactual scenarios » (lines 66-7) are more faithful to what is actually done in this work. So I think every presentation of causal interpretability of the first form should be changed to that of the second form.

1.2. On that same line of thought, one could say that having a counterfactual to always lead to the same outcome in the change of prediction (having the predictions to be monotonic regarding the counterfactual) is somewhat restricting and leads to an interpretation that is somewhat of lesser importance. For instance, in information theory, the relevance of information is inversely proportional to the probability that the event the information conveys is true. Thus, having counterfactuals resulting in similar outcomes is of lesser importance.

« This means that a black box does not have access to any of the causal factors that generated the data » (line 125). Wouldn’t it be truer to say that we do not know whether the black-box model has access to any causal factors that generated the data? The black box model could, in its hidden inner workings, retrieve the features before making a prediction accordingly.
Why assume that concept-based models require the use of every concept? I could be wrong, but I don’t think this is necessarily a need for concept-based models.

Typos and such

« ND » (line 236), though it is clear from the following that it refers to the concept of « non-descendance », should be defined properly.
« In other words, a maximal T-admissible set is a T-admissible set that would cease to be T-admissible if any additional variable were added to it » (line 256); I would add for clarity that once a set is not T-admissible, adding one of many variables can never make it T-admissible again.

-Line 243 : « are not be »

-Line 305 : Should cite the original MNIST paper, since the proposed dataset is a derivation of MNIST.

-Figure 4 (c) : Feels weird to have accuracy bars and error bars next to each other on the same graph. I would be better to be consistent with the considered metric.

-At a few places in the article, an « interpretability-accuracy » tradeoff is mentioned and defended (lines 12, 291, 292, Fig. 4 (c) caption), whereas at some other places, a « causal interpretability-accuracy » is defended (lines 80, 94, 280, 298-9, 340). Since the article focuses on causal interpretability, I would only discuss the tradeoff involving this kind of interpretability, since the tradeoff has not been defended for interpretability as a whole and accuracy.

-« a concept-based model is also often not interpretable » (line 89); I believe Example 4. illustrates that this could be the case, not that it is often the case. « We also demonstrate theoretically that concept-based models, which rely on all observed features for prediction, are also not guaranteed to be causally interpretable. » (line 72-3) is more faithful to the work.

问题

Questions are listed in the Weaknesses section.

局限性

Limitations of the framework are explicit, for they are to be found in the SCM paradigm that is presented.

最终评判理由

I don't feel that the criticisms raised by Reviewer xsrJ are sufficient to reject this article. The assumptions underlying the SCM structures are standard. Concerning the help in understanding the decision, working by elaborating counterfactuals is also standard practice. I stand by my score.

格式问题

作者回复

2025-07-31

We appreciate the reviewer’s time and feedback provided to improve the manuscript. We respond to your comments as follows:

[W1.1] I feel like sentences such as « In this work, we introduce the notion of causal interpretability, which formalizes when counterfactual queries can be evaluated from a model and observational data » (lines 5-7) are misleading. One could benefit from counterfactuals that do not always lead to the same outcome in the change of the prediction. [..] I feel like sentences such as « In this work, we introduce the notion of causal interpretability, which concerns whether a prediction model can be interpreted consistently across counterfactual scenarios » (lines 66-7) are more faithful to what is actually done in this work.

Thank you for pointing this out and we appreciate the opportunity to clarify our intent. We agree that the phrasing on lines 66–67 (“whether a prediction model can be interpreted consistently across counterfactual scenarios”) more accurately reflects the core idea behind our notion of causal interpretability.

Technically, causal interpretability refers to the property that any model within a given model class (models using the same set of variables for prediction, i.e., compatible with a specified causal diagram) that matches the observational distribution will also evaluate counterfactuals consistently. We acknowledge that evaluating counterfactuals using an individual fully specified SCM is important in some contexts.

Our original phrasing on lines 5–7 was intended to convey “when counterfactual queries can be evaluated from a specific model class and observational data.” To avoid confusion, we have revised this wording throughout the manuscript to better align with the more precise formulation from line 66.

[W1.2] On that same line of thought, one could say that having a counterfactual to always lead to the same outcome in the change of prediction (having the predictions to be monotonic regarding the counterfactual) is somewhat restricting and leads to an interpretation that is somewhat of lesser importance.

We appreciate this thoughtful comment. We agree with your point that “the relevance of information is inversely proportional to the probability that the event the information conveys is true.” However, we believe that the consistent evaluation of counterfactuals within a model class are still very valuable.

As we illustrate in Example 4, when counterfactuals are not consistently determined by the observational distribution alone, evaluating them requires full specification of the structural mechanisms (e.g., $f_C$ ) and the distribution over exogenous variables (e.g., $U_C$ ). In contrast, when causal interpretability holds (as formalized in Theorem 3), counterfactuals can be computed using a closed-form expression that depends solely on the observational distribution ( $P(\mathbf{X, V}, \widehat{Y})$ ); no knowledge of the SCM’s mechanisms is required.

From an information perspective, one could say that causal interpretability allows us to answer counterfactual queries with less information. In this sense, the values of causal interpretability here are not maximizing information content, but by minimizing the informational burden required to perform reliable reasoning about counterfactuals.

[W2] « This means that a black box does not have access to any of the causal factors that generated the data » (line 125). Wouldn’t it be truer to say that we do not know whether the black-box model has access to any causal factors that generated the data?

Yes, we agree with the reviewer, and we will revise the sentence for better clarity.

[W3] Why assume that concept-based models require the use of every concept? I could be wrong, but I don’t think this is necessarily a need for concept-based models.

While this is not the restriction, concept-based models typically use all concepts they collected. Specifically, concept-based models first predefine the concept set $\mathbf{V}$ (for example, cheek, smile, and gender in CelebA), and use all these predefined concepts in $\mathbf{V}$ to predict the label. In our paper, we point out that to ensure causal interpretability, one should take causal relationships among the concepts into account to determine the concept set that is used for classification. We will revise the text for better clarity.

[Typos and such]

Thanks for the valuable suggestions! We will incorporate the reviewer’s comments into the revised manuscript, e.g., adding the definition of non-descendants and using the consistent terminology of causal interpretability-accuracy tradeoff.

We appreciate the meaningful discussion and would be happy to provide further clarification. Thank you!

最终决定Accept (poster)

2025-09-17

(a) Theoretical paper formulating when models can be causally interpretable with counterfactual queries, considering prior concepts like concept-based models. This results in a theorem characterising the result. (b) Theoretical formulation of the problem and a theorem about a necessary condition. Technically solid. (c) Limited empirical work of limited value. How does one use this theory in building better causal models? Definitions are heavy and need simpler explanation. Explanations given are minimalist and not always helpful. (d) Two accepts and the overall notion seems useful, with good rebuttals. Its an interesting characterisation of the problem yielding insights into causality, but needs better explanations. (e) Healthy discussion with reviewers. The one rejecting reviewer raised some good questions prompting strong confidential comments by the author, but I think better explanations in the paper would help.