4.3

/10

withdrawn4 位审稿人

最低3最高6标准差1.3

4.0

置信度

正确性2.3

贡献度1.8

表达3.0

ICLR 2025

Interpretable and Efficient Counterfactual Generation for Real-Time User Interaction

Cesare Barbera,Andrea Passerini

OpenReview PDF

提交: 2024-09-28更新: 2024-12-03

摘要

关键词

Explainable AIGenerative AIHuman-Machine interaction

评审与讨论

审稿意见

评分: 5置信度: 42024-10-28

This paper proposes a framework for counterfactual explanations based on a generative autoencoder. By refining the process of counterfactual selection, the proposed method effectively generates counterfactuals that fulfill a list of desired properties in real-time. The resultant explanations facilitate human-machine interactions and demonstrate the potential for improving user performance, as evidenced by the experimental results.

优点

The paper is well-structured and self-contained, offering a clear definition of the objectives for counterfactual construction.
The definition of counterfactual candidates is sound, significantly narrowing the search space and enabling real-time generation.
The carefully designed experiment demonstrates the potential benefits of providing complementary information in human-machine interaction.

缺点

Similar to other generator-based explanation frameworks, the transparency of the explanation process itself is limited due to the black-box nature of the neural-network-implemented generator.
The flexibility of the proposed method is another concern, as the delivered explanations appear to be model-specific.
The expected counterfactual violates $\mathcal{P}_2$ stated in Definition 1.
The benefit of the rotation for accelerating expectation computation is unclear.

Questions 1 and 2 detail the concerns mentioned in points 3 and 4 respectively.

问题

The definition of counterfactual candidates is well-motivated. However, recalling that the expected counterfactual is a weighted average of $\mathbb{S}_1$ and $\mathbb{S}_2$ , the final result seems to deviate from both segments, thereby violating $\mathcal{P}_2$ . This suggests that some counterfactuals are strictly better than the one selected. Could the authors provide clarification on how this should be interpreted?
How does the rotation in Section 5.2 contribute to accelerating the computation of expectation? Given two points $a$ and $b$ , let $\mathbb{S}=\lbrace(1-t)a + tb|t\in [0,1]\rbrace$ be the segment connecting them, which is a one-dimensional element regardless of the dimensionality of the feature space. Finding the weighted average of $\mathbb{S}$ involves determining the expected position on the segment, which only depends on the variable $t$ . An estimate can be acquired with a univariate Monte-Carlo estimator by interpolating between 0 and 1 for $t$ . While the rotation in the paper appears to eliminate estimation variance in the aligned dimensions, it instead concentrates the variance in the final dimension, which is later redistributed to the others during the reverse of the rotations.
Could the authors elaborate more on sparsity? Line 376 says "the label-irrelevant generative factors are shared ensuring sparsity". According to Appendix.C.1, $z_u$ accounts for only one-fourth of the total latent encoding $z$ . With modifications applied to $z_s$ , which constitutes 75% of the latent features, the resultant counterfactual seems to deviate from the intended sparsity.
What do the different variants of $\mathbb{S}_1$ mean? Most of them are marked blackboard bold, with one exception at line 270 which is italicized. Some variants differ in the presence of the superscript $\mathcal{C}$ .
The experimental results are appealing. To support the claim that the human-machine interaction serves as "a training process for the participants", could the authors provide the accumulated accuracies of initial user predictions over the number of seen instances?
Some parts of writing can be polished for clarity, for example:
- Line 185 states "They propose to apply to the latent representation …" — it is unclear what is being applied.
- The caption of Figure 6 reads “Labels are treated as a random variable to also sample.” — what does “also sample” mean?

评论- Response to egRZ

2024-11-19

Transparency The black-box nature of the classifier hinders transparency. We explicitly tackle the issue leveraging human-understandable concepts. This approach allows users to understand when a machine makes a choice for a ‘right’ or ‘wrong’ reason as concepts are explicitly stated in the explanation. Evidence of this is further supported by our user study.

Flexibility Our approach is not model-agnostic. In order to apply our explanatory technique in the latent space the minimum requirement is that the classifier uses a gaussian-mixture loss. The work of [1] shows that this loss can be used to obtain equivalent performance to softmax based classification scores for a wide variety of benchmarking datasets and CNN architectures. Since one can explain any DNN that uses this loss with our method creating a counterfactual in the latent-space, in order to generate explanations in the input space a decoding model reconstruction is required. In conclusion, even though our approach does not extend to DNN with softmax classification layers, the requirement to implement our approach is very simple. Other components of our framework, such as label-relevant/label-irreleveant encoders, try to target specific desiderata of counterfactual explanations such as sparsity or validity but the overall applicability of our method is not limited and does not imply worse classification performances.

$\mathcal{P}_2$ violations We are aware of this and we added proof in Appendix B that the deviation from $\mathcal{P}_2$ of the expected counterfactual is bounded and this error is negligible.

Sparsity The role of $z_u$ is to encode generative factors that are shared across labels and therefore should be fixed when computing a counterfactual for a given user specified label. By keeping part of the encoding unchanged this improves sparsity of the explanation. The reviewer rightfully notices that this is only 25% of the whole latent encoding and does not suffice to ensure sparse explanations. For this reason we optimize the trade-off between sparsity and likeliness by the computation of the expected counterfactual in the latent space. Overall Sparsity is therefore tackled in a two-fold manner: By keeping part of the latent encoding fixed and by explicitly optimizing for it in the formulation of our counterfactual-search problem.

Versions of $S$ We apologize for typos. We updated the text to highlight that the presence of the superscript indicates that property $\mathcal{P}_1$ is also satisfied (line 273-281).

Training effects We added a new section in Appendix G to analyze this phenomenon in detail. Instead of plotting cumulative accuracies, we chose to plot cumulative errors, as this provides a clearer view of where and how frequently errors occurred. The results align with our claim, highlighting the presence of training effects among participants since most errors are made in the early stage of the user study.

Application of the gaussian mixture loss The object of the sentence is the gaussian mixture loss function that we introduced in the following line. We updated the text to make sure this is clearer (line 177-180).

Sampling labels We updated the paper to make this clearer (line 1294). More precisely we refer to a two-step sampling mechanism. First labels are sampled according to a distribution (e.g. multinomial) and then images are extracted from the sampled label conditional distribution.

Benefit of rotations It is true that $t$ is a single dimension element but the suggestion still does not compute a univariate estimate but a multivariate one as densities need to be computed according to multivariate distributions. Nonetheless one can estimate the expected position by interpolating and accumulating the densities at different values of $t$ as to approximate the integral. The difference between this approach and the one we suggest is that the reviewer’s approach requires a hyperparameter ‘step’ to evaluate the densities and that the amount of ‘steps’ needed varies at the varying of the length of the segment. Overall the accuracy of the estimate and the performance of this method will depend on the length of this segment which can vary largely according to two factors: The margin of the classifier, (large margins implementation may be used as in [1]) and the class asked by the user (If the user asks a counterfactual for a class that is distant in the latent space from the original instance). In conclusion our approach guarantees competitive running-times which are independent from the decision boundary learned by the model and the counterfactual query. The other suggested approach running times may be susceptible to significant variance.

references

[1] Luss, Ronny, et al. "Leveraging latent features for local explanations."

[2] Wan, Weitao, et al. "Rethinking feature distribution for loss functions in image classification."

2024-11-22

Dear Reviewer,

We would like to follow up to see if our response addresses your concerns or if you have further questions. We would really appreciate the opportunity to discuss this further and know whether our response has already addressed your concerns. Thank you again!

2024-11-25

Dear Reviewer,

As the discussion period concludes tomorrow, we would appreciate your feedback on our responses to your comments. Please let us know if our answers resolved your concerns or if there are additional points that need addressing.

Thank you,

The Authors

2024-11-27

Thank you for the detailed response to address my concerns. The clarification on rotations brings insight into the computational efficiency gain, which should be interpreted together with the multivariate gaussian distribution in the latent space. While the clarifications are appreciated, some of the concerns persist.

Transparency The authors claim that their approach tackles the transparency issue of the generator by showing human-understandable concepts. This claim is not entirely convincing. First, the derivation of the concepts requires additional efforts from human experts, raising questions about the scalability. Second, there seems to be a gap between the generator's actual behavior and the patterns inferred by humans through observations. Is there any mechanism to guarantee the derivation of faithful and truth-telling concepts?

Sparsity: I could not find a formal definition of sparsity in the paper, and therefore assume the common understanding in the literature, i.e. sparsity implies altering a minimal subset of features (but please let me know if the context is different in this paper). Given the definition, could the authors elaborate on how their approach ensures sparsity? The computation of the expected counterfactual balances likeliness and closeness (in terms of latent space distance), but its connection to sparsity remains unclear. If the segment $\mathbb{S}$ is not aligned with the latent space axis (which is very likely to happen in a high-dimensional space), modifications will apply to all units in $z_s$ to reach the expected counterfactual. Since each latent dimension corresponds to one concept, this suggests that deriving the final counterfactual involves altering all concepts, which contradicts the definition of sparsity.

Also, I appreciate the authors' effort in visualizing users' cumulative errors in Figure 13, which brings further questions regarding the training effect. From my perspective, the training effect should manifest as users improving their performance over time due to the additional information provided, resulting in (relatively) concentrated errors in earlier stages of a task. This would presumably lead to a curve above the red line in Figure 13, indicating decreasing cumulative errors as users adapt. However, the presented data does not align with this expectation.

While taking the split at question 13 may support the authors' claim of a training effect, it is unclear why a burst in errors occurs after a certain period of training (between questions 8 and 13). Furthermore, the changing patterns between the "Label" and "Label+Explanation" conditions appear highly similar, raising questions about the origin of the claimed training effect.

评论- Additional response to egRZ

2024-11-28

Dear reviewer, thank you for engaging in the discussion, we are pleased we were able to address some of your previous concerns. With regards to the points just made, we tackle them below:

Scalability Our approach scales effectively to deeper and more complex architectures (Appendix C reports running times for different architecture sizes). In the limitation section of our manuscript we explicitly state that given the need for human annotators using compact latent spaces is very likely needed (line 528). While in the context of our experiment this was not limiting, we explicitly mention that we consider a very important avenue for future research to improve the applicability of our approach given the promising results our method yields in the interactive setting. More precisely, our technique can be applied to large scale models. For example, leveraging a latent diffusion model conditioned on the compact latent dimensions of the RAE or the RAE outputs are some of the directions we consider exploring.

Transparency Without concept supervision, it is impossible to dictate which concepts the model encodes in its latent space. Instead, users infer patterns by explicitly observing the generator's behavior. In our approach (see Appendix E), users analyze latent traversal, observing how changes in a single latent dimension affect the generation while other dimensions remain fixed. his implies that if humans can correctly assign the conceptual changes to the corresponding latent dimensions no gap should be observed between the generated instances and the human inferred behaviour. We argue that any gap should rather be ascribed to the hyper-parameter controlling the number of concepts to return together with the explanation. In that regard, if the perturbations are very simple and actually correspond to a single or very few concepts, returning a too high number of concepts could lead to descriptions of the counterfactual changes which are not faithful with respect to the actual changes in the counterfactual image (this is because concepts are presented mentioning exclusively the direction in which they are altered and not the magnitude of change as well). We explicitly mentioned this in Appendix F of our manuscript. A potential solution to this issue, given a default value of concepts to be returned, could be to drop all the concepts whose relevance metric is below a certain threshold. This threshold hyper-parameter could possibly be fine-tuned or user-specified. Alternatively one could implement richer dictionaries taking into account the magnitude of the change at a concept level, although increasing reliance on annotators. For our experiment we decided to keep the number of concepts constant for each image in order to decrease the noise users were subject to during the interaction with the model. This value was set to 3 as most types of cells could be obtained with simple changes to the input image.

Sparsity We thank the reviewer for the clarification as it allowed us to better understand the doubts regarding our manuscript. We realize that the term sparsity could be a bit misleading in this context. Our goal is to ensure that the counterfactual image and the input image are as similar as possible. To avoid any confusion, we now refer to this property as ‘proximity’ in the manuscript, and we clarified the assumptions behind the optimization of this property. With regard to concepts, even though some might be altered simultaneously to improve proximity of explanations, we specifically designed a concept relevance metric that allows us to infer which changes were most relevant to the counterfactual image. This allows the method to generate sparse explanations in terms of concepts being modified in the counterfactual.

*We updated the plot in the appendix by removing two outliers (instances correctly classified by all participants) to present a clearer pattern. From question 8 onward, the behavior expected by the reviewer becomes evident, with earlier differences likely obscured by the experiment's initial stages and task difficulty. The effect may also be mitigated by the limited number of questions and this could be a lot more evident with a longer study. We did not go in this direction because the cognitive burden on users worried us and capturing such phenomenon was not the main scope of our contribution. However, the plot clearly supports the claim of training effects. Additionally, the Label and Label+Explanation settings show similar patterns, as the main component of the training process evidently consists in providing users a ground truth to assign to images they see (the machine prediction is extremely accurate). Large differences cannot be seen because both settings provide this information. In conclusion, this aligns with our claim that training effects were present in both interactive settings.

审稿意见

评分: 3置信度: 42024-10-29

This paper focuses on the topic of generating visual counterfactual explanations for predictive models in computer vision. The proposed method is based on a Denoising Disentangled Regularized Autoencoder trained in a two stage manner. The first stage deterministically trains an encoder-decoder architecture with the encoder split into two separate parts responsible for label-based and label-independent information. The second stage introduces stochasticity to the learning process and, after freezing the previously learned weights, trains an additional autoencoder on the combined latents. The introduced architecture is utilized for the counterfactual explanation generation by encoding the factual image into its two-part latent representation, modifying the label-relevant part to identify candidate counterfactuals, computing the `expected counterfactual', extracting the most important concepts based on a proposed metric, and decoding the modified label-based latent together with label-irrelevant part to obtain the explanation. The work explicitly defines the properties of the mentioned counterfactual candidates and develops a theoretical result to more efficiently search for the best candidate. The proposed method is evaluated through a user study based on the BloodMNIST dataset with detailed analysis of the obtained results, showing how the proposed approach can guide humans to improved decision-making.

优点

S1. The authors provide a clear introduction and motivation to the problem addressed in their work.

S2. The method's overview clearly explains how specific components of the proposed architecture aim to address the mentioned limitations of previous works. The description of each component of the optimized loss is properly described. Overall, despite the complexity of the framework, the authors succeed in clearly communicating its inner workings using the attached figures and pseudocode.

S3. In addition to the practical side of the framework (two-step training procedure), the authors propose an interesting theoretical result to efficiently search for the `optimal' counterfactual candidate in the model's latent space. I also enjoyed the introduced relevance scores for concept selection, Fig. 2 that nicely summarizes the candidate selection procedure and the attached pseudocode in Algorithm 1.

S4. The experimental evaluation is based on a properly designed user study with detailed analysis of the obtained results. Interesting research questions were proposed and the provided results were very well processed to provide principled answers.

S5. The overall writing is clear and properly redacted, except some small typos, e.g., $S_1$ in line 270.

S6. Source code is provided as an anonymised repository to ensure reproducibility.

缺点

W1. Both the abstract and the introduction mention specific limitations of previous works, specifically: efficiency, likeliness, interpretability, validity and sparsity. While the paper provides motivation on how each of the component's framework addresses those, I must argue that the paper struggles to provide empirical evidence for these claims. While some quantitative results are provided (e.g., average generation time in line 408), they are mainly focused on the influence of the method on human decision-making and do not address the limitations. Also, no comparison with previous work is given. The above measures could be quantified using, e.g., some variation of FID [5] and S3 [2,6] (likeliness, validity), and COUT [7] for sparsity.

W2. While the paper is well-written, I got confused with how the authors position their work in the XAI domain, its connection to counterfactual explanations for deep learning models as post-hoc explanations and their required properties. To the best of my understanding, the proposed method is not able to provide post-hoc explanations for any predicitve model other than the Gaussian-mixture-based classifier trained together with other components. This is problematic since the described limitations are mentioned as problems of generally applicable methods and their relationship with the authors work is unclear.

W3. The authors mention real-time generation as one of the main contributions of their method. However, the paper lacks quantitative results comparing the proposed approach to any of the previous works in this context, making it unclear whether the improvements are only incremental or actually ground-breaking. While the method might indeed be an effective real-time generator, it is not clear how this relates to previous works which address a more general class of models (see W2.).

W4. The paper mainly cites works from the 2018-2022 period. While I do not consider myself a highly-educated expert in either the topic of contrastive explanations, deterministic regularized autoencoders nor latent disentanglement, I cannot escape the feeling that there are more recent papers that could be mentioned in these contexts. This is not an explicit weakness of the paper, but I would be happy if the authors could address why more recent works do not appear in the literature section (has the field somehow slowed down or community lost interest in it?). My another concern is the specific connection of this work to the topic of contrastive explanations - I must argue that mentioning specific works on visual counterfactual explanations (VCEs) would better reflect the paper's connection to the field. In this case, the authors fail to mention a large amount of work combining generative models and VCEs from the recent years, e.g. [1,2,3,4] to name a few. How does the authors' paper place itself in the context of these works?

W5. From a theoretical point of view, the paper often provides very strong claims like the non-existence of a strictly better counterfactual' (lines 259-260), such counterfactual intrinsically optimizes the trade-off between the likelihood of the explanation and the distance from the instance to explain` (lines 281-283) which are true only when assuming that the Gaussian mixture model perfectly preserves the relationships between the samples from the original data distribution. This should be stated explicitly in the paper and I would like to authors to further elaborate on the assumptions and limitations of this approach.

W6. Another contribution mentioned by the authors is the possibility of extracting interpretable concepts associated with the latent dimenions of the proposed architecture. It should be stated clearly that the identification of these concepts requires a human expert that will properly label them. For example, to the best of my understanding, examples like those in Fig. 10 include descriptions of concepts that were first labelled by the expert. In terms of experiments, I think that the influence of these concepts on the human decision-making has not been properly studied, making it difficult to disentangle their contributions to the overall process.

W7. To maintain the review's structure, I will include my question regarding the theoretical derivations here. I must stress that these are very detailed and I was generally satisfied with them. My concern is the pharsing from line 850, where it is stated that the expected value, according to an isotropic Gaussian, of the elements in a segment $S$ (...). Could the authors clarify whether this sentence assumes that elements in $S$ follow an isotropic Gaussian distribution? To the best of my understanding, $S$ forms a bounded interval since $S = \\{ (1 - t) \cdot a + t \cdot b \mid t \in [0, 1] \\}$ , hence its elements cannot follow a Gaussian distribution which assumes infinite support.

[1] Jeanneret et al., Diffusion Models For Counterfactual Explanations, ACCV 2022

[2] Jeanneret et al., Adversarial Counterfactual Visual Explanations, CVPR 2023

[3] Augustin et al., Diffusion Visual Counterfactual Explanations, NeurIPS 2022

[4] Boreiko et al., Sparse visual counterfactual explanations in image space, DAGM 2022

[5] Heusel et al., GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NIPS 2017

[6] Chen et al., Exploring simple siamese representation learning, CVPR 2021

[7] Khorram and Fuxin, Cycle-Consistent Counterfactuals by Latent Transformations, CVPR 2022

问题

I would be happy if the authors could address each specific weakness mentioned above. In general, I would say that the paper has great potential but mentions contributions that are clearly not addressed. My suggestion for the authors would be to refocus the text on the human-machine interaction, since this is the most promising result, and deviate from the counterfactual explanation domain. It is difficult to be convinced that the proposed framework is in fact a counterfactual explanation generator if it only allows to provide them for a model trained inherently in the framework which is a very simple Gaussian-mixture-based classifier. Both the experimental design and theoretical derivations are very elegant and I encourage the authors to just focus on those with an extended empirical evaluation. Overall, the paper shows great promise that it might be worth training the framework from scratch for each new problem, since it may greatly improve the understanding of complex domains by inexperienced users.

评论- Response to 3rie

2024-11-19

Empirical evidence We added a quantitative evaluation of our method with FID, COUT and S3 in Appendix C. We compare with the competing approach of [1] as it is the only other counterfactual generating technique which leverages concepts without supervision that we are aware of. We also compare to our method simply returning a point in the segment connecting the instance to explain and the counterfactual class mean for different model confidence values as a mean for an ablation study. Our technique has comparable performance to our competitor while being substantially more efficient.

Applicability Our approach is not model-agnostic. In order to apply our explanatory technique in the latent space the minimum requirement is that the classifier uses a gaussian-mixture loss. The work of [2] shows that this loss can be used to obtain equivalent performance to softmax based classification scores for a wide variety of benchmarking datasets and CNN architectures. Since one can explain any DNN that uses this loss with our method creating a counterfactual in the latent-space, in order to generate explanations in the input space a decoding model reconstruction is required. In conclusion, even though our approach does not extend to DNN with softmax classification layers, the requirement to implement our approach is very simple. Other components of our framework, such as label-relevant/label-irreleveant encoders, try to target specific desiderata of counterfactual explanations such as sparsity but the overall applicability of our method is not limited and does not imply worse classification performances.

Running times We added quantitative experiments to Appendix C where we also evaluate running times. We compare our method with the competing approach of [1]. Results show our approach superior performance across various architecture complexities.

Related work We extended the related work to consider more recent and relevant approaches. Our method separates from mentioned papers because it does not leverage knowledge of a causal graph (a requirement for the other works). In addition, interest for diffusion models high quality image generation led to researchers leveraging them for counterfactual explanations. Although such approaches are able to generate realistic counterfactual images, the resulting counterfactuals lack transparency which is a crucial component of our framework.

Strictly better counterfactual Our optimization process focuses exclusively on the latent space, meaning its effectiveness in the input space depends on the preservation of distances. While this reliance is a limitation, strong reconstruction and classification performance suggest that the assumption is realistic. If distinct inputs were mapped too closely in the latent space, neither reconstruction nor classification would function effectively. We updated the text to make this clearer to future readers (line 245)

Concept labelling Indeed, learned concepts require a human annotator for labeling. We updated the paper to explicitly mention this in the main text (line 354) and in the dedicated section of the Appendix: Appendix E.

Effect of concepts In our study, we directly evaluated the approach that includes both the counterfactual image and the descriptive concepts as this is the setting that is most informative for users. To keep the number of experimental conditions as low as possible we did not specifically evaluate the effect of concepts on explanations. However, we recognize the importance of this research question and plan to explore it thoroughly in a future journal version of our work.

Expectation along a segment We do not assume points in the segment $S$ are normally distributed. But $S$ is a collection of points in the space $\mathbb{R}^d$ and points in $\mathbb{R}^d$ follow a normal distribution. Finally we are interested in computing the expected value of the points that belong to $S$ . We update the paper to make this clearer (line 925).

references

[1] Luss, Ronny, et al. "Leveraging latent features for local explanations."

[2] Wan, Weitao, et al. "Rethinking feature distribution for loss functions in image classification."

2024-11-22

Dear Reviewer,

2024-11-25

Dear Reviewer,

Thank you,

The Authors

2024-11-26

Thank you for addressing my comments. While I am satisfied with some of them, I will need some further clarifications before my final decision.

After the authors positioned their work among some more recent approaches, my main concerns are connected with: 'concepts without supervision' claim, limitation to gaussian-mixture loss-based models and its connection to related works.

'concepts without supervision' claim Adding a specific definition of 'concept' would greatly benefit the paper's contribution, as it is now unclear to me whether the claim about no concept supervision is actually true. Importantly, the method utilizes 'label supervision' (Eq. 2, lines 153, 176) for the gaussian-mixture-based classification model. Note that the notions of labels and concepts sometimes fully overlap. For example, CelebA datasets provide labels that identify concepts of faces, such as smile presence or age. Moreover, when following the above understanding of concepts, many current approaches for counterfactual explanations are based on unsupervised generative models (like diffusion models) that do not utilize any labels (contrary to the authors approach) during training or inference, making them also concept independant. At last, it might me misleading to actually refer to no supervision of concepts, since the authors explicitly claim that human supervision is required in extracting concepts from their approach. Clarifying these points would greatly improve my understanding of the paper's contributions.

Connection to related works Following the authors response related to diffusion-based approaches for counterfactual explanation generation, I am not convinced that the evaluation is performed fairly. Following the reasoning above, which assumes that methods based on generative models also do not utilize concept supervision, it is unclear to me why these approaches are not compared to the authors algorithm. Are they not applicable to the considered scenario? While the authors mention that these algorithms lack transparency in generated explanations, I cannot find justification for its presence in the proposed method, as transparency is not defined anywhere. Note that these methods are typically evaluated using much larger and complex datasets than BloodMNIST, e.g., CelebA, CheXpert, ImageNet to mention a few examples. Is the authors methods also applicable to these cases? If yes, how does it compare to current state-of-the-art? If no, what are the actual reasons for that? Depending on the justification, the main claims of the paper should be properly modified. For example, claiming that user interaction is possible in real-time might be true for BloodMNIST, but will it be generally true for other, larger datasets?

Gaussian-mixture-loss-based models In my honest opinion, the paper should mention the above limitation very explicitly, but it does not mention it at all at the moment. As stated, the approach is only applicable to a very specific subclass of DNNs. Moreover, the paper does not show that any 'external' model (even from this subclass) can be actually incorporated in the pipeline in a post-hoc manner, since the introduced framework trains the classification model together with the autoencoder from scratch.

Please note that I truly appreciate the authors efforts. However, the new comments stem mostly from what I tried to initially convey in the Questions section from the review. Referring to them would also greatly help me in making the final decision. At last, please excuse me for providing the answer at the last moment. I will take this into consideration before deciding on my final score, as it obviously limits the discussion.

评论- Additional repsonse to 3rie

2024-11-27

We would like to thank the reviewer for their appreciation and willingness to take our replies under consideration. In the following we provide additional clarifications for the remaining concerns.

concepts without supervision claim In the context of our paper we consider latent dimension $z_i \in \mathcal{Z}$ which, when associated with a function $\text{DEC}(\cdot)$ , is a concept if $(\text{DEC}(z_{i,1}, z_{\setminus i}), \text{DEC}(z_{i,2}, z_{\setminus i}))$ can be understood by and end-user as an atomic change of a property of the input (in the context of BloodMnist, an example could be changing the values of the frst latent dimension while keeping others intact leads to a change of size of the cell).

As the reviewer correctly notices, label supervision is used by our model. In our setting we clearly distinguish between labels and concepts. Labels represent classes, or in the case of BloodMNIST the cell type. Concepts on the other hand are associated to the $z_i \in Z$ of our model which are the latent representations. Our approach is unsupervised with regard to concepts as during training our model does not have supervision with regard to which concepts to encode in the learned representations. Some datasets like CelebA provide additional supervision which can be used to guide the training process but our technique does not leverage this information and we drop this requirement because most real-world datasets do not provide such information. In addition, approaches that leverage diffusion models may be concept independent in the sense that they do not require concept supervision in order to be trained but we are not aware of any work that exploits diffusion models to provide concept based explanations or that associate human interpretable concepts to the extracted explanations. We additionally rephrased the description (line 253) to make it clearer that while the training process is done without any concept supervision, after training latent representations are associated with concepts in post-hoc fashion by a human annotator.

Connection to related works First, when talking about transparency we meant interpretability of the counterfactual for a user (we replaced transparency with interpretability in the paper (line 86)). Our work crucially relies on concepts (annotated post-hoc after training) as interpretable explanations for the counterfactual being provided. We are not aware of any approach with diffusion models that leverage human understandable concepts to improve interpretability of explanations. Concerning the choice of the dataset, we focused on BloodMNIST not for computational reasons, but because it is a challenging but still feasible task for a non-expert user. CelebA and ImageNet are too simple for users, while CheXpert is too difficult for laypersons. Our approach can easily scale to large architectures and datasets, as our counterfactual search exclusively depends on the size of the latent-space. Indeed Appendix C reports running times for increasing architecture depths, confirming the feasibility of the approach.

Gaussian-mixture-loss-based models We added in the limitations section of our paper the fact that our approach requires a Gaussian mixture loss (line 521-526). In our opinion, this limitation is not dramatic as it only affects the training loss, which can nonetheless be applied to arbitrary DNN architectures. While external models cannot be used as-is, they could in principle be incorporated by fine-tuning them using the Gaussian mixture loss. We clarified this aspect in the revised version of the manuscript.

We look forward to further discuss the topic or any additional doubts of the reviewer.

2024-11-27

Thank you addressing my comments. Now, I feel better informed regarding how the paper approaches the notion of concepts and its relationship to approaches based on other generative models like diffusion models. However, the above responses still do not resolve certain issues.

Scalability. I understand the motivation behind evaluating the method on BloodMNIST. However, simply stating that the approach can scale to larger architectures and datasets based on running times for increasing architecture depths is not enough. For example, there are many reasons why VAEs are not (exclusively, they can be combined with latent diffusion models for example) the main method used in today's image generation pipeline, e.g., posterior collapse. In the above, the authors make an implicit assumption that the latent space is able to 'handle everything' and only its size must be controlled. However, there are many reasons to think that, for example, concept identification would not be possible on complex data like ImageNet. Are there any examples in related works showing that specific latent dimensions of VAEs encode easily identifiable concepts on data like ImageNet? In general, I am very skeptical regarding the statement that this approach can easily scale. While I do not state that every method must work with every level of data complexity, this issue must be clarified here. Either by proof with results on bigger and more complex data, or by explicitly saying that this approach is meant for simpler datasets.

Generalization to independent models. Once again, I would be very careful with stating that the approach is able to handle models that are not trained together with the pipeline, since no empircal proof is given. The limitation connected to the loss might not be dramatic, but it requires verification if the authors want to claim that this method is applicable to arbitrary DNN architectures trained through Gaussian mixture loss. Also, the paper's contributions remain limited if no external models are incorporated, since then the entire pipeline reduces to a solution that does not actually provide counterfactual explanations, but rather counterfactual examples only. This is fine if the goal of the method is concentrated on the human-machine interaction, but not enough to state that it provides a general method for counterfactual explanation generation.

Labeled data requirement. The authors mention that their approach is unsupervised with regard to concepts as during training our model does not have supervision with regard to which concepts to encode in the learned representations. Some datasets like CelebA provide additional supervision which can be used to guide the training process but our technique does not leverage this information and we drop this requirement because most real-world datasets do not provide such information. I think that here the relationship between concepts and labels is once again very vague. What is the additional supervision that can be used to guide the training process in, e.g., CelebA, that the authors do not leverage? If I understand correctly, the authors require labeled data for their approach to be trained. Hence, the reliance on additional supervision is there. Note that this has direct influence on the structure of the latent space (even presented pictorially, Figure 1. (b, upper part)) and the 'knowledge' gained by the autoencoder.

2024-11-28

Dear reviewer, thank you for the clarification with regard to the still standing issues. Below we tackle the points made:

Scalability Our counterfactual search process is scalable with respect to architecture depth as it is independent from the architecture size. Also, in the limitation section of our manuscript we explicitly state that given the need for human annotators using restricted latent spaces is very likely needed (line 528). While in the context of our experiment this was not limiting, we explicitly mention that we consider a very important avenue for future research to improve the applicability of our approach given the promising results our method yields in the interactive setting. More precisely, our technique can be applied to large scale models. For example, leveraging a latent diffusion model conditioned on the compact latent dimensions of the RAE or the RAE outputs are some of the directions we consider exploring. It is worth mentioning that, as our approach is centered around interpretable concepts, such larger models are required to support concept extraction. This may not always be possible as the reviewer correctly notices and we specify this in line 531 of our paper. In addition, in order to obtain real-time generation, the underlying generative model should guarantee fast generation. It is worth noticing that if this is the case when leveraging conditional LDPMs, we can still optimize counterfactual search directly in the latent space. This allows us to generate counterfactuals with a single conditioned generation of the LDPM, ensuring an efficient explanatory mechanism.

Generalization to independent models The focus of our proposal is indeed a framework for interactive classification, in which a machine learning model is trained to perform classification and be amenable to counterfactual generation. Our approach is thus not a general purpose post-hoc counterfactual generation method. The post-hoc approaches we mention in the related work are meant to clarify why existing solutions are not appropriate for our interactive classification setting, namely the lack of real-time performance and concept-based explanations. While adapting an external model to generate interpretable counterfactuals should be feasible in principle (keeping in mind the concerns on real-time execution and quality of the concepts), this is not the main focus of our contribution. We better clarified the focus of our work in the abstract and introduction.

Labeled data requirement Our approach leverages class labels to solve the classification task (e.g. cell type). Most real world datasets provide exclusively class-label information. CelebA on the other hand provides multiple labels per image which refer to the presence of a specific attribute or concept (e.g. glasses, smile, wrinkles…). These can be used to guide the learned representations of a model (the $z_i$ ) to encode specific concepts. Our approach encodes concepts without any supervision or it does not leverage the above mentioned information about the presence or absence of certain attributes (concepts). In conclusion our model is supervised at a label-level (e.g. class information is needed to distinguish between cell types) but unsupervised at a concept-level (no information about the attributes of the image are required).

Please let us know if there is any additional information you require us to provide and thank again for engaging in the discussion and for the helpful feedback

审稿意见

评分: 3置信度: 42024-10-31

This work proposes a new counterfactual explanation technique for image classification. The technique uses a regularized latent space model and searches for suitable counterfactual candidates in the learned latent space, which should have favorable interpretability characteristics. The technique is evaluated through a human-subject study.

优点

Strengths:

Overall, the technique is well explained and the writeup is good to follow
I did not discover any major flaws regarding soundness
User evaluation is important, and not often considered in XAI

缺点

Weaknesses:

Related Work. Unfortunately, there are many related techniques for counterfactual image generation that are not discussed in this work. In general, the idea of using latent-variable models to generate counterfactuals cannot be considered novel and is well covered in the literature, e.g., by Sauer & Geiger (2021). A recent work (Melistas et al., 2024, Section 2) lists more than 10 approaches to tackle the problem presented using VAEs, GANs, Deep-SCM, Diffusion models, Flows, ... Unfortunately, these related works are not mentioned here. It is not clear why they are insufficient and yet another method to tackle this problem is required.
Grounding for disentanglement claims. The authors claim that their method yields interpretable, disentangled representations. However, while regularization can help disentanglement in practice, it should be noted that there is theoretical backing for this claim (unless some rigid assumptions or knowledge of causal models is assumed). For instance, Locatello et al. (2019) prove that disentanglement without additional information is impossible, and Leemann et al. (2023) study the topic for conceptual explanations. It is therefore questionable whether the mentioned trade-off between disentanglement and reconstruction quality really exists. The references given (e.g., BetaVAE and its derivatives) do not reflect the current state of research.
Evidence for claims. It is okay to make claims like in Section 6 (l.406-407, "this is the first unsupervised concept based counterfactual generating technique suited for a real time interaction") but this requires evidence to back them. For instance, at this point I would have expected a run-down of runtimes of other CFE techniques for images when using encoders/decoders of the same complexity.
Evaluation is insufficient and qualitative results are not convincing. I think a user-study is a good start for evaluation, but it not sufficient on its own. I think other metrics for image quality such as FID and edit-distance (in input and latent space) should be checked and reported as well, in particular in contrast to other techniques. Unfortunately, the counterfactuals shown in the Figure look blurry and not like realistic scans. Disentanglement claims should be checked using synthetic datasets with known concepts such as 3DShapes (https://github.com/google-deepmind/3d-shapes).
Ablation studies are missing. There are no ablation studies that allows to verify the necessity of each component in the framework. For instance, I am wondering whether the complex calculation of the mean is necessary or if some point a specific distance behind the decision boundary on the segment from the input embedding and the counterfactual class embedding would be sufficient.
Accuracy considerations. The work proposes to use a specific generative image classification model that allows to directly generate counterfactuals. However, I think the accuracy of this model will be lower than that of state-of-the-art models. This trade-off is not discussed.

Summary. Unfortunately, I don’t think the technique developed is highly innovative and an evaluation against competing techniques is missing. If there is a specific advantage of the technique that I am missing, I suggest that a comparative analysis with the techniques in Melistas et al. (2024) should be added to show this advantage. In its current state the motivation why the existing techniques are insufficient for the counterfactual generation problem is not clear at all.

References

Locatello, Francesco, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. "Challenging common assumptions in the unsupervised learning of disentangled representations." In international conference on machine learning, pp. 4114-4124. PMLR, 2019.

Tobias Leemann, Michael Kirchhof, Yao Rong, Enkelejda Kasneci, Gjergji Kasneci Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI), PMLR 216:1207-1218, 2023.

Sauer, Axel, and Andreas Geiger. "Counterfactual Generative Networks." International Conference on Learning Representations, 2021

Melistas, T., Spyrou, N., Gkouti, N., Sanchez, P., Vlontzos, A., Papanastasiou, G., & Tsaftaris, S. A. (2024). Benchmarking Counterfactual Image Generation. arXiv preprint arXiv:2403.20287.

Minor points: There are a couple of issues with the writeup

Please check capitalization of bullet points in lines 122-132.
L.141-152 is hard to follow.
Typo L.176 (caption): “regularize”
L. 215 “DDPM” is not introduced
L. 344 Concept-based (section title)
L. 346 class-relevant
When you refer to the appendix, please include a link to the exact section or figure (e.g. l. 347)
L. 410 hyper-parameter configuration
Table 1: Please use the same number of digits for each result
I noticed that on page 27 of this submission (Figure 13), it seems to be indicated that the study was conducted at the University of Trento, potentially revealing the affiliation of the authors and thereby violating the double-blind review principle.

问题

User study: I have some questions regarding the user study: Was the study IRB approved? Was the study preregistered? The number of 50 participants divided over multiple conditions seems rather low, how was the number chosen? A survey by Rong et al. (2022) shows the average number of participants in XAI user studies with a between-subjects design to be greater than 300.
The study relies on a specific classification model which performs the classification through a regularized latent space. What are the costs of explainability here, i.e., what is the performance difference of this model (91% accuracy is reported) vs. using a state-of-the-art black-box model that is trained on the dataset without any constraints?
Technical Derivation: In Equation (9), how is the formula for the weights determined? If one is interested in the expected value, shouldn't the weight of each segment be the integrated density over the segment? Here it seems that only the density at the respective center is used. Suppose we have a constant density, then the length of the segment would not play a role in the weight? is this intentional?

Reference

Rong, Yao, Tobias Leemann, Thai-trang Nguyen, Lisa Fiedler, Peizhu Qian, Vaibhav Unhelkar, Tina Seidel, Gjergji Kasneci, and Enkelejda Kasneci. "Towards human-centered explainable AI: user studies for model explanations.", IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)

评论- Response to Y1ZA

2024-11-19

Identity breach in screenshot in appendix We apologize for the inconvenience, we did not realize this was part of the screenshot. We removed it from the revised version of the manuscript.

Minor points We addresed the reviewer points and made the requested changes.

Related work We extended the related to work to consider more recent and relevant approaches. Our method separates from mentioned papers because it does not leverage knowledge of a causal graph (a requirement for the other works). In addition, interest for diffusion models high quality image generation led to researchers leveraging them for counterfactual explanations. Although such approaches are able to generate realistic counterfactual images, the resulting explanations lack transparency which is a crucial component of our framework.

Running times We added quantitative experiments to Appendix C where we also evaluate running times. We compare with the competing approach of [1] as it is the only other counterfactual generating technique which leverages concepts without supervision that we are aware of. Results show our approach superior performance across various architecture complexities.

Accuracy The work of [2] shows that this loss can be used to obtain equivalent classification performance to softmax based classification scores for a variety of benchmarking datasets and CNN architectures. In addition the choice of a very simple architecture is due to the very simple input domain but much more complex architectures can be leveraged for our framework.

Evaluation We added in Appendix C a quantitative evaluation of our method with FID, COUT and S3 in the appendix. We compare with the approach of [1] and with returning an explanation consisting of the point in the segment connecting the instance to explain and the counterfactual class mean for a given model confidence value. We experiment with the confidence values of 0.6, 0.8 and 0.9 as a mean for an ablation study.

Image quality Images are blurry due to the reconstruction performance of the model that must leverage restricted latent spaces because of the concept extraction technique. We argue that even though blurry, counterfactuals were interpretable and actionable as the user study results prove.

Disentanglement Label disentanglement is what is requested for the approach to correctly generate valid counterfactuals. Latent disentanglement is also important in our framework, because it allows the method to extract clean and independent concepts that greatly improve the interpretability of our explanatory technique. We are aware of the lack of theoretical guarantees for unsupervised latent disentanglement. Our approach simply encourages it via latent regularization. We updated the manuscript to clearly distinguish between label and latent disentanglement, and we explicitly mention in the related work section the negative theoretical results about unsupervised latent disentanglement.

User study The number of 50 participants is not divided over the 3 conditions. Each condition was instead studied with 50 participants for a total of 150 participants in the study. Furthermore, the study does not require IRB approval in line with the ethical guidelines of our institution. Specifically, we assessed the risk of our study using a survey designed by our institution for this purpose. The assessment yielded a minimal risk level, which confirmed that IRB approval was not necessary.

Technical derivation We added the mentioned technical derivation in Appendix B where we show how the weights are derived. We show how to carry the expected value computation and why we implement our methodology to estimate it.

References

[1] Luss, Ronny, et al. "Leveraging latent features for local explanations."

[2] Wan, Weitao, et al. "Rethinking feature distribution for loss functions in image classification."

2024-11-22

Dear Reviewer,

2024-11-25

Dear Reviewer,

Thank you,

The Authors

2024-12-02

My apologies for the late reply.

I have checked the rebuttal. I think the related work section has improved, thanks for adding the discussion on indentifiability. However, while some of the works do indeed require a causal graph, many of the references in the benchmarking paper do not require such background knowledge. I think that only building an "interpretable" latent space (without theoretical guarantees) is insufficient to justify novelty in my opinion. While adding a comparison to [1] is a first step, I don't think the work represents the state of the art in counterfactual explainability (judging from my experience and reading of the benchmarking paper by Melistas et al. mentioned in my review).

Looking at the other reviews, I agree with reviewer 3rie that the evidence for the insufficiency of the related methods (of which there are many) is not compelling enough.

My key suggestions to improve the paper thus are as follows:

Start from state-of-the-art methods and identify deficiencies.: I agree with reviewer 3rie on the point that many references and methods are a bit outdated. Instead of relying on classical latent-space models, the authors should turn towards more modern diffusion models etc. I advise the authors to look closely at these methods and uncover what real practical challenges still need to be solved. The paper should start with convincing evidence for the insufficiency of the state-of-the-art models.
Communicating limitations: I also realized late while reading that this approach proposed an interpretable model instead of applying a CF generator post-hoc. I think this should be communicated earlier and the performance characteristics should be communicated.
The success of the method falls with interpretability characteristics of the latent space, for which no theoretical guarantees exist. I don't know if it is a good idea to rely on such a framework in safety critical applications generally.

While I still cannot recommend acceptance of the manuscript in its current form, and I hope that some of these suggestions help the authors to revise their work and resubmit it to a suitable venue.

审稿意见

评分: 6置信度: 42024-11-04

The paper generates interpretable counterfactual images in real-time leveraging a disentangled regularized autoencoder for labels and instances making it more accessible for HITL-approach The approach appears to be theoretically rigorous, using disentanglement and latent space regularization efficiently for counterfactual sampling. The method is theoretically robust, with well-founded training and selection mechanisms supported by rigorous yet somewhat obvious proofs (so the theoretical contribution is limited). While experimental methods are robust, including additional datasets would enhance statistical generalizability and validate findings. The paper is well-written, with effective visuals and a logical flow. Some additional annotations on the figures would make them more accessible. The framework has possibly some limited potential to impact AI explainability in real-time decision-making, particularly in human-centered applications but some points need yet to be clarifed (see points below). It addresses a gap by making counterfactual explanations feasible in interactive settings. Real-world deployment could face challenges due to computational demands and the complex training setup. Furthermore, the dependency on a well-defined latent space might limit the framework’s adaptability to highly complex or noisy data, which might restrict real-world deployment. Some suggestions for improvement: -Expand the empirical evaluation with a broader range of datasets to improve robustness and generalizability. -Clarify the "100% validity" claim with a more nuanced discussion of potential limitations. (might be redundant if the math already proves 100% validity). -Conduct quantitative comparisons with other methods to offer a clearer perspective on relative strengths and weaknesses.

优点

The approach has some potential to enhance counterfactual generation by balancing efficiency and interpretability, both of which are essential for real-time, human-centered applications. A user study indicates that the generated counterfactuals may potentially enhance human task performance, which would be valuable in practical settings. The methodology is well-structured and includes clear descriptions of each step in the counterfactual generation pipeline. It seems that the user study is scientifically valid and includes appropriate metrics. With a generation time of about 1.2 seconds, the framework seems ready for real-time, interactive AI applications. The authors claim that their framework is the first of this kind.

缺点

The evaluation is limited to a single dataset (BloodMNIST) and task, which may impact the method's broader applicability. A wider evaluation would give a better sense of its utility across different contexts. The claim of “100% validity” might be overconfident; high-dimensional edge cases might present challenges to this level of accuracy.

问题

-Could the authors clarify how are the “associated concepts” at line 348 identified? Does the framework directly offer human-comprehensable concepts for latent features? If so, how this is achieved?

-What does “unsupervised” in line 406 mean? It seems that the training of the framework requires massive labeled data, therefore against the claim of being “unsupervised”.

-Could the authors elaborate more on why the decoder $\mathrm{DEC}$ is not suitable for generation? Both $\mathrm{ENC}_s$ and $\mathrm{ENC}_u$ learn distributions for the latent space, sampling a point from a specific Gaussian should generate a synthetic instance.

-The denosing serves for shaping latent space structure, couldn’t it be applied during the training of the autoencoder? The decoder should be able to handle noises at an appropriate level, which makes the auxiliary model somehow redundant.

评论- Response to vmA4

2024-11-19

limited evaluation We added to Appenidx C a more thorough quantitative evaluation of our approach. We compare with the competing approach of [1] as it is the only other counterfactual generating technique which leverages concepts without supervision that we are aware of. Our technique has comparable performance to our competitor while being substantially more efficient.

100% validity We refer to validity as the explanation being classified by the model as the class asked by the user. In that regard counterfactual candidates are valid by definition as only points on the side of the decision boundary associated to the query class are considered. To further strengthen this, expectations are computed sampling from the conditional counterfactual label distribution making it impossible to obtain explanations that are not predicted as the query class. High-dimensional latent spaces are not considered due to the concept-extraction mechanism which relies on compact latent spaces.

Concept extraction The model learns latent representations which can be associated to interpretable concepts via latent traversal. In order to obtain high quality concepts latent disentanglement is needed. We ecnourage this via latent regularization. We modified related work to make this clearer and expand on the concept extraction technique in Appendix E.

Unsupervised conceps Unsupervised refers to the concepts which are not learned with supervision. Classification is supervised as the reviewer correctly notices. We rephrased to make this clearer (line 415).

Decoder generation This is because with increasing latent dimensions the densities of the points vanishes. This implies that, in order to sample, shaping data according to a distribution is not sufficient. The model needs to additionally learn ‘smooth’ latent space which is achieved with noise addiction. Please refer to [2] for more details.

Auxiliary model The noise injection mechanism is used to ‘smooth’ the latent space as it is already shaped according to a gaussian distribution in the deterministic version of the model. Since our concept extraction technique relies on compact latent spaces which causes a high loss of information after encoding, handling simultaneously noise and reconstruction can be difficult for the decoder. With the suggested approach the decoder focuses only on reconstruction in the first stage and then ‘smooth’ representations are induced with an auxiliary model helping the decoder to handle the noise improving reconstruction quality with respect to the noise injection mechanism of the VAE.

References

[1] Luss, Ronny, et al. "Leveraging latent features for local explanations."

[2] Ghosh, Partha, et al. "From variational to deterministic autoencoders."

2024-11-22

Dear Reviewer,

2024-11-25

Dear Reviewer,

Thank you,

The Authors

评论- Global response

2024-11-19

We would like to thank the reviewers for their insightful feedback. We are pleased that our efforts in formalizing a method for human-AI decision-making and investigating the impact of explanations on real users were well-received. We have addressed the limitations mentioned by the reviewers through detailed responses in individual comments and updated our manuscript specifically highlighting the changes made. We look forward to engaging further in discussions on this topic.

撤稿通知

2024-12-03

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.