The FIX Benchmark: Extracting Features Interpretable to eXperts
摘要
评审与讨论
This paper proposes a real-word benchmark to evaluate feature-based explanations of black box machine learning models. The authors collect six real-word datasets from three different domains to evaluate the performance of feature attribution methods in terms of pre-defined "expert features" , i.e. a collection of feature sets that provide a meaningful interpretable "ground-truth" explanation. This is in contrast to existing work that focuses mainly on general metrics and synthetic examples without including domain knowledge specifically tailored to the given dataset and task. Given an instance and a collection of feature sets , i.e. the proposed explanation, the FIX benchmark computes essentially a heuristical score that determines how well and align: For each feature , the average "amount of alignment" of all elements in that involve feature is compared with the ground-truth elements of . This alignment is computed using the "ExpertAlign" function, which is either task-specific or simply the maximum intersection over union metric with the human-annotated ground truth "expert features". This score is then averaged over all features (Eq. 1). The benchmark covers the extraction of "expert features" for all six datasets, where for two datasets human-annotated are used, and the remaining four datasets have "implicit expert features", which are defined indirectly via a task-specific scoring functions. The paper then proceeds to evaluate several domain-specific pre-processing techniques (such as patches, watershed for images) and some feature-based explanations (such as Archipelago, CRAFT) to evaluate their score on the benchmark. From these results the authors conclude that current feature-based explanations are not sufficiently aligned with "expert features".
优点
- The evaluation of feature-based explanations with respect to interpretability is an important topic, where human-annotated ground-truth explanations specifically tailored to a given task and domain are an important tool to evaluate approaches.
- The proposed datasets cover multiple domains, and the proposal of a single score for each task is good for comparing methods. The applications seem interesting, although I cannot judge on the quality of any of the proposed expert features.
- The paper is well-structured and easy to follow, and the purpose of the paper is clear
缺点
- Lack of popular feature attribution methods: The paper proposes a benchmark that has the potential to improve explanations based on groups of features. However, the benchmarking of methods is significantly under-developed. The paper claims (e.g. in the abstract) that feature-based explanations fall short in identifying the given "expert features". This claim is not sufficiently covered by the experiments. The authors do barely evaluate feature-based explanations. In the experiments, from current XAI literature only Archipelago is considered (and image-specific variants, such as CRAFT). While in the introduction feature attributions, such as LIME or SHAP are discussed, none of these concepts appear in the empirical evaluation. In contrast, the evaluation in Table 2 centers around pre-processing techniques rather than feature attribution methods. For improvement, to support this claim, I would suggest to include a variety of well-established feature attribution methods, such as SHAP, Integrated Gradients, LIME and their variants for higher-order explanations and evaluate their performance on the FIX benchmark. Moreover, it is unclear to me, why the proposed FIXScores reported in Table 2 are insufficient / unsatisfactory.
- Unclear modeling choices: Given only the datasets, it is difficult to execute a benchmarking of XAI methods, since a black box model is required. The paper does not address the question on how to model these datasets in the first place. A potential improvement would be to clearly discuss the modeling aspect of the datasets, or incorporate the trained model in the benchmark. As of now, it is unclear if any results obtained by the benchmark is due to a bad modeling choice or a bad interpretability method. Moreover, it is even unclear if the "expert features" and actual model reasoning are aligned, which would be crucial for the benchmark (see Q1-Q3 below).
- Implications remain unclear: The paper in its current state has limited impact on future research, since it does not provide clear evidence of limitations in current XAI methods (see above), nor where these are rooted in or how they could be mitigated in the future. It would be very helpful to clearly evaluate how these methods fail, why, and to propose improvements or directions for future developments.
问题
- What are the performances of the models trained for the given task? How difficult is it to train such models on these datasets? Are the performances sufficient to draw any conclusions about the quality of explanations or could these problems also be inherently incorporated in the trained models?
- How do you ensure that the trained models actually align with the expert features in their reasoning?
- What requirements should models satisfy to be eligible for an evaluation of explanations using these datasets? Would you include the models as part of the benchmark?
Explainable Artificial Intelligence (XAI) has been an active research direction to understand the black-box nature of highly complex deep neural networks. One of the widely researched areas in XAI is feature attribution, where we attribute the prediction of a given model back to its input features, i.e., assigning scores to input features that reflect their respective importance toward the model’s prediction. The authors argue the usefulness of feature attribution techniques is limited for high-dimensional data, where the identified important features do not have any semantic meaning and are uninterpretable to stakeholders and domain experts. To address the above challenges, the authors define expert features as low-level feature groupings that align with domain experts’ knowledge of the relevant task and have practical relevance. Further, they present the FIX benchmark, a unified evaluation measuring feature interpretability that captures the knowledge of domain experts from diverse applications and unifies them using the FixScore metric, which quantifies how interpretable a proposed feature group is.
优点
-
The proposed framework provides an important benchmark encompassing six diverse real-world settings and three different data modalities and unifies them using the introduced expert alignment measure FixScore.
-
The authors provide a very detailed implementation detail of how they calculate the scores for each dataset in the Appendix.
-
The authors raise an important problem of grounding feature attribution in explainable artificial intelligence to knowledge from domain experts.
缺点
-
"We propose FIXScore, a unified expert alignment measure With FIXScore, we find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods to better identify features interpretable to experts." -- Are the authors proposing a call to the community to develop new feature attribution explanation methods? If yes, how does FixScore or expert features help in this regard? Most classification models are trained on certain downstream objectives which do not necessarily reward the model to learn the task using expert features. Assuming complete faithfulness of explanation methods, aren't we arguing that the underlying training paradigm should change?
-
The authors argue that "the main problem is that features at the individual pixel or token level are often too granular and thus lack clear semantic meaning in relation to the entire input" -- While I agree that individual pixels and tokens do not possess any clear semantic meaning, the model was never trained to learn features that align with domain experts, i.e., they are always trained on a given cross-entropy or some other downstream loss function which may/may not align with domain experts. Hence, this argument is not necessarily a symptom of the feature attribution methods but could be implied from the underlying training of the model itself.
-
The authors propose some specific properties of grouping low-level features to high-level features that are more digestible and the low-level features should align with domain experts’ knowledge of the relevant task -- Wouldn't training a model using concept bottleneck models suffice these properties?
-
The authors argue about the importance of feature attribution that is grounded in domain expert knowledge. However, in sec 4.1, the authors use the purity- and ratio-based metrics that are proxies to ground truth information needed to align attributed features. Since these purity and ratio metrics are essentially based on pixel intensities, isn't this another measure of something like the Intersection of Union (IoU), which we use to evaluate feature attribution maps for real-world images and calculate the IoU between the generated importance map with ground-truth object bounding boxes?
-
For datasets like emotion and politeness, getting expert features may be highly subjective. Are there ways the authors are thinking to address this problem? Further, these subjective notions have different levels to it and are mostly context and scenario-dependent. How do these factors affect the underlying expert features?
-
One way to support the significance of the expert features would have been to train a machine learning model on just the features that are deemed to be aligned with domain experts and report the accuracy of the trained model. If the expert features are indeed grounded in domain knowledge, a model should be able to achieve high predictive performance using only those features.
(Minor points)
-
It would be great for the reader if the authors could include the metric calculation of each of the datasets in the main paper as that is one of the main contributions of the paper.
-
The anonymous link to the codes doesn't work.
问题
Please refer to the weakness section for open questions regarding the work.
The paper presents "FIX", a benchmark designed to evaluate how well machine learning-generated features align with expert knowledge across various domains like cosmology, psychology, and medicine. The authors introduce FIXScore, a unified measure of feature interpretability, providing a framework for assessing alignment with expert knowledge in vision, language, and time-series data. Through empirical analysis, the authors highlights the limitations of current feature extraction methods and underscores the need for methods that can produce expert-interpretable features, particularly for high-dimensional data.
优点
- The paper presents a novel benchmark to measure the alignment of machine-extracted features with expert knowledge, which is valuable for guiding future research on general-purpose, automated expert feature extraction.
- The paper employs datasets from various domains, including tabular, text, and vision data, including tabular, text and vision data, and evaluates various feature extraction methods, including both domain-specific and domain-agnostic techniques.
缺点
- Although the paper uses five datasets for experimentation, it lacks a clear rationale for selecting these specific datasets.
- The structure of the paper could be improved - the results and discussions of the results are very brief in the main paper and lack of interpretability. The paper could consider condense the dataset introduction part and add more discussions on the results to better demonstrate the insights.
- There is an inconsistency in line 478, where the paper mentions “three segmentation methods” for image data but lists four: “Patches,” “Quickshift,” “Watershed,” and “CRAFT.” Additionally, Table 2 includes an extra method, "SAM," which is not cited.
问题
- The FIXScore relies on predefined expert alignment metrics, which may be subjective and vary across experts within a domain. Would different expert definitions for the same task lead to huge variations in evaluating the extracted features?
- Table 2 shows that the results on Domain-specific Time Series are much lower than others, but the domain agnostic is rather comparable to vision tasks. And Emotion score is also lower than Politeness. While the caption notes that FIXScore is not comparable across tasks, is there any interpretation of these score discrepancies? Additionally, the abstract concludes “poor alignment with expert-specified knowledge.” If scores are not directly comparable across datasets, how should we interpret a score as "high" or "low" and determine alignment quality?
This paper proposes a benchmark for measuring how well a collection of features aligns with expert knowledge.
优点
- the problem the paper tries to address is important.
- the datasets cover different domains
缺点
- The formulation of the ExpertAlign metric for various datasets appears somewhat ad hoc, lacking clear justification for the chosen metrics. Why were these particular metrics selected over other potential designs, and how do they relate to Formula 2?
- If the proposed metrics were part of a broader methodological paper as a way to evaluate the primary method, they would likely be suitable. However, as a standalone contribution, they seem insufficient.
问题
see above
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.