Prediction Via Shapley Value Regression
摘要
评审与讨论
This paper proposes ViaSHAP, whose goal is to incorporate (baseline) SHAP explanations in a deep learning architecture as the final layer, such that the sum of all values is used as the model's prediction. It is thereby an extension of FastSHAP, which fits a surrogate model (or uses a fine-tuned variant) to predict SHAP explanations. The key difference is that instead of using the FastSHAP model as a surrogate to explain the black-box model, ViaSHAP takes the FastSHAP (surrogate) model as the prediction model. The authors propose a similar learning procedure as FastSHAP (but fit on true labels instead of the model's prediction). This learning paradigm is based on sampling random maskings during training, which are obtained from re-writing the Shapley value as a solution to a weighted least-squares problem. ViaSHAP is then applied on two Kolmogorov-Arnold Networks (KANs) and two MLP architectures. The authors evaluate the performance of this new architecture and learning paradigm using 25 tabular datasets and XGBoost, random forests, and TabNet. The evaluation shows that the KAN networks perform best on many of the datasets. Moreover, the authors evaluate the SHAP explanations of the ViaSHAP computed with KernelSHAP and the (inherent) SHAP values obtained from the proposed ViaSHAP model.
优点
- This work aims to incorporate the SHAP explanations during training in the model, which can be viewed as a bridge between black-box models and inherently interpretable models. The core idea to use the FastSHAP-paradigm directly as a prediction model is a novel, yet minor, extension of FastSHAP, which has not been explored so far.
- The paper is easy to follow, details on implementations, and experimental results are clearly presented
缺点
- One weakness is the novelty of the contribution. Training a black-box classifier, which incorporates the SHAP explanations, has already been proposed with FastSHAP. The main novelty in this work, to fit FastSHAP on the (FastSHAP-)predictions of masked training data, instead of the predictions of the black-box model is not a major contribution.
- The paper benchmarks the performance of ViaSHAP models with competitors on tabular datasets. But a key comparison is missing. How does ViaSHAP perform compared with a model with similar architecture that uses a traditional learning paradigm (without maskings?). What are the consequences for the classification performance of the model?
- The SHAP explanations computed by KernelSHAP on the ViaSHAP model differ from the ViaSHAP explanations. In fact, in Table 2 and 3 this discrepancy is often quite large. The paper does not address this discrepancy, nor any consequences for the produced explanations. For instance, it would be interesting to see, if it is possible to receive exact SHAP scores there (convergence of KernelSHAP) by increasing the number of maskings in training or state theoretical guarantees on the maximum discrepancy.
- Possible technical errors in the proofs, see questions below
- ViaSHAP only considers baseline imputations as the masked inputs in the learning paradigm. A discussion for marginal and conditional imputations (as given in FastSHAP) is missing. What are the consequences of this decision? How does ViaSHAP behave under other value functions?
- A comparison on non-tabular data, such as images, is missing. This would be particularly interested, since FastSHAP unfolds its benefits on such data. A comparison on the benchmarks from the FastSHAP paper would be beneficial.
问题
- What are the consequences of the novel training scheme. How does the performance compare to the same architecture but with usual training scheme?
- Moreover, how does the training time of the novel training scheme (with maskings) compare with the usual training scheme to achieve similar performances for the same architecture?
- Is there any guarantee to obtain exact SHAP explanations with ViaSHAP (fully agreeing with KernelSHAP). How does the training scheme needs to be modified to guarantee this? Is there a trade-off between accuracy and computational feasability?
- In the proof of Lemma 2, why should the global loss be minimized at zero and not some other non-negative value?
- In the proof of Lemma 3, and in general, I do not understand why the prediction on the masked output should be equal to the sum - this should only be the case, if the loss is minimized at zero, which is highely unlikely. Does there exist any empirical evidence for this, or a formal assumption?
Response to weakness 6 (A comparison on non-tabular data, such as images, is missing. This would be particularly interested, since FastSHAP unfolds its benefits on such data. A comparison on the benchmarks from the FastSHAP paper would be beneficial.):
We again thank the reviewer for highlighting a limitation that can be addressed and improving the quality of the paper.
In order to address this limitation in the paper, we conducted an experiment where we implemented ViaSHAP using 3 image recognition architectures (ResNet50 [5], ResNet18, and U-Net[6]). We evaluated the predictive performance of the 3 models using 1-Top Accuracy (The results are displayed in Table 2). All the models were trained on the CIFAR10 dataset without pre-trained weights (trained from scratch) using 2 masks (samples) per instance and with early stopping after 10 epochs without improvement on the validation split (10% of the training data). The results show that ViaSHAP can have high predictive performance on classical image classification tasks.
Table 2: Comparison of the predictive performance of the implementations of ViaSHAP using U-Net, ResNet18, and ResNet50.
| Model | AUC | 0.95 Confidence Interval |
|---|---|---|
| 0.983 | (0.981, 0.986) | |
| 0.968 | (0.964, 0.971) | |
| 0.96 | (0.956, 0.964) |
Afterward, we evaluated the accuracy of the Shapley values by following the same approach employed in [1], i.e., by selecting the top 50% important features (according to the explainer) and evaluating the predictive performance of the explained model using only the selected top features (Inclusion accuracy) and without the top features (Exclusion accuracy). We compared the accuracy of approximating the Shapley values of the 3 models as well as FastSHAP as an explainer to the 3 ViaSHAP implementations. The results show that ViaSHAP implementations are more accurate in approximating the Shapley values compared to the explanations obtained through FastSHAP.
Table 3: The table compares the accuracy of the Shapley values using the top 50% of the most important features (according to their Shapley values). It reports the Top 1 accuracy of the explained model. The Inclusion AUC (where higher values are better) and the Exclusion AUC (where lower values are better).
| Model | Exclusion AUC | 0.95 Confidence Interval | Inclusion AUC | 0.95 Confidence Interval |
|---|---|---|---|---|
| 0.773 | (0.747, 0.799) | 0.988 | (0.981, 0.995) | |
| FastSHAP() | 0.864 | (0.843, 0.885) | 0.978 | (0.969, 0.987) |
| 0.611 | (0.581, 0.642) | 0.99 | (0.983, 0.996) | |
| FastSHAP() | 0.755 | (0.728, 0.782) | 0.954 | (0.941, 0.967) |
| 0.554 | (0.523, 0.585) | 0.997 | (0.994, 1.0) | |
| FastSHAP() | 0.778 | (0.753, 0.804) | 0.978 | (0.969, 0.987) |
[1] FastSHAP: Real-Time Shapley Value Estimation, Jethani et al., ICLR2022.
[5] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770– 778.
[6] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. arXiv. https://arxiv.org/abs/1505.04597
Thank you for addressing my concerns. I appreciate the novel experimental results, and clarifications regarding proofs. I would kindly ask the authors to include the argument and formal assumption regarding global optimization and zero loss in the lemmata/theorem and the proofs for the revised version of the paper.
I have some follow-up questions and comments:
- as I understood, you have not used the interventional approach (marginal expectations), but instead used baseline imputations (masking all features not available with the baseline ), which corresponds to Baseline SHAP [1]? Could you clarify this? Is there any reason not to use the same (conditional) variant as FastSHAP? What are possible implications on the interpretation of these scores?
- I think the provided experiments regarding performance should be included in the paper for all architectures, as these seem crucial and intriguing. With the novel training scheme, it is important to understand how it affects the model's performance, which might also differ across architectures. Moreover, I think it would be valuable to compare these performances given similar computational resources.
- "Given all the mentioned variables, we cannot guarantee the exact computation of the Shapley values." I understand that such guarantees are difficult to provide. However, it seems plausible that increasing the number of sampled masks would yield values closer to the actual Shapley values, possibly at the cost of performance. Do you have any insights on such a trade-off? I think together with the performance of your approach, a crucial point of your contribution should address the interpretability aspect your method, i.e. how much your attributions coincide with the SHAP scores, and how to achieve higher interpretability, possibly at the cost of performance. Experiment 4.3 is good first step towards evaluation but does not answer this question.
That being said, I decided to increase my score, and I am re-considering an additional increase.
[1] Sundararajan, M. & Najmi, A.. (2020). The Many Shapley Values for Model Explanation
Questions:
1- What are the consequences of the novel training scheme. How does the performance compare to the same architecture but with usual training scheme?
We thank the reviewer for highlighting the missing comparison. We conducted an experiment to evaluate the effect of Shapley loss on the predictive performance. We compare to a model with the same architecture but does not compute the Shapley values. The results provided in Table 1 (below) show that the performance of the ViaSHAP is generally better than a KAN model with the same architecture, which suggests that the Shapley component of the loss function may have a regularization effect on the training of the model.
Table 1: A comparison between ViaSHAP () and a KAN model with the same architecture (but not constrained to compute the Shapley values) with respect to the predictive performance (as measured in AUC).
| Dataset | KAN Model | |
|---|---|---|
| Abalone | 0.882 ± 0.001 | 0.870 ± 0.003 |
| Ada Prior | 0.895 ± 0.005 | 0.890 ± 0.005 |
| Adult | 0.917 ± 0.001 | 0.914 ± 0.003 |
| Bank32nh | 0.886 ± 0.001 | 0.878 ± 0.001 |
| Electricity | 0.924 ± 0.005 | 0.930 ± 0.004 |
| Elevators | 0.935 ± 0.003 | 0.935 ± 0.002 |
| Fars | 0.957 ± 0.001 | 0.960 ± 0.0003 |
| Helena | 0.883 ± 0.001 | 0.884 ± 0.0001 |
| Heloc | 0.793 ± 0.002 | 0.788 ± 0.002 |
| Higgs | 0.801 ± 0.002 | 0.801 ± 0.001 |
| hls4ml lhc jets hlf | 0.944 ± 0.000 | 0.944 ± 0.0001 |
| House 16H | 0.948 ± 0.001 | 0.949 ± 0.0007 |
| Indian Pines | 0.935 ± 0.001 | 0.985 ± 0.0004 |
| Jannis | 0.860 ± 0.002 | 0.864 ± 0.001 |
| JM1 | 0.725 ± 0.008 | 0.732 ± 0.003 |
| Magic Telescope | 0.931 ± 0.001 | 0.929 ± 0.001 |
| MC1 | 0.933 ± 0.019 | 0.940 ± 0.003 |
| Microaggregation2 | 0.783 ± 0.002 | 0.783 ± 0.002 |
| Mozilla4 | 0.967 ± 0.001 | 0.968 ± 0.0008 |
| Satellite | 0.987 ± 0.003 | 0.996 ± 0.001 |
| PC2 | 0.458 ± 0.049 | 0.827 ± 0.009 |
| Phonemes | 0.945 ± 0.002 | 0.946 ± 0.003 |
| Pollen | 0.491 ± 0.005 | 0.515 ± 0.006 |
| Telco Customer Churn | 0.848 ± 0.005 | 0.854 ± 0.003 |
| 1st order theorem proving | 0.805 ± 0.005 | 0.822 ± 0.002 |
2- Moreover, how does the training time of the novel training scheme (with maskings) compare with the usual training scheme to achieve similar performances for the same architecture?
We reported the training time of a model with the same architecture but does not employ sampling or compute Shapley values, which can be found in Table 4 under columns (No Sampling) in Appendix G.
3- Is there any guarantee to obtain exact SHAP explanations with ViaSHAP (fully agreeing with KernelSHAP). How does the training scheme needs to be modified to guarantee this? Is there a trade-off between accuracy and computational feasability?
The accuracy of the approximated Shapley values depends on several factors, including the number of features, i.e., the number of possible coalitions, the amount of training data available, the representational capacity of the chosen architecture, as well as the nature and distribution of the data. Given all the mentioned variables, we cannot guarantee the exact computation of the Shapley values.
4- In the proof of Lemma 2, why should the global loss be minimized at zero and not some other non-negative value?
The assumption of Lemma 2 is that the global loss is minimized at zero. Obviously, being a sum of square, 0 is the lower bound of this term's values overall. The reviewer asks how we can guarantee that this lower bound is indeed reached. Our argument in this regard is the same as that used by FastSHAP [1]. That is, since ViaSHAP relies on an MLP to learn its values, the universal approximation theorem (Cybenko [2]; Hornik [3]) states that we can learn the Shapley values up to an arbitrary accuracy.
Nonetheless, to push this answer further, we also added a proof for a relaxed version of Lemma . We prove that, as the loss converges to , so do the attributed importances of non-influential criteria. In particular, if the loss has value , the importance attributed to a non-influential criterion is at most . The proof was added in the paper in the appendix, after the proof of Lemma 2.
In practice, it is unlikely for a loss to exactly reach its global optimum. Instead, it approximates it. We assume here that the loss has reached a value . We propose an upper bound on conditioned on .
Since the loss is composed only of non-negative terms, this means that:
Thus, in particular, we have the two following cases:
\left| \mathcal{V}ia^{*SHAP*}(x^{S \cup \{i\}}) - \mathcal{V}ia^{*SHAP*}(0) - 1^{\top}\_{S \cup i}\phi^{\mathcal{V}ia}(x; \theta) \right| \leq \epsilon$$ and\left| \mathcal{V}ia^{*SHAP*}(x^{S}) - \mathcal{V}ia^{*SHAP*}(0) - 1^{\top}\_{S}\phi^{\mathcal{V}ia}(x; \theta) \right| \leq \epsilon
\Rightarrow \left| \mathcal{V}ia^{SHAP}(x^{S \cup {i}}) - \mathcal{V}ia^{SHAP}(0) - 1^{\top}_{S \cup i}\phi^{\mathcal{V}ia}(x; \theta) - \mathcal{V}ia^{SHAP}(x^{S}) + \mathcal{V}ia^{SHAP}(0) + 1_S^{\top}\phi^{\mathcal{V}ia}(x; \theta) \right| \leq 2\epsilon
\Rightarrow \left| \mathcal{V}ia^{SHAP}(x^{S}) - S\cup I^{\top}\phi^{\mathcal{V}ia}(x; \theta) - \mathcal{V}ia^{SHAP}(x^{S}) + 1^{\top}_S\phi^{\mathcal{V}ia}(x; \theta) \right| \leq 2\epsilon \quad \text{by equation 8}
\Rightarrow \left| \sum_{j \in S \cup {i}} \phi_j^{\mathcal{V}ia}(x; \theta) - \sum_{j \in S} \phi_j^{\mathcal{V}ia}(x; \theta) \right| \leq 2\epsilon
\Rightarrow \left| \phi_i^{\mathcal{V}ia}(x; \theta) \right| \leq 2\epsilon
\Rightarrow \left| \phi_i^{\mathcal{V}ia}(x; \theta) \right| \leq 2\sqrt{\mathcal{L}_{\phi}(\theta)}
Thus, as the loss function converges to $0$, so does the importance attributed to features with no influence on the outcome. Of course, this is a theoretical argument, which does not guarantee in practice that these results generalize well, or that the fit is perfect. This is why we provide extensive experiments to confirm empirically the validity and performance of our approach. [1] FastSHAP: Real-Time Shapley Value Estimation, Jethani et al., ICLR2022. [2] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, Dec 1989. [3] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 19915- In the proof of Lemma 3, and in general, I do not understand why the prediction on the masked output should be equal to the sum - this should only be the case, if the loss is minimized at zero, which is highely unlikely. Does there exist any empirical evidence for this, or a formal assumption?
In the same way as for the previous Lemma (2), this assumes perfect minimization of the loss. Thus, we propose a relaxed variant, where the loss term was minimized down to with . Thus, following similar reasoning as in the proof of Lemma , we have that :
We also have:
By the triangle inequality on the right-hand side:
But observe that all features in 0 are non-contributive since, , by definition of the masking operation. Thus, by the bound found in Lemma : . Thus .
Thus:
and we thus derive the following upper bound on the -wise error as:
.
Response to weakness 5 (ViaSHAP only considers baseline imputations as the masked inputs in the learning paradigm. A discussion for marginal and conditional imputations (as given in FastSHAP) is missing. What are the consequences of this decision? How does ViaSHAP behave under other value functions?):
Thanks for pointing out this weakness. We indeed have to motivate our decision to use the interventional approach to approximate the Shapley values. Our decision is based on the work of Chen [4], which suggests that the interventional approach to computing the Shapley values results in explanations that tend to be more "true" to the data and can be more computationally tractable as well.
[4] Hugh Chen, Joseph D Janizek, Scott Lundberg, and Su-In Lee. True to the model or true to the data? arXiv preprint arXiv:2006.16234, 2020.
We appreciate the reviewer's thorough feedback. We are grateful for the time and effort dedicated to reviewing our paper. We provide some comments and answers to the questions in the following part.
Regarding weakness 1 (One weakness is the novelty of the contribution. Training a black-box classifier, which incorporates the SHAP explanations, has already been proposed with FastSHAP. The main novelty in this work, to fit FastSHAP on the (FastSHAP-)predictions of masked training data, instead of the predictions of the black-box model is not a major contribution.):
We respectfully disagree with the reviewer.
To the best of our knowledge, Shapley values have always been computed post-hoc, requiring the prediction and the black-box model to be available first, which applies to all the known methods, such as, FastSHAP [1], KernelSHAP[2], or the unbiased KernelSHAP[3]. In contrast, a key contribution of our work is that ViaSHAP computes Shapley values before the game itself, i.e., before the prediction takes place.
FastSHAP and our approach, ViaSHAP, are designed for different scenarios. FastSHAP is a post-hoc explanation technique, relying on a pre-trained black-box model to generate predictions, which FastSHAP then explains. In contrast, ViaSHAP is a standalone, explainable-by-design model. It inherently provides Shapley values as explanations for its predictions, without depending on or explaining the behavior of a separate pre-trained model.
[1] FastSHAP: Real-Time Shapley Value Estimation, Jethani et al., ICLR2022.
[2] Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777, 2017.
[3] Ian Covert and Su-In Lee. Improving kernelshap: Practical shapley value estimation using linear regression. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130, pp. 3457–3465, April 2021.
We thank the reviewer for the helpful questions that clarify our contribution.
as I understood, you have not used the interventional approach (marginal expectations), but instead used baseline imputations (masking all features not available with the baseline 0), which corresponds to Baseline SHAP [1]? Could you clarify this? Is there any reason not to use the same (conditional) variant as FastSHAP? What are possible implications on the interpretation of these scores?
Indeed, we used the same approach as FastSHAP. We use standard normalization so that the average value over each feature is . Therefore, applying a mask over feature and setting its normalized value to is equivalent to setting its value to . Consequently, what we explain is . This will indeed be clarified in the experimental setup.
I think the provided experiments regarding performance should be included in the paper for all architectures, as these seem crucial and intriguing. With the novel training scheme, it is important to understand how it affects the model's performance, which might also differ across architectures. Moreover, I think it would be valuable to compare these performances given similar computational resources.
We totally agree with the reviewer. The results from the additional experiments highlighted by the reviewers have significantly enhanced the quality of the paper. We are in the process of updating the manuscript to incorporate the new experiments, as discussed during the discussion. The updated version will be uploaded before the revision deadline.
I understand that such guarantees are difficult to provide. However, it seems plausible that increasing the number of sampled masks would yield values closer to the actual Shapley values, possibly at the cost of performance. Do you have any insights on such a trade-off? I think together with the performance of your approach, a crucial point of your contribution should address the interpretability aspect your method, i.e. how much your attributions coincide with the SHAP scores, and how to achieve higher interpretability, possibly at the cost of performance. Experiment 4.3 is good first step towards evaluation but does not answer this question.
Our empirical observations support the reviewer's argument, as the similarity of the approximated Shapley values to the ground truth witness more degradation with a higher number of features, i.e., a higher number of required coalitions to compute the exact Shapley values, given a constant number of samples per data instance. An example is the Indian Pines dataset (220 features). We are working on addressing such a trade-off.
Thank you again for your response. I have one last follow-up question/comment:
Comparison with FastSHAP: Masking a set of features using the baseline is not the same as computing the marginal expectation, since . This only holds for very few models, e.g. linear models. As mentioned before, I think your method relies on Baseline SHAP [1]. Marginal expectations would result in interventional SHAP values/random baseline [1,2]. Moreover, it is stated in the FastSHAP paper (section 3.2) that they use a trained surrogate for masking:
To this end, we use a supervised surrogate model (Frye et al., 2020; Jethani et al., 2021) to approximate marginalizing out the remaining features using the conditional distribution [given ]
In fact, FastSHAP relies on conditional expectation (modeled by the surrogate), known as observational SHAP values [2]. Does that mean, your experiments were conducted with FastSHAP using the baseline imputation, instead of the originally proposed variant? Did you use baseline imputations for all experiments, including KernelSHAP?
[1] Sundararajan, M. & Najmi, A.. (2020). The Many Shapley Values for Model Explanation
[2] Hugh Chen, Joseph D Janizek, Scott Lundberg, and Su-In Lee. True to the model or true to the data? arXiv preprint arXiv:2006.16234, 2020.
We thank the reviewer very much for the engaging discussion and the thoughtful comments. From the sources you proposed [1,2], and another relevant one [3], we found the three following ways of describing interventional Shapley values:
[1]: "One type of what-if analysis (e.g. BShap) performs interventions on the feature, while another (e.g. CES) marginalizes the feature over the training data. The former may construct out-of-distribution inputs, but regularization can ensure reasonable model behavior on these inputs"
[2]: "Interventional: we “intervene” on the features by breaking the dependence between features in S and the remaining features. We refer to Shapley values obtained with either approach as either observational or interventional Shapley values."
[3]: "Notably observational SHAP takes the underlying data distribution with its dependencies into account by using (observational) conditional expectations. Whereas interventional SHAP breaks up feature dependencies via interventions and therefore puts more emphasis on the model."
You are right that we do not use the conditional approach used in [2]. If our terminology is correct, our approach is then interventional, in that that it replaces feature values without taking into account the inter-features dependencies, as opposed to the observational one. This means that we do construct out-of-distribution inputs. In fact, we used the expected values of the features as a baseline for feature removal since we used standard normalization to normalize all feature values and center them around zero.
If you see any confusion wrt our terminology, we apologize, and we will be happy to correct it in the paper in order to avoid any ambiguity wrt that specific point.
On the other hand, in FastSHAP, the authors evaluated 3 approaches in section 5.2 (Surrogate/In distribution, Marginal/Out of distribution, Baseline removal), and we employ an approach similar to number 3 (baseline removal). In particular, the code provided by the FastSHAP authors [4] uses baseline removal as a default. We agree that it would be relevant to evaluate the effect of several masking strategies on the performance of our approach.
In our experiments, we applied the same baseline removal to all compared models (ViaSHAP, FastSHAP, and KernelSHAP) in order to ensure fairness in the evaluation.
[1] Sundararajan, M. & Najmi, A.. (2020). The Many Shapley Values for Model Explanation
[2] Hugh Chen, Joseph D Janizek, Scott Lundberg, and Su-In Lee. True to the model or true to the data? arXiv preprint arXiv:2006.16234, 2020.
[3] Artjom Zern, Klaus Broelemann, Gjergji Kasneci, Interventional SHAP Values and Interaction Values for Piecewise Linear Regression Trees, AAAI-23
[4] https://github.com/iancovert/fastshap, line 186 in fastshap/fastshap.py
Thank you for the clarification. Terminologies used for the defined game are commonly:
- Baseline SHAP [1]: for a single baseline vector b, ( in your case, i.e. often )
- Interventional SHAP (or SHAP with marginal expectations / Random Baseline SHAP in [1]): (sometimes also introduced with DO-operator), i.e. are randomly sampled feature values for feature in . Usually, a background dataset is chosen and Monte Carlo sampling is used, i.e. multiple baselines from a background dataset.
- Observational SHAP (SHAP with conditional expectations, CES in [1]): , i.e. here we impute conditioned on the values . (this is hard in practical applications). FastSHAP proposes the surrogate here to model this.
I think this aligns with my W5 in the beginning, it is good that you choose the same for all methods. The paper could benefit from being more general in this regard, as the proposed method could be applied also to all three imputation methods. Also providing a more general description of the background of value functions and choices would be necessary, similar to FastSHAP.
That being said, I highly appreciate the effort of providing the new results/experiments/comparisons and clarifications, which overall improved the quality of the paper. However, I observe that reviewer Uxwi also raised some of my concerns. While I don't have a strong opinion on this, giving the somewhat limited novelty of the paper and remaining points (e.g. introducing removal techniques formally and comparison, integrating comparison with same architecture fully into 4.2 including MLP, finding many relevant information in the appendix now), I would encourage the authors to work these out carefully. I therefore decided to keep my score, but I'm happy to discuss this further with the remaining reviewers.
We appreciate the reviewer's feedback and insightful comments. We are also happy to answer any additional questions.
This paper proposes a new model and training loss for efficiently computing Shapley values in supervised machine learning. The central idea is to train the model to directly output Shapley values when making predictions.
To achieve this, the model learns a feature representation, , where the output is simply the sum of its coordinates, or the sum plus a link function. The goal is for each coordinate of the learned representation to correspond to the Shapley value of the -th feature in the input data. To enable this interpretation, the authors introduce a “consistency” penalty alongside the standard training loss, based on the optimization problem used in Kernel Shap. Here, they plug in the learned representation, , where Shapley values are typically used in the standard loss. This consistency penalty can also be seen as an amortization of the Kernel Shap problem and is very similar to the loss in Fast Shap [1].
The authors validate their approach on various tabular datasets using both MLPs and KANs to learn the representation, . Although predictive performance with MLPs is not great, KANs perform really well, achieving results comparable to state-of-the-art methods like XGBoost. They also show that their amortized approach performs favorably compared to Kernel Shap, which is ultimately what it should do.
[1] Jethani, N., Sudarshan, M., Covert, I., Lee, S.-I., & Ranganath, R. (2022). FastSHAP: Real-Time Shapley Value Estimation. arXiv. https://arxiv.org/abs/2107.07436
优点
- The paper is well-presented and very clear. The explanation of Shapley values is generally easy to follow for newcomers to the field, while still highlighting key points for those familiar with it. The plots are compelling and help illustrate the algorithm, and I particularly liked the diagrams in figures 1-2.
- The evaluation is thorough, and there is sufficient evidence to support the authors’ claims.
- The application of KANs is interesting (and somewhat surprising); I didn’t expect them to outperform MLPs by such a margin. I think this has value to the community.
- While the proposed loss is not entirely novel, the approach is an interesting extension of models designed to compute Shapley values. I especially appreciate how this approach frames learning Shapley values as a representation learning problem.
缺点
- A more thorough discussion and comparison with FastShap would be valuable. This would clarify the similarities, as the proposed loss can be seen as almost an instantiation of FastShap’s loss (without the ViaShap component) as per equation (4) in [1]. A speed comparison with FastShap, for instance, would also be helpful.
- Related to this, the novelty is somewhat limited since the consistency penalty is quite similar to FastShap, although incorporating it into the training phase itself is innovative.
问题
- Does the “# samples” in figure 5 refer to the number of samples used to estimate the loss?
- What was your intuition for using KANs? Did you try different models and found that KANs worked well, or do you have an idea as to why they may be particularly suited to this problem?
- Could you elaborate on your view of the main contributions of your method compared to FastShap?
Nitpicks & Suggestions
- Line 122: Incorrect use of “”.
- Line 168: Equation 4 should be capitalized.
- Line 218: Should “M” be “d”?
- Lines 329-353: Contains unnecessary detail that makes the text harder to read; suggest shortening for readability.
- Line 87: The explanation of additive models and their connection to Shapley values could be clarified.
- To improve focus, consider emphasizing KANs due to the weaker performance of MLPs. If you agree, consider moving the section on MLPs to the appendix.
We thank the reviewer for the positive feedback and helpful comments. We provide answers to their question below.
1- Does the “# samples” in figure 5 refer to the number of samples used to estimate the loss?
Yes, the figure shows the effect of the number of samples used to optimize the loss functions on the training time. We also show the time required to train a model with the same architecture but does not compute the Shapley values, i.e., does not apply sampling and Shapley loss, which is shown in the figure at # samples = 0.
2- What was your intuition for using KANs? Did you try different models and found that KANs worked well, or do you have an idea as to why they may be particularly suited to this problem?
The reason for using KANs is twofold. At some level, you are right that, by trying them, we noticed their superior performance. On the other hand, a slightly longer-term motivation is that KAN was introduced with interpretability in mind [1, 2]. In particular, their behaviour as a combination of only univariate functions allows for more transparency when adequately leveraged, for instance, through some symbolic regression. While we do not exploit this property here, we believe that showing that KANs perform well in this situation opens interesting future perspectives. For instance, since our Shapley values here are functions of the input, being able to interpret that input would allow us to determine why a criterion is particularly important for a given instance.
3- Could you elaborate on your view of the main contributions of your method compared to FastShap?
We thank the reviewer for this question. Our approach (ViaSHAP) and FastSHAP are designed for different scenarios. FastSHAP is a post-hoc explanation technique, relying on a pre-trained black-box model to generate predictions, which FastSHAP then explains. In contrast, ViaSHAP is a standalone, explainable-by-design model. It inherently provides Shapley values as explanations for its predictions, without depending on or explaining the behavior of a separate pre-trained model.
To the best of our knowledge, the Shapley values were always computed post-hoc, i.e., the prediction and the black-box must be provided first. In contrast, ViaSHAP computes the Shapley value before the game, which is one of the main contributions of the paper.
We hope this clarification adequately addresses the reviewer’s concerns and highlights the unique contributions of ViaSHAP.
4- Nitpicks & Suggestions:
We thank the reviewer very much for the helpful remarks. We will make the necessary updates to the paper to address your points.
[1]- Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljacic, Thomas Y. Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks, 2024. URL https://arxiv.org/abs/2404.19756.
[2]- Liu, Z., Ma, P., Wang, Y., Matusik, W., & Tegmark, M. (2024). KAN 2.0: Kolmogorov-Arnold networks meet science. arXiv. https://arxiv.org/abs/2408.10205
Thank you for the clarification with my questions. After reading the other reviews as well as the rebuttal I have decided to maintain my score, with the clarification that I lean towards thinking the paper should be accepted. In particular, I appreciate the additional experiments that the authors provided for Uxwi and aGxt as I think they strengthen the paper. Moreover, as I stated before, but as I would like to re-emphasize, I think this application of KANs is interesting and valuable for the community. Similarly, I also think that building intrinsically explainable models that go beyond traditional machine learning models is an interesting research direction, and having this work demonstrate one possible approach to the community is valuable. My only reason towards not increasing my score is novelty as I think that the contribution, while interesting and valuable, is an incremental jump from FastShap.
We thank the reviewer for the positive feedback and are happy to answer any additional questions or provide further clarification.
The paper introduces a novel approach to generating SHAP values by integrating SHAP value loss directly into the model’s training loss, enabling the simultaneous training of a predictor and a SHAP value generator in an end-to-end manner. This is done by considering the model prediction to be the sum of SHAP values, creating training dynamics that should help develop predictors that are also explainable through the inherent SHAP generator network.
优点
The paper is well-structured and clearly written, making the methodology easy to follow. The authors provide a thorough experimental analysis on numerous dataset, including ablation studies on key components. They also experimented using KANs to learn SHAP values, and compared it with using standard MLPs to show where each method performs best. The authors show that their method is able to achieve comparable performance (in AUC) with classic ML models in tabular datasets, while generating the predictions off predicted SHAP values.
缺点
-
Metrics for Comparing SHAP Values with Ground Truth: The choice of cosine similarity as a metric for assessing SHAP values may not be the most effective, as it doesn’t capture absolute differences well and only reflects directional alignment. Similarly, while Spearman’s correlation can illustrate feature impact ranking, it doesn’t prove that the generated SHAP values are actually close to the ground truth, since it ignores their absolute values. Despite offering a limited view, the reported Spearman rank correlations in Table 3 are still relatively low for several datasets, which suggests that the generated SHAP values could be further improved.
-
Limited Comparison with Other Methods: Reporting SHAP value accuracy only against the baseline could be sufficient, but generally only if the generated SHAP values align very closely with the ground truth across all datasets. Since that doesn’t seem to be the case here, it’s essential to compare with other methods to understand the quality of the generated SHAP values. Given the design choices made to enable end-to-end SHAP generation, it would still be practical to compare with post-hoc methods like FastSHAP. Although these methods require post-hoc computation (which this work aims to avoid), FastSHAP, for instance, is relatively easy to train and very efficient at inference. Such comparisons would provide a more complete picture of where this approach stands relative to others.
-
Unclear Practical Scope and Complexity: The practical scope of this method isn’t fully addressed in the paper. Using SHAP values as predictors, while shown to be effective in experiments, increases the model’s complexity and could make the training process more difficult to control. This approach also limits the range of possible predictors. It’s not fully convincing that avoiding post-hoc SHAP computation is a big enough advantage to justify these constraints. A discussion on practical considerations, such as where this approach might be preferable despite its complexity, would be useful.
问题
- You consider the output to be the sum of computed SHAP values of all features/columns (). However, SHAP values computed should sum up to the prediction minus expected predictions (). Are you assuming ?
- Achieving high SHAP accuracy using a metric like , which more directly measures how close the generated SHAP values are to the ground truth, would provide a more convincing validation of the model’s accuracy claims. If the authors could provide Table 2 with values, it would give a clearer view of whether the empirical results support the claims of accurately generated SHAP values.
- It would be informative to evaluate the trained predictor from this work as the black-box model in FastSHAP. This should give a better view on the relative quality of generated SHAP values.
In response to the Weakness point number 3 (Unclear Practical Scope and Complexity):
In Appendix K, we presented the computational cost in terms of training time, inference time, and the impact of varying the number of samples on training time. However, we need to clarify that users are free to employ any deep learning architecture, and the complexity of the model will depend on their choice. Moreover, any extra computational cost arising from sampling feature coalitions is incurred solely during the training phase. During inference, the computational cost remains identical to that of a similar architecture that does not compute Shapley values.
We thank the reviewer for their time and the helpful review. In the following, we provide answers to the raised questions.
Question 1:
Indeed, and as described in section 2.2, the Shapley values do not explain on its own, but the difference in output between and a baseline output (which we called ). Usually, the baseline is chosen, as you suggest, as the average output of the model over the training set. In our case, we use linear normalization so that the average value over each feature is . Thus, applying a mask over a feature, and setting its normalized value to , is equivalent to setting this feature's value to the average value over the dataset. Thus, what we explain is actually . In the context of image classification, one could see this as classifying a fully gray image, which is not expected to belong to any class. Thus it is true that our model assumes that the average input belongs in no class (akin to a 'grey picture'), since the prediction for this average example is .
Nonetheless, it is true that, in some cases, particularly with heavily imbalanced datasets, the "average value" might still be strongly representing one class over others. We agree that we need to address this limitation more clearly in the paper, and thank the reviewer for pointing it out. We see that, from our results, the model still performs remarkably well with this assumption.
Question 2:
Since we allow for the application of a link function to accommodate a valid range of outcomes () as mentioned in section 3.1, the Shapley values computed by ViaSHAP are not on the same scale of KernelSHAP values, i.e., the values of the two models can have different magnitudes but maintain relative importance. Therefore, is not a suitable metric to measure the similarities between the different approximations. We also based our selection of the similarity metrics on the following work [1], which argued that Spearman’s Rank correlation is a suitable measure for comparing explanations in general.
However, to address the valid concerns of the reviewer, we conducted an ablation study to address the effect of applying a link function on the predictive performance of ViaSHAP as well as the accuracy of approximating the Shapley values. The results (Table 1) show that a link function does not affect the predictive performance significantly, while the removal of the link function significantly improves the accuracy of the Shapley values as shown in Table 2 below. This experiment allowed for the use of as a similarity metric, which turned out to be consistent with the results reported using Spearman’s Rank and the cosine similarity.
Table 1: A comparison of the predictive performance (measured in AUC) between with and without a link function applied to the output.
| Dataset | (without a link function) | (default settings) |
|---|---|---|
| Abalone | 0.883 ± 0.0002 | 0.87 ± 0.003 |
| Ada Prior | 0.898 ± 0.0026 | 0.89 ± 0.005 |
| Adult | 0.919 ± 0.0005 | 0.914 ± 0.003 |
| Bank32nh | 0.883 ± 0.0028 | 0.878 ± 0.001 |
| Electricity | 0.934 ± 0.0044 | 0.93 ± 0.004 |
| Elevators | 0.936 ± 0.0025 | 0.935 ± 0.002 |
| Fars | 0.958 ± 0.0015 | 0.96 ± 0.0003 |
| Helena | 0.868 ± 0.0056 | 0.884 ± 0.0001 |
| Heloc | 0.792 ± 0.0006 | 0.788 ± 0.002 |
| Higgs | 0.801 ± 0.0007 | 0.801 ± 0.001 |
| hls4ml lhc jets hlf | 0.939 ± 0.0005 | 0.944 ± 0.0001 |
| House 16H | 0.949 ± 0.0011 | 0.949 ± 0.0007 |
| Indian Pines | 0.982 ± 0.0012 | 0.985 ± 0.0004 |
| Jannis | 0.861 ± 0.0009 | 0.864 ± 0.001 |
| JM1 | 0.686 ± 0.0245 | 0.732 ± 0.003 |
| Magic Telescope | 0.921 ± 0.0024 | 0.929 ± 0.001 |
| MC1 | 0.952 ± 0.0106 | 0.94 ± 0.003 |
| Microaggregation2 | 0.764 ± 0.0079 | 0.783 ± 0.002 |
| Mozilla4 | 0.965 ± 0.0014 | 0.968 ± 0.0008 |
| Satellite | 0.944 ± 0.0103 | 0.996 ± 0.001 |
| PC2 | 0.659 ± 0.0601 | 0.827 ± 0.009 |
| Phonemes | 0.923 ± 0.0030 | 0.946 ± 0.003 |
| Pollen | 0.501 ± 0.0018 | 0.515 ± 0.006 |
| Telco Customer Churn | 0.857 ± 0.0034 | 0.854 ± 0.003 |
| 1st order theorem proving | 0.810 ± 0.0059 | 0.822 ± 0.002 |
[1]- Rahnama, A. (2024). The blame problem in evaluating local explanations and how to tackle it. In Artificial Intelligence. ECAI 2023 International Workshops (pp. 66–86). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-50396-2_4
Table 2: The Accuracy of approximating the Shapley values of with and without a link function applied to the output.
| Dataset | (without a link function) Cosine | Spearman | R2 | (default settings) Cosine | Spearman rank |
|---|---|---|---|---|---|
| Abalone | 0.999 ± 0.0008 | 0.971 ± 0.052 | 0.999 ± 0.002 | 0.9693 ± 0.0166 | 0.6635 ± 0.234 |
| Ada Prior | 0.963 ± 0.037 | 0.909 ± 0.068 | 0.9 ± 0.095 | 0.9346 ± 0.046 | 0.8763 ± 0.088 |
| Adult | 0.981 ± 0.03 | 0.931 ± 0.074 | 0.948 ± 0.079 | 0.9306 ± 0.049 | 0.9594 ± 0.035 |
| Bank32nh | 0.948 ± 0.045 | 0.648 ± 0.114 | 0.87 ± 0.142 | 0.779 ± 0.163 | 0.432 ± 0.151 |
| Electricity | 0.998 ± 0.004 | 0.967 ± 0.043 | 0.992 ± 0.012 | 0.9703 ± 0.02 | 0.7983 ± 0.183 |
| Elevators | 0.997 ± 0.004 | 0.969 ± 0.026 | 0.993 ± 0.009 | 0.966 ± 0.024 | 0.9203 ± 0.064 |
| Fars | 0.962 ± 0.036 | 0.882 ± 0.073 | 0.895 ± 0.073 | 0.8859 ± 0.253 | 0.347 ± 0.328 |
| Helena | 0.874 ± 0.095 | 0.702 ± 0.148 | 0.016 ± 1.307 | 0.8562 ± 0.092 | 0.669 ± 0.152 |
| Heloc | 0.962 ± 0.036 | 0.882 ± 0.073 | 0.895 ± 0.105 | 0.8438 ± 0.111 | 0.7409 ± 0.147 |
| Higgs | 0.991 ± 0.006 | 0.87 ± 0.057 | 0.977 ± 0.014 | 0.9169 ± 0.068 | 0.674 ± 0.12 |
| hls4ml lhc jets hlf | 0.999 ± 0.002 | 0.974 ± 0.032 | 0.998 ± 0.005 | 0.9712 ± 0.021 | 0.8575 ± 0.119 |
| House 16H | 0.988 ± 0.015 | 0.952 ± 0.044 | 0.961 ± 0.057 | 0.9195 ± 0.048 | 0.8876 ± 0.092 |
| Indian Pines | 0.683 ± 0.171 | 0.553 ± 0.18 | 0.333 ± 0.192 | 0.7958 ± 0.121 | 0.6991 ± 0.116 |
| Jannis | 0.898 ± 0.072 | 0.624 ± 0.113 | 0.722 ± 0.183 | 0.852 ± 0.141 | 0.4775 ± 0.131 |
| JM1 | 0.965 ± 0.042 | 0.916 ± 0.085 | 0.901 ± 0.094 | 0.88 ± 0.044 | 0.7561 ± 0.202 |
| Magic Telescope | 0.994 ± 0.006 | 0.959 ± 0.042 | 0.98 ± 0.02 | 0.9224 ± 0.067 | 0.9 ± 0.098 |
| MC1 | 0.951 ± 0.093 | 0.881 ± 0.139 | 0.873 ± 0.332 | 0.4659 ± 0.268 | 0.6212 ± 0.157 |
| Microaggregation2 | 0.982 ± 0.021 | 0.957 ± 0.049 | 0.929 ± 0.114 | 0.9382 ± 0.049 | 0.8756 ± 0.096 |
| Mozilla4 | 0.9998 ± 0.0003 | 0.967 ± 0.074 | 0.9996 ± 0.0007 | 0.9529 ± 0.023 | 0.9423 ± 0.092 |
| Satellite | 0.976 ± 0.033 | 0.894 ± 0.102 | 0.814 ± 0.296 | 0.8411 ± 0.116 | 0.746 ± 0.212 |
| PC2 | 0.956 ± 0.087 | 0.875 ± 0.127 | 0.895 ± 0.223 | 0.534 ± 0.183 | 0.7326 ± 0.161 |
| Phonemes | 0.993 ± 0.013 | 0.951 ± 0.094 | 0.975 ± 0.076 | 0.8112 ± 0.162 | 0.9407 ± 0.103 |
| Pollen | 0.994 ± 0.013 | 0.959 ± 0.076 | 0.929 ± 0.212 | 0.9517 ± 0.059 | 0.372 ± 0.429 |
| Telco Customer Churn | 0.978 ± 0.025 | 0.934 ± 0.052 | 0.939 ± 0.054 | 0.8098 ± 0.108 | 0.8476 ± 0.098 |
| 1st order theorem proving | 0.778 ± 0.123 | 0.66 ± 0.146 | 0.429 ± 0.479 | 0.7254 ± 0.179 | 0.6228 ± 0.188 |
Question 3:
We thank the reviewer for pointing out this limitation in the paper. In response, we evaluated the explanations of ViaSHAP in comparison with FastSHAP, where ViaSHAP is provided as a black box in FastSHAP. The evaluation metrics, including , cosine similarity, and Spearman's rank correlation, demonstrate that ViaSHAP significantly outperforms FastSHAP in terms of the accuracy of the computed Shapley values. The results are shown in Table 3.
Table 3: A comparison between the accuracy of ViaSHAP and FastSHAP in approximating the Shapley values
| Dataset | Cosine | FastSHAP Cosine | Spearman | FastSHAP Spearman | R2 | FastSHAP R2 |
|---|---|---|---|---|---|---|
| Abalone | 0.999 ± 0.0008 | 0.999 ± 0.002 | 0.971 ± 0.05 | 0.966 ± 0.05 | 0.999 ± 0.002 | 0.996 ± 0.008 |
| Ada Prior | 0.963 ± 0.037 | 0.703 ± 0.25 | 0.909 ± 0.07 | 0.64 ± 0.2 | 0.887 ± 0.105 | 0.042 ± 1.359 |
| Adult | 0.981 ± 0.03 | 0.956 ± 0.072 | 0.931 ± 0.07 | 0.893 ± 0.11 | 0.952 ± 0.072 | 0.853 ± 0.298 |
| Bank32nh | 0.948 ± 0.045 | 0.897 ± 0.079 | 0.648 ± 0.11 | 0.527 ± 0.13 | 0.852 ± 0.161 | 0.728 ± 0.29 |
| Electricity | 0.998 ± 0.004 | 0.978 ± 0.06 | 0.967 ± 0.04 | 0.921 ± 0.1 | 0.993 ± 0.011 | 0.914 ± 0.306 |
| Elevators | 0.997 ± 0.004 | 0.994 ± 0.006 | 0.969 ± 0.03 | 0.941 ± 0.05 | 0.993 ± 0.009 | 0.983 ± 0.023 |
| Fars | 0.997 ± 0.008 | 0.997 ± 0.021 | 0.849 ± 0.1 | 0.834 ± 0.12 | 0.994 ± 0.022 | 0.991 ± 0.028 |
| Helena | 0.874 ± 0.095 | 0.822 ± 0.139 | 0.702 ± 0.15 | 0.6 ± 0.19 | 0.677 ± 0.204 | 0.532 ± 0.29 |
| Heloc | 0.962 ± 0.036 | 0.935 ± 0.064 | 0.882 ± 0.07 | 0.826 ± 0.11 | 0.894 ± 0.098 | 0.824 ± 0.177 |
| Higgs | 0.991 ± 0.006 | 0.994 ± 0.004 | 0.87 ± 0.06 | 0.899 ± 0.05 | 0.977 ± 0.014 | 0.986 ± 0.01 |
| hls4ml lhc jets hlf | 0.999 ± 0.002 | 0.999 ± 0.003 | 0.974 ± 0.03 | 0.971 ± 0.03 | 0.998 ± 0.005 | 0.997 ± 0.016 |
| House 16H | 0.988 ± 0.015 | 0.964 ± 0.035 | 0.952 ± 0.04 | 0.891 ± 0.1 | 0.964 ± 0.039 | 0.89 ± 0.107 |
| Indian Pines | 0.683 ± 0.171 | 0.423 ± 0.154 | 0.553 ± 0.18 | 0.204 ± 0.12 | 0.333 ± 0.192 | -0.615 ± 0.912 |
| Jannis | 0.898 ± 0.072 | 0.92 ± 0.064 | 0.624 ± 0.11 | 0.673 ± 0.11 | 0.722 ± 0.183 | 0.776 ± 0.179 |
| JM1 | 0.965 ± 0.042 | 0.98 ± 0.042 | 0.916 ± 0.08 | 0.934 ± 0.08 | 0.887 ± 0.206 | 0.925 ± 0.37 |
| Magic Telescope | 0.994 ± 0.006 | 0.984 ± 0.023 | 0.959 ± 0.04 | 0.918 ± 0.08 | 0.98 ± 0.021 | 0.946 ± 0.094 |
| MC1 | 0.951 ± 0.093 | 0.789 ± 0.254 | 0.881 ± 0.14 | 0.638 ± 0.3 | 0.881 ± 0.346 | -0.024 ± 9.964 |
| Microaggregation2 | 0.982 ± 0.021 | 0.99 ± 0.017 | 0.957 ± 0.05 | 0.97 ± 0.04 | 0.944 ± 0.061 | 0.966 ± 0.054 |
| Mozilla4 | 0.9998 ± 0.0003 | 0.994 ± 0.017 | 0.967 ± 0.07 | 0.921 ± 0.14 | 0.9996 ± 0.0007 | 0.984 ± 0.049 |
| Satellite | 0.976 ± 0.033 | 0.858 ± 0.114 | 0.894 ± 0.1 | 0.55 ± 0.25 | 0.873 ± 0.151 | 0.126 ± 0.793 |
| PC2 | 0.956 ± 0.087 | 0.786 ± 0.234 | 0.875 ± 0.13 | 0.619 ± 0.25 | 0.891 ± 0.272 | 0.274 ± 1.616 |
| Phonemes | 0.993 ± 0.013 | 0.981 ± 0.036 | 0.951 ± 0.094 | 0.946 ± 0.1 | 0.971 ± 0.071 | 0.925 ± 0.165 |
| Pollen | 0.994 ± 0.013 | 0.984 ± 0.024 | 0.959 ± 0.076 | 0.905 ± 0.13 | 0.933 ± 0.276 | 0.855 ± 0.23 |
| Telco Customer Churn | 0.978 ± 0.025 | 0.963 ± 0.045 | 0.934 ± 0.052 | 0.892 ± 0.09 | 0.924 ± 0.085 | 0.899 ± 0.109 |
| 1st order theorem proving | 0.778 ± 0.123 | 0.776 ± 0.174 | 0.66 ± 0.146 | 0.658 ± 0.21 | 0.429 ± 0.479 | 0.367 ± 2.832 |
Thank you once again for your thoughtful review of our work. Based on your review, we have conducted additional experiments and gathered new evidence to address the concerns you raised. We believe the results strengthen the manuscript. We would greatly appreciate your perspective on the latest findings.
Q3- Thanks as well, for using your method as a black-box in FastSHAP. This should illustrate how relatively good your method is. It is very impressive that ViaSHAP is generating better SHAP values than FastSHAP. There are also datasets like MC1 and Ada Prior that FastSHAP’s compared to ground truth is really off while ViaSHAP is doing great.
-
What do you think explains this amount of difference? FastSHAP’s in some of the datasets. It means it does not explain any of the variability in the ground truth SHAP values. Or put simply, it is the same as a simple baseline model that always predicts the mean vector for each data point. It was unexpected to see such a performance.
-
You mentioned you used the default setting for all algorithms. If I’m not wrong, the default setting for FastSHAP is their surrogate model that computes the conditional SHAP, as opposed to KernelSHAP that computes the interventional SHAP, used for generating ground truth values.
-
Would be great if you could also include the setting for the FastSHAP you used, including the network details and the hyperparameters. Please excuse me if it is already in the Appendices but I overlooked!
W3 / Unclear Practical Scope and Complexity - My point is more on the complexity of controlling the training dynamics of the predictor. That is true that the model is not bound to use a specific model, but the predictions are always eventually going to be the sum of SHAP values. So comparing it to the alternative of having the full control on directly predicting the output from the inputs, it is naturally a more restrictive class of predictors with less degree of freedom. If the authors could highlight more on the practicality of such a method, it can strengthen their arguments.
What do you think explains this amount of difference? FastSHAP’s in some of the datasets. It means it does not explain any of the variability in the ground truth SHAP values. Or put simply, it is the same as a simple baseline model that always predicts the mean vector for each data point. It was unexpected to see such a performance.
ViaSHAP is the original model, and its predictions are derived from the Shapley values, which we believe is a key advantage over any post-hoc explanation method. In contrast, FastSHAP is an approximation of the model provided by ViaSHAP.
There are 7 datasets where the performance of FastSHAP was remarkably worse than ViaSHAP. The 7 datasets are challenging, i.e., they have either a limited number of training examples (Ada Prior, Indian Pines, MC1, Satellite, PC2, and First Order Proving Theorem), a relatively large number of features (Indian Pines and First Order Proving Theorem), or numerous classes to predict (Helena and First Order Proving Theorem). A detailed description of the datasets is available in Table 13 in Appendix L.
You mentioned you used the default setting for all algorithms. If I’m not wrong, the default setting for FastSHAP is their surrogate model that computes the conditional SHAP, as opposed to KernelSHAP that computes the interventional SHAP, used for generating ground truth values.
We used ViaSHAP directly as a black box in FastSHAP without a surrogate model with the same baseline removal approach described above. The default settings refer to the architecture and the hyperparameters. We will further clarify this point in the paper.
Would be great if you could also include the setting for the FastSHAP you used, including the network details and the hyperparameters. Please excuse me if it is already in the Appendices but I overlooked!
We used a network composed of an input layer that maps n features to 128, a hidden layer to 128 dimensions (128 × 128), and the output layer maps 128 to (n × number of classes). There are ReLU activation functions after the first and the second layers. The number of samples is 32, the max number of epochs is 200, and the validation samples are 128. The used settings are available in [1].
W3 / Unclear Practical Scope and Complexity - My point is more on the complexity of controlling the training dynamics of the predictor. That is true that the model is not bound to use a specific model, but the predictions are always eventually going to be the sum of SHAP values. So comparing it to the alternative of having the full control on directly predicting the output from the inputs, it is naturally a more restrictive class of predictors with less degree of freedom. If the authors could highlight more on the practicality of such a method, it can strengthen their arguments.
If we understand the latest description of the weakness correctly, we expected a model identical to ViaSHAP (but not constrained to learning Shapley values) to outperform ViaSHAP in terms of predictive performance. Therefore, and as requested by the reviewers, we compared ViaSHAP to an identical architecture that is not restricted to computing Shapley values, i.e., the model has full control of directly predicting the output from the inputs. The results were counter-intuitive, and ViaSHAP significantly outperforms the unrestricted alternative with respect to the predictive performance. Our explanation/speculation is that the Shapley component of the loss function appears to have a regularizing effect on the optimization process of the evaluated models. The experiment and the detailed results are available in Appendix G.
[1]- https://github.com/iancovert/fastshap/blob/main/notebooks/census.ipynb
Thank you for all the answers. I want to first mention that I genuinely appreciate the willingness of the authors for engaging in the discussions and providing extra details and experiments wherever needed. I am trying to convince myself about the methodological soundness of the proposed algorithm before considering increasing my score.
I want to focus on my initial Q1 regarding theoretical aspects of the method without discussing the rest, as that is currently the one I am more interested to clarify first.
Thanks for referring me to the line 326. That was the part that I missed and was causing some of the confusions. So basically there is a pre-processing process that centers all features at zero. That is fine, then we agree to . Now, considering the sum of SHAP values to be , how are you ensuring ?
Thanks a lot for going through my review and providing the extra information I asked for. I will go through each part individually.
Q1- Thanks for your answer and referring me to the corresponding section. There are still points that I would like to discuss with the authors and hopefully clarify.
Indeed, and as described in section 2.2, the Shapley values do not explain on its own, but the difference in output between and a baseline output (which we called ).
Reading line 122 I understand you are using the vector 0 as your baseline, so you are assuming your baseline value to be . The baseline value is dependent on the version of SHAP values you are computing (this is discussed in many papers including [1], [2]). As mentioned in line 327 and given that in your experiments you are comparing your values with KernelSHAP’s values as ground truth, I can infer that interventional SHAP is what you are computing with your method (by the way, Chen et al. (2020) suggested that interventional SHAP is more true to the model than data, contrasting with what is mentioned in line 328). In the interventional setting, the baseline value is , where is a background dataset. This means that you are assuming this baseline value to be zero. Which I believe is not obvious at all and requires extensive justifications. Assuming to be a proxy for , this is as if you assume the expected value of the predictor to always be zero. Imagine you expect this from a regressor that is supposed to predict the height of people based on some features.
In our case, we use linear normalization so that the average value over each feature is 0. Thus, applying a mask over a feature, and setting its normalized value to 0, is equivalent to setting this feature's value to the average value over the dataset.
I could not find the details on this “linear normalization” in the paper. If you are referring to the batch normalisations done in your MLPs, this was not part of your assumptions in your Section 2.2. In line 111, it is said that is “a trained model” with no extra constraints. Even for KANs that you used this is not an obvious assumption. It would be great if you could detail more on what do you mean by “we use linear normalization” and “setting its normalized value to 0”.
applying a mask over a feature, and setting its normalized value to 0, is equivalent to setting this feature's value to the average value over the dataset.
Again, here it is not obvious why setting the value of the masked values to zero is as if they are set to the average value over the dataset. This assumption that all the features in the dataset is centered at zero is something that neither I see in the paper nor is realistic.
Thus, what we explain is actually .
Let’s imagine we accept . If you are assuming as ViaSHAP is supposed to be a predictor that generates the final predictions, then you are also assuming . I can accept that with the constraints you considered in your trainings, ViaSHAP might become an estimator with this attribute. But this does not necessarily sounds a realistic assumption for an arbitrary predictor and needs to be both clearly stated and justified in the paper.
Nonetheless, it is true that, in some cases, particularly with heavily imbalanced datasets, the "average value" E(x) might still be strongly representing one class over others. We agree that we need to address this limitation more clearly in the paper, and thank the reviewer for pointing it out. We see that, from our results, the model still performs remarkably well with this assumption.
I strongly advise the authors to work out the theoretical aspects of their assumptions before looking into the experimental results. High accuracies in empirical results do not necessarily guarantee the soundness of the method. I believe the theoretical aspects needs to be clarified first, and then be backed up with strong empirical results. I also wish I could see the limitations being mentioned and addressed in the updates as promised by the authors, but I believe they are still missing.
Q2- Thanks a lot for providing scores between ViaSHAP’s outputs and ground truth. It is impressive how scores are also high in most of the cases. I still have few questions.
-
I did not find the details on what has been used as the background dataset?
-
Why do you think on Helena the achieved is that low?
[1] Sundararajan, M. & Najmi, A.. (2020). The Many Shapley Values for Model Explanation [2] Hugh Chen, Joseph D Janizek, Scott Lundberg, and Su-In Lee. True to the model or true to the data? arXiv preprint arXiv:2006.16234, 2020.
We thank the reviewer for the thoughtful questions.
Reading line 122 I understand you are using the vector 0 as your baseline, so you are assuming your baseline value to be ...
From the sources you suggested [1,2], along with an additional relevant reference [3], we identified the following three descriptions of interventional Shapley values:
[1]: "One type of what-if analysis (e.g. BShap) performs interventions on the feature, while another (e.g. CES) marginalizes the feature over the training data. The former may construct out-of-distribution inputs, but regularization can ensure reasonable model behavior on these inputs"
[2]: Interventional:" we “intervene” on the features by breaking the dependence between features in S and the remaining features. We refer to Shapley values obtained with either approach as either observational or interventional Shapley values."
[3]: "Notably observational SHAP takes the underlying data distribution with its dependencies into account by using (observational) conditional expectations. Whereas interventional SHAP breaks up feature dependencies via interventions and therefore puts more emphasis on the model."
If our terminology is accurate, our approach can be classified as interventional, as it replaces feature values without considering the dependencies between features (the reviewer is correct that the interventional Shapley values are more true to the model). However, we do not use the interventional conditional expectation described in [2]. In FastSHAP [6], the authors evaluated three approaches in Section 5.2: Surrogate/In-distribution, Marginal/Out-of-distribution, and Baseline Removal. Our method follows the third approach, Baseline Removal.
If there is any confusion regarding our terminology, we sincerely apologize and are more than willing to clarify and correct it in the paper to eliminate any ambiguity on this specific point.
I could not find the details on this “linear normalization” in the paper...
In the paper, in line 326, we referred to the normalization by standard normalization [4], which normalizes the values of feature as follows:
, where is the mean of feature in the training data, and is the standard deviation of in the training dataset. The same normalization operation has been used by FastSHAP here [5].
Again, here it is not obvious why setting the value of the masked values to zero is as if they are set to the average value ...
We apply the following quoted approach from FastSHAP, section 5.2 [6]:
(Baseline removal) , where are fixed baseline values (the mean for continuous features and mode for discrete ones).
Since our preprocessing centers the feature values around zero, we use zero as a default baseline value. If the user decides to use a different normalization operation or different preprocessing methodology, then the mean value is applied.
Again, we apologize for any confusion and are willing to clarify the text.
Let’s imagine we accept . If you are assuming as ViaSHAP is supposed to be a predictor that generates the final predictions...
As we mentioned above, is true if the standard normalization is applied, which centers features around 0. Otherwise, we totally agree with the reviewer.
I also wish I could see the limitations being mentioned and addressed in the updates as promised by the authors, but I believe they are still missing.
We will definitely include a section outlining all the limitations we are aware of. We apologize for not being able to implement all the required updates since we had very limited time to conduct all the experiments suggested by the reviewers and update the paper before the revision deadline.
- I did not find the details on what has been used as the background dataset?
As we mentioned above, we used the baseline removal approach and the baseline values are the mean values of the training data, which has been applied to both ViaSHAP and FastSHAP.
- Why do you think on Helena the achieved is that low?
Our explanation (hypothesis) is that the Helena dataset has 100 classes, which significantly raises the complexity of the explanation task, making it more challenging for the model.
[1] Sundararajan, M. & Najmi, A.. (2020). The Many Shapley Values for Model Explanation
[2] H Chen, JD Janizek, S Lundberg, SI Lee. True to the model or true to the data? arXiv preprint arXiv:2006.16234, 2020.
[3] A Zern, K Broelemann, G Kasneci, Interventional SHAP Values and Interaction Values for Piecewise Linear Regression Trees, AAAI-23
[4]- https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.StandardScaler.html
[5]- https://github.com/iancovert/fastshap/blob/main/notebooks/census.ipynb
[6]- Jethani et al., FastSHAP: Real-Time Shapley Value Estimation, ICLR2022.
We greatly appreciate the reviewer's questions and feedback, as they contribute to refining and clarifying the paper.
considering the sum of SHAP values to be , how are you ensuring ?
Since , in our case, is a ViaSHAP model, has been optimized to minimize the following loss function and predict .
Let us assume a perfect minimization of the loss:
Therefore, if is a vector of zeros (or mean values):
In practice, it is unlikely for the loss to reach its global optimum. Consequently:
However, as the loss function converges to 0, so does
As we mentioned before, in some cases, specifically with heavily imbalanced data, the "average value" may still predominantly represent one class over others. This limitation also applies to FastSHAP when using a baseline removal value function. However, unlike ViaSHAP, which explicitly optimizes to ensure that , FastSHAP does not guarantee this property as FastSHAP has no control over the black-box model. Regardless of the values of , the explanations of both ViaSHAP and FastSHAP are .
Nevertheless, our results demonstrate that ViaSHAP performs remarkably well under the previous assumption. In future work, we could add a learned or fixed bias to account for the shift of value without losing any model properties.
So basically your loss is forcing . I do not find this an obvious assumption for an arbitrary predictor . An example was what I mentioned in my previous comments. Imagining the task of predicting the height of people form some features, this basically translates to learning a predictor that returns when average values in each feature is used as input.
Looking from a broader perspective, given that with your choice of baseline you are approximating with , it basically means you are forcing the expected value of the predictor to be , assuming the choice of baseline was realistic. Such a predictor should not be a good predictor for many of real-world tasks, especially regression tasks where you can find such obvious examples where the expected value of a good estimator cannot be . Reading lines 252-254, I believe these regression tasks are also supposed to fit in ViaSHAP's area of application.
That is true if we only optimize the Shapley component of the loss function. In fact, what we optimize is the following:
where is the predicted probability of class . Therefore, if forcing results in a poor predictor, the loss value will rise, and the optimizer will shift to a more suitable value for the predictor.
Again, regardless of the values of , the explanations of ViaSHAP are .
It is true that the regression tasks are also in ViaSHAP's scope of application. For instance, the cross-entropy component in the loss function can be replaced with mean squared error to train a ViaSHAP model for regression.
In an ideal scenario, such as a balanced classification dataset, perfect minimization of the Shapley loss, and convergence of the prediction loss to a minimal value, the model ensures that equals 0.
As mentioned earlier, such ideal conditions are rare in practice. However, will converge closer to 0 as long as doing so does not lead to an increase in the combined loss values (Shapley loss + prediction loss).
Only iterating what is suggested by you; I see that the Shap component of the loss results in a predictor with of zero at the global optimum. On the other hand, your prediction loss results in a predictor with somewhere around the expected value of the distribution of data, in its global optimum. These are two optimums with fundamental differences. I am not convinced minimising their sum will guarantee stable training and fundamentally sound optimisation. Especially when the expected value of the label distribution is different than . I believe ensuring correct SHAP values and predictions at the same time is not as easy as adding the two losses and optimising the sum.
I appreciate the attempt of the authors for trying KANs on computing SHAP values. However, the core idea behind the method, which seems to only be the addition of the prediction loss to FastSHAP while heavily constraining the original method by limiting it to only zero baselines as the value function, does not sound very novel to me. For such a limited novelty, at least a thorough analytical analysis on methodological soundness and optimisation dynamics is needed.
I am very happy that the authors could still present good empirical results on the classification datasets tested. However, I think the work is prone to fundamental issues while not also being novel enough to be accepted in this conference. I will not reduce my score in favour of the good empirical results provided and experimenting KANs for computing SHAP values.
We appreciate that the reviewer decided not to reduce the paper score and fully understand their concerns. However, we respectfully disagree with the assessment that the current optimization process is unstable, has conflicting objectives, or leads to poor predictors. Therefore, we tested the impact of the efficiency constraint (which forces ) on the predictive performance, the accuracy of Shapley values, and the number of epochs before early stopping. To do so, we remove from the loss function as follows:
,
As a result, the model explanation becomes instead of , which means the model is no longer constrained to predict any baseline, and the prediction loss has full control over the expected value of .
At inference time, the user can choose to use as an explanation or compute as well.
Due to time constraints and the short notice for conducting the experiments, we were able to gather evidence from 8 datasets out of the 25 originally planned. In Table 4, we compare the predictive performance of the originally proposed loss function with its unconstrained variant. The results indicate that removing the efficiency constraint from the loss function does not lead to improved predictive performance.
Table 4: The accuracy of the constrained and unconstrained ViaSHAP (measured in AUC)
| Dataset | (unconstrained) | (with efficiency constraint) |
|---|---|---|
| Abalone | 0.883 0.0003 | 0.883 0.0002 |
| Ada Prior | 0.897 0.003 | 0.898 0.003 |
| JM1 | 0.691 0.026 | 0.686 0.025 |
| MC1 | 0.942 0.011 | 0.952 0.011 |
| Mozilla4 | 0.965 0.001 | 0.965 0.001 |
| Satellite | 0.926 0.006 | 0.944 0.01 |
| PC2 | 0.67 0.046 | 0.659 0.06 |
| Phonemes | 0.919 0.006 | 0.923 0.003 |
In Table 5, we compare the similarity of explanations and to the ground truth values obtained using the unbiased KernelSHAP. The results show that the explanations are more similar to the ground truth values.
Table 5: The similarity of and explanations to the ground truth, as measured by
| Dataset | ||
|---|---|---|
| Abalone | 0.999 0.002 | 0.559 0.198 |
| Ada Prior | 0.8461 0.138 | 0.7968 0.153 |
| JM1 | 0.851 0.123 | 0.485 0.188 |
| MC1 | 0.869 0.239 | 0.498 0.409 |
| Mozilla4 | 0.9996 0.002 | -0.097 0.306 |
| Satellite | 0.823 0.195 | 0.638 0.254 |
| PC2 | 0.913 0.206 | 0.353 0.575 |
| Phonemes | 0.98 0.063 | 0.968 0.096 |
Table 6 compares the accuracy of the Shapley values obtained using the originally constrained ViaSHAP to the unconstrained variant (using ). The results show no significant difference in the accuracy between the two approaches.
Table 6: The similarity between the explanations generated by the constrained and unconstrained ViaSHAP to the ground truth, as measured by
| Dataset | unconstrained | with efficiency constraint |
|---|---|---|
| Abalone | 0.999 0.002 | 0.999 0.002 |
| Ada Prior | 0.8461 0.138 | 0.9 0.095 |
| JM1 | 0.851 0.123 | 0.901 0.094 |
| MC1 | 0.869 0.239 | 0.873 0.332 |
| Mozilla4 | 0.9996 0.002 | 0.9996 0.0007 |
| Satellite | 0.823 0.195 | 0.814 0.296 |
| PC2 | 0.913 0.206 | 0.895 0.223 |
| Phonemes | 0.98 0.063 | 0.975 0.076 |
Finally, Table 7 presents the average number of epochs required before early stopping for both the constrained and unconstrained models. The results also show no significant differences between the 2 variants.
| Dataset | unconstrained | with efficiency constraint |
|---|---|---|
| Abalone | 42 | 47 |
| Ada Prior | 7 | 7 |
| JM1 | 3 | 2 |
| MC1 | 43 | 40 |
| Mozilla4 | 41 | 34 |
| Satellite | 70 | 30 |
| PC2 | 44 | 22 |
| Phonemes | 24 | 27 |
Based on the available evidence, we argue that the proposed approach is valid and demonstrates no convergence issues when compared to the unconstrained variant.
The previous set of experiments will be added to the paper using the complete set of 25 datasets and all corresponding statistical significance tests.
However, the core idea behind the method, which seems to only be the addition of the prediction loss to FastSHAP...
It is possible that we oversimplified the description of the proposed approach, that it sounds as if we do nothing beyond adding the prediction loss to FastSHAP. While we acknowledge that our work is based on FastSHAP, there are key differences that distinguish our approach:
1- There is no black-box model provided to explain, which is a fundamental assumption of any Shapley-based explanation method
2- Shapley values are provided prior to the predictions
3- The model learns to explain itself
Thanks for your answer and providing new empirical results on the classification tasks tested.
Unfortunately, the analytical discussion is still missing and I have nothing new to add to my last comment. I still find the novelty of the method limited, and see the restriction on using zero baseline for the value function a weakness that I do not see a fundamental reason for. This is something that was also mentioned by reviewer aGxt.
While I appreciate the effort the authors put on providing more empirical results, my opinion stays unchanged and I think the work in its current form does not pass the acceptance threshold.
We respect the reviewer's decision. We kindly point the reviewer to our most recent responses. In them, we proposed a variant of the loss function that directly addresses the issue they raised, which is not an extra empirical result.
This paper introduces a new machine learning approach where the model directly predicts the Shapley values and then uses them for prediction (by summation). This is in contrast to the traditional approach where Shapley values are calculated at inference time after the model is trained leading to additional computational cost.
优点
- The paper is very clearly written and easy to follow. All necessary preliminaries are introduced and thoroughly discussed. This makes the paper mostly self-contained and thus appealing to a wide audience (even if they are not very familiar with Shapley values).
- The idea is simple to understand, novel, and creative.
- Theoretical results justify validity.
- Experimental settings are thoroughly described.
- A few different implementations are presented and compared.
- Ablation studies are included.
缺点
As much as I like the general idea and the presentation, one question has not yet been fully answered for me: When should I use Shapley value regression instead of the traditional approach?
From what I can see the main reason for using Shapley value regression is to avoid additional computational overhead at inference time. However, to avoid that, we need to constrain our choice of architecture, adapt the training process, etc. Overall, I would like to better understand when all of this is worth it. I would like to see experiments that compare the training times and inference times of the two approaches, Shapley value regression, and the traditional approach. How much more time do I need for training when using Shapley value regression? How much more time do I need at inference time if I want to calculate Shapley values in the traditional way? The paper only shows the training and inference times for Shapley value regression and does not compare it to the traditional approach.
I would also be interested to know if there are any other advantages of directly predicting Shapley values beyond the reduction in inference time.
Clarifications
- Lines 430-432: Can you elaborate on what you mean by the sentence "KernelSHAP requires more than 2000 samples and model evaluations per data example to achieve the same accuracy level of ViaSHAP on the Adult, Elevators, and House16H datasets."? What is the "accuracy level"? Earlier it was mentioned that the ground truth is computed using KernelSHAP, so I am not sure against what the accuracy is measured.
- Lines 457-459: Why is that observation "remarkable"? It seems intuitive that the similarity will improve as increases.
Minor
- Figure 2. I think there is something wrong with the loss function . It is defined as a pair of two numbers.
问题
As discussed in the Weaknesses section.
We thank the reviewer for their positive feedback and helpful comments. We will answer the questions in the following part.
1- When should I use Shapley value regression instead of the traditional approach?
We thank the reviewer for pointing out the need for clarification regarding the novelty and relevance of our approach. Regarding KernelSHAP, the obvious advantage of our approach (and of FashSHAP) is to avoid running an optimization problem for each example one wishes to explain. Now, to explain the difference with FastSHAP: FastSHAP and our approach, ViaSHAP, are applied in two different settings. FastSHAP is a post-hoc explanation method; it requires a black-box, pretrained model to provide the predictions, over which FastSHAP is trained to explain its predictions. On the other hand, ViaSHAP is its own standalone model, explainable by-design through providing its own Shapley value. It does not explain the inner workings of a separate, pretrained model.
To the best of our knowledge, Shapley values have traditionally been computed in a post-hoc manner, meaning that the prediction and the black-box model must first be available. In contrast, the key contribution of our work is that ViaSHAP computes the Shapley values before the game itself.
2- From what I can see the main reason for using Shapley value regression is to avoid additional computational overhead at inference time. However, to avoid that, we need to constrain our choice of architecture, adapt the training process, etc.
The choice of architecture is constrained in exactly the same way as any black-box used with FastSHAP, in the sense that both models are performing the same classification task, and can thus have similar architectures up to the last layer. The difference is that FastSHAP then needs to run its explainer on top of the black-box to draw the explanation. On the other hand, for ViaSHAP, the training of the predictor is conditioned on providing the explanation, thus making the entire process single-step. This means that ViaSHAP avoids running two models sequentially, but only runs one. Therefore, the user has the freedom to select any suitable deep learning architecture, e.g., MLP, transformers, computer vision, and image processing architectures.
3- I would like to see experiments that compare the training times and inference times of the two approaches, Shapley value regression, and the traditional approach. How much more time do I need for training when using Shapley value regression? How much more time do I need at inference time if I want to calculate Shapley values in the traditional way?
Regarding the training, FastSHAP also requires training the black-box first, then training the explainer by predicting the outputs of several maskings for all data points over several epochs. On the other hand, ViaSHAP does all of this in a single training (since there is a single model). Since, once again, ViaSHAP can have the same architecture as the black-box used in FastSHAP, training ViaSHAP is, at worst, as slow as training the black-box of FastSHAP. Furthermore, any additional computational cost associated with sampling feature coalitions is incurred exclusively during the training phase. At inference time, the computational cost of ViaSHAP is identical to the same architecture that does not compute Shapley values. In Table 4 in the paper, for each dataset, we report the time required to train a model using the same architecture as ViaSHAP, but the model does not compute the Shapley values, i.e., sampling and Shaply loss were not involved in the training, which can be found under columns (No Sampling).
In the following table, we report the time required to explain 1000 instances using KernelSHAP and ViaSHAP () on 6 datasets using the same hardware setup as described in section 4.5 in the paper.
Table 1: The time required to explain 1000 predictions using KernelSHAP and ViaSHAP.
| Dataset | KernelSHAP Time (s) | ViaSHAP Time (s) |
|---|---|---|
| Adult | 56.92 | 0.0026 |
| Elevators | 54.22 | 0.0021 |
| House 16 | 53.12 | 0.0052 |
| Indian Pines | 43124.66 | 0.0023 |
| Microaggregation 2 | 79.97 | 0.0022 |
| First order proving theorem | 436.25 | 0.0022 |
4- Clarifications:
Lines 430-432: Can you elaborate on what you mean by the sentence "KernelSHAP requires more than 2000 samples and model evaluations per data example to achieve the same accuracy level of ViaSHAP on the Adult, Elevators, and House16H datasets."? What is the "accuracy level"? Earlier it was mentioned that the ground truth is computed using KernelSHAP, so I am not sure against what the accuracy is measured.
Thank you for pointing out this caption that indeed requires more clarification. We computed the ground truth Shapley values using the unbiased KernelSHAP [1], which continously samples coalitions until convergence and updates the learned Shapley values. After convergence, we consider the values as the ground truth. Before convergence, the values are approximations that are being improved after each iteration of coalitions sampling and evaluation of the black-box model on the sampled coalitions. The figures demonstrate that KernelSHAP initially provides approximations that differ significantly from those computed by ViaSHAP. However, as KernelSHAP refines its approximations, the similarity to ViaSHAP's values increases. The figures further illustrate that KernelSHAP requires over 2000 samples and corresponding model evaluations to achieve a high level of similarity to ViaSHAP. In contrast, ViaSHAP computes these values within milliseconds, as shown in Table 1 above.
Lines 457-459: Why is that observation "remarkable"? It seems intuitive that the similarity will improve as increases.
What is remarkable is that the predictive performance on the classification task (not the accuracy of the Shapley values) remains mostly unaffected by the exponential increase in the scaling hyperparameter, , which implies that users can substantially increase to learn more accurate Shapley values without compromising the model's high classification performance.
5- Minor: Figure 2. I think there is something wrong with the loss function. It is defined as a pair of two numbers.
In Figure 2 we show how the components of equation 6 are computed, i.e., () and (). We will update the Figure to eliminate the confusion.
[1]- Ian Covert and Su-In Lee. Improving kernelshap: Practical shapley value estimation using linear regression. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130, pp. 3457–3465, April 2021.
Thank you for addressing most of my concerns. I have a question regarding your following statement.
The choice of architecture is constrained in exactly the same way as any black-box used with FastSHAP, in the sense that both models are performing the same classification task, and can thus have similar architectures up to the last layer.
It seems to me that one can use FastSHAP with, for instance, tree-based models. However, ViaSHAP needs to be a neural network with a particular last layer. If that is so, then there is a trade-off between the inference time of predicting Shapley values and the flexibility of the used architecture.
It is true that ViaSHAP assumes that the employed base model can optimize the proposed loss function through backpropagation. Therefore, the base model cannot be, for instance, a decision tree. However, ViaSHAP is intended to replace a powerful black box with an equally powerful but explainable model. The only tradeoff at the inference time is between the complexity of the employed architecture and the time required to make a prediction, as more sophisticated models are more computationally expensive. Since ViaSHAP produces Shapley values alongside predictions without requiring solving separate optimization problems or training an additional explainer, it is guaranteed to save computational cost compared to KernelSHAP and FastSHAP.
N.B. According to [1], most models do not support predictions based on subsets of features (i.e., sampling coalitions). Therefore, Jethani et al. [1] proposed training a supervised surrogate model that approximates the marginalization of the masked features using their conditional distribution. Afterwards, FastSHAP explains the surrogate model. On the other hand, we propose to invest in a single inherently explainable model to avoid multiple layers of models to generate explanations.
[1]- Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su-In Lee, and Rajesh Ranganath. FastSHAP: Real-time shapley value estimation. In ICLR 2022
With the aim of engaging in the discussion of the paper, although it is not proposed by the authors, I think it is possible to extend the method straightforwardly to work with other non-neural based methods and thus it should not be considered a weakness of the framework proposed.
For example, one could imagine having one gradient boosting model per dimension of and train it using the standard GBT training procedure for all trees. In particular, if then one can use the objective proposed by the authors ( equation (6) in the paper + model fit term) take the derivative with respect to each tree , and optimize. This strategy is often used by other methods and falls right in line with tree based models such as [1]. See for example [2] where this strategy is used.
With a bit more detail just to be super clear on what I mean, we model each dimension of the feature mapping using a separate gradient boosting tree model. Specifically, we define:
where denotes the gradient boosting tree model for the -th dimension.
Boosting Process Handling Multidimensionality of
At each iteration of the boosting process, we have the current estimate of denoted by:
Our goal is to minimize an objective function (for example eq (6) + model fit term, or eq (7) in the paper) over the dataset . The boosting process involves the following steps:
1.Compute Negative Gradients:
For each sample and each dimension , compute the negative gradient of the loss function with respect to the current estimate:
This results in a gradient vector for each sample. (which should be straightforwardly computable with the loss proposed by the authors.)
2.Fit Base Learners for Each Dimension:
For each dimension , fit a regression tree to the negative gradients by solving:
where is the space of regression trees.
3. Update Estimates of
Update the estimates for each dimension using the fitted base learners:
where is the learning rate.
The resulting method (or similar versions thereof) could be used in exactly the same way the authors propose it in the paper. Therefore, unless I am missing something (which is not impossible) I think the method can be easily extended to other paradigms without the need of backpropagation.
[1] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.
[2] Duan, T., Avati, A., Ding, D. Y., Basu, S., Ng, A. Y., & Schuler, A. (2019). NGBoost: Natural gradient boosting for probabilistic prediction. CoRR, abs/1910.03225. Retrieved from http://arxiv.org/abs/1910.03225
The suggested method is indeed a valid approach for applying ViaSHAP using gradient boosting models or any other regression model. Employing a separate regressor per feature provides the necessary flexibility to implement ViaSHAP without relying on backpropagation.
We thank the reviewer very much for pointing out this approach!
Dear Reviewers,
We are grateful for your valuable feedback and have updated our manuscript to incorporate additional experiments and evidence based on your suggestions and remarks. The revised manuscript now includes the following updates:
1- Results from experiments on image data.
2- A comparison between ViaSHAP and an identical architecture not optimized to compute Shapley values.
3- A comparison between ViaSHAP and FastSHAP with respect to the accuracy of their Shapley value approximations.
4- An ablation study to evaluate the effect of adding a link function on the performance of ViaSHAP.
5- The Accuracy of Shapley values is measured using .
6- Additional proofs for Lemma 1 and Lemma 2.
7- A comparison between ViaSHAP and KernelSHAP with respect to the inference time.
8- Clarifications of parts of the manuscript based on the reviewers' feedback.
We hope the updates address your concerns and enhance the quality of the manuscript. Thank you for your time and constructive input.
Dear Reviewers,
We sincerely thank you for the time and effort you dedicated to reading and reviewing our paper. Your thoughtful feedback, recommendations, and constructive criticism have been valuable in shaping our work into a stronger and refined form. We value the insights you provided, which have enabled us to address key aspects of our approach and improve the overall clarity of the paper.
We greatly appreciate the opportunity to engage with your comments and suggestions, and we acknowledge the important role they played in strengthening our contribution.
Best regards
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.