PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
5
5
4.0
置信度
创新性3.3
质量3.3
清晰度2.5
重要性3.3
NeurIPS 2025

Regression-adjusted Monte Carlo Estimators for Shapley Values and Probabilistic Values

OpenReviewPDF
提交: 2025-05-07更新: 2025-10-29
TL;DR

We combine Monte Carlo and regression-based methods to get a flexible estimator which achieves state-of-the-art performance.

摘要

With origins in game-theory, probabilistic values like Shapley values, Banzhaf values, and semi-values have emerged as a central tool in explainable AI. They are used for feature attribution, data attribution, data valuation, and more. Since all of these values require exponential time to compute exactly, research has focused on efficient approximation methods using two techniques: Monte Carlo sampling and linear regression formulations. In this work, we present a new way of combining both of these techniques. Our approach is more flexible than prior algorithms, allowing for linear regression to be replaced with any function family whose probabilistic values can be computed efficiently. This allows us to harness the accuracy of tree-based models like XGBoost, while still producing unbiased estimates. From experiments across eight datasets, we find that our methods give state-of-the-art performance for estimating probabilistic values. For Shapley values, the error of our methods is up to $6\times$ lower than Permutation SHAP (the most popular Monte Carlo method), $2.75\times$ lower than Kernel SHAP (the most popular linear regression method), and $1.75\times$ lower than Leverage SHAP (the prior state-of-the-art Shapley value estimator). For more general probabilistic values, we can obtain error up to $60\times$ lower than prior work.
关键词
Shapley valuesRegression-adjustmentKernel SHAPMaximum sample reusePermutation SHAP

评审与讨论

审稿意见
4

The paper proposes a technique that combines Monte Carlo sampling with a regression-based formulation. This approach is claimed to reduce the variance of approximation, thereby improving the estimation of probabilistic values. The authors also introduce treeprob, a method for exactly computing all probabilistic values when interventional utility is used. Experimental results support the potential effectiveness of the proposed technique.

优缺点分析

Strengths

  • The proposed approach appears promising for developing variance-reduced Monte Carlo methods for approximating probabilistic values.

Weaknesses

Major

  • Basically, the proposed procedure can be summarized as follows: (1) learning a function ff to approximate the given utility function vv, where ϕ(f)\phi(f) is cheap to compute exactly; and then (2) approximate ϕ(v)\phi(v) as ϕ(f)+ϕ^(vf)\phi(f) + \hat{\phi}(v-f). In particular, ϕ^(vf)\hat{\phi}(v-f) is an approximation of ϕ(vf)\phi(v-f) computed using th Monte Carlo method introduced in Eq. (2). That being said, ϕ^(vf)\hat{\phi}(v-f) could also be computed using any existing Monte Carlo methods. This suggests that all the considered Monte Carlo baselines could potentially benefit from the same learned function ff.

  • On the other hand, while the proposed Monte Carlo method appears novel, it is unclear whether it achieves faster convergence rate either empirically or theoretically compared to the existing baselines.

All in all, the paper lacks experiments demonstrating that the proposed Monte Carlo method is the most effective when paired with the learned ff.

Minor

  • I suggest that the authors include more details on the implementation of treeMSR. After examining the provided code, I found that the features used are binary; the baseline is the zero vector, and the explicand is the one vector. This important information is missing in the paper.

  • In the proof of theorem 2.1, the randomness involved in learning ff is not accounted. As a result, the stated result applies only to the specific Monte Carlo method defined in Eq. (3). The current statement of theorem 2.1 may therefore be misleading.

Typos

  • According to Eq. (7), the first equality in Eq. (8) should be changed to \leq.
  • In line 26 of Algorithm 2, it should be ϕ[node.feat]ϕ[node.feat]+(pos_e+neg_b)\phi[\mathrm{node.feat}] \gets \phi[\mathrm{node.feat}] + (\mathrm{pos}\_{e} + \mathrm{neg}\_{b}).

问题

  • Why not combine Strainˆ\mathcal{S}\^{\mathrm{train}} and Stestˆ\mathcal{S}\^{\mathrm{test}} for both learning and approximating, given that they are drawn from the same distribution?
  • Could the authors provide results of substituting other Monte Carlo baselines in both linearMSR and treeMSR?

局限性

yes

最终评判理由

Although the presented theory has certain flaws, such as not positioning the proposed MSR estimator relative to existing ones in terms of theoretical convergence rate, and Theorem 2.1 for employing the proposed regression-adjusted trick being clearly naive, I believe the trick itself is promising, and future work may yield improved results both theoretically and empirically. In my view, a better theorectical framework is far beyond what is presented in this work.

格式问题

I did not spot any formatting issues.

作者回复

all the considered Monte Carlo baselines could potentially benefit from the same learned function ff.

We agree with this point, and actually view it as a strength rather than a weakness of our approach. We selected MSR as our “baseline” estimator because, in addition to being simple to implement, it is

  1. unbiased (contrasting with Kernel SHAP and Leverage SHAP), and

  2. computationally efficient in the sense that every function evaluation v(S)v(S) is used to update every estimate (contrasting with Permutation SHAP and MC).

As far as we know, MSR uniquely satisfies these two properties among probabilistic value estimators.

the paper lacks experiments demonstrating that the proposed Monte Carlo method is the most effective when paired with the learned ff.

Since the main contribution of our work is the regression adjustment technique (learning an ff for which we can compute probabilistic values exactly), considering a wide variety of alternative baseline estimators is outside the scope of the paper. That being said, we just ran some experiments with MC as the baseline; please see below.

On the other hand, while the proposed Monte Carlo method appears novel, it is unclear whether it achieves faster convergence rate either empirically or theoretically compared to the existing baselines.

For fixed sample budgets, Tree MSR achieves an average error that is 1.75×1.75\times lower than the prior best estimator, Leverage SHAP, across eight popular datasets (Table 1). As for convergence rate, we explore how accurately the estimators recover the true Shapley values (Figure 2) and probabilistic values (Figure 3) as a function of sample size. From these figures, we conclude that, for most datasets, Tree MSR does gives faster convergence empirically. We justify this finding theoretically in Theorem 2.1, where we show the performance of our estimator depends on how well the (randomly) learned ff fits vv.

I suggest that the authors include more details on the implementation of treeMSR. After examining the provided code, I found that the features used are binary; the baseline is the zero vector, and the explicand is the one vector. This important information is missing in the paper.

We will add more details! One note: In the interventional setting where we apply Tree Prob, a point is fully specified by the binary indicator of which features come from the “explicand” and which come from the “baseline”. Since the tree-based model can learn in a scale-invariant and shift-invariant way with respect to these features, it actually suffices to only consider binary inputs. We make this clear in the updated version of the paper.

In the proof of theorem 2.1, the randomness involved in learning ff is not accounted. As a result, the stated result applies only to the specific Monte Carlo method defined in Eq. (3). The current statement of theorem 2.1 may therefore be misleading.

We can add a comment reminding the reader that f itself is a random function. Since independent samples are used to fit ff (line 3-4 in Algorithm 1) and to approximate Shapley values (line 6-7) the statement of Theorem 2.1 is true regardless of the outcome of f.

Typos

Thank you!

Why not combine Strain\mathcal{S}^\text{train} and Stest\mathcal{S}^\text{test} for both learning and approximating, given that they are drawn from the same distribution?

Good question! Our variance analysis in the proof of Theorem 2.1 relies on independence between the final estimate and the learned function ff. We use Strain\mathcal{S}^\text{train} to learn the function ff, and then Stest\mathcal{S}^\text{test} for the samples used to build the final estimate.

Could the authors provide results of substituting other Monte Carlo baselines in both linearMSR and treeMSR?

Yes! We compare several Shapley value estimators in the small sample regime m=10nm=10n on the four small datasets below.

Dataset: Adult (n=12)

EstimatorMean1st2nd3rdMax
LinearMC6.80e-012.70e-021.04e-011.31e-012.53e+00
TreeMC1.30e-052.92e-085.40e-081.34e-071.22e-04
LinearMSR1.59e-041.77e-054.43e-055.07e-054.89e-04
TreeMSR1.09e-063.61e-084.19e-081.68e-073.12e-06
KernelSHAP5.19e-041.01e-041.18e-041.90e-041.15e-03
LeverageSHAP1.87e-041.13e-053.06e-054.64e-055.34e-04
PermutationSHAP4.83e-032.65e-044.14e-045.85e-042.04e-02

Dataset: Forest Fires (n=13)

EstimatorMean1st2nd3rdMax
LinearMC4.64e-013.47e-021.03e-011.09e-011.99e+00
TreeMC1.29e-052.54e-071.18e-062.09e-064.15e-05
LinearMSR6.41e-051.27e-051.50e-051.54e-052.93e-04
TreeMSR1.21e-066.79e-081.00e-071.05e-078.51e-06
KernelSHAP6.72e-057.45e-061.51e-051.89e-051.86e-04
LeverageSHAP5.78e-056.83e-068.97e-061.86e-051.64e-04
PermutationSHAP7.63e-041.40e-041.42e-041.70e-042.53e-03

Dataset: Real Estate (n=15)

EstimatorMean1st2nd3rdMax
LinearMC7.30e+002.27e-014.37e-016.58e-016.20e+01
TreeMC1.21e-104.54e-124.83e-121.05e-118.63e-10
LinearMSR5.94e-063.21e-075.74e-077.47e-072.44e-05
TreeMSR1.95e-104.19e-124.63e-127.32e-121.60e-09
KernelSHAP8.02e-063.00e-074.86e-075.28e-074.87e-05
LeverageSHAP8.77e-062.38e-074.20e-071.18e-063.79e-05
PermutationSHAP3.66e-054.53e-084.29e-071.22e-062.03e-04

Dataset: Bike Sharing (n=16)

EstimatorMean1st2nd3rdMax
LinearMC2.18e+004.83e-021.08e-012.63e-011.74e+01
TreeMC2.05e-061.31e-081.52e-081.99e-078.10e-06
LinearMSR5.82e-051.13e-065.11e-069.26e-061.89e-04
TreeMSR2.96e-074.78e-105.78e-101.60e-081.96e-06
KernelSHAP7.10e-051.18e-061.73e-052.01e-051.61e-04
LeverageSHAP5.19e-053.73e-065.26e-069.99e-061.28e-04
PermutationSHAP2.16e-041.47e-052.94e-054.82e-053.82e-04

While Tree MC is quite good, we generally find that Tree MSR gives error about an order of magnitude smaller. We believe this is due to the differences in the final regression-adjusted estimate: each sample is used for every estimate in MSR, whereas each sample is only used for one estimate in MC.

评论

Thank you to the authors for taking the time to respond. In my opinion, the proposed regression-adjusted trick is promising, as it is fully complementary to existing Monte Carlo methods and serves as a bridge between what can be exactly computed and what can only be approximated. However, what prevented me from giving a higher score is that I find the theoretical justification somewhat lacking, although this could be addressed in future work. I will make my decision later.

审稿意见
5

This paper proposes using the control variates (CV) method or regression sampling (RS) method for computing the Shapley value and its variants to address the well-known computational challenges. While CV and RS are well-established techniques in Monte Carlo estimation, to the best of my knowledge, this is the first proposal to apply these methods to Shapley value calculation, which is a noteworthy contribution.

As demonstrated in general Monte Carlo contexts, both CV and RS are undoubtedly effective. For Shapley value computation, the baseline function can be chosen as a linear function, since in that case the exponential sum required for the Shapley value can be computed exactly.

The idea is straightforward. Since existing sampling approaches such as LeverageSHAP are already highly optimized, direct performance comparison may be challenging. Nevertheless, the authors’ proposal is meaningful as an additional acceleration method for SHAP, complementing LeverageSHAP. I support acceptance of this paper.

优缺点分析

Strength:

  • The first introduction of the CV or RS method to Shapley values.
  • Demonstrated variance reduction.

Weakness:

  • Depending on how the performance is measured, improvement may not be drastic as compared to highly optimized alternatives.

问题

I found Fig. 1(b) confusing. The Correlated and Community data points do not appear to satisfy the unbiasedness property. Why do they seem so biased? In theory, the estimated values should be distributed both above and below the true value, so that the deviations cancel out on average. Why is this not observed in the figure?

局限性

CV and RS is well-known. Contribution can be seen incremental from a general Monte Carlo study perspective. However, application to Shapley value computation is novel enough. This would should be accepted.

格式问题

None.

作者回复

Thank you for your positive review!

Depending on how the performance is measured, improvement may not be drastic as compared to highly optimized alternatives.

While it doesn’t give SOTA in every setting, our Tree MSR achieves an average error 1.75×1.75 \times lower than the prior best Shapley value estimator, Leverage SHAP, on eight common datasets.

I found Fig. 1(b) confusing. The Correlated and Community data points do not appear to satisfy the unbiasedness property. Why do they seem so biased? In theory, the estimated values should be distributed both above and below the true value, so that the deviations cancel out on average. Why is this not observed in the figure?

Figure 1(b) shows the true Shapley values vs. estimates produced by the standard MSR estimator on one random run. The multiple points for each dataset (e.g. for Correlated and Community) correspond to different features in the dataset. While the estimated Shapley value for each individual feature is unbiased, there is substantial correlation between the estimates for different features, leading to this clustering. The correlation is due to the fact that every function evaluation v(S)v(S) is used to estimate the Shapley value for every feature. Large values of v(S)v(S) in particular can have substantial impact. To observe unbiasedness, we would need to re-run the same experiment with different random seeds. If we do so, we see that the estimates for Correlated and Communities do not consistently appear on one side of the diagonal line.

While the computational advantage of MSR is its “maximum reuse”, this property comes with a cost in variance and correlation. We view Regression MSR as a fix: The variance of Regression MSR depends on [f(S)v(S)]2[ f(S) \approx v(S) ]^2 rather than [v(S)]2[v(S)]^2 like standard MSR. So, even if some v(S)v(S) is large, we can still get remarkably accurate estimators provided f(S)v(S)f(S) \approx v(S).

审稿意见
5

This paper proposed a novel estimator for general probabilistic values such as Shapley values and Banzhaf values. The authors combined Monte Carlo sampling with regression-based methods to construct an unbiased estimator. Specifically, the proposed method used regression as a variance reduction technique for the Maximum Sample Reuse method, thereby integrating the strengths of both approaches.

优缺点分析

Strengths:

  1. The proposed estimator has several desirable properties: (1) it is unbiased, (2) it is applicable to any probabilistic value (e.g., Shapley value and Banzhaf value), (3) it allows the use of any learned function ff to approximate the value function vv, and (4) the variance of the estimator depends on the quality of the surrogate function ff. As such, when ff is accurate, the variance can be significantly reduced.

  2. The authors analyze the error bound of the estimator under an arbitrary sampling distribution DD , which improves the theoretical soundness and generality of the method.

  3. The authors implement the estimator using both linear and tree-based models for the surrogate function ff, with the tree-based models exhibiting lower approximation error in practice.

Weaknesses:

  1. It would be helpful if the authors could provide a more detailed derivation of Equation (5) in the appendix, including a formal proof of its unbiasedness and an analysis of its variance.

  2. Since the variance of the proposed estimator depends on the surrogate function ff, it would be beneficial to provide a more detailed discussion of how different choices of 𝑓𝑓 affect the variance and estimation accuracy. For example, to what extent does the use of linear versus non-linear models influence the quality of the estimated probabilistic values? Could shallow neural networks serve as effective approximators in this context? A more detailed analysis of how the choice of ff impacts estimation error would strengthen the paper.

  3. It is unclear how the computational complexity of the proposed method compares to existing approaches. In particular, it would be helpful to clarify whether the estimator has similar time complexity to the Maximum Sample Reuse method. A discussion or empirical comparison of the time complexity (or runtime) would improve the clarity of the contribution.

  4. The authors could add a summary table in the appendix comparing the proposed method with existing baselines (e.g., in terms of unbiasedness, variance, computational cost, use of surrogate models, and other relevant properties), which would help readers better understand the advantages and limitations of each method.

问题

See weaknesses

局限性

yes

最终评判理由

I appreciate the strengths of this manuscript and the authors’ detailed responses. I am inclined to support its acceptance.

格式问题

N/A

作者回复

Thank you for your detailed comments! We respond below:

It would be helpful if the authors could provide a more detailed derivation of Equation (5) in the appendix, including a formal proof of its unbiasedness and an analysis of its variance.

These derivations are provided in proof of Theorem 2.1 (Appendix A). We show that ϕ~i\tilde{\phi}_i is unbiased in the equation under Line 370, and analyze its variance in Equation (7). In response to your comment, and that of Reviewer z7xn, we will make this analysis clear in the main body of the updated paper.

to what extent does the use of linear versus non-linear models influence the quality of the estimated probabilistic values? Could shallow neural networks serve as effective approximators in this context? A more detailed analysis of how the choice of impacts estimation error would strengthen the paper.

A crucial component of our proposed Regression MSR estimator is that the probabilistic values of the approximation ff need to be efficiently computable. (Otherwise, we would be stuck with the initial problem of estimating the probabilistic values of ff.)

We are aware of two primary function classes with this property:

  • The coefficients of linear functions are always their probabilistic values. (This can directly be seen directly in Equation (1) with the probabilistic constraint that =0n1(n1ell)p=1\sum_{\ell=0}^{n-1} {n-1 \choose ell} p_\ell = 1.)

  • We show in Appendix C how the probabilistic values of (multiple) decision trees can be computed efficiently. A special case of this was known for Shapley values, and basically follows from the “case”-like structure of trees.

The issue with neural networks is that computing their Shapley values is generally slow, requiring O(2^n) time even when the network is shallow or has few parameters.

it would be helpful to clarify whether the estimator has similar time complexity to the Maximum Sample Reuse method. A discussion or empirical comparison of the time complexity (or runtime) would improve the clarity of the contribution.

Great question! In our experience, the dominating cost is computing mm function evaluations of vv. Here’s one way to see why: If vv is as expressive as a shallow neural network, this time complexity is already O(mn2)O(m n^2) and likely much higher. (A single fully connected layer between nn nodes requires O(n2)O(n^2) operations, repeated for each of the mm different nn-dimensional inputs.)

Beyond the time to get evaluations of vv, the additional computational step of Regression MSR is fitting the function ff:

  • When ff is linear, solving a linear system with mm equations and nn rows takes O(mn2)O(mn^2) time. In practice, this is very fast due to excellent linear algebra libraries. Note: Kernel SHAP and Leverage SHAP have this same time complexity because they are also solving a linear system of the same size.

  • When ff is tree-based model, fitting the function requires roughly O(Tmn)O(T * m * n) time where TT is the number of trees. In practice, we use the default T=100T=100 for XGBoost. Due to its extensive optimizations, fitting XGboost is remarkably fast in practice.

The authors could add a summary table in the appendix comparing the proposed method with existing baselines (e.g., in terms of unbiasedness, variance, computational cost, use of surrogate models, and other relevant properties), which would help readers better understand the advantages and limitations of each method.

Great idea! We agree a summary table would be useful for the community. Please see below for an initial Shapley value estimator summary. We will continue to update, adding more estimators and properties, for the final version of the paper. Let TmT_m be the time to evaluate the value function vv on mm different subsets.

MethodBiasTime ComplexityRegression-based
Kernel SHAPBiasedTm+O(mn2)T_m + O(mn^2)Linear Model
Leverage SHAPBiasedTm+O(mn2)T_m + O(mn^2)Linear Model
Permutation SHAPUnbiasedTm+O(m+n)T_m + O(m + n)No
MCUnbiasedTm+O(m+n)T_m + O(m + n)No
MSRUnbiasedTm+O(m+n)T_m + O(m + n)No
Regression MSRUnbiasedTm+O(mn2)T_m + O(mn^2)Linear or Tree Model
审稿意见
5

The article addresses the estimation of probabilistic values such as Shapley values and Banzhaf values without the need to perform 2n2^{n} computations and accesses to the payoff function. They propose using a simpler function, f(S):2[n]Rf(S): 2^{[n]} \to \mathcal{R}, to approximate v(S)v(S). This function is then used to exploit the linearity property of probabilistic values to find a formulation that can estimate the contributions consistently, which is proven theoretically. The proposed algorithm assumes that ϕ(f)\phi(f) can be computed fast, i.e. ff is either linear or tree-based. The article also provides intensive numerical experiments analyzing the sample complexity of the proposed algorithm and comparing it to the many published methods.

优缺点分析

Strengths:

  1. The article addresses an important problem in XAI which is the efficient computation of Shapley values
  2. It provides sound theoretical guarantees.
  3. The experimental results are thorough and convincingly demonstrate the method’s effectiveness.

Weaknesses:

  1. The presentation lacks clarity in key areas, including both the preliminary literature overview, the discussion of closely related methods, and even the theoretical contribution
  2. The manuscript omits discussion of several relevant works in the literature on fast Shapley value estimation. While the literature is admittedly large, a more comprehensive review would strengthen the article.

问题

  1. Equation (3) seems central to the proposed algorithm, but no intuition is provided. It’s also unclear which cited work this equation comes from. Can you clarify clarify its origin and provide some explanation?

  2. Theorem 2.1 is the main contribution of the article, I think it can be stated better, although equation (6) does entail consistency results, I believe the main contribution of the article can be more readily accessible if consistency (E[ϕ^i]=ϕiE[\hat{\phi}_{i}] = \phi_{i}) and variance results are mentioned explicitly in the main body of the article.

  3. is the division between Monte Carlo and regression-based methods strictly correct? For instance, doesn’t KernelSHAP also involve Monte Carlo sampling, not just regression?

  4. Some explanations are hard to follow. For example, in line 35, the first sentence appears without context, and the topic shifts abruptly right after. This is also seen in sections 1.1

  5. Key related works like Mitchell et al., 2022 ("Sampling Permutations for Shapley Value Estimation") and Jethani et al., 2022 ("FastSHAP") are not mentioned. Even though FastSHAP is not based on Monte Carlo, it is relevant. Also, Mitchell et al. show that Owen's halves can be effective, yet this article only discusses Owen’s method in the context of KernelSHAP. I understand that the relevant literature is vast but it does draw attention that the method is not compared to such prevalent works

局限性

yes

最终评判理由

I believe the results of the article are significant. I had concerns about the comparisons and some points in the articles. The authors have convinced me on all points. I believe the article is interesting to people working in the area of XAI and is technically solid with impact on the area.

格式问题

NA

作者回复

Thank you for your careful reading! We appreciate your clarifying questions, and pointers to specific related works. We respond in detail below:

Equation (3) seems central to the proposed algorithm [...] Can you clarify its origin and provide some explanation?

The Maximum Sample Reuse (MSR) estimator described in Equation (3) was originally proposed to estimate Banzhaf values [WJ23]. Since then, it has been generalized to other probabilistic values, albeit with differing notations [KBMH24, LY24a, LY24b]. The key insight is that we can write the probabilistic values as

ϕi(v)=S[n]{i}pS[v(S{i})v(S)]\phi_i (v) = \sum_{S \subseteq [n] \setminus \{ i \}} p_{|S|} [ v(S \cup \{i \}) - v(S) ] =S[n]:iSpS1v(S)S[n]:iSpSv(S)= \sum_{S \subseteq [n] : i \in S} p_{|S| -1} v(S) - \sum_{S \subseteq [n]: i \notin S} p_{|S|} v(S) =S[n]v(S)[pS11[iS]pS1[iS]]= \sum_{S \subseteq [n]} v(S) [p_{|S|-1} {1}[i \in S] - p_{|S|} {1}[i \notin S] ]

Compared to the first equation, the final equation makes it clear how v(S)v(S) appears (albeit with a potentially different sign and weighting) in every probabilistic value. This observation naturally suggests the MSR estimator in Equation (3), where each SS is sampled, and reweighted, so that the estimator is right in expectation. We will update the paper to make this motivation clear.

I believe the main contribution of the article can be more readily accessible if consistency (E[ϕ^i]=ϕiE[\hat{\phi}{i}] = \phi{i}) and variance results are mentioned explicitly in the main body of the article.

We think this is a good suggestion, and will plan on including the consistency and variance directly in Theorem 2.1.

is the division between Monte Carlo and regression-based methods strictly correct? For instance, doesn’t KernelSHAP also involve Monte Carlo sampling, not just regression?

Good question! In our work, we distinguish between“Monte Carlo” methods that directly estimate the linear sum of probabilistic values, and “regression-based” methods that fit an approximation ff to the underlying game vv, then compute the probabilistic values of ff. While methods like KernelSHAP use random sampling to fit a linear function ff to vv, the method ultimately computes ff via least squares regression and returns the Shapley values of ff. (Since ff is a linear function, its Shapley values are simply the coefficients for each feature.) We will clarify this distinction in the updated paper..

in line 35, the first sentence appears without context, and the topic shifts abruptly right after. This is also seen in sections 1.1

Thank you for pointing out these writing discontinuities. We have smoothed out the transitions (Line 35 and Section 1.1 more broadly) in the new version.

Even though FastSHAP is not based on Monte Carlo, it is relevant. Also, Mitchell et al. show that Owen's halves can be effective, yet this article only discusses Owen’s method in the context of KernelSHAP. I understand that the relevant literature is vast but it does draw attention that the method is not compared to such prevalent works

Due to their widespread applications in interpretability, there is certainly a vast literature on estimating Shapley values, and probabilistic values more broadly. We will plan on expanding our discussion of prior work in the updated paper. We briefly discuss two methods mentioned by the reviewer here:

Fast SHAP is a method for learning a function ff that simultaneously fits multiple related games vv. While it has a high overhead in sample complexity to learn ff, Fast SHAP can give improved results when we want to estimate Shapley values for multiple, similar games simultaneously. In our setting, we are interested in estimating the Shapley values of a single game vv. Due to these incomparable settings, we do not compare to Fast SHAP in our work, but have now highlighted the estimator in the updated related work section in response to your comment.

Owen’s and Owen’s halves are two Shapley value estimators. They both exploit a connection between Shapley values and the expectation of a continuous process. The “halves” augmentation is a kind of paired sampling (sometimes called “antithetical” sampling) that boosts performance. We chose not to compare to Owen’s or Owen’s halves estimators because they were shown in “Sampling Permutations for Shapley Value Estimation” to consistently perform worse than Permutation SHAP (what they call “MC-antithetic”) see e.g., their Figure 9. In our work, we find that Tree MSR already achieves an average 6×6\times improvement over Permutation SHAP. Nevertheless, we will plan on adding discussion of this prior work.

评论

I thank the authors for their detailed response. They have clarified and answered all my questions and concerns. I will raise my score to an accept.

最终决定

The paper presents a novel method for efficiently estimating probabilistic values, such as Shapley and Banzhaf values, by combining Monte Carlo sampling with flexible function-based regression. This approach achieves state-of-the-art accuracy across multiple datasets, significantly reducing estimation error compared to existing methods while maintaining unbiased probabilistic value computations.

Strengths of the paper include addressing efficient and accurate computation of Shapley and other probabilistic values with strong theoretical guarantees. The estimator is unbiased, flexible, applicable to any probabilistic value, and achieves variance reduction with accurate surrogates. Experiments are thorough and demonstrate state-of-the-art performance, particularly using tree-based models, and the method introduces novel variance-reduction techniques.

Reviewers raised several concerns, many of which were addressed during the rebuttal and discussion period, including unclear presentation of the literature review, related methods, and theoretical contributions, missing discussion of relevant prior work, and insufficient derivations or proofs. The impact of different surrogate functions on variance and accuracy is underexplored, computational complexity and runtime comparisons are lacking, and it is unclear if the method converges faster than existing Monte Carlo baselines.

The reviewers agree that the paper is technically solid, addresses an important problem in XAI, and is of interest to the community. While some theoretical aspects, such as the convergence analysis and Theorem 2.1, are limited or simplistic, the proposed regression-adjusted approach is promising and may inspire future theoretical and empirical improvements. Overall, the reviewers support acceptance.