PaperHub
4.0
/10
Poster3 位审稿人
最低1最高3标准差0.9
3
1
3
ICML 2025

What makes an Ensemble (Un) Interpretable?

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

Ensemble models are widely recognized in the ML community for their limited interpretability. For instance, while a single decision tree is considered interpretable, ensembles of trees (e.g., boosted trees) are often treated as black-boxes. Despite this folklore recognition, there remains a lack of rigorous mathematical understanding of what particularly makes an ensemble (un)-interpretable, including how fundamental factors like the (1) *number*, (2) *size*, and (3) *type* of base models influence its interpretability. In this work, we seek to bridge this gap by applying concepts from computational complexity theory to study the challenges of generating explanations for various ensemble configurations. Our analysis uncovers nuanced complexity patterns influenced by various factors. For example, we demonstrate that under standard complexity assumptions like P$\neq$NP, interpreting ensembles remains intractable even when base models are of constant size. Surprisingly, the complexity changes drastically with the number of base models: small ensembles of decision trees are efficiently interpretable, whereas ensembles of linear models remain intractable, even with a constant number of models. We believe that our findings provide a more robust foundation for understanding the interpretability of ensembles, emphasizing the benefits of examining it through a computational complexity lens.
关键词
interpretabilityexplainable AI

评审与讨论

审稿意见
3

The paper provides a complexity-theoretic investigation into the interpretability of ensemble models, focusing on three major types of explanation queries (sufficient reasons, contrastive reasons, Shapley values, etc.) and three common base-model classes (decision trees, linear models, neural networks). The main thrust is that ensembles are often far more difficult to interpret, from a computational perspective, than single base models. The authors show that even small ensembles can push explanation queries into NP-complete, #P-complete, or even higher complexity classes, underscoring fundamental limitations on generating certain forms of explanations in a tractable manner. Interestingly, the authors also highlight nuanced differences among models: while even two linear models can already lead to intractability for some explanation queries, an ensemble of a few (small) decision trees may still be analyzed in polynomial or XP time under certain conditions.

给作者的问题

Some practitioners still use large ensembles (eg., random forests with 100+ trees), do you have suggestions for which subsets of your results are more pressing in practice versus primarily theoretical?

论据与证据

Yes

方法与评估标准

Yes, I did not find significant issues for this part.

理论论述

Yes. For Proposition 4.2 (complexity separations in linear/tree ensembles) and Proposition 5.1 (dealing with constant-size base models), the reductions appear logically consistent. There is no obvious mismatch in how they construct the gadgets or use known NP-complete problems as a basis.

实验设计与分析

It is not an experimental paper, so there are no empirical tests on real data sets or performance measures. The analysis is fully theoretical, focusing on polynomial and parameterized complexity arguments.

补充材料

No

与现有文献的关系

The paper strongly relates to prior research on the computational complexity of interpretability (eg., references to complexity analyses of single decision trees, linear models, and neural networks). It also relates to the broader literature on knowledge compilation and logic-based explanation, as well as standard NP/#P-completeness references. Their parameterized-complexity approach extends prior lines of inquiry that used W[1]-hardness for single-tree explanation tasks.

遗漏的重要参考文献

No

其他优缺点

Strengths

  1. Comprehensive discussion of three explanation types. The paper studies multiple explanation queries (feature-selection based, contrastive, Shapley) which are popular in different XAI branches.
  2. Clear cases on different base-model. By analyzing different base model (linear, tree and neural nets) and structural parameters, the author clarifies precisely how and why ensembles become more difficult to interpret.
  3. Well-structured theoretical results. The complexity classification including membership, hardness, and parameterized analyses.

Weaknesses

  1. Limited discussion on practical methods. While the theoretical results are strong, there is relatively little discussion on approximate or heuristic strategies to circumvent these hardness results. It would be better to state how these hardness barriers affect real-world usage of ensemble interpretability methods. In addition, elaborating how practical interpretability methods (eg., SHAP) align with or contradict the complexity results would be valuable.
  2. Scalability in practice. Although the paper mentions XP- or W[1]-type results, it is still somewhat unclear which of these might be feasible for moderately large ensembles (eg., k=10 or k=20).
  3. Discussion for more specialized ensemble schemes. The paper primarily focuses on major/weighted voting, but some new ensemble methods apply meta-model strategies or dynamic weighting. It is recommended to provide a deep analysis.

其他意见或建议

No

作者回复

We thank the reviewer for the constructive feedback, and for recognizing the significance of our work.

Discussion on more specialized ensemble schemes

We thank the reviewer for raising this point. While our complexity results are established for both majority voting and weighted voting inference methods - covering a broad range of possible ensemble configurations and commonly used ensembles - many of our findings extend to other types of ensemble methods as well. We touch on this briefly in Appendix A.1, where we discuss how our results extend to certain other ensemble configurations. Specifically, for meta-learner ensembles (as discussed in A.1), if the meta-learner can simulate majority or weighted voting - which is feasible even with simple models like linear classifiers, and certainly with neural networks - then the hardness results we present also extend to this setting. However, the membership results are more nuanced and would depend on the specific architecture of the meta-learner. Similarly, for dynamic ensembles, whenever the dynamic behavior reduces to a static configuration (e.g., fixed weights), the hardness results continue to hold. However, as with meta-learners, the membership results would depend on the particular characteristics of the dynamic mechanism. We will highlight these points more clearly in the final version.

Practicality aspects and relation to existing explainability schemes

We thank the reviewer for raising this important point. We will clarify and expand our discussion of these aspects in the final version of the paper.

We first emphasize that some of our results demonstrate the intractability of various explanation tasks, hence demonstrating that heuristic approaches may lack formal guarantees. For example, in the case of sufficiency-based explanations, this can imply that practical heuristics that are used in practice may produce explanations that are either insufficient (i.e., unfaithful) or suboptimal in terms of minimality - issues already noted in practical prior work that have examined these aspects (e.g., [1-2]). Similarly, for SHAP, our results suggest that heuristic methods like KernelSHAP or FastSHAP may significantly diverge from exact SHAP values.

For tree ensembles, many computational challenges become efficiently solvable when both the number of trees and the number of leaves per tree are bounded. Furthermore—and importantly—we will clarify in our practical discussion that many of these computations are also amenable to parallelization. Lastly, since our work establishes membership in classes such as W[1] and related complexity classes, some of these problems can be reduced to well-known problems within these classes (e.g., kk-Clique), for which efficient heuristics are already available [e.g., 3–4]. This opens up an interesting direction for future research, where such heuristics could be leveraged to solve these problems more efficiently. We also agree that an exciting direction for future work lies in exploring additional forms of approximate guarantees - such as fully polynomial-time approximation schemes (FPTAS) or fully polynomial-time randomized approximation schemes (FPRAS) - as well as investigating relaxed versions of certain explanation definitions, which may offer improved tractability.

Given these considerations, the scalability of explanation methods like SHAP for tree ensembles largely hinges on the number of leaves per tree. When this number is relatively small, explanations remain tractable - even for considerably large ensembles such as those mentioned by the reviewer. However, a key and central practical insight from our results is that interpretability benefits, from a computational perspective, are greater when using fewer but deeper trees rather than many shallow ones - an insight that can influence the preferred tree structure when interpretability is the goal. In contrast, integrating linear models into ensembles can significantly hinder interpretability, even with just a few base models, due to a rapid increase in the complexity of generating explanations. This highlights the comparative advantage of decision-tree-based ensembles over those incorporating linear models from an interpretable lens. We believe the strong theoretical foundation laid by our work can inspire further practical exploration of these diverse aspects, and we will revise the final version to expand this practical discussion. We thank the reviewer for raising this important point!

[1] VeriX: Towards Verified Explainability of Deep Neural Networks (Wu et al., Neurips 2023)

[2] Using MaxSAT for Efficient Explanations of Tree Ensembles (Izza et al., AAAI 2022)

[3] Efficient Algorithms for Clique Problems (Vassilevska et al., Information Processing Letters)

[4] Efficient Maximum Clique Computation over Large Sparse Graphs (Chang et al., KDD 2019)

审稿意见
1

The paper presents complexity results for a variety of explanation queries on different machine learning models with the main aim of comparing the behaviour of single models with ensembles. In particular, the paper considers the following explanation queries: Minimum Sufficient Reason (MSR), Minimum Contrastive Reason (or minimum change required) (MCR), Shapley Additive Explanations (SHAP), and variants of MSR such as check sufficient reason (CSR) and count completions (CC). Moreover, it considers the following models (and their respective ensembles): decision trees, linear models, and neural networks.

Their main results (which are based on various complexity results for the above mention queries on the above mentioned models) are:

  • DT are strictly more c-interpretable than ensembles of DTs for CSR, MSR, MCR, CC, and SHAP and the same holds for linear models for CSR, MSR, and MCR. Moreover, assuming unary weights and biases the same applies for linear models w.r.t. CC and SHAP.

  • The above does not hold for neural networks, i.e., for any query neural networks and their ensembles behave the same. This is based on the observation that ensembles of neural networks can be transformed into an equivalent single neural network in polytime.

  • Parameterized Hardness results for ensembles of all model types with the size of any single model as a parameter for all query types.

  • Parameterized complexity results for ensembles of DTs and linear models with the number of models as a parameter showing that while this parameter does not help for linear models it can be helpful for DTs.

Update after rebuttal

I stay with my initial recommendation because the authors did not convince me that the overlapping results are not a problem. In particular, they did not even address all the overlaps in their rebuttal.

给作者的问题

The main question I have is whether the authors agree with my assessment with respect to the results in https://arxiv.org/pdf/2407.15780.

论据与证据

The claims are supported by clear and convincing evidence.

方法与评估标准

Yes.

理论论述

I checked all proofs occuring in the main text as well as some proofs in the appendix and did not find any issues.

实验设计与分析

n/a

补充材料

I checked the proof methods for all results given in the appendix as well as the defered preliminaries.

与现有文献的关系

This is the main problem of the paper, i.e., various results that are claimed to be new have already been shown in https://arxiv.org/pdf/2407.15780, which is also cited but not discussed in the paper. In particular, this holds for almost all results shown for DTs and ensembles of DTs for the query types MSR, MCR, and CSR. I am not sure why the authors did not see this, I can only imagine that the authors did not realise that MSR equals local abductive explanations, MCR equals local contrastive explanations, and https://arxiv.org/pdf/2407.15780 also contains results on CSR that were used for MSR.

In particular, I found the following problems:

  • Proposition 4.2: The following results have already been shown in https://arxiv.org/pdf/2407.15780 for ensembles of DTs: coNP-hardness with respect to CSR (lemma 42), NP-hardness with respect to MCR (theorem 36), and coNP-hardness w.r.t. MSR (theorem 36)

  • Proposition 5.1: The following results have already been shown in https://arxiv.org/pdf/2407.15780 for ensembles of DTs w.r.t. size of any single model: para-coNP-hardness for CSR (theorem 36), paraNP-hardness for MCR (theorem 36), para-coNP-hardness for MSR (theorem 36)

  • Proposition 5.3: The following results have already been shown in https://arxiv.org/pdf/2407.15780 for ensembles of DTs w.r.t. number of ensemble models: coW1-hardness for CSR (lemma 29,30, and 34), W1-hardness (theorem 35) and in XP (theorem 13) for MCR.

  • proposition 5.5: shown in theorem 13 of https://arxiv.org/pdf/2407.15780

  • propostion 5.6: shown in theorem 8 of https://arxiv.org/pdf/2407.15780 (in a much more general manner)

遗漏的重要参考文献

See previous item.

其他优缺点

The problems considered in the paper are well motivated and significant and the obtained results are insightful and plentyful. Unfortunately, many of the obtained results have already been known previously and shown in https://arxiv.org/pdf/2407.15780 (see above). This is of course a major problem for the paper that cannot easily be repaired since repairing this problem might (and probably) will change the motivation and problem setting of the paper. I therefore have to recommend that the paper is rejected. I would also recommend the authors to resolve this issue and resubmit.

其他意见或建议

I do not have additional comments.

作者回复

We thank the reviewer for the valuable feedback and for acknowledging the significance of many of our results.

Relation to Ordyniak et al. (KR, November 2024)

We appreciate the reviewer for highlighting the important connection to the recent and highly significant work by Ordyniak et al., published in KR in November 2024. Their work addresses several related concepts that are indeed relevant to ours. While our work was developed independently and focuses on many complementary aspects, we fully recognize the significance of their contributions and agree that our discussion should be improved to better reflect this connection.

We appreciate the reviewer highlighting certain intersections within a few propositions. However, we'd like to emphasize that these represent only a very small portion of our overall contributions and proofs, many of which the reviewer has positively acknowledged. Specifically, after careful examination, we found genuine intersections amounting to no more than 1–2 pages of proofs in total within dozens of pages of our proofs in the full 44-page appendix. Throughout our rebuttal, we will clarify this point with technical details and will further address and elaborate on this important related work in the final version. Nonetheless, we believe these refinements can be effectively addressed in a camera-ready revision, without requiring a new submission.

Technical elaboration. Much of our non-parameterized results are entirely independent of Ordyniak et al., including technically challenging results like Lemma E.5, which resolves a long-standing issue, and a complex dynamic programming algorithm for SHAP under linear models with unary weights.

Our parameterized results focus on three model classes: linear models, neural networks, and decision trees—of which the first two were not studied by Ordyniak et al. A key contribution is our intractability results for ensembles of a constant number of linear models (Pages 31–44), which reveal surprisingly sharp complexity contrasts with decision tree ensembles. We also analyze explanations like CC and SHAP using non-trivial proof techniques. Our analysis goes beyond majority voting to include weighted voting, which involves some intricate constructions. Although there is some intersection with Ordyniak et al. in the parameterized results for CSR/MCR/MSR on decision tree ensembles using majority voting, the bulk of our proofs for these classes are distinct as well—as we elaborate below.

For Proposition 4.2, our proof applies to a broader class of poly-subset-constructible models, including decision trees, linear models, neural networks, and more. We also cover the CC and SHAP queries. While there is some intersection in our results for CSR and MCR, our results for these are mainly corollaries and are not central to the technical core of this proposition—CSR isn’t even listed as a novel contribution of ours in Table 1, and we’ll revise MCR similarly. Finally, for the MSR query, we prove Σ2P\Sigma^P_2-hardness via a non-trivial construction that goes beyond coNP-hardness.

Similarly, Proposition 5.1 applies to the broader class of poly-subset-constructible models—and extends the analysis to CC, SHAP. While the CSR/MSR/MCR results for decision trees intersect, this is again only an extremely brief corollary (lines 1685–1688 only). For Proposition 5.3, we agree on the MCR and CSR hardness proofs and will clarify this. However, our results also present non-trivial membership results—coW[1] for CSR, W[P] for MCR—and additional findings like #W[1]-hardness and membership results for SHAP and CC. We also analyze MSR, showing para-NP-hardness and containment in XNP. Lastly, while we agree Propositions 5.5/6 can be inferred from Ordyniak et al., in our work these too are presented more as direct corollaries rather than central contributions. We can clarify this by revising the very brief quarter-page section in the main text discussing these results, to make their connection to that work more explicit.

Summary. We fully acknowledge the highly valuable contributions of the work of Ordyniak et al. and agree with the reviewer that their work deserves a more thorough discussion. While there are some technical intersections, we believe they are minor. Our main contributions and the current structure of the paper—which emphasize the surprisingly contrasting behaviors across different model families (e.g., linear models vs. decision trees) and offer a nuanced analysis of the complexity gaps between base models and ensembles—can largely remain unchanged. We believe the clarifications regarding the connection to Ordyniak et al. are reasonable to address in a camera-ready revision, without requiring a separate submission. We will incorporate these clarifications by expanding on their work in the introduction, related work, and the relevant sections of the main text and appendix. We thank the reviewer for highlighting this very important point!

审稿意见
3

This paper studies the interpretability of ensemble of models from the computational complexity theory. Authors show both negative results (e.g., intractability even for constant-size models) and positive results (e.g., tractability for small decision tree ensembles).

给作者的问题

The analysis assumes feature independence for Shapley value computation. How would feature correlations impact the complexity results? Would relaxing this assumption require new complexity classes or render the problem fundamentally harder?

The paper shows that linear model ensembles become intractable with just two base models, while decision tree ensembles can be tractable for fixed k. What inherent structural property of decision trees (e.g., axis-aligned splits, hierarchical structure) enables this tractability, and why does this property fail for linear models? Is there a formal characterization of model classes that avoid this complexity explosion?

论据与证据

Yes, there are proofs.

方法与评估标准

The paper is a pure theoretical work with no experiments.

理论论述

Yes, there are proofs.

实验设计与分析

No, there are no experiments.

补充材料

Yes, some of the proofs.

与现有文献的关系

This paper has an impact over specific claims surrounding the explainability of ensemble models.

遗漏的重要参考文献

No.

其他优缺点

Strenghts:

Novelty and Broad Scope:

The paper provides a formal, complexity-theoretic framework for ensemble interpretability, addressing underexplored (but sometimes) questions. For instance the fact that k-ensemble decision trees are strictly more interpretable than k-ensemble linear models was not proved before. The analysis spans multiple explanation types (sufficient reasons, contrastive explanations, Shapley values) and base models. Thus, the paper gives a more broad picture.

Limitations:

The paper is very hard to read, with several notions on complexity that I think are not necessary. Some of the proofs (e.g., Proposition E.1) are dense and could benefit from higher-level intuition in the main text.

The paper addresses questions with obvious answers. For instance, Section 4 studies the comparison between explainability of base models and their ensemble counterparts. It is obvious that the former is more easily interpretable than the latter. I don't see why we need a study for the purpose. Why is it even important to answer these questions?

There are no experiments to show the applicability of the theory discussed here.

其他意见或建议

No.

作者回复

We thank the reviewer for the constructive feedback and for recognizing the importance of the results provided in our work.

The importance of complexity separations between base-models and ensembles

While the reviewer acknowledges the importance of some of our unexpected results, such as the substantial complexity differences between ensembles of decision trees and linear models, they question the value of including more intuitive findings — like the complexity gap between base models and their ensembles. We appreciate this and clarify that, although it is intuitive to expect ensembles to be more complex, the exact nature of this complexity gap is not at all well understood. Our results show that while such gaps often appear, they sometimes do not — highlighting the need to understand when and why they arise.

For instance, ensembles of neural networks do not lead to increased complexity, in contrast to ensembles of decision trees and linear models. Furthermore, we demonstrate that the presence of a complexity gap also depends on the type of explanation used — for instance, linear models and their ensembles exhibit no complexity gap under CC and SHAP. It is interesting to note, though, that when the weights of an ensemble of linear models are encoded in unary, the complexity gap for CC and SHAP does reappear. This finding emphasizes the importance of a linear model’s expressiveness in driving this effect and reveals another unexpected insight into how the complexity gap between ensembles and their base models varies across different settings.

To summarize, although the broad distinction between ensembles and base models might be anticipated, our findings reveal various subtle and nontrivial differences across settings — differences that depend on both the model type and the explanation method. This highlights the importance of a thorough and rigorous analysis of this dimension.

What structural properties drive the complexity explosion in linear models versus decision trees?

We thank the reviewer for this insightful question, and we will elaborate on this point further in the final version. Intuitively, decision trees have a leaf structure that allows for enumeration over possible outcomes — this enumeration becomes more complex in ensembles, but the increase is gradual. In contrast, linear models lack a comparable enumerable structure, making it impossible to iterate over possible assignments in a similar way. This leads to the surprising result that even a strikingly small ensemble of linear models can already render interpretation computationally intractable. We agree that exploring alternative base models or more structured approaches that retain the enumerable properties of decision trees is a promising direction for future research. We will include this direction in our final version. An important question moving forward is how to design models that balance expressivity and predictive power with the ability to compute explanations efficiently.

Impact of feature independence

We note that our work follows common conventions used in previous computational complexity studies of SHAP (e.g., [1,2]) and in standard implementations like KernelSHAP [3], by assuming feature independence. As the reviewer correctly pointed out—and as also noted in [1]—relaxing this assumption can render the computation of SHAP intractable. Intuitively, even when using a simple model such as a decision tree, a highly complex and expressive data distribution can dominate the computational burden, ultimately making the SHAP computation hard.

However, recent work [4] has shown that tractability can still be achieved under less expressive, but non-independent distributions, such as those exhibiting Markovian dependencies. Identifying broader classes of non-independent distributions that still allow efficient computation remains an important and interesting research direction. We believe that linking this line of research with our tractability results for SHAP presents a promising direction for future exploration. We appreciate the reviewer for raising this point and will incorporate it as a direction for future exploration.

Improving clarity of points

Following the reviewer’s comments, in the final version, we will enhance the technical presentation of the paper by simplifying some complexity-related notations, clarifying the intuitive expedition of the results, and reducing the density of certain propositions, as suggested. We thank the reviewer for highlighting these points.

[1] On the Tractability of SHAP Explanations (Van den Broek et al., JAIR 2022)

[2] On the Complexity of SHAP-Score-Based Explanations: Tractability via Knowledge Compilation and Non-Approximability Results (Arenas et al., JMLR 2023)

[3] A Unified Approach to Interpreting Model Predictions (Lundberg et al., Neurips 2017)

[4] On the Tractability of SHAP Explanations Under Markovian Distributions (Marzouk et al., ICML 2024)

最终决定

The reviewers appreciate the paper's solid technical work. However, some of them have pointed out that some of the findings seem similar to those in [Ordyniak et al. 2024], which affects the paper's novelty. The authors should make sure to properly mention and discuss this earlier work, even if their results were found independently. Comparing their work with existing research will make the paper stronger.