Interpretable Mesomorphic Networks for Tabular Data
Explainable deep networks that are not only as accurate as their black-box deep-learning counterparts but also as interpretable as state-of-the-art explanation techniques.
摘要
评审与讨论
This work explores interpretable models with a focus on tabular data and introduces a novel method that is locally linear while retaining a non-linear global decision boundary. The authors achieve this by expressing their model as a linear model that takes the input features and produces a dot product with weights that depend non-linearly on the features. Thus for every point, a linear decision surface is produced but the surface remains parameterized as a non-linear function of the inputs. This makes the method more interpretable as feature importance attribution can be performed, while at the same time preserving the accuracy benefits. The authors perform a very broad evaluation of their method on the AutoML benchmark, showing that there method indeed outperforms white box models such as logistic regression, remaining on par with state-of-the art models for tabular data, while offering better and more efficient interpretability.
优点
- The evaluation that the authors perform is very thorough and the results seem impressive. The formulated hypotheses are very sensible and very convincingly demonstrated.
- The method is very intuitive and easy to understand. The toy task with the half moons is a really good illustration to gain intuition! The paper is also very well written and easy to follow.
缺点
- Input-dependent weights reminds me lot of a (single layer) Transformer, where the attention matrix (which is built from the inputs and weights in a non-linear fashion) serves as a weight matrix to then linearly combine the inputs. The method here uses more involved networks to produce this set of weights (not just a single layer) but still I feel like this similarity is really worth highlighting more. How would for instance a single-layer Transformer perform? Would the attention matrix also serve as a useful interpretability tool? The baseline TabNet seems to leverage attention scores to some degree to offer interpretability, it would be nice if the authors could further comment on the similarities.
- While the method is linear by design around a fixed datapoint x, the decision surface could still rapidly change as a function of x (especially if the classification task requires the function to do so). In such cases, does the method still offer high interpretability? The feature importance metric proposed here only zeroes out the input feature applied to the hyperplane, but the input feature used for constructing the weights is not modified. In such a situation, I would expect that the feature importance metric would fail. I guess this is why the authors employ a regularizer in the loss? How does changing its strength affect the results? In general it would also be nice to get an understanding of where this method might fail to offer explainable results.
问题
- The authors focus on tabular data but in principle I don’t see anything stopping the methodology from also working on vision. Did the authors try any experiments on simple vision datasets? I would be very interested in seeing if the method can also keep high accuracy scores there.
- How does the method compare to a simple first order Taylor expansion of the black-box model of interest, around a given input datapoint x? This strategy should also preserve global accuracy scores while offering interpretability locally. It is of course more expensive to compute compared to the authors model, but ignoring this, I would like to understand the differences better.
局限性
See above.
We thank the reviewer for providing valuable feedback. Below we address the concerns raised by the reviewer:
-
Regarding: “Weakness 1
Supposing a single input example with features, one would need to project each of the features to a fixed dimensionality to represent tokens, in this scenario the context length would be the number of features . Then, most commonly, related work in the domain appends a classification token [1]. Strictly speaking, a transformer consists of projection weight matrices, like , , and . The attention matrix of the transformer gives a similarity score to the different and representations, and not to the original features.
In the transformer layer, one would use the value of the classification token which would be a combination of the different of the other feature representations based on the similarity scores. This would be a combination of the other feature representations which after the transformer layer would then pass through an output layer to generate the class predictions as in [1]. On the contrary, our work does not use a combination of the different feature representations to generate the output. One could consider the output weights to the linear model generated from our hypernetwork to be an analogous operation.
In the case of attention over instances, it would be different from our method since our method takes as an input only one example at a time.
TabNet has a different architecture that employs sequential attention, it features a series of steps where it learns sparse matrices of the different original features for every decision step. Based on the mask of a step, the decision step will have an impact on the final output. The masks are then combined based on the decision step importance to the final output. On the contrary, our work does not feature multiple decision steps, and after generating the weights, the output is linearly dependent on them.
-
Regarding: “Weakness 2”
The aforementioned scenario by the reviewer would not impact our method, as the target linear model is generated per example. The hypernetwork will learn to fit every point and it will retain a global accuracy as shown in Section 2.4. The interpretability experiments in Section 5, Hypothesis 3 include piece-wise functions, and based on the results we argue that our method has competitive performance with the other interpretability baselines.
The hyperplane based on a given example generates the weights which combined with the example features provide the impact of the individual features. Zeroing out an input feature to the hyperplane would generate predictions for a different example. The regularizer in the loss helps to induce sparsity in the generated weights, however, it only marginally improves the prediction quality of the interpretability. We have experimented with removing the regularizer in the loss and we observed that the predictions from our method were robust.
-
Regarding: “Question 1”
The reviewer is correct in his\her understanding. We would like to kindly point the reviewer to Line 342 in the main manuscript, where we link to Appendix A, where we have provided a proof-of-concept example on a vision task.
-
Regarding: “Question 2”
The coefficients of the hyperplane from the Taylor expansion of the first order are given by the derivative of the output with regard to the example features. As for IMN, we perform the multiplication of the derivative with the corresponding feature and we compare both methods over the interpretable benchmark. We provide the results below:
Dataset Metric IMN Taylor Faithfulness 0.987 0.989 ROAR Faithfulness 0.639 0.558 Gaussian Linear Infidelity (Lower better) 0.007 0.009 ROAR Monotonicity 0.785 0.757 Shapley Correlation 0.999 0.999 Faithfulness 0.621 0.427 ROAR Faithfulness 0.027 0.050 Gaussian NonLinearAdditive Infidelity (Lower better) 0.018 0.030 ROAR Monotonicity 0.637 0.625 Shapley Correlation 0.741 0.618 Faithfulness 0.841 0.823 ROAR Faithfulness 0.404 0.458 Gaussian Piecewise Infidelity (Lower better) 0.008 0.082 ROAR Monotonicity 0.682 0.605 Shapley Correlation 0.875 0.560 Wins 11 3 As observed, IMN outperforms the Taylor approximation and wins 11-3 in the provided benchmark.
We believe to have correctly addressed all the questions raised by the reviewer. If the reviewer has additional questions, we are more than happy to answer them.
[1] Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2021). Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34, 18932-18943.
Thank you for the detailed rebuttal and running those extra experiments for the Taylor approximation. My questions are all addressed. I have raised my score accordingly.
This paper introduces a new neural network architecture called Interpretable Mesomorphic Networks (IMN) for handling tabular data. IMN combines the high accuracy of deep learning models with the interpretability of linear models by generating instance-specific linear models through deep hypernetworks, achieving both local linearity and global depth.
优点
S1: Innovative Design: IMN combines the advantages of deep learning and linear models, achieving a balance between high performance and interoperability.
S2: Extensive Experimental Evidence: IMN's effectiveness and advantages are demonstrated through experiments on various datasets.
S3: Interpretability: IMN provides both instance-level explanations and global feature importance analysis.
S4: Efficiency: IMN shows excellent inference time performance, particularly on large-scale datasets.
缺点
W1: The implementation and training process of IMN is relatively complex, which might pose challenges for deployment and maintenance in practical applications.
W2: IMN relies on the design and optimization of deep hypernetworks, potentially requiring extensive hyperparameter tuning and computational resources.
W3: While extensive, the experiments do not cover all possible application scenarios, especially for more challenging unstructured data. what is the significant of the proposed method?
问题
Q1: How scalable is IMN in practical applications (i.e., in time complexity)? While IMN performs well in experiments, how does it fare in large-scale, complex tabular datasets in real-world applications?
Q2: How sensitive is IMN to hyperparameters? Does IMN's performance vary significantly with different hyperparameter settings, requiring extensive tuning?
Q3: Comparison with other emerging interpretability methods: How does IMN compare with the latest interpretability methods, such as those based on attention mechanisms, in terms of advantages and disadvantages?
局限性
check weakness and questions.
We would like to thank the reviewer for the valuable feedback. Below we will address the concerns raised by the reviewer:
-
Regarding: W1
Implementation:
We would like to point out to the reviewer that we reuse simple feed-forward backbones from previous work [1]. The only extra operation is implemented by a single line (Line 58 in module hypernetwork.py at the provided code).
Training:
Our code offers a fit/predict interface similar to the well-known and widely used scikit-learn library [2] which should facilitate usage for the practitioners. Additionally our method uses the stochastic gradient descent training procedure which is the standard. We would like to remind the reviewer that our approach is end-to-end, in this regard there is only one model class and not a series of components that have to be fit individually.
Lastly, we would like to point the reviewer to Table 2 in the manuscript, where as observed, our method has a faster median training time compared to the tabular ResNet. (As a result of the epoch being a hyperparameter, which translates to IMN having a faster convergence)
Deployment:
After training the model, a practitioner would only need to instantiate the class and load the weights. We kindly point the reviewer to Table 2 in the main manuscript, where as observed, IMN has a comparable inference time compared to the tabular ResNet. IMN can be run on the CPU and for inference/deployment the model can additionally be quantized.
Based on the above information, we believe the aforementioned aspects are trivial for our proposed method.
-
Regarding: W2
When HPO is used:
To verify that IMN does not need extensive hyperparameter tuning and computational resources when HPO is applied, we compare the time it takes IMN to find the incumbent configuration versus the time it takes the plain counterpart, the tabular ResNet. We provide a comparison of the incumbent performance of both methods versus the number of trials. As in the main manuscript (Table 4), we select 3 datasets with distinctive characteristics (number of instances/ nr features): Credit-g (1000/21), Adult (48842/15), and Christine (5418/1637). We present the results in Figure 2 at the attached document to the Global Response, where as observed, it takes only a few trials to find a well-performing hyperparameter configuration for both IMN and the tabular Resnet. Additionally, it takes IMN the same number of trials as the tabular ResNet to find the optimal hyperparameter configuration.
When HPO is not used:
We would kindly point the reviewer to Line 305, “Hypothesis 1 and 2 are valid even when default hyperparameters are used, for more details we kindly refer the reader to Appendix B”.
-
Regarding: W3
We agree with the reviewer that other data modalities are important, however, we believe tabular data are also very important and ubiquitous [3], due to the numerous application domains involving tabular data. In that context, our focus is on tabular data which in contrast to the other mentioned modalities is structured. We believe investigating other modalities merits a separate investigation of its own and falls outside the scope of our work.
Despite the above, we would like to point the reviewer to Line 342 in our main manuscript: “Although our work focuses on tabular data, in Appendix A we present an application of IMN in the vision domain.”
-
Regarding: Q1
We would like to point out to the reviewer that we run our method on the AutoML benchmark [4], a benchmark widely used by the community that features a diverse set of real problems. This includes large-scale tabular datasets from real-world applications.
Moreover, Table 2 provides time information regarding the time complexity of our method and the baselines over all datasets. Additionally, in Table 4, we provide a more specific analysis for hand-picked datasets that have different characteristics showcasing the scalability of IMN regarding inference time.
Lastly, as an example, on the largest dataset (airlines, circa 583k training instances) IMN takes only 6.9 hours for training and has an inference time of 0.195 seconds.
-
Regarding: Q2
We investigate the sensitivity of IMN and CatBoost with regard to the hyperparameter configuration used. For every task included in our experiment, we generate a distribution of the validation AUROC performances for all the explored hyperparameter configurations per method.
Figure 3 in the attached document to the Global Response presents the results, where as observed, IMN has a comparable sensitivity to CatBoost with regard to the controlling hyperparameter configuration. Moreover, in the majority of cases, the IMN performance does not vary significantly.
-
Regarding: Q3
We would like to kindly point the reviewer to Hypothesis 2, 3, and 4, where we compare IMN with TabNet, an interpretable method that employs attention. IMN manages to outperform TabNet: in terms of performance with a statistically significant difference (Hypothesis 2, Figure 4), in terms of training/inference time (Hypothesis 2, 3 and Table 2, 4), in terms of local interpretability (Hypothesis 3) and in terms of global interpretability (Hypothesis 4, Figure 6).
Our results are consistent with the results of previous work [5].
We believe we have clarified all the concerns raised by the reviewer. We would kindly ask the reviewer to increase the score and recommend acceptance based on the provided clarifications.
I have read authors’ feedback to each reviewer. I decide to rise my score to under borderline.
We thank the reviewer for reading our rebuttal and increasing the score from 3 to 4.
However, we addressed the reviewer's concerns thoroughly in our rebuttal. In particular, we explained that the three weaknesses mentioned in the original review do not reflect the implementation details and the empirical results.
If the reviewer assesses that our rebuttal is unclear, or that there are open issues, we are happy to further address the open points in the remaining rebuttal time.
References:
[1] Kadra, A., Lindauer, M., Hutter, F., & Grabocka, J. (2021). Well-tuned simple nets excel on tabular datasets. Advances in neural information processing systems, 34, 23928-23941.
[2] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, 2825-2830.
[3] Van Breugel, B. & Van Der Schaar, M.. (2024). Position: Why Tabular Foundation Models Should Be a Research Priority. Proceedings of the 41st International Conference on Machine Learning in Proceedings of Machine Learning Research.
[4] Gijsbers, P., LeDell, E., Thomas, J., Poirier, S., Bischl, B., & Vanschoren, J. (2019). An open source AutoML benchmark.
[5] McElfresh, D., Khandagale, S., Valverde, J., Prasad C, V., Ramakrishnan, G., Goldblum, M., & White, C. (2024). When do neural nets outperform boosted trees on tabular data?. Advances in Neural Information Processing Systems, 36.
I appreciate your thorough response to my comments. While your clarifications address some of the concerns, I still have reservations about the interpretability study and the overall experimental design.
Interpretability study limitations Your interpretability study (hypothesis 4) is based on a relatively small dataset (Census with only 10+ features) without noise data. The additional study on mushroom edibility prediction also uses a well-studied dataset where high performance (F1 score of 0.99) is easily achievable. These choices in experimental datasets limit the generalizability of your interpretability claims.
Real-world application cases The paper would be significantly strengthened by including detailed case studies from real-world applications (not just well-studied datasets). These could demonstrate IMN's effectiveness and interpretability in practical scenarios such as financial risk assessment, healthcare predictions, marketing analytics, or HR decision support.
Regarding your dataset selection To increase confidence in IMN's capabilities, it would be valuable to see results on a wider range of datasets, especially those with more complex feature interactions, higher dimensionality, and the presence of noise.
While I acknowledge the positive scores from other reviewers, these remaining concerns prevent me from increasing my score.
We thank the reviewer for the valuable response.
-
Regarding; "Dataset selection: To increase confidence in IMN's capabilities, it would be valuable to see results on a wider range of datasets, especially those with more complex feature interactions, higher dimensionality, and the presence of noise."
We agree that a wide range of datasets is necessary to showcase the gain of our method. For this reason, we already used a large battery of datasets in our experimental protocol. Concretely:
- For Hypothesis 1 and 2 on performance vs. interpretable methods by design:
- We compared against 7 baselines on 35 datasets from the popular AutoMLBenchmark [1], details in Table 6, Appendix C.
- For Hypothesis 3 on local interpretability methods:
- We compared against 8 baselines on the 3 datasets offered by the recent XAI Benchmark [2], details in Table 3, Section 5.
- For Hypothesis 4 on global interpretability:
- We compared against 4 baselines on 2 datasets from prior work [3, 4].
Overall, we compared against 14 different baselines on 40 different datasets across the four sets of experiments. The datasets in our protocol arise in various domains, such as healthcare (dataset: blood-transfusion), finance (dataset: census, australian, credit-g), etc.
- For Hypothesis 1 and 2 on performance vs. interpretable methods by design:
-
Regarding "Your interpretability study (hypothesis 4) is based on a relatively small dataset (Census with only 10+ features) without noise data. The additional study on mushroom edibility prediction also uses a well-studied dataset where high performance (F1 score of 0.99) is easily achievable. These choices in experimental datasets limit the generalizability of your interpretability claims.":
We would like to stress that the primary contribution of our work is offering a local interpretability technique that is by design interpretable due to the linear models generated by our hypernetworks. As described in the manuscript, the purpose of Hypothesis 4 is to show that even though our method is designed for local interpretability, it also offers global interpretability as a bonus feature. We would like to kindly point out to the reviewer that the census and mushroom-edibility datasets are well-known datasets from prior work [3, 4].
Regarding the well-studied problem of mushroom-edibility, we would kindly like to point out to the reviewer that we verify that our method provides valid global interpretability in the following ways:
- The experimental way as provided for the Census dataset. Where a feature is dropped and the impact on the predictive performance of the model is measured as importance.
- From knowing the actual feature importances. As such, for this second scenario, one needs a well-studied problem to know the underlying feature importance.
Additionally, please note that the problem of feature attribution (i.e. finding important/influential features) is orthogonal to the performance of classifiers on a dataset. If a baseline achieves a high F1 score, it does not mean that the dataset is uninteresting from an interpretability perspective.
-
Regarding: "The paper would be significantly strengthened by including detailed case studies from real-world applications (not just well-studied datasets). These could demonstrate IMN's effectiveness and interpretability in practical scenarios such as financial risk assessment, healthcare predictions, marketing analytics, or HR decision support."
As mentioned in our previous reply, we use datasets from diverse domains (see comment above on the datasets used).
We would like to kindly point out to the reviewer that both census (from the 1994 Census database) and mushroom-edibility (from the National Audubon Society Field Guide) are real-world datasets as mentioned in [3]. Lastly, we would like to mention a few datasets that fall under the categories the reviewer lists: 1) blood-transfusion (Blood Transfusion Service Center in Hsin-Chu City Taiwan), 2) credit-g (German credit data), 3) australian (Australian Credit Approval dataset), etc.
We believe we have clarified all the concerns raised by the reviewer. We would kindly ask the reviewer to increase the score and recommend acceptance based on the provided clarifications.
[1] Gijsbers, P., LeDell, E., Thomas, J., Poirier, S., Bischl, B., & Vanschoren, J. (2019). An open source AutoML benchmark.
[2] Liu, Y., Khandagale, S., White, C., & Neiswanger, W. Synthetic Benchmarks for Scientific Research in Explainable Machine Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[3] Arik, S. Ö., & Pfister, T. (2021, May). Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, No. 8, pp. 6679-6687).
[4] Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
Dear authors,
Could you provide the interpretability study (Hypo 4) on more (little bit larger) dataset, for instance, openml 618 and openml 579?
Best
We would like to thank the reviewer for the provided reference.
Regarding: "Could you provide the interpretability study (Hypo 4) on more (little bit larger) dataset"
We would like to point out to the reviewer that the census dataset (id 1590) has 49k instances, while mushroom edibility dataset (id 24) has 1.8k instances. Datasets 579 and 618 referenced by the reviewer have 250 instances and 1k instances, so in this regard, they are not larger. Additionally, both datasets are synthetic datasets and not real-world examples compared to the original set considered.
Despite the above, we gladly run on the datasets referenced by the reviewer.
Both datasets are generated by the Friedman function + noise, given as:
Dataset 579 features 250 examples and the 5 original features.
Dataset 618 is generated by the same function, however, it additionally has 45 random features, which is 9 times the original number of features. The dataset features 1000 instances.
For both datasets, the first 5 features are the original features that contribute to the output.
-
We run our method on dataset 579, and we get the following results:
Feature ranking: ['x_4', 'x_5', 'x_2', 'x_3', 'x_1']
Feature impacts: [0.7252574, 0.48875564, 0.4850915, 0.44188046, 0.3727691]
Repeating the experiment as for the census dataset in Hypothesis 4, we drop a feature and evaluate the increase in error prediction by the model. We present the results in the Table below:
Feature to remove Error change + 0.09453 + 0.08712 + 0.04188 + 0.01331 + 0.00539 The error change confirms the ranking generated by our method and is intuitive given the form of the equation. We would like to mention that the interpretability provided in this scenario is dependent on the dataset. As in this scenario, we do not have a Ground truth ranking of feature importances.
-
We run our method on dataset 618, and we get the following results:
Feature ranking: ['x_5', 'x_4', 'x_3', 'x_2', 'x_1', 'x_41', 'x_21', 'x_19', 'x_38', 'x_50', 'x_6', 'x_28', 'x_17', 'x_44', 'x_35', 'x_22', 'x_24', 'x_48', 'x_14', 'x_29', 'x_15', 'x_11', 'x_36', 'x_26', 'x_42', 'x_32', 'x_46', 'x_23', 'x_43', 'x_12', 'x_16', 'x_39', 'x_33', 'x_45', 'x_34', 'x_30', 'x_18', 'x_27', 'x_40', 'x_10', 'x_7', 'x_49', 'x_13', 'x_8', 'x_25', 'x_47', 'x_20', 'x_37', 'x_9', 'x_31']
Feature impacts: [0.7121832 0.3526064 0.3483662 0.23299913 0.15339686 0.09143675 0.07042366 0.05917934 0.05624102 0.0553472 0.0540031 0.05357284 0.05347642 0.05335978 0.05294265 0.05056856 0.05014918 0.04853891 0.04850458 0.04743053 0.04738801 0.04495728 0.04370673 0.04227468 0.04207935 0.0419944 0.04045379 0.03908753 0.03472474 0.03439515 0.03094473 0.03029181 0.02994934 0.02873245 0.02851641 0.02840919 0.02693252 0.02679885 0.02547741 0.02534737 0.02332983 0.02322569 0.02239586 0.02127747 0.02053387 0.02039642 0.02035189 0.01959292 0.01916017 0.01906535]
In this scenario, there exists a Ground truth ranking, since we have features that contribute to the target variable and features that are random. Based on the above results, we conclude that all the non-noise variables are captured by IMN as important even in a highly noisy scenario, where 90% of the features are random and only 10% contribute to the target variable.
Dear authors,
larger means including more features.
I think this experiment on the generated dataset is interesting. (Could you provide more details on this Exp and other baseline methods?) My questions on experiments are all addressed. I will raise my score accordingly.
The paper presents a hypernetwork approach to build an interpretable neural network for tabular data. Deep hypernetwork takes an input and returrns the weights for a linear model, which classify a given point. The example-baed interpretability follows from the interpretability of a linear models. The model is evaluated on typical benchmark dataset.
优点
- The model is simple, very intuitive and theoretically sound.
- This is another work which show a wide range of applicability of hypernetworks. Here, the authors use hypernetworks for interpretability problem.
- The authors propose local classification rule (individual model for each point) using linear model (target network), while the global rule is given by the deep model (hypernetwork). This is a very interesting idea and has not been invistigated before for hypernetworks
缺点
- The authors restrict their baseline to interpretable classifiers. There are a lot of recent studies on tabular data that authors should compare with.
- This is not the first work on uisng hypernetworks on tabular data. The authors of [1] use hypernetworks to build an ensemble of neural networks. The auhtors should at least refer to this work in their paper
[1] Wydmański, Witold, Oleksii Bulenok, and Marek Śmieja. "Hypertab: Hypernetwork approach for deep learning on small tabular datasets." 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 2023.
问题
-The authors should explain the connection with the paper [1] and use other deep models for comparison. -The global classification rule is given by the hypernetwork. How the classification changes for nearby points. The authors could illustrate it with an example.
局限性
yes
We thank the reviewer for the valuable feedback. Below we will address the concerns raised by the reviewer:
-
Regarding: “The authors restrict their baseline to interpretable classifiers. There are a lot of recent studies on tabular data that authors should compare with.”
The main focus of our work lies in interpretability. We compare with state-of-the-art tabular models (Hypothesis 2), to verify that our method is comparable in accuracy and it does not suffer a significant degradation in accuracy. Based on recent work [3] which extensively compares tabular models on a wide range of datasets (circa 176 datasets), we selected the top-performing methods for different model types (Neural Networks and Gradient Boosted Decision Trees).
We believe the existing set of included baselines is representative of the most frequent models used in the tabular domain and which achieve competitive performance.
Moreover, we add two additional baselines to our experimental protocol, DANet [1] and HyperTab [2] as the reviewer suggested. The results are presented in Figure 1 at the Global Response. We describe the results in more detail in our next response to the reviewer.
-
Regarding: “This is not the first work on uisng hypernetworks on tabular data. The authors of [1] use hypernetworks to build an ensemble of neural networks. The auhtors should at least refer to this work in their paper”.
We thank the reviewer for providing a valuable reference. Following the reviewer’s suggestion, we additionally add HyperTab to our experiments. We provide the results on our benchmark for all methods in Figure 1, where as observed, HyperTab achieves a similar performance to IMN, with a slightly lower rank.
To validate the HyperTab results, we additionally compare all methods for datasets that have circa 1k instances and less (5 datasets), the results are presented in the table below:
Method Average Rank TabNet 7.0 DANet 5.4 IMN 4.0 Random Forest 3.2 TabResNet 3.0 CatBoost 2.8 HyperTab 2.6 The results are consistent with the results from the original authors [2], where the authors advocate that “HyperTab performs strongly for small datasets however, it is comparable with other methods for medium and large-scale datasets”. This in turn validates our results.
We will make sure to include the provided reference in the related work and describe the differences between HyperTab and ours. For e.g.:
- Our work takes as input the full view of original features, while HyperTab takes as input a binary mask depicting the subset of selected original features.
- Our work makes a single forward pass through the hypernetwork, while HyperTab does multiple forward passes through the hypernetwork to build an ensemble of target networks.
- The target network of IMN is a single output layer of 1 unit in the case of a binary classification task and units in the case of multi-class classification where represents the number of classes. While HyperTab has a target network of a hidden layer of multiple units and an output layer (2 layers in total).
- The referenced work focuses on accuracy while ours focuses on interpretability.
-
Regarding: “The global classification rule is given by the hypernetwork. How the classification changes for nearby points. The authors could illustrate it with an example.”
We would like to kindly point the reviewer to Section 2.4, Figure 2 left, where we have provided an example of how the global classification boundary looks based on the hypernetwork prediction as the reviewer requests. The section additionally provides information on the classification accuracy of the locally generated hypernetworks regarding neighboring points.
We believe to have addressed the concerns raised by the reviewer and we welcome any other questions/uncertainties that the reviewer might have.
[1] Chen, J., Liao, K., Wan, Y., Chen, D. Z., & Wu, J. (2022, June). Danets: Deep abstract networks for tabular data classification and regression. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 4, pp. 3930-3938).
[2] Wydmański, Witold, Oleksii Bulenok, and Marek Śmieja. "Hypertab: Hypernetwork approach for deep learning on small tabular datasets." 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 2023.
[3] McElfresh, D., Khandagale, S., Valverde, J., Prasad C, V., Ramakrishnan, G., Goldblum, M., & White, C. (2024). When do neural nets outperform boosted trees on tabular data?. Advances in Neural Information Processing Systems, 36.
I would like to thank the authors their answer and I hope to see the paper at NeurIPS!
The paper introduces a new class of neural networks designed to be both deep and interpretable. These networks, referred to as Interpretable Mesomorphic Networks (IMN), utilize deep hypernetworks to generate linear models on a per-instance basis. This approach aims to retain the high accuracy of traditional black-box neural networks while providing explainability by design. The authors demonstrate through extensive experiments that their models perform comparably to state-of-the-art classifiers and outperform existing explainable methods on tabular data.
优点
-
The proposed method offers interpretability by generating linear models specific to each data instance, making it easier to understand individual predictions.
-
The IMN achieves accuracy comparable to black-box models, ensuring high performance while providing explainability.
-
The approach is applicable to both classification and regression tasks, enhancing its utility across different types of tabular data problems.
-
The method has been validated through extensive experiments, demonstrating its efficacy compared to state-of-the-art classifiers and explainable methods.
缺点
I am not an expert on interpretability issues, but I am fairly familiar with the topic of tabular data prediction. Overall, I didn't notice any significant shortcomings in the paper. However, the author overlooked DANet [1], a model that outperforms TabNet and provides a better interpretable framework. This could be included for comparison.
[1] Danets: Deep abstract networks for tabular data classification and regression
问题
See above
局限性
N/A
We would like to thank the reviewer for the valuable feedback. Below we will address the concerns of the reviewer:
-
Regarding: “However, the author overlooked DANet [1], a model that outperforms TabNet and provides a better interpretable framework. This could be included for comparison.”:
We thank the reviewer for the valuable suggestion. We would like to point out that there exist 2 main differences between TabNet/IMN and DANet. TabNet/IMN provides local explanations, which are in the original feature space. While, on the other hand, DANet provides global explanations and it groups/abstracts higher-level features from the original feature space. This in turn makes DANet incompatible with our experimental protocol regarding interpretability.
However, we believe DANet falls under methods that are explainable by design, and as such, we provide a comparison between all baselines and DANet in terms of performance in Figure 1 at the Global Response. As the reviewer suggested, DANet outperforms TabNet, however, it still performs worse compared to our proposed method in terms of accuracy.
Our results are consistent with the results of previous work [1]. We will update the camera-ready version of our work to include DANet as an additional interpretable baseline.
[1] McElfresh, D., Khandagale, S., Valverde, J., Prasad C, V., Ramakrishnan, G., Goldblum, M., & White, C. (2024). When do neural nets outperform boosted trees on tabular data?. Advances in Neural Information Processing Systems, 36.
We believe to have addressed all the concerns raised by the reviewer and we welcome any additional questions the reviewer might have.
Thank you for your response. Your answer has resolved my issue.
We thank all the reviewers for providing valuable feedback regarding our work. Below we will summarize the main concerns raised by the reviewers:
-
Reviewer csKM: “The authors overlooked DANet”
While DANet is an interpretable method by design, compared to IMN/TabNet, it does not provide local explanations and it additionally does not provide explanations in the original feature space given that it abstracts higher notion features from the original features. As such, DANet is incompatible with our experimental protocol regarding interpretability.
We include DANet in our accuracy-related experiments and we compare against all baselines for the datasets present in our benchmark. We present the results in Figure 1 of the attached document, where, as the reviewer suggests, DANet outperforms TabNet. However, it has a worse performance compared to our proposed method IMN. Our results are consistent with the results from prior work [1].
-
Reviewer hm6c: “IMN requires extensive tuning and extra computational resources”
To address the reviewer’s concern, we compare IMN against the Tabular ResNet (TabResNet). To prove that IMN does not require extensive tuning, we compare the validation incumbent performance (best validation performance observed) during the HPO trials for both methods. As in the main manuscript (Table 4), we select 3 datasets with distinctive characteristics (number of instances/ nr features): Credit-g (1000/21), Adult (48842/15), and Christine (5418/1637).
We present the results in Figure 2 of the attached document, based on which:
- IMN has a similar convergence compared to TabResNet with respect to the incumbent hyperparameter configuration.
- IMN finds a well-performing hyperparameter configuration in a few trials in the majority of cases, similar to the TabResNet, demonstrating that tuning its HPs is not a demanding task.
-
Reviewer hm6c: “How sensitive is IMN to hyperparameters”
To address the reviewer’s concern, we compare IMN and CatBoost (a method known for being robust to its hyperparameters in the community). Specifically, for every task, we plot the distribution of the performance of all hyperparameter configurations for every method. We present the results in Figure 3 of the attached document. As observed, IMN has a comparable sensitivity to CatBoost with regard to the controlling hyperparameter configuration. Moreover, in the majority of cases, the IMN validation performance does not vary significantly.
-
Reviewer qRx1: ”The authors restrict the comparison to interpretable classifiers and the authors do not reference HyperTab [2] as a prior work with hypernetworks for tabular data”.
Following the reviewer’s suggestion, we included DANet [2] and HyperTab [3], two deep learning baselines in our experiments. We present the results in Figure 1, where IMN achieves a better rank compared to both methods over all datasets. Our results are consistent with the results from prior work [1] and with the original work [3].
-
Reviewer uYeq: “How would a first-order Taylor expansion compare with the proposed method”
We compare IMN against a first-order Taylor expansion on our explainability benchmark. Based on the results, we argue that IMN outperforms the first-order Taylor expansion. For more details, we kindly refer to the respective answer of the reviewer.
We would like to again thank the reviewers and we look forward to a productive rebuttal.
[1] McElfresh, D., Khandagale, S., Valverde, J., Prasad C, V., Ramakrishnan, G., Goldblum, M., & White, C. (2024). When do neural nets outperform boosted trees on tabular data?. Advances in Neural Information Processing Systems, 36.
[2] Chen, J., Liao, K., Wan, Y., Chen, D. Z., & Wu, J. (2022, June). Danets: Deep abstract networks for tabular data classification and regression. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 4, pp. 3930-3938).
[3] Wydmański, Witold, Oleksii Bulenok, and Marek Śmieja. "Hypertab: Hypernetwork approach for deep learning on small tabular datasets." 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 2023.
This paper proposed an interpretable architecture for tabular data, where a hypernetwork learns a local linear classifier for the input data point. The weights of this classifier provide a measure of feature importance both locally and globally. The paper is well written and its efficacy is supported by suitable experiments. All reviewers recommended the acceptance, the AC also recommends acceptance.