Explanation Shift: How Did the Distribution Shift Impact the Model?
We find that the modeling of explanation shifts can be a better indicator for analyzing interactions between models and shifting data distributions than state-of-the-art techniques based on representations of distribution shifts
摘要
评审与讨论
This paper presents a model for detecting data shift by modeling and quantifying the shift in explanations rather than model performance shift or data distribution shift directly. “Explanations” are defined in terms of a feature-level contribution to the (relative) model output, e.g. Shapley values. The distribution of explanations between the training and the evaluation/new data sets are quantified to get a measure of the explanation shift. The measure of shift is based on a two-sample test where a classifier predicts whether the explanation come from the training or new distribution.
优点
The idea to focus on changes in explanation distribution allows for detecting shift even when there is no effect on model predictions (P(f(D))=P(f(D^{new})), and without making assumption about the type of shift to detect. This formulation allows for detecting covariate, prediction, concept, and novel group shift as long as the explanation values change.
The above is well supported by experiments of multiple types of shifts and compared to several shift-detecting baselines as well as alternative distributions in the two-sample test.
Experiments and analytical examples showcase the settings where alternative shift detection approaches fail and explanation shift detection can succeed.
缺点
Computing explanation vectors with Shapley values limits the number of features that can be considered or, if using TreeShap for efficiency, limits the models that can be used to tree-based.
Results are sensitive to choice of prediction model and detector model
问题
Is there some way to know which predictor and detector model should be used based on prior assumptions about the data itself or the expected shift, if any, that you are trying to detect? Do you have any insights into when different models agree/disagree, for example?
Top of page 8 referencing Figure 4 says the PR18 is the most disparate. Should this be KS18 as it results in the largest AUC for the shift detector? Or what is meant by “disparate” here? Can you comment on the difference between the results shown in Figure 4 and those from Figures 6-8 in Appendix E? It seems employment, travel time, and mobility follow similar patterns, but differ from Figure 4.
Minor comments
Is the line for g_\phi=Input, f_\theta=Log missing in Figure 3 middle? Or is it overlapping with the XGBoost line?
“vii” on page 7 below Figure 2 should be “B7”.
Dear reviewer, many thank for the reviewer and its valuable and insightful comments. Please consider the main points of our rebuttal above and our specific points that follow now:
Results are sensitive to choice of prediction model and detector model
Yes, different explanation shift will perform differently on different model . The pipeline is susceptible to both. In the Appendix Table 7, we have compared several estimators and detectors
Question 1. Is there some way to know which predictor and detector model should be used based on prior assumptions about the data itself or the expected shift, if any, that you are trying to detect? Do you have any insights into when different models agree/disagree, for example?
Not to the best of our knowledge. Note that this question is similar to how we could know what was the best ML estimator prior to fitting.
Question 2. Top of page 8 referencing Figure 4 says the PR18 is the most disparate. Should this be KS18 as it results in the largest AUC for the shift detector? Or what is meant by “disparate” here?
Yes, this is a typo, which we will fix—many thanks for noting.
Question 3. Can you comment on the difference between the results shown in Figure 4 and those from Figures 6-8 in Appendix E? It seems employment, travel time, and mobility follow similar patterns, but differ from Figure 4.
Our main analysis in this experimental design is that, given similar data, the choice of what to predict is relevant. Depending on what the model has learned, there are different results. As the reviewer has correctly noticed, the most disparate state is not always PR. This shows an example of when the explanations learned representation varies but changes on the input data are mostly constant. We will further refine the paragraph at the end of section 5.2.2 to make this clearer.
The authors tackle the problem of detecting whether distributions shift between some labelled source dataset and an unlabelled target dataset. In particular, they focus on detecting shifts that affect the behavior of a particular model trained on the source data. Instead of comparing distributions of features directly between source and target, they propose a method based on comparing the distribution of explanations generated using Shapley Values for the given model. This comparison is done using a binary classifier trained to predict the domain from the explanations. They compare their method with a variety of baselines on synthetic and real data, finding that their method has greater sensitivity to data shifts.
优点
- The paper is generally easy to read and well-written.
- The paper provides a nice formulation of the problem.
- The paper tackles an important real-world problem.
- The authors provide theoretical analyses for a few simple synthetic cases.
缺点
-
I am not convinced that the results demonstrate the empirical superiority of the proposed method relative to the baselines. The authors only compare to the baselines in Figure 2. Here, there are several other methods that are competitive with explanation shift. In addition, the authors do not show confidence intervals in this figure. I also contest that "good indicators should follow a progressive steady positive slope", as if the goal is distribution shift detection, the only thing that should matter is the outcome of the hypothesis test.
-
All of the datasets considered are tabular datasets. How would the authors adapt their method to images (where X could be pixels) and text (where X is a sequence of tokens and f is an LLM)? It seems like Shapley values would be less meaningful and harder to compute in these settings.
-
In the real-world datasets, it is unclear what the ground truth should be, and so it is hard to say whether the proposed method is behaving as intended. For example, how do we know in Section 5.2.2 that the distribution in CA18 is actually different than CA14, in a way that affects model performance?
-
One important aspect of distribution shift detection is to isolate the shift to particular distributions (i.e. particular shifts in Definitions 2.1-2.5). This is underexplored in most of paper. The authors do explore this by examining the feature importances in Section 5.2.2, but this seems quite ad-hoc and should be characterized further. For example, how would this behave theoretically in the covariate shift case in Example 4.1?
-
The authors should consider examining the scenario where the number of samples on the target domain is limited, both empirically and theoretically. How does the power of the two-sample test scale with the number of target domain samples?
-
The authors have missed several important prior works [1-2], which should be baselines.
-
(minor) There are many typos in the paper, including "taks" in Section 4.1, "datasests" in Section 5.1, "Sensitivy" in Figure 2, and an extra bracket in Efficiency Property.
[1] Towards Explaining Distribution Shifts. ICML 2023.
[2] A Learning Based Hypothesis Test for Harmful Covariate Shift. ICLR 2023.
问题
Please address the weaknesses above, and the following questions:
-
How does the runtime of the algorithm compare to the baselines? I believe that Shapley values may be time consuming to compute especially when there are a lot of features.
-
Have the authors tried any other tests to distinguish between the two explanation distributions, other than the classifier based approach?
Dear reviewer, many thanks for the comments, please consider th points of our main rebuttal above and our specific points that follow now:
Item I.1 I am not convinced that the results demonstrate the empirical superiority of the proposed method relative to the baselines. The authors only compare the baselines in Figure 2. Here, there are several other methods that are competitive with explanation shift. In addition, the authors do not show confidence intervals in this figure. I also contest that "good indicators should follow a progressive steady positive slope", as if the goal is distribution shift detection, the only thing that should matter is the outcome of the hypothesis test.
Answer A.1: The reviewer may note that we also compare baselines B4, B5, and B6 in Figure 3. Wrt our evaluation, we fully agree with the reviewer that the approach's effectiveness is key. This raises the question of how best to measure effectiveness. We have decided to construct experiments in Figures 2 and 3 in a way such that we can control the level of distribution shift by continuously increasing , the co-variance (Fig 2), and the level of OOD data (Fig 3). A good metric for distribution shift should be able to mirror this control parameter. By the design of our experiment, step changes or fluctuations point towards less reliable metrics for recognising distribution shifts. We will make this experimental design clearer in the text. Also, see the extra evaluation above (2. quantifiable metrics).
I.2 All of the datasets considered are tabular datasets. How would the authors adapt their method to images (where X could be pixels) and text (where X is a sequence of tokens and f is an LLM)? It seems like Shapley values would be less meaningful and harder to compute in these settings.
Our method is limited to tabular data. While this constitutes a limitation to our approach, we posit that understanding the distribution shift of tabular data is of problem hugely relevant for both research and practice.
I.3 In the real-world datasets, it is unclear what the ground truth should be, and so it is hard to say whether the proposed method is behaving as intended. For example, how do we know in Section 5.2.2 that the distribution in CA18 is actually different than CA14, in a way that affects model performance?
A.3 As described in section 2.2 (Impossibility of model monitoring), (Garg 2021)’s theorem shows that it is not possible to relate distribution shift to a change in model performance, particularly for tabular data where concept shift cannot be detected by properties of the visual domain. Therefore, our approach focuses on the change in how a given model works on data.
I.4 One important aspect of distribution shift detection is to isolate the shift to particular distributions (i.e. particular shifts in Definitions 2.1-2.5). This is underexplored in most of paper. The authors do explore this by examining the feature importances in Section 5.2.2, but this seems quite ad-hoc and should be characterized further. For example, how would this behave theoretically in the covariate shift case in Example 4.1?
A1.4 We fully agree with the reviewer, in the last paragraph of section 2.1 we have stated: "In practice, multiple types of shifts co-occur together and their disentangling may constitute a significant challenge that we do not address here."". We have only analyzed particular shifts on synthetic data where we can fully characterize the type of shift.
I.5 The authors should consider examining the scenario where the number of samples on the target domain is limited, both empirically and theoretically. How does the power of the two-sample test scale with the number of target domain samples?
A1.5: It is correct that our recognition procedure is a binary classifier and the reliability of the classifier depends on the number of samples on which it is trained. Previous work has studied the limitations of Classifier Two Sample Tests in the particle physics domain, in which they suggest using bootstrap permutation tests to get a better handle on the test's accuracy.
We've taken this methodology. In our own experiments (check out Section 5.2.2), we have included bootstrap permutation testing between the in-distribution (CA14) bootstraps and other states bootstraps. Furthermore, in our experiment in section 5.2.1 where an increasing fraction of data is from OOD, we that from a small fraction onwards our metrics picks up the changes in the model due to distribution shift.
[3] Model-independent detection of new physics signals using interpretable SemiSupervised classifier tests, Chakravarti, Purvasha and Kuusela, Mikael and Lei, Jing and Wasserman, Larry, The Annals of Applied Statistics, 2023
The paper proposes to use distribution shifts in "model explanations" such as Shapley values to attribute/identify distribution shifts across domains. The problem is phrased as that of running two-sample tests/conditional independence tests on explanations generated in the training domain and the target domain. The conditional independence test is operationalized using classifier two sample tests.
Example cases motivate how univariate (feature-level) two-sample tests will not pick up conditional covariate shifts, and that for an optimally trained model with uninformative features in the training set, a univarite feature-level two sample test will detect distribution shift, while the explanations will not have a shift.
Other example cases showing that: explanation shift does not always imply a shift in the predictive distribution (this example is not clear to me as terms are not well defined).
Finally a negative result showing the concept shift cannot be indicated by an explanation shift is presented.
Empirical evaluation consists of synthetic data analysis: Here, sensitivity of classifier two-sample tests and metrics of evaluating distribution shifts are considered. This evaluation suggests NDCG may be unstable as a test-statistic for evaluating distribution shifts.
Real-world analysis: UCI Adult Income demonstrates AUCs of the two-sample test proposed in the paper with multiple choices of model families used to train the original classifer as well as the explanation classifier.
Spatio-temporal shifts are evaluated using this data along with interpretation of the linear coefficients of the explanation shift detector model.
优点
-
The overall paper is well written. Understanding how distribution shifts affect ML model performance is important.
-
The hypothesis that distribution shifts in explanations generated for a model could give a hint about overall distribution shifts and their impact on model performance is a good one and should be explored
缺点
-
The authors provide some interesting case studies and examples of the utility of Shapley based explanations in identifying distribution shifts. One aspect of this discussion that could be done better is trying to emphasize what additional assumptions could be required so that explanations such as Shapley can indeed be used for detecting, say, concept shift. This is not an impossible task for other methods, see for example: Liu et al [1]. See also experiments in this paper.
-
The example on "uninformative features" is valid but unrealistic, shortcut learning is real in ML and hoping a model is trained optimally is kind of unrealistic. Also may be best state explicitly that .
-
The overall empirical evaluation, in my view is weak. May be at least discuss how general the approach will be (can it be used on other data-modalities, beyond tabular data?)
[1] Liu, Jiashuo, Tianyu Wang, Peng Cui, and Hongseok Namkoong. "On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets." arXiv preprint arXiv:2307.05284 (2023).
问题
-
What is in Example 4.3? I am not sure this example is valid based on what I think will be in this example. Can authors clarify?
-
Why aren't methods that don't actually need a causal graph such as [2] compared to in the paper?
-
In fact it is possible to also compare to other methods of Budhatoki et al, Zhang et al that the authors cite with minimal assumptions that or to highlight limitations of these methods.
-
I see a discussion on using LIME as a possible explanation for addressing the same, however, I am not sure whether crucial properties of Shapley are necessary for this method to succeed. I think additional discussion on which explanations have the desirable properties for use in detecting explanation shifts will make the paper stronger.
[2] Namkoong, Hongseok, and Steve Yadlowsky. "Diagnosing Model Performance Under Distribution Shift." arXiv preprint arXiv:2303.02011 (2023).
Dear reviewer, many thanks for the comments, please consider the main points of our rebuttal above and our specific points that follow now:
I.1 The authors provide some interesting case studies and examples of the utility of Shapley-based explanations in identifying distribution shifts. One aspect of this discussion that could be done better is trying to emphasize what additional assumptions could be required so that explanations such as Shapley can indeed be used for detecting, say, concept shift. This is not an impossible task for other methods, see for example: Liu et al [1]. See also experiments in this paper.
A.1: Paper [1] can deal with concept shift because the authors assume that labels are given. In our problem scenario, we make the realistic assumption that are unknown. Thus, we address a different problem. Also cf. our Main Point 1 at the beginning of our rebuttal.
We will contrast our work against [1] in our discussion. This will further clarify and strengthen our work.
I.2 The overall empirical evaluation, in my view is weak. May be at least discuss how general the approach will be (can it be used on other data-modalities, beyond tabular data?)
A.2: We have systematically varied models, model parametrizations, and input data distributions. As indicated in Main Point 2 (at the beginning of the rebuttal), we have designed carefully controlled experiments (which are novel by themselves). Additionally, the appendix contains supplementary experimental results, providing details on synthetic data experiments, extending to further natural datasets, showcasing a broader range of modelling choices, and different explanation mechanisms. Thus, we believe that one of the paper's strengths is an extensive empirical evaluation.
As the reviewer correctly notes, the Classifier Two Sample Test on the explanations distributions is limited to tabular data and is not able to detect concept shift (we have stated this in the discussion). For non-tabular data, it needs further modifications that are out-of-scope for the paper. Characterizing distribution shift on tabular data constitutes an economically hugely important problem that still lacks appropriate treatment.
I.3 Question 1 What is in Example 4.3? I am not sure this example is valid based on what I think will be in this example. Can authors clarify?
A.3: The Shapley values are a product of a weighting (here: alpha_i for feature i) multiplied by the actual value of the feature (here: x_i). Thus,Shapley values explain a model f by a linear model with coefficients . We will further clarify that the Shapley values can be directly calculated for a linear model and IID data by this formula (Aas, 2021)
The specific example shows that there is a feature that is not used by the model and is shifted. In this case, we have a distribution shift, that does not affect the model f nor the explanation shift, and any “distribution shift” methods based on input data will flag it.
I.4 Question 2 Why aren't methods that don't actually need a causal graph such as [2] compared to in the paper? [2] Namkoong, Hongseok, and Steve Yadlowsky. "Diagnosing Model Performance Under Distribution Shift." arXiv preprint arXiv:2303.02011 (2023).
A.4: We do not compare to this paper for two reasons: This paper works in a different problem setup. It requires labels , whereas we assume that these are not available (cf. answer A.1; Main Point 1) We will be happy to acknowledge the recent work that is reported in the two arxiv papers. However, given that they have not been published at a conference yet, we think that their absence in our discussion should not be held against our submission.
I.5: Question 3: The overall empirical evaluation, in my view is weak. May be at least discuss how general the approach will be (can it be used on other data-modalities, beyond tabular data?)
A.5 Classifier Two Sample Test, as we have done for tabular data, is not directly applicable to text or images. We have focused on tabular data since it constitutes an economically hugely important domain that brings up unique challenges. We have stated this in the discussions, but we will further clarify.
I.6 Question 4 I see a discussion on using LIME as a possible explanation for addressing the same, however, I am not sure whether crucial properties of Shapley are necessary for this method to succeed. I think additional discussion on which explanations have the desirable properties for use in detecting explanation shifts will make the paper stronger.
A.6: The theoretical properties that we require from a feature attribution method are efficiency and uninformativeness (cf section 2.3). We use and compare Shapley values and Lime because these approaches (and their implementations) fulfil or at least approximately fulfil these properties.
The work proposes a method to detect distribution shifts that matter to a model. The central proposal is to study changes in output of an explanation method (like SHAP) between the shifted datasets. This change is quantified using classifier two-sample test, the output of which is used to detect distribution shifts that matter. Through extensive experiments, the proposed test based on explanation outputs is argued to be more sensitive than multiple baselines.
优点
- Paper is written well with all relevant background explained in main text.
- The idea of detecting relevant distribution shifts through their effect on explanations is original and interesting.
- Work provides intuitive examples to show which distribution shifts can be detected by explanation shifts.
- It tackles a significant problem. Interpretability of distribution shifts, including their effect on model behavior, is an under-studied problem.
缺点
- Motivation for considering shifts in explanation is not convincing. This, I believe, is because the goal is not stated concretely to then motivate the solution. The stated goal of investigating interaction between distribution shifts and learned model needs to be further characterized in quantifiable forms. I agree that not every type of distribution shifts are important to detect. The ones that change model behaviour in some meaningful way are important. However, it is unclear to me why explanations, that too Shapley values, is a meaningful or useful summary of model behavior to consider. Say, we care about AUC or any metric of choice to summarize model behavior, we can directly detect shifts in this metric using the same classification pipeline as proposed. If the reason for using explanations is for the interpretability of the shift in model behavior, the problem and solution needs to be stated differently. I think that some of this is addressed in the Appendix by pointing to related work that uses explanations for model debugging, drift detection, or other tasks. This could be used in Introduction to more clearly motivate the approach.
- Evaluation setup can be improved in terms of quantifying evaluation metrics like sensitivity. The magnitude of AUC does not matter that much I believe. Authors are right in asking for sensitivity in the method but it is not directly evaluated. For instance in Figure 2, it is unclear between (ours) and (B7) methods in the left figure and between (ours) and (B2) which one is better. All seem sensitive in the sense that different correlation coefficients result in different AUCs.
- Presentation of experiment can be improved. The case study on ACS data in Sec 5.2.2 is a good place to showcase how the question in the paper title (how the shift impacted the model) is addressed by the method. The findings are stated in a matter-of-fact way. These could be contextualized in the data making the significance of the method and its results more apparent. The questions asked through the experiments could be described at the start of Sec 5 and what we learned from the results could be described more directly at the end of the section.
问题
-
Please clarify the problem statement more formally. What aspects of the model behavior under distribution shifts (interaction between distribution shift and learned model) is of interest?
-
Please motivate the approach (focus on explanations) in terms of the problem statement. How does Shapley values capture the desired aspects of the model behavior?
Minor (no response is requested)
The success of the method depends on how powerful the classifier in the two-sample test is, since a powerful enough classifier can detect relevant shifts just from a combination of input data and predictions without requiring explanations. Explanation method, in the proposed approach, helps in creating better and more relevant features for use by the classifier. Therefore, consider describing how to choose the classifier and the explanation method.
Explain C2ST (which I believe is classifier two-sample test?) in Experiments on Page 7.
Explain the reasoning for choosing novel group shift in Sec 5.2.1 and discuss the results in more detail.
Geopolitical -> geographical in Section 5.2.2
On the use of Shapley values, the claim that they account for co-variate interactions (compared to univariate tests) depends on the payoff function (interventional vs observational) used in computing Shapley values (reference Kumar et al. 2020 Problems with Shapley-value-based… https://proceedings.mlr.press/v119/kumar20e.html). Discussion in Appendix H could be highlighted in main text.
For completeness, please define the crossed \sim notation. Does this mean that the population-limits of the empirical distributions are not the same?
The definition of perf(D) is potentially missing a division by number of data points in Sec 2.1.
Dear reviewer, many thanks for the comments, please consider the main rebutall and the specific rebutal:
I.1 Motivation for considering shifts in explanation is not convincing. This, I believe, is because the goal is not stated concretely to then motivate the solution. The stated goal of investigating interaction between distribution shifts and learned model needs to be further characterized in quantifiable forms. I agree that not every type of distribution shifts are important to detect. The ones that change model behaviour in some meaningful way are important. However, it is unclear to me why explanations, that too Shapley values, is a meaningful or useful summary of model behavior to consider.
A.1: Main Points 1 and 2 at the beginning of our rebuttal address these concerns. To summarize: AUC or likewise metrics cannot be applied in our problem setting, which is frequent in the real world, where new data comes without labels .
As (Garg 2021) have shown, correctly estimating the performance is impossible in the presence of distribution shift. Their results imply that AUC and likewise methods are inapplicable for discovering distribution shift when is not given.
Given the model predictions , we use the Shapley values, as they attribute the contribution of input features to the prediction (more information in section 2.3).
We will use your question and our response to make the introduction clearer. Thank you.
I.2 Evaluation setup can be improved in terms of quantifying evaluation metrics like sensitivity. The magnitude of AUC does not matter that much I believe. The authors are right in asking for sensitivity in the method but it is not directly evaluated. For instance in Figure 2, it is unclear between (ours) and (B7) methods in the left figure and between (ours) and (B2) which one is better. All seem sensitive in the sense that different correlation coefficients result in different AUCs.
A.2 In order to clarify and provide a quantifiable metric, we have extended the experimental evaluation in the main rebuttal. We say baselines are underperformant because they achieve a lower correlation with or are independent of the model used.
I.3 Presentation of experiment can be improved. The case study on ACS data in Sec 5.2.2 is a good place to showcase how the question in the paper title (how the shift impacted the model) is addressed by the method. The findings are stated in a matter-of-fact way. These could be contextualized in the data making the significance of the method and its results more apparent. The questions asked through the experiments could be described at the start of Sec 5 and what we learned from the results could be described more directly at the end of the section.
Many thanks for the suggestion; we fully agree with the reviewer. These minor edits will improve the paper's clarity. We will take them into account.
I.4 Q1 Please clarify the problem statement more formally. What aspects of the model behavior under distribution shifts (interaction between distribution shift and learned model) is of interest?
A.4 Section 4 covers all the different types of shifts: Uninformative shifts vs Informative Shifts. Among the latter, problems in practice have to deal with Covariate shifts, Prediction shifts, and Concept shifts. Discovering concept shift (if not domain assumptions can be made, nor labeled data can be obtain), which is why we are interested in covariate and prediction shifts. These are all defined in section 4 and related to explanation shifts.
I.5 Q2 Please motivate the approach (focus on explanations) in terms of the problem statement. How do Shapley values capture the desired aspects of the model behavior?
The Shapley values, weigh the influence of feature values towards a prediction, (Efficiency Property - cf. Section 2).
The strength lies in the decomposition of predictions, offering insight into how each feature contributes to the overall outcome. This attribute is valuable when dealing with high-dimensional distributions, providing a more informative distribution than prediction distributions.
By employing feature attribution explanation techniques, specifically Shapley and LIME, we exploit this efficiency property. These techniques facilitate a granular understanding of how shifts in specific feature distributions impact the model. he decomposition of the prediction across the contribution of each features allows for a higher dimensionality distribution than the predictions distributions. Therefore the explanation distribution compasses always more information.
I would like to thank the authors for the detailed response. It partly addresses my concerns. However the ones on problem statement are not resolved. For instance, if detecting a particular aspect of model behavior like sensitivity to covariate or prediction shifts is the focus then this should be highlighted in Introduction and Setup. The properties of explanation outputs that help in such detection, better than prior approaches, should be discussed to motivate the approach. These I feel are substantial changes to the presentation. Thus, I am keeping the score at 5 marginally below threshold.
I now understand that the problem settings of interest do not have labelled data in target, hence accuracy metrics can't be computed. In this case, explanation outputs seems like one alternative. Thanks for quantifying performance of the method in the new experiments.
We thank all reviewers for their valuable and insightful comments. We want to respond to two major concerns several reviewers shared:
Main Point 1 Positioning to related work: Our problem scenario is such that for new data the labels are unknown - as is common in practice. This is very different from the problem definition of related work that uses Accuracy/Precision/AUC, assuming that labels are available. Some other related works suggest that the labels are correctly predicted. Because of the impossibility theorem by (Garg 2021), the latter, however, is problematic when no domain assumptions can be made. Such domain assumptions may be available in the domain of images but are rarely or never available for tabular data. Therefore, tabular data constitutes an economically hugely important domain that brings up unique challenges. While we have stated our assumptions in section 2.1 (specifically the second paragraph), we truly appreciate that this can be easily missed. We will clarify this in an updated introduction.
Main Point 2 Quantifiable Metrics: We have demonstrated the advantage of our approach by showing in Figure 2 and Figure 3 how our approach performs best at accounting for a carefully controlled distribution shift. Thereby, we controlled the distribution shift with synthetic data (Fig 2) and with natural data (Fig 3) using a parameter for the strength of correlation (Fig 2) and the fraction of data from the unseen group (Fig 3). Some reviewers were concerned about whether these plots would show our approach's advantage.
We add here an evaluation summarising these curves' quality into a single quantifiable metric for each approach. In the following table, we compute the Pearson correlation coefficient between each of the baselines and for Figure 2. One clearly sees the improvement by our approach as the perfect agreement would be indicated by 1.0.
| Baseline | Pearson Correlation with |
|---|---|
| B1 Input KS | 0.01 |
| B2 Prediction Wasserstein | 0.97 |
| B3 Explanation NDCG | 0.52 |
| B4 Prediction KS | 0.70 |
| B5 Uncertainty | 0.26 |
| B6 C2ST Input | 0.18 |
| B7 C2ST Output | 0.96 |
| (Ours) C2ST Explanation | 0.99 |
For Figures 3B and 3C, one can see how applying C2ST on the explanation distributions (Explanation Shift Detector) achieves better results than NDCG on the feature explanation orders ( B3 ). While we find that B6 (C2ST on the input data) achieves performance comparable to our method in this evaluation, considering Fig 2 where B6 performs very poorly, this is obviously only the case, because there appears to be no co-variance shift in the setup of this experiment.
| Baseline | Pearson Correlation with |
|---|---|
| (Ours) Exp Shift f=XGB | 0.97 |
| (Ours) Exp Shift f= Log | 0.98 |
| B4 Exp. NDCG f = XGB | 0.81 |
| B4 Ex. NDCG f = Log | 0.31 |
| B6 C2ST Input f=XGB | 0.98 |
| B6 C2ST Input f=Log | 0.98 |
| B7 C2ST Output f= XGB | 0.96 |
| B7 C2ST Output f= Log | 0.69 |
ML models do not work when deployed OOD. This paper aims to provide a method to understand why. The paper proposes explanation shift as a way to investigate interactions between distribution shifts and deployed predictive model by using SHAP values between the shifted and original data. The change is quantified via a two sample test.
There were three concerns that still remained at the end of the the discussion period.
- Why is the use of SHAP the right (or even better) mechanism to quantify shifts -- While the authors state that the "strength lies in the decomposition of predictions, offering insight into how each feature contributes to the overall outcome", there is also work Bilodeau et. al (https://arxiv.org/abs/2212.11870) showing how methods such as SHAP and integrated gradients can provably fail at identifying feature attributions. I think it important to contextualize the use of SHAP for this approach in light of these results.
- Weak empirical comparison: The authors were presented with relevant suggestions to improve the empirical evaluation of their work.
- Feature-specific shift detectors (like feature-level two-sample tests) and find important features as a combination of covariate-level two-sample/independence tests + explanation based attributions.
- Conditional covariate shifts of interest such as Feature shift detection: Localizing which features have shifted via conditional distribution tests, Sean et. al
- Towards Explaining Distribution Shifts, Kulinski et. al & A Learning Based Hypothesis Test for Harmful Covariate Shift, Ginsberg et. al
- In addition, Wang et. al in (https://ieeexplore.ieee.org/abstract/document/10191221?casa_token=jkWx6YgEy5AAAAAA:i4h1pU2mDq3Qi0JFz-dJprfj_ltXlQhAxWUlpRudih55GRaxrP4Vj7SGFv-NU_C9fL-Cj6ZdwGo) highlight the utility of explainable methods in AI for understanding distribution shift -- please do incorporate a discussion of this work as well.
- Revamping the writing in the introduction and motivation to better setup the proposed method.
为何不给更高分
I've articulated the remaining concerns as justification for why the work should not have a higher score based on the remaining reviewer comments that remain unexplained.
为何不给更低分
N/A
Reject