PaperHub
3.0
/10
Rejected4 位审稿人
最低3最高3标准差0.0
3
3
3
3
3.8
置信度
ICLR 2024

Detecting Shortcuts using Mutual Information

OpenReviewPDF
提交: 2023-09-24更新: 2024-02-11
TL;DR

Proposed a mutual-information based method to detect shortcuts/spurious correlations.

摘要

关键词
shortcutsspurious correlationmutual informationinformation theoryneural tangent kernel

评审与讨论

审稿意见
3

This paper propose to detect shortcut by evaluating the mutual information between the latent variable Z of a network and the input X. The intuition is that the latent variable Z of a network learning shortcuts would have a lower mutual information with the input X. Based on the intuition, the paper propose to detect shortcut by calculating mutual information using neural tangent kernel (NTK).

优点

1). Using mutual information to detect whether networks is an interesting topic.

2). This paper provide a detailed related work introduction.

缺点

The proposed method seems infeasible.

1). In algorithm 1, a dataset is said to contain shortcut if I(Xtest;Z)<I(Xtr;Z)I(X_{test};Z)<I(X_{tr};Z). However, since XtestX_{test} have different distribution than XtrX_{tr}, it seems natural to have I(Xtest;Z)<I(Xtr;Z)I(X_{test};Z)<I(X_{tr};Z). Therefore an important question need to be answered: are there any datasets satisfying I(Xtest;Z)I(Xtr;Z)I(X_{test};Z)\geq I(X_{tr};Z)? There lacks empirical evidences in this paper to answer the question and I do not think the algorithm 1 could effectively detect shortcuts.

2). Unlike the proposed algorithm 1, the experiments in this paper, on the other hand, mainly compare networks trained on two datasets "with" and "without" shortcut. This approach also have problems since it requires comparing with a network trained on dataset that is "without" shortcut. By defining "without shortcut", it also involves domain knowledge and human expertise to detect shortcuts.

Based on the above two points, I think that the propose method have major flaws.

Minor issues:

1). the introduction of estimating mutual information using NTK is vague. For example, what is the definition of Θ(x,X)\Theta(x,X)? σ\sigma is used in Eq.5 but it is written as Σ\Sigma in Eq.7.

2). Figure legends (e.g. Fig.4) could be more detailed.

问题

As mentioned in the weakness section, I have concerns over the propose method.

1). For those datasets without shortcut, are they satisfy I(Xtest;Z)=I(Xtr;Z)I(X_{test};Z)=I(X_{tr};Z)? Please provide empirical evidences to show that algorithm 1 is feasible.

2). I notice that at the begining of the training, before the mutual information I(X;Z)I(X;Z) starts to be different between "with shortcut" and "without shortcut", the test loss has become different (e.g. 100 epoch in Fig.6 and 1000 epoch in Fig.7). Why is it?

评论

Thank you for the detailed review and insightful feedback. Here is the further explanation/answers to questions raised in the review:

Are there any datasets satisfying I(Xtest;Z)I(Xtr;Z)I(X_{test}; Z) \geq I(X_{tr}; Z)? There lacks empirical evidences in this paper to answer the question and I do not think the algorithm 1 could effectively detect shortcuts.

While I(Xtest;Z)I(X_{test}; Z) cannot be greater than I(Xtr;Z)I(X_{tr}; Z), however, for XtestX_{test} sampled from same distribution as XtrX_{tr}, I(Xtest;Z)I(Xtr;Z)I(X_{test}; Z) \approx I(X_{tr}; Z), as shown in the different experiments.

Unlike the proposed algorithm 1, the experiments in this paper, on the other hand, mainly compare networks trained on two datasets "with" and "without" shortcuts. This approach also has problems since it requires comparing with a network trained on dataset that is "without" shortcut. By defining "without shortcuts", also involves domain knowledge and human expertise to detect shortcuts.

We use `with' and 'without shortcut' to denote shortcuts in the controlled experiments to demonstrate that mutual information can be used to distinguish between the two datasets. We demonstrated the practical use case in Section 5 where we sample XtrX_{tr} and XtestX_{test} from two different medical imaging datasets. Our method can detect the presence of shortcuts, which is in fact corroborated when the model train on XtrX_{tr} is evaluated on XtestX_{test}.

For those datasets without shortcut, are they satisfy I(Xtest;Z)=I(Xtr;Z)I(X_{test}; Z) = I(X_{tr}; Z)? Please provide empirical evidences to show that algorithm 1 is feasible.

I(Xtest;Z)andI(Xtr;Z)I(X_{test}; Z) and I(X_{tr}; Z) will similar when sampled from the same distribution. In the experiments reported, MI I(Xtr;Z)I(X_{tr}; Z) is calculated on different samples from the training distribution.

I notice that at the begining of the training, before the mutual information starts to be different between "with shortcut" and "without shortcut", the test loss has become different (e.g. 100 epoch in Fig.6 and 1000 epoch in Fig.7). Why is it?

Our intuition in the initial phase network explores the more optimal region of the solution space in the initial training epochs before discovering shortcuts. This is consistent with the findings of Shwartz-Ziv & Tishby (2017) on the behaviour of SGD.

[1] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. 2017

审稿意见
3

The author claims that the failure of deep neural networks to generalize to out-of-distribution (OOD) data is often caused by shortcuts or spurious correlations in the training data, leading to a performance drop due to distribution shift. This paper aims to understand shortcut learning using information-theoretic tools and propose using mutual information (MI) between the learned representation and the input space as a domain-agnostic metric for detecting shortcuts in training datasets. To study the training dynamics of shortcut learning, a framework based on the Neural Tangent Kernel (NTK) in introduced, able to detect shortcuts and spurious correlations in training data without requiring class labels for the test data. Empirical experiments on multiple datasets, including MNIST, CelebA, NICO, Waterbirds, and BenchMD, demonstrate that MI can effectively detect shortcuts.

优点

  1. This paper can be viewed as a complementary method to existing out-of-distribution (OOD) detectors. It offers a domain-agnostic metric to detect shortcuts and provides a way to diversify the training data before deploying a model in safety-critical domains. By identifying shortcuts and spurious correlations in training datasets, it helps improve the robustness and reliability of models when faced with distribution shifts, making it a valuable tool for ensuring model performance in real-world applications.

  2. The quality of the writing is commendable, and the experiments conducted to establish the effectiveness are adequate.

缺点

  1. The core concept underpinning the method lacks novelty. In essence, during the training phase, the objective is to minimize I(X;Z)I(X; Z) on the training data DtrD_tr. However, it becomes evident during the testing phase that I(Xte;Z)<I(Xtr;Z)I(X_{te}; Z) < I(X_{tr}; Z), provided a domain gap exists between the training data DtrD_{tr} and the test data DteD_{te}. Subsequently, equations (4) - (10) are all employed to compute I(X;Z)I(X; Z) using well-established formulations from prior literature. Hence, the methods devised are not restricted to the detection of shortcuts and can be applied to data exhibiting substantial domain gaps. I recommend that the authors conduct additional experiments on out-of-distribution (OOD) detection.

  2. Based on the experiments on partially correlated shortcuts as depicted in Figure 4, I have reservations about the method's limitations when it comes to detecting datasets with partially correlated shortcuts. This is because the hard line in Figure 4(a) appears to be very close to the dotted line in Figure 3(a). I recommend that the authors conduct additional experiments to further investigate the detection of datasets with partially correlated shortcuts.

问题

See weaknesses as above.

评论

Thanks for the detailed feedback! We will run additional experiments on OOD for future submission.

审稿意见
3

The authors utilize a neural tangent kernel framework to approximate the mutual information between input and learned representations. They hypothesize that mutual information should be smaller in the test set compared to the training set if there's a distribution shift present. They test their ideas on a number of datasets and compare their proposed methodology with out-of-distribution (OOD) detection algorithms.

优点

  • The authors' topic of choice is timely, and is of both theoretical and practical significance.
  • The authors involve a number of datasets in their experiments, and go beyond frequently used (semi)synthetic datasets in their investigation. Domain-specific empirical examination is especially likely to be informative in this topic.

缺点

I believe that the issue the authors address requires conceptual clarity in terms of defining and diagnosing the problem, and a carefully constructed methodology and accompanying experiments to support the arguments made. I believe that the paper has important shortcomings in these respects. I summarize my main criticisms below, and provide detailed questions and observations regarding these in the next section.

  • The paper presents no systematic justification of their proposed method, apart from some conjectural statements.
  • The paper is imprecise, unclear, and/or inconsistent with their conceptual structure or central notation.
  • The experiments do not seem consistent with the onus of the paper's argumentation, present insufficient details on the specifics, and do not pursue some critical avenues to their fullest extent.

问题

On justification of the proposed method:

  • I do not think it is clear why the paper's method of choice (i.e. comparing the mutual information between input and learned representations) should be a good choice for detecting this phenomenon. An example:
    • Let's assume a binary label YY and a binary feature SS, and that ptrain(S=1Y=1)=ptrain(S=0Y=0)=1p_{train}(S=1|Y=1) = p_{train}(S=0|Y=0) = 1, and ptest(S=1Y=1)=ptest(S=0Y=0)=0.5p_{test}(S=1|Y=1) = p_{test}(S=0|Y=0) = 0.5, which conforms to the Eq's 2 and 3. Also let ptrain(Y=1)=ptrain(Y=1)=0.5p_{train}(Y = 1) = p_{train}(Y = 1) = 0.5. YY can be odd/even label and SS can be the presence of the white patch.
    • Then, according to paper's arguments a trained model would learn to reduce the representation of XX to the presence of the white patch g(X)=Z=Sg(X) = Z = S to maximally compress XX. In the test set the network would still extract the presence of the white patch from the OOD samples. Why would then I(X;Z)I(X;Z) be different under ptrainp_{train} and ptestp_{test}?
  • Moreover, can the authors definitively claim that I(Xtest;Z)<I(Xtr;Z)I(X_{test}; Z) < I(X_{tr};Z) can only be due to the presence of spurious correlations? If not, how reliable should this method be considered for detecting shortcut learning?
  • The suggested methodology might make sense under some distribution shifts, but it is authors' responsibility to describe and explain this while presenting their method.

On conceptual clarity and notation consistency

  • The authors do not make clear notational distinctions between random variables and specific values they can take, and use inconsistent notation seemingly without explanation:
    • After introducing XX and YY as random variables, why do we revert to xx and yy in Eq.s 2 and 3?
    • On Eq.s 8 and 9 why do we have e.g. XX and xx in the same equation?
    • Why is Y\mathcal{Y} never explicitly introduced?
    • Do pp and PP refer to different mathematical objects? If so, why are they not explicitly introduced?
    • Why is ss always lower case?
    • Why do we have XtrX_{tr} but not ZtrZ_{tr}?
  • The generative model implied in Section 3 is not clear, and it's not clear how it relates to distribution shifts in question.
  • Definition 2 is unclear. What is Γ\Gamma? Is it a function that outputs scalar values, such that we can have order relationships?
  • Proposition 1 is not a provable mathematical statement, so it should be named a conjecture or a hypothesis. Even as a hypothesis it is imprecisely stated.
  • Pg. 1: "A shortcut is a distribution shift..." I think defining shortcut as a type of distribution shift is both unhelpful and is inconsistent with the rest of the literature.

On experiments

  • Both in the abstract and the introduction, as well as in the Algorithm 1, the authors propose examining I(Xtest;Z)<I(Xtr;Z)I(X_{test}; Z) < I(X_{tr};Z) as a way to determine presence of shortcut learning, yet in none of the experiments they apply this methodology to decide on this, until Section 5. Neither is this information present in any of their figures before Figure 8.
  • Although deferring some details to the supplementary material is understandable, the authors include no details whatsoever on their experiment setting. This ranges from the used model families to how ZZ was obtained.
  • The comparison with OOD methods (i.e. how the baselines were used for this task) should be introduced in a more detailed fashion.
  • Figure 2 is a singular demonstration of what the model learns in the presence of strong spurious correlations. A method for quantifying this is needed, in a way that generalizes to other experiments as well.
审稿意见
3

This paper proposed a framework for detecting spurious correlations or shortcuts implied in training datasets. The main hypothesis of this paper is that the information between input and embedding would be low provided that there are shortcuts in a dataset. The author leveraged a neural tangent kernel to estimate mutual information between input and embedding representation and empirically represented that their hypothesis is valid on the synthetic (MNIST with shortcuts), benchmarks (Waterbirds, CelebA, and NICO), and real-world medical datasets.

优点

  • The empirical results support their hypothesis that the $I(X;Z) of a dataset with a shortcut is lower than that without a shortcut.
  • The method is simple and easy to follow.

缺点

  • The proposed method is limited in the real-world application scenario. The proposed method requires a 'without shortcuts dataset' to detect whether there are shortcuts in the training dataset. I am not sure how many cases we can prepare the 'without shortcuts dataset' before we know whether the training dataset has a shortcut.
  • (Kirichenko et al., 2022) represented that the model trained on Waterbirds using ERM has the ability to classify the foreground-only and background-only datasets. It conflicts with the main hypothesis that the model trained with the shortcut dataset will have a low $I(X;Z).
  • The experiment graphs show that the mutual information is less discriminative than the losses, which diminishes the necessity of using mutual information to detect the existence of shortcuts in the training dataset.

Minor corrections

  • Page 5 MNIST with synthetic shortcut) Figure 1 -> Figure 2
  • Figure 2) Please denote the used Saliency map.
  • Definition 2) Please state that the higher Γ\Gamma, the better the generalization.

问题

  • The proposed algorithm and the experiment setting are different. If I denote an original training and test dataset as DtrD_{tr} and DteD_{te}, respectively, and the shortcut added training dataset as DtrscD_{tr}^{sc}, then the Figure 3(c) plots 'I(Xtest;Z)I(X_{test};Z) trained on DtrD_{tr}' and 'I(Xtest;Z)I(X_{test};Z) trained on DtrscD_{tr}^{sc}'. However, the algorithm seems to be denoted to compare 'I(Xtest;Z)I(X_{test};Z) trained on DtrD_{tr}' and 'I(Xtr;Z)I(X_{tr};Z) trained on DtrD_{tr}'.
  • Algorithm 1 step 1) Why F\mathcal{F} is initially required?
  • Figure 3 with shortcut line vs Figure 4 100% line) I think they are the same experiment, but the graphs differ.
评论

Thank you for the detailed review and feedback!

The proposed method is limited in the real-world application scenario. The proposed method requires a 'without shortcuts dataset' to detect whether there are shortcuts in the training dataset. I am not sure how many cases we can prepare the 'without shortcuts dataset' before we know whether the training dataset has a shortcut.

We use `with' and 'without shortcut' to denote shortcuts in the controlled experiments to demonstrate that mutual information can be used to distinguish between the two datasets. We demonstrated the practical use case in Section 5 where we sample XtrX_{tr} and XtestX_{test} from two different medical imaging datasets. Our method can detect the presence of shortcuts, which is in fact corroborated when the model trained on XtrX_{tr} is evaluated on XtestX_{test}.

(Kirichenko et al., 2022) represented that the model trained on Waterbirds using ERM has the ability to classify the foreground-only and background-only datasets. It conflicts with the main hypothesis that the model trained with the shortcut dataset will have a low $I(X;Z).

Thanks for pointing out this work. One major difference is that we use Neural tangent kernels for training the model and calculating the mutual information value. Kirichenko et al., 2022 used a finite-width neural network in their study.

The experiment graphs show that the mutual information is less discriminative than the losses, which diminishes the necessity of using mutual information to detect the existence of shortcuts in the training dataset.

While using a trained model on new data XtestX_{test}, we usually don't have the corresponding label YtestY_{test}. In contrast to the test accuracy, calculating MI I(Xtest;Z)I(X_{test}; Z) does not require a labelled dataset and hence our method can be used to detect shortcuts without requiring test labels. We plotted the test accuracy to empirically demonstrate the connection between I(Xtest;Z)I(X_{test}; Z) and the test accuracy.

The proposed algorithm and the experiment setting are different.

In our experiments, we calculated both I(Xtest;Z)I(X_{tes}t; Z) and I(Xtr;Z)I(X_{tr}; Z) using model trained DtrD_{tr}

Algorithm 1 step 1) Why F\mathcal{F} is initially required?

We need a trained model F\mathcal{F} to calculate the mutual information.

评论

I greatly appreciate your response. After considering the opinions of the other reviewers and the author's response, I tend to maintain my original score.

AC 元评审

This paper presents the presence of shortcuts in machine learning as an out-of-distribution problem and uses mutual information between input and learned representations using a neural tangent kernel to detect them. Experimental results are presented on many benchmarks.

One major conceptual clarification that is missing from the mutual information formulation of this procedure is the lack of distinction between spurious correlations and true inductive biases (or truly informative features). The true inductive biases will be helpful in any distribution, whereas shortcuts might be different in different distributions. The “no shortcut” setup studied in this work is somewhat reductive and not possible to study on a wide variety of settings. In effect, what the paper is proposing as a shortcut detector might simply correspond to an outlier detector.

The presentation of the paper was clear and many empirical studies were presented to verify the hypothesis.

为何不给更高分

Please see weaknesses above; conceptually the method in the paper was somewhat unsatisfactory.

为何不给更低分

N/A

最终决定

Reject