PaperHub
8.2
/10
Spotlight4 位审稿人
最低5最高5标准差0.0
5
5
5
5
2.8
置信度
创新性2.8
质量3.0
清晰度2.8
重要性2.5
NeurIPS 2025

Shortcut Features as Top Eigenfunctions of NTK: A Linear Neural Network Case and More

OpenReviewPDF
提交: 2025-05-07更新: 2025-10-29
TL;DR

We reveal that shortcut features correspond to top eigenfunctions of NTK and dominate the neural network output after convergence.

摘要

关键词
shortcut learningneural tangent kernel

评审与讨论

审稿意见
5

The paper investigates shortcut learning in deep learning models, where networks overly rely on dominant but non-generalizable features in the training data. Using the Neural Tangent Kernel (NTK) framework, the authors analyze linear neural networks and define a “feature” as an eigenfunction of the NTK. They find that shortcut features align with large NTK eigenvalues, particularly in settings with imbalanced or clustered data distributions. These dominant features significantly affect the model output even after training, suggesting that shortcut learning isn't solely due to max-margin biases. The insights are also shown to extend to nonlinear networks, like two-layer ReLU networks and ResNet-18.

优缺点分析

Strengths:

  1. The paper presents a thorough theoretical analysis using the Neural Tangent Kernel (NTK) framework to dissect shortcut learning, starting from linear models and extending the insights to more complex architectures. It provides formal theorems and proofs (e.g., Theorem 3.1 and 3.2) that explain how shortcut features arise due to spectral properties of NTK and data distribution characteristics.

  2. Introduces a clean interpretation of features as NTK eigenfunctions and relates shortcut features to eigenvalues, thereby linking shortcut learning to fundamental learning dynamics like spectral bias.

  3. The paper empirically shows that findings from linear settings extend to two-layer ReLU networks and ResNet-18, increasing the real-world relevance of the theoretical insights.

Weaknesses:

  1. Core theoretical results are derived under simplifying assumptions: linear networks, Gaussian mixture models, infinite-width limit. While extensions to complex models are attempted empirically, theoretical generality is limited.

  2. The paper does not include error bars or statistical tests for empirical results. This undermines confidence in the generality of findings, especially on small or noisy datasets.

问题

  1. The theoretical analysis relies heavily on the NTK framework (infinite-width, lazy training). In practice, finite-width networks often exhibit feature learning. Is there any empirical observations or theoretical intuitions about how the findings generalize to feature-learning regimes?

  2. The shortcut labels in real-world datasets are manually constructed based on model predictions. How consistent are these across runs or architectures? Could a data-driven or automated method for shortcut label discovery be developed using availability scores?

局限性

Theoretical results rely on idealized settings (linear networks, infinite width, Gaussian mixture data), which limits applicability.

The empirical results do not provide statistical significance analysis.

最终评判理由

The authors provided a bunch of empirical results provide more solid evidence on the claims. Also, the authors clarified most of the questions and concerns I have. Overall, I think it is a solid paper with all the extended results.

格式问题

No

作者回复

Thank you very much for your constructive review! We first notify you that our equations 21, 22, 23 in Appendix B have typos: exp(12(sμk)Σk1(sμk))\exp{(-\frac{1}{2} (s - \mu_k)^\top \Sigma^{-1}_k (s - \mu_k))} must be exp(12sΣk1s)\exp{(-\frac{1}{2} s^\top \Sigma^{-1}_k s)}. We are sorry for the inconvenience.

1. The paper does not include error bars or statistical tests for empirical results. This undermines confidence in the generality of findings, especially on small or noisy datasets.

Though the availability of a shortcut label was consistently higher than the availability of the ground-truth label, the values of availability varied due to the randomness in picking samples for measuring NTK. Thus, we did not include error bars for the clarity of the graphs. But we will try to run important experiments such as Waterbirds and CelebA on CE loss to show you that our results are consistent. Though the values can vary due to randomness of picking training samples for measurement of NTK, the availability of shortcut labels was consistently higher than the one of ground-truth labels. Also, in the experiment measuring the strength of a shortcut (Fig. 7), a stronger shortcut always exhibited higher availability. The experiments below were run with CE loss. The availability was measured at the 1000-th iteration to observe the convergence behaviour in the early phase of training (which was also our focus in Fig. 6, 9, and 10)

Waterbirds

12345
Shortcut label0.36690.27000.34800.40490.2996
Ground-truth label0.34960.25760.33250.38270.2906

CelebA

12345
Shortcut label0.63000.58950.58860.57670.5958
Ground-truth label0.62570.53820.57070.56110.5833

Colored-MNIST

12345
Shortcut label0.53170.53600.51700.53470.5191
Ground-truth label0.43700.42480.40320.44720.4380

Patched-MNIST

12345
Shortcut label0.16280.16860.17510.17390.1684
Ground-truth label0.15550.15780.15330.15670.1454

On the other hand, for the experiments measuring the strength of a shortcut in Patched-MNIST (Fig. 7), the variance of availability was small (across 5 runs):

Size of patch1x13x35x57x7
Shortcut label0.0605 ± 0.00420.1698 ± 0.00490.2861 ± 0.00380.3721 ± 0.0085

If given a chance to revise our paper, we will reflect your comment and include an error bar for this experiment. The experiments below were run with MSE loss.

Waterbirds

12345
Shortcut label0.20400.23420.19100.24060.2777
Ground-truth label0.18300.20780.17450.21730.2739

CelebA

12345
Shortcut label0.44660.45720.38530.38610.4841
Ground-truth label0.41400.42300.37150.36910.4527

Colored-MNIST

12345
Shortcut label0.53340.55290.53790.55290.5550
Ground-truth label0.44790.46700.45990.47420.4672

Patched-MNIST

12345
Shortcut label0.16950.19540.15280.16480.1718
Ground-truth label0.15440.17860.14310.16100.1566
The experiments below were run with CE loss and SD regularization.

Waterbirds

12345
Shortcut label0.22630.22370.19350.23180.2362
Ground-truth label0.20030.21220.18180.21550.2108

CelebA

12345
Shortcut label0.48240.40930.46070.42430.4203
Ground-truth label0.46620.40150.43500.40680.4053

Colored-MNIST

12345
Shortcut label0.52170.55330.55900.52930.5465
Ground-truth label0.44990.45480.49280.49340.4551

Patched-MNIST

12345
Shortcut label0.17320.15420.16160.15720.1742
Ground-truth label0.16200.14530.14310.14470.1553

We also answer your questions below:

1. The theoretical analysis relies heavily on the NTK framework (infinite-width, lazy training). In practice, finite-width networks often exhibit feature learning. Is there any empirical observations or theoretical intuitions about how the findings generalize to feature-learning regimes?

All the plots in Fig. 6, 9, and 10 are measurements from empirical NTK of finite-width networks. They were measured at >= 100 epoch (130 epoch for CelebA, 200 epoch for the rest). This empirically implies that our findings could be generalizable to feature-learning regimes. Also, one might extend our results with perturbation theory and dynamical mean-field theory, which might need more effort.

2. The shortcut labels in real-world datasets are manually constructed based on model predictions. How consistent are these across runs or architectures? Could a data-driven or automated method for shortcut label discovery be developed using availability scores?

The shortcut labels were consistent across runs. As in an example in Waterbirds, when the label was manually found, the match rate of the test predictions to the found label was 0.7819 ± 0.0018 (measured at > 100 epochs across 5 runs). On the other hand, when the label was composed solely from the shortcut feature, the match rate of the test predictions to the composed label was 0.7488 ± 0.0029 (across 5 runs) which was lower than the aforementioned one. In an example in CelebA, when the label was manually found, the match rate of the test predictions to the found label was 0.9527 ± 0.0003 (measured at > 100 epochs across 5 runs). On the other hand, when the label was the ground-truth label, the match rate of the test predictions to the ground-truth label was 0.9510 ± 0.0004 (across 5 runs) which was lower than the aforementioned one. In CelebA, we compared the match ratio of the shortcut label to the match ratio of the ground-truth label since the match rate to the label solely from the shortcut feature was much lower than the one to the ground-truth label. This shows that our shortcut label is reliable. But the shortcut labels were not consistent across parameters or architectures. For example, as shown in A.4, the shortcut labels might be different if the network is pretrained. And yes, it might be possible to construct a shortcut label from availability scores or from top eigenvectors of NTK (by constructing top eigenvectors of NTK as shortcut labels). But there is a limitation which was also noted in Appendix A.6: if there are multiple shortcuts in the datasets, it might not be possible to detect all the shortcut labels from the datasets.

评论

I thank the authors for their extended results and response! Those clarified my concerns and I would like to raise my score.

审稿意见
5

Propose a toy model where the problem of shortcut features can be investigated, using a mixture of two Gaussians with biased clusters (unbalanced probabilities) and considerable variance. Demonstrate, using an NTK setting in a linear network, that in this setup, shortcut features get learned earlier and influence the learned predictions. Furthermore, it demonstrates that a previously suggested measure to avoid shortcut features, controlling the classification margin, does not seem to alleviate the problem. Propose measures of availability and predictability to identify shortcut features in real-world datasets and demonstrate that they exhibit similar properties.

优缺点分析

Strengths

  1. A toy model where the problem of shortcut features can be reproduced and against which possible solutions can be tested.
  2. Use an existing theoretic framework (NTK) in a simplified manner to better understand the learned features which correspond to shortcut features.
  3. The predictability and availability metrics seem reasonable and useful.
  4. Test to see if a previously proposed mitigation for shortcut features work on the model, and found it does not.

Weaknesses

  1. Unclear logical structure. The manuscript would benefit from a more layered structured, which makes clearer claims. Specifically, it is unclear which results depend on the Gaussian mixture model, which on the NTK analysis, and which do not. I think the structure of the logical argument is:
    ** Propose a model of biased Gaussian mixture to test shortcut features (with equations).
    ** Show when the model demonstrates shortcut features and when it does not (e.g., yes when biased larger than something and variance larger than something, show figure).
    **Apply a previously suggested method to understand the learned features in this model, demonstrating that they exist, are learned faster, and contribute to task performance.
    ** Show that a previously suggested solution for shortcut features, controlling prediction margin, is not helpful in this model. **Based on the analysis, suggest measures for identifying shortcut features, availability, and predictability.
    ** Replicate the results of real-world datasets, without the model's assumptions.
  2. The predictability and availability metrics seem reasonable, to the point that I would guess they already have a name or previous usage.
  3. The NTK analysis is relevant only in the lazy regime, which is not what happens when neural networks get trained on real-world samples. The derived results provide only an intuitive understanding that needs to be tested.
  4. While it is note that "the data variance of each cluster is not really small", the contribution of this phenomenon is not well investigated. The point of Figure 4 is unclear.

问题

  1. Can you show that the Gaussian mixture model exhibits shortcut features without using the NTK analysis?
  2. Can you quantitatively show the parameters of the GMM under which shortcut feature do or do not show, based on the NTK analysis?

局限性

yes

最终评判理由

The authors have properly answered my questions and clarified things I misunderstood, so I increased my score. The manuscript would still benefit from being updated to include clarifications, fix typos, and correct the experimental results, which contained a mistake.

格式问题

None

作者回复

Thank you very much for your constructive review! We first notify you that our equations 21, 22, 23 in Appendix B have typos: exp(12(sμk)Σk1(sμk))\exp{(-\frac{1}{2} (s - \mu_k)^\top \Sigma^{-1}_k (s - \mu_k))} must be exp(12sΣk1s)\exp{(-\frac{1}{2} s^\top \Sigma^{-1}_k s)}. We are sorry for the inconvenience.

1. Unclear logical structure. The manuscript would benefit from a more layered structured, which makes clearer claims.

We are sorry for our unclear statements. Your statements you wrote about the logical argument are correct. For more information, we clarify our results as below: Theorem 3.1 and Theorem 3.2 are based on data following the Gaussian mixture model, while Corollary 3.3 does not necessarily need data to be on a Gaussian mixture model. Those three results need an assumption that the trained model is a linear neural network. Meanwhile, the toy experiments from Fig. 4, Fig. 5, and Fig. 7 are trained on a two-layer ReLU FC network. Saliency maps (Fig. 3) were generated from a two-layer ReLU CNN network and predictability and availability (Fig 6) were measured from a pretrained ResNet-18.

2. The predictability and availability metrics seem reasonable, to the point that I would guess they already have a name or previous usage.

Yes, especially the availability metric has been widely used to measure the alignment of NTK to the ground-truth label [1], but it has not been used to investigate the phenomenon of shortcut learning.

3. The NTK analysis is relevant only in the lazy regime, which is not what happens when neural networks get trained on real-world samples. The derived results provide only an intuitive understanding that needs to be tested.

Theorem 3.1 is relevant in the lazy regime and our analysis on the convergence speeds of features is dependent on the lazy regime assumption. We acknowledge that it is our limitation. However, it is known that the neural network can sometimes exhibit the lazy regime behaviour in the early phase of training, which is our main focus of our analysis on the convergence speeds of features [2]. Also, Theorem 3.2 and Corollary 3.3 do not need an assumption of lazy regime.

4. While it is note that "the data variance of each cluster is not really small", the contribution of this phenomenon is not well investigated. The point of Figure 4 is unclear.

The fact that the data variances of the clusters are not small is used in Theorem 3.2, which implies that shortcut features dominate the decision boundary when the variances are not zero. The point of Fig. 4 is to empirically prove Theorem 3.2 and show the effect of data variances on the decision boundary.

We also answer your questions below:

1. Can you show that the Gaussian mixture model exhibits shortcut features without using the NTK analysis?

In Theorem 3.2, we show that the Gaussian mixture model exhibits shortcut features after convergence without using any NTK analysis. (Unfortunately, we currently do not provide an analysis on convergence speeds of features during convergence without NTK analysis.)

2. Can you quantitatively show the parameters of the GMM under which shortcut feature do or do not show, based on the NTK analysis?

If the data distribution follows the Gaussian mixture model of p(x)=k=1KπkN(μk,σK2I)p(x) = \sum^K_{k=1} \pi_k \mathcal{N}(\mu_k, \sigma^2_K I) and σk>0\sigma_k > 0, then the imbalance in the weights πk\pi_k can trigger shortcut learning after convergence. Spectral bias can occur even when σk=0\sigma_k = 0.

[1] Baratin, Aristide, et al. "Implicit regularization via neural feature alignment." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.

[2] Lyu, Kaifeng, et al. "Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking." The Twelfth International Conference on Learning Representations.

评论

Thank you for your clarifications!

Your results seem interesting and well-supported. I think that improving the clarity and exposition (next time) would have a huge effect on your ability to explain your results to the reader. Meanwhile, I will increase my score.

审稿意见
5

The paper proposes an analysis of shortcut learning resulting from the dominance of biased attributes in the data for the case of linear networks using NTK theory, by tying the network features to the eigenfunctions of the NTK and measuring the influence of each eigenfunction to the neural network outputs.

The central parts of the theoretical argument are two theorems, formulated for the case of linear neural networks and datasets generated from mixtures of Gaussians with isotropic covariances:

Theorem 3.1 quantitatively inspects the spectral bias by relating the convergence speed of each feature direction to the eigenvalues​ of the between-means covariance matrix M. It states that the eigenfunctions of the integral operator defined for the inner product kernel are linear projections on the eigenvectors of M and the eigenvalues decompose in the sum of average within-cluster variances, which is a constant term, and the eigenvalues associated with the eigenvectors of M. The theorem provides insight on why shortcut features (which, in the considered scenario are aligned with large cluster mixing probabilities), corresponding to larger M eigenvalues, are fitted faster: their kernel eigenvalues are large. In contrast, discriminative features that correspond to either small or similar clusters show a slower learning speed.

Theorem 3.2 further shows that, under training with MSE loss, eigenfunctions of the kernel operator corresponding to clusters with larger mixing weights have a higher influence on the network output. Coupled with theorem 3.1, it states that shortcut features (which correspond to larger kernel eigenvalues) are fitted early and dominate the network predictions after convergence. One important note from the authors is that the impact of bias feature after convergence depend not only on the cluster mixing weights but also on the variances of the samples, highlighted empirically on a toy example.

The theoretical results and experiments on toy data additionally challenge the idea that maximal margin control (for example via SD) alone can eliminate the shortcut bias, by stating that the SD decision boundary converges to the MSE decision boundary in the infinite-limit of the regularization coefficient, which is affected by shortcut learning. Further, the authors provide metrics for the concepts of predictability (how well a feature is aligned with the ground truth) and availability (the alignment of the feature with the top eigenspace of NTK) and empirically apply them on real datasets.

优缺点分析

Strengths:

  • The authors take an interesting perspective on shortcut features through the lens of NTK theory and advances the theoretical understanding
  • The theoretical argument (restricted to the case of linear networks and GMM data) provides explicit forms for the eigenfunctions and weights as functions of the dataset mixing probabilities
  • The results improve understanding on debiasing, by first challenging the idea that maximal margin alone cannot eliminate shortcut biases and by additionally highlighting the impact of within-cluster variance on both spectral bias and feature weights
  • The experimental setup is extended beyond the limited regime considered in the theoretical argument (linear networks, GMM data) to real data and networks and empirically validates the spectral bias properties through newly introduced metrics for predictability and availability
  • The results might potentially provide insights applicable for the study of interpretability, as shown in Figure 3 where saliency maps for directions with high and low eigenvalues are tied to human-interpretable features

Weaknesses:

  • In the follow up analysis of the theorem 3.1, it is implied that eigenvalues depend primarily on the cluster mixing weights ("Since eigenvectors close to a cluster with a larger weight pk have larger eigenvalues, an inner product with a vector from a larger cluster has a larger eigenvalue and converges faster due to the spectral bias"). However, it follows from Theorem 3.1 that eigenvalues scale with both mixing weights and the magnitude of the mean vectors (concretely, with pk * ||uk||). Therefore, in scenarios where a cluster with a high mixing weight has a small-norm mean, its direction might not dominate the spectral bias anymore. I recommend clarifying this aspect in the analysis, as it potentially restricts the theoretical setup even further to datasets where the mean vectors have approximately equal norms, further drifting away from practice.
  • Some of the limitations already pointed out by the authors: highly limiting regime in the theoretical analysis, with respect to data generation and class of models, as well as limitation of availability due to multiple shortcuts. While the empirical results show agreement with theory, the gap between the two is still considerable.
  • Theory relies on static NTK assumption that only holds in the infinite-width limit

问题

  • Given the spectral bias analysis, I am curious whether you have thought about how the new spectral bias insights (and the metrics for availability and predictability) might inform designing further debiasing strategies?
  • Since the theory assumes a static NTK (which holds for infinite-width limit), have you observed a particular width threshold beyond which the spectral bias observations tend to emerge in the experiments?
  • As mentioned in the Weaknesses section, for the toy experiments highlighting theorems 3.1 and 3.2, has the norm of the cluster means been controlled in some way in order to observe its effect on the eigenvalue scales?
  • Can some aspects from the theoretical argument be directly applied from linear NTK to other types of kernels?

局限性

Yes

最终评判理由

Key updates in the rebuttal include the following points :

  • reiterates where the GMM-data and constant-NTK premises are assumed in Theorems 3.1 and 3.2 and Corollary 3.3
  • discussions on the case of small, high-variance clusters (discussion can be found in the replies to reviewers uhAz and jAtx)
  • supplies preliminary confidence-interval experiments across multiple runs, strengthening the empirical claims (discussion can be found in the replies to reviewers jAtx and o9y3)
  • tests on how measures of availability varies with network width, confirming that the availability of the shortcut label is consistently higher than the availability of the ground-truth.

However, I also emphasized in my discussion with the authors that, even if the empirical evaluation regarding Theorem 3.2 focuses exclusively on the role of the mixing weights, the paper should still discuss more clearly that eigenvalues are influenced by both the mixing weights and the magnitude of the mean vectors, a consequence deriving from Theorem 3.1. I recommended including this discussion such that the influence of mean-vector norms is not overlooked, as equal mean vectors constitutes another limiting assumption of the setup. I appreciate that the authors plan to detail this discussion for the final version.

Following the rebuttal, I have decided to increase my rating.

格式问题

No major paper formatting concerns

作者回复

Thank you very much for your constructive review! We first notify you that our equations 21, 22, 23 in Appendix B have typos: exp(12(sμk)Σk1(sμk))\exp{(-\frac{1}{2} (s - \mu_k)^\top \Sigma^{-1}_k (s - \mu_k))} must be exp(12sΣk1s)\exp{(-\frac{1}{2} s^\top \Sigma^{-1}_k s)}. We are sorry for the inconvenience.

1. I recommend clarifying this aspect in the analysis, as it potentially restricts the theoretical setup even further to datasets where the mean vectors have approximately equal norms, further drifting away from practice.

Firstly, both toy experiments in Fig. 4 and 5 were conducted with orthogonal clusters with means of norm 1. The norms of the means of clusters were identical and the effect of the norm was decoupled. While we were also aware of the effect of the norm of the mean, we did not elaborate on the subject since our focus was on the weight πk\pi_k. If given a chance to revise our paper, we will reflect your comment and emphasize more on the effect of norms of the means to the decision boundary of a neural network.

2. Theory relies on static NTK assumption that only holds in the infinite-width limit.

Theorem 3.1 assumes the static NTK assumption, which is the limitation of our work. However, it is known that the neural network can sometimes exhibit the lazy regime behaviour in the early phase of training, which is a focus of our analysis on the convergence speeds of features [3]. On the other hand, Theorem 3.2 and Corollary 3.3, which show the decision boundary of a network after convergence of the network, are not based on the constant NTK assumption.

We also answer your questions below:

1. Given the spectral bias analysis, I am curious whether you have thought about how the new spectral bias insights (and the metrics for availability and predictability) might inform designing further debiasing strategies?

Our new spectral bias insights might aid finding possible shortcut labels inherent in the dataset. Since shortcut features are learned faster than core features, one can construct shortcut labels by looking at the predictions of the network in the early phase of training, which is already done in a famous debiasing method called LfF [1].

2. Since the theory assumes a static NTK (which holds for infinite-width limit), have you observed a particular width threshold beyond which the spectral bias observations tend to emerge in the experiments?

We could not particularly find a case where our observations were not valid. We show our results on Patched-MNIST and a two-layer ReLU FC network when the width of the network is small as 32 or 64: availability of the shortcut label was consistently higher than the availability of the ground-truth label.

Width326425651210242048
Shortcut label0.17260.16250.16860.16840.14940.1138
Ground-truth label0.16370.15390.15780.15710.14300.1050

3. As mentioned in the Weaknesses section, for the toy experiments highlighting theorems 3.1 and 3.2, has the norm of the cluster means been controlled in some way in order to decouple its effect on the eigenvalue scales?

Both toy experiments in Fig. 4 and 5 were conducted with orthogonal clusters with means of norm 1. The norms of the means of clusters were identical and the effect of the norm was decoupled.

4. Can some aspects from the theoretical argument be directly applied from linear NTK to other types of kernels?

You cannot directly apply the results to NTK from a ReLU network, but in some cases where eigenfunctions of a ReLU NTK include linear functions [2], our results might be valid for linear eigenfunctions in those simple cases.

[1] Nam, Junhyun, et al. "Learning from failure: De-biasing classifier from biased classifier." Advances in Neural Information Processing Systems 33 (2020): 20673-20684. [2] Hermann, Katherine L., et al. "On the foundations of shortcut learning." arXiv preprint arXiv:2310.16228 (2023). [3] Lyu, Kaifeng, et al. "Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking." The Twelfth International Conference on Learning Representations.

评论

I thank the authors for their responses! In the light of the discussions, I have increased my score.

Regarding control for the mean-vectors, although the experiment accounts for equal mean-vector norms, my original curiosity (in hindsight, unclearly formulated, apologies for that) was rather about systematically controlling the drift between mean-vector norms such that its effect on the network outputs is observed.

While explicitly evaluating the effect of the mean-vector norms could more strongly highlight the theoretical-empirical link, I recognize that the focus of the experiment is on the mixing weights. I welcome and support the decision of the authors to emphasize more clearly on the effect of the mean-vector norms on the eigenvalues in an updated version.

审稿意见
5

Introduces an NTK-based theory explaining why shortcut (spurious) features dominate training: clusters with higher sample weight π acquire larger NTK eigen-values, are fitted fastest, and retain the largest post-training influence.

Proves (Theorems 3.1–3.2) these properties for linear networks; then shows empirically that two-layer ReLU nets and ResNet-18 exhibit the same behaviour on Patched-/Colored-MNIST, Waterbirds, CelebA, Dogs vs Cats.

Defines predictability (alignment to ground-truth) and availability (alignment to top-NTK eigen-vectors) and demonstrates that shortcut labels have low predictability but high availability across five biased benchmarks.

Argues, both theoretically (Cor. 3.3) and with 2-D toy data, that margin-control methods (SD / Marg-Ctrl) cannot fully remove shortcut bias.

优缺点分析

Quality Strength

  1. Solid linear-kernel proofs, clearly laid out in Appendix B.
  2. Comprehensive experiments (5 real, 2 synthetic datasets) validate theory.

Weakness

  1. All curves come from single runs; no confidence intervals reported.

  2. Manual construction of shortcut labels may inject subjectivity.

Clarity

Strength

1 Helpful visuals (Fig. 3 saliency, Fig. 4-5 decision boundaries) illuminate ideas.

Weakness

2 Main text is heavy with notation; many key plots relegated to appendix.

Significance

Strength

  1. Bridges spectral bias and shortcut learning, highlighting cluster variance as a new factor.

Weakness

  1. Results remain in infinite-width/lazy-training regime; impact on feature-learning settings uncertain.

Originality

Strength

  1. Novel availability metric linking shortcut strength to NTK spectrum; challenges margin-bias narrative.

Weakness

  1. Core insight “large eigen-values ⇒ faster learning” is rooted in prior spectral-bias work, limiting novelty.
  2. Core contribution also overlaps with Learnability in the Context of NTK: because mathematically Tr K = ∑ λᵢ, their large-k(x,x) “easy” samples are exactly authors large-λ “shortcuts”.

问题

Finite-width drift – Could you run a width-sweep (e.g. 256-2 k channels) on Patched-MNIST to show when NTK approximation breaks?

Statistical confidence – Please repeat each benchmark over ≥5 seeds and report mean ± s.d.; how stable is the availability ranking?

Shortcut-label sensitivity – How do availability curves change if shortcut labels are derived automatically (e.g. k-means on top eigen-projections) instead of manual inspection?

Variance term in Thm 3.2 – Provide a concrete bound showing when a small but high-variance cluster can outweigh a large low-variance one.

Late-epoch NTK – Have you measured availability using the empirical NTK after feature learning (e.g. epoch 50 ResNet-18) to see if shortcut alignment grows?

局限性

Authors acknowledge linear-kernel and infinite-width assumptions but societal-impact section is brief; suggest expanding on potential mis-use of availability for data filtering.

最终评判理由

I appreciate the authors' efforts to address my concerns. I raised my score to 5.

格式问题

None

作者回复

Thank you very much for your constructive review! We first notify you that our equations 21, 22, 23 in Appendix B have typos: exp(12(sμk)Σk1(sμk))\exp{(-\frac{1}{2} (s - \mu_k)^\top \Sigma^{-1}_k (s - \mu_k))} must be exp(12sΣk1s)\exp{(-\frac{1}{2} s^\top \Sigma^{-1}_k s)}. We are sorry for the inconvenience.

1. All curves come from single runs; no confidence intervals reported.

Though the availability of a shortcut label was consistently higher than the availability of the ground-truth label, the values of availability varied due to the randomness in picking samples for measuring NTK. Thus, we did not include error bars for the clarity of the graphs. Due to the limited time, we cannot run all experiments with multiple runs during our rebuttal period: but we will try to run important experiments such as Waterbirds and CelebA on CE loss to show you that our results are consistent. Though the values can vary due to the randomness of picking training samples for measurement of NTK, the availability of a shortcut label was consistently higher than that of ground-truth label. Also, in the experiment measuring the strength of a shortcut (Fig. 7), a stronger shortcut always exhibited higher availability. The experiments below were run with CE loss. The availability was measured at the 1000-th iteration to observe the convergence behaviour in the early phase of training (which was also our focus in Fig. 6, 9, and 10)

Waterbirds

12345
Shortcut label0.36690.27000.34800.40490.2996
Ground-truth label0.34960.25760.33250.38270.2906

CelebA

12345
Shortcut label0.63000.58950.58860.57670.5958
Ground-truth label0.62570.53820.57070.56110.5833

Colored-MNIST

12345
Shortcut label0.53170.53600.51700.53470.5191
Ground-truth label0.43700.42480.40320.44720.4380

Patched-MNIST

12345
Shortcut label0.16280.16860.17510.17390.1684
Ground-truth label0.15550.15780.15330.15670.1454

On the other hand, for the experiments measuring the strength of a shortcut in Patched-MNIST (Fig. 7), the variance of availability was small (across 5 runs):

Size of patch1x13x35x57x7
Shortcut label0.0605 ± 0.00420.1698 ± 0.00490.2861 ± 0.00380.3721 ± 0.0085

If given a chance to revise our paper, we will reflect your comment and include an error bar for this experiment. The experiments below were run with MSE loss.

Waterbirds

12345
Shortcut label0.20400.23420.19100.24060.2777
Ground-truth label0.18300.20780.17450.21730.2739

CelebA

12345
Shortcut label0.44660.45720.38530.38610.4841
Ground-truth label0.41400.42300.37150.36910.4527

Colored-MNIST

12345
Shortcut label0.53340.55290.53790.55290.5550
Ground-truth label0.44790.46700.45990.47420.4672

Patched-MNIST

12345
Shortcut label0.16950.19540.15280.16480.1718
Ground-truth label0.15440.17860.14310.16100.1566

2. Manual construction of shortcut labels may inject subjectivity.

Though we manually constructed shortcut labels, the assignment of labels does not depend on our subjective decision: for real-world datasets, we grouped the samples into four groups: [Bias-aligned samples labelled as 1], [Bias-aligned samples labelled as -1], [Bias-conflicting samples labelled as 1], and [Bias-conflicting samples labelled as -1]. We observed how each group was predicted in the test sets and assigned a shortcut label to each group as the predicted label of the majority of samples in the group (if the majority of the group was predicted as 1, then the shortcut label becomes 1 for this group). In Fig. 6, 9, and 10, the test predictions of the network were closer to our constructed shortcut labels rather than the original ground-truth labels, which shows that our constructed shortcut labels are valid. More explanation is in Appendix A.1. This is purely based on the statistics of the predictions not our subjective decisions. As in an example in Waterbirds, when the label was manually found, the match rate of the test predictions to the found label was 0.7819 ± 0.0018 (measured at > 100 epochs across 5 runs). On the other hand, when the label was composed solely from the shortcut feature, the match rate of the test predictions to the composed label was 0.7488 ± 0.0029 (across 5 runs) which was lower than the aforementioned one. In an example in CelebA, when the label was manually found, the match rate of the test predictions to the found label was 0.9527 ± 0.0003 (measured at > 100 epochs across 5 runs). On the other hand, when the label was the ground-truth label, the match rate of the test predictions to the ground-truth label was 0.9510 ± 0.0004 (across 5 runs) which was lower than the aforementioned one. In CelebA, we compared the match ratio of the shortcut label to the match ratio of the ground-truth label since the match rate to the label solely from the shortcut feature was much lower than the one to the ground-truth label. This shows that our shortcut label is reliable.

3. Results remain in infinite-width/lazy-training regime; impact on feature-learning settings uncertain.

Theorem 3.1 assumes the infinite-width/lazy-training regime, which is the limitation of our work. However, it is known that the neural network can sometimes exhibit the lazy regime behaviour in the early phase of training, which is the reason our analysis focuses on the convergence speeds of features [1]. On the other hand, Theorem 3.2 and Corollary 3.3, which show the behaviour of shortcut features after convergence of the network, are not based on the constant NTK assumption.

We also answer your questions below:

1. Finite-width drift – Could you run a width-sweep (e.g. 256-2 k channels) on Patched-MNIST to show when NTK approximation breaks?

We measured availability for two-layer ReLU FC networks with various widths: 32, 64, 256, 512, 1024, and 2048. Availability of the shortcut label was consistently higher than the availability of the ground-truth label.

Width326425651210242048
Shortcut label0.17260.16250.16860.16840.14940.1138
Ground-truth label0.16370.15390.15780.15710.14300.1050

2. Statistical confidence – Please repeat each benchmark over ≥5 seeds and report mean ± s.d.; how stable is the availability ranking?

We showed the partial results above: though the value of the availability itself could be unstable, the availability of a shortcut label was consistently larger than the availability of the ground-truth label.

3. Shortcut-label sensitivity – How do availability curves change if shortcut labels are derived automatically (e.g. k-means on top eigen-projections) instead of manual inspection?

We are not sure if we understood your question correctly, but if you mean automatic derivation by composing the label from k-means clustering on data projected to principal components, then the availability varied across runs. We experimented on Waterbirds and the label was constructed from 2-means clustering on data projected to principal components. Below shows the result:

Waterbirds

12345
Ground-truth label0.36310.34810.29020.34980.3466
K-means label0.08780.08360.04600.04190.0409
We suspect that the labels from k-means clustering were far from the ground-truth label.

4. Variance term in Thm 3.2 – Provide a concrete bound showing when a small but high-variance cluster can outweigh a large low-variance one.

If the means of the clusters are orthogonal to each other and the norms of the means are identical, a small but high-variance cluster cannot outweigh a large low-variance one. The variance term i=1Kπiσi2\sum^K_{i=1} \pi_i \sigma^2_i in wkw_k includes the variance of all clusters and wkw_k is not dependent only on the variance of a single cluster. If at least one cluster has a positive variance, then a large cluster always outweighs small clusters.

5. Late-epoch NTK – Have you measured availability using the empirical NTK after feature learning (e.g. epoch 50 ResNet-18) to see if shortcut alignment grows?

All the experiments in Fig. 6, 9, and 10 show the alignment of empirical NTK after epoch 100 (except CelebA that was run for 130 epochs, all the experiments were run for 200 epochs). The availability grew in experiments on synthetic datasets, but it did not in experiments on real-world datasets.

[1] Lyu, Kaifeng, et al. "Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking." The Twelfth International Conference on Learning Representations.

评论

I appreciate the authors' efforts to address my concerns. I will raise my score.

评论

We further notify the reviewers that the images of the birds in our Waterbirds dataset are padded with a margin during synthesis, so the sizes of the birds in our dataset are 40% of those of the CUB-200 dataset. This description on the experimental setting was omitted in our paper. We are sorry for this omission, and if given a chance to revise our paper, we will add this description to the paper. On the other hand, when we experiment with no padding, the availability of the shortcut label is still consistently higher (across 5 runs) than the availability of the ground-truth label, as shown below (shortcut label was manually found again and did not change). So, the different image size does not affect the main argument of our paper about the higher availability of the shortcut label. If given a chance to revise our paper, we will include this experimental result as main plots to our paper.

Waterbirds in CE loss12345
Shortcut label0.25470.27510.29720.33900.2788
Ground-truth label0.23970.25530.28590.33000.2654
Waterbirds in MSE loss12345
Shortcut label0.24620.26640.23250.21440.2486
Ground-truth label0.22400.25130.22670.20060.2338
Waterbirds in SD regularization12345
Shortcut label0.20050.24440.23300.26780.2534
Ground-truth label0.19630.22120.21660.24490.2371
评论

Thanks for self-reporting the error. Seems like an honest mistake, and, importantly, does not seem to affect the main claims in the paper.

It goes without saying but if your paper were accepted, the final version must be updated to reflect the corrected and clarified results presented during the review. I would be explicitly looking out for it. Note, of course, that the discussions are still ongoing, and no decisions have been made.

最终决定

The paper formally connects the widely observed phenomenon of shortcut learning to spectral bias, proposing that shortcut features correspond to NTK eigenfunctions with the largest eigenvalues. The authors supported their theory with extensive experiments across a wide range of datasets and models. The work also argues that max-margin bias is not the only cause of shortcut learning, and how that the phenomenon persists even when the network's margin is controlled. Further, the authors demonstrate that data variance within clusters also plays an important role in learning shortcut features.

Originally, the results were reported for a single run, but during the rebuttal the authors extended the results and repeated the experiments for more runs. Another concern noted by the reviewers is that the theory relies on a linear network in an infinite-width regime. To address this concern, the authors showed that their findings empirically hold across different architecture sizes.