PaperHub
6.8
/10
Poster5 位审稿人
最低4最高5标准差0.4
4
4
5
4
4
3.2
置信度
创新性2.6
质量2.6
清晰度2.8
重要性2.4
NeurIPS 2025

How Does Label Noise Gradient Descent Improve Generalization in the Low SNR Regime?

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
Label noise gradient descentlow signal-to-noise ratiooptimization and generalization

评审与讨论

审稿意见
4

This paper considers the problems of generalization when we have low signal to noise ratio. The paper shows two results. First, for a very specific high dimensional setup the paper shows that early stopped gradient descent does not generalize well. To mitigate this, the paper suggests a labelling flipping gradient descent, where certain labels are randomly flipped each iteration. The paper then shows that early stopped label flipping gradient descent does generalize.

优缺点分析

Strength

  1. The paper present concrete theoretical results that are well motivated. The two theorems 3.2 and 3.3 complement each other and study an import phenomenon.

  2. The proof sketch is well appreciated. In particular, it very clearly highlights this decompositions and shows that the ww's live in the signal subspace of noise subspace. Thus, making the noise overfitting phenomenon precise.

Weaknesses

  1. The setup for the paper is quite specific

    a. Different signal and noise components with noise being orthogonal to the signal - as compared to signal plus noise model. Where the noise interacts with the signal

    b. Relu squared activation function. It is unclear to the reviewer why this activation is chosen compared to standard activation function.

    c. The usage of 1-D convolutions instead of a standard feedforward network.

  2. Lack of insights into the phenomenon. This is a significant weakness, while the the theoretical results are clear and provide an interesting result. The paper claims that the label flipping helps as the noise acts as a regularization, but it would be good to get more intuition about why this is true. On the surface in Theorem 3.3, the stopping time, the generalization bound, and probability with which we are in a good trajectory are independent of pp. Hence if I make pp really small, say p=ddp = d^{-d}. Then with probability 1 we should no flips during training. Yet this generalizes while GD does not.

  3. Extreme high-dimensionality. While I recognize that this is common in the benign overfitting literature. I still think it is a negative. For example [A] in very similar setting show benign overfitting for much lower dimensions

  4. I think the presentation and writing could be improved, in particular in the proof sketch section.

[A] Karhadkar, Kedar, Erin George, Michael Murray, Guido F. Montufar, and Deanna Needell. "Benign overfitting in leaky relu networks with moderate input dimension." Advances in Neural Information Processing Systems 37 (2024)

问题

  1. Comparison between stopping times. The stopping times in Theorems 3.2 and 3.3 are different. From inspection it looks like Theorem 3.2 you stop much later. Can the authors comment on this?

  2. Since both algorithms are about early stopped methods, can you something about the risks at convergence?

  3. How do the choice of metrics affect things? I.e. classification versus regression similar to [B], (Figures 3, 4) [C], and the consistency vs calibration discussion in [D]?

[B] Muthukumar, Vidya, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. "Classification vs regression in overparameterized regimes: Does the loss function matter?." Journal of Machine Learning Research 22, no. 222 (2021):

[C] Kausik, C., Srivastava, K., & Sonthalia, R. (2024). Double Descent and Overfitting under Noisy Inputs and Distribution Shift for Linear Denoisers. Transactions on Machine Learning Research. Available at https://openreview.net/forum?id=HxfqTdLIRF

[D] Wu, Jingfeng, Peter Bartlett, Matus Telgarsky, and Bin Yu. "Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression." arXiv preprint arXiv:2502.13283 (2025).

局限性

N/A

最终评判理由

As such I think we have reached an agreement that while the assumptions might be needed for tractability they are a weakness.

Additionally the paper needs a revision based on the above discussion.

Based on the positive scores from the other reviewers and the clarifications. I’m willing to increase my score. However I think the paper is borderline.

格式问题

N/A

作者回复

We thank the reviewer for their careful reading, positive assessment of our theoretical contributions, and constructive feedback. We address the main concerns and questions below.


W1: The setup for the paper is quite specific. a. Different signal and noise components with noise being orthogonal to the signal. b. Relu squared activation function. c. The usage of 1-D convolutions instead of a standard feedforward network.

R1: a. The orthogonality assumption between signal and noise is standard in theoretical works aiming for tractable signal-noise separation, and is also adopted in the benign overfitting literature [1]. Our framework can be extended to cases where the signal and noise components are not strictly orthogonal. While relaxing orthogonality introduces additional interaction terms between signal and noise, these terms can be effectively bounded under sufficiently high-dimensional settings [2].

b. The choice of squared ReLU is motivated by the analytic tractability it offers in our setting, especially when the second layer is fixed. Notably, squared ReLU corresponds to the smallest polynomial power for which both signal and noise components exhibit exponential growth, making it the minimal setting that captures the key dynamics. As discussed in the main text and Appendix G.4, we have verified that our key empirical findings extend to higher-order activations (e.g., ReLU3^3, ReLU4^4), and we expect the main phenomena to be present for standard ReLU as well, though the analysis is more involved.

c. We use 1D convolutions primarily to align with the multi-patch data structure, which has been widely studied for its relevance to feature learning [1,2,3]. This choice facilitates a clean separation between patches and aligns our setup with prior literature.


W2: Lack of insights into the phenomenon. On the surface in Theorem 3.3, the stopping time, the generalization bound, and probability with which we are in a good trajectory are independent of pp.

R2: The requirements on both the lower and upper bounds for the flipping rate pp are crucial for the effectiveness of label noise GD, particularly in the second phase of training.

In our analysis, the conclusion about signal learning in Lemma 4.5 relies on having an upper bound on pp. If pp is too large, too many noisy labels are introduced, and the signal component cannot grow, which prevents effective signal learning. At the same time, the result on the boundedness of the noise component in the first part of Lemma 4.5 holds only when pp has a constant lower bound. If pp is too small, the regularization effect vanishes and noise memorization is no longer suppressed, as noise memorization term no longer behaves as a controlled martingale.

We will explicitly state in Assumption 4.1 that pp must be bounded above and below by constants, and we will point out these requirements clearly in the relevant lemmas and proofs.


W3: Extreme high-dimensionality. For example [A] in very similar setting show benign overfitting for much lower dimensions

R3: Thank you for raising this point. Our choice of the high-dimensional regime is widely adopted to enable precise, tractable analysis of gradient descent dynamics and generalization in neural networks [1,2,3].

We appreciate the reviewer’s reference to [A], which indeed demonstrates benign overfitting with leaky ReLU activation and hinge loss in the lower-dimensional. However, their setup differs from ours both in terms of the activation function (leaky ReLU vs. squared ReLU) and the loss (hinge vs. cross-entropy). In particular, hinge loss promotes hard margin solutions, whereas logistic loss leads to different implicit bias and convergence dynamics. Our work specifically focuses on the low SNR setting and investigates the effect of label noise GD, which presents distinct analytical challenges. We will cite [A] and discuss these differences more explicitly in the revised manuscript.


W4: I think the presentation and writing could be improved, in particular in the proof sketch section.

R4: Thank you for this suggestion. In the revised manuscript, we will improve the presentation and clarity of the proof sketch section, explicitly connecting the arguments to the upper and lower bounds on the flipping rate p. This should enhance the readability and help readers better understand the role of p in the analysis.


Q1: Comparison between stopping times. The stopping times in Theorems 3.2 and 3.3 are different. From inspection it looks like Theorem 3.2 you stop much later.

A1: The time thresholds for standard GD and label noise GD are not directly comparable, as they correspond to different evaluation criteria. Specifically, the time threshold for standard GD represents the time required for the training loss to converge below a certain threshold ϵ\epsilon, whereas the time threshold for label noise GD pertains to achieving sufficiently low 0-1 test loss. Therefore, comparing the running time for the two algorithms directly is not fair, as they are optimizing different objectives.

However, to facilitate a meaningful comparison, we have derived a ratio between the time thresholds under Assumption 4.1. By further setting m2=log(6/(σ0μ2))/ϵm^2 = \log(6/(\sigma_0 \| \mu\|_2))/\epsilon. we can obtain the ratio between the time thresholds as follows:

t_GD/t_LNGD=Θ(nμ22σp2d)=Θ(nSNR2) t\_{GD}/t\_{LNGD} = \Theta( \frac{n \| \mu\|^2_2}{ \sigma^2_pd }) = \Theta(n \rm{SNR}^2)

According to Assumption 4.1, we assume that nSNR21n SNR^2 \ll 1, which implies that label noise GD requires more time to achieve good test performance compared to the convergence time of training loss for standard GD, which is consistent with the empirical results demonstrated in Figure 1.


Q2: Since both algorithms are about early stopped methods, can you something about the risks at convergence?

A2: Regarding Theorem 3.2, the convergence rate is of order1/ϵ 1/\epsilon, where ϵ\epsilon is the upper bound on the training loss. As ϵ\epsilon is made arbitrarily small, the required running time increases without bound. As a result, the conclusions regarding the behavior of training and test errors hold with high probability and persist for sufficiently large time steps.

For Theorem 3.3, the theorem describes the transition from a regime where noise memorization dominates to one where signal learning becomes dominant. After this transition, our experiments show that the training loss continues to oscillate and the test error remains consistently small. While these empirical observations suggest that the generalization guarantee persists in the long run, providing a complete theoretical analysis of the training dynamics for all time remains an interesting problem for future research.


Q3: How do the choice of metrics affect things? I.e. classification versus regression similar to [B], (Figures 3, 4) [C], and the consistency vs calibration discussion in [D]?

A3: Thank you for this insightful question and for pointing out these relevant works. We agree that the choice of loss function and evaluation metric can have a substantial impact on both the optimization dynamics and generalization performance, particularly in overparameterized regimes.

Our main analysis and results focus on the logistic loss for binary classification. While this is a natural and widely studied choice for classification tasks, it is well known (as shown in [B], [C], and [D]) that the training loss, test loss, and implicit bias induced by the loss function may differ significantly between regression and classification objectives. For instance, [B] demonstrates that least-squares and hard-margin SVM can yield the same interpolating solution in linear models, but their test performance may diverge depending on whether regression or classification metrics are used. [D] shows that, for overparameterized logistic regression, early stopping leads to well-calibrated solutions with small excess risk, while running GD to convergence leads to statistical inconsistency.

In our context, the implicit regularization and the dynamics of label noise GD are analyzed specifically for logistic loss, so our theoretical guarantees and empirical findings apply to classification performance under this loss. The behavior may differ under other loss functions or when evaluated with alternative metrics, and analyzing these differences is an important direction for future work. We will cite these works and clarify the scope of our results in the revised manuscript.


Summary: We thank the reviewer again for the thoughtful critique and helpful questions, and hope these clarifications address your concerns.


References

[1] Cao et al. Benign overfitting in two-layer convolutional neural networks. NeurIPS 2022.

[2] Kou et al. Benign overfitting in two-layer ReLU convolutional neural networks. ICML 2023.

[3] [3] Chen et al. Why does sharpness-aware minimization generalize better than sgd? NeurIPS 2023.

[A] Karhadkar, Kedar, Erin George, Michael Murray, Guido F. Montufar, and Deanna Needell. "Benign overfitting in leaky relu networks with moderate input dimension." Advances in Neural Information Processing Systems 37 (2024)

[B] Muthukumar, Vidya, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. "Classification vs regression in overparameterized regimes: Does the loss function matter?." Journal of Machine Learning Research 22, no. 222 (2021)

[C] Kausik, C., Srivastava, K., & Sonthalia, R. (2024). Double Descent and Overfitting under Noisy Inputs and Distribution Shift for Linear Denoisers. Transactions on Machine Learning Research.

[D] Wu, Jingfeng, Peter Bartlett, Matus Telgarsky, and Bin Yu. "Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression." arXiv preprint arXiv:2502.13283 (2025).

评论

Thank the authors for their response. However, I do not follow and am not convinced by some of the responses. Given below:

Non standard assumptions

While I agree with the authors some assumptions are needed for tractability and that some of their assumptions have been used in prior work. I still believe this is a weakness

Boundedness of pp

I do not see an assumption 4.1 in the main text of the paper. I see a Proposition 4.1 and an assumption 3.1. In assumption 3.1 it states there is a CC such that 0<p<1/C0 < p <1/C where CC is sufficiently large. This provides an upper bound on pp but not the lower bound.

Looking at Proposition 4.1 and Theorem 3.3 and Lemma 4.5 I do not see an assumptions about pp in the statements.

Hence I would still like the authors to comment on p=ddp = d^{-d} example.

评论

Thank you for the follow-up questions. We clarify here (i) where the lower bound on the flip rate pp enters our analysis, and (ii) how small pp can be in practice.

First, regarding our previous response: our reference to “Assumption 4.1” was a typo and we intended to refer to Assumption 3.1 in the main text. Proposition 4.1 does not require a lower bound on pp, so it technically includes the case p=0p = 0. However, Lemma 4.4 (and therefore Theorem 3.3) requires pp to be lower bounded in terms of the sample size nn and network width mm. We will update Assumption 3.1 accordingly to make this requirement explicit.

Where does the lower bound on pp enter?

The lower bound on pp is needed to ensure sufficient concentration of flipped labels during training. As shown in Lemma B.4, in order to guarantee that the number of flipped labels S|\mathcal{S}_{-}| remains within [pt/2,3pt/2][pt/2, 3pt/2] with high probability, we require p=Ω~(1/t)p = \tilde{\Omega}(1/\sqrt{t}), where tt is the number of training steps. Here, Ω~\tilde{\Omega} hides polylogarithmic factors in dd, and we also use ~\tilde{\cdot} to denote the hiding of polylogarithmic factors in the following context.

In the proof of Lemma 4.4, we set t=Θ~(nm/(ησp2d))t = \tilde{\Theta}(nm / (\eta \sigma_p^2 d)). Together with the upper bound on η\eta from Assumption 3.1, we obtain p=Ω~(1/mn)p = \tilde{\Omega}(1/\sqrt{mn}). This is the lower bound required for our theory, and we will explicitly add it to Assumption 3.1, Theorem 3.3, and Lemma 4.4 in the revised version. Typically, we set m=Θ~(1)m = \tilde{\Theta}(1), so the lower bound simplifies to p=Ω~(1/n)p = \tilde{\Omega}(1/\sqrt{n}) in this case. Therefore, extremely small choices of pp, such as p=ddp = d^{-d}, fall outside the scope of our theoretical lower bound.

Empirical verification

To test the practical effect of small pp, we conducted experiments with p=105p = 10^{-5} (n=200n = 200, d=2000d = 2000, m=20m = 20). We observed that the characteristic training loss oscillations and progressive test accuracy growth still hold for pp as small as 10510^{-5}. However, for much smaller pp, the regularization effect is expected to fade, consistent with our theoretical requirements.

We thank the reviewer for highlighting this important technical point, and we will make these clarifications explicit in the revised manuscript.

评论

I thank the authors their clarification.

As such I think we have reached an agreement that while the assumptions might be needed for tractability they are a weakness.

Additionally the paper needs a revision based on the above discussion.

Based on the positive scores from the other reviewers and the clarifications. I’m willing to increase my score. However I think the paper is borderline.

评论

Thank you for your follow-up and for considering our clarifications, and for your willingness to increase the score based on the discussion.

While we understand your view that some of our assumptions may be seen as a weakness, we regard them as natural and widely adopted in the theoretical literature for tractability. Even with this seemingly simple assumption, the analysis requires substantial technical effort, and we view our results as an important step towards tackling more general and challenging settings in the future

We sincerely appreciate the time and constructive feedback you have provided throughout the review process.

审稿意见
4

This paper explores how increasing label noise affects training in deep neural networks with SGD. The authors identify a clear three-stage progression: initially, the network learns well (generalizes); then, as noise increases, it starts to memorize the noisy labels (training error stays low but test error gets worse); and finally, in the high-noise regime, performance degrades overall. While this is to be expected,the authors propose a decomposition that lets them separately track the influence of clean and noisy data on the learning dynamics. They support their findings with a series of controlled experiments across datasets and architectures. The authors also provide an interpretation in terms of implicit regularization: as noise increases, SGD is biased away from sharp minima, which affects generalization. While this is more conceptual than formal (there’s no analytical expression for the regularizer), the overall message is clear.

优缺点分析

The paper addresses a timely and important question: how does increasing label noise affect training dynamics in neural networks using SGD? A major strength is its originality — the authors introduce a decomposition method that separates the contributions of clean and noisy samples during training, providing new insight into the transition from learning to memorization and eventual degradation. The experiments are thorough, well-controlled, and reproducible, and the three-phase behavior they identify appears robust across datasets and architectures. The writing is clear and the methodology is easy to follow, making the results accessible and compelling. Overall, the work offers both conceptual and practical contributions to our understanding of generalization under label noise.

That said, there are some limitations. While the paper does cite relevant theoretical works on SGD under label noise — including HaoChen et al., Blanc et al., and arXiv:2402.06443 — the connection remains mostly qualitative. The authors could strengthen the paper by more directly engaging with these analyses, either by framing their results in similar terms or exploring where the empirical behavior aligns or diverges. In addition, the proposed decomposition, though intuitive and useful, lacks formal justification: it would be helpful to understand whether it corresponds to any known bias or objective. Finally, the paper could benefit from a broader discussion of related work, including recent advances in robust estimation and simplified theoretical models of learning under noise.

问题

Can the decomposition framework be given a more formal grounding? While the empirical results are compelling, is there a theoretical interpretation of the decomposition — for instance, does it approximate an optimization bias or relate to any known statistical objective?

Would the observed three-phase behavior still hold under more realistic or structured noise models? The current experiments assume uniform random label noise. It would be useful to understand whether similar transitions emerge under class-dependent or instance-dependent noise.

How robust are the findings to other design choices? For example, would the results hold under different optimizers (like Adam), smaller or larger batch sizes, or more modern architectures like transformers or vision models with pretraining?

Could the authors make the link to theoretical work on implicit regularization more precise? While relevant papers are cited (e.g., HaoChen et al., Blanc et al., arXiv:2402.06443), the discussion remains qualitative. Are the observed degradation regimes consistent with the formal regularizers described in those works?

Is the decomposition method applicable to other learning setups? For example, how would this behave in multi-index models, structured data (e.g., image patches), or low-data regimes?

How does this work differ from robust estimation frameworks in high dimensions, such as [Chien et al., arXiv:2305.18974]? That line of work addresses contamination from a statistical estimation angle. It would be useful to understand the conceptual or empirical boundaries between those approaches and the one proposed here.

Can the authors make the implicit regularization more quantitative? In particular, is there an effective loss function (e.g., involving flatness measures like Hessian trace or spectral norm) that captures the observed transition as label noise increases?

Would the three-stage phenomenon also emerge in synthetic or stylized theoretical settings? For instance, in the kind of setups considered by Villoutreix et al. (NeurIPS 2020) or student–teacher models, do we see a similar transition from learning to memorization to degradation?

局限性

yes

最终评判理由

This paper presents a novel decomposition of SGD dynamics under label noise, rigorously anchored via Cao et al. (NeurIPS ’22), and supported by controlled experiments that reproduce the predicted three-stage progression from learning to memorization to degradation. The rebuttal clarified the theoretical grounding and addressed key questions, strengthening confidence in the contribution. Remaining limitations concern empirical breadth (datasets beyond CIFAR-10, alternative optimizers, structured noise) and the still qualitative link to implicit regularization, but these do not undermine the core advance. Overall, I find the work theoretically solid, timely, and impactful, and I recommend acceptance.

格式问题

no

作者回复

We thank the reviewer for their thorough and thoughtful assessment, as well as for recognizing the originality and practical relevance of our work. Below, we address the specific questions and suggestions in detail.


W1: While the paper does cite relevant theoretical works on SGD under label noise — including HaoChen et al., Blanc et al., and arXiv:2402.06443 — the connection remains mostly qualitative. The authors could strengthen the paper by more directly engaging with these analyses, either by framing their results in similar terms or exploring where the empirical behavior aligns or diverges.

R1: We agree that a more direct engagement with the theoretical analyses of HaoChen et al., Blanc et al., would strengthen our paper. These works provide formal arguments showing that label noise (or stochasticity) in SGD induces an implicit bias towards flatter minima, which is believed to enhance generalization. While our primary focus is on the training dynamics and generalization gap in the low SNR regime, our theoretical results are consistent with the idea that label noise GD promotes “flatter” solutions by discouraging overfitting to noise. Although we do not explicitly quantify flatness in our analysis, the boundedness of noise memorization coefficients and the improved generalization performance may suggest an implicit bias toward flatter minima. We will further discuss this connection in the revision.


W2: In addition, the proposed decomposition, though intuitive and useful, lacks formal justification: it would be helpful to understand whether it corresponds to any known bias or objective.

R2: We would like to clarify that our decomposition method is closely related to, and can be rigorously grounded by, the theoretical results established in Cao et al. [1]. Specifically, the iterative equations and uniqueness results for the signal-noise decomposition presented in their work can be directly applied to our setting. Therefore, our decomposition enjoys a solid theoretical foundation.


Q1: Can the decomposition framework be given a more formal grounding? While the empirical results are compelling, is there a theoretical interpretation of the decomposition — for instance, does it approximate an optimization bias or relate to any known statistical objective?

A1: As also discussed in response to W2, our method is not just intuitive and empirically effective, but can be rigorously justified based on the theoretical results established by Cao et al. [1]. It provides a formal signal-noise decomposition and proves the uniqueness and dynamics of the decomposition coefficients, which apply directly to our setting. In particular, in the high-dimensional regime, the Gaussian noise vectors are nearly orthogonal to each other with high probability, which is a crucial property that ensures the uniqueness and stability of the decomposition. This high-dimensional geometry underlies the effectiveness and theoretical soundness of our approach.


Q2: Would the observed three-phase behavior still hold under more realistic or structured noise models? The current experiments assume uniform random label noise. It would be useful to understand whether similar transitions emerge under class-dependent or instance-dependent noise.

A2: In our experiments (see Appendix G.3), we have evaluated label noise GD under various noise distributions and rates, including uniform and Gaussian label noise. The results consistently show that label noise GD suppresses noise memorization and enhances signal learning across these settings. We agree that it would be interesting to explore more structured noise models, such as class-dependent or instance-dependent noise, and we leave this as a direction for future work.


Q3 How robust are the findings to other design choices? For example, would the results hold under different optimizers (like Adam), smaller or larger batch sizes, or more modern architectures like transformers or vision models with pretraining?

A3: Thank you for the suggestion. Our current experiments are primarily designed to verify the theoretical claims regarding label noise GD within a controlled setting, and we have not systematically evaluated robustness to different optimizers or more modern architectures such as transformers or pre-trained vision models. While these questions are highly relevant and interesting, they are beyond the scope of the present work, which focuses on foundational theory and controlled empirical validation. We agree that further exploration along these lines would be valuable for understanding the broader applicability of our results.


Q4: Could the authors make the link to theoretical work on implicit regularization more precise? While relevant papers are cited (e.g., HaoChen et al., Blanc et al., arXiv:2402.06443), the discussion remains qualitative. Are the observed degradation regimes consistent with the formal regularizers described in those works?

A4: Thank you for the question. In our analysis, we show that, for label noise GD, the algorithm effectively suppresses noise memorization. As a result, the population test loss remains small even in the presence of significant label noise. This behavior contrasts with standard GD, where noise memorization can dominate and lead to a non-vanishing lower bound on test error. Our findings are in line with the theoretical understanding of implicit regularization, as described in HaoChen et al., Blanc et al.


Q5: Is the decomposition method applicable to other learning setups? For example, how would this behave in multi-index models, structured data (e.g., image patches), or low-data regimes?

A5: In this work, our decomposition method can be extended to multi-patch structured cases, and a similar idea could be used to decompose a multi-index model. The approach of separating signal and noise contributions in the learning dynamics remains applicable in these more general settings


Q6 & W3: How does this work differ from robust estimation frameworks in high dimensions, such as [Chien et al., arXiv:2305.18974]? That line of work addresses contamination from a statistical estimation angle. It would be useful to understand the conceptual or empirical boundaries between those approaches and the one proposed here.

A6: While those approaches address contamination from a classical statistical estimation perspective, our decomposition method provides a complementary view by focusing on the learning dynamics under label noise GD in deep neural networks. Specifically, our work contributes a novel theoretical analysis that explicitly separates the evolution of signal and noise components during training, revealing how label noise affects memorization and generalization in modern neural networks. This dynamic decomposition framework offers new insights into the transition from learning to memorization, which are difficult to capture using traditional robust estimation techniques and other simplified theoretical models of learning under noise.


Q7: Can the authors make the implicit regularization more quantitative? In particular, is there an effective loss function (e.g., involving flatness measures like Hessian trace or spectral norm) that captures the observed transition as label noise increases?

A7: Thank you for this insightful suggestion. Our theoretical results may suggest that label noise influences the optimization trajectory and generalization performance in a manner consistent with a bias toward flatter solutions as label noise prevents the network from overfitting to noisy labels. Developing a more quantitative link between label noise GD with flatness-based measures is an important direction for future work, and we will highlight this point in the revised manuscript.


Q8: Would the three-stage phenomenon also emerge in synthetic or stylized theoretical settings? For instance, in the kind of setups considered by Villoutreix et al. (NeurIPS 2020) or student–teacher models, do we see a similar transition from learning to memorization to degradation?

A8: Our experiments in both synthetic and real-world settings consistently show that label noise GD effectively prevents noise memorization and improves generalization. While we have not yet systematically explored the student–teacher setup, based on the underlying dynamics and related theoretical insights, we expect that a similar phenomenon would also emerge in such models.


Summary: We thank the reviewer again for their careful reading and valuable feedback. We hope our clarifications and additional results address your concerns.


Reference

[1] Yuan Cao, Zixiang Chen, Misha Belkin, and Quanquan Gu. Benign overfitting in two-layer 382 convolutional neural networks. Advances in Neural Information Processing Systems, 35:25237– 383 25250, 2022

评论

The authors answered all major technical queries and anchored their decomposition rigorously via Cao et al. (NeurIPS ’22). The martingale control of noise memorisation in the low-SNR regime is, to my knowledge, novel and non-trivial. Theory and controlled experiments are in good agreement, and the work fits well into the NeurIPS theory track.

Nevertheless, empirical breadth (CIFAR-10N, Clothing1M, alternative optimisers) is still promised rather than delivered, and the link to implicit-regularisation theory remains qualitative. A clean statement of the lower bound on the flip rate p is also still missing.

Given the genuine theoretical advance and the generally positive consensus of the other reviews (5, 4, 4, 3), I definitely lean toward (and recommand) acceptance, but because the empirical and explanatory breadth is still narrower than some of us would like, I keep my score at 4.

评论

Thank you very much for your thoughtful and encouraging feedback. We are glad that your concerns have been addressed, and we appreciate your recognition of the novel and non-trivial contributions of our work.

Regarding the lower bound on the flip rate pp, we have now provided a clear statement and discussion in our response to Reviewer jMWd.

Thank you again for your valuable comments and recommendation for acceptance.

审稿意见
5

This paper investigates whether introducing label noise in gradient updates can enhance the testing performance of neural networks in low signal-to-noise ratio environments. By comparing standard GD with label noise GD, it is demonstrated that adding label noise during the training process can suppress noise memory. In addition, under low signal-to-noise ratio conditions, standard GD establishes a non vanishing lower bound on test error, thereby demonstrating the benefits of introducing label noise in gradient based training.

优缺点分析

Strength:1、The theoretical support in the paper is sufficient and the proof is rigorous. 2、Simple and efficient, Label Noise GD is a lightweight extension of standard gradient descent that does not introduce additional model complexity or training costs. Weakness:1. the neural network is relatively simple. Can it generalize to more complex neural networks? 2、Can other datasets be added for comparison, such as Clothing1M and noisy CIFAR-10N?

问题

  1. the neural network is relatively simple. Can it generalize to more complex neural networks? 2、Can other datasets be added for comparison, such as Clothing1M and noisy CIFAR-10N?

局限性

yes

最终评判理由

I have reviewed other experiments in Appendix, which provide detailed information on the training of shallow and deep networks. I appreciate the author's clarification and will maintain my score.

格式问题

None.

作者回复

We thank the reviewer for the very positive assessment and for highlighting the strengths of our work. We address the main questions below.


W1 & Q1. The neural network is relatively simple. Can it generalize to more complex neural networks?

R1: While our main theoretical analysis focuses on a two-layer convolutional neural network for tractability, we have performed extensive additional experiments to evaluate the generality of label noise GD in more complex and realistic settings. Specifically:

In the main text, we report experiments using a VGG architecture on CIFAR-10, demonstrating that label noise GD is effective at suppressing noise memorization and improving generalization even in large-scale, practical deep networks under low SNR regime.

Appendix G.1 (Deeper Neural Network): We extend our experiments to deeper neural network architectures. The results show that label noise GD continues to suppress noise memorization and improves generalization in these deeper models.

Appendix G.4 (Higher Order Polynomial ReLU): We also investigate the effect of higher order polynomial ReLU activations. Our findings confirm that the generalization improvement and suppression of noise memorization by label noise GD are maintained across different activation functions.

Together, these results provide strong empirical evidence that the regularization effects of label noise GD are not limited to the specific theoretical setup, but persist in deeper architectures, practical networks like VGG, and across different activation functions. We agree that developing a full theoretical framework for these cases is a valuable direction for future research


W2 & Q2: Can other datasets be added for comparison, such as Clothing1M and noisy CIFAR-10N?

R2: We appreciate the suggestion to evaluate label noise GD on larger and noisier real-world datasets such as Clothing1M and CIFAR-10N. In our current empirical study, we focused on controlled settings that allow us to systematically manipulate the signal-to-noise ratio (SNR), thereby closely aligning our experiments with the theoretical insights of the paper.

Specifically, our CIFAR-10 experiments are designed to create challenging low-SNR scenarios by injecting noise directly into the high-frequency Fourier components of the images [1]. This mechanism enables us to reduce the effective SNR in a principled and controllable way, simulating conditions where the underlying signal is weak and non-semantic (noisy) features dominate. By doing so, we are able to rigorously test the effectiveness of label noise GD in preventing overfitting to noise and in improving generalization, exactly in the regime that our theoretical analysis targets.

We would like to note that datasets such as CIFAR-10N represent an alternative but conceptually similar approach for controlling SNR, as they introduce label noise rather than feature noise. Both mechanisms, namely high-frequency corruption in CIFAR-10 and label corruption in CIFAR-10N, provide valuable testbeds for studying generalization in noisy settings. Each method probes different aspects of the learning process.

While our current experiments already highlight the robustness of label noise GD in the presence of controlled feature-space noise, we agree that further evaluation on large-scale, real-world noisy datasets like CIFAR-10N and Clothing1M would be valuable and would further strengthen our claims. We plan to include such experiments in future work, and we expect the regularization benefits of label noise GD to generalize to these more challenging settings.


Summary: We thank the reviewer again for their positive evaluation and helpful suggestions. We hope our responses have addressed your questions.


Reference

[1] Ghorbani, B., Mei, S., Misiakiewicz, T. and Montanari, A., 2020. When do neural networks outperform kernel methods?. Advances in Neural Information Processing Systems, 33, pp.14820-14830.

审稿意见
4

In this paper, the authors study label noise gradient descent and its effect on generalization. They show that under certain data models, this approach can outperform standard gradient descent in low signal-to-noise ratio (SNR) settings. Specifically, standard gradient descent tends to memorize the noise in data, which prevents it from learning the underlying features. In contrast, label noise gradient descent avoids overfitting to noise and is able to eventually learn meaningful features. Experimental results are provided to support the theoretical findings.

优缺点分析

Strengths:

  1. Studying the training dynamics of gradient descent in the presence of label noise is an interesting research question.
  2. The use of a supermartingale-based argument to analyze noise memorization is technically interesting and may prove useful for analyzing related problems.
  3. The paper is clearly written and generally easy to follow.

Weaknesses:

  1. The data model considered in the analysis is quite limited. For example, even a linear model can perfectly learn the target function in this setup. It would be more impactful to extend the results to more realistic or challenging models.
  2. See question section below

问题

  1. In Theorems 3.2 and 3.3, the results state that there exists a time tt at which the conclusion holds. Does this guarantee persist for all tt larger than that value, or is it specific to some finite time window?
  2. Related to the above, I wonder whether gradient descent achieves good generalization performance in the long run (as tt \to \infty). Given that GD is known to converge to a max-margin solution in certain settings (according to the implicit bias literature), shouldn't this lead to good test performance eventually as well?
  3. It would be helpful to elaborate more on the connection with prior works that study label noise in local dynamics, especially those suggesting that label noise biases training toward flatter minima. Does the current paper suggest any similar behavior, implicitly or explicitly?
  4. Given the mentioned similarities to previous works (e.g., [8,29]), it would be valuable to clarify the key differences in the analysis that allow this paper to address challenges those works could not. For example, how does this analysis overcome issues like handling ReLUq\text{ReLU}^q for q>2q > 2, or the stronger SNR assumptions required in prior results?

Minor:

  1. Line 306-307: ιi\iota_i should be ιi(t)\iota_i^{(t)}?

局限性

Yes

最终评判理由

The authors' response has addressed my concerns. I encourage them to add more discussion on related works. I have increased my score to 4.

格式问题

No

作者回复

We thank the reviewer for the detailed and thoughtful feedback, as well as for highlighting the strengths of our work. We address the main questions and concerns below.


W1: The data model considered in the analysis is quite limited. For example, even a linear model can perfectly learn the target function in this setup. It would be more impactful to extend the results to more realistic or challenging models.

R1: The signal-noise data model we consider has been extensively studied in the neural network theory literature as a sandbox to understand how gradient-based training prioritizes the learning of the underlying signal over overfitting to noise [8,10,29]. While this model is indeed stylized, it allows for rigorous analysis of non-convex training dynamics that would be intractable in more realistic settings.

To the best of our knowledge, our work is the first to investigate label noise GD in this setting and to demonstrate efficient learning and benign overfitting in the challenging low SNR regime. Moreover, as discussed in Section 4.3, the analysis of label noise GD is already technically nontrivial even in this simplified setup.

We agree that extending these results to more complex data models and architectures would be valuable, and we consider this an interesting direction for future work. In addition, our empirical studies in Appendix G provide supporting evidence that the regularization effects of label noise GD persist in deeper architectures and on more realistic datasets.


Q1: In Theorems 3.2 and 3.3, the results state that there exists a time t at which the conclusion holds. Does this guarantee persist for all t larger than that value, or is it specific to some finite time window?

A1: Regarding Theorem 3.2, the convergence rate is of order 1/ϵ1/\epsilon, where ϵ\epsilon is the upper bound on the training loss. As ϵ\epsilon is made arbitrarily small, the required running time increases without bound. As a result, the conclusions regarding the behavior of training and test errors hold with high probability and persist for sufficiently large time steps.

For Theorem 3.3, the theorem describes the transition from a regime where noise memorization dominates to one where signal learning becomes dominant. After this transition, our experiments show that the training loss continues to oscillate and the test error remains consistently small. While these empirical observations suggest that the generalization guarantee persists in the long run, providing a complete theoretical analysis of the training dynamics for all time remains an interesting problem for future research.


Q2: Related to the above, I wonder whether gradient descent achieves good generalization performance in the long run (as ). Given that GD is known to converge to a max-margin solution in certain settings (according to the implicit bias literature), shouldn't this lead to good test performance eventually as well?

A2: In our low SNR regime and under our signal-noise model, the “max-margin” or implicit bias effect does not ensure good generalization for standard GD, because the signal is too weak relative to the noise. As a result, the solution that minimizes the training loss primarily fits the noise, rather than recovering the true underlying features. This aligns with recent works on harmful overfitting (e.g., [8,29]), which show that, without sufficient SNR or proper regularization, GD may converge to solutions with poor test performance. In contrast, our analysis and experiments show that label noise GD introduces a regularization effect that actively prevents memorization of the noise, enabling the network to focus on the informative signal even in the long run.

In addition, we would like to emphasize that our data setting differs from the classical cases considered in much of the implicit bias literature [S1,S2]. Specifically, our model adopts a multi-patch structure as described in Definition 2.1. As a result, the gradient descent updates in our setting are fundamentally different from those in standard linear or single-patch models. This structural difference plays a key role in our analysis and leads to distinct training dynamics, especially in the low SNR regime. Therefore, the conclusions regarding implicit bias and max-margin behavior do not directly carry over to our setup.


Q3: It would be helpful to elaborate more on the connection with prior works that study label noise in local dynamics, especially those suggesting that label noise biases training toward flatter minima. Does the current paper suggest any similar behavior, implicitly or explicitly?

A3: Thank you for this suggestion. There is indeed a connection between our findings and prior works that relate label noise (or stochasticity) to a bias toward flatter minima (e.g., [14, 34]). While our primary focus is on the training dynamics and generalization gap in the low SNR regime, our theoretical results are consistent with the idea that label noise GD promotes “flatter” solutions by discouraging overfitting to noise. Although we do not explicitly quantify flatness in our analysis, the boundedness of noise memorization coefficients and the improved generalization performance both suggest an implicit bias toward flatter minima. We will further discuss this connection in the revision.


Q4: Given the mentioned similarities to previous works (e.g., [8,29]), it would be valuable to clarify the key differences in the analysis that allow this paper to address challenges those works could not. For example, how does this analysis overcome issues like handling ReLUqReLU^q for q>2q>2, or the stronger SNR assumptions required in prior results?

A4: Thank you for this important question regarding the differences between our work and prior studies such as [8,29]. Our analysis departs from previous approaches in several key aspects:

First, our choice of squared ReLU (2-homogeneous) activation is deliberate. Since our setup does not optimize the second-layer parameters, using squared ReLU allows us to closely mimic the feature learning behavior of training both layers in a standard ReLU network. Squared ReLU is also the minimal integer activation where both the signal and noise components grow exponentially during training. This property leads to more intricate dynamics, making the analysis notably more challenging than the case of ReLUqReLU^q with q>2q>2, where the separation between signal and noise can be established more directly.

Second, we introduce a martingale-based argument to precisely control the evolution of noise memorization under label noise GD. This technical tool enables us to rigorously establish the persistence of the generalization benefit of label noise GD, even as training proceeds. Such a persistent control of noise memorization has not been achieved in previous analyses.

Finally, our approach also relaxes some of the stronger SNR assumptions required by earlier results like SAM [10]. By analyzing the low SNR regime with this refined technique, we are able to address cases.


Minor: Line 306-307: should be ιi(t)\iota^{(t)}_i

R: Thanks for pointing it out. We will correct the notation for the superscript (t)(t) in the final version.


Summary: We hope these clarifications address your concerns. We appreciate your careful reading and helpful suggestions.


References

[S1] Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S. and Srebro, N., 2018. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70), pp.1-57.

[S2] Lyu, Kaifeng, and Jian Li. "Gradient descent maximizes the margin of homogeneous neural networks." arXiv preprint arXiv:1906.05890 (2019).

[8] Yuan Cao, Zixiang Chen, Misha Belkin, and Quanquan Gu. Benign overfitting in two-layer convolutional neural networks. Advances in Neural Information Processing Systems, 35:25237– 383 25250, 2022

[10] Chen et al. Why does sharpness-aware minimization generalize better than sgd? NeurIPS 2023.

[14] Alex Damian, Tengyu Ma, and Jason D Lee. Label noise SGD provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34:27449–27461, 2021.

[29] Yiwen Kou, Zixiang Chen, Yuanzhou Chen, and Quanquan Gu. Benign overfitting in two-layer relu convolutional neural networks.

[34] Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after sgd reaches zero loss?–a mathematical framework. arXiv preprint arXiv:2110.06914, 2021.

评论

Thank you to the authors for the response, which has addressed my concerns. I encourage to add more discussion on related works. I have increased my score to 4.

评论

Thank you very much for your thoughtful feedback and for raising your score. We appreciate your suggestion to expand our discussion on related works and will include a more detailed discussion in the revision.

审稿意见
4

This paper investigates the generalization benefits of label noise Gradient Descent in the context of binary classification with neural networks (NNs). The paper considers training a two-layer neural network with squared ReLU activation and a fixed output layer over synthetic data sampled from noise-signal data model with a low signal-to-noise ratio (SNR).

When training with standard GD, both the signal and noise components of the model grow exponentially. However, in low SNR regimes, the noise component grows faster, leading the model to minimize the training loss by primarily fitting the noise. This results in a non-negligible test error. In contrast, with label noise GD, the noise component initially grows exponentially but then oscillates and remains bounded. Meanwhile, the signal component continues to grow, allowing the model to learn a solution with a small test error.

优缺点分析

Strengths:

Label noise GD is well studied theoretically to understand how stochastic algorithms generalize better. The paper presents a simple setting where the advantage of label noise GD can be theoritically established. The data model considered is also well studied to understand feature learning with NNs and establishing label noise GD learns features even in low SNR regime is a interesting addition to the existing literature.

Weakness:

In my opinion, a weakness of the paper is its reliance on a specific architecture. The phenomena described, such as the exponential growth of signal and noise, and the oscillation of noise, appear to be contingent on the 2-homogeneous property of the activation function. This architectural choice therefore restricts the generalizability of the observed phenomena.

问题

  1. What happens when the output linear layer is not fixed but trained? More explicitly a.ReLU(x,w)a . \text{ReLU}(x,w) is also 2-homogenous does the dynamical properties described in Proof-Sketch. Can the authors provide a empirical evidence for this or show how the analysis extends ?

  2. A more empirical investigation on how Label noise GD prevents from fitting the noise component across different architecture and data models would be a valuable addition.

Minor: In equation 3 after line 221, ξi\xi_i is not defined earlier, should it be xix_i ?

局限性

yes

最终评判理由

My initial evaluation of the paper is positive with some slight concerns regarding experimentations. The authors addressed some limitations regarding experimentation with deeper architectures, different activations and other datasets. However, given there are some limitations (which the authors empirically addressed) with the choice of activations (qq-homogeneous for q>2q>2) prevents me to increase the score further.

格式问题

no formatting concerns

作者回复

We thank the reviewer for the thoughtful comment and constructive feedback. Below, we address the main concerns and questions raised.


W1: The phenomena described, such as the exponential growth of signal and noise, and the oscillation of noise, appear to be contingent on the 2-homogeneous property of the activation function. This architectural choice therefore restricts the generalizability of the observed phenomena.

R1: We would like to clarify that the 2-homogeneous (squared ReLU) activation function serves as the minimal setting where the benefits of label noise gradient descent can be most clearly established in our analysis. The choice allows us to better characterize the separation between signal and noise learning dynamics.

Importantly, our results are not limited to this particular activation. As presented in Appendix G.4, we have included experimental evidence showing that similar effects occur with higher-order polynomial activations. Specifically, both ReLU³ (Figure 14) and ReLU⁴ (Figure 15) demonstrate the same suppression of noise memorization and improvements in generalization. These observations suggest that the phenomena we describe are not restricted to the 2-homogeneous property, but are present in a broader class of activation functions.


Q1: What happens when the output linear layer is not fixed but trained? More explicitly aReLU(w,x) is also 2-homogenous does the dynamical properties described in Proof-Sketch. Can the authors provide empirical evidence for this or show how the analysis extends?

A1: In our current analysis, we fix the output layer in order to keep the theoretical treatment analytically tractable. This modeling choice is consistent with previous works in the literature [1,2,3]. Nevertheless, we have conducted preliminary experiments, as reported in Appendix G.1, in which two layers are trained. Although the quantitative details, such as the convergence rate, may change, the key qualitative behaviors remain unchanged.

Specifically, we observe that noise memorization remains bounded and the signal component continues to grow under label noise GD even with deeper neural network. These results suggest that the regularization effect of label noise GD is not simply a consequence of fixing the output layer. Extending our theoretical framework to fully trainable networks is a promising direction for future research, and we plan to investigate this further.


Q2: A more empirical investigation on how Label noise GD prevents from fitting the noise component across different architecture and data models would be a valuable addition.

A2: We would like to note that Appendix G of our submission already includes additional experiments on deeper architectures and on datasets with alternative noise structures (such as Gaussian label noise and modified MNIST/CIFAR). Across these settings, we consistently observe that label noise GD prevents harmful overfitting and improves generalization. These empirical results support the generality of our findings, and we will further highlight them in the main paper to increase their visibility.


Minor: In equation 3 after line 221, ξi\xi_i is not defined earlier, should it be xix_i?

Thanks for pointing it out, the noise component in Definition 2.1 should be denoted as ξi\xi_i​. We will clarify this notation and make the correction in the revised version.


Summary: We thank the reviewer again for the thoughtful feedback. We hope that our responses and the additional results provided have fully addressed your concerns. If there are any further questions, we would be happy to clarify.


References

[1] Cao et al. Benign overfitting in two-layer convolutional neural networks. NeurIPS 2022.

[2] Kou et al. Benign overfitting in two-layer ReLU convolutional neural networks. ICML 2023.

[3] Chen et al. Why does sharpness-aware minimization generalize better than sgd? NeurIPS 2023.

评论

I thank the reviewers for directing my attention to the additional experiments in Appendix G, which detail the training of both shallow and deep networks.

Regarding my previous comment on 2-homogenous activation functions, I agree with the authors that their analysis extends to higher-order ReLUs and, more broadly, to qq-homogenous functions for q2q \geq 2. My concern focused on the case where q=1q=1 and non-homogenous activations. While the experimental results may still demonstrate a separation in the growth between the signal and noise components, the analytic framework proposed in the paper, which relies on exponential growth, would not be applicable for q=1q=1.

I am grateful for the authors' clarifications and will maintain my positive evaluation of the work.

评论

Thank you very much for your constructive feedback and for revisiting our additional experiments in Appendix G.

We appreciate your positive evaluation and thoughtful comments on the scope of our analysis. We agree that our current theory is best suited for homogeneous activations, and extending it to non-homogeneous activations is an important direction for future work. Besides, the q=1q=1 case (e.g., standard ReLU) typically requires more trainable layers to observe clear separation, which we plan to investigate further. We are glad our empirical results indicate that the signal-noise separation may still hold more generally.

Thank you again for your careful review and valuable suggestions.

最终决定

This paper theoretically studies the use of label noise for improved generalization of gradient descent learning in low signal-to-noise ratio (SNR) settings.

The authors show that the addition of label noise implicitly regularizes the harmful memorization of the noise and effectively improves the learning of the underlying signal in the data. This happens in the low SNR setting where the noise in the data is stronger and therefore limiting its memorization is important for benign overfitting.

The theory considers an analytically tractable setting of binary classification, a corresponding two-layer convolutional neural network with its last layer fixed and squared ReLU activation, and a signal-noise data model. To complement the relatively simple theoretical setting, the authors provide experiments for additional settings (e.g., other activation functions, training both layers of a two-layer network, deeper networks, and datasets based on classes from MNIST and CIFAR-10). As mentioned in the reviews, the experiments can be more extensive (for example, including more optimizers as Reviewer BYFK commented), but if we judge this paper as a theoretical contribution then its empirical depth is sufficient.

While the idea of label-noise GD is not new, the authors provide new analysis and findings that contribute to the theory of benign overfitting and optimization dynamics. In particular, as written by Reviewers ajKN and BYFK, the martingale-based argument for analyzing the noise memorization is a significant contribution.

Therefore the recommendation is to accept this paper.