6.8

/10

Poster5 位审稿人

最低4最高5标准差0.4

3.6

置信度

创新性2.8

质量2.8

清晰度2.8

重要性2.4

NeurIPS 2025

From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning

Junsoo Oh,Jerry Song,Chulhee Yun

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

We theoretically investigate weak-to-strong generalization from a linear CNN to a two-layer ReLU CNN

摘要

关键词

weak-to-strong generalizationtheoryfeature learning

评审与讨论

审稿意见

评分: 4置信度: 32025-06-20

The paper introduces a theoretical framework for studying weak-to-strong generalization via feature learning. It shows how a two-layer ReLU CNN learns from a linear CNN model, and characterize two regimes depending on the number of samples for training the stronger model. Empirical simulation match the derived theories.

优缺点分析

In terms of strengths, the paper studies weak-to-strong generalization with nonlinear models, in contrast to existing analysis with linear model or random feature model. Also, the theoretical analysis appears to be complete in terms of characterizing a transition from generalization to overfitting. For the weakness, refer to the question section.

问题

What is the role of weak model? It seems the weak model will learn to correctly classify both the easy-only and both-signal samples (with $o(1)$ error) while random guess the hard-only samples. Then the training of stronger model would be on a dataset with label noise almost only on hard-only samples. Is my understanding correct? If so, could the author comment on what is the difference to paper [2], which also considers label noise in their data model.
How practical the data distribution is and how likely the theoretical results hold in practice? Could the authors provide some experiments done in real-world scenarios?
Could the authors provide more discussions in relation to the existing theoretical analysis of weak-to-strong generalization? Currently the paper only mentions the difference in terms modelling assumption, without an in-depth discussion how the derived results in the paper show similar/different insights compared to existing works.
It seems the settings and analysis techniques are based on [1,2]. It would be good if the authors could comment on the differences in terms of analysis and training dynamics, which could further help understand the insights as well as the technical contributions of the paper (if any).

[1] Benign Overfitting in Two-layer Convolutional Neural Networks.

[2] Benign Overfitting in Two-layer ReLU Convolutional Neural Networks.

局限性

The discussions of limitation is insufficient. Authors are encouraged to add a separate paragraph outlining the limitations.

最终评判理由

I remain positive for the paper as most of my concerns are addressed during rebuttal.

格式问题

作者回复

2025-07-31

We express our gratitude for your valuable comments. In the following, we address the points raised by the reviewer.

Q1, Q4. Comparison to [1],[2]

Your understanding on the role of weak model is correct. We would like to provide differences between our work and previous works [1, 2].

We focus on a training scenario that is different from the one considered in [1, 2]: training a single type of network. Specifically, we study a two-stage training process: first training a weak model, then training a strong model supervised by the pretrained weak model. This difference in training scenarios leads to some technical differences.
While [1,2] considers a single type of signal for each class, our setting involves two different types of signals per class. Moreover, the label flipping probability in [2] is independent of the input, whereas in our setting, it depends on the input because it is introduced by a pretrained weak model. These differences require more careful analysis in our proofs.
While working on this paper, we also identified technical issues in [2]. Specifically, [2] often uses the sum of activations as the model output, whereas the correct formulation should use the average of activations. To correct this, in our Lemma C.3 (S7), we show that the filter whose initialization has a positive inner product with the noise tends to have a noise coefficient close to the maximum among all filters. This is an analysis not previously presented in [2].
Condition 4.1 in [2] considers a regime where noise learning dominates signal learning, which corresponds to the data-scarce regime in our work. However, we additionally consider the data-abundant regime and reveal that a different behavior emerges in this setting.

Q2. Practical Insights and Experiments on Real-World Data.

Our data distribution is inspired by real-world characteristics, consisting of label-relevant information (corresponding to the signal in our work) and label-irrelevant information (noise in our work). Moreover, the label-relevant information varies in learning difficulty due to differences in its strength or complexity. The distribution is designed to reflect these properties while maintaining mathematical tractability. We also emphasize that analysis in such simplified settings can still provide valuable insights into the underlying mechanisms.

Our key intuition behind weak-to-strong generalization is as follows: there exists a subset of data containing hard-to-learn information that is nevertheless labeled correctly due to the presence of easy-to-learn information (referred to as easy signals in our setting). This subset of data, termed both-signal data in our work, provides gradient updates that enable the strong model to learn the hard-to-learn information.

We additionally conducted experiments on a simplified real-world dataset to support our findings. Specifically, we slightly modified the MNIST dataset to highlight the roles of key components such as signal and noise. The results demonstrate that our theoretical insights also hold beyond our settings. Detailed experimental settings and results are provided in our responses to Reviewer ABZC’s W1 and Q2.

Q3. Comparison to Previous Works on Theory of Weak-to-Strong Generalization

We would like to include more discussion on some prior works related to the theory of weak-to-strong generalization, in addition to what is already presented in our draft.

The most closely related prior work is [3], which shares a similar intuition with ours—namely, that data containing both easy-to-learn and hard-to-learn information is essential for weak-to-strong generalization. However, their theoretical analysis relies heavily on the abstract framework introduced in [4], and it remains unclear whether gradient-based training can indeed lead to weak-to-strong generalization in their setting.

Similar to our results, [5] demonstrates that weak-to-strong generalization can emerge through benign overfitting. They consider linear models and analyze the minimum $\ell_2$ -norm interpolating solution, which is attainable via gradient descent. We view this as a step forward from analyses based purely on abstract frameworks. However, this work still relies on the assumption that the weak and strong models have access to different sets of features different access to feature space, an assumption we consider to be somewhat unnatural. Moreover, this work does not consider scenarios with a large amount of data, which we explicitly analyze as the data-abundant regime in our work.

[6] considers a more natural setting, where training is performed using gradient-based methods and the weak/strong models are chosen as random feature models with different numbers of hidden nodes. However, they analyze training on the population loss, where early stopping becomes crucial, as overtraining can lead the strong model to produce outputs similar to those of the weak model on the population. In contrast, our analysis is based on finite-sample training and explicitly reveals the role of early stopping in the finite but data-abundant regime.

We believe that our contribution is essential in this line of work, as it makes the theoretical analysis more aligned with realistic training settings.

Limitations

Thank you for your suggestions. We plan to discuss our limitations more explicitly in the next revision.

Thank you very much for your time. If there is anything that requires further clarification, please do let us know.

Best regards,

Authors

References

[1] Cao et al. Benign Overfitting in Two-layer Convolutional Neural Networks. NeurIPS 2022

[2] Kou et al. Benign Overfitting in Two-layer ReLU Convolutional Neural Networks. ICML 2023

[3] Shin et al. Weak-to-Strong Generalization through the Data-Centric Lens. ICLR 2025

[4] Lang et al. Theoretical Analysis of Weak-to-Strong Generalization. NeurIPS 2024

[5] Wu & Sahai. Provable Weak-to-Strong Generalization via Benign Overfitting. ICLR 2025

[6] Medevedev et al. Weak-to-Strong Generalization Even in Random Feature Networks, Provably. ICML 2025

2025-08-04

I thank the authors for the reply. Most of my concerns are addressed and I maintain my positive rating.

2025-08-05

Thank you for your response. We are glad to hear that our response was helpful and resolved your concerns. If you have any further questions, please feel free to ask.

Best regards,

Authors

审稿意见

评分: 4置信度: 42025-06-28

This paper provides a theoretical analysis for weak-to-strong generalization in binary classification based on special structured data consisting of signal and noise patches (a variation of the multi-view dataset introduced in Allen-Zhu and Li (2020)). The authors take a linear CNN as the weak model and a two-layer ReLU CNN as the strong model. The main theoretical findings are the following:

A separation result on the generalization of the weak and strong models on the special binary classification task, showing that there exists some parametrization of the strong model that achieves zero test error, whereas the weak model cannot achieve zero test error.
A generalization bound for the supervised training of the weak model with gradient descent, providing a sample complexity of $n_{wk} \gtrsim \frac{\sigma_p^2 d}{(2 p_e + p_b)^2 \|\mu\|^4}$ for the weak model to achieve a close-to-optimal test error on the binary classification task, where $\sigma_p^2$ is the variance of the noise patches, $p_e$ and $p_b$ are the probabilities of the easy-only and mixture of easy and hard patches, $d$ is the feature dimension, and $\mu$ is the easy feature norm.
A phase transition analysis for weak-to-strong generalization with respect to the weak-to-strong training sample size $n_{st}$ in the "data-scarce regime", showing that when the strong model is trained with weak supervision, it can either generalize in the case of benign overfitting or fail to generalize in the case of harmful overfitting.
An analysis of the "data-abundant regime", showing that weak-to-strong generalization can arise via early stopping but vanishes when the strong model is overtrained with weak supervision.

优缺点分析

Strengths

While mainly leveraging existing problem setup and technical tools, this paper provides a fresh non-linear feature learning analysis of weak-to-strong generalization, which offers new angles to understand weak-to-strong generalization.
The setting and main takeaways from the theoretical analysis are clearly organized and presented.
The key theoretical insights outlined in Section 4 is helpful for understanding the intuitions behind the theoretical results and the implications of the results for weak-to-strong generalization.

Weaknesses

Some detailed settings and statements seem vague and could be better clarified. See the "Questions" section below for more details.
The empirical evidence of this paper seems limited, even given its theoretical nature. Section 5 only provides a single set of experiments on the synthetic binary classification task and does not investigate the scaling of weak-to-strong generalization with respect to $n_{st}$ that seems essential for demonstrating the phase transition in the "data-scarce" regime (see the second point in the "Questions" section below). That is to say, I don't think strong experiments on real data and models are necessary for this work, given its theoretical nature. But given the (possibly inevitable) limitations of some theoretical statements in the paper (e.g., Theorem 3.4, see the second point in the "Questions" section below), it would be helpful to at least provide thorough simulations on the synthetic binary classification task.
A potential limitation inherited in the problem setup of this work is that the capacity gap between the weak and strong model completely comes from the architecture of the weak and strong models with respect to the special structured data. This is fairly different from weak-to-strong generalization, or the superalignment problem that it aims to address, in practice. The capacity gaps between large foundation models are often due to many other factors beyond the architecture (e.g., pretraining data and methods) and shared across various tasks, more like differences in the qualities of their representations. While I think the learning dynamics analysis of weak-to-strong generalization from linear to non-linear models on special structured data is theoretically interesting and valuable, I subjectively believe that it has a considerable gap from the practical weak-to-strong generalization problem. I think such a gap should be carefully discussed, maybe as a part of the conclusion/limitations at the end.

问题

It is unclear from the setup in Section 2.1 how the "easy" and "hard" signals are defined and distinguished for various statements.
- From the proof of Proposition 2.2, the mixture of signs where the hard signals are sampled uniformly seems to be the only key difference between the easy and hard signals that leads to the separation of the algorithm-agnostic capacity gap between the weak and strong models (which I think should be highlighted explicitly).
- However, additional assumptions on the sampling probabilities $p_e, p_h, p_b$ and signal strength $\|\mu\|,\|\nu\|$ are needed (e.g., Condition (C5)) for weak-to-strong generalization when analyzing the training dynamics of GD. The current organization of these assumptions is somehow confusing. It may be helpful to partition the assumptions in Condition 3.1 into several parts. For example, C5 and C6 may be separated as a stand-alone additional assumption on the easy v.s. hard signals for the training dynamics analysis.
The phase transition in the "data-scarce" regime with respect to $n_{st}$ is not very clear from the statement of Theorem 3.4. In particular, Theorem 3.4 shows that $n_{st} \ge C_2 \frac{\sigma_p^4 d}{p_b^2 \|\nu\|^4}$ leads to benign overfitting, while $n_{st} \le C_4 \frac{\sigma_p^4 d}{p_b^2 \|\nu\|^4}$ leads to harmful overfitting. Here, the upper and lower bounds are asymptotically the same, and there are no discussions on conditions like $C_2 \approx C_4$ , showing that this indeed leads to a phase transition. The simulation in Section 5 shows one synthetic example with three seemingly arbitrary $n_{st}$ , without investigating the scaling of weak-to-strong generalization with respect to $n_{st}$ that could demonstrate the phase transition.

局限性

Major limitations are discussed in the weaknesses section. Here are some minor points:

The explanations for some assumptions (specifically, C1-C4) in Condition 3.1 is vague.
Section 4 may be partitioned into subsections of "data-scarce" and "data-abundant" regimes to be more organized.
The meaning of "filters" in line 312 is not clear. The explanation in lines 310-312 is not very clear either.

最终评判理由

I would like to thank the authors for their efforts to address my questions and concerns. I will maintain my positive evaluation.

格式问题

There are no standing-out formatting concerns.

作者回复

2025-07-31

We express our gratitude for your valuable comments. In the following, we address the points raised by the reviewer.

W2, Q2. Experiments on Synthetic Data

Admittedly, a precise calculation of the constants $C_2$ and $C_4$ is complex; our analysis focuses on an asymptotic regime where parameters such as the number of data points and dimension are sufficiently large. In this regime, the upper and lower bounds for the transition match asymptotically up to constants, creating a sharp threshold. In the small sample settings of our experiments, this transition appears more gradual, rather than a sharp phase transition. However, by varying $n_{st}$ and the dimension d, we can still observe and verify the transition dynamics in our theoretical analysis.

We ran new experiments with the same setup as described in the paper, while varying $n_{st}$ and dimension $d$ . The same pretrained weak model was used to generate pseudolabels across all runs within the same seed. We ran experiments for 5 different random seeds. Below is the average test accuracy with standard deviation across different seeds.

As shown in the table, the test accuracy of the weak-to-strong training generally improves as $n_{st}$ increases for both $d = 2000$ and $d = 4000$ . This demonstrates the transition in the data-scarce regime. The results also suggest that the transition point scales with the dimension $d$ , as a larger $n_{st}$ is required to achieve similar performance when $d$ is doubled, which aligns with our theoretical findings.

	$d=2000$ (weak test acc: 0.8508)		$d=4000$ (weak test acc: 0.8494)
$n_{st}$	weak-to-strong test acc	$n_{st}$	weak-to-strong test acc
75	0.8829 ± 0.0448	150	0.8566 ± 0.0274
150	0.8830 ± 0.0236	300	0.8888 ± 0.0267
225	0.9022 ± 0.0204	450	0.8963 ± 0.0172
300	0.9013 ± 0.0310	600	0.9019 ± 0.0249
375	0.8831 ± 0.0428	750	0.8947 ± 0.0099
450	0.8970 ± 0.0260	900	0.9054 ± 0.0178
525	0.9135 ± 0.0124	1050	0.8845 ± 0.0145
600	0.9022 ± 0.0281	1200	0.8846 ± 0.0317
675	0.9126 ± 0.0200	1350	0.9128 ± 0.0124
750	0.9268 ± 0.0173	1500	0.9036 ± 0.0106
825	0.9122 ± 0.0119	1650	0.9112 ± 0.0277
900	0.9228 ± 0.0044	1800	0.9001 ± 0.0253
975	0.9053 ± 0.0115	1950	0.9079 ± 0.0132
1050	0.9225 ± 0.0208	2100	0.9168 ± 0.0166
1125	0.9254 ± 0.0191	2250	0.9205 ± 0.0254
1200	0.9340 ± 0.0141	2400	0.9094 ± 0.0105
1275	0.9176 ± 0.0293	2550	0.9275 ± 0.0194
1350	0.9082 ± 0.0169	2700	0.9093 ± 0.0210
1425	0.9223 ± 0.0148	2850	0.9066 ± 0.0228
1500	0.9121 ± 0.0188	3000	0.9132 ± 0.0255
1575	0.9205 ± 0.0150	3150	0.9096 ± 0.0277
1650	0.9310 ± 0.0176	3300	0.9263 ± 0.0123
1725	0.9262 ± 0.0138	3450	0.9197 ± 0.0157
1800	0.9380 ± 0.0250	3600	0.9280 ± 0.0161
1875	0.9148 ± 0.0250	3750	0.9343 ± 0.0144
1950	0.9226 ± 0.0156	3900	0.9253 ± 0.0096
2025	0.9230 ± 0.0221	4050	0.9169 ± 0.0081
2100	0.9245 ± 0.0298	4200	0.9224 ± 0.0088
2175	0.9248 ± 0.0143	4350	0.9133 ± 0.0136
2250	0.9304 ± 0.0118	4500	0.9123 ± 0.0130

W3. Architectural Differences Between Weak and Strong Models

It is true that architectural differences are not the only factors contributing to the differing capabilities of weak and strong models. However, we believe that one key difference lies in expressive power, as exemplified by the weak-to-strong training experiment from GPT-2 to GPT-4 presented in [1]. While we focus on architectural differences in this work, we believe that the core intuition behind our analysis can be extended to more general scenarios, and we will add this discussion to the Conclusion section as you suggested.

Q1. Conditions on Easy and Hard Signal

We agree with the reviewer that the conditions on easy and hard signals appear in multiple places in the draft and may cause confusion. In Section 2.1, line 116, we mentioned that easy and hard signals differ in their learning difficulty.

As you mentioned, the XOR structure of the hard signals creates a capacity gap between our weak and strong models. We agree that this point should be mentioned earlier, and we will explicitly add it around line 116.

In addition, as you mentioned, Conditions (C5) and (C6) are related to the distinction between easy and hard signals and to the weak-to-strong learning scenario, which may differ in nature from Conditions (C1)–(C4). We grouped them together for compactness, but we will consider restructuring this part to improve clarity.

Limitations. Suggestions for Clarification

Thank you for your suggestions for improving our paper. We will address the points in the limitations part one by one, following the order in which you raised them.

We would like to provide further explanation of Condition 3.1. In (C1) and (C2), we assume a sufficiently large dimension and large sample sizes. These assumptions allow us to apply concentration inequalities to quantities such as norms and correlations involving Gaussian noise, as well as the number of each type of data. In (C3) and (C4), we assume a sufficiently small initialization scale and learning rate. These assumptions ensure that the effect of initialization can be neglected compared to the gradient descent updates, and that each update step remains small, which enables stable training dynamics. Lastly, we would like to additionally clarify that the motivation behind (C5) is that the difficulty of learning a signal is determined by its frequency and strength.
In our draft, we didn’t separate the discussion of the two regimes into distinct subsections, given that they share overlapping insights. Instead, we plan to add clearer transitional markers for the reader’s convenience in the next revision.
In line 312, “filters” refers to $w_{s,r}$ as defined in Definition 3. We would like to provide a more detailed explanation of the discussion in lines 310–312. For a filter $w_{s,r}$ that is initially positively aligned with $\nu_s$ , the first term in the block equation between lines 304 and 305 is positive, while the second term is zero. As a result, the inner product between this filter and $\nu_s$ increases over time. In contrast, for a filter that is initially negatively aligned with $\nu_s$ , the first term becomes zero, and the second term is negative. This leads to a decrease in the inner product. This contrasting behavior between positively and negatively aligned filters allows the strong model to learn both $\pm \nu_s$ .

We will include these in the next revision.

Thanks for your time and efforts. Please let us know if you have further questions.

Best regards,

Authors

References

[1] Burns et al. Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision. ICML 2024

2025-08-04

I would like to thank the authors for their efforts to address my questions and concerns. I will maintain my positive evaluation.

2025-08-05

Thank you for your response. We are glad to hear that our responses were helpful and adequately addressed your concerns and questions. We would also be happy to hear if you have any additional thoughts or suggestions.

Best regards,

Authors

审稿意见

评分: 4置信度: 42025-06-29

This paper provides a theoretical and empirical analysis of weak-to-strong generalization, where a stronger model trained under supervision from a weaker model can outperform the teacher. This work fills a gap by analyzing a special setting: weak-to-strong training from a linear CNN to a two-layer ReLU CNN on patch data.

优缺点分析

Strengths:

The paper provides a detailed and mathematically sound characterization of weak-to-strong generalization in both data-scarce and data-abundant regimes, including tight bounds and transition conditions between benign and harmful overfitting.

Weaknesses:

The data distribution is highly engineered, with orthogonal easy/hard signal vectors and shuffled patch structures. While analytically convenient, it lacks realism and limits the applicability of the results to real-world tasks.
Much of the analytical machinery (e.g., signal-noise decomposition, patch-wise CNN analysis, benign overfitting dynamics) closely follows prior work. The contribution lies more in adapting known tools to a new but contrived setup, rather than developing fundamentally new techniques.
All experiments are conducted on synthetic data tailored to the theoretical setup. Even simple real-data tests (e.g., CIFAR with weak supervision) would help assess whether the observed phenomena persist beyond the toy model.

问题

Why does the data distribution assign only the positive direction $\mu_y$ for easy signals, while it allows both $\pm\nu_y$ directions for hard signals? In Definition 1, the easy signals are always set to $\mu_y$ , whereas the hard signals are drawn uniformly from all sign combinations $\pm\nu_y$ . This introduces a structural asymmetry. I understand that this may be designed to ensure the weak model can reliably learn easy signals while remaining ineffective on hard signals, but I would appreciate clarification on whether this choice is essential to the theoretical guarantees, or whether the same qualitative conclusions would hold under a more symmetric signal setup (e.g., also randomizing the sign of $\mu_y$ ).
How sensitive is the theory to the assumption that the weak model labels are not adversarial, especially in the data-abundant regime? Can small adversarial label flips cause failure of weak-to-strong generalization?

局限性

While the paper presents a rigorous analysis, its setting is highly stylized and somewhat artificial. The data distribution is carefully engineered to contain label-aligned "easy" and "hard" signals with hand-picked orthogonality, which may not reflect the structure or complexity of real-world data. As a result, it is unclear how broadly the insights or guarantees would transfer to practical scenarios, such as vision or NLP tasks. Moreover, much of the technical analysis builds directly on tools and mechanisms developed in prior work on benign overfitting and feature learning (e.g., Cao et al. 2022; Kou et al. 2023), such as signal-noise decompositions and patch-wise convolutional structure. While the application to weak-to-strong generalization is interesting, the paper feels more like a thoughtful reuse of existing techniques in a contrived setting tailored to fit them, rather than a technically novel contribution in its own right. A more ambitious advance would involve analyzing weak-to-strong generalization in less restrictive or more realistic settings.

最终评判理由

This paper offers a clean and technically solid analysis of weak-to-strong generalization in a synthetic setting. While its scope is narrow and some assumptions are idealized, the corrected proof and added insights meaningfully contribute to the theoretical understanding of the phenomenon. I recommend a weak accept based on its rigor, the technical fix to prior work, and the relevance of the topic to current interest in model distillation and student-teacher dynamics.

格式问题

No.

作者回复

2025-07-31

We would like to express our appreciation for your valuable comments. In the following, we address the points raised by the reviewer.

W1, L1. Our Simple Problem Settings

We acknowledge that our problem setting is simplified to enable rigorous theoretical analysis. In particular, assumptions such as the orthogonality of signal vectors allow us to make the learning dynamics mathematically tractable. However, similar simplifications are commonly employed in the recent feature learning theory literature. These line of works utilize similar settings to gain theoretical understandings on various phenomena—such as how model ensembling improves generalization [1], how benign overfitting arises [2,3], how different optimizers affect generalization [4–6], how architectural choices influence generalization [7,8], and how data augmentation contributes to generalization [9–11].

While our assumptions may not hold in practical settings, we believe that they still offer valuable insights into the mechanisms behind weak-to-strong generalization. We would like to offer a more intuitive, high-level explanation for our insights presented in Section 4. A weak model may fail to generalize well because it cannot learn certain types of information that are difficult to capture, what we refer to as hard signals. Now consider training a stronger model—one that is strong enough to learn these hard signals—under the supervision of the weak model. In this setting, the strong model learns from pseudo-labeled data, and its success depends on how often the weak model assigns correct labels.

The key idea is this: there exists a subset of training data that contains both easy and hard signals. Thanks to the easy signals, the weak model is likely to label these examples correctly, even though it cannot learn the hard signals on its own. These correctly labeled examples—what we call both-signal data—provide useful gradient updates that guide the strong model to learn the hard signals. Although the framework we propose is simple, we believe it captures essential intuition that can be extended to more general settings. Also, in our response to W3 (see below), we have empirically shown that our insights can be applied to real-world data.

W3. Experiments on Real World Data

W2, L2. Novel Contributions

Even though we use techniques from [3], we would like to highlight some key differences between [3] and our work.

While [3] considers a single type of signal for each class, our setting involves two different types of signals per class. Moreover, the label flipping probability in [3] is independent of the input, whereas in our setting, it depends on the input because it is introduced by a pretrained weak model. These differences require more careful analysis in our proofs.
While working on this paper, we also identified technical issues in [3]. Specifically, [3] often uses the sum of activations as the model output, whereas the correct formulation should use the average of activations. To correct this, in our Lemma C.3 (S7), we show that the filter whose initialization has a positive inner product with the noise tends to have a noise coefficient close to the maximum among all filters. This is an analysis not previously presented in [3].
Condition 4.1 in [3] considers a regime where noise learning dominates signal learning, which corresponds to the data-scarce regime in our work. However, we additionally consider the data-abundant regime and reveal that a different behavior emerges in this setting.

Moreover, we would like to emphasize that our work not only advances techniques from the feature learning literature, but also addresses several limitations of prior theoretical studies on weak-to-strong generalization by incorporating the following elements: consideration of gradient-based training dynamics, a more natural definition of weak and strong models, analysis based on finite samples rather than population loss, and identification of distinct data regimes. We provide further discussion on these points in our response to Reviewer yKoW’s Q3.

Q1. Choice of Asymmetric Signals

Your understanding is correct. Our asymmetric choice of signal directions is intended to highlight the fundamental gap between the weak and strong models. We can also consider a symmetric case where both easy and hard signals involve no sign flips. However, in such a setting, we believe that a linear CNN and a ReLU CNN would require asymptotically the same sample complexity for generalization. As a result, it would be difficult to argue that the two models have different capabilities. In such cases, we believe that changing the architectures of the weak and strong models can lead to similar conclusions. For example, one could adopt a simple CNN as the weak model and a simple ViT as the strong model, whose learning dynamics and benign overfitting behaviors have been studied in [2] and [8]. As noted in [8], a simple ViT has a superior ability to learn signals compared to a simple CNN. Therefore, if we consider a regime where easy signals are learnable by both architectures while hard signals lie in the gap between the two, our theoretical insights discussed above may still apply.

However, we believe that our current simplified setting is more effective in delivering clear and interpretable insights.

Q2. Assumption on Supervising Weak Model

We believe there may be a misunderstanding regarding our assumption about supervising the weak model. We do not assume that the weak model's labels are non-adversarial; we only assume that the supervising weak model satisfies certain error bounds. We would like to note that our results also extend to supervisors that assign labels adversarially, as long as they remain within the assumed error budget, since the only requirement is that the weak model satisfies the specified error bounds.

Thanks for your time and consideration. Let us know if you have remaining questions; we are happy to discuss more.

Best regards,

Authors

References

[1] Allen-Zhu & Li. Toward Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning. ICLR 2023

[2] Cao et al. Benign Overfitting in Two-layer Convolutional Neural Networks. NeurIPS 2022

[3] Kou et al. Benign Overfitting in Two-layer ReLU Convolutional Neural Networks. ICML 2023

[4] Zou et al. Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization. ICLR 2023

[5] Jelassi et al. Towards Understanding How Momentum Improves Generalization in Deep Learning. ICML 2022

[6] Chen et al. Why Does Sharpness-Aware Minimization Generalize Better than SGD. NeurIPS 2023

[7] Huang et al. Quantifying the Optimization and Generalization Advantages of Graph Neural Networks Over Multilayer Perceptrons. AISTATS 2025

[8] Jiang et al. Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization. NeurIPS 2024

[9] Shen et al. Data Augmentation as Feature Manipulation. ICML 2022

[10] Zou et al. The Benefit of Mixup for Feature Learning. ICML 2023

[11] Oh & Yun. Provable Benefit of Cutout and CutMix for Feature Learning. NeurIPS 2024

2025-08-06

Thank you for the detailed rebuttal. Regarding your comment that reference [3] contains a technical issue due to using the sum of activations rather than the average: upon reviewing [3], the model appears to be defined using the average over filters, not the sum. For instance, in Section 3 of [3], the network output is explicitly defined with a $1/m$ normalization over the number of filters.

Since your rebuttal presents the identification of this issue as part of your paper’s technical novelty, I encourage you to elaborate more clearly on this point. If you believe there is a specific step or equation in [3] where the use of the sum leads to a technical flaw or inconsistency, please point to the exact location. As it stands, this appears to be a modeling difference rather than a flaw, and the current claim may be based on a misreading.

Clarifying this point would help evaluate the significance and correctness of the contribution.

2025-08-06

Thank you for your response. We would like to clarify the point regarding technical flaws in [3].

As you mentioned, [3] considers the average of activations (which aligns with our strong model), not the sum. However, in their proof, they frequently alternate between using the average and the sum, which leads to technical inconsistencies. While many of these can be resolved by adding a missing factor of $1/m$ in the analysis, there are also more substantial issues. For example, in Proposition C.2, they aim to show that the noise coefficients $\bar{\rho}\_{j,r,i}^{(t)}$ are upper bounded by $\alpha$ via induction. To do so, they decompose $\bar{\rho}\_{j,r,i}^{(t)}$ into several terms and attempt to bound each one individually (as in Equation (C.23)). However, when bounding the $I_8$ term (middle block equation on page 28), the factor of $1/m$ multiplying $\sigma(\langle w_{j,r}^{(t)}, \xi_i \rangle)$ is missing in the first line. If we naively correct this by inserting the $1/m$ factor, the final bound on $\bar{\rho}\_{j,r,i}^{(t)}$ becomes $m\alpha$ instead of $\alpha$ , which would require a stronger assumption on $m$ . To address this, we introduced an additional technical result (Lemma C.3 (S7)), which is not covered in [3].

We hope this clarification is helpful. If you need any further discussion, feel free to reach out.

Best regards,

Authors

2025-08-06

Thank you for your detailed clarification. I have carefully reviewed the point you raised regarding Proposition C.2 in [3], and I agree with your observation. I appreciate that your Lemma C.3 (S7) explicitly resolves this issue. Given that this correction is not merely cosmetic but addresses a gap in the inductive proof, I now view this aspect of your work as a meaningful technical contribution. As such, I am willing to revise my score to 4 (Weak Accept).

2025-08-07

Thank you for your response and for reconsidering the score. We are glad to hear that our response was helpful. We would also be happy to hear if you have any additional comments.

Best regards,

Authors

审稿意见

评分: 4置信度: 42025-07-03

This paper attempts to find a simple setting in which weak to strong generalization provably occurs. Their weak learner is a linear network with 3 copied weights (what they call a linear CNN). Their strong learner is a 1-hidden layer ReLU CNN in which the second layer weights are set to one and only the first layer weights learn. The ReLU CNN has three types of hidden units in correspondence with three patches in the data. Their model of data is a very contrived data distribution consisting of 3 different patches (matched to both the architecture of the linear and one-hidden layer CNN) with a set of easy, and hard signals in each patch, or noise. They ask if the weak learner learns on actual data first, then the strong learner learns on data generated by the weak learner, when will the strong learner outperform the weak learner?

The types of results they can prove relevant to this are:

In a data scarce regime (not much data but above a threshold), the weak learner performs as well as expected, and the strong learner can perform better than the weak learner due to benign overfitting, in which it does not overfit to the mistakes of the weak learner.
In a data abundant regime, the strong learner provably exhibits weak to strong generalization, if the strong learner's learning is stopped before an early stopping time. If the strong learner learns for too long it could potentially overfit to the mistakes of the weak learner (though no theoretical guarantees are derived at the convergence point of strong learning. But early stopping of the strong learner provably prevents this.
They have simulations in which they train the strong learner for longer than the early stopping time and find performance degrades then plateaus. No theory is provided for this.

优缺点分析

Strengths:

The paper provides a simple example in which weak to strong generalization is provable.
Some simulations bear out the predictions of the theory, and also reveal other phenomena at late times in learning that are beyond the reach of this theory.

Weakness

A major weakness is the highly contrived nature of the data and its match to the architecture. The data comes in 3 patches, and the architecture comes with hidden units (or linear units) with receptive fields matched exactly to one of the three patches. What goes into each patch has a strange combinatorial nature out of a finite set of objects (fixed orthogonal vectors). Only first layer weights learn; second layer weights are fixed. This strong match between data and architecture is unrealistic. It seems that the analysis is tied to this match. It is less clear what one can conclude about how the structure of realistic data impacts weak to strong generalization, and what is going on when there is not such contrived match, or such a toy dataset. It is unclear how their derivations could be generalized.
Less of a concern but an area for improvement in exposition: the authors need 6 conditions C1-C6 in section 3. Some of them are well explained intuitively whereas others are not. More discussion of these conditions and their roles would help.

问题

What are the prospects of generalizing your theory methods to more general structures in data and lack of match to architectures?

局限性

Yes. But further discussion on the limitations of their data assumptions would be helpful.

最终评判理由

While the setting in which the theorems are proved are quite simple, the phenomenon of weak to strong generalization is interesting enough that having a simple setting in which it provably occurs may be valuable, and the intuitions provided in the rebuttal as to what we can learn in general from this simple setting are helpful.

格式问题

None.

作者回复

2025-07-31

We would like to express our gratitude for your valuable comments. In the following, we address the points raised by the reviewer.

W1, Q. Extensions and Practical Insights Based on Our Theory

We agree with the reviewer that our theoretical framework may look overly simplified compared to realistic settings. As described in lines 82–90, our work builds on the recent line of research in feature learning theory, where the adopted settings allow for mathematically tractable analysis of learning dynamics. While our setting may appear simplified, similar assumptions have been widely used to gain theoretical understandings on various phenomena—such as how model ensembling improves generalization [1], how benign overfitting arises [2,3], how different optimizers affect generalization [4–6], how architectural choices influence generalization [7,8], and how data augmentation contributes to generalization [9–11].

Here, we provide a more high-level explanation of the intuition behind the analysis presented in Section 4. A weak model may sometimes exhibit low performance because it fails to learn certain hard-to-learn information (referred to as hard signals in our setting). Now, consider training a strong model—strong enough to capture the hard-to-learn information that the weak model fails to learn—under the supervision of the weak model. The strong model can only learn useful information from correctly pseudo-labeled data.

Our key intuition behind how the strong model can outperform the weak model is as follows: there exists a subset of data that contains hard-to-learn information but is still labeled correctly due to the presence of easy-to-learn information (easy signals in our setting). This subset of data (termed both-signal data) provides gradient updates that enable the strong model to learn the hard-to-learn information.

Our theoretical framework may seem simple at first glance, but analyzing its training dynamics is technically involved and the fomal proof spans 50 pages in the supplementary material. Also, the intuition from this setting can help us understand more general scenarios. One such example is early stopping, an often adopted practical choice for weak-to-strong generalization. The discussion in lines 313–322 explains the role of this strategy.

Regarding our three-patch data setting, while it could in principle be extended to incorporate more patches, we deliberately choose the minimal number (three) that is sufficient to capture all essential types of signal and noise vectors defined in our framework, in order to maintain mathematical simplicity. In addition, we would like to note that our analysis can be potentially extended to more general architectures—for example, by adopting a simple CNN as the weak model and a simple ViT as the strong model, whose learning dynamics and benign overfitting have been studied in [3] and [8]. However, we believe that our simplified setting is more effective for delivering clear insights. For more details, you can check our response to Reviewer bMLu‘s Q1.

W2. Further Explanation on Condition 3.1

We would like to provide further explanation of Condition 3.1. These types of conditions are widely adopted in feature learning theory literature that we have discussed above. In (C1) and (C2), we assume a sufficiently large dimension and large sample sizes. These assumptions allow us to apply concentration inequalities to quantities such as norms and correlations involving Gaussian noise, as well as the number of each type of data.

In (C3) and (C4), we assume a sufficiently small initialization scale and learning rate. These assumptions ensure that the effect of initialization can be neglected compared to the gradient descent updates, and that each update step remains small, which enables stable training dynamics.

Lastly, we provide explanations for (C5) and (C6) in lines 189–192 of the draft, and we would like to further clarify that the motivation behind (C5) is that the difficulty of learning a signal is determined by its frequency and strength.

We will discuss these points more clearly in the next revision.

Thanks for your time and consideration. We are happy to discuss further if anything remains unclear.