Impact of Layer Norm on Memorization and Generalization in Transformers
摘要
评审与讨论
This paper studies the effect of Layer Normalization on memorization and learning in Transformer architectures, comparing Pre-LN (where LN is applied before main layers) and Post-LN (where LN is applied after) settings. The authors systematically investigate what happens when they remove LN's learnable parameters, measuring how this affects memorization (overfitting noisy labels), genuine learning, and recovery of true labels.
优缺点分析
Strengths:
- The paper is well-written, with a logical flow, and provides a clear explanation of the problem's importance, novelty, and relevant background.
- The authors studied the effect of LayerNorm across various models and datasets, offering a relatively large-scale and robust study. Results appear relatively stable across different settings.
- The connection established between gradient norms and learning/memorization provides theoretical grounding for the empirical observations. This also introduces a useful metric for practitioners.
- The findings could help inform the design of models that are more robust to label noise or less prone to memorization.
Weaknesses:
- Memorization is tested via random label flipping (1% noise). As far as I can tell, there is no investigation of how the results change when the proportion of noisy labels is increased. While the findings likely hold up to a certain point, it would be interesting to see how they behave as label noise increases.
- The study focuses exclusively on removing LN parameters (scaling and bias), while retaining the core mean-variance normalization. The impact of removing the normalization itself is not explored. If relevant prior work exists, it should be cited; otherwise, could the authors clarify why they did not also test the effect of removing the normalization operation?
问题
Please see the weaknesses section for broader concerns. In addition, I have the following specific questions and suggestions for clarity:
- Certain models (e.g., DistilBERT) appear to deviate from the general trends observed, but these exceptions are not discussed in depth. Do the authors have any intuition or explanation for why such deviations occur?
- While I enjoyed reading the paper, I believe the following points would help improve its clarity:
2.1. In Figure 1, what dataset is used to generate the plot? It would help to include this in the caption or main text.
2.2. In Figure 2, the second column presents results without layer normalization (LN). It would be helpful to also show the same configuration with LN for direct comparison. Without it, it's difficult to assess the actual impact of LN removal, and the effect might be overestimated.
2.3. The paper claims that in the Post-LN setup, removing LN suppresses memorization. However, from the plots, this claim is hard to confirm. While it may help with recovery of true labels, the link to memorization suppression is unclear. Could the authors elaborate on how they diagnose memorization from these plots?
2.4. In Figure 3(a), third column, I expected the effect of LN removal to show a monotonic trend. However, the blue line is non-monotonic. Do the authors have an explanation for this behavior?
2.5. On page 8, lines 265–267, the sentence beginning with "In contrast, ..." is unclear to me. Specifically, how is the order of magnitude of gradient norms related to the restoration of genuine labels without interfering with learned features? Some further clarification here would be appreciated.
- The authors mention the effect of LN removal from early layers, mid layers and final layers, however, there is no definition what layers are assumed to be early, middle or last. Is the number of layers in each group equal?
局限性
Please see weaknesses and questions.
格式问题
No concerns
We appreciate the reviewer’s recognition of our work, and appreciate their insightful comments. Below we address the questions to provide more clarity regarding our paper.
Weaknesses:
W1: Thank you for the suggestion. We provide additional experiments on the Emotions dataset for 2% and 5% label noise. From the results we again observe consistent results that LN removal mitigates memorization in Post-LN models (BERT), while it impairs learning in Pre-LN models (GPTNeo).
Noise Model Setting Learning (%, ↑ is better) Memorization (%, ↓ is better) Recovery (%, ↑ is better) Random Prediction (%, ↓ is better) 2% Post-LN (BERT) Before 91.7 100 0 0 After 92.0 20.62 76.25 3.12 Pre-LN (GPTNeo) Before 91.35 100 0 0 After 84.85 66.87 16.56 16.56 5% Post-LN (BERT) Before 90.35 100 0 0 After 91.25 27.0 66.88 6.12 Pre-LN (GPTNeo) Before 90.35 100 0 0 After 82.6 67.5 11.0 21.5
W2: Yes, in this study we primarily focus on removing learning parameters. We do this by following prior work done by [1], who also removed the same to analyze LN’s impact.
Questions:
Q1: Regarding DistilBERT, we observe that even though LN removal did not mitigate memorization, early LNs removal at least delayed memorization when removed, because early LNs have higher norms. That being said, we suspect that other components (like FFN, MHSA) in DistilBERT might be driving memorization more than LN. Hence, LN removal didn’t mitigate memorization. But since this paper is solely focused on the impact of LN, we refrain ourselves from distributing our focus on other components.
Q2.1: In Fig.1 we used multiple datasets-models configuration for diversity in our analysis, as stated in Appendix F.3 - (1) Emotions dataset with BERT (Post-LN), DeBERTa (Post-LN), GPTNeo (Pre-LN), (2) News dataset with ELECTRA (Post-LN), Longformer (Post-LN), Qwen2 (Pre-LN), (3) Tweets dataset with RoBERTa (Post-LN), GPT2 (Pre-LN), (4) CIFAR10 dataset with ViT-B (Pre-LN), (5) UTK-Face dataset with DeiT (Pre-LN), and (6) NICO++ dataset with ViT-S (Pre-LN).
Q2.2: We take into account the reviewer’s suggestion and will add before LN removal memorization plots over epochs for Fig.2 as a reference. This would essentially show that without LN removal, model achieves 100% memorization at the end of training.
Q2.3: As explained in our paper, we introduce “noisy labels” to study memorization. If the model overfits on these noisy labels, then it has memorized them. However, in Post-LN models, after removing LN, we observe that the model does not overfit on these labels and reverts back to their genuine class labels, meaning that memorization was mitigated as shown by high recovery scores (green bars) and low memorization scores (red bars), in Fig.1d.
Q2.4: We appreciate the reviewer’s observation. Here is our explanation for this effect: In the very early training epochs (epochs 1–3), removal of early LNs disrupts the model’s capacity to learn and generalize, resulting in poor test accuracy. Consequently, the model behaves nearly randomly causing high random prediction scores and poor recovery. As training progresses, however, the model gradually learns and generalizes better to the test set. This improved learning temporarily enhances recovery (reduces memorization) for the noisy samples, explaining the increase in the blue curve (early LNs). But with further training, memorization sets in, and recovery again worsens and hence the dip. On the other hand, when removing middle/later LNs, which our gradient analysis shows to have lower learning gradient norms, the model's learning behavior is less disrupted, even in early epochs. This causes the model to make correct predictions from the start, and thus the recovery score doesn't begin at zero, and hence leading to a more monotonic-like trend. Thanks to your comment, we will incorporate this explanation in the revised version.
Q2.5: First, we would like to clarify that the gradients offer possible explanations for the observed phenomena across Pre- and Post-LN models. From Eq 5, we observe that the ratio of learning-to-memorization gradient norms are much higher for Pre-LN models than Post-LN models. This means that for Pre-LN models, the learning gradient norms are significantly high, hence LN removal disrupts learning. On the other hand, for Post-LN models, both learning and memorization gradient norms are similar. This suggests that LN removal won’t influence learning that much while impacting memorization.
Q3: Yes, we define early, middle and later layers by grouping consecutive layers. We have provided the explanation for the same in Appendix F.4.
[1] Xu, J., Sun, X., Zhang, Z., Zhao, G. and Lin, J., 2019. Understanding and improving layer normalization. Advances in neural information processing systems, 32.
We thank the reviewer for their comments and suggestions and hope that our responses provide more clarity regarding our paper.
Thank you for the detailed responses and the additional results. I enjoyed reading the paper and believe it offers valuable insights into the effect of layer norm. I am therefore keeping my score of 5.
We would greatly appreciate it if the reviewer could review our rebuttal and indicate whether our responses resolve their concerns. We are happy to provide additional explanations or extend the discussion if needed.
We thank the reviewer for recognizing our paper's contributions and their time and effort in reviewing our paper.
This paper presents a comprehensive analysis of the role of Layer Normalization (LN) in transformer architectures, contrasting Pre-LN (where LN is applied before attention and feedforward layers) and Post-LN (where LN is applied after residual connections) designs. The authors study how LN affects both learning stability and label memorization, particularly under training with noisy labels.
The core findings are: (1) Learning vs. Memorization: LN is essential for learning in Pre-LN transformers; removing LN parameters destabilizes training and worsens overfitting. In contrast, for Post-LN models, LN plays a key role in memorization, and removing LN parameters suppresses memorization and restores correct labels without hurting generalization; (2) Importance of Early LN Layers: The influence of LN is concentrated in the early layers—removing early LNs harms learning in Pre-LN models and effectively suppresses memorization in Post-LN models; (3) Gradient-Based Analysis: The authors compute learning and memorization gradient norms, showing that LN in Pre-LN models contributes more to learning gradients, while in Post-LN models, learning and memorization gradients are comparable—explaining the divergent behaviors upon LN parameter removal.
优缺点分析
Strengths
-
The paper is exceptionally well-written and highly accessible. The flow of ideas is clear and logical, with no ambiguities or gaps in explanation. The authors strike an excellent balance between depth and clarity, relegating technical details and supplementary results to the appendix, keeping the main paper focused and uncluttered.
-
The experimental validation is compelling and clearly supports the theoretical claims. The experiments are consistent, well-designed, and span a variety of architectures and datasets. The figures and plots are not only visually informative but also instrumental in building trust in the findings.
-
Although the topic may appear narrowly focused—centered solely on the role of Layer Normalization (LN)—the thoroughness of the analysis and the consistency of the insights make this study valuable to the broader community. While it doesn’t propose a new method, the paper’s novelty lies in its analytical depth and empirical findings. It is clearly an analysis paper, and should be reviewed as such.
-
The metrics introduced to quantify the impact of LN on learning and memorization—Learning Accuracy, Memorization Score, Recovery Score, and Random Prediction Score—are thoughtful and enhance interpretability across experiments.
Minor Weaknesses and Suggestions
-
The vision datasets used—CIFAR-10, UTK-Face, and NICO++—might be seen as relatively small-scale or “toy” by today’s standards. Including experiments on more challenging and large-scale datasets such as ImageNet-1k, Places365, or iNaturalist could further strengthen the empirical claims and broaden the paper’s applicability.
-
While the use of noisy labels is a standard approach to studying memorization, a brief discussion of alternative methods (e.g., studying forgetting dynamics, example forgetting statistics, or influence functions) would help contextualize the choice. A comparison of the advantages and limitations of noise injection would also improve clarity.
-
It would be interesting to include an ablation where a Pre-LN model is manually converted to Post-LN, and vice versa (e.g., modifying ViT-S accordingly), to directly observe how the same base architecture behaves under the two LN configurations. This would further cement the core message about architectural asymmetries in LN’s impact.
-
The paper currently evaluates only unimodal models (vision-only or text-only). It would be interesting to investigate whether the findings regarding LN’s role in learning and memorization transfer to multimodal architectures such as vision-language models (e.g., CLIP). Given that these models often use LayerNorm in both visual and textual branches, this could shed light on whether the observed effects generalize beyond the unimodal setting.
问题
This is a well-executed analysis paper with a thorough investigation into the role of Layer Normalization in Pre-LN and Post-LN transformer architectures. The paper is already strong in its current form and I lean toward an accept recommendation. That said, I have a few suggestions and questions that I believe could improve the scope and impact of the work even further, should the authors choose to explore them:
- Scale of Vision Datasets (e.g. run few experiments on Imagenet-1k)
- Alternative Memorization Probes (i.e. explain differences and why noisy labels were chosen)
- Architectural Conversion Experiments (e.g. transform ViT-S to Post-LN)
- Extension to Multimodal Models (e.g. run few experiments with CLIP)
These are suggestions for potential enhancement, not critical gaps.
局限性
Yes.
最终评判理由
I recommend accepting this paper. I will keep my initial score 5: Accept. Authors did a great job in this rebuttal.
格式问题
No concerns.
We thank the reviewer for recognizing our work’s value, and suggesting several points that can potentially further improve our work. We report back what we were able to collect within the time frame.
Minor Weaknesses/Questions
Q1: We provide additional analysis on the Places365-Mini (We couldn’t do Places365 because of the resource and time constraints). We tested out, ViT-B and DeiT Pre-LN models against the same.
Model Setting Learning (%, ↑ is better) Memorization (%, ↓ is better) Recovery (%, ↑ is better) Random Prediction (%, ↓ is better) ViT-B Before 93.4 100 0 0 After 56.2 78.33 13.33 8.34 DeiT Before 91.2 100 0 0 After 51.2 95 0 5
From our results, we can again verify that for Pre-LN models, LN primarily influences learning, where their removal impairs learning without mitigating memorization.
Q2: Thank you for such productive comments. We chose random label noise because it offers a simple and reliable way to study memorization. Since these examples are known a priori to be unlearnable, any correct prediction must stem from memorization rather than generalization. In contrast, methods like forgetting dynamics or influence functions require additional computation to first identify memorized examples, adding complexity to the analysis. While such heuristic-based methods can also be used, we utilized random label noise for its conceptual clarity and ease of implementation. Having said that, we agree that such further explorations would help extend this work.
Q3: Following your suggestion, we attempted this for Pre-LN models (ViT-B and DeiT) by reordering the LN placements to emulate Post-LN behavior. However, even prior to any LN removal, these converted models failed to converge properly, as reflected in poor train-test accuracy at the end of training, as shown below.
Model Train Accuracy (%) Test Accuracy (%) ViT-B 10.15 10 DeiT 42.77 42.52
This suggests that such conversions may require additional tuning or stabilization strategies. We consider this as a promising direction for future work to systematically convert Pre-LN models to Post-LN and examine their behavior under LN removal.
Q4: We appreciate the reviewer’s suggestion. Due to resource and time constraints, we were unable to run multimodal experiments, but we instead provide new results on a language modeling task using a modified version of our classification dataset. Each input is rephrased as: original text + “This emotion is [type].”, where [type] is one of 6 emotions. To simulate noisy labels, we randomly flip the [type] token for 1% of the training samples to a different emotion. We then train both Pre-LN (GPTNeo) and Post-LN (BERT) models under two settings: (1) with LN intact, and (2) with LN removed.
Model Setting Learning (%, ↑ is better) Memorization (%, ↓ is better) Recovery (%, ↑ is better) Random Prediction (%, ↓ is better) Post-LN (BERT) Before 92.14 100 0 0 After 91.95 28.12 60.62 11.25 Pre-LN (GPTNeo) Before 91.7 100 0 0 After 85.55 51.25 19.38 29.38
We find that, consistent with our original results, removing LN impairs memorization in Post-LN (BERT) but hurts learning in Pre-LN (GPTNeo), re-verifying our main claims in a generative modeling setup. We hope this demonstrates that our findings generalize beyond classification tasks.
We sincerely thank you for your valuable comments and suggestions, and hope that our responses answer them. We would be happy to answer any further questions that you may have.
Thank you to the authors for the detailed and thoughtful rebuttal. I appreciate the additional results, especially given the tight one-week timeframe.
Most of my questions were addressed in a satisfying manner. The new experiments, particularly those on Places365-Mini and the language modelling task, align well with the paper’s core findings and further reinforce its contributions.
For the sake of discussion, I’d like to ask a follow-up question about Q3. In the architectural conversion experiment, the converted ViT-B model seems to fail entirely (10% test accuracy), suggesting it did not learn at all.
I’m curious: Did you follow the same training recipe as in the "original" ViT-B?
It would make sense if such a conversion alters the optimization dynamics, but does it alter them so much?
We thank the reviewer for their prompt response and are glad they found our additional experiments satisfying and reinforcing our contribution. Regarding the follow-up question on converting Pre-LN to Post-LN for ViT-B, yes, we indeed followed the exact same training recipe for the Post-LN version as for the original Pre-LN setting. We also find it intriguing that such a large performance drop occurs. Understanding why this conversion alters the optimization dynamics so drastically would indeed be an interesting and inspring direction for future work.
Overall we again thank the reviewer for engaging with reviewing our paper and extending their thoughts on it.
Thanks a lot. Indeed, this is an interesting direction for future work. Concerning my rating, I will keep my 5: Accept.
I recommend accepting this paper.
We sincerely thank the reviewer for recognizing our work and for providing thoughtful and valuable feedback during their review.
The paper investigates how Layer Normalization (LN) affects memorization versus learning in transformers with Pre-LN and Post-LN designs. The key takeaway is: removal of LN parameters in Pre-LN models significantly destabilizes the learning process, leading to persistent overfitting. In contrast, removing LN parameters from Post-LN architecture designs effectively mitigates memorization and enables the recovery of genuine labels.
优缺点分析
Strengths:
-
Clear, under-explored question. LN is ubiquitous in modern transformer design, yet its role in memorization vs. learning has not been clearly established; the paper studies this gap.
-
Broad empirical study. Extensive results for both language and vision modalities, including 3 language and 3 vision classification datasets, and 7 Pre-LN and 6 Post-LN transformers architectures.
-
Simple, reproducible intervention. Dropping γ, β is easy to implement and isolates LN’s specific contribution.
-
Code is provided.
Weaknesses:
-
Noise regime. Results use 1 % random label noise. Do the Pre-/Post-LN findings persist with higher noise rates? A sensitivity plot is required to verify the findings.
-
Task scope. All benchmark tasks are classification. Can you report at least one generative task (e.g., language modelling) to demonstrate that the conclusions generalize? A toy/ synthetic task is fine.
-
Standard deviation of experiments. The paper mentions three seeds, but error bars are missing.
-
What optimizer did you use? Could the optimizer choice—say switching from Adam to a modern variant like Muon—alter the Pre- vs. Post-LN learning/memorization findings? A brief explanation would be helpful.
Random question for my own learning: We know that many Pre-LN transformers produce large activation values—this is true even for generative models like DiTs. Do you think there’s a relationship between LayerNorm-related overfitting and the presence of activation outliers, or are the outliers simply an artifact of the optimizer?
If the authors satisfactorily address the questions above, I’m prepared to raise my score to 5 (Accept). Overall, this is an interesting paper with several insights.
问题
Please see Strengths and Weaknesses Section for a list of questions.
局限性
Yes.
最终评判理由
Thank you authors for the rebuttal. Authors have addressed all my concerns.
In the revised version, please include results on different noise ratios, next-token prediction task and Muon optimizer.
Overall, this paper is interesting, well executed, and constitutes a good scientific contribution in my opinion. I will increase my score to 5 (Accept).
格式问题
N/A
We thank the reviewer for their comments and feedback. Below, we address their questions to the best of our ability in the given time frame.
Weaknesses:
Q1: As per your suggestion, we provide experiments on higher label noise ratios - 2% and 5% for the Emotions dataset using BERT (Post-LN) and GPTNeo (Pre-LN) models.
Noise Model Setting Learning (%, ↑ is better) Memorization (%, ↓ is better) Recovery (%, ↑ is better) Random Prediction (%, ↓ is better) 2% Post-LN (BERT) Before 91.7 100 0 0 After 92.0 20.62 76.25 3.12 Pre-LN (GPTNeo) Before 91.35 100 0 0 After 84.85 66.87 16.56 16.56 5% Post-LN (BERT) Before 90.35 100 0 0 After 91.25 27.0 66.88 6.12 Pre-LN (GPTNeo) Before 90.35 100 0 0 After 82.6 67.5 11.0 21.5
From the results we can clearly confirm that even for higher proportions of label noise, LN has an impact on memorization in Post-LN model (BERT), where their removal suppresses memorization. While in Pre-LN model (GPTNeo) case, LN removal impairs learning without mitigating memorization.
Q2: We acknowledge the reviewer’s comment and carry out a Next Token Prediction Task on the Emotions dataset for both BERT (Post-LN) and GPTNeo (Pre-LN). We modify the classification dataset as follows - original text + “This emotion is [type]”. We then predict the “[type]” of the token which can be one of the 6 emotions. To induce the notion of noisy labels in language-modeling, we change the “[type]” token for 1% of the samples to a random different emotion in the train set. Following this setup, we train 2 variations of both Pre- and Post-LN models, (1) without removing LN, and (2) with removing LN and provide the results below.
Model Setting Learning (%, ↑ is better) Memorization (%, ↓ is better) Recovery (%, ↑ is better) Random Prediction (%, ↓ is better) Post-LN (BERT) Before 92.14 100 0 0 After 91.95 28.12 60.62 11.25 Pre-LN (GPTNeo) Before 91.7 100 0 0 After 85.55 51.25 19.38 29.38
We observe that even for this language modeling task, LN has an impact on memorization in Post-LN model (BERT) while it impairs learning for Pre-LN model (GPTNeo). As a result, we verified that further results on generative tasks also corroborate the claims of the paper.
Q3: Although we provided the error bars in most of the plots, if, in some charts, they are not that visible, that is because the std-dev is too less. However, we could not provide error bars in Memorization-Recovery-Random Prediction charts due to their nature (bifurcated into 3 regions), however, we will provide them in a tabular form in the Appendix in a revised version.
Q4: We used Adam as the optimizer. As per the reviewer’s suggestion, we also evaluate our setup with Muon optimizer. For Muon as well, we observe that LN has an impact on memorization in Post-LN model (BERT) while it impairs learning for Pre-LN model (GPTNeo), as shown below:
Model Setting Learning (%, ↑ is better) Memorization (%, ↓ is better) Recovery (%, ↑ is better) Random Prediction (%, ↓ is better) Post-LN (BERT) Before 91.95 100 0 0 After 92 25 62.5 12.5 Pre-LN (GPTNeo) Before 91.7 100 0 0 After 85.55 32.5 29.38 38.12
“Random question…”: That is an interesting and thoughtful point to consider. It sounds indeed plausible that while memorizing a label, the model can rely on certain tokens, which could lead to disproportionately large activations, from a general perspective. Having said that, as it is not what we have been thinking about rigorously, we may not be the best to adequately answer the question. We humbly relay the opportunity to explore the point to the reviewer or other adequate researchers. Hope we could see some interesting insights regarding the point in upcoming conferences in the near future. Thank you again for extending thoughts, building on our work.
We hope our response resolves your concerns, and helps you consider raising the score. We would be happy to answer any more questions that you may have. Thank you.
Thank you authors for the rebuttal. Authors have addressed all my concerns.
In the revised version, please include results on different noise levels, next-token prediction task and Muon optimizer.
Overall, this paper is interesting, well executed, and constitutes a good scientific contribution in my opinion. I will increase my score to 5 (Accept).
We would greatly appreciate it if the reviewer could review our rebuttal and let us know if our responses address their concerns. We would be glad to provide additional clarification or engage in further discussion as needed.
We thank the reviewer in recognizing the value of our work. We will include the additional results as suggested by the reviewer in the final version of the paper. Thank you for your time and effort in reviewing our paper.
The paper investigates how Layer Normalization (LN) influences label memorization and learning in transformers configured with pre-LN and post-LN placement across 13 models (7 pre-LN, 6 post-LN) and six language and vision classification datasets. This paper shows that the LN parameter is crucial in pre-LN models, unlike post-LN models. Without LN, pre-LN models suffer large drops in test accuracy, indicating that LN is essential for learning stability and mitigating memorization. However, the removal of the LN parameter in post-LN models suppresses memorization and recovers genuine labels. This Paper also argues that early LN is crucial and explains those phenomena with the gradient of each LN.
优缺点分析
Strengths
- S1: The paper is easy to follow, and the argument has not been studied before.
Weaknesses
- W1: It would be ideal to run a controlled experiment using a single model and learning objective, since the outcomes can be affected by various hyperparameters.
- W2: I’m not clear on how gradients alone can measure a model’s learning versus its tendency to memorize labels, and there’s no clear explanation regarding this (line 131-132). Also, since each example yields multiple LN gradient values (about 2N for N layers), I’d expect both the variance and the mean of those gradients across samples to be quite large, so they can’t reliably indicate learning or memorization on their own.
- W3: Arguing LN in pre-LN helps stabilize training only by showing the accuracy gap before and after removing LN sounds overclaims to me. It would be great to show the loss and gradient norm graph to see how spiky the training was.
问题
- Q1: The paper says removing the early layer of LN significantly disrupts performance. I suspect this may stem from attention entropy collapse [1]. It would be informative to compare attention entropy in pre-LN models with and without LN and to examine how those entropy changes correlate with model accuracy and label memorization.
- Q2: Once the model has learned all data patterns, its logit vector norm continues to grow, driving the cross-entropy loss down excessively and leading to memorization of the training set [2]. This paper also shows that applying simple logit vector normalization markedly boosts out-of-distribution performance. I’m interested in how the logit norm behaves with versus without LN and how that correlates with your empirical observations.
局限性
The Paper properly addressed the limitation. But I think this paper would be more valuable if it also dealt with a more realistic scenario. One example would be the impact of LN in causal LM (with language modeling task) under non-iid text data.
最终评判理由
The authors added controlled experiments comparing five models under identical data and hyper-parameter settings, consistently reproducing the effect of LN placement, and showed that LN-input gradient norms validly separate learning from memorization. They further extended the study to attention entropy, logit norms, and a generative next-token prediction task, fully addressing W1, W2, and key questions Q1 & Q2. Loss and global gradient-norm plots are still pending, but the empirical evidence now appears solid. I therefore raise my rating.
格式问题
None
We thank the reviewer for their thoughtful and constructive feedback. Below, we address their questions in detail, while covering requested additional experiments in the limited time frame with our best effort.
Weaknesses:
W1: In order to address the reviewer’s request we include additional results so that we can overall compare results for Emotions dataset through three Post-LN models, BERT, DeBERTa, and ELECTRA, and two Pre-LN models, GPT-Neo and Qwen2.
Model Setting Learning (%, ↑ is better) Memorization (%, ↓ is better) Recovery (%, ↑ is better) Random Prediction (%, ↓ is better) Post-LN (BERT) Before 92.30 100 0 0 After 91.95 19.16 72.5 8.33 Post-LN (DeBERTa) Before 92.58 100 0 0 After 91.98 21.66 70.41 7.91 Post-LN (ELECTRA) Before 93.20 100 0 0 After 92.70 17.5 68.75 13.75 Pre-LN (GPTNeo) Before 90.98 100 0 0 After 86.16 73.75 14.79 11.45 Pre-LN (Qwen2) Before 93.25 100 0 0 After 89.4 97.50 0.62 1.88
The results again confirm our claims that for Post-LN models, LN removal mitigates memorization, and for Pre-LN models, LN removal impairs learning.
W2: Thank you for this important point. We use gradient norms as a proxy to examine how LN removal would impact memorization and learning in Pre- and Post-LN models. The gradient shows how the loss is affected due to the input of the LN for each sample, thereby commenting on learning and memorization through the analysis of clean and noisy samples, respectively. Our results consistently show that in Pre-LN models, learning gradients dominate, while in Post-LN models, memorization and learning gradients are of comparable magnitude. This offers an explanation of the discrepant impact of LN in Post-LN (mitigates memorization) and Pre-LN models (impairs learning) upon its removal. Please kindly note that, while individual gradients may vary, averaging across multiple samples yields stable trends that align with our empirical observations on LN removal.
W3: We agree that loss analysis could further support our claims. However, we have included test accuracy and memorization dynamics over epochs (Fig 2), which already reflects training instability Pre-LN removal due to a drop in test-accuracy, which the model can not recover back over the course of training. Having said that, we will try to include loss analysis plots in the final version of the paper per your request.
Questions: We like the reviewer’s inquisitiveness in our paper. Since, the reviewer hasn’t provided us with the exact papers that they are referring to, we present the results for Q1 and Q2, to the best of our knowledge in the limited time frame.
Q1: We compute attention entropy across layers in Pre-LN models - GPTNeo (12 layered), and Qwen2 (24 layered), before and after LN removal. We provide results as mean and standard deviation for all samples across all layers.
Model Setting L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L23 L24 GPTNeo Before 5.03±0.02 4.90±0.01 5.06±0.02 4.88±0.01 5.04±0.02 4.87±0.02 5.05±0.02 4.88±0.02 5.04±0.02 4.85±0.02 5.03±0.02 4.86±0.02 - - - - - - - - - - - - After 5.00±0.03 4.83±0.03 5.01±0.03 4.84±0.02 5.00±0.03 4.83±0.03 5.01±0.03 4.84±0.02 5.00±0.03 4.84±0.02 5.00±0.03 4.83±0.03 - - - - - - - - - - - - Qwen2 Before 5.99±0.01 5.99±0.01 6.00±0.01 5.96±0.01 5.97±0.01 5.95±0.01 5.96±0.01 5.97±0.01 5.98±0.01 5.96±0.01 5.95±0.01 5.95±0.01 5.97±0.01 5.96±0.01 5.96±0.01 5.95±0.01 5.96±0.01 5.95±0.01 5.95±0.01 5.96±0.01 5.95±0.01 5.95±0.01 5.93±0.01 5.97±0.01 After 5.92±0.01 5.97±0.01 6.00±0.01 5.99±0.01 5.99±0.01 5.99±0.01 5.99±0.01 5.99±0.01 5.99±0.01 6.01±0.01 5.99±0.01 6.01±0.01 6.00±0.01 6.01±0.01 6.00±0.01 6.00±0.01 6.01±0.01 6.00±0.01 6.01±0.01 6.01±0.01 6.01±0.01 6.02±0.01 6.01±0.01 5.98±0.01
The results show no significant drop in attention entropy, indicating that performance degradation is not due to entropy collapse. While entropy remains stable, learning is still impaired, suggesting there might be other mechanisms at play, making it an interesting direction for future work.
Q2: We analyze the norm of the logit vectors for both Pre-LN and Post-LN models, for before and after LN removal.
Dataset Model Setting Logit Norm(Mean±Std) Emotions Post-LN (BERT) Before 9.34±1.22 After 8.80±1.15 Pre-LN (GPTNeo) Before 15.23±2.71 After 8.14±2.31 News Post-LN (ELECTRA) Before 9.72±0.96 After 8.75±1.61 Pre-LN (Qwen2) Before 11.40±4.37 After 6.51±3.18
In Post-LN models (BERT, ELECTRA), logit norms remain stable after LN removal, consistent with our observation that their learning dynamics are not substantially disrupted. In contrast,logit norms in Pre-LN models decrease after LN removal, indicating impaired learning and higher prediction uncertainty. This further aligns with our gradient analysis, which shows learning is more dependent on LN in Pre-LN architectures.
Limitations: We acknowledge the reviewer’s comment and carry out a Next Token Prediction Task on the Emotions dataset for both BERT (Post-LN) and GPTNeo (Pre-LN). We modify the classification dataset as follows - original text + “This emotion is [type]”. We then predict the “[type]” of the token which can be either of the 6 emotions. To induce the notion of noisy labels in language-modeling, we change the “[type]” token for 1% of the samples to a random different emotion in the train set. Following this setup, we train 2 variations of both Pre- and Post-LN models, (1) without removing LN, and (2) with removing LN and provide the results below:
Model Setting Learning (%, ↑ is better) Memorization (%, ↓ is better) Recovery (%, ↑ is better) Random Prediction (%, ↓ is better) Post-LN (BERT) Before 92.14 100 0 0 After 91.95 28.12 60.62 11.25 Pre-LN (GPTNeo) Before 91.7 100 0 0 After 85.55 51.25 19.38 29.38
We observe that even for this language modeling task, LN has an impact on memorization in Post-LN model (BERT) while it impairs learning for Pre-LN model (GPTNeo). As a result, we verified that further results on generative tasks also corroborate the claims of the paper.
We hope our response resolves your concerns and questions, and helps you consider raising your score. If there is anything that we can further clarify, please let us know. We appreciate your time and effort again.
Thank you for the thorough rebuttal and additional experiments. Your controlled comparisons and analyses resolve my main concerns (W1, W2, Q1, and Q2). Although the loss and global gradient-norm curves are still pending, the current empirical evidence appears solid. I have therefore raised my score and encourage you to include the promised training-dynamics plots in the final version.
We would appreciate it if the reviewer could take a look at our rebuttal and let us know whether it sufficiently answers their questions. We will be happy to elaborate on any point that may require more detail.
We thank the reviewer for recognizing the value of our work. We appreciate the time and effort they have put into reviewing our paper. We will add the plots as suggested by the reviewer in the final version of the paper. Thank you.
This paper discusses the impact of LN layers on the training dynamics of transformers in a supervised classification setup. It systematically analyzes the role of Layer Normalization showing that removing LN parameters:
- destabilizes learning and increases memorization in Pre-LN models and
- suppresses memorization and aids recovery of true labels in Post-LN models. The reviewers have praised this work for its simplicity and clarity. The rebuttal successfully addressed most of the reviewers comments.
My only concern with this work is the discrepancy between the possible impact breadth and evaluation thoroughness. This work has broad impact, as transformers are ubiquitous in modern machine learning (vision, language, speech, audio). The paper does a decent job at providing empirical evidence on several domains (vision and language). However, as noted by reviewers Fd4h and rmJJ the evaluations should include proper language modelling and larger-scale datasets. The reviewers provided experiments on a "generative" variant of Emotions dataset - but I think the evidence would be much stronger on proper language modeling, measure using perplexity. Language modeling with transformers has an abundant literature with decent mid-scale benchmarks from the pre-LLM era.
Overall, despite the point raised above, this is a good paper. I recommend accepting the paper as a poster.