$\boldsymbol{\mu}\mathbf{P^2}$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling
摘要
评审与讨论
This paper proposed which is an effective way to scale the perturbation radius of SAM for each layer so that the optimal hyperparameter (learning rate and perturbation radius ) transfers across different widths. Experiments show that this approach indeed allows the transfer of optimal and across widths.
优点
- The idea is original, significant and seems to work as shown by the experiments
缺点
To me the main issue is presentation:
- The writing style is quite verbose which makes it hard to remember all the details and the main message of the paper.
- There is a lack of figures to help reader understand the theorem and intuition of the paper.
- The paper is written in a way which assumes that the readers are very familiar with Tensor program.
- Some parts of the paper are quite confusing. For instance, I find myself keep rereading the part from Line 183 to 196 about non-vanishing perturbations vs effective perturbations and I still can not tell the different between the two.
- The definitions are incomplete also. For instance, definition B.1 in the appendix include a clear definition for but not and and I try my best to guess what they mean but fail to do so. Please give a clear and complete definition for each notation for such a math heavy paper. Also please provide a table summarizing all the notations used in the paper.
- All the definitions are provided in the appendix (which are optional) which means readers cannot comprehend the paper without reading the appendix.
- After reading through the paper multiple times, it is still unclear to me how I can set the perturbation radius for each layer.
问题
- Can you provide a more intuitive way to explain ? To me there must be an easier way to present this idea.
- Exactly how can I set the perturbation radius based on ? Can you provide an example for this?
局限性
The authors somewhat address the limitations in the last section.
Thank you for carefully reading our paper, providing detailed feedback and your help in improving the presentation of our results. We take your concerns about clarity and presentation seriously as we would like that a large audience is able to appreciate our results. If we are able to address some of your concerns and manage to improve clarity, we kindly ask you to consider updating your score as you judged the content overall positively as ‘original’ and ‘significant’.
Lack of figures. We will include a figure that visualizes the phases of all bcd-parameterizations. After fixing the initialization and learning rate scalings, there is indeed a simple way to visualize all unstable, effective SGD, perturbation non-trivial and effective perturbation regimes by drawing a quadrant in a 2D plane (see the pdf attached to the global response).
Familiarity with Tensor Programs. As our theory relies on Tensor Program (TP) theory, including it is hard to avoid. However we agree that readers unfamiliar with it should also be able to intuitively understand the results. We will try to reduce the focus on TPs in the main body of the updated manuscript, and highlight spectral and intuitive conditions more (see also the last paragraph in this response, the answer to reviewer egBf and our global response).
Distinction between non-vanishing and effective perturbations. For the reader’s convenience, we will also include a spectral condition on the weights for all l, that essentially states that the effect of the weight perturbation on a layer’s output should be of the same scale as the original output of that layer. We will highlight that non-vanishing perturbations are achieved if and only if at least one layer is effectively perturbed. However we want all layers to be effectively perturbed, otherwise the layer’s perturbation could be set to 0, which would save computation. Hence non-vanishing perturbations is not the correct notion to measure but effective perturbations are, if we want to achieve the best layerwise perturbation scaling for SAM. Effective perturbations really measure an individual layer’s perturbation effect on the output. We hope this resolves the confusion.
Lacking definitions and notation table. We agree that we should provide a complete list of definitions used in the paper, and will do so in the updated version of the manuscript. We also love the idea and will include a table that summarizes the notation. Just like the proofs have to be moved to the appendix, the full formal definitions are quite spacious and can therefore not be included in the main paper, whereas we try to give a shorter and more intuitive presentation of the relevant terms in the main text.
How to use in practice. We hope the following changes facilitate the understanding and adoption of in practice. We will:
(1) explain how to set layerwise learning rates and perturbation radii to achieve in the main text,
(2) make Pytorch code to reproduce all of our experiments publicly available upon acceptance,
(3) refer to the pseudocode in Appendix E.8 and rewrite that appendix for improved clarity to provide alternative implementations and perspectives, using the mup-package and the spectral perspective.
Concretely, concerning (1), each layer either behaves input-like, hidden-like or output-like. At width , is implemented by scaling the learning rate in each layer to , where can be read off from the table below. In addition, our results show that for SAM in , the global perturbation radius should be scaled as , and in SAM’s weight perturbation step, the layerwise gradients should be multiplied by the scalar , where can be read off from the table below (as was provided in Table 1 (right) for several variants of SAM).
| Input-like | Hidden-like | Output-like | |
|---|---|---|---|
| Learning rate scaling factor | 1 | ||
| Perturbation scaling factor |
Now to apply , we will explain the following steps in the main text:
- Parameterize the network and the SAM update rule according to as explained above.
- Tune the learning rate and perturbation radius on a small model.
- Train the large model only once using the optimal learning rate and perturbation radius from the small model.
More intuitive way to present . We find ourselves unable to replace Tensor Programs in our analysis, but the spectral perspective as alluded to by Reviewer egBf provides a complementary perspective to understand our results. However, the analysis of SAM is necessarily more complicated than that of SGD or ADAM due to layer coupling through the joint gradient normalization in SAM’s denominator. See our response to Reviewer egBf for more details. Related literature has shown that a careful signal propagation analysis both forward and backward is necessary to not get stuck to an analysis at or close to initialization like the NTK (Jacot et al., 2018), because of scaling mismatches. While writing out the computations for SAM learning using TP rules is quite technical, a TP analysis is conceptually a simple and reliable way to find the correct scalings: The TP framework intuitively just states that a matrix vector product either introduces a factor when matrix and vector are sufficiently independent, or a factor when matrix and vector are sufficiently correlated. Wrong update or perturbation scalings can then be corrected with layerwise scalings of the learning rate and perturbation radius.
References:
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks, NeurIPS 2018.
Thank you for the rebuttal. Since the issue of presentation remains, I keep my score as is.
Besides adding more figures to improve the readability of the paper, one crucial advice I have is that the authors should annotate each term in each equation with their intuitive meaning (using color, underbrace, etc...) to make the equations easier to understand. A good example of this practice is [1]. Furthermore, after each proposition or theorem, the authors should provide an interpretation of the theorem and its role with respect to the global topic of the paper.
Can you clarify exactly what is happening in the figure in the PDF attached to your global rebuttal? There is no caption in that figure.
[1] Kingma et al. Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation. NeurIPS 2023.
Thank you for your help in improving the reception of our results. Below, we summarize the key changes we plan to make that aim to improve clarity incorporating suggestions from all the reviewers:
-
Colors and attached figure. We like the idea of using colors to make equations more digestible, and will do this in particular when distinguishing between vanishing (red), non-vanishing (darkened yellow) and effective perturbations (green), as in the figure in the attached pdf. The caption for the attached figure will read something similar to this:
“(Phase characterization of bcd-parameterizations) Given a choice of layerwise initialization and learning rate scalings , the maximal perturbation scaling and the last-layer perturbation scaling completely determine whether a -parameterization is unstable (grey), has effective SGD dynamics (red), effective perturbations in some but not all layers (yellow) or effective perturbations in all layers (green). In SP or NTP (left), there does not exist a choice of perturbation scalings that achieves effective perturbations in all layers, whereas in , there is a unique choice as provided in Theorem 11.” We hope this clarifies the figure.
-
Using in practice. As explained in our previous response, we will clearly state how to use in practice. We will refer to the pseudo-code provided in the appendix, and upload open source code upon acceptance.
-
Highlighting assumptions. We will further clarify the assumptions of limited batch size and training time, necessary in TP theory.
-
Spectral perspective for a more intuitive derivation of corrected perturbation scaling . We will highlight the spectral perspective as an accessible perspective for deriving the corrected layerwise perturbation scalings, both to our intuitive perturbation scaling condition after line 271 and in the introduction: After ensuring that SAM’s denominator is scaled to be width-independent, the perturbation numerator can be scaled like the updates. Considering a version of SAM without layer coupling as a first step, the correct perturbation scalings immediately follow from the condition that perturbations should scale like updates, which reduces the complexity during the first read. When discussing non-vanishing versus effective perturbations, we will add a spectral condition on the weights, as discussed above.
-
Notation table. We will provide a complete set of definitions and a table summarizing all notation.
Unfortunately we are not allowed to upload a revised version of our submission in the current stage of the review process but we will be sure to improve the exposition of our paper following your recommendations. We are unsure how we can further address your concerns about the presentation at this stage.
We are happy to receive any other recommendations that the reviewer has for improving the accessibility of our paper.
This paper analyzes Sharpness-Aware Minimization (SAM) in the infinite-width limit using tensor program theory. The authors identify issues with standard SAM implementations in wide networks and propose a new parameterization called μP^2 to address these problems. They provide theoretical analysis and conduct extensive experiments on MLPs, ResNets, and Vision Transformers to validate their findings. The μP^2 parameterization is shown to achieve hyperparameter transfer across model scales and improve generalization performance.
优点
-
The paper provides a rigorous theoretical analysis of Sharpness-Aware Minimization (SAM) in the infinite-width limit using tensor program theory. This extends the community's understanding of SAM's behavior in large neural networks.
-
The paper identifies the degenerate issue with standard SAM implementations in in infinite neural networks and propose a new parameterization (μP^2) to address the problem.
-
Extensive empirical experiments are conducted to validate the theoretical findings and demonstrate improved performance of μP^2.
缺点
-
The theoretical analysis extends tensor program theory; however, the authors did not clearly introduce abc-parameterization in the main body of the paper. It would be beneficial if the authors added a notation section and provided a more detailed comparison between abc-parameterization and bcd parameterization.
-
The batch size of SAM is not adequately discussed in the main paper. For SAM with batch size , it is called -SAM, which is crucial in SAM's behavior [1].
-
Layerwise perturbation scaling for SAM is not a novel concept.
References: [1] Foret, Pierre, et al. "Sharpness-aware minimization for efficiently improving generalization." arXiv preprint arXiv:2010.01412 (2020).
问题
-
Recent work has provided theoretical insights into Sharpness-Aware Minimization (SAM) beyond its initial sharpness-based interpretation [1,2,3]. Within the µP2 framework, what distinguishes SAM from Stochastic Gradient Descent (SGD)? Does the new framework also offer a fresh perspective on the generalization performance of SAM?
-
Is µP2 capable of facilitating depth parameter transfer? [4]
References:
[1] Andriushchenko, M., et al. (2024). "Sharpness-aware minimization leads to low-rank features." Advances in Neural Information Processing Systems 36.
[2] Wen, Y., et al. (2024). "Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization." Advances in Neural Information Processing Systems 36.
[3] Chen, Z., et al. (2024). "Why does sharpness-aware minimization generalize better than SGD?" Advances in Neural Information Processing Systems 36.
[4] Yang, G., et al. (2023). "Tensor programs VI: Feature learning in infinite-depth neural networks." arXiv preprint arXiv:2310.02244."
局限性
Yes, the authors adequately addressed the limitations
Thank you for carefully reading our paper and providing detailed feedback. We are delighted about your overall positive evaluation of our work. If we are able to address some of your concerns, we kindly ask you to consider updating your score as you positively alluded to both our rigorous theoretical analysis that extends the community’s understanding of SAM and improves over standard SAM as well as the extensive experiments we conducted.
Discussion of abc-parameterizations. The footnote on page 5 discussed -parameterizations and referred to Appendix E.7, where we provide a more detailed comparison. -parameterizations only differ from -parameterizations in that they do not consider perturbation scalings , but introduce additional layerwise weight multipliers in the architecture. These weight multipliers result in equivalence classes of abc-parameterizations, out of which we pick the representative with in all layers. In this way, -parameterizations without the perturbations effectively recover all -parameterizations, reducing them to their essence: Each -parameterization is effectively just a layerwise initialization and learning rate scaling. In the updated version of the manuscript, we will explain this in the main text and refer to Appendix E.7 for more details. We believe that omitting weight multipliers in the definition of -parameterizations improves clarity, as we already have to introduce an unavoidable complication of introducing perturbation scalings .
Role of batch size. As for SGD, fixed batch size is covered by our theory. Since small batches are particularly useful for SAM, we do not see additional value in studying the limit . In the updated version of the manuscript, we will include a comment that fixed batch size is covered by our theory. In our experiments we make sure to always use batch size 64 on CIFAR10. As our focus is width-scaling, changing the batch size would introduce confounding effects. By achieving width-independent SAM dynamics, we expect low batch size to also be beneficial for generalization in at large width whenever it is at small width, but a systematic analysis and understanding constitutes an interesting avenue for future work.
Layerwise perturbation scaling is not novel, but its rigorous understanding is. We do not claim to be the first to propose layerwise perturbation scaling. Instead we aim to provide the first rigorous infinite-width theory that informs practice how layerwise perturbations should be scaled without having to tune all layers individually, which would be much more costly. This paper rigorously resolves the question how exactly layerwise perturbations should be scaled as we scale up model size. We are not aware of other work that makes meaningful progress in this regard, but would be very interested in further related work.
In bcd-parameterizations, what distinguishes SAM from SGD? In terms of understanding SAM, our scaling analysis has shown that standard scaling (1) becomes unstable if is held fixed if we scale up the width, (2) can at most perturb the last layer in wide neural networks and hence could instead be replaced by SGD (=set perturbations to 0) in all previous layers, and (3) can also recover width-independent perturbation dynamics (=effective perturbations) with the correct layerwise adjustment.
Generalization. We consciously analyze the SAM update rule without alluding too much to sharpness, as contributing to the discussion about the connection between sharpness and generalization is not our goal. Indeed we do not make claims about generalization just like other Tensor Program theory. Yang and Hu (2021) show that is necessary to achieve maximal stable updates in all layers in the infinite-width limit. If feature learning is necessary to achieve optimal generalization, then will outperform other parameterizations at large width. Similarly, our goal are width-independent perturbation scalings as a necessary requirement for effective SAM dynamics in the infinite-width limit. If SAM enhances generalization over its base optimizer, then can be expected to outperform other parameterizations at large width. However, as we mentioned in the ‘Future work’ section, we agree that further insights into generalization are very relevant and interesting.
Depth transfer. While for SGD and Adam, depth transfer has been achieved in several papers with a simple -scaling of the residual connections and adapted learning rate scaling (see the extended related work appendix A), depth transfer for SAM remains open and is a question that we are currently working on, that lies beyond the scope of this paper. We will mention this question in the ‘Future work’ section.
References:
Greg Yang and Edward Hu. Feature learning in infinite width neural networks, ICML 2021.
Thank you for your detailed response. While my original assessment and score remain unchanged, I encourage the authors to incorporate this discussion into the revision.
Thank you for your thoughtful questions, feedback, and constructive criticism. We agree that incorporating much of the discussion will greatly benefit the clarity and readability.
The authors extend muP based learning rate transfer to the extra gradient ascent step involved in the SAM algorithm. The authors present "tensor programs" theory to derive this scaling, and present convincing experiments that in their method, both step sizes in SAM transfer across width (learning rate and perturbation radius). In figure 1, the author's method also seems to get better test performance than the more naive methods.
优点
- the authors did an honest and thorough effort to port tensor programs theory to the case of SAM
- the scalings the authors worked out seem to work really well, both achieving stable step sizes across width, and achieving really good test performance
- I like the narrative, first showing that standard training does not effectively perturb all layers, then showing how to fix this
缺点
I like the paper and think that it should probably be accepted. However, I can see weaknesses in terms of presentation and use of technical tools. I believe these weaknesses are severely limiting the user-friendliness of the paper. I believe that the current presentation will alienate 99.9% of the community, and addressing these issues would significantly strengthen the paper and make it useful to a significantly wider portion of the community. Since I feel these weaknesses are important, depending on the author response, I am willing to either lower or increase my score.
Analysis seems over-complicated
The authors present "tensor programs" theory to derive their scalings. However, it has now been shown that muP is equivalent to initializing matrices and scaling weight updates to have spectral norm proportional to sqrt(fan-out / fan-in). This feels like a dramatic simplification and is an extremely simple condition that is easy for almost everyone in the community to understand. I see that you discuss this condition in an appendix so are certainly aware of it. Could you clarify: is your paper just fixing SAM ascent steps to have spectral norm proportional to sqrt(fan-out / fan-in)? If so, I think it would be extremely helpful for the reader to state this clearly and concisely in the introduction of the paper. I would actually consider focusing the analysis around this condition, and relegating most or all of the tensor programs theory to the appendix.
Misleading theoretical statements
I feel like this paper has inherited some of the overly grandiose language from the tensor programs papers. For example, referring to your parameterization as the "unique" one that works feels misleading to me. Actually I can give you a variety of different ways of parameterizing layers that would do the trick. Also the statement "It is straightforward to extend our theory to any architecture that is representable as a NE⊗OR⊤program including ResNets and Transformers" feels unhelpful. Can you either link directly to the appendix which does these "straightforward" extensions, or omit this sentence? By the way, if all you are doing is scaling the spectral norm of updates, then it's obviously straightforward to extend this to other layer types---but this does not need "NE⊗OR⊤" programs...
Missing related work
Could you take a look at this paper: https://arxiv.org/abs/2002.03432. It deals precisely with the question of ensuring that layers are "effectively perturbed" as you say, and it is prior work to muP. I believe that the analytical strategies developed in that paper could help simplify your work. The slight issue with that work is that Frobenius norms are used instead of spectral norms.
问题
Some minor things to fix:
- in Figure 1, the legend item for muP-global is unclear since it doesn't look like a dashed line
- in Figure 2, why not just plot the relative change in weights in spectral norm?
局限性
The authors do include some discussion of some limitations in the future work section.
We sincerely appreciate your thoughtful review of our paper. We are delighted about your positive evaluation of our work and are grateful for your insights which have been invaluable in enhancing the accessibility and clarity of our work.
Improving the presentation. In the main paper, after line 271, we provide an intuitive condition that perturbations should scale like updates under . From this condition, it follows that a layer is effectively perturbed iff the spectral condition holds for perturbations (in TP architectures). We agree with you that emphasizing the spectral perspective will improve the accessibility of the paper and we will highlight it in the updated paper (see global response).
However, unlike for SGD or Adam, deriving correct layerwise perturbation scalings for SAM from the spectral condition is complex. This complexity arises from the gradient normalization (in Frobenius norm) in the perturbation's denominator, which couples all layer scalings and has been shown to be practically relevant [1]. To simplify the analysis, we require the perturbation's denominator to be in the definition of bcd-parameterizations (bcd-params). This allows the numerator to be scaled just like updates, under layer coupling constraints. Further updates to enhance clarity in terms of practical use can be found in our response to Reviewer dLWn.
Relevance of TP theory. While we agree that the spectral condition provides a useful and accessible perspective, note that a rigorous justification of the spectral condition [3] crucially relies on NE⊗OR⊤ program (TP) theory. We would also like to point out that a recent ICML 2024 workshop spotlight paper “On Feature Learning in Structured SSMs” shows that the spectral scaling condition fails to achieve feature learning in Mamba layers, which cannot be represented as TPs. This highlights that the spectral scaling condition is not universal and requires further validation of its underlying assumptions.
The significance of our TP theory extends beyond . It enables a comprehensive characterization of SAM's scaling behavior and allows for the analytical derivation of infinite-width limits for all bcd-params, including standard SAM. Furthermore, any scaling analysis for SAM introduces additional complexities over SGD/Adam, not only because of the layer coupling due to SAM's denominator. A priori, it is unclear how perturbations and updates interact even to provide conditions for stable learning. Interestingly, our findings reveal that the conditions on perturbation scalings are largely independent of initialization and learning rate scalings, a result that only becomes apparent in retrospect.
Given these challenges, our TP analysis is conceptually simple: we write SAM learning (two forward and backward passes for each update) as a TP to understand how evaluating the gradient on gradient-perturbed weights affects the activations and output function and rigorously track all update and perturbation scalings. This allows us to derive rigorous statements under weak and simple assumptions.
Uniqueness claim. We study infinite-width limits of bcd-params under fixed depth and training time. In this setting, Theorem 11 indeed shows that is the unique stable bcd-param (up to smaller last layer initialization) that achieves both maximal stable updates and effective perturbations in all layers in the infinite-width limit. If the concern is about non-uniqueness related to equivalences (e.g., using weight multipliers), we addressed this in the paper: weight multipliers are covered in the footnote on page 5 and explained in detail in Appendix E.7. We will be sure to further emphasize this in the main text. If the reviewer has knowledge of other parameterizations to achieve effective perturbations, we are very interested in learning about them.
Extensions to other layer types. We will omit the word “straightforward” and just refer to the respective appendix, mentioning that many common layers either behave like input, hidden or readout. In the respective appendix, we will also explain in more detail why our theory and derived scalings extend to common layer types even under SAM's layer coupling.
Missing related work. Thank you for the missing reference. We will discuss it in the updated manuscript. The ideas of compositionality, perturbations and automatic update scaling are intriguing and related. While the analytical strategies might become helpful in developing simpler analyses in the future, we do not see an immediate path to derive SAM width-scaling analysis using these ideas. In particular, the assumption that perturbations are full rank is violated for gradient-based perturbations on small mini-batches (as used for SAM), and the condition number of random matrices explodes with increasing width already at initialization [2], which may render the upper bound in Theorem 1 quite loose at large width. In contrast, the assumptions required for our analysis are easy to understand.
Minor comments on Figures. We will correct the unclear legend for -global in Figure 1. Concerning Figure 2, we thought that the Frobenius norm tending to 0 is an even stronger statement: The effect of the perturbations on the activations vanishes even if we accumulate the perturbations over all directions. Arguably, this is unsurprising, as the two norms are equivalent because the gradient on a mini-batch of size 64 is always low-rank. We also found it striking that there is a width-independent limit for last-layer perturbations, even in Frobenius norm (explained in Appendix G.1).
References:
[1] Monzio Compagnoni et al. An SDE for modeling SAM: Theory and insights, 2023.
[2] Alan Edelman. Eigenvalues and condition numbers of random matrices, 1988.
[3] Yang et al. A spectral condition for feature learning, 2023.
Thanks for your reply. I'm re-considering my score. Some further questions and concerns based on your response:
"deriving correct layerwise perturbation scalings for SAM from the spectral condition is complex"---Doing explicit spectral normalization makes this trivial. Why not just switch the Frobenius normalization to spectral normalization?
"a rigorous justification of the spectral condition [3] crucially relies on NE⊗OR⊤ program (TP) theory"---Claiming that your approach is rigorous and another approach is non-rigorous without justification is not compelling to me. The spectral scaling condition makes clear and simple arguments with formally stated assumptions. In contrast, tensor program theory implicitly relies on the assumption that network width dominates both batch size and training time. I don't see why one approach is more rigorous than the other. Which analysis do you believe to be more generally applicable?
"a recent ICML 2024 workshop spotlight paper “On Feature Learning in Structured SSMs”---this paper is not publicly available. How I can access it? Based on your response here, I would guess that that paper is not doing explicit spectral normalization. Also, posting links in your rebuttal is against the conference rules.
"Given these challenges, our TP analysis is conceptually simple"---Unfortunately, I disagree here. And so does at least one other reviewer.
"If the reviewer has knowledge of other parameterizations to achieve effective perturbations, we are very interested in learning about them."---One example is that the TP stipulation that activation magnitudes don't blow up is actually more restrictive than necessary. It would be fine to have both activation magnitudes and updates blow up asymptotically at one layer so long as the next layer corrects for this. You'd just need to be sure to use a number system that can support this to avoid numerical overflow.
"the assumption that perturbations are full rank is violated for gradient-based perturbations on small mini-batches (as used for SAM), and the condition number of random matrices explodes with increasing width already at initialization"---thanks for looking at this paper. I feel that you could perhaps engage more meaningfully with the spirit of the work. I did mention in my review already that the "slight issue with that work is that Frobenius norms are used instead of spectral norms". It's a paper from 2020 and the community's understanding has improved since then. On the other hand, it's a non asymptotic analysis. Also your comment on condition numbers is incorrect. That depends on the choice of ensemble. Orthogonal random matrices have unit condition number at all widths.
Thank you for your thorough response and engaged discussion. As we try to clarify in our answer to your global response, we believe that we agree on most points. For example, we believe that the spectral perspective is a valuable perspective with potential for simplification and generalization, and in the statement you cite we also say that it is rigorously justified. To further improve the clarity of our paper, in our global answer, we promise both to further elaborate how to achieve effective perturbations with spectral arguments and to further clarify the limitation of our TP theory that width is assumed to dominate training time. Our global response also contains a discussion on replacing the Frobenius norm by the spectral norm and a clarification what we meant to say in our statement about rigor.
ICML Workshop paper. The paper can be accessed by contacting the workshop organizers. Thank you for reminding us that references as links are not allowed. That paper identifies Mamba’s selection mechanism as a crucial, non-standard architecture component that does not inherit feature learning when applying an unadapted spectral scaling approach. In this selection mechanism, some vectors simultaneously act as activations and as weights.
Activation blowup corrected by the following layer. We currently cannot think of a case in which inducing such blow up would be useful, as replacing the number system would be a major practical complication. We would hence argue that the stability constraint that prevents blowup anywhere in the network and that is common in TP literature is a reasonable constraint to pose. Under the stability constraint, our theorem statement that claims uniqueness is correct, as it is in other TP literature. Since we make all of our assumptions transparent, we plan to keep the current formulation unchanged.
Related work. As mentioned in our original response (shortly due to space constraints), we find the ideas of the paper intriguing and related, in particular the automatic correct update scaling, and these aspects are what we can discuss in the revision. Our point was that we were unable to directly rephrase our analysis that aims to explain the current common initialization practice that samples iid Gaussian entries, for which the condition number indeed blows up with width, in that paper’s terms. But the option of taking the orthogonal random matrix ensemble in conjunction with this paper’s ideas is intriguing when thinking about potential future initialization, training and scaling ideas.
"in the statement you cite we also say that it is rigorously justified" the problem I am highlighting is that you say that "a rigorous justification of the spectral condition [3] crucially relies on NE⊗OR⊤ program (TP) theory". I strongly disagree with this sentence, both in form and spirit. I think that we have a different understanding of mathematical rigor.
"We currently cannot think of a case in which inducing such blow up would be useful" I am not claiming it is useful. I am pointing out that you need to properly and clearly caveat any uniqueness claims, and clearly signpost their limitations.
Related work Thanks for engaging with this.
Sharpness Aware Minimization (SAM) improves performance across various neural architectures and datasets, but understanding its scaling behavior as models grow is crucial. This study examines the infinite-width limit of neural networks trained with SAM using the Tensor Programs framework. Findings show that in wide neural networks, SAM's dynamics effectively reduce to applying SAM only in the last layer. The authors propose a new parameterization called maximal update and perturbation parameterization ensures effective feature learning and perturbation across all layers. Experiments with MLPs, ResNets, and Vision Transformers confirm the method's effectiveness.
优点
- This paper is well-written.
- This paper offers a robust theoretical foundation, thoroughly explaining the concepts and methodologies used.
缺点
- In the experiments, the authors should consider including more SAM-variant methods, such as ESAM[1] and GSAM[2].
[1] Efficient Sharpness-aware Minimization for Improved Training of Neural Networks.
[2] Surrogate Gap Minimization Improves Sharpness-Aware Training.
问题
N/A
局限性
Yes.
Thank you for carefully reading our paper. We are delighted about your overall positive evaluation of our work. As other reviewers have pointed out, this paper is already quite dense and contains extensive experiments. Hence we would like to defer experiments on further SAM variants to future work.
We are thankful for all of the thoughtful comments and constructive feedback to improve the clarity of our paper’s presentation. We are delighted to have received overwhelmingly positive feedback about our content and results, and we will do our best to improve the accessibility of the paper — in our own interest. The main changes in this regard include:
Spectral perspective versus TP theory. While Tensor Program (TP) theory plays a crucial role in our proofs (see our response to Reviewer egBf for more details), we will further highlight the spectral perspective to improve clarity and be accessible to a larger audience. Specifically, we plan to make the following concrete changes:
- We will introduce the spectral condition already in the introduction to intuitively explain effective perturbations.
- We will append the sqrt(fan-out / fan-in) scaling condition for perturbations to the intuitive perturbation scaling condition after line 271.
- When discussing non-vanishing versus effective perturbations, we will add the spectral weight scaling condition for all l.
Open source code. Upon acceptance, we will release open source code to reproduce all of our experiments and to provide another resource for understanding and experimenting with our layerwise perturbation scaling.
Phase characterization figure. We will provide another figure to visualize the regimes of stability, effective SGD dynamics, non-vanishing perturbations, and effective perturbations (see the attached pdf).
Further changes to improve clarity. In several places, we will rewrite paragraphs to improve their clarity. As motivated by Reviewer dLWn, we will describe how to use in practice in the main paper; we will refer to the pseudo-code in Appendix E.8, and rewrite that appendix to more clearly state how to implement : First ensure that SAM’s denominator is scaled to be width-independent, then in the numerator the perturbations of each layer should be scaled like the layer’s updates. In the appendix, we will add a table summarizing all notation.
Thanks to the authors for the response. I want to globally highlight issues I still see:
-
the authors claim that tensor programs analysis is rigorous, and a spectral-norm based approach using upper bounds is non-rigorous and is only useful for intuition. I couldn't disagree more strongly with this statement. To make my point clear, tensor programs operates in an infinite width limit, relying in particular on assumptions that (1) batch size is constant and (2) width dominates batch size. Let's look at an actual training example that people care about
Consider Llama 13B (i.e. meta-llama/Llama-2-13b on HuggingFace). From the model card, this is "trained with a global batch-size of 4M tokens". Also, Linear layers in the MLP blocks have fan-in of 5120.In other words standard transformer training setups strongly violate the assumptions of muP, and have batch size >>>> width. To give another example, here is another classic paper. This paper does algorithm design using upper bounds or "majorizations", and doesn't prove matching lower bounds. I suppose you also believe that this work is non-rigorous? -
the authors do not seem to be aware they can solve their problem in an arguably much simpler way, by doing explicit normalization rather than abc parameterizations For instance, the authors point to Frobenius normalized SAM as being a major challenge for TP-theory. But then why not just switch the Frobenius normalization to spectral normalization? I believe that the normalization perspective would allow one to do scalable SAM with minimal code changes and a parsimonious theoretical explanation.
In conclusion, I want to clarify that I think the authors have done a good job. But I am challenging the authors to do two things:
- consider that Tensor Programs relies on the assumption of width >>>> batch_size, which is strongly violated in modern training setups
- put substantial effort into honing the exposition, which they have partially committed to, but I want to make sure they go all the way
Authors, depending on your engagement with these two critiques, I'm willing to either lower or increase my score
We thank the reviewer for the engaged discussion and appreciate their acknowledgement of our efforts to improve the accessibility of our paper. If it was not clear in our original response, we want to emphasize that we like the spectral perspective, which is why we will further highlight it in the revision. We believe that it has a potential for simplification and generalization.
Improving the exposition. In addition to the changes promised in our original global response, the following changes aim to further improve the clarity of presentation:
- In Section 4.2, we will clearly discuss the spectral perspective and how to achieve effective perturbations with spectral arguments in more detail.
- We will emphasize in the setup that TP theory relies on the assumption that width dominates training time and batch size.
- When discussing the simpler-to-analyse version of SAM without layer coupling (the gradient normalization applied layerwise instead of jointly), we will add ablations to study its -transfer and generalization properties for preliminary insights whether this version has practical potential.
Before we respond to the remaining comments, we aim to first clarify the primary objective of our paper.
Theoretical understanding vs algorithm design. Our main goal in this work is to analyze the infinite-width behavior of existing SAM update rules as they are implemented in practice. Specifically, we aimed to analytically derive infinite-width limits and identify desirable/undesirable behaviors of these rules. For example, we use TP theory to prove statements such as "muP with global perturbation scaling yields vanishing perturbations in all hidden layers in the infinite-width limit (Proposition 2)"
While the development of is an interesting and extremely valuable outcome of this theoretical investigation, it is not our sole or primary objective. From the reviewer's comments, we infer that they may have assumed that our primary aim was to derive or a scaling rule that allows width-transferable perturbation radius.
In line with our previous response, we still maintain that deriving /the correct width-dependent perturbation scaling is possible via spectral scaling conditions potentially supported by simpler analyses, that first scaling SAM's denominator to simplifies the analysis, and that in a version of SAM without layer coupling each layer can be individually scaled like updates (there would just be layerwise outside of the normalization and no need for additional global perturbation scaling ). We will make sure that we further clarify these points in the revision.
However, we respectfully emphasize that our focus is primarily on advancing theoretical understanding which we strongly believe to be inherently valuable. For example, our analysis can be used as a starting point to understand generalization properties of infinitely wide linear neural networks trained under SAM. Practical improvements like emerge as secondary, albeit welcome, consequences of our analysis.
We hope this clarification helps better contextualize our work and the remainder of our response.
Statement about rigor. We are confused by the reviewer's comments on our statement about rigor. First, we want to emphasize that we have never stated or implied anywhere in our response or in any other form that either spectral normalization or algorithm design based on majorization constitutes a non-rigorous method. Second, could the reviewer kindly clarify which analysis with upper bounds (without matching lower bounds) are being referred to in this context? The spectral scaling paper does not use any upper bounds without matching lower bounds. It first uses some elementary arguments to intuitively motivate the claim that spectral condition implies feature learning in the infinite-width limit and then uses TP theory to formally prove that muP is the unique abc-parameterization that satisfies the spectral condition, thereby formally connecting it to the infinite-width theory established in the TP framework. Therefore, we are very confused about which analysis the reviewer is referring to here.
To further clarify the statement in our original response, in the context of our paper, where the goal is to understand infinite-width behavior, we are interested in proving the following: “/spectral scaling implies effective perturbations in the infinite-width limit.” Our response aimed to state that the spectral condition paper [3, Appendix B] utilizes TP theory to make a similar statement about feature learning in the infinite-width limit and therefore is rigorously justified (in the aforementioned context). We want to emphasize that the essence of our response was to highlight the utility of TP theory. We did not claim that it is superior to some another analysis that motivates the spectral condition.
Batch size vs width. Our analysis which relies on TP theory indeed assumes that width dominates other quantities such as batch size and training time, that are typically large in practice. We will make sure that these assumptions are clearly highlighted in the main paper. We acknowledge that the batch size assumptions are not typically satisfied in LLM training. Nevertheless, these assumptions are typically met in vision tasks that SAM related work often focuses on. In our experiments, we use a batch size of 64 and widths ranging from 256 to 16384. This aligns with typical practices in SAM training (Foret et al., 2020; Müller et al., 2023), which has been shown to strongly benefit from smaller batch sizes (Foret et al., 2020; Andriushchenko and Flammarion, 2022).
Looking ahead, we see two important areas for future research in this direction: understanding the practical behavior of in large batch settings (e.g., via LLM training) and developing theory for jointly increasing batch size with width — a challenging but very interesting question for future work.
Finally, we note that the spectral condition paper [3, Assumptions 3 and 4] also requires this assumption to ensure the low rank nature of the updates. Therefore, we are not sure why the reviewer raises this point in the context of comparing the spectral perspective with TP theory. We want to emphasize again that we aim to include a multitude of perspectives in our paper, without placing one perspective over another.
Replacing joint Frobenius by layerwise spectral normalization for simpler analysis. We would like to clarify that the added complexity in analyzing the original SAM update rule (which is our focus) stems from the layer coupling introduced by SAM's gradient normalization term, rather than the choice of norm. As our analysis assumes that batches remain Θ(1) (a common practice for SAM), switching between Frobenius and spectral norm does not alter the width-dependent gradient scalings.
Furthermore, we do not believe that the layer coupling poses a major hurdle for a simpler analysis. We were merely pointing out the additional complexity it adds over SGD/Adam. As we stated earlier, first scaling SAM's denominator to simplifies the analysis, and in a version of SAM without layer coupling each layer can be individually scaled like updates without any additional complications. We will further clarify this in the revision.
We do agree with you that adapting the standard SAM update rule by replacing the Frobenius gradient normalization by spectral normalization is a natural and promising idea for automatic effective perturbation scaling. We would be interested in a more thorough exploration of this idea in the future.
References:
Andriushchenko, Maksym, and Nicolas Flammarion. "Towards understanding sharpness-aware minimization." ICML 2022.
Foret, Pierre, et al. "Sharpness-aware minimization for efficiently improving generalization." ICLR 2021.
Compagnoni, Enea Monzio, et al. "An SDE for modeling SAM: Theory and insights." ICML 2023.
Müller, Maximilian, et al. "Normalization layers are all that sharpness-aware minimization needs." NeurIPS 2023.
In our experiments, we use a batch size of 64 and widths ranging from 256 to 16384. This aligns with typical practices in SAM training (Foret et al., 2020; Müller et al., 2023), which has been shown to strongly benefit from smaller batch sizes (Foret et al., 2020; Andriushchenko and Flammarion, 2022).
This statement is wrong. In the original SAM paper, they use a batch size of 4096 for their ImageNet experiments. What SAM actually benefits from is m-sharpness (Section 4 of Foret et al) in which the large batch is split into multiple non-overlapping sub-batches, and the adversarial perturbations is calculated on each sub-batch individually.
"our focus is primarily on advancing theoretical understanding which we strongly believe to be inherently valuable" Great! I agree!
"The spectral scaling paper does not use any upper bounds without matching lower bounds" This is incorrect. First of all, equations 4 and 5 in the spectral paper are upper bounds without matching lower bounds, hence you can think of the spectral condition as following from a majorization-style analysis. And in that light your comment that "a rigorous justification of the spectral condition [3] crucially relies on NE⊗OR⊤ program (TP) theory" can be read as stating that algorithm design based on majorization is non-rigorous.
"Looking ahead, we see two important areas for future research in this direction: understanding the practical behavior of in large batch settings (e.g., via LLM training) and developing theory for jointly increasing batch size with width — a challenging but very interesting question for future work." Glad that you're engaging with my review and that it's suggesting directions for followup work.
"Replacing joint Frobenius by layerwise spectral normalization for simpler analysis." Euclidean normalization of the flattened weight vector seems to me like an incredibly unnatural thing to do to a neural net weight vector, and I hope the community can recognize this and move past it.
Final suggestions
I've run out of time to engage on this paper further. I've decided to keep my score of 6. I hope you see that I'm trying to review in good faith, and this is a good score based on the substantial amount of work I feel the paper needs to meet the interests of the community writ large. Please overhaul the paper to:
- simplify the exposition, following my and Reviewer dLWn's suggestions
- avoid grandiose language that hides limitations and alienates readership. For example any uniqueness claims or claims about rigor should be thoroughly justified. Maybe you can tell, but I personally found your claims about rigor to be short-sighted. I would choose to read a paper that makes simple arguments and makes its limitations clear every single time. At the end of the day, we're all trying to work out how to think about these topics rigorously, and I hope we get there together
Thank you for pointing this detail out. We follow the formulation by Andriushchenko and Flammarion (2022) who analyse the effect of . Equation (4) of that paper states that the -SAM formulation leads to the following update rule on each iteration of training, denoting the batch at time t by , where ,
where importantly the same batch is used in the inner and outer gradient steps, which corresponds exactly to using batches of size in the ascent and the descent steps. They showed under this formulation that using small batches is beneficial for generalization in SAM.
However, our perturbation scaling analysis also applies to the -SAM formulation where different batch sizes are used for the ascent/perturbation and the descent/update step, because our analysis shows that the ascent and descent step can be decoupled. Since we are interested in perturbation scaling, the relevant batch size is the one used for computing the perturbation, and using small has also been shown to be useful in this formulation (e.g. Figure 3 in Foret et al. (2021)). In this formulation, we should indeed correct the above statement to ‘SAM benefits from small perturbation batch size ’.
In all of the SAM papers cited in our above response, perturbation and update batch sizes between 1 and 256 are used for CIFAR-10 and CIFAR-100. Larger update batch sizes are indeed sometimes used for ImageNet experiments but most of them still use a perturbation batch size of 128.
We apologize for the oversimplifying statement.
This reviewer accepts your apology. But your analysis still relies on being able to deal properly with BOTH the descent and the ascent step. It's this sort of thing that got me frustrated that you're arguing about rigor, while not being upfront about these sorts of limitations.
Thank you for appreciating our work. We will do our best to improve the paper’s accessibility and to clearly state the underlying assumptions as well as limitations.
-
The spectral paper provides matching lower bounds. We are confused: the reviewer's statement that equations (4) and (5) from the spectral paper have no matching lower bounds appears incorrect. The paper makes simplifying assumptions that the terms in the expansion do not cancel each other and under this assumption provide matching lower bounds in eq. (8) for eq. (4), and, citing from the same paper: ‘Combining Equation (10) with our matching upper bound from Equation (5), we conclude that as desired.’ Also in the appendix, via TP theory, these lower-bounds are further supported.
-
Our assumptions are aligned with standard practice in SAM. We are still baffled by the response from the reviewer about our statement on rigor. First, as we clarified in our response, our statement has been misinterpreted (we have not called the spectral approach non-rigorous neither in our response, nor in any other forum) so we would really appreciate it if the reviewer takes our clarifications into consideration.
Second, we were only trying to answer Reviewer dLWn’s comment about our “small batch size helps SAM” statement and not about our analysis. We already stated in our response that we use the formulation from Andriushchenko and Flammarion (2022) which uses the same batch size for ascent and descent and show that small batch size helps the generalization of SAM under this formulation. This training setting clearly aligns with the assumptions of our analysis. Furthermore, in papers that use the m-SAM formulation from Foret et al. (2021), slightly larger update batch sizes are indeed used for ImageNet experiments (not CIFAR-10 or CIFAR-100) but many of them still fall under the width > batch size setting. For example, Andriushchenko et al. (2023) use an update batch size 256 on TinyImageNet with ViT-B and ViT-L of width 768 and 1024, respectively. Müller et al. (2023) use update batch size 128 on ImageNet-1K for ViT-S. Accordingly, this setting is indeed aligned with the assumptions of our analysis. As mentioned in our previous response, we plan to clearly highlight the fixed batch size and training time assumptions in the revision.
References:
Andriushchenko, Maksym, et al. "Sharpness-aware minimization leads to low-rank features." NeurIPS 2023.
Andriushchenko, Maksym, and Nicolas Flammarion. "Towards understanding sharpness-aware minimization." ICML 2022.
Müller, Maximilian, et al. "Normalization layers are all that sharpness-aware minimization needs." NeurIPS 2023.
The authors write "we have not called the spectral approach non-rigorous", however they also state "a rigorous justification of the spectral condition [3] crucially relies on NE⊗OR⊤ program (TP) theory". This latter statement implies that the authors believe that, in the absence of Tensor Programs theory, the spectral approach is non-rigorous. Based on this statement, I inferred that you find the lower bounding arguments in the spectral paper non-rigorous. I'm sorry if I was mistaken on this. Could you please point out which step in the spectral paper is non-rigorous in the absence of Tensor Programs theory?
Given that I agree with reviewer dLWn that the presentation of this paper does not serve the interests of the wider community, I believe that the same practical performance can be achieved via much simpler techniques, and that substantial confusion still remains between the authors and the reviewers despite much back and forth, I think this paper would strongly benefit from a further round of review. I have changed my mind and I have decided to lower my score. Good luck to the authors, and I'm sorry if this isn't the outcome you wanted.
We regret to see that the reviewer seems to let personal feelings significantly affect their judgement. The initial review states ‘I like the paper and think that it should probably be accepted.’, ‘honest and thorough effort to port tensor programs theory to the case of SAM’ and ‘I like the narrative’. In a later global comment you say ‘I think the authors have done a good job’. We have promised to address the few addressable concerns, in particular further highlighting the spectral perspective and the discussed assumptions, which have initially satisfied the reviewer to keep their initial score of 6.
Only after our last comment, in which we correct the reviewer’s factually wrong statements about Yang et al. (2023) and defend that our analysis covers both the descent and ascent step under realistic assumptions while repeatedly promising to highlight the limited batch size assumption, do they reduce the score.
The whole discussion about whether the spectral perspective is rigorous or not is barely related to the paper under review, but seems to be a major personal concern for judging the value of our work based on the use of the word ‘rigorous’. As we said from the beginning, we agree that the spectral perspective is both useful and rigorously justified in the context of infinite-width theory. The only point of contention seems to be whether this has been proven with or without TP theory.
Our promised changes have shown significant will in resolving their concrete and addressable concerns. Although the correctness, content, and results have not been majorly criticized by any reviewer, many of Reviewer egBf’s concerns with or without relation to the paper under review do not seem to be addressable — only after our last comment.
In conclusion, we believe that the review violates several best practices outlined in the Reviewer Guidelines, and will be flagged to the AC.
- Be fair. Do not let personal feelings affect your review.
- Be useful. Try to keep your feedback constructive when possible.
- Be flexible. The authors may address some points you raised in your review during the discussion period.
dear egBf,
The dispute here is definitely a scientific one, no worries.
1/ Regarding the weakness you are heavily relying on now, would it not possible that a paper written on optimization for ML using variational inequalities may assume optimization knowledge on variational inequalities?
On this vein, we already provided concrete ways of improving accessibility along spectral perspective, even promising to define bigO and Omega notations. Is this not sufficient?
2/ Can you please respond to our comment if you are still not sure on what concrete basis we believe the spectral perspective is non-rigorous in absence of TP or infinite-width theory?
"the reviewer's statement that equations (4) and (5) from the spectral paper have no matching lower bounds appears incorrect. The paper makes simplifying assumptions that the terms in the expansion do not cancel each other and under this assumption provide matching lower bounds in eq. (8) for eq. (4), and, citing from the same paper: ‘Combining Equation (10) with our matching upper bound from Equation (5), we conclude that as desired.’ Also in the appendix, via TP theory, these lower-bounds are further supported."
In particular, please explain how we can support the assumption "the terms in the expansion do not cancel each other" of the spectral perspective in our minimax/SAM context without the TP theory support?
3/ Regarding our uniqueness claim, we can qualify it under restricted stability constraints, which address your concern.
Dear authors,
Thank you for responding in a conciliatory tone---I appreciate it. Given your commitment to qualify the uniqueness statement, I will increase my score by one point. Now let me respond to your further comments.
"Regarding the weakness you are heavily relying on now"---I want to point out that this is one of the main weaknesses pointed out in my original review. Furthermore, variational inequalities are an established piece of math, going back decades. I think it's fair to say that the theory in the Tensor Programs paper is significantly less "battle-tested", and in my opinion you should not expect your readers to know it. Especially when your paper is "applications focused"---studying SAM.
"the reviewer's statement that equations (4) and (5) from the spectral paper have no matching lower bounds appears incorrect" What I'm trying to say here is that the spectral paper provides matching lower bounds based on assumptions. But the TP theory which supposedly rigorously justifies these assumptions itself makes other assumptions, just not phrased as such. So if the act of making assumptions is non-rigorous, then neither approach is rigorous! As it happens, I would personally argue that the approach that makes its assumptions crystal clear is more rigorous, not less.
If you can engage with what I'm saying here, and demonstrate that you understand what I mean, I'd bump my score back up to 6.
Finally, I'd just ask the authors to remember that the reviewers are volunteers and also human beings. I'm currently at an uncertain stage of my career. I don't see what relevance mentioning your seniority has to getting a fair scientific review. Also I found the attempt to de-anonymise me honestly quite scary. I hope that you'll consider your conduct here as I have considered mine.
Dear authors,
First of all, thank you for the feedback. I take your concerns seriously and will separately contact the AC for feedback on my conduct. I think that since the reviews are all recorded, I would expect that if there has been a conduct violation it will be caught by the AC and dealt with.
I think that what might help is for me to explain why I believe the question of the perceived rigor of TP arguments versus the spectral perspective is central to my evaluation of your paper. First of all, both these topics are discussed in your paper. But second, both myself and reviewer dLWn found the reliance of your paper on Tensor Programs arguments to be a serious weakness. Reviewer dLWn wrote as a weakness that "The paper is written in a way which assumes that the readers are very familiar with Tensor program" and cited presentation issues as their main reason for a negative review. Therefore, if there is a simpler way to derive your results, it is of critical importance whether or not that approach is rigorous in the absence of Tensor Programs. After all this discussion, I am still not sure on what concrete basis you believe the spectral perspective is non-rigorous in absence of TP or infinite-width theory.
I also want to note that I raised other issues with your technical results, for instance the uniqueness claim. When pushed by the authors, I provided a counterexample to the claim. Upon providing the counterexample, I then felt that the authors dismissed my concern rather than taking it seriously.
Although I praised the authors' work for some positive aspects, I made it clear in my review and repeated engagement with the authors that I would be prepared to either raise or lower my score depending on whether the authors could address my concerns. Due to the message that the authors were still "confused" and "baffled" (after I thought I had made my final assessment) I realized that the discussion period was not a success in terms of bringing consensus between this reviewer and the authors, I lost trust that the authors could make the necessary changes, and therefore I felt it right to lower my score.
I do feel that the dispute here is a scientific one and not a personal one. I sincerely wish the authors all the best, and good luck
Dear Reviewer egBf,
I would like to flag what the authors are perceiving from these discussions:
-
Power play (you agree with me or else) on a related paper, which we must evaluate in its spirit and not what it actually contains. This kind of inflexible behavior is terrifying to new authors.
-
Clear personal conflict conflated into our work/your review.
Dear Reviewer egBf,
we sincerely regret this discussion to have become so heated as both sides seem to be very passionate about finding and understanding the correct way to scale neural network optimization. Misunderstandings seem to be mostly caused by differing attitudes to achieve this goal. Our goal was to analyze past and current SAM practices, while the reviewer has more visionary perspectives of designing the optimization and scaling procedures of the future.
For our analysis, we have decided to use TP theory, which comes with its limitations which we will point out very clearly, but still seems to us like a legitimate approach to take, as our results demonstrate. We certainly agree that there is a lot of follow up work to be done; there is a lot to do in terms of understanding long training dynamics.
In our own interest, we will try to clarify the used TP framework as much as possible, as well as highlighting the practical and more intuitive ways to arrive at the corrected scaling, but we tried to argue that some of the difficulties in our analysis stem from the layer coupling in the original SAM algorithm, which can only be resolved by changing the optimization algorithm, which again points to our differences in algorithm design versus analysis.
We may also have differing views along the theory versus practice dimension. Our statement on rigor was only related to the infinite-width theory, we are doing in this paper. We embrace a multitude of perspectives and agree that the approach of making assumptions that are empirically verifiable followed by logical arguments may yield further insights into long training dynamics than TP theory can, as previously stated with a potential for simplification and generalization. We just disagree that TP theory is not a valid approach for our analysis.
To engage with your comment about the spectral paper, we agree that all assumptions should be made crystal clear. The spectral paper does this and the original TP papers also state the underlying assumptions but are perhaps hard to parse. To us, it appears that the spectral paper makes strictly stronger assumptions than the TP papers, including width dominates training time, batch size, and depth. The spectral paper does not assume that the activation function is smooth (required by TP) but instead assumes that co-ordinate wise non-linearities do not affect the scaling of vectors (Assumption 2).
Regarding whether the act of making assumptions itself makes things non-rigorous: we believe the answer to this question to be quite nuanced. In the extreme case, one can simply assume the result and then prove that the result holds. To our understanding of the word, this does not make it rigorous. Our statement “a rigorous justification of the spectral condition [3] crucially relies on NE⊗OR⊤ program (TP) theory” meant that the simplifying assumptions like Assumption 1 and line above eq. (10) of the spectral paper (used for providing the lower bounding arguments) can be proven using TP theory. This does not mean that these assumptions are not useful to understand what is essentially happening. On the contrary, we strongly believe that the argumentation in the spectral paper to arrive at the correct scalings from the assumptions made is indeed logical and useful.
Overall, we hope that in a year both parties will be able to look back at this discussion with empathy for each other, as we follow the same goal and seem to appreciate each other’s efforts. Please feel free to score the paper as you genuinely feel the paper and promised changes deserve.
Thanks, authors, for writing this. Given the level of thought you have now clearly put into these issues, I'm comfortable increasing my score. Let me just make a few replies for completeness:
-
"some of the difficulties in our analysis stem from the layer coupling in the original SAM algorithm, which can only be resolved by changing the optimization algorithm, which again points to our differences in algorithm design versus analysis." I agree we seem to have a different taste when it comes to modifying the algorithm versus analyzing what is already there. My thought on this is that since our understanding of the subject matter is continually improving, it is actually unlikely that those "older" algorithms are doing things in the best way. As a theorist, I see an opportunity to both improve the core algorithms as well as their theoretical understanding. And actually doing both simultaneously might make progress easier.
-
"To us, it appears that the spectral paper makes strictly stronger assumptions than the TP papers, including width dominates training time, batch size, and depth." The way I see this is that if one first starts with the upper bound arguments in the spectral paper, then one can take the lower bound arguments as "optional extras" that one can accept or reject depending on their taste for the assumptions. Given the significant complexities that occur in actual neural net training at long time and large batch size, I do wonder if anything more than the upper bound style arguments is possible in general. I also want to emphasize that there is a close relation between these upper bound style arguments and "majorizations" which are common place in older ML theory.
-
"Regarding whether the act of making assumptions itself makes things non-rigorous: we believe the answer to this question to be quite nuanced" I completely agree. But as a result, I would be exceptionally hesitant to use rigor as an argument without thoroughly caveating precisely what is meant by the term and how it applies in the given situation.
I'm also sorry that the discussion got heated. But thanks for engaging thoroughly. And yes I look forward to following more of your work on this topic :)
This paper theoretically demonstrates that weight perturbations induced by the SAM optimizer vanish in the infinite-width limit in every layer except the output layer. This may limit the regularization effects one gets from SAM in large-width settings. To address this, the theory further shows that perturbation scaling must be width-dependent. Based on this, the paper proposes a fix, called "Maximal Update and Perturbation Parameterization" (µP^2), and claims that it can achieve better hyperparameter transfer and improved generalization performance. The latter claims are empirically demonstrated in MLPs and ResNets on CIFAR10 and ViTs on Imagenet1K.
The paper received 4 reviews. Three of these reviews recommend acceptance (with different degrees), while one reviewer (dLWn) rates it as borderline reject. Reviewer dLWn's main concerns were around clarity of presentation, specifically using more figures to provide intuition about the theoretical results, as well as providing more background on the Tensor program, and finally a clearer presentation about how practitioners can use the results of this paper to set the perturbation radius for each layer. I find these to be excellent suggestions, and I strongly encourage the authors to consider them for their final version of the paper to improve its readability.
Reviewer egBf and the authors had a lengthy engagement during the rebuttal phase. The reviewer had a very positive opinion about the submission in the earlier stages of the discussion period. However, there was considerable further discussion on the definition of rigor used in the paper and how the authors and the reviewer interpret it, where there was some disagreement. Specifically, reviewer egBf suggested the possibility of a simpler proof, while the authors believed that a rigorous proof should be based on the tensor program (something that egBf disagreed with). While an alternate proof approach, especially one with the potential of being simpler, is of interest, it can be a matter of future work, specially because the current proof approach in not incorrect. Ultimately, reviewer egBf rates the paper as a weak accept.
The other two reviewers had fewer concerns about the paper and rated it as weak accept and accept. Overall, the consensus leans towards acceptance, and I also recommend accepting it because I believe both the problem and the proposed approach (Tensor program) are interesting. However, The paper can greatly benefit from a revision for its final version, with the goal of clearer presentation, based on the feedback provided during this review phase.