PaperHub
5.8
/10
Poster4 位审稿人
最低3最高8标准差1.9
5
8
7
3
4.0
置信度
正确性3.0
贡献度3.3
表达2.5
NeurIPS 2024

Scalable Optimization in the Modular Norm

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

Normalizing weight updates in a special norm leads to hyperparameter transfer; this norm can be inferred automatically from the computation graph.

摘要

关键词
scalableoptimizationmodularnormnormalizationmodulahyperparametertransferarchitectureawareoperatormoduletree

评审与讨论

审稿意见
5

The authors are tackling the difficult problem of try to scale the parameters of a network so that different sizes of network have similar optimisation properties. This would greatly aid in hyper-parameter tuning. They do this by introducing a new norm which is defined for the whole network. They show experimental results providing evidence that their approach is successful.

优点

This is a hard problem that is worthy of study. The approach is mathematically rigorous in so far as it introduces a new norm. The norm is designed to capture features of the network that are important in scaling and this is backed up by experimental results.

缺点

The approach feels a bit ad hoc, particularly the introduction of "masses". It feels a little bit like the problem of adjusting hyper-parameters has been pushed onto determining masses for modules.

问题

1, Is there a principled way to determine the masses. 2. Are the masses for a convolutional layer in VGG the same as that for ResNet or AlexNet 3. Do the masses depend on the problem 4. How does the scalability compare to the commonly used initialisation schemes for weights?

局限性

I see no issues here.

作者回复

Dear Reviewer U3gK,

Thank you for the time you spent reading and reviewing our paper. Many of the questions you had related to the masses in particular, so let us briefly discuss how we think about them.

You are correct in that the masses absorb at least part of the question of determining optimal hyperparameters: it has been recognized by many practitioners that the learning rates for particular layers in a network (especially the embedding and the final layers) should be individually tuned, and tuning the masses plays this role in our framework. One should expect the optimal masses to differ for individual problems and architectures.

However, we still think the separation of “tune the masses” (aka relative learning rates of the layers) and “tune the global learning rate” is a useful way to subdivide the tuning problem. Many architectures take the form of an initial/embedding layer, then a number of residual layers, then a final layer. We found a useful scheme for allocating mass is to use the ratio 1 : M : 1 between the initial_layer : residual_body : final_layer. This means in particular if there are L residual layers, each one is given mass M/L. For this scheme, we reported the following experimental results in our paper (see Appendix D.6):

  • For any given fixed M, the global learning rate exhibits hyperparameter transfer from a small to large network;
  • The optimal M (in terms of lowest test/training loss) itself exhibits hyperparameter transfer.

Note that this scheme involves tuning a single mass M, rather than one mass for every layer.

Moreover, in contrast to earlier approaches, we concretely link the tuning of the masses to estimating the change in network output due to weight updates in each layer (Proposition 3), which gives a conceptual way to a priori reason about tuning the learning rates between separate parts of a network.

Together, hopefully this answers your first three questions. For the fourth question on initialization schemes, we want to clarify that there are two important issues when choosing an initializer:

  1. what random matrix distribution do you use: orthogonal, Gaussian, uniform, etc.
  2. given a choice of distribution, how do you scale it and how large are the resulting singular values?

Our belief is that the first question is not important so long as you carefully deal with the second question. In this work we use orthogonal initialization because we believe it is the conceptually simplest “base initializer” to subsequently scale according to question two (this is because the singular values of a random orthogonal matrix are all one). However, we want to emphasize that we believe that rigorously addressing this question is outside the scope of this work. See arXiv:2310.17813 for more discussion on orthogonal versus Gaussian init and arXiv:2011.14522 for more discussion on SP versus muP.

Thank you again for the time and effort you put into reviewing our paper!

评论

Thank you for addressing my question. A semi-automatic means of assigning masses does make the contribution of the paper slightly more convincing. The task of improving the scalability of networks is clearly important and defining new norms seems a reasonable way to proceed. I'm still weighing up whether your approach has nailed the problem. I will consider my scores.

评论

Thank you! Please let us know if any more questions come up!

评论

Just wanted to send a gentle nudge in case you got a chance to think more or reconsider your score! We'd be happy to engage further.

Authors

审稿意见
8

The authors propose the “modular norm”, a norm for deep learning that can be simply recursively composed, allowing for easily recursively computing (and hence controlling) the Lipschitz constant of the network and loss gradient. They propose how to scale the gradient updates by the modular norm, and empirically demonstrate the effectiveness of such ‘normed optimization’ at achieving invariance of the optimal hyperparameters to scale.

优点

The paper proposes a very interesting idea: an architecture-agnostic norm for neural networks that gracefully scales in width and depth. This seems like a very interesting take on what is currently being achieved through convoluted scaling of initialisations, learning rates, and residual branches following rules derived from asymptotic properties of these rules. This paper has challenged how I think about sharpness and curvature of the loss landscape in a deep learning context, and the asymptotic properties thereof. I'll admit, I'm still digesting the take-aways, but I feel fairly confident many in NeurIPS community would benefit from reading this paper.

Lastly, the experimental results look on learning rate transfer are fairly robust, and look very promising.

缺点

  1. The use of the term ‘feature learning’ seems somewhat distinct in this paper from the way it's been used in the Tensor Programs (TP) literature. To the best of my understanding, in TP4, it refers to the lack of or presence of change in how each layer processes its inputs in the infinite width limit. In Proposition 3, the amount of “feature learning” seems to refer to a bound on how much the way a layer processes its inputs can change. I think it would have been great if the authors defined or front-loaded what they mean by ‘feature learning’ and ‘proportion of feature learning’ before Proposition 3. E.g., lines 128-131 were difficult to parse on a first read-through.
    • Similarly, the controlling the ‘proportion of feature learning’ claims are teased again without clarifying what is meant by feature learning in lines 171-173. I think these lines could be cut to be honest.
  2. I think the presentation could use a little bit of work. A lot of it is very good already. In particular, the notation, definitions and propositions read well, and the discussion of limitations and future work in section 5 was very useful and interesting. That being, I think section 2, which was setting up the motivation for what follows, left me a bit confused on my first read-through. I would have hoped to give concrete suggestions on how I'd like to see it changed, but that is a non-trivial task, so I'll point to what I think is confusing at the moment:

The authors mention two motivations for the modular norm in the abstract: a) graceful scaling in width and depth, and b) ease of constructing a Lipschitz-continuous network (in the modular norm). The elaboration on and setup of these two motivations than gets slightly jumbled, in my opinion. The introduction primarily focuses on a) – the graceful scaling. Then, section 2, to the best understanding, gives an argument for why achieving Lipschitz continuity with a tight constant that's invariant to the scaling dimension might result in the step-size being invariant across scale. As I understand it, this is a fairly loose argument, that's not being corroborated with further formalism or arguments later on.

I think, when first reading section 2.1, I would have appreciated:

  1. Having a better sense of direction:
    1. Making it clear that the link between graceful scaling of step-size and Lipschitz continuity is a loose motivating argument, that's absolutely useful, but will not be formalised later on.
    2. Making it clearer that the goal of the following sections will be to present a modular norm in which it will be easy to specify networks with a Lipschitz-continuity that's invariant to a scaling dimension.

Then, I think the relationship before what section 2.3 lays out — achieving norm η\eta updates in modular norm — and section 2.1 could also have been made much clearer.

  1. Some smaller points:
  • On lines 81-83, it seems like bulletpoints (ii) and (iii) could be merged. They refer to the same desirable property, and one is just setting up the other.
  • The sharpness bound in equation 2.1 could use a citation (or link to appendix derivation) for pedagogical purposes.
  • It would have been great to define (or cite a reference for) what a matrix norm induced by input and output space norms is. I wasn't familiar with this term.
  • The authors claim that equation (2.2) holds quite tightly. Do the authors have a reference?

问题

  1. The authors say “[scaling the updates in the right norm] in some instances, [...] enables training with a simpler optimizer—for example, training GPT with SGD rather than Adam—thus incurring a smaller memory footprint.” Do the authors have experiments or a reference to corroborate this? If stable training of large transformers was demonstrated in the modular norm, this would be an impressive feat.
  2. The work aspirationally mentions graceful scaling in e.g. width/number of modules. If my understanding is correct, the many guarantees on how the properties of modules combine in the modular norm means that it would be easy to devise rules for setting e.g. the mass parameters so that the sharpness of the network is nicely bounded. However, the work doesn't actually show any results on the tightness of these bounds in any asymptotic setting? Without the bounds being tight (or not getting worse in appropriate scaling limits), why would we expect hyperparameters like the learning rate to transfer?
  3. Why is the fact that the norm of the weight updates in modular norm only depends on the learning rate in normed optimization (section 2.3) a sensible thing to do?
  4. Is there any reason the authors didn't compare to hyperparameter transfer in muP and its depth counterpart?
  5. Empirically, are the Lipschitz bounds actually close-to-tight? Does the tightness persist throughout training?

局限性

  1. If I'm understanding things correctly, Proposition 3 does not guarantee the presencence of feature learning in the modular norm, the same muP guarantees a change in the way a layer processes its inputs by at least Ω(1)\Omega(1) as the width goes to infinity?
  2. Only transfer of the step-size and mass allocation across scale is considered, and not of other training hyper-parameters (e.g. learning rate schedule).
作者回复

Dear Reviewer FbvX,

We are grateful for your extremely thorough and helpful review! We’re very happy you found our paper interesting.

Before we go in-depth into your comments: many of them are related to the tightness of the bounds we prove on first/second derivatives, so let us first explain our picture of the situation. Almost every inequality in our paper is an instance of either:

  1. The defining inequality WxWx||W x|| \leq ||W|| \cdot ||x|| for the spectral norm W||W|| of a matrix WW (often WW is a gradient matrix, and xx is an activation vector);
  2. The triangle inequality, where every term in the sum represents the change in network output due to a change in the weights of a particular module.

We view the question of whether either type of inequality is tight as being an "alignment" question: how aligned are the activation vectors to the singular value vectors of the gradient matrix for a single linear module, and how correlated are the various contributions to the change in total network output from all the different modules? We think these are very significant empirical questions about neural networks; of which there has been some investigation (e.g. arXiv:2310.17813) but further work is very much warranted. Based on your feedback and that of the other reviewers, we propose clearly highlighting this as an important avenue for future work in the “Limitations and Future Work” section.

At this point we also would like to contrast the approach to muP, which obtains \Omega(1) lower bounds in the infinite width (but crucially, constant batch size) limit. In the extreme case that the batch size is one, the gradient matrices are necessarily rank one, and inequalities of both type (1) and type (2) as above are automatically tight, and from this one can then deduce \Omega(1) estimates in the width >>> batch size regime (where one can make a low rank assumption on the gradient matrices). However, we question the relevance of such theoretical lower bounds obtained from this limiting case to real life neural network training. We believe an empirical study using realistic widths, depths and batch sizes would shed much more light on this issue.

Now, to answer your comments on the weaknesses:

  1. A precise statement of what we termed "feature learning" would be "linearized change in the module output as a result of a weight update". We think this is a very useful concept that deserves an evocative name, and substantially informs how we think about the mass parameters. You’re right that this is different from other uses of the term e.g. in Tensor programs; we will heed your advice and better clarify the language in this section.
  2. Thank you for some great suggestions on how to improve Section 2 in particular:
  • We thought it was important to have less formal "motivation" section which outlined at a high level the relationships between the mathematical concepts in the paper, and certainly stressing the looseness of this is a good suggestion (so that the reader does not expect us to prove, e.g., that Lipschitz constants independent of network dimensions necessarily guarantee hyperparameter transfer).
  • As for section 2.3, we wanted in this section to explicitly spell out how the modular norm could be actually used in real life optimization, so it’s a little bit different in its goals to section 2.1. Clarifying this would be a good suggestion.
  • The tightness of equation 2.2 is partially tested in arXiv:2310.17813, but as above we believe more work should be done on this question.

To answer your questions:

  1. In the experiments documented in the paper, we demonstrated that “normed SGD” was competitive with Adam for nanoGPT-like transformers. See Figures 1 in the main paper and Figures 9 and 10 in the appendix. Testing whether this remains true at a slightly larger scale is something we are actively working on currently (on a 160M parameter transformer model).
  2. See the discussion above about tight bounds.
  3. “Why is the fact that the norm of the weight updates in modular norm only depends on the learning rate in normed optimization (section 2.3) a sensible thing to do?” This is sensible because of Equation 3.1 in Definition 2: it implies that the learning rate will directly control the amount of “feature learning” measured in the module’s output norm.
  4. We did not include a direct comparison to muP due to time and space constraints; while our work has a similar goal to muP, we do think the approach (going via non-asymptotic elementary inequalities rather than trying to identify infinite width/depth limits) is mostly orthogonal. We are not claiming our approach has better or worse real life performance than muP.
  5. As you highlighted, we do believe the key question is "over the course of training". Based on your feedback, we will edit the "Limitations and Future Work" to highlight this (we teased some of this in the "Loss of well-normed-ness" section, but in hindsight this could be better foregrounded).

For your questions on limitations:

  • See the above for a discussion about lower bound guarantees, including a discussion of the approach taken in muP.
  • Correct, we only consider learning-rate transfer in this paper.

Thank you again for the time and effort you put into the thorough review of our paper!

评论

Thank you for a very thorough response! I also very much appreciate all the changes the authors said they would make, and think they are a good idea.

I wanted to discuss a couple of points in the authors' response:

In the extreme case that the batch size is one, the gradient matrices are necessarily rank one, and inequalities of both type (1) and type (2) as above are automatically tight, and from this one can then deduce \Omega(1) estimates in the width >>> batch size regime (where one can make a low rank assumption on the gradient matrices).

Maybe I'm missing something, but I don't think this is true when looking at different inputs. Yes, this will be true when comparing the tightness of the bound on the same datapoint xx on which a gradient update was made, but the bounds will not necessarily be tight for a different datapoint xx^\prime – the activations for that datapoint might not be aligned to any extent with those of xx. That's what makes the results about muP interesting and non-trivial in my opinion.

I would also push back on infinite model-size, fixed batch-width, not being a realistic limit. It seems to capture how people scale training in practice pretty well. If anything, I would take more of an issue with not considering the infinite training time limit, but I think being able to obtain asymptotic tightness results even in the finite training time limit is quite interesting.

“Why is the fact that the norm of the weight updates in modular norm only depends on the learning rate in normed optimization (section 2.3) a sensible thing to do?” This is sensible because of Equation 3.1 in Definition 2: it implies that the learning rate will directly control the amount of “feature learning” measured in the module’s output norm.

That makes sense. I think I would recommend putting this as a motivation (not formal, just colloquially explained in words) for normalising weight updates at the beginning of Section 2.3. I think at the moment Section 2.3 just comes a bit out of nowhere, and that bit of explanation makes it clear why normed optimisation is a desirable thing to do.

评论

Thanks for the suggestion about extra motivation, which we will implement. Also thanks for the additional questions which have helped us sharpen our own thinking---we are keen to continue the discussion for as long as you are! Regarding your questions:

"I would also push back on infinite model-size, fixed batch-width, not being a realistic limit". Thanks for pushing back. Let's crunch numbers on a concrete example. Consider Llama 13B (i.e. meta-llama/Llama-2-13b on HuggingFace). From the model card, this is "trained with a global batch-size of 4M tokens". Also, Linear layers in the MLP blocks have fan-in of 5120. So in an actual practical training situation, batch size dominates width, rather than vice versa. So there is no a priori reason to believe gradients should be low rank, and this is far outside the realm of applicability of muP.

So why does muP work then? Based on this discussion, we are wondering if it is a coincidence. Look in the spectral-muP paper (arXiv:2310.17813) at the bottom of page 7:

Empirical observation: low-rank structure remains at large batch size. Surprisingly, we observe numerically that MLP updates remain low (effective) rank and aligned with incoming vectors even at large batch size B. This is demonstrated in Figure 1.

In other words, gradients can have low stable rank even when batch size dominates width. This is a surprising empirical finding that to the best of our knowledge still needs explaining. It may be a property specific to the data distribution we train these models on, for instance.

Generally, we feel that due to the presentation style of Tensor Programs papers, it can be hard to catch these issues. In contrast, it is our intention to be extremely straightforward about the limitations of our approach.

"the bounds will not necessarily be tight for a different datapoint". We weren't aware muP could handle different datapoints in this way. Could you possibly point us to the muP statement that you're referring to and we'll take a closer look. (We'll look for it too---but we're just asking in the interest of accelerating the discussion).

In conclusion. We'd love to keep the discussion going. Also, if you'd be in anyway open to increasing your score it could really help us.

Authors

评论

Just wanted to send a gentle nudge in case you got a chance to think more or reconsider your score! We'd be happy to engage further.

评论

Thank you again to the authors for their response, clarification, and the promise to improve the paper in the areas mentioned above.

I still maintain that this is a paper with some very interesting ideas, and it does a good job at presenting and evaluating them. Hence, I still believe that this paper is a strong accept (8), although the discussions with the authors have increased my confidence in this assessment.

评论

Thank you sincerely for your time and effort spent reviewing!

审稿意见
7

The authors propose a new normalization strategy for deep models rooted in the introduction of a new framework and on feature learning considerations. The authors provide a few experimental examples as motivation, and then start introducing their framework. They formally define what a module is and its norm and then present a few results on module composition and relation to gradient/hessian-dependent quantities. The authors showcase that, on some toy experiments, their "modula" package yields improvements over vanilla SGD, closing the gap to Adam.

优点

The paper is well-written, presentation is formal, notation is pleasant. Illustrations are very well done and in general the whole work is very thought-through. I also like the idea: not modifying the optimizer or the architecture, but normalizing stuff before it is fed into the optimizer. This is very flexible and I hope the authors can scale this up on bigger models.

缺点

There are 2 weaknesses I think are hurting the paper, but both are solvable.

  1. While the paper reads well, I found a lot of distance between the discussion of SGD in formula 2.1 and the results of Proposition 5. In between, there is little discussion of SGD. When reading, I felt a bit lost into the definitions and formalism - seemed the connection was just used as motivation but one has to to through 4 pages to get back at it. I think the authors should put proposition 5 sooner as a main result, and actually as a motivation for introducing the modular norm.

  2. This is solvable but hurts the contribution a lot: experiments. Resnet experiments are ok - but where we really need the thing to shine is transformers since there SGD has a huge gap. I think the current status is lacking a bigger transformer architecture: the context length used is 128 for GPT - very small. I know that if you have resource limitations, this can be demanding, but I am afraid your idea would be lost in the literature if you don't provide further evidence. If I have to make a suggestion: pythia 160M (https://github.com/EleutherAI/pythia/blob/main/models/160M/pythia-160m.yml) is a good model: probably takes < 0.5 day on a single GPU (train e.g. on slimpajama). If you feel very confident, try 24 layers. I think you only need to run this with SGD lr = [1e-3, 1e-2, 1e-1, 1], use the standard Adam parameters reported in the link above, and then run module on SGD. I would keep the context to 1k or 2k. I think if this works many people will pay attention.

I am happy to raise to accept if you can provide some experimental evidence in the rebuttal.

问题

  1. Can you show experimentally that equation 2.2 holds tight? Many results in your paper are based on inequalities. And yes, I am aware of G. Yang's discussion of the spectral norm - yours is similar. Yet I think further motivation is needed to believe your inequalities fully.

  2. Can you trace back your findings to some framework-independent considerations? I think it is very important to show - even in a small architecture, what actually your normalization is doing. Tensor programs is similarly unsatisfying in the sense that for many researchers, having a program doing the normalization for you is a bit fishy. I think you should outline your contributions also in a language that everyone is using, so to make clear to researchers what is that you believe is needed -- precisely -- for SGD to work.

  3. One thing I am afraid of is that sometimes normalization can be a bit aggressive. Did you ever found this to be the case? I am thinking (for instance) on some findings about "edge of stability" : in the paper, the authors show that if you normalize the learning rate with the biggest hessian eigenvalue, then things can go south. Here, you are normalizing blocks (modules) though, so it might be different. Probably is.

局限性

Experiments, see above.

作者回复

Dear Reviewer qUST,

We are grateful for your thorough and helpful review! First of all, we will heed your advice about restructuring the technical content. Second, regarding the result being lost in the literature, we are already proactively working with the community on larger-scale validations. For example, we are working with a software engineer at Midjourney on an open source project to port Modula into JAX (calling it “Modulax”) and they are planning to test it on large-scale diffusion models---including Midjourney v7 if it works well. And we are in contact with an engineer at Cerebras who reported good training stability with Modula on GPT-2 size models. In terms of what we can show you now, we will try to get the experiment on a 160M size model done and share the results with you before the end of the discussion period. Apologies for not having it done sooner.

Regarding your questions:

  1. As you say, the tightness of equation 2.2 is partially tested in arXiv:2310.17813. We agree that a more thorough evaluation is needed. Doing this properly will require significant care and is probably worthy of a full paper. We propose clearly highlighting this as an important avenue for future work in the “Limitations and Future Work” section.
  2. Thank you, this is a great and important point. We do have a clear, intuitive understanding of what Modula is doing “under the hood”. But due to space limitations and the mad dash to get the project done, we didn’t include this in the paper. To partially rectify this, we created open source docs and an FAQ that directly explain many of the mechanisms. We propose to add an intuitive explanation of the underlying mechanisms to the prospective camera ready. We will express this in clear, accessible language.
  3. In our experience, we have never seen per-tensor gradient normalization being too aggressive. There is a body of work supporting this: consider the LARS and Fromage optimizers for older examples. See arXiv:2305.17212 for a more recent example. In addition, consider that Adam and recent extensions like Adam-mini can be interpreted as forms of per-tensor normalization, although people usually don’t think of them that way. And don’t forget Shampoo!

Thank you again for your time and effort. Again, we will try to get the suggested experiment to you by the end of the discussion period.

评论

Thanks for the worm and open discussion. I am glad to see that the authors want to progress further and believe in the project. I changes to "accept", with the hope that the authors can revise and provide more intuition, as well as larger scale experiments, in the revised version. It's nice to do this to increase impact. Good luck!

评论

We commit to including these revisions and experiments in the prospective camera ready. We'll try to get the experiments done this weekend if we can.

Thank you! Authors

审稿意见
3

This paper introduces the modular norm, which is a norm designed to be adapted to neural network (NN) optimization. Specifically, an optimization process involving a gradient computed according to this norm scales with the size of the NN to be optimized. This paper provides the algorithm of computation of the modular norm, along with a small set of training experiments.

优点

Originality

Creating a norm making the optimization process scale with the size of the trained NN is original.

Clarity

Overall, the general idea is easy to understand.

Significance

Finding the scaling laws linking the hyperparameters (learning rates, initialization scale, penalty, etc.) to the architecture of a NN is a major field of research in NN optimization.

Quality

Lines 87--92, the authors state an interestng motivation of their work:

In machine learning, we we would ideally like [...] that meets these requirements: the modular norm.

Overall, the motivation and the idea of the modular norm are appealing.

缺点

Clarity

A major issue of the paper is the absence of list of contributions. In other words, the authors do not make any general claim about their results. The list of contributions must appear somewhere (ideally in the beginning), so that one would have a formal basis to evaluate the paper (are the claims significant enough? are they well justified? etc.).

Section 4 (experiments): one would expect a description of the implementation of the modular norm in an optimizer. As such, I do not understand what is computed (mass? sensitivity? norm?), when (at each step? epoch? ...) what is used in the optimizer and how.

Major typographical issue make the main text hard to read: some notations overlap with each other (line 148, lines 162--163, etc.).

Significance

No list of contributions is provided, so it is difficult to evaluate the significance of the paper. Besides, the set of experiments is very limited.

Quality

The contribution of this paper is unclear for the reasons aforementioned.

Moreover, given the context, the motivation and the related works, one would expect a comparision with similar "scaling" techniques (muP and NTK parameterizations, for instance. It is well-known that standard training (with the usual parameterization) diverges as the number of "blocks" or the "width" tends to infinity (Fig. 4), so it is not surprising that the proposed method performs better in this situation.

问题

How does the proposed optimization process compare to optimization with NTK/muP parameterization?

局限性

Lack of experimental validation, lack of clear list of contributions.

作者回复

Dear Reviewer 3bTa,

We are grateful for your time spent reviewing our paper. We are sorry if the absence of the contribution list was disconcerting. We feel that our paper is chock-full of contributions since we are advancing a substantially novel perspective on deep learning optimization based on automatically and recursively metrizing the weight space of the neural network. To highlight a few key contributions, consider that:

  • We propose the modular norm defined via a recursive algorithm (Definitions 3 and 4)
  • We show that neural nets are Lipschitz (Proposition 2) and Lipschitz smooth (Proposition 5) in the modular norm

Establishing a solid, workable notion of Lipschitzness for neural networks is generally regarded as a major open problem by experts in optimization theory. See, for example, the survey “Optimization for deep learning: theory and algorithms” by Ruoyu Sun (arXiv:1912.08957) for an explanation of this. We are sorry that the presentation of the paper was not to your taste, but we hope that you will consider re-evaluating the paper in light of this clarification.

Regarding your other questions:

  • Mass and sensitivity are held fixed from the start of training. Weight updates are normalized in the modular norm at each training step.
  • The spectral perspective on training dynamics is already reconciled with muP and NTP in “A Spectral Condition for Feature Learning” by Yang et al (arXiv:2310.17813). See, for example, Figure 2 in that paper. We are generalizing that analysis to general architectures. We are not claiming to have better performance than muP.

Thank you again for your time and effort!

评论

I thank the authors for their answer and the suggestions of articles.

We propose the modular norm defined via a recursive algorithm (Definitions 3 and 4)

As such, this is not exactly a claim. This is a proposition of quantities matching several appealing properties (I admit that), related to feature learning. To make this contribution real, an actual application has to be found (either theoretical or practical). If practical, one should expect an experimental evaluation of the proposed method. Otherwise, why should we care more about the modular norm (along with the mass and the sensitivity) than any other measure (there are many of them)?

Please note that, even if the proposed method performs worse than muP, the experimental results would be interesting anyway, and this should not be an obstacle to acceptance for publication. But, as such, the reader does not have any basis to compare experimentally the proposed method to muP or other.

We show that neural nets are Lipschitz (Proposition 2) and Lipschitz smooth (Proposition 5) in the modular norm

I agree that some work has to be done to obtain some bounds (e.g., bound (i) in Prop. 5) involving the Hessian in order to make progress towards better optimization results. I also agree that the NNs are Lipschitz smooth in the modular norm. Yet, there is no evidence that the modular norm may be used to improve existing theorems about optimization of NNs (or for any other use).

Overall, I admit that the study of the modular norm is appealing, but the actual claims are very loose:

  • we do not know how the proposed method compare to similar ones;
  • we do not know why ensuring Lipschitz smoothness in the modular norm should be useful for further research.
评论

We really appreciate you engaging with us further.

"an actual application has to be found (either theoretical or practical)" We completely agree with you here. Actually, the application we are targeting in the paper is the definition of a single normalize function that can be applied to either Adam or SGD making the method exhibit good learning rate transfer. In contrast, in muP and all other approaches to this problem that we know of, different scaling rules are needed for different base optimizers. In other words, we believe that we have made the implementation of scalable training substantially easier across optimization algorithms. This application is what the experiments in Figures 1 and 4 target.

"even if the proposed method performs worse than muP, the experimental results would be interesting anyway" You're right, and we commit to including this comparison in the prospective camera ready. We'll try to get it done this weekend if we can.

"we do not know why ensuring Lipschitz smoothness in the modular norm should be useful for further research". You are right that we haven't found a killer application for this part yet. But we felt so excited about this result from a theoretical perspective, that we felt it was worth highlighting prominently in the main paper. We are hopeful that ourselves or other researchers are able to use automatic sharpness calculations productively in future work. We agree there is uncertainty here.

In conclusion With this paper, we resisted the urge to "just present one idea", which is the typical advice for ML conference papers. Instead, we threw everything we could at the problem and tried to write the most exciting paper we could based on that. As a result, there are many avenues left open. We intend to continue to work on this in future work, and we hope that maybe we could inspire other community members to also find these directions interesting.

If based on your reviewing you had suggestions about how we could restructure the paper further, we would love to hear them. Also, if you feel that we have addressed your concerns at least to some extent, we would appreciate it if you consider raising your score.

Best, Authors

评论

Just wanted to send a gentle nudge in case you got a chance to think more or reconsider your score! We'd be happy to engage further.

评论

Thank you.

I maintain my score. I believe that this work deserves to be accepted for publication at some point. But, in its current state, it is difficult to say more than "this is an exciting idea". The fact that "many avenues [are] left open" should be an incentive to provide clear answers along with the initial ideas. In short, clear claims, that are well-supported and checked in-depth. For instance: is the Lipschitz-smoothness according to the modular norm the property that researchers in optimization expect? I do not have the answer to that question. But the paper should provide, at least, a discussion about that point (involving the usual proof techniques used in optimization, etc.). The same holds for the initial idea (the modular norm)...

More informally, a list of claims may be used by the authors themselves as a sanity check, in order to clarify the aim of the paper, formulate the take-home messages(s), identify the weaknesses of the paper, etc.

作者回复

To all Reviewers,

We are sincerely grateful for your contributions to the conference and for your feedback on our work. Reviewer FbvX commented that the paper “has challenged how I think about sharpness and curvature of the loss landscape in a deep learning context” and that they “feel fairly confident many in NeurIPS community would benefit from reading this paper”. We found this feedback really encouraging—thank you! We will also listen to and try to address all the reviewer feedback—see our individual responses below.

Best,

Authors

评论

Just to summarize the current reviewer positions for the AC:

  • reviewer qUST liked the paper but felt that larger-scale experiments could increase its impact
  • reviewer FbvX praised the work for both its originality and also its "fairly robust" and "very promising" experiments
  • reviewer U3gK expressed concerns about the role of the mass parameters, which were largely assuaged when we pointed out that the mass of all the residual blocks can be tuned collectively
  • reviewer 3bTa was thrown by the lack of a formal contribution list at the end of the introduction. However, we feel that the contribution list is essentially contained within the paper's abstract, and the NeurIPS conference has no requirement for a bulletpoint list of contributions at the end of the introduction.

Again, we are immensely grateful to all reviewers for their time and effort. Thank you!

最终决定

This manuscript proposes a modular norm that can be used within the training process of a neural networks. There are some limitations and deficiencies of the manuscript in its current form, particularly aspects of the presentation, a lack of certain comparative "baselines," and incomplete evaluation of the proposed formalism. However, there was sufficient excitement from several reviewers about the novel ideas presented within the manuscript to counterbalance those issues—the problem the manuscript is working to tackle is undoubtedly important. While there is still clearly work to be done, it is reasonable to consider the present manuscript a sufficient first step in an interesting direction. The authors should make the relevant adjustments to the manuscript that arose during the discussion phase and address the specific weaknesses raised about aspects of the presentation (note that as discussed, this need not be a contribution list but additional clarity in intro about the positioning and advancements of the work would be appreciated).