PaperHub
6.3
/10
Poster3 位审稿人
最低3最高4标准差0.5
3
3
4
ICML 2025

Modular Duality in Deep Learning

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

Modular dualization maps gradients to weight space in neural nets, enabling fast and scalable training algorithms automatically optimized for different architectures. Successfully used by the community to speed up NanoGPT training.

摘要

关键词
modulardualityNewton-Schulz

评审与讨论

审稿意见
3

The contributions of this paper could be summarised as follows:

  • it combines the notion of dualization/steepest descent with the notion of a modular norm (a max-of-norms aggregation of norms tailored to each single module). This goes beyond previous works on steepest descent that only consider a l1-type aggregation.
  • it proposes a norm choice for the standard deep learning modules. The proposed choices are informally motivated, and lead to updates related to Shampoo and muP, two successful techniques in deep learning.

给作者的问题

For questions on experiments, see the according section.

The other questions follow from my comments above, but I will repeat some here:

  1. Why is the l1-RMS choice for embedding modules superior to using RMS-RMS? Did you run an ablation for this?

  2. Can the convergence theory of steepest descent give any insight into specific norm choices?

论据与证据

One shortcoming of the paper is the lack of theoretical foundation for the proposed framework: steepest descent has a general convergence theory (for example in the papers that are given as references, or in the book by Nesterov). Hence, the question arises how the choices for the module norms affect the rates and constants in this convergence theory. This would also be a possible way to motivate certain norm choices (for example, if the corresponding smoothness constants would be small). However, the paper in its current form makes no efforts in this direction.

It should be remarked that the paper also does not claim to make contributions in this direction.

Another (minor) shortcoming is that the motivation through a type system seems to not capture the situation, considering the standard mathematical formalization (see details below).

方法与评估标准

Evaluation criteria make sense (convergence and LR transferability).

However, the choice of module norms have been already proposed in Large et al. 2024, and experimental or theoretical ablations/evidence for each particular choice is missing. For example, is the l1-RMS choice for embedding modules superior to using RMS-RMS?

理论论述

There are no major proofs or theoretical claims that need to be checked.

实验设计与分析

The experimental design seems sound.

Some questions on experiments:

  • For Figure 1, how does it look after a larger number of epochs (20 epochs will not be sufficient for high accuracy)? Does the dualization approach still lead to lower loss, or is this effect only visible with a short training time?

  • In section 6.5 it is explained that the watermark erasure can be seen when the batch size is not too small: could you elaborate how the batch sizes comes into play? Also, is the method without dualization just SGD or Adam here?

  • Why are the iterations and coefficients for Newton-Schulz chosen differently in the two experiments?

补充材料

No supplementary material.

与现有文献的关系

The main contribution of the paper seems to be to connect/motivate techniques that have been reported to improve training of deep learning models (e.g. muP and steepest descent methods). The amount of theoretical or empirical advancements in this paper itself however seems limited, especially given the similarity to Large et al. 2024.

遗漏的重要参考文献

None.

其他优缺点

There are several issues with respect to correct mathematical notation in the paper (see below). These notation choices might have been made in order to keep it simple; however, the mathematical correctness is suffering from this choice:

  • The mapping "dualize" by its definition is a set-valued mapping (as the argmax is not unique in general). With the current definition (that reads as if dualize returns a single element of the space WW), one quickly runs into formal issues: the equation in Proposition 1 is ill-defined, as the left-hand side is a set, but the right-hand side is a single element.
  • Example 1 is ill-defined if g=0g=0. Again, in order to properly define this, dualize should be set valued, and then dualize(0) is the unit ball.
  • The lines 119-123 are strange: If the gradient is considered an element of the space WW^* as stated, then as long as WWW^* \neq W we can not add it to an element of WW by definition. However, this paragraph reads as if we would have a choice in doing so or not ("we shall forbid ourselves").
  • It should also be mentioned, that usually the gradient is defined as the element of WW that represents the linear mapping of the derivative (e.g. see Prop 2.4 in https://arxiv.org/pdf/2403.14606) via the Riesz representation theorem. Hence, in many textbooks the gradient is defined as an element of WW, wheras the paper defines it as an element of WW^*, which might lead to confusion. It would be beneficial to introduce the Jacobian-vector product (see Def 2.13 in https://arxiv.org/pdf/2403.14606), and then motivate by considering the problem argminw1L(w)[Δw].\arg \min_{\\|w\\|\leq 1} \partial L(w)[\Delta w]. If the norm in the constraint is not the Euclidean norm, then the solution to this problem is not necessarily the negative gradient (scaled), and as a consequence we need to introduce the dualization mapping. Mathematically speaking, as the gradient (as usually defined) is an element of the same space as the weight ww, the motivation via a type system seems to not capture the situation (even though it is a useful metaphor).
  • The notation in lines 168-170 (right column) are not fully clear: what exactly is "summation over any shared tensor indices"? I think it would be much easier to introduce the Jacobian-vector product, then this quantity can be simply written as wM.forward(w,x)[Δw]\partial_w M.forward(w,x)[\Delta w].

其他意见或建议

Minor comments:

  • The multiplications is sometimes denoted as ×\times (Prop 1), sometimes as * (Def 6). Please align these notations.
  • Section 3.4: "are also smooth in an appropriate sense". Can you provide a reference for this statement?
  • Definition 5 appears identically in Large et al., 2024. Please refer to it in the statement of Def. 5, to emphasize that this is not a new concept proposed in this paper.
  • Lines 431-436: how can papers from 2019 and 2021 inspire a paper from 2018?
作者回复

Dear Reviewer BSBb, thank you for your contributions to the conference. We are grateful for your constructive and thorough review of our paper. We hope to provide useful responses to your questions!

First, we ran new experiments to address your questions about the long-term performance of duality-based optimizers. We created an anonymized GIF (link here) so you can directly watch the training loss fall over 100 epochs on CIFAR-10 for Adam (SP), Adam (µP), and Dualization. The GIF shows that dualization has lower training loss than Adam at every epoch. For example, dualization reaches loss 1e-2 in epoch 17, while Adam reaches the same loss in epoch 56. And here is another GIF for test accuracy. By the end, Adam and dualization both saturate around accuracy 60%, which is typical for an MLP on CIFAR-10.

Second, we really appreciate your questions about convergence analysis and nailing down norm choices. When evaluating our paper, we kindly ask that the reviewer consider a broad idea of what an important optimization paper might look like. We contend that our paper makes the following contributions of substantial importance:

  1. We build the first norm-based duality theory for deep learning that considers the tensor structure of the model and does not amount to updating one layer at a time. This is revitalizing interest in norm-based deep learning optimization theory, with exciting followup work based on our work involving ideas like linear minimization oracles, Frank-Wolfe analysis and trust region analyses.
  2. We introduce the Newton-Schulz orthogonalization primitive to the optimization literature. This is already having a substantial practical impact in industry and is inspiring followup research on experimenting with these methods in academia.
  3. We theoretically reconcile the Shampoo optimization algorithm with the maximal update parameterization. Anecdotally, these were both regarded as some of the hardest techniques to understand. We provide a new and easy way to unify these techniques that immediately suggests new ways to extend them.
  4. We demonstrate that dualized training algorithms automatically exhibit transferable learning rates.
  5. We also show that dualized algorithms have novel numerical properties. This is an important scientific contribution since it provides a direct counterexample to the idea that the weights don’t change in wide networks, which inspired a lot of NTK research. It may also have implications for computer number systems.

In short, we hope that you will take another look at our paper with an open mind. We agree that convergence analyses and an exhaustive experimental analysis of different norm choices are exciting directions, but they were not priorities for our paper.

As for your comments on tightening the mathematical notation, thank you for them. We agree in most cases but not all of them:

  • We agree that we glossed over the set-valued nature of “dualize” and how it should act on the zero input. We will clarify this as suggested.
  • We really like the reviewer’s suggestion of introducing the Jacobian-vector product. We will implement this idea as suggested.
  • Regarding your comment that we cannot add the gradient (in our parlance) to the weights, the reviewer has missed line 110 where we state that the weight space is the Euclidean space W=RnW=R^n, as is the case in deep learning. Therefore our presentation is sound. We will clarify this in the paper.
  • The reviewer correctly notes that many textbooks define gradients to live in primal space. But there is a gap between these definitions (e.g. Blondel/Roulet Proposition 2.4), which assume an inner product space, and deep learning where we lack a canonical inner product. Furthermore, in PyTorch/JAX, we usually call loss derivatives "gradients". Even more subtly, if we choose to equip the network with the dot product on flattened weight space, then the reviewer's gradient is equivalent to our paper's gradient! We glossed over these technicalities for accessibility, but propose adding an explanatory paragraph and welcome reviewer collaboration on this issue.

As for your other questions:

  • watermark erasure experiment. The method is SGD but with vanilla spectral normalization applied to the updates to match the learning rate scale to the dualized method (see Appendix A.2). Since the rank of the gradient is upper bounded by the batch size, batch size also limits the maximum possible stable rank of the dualized gradient, which is what drives watermark erasure.
  • different Newton-Schulz iterations. These experiments were simply run by different authors at different times. If you are interested to know more about Newton-Schulz, any coefficients that approximate sign(x) yield essentially the same duality map. The important practical consideration is the linear coefficient and the number of iterations, which set the inflation factor of small singular values.
审稿人评论

Dear authors,

thank you a lot for the detailled response, and for running additional experiments.

The point that this paper reconciles Shampoo and muP is convincing, so I will raise my score.

Regarding your comment on the gradient living in primal space (not affecting my score): I was confused by the comment "in deep learning where we lack a canonical inner product". While I agree that there is no canonical norm, for the inner product your paper itself uses the canonical inner product (see Definition 1). This inner product is also the same, independently of whether we flatten a weight matrix or not. I think it is not necessary hereto deviate from the standard textbooks, where the gradient is an element of the primal space (via the Riesz theorem). As you pointed out as well, everything can be formalized nicely by using the Jacobian-vector product.

作者评论

Dear Reviewer BSBb,

Thank you again for your level of engagement as a reviewer and your service to the conference---we appreciate it!

We think you are right: the JVP formulation is the way to go. Regarding our comment about the lack of a canonical inner product, what we meant by this is that the standard dot product is not a structure-aware inner product for neural networks---for example, we do not use the dot product to induce a distance measure on the weight space for the purposes of optimization. Technically---and we are not advocating for this---but one could re-formulate the statements in our paper that involve the dot product using a different inner product. But we agree that the JVP renders these considerations moot.

We thank the reviewer again for their very helpful feedback.

审稿意见
3

The paper introduces a recursive procedure called modular dualization for constructing duality maps in general neural architectures. This method unifies two important optimization techniques—maximal update parameterization and Shampoo—by demonstrating that both are partial approximations of a single duality map induced by the RMS–RMS operator norm. The modular dualization procedure works by assigning operator norms to individual layers based on their input-output semantics, making the construction explicitly recursive and easy to implement in software packages. Essential features of both µP and Shampoo are recovered from the duality map Linear.dualize, placing these methods within a common theoretical framework. This unified approach has led to significant wall-clock speedups in training transformers ranging from 124 million to 1.5 billion parameters. Inspired by prior work on optimization algorithms that adapt to computation graph structures, the authors aim to provide a clarifying toolkit for the design and analysis of deep learning systems through their theory of modular duality.

给作者的问题

No.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

No.

实验设计与分析

Yes.

补充材料

No.

与现有文献的关系

Not applicable.

遗漏的重要参考文献

No.

其他优缺点

Overall the paper is well-written and it motivates well.

The theoretical novelty of this paper is quite limited. It looks like a concatenation of several previous works including steepest descent on a normed space [3], modular norm [2], and gradient descent w.r.t. matrix operator norm (Shampoo [1]). In other words, this paper looks more like a Systemization of Knowledge (SoK) paper instead of a standard conference paper. It is hard to tell which part of this paper is novel, i.e., not from previous paper. For instance, even Example 7 seems to be a simple extension of the linear modular based on the norm defined on line 313-314.

Regarding the application contribution, please discuss the relation of this paper and Muon. The dualize function for the linear module is simply the dual norm of the operator norm of a matrix (with some rescaling factor), which is already introduced in Shampoo [1]. Muon takes the rectangular Newton-Schultz iteration. The author claims that Muon’s algorithm is based on the idea in this paper. However, it is simply an implementation method to calculate UVTUV^T without directly executing the SVD of a matrix. Claiming the credit of the invention of this implementation method is not well-supported.

[1] Vineet Gupta, Tomer Koren, and Yoram Singer. “Shampoo: Preconditioned stochastic tensor optimization.” International Conference on Machine Learning. PMLR, 2018.

[2] Large, T., Liu, Y., Huh, M., Bahng, H., Isola, P., and Bernstein, J. Scalable optimization in the modular norm. In Neural Information Processing Systems, 2024.

[3] Jeremy Bernstein and Laker Newhouse. “Old optimizer, new norm: An anthology.” arXiv preprint arXiv:2409.20325 (2024).

其他意见或建议

The notation of RMS norm is a bit hard to understand. RMS seems to be the abbreviation of root mean square. Nevertheless, the RMS norm in the paper is defined to be a rescaled version of standard l2 norm with a scaling factor of 1/d1/\sqrt{d}, which can be understood by taking the root mean square across all the dimension. It will be better to explain the reason to call this norm RMS norm or cite a paper which introduces RMS norm.

作者回复

We thank the reviewer for their time and effort reviewing for ICML.

First, we point out that the optimizer anthology [3] is a non-archival workshop paper, and therefore a conference submission on the same topic is in accordance with ICML policy. But even given this, our work goes beyond the anthology by placing the ideas in a general and forward-looking theoretical framework, showing new experimental results such as on the novel numerical properties of dualized training methods, proposing the unification of Shampoo and muP as approximations to a single duality map, and establishing implications for deep learning software libraries through the directly programmable structure of the duality maps.

Second, we are delighted the reviewer has characterized the Shampoo algorithm as “gradient descent w.r.t. matrix operator norm”—this characterization is actually the perspective proposed by the optimizer anthology [3], which again is a non-archival workshop paper. In contrast, the original Shampoo paper [1] presents Shampoo as an approximation to full-matrix Adagrad—see, for example, Sections 1.1 and 1.2 of the Shampoo paper [1]. Even the full-matrix Adagrad perspective on Shampoo is controversial (Xie et al 2025). But, taken together, we see the reviewer’s characterization of Shampoo in this way as evidence of the appeal of the matrix norm and duality perspective!

With regard to Muon and Newton-Schulz, we are grateful for the reviewer bringing this up. We will clarify the language in the paper to make clear that on this axis we made two original contributions that were critical to the speed and success of Muon:

  1. Proposing using Newton-Schulz iterations to do gradient orthogonalization
  2. Proposing the idea of treating the polynomial coefficients in Newton-Schulz as tunable hyperparameters to accelerate the convergence

On top of these ideas, of course, Muon adds momentum and various systems innovations such as low-precision casts and a low overhead multi-GPU distributed implementation. We would be delighted to discuss any of these points further. Given these clarifications on the novelty of our contributions, we would be grateful if the reviewer would consider substantially increasing their score.

审稿意见
4

This paper proposes a recipe for neural network design and optimization via "modular dualization". A module consists of a forward pass operation, "mass" and "sensitivity" parameters, and a norm associated with the weight space. This design allows for concatenation and composition of modules. The key insight is that the choice of "norm" encodes the desired semantics of the module, and thus affects the geometry of optimization over the weight space. The optimization direction over a given module's weight space is given by the dualization map, i.e. steepest descent with respect to the module's norm. A couple of key sample modules are provided, describing standard linear, embedding, and Conv2D layers, as well as (trivially) weight-less "bond" layers. The benefit of this perspective is demonstrated through the derivation of a new, highly performant optimizer Muon through the modular duality lens, and a simple derivation of the maximal update parameterization (μ\muP) rule.

给作者的问题

No critical questions; some clarification questions listed earlier.

论据与证据

The claims in this paper are well-supported by clear exposition and field-testing via Muon and μ\muP.

方法与评估标准

Yes.

理论论述

I have verified the correctness of all the theoretical results in this paper.

实验设计与分析

The numerical results are sensible and details are documented in Appendix A.

补充材料

I have read the experiment details contained in Appendix A.

与现有文献的关系

This work seems to follow a very recent line of work that aims to design a rigorous modular framework for designing deep learning set-ups in a way that co-designs the architecture with the optimizer in mind. This is a valuable contribution to the community, as it has the potential to both unify many seemingly disparate threads (e.g.\ μ\muP and Muon), as well as be the jumping off point for designing task-specific architectures/optimizers.

遗漏的重要参考文献

Not that I'm aware of.

其他优缺点

In addition to what's listed above, I think the main ideas in this paper are appealingly simple to read and understand. I think the well-normed module and modular norm ideas are likely to have immediate impact on optimizer/model design, especially since there seems to be revived interest in new deep learning optimizers that depart from the Adam family. The perspective of tying a layer's optimizer direction with its particular geometry makes a lot of sense, and may plausibly lead to more interpretable architecture behavior.

I have a few minor questions not immediately answered in the paper:

  • What is the role of the mass and sensitivity attributes? Do these parameters affect the choice of norm depending on where in the architecture the module is?

  • Why is, e.g., RMS-RMS norm intuitively a good choice for a Linear module? As a possibly silly sanity check, if we apply a single Linear module on a linear least-squares problem, why should we expect/want the data and weight distribution to lie in unit balls, since the output y=Mxy = Mx can be made arbitrarily large or ill-conditioned. As a related note, would an "optimal" choice of norm depend on the data/activation distribution?

  • Why is boosting small singular values as in rectified Shampoo/Muon intuitively good? An immediate thought is that this would put the "noise" and "signal" directions of the weight update at the same magnitude.

  • It is shown μ\muP is recovered. Is there a general recipe for deriving maximal update / feature learning rules given a recipe of modules?

其他意见或建议

Minor comments:

  • Notation for spectral norm is a little confusing, since \|\cdot\|_\ast is often used to denote the nuclear norm.

  • Different places use different symbols for scalar multiplication ×,\times, \ast.

作者回复

Dear reviewer NWmj, we are sincerely grateful for your time and effort reviewing for ICML. We also really appreciate your thorough and positive review of our work.

Given your comments that “the main ideas in this paper are appealingly simple to read and understand” and also “likely to have immediate impact on optimizer/model design”, we wondered if you would be willing to champion our paper to the area chair?

Regarding Reviewer GE3w’s review, we noticed that they compare our work to the optimizer anthology, which is a non-archival workshop paper. And even so, our work goes beyond the anthology by placing the ideas in a general and forward-looking theoretical framework, showing new experimental results such as on the novel numerical properties of dualized training methods, proposing the connection between Shampoo and muP, and establishing implications for deep learning software libraries.

While Reviewer BSBb provides a careful and rigorous review, we feel that they unfairly characterize our work as incremental. While convergence rate analysis is certainly an important future direction, it was not a goal of our work. We ask that our work be evaluated as a piece of science accounting for its implications for unifying optimization theory, for building new kinds of neural network software libraries, for introducing new numerical linear algebra primitives to the deep learning optimization literature and for potential implications for deep learning number systems.

We are also glad to answer your questions:

  • The role of the mass and sensitivity attributes. If we compose two modules, the input sensitivity of the second module is used to re-scale the norm of the first module (part d of Definition 6). In turn, this means the duality map will calibrate the size of perturbations to the first module with regard to the input sensitivity of the second module. As for mass, this provides the user with control to manually re-scale the norms of certain modules in order to provide precise control over how much feature learning each submodule contributes to the overall network. The motivating application is to allow you to set the update size in the embedding layer in a transformer independent of how many residual blocks there are. See Section 3.3 in the modular norm paper for discussion of this.
  • On the choice of the RMS-RMS norm for Linear modules. We actually do not think that the RMS–RMS norm is necessarily always a good choice. The idea is that if you have RMS control on the inputs and you want RMS control on the outputs, then RMS–RMS control on the weight updates is a good idea. This seems to match behaviour in the hidden layers of transformers where best practice was already to RMS-normalize the activation spaces (e.g. LLaMa https://arxiv.org/abs/2302.13971). But, as you suggest, if your input or output data has different structure you might want to consider different norms such as L1 or L-infinity for two simple examples.
  • On the intuitive benefits of boosting small singular values in the gradient. We think the idea here is that the small singular values are not necessarily noise. If you inspect the singular value distribution of gradients, as done in say https://arxiv.org/abs/2310.17813, you notice that most of the gradient singular values are actually small compared to the max. From this perspective, it could seem wasteful to make effectively low rank gradient updates as you are not making use of a lot of signal in the gradient. Of course it’s possible that the very tiny singular values are still noise. This is an interesting question to explore further.
  • On a general recipe for deriving maximal update schemes. Actually, the purpose and construction of the modular norm is meant to provide a general recipe for obtaining feature learning for general architectures. The paper proposes that feature learning is obtained by scaling updates in a norm with three key properties:
  1. The neural network output is weight-lipschitz in the norm
  2. The Lipschitz constant is non-dimensional (does not depend on e.g. width or depth)
  3. The tightness of the Lipschitz guarantee is independent of network size

If you find a norm that achieves these properties, then it’s reasonable that using it to scale updates would confer precise and scale-independent control on the amount of feature learning. See Section 2.1 of the modular norm paper for informal discussion of this.

Thank you again for your review. Again, we are immensely grateful for your time and effort reviewing for ICML.

审稿人评论

Thanks for the authors providing detailed answers to my questions.

In light of the mass/sensitivity parameters and the implicit goal of providing maximal updates/feature learning, I wonder if it makes sense for the authors to dedicate a small section to walking through what "feature learning" means in your context, and pedagogical worked example showing how well-normed modules might scale things correctly to achieve it (even if it's heuristic). I think this would help a lot in contextualizing why a type system and co-design for deep learning architecture/optimizers concretely aligns with a (highly touted) goal of current deep learning optimization literature.

Regarding the choice of RMS-RMS norm, that is helpful to know. I wonder (perhaps irresponsibly) if there is some thread to pull on here to claim well-normed modules can allow one to avoid certain normalization layers, since there seems to be literature suggesting normalization causes various headaches, or questioning whether it is fundamentally required.

Lastly, regarding gradient noise, I fully agree that the magnitude of the singular values can be spurious with regard to which directions are relevant, and that orthonormalizing is one way to boost possibly undervalued directions. A last possibly irresponsible thought is the following: if the magnitude of the "noise" vs "signal" directions of, say, the layer-wise gradient are interspersed, some prior literature in statistical signal processing suggests this can be caused by heteroscedasticity, and that proper whitening/normalization can "reveal" the hidden signal directions properly (albeit in much simpler settings than deep learning). Given that whitening/normalization can always be cast as dualizing under a (possibly iterate-dependent) norm, I wonder if the know-how in that literature, e.g. https://arxiv.org/abs/1611.05550 can help provide some principles explaining or designing optimizers targeting this "signal boosting" behavior.

作者评论

Dear Reviewer NWmj,

We want to recognize your generosity in sharing suggestions for improving our paper, as well as research ideas.

  1. we will add a section highlighting the connection between well-normed modules and feature learning. To make the connection concrete, we will include a worked example involving a linear neural network layer. We will also explain how the treatment extends to compositions and concatenations of modules.
  2. we share the hope that well-normed modules might obviate the need for normalizing the activations, although we want to do more research on this question before we make strong claims here
  3. we love the idea of trying to tackle heterogeneity or heteroscedasticity in gradient noise by porting tools and know-how from statistical signal processing. Thank you for exposing us to this literature on ePCA---the different de-biasing strategies are fascinating. Trying to nail down and exploit the noise structure of stochastic gradients in neural networks is an exciting research topic, and we see the connection that the reviewer is pointing out.

In conclusion: we believe our paper has made progress on building a conceptual scaffolding for thinking rigorously about first-order optimization in deep learning. We believe the work could seed a lot of further progress. We have a lot to say, and we are bursting with ideas that we want to share with the ICML community. We would be immensely grateful for any help you can give us in elevating our work. We will pay it forward!

最终决定

Dear Authors,

Thank you for your submission to ICML and for your contribution. Your paper proposes a novel, theoretically grounded framework for constructing duality maps tailored to neural architectures, and positions this framework as a unifying lens for interpreting and improving modern optimizers like Shampoo and maximal update parameterization (µP). The work presents a recursive construction of duality maps, practical algorithms for dualizing key layers, and early empirical results highlighting the potential of duality-based optimization.

The reviewers appreciated the clarity of exposition and the conceptual originality of the framework. Reviewer NWmj found the ideas compelling and potentially impactful for future optimizer/model co-design. Reviewers GE3w and BSBb initially raised concerns regarding the novelty relative to non-archival prior work, but ultimately acknowledged the distinctive theoretical contributions, unification of existing methods, and implications for practical optimization tools like Muon. Your thoughtful rebuttal, additional clarifications, and concrete follow-up experiments helped address their concerns and resulted in increased scores.

Given the strength of the theoretical contributions, the practical relevance demonstrated through recent adoption, and the overall positive trajectory of the discussion, I am pleased to recommend acceptance to ICML 2025. I encourage you to use the camera-ready version to further highlight the framework's implications for feature learning, norm choice, and optimizer design, and to clarify mathematical notation as discussed with reviewers.

Congratulations on your contribution.

Best regards, AC