Thank you for your review and detailed questions. Please see the detailed response:

Ablating iterate averaging

To answer the question on the sensitivity of the parameter, we ran an additional set of ablations, focusing on the LLM setting. Our experimental setup already measures the performance at three stages of training (initialization, 10k checkpoint, 50k checkpoint), and we report values ablating for SPlus:

Validation Loss of SPlus:

Starting Checkpoint
LLM at Init	3.085	3.064	3.028	3.017	3.003	3.077
LLM at 10K	3.062	3.047	3.012	3.002	2.988	3.303
LLM at 50K	3.010	2.995	2.964	2.954	2.936	3.035

Steps-to-Adam of SPlus:

Starting Checkpoint
LLM at Init	0.73	0.616	0.479	0.461	0.486	0.771
LLM at 10K	0.667	0.556	0.409	0.384	0.419	Div.
LLM at 50K	0.998	0.792	0.495	0.425	0.383	Div.

As shown in the above table, we can get slightly better performance by tuning the parameter per training setting. However, using a default value of 0.999 (as we do in the main paper) allows for nearly the same performance across the board, regardless of the starting checkpoint. The intuitive reason for the effectiveness of iterate averaging, especially under a constant learning rate, can be viewed as reducing the effect of noise during optimization -- if we assume that each gradient steps introduces some independent noise due to linearization/stochastic batches, then iterate averaging enables the noise to cancel out (variance of the mean of i.i.d. random variables decays linearly with the sample size). We will include this discussion in the next revision.

Memory Usage of SPlus

The additional memory requirement is a fair point and a weakness of the method. However, we note that the core SPlus update saves memory compared to the most competitive baseline (SOAP), which requires an additional set of parameters to track the second moment. In settings where iterative averaging is disabled, SPlus can roughly approximate SOAP while not requiring additional memory. We confirm this via an additional set of experiments, where SPlus is run without iterate averaging:

Validation Loss (no iterate averaging):

Method	LLM-Init	LLM-10K	LLM-50K	Approx Memory Usage
Adam	3.132	3.114	3.012	3nm
SPlus No EMA	3.087	3.057	3.010	2nm + 2(n^2 + m^2)
SOAP	3.085	3.072	2.995	3nm + 2(n^2 + m^2)

Steps-to-Adam (no iterate averaging):

Method	LLM-Init	LLM-10K	LLM-50K	Approx Memory Usage
Adam	1.0	1.0	1.0	3nm
SPlus No EMA	0.729	0.662	0.685	2nm + 2(n^2 + m^2)
SOAP	0.712	0.660	0.677	3nm + 2(n^2 + m^2)

Notation and figures

Regarding the notation and the axes, thank you for catching these issues, and we have fixed them in the next revision. in our case represents the gradient with relation to the loss function (i.e. ), and is an arbitrary vector to take the argmin over. is a set of "live" parameters, which are averaged into .

Why do we use the Steps-to-Adam metric?:

We chose steps-to-Adam as it is an interpretable way to understand the effective speedup of one optimizer to the rest, and many prior works report a speedup over Adam, e.g [1, 2]. Raw validation losses are less interpretable, as the relative difference is harder to compare. That said, in the above tables, we have also provided raw validation losses.

Regarding the Transformer-only experiments

It is a fair point that our experiments limit only to Transformers, as mentioned in the limitations section. We note that our experimental setup is still larger than many previous works [1, 2] that only consider language modelling -- we additionally consider diffusion modelling and image classification.

We would like to thank you for raising relevant questions and suggestions. We believe the additional ablations have increased the clarity of the results and strengthened the paper. Please let us know if you have any remaining concerns or questions.

[1] Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. Lui et al, 2023.

[2] SOAP: Improving and Stabilizing Shampoo using Adam. Vyas et al, 2024.