We thank the referee for their positive feedback on our paper, and for the constructive questions that helped expanding and improving our presentation. We believe we addressed all the referee's questions. We detail our reply and updates to the draft below:

The referee is correct that in our analysis so far we focused on SGD label noise. We carried out a new set of simulations training ResNet-18 on CIFAR10 using SGD instead of label noise. We find a clear signal of the gamma=2/3 scaling relationship. Strictly speaking, our mathematical framework currently assumes that the noise is constant, which is not quite the case for SGD noise and is the reason why we did not consider SGD noise in our experiments in the first place. Yet, this result shows that in practice the differences with SGD noise might not be minor in terms of the power law. It is gratifying to see empirically that the optimal scaling law still applies, and we are grateful to the referee for bringing up this question. We included the new results in Sec. 4.3.1 and Fig. 2 (bottom right).
The number of timesteps it takes for the model to converge to a low-curvature point of the zero-loss valley is controlled by the slowest between and . By ``controlled'' we mean that the longer is , the longer it takes to converge to such low-curvature point. More specifically, optimality is reached when because, as illustrated in Fig. 1(a), if becomes faster, becomes slower and viceversa. The transverse timescale is important because if this is too slow, even though longitudinally convergence has happened, there might still be residual transverse fluctuations (similar to the top left of Fig. 2).
Regarding our previous experiment on CIFAR10 (where we explicitly separated phase 1 and phase 2 in the training schedule), label noise would have continued to cause a drift unless SGD would have caused the weights to converge to the widest minimum by the end of phase 1. Please let us know if this answers your question, we are happy to discuss more.
Fig 1 is obtained using various values of for a fixed value of . As explained in the last paragraph before App. A.2, the way we extracted the exponent was to fit the number of timesteps to convergence following the relationship , where we performed several runs with different values of and the associated , and extracted and the exponent through a fit. Because of this, the plots of Fig. 1 are defined only when running across several values of , it is not possible to plot them for a single value of .

Additionally, we have two remarks about the weaknesses mentioned by the referee:

Prefactor: We appreciate the referee's critique that the theory does not give a prescription for the prefactor . In principle, this could mean that we have not reduced the search space of hyperparameters, but we notice two things. First, given the optimal power is fixed to 2/3 by the theory, to know one only needs to perform a few runs to find the optimal momentum hyperparameter for a given fixed value of the learning rate. In particular, from this information it is possible to extract the best momentum hyperparameter for other values of the learning rate thanks to the knowledge of the exponent. This effectively reduces the hyperparameter search to one parameter instead of two. Secondly, and intriguingly, we noted that in all our experiments, independently of the models and tasks we considered.
Simple Extension: We would also like to note that, while (Li et al. 2022) has been a main result inspiring our research, our framework has substantial differences. In order to treat the momentum hyperparameter as an additional small parameter (i.e. in addition to the learning rate ), we had to develop a completely different scheme of limits to derive our final equation. In that derivation, we relied on the small-noise limit rather than small-learning rate.