Scaling Law with Learning Rate Annealing
A new scaling law formula with learning rate annealing that can fit and predict full loss curves.
摘要
评审与讨论
This paper proposes to extend existing LLM scaling laws to account for learning rate annealing, which additionally fits the loss at intermediate steps during training. The paper is well written, and provides intuition for the final formulation of the novel scaling law. Extensive experiments are provided to back up the observations.
优缺点分析
Strengths: The paper is well-written and has applications in reducing the compute for training LLMs (Table 2). The observations are backed up by experiments. Further, the authors show that their novel scaling law is a generalization of the Chinchilla scaling law (Section 5).
Weaknesses: The scaling law discussed in the paper does not have theoretical justification, but the authors mention this limitation. One minor note in the conclusion (line 344) is that it would be better to remove "theory". Additional discussion on related work would be useful to put this work into context.
问题
- How feasible would it be to consider generalization bounds under this new scaling law, e.g. [1]?
[1] Finzi, Marc, et al. "Compute-Optimal LLMs Provably Generalize Better with Scale." https://arxiv.org/abs/2504.15208.
局限性
yes
最终评判理由
The authors addressed my concerns, and I think the paper has sufficient novelty for publication in NeurIPS.
格式问题
N/A
W1: Theoretical justification and related works
Yes, our scaling law is empirical, as noted in our limitations section. We appreciate your suggestion and will revise certain expressions in the paper accordingly. Additionally, we will incorporate more related work to strengthen the paper. We will make this revision in the next version of the paper. Additionally, we appreciate your suggestion to include a more detailed discussion of related work to better contextualize our findings. We will expand the related works to provide a clearer comparison with prior studies and to highlight how our work fits into the broader research landscape. This will help strengthen the paper’s contribution and provide a more comprehensive perspective.
Q1: How feasible would it be to consider generalization bounds under this new scaling law?
Thank you for your insightful question regarding the feasibility of deriving generalization bounds under the new scaling law. We believe that exploring generalization bounds is feasible but would require additional theoretical work. We would like to emphasize some strong generalization effects demonstrated in our experiments:
- In our experiments, the learning rate schedule (LRS) used for prediction was unseen during fitting, and the error was controlled within a very small range (<0.35%).
- In our experiments, we performed data fitting using only a short number of training steps (e.g., 20K) and extrapolated generalization to longer training steps (e.g., 60K).
- As you also noted, our work is a generalized form of the Chinchilla scaling law. We can achieve significantly more fitting points under the same computational resources, leading to better prediction and generalization performance. Moreover, compared to the Chinchilla scaling law, our model introduces only one additional parameter, and this simplicity contributes to greater robustness.
We will incorporate the paper you mentioned in the next version of our manuscript. We are committed to addressing these points in the revised manuscript and hope our responses clarify our approach and intentions. Please let us know if you have any further recommendations or require additional clarification. We greatly appreciate your time and expertise in reviewing our work.
I thank the authors for clarifying my question. This paper is outside of my area of expertise, but having considered the other reviews, I'm happy to maintain my original score and look forward to the upcoming discussion with the other reviewers.
This paper investigates the relationship between the loss curve in large language model pre-training and the corresponding learning rate schedule. The authors propose a formula to model the training loss, expressed as:
where is the irreducible loss, the second term models the initial rapid decrease in loss as a function of the cumulative learning rate (forward accumulation), and the third term accounts for the annealing phase of the learning rate schedule (backward accumulation).
To empirically validate their proposed formula, the authors conduct a comprehensive set of experiments. They also demonstrate the robustness of their model by successfully fitting the formula to the training trajectories of existing open-source models, such as DeepSeek.
Overall, this paper offers a novel and promising perspective on scaling law research. The proposed theory has the potential to be highly beneficial for practical pre-training applications, provided it is further substantiated with more rigorous theoretical grounding and extensive reliable empirical verification.
优缺点分析
Strength
-
fresh research direction ; novel observation (learning rate schedule vs whole learning curve)
-
extensive experiments (a lot of efforts)
-
potential application to pre-training
weakness
-
the is not very grounded. In particular, the term and its definition is not very well-motivated and not very interpretable.
-
the hparams in is not interpretable.
-
fig 2 b is a bit concerned to me; as the constant learning rate and cosine has almost the same learning curve shapes and their final performance is very closed (2.83 vs 2.84); in practice, their shape should be very different. for example, see Fig 6 in https://arxiv.org/pdf/2405.18392. this makes me cast doubte on the remaining experiments.
问题
-
Figure 3: Figure 3 looks okay, but it is not entirely convincing given that it relies on an interpolation fit rather than an extrapolation fit to an unseen scale.
-
Figure 7b: In the 1.2B parameter run shown in Figure 7b, the fit deviates from the actual learning curve between steps 10,000 and 30,000. The empirical curve is not very smooth in this region, which suggests the theory may not be very robust.
-
Theoretical Grounding of the Annealing Term (): The (or ) term is not well-grounded and seems quite arbitrary. I am concerned that this may be overfitting to the authors' specific experimental setup. It seems that likely depends on multiple factors, such as the choice of optimizer (e.g., AdamW vs. Muon vs. others). For AdamW, it could depend on the and/or parameters.
-
Claim on LRS Superiority: Regarding the section "Why Is WSD LRS Better than Cosine LRS?", I am not sure this statement is true.
-
Lack of Empirical Validation: In the "Optimal Annealing Ratio of WSD LRS" section, the authors provide a prediction from their theory but do not validate it with a real experiment.
-
Speculative Applications: Overall, I believe the applications in Section 4 are mostly speculative insights, if I understand correctly.
-
Section 5: The same applies to Section 5.
局限性
yes
最终评判理由
authors address some of my concern;
格式问题
no
We sincerely appreciate your thorough engagement and the numerous insightful questions you’ve raised, which have greatly enriched our work.
W1: The is not very grounded.
We present in detail how we derive in Section 3.2. Basically, we observe the delay phenomenon in LR annealing and we find that there exists momentum in LR annealing. Therefore, the final is momentum-like (or EMA-like) form. As noted by reviewer EhjS comments, the formulation "is interpretable, and illustrates the trade‐off between rapid progress at high learning rates and the benefits of annealing". To further substantiate our findings, we conduct extensive experiments in Section 3 to validate the proposed scaling law.
W2: The hyper-param in is not interpretable.
- is the momentum coefficient, which is quite common in momentum-like form which means the . We adopt in our all experiments and it shows that the choice of is not that sensitive and is quite robust across many experimental setups.
- In Appendix I.1, we have fully discussed and investigated the essence and the impact of decay factor in detail. We show that influences three aspects: (1) delay steps; (2) optimal annealing ratio; and (3) balance point between and .
W3: Concerns about Figure 2b
This is an actually great question, and we fully discuss this in Appendix H.2. Simply speaking, the loss curves of constant/cosine LR, and how different they are, depend on some important hyper-parameters such as training steps. And our curve is actually not contradictory to the paper you mentioned.
In Figure 24, we show that the curves of cosine and constant could be quite similar if we train 10K steps (shown in Figure 24a), but become quite different if training for 100K steps (Figure 24b). Moreover, we show that in small training steps, constant LR could lead to better performance than cosine LR. When training for more steps, the final performance of cosine LRS surpasses that of constant LRS, and this gap gradually widens (Figure 24c).
Moreover, the max learning rate can also influence the curves of constant and cosine LR schedules. And larger max learning rate often leads to larger gap between them. We set max learning rate as 2e-4 in Figure 2, and it is not large. Our results are all based on rigorous and standardized experiments.
Q1: Figure 3
No, we did extrapolation prediction rather than interpolation. We fit only 20K training steps (Figure 2) and predict the loss for 60K training steps under many unseen different LRS (Figure 3).
In our additional experiments in our Appendix E, we also did extrapolation prediction. For example, in Appendix E.1, We adopt another experimental setup and fit only 30K training steps (Figure 10) and predict the loss for over 100K training steps under unseen LRS.
Q2: Figure 7b
Our proposed scaling law calculates the expected loss for each training step. In empirical experiments, there could be more other factors influencing the validation loss. For example, some training samples near specific checkpoints are more similar to the validation set, and then it could make a temporarily lower validation loss. Additionally, larger batch sizes typically result in lower variance and makes the loss curve smoother. Furthermore, prior scaling laws, such as OpenAI and Chinchilla scaling laws, also compute expected losses with varying degrees of variance. Therefore, we believe this does not suggest that our scaling law lacks robustness.
Q3: Theoretical grounding of the annealing term
We fully discussed all your mentioned concerns about in Appendix I.1 and I.2 of our paper.
Selection of : We adopt in all of our experiments (with many different setups, see Table 3), and the results are all good, which means that is quite robust across different setups. Moreover, we even mention in the footnote of page 27 that one can fit as a parameter like , which is also reasonable and totally feasible.
How and in Adam Optimizer impacts: We have discussed this issue in our paper. We conduct an experiment in Figure 29a of Appendix I.2. In conclusion, we find that changing does not nearly influence the delay steps as well the momentum term (including ).
How impacts: In Appendix I.1, we discussed and investigated the essence and the impact of decay factor in detail. We show that influences three aspects: (1) delay steps; (2) optimal annealing ratio; and (3) balance point between and .
Overall, we believe that we have sufficiently discussed in our paper.
Q4 & Q5 & Q6 & Q7
In this section, we aim to demonstrate that the experimental conclusions from previous works can be also validated and explained using our predicted scaling law.
Claim on LRS Superiority: We cited the conclusion drawn from [1] and [2], and used our scaling law to verify and explain their conclusions. We provide a perspective about the balance between and . We believe there are still cases where WSD underperforms compared to cosine (as I mentioned above regarding the relationship between cosine and constant LRS). In summary, we believe what really matters is that our scaling law can predict the final loss under various circumstances, allowing for case-specific evaluations.
Lack of Empirical Validation: Similarly, we cited the conclusion drawn from [3], and used our scaling law to verify and explain their conclusions. Their paper has already shown that 10%~20% annealing ratio in WSD LRS is quite good in most situations, which is consistent with the results predicted by our scaling law.
Speculative Applications: Section 4 demonstrates how we apply our scaling law to validate and explain insightful findings, and these results align with the conclusions of prior work. In fact, we have already conducted extensive experiments in Section 3 to verify our scaling law, including the main experiments in Figure 2 and Figure 3, as well as four additional validation experiments documented in Lines 209–219 and Appendix E.1–E.4.
About Section 5: Section 5 presents a comparison between our scaling law and the Chinchilla scaling law. First of all, in Table 4, we have validated the performance comparison (in real experiments) between our and Chinchilla’s scaling laws. In Section 5, we focus more on formal reduction and analysis of resource consumption.
[1] MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
[2] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
[3] Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Thank you again for your valuable feedback and the many thought-provoking questions that have helped refine our work. We look forward to any further insights you may have!
i remain concerned about the results presented in Figure 2. The learning curves for the cosine and constant learning rate schedules are expected to be quite distinct, yet they appear nearly identical in the plot, even after 20,000 training steps. I strongly encourage the authors to investigate why these two curves are so similar and to verify that their setup can reproduce the expected shapes for each schedule ( e..g, did you tune the max learning rate 2e-4 in Figure 2 is too small;) I am hesitant to raise my score until this discrepancy is clarified.
Thank you for your prompt response! We have reviewed our findings and are confident that our results are correct.
- What you may want to see: loss curves of the cosine and constant LR schedules are quite distinct, is presented in Figure 14, which is also included in our paper with another setup (See setting E in Table 3). We use a large learning rate of 6e-4 in that experiment.
- We conducted many experiments across 5 different setups. Some setups yield similar loss curves for cosine and constant LR schedules, while others show clear differences. Please refer to Figures 2, 10, 12, and 14 for details.
- To reiterate, the similarity of curves under cosine and constant LR schedules depends heavily on the experimental setups. This is discussed in Appendix H.2, which highlights the significant impact of training steps. Besides, the max learning rate influences, too.
- The paper you referenced employs a different experimental setup compared to ours, with a higher number of training steps (exceeding 100k). And their y-axis is PPL () with a narrow range, akin to our zoomed-in view, which accentuates the differences between the two curves.
Please let us know if our response fully addresses your concerns regarding this issue. Additionally, we would like to confirm whether your other initial questions have been adequately resolved. Thank you for your engagement and feedback!
The authors identify empirical relationships among per‐step loss, gradient norms, and the learning‐rate schedule, and propose a novel scaling law that directly links the per-step loss to the learning‐rate scheduler. Their formulation is capable of capturing the abrupt loss drop following a learning‐rate decay, making it applicable across a variety of schedules.
优缺点分析
Pros
- Designing optimal learning‐rate schedules is critical for modern deep‐learning training, yet our theoretical understanding remains limited. This work takes a significant step by formalizing the empirically observed effect of the schedule on each training step’s loss.
- The paper presents a step‐by‐step derivation that bridges intuitive observations (e.g. the delay in loss reduction) and the final scaling‐law form. Each refinement is well motivated and tied back to real‐world phenomena.
- The resulting formula is concise, interpretable, and accurately predicts unseen loss curves, while explicitly illustrating the trade‐off between rapid progress at high learning rates and the benefits of annealing.
Cons
- Most of the derivation rests on visual/empirical observations rather than rigorous theory (cf. Question 1).
- The law can compare configurations only within certain families of schedulers; it does not immediately generalize to arbitrary or hybrid schedules without ad‐hoc adjustments (cf. Question 2).
问题
Questions
- Is it possible to theoretically claim the reasonability of the proposed scaling‐law relationship, analogous to [Theorem 1 in 1]? Such a result would greatly strengthen the paper’s claims.
- Under the proposed law, a “constant‐then‐zero” schedule emerges as optimal (cf. [Theorem 2 in 1]), which reproduces the failure of explaining the corner behavior that adding zero‐LR steps always makes improvement. According to the discussion sections, the authors proposed an ad-hoc adjustment to this issue. Based on this modification, can you theoretically derive certain optimality as done in [1]?
- The authors mentioned that the proposed scaling law can be extended to accommodate the model-size with empirical validation. Based on this, if we fit a scaling law on small scale experiments, we should be able to use the fitted law to determine the configuration of large scale training schedulers. Is there any experiments related to this? Does the performance beats the default configuration?
Reference
[1] Luo, K. et al. (2025). A multi‐power law for loss‐curve prediction across learning‐rate schedules. *arXiv:*2503.12811.
局限性
yes
格式问题
no
We sincerely appreciate your valuable feedback and the opportunity to address your comments.
Q1: Empirical observations rather than rigorous theory
We believe that theoretical analysis can strengthen our claims. However, it’s important to recognize that in real-world LLM training, theoretical analysis is extremely difficult. Almost all scaling law papers are proposed empirically. The paper you mentioned [1] also addresses this issue and simplifies the problem to a highly toy scenario to derive its results: "we consider a simple setting where SGD optimizes a quadratic loss function with noisy gradients."
We believe that advancing theoretical analysis is an important research direction, and we also believe that our conclusions are sufficiently supported by the extensive experiments in Section 3 we have conducted.
[1] A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules
Q2: Optimizing learning rate schedule
Optimizing learning rate schedule: In Appendix I.4 of the paper, we discuss the optimization of the learning rate schedule in detail. Specifically, when optimizing Eq. 10 (as opposed to Eq. 1), the problem becomes quite difficult to solve, as it is no longer convex (the Hessian matrix of this problem is no longer positive definite). From another practical engineering aspect, it is feasible and efficient to select better LRS from many candidates based on the prediction of the scaling law. For example, we show how to choose between WSD and cosine LRS and how to choose the optimal annealing ratio in Section 4.
The scope of our scaling law: Besides, based on your concern "The law can compare configurations only within certain families of schedulers". We have the response below:
When the LR schedule becomes even stranger and irregular, the loss becomes more difficult to predict, which is quite intuitive and inevitable. In this case, our scaling law will still make accurate enough predictions and try to keep the error within a usable range.
As the Table 4 shows, when the predicted LR schedule becomes more complicated, the error of our scaling law becomes indeed larger from 0.159% to 0.322%. The errors are all quite small so that we can adopt our scaling law on the schedules.
More extremely, we also perform an extra experiment on irregular and random LR, where we set LR of each step uniformly sampled from 0~2e-4, and S2 term indeed approaches zero during the whole training run. The mean error in such extreme situation is 0.605%, indeed larger but within a moderate range. Your mentioned "constant‐then‐zero" is a quite extreme situation. Overall, our scaling law applies within a specific scope (e.g., consistent maximum learning rate and standard learning rate schedules). However, for commonly used learning rate schedules, it offers high accuracy, simplicity, and strong interpretability.
Q3: About -extended scaling law
Yes and the experiment in Figure 7b shows that we can derive a model size extended scaling law from smaller model sizes. The fitted equation is . We didn't use this equation to select the better learning rate schedule but we validated this fitted equation by predicting the full loss curve for a larger model. Actually, if we replace with specific value (e.g. 594.6M), and the equation becomes , which is quite similar with our fitted equation in Figure 2: . This means that our equation can also incorporate certain values of N to make specific predictions for a given model size.
We hope this clarification addresses your concerns and thank you again for your thoughtful review.
I thank authors for in-depth explanation to my questions. I will keep my scores unchanged.
Authors propose a scaling law which is able to predict loss of a training run based on the LR schedule. Its effectiveness is validated by closely fitting pretraining loss curves.
优缺点分析
Strength:
- The paper is well written and easy to follow.
- A common knowledge is that the learning rate is the single most important hyperparameter. Utilizing only the learning rate schedule to accurately predict the loss curve seems natural and well-motivated.
- Empirical evidence is strong, as well as the application of this scaling law on analyzing existing learning rate schedule.
Weakness:
- I am wondering about direct applications of the proposed scaling law. For example, given everything else the same, would it be possible to derive an optimal learning rate schedule that results in the smallest final loss?
- One weakness is that the same set of parameters (L0, A, C, alpha) can only be used to predict exactly same training runs with the same max learning rate. This leads to two questions: a) How do the four parameters change when everything else is the same but only the max learning rate changes? b) Does the change above imply anything about what might the optimal learning rate be for this training run?
问题
See weaknesses
局限性
yes
最终评判理由
A common knowledge is that the learning rate is the single most important hyperparameter. Utilizing only the learning rate schedule to accurately predict the loss curve seems natural and well-motivated. Empirical evidence is strong, as well as the application of this scaling law on analyzing existing learning rate schedule.
格式问题
n/a
Thank you for your thoughtful recognition and for sharing such detailed insights!
W1: Deriving an optimal rate schedule
Yes and we discussed this issue in detail in Appendix I.4. Simply speaking, when optimizing LR schedule mathematically, one should use Eq. 4 rather than Eq. 1, which is more precise but quite difficult to solve in math. From another practical engineering aspect, it is feasible and efficient to select better LRS from many candidates based on the prediction of the scaling law. For example, we show how to choose between WSD and cosine LRS and how to choose the optimal annealing ratio in Section 4.
W2: How our parameters change
We answer this question one by one.
(a) Indeed, the parameters (L0, A, C, ) would change under those conditions. And we emphasized this in Section 3.3.
(b) Great question! It is quite possible to research this interesting question by researching how changes over the pre-set max learning rate. For instance, as the learning rate increases, these parameters may follow a consistent pattern, potentially leading to a parabolic-like trend in final performance, which could help predict the optimal learning rate. While we haven’t conducted this experiment, we believe it offers a promising approach to determining the optimal learning rate.
Thank you again for your valuable input and for sparking such an interesting discussion! Please let us know if there’s anything further we can assist with.
Thanks authors for the rebuttal. I will keep my original positive score.
The paper empirically derives the scaling law for various neural language laws under different learning rate annealing processes. In particular, the paper supposes a scaling law of the form: . Extensive experiments are performed under various learning rate conditions (e.g., cosine decay and WSD).
Overall, the reviewers are in favor of acceptance with some mild concerns about the experimental results relating to the cosine and WSD learning rate schedules. It would be helpful for the authors to incorporate within the text a description of where the proposed scaling law comes from and/or theoretically derive such a scaling law from simple model. I highly suggest that the authors incorporate the comments from the reviewers into the main text.