PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
5
3
4
5
3.0
置信度
创新性2.8
质量3.0
清晰度2.8
重要性3.0
NeurIPS 2025

AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Foundation Model PretrainingHyperparameter SearchLearning Rate Search

评审与讨论

审稿意见
5

This paper introduces AdaLRS, a plug-and-play adaptive learning rate search algorithm that approximates the optimal learning rate online within a single full training process by optimizing the loss descent velocity.

The authors first demonstrate through small-scale pre-training experiments that in Large Language Model (LLM) and Vision-Language Model (VLM) pre-training, the training loss and its slope exhibit an approximately convex relationship with the learning rate and share the same optimal point. Based on this empirical observation, AdaLRS monitors the loss slope within a sliding window during the early stages of training. If the velocity decreases, it attempts to increase the learning rate; if this proves unsuccessful, it rolls back and reduces the learning rate.

The paper provides a theoretical proof of convergence and geometric error decay. Experiments covering 2B/7B LLMs and a 2B VLM, with various initial learning rates, Cosine and WSD schedulers, and in a continued pre-training scenario, all show that AdaLRS can quickly guide an unsuitable learning rate towards the optimal region and accelerate convergence. In some cases, it can even surpass the final performance of a manually tuned "best" learning rate.

优缺点分析

I am overall supportive of this paper. Although I am not an expert in LR tuning domain, I feel that this method is practical.

The manuscript is straightforward to read: its core idea—adjusting the learning rate online by monitoring the loss-descent velocity—is both simple and intuitive, making the method easy to grasp and potentially easy to reproduce. The authors run large-scale experiments across multiple model sizes (2 B–7 B parameters), two modalities (LLM and VLM), and several learning-rate schedulers. This breadth of evidence strengthens the paper’s empirical claims. The related-work section situates AdaLRS convincingly among classic hyper-parameter optimization, hyper-gradient, and adaptive-LR methods, highlighting both overlap and distinctions. Because AdaLRS can be plugged into an existing training loop with minimal modifications, practitioners who frequently struggle with manual LR tuning can benefit immediately.

I also find some minor weaknesses:

  • Hypothesis 1—that both the loss and its slope are convex in the learning rate and share the same optimum—is supported only empirically. Without a stronger theoretical justification, the method’s robustness outside the tested settings remains uncertain.
  • In several plots (e.g., Figure 1), the training-loss curves and LR trajectories are rendered so small that individual trends are difficult to discern. Enlarging the sub-plots or using insets would improve readability.
  • Figure 1 currently offers little explanatory detail. Expanding the caption (e.g., clarifying what each curve represents and why the trajectories matter) would help readers understand the take-away message without referring back to the main text.

问题

You briefly distinguish AdaLRS from Hypergradient Descent (HD) in the related work section. To make the distinction more concrete, would it be possible to provide a small-scale empirical comparison against HD? Showing how HD performs with a modern scheduler (like cosine) and how AdaLRS differs in practice, even on a smaller task, could more forcefully highlight the unique advantages of your proposed method for foundation model pretraining.

AdaLRS repeatedly saves and restores checkpoints to enable roll-backs, yet Table 1 omits the associated memory and wall-time costs. Could you quantify these overheads and discuss feasibility at > 70 B parameter scale?

局限性

yes

最终评判理由

I raised my score from 4 to 5.

Why not a lower score?

  1. Finding the optimal learning rate (LR) is a timely and relevant challenge in large model pre-training.

  2. This paper provides an adequate empirical study along with analytical discussions, offering non-trivial insights for the community.

Why not a higher score?

  1. The proposed method introduces additional hyperparameters that may require tuning. In light of the next point, this limitation gains added significance.

  2. The scalability of the method remains unclear, yet it is crucial for real-world applications.

  3. Although optimizing the LR is a timely issue, it represents only a small aspect of large model pre-training, which often intersects with other challenges (e.g., loss spikes, as noted by other reviewers). This suggests the method may be somewhat idealized or primarily experimental in nature.

格式问题

Please see weaknesses. The figures needs to be improved.

作者回复

We appreciate the reviewer's insightful feedback. Please allow us to clarify our methodologies point to point.

1. Theoretical justification for Hypothesis 1

  1. The convexity of loss descent velocity w.r.t. LR

    Assume L(θ)L(\theta) is the loss function to be minimized, and we use SGD with the update rule:

    θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)

    where η\eta is the learning rate (LR).

    Intuitively, if η\eta is too small, updates are small, and the loss decreases slowly; if η\eta is too large, the updates may overshoot, causing oscillation or even divergence.

    The expected loss descent velocity per step (using Taylor expansion) is:

    E[L(θt+1)L(θt)]ηL(θt)2+L2η2L(θt)2\mathbb{E}\left[ L(\theta_{t+1}) - L(\theta_t) \right] \approx -\eta \| \nabla L(\theta_t) \|^2 + \frac{L}{2} \eta^2 \| \nabla L(\theta_t) \|^2

    where LL is the Lipschitz constant of the gradient. When the learning rate is small, the first term dominates the expectation, and a smaller η\eta leads to a smaller decrease in loss. When the learning rate is large, on the other hand, the second term cannot be neglected. It contributes to suppress the expected loss descent value.

    Differentiating with respect to η\eta and setting to zero for the extremum:

    η(η+L2η2)=1+Lη=0    η=1L\frac{\partial}{\partial \eta} \left( -\eta + \frac{L}{2} \eta^2 \right) = -1 + L \eta = 0 \implies \eta^* = \frac{1}{L}

    This shows that the loss descent velocity with respect to η\eta is a convex function (has only one maximum).

  2. Shared optimum across loss and loss slope optimization

    In Figure 1, we show that the optimum for both training loss and its slope remain the same across most of the training process. Based on this observation, we summarize the reason why the optimal learning rate for fastest loss decrease and lowest final loss generally coincide as follows:

    • Optimial loss slope -> optimal loss: the best learning rate discovered via the largest loss slope leads to the largest loss descent per step, resulting in the lowest attainable final loss.
    • Optimial loss -> optimal loss slope: if a learning rate cannot achieve the largest loss slope, the resulted final training loss must be suboptimal, too, because the LR with the largest loss slope will always produce lower training loss. As a result, the optimal training loss must be achieved by a learning rate with the largest loss slope.

    The conclusions above can be generalized to any learning rate values: learning rate with faster loss descent velocity always achieve lower training loss. Since we have justified that the loss slope optimization is convex, it can be easily dedued that the training loss optimization is also a convex problem.

2. Details about Figure 1

We thank the reviewer for the careful reading and useful suggestions, and we will supplement enlarged figures as well as detailed caption texts in the revised version of the paper.

We also would like to further clarify the content and main conclusions drawn from Figure 1:

  1. Subplot a,b,c show the training loss curves and LR trajectories for different pretraining scenarios.
  2. In Subplot e,f,g, each curve represents the training loss dynamics across different LR settings at a certain training step.
  3. In Subplot h,i,j, each curve indicates the loss slope dynamics across different LR setting at a certain loss value.
  4. The curve groups shown in Subplot e,f,g and h,i,j demonstrate consistent optima across the training process, and the optimal LR for the fastest loss descent and lowest final loss generally coincide.

3. Comparison with Hypergradient Descent (HD)

We thank the reviewer for the careful reading. In fact, the Hypergradient Descent (HD) and AdaLRS algorithms serve different purposes, and therefore they are not comparable. We list the differences between the two methods below:

  1. Main idea
    • HD: dynamically adjusts learning rate across the entire training process.
    • AdaLRS: searches for the optimal learning rate scale at the beginning steps of pretraining.
  2. Workflow
    • HD: adjusts LR dynamically across the training process, serving as a learning rate scheduler itself.
    • AdaLRS: adjusts only the learning rate scale, which is multiplied with base LRs from an existing scheduler.
  3. Compatibility with modern learning rate schedulers
    • HD: cannot be used with any learning rate scheduler.
    • AdaLRS: compatible with a series of modern learning rate schedulers, such as cosine scheduler, WSD scheduler, and potentially more alternatives.

4. Additional costs in checkpoint restoration and rollback

  1. Memory and wall-time costs of checkpoints restoration

    To the best of our knowledge, modern foundation model pretraining frameworks, such as deepspeed, support checkpoint restoration in parallel with model training. In our experiments, we find out that the memory and wall-time costs of checkpoint restoration are negligible, and do not block the training process.

  2. Memory and wall-time costs of roll-backs

    Assuming a extreme case where all the LR upscaling attempts are failed, every training step will be performed twice (once in the look-ahead attempt, and another in alternative LR downscaling process). As discussed in Section 2.1 workflow, the LR adjustment process is restricted to certain training steps, which is the 10%-40% part of the training process. Therefore, the maximum time consumption of an entire training process will be 130% of the baseline training time. It is still much faster than traditional LR searching methods.

  3. Feasibility at 70B parameter scale

    The design of AdaLRS is based solely on the training loss dynamics, which unveils promising scalibilities. In our early experiments in 72B LLM and VLM training, AdaLRS exhibits expected LR adjustment effectiveness. We will validate the applicability of AdaLRS at 70B models at scale in our future work, and supplement corresponding details in the revised version of our paper.

评论

Thank you to the authors for their response.

  1. The theoretical justification is valid. However, there appears to be a minor error in the phrasing: "This shows that the loss descent velocity with respect to η\eta is a convex function (has only one maximum)." Given that the objective is to minimize L(θt+1)L(θt)L(\theta_{t+1}) - L(\theta_t) (i.e., maximize the descent), and the quadratic function has a positive leading coefficient (indicating convexity and a single minimum), the term "maximum" should be corrected to "minimum."

  2. I appreciate the added discussion as suggested, which adequately addresses my concerns. Please incorporate this discussion into the revised paper.

  3. I raised my score from 4 to 5. Happy to see the paper published.

评论

We would like to thank Reviewer oJfS for the careful reviewing of our paper and for the useful suggestions. We will incorporate these discussions in the revised version of our paper, and extend our compliments in the Acknowledgment section. Please kindly let us know if there is anything unclear.

审稿意见
3

This paper introduces AdaLRS, an algorithm that automatically finds optimal learning rates during pretraining. The main idea is to use the slope of the training loss to dynamically adjust the learning rate to maximize this velocity within a single training run. The algorithm detects when the loss slope degrades and tests if changes in learning rate would decrease the loss faster. The authors provide a theoretical convergence analysis based on the convexity of loss and its slope. The algorithm was validated on LLM and VLM pre-training at 2B-7B scales when trained from scratch or in a continual training setting. The authors also checked its compatibility with learning rate schedules.

优缺点分析

Strengths

  • Clear motivation: Optimal LR is difficult to find at scale. This makes finding optimal LR during a single run well motivated.
  • Large-scale validation: The experiments are performed at large scales.
  • Clear main idea: The core idea of adjusting LR based on the slope of the loss is intuitive and well-motivated

Weaknesses

  • Poor readability: The paper is difficult to follow with inconsistent notation and unclear algorithmic descriptions. I think the paper needs a major rewrite.
  • Algorithmic inconsistencies: There seems to be a contradiction between Equation 1 and Algorithm 1 in the Appendix. Equation 1 suggests simultaneously increasing or decreasing LR, while Algorithm 1 follows a sequential if-else statement. Furthermore, Equation 1 is using ee for the error, but suddenly θ\theta is introduced in Algorithm 1.
  • Timeline ambiguity: It is unclear when LR is changed (at step t or t+k). If there is a k-step lookahead, it should be clearly stated in the main text, as it increases the computational cost.
  • Excessive complexity: While the central idea is clear, the method is overly complicated. It introduces 5-6 hyperparameters, and their reliability is not well tested.
  • Convexity assumptions: The "convexity" discussed in the paper is trivial. Of course, small LRs give slow descent and training diverges at large LRs. Therefore, an optimal LR exists trivially. Are the authors claiming anything beyond it?
  • Logical inconsistency in equation 1: Equation 1 says that reduce the LR if the loss slope is decreasing. Why would one select an LR where the loss is decreasing slowly? Are the authors assuming that loss is increasing in this case?
  • Narrow applicability: Method assumes stable loss curves without spikes, which is unrealistic for large-scale pretraining
  • Limited scope: In the Appendix, the authors discuss that the method only works well with small LR baselines.
  • Chicken-and-egg problem: How does one know a priori whether their LR is "small enough" for the method to work?

问题

Please see weakeneses.

局限性

Please see Weaknesses.

Additional comments

  • Please use the same scale for the x-axis of the middle and bottom rows of Figure 1.

最终评判理由

The rebuttal has addressed a few of the concerns I had. Accordingly, I have increased my score from 2 to 3.

Why am I not recommending for acceptance?

  1. I feel like the paper needs a major rewrite: among the 6 papers I reviewed this cycle, I have spent the most time understanding the details of this paper. As the high-level idea of the paper is quite simple, I believe the presentation can be improved.
  2. While the overall idea is simple, the resultant method appears to be overly complicated with multiple new hparams. I am unsure if the community would adopt the method.

Since my score is still borderline, I am happy to discuss it with other reviewers and the ACs.

格式问题

I did not observe any formatting issues.

作者回复

We appreciate the reviewer's insightful feedback. Please allow us to clarify our methodologies point to point.

1. Paper readability

We thank the reviewer for the careful reading. The potentially confusing notations are introduced to prove the convergence of AdaLRS. We consider potential statistical error in loss slope estimation and several other details for broader generalizability of the proof. For better readibility, we provided a simplified workflow description of the proposed AdaLRS algorithm in Section 2.1 workflow. We admit certain algorithmic details can be further clarified, and will adjust corresponding descriptions accordingly in the revised version.

Below is a simplified workflow description of the proposed AdaLRS algorithm, and we hope it can help address potential confusion:

  1. AdaLRS attempts to upscale the LR when the loss descent velocity decreases, which is triggered by a threshold θ\theta shown in Algorithm 1 of the paper.
  2. k attemping training steps are performed to estimate the loss slope after LR upscaling.
  3. If the loss descent velocity increases after LR upscaling attemption, this adjustment is retained; otherwise it is discarded and replaced by a LR downscaling adjustment.
  4. Either a LR upscaling or downscaling adjustment is performed, a decaying factor γ\gamma is applied to the LR scaling factors, ensuring the convergence of LR adjustments.

2. Algorithmic representations

As shown in Equation 1 and Algorithm 1, the definition of ee remains consistent, which is the loss slope estimation error introduced to ensure the generalizability of our proof for AdaLRS's convergence. The threshold θ\theta introduced in Algorithm 1 is set to automatically trigger LR upscaling attemptions for various pretraining tasks. We elaborate the necessity and importance of this threshold in point 4.

3. Additional computation cost

We mention that the attempted LR upscaling will be reverted if the resulted loss descent velocity decreases in Section 2.1 workflow. In Section 3.2, we further discuss the acceleration ratio of AdaLRS in different pretraining scenarios, and we observe that the k-step lookahead does not undermine the convergence speed and loss improvement of AdaLRS. Even for large LR settings, where a series of roll-backs are performed, models trained with AdaLRS still achieve an over 30% acceleration ratio to reach the same loss level as baselines. It is also worth noticing that the roll-backs do not undermine the efficiency of AdaLRS in optimal LR searching. Since we only conduct LR search in 30% of the training steps, the maximum computational cost of AdaLRS is 130% of the baseline training cost, which still outperforms the traditional LR searching methods markedly.

4. Hyperparamter

Besides the LR scaling factors α\alpha, β\beta, and γ\gamma, we introduce only two additional hyperparameters θ\theta and ee in our algorithm. The threshold θ\theta provides an automatical way to trigger LR upscaling attemptions, regardless of the training scenario or scale. We set θ\theta as 0.9 in our experiments, i.e., an attempted LR upscaling is triggered whenever the loss descent velocity decreases by at least 10%. This simple strategy ensures steady loss descent with high velocity, as well as timely LR upscaling when the velocity decays. More importantly, this method achieves adaptive LR upscaling attempts across various pretraining tasks, regardless of the training scenario. The loss slope estimation error ee, on the other hand, is incorporated to ensure the generalizability of our proof for AdaLRS's convergence, which is shown in Section 2.2.

For LR scaling factors α\alpha, β\beta, and γ\gamma, the first two scaling factors are necassary for AdaLRS, and γ\gamma ensures the convergence of AdaLRS to the small enough neighborhood of the optimal LR. We further conduct experiments to demonstrate the importance of the scaling factors and AdaLRS's robustness under various hyperparameter settings. Starting from a 2e-5 LR for 2B LLM pretraining with WSD scheduler, we show the LR dynamics in the 64M-sample pretraining experimets:

Experimentα\alpha/β\beta/γ\gammaStep/LR/ScaleTest PPL
Baseline-100/2e-5 -------43.77
A3/2/0.99100/2e-5/↑20100/6e-5/↑28100/1.8e-4/↑35100/3.6e-4/↑42100/7.19e-4/↑50100/3.6e-4 ↓-16.94
B2/1.67/0.99100/2e-5/↑20100/4e-5/↑28100/8e-5/↑36100/1.6e-4/↑44100/3.2e-4/↑51100/5.16e-4/↑57100/3.1e-4 ↓18.79
C1.5/1.43/0.99100/2e-5/↑20100/3e-5/↑28100/4.5e-5/↑36100/6.75e-4/↑44100/1.01e-4/↑52100/1.52e-4/↑-21.90
D2/1.67/0.95100/2e-5/↑20100/4e-5/↑28100/7.6e-5/↑36100/1.37e-4/↑44100/2.35e-4/↑51100/3.09e-4/↑-19.58
E2/1.67/0.9100/2e-5/↑20100/4e-5/↑28100/7.2e-5/↑36100/1.16e-4/↑44100/1.7e-4/↑51100/2.17e-4/↑-19.70

In this table, Exp Baseline, A correspond to the "Small LR" setting for 2B LLM pretrain in our paper, and Exp B, C are conducted with different LR scaling factor values. Exp D, E adopt smaller decaying factor, which suppresses the "overshooting" behavior observed Exp B. Overall, depite the varying hyperparameter settings, learning rates converge to the neighborhood of the optimum 2e-4 robustly.

We also choose the model checkpoints at 60,000 step (150,000 in total) to evaluate their test PPL, and the results show that AdaLRS achieves improved model performance over the baseline under various hyperparameter settings. We will include the complete loss dynamic figures and evaluation results in the revised version of our paper.

5. Convexity assumptions

The convexity assumptions is indeed intuitive, but they have not been validated systematically across different foundation model pretraining setting. In this work, we not only validated this convexity in various pretraining scenarios, but also unveil that the loss and loss slope optimization share the same optimum. The shared optimum builds up the keystone of the convergence of AdaLRS algorithm, which is discussed in Section 2.2.2.

6. Logics of Equation 1

As described in Section 2.1 workflow, the proposed AdaLRS algorithm downscales the LR if the attempted LR upscaling results in decreased loss descent velocity. The condition in Equation 1 Line 2 is indeed a typo, and it should be v(αηt)<v(ηt)2ev(\alpha'η_t) < v(η_t) − 2e. We thank the reviewer for the careful reading and will correct it in the revised version.

7. Loss spikes in pretraining

Loss spikes are indeed inevitable in pretraining. Actually, we also encounter them in our experiments, but these loss spikes do not undermine the applicability of AdaLRS. Firstly, the gradual LR scaling strategy shown in Algorithm 1 suppresses a number of loss spikes. Secondly, a simple loss smoothing strategy is found to be sufficient to mitigate the influence of loss spikes. As a result, despite the presence of loss spikes, AdaLRS can still achieve expected performance improvement in relatively large scale pretraining.

8. Application scope

This issue is discussed as one of AdaLRS's limitation in Appendix C. We argue that it does not undermine the effectiveness of AdaLRS in optimal LR searching. As shown in Figure 2 and Table 2, 3, AdaLRS can achieve significant performance improvement in pretraining under both small LR and large LR settings. We also discussed in Section 3.2 about the effectiveness of AdaLRS in large LR settings, presenting that even if AdaLRS cannot achieve optimal model performance in a single run, it still finds the optimal LR efficiently, outperforming traditional LR searching methods markedly.

9. Chicken-and-egg problem

  1. As discussed in the point above, AdaLRS finds optimal LR efficiently and effectively, regardless of the initial LR setting. Therefore, although the chicken-and-egg problem may hinder us from reaching the optimal model performance in a single run, it does not undermine the applicability of AdaLRS in optimal LR searching.
  2. In practice, researchers can refer to existing opensource works for LR setting estimation. For example, 1e-3 and 2e-4 are often used as initial LRs for VLM and LLM pretraining [1,2], respectively. Since AdaLRS approximates the optimal LR with geometric error decay, which is proved in Section 2.3 of the paper, it allows a wide range for small LR choices. As a result, the influence of the chicken-and-egg problem is limited.

[1] Li B, Zhang Y, Guo D, et al. Llava-onevision: Easy visual task transfer[J]. arXiv preprint arXiv:2408.03326, 2024.

[2] Dey N, Gosal G, Khachane H, et al. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster[J]. arXiv preprint arXiv:2304.03208, 2023.

评论

I thank the authors for their rebuttal. It has cleared a few concerns I had; thereby, I am increasing the score. Nevertheless, I am still uncertain about:

  1. The clarity of the paper. I still believe the paper needs a major rewrite.
  2. The complexity of the resultant method with multiple new hparams.
评论

We sincerely thank the reviewer for the careful reading and useful suggestions, and will provide our compliments in the Acknowledgment section of the revised version. We admit the potentially confusing description in certain design details, and will definitely clarify these contents to improve readability. That being said, please kindly allow us to clarify the necessity and importance of each hyperparameter introduced in our method, which we believe are also crucial for the evaluating the paper's clarity:

LR scaling hyperparamters (α\alpha, β\beta, and γ\gamma)

Firstly, our dynamic LR scaling strategy naturally requires two scaling factors to take effect, with α\alpha and β\beta being upscaling and downscaling factors, respectively. The decaying factor γ\gamma further ensures the convergence of the scaling factors, otherwise large scaling factors may incur oscillation in the final LR search process. We presented the convergence boundary of AdaLRS in Proposition 2.3 of our paper, where γ\gamma performs as a necessary part of the convergence bounds.

Additional hyperparameters (θ\theta and ee)

Only two additional hyperparameters except for the scaling factors are introduced in our algorithm. The first one is the loss slope decaying threshold θ\theta, which controls the trigger of LR upscaling attempts. Although it may be more straightforward to trigger the attempts at a fixed set of steps (or other simple heuristic rules), θ\theta allows smooth loss descent with a consistent velocity and ensures timely adjustment when the velocity decays. More importantly, θ\theta operates solely on the loss descent velocity, which is independent of the training scenario or scale. We believe that θ\theta is necessary for stable and adaptive LR search across various pretraining tasks.

The other hyperparameter ee is the loss slope estimation error, which is indeed negligible in understanding the workflow of the algorithm. However, we argue that it is necessary to consider this estimation error in the discussion of AdaLRS's convergence. As elaborated in Section 2.2.2, we prove that the hyperparameter ee controls the convergence accuracy, i.e. error bound, of the algorithm. Taking the loss slope estimation error into account ensures the generalizability of our proof for AdaLRS's convergence.

评论

Dear Reviewer 9vf5:

We would like to thank you for the careful reviewing of our paper and for the comments. Please kindly let us know if anything is unclear. We truly appreciate this opportunity to clarify our work and shall be most grateful for any feedback you could give to us.

Best regards,

Authors

审稿意见
4

The paper introduces AdaLRS, a novel adaptive learning rate (LR) search algorithm specifically designed for efficient pretraining of foundation models (such as large language models (LLMs) and vision-language models (VLMs)). The authors provide theoretical proofs for convergence and geometric error decay of the AdaLRS algorithm, and also conducted extensive experiments on LLM (Qwen2.5 models) and VLM (SAIL-VL) pretraining tasks, demonstrating that AdaLRS consistently finds optimal learning rates across various initial conditions and model sizes.

优缺点分析

The work addresses a significant practical challenge: efficiently finding optimal learning rates without extensive tuning or costly experiments. AdaLRS provides a practical, plug-and-play solution applicable across different model architectures. This work also provides theoretical proofs and analysis of convergence and geometric error decay, significantly enhancing the method’s effectiveness in accelerating the training convergence. One limitation is that, without statistical validation (statistical significance analysis), it's unclear whether the reported improvements over baselines are statistically significant.

问题

Authors attribute the poor performance of AdaLRS under large initial learning rate (LR) settings to disruptive parameter updates. It’s unclear why the baseline methods, which utilize same large initial learning rates, do not suffer from these disruptive parameter updates. Could the authors please explain this discrepancy?

局限性

Yes

最终评判理由

Many thanks for the author’s rebuttal comments. They have addressed some of my concerns, but not all issues have been fully resolved. I have increased my rating regarding the paper's quality.

格式问题

N/A

作者回复

We appreciate the reviewer's useful suggestions. Please allow us to clarify our methodologies and conclusions point to point.

1. Statistical confidence analysis

Since large scale foundation model pretraining is resource-consuming, we opt to conduct repetive experiments at 8M-sample scale for 2B LLM pretraining for statistical confidence analysis. We use 5 different random seeds for model training, and report the final pretraining losses.

Experiment12345
Pretraining Loss4.7214.7144.7344.7334.713

The mean value and standard deviation of the pretraining losses are 4.723 and 0.009, respectively. As shown in Table 2 of our paper, the training loss improvement of AdaLRS experiments all exceed the 3σ range, except for the Fit LR experiment for 7B LLM pretraining. These results indicate that the AdaLRS pretraining loss improvement is statistically significant.

2. Disruptive parameter updates in large LR experiments

The disruptive parameter updates is introduced by the large LR setting rather than the AdaLRS algorihhm, and therefore the large LR baseline models indeed suffer from similar performance degradations. As shown in Figure 2 and Table 2, 3 of our paper, large LR baseline experiments converge to relatively high pretraining losses and suffer from performance degradation in evaluation results. Despite the degraded model performance, AdaLRS still reaches the optimal learning rates and improves pretraining convergence significantly, demonstrating its effectiveness in optimal LR searching.

评论

Dear Reviewer 5CYH:

We would like to thank you for the careful reviewing of our paper and for the comments. Please kindly let us know if anything is unclear. We truly appreciate this opportunity to clarify our work and shall be most grateful for any feedback you could give to us.

Best regards, Authors

审稿意见
5

The paper introduces AdaLRS, a plug-and-play algorithm for online adaptive learning rate search in foundation model pretraining. Unlike existing methods that rely on offline tuning or proxy models, AdaLRS optimizes the loss descent velocity to adjust learning rates dynamically during training. The authors provide theoretical proofs of convergence and geometric error decay and demonstrate the algorithm’s effectiveness across diverse settings—covering both LLMs and VLMs, varying model sizes, training paradigms, and learning rate schedules. Experiments show AdaLRS can significantly accelerate convergence, improve model performance, and generalize well across tasks.

优缺点分析

Strengths

  1. The paper is well-written and easy to follow.
  2. The method is simple but effective.
  3. The experiments are comprehensive covering both LLM and VLM backbones.
  4. The authors provide theoretical analysis along with good empirical performance.

Weaknesses

  1. The method introduces more hyper-parameters (i.e., upscaling factor α, downscaling factor β, and decaying factor λ). The authors did not provide sensitivity experiments on these hyper-parameters so it remains unclear if we still need to tune these. If so the method still needs extensive hyper-parameter tuning.

问题

The experiments focus primarily on training loss, with less emphasis on downstream or task-specific metrics. Although loss and accuracy are strongly correlated, sometimes they may not align perfectly with final task-specific metrics in some applications.

局限性

Yes.

最终评判理由

The authors have addressed my concerns during the rebuttal and I will keep my score.

格式问题

No.

作者回复

We appreciate the reviewer's insightful feedback. Please allow us to address the reviewer's concerns point to point.

1. The choice of hyperparameters

The default setting α\alpha=3, β\beta=2, and γ\gamma=0.99 in the paper is chosen for simplicity, and we did not introduce human labor in hyperparameter tuning. To validate the robustness of AdaLRS, we further conduct experiments under various hyperparameter settings. Starting from a 2e-5 LR for 2B LLM pretraining with WSD scheduler, we show the LR dynamics in the 64M-sample pretraining experimets:

Experimentα\alpha/β\beta/γ\gammaStep/LR/ScaleTest PPL
Baseline-100/2e-5/-------43.77
A3/2/0.99100/2e-5/-20100/6e-5/↑28100/1.8e-4/↑35100/3.6e-4/↑42100/7.19e-4/↑50100/3.6e-4/↓-16.94
B2/1.67/0.99100/2e-5/-20100/4e-5/↑28100/8e-5/↑36100/1.6e-4/↑44100/3.2e-4/↑51100/5.16e-4/↑57100/3.1e-4/↓18.79
C1.5/1.43/0.99100/2e-5/-20100/3e-5/↑28100/4.5e-5/↑36100/6.75e-4/↑44100/1.01e-4/↑52100/1.52e-4/↑-21.90
D2/1.67/0.95100/2e-5/-20100/4e-5/↑28100/7.6e-5/↑36100/1.37e-4/↑44100/2.35e-4/↑51100/3.09e-4/↑-19.58
E2/1.67/0.9100/2e-5/-20100/4e-5/↑28100/7.2e-5/↑36100/1.16e-4/↑44100/1.7e-4/↑51100/2.17e-4/↑-19.70

In this table, Exp Baseline, A correspond to the "Small LR" setting for 2B LLM pretrain in our paper, and Exp B, C are conducted with different LR scaling factor values. Exp D, E adopt smaller decaying factor, which suppresses the "overshooting" behavior observed Exp B. Overall, depite the varying hyperparameter settings, learning rates converge to the neighborhood of the optimum 2e-4 robustly.

We also choose the model checkpoints at 60,000 step (150,000 in total) to evaluate their test PPL, and the results show that AdaLRS achieves improved model performance over the baseline under various hyperparameter settings. We will include the complete loss dynamic figures and evaluation results in the revised version of our paper.

2. Downstream task performance

To investigate the generalization of AdaLRS's improved pretraining loss, we further conduct a lightweight 6M-sample SFT with the Infinity-Instrcut dataset [1] for the 2B LLM experiments. We evaluate the SFT model performance on open-ended question-answering tasks, such as Alpaca-Gen and KNIGHT-Gen [2], which aligns with real-world LLM applications. We follow previous work [2] to use Rouge-1, Rouge-2, and Rouge-L as evaluation metrics.

ExperimentAlpacaGen_rouge1AlpacaGen_rouge2AlpacaGen_rougeLKNIGHTGen_rouge1KNIGHTGen_rouge2KNIGHTGen_rougeL
Small LR24.035.4115.4010.021.238.36
+ AdaLRS28.776.3917.1013.332.9510.81
Large LR8.490.416.092.410.112.02
+ AdaLRS10.140.527.114.920.113.96
Fit LR29.997.6618.5614.022.7811.29
+ AdaLRS28.496.7117.2912.652.6110.13

As shown in the table, SFT models from AdaLRS pretrained checkpoints achieve significantly higher benchmark scores than baseline models. For fit LRs, AdaLRS pretrained models suffer from slight performance degradation, which coincides with the pretraining loss shown in Table 2 of our paper.

These results demonstrate the performance generalizability of LLMs pretrained with AdaLRS. For VLM experiments on downstream benchmarks, we refer to Table 3 in our paper for more details.

[1] Li J, Du L, Zhao H, et al. Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models[J]. arXiv preprint arXiv:2506.11116, 2025.

[2] Yang D, Xiao D, Wei J, et al. Improving factuality in large language models via decoding-time hallucinatory and truthful comparators[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(24): 25606-25614.

评论

Thanks for the authors' thoughtful rebuttal. My concerns have been fully addressed.

评论

We would like to thank Reviewer 6tFv for the careful and responsible reviewing of our paper. We will include the experiment results shown above in the revised version of our paper, and extend our compliments for the useful reviews in the Acknowledgment section. Please kindly let us know if there is anything unclear or could help improve the paper quality.

评论

Dear Reviewers,

As the discussion deadline approaches, please kindly review the authors’ responses and share your thoughts—unless you have already done so. Thank you for your engagement and support.

Area Chair

最终决定

This paper proposes AdaLRS, a simple yet effective method for online adaptive learning rate search in foundation model pretraining. The method dynamically adjusts learning rates by monitoring loss descent velocity, with some theoretical evidence for its convergence. The experiments are comprehensive, covering both LLM and VLM pretraining under diverse settings, and reviewers generally agree that the approach is well-motivated, empirically validated, and has certain practical value.

Reviewers raised concerns regarding several aspects, including too many hyperparameters, sensitivity to hyperparameters, the statistical significance of improvements, downstream evaluation, and clarity of presentation. The rebuttal provided additional experiments and clarifications that largely addressed these issues, though some issues in writing remain. Overall, the strengths outweigh the weaknesses, and I recommend acceptance, while encouraging the authors to improve the readability of the paper in future revisions.