7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.5

置信度

创新性2.8

质量2.8

清晰度3.3

重要性2.5

NeurIPS 2025

Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

Minhak Song,Beomhan Baek,Kwangjun Ahn,Chulhee Yun

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

We show that Schedule-Free methods effectively navigate the river structure of the loss landscape, enabling scalable language model training without decay schedules or extra memory.

摘要

关键词

Language ModelPretrainingSchedule-FreeLearning Rate SchedulesWeight AveragingOptimization DynamicsLoss Landscape

评审与讨论

审稿意见

评分: 4置信度: 42025-06-09

This paper analyzes the schedule-free-optimizer (SFO) from DeFazio et al. (2024) through the perspective of the "river-valley loss landscape." Unlike with AdamW, performing a short decay phase on the SFO weights (from a well-tuned run) does not yield loss improvements, suggesting that a well-tuned SFO model is already "close to the river." Digging into the behaviour of SFO, they find that the simple weight average x_t lags behind the main y_t iterates when the SFO momentum is suboptimal. They reformulate the SFO derivation in order to show that the \beta momentum term is effectively doing double-duty: it controls the effective smoothing across gradients, while also controlling the shape of how prior outputs are averaged. By decoupling these two effects via the introduction of an additional hyperparameter, some of the sensitivity of SFO to its hyperparameters appear to be ameliorated, including when the batch size is increased by a factor of 4.

优缺点分析

Strengths

Identifying the key issues with SFO and how these can be mitigated is valuable and well-motivated.

For the most part, the work carefully stepped through the theory, providing accompanying empirical results at each step, in a manner that was both rigorous and instructive.

Both the figures and the writing in each section were fairly clear and easy-to-digest.

Weaknesses

The main issue is that the work seems to be a bit incomplete and preliminary, and it felt like the later sections were hastily written and a bit disorganized.

The paper essentially culminates in a proposal for a refined version of SFO. But the section evaluating this refined version doesn't really provide enough details for the reader to understand whether this refined version is worth adopting. The main disadvantage of introducing a new hyperparameter is, of course, that this introduces new tuning costs at small scale, and questions of HP transfer across scales for training large models. If we view the paper as, rather than providing actionable insights and new tools for practitioners, just hypothesis-driven empirical insights, then I would have liked to see more experiments that evaluate the refined method across further settings. E.g., does the optimal C change systematically with batch size? Is having x_t track y_t sufficient for good models, or only necessary?

e.g., it’s a little hard to compare the results in Figure 8 since the y-axis has a different scale in Left vs. Middle, but I'm left wondering if C really does "fix" SF-AdamW? Yes, x_t catches up to y_t, but if y_t is still below what could be achieved with an optimal β1 (or with a simple Cosine-10%-decay schedule), then SF-AdamW is apparently still sensitive to these HP values. What should the reader take away here, and what else is worth checking?

Other empirical concerns:

“We perform sweeps over the learning rate, momentum parameters, and the refinement parameter C (for refined SF-AdamW).”

But how are you evaluating the results across the sweep? On the validation set? It’s totally expected that if you introduce a new hyperparameter and you then “fit” this hyperparameter on the validation set, you can do better than if you didn’t introduce this hyperparameter. At the very least, we should evaluate on a held-out set. But even beyond that, we should evaluate that a fit at one scale provides a good model at a larger scale, since we can’t actually afford to fit C at every scale.

In the appendix, line 796: “For large-batch runs with cosine decay, the learning rate is annealed to 10% of its peak.”

I’m a little concerned that this Cosine-10x decay is a bit of a “straw man”. E.g., Bergsma 2502.15938 showed that decaying to 0 was superior to 10% decay across a range of pre-training runs --- and gains increase at higher tokens-per-parameter. Given that you trained to ~48TPP, I would expect decaying to zero to be superior. DeFazio used Linear as the baseline, also see their section B.1. Moreover, to be fair to Cosine, you probably also need to adjust some of its HPs as you scale the batch size.

Regarding the paper's main contributions:

"We begin by revisiting two widely used strategies" -- is this really a main contribution?
"Reveal" that SF performs a form of weight averaging -- I think you need to be a lot more precise here because we know that SFO performs a form of weight averaging -- that's exactly what it says on the box! Maybe you can specify exactly what you show that hasn't been shown previously. It would have also been helpful if you could make clear exactly what Morwani et al showed versus what this paper shows -- I wasn't familiar with that work, but your paper hints that they touch on a very similar topic.

Regarding prior work

For a full description of the advantages and disadvantages of prior work, you might note that weight averaging does not necessarily need to consume extra memory. E.g., in DeepSeek-V3:

Exponential Moving Average in CPU. During training, we preserve the Exponential Moving Average (EMA) of the model parameters for early estimation of the model performance after learning rate decay. The EMA parameters are stored in CPU memory and are updated asynchronously after each training step. This method allows us to maintain EMA parameters without incurring additional memory or time overhead.

Paper organization

Section 2.2 directly re-states things about WSD that were mentioned in 2.1.
I felt like I had to flip between Section 4.4 and Section 5 -- I'm not sure, but shouldn't the unnumbered equation for the alphas just go back with the unnumbered equation for the x_t iterate you have above on the page?
Line 262: Please define P when you introduce it - or maybe introduce this term after Line 271?

Writing

Section 5: I felt like this section was more confusing than it should be. Like, could we start with, “if large β has all these benefits in terms of keeping us in the valley, why not just use a large β all the time”? I don't have specific suggestions, I just found this hard to follow. Even the point that “large β slows yt updates, as both βct+1 and (1 − β) in (SFy) become vanishingly small when t is large" – isn’t the point that (1-β) becomes small and we pay less attention to recent gradients, but if batch sizes are larger, we actually want to pay more attention to recent gradients? – I.e., the scale of DATA that we pay attention to should be similar. I don't know why it matters that later gradients count less at higher steps... or, if that is important (which it might be), then this sentence is carrying way too much weight to explain all of this.

Minor

WSD “allows dynamic adjustment of HPs across phases” But doesn’t Cosine as well? Which HPs? E.g., it's not "allowed" with a Cosine schedule to change the batch size or LR halfway through, or not common? I’m not following.

Line 37 -- when you mention the "workaround" of weight averaging, it would be good to have some citations here or at least a forward pointer to your related work (so we know what you have in mind exactly by weight averaging when you introduce it).

124M model trained on 6B tokens, so that's 48.4 TPP. You say "compute budget determined by Chinchilla," but isn't the rule-of-thumb for compute-optimal training 20 TPP?

Typos: Line 263: “We first the stability” Line 276: “with with”

问题

See weaknesses above.

局限性

I feel the paper could have benefited from a detailed limitations section, summarizing the "simplifying assumptions" that were made in the theory, and noting the specific items that were not explored in the paper, but that might have been given more compute:

other optimizers
larger-scale models (both as noted in the conclusion)

But also, maybe:

testing the scaling of the C hyperparameter
testing a range of different batch sizes
testing other vocabulary sizes, model architectures, datasets, etc.

You might also acknowledge the disadvantages of introducing another hyperparameter, and how these might be mitigated.

In short, I'd be curious what the authors think that people training large-scale models should test before adopting the refined optimizer proposed in this paper.

最终评判理由

The authors have addressed some of my concerns regarding quality (experimental rigor) and clarity, so I have revised my score. I'm still not sure about the overall significance. But overall, this is exactly a "technically solid paper where reasons to accept outweigh reasons to reject."

格式问题

None

作者回复

2025-07-31

We appreciate your thoughtful feedback. We believe the core of the concern arises from a slight misalignment between the primary focus of our paper and the review’s emphasis on Section 5, which treats the work as if its main goal were to propose a new optimizer.

As reflected in our title, “Through the River: Understanding the Benefit …”, our central contribution is a principled analysis of SF dynamics, grounded in the river-valley loss landscape framework. This focus spans the majority of the paper (Sections 2–4), where we present both theoretical and empirical insights into the behavior, strengths, and limitations of SF methods. The refined method introduced in Section 5 is intended as a concrete illustration of this framework—a proof-of-concept that addresses a diagnosed failure mode—rather than the core contribution of the work.

We acknowledge that the refinement could appear preliminary if evaluated in isolation as a standalone optimizer. However, its purpose is to demonstrate how insights from our analysis can inform practical improvements. With this context in mind, we respond to the reviewer’s specific points below.

Contribution and Evaluation of Refined SF method

the section evaluating this refined version doesn't really provide enough details

The refined SF method was introduced to empirically test a key prediction from our analysis: that the coupling between averaging and momentum in the vanilla SF update leads to degraded performance of x_t relative to y_t under small β1 (Figure 5). By introducing the decoupling parameter C, we isolate these effects, and the observed improvement of x_t over y_t (Figure 8) provides empirical support for our hypothesis. While the primary goal of the refined method is diagnostic, these results show how our analysis informs a practical path to improving SF in regimes where they currently struggle.

The main disadvantage of introducing a new HP is ... tuning costs

To address this, we conducted sensitivity analysis of C, reported in Tables 2 and 3 of our response to Reviewer kaR1, Q4. The results show that refined SF-AdamW is robust to the choice of C: it consistently outperforms vanilla SF across a broad range of values, reducing the practical cost of tuning.

Does the optimal C change systematically with batch size? Is having x_t track y_t sufficient for good models, or only necessary?

A systematic batch size ablation is an important future direction. Our hypothesis is that larger batch sizes, which reduce gradient noise, would favor larger optimal C values that prioritize recent iterates. However, we refrain from making strong claims without direct evidence. Due to computational constraints—and since the refined method is not the paper’s main focus—we did not perform these scaling experiments. We believe it would be valuable to conduct a study akin to Fig. 2 in [1], measuring the critical batch size of the refined SF method.

Having x_t track or outperform y_t is a necessary but not sufficient condition for good performance. A good averaging scheme may improve x_t, but the quality of the underlying momentum iterates y_t is also important.

It’s a little hard to compare the results in Fig.8. I'm left wondering if C really does "fix" SF-AdamW?

Our intention is not to claim that C universally “fixes” SF-AdamW, but rather that it mitigates a specific failure mode where x_t lags behind y_t due to entangled averaging. The differing y-axis scales were chosen to highlight the x_t vs. y_t gap within each subplot, not to compare across momentum values.

Other empirical concerns

But how are you evaluating the results across the sweep? On the validation set?

Yes, our evaluation was conducted on the validation set, which is standard in LM pretraining where, under the single-pass setup, prior work has shown there is no generalization gap (see [2]).

To rigorously address this concern, we evaluated on a held-out test set (9M-token subset of SlimPajama) with HP sweep of refined method (see Tables 2 and 3). The performance trends on the test set closely match those on the validation set, confirming that our conclusions are not due to overfitting.

Cosine-10x decay is a bit of a “straw man”

We conducted an HP sweep for AdamW with cosine decay to 0, evaluated on a held-out test set:

Table 4: AdamW with Cosine Decay to 0 (Batch Size 2M, Test Perplexity)

	max LR=5e-4	1e-3	2e-3	5e-3
$(\beta_1,\beta_2)=(0.9, 0.95)$	27.26	24.29	23.21	23.58
$(\beta_1,\beta_2)=(0.95, 0.99)$	27.26	24.44	23.42	24.05

These results show that well-tuned AdamW with cosine decay to 0 slightly outperforms refined SF-AdamW (Table 3). Nonetheless, our refinement significantly narrows the gap between vanilla SF-AdamW and AdamW with optimal cosine decay in the large-batch regime. We appreciate your insightful comment and will clearly discuss this result in the revised manuscript.

Regarding the paper's main contributions

"We begin by revisiting two widely used strategies" – is this really a main contribution?

Section 2 is indeed a review, not a novel contribution. Its purpose is to contextualize existing strategies through the lens of scalable training and river‑valley framework, setting the stage for subsequent analysis. We will revise main contributions list to reflect this.

"Reveal" that SF performs a form of weight averaging – you need to be a lot more precise here

The original SF method [3] indeed frames x_t as a uniform average of the z iterates. Our contribution is to show an alternative and more informative reformulation: x_t can be viewed as a weighted average of y iterates, where y follows a momentum‑like update.

This perspective aligns with [4], who interpret SF‑SGD as a weighted average of accelerated SGD, but our work extends it in several ways:

We analyze SF-AdamW, which has not been studied in this context.
We connect the reformulation to the river-valley perspective, showing that y tracks the valley floor while x acts as a smoothed average.
We observe that EoS arises specifically at y iterates, not at x or z.

These observations support our proposal that y should be viewed as the primary optimization path, with x serving as its weighted average. We will revise the text to clearly distinguish our contributions from prior work.

Regarding prior work

weight averaging does not necessarily need to consume extra memory, e.g., in DeepSeek-V3

We clarify as follows:

Even if DeepSeek‑V3’s implementation is fully optimized, tracking EWA requires extra CPU memory, so it is not strictly correct to say it “consumes no extra memory” unless the claim is limited to GPU memory.
While techniques exist to reduce GPU memory use, in practice they demand non‑trivial engineering efforts, depend on training setup, and still incur overhead from CPU allocations, extra compute, and CPU–GPU synchronization. The degree of overhead also varies with hardware.

Therefore, we believe our original statement that weight averaging often incurs non-negligible practical overhead is a fair characterization of the trade-offs involved.

Paper Organization

Thank you for the suggestions. We will revise the paper accordingly.

Writing

Section 5: “if large β has all these benefits in terms of keeping us in the valley, why not just use a large β all the time”?

Our analysis shows that larger β reduces oscillation variance, keeping y_t closer to the river floor, but does not imply that larger β always improves performance.

For y_t, performance depends on both oscillation magnitude and effective "speed" along the river, requiring a balance when tuning β.
For x_t, β also controls averaging weights, and the optimal weighting is task‑dependent.

Thus, β must be tuned to jointly account for its dual effects on x_t and y_t.

“large β slows yt updates,..." – isn’t the point that (1-β) becomes small and we pay less attention to recent gradients, but if batch sizes are larger, we actually want to pay more attention to recent gradients?

Our original argument highlighted that as $β\approx 1$ and $t\to\infty$ , both $β c_{t+1}$ and $(1-β)$ in (SFy) become small, slowing updates of y_t. We also agree with your perspective: large β places less emphasis on recent gradients, potentially degrading y_t performance in large-batch regimes. Conversely, a larger β is beneficial for x_t due to its narrower averaging window. We will clarify this trade-off in the revised manuscript, incorporating your insights.

Minor

WSD “allows dynamic adjustment of HPs across phases” But doesn’t Cosine as well?

WSD decouples the constant-LR "trunk" from temporary "decay" branches, enabling one to experiment with phase-specific HPs (e.g., decay shape) on new branches without altering the main run. In contrast, a monolithic cosine schedule binds LR to the entire training horizon, so testing a different decay profile requires a full restart.

Line 37-when you mention the "workaround" of weight averaging ... forward pointer to your related work

We will add a pointer to Section 2.3.

124M model trained on 6B tokens, so that's 48.4 TPP

Our pre-training runs were set to match the Chinchilla-optimal compute scale. For our main experiments, 124M model was trained for 5k iterations on 2.5B tokens, consistent with this scale (see Appendix B).

Typos: Line 263, Line 276

Thank you for catching these. They will be corrected.

Limitations

I feel the paper could have benefited from a detailed limitations section

We will add a "Limitations" section outlining theoretical assumptions and the scope of our experiments.

[1] Zhang et al., How Does Critical Batch Size Scale in Pre-training?, ICLR 2025.

[2] D’Angelo et al., Why Do We Need Weight Decay in Modern Deep Learning?, NeurIPS 2024.

[3] Defazio et al., The road less scheduled, NeurIPS 2024.

[4] Morwani et al., Connections between schedule-free optimizers, ademamix, and accelerated sgd variants, 2025.

评论- Response to rebuttal

2025-08-04

I had a look at the other reviews, my original review, the rebuttals, and again at the paper. Overall, experimentally, this is good, careful work, while the overall impact is more a matter of opinion. The authors have provided new experiments, which should strengthen the paper. In particular, they have addressed some of my own concerns. I will revise my score.

Regarding tuning hyperparameters on the validation set:

You say, this "is standard in LM pretraining" and "prior work has shown there is no generalization gap"
This isn't very compelling to me: there's no generalization gap until there is a generalization gap. And one way we might create a generalization gap is introducing new hyperparameters and specifically tuning them on the validation set!
So, there's no "presumption of innocence" here: I mean, we should at least establish that new hyperparameters do generalize before relying on that fact, you know what I mean? And even then, I don't think I would rely on it :)

2025-08-05

Thank you for your thoughtful reconsideration and valuable suggestions. We will revise our paper to clarify your comments on hyperparameter tuning and to incorporate the new experiments. We greatly appreciate your detailed and insightful feedback.

审稿意见

评分: 5置信度: 42025-06-29

The paper analyzes Schedule-Free (SF) optimization using different theoretical and empirical frameworks, including River-Valley landscapes, Edge of Stability, central flows, and a toy model. The authors analyze two limitations of SF optimization: 1. Sensitivity to the momentum parameter $\beta$ and (2) large batch size.

Main contributions are summarized below:

The authors show that SF-AdamW can effectively follow the optimal trajectory along the river direction without explicit learning rate decay or weight decay. In particular, the iterate $y_t$ faithfully follows the river direction, while $x_t$ may not follow at suboptimal $\beta$ values.
They characterize the dynamics of SF using the central flow framework. The resultant equations shed light on why suboptimal $\beta$ values can steer training away from the river direction.
Analysis of SF on a toy model also sheds light on the effect of suboptimal $\beta$ parameter.
Derive the Edge of Stability thresholds of SF-GD and SF-AdamW.
Introduce a new SF variant that aims to address momentum sensitivity and large batch performance by decoupling the momentum for $y_t$ iterate and EMA averaging of $x_t$ through an additional hyperparameter $C$ .

I am not very familiar with prior works on understanding SF. I will rely on other reviewers for my judgment on new contributions to the understanding of SF.

Experimental setup: ~100 M Llama / GPT style Transformers pre-trained on SlimPajama/OpenWebText trained up to Chinchilla optimal.

优缺点分析

Strengths

Strong theoretical and empirical analysis connecting to prior work, including (1) the River-Valley Landscape, (2) Edge of Stability, (3) Central Flows.
Well-motivated problem statement: improving limitations of SG optimization for a scalable alternative to conventional pretraining strategies, which use learning rate schedules.
Insights into the sensitivity of the SF method to the momentum parameter and fixes.

More generally, I think the paper significantly contributes to our understanding of SF optimization.

Weaknesses

I am unsure about the implications of the large batch size results. While the authors aim to address the large batch size limitation of SF, they do not perform an analysis on the effect of large batch size, and much of the paper analyzes the effect of the momentum coefficient. The only exception is Figure 8, where they show that SF-AdamW matches AdamW's performance. Furthermore, a batch size of 2M tokens is only 4 times larger compared to the default setting of 0.5M tokens.
Introduces a tunable hyperparameter $C$ . Its unclear how sensitive SF is wrt $C$ in different training regimes beyond pre-training. Furthermore, an analysis of the hyperparameter $C$ would be helpful.
In Section 3, the authors are implicitly assuming the River Valley Landscape in SF optimization. While the weight averaging and LR decay experiments suggest the existence of the River-Valley Landscape, I believe interpolation between different iterates along the valley and river directions (as in the original River Valley Landscapes) would be helpful.

问题

I think the Central Flow can be used to explain why the iterate $x_t$ deviates from the river direction at suboptimal $\beta$ . This result will complement the paper's analysis.
I have a basic question about SF. If $x_t$ deviates from the river direction and $y_t$ does not, then why does it matter for the overall performance? This suggests that one can always evaluate the loss at the iterate $y_t$ , and it should perform well. Have the authors tested this?
What's the motivation behind the toy model? The first term can be viewed as a two-layer network with one training example. What's the intuition behind the second term? How does a term like that appear in practice?
How does SF use the same memory as AdamW? It has 4 variables $x_t, y_t, z_t, v_t$ . Is it because of the reformulation in Section 4.4?

局限性

Experimental results are limited to two pre-training setups at ~100M scale trained up to Chinchilla optimal. Please note that I am not saying that the authors should perform extensive experiments at large scales. Rather, mentioning the setup sets the scope of the experiments during the review process.
(Minor) The visualizations can be improved: authors can use contrastive colors for the landscape and trajectories.
Please see weaknesses.

最终评判理由

The authors' rebuttal has addressed my major concerns. While a few concerns regarding the iterates and their role in dynamics remain, I hope these will motivate future works. As my original rating was already high, I will be maintaining my score.

格式问题

I did not notice any formatting issues.

作者回复

2025-07-31

Thank you for your valuable feedback! We appreciate that you found our work significantly contributes to our understanding of SF optimization. We would like to address your questions as below.

W1: Implications of the Large Batch Results

While the authors aim to address the large batch size limitation of SF, they do not perform an analysis on the effect of large batch size.

We acknowledge that while our paper aims to address the large-batch limitation of SF, our experimental focus is primarily on analyzing the effect of momentum rather than batch size. This choice was motivated by the fact that the role of momentum in SF methods remains relatively underexplored, and our goal was to address this gap. In contrast, the impact of batch size on SF performance has already been studied—most notably by [1], who conducted scaling experiments and identified the critical batch size at which vanilla SF methods begin to degrade.

Rather than replicating those experiments, we sought to build on their findings by providing intuition for why vanilla SF fails in large-batch regimes. Our analysis suggests that the failure stems from the entanglement between momentum and the weight averaging window, which motivated our refinement to decouple them.

We agree with the reviewer that our current large-batch results for the refined SF method are preliminary. Due to computational constraints, and because proposing the refined method is not the main focus of our paper, we did not conduct a full batch size ablation. We will revise the text to avoid overclaiming the effectiveness of the refined method in large-batch settings.

[1] Zhang et al., How Does Critical Batch Size Scale in Pre-training?, ICLR 2025.

W2: Analysis of HP C

Introduces a tunable HP C. Its unclear how sensitive SF is wrt C in different training regimes beyond pre-training.

We agree that introducing a new HP entails tuning cost. To assess this, we conducted an additional sweep over C values, reported in Tables 2 and 3 of our response to Reviewer kaR1, Q4. We observe that performance remains stable across a wide range of C.

W3: Assuming River Valley Landscape

I believe interpolation between different iterates along the valley and river directions would be helpful.

We appreciate the suggestion and would like to clarify our position. Our work builds upon the river-valley landscape as a modeling framework that has been established in recent literature. This assumption is not specific to any particular optimizer, as the river-valley refers to the geometry of the loss landscape itself, rather than the dynamics of a specific optimization method.

Interpolation-based visualizations have already been presented in prior works—for example, Fig. 7 in [2] and Fig.2 in [3]—demonstrating that modern deep learning loss landscapes often exhibit a convex, valley-shaped structure. In our own setup, we independently verified this landscape structure through interpolation between iterates and observed the same trend (see Table 1 of our response to Reviewer kaR1, Q1&Q2).

In addition, we would like to emphasize that the river-valley landscape is further supported by several well-established empirical signatures:

The sharp loss drop during LR decay.
The EoS phenomenon, where iterates bounce between the steep walls of the valley.
The Central Flow, which captures the trajectory along the valley floor.

Further supporting evidence comes from [4], who show that optimization updates projected onto the river direction (i.e., high-curvature hill directions removed) preserve learning efficacy. This implies that motion in the hill directions is often unnecessary once the iterates are near the valley floor.

We will revise the text to include these clarifications and highlight that we also validated the river-valley structure in our setting.

[2] Wen et al., Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View, ICLR 2025.

[3] Belloni et al., Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers, ICML 2025 Workshop on HiLD.

[4] Song et al., Does SGD really happen in tiny subspaces?, ICLR 2025.

Q1: Is Central Flow Analysis of x_t Possible?

I think the Central Flow can be used to explain why the iterate x_t deviates from the river direction at suboptimal β.

We believe that applying central flow analysis to the x iterate is not appropriate. Central flow is intended to model the time-averaged optimization trajectory at the Edge of Stability. Our observations indicate that only the y iterates operate at the EoS, exhibiting oscillations centered around the river. Therefore, we model the central flow based on the y iterates, which allows us to capture the oscillation variance in the hill direction away from the valley floor.

In contrast, the x iterates do not oscillate around the river and do not operate at the EoS. Consequently, central flow analysis does not provide a suitable explanation for the deviation of x_t from the river at suboptimal β. We appreciate the reviewer’s insight and would welcome suggestions on alternative frameworks for modeling this behavior.

Q2: x_t vs y_t in SF method

If x_t deviates from the river direction and y_t does not ... one can always evaluate the loss at the iterate y_t, and it should perform well.

It is not the case that the y iterates always outperform the x iterates. For optimal values of momentum (e.g., β = 0.95), we observe that x_t achieves better performance than y_t. This is because x_t corresponds to an "appropriate" weighted average of the y iterates, which can enhance performance when the averaging is well-balanced.

In contrast, when β is small, the weighting scheme places too much emphasis on early y iterates, leading to degraded performance of x_t. This issue motivated our introduction of the refined SF method, which decouples the averaging mechanism from the momentum parameter β. This decoupling allows for improved averaging regardless of the momentum setting. As shown in Figure 8 (left), the refined method enables x_t to consistently outperform y_t, even at smaller β values.

Returning to the original question: no, evaluating the loss at y_t alone does not always yield optimal performance. A well-designed weighted average over the y iterates is often necessary for achieving the best results.

Q3: The Toy Objective

What's the motivation behind the toy model?

Our design of the toy model is intended to construct a minimal model that captures the river-valley geometry while replicating key behaviors observed in language model training.

The first term, $(w_1w_2-1)^2/2$ , can indeed be viewed as the loss of a two-layer network on a single example. More importantly, it defines a river-shaped loss landscape, where the constraint $w_1w_2 = 1$ defines the valley floor, and the curvature in the hill direction becomes sharper as $w_1$ increases. This structure is central to capturing the loss geometry observed in modern training settings.

The second term, $\log(1+\exp(-w_1))$ , is included to create a directional bias along the valley floor so that the optimizer progresses in a consistent direction. The specific form of this term is not critical to the main phenomena we study and could be replaced by other forms (e.g., a linear term like $-w_1$ ) without qualitatively changing the behavior. Its role is simply to ensure that the loss decreases along the river as training progresses.

While the exact form of the toy objective may not correspond to a practical loss function, that is not its purpose. Instead, it serves as a minimal model that exhibits rich dynamics consistent with those observed in realistic settings. Notably, it replicates:

How SF dynamics remain near the valley floor and how this behavior depends on the momentum parameter β.
The emergence of the EoS, specifically at the y iterates, which mirrors our findings in large-scale experiments.

The fact that this toy model reproduces such nontrivial behaviors confirms its utility as a theoretical tool. It provides insight into the mechanisms behind SF optimization that would be difficult to extract from large-scale models alone.

Q4: Memory Overload in SF-AdamW

How does SF use the same memory as AdamW? It has 4 variables x, y, z, v. Is it because of the reformulation in Section 4.4?

SF-AdamW does not require storing all four variables explicitly. In particular, y is computed on the fly from x and z, and therefore does not need to be stored as a separate tensor. This implementation detail is not specific to our reformulation and is already described in the original SF method (see Section 4.4 in Defazio et al.).

To clarify:

AdamW stores the model parameters x_t, the first-moment vector m_t, and the second-moment vector v_t.
SF-AdamW stores the model parameters x_t, the auxiliary parameters z_t, and the second-moment vector v_t.

SF-AdamW can be interpreted as SF method applied over RMSProp, just as AdamW can be viewed as momentum applied over RMSProp. In both cases, the optimizer maintains three parameter-sized tensors, and SF-AdamW does not incur any additional memory overhead compared to AdamW.

Limitations

Experimental results are limited to two pre-training setups at ~100M scale trained up to Chinchilla optimal.

We agree and will add a "Limitation" section to acknowledge this. Our experiment scale was intentionally chosen to make the training dynamics of SF methods more interpretable within our compute constraints, as our primary goal is to understand the underlying mechanisms of SF.

We acknowledge that validating these findings at larger scales and across more diverse architectures would further strengthen their generality. We view this as an important direction for future work, though it is beyond the scope of the current paper.

The visualizations can be improved

Thank you for the suggestion. We will revise the figures for better readability.

2025-08-05

I thank the authors for their detailed responses, which have mostly resolved my concerns. As my initial score was already high, I will maintain my score.

2025-08-05

Thank you for your careful feedback and positive evaluation. We sincerely appreciate your time and effort in reviewing our paper.

审稿意见

评分: 4置信度: 42025-07-01

The paper "Through the River: Understanding the Benefit of Schedule-Free (SF) Methods for Language Model Training" proposes to study schedule-free optimizers through the lens of river-valley landscape. In particular, the authors conduct several experiments which result in several observations that could explain the loss curve in the view of river-valley landscape. A new refined SF method is also proposed to improve the SF-AdamW.

优缺点分析

Strengths : The paper is clearly written, and is rich in terms of experiments and observations. A new analysis is also proposed to improve the SF trainer.

Weaknesses (please response to the questions section directly): It is still unclear if the evidence is in support of a river-valley landscape, or if other landscapes could also result in the loss curves in Fig 2 and 3; The theoretical understanding is on a simple 2-dim toy objective which may not be persuasive; The experiments on the refined optimizer is questionable.

问题

First, in my opinion, the river-valley landscape is an interesting observation that is still lack of justifications. In particular, how to define this landscape mathematically? In my humble opinion, it should be that at the point in side the valley, there is an eigen direction of Hessian that decrease the function value the most (river), and an eigen direction which actually is nearly convex quadratic (valley). If we don't quantify this, how are we suppose to really understand the loss curves in Section 3?
Therefore, I'm not sure about all the experiment in Section 3. Figure 2 and 3 show that SF-AdamW would not improve with LR decay or EWA, and it is shown that AdamW is benefited from lr decay, however I'm not sure how this is related to the "river-valley" landscape. In particular, for a convex quadratic objective, you can readily see a faster decrease of the loss with decay lr if your initial lr is too large (which leads to oscillation), and I don't see where each part of the river and the valley play a role here. It is also shown that SF method is sensitive to momentum parameter, which again could be explained without referring to the river-valley landscape...
Section 4 gains some observation on the toy objective. However this is not even a stochastic optimization problem, and I would question if the observation could be carried out in large scale.
For the refined optimizer, it may be less sensitive to the momentum parameters $\beta_1$ and $\beta_2$ , but we have a new hyperparameter $C$ to tune. It might still be sensitive to $C$ ?
I almost forgot but in line 160, the authors claim "Importantly, it achieves this without requiring additional memory overhead compared to AdamW", which I don't understand. To my knowledge, SF method will need three sequences x, y and z for each update iteration. You may be able to recover y by x and z, but you still have one extra sequence comparing to standard SGD or Adam, which corresponds to having another copy of the weight matrix. How come there is no additional memory overhead?

In general, I feel unconfident of the results and observations of the work to be published in NeurIPS.

局限性

See the questions section.

No potential negative societal impact.

最终评判理由

The authors provides further evidences and discussions which resolve most of my concern. I'd like to increase my evaluation score.

格式问题

作者回复

2025-07-31

Thank you for your efforts and valuable feedback. Let us address your concerns and questions below. We look forward to discussing further if anything remains unclear.

Q1 & Q2: Justification of the River-Valley Landscape

The river-valley landscape lacks mathematical justification and its connection to the experiments in Section 3 is unclear.

We would like to clarify that our paper does not introduce the “river-valley” landscape as a novel proposal. Rather, we build upon an emerging framework established in recent peer-reviewed literature to analyze and interpret our findings.

The river-valley model was introduced by Wen et al. (2025) and further supported by Belloni et al. (2025). It is grounded in empirical observations, including the characteristic sharp loss drop during learning rate (LR) decay and direct measurements of a convex, valley-shaped loss profile (see Fig.7 in Wen et al and Fig.2 in Belloni et al). This framework also aligns with established theoretical concepts such as the “Edge of Stability,” where iterates bounce between valley walls, and the “Central Flow,” which corresponds to the low-curvature valley floor that the optimization trajectory follows.

Independent work by Song et al. (2025) provides complementary evidence. They explicitly measured Hessian eigenvalues and eigenvectors during training to identify high-curvature (“hill”) and low-curvature (“river”) directions. They showed that projecting optimizer updates onto the river direction—by removing components along hill directions—preserves learning efficacy. Their results suggest that once the iterate reaches the valley floor, motion along hill directions is unnecessary, while progress along the river remains essential.

We use this established framework to interpret the experiments in Section 3. The reviewer’s convex quadratic intuition is consistent with this view: under the river-valley lens, the sharp loss drop after LR decay arises because the initial LR is too large for the steep "valley walls"; decay enables convergence to the "valley floor".

Our key empirical finding is that SF-AdamW does not exhibit such a sharp drop upon LR decay. Within the river-valley framework, this suggests that SF iterates remain already close to the valley floor throughout training, with the SF mechanism effectively keeping the trajectory in the river.

To provide direct justification in our own experimental setting, we measured the loss along linear interpolations between training checkpoints under the setting of Fig.2 (Slimpajama, 0.5M batch size). We compared (1) AdamW with constant LR, (2) AdamW with linear LR decay to 0, and (3) SF-AdamW with constant LR, using checkpoints after 2B and 2.5B tokens. Results are reported in Table 1 as below.

Table 1: Test loss evaluated at linear interploation $\alpha w_0 + (1-\alpha) w_1$ between two checkpoints (2B and 2.5B tokens)

$\alpha$	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1
AdamW (Constant LR)	3.322	3.295	3.274	3.259	3.248	3.242	3.240	3.243	3.250	3.261	3.278
AdamW (Linear Decay LR)	3.322	3.297	3.275	3.255	3.239	3.224	3.212	3.204	3.198	3.195	3.195
SF-AdamW (Constant LR)	3.263	3.253	3.245	3.237	3.231	3.226	3.222	3.220	3.219	3.218	3.219

These new results clearly demonstrate three distinct dynamics, replicating the findings of prior work and supporting our hypothesis:

AdamW (constant LR) shows a convex, valley-shaped loss profile between checkpoints.
AdamW (linear decay to 0) shows a sharp, monotonic decline, consistent with an iterate transitioning from the valley wall down to the valley floor.
SF-AdamW (constant LR) shows a flat, slow decline, consistent with an iterate already moving along the low-curvature valley floor.

Notably, (1) and (2) replicate the loss profiles reported by Fig.7 in Wen et al. and Fig.2 in Belloni et al., while (3) further supports our claim that SF-AdamW closely tracks the river.

This combination of an established literary framework and our new, direct empirical evidence provides a solid foundation for our analysis using the river-valley landscape.

References

Wen et al. (2025), Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View, ICLR 2025.

Belloni et al. (2025), Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers, ICML 2025 Workshop on High-dimensional Learning Dynamics.

Song et al. (2025), Does SGD really happen in tiny subspaces?, ICLR 2025.

Q3: The Toy Objective

The observation is on a simple 2D toy objective... Can this carry over to large scale?

The toy objective in Section 4 is deterministic and simplified by design. This intentional simplification allows us to isolate and study how the optimizer interacts with the key geometric properties of the river-valley landscape without the confounding effects of stochasticity.

What validates this approach is our finding that this minimal model reproduces surprisingly rich phenomena that are observed in deep learning experiments. For instance, the toy objective accurately captures:

How schedule-free dynamics track the valley floor and how this behavior depends directly on the momentum parameter.
The emergence of the Edge of Stability phenomenon, specifically occurring at the $y_t$ iterates, which is a highly non-trivial dynamic.

The fact that such a minimal setup can replicate these complex behaviors makes it a powerful and insightful tool for understanding the fundamental mechanisms of schedule-free optimization, confirming its relevance beyond the simplified setting.

Q4: The Refined Optimizer and Hyperparameter C

We have a new hyperparameter C to tune. It might still be sensitive to C?

We appreciate this concern and directly address it with new sensitivity experiments conducted during the rebuttal period. Building on our main experiments on Slimpajama with 0.5M and 2M batch sizes, we swept over a range of values for C and evaluated test perplexity on 9M held-out tokens. The results are shown in Tables 2 and 3 below.

Table 2: Refined SF-AdamW, Batch Size 0.5M, $\beta_2=0.99$ , LR 2e-3, Test Perplexity

	Vanilla	$C=5$	10	20	50	100	200	500
$\beta_1=0.1$	67.20	37.06	35.80	36.62	36.15	37.27	-	-
$\beta_1=0.5$	41.01	-	-	29.32	30.87	29.96	29.57	-
$\beta_1=0.9$	27.70	-	27.70	-	23.97	24.64	24.93	25.11
$\beta_1=0.95$	25.12	-	-	25.12	23.98	23.60	24.37	24.83

Table 3: Refined SF-AdamW, Batch Size 2M, $\beta_2=0.99$ , LR 2e-3, Test Perplexity

	Vanilla	$C=5$	10	20	50	100	200	500	1000
$\beta_1=0.1$	110.8	54.80	47.37	44.31	43.80	-	-	-	-
$\beta_1=0.5$	54.24	47.80	38.42	38.86	38.49	42.97	-	-	-
$\beta_1=0.9$	31.34	-	31.34	29.86	27.68	28.16	27.02	27.88	-
$\beta_1=0.95$	27.29	-	-	27.29	25.77	25.45	25.77	27.31	-
$\beta_1=0.98$	26.09	-	-	30.23	26.09	25.49	24.51	23.95	23.88

Our findings show that refined SF-AdamW is robust to the choice of C: it consistently outperforms vanilla SF across a broad range of values. For example, in Table 2, settings such as $C=50,100,200,500$ all improve over vanilla SF for $\beta_1=0.9, 0.95$ , and $C=50,100$ improve over vanilla SF across all momentum configurations. Similarly, in Table 3, refined SF outperforms vanilla SF over a wide range of $C$ , with improvements persisting up to very large values.

These results indicate that refined SF-AdamW remains robust and effective across a wide range of C, reducing the cost of tuning this hyperparameter.

Q5: Memory Overhead

How can you claim no additional memory overhead compared to AdamW when SF methods require three sequences (x, y, z)?

We would like to clarify that our claim about SF-AdamW having no additional memory overhead compared to AdamW is accurate. The key point is that SF-AdamW can be interpreted as the Schedule-Free method applied over RMSProp, just as AdamW is momentum applied over RMSProp.

To be specific:

AdamW: Requires storing the model parameter (x_t), the 1st moment vector (m_t), and the 2nd moment vector (v_t).
SF-AdamW: Requires storing the model parameter (x_t), the auxillary parameter (z_t), and the 2nd moment vector (v_t). Note that y_t is computed on the fly from x_t and z_t, and does not need to be stored separately.

Thus, both optimizers require storage for three full copies of the model parameters, and SF-AdamW does not introduce any additional memory overhead compared to AdamW. We will add this clarification in the next revision.

2025-08-03

I thank the authors for their detailed responses. These partially address my concerns. Some remaining concerns and comments:

I further reflect about my review and believe that my key concern is that, it is not very clear what the scope of the paper is. From section 2.4 to section 4, the work focus on analyzing SF optimizer in different aspects: it's ability of tracking the river in the river-valley landscape assumption; it's sensitivity to momentum; it operates at the edge of stability (in full batch setting); etc. Then in section 5 a remedy is proposed to tackle its sensitivity to momentum hyperparameters, especially for large batch sizers. To me the paper is advocating SF optimizer and the robust remedy, and I still feel that there needs more evidence to support the new robust SF optimizer. For example, is it really robust for even larger models, and is the interpolating between two checkpoints (as in the additional rebuttal results) indicates that this new robust optimizer can track the river regardless of hyperparameter settings?
The additional result of interpolating between two checkpoints seems interesting but very preliminary. Would this observation prevail for every two consercutive checkpoints? Or it can only be observed in the late stage of the training (when the training already concentrate in the valley, as described in the river-valley landscape)?
Can the authors elaborate more on how the EoS analysis (in sec. 4.3) is connected to river-valley landscape? Another thing I feel furstrated about is that the different sections of the paper seems a bit fragmented.

I know I'm requesting some extra experiments. You could share the numbers and I can visualize them myself. Let me know if these comments and additional experiments make sense to you.

2025-08-04

Interpolating Two Chekpoints

The additional result of interpolating between two checkpoints seems interesting but very preliminary. Would this observation prevail for every two consecutive checkpoints? Or can it only be observed in the late stage of the training (when the training has already concentrated in the valley, as described in the river-valley landscape)?

In addition to Table 1 (interpolation between 2B and 2.5B tokens) provided in the rebuttal, we report interpolation results between earlier checkpoints: (0.5B, 1B), (1B, 1.5B), and (1.5B, 2B).

Table 4: Test loss evaluated at linear interpolation between 0.5B and 1B tokens

$\alpha$	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1
AdamW (Constant LR)	3.896	3.840	3.792	3.747	3.702	3.658	3.614	3.575	3.543	3.523	3.521
SF-AdamW (Constant LR)	3.678	3.631	3.593	3.560	3.530	3.503	3.479	3.460	3.445	3.435	3.433

Table 5: Test loss evaluated at linear interpolation between 1B and 1.5B tokens

$\alpha$	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1
AdamW (Constant LR)	3.521	3.483	3.453	3.428	3.408	3.393	3.382	3.377	3.376	3.380	3.391
SF-AdamW (Constant LR)	3.433	3.410	3.391	3.375	3.361	3.350	3.340	3.333	3.328	3.326	3.327

Table 6: Test loss evaluated at linear interpolation between 1.5B and 2B tokens

$\alpha$	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1
AdamW (Constant LR)	3.3391	3.361	3.337	3.319	3.305	3.296	3.292	3.293	3.297	3.307	3.322
SF-AdamW (Constant LR)	3.327	3.312	3.300	3.290	3.282	3.275	3.269	3.266	3.263	3.262	3.263

For interpolations at (1B, 1.5B), (1.5B, 2B), and (2B, 2.5B, reported earlier), we consistently observe:

AdamW (constant LR) produces a convex, valley-shaped loss profile between checkpoints.
SF-AdamW (constant LR) produces a flat, slowly declining profile between checkpoints.

For early checkpoints (0.5B, 1B), both AdamW and SF-AdamW exhibit a sharper decline, suggesting that the optimization dynamics had not yet reached near the valley floor in this initial stage.

Overall, these results suggest that during most of the training (from 1B tokens onward), the optimization dynamics concentrate near the valley floor, with SF-AdamW consistently following the river closely.

Refined SF Method

I still feel that there needs more evidence to support the new robust SF optimizer. For example, is it really robust for even larger models, and is the interpolating between two checkpoints (as in the additional rebuttal results) indicates that this new robust optimizer can track the river regardless of hyperparameter settings?

To address this, we provide linear interpolation results for the refined SF method under diverse momentum settings ( $\beta_1=0.1, 0.5$ ):

Table 7: Refined SF-AdamW, Test loss at linear interpolation between 2B and 2.5B tokens

$\alpha$	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1
$(\beta_1, \beta_2) = (0.1, 0.99)$ , $C=10$	3.729	3.715	3.702	3.689	3.677	3.663	3.649	3.635	3.621	3.607	3.593
$(\beta_1, \beta_2) = (0.5, 0.99)$ , $C=20$	3.477	3.464	3.453	3.442	3.431	3.420	3.408	3.399	3.390	3.393	3.379

Across both configurations, we observe a flat and smoothly declining loss profile, consistent with iterates following the valley floor. This provides further evidence that refined SF-AdamW robustly tracks the river regardless of momentum setup.

Regarding larger models, we were unable to conduct additional experiments during the rebuttal phase due to limited time and resources. If the reviewer considers such experiments critical, we are willing to perform them after the rebuttal period.

2025-08-05

I thank the authors for prompt reply. I decide to increase my evaluation. I still hope to see some large model results but I'm now leaning toward acceptance for this work.

2025-08-05

Thank you for your thoughtful reconsideration and positive evaluation. We truly appreciate your efforts and constructive feedback throughout.

2025-08-04

Thank you for your thoughtful follow-up questions. We address them below and are happy to discuss further if anything remains unclear.

Scope of the Paper & Connection between EoS and River-Valley

My key concern is that, it is not very clear what the scope of the paper is. / Can the authors elaborate more on how the EoS analysis (in sec. 4.3) is connected to river-valley landscape? Another thing I feel furstrated about is that the different sections of the paper seems a bit fragmented.

We would like to clarify the central scope of our paper. As reflected in the title, “Through the River: Understanding the Benefit of Schedule-Free Methods…”, our core contribution is a principled analysis of SF dynamics through the lens of the river-valley loss landscape. This focus spans the majority of the paper (Sections 2–4), in which we present theoretical and empirical insights into how SF interacts with the valley geometry. While sections on momentum sensitivity, EoS, and central flow may appear fragmented at first glance, they are closely interconnected aspects of our central theme: understanding how SF optimization dynamics relate to the river-valley structure.

In particular, our EoS analysis plays a crucial role in this connection. Based on our theoretical stability analysis, we observe that EoS behavior emerges specifically at the y iterates (Observation 4), rather than at x or z. This finding provides key evidence that the y sequence is the most faithfully aligned with the river geometry (Observation 3). Concretely, at the EoS, the y iterates oscillate between the steep valley walls (i.e., they oscillate along the high-curvature “hill” directions), while maintaining stable overall progress along the low-curvature “river” direction. The central flow then formalizes this view by modeling the time-averaged trajectory of y iterates, corresponding precisely to the smooth river path along the valley floor after filtering out the hill-direction oscillations.

Taken together, these observations suggest that, from the river-valley perspective, y_t should be regarded as the primary optimization trajectory. This insight naturally motivates our reformulation of the SF update rule in terms of the y iterates (SFy). Within this framework, the x iterate can be understood as a weighted average of y_t, with the averaging window determined by the momentum parameter $\beta_1$ . This interpretation also explains why suboptimal $\beta_1$ lead to poor performance (Observation 2): an ineffective averaging scheme causes the x iterates to drift away from the river, even while the y iterates remain well aligned.

审稿意见

评分: 5置信度: 22025-07-03

The authors revisit Schedule-Free (SF) optimization (SF-AdamW), demonstrating that it navigates the "river-valley" structure of loss landscapes without explicit LR decay or memory-heavy weight averaging, and also list some observations, for example, SF-AdamW is highly sensitive to momentum. A refined version of SF-AdamW is proposed, addressing two critical issues: momentum-sensitivity and scalability under large batch sizes.

优缺点分析

Based on insightful observation (e.g., SF-AdamW is sensitive to the momentum parameter, an inadequate choice will lead to deviation from the river). The key contribution is the introduction of parameter C, which is used in a reformed Schedule-Free Optimizer to decouple the momentum and averaging behavior, weakening the momentum-sensitivity and resulting in further reductions of validation loss.
The experiments are limited to relatively small-scale models (e.g., 124M parameter transformers decoder) due to computational resource constraints. Given the claim of scalability of the optimizer as well as relevance to large-scale training, a lack of large-scale model experiments somewhat limits the strength of the claims.
The paper admits the lack of error bars report and analysis due to computational constraints, leaving the statistical significance of observed performance improvements uncertain.

问题

The authors have demonstrated the improved validation loss of their refined SF-AdamW optimizer compared to the original SF-AdamW. Can the refined SF optimizer lead to measurable performance gains on downstream tasks based on different models and benchmarks?

局限性

Yes

最终评判理由

Many thanks for the author’s rebuttal comments. Most of my concerns have been addressed. I also updated my ratings for the Significance of the paper.

格式问题

N/A

作者回复

2025-07-31

Thank you for your efforts and for appreciating our work. We address your concerns and questions below.

Response to Weaknesses

The experiments are limited to relatively small-scale models (e.g., 124M parameter transformers decoder) due to computational resource constraints. / The paper admits the lack of error bars report and analysis due to computational constraints.

We agree and will explicitly acknowledge this in the Limitations section of the revised manuscript. Our experiments were conducted on 124M-scale models trained up to the Chinchilla-optimal compute budget. This scale was intentionally chosen to balance interpretability of Schedule-Free dynamics with available computational resources.

The primary goal of our paper is to understand the mechanisms underlying Schedule-Free optimization, rather than to demonstrate state-of-the-art performance. Accordingly, we prioritized controlled ablations and mechanistic analysis over repeated trials or larger-scale benchmarks. In Sections 3 and 4, which focus on analyzing training dynamics rather than making performance claims, we believe error bars would provide limited additional insight. In Section 5, however—where we introduce the refined SF method and claim improvements over vanilla SF—we agree that error bars would strengthen the statistical significance of the results.

Due to the limited time and resources during the rebuttal period, we could not provide these additional runs. We will, however, include error bars for the refined SF experiments in Section 5 in the revised version.

Response to Questions

The authors have demonstrated the improved validation loss of their refined SF-AdamW optimizer compared to the original SF-AdamW. Can the refined SF optimizer lead to measurable performance gains on downstream tasks based on different models and benchmarks?

Evaluating downstream task performance is beyond the scope of this work. Our focus is on developing efficient and scalable pretraining algorithms. Assessing downstream performance would require fine-tuning models that have been pretrained using our method, which in turn necessitates pretraining very large models on massive datasets—a level of computation that is currently beyond our resources.

While we acknowledge the differences between pretraining and fine-tuning loss landscapes, we believe that our refinement may, to some extent, generalize to fine-tuning tasks. Verifying the effects of decoupling and other findings from pretraining dynamics in the context of fine-tuning tasks remains a promising direction for future work.

最终决定Accept (poster)

2025-09-17

The paper analyzes Schedule-Free (SF) optimization (SF-AdamW), demonstrating that it implicitly performs weight averaging and follows a “river-valley” loss landscape without explicit decay schedules or extra memory. The authors identify momentum sensitivity and limited large-batch robustness as key weaknesses and propose a refined SF variant with an extra hyperparameter to address these issues. Reviewers find the work technically solid, clearly written, and insightful, particularly in its theoretical framing and analysis of SF dynamics. However, concerns remain about the small experimental scale, lack of statistical rigor, unclear formalization of the “river-valley” concept, and the practicality of the refined optimizer given its new hyperparameter. Overall, reviewers lean towards acceptance, recognizing the contribution as important for understanding and potentially scaling SF methods, although it is still preliminary in evaluation.