PaperHub
7.0
/10
Poster3 位审稿人
最低4最高5标准差0.5
4
5
4
3.3
置信度
创新性3.3
质量2.7
清晰度2.3
重要性3.3
NeurIPS 2025

Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling

OpenReviewPDF
提交: 2025-04-16更新: 2025-10-29
TL;DR

Efficient Long Context Fine-tuning through Dynamic Data Scheduling

摘要

关键词
long context fine-tuningcontext parallelismsupervised fine-tuningLLMLarge language model

评审与讨论

审稿意见
4

This paper investigates how to conduct efficient training in the presence of both long and short data sequences. It proposes a method named Skrull, which includes Distributed-Aware Context Parallelism (DACP) and Global Data Scheduling (GDS). Experimental results demonstrate that this method achieves a significant efficiency improvement (3.76x) compared to DeepSpeed in real-world long-SFT scenarios.

优缺点分析

Strengths:

  1. The proposed method is novel and its solution is well-supported by theory, making the approach appear solid.

  2. The experimental results are impressive. The average speedup of 3.76x, with a peak of 7.54x, strongly validates the effectiveness of the method. This is a great contribution to the long-context community.

Weaknesses:

  1. The organization of the paper seems imbalanced. The experiments section is too brief (only one page), which slightly weakens the evidence supporting the method's effectiveness. The analysis is also insufficient. For instance: 1) There is a lack of discussion regarding more hyperparameters. 2) The paper claims the "Scheduling Algorithm of GDS" has almost no overhead, but this still should be presented with quantitative data.

  2. This method alters the construction of batches. Although the authors state on page 5, lines 164-165, that "We limit the scheduling scope to the global batch because it represents the maximum scope that maintains mathematical equivalence for mainstream optimizers such as Adam [13] and AdamW [17]," this assertion alone is relatively weak. The authors should provide a comparison of model performance (e.g., loss curves or downstream benchmark results) between a model trained with Skrull and one trained with a standard method.

  3. The number of baselines is limited. The only comparison made is against DeepSpeed. The authors could consider comparing their method against other strategies like sorted batching, such as LongAlign, as mentioned in the related work section on line 305.

Note:

I highly recognize the contribution of this paper in terms of efficiency, and I believe it is an important contribution. However, as an academic paper, its current presentation has several issues, as detailed in the weaknesses above. If these issues are addressed, I would be willing to raise my score.

问题

See weakness

局限性

None

最终评判理由

The authors addressed most of my concerns in their rebuttal. My concerns are majorly regarding the comprehensiveness of the experiments. Through their response, the authors gives very detailed results. Thus, I raised my rating from borderline reject to borderline accept.

格式问题

None

作者回复

We sincerely thank Reviewer 9acd for taking the time to review our work and for highly recognizing the novelty and effectiveness of our approach, especially noting that it is "a great contribution to the long-context community." We appreciate Reviewer 9acd's constructive feedback on the current presentation, which we will carefully address in the final version. Below, we summarize the reviewer’s concerns and describe how we plan to resolve them.

Revision Summary

We will add several subsections to enhance the Evaluation and Experimental Analysis sections. To allocate space, we will move part of Skrull's implementation details to the appendix and extend the experimental section by approximately one page.

First, we will add three new subsections to Section 5 (Evaluation):

  1. Precision Validation (after "Overall Performance"): We will present loss curves comparing Skrull and standard training to demonstrate their training equivalence.
  2. Impact of BucketSize (after "Performance Impact of BatchSize"): We will analyze how this hyper-parameter affects performance and highlight the importance of accurate cost modeling.
  3. Case Study (at the end of Evaluation): We will analyze a realistic training case with memory footprints and wall-clock time to illustrate the trade-off between memory and computation, as discussed in Section 4. (Due to limited characters, see the quantitative data and analysis in our response to Reviewer N1BH (Point 3) ).

Additionally, we will include a paragraph quantifying the scheduling overhead, demonstrating that Skrull enables online scheduling with negligible cost.

Second, we will expand the baseline comparisons:

  1. We include sorted batching in LongAlign [1] under our experimental settings and integrate the results into Figure 3. Our method significantly outperforms the sorted batching.
  2. We evaluate Skrull on larger models and test compatibility with algorithm-level optimizations such as LoRA [2], further highlighting Skrull’s versatility and ease of integration.

Responses to Concerns

Concern 1: The experiments section is too brief. (1.1) Lacking of discussion regarding more hyper-parameters. (1.2) Lacking of quantitative data to illustrate the overhead of Skrull.

We thank Reviewer 9acd for this useful comments. We have included new experiments and discussions.

1.1 Experiments on more hyper-parameters: Besides the "BatchSize", we add a subsection to discuss the performance impact of "BucketSize".

We test the end-to-end speedups with multiple BucketSize settings using Qwen2.5-0.5B in ChatQA2-Long-SFT dataset with the setting <DP=1,CP=8,BatchSize=64>. The table below demonstrate that different "BucketSize" setting significantly influence the performance.

BucketSizeTime (s)Speedup
baseline1551.00x
6K2360.66×
8K1930.80×
10K1770.88×
12K1700.91×
14K1650.94×
16K1620.96×
18K1531.01×
20K1431.08×
22K1321.17×
24K1261.23×
26K1191.30×
27K1161.34×
28KOOM-
30KOOM-
32KOOM-

Table 1. Speedup with different BucketSize.

Although larger BucketSize brings more performance gains, it also increases the risk of out-of-memory (OOM) errors. Therefore, it is important to set appropriate "BucketSize". In Skrull, BucketSize is determinted by the memory modeling results in offline profiling. Under this training setting and model size, the predicted ideal BucketSize is range from 26K to 27K, thus we choose 26K as default BucketSize in our evalution section. The experimental results further demonstrate the effectiveness of performance modeling in Skrull.

Additionally, We also evaluate Skrull's performance in larger model and compatibility with other SFT optimizations such as Lora[2]. We test Qwen2.5-14B and 32B model in the manner of Lora using LMsysChat1M. The table below demonstrate Skrull can work seamlessly with Lora, showcasing the versatility of Skrull. We believe Skrull, a system-level long-SFT optimization, can work seamlessly with more algorithm perspective optimization and provide community a effective and easy to use training solution.

14B32B
Baseline1.00×1.00x
DACP1.18×1.24x
DACP + GDS2.37×2.54x

Table 2. Speedup with larger models.

1.2 Scheduling overhead anaylsis and quantitative data in Skrull.

Firstly, the scheduling process is implemented in the DataLoader. Skrull only request the sequence length information in DataSampler and generate the scheduling plans to DataLoader. From the feature of DataLoader, Skrull's scheduling is doing on CPUs and is fully asynchronous with the model training on GPUs. Therefore, the overheads can be overlapped during the training.

Secondly, we also record the quantitative costing during realistic training with Qwen2.5-0.5B in ChatQA2-Long-SFT. The scheduling cost (including both GDS and DACP) is ranged from 10ms to 50ms. In contrast, training costs always elapsed 1 second or more per iteration, confirming Skrull’s overhead is negligible and fully amortized.

Concern 2: Provide a comparison of model performance between Skrull and standard training.

We sincerely thank the Reviewer 9acd for this important point. Although Skrull reorganizes sequences within the global batch for system efficiency, it does not alter the global batch content or its optimizer interaction. Since gradients are accumulated and averaged within each global batch before the optimizer step, the optimization trajectory remains equivalent, except for negligible numerical differences.

Nevertheless, to empirically validate this, we compare the training loss between Skrull and standard methods. After 1000 iterations, both methods converge to a loss of ~0.24, as shown in the partial results below:

We sincerely thank the reviewer for this important feedback. We would like to further clarify that Skrull do not alter the global batch content and the global batch orders from the perspective of model optimization. Since gradients are accumulated and averaged within each global batch before the optimizer step, the optimization trajectory remains equivalent. Skrull reorganizes the calculation orders (gradients accumulation orders) within global batch and the way they are executed in GPU (splited or not, in GPU 0 or GPU 1) for system efficiency. However, we are highly agreed with reviewers that the precision validation is crucial to optimization. Therefore, to empirically validate this, we compare the loss curves between Skrull and standard methods. Due to the rules of rebuttal, we can only show some points in the following table. After 1000 iterations, the loss of both methods converge to a loss of ~0.24. The slightly numerical differences are caused by hardware-level numerical fluctuation, which do not influence the convergence. We will plot the figure of whole loss curves comparison in our final version.

IterationStandardSkrull
07.455415737.45541573
106.272561076.27239227
204.247318274.24617065
302.673598862.67313461
402.247642902.24854412
501.884358411.87571697
601.632050131.62057266
701.495485021.48315649
801.389792251.39394617
901.626601411.60026455
1001.172488781.17774509

Table 3. Loss carves comparison during training Qwen2.5-0.5B on LMsysChat1M.

Concern 3: The authors could consider comparing their method against other strategies like sorted batching.

We thank for this helpful comments. To further evaluate our optimization, we add sorted batching strategy in LongAlign[1] as our baseline. As shown in the following table, Skrull outperform sorted batching significantly. Although sorted batching achieve more balance batching by grouping the training sequence by length, it is still a "static" ahead-of-time batching method. In contrast, Skrull can achieve dynamic just-in-time data scheduling guided by performance modeling. Additionally, Skrull also implement DACP to reduce the communication overheads during long context training. The results will be reflected in our final version.

Model SizeWikipediaLMsysChat1MChatQA2-Long-SFT
0.5B1.10× (7.54×)1.13× (7.51×)1.03× (1.46×)
7B1.06× (2.30×)1.07× (2.73×)0.99× (1.08×)

Table 4. Speedup comparison between Sorted Batching (main value) and Skrull (in parentheses)


Summary

We thank Reviewer 9acd again for their thoughtful comments and suggestions. We believe that the revisions will significantly improve the clarity of our paper, and we look forward to presenting the enhanced version.

[1] Yushi Bai et al., LongAlign: A Recipe for Long Context Alignment.
[2] Edward J. Hu et al., LoRA: Low-Rank Adaptation of Large Language Models.

评论

Dear authors,

Thanks for your detailed reply. I have raised my rating from 3 to 4 accordingly. Please do incorporate these discussions in your next version.

评论

Thank you for your thoughtful reviews and recognizing the contributions of our work! All the discussions will be incorporated in our next version.

审稿意见
5

This paper addresses training efficiency losses incurred by training on heterogeneous mixtures of short- and long-context data, as is common in SFT stage of LLM training. The authors propose a novel dynamic data scheduler designed to dynamically schedule long-context SFT for efficiency improvements. The authors show on common datasets that sequence length distributions are highly variable both inter and intra dataset. They propose DACP and GDS, which together compose Skrull. They present analyses of the marginal gains of adding DACP and GDS to the DeepSpeed framework, reporting strong results across three datasets.

优缺点分析

Strengths: The paper structure is easy to follow. The formulation of the problem is lucid, the approach is clearly explained, the results are strong. The efficiency gains of the proposed method seem significant, and likely to be useful to others.

Weaknesses: The writing is occasionally unclear, ungrammatical, or both. This can be easily improved to make the paper much easier to read.

The section 4.3 discusses the tradeoffs between memory and computation that are central to evaluating Skrull, but at no point are the actual memory costs associated with the experimental results actually presented. Isn't this important to evaluating the technique?

问题

Some of the writing is garbled and ungrammatical. This does not terribly distract from the content, but is distracting and not publication-ready. Please fix. lines 302-303 for example.

See weaknesses.

Nit: Fig 2 is quite difficult to read when printed, and in PDF format needs to be heavily zoomed in. Please fix the tiny bold/italicized words.

局限性

none

最终评判理由

Solid paper with thorough experiments and good results. Will be of interest to others and likely to be replicated.

格式问题

na

作者回复

We sincerely appreciate Reviewer N1BH for recognizing the value of our work, noting that our work is "easy to follow, lucid in problem formulation, clearly explained, with significant results and likely to be useful to others." We are grateful for the constructive feedback, especially regarding writing clarity, figure readability, and the need for further analysis of trade-offs in Skrull. We address each of these points below and will incorporate all necessary revisions in our final version.

1. Writing Clarity.

We thank the reviewer for pointing out that parts of the writing are unclear or ungrammatical (e.g., lines 302–303). We fully agree that writing quality is essential for clarity and readability. Following this suggestion, we have carefully revised the entire paper to improve the writing, correct grammatical issues, and ensure all sentences are concise and well-structured. The final version will be thoroughly proofreading and ready for publication.

2. Readability of Figure 2.

We appreciate the reviewer’s suggestion regarding the legibility of Figure 2, especially when printed. We have re-designed this figure to use larger fonts and avoid excessive stylization (e.g., tiny bold/italicized text). We will ensure that all figures in the final version are visually accessible and clear in both digital and print formats.

3. Evaluation and Analysis of Trade-offs in Skrull.

We appreciate the reviewer’s insightful comment regarding the lack of actual memory usage data to evaluate the trade-offs between computation and memory in Skrull.

We provide additional quantitative data based on training with ChatQA2-Long-SFT under the configuration <DP=1, CP=8, BatchSize=64> using Qwen2.5-0.5B. Additionally, we include a naive scheduling algorithm for comparison, which dispatches sequences in a simple round-robin manner. To further evaluate the effectiveness of memory trade-offs in Skrull, we test the round-robin scheduling algorithm both with and without a roll-back mechanism.

We measure the balance of computation using wall-clock time, where lower time indicates more balanced computation under the same workload. Similarly, we evaluate memory balance by tracking the minimum and maximum peak memory usage across all GPUs.

MethodSpeedup
Baseline1.00×
Round-robin w/ roll-back1.17×
Round-robin w/o roll-backOOM
Skrull1.40×

Table 1. Comparison between different scheduling strategies.

Note 1: Higher speedup (i.e., lower wall-clock time) indicates better computational balance.

Note 2: The roll-back mechanism is a component of Skrull used to avoid out-of-memory.

IterationBaseline (Min, Max)Round-robin w/ roll-back (Min, Max)Skrull (Min, Max)
1(36, 36)(10, 49)(21, 51)
2(40, 40)(29, 47)(37, 50)
3(35, 35)(14, 45)(32, 49)
4(46, 46)(46, 47)(45, 49)
5(49, 49)(49, 49)(49, 49)
6(35, 35)(16, 46)(29, 50)
7(33, 33)(13, 50)(20, 48)
8(47, 47)(37, 52)(47, 47)

Table 2. GPU memory usage per iteration (in GB).

The results in Tables 1 and 2 demonstrate that Skrull achieves optimal performance while respecting memory constraints. Compared to the baseline, both Skrull and round-robin scheduling assign more sequences locally, reducing communication and computation degradation at the cost of increased memory imbalance. While such imbalance is acceptable as long as memory capacity is not exceeded, it does increase the risk of out-of-memory (OOM) errors — as observed in the round-robin scheduling without roll-back.

These results highlight the importance of the roll-back mechanism and cost modeling in Skrull. Compared to round-robin with roll-back, Skrull provides better computational balance and lower memory imbalance, demonstrating the effectiveness of its scheduling algorithm.

We will include a detailed version of this analysis, along with additional quantitative data, in the final version.


Summary

We thank Reviewer N1BH again for the valuable feedback. We believe that the additional analyses and the improved presentation will further strengthen the clarity, completeness, and readability of our paper. All suggested corrections will be reflected in the final version.

评论

Thanks for the additional data. If this is included in the paper, it really strengthens the claims. I increase my rating from 4 to 5.

评论

Dear Reviewer N1BH,

I hope this message finds you well. We sincerely thank you for your support and recognition of our work. We have incorporated the discussion and detailed analysis into the “Case Study” subsection at the Evaluation section in our next version. Thank you again for your insightful and helpful review.

Best wishes, Authors

审稿意见
4

This paper tackles the efficiency problem in long-context supervised fine-tuning (Long-SFT) for LLMs, where mixing long and short sequences causes system inefficiencies. The authors introduce Skrull, a dynamic data scheduler combining Distributed-Aware Context Parallelism (DACP) and Global Data Scheduling (GDS). Their approach balances computation and memory on the fly, achieving up to 7.5× speedup over DeepSpeed in realistic benchmarks.

优缺点分析

Strengths The paper addresses a highly practical and timely challenge for modern LLM training. Its system-level method is orthogonal to model or optimizer changes, making it easy to integrate with other techniques. The design is well explained, with clear trade-off modeling and solid experimental validation on real datasets.

Weaknesses The experiments focus on models up to 7B parameters; it’s unclear how well Skrull scales to even larger models. There is no analysis of Skrull’s impact on training convergence or final task accuracy. Also, it lacks direct quantitative comparison with other recent long-context scheduling methods.

问题

How robust is Skrull if the input sequence distribution changes over time — does profiling need to be repeated?

局限性

Skrull depends on accurate profiling; if cost models are wrong, it may underperform or run out of memory. It currently works with DeepSpeed — porting it to other training frameworks may require extra engineering effort.

最终评判理由

This is a good paper with solid results. The authors further provide detailed rebuttals that addressed my concerns. Thus, I will keep my rating.

格式问题

N/A

作者回复

We sincerely appreciate Reviewer eLuU for recognizing the value of our work, noting that it "addresses a highly practical and timely challenge for modern LLM training". In accordance with Reviewer eLuU's suggestions, we have conducted additional experiments to further demonstrate Skrull's effectiveness, precision, and provide a more quantitative comparison with other approaches such as LongAlign [1]. We sincerely thank Reviewer eLuU for the valuable feedback. All additional results will be included in the final version. Below, we provide detailed responses to Reviewer eLuU's concerns and questions.

Responses to Weaknesses

Concerns 1: How Skrull scales to even larger models

Firstly, we would like to clarify that Skrull addresses system inefficiencies caused by the dataset characteristics in long-context SFT. Theoretically, such inefficiencies (e.g. load imbalance, redundant communication, kernel degradation) persist regardless of model size. Also, we believe that as model size increases, the kernel degradation problem caused by finer sequence splitting is alleviated due to higher computational intensity.

To empirically validate this, we tested the Qwen2.5-14B and 32B models on the LMsysChat dataset under the setting <DP=4, CP=8, BatchSize=64>. We adopted LoRA [2] to rapidly evaluate Skrull's effectiveness with larger models and its compatibility with other efficient fine-tuning techniques. The results show that Skrull achieves a 2.37×, 2.54x end-to-end speedup compared to the standard method, respectively. The detailed breakdown is as follows:

14B32B
Baseline1.00×1.00x
DACP1.18×1.24x
DACP + GDS2.37×2.54x

Table 1. Speedup with larger models.

Our experiments demonstrate that Skrull remains effective when scaling to larger models. Furthermore, our additional experiments show Skrull's compatibility with other long-context fine-tuning techniques. We believe Skrull, a precision-lossless system optimization, can provide the long-context community with a more efficient and easily integratable training solution.

Concern 2: Analysis of Skrull’s impact on training convergence

We thank for Reviewer eLuU's meticulous reviews. As mentioned with other reviewers, we add the loss carve of both Skrull and standard training methods for comparison. Due to the NIPS rebuttal rules, we cannot include loss carves figure directly or external links in the OpenReview to demonstrate Skrull's precision and convergence. Instead, we present some loss value at different training iterations. After 1000 iterations, the loss of both methods converge to about ~0.24. The slight numerical differences are due to hardware-level fluctuations, which do not affect convergence. We will include complete loss carves figure in our final version.

IterationStandardSkrull
07.455415737.45541573
106.272561076.27239227
204.247318274.24617065
302.673598862.67313461
402.247642902.24854412
501.884358411.87571697
601.632050131.62057266
701.495485021.48315649
801.389792251.39394617
901.626601411.60026455
1001.172488781.17774509

Table 2. Loss carves comparison during training Qwen2.5-0.5B on LMsysChat1M.

We highly acknowledge the importance of optimization precisions and we conduct the module tests in our early proof of concept validation, including the attention module and the training impact of scheduling data within global batch. However, Skrull functions as a system-level optimization, it only alters the assignment of training data to specific GPUs and does not involve any algorithm-level modifications. This helps to explain why we did not initially include this section in our original submission.

Concern 3: more direct quantitative comparison with other methods.

We sincerely appreciate this suggestion and we have conducted additional experiments for a more comprehensively comparison. As mentioned with other reviewers, we compare with LongAlign [1], which involves sorted batching method to sort the dataset by sequence length and select random consecutive groups for each batch to improve the long-SFT training efficiency. The results are listed in the following Table 3.

ModelWikipediaLMsysChat1MChatQA2-Long-SFT
Qwen2.5-0.5B1.10× (7.54×)1.13× (7.51×)1.03× (1.46×)
Qwen2.5-7B1.06× (2.30×)1.07× (2.73×)0.99× (1.08×)

Table 3. Speedup comparison between Sorted Batching (main value) and Skrull (in parentheses)

Experimental results demonstrate that Skrull significantly outperforms the sorted batching strategy. Although sorted batching achieve more balance batching by grouping the training sequence by length, it is still a "static" batching method. In contrast, Skrull can achieve dynamic data scheduling guided by performance modeling. Additionally, Skrull implements DACP to reduce the communication overheads during long-context training. The results will be reflected in our final version.


Responses to Questions

Questions 1: How robust is Skrull if the input sequence distribution changes over time — does profiling need to be repeated?

We sincerely thank for Reviewer eLuU for pointing out this key detail of our implementation. Our profiling is aimed to predict the performance characteristics according to data sequence length. The relationship between sequence length and performance is affected by the model, hardware configuration and training system settings (parallel settings, gradient checkpointing, etc.). If all the settings are same, we do not need to repeat offline profiling even if the sequence distribution changes. However, if the scope of sequence length beyond the profiling scope (eg. maximum sequence length extended), we suggest to repeat for more accurate prediction.


Comments on Limitations

Skrull relies on accurate profiling

We thanks for this insightful comments and we are highly agreed with Reviewer eLuU. Skrull depends on accurate profiling because we want to maximize the performance gains. Especially for memory, inaccurate modeling incurs the risk of OOM. But in practice, the profiling results exhibit stable and accurate. We will put some predicted and realistic data in our appendix to illustrate it.

Skrull's portability and engineering efforts

Skrull is orthogonal to training frameworks. However, it does require some engineering effort to implement, mainly focus on the logic of Dataloader and some custom modules.


Summary

We thank Reviewer eLuU again for meticulous reviews of our paper, which help us improve both the clarity and strength of our manuscript. We hope our responses have fully addressed your concerns.

[1] Yushi Bai et al., LongAlign: A Recipe for Long Context Alignment.

[2] Edward J. Hu et al., LoRA: Low-Rank Adaptation of Large Language Models.

评论

Thanks for the detailed response and experimental results. I will keep my rating for this paper.

评论

Dear Reviewer eLuU,

I hope this message finds you well. Thank you once again for recognizing our contribution and for your valuable feedback. If you have any additional questions or suggestions, please don’t hesitate to let us know. Your feedback is greatly appreciated and truly invaluable to us!

Best wishes, Authors

最终决定

The paper introduces Skrull, a novel dynamic data scheduler designed to accelerate the supervised fine-tuning of large language models on datasets with mixed long and short sequences. By intelligently balancing computational loads and memory usage across GPUs, Skrull addresses system inefficiencies inherent in long-context training. As highlighted by reviewers, paper's core strengths are novelty, practicality and very promising speedup results (up to 7.54x).

Initially, all three reviewers shared concerns about the insufficient experimental results, lack of clarity and polish. Reviewer eLuU questioned Skrull's scaling to larger models and its impact on convergence, while also wanting comparisons to other schedulers. Similarly, Reviewer 9acd found the experiments section too brief, asking for more baselines like sorted batching, analysis of the scheduling overhead, and proof (e.g., loss curves) that the method didn't alter model performance. Reviewer N1BH focused on the lack of quantitative data to validate the paper's claimed memory-computation trade-offs and pointed out minor presentation issues. The authors effectively addressed every concern in their rebuttal by providing new experiments on larger 14B and 32B models, a direct speedup comparison against the "sorted batching" method, detailed loss curve data confirming mathematical equivalence, and tables showing the negligible scheduling overhead and memory usage, which ultimately satisfied the reviewers and led to increased scores. Meta reviewer also notes that authors do not provide any test loss/accuracy numbers. Authors are highly encouraged to revise their manuscript to address above concerns.