Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage
摘要
评审与讨论
This paper proposes an active memory profiler and memory offloading algorithm for reducing the peak memory usage of LLM training workloads. The key insight is that the active memory required for LLM training is a small fraction of the total memory allocated, and therefore carefully scheduling memory offloads and corresponding prefetches from SSD memory can enable scaling beyond the fundamental device DRAM memory limit. The evaluation shows that the proposed offloading framework TERAIO achieves an average speedup of 1.47x compared to the ZeRO-Offload and ZeRO-Infinity techniques.
优缺点分析
Strengths:
- Evaluation demonstrates significant performance improvement over baseline approaches (ZeRO-Offload / ZeRO-Infinity).
- Implementation of tensor memory profiler instruments PyTorch to minimize the required developer effort.
- Paper is well written and contributions are clear.
Weaknesses:
- Limited discussion of related work / analysis of baselines. The main comparison is to ZeRO-Offload and ZeRO-Infinity, but it’s unclear A) whether these are state-of-the-art approaches to memory offloading, and B) why TERAIO is able to outperform ZeRO-Infinity given that both ZeRO-Infinity and TERAIO are able to offload to both host memory as well as SSDs. In Section 4.2 the authors write: “Since SSDs offer larger capacity, ZeRO-Infinity can still train Llama-70B. However, it only achieves 43.0% of the ideal performance, because its coarse-grained offloading scheme cannot efficiently utilize the limited SSD bandwidth to migrate larger tensors.” What makes the ZeRO-Infinity scheme more coarse grained than TERAIO? Later in Section 4.4 the authors write: “ZeRO-Infinity requires 170GB and 770GB of CPU memory to train Llama-8B and Llama-70B, respectively, as it has to offload gradients and optimizers into CPU.” Is this fundamental to ZeRO-Infinity’s offloading design or just an implementation detail? In general, is the speedup mostly attributed to a more intelligent offload / prefetch schedule, a more robust implementation (e.g. more effectively using GPUDirect), or something else?
- Limited novelty of tensor profiler. Many papers have proposed granular profiling for DNNs, but TERAIO includes this profiler as a key contribution. Is there anything particularly different about the TERAIO profiler?
- Results are mostly empirical, but it would be nice to see more theoretical analysis of how to compute the expected performance of this approach given key parameters such as the activation sizes, hardware bandwidth, and available memory capacity. This would help determine whether completely hiding the transfer latency is possible without having to actually run the system in practice.
- No comparison to activation recomputation. The authors suggest that this will slow down training which is true, but how does this slowdown compare to potential kernel stalls caused by delays in the memory prefetching?
问题
- For the characterization study in Section 2, which sequence length / batch sizes were used for each model? Do the conclusions change with different parameters?
- Why is there still a ~20% gap between TERAIO and the ideal performance?
- How often do kernel stalls happen in practice and what is the resulting performance impact? Can these kernel stalls be predicted in advance using theoretical analysis?
- How does this technique change with emerging memory technologies like the high-bandwidth GPU-CPU interconnects available with NVIDIA Grace systems?
局限性
The authors have described the limitations of their work and there are no ethical concerns.
最终评判理由
Overall, the additional experiments provided by the authors help clarify the performance benefits of TeraIO. Including these experiments and expanding the results to include more in-depth study of the ZeRO-Infinity baseline would strengthen the paper. As a result, I increased my score.
格式问题
No
1. Is ZeRO-Infinity the SOTA Approach
ZeRO-Infinity is the most widely adopted framework for tensor offloading, which has been integrated into the DeepSpeed framework. Recent tensor offloading studies also include FlashNeuron, Smart-Infinity, and others. FlashNeuron only considered offloading activation tensors, Smart-Infinity focused on using specialized near-storage accelerators to accelerate optimizer updates.
To the best of our knowledge, ZeRO-Infinity is still a well-acknowledged and practical SOTA approach for tensor offloading, which serves as the most competitive baseline for TeraIO.
2. Why TeraIO Outperforms ZeRO-Infinity
TeraIO outperforms ZeRO-Infinity due to its design differences in offloading granularity and migration strategy.
Offloading Granularity: ZeRO-Infinity plans the offloading of parameters and optimizer states at the layer granularity of ML models. This introduces bursts in I/O patterns, and can underutilize the migration bandwidth. In contrast, TeraIO considers all tensors and performs offloading at the GPU kernel granularity based on our precise tensor lifetime analysis. Our approach ensures high and consistent migration bandwidth utilization (see Fig. 9 (2)).
Migration Strategy: ZeRO-Infinity uses a heuristic-based policy to decide which tensors should be offloaded and when the offloading should start, regardless of the storage bandwidth usage. It lacks global optimization and produces unpredictable I/O patterns. In contrast, TeraIO quantifies the benefit and cost of tensor offloading via I/O-aware planning (Section 3.2), which generates globally optimized migration plans that maximize I/O bandwidth utilization.
3. Clarification of CPU Memory Requirements of ZeRO-Infinity
Since ZeRO-Infinity performs optimizer state updates on CPU, its design intentionally places gradients and optimizer states in CPU memory to reduce CPU-SSD I/O bandwidth consumption. This high CPU memory usage is fundamental to ZeRO-Infinity’s design.
4. Clarification of Speedup Breakdown
The speedup of TeraIO is mostly attributed to a more intelligent migration mechanism. To validate this, we conducted experiments when disabling GPUDirect Storage (GDS) and using regular host-based storage (first read data from SSD to host memory, and then transfer to GPU). We obtained similar end-to-end performance (see table below).
Table: Training throughput comparison with and without GDS (tokens/s)
| Llama3-8B (bs32, seq2k) | Granite-8B (bs32, seq4k) | Llama3-70B (bs8, seq2k) | |
|---|---|---|---|
| TeraIO-SSD w/ GDS | 1719 | 861 | 57 |
| TeraIO-SSD w/o GDS | 1668 | 830 | 56 |
| TeraIO-Mixed w/ GDS | 1723 | 1319 | 91 |
| TeraIO-Mixed w/o GDS | 1713 | 1286 | 90 |
Although GDS does not have a significant impact on the end-to-end performance in our current setting (2 GPUs, each connected to 4 SSDs), it provides significant scalability advantages by eliminating host resource contention.
Without GDS, tensor offloading frameworks will suffer from scalability bottlenecks. To validate this, we scale the number of GPUs and SSDs and measure the host resource usage. As shown in the table below, host resource usage increases linearly. With 2 GPUs and 4 SSDs per GPU, over 100 GB/s of host memory bandwidth is consumed, and more than 3 CPU cores are fully used, since the host memory is used as a bounce buffer for data transfers between GPUs and SSDs.
Table: Host resource utilization without GDS
| GPUs | Metric | 1 SSD per GPU | 2 SSDs per GPU | 4 SSDs per GPU |
|---|---|---|---|---|
| 1 GPU | CPU Usage* | 51% | 83% | 189% |
| Memory BW | 13.9 GB/s | 27.3 GB/s | 51.7 GB/s | |
| 2 GPUs | CPU Usage* | 93% | 185% | 353% |
| Memory BW | 27.7 GB/s | 57.5 GB/s | 102.2 GB/s |
*CPU Usage: Usage relative to a single CPU core.
It is worth noting that an important goal of TeraIO is to achieve low-cost LLM training, it is critical to minimize the use of host CPU and memory resources. Compared to ZeRO-Infinity, TeraIO significantly reduces the host CPU and memory requirements, while offering better scalability for offloading tensors to low-cost SSDs.
5. Novelty of Tensor Profiler
Compared to existing DNN profilers, our tensor lifetime profiler has two unique features:
(1) Kernel-level tensor lifetime tracking: Our profiler can precisely capture active/inactive periods of each tensor across GPU kernels. Existing profilers like Nsight Systems focuses on performance bottleneck identification, while PyTorch Profiler provides only memory snapshots and kernel timing. But none of them can provide the tensor lifetime information needed for our optimized migration planning.
(2) Lightweight profiling with PyTorch integration: We instrumented PyTorch's automatic operator generator to capture all tensor activities at runtime with minimal overhead. Because of this, our profiler can capture all tensors and their operators in a comprehensive manner, while existing tools like torch.compile() cannot achieve this, when the model includes custom operators or autograd hooks.
6. Theoretical Analysis of Expected Performance
TeraIO provides a theoretical performance model for estimating the training time in the tensor migration algorithm. As described in Section 2.2, our performance model takes tensor sizes, profiled kernel execution time, the tensor migration plan, available I/O bandwidth, and GPU memory capacity as inputs to predict the training time. With each kernel’s execution time collected in our profiling, our performance model simulates tensor migration at the designated bandwidth and checks whether the tensors needed by the kernel are already in the GPU memory. If not, these tensors’ migration overhead will be added to the total training time. This enables us to examine different migration plans with different performance-critical parameters, and check whether the migration overhead can be fully hidden with the migration plan, as demonstrated in the roofline analysis (Fig. 3).
7. Activation Recomputation
We choose activation offloading over recomputation for two major reasons: (a) Our roofline analysis shows that, with 32GB/s migration bandwidth, TeraIO can achieve near-ideal training performance. This bandwidth can be easily achieved by aggregating multiple commodity SSDs. In contrast, recomputation will consume precious GPU resources and could introduce extra computation overhead. (b) It is much easier to scale SSD resources at a lower cost by adding more SSD devices, compared to the approach of scaling expensive GPU resources. This reflects our “low-cost” design philosophy.
Moreover, activation recomputation is orthogonal to offloading. As having both activation recomputation and offloading significantly increases the complexity of the framework design and implementation, it requires extensive investigation and analysis. We wish to explore this as future work.
8. Clarification of Settings in the Characterization Study
We list the settings in the table below. For diversity, we used different batch sizes across different models to study their impact on the memory capacity requirements.
| Llama3-8B | T5-11B | GPT2-40B | Llama3-70B | |
|---|---|---|---|---|
| Batch size | 128 | 256 | 16 | 64 |
| Sequence length | 4,096 | 512 | 1,024 | 2,048 |
As we vary the parameters shown in the table, we reach the same conclusion. This is because of the inherent LLM characteristics. As most LLMs use transformer architectures, their sparse tensor access patterns and high compute intensity stem from the fundamental structure of attention mechanisms and multi-layer transformer architecture.
9. Clarification of Performance Gap
The performance gap still exists when the memory requirement of an LLM significantly exceeds the available GPU capacity. For large models like Llama3-70B (>900% memory usage), the volume of tensor migrations overwhelms the available I/O bandwidth even with optimal tensor migration plans. This causes GPU kernel stalls when the required tensors are still being migrated. The performance gap could be eliminated by scaling the migration bandwidth (e.g., adding more SSDs), as demonstrated in our roofline analysis (Section 2.2).
10. Kernel Stalls
To elaborate on kernel stalls and their impact on performance, we show the breakdown of end-to-end execution time in the table below. It includes three components: (a) the time when tensor migrations stall GPU computation, (b) the time when tensor migrations overlap with GPU computation, and (c) the remaining GPU computation time. On average, kernel stalls account for less than 9% of the total execution time, thanks to our optimized tensor migration schedule.
These kernel stalls can be predicted and minimized using our performance model as discussed in Section 2.2 and Question 6).
Table: Breakdown of training iteration time (kernel stall%, overlap%, computation only%)
| Llama3-8B (bs32,seq2k) | Granite-8B (bs32,seq4k) | Llama3-70B (bs8,seq2k) | |
|---|---|---|---|
| TeraIO-SSD | 4%,96%,~0% | 38%,59%,3% | 66%,34%,~0% |
| TeraIO-Mixed | 3%,75%,22% | 5%,75%,20% | 45%,53%,2% |
11. Emerging Memory Technologies
Although NVLink-C2C in NVIDIA Grace systems provides high GPU-CPU bandwidth, the memory capacity is still hard to scale, due to the fundamental DRAM scaling walls (physical limitations, power wall, and architecture constraints). As the memory capacity is insufficient for large models, expanding the GPU/host memory with low-cost and scalable SSDs ($0.2/GB for SSDs vs.$4/GB for DRAM on average) has become a practical and promising solution.
Even if the cost is not a concern, our lifetime-aware tensor offloading approach can be applied to new and emerging memory technologies such as CXL memory. By expanding the GPU/host memory with external memory devices via CXL, new performance tradeoffs need to be considered, as the bandwidth and latency of accessing CXL memory vs. SSDs are different. Unfortunately, commodity CXL memory devices are not widely available on the market yet. We wish to extend TeraIO with new memory technologies as future work.
Overall, the additional experiments provided by the authors help clarify the performance benefits of TeraIO. Including these experiments and expanding the results to include more in-depth study of the ZeRO-Infinity baseline would strengthen the paper. As a result, I will increase my score. The primary remaining weakness would be the novelty of the profiler and discussion of related work in general.
- Is ZeRO-Infinity the SOTA Approach
- Thank you for clarifying, this discussion would be helpful in a dedicated related work section / paragraph of the final paper.
- Why TeraIO Outperforms ZeRO-Infinity
- The authors state: “ZeRO-Infinity plans the offloading of parameters and optimizer states at the layer granularity of ML models.“ but another reviewer disputes this; to confirm, it would be good to have some empirical analysis of the migration patterns of Zero-Infinity similar to Figure 9 in the paper.
- “ZeRO-Infinity uses a heuristic-based policy to decide which tensors should be offloaded and when the offloading should start, regardless of the storage bandwidth usage.” - The paper would benefit from more details about what these heuristics are and why they are suboptimal.
- Clarification of CPU Memory Requirements of ZeRO-Infinity There is some additional discussion of this limitation in Section 5.2.2 of the ZeRO-Infinity paper - it would be helpful to have a direct discussion of why this analysis does not hold for TeraIO.
- Clarification of Speedup Breakdown
- This analysis is helpful in isolating the impact of the migration algorithm. It would be nice to include the ZeRO-Infinity results in this table, as well.
- Novelty of Tensor Profiler
- Another reviewer brought up SwapAdvisor, HOME, PatrickStar, and Checkmate as other profilers. I am still not fully convinced that the profiling mechanisms in TeraIO are substantially different from these; for example, why is tracking GPU -> SSD transfers fundamentally more challenging than tracking GPU -> CPU transfers?
- Theoretical Analysis of Expected Performance
- I agree that the tensor migration algorithm and roofline analysis are sufficient here.
- Activation Recomputation
- Deferring activation recomputation to future work makes sense.
- Clarification of Settings in the Characterization Study
- Including these details in the appendix would be helpful.
- Clarification of Performance Gap
- Thanks for the clarification.
- Kernel Stalls
- This table is helpful. If possible, can you also add ZeRO-Infinity to this?
- Emerging Memory Technologies
- Thanks for the discussion, this would be interesting future work.
We thank the reviewer for the detailed comments. We will include the clarifications and discussions in the paper.
1. Why TeraIO Outperforms ZeRO-Infinity
(1) Clarification of Offloading Granularity:
We apologize for the confusion in our previous response. Both ZeRO-Infinity and TeraIO can perform offloading at the tensor granularity. However, as they plan tensor migrations, ZeRO-Infinity selects tensor candidates for offloading at a coarse granularity (i.e., at PyTorch module boundary). After that, it will offload a bunch of tensors. This may cause burst I/O requests or underutilize the migration bandwidth. In contrast, TeraIO will check the execution graph and learn the lifetime of each tensor that would be present in GPU memory. With tensor lifetime analysis, TeraIO will decide which tensor should be migrated/prefetched at what time (i.e., tensor migration planning), and associate the offloading and prefetching instructions with the corresponding operators.
We can show the average GPU I/O bandwidth usage when using ZeRO-Infinity. Due to the format and space restrictions, we cannot directly show the figure here. We will add it in the paper.
(2) Details of the Heuristic-based Policy:
ZeRO-Infinity plans tensor migrations based on the operator sequence for each training iteration. At runtime, it keeps track of the index of the current running operator in the operator sequence. For prefetching, before executing one operator, it always prefetches all parameters needed for the next few operators until the prefetch_bucket buffer is full (configurable parameter, 4GB in our experiments). For offloading, after completing the execution of one layer (PyTorch module), it offloads all gradients that are no longer needed by the GPU. This approach follows the “on-demand” policy, it only keeps the model parameters used by the current operator in GPU memory and drops the model parameters used by previous operators. This not only wastes precious GPU memory (that could be used for hosting more tensors), but also causes unnecessary I/O traffic.
In contrast, TeraIO quantifies the benefit and cost of each tensor offloading via I/O-aware planning, which generates globally optimized migration plans that can maximize both I/O bandwidth utilization and GPU memory usage.
2. Clarification of CPU Memory Requirements of ZeRO-Infinity
Unlike ZeRO-Infinity, TeraIO does not have a strict requirement for CPU memory, because it does not perform the optimizer state updates on the CPU. TeraIO uses the GPU for the optimizer states computation. It works even without offloading tensors to CPU memory. We will clarify this in the paper.
3. Clarification of Speedup Breakdown
In the paper, we did not evaluate ZeRO-Infinity with GDS support. This is because the benefit of using GDS mostly lies in the alleviation of host resource contention (for the cost-effective training goal). It does not bring much performance benefit for LLM training, as shown in the table of “4. Clarification of Speedup Breakdown” in our rebuttal. We will clarify this in the paper.
4. Novelty of Tensor Profiler
Existing studies, such as SwapAdvisor, HOME, PatrickStar, and Checkmate, mostly rely on model dataflow graphs to plan tensor migrations or recomputations. However, the dataflow graphs only have the data dependencies and kernel execution order, they do not contain the detailed information about the kernel execution time.
In contrast, TeraIO’s profiler tracks the input/output tensors for each operator and measures the kernel execution time at runtime by instrumenting the PyTorch framework. It can capture precise tensor lifetime, which drives the lifetime-aware and I/O bandwidth-aware tensor migration planning in TeraIO.
5. Clarification of Settings in the Characterization Study
We appreciate your suggestion and will include them in the appendix.
6. Kernel Stalls
We update the table (see below) by including the breakdown of the training iteration time of ZeRO-Infinity. Compared to ZeRO-Infinity, TeraIO causes much less GPU kernel stall time by maximizing the overlap of tensor migrations with computation.
Table: Breakdown of training iteration time (kernel stall%,overlap%,computation only%)
| Llama3-8B (bs32,seq2k) | Granite-8B (bs32,seq4k) | Llama3-70B (bs8,seq2k) | |
|---|---|---|---|
| TeraIO-Mixed | 3%,75%,22% | 5%,75%,20% | 45%,53%,2% |
| ZeRO-Infinity | 31%,15%,54% | 30%,16%,54% | 49%,23%,28% |
Dear Reviewer 9SW7,
Please verify whether the authors have adequately addressed your comments, post your response amd submit the “Mandatory Acknowledgement” accordingly. Kindly note that you must engage in discussions with the authors and then submitting the “Mandatory Acknowledgement” to avoid being marked as a non-participating reviewer, which may result in penalties.
Thanks for your support!
The paper proposed a tensor offloading framework for PCIe SSDs to enable LLM model training by analyzing the lifetime of active tensors. Moreover, based on the lifetime profiler, the proposed tensor migration planning algorithm decides the offloading and prefetching period to the GPUDirect storage and CPU memory. The evaluation shows that TERAIO outperforms the training throughput of state-of-the-art studies by 1.47x on average.
优缺点分析
The paper proposed an improved offload strategy by introducing a runtime tensor profiler. The migration plan is general to different I/O devices. The weakness of the work is its dependence on the GPUDirect technology, which limits its generalizability to non-NVIDIA GPUs or older generation systems. I suggest that the authors discuss why the work is limited to GPUDirect and what alternative technologies are available.
问题
-
Why does the tensor migration plan first target SSD but not CPU memory, given that there is 1.5TB of DDR5 memory? The analysis for offloading and prefetching should also apply to main memory as well. In particular, it would be interesting to see how main memory and SSDs interact in offloading. For instance, if both are used at the same time, the available bandwidth could be the sum, or main memory could be used as a L4 cache for SSD.
-
In Line 32, the "DRAM scalability challenge" is unclear.
局限性
(mentioned above)
最终评判理由
My primary concern was the practical applicability of the work, particularly regarding alternatives to GDS technology and the role of CPU memory in the tensor migration strategy. The authors clarified both points sufficiently in their response.
However, I maintain a borderline accept decision. The paper, as well as the rebuttal, contains several vague or imprecise terms, such as "no additional cost in compilation" and "TeraIO does not have a minimum CPU memory requirement." These statements, while perhaps comprehensible in context, reflect a lack of rigor in writing. Those casual phrasing suggests the authors have not thoroughly scrutinized their responses and the precision of their claims. While the contributions have potential, the overall paper falls on the borderline of an acceptable academic article.
格式问题
-
All tables should be in a better format.
-
The instruction block in the checklist should be removed.
1. Generalizability Issue of NVIDIA GDS
We used GPUDirect Storage (GDS) in TeraIO, because it provides significant scalability advantages by bypassing the performance bottlenecks caused by the host CPU. Without GDS, the conventional host-based approach (CPU reads data from SSDs to host memory, and then transfers to GPU) will consume significant hardware resources (CPU and memory) on the host, which not only creates bottlenecks as we scale the number of GPUs and SSDs, but also increases the cost for LLM training.
TeraIO does not necessarily rely on NVIDIA’s GPUDirect solutions. It is compatible with any peer-to-peer (P2P) direct storage solutions, such as AMD's DirectGMA.
2. Offloading Target Choice in the Migration Algorithm
TeraIO uses both CPU memory and SSDs as tensor offloading targets, as described in our tensor migration algorithm in Section 3.2 as well as the evaluation in Section 4.
TeraIO supports simultaneous migrations. However, it prioritizes SSDs because of their unique advantages in terms of large capacity, low cost ($0.2/GB), and scalable I/O bandwidth with GPUDirect Storage support. When generating optimized tensor migration plans, TeraIO predicts the bandwidth utilization of SSDs based on previously decided tensor migrations (see Line 17 in Algorithm 1). If TeraIO estimates that the SSD bandwidth will be saturated, it will offload the candidate tensors to the host memory.
When simultaneous migrations to both host memory and SSDs are needed, the GPU-CPU transfers will utilize the remaining PCIe bandwidth. TeraIO's I/O-aware tensor migration planning algorithm has taken this into consideration by tracking bandwidth usage across both targets (see Algorithm 1).
3. DRAM Scalability Challenge
The DRAM scalability challenge refers to the fundamental limitations in scaling DRAM capacity. They include (1) physical limitation in which the number of pins on a processor package and DIMM modules is limited by the printed circuit board trace routing, which cannot easily scale; (2) power limitation in which higher-capacity DRAM modules consume significantly more power, resulting in thermal management challenges; and (3) scaling limitation in which DRAM cells cannot be made smaller without causing DRAM cell disturbances and retention time violations.
4. Paper Formatting Concerns
We will take an editorial pass at the paper and fix the formatting issues.
The rebuttal's justification for prioritizing SSDs over CPU memory is not convincing. The response emphasizes SSDs' larger capacity and lower cost, but these factors are irrelevant at runtime, given that the system already has access to 1.5TB of DDR5 memory. This comparison conflates cost considerations with performance trade-offs, which is misleading in the context of runtime performance. In addition, the term "scalable bandwidth" is vague and does not properly address the core concern: DRAM offers significantly lower latency and higher bandwidth than SSDs, even when considering techniques such as GPUDirect. The rebuttal does not clarify why the available DRAM was not utilized first for offloading, given that it is a faster and suitable option.
Also, what is the minimal required memory for TERAIO? And what's the impact on the performance when reducing CPU memory? All servers in the evaluation have huge memory capacity. An evaluation on a lower memory system can clarify my concern.
We thank the reviewer for the comments. TeraIO prioritizes SSDs over CPU memory for the following reasons.
(1) Runtime performance: since tensor offloading can be overlapped with GPU kernel executions, the latency of their data accesses is not on the critical path. Instead, the migration bandwidth is critical for the performance. Although CPU memory offers lower latency and higher bandwidth than SSDs, our roofline analysis with different migration bandwidths showed that a bandwidth of 32 to 48 GB/s for the tensor migration is sufficient to achieve near-ideal performance for LLM training (Fig. 3). However, in real-world deployment, if the bandwidth between GPU and SSD cannot reach the requested migration bandwidth (e.g., due to less number of SSDs, limited bandwidth offered by the SSD, and different PCIe topologies), TeraIO will use CPU memory as a backup offloading target. In our experiments, due to the limited PCIe slots on the machine, we have 4 SSDs (Samsung 990 PRO 2TB) for each GPU, providing 16GB/s bandwidth for direct data transfer between GPU and SSDs.
(2) Scalability: In terms of scaling the storage capacity for tensor offloading, SSDs provide more advantages than DRAM. Today, SSDs can easily scale to tens of TBs per PCIe slot, while DRAM scales only to 32GB-512GB per DIMM slot. To reach the same capacity, we would require an impractically high number of DIMM slots to match per-slot SSD density. Although our experimental server has 1.5TB DRAM, it has used all 24 DIMM slots in the server at a cost of 7,800 USD, while four 2TB SSDs only take one PCIe x16 slot (each SSD needs a x4 slot) at a cost of 640 USD in total. As we further scale the size of LLMs, the DRAM capacity will inevitably become the bottleneck.
(3) Cost: In addition to the performance and scalability reasons described above, an important goal of TeraIO is to achieve cost-effective LLM training. Therefore, it is critical to minimize the use of host CPU and memory resources (0.2 USD/GB for SSDs vs. 4 USD/GB for DRAM on average). The saved host CPU and memory can be utilized for other purposes (e.g., dataset preprocessing or serving non-ML tasks) to maximize the server efficiency.
TeraIO does not have a minimum CPU memory requirement, it works even without offloading to CPU memory (“TeraIO-SSD” in Fig. 7). We showed the training throughput of TeraIO and ZeRO-Infinity as we vary the reserved CPU memory as follows. Note that ZeRO-Infinity has a minimum CPU memory requirement (170GB for Llama-8B, 770GB for LIama-70B) to store the gradient and optimizer states, since it performs optimizer state updates on the CPU.
Table: Training throughput (tokens/s) for Llama3-8B with different reserved CPU memory for offloading
| CPU Memory Capacity Used (GB) | 0 | 16 | 32 | 64 | 128 | 170* | 256 |
|---|---|---|---|---|---|---|---|
| TeraIO | 1688 | 1694 | 1691 | 1823 | 1823 | 1823 | 1823 |
| ZeRO-Infinity | / | / | / | / | / | 1173 | 1231 |
*170GB is the minimum CPU memory required for ZeRO-Infinity to successfully train Llama3-8B.
Table: Training throughput (tokens/s) for Llama3-70B with different reserved CPU memory for offloading
| CPU Memory Capacity Used (GB) | 0 | 128 | 256 | 512 | 770* | 1024 |
|---|---|---|---|---|---|---|
| TeraIO | 67 | 72 | 84 | 101 | 107 | 113 |
| ZeRO-Infinity | / | / | / | / | 78 | 81 |
*770GB is the minimum CPU memory required for ZeRO-Infinity to successfully train Llama3-70B.
As shown in the tables above, ZeRO-Infinity does not improve the training performance much as we increase the reserved CPU memory. Since ZeRO-Infinity uses the reserved CPU memory to store gradients and optimizer states, it has a minimum requirement for the memory capacity. In addition, ZeRO-Infinity uses the reserved memory as the bounce buffer for data transfers between GPU and SSDs. However, without smart tensor migrations like lifetime-aware tensor migration in TeraIO, even though we further increase the reserved memory capacity, GPU kernel stalls still happen as the requested tensors are not ready.
TeraIO uses CPU memory as the backup offloading target, in case the GPU-SSD bandwidth is insufficient to meet the tensor migration requirement. To minimize GPU kernel stalls, TeraIO will place the most critical tensors (that would cause GPU kernel stalls if they cannot be loaded into GPU memory quickly) in the reserved CPU memory. For smaller models like LIama3-8B, a reserved memory capacity of 64GB is sufficient. Larger reserved CPU memory capacity brings more benefits to the large models like LIama3-70B.
Note that, even though we do not reserve any CPU memory for TeroIO, its performance is still much better than ZeRO-Infinity with a large reserved CPU memory, because of its unique advantages in tensor lifetime analysis and migration planning. Instead of relying on heuristic-based approaches, the I/O-aware migration algorithm in TeraIO tracks estimated storage bandwidth usage and quantifies the benefit and cost of each planned tensor migration. This enables TeraIO to generate globally optimized migration plans.
Thank you for the detailed response. The responses address my questions.
Thank you very much for your constructive feedback. We will incorporate the clarification and discussion in the paper.
TERAIO introduces a lifetime-aware tensor offloading mechanism for cost-efficient LLM training. It profiles tensor activity at the first few iterations and leverages GPUDirect Storage to support DMA operations. TERAIO achieves fine-grained, bandwidth-aware offloading to SSDs and minimizes stalls when migrating inactive tensors back to the GPU storage. TERAIO significantly improves training throughput with an average of 1.47x and reduces hardware cost compared to state-of-the-art baselines.
优缺点分析
Strengths
- It requires minimal code changes and easily integrates with PyTorch, making it practical for developers to adopt.
- Demonstrate the strong performance over ZeRO-Offload and ZeRO-Infinity.
- The experimental data are clear, and the experimental design is explained in detail.
Weaknesses:
- While the paper provides thorough experimental comparisons against ZeRO-Offload and ZeRO-Infinity, it doesn’t include Smart-Infinity, a recent storage offloading framework.
问题
Smart-Infinity also claims to outperform ZeRO-Infinity while scaling up the number of SSDs, so why was it not included as a baseline for comparison in the experiments? If possible, I would like to know the current performance difference between TERAIO and Smart-Infinity. This could be addressed either through a explanation or, ideally, by providing direct comparative data to demonstrate the significance.
局限性
yes
最终评判理由
Regarding the concern about whether there was a comprehensive baseline comparison, I have now received a clear explanation from the authors and am convinced that the issue has been addressed. Therefore, I am willing to increase my score to 5.
格式问题
Not applicable
1. Comparison to Smart-Infinity
(1) The research approach of TeraIO is fundamentally different from Smart-Infinity. Smart-Infinity focuses on using specialized computational storage devices to accelerate the optimizer state updates. It can be viewed as “ZeRO-Infinity + computational storage”. TeraIO proposes a low-cost tensor offloading framework that focuses on optimizing tensor migrations with commodity SSDs.
(2) The hardware settings of TeraIO vs. Smart-Infinity are different. Smart-Infinity requires specialized computational storage devices. TeraIO uses commodity SSDs and does not require hardware changes or specialized hardware accelerators.
(3) Smart-Infinity's tensor offloading mechanism remains identical to ZeRO-Infinity. With lifetime-aware tensor migration schemes, TeraIO significantly outperforms the offloading approach proposed in ZeRO-Infinity, as examined in our evaluation. For the above reasons, our current submission did not include the comparison with Smart-Infinity. We will add the discussion in the paper.
Thank you for your clarification. As clarified, Smart-Infinity is indeed a case of hardware enhancement without innovation in the offloading policy. I believe my concerns have been addressed, and I have no further questions. And I'm willing to increase my score to 5.
We really appreciate your feedback, and will incorporate the discussion in the paper.
The authors proposed TERAIO, aiming at efficient transparent tensor placement under offloaded scenario with persistent storage devices. Given the fact that offload with SSDs has long been a bottleneck for large models, the well orchestrated tensor management is critical to the feasibility of this technique. Via transparently tracking life-cycle of all tensors, TERAIO can plan ahead and make optimal transfer scheme with high generalizability. Additionally, the usage of GDS is also beneficial.
优缺点分析
Strengths
- TERAIO is built on the basis of PyTorch, indicating it is model- and workload-agnostic.
- GPUDirect Storage further cut down the potential latency in disk I/O.
- Optimized tensor migration planning is promising.
Weakness
- The paper lacks ablation studies that highlighting the effectiveness of several key designs in TERAIO. (e.g. GPUDirect Storage, tensor migration planning)
- GPUDirect storage can significantly cut down latency in I/O, but not too beneficial in bandwith improvements. The latency issue is somehow hinderable by planning in TERAIO.
- The generalizability of TERAIO is not well demonstrated. The LLM workload in sec 4 is too homogenous.
- Figure 4 embedded texts are incorrectly referred. Please double check.
问题
- Please clarify the ZeRO settings (stages, DeepSpeed json parameters). They are performance critical. Some of the DeepSpeed mechanisms like persistent params are very similar to TERAIO functionalities.
- TERAIO seems to have strong assumption on compiliation. Will it also work for non-compiled pipelines?
- Activations in training are accumulating during training without gradient checkpointing. As they will be later fetched, they are likely be captured by TERAIO and offloaded, which will inevitably take precious downward bandwidth. Please kindly check the tradeoff between gradient checkpointing and activation offload.
- You have mentioned the batch size boost gain (l313), but not shown in figures and tables. Please clarify.
- For ZeRO-Infinity data point in Fig. 8, you only tested one size of CPU-side buffer. Please justify.
- There other tensor lifecycle management works like SwapAdvisor, HOME, Checkmate, PatrickStar. Is there any comparison? If not working on the similar ground, please explain.
局限性
Yes
格式问题
N/A
1. Ablation Studies
We conducted the ablation study as follows.
We conducted experiments when disabling GPUDirect Storage (GDS) and using conventional host-based I/O approach (CPU reads data from SSDs to host memory, then transfers to GPU). We obtained similar end-to-end performance (see table below), showing that the performance speedup is mostly attributed to the optimized tensor migration planning.
Table: Training throughput w/ and w/o GDS (tokens/s)
| Llama3-8B (bs32,seq2k) | Granite-8B (bs32,seq4k) | Llama3-70B (bs8,seq2k) | |
|---|---|---|---|
| TeraIO-SSD w/ GDS | 1719 | 861 | 57 |
| TeraIO-SSD w/o GDS | 1668 | 830 | 56 |
| TeraIO-Mixed w/ GDS | 1723 | 1319 | 91 |
| TeraIO-Mixed w/o GDS | 1713 | 1286 | 90 |
GDS provides significant scalability advantages by eliminating host resource contention. Without GDS, tensor offloading frameworks suffer from scalability bottlenecks. To validate this, we scale the number of GPUs and SSDs and measure the host resource usage. As shown in the table below, host resource usage increases linearly with the number of GPUs and SSDs. With 2 GPUs and 4 SSDs per GPU, over 100 GB/s of host memory bandwidth is consumed, and more than 3 cores are fully used, as host memory is used as a bounce buffer for data transfers between GPUs and SSDs.
Table: Host resource utilization without GDS
| GPUs | Metric | 1 SSD per GPU | 2 SSDs per GPU | 4 SSDs per GPU |
|---|---|---|---|---|
| 1 GPU | CPU Usage* | 51% | 83% | 189% |
| Memory BW (GB/s) | 14 | 27 | 52 | |
| 2 GPUs | CPU Usage* | 93% | 185% | 353% |
| Memory BW (GB/s) | 28 | 58 | 102 |
*Usage relative to a single core.
We also show the breakdown of end-to-end execution time in the table below. It includes three components: (a) the time when tensor migrations stall GPU computation, (b) the time when tensor migrations overlap with GPU computation, and (c) the remaining GPU computation time.
Table: Breakdown of training iteration time (kernel stall%,overlap%,computation only%)
| Llama3-8B (bs32,seq2k) | Granite-8B (bs32,seq4k) | Llama3-70B (bs8,seq2k) | |
|---|---|---|---|
| TeraIO-Mixed | 3%,75%,22% | 5%,75%,20% | 45%,53%,2% |
| ZeRO-Infinity | 31%,15%,54% | 30%,16%,54% | 49%,23%,28% |
Compared to ZeRO-Infinity, TeraIO has less stall time, as it maximizes the overlap of tensor migrations with computation. TeraIO outperforms ZeRO-Infinity for the design differences in two aspects:
Offloading Granularity: ZeRO-Infinity plans the offloading of tensors at the layer granularity of ML models. This introduces burst I/O patterns and can underutilize migration bandwidth. In contrast, TeraIO performs tensor offloading at the GPU kernel granularity based on our precise tensor lifetime analysis. This approach ensures high and consistent migration bandwidth utilization (see Fig. 9 (2)).
Migration Strategy: ZeRO-Infinity uses a heuristic-based policy to decide which tensors should be offloaded and when the offloading should start, regardless of the storage bandwidth usage. It lacks global optimization and produces unpredictable I/O patterns. In contrast, TeraIO quantifies the benefit and cost of tensor offloading via I/O-aware planning (Section 3.2), which generates globally optimized migration plans that maximize I/O bandwidth utilization.
2. Benefits of GPUDirect Storage (GDS)
We choose GDS since it provides significant scalability advantages (see Question #1).
An important goal of TeraIO is to achieve low-cost LLM training, thus it is critical to minimize the use of host CPU and memory resources. Compared to ZeRO-Infinity, TeraIO significantly reduces the host resource requirements while offering better scalability.
3. LLM Workloads Evaluated
We add GPT2-20B in our evaluation and the results as follows:
Table: Training throughput (tokens/s) for GPT-20B
| Config | bs8,seq2048 | bs16,seq1024 | bs16,seq2048 | bs32,seq2048 |
|---|---|---|---|---|
| Ideal | 546 | 573 | 614 | 656 |
| TeraIO-SSD | 277 | 294 | 374 | 414 |
| TeraIO-Mixed | 438 | 325 | 463 | 469 |
| ZeRO-Offload | 270 | 276 | 309 | 331 |
| ZeRO-Infinity | 212 | 214 | 256 | 283 |
The results are consistent with our evaluation in Fig. 7.
4. Fig. 4
We will fix it in the paper.
5. ZeRO Settings
We show the performance-critical parameters in the table below. With these settings, we ensure ZeRO-Infinity achieves reasonable performance.
Table: Performance-critical parameters in ZeRO-Infinity
| Parameter | Value | Description |
|---|---|---|
stage | 3 | Uses ZeRO-3 |
pipeline_read/write | true | Overlaps read/write of next/previous tile with computation of current tile |
pin_memory | true | Uses page-locked CPU memory for faster transfers |
buffer_count (optimizer) | 4 | Number of async I/O buffers for optimizer states |
buffer_count (param) | 18 | Number of async I/O buffers for parameters |
buffer_size (param) | 300/540M | Size of each parameter buffer |
param_persistence_threshold | 100K | Do not partition parameters smaller than this threshold |
model_persistence_threshold | sys.maxsize | Upper bound of unpartitioned parameters |
We enabled the pipeline_read/write parameters to optimize computation and data I/O overlap during optimizer state updates. We tuned parameters pin_memory, buffer_count, and buffer_size to optimize tensor offloading throughput. For param_persistence_threshold and model_persistence_threshold, we use default values.
6. Compilation Assumptions
Similar to state-of-the-art tensor offloading studies like ZeRO-Infinity, TeraIO uses ML compilers to extract rich semantic information from models for guiding the tensor offloading. This approach is a standard practice as current LLM training systems already use compilers like torch.compile() for performance optimizations. Extracting semantics from model execution comes at no additional cost.
For non-compiled pipelines, TeraIO can still track tensor operators and measure kernel execution times at runtime by instrumenting the PyTorch framework, and collecting kernel execution time, input/output tensors for each operator upon its execution at runtime (Section 3.1).
TeraIO is also extensible for supporting LLMs that involve dynamic computations. Take MoE for example, although expert routing is dynamic and is determined at runtime, most parts of the model are still static. TeraIO can identify the static parts during the profiling process and perform lifetime-aware migrations for their tensors. For the dynamic parts, such as expert routing, we can treat them as single operators and skip the fine-grained migrations for their tensors.
7. Tradeoffs Between Gradient Checkpointing and Activation Offloading
We choose activation offloading over recomputation for two reasons: (a) our roofline analysis shows that with the migration bandwidth of 32GB/s, TeraIO can achieve near-ideal performance. This bandwidth can be easily achieved by aggregating multiple SSDs. In contrast, recomputation will consume precious GPU resources and introduce extra computation overhead. (b) It is much easier to scale SSD resources at a lower cost by adding more SSD devices, compared to the approach of scaling expensive GPU resources.
Activation recomputation is orthogonal to offloading. As having both activation recomputation and offloading significantly increases the complexity of the framework design and implementation, we wish to explore this as future work.
8. Batch Size Boost Gain
Table: Performance of training LIama3-8B using TeraIO with various batch sizes
| Batch Size | Micro-batch Size | Can Run w/o Offloading | Throughput (tokens/s) | Shown in Fig. 7 |
|---|---|---|---|---|
| 32 | 2 | ✓ | 1749 | ✗ |
| 32 | 4 | ✗ | 1723 | ✓ |
| 128 | 4 | ✗ | 1911 | ✓ |
The table above shows the throughput of training Llama3-8B (seq_len=2K) with different batch sizes. The hardware setting of two H100 GPUs has sufficient memory for training with batch size 32 and micro-batch size 2. However, with larger (micro)batch sizes, out-of-memory errors will happen.
TeraIO enables training with batch size of 128, delivering 9% higher throughput compared to training without offloading (but using the maximum batch size). This shows that, for the models that can fit in the GPU memory, TeraIO can still achieve better performance by allowing larger batch sizes.
9. Clarification of the ZeRO-Infinity Data Point
As ZeRO-Infinity performs optimizer state updates on CPU, it requires CPU memory to store the optimizer states.
In Fig. 8, we evaluate the performance of ZeRO-Infinity with 4 SSDs. We obtained the performance with the minimum CPU memory it requires. For Llama3-8B, TeraIO outperforms ZeRO-Infinity in both performance and CPU memory usage. For Llama3-70B, we added an experiment and ran ZeRO-Infinity with 1TB CPU memory. TeraIO still outperforms ZeRO-Infinity. The result is in the table below.
Table: Throughput (tokens/s) of training Llama3-70B with different CPU memory capacities
| Configuration | 512GB CPU Mem | 770GB CPU Mem | 1TB CPU Mem |
|---|---|---|---|
| 2 SSDs (TeraIO) | 83 | / | 105 |
| 4 SSDs (TeraIO) | 101 | / | 113 |
| ZeRO-Infinity | / | 78 | 81 |
10. Novelty of Tensor Lifecycle Management
TeraIO differs fundamentally from these existing works in both research goal and approach.
(1) SwapAdvisor, HOME, and PatrickStar focused on data migrations between GPU and CPU memory; Checkmate focused on alleviating the GPU memory bottleneck with tensor recomputation. These studies did not consider the cost-efficiency of LLM training as the first citizen in their designs. In contrast, TeraIO focuses on the tensor migrations between GPU memory and low-cost SSDs, aiming to provide a cost-effective approach for training LLMs.
(2) TeraIO introduces unique advantages in tensor lifetime analysis and migration planning. Unlike existing works that rely on model dataflow graphs, TeraIO uses precise tensor lifetime information for generating optimized and fine-grained tensor migration plans. Instead of relying on heuristic-based approaches, the I/O-aware migration algorithm in TeraIO tracks estimated storage bandwidth usage and quantifies the benefit and cost of each planned tensor migration. This enables TeraIO to generate globally optimized migration plans that maximize I/O bandwidth utilization.
The authors mostly addressed my concerns. Some remaining questions:
- Compiling is not "at no cost" as you've claimed. It can be time consuming.
- On ablation studies: The rebuttal still did not show the performance gain from your mitigation strategy.
- Offloading granularity: ZeRO-Infinity DO offload at tensor granularity instead of "layer".
- What about the host CPU utilization with GDS enabled? If I missed that part, please kindly indicate.
We thank the reviewer for the feedback. Please find our response as follows.
1. Compilation Cost
TeraIO leverages ML compilers to extract semantic information from the execution graph of compiled models, for conducting tensor lifetime analysis and generation of tensor migration plans. The extraction overhead is negligible, as the execution graph of models has been produced by the ML compiler. The overhead of generating tensor migration plans based on the lifetime analysis is 31.2–396.6 seconds (see the table below), depending on the model size. This is less of a concern, because it enables efficient tensor offloading, which significantly reduces the model training time (1.47x reduction compared to ZeRO-Infinity and ZeRO-Offload on average). After the plan generation, we do not need to recompile ML models, as the corresponding offloading and prefetching instructions have been integrated into the execution graph. We instrumented PyTorch’s automatic operator generator to add a hook function before each operator’s execution, and the hook function will check our migration plan to decide whether any tensor migrations have been scheduled for that operator. If yes, the corresponding offloading and prefetching instructions will be executed in the background during the training.
Table: The overhead of generating migration plans
| Model | Llama3-8B | Granite-code-base-8B | GPT-20B | Llama3-70B |
|---|---|---|---|---|
| Time (s) | 31.2 | 37.9 | 64.2 | 396.6 |
2. Performance Benefit of Migration Strategy
The performance benefit of TeraIO mostly comes from tensor migration planning based on our tensor lifetime analysis. As shown in the first table in our ablation study, we did not observe significant performance differences between the cases with and without GPUDirect Storage. TeraIO adopted GPUDirect Storage mainly for alleviating the host resource contention, and expanding the GPU memory with SSDs for lowering the training cost.
As shown in the third table in our ablation study, TeraIO causes much less GPU kernel stall time, in comparison to ZeRO-Infinity. This is because TeraIO can maximize the overlaps of tensor migrations with GPU kernel executions, thanks to its lifetime-aware tensor migration planning.
3. Offloading Granularity
We apologize for the confusion in our previous response. Both ZeRO-Infinity and TeraIO can perform offloading at the tensor granularity. However, as they plan tensor migrations, ZeRO-Infinity will select the tensor candidates for offloading at a coarse granularity (i.e., at PyTorch module boundary). After that, it will offload a bunch of tensors. Therefore, it may cause burst I/O requests and underutilize the migration bandwidth. In contrast, TeraIO will check the execution graph and learn each tensor’s lifetime in GPU memory. Based on the tensor lifetime analysis, TeraIO will decide which tensor should be migrated/prefetched at what time (i.e., tensor migration planning), and associate the offloading and prefetching instructions with the corresponding operators. Since our tensor migration algorithm also considers the dynamic impact of one tensor’s offloading on the available migration bandwidth, TeraIO can best utilize the migration bandwidth and minimize GPU kernel stalls.
4. Host Resource Utilization with GDS Enabled
We update the second table in our previous response (see below) by adding the host resource utilization with GDS enabled. As we increase the number of GPUs and SSDs, the CPU utilization is increased. This is because our current implementation of GDS uses NVIDIA’s cuFile library which still relies on the host file system to manage SSDs (e.g., filesystem metadata operations). However, with GDS enabled, the host memory bandwidth utilization is very low, and the host CPU utilization is lower than the case with GDS disabled. This is because, unlike the conventional host-based I/O approach that uses host memory as the bounce buffer, GDS does not involve intensive memory operations on the host.
Note that the recent GDS studies such as BaM or GPU-initialized storage [29] allow the GPU to fully bypass the host CPU by moving both the control path and data path for interacting with SSDs to the GPU. However, it needs the GPU to manage the SSD on its own, which requires significant effort for GPU software development. Therefore, in our current implementation, we use NVIDIA’s default cuFile solution, with the expectation of having widespread adoption. We would like to add this discussion in the paper.
Table: Host resource utilization with and without GDS
| GPUs | Metric | GDS | 1 SSD per GPU | 2 SSDs per GPU | 4 SSDs per GPU |
|---|---|---|---|---|---|
| 1 GPU | CPU Usage* | Disabled | 51% | 83% | 189% |
| CPU Usage* | Enabled | 41% | 71% | 169% | |
| Memory BW (GB/s) | Disabled | 14 | 27 | 52 | |
| Memory BW (GB/s) | Enabled | 0.4 | 0.6 | 1.5 | |
| 2 GPUs | CPU Usage* | Disabled | 93% | 185% | 353% |
| CPU Usage* | Enabled | 84% | 163% | 329% | |
| Memory BW (GB/s) | Disabled | 28 | 58 | 102 | |
| Memory BW (GB/s) | Enabled | 0.9 | 1.2 | 2.6 |
*Usage relative to a single core.
Dear Authors,
Thank you for your prompt response. The rebuttal partially addressed my concerns, but the performance is not seemingly persuasive. I decide to keep my positive score
Thank you for the response. We would like to further clarify why TeraIO outperforms ZeRO-Infinity for its advantage in the migration strategy.
Specifically, ZeRO-Infinity uses a heuristic-based policy. It plans tensor migrations based on the operator sequence for each training iteration. For prefetching, before executing one operator, it always prefetches all parameters needed for the next few operators until the prefetch_bucket buffer is full (configurable parameter, 4GB in our experiments). For offloading, after completing the execution of one layer (PyTorch module), it offloads all gradients that are no longer needed by the GPU. This approach follows the “on-demand” policy, it only keeps the model parameters used by the current operator in GPU memory and drops the model parameters used by previous operators. This not only wastes precious GPU memory (that could be used for hosting more tensors) but also causes unnecessary I/O traffic.
In contrast, TeraIO generates a globally optimized tensor migration plan by quantifying the benefit and cost of each tensor offloading via I/O-aware planning. TeraIO prioritizes offloading larger tensors with longer inactive periods, as it can reduce GPU memory pressure for a longer time. In addition, TeraIO offloads tensors in a conservative manner. It offloads a minimum number of tensors until the GPU memory can host all the needed tensors for the next kernel execution, which can best utilize the precious GPU memory space.
The impact of TeraIO vs. ZeRO-Infinity on the training performance can be reflected in the stall time of GPU kernel executions. As shown in the 3nd table in our rebuttal, we demonstrated the breakdown of training in three parts: (a) the kernel stall – the time when tensor migrations stall GPU computation, (b) the overlap – the time when tensor migrations overlap with GPU computation, and (c) the computation – the remaining GPU computation time. The table shows that TeraIO creates more opportunities for overlapping the computation and tensor migrations, and incurs less GPU kernel stall time. For very large models like Llama3-70B (>900% memory usage), the volume of tensor migrations overwhelms the available migration bandwidth, this causes more GPU kernel stalls. However, TeraIO still made the best effort to maximize the overlapping between GPU kernel execution and tensor migration, which delivered 1.33x better training performance than ZeRO-Infinity.
Dear Reviewer,
Thank you again for your insightful feedback. As the discussion period is coming to an end, we want to make sure that we have fully addressed your concerns. If you have any further questions, please let us know.
We truly appreciate your time and valuable feedback. We will incorporate the discussions in the paper.
This paper presents a new tensor offloading system, TERAIO, to enhance the efficiency of LLM training over multiple GPUs and SSDs. The design of this system comes from the observation that only a small portion of tensors are active during LLM training. Based on this observation, it estimates the lifetime of each tensor by only profiling a small number of training iterations, and then generates the optimized tensor offloading strategy. TERAIO can be directly integrated with PyTorch to support compiled LLM training workloads, and experiments also demonstrate its superiority over baselines.
The reviewers agree that this is a solid workload, well written, and the solution is interesting and promising. The authors’ responses with new experiments and explanations address the majority of reviewers’ comments. Given that each reviewer is positive about this paper, it is recommended for acceptance. The authors are still suggested to address the remaining points in the revision, including more clear explanation about the compilation cost, and performance benefits.