HiRemate: Hierarchical Approach for Efficient Re-materialization of Neural Networks
摘要
评审与讨论
The paper introduces HiRemate, a hierarchical framework for neural network re-materialization to reduce memory usage during training. The core idea involves recursively partitioning the computation graph into manageable subgraphs, solving each with optimized strategies, and merging solutions hierarchically.
给作者的问题
See the above sections.
论据与证据
- Comparisons with Checkmate on small graphs (Table 2) show close performance, but no theoretical guarantees or bounds on suboptimality. Actually for 5-layer GPT2, H-ILP already has >3% performance gap compared to the optimal one achieved by Checkmate.
- The problem statement in Section 3 seems to be highly simplified. Do you assume a topological order of a computational graph here? How do you handle dependency between layers/operators.
方法与评估标准
- Hierarchical decomposition is sensible for scalability, but the trade-off between partition granularity and solution quality is not rigorously analyzed.
理论论述
- As mentioned above, comparisons with Checkmate on small graphs (Table 2) show close performance, but no theoretical guarantees or bounds on suboptimality.
实验设计与分析
- Please include comparisons with dynamic re-materialization methods (e.g., DTR [A]) so that readers can more effectively evaluate the benefits of the hierarchical approach, especially considering that statically planning the schedule for a Transformer model with 2500 nodes takes 2.5 hours, which may be too expensive.
- Comparisons with Checkmate are limited to small graphs, and Rockmate is only tested on sequential models. Please consider including RNN or SSM experiments.
补充材料
- Appendix E/F: Extended experiments validate generality but lack diversity in model types (e.g., no RNNs or recent State Space Models).
与现有文献的关系
It introduces a hybrid method that integrates hierarchical and solver-based approaches to address a key challenge in re-materialization.
遗漏的重要参考文献
[A] Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock, "Dynamic Tensor Rematerialization", ICLR, 2021.
其他优缺点
See the above sections.
其他意见或建议
See the above sections.
We thank the reviewer for the careful reading and detailed feedback. Below we address each of point of the review in turn.
On Lack of Theoretical Guarantees and Suboptimality Gap We agree that HiRemate does not provide formal guarantees on optimality. This is also true of Rockmate and TW-Remat, which, like HiRemate, are designed for scalability rather than provable optimality. Our focus is on solving large, complex computation graphs that are intractable for methods like Checkmate. The observed gap in small graphs (e.g., the 3% difference on GPT-2 5-layer) reflects this trade-off. HiRemate is designed to let users balance solution quality and solving time by:
- Adjusting the number of memory budget levels used in lower-level ILPs,
- Controlling partition granularity (larger subgraphs often improve quality but require more time),
- Using mixed solvers (e.g., DP on sequential parts, ILP only where needed),
- Optionally incorporating user-defined partitioning strategies.
These controls allow users to tune the solution process as needed. While we currently lack a formal analysis of suboptimality bounds, we are conducting further experiments to better characterize this trade-off in practice.
On Problem Statement and Dependency Handling
We clarify that HiRemate does not assume a topological order in the original graph. We extract the forward-backward graph using torch.export() at the level of torch/aten operations. The graph is simplified using the rkgb tool (Zhao et al., 2023) to remove view operations and isolate meaningful compute and data nodes (Appendix C.1).
This graph forms the input to our partitioning procedure (Section 3.1), which does not require topological ordering. A valid topological order is computed only within each subgraph before solving the H-ILP problem, as required for the ILP formulation -- similar to Checkmate.
The generated schedule consists of forward aten calls, backward Autograd calls, and memory deallocations (see Appendix C.2). Dependencies are fully respected by construction and execution is sequential.
On Partition Granularity vs. Solution Quality HiRemate is designed to enable trade-offs between granularity and schedule quality. We evaluate this effect on encoder-decoder Transformers in Figure 4, which shows how deeper hierarchies (finer partitions) impact both solving time and memory usage.
A fully rigorous analysis would require evaluating several partitioning strategies across multiple architectures. We agree this is an important direction and are currently running additional experiments, although they were not yet complete at the time of submission.
On Comparison to Dynamic Re-materialization (e.g., DTR) Static and dynamic re-materialization strategies differ mainly in how they handle the structure of the computational graph. Static re-materialization is well-suited for models with fixed computation graphs, where memory usage can be precomputed and optimized for predictable and efficient execution. On the other hand, dynamic strategies are better adapted to models with variable or input-dependent control flow, where runtime decisions enable on-the-fly memory management. While dynamic approaches offer more flexibility, they generally come with higher runtime overhead and less predictable behavior. Static strategies, in contrast, exceed dynamic methods in memory efficiency when applied to well-structured models. Quantifying these differences is complex because most dynamic strategies generally incorporate other algorithmic ingredients (typically paging) as in POET (https://proceedings.mlr.press/v162/patil22b/patil22b.pdf) or MegTaiChi (https://dl.acm.org/doi/pdf/10.1145/3524059.3532394), which makes comparison difficult. We see integration or benchmarking against these methods as important future work.
On Checkmate and Rockmate Comparisons Checkmate produces optimal schedules but scales poorly with graph size. We include comparisons where tractable (e.g., small GPT models), but for larger graphs, Checkmate does not complete in reasonable time. HiRemate is designed to provide approximate but scalable solutions for those larger models. In practice, we use Checkmate when feasible, and HiRemate otherwise. Rockmate is designed for block-sequential architectures. When applied to non-sequential graphs (e.g., U-Nets), it treats the entire model as a single block, falling back to Checkmate behavior. As such, on non-sequential models, Rockmate effectively reduces to a limited form of Checkmate and does not offer meaningful additional comparison.
Regarding RNNs and SSMs Please, refer to the Answer to Reviewer YaUT
The paper presents a novel hierarchical framework to optimize memory usage during neural network training. They recursively partitions large computation graphs and apply optimized solvers at multiple levels to significantly reduces memory consumption while maintaining efficient training performance.
给作者的问题
Please refer to Weaknesses.
论据与证据
The claims are well-supported by experimental results. The paper provides empirical evidence demonstrating significant memory savings with minimal computational overhead across multiple neural network architectures.
方法与评估标准
The proposed HiRemate framework and its evaluation criteria are well-aligned with the problem of optimizing memory usage in deep learning training.
理论论述
The paper provides a rigorous theoretical analysis of its algorithm
实验设计与分析
The paper includes exhaustive comparisons with state-of-the-art baselines, demonstrating its effectiveness across multiple architectures and hierarchy depths.
补充材料
.
与现有文献的关系
HiRemate extends prior work on ILP-based re-materialization methods such as Checkmate and TW-Remat, which optimize memory usage by selectively recomputing activations. Unlike these approaches, which struggle with large computation graphs due to scalability issues, HiRemate introduces a hierarchical partitioning strategy that allows it to efficiently scale to larger models.
遗漏的重要参考文献
The related work seems exhaustive.
其他优缺点
Strength
- The paper provides a rigorous theoretical analysis of its algorithm
- The paper includes exhaustive comparisons with state-of-the-art baselines, demonstrating its effectiveness across multiple architectures and hierarchy depths.
- The proposed method is easily integrable with PyTorch, enhancing usability and enabling practical adoption with minimal modifications.
- The intuition behind the scheduling algorithm design is well-explained and easy to follow
Weakness
-
While the hierarchical ILP solver improves scalability, it still incurs significant computational overhead for very large graphs. The reliance on multiple ILP optimizations per subgraph introduces bottlenecks, making it less practical for real-time or iterative optimization scenarios.
-
The framework primarily targets static computational graphs, limiting its applicability to modern AI applications that involve dynamic graphs, such as reinforcement learning, adaptive architectures, and some transformer-based models.
其他意见或建议
Please refer to Weaknesses.
We thank the reviewer for the careful reading and detailed feedback. Below we address each of point of the review in turn.
On Solver Overhead and Scalability Indeed, H-ILP introduces a solving time overhead. However, this can be mitigated in the following ways:
- Solving separate subgraphs can be performed independently and in parallel, providing efficient parallelization (moreover for subgraphs of the same structure we run the solver optimization once).
- The complexity analysis of Section 3.3 shows that the total solving time of all ILP optimizations is expected to grow linearly with the number of nodes in the graph, ensuring a reasonable scaling of the framework.
- For long training runs (e.g., months/weeks), even several hours of preprocessing is negligible.
- For cases where fast solving is required, faster heuristics (like TW-Remat) or Dynamic Programming solvers can replace ILP optimization subproblems, keeping ILP only for the several top-level optimization problems. Configuration options to run with different solvers are already implemented in HiRemate, and its modular design is straightforward to extend with more custom solvers and partitioning strategies.
On Limitation to Static Computation Graphs HiRemate is designed for neural networks whose forward-backward computation graph remains constant across input samples. This includes many widely used architectures, including such as GPTs, Encoder-Decoder Transformers, ResNets, UNets, ... For these networks, HiRemate extracts the full computation graph once before training and generates a schedule that meets memory constraints by selectively recomputing intermediate activations rather than storing them. The resulting schedule is reused throughout training.
This static-graph assumption covers a wide class of models used in practice. For architectures where the graph structure can change depending on runtime input, HiRemate can still be applied or extended in many useful cases:
- RNN-style architectures: These apply the same computation block multiple times. If the number of steps is fixed, the computation can be unrolled into a static graph, which HiRemate can optimize directly. Thanks to the repeated sequential structure, the higher levels of HiRemate’s hierarchy can use a faster dynamic programming solver instead of the ILP-based one. If the number of steps varies, multiple schedules can be precomputed for different lengths and selected at runtime. In all cases, HiRemate identifies and reuses schedules for similar subgraphs, which reduces redundant computation during schedule generation.
- SSM-based models (e.g., S4 or Mamba like): These models use state-space blocks that in most cases can be expressed in either a recurrent or convolutional form (e.g., S4D, Mamba2). The convolutional version, typically used during training due to its better compatibility with modern hardware, results in a static computation graph that HiRemate can process directly. In particular, we checked that the current version of our framework is compatible with S4D from
state-spaces/s4and Mamba2Easy fromstate-spaces/mamba2. If the recurrent form is used instead, the situation is similar to RNNs, and the same adaptations apply. - Architectures with limited graph variation: Some models include conditional branches or mode switches that result in a few possible computation graph variants. If these variations are known in advance and limited in number, HiRemate can precompute a separate schedule for each case. At runtime, the appropriate schedule is selected depending on the input or control signal.
- Architectures with highly dynamic graphs: A class of models changing the graph structure at runtime in many complex ways. These models fall outside the scope of HiRemate’s current design. In such cases, dynamic heuristics should be explored, their combination with static approaches is an interesting direction for further research.
Even for models with highly dynamic structure, it is often possible to apply HiRemate locally, at the level of individual nn.Module blocks whose behavior is static. This enables good block-wise memory/time trade-offs and can still lead to meaningful savings. HiRemate’s modular design supports partial integration into larger training pipelines, allowing users to optimize memory usage in parts of a model where static graphs are available. In particular, this is relevant for many kinds of mixture-of-experts architectures.
In summary, HiRemate automatically generates memory-efficient training schedules for static computation graphs and can be adapted to models with repeated or mildly varying structure. It handles large graphs through hierarchical partitioning, supports usage of diffrent solvers across subgraphs, and integrates well with PyTorch without requiring changes to model code. These properties make it practical for many real-world training pipelines and adaptable to a wide range of architecture types.
This paper develops a procedure for hierarchically partitioning a data flow graph in order to find a schedule for selectively recomputing neural network activations so that memory usage during backpropagation fits within a specified budget.
给作者的问题
See strengths and weaknesses above.
论据与证据
The proposed method is a modular framework that delivers practical benefits over recent competing approaches.
方法与评估标准
Experiments measure peak memory usage and iteration time during training for HiRemate in comparison to recent related work. Across a variety of architectures and batch sizes, HiRemate appears to consistently outperform (faster iteration time at any fixed peak memory) TW-Remate and Rockmate.
理论论述
实验设计与分析
See methods and evaluation above.
补充材料
The appendix covers additional technical and experimental details.
与现有文献的关系
Appears to be a consistent, but incremental improvement over recent work.
遗漏的重要参考文献
其他优缺点
With HiRemate in use, scaling batch size appears to (approximately) linearly increase both iteration time and peak memory usage (e.g., as in Figures 11 and 12). Within a limited regime, this same behavior would be expected of a naive strategy that simply reduces batch size in order to fit within a memory budget; i.e., to achieve the same effect of one large batch, two smaller batches could be run in series, accumulating gradients for both before updating parameters. Such a strategy would not pay any penalty for recomputation, but effectiveness would depend on whether the smaller batches fill the available parallel computational resources. How does this naive strategy compare to HiRemate? Specifically, at what point (what batch size) does HiRemate become more efficient than treating a larger batch as multiple smaller batches?Understanding the trade-off in comparison to this naive strategy seems crucial to making a practical argument for using HiRemate.
其他意见或建议
See strengths and weaknesses above.
We thank the reviewer for the careful reading and detailed feedback. Below we address a question regarding gradient accumulation comparison.
Gradient accumulation reduces memory usage by splitting a large batch into smaller sub-batches and accumulating gradients across them. This strategy avoids recomputation but comes with trade-offs that HiRemate addresses differently:
When batch size 1 exceeds memory (e.g., high-resolution inputs or very deep models), gradient accumulation does not help. HiRemate enables training in these settings by recomputing activations selectively.
Low GPU utilization with small batches: Accumulation processes sub-batches sequentially, often leaving GPU cores underutilized. HiRemate allows for larger effective batches under the same memory budget, improving throughput.
Increased GPU utilization with HiRemate is also worth applying in the case of network architectures whose computation is not dominated by large matrix multiplications, or more generally whose main computational operations are not well optimized for the hardware.
We agree that an empirical comparison with gradient accumulation would strengthen the practical case for HiRemate, and we plan to include such experiments in future work.
This paper presents an approach to scheduling computations required for neural network training subject to a memory usage budget. All reviewers lean toward accept. The hierarchical solver proposed for this scheduling problem appears to deliver good empirical results in comparison to relevant baselines. Reviewers ag6k and M2sS suggest additional experimental analyses that would be informative and should be considered for inclusion in a revision of the paper.