5.0

/10

Rejected3 位审稿人

最低3最高6标准差1.4

3.7

置信度

ICLR 2024

H-Rockmate: Hierarchical Approach for Efficient Re-materialization of Large Neural Networks

Julia Gusak,Xunyi Zhao,Théotime Le Hellard,Zhe LI,Lionel Eyraud-Dubois,Olivier Beaumont

OpenReview PDF

提交: 2023-09-24更新: 2024-02-11

摘要

关键词

RematerializationNeural NetworksMemory-Efficient TrainingPyTorchInteger Linear ProgrammingTrainingCheckpointing

评审与讨论

审稿意见

评分: 6置信度: 42023-10-16

The authors proposed a hierachical (or devide and conquer) approach to make re-compute/materialize decision by:

partition the graph into subgraphs hierachically
apply a modified ILP solver (H-ILP) on solved subraphs recursively
additionally, the base solver for the leaf subgraphs of the (hierachical) tree is modularized to swap between different algos

to address the issue of existing approaches which (are either ILP based thus) don't scale to large graph or fail to optimize more general graph with long skip connections like Unet.

优点

The hierachical + ILP solution proposed by the authors is intuitive and practical in the sense that:
1. the search space of ILP based approach is too large to scale to graph with thousands of computational ops/nodes, a good graph partition can trim down the search space efficiently
2. non-ILP based approach has a hard time dealing with networks with long skip connections like UNET or ENCODER/DECODER architecutre.
the base solver for the bottom/leaf subgraph is modularized, thus can swap to different algos as a graph/solver runtime tradeoff
solid explanation/comparisons to related works/baselines, and the robustness to hierachy depth (figure 4) is a good indicator of its scalability to general/deep networks

缺点

The parition algorithms (especially the score function/cost model in equation 1) is a bit ad-hoc, I can grasp the intuition behind it, e.g., it tries to identify a subgraph with least IO and penalize on number of nodes in it so that it can minimize the memory required to checkpoint its IO while keeping the scale of each subgraph relatively small. However, graph partition is a long-studied problem and usually such a heuristic/greedy based algo don't scale very well in the sense that they are typically tailored for specific known targets and would fail overtime when target envolves, that being said, I would suggest:
1. try the partition algos on a densenet to see if it produces good result
2. alternatively make this partition algo also modularized as the solvers, what's more valuable/solid in this work is the intuitionn of hierachy (devide and conquer) and the H-ILP solver IMO
the presentation can be improved:
1. the 3.1 H-partition part contains a lot implementation details without much explanation where they come from, e.g., as is briefly mentioned above regarding equation (1), and additionally why do you need alpha and why is it 0.5, why did you choose 4 candidte groups in "Formation of candidate groups" rather than other numbers.
2. On the other hand, the caption of the most important figure 1 doesn't have enough details, what's the time vs memory plot? (I think they refer to options), what's direct solver (I only got base solvers for bottom subgraph and H-ILP hierachical solver), etc.
3. How does H-ROCKMATE beat the baselines in Unet/Encoder-Deccoders more concretely? An explanation or preferrably an illustrative exmaple would help readers understand the quality of it more intuitively.

问题

In addition to the questions in Weakness, here are a couple more questions:

what does "the higher level algorithm adapts the sub schedules" conrectely mean in Correction terms for memory usage?
would "model = HRockmate(model, sample, memory_budget)" work with Tensor Parallel packages like Megatron? as my guess of the implementation relies on model/graph tracing and Megatron can pose difficulties in such tracing due to collective communications.

评论- Official comment to Reviewer u563

2023-11-21

Thank you for your comments.

About the partitioning algorithm: please, see our comment on graph partitioning in the General answer. We agree completely that this partition algorithm is ad hoc. We are aware that graph partition is a long-studied problem, but it is not clear what should be the optimization objective for our case. Despite not insisting on it in the paper, the partition algorithm is indeed modular in the current implementation and it is just as easy to use another partition algorithm as it is to use another solver. As you mentioned, this paper's contribution is focused on the idea of hierarchical decomposition and the H-ILP formulation. The partitioning algorithm is provided for completeness of the paper, and its purpose is to show that there exists at least one partitioning algorithm that makes this hierarchical idea work. Trying and comparing other partitioning approaches is clearly part of our future work, to see if it is possible to make the idea work even better. And thank you for the suggestion, we will add Densenet to a set of architectures we experiment with.
About the values of the parameters: the value of the parameter alpha of the algorithm was chosen empirically from a small set of experiments, the value $0.5$ seemed to provide the best results overall. About the 4 candidate groups, this is not really a parameter: for a given node $x$ , it is not clear whether the "right" partitioning around $x$ is to group it with its predecessor nodes, or with its successor nodes. This is why the algorithm creates one candidate group where $x$ is within its group and one candidate group where $x$ is not within the group. The same goes for $a(x)$ , and this yields 4 $(=2\times 2$ ) possible combinations.
About the caption of Figure 1: you are correct that the time vs memory plots represent the set of options. A direct solver is a solver that does not use the hierarchical decomposition but works directly on the whole subgraph itself flattening the hierarchy and all the sub-sub-graphs (TW-Remat is an example of a direct solver). We will make this more clear in the caption of Figure 1 in an updated version.
About how H-Rockmate beats the baselines: in the example of U-Net, the graph partitioning detects the main structure of U-Net. It is decomposed into a hierarchy where each subgraph corresponds to a layer and contains about 10-15 nodes (see Table 3 in Appendix). This allows H-Rockmate to see the whole structure (which Rockmate can not do) while keeping the size small enough (which Checkmate can not do). TW-Remat is only a heuristic and it is more difficult to guarantee the quality of its schedules. In an updated version, we will add (in the Appendix) an illustration of how the algorithm works on the U-Net network, with the hierarchical decomposition and the resulting optimized sequence.

2023-12-04

Thank you for your detailed explanation and committed improvements, lack of support for TP is still one of my main concerns as TP proves more straightforward and scalable in solving memory limit, but I have raised my score to 6 as I think the hierarchical approach still shines in its own right to tackle a long standing problem. Sorry for the late reply though I would sincerely suggest the authors to submit rebuttals earlier than closed to ddl.

审稿意见

评分: 6置信度: 32023-10-31

This paper introduces H-Rockmate, which is a hierarchical approach to find a re-materialization strategy for large neural networks. It decomposes a dataflow graph into multi-level and find efficient solutions of each blocks in bottom level. A related ILP formulation is proposed to recombine low-level solutions. H-Rockmate can find similar performance as ROCKMATE in less time, making it more practical.

优点

H-Rockmate proposes a hierarchical decomposition method for the computation graph. Thus the size of the ILP problem is smaller. Experiments show that efficiency and performance haven’t been compromised.
Other re-materialization strategies can be integrated into their frameworks to achieve better performance.
Theoretical analysis of their algorithm and ILP formulation is provided in the Appendix.

缺点

Since they claim H-Rockmate works for large neural networks, the sizes of neural networks used in experiments are the same as other works.
There are some typos in the Appendix, such as “line ??”.

问题

If H-Rockmate is applied to a billion-level neural networks like LLaMA, how will the peak memory and iteration time be? I think experiments with larger neural networks than GPT2 is necessary.
Can you introduce what constraints are considered in your main part of the paper and introduce detailed expressions in Appendix？
What if modeling on-chip global memory to get a better scheduling? Can your method support this?

评论- Official comment to Reviewer piuX

2023-11-21

Thank you for your comments.

Regarding the size of neural networks: please, see our General answer to all reviewers, where we refer to different definitions of "large" models and specify in which scenarios H-Rockmate is applied. We have tested much larger neural networks (in terms of number of nodes) than what was previously accessible to Checkmate/Rockmate and obtained performance better than existing approaches. Applying directly H-Rockmate to neural networks whose parameters don't fit one device, like LLaMa, is not feasible because H-Rockmate only optimizes the memory used by activations. However, as we say in the General answer, H-Rockmate can be applied separately to the submodule of the neural network. And as most solutions for training such large models do use a form of re-materialization, and our work provides an improved and generic approach for optimizing the re-materialization strategy, we see a task of finding optimal combination of pipeline training and generic re-materialization as future research.
Answering Question 2: even without the technical details, the list of H-ILP constraints is significantly long, and highly redundant with the Rockmate paper. Since we wanted to keep the paper easy to understand, we decided to focus on the specific contributions that make H-ILP able to work in a hierarchical manner.
Answering Question 3: We are not sure we understand fully what "on-chip global memory" refers to. We assume that it relates to some memory external to the GPU, either the memory attached to the CPU or the memory of one/several other GPUs. This is related to "offloading" problem, where some activations or weights are moved to and from other memory places which incurs a communication delay. Supporting offloading is one of our future work directions, we, particularly, mention this in the Conclusion and General answer.

审稿意见

评分: 3置信度: 42023-11-05

This paper tries to solve the problem of efficient scheduling of re-materialization of the training computation. Concretely, the paper proposes H-rockmate, a hierarchical solution to decompose the data-flow graph into a hierarchy of small-scale subgraphs and compute a re-materialization schedule that minimizes the computational overhead within a given memory budget. Empirical studies are conducted to evaluate the performance of the proposed method in terms of solver efficiency and end-to-end performance.

优点

The summarization of this research area is clear and accurate; the related work section is well-organized.
The intuition behind the scheduling algorithm design is clear and straightforward.
Based on the reported experimental results, the reduction of the solver execution time is significant.

缺点

The writing of the paper can be significantly improved. First of all, there is a lack of a formal definition for the scheduling problem itself -- the current introduction of the problem is interweaved with the problem statement in Section 3.1. Additionally, section 3.2 is too casual; there is a lack of enough formalization about the mathematical representation of the problem -- I notice plenty of important information is left in the appendix. I do not think this is an appropriate trade-off; the technique content should be self-explained within the scope of the paper.
I am a little confused by the presented results in Figure 3; I was expecting that when the budget is very low, every algorithm should be able to find the scheduling of re-computing every activation, while when the budget is very high, every algorithm should be able to find the scheduling of no-recomputation, but still there is some difference between each line. This was confusing. Additionally, the important hyper-parameters, such as batch size, are not enumerated in Section 4.
Another trivial detail is that the font style differs from other submissions I reviewed; please check the instructions to ensure you are using the requested font style.

问题

See my comments in the Weakness section.

评论- Official comment to Reviewer J4PJ

2023-11-21

Thank you for your comments.

Regarding the writing of the paper: the first paragraph of Section 3 is our statement of the scheduling problem. It is somewhat short of the necessity of keeping the paper length within the page limit. We will provide more details in an updated version, especially about the input data (execution time and memory overhead of each node in the graph, memory size of each activation). In Section 3.2, as discussed in the main answer, we have indeed opted for an informal description of the specificities of H-ILP compared to the ILP in Rockmate. This was again motivated by the page limit; a more technical presentation would require us to introduce all variables and notations and we could not identify what to remove from the paper in compensation.
Regarding the results in Figure 3:
- About the low budget case, your intuition is indeed correct for the case of chain networks, with homogeneous activation sizes. Take for example the case of a chain of length 4. The recompute-all schedule would be F1-F2-F3-F4-B4 - F1-F2-F3-B3 - F1-F2-B2- F1-B1 (where F2 and B2 state for the forward and backward pass through the second layer, correspondingly). If all layers and activations are the same, the peak memory is reached during each B operation. To achieve the minimum budget, no activation can be stored in memory during any B operation: only the recompute-all schedule achieves this minimum budget. However, when sizes are heterogeneous, it is possible to reach the same memory as the “re-compute all” schedule, but with a lower running time: the peak memory is achieved for the most memory-demanding operation (for example the B3 operation). All other operations can be performed with some activations stored in memory: we can maybe store the result of F1 during the computation of B4 without increasing the peak memory. This allows for a shorter schedule that does not recompute everything. Finding such a shortest schedule still leaves room for optimization. Furthermore, when the dependency graph of the neural network is not a chain, other optimization opportunities arise. Take for example a graph where the output of the first layer 1 is used by two paths, 2 -> 3 and 4 -> 5, with a last layer 6 using the result of 3 and 5. The graph looks like
```
       2 -> 3
      /      \
     1        6
      \      /
       4 -> 5
```
  In a recompute-all schedule, one wants to compute F6B6 with no other activation in memory. Computing F6 requires the result of F3 and F5. While computing F3, one can choose whether to keep the result of F1 in memory or not, and this decision has an impact on the duration of the schedule. For more complex graphs, this decision is not trivial. It was actually shown in [Naumann, 2018] that it is NP-hard to decide whether a given task graph can be computed within a given budget.
- About the high budget case: the best solution is indeed to store everything and to perform no recomputation. However, the values presented in Figure 3 represent actual measurements, as opposed to the estimated duration of the computed schedules. In that case, the ILP-based algorithms do output the "no-recomputation" schedule, but the real-life execution contains overhead from the Rockmate framework. Your comment made us reconsider Figure 3b specifically because indeed there should not be such a difference in overhead. We experimented again and obtained a very similar plot, but this time all algorithms obtained the same performance for the high-budget case. The performance is just slightly lower than Autodiff because of the overhead of the framework.
- For the TW-Remat solution specifically, this one is a heuristic and has no guarantee of finding these good solutions ("recompute-all", "no-recomputation") when they exist.
Regarding batch size and sequence length: we will mention them in an updated version of the paper. However, the reason why these values are not mentioned is that they only scale the plots (both in terms of memory usage and iteration time), as shown in Figures 6, 7, and 8 of the Appendix. In the experiments of the main paper, the values were selected so that the memory usage is consistent with the total memory available.

评论- Thank you for your feedback!

2023-12-04

Dear Authors,

I really appreciate your effort in answering my questions, those notes make the technique detail more clear. However, I still feel that the current version of the paper still demands some significant polishment for formal publications.

Best wishes, Reviewer J4PJ

评论- General answer to all Reviewers

2023-11-21

We thank the reviewers for their comments. We provide here a general answer for the main points and answer specific questions in individual comments to each review.

Regarding the balance between the main text and appendix

When writing the paper we tried to find a trade-off between intuitive and technical explanations, which allows readers with different backgrounds in the topic to understand the main idea and the most important novel technical contributions of the paper (e.g., such as H-ILP and its difference with ILP from Rockmate). As we see from the reviews, some reviewers are content with the presentation (Reviewer piuX), while others would like to see more technical details moved from the appendix to Sections 3.1 or 3.2. We will try to do our best by moving some details from the appendix, but given the page limit we prefer not to overload the main text with the formulas to save the clarity of the presentation of the contributions.

Regarding the application of H-Rockmate to large models

Firstly, we would like to clarify in which sense we use the term "large" in our paper, and secondly, how H-Rockmate is beneficial for such large models.

The term "large" model in deep learning papers might be used in different ways, by means:

large number of nodes in a data-flow graph (forward or forward-backward graph) corresponding to the given model.
large number of model parameters (or memory required for parameters storage)
large peak memory footprint during training (the memory required to store the intermediate activations compared to the memory occupied by parameters might significantly vary depending on the batch size and input sample size).

The solution proposed by H-Rockmate provides state-of-the-art results in terms of peak memory/training time trade-off, for models whose parameters fit on one GPU device. Given this condition, the first benefit of H-Rockmate is that it can handle models with much larger computational graphs than other ILP-based solutions. For example, you can see in Table 3 of the Appendix that H-Rockmate can handle an Encoder-Decoder Transformer with 36 encoders and 36 decoders. The computational graph before simplification has more than 6,000 nodes; after simplification and inclusion of data nodes, this results in about 5,800 nodes. Our H-Rockmate framework finds a re-materialization solution in a couple of hours, which is acceptable given the usual training time of such large models. And the second benefit is that for a given peak memory it can find a schedule corresponding to faster training iteration compared to existing re-materialization techniques.

In the scope of H-Rockmate work, we did not consider finding an optimal re-materialization schedule for models whose parameters don't fit on one GPU device. This is planned for future work. However, already now the current H-Rockmate implementation allows combining pipeline parallelism solution of Pytorch in a simple way when the submodel assigned to each GPU can be preprocessed as a separate H-Rockmate module. Combining re-materialization with offloading should be possible and is also an interesting angle for future development.

Regarding the use of H-Rockmate in a multi-GPU setting

We have tested H-Rockmate with the Data Parallelism and Model Parallelism solutions available within PyTorch. In both cases, we obtain the expected behavior. The present paper however focuses on the case of model training with single GPU as described in the above comment.

It might indeed be technically challenging to use Tensor Parallel frameworks like Megatron together with H-Rockmate. However, implementing our algorithms within such frameworks should be feasible and would allow users to further optimize the overhead of recomputations.

Regarding the graph partitioning algorithm

We will clarify in the paper that in our whole framework is modular, both for the solvers and for the partitioning algorithm. Developing and evaluating other partitioning approaches would require only a small amount of development. The heuristic presented in the paper was needed as a proof of concept for the whole idea of hierarchical solving, but we left 'improving the partitioning algorithm' as future works. Using an 'off the shelf' partitioning algorithm from the literature is nontrivial because it is not clear in our case what criteria need to be optimized.

AC 元评审

2023-12-05

The paper presents a new technique for training neural networks. A challenge with the paper is that the problem definition and algorithm are intermixed. As such, it is difficult to identify the key intuition for the problem being solved. Addressing the clarity of the results will make this a stronger submission.

为何不给更高分

There were many concerns with the writing. It is particularly concerning that the authors did not clearly define the formal problem before introducing their algorithm.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject