3.8

/10

Rejected4 位审稿人

最低3最高6标准差1.3

4.0

置信度

ICLR 2024

MetaDist: An Infrastructure for Automatic Parallelism via ShardCombine Algorithm

Shenggan Cheng,Lansong Diao,Zongyan Cao,Siyu Wang,Wei Lin,Yang You

OpenReview PDF

提交: 2023-09-22更新: 2024-02-11

摘要

关键词

Automatic ParallelismDistributed TrainingMachine Learning FrameworkSingle Program Multiple Data

评审与讨论

审稿意见

评分: 3置信度: 32023-10-25

As the size of models becomes larger, distributed training has been paid much attention to. In this paper, the authors develop an infrastructure for automatic parallelism. Their infrastructure is based on two data structures and compatible with multiple ecologies, including PyTorch and JAX. The performance of their solution is better than existing methods.

优点

The problem investigated in this paper is fundamental.
The presentation of this paper is clear. It is easy for me to follow the paper.
The infrastructure has great performance.

缺点

The main issue of this paper is the motivation of this paper is not clear. Since we already have the Alpa system to do distributed training automatically, I do not know why we need to develop a new tool. In addition, the experimental results show that the infrastructure developed in this paper has similar results as Alpa over GPT and WideResnet. Furthermore, the GAT model is not large enough. In most cases, we use the GAT model to solve small-size graphs. Therefore, I believe that Alpa is good enough.
As for automatically distributed training, the search time is an important metric. Is it possible to compare the search time of MetaDist with Alpa in the experiment section?
Could the authors evaluate their infrastructure over multiple nodes?
Megatron-LM is a widely used distributed training framework. I think the authors also need to add Megatron-LM as baseline for Pytorch implementation.
As an infrastructure, I think the authors should open their source code. If this infrastructure is used by many researchers and engineers, it means that this infrastructure is useful. In view of this, the GitHub star and download times are two important metrics for an open-sourced infrastructure.
This paper looks like a system paper, and I am not sure whether this paper could be accepted by the ICLR community.

问题

Please see the Strengths and Weaknesses.

伦理问题详情

N/A

审稿意见

评分: 3置信度: 42023-10-29

This work proposes MetaDist, which provides automatic parallelism for different frameworks such as PyTorch and Jax.

优点

I appreciate the engineering effort made in this work and it is really thrilling to see an automatic parallelism toolkit that is framework agnostic. I believe this is significant for the industry to ease the difficulty of training large models.

缺点

(W1) Limited contributions. It is unclear how the proposed MetaDist representation and the ShardCombine algorithm improve over existing works.

(W2) This work does not support pipeline parallelism. And the searching algorithm does not take computation cost into account.

(W3) More baselines are expected and some of the experimental results need elaboration.

问题

(Q1) From my humble opinion, this work is not well motivated and the rationality behind the proposed methods is not well explained.

First, in the introduction section, the authors discussed two main challenges, i.e., (i) lack of ecological compatibility, and (ii) difficulty in development, maintenance, and benchmarking. However, they seems more of an implementation issue rather than a research topic.
Second, there is no discussion about the limitations of existing methods and how the MetaDist representation and the ShardCombine algorithm tackles them (see questions below). For me, they are just put forward straightforwardly.

(Q2) In essence, MetaDist can be viewed as a kind of IR in a distributed manner. Therefore, I believe it is necessary to compare MetaDist related works such as pONNX [1] and Unity [2]. For me, the ShardSpec and CombineSpec can be broken down into the partition, combine, replicate, and reduce operators in Unity, and the latter can provide more fine-grained representations.

[1] Wang et al. Parallel Training via Computation Graph Transformation.
[2] Unger et al. Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization.

(Q3) MetaDist focuses on intra-op parallelism, so it is unknown whether could it support pipeline parallelism, which is frequently used in large-scale distributed deep learning.

(Q4) Section 3.3 states that “The ShardCombine algorithm is an exploration algorithm that utilizes heuristic information to shard input data, and attempts to re-combine local results into global results using the TryCombine function”. What is the heuristic information here? Furthermore, how the heuristic information helps your algorithm design?

(Q5) Section 3.4.1 states that “The objective of the problem is to minimize the communication cost”. It is confusing why computation cost is not taken into account here. Furthermore, it is unclear how the ILP differs from the one in Alpa.

(Q6) The experiments were conducted on a single GPU server with eight GPUs. I am afraid the exploration space of parallelism would be extremely limited and could not evaluate the performance thoroughly.

(Q7) Regarding the experimental results, there are a few issues:

The major competitors are Alpa and FairScale, however, there should be frameworks that are more widely used for large model training in practice, such as DeepSpeed and Megatron-LM. I believe comparisons are necessary.
The authors stated that “ZeRO-3 is difficult to scale efficiently because of its communication overhead”. When memory is not scarce, ZeRO-2 may be a better option than ZeRO-3 and pure DataParallel. Please also compare with ZeRO-2.
When training on a single GPU, there should be no performance differences, but they are reported in Figure 8. Please elaborate.
There are several cases that MetaDist-BS runs out of memory. Does it indicate that your searching algorithm does not take memory limit as the constraint?

审稿意见

评分: 6置信度: 42023-11-01

The authors propose a technique to help address and improve the manual efforts necessary to parallelize training pipelines for large models on limited hardware. Two data structures MetaOps and MetaIR are proposed to construct the MetaSPMD space. MetaIR and MetaOp are framework-agnostic approaches where the computational graph of a network can be converted into MetaIR and the corresponding operators are converted to MetaOps. MetaSPMD specifies the operator-level parallel space of MetaOps and details how to shard inputs and combine local results into global ones. Based on these structures, the paper proposes the ShardCombine algorithm to successfully shard input data and re-combine local results.

优点

The ideas presented are well thought out, and much needed
Good that it works on both PyTorch and JAX
Experimental results could improve, but are not a bottleneck for an acceptance

缺点

It might help to have some examples when the cache mechanism lacks effectiveness and is effective.
The major issue with this is reproducibility. Simply by reading the paper, it will be difficult for me to reproduce the work in this paper. Artifacts will help. But the parts/sections about the MetaIR, MetaOps, and MetaSPMD could have been written so that one can understand how they are being practically generated. Conceptually, it’s fine.
Why does Beam Search + ResNet + JAX in Figure 7 so much longer than any other combination? Is it simply because JAX has more fine-grained operators?
The wiring could improve a bit, especially Sections 3 and 4. It would have been nice to have some results on at least a 1 billion scale model. Or do the four different model sizes refer to GPT2, 3.5, etc.

问题

Please check the Weaknesses for detailed questions to be answered.

Are there some examples when the cache mechanism lacks effectiveness and is effective?
Are there artifacts available for reproducibility purposes?
Why does Beam Search + ResNet + JAX in Figure 7 so much longer than any other combination? Is it simply because JAX has more fine-grained operators?
Are there results on at least a 1 billion scale model? do the four different model sizes refer to GPT2, 3.5, etc?

审稿意见

评分: 3置信度: 52023-11-01

This manuscript / work is investigating the automatic parallelism and in particular is concerned with the selection of the best combination of strategies from a selection space of parallel strategies. The authors present the MetaDist, which they claim it is an infrastructure for automatic parallelism. They propose two abstract data structures, the MetaOp and MetaIR, respectively, which enable them to construct the MetaSPMD space. The ShardCombine Algorithm obviates the need for manual annotation, significantly reducing the development and maintenance cost. Moreover, their approach is natively compatible with multiple ecologies, including PyTorch and JAX. To validate their design, they implement two baseline automatic parallelism algorithms based on MetaDist. Their experiments demonstrate that our approach achieves state-of-the-art performance compared with other distributed solutions.

优点

Automatic parallelism is a crucial problem for current computing systems and especially recent machine learning architectures

缺点

The impact of pruning on the prediction results needs to be addressed. How does pruning influence the accuracy or effectiveness of the model?
When measuring throughput, it should be ensured that the profiling time and the time taken for pruning do not adversely affect the overall performance of the framework. The authors should explain how they have accounted for these factors in their evaluation.
Incomplete discussion of prior work on automatic parallelization especially those that consider compiler approaches that are independent of a specific processor architecture.

问题

In section 3.2, what is callable primitive operator stans for in op_function, invars, outvars, and spmd_rules? 2)In figure 2, what is the MetaIR? There is a MetaIR on the left side and another MetaIR in the middle.
In Section 3.3 Sharedim(id=j)(Sj) is used to determine the parallelism strategies. How many such strategy could be applied? I notice the paper mentioned S1 S2 S3 in the rest of the paper. For Si, how many ways to partition the tensor?
What is the definition of score function? Is it just the train time? If two parallelisms could be applied, how to evaluate which one is better?
How is the communication bandwidth affect the result?
The treatment and discussion of prior work on automatic parallelism and compiler approaches for automatic parallelization only covers papers from 2022 and 2023 when there are pioneering approaches even before that. Here are some examples of compiler approaches to automatic parallelization: "A load balancing inspired optimization framework for exascale multicore systems: A complex networks approach." In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 217-224. IEEE, 2017. "Self-optimizing and self-programming computing systems: A combined compiler, complex networks, and machine learning approach." IEEE transactions on very large scale integration (VLSI) systems 27, no. 6 (2019): 1416-1427. "A distributed graph-theoretic framework for automatic parallelization in multi-core systems." Proceedings of Machine Learning and Systems 3 (2021): 550-568. "Plasticity-on-chip design: Exploiting self-similarity for data communications." IEEE Transactions on Computers 70, no. 6 (2021): 950-962. The authors have to check the literature and carefully contrast all existing related approaches to automatic parallelization and compiler approaches especially when these prior approaches were developed to be computer architecture independent and adaptable to general settings.

AC 元评审

2023-12-05

All reviewers had issues with this paper. The main issues are clear motivation, clear description of novelty over prior work and clear demonstration that the results are reproducable.

为何不给更高分

The reviewers have not identified key reasons why this paper should be accepted. Without interest, the paper is not ready for acceptance at a major conference.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject