3.7

/10

withdrawn3 位审稿人

最低3最高5标准差0.9

4.0

置信度

ICLR 2024

Balance Beam: adaptive computation for affordable training and inference with high-throughput offloading for LLMs

Yiqi Zhang,Yang You

OpenReview PDF

提交: 2023-09-23更新: 2024-03-26

摘要

With the surging growth of model parameters, foundation models pose unprecedented challenges to traditional computational infrastructure. These large models intrinsically require substantial accelerator memory to accommodate massive tensors including model weights, activations, and optimizer states during pre-training, fine-tuning or even inference stages. To alleviate such intense pressure on memory, besides introducing excessive accelerators to suffice high demand of memory, offloading these parameters from accelerator to other storage medium such as DRAM is a preferable option for fine-tuning or inference with the model under computationally restricted circumstances. However, the prohibitive costs of data movement render it a theoretically plausible yet practically unpreferred solution. Previously state-of-the-art methodologies enhanced inference performance by retaining partial model state $in-situ$ across multiple mini batches to boost inference performance but incur intricate hyperparameters and excessive overhead of exchanging cache. In this work, we propose a comprehensive workflow to address these challenges, with focuses on dynamic analysis of model-system compatibility and prioritizing computational intensity over data movement. We have shown that the proposed workflow facilitates both fine-tuning and inference of foundation models with higher throughput in restricted computational resources. Compared to state-of-the-art approach, our framework attains a remarkable speedup of over 4x for training and 2x for inference, using a 30-billion parameter model on a singular NVIDIA A100 GPU.

关键词

Deep LearningHeterogenous ComputingOffloadingLarge Batch TrainingLarge Language Model

评审与讨论

审稿意见

评分: 3置信度: 42023-10-29

The paper proposes a workflow to address the challenges of training and inference of LLMs. The workflow focuses on dynamic analysis of model-system compatibility and prioritizing computational intensity over data movement. The methods involve a hyperparameter selection strategy, and better offloading scheme.

优点

The paper focuses on a critical and urgent issue, the training and inference efficiency of LLMs.
The paper provides an image on GCP to reproduce their partial results.

缺点

Major issue 1

The paper "theoretically" analyzes the hyper-parameter selection process in Section 3.1 and provides experimental validation in Section 5.1. However, Section 3.1 cannot demonstrate effective theoretical information.

The paper recommends a minimum batch size for achieving high throughput, and caching will not be employed if this threshold cannot be met. Is this threshold a hyper-parameter need to be configured? The threshold is definitely related to hardware specifications and model architectures. It is hard for other users to apply this technique. They have to tune this parameter empirically instead of configure it theoretically.
The method can select appropriate batch sizes by assessing the working memory requirements per token during benchmarking. It is unclear for readers the details of the selection.
The paper propose to to retain only one layer on the accelerator at any given time during inference, for checkpoint selection. Is this strategy guaranteed optimal theoretically? I think the setting of checkpoint depends on the hardware specifications and model architectures. I think the proposed is not optimal at all times.

Hyper-parameter balancing strategies in Section 3.1 is not a solid theoretical analysis. It seems a case study for a specific hardware/software settings and it is difficult to transfer to other settings. Users cannot follow their strategy to set hyper-parameters in an optimal way.

Other major issues

The paper claims that Balance Beam is "an workflow to optimize the trade-off between latency and throughput performance of LLMs". However, there is no discussion on the latency in the proposed method and experiments. From my understanding, the column-wise traversal of Figure 1.a will increase the latency significantly compared to the row-wise traversal.
The evaluation results are based on the authors' implementations, for both baseline and the proposed method. However, the baseline may be much weaker than the current SOTA solution. Is it possible to import a public SOTA implementation and conduct comparisons based on that?
The paper use FLOPS to quantify the arithmetic intensity. FLOPS is a measure of computer performance, while arithmetic intensity is the ratio of total floating-point operations to total data movement. They do not match.

Minor issues

It is not clear whether the proposed method can be applied to other hardware settings, such as other GPUs, TPUs and large scale.
the FlexGen proposed in (Sheng et al., 2023) have showed -> has shown
Table 2. Optimal -> Ours

问题

What are the limitations and potential negative impact of Balance Beam?

审稿意见

评分: 3置信度: 42023-11-01

The paper focuses on some hyperparameters ( KV cache, batch size, and #gradient checkpoints ) with a newly invented column-wise traversal approach. Via tuning those hyperparameters with some systematic optimization, this paper shows 2x speedups compared to baseline. This paper also proposes the approach can be applied to fine-training scenarios. Gets high acceleration on fine-tuning tasks.

优点

The main strength of this paper is that the newly invented FlexGen approach can be applied to training scenarios, providing less data traffic. It has been impossible to scale batch size due to memory capacity in previous offloading training scenarios.

The paper introduces new hyperparameters into offloading, so tuning hyperparameters (KVcache, batch size, and gradient checkpoints) looks important for better utilization. For OPT-175B, turning off KVcache shows optimal performance, which is especially interesting.

Providing real value for the experiments make it easy for comparing with different papers. In addition, if the codes are contributed to popular framework, it would be very helpful to apply common system optimization methods to derive higher throughput with newly invented methods.

缺点

I have concerns on aspects of novelty, practicality and evaluation.

Novelty

The main contribution of the paper is in providing a tunable hyperparameter space. However, the proposed hyperparameters are not newly discovered, but already existing knobs. Personally, I find it interesting that the popularly used KV caching can be a new hyperparameter, and that putting it together with other parameters is a meaningful work. However, I don't think the novelty is enough to be published in ICLR.
The techniques provided from section 3.2 are mostly what's commonly used, especially in ZeRO-offload. Even though the paper does not strongly claim those as the contribution, it is not okay to introduce them in the methods section, without any citations.

Practicality

This paper suggests the batch size and sub batch size as the key hyperparameter. However, unlike inference, changing the batch size has impact on the final accuracy. Because of this, it shouldn't be tuned just for the throughput. In addition, I was a little disappointed that the paper just provides the results from various tuning points. At first, I was expecting a systematic/automatic tuning software or a policy, or at least a guideline on how to achieve the goal. Without them, the contribution is quite limited.

Evaluation

The authors seem to acknowledge that there is accuracy impact on using large batch sizes for training (from section 6.1), but there is no evaluation on the accuracy.
Some experiments can be misleading. For training, ‘with optimal hyperparameter settings’ seems to use different effective batch sizes, and only FLOPS results are given. If effective batch size is different, the throughput difference would come from additional update steps, and this should be considered.

Writing

It's relatively minor compared to other issues, but the writing needs some improvements. For example, the paper does not clearly distinguish the training and inference. Even though they share a lot in common, some guide is needed for the readers. In addition, the description of generative inference with KV-cache is not enough to understand this work. Even though the KV cache is getting its popularity, this paper lacks enough information to understand the key aspects, such as why utilizing KV-cache is good for transformer inference.

There are some ambiguities and minor mistakes in this paper. If Fig.1-(b) is placed next to Fig.3, it could be better to see with Sec.3.2. In Sec.1, ‘5000’ and ‘s’ are placed in different lines, which makes it confusing. In Sec.1, Insert ‘and’ between ‘batch size’ and ‘number of gradient checkpoints.’ I can’t find any references for gradient checkpoints and recompute.

问题

For evaluation, are training FLOPS results measured with different effective batch sizes? For training, does this work utilize only host memory? (No storage for training? )

审稿意见

评分: 5置信度: 42023-11-05

The paper argues that the system hyperparameters are challenging to tune in a distributed setup and design and introduce a workflow to mitigate these challenges. The results show promising improvement over SOTA framework in inference as well as training.

优点

$\mathtt{+}$ The results are promising and cover both inference and training of LLMs across a range of benchmarks.

$\mathtt{+}$ The framework seems to be straightforward to use.

缺点

$\mathtt{-}$ The contribution of the paper is limited and it is not clear whether the presented results are generalizable to other networks and setup.

$\mathtt{-}$ The majority of presented techniques are already explored and the paper mainly focuses on changing to system hyperparameters for a particular setup.

$\mathtt{-}$ The choice of the baseline is not well-supported and it is not clear whether the baseline implementation was also fully optimized for the target platform.

问题

(Q1) Did you also tune the system hyperparameters of the baseline?

(Q2) How does your approach scales to larger network (beyond 4 GPUs)?

(Q3) I am curious to see whether the proposed approach has any negative impact on accuracy/training convergence/etc.? Did you see any degradation?

(Q4) I commend the authors for devoting almost a page to future works. However, I think the suggested future work actually are really important to see the potential benefit of the work. In its current form, the paper seems to have a very narrow scope and no clarity on how to apply in different situations and scenarios.