3.5

/10

withdrawn4 位审稿人

最低1最高5标准差1.7

3.0

置信度

正确性1.5

贡献度2.0

表达1.3

ICLR 2025

Zero redundancy distributed learning with differential privacy

Zhiqi Bu,Ruixuan Liu,Justin Chiu,Sheng Zha,George Karypis

OpenReview PDF

提交: 2024-09-27更新: 2024-11-28

TL;DR

We develop a large-scale distributed learning (ZeRO, DeepSpeed) with differential privacy over 512 GPUs and 100B parameters.

摘要

关键词

deep learningdifferential privacydistributed learningsystem design

评审与讨论

审稿意见

评分: 3置信度: 22024-10-28

This paper incorporates DP into the established ZeRO optimizer, used in distributed learning (via, e.g., multiple GPUs) which especially relaxes the burden of training large models. In such cases, where the model size increases beyond a reasonable bound for one GPU, the model must be partitioned (in addition to the data, as in conventional distributed learning), so that each GPU only holds a partial shard of the model. The authors specifically apply DP with the mixed-precision training, as adopted in ZeRO; and are thus able to experimentally show that the incorporation of DP into ZeRO does not harm the latter's performance.

优点

Significance:

The paper studies DP in the challenging distributed learning case, where the model size increases beyond a reasonable bound for one GPU.
The model can be partitioned into shards via, e.g., pipeline parallelism and model parallelism. Yet, the pipeline parallelism can be inefficient due to a non-DP-related issue – the pipeline bubble, where GPUs are idle while waiting for data to process; and the authors are focusing on the model parallelism alternative, by adopting the ZeRO optimizer.

缺点

Originality:

The claimed contribution lacks novelty. To the best of my understanding, the proposed solution is to incorporate the Gaussian mechanism into the ZeRO optimizer (in which mixed-precision is used). Yet, the idea of incorporating the Gaussian mechanism into pipeline parallelism (instead of model parallelism) was already learned in He et al. (2022)

Clarity:

The ZeRO optimizer, which is a main tool in you method, it not properly covered, for instance in a dedicated section of its own. This makes the paper difficult to follow as a stand-alone work.
The threat model is not mentioned: what is the quantity that should be preserved as private, and why, and from whom; all are not clarified. This can by implied from the context and from the motivation for studying the work of He et al. (2022), but is not included in the presented paper, and therefore not well-clarified.

问题

In lines 12-13 you mentioned the communication efficiency challnage of training deep networks. As you do not specify at this point that distributed settings are considered, it is not clear what communication has in regular (fully centralized) deep learning.
In line 8 you use the symbol $\epsilon$ . It is not clear at this stage that this symbols the privacy budget rather than something else.
In line 64 you mentioned that DDP with DP usually incurs huge memory cost due to caching the per-sample gradients, isn't it because of caching a full copy of the DP model?
In lines 72-73: "...which roughly translates to 2B model training with Adam" what is 'B' stands for?
There is no referencing in the paper text for Figure 1.
As ZeRO is not reviewed on its own, is it not clear what 'model states' means in line 187.
In line 192: "a master copy (fp32) of optimizer states. and the half-precision...". What does 'fp' means? "states. and" is probably a typo.
In lines 196-198, where does the 4, 16, and (4+...) come from? not clear or intuitive.
In line 222: "...on the half precision (fp16 or bf16) parameters" what is 'bf'?
In equation (3) lines 308-309, it seems that that the numerator and denominator of the fraction are reversed.
In lines 484-485: "Here super-linearity is a property of ZeRO (see Figure 3 in Rajbhandari et al. (2020))..." quite not easy-to-follow; forcing the reader to jiggle many times to Rajbhandari et al. (2020) paper to understand this paper.

审稿意见

评分: 5置信度: 42024-10-31

The authors propose a DP version of Zero Redundancy Optimizer (ZeRO) that enables the training of very large models under DP. They compare the non-DP and variants in different settings and observe a good level of communication and computation efficiency as well as scalability. Additionally, a core contribution is the enabling of mixed-precision training through appropriate handling of the loss. These advances enable the authors to train very large models under DP for the first time (e.g., GPT-100B or ViT-10B). The final contribution is the promised release of the codebase as a framework to allows to use DP-ZeRO for many applications.

优点

S1: Introduces a method to train very large models under DP (e.g., GPT-100B or ViT-10B) which is very timely even with PEFT as models grow a lot.
S2: There are analysis both empirically as well as in terms of time complexity.
S3: The authors make the method work (e.g., address problems with loss scaling and integrate it with many clipping methods).
S4: The authors promise to release a framework, although the linked anonymous git is not very structured (See Q2).

缺点

While I believe that the paper has good contributions there are some things that need to be addressed.

W1: Contribution 3: "We enable DP deep learning with more than 1B trainable parameters for the first time.": This claim is very strong and I am not sure if it is correct as Ghost Clipping can train all parameters of GPT-2-large which is smaller than a billion but not too far from it with a smaller GPU (24 GB vs. 40 GB) easily (physical batch size of 10) [see Figure 4a of Li et al. (2022) [1]]. Another recent work trains a ResNet of almost one billion parameters with out-of-the-box Opacus on 40 GB VRAM [see Figure 3b of Beltran et al. (2024) [2]]. I would suggest to the authors to make their claim more precise as their contributions are good, but this claim goes too far and might not be true. You also mention also in Line 072 that 2 Billion is possible without your contribution. Something like "We enable DP deep learning with far more than 1B trainable parameters for the first time." might be a better choice.
W2: Figure 1: This suggests that there are no other works that achieved this large trainable parameters. Unfortunately, I was unable to find citations to these works so it is hard to say how comprehensive the list is. I would recommend adding appropriate citations (perhaps in the Appendix) and perhaps also some numbers from recent papers like [2] that trained ViT-Huge.
W3: Experimental design regarding utilizing the GPUs: It is unclear to me why the authors chose to not utilize the full VRAM of the GPUs in their experiments (Figure 4, 6 and 7) when evaluating the methods. Their experiment in Figure 8 compares between maximum B and fixed B=2 and there seems to be a gap. Similarly, related work [2] has also noted that maximum possible GPU memory usage improves the speed. (See Question 1 as well).
W4: Presentation of Figures: The captions do not have comprehensive information about what is shown (e.g., Figure 4 and 6) and at least Figure 1 is never mentioned in text. I think it would be easier to grasp Figures 2, 6 and 8 using subfigures instead of this very compressed format that impairs the understanding.
W5: The usage of $\Psi$ is a bit confusing at first. I recommend adding the unit bytes to it (if I understood it correctly). Also it is confusing that the total $\Psi_\text{model} = 16$ if there are 3 components (line 197)?

Minor:

Line 068: "DDP cannot train models that exceed the capacity of one GPU": Would be good to make this clearer as this refers to cases where the model weights do not fit anymore.
Line 128: "In DP deep learning, the gradients are made private by post-processing through per-sample gradient clipping and random noising". Post-processing is not the correct term here as post-processing in DP usually refers to the property that something being DP cannot be reversed. I recommend rephrasing this.
Line 155: "With $\sigma_\text{DP} = 0$ leading to $\epsilon = 0$ (not private)": It would be good to emphasize that this is only in regard to the noise, but clipping is still different than non-DP.
Line 192: The sentence seems broken "optimizer states. and".
Line 341: "These techniques improve the relative speed of DP-ZeRO but worsens the absolute speed.": It is unclear what this means. Please rephrase.
Line 456: This implies that SGD leads to worse accuracy than Adam which is not universially true. I would recommend reprashing this.

[1] Li, X., Tramer, F., Liang, P., & Hashimoto, T. Large Language Models Can Be Strong Differentially Private Learners. In ICLR 2022.

[2] Beltran, S. R., Tobaben, M., Loppi, N., & Honkela, A. (2024). Towards Efficient and Scalable Training of Differentially Private Deep Learning. arXiv:2406.17298.

问题

Q1: Why are you not fully utilizing the memory of the GPU even though this should improve bring speedups? Experiments regarding Figures 4, 6 and 7 do not use max B but some fixed B.
Q2: Contribution 4: "We will open-source our codebase as the first DP distributed learning library": You name this as a contribution but the anonymous git just has code for fastDP or is it somehow integrated into each other? This is not clear to me from the README.
Q3: How does the scaling of your method compare to naively scaling with DDP as e.g., illustrated in Beltran et al. [2] Figure 7 for smaller models? Is your method then also benefical or does it only worth for very large models?
Q4: Line 155: "These techniques improve the relative speed of DP-ZeRO but worsens the absolute speed.": It is unclear to me what this means. Could you clarify please?
Q5: Section 3.2.3: You write that "local computation can accelerate by 50%" as can be seen in Table 2. When treating training_params as almost zero I arrive at PEFT: $O(BT^2)+4BT \Phi_\text{model}$ which is more than half of full training: $O(BT^2)+6.666BT \Phi_\text{model}$ . What is my misunderstanding here because I cannot clearly see it?
Q6: Line 360: "a large number of GPUs so that the micro-batch size B (i.e. per-GPU batch size) is small": Recent work [3, 4, 5] suggests that large (logical) batch sizes are benefical under DP (in comparison in non-DP they are not), how does this influence the relative cost of DP here.
Q7: Lines 461-464: Is it possible that there could be a higher speedup with PEFT when fully utilizing the GPU memory? In Figure 6 only 25% of the GPU memory are used for PEFT, but more memory should result in more speed-up.

[2] Beltran, S. R., Tobaben, M., Loppi, N., & Honkela, A. (2024). Towards Efficient and Scalable Training of Differentially Private Deep Learning. arXiv:2406.17298.

[3] De, S., Berrada, L., Hayes, J., Smith, S. L., & Balle, B. (2022). Unlocking high-accuracy differentially private image classification through scale.arXiv:2204.13650.

[4] Räisä, O., Jälkö, J., & Honkela, A. Subsampling is not Magic: Why Large Batch Sizes Work for Differentially Private Stochastic Optimisation. ICML 2024.

[5] Ponomareva, N., Hazimeh, H., Kurakin, A., Xu, Z., Denison, C., McMahan, H. B., ... & Thakurta, A. G. (2023). How to dp-fy ml: A practical guide to machine learning with differential privacy. JAIR, 77, 1113-1201.

审稿意见

评分: 5置信度: 32024-11-02

This paper presents a new efficient pipeline for distributed DP optimization or large deep learning models. Authors combine the existing ZeRO distributed pipeline with mixed precision and DP-SGD. This allows the DP deep learning to be extended to modern large models with (hundreds of) billions of parameters. Authors demonstrate empirically, that the throughput between the non-DP and DP pipelines is rather comparable, suggesting that the additional computation cost of DP can be rather moderate.

优点

Fine-tuning large pre-trained models has proved to be an appealing avenue for mitigating the utility hit caused by DP training. The proposed pipeline extends the modern state-of-the-art DP fine tuning to very large models, which have been difficult (or even impossible) to train previously due to additional computation costs by per-example gradient computation. Therefore the proposed method is an promising direction towards DP training that benefits from the vast information condenced in the large models.

Empirical evaluation demonstrates a computational efficiency comparable to the non-DP ZeRO for many different architectures. Furthermore, authors also demonstrate high scalability of the method w.r.t. the number of available GPUs.

缺点

While I can believe the argument, that existing distributed DP methods scale somewhat poorly, I think authors should still compare the throughput of the proposed method to some existing DDP baselines. For example, how would the throughput compare against DataP? Also, in Remark 4.2, authors state that the "the usefulness of DP-ZeRO can be demonstrated by comparing ZeRO to PipeP.", but I don't see this comparison nowhere.

The clarity of the paper could be improved somewhat significantly. I think some summary of the full pipeline would help the readibility a lot. Currently the components of the proposed pipeline are spread across the paper, which makes assessing e.g. the privacy properties a bit difficult. Also, I don't forward/backward propagation discussion in Section 3.1. I guess this is not novel analysis of this work, and I don't at its current form understand its purpose. I will list some further questions regarding the clarity in the next section.

问题

I don't quite understand the discussion and what is illustrated in the Figure 2. Why is there different accuracy between the left two but not between the right two? Is this illustration somehow representative of the proposed partitioning scheme? And if so, which part of the illustration?
I'm also bit confused on how the clipping is actually performed. You mention in Section 2.1 the per-group clipping and in Section 3.1. you state that the clipping is done locally on each GPU. Does this mean that the parameter groups are now split amongst the GPUs? It would help to follow the discussion if authors could summarize the procedure. I guess Figure 3 is meant to summarize it, but I'm struggling to see the different groups from this figure.
I guess you apply the clipping per group, so maybe in the discussion of "Per-sample gradient clipping" you could use $\mathbf{q}_{[m],i}$ in the denominators?
In "Optimizer state partition" you discuss the memory costs. Can you please clarify where does the $16\Psi$ for the basic DataP come from?
When you say (e.g. in the beginning of page 5) " $2\Psi_{model}$ fp16 parameters", do you actually mean "fp16 parameters that take $2\Psi_{model}$ bytes of memory"?
At the end of Page 5 the clipping bound does not appear in the noise scaling. Are you somehow assuming that the maximum norm of the gradient is $1$ ? If so please be explicit about it.
Are the numerator and denominator flipped in eq. (3)?

typos

Page 4: "... states. and ..."

审稿意见

评分: 1置信度: 32024-11-03

This work presents a new differentially private version of the Zero Redundancy (ZeRO) optimizer called DP-ZeRO. ZeRO is an optimization technique to improve the efficiency of training very large models. The paper's main contributions are

DP-ZeRO has the same level of communication efficiency, computation efficiency and scalability as standard ZeRO.
Enables training 1B+ parameter models with DP for the first time.
Codebase to run training that will be open sourced.
Discussion around loss scaling and clipping.

The paper goes on to describe the ZeRO optimizer as well as the overhead of DP-SGD which requires per example clipping. Then, the paper describes that DP-ZeRO is basically the same as ZeRO except for the backpropagation step as it requires per sample gradient norms to scale the per example gradient. However, the authors do not clarify what is required to overcome this challenge and only say that it has a small cost. The authors also explain that many other techniques like PEFT, low memory optimizers and activation rematerialization are compatible with DP-ZeRO. Then there is an experimental section describing some experiments where the authors show that there is a minimal overhead of running DP-ZeRO compared to ZeRO and linear scaling.

优点

The paper tackles an important problem of private training of large models.

缺点

It's not clear what novel algorithm is presented in the paper and what it actually does. Can the authors please clarify?
Usually, for large models that require model parallelism, either very small per core batch sizes are used (such as batch size 1) and/or PEFT is used which means that the cost of per example gradients and per example gradient norms is usually is small.
The larger practical challenge of private training are the use of very large batch sizes which is not mentioned in the paper.

问题

See weaknesses.

撤稿通知

2024-11-28

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.