Private Training Large-scale Models with Efficient DP-SGD
摘要
评审与讨论
This paper introduces FlashDP, a system designed to enhance the efficiency of training LLMs with per-layer DP-SGD. FlashDP proposes a hardware-software co-design approach where key DP-SGD operations are consolidated into a single, cache-friendly computational pass, calculating gradients only once in a fused manner to optimize GPU operations. The empirical results demonstrate substantial improvements in throughput and memory usage for DP pre-training tasks compared to existing methods.
优缺点分析
Disclaimer: My background is primarily in DP, and I have limited expertise in system-level optimization or GPU architecture. Therefore, this review focuses on the paper's contributions from a DP practitioner/researcher's perspective and may not fully capture or critically assess the technical novelty of the system-level innovations.
Strengths
- The paper presents FlashDP as a result of hardware-software co-design, introducing novel algorithmic ideas like Hierarchical Reduction Architecture (HRA) and Block-wise All-Reduce, which aim to efficiently calculate per-layer gradient norms with minimal data movement by using on-chip memory for local reductions before global aggregation. It further details how these ideas can be implemented effectively on current GPUs, addressing CUDA limitations through an adaptive kernel approach.
- FlashDP demonstrates strong empirical performance (in terms of throughput and memory efficiency) across different experimental setups compared to baselines like Opacus, GhostClip, and BK for per-layer clipped DP-SGD.
Weaknesses
- The paper showcases FlashDP's effectiveness in DP pre-training for LLMs. While technically challenging, DP pre-training, especially when using largely public datasets, is often less of a priority and concern for DP practitioners compared to DP fine-tuning on private, sensitive data. Common DP fine-tuning paradigms such as DP-LoRA are not explored. Therefore, the focus of this paper does not align with the most pressing practical needs and common workflows within the DP research community.
- FlashDP's design and optimizations focus on per-layer clipping. While the authors support this choice with existing literature, global clipping is generally more prevalent. Adapting FlashDP for global clipping appears challenging, as it would require all per-sample gradients for every layer to be available before the clipping operation can occur.
- While the throughput gains are impressive and clearly demonstrated, the paper does not report on end-to-end wall-time for representative training tasks. Practitioners often rely on wall-time to gauge the actual time and cost savings. Without this, it's harder to fully assess the practical efficiency gains.
- FlashDP's performance relies on deep integration with CUDA and employs an adaptive multi-kernel approach to manage GPU synchronization. While this addresses the limitations of existing system, it might pose challenges for portability to other hardware or even future CUDA versions. Furthermore, the inherent complexity of such low-level optimizations could make FlashDP harder to adopt, maintain, or modify for researchers lacking specialized GPU programming expertise.
Minor points
- Line 39-40: none of the references mentioned are about DP pre-training. Cite DP BERT instead
- Some word choices are uncommon for research papers: Line 62 "prolonging" --> "increasing"; Line 181 "exemplar" --> "example"
- I think a brief discussion on the sampling strategies, in particular the choice between shuffle-based sampling and Poisson subsampling, is important for a system that supports large-scale DP training
- The apostrophes appear to be incorrectly rendered (e.g., 'FlashDPâA˘ Zs', 'PyTorchâA˘ Zs', 'modelâA˘ Zs')
问题
- The name "FlashDP" is reminiscent of "FlashAttention", and both appear to achieve significant performance gains through I/O-aware designs and kernel fusion. Could you elaborate on any conceptual similarities or inspirations drawn from FlashAttention in the design of FlashDP?
- From a practical standpoint, how do you envision FlashDP benefiting DP practitioners in their common workflows? Do you foresee it being developed into a standalone, user-friendly library, or do you anticipate its core optimizations being integrated into existing, widely-used frameworks such as Opacus or DP-Transformers to enhance their performance?
局限性
The authors didn't discuss the limitations in the main text, though they provide a brief discussions in the checklist (focus on per-layer clipping rather than global clipping, and reliance on fused CUDA kernels)
最终评判理由
I'm not fully convinced that the paper will be of sufficient interest to the DP community, because its focus diverges from the common practices (DP-LoRA, global clipping, etc).
That said, I agree that the paper presents a solid piece of work, and wouldn’t object if other reviewers would like to champion acceptance. I also appreciate the authors’ detailed response, hence have increased my score.
格式问题
N/A
We thank the reviewer for the thoughtful and DP-focused feedback. Below, we respond to each point raised:
- Q1: "The paper showcases FlashDP's effectiveness in DP pre-training for LLMs..."
We agree with the reviewer that DP fine-tuning is of critical importance, particularly for protecting user-specific or private downstream data. However, we also argue that DP pre-training is equally important. As already cited in our submission, several recent works [1, 2, 3] have emphasized the necessity and feasibility of applying differential privacy to the pre-training stage of large models, especially in the context of LLMs. Also, pre-training on large, publicly available corpora poses risks related to copyright and data licensing, and differential privacy offers a principled mechanism to protect against unintended memorization or redistribution of such content. Recent work has also explored DP pre-training in this context.
[1] Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. In International Conference on Learning Representations, 2022.
[2] Zhiqi Bu, Jialin Mao, and Shiyun Xu. Scalable and efficient training of large convolutional neural networks with differential privacy. Advances in Neural Information Processing Systems, 35: 38305–38318, 2022.
[3] Zhiqi Bu, Yu-Xiang Wang, Sheng Zha, and George Karypis. Differentially private optimization on large model at small cost. In International Conference on Machine Learning, pp. 3192–3218. PMLR, 2023b.
Furthermore, our work does not exclude the use of FlashDP in fine-tuning settings. In fact, FlashDP can be directly applied to full-parameter fine-tuning, since the training flow is nearly identical to pre-training.
As for low-rank fine-tuning methods like DP-LoRA: FlashDP's optimization strategy—focusing on fused per-layer weight gradient computation for linear layers—can theoretically be adapted to LoRA. However, LoRA’s trainable weights are typically very small, and the DP-induced overhead on such layers is negligible in practice. In other words, optimizing DP-related operators in DP-LoRA yields much smaller performance gains compared to optimizing them in full-parameter training, since the privacy-related computation is already lightweight in DP-LoRA. This is, in fact, one of the main reasons for the popularity of DP-LoRA: it sidesteps the performance and memory concerns of DP-SGD altogether. Thus, we intentionally focus FlashDP on full-parameter updates where performance constraints are significant.
- Q2: "FlashDP's design and optimizations focus on per-layer clipping..."
We acknowledge that global clipping is commonly used in practice. However, per-layer clipping has also gained significant traction recently, particularly in LLM-scale DP training, due to its superior memory efficiency and empirical performance. FlashDP is designed to serve this need effectively. Moreover, as cited in our submission (Lines 164–165), recent work [4] has shown that per-layer clipping can achieve comparable utility to global clipping. This further supports the practicality of adopting per-layer clipping as a viable alternative in large-scale DP training.
[4] Jiyan He, Xuechen Li, Da Yu, Huishuai Zhang, Janardhan Kulkarni, Yin Tat Lee, Arturs Backurs, Nenghai Yu, and Jiang Bian. Exploring the limits of differentially private deep learning with group-wise clipping. arXiv preprint arXiv:2212.01539, 2022.
That said, we agree there is value in supporting both paradigms. In fact, we are actively developing FlashDP-v2, a new version that extends our optimization techniques to support global clipping. This extension introduces considerable system-level complexity (e.g., asynchronous kernel launching, CPU-GPU overlap, synchronization management). These system-level complexities make it challenging to present FlashDP-v2 and FlashDP within a single paper, so we have decided to describe them in two separate works.
In our view, it is reasonable and modular for different backends to support different clipping modes. Practitioners can simply use FlashDP for per-layer clipping, and FlashDP-v2 (forthcoming) for global clipping—both are independent DP-compatible operator modules. We will note this as future work in the camera-ready version.
- Q3: "While the throughput gains are impressive and clearly demonstrated..."
We appreciate this point. While we do not explicitly report wall-time, our paper reports throughput in tokens per second (tokens/s). Since each training step processes a fixed number of tokens (batch size × sequence length), wall-time can be easily computed as:
wall-time = total_tokens / throughput
Thus, throughput and wall-time are effectively interchangeable metrics in our setting. We will clarify this equivalence explicitly in the camera-ready version.
- Q4: "FlashDP's performance relies on deep integration with CUDA..."
First, FlashDP is already implemented as a user-facing PyTorch interface. Users do not need to write or modify any CUDA code to adopt it. All optimization logic is encapsulated within the framework, making it simple to integrate.
Second, while FlashDP is currently optimized for CUDA-based GPUs, its design principles—cache-aware fused computation and hierarchical reduction—apply broadly to many modern parallel architectures, as multi-level cache hierarchies are common across platforms. We believe the FlashDP design can be adapted to other hardware in the future, and we plan to explore this direction.
- Other Points:
-
"Line 39-40...": Line 39–40 citation correction: Thank you for noting that none of the cited references pertain to DP pre-training. We will replace or remove them and instead cite more relevant works in the camera-ready version.
-
"Some word choices are uncommon for research papers...": Thank you for pointing this out. We agree that the word choices “prolonging” (Line 62) and “exemplar” (Line 181) are uncommon in academic writing. We will revise them to “increasing” and “example,” respectively, in the camera-ready version.
-
"I think a brief discussion on the sampling strategies...": We thank the reviewer for highlighting this point. We will add a brief discussion on sampling methods (e.g., Poisson vs. uniform) in the revised paper. Note that FlashDP does not interfere with any sampling strategy, as these decisions are made at the data loading stage. FlashDP operates after sampling, during the backward pass.
-
"The name 'FlashDP' is reminiscent of 'FlashAttention'...": Inspired by the naming convention of FlashAttention, which similarly optimizes GPU kernel fusion and memory access, we adopted the name “FlashDP” to reflect our method’s core: efficient, cache-friendly operator fusion for DP training.
-
"From a practical standpoint...": FlashDP is designed as a standalone, modular library. We plan to release it publicly with documentation and PyTorch integration. Furthermore, we would be happy to contribute upstream to widely used DP libraries such as Opacus and DP-Transformers, and plan to submit PRs to facilitate adoption.
-
For limitation, while FlashDP currently focuses on CUDA-optimized implementation, a key direction for future work is expanding support to other hardware backends such as AMD GPUs or custom AI accelerators. This would help further democratize differentially private LLM training. We believe this is primarily an engineering effort and can benefit from community collaboration as adoption grows. We sincerely apologize for the oversight—although we referred to Appendix A.3 in the checklist, we inadvertently omitted the actual discussion of limitations in the appendix. We will correct this in the camera-ready version.
Thanks for the response.
-
Re. DP pretraining: None of the works the authors cited [1,2,3] is about DP pretraining. [1] is in fact one of the first few works on DP fine-tuning.
-
Re. DP-LoRA: The authors made the following claim: "In other words, optimizing DP-related operators in DP-LoRA yields much smaller performance gains compared to optimizing them in full-parameter training, since the privacy-related computation is already lightweight in DP-LoRA. This is, in fact, one of the main reasons for the popularity of DP-LoRA: it sidesteps the performance and memory concerns of DP-SGD altogether."
This is not true IMO. DP-LoRA is more favorable in large-scale models because the noise it adds is much smaller compared to full-parameter fine-tuning. Empirically, its performance is at least on par with full fine-tuning, while enjoying a much smaller computational / storage cost.
I'm still not fully convinced that the paper will be of sufficient interest to the DP community, and thus keep my score. On the other hand, I will not object if other reviewers recommend acceptance.
Thank you for the follow-up. We apologize for the confusion caused by our previous reply and would like to take this opportunity to clarify our intent more precisely.
First, we would like to further clarify the core design of FlashDP:
FlashDP is primarily designed to optimize the operator-level performance (e.g., throughput, memory usage, latency) of DP-SGD, particularly in the context of large-scale matrix multiplications. In this sense, our focus is on the computational paradigm rather than the specific application scenario (e.g., pre-training vs. fine-tuning).
As we mentioned earlier, pre-training and full-parameter fine-tuning share the same underlying computation pattern—both rely heavily on large matmul operations over full model parameters. From the standpoint of operator efficiency, they are effectively equivalent, and FlashDP can be directly applied to both without modification.
Regarding your first point, we apologize that our earlier response may not have clearly conveyed the importance of DP pre-training. Upon further literature review, we believe DP pre-training remains a critical area of research.
For example, in [1], the authors argue that labeling models as “privacy-preserving” based solely on fine-tuning with DP—after pre-training on web-scraped “public” data—may dilute the meaning of privacy. They emphasize that even publicly available data can raise privacy concerns, especially if sensitive information is memorized. Applying DP during pre-training is presented as a promising mitigation strategy. In [2], the authors conduct an in-depth analysis of the trade-offs between computation, privacy, and utility in DP-LM pretraining. They introduce extended scaling laws that capture the key factors in DP pretraining and offer principled guidance for optimal training configurations. This work provides strong theoretical and empirical support for applying DP during pre-training. In [3], the authors demonstrate how DP pre-training can be achieved efficiently on vision transformer models, with memory usage and utility comparable to non-private baselines.
Thus, we respectfully argue that DP pre-training is not only meaningful but increasingly relevant, especially as LLMs trained on large web-scale corpora are widely deployed.
[1] Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining
[2] Scaling Laws for Differentially Private Language Models
[3] Memory-Efficient Differentially Private Training with Gradient Random Projection
That said, we would like to re-emphasize that, as noted earlier, FlashDP does not depend on whether the model is in the pre-training or fine-tuning phase—it is compatible with both. In fact, the underlying computation pattern for full-parameter fine-tuning is identical to that of pre-training, particularly for Linear layers. As we show in Part 2, FlashDP can also be applied directly to LoRA-based fine-tuning with minimal changes.
For the second point, "DP-LoRA is more favorable in large-scale models because the noise it adds is much smaller compared to full-parameter fine-tuning. Empirically, its performance is at least on par with full fine-tuning, while enjoying a much smaller computational / storage cost.", we realize now that our previous use of the term “performance” may have been ambiguous. The reviewer is absolutely right that DP-LoRA often performs better in terms of utility or accuracy. However, our intent was to refer specifically to operator-level performance or efficiency—that is, improvements in runtime, memory, or throughput. We appreciate the opportunity to clarify this distinction.
What we meant was that DP-LoRA, as a computational paradigm, is already lightweight in terms of operator cost. Since the DP-related operations (e.g., per-sample gradient clipping and noise addition) in DP-LoRA are applied to very small matrices, there is little headroom for FlashDP to further improve efficiency. In contrast, full-parameter DP-SGD involves large matmul operations where these DP operations become a bottleneck—precisely the scenario FlashDP is designed to accelerate. To be clear, FlashDP can be applied to DP-LoRA, but the efficiency gains are minimal simply because LoRA itself is already highly efficient relative to full matrix updates.
To further illustrate this point, we provide a detailed breakdown of LoRA’s forward and backward computation and explain how FlashDP can be applied without requiring any adaptation:
- Notation
- input tensor:
- base weight:
- LoRA down-proj:
- LoRA up-proj:
- LoRA scale factor:
- output:
- output gradient:
- Forward
- Backward Gradients
- Input gradient:
where , , and .
- Base weight gradient (non-trainable):
- LoRA B gradient:
- LoRA A gradient:
- FlashDP can be directly used to compute both LoRA gradients above. The only difference lies in the reshaping operations, which are computationally negligible and require no modification to our core kernels:
- LoRA B gradient (FlashDP):
- LoRA A gradient (FlashDP):
To support our claims, we benchmarked the backward pass runtime of three methods on a Linear layer: (1) Non-DP LoRA, (2) DP-LoRA, (3) FlashDP-LoRA.
We set batch size = 2, LoRA rank = 8, and let . Results are in running milliseconds:
| dim | Non-DP LoRA (ms) | DP-LoRA (ms) | FlashDP-LoRA (ms) |
|---|---|---|---|
| 2048 | 5.13 | 4.97 | 4.74 |
| 2560 | 7.76 | 8.64 | 8.23 |
| 3072 | 13.30 | 14.32 | 13.72 |
| 3584 | 20.88 | 22.04 | 21.27 |
| 4096 | 30.33 | 31.59 | 30.64 |
| 4608 | 42.65 | 44.06 | 43.00 |
| 5120 | 58.93 | 60.45 | 59.25 |
| 5632 | 77.54 | 79.18 | 77.88 |
| 6144 | 100.64 | 102.40 | 100.96 |
| 6656 | 128.36 | 130.27 | 128.72 |
| 7168 | 159.89 | 164.43 | 160.23 |
| 7680 | 195.69 | 200.60 | 197.51 |
| 8192 | 238.86 | 244.11 | 239.18 |
As the table shows, FlashDP reduces the runtime of DP-LoRA consistently across dimensions. However, the improvements are modest, as all three configurations are already quite efficient. This empirically confirms our earlier point: while FlashDP is applicable to DP-LoRA, the room for operator-level optimization is limited due to LoRA's inherently efficient design.
In summary:
- FlashDP focuses on optimizing the heavy matmul operations in full-parameter DP-SGD.
- While it is technically compatible with DP-LoRA, the efficiency gains are smaller due to the inherently lightweight nature of LoRA's computation.
- Our contributions are thus most impactful in settings where large matrix operations dominate the DP workload.
We appreciate the reviewer’s comments and hope this clarification better communicates the scope and motivation behind FlashDP.
I thank the authors for their detailed explanations as well as the additional experiments on DP-LoRA. I am willing to acknowledge the authors’ efforts by increasing my score. On the other hand, I do want to point out that as a DP researcher / practitioner, DP-LoRA is my go-to choice and is already reasonable efficient. The improvement brought by FlashDP as the authors pointed out seems quite limited. As such, I’m not sure how much value it would bring to my (as well as other practitioners / researchers) workflow.
Thank you very much for your thoughtful response and for your willingness to raise the score—we truly appreciate it.
We completely understand your perspective regarding DP-LoRA and its practicality in real-world workflows. Our goal with FlashDP is to address efficiency bottlenecks in full-parameter DP-SGD, where the overhead is most significant. While the current gains of FlashDP in DP-LoRA are limited, we plan to explore more aggressive speed optimizations for DP-LoRA in future work, with the goal of pushing its efficiency to the extreme.
Thank you again for your engagement and constructive feedback throughout the review process.
This paper addresses the problem of efficiently training large-scale models with Differential Privacy (DP). The authors propose FlashDP, a cache-friendly per-layer DP-SGD that achieves computational advantages over existing methods. The authors show that FlashDP achieves a higher throughput compared to the Non-DP method.
优缺点分析
-
Achieves noteworthy computational advantages over existing methods;
-
Differential Privacy is arguably a field with growing adoption and better computational efficiency is a critical need;
-
Weakness: The method utilizes dedicated GPU kernels, which appears to be a general approach to enhance performance. It is compared to an implementation that does not have a custom kernel, which seems not a fair comparison.
问题
Questions
-
The methodological improvements seem to come from computational considerations such as memory access, context switching, and redundant memory swaps. Given the authors' background in that literature, is this a common approach to improve code by lumping ops into a custom GPU kernel? If not, what is special about this method that makes it amenable; if it is common, what is the novelty of this paper?
-
Continuing the previous question, the paper claims that the custom kernel is more efficient and I assume that it was custom designed for the problem. a) To what other problems would this custom kernel be applicable? b) Would an automated tool be able to generate such a fused kernel for the same problem?
-
Quote from conclusion: "At the same time, it highlights the need for responsible release practices to mitigate potential misuse under the guise of privacy." This sentence reads confusingly, but it seems important as the final sentence of the conclusion. What aspect in the paper "highlights" which "release practices"?
局限性
As stated in the final question. Please explain what is meant with the line "At the same time, it highlights the need for responsible release practices to mitigate potential misuse under the guise of privacy."
最终评判理由
Private comment to the AC has been made before. For many rounds in the reviews, my main argument was the computational scope of the paper instead of the machine-learning angle. It seems that two versions of FlashAttention have been published in this conference before. Therefore, I have increased the score.
格式问题
Small observations that are not part of the review
Line 279: "PyTorchâA˘ Zs" is a typo or rendering issue? Line 331: "s FlashDPâA˘ Zs a" is a typo or rendering issue?
"Opacus (Yousefpour et al., 2021) and (Rochette et al., 2019) enhance the training efficiency by employing the outer product method" "(He et al., 2022) evaluated the precision equivalence of per-layer clipping with flat clipping on LLMs." use \citet{} instead of \citep{} whenever the reference should be within Text instead of within Parentheses.
"In this section, we introduce the previous non-DP, explicit, and implicate methods of DP-SGD": implicate -> implicit
We appreciate your thoughtful review and detailed questions. Below, we address each of your points in turn.
- Q1: "Is this a common approach to improve code by lumping ops into a custom GPU kernel? What is the novelty of this paper? To what other problems would this custom kernel be applicable? Would an automated tool be able to generate such a fused kernel for the same problem?"
Operator fusion is indeed a widely used optimization technique for improving GPU utilization, especially for workloads that involve multiple consecutive tensor operations. However, in practice, operator fusion remains largely domain-specific and often needs manual tailoring to the computational patterns of a specific application.
In the case of FlashDP, we specifically target the gradient computation workflow under DP constraints, which introduces a unique and challenging fusion scenario:
- Per-sample gradient computation involves a matrix multiplication, followed immediately by a norm-reduction step that requires aggregation across different axes.
- This disrupts typical fusion patterns because the intermediate output of the matmul needs to be preserved and reused across threads or even blocks for the subsequent reduction.
- If these intermediate results are stored in HBM, both memory usage and latency increase significantly.
To address this, FlashDP preserves the blocking pattern used in matrix multiplication and introduces a two-stage hierarchical reduction (intra-block and inter-block), enabling norm computation without writing the intermediate matmul output to HBM. This design is tightly coupled with GPU architecture, particularly SRAM usage and block scheduling.
Current automated operator fusion tools (such as torch.compile) are not aware of DP-specific semantic constraints, particularly the privacy-critical separation of per-sample gradients and the cross-sample norm reduction logic. As a result, they cannot discover the optimal fusion strategy we implemented in FlashDP. Without this DP constraint, standard workloads would allow straightforward fusion and scheduling, which automated compilers can often handle effectively.
In short, FlashDP highlights a novel fusion challenge unique to DP training, and provides a solution that is not yet supported by automated tooling.
Also, FlashDP is not limited to DP-SGD. It is a general backend for any norm-bound-clipping-based training algorithm. As long as per-sample norm computation is needed, the idea from FlashDP can provide the same memory and compute benefits.
- Q2: "Quote from conclusion: 'At the same time, it highlights the need for responsible release practices to mitigate potential misuse under the guise of privacy.'..."
Thank you for pointing this out. This sentence was added in response to the NeurIPS Paper Checklist requirement to consider potential risks of misuse and to discuss responsible mitigation practices. Specifically, our concern is that as tools like FlashDP make large-scale DP training more efficient and accessible, it becomes easier for practitioners to claim differential privacy without rigorous accounting or transparency (e.g., not reporting ε/δ, noise scale, or clipping strategy).
Thus, this sentence is intended to highlight that performance gains should not come at the cost of accountability, and any deployment of DP-trained models should still follow sound privacy analysis and transparent reporting.
We agree that this statement could be made more precise and will revise it in the camera-ready version for clarity and context.
- Q3: Small observations / formatting issues
Thank you for noting the minor formatting and citation issues. Specifically:
- "Line 279...": We will fix the LaTeX rendering bugs such as “FlashDPâA˘ Zs” and “PyTorchâA˘ Zs” which were caused by incorrect character encoding.
- "Use \citet{} instead of \citep{}...": For citations like “Opacus (Yousefpour et al., 2021)” and “(He et al., 2022)”, we agree that \citet{} is preferable when citing as part of the sentence. We will fix these instances.
- "Implicate -> implicit...": We will correct “implicate” to “implicit” in the phrase “explicit and implicate methods of DP-SGD”.
All of these corrections will be addressed in the camera-ready version.
I thank the authors for the thoughtful rebuttal. The responses demonstrate the computational challenges in the DP-SGD implementation, and I appreciate the comments on the technical concerns and the minor formatting issues I raised.
There is a unique challenge in the fusion of DP training. Particularly, the need to preserve intermediate matmul outputs for norm clipping while avoiding HBM. There is a suggestion that FlashDP can serve as a general backend for any norm-bound-clipping-based training algorithm, and more examples and evidence would make the paper more compelling.
After carefully considering the paper and rebuttal, however, I maintain the assessment regarding the paper's contribution. While the authors have convincingly argued that FlashDP addresses a computational fusion challenge specific to DP training that current automated tools cannot handle, the core contribution focuses on computational and engineering aspects rather than machine learning methodology. If I understand correctly, the paper's main innovation lies in the development of custom GPU kernels that optimize memory access patterns and reduce redundant computations for DP-SGD training.
Thank you very much for recognizing the technical value of our GPU kernel optimization efforts. We truly appreciate your acknowledgment.
That said, we understand that your remaining concern appears to center around whether custom GPU kernel optimization work qualifies as a "machine learning methodology," and whether such contributions align with NeurIPS’s paper categories.
To clarify this, we would like to draw your attention to prior NeurIPS papers such as FlashAttention [1] (NeurIPS 2022) and FlashAttention-3 [2] (NeurIPS 2024). Both papers focus on GPU operator fusion and system-level engineering to improve deep learning workloads, and they are both accepted by NeurIPS. Let us briefly describe them to better support our case:
- FlashAttention proposes an I/O-aware exact attention algorithm that fuses multi-step attention computation into a single GPU kernel. By applying a tiling strategy, it avoids excessive HBM reads/writes of the large attention matrix, computing in high-speed on-chip SRAM instead and freeing up memory bandwidth.
- FlashAttention-3 builds upon the original FlashAttention by leveraging new features in NVIDIA Hopper GPUs. It introduces warp-level asynchrony and overlapping of tensor core computation with TMA memory transfers, while interleaving tiled matmuls and softmax for better hardware utilization. Additionally, it adopts FP8 chunk-wise quantization to achieve faster computation while maintaining numerical stability.
[1] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
[2] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
At a high level, these works and our FlashDP share the same research focus: applying GPU kernel fusion to optimize tasks in machine learning. What differentiates each work is the unique set of technical constraints imposed by their respective application domains:
- FlashAttention addresses I/O bottlenecks in attention computation and overcomes the challenge of fusing softmax through online softmax strategies.
- In contrast, FlashDP tackles the bottleneck in DP-SGD, where per-sample gradients must be reused across multiple DP operations. Due to the cross-sample norm reduction, standard fusion strategies break down. To address this, we introduced Intra-block Reduce, Inter-block Reduce, and Block-wise Synchronization to enable effective fusion without sacrificing correctness.
- Our experiments show that FlashDP results in substantial improvements in throughput and memory efficiency during DP-SGD training of large models.
Therefore, we respectfully disagree with the view that “the core contribution focuses on computational and engineering aspects rather than machine learning methodology” should be considered a reason for rejection. FlashAttention and its follow-up papers demonstrate that such engineering-focused contributions are not only valid but have been consistently recognized at NeurIPS, affirming their alignment with the conference's standards and scope.
While we understand your personal preference might lean more toward methodological innovation in ML, we believe that operator-level optimization for ML workloads is increasingly vital, especially as model sizes continue to scale.
Given the above, we sincerely ask you to reconsider your assessment. If our response resolves your concerns, we would deeply appreciate it if you could consider increasing your score.
Finally, we completely understand that your time is very limited, and we truly appreciate your taking the time to respond. As today is the final day of the author-reviewer discussion period, we would be very grateful for any further suggestions or updated evaluation you could provide before the window closes.
Thank you again for your thoughtful engagement and feedback.
Thank you for continuing to engage in the discussion. I was not aware that FlashAttention was published in the conference as that paper has the same computational scope indeed. I will increase my score.
Thank you very much for your thoughtful reconsideration and for increasing your score. We truly appreciate your engagement and constructive feedback throughout the discussion.
This paper proposes an efficient implementation of DP-SGD training. Firstly, the paper adopts per-sample gradient clipping, so that we do not need to have cross-layer synchronization of gradient norms. Secondly, the paper fuses the operation of per-layer gradient clipping into several custom CUDA kernels to make the computation more efficient. Results show that the method improves the memory consumption and computation cost significantly compared to the vanilla DP-SGD, is compatible with several distributed training pipelines, and can be used to train models up to 13B.
优缺点分析
Strengths:
-
The efficiency improvements are impressive.
-
The paper shows the feasibility of applying the method to train 13B models and using it in distributed training pipelines. This demonstrates the practical usability of the approach.
Weakness:
-
Some results are missing/incomplete.
-
It seems like it will take non-trivial efforts to apply the approach to networks with new/non-standard network layers.
See "Questions" for details about the above points.
问题
-
My main concern is that some important results are missing.
- Tables 1 and 2 have missing numbers.
- The paper claims that the method maintains "parity with standard per-layer clipped DP-SGD in terms of accuracy." I agree that it is true by design, but I have some questions about the results.
- The only results I see are Table 3 in the appendix, where we see that DP-SGD and FlashDP have the same validation loss up to 4 decimal points. However, even if we train a model using the same codebase twice, due to the random seeds for data shuffling, etc. (if we do not fix the seed) and the randomness in GPU computation, we don't expect to see the same results. Could the authors explain why the results of DP-SGD and FlashDP match so well?
- Since some of the baselines (such as GhostClipping) use full model clipping (instead of per-layer clipping), it would be necessary to show if per-layer clipping leads to accuracy degradation in a full training process compared to full model clipping.
-
(Please correct me if my understanding is wrong.) Since the method requires custom CUDA kernels for the backpropagation process, we will need different implementations for different network layers. For example, Algorithm 1 lists how we implement it for a linear layer in transformers. If we have a different type of layer that has different input/output shapes (e.g., convolution layers in vision models), then we have to implement a different CUDA kernel. That means that the method is not very scalable in terms of implementation. In comparison, TF-Privacy does not need separate implementation for different network layers; while Opacus needs it, the implementation can be on the PyTorch level, which is easier than CUDA kernels. The paper needs to discuss the limitations.
-
Do the authors plan to release the code? I don't see it in the attachment.
-
Line 286: "the grid synchronization required by CG necessitates launching all blocks simultaneously, which is impractical for DP applications." Could you explain it in more detail?
-
Table 1: Why does GhostClip have even less memory usage than Non-DP?
-
Typos:
- Line 279: PyTorch
- Line 331: FlashDP
- Line 360: FlashDP
局限性
I don't see discussions about limitations and the potential negative societal impact of their work.
最终评判理由
The rebuttal has addressed my concerns. Therefore, I have increased my score from 3 to 4. The paper would be clearer to readers with the added clarifications on the generalizability of the implemented CUDA kernels, as discussed.
格式问题
I have no concerns about paper formatting.
We sincerely thank the reviewer for the valuable feedback and questions. Below we provide our point-by-point response.
- Q1: "Tables 1 and 2 have missing numbers."
Some entries in the tables are missing. As clearly indicated in the table header, these are due to GPU memory limitations. In such cases, the baseline methods fail to run due to out-of-memory (OOM) errors. Rather than omitting these settings altogether—which we felt might obscure important limitations—we chose to include them with missing results to indicate the limitations of certain methods under constrained GPU memory.
- Q2: "Could the authors explain why the results of DP-SGD and FlashDP match so well?"
FlashDP is designed to improve the GPU efficiency of DP-SGD operators without modifying their mathematical semantics. Therefore, it does not introduce any accuracy degradation by design. In our experiments, to ensure consistent and fair comparison, we used the same random seed across both DP-SGD and FlashDP. This configuration ensures that the observed matching in validation loss is expected and repeatable. We will make this setting explicit in the camera-ready version.
- Q3: "Since some of the baselines (such as GhostClipping) use full model clipping (instead of per-layer clipping)..."
While the reviewer raises an important point regarding the potential impact of clipping strategy on model utility, prior work has shown that per-layer clipping can achieve comparable accuracy to global clipping, especially in large-scale models. Specifically, as cited in our submission (Lines 164–165), He et al. [1] demonstrate that per-layer clipping matches the utility of global clipping in various DP training setups. This suggests that our choice of per-layer clipping is both practically effective and theoretically grounded.
[1] Jiyan He, Xuechen Li, Da Yu, Huishuai Zhang, Janardhan Kulkarni, Yin Tat Lee, Arturs Backurs, Nenghai Yu, and Jiang Bian. Exploring the limits of differentially private deep learning with group-wise clipping. arXiv preprint arXiv:2212.01539, 2022.
Therefore, the use of per-layer clipping in our experiments is unlikely to introduce utility degradation relative to global clipping, and it aligns with trends in scalable DP training practice. We will clarify this more explicitly in the camera-ready version to avoid confusion.
- Q4: "Since the method requires custom CUDA kernels for the backpropagation process, we will need different implementations for different network layers..."
First, in today’s large models—especially transformer-based LLMs—the vast majority of parameters reside in linear layers. Since DP-SGD operates on per-parameter gradients, it is most effective to optimize the gradient processing for layers with the highest parameter count. In contrast, optimizing smaller layers (e.g., LayerNorm) yields limited practical gains. Second, the reviewer mentioned concerns about input/output shape variation. Our algorithm is independent of specific input shapes; as shown in Algorithm 1, the logic does not rely on any shape-specific assumptions. Lastly, while convolutional layers are indeed common in vision models, they can be losslessly converted into linear operations via im2col, which is already a standard practice in libraries such as cuDNN. In fact, our FlashDP implementation already supports convolutional layers internally, though we did not include related experiments as this work focuses on LLMs.
Further explanation: convolution workflow via im2col / col2im:
- Forward: Convolution as GEMM via im2col
Given:
- Input tensor:
- Weight tensor:
- Output tensor:
- Flatten input patches via im2col (no time cost):
- Flatten weights (no time cost):
- Matrix multiplication (linear forward):
- Reshape to obtain output
- Backward
Given:
- Loss gradient w.r.t. output:
- Flatten:
- Gradient w.r.t. weight (the backward pass of the linear layer for weight gradients, where FlashDP is applied):
- Gradient w.r.t. input (linear backward for input gradient computation):
- Recover input gradient via col2im (no time cost):
Here, denotes the inverse transformation of : it maps the unfolded gradient matrix back to the original input tensor shape by aggregating overlapping gradients from sliding windows into their corresponding spatial positions.
As shown in the equations above, convolution operations can be equivalently expressed as matrix multiplications preceded by im2col and followed by col2im. These two transformations are deterministic, involve no learnable parameters, and do not perform any computation beyond memory reshaping and indexing. In practice, their overhead is negligible—especially on modern accelerators where such data layout operations are heavily optimized. Therefore, from an implementation perspective, supporting convolutions does not require new CUDA kernels; our existing linear-layer kernels can be reused directly after applying these standard transformations. This also highlights the modularity of our design: once DP operators are optimized for linear layers, they naturally extend to convolutional layers through standardized tensor reshaping procedures.
- Q5: "Do the authors plan to release the code?"
We confirm that we will release the full FlashDP codebase with the camera-ready version.
- Q6: "Line 286: 'the grid synchronization required by CG necessitates launching all blocks simultaneously, which is impractical for DP applications.'..."
CG indeed allows global synchronization across blocks, but they require all blocks to be launched simultaneously, which imposes strict hardware constraints on grid size and shared memory usage. In the setting of DP training, where each sample must undergo individual gradient clipping and noise addition, high parallelism is essential to maintain training throughput. We will elaborate on this in Section 4.2 of the revised version.
- Q7: "Table 1: Why does GhostClip have even less memory usage than Non-DP?"
This is due to an implementation detail in the official GhostClip code. Specifically, in line 503 of privacy_engine.py, the authors explicitly delete all intermediate variables except .summed_grad with the comment: “Aggressive memory saving — delete everything except .summed_grad to save memory!” While this technique indeed reduces memory usage, it significantly harms runtime performance due to recomputation. We chose not to adopt this strategy, as FlashDP aims to balance both memory efficiency and high throughput.
- Q8: "Typos..."
We acknowledge the typographical and compiling issues (e.g., “FlashDPâA˘ Zs”) and will correct them in the camera-ready version.
Dear authors,
Thank you for the reminder! This year, in order to update the score, reviewers are required to submit a "Final Justification," which—according to the guidelines—should "consider rebuttal and discussions with authors, other reviewers and AC". However, the AC–Reviewer discussion period has not yet started. I would prefer to follow the process and submit the final justification after that discussion takes place.
That said, please rest assured that my message indicating an intent to raise the score from 3 to 4 is visible to the AC and all other reviewers. So whether or not the "score" field is formally updated at this point should not affect the decision process.
Thank you for the clarification! We completely understand, and we just wanted to send a gentle reminder in case it had been overlooked. We really appreciate your thoughtful engagement and your support throughout the review process.
Thank the authors for the detailed reply! It addresses most of my questions, so I will increase the score to 4. Below are my remaining points/questions:
-
Apologies for missing the note in Table 1. However, I don’t see a similar note in Table 2. It would be helpful to add the same clarification there for consistency.
-
Regarding the statement: "as shown in Algorithm 1, the logic does not rely on any shape-specific assumptions." I was confused because in Algorithm 1, the input activation tensor is defined as and the output tensor as , where (according to Section 3) , , , and refer to batch size, sequence length, input feature dimension, and output feature dimension. However, in convolutional layers of vision models, there typically isn't a “sequence length” dimension.
We sincerely thank the reviewer for the thoughtful follow-up and for increasing the score.
Regarding the first point: thank you for catching this inconsistency. We will add a clarification in Table 2 as well, similar to Table 1, to ensure consistency across the presentation.
We apologize for the confusion caused by the notation. Indeed, FlashDP is designed to optimize the matmul operation in linear layers directly. Our intent in Algorithm 1 was to illustrate that this matrix multiplication abstraction (i.e., matmul) generalizes well to both linear and convolutional layers via standard reshape operations. While we used , , and to describe the LLM setting (where is typically the sequence length), these symbols are not hardcoded into our implementation and were chosen to reflect the transformer context for clarity.
To clarify this further, we extend the earlier convolution formulation to include the batch dimension:
- Input tensor after
im2col: - Gradient with respect to the output:
We now align these shapes with the notation used in our paper:
With these substitutions, the tensors become:
By applying a lightweight transpose (almost no time cost compared to matmul) to obtain:
Then we can directly use the FlashDP operator, originally designed for linear layers, to process convolutions as well.
Thank you for the explanations! I don't have further questions.
Thank you again for your thoughtful engagement and kind feedback throughout the discussion. We truly appreciate your time and the constructive points you raised.
We recall that you explicitly mentioned all of your concerns had been resolved and that you intended to raise your score. However, we noticed that the official score in the system still shows as a 3, and we completely understand that during a busy review period, updates in the system can sometimes be overlooked.
If so, we would be sincerely thankful if you could kindly update the score at your convenience. In any case, we deeply appreciate your detailed review and the opportunity to improve our work based on your comments.
This paper proposed memory- and compute-efficient LLM training method (FLashDP) under the privacy constraints of differential privacy (DP). Specifically, authors have tacked the scalability challenges of DP-SGD (Differentially Private Stochastic Gradient Descent) primarily stemming from the per-sample gradient clipping, which demands huge GPU memory and pose a significant challenge for LLM's with billions parameters.
The authors validate the efficiency of FlashDP on GPT-2 (1B+) and LLaMA (up to 13B) models using 4 A100 GPUs. They show memory and throughput improvements over SoTA DP methods, and compare the validation loss to non-DP baselines.
优缺点分析
Strengths
-
Training LLMs under differential privacy is a major challenge due to the fundamental tension between privacy and degraded utility. This work directly addresses that trade-off in the context of pre-training, which is far more challenging and demanding than fine-tuning.
-
Unlike most prior work that focuses only on fine-tuning existing LLMs under DP, this paper targets the more ambitious and impactful goal of DP pre-training. This broaden the scope and applicability of DP to foundational model training, which is critical for ensuring privacy at scale.
-
The proposed FlashDP is scalable to 13B models.
-
The architectural advancement and engineering improvement in FlashDP are clearly presented in the paper, and the limitations of prior work are well-motivated. The experimental results showing memory and throughput gains are well depicted.
Weaknesses
A potential weakness is that the comparison of validation loss with non-DP baselines is not extended to downstream task evaluations. Particularly for longer context lengths (e.g., >1K), improvements in validation loss may not directly translate to downstream performance [2], potentially due to effects such as the dilution phenomenon discussed in [1].
[1] Fang et al., What is Wrong with Perplexity for Long-context Language Modeling?, ICLR 2025
[2] Hu et al., Can Perplexity Reflect Large Language Model's Ability in Long Text Understanding?, ICLR 2024
问题
See the Weakness.
局限性
Not explicitly discussed in the paper. While authors have pointed to Appendix A.3 for limitations (in NeurIPS checklist), I did not find the limitations of the author's proposed methods.
最终评判理由
My recommendation is primarily based on the large-scale experimental evaluation, as authors evaluated FlashDP on GPT-2 (1B+) and LLaMA (up to 13B) models, demonstrating clear memory and throughput gains over SOTA DP methods, and also comparing validation loss against non-DP baselines. While the scale and empirical results are compelling, I did not verify the correctness of the method itself, as my expertise lies in cryptographically secure privacy-preserving machine learning (PPML) rather than differential privacy.
格式问题
None.
We sincerely thank the reviewer for their encouraging feedback and strong support for our work. We greatly appreciate your recognition of FlashDP's ambition to support DP pre-training, and your positive assessment of its scalability, architectural clarity, and empirical rigor.
You raised an excellent point regarding the limitations of using validation loss or perplexity (PPL) as the sole measure of model utility, especially in the long-context regime. We agree that PPL improvements may not always correlate with downstream task performance, particularly due to phenomena like context dilution, as discussed in the papers you cited ([1], [2]). These observations are very insightful and valuable.
That said, the primary purpose of our utility evaluation was to confirm that FlashDP maintains full accuracy parity with standard DP-SGD, which our results do show. This alignment is expected, since FlashDP modifies only the execution efficiency of gradient clipping and noise addition, but does not alter the algorithmic behavior or statistical properties of DP-SGD. At the operator level, FlashDP and DP-SGD behave identically in both computation and output.
Nonetheless, we fully agree that exploring downstream task evaluations, as well as alternative utility metrics beyond PPL, are crucial for a more complete understanding of model quality under DP constraints. We will cite and briefly discuss the two references you provided in the camera-ready version, and we see this as a natural direction for future work.
For limitation, while FlashDP currently focuses on CUDA-optimized implementation, a key direction for future work is expanding support to other hardware backends such as AMD GPUs or custom AI accelerators. This would help further democratize differentially private LLM training. We believe this is primarily an engineering effort and can benefit from community collaboration as adoption grows. We sincerely apologize for the oversight—although we referred to Appendix A.3 in the checklist, we inadvertently omitted the actual discussion of limitations in the appendix. We will correct this in the camera-ready version.
Thanks for the rebuttal! I will keep my initial score.
Thank you for your positive evaluation and for taking the time to review our work. We truly appreciate your support and encouraging feedback.
This paper introduces FlashDP which allows efficient training of LLMs with differentially private SGD. FlashDP optimizes by implementing specific CUDA kernels which allows for efficient per-example gradient computation and reducing the HBM/SRAM IO overhead. Though FlashDP requires more than one CUDA kernel due to a required global synchronization of gradient norm square, the overall implementation is still more efficient than the previous approaches. FlashDP was compared against popular implementations such as Ocapus and showed similar memory and throughput as the non-DP approaches.
优缺点分析
Strengths
- The paper focused on an important question of implementing DP-SGD efficiently for LLMs. This is a crucial step for making LLMs more private and responsible as they are touching more of the private user data.
- The algorithm design is thoroughly described, and overall it is intuitive to understand even for readers without CUDA programming experience.
- The results indeed demonstrated the efficacy of FlashDP, showing better trade-off between memory and compute than the baseline approaches.
Weaknesses
- Extensibility of the proposed method seems to be limited. The implementation was based on CUDA kernel, and thus might not apply to other hardwares e.g. TPUs or Apple’s M chips. Also, it seems like one has to implement CUDA kernels for different operations if extended to other architectures, making it harder to be extended.
- The main results in Table 1 only considered GPT-2 families which is a bit outdated. It would also be interesting to know the performance differences between different implementations on fine-tuning or post-training where DP is likely more needed than pre-training.
- Minor: multiple rendering of FlashDP’s as FlashDPâA˘ Zs, maybe some latex compiling issue.
问题
- Is there a detailed profiling of the GPU I/O overhead reduced in FlashDP compared with existing approaches?
- How much faster would a single kernel implementation run, ignoring the inaccurate gradient clipping factor? Just curious about how much the block-wise synchronization overhead brings.
局限性
Limitations discussed. No potential negative societal impact.
格式问题
N/A
We sincerely thank the reviewer for recognizing the importance and clarity of our work, and for the constructive feedback. Below we provide detailed responses to the concerns raised.
- Hardware extensibility
We agree that generalizing FlashDP beyond CUDA-based GPUs (e.g., to TPUs or Apple’s M-series chips) is an important and challenging task. However, this limitation is not unique to FlashDP, but rather inherent to all low-level GPU operator optimization efforts. For instance, when FlashAttention, a widely recognized operator fusion method, was first released, it only supported NVIDIA Ampere GPUs (e.g., A100), and did not work on previous-generation GPUs, let alone other hardware platforms. This did not diminish its value to the community.
Similarly, FlashDP is a CUDA-native design targeting NVIDIA architectures where per-layer DP-SGD presents clear performance bottlenecks. Extending FlashDP to non-CUDA hardware is certainly possible, but would require new backend-specific system engineering efforts. We believe that presenting FlashDP at a top venue like NeurIPS will help us attract community interest and contributions toward making such extensions a reality.
- Model architecture and relevance
While the primary results in Table 1 are reported on GPT-2 variants, our paper also includes experiments on the LLaMA family (including LLaMA-13B), which shares a decoder-only transformer architecture with GPT models, in Figures 3 and 5. Given that the vast majority of today's open-source and production LLMs adopt similar architectures (e.g., LLaMA, GPT-NeoX, Mistral), we believe these two model families sufficiently represent mainstream pretraining setups.
Furthermore, as FlashDP is implemented at the operator level, it remains fully applicable to fine-tuning and post-training settings as well. Since these stages typically reuse the same model structure and training codebase as pretraining, FlashDP can be directly integrated to speed up and reduce memory usage during private fine-tuning tasks.
- Micro issue: LaTeX character encoding
Thank you for pointing out the formatting issue with rendered text like “FlashDPâA˘ Zs.” These were caused by LaTeX encoding artifacts in PDF generation. We will fix all such issues in the camera-ready version.
- Operator-level DP profiling results
We appreciate your suggestion to analyze the performance of the fused DP-specific operator chain (i.e., linear backward + clipping + noise addition). To that end, we conducted an isolated operator-level benchmark comparing three setups:
- NonDP: standard Linear.backward()
- DP-SGD: naïve implementation of Linea per-sample backward + per-sampling clipping + noise
- FlashDP: fused kernel covering all of the DP-SGD operators in a single pass
We vary the matrix dimension M and set P = T = D = M. The achieved TFLOPS are reported below:
| M | NonDP (TFLOPS) | DP-SGD (TFLOPS) | FlashDP (TFLOPS) |
|---|---|---|---|
| 2048 | 75 | 41 | 70 |
| 2560 | 76 | 47 | 74 |
| 3072 | 78 | 52 | 77 |
| 3584 | 80 | 55 | 79 |
| 4096 | 83 | 59 | 80 |
| 4608 | 84 | 61 | 81 |
| 5120 | 84 | 62 | 81 |
| 5632 | 84 | 65 | 82 |
| 6144 | 85 | 66 | 83 |
| 6656 | 85 | 66 | 83 |
| 7168 | 87 | 69 | 84 |
| These results show that naïve DP-SGD significantly reduces GPU utilization, while FlashDP restores throughput nearly to NonDP levels. This validates the core motivation of FlashDP: fusing gradient computation, norm calculation, clipping, and noise addition into an efficient single kernel, avoiding costly memory access and redundant computation. |
We thank the reviewer for acknowledging the submission. If any of our responses have addressed prior concerns, we would greatly appreciate it if you would consider raising the score. We also welcome any further questions, suggestions, or discussions—technical or otherwise—and would be happy to engage further. Thank you again for your time and thoughtful consideration.
There was a consensus among the reviewers that this work addresses an important and challenging problem: the efficient private training of large-scale models. Reviewers were uniformly impressed by the significant empirical results, which convincingly demonstrate that the proposed FlashDP method achieves substantial gains in throughput and memory efficiency, nearing non-private performance levels on models as large as 13B parameters. During the extensive author-reviewer discussion, several limitations were clarified. The primary ones are the lack of generalizability of the implemented CUDA kernels, and its specific focus on per-layer clipping, as opposed to the more common global clipping paradigm. Furthermore, the evaluation centers on pre-training efficiency rather than fine-tuning, which is a common focus for DP practitioners . However, the authors engaged constructively with this feedback and have committed to addressing these points by adding a dedicated limitations section that discusses the hardware dependency and clipping strategy, and by incorporating a discussion of their utility evaluation metrics.