Characterizing the Training Dynamics of Private Fine-tuning with Langevin Diffusion
We provide a quantitative analysis of training and neuron dynamics in private fine-tuning.
摘要
评审与讨论
This work studies the training dynamics of DP fine-tuning using a Langevin diffusion approximation of DP-SGD, and provides theoretical understandings of how fune-tuning techniques (such as DP-LP, DP-FFT, or a hybrid) affect the out-of-distribution performance of a 2-layer ReLU network. Using a zeroth-order approximation, the authors provide a theoretical explanation of an empirical phenomenon that randomly initialized linear heads distort pre-trained backbone features in the early stages of DP-FFT. To mitigate or even avoid the feature distortion, they propose a hybrid method that combines DP-LP and DP-FFT, and further examine the privacy budget allocation across DP-LP and DP-FFT. Extensive numerical experiments support theoretical findings and demonstrate the effectiveness of the proposed hybrid method. Overall, this work deepens our understanding of DP fine-tuning by providing a solid theoretical explanation. I found this work interesting and thus recommend an Accept.
优点
- Use Langevin diffusion as an approximation of DP-SGD to study the dynamics of DP tuning. While obtained from an approximation model, findings are very interesting and align with empirical observatons.
- The proposed hybrid tuning method is supported by theoretical guaranteens and evidence from extensive experiments.
- Insights into privacy budget allocation are of practical interest.
缺点
In practice, trainings are usually done in a discrete manner (say t=1,2,...), but Lagenvin diffusion is a continuous approximation for t>0. This gap may hamper the generality of this work's findings. For example, if in Theorem 3.3 belongs to [0,1], then feature distortion might not be an issue, because we start from . Therefore, it would be helpful if authors could provide a short discussion of the value of , or name some driving factors that may significantly affect the value of .
问题
- theorem 3.4 says after , DP-FFT does not distort the pre-trained features. But the Eq (10) is stated for . Is there a typo in the range of ? I guess it should be ?
- typos around line 1148
Thanks for your positive feedback!
Weakness
“In practice, trainings are usually done in a discrete manner (say t=1,2,...), but Lagenvin diffusion is a continuous approximation for . This gap may hamper the generality of this work's findings. For example, if in Theorem 3.3 belongs to [0,1], then feature distortion might not be an issue, because we start from t=1. Therefore, it would be helpful if authors could provide a short discussion of the value of , or name some driving factors that may significantly affect the value of ”.
Response:
Thank you for highlighting the distinction between continuous Langevin diffusion and the discrete time steps used in practical training. This is an important point, as it helps clarify how theoretical findings based on continuous dynamics translate to real-world implementations. We will add a short discussion in our paper on this issue. To approximate continuous Langevin diffusion in discrete time steps, one can use the Euler–Maruyama method, a standard numerical approach for solving stochastic differential equations. This method discretizes the continuous process by mapping the time interval into discrete steps of size , allowing us to approximate the dynamics as follows: , where is the discrete step size, is the noise scale, and represents Gaussian noise at each step. By choosing an appropriate , we approximate the continuous process well over discrete steps, allowing theoretical insights into the continuous process to inform practical implementations. As for the value of in Theorem 3.3, it represents the initial time interval during which feature distortion occurs. In practical terms, the rate and degree of feature distortion can vary based on factors such as:
- The initialization scale of model parameters, particularly the linear head,
- The learning rate in DP-SGD, which affects the stability and alignment of features in early training stages,
- The noise scale , as higher noise may delay alignment or lead to greater initial feature distortion. We also provide a detailed description in remark F.2, Appendix F in our updated paper.
Question
“Typos in Equation (10), Line 1148”.
Response:
Thank you for catching that error. Yes, the time range in Equation (10) is indeed a typo. We appreciate your careful reading, and we will correct Equation (10) and Line 1148 in the revised version of our paper.
I appreciate the authors' detailed response to my questions. I am satisfied with the revised manuscript.
This paper addresses the challenge of differentially private fine-tuning of deep learning models. The authors highlight that naïve full-parameter fine-tuning leads to misalignment between the pre-trained features and the last layer. They propose a hybrid strategy that combines linear probing with fine-tuning, demonstrating theoretically and empirically that this approach mitigates feature distortion. The theoretical framework is based on a simplified Langevin diffusion and a two-layer ReLU neural network. The paper’s theoretical insights are further supported by experimental evaluations on various vision tasks and models.
优点
S1. The importance and relevance of the problem tackled — differentially private deep learning — is highlighted well, given the challenge posed by the high dimensionality of typical models.
S2. The concept of splitting the privacy budget between full parameter tuning (or Full Fine-Tuning, FFT) and Linear Probing (LP: tuning only the last layer) appears intriguing and potentially valuable for practical applications.
S3. The attempt to provide rigorous theoretical analysis addresses a gap in the current literature.
S4. The paper includes multiple experiments across diverse datasets and models, supporting the benefits of the proposed hybrid approach.
缺点
W1. The idea of combining FFT and LP is not entirely novel, as similar approaches were introduced and experimentally tested by Tang et al. (2023). A more detailed discussion of how this paper builds on or extends prior work is needed.
W2. The theoretical model appears oversimplified. Using a 2-layer neural network with ReLU activations while enabling a decoupling of feature learning and classification seems far removed from the practical settings under consideration. It is unclear how this model aligns with realistic applications, particularly since results in Section 4.1.1 suggest exponential convergence, implying that the underlying problem is no harder than strongly convex optimization (or that PL conditions hold for the loss).
In addition, the analysis is done for a non-standard simplified Langevin diffusion instead of DP-SGD. The benefits of such an approach seem questionable. The authors "Apply a zeroth-order asymptotic expansion" which sounds very complicated but looks like a simple removal of Brownian motion representing Gaussian noise. This noise is the core component of DP training, and its removal makes method (1) equivalent to simple gradient flow. In addition, it ignores the crucial per-sample clipping operation, which makes DP deep learning feasible. Thus, the proposed approximation loses the main features of differentially private training. The authors say on line 164
our modeling preserves the noisy behavior characteristic of DP-SGD
which need to be justified.
W3. Assumptions 3.1 and 3.2 are unconventional and strongly restrictive. Only one previous work referenced these assumptions, making it difficult to accept them as standard or practical in the context of simple binary classification.
W4. The experimental setup lacks details on the clipping threshold, a key parameter influencing DP-SGD performance and necessary for reproducibility.
W5. The paper’s theoretical results and mathematical proofs in the Appendix are difficult to follow, partly due to unclear notation and insufficient explanation. Specific issues and questions include:
-
The notation in line 214 is ambiguous.
-
The choice of zero-mean Gaussian initialization for the linear head is restrictive.
-
The meaning of “optimality" of in line 232 is unclear.
-
After a quick look at the work of Ganesh et al. (2023b), I did not find the exact statement of Theorem 4.1. Could the authors please point me to the exact place in the original paper?
-
Theorems 4.2 and 4.3 are hard to comprehend and presented without almost any commentary or explanation.
-
It is unclear to me why the authors say that
According to Theorem 5.1, a greater proportion of the privacy budget should be allocated to DP-LP when the total privacy budget is smaller.
-
Moreover, I would like to ask what is parameter . It seems like it can make arbitrarily large.
-
Section 5.2 relies on assumptions from the Appendix, which are also very strong (like E.7). Such an approach makes it hard to access the theoretical contributions of the paper adequately.
-
What does variable denote on line 442? How big can be in practice for realistic parameter values? It is unclear how Corollary 5.3 is obtained for the main paper text.
-
The steps of Theorem 3.3 proof are unclear. For instance, how is formula (28) obtained? Why the probability on line 958 equals ?
问题
Most of my concerns are mentioned in the Weaknesses part.
Q1. Is there a reason why Langevin diffusion is defined differently from Ganesh et al. (2023b)?
Q2. How realistic are Assumptions 3.1 and 3.2? Were they experimentally validated?
Q3. Public pre-training and private fine-tuning approach has been seriously questioned recently. I would appreciate the author's thoughts on the recent position paper by Tramèr et al. (2024).
Tramèr, F., Kamath, G., & Carlini, N. (2024). Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining. Proceedings of the 41st International Conference on Machine Learning.
Question:
“Public pre-training and private fine-tuning approaches have been seriously questioned recently. I would appreciate the author's thoughts on the recent position paper by Tramèr et al. (2024).”
Response:
We generally agree with the points raised by Tramèr et al as guiding principles. That being said:
- On the utility critique, there are use cases where there exists public data (that can be responsibly curated) that is representative of the private domain. For example, natural language on public web forums has similarities to speech patterns on phone messaging apps (while not being identical). While their observations are generally true (e.g., pretraining on ImageNet won’t help a Sarcoma classifier), there are still important use cases where pre-training on public data can help.
- On the privacy critique of public pretraining, we feel the points brought up by Tramèr et al are not just a weakness of pretraining on public data, but of differential privacy as a whole, which is not nuanced enough to capture contextual integrity. Indeed, there has been work (Cummings et al, CCS 2021) showing that users do not understand or appreciate the privacy affordances of differential privacy. This does not mean that DP as a whole should not be studied, but it broadens the scope of interesting privacy questions to include alternative metrics and techniques that are more aligned with user expectations.
- More generally, we do not think that topics should be considered “not worth studying” simply because they are out of fashion or controversial. There are responsible and irresponsible ways of deploying DP finetuning, as with any technology. We are trying to understand the foundations of a problem, which is separate from the (very important) problem of understanding how to deploy technologies responsibly.
- There are some potential solutions to the problems Tramer et al. mentioned:
-
Use synthetic data (e.g. the random priors used in Tang et al. 2023).
“Learning to See by Looking at Noise” by Baradad et al. (NeurIPS 2021).
-
Do differentially private pre-training.
“ViP: A Differentially Private Foundation Model for Computer Vision” by Yu et al. (ICML 2024).
-
Question:
“How realistic are Assumptions 3.1 and 3.2? Were they experimentally validated?”
Response:
We discussed these assumptions in our response to Weakness-3. Additionally, empirical results from Hancheng Min et al. (2024) align with Assumption 3.1, providing practical evidence that supports its validity.
Your question highlights a fundamental challenge in deep learning theory—the gap between theoretical assumptions and empirical realities. Developing strong theories often requires certain idealized assumptions on the data that may not align with real datasets like ImageNet. In deep learning theory, there are typically two main approaches to making assumptions on data:
- Statistical Assumptions: These assume distributional properties of data, often unrelated to real datasets but motivated by theoretical goals. For instance, Lee et al. (NeurIPS 2022) in “Convergence for Score-Based Generative Modeling with Polynomial Complexity” assume the data distribution satisfies a log-Sobolev inequality to leverage stochastic differential equation theory.
- Geometric Assumptions: These assume certain geometric patterns in data points, which is common in studying training dynamics of multi-layer networks. For example:
- Kumar et al. (ICLR 2022) assume that out-of-distribution (OOD) data points are perpendicular to training data. (https://openreview.net/forum?id=UYneFzXSJWh)
- Phuong and Lampert (ICLR 2021) assume orthogonal separability.
- Wang and Pilanci (ICLR 2022) and Min et al. (ICLR 2024) also use geometric assumptions relevant to neural network convergence and alignment.
- Implicit assumptions: assume that a family of abstract loss functions that have certain properties used in standard optimization theory, such as convexity, PL condition, and smoothness. These assumptions restrict the properties of loss landscapes and thus implicitly restrict properties of the training data.
- Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan. “How to Escape Saddle Points Efficiently”. ICML 2017.
Question:
“Is there a reason why Langevin diffusion is defined differently from Ganesh et al. (2023b)?”
Response:
Our definition of Langevin diffusion is equivalent to that of Ganesh et al. (2023b), with a difference in notation. In our work, we use the noise scale , which directly corresponds to the noise multiplier in DP-SGD, while Ganesh et al. use for the same term. We chose for consistency with DP-SGD terminology. When we apply the Euler–Maruyama method on our Langevin diffusion, we obtain a discrete update that corresponds to DP-SGD. For our analysis with clipping, please refer to “Appendix F: Theory with Clipping”.
It’s also worth noting that there are two equivalent definitions of Langevin diffusion commonly used in the literature:
- “On the universality of Langevin diffusion” (https://arxiv.org/abs/2204.01585v3) by Ganesh et al.
- “Initialization Matters: Privacy-Utility Analysis of Overparameterized Neural Networks” (https://proceedings.neurips.cc/paper_files/paper/2023/hash/1165af8b913fb836c6280b42d6e0084f-Abstract-Conference.html) by Ye et al.
Both definitions provide similar privacy guarantees, as shown in Lemma 2.1 of the first paper and Theorem 3.1 of the second.
- Tilde notation: The notation (x,y) ∼ (x’,y’) means that y=y’. We will remove this notation and just write in our paper.
- Gaussian initialization: We make this assumption based on the following papers:
- Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution. ICLR 2022. In their experiments, they initialize the linear head with zero-mean gaussian distribution.
- Xinyu Tang, Ashwinee Panda, Vikash Sehwag, Prateek Mittal. Differentially Private Image Classification by Learning Priors from Random Processes. NeurIPS 2023. In their experiments, they initialize the linear head of the WideResNets with zero-mean gaussian distribution.
- “Optimality”: Thanks for pointing out the ambiguity of “optimality”. It simply means that perfectly aligns with the mean data direction of a certain label.
- Ganesh’s paper: The final version of Ganesh’s paper, published in Conference on Learning Theory, is based on the third version of their arxiv paper. Follow this link: https://arxiv.org/abs/2204.01585v3. Our theorem 4.1 is based on their Lemma 2.1. Similar results can be found in Theorem 3.1 of ”Initialization Matters: Privacy-Utility Analysis of Overparameterized Neural Networks” by Ye et al. (NeurIPS 2023).
- Explanation of Theorem 4.2 and 4.3:
- Explanation of Theorem 4.2:
- Interpretation of constants: and respectively measures the maximum/minimum alignment between the pre-trained features and the data points . and are the noise scale multiplied by the norms of the pre-trained features.
- Limit behavior: If we take limit , then we get a lower bound limit and an upper limit . When there is only one feature vector and one data point, the upper limit and lower limit are equal and we obtain an exact dynamics for the loss.
- Effect of noise: If we only increase the noise scale and fix other initial settings, are larger. As a result, the expectation of the loss would decrease faster but cannot get close to zero because of the large noise. The faster convergence induced by noise is caused by the curvature of the loss landscapes. The second term of Ito’s Lemma (Equation 2) introduced the curvature property of the loss function into our analysis.
- Effect of pre-training: If we use a bad pre-trained encoder (i.e. decrease the alignment between the pre-trained features and the data points), the loss lower bound is further away from zero. This agrees with the intuition that given a bad backbone, linear probing cannot achieve good performance.
- Order of convergence: the linear probing setup gives linear and convex properties. So we obtain exponential convergence, which agrees with the standard results in strongly convex settings.
- Explanation of Theorem 4.3:
- Limit behavior: If we take limit , then we get a lower bound limit and an upper limit .
- Effect of noise: If we only increase the noise scale and fix other initial settings, the lower limit and upper limit both increase. This indicates that the loss cannot get close to zero because of the large noise.
- Effect of data separability: If we only increase the metric of data separability, i.e. let the training data to be more separable, the upper limit decreases. Meanwhile, the upper limit and lower limit get closer. This indicates that the loss is expected to converge to a smaller value in the end as the training task becomes easier.
- Effect of pre-training: the limit of the upper and lower bounds does not explicitly depend on the pre-training conditions. The intuition is that for our simplified setting, with different pre-trained backbones, full fine-tuning the whole model would finally produce similar performance conditioned on the training data and the noise.
- Explanation of Theorem 4.2:
- According to the proof of Theorem 5.1, when we fix the total training time , the necessary time for DP-LP increases as we increase the noise scale . So the time proportion of DP-LP increases throughout the whole fine-tuning process. According to Theorem 4.1, the privacy budget proportion of DP-LP depends on the time proportion of DP-LP. Therefore, a greater proportion of the privacy budget should be allocated to DP-LP when the total privacy budget is smaller.
- Comments on Theorem 5.1: Sorry for the confusion. In this theorem, we use r as a high-probability upper bound for the privacy budget necessary to do linear probing. denotes the probability commonly used in inequalities derived by concentration bounds.
- The assumptions used for Section 5.2 are based on the following papers:
- Etienne Boursier, Loucas Pullaud-Vivien, and Nicolas Flammarion. Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs. In Advances in Neural Information Processing Systems, volume 35, pages 20105–20118, 2022. In this paper, the neuron alignment is carefully analyzed for the case all data points are orthogonal to each other.
- Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution. ICLR 2022. They assume that the pre-trained encoder has been orthogonalized to have orthonormal rows.
- Nilesh Tripuraneni, Michael I. Jordan, and Chi Jin. On the theory of transfer learning: The importance of task diversity. NeurIPS 2020.
- About notation : The notation denotes the imbalance matrix (Definition E.8 in Appendix E, Line 1905-1909). Prior work on gradient flows has found that the imbalance matrix remains invariant over the evolution of gradient flows modeling gradient descent. For example,
- Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. ICML 2018.
- Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. NeurIPS 2018.
- Corollary 5.3 is derived by comparing the expected loss upper bounds provided in Proposition 5.2 (for DP-LP-FFT) with those in Theorem E.24 (for DP-FFT). These comparisons allow us to predict the relative performance of DP-LP-FFT under certain conditions.
- Explanation of the proof of Theorem 3.3:
- Equation (28): this equation results from expanding the derivative of the cosine similarity between and . It follows directly from differentiating the cosine similarity expression.
- High probability guarantee (): The probability reflects the likelihood that initialization causes feature distortion, which depends solely on the sign of each entry in the linear head . With zero-mean Gaussian initialization (commonly used in deep learning frameworks), each entry has a probability of 1/2 of being positive or negative. Given independent entries in the linear head, the overall probability of alignment (sign consistency) is .
Review:
“The experimental setup lacks details on the clipping threshold, a key parameter influencing DP-SGD performance and necessary for reproducibility.”
Response:
Thank you for highlighting the need for clarity on the clipping threshold. In our experiments, we used clipping thresholds 𝐶 = 0.1 C=0.1 and 𝐶 = 1 C=1. For the vision benchmarks in our paper, these values are based on established empirical studies that explore optimal clipping thresholds for DP-SGD. In particular, Appendix B.1 of Unlocking High-Accuracy Differentially Private Image Classification through Scale by Soham De et al. (2022, DeepMind) provides an in-depth analysis of clipping norms, concluding with the choice of 𝐶 = 1 C=1 for their primary experiments.
Our experimental settings also draw from the methodologies outlined in:
- Unlocking High-Accuracy Differentially Private Image Classification through Scale by Soham De et al., 2022.
- Differentially Private Image Classification by Learning Priors from Random Processes by Tang et al. (NeurIPS 2023).
We hope this additional context clarifies our choice of parameters and provides the reproducibility details necessary.
Review:
"Assumptions 3.1 and 3.2 are unconventional and strongly restrictive. Only one previous work referenced these assumptions, making it difficult to accept them as standard or practical in the context of simple binary classification."
Response:
We appreciate the reviewer’s concerns about the unconventionality of Assumptions 3.1 and 3.2. However, these assumptions are rooted in a well-established line of research within neural alignment, developed over multiple studies focused on multi-layer network dynamics. Neural alignment has been studied in prior work for orthogonally separable data (Assumption 3.1) and for orthogonal data.
-
Assumption 3.1, for example, draws from foundational work, including Phuong and Lampert (2021), Wang and Pilanci (2022), and Min et al. (2024). Min et al. provided a detailed review of papers using Assumption 3.1 in Section 3.2 of the paper “Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization” (ICLR 2024).
- Mary Phuong and Christoph H Lampert. “The inductive bias of ReLU networks on orthogonally separable data.” ICLR 2021.
- Yifei Wang and Mert Pilanci. “The convex geometry of backpropagation: Neural network gradient flows converge to extreme points of the dual convex program.” ICLR 2022.
- Hancheng Min, Enrique Mallada, and René Vidal. “Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization.” ICLR 2024.
-
Regarding Assumption 3.2, we assume that a “clustering” behavior emerges in the pre-trained features, which allows the features to work well in transfer learning (Galanti et al., 2022). This phenomenon is well-documented empirically in the neural collapse literature (Kothapalli, 2023), suggesting that pre-trained features tend to converge around the mean direction for data in class . Assumption 3.2 says that data with positive label (resp. negative) only activates the -th neuron if (resp. ). As a result, any positive data pair activate the same set of neurons. From a contrastive learning viewpoint, it makes the representations of them semantically similar (Saunshi et al., 2019). Namely, when the features and data inputs are normalized unit vectors, the difference between representations of a positive data pair is bounded by: with , which represents the maximum cosine similarity between the features and the data points.
In conclusion, the form of Assumption 3.2 is consistent with empirical observations from the neural collapse literature and contrastive learning theory. By aligning with these well-documented phenomena, the assumption provides a reasonable and interpretable foundation for our theoretical analysis.
We will clarify these points in our manuscript.
-
On the Zeroth-Order Approximation:
Review:
“In addition, the analysis is done for a non-standard simplified Langevin diffusion instead of DP-SGD. The benefits of such an approach seem questionable. The authors ‘Apply a zeroth-order asymptotic expansion’ which sounds very complicated but looks like a simple removal of Brownian motion representing Gaussian noise. This noise is the core component of DP training, and its removal makes method (1) equivalent to simple gradient flow.”
Response:
- Please note that our approximation does not remove the effect of noise, nor is the resulting model equivalent to gradient flow. The noise multiplier σ remains explicitly in our convergence bounds. We retain the key noise effects for the loss dynamics by keeping the second-order term from Ito’s lemma in Equation (2) and preserving the second-order terms associated with Brownian motion. This approach allows us to capture the essential stochastic characteristics of DP-SGD without modeling the full noise term directly on the parameters. In essence, this approximation enables us to analyze the expected behavior of parameter updates while preserving the noise-sensitive behavior of the loss itself. By isolating these core elements, we provide insights into the overall training dynamics under differential privacy without losing the major noise effects that influence convergence properties and feature alignment. To support our claim that this approximation does not introduce too much error, we have proved an error approximation guarantee (Theorem F.4 in the updated manuscript), which shows that our approximated model does not differ too much from the original Langevin diffusion model.
- To complement our results under our approximation, we provided a rigorous theoretical analysis without approximation in Section 4.1.1. This non-approximated model relies on a 2-layer linear network, allowing us to maintain theoretical rigor without the complexities introduced by the approximation. This section presents a practical balance between analytical feasibility and capturing essential dynamics. The approximation allows us to obtain lower bounds for the DP-LP and DP-FFT loss, which are very challenging without the approximation in stochastic analysis.
- Furthermore, we believe that our work represents a significant step in the theoretical analysis of non-linear activations and multi-layer architectures in DP-SGD—an advancement in the field. Future developments in this area, potentially building on our zeroth-order framework, can further enhance the community’s understanding of DP-SGD dynamics in more intricate and realistic settings.
-
Addressing Approximation Criticisms with Additional Analysis
Review:
“In addition, it ignores the crucial per-sample clipping operation, which makes DP deep learning feasible. Thus, the proposed approximation loses the main features of differentially private training. The authors say on line 164 ‘our modeling preserves the noisy behavior characteristic of DP-SGD’ which needs to be justified.”
Response:
-
In the updated version of our paper (see attached), we provide analysis of Langevin diffusion with clipping. Note that this is the first analysis of clipped Langevin diffusion, and the zeroth order approximation is crucial to the analysis. We also provide the approximation error of the zeroth order approximation under clipping. See “Appendix F: Theory with Clipping”.
-
Theory with clipping and approximation
- We have updated our paper in openreview and we provide analysis of Langevin diffusion with clipping. The additional theoretical results are: (1) The existence of a unique strong solution of the Langevin diffusion with clipping. (2) The zeroth order approximation error of Langevin diffusion with clipping. (3) A clipping version of Theorem 3.3 (feature distortion).
We put all these additional results in “Appendix F: Theory with Clipping”.
-
I greatly appreciate the author's detailed response. However, I have to admit that, unfortunately, I am still currently not convinced about the theoretical contributions of the paper. Namely, I can not see their additional value to experimental results. Most of the theorems presented in the current form are unclear to me, and I suggest adding clarifications in the revision. Numerous references to prior works mentioned in the response look solid but do not justify relevance to real-world applications to me. I would like to recommend adding more concrete, practical recommendations based on the empirical results obtained.
Proof of Theorem F.4. misses several steps in the first inequality. Because of this, I think that the result is incorrect as due to the sum over , the upper bound is expected to scale linearly with (assuming dataset size due to undefined notation), which makes the result poor from a Differential Privacy perspective. Moreover, currently, the basic fundamental properties of clipped Langevin diffusion are not understood, e.g., it is unclear whether it even converges (and if yes, to what) in contrast to standard Langevin diffusion. Corollary F.7 is claimed as the existence proof of a unique, strong solution for the clipped Langevin diffusion. However, it is not proved that per-sample loss function has finitely many discontinuities, and I think it is incorrect in the general case of non-smooth activations (like ReLU).
We appreciate the reviewer’s points regarding the simplicity of our theoretical model and the approximation we use. Below, we clarify why these modeling choices are fundamental to our analysis and not limitations.
-
Exponential Convergence and Model Choice
Review:
“It is unclear how this model aligns with realistic applications, particularly since results in Section 4.1.1 suggest exponential convergence, implying that the underlying problem is no harder than strongly convex optimization (or that PL conditions hold for the loss).”
Response: While it is true that our 2-layer neural network exhibits exponential convergence, the key goal here is not solely to establish convergence rates, but rather to preserve the model’s architecture in a way that captures layer-specific dynamics of DP fine-tuning. Abstracting the convergence as PL conditions on a general function would indeed obscure critical structural details of the two-layer setup, and would not allow us to capture the dynamics of adapting a pre-trained encoder with a linear head. Thus, our focus here is not on the order of convergence guarantees, but on preserving architectural features that are intrinsic to the phenomena we study.
Additionally, note that exponential convergence is a common characteristic in simplified settings, as seen in Kumar et al.’s analysis of non-private transfer learning and Min et al.’s work on linear networks, yet these models continue to provide insights applicable to complex architectures.
-
Intuition from Simplified Models
Review:
“The theoretical model appears oversimplified. Using a 2-layer neural network with ReLU activations while enabling a decoupling of feature learning and classification seems far removed from the practical settings under consideration.”
Response:
Two-layer neural networks are widely accepted in the theoretical community as valuable tools for deriving intuition (examples below). Similar to the approach taken by Tian et al. in “Understanding Self-Supervised Learning Dynamics without Contrastive Pairs,” who used a 2-layer model to explore foundational questions in self-supervised learning, our goal here is to identify core principles of representation alignment and feature distortion in DP fine-tuning. Such simplified models allow for clearer insights and serve as a basis for studying more intricate behaviors in deep networks. The theoretical insights derived from these models are supported by empirical results on complex architectures (e.g., Figure 1, Figure 3), demonstrating their relevance to real-world applications. For instance, see below some recent distinguished ML theory papers that use 2-layer NNs for analysis, while studying phenomena that apply empirically to more complex architectures:
-
(ICML 2021 Outstanding Paper Award Honorable Mention) Yuandong Tian, Xinlei Chen, Surya Ganguli. Understanding Self-supervised Learning Dynamics without Contrastive Pairs.
-
(ICLR 2022 Oral) Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution.
-
(NeurIPS 2022 Oral) Yuandong Tian. Understanding Deep Contrastive Learning via Coordinate-wise Optimization.
-
W1:
“The idea of combining FFT and LP is not entirely novel, as similar approaches were introduced and experimentally tested by Tang et al. (2023). A more detailed discussion of how this paper builds on or extends prior work is needed.”
Response: We acknowledge the reviewer’s point that the concept of combining LP and FFT methods has been explored in prior works, including Tang et al. (2023). Our paper makes distinct theoretical and empirical advancements over existing approaches:
-
Theoretical Foundation: To the best of our knowledge, our work is the first to provide a rigorous theoretical analysis of DP-LP-FFT under a continuous model. Previous work, including Tang et al., primarily focused on empirical evaluations without theoretical insights. By analyzing the dynamics of DP-SGD using a novel approximation technique based on Langevin diffusion, we bridge a significant gap, providing theoretical backing for the observed phenomena in DP fine-tuning across various privacy settings.
-
Broader Scope of Representation Alignment: Kumar et al. show that the out-of-distribution performance of pre-trained features can be distorted in non-private fine-tuning while they leave in-distribution feature distortion as an open question for both non-private and private settings. Unlike prior work, our analysis demonstrates that representation alignment in LP-FFT occurs universally across both private and non-private settings, thereby addressing unresolved theoretical questions posed by Kumar et al. (2021). By highlighting and formalizing this alignment trend as a universal phenomenon, we introduce the concept of "representation alignment" that holds even as data distribution shifts across domains.
Reference: Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution. ICLR 2022 (oral).
-
Empirical Generalization and Mechanistic Insight: Tang et al. investigated LP-FFT in a specific scenario involving synthetic pre-training data. In contrast, our experiments systematically confirm that the trade-offs observed by Tang et al. are generalizable across diverse datasets and architectures. We provide a deeper understanding of this phenomenon by identifying the mechanisms of feature distortion and alignment through visualizations and empirical validations. This approach provides a clearer understanding of how and why LP-FFT performs well in DP contexts, paving the way for a theoretical understanding of privacy-utility trade-offs in differentially private models. We also provided results of LP-LoRA in comparison to LP-FFT in our appendix.
For reference, the ICLR2022 oral paper “Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution” also analyzed LP-FFT. However, the novelty and importance of the theoretical insights provided by Kumar et al. were still recognized despite this observation. We hope that, similarly, our theoretical contributions here will be acknowledged for extending foundational insights into novel, underexplored domains.
Thank you for your initial review. We’ve provided a detailed response to your comments, and since the discussion period deadline is approaching, we’d greatly appreciate your feedback on our response.
Your input is important to further improving the work, and we look forward to hearing your thoughts!
Thank you again for your careful reading of our paper. We really appreciate your time and feedback. Responses to each point are included below.
Most of the theorems presented in the current form are unclear to me, and I suggest adding clarifications in the revision.
Thanks for the feedback. We will add our clarifications to Weakness3 and Weakness5 in the paper.
Numerous references to prior works mentioned in the response look solid but do not justify relevance to real-world applications to me. I would like to recommend adding more concrete, practical recommendations based on the empirical results obtained.
We appreciate the reviewer’s suggestion to emphasize the relevance of our work to real-world applications and to provide practical recommendations. We address this feedback by highlighting new practical insights derived from our empirical results that are different from [Tang et al. 2023]. Per your suggestion, we will emphasize these recommendations more strongly in our paper:
-
Recommendation on fine-tuning methods:
a. Exploit Sparsity in DP-LoRA: DP-LoRA demonstrates resilience to representation distortion due to its sparse parameter updates (Table 2). DP-LoRA offers a practical solution for achieving privacy without compromising model expressiveness. We are the first to provide a head-to-head comparison of feature distortion between DP-LORA and other DP finetuning techniques, so this is a new practical recommendation.
b. Leverage DP-LP for Distortion Mitigation: Based on our findings, practitioners dealing with privacy-preserving fine-tuning in settings sensitive to representation distortion (e.g., medical imaging or personalized recommendation) should consider DP-LP-FFT. This approach effectively mitigates the representation distortion observed in DP-FFT while maintaining robust performance.
-
Architecture-Specific Recommendations (Table 1):
a. CNN-Based Models: Since architectures like MobileNet-v3, ResNet, and WideResNet are more prone to feature distortion (Table 1), practitioners using these models should carefully tune parameters such as noise scale and clipping thresholds to minimize degradation. Additionally, they should consider integrating DP-LP to counteract feature distortion in these networks.
b. Transformer-Based Models: Transformers empirically exhibit greater robustness against feature distortion, making them a suitable choice for privacy-critical applications with complex feature representations, such as time-series analysis or natural language processing. Fine-tuning these models with DP-FFT or DP-LoRA can yield reliable performance even under stringent privacy constraints.
-
Visualization and Interpretation of Feature Distortion:
a. Top-1 k-NN Accuracy as a Diagnostic Tool For Feature Distortion (Figure 1, Figure 3): Evaluating feature quality through top-1 k-NN accuracies and applying UMAP to backbone mappings provide a simple yet effective method to interpret feature distortion during fine-tuning. Engineers can use this metric to diagnose and compare the impact of different fine-tuning approaches on feature integrity.
b. Real-time Monitoring During Fine-tuning (Figure 5): By integrating k-NN accuracy checks into the fine-tuning workflow, practitioners can monitor the model’s feature representation quality in real-time, enabling early intervention and adjustment of training parameters.
Proof of Theorem F.4. misses several steps in the first inequality. Because of this, I think that the result is incorrect as due to the sum over N, the upper bound is expected to scale linearly with N (assuming dataset size due to undefined notation), which makes the result poor from a Differential Privacy perspective.
Thanks for pointing out the missing term in Theorem F.4. We have updated the theorem statement and the proof. The order of time and noise scale remain the same. Please note that the scaling with can be removed by writing the loss as a sample-wise average, as is commonly done in mini-batch gradient descent. We kept the additive (non-averaged) form for consistency with the main paper. Please also note that the listed scaling with is the same as in other papers analyzing DP-SGD with Langevin Diffusion, e.g. in (Ye et al., 2023), for the same additive loss expression—this is not an artifact of our model.
Corollary F.7 is claimed as the existence proof of a unique, strong solution for the clipped Langevin diffusion. However, it is not proved that per-sample loss function has finitely many discontinuities, and I think it is incorrect in the general case of non-smooth activations (like ReLU).
We can generalize the statement in Corollary F.7 from “finitely many discontinuities” to “a discontinuity set with Lebesgue measure zero”. Corollary F.7 is based on Theorem F.6. Theorem F.6 only requires the loss function to be measurable, which is a very weak condition and applies for most of the common ML architectures.
Moreover, currently, the basic fundamental properties of clipped Langevin diffusion are not understood, e.g., it is unclear whether it even converges (and if yes, to what) in contrast to standard Langevin diffusion.
The criterion for the existence of a stationary distribution for the SDE is provided in reference [1], offering a deeper insight into its fundamental properties. Importantly, the convergence results rely on continuity rather than differentiability, meaning the non-differentiability of functions such as ReLU and clipping does not affect the validity of the analysis.
- [1] (Theorem 2.1.1, Proposition 2.1.2, Theorem 2.2.1) Sandra Cerrai. Second Order PDE’s in Finite and Infinite Dimension: A Probabilistic Approach. Springer, Berlin, 2002.
Dear Reviewer avt2,
We would like to thank you again for your thoughtful feedback and for helping us improve the paper. Your comments were very helpful, and we appreciate your time. Since the discussion period is coming to a close, we were wondering if you had any thoughts on our latest response and updates?
Thanks so much.
The paper examines differentially private (DP) fine-tuning (FT) methods, showing through theoretical and empirical analysis that DP full FT can distort pretrained backbone features due to misalignment between the pretrained backbone and the randomly initialized linear head. To address this, the authors propose DP-LP_FFT, which first performs linear probing and then fine-tunes. Additionally, the paper provides convergence rates for the proposed methods using two-layer neural networks.
优点
-
The paper investigates an interesting phenomenon in fine-tuning methods, where non-private fine-tuning distorts pretrained features and leads to degraded OOD performance. A similar empirical effect is observed in private settings, as shown in Figure 3.
-
The proposed DP-LP running before DP-FFT can theoretically reduce feature distortion. It would be beneficial to provide experimental results to compare with Figure 3.
-
The proposed LP mechanism in private fine-tuning is effective, as demonstrated in Tables 1 and 2.
缺点
There is a lack of comparison of feature distortion with DP-LP.
问题
In line 199 , why are subspaces separated by or ?
Thanks for your positive feedback!
Weakness:
Review:
“There is a lack of comparison of feature distortion with DP-LP.”
Response:
In the DP-LP (Differentially Private Linear Probing) method, only the final linear layer, or “head,” is updated, while the pre-trained backbone features remain frozen. Since the backbone features are not fine-tuned, feature distortion does not occur under DP-LP. Thus, a comparison of feature distortion with DP-LP is unnecessary, as there is no risk of altering the backbone representations in this method.
For additional context on why freezing features prevents distortion, see:
- Kaiming He, Haoqi Fan, YuxinWu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Computer Vision and Pattern Recognition (CVPR), 2020.
- Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution. ICLR 2022.
Question:
“In line 199 , why are subspaces separated by the two cones
Response:
The separation of subspaces follows directly from Assumption 3.1. This assumption leads to the convexity of cones and , each containing all positive and negative data points, respectively. Hancheng Min et al. provide a rigorous proof of this convex cone structure in Appendix C of their paper:
- Hancheng Min, Enrique Mallada, and René Vidal. Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization. ICLR 2024.
Thank you again for your initial review. We’ve provided a detailed response to your comments, and since the discussion period deadline is approaching, we’d greatly appreciate your feedback on our response.
Your input is important to further improving the work, and we look forward to hearing your thoughts!
Thanks, I will keep my score.
We would like to extend our sincere gratitude to all the reviewers for their time and effort in evaluating our work. We are especially grateful to reviewer avt2 for the detailed and insightful feedback, which has been invaluable in refining our contributions. We also appreciate the positive feedback from reviewers HKiw and kixW, whose encouraging comments have reinforced our confidence in the value of this research.
Dear AC,
Thank you for overseeing our paper; we appreciate your time and efforts. We wanted to provide a brief summary of the discussion from our perspective. Overall, our work introduces a novel analysis and understanding of differentially-private finetuning, demonstrating a phenomenon we call representation distortion for DP full fine-tuning. We confirm our theory with extensive empirical evidence on several architectures and settings.
Strengths:
The reviewers highlighted a number of strengths of the work:
- From reviewer avt2 and HKiw: The concept of splitting the privacy budget between full parameter tuning (or Full Fine-Tuning, FFT) and Linear Probing (LP: tuning only the last layer) appears intriguing and potentially valuable for practical applications. The paper includes multiple experiments across diverse datasets and models, supporting the benefits of the proposed hybrid approach.
- From reviewer avt2 and kixW: The attempt to provide rigorous theoretical analysis addresses a gap in the current literature. While obtained from an approximation model, findings are very interesting and align with empirical observations.
- From our rebuttal: We theoretically analyze feature distortion in DP-LP-FFT universally across private and non-private settings.
Key Weakness and response:
- From reviewer avt2: The theoretical model (2-layer ReLU network) appears oversimplified.
- Response: Two-layer neural networks are widely accepted in the theoretical community as valuable tools for deriving intuition. The theoretical insights derived from these models align with empirical results on complex architectures.
- From reviewer avt2: Assumptions 3.1 and 3.2 are unconventional and restrictive.
- Response: These assumptions are rooted in a well-established line of research within neural alignment and are consistent with empirical observations.
- From reviewer avt2: The analysis ignores the crucial per-sample clipping operation.
- Response: We have updated our paper in openreview and we provide analysis of Langevin diffusion with clipping in Appendix F:
- We note that this is the first analysis of Langevin diffusion with clipping, and DP-GD with clipping based on stochastic differential equations in the literature, to our knowledge. Our results include:
- The existence of a unique strong solution of the Langevin diffusion with clipping.
- The zeroth order approximation error of Langevin diffusion with clipping.
- A clipping version of Theorem 3.3 (feature distortion).
- We note that this is the first analysis of Langevin diffusion with clipping, and DP-GD with clipping based on stochastic differential equations in the literature, to our knowledge. Our results include:
- Response: We have updated our paper in openreview and we provide analysis of Langevin diffusion with clipping in Appendix F:
- From reviewer avt2: The paper lacks relevance to real-world applications and practical recommendations based on the empirical results.
- Response: We provide a systematic study of DP-LP-FFT methods and feature distortion in fine-tuning. Our paper provides diagnostic tools for feature distortion and our experiments provide architecture-specific recommendations.
This paper is in borderline! Although the authors have addressed some comments and questions, the reviewer is still not convinced about the theoretical contributions of the paper as they cannot see the additional value to experimental results. Moreover, the reviewer also thinks that the result of the paper may not be correct due to the sum over N, the upper bound is expected to scale linearly with N (assuming dataset size due to undefined notation), which makes the result poor from a Differential Privacy perspective. Although the authors have updated the theorem statement and the proof, the reviewer is still skeptical and did not want to support the publication of the paper. Due to some technical issue, I cannot recommend the acceptance of this paper. However, I strongly encourage the authors to revise the paper carefully and resubmit it to the next publication venues.
审稿人讨论附加意见
NA
Reject