Revisiting Differentially Private ReLU Regression
摘要
评审与讨论
This paper studies the problem of learning a planted (student-teacher) ReLU regression model, privately. The paper proposes two algorithms (DP-GLMtron and DP-TAGLMtron) to do this. For both algorithms, privacy utility tradeoffs are computed. The results do not explicitely contain the ambient dimension , while they contain other terms as the "effective dimension", which allow to have bounds explicitely vacuous in the regime where there are more parameters (in this case, equal to the number of input dimensions ) than number of samples
优点
This paper tackles a difficult and theoretically interesting problem, i.e. (in its generality) private, non-convex optimization. This paper focuses on the case of a planted ReLU regression model. The initial assumptions (see basically page 4) are, in my opinion, reasonable for this setting.
缺点
The paper has, in my opinion, 4 major weaknesses:
-
Its fit with the related literature is not always clear. The main improvement with respect to previous results is remarked in Table1, and the baseline is the recent and unpublished work [1]. In line 98, results on convex optimization are presented [2, 3]. This results are not presented with the dependence on the norm of the solution, and it is not mentioned that both results assume the input samples to have bounded norm, which is not the case in this work. This makes the comparison with the convex baseline unclear. Furthermore, comparison with [4] is not provided, which definitely offers a more solid baseline than [1].
-
The analysis is not well motivated. The authors contribution is based on the two proposed algorithms, namely DP-GLMtron and DP-TAGLMtron. Considering the arguable practicality of such algorithms in realistic deep learning problems, why didn't the authors focused, for example, on providing bounds on DP-SGD instead? This makes the results being very specific to the toy setting of ReLU regression, for which the two algorithms are specifically designed for.
-
While it is true that the provided bounds do not contain explicitely , it is unclear the true dependence with the ambient dimension through other terms. Namely, in line 222 the authors set , argued to be the improvement with respect to [1]. This allows to remove the dependence of from the trace of the covariance . Authors also assume to be bounded in corollary 4.3. Furthermore, the bounds also depend on , which does not receive discussion on how this term is reasonably small.
-
Empirical validation of the claims is relatively weak. In particular, in the main body, empirical results are presented on a synthetic dataset with quickly decaying covariance eigenvalues. The experiments on MNIST are pushed in Appendix B (there might be a typo in the third line of the caption of Figure 2, where it's mentioned that the data is Bernoulli), where the results are, in my opinion, puzzling. The baseline of DP-SGD is presented (which has to be improved with the two proposed algorithms to motivate the analysis), and the excess risk for increases as the number of samples increases (!). Furthermore, this excess risk is approximately , which seems larger than simply random guessing. This dubious results make me suspect of a bad comparison with standard and well established algorithms as DP-SGD, pointing towards the second weakness mentioned in this list.
[1] - https://arxiv.org/pdf/2310.08425
[2] - https://proceedings.mlr.press/v32/jain14.pdf
问题
Can the authors provide extensive comparison with [4]?
Why didn't the authors focused on DP-SGD? Why it is useful to introduce these two algorithms?
Can the authors comment on their implicit assumption on ?
Can the authors elaborate on the experimental settings used for DP-SGD on MNIST data?
局限性
No potential negative societal impact.
We thank the Reviewer M8TJ for the thorough review and the insightful comments.
Response to the Weakness 1 and Question 1
We thank reviewer M8TJ for the constructive feedback and we will include these comments in the updated version, thanks! We acknowledge that our main improvement is highlighted in Table 1, with [1] as the baseline. While [4] also studies DP-GLM with non-convex loss, it is not directly comparable to our work. Specifically, Theorem 5 of [4] assumes , while we consider has a sub-Gaussian tail, meaning with high probability. Moreover, [4] provides a bound of dependent on , which we aim to mitigate in the overparameterized setting. For Theorem 6 in [4], it still needs to assume while we consider more general distributions that could be unbounded. Moreover, it needs a very strong assumption that as it uses DP-Frank Wolfe.
Regarding the results on convex optimization in lines 98 and references [2, 3], we will revise our paper to clarify differences in assumptions, particularly regarding the bounded norm of input samples, ensuring clear distinctions between our work and the convex baseline.
Response to the Weakness 2 and Question 2
We appreciate the reviewer's attention to our motivation. Before analyzing DP-GLMtron, we noticed DP-SGD sometimes fails to converge and reaches suboptimal solutions (as seen in our experiments). Therefore, we try to visualize the training trajectories of DP-SGD and DP-GLMtron on a 2D noiseless ReLU regression with symmetric Bernoulli data (we put the figures in the PDF file, please check it). The visualization demonstrates that DP-SGD struggles to converge under large noise conditions and is more likely to settle at a saddle point rather than reach the optimal solution. As we mentioned in our paper section 4 (equation 3), the SGD and GLMtron follow the different update rules: When , the SGD gradient is zero, and DP-SGD's direction is heavily influenced by noise, especially with a low privacy budget (see Figure 1a). Therefore, these observations motivated us to propose the DP-GLMtron and DP-TAGLMtron algorithms.
Response to the Weakness 3 and Question 3
We appreciate the reviewer's attention to our approach to addressing dimension independence. However, the reviewer may misunderstood our paper. Generally, our bounds depend on the trace of the covariance matrix rather than . In line 222, we just aim to provide a simple example to illustrate how the bounds behave under the assumption . However, this does not imply that our bounds improve previous results with this setting. Our main contributions are:
-
We provided the first analysis beyond data assumptions that satisfy Gaussian-like distributions, which may overlook cases where the spectrum exhibits decay.
-
Our results offer nearly dimension-independent bounds without assuming . This indicates that as long as is and the tail summation of eigenvalues is , it is possible to achieve diminishing bounds without being affected by the dimension in the over-parameterized regime.
Why do we focus on the covariance matrix ? We provide some examples here. If , its trace satisfies . If , its trace satisfies . These examples illustrate that only the effective rank influences utility, which explains why we can mitigate the impact of dimension in some cases.
Additionally, (i.e., ) to be bounded by a constant is a common assumption in the DP estimation even for the linear regression model such as [5,6,7].
Regarding the term, we provide analysis in Lemma D.3 for detailed discussion, where we establish that each is effectively bounded by the norm term . This crucial detail ensures that our results remain controlled. Furthermore, by decomposing our analysis into the head and tail subspaces of , we can achieve bounded results analogous to those observed in non-private scenarios, as detailed in references [8,9,10,11].
Response to the Weakness 4 and Question 4
We understand your concern regarding our experimental setup. Below, we hope to address your concerns as follows:
-
Why do we have a synthetic dataset with spectrum decay in the main body? The simulation in our main body aims to validate the theoretical insights related to the role of eigenvalues in a high-dimensional setting, as we discussed in Section 4.
-
Typo of MNIST. Thanks for pointing out the typo, we will correct it in our revised version.
-
Why are results increasing with data size and approximately ? The observed increase in excess risk as data size grows, along with its approximation to 24, stems from the nature of our experimental setup. We are considering a ReLU regression model, as outlined in the Preliminary Section and Equation 1, without incorporating any neural network architecture.
Due to space limitations, we will include more explanations, experiment setup, and references in the comment section, please check it in the following.
I thank the authors for their rebuttal. Here are some comments.
Regarding the response to the Weakness 2 and Question 2
I agree with the authors that one experimental setting can be found for which GLMtron performs better than SGD. Same for the corresponding DP versions. This experimental evidence is, however, extremely weak. How realistic is the setting of Bernoulli data? How realistic is this simplified model itself? To my understanding, it is not clear if DP-SGD really presents some intrinsic weakness on a class of realistic models (as ReLU neural networks), and it is not clear that the results on GLMtron can be extended to a not so-specific setting.
As an example, I believe DP-SGD performs relatively well on simple datasets as MNIST when considering a neural network with ReLU activation (see Abadi&al.), even in very over-parameterized models. Would DP-GLMtron beat the baselines of DP-SGD for an over-parametrized 2-layer neural network? This paper does not go close to answering this question (see my last point on experiments).
Regarding the response to the Weakness 3 and Question 3
I would like to remark to the authors that I still believe I did not misunderstand the paper. I know that the paper does not make the assumption . However, I argue that the bounds are “dimension independent” only when the “effective dimension” is of constant order, which is (in my opinion) arguably surprising. In the (I would say important and not much described) case where we have , and (let's say, standard Gaussian data), the number of input dimensions indeed enters the bounds.
I remark that I see the contribution of characterizing the utility privacy trade-offs as a function of the spectrum of the covariance, in ReLU regression. However, the weakness as I described it in my original review, in my opinion, still remains.
Regarding the response to the Weakness 4 and Question 4
I find the choice of the experiment of dubious relevance. If it is important to consider regression problems, why using the full MNIST dataset? Oppositely, If DP-GLMtron can defeat DP-SGD in wider settings, like MNIST, why not consider classification? I am not strictly asking for an additional experiment right now, as I believe it to be of difficult implementation during discussion time (if not impossible). However, the current experiment (comparing DP-SGD with DP-GLMtron on a classification task phrased as a regression problem with generalization errors of order 20), in my opinion, fails to corroborate the theoretical claims of this paper.
To conclude, contribution on DP ReLU regression is to the best of my knowledge, novel (as I acknowledge the remarks of the authors on the differences with respect to [4]). However, I still believe the motivation of this work to be lacking, which represents the main weakness of this paper, together with the last two replies in this comment.
We continue to provide more explanations here.
According to the updated rules of algorithms (see Equation 3) and further discussion in response to Weakness 2 and Question 2, this behavior illustrates that DP-SGD struggles to converge effectively in this context. As data size increases, the challenges with convergence lead to an increase in error, which highlights the limitations of DP-SGD in handling certain high-dimensional regression tasks.
Another potential reason for the error approximating ~24 is that our model is a regression model. Regression models output continuous values, but classification problems require discrete class labels. This can result in predictions that aren't directly usable for classification, which may contribute to the observed discrepancy in excess risk.
- Experimental settings used for DP-SGD on MNIST data. Yes, we will provide the details of our experimental settings as follows:
A.Data Preparation
- Dataset: MNIST, loaded via tensorflow.keras.datasets.mnist.
- Data Preprocessing: Training and test data are reshaped and normalized to have values between 0 and 1.
B.Experimental Parameters
- Data Sizes: The experiment uses a range of data sizes from 50 to 5000, increasing in steps of 500.
- Feature Dimension: Derived from the reshaped MNIST data.
- Repeats: The experiment is repeated 5 times to ensure robustness.
- lr =
- Privacy Parameters:
- Delta:
- Epsilon: An array with values
- Sigma: Calculated as
C. Experimental Model Updates and Evaluation Metric (Our gradient update follows the rule of Equation 3 with private noise.)
-
Model update: (SGD update)
if x_t @ thetasgd > 0: gradient = (relu(x_t @ thetasgd ) - y_t) * x_t.T + noise else: gradient = np.zeros((featuredim, 1)) + noise thetasgd = thetasgd - lr * gradient -
Evaluation Metric: (Mean Square Loss)
rwsgd = np.mean((relu(X @ thetasgd) - y) ** 2)
References
[1] - Shen, Hanpu, et al. "Differentially private non-convex learning for multi-layer neural networks." arXiv preprint arXiv:2310.08425 (2023).
[2] - Jain, Prateek, and Abhradeep Guha Thakurta. "(Near) dimension independent risk bounds for differentially private learning." International Conference on Machine Learning. PMLR, 2014.
[3] - Song, Shuang, et al. "Evading the curse of dimensionality in unconstrained private glms." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
[4] - Wang, Di, Changyou Chen, and Jinhui Xu. "Differentially private empirical risk minimization with non-convex loss functions." International Conference on Machine Learning. PMLR, 2019.
[5] Varshney, Prateek, Abhradeep Thakurta, and Prateek Jain. "(Nearly) Optimal Private Linear Regression via Adaptive Clipping." arXiv preprint arXiv:2207.04686 (2022).
[6] Cai, T. Tony, Yichen Wang, and Linjun Zhang. "The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy." The Annals of Statistics 49.5 (2021): 2825-2850.
[7] Bassily, Raef, et al. "Private stochastic convex optimization with optimal rates." Advances in neural information processing systems 32 (2019).
[8] Wu, Jingfeng, et al. "Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression." International Conference on Machine Learning. PMLR, 2022.
[9] Zou, Difan, et al. "Benign overfitting of constant-stepsize sgd for linear regression." Conference on Learning Theory. PMLR, 2021.
[10] Bartlett, Peter L., et al. "Benign overfitting in linear regression." Proceedings of the National Academy of Sciences 117.48 (2020): 30063-30070.
[11] Wu, Jingfeng, et al. "Finite-sample analysis of learning high-dimensional single relu neuron." International Conference on Machine Learning. PMLR, 2023.
We thank Reviewer M8TJ for your reviewing efforts and feedback.
Response to the Cont.W2 and Q2
We appreciate the reviewer's concerns regarding our experiments. However, it is important to note that our paper primarily focuses on theoretical work. The experiments we conducted on Bernoulli data, which satisfy our assumptions for analysis, were designed to validate our theoretical insights into the role of eigenvalues in a high-dimensional setting.
Why do we focus on the ReLU regression model? Indeed, the ReLU regression model can be viewed as the same as a two-layer neural network model with a ReLU activation function and a single neuron in the hidden layer, as described in [1]. In particular, [1] considers the following formulation:
where is the input, is the weight vector of the first layer, is the output weight, and is the ReLU activation function defined as if and if . Since they assume that is not trained and focus on the empirical risk minimization problem with quadratic loss, it is equivalent to ReLU regression when setting .
Therefore, ReLU regression can be considered a basic simplified problem in neural network optimization [1,2,3]. As we replied to Reviewer 7Kd7, one potential way to extend our work to ReLU Neural Network is to consider the neurons. In this case, the gradient for optimization would become a summation of gradients, each akin to the one used in our current analysis. This extension represents an area for future work.
Regarding the lack of experiments, we hope to address your concerns in the following part.
Response to the Cont.W3 and Q3
We appreciate the reviewer's clarifications and acknowledge that we may have misunderstood your feedback earlier. Yes, our paper offers a novel perspective by demonstrating that dimensionality can be mitigated in specific cases, such as when the data covariance matrix exhibits eigenvalue decay. This aspect might have been overlooked in previous Differential Privacy (DP) studies due to their reliance on data assumptions that resemble Gaussian-like distributions.
However, it is important to note that even in non-private regression models, the mitigation of dimensionality is typically contingent upon assumptions about the spectrum, as discussed in [3-10]. Therefore, we do not view this as a weakness of our work.
Response to the Cont.W4 and Q4
To better address your concerns, we have included additional experimental results on three different regression tasks, including one table here. (Because the PDF file in the global rebuttal cannot be edited during the discussion period, we will incorporate figures into our revised paper.)
The table below presents the errors observed for the Gas Turbine, CA Housing, and Wine Quality datasets when using various Differential Privacy algorithms: DP-SGD, DP-GLMtron, DP-TAGLMtron, and DP-FTRL. These experiments were conducted using the ReLU regression model across three different privacy budgets, represented by epsilon values of 0.05, 0.2, and 0.5, with a learning rate of 0.01. The errors reported were taken from the results at the final epoch (50th epoch). It can be observed that DP-GLMtron and DP-TAGLMtron still consistently outperformed the DP-SGD and DP-FTRL across all datasets and privacy budgets, demonstrating their robustness and effectiveness. Additionally, the performance gap between DP-GLMtron, DP-TAGLMtron, and the other two algorithms indicates that DP-SGD and DP-FTRL may converge to suboptimal solutions.
| Dataset | Epsilon | DP-SGD | DP-GLMtron | DP-TAGLMtron | DP-FTRL |
|---|---|---|---|---|---|
| Gas Turbine | 0.05 | 12.24 | 8.76 | 9.42 | 13.58 |
| 0.2 | 14.99 | 9.92 | 8.22 | 13.23 | |
| 0.5 | 15.84 | 9.49 | 9.09 | 15.34 | |
| CA Housing | 0.05 | 29.37 | 16.46 | 6.39 | 20.47 |
| 0.2 | 26.31 | 9.14 | 7.06 | 19.17 | |
| 0.5 | 14.90 | 6.18 | 7.42 | 18.91 | |
| Wine Quality | 0.05 | 134.37 | 37.39 | 35.36 | 138.69 |
| 0.2 | 117.28 | 35.55 | 34.62 | 138.01 | |
| 0.5 | 138.04 | 34.78 | 34.44 | 141.60 |
Reference
[1] Frei, Spencer, Yuan Cao, and Quanquan Gu. "Agnostic learning of a single neuron with gradient descent." Advances in Neural Information Processing Systems 33 (2020): 5417-5428.
We provide more reference here.
[2] Diakonikolas, Ilias, et al. "Learning a single neuron with adversarial label noise via gradient descent." Conference on Learning Theory. PMLR, 2022.
[3] Wu, Jingfeng, et al. "Finite-sample analysis of learning high-dimensional single ReLU neuron." International Conference on Machine Learning. PMLR, 2023.
[4] Bartlett, Peter L., et al. "Benign overfitting in linear regression." Proceedings of the National Academy of Sciences 117.48 (2020): 30063-30070.
[5] Zou, Difan, et al. "Benign overfitting of constant-stepsize sgd for linear regression." Conference on Learning Theory. PMLR, 2021.
[6] Tsigler, Alexander, and Peter L. Bartlett. "Benign overfitting in ridge regression." Journal of Machine Learning Research 24.123 (2023): 1-76.
[7] Wu, Jingfeng, et al. "Finite-sample analysis of learning high-dimensional single ReLU neuron." International Conference on Machine Learning. PMLR, 2023.
[8] Zhang, Xiao, et al. "Learning one-hidden-layer relu networks via gradient descent." The 22nd international conference on artificial intelligence and statistics. PMLR, 2019.
[9] Cao, Yuan, et al. "Towards understanding the spectral bias of deep learning." arXiv preprint arXiv:1912.01198 (2019).
[10] Du, Simon, et al. "Gradient descent finds global minima of deep neural networks." International conference on machine learning. PMLR, 2019.
[11] Du, Simon S., et al. "Gradient descent provably optimizes over-parameterized neural networks." arXiv preprint arXiv:1810.02054 (2018).
[12] Cao, Yuan, and Quanquan Gu. "Generalization bounds of stochastic gradient descent for wide and deep neural networks." Advances in neural information processing systems 32 (2019).
[13] Huang, Yu, Yingbin Liang, and Longbo Huang. "Provable generalization of overparameterized meta-learning trained with sgd." Advances in Neural Information Processing Systems 35 (2022): 16563-16576.
I thank the authors for their response. I follow up below.
Regarding the motivation, I agree and acknowledge that this is a theoretical work. At the moment, however, the theoretical lesson the results of this paper deliver is that DP-SGD fails in ReLU regression over Bernoulli data, while another proposed algorithm does better. If the lesson is of broader interest (less specific data or model), I expect either the experiments or the theory to suggest this extension. This is something that, for now, I do not see (see my last point on the experiments).
I additionally remark that I am not complaining about the assumptions used to derive the theory (proving theorems in the most general setting is always difficult), I am complaining because I still do not believe the findings of this paper have wider application (even theoretical) than this restricted setting.
Regarding the experiments, I would ask if the authors can share their code through an anonymized repository. I find the results puzzling. See for example the scores for on the wine quality dataset, where the targets are numbers (scores) between 0 and 10. Guessing the average score of 5 at test time gives a perfectly private policy with smaller test losses than the ones reported for all algorithms. It is possible I am not fully following the meaning of the reported numbers, and I would appreciate it if I could see the authors’ implementation.
We thank Reviewer M8TJ for your active feedback.
We would like to clarify a few key points:
-
Our analysis of the ReLU regression model can be viewed as a foundational step toward understanding a two-layer neural network with a single neuron [1]. It is not intended to be a specific model.
-
A significant theoretical contribution of our work is the introduction of a novel analysis that extends beyond the typical data assumptions resembling Gaussian-like distributions. This analysis demonstrates that dimensionality can be mitigated in specific cases, such as eigenvalue decay, an aspect that has been overlooked in previous DP communities. Moreover, the assumptions we employ in our paper are more general and less restrictive than those in existing works [2-5].
-
The experimental results from synthetic data, as well as classification and regression tasks, demonstrate a performance gap between DP-SGD/DP-FTRL and DP-GLMtron/DP-TAGLMtron. This suggests that DP-SGD and DP-FTRL may converge to suboptimal solutions not only in the specified setting.
Regarding the code, we have shared it with the AC through an anonymized repository. Please await the AC's response.
[1] Frei, Spencer, Yuan Cao, and Quanquan Gu. "Agnostic learning of a single neuron with gradient descent." Advances in Neural Information Processing Systems 33 (2020): 5417-5428.
[2] Shen, Hanpu, et al. ”Differentially private non-convex learning for multi-layer neural networks.” arXiv preprint arXiv:2310.08425 (2023).
[3] Prateek Varshney, Abhradeep Thakurta, and Prateek Jain. (nearly) optimal private linear 464 regression via adaptive clipping. arXiv preprint arXiv:2207.04686, 2022.
[4] Di Wang and Jinhui Xu. On sparse linear regression in the local differential privacy model. In 469 International Conference on Machine Learning, pages 6628–6637. PMLR, 2019.
[5] Xiyang Liu, Prateek Jain, Weihao Kong, Sewoong Oh, and Arun Sai Suggala. Near optimal 445 private and robust linear regression. arXiv preprint arXiv:2301.13273, 2023.
I thank the authors for their response. I conclude that, for now, the authors did not address the major reasons behind my evaluation of this work. If it will become available, I will reconsider my score after consulting the new provided experiments.
We thank Reviewer M8TJ for your thoughtful response. We understand your concern and hope our additional experiments can address it.
However, we still would like to note that our work primarily focuses on the theoretical aspects, particularly the perspective of eigenvalues, which has not been considered in previous DP communities.
Therefore, we believe our work offers a valuable contribution despite the previous experimental limitations.
Best regards,
The Authors
The paper provides an algorithm for differentially private RELU regression. They claim that their results outperform DPSGD. Additionally, for the case of a small privacy budget, they provide a tree aggregation protocol that balances privacy and utility. Finally, extensive experimental results are provided to support the claims.
优点
-
A novel algorithm is provided for differentially private RELU regression which outperforms DPSGD.
-
Detailed theoretical analysis is provided for the proposed algorithms.
-
Sufficient experimental evaluation is provided for the proposed algorithms with experiments on both synthetic and real-world data
缺点
- There are a lot of recent works on privacy-utility tradeoffs which seem relevant to this paper. A literature survey involving works in that area can be helpful for future readers to further extend this work.
问题
- Can the authors propose ideas on how the proposed approach be extended to models with higher complexity such as deep neural networks?
- The authors analyze the privacy-utility tradeoff in their work. They should also mention similar papers involving privacy-utility tradeoff algorithms such as
[a] Alireza Fallah, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar Optimal and Differentially Private Data Acquisition: Central and Local Mechanisms Operations Research 2024 72:3, 1105-1123 [b] Ameya Anjarlekar, Rasoul Etesami, & R. Srikant. (2023). Striking a Balance: An Optimal Mechanism Design for Heterogenous Differentially Private Data Acquisition for Logistic Regression. [c] Andrew Lowy, Zeman Li, Tianjian Huang, & Meisam Razaviyayn. (2024). Optimal Differentially Private Model Training with Public Data.
- Typo on line 335
局限性
NA
We thank the Reviewer JyY5 for the valuable review as well as the positive feedback!
Response to the Weakness 1
Thank you for your suggestion. We agree that adding a literature survey on privacy-utility tradeoffs would be beneficial. We will include more related work in the revised paper and hope to provide valuable context for readers.
Response to the Question 1
Yes. ReLU regression, as discussed in our paper, can be viewed as the same as a two-layer neural network model with a rectified linear unit (ReLU) activation function and a single neuron in the hidden layer, as described in [2]. In particular, [2] considers the following formulation:
where is the input, is the weight vector of the first layer, is the output weight, and is the ReLU activation function defined as if and if . Since they assume that is not trained and focus on the empirical risk minimization problem with quadratic loss, it is equivalent to ReLU regression when setting . Therefore, ReLU regression can be considered a basic simplified problem in neural network optimization[1,2,3].
Response to the Question 2
Thank you for pointing out these relevant works. We will include references to these papers in our revised manuscript to provide a broader context for the privacy-utility tradeoff algorithms discussed in our study.
Response to the Question 3
Thank you for catching the typo on line 335. We will correct it ("simulation") in the revised paper.
[1] Wu, Jingfeng, et al. "Finite-sample analysis of learning high-dimensional single ReLU neuron." International Conference on Machine Learning. PMLR, 2023.
[2] Frei, Spencer, Yuan Cao, and Quanquan Gu. "Agnostic learning of a single neuron with gradient descent." Advances in Neural Information Processing Systems 33 (2020): 5417-5428.
[3] Diakonikolas, Ilias, et al. "Learning a single neuron with adversarial label noise via gradient descent." Conference on Learning Theory. PMLR, 2022.
Thank you for your detailed response
We thank Reviewer JyY5 once again for your efforts and time. Please feel free to reach out if you have any further questions or suggestions.
This paper revisits the problem of differentially private (DP) ReLU regression in overparameterized regimes. The authors propose two novel algorithms, DP-GLMtron and DP-TAGLMtron, which outperform conventional methods like DPSGD. The paper provides theoretical analysis of these algorithms, including privacy guarantees and utility bounds. The authors extend their analysis beyond Gaussian-like data distributions to settings with eigenvalue decay, showing how data distribution impacts learning in high dimensions. The paper also includes empirical results on both synthetic and real-world datasets to validate their theoretical findings.
优点
- The proposed algorithm is noval and intuitive.
- The authors provide comprehensive theoretical analysis, and the results are concisely conveyed.
- The writing is good.
缺点
- The assumptions used in this paper are, though relaxed, strong and may not always hold in practice.
- The simulation is weak and contains some mismatches with the theory.
问题
- Can the authors provide the specific location of the claimed rate in [1]? I scanned the paper and don't see how this rate appears. Thanks in advance :).
- What is the role of ReLU regression? Does it serve as a fundamental simplified problem for neural network optimization?
- The authors provide special cases of . However, I would appreciate an explanation of its form. For instance, compared to , why is it not squared over , and why did the sum of inverse appear?
- In the simulation (Figure 1), I noticed that in some cases (especially Figure 1(a)), the excess risk is increasing with respect to sample sizes. How does this happen?
- Again in the simulation, DP-GLMtron gradually matches the performance of DP-TAGLMtron as increases, while this is not true for the theory, where DP-GLMtron does not have a guarantee when is large. How does this happen?
- Following the last two questions, it seems that DP-GLMtron and DP-TAGLMtron are more sensitive to variations of than DP-SGD. This is also true for the theory (for instance, Table 1 suggests that for the proposed methods and DPSGD, dependence is and , respectively).
- Why did the authors choose the settings for and in the simulation?
[1] Differentially Private Non-convex Learning for Multi-layer Neural Networks.
局限性
The limitations should appear in the main text.
We thank Reviewer 7Kd7 for the careful and detailed review as well as the constructive feedback.
Response to the Weakness 1
We kindly disagree with the reviewer's opinion. It is important to note that our work is theoretical, and our assumptions are standard in theoretical studies of ReLU regression or linear regression. As we mentioned in the paper, most assumptions are weaker than the sub-Gaussian data assumption. Even for linear regression models, certain assumptions are necessary to provide analysis on private estimation, such as [1,2,3].
Response to the Weakness 2
We understand the reviewer's concern about our experimental results and we hope to address this in the following answers (Question 4 part).
Response to the Question 1
Yes, the rate is provided in Theorem 5 of [4], which analyzes the performance of the DP-Projected Gradient Descent for ReLU regression :).
Response to the Question 2
Previous studies on differentially private (DP) statistical estimation have primarily focused on convex models, such as linear models or linear regression, with less attention given to non-convex models. ReLU regression is one of the most fundamental non-convex models in neural networks, which is why we chose to focus on it in our paper. More specifically, ReLU regression, as discussed in our paper, can be viewed as the same as a two-layer neural network model with a rectified linear unit (ReLU) activation function and a single neuron in the hidden layer, as described in [5]. Specifically, they consider:
where is the weight vector of the first layer, is the output weight, and is the ReLU activation function defined as if and if . Since they assume that is not trained and focus on the empirical risk minimization problem with quadratic loss, it is equivalent to ReLU regression when setting . Therefore, ReLU regression can be considered as a basic and simplified problem in neural network optimization[6,7,8].
Response to the Question 3
We are glad to see the reviewer notice this difference. The term arises from the variance error term, which is related to model noise. Specifically, it takes the form of . In contrast, originates from the private error introduced by the private noise . Thus, when considering how the noise terms interact with the data points , we observe the following: 1) model noise leads to ; 2) private noise results in . This is the main reason why lacks the component compared to . We provide more details in the Appendix (Lemma D.4).
Response to the Question 4 and Weakness 2
This phenomenon for DP-SGD is actually our motivation for studying the private ReLU regression problem. As we mentioned in our paper section 4 (equation 3), the SGD and GLMtron follow the different update rules: This indicates that if , the gradient for SGD becomes zero. Then, in DP-SGD, noise added on gradient can significantly influence the update direction, especially with a low privacy budget. Indeed, we visualized the training trajectories of DP-SGD and DP-GLMtron on a 2D noiseless ReLU regression (Please check the files in the global rebuttal). The visualization demonstrates that DP-SGD struggles to converge under large noise conditions and is more likely to settle at a saddle point rather than reach the optimal solution. This observation motivated us to propose the DP-GLMtron algorithm.
Response to the Question 5
Exactly, the DP-GLMtron can not provide the privacy guarantee when is large, but here the performance refers to the utility/excess risk. A large implies that less noise is added to the gradient, which means the gradient is less impacted by noise, improving performance. The primary difference between DP-GLMtron and DP-TAGLmtron lies in the method of noise accumulation. Therefore, when the noise is reduced, the performance gap between these two algorithms narrows.
Response to the Question 6
We kindly disagree with the reviewer's opinion. The utility bound depends not only on the privacy budget but also on the data size and the dimension . It is not possible to separate these factors to independently demonstrate the efficiency of the algorithms. Thus, our results is better than . Additionally, the rate in Table 1 for [1] is for DP-PGD rather than DP-SGD.
Response to the Question 7
The setting with two eigenvalue decay scenarios and satisfy our assumptions and we would like to validate our theoretical insights on the role of eigenvalues in a high-dimensional setting, as detailed in Corollary 4.3. As discussed previously, existing work in the DP community primarily relies on data assumptions satisfying Gaussian-like distributions, which reduces the data covariance matrix to an identity matrix scaled by variance. Such an assumption may cause the case to be overlooked when the spectrum exhibits decay. Therefore, our paper chooses one of the eigenvalue decay scenarios and in our stimulation to examine the effect of eigenvalue.
Due to space limitations, we will include references in the comment section, please check it in the following.
[1] Varshney, Prateek, Abhradeep Thakurta, and Prateek Jain. "(Nearly) Optimal Private Linear Regression via Adaptive Clipping." arXiv preprint arXiv:2207.04686 (2022).
[2] Cai, T. Tony, Yichen Wang, and Linjun Zhang. "The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy." The Annals of Statistics 49.5 (2021): 2825-2850.
[3] Bassily, Raef, et al. "Private stochastic convex optimization with optimal rates." Advances in neural information processing systems 32 (2019).
[4] Shen, Hanpu, et al. "Differentially private non-convex learning for multi-layer neural networks." arXiv preprint arXiv:2310.08425 (2023).
[5] Du, Simon S., et al. "Gradient descent provably optimizes over-parameterized neural networks." arXiv preprint arXiv:1810.02054 (2018).
[6] Wu, Jingfeng, et al. "Finite-sample analysis of learning high-dimensional single ReLU neuron." International Conference on Machine Learning. PMLR, 2023.
[7] Frei, Spencer, Yuan Cao, and Quanquan Gu. "Agnostic learning of a single neuron with gradient descent." Advances in Neural Information Processing Systems 33 (2020): 5417-5428.
[8] Diakonikolas, Ilias, et al. "Learning a single neuron with adversarial label noise via gradient descent." Conference on Learning Theory. PMLR, 2022.
We appreciate the Reviewer for the thoughtful feedback and hope the following responses address your concerns effectively!
Response to the Cont.Q2
Yes, we would like to further explore the relationship between our work and neural network optimization here. Theoretically, as we mentioned earlier(our response to Q2), ReLU regression can be viewed as the same as a two-layer neural network model with a single neuron. Therefore, to extend our work to NN optimization, one potential way is to consider the neurons. In this case, the gradient for optimization will become a summation of of our current gradient. Then, we can conduct an analysis analogous to that presented in this paper.
Empirically, our work offers a novel perspective on the over-parameterization problem, showing that the dimension can be avoided in certain cases, such as when the data covariance matrix exhibits eigenvalue decay. Additionally, our findings highlight that DP-SGD struggles with convergence and tends to settle at suboptimal points. It would be quite interesting to explore whether similar observations can be made with a two-layer neural network model.
Response to the Cont.Q6
To better address your question, we have included additional experimental results related to the performance of different algorithms under various privacy budgets. (We found that the PDF file in the global rebuttal cannot be edited during the discussion period. Therefore, we will incorporate these results into our revised paper.)
When the sample size is fixed, the utility in [1] can be understood as , whereas our utility bound is , which is an improvement over [1] according to the magnitude of and .
In this additional experiment, we demonstrate the sensitivity of the privacy budget for each algorithm, and it is clear that DP-SGD faces the same issue. When is small, the tree mechanism in DP-FTRL allows it to add less noise compared to the naive Gradient Descent or GLMtron algorithm, resulting in similar performance between DP-FTRL and DP-GLMtron. Additionally, it is important to note that DP-FTRL should be compared with DP-TAGLMtron rather than DP-GLMtron, as both DP-FTRL and DP-TAGLMtron utilize the tree mechanism, whereas DP-GLMtron does not. Moreover, even though DP-FTRL converges at the end of the training, it can be observed that the excess risk for DP-FTRL is significantly larger than that of DP-TAGLMtron, indicating that it may converge to a suboptimal solution.
[1] Shen, Hanpu, et al. ”Differentially private non-convex learning for multi-layer neural networks.” arXiv preprint arXiv:2310.08425 (2023)
Thanks and I am satisfied with the clarifications. I have updated my score.
We would like to thank Reviewer 7Kd7 again for your valuable time in discussing the work, and we greatly appreciate your positive feedback! If you have any further questions or suggestions, please feel free to reach out.
Thanks for the detailed clarification.
- Regarding Response to Question 2: so does your work lead to any suggestions on NN optimization? (no criticism if no)
- Regarding Response to Question 6: but if the sample size is fixed, varying would result in a larger variation in your method? Because term leads to a larger slope compared to . Also, in the simulation, your methods performed evenly to DP-FTRL when is small, but while outperform it significantly for large .
To all reviewers:
We would like to thank all the reviewers for their great efforts and insightful comments! Based on their suggestions, we address some common concerns and discuss them in the revised paper.
1. Motivation for Proposing the DP-GLMtron Algorithm and Focus Away from DP-SGD:
Our motivation for developing the DP-GLMtron algorithm stems from observations regarding the limitations of DP-SGD. Initially, we noticed that DP-SGD sometimes fails to reach the optimal solution and can even struggle to converge, as evidenced by our experimental results. To further explore this issue, we visualized the training trajectories of both DP-SGD and DP-GLMtron on a 2D noiseless ReLU regression with symmetric Bernoulli data. Please refer to the figures in the PDF for detailed illustrations.
The visualization in the attached PDF file shows that DP-SGD encounters difficulties in converging under conditions of high noise, often settling at a saddle point instead of reaching the optimal solution. This behavior is particularly evident when using a low privacy budget, as shown in Figure 1(a) in our paper.
As discussed in our paper (Section 4, Equation 3), SGD and GLMtron follow different update rules: This indicates that when , the gradient for SGD becomes zero. When considering DP-SGD, noise is added to the gradient, and the direction of DP-SGD is significantly influenced by random Gaussian noise. These observations motivated us to propose the DP-GLMtron and DP-TAGLMtron algorithms, which address these convergence issues and provide more reliable results under different noise conditions.
2. Increase in Experimental Results with Data Size and Approximation to 24.
The observed increase in excess risk as data size grows, along with its approximation to 24, stems from the nature of our experimental setup. We are considering a ReLU regression model, as outlined in the Preliminary Section and Equation 1, without incorporating any neural network architecture.
Moreover, according to the updated rules of algorithms (see Equation 3) and the above discussion, this behavior illustrates that DP-SGD struggles to converge effectively in this context. As data size increases, the challenges with convergence lead to an increase in error, which highlights the limitations of DP-SGD in handling certain high-dimensional regression tasks.
Another potential reason for the error approximating ~24 is that our model is a regression model. Regression models output continuous values, but classification problems require discrete class labels. This can result in predictions that aren't directly usable for classification, which may contribute to the observed large values in excess risk.
Dear Chairs,
We hope this message finds you well.
We would like to ask how to best present additional experimental results, such as figures, during the discussion period, given that the global rebuttal cannot be edited.
Thank you for your time and support.
Best regards,
The Authors
In general, it is not customary for authors to produce new results during the discussion phase. It might be best if you avoid presenting them, but rather mention their existence. If you absolutely must reply to the reviewer who specifically asks to see the results, I suggest you use an anonymous link. Again, I would caution against abusing this to present revisions of the work submitted.
Dear Area Chair svdp,
Thank you for your notification.
We fully understand the importance of adhering to the NeurIPS 2024 code of conduct. In response to the reviewer’s request, we will provide only the necessary code without including any revised versions of our work or additional figures.
Thank you again for your valuable time and support.
Best regards,
The Authors
The paper studies the problem of learning a planted (student-teacher) ReLU regression model, privately. It does achieve nice theoretical contributions but its experiments leave much to be desired. While the experimental side does seem on the weaker side, all reviewers agree that the theoretical contribution is sound and has merit. It is therefore our recommendation to accept the paper as a poster to NeurIPS 24.