PaperHub
6.5
/10
Poster4 位审稿人
最低6最高8标准差0.9
6
6
8
6
2.8
置信度
ICLR 2024

The importance of feature preprocessing for differentially private linear optimization

OpenReviewPDF
提交: 2023-09-20更新: 2024-03-14

摘要

Training machine learning models with differential privacy (DP) has received increasing interest in recent years. One of the most popular algorithms for training differentially private models is differentially private stochastic gradient descent (DPSGD) and its variants, where at each step gradients are clipped and combined with some noise. Given the increasing usage of DPSGD, we ask the question: is DPSGD alone sufficient to find a good minimizer for every dataset under privacy constraints? As a first step towards answering this question, we show that even for the simple case of linear classification, unlike non-private optimization, (private) feature preprocessing is vital for differentially private optimization. In detail, we first show theoretically that there exists an example where without feature preprocessing, DPSGD incurs a privacy error proportional to the maximum norm of features over all samples. We then propose an algorithm called *DPSGD-F*, which combines DPSGD with feature preprocessing and prove that for classification tasks, it incurs a privacy error proportional to the diameter of the features $\max_{x, x' \in D} \|x - x'\|_2$. We then demonstrate the practicality of our algorithm on image classification benchmarks.
关键词
private optimizationfeature preprocessingdifferential privacy

评审与讨论

审稿意见
6

Differential privacy (DP) has been applied to many machine/deep learning tasks to keep privacy of dataset. DPSGD is one of the influential work to train private ML models. Although some theoretical works proves its convexity properties and leads the trend, DPSGD may not work well on raw features in image classification tasks. Beyond the DPSGD, this work demonstrates the feature preprocessing is necessary to optimize the performance of DPSGD by proposing a new algorithm - DPSGD-F as a combination of DPSGD and feature preprocessing.

优点

  1. Based on the proposed algorithm, this work does not only provide empirical results on specific image tasks, but it also provides theoretical guarantees on it step by step. It is really good to connect theory and practice in the research.
  2. The counter-example of DPSGD in Section 4 is innovative and looks solid to investigates its limitation seldom considered by other works (to my knowledge).

缺点

  1. The study subject - linear model - is relatively simple and less representative compared to SOTA ML models, especially for image classification task.
  2. Except the Fashion-MNIST dataset (released in recent few years), original MNIST and CIFAR-100 datasets are relatively outdated compared to other image datasets.

问题

  1. What is Omega tilde in Theorem 1?
  2. Can you explain more about what the “uninformative” means in the Theorem 3?
  3. For lemmas in Section 5, can you provide high level ideas of proofing for each one (probably one or two sentence per lemma) like you did in the Section 4 to make reader clearer about the correctness?
评论

We thank the reviewer for acknowledging that our work is a good connection between theory and practice. Below we address the reviewer’s questions on our work.

[The study subject - linear model - is relatively simple and less representative compared to SOTA ML models.] [Except the Fashion-MNIST dataset (released in recent few years), original MNIST and CIFAR-100 datasets are relatively outdated compared to other image datasets.]

We believe the main contribution of our work is to formally prove that feature preprocessing can help DPSGD-based algorithms to obtain better optimization guarantees. While the considered model and datasets are simple, we believe these will serve as important first steps to study more complex SOTA models and modern datasets.

On the other hand, while without privacy, these tasks have been studied a lot and SOTA results have been advanced over the years, these tasks still remain challenging tasks for private image classification. For example, to the best of our knowledge, no good classification model has been privately trained from scratch on CIFAR-100. Current results solve classification on CIFAR-100 by private finetuning using pre-trained models on larger datasets (e.g. ImageNet1K). For this, [De et al 2022] showed that fine tuning the last layer of a WRN network achieves an accuracy of 70.3% on CIFAR-100 at ε=1\varepsilon = 1. Note that fine tuning the last layer of an advanced model like WRN is the same as training a linear model. With our technique, we are able to improve the accuracy to 71.6%, which is close to the current SOTA result of 71.9% [Bu et al 22] for the task of fine tuning on CIFAR-100 with pretrained models from ImageNet1K. This shows that for private training of models, linear models may not be just a simplified model, but rather a useful model in practice, especially given its equivalence to private fine tuning of the last layer of any neural models.

[Omega tilde in Theorem 1]

Tilde was used to hide logarithmic factors in the big O expression. In the updated draft, we have removed the tilde notation and written out the specific log factors. Now the term inside Omega represents an up-to-constant bound on the required number of samples to obtain the result in Theorem 1.

[Can you explain more about what the “uninformative” means in Theorem 3?]

The term "uninformative initialization" is mainly used to require that the initialization should be not be chosen depending on the input dataset so that it will benefit the optimization for specific datasets instead of all. In the case of DD' in Theorem 3, we mean w0=(w0(1),w0(2))w_0 = (w_0(1), w_0(2)) should satisfy that w0(2)0w0(2) ≤ 0 since the optimal ww^* for DD' has w(2)>0w^*(2) > 0. Similarly for DD'' we require w0(2)0w_0(2) ≥ 0.

[High-level ideas on correctness for lemmas in Section 5.]

Thanks for the great suggestion. Lemma 3 and 4 mainly come from prior work, and the results are direct corollaries. Proof of Lemma 5 is short and provided in the section. We have added a high-level explanation on why Lemma 6 and Lemma 7 hold in the update draft (marked in red).

审稿意见
6

The paper investigated the significance of feature processing in differentially private linear models. The authors introduced an algorithm named DPSGD-F, which incorporates a preprocessing procedure for feature clipping within DP-SGD. The authors theoretically demonstrated the relationship between the excess risk and the feature diameter. Furthermore, their empirical results suggested that DPSGD-F outperforms DP-SGD on datasets such as MNIST, FMNIST, and CIFAR-100 (pretrained).

优点

Strengths:

  1. The authors introduced a novel variant of DPSGD known as DPSGD-F, which empirically outperforms DPSGD on particular datasets, and they supported the proposed algorithm with theoretical guarantees.

  2. The authors presented a counterexample illustrating situations where DP-SGD does not converge.

缺点

Weaknesses:

  1. I have several concerns regarding the motivation and theoretical results, which might undermine the paper's conclusion. Below are some of my concerns about the counter example in Section 4 and Theorem 1 in Section 3.

    a) In the counterexample, the condition μ>nϵ\mu > n\epsilon is required. However, if I understand Theorem 1 correctly, the second term in Theorem 1 becomes a constant when μ>nϵ\mu > n\epsilon since R=μ2+1R = \sqrt{\mu^2 + 1}. Consequently, this counterexample may not convincingly imply that DPSGD-F is superior to DPSGD. It is recommended to address this concern during the rebuttal session.

    b) The authors use this counterexample to emphasize the importance of feature diameter. Nevertheless, it appears that it's the angle between features for different classes, rather than the diameter itself, that is crucial. Particularly, when μ\mu is sufficiently large, the features for different classes become nearly identical because, in comparison to μe1\mu e_1, e2e_2 can be neglected. To make the counter example more convincing, please consider whether the example x=e1+1μe2x = e_1 + \frac{1}{\mu} e_2 for y=1y=1 and x=e11μe2x = e_1 - \frac{1}{\mu} e_2 for y=1y=-1 remains consistent when μ\mu is sufficiently large.

    c) The upper bound in Theorem 1 is dimension-dependent since the authors required that $n\geq O(\sqrt{d}/\epsilon).

  2. Given that the counterexample doesn't fully convince the reviewer, it is recommended that the authors conduct an empirical evaluation of DPSGD-F on a wider range of datasets. Specifically, it would be beneficial to compare the results with Table 1 in De et al.'s work (https://arxiv.org/pdf/2204.13650.pdf) using various choices of pretrained and finetuned datasets for further validation.

问题

Please reference the weakness section.

评论

We thank the reviewer for agreeing to the novelty of the work and raising important questions regarding the result. Below we address the concerns raised by the reviewer.

[The performance of DPSGD-F (Theorem 1) on the counterexample.]

Thank the reviewer for pointing out that Theorem 1 in the submitted manuscript can only show that DPSGD-F achieves a constant error for the counterexample in Section 4. After a careful examination of our result, we are able to obtain an improved performance guarantee of our proposed algorithm DPSGD-F with a more involved analysis. We have marked the updated result (Theorem 1) and analysis (Appendix B.5) in red in the updated draft. In short, we are able to improve the dependence on RR and diam(D)diam(D) to roughly (diam(D)+R/n2)rank(M)/(nε)(diam(D) + R/n^2) \cdot \sqrt{rank(M)}/(n\varepsilon), ignoring other constants that don’t depend on RR and diam(D)diam(D). In this form, it can be seen that DPSGD-F strictly improves upon DPSGD by achieving a o(1)o(1) error when μn<μ/ε\sqrt{\mu} \ll n < \mu/\varepsilon.

[Whether the counterexample still holds when x1=e1+1/μe2x_1 = e_1 + 1/\mu e_2 and x2=e11/μe2x_2 = e_1 - 1/\mu e_2.]

The result holds under any rescaling of x1x_1 and x2x_2. To see this, the standard deviation of the noise for DP-SGD scales linearly with the norm of xx and hence any scaling won’t change the relative scale between the added noise and the gradient vector.

[The upper bound in Theorem 1 is dimension-dependent since the authors required that nO(d/ε)n\geq O(\sqrt{d}/\varepsilon).]

We agree that this is a limitation of our work. This constraint mainly comes from the requirement on nn for the private mean estimation step. To the best of our knowledge, the requirement is needed for state-of-the-art mean estimation algorithms. We leave studying whether the constraint can be improved as a future direction.

[Given that the counterexample doesn't fully convince the reviewer, it is recommended that the authors conduct an empirical evaluation of DPSGD-F on a wider range of datasets]

We hope the improved performance guarantee and analysis on the algorithm resolves the reviewer's concern on the counterexample. This work is mainly a theoretical investigation of the importance of feature preprocessing for private optimization. We believe the theoretical result (especially the improved version in the updated draft) and the set of experiments already give decent evidence on our claim. Moreover, some of the experiments already demonstrate the potential of our approach for challenging classification problems. For example, for the task of privately finetuning the last layer of a pretrained model (using ImageNet1K) on CIFAR-100 under ε=1\varepsilon=1, our result improves the previous best accuracy from 70.6% (De et al., 2022) to 71.6%. This being said, we agree that conducting more extensive experiments would make the empirical result more convincing. We could add more experiments in the final version if the reviewer still has concern on the theoretical and empirical contribution of the work.

评论

I appreciate the refinement made to the conclusion of Theorem 1 by the authors. After a careful review of the proof, I find myself with additional concerns regarding its accuracy. Further intuition from the author on the proof would be beneficial, and I am willing to reconsider my rating if the proof holds true.

I note the term Rn2\frac{R}{n^2}, presumably derived from concentration results in mean estimation where μ^μ2O(R/n2)||\hat{\mu} - \mu||_2\leq O(R/n^2). However, the convergence rate for mean estimation is typically 1/n1/n instead of 1/n21/n^2, as also indicated in Theorem 3.3 in Huang et al., 2021, which the author refers to. Could the author clarify the rationale for the term being Rn2\frac{R}{n^2} instead of Rn\frac{R}{n}?

评论

Thanks for asking this clarifying question. Indeed, the term diam(D)+R/n2diam(D) + R/n^2 comes from the private mean estimation guarantee. The reviewer is right that the convergence rate for mean estimation is 1/n1/n as per Theorem 3.3 in Huang et al., 2021. However, the constant is diam(D)diam(D) instead of RR. We would like to clarify the notations in the two papers. For a dataset D={x1,,xn}D = \{x_1, \ldots, x_n\}, we use RR to denote its maximum Euclidean norm, defined as maxxDx\max_{x\in D} ||x||. This term corresponds to r(D)r(D) in Huang et al 2021. We use diam(D)diam(D) to denote the diameter of DD, defined as maxxDxμ(D)\max_{x \in D} || x - \mu(D)||. This term corresponds to ω(D)\omega(D) in Huang et al 2021.

Dependence on diam(D)diam(D): Note that Theorem 3.3 of Huang et al 2021 shows that with high probability, the private estimation error scales as dω(D)/(nε)\sqrt{d} \omega(D)/(n \varepsilon), which corresponds to ddiam(D)/(nε)\sqrt{d} \cdot diam(D)/(n \varepsilon) instead of dR/(nε)\sqrt{d} R/(n \varepsilon). In Theorem 1 (or more precisely Lemma 3) of our paper, we used a weaker version of their result, namely, we only use the fact that when nd/εn \ge \sqrt{d}/\varepsilon, the mean estimation error is at most diam(D)diam(D) with high probability. Hence our result is consistent with Huang et al 2021 in terms of the dependence on diam(D)diam(D).

Dependence on RR: The dependence on r(D)r(D) (or RR in our notation) is not stated explicitly in the guarantee term in Theorem 3.3 of Huang et al 2021. Instead, it is hidden in logarithmic terms in the requirement on nn in the Theorem statement. In Theorem 3.3, the term uu is similar to r(D)r(D) as it serves as an upper bound on the maximum norm of the data points. However, in our paper, to obtain an in-expectation bound on the estimation error instead of the high-probability bound in Huang et al 2021, we need to choose an appropriate failure probability β\beta and add an extra term βR\beta R to the high probability bound since if the high probability bound fails, the error could be as large as RR, hence contributing βR\beta R to the overall expectation. We choose β=1/n2\beta = 1/n^2 in our paper and this results in the R/n2R/n^2 term in our estimation guarantee in Lemma 3, which carries over to the statement of Theorem 1 (see detailed in Section B.1). We note here that technically, we can choose β\beta to be any inverse polynomial of nn and further reduce the dependence on RR. We choose β=1/n2\beta = 1/n^2 here mainly for ease of presentation.

We added a discussion after Lemma 3 (mean estimation result) in our paper to clarify the above result.

We hope this clarifies the reviewer’s concern on the accuracy of our result. We would be happy to answer any further questions from the reviewer.

评论

Thank you to the authors for addressing my concerns. I am pleased to amend my rating to a 6.

审稿意见
8

This paper looks at the effect of feature pre-processing on the DP-SGD. Specifically, they focus on the case of linear models and aim to show theoretically and empirically the impact of feature pre-processing. First, they give a counter-example to show the error of DP-SGD must be proportional to the maximum norm of the features. Then, they show that carefully pre-processing the features by subtracting the mean and shifting the bias by a private quantile estimation can reduce the error to be proportional to the dimension of the dataset. The experiments show a clear improvement in accuracy while spending a minimal privacy budget on the feature pre-processing.

优点

  • The paper gives strong theoretical and empirical support for feature processing, a topic previously only reasoned about empirically.
  • The privacy utility trade-off is a significant roadblock in the adoption of DP-SGD, and this paper gives a simple solution to improve this trade-off and give a nice increase in accuracy.
  • The paper is very well written.

缺点

Theory does not match experiments

The algorithm is significantly modified for the experiment section. It was clear that the algorithm analyzed was designed to make the theory work, and components were dropped or simplified for the experiments.

The part I am concerned about is the normalization of the features before computing the mean. The reason is that the normalization must be carried out in a private manner to allow for bounded sensitivity. The current version seems to assume a normalized dataset without accounting for any privacy budget to carry out the normalization.

Note

I was unable to verify the proofs in this paper due to my lack of theoretical background. I apologize that I leave this verification to my fellow reviewers.

问题

Can the authors comment on the normalization issue?

评论

We thank the reviewer for the very positive feedback and acknowledging the contributions of our work. Below we address the concern the reviewer raised about the experiment part of our paper.

[The algorithm is significantly modified for the experiment section. …… The part I am concerned about is the normalization of the features before computing the mean.]

Thanks for raising the valid concern. Both algorithms we use for the theoretical analysis and experimental evaluations follow the same procedure of (1) private mean estimation and translation of the features; (2) perform DPSGD on the new features. The main difference is on the private mean estimation step. In the modified algorithm (Algorithm 3), we normalize the features to a fixed norm and use the private Gaussian mechanism for computing the mean.

We note that the modified algorithm with a fixed normalization norm satisfies differential privacy guarantee as claimed in Lemma 9. In experiments, we treat the normalization norm as a hyperparameter and search over a small set of values: {1, 10, 100, 1000}. We then report the maximum accuracy achieved within the set. We agree that this hyperparameter search might leak extra information about the dataset and this is not taken into account in our reported privacy parameter. However, ignoring the privacy leakage in the hyperparameter search step is commonly used in current empirical evaluations of private training algorithms, e.g., in (De et al 2022). In fact, how to best evaluate the privacy leakage for hyperparameter search is an active area of research, e.g., in (Papernot and Steinke 2022), which is beyond the scope of this work.

Hence, while modifications to the algorithm are made in the experiment section, we believe the result still demonstrates the importance of feature pre-processing for private linear classification tasks, which is the main message of our work.

Nicolas Papernot and Thomas Steinke Hyperparameter Tuning with Renyi Differential Privacy ICLR 2022

评论

Thanks for the clarification. I think the fact that normalization is treated as a hyper parameter should be mentioned in the paper. It does feel like cheating, but I agree, other works also cheat in the same way.

审稿意见
6

This paper proposes a differentially private SGD algorithm which pre-processes the features before gradient descent steps. It is shown that, in linear classification problems and under certain regimes of privacy parameters, the new algorithm converges to a solution with lower optimality gap that the standard mini-batch DP-SGD algorithm without feature pre-processing. The theoretical improvement is supported by numerical experiments on image data sets.

优点

  • The paper's contribution is made very easy to understand by the side-by-side comparison between standard DP-SGD's optimality gap and the new algorithm's optimality gap. The paper is overall well-written.

  • The example on why DP-SGD's excess loss must scale with RR is simple yet effective.

  • There is an extensive set of numerical experiments supporting the theoretical results.

缺点

  • On the improvement of Theorem 1 over Lemma 1, it appears that the comparison boils down to diam(D)rank(M)log(1/δ)+Rlogn\text{diam}(D)\sqrt{\text{rank}(M)\log(1/\delta)} + R\log n versus Rrank(M)log(1/δ)R\sqrt{\text{rank}(M)\log(1/\delta)}. While it is clear that, when the diameter is significantly smaller than RR, the former's first term diam(D)rank(M)log(1/δ)\text{diam}(D)\sqrt{\text{rank}(M)\log(1/\delta)} is dominated by Rrank(M)log(1/δ)R\sqrt{\text{rank}(M)\log(1/\delta)}, there is a lack of discussion on whether/why RlognR\log n is also dominated by Rrank(M)log(1/δ)R\sqrt{\text{rank}(M)\log(1/\delta)}. The lower bound result, Theorem 2, did not address Theorem 1's second term scaling with RlognR\log n either.

  • It is ambiguous whether the improvement afforded by feature pre-processing is specific to linear classification. The term "linear optimization" in the title and the sentence " even for the simple case of linear classification ..." in the abstract seem to suggest that there are reasons to believe the improvement can generalize to other settings. However, there is barely discussion of other settings in this paper. If there are reasons why feature pre-processing also improves over DP-SGD in other problems (or linear optimization in general), mentioning these reasons may significantly improve the paper's contribution.

  • A few minor points on writing:

    • the phrase "maximum norm of features" in the abstract could be understood as "\ell_\infty norm of features". It may help to say "maximum Euclidean norm" instead.
    • The letter GG in Lemma 1 is quite far from the previous mention of GG. It might help to define GG as the Lipschitz constant again in the theorem statement.

问题

Corresponding to the two points under "Weaknesses":

  • Does the claim that Theorem 1 improves over Lemma 1 require any condition on rank(M)\text{rank}(M), δ\delta, nn, or other parameters?
  • What is the relationship, if any, between the benefit of feature pre-processing in linear classification and the "importance of feature processing for differentially private linear optimization" as the title says?
评论

We thank the reviewer for the positive feedback and constructive comments. Below we discuss the technical questions raised by the reviewer.

[Comparison between Lemma 1 (prior result) and Theorem 1 in different regimes of rank(M),n,δrank(M), n, \delta, and other parameters?]

Thanks for raising the important point. After a careful examination of our result, we are able to obtain an improved performance guarantee of our proposed algorithm DPSGD-F with a more involved analysis. We have marked the updated result (Theorem 1) and analysis (Appendix B.5) in red in the updated draft. In short, we are able to improve the dependence on RR and diam(D)diam(D) to roughly (diam(D)+R/n2)rank(M)/(nε)(diam(D) + R/n^2) \cdot \sqrt{rank(M)}/(n\varepsilon), ignoring other constants that don’t depend on RR and diam(D)diam(D). In this form, our result strictly improves upon the prior result. When diam(D)>R/n2diam(D) > R/n^2, diam(D)diam(D) is the leading term in the optimality gap and the dependence on RR is at most logarithmic (only appears in the requirement on the nn). When diam(D)<R/n2diam(D) < R/n^2, the dependence on RR is improved by a factor of n2n^2, which also improves upon the prior result. Moreover, the Rn2\frac{R}{n^2} term can be reduced to any inverse polynomial function of nn by increasing the requirement on nn in the assumption by a constant factor. We use R/n2R/n^2 mainly for presentation purposes.

[Relation between feature pre-processing for DP linear optimization and feature pre-processing for DP linear classification.]

This is a great question. We agree that the main focus of this paper is on linear classification models. However, since linear classification is a special case of linear optimization with specific requirements on the type of loss functions used, we believe our result also hints on the role of feature pre-processing for linear optimization problems. The main property that we use about linear classification tasks is that the final minimizer satisfies assumption 2, which requires that the linear model should predict values with different signs for at least one pair of the samples at the minimizer. This may not hold for other linear optimization problems in general, e.g., linear regression. We suspect that some processing of the labels might help linear regression to satisfy this condition, e.g., subtracting the mean from the labels, but this might need further investigation since the mean needs to be noised to guarantee privacy. We leave studying whether linear optimization can help for general linear optimization problems as important future directions.

[Suggestions on improving the presentation.]

Thanks for the suggestions on improving the presentation. We have added an explanation for the norms used and revisited the definition of GG when necessary. Changes are marked in red.

AC 元评审

(a) Summarize the scientific claims and findings of the paper based on your own reading and characterizations from the reviewers.

For linear classification private feature preprocessing can help under differentially privacy. Theoretically, it is shown that there exists an example where, without feature preprocessing, DPSGD incurs a privacy error proportional to the maximum norm of features over all samples. DPSGD-F combines DPSGD with feature preprocessing and prove that for classification tasks, it incurs a privacy error proportional to the diameter of the features. In particular, it is shown that, in linear classification problems and under certain regimes of privacy parameters, the new algorithm converges to a solution with lower optimality gap that the standard mini-batch DP-SGD algorithm without feature pre-processing. The practicality of the proposed algorithm is demonstrated on image classification benchmarks.

(b) What are the strengths of the paper?

The paper is clear and well motivated. Experiments are extensive and support the theoretical claims. The toy example showing that DPSGD performance depends on R is effective.

(c) What are the weaknesses of the paper? What might be missing in the submission?

The privacy accounting for the preprocessing step is not properly done. Theorem 1 is still dimension dependent. The toy example is not convincing enough.

为何不给更高分

The main idea is related to adaptive clipping, which has been explored in several other papers. The main novelty seems to be the fact that the authors analyze GLM that demonstrates the gain in adaptive clipping.

为何不给更低分

The paper has made interesting connections between the geometry of the data, clipping, and GLMs. Shown in this particular way, the results are new and matches intuitions from experiments.

最终决定

Accept (poster)