PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
4
4
3.5
置信度
创新性3.0
质量3.0
清晰度2.8
重要性2.8
NeurIPS 2025

Private Zeroth-Order Optimization with Public Data

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

One of the major bottlenecks for deploying popular first-order differentially private (DP) machine learning algorithms (e.g., DP-SGD) lies in their high computation and memory cost, despite the existence of optimized implementations. Zeroth-order methods have promise in mitigating the overhead, as they leverage function evaluations to approximate the gradients, hence significantly easier to privatize. While recent works have explored zeroth-order approaches in both private and non-private settings, they still suffer from relatively low utilities compared with DP-SGD, and have only been evaluated in limited application domains. In this work, we propose to leverage public information to guide and improve gradient approximation of private zeroth-order algorithms. We explore a suite of \underline{p}ublic-data-\underline{a}ssisted \underline{z}eroth-\underline{o}rder optimizers (PAZO) with minimal overhead. We provide theoretical analyses of the PAZO framework under an assumption of the similarity between public and private data. Empirically, we demonstrate that PAZO achieves superior privacy/utility tradeoffs across vision and text tasks in both pre-training and fine-tuning settings, outperforming the best first-order baselines (with public data) especially in highly private regimes, while offering up to $16\times$ runtime speedup.
关键词
Zeroth-Order OptimizationDifferential Privacy

评审与讨论

审稿意见
4

This paper aims to accelerate zeroth-order differentially private optimization while improving gradient approximation by leveraging public data. The authors propose three PAZO variants and validate them both theoretically and empirically.

优缺点分析

Strengths

  • Proposed three versions of the optimization method to utilize public data in zeroth-order settings.
  • The proposed methods often achieve performance comparable to first-order DP methods that use public data, while significantly improve the speed of private learning.
  • The privacy-utility trade-off is mitigated, which implies that PAZO can be useful where ϵ\epsilon is small.

Weaknesses

  • PAZO improves performance by using public data under the γ-similarity assumption. However, this raises the question: isn't the improvement mostly due to the public data alone? That is, isn’t it possible that the private data contributed very little during training? In fact (1) The performance does not vary much with the privacy budget. (2) The results are not very sensitive to hyperparameters. In first-order methods, private gradients are clipped, whereas public gradients are not, meaning public gradients may dominate the update. As a result, private gradients might have a negligible impact. To address this, the authors should conduct an experiment using only public data (i.e., no private data, hence no privacy concerns), and explicitly report how much performance improves when private data is included.

  • In the vision domain, the authors assume in-distribution public data, while in the language domain, they assume out-of-distribution (OOD) public data. This seems to be done to satisfy the γ-similarity assumption. Perhaps in language tasks, even OOD data still lies in a similar feature space. If that is the case, the paper should discuss domain-specific similarity more clearly or unify the assumption across domains. If in-distribution public data is assumed, this should be clearly stated at the beginning of the paper.

Minor weaknesses

  • Notation should be changed. The notation of the model is xx, while that of the data is ξ\xi. Normally, the use of the notation xx refers to data and θ\theta or ww to parameters.
  • Comparisons between methods using public data and those that do not (e.g., Figures 2 and 5) are unclear. As done in Tables 1–4, figures should visually differentiate methods that use public data (e.g., using distinct colors or styles).

问题

  • Why is the noise added to the scalar directional derivative {f(x+λu)f(xλu)}/2λ\{f(x+\lambda u)- f(x -\lambda u)\} / 2\lambda, instead of clip({f(x+λu)f(xλu)}/2λ)u\text{clip}(\{f(x+\lambda u)- f(x -\lambda u)\} / 2\lambda) \cdot u as DPZero[7]? DPZero added noise to the vector after clipping, resulting in the noise added in all dd dimensions. However, in this paper, scalar noise is added and then multiplied by the random vector. This makes the direction of uu preserved. What are the privacy implications of preserving the direction of uu? Even though the vector uu is random, it is used as an input in the directional derivative f(x+λu)f(xλu)/2λf(x+\lambda u)- f(x -\lambda u) / 2\lambda, and thus may potentially leak private information. How can this be justified from a privacy perspective?
  • Figure 4 only reports PAZO-S. However, the convergence theorems show different upper bounds for PAZO-M, P, and S. Based on these, we would expect different convergence behaviors. Can the authors provide convergence plots for each variant?

局限性

More interpretation of experimental results is needed regarding robustness and the privacy-utility tradeoff.

最终评判理由

In the rebuttal, my major concern about the privacy guarantee was addressed. I raised my score from 3 to 4

格式问题

The paper is well formatted.

作者回复

We appreciate the reviewer's time and detailed feedback. We present our responses and address the concerns as follows.

[Performance of public only] We thank the reviewer for this question. Indeed, we view the robustness of PAZO across various privacy budgets and hyperparameters as one advantage. Next, we demonstrate that the improvements are not due to overfitting to public data. As suggested, we train on public data alone using SGD with BS, LR, and WD tuned. We denote the optimal accuracy in each dataset as ‘public only’ and pick the best first-order with public data (FO+PUB) and zeroth-order with public data (PAZO) algorithm for each dataset. We present the performance gain when private data is included (i.e., ‘FO+PUB/PAZO’ minus ‘public only’ across ε=0.1,0.5,1,2,3\varepsilon=\\{0.1,0.5,1,2,3\\}).

CIFAR10Tiny-ImgNetIMDBMNLI
Public only66.127.168.460.8
FO+PUB improvement-1.3 ~ 6.83.4 ~ 8.41.8 ~ 9.5-4.3 ~ 12.0
PAZO improvement4.0 ~ 4.53.8 ~ 4.05.4 ~ 6.27.4 ~ 8.2

Note that these ‘public only’ accuracies come with severe overfitting due to the small number of public samples. In contrast, the DP accuracies come without overfitting. We observe that 1) FO enjoys up to 12.0 performance gain and ZO enjoys up to 8.2 when public data is included; 2) ZO consistently enjoys performance gain when private data is included, while FO does not, since private gradient can be too noisy when privacy is tight.

[OOD vision public data] We thank the reviewer for suggesting this. In our submission, for vision tasks, we did use public data with slight class imbalance (with class_size ratio [1:...:0.85][1:...:0.85]) and treated this as a mild distribution shift from the private data (Section 5.1 line 228). Our γ\gamma similarity assumption is an upper bound of the distribution gap, which is unified across different domains. To verify that vision OOD data also provide gradients with bounded distance from private gradient (i.e., reasonable γ\gamma-similarity), we create a public dataset with big class imbalance (with class_size ratio [1.0:0.9:0.8:...:0.2:0.1][1.0:0.9:0.8:...:0.2:0.1]). (Running SGD on this imbalanced public data alone produces a test accuracy of 50.1 on balanced test data.) We report the test accuracies and γ\gamma-similarity of PAZO below (ε=1\varepsilon=1). We observe that while the range of γ\gamma-similarity depends on specific methods, the values are consistently small. For a fixed PAZO variant, as the public data becomes more in-distribution, performance improves and γ\gamma decreases (i.e., gradients for public and private data become more similar).

Pub data \ MethodPAZO-MPAZO-PPAZO-S
Slight imbalance70.2 (γ=4.5\gamma=4.5)69.2 (γ=3.3\gamma=3.3)70.2 (γ=1.6\gamma=1.6)
Half-half69.8 (γ=4.9\gamma=4.9)67.6 (γ=3.8\gamma=3.8)67.3 (γ=1.8\gamma=1.8)
Big imbalance65.5 (γ=6.4\gamma=6.4)64.1 (γ=4.7\gamma=4.7)63.0 (γ=2.1\gamma=2.1)

To demonstrate improvements of PAZO under more out-of-distribution public data, we report the performance of PAZO compared with the best first-order methods with public data (Best FO+PUB, a.k.a. DOPE-SGD) under big class imbalance. It shows that 1) OOD vision public data improves the performance of all the methods; 2) ZO+PUB also outperforms FO+PUB under tight privacy budgets (ε1\varepsilon\leq 1), even when the public data is OOD.

TypeMethodε=0.1\varepsilon=0.1ε=0.5\varepsilon=0.5ε=1.0\varepsilon=1.0ε=2.0\varepsilon=2.0ε=3.0\varepsilon=3.0
FODP-SGD46.749.750.854.254.5
Best FO+PUBDOPE-SGD54.860.164.267.668.3
ZODPZero47.048.148.248.248.1
(ours)PAZO-M65.365.665.565.966.3
(ours)PAZO-P64.364.064.164.365.1
(ours)PAZO-S63.262.963.064.163.8

[Adding noise zz] Thank you for this question. We’d like to clarify that PAZO-M adds a scalar noise zz to the clipped scalar clip((f(x+λu)f(xλu))/(2λ))\text{clip}((f(x+\lambda u)−f(x−\lambda u))/(2\lambda)) (Algorithm 1 line 8 in our submission), following similar designs as in DPZero ([1] Algorithm 2 line 3). We note that another option of adding noise to the vector clip(f(x+λu)f(xλu)/(2λ))u\text{clip}(f(x+\lambda u)−f(x−\lambda u)/(2\lambda) )\cdot u ([1] Algorithm 1 line 3) appeared in DPZero as one baseline, not their actual proposed algorithm ([1] Algorithm 2); and they show that adding noise to the scalar derivative is preferred ([1] Section 4 and Appendix E “Privacy guarantee”). The reviewer is correct that (f(x+λu)f(xλu))/(2λ)(f(x+\lambda u)−f(x−\lambda u))/(2\lambda) is data-dependent, which is exactly why we privatize this term (Line 8) to satisfy differential privacy.

[1] DPZero: Dimension-Independent and Differentially Private Zeroth-Order Optimization. ICML 2024.

[Convergence speed] Since figures cannot be uploaded here, we present the accuracy that DPZero converges to and compare the evaluation step index at which each ZO/ZO+PUB method achieves that accuracy (ε=1\varepsilon=1). We observe that PAZO variants have slightly different convergence behaviors, but they are all consistently faster than ZO without public data.

CIFAR10 (acc 48.2)Tiny-ImgNet (acc 27.5)IMDB (acc 62.6)MNLI (acc 58.2)
DPZero393272379496
PAZO-M10168014
PAZO-P71411816
PAZO-S1191015

We will present the convergence plot of all variants in the revised paper.

[Notation and figure format] We appreciate these suggestions and will adopt the new notation in our revised version. We will also update the figures by using more distinct colors and styles to differentiate methods with and without public data.

评论

Thank you for your rebuttal. In the rebuttal, my major concerns about a public-only setting and the noise addition to a scalar value have been addressed. I'll raise my score.

评论

Thank you again for your valuable time and review! We are glad your major concerns have been addressed.

审稿意见
5

The authors present a novel zero-order optimization algorithm that is guided by some first-order information (i.e., gradients) of publicly available data. They introduce three different ways of integrating such information in the zero-order optimization process and show that they can achieve faster convergence while still keeping the privacy advantages coming with zero-order optimization.

优缺点分析

Strengths

  • The idea is interesting and reasonable in distributed environments where data privacy is important
  • the paper is written clearly
  • a convergence analysis is provided
  • empirical results look promising

Weaknesses

  • in Algo. 1 it is unclear how the batch of public data is obtained. Since it should match with the private batch as good as possible, the authors should describe how they ensure that both batches exhibit similar data characteristics (if they exhibit conflicting characteristics, divergence might even happen)
  • in the unnumbered between 141-142 it is not clear to me why with the presented equation, the coefficients uu are learned. Please clarify this.
  • in the Equation before line 142, why is it Eu[f(x)TGuGu]\mathbb{E}_u[\nabla f(x)^T GuGu]? Isn't it only GuGu? If so, the subsequent equations are also wrong. A brief derivation would be helpful here.
  • The authors claim an equivalence between their projection of the gradient and the expectation of gradient approximations when sampling from a ball induced by GG (see lines 143-144). Please provide proof of this equivalence.

Since I am not familiar with the field of DP and zero-order optimization, my review should be used with care for the final decision.

问题

  • wouldn't a repeated querying of the same function lead to a much higher privacy budget needed to ensure DP?

局限性

The work should explicilty discuss current limitations (at least briefly).

最终评判理由

As I have written in my initial review, I think the authors' work is a significant contribution towards efficient zero-order optimization in privacy restricted regimes. The use of publicly available data is a promising direction to advance optimization algorithms under privacy constraints. In their rebuttal, the reviewers have addressed all my points raised in the initial review. Although I didn't read the other reviews in full detail, it seems the authors have addressed major concnerns in the rebuttal.

格式问题

n/a

作者回复

We appreciate the reviewer’s valuable feedback and positive assessment of our work. We present our responses below.

[Choice of public batch] We thank the reviewer for carefully noticing this. In practice, public data can be obtained via users who opt out of private training or other side information/statistics of the sensitive data [1], which would guarantee certain similarities.

Since there is no prior work that leverages public information in private zeroth-order optimization, we present all three PAZO methods, each with different properties. PAZO-M (Algorithm 1) indeed simply aggregates the private ZO gradient with a randomly sampled public batch gradient, which is fast and memory-efficient. Empirically, we show that if we pick reasonable public data with mild distribution shifts, PAZO-M performs well. Theoretically, we show that if the public gradient is not close to the private gradient, one should reduce α\alpha for better convergence.

Due to the need for selecting good public gradients, we have proposed PAZO-P that allows for larger weights on good public gradients and smaller weights on less related public gradients. We also propose PAZO-S, which directly selects the public gradient that incurs the lowest loss. They may offer better performance in some scenarios (Section 5) while having more computational cost.

[1] Private Adaptive Optimization with Side Information. ICML 2022.

[Learning coefficients] We thank the reviewer for the careful reading and pointing out the vagueness. We are learning the coefficient in a zeroth-order fashion, i.e., relying on function evaluations. Specifically, when we sample qq uu’s, each sampled uu is an aggregation proposal (for how to linearly combine the columns of GG), and the loss difference is the feedback that indicates how good this proposal is. The update direction in PAZO-P, in essence, is a weighted average of the columns of GG, with the weights (i.e., coefficients) depending on the loss differences. Since we take the feedback from ff to weigh and aggregate the public gradients, we interpret the coefficients to be learned. We will clarify this in the next version.

[Why GuGu] A more detailed derivation is as follows: The term Δ:=[f(x+λGu)f(xλGu)]/(2λ)\Delta := [f(x + \lambda Gu) - f(x - \lambda Gu)]/(2\lambda) is a central difference approximation to the directional derivative of ff at xx in the direction GuGu. As λ0λ \rightarrow 0, this converges to f(x)Gu\nabla f(x)^{\top}Gu. Therefore, E[ΔGu]=E[f(x)GuGu]=E[Guf(x)Gu]=E[GuuGf(x)]\mathbb{E}[\Delta Gu]=\mathbb{E}[\nabla f(x)^{\top} GuGu]=\mathbb{E}[Gu\nabla f(x)^{\top}Gu]=\mathbb{E}[Guu^{\top}G^{\top}\nabla f(x)] as claimed.

[Proof of equivalence] We present the derivation as follows. Note that the expectation of the two-point zeroth-order gradient estimator is the same as that of one-point due to the symmetry of the sampling distribution:

Eu[f(x+λu)f(xλu)2λu]=Eu[f(x+λu)2λu]+Eu[f(x+λ(u))2λ(u)]=Eu[f(x+λu)λu]. \mathbb{E}_u \left[\frac{f(x+\lambda u) - f(x-\lambda u)}{2\lambda}u \right] = \mathbb{E}_u \left[\frac{f(x+\lambda u) }{2\lambda}u \right] + \mathbb{E}_u \left[\frac{ f(x + \lambda (-u))}{2\lambda}(-u) \right] = \mathbb{E}_u \left[\frac{f(x+\lambda u) }{\lambda}u \right].

Therefore, the goal is to show xE_vUnif(G(kBk))[f(x+λv)]=E_uUnif(kSk1)[f(x+λu)λu]\nabla_x \mathbb{E}\_{v\sim\text{Unif}(G(\sqrt{k}B^k))}[f(x+\lambda v)] = \mathbb{E}\_{u\sim\text{Unif}(\sqrt{k}S^{k-1})}[\frac{f(x+\lambda u)}{\lambda}u]. According to the divergence theorem in higher dimensions, we define the body Ω=G(kBk)Rd\Omega =G(\sqrt{k}B^k) \subset \mathbb{R}^d and its boundary Ω=G(kSk1)\partial\Omega=G(\sqrt{k}S^{k-1}), and then we have

Ωvf(x+λv)dv=Ωf(x+λu)uudu=1kΩf(x+λu)udu.\int_{\Omega} \nabla_v f(x+\lambda v) dv = \int_{\partial\Omega} f(x+\lambda u) \frac{u}{\Vert u\Vert} du = \frac{1}{\sqrt{k}} \int_{\partial\Omega} f(x+\lambda u) u du.

Note that vf(x+λv)=λxf(x+λv)\nabla_v f(x+\lambda v) = \lambda \nabla_x f(x+\lambda v) and Area(rSd1)=drVol(rBd)\text{Area}(r S^{d-1}) = \frac{d}{r} \cdot \text{Vol} (r B^d). Since an orthonormal GG is area- and volume-preserving, we have Area(Ω)=kVol(Ω)\text{Area}(\partial\Omega) = \sqrt{k} \cdot \text{Vol} (\Omega). Combining the above, we obtain

xE_vUnif(Ω)[f(x+λv)]=1λVol(Ω)Ωvf(x+λv)dv=1λkVol(Ω)Ωf(x+λu)udu=Area(Ω)λkVol(Ω)E_uUnif(Ω)[f(x+λu)u]=1λE_uUnif(Ω)[f(x+λu)u] \nabla_x \mathbb{E}\_{v\sim\text{Unif}(\Omega)}[f(x+\lambda v)] = \frac{1}{\lambda \text{Vol}(\Omega)} \int_{\Omega} \nabla_vf(x+\lambda v) dv = \frac{1}{\lambda \sqrt{k}\text{Vol}(\Omega)} \int_{\partial\Omega} f(x+\lambda u) u du = \frac{\text{Area}(\partial\Omega)}{\lambda\sqrt{k} \text{Vol}(\Omega)} \mathbb{E}\_{u\sim \text{Unif}(\partial\Omega)}[f(x+\lambda u) u] = \frac{1}{\lambda} \mathbb{E}\_{u\sim \text{Unif}(\partial\Omega)}[f(x+\lambda u) u] as desired.

[Privacy implications of qq] We thank the reviewer for emphasizing that and would like to clarify that we have already taken that into account by multiplying the variance of noise zz in each step by qq (Algorithm 1-3 in our submission). We present the privacy guarantees in Section 3.4.

[Limitations] In Section 6, we acknowledge the limitation that the current convergence bound can be further sharpened by considering other similarity metrics, e.g., angle-based metrics. Further, a broader and larger-scale benchmark of public and private dataset pairs will enhance the significance of our empirical results.

评论

I thank the authors for the rebuttal, my points have been addressed accordingly.

评论

Many thanks for your valuable time and review! We are glad that your concerns have been addressed.

审稿意见
4

The paper considers to use public data to assist private zeroth-order optimization. Existing works have shown that zeroth-order methods can be applied to differentially private training or fine-tuning. However, there is still a gap in terms of performance compared with first-order methods. This work proposes a suite of public data assisted zeroth-order optimizers (PAZO), including PAZO-{M,P,S}. PAZO-M simply uses a convex combination of private zeroth-order estimations and public gradient as the descent step. PAZO-P narrows the search space of zeroth-order optimization into a subspace constructed using public gradient. PAZO-S uses private data to select the best public gradient. Convergence analysis shows that all three methods match the worst-case upper bound of existing results. Empirical experiments show that all three methods match or even outperform exising public data assisted first-order methods, while offering speedup in time.

优缺点分析

The paper is clearly written, and the studied topic is interesting and important for the community. The claims are supported by evidence, and the results look good to me. My concerns are following.

  1. Why is γ\gamma similarity used? A more natural measure to me is the cosine similarity. In all convergence results, there is dependence on σ\sigma. I understand the meaning of σ1\sigma_1 and σ2\sigma_2, but what is σ\sigma? What is the dependence on qq and the dimension dd in all theoretical results? The dependence on the dimension dd is crucial in zeroth-order methods and should be highlighted and well discussed.

  2. In several places of the paper, the authors say existing zeroth-order methods only consider prompt tuning for language models, like [6,7,8,10]. I think this is not a correct statement. Prompt tuning is one efficient way of doing language model fine-tuning, where only prompt embeddings are updated and the base model is frozen. This is not what [6,7,8,10] do. In these related works, all trainable model parameters are updated and thus considered to be full parameter fine-tuning. The authors should correct all these wrong statements.

  3. One of the major reasons that previous works propose to use zeroth-order methods is to save memory by avoid gradient computations. In PAZO, public gradients are anyway required, and thus this benefit of saving memory is lost. Is there a comparison in terms of GPU memory for all methods, including vanilla DPZero and DPSGD, PAZO-{M,P,S}, and public-data assisted first-order methods?

  4. The authors claim in the contribution that improved convergence can be guaranteed using a proper choice of parameters. It is not clear in the main text how the rates can be improved. Is this improvement only in terms of constants or also in terms of key parameters like TT and dd?

问题

Major questions are already asked in weakness. Other minor:

  1. Typos: should it be "rely on" instead of "reply on" in line 21 and line 132? I guess it should be "combined with these prior methods" in line 88.

  2. The paper uses random direction uniformly on the sphere when introducing the algorithm. Random search directions from Gaussian are also extensively used in previous work. Which noise is used in the experiments? How do these two compare in both the theory and experiments?

  3. Isn't it costly to sample kk mini-batches in PAZO-P and PAZO-S? What is the value of kk and batch size bb used for each method when comparing time efficiency in Figure 5?

局限性

Limitations are already discussed in the weakness, including dimension dependence in theoretical results and memory consumption.

最终评判理由

My major concern is on the memory efficiency of the proposed method, as public gradient is required but the main motivation to use zeroth-order methods is to save memory by avoiding gradient computation. The author replied in the rebuttal that the saving in memory is on the private data compared with public data assisted DP first-order methods. This makes some sense but not fully resolved this limitation. However, there might be some settings where the proposed methods provide clear advantage. I think the paper proposes an interesting and new observations that public data can greatly improve the performance of private zeroth-order methods. The reasons to accept outweight reasons to reject, so I keep my score 4 for acceptance.

格式问题

Looks good to me.

作者回复

We are grateful to the reviewer for the careful reading and valuable comments. We present our responses as follows.

[γ\gamma- vs. cosine similarity] We thank the reviewer for this suggestion. Cosine similarity measures angular alignment, while Euclidean distance captures both scale and angle mismatch. Since abγ\Vert a−b\Vert\leq\gamma ensures bb stays within a γ\gamma-radius ball around aa, it more directly controls the absolute error of gradient updates in convergence analysis, which is also adopted by existing literature [2,3]. Empirically, we observe that our Euclidean similarities consistently correlate positively with performance (Table 1-2 in rebuttal). However, we acknowledge that cosine similarity could complement our analysis: If we assume aM1\Vert a\Vert\leq M_1 and bM2\Vert b\Vert\leq M_2, then the cosine similarity lower bound cos⁡(a,b)φ\text{cos⁡}(a,b)\geq \varphi implies abγ=max(M1,M2,M12+M222M1M2φ)\Vert a-b\Vert\leq\gamma=\max(M_1, M_2, \sqrt{M_1^2+M_2^2-2M_1M_2\varphi}), which can be plugged into our current analysis. We leave more geometric-aware similarity measures and their relations to the convergence of our methods and those of others for future work.

[Clarification on σ\sigma] We clarify that σ\sigma is the variance of the noise we add to guarantee differential privacy (Section 3.4 in our submission, line 171).

[Dependence on qq, TT, dd, and improved convergence] We achieve either constant improvements or better dependence on dd, depending on the specific PAZO variant. Specifically, if we choose T=(1α)2n(r+2)ε4rlog(e+ε/δ)T=\frac{(1-\alpha)^2n(r+2)\varepsilon}{4\sqrt{r\log(e+\varepsilon/\delta)}}, PAZO-M achieves O((1α)2dlog((1α)d))O((1−\alpha)^2\sqrt{d} \log((1−\alpha)d)), which has a better dependence on dd than DPZero if the errors from public data O(α2/(1α)2(γ2+σ22/b))O(\alpha^2/(1-\alpha)^2(\gamma^2+\sigma_2^2/b')) do not dominate. If we choose T=n(r+2)ε4rlog(e+ε/δ)T=\frac{n(r+2)\varepsilon}{4\sqrt{r\log(e+\varepsilon/\delta)}}, PAZO-P has dd-independent error rate O(dlogk)O(\sqrt{d} \log k) with an additional error O(γ2+σ22/b)O(\gamma^2 + \sigma_2^2/b') where kdk \ll d is the number of bases of GG. PAZO-S has an error rate that is dd-independent since it selects the best public gradient as the update direction.

When q>1q>1, the error of zeroth-order optimization reduces by O(1/q)O(1/q) while the variance of the noise zz increases by qq-times. The resultant error term will be of the form Aq+B/qAq+B/q for some AA and BB, so the optimal qq can be chosen as q=B/Aq=\sqrt{B/A}.

[Wording on prompt tuning] We thank the reviewer for pointing it out. We would like to clarify that we use “prompt tuning” to refer to the task of full-parameter fine-tuning with prompts. Our argument is that (which also aligns with [5]) ZO may not work in standard full-model fine-tuning without prompts where 1) the input sequence is not modified with task-specific natural language templates or 2) a task-specific classification head is added on top of the final <s> embedding. We will replace “prompt tuning” with “full fine-tuning with prompts” to eliminate the vagueness.

[Memory cost of PAZO] While we agree that PAZO requires memory for public gradients, it eliminates the memory/computation/implementation bottlenecks when we consider private settings. PAZO is more efficient (and sometimes more accurate) than differentially-private first-order methods, and is more accurate than vanilla DP zeroth-order methods without public data.

In Table 6 of our submission, we present the number of different operations needed per iteration of each method. Specifically, PAZO-M requires one batch of public gradient, so the memory overhead is O(d)O(d), where dd is the number of model parameters. PAZO-S is also O(d)O(d) since we can compute the kk public batch gradients sequentially. Though PAZO-P has an O(kd)O(kd) memory overhead than DPZero, it is still more memory- and computation-efficient than the first-order DP methods because the latter generally requires O(bd)O(bd) memory to maintain per-sample gradients, where bb is the private batch-size. Our experimental results are obtained using k=3,6k=\\{3,6\\} while b=64b=64. Such dependence on bb is also restrictive since larger batch-sizes improve performance [6,7].

Therefore, compared to first-order DP with public data, PAZO 1) enjoys time- and memory-efficiency and 2) performs better under tight privacy; compared to DPZero, PAZO converges faster to better accuracies, at the cost of computing public batch gradients.

[Perturbation sampling] We would like to clarify that we use Gaussian perturbations in our experiments (line 570-572 in our submission) as most of the other ZO work does [1,4,5]. Our rationales are two-fold. (1) Theoretically, uniform sampling from the sphere and sampling from Gaussian are closely related: To sample from Uni(dSd1)\text{Uni}(\sqrt{d}S^{d-1}), it is equivalent to sample a vector from N(0,Id)\mathcal{N}(0, I_d) and then scales it to have norm d\sqrt{d}. Since the norm of vN(0,Id)v\sim\mathcal{N}(0, I_d) concentrates on d\sqrt{d} with variance O(1/d)O(1/d), we can expect that sampling from these two distributions does not give us much different vectors when d1d\gg 1 with high probability.

(2) Empirically, prior work [1] has shown that Gaussian and sphere perturbations produce very similar results. The following table shows that this also holds for PAZO, where we present the accuracy of sampling from Gaussian minus that of sampling from the sphere (ε=1\varepsilon=1).

MethodCIFAR10Tiny-ImgNetIMDBMNLI
PAZO-M0.90.20.5-0.3
PAZO-P0.7-0.1-0.71.1
PAZO-S0.40.5-0.40.9

[Figure 5 hyperparameter] We would like to clarify that each run plotted in Figure 5 is the one with the best accuracy of the corresponding method after k,b\\{k, b^{\prime}\\} are tuned. DPZero is occasionally slower than PAZO because we try q=1,5q=\\{1,5\\} for each method and use q=1q=1 if q=5q=5 does not bring significant improvements. As presented in Table 7 in our submission, we observe that the optimal qq for DPZero is 55 while public information allows PAZO to use q=1q=1 to achieve competitive accuracies.

Though the runs of PEZO-P and PAZO-S in Figure 5 require computing kk batch gradients, k=3,6k=\\{3,6\\} are small. q=1q=1 for PAZO and q=5q=5 for DPZero. The values of bb^{\prime} of each method are presented below.

MethodCIFAR10Tiny-ImgNetIMDBMNLI
DPMD8888
DOPE-SGD83288
GEP323232--
PAZO-M88432
PAZO-P32166432
PAZO-S8888

[1] DPZero: Dimension-Independent and Differentially Private Zeroth-Order Optimization. ICML 2024.
[2] Train Faster, Generalize Better: Stability of Stochastic Gradient Descent. ICML 2016.
[3] Optimization Methods for Large-Scale Machine Learning. SIAM Review 2018.
[4] Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark. ICML 2024.
[5] Fine-Tuning Language Models with Just Forward Passes. NeurIPS 2023.
[6] Learning Differentially Private Recurrent Language Models. ICLR 2018.
[7] Unlocking High-Accuracy Differentially Private Image Classification through Scale. ICML 2022.

评论

Many thanks for the detailed response! I do not have further questions. I will keep my initial score to recommend acceptance of the paper.

There seem to be some typos in the rebuttal. Isn't the rate of PAZO r\sqrt{r} instead of d\sqrt{d}, as the authors said dimension independent. I also checked the response to another reviewer, where the rate is r\sqrt{r} under additional assumptions.

评论

Many thanks for your time and feedback. Yes, that is a typo (thanks for catching it!), and we will correct to O((1α)2rlog((1α)d))O((1-\alpha)^2\sqrt{r}\log((1-\alpha)d)) for PAZO-M and O(rlogk)O(\sqrt{r}\log k) for PAZO-P in our revised paper.

审稿意见
4

In this paper, the authors proposed combining public first-order gradients with private zeroth-order gradients for differentially private training. The paper proposes three ways of mixing the gradients: 1. naive weighted averaging; 2. projecting ZO direction to FO gradient subspace; 3. using zeroth-order evaluation to determine the optimal public gradient direction. The paper provides both convergence and privacy analyses of the proposed algorithms, and presents numerical results on CV and NLP datasets.

优缺点分析

Strength:

  1. The paper provides novel ways of mixing zeroth and first-order gradients for differentially private optimization.
  2. The paper introduces γ\gamma-similarity to measure the dissimilarity between the public and private data distributions.
  3. The paper provides a theoretical analysis for the privacy and convergence guarantee of all three proposed algorithms under reasonable assumptions.
  4. Numerical results show that the proposed algorithms outperform the baseline algorithm DPZero and are less sensitive to ϵ\epsilon.

Weakness:

  1. For PAZO-P and PAZO-S, how does the memory consumption compare with DP-Zero or PAZO-M? A key benefit of using ZOO is its memory-saving. If the FO gradient matrix G uses too much memory, there might be no point in using the ZO gradient.

  2. For PAZO-M, clipping the function value itself does not make sense to me. In the case that ff is very large (or very small), clipping it to C does not provide any information. It might be more reasonable to clip the function difference.

  3. Is argmin the best operator for PAZO-S? The author fails to investigate other smoothed versions of argmin to avoid directly dropping the other gradient information.

  4. For theoretical analysis, the convergence results rely on the clipping being inactive. While in DPZero, the effect of clipping is taken into consideration with the concentration bound. Therefore, the analysis requires further refinement to avoid assuming the clipping is inactive and the bounded gradient Assumption 1.

  5. The paper does not provide the final privacy-utility trade-off, i.e., the optimal choice of steps TT for convergence.

  6. In the proof in B.4 for PAZO-S, in lines 550 and 551, it is unclear to me why the expected value of gtg(xt;B)g_t - g'(x_t;B'_{*}) is bounded by σ22\sigma^2_2. Since after argmin, gg' is no longer unbiased.

  7. I didn't find the choice of bb in Table 8. Also, for DP-SGD, why is bb not an integer?

  8. Since the paper introduced γ\gamma, it is good to have a set of experiments demonstrating how the algorithm performs with different γ\gamma's for one dataset.

问题

Please address the weakness.

I believe the theoretical analysis is limited and may contain errors. If my concerns about the theoretical analysis can be addressed, I would love to increase my score.

局限性

N/A

最终评判理由

Based on the rebuttal discussion and clarification. I believe the issues in the theoretical analysis are minor and can be fixed during final revision. The numercial experiment details have also been clarified. Therefore, I would raise my score.

格式问题

N/A

作者回复

We appreciate the reviewer’s time and detailed feedback. We address the concerns and clarification questions as follows.

[Memory cost of PAZO] While we agree that PAZO requires memory for public gradients, it eliminates the memory/computation/implementation bottleneck of maintaining per-sample gradients in private settings. PAZO is more efficient (and sometimes more accurate) than differentially-private first-order methods, and is more accurate than vanilla DP zeroth-order methods without public data.

In Table 6 of our submission, we present the number of different operations needed per iteration of each method. Specifically, PAZO-M requires one batch of public gradient, so the memory overhead is O(d)O(d), where dd is the number of model parameters. PAZO-S is also O(d)O(d) since we can compute the kk public batch gradients sequentially. Though PAZO-P has an O(kd)O(kd) memory overhead than DPZero, it is still more memory- and computation-efficient than the first-order DP methods because the latter generally require O(bd)O(bd) memory to maintain per-sample gradients, where bb is the private batch-size. Our experimental results are obtained using k=3,6k=\\{3,6\\} while b=64b=64. Such dependence on bb is also restrictive since larger batch-sizes improve performance [3,4].

Therefore, compared to first-order DP with public data, PAZO 1) enjoys time- and memory-efficiency and 2) performs better under tight privacy; compared to DPZero, PAZO converges faster to better accuracies, at the cost of computing public batch gradients.

[Clipping difference] We’d like to clarify that for PAZO-M, we indeed clip the function difference (Algorithm 1 line 8). Please let us know if the reviewer still has concerns.

[argmin in PAZO-S] We motivate the design of argmin as in Eq. (3) where in that particular setting, argmin is optimal. We note that a smoothed version of argmin (linearly combining the public gradients) is explored in PAZO-P, where the final update direction is a linear combination of the public gradients. We acknowledge that this is a promising direction, and we leave it as future work to explore other ways to determine the coefficients.

[Refined bounds with active clipping and final privacy/utility tradeoffs] We note that we did not incorporate effective clipping in the original submission as we were using weaker assumptions. In particular, DPZero uses our Assumption 1 ([1] page 37, the equation above Eq. 21), in addition to a stronger per-sample Lipschitz assumption (Assumption 3.1 in [1]). Our submission assumed inactive clipping also because 1) it is a common assumption in DP literature [2] and 2) our novelty is not proposing/analyzing clipping methods but exploring leveraging public information in zeroth-order optimization both theoretically and empirically.

However, for a clearer comparison with DPZero, we thank the reviewer for suggesting this and provide the refined error bounds considering active clipping and privacy/utility tradeoffs under exactly the same assumptions as DPZero:

PAZO-M. For any ε>0\varepsilon>0 and δ(0,1)\delta\in(0,1), PAZO-M is (ε,δ)(\varepsilon,\delta)-DP. Under Assumption 3.5 in [1], suppose max0tTFS(xt)B\max_{0\leq t\leq T} |F_S(x_t)|\leq B, the output xτx_{\tau} satisfies E[FS(xτ)2]rlog(e+ε/δ)nε[64((FS(x0)FS)l+(1α)2L~2)+2L2]+4(α(γ2+σ22/b))+α2(γ2+σ22/(2b))4(1α)2(r+2) \mathbb{E}[\Vert\nabla F_S(x_{\tau})\Vert^2] \leq \frac{\sqrt{r\log(e + \varepsilon/\delta)}}{n\varepsilon} \left[ 64((F_S(x_0) - F_S^*)l + (1-\alpha)^2\tilde{L}^2)+ 2L^2 \right] + 4(\alpha(\gamma^2+\sigma_2^2/b^{\prime}) ) + \frac{\alpha^2(\gamma^2+\sigma_2^2/(2b'))}{4(1-\alpha)^2(r+2)}

where we define L~2=L2log(22πn3ε2(r+2)((1α)d+α+8(1α)2lB(r+2)/L2)rlog(e+ε/δ))\tilde{L}^2 = L^2\log\left(\frac{2\sqrt{2\pi}n^3\varepsilon^2(r+2)((1-\alpha)d+\alpha+8(1-\alpha)^2lB(r+2)/L^2)}{r\log(e+\varepsilon/\delta)}\right)

and choose the parameters to be η=14(1α)2l(r+2),T=(1α)2n(r+2)ε4rlog(e+ε/δ),λLld32(rlog(e+ε/δ)nε)12,C=4L~.\eta = \frac{1}{4(1-\alpha)^2l(r+2)}, \quad T=\frac{(1-\alpha)^2n(r+2)\varepsilon}{4\sqrt{r\log(e+\varepsilon/\delta)}}, \quad \lambda\leq \frac{L}{ld^{\frac{3}{2}}}\left(\frac{\sqrt{r\log(e+\varepsilon/\delta)}}{n\varepsilon}\right)^{\frac{1}{2}}, \quad C=4\tilde{L}.

This shows that our dependence on dd and rr becomes O((1α)2rlog((1α)d))O((1-\alpha)^2\sqrt{r}\log((1-\alpha)d)). Compared to DPZero's O(rlogd)O(\sqrt{r}\log d), if α>0\alpha>0, our error has a smaller dependence on dd but additional errors O(α2/(1α)2(γ2+σ22/b))O(\alpha^2/(1-\alpha)^2(\gamma^2+\sigma_2^2/b^{\prime})) due to using biased public gradients.

PAZO-P. For any ε>0\varepsilon>0 and δ(0,1)\delta\in(0,1), PAZO-P is (ε,δ)(\varepsilon,\delta)-DP. Under the same assumptions as stated above, the output xτx_{\tau} satisfies E[FS(xτ)2]rlog(e+ε/δ)nε[64((FS(x0)FS)l+L~2)+3L2]+8(γ2+σ22/b)\mathbb{E}[\Vert\nabla F_S(x_{\tau})\Vert^2] \leq \frac{\sqrt{r\log(e + \varepsilon/\delta)}}{n\varepsilon} \left[ 64((F_S(x_0) - F_S^*)l + \tilde{L}^2)+ 3L^2 \right] + 8(\gamma^2+\sigma_2^2/b^{\prime}) where we define L~2=L2log(22πn3ε2(r+2)(k+8lB(r+2)/L2)rlog(e+ε/δ))\tilde{L}^2 = L^2\log\left(\frac{2\sqrt{2\pi}n^3\varepsilon^2(r+2)(k+8lB(r+2)/L^2)}{r\log(e+\varepsilon/\delta)}\right) and choose C=4L~C=4\tilde{L} and η,T\eta, T, and λ\lambda the same as in DPZero Theorem 3 [1]. This shows that we achieve a dd-independent error bound O(rlogk)+O(γ2+σ22/b),kdO(\sqrt{r}\log k) + O(\gamma^2+\sigma_2^2/b^{\prime}), k\ll d.

[Proof in B.4] We thank the reviewer for pointing out the typo—the upper bound should be 2σ222\sigma_2^2, and all the remaining analysis still holds (differing by a constant). We provide step-by-step derivation for the claim E[gtg(xt;B)2]2σ22/b\mathbb{E}[\Vert g_t^{\prime}−g^{\prime}(x_t;B_∗^{\prime})\Vert^2] \leq 2\sigma_2^2/b^{\prime} as follows:

Recall that bb' is the public batch-size and gtg_t^{\prime} is the full public gradient. For a sampled public batch index i[k+1]i\in[k+1], its gradient is g(xt;Bt,i)=gt+1bj=1bζt,j(i)g(x_t;B_{t,i}^{\prime}) = g_t^{\prime} + \frac{1}{b^{\prime}} \sum_{j=1}^{b^{\prime}} \zeta_{t,j}^{(i)} where ζt,j(i)\zeta_{t,j}^{(i)} is the stochastic gradient noise for the public sample jj in the ii-th batch. We denote the selected best batch as II and thus

E_Bt[gtg(xt;Bt)2]=E_Bt[1bj=1bζt,j(I)2]=1bE_Bt[ζt(I)2]=1b(E_Bt[ζt(I)]2+E_Bt[ζt(I)E_Bt[ζt(I)]2]). \mathbb{E}\_{B_t^{\prime}}[\Vert g_t^{\prime}-g^*(x_t;B_t^{\prime})\Vert^2] = \mathbb{E}\_{B_t^{\prime}} \left[\Vert\frac{1}{b^{\prime}}\sum_{j=1}^{b^{\prime}}\zeta_{t,j}^{(I)}\Vert^2\right] = \frac{1}{b^{\prime}} \mathbb{E}\_{B_t^{\prime}} \left[\Vert\zeta_{t}^{(I)}\Vert^2\right] = \frac{1}{b^{\prime}} \left( \Vert\mathbb{E}\_{B_t^{\prime}} \left[\zeta_{t}^{(I)}\right]\Vert^2 + \mathbb{E}\_{B_t^{\prime}} \left[\Vert\zeta_t^{(I)} - \mathbb{E}\_{B_t^{\prime}}[\zeta_{t}^{(I)}]\Vert^2\right] \right).

By assumption, E_Bt[ζt(i)2]σ22\mathbb{E}\_{B_t^{\prime}} \left[\Vert\zeta_{t}^{(i)}\Vert^2\right] \leq \sigma_2^2 for any batch ii. The choice of II is dependent on the private data batch BtB_t. Therefore, by the law of total expectation,

E_Bt[ζt(I)]2=E_i[E_Bt[ζt(I)I=i]]2E_i[E_Bt[ζt(I)I=i]2]E_i[σ22]=σ22.\Vert\mathbb{E}\_{B_t^{\prime}} \left[\zeta_{t}^{(I)}\right]\Vert^2 = \Vert\mathbb{E}\_i \left[ \mathbb{E}\_{B_t^{\prime}} [\zeta_{t}^{(I)}|I=i]\right]\Vert^2 \leq \mathbb{E}\_i \left[ \Vert\mathbb{E}\_{B_t^{\prime}} [\zeta_{t}^{(I)}|I=i] \Vert^2 \right] \leq \mathbb{E}\_i [\sigma_2^2] = \sigma_2^2.

For the variance term, E_Bt[ζt(I)E_Bt[ζt(I)]2]E_Bt[ζt(I)2]=E_i[E_Bt[ζt(I)2]I=i]σ22. \mathbb{E}\_{B_t^{\prime}} \left[\Vert\zeta_{t}^{(I)} - \mathbb{E}\_{B_t^{\prime}}[\zeta_{t}^{(I)}]\Vert^2\right] \leq \mathbb{E}\_{B_t^{\prime}} \left[\Vert\zeta_{t}^{(I)}\Vert^2\right] = \mathbb{E}\_i \left[ \mathbb{E}\_{B_t^{\prime}} \left[\Vert\zeta_{t}^{(I)}\Vert^2\right] |I=i \right] \leq \sigma_2^2.

Substituting the bias and variance bounds, we obtain E_Bt[gtg(xt;Bt)2]2σ22/b.\mathbb{E}\_{B_t^{\prime}}[\Vert g_t^{\prime}-g^*(x_t;B_t^{\prime})\Vert^2] \leq 2\sigma_2^2/b^{\prime}.

[Choice of b in Table 8] In hyperparameter tuning, bb is tuned only for non-private methods (MeZO and SGD). We fixed the private batch-size to 64 for all private methods, including zeroth-order and first-order, with and without public data. We will make it clear in our appendix. We thank the reviewer for finding the typo in the line for DP-SGD in Table 8—the bb should have been η\eta.

[Performance under various γ\gamma] To create public data of different extents of distribution shifts (different γ\gamma), we mix ID public data (MNLI) and OOD public data (SNLI) with different proportions. We present the performance and γ\gamma of PAZO (ε=1.0\varepsilon=1.0) under these scenarios. We observe that 1) the range of γ\gamma is method-dependent and 2) for any fixed PAZO variant, the accuracy increases as the data become more similar (smaller γ\gamma).

Pub data \ MethodPAZO-MPAZO-PPAZO-S
MNLI only75.9 (γ=38\gamma=38)73.2 (γ=43\gamma=43)74.0 (γ=49\gamma=49)
Half-half73.5 (γ=49\gamma=49)72.7 (γ=45\gamma=45)73.1 (γ=67\gamma=67)
SNLI only67.7 (γ=72\gamma=72)68.8 (γ=68\gamma=68)68.9 (γ=81\gamma=81)

We also present results on CIFAR10 with public data of different extents of distribution shifts, where we observe similar trends. The public data with slight class imbalance has class_size ratios [1:...:0.85][1:...:0.85]. To create public data with more distribution shift, we enforce big class imbalance (with class_size ratios [1.0:0.9:0.8:...:0.2:0.1][1.0:0.9:0.8:...:0.2:0.1]).

Pub data \ MethodPAZO-MPAZO-PPAZO-S
Slight imbalance70.2 (γ=4.5\gamma=4.5)69.2 (γ=3.3\gamma=3.3)70.2 (γ=1.6\gamma=1.6)
Half-half69.8 (γ=4.9\gamma=4.9)67.6 (γ=3.8\gamma=3.8)67.3 (γ=1.8\gamma=1.8)
Big imbalance65.5 (γ=6.4\gamma=6.4)64.1 (γ=4.7\gamma=4.7)63.0 (γ=2.1\gamma=2.1)

[1] DPZero: Dimension-Independent and Differentially Private Zeroth-Order Optimization. ICML 2024.
[2] Do Not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning. ICLR 2021.
[3] Learning Differentially Private Recurrent Language Models. ICLR 2018.
[4] Unlocking High-Accuracy Differentially Private Image Classification through Scale. ICML 2022.

评论

My concerns have been addrerssed by the authors. The correction and update in the theoretical analysis, and clarification on the settings in the experiments are clear. I would raise my score.

Still, for private first order gradient computation, it is worth mentioning that exisitng ghost-clipping and bookkeeping methods avoids using O(bd) memory for storing per-sample DP gradients.

评论

Thank you again for your valuable time and review! We are glad to know that the concerns have been addressed.

We appreciate the suggestion of mentioning ghost-clipping and bookkeeping methods, which could mitigate the memory cost of vanilla DP-SGD. While we briefly mention efficient DP-SGD implementations in the introduction (first paragraph), we will explicitly highlight the benefits of those methods in our revised paper. For instance, we note that ghost-clipping/bookkeeping can improve the layer-wise gradient storage to min2bp2,bd\min\\{2bp^2, bd\\} where bb is private batch-size, dd is the number of model parameters of a specific layer, and pp is the feature dimension of this layer, i.e., sequence length for text data [5]. It may introduce additional computations, and the improvements may be limited when pp is large, which is usually the case for LM with long context windows [5]. We totally agree that there exist relatively efficient DP-SGD variants, and we will properly cite and discuss them.

[5] Differentially Private Optimization on Large Model at Small Cost. ICML 2023.

最终决定

After the discussion phase all reviewers agree that the submission is of high enough quality for publication to NeurIPS. As Area Chair, I agree and recommend acceptance.