Instance-Optimality for Private KL Distribution Estimation
摘要
评审与讨论
This paper studies the problem of estimating distributions over symbols under KL divergence and differential privacy. It introduces a notion of multiplicative instance-optimality, in which an estimator’s performance is measured against the best oracle estimator that knows a specified neighbourhood of the true distribution. The authors focus on a particular neighbourhood defined by additive perturbations of the probability masses. They then construct a “sampling-twice” estimator that attains instance-optimality for this neighbourhood and show that the same idea extends to the private setting by adding calibrated Laplace noise to the histogram. Conceptually, the work is most comparable to [27], which introduced multiplicative instance-optimality for the Wasserstein distance rather than KL divergence.
优缺点分析
Strengths:
- The extension of the "multiplicative" instance-optimality framework to KL divergence is a natural and fundamental problem. The authors also argue that the additive perturbation neighbourhood is the “right” notion for attaining instance-optimality under KL divergence.
- The paper introduces several technical innovations, such as a "Generalized Assouad’s method" for deriving lower bounds and the "Sampling Twice" estimator for the upper bound.
- The overall proof strategy appears reasonable and sound. However, due to limited time, I did not verify all details carefully.
Weaknesses:
- My main reservation about the paper is its presentation. The authors completely omit proofs from the main text, which makes it quite difficult to assess the core technical contributions. I suggest including at least proof sketches for the main theorems. Additionally, the non-DP instance-optimality result seems to be a substantial contribution on its own, while the DP part feels more like an add-on and could be discussed briefly in the main text.
- The authors state that the main motivation for the "Sampling Twice" trick is to reduce the bias of classical Good-Turing estimates, yet the experiments show that Good-Turing outperforms all compared estimators. This makes the narrative somewhat confusing. Is there a formal proof that Good-Turing cannot attain instance-optimality under your additive neighbourhood?
- Typos: “Poof” → “Proof” and “optimlity” → “optimality” appear multiple times in the appendix.
问题
See weakness section.
局限性
yes
最终评判理由
I agree with the other reviewers that this is a solid work and have no issues accepting it.
格式问题
NA
We thank the reviewer for their feedback and suggestions, and we will be sure to rearrange proofs and correct the typos to improve the presentation clarity in the revised paper.
Comparison to Good-Turing estimator
We apologize for the confusion, we discussed the comparison briefly in line 137-151, but will devote more space to further clarify this important question.
Both the Good-Turing estimator and our “sampling twice” estimator reduce the bias (compared to the empirical estimator), by estimating and distributing combined mass. The main advantage of the “sampling twice” trick compared to the Good-Turing estimator lies not in utility, but rather in its significantly lower sensitivity to neighboring datasets, thus being easier to privatize and remain instance-optimal even in the DP setting. (See Line 139-151 for detailed sensitivity comparison.)
In terms of practical performance, the Good-Turing estimator as well as its hyperparameter choice has been extensively studied in prior works and thus the existing baselines are very strong. Our work gives the first provably instance-optimal (up to constants) algorithm, and gets very close to the best existing baselines. However, since our sampling-twice approach processes two disjoint subsets, it incurs twice the (sampling) noise in extreme cases (e.g., when all symbols are above or below threshold), leading to a suboptimal constant of two. This tradeoff can be partially mitigated through hyperparameter tuning, such as adjusting the dataset partition ratio, as discussed in Section 4.
Does Good-Turing attain instance optimality under additive neighborhood
This is a great question. We are unaware of negative results. Nevertheless, the existing analysis for Good-Turing would result in a additive gap compared to the lower bound we obtain under additive neighborhood. The main gap is that the Good-Turing’s error for combined probability estimate is (as in Lemma 19 of Orlitsky & Suresh (2015)) due to delicate correlations between partitioned buckets (of small symbols) and their combined mass estimates, rather than in our sampling-twice estimator (as in eq 191 that applies Lemma B.10) facilitated by using fresh counts for small symbols to estimate combined mass. We will add this discussion in the revised version to clarify the comparison.
I thank the authors for their responses to my questions and the discussion on the optimality of the Good-Turing estimator. I agree with the other reviewers that this is a solid work and have no issues accepting it.
This paper studies the problem of estimating a discrete distribution over [d] in KL divergence. Specifically, it focuses on locally minimax rates: bounds which are optimal even if the estimator knows the distribution lives in some local neighborhood. The authors study this problem in both private and non-private settings.
The main contributions are as follows: first they define a notion of neighbourhood for this problem. The main idea is as follows: They consider a standard perturbation based neighbourhood where around the current distribution we can perturb each symbol to be O(1/n). Also, the neighbourhood requires that the mass of small-probability symbols doesn't change significantly in the neighborhood. The first technical result is an extension of Assouad-based lower bound to prove a fundamental lower bound for the rate achievable by any algorithm based on this neighbourhood.
Then, they propose a non-private algorithm for this task. The algorithm is quite simple and intuitive. The idea is double sampling and Good-Turing based trick: we sample two datasets of size n/2. Using the first dataset we estimate the low probability symbols and then use the second dataset to estimate the individual probabilities. The authors show that this approach achieves the desired local minimax property.
Finally, they show a generalized version of the algorithm to handle the private case. It is similar to the non-private version however, we use Laplace distribution to privatize various components of the non-private algorithm.
优缺点分析
This paper is very interesting, both in terms of results and techniques. The estimators and neighborhood relations are non-trivial and nuanced. Also the results are complete in the sense that there are not many gaps in the results. Also, the authors do a good job in reviewing existing neighbourhood definitions and discuss their limitations. I can't think of that much weakness in this work. Typo in Algorithm 1→ it seems N should be n.
问题
n/a
局限性
yes
最终评判理由
no updates. This paper merits acceptance.
格式问题
no
We thank the reviewer for the recognition and feedback. We will be sure to address the typo in the revised paper.
The paper considers instance-optimal differentially private (DP) KL distribution estimation with multivariate Poisson data. They first define a new notion of instance-optimality based on a neighborhood of additively perturbed probability values and show that these new neighborhoods are the smallest that still allow for instance-optimal algorithms. They then propose a new non-private instance-optimal (w.r.t. these neighborhoods) algorithm. A private variation of said algorithm is also shown to be instance-optimal (among DP algorithms). Numerical experiments show advantages of the proposed method over baselines on distribution instances of practical interest.
优缺点分析
Strengths:
- The new notion of neighborhood is intuitive and the ``minimality'' result for this notion is quite nice.
- Instance-optimal algorithms (up to constants) are provided for non-DP and DP setting.
- The algorithms are novel and the lower bound proof requires some novel techniques as well.
Weaknesses:
- Discussion of why 2-point and multiplicative neighborhoods are ``too small'' could be clarified. It seems the main idea is that these nbhds don't permit instance-optimal algorithms, but the explanation about why is not totally clear to me.
- I didn't see citation/discussion of Asi & Duchi's earlier works on instance-optimal DP estimation.
- A table summarizing the many previous works desdcribed in paragraph 2 of Sec 1.2 (in comparison to the submission) would be nice.
- The technical aspects are presented in a dense manner, with many important details and result statements deferred to the appendix. While I was able to get the high-level ideas from reading, it would be nice if the authors could streamline the presentation a bit (e.g. get rid of some redundancies and defer some less important things to the appendix) to make room for clearer presentation of the most important technical aspects. Of course this is not easy with the 9-page limit, but hopefully the camera-ready version can be improved if the paper is accepted.
- Abstract could be improved to better describe the paper's core contributions (as discussed in the introduction)
问题
Is the requirement in the lower bound (7) necessary? If not, what are the challenges to obtaining a lower bound that holds for all ?
Intuition for ``privacy for free' regime in line 264?
局限性
yes
最终评判理由
No updates
格式问题
no
We thank the reviewer for their comments and feedback. We will improve the presentation and arrangements of statements and proof for better clarity in future revisions. Below we address the questions.
Two-point neighborhood and multiplicative neighborhood
We apologize for the confusion and will clarify the existing results better in an updated version. Yes, Line 92-101 discusses their main limitation as being too small to permit instance-optimal algorithms, as their lower bounds largely remain unchanged under growing dimensionality of the problem.
- Two-point neighborhood: the lower bound comes from distinguishing two hypotheses based on the observed data. The primary factor determining this lower bound is how “separated” the two hypotheses are, not the number of features or dimensions of the data. Consequently, lower bounds for two-point neighborhoods typically do not grow with dimensionality of the problem, as evidenced by prior works [21,44,6].
- Multiplicative neighborhood: the lower bound comes from distinguishing the target instance with other distributions that lie in its multiplicative neighborhood (e.g. of a factor of 1/2). Consequently, the lower bound is at most a constant that equals the multiplicative ratio (e.g., 1/2) of the neighbourhood size.
By contrast, there are instances (such as long-tailed instances in our particular distribution estimation problem) whose estimation error inevitably grows with data dimension (e.g., of scale in our Theorem G.9 construction), indicating that the two-point neighborhood and multiplicative neighborhood are too small.
Discussion of Asi & Duchi’s earlier works on instance-optimal DP estimation:
We thank the reviewer for pointing out that we missed discussing the works of Asi and Duchi [a, b]. Their first work only deals with one-dimensional quantities. Their Neurips 2020 work discusses high-dimensional problems defined on the dataset (rather than on the distribution as in our work). The notion of local minimax optimality there includes all datasets at a certain distance, whereas their second notion only competes with unbiased algorithms. We will add discussion of these works to the final version.
- [a] Asi, H., & Duchi, J. C. (2020). Near instance-optimality in differential privacy. arXiv preprint arXiv:2005.10630.
- [b] Asi, H., & Duchi, J. C. (2020). Instance-optimality in differential privacy via approximate inverse sensitivity mechanisms. Advances in neural information processing systems, 33, 14106-14117.
A table summarizing many previous works described in paragraph 2 of Section 1.2
For the DP setting, to the best of our knowledge, there are no known results for the KL minimax rates or instance-optimal rates, thus we only summarized our results in Tables 3, 4, and 5, and summarized prior non-DP results in Tables 1 and 2. Guarantees in and error are nicely summarized in [Acharya, Sun, and Zhang 2021, Table 1-2].
requirement
This is a great question. Typically private estimation considers the scenario of epsilon>1/n. Indeed if , then by group privacy, the output distributions of the algorithm on any two datasets D and D’ (of size n) should be within a constant factor of each other. In other words, the algorithm output is essentially independent of the input dataset, and this is an uninteresting regime.
Technically, our proof requires mainly to ensure that perturbing low probability symbols to is valid, i.e., . Thus essentially this constant 1 can be relaxed to any other constant, albeit changing the multiplicative constant in our obtained lower bound.
Intuition for “privacy for free” regime in line 264
Intuitively, for symbols with large enough probability, the noise added due to privacy has variance 1/\epsilon^2, while the inevitable statistical variance of the symbol’s frequency under natural poisson data sampling is . Thus when is large enough, the error due to privacy is negligible compared to the statistical error, i.e., privacy is “for free”. The term in our DP lower bound (Eq 7), compared to the term in our non-DP lower bound (Eq 5), captures this phenomenon.
Thanks for the rebuttal, which helped clarify my questions. I will keep my score and recommend acceptance.
The paper considers the problem of private discrete distribution estimation with respect to the KL-distance. This is a fundamental question which has been considered extensively in the non-private statistics literature. When addressing this problem, it is possible to design two types of algorithms, depending on the target guarantee: i) minimax optimal algorithms, and ii) instance optimal algorithms. Minimax optimal algorithms have optimal performance when one considers the worst possible distribution in the class. This is a strong guarantee, but can be too conservative in practice. Conversely, instance optimal algorithms yield guarantees on a per-instance basis, adapting to easier instances, while recovering the guarantees promised by minimax optimal algorithms for worst-case instances. Based on the above description, instance optimal algorithms are very desirable, due to their potential for good performance on practical instances. However, designing them and establishing their guarantees can be a hard task. Additionally, it is generally a non-trivial matter how one defines instance optimality depending on the setting. The specific notions of instance optimality considered in this work are inspired by the older work of [FMST24], where the idea is that an algorithm running on input from a distribution must be competitive with the best algorithm that receives samples from a distribution that is in a neighbourhood of (which is known to ). The neighbourhoods considered in this work involve small additive perturbations. The paper gives nearly matching upper and lower bounds for both minimax and instance optimal discrete distribution estimation with respect to the KL-distance under -DP. The upper bounds are obtained by designing low sensitivity variants of well-known estimators (namely add-constant estimators and the Good-Turing estimator) and then privatizing them. The lower bounds are obtained via a novel generalization of Assouad's method for decomposable statistical distances. Empirical evaluation of the performance of the algorithms is also given.
[FMST24] Vitaly Feldman, Audra McMillan, Satchit Sivakumar, and Kunal Talwar. Instanceoptimal private density estimation in the wasserstein distance. arXiv preprint arXiv:2406.19566, 2024.
优缺点分析
Strengths:
- The problem that the paper considers is important and well-motivated. KL-discrete distribution estimation is a fundamental topic in density estimation, and so is attaining instance optimality.
- The results exhibited a non-trivial degree of technical hardness and novelty. In particular, the decomposable generalization of Assouad could be of broader interest.
- Algorithm 1 could also be of broader interest as an alternative to Good-Turing.
Weaknesses:
- One thing I had a slight issue with was the presentation. The way the main body is structured, the main results of the work are not stated properly. The results are described briefly in text, and then pointers are given to the appendix, where the full statements are included. This bothered me slightly while reading the paper, since I felt that the main body does not really stand on its own, and I had to do some back and forth. Additionally, the bounds were hard to parse (but I guess this is inherent to some extent when it comes to instance optimality). It would be great if the authors edited the manuscript in a future update to address this issue.
Overall, I think this is a good paper that merits acceptance.
问题
I have one question that is slightly broader. What challenges would the authors expect to arise when considering the problem under approx-DP? Pure-DP can be somewhat restrictive, and is not the variant of DP that is most commonly used in practice. Can the authors provide a short discussion of the extent to which they believe their approach would also work for approx-DP?
局限性
Yes
最终评判理由
My view of the paper was positive from the beginning, and still is, after the rebuttal.
格式问题
No formatting concerns
We thank the reviewer for their suggestions, and will incorporate them to improve the presentation to make the main body more self-contained. Below we answer the questions.
Approx-DP
Our per-instance lower bounds hold for (eps, delta) approx-DP with delta<epsilon, and match the upper bound achieved by pure-DP algorithms. This indicates that our pure-DP algorithm, while enjoying better privacy guarantees, is also instance-optimal under approximate DP. Similar phenomena have previously been observed for the histogram estimation problem under approx-DP, where in the asymptotic sense, approx DP (e.g., Gaussian mechanism) offers little advantage over pure DP (e.g., Laplace).
In the practical sense, indeed the Gaussian mechanism may be practically better in some regime of parameters (due to the lighter distribution tails compared to Laplace mechanism). And our algorithm (after substituting Laplace mechanism with Gaussian mechanism and changing threshold to calibrate to delta) achieves instance optimality up to a log(1/\delta) factor. We will include the Gaussian variants of our algorithm and its instance-optimality proof in the future revised paper.
All of the reviewers agree that this paper deserved to be accepted. It addresses a natural and fundamental problem about instance-optimal distribution estimation, and provides significant new results. The only consistent concerns were related to the quality of the presentation, so the authors are encouraged to focus their camera-ready efforts there, using the reviewer comments as a guide.