On Differential Privacy for Adaptively Solving Search Problems via Sketching
We develop data structures for search problems such as regression and nearest neighbor search that are robust against an adaptive adversary.
摘要
评审与讨论
First the authors consider the differentially private ANN problem. They reduce this problem of differentially private ANN to differentially private selection problem, and then solve the differentially private selection problem via differentialy private selection methods. Next, the authors consider the problem of differentially private linear regression. They solve this problem by applying matrix sketching techniques. The key idea is to use standard a number of different random matrix sketches, and then to aggregate the solutions in a differentially private manner via an extension of a differentially private median approach of (Beimel et all 2022). Finally, to bound the utility they use they compute an ell-infinity bound on the error introduced by sketching. As the authors note, this is an interesting approach to bounding the utility as it is an uncommon use of ell-infinity bounds in matrix sketching.
update after rebuttal
Thank you for the helpful clarifications. I have raised my score accordingly. Please include these clarifications in the final version.
给作者的问题
It would be good if the authors could clarify how their work improves on prior works.
论据与证据
Yes
方法与评估标准
Yes
理论论述
I read the technical overview, but did not carefully check the proofs.
实验设计与分析
There are no experiments in the paper.
补充材料
Please see question about "proofs" above.
与现有文献的关系
The authors do a good job of discussing which previous techniques they build on.
However, as far as I can tell, the authors do not provide much comparison to prior related works. It would be good if the authors could clarify how their work improves on prior works, or point to where the improvement over prior works is discussed in the paper.
遗漏的重要参考文献
Please see "Relation To Broader Scientific Literature" question above.
其他优缺点
I think the strength of the paper is in the techniques, which are well explained in the technical overview. However, the comparison to prior works can be improved. Specifically, it would be good if the authors could clarify how their work improves on prior works.
其他意见或建议
N/A
We thank the reviewer for their question about prior work, which we address below:
-
Prior works using differential privacy for adaptivity focus on numerical estimates [Hassidim et al., 2022; Beimel et al., 2022; Song et al., 2023; Cherapanamjeri et al., 2023]. Our work instead tackles the search problem—returning a point using an adaptive DP data structure. While differentially private heavy hitter algorithms (e.g., [Chadha et al., 2023]) relate to private search, they lack adaptivity. To our knowledge, this is the first work combining DP with adaptive search.
-
Because of this, prior adaptive search methods either (a) build data structures (one per query), or (b) use a net argument over the space, requiring structures.
Below we compare our method to these baselines for ANN and regression. We omit for clarity.
Approximate Nearest Neighbor (ANN)
| Method | Space | Amortized Prep Time | Query Time | Update Time |
|---|---|---|---|---|
| copies | ||||
| copies | ||||
| Ours |
For adaptive queries/updates, the exact method scans the dataset. If , we outperform -copies when ; if , we outperform -copies when .
Regression
Let and , with goal to compute minimizing up to . Each step updates , via ; bounds 's condition number.
| Method | Space | Amortized Prep Time | Query Time | Update Time |
|---|---|---|---|---|
| copies | ||||
| copies | ||||
| Ours |
We outperform -copies when , and -copies when —conditions often met in practice.
Regression with Sparse Label Updates
If only is updated and each changes at most entries, dependence can be reduced to polylogarithmic.
| Method | Space | Amortized Prep Time | Query Time | Update Time |
|---|---|---|---|---|
| copies | ||||
| copies | ||||
| Ours |
Only copies are needed. Our method is faster when and , or when and . Since is user-chosen (rebuilding occurs every queries), trade-offs can be tuned (see Claims D.16–D.17).
Summary:
Our method leverages DP with advanced composition, achieving scaling and improved space, preprocessing, query, and update complexity. These advantages are especially notable when is large, or when sparsity and condition number are favorable. We will ensure these comparisons are clearly included in the revision.
The paper addresses the problem of hiding internal randomness of data structures using differential privacy. Specifically, the authors focus on nearest neighbor search and regression problem and aim to protect the internal randomness against adaptive adversary using differentially private techniques.
Update after rebuttal
The author response provided during the rebuttal clearly addressed my questions regarding the interaction between the privacy mechanism and the underlying data structure, and the privacy loss accumulation. I have raised my rating.
给作者的问题
See above
论据与证据
A sequence of theorems presented in Section 1.1 clearly demonstrates improvements over existing algorithms. They provide probabilistic utility guarantees as well as time and memory complexity, but it is not immediately clear how differential privacy interacts with these bounds. While I understand that characterizing the interplay between data structure and differential privacy is not the main focus, it could be of interest to a broader audience.
方法与评估标准
-
Section 2 provides explanation on how the search problem can be formulated as differentially private selection, which completely makes sense. The authors use the standard Laplace mechanism to achieve pure -DP. I wonder if the Gaussian mechanism can be used instead to achieve -DP. In that case, how does it affect the guarantees stated in theorems in Section 1?
-
Another aspect not discussed in the paper is the accumulation of privacy loss. In differential privacy, the total privacy loss accumulates as the mechanism releases private answers to successive queries, which means the mechanism must shut down the database and stop answering after a certain number of queries. However, it is unclear from the paper whether there is a limit to the number of queries the underlying data structure can handle.
理论论述
I didn’t thoroughly check the correctness of the proofs.
实验设计与分析
This is a theory paper that doesn't contain experimental results.
补充材料
No supplementary materials are available.
与现有文献的关系
This paper develops efficient data structures that protect their internal randomness from adaptive adversaries, offering a secure and private foundation for machine learning tasks.
遗漏的重要参考文献
I don't find any missing related work.
其他优缺点
- Strength
- The paper presents a sequence of theorems that describe improvements over existing results.
- The authors provide interpretations of theorems.
- Weakness
- This may not be a weakness, but rather a comment. While the paper is clear about the algorithmic properties of underlying data structures, e.g., how its time and storage complexities scale with input size, they don’t reveal much about the properties of differential privacy, such as composition and privacy loss accumulation.
- I believe that including some figures or diagrams would have been helpful for better understanding the proposed technique.
其他意见或建议
None
We thank you for your very helpful comments. We would also like to address your questions and comments regarding differential privacy.
Let us start with a high-level overview of the framework, first introduced in [Hassidim et al., 2022; Beimel et al., 2022]. The rough idea is to treat the random seeds used by data structures as the database for which we want to preserve privacy. In the simplest setting, to achieve a target privacy level , one might need to aggregate the responses of data structures in a differentially private manner. However, by the advanced composition theorem, the same level of privacy in terms of can be achieved using only independent outputs—at the cost of moving from pure DP to approximate DP with . As long as is small, the approximation is acceptable.
Responses to Specific Questions
Q: Can the Gaussian mechanism be used for -DP? In that case, how does it affect the final guarantees?
A: Yes, the Gaussian mechanism is valid for achieving -DP, provided that is not too large. Suppose we want the data structure to succeed with probability at least . Let denote the total privacy loss, composed of:
- : from the Gaussian mechanism
- : from advanced composition
Then:
In our original setting, we set using an appropriate . To use the Gaussian mechanism instead, we may set and reduce to . This still yields , matching the Laplace mechanism’s privacy loss.
We chose the Laplace mechanism for conceptual simplicity and because the task is essentially a counting problem, where Laplace noise is more natural.
Q: In DP, privacy loss accumulates with repeated queries, so mechanisms are often shut down after a limit. Does your adaptive data structure exhibit this behavior?
A: Yes, the adaptive data structure’s ability to handle queries hinges on privacy composition. Using the advanced composition theorem, we guarantee -DP across interactions by using resources (e.g., data structure copies or random seeds), rather than .
However, this requires to be fixed in advance. For indefinite querying, a standard approach is to rebuild the data structure periodically after every queries. This permits optimization of to balance space, amortized preprocessing time, and query/update efficiency.
Additional Comments
It would be better to discuss more on differential privacy, such as composition and loss accumulation.
A: Thank you for pointing this out. We will ensure that the final version includes a more detailed discussion on differential privacy, particularly covering composition theorems and cumulative privacy loss (please refer to our high-level overview and answers regarding loss accumulation above).
Including figures and diagrams would be helpful for explaining the techniques.
A: We appreciate the suggestion. We agree that visualizations—especially for illustrating how the sparse argmax mechanism is applied in our ANN algorithm—would enhance clarity and intuitively explain why our algorithm is both correct and efficient. We will include appropriate figures and diagrams to better illustrate the framework in the revision.
The paper explores using differential privacy for answering adaptively chosen queries (potentially by and adversary) for search problems. The authors consider approximate nearest neighbor and regression as the main problems in this work. The main contributions are showing that under reasonable assumptions, it is possible to maintain copies of data structure to answer queries for both problems, matching the bounds from previous work that uses DP for adversarial streaming. For the ANN problem, they reduce the problem to private selection and provide a new mechanism that obtains better runtime for sparse vectors. For regression problem with updates to both design matrix and response vector, the approach involves maintaining and updating multiple sketch matrices, solving to generate multiple solution vectors and using DP median to generate the final solution. The authors also show application to online weighted matching and computing terminal embeddings.
给作者的问题
No additional questions.
论据与证据
All the claims made in paper are accompanied with proofs. Reasoning and justification for the assumptions made are also provided.
方法与评估标准
The paper does not have experiments.
理论论述
The proofs are correct to the best of my knowledge. The contributions sublinear dependence on for time/space complexity in answering queries. The baseline comparison are the naive or existing bounds of and . The accuracy guarantees follow from success probability for the data structure or the generalization guarantee for DP algorithms.
实验设计与分析
This is a theoretical work, no experiments are provided.
补充材料
The proofs in appendix are correct to the best of my knowledge.
与现有文献的关系
This work contributes line of prior work that explores connection between differential privacy and robustness to adaptivity. The paper makes novel contribution in this field by giving approaches that work for search problems.
遗漏的重要参考文献
Not to the extent of my knowledge.
其他优缺点
The sparse argmax mechanism is a nice contribution and might be of interest to broader DP community. For regression problem, the dependence on is quadratic which might be suboptimal -- in addition the alternative approach suggested obtains linear in .
其他意见或建议
No additional comments.
We thank you for your very helpful comments and we would like to address your comments on the quadratic dependence on and linear dependence on for the alternative.
For the dependence, the main reason is that by using the guarantee together with the standard relative error guarantee of sketched regression, we can show that the regression cost of the private median estimator satisfies . Hence, to account for the blowup in condition number, we need to scale down the approximation factor by . Since the sketching dimension depends inverse quadratically on the approximation factor, we have a dependence in the runtime. One could argue that known sketching-based methods based on the following framework need to incur such a dependence: our algorithm estimates each coordinate to high precision, and this inevitably incurs a dependence on the condition number for the error, and the sketching dimension in turn depends quadratically on the condition number. When there are no updates to , we could use an iterative method instead to remove the polynomial condition number dependence. It is an interesting open question to develop an iterative algorithm that can remove the dependence on the condition number while also supporting updates to . Therefore, while the dependence is a consequence of our current sketching approach for worst-case guarantees, we believe the overall framework offers significant advantages, and reducing this dependence under updates is an important avenue for future work.
The alternative method based on the bounded path technique can be treated as a refinement to the method of using a fresh sketch for each update and query. Specifically, the algorithm depends on , and can be naively upper bounded by , making it less appealing. However, we note that is a more fine-grained parameter that could capture the structure of the input, e.g., if only entry changes between queries are allowed, then the upper bound reduces to . If the updates are infrequent, e.g., only updates are performed, then the bound becomes . Hence, while the alternative is no more efficient in the most general sense, it provides more fine-grained control and a speedup in situations for which the instance has extra structure.
The paper studies the problem of adaptive algorithms: the algorithms where the adversary interacts with a randomized algorithm and the goal of the adversary is to increase the error probability.
It is known that differential privacy could be used to design such algorithms, but the existing applications were for algorithms computing numeric results; this paper generalizes the approach to vector-valued algorithms. As a result, it builds an algorithm for the nearest neighbor problem with better than previously known running and preprocessing time.
给作者的问题
N/A
论据与证据
The evidence sufficiently supports the claims.
方法与评估标准
Methods and evaluations are appropriate for the problem at hand.
理论论述
The proofs presented in the paper are correct.
实验设计与分析
There are no experiments in the paper.
补充材料
I haven't reviewed supplementary material.
与现有文献的关系
For me, personally, the idea of using differential privacy for designing adaptive algorithms is one of the most exciting ideas in the last couple of years: it connects the fields that from the first glance have almost nothing in common. While this is not the first paper in this direction, it generalizes the idea to a much wider class of problems.
遗漏的重要参考文献
I don’t think any specific reference is missing.
其他优缺点
N/A
其他意见或建议
Line 40: It would be better to rephrase the sentence about queries; it is a bit confusing in the current form. Line 94: The assumption doesn’t say anything about neighbours, it only talks about cr balls. Perhaps the explanation needs to be rewritten. Line 258: It should be argmax \sum b_{(i), j}; the j index is missing in the paper.
We appreciate your encouraging and helpful comments. Regarding your comments:
-
Line 40:
We will clarify this sentence. Our intention is to express both ANN and regression in a unified way, and we will revise it to clearly explain the query and update models in each problem. -
Line 94:
The assumption states that the predicate function is sparse, and we define the predicate as the intersection of the dataset and the -ball around —that is, all points in within distance of the query. We acknowledge that the current formulation does not make this connection explicit and will revise it for clarity. -
Line 258:
We will fix the missing index .
The paper extends the literature on using differential privacy as a tool to design randomized data structures that maintain their correctness properties against adaptive (rather than oblivious) adversaries. The main contribution is the development of techniques that allow solving search problems under adaptive queries, in particular approximate near neighbor search. The paper addresses important problems, and introduces interesting techniques. The reviewer consensus is that the paper would be a strong contribution to the ICML program.