Differentially Private One Permutation Hashing
摘要
评审与讨论
This paper explores differentially private one permutation hashing (DP-OPH) methods and develop three variants based on the densification procedure of OPH. This paper conduct privacy and utility analyses for these algorithms and demonstrate that DP-OPH outperforms the DP MinHash alternative previously suggested for hashing Jaccard similarity across various privacy settings.
优点
[S1] The motivation of this paper is clear, which is to protect the privacy of one permutation hashing (OPH) to meet differential private definition.
[S2] This paper is easy to read with necessary introduction of background and preliminaries.
[S3] This paper provides extensive theoretical analysis with formalized proof. This paper also provides fair experiments with clear results.
缺点
[W1] The technical depth and contribution are limited. This paper introduces randomized response technique to randomize output, which is a common idea that has been studied in various previous works. Although this combination successfully outputs a differential private OPH, as claimed by the authors, the technical challenge and depth are not clear shown to highlight contribution.
[W2] The assumption “each data vector contains at least f non-zeros” in Sec 3.2 remains to be discussed more seriously. The number of non-zeros should obey some probability distribution. Although the probability might be small enough to be negligible if the f is set properly, one should quantificationally capture the probability or provide necessary reference to make the assumption more rigorous. In addition, to lead following contribution, the authors claim K is usually around hundreds in Sec 3.2, which also should be provided with a reference to support the claim.
[W3] The language level is sub-standard, which requires further polishment. Some conceptual information is missing. For example, the definition and procedure of SortedIndex and MapToIndex in Alg.3 are not introduced, making readers confused about the procedure of Alg.3.
问题
[Q1] The authors could consider to provide more discussions, including but not limited to the limitation and potential applications of the proposed schemes.
[Q2] This paper could benefit from language improvements.
Dear Reviewer uYfv,
We appreciate your review and feedback.
-
Our paper proposes the first comprehensive study on generating private sketches from the widely used minhash type algorithms (DP-OPH and DP-MH), which fills the gap in the literature. This type of algorithm is naturally good application of the randomized response method because the hash values are finite bits. While randomized response is a known technique in DP, deriving the algorithms and flipping probabilities requires careful and rigorous design and calculation. Developing the theoretical analysis on the utility (mean and variance) also needs non-trivial efforts.
-
The bound on is assumed to define the data domain of DP in a general sense and for the convenience of theoretical analysis. In the case of private data release where the task is to release the hash values of a set of data vectors, the lower bound on is natural (which is the smallest of the dataset).
-
Thanks for the suggestion. SortedIndex and MapToIndex are helper functions to define a within-bin random permutation of an empty bin . returns the original index of a sorted array. For example, let , , and and suppose the indices in each bin are in ascending order, and is empty. Suppose . In this case, , so . Assume is picked and . At line 7 we have and at line 8, is a mapping , which effectively defines another within-bin permutation of using the partial ordering of . Finally, we set as the minimal index of ``1'' among the additionally permuted elements in bin .
We will add this example in the appendix to better explain the algorithm. Thanks again for your suggestion.
This paper introduces and analyzes a few algorithms for differentially private binary sketches. The proposed setting features binary databases of size , and the goal is to compute Jaccard similarity between databases privately using sketches of size . The paper's proposed algorithms are primarily variants of one permutation hashing. One variant does not use densification (rand), one uses fixed densification (fix), and one uses re-randomized densification (re). The paper also introduces a private baseline based on minhash and provides both formal and empirical utility results demonstrating that, roughly, re dominates fix dominates minhash. Finally, empirical results show that rand in turn dominates re unless or larger.
优点
Private sketching is a natural problem, and the genomics example is well-motivated. Studying private versions of oph and minhash is also a natural approach, so even somewhat independently of whether they produce good algorithms or not, it's useful to have a study written down.
缺点
I think there is a potentially decent but smaller version of the paper contained in the current submission. Some specific comments:
-
I found the empirical results in Figures 3 and 4 surprising. After all the effort to set up and analyze fix and re, the simpler (and pure DP) rand clearly obtains the highest precision outside of extremely large values . Even when is large, the gap between rand and re is small. Given the effort the authors have expended on fix and re, I think it is valid to include them somewhere in the paper, but I suggest that they make more sense discussed in the Appendix as natural solutions that turn out to be less useful. Perhaps the authors think it's valuable to include fix and re because they have formal utility results that can be compared to minhash? If so, I understand this argument, but I still think the gap in performance is large enough that most or all of the focus should be on rand.
-
The lower bound required for fix and re is another disadvantage relative to rand. Also, how does "filtering" for (Line 418) affect the privacy guarantee -- what does the algorithm do if given a data points that fails ?
-
The introduction highlights that, unlike some past work, this approach does not assume private hash function randomness. I get that this is more flexible, but I don't understand why it's so meaningful. These algorithms still require that the bit flip randomness is kept private -- why is the hash function assumption so different?
-
A utility analysis for rand would help complete the paper structure suggested above. Have the authors attempted this, or do they have reason to believe it's hard?
问题
(See Weaknesses above.)
Dear Reviewer F2fi,
Thanks for your valuable feedback and recognizing our contributions. We also believe that private minhash and OPH methods are important with many useful applications in practice.
-
Thanks for the discussion. Since related study is still missing in the literature, as the first paper on private OPH, we tried to provide a comprehensive study on different variants of OPH and show how to privatize each of them. And also, densified OPH methods (fix and re) lead to theoretically more rigorous approaches (we will discuss more in the reply to Q3). The practitioner can choose which method s/he would like to apply in practice, depending on the use case and privacy requirements. We appreciate your suggestions and indeed, the very good performance of DP-OPH-rand in the high privacy region makes it a strong variant. We will add more discussion on this method in the paper.
-
In our paper, we assumed a lower bound on the number of non-zero elements for theoretical consideration. Essentially, defines the general data domain in which our DP algorithms rigorously protect the data privacy for any vector. Under this assumption, the same algorithm (with same parameters) is applied to any vector in . This setup is more general and convenient for theoretical analysis.
In practice, if we have privatized a set of vectors (database) with the DP algorithm parameterized with , adding vectors with number of non-zeros smaller than will not affect the privacy of which has already been privatized. We can treat as another database and apply DP-OPH with the value associated with .
-
The two sources of randomness in the algorithm are different in that the hash functions must be kept in some way because we must apply the same set of hash functions to all the vectors. The randomness from DP flipping is independent for every data vector and every bit. In some sense, this randomness is “one-time”, so the risk of leaking this information is small. On the other hand, since hash functions need to be kept, it is exposed to a higher risk of leakage. In the setup of our paper, this risk is resolved because we already assume that the hash functions can be public. As you kindly mentioned, this setup is much more flexible and practical, and is a much stronger privacy setup in real-world applications.
-
As we mentioned in our answer to Q1, densified OPH (with fix and re) and DP-MH are more rigorous theoretically since we can have a unified-form unbiased estimator as given in Theorem 3.5. The essential reason is that for the non-DP version of the above three methods, every hash value is unbiased of the Jaccard similarity . This does not hold for DP-OPH-rand because of the random bits assigned to empty bins. For DP-OPH-rand, it is possible to get an unbiased estimator specifically for each pair of data vectors depending on the statistics of both vectors, but it’s impossible to get a universally unbiased estimator as in Theorem 3.5. Therefore, we feel it is less interesting/meaningful to present the calculations for DP-OPH-rand. If suggested, we are happy to add a remark on this point in the paper.
Again, we sincerely appreciate your efforts in reviewing our paper and the nice suggestions.
This paper studies the differentially private hashing and sketching, for the task of computing Jaccard similarity index. It starts by reviewing several classical, non-private hashing schemes for this task, including MinHash and One Permutation Hashing (OPH) from prior works.
It then designed new privatized versions of these hash schemes, including DP-OPH-re, DP-OPH-fix, DP-OPH-rand, etc. The paper proved that these algorithms satisfy either pure-DP or approximate-DP, and conducted a number of experiments comparing all these methods. The general conclusion seems to be that DP-OPH-re is the most effective approach.
优点
- Designing differentially private sketching/hashing schemes is a well-motived and important problem in differential privacy and massive data processing. This paper studies one of the most fundamental problems in this direction.
- The algorithms are clearly explained and easy to follow.
缺点
- The main privatization technique used in this paper seems to be the randomized response mechanism. It is hard to believe that this can be effective.
- To privatize the OPH-re/fix schemes, this paper proposes to consider the relaxation of approximate privacy to avoid paying a huge price on "epsilon". I have some concerns on the choice of here.
- The algorithms proposed in this paper try to protect the privacy of a single bit in a binary feature vector . In a more realistic setting, though, I would imagine being the feature vector of a client, and it must be protected as a whole ---- I understand this sounds very hard. But it needs to be justified to choose
- The theoretical discussion in the introduction is almost impossible to follow -- I would suggest emphasizing the asymptotic behavior and deferring equations to the appendix when possible.
- Some issues with the experiments need to be clarified (see below)
问题
- I think the standard setup of in DP theory/practice is to make it "cryptographically" small (smaller than inverse data set size, at least). It seems you ran experiments within the regime that . What happens if you scale it further?
- Can you comment on your definition of "adjacent inputs" in defining privacy? In other words, how to interpret the implication of protecting a single bit in the feature vector?
- Could you clarify the interplay between and the variance of your hashing scheme? In particular, I think the quality of the DP-OPH would degrade significantly once goes moderately large. On the other hand, setting a smaller makes the hashing itself less accurate. Can you compare your results with non-private algorithms? I apologize if these are written in the text (and would appreciate it if you can provide a pointer)
- What is your definition of "precision" in the experiment? Also, why not include a non-private algorithm for comparison?
伦理问题详情
No ethics concerns, as far as I can tell.
Dear Reviewer vVL9,
Thanks for your review of our paper.
-
We presented since it is a common choice in many DP papers. We also experimented with other values and the comparisons are similar. As you kindly mentioned, in some papers is interpreted/set as the reciprocal of data size. In our experiments, is actually smaller than the reciprocal of the data sizes, making it a good choice of .
-
As we discussed in the paper with the example of genomes data, each bit in a binary vector could be the indicator of a gene, which is highly sensitive. In another scenario of online advertising, each bit might be a sensitive information of an individual like age, job, education, marriage status, etc. Our algorithms prevents the identification of such features in the vector (for each individual).
The attribute-level privacy protection is standard in DP hashing methods, as we cited extensively in the paper. Since these hashes are signatures of the data vectors, this privacy setup is different from protecting the summary statistics. Protecting the privacy of the original data is in general harder. Our paper proposes privatized algorithms for minhash type methods which are very popular hashing algorithms. We believe this is a good contribution to the community.
- In our plots, the horizontal dotted lines are the results for non-private OPH-re method.
Yes, as you mentioned, both and have trade-offs and an ‘optimal’ value.
: larger leads to larger flipping probability which degrades the performance. Smaller makes minhash and OPH themselves less accurate. Thus, we should choose a ‘moderate’ . Through our experiments, we found that to yields the best results.
: larger leads to less privacy budget for each hash thus more noise. Smaller makes minhash and OPH themselves less accurate, too. Hence, similar to , we should choose a ‘moderate’ as well. This choice also depends on the dataset, e.g., the dimensionality, sparsity, etc. For high dimensional data that are not extremely sparse (which is a common application case of minhash), to is recommended.
- Precision is the standard metric in similarity search. For each query, we have its N ground truth neighbors. If we retrieve neighbors of the query, precision is computed as (# of true neighbors in k retrieved points)/k, which measures how many retrieved points are indeed its neighbors. As we answered in Q3, the horizontal dotted lines are the results for non-private OPH-re method.
Thanks again for your feedback and suggestions.
Thank you for your clarification. They do help me understand the paper better (and I am sorry for missing some key texts in your paper before).
However, let me clarify some of my points:
- I think in theory the overhead on is roughly (or ) when gets smaller. At least, this is what it is when you do not make assumptions about . I do think this overhead is pretty noticeable once goes smaller. However, as mentioned by Reviewer F2fi, one might as well use the pure-DP version of your algorithm, and this is my least concern.
- There are a couple of reasons that I raise a criticism against randomized response. (1) I think of as a growing parameter. The probability of flipping a hashing is roughly , which quickly goes to as increases. I saw you ran experiments up to . Even with a small constant , your scheme needs an epsilon far larger than to compete with the non-private baseline. For a non-private baseline, however, I can choose a standard such as or . I was wondering if you have compared with a non-private baseline being or , or even . Without such a comparison, I don't think the experiments were fairly done. (Also, I think the dotted line deserves a one-sentence explanation in the main text rather than only appearing in the caption).
- I acknowledge that there are research on attribute-level privacy protection. However, I think this makes the "privacy" less meaningful. Take online advertising as an example; it is not the case that all attributes of an individual are independent. My shopping/youtubing history may as well reveal my sensitive information such as gender/age/occupation etc. Hence, an "attribute-level" privacy protection does not prevent an adversary from learning my, say, age, from my shopping history. This is also connected to my criticism against randomized response: there seems to be no hope in extending your algorithms to achieve individual-level privacy.
Overall, my evaluation remains the same: the problem of study is indeed important but the approach taken in this paper does not seem to be scalable. The experiments also meet my expectations: even with a somewhat non-fair setup, the accuracy is comparable with non-private baseline only when goes to at least, say, . I do appreciate the comprehensive work done in this research and see the value in the paper. But, as also raised by other reviewers, the presentation needs significant improvement.
This paper studies the problem of similarity search based on the measure of Jaccard similarity, under the setting of differential privacy.
Jaccard similarity between two "documents" are defined as follows:
- Documents are represented as binary vectors , e.g. these can be the indicator vectors of the bag of words.
- measures the ratio of the size of the intersection and the union of the two documents.
Given a collection of documents, , and a query document , we would like to recover the top documents that have highest value of .
The MinHash is a celebrated techique to construct a sketch of the given documents in order to efficiently estimate the Jaccard similarity against any given query document.
The goal in the paper is to construct differentially private versions of this sketching technique, where the notion of differential privacy holds against adding / removing one word from any given document.
The paper considers a -bit version of MinHash, as well as methods involving "re-randomization" (there were introduced in prior work). These sketches are made differentially private by applying the randomized response mechanism.
The privacy analysis is performed that gives tighter bounds than basic composition, and some experimental results are provided on some practical datasets (the Leukemia gene expression dataset, and Webspam dataset), showing that the DP versions of the MinHash variants with re-randomization achieve better precision.
优点
The paper introduces DP variants of MinHash sketching technique and provides a detailed analysis.
缺点
I felt the presentation in the paper is quite poor. I had to re-read several parts of the paper, and I still don't fully understand many of the details. (See some of my questions below.)
Presentation aspects aside, I am also confused about the motivation or potential application of these methods. In the experiments, the precision values are reported, by which I think means that the DP hashes to recover the documents with most similarity. The evaluation of precision itself is not a private operation though, as it requires comparing against the true Jaccard similarity. I understand that this is done just to demonstrate the utility of the approach, so it is okay that computing precision is non-private. But how do we expect to use these DP methods in any case? Sure, we can compute the approximate Jaccard similarity of a query against documents in a database, but then if we actually retrieve the most similar document and look at its contents then it is not private anymore.
问题
The description of OPH-re is confusing to me. I read it several times and I still don't understand what the method is doing. In particular, I did not get what and mean and what means. It might help to work through some example to explain the algorithm.
In DP-OPH-rand (Algorithm 5), is there some typo on Line 4 in that should the be ?
Is it possible to simplify Theorem 3.6 in some way to highlight the main take-away as opposed to have detailed calculations within the theorem statement that take half a page to even state?
Dear Reviewer zV5S,
We appreciate your review of our paper.
About the motivation and application: The similarity search experiments serve two purposes: 1. Evaluating the utility of the proposed algorithms in terms of similarity preservation, as you kindly pointed out; 2. Similarity search is itself an important application of minhash type algorithms in industry. The goal of search is, in general, to find similar data points to a query point. So the task is to return the IDs, not the raw data. With our algorithms, the original raw data is never published---the private hashes produced by our methods are signatures of the original data vectors that preserve the similarity information, which are directly used for search and ML model training tasks. Only the private hashes are used. In the example we gave in the paper, the bioinformatics community releases minhash of a large gene dataset on a regular basis for search and ML. In that case, only the hash values of each vector are released to the public, and the raw data is never published. However, those hashes are not private either. Our algorithms protect the privacy of those hashes in a rigorous DP manner. On the other hand, if the intent is to look at the raw data after retrieval, then there is no way to protect the data privacy. The motivation of our paper (and in fact, all the DP data protection papers) is to randomize/transform the data in a private way, at the same time preserving utility for search and learning.
Q1: SortedIndex and MapToIndex are helper functions to define a within-bin random permutation of an empty bin . returns the original index of a sorted array. For example, let , , and and suppose the indices in each bin are in ascending order, and is empty. Suppose . In this case, , so . Assume is picked and . At line 7 we have and at line 8, is a mapping , which effectively defines another within-bin permutation of using the partial ordering of . Finally, we set as the minimal index of ``1'' among the additionally permuted elements in bin .
Essentially, for an empty bin , re-randomized densification picks a non-empty bin at random, copies its data, and applies the bin-wise permutation of bin to the data of bin to generated a hash value. We will add this example in the appendix as suggested. Thank you.
Q2. In Algorithm 5, it is instead of . This is because, DP-OPH-rand does not densify the empty bins. Instead, it directly assigns a random bit to the empty bins. Therefore, it does not use which is computed for densified DP-OPH methods.
Q3. The very complicated form of Theorem 3.6 is because the calculation of DP-OPH methods involves many parameters (both related to data and algorithm) like , , , , , , , etc, as well as many combinatorial calculations. We made a great effort to derive the analytical form of the variance, but further simplifying it seems difficult. Nevertheless, we verified the theorem in Figure 2 through numerical simulations.
We hope our explanation answers your questions properly. Please let us know if you have more questions. Thanks again for your efforts in reviewing our paper.
This paper studies an important problem in privacy-preserving hashing and sketching. The reviewers are in agreement that the paper does not meet the bar for publication at ICLR, and that the presentation would need to be significantly improved before the paper is ready for publication. We encourage the author(s) to incorporate the feedback from the reviewers regarding the presentation and experimental evaluation before re-submitting this work.
审稿人讨论附加意见
The reviewers raised concerns about:
- the value of delta
- the semantics of the privacy protection, in particular, attribute-level privacy and the restriction of the proposed randomized response mechanism to this setting
- the experimental evaluation
- the presentation in this work
- the relationship between different parameters
- the motivation behind the studied algorithms
The authors clarified the authors’ concerns around 1), 5) and to some extent 2), 3) and 6). However, the concerns about 4) remained.
Reject