Private Set Union with Multiple Contributions
We study the the limits of differentially private set union (or partition selection) when each user can contribute multiple items.
摘要
评审与讨论
Assume each user owns a subset of at most items that belong to a large universe , the Differentially Private Set Union (I prefer calling it Differentially Private Partition Selection, see comments) problem is to compute the union of the items from all users while satisfying a reasonable definition of differential privacy (DP). Previous works show that one can construct an optimal -DP algorithm when each user has exactly one item. However, designing an optimal -DP algorithm when each user contributes up to items is an open problem.
This paper presents several key findings for this open problem. First, Instance-optimal algorithms are not generally possible when . Second, it is possible to have an appropriate algorithm for settings where some public prediction regarding the union is available. Finally, the paper presents several concrete algorithms under various public predictions, accompanied by detailed proofs of privacy and utility.
In summary, this paper provides important results for the Differentially Private Partition Selection problems, helping researchers better understand its privacy boundaries.
优缺点分析
Strengths
- Important (impossible and possible) results for the Differentially Private Partition Selection problem.
- Concrete algorithms under possible settings.
- Detailed privacy and utility proofs.
Weaknesses
- I am wondering if the topic of this paper meets one of the topics of NIPS 2025.
- Private Set Union has been a well-defined cryptography topic. I suggest changing the term "Private Set Union" to another one, e.g., Differentially Private Partition Selection.
- No implementations and experiments are given in this paper. I believe this is a minor problem since this paper focuses mainly on theoretical results.
问题
I cannot find a topic for NIPS 2025 that meets the topic of this paper. Could the authors formally state that the topic of this paper is considered for NIPS 2025?
局限性
Aside from the problem of meeting the topics of NIPS 2025, I identify that the definition of Private Set Union provided by the authors is not the one defined in the cryptography topic. Specifically, in cryptography, PSU considers several (typically two) parties each having a set, and the output of PSU is the union of sets without leveraging the intersections of these sets. The detailed definitions and constructions for two parties and multiple parties can be found in [1][2], respectively. I suggest changing the term used in this paper or having a separate description to clarify the differences between the two definitions.
I have to say that the reason I bid on this paper is that I am quite familiar with the cryptographic PSU, but not the one defined in this paper. I tried my best to read the paper based on my DP knowledge, and I believe this paper is well-written. I am looking forward to seeing feedback from other reviewers.
[1] Kolesnikov, V., Rosulek, M., Trieu, N., & Wang, X. Scalable private set union from symmetric-key techniques. ASIACRYPT 2019, pp. 636-666. Cham: Springer International Publishing. [2] Liu, Xiang, and Ying Gao. Scalable multi-party private set union from multi-query secret-shared private membership test. ASIACRYPT 2023, pp. 237-271. Singapore: Springer Nature Singapore, 2023.
最终评判理由
Thanks again for the authors feedback. I believe changing the name from Private Set Union to Differentially Private Set Union can reflect the specific privacy constraint that this paper targets. I suggest having a separate paragraph descrbing the definition of Private Set Union from the crypto field and explicitly state the difference.
格式问题
No.
Thank you for taking the time to review our work. We address your specific points below.
> I cannot find a topic for NIPS 2025 that meets the topic of this paper. Could the authors formally state that the topic of this paper is considered for NIPS 2025?
The relevant topic from the NeurIPS call for papers is "privacy" (under "Social and economic aspects of machine learning"). Differential privacy is a longstanding interest of the ML community and has consistently appeared at previous conferences. (For example, the same problem has been studied in Gopi et al. (2020), which was presented at ICML 2020.)
> Private Set Union has been a well-defined cryptography topic... I suggest changing the term used in this paper or having a separate description to clarify the differences between the two definitions
We agree and are open to changing the name. One option is to change the name to "Differentially private set union" to reflect the specific privacy constraint we are working with. This is also consistent with prior work such as Gopi et al. (2020), and Carvalho et al. (2022).
Thank authors for the response. I don't have futher questions. As my initial rating is positive, I'll keep it as it is.
This paper studies the private set union problem in the context of differential privacy, where each user contributes a bag of up to items from a large universe. The goal is to compute a subset of the union of all user-contributed items while preserving -differential privacy. The challenge arises because reporting infrequent items can violate privacy guarantees. Unlike previous work, which focused on the case where , this paper generalizes the setting to and provides the first theoretical utility bounds in this more general case.
The main reason the previous paper do not have a good theoretical analysis on the private set union is because that it is not trivial to formalize the utility. As mentioned by the authors, a simple example shows any -differentially private algorithm cannot output a subset with expected cardinality larger than , so it does not make much sense to directly analyzing the size of the output. In this paper, the authors introduce the utility ratio, which normalizes the expected utility by a dataset-specific upper bound. This allows comparison of algorithms in terms of their worst-case normalized performance. Given this measurement, the authors give several impossibility results and matching algorithms.
优缺点分析
This paper considers an important problem in differential privacy and presents several clearly stated theoretical results. In fact, to the best of my knowledge, it is the first paper to provide any form of theoretical utility bounds for private set union with . (The work by Gopi et al. analyzes only the privacy guarantees, while the Cohen et al. paper implicitly handles only the special case where .) Here, denotes the maximum cardinality of the input subsets. Moreover, the paper is well-written and easy to follow. The proof sketches included in the main text are especially helpful in conveying the core techniques and proof ideas.
In my view, the most intuitive way to define “utility” in the context of private set union is to aim to preserve all items with reasonably large frequency . The function offers a strong intuition for how frequency impacts utility, making it natural to define a utility ratio for the general case of . This, in turn, provides a meaningful and intuitive characterization of utility for the private set union problem. It is also quite surprising that the authors are able to show tightness of their algorithm, especially given that it is significantly simpler than the algorithmic framework in Gopi et al., which involves more complex histogram updating strategies.
In conclusion, I believe this paper would be a good fit for NeurIPS.
问题
In this paper, for the union set case where , you refer to the sampling-based algorithm from the Cohen et al. paper. I believe there is another algorithm based on thresholding the histogram:
(i) Add Laplace noise with parameter to the frequency of elements in ;
(ii) Output all elements whose noisy frequency exceeds a threshold , where is approximately .
It should be easy to show that this algorithm is -differentially private (see the proof of Lemma 5 in the paper “Releasing Search Queries and Clicks Privately,” WWW 2009 for details). Clearly, these two algorithms are not equivalent. However, I did some preliminary calculations by myself, and I believe the thresholding approach achieves asymptotically similar error bounds to the i.i.d. sampling algorithm in Cohen et al.
My motivation for comparing these two algorithms is that the thresholding method appears more intuitive. In particular, it provides a natural explanation for why is a three-piece function:
(i) the frequency almost surely falls below the threshold (corresponding to the case );
(ii) the frequency almost surely exceeds the threshold (corresponding to the case ); and
(iii) the intermediate case.
So if they are indeed equivalent (in terms of the utility), then perhaps using the thresholding algorithms as the benchmark would be better? What do you think?
局限性
Yes.
最终评判理由
The authors properly respond my questions.
格式问题
No.
Thank you for taking the time to review our work. We address your specific points below.
> For the union set case where k=1, you refer to the sampling-based algorithm from the Cohen et al. paper. I believe there is another algorithm based on thresholding the histogram... I believe the thresholding approach achieves asymptotically similar error bounds... perhaps using the thresholding algorithms as the benchmark would be better?
We agree that the thresholding approach is very intuitive, and that it achieves asymptotically similar error bounds. But since the strictly optimal (not just asymptotic) reporting probabilities are known in the k=1 case, and since we can actually achieve it under some assumptions (see Theorem 1.3) then we believe that working with the optimal reporting probabilities is a stronger baseline.
Thanks for your response, will keep score.
This paper investigates the private set union problem when each user may contribute multiple items. The authors introduce a new performance metric called the “utility ratio,” which normalizes expected output size by a dataset-specific upper bound and measures worst-case performance across all datasets. They prove that instance-optimal algorithms exist for the single-contribution case but fail when users can submit multiple items, characterizing how performance degrades as the maximum contribution count grows. For settings with a prior histogram prediction, they design a mechanism that achieves the optimal utility ratio when the prediction is exact and degrades gracefully with prediction error. Both theoretical analyses and empirical evaluations across various settings demonstrate that their algorithms achieve optimal or near-optimal utility ratios.
优缺点分析
Strengths:
The paper makes significant theoretical contributions, proving tight lower and upper bounds on the utility ratio and constructing matching algorithms. It also extends to scenarios with prior histogram information, offering robust mechanisms with provable guarantees. The work deepens our understanding of differentially private set union with multiple contributions, revealing fundamental limitations and guiding algorithm design. The introduction of the utility ratio provides a unified framework for comparing mechanisms across datasets.
Weaknesses:
While the authors characterize how the worst-case utility ratio deteriorates as per-user contribution bounds grow, they do not propose or evaluate any algorithmic techniques to mitigate this degradation. In extreme multi-item settings, the utility may drop sharply without clear avenues for improvement.
There is little discussion of the runtime complexity or messaging costs of the proposed algorithms. In privacy-constrained settings, these overheads can dominate, especially when each user contributes multiple items, but the paper omits any comparative performance profiling against existing baselines.
The robust mechanism fundamentally assumes access to a reasonably accurate prior histogram of the data distribution. However, the paper does not comprehensively explore how sensitive the overall utility is to different types of prediction errors or misspecified priors.
The work fixes privacy parameters (, ) and does not systematically study the privacy–utility frontier. As a result, readers lack guidance on how the utility ratio behaves under tighter privacy budgets or in regimes where sensitivity amplification techniques might apply.
问题
For high-k scenarios where per-user contribution counts are large, are there feasible algorithmic improvements or relaxation schemes to mitigate the performance degradation?
What are the computational and communication costs of the mechanisms, and how do they compare empirically to existing methods?
How sensitive is the proposed robust mechanism to different error distributions in the prior histogram prediction?
How does the utility ratio perform across different privacy-utility trade-off points as privacy parameters (, ) vary?
局限性
Yes
最终评判理由
The rebuttal has well addressed all of my initial concerns. I think this paper is technically solid, and other reviewers also recognize it as well-structured with significant theoretical contributions to differential privacy. Therefore, I recommend acceptance.
格式问题
No
Thank you for taking the time to review our work. We address your specific points below.
> While the authors characterize how the worst-case utility ratio deteriorates as per-user contribution bounds grow, they do not propose or evaluate any algorithmic techniques to mitigate this degradation.
Our results show that in the worst case, every mechanism is subject to low utility ratios for some privacy regimes and datasets, so mitigations do not exist. However, one interesting direction that we partially address is to explore non-worst-case conditions under which mechanisms can achieve high utility ratios. Our mechanism that makes use of predictions (Section 4) is one example: in cases where a reasonably accurate prediction of the histogram is available, it is possible to do much better than our impossibility results.
> There is little discussion of the runtime complexity or messaging costs of the proposed algorithms. In privacy-constrained settings, these overheads can dominate, especially when each user contributes multiple items, but the paper omits any comparative performance profiling against existing baselines
Let denote the sum of bag sizes in an input dataset. The mechanisms discussed in Lemma 3.1, Lemma 3.4, and Theorem 4.1 all run in linear time in because they essentially build a histogram of the items that appear in the dataset and then do linear time operations on the histogram. The mechanism from Lemma 3.3 is also linear in , since whenever it does not output an empty set, it just outputs the items of a randomly chosen user. Finally, the bi-criteria mechanism from Theorem 3.2 can also be implemented to run in linear time in with a bit of bookkeeping. Note that the linear dependence on N is unavoidable for any algorithm that goes over every data point. We will clarify this.
> The robust mechanism fundamentally assumes access to a reasonably accurate prior histogram of the data distribution. However, the paper does not comprehensively explore how sensitive the overall utility is to different types of prediction errors or misspecified priors
Theorem 4.1 shows that the expected output size of the mechanism that uses predictions degrades with the distance between the true and predicted dataset histograms. If the distance is , then the mechanism performs as well as the Pi bound on a dataset where every item count is decreased by . Studying robustness with respect to other discrepancy measures is an interesting future direction to explore. We will clarify this in the paper.
> The work fixes privacy parameters (epsilon, delta) and does not systematically study the privacy-utility frontier. As a result, readers lack guidance on how the utility ratio behaves under tighter privacy budgets or in regimes where sensitivity amplification techniques might apply
We did not emphasize the privacy-utility frontier, but our results do characterize some interesting regions of the privacy parameter space where utility ratios are bounded in different ways. For example, Lemmas 3.3 and 3.4 show that for fixed , utility ratios of 1 are achievable in the limit as tends to zero or infinity. On the other hand, for a range of intermediate values, our results establish that utility ratios cannot be better than roughly . Understanding implications for amplification, composition, etc. would be a nice direction for future research.
I appreciate the authors’ efforts in the rebuttal and will maintain my positive rating.
This paper studies the problem of differentially private set union: given a collection of sets , each contributed by an individual, output a set approximating in an -differentially-private way. must be a subset of , and the goal is to make it as large as possible subject to that constraint.
A key parameter is , an upper bound on the cardinality of each .
The paper presents new mechanisms for this problem, and also proves impossibility results.
Previous work showed there is a function which upper bounds the performance of any -DP mechanism on a particular dataset , and that when there is a mechanism that achieves this for all datasets .
This paper shows that in contrast, when , no mechanism can achieve the bound for all datasets, even in the "easy" case when is on the order of the size of the answer.
On the other hand, the paper describes new mechanisms for the regime. One of them achieves a provable performance guarantee for all datasets (which necessarily is worse than ). The other mechanism is parameterized by an estimate of the true output, and achieves the limit if the estimate is exactly correct, with performance decreasing as the quality of the estimate decreases (while always satisfying the privacy requirement, even if the estimate is completely wrong). It is worth noting that the trivial strategy of always outputting is not allowed, since the mechanism's output must always be a subset of the true answer.
优缺点分析
Strengh: Clear writing
The paper was a pleasure to read. The background and the new results were easy to understand.
Strength: New mechanisms for private set union
The paper describes two new mechanisms for private set union with : a general-purpose one (Theorem 1.2) and one that takes advantage of a prediction of what the set union is likely to include (Theorem 1.3) to get a surprising accuracy guarantee.
Strength: Impossibility results and mathematical analyses
The paper includes impossibility results that their new mechanisms can be compared to, as well as rigorous mathematical analyses of the new mechanisms. (The reduction from matrix marginals in the proof of Theorem 2.3 was nice and pleasingly simple.)
Also, there as an interesting use of linear programming to show the best utility ratios acheivable for certain ranges of parameters. (Lines 73-76 and Figure 2.)
Minor weakness: no empirical evaluation
The paper stands on its own as it is, but it would have been nice to see the method tried out, even on synthetic data. It's always nice to see an example when understanding the performance of something.
问题
-
I'm not sure Definition 1.1 has the right citation. I have not checked carefully, but my understanding is Dwork et al's "Calibrating noise to sensitivity in ..." only introduced -DP. I have seen the (\\epsilon,\\delta\) version in Definition 1.1 attributed to Dwork, Kenthapadi, McSherry, Mironov, Naor's "Our Data, Ourselves: Privacy via Distributed Noise Generation" in Eurocrypt 2006. Probably that one should be cited. (I skimmed that paper, it looks like they call it "-approximate -indistinguishability", and they seem to be presenting it as an original definition.)
-
The definition of on lines 56–60 doesn't say what happens when . At least, the constants have in them. The case comes up on line 87, and in the statements of Theorems 2.1 and 2.2, and on line 267.
-
Line 83 describes "the bounds rougly mirroring the behavior in Figure 2", but for large , Figure 2 seems to contradict the bound in question, since the utility ratio starts to increase again and eventually reaches 1 as increases. I guess it is not really a contradiction because Theorem 1.1's minimum value for depends on , and so maybe it is not satisfied on the right side of Figure 2. Still, it makes that text on line 83 look untrue. (Or did I misunderstand something here?)
-
The second bound in Theorem 1.1 (lines 80-81) has an unspecified dependence on . This is indicated by the subscript in the notation, but it may be unusual or unintuitive enough that it's worth commenting on in the prose: how exactly should one think of this bound? What does it mean? At least, I found it a bit tricky to wrap my head around, since usually is the increasing variable in asymptotic bounds.
-
Line 236: Not strictly necessary, but it would be nice to hear more about what is. Is there a precise mathematical definition of in terms of the query and dataset ? Or maybe an example would be appropriate, just to give some intuition?
-
Small things:
-
Line 38: "Two datasets and as neighboring..." seems to be be missing something, e.g. "We consider two datasets and as neighbouring...".
-
Line 135: "after removing user ": unless I missed it, isn't defined here. I guess you mean user 1?
-
局限性
yes
最终评判理由
I only had relatively minor questions and the authors have addressed them. I'm maintaining my original positive rating.
格式问题
none
Thank you for taking the time to review our work. We address your specific points below.
> I'm not sure Definition 1.1 has the right citation
Agreed. This was indeed a mistake and we will update the citation.
> The definition of on lines 56-60 doesn't say what happens when epsilon=0
In the case when , we have . We will clarify this in the paper.
> Line 83 describes "the bounds roughly mirroring the behavior in Figure 2", but for large k, Figure 2 seems to contradict the bound in question, since the utility ratio starts to increase again and eventually reaches 1 as k increases. I guess it is not really a contradiction because Theorem 1.1's minimum value for k depends on epsilon, and so maybe it is not satisfied on the right side of Figure 2. Still, it makes that text on line 83 look untrue. (Or did I misunderstand something here?)
Your understanding is correct, our impossibility results establish combinations of the privacy parameters and under which high utility ratios are impossible. But, for fixed , taking to zero or infinity allows for a utility ratio of 1 (see Lemmas 3.3 and 3.4), which matches the behavior in the plots. We intended for the text on line 83 to point out that the curves for each value of k have minimum values near 1/k, the worst-case ratio from the first part of Theorem 1.1. We agree that it is currently confusing and we will revise the text.
> The second bound in Theorem 1.1 (lines 80-81) has an unspecified dependence on n. This is indicated by the subscript n in the notation, but it may be unusual or unintuitive enough that it's worth commenting on in the prose: how exactly should one think of this bound? What does it mean? At least, I found it a bit tricky to wrap my head around, since usually n is the increasing variable in asymptotic bounds
The main focus of our paper is to study how the user contribution limit k impacts the utility ratio, so we emphasize the dependence on k in Theorem 1.1. The full bound is presented in Theorem 2.3, which includes the dependence on n. The bound holds when k is large compared to n and hence we include n as a subscript. We will clarify the notation used in Theorem 1.1 in the revision.
> Line 236: Not strictly necessary, but it would be nice to hear more about what is. Is there a precise mathematical definition of in terms of the query and dataset ? Or maybe an example would be appropriate, just to give some intuition?
The current text on line 236 uses clashing notation with the rest of the paper which we will address. In the work of Kaplan et al., the query counts the number of users u that match a predicate q and the set of users to remove (denoted by in the current version) contains all users u for which q(u) = 1.
> Line 38: "Two datasets D and D' as neighboring..." seems to be be missing something, e.g. "We consider two datasets D and D' as neighbouring..."
Right. We will fix this typo.
> Line 135: "after removing user i": unless I missed it, i isn't defined here. I guess you mean user 1?
Right. We will fix this typo.
Thank you, your responses have addressed my questions. I am maintaining my positive rating.
This paper considers the problem of computing a differentially private set union when each user holds at most k items. Past work provided optimal mechanisms for k=1. This work provides theoretical results for k > 1. No mechanism is optimal or can achieve high utility for every dataset. The authors therefore consider the utility ratio, a dataset-normalized utility measure, and prove impossibility results and construct algorithms with strong theoretical guarantees for utility ratio.
Reviewers were unanimously in favor of this paper. They found the problem and the results to be important, with the first substantive results for the k > 1 case and strong impossibility and possibility results, clear writing, and insightful mathematical analysis and proofs. Few or no substantive weaknesses were raised. One reviewer suggested to avoid or qualify the name “private set union” since this has a different meaning in cryptography, which is a useful suggestion. Congratulations on a well received paper.