Hot PATE: Private Aggregation of Distributions for Diverse Tasks
A PATE design that has high utility for diverse tasks
摘要
评审与讨论
Hot PATE extends the Private Aggregation of Teacher Ensembles (PATE) framework to diverse and open-ended tasks, addressing the fundamental tradeoff between privacy and diversity in generative AI. While the PATE framework works best in the classification settings with a small set of labels, Hot PATE can remedy this with coordinated ensembles, where teacher models use shared randomness to synchronize token selection, increasing agreement while preserving diversity without additional privacy penalties. The authors formalize a diversity-preserving aggregation method that ensures knowledge transfer while filtering irrelevant tokens. The benefits of coordinated ensembles are demonstrated t oachieve high diversity in an artificial scenario designed by the authors empirically.
update after rebuttal
The reviewer thanks the authors for their responses. While I appreciate the effort to clarify the points raised, I feel that the answers do not fully resolve my concern about the empirical section, and as such, I am maintaining my original score.
给作者的问题
-
Similar to the prior work, it'd be very helpful to try Hot PATE on real-world datasets to substantiate the practical impact of Hot PATE and improve confidence in its general applicability.
-
Along with the point above, can you provide some results on different privacy levels to better understand empirically the diversity-privacy trade-off.
The reviewer is not concerned about the contribution of the approach introduced in the paper, however, the empirical section is pretty weak. It'd definitely make the paper stronger if the authors provide empirical studies on real-world datasets and various privacy levels.
论据与证据
While the submission presents strong theoretical foundations regarding the privacy-diversity tradeoff in PATE and the benefits of coordinated ensembles, the empirical evaluation is pretty weak and does not provide convincing evidence.
方法与评估标准
The proposed methods are really great and intuitive for the problem or application at hand but the evaluation is quite weak. It relies on an artificial setting that does not fully reflect real-world applications.
理论论述
I have not fully verified the proofs.
实验设计与分析
The experimental design is highly artificial and does not map directly to practical use cases, which limits the generalizability of the findings. The empirical results do not include concrete privacy budget values (epsilon, delta), which are crucial for evaluating privacy-preserving methods. There is also not sufficient ablation studies that systematically study how the parameters affect results.
补充材料
I went over the Appendix of the paper.
与现有文献的关系
The paper improves a key limitation of PATE over generative tasks where outputs are inherently diverse, offering an important advancement by improving diversity preservation in private learning. This can be beneficial in scenarios such as in-context learning and synthetic text generation for distillation.
遗漏的重要参考文献
I think the paper presents the background material adequately.
其他优缺点
Strengths:
- The paper presents a creative extension of PATE to diverse and open-ended tasks, which is a significant departure from prior applications primarily focused on classification.
- The introduction of ensemble coordination to enhance diversity in a privacy-preserving manner is a novel approach in privacy-preserving learning.
- The paper provides a rigorous mathematical framework for diversity-preserving aggregation.
Weaknesses:
- The evaluation is limited to a synthetic task rather than real-world applications.
- The empirical section does not report exact (epsilon, delta) values for different settings and it does not explicitly measure how privacy budgets scale with diversity.
其他意见或建议
N/A
We thank the reviewer for the time and comments.
Question 1 ״Try Hot PATE on real-world datasets to substantiate the practical impact of Hot PATE and improve confidence in its general applicability״
Hot PATE is a mathematically rigorous framework, which we consider to be our main contribution. Hot PATE can only improve over the baseline (cold PATE) in terms of privacy utility tradeoff, and this benefit increases with the diversity.
We include here additional evidence of large potential gains of hot vs cold PATE on natural texts that represent real-world datasets. We are happy to add such evaluations to the paper:
We generated the tokens output distributions of Llama 3.2 1B on all prefixes of the first 500 tokens of a WSJ top page article from Saturday (this is representative, as other news articles and texts we tried gave similar results). The temperature setting was the default .
For the purpose of this evaluation, consider a case when all teachers share the same output distribution. In this case, a coordinated ensemble would yield a histogram where a single token (different one for different shared randomness) has a count of and remaining tokens have count . The diversity-preservation is perfect as the probability of any one token is equal to its probability value in the distribution. As for cold PATE, we computed the agreement level (minimum fraction of teachers) required for meeting different diversity-preservation levels on a sequence of 10 tokens. These values are inversely related to the privacy parameter . Results are averaged over a sliding window of 10 consecutive tokens from the article:
Hot PATE (coordinated ensembles): 100% diversity transfer, privacy parameter .
Cold PATE (independent ensembles):
| Diversity Transfer | Count Required | × Loss in |
|---|---|---|
| 25% | 0.084 n | 11.9 |
| 50% | 0.0261 n | 38 |
| 90% | 0.000733 n | 1364 |
| 95% | 0.000184 n | 5446 |
As you can see, for transferring 50% of the distribution, we would incur X privacy loss with cold PATE. For transferring 95%, the loss is X. This is very significant. We can extrapolate the same relative gains when the teacher distributions are not identical, as the gains would apply to a common "transferable" part that is supported by enough teachers.
Question 2 ״Can you provide some results on different privacy levels to better understand empirically the diversity-privacy trade-off.״
Yes. There are two regimes:
(1) The teacher distributions are "close" in TV distance. In this regime, using Hot PATE allows us to generate the privacy-preserving tokens (called "yields") for "free". We achieve this using the SVT (sparse-vector technique) with "BetweenThresholds" as there is a high agreement among the teachers. In contrast, in this regime, the baseline "cold" PATE incurs a privacy cost that increases with the diversity, and does not obtain "free yields" as our method.
(2) The teacher distributions are less similar (higher TV distance). In this regime it could be that when sampling a token from each teacher, the resulting samples are in complete "disagreement" and we need to re-sample (so multiple samplings might be needed per generated token). What we showed is that in Hot PATE we essentially need to pay only once per "yield", and we can generate tokens as long as most teachers agree on some small fraction of the distribution (in TV distance). In contrast, in this regime, the baseline "cold" PATE fails completely, in the sense that the probability of it generating any tokens at all is close to zero.
We did include some calculations in the appendix using one particular Laplace-based analysis method. For , we get tokens for . So this is 180 tokens with 1000 "teachers" (partitions of the data), 18000 tokens with 10K teachers, or 7 tokens for 200 teachers. These tokens are the combined length of synthetic examples or summaries generated from the sensitive data.
Finally, if more utility is needed, then hot PATE also applies with weaker privacy notions that are often used in practice, such as DP with high or -anonymity (as used in Clio by Anthropic): The "boost" in the privacy parameter facilitated by high counts in the histogram translate to a "boost" in .
The PATE framework was designed for classification tasks where there is a single ground-truth label; however for tasks like sequential text generation, there might be multiple “good” responses. This paper proposes to extend the PATE framework to diverse tasks like this (where the responses are distributions rather than a single outcome) by designing an aggregation method that preserves both diversity and privacy.
update after rebuttal:
After reading the other reviews and the rebuttals, I do have an amendment to make: I wrote down “Convincing empirical evaluation” as a strength, but the empirical evaluation is only convincing with regards to demonstrating that hot PATE preserves diversity. The experiments don’t fully demonstrate the value of diversity with regards to practical real-world use-cases.
Still (and somewhat subjectively), I think this paper has strong merits just in terms of theoretical foundations and sheer creativity; it is a significant departure from previous work and I think would be interesting to audiences at ICML, despite its limitations.
给作者的问题
Besides sequential text generation, what other applications would hot PATE work well for?
论据与证据
The claims made in the submission are supported by evidence.
方法与评估标准
The proposed methods make sense for the problem at hand.
理论论述
I didn't thoroughly check the correctness of any proofs.
实验设计与分析
The empirical demonstration in Section 5 looks sound to me.
补充材料
I didn't review the supplementary material carefully.
与现有文献的关系
This paper's key contribution is an extension of the PATE framework.
遗漏的重要参考文献
All essential references are discussed.
其他优缺点
Strengths --
- The paper's ideas are novel and it is great to see PATE applied in a more modern setting.
- Convincing empirical evaluation.
- Figures 1-3, in addition to being very charming, are helpful tools for understanding the text.
Weaknesses --
- The paper is pretty dense and it is easy to lose the plot.
- To some extent, the paper feels like a "proof of concept" to me: we get a new framework and we see that it does well according to certain metrics, but it's unclear how well this will perform across different applications because there is no utility guarantee and no downstream task.
其他意见或建议
This paper has great ideas but is nuanced and requires careful reading. I have a couple thoughts on how to make things more palatable for a casual reader:
- I think it would be great to have a more explicit side-by-side comparison of cold PATE and hot PATE. One way to do this could be to have a “master” algorithm block that generalizes both cold and hot PATE. Then cold PATE could be characterized as plugging in DP-aggregation of the frequency histogram into this master algorithm block, and hot PATE could be characterized as plugging in a diversity-preserving aggregation method (as detailed in Definition 1).
- Section 4 could also better juxtapose independent ensembles (for cold PATE) and coordinated ensembles (for hot PATE). I do like the blue comments in Algorithm 1 but in the two-column format (with the comments overflowing onto the next line and hampering readability) I think it hinders as much as it helps. A “master algorithm block” type of solution could also work here, to highlight the differences between the two sampling methods. (And if I’ve understood correctly, the main difference between independent and coordinated ensembles is mostly just how is sampled, and the are computed the same way for both methods.)
- Lastly I really liked the explanation of “cold” vs “hot” PATE that appears in Section 2, and I feel like it could be moved to the introduction to avoid suspense. (I spent the first four pages of the paper a little distracted, wondering what makes PATE “hot” or “cold”.)
We thank the reviewer for the constructive feedback and excellent comments and suggestions. We will use them to improve the presentation.
Question: “Besides sequential text generation, what other applications would hot PATE work well for?”
Response:
Hot PATE is suitable for "soft" tasks where the desired output is a sample from a distribution over multiple "possible answers" or "responses". It is a way to aggregate multiple sensitive distributions in order to obtain one or multiple samples that preserve "diversity" and "privacy" (this could also be a single task; not necessarily sequential generation).
Example usage that is not LLM token generation: A game where each sensitive expert suggests a distribution over possible next moves. We want to choose a move that reflects the experts but guards their privacy.
Hot PATE is also suitable for a weaker notion of privacy, a variant of anonymity, when we only require that the output is “supported” by at least sensitive units. To achieve this, we simply only return a response that has count at least without adding noise. With hot PATE there is a much higher likelihood of a token having a large count than with cold PATE.
Weaknesses:
The reviewer pointed out that it is hard to tell how hot PATE would “perform across different applications” and “utility guarantee”.
Since indeed our demonstration was on a particular task, we include here some additional evidence of large potential gains of hot vs cold PATE on natural texts. We are happy to add such evaluations to the paper. We also discuss utility guarantees (see our response to reviewer GoNr).
We considered the token output distributions of Llama 3.2 1B on all prefixes of the first 500 tokens of a WSJ top page article from Saturday (other texts gave similar results). The temperature setting was the default .
For the purpose of this evaluation, consider a case when all teachers share the same output distribution. In this case, a coordinated ensemble would yield a histogram where a single token (different one for different shared randomness) would yield a count of . The diversity-preservation is perfect as the probability of any one token is equal to its probability value in the distribution. As for cold PATE, we computed the agreement level (minimum histogram count) required for meeting different diversity-preservation levels on a sequence of 10 tokens. These values are inversely related to the privacy parameter . Results are averaged over a sliding window of 10 consecutive tokens from the article:
Hot PATE (coordinated ensembles): 100% diversity transfer, privacy parameter .
Cold PATE (independent ensembles):
| Diversity Transfer | Count Required | × Loss in |
|---|---|---|
| 25% | 0.084 n | 11.9 |
| 50% | 0.0261 n | 38 |
| 90% | 0.000733 n | 1364 |
| 95% | 0.000184 n | 5446 |
As you can see, for transferring 50% of the diversity, we would incur X privacy loss with cold PATE. For transferring 95%, the loss is X. This is very significant. We can extrapolate the same relative gains when the teacher distributions are not identical, as the gains would apply to a common "transferable" part that is supported by enough teachers.
This paper introduces Hot PATE, a privacy-preserving method for auto-regressive models in open-ended tasks. It addresses the challenge of preserving diversity and privacy by coordinating teacher models through shared randomness and positive correlation voting. Key contributions include mathematically formalizing robust diversity transfer and demonstrating significant improvements over traditional methods, particularly in maintaining higher coverage and diversity with lower noise scales. Applications range from homogeneous to heterogeneous ensembles, enabling flexible privacy-preserving multi-party learning.
update after rebuttal
I maintain my score after carefulling reading the rebuttal.
给作者的问题
-
The experimental section only counted the diversity of tokens, does the dataset composed of these tokens have practical application value? (For example, are the generated sentences still grammatically correct? Can you provide a case of the generation results?)
-
Is the diversity-privacy tradeoff indeed inherent? What's your final answer for this question.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
I've not checked the proofs in appendix
实验设计与分析
The paper focuses more on theoretical advantages and does not thoroughly explore the feasibility, computational costs, and scalability of the coordination integration method in real-world scenarios. These factors are critical for practical deployment.
补充材料
I've not review the supplementary material
与现有文献的关系
This paper have close relation to PATE and differential privacy.
遗漏的重要参考文献
No
其他优缺点
Weakness:
-
Redundant Contribution Section: The contribution section is overly verbose and repetitive, covering multiple details and steps repeatedly. For instance, the paper repeatedly elaborates on the implementation mechanisms of "Hot PATE" (such as specific sampling methods for coordination sets, threshold settings, etc.) and emphasizes the same core ideas across different scenarios (homogeneous and heterogeneous sets).This redundancy makes the contribution section overly cumbersome and less focused.
-
Lack of Clear Hierarchical Structure: The contribution section fails to distinguish between core innovations and secondary implementation details. For example, the proposal of "coordination sets" is one of the main contributions of the paper, but it is mixed with other technical details (e.g., privacy analysis methods), making it difficult for readers to quickly grasp the key points.
其他意见或建议
No
We thank the reviewer for the comments and will do our best to improve the presentation.
Question 1:
-- our demo “only counted the diversity”.
Our demo reports on the diversity-privacy tradeoff. Diversity is measured by the number of returnable tokens for a given privacy level (measured by the threshold setting).
-- “grammatically correct”
Our proposed method returns text that is generated by the LLM. So if the LLM generates grammatically correct sentences, then so does our output. This is true in our construction because all of the teachers are consistently applied with the same sanitized prefix (which was privately computed in the previous iterations). So all of the teachers are predicting the next token of the same prefix. This is explained on page 4 (below Figure 4). In our demo, we used the Lamma 3 8B model and text prompts. The prompts were designed to generate a single token response that is a number and the LLM behaved as expected. Sequential text generation repeats such steps.
-- “practical application value”.
Our hot PATE method is designed as a plug-in replacement for cold PATE. It is validated mathematically, always improves the utility-privacy tradeoff, and its benefits increase with the entropy of the generation process.
We include here an additional evidence for the benefits of hot PATE over baseline cold PATE on natural texts. We considered the token output distributions of Llama 3.2 1B on all prefixes of the first 500 tokens of a WSJ top page article from Saturday (other texts gave similar results). We are happy to include such evaluation in our paper.
For the purpose of this evaluation, consider a case when all teachers share the same output distribution. In this case, a coordinated ensemble would yield a histogram where a single token (different one for different shared randomness) would yield a count of . The diversity-preservation is perfect as the probability of any one token is equal to its probability value in the distribution. As for cold PATE, we computed the agreement level (minimum histogram count) required for meeting different diversity-preservation levels on a sequence of 10 tokens. These values are inversely related to the privacy parameter . Results are averaged over a sliding window of 10 consecutive tokens from the article:
Hot PATE (coordinated ensembles): 100% diversity transfer, privacy parameter .
Cold PATE (independent ensembles):
| Diversity Transfer | Count Required | × Loss in |
|---|---|---|
| 25% | 0.084 n | 11.9 |
| 50% | 0.0261 n | 38 |
| 90% | 0.000733 n | 1364 |
| 95% | 0.000184 n | 5446 |
As you can see, for transferring 50% of the diversity, we incur X privacy loss with cold PATE. For transferring 95%, the loss is X. This is very significant. We can extrapolate the same relative gains when the teacher distributions are not identical, as it applies to the "transferable" part.
Question 2: As we establish mathematically, and demonstrate in our demo, hot PATE (coordinated ensembles) provides high utility regardless of diversity. The tradeoff is not inherent.
Weaknesses:
– "focuses more on theoretical advantages and does not thoroughly explore the feasibility, computational costs, and scalability of the coordination integration method in real-world scenarios"
The benefits of our method are established via a rigorous mathematical analysis and kick-in whenever there is entropy in the responses (which is well established for LLMs). We do, in fact, include a discussion of scalability and computational costs for different API types (Section 4.3). The additional evidence provided above shows that on "real-world" text generation we can expect orders of magnitude benefits over "cold" PATE.
– “repeat the same core idea across scenarios” by describing “homogeneous and heterogeneous” ensembles
These two scenarios warrant separate treatment. Standard PATE is designed for heterogeneous ensembles, which make sense both in diverse and non-diverse settings, whereas homogeneous ensembles are only relevant in diverse settings (and as far as we know were not previously studied with PATE). They require not only different threshold settings but also different private aggregation methods, in order to preserve diversity.
– ״The proposal of "coordination sets" is one of the main contributions of the paper, but it is mixed with other technical details (e.g., privacy analysis methods)״
Our submission is about privacy. It is therefore necessary and central for us to present coordinated ensembles together with their benefit to our privacy analysis.
The paper introduces "Hot PATE," which is a variation of Private Aggregation of Teacher Ensembles (PATE or, in the context of the paper, 'cold' PATE). Hot PATE focuses on ensuring privacy in open-ended generative tasks, addressing the fundamental tradeoff between privacy and diversity in generation.
Overall, this is a very promising paper. The theoretical results are solid, and this is a non-trivial extension of PATE. I have read the authors' comments to the AC about the reviews, and have taken them into account.
The main discomfort by the reviewers -- which led to a average 'reject' score -- is the experimental setup. The paper starts with a very strong motivation for Hot PATE, a solid theoretical setup, and the promise of improving diversity in generation while maintaining privacy. Alas, the empirical demonstration (Section 5) is very artificial (edible numbers on planet Z??). Reviewer b24K noted that the paper feels more like a "proof-of-concept" rather than a finished research product. Reviewer GoNr also noted the limitation of a single synthetic example, and remained unconvinced after the rebuttal. They also note that clear privacy measurements and reporting of \eps,\delta parameters are absent from the paper.
Overall, I believe this paper is promising but could benefit from a stronger empirical validation. After all, the paper (as it stands) promises more diverse and private generation, yet only presents results on a highly artificial task. I take the authors' point that the theoretical results indicate that their method is a strict improvement of (cold) PATE, but I do agree that greater care should be taken in the empirical section, including results on what at least resembles real-world data. The rebuttal is a step in the right direction, but the changes would require an additional round of review.
I encourage the authors to continue improving their work -- this will evolve into an important paper for the privacy community, but the manuscript's impact will benefit from additional polishing, experiments aligned with its core message, and one more round of review.