/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion

Tianyuan Zou,Yang Liu,Peng Li,Yufei Xiong,Jianqing Zhang,Jingjing Liu,Xiaozhou Ye,Ye Ouyang,Ya-Qin Zhang

OpenReview PDF

提交: 2025-01-11更新: 2025-07-24

摘要

关键词

Differentially Private Synthetic DatasetCollaboration between Private Data and Private ModelFusion of Pre-trained Language Model

评审与讨论

审稿意见

评分: 32025-03-04

The paper introduces WASP, an approach for generating differentially private synthetic data by leveraging multiple pre-trained language models (PLMs) in a collaborative manner. They tackle (1) limited private samples, (2) noisy synthetic data, and (3) risky PLM selection. They employ a Top-Q voting mechanism to improve private data distribution estimation, uses contrastive prompts to reduce noise, and dynamically weights PLMs based on their performance.

给作者的问题

See cons.

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Yes

补充材料

与现有文献的关系

Using LLM for data publishing

遗漏的重要参考文献

Yes

其他优缺点

（1）Pros The paper introduces a novel framework (WASP) that combines multiple PLMs to generate high-quality DP synthetic data, addressing key limitations of existing PE methods.

The use of Top-Q voting and contrastive prompts is innovative and effectively improves the quality of synthetic data while maintaining privacy guarantees.

The paper provides a thorough theoretical analysis of the differential privacy guarantees of WASP, including proofs and sensitivity analysis.

（2）Cons

While the framework is effective, the complexity of combining multiple PLMs and the iterative nature of the process may raise concerns about scalability, especially for large-scale datasets or when using computationally expensive PLMs (e.g., GPT-4).

The paper does not provide a detailed analysis of the computational cost or time complexity of WASP compared to baseline methods.

The paper focuses exclusively on text data, and it is unclear how well WASP would generalize to other types of data (e.g., images, tabular data).

The paper assumes a fixed privacy budget across all iterations. However, in practice, the allocation of the privacy budget across iterations could be optimized to further improve the quality of synthetic data.

其他意见或建议

作者回复

2025-03-31

We appreciate our reviewer's insightful comments that help us improve our work.

Q: Complexity of combining multile PLMs and computational cost & complexity comparison.

First of all, as stated in Lines 105~107 on the left in our paper, no additional queries are made to PLMs under the same number of required synthetic samples $N$ in the WASP framework compared to baseline methods although samples are generated in a iterative manner. As shown in Table G in below, the computational cost of WASP is slightly higher than that of PE baselines, mainly due to increased in prompt length ( $8$ in-context samples instead of $1$ ) and the "furthest histogram" calculation.

Second, the primary computational cost is driven by synthetic sample generation, which accounts for nearly 3000 times the runtime of other operations (including DP Top- $Q$ Voting and PLM Importance Weighting). While this process is unavoidable, users can mitigate the impact of computationally expensive PLMs by choosing faster alternatives. Also, like shown in Table G, for large-scale datasets, WASP's computational cost remains manageable.

Table G. Comparison of computational complexity and runtime (seconds) of WASP and PE series baselines within each iteration (averaged across iterations and also across different PLMs for PE series baselines). "Others" includes DP top- $Q$ voting and PLM importance weighting, i.e. line 7~10 in Alg. 1, with the latter beging a negligible time cost tensor normalization operation on the Nearest Histogram vector.

		Aug-PE	Aug-PE	Pre-Text	Pre-Text	WASP	WASP
		Generation	Others	Generation	Others	Generation	Others
	Complexity	$O(N)$	$O(MN)$	$O(N)$	$O(MN)$	$O(N)$	$O(MN)$
IMDb	$L=1,M=100$	1.341 $\times$ 10 $^{4}$	5.846	-	-	1.970 $\times$ 10 $^{4}$	6.090
	$L=10,M=300$	-	-	1.348 $\times$ 10 $^{4}$	7.153	1.963 $\times$ 10 $^{4}$	7.668
Yelp-Rating	$L=1,M=100$	1.539 $\times$ 10 $^{4}$	5.468	-	-	2.175 $\times$ 10 $^{4}$	5.528
	$L=10,M=300$	-	-	1.561 $\times$ 10 $^{4}$	7.376	2.235 $\times$ 10 $^{4}$	7.608
Openreview-Rating	$L=1,M=100$	1.268 $\times$ 10 $^{4}$	5.655	-	-	1.949 $\times$ 10 $^{4}$	5.871
	$L=10,M=300$	-	-	1.215 $\times$ 10 $^{4}$	7.713	1.982 $\times$ 10 $^{4}$	7.911

Q: Other data modality.

We believe that WASP framework can also be extended to other modalities. WASP's key components, including DP Top- $Q$ voting, PLM importance weighting, are general operations which can be directly applied to other data modalities. However, the process for synthetic data generation differs by modality. For example, for image generation [7,8], the applied models and APIs are different from text, with diffusion models and RANDOM_APIs oftenly used to modify existing samples with a hyper-parameter controling variation degree. Therefore, non-trivial adaptations should be made for the generation processure. Similarly, for tabular samples [9,10], non-trivial modifications are also required for crafting proper generation prompts and pipelines.

As a case in point, the original PE method [7] focuses exclusively on image data, and Aug-PE [11] makes non-trivial adaptations with carefully designed generation techniques to generalize it onto text data. Given the extensive work required, we think generalizing WASP to other modalities deserves a separate paper/papers.

Q: Further optimization on privacy budget allocation.

We thank our reviewer for pointing out an interesting future step for our work. Since our primary focus lies in the fusion of multiple PLMs for private synthetic data generation, we decide to leave this as a possible future advancement.

[7] Zinan Lin et. al., Differentially Private Synthetic Data via Foundation Model APIs 1: Images, ICLR2024.

[8] Kecen Li et. al., PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining, USENIX Security2024.

[9] Mikel Hernandez et. al., Synthetic data generation for tabular health records: A systematic review, Neurocomputing2022.

[10] Eugenia Papadaki et. al., Exploring innovative approaches to synthetic tabular data generation, Electronics2024.

[11] Chulin Xie et. al., Differentially Private Synthetic Data via Foundation Model APIs 2: Text, ICML2024.

审稿意见

评分: 42025-03-11

The paper proposes a novel framework called WASP, designed to generate synthetic data that mimics real private datasets while ensuring DP. WASP addresses three key challenges in existing methods: limited private samples, noisy synthetic data, and the risk of selecting the wrong PLM. It uses a Top-Q voting mechanism to estimate private data distributions more accurately, leverages contrastive learning to improve synthetic data quality, and dynamically weights multiple PLMs to mitigate model bias. The framework is tested on six datasets with both open-source and closed-source PLMs, demonstrating superior performance over existing methods.

给作者的问题

Is the contrastive learning process efficient in terms of computational resources, especially when dealing with large datasets or complex tasks?
The federated DP analysis assumes ≤8 samples/user for Δ=32 sensitivity. What happens if users contribute more samples (e.g., 100/user)? Does WASP’s noise scale linearly, making it impractical, or are there optimizations?

论据与证据

The claims made in the submission are generally supported by clear and convincing evidence. However, while WASP performs well across multiple PLMs, the claim of being "PLM-agnostic" might be overstated. The ablation experiments primarily focus on a limited set of PLMs, specifically three GPT-based models when testing with K=2 PLMs. This limitation in the scope of PLMs tested raises questions about how well WASP would perform with a broader range of PLMs, including those from different architectures or vendors.

方法与评估标准

The proposed method, WASP, and its evaluation criteria generally make sense for the problem.

理论论述

No critical errors were found in the provided proofs.

实验设计与分析

The experimental design in WASP demonstrates methodological rigor. It tests 6 NLP tasks including sentiment analysis, rating/field classification with 6 open-source and 3 closed-source PLMs. However, there are some potential issues:

Comparison Fairness: Aug-PE uses single PLMs, while WASP benefits from multi-PLM fusion; no comparison to Aug-PE with equal compute/API budgets.
Task Scope: Limited to classification; no validation on generation/sequence tasks (summarization).

补充材料

The supplementary materials are comprehensively reviewed. The supplementary materials provided essential technical validation for theoretical soundness of DP mechanisms, pactical implementation details, extended empirical evidence beyond main results and methodological differentiation from prior work.

与现有文献的关系

Enhanced PE Methods

Prior: PE (Lin et al., 2024; Xie et al., 2024) used Top-1 voting, struggling with data scarcity.
WASP: Introduces Top-Q voting with decaying weights, improving sensitivity analysis (Δ=2 → Δ=4) and enabling robust distribution estimation with limited samples.

Contrastive ICL + DP Synthetic Data

Prior: Contrastive ICL improved model responses (Gao & Das, 2024).
WASP: First to combine contrastive ICL with DP guarantees, using low-quality samples as negative examples to reduce noise (addressing Xie et al., 2024's "noisy data" issue).

[1] Xie, C.,Lin, Z.,Backurs, A.,Gopi, S.,Yu, D.,Inan, H.A., Nori, H., Jiang, H.,Zhang, H.,Lee, Y.T., etal. Differentially Private Synthetic Datavia Foundation Model APIs 2:Text. In Forty-first International Conference on Machine Learning, 2024.

[2] Lin,Z., Gopi, S., Kulkarni, J., Nori,H., and Yekhanin, S. Differentially Private Synthetic Datavia Foundation Model APIs 1: Images. In The Twelfth International Conference on Learning Representations, 2024.

[3] Ye,J., Gao, J.,Wu,Z.,Feng, J.,Yu,T., and Kong, L.ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.3671–3683,2022.

遗漏的重要参考文献

As far as I know, no related works that are essential to understanding the (context for) key contributions of the paper, but are not currently cited/discussed in the paper.

其他优缺点

It would be clearer to show the ablation results if the authors can add another comparison without both contrastive prompting and importance weighting when conducting ablation study in Table 4.

其他意见或建议

A more thorough comparison with other DP methods, such as DP-SGD, would offer a clearer understanding of WASP's advantages. Additionally, presenting a time cost comparison between the proposed method and standard PE would strengthen the evaluation.

作者回复

2025-03-31

Thanks very much!

Q: Fair comparison with equal API budget.

We emphasis that, as stated in Lines 105~107 on the left in our paper, no extra query to the PLMs is incurred by WASP compared to single-PLM PE baselines, when they are compared under the same number of required synthetic samples $N$ . So we are making fair comparison to Aug-PE with equal API budgets (number of queries) throughout our experiments.

Q: Validation on generation tasks.

We provide results on question answering task with SQuAD dataset in Table D. These results also testify the effectiveness of our proposed WASP for generation tasks and will be added to our final version.

Table D. Evaluation of STM performance (F1) on SQuAD dataset. The same setting is used as that in Table 1 in the paper.

	Only Private	FuseGen	Aug-PE $_{GPT-2}$	Aug-PE $_{Llama-2}$	Aug-PE $_{Vicuna}$	Aug-PE $_{OPT}$	Aug-PE $_{ChatGLM3}$	Aug-PE $_{Flan-T5}$	WASP
F1	5.41	9.31	7.37	7.84	8.46	8.31	8.20	9.32	11.40

Q: Ablation without both contrastive ICL and PLM importance weighting.

By removing both components, the final STM performance are $89.05$ %, $58.72$ %, $35.45$ % for IMDb, Yelp-Rating and Openreview-Rating respectively. Comparing with results in Table 4, these results show that both components are vital in boosting the final performance of WASP. See full results in Table E1 in https://anonymous.4open.science/r/WASP/re.pdf .

Q: Comparison with DP-SGD.

We provide experimental results for first fine-tuning a single PLM with $M=100$ private samples under the same setting of Table 1 in our paper with DP-SGD and then use the fine-tuned PLM for generation ("DP-SGD+Gen") in Table F (see Table F1 in https://anonymous.4open.science/r/WASP/re.pdf for more results). Results demonstrate the superiousness of WASP compared to DP-SGD method. We will add these into our final version.

Table F. STM performance of WASP and "DP-SGD+Gen" with $N=6000,M=100,L=1$ .

		GPT-2	Llama-2	Vicuna	OPT	ChatGLM3	Flan-T5
IMDb	DP-SGD+Gen ( $K=1$ )	87.44	84.63	84.93	81.47	83.18	89.14
	WASP ( $K=6$ )			89.52
Yelp-Rating	DP-SGD+Gen ( $K=1$ )	50.04	49.95	57.46	55.68	45.79	60.85
	WASP ( $K=6$ )			61.21

Q: Time cost comparison.

We compare the runtime of WASP and PE methods in Table G in response for Reviewer ngfH in below due to response limitation. Results show that, the additional time overhead for WASP per iteration is minor compared to PE methods. Since WASP makes no extra PLM queries compared to PE methods, its runtime increase comes from longer prompts (8 v.s. 1 in PE methods) for generation, and from the additional calculation of the "furthest histogram" for other operations during in-context learning sample selection.

Q: Efficiency of contrastive in-context learning.

Results in Table G in response for Reviewer ngfH in below show that increasing the total number of private samples $M$ or task complexity has little impact on WASP' runtime, confirming its efficiency.

Q: Optimization for user-level DP when data parties contribute more samples.

First, for fair comparison in federated setting, we strictly follow Pre-Text [4], assuming each data party controls no more than 8 samples samples to fit on-device setting. If users contribute more, optimizations like norm clipping [5,6], an off-the-shelf and widely applied technique, can keep noise levels practical by capping the norm of voting vectors $H^n_l,H^f_l$ given by data party $l$ at a preset bound $\zeta$ , preventing user sensitivity $\Delta$ from scaling linearly with user private dataset size. Under the same $\zeta$ , using more private samples (e.g. 100/user) will not degrade performance compared to using fewer (e.g. 8/user).

Moreover, if a single party controls a large number of samples (e.g. 100), like shown in Table 1 in our paper, the party alone can achieve good performance using sample-level DP whithout the need of collaborating with others.

Q: More different architecture PLM results with $K=2$ for testifying PLM-agnostic.

Table H shows results for $5$ open-source PLMs. These results show that pair-wise performance exceeds individual participating single-PLM alone, supporting our PLM-agnostic claim.

Table H. STM performance of $K=1$ (diagnose) and $K=2$ (others) of WASP with $N=6000,M=100$ .

	GPT-2	Llama-2	Vicuna	OPT	ChatGLM3	Flan-T5
GPT-2	85.65	86.72	85.91	85.85	86.92	89.33
Llama-2		85.82	85.92	86.23	86.97	89.45
Vicuna			82.90	85.80	86.18	89.30
OPT				84.32	86.22	89.32
ChatGLM3					86.12	89.63
Flan-T5						89.28

[4] Charlie Hou et. al., PrE-Text: training language models on private federated data in the age of LLMs, ICML2024.

[5] Anda Cheng et. al., Differentially private federated learning with local regularization and sparsification, CVPR2022.

[6] Fumiyuki Kato et. al., Uldp-FL: Federated Learning with Across-Silo User-Level Differential Privacy, VLDB2024.

审稿人评论

2025-04-06

Thank you for the detailed rebuttal. The authors have addressed the key concerns well. Overall, the rebuttal significantly improves the submission. I’m upgrading my score.

作者评论

2025-04-07

We are truly grateful for your effert and your acknowledgement on our rebuttal. Thank you so much for upgrading the score! We will be sure to incorporate the rebuttal into our final draft for refinement!

审稿意见

评分: 32025-03-11

This paper studies differentially private generation of synthetic data using LLM APIs. The general idea is based a lines of prior works Private Evolution, which generate samples tailored for synthetic data by resampling from the ones that is closer to the private dataset. This work improves upon existing works by 1) less private data enabled 2) low quality synthetic samples 3)model bias of the LLM APIs. The proposed method WASP uses (1) a Top-Q voting mechanism where private samples vote for the nearest and furthest synthetic samples, (2) dynamically weighted LLM-based data generation informed by private histograms, and (3) contrastive in-context learning to enhance the quality and relevance of generated synthetic samples. Experiments show SOTA results compared to Aug-PE when the number of private data is small.

给作者的问题

It would be good to see the experiment results after fixing this delta issue.

论据与证据

There are two claims. For privacy, the author theoretically proves the differential privacy of the voting mechanism. For utility, the author demonstrate the performance through experiments. The privacy proof is problematic and might require more clarity. See below for details.

方法与评估标准

Yes.

理论论述

Yes. Issue : Lemma D.3 is supposed to be the advanced composition theorem. But it is not cited correctly. The delta composition is missing. This could significantly weaken the privacy guarantee.

实验设计与分析

Yes. Note that the private dataset has size 100, which is specifically designed to highlight the advantage of this proposed method.

补充材料

Some of the proof.

与现有文献的关系

This falls under a new line of research for using LLM APIs to generate synthetic data that is similar as the private dataset. The citation is adequate.

遗漏的重要参考文献

No.

其他优缺点

I think the overall quality is good except for this mistake in the privacy proof. The contribution is incremental compared to prior work but still great progress towards making the generation quality better.

其他意见或建议

None.

作者回复

2025-03-31

Q: Lemma issue and results.

Thank you for pointing this out. Lemma D.3 should be added with "with $\delta_{total}$ increased to larger than $T\times\delta$ ".

Denote the $\delta_{total},\delta_{iter}$ as the final $\delta$ and $\delta$ for each iteration respectively. Therefore, as we perform 5 iterations to collect the total synthetic dataset with 4 iterations related to DP (the first iteration performs zero-shot generation without real sample guidance), the final $\delta_{total}>4\cdot\delta_{iter}$ . Therefore, as we applied $\delta_{iter}=1\times10^{-5}$ in our experiments, the final $\delta_{total}>4\times10^{-5}$ for all PE series baselines and WASP. Moreover, following Theorem 4.3 in [3], using $\delta_{iter}=1\times10^{-23}$ instead will guarantee overall $(4.0,1\times10^{-5})$ -DP, which results in a noise scale roughly $2.14$ times as large as the original one used in our original experiments in the paper.

Experimental comparison using the original and new $\delta_{iter}$ are included in Table C with other setting maintained the same as that for Table 1 in our paper. These results demonstrate that, under tighter privacy guarantee ( $\delta_{iter}=1\times10^{-23}$ ), the performance decrease is just minor, indicating the robustness of WASP and PE baselines.

Table C. Comparison of using $\delta_{iter}=1\times10^{-5}$ (original experimental results) and $\delta_{iter}=1\times10^{-23}$ (satisfies $\delta=1\times10^{-5}$ ) with $4$ iteration steps related to DP. $\epsilon=4.0$ is fixed as the final combined DP budget. Experiments are performed using $6$ open-source PLMs with $L=1,M=100$ .

		Aug-PE $_{GPT-2}$	Aug-PE $_{Llama-2}$	Aug-PE $_{Vicuna}$	Aug-PE $_{OPT}$	Aug-PE $_{ChatGLM3}$	Aug-PE $_{Flan-T5}$	WASP
IMDb	$\delta_{iter}=1\times10^{-5}$	85.38	85.77	82.76	83.86	85.82	89.00	89.52
IMDb	$\delta_{iter}=1\times10^{-23}$	84.88	85.30	82.04	83.52	85.22	88.83	89.18
Yelp-Rating	$\delta_{iter}=1\times10^{-5}$	45.28	47.42	54.42	50.81	55.17	58.69	61.21
Yelp-Rating	$\delta_{iter}=1\times10^{-23}$	45.03	47.10	54.09	50.47	54.97	58.61	61.05

[3] Peter Kairouz et. al., The Composition Theorem for Differential Privacy, ICML2015.

审稿意见

评分: 32025-03-18

Two of the main strategies for generating private synthetic text or image data are private fine-tuning and private evolution. Private fine-tuning can work well but this requires access to the parameters of a generative model (many of which are closed-source) and the computational resources to fine-tune the model. Private evolution uses API access to a generative model in order to generate private synthetic data. These methods sample some synthetic data, then privately compare the synthetic data to the real data, then query the API to create new data based on the feedback generated by the real data.

The two main contributions in this work are the weighted multi-PLM fusion and the contrastive prompting. The weighted multi-plm fusion takes advantage of the fact that there are many pretrained language models available and that some of them will be better suited for generating particular types of data. The method developed by the authors (WASP) will start by querying several generative models and as it goes it will start to generate fewer samples from the models that are not performing well for a given task and more samples from models that are performing well for a given task. Contrastive prompting modifies the PE framework by privately measuring which of the synthetic data are most similar to the real data and which synthetic data are least similar to the real data points. This is used to construct contrastive prompts like “generate synthetic data that are more similar to this <good example> and less similar to this <bad example>”.

The authors perform several experiments and ablations to demonstrate how the various components of WASP contribute to its performance and how WASP compares to baseline methods. The authors also extend WASP to a federated setting. These experiments show that WASP outperforms previous techniques across various metrics.

给作者的问题

The x axis of figure 1.b is not labeled or explained in the caption, what is this plot?

For the ablation study about contrastive prompting, was the noise scale adjusted for the trials without contrastive prompting? The cost of the contrastive prompting is that you must make an additional measurement of the private data to generate the furthest histogram. If you do not apply contrastive prompting you only have to measure the nearest histogram so the sensitivity is reduced and you can make the measurement with less noise. This is not discussed in the ablation section and if it was not done this would not be a fair comparison.

The fact that FuseGen can perform well for IMDb and Yelp-category without looking at the private data is very surprising, can you explain this?

Table 6 shows that the performance of these methods is relatively insensitive to the privacy budget (especially on IMDb), does this imply that the PLMs may have memorized some of these datasets and most of the performance comes from that?

In figure 6, what is the difference between the Aug-PE line and the w/o Con line?

The various commercial PLMs have different costs, could your fusion framework be modified to trade off performance and API costs?

论据与证据

Aug-PE performs poorly when using the “incorrect” PLM - The authors show that the performance of Aug-PE depends on the PLM that is being used and that the best PLM to use for one dataset is not necessarily the best for every dataset. This claim is relatively well supported and is a good motivation for multi-PLM fusion. This is further supported by the ablation study showing that performance improves with the number of PLMs included in WASP.

WASP’s advantage over Aug-PE extends to the federated setting - The authors show that federated-WASP performs better than PreText (which is a version of Aug-PE that works in the federated setting).

Contrastive prompting and Multi-PLM fusion are both important to the performance of WASP - This claim is supported by an ablation study summarized in Table 3. However, I am dubious about the result for the importance of contrastive prompting. I have a question about this in the questions section of the review.

Top-Q voting performs better than top-1 voting - Table 5 shows for 2 datasets that performance improves when Q increases.

方法与评估标准

The proposed methods and evaluation criteria do make sense for the task at hand

理论论述

I reviewed the claim that WASP satisfies (\epsilon, \delta)-DP, this uses standard DP results such as the privacy of the gaussian mechanism and composition of multiple gaussian mechanisms. This claim is correct.

实验设计与分析

I am curious about the ablation results in table 3 and would like to see these results for the other datasets as well if possible. I also have a question about how this experiment was performed in the questions section.

补充材料

I reviewed the entire supplementary material, including the additional experiments, privacy proof, full algorithm descriptions, and the contrastive and non-contrastive prompts.

与现有文献的关系

Dp synthetic text is a rapidly developing field that has recently become much more practical with the development of LLMs. This work could be an important step that improves private evolution, which is one of the primary techniques being explored for this problem.

遗漏的重要参考文献

I do not believe that there are essential references not discussed here.

其他优缺点

Strengths The PLM fusion technique is intuitive and well supported by the experiments, I am confident that this technique increases the quality of the DP synthetic data.

The evaluation of these methods is very comprehensive across several popular datasets and many PLMs

Weaknesses I am not sure that the contrastive prompting is beneficial, especially if the change in histogram sensitivity is not accounted for in the ablation.

I am worried that much of the performance gains could come from PLM memorization. If the various PLMs memorized some data sources more than others then this could be the reason multi-PLM fusion is beneficial, not because the PLMs are better suited for the task in some other way. The relative insensitivity to epsilon makes me worry about this more.

其他意见或建议

No other comments

作者回复

2025-03-31

We appreciate these insightful comments from our reviewer which help us improve our work.

Q: Different noise scale for different function sensitivity considering contrastive prompting.

You are absolutely right, and we indeed have applied different noise scales according to different sentitivity values of different methods in our abalation study. Specifically, while the noise applied in WASP using a sensitivity of 4 (see Theorem D.2), the noise scale was halved (from 9.690 to 4.845) for "w/o comtrastive prompting" ablation, as the sensitivity was halved, to assure a fair comparison. Thus, histogram sensitivity changes have been accounted for in the ablation. We will clarify this in the final version.

Q: Good performance of FuseGen and performance origination of WASP.

The success of FuseGen relies on the fusion of several PLMs which provides a broader and more diverse knowledge base for training a STM that can generalize better on the real dataset compared to using a single PLM. As shown in Table A below and in [2], FuseGen boosts the performance of the trained STM compared to the best performing single-PLM of the task (Flan-T5 for both tasks in Table A), with larger extent for more common task like IMDb resulting in superior performance.

Therefore, fusing the knowledge from multiple PLMs is the first origination of the performance improvement of WASP.

What's more, the addition of private samples also contributes to the final performance of WASP. Previous work ZeroGen [1] explores the performance of STM trained on the synthetic dataset given by a single PLM without the help of private data, which reveals the base level of knowledge "memorized" in a single PLM. Like shown in Table A, using Flan-T5, ZeroGen achieves a performance around $2.5$ % less than Aug-PE for both datasets with private data completely exposed ( $\epsilon=\infty$ ). Similar patterns can be witnessed for multi-PLM setting by comparing FuseGen and WASP with $\epsilon=\infty$ . Therefore, although different PLMs may have different levels of knowledge about each dataset (as shown by ZeroGen results here and in [2]), the involvement of even a small amount of private samples consistently help in further boosting the STM performance.

Moreover, the "insensitivity to privacy budget" of our WASP (and Aug-PE) in Table 6 demonstrates the robustness of our proposed method.

Table A. Comparison of STM performance of different methods. Flan-T5 is used for ZeroGen and Aug-PE as they are single-PLM methods, while $K=6$ is used for FuseGen and WASP.

	ZeroGen	Aug-PE ( $\epsilon=\infty$ )	FuseGen	WASP ( $\epsilon=\infty$ )
IMDb	87.06	89.48	89.07	89.96
Yelp-Rating	57.08	59.62	57.96	62.02

Q: Meaning of x-axis in Figure 1.b.

Sorry, the x-axis of Figure 1.b should be the task performance (accuracy) of the STM trained on the synthetic dataset. We will add this into our final version.

Q: Lines in Figure 6.

As stated in the caption of Figure. 6, top-Q voting ( $Q=8$ ) is used for "w/o Con" whereas $Q=1$ for "Aug-PE". Note, to guarantee the same privacy budget, noise scale also increases (to twice its original value) as $Q$ increases from $1$ to $8$ . The difference between "w/o Con" and "Aug-PE" demonstrate the effectivenss of applying top- $Q$ voting with $Q>1$ .

Q: Trade-off between performance and cost.

Yes, WASP can be modified to take into account the API costs by adjusting the importance weighting function of each LLM with their associate API cost. For example, assuming each PLM $k$ costs $v_k$ per query on average, a trade-off function $w_k =U(w_k,v_k)$ can be applied to balance cost and performance and get the final PLM weight $w_k$ . To keep $w_k,v_k$ at the same level, we can first normalize $v_k$ to make each $v_k$ range from $0.0$ to $1.0$ . $U$ can be selected according to users' needs, e.g. $U(w_k,v_k)=\lambda w_k + (1-\lambda) v_k$ with a tunable parameter $\lambda$ . Finally, all the $w_k$ after adjustments need to be normalized as $w_k=w_k/\sum_{k'=1}^K w_{k'}$ .

Experimental Designs Or Analyses: Other dataset results with close-source PLMs in Table 3.

We show results for IMDb on closed-source PLMs in Table B, which also shows the superiority of WASP compared to baseline methods.

Table B. STM performance using $M=100$ with $3$ closed-source PLMs ( $K=3$ for WASP) under the same DP setting used in Table 3.

	Only Private	FuseGen	Aug-PE $_{GPT-3.5}$	Aug-PE $_{GPT-4}$	Aug-PE $_{GPT-4o}$	WASP
ACC	50.00	84.12	84.16	83.28	85.82	86.34

[1] Jiacheng Ye et. al., ZeroGen: Efficient Zero-shot Learning via Dataset Generation, EMNLP2022.

[2] Tianyuan Zou et. al., FuseGen: PLM Fusion for Data-generation based Zero-shot Learning., EMNLP2024.

审稿人评论

2025-04-05

Thank you for your response, I have adjusted my score based on your clarifications. I hope that the details you have provided me about how the privacy noise is adjusted in the ablations make it into further drafts of the manuscript.

作者评论

2025-04-05

Dear reviewer, thank you very much for taking the time to reconsider the score. We sincerely appreciate your feedback and will make sure to incorporate the clarifications regarding privacy noise adjustment into the next version of the manuscript.

最终决定Accept (poster)

2025-05-01

This paper proposed an algorithm WASP that improves private evolution for differentially private synthetic data generation. Instead of querying a single LLM API, WASP querying multiple LLM APIs and use a carefully designed mechanism to balance the APIs.

This paper received borderline scores (3,3,3,4) and all reviewers in favor of acceptance. DP synthetic data and the private evolution family algorithms is a timely and important topic. WASP shows empirical improvement in experiments, and releases code.

Reviewers also raised concerns on the DP theory (reviewer Eqwr), fair comparison on computation (reviewer rN1N) and number of data generation setting (review Eqwr) in experiments, and the source of the improvement, especially the concerns on memorization from multiple pre-trained models (reviewer gbiW). All reviewers acknowledge reading the authors' response, and some of the reviewers raised the scores as the authors successfully address the main concerns. I would strongly encourage the authors further clarify on these important issues in the next version, and provide more clarification to address concerns on number of data generation setting (review Eqwr), and memorization from multiple pre-trained models (reviewer gbiW).