5.8

/10

Rejected4 位审稿人

最低5最高8标准差1.3

3.8

置信度

正确性2.8

贡献度2.8

表达2.8

ICLR 2025

A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

Rui Xin,Niloofar Mireshghallah,Shuyue Stella Li,Michael Duan,Hyunwoo Kim,Yejin Choi,Yulia Tsvetkov,Sewoong Oh,Pang Wei Koh

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

Privacy evaluation for quantifying disclosure risks of sanitized dataset release beyond surface level, exposing false sense of privacy

摘要

关键词

PrivacyNLPTextReidentificationData ReleaseSanitizationAnonymization

评审与讨论

审稿意见

评分: 5置信度: 42024-10-22

This paper investigates the current limitations of existing textual data sanitization methods. By considering re-identification attacks with known auxiliary information, the paper shows that a sparse retriever can link sanitized records with target individuals even though the PII patterns are anonymized. Instead, the paper proposes a new privacy evaluation framework for the release of sanitized textual datasets. The paper considers two datasets, including MedQA and WildChat, to show that seemingly innocuous auxiliary information can be used to deduce personal attributes like age or substance use history from the synthesized dataset. Experimental results also verify that current data sanitization methods create a false sense of privacy only on the surface level.

优点

The paper is well-written and easy to follow. All the included sanitization methods are up-to-date and well-explained.
The proposed method is straightforward in decomposing the re-identification with linking and matching methods.
Experimental results are comprehensive with sufficient ablation experiments. The included baselines are solid and up-to-date.

缺点

My major concern is that the auxiliary data is highly contrived. Based on my understanding, each auxiliary sample is the subset of exact atomics from the target record. For example, in Fig 1, Auxiliary Information contains two atoms of the Original Record. That is, if you only consider de-identified, sanitized records, it is very easy for your BM25 retriever to get the sanitized target. In real-world re-identification attacks, there is no such auxiliary information that has many exact overlapped n-grams as original records.
For the claim that 'private information can persist in sanitized records at a semantic level, even in synthetic data,' if you consider DP generation, the privacy level is indicated by $(\epsilon, \delta)$ . That is, your linked record may not be the original target sample. DP introduces random noise to offer plausible deniability and protect the original record's privacy.
The implemented methods for the proposed privacy evaluation framework only integrate various existing components by using the contrived auxiliary data. It is not likely to scale this framework for a large number of overlapped atoms.

问题

Please refer to my weaknesses. Also, I have a few new questions.

How can your method extend to other datasets? Is there any real auxiliary data that can be used instead of creating overlapped auxiliary data from the original records?
Regarding the concept of privacy, is converting the age of 23 to early 20s a privacy breach? Such conversion is commonly adopted by K-anonymity.

评论- Response to Reviewer oF1E Part 2

2024-11-22

”Regarding the concept of privacy”

Whether converting "23" to "early 20s" constitutes a privacy breach depends on context and potential harm (Shao et al., 2024). Our framework doesn't make this normative judgment - instead, it measures information persistence after sanitization. We find concerning pattern preservation even with aggressive sanitization: medical records maintain linked combinations of symptoms, age ranges, and conditions (Zhang et al., 2024), while chat data preserves writing styles and topic preferences. These persistent patterns enable re-identification through modern machine learning techniques.

Our results reveal fundamental limitations in current text privacy approaches, demonstrating the need for more sophisticated protection mechanisms that consider semantic-level information leakage. This is particularly crucial as organizations increasingly handle sensitive text data across healthcare, customer service, and other privacy-critical domains, and as private user data holds the key to unlocking new model capabilities.

Reference

Zhang, Z., Jia, M., Lee, H. P., Yao, B., Das, S., Lerner, A., ... & Li, T. (2024, May). “It's a Fair Game”, or Is It? Examining How Users Navigate Disclosure Risks and Benefits When Using LLM-Based Conversational Agents. In Proceedings of the CHI Conference on Human Factors in Computing Systems (pp. 1-26).

Shao, Y., Li, T., Shi, W., Liu, Y., & Yang, D. (2024) PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.

评论- Rebuttal Acknowledge

2024-11-24

Thanks for your rebuttal to clarify a few disagreements. Except I still think that the auxiliary data is highly contrived, I agree with the other comment. So, I will raise my review score.

评论- Response to Reviewer oF1E

2024-11-26

Thanks for your rebuttal to clarify a few disagreements. Except I still think that the auxiliary data is highly contrived, I agree with the other comment. So, I will raise my review score.

Thank you for your feedback and score revision. To address the concerns about auxiliary data, we conducted an additional experiment where we used an LM (LLaMa 3 8B; prompt in Appendix D.2.2) to paraphrase the auxiliary information to reduce direct textual overlap with the original text. For example, the auxiliary information

"Auscultation of the lungs does not reveal any significant abnormalities. He consumed 3 glasses of the drink before symptoms developed. On physical examination, he is disoriented."

is paraphrased into

"A thorough examination of the patient's lungs did not uncover any notable issues. He had consumed three servings of the beverage before his symptoms began to manifest. Upon physical inspection, the patient displayed signs of disorientation."

Overall, bi-gram overlap (as measured by ROUGE-2 precision) between the paraphrased and original auxiliary information decreases from 71.0% to 19.9% for MedQA and from 40.5% to 21.0% for WildChat.

We repeated our privacy analysis using the new paraphrased auxiliary information and found that:

The relative performance patterns across sanitization methods remain consistent whether using original or paraphrased auxiliary data—methods showing higher leakage with original auxiliary data also show higher leakage with paraphrased data. For example, the relative distances between AzureAI PII tool < Dou et al. (2024) is preserved when we switch to paraphrased auxiliary data.
Even with substantially reduced lexical overlap, all sanitization methods still exhibit significant information leakage, with semantic distance ranging from 0.22 to 0.57 when using paraphrased auxiliary data. A semantic distance of 0.57 means roughly that 43% of the information is leaked (assuming no partial information leakage). As you pointed out, BM25 is particularly sensitive to paraphrasing, so we expect we would be able to recover even more information using a semantic (dense) retriever.

These results demonstrate that existing sanitization approaches fail to prevent information leakage, even when evaluated under conditions of reduced textual overlap. We have added this analysis to the revision in Appendix B.1. Thank you for raising this question.

Dataset	Sanitization Method	Semantic Distance	Semantic Distance with Paraphrased Aux Info
MedQA	No Sanitization	0.04	0.22
	Sanitize & Paraphrase	0.31	0.35
	Azure AI PII tool	0.06	0.26
	Dou et al. (2023)	0.34	0.50
	Staab et al. (2024)	0.33	0.57
WildChat	No Sanitization	0.19	0.26
	Sanitize & Paraphrase	0.44	0.50
	Azure AI PII tool	0.21	0.30
	Dou et al. (2023)	0.22	0.28
	Staab et al. (2024)	0.40	0.47

Table 1. Privacy scores measured using original vs. paraphrased auxiliary information across sanitization methods.

2024-12-02

Thank you again for your detailed response to our rebuttal and helpful feedback throughout the review process. As we approach the discussion period deadline, we remain available to address any additional aspects requiring further clarification. We look forward to engaging with any remaining questions you may have.

评论- Response to Reviewer oF1E Part 1

2024-11-22

We thank the reviewer for the helpful feedback and highlighting our strengths, including the comprehensive benchmarking of sanitization methods, straightforward pipeline, comprehensive experiments and ablations, and easy to follow write-up. We hope the explanations below address the reviewer’s remaining concerns and questions.

”My major concern is that the auxiliary data is highly contrived.”

Thank you for raising the concern about the lexical overlaps between the auxiliary data and original records. We agree that in realistic applications, auxiliary information can come in more nuanced or complex formats. However, in our experiments, we use the same auxiliary information to highlight differences between existing lexical-based and proposed semantic-based privacy metrics, showing that our semantic-based metric uncovers more leakage. However, if the auxiliary information is more nuanced—e.g. semantically similar to the original atoms—it becomes even harder for privacy metrics to detect, further demonstrating our point of these lexical methods providing a “false sense of privacy.”

While our auxiliary setup may appear contrived, privacy guarantees must account for worst-case scenarios (Dwork & Roth, 2014). Real-world privacy breaches like the Netflix Prize de-anonymization (Narayanan & Shmatikov, 2008) demonstrate how seemingly innocuous auxiliary information enables re-identification. Our ablation studies validate framework robustness across varying information settings: MedQA shows linking rates of 58-78% for LLM-based sanitization and 81-94% for PII removal, while WildChat maintains consistent rates of 56-62% across methods. This variation in success rates indicates our framework captures meaningful privacy risks rather than artificially inflated matches.

Reference

Narayanan, Arvind, and Vitaly Shmatikov. "Robust de-anonymization of large sparse datasets." In 2008 IEEE Symposium on Security and Privacy (sp 2008), pp. 111-125. IEEE, 2008.

Dwork, Cynthia, and Aaron Roth. "The algorithmic foundations of differential privacy." Foundations and Trends® in Theoretical Computer Science 9, no. 3–4 (2014): 211-407.

”For the claim that 'private information can persist in sanitized records at a semantic level,”

While differential privacy provides formal guarantees through $\epsilon$ , its practical implications for language models remain unclear (Habernal, 2022). Different $\epsilon$ values have ambiguous meaning for text privacy - our work provides empirical quantification of these guarantees. With $\epsilon$ =1024, we observe improved privacy scores (0.92 from 0.43) but significant degradation in both task performance (0.62 to 0.40) and text coherence (3.44 to 2.25). This aligns with recent findings showing that DP's theoretical guarantees may not translate directly to meaningful privacy protection in high-dimensional text data (Brown et al. 2022).

Reference

Brown, H., Lee, K., Mireshghallah, F., Shokri, R., & Tramèr, F. (2022, June). What does it mean for a language model to preserve privacy?. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency (pp. 2280-2292).

Habernal, Ivan. “When Differential Privacy Meets NLP: The Devil Is in the Detail.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 1522–28. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.emnlp-main.114.

”It is not likely to scale this framework for a large number of overlapped atoms.”

Our framework demonstrates practical effectiveness using just 3 claims for meaningful privacy evaluation, contradicting concerns about scalability. This efficiency stems from our novel semantic matching approach (detailed in Section 3.2) which captures information leakage without requiring exhaustive claim combinations. The framework adapts naturally across domains - medical records separate into symptoms, history, and demographics (average 15.6 claims/document), while conversational data follows dialogue structure and topic boundaries (Wang et al., 2023).

Reference

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., ... & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.

审稿意见

评分: 5置信度: 32024-11-03

This paper introduces a privacy evaluation framework for data sanitization methods, specifically data anonymization and data synthesis, in the context of natural language. The framework can be summarized as follows: first, random records from the original data are sampled as the auxiliary data; then, an information retrieval technique is used to link the auxiliary data with the sanitized data, and an LLM is utilized to evaluate the semantic similarity between the original records and the linked records. The final similarity scores denote the degree of privacy leakage of the sanitized data.

优点

This paper focuses on the privacy leakage of text data. The authors design different prompts for LLM to evaluate the semantic similarity between two sentences, which is interesting. The experimental results are extensive. Multiple data sanitization techniques are included in the evaluation framework.

缺点

The main issue of this paper is the definition of privacy leakage, which the authors equate with semantic similarity between auxiliary data and sanitized data. However, the semantic information of a sentence reflects its utility. If the semantic content of a sanitized sentence is altered, the sanitization method would be useless. Traditional data anonymization methods aim to remove only identifiers from a data record rather than all information. In this context, identifiers should be the privacy focus, and privacy leakage should refer specifically to identifier leakage.
The technical novelty is relatively limited. The linking step uses the existing BM25 retriever, while the semantic similarity evaluation mainly relies on established prompt engineering techniques.
The findings are not particularly interesting, as it is well-known that simple data anonymization and data synthesis techniques are insufficient to protect data privacy. This paper's findings merely confirm that this limitation also applies to text data.
The numerical results rely on LLM output, which is relatively qualitative and less persuasive. Additionally, querying LLaMA three times for consistency seems unnecessary; disabling sampling parameters in text generation should ensure consistent results from LLaMA for the same query.

问题

N/A

评论- Response to Reviewer 18vb Part 2

2024-11-22

“The numerical results rely on LLM output …”

We conducted thorough human evaluation showing strong agreement (0.93 Spearman Correlation) between annotators and LLM judgments, validating our approach. LLM-based evaluation is increasingly accepted in the research community (Chiang & Lee, 2023; Zheng et al., 2023), with recent work using similar approaches for code similarity assessment (Chon et al., 2024), text generation evaluation (Wang et al., 2023), and information extraction validation (Hsu et al., 2024). Our choice to query LLaMA three times is supported empirically by a range of prior works on self-consistency of LLM prompting (Wang et al., 2022). There is significant instruction-following inconsistency with single queries (where the agreement drops to 0.84 Spearman Correlation).

We would be happy to clarify any of these points further or provide additional details about specific aspects of our methodology.

2024-12-02

Thank you again for your review of our submission. As we approach the discussion closure deadline, we remain available to address any aspects requiring further clarification. We look forward to engaging with any additional questions you may have.

评论- Response to Reviewer 18vb Part 1

2024-11-22

We appreciate your detailed review and constructive feedback. We thank you for highlighting that we have an extensive set of experiments as well as our methodology. We hope the following response addresses your concerns.

“identifiers should be the privacy focus, and privacy leakage should refer specifically to identifier leakage.”

While identifier leakage is a necessary component for measuring privacy leakage, measuring it alone is not sufficient to ensure privacy. Modern privacy threats increasingly leverage semantic patterns and quasi-identifiers (Ganta et al., 2008). Moreover, real-world privacy breaches like the Netflix Prize de-anonymization (Narayanan & Shmatikov, 2008) demonstrate how de-identified information, which has no identifier leakage, enables the breach of privacy. It is therefore important to go beyond identifier leakage for a proper measurement of privacy.

Furthermore, we agree with the reviewer that the semantic information is heavily tied to the utility of the record; however, there is a long-standing tradeoff between privacy and utility, which is complicated by the fact that privacy is inherently context-dependent (Nissenbaum, 2004, Shao et al., 2024). Our work does not attempt to make normative judgments about what constitutes a privacy violation - rather, we provide a quantitative framework for measuring information persistence after sanitization. We aim to help disentangle the complex relationship between privacy and utility by providing a framework to measure and better understand these trade-offs. Our broader view of privacy is especially critical given the unprecedented scale and intimacy of user-LLM interactions.

References

Nissenbaum, H. (2004). Privacy as contextual integrity. Wash. L. Rev., 79, 119.

Narayanan, Arvind, and Vitaly Shmatikov. "Robust de-anonymization of large sparse datasets." In 2008 IEEE Symposium on Security and Privacy (sp 2008), pp. 111-125. IEEE, 2008.

“The technical novelty is relatively limited …”

“The findings are not particularly interesting…”

Our work addresses a critical gap: while privacy limitations are documented for structured data (Stadler et al., 2022), text data presents unique challenges that current methods fail to address. Major cloud providers and healthcare organizations continue to rely on simple PII detection and de-identification (Johnson et al., 2020), following outdated privacy models (Garcia et al., 2019). Our results quantify this failure - showing 94% information leakage with state-of-the-art PII removal methods - and provide compelling evidence that current approaches are fundamentally inadequate. We argue that our findings of textual sanitization methods falsely preserve privacy uncovers a fundamental issue beyond merely applying it to text data.

The urgency of this work is amplified by the scale of personal data sharing with LLMs (over 200M monthly ChatGPT users) and users' demonstrated tendency to share more intimate details with AI systems than human interlocutors (Zhang et al., 2024). This combination of increased disclosure and inadequate protection mechanisms creates significant privacy vulnerabilities that practitioners can no longer ignore.

References

Johnson, A. E., Bulgarelli, L., & Pollard, T. J. (2020, April). Deidentification of free-text medical records using pre-trained bidirectional transformers. In Proceedings of the ACM Conference on Health, Inference, and Learning (pp. 214-221).

Garcia, D. (2019). Privacy beyond the individual. Nature human behaviour, 3(2), 112-113.

Stadler, Theresa, Bristena Oprisanu, and Carmela Troncoso. “Synthetic Data -- Anonymisation Groundhog Day.” arXiv, January 24, 2022. http://arxiv.org/abs/2011.07018.

审稿意见

评分: 8置信度: 42024-11-04

The manuscript seeks to highlight privacy concerns in text-sanitization techniques, by a) proposing a semantic-similarity based privacy metric for re-identification/matching attacks, and b) evaluating state-of-the-art defenses against such inference under the proposed metric.

The authors use a 2-step approach; in the first 'linking' step, sanitized documents are compared to externally known auxiliary information about a target individual using a TFIDF-based sparse retriever, and in the second 'semantic matching' step, a language model is used to assess similarity between atomic claims in the retrieved document and those in the original, unsanitized document.

The paper then evaluates several defense strategies to quantify information leakage under the above framework, and find that DP based methods may provide some of the strongest protections, albeit at the cost of data utility.

优点

Extremely well-written paper, with clear motivation and research questions.
The figures in the paper are very informative, and do an excellent job of conveying key information. The running examples were great for readability.
Clear problem statement, with adequate background and justification of design choices. Creative use of LLMs as an evaluation for text-coherence.
The results about access to different sets of auxiliary information were really interesting to me. The hypothesis about the non-uniformity of LLMs' instruction-following seems intuitive, but would be interesting to quantify this in its own right.
The human subject experiments were a nice touch - useful to know the capabilities of the two models in this context.

缺点

Can't think of any immediate flaws or weaknesses. Happy to discuss further once other reviews are in.

问题

Did you find major differences between the two datasets in terms of atomizing claims in documents? It seems to me that this would be more structured in a medical Q&A dataset, as compared to LLM-user interactions.

评论- Response to Reviewer HXAX

2024-11-22

“Did you find major differences between the two datasets in terms of atomizing claims in documents? It seems to me that this would be more structured in a medical Q&A dataset, as compared to LLM-user interactions.”

Thank you for this insightful question about dataset differences. We'll address this by examining three key aspects: (1) the structural patterns we found in each dataset type, (2) how these differences affected sanitization effectiveness, and (3) the practical implications for privacy protection.

(1) Structural Differences in Claims: In MedQA, we found highly structured patterns with consistent medical attributes - 89% of records contained patient age, 81% included specific symptoms, and 63% contained medical history information, with an average of 15.6 distinct medical claims per document. This structured nature made the atomization process more systematic - we could reliably separate claims about symptoms, medical history, and demographics. However, this revealed a key privacy challenge: even after sanitization, the semantic relationships between medical attributes remained intact, making re-identification possible through these linked attributes. This was particularly problematic due to the sparsity of specific age-symptom-history combinations in medical data - unique combinations of these attributes could often identify a single patient even when individually sanitized.

(2) Dataset-Specific Sanitization Effectiveness: The structural differences led to interesting patterns in sanitization effectiveness. For MedQA, while DP-based synthesis achieved strong privacy scores (0.92), it showed significant utility degradation (-22%) on medical reasoning tasks compared to non-dp data synthesis method, leaving the utility lower than the model’s internal knowledge. This sharp utility drop occurred because medical reasoning requires precise preservation of sparse, specialized attribute combinations - even small perturbations in the relationships between symptoms, age, and medical history can change the diagnostic implications. Identifier removal performed poorly (privacy score 0.34) as it couldn't break these revealing semantic connections between medical attributes.

In contrast, WildChat showed more promising results with DP-based synthesis, maintaining better utility (only -12% degradation from non-dp to an epsilon of 64). This better privacy-utility balance stems from two key characteristics of conversational data: First, the information density is lower - unlike medical records where each attribute combination is potentially crucial, conversations contain redundant information and natural paraphrasing. Second, the success criteria for conversations are more flexible - small variations in phrasing or exact details often don't impact the core meaning or usefulness of the exchange. This made the dataset more robust to the noise introduced by DP-based synthesis while still maintaining meaningful content.

(3) Practical Guidelines for Sanitization: Our findings challenge the common practice of relying on PII removal and scrubbing methods for text privacy, showing they provide a false sense of security. These insights are particularly timely as organizations increasingly handle sensitive text data across healthcare, customer service, and other domains. We thank you again for your thoughtful feedback and will incorporate the above discussion in the paper.

2024-11-25

Thank you for your response. This adequately answers my question, and these insights would be a nice addition to the paper.

2024-12-03

Thank you for your feedback. We have incorporated these insights into the Discussion section of the revised manuscript.

审稿意见

评分: 5置信度: 42024-11-04

The paper proposes a framework to evaluate sanitization methods for releasing datasets with textual data. They highlight that obvious methods such as removing explicit identifiers like names is insufficient for properly protecting privacy, since other semantic details can also leak private information. Also, auxiliary information about an individual may be linkable to a supposedly sanitized target document, thus allowing an attacker to infer or recover further sensitive details.

The goal of the framework is the quantification of information leakage from sanitized documents given auxiliary information about the document's owner or author. The framework proposes to determine auxiliary information from each original documents by extracting individual "claims". For each document, the attacker is given a subset of claims and runs a sparse retriever to find the best-matching document from the set of sanitized documents. They then define a similarity metric, which is either determined by an LLM or the ROGUE-L score, to compute the similarity between the retrieved document and the remaining claims extracted from the original document. Additionally, they define task-specific utility metrics for each evaluated dataset.

In the evaluation, the authors consider two datasets: MedQA from the medical domain with a question-answering task, as well as WildChat consisting of online conversations with ChatGPT and a text categorization task. They also consider a range of sanitization methods that either work by removing PII or by generating synthetic data, the latter also with the option of providing differential privacy. In each scenario, the newly introduced semantic and lexical privacy metrics are computed, along with task-specific utility measures as well as the quality (coherence) of the sanitized texts. Lastly, they perform a human evaluation to determine which variant of the privacy metric best matches human preferences.

优点

Formalizing linkage attacks for unstructured text data is a nice and useful contribution, and enables a systematic evaluation of various (novel) text sanitization methods in the future.

While not entirely new, cf. e.g., [1] and the already cited (Stadler et al., 2022), the observations that superficial sanitization methods (such as PII removal) are often insufficient to properly protect privacy remains important.

For most parts, the paper is well written and easy to follow. However, there are some uncertainties about metrics and inconsistencies between numbers reported in the texts and tables, which are (in my view) confusing to the reader and undermine the validity of the currently reported results.

缺点

I stumbled across some inconsistencies between the numbers reported in the Tables and discussed in the text. Please double-check (cf. questions) and update, or explain the differences.

Some details about the metrics and their computation remain unclear (cf. questions). Please try to use consistent naming and define concepts (such as the definition of metrics/distances) in one concise and consecutive piece of text (not spread across several sections).

L323: I think the conclusion from a "disparity between lexical and semantic similarity" to "these techniques primarily modify and paraphrase text without effectively disrupting the underlying connected features and attributes" is made a bit prematurely: Both are entirely different measures, and even for "no sanitization", the lexical score is twice the semantic score. Also, what would happen if you shifted the (apart from the ordering: somewhat arbitrarily) assigned scores for the "similarity metric" in Section 2.4 from {1,2,3} to {0,1,2} or to {1,10,100}?

问题

L081, L099: Just to be sure: If I understood correctly, "claims" can refer to any sensitive or non-sensitive information in the original texts?

L101: If (1) PII removal had 94% leakage (inferable claims), and (2) data synthesis (with or without DP?) has 9% lower leakage (i.e., 85% ?), why does (3) data synthesis without DP state 57% << 85% leakage?

Section 2.3 Linking Method:

L146: Could you briefly motivate the use of the sparse BM25 retriever? Did you consider dense retrieval methods (say, using some form of text embeddings)? What are the benefits of BM25 vs. other sparse methods, or of sparse methods vs. dense methods, in particular for the linkage task at hand?
L147-148: You state your "approach aggregates the auxiliary information into a single text chunk" -- Does that mean you combine (concatenate?) all "atomized claims" x^(i)_j across all j into the "aux info" {\tilde x}^(i)? (Wouldn't hurt to write this down more explicitly.) (Just found the info in L296/Section 3.3 that the attacker gets 3 random claims for the linkage phase. I find that a bit late, it would be better to mention it directly in Section 2.3 to avoid confusion/guessing on the side of the reader.)

Section 2.4 Similarity Metric:

L156: Can you give a specific reason for querying only with claims that were not utilized in the linking phase? Besides, how do I know whether a claim about an original document was used for linking? If all atomized claims are combined into the aux info (cf. previous question), and the aux info is used as query in the retriever, wouldn't this imply that all claims are already consumed in the linking phase?
L159: If I understood correctly, you define the "similarity metric" µ between the original and linked documents by querying a language model to judge the document similarity, where you assign values on a scale from 1 (for 'identical documents') to 3. I wonder if it would make more sense to start the scale at 0, since mathematically, a metric has the property of evaluating to 0 for identical inputs. (In your case, we would get "µ(x,x) > 0" instead of "µ(x,x) = 0".)
How do I know that the atomized claims, which are used to compute µ and hence to measure privacy preservation, are actually privacy-relevant, and not just some arbitrary, privacy-insensitive facts?

L313: I'm confused regarding the symbol "µ" seemingly used for multiple purposes. It is defined in Sec. 2.4, but here, you also use it for another metric induced by ROGUE-L scores.

L317 (also L386): You state "zero-shot prompting achieves an accuracy of 0.44" for MedQA task utility, but why am I unable to find that result in Table 1 (for "No Sanitization", it says 0.69)?

Table 1:

Calling the privacy metrics "Overlap" and "Similarity" is very confusing, since they actually mean the opposite (high lexical overlap and semantic similarity would indicate a high agreement between the two documents, but high scores in Table 1 mean good privacy). Name them lexical/semantic "distance" instead?
Talking about metrics: In Equation 1 you define a "privacy metric", I guess that is what is reported under the (why differently named?) "Semantic Similarity" column in Table 1. It is based on the "similarity metric" from Section 2.4, which has values between 1 and 3 -- How does it end up with values between 0 and 1 in Table 1?? I couldn't see any discussion on some form of normalization of these scores. The expected value of scores >= 1 in Eq. 1 would still result in a value >= 1, and not in [0,1). Please double-check how you actually compute the metrics. Try not to distribute information pertaining to one concept across the paper, but put it concisely into one place if possible. Also prefer consistent naming.

Table 2:

The effect of \epsilon appears surprisingly small to me, with only minimal changes across all metrics even when comparing \epsilon=3 and 1024. Can you explain this behavior?
It would be interesting to compare with a random baseline where the utility is determined from completely random texts -- to rule out that 0.4x task utility in the case of MedQA can already be achieved based on completely random input (say, if the dataset suffers from strong class imbalance and the classifier always just guesses the largest class, thus obtaining an overly optimistic accuracy).

Table 3:

L404: What exactly is the "linkage rate"? Please specify.
L423: Contradicting statements: Here, you state the last three claims are used, previously in L296, you mentioned 3 randomly selected claims.

L443/Section 2.4: If you can switch the similarity metric µ also to ROGUE-L, please already state this as possible option in Section 2.4 where you introduce µ. Currently, you only say there that µ is determined by querying a language model.

Lastly, what are your thoughts on information that is both privacy-sensitive and utility-relevant, say, if one or more atomized claims are also strongly related to a downstream task? For instance, what if an atomized claim turns out to be "John likes baseball", and one of the WildChat categories is "baseball", too? Feel free to substitute "baseball" with something more delicate, such as "drinking alcohol". (Case A: If the baseball aspect is kept in the sanitized document, both µ and the chi^2 distance should be small, indicating poor privacy but good utility. Case B: If the baseball aspect was redacted, both µ and chi^2 should be larger, indicating better privacy but poorer utility.)

Additional considerations for related work: [1] also highlights the insufficiencies of superficial sanitization methods for text. [1] and also [3,4,5] propose differentially private methods that obfuscate texts. An evaluation framework for text rewriting has also been introduced previously [6]. [2] has been published in parallel with (Yue et al., 2023) and also suggests differentially private synthetic text generation.

[1] Weggenmann & Kerschbaum, "SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining", SIGIR 2018
[2] Mattern et al., "Differentially Private Language Models for Secure Data Sharing", EMNLP 2022
[3] Weggenmann et al. "DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders", WWW 2022
[4] Igamberdiev & Habernal, "DP-BART for Privatized Text Rewriting under Local Differential Privacy", ACL Findings 2023
[5] Bo et al., "ER-AE: Differentially Private Text Generation for Authorship Anonymization", NAACL 2019
[6] Igamberdiev et al. "DP-Rewrite: Towards Reproducibility and Transparency in Differentially Private Text Rewriting", COLING 2022

评论- Response to Reviewer Jmiv Part 4

2024-11-22

Talking about metrics: In Equation 1 you define a "privacy metric", I guess that is what is reported under the (why differently named?) "Semantic Similarity" column in Table 1. It is based on the "similarity metric" from Section 2.4, which has values between 1 and 3 -- How does it end up with values between 0 and 1 in Table 1?? I couldn't see any discussion on some form of normalization of these scores. The expected value of scores >= 1 in Eq. 1 would still result in a value >= 1, and not in [0,1). Please double-check how you actually compute the metrics. Try not to distribute information pertaining to one concept across the paper, but put it concisely into one place if possible. Also prefer consistent naming.

Thank you for identifying this inconsistency. The semantic similarity metric indeed originates from a 1-3 scale. We normalize these scores to the [0,1] range. We have consolidated all metric-related information, including this normalization step, in Section 2 to ensure clarity and completeness. Additionally, we have standardized the terminology throughout the paper to consistently refer to these metrics using the same names in both the methodology section and results discussion.

The effect of \epsilon appears surprisingly small to me, with only minimal changes across all metrics even when comparing \epsilon=3 and 1024. Can you explain this behavior?

Explained above.

It would be interesting to compare with a random baseline where the utility is determined from completely random texts -- to rule out that 0.4x task utility in the case of MedQA can already be achieved based on completely random input (say, if the dataset suffers from strong class imbalance and the classifier always just guesses the largest class, thus obtaining an overly optimistic accuracy).

Thank you for this suggestion. We have a related baseline that measures the language model's inherent knowledge bias. Instead of using random text input, we evaluate the model's performance when given only the question without any context or private information, achieving 0.44 accuracy.

L404: What exactly is the "linkage rate"? Please specify.

There are two stages in our pipeline: the linking stage using the linking method L, and the stage where we apply the similarity metric μ. The linkage rate measures the percentage of documents correctly matched with their corresponding auxiliary information using linking method L. For this metric, we only report sanitization methods that preserve a correspondence between original and sanitized documents.

L423: Contradicting statements: Here, you state the last three claims are used, previously in L296, you mentioned 3 randomly selected claims.

We use randomly selected claims in our experiments. We have fixed this inconsistency in the manuscript.

L443/Section 2.4: If you can switch the similarity metric µ also to ROGUE-L, please already state this as possible option in Section 2.4 where you introduce µ. Currently, you only say there that µ is determined by querying a language model.

Thank you. We have updated our paper to distinguish between the language model-based metric and the ROUGE-L based privacy metric.

Lastly, what are your thoughts on information that is both privacy-sensitive and utility-relevant, say, if one or more atomized claims are also strongly related to a downstream task? For instance, what if an atomized claim turns out to be "John likes baseball", and one of the WildChat categories is "baseball", too? Feel free to substitute "baseball" with something more delicate, such as "drinking alcohol". (Case A: If the baseball aspect is kept in the sanitized document, both µ and the chi^2 distance should be small, indicating poor privacy but good utility. Case B: If the baseball aspect was redacted, both µ and chi^2 should be larger, indicating better privacy but poorer utility.)

Answered above.

Additional considerations for related work

Thank you! We have added them in the paper.

2024-11-29

Thank you for providing the additional explanations. Most make sense, and I assume that you will add the additional explanations to the paper where they are required/helpful for readers to better/quicker follow the paper.

For Table 2, I agree that large $\epsilon$ values can still provide good protection in DP-SGD. However, wouldn't it make sense to focus your evaluation on a range of smaller values, say, $\epsilon\in[0.5,3]$ ? Also, while DP improves privacy for MedQA, it seems to be more detrimental for WildChat, where the drop in utility is more significant than the improvement in privacy. (You make a general claim in L382-383 that "implementing DP, even with relaxed guarantees such as ε = 1024, significantly enhances privacy protection", however, this only seems to apply to MedQA.)

评论- Response to Reviewer Jmiv Part 3

2024-11-22

L147-148: You state your "approach aggregates the auxiliary information into a single text chunk" -- Does that mean you combine (concatenate?) all "atomized claims" x^(i)_j across all j into the "aux info" {\tilde x}^(i)? (Wouldn't hurt to write this down more explicitly.) (Just found the info in L296/Section 3.3 that the attacker gets 3 random claims for the linkage phase. I find that a bit late, it would be better to mention it directly in Section 2.3 to avoid confusion/guessing on the side of the reader.)

Yes this is exactly what we meant. We’ll update it in the draft in the next version.

L156: Can you give a specific reason for querying only with claims that were not utilized in the linking phase?

Our metric seeks to measure the information gained from having access to released sanitized data. Therefore, when computing the final score, we ignore the auxiliary information.

Besides, how do I know whether a claim about an original document was used for linking?

We randomly select three claims from a given record in this study.

If all atomized claims are combined into the aux info (cf. previous question), and the aux info is used as query in the retriever, wouldn't this imply that all claims are already consumed in the linking phase?

For records containing fewer than three claims, we exclude them from the final privacy metric computation to maintain consistent evaluation conditions across the dataset.

L159: If I understood correctly, you define the "similarity metric" µ between the original and linked documents by querying a language model to judge the document similarity, where you assign values on a scale from 1 (for 'identical documents') to 3. I wonder if it would make more sense to start the scale at 0, since mathematically, a metric has the property of evaluating to 0 for identical inputs. (In your case, we would get "µ(x,x) > 0" instead of "µ(x,x) = 0".)

Yes, we normalize the score to 0-1 when reporting the numbers in the table. We’ll add it to the paper.

How do I know that the atomized claims, which are used to compute µ and hence to measure privacy preservation, are actually privacy-relevant, and not just some arbitrary, privacy-insensitive facts?

Answered above.

L313: I'm confused regarding the symbol "µ" seemingly used for multiple purposes. It is defined in Sec. 2.4, but here, you also use it for another metric induced by ROGUE-L scores.

We apologize for the ambiguity regarding µ. In this baseline, we use ROUGE-L as both the linking function L and the privacy metric µ to investigate privacy score using established text similarity metrics. This choice simulates a classical baseline approach where the same algorithmic method serves both purposes. We have revised the notation to distinguish between different implementations of the steps in the paper.

L317 (also L386): You state "zero-shot prompting achieves an accuracy of 0.44" for MedQA task utility, but why am I unable to find that result in Table 1 (for "No Sanitization", it says 0.69)?

We apologize for the unclear terminology. The 0.44 accuracy refers to our baseline measurement where the model receives only the question and multiple choice options, without any context. This represents the model's inherent knowledge. In contrast, the 0.69 accuracy under "No Sanitization" represents our upper bound, where the model receives complete, unmodified context. We have updated the manuscript to reflect this.

Calling the privacy metrics "Overlap" and "Similarity" is very confusing, since they actually mean the opposite (high lexical overlap and semantic similarity would indicate a high agreement between the two documents, but high scores in Table 1 mean good privacy). Name them lexical/semantic "distance" instead?

Thank you. We have revised the terminology accordingly.

评论- Response to Reviewer Jmiv Part 2

2024-11-22

The effect of \epsilon appears surprisingly small to me, with only minimal changes across all metrics even when comparing \epsilon=3 and 1024. Can you explain this behavior?

The fact that privacy preserving techniques with extremely large $\epsilon$ have a significant effect, and at the same time the value of $\epsilon$ has relatively small effect has been observed in other applications of differential privacy also. For example, it is widely known in the DP community that adding DP with very large $\epsilon$ significantly mitigates Membership Inference Attacks (MIA) as measured by standard MIA metrics. We believe that this is partly due to the fact that DP-SGD methods use clipping (even for very large $\epsilon$ ), which already significantly reduces the influence of any single sample for any $\epsilon$ values.

In addition, there is a significant difference between our threat model and the strong adversarial model assumed in DP. While DP provides worst-case privacy guarantees against an adversary who knows all but one record in the dataset, our threat model considers a substantially weaker adversary who only has access to partial information from a single record. Additionally, the DP-SGD training framework we adopted in the paper composes privacy costs from each optimization step, assuming the adversary can observe gradients throughout training. In contrast, our threat model only allows the adversary to access the final sanitized dataset, not the resulting model and let alone the training process. This further reduces the effective strength of the attack. This finding aligns with recent findings that demonstrate DP's effectiveness against membership inference attacks even with larger epsilon values (Lowy et al., 2024)

Reference

Lowy, A., Li, Z., Liu, J., Koike-Akino, T., Parsons, K., & Wang, Y. (2024). Why Does Differential Privacy with Large Epsilon Defend Against Practical Membership Inference Attacks?. arXiv preprint arXiv:2402.09540.

Below are our responses to your remaining questions:

Also, what would happen if you shifted the (apart from the ordering: somewhat arbitrarily) assigned scores for the "similarity metric" in Section 2.4 from {1,2,3} to {0,1,2} or to {1,10,100}?

We apologize for not explaining this clearly in the paper. The privacy score in our framework is normalized to a [0,1] range, where 1 represents complete privacy and 0 represents no privacy. Shifting the scoring scale from {1,2,3} to {0,1,2} would not affect our final results, as the normalization process preserves the relative distances between scores. However, using highly uneven spacing like {1,10,100} could affect the results by introducing non-linear weighting between different privacy levels. We maintained equal intervals in our scoring to ensure consistent sensitivity across all privacy levels.

L081, L099: Just to be sure: If I understood correctly, "claims" can refer to any sensitive or non-sensitive information in the original texts?

Yes, claims encompass any discrete piece of information from the original text, whether sensitive or non-sensitive.

L101: If (1) PII removal had 94% leakage (inferable claims), and (2) data synthesis (with or without DP?) has 9% lower leakage (i.e., 85% ?), why does (3) data synthesis without DP state 57% << 85% leakage?

We apologize for the confusion. The term "identifier removal" in line 101 refers to the broad category of all identifier removal methods, not just PII removal. Our results compare two main categories of data sanitization: identifier removal methods and data synthesis methods. The 9% improvement in privacy protection refers specifically to the difference between Dou et al.'s (2024) identifier removal method, which showed the best performance among removal techniques, and the data synthesis approach.

L146: Could you briefly motivate the use of the sparse BM25 retriever? Did you consider dense retrieval methods (say, using some form of text embeddings)? What are the benefits of BM25 vs. other sparse methods, or of sparse methods vs. dense methods, in particular for the linkage task at hand?

We initially implemented dense retrieval using state-of-the-art dense retriever models GritLM (Muennighoff et al., 2024), but we found that the BM25 sparse retriever performed on average 16% better than dense approach in the MedQA dataset, and they performed similarly in the WildChat dataset.

Reference

Muennighoff, Niklas, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. “Generative Representational Instruction Tuning.” arXiv, April 17, 2024.

评论- Response to Reviewer Jmiv Part 1

2024-11-22

Thank you for your thoughtful feedback. We appreciate your highlighting our strengths, including the critical contribution of a systematic evaluation of novel sanitization methods and that our writing is easy to follow. We hope the response below addresses your concerns and questions.

Lastly, what are your thoughts on information that is both privacy-sensitive and utility-relevant?

How do I know that the atomized claims, which are used to compute µ and hence to measure privacy preservation, are actually privacy-relevant, and not just some arbitrary, privacy-insensitive facts?

We are making an important first step towards making privacy metrics and definitions more relevant and practical. Existing privacy metrics, such as differential privacy, address data privacy, where a single token in the textual data is considered private. While this approach works well for structured data (Stadler et al., 2022), text data presents unique challenges that current methods fail to address. As a result, major cloud providers and healthcare organizations continue to rely on simple PII detection and de-identification (Johnson et al., 2020), following outdated privacy models (Garcia et al., 2019).

We take a first step towards addressing this gap by proposing, for the first time, an inferential privacy metric. Our results quantify the failure of existing approaches–showing 94% information leakage with state-of-the-art PII removal methods--and provide compelling evidence that current approaches are fundamentally inadequate.

Ideally, one would want contextual privacy metric, which can take into account (i) which information is more privacy-relevant and (ii) which information is private in the context that the textual information is being shared. These are extremely challenging questions that we believe are beyond the scope of this paper. Nevertheless, they represent exciting research directions to pursue, particularly given recent advances in LLMs. We have added this discussion to the limitation section.

References

Garcia, D. (2019). Privacy beyond the individual. Nature human behaviour, 3(2), 112-113.

Stadler, Theresa, Bristena Oprisanu, and Carmela Troncoso. “Synthetic Data -- Anonymisation Groundhog Day.” arXiv, January 24, 2022. http://arxiv.org/abs/2011.07018.

L323: I think the conclusion from a "disparity between lexical and semantic similarity" to "these techniques primarily modify and paraphrase text without effectively disrupting the underlying connected features and attributes" is made a bit prematurely: Both are entirely different measures, and even for "no sanitization", the lexical score is twice the semantic score.

We agree with the reviewer that direct numerical comparison between lexical and semantic scores may not be methodologically ideal, but we focus on how users interpret these privacy metrics. In practice, users often interpret these numbers as direct indicators of privacy protection levels. By showing both metrics, we provide a more complete picture that helps users avoid over-relying on any single measure when assessing privacy guarantees, which can later be used on privacy nutrition labels designed to help practitioners (Smart et al., 2024). This dual approach promotes a more nuanced understanding of actual privacy protection rather than depending on potentially misleading single metrics (Kelley et al., 2009).

References

Smart, M. A., Nanayakkara, P., Cummings, R., Kaptchuk, G., & Redmiles, E. (2024). Models matter: Setting accurate privacy expectations for local and central differential privacy. arXiv preprint arXiv:2408.08475.

Kelley, P. G., Bresee, J., Cranor, L. F., & Reeder, R. W. (2009, July). A" nutrition label" for privacy. In Proceedings of the 5th Symposium on Usable Privacy and Security (pp. 1-12)

2024-12-02

Focus evaluation on a range of smaller epsilons

Thank you for the suggestions! We’ll look into adding more smaller epsilon comparisons. Here are our reasons for selecting the existing set of epsilon values:

In our experiments, we observe when \epsilon is 3, the model output is private, but the utility is quite low. In particular, the text produced is incoherent (please refer to results in Table 2). We therefore opted not to try lower values of \epsilon, which we would expect to increase privacy (which is already very high) but further decrease utility. Instead, we studied higher values of \epsilon in an attempt to improve utility. We observe that even at these higher values, DP can still protect privacy; this is consistent with recent studies that have also shown that higher values of \epsilon can still protect against membership inference attacks (Lowy et al., 2024; Ponomareva et al., 2022).
Our minimum value of \epsilon = 3 follows established practices in the literature, including Yu et al. (2021), Mehta et al. (2022) and Mattern et al. (2022). This value provides stronger privacy guarantees compared to the one evaluated in Yue et al. (2023), whose differential privacy sanitization method we adopted. This informed our decision to examine epsilon values above 3.

References

Yue, Xiang, Huseyin A. Inan, Xuechen Li, Girish Kumar, Julia McAnallen, Hoda Shajari, Huan Sun, David Levitan, and Robert Sim. "Synthetic text generation with differential privacy: A simple and practical recipe." arXiv preprint arXiv:2210.14348 (2022).

Mehta, Harsh, Abhradeep Thakurta, Alexey Kurakin, and Ashok Cutkosky. "Large scale transfer learning for differentially private image classification." arXiv preprint arXiv:2205.02973 (2022).

Mattern et al., "Differentially Private Language Models for Secure Data Sharing", EMNLP 2022

Yu, Da, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni et al. "Differentially private fine-tuning of language models." arXiv preprint arXiv:2110.06500 (2021).

Lowy, Andrew, Zhuohang Li, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, and Ye Wang. "Why Does Differential Privacy with Large Epsilon Defend Against Practical Membership Inference Attacks?." arXiv preprint arXiv:2402.09540 (2024).

Ponomareva, Natalia, Jasmijn Bastings, and Sergei Vassilvitskii. "Training text-to-text transformers with privacy guarantees." In Findings of the Association for Computational Linguistics: ACL 2022, pp. 2182-2193. 2022.

where the drop in utility is more significant than the improvement in privacy…

Thank you for highlighting this issue. While DP methods achieve strong privacy protection on both MedQA and WildChat, the privacy gains differ specifically due to variations in the privacy protection of fine-tuning approaches. This difference stems from the threat models: in MedQA, we treat both questions and answers as public information, to evaluate the sanitization method's ability to generate context corresponding to correct choices. Conversely, for WildChat, we consider the entire conversation as private information. We hypothesize that this distinction in information availability directly affects the fine-tuning methods' ability to learn private information, explaining the observed differences in privacy gains across our experiments. We will update the manuscript to reflect the discussion, and improve the claim we are making.

I assume that you will add the additional explanations to the paper where they are required/helpful for readers to better/quicker follow the paper.

We thank you again for your constructive feedback, especially regarding clarity improvements. All suggested clarifications have been incorporated into the manuscript, with modifications highlighted in yellow for reference.

评论- Response to All Reviewers

2024-11-22

We sincerely thank all reviewers for their thoughtful and constructive feedback. We appreciate the recognition of our paper's strengths, including its systematic evaluation framework, comprehensive experiments, and clear writing. Our work addresses a critical need to better understand the inherent tension between preserving semantic information for utility while protecting privacy - a fundamental challenge without simple solutions. Our findings reveal that practitioners who rely on current PII removal and scrubbing methods for text privacy may be operating under a false sense of privacy. This insight is particularly alarming given the increasing volume of sensitive text data being handled across healthcare, customer service, and other domains (Mireshghallah et al. 2024). Without proper understanding of these limitations, organizations may inadvertently expose sensitive information while believing their data is adequately protected.

While our work opens up important new research directions in privacy-preserving text sanitization, we have barely scratched the surface of this complex challenge. We have carefully addressed each reviewer's specific concerns in our detailed responses below and have uploaded a revision with changes highlighted in yellow. For each response we have included the reference information, if it was already not in the paper.

AC 元评审

2024-12-21

The paper introduces a privacy evaluation framework for the release of sanitized textual datasets. This framework is based on two steps: (i) linking, where a sparse retriever matches de-identified samples with potential candidate "sanitized" samples, and (ii) matching, which assesses the information gained about the target by comparing the matched record from the linking step with the private data. A key aspect is replacing lexical matching (e.g., matching names or other personal attributes) with semantic matching.

The paper addresses an important problem with practical relevance in many domains. Data linkage and inference have an over two-decade-long history (perhaps longer), and the authors correctly acknowledge the urgency of the topic given the ever-growing volume of data collected and stored across multiple domains. The paper is also well-written and easy to follow.

The paper also has several limitations, which made most reviewers stand by a score that rates the paper slightly below the acceptance threshold. In my view, the main concerns are novelty of the claims and linkage attack (e.g., the use of the BM25 retriever), connection with related work (see comments by reviewer Jmiv), and precision of the experimental evaluation. For the latter, I was surprised that experiments include no std dev/std error in the evaluation, and it is not clear that the finding on the two selected datasets (MedQA and WildChat) would extend generalize to other datasets. This is particularly relevant in light of the confusion surrounding DP results (Jmiv and oF1E).

Overall, the paper could benefit from a significant revision and increased precision of the definitions and the experimental results. This would be a clear "major revision" if this were a journal. I believe the importance of the topic demands precise, accurate, and clear numerical results to substantiate the claims of privacy risks. This would make the paper's overall (important) message much stronger and more substantiated. I encourage the authors to review their manuscript and seriously account for the reviewer comments.

审稿人讨论附加意见

The reviewers remained tepid after discussion with the authors.

最终决定Reject

2025-01-22

Reject