6.0

/10

Rejected5 位审稿人

最低3最高8标准差1.9

3.8

置信度

正确性2.4

贡献度2.4

表达2.4

ICLR 2025

Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

Roman Levin,Valeriia Cherepanova,Abhimanyu Hans,Avi Schwarzschild,Tom Goldstein

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

We develop Prompt Detective, a statistical method to reliably determine whether a given system prompt was used by a third-party language model.

摘要

关键词

privacymembership inference attackprompt extraction

评审与讨论

审稿意见

评分: 5置信度: 42024-10-18

This paper offers a method for detecting the use of proprietary system prompts by a third party. The method is based on a statistical test that measures whether the response distribution of the third party model (which potentially uses the proprietary prompt) matches that of the same model with the proprietary prompt. The statistical test is performed using the response embeddings taken from a BERT model. In experiments, the method is applied to two existing system prompt datasets, as well as a novel LLM-generated dataset, showing positive results in some settings. The authors highlight that their experiments show how minor changes to prompts can lead to significant differences in response distributions.

优点

The paper is generally well-presented, with clear plots as well as psuedocode for their algorithm. In addition, the extension of existing system prompt datasets to include more similar items creates the possibility for more interesting experimental results.

缺点

I find that this paper has many weaknesses, which I detail below.

-After reading the introduction, I find myself questioning the exact nature and importance of the problem. If prompt reconstruction is so difficult, then why is such a test important? Also, I'm wondering if it is illegal to steal a proprietary system prompt? If not, what actions would one take after receiving a positive result on your test? I think better and more clear motivation is needed here.

-I think the related works section overemphasizes prompt engineering in general (as does the introduction), and does not make clear exactly where this paper is meant to be placed in the literature. Also, one major concern I have is that line 62 states that "prompt reconstruction approaches usually do not offer certifiable ways to verify if the recovered prompt was indeed used"; given the "usually", I believe I am to understand that there has been previous work considering certifiable ways to verify prompt inclusion. However, I cannot find those references in Section 2 (and also later wonder if that means that the experiments section is missing important baselines).

-Another place where I fail to grasp the motivation is Section 3.1, second paragraph. This seems meant to be a motivating example for the methodology to come, but I find it to be confusing.

-The methodology composes existing tools (e.g., embedding model, statistical test), and lacks any significant novelty. One opportunity to introduce novelty may have come from some method for producing especially effective task prompts (also small note, why is Section 3.3 not titled "Task Prompts"?), but this is not deeply explored.

-Related to the last point, I find that line 243 "The selection of task prompts q1, . . . , qn is an important component of Prompt Detective" is unsupported by the experiments; Lines 263-265 seem to indicate that task prompts are not very important.

-While I am not very familiar with these datasets, the tasks in Section 5.1 seem fairly trivial. Also, why are false negative rates not included in Table 2?

-Are there any baselines that you might compare to? Either from the prompt reconstruction literature, or membership inference literature?

-The extension to black box settings seems like another potential avenue to introduce novel methodology, but is ultimately given little attention.

-In the discussion, the authors state that "A key finding of our work is that even minor changes in system prompts manifest in distinct response distributions". I believe this is very well known to LLM practitioners and researchers. e.g. see Section 3.3.1 of [1]

[1] https://arxiv.org/abs/2404.09932

问题

Please see weaknesses for specific questions. On a high-level, I would be interested to hear more on why is this an important problem, and how would you ultimately like to see your method used?

伦理问题详情

I have no ethics concerns.

评论- Response to Reviewer hT6p -- Part 2

2024-11-27

[Comparison to prompt extraction baselines] We ran additional experiments comparing PLeak [4] – the most high performing of the existing prompt reconstruction approaches (referenced in other reviewers' comments) to Prompt Detective in the prompt membership setting. We used the optimal recommended setup for real-world chatbots from section 5.2 of the PLeak paper – we computed 4 Adversarial Queries with PLeak and LLAMA-2 13B as the shadow model as recommended, and we used ChatGPT-Roles as the shadow domain dataset to minimize domain shift for PLeak. We observed that PLeak sometimes recovers large parts of target prompts even when there is no exact substring match, and that using the edit distance below the threshold of 0.2 to find matches maximizes PLeak’s performance in the prompt membership inference setting. To further maximize the performance of the PLeak method, we also aggregate the reconstructions across the 4 Adversarial Queries (AQs) by taking the best reconstruction match (this aggregation approach is infeasible in prompt reconstruction setting where the target prompt is unknown but can be used to obtain best results in prompt membership inference setting where we know the reference prompt). We then applied these adversarial prompt extraction queries to LLAMA-2 13B as the target model with system prompts from Awesome-ChatGPT-Prompts and computed False Positive and False Negative rates for direct comparison with the results of Prompt Detective reported in Table 1 of our paper. We report the results below:

Method	Target Model	FPR	FNR
Prompt Detective	Llama2 13B	0.00	0.05
PLeak	Llama2 13B	0.00	0.46

We see that Prompt Detective significantly outperforms PLeak in the prompt membership inference setting which is expected since Prompt Detective is specifically tailored to work in the verification setup while PLeak is geared towards solving a different problem of prompt reconstruction. We added these results to Appendix B.1.

[Black box setting] We respectfully disagree that the black box setting was overlooked. The paper includes experiments showcasing the effectiveness of Prompt Detective in a practical black box scenario and reports the results in Table 3.
[Highlighting the ability to separate even extremely similar system prompts] While we agree that related findings of LLM brittleness and sensitivity to prompts exist and are well-known, our work highlights a distinct finding – the ability to separate even extremely similar system prompts, like the ones differing only by a typo, as they manifest in distinct response distributions (by eye, responses to such two system prompts of extreme similarity would be indistinguishable). See Appendix C.2 for a case study on prompts differing by a typo. We also point out that other reviewers found this particular finding surprising and important, as mentioned by Reviewer L9Uj: “The paper has some surprising results, at line 448 ( system prompts that differ only by a typo as an example of extreme similarity (see Appendix C for details). I think this result is quite important and may have broader implication…”

In response to other reviewers feedback, we have also performed an ablation study on Prompt Detective embeddings and included ROC curves for Prompt Detective in the paper. Please, see our general response for details.

We sincerely thank you once again for your detailed feedback. We hope our responses have addressed your questions, and kindly ask you to consider increasing your score in light of the clarifications, additional experiments, and writing improvements we have provided.

References:

[1] https://prompti.ai/chatgpt-prompt/

[2] https://promptbase.com/

[3] https://github.com/jujumilk3/leaked-system-prompts/tree/main

[4] Hui, B., Yuan, H., Gong, N., Burlina, P. and Cao, Y., 2024. PLeak: Prompt Leaking Attacks against Large Language Model Applications. arXiv preprint arXiv:2405.06823.

[5] Zhang, Y., Carlini, N. and Ippolito, D., 2024, August. Effective prompt extraction from language models. In First Conference on Language Modeling.

评论- Response to Reviewer hT6p -- Part 1

2024-11-27

Thank you for your review and feedback. We appreciate your recognition of our paper's clear presentation and positive results. We address your points below:

[Problem motivation and importance] Carefully crafted system prompts play a crucial role in shaping LLM outputs and driving performance in application domains, in fact some prompts are valuable enough to be sold at online marketplaces [1,2]. Separately, hidden chatbot system prompts sometimes contain instructions that the users may want to be aware of such as instructions on handling private information of the user [3]. Regarding prompt reconstruction, it is certainly possible and methods such as PLeak [4] (the most promising prompt reconstruction approach) achieve reconstruction in real-world chatbots relatively successfully (with 68% success rate). System prompts may also be leaked through means other than reconstruction methods [3] and utilized by adversaries in their chatbots. However, the prompt reconstruction problem is indeed challenging and distinct from the prompt membership inference and what we meant in introduction is that the success rate of prompt reconstruction methods (e.g., PLeak’s 68%) is not high enough to reliably verify prompt reuse and more effective methods tailored to this problem such as Prompt Detective can be developed. In fact, we ran additional experiments comparing Prompt Detective vs PLeak as a prompt reconstruction baseline in the prompt membership inference setting and found that Prompt Detective is significantly more effective (achieving 0.05 FNR and 0.00 FPR vs PLeak’s 0.46 FNR and 0.00 FPR). Please, refer to our response to your point 7. [Comparison to prompt extraction baselines] below for more details on this experiment (and also to our general response or Appendix B.1 of the paper).
[Related work] We appreciate this feedback and have revised our introduction and related work section to include a more clear discussion of prompt reconstruction methods to better position our contribution within the literature. Regarding the word "usually" in line 62, we removed that word and clarified that while some reconstruction methods provide confidence scores (namely, [5] which was already cited), they don't offer statistical guarantees for prompt usage verification. We have also added the comparison to PLeak (the most performant prompt reconstruction baseline) mentioned above to Appendix B.1.
[Methodology motivation] We revised Section 3.1 to clarify the motivating example.
[Novelty] While our method does rely on existing research (as most research does), its novelty lies in the application to the prompt membership inference problem and finding the right combination of components to develop an effective statistical framework for prompt reuse verification (more effective, for example, than utilizing prompt reconstruction methods for the same purpose as evidenced by the comparison experiment to PLeak). While some design choices may seem simple, they are in fact effective, as pointed out by Reviewer vCak: “Prompt Detective is a straightforward hypothesis test without any unnecessary moving parts, yet done properly such that it achieves good performance in terms of what it set out to do“.
[Task prompts] Regarding task prompts, as part of finding the effective recipe for the statistical framework, we for example explore the optimal allocation of a fixed budget of generations in Figure 4 – we find that having a larger number of shorter responses and diversified sets of different task prompts which elicit responses that are directly influenced by system prompts is most useful for effective separation. (The best particular setup in the Figure 4 experiment is to use 10 task prompts with 50 generations, each 64-token-long).
[Tasks in section 5.1, false negative rates in Table 2] While some of the system prompts in our general experiments are fairly distinct, these datasets also include system prompts for similar roles which are more challenging to separate in negative cases (where the known system prompt differs from the system prompt used by the model). Additionally, the general experiments also confirm the ability of Prompt Detective to reliably verify positive cases. Section 5.2 is devoted to the exploration of much more challenging negative cases – hard examples. False negative rates are not included in Table 2 because there are no positive cases – the experiments are focused on separating the original system prompt from other, distinct system prompts of varying similarity to the original.

2024-11-30

Dear Reviewer hT6p,

Thank you for your time and effort in reviewing our paper and rebuttal.

We have added new experiments and results to address your concerns in the updated draft.

We hope our responses have addressed your questions, and kindly ask you to consider increasing your score in light of the clarifications, additional experiments, and results we have provided. Please let us know if you have further questions as the extended discussion period comes to an end. Thank you!

评论- Gentle reminder

2024-12-02

Dear Reviewer hT6p,

Thank you again for your feedback and time!

As the extended discussion period comes to an end tomorrow, we wanted to post a gentle reminder that we added new experiments and results to address your concerns. Please see our responses to your review as well as the general response. We hope our responses address your questions, and kindly ask you to consider increasing your score in light of the clarifications, additional experiments, and results we provided.

2024-12-02

First, I’d like to address the multiple pings sent to me to review this paper over the holiday. Per the original ICLR email:

“The discussion phase will have the following timeline:

today to November 26 at 11:59pm AoE: Reviewers and authors can exchange responses with each other as often as they wish.
November 26 to November 27 at 11:59pm AoE: Only authors may respond. Reviewers cannot respond.”

These rebuttals were posted on November 27, after the initially proposed discussion period had ended. I believe the ultimate extension to the discussion period was primarily meant for reviewers who had failed to engage in discussion prior to November 27, not so that authors could take longer in rebuttals and then expect reviewers to spend their holiday with the paper.

Thank you for clarifying the points with respect to motivation and related work, and for performing additional experiments.

I will raise my score to 5, but still maintain a score below accept. I think the paper lacks technical novelty, and I do not find the application to be convincing (similar to Reviewer L9Uj).

审稿意见

评分: 3置信度: 42024-10-30

The paper discuss the detection of whether a blackbox service uses the same system prompt, or whether a system prompt of a victim was stolen. A statistical method called "prompt detective is used" which creates two generations one with a model that the victim can control its system prompt and one with the target model. The paper shows that system prompts with very minor differences can be identified.

优点

The paper has some surprising results, at line 448 ( system prompts that differ only by a typo as an example of extreme similarity (see Appendix C for details). I think this result is quite important and may have broader implication beyond the specific problem of verifying whether the prompt is the same or not. For example, one can extract information from a blackbox model about the exact name of an api_call or some system prompt formatting even if the system designer is filtering out these information.
I like the idea of designing task prompts that are specifically designed to probe for the corresponding system prompt.

缺点

I think verifying if the prompt used is the same might not be needed because one can perform prompt reconstruction or attacks to reveal the system prompt, which they are already existing work that shows it is possible [1-3]. The paper makes comparison to membership inference attacks on training data but this threat model is different as, unlike training data, system prompts can be easily reconstructed. I would at least expect the paper to make this comparison to show the advantage of the method over it. The discussion in section 2.2 while it mentions prompt leakage, does not mention why can't it be used to verify if the prompt is the same or not.

[1] Debenedetti et al., Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition. NeurIPS D&B

[2] Hui et al., PLeak: Prompt Leaking Attacks against Large Language Model Applications (https://arxiv.org/abs/2405.06823)

[3] Geiping et al., Coercing LLMs to do and reveal (almost) anything (https://arxiv.org/abs/2402.14020)

Revealing system prompts have broader implications (e.g., one may use information in the system prompt in order to better launch prompt injection attacks). But the paper discusses that system prompts are valuable and therefore can be stolen to steal the functionality and we need to verify if that is the case. However, it does not show convincing setups where these system prompts are important. The current prompts are very generic. I would have expected to have prompts that are either large enough with a lot of detailed reasoning methods, contain private data, or have any optimized tokens that needs computation. There are two drawbacks of this: 1) we can't know if the method would generalize to these cases, 2) the generic system prompts can have overlap naturally and therefore they are not distinct. Even if the prompts are the same, that does not mean there exists an attack.
Regarding the above point, it seems that the test is costly in terms of the number of queries needed. If this scales with the length of the prompt, then the test may be very expensive to perform.
In figure 2 and section 4.1, the paper does not make it clear whether all of these scenarios are considered "stolen prompts", if the model is used in the same task, there should exist natural similarity in concept. It seems from section 5.2 that all of these are considered positive examples (similar to the system prompt). I think this is not a very reasonable assumption.

问题

I don't understand Section 6. I think is not really blackbox because we assume the model exists. In principle this can be easily converted into the same setup (perform the test with each model individually and take the smallest p-value as the final one).
Can you use the log prob to perform the test? (in this case you may not need to perform as many queries as there is no randomness for sampling).
Are there any difference in the actual functionality of the model if one uses a system prompt with similarity 5? because in this cases, even if the method shows high similarity, the designer can easily argue that the system prompt is not the same and therefore the designer of the service may decide to always paraphrase the stolen system prompt.

评论- Response to Reviewer L9Uj -- Part 1

2024-11-27

Thank you for your thoughtful review and feedback. We are glad that you found our results on a typo case study surprising and important, and we address your comments and questions below:

[I think verifying if the prompt used is the same might not be needed.]

We argue that while prompt reconstruction methods attempt to deduce the content of a system prompt, they are prone to errors, often require white-box access to the model, and lack a reliable way to confirm whether the reconstructed prompt matches the actual prompt used. This is where Prompt Detective provides a distinct advantage—it offers a statistically significant verification mechanism that complements, rather than replaces, reconstruction methods. Even if reconstruction reveals an approximate prompt, it cannot provide a certifiable guarantee that the reconstructed prompt was actually employed by the system. Prompt Detective addresses this gap by offering an independent verification mechanism based on observable response distributions.

Despite these differences between the prompt reconstruction and prompt verification approaches, we ran additional experiments comparing PLeak [1] – the most high performing of the existing prompt reconstruction approaches [1-4] referenced in your and reviewer DuyU’s comments (per the results reported in each of the papers) to Prompt Detective in the prompt membership setting. We used the optimal recommended setup for real-world chatbots from section 5.2 of the PLeak paper – we computed 4 Adversarial Queries with PLeak and LLAMA-2 13B as the shadow model as recommended, and we used ChatGPT-Roles as the shadow domain dataset to minimize domain shift for PLeak. We observed that PLeak sometimes recovers large parts of target prompts even when there is no exact substring match, and that using the edit distance below the threshold of 0.2 to find matches maximizes PLeak’s performance in the prompt membership inference setting. To further maximize the performance of the PLeak method, we also aggregate the reconstructions across the 4 Adversarial Queries (AQs) by taking the best reconstruction match (this aggregation approach is infeasible in prompt reconstruction setting where the target prompt is unknown but can be used to obtain best results in prompt membership inference setting where we know the reference prompt). We then applied these adversarial prompt extraction queries to LLAMA-2 13B as the target model with system prompts from Awesome-ChatGPT-Prompts and computed False Positive and False Negative rates for direct comparison with the results of Prompt Detective reported in Table 1 of our paper.

We report the results below:

Method	Target Model	FPR	FNR
Prompt Detective	Llama2 13B	0.00	0.05
PLeak	Llama2 13B	0.00	0.46

[Revealing system prompts have broader implications.] While the main experiments in the paper focus on role-playing prompts, we have also demonstrated the generalizability of Prompt Detective to realistic, production-grade prompts. Specifically, in the typo case study, we examined the widely used Llama system prompt, which instructs the model to function as a helpful and harmless assistant. Our results show that Prompt Detective effectively distinguishes between two distinct general-purpose prompt variations, highlighting the robustness and applicability of our method to real-world scenarios.
[ Regarding the above point, it seems that the test is costly in terms of the number of queries needed.] As outlined in Section 4, our experiments use 50 to 500 generations per system prompt, depending on the level of similarity. When the prompt in the third-party language model closely resembles the proprietary one, up to 500 generations may be required to statistically detect differences. However, in most cases, Prompt Detective achieves reliable results with just 50 generations.
[Figure 2] In Figure 2 we illustrate examples of “hard prompts”, which are not the same as the proprietary prompts (so, negative cases), but which we assume would be harder to differentiate from the proprietary prompt because of the high similarity between them.

评论- Response to Reviewer L9Uj -- Part 2

2024-11-27

[Black-box scenario in Section 6.] Since prompt engineering is a much less resource-intensive task than developing or fine-tuning a custom language model, it is reasonable to assume that such chat bots which reuse system prompts are based on one of the publicly available language models such as API-based GPT models, Claude models, or open source models like Llama, Mistral, or others. For the black box experiments, for the pool of models we simply used all models we used in our general experiments, however, our method is similarly applicable to larger pools of models as well. We compare the generations of $f_p$ against each reference model $\{\bar{f}^i_{\bar{p}}\}_{i=1}^6$ and take the maximum $p$ -value. To address the multiple-comparison problem inherent in this setup, we apply the Bonferroni correction to adjust the $p$ -value threshold to make sure that the overall significance level remains at 0.05
[Can you use the log prob to perform the test?] Directly utilizing the target model's embeddings instead of re-sampling generations and computing BERT embeddings could indeed be a simplification for Prompt Detective. However, this approach would restrict Prompt Detective's applicability to models with accessible logits, thereby excluding models like Claude that do not provide such access.
[the designer of the service may decide to always paraphrase the stolen system prompt…] Indeed, the adversary rephrasing the system prompt is a possible scenario. However, we note that if the system prompt is rephrased such that the output distribution remains highly similar to the original output distribution (which would correspond to preserving functionality), that will require a very large set of task prompts and responses to achieve low p-values and separation in Prompt Detective. Therefore, observing no separation until a very large sample of responses is used will be indicative of the adversary using such adaptive attacks attempting to preserve the output distribution (i.e. functionality) while rephrasing the system prompt, so Prompt Detective will remain effective in this scenario. This is consistent with our hard setup experiments in Section 5.2.

References:

[1] Hui, B., Yuan, H., Gong, N., Burlina, P. and Cao, Y., 2024. PLeak: Prompt Leaking Attacks against Large Language Model Applications. arXiv preprint arXiv:2405.06823.

[2] Geiping, J., Stein, A., Shu, M., Saifullah, K., Wen, Y. and Goldstein, T., Coercing llms to do and reveal (almost) anything, 2024. URL https://arxiv. org/abs/2402.14020.

[3] Debenedetti, E., Rando, J., Paleka, D., Florin, S.F., Albastroiu, D., Cohen, N., Lemberg, Y., Ghosh, R., Wen, R., Salem, A. and Cherubin, G., 2024. Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition. arXiv preprint arXiv:2406.07954.

[4] Zhang, Y., Carlini, N. and Ippolito, D., 2024, August. Effective prompt extraction from language models. In First Conference on Language Modeling.

2024-11-30

Dear Reviewer L9Uj,

We sincerely appreciate your efforts in reviewing and reading our paper and rebuttal. We have made a significant effort and added new experiments and results to address your concerns in the updated draft.

We kindly hope you will consider raising your score if our revisions and rebuttal have resolved your concerns. Please let us know if you have any further questions. Thanks again for your contribution to our paper.

2024-11-30

Thank you very much for your detailed feedback.

I think you are correct that prompt construction may be less accurate. The method presented in this work can be more specific.

I still have concerns about the significance of the experimental setup that I identified in my review which I think the rebuttal has not addressed.

While I believe stealing system prompts can sometimes be an important threat, the setup in the paper is very simplified. There are no valuable unique system prompts that adversaries may be motivated to steal, currently the paper assumes very generic system prompts of role-playing. In general, extracting generic system prompts may not be critical (e.g., see: https://www.microsoft.com/en-us/msrc/aibugbar). It is not straightforward to map the results in the paper to actual cases where there exist private data in the prompt or where there are actual harmful consequences after extracting system prompts.

2024-12-02

Dear Reviewer L9Uj,

Thank you for your engagement and for acknowledging that our method is more accurate than prompt reconstruction approaches. We appreciate the opportunity to clarify the remaining points you raised.

[“While I believe stealing system prompts can sometimes be an important threat, the setup in the paper is very simplified. There are no valuable unique system prompts that adversaries may be motivated to steal, currently the paper assumes very generic system prompts of role-playing.”] Thank you for agreeing that stealing system prompts is an important threat. Regarding the selection of prompts for our experiments, our goal was to use realistic and useful prompts from credible sources, such as awesome-chatgpt-prompts [1] dataset and Anthropic Prompt Library [2]. Prompts from awesome-chatgpt-prompts are used in published works from the prompt reconstruction literature [3] and they are also similar in tasks and utility to system prompts sold at online prompt marketplaces [4] (e.g., prompts for coding, coaching, copywriting) highlighting their importance and value. The Anthropic Prompt Library webpage mentions that their prompts are “optimized prompts for a breadth of business and personal tasks”. In addition to that, in the typo case study in Appendix C of our paper we present experiments with the Llama system prompt — an extremely important and valuable system prompt that we have no doubt Meta’s scientists and engineers spent significant effort optimizing.

Regarding the link that you shared [5], Microsoft’s risk classification for prompt extraction simply says that extracting public information from prompts is low risk, however, the idea of Prompt Detective is to enable its users to verify if their system prompt has been reused assuming that their system prompt is valuable to them and therefore likely confidential. Such risk would be classified as “high risk”. While, unlike the Llama prompt, most companies do not open-source their valuable system prompts, we are confident that our method is applicable in those scenarios given its experimental effectiveness and robustness across edge cases which we showcase in the experiments of our paper. Confidential system prompts are not available publicly by definition preventing us from running experiments with them, however, we would respectfully argue that we do showcase the effectiveness of Prompt Detective on valuable and useful system prompts that are of interest to the scientific community and individual users.

We hope that we addressed your remaining concerns. We hope that you would be open to reconsidering your assessment of our work and increasing the score, especially in light of our clarifications, and given that you agree that our additional experiments show that our method is more accurate than prompt reconstruction approaches which we feel was an important concern highlighted first in your original review. Thank you very much again for your feedback, time, and engagement!

References:

[1] https://github.com/f/awesome-chatgpt-prompts

[2] https://docs.anthropic.com/en/prompt-library/library

[3] Zhang, Y., Carlini, N. and Ippolito, D., 2024, August. Effective prompt extraction from language models. In First Conference on Language Modeling.

[4] https://prompti.ai/chatgpt-prompt/

[5] https://www.microsoft.com/en-us/msrc/aibugbar?oneroute=true

审稿意见

评分: 6置信度: 32024-11-03

This paper studies the problem of identifying whether an LLM uses a given system prompt, for the purposes of detecting the illegal leakage of system prompts. It applies statistical hypothesis testing to determine whether an LLM uses a given system prompt. The results indicate the effectiveness of the approach in showing that the target system prompts were not used by the given LLM for its generations.

优点

The paper presents a novel application of statistical hypothesis testing to address an issue of economic value to LLM developers and users.
The paper uses a formal approach to study and distinguish the impacts of different system prompts, that may share some commonalities.
The results show the promising capabilities of the method in differentiating the system prompt used by an LLM from a target system prompt.
The analysis is supported by a quite informative ablation study, which provides interesting insights into the importance of generation length versus number of generations on the statistical test.

缺点

The paper lacks motivation for the use of permutation testing for the task at hand. Firstly, what theoretical benefit does statistical testing bring about for accurately inferring the system prompt over existing methods? Secondly, why is this particular method of statistical testing picked from various other possibilities?
As per my understanding of permutation testing, the "if" statement in line 184 (Algorithm 1), should have the condition $s^*>s^{obs}$ .
I believe that a better null hypothesis would be that the system prompts are not identical. This is because rejecting this would support the motivation of the paper, i.e., to detect reuse of system prompt and the privacy violation caused by that. The current method can only tell whether the system prompt was not used by the LLM, which appears to be not well-aligned with the motivation.
Given that this is not the first work on extracting the system prompts of proprietary LLMs, there is a lack of theoretical and experimental comparison with the various baselines/prior approaches. Examples of prior works on system prompt extraction include: Effective Prompt Extraction from Language Models by Zhang et al, Pleak: Prompt leaking attacks against large language model applications by Hui et al, Coercing llms to do and reveal (almost) anything by Geiping et al.
I see the following points missing from the current experiments section. I believe that including them will enhance the paper.
1. Include the results of open-source models used in Table 1, in Table 2 for hard examples as well.
2. Include different models from the same model families (e.g., Llama-3) to see the variation in the effectiveness of Prompt Detective by number of parameters etc.
3. The ablation study in Figure 4 needs clarification. I think it is not appropriate to put the plot from the left panel exactly on the right panel. Although I understand what it means now, but it took me many re-readings to grasp the plot's structure. Moreover, in the left panel, the authors should clearly state the best values of the number of generations and task prompts, identified from their ablation study. The number is not possible to judge from the plot.
4. Include more of the latest models in the experiments to thoroughly study the effectiveness of the method. Examples include: GPT-4, Claude-3.5-sonnet, Gemini.
5. The dependence of the results on the task probes is not adequately studied. I think there should be an ablation study of how the p-values vary with different choices for the task probes.
6. The use of the BERT model specifically, as the projection function for the generations, is arbitrary and should be supported by an ablation study.
7. I don't see the motivation of using universal task prompts for varying system prompts in the Awesome-ChatGPT-Prompts and Anthropic Prompt Library datasets. What is the challenge in generating custom task prompts for each system prompts?
The generalization of the approach to the Black-box setting, where the LLM is not known, makes a major assumption of the model being from either of 6 popular models. This does not appear to be a feasible approach in the general setting, where instead of the 6 models, another model or finetuned version of one of the 6 models could be used. Hence, I don't think that the approach is truly black-box.
Minor comments:
1. I believe that the paper should include some background on the main technique used, i.e., permutation testing. This can enhance the readability of the paper.
2. Algorithm 1 is not described in detail in the paper. There are several constructs that are assumed to be known to the reader already. For example: what is $N_{permutations}$ , $Shuffle(.)$ ? What does it mean to concatenate responses in line 179? Is it concatenating the list of responses or are the individual response strings concatenated element-wise over the 2 lists?
3. The construction of "Hard examples" of system prompts is described in a very hand-wavy manner. For example, the distinction between "Same Prompt, Minimal Rephrasing" and "Same Prompt, Minor Rephrasing" is vague. Also, the word "minimal" has a very specific mathematical interpretation, which I don't see fit here. I would encourage the authors to clearly state the goals of each variation and the exact differences between them for the hard examples.
4. Lines 346-347: Shouldn't "rejecting the null hypothesis of equal distributions" be "rejecting the null hypothesis of different distributions"?
5. As the paper tries to verify LLM systems for copying a system prompt, it should contain the related works on verifying LLM systems in other contexts, especially the ones using statistical tests. A non-exhaustive list of related references is -
  1. C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models by Kang et al.
  2. Certifying LLM Safety against Adversarial Prompting by Kumar et al.
  3. Quantitative Certification of Bias in Large Language Models by Chaudhary et al.

问题

What is Claude 3's system prompt when generating the task prompts for given $\bar{p}$ ? How would that influence the quality of the generated task prompts?

评论- Response to Reviewer DuyU -- Part 3

2024-11-27

5.5 [Task probes.] The task probe construction is automated using the Claude 3 Sonnet language model, please see Table 6 in Appendix for the prompts used to construct the task probes. Conceptually, the most important aspect for the task probe construction is to build task probes which elicit responses that are related to the system prompt and are directly influenced by it, otherwise the utility of the target LLM responses would be limited. The other important aspect is practical generation budget allocation for the task prompts – that is explored in ablation presented in Figure 4 and mentioned above.
5.6 [Ablation on embedding methods.] Prompted by your review, we ran additional experiments for an ablation on embedding methods. The results are included in Appendix B Table 4. In short, we observe no significant difference between performance of different encoder models when used in Prompt Detective. We experimented with gte-Qwen2-1.5B-instruct, jina-embeddings-v3 and mxbai-embed-large-v1 models from MTEB leaderboard [5] and found that our results are consistent across various embedding techniques. Originally, we simply chose BERT for its simplicity, wide adoption, and computational efficiency.

Model	Encoder	FPR	FNR	pp_avg	pn_avg
Claude	BERT	0.05	0.03	0.544 ± 0.29	0.022 ± 0.12
Claude	jina-embeddings-v3	0.03	0.07	0.489 ± 0.30	0.006 ± 0.03
Claude	mxbai-embed-large-v1	0.04	0.04	0.504 ± 0.29	0.020 ± 0.11
Claude	gte-Qwen2-1.5B-instruct	0.03	0.04	0.514 ± 0.29	0.013 ± 0.08
GPT3.5	BERT	0.00	0.06	0.502 ± 0.28	0.000 ± 0.00
GPT3.5	jina-embeddings-v3	0.01	0.08	0.487 ± 0.30	0.003 ± 0.03
GPT3.5	mxbai-embed-large-v1	0.00	0.05	0.508 ± 0.30	0.000 ± 0.00
GPT3.5	gte-Qwen2-1.5B-instruct	0.01	0.05	0.502 ± 0.29	0.002 ± 0.02

5.7 [Task prompt set in general experiments.] Having responses generated for the same collection of task prompts across different system prompts simplifies random sampling of negative cases – where the known system prompt differs from the system prompt used by the model.

[Black box setting.] We note that our black box setting is practical because, as we mention is Section 3.1 of the paper, prompt engineering is a much less resource-intensive task than developing or fine-tuning a custom language model, therefore, it is reasonable to assume that such chat bots which reuse system prompts are based on one of the publicly available language models such as API-based GPT models, Claude models, or open source models like Llama, Mistral, or others. For the black box experiments, for the pool of models we simply used all models we used in our general experiments, however, our method is similarly applicable to larger pools of models as well. We do agree that this setting can be considered a “dark gray” box, we believe that it is a practical setting nonetheless.
[Minor comments.] Thank you for your points, we added clarifications to the paper and added the citations to the related work section. The prompt used for constructing hard examples is already included in Table 6 in Appendix F. We have included a new section with a detailed explanation of each step in the algorithm in Appendix D. Regarding your question 4, since the null hypothesis corresponds to two distributions being the same, Type 1 error indicates the probability of rejecting the null hypothesis of equal distributions.
[Task probe construction prompt.] Please, see point 5.5.

References:

[1] Zhang, Y., Carlini, N. and Ippolito, D., 2024, August. Effective prompt extraction from language models. In First Conference on Language Modeling.

[2] Hui, B., Yuan, H., Gong, N., Burlina, P. and Cao, Y., 2024. PLeak: Prompt Leaking Attacks against Large Language Model Applications. arXiv preprint arXiv:2405.06823.

[3] Geiping, J., Stein, A., Shu, M., Saifullah, K., Wen, Y. and Goldstein, T., Coercing llms to do and reveal (almost) anything, 2024. URL https://arxiv. org/abs/2402.14020.

[4] Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z. and Fredrikson, M., 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

[5] Muennighoff, N., Tazi, N., Magne, L. and Reimers, N., 2022. MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.

评论- Response to Reviewer DuyU -- Part 2

2024-11-27

[Comparison to prompt extraction baselines.]

Prompted by your review, we ran additional experiments comparing PLeak [2] – the most high performing of the existing prompt reconstruction approaches (referenced in your and reviewer L9Uj’s comments) to Prompt Detective in the prompt membership setting. We used the optimal recommended setup for real-world chatbots from section 5.2 of the PLeak paper – we computed 4 Adversarial Queries with PLeak and LLAMA-2 13B as the shadow model as recommended, and we used ChatGPT-Roles as the shadow domain dataset to minimize domain shift for PLeak. We observed that PLeak sometimes recovers large parts of target prompts even when there is no exact substring match, and that using the edit distance below the threshold of 0.2 to find matches maximizes PLeak’s performance in the prompt membership inference setting. To further maximize the performance of the PLeak method, we also aggregate the reconstructions across the 4 Adversarial Queries (AQs) by taking the best reconstruction match (this aggregation approach is infeasible in prompt reconstruction setting where the target prompt is unknown but can be used to obtain best results in prompt membership inference setting where we know the reference prompt). We then applied these adversarial prompt extraction queries to LLAMA-2 13B as the target model with system prompts from Awesome-ChatGPT-Prompts and computed False Positive and False Negative rates for direct comparison with the results of Prompt Detective reported in Table 1 of our paper. We report the results below:

Method	Target Model	FPR	FNR
Prompt Detective	Llama2 13B	0.00	0.05
PLeak	Llama2 13B	0.00	0.46

[Experiment points.]

5.1 & 5.2 & 5.4 [Model families and sizes used in the General experiments and Hard Examples.] In our general experiments in Table 1, we report Prompt Detective performance across a variety of language model families and sizes – including both larger and smaller models, multiple models of the various open source families, and closed-source models. We believe that Table 1 clearly showcases the effectiveness of Prompt Detective with minor variations in performance across these settings, therefore we believe including more combinations of model sizes and families would result in suboptimal compute spend for a limited quantity of new research findings. The Hard Example section is a deep dive section where we explored the setting of highly similar system prompts and, following the similar logic of responsible use of compute resources and minimizing the carbon footprint, we decided to focus on the efficient variants of models powering popular real-world chatbots (as our results are fairly robust across model sizes and families experimentally and conceptually we would also expect that, we opted for the efficient Claude Haiku and GPT 3.5 vs larger and more expensive Claude or GPT variants). Prompted by your questions, we added a clarification on the models we used in our experiments to Appendix G.
5.3 [Figure 4.] Thank you for your questions. Figure 4 explores the optimal setup for a fixed budget of generations – on the left panel, it explores whether for the same number of LLM queries it is better to have more task prompts versus more responses per task prompt. While the trends are overall fairly similar, the diversified approach – with a larger number of task prompts yields more consistent results. The right panel explores a similar question, but in terms of number of tokens instead of LLM queries (we note that the blue and red lines are the same in shape, but the x-axis is now different and is measured in tokens, each generation on the red and blue lines is 512 token long, we now clarified that in caption). We clearly see that for a fixed budget of generated tokens having a larger number of shorter responses is most useful for effective separation. The best particular setup in Figure 4 is to use 10 task prompts with 50 generations, each 64-token-long, which is discussed in Section 5.2.

评论- Response to Reviewer DuyU -- Part 1

2024-11-27

Thank you for your thorough review and constructive feedback. We appreciate your recognition of our novel application and the strengths you identified. We address your concerns below:

[Motivation for statistical framework and the use of permutation test.] Thank you for your question.

First, we note that the problem of prompt membership inference (i.e., verification if a known system prompt is used by a model) is different from existing prompt reconstruction methods (for example, [1,2,3]) which aim to recover the prompt. While these methods achieve impressive results on prompt reconstruction given the difficulty of the reconstruction problem, their reconstruction success rate is not high enough to be able to confidently verify the prompt reuse and solve for prompt membership inference, they are computationally expensive usually relying on GCG-style optimization [4], and some of these methods require access to model gradients [3].
Specifically, [1] achieves 44.6% to 74.1% success rate in prompt reconstruction (per Table 1 in [1]). [2] relies on knowing the target domain of system prompts and outperforms other reconstruction approaches in relative terms if the domain is known, but still achieves only 67.3% or 68.9% success rates in absolute terms on more difficult target models and domains (per Table 3 in [2]). Additionally, [2] reports that PLeak successfully reconstructs only 68% of prompts in real-world chatbots. [3] achieves up to 82% reconstruction success rate (per Figure 5 in [3]) but requires access to model gradients for optimizing adversarial prompts.
For direct comparison, we ran additional experiments comparing PLeak [2] – the most high performing of the existing prompt reconstruction approaches (referenced in your and reviewer L9Uj’s comments) to Prompt Detective in the prompt membership setting. We observe that Prompt Detective is significantly more effective than PLeak in the prompt membership inference setting and report the detailed results of this experiment in our general response as well as in our response to your point 4 [Comparison to prompt extraction baselines] below. We added these results to Appendix B.1 as well.
The above motivates the development of Prompt Detective as a statistical framework specifically tailored to the prompt membership inference setting. Prompted by your question, we expanded our original discussion on this in lines 58-63 of Introduction and section 2.2 of related work. Regarding the choice of the permutation test, we chose it for its non-parametric nature and ability to handle high-dimensional data without distributional assumptions. Compared to the latency of language model inference, the latency of running many permutations (usually thought of as the main downside of this test) was negligible.

[Condition in line 184 (Algorithm 1).] We appreciate your careful review of our algorithm. However, the original condition s* ≤ s_obs is correct for the following reasons: (1) Lower cosine similarity indicates more difference between the prompts, (2) We want to determine if the observed similarity is significantly lower than what we would expect by chance. (3) We count how often the permuted data gives a result as extreme or more extreme (i.e., as low or lower) than the observed data, (4) If we find few cases where s* ≤ s_obs, it results in a low p-value, indicating that the observed similarity was significantly low and extreme. This aligns with standard permutation test methodology where we count the number of permutations that produce a test statistic as extreme as or more extreme than the observed value.
[Null hypothesis.] We note that adopting the null hypothesis (H₀) which posits that the two samples originate from the same distribution is the standard setup in statistical hypothesis testing for determining whether two samples derive from different distributions. This framework is foundational to a wide array of established tests, including the Kolmogorov-Smirnov test, the Mann-Whitney U test, and the Anderson-Darling test, among others. However, we acknowledge that false positives (i.e., incorrectly predicting that a system prompt was stolen when it was not) are important and they carry more severe consequences. To mitigate this risk, we place significant emphasis on the power of our statistical test and conduct extensive experiments using challenging examples. In these scenarios, system prompts may differ by only a few words, yet Prompt Detective can still effectively distinguish between the distributions of the outputs generated by these prompts when a large number of generations (500) is utilized in the method.

2024-11-30

Dear Reviewer DuyU,

Thank you for your valuable time and effort in reviewing our paper and rebuttal—we truly appreciate your thoughtful feedback.

Your insights have helped us improve our work, we have made a significant effort and added new experiments and results to address your concerns in the updated draft.

We kindly hope you will consider raising your score if our revisions and rebuttal have resolved your concerns. Please reach out with any further questions. Thanks again for your contribution to our paper.

2024-12-01

Thanks to the authors for their rebuttal. It addresses several of my concerns and I mention the remaining ones below.

I think the motivation for statistical testing is still missing. I was not only curious to see the empirical advantage of the proposed method, but the reason for choosing statistical methods for prompt membership inference. Why not design a better empirical method?
I see a contradiction in this statement: "Lower cosine similarity indicates more difference between the prompts". Being a similarity metric, it should imply less difference. Either the authors should clarify this more in the paper or change their terminology. Currently, I don't understand their algorithm.
I think using the null hypothesis of the paper, just because general statistical methods adopt some conventions, is not justified. Being for a specialized application with particular consequences, the work should (re-)design every component of the framework according to the needs of the application.
I do not understand the reason given by the authors for universal task probes.

2024-12-02

Dear Reviewer DuyU,

Thank you for your engagement. We appreciate the opportunity to clarify the remaining points you raised.

[“The motivation for statistical testing is still missing.”] We chose a statistical testing framework for prompt membership inference because it provides a principled and certifiable method to determine whether a specific system prompt was used by a language model. Statistical methods allow us to measure the significance of observed differences in output distributions between models with and without the candidate prompt.
[“Why not design a better empirical method?”] We have demonstrated that Prompt Detective already achieves great empirical performance, with a false negative rate close to 0.05 — the significance level selected in advance — and a false positive rate remaining close to 0% in most setups. We have also shown that our statistical approach achieves much better performance compared to state-of-the-art prompt reconstruction methods.
[“I see a contradiction in this statement: 'Lower cosine similarity indicates more difference between the prompts'. Being a similarity metric, it should imply less difference.”] Lower cosine similarity between two vectors indicates that the vectors are dissimilar (as stated in [1]: “similarity measures take on large values for similar objects and either zero or a negative value for very dissimilar objects”). Therefore, a lower cosine similarity between two averaged generation representation vectors indicates more difference between the prompts, confirming that our statement is correct.
[Null hypothesis]: The current null hypothesis meets the needs of the application. Using the null hypothesis that the generations come from the same distribution allows us to set a significance level (tolerance to Type I error, i.e., False Negative Rate) in advance, and control the sensitivity of the test (Type II error, i.e., False Positive Rate) by increasing the number of generations used in Prompt Detective. That is, given enough samples, we can guarantee arbitrarily low false positive rate. We demonstrated in our paper that this approach leads to a false negative rate being close to the significance level of 0.05, while increasing the number of generations to 1000 increases the sensitivity of Prompt Detective to the level of distinguishing between two prompts differing by a typo, effectively reducing the false positive rate to 0%.
[“I do not understand the reason given by the authors for universal task probes.”] Using universal task prompts is a design choice in our experiments in the standard setup — it allows to reuse language model outputs to emulate the negative case of comparing generations for different system prompts with the same task prompts. We used specialized task prompts in the Hard Example experimental setup.

Given that you previously mentioned that our rebuttal addresses several of your concerns, we would appreciate you raising the score to reflect that, especially in light of these additional clarifications. Thank you!

References:

[1] https://en.wikipedia.org/wiki/Similarity_measure

2024-12-02

Thanks to the authors for the helpful clarifications. I am raising my score to 6.

Reason for not giving lower score: I believe that this paper is in the acceptance range with a potentially useful application of statistical tests in prompt membership inference.

Reason for not giving higher score: The method is a straightforward application of a statistical test. Hence, the novelty of the work is only in terms of applying the test to a previously unexplored domain, not in terms of the method or its theory.

审稿意见

评分: 8置信度: 42024-11-04

System prompt is an important component of an LLM system as it directly controls the behavior of the base model by outlining the context of the conversation and the expected behaviors of the model. As a result, model developers can spend significant effort on developing proprietary system prompts that can boost model utility and hide them from the users. However, given the existence of various jailbreaking techniques that can drive the model to easily divulge its system prompt, it is important to have a statistical testing framework for model developers to perform end-to-end audits of a given LLM system on whether a specific system prompt has been used by the underlying LLM. To this end, the authors proposed Prompt Detective, a permutation testing framework to determine whether a given system prompt was used by a third-party language model. Through experiments, the authors demonstrate the effectiveness of their proposed method in different scenarios where the prompts are rewritten to various degrees.

优点

This paper presents an important threat model as summarized in the previous section.
The paper is well-written with each figure sending a clear message. I can get the gist of it in my first pass.

缺点

While this work presents a valid threat model, have the authors considered possible adaptive attacks in which an adversary intends to hide the fact that they used a specific system prompt by optimizing the prompt in a way that preserves utility but results in inconclusive test results?
The current experiments exclusively use BERT models for embedding. It would be much appreciated if the authors could supplement experiments to demonstrate the effectiveness of the models on more embedding models.
In a more realistic scenario, an attack could use jailbreaking techniques to steal the system prompt of an LLM with blackbox access, through which the attacker can obtain a prompt that may or may not reproduce the original system prompt. Then, the attacker may choose to deploy the stolen prompt to their systems with a similar base LLM. Instead of manually varying the system prompt to validate the statistical testing framework, have the authors considered this threat model?

问题

Is the proposed method applicable to soft prompts and stolen in-context demonstrations?
My questions raised in “weakness.”

评论- Response to Reviewer axbL

2024-11-27

Thank you for your thoughtful review and positive feedback! We appreciate your recognition of the importance of our work and the clarity of our presentation. We are glad you found the paper well-written and easy to understand. Please, see our answers to your questions below:

[Adaptive attacks:] Thank you for this point. Indeed, the adversary rephrasing the system prompt is a possible scenario. However, we note that if the system prompt is rephrased such that the output distribution remains highly similar to the original output distribution (which would correspond to preserving utility), that will require a very large set of task prompts and responses to achieve low p-values and separation in Prompt Detective. Therefore, observing no separation until a very large sample of responses is used will be indicative of the adversary using such adaptive attacks attempting to preserve the output distribution (i.e. utility) while rephrasing the system prompt, so Prompt Detective will remain effective in this scenario. This is consistent with our hard setup experiments in Section 5.2.
[Embedding models:] We appreciate this suggestion. We have now conducted additional experiments using different embedding models (e.g. gte-Qwen2-1.5B-instruct, jina-embeddings-v3 and mxbai-embed-large-v1 models from MTEB leaderboard [1]) on the Awesome-ChatGPT-Prompts dataset and found that our results are consistent across various encoder models. We have included these new results in the Appendix B Table 4 and briefly mentioned them in the main text. Please, find the results below:

Model	Encoder	FPR	FNR	pp_avg	pn_avg
Claude	BERT	0.05	0.03	0.544 ± 0.29	0.022 ± 0.12
Claude	jina-embeddings-v3	0.03	0.07	0.489 ± 0.30	0.006 ± 0.03
Claude	mxbai-embed-large-v1	0.04	0.04	0.504 ± 0.29	0.020 ± 0.11
Claude	gte-Qwen2-1.5B-instruct	0.03	0.04	0.514 ± 0.29	0.013 ± 0.08
GPT3.5	BERT	0.00	0.06	0.502 ± 0.28	0.000 ± 0.00
GPT3.5	jina-embeddings-v3	0.01	0.08	0.487 ± 0.30	0.003 ± 0.03
GPT3.5	mxbai-embed-large-v1	0.00	0.05	0.508 ± 0.30	0.000 ± 0.00
GPT3.5	gte-Qwen2-1.5B-instruct	0.01	0.05	0.502 ± 0.29	0.002 ± 0.02

[Verification step in jailbreaking attacks:] While we consider Prompt Detective to be a defense mechanism enabling users to verify their system prompt reuse by adversaries, the method can indeed be used as a verification step on top of prompt extraction attacks. In fact, this research direction may lead to significantly improved prompt extraction attacks, however, within the scope of this paper, we acknowledge this fact as a potential ethical consideration for developing Prompt Detective in our Ethics Statement.
[Applicability to soft prompts and in-context demonstrations:] Prompt Detective is indeed directly applicable to the use cases of soft prompt and in-context example verification since it detects differences in output distributions induced by prompts. However, we decided to focus our exploration in this paper on system prompts. We note that using Prompt Detective as a membership inference attack to verify the use of a single in-context example or an unordered set of multiple in-context examples is a slightly more challenging task requiring additional care in test result interpretation as even changing the order of in-context examples may lead to subtle (or sometimes more pronounced) distribution shifts.

Thank you again for your positive review of our paper, and we hope our clarifications address your questions!

References:

[1] Muennighoff, N., Tazi, N., Magne, L. and Reimers, N., 2022. MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316. (https://huggingface.co/spaces/mteb/leaderboard)

审稿意见

评分: 8置信度: 42024-11-04

This paper introduces "Prompt Detective", a method to infer whether a chat LLM uses a particular system prompt.

Threat model: Given a target chatbot LLM with an unknown system prompt and a candidate system prompt, an attacker can query the chatbot with user prompts and observes responses (as text). The attacker's goal is to reliably guess if the candidate system prompt is different from the target chatbot's actual system prompt. The attacker generally knows the LLM being used (Sec. 6 considers a relaxation).

Prompt Detective attack: Given a set of user prompts, the attack samples multiple generations per user prompt, once for the target chatbot LLM with the unknown system prompt, and once for the same LLM with the candidate system prompt. The attack then performs a permutation test over the two sets of generations, permuting generations per user prompt, and using the cosine similarity of mean BERT embeddings as the test statistic. If the test's significance level is larger than some $\\alpha$ , then Prompt Detective outputs that prompts are distinct, otherwise that they might be the same (i.e., do not reject null hypothesis).

Evaluation: The paper evaluates Prompt Detective on a broad set of prompts and models (ranging from Mistral 7B to GPT-3.5 in size) using both positive (equal) and negative (not equal) pairs of unknwon and candidate system prompt. The evaluation contains both an average-case setup and a "hard" setup that considers prompt pairs with a controlled level of similarity.

优点

While prompting might not be a long-term business strategy, many current applications build upon proprietary system prompts. Hence, the problem this paper studies is relevant today. The threat model is further precise, sensible, and discussed thoroughly.

Prompt Detective is a straightforward hypothesis test without any unnecessary moving parts, yet done properly such that it achieves good performance in terms of what it set out to do. For example, in the average-case evaluation ("standard setup"), with a p-value of 0.05 (controlling false negatives, not false positives!), the method seems to work reasonably well; it yields almost no false positives, and a true positive rate consistently over 90%.

The paper is well-written, and the method is described well. The evaluation is mostly done well (except one issue; see weaknesses). The authors also investigate the Prompt Detective's degrees of freedom (e.g., more query prompts vs. more generations per query prompt), and show that the method still works in a more black-box threat model.

缺点

Hypothesis test controls the wrong thing: The hypothesis test is set up to control the false negative rate (i.e., predicting that a system prompt was not used when it was). However, as in e.g. membership inference, wrongly predicting that a system prompt was stolen has much stronger consequences than predicting was not stolen (see e.g. Carlini et al., 2021). Hence, the null hypothesis should be that the two system prompts are different (this might be hard with the used permutation test) to ensure a low false positive rate. The paper should also report full ROC curves when sweeping over significance levels $\\alpha$ (in general but especially due to the way the test is set up).

Figures 3 and 4 lack positive pairs: In Fig. 3 and 4, it is very much expected that the p-value decreases with more queries for negative cases (different prompts) due to the way the hypothesis test is set up. Hence, the p-value for positive pairs (same prompts) is much more interesting. This value should stay high for the method to work well, but I suspect that this p-value decreases too. Hence, the evaluation should also show $p^p_{avg}$ in Fig. 3 and 4. (Additionally, both figures' plots might benefit from error bands around the lines.)

Hard setup lacks positive pairs: The hard setup only considers negative pairs (original system prompt plus a modified version). However, for the same reasons outlined above, there must be positive pairs to assess how well Prompt Detective works (i.e., include positive pairs and show the FNR/TPR in Table 2; simply looking at the FPR does not suffice).

问题

Why does the standard evaluation setup described in Sec. 4.3 only consider on negative pair instead of all possible pairs? As acknowledged, "negative pairs may not represent similar system prompts"; hence considering all possible negative pairs might provide a more complete picture.
The typo case study in App. C.2 sounds interesting; is it possible to get full results for that study?. In particular, what are the FPR/FNR and p-value for the positive case?
Is there any intuition why Claude Haiku behaves very differently from the other models in Table 1?

评论- Response to Reviewer vCaK

2024-11-27

Thank you for your detailed review and constructive feedback. We appreciate your positive remarks regarding the relevance of our problem statement, the soundness and simplicity of Prompt Detective, and the overall clarity of our presentation. Below, we respond to each point of feedback in detail:

[Hypothesis test controls the wrong thing:] We note that adopting the null hypothesis (H₀) which posits that the two samples originate from the same distribution is the standard setup in statistical hypothesis testing for determining whether two samples derive from different distributions. This framework is adopted in a wide array of established tests, including the Kolmogorov-Smirnov test, the Mann-Whitney U test among others. However, we acknowledge that false positives (i.e., incorrectly predicting that a system prompt was stolen when it was not) carry more severe consequences. To mitigate this risk, we place significant emphasis on the power of our statistical test and conduct extensive experiments using challenging examples in Section 5.2. In these scenarios, system prompts may differ by only a few words, yet Prompt Detective can still effectively distinguish between the distributions of the outputs generated by these prompts when a large number of generations (500) is utilized in the method.
[The paper should also report full ROC curves:] Thank you for your suggestion, we have added the ROC curves for Prompt Detective in Appendix B Figure 6. We observe that Prompt Detective achieves ROC-AUC of 1.0 in all cases except for Claude on Awesome-ChatGPT-Prompts dataset where it achieves 0.98 ROC-AUC.
[Figures 3 and 4 lack positive pairs:] Thank you for your suggestion, we have now updated Figure 3 to include plots of p-values for positive pairs, however our experiments with hard examples do not include positive pairs, please see the explanation in the next comment. For Prompt Detective, the false negative rate (FNR) for positive pairs is determined by the chosen significance level. Our experiments confirm that the FNR remains close to 0.05 when using a significance level of 0.05. Similarly, adjusting the significance level to 0.01 results in an FNR close to 0.01, demonstrating consistency with theoretical expectations. In Figure 3 we demonstrate the relationship between the power of the test (False Positive Rate) and the number of queries used in the test, which does not affect the significance level and therefore the false negative rate. As expected, the average p-value in the positive case stays close to 0.5.
[Hard setup lacks positive pairs:] As noted in your review, the consequences of a wrongful accusation are significantly higher, and, to address this, the Hard Setup section focuses on evaluating the potential limitations of Prompt Detective when distinguishing between highly similar but distinct prompts. Including results for positive pairs would not offer additional insights, as this experiment has already been explored in the Standard Setup using the exact same prompts from the Anthropic Library. In the Hard Setup section, we specifically vary the "counterfactual prompt" while keeping the original (proprietary) prompt fixed, which are the same as ones in the Standard Setup.
[Why does the standard evaluation setup described in Sec. 4.3 only consider on negative pair instead of all possible pairs?:] Considering all possible negative pairs would result in approximately 30,000 prompt pairs. Running Prompt Detective on each pair would require 1,190,450 queries per model, which is computationally prohibitive. To address this, we randomly selected a subset of negative pairs equal in number to the positive pairs, corresponding to the total number of prompts in the dataset.
[Full results for typo case study in App. C.2:] Since these case studies include a single prompt, we can not report distribution level metrics such as FNR and FPR. However, we observe that Prompt Detective can distinguish between these prompts achieving p-value of 0.02 when a larger number of queries (1000) is used.

In response to the other reviewers' feedback, we conducted additional experiments, including (1) comparing Prompt Detective against PLeak, a prompt extraction baseline, and (2) an ablation study on Prompt Detective embeddings. Please, see our general response for details.

We sincerely thank you once again for your thoughtful feedback. We hope our responses have addressed your questions, and kindly ask you to consider increasing your score in light of the clarifications and improvements we have provided.

2024-11-28

I thank the authors for their detailed reply and the additional plots/details.

Re most points: The full ROC curves (2) and p-values for positive pairs in Fig. 3 (3) help to instill confidence in the method, and I see my mistake regarding Figure 4/the hard setup (4). I also thank the authors for clarifying the challenge in comparing all pairs (5), although I am still wondering if there is a heuristic approach to subsample particularly "hard" pairs. Finally, I realize that I made a thinking error, and the authors are indeed correct that reporting TPR and FPR values in the typo case study are not possible (6). However, the paper could more explicitly mention that the case study is essentially just a (interesting but) single example, not a full study with significant results.

Re 1 (hypothesis test controls the wrong thing):

We note that adopting the null hypothesis (H₀) which posits that the two samples originate from the same distribution is the standard setup in statistical hypothesis testing for determining whether two samples derive from different distributions.

This is correct, but not my point. My point: In statistical hypothesis testing, the null-hypothesis should be the "base case", not the case that has big consequences. In the context of this paper, H0 should thus be "System prompt was not stolen" and H1 should be "System prompt was stolen". However, in this paper, it is the other way around (H0 is that the two distributions are the same, which is equivalent to the prompt having been stolen). I do understand that permutation tests use the null hypothesis that distributions are equal. However, the tools (test) should follow the model (hypotheses), not the other way around.

Although minor issues remain, I think that the paper is overall okay and can be accepted. If I could, I would raise my score to a 7. Since this is not possible (and the authors addressed many concerns) I hence raise my score to an 8, but it could also well be a 6.

评论- General response

2024-11-27

Dear Reviewers and AC,

We thank the reviewers for their thoughtful feedback on our paper. We appreciate the time and effort they have put into reviewing our work. We carefully addressed all of the comments and wanted to highlight several improvements that several reviewers asked for.

[Comparison to prompt extraction baselines:] TL;DR: We ran additional experiments comparing PLeak [1] -- the strongest prompt extraction baseline in a careful PLeak-friendly prompt membership inference setup and found that Prompt Detective is significantly more effective.

Prompt extraction methods can be adapted to the prompt membership inference setting by comparing recovered system prompts to the reference system prompts. We ran additional experiments comparing PLeak [1] – the most high performing of the existing prompt reconstruction approaches to Prompt Detective. We used the optimal recommended setup for real-world chatbots from section 5.2 of the PLeak paper – we computed 4 Adversarial Queries with PLeak and LLAMA-2 13B as the shadow model as recommended, and we used ChatGPT-Roles as the shadow domain dataset to minimize domain shift for PLeak. We observed that PLeak sometimes recovers large parts of target prompts even when there is no exact substring match, and that using the edit distance below the threshold of 0.2 to find matches maximizes PLeak’s performance in the prompt membership inference setting. To further maximize the performance of the PLeak method, we also aggregate the reconstructions across the 4 Adversarial Queries (AQs) by taking the best reconstruction match (this aggregation approach is infeasible in prompt reconstruction setting where the target prompt is unknown but can be used to obtain best results in prompt membership inference setting where we know the reference prompt). We then applied these adversarial prompt extraction queries to LLAMA-2 13B as the target model with system prompts from Awesome-ChatGPT-Prompts and computed False Positive and False Negative rates for direct comparison with the results of Prompt Detective reported in Table 1 of our paper. We report the results below:

Method	Target Model	FPR	FNR
Prompt Detective	Llama2 13B	0.00	0.05
PLeak	Llama2 13B	0.00	0.46

[Ablation on embedding methods:] TL;DR: We ran additional ablations, our results are consistent across embedding methods.

We ran additional experiments for an ablation on embedding methods. The results are included in Appendix B. In short, we observe no significant difference between performance of different encoder models when used in Prompt Detective. We experimented with gte-Qwen2-1.5B-instruct, jina-embeddings-v3 and mxbai-embed-large-v1 models from MTEB leaderboard [2] and found that our results are consistent across various embedding techniques. Originally, we simply chose BERT for its simplicity, wide adoption, and computational efficiency.

Model	Encoder	FPR	FNR	pp_avg	pn_avg
Claude	BERT	0.05	0.03	0.544 ± 0.29	0.022 ± 0.12
Claude	jina-embeddings-v3	0.03	0.07	0.489 ± 0.30	0.006 ± 0.03
Claude	mxbai-embed-large-v1	0.04	0.04	0.504 ± 0.29	0.020 ± 0.11
Claude	gte-Qwen2-1.5B-instruct	0.03	0.04	0.514 ± 0.29	0.013 ± 0.08
GPT3.5	BERT	0.00	0.06	0.502 ± 0.28	0.000 ± 0.00
GPT3.5	jina-embeddings-v3	0.01	0.08	0.487 ± 0.30	0.003 ± 0.03
GPT3.5	mxbai-embed-large-v1	0.00	0.05	0.508 ± 0.30	0.000 ± 0.00
GPT3.5	gte-Qwen2-1.5B-instruct	0.01	0.05	0.502 ± 0.29	0.002 ± 0.02

[Visualization improvements:] We added ROC curves for Prompt Detective in Appendix B Figure 6. We observe that Prompt Detective achieves ROC-AUC of 1.0 in all cases except for Claude on Awesome-ChatGPT-Prompts dataset where it achieves 0.98 ROC-AUC.

We hope these additions strengthen our paper and we thank everyone again for their valuable feedback and help improving our study.

[1] Hui, B., Yuan, H., Gong, N., Burlina, P. and Cao, Y., 2024. PLeak: Prompt Leaking Attacks against Large Language Model Applications. arXiv preprint arXiv:2405.06823.

[2] Muennighoff, N., Tazi, N., Magne, L. and Reimers, N., 2022. MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.

AC 元评审

2024-12-18

This paper proposes a method which uses statistical tests to determine whether a system prompt has been stolen. The method is simple and intuitive, and the paper shows some promising results.

The reviewers generally agree that the paper has some significant strengths: the proposed threat model is important, interesting and sensible; the paper is well written; the results are promising and some of the results are surprising.

However, in initial review, rebuttal and discussion phases, the reviewers also expressed important concerns about some aspects of the paper. First, the use cases adopted in the experiments may not be practical enough, therefore, the experimental results may not be able to support the motivation of the paper. This has been pointed out by multiple reviewers and the other reviewers have also concurred. Second, the paper may lack technical novelty, since the algorithm may be more of an application of statistical tests. This is not as important a concern as the first one, but it has also been pointed out by multiple reviewers. In addition, the reviewers raised other issues as well, such as whether the adopted null hypothesis is reasonable, and the proposed detection method should be evaluated against real attack methods.

Given these important concerns, the reviewers could not champion the paper. As a result, rejection is recommended. We think the paper has good potential, and we encourage the authors to try to incorporate more realistic use cases in the experiments to make the paper more convincing.

审稿人讨论附加意见

During the discussion phase, there were extensive and fruitful discussions among the reviewers. The reviewers all discussed what they think are the strengths of the paper and also shared their concerns, including whether they agree with the concerns expressed by other reviewers. In the end, as a result of the important concerns mentioned above, the reviewers could not champion the paper for acceptance.

最终决定Reject

2025-01-22

Reject