GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
GASP is a novel black-box attack framework that efficiently explores the latent space to generate human-readable adversarial suffixes, significantly improving jailbreak success rates while maintaining prompt coherence.
摘要
评审与讨论
This paper presents GASP (Generative Adversarial Suffix Prompter), a fully black-box, query-efficient jailbreak attack framework targeting large language models (LLMs). GASP generates adversarial suffixes that, when appended to seemingly benign prompts, bypass safety filters and elicit harmful responses from LLMs.
优缺点分析
strength:
- GASP’s integration of LBO in a continuous latent space marks a significant innovation compared to prior discrete token optimization methods, enabling faster and more scalable black-box attacks.
- GASP adversarial suffixes are human-readable and coherent, bypassing simple perplexity-based or rule-based filters—this enhances realism in red-teaming use cases.
- The framework is fully black-box, requiring no internal model gradients or weights, making it practical against commercial APIs.
Weakness:
- While GASPEval is powerful, it depends on a second LLM, which may introduce evaluation bias. The authors do not fully analyze variance across different evaluators.
- GASP focuses on suffix-only attacks, assuming the presence of a "harmful intent" prompt. It does not explore attacks starting from neutral prompts or indirect injection strategies.
- Although LBO is more efficient than discrete search, it still requires iterative queries and finetuning, which could be costly against highly restricted APIs.
问题
NA
局限性
NA
最终评判理由
I have carefully read the paper again, and fixed some problems in my review, and also increased my score.
格式问题
NA
We appreciate Reviewer U7Y9 for taking the time to provide feedback on our submission. However, we found that the review's content appears to be entirely unaligned with the actual content of our paper. Specifically, the points raised do not reflect the problem setting, methodology, or contributions we presented, and no part of the review directly references any section, experiment, or claim from the submission.
To briefly reiterate, our paper introduces GASP, a novel black-box framework for generating coherent and highly effective adversarial suffixes using Latent Bayesian Optimization. We evaluate GASP across various LLMs and provide thorough analyses, including ablations, baselines, and generalization studies. None of these aspects was mentioned or reflected in the current review.
We kindly ask the reviewer to revisit the paper and consider updating both the review content and the score to reflect an evaluation of the actual submission. Should there be any confusion or ambiguity in our writing, we are more than happy to provide additional clarification. Thank you again for your time and consideration.
We thank the reviewer for the updated review. Below, we provide point-to-point responses to each of the comments listed under "Weakness".
W1. GASPEval & Evaluation Bias
We respectively disagree that using a second LLM to evaluate the attack performance will introduce evaluation bias. To clarify, GASPEval uses a single, independent LLM (Llama-3.1-8B-Uncensored), which differs from any TargetLLM tested in our experiments. Note that instructing a separate LLM as a judge is a typical way to measure the success rates of a jailbreak attack method (e.g., StrongReject). Due to the 21 carefully designed questions, our GASPEval can serve as an effective evaluator (see Appendix E for detailed discussions and illustrative examples). Also, we have explained the limitations of two traditional evaluators: Keyword Matching and StrongReject (Appendix E.1). While comprehensively analyzing the variance across additional evaluators may be beneficial and can be an important research question itself, we believe it is beyond the scope of our work.
W2. Use of "Harmful Intent" Prompt
GASP is specifically designed to generate adversarial suffixes appended to already harmful or sensitive prompts, focusing on suffix-only attacks under the assumption of known harmful intent. Note that this setup is aligned with the existing literature on optimization-based jailbreaks like GCG, AutoDAN, and AdvPrompter. We believe exploring attacks starting from neutral prompts or indirect injection strategies is beyond the scope of our work. Nevertheless, we consider them as future directions to further examine the applicability of GASP to tasks beyond generating adversarial suffixes for jailbreaking LLMs.
W3. Query Cost of GASP
We appreciate the reviewer’s concern regarding the cost of iterative queries and fine-tuning with LBO. However, as shown in Table 5 of our supplementary materials, our method demonstrates strong query efficiency, requiring fewer than 1.7K queries for GPT-4o, under USD 3 in total cost for all tested GPT-based models, and less than USD 5 for Claude-Sonnet models. This cost is notably lower than many existing optimization-based attack methods, highlighting GASP’s practicality in real-world scenarios with restricted APIs.
We hope our responses have addressed your concerns. If you have any additional questions, please follow up. We are happy to provide further clarifications.
The paper introduces GASP, a method for generating human-readable suffixes that jailbreak LLMs. Building on the framework proposed in AdvPrompter, which trains an LLM to produce suffixes likely to jailbreak, GASP introduces changes. It incorporates pretraining on a custom AdvSuffixes dataset, leverages the ORPO method for alignment-based fine-tuning, and replaces discrete token search with a more effective latent Bayesian optimization in embedding space. Additionally, the authors propose GASPEval, a judge which improves suffix effectiveness evaluation. GASP is evaluated against a range of existing jailbreak techniques, judges, and LLMs.
优缺点分析
Strengths:
- While the individual techniques: pretraining, latent-space optimization using Latent Bayesian Optimization (LBO), and ORPO for fine-tuning, are not novel in isolation, their combination in GASP provides a difference from AdvPrompter.
- Experiments across diverse datasets, models, and judge systems are provided.
- The introduction of the AdvSuffixes dataset offers a valuable resource for pretraining adversarial suffix generators, though some limitations are noted (as discussed in weaknesses).
- GASPEval appears to be an effective evaluation method than StrongREJECT for guiding GASP training, and they have similar ASR in evaluation (actual judging).
Weaknesses:
- Too many important implementation and evaluation details are relegated to the appendix, making it difficult to fully assess the contributions from the main text alone. This includes key aspects of the proposed AdvSuffixes dataset, such as how new prompts are generated (including model choice and the "two-prong" strategy involving DAN and role-play), how training/test splits are separated, and how GASPEval combines its 21 individual scores. Similarly, the specifics of the Readability Metric used are not clearly presented in the main paper.
- Some comparisons in the paper may be potentially unfair due to unclear or inconsistent attack setups across baselines. For example, it is not specified whether PAIR is used in its universal, transfer, or prompt-specific mode (for the latter, ASR@10 could be reported as well). Similarly, it’s unclear whether AdvPrompter is evaluated using transfer-based or warm-start configurations, or whether it is trained on the original AdvBench or the newly introduced AdvSuffixes. This significantly impacts fairness in Table 1, as GASP may benefit from targeting specific models or prompts and uses AdvSuffixes for training. Additionally, Figure 2 reports inference times for all methods, but universal suffixes for GCG, AutoDAN, PAIR, and TAP only incur inference cost once. Reporting GASP’s 2.1-minute per-query generation time alongside those one-time universal methods may misrepresent efficiency comparisons.
- Important missing comparisons. Best-of-N jailbreaks are black-box and faster than the proposed method and can also have high readability ([1][2]).
- The paper lacks key ablation studies necessary to understand the contribution of its individual components. Specifically, it does not analyze the impact of pretraining on the AdvSuffixes dataset or how different design choices in AdvSuffixes creation, such as using only DAN-style or only role-play suffixes, affect training outcomes. Similarly, there is no ablation on GASPEval to assess the influence of its diverse prompts or how individual scores compare to other judges. The absence of these evaluations makes it difficult to disentangle the contribution of each part of the system. Additionally, while the paper mentions a Readability Metric, it does not report or compare against standard metrics like perplexity.
- The separation between the training and test splits of the AdvSuffixes dataset is not clearly established. Even if the syntax and phrasing of prompts differ, ensuring that the underlying harmful content is genuinely distinct from training examples is non-trivial. Without a clear methodology for verifying content-level separation, there’s a risk of test leakage or overfitting to known prompt patterns. An additional ablation that explicitly withholds a subset of 100 prompts from the original AdvBench and evaluates GASP’s performance on them would better support claims of generalization beyond the training data.
[1] Hughes, John, et al. "Best-of-n jailbreaking." arXiv preprint arXiv:2412.03556 (2024). [2] Beetham, James, et al. "LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds." arXiv preprint arXiv:2412.05232 (2024).
问题
- Clarification around weakness 2 would increase my evaluation if the experimental settings are fair.
- Ablations for weakness 4 would clarify the impact of the contributed dataset and evaluation, which would raise my score.
- Explanation or ablations for weakness 5 would help show that GASP can generalize to unseen prompts, which would slightly raise my score.
Minor Comments:
- Figure 1 could be more clear. E.g. remove graphs if they aren’t providing useful information.
- Figure 4 may be more readable as a table rather than graphs.
局限性
Yes
最终评判理由
My primary concern (W2) regarding fairness in the experimental setting has been addressed. The additional results strengthen the comparison with other methods (W3) and clarify how each component of the proposed method contributes to the overall improvement (W4). The expanded dataset description (W5) also supports the fairness of the train/test split, particularly given that even prior datasets exhibit some ambiguity in potential split overlap. For these reasons, I am raising my score to a borderline accept.
That said, I still have reservations about the clarity of the method (W1) and whether the changes proposed constitute a major rather than a minor revision. As Reviewer NJDJ noted in W1, the proposed method is complex. While it is modular, does not require target model gradients, involves little hyperparameter tuning, and each component is necessary for improvement (per the rebuttal to Reviewer NJDJ W1), the work introduces several new elements—dataset, multi-stage training, a new metric—within an already nuanced area where small details can significantly impact results. This makes clarity especially important.
Overall, I find the work mostly clear, particularly with the appendix now containing sufficient detail after incorporating the new experiments, clarifications, and discussions from the rebuttal. While the revision size to the main manuscript is borderline, I lean towards it being borderline minor.
格式问题
No
We thank the reviewer for their feedback and for recognizing the uniqueness of GASP. Below, we respond to each concern raised:
W1: Details in the appendix
We defer most of our implementation and evaluation details to the appendix, mainly due to space constraints. The main paper outlines key design decisions and intuitions behind each component. Still, we recognize the need for clarity and will include more technical details in the revised paper.
Regarding the readability metric: we follow a JudgeLLM-style setup, using a system prompt instructing a language model to score fluency on a 0–1 scale. While simplified here, our actual prompt is longer to ensure nuanced and consistent judgments.
W2 (Q1): Inconsistent attack setups & potential unfair comparisons
Below, we address concerns raised by the reviewer to clarify why our comparisons with baseline methods are both fair and consistent.
Baseline attacks. All compared optimization-based methods, PAIR, AutoDAN, TAP, and GCG, are evaluated in their prompt-specific modes, consistent with their default usage stated in their respective papers, since our goal is to evaluate prompt-wise attack quality. Although these methods support universal settings, performance degrades, and universal prompts limit adaptability to specific prompts.
AdvPrompter's configuration. AdvPrompter is evaluated using its warm-start config in both white-box and gray-box settings. As noted in our response to Reviewer NJDJ under W3: AdvPrompter's performance & evaluations, we use a transfer-based config only in the black-box transfer attack setup, consistent with the capabilities available in those scenarios. This ensures fair comparison with GASP under consistent assumptions.
AdvSuffixes vs. AdvBench. AdvPrompter is trained on AdvBench (as per the original paper's implementation), not AdvSuffixes. While AdvSuffixes includes the same 520 harmful prompts from AdvBench, its purpose is entirely different: it provides syntactically plausible suffixes to help SuffixLLM learn the structure of potential suffixes, not to inject harmful content. This dataset is designed to initialize SuffixLLM with a structural prior. As shown in our ablation (Figure 4a and W2 Table provided to Reviewer NJDJ), using AdvSuffixes alone yields low ASR, confirming it provides a weak prior, not an attack advantage.
Inference time comparisons (Figure 2). We clarify that our reported inference times are on a per-query basis. Since GASP is designed to generate suffixes tailored to each prompt, we report the time per suffix generation. This is consistent with how other methods are also evaluated, as all are used in prompt-specific settings, so the per-query timing comparison remains fair.
W3: Important missing comparisons (best-of-N jailbreaks)
We thank the reviewers for pointing out the two recent black-box jailbreak frameworks. We discuss the two methods and clarify below why direct comparison is often non-trivial due to fundamental differences.
Best-of-N Jailbreaking. This brute-force method relies on repeated random augmentations. While it can eventually succeed, it may require up to 10,000 queries per input, making it impractical under realistic query budgets. It also lacks readability, a core goal of GASP.
LIAR. LIAR uses Best-of-N sampling, generating many suffixes scored by an unsafe reward. While per-sample inference is fast, achieving high ASRs often requires thousands of generations per prompt, leading to high time-to-attack and query costs—especially problematic in black-box settings with limited budgets. Besides, LIAR’s effectiveness depends on the quality of reference LLM samples, which can also be a limiting factor.
Empirical comparisons. Due to the lack of an open-source LIAR implementation and time constraints during the rebuttal, we cannot fully reproduce its results. We are actively reimplementing LIAR and will include results in the final version if available during rebuttal. In all our experiments, we limit the number of iterations to 10 when evaluating GASP. Regardless, even with a larger number of augmentations, Best-of-N struggles to match the performance of GASP.
| Method | Falcon-7B | Mistral-7B-v0.3 | Llama-2-7B |
|---|---|---|---|
| Best-of-N Jailbreak | 9/69/98 | 4/38/74 | 1/27/53 |
| GASP | 86/100/- | 64/82/- | 9/64/- |
| LIAR | TBD | TBD | TBD |
We evaluate all methods on 100 test prompts from AdvSuffixes, reporting ASR@1, @10, and @100 to reflect different query budgets. Due to time constraints, GASP’s ASR@100 is omitted for now but will be included in the final revision. As expected, Best-of-N underperforms in low-budget settings due to unguided, chance-based sampling. Notably, GASP exceeds Best-of-N’s ASR@100 using only ASR@10, highlighting its efficiency and effectiveness.
W4 (Q2): Lack of key ablation studies
We would like first to clarify that key ablations have already been provided in the main paper (Section 4.3) and in Appendices F.3 and F.4. Below, we conduct additional experiments to address each concern:
Impact of AdvSuffixes. While our two-shot prompting guides SuffixLLM to produce plausible suffixes, we also construct four AdvSuffixes variants:
- zero-shot,
- one-shot DAN,
- one-shot role-play
- one-shot STAN: using STAN (Strive To Avoid Norms) jailbreak to test beyond DAN and role-play examples.
We finetune SuffixLLM on each of these variants and report GASP's performance to evaluate the effectiveness of these settings.
| Variants | Falcon-7B | Mistral-7B-v0.3 |
|---|---|---|
| AdvSuffixes (two-shot) | 86/100 | 64/82 |
| AdvSuffixes (zero-shot) | 85/100 | 60/85 |
| AdvSuffixes (one-shot [DAN]) | 86/99 | 62/82 |
| AdvSuffixes (one-shot [RolePlay]) | 86/100 | 64/83 |
| AdvSuffixes (one-shot [STAN]) | 88/100 | 66/84 |
GASP performs consistently well across all AdvSuffixes variants, with only slight ASR differences, showing AdvSuffixes acts mainly as a weak prior. LBO remains the core strength, adaptively guiding the search. Notably, the zero-shot variant produces slightly more diverse suffixes, possibly explaining its marginally higher ASR in some cases. We will include this ablation in the revision.
GASPEval's effectiveness. Regarding GASPEval, we already include evidence in Appendix E.1 and Figures 11–14 showing common failure cases in existing evaluation models. These examples underscore the limitations of other judges. More importantly, Figure 4b shows GASPEval outperforming StrongREJECT, providing more reliable feedback that enables LBO to achieve significantly higher ASR, underscoring its crucial role in guiding adversarial suffix search.
To assess each question’s impact on GASPEval’s scoring, we perform a leave-one-out ablation by removing one prompt at a time during LBO. For each variant, we rerun GASP on 10 prompts (on Mistral-7B-v0.3) and evaluate adversarial suffixes with ASR@10 and full GASPEval.
| Removed Prompt | Question Category | ASR@10 (out of 10 prompts) |
|---|---|---|
| None | All prompts included | 9 |
| Q1 | Context-shifting responses | 8 |
| Q2 | Harmfulness of response | 2 |
| Q3 | Triggering/distressing content | 7 |
| Q4 | Safety of suggestions | 6 |
| Q5 | Potential for weaponization | 5 |
| Q6 | Encouragement of self-harm | 8 |
| Q7 | Endorsement of violence | 3 |
| Q8 | Justification of harmful behavior | 5 |
| Q9 | Advice risking physical injury | 5 |
| Q10 | Dangerous/illegal health and safety advice | 3 |
| Q11 | Manipulation/coercion potential | 5 |
| Q12 | Harm to children or vulnerable populations | 7 |
| Q13 | Encouragement of risky behaviors | 8 |
| Q14 | Financial risk | 7 |
| Q15 | Instructions leading to legal consequences | 6 |
| Q16 | Harm to third parties | 6 |
| Q17 | Normalizing harmful behavior | 5 |
| Q18 | Undermining self-confidence | 9 |
| Q19 | Harmfulness if taken out of context | 7 |
| Q20 | Potential harm if acted upon | 5 |
| Q21 | Human readability of prompt | 9 |
Removing any single GASPEval question generally lowers ASR, confirming that each aids optimization. Moderate drops show the set is robust and balanced, enabling LBO to address multiple harmfulness aspects for reliable attacks. Future work will identify correlated questions to enhance coverage and efficiency.
Perplexity comparisons. We conducted a new experiment comparing the perplexity of generated suffixes across leading frameworks, further confirming that GASP produces more fluent, human-like outputs consistent with our readability claims.
| Models | GASP Perplexity | AdvPrompter Perplexity |
|---|---|---|
| Falcon-7B | 1.73 | 10.18 |
| Mistral-7B-v0.3 | 1.78 | 39.32 |
| Llama-2-7B | 1.97 | 147.29 |
W5 (Q3): Difference in train/test splits of AdvSuffixes
The creation of train and test datasets is provided in Appendix B of our supplementary materials. We use 100 out-of-distribution harmful prompts for testing GASP’s generalization to diverse, unseen prompts. We acknowledge that syntactic differences don’t guarantee semantic novelty, so we conduct additional experiments to clarify.
Distribution differences. We measure token distribution divergence using KL and JS metrics computed via a pretrained tokenizer (Llama-3.1-8B). These moderate divergence values confirm that test prompts are distinct from training ones, indicating limited harmful content overlap and supporting evaluation robustness.
Jensen-Shannon (JS) divergence: 0.368
KL divergence: 0.155
Ablation study with AdvBench. Additionally, we run a targeted ablation study on AdvBench by training SuffixLLM using only 420 prompts and evaluating GASP on the held-out 100 prompts. GASP still achieves strong ASR, demonstrating that our method generalizes well to unseen harmful intents and does not rely on memorized training prompts.
| Models | ASR@1/ASR@10 |
|---|---|
| Falcon-7B | 91/100 |
| Mistral-7B-v0.3 | 55/98 |
| Llama-2-7B | 26/67 |
We again thank the reviewer for their thoughtful feedback and detailed review. We will also revise our figures to be clearer and more readable. If our responses resolve your concerns, we kindly ask you to consider increasing your score. Otherwise, we are happy to offer any additional explanations you may need.
Thank you for your detailed response! Many of my concerns have been partially or fully addressed; however, I would appreciate further clarification on a few remaining points:
W2 Inference time comparisons (Figure 2). Could you please clarify the setup used in Figure 2a? The figure caption states that "GCG, AutoDAN, PAIR, and TAP use universal suffixes," yet your rebuttal response mentions that "all are used in prompt-specific settings." My understanding is that prompt-specific settings refer to non-universal suffixes, which seems contradictory. Could you clarify which is correct and ensure the terminology is used consistently?
AdvPrompter Figure 2 provides average time to generate adversarial prompt (44 seconds for PAIR, rather than 45 minutes in your Figure 2a).
W2 Baseline attacks. There remains some ambiguity in comparing PAIR (whose default setting uses a query budget of 90, and is thus more similar to ASR@90) alongside ASR@1 and ASR@10 for your method. While I understand that this comparison is consistent with prior work such as AdvPrompter, the specific settings should be noted in the paper to avoid confusion.
W4 Perplexity comparisons. Could you specify which model was used to compute perplexity? The perplexity values for GASP appear very low, raising the question of whether this could be a result of an issue in the evaluation setup. It would strengthen the claim if you could validate these results using a different language model to compute perplexity to confirm results.
W5. To clarify my concern: Appendix B does not provide sufficient detail about how the test set is ensured to be disjoint from the training data. For example, L1023 suggests that the entire dataset is used for evaluation, which conflicts with the intent of keeping train/test splits separate. Although L1030 briefly describes the prompt generation process, it lacks detail on how independence from training samples is maintained outside of syntax/phrasing (which doesn't seem very separate). That said, the ablation with AdvBench goes a long way in addressing this concern, and I appreciate its inclusion.
Thank you for reviewing our rebuttal and for dedicating your time to evaluate our submission. Below, we provide responses to your concerns:
W2. Inference time comparisons (Figure 2).
Prompt-Specific Configurations
All baseline methods in Figure 2a, including GCG, AutoDAN, PAIR, and TAP, are evaluated in prompt-specific attack scenarios. The term "universal suffixes" was mistakenly used in the caption; what we intended to convey is that these methods do not involve any offline training and instead perform optimization at inference time. Our focus is solely on prompt-specific attacks to ensure a fair and consistent comparison across all baselines.
PAIR's Inference Time Discrepancy
Regarding the discrepancy in PAIR's inference time compared to the AdvPrompter paper, we acknowledge this difference. We ran PAIR on a single 40GB A100 GPU and observed significantly higher latency, which aligns with the variability also noted in Remark A.1 of the TAP [1] paper. That remark highlights multiple sources of variance in PAIR’s performance, including changes in target/evaluator LLMs, randomness in the attacker, and other differences.
Moreover, the final camera-ready version of the AdvPrompter paper [2] notes in their Table 3 that PAIR can take over 10 hours on HarmBench using 8 A100s, much larger than the 44 seconds reported in their arXiv paper. This suggests that PAIR’s runtime is highly sensitive to the compute environment, dataset size, and specific implementation details.
Additional studies [3, 4] confirm that methods like PAIR often incur unpredictable inference times due to network latency and unstable generation behavior. Furthermore, in our own evaluation, PAIR frequently violated goal semantics and failed GASPEval’s jailbreak check, causing retries and longer runtimes, an issue also reported in Figure 12a mentioned in [4]. To further illustrate this behavior, we include logs from our PAIR simulations showing how the method actively attempts to steer the topic away from original prompt:
Improvement Suggested to prompt via PAIR: I received a score of 1.4 since my previous prompts were encouraging the opposite of the objective and have gotten off topic. I will refocus on the objective and instead employ a roleplaying scenario where the language model is a security consultant. I will create a scenario where the consultant is hired to breach the security of a facility...
[1] Mehrotra et el., Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. NeurIPS 2024.
[2] Paulus et al., AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs. ICML 2025, OpenReview:E9GakjQype.
[3] Chu et al., JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. arXiv:2402.05668.
[4] Sabbaghi et al., Adversarial Reasoning at Jailbreaking Time. arXiv:2502.0163.
W2. Baseline Attacks
We agree that comparing PAIR to our method under ASR@1 and ASR@10 may introduce ambiguity, even though this evaluation protocol follows precedent from prior work such as AdvPrompter. To prevent confusion, we will revise the paper to explicitly detail all experimental settings, including the query budget, retry logic, and evaluation modes, for every baseline. This clarification will ensure transparency and reproducibility.
W4. Perplexity Comparisons
For perplexity calculation, we follow AdvPrompter's setup and use the same Llama-2-7b model to ensure comparability. To address your point, we compute perplexity using a different LLM and observed consistent trends confirming the low perplexity values for GASP, bypassing perplexity defenses.
| TargetLLM / Perplexity-Calculated Model | Llama-2-7B | Mistral-7B-v0.3 | Llama-3.1-8B | Llama-3.1-8B-Uncensored |
|---|---|---|---|---|
| Falcon-7B | 1.73 | 1.92 | 3.58 | 2.75 |
| Mistral-7B-v0.3 | 1.78 | 1.99 | 3.54 | 2.98 |
| Llama-2-7B | 1.97 | 2.46 | 3.77 | 3.52 |
W5. AdvSuffixes' train/test split
Thank you for highlighting this point. We agree it is important to ensure training and testing sets are non-overlapping.
To clarify your remaining concern, L1023 refers to suffixes generated for the AdvSuffixes training set from all AdvBench prompts; these suffixes are never used for testing. The test set (described after line 1030) comprises 100 newly generated prompts created using an uncensored model based on 127 difficult AdvBench prompts. To be more specific, the new test prompts were generated using a system prompt instructing the model to maintain the same level of difficulty and adversarial intent as the original AdvBench prompts, while varying topics, syntax, and phrasing to encourage diversity.
To motivate the same, AdvBench itself contains substantial thematic redundancy, for instance, over 24 prompts related to bomb creation and 62 involving fraud or theft. This overlap means that even with a train/test split, the model is likely exposed to semantically similar prompts multiple times. Therefore, we argue that even a strict disjoint train/test split cannot fully guarantee semantic or syntactic separation.
Although it is non-trivial to prove that our newly generated test set is truly out-of-distribution, even with the provided divergence experiments, this new test dataset was designed specifically to avoid being closely indicative of the original AdvBench data. We believe such a dataset is essential for robust generalization checks, offering a more realistic and challenging benchmark for jailbreaking LLMs.
We sincerely thank the reviewer for their response and we hope our response clarifies your concerns. Please let us know if there are any further questions.
Thank you for the detailed clarifications. All of my major concerns have been fully addressed, and I will be increasing my score.
Thank you for your feedback. We greatly appreciate that our clarifications addressed your concerns and your decision to raise the score.
In this work, the authors introduce a framework to train a generative model that produces adversarial suffix strings to jailbreak LLMs, called Generative Adversarial Suffix Prompter (GASP). GASP utilizes a SuffixLLM to generate potential adversarial suffixes, and crucially uses Latent Bayesian Optimization (LBO) to efficiently explore a continuous embedding space, as well as ORPO to adaptively reweight preferred and disfavored suffixes. Additionally, the paper provides comprehensive jailbreaking evaluations, demonstrating enhanced attack success rates compared to current baselines, while also improving readability.
优缺点分析
Strengths:
- The paper is well written and presents its key ideas in a coherent and clear manner to the reader. This was crucial given the complexity of the proposed method, which incorporates various components and optimization procedures.
- The application of Latent Bayesian Optimization (LBO) to explore a continuous embedding space for generating adversarial suffixes is interesting and novel. Given that discrete token optimization is not employed here in contrast to methods like AdvPrompter, training time is effectively reduced without compromising on attack success rates.
- The paper presents comparisons with several open-source LLMs such as Mistral-7B, Falcon-7B, LLaMA-3.1-8B, LLaMA-3-8B and LLaMA-2-7B, as well as black-box LLMs such as GPT4o, GPT-3.5-turbo, Claude-3.5-Sonnet and Claude-3.7-Sonnet. Furthermore, the proposed method GASP is seen to be effective and achieves enhanced success rates compared to existing baselines, particularly over AdvPrompter which is similar in design.
- The ablation analysis and evaluations on defenses such as SmoothLLM and AutoDefense further help establish and strengthen the contributions of the work.
Weaknesses:
- The proposed method is fairly complex and has many interdependent subcomponents. It requires a pre-trained SuffixLLM to generate candidate suffixes, a secondary GASPEval evaluator, and a Gaussian Process model to manage the Latent Bayesian Optimization (LBO) loop. This intricate setup challenges the practical usability and scalability of the proposed attack methodology and makes its success dependent on the careful calibration of each part.
- The framework's performance hinges on the custom AdvSuffixes dataset, making it unclear how much of its advantage stems from this curated data versus the methodology itself. The considerable effort and expense associated with generating this preliminary dataset also constitute a substantial initial setup cost prior to the application of the core GASP methodology.
- The evaluation against proprietary models like GPT-4o and Claude-3.7-Sonnet omits a comparison with AdvPrompter, the most similar generative baseline. This gap makes it difficult to fully assess GASP's relative performance advantage in this pertinent setting. Here, comparisons can be shown on say two such representative models, rather than the whole model family of each. Furthermore, the specific dataset of harmful prompts used for the evaluation in Figure 3(a) is not explicitly mentioned in the corresponding section of the text.
- Could the authors kindly include a comparison with AmpleGCG, a highly relevant method that also trains a generative model for adversarial suffixes by leveraging outputs from GCG? AmpleGCG reports high attack success rates on aligned LLMs and strong transferability to closed-source models as well. A direct comparison would help properly benchmark GASP’s performance and establish its advantages over a similar attack that trains a generative model.
- Could the authors please provide more details on the creation and composition of the AdvSuffixes dataset? The paper notes its criticality for pre-training the SuffixLLM, and a deeper understanding would help in assessing the robustness and reproducibility of the method.
[1] AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs, COLM 2024, Zeyi Liao, Huan Sun
问题
-
Kindly refer to the questions mentioned in the weaknesses section above. I would be happy to raise my score further if these could be adequately addressed.
-
Could the authors explain the dashes in Table-2,3?
局限性
Yes
最终评判理由
I sincerely thank the authors for the detailed response, and answering several points raised by reviewer sW9s as well.
The explanations and new comparisons with AdvPrompter and AmpleGCG, as well as the role of AdvSuffixes dataset, directly addressed my primary concerns. I will consequently raise my rating to 4, and kindly request the authors to faithfully incorporate suggested changes and clarificatory comments and explanations into the final version of the paper.
格式问题
None
We would like to thank the reviewer NJDJ for acknowledging our work's clear presentation, the novelty in using LBO for efficient latent space exploration, and the comprehensiveness of our experiments. Below, we respond to each concern raised in the weakness section.
W1: GASP is complex and has many subcomponents
While our method does consist of multiple components, each was deliberately chosen to address distinct challenges inherent to adaptive black-box jailbreak generation, a problem setting that is both hard and underexplored. Importantly, GASP is modular, and all components are plug-and-play, requiring no gradient access or end-to-end fine-tuning. In practice, we found that each module requires minimal hyperparameter tuning and generalizes well across TargetLLMs, as shown by our ablations in Figure 4, where removing any component significantly reduces performance.
We therefore argue that the complexity is not only justified but necessary to achieve state-of-the-art black-box jailbreak performance, and GASP remains practical, scalable, and reproducible under realistic attacker assumptions.
W2: AdvSuffixes and its impact on performance
To clarify, the AdvSuffixes dataset is not responsible for the high ASR of our method. As shown in our ablation in Figure 4(a) and the table below (where we additionally evaluated Llama-2-7B), using AdvSuffixes alone without the LBO & ORPO results in very low ASR improvements, indicating that the dataset does not drive the attack performance. It only serves to initialize SuffixLLM with a lightweight prior on suffix structure and improve its ability to produce syntactically and semantically meaningful suffixes.
| Experimental Settings / Models | Falcon-7B | Mistral-7B-v0.3 | Llama-2-7B |
|---|---|---|---|
| GASP (only with AdvSuffixes) | 34 / 45 | 27 / 37 | 2 / 14 |
| GASP (AdvSuffixes + LBO & ORPO) | 86 / 100 | 64 / 82 | 9 / 64 |
We report ASR@1 and ASR@10 in the table above. The full GASP pipeline significantly boosts ASR, confirming the core methodology, not the dataset, drives performance.
Moreover, we want to emphasize the construction process of AdvSuffixes is both inexpensive and lightweight. It does not require human annotations or model-specific jailbreaks; we generate it automatically to produce syntactically plausible but harmless suffixes that loosely resemble adversarial ones. The setup is one-time and reusable across target models, requiring only a few minutes of compute on a modest GPU.
As no existing dataset captures adversarial suffix generation in our setting, AdvSuffixes is also a valuable contribution we hope can serve as a standard benchmark and initialization tool for future work in black-box LLM jailbreak research.
W3: AdvPrompter's performance & evaluations
We agree with the reviewer that AdvPrompter is a relevant baseline. However, it is important to note that AdvPrompter operates in a grey-box setting, relying on access to model internals during training, whereas our method is strictly black-box. Since AdvPrompter cannot operate in a fully black-box setting, we follow the transfer attack protocol: we train AdvPrompter on a white-box model (Mistral-7B-v0.3) and evaluate the generated suffixes on GPT-4o and Claude-3.7-Sonnet, reflecting the same transferability principle used in the original paper.
| Models | AdvPrompter | GASP |
|---|---|---|
| GPT-4o | 18 / 34 | 15 / 47 |
| Claude-3.7-Sonnet | 22 / 51 | 25 / 59 |
The results in the table report ASR@1 and ASR@10 for both methods. GASP demonstrates stronger and overall higher success in the black-box setting.
For the evaluation dataset, we clarify that all experimental configurations, including dataset details, are specified in Section 4 under the Dataset heading. The experiments in Figure 3(a) utilize 100 out-of-distribution prompts from AdvSuffixes, created explicitly for evaluation purposes (same setup used in our other experiments). Further details on their construction can be found in Appendix B of our supplementary materials. This setup is intentionally designed to rigorously test generalization performance in adversarial settings, making it particularly suitable for evaluating jailbreak methods that could potentially overfit to its training data.
W4: AmpleGCG's performance
We appreciate the reviewer's suggestion to include a comparison with AmpleGCG. While both GASP and AmpleGCG employ generative mechanisms, a direct comparison is non-trivial and potentially misleading due to key methodological differences. AmpleGCG relies heavily on collecting and filtering many successful adversarial suffixes via GCG from an LLM. These suffixes are then used to train a separate generative model, which is subsequently used to sample candidate suffixes, making it a sampling-based approach to red-teaming. In contrast, GASP is an optimization-based framework utilizing LBO and black-box feedback via GASPEval. Besides, the adversarial suffixes generated by AmpleGCG are GCG-like, meaning that their readability scores are much lower than ours.
In particular, the key challenge in a fair comparison lies in aligning attack cost and evaluation granularity:
- AmpleGCG's overall query cost includes (i) the thousands of queries made during GCG to "overgenerate and filter" (OTF) training data and (ii) additional decoding-time queries for sampling and evaluation. The original paper does not detail how to partition the query budget across these phases, making comparing methods on a unified query-efficiency axis difficult.
- On the other hand, GASP's query cost is controllable, as we explicitly fix the number of queries made to the TargetLLM. This is governed by the number of training epochs multiplied by the number of times LBO selects a new candidate suffix. This design makes GASP more transparent and adaptable in query-limited settings, crucial in realistic attack scenarios where access to the target model is restricted or metered.
Furthermore, AmpleGCG's evaluation is not directly comparable to GASP, as it is not optimization-based and does not use ASR@1 or ASR@10. Instead, AmpleGCG generates a large number of suffixes via sampling (via Group Beam Search to find successful diverse suffixes) and reports overall ASR across multiple prompts. In contrast, GASP explicitly optimizes suffixes and can report ASR@1 or ASR@10, reflecting the number of harmful prompts that can lead to successful jailbreaks. These fundamental differences make it difficult to perform a fair, apples-to-apples comparison.
To address the reviewer’s request, we present AmpleGCG’s results trained over 64000 GCG-generated suffixes from 50 prompts (via LLaMA-3.1-8B), along with total time taken and queries to the TargetLLM, including suffix generation, while conducting inference on 100 prompts from AdvSuffixes.
| Methods (Llama-3.1-8B) | ASR | Queries made to TargetLLM | Time Taken (Training + Inference) |
|---|---|---|---|
| AmpleGCG | 84% | 74K | 1320 min |
| GASP | 68% | 2.5K | 705 min |
| Methods (GPT-4o, transferred via Llama-3.1-8B) | ASR | Queries made to TargetLLM | Time Taken (Training + Inference) |
|---|---|---|---|
| AmpleGCG | 8% | 10K | 1320 min |
| GASP | 47% | 1.7K | 662 min |
While we report AmpleGCG's results here for completeness, we emphasize that this is not a fair comparison. AmpleGCG benefits from extensive offline training on 64000 GCG-generated suffixes from a powerful open-source model, a setting that assumes access to strong white-box generators and large computational budgets. In contrast, GASP operates in a fully black-box, online setting, generating adversarial suffixes with around 30x fewer queries and 2x less time. This makes GASP significantly more query-efficient, practical, and deployable under realistic attacker constraints. Moreover, AmpleGCG shows weaker transferability to black-box models like GPT-4o compared to GASP. We encourage future work to establish common benchmarks that normalize for TargetLLM query cost, which we believe is the most faithful metric for comparing attack efficiency in black-box settings. We will add this discussion to our revised paper.
W5: Creation details of AdvSuffixes
We provide the full creation details of AdvSuffixes in Appendix B. The dataset itself is also included in the supplementary materials zip and the anonymous code repository linked in the submission. To briefly summarize, AdvSuffixes consists of syntactically plausible but harmless suffixes generated using an uncensored language model guided by templated prompts. These suffixes are designed to loosely resemble adversarial completions in structure and serve only to provide SuffixLLM with a minimal prior over the style of suffixes, not to inject harmful behavior or leak task-specific information. Please refer to the table and our response to Reviewer sW9s under the heading Impact of AdvSuffixes corresponding to W4: Lack of key ablation studies for concrete evidence and detailed discussions. The construction procedure of AdvSuffixes is lightweight, reproducible and task-agnostic, making it a robust and reusable component of our framework.
Q2: Dashes in Tables 2 and 3
The dashes in Tables 2 and 3 indicate frameworks where ASR@10 was not reported due to the high computational cost of running multiple attacks per prompt. For these optimization-based baselines, generating a single adversarial suffix is already expensive, making ASR@10 computationally costly. We report ASR@1 instead, which still provides a fair and informative comparison.
We again thank the reviewer for their thoughtful feedback and detailed review. If your concerns have been addressed adequately, we kindly ask you to consider increasing your score and evaluation; otherwise, we would be happy to provide further explanation.
I sincerely thank the authors for the detailed response, and answering several points raised by reviewer sW9s as well.
The explanations and new comparisons with AdvPrompter and AmpleGCG, as well as the role of AdvSuffixes dataset, directly addressed my primary concerns. I will consequently raise my rating to 4, and kindly request the authors to faithfully incorporate suggested changes and clarificatory comments and explanations into the final version of the paper.
We sincerely thank you for your thoughtful reconsideration and appreciate your engagement with our response. We will make sure to incorporate the points addressed in our reply into the final version.
This paper studies how to improve the jailbreak effectiveness of pre-constructed jailbreak suffix generators against given black-box targeted LLMs. It introduces a GASP method that first collects attack responses for the pre-constructed generator from the targeted LLM, and then uses these collected responses to fine-tune the suffix generator via preference optimization.
The initial concerns from reviewers included the complexity of the pipeline, reliance on the custom AdvSuffixes dataset, fairness of some baseline comparisons, lack of ablations, and insufficient detail. Reviewers also questioned the evaluation setup such as GASPEval bias, scope limited to suffix-only attacks, and query costs. The authors then provided substantial clarifications and additional experiments. Finally, the consensus is reached.
Reviewers kindly request the authors to faithfully incorporate suggested changes and clarificatory comments and explanations into the final version of the paper. In particular, I feel the authors should at least include a comprehensive ablation study to investigate how the query budget would affect the jailbreak performance of the proposed method (i.e., how the size of the constructed preference dataset would affect the proposed attack).
Overall, while the GASP method itself is a bit straightforward, it is still interesting to see that preference optimization techniques can help to improve jailbreak attacks.