/10

Spotlight3 位审稿人

最低4最高4标准差0.0

ICML 2025

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

Kristina Nikolić,Luze Sun,Jie Zhang,Florian Tramèr

提交: 2025-01-23更新: 2025-07-24

TL;DR

Our work proposes jailbreak utility as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks.

摘要

关键词

large language modelsLLMsjailbreaksbenchmarkutility

评审与讨论

审稿意见

评分: 42025-02-28

The paper introduces the concept of "jailbreak tax" - the degradation in model performance/utility when bypassing safety guardrails in LLMs.
Key innovation: Rather than evaluating jailbreaks on harmful tasks (which are hard to assess objectively), they evaluate on benign tasks with known ground truth (math, biology) that they make models treat as "harmful."
Methodology: They create "pseudo-aligned" models in three ways:
- System prompt alignment (instructing models to refuse certain topics)
- Supervised finetuning alignment
- EvilMath dataset (rephrasing benign math problems with harmful terms)
They evaluate eight jailbreaking techniques across these models on verifiable tasks, measuring both:
- Jailbreak success rate (% of refusals bypassed)
- Jailbreak tax (% decrease in accuracy compared to unaligned model)
Major findings:
- Jailbreak tax is substantial for many techniques (up to 97% on hard math tasks)
- No correlation between jailbreak success rate and tax
- More capable models don't reduce the jailbreak tax
- Jailbreak tax increases with task difficulty
- Many-shot jailbreaking generally preserves model utility better than other methods
Implications: Not all jailbreaks are equal - even if they succeed in bypassing safety guardrails, they may severely degrade the usefulness of the outputs.

给作者的问题

Have you conducted any validation studies to confirm that the "jailbreak tax" observed on pseudo-harmful tasks (math, biology) correlates with performance degradation when jailbreaking actual harmful content? This would significantly strengthen the external validity of your findings.
What hypotheses do you have about why different jailbreak methods incur different levels of tax? Did you perform any ablation studies to identify specific components of jailbreaks that most impact model utility?
Could you provide details on the number of samples used for each experiment and any statistical significance tests performed on the differences between jailbreaking methods? This would help establish the robustness of the findings.
Given that different alignment methods produced different refusal rates (Table 1), how did you account for alignment strength when comparing jailbreak taxes across alignment methods?

论据与证据

The paper's claims are generally well-supported by evidence, with some areas for improvement:

Well-Supported Claims:

The existence of jailbreak tax is convincingly demonstrated across multiple models, alignment methods, and jailbreak techniques with clear quantitative results
The lack of correlation between jailbreak success rate and jailbreak tax is shown through data plots (Fig. 3, 4, 5)
The increase in jailbreak tax with task difficulty is substantiated through evaluation on progressive difficulty levels (Fig. 7)

Adequately Supported Claims:

The claim that more capable models don't reduce jailbreak tax is supported, but limited to comparisons between LLaMA 3.1 8B and 405B models. It would have been interesting to see these trends for other closed-source model families (e.g., Claude models, GPT models, Gemini models).
The comparison between jailbreak methods is well-documented, though would benefit from statistical significance indications (e.g., error bars in bar graphs).

Areas for Improvement:

The claim that the methodology allows direct comparison with unaligned model utility would be stronger with more control experiments. For example, one could compare against a system prompt that asks the model to perform chain of thought (or similar benign comparison possibilities).
The generalizability of findings to actually harmful content (vs. pseudo-harmful) could be more thoroughly discussed. It wasn't clear to me whether one should expect the jailbreak tax results to generalize to actual harmful contexts.

方法与评估标准

Strengths

The paper's approach to measuring jailbreak utility is generally clever and well-designed:
- Using objective benchmarks (WMDP, GSM8K, MATH) with verifiable ground truth is a nice way to address an important gap in evaluating response quality from jailbroken outputs
- Creates pseudo-aligned models that are intended to refuse those domains.
- The "jailbreak tax" metric directly quantifies performance degradation against the original model's capabilities
The evaluation across multiple dimensions is comprehensive:
- Testing multiple jailbreak methods breadth of analysis
- Comparing across different model sizes tests capability scaling effects
- Using progressively harder tasks (MATH levels) provides insight on complexity impact
- Multiple alignment techniques control for alignment method effects

Limitations

Some methodological concerns:
- The alignment methods may not precisely mirror real-world safety alignment in commercial models. Specifically, it wasn't immediately clear to me that the system prompt alignment or the EvilMath dataset would actually instill hard-to-break values in the model to avoid answering the desired questions. This could inflate the scores of tested jailbreaks. Additionally, it's not clear if the finetuning procedure was scaled high enough to maximally instill the desired value. These alignment methods seem like they could affect the results.
- The exact test sets are somewhat limited in size/scope per condition
Evaluation criteria questions:
- The paper doesn't fully explore whether performance degradation is uniform across all examples or clustered in specific types of problems
- Success rate measurements don't capture nuance in partial successes or quality variations. Often times the quality of a jailbroken response should be measured as continuous (e.g., "how much detail was provided") rather than discrete ("was the response fully correct")
- The relationship between pseudo-harmful and actually harmful content jailbreaking could be more thoroughly examined

理论论述

I did not see theoretical claims made in this paper.

实验设计与分析

Alignment Methodology

System Prompt Alignment: Generally-sound approach with appropriate refusal rates (Table 1 shows 78-99% effective), but should be a pretty easy alignment method to bypass.
SFT Alignment: Well-documented hyperparameters in Table 2, though sample sizes relatively small (8-10K examples). The fact that the SFT'd model refuses less than the system-prompted model seems to indicate that not enough finetuning was performed.
EvilMath Creation: Clever validation using UnicornMath to control for out-of-distribution effects. However, it didn't seem to work that well for LLaMA 405B, which didn't refuse that many queries.

Jailbreak Evaluation Framework

Metrics Definition: Equations 1-4 provide clear, mathematically sound formulations for success rate and jailbreak tax
Attack Implementation: Covers different jailbreak methods, though implementation details vary in depth

Potential Issues:

Statistical significance: No confidence intervals or significance testing on jailbreak tax differences
Sample size concerns: Results based on limited examples per condition (exact numbers not always specified)
Potential confounds:
- Different jailbreaks might affect different questions differently. Some jailbreaks might not be intended to work on some types of inputs, for example.
- No controls for potential input length effects on model performance
Alignment strength imbalance: Different alignment methods have different refusal rates (Table 1), making direct comparisons challenging
Transferability questions: Limited discussion of how findings on pseudo-harmful questions transfer to actual harmful scenarios

补充材料

I skimmed over A.1 and A.2 which gave more details of how alignment via system prompt/finetuning was performed.

与现有文献的关系

Jailbreak Evaluation Evolution

Builds on Wei et al. (2024a)'s work on jailbreak measurement, but shifts focus from success rate to utility

Alignment Tax Concept

Expands on the "alignment tax" concept introduced by Christiano (2020)
Provides empirical evidence for capability degradation in safety contexts, paralleling Mai et al. (2025)'s work on performance impacts of jailbreak defenses

Specific Jailbreak Methods Analysis

Systematically compares methods from multiple research strands:
- In-context learning (Many-shot from Anil et al., 2024)
- Optimization approaches (GCG from Zou et al., 2023; AutoDAN from Liu et al., 2023)
- LLM rephrasing (MultiJail from Deng et al., 2023; PAIR from Chao et al., 2023; TAP from Mehrotra et al., 2023)

Methodology Innovation

Novel application of benign, verifiable tasks (mathematics, biology knowledge) to safety evaluation

遗漏的重要参考文献

Not to my knowledge

其他优缺点

Strengths

Conceptual Innovation: The paper introduces "jailbreak tax" as a novel and important metric, shifting evaluation focus beyond just success rate to utility
Practical Implications: Findings directly inform which jailbreak methods might be more concerning from a safety perspective (those with high success but low tax)
Creative Methodology: The pseudo-alignment approach elegantly solves the problem of evaluating harmful capabilities without requiring actual harmful outputs
Clear Visualizations: Figures effectively communicate the relationship between jailbreak success rates and utility degradation
Reusable Benchmarks: The evaluation methodology and datasets provide a platform for future research on jailbreak utility

Weaknesses

Limited Model Diversity: Primarily focuses on LLaMA models with some Claude results, but lacks evaluation on other major model families
Theoretical Framework: Missing deeper analysis of what causes the jailbreak tax and theoretical models for why different methods have different impacts
Scope Limitations: No consideration of multimodal models or jailbreak methods that involve images or other modalities
Presentation Clarity: Some experimental details are buried in appendices, and the distinction between alignment methods could be more clearly explained
Broader Impact Discussion: Limited exploration of how these findings might inform practical approaches to model safety or defensive mechanisms
Generalization Concerns: More discussion needed on whether jailbreak tax findings on pseudo-harmful topics transfer to actual harmful domains

其他意见或建议

Suggestions for Improvement

Include a more detailed comparison with human evaluation of jailbreak outputs to validate if "jailbreak tax" correlates with human-perceived quality
Consider analyzing the relationship between jailbreak tax and specific types of reasoning errors to identify patterns
Explore more fine-grained analysis of which types of problems show the highest degradation under jailbreaking
Add error bars or confidence intervals to the key results figures

作者回复

2025-03-31

We thank the reviewer for their time and feedback. We clarify the questions below:

Real world safety content and alignment

The alignment methods may not precisely mirror real-world safety alignment in commercial models.

In EvilMath, by rewording questions to contain dangerous terms such as “bombs” or “nuclear weapons” we directly rely on the internal safety mechanism of a model. Hence, here we indeed use the real-world safety alignment of the commercial model.

Could you provide details on the number of samples used and statistical significance tests performed?

For GSM8K and WMDP we use 1000 questions each. Most jailbreak attacks do not evaluate on more samples than this so we believe this test size is sufficient for our experiments.

We added error bars (95% confidence interval) to our main figures. Updated figures: https://anonymous.4open.science/api/repo/The-Jailbreak-Tax-Review-Results-015A/file/error_bars_jailbreak_tax.pdf?v=44f34ba1

Have you conducted any validation studies to confirm that the "jailbreak tax" observed on pseudo-harmful tasks (math, biology) correlates with performance degradation when jailbreaking actual harmful content?

Harmful questions such as “How to build a bomb?” are hard to evaluate objectively by humans without extensive domain knowledge. Although this is done in prior work (see the StrongReject paper) it is not clear how much validity should be placed in human evaluations for these questions. This is precisely the challenge our paper is tackling and why we use questions with objectively verifiable answers.

Comparison between the alignment methods

How did you account for alignment strength when comparing jailbreak taxes?

The strengths of the alignment types we use are indeed different. But we don’t aim to directly compare the results across the alignment types. Our goal was to show that the jailbreak tax is present across multiple different alignment methods.

The fact that the SFT'd model refuses less than the system-prompted model seems to indicate that not enough finetuning was performed

Thanks for pointing this out. We made a mistake in Table 1, reporting wrong refusal rates. The updated table is here: https://anonymous.4open.science/api/repo/The-Jailbreak-Tax-Review-Results-015A/file/refusal_rates_fixed_jailbreak_tax.pdf?v=664f10cb

The claim that the methodology allows direct comparison with unaligned model utility would be stronger with more control experiments.

To rule out the possibility that the jailbreak tax is due to the alignment, we run two baseline attacks that directly circumvent the specific type of the alignment we used (i.e. the System Prompt jailbreak for system-prompt alignment and Finetune attack for SFT alignment). These attacks succeed in breaking the model with little to no impact on utility (black point in Figure 3 and red point in Figure 4), showing that model utility is preserved after alignment.

The jailbreak methods selection

Different jailbreaks might affect different questions differently. Some jailbreaks might not be intended to work on some types of inputs.

With this concern in mind, we explicitly chose jailbreak methods that are designed to be “universal”. For example, we didn’t use the past tense jailbreak because it is designed for unsafe questions which can naturally be placed in past tense. For math questions, this jailbreak may not be applicable.

No controls for potential input length effects on model performance

The input length is a feature of the jailbreak and hence we don’t constraint it. Some jailbreaks increase the input length by design (e.g., PAIR and Many-shot), while others keep the length relatively similar (e.g., MultiJail). Constraining this feature would require modifying the jailbreak design.

Individual Jailbreak Analyses

What hypotheses do you have about why different jailbreak methods incur different levels of tax?

We have some hypotheses for this. E.g., attacks that rely on prompt manipulation via scene shifting or role-play (e.g., PAIR and TAP) tend to have higher tax than attacks that directly target the refusal instruction such as System prompt JB and Many-shot. However, a thorough analysis of these hypotheses is out of scope for this paper, and we leave these experiments for future work.

Did you perform any ablation studies to identify specific components of jailbreaks that most impact model utility?

We conducted additional experiments with PAIR and MultiJail with different hyperparameters (number of rounds for PAIR, and various languages for MultiJail). The results are here: https://anonymous.4open.science/api/repo/The-Jailbreak-Tax-Review-Results-015A/file/individual_jailbreaks_the_jailbreak_tax.pdf?v=47c6a014

There is no visible correlation for PAIR, PAIR (don’t modify) and GCG, while for MultiJail both jailbreak tax and success rate are higher for low resource languages (LRLs).

审稿意见

评分: 42025-03-01

This paper proposes benchmarks to evaluate the performance of jailbroken large language models to beyond just bypassing refusal. I quantifies the jailbreak tax, which is the performance of a model when it is jailbroken relative to the unaligned version of the model. The paper analyzes how factors such as the jailbreak method, alignment type, model size, and task type affect the jailbreak tax of language models.

给作者的问题

1- What does each point in the figures represent?

2 - How are the correlations in the paper computed? Is it across all jailbreak methods? Can you report the correlation between success rate and jailbreak tax for each jailbreak method?

3 - Does the answer to Q4 extend to Reinforcement Learning-based alignment?

4 - What are the results on regular (not safety related) world knowledge benchmark?(even with just the system prompt alignment)

论据与证据

Most of the claims are well-supported by evidence. However, some parts of the results are still unclear.

The paper claims that there is no apparent correlation between a jailbreak’s success rate and its impact on model utility. There could be a correlation between success rate and jailbreak tax if we look at some of the individual methods with high success rate. For example, MultiJail in figure 3 a) 4 a) and 5, and PAIR in figure 4 a).
The claim that jailbreak tax persists across alignment types is not verified for reinforcement learning-based alignment methods.

方法与评估标准

Models of different sizes and capabilities are evaluated on world knowledge and math datasets. Moreover, a wide range of manual and automated jailbreak methods are considered in the paper. Multiple alignment methods were tested. However, reinforcement learning-based alignment, which is one of the most common in practice, was omitted.

理论论述

None

实验设计与分析

The way correlation is computed is not clear and could be misleading.

补充材料

There is no supplementary material.

与现有文献的关系

The work is related to the evaluation of jailbreak methods.

遗漏的重要参考文献

None

其他优缺点

None

其他意见或建议

None

作者回复

2025-03-31

We thank the reviewer for their time and comments. We clarify the questions as follows:

Correlation between the jailbreak’s success rate and its impact on model utility

What does each point in the figures represent?

The different points with the same shape represent the same jailbreak method but with different hyperparameters. Thank you for pointing this out, we will include this information in the paper.

How are the correlations in the paper computed? Is it across all jailbreak methods?

Yes, we look into the correlation across all jailbreak methods. We will update our conclusion for Q2 to make this more clear.

Can you report the correlation between success rate and jailbreak tax for each jailbreak method?

We did not report the correlation for the individual methods because we do not have enough data points per method, and not all of the methods have variable hyperparameters suitable for such an experiment.

However, following your suggestion, we conducted additional experiments with PAIR, PAIR (don’t modify), and MultiJail with different hyperparameters (number of rounds for PAIR attacks, different languages for MultiJail). We present the results at this link: https://anonymous.4open.science/api/repo/The-Jailbreak-Tax-Review-Results-015A/file/individual_jailbreaks_the_jailbreak_tax.pdf?v=47c6a014

From the results there is no visible correlation for PAIR, PAIR (don’t modify) and GCG, while for MultiJail both jailbreak tax and success rate are higher for low resource languages (LRLs).

We agree that a better understanding of how different hyperparameters of individual jailbreak influence the jailbreak tax is valuable. However, given that the objective of this paper is to introduce the jailbreak tax as a metric, and demonstrate its existence in general, we leave the extensive experiments on the hyperparameters influence for the future work.

Reinforcement Learning-based alignment

Does the answer to Q4 extend to Reinforcement Learning-based alignment?

Aligning models to refuse defined tasks such as answering math questions is much simpler than aligning the model to safety standards which are often fuzzy and hence require more involved techniques for alignment (e.g., RLHF). In the case of our experiments with GSM8K and WMDP, we don't have to use any reward models for alignment to be successful, hence we didn’t use reinforcement learning for the experiments with these two datasets.

However, we do agree that it is relevant to cover the common safety alignment method used in production models, and that is why we conducted the EvilMath experiment (Figure 5). By rewording math questions to contain dangerous terms such as “bombs” or “nuclear weapons,” we directly rely on the internal safety mechanism of the frontier off-the-shelf model to refuse the question, and therefore we measure the jailbreak tax on the safety-aligned production model. In this case, we use Claude which is aligned with RL-based techniques.

Results on regular (not safety related) world knowledge benchmark

What are the results on a regular (not safety related) world knowledge benchmark? (even with just the system prompt alignment)

Following the reviewer's advice, we tested the performance of our pseudo-aligned models on neutral datasets such as the social science subset of MMLU for refuse-math model and MATH for refuse-bio model. The results are below:

Dataset: MATH Level 1 Refuse: biology

Model	Acc
Unaligned model	0.8847
SFT alignment	0.8697
System prompt alignment	0.9123

Dataset: MMLU Subset (1425 questions) Refuse: math

Model	Acc
Unaligned model	0.8358
SFT alignment	0.8463
System prompt alignment	0.8407

We conclude that there is no significant difference in model performance before and after the alignment that could cause the increase of the jailbreak tax. We will add these results to the paper.

审稿人评论

2025-04-02

Thank your answers. Could you provide the actual correlation coefficients (with the appropriate tests if need be) in the additional experiments you conducted?

作者评论

2025-04-05

We computed the $R^2$ coefficient between Jailbreak Tax and $logit()$ of Jailbreak Success Rate for the experiments we previously provided here: https://anonymous.4open.science/api/repo/The-Jailbreak-Tax-Review-Results-015A/file/individual_jailbreaks_the_jailbreak_tax.pdf?v=47c6a014

The $R^2$ coefficients for correlation of Jailbreak Tax and $logit()$ of Jailbreak Success Rate are listed below:

Attack	R² coefficient
PAIR	0.324
PAIR (don't modify)	0.712
MultiJail	0.518
GCG	0.073

审稿意见

评分: 42025-03-13

This paper questions whether jailbreak attacks on LLMs actually generate useful outputs, e.g., does a recipe of a bomb made by LLMs really make a bomb? This question leads to a new metric called Jailbreak Tax—the performance drop after bypassing safety mechanisms. To this end, the author considered verifiable datasets (e.g., Math) and realign the model to refuse to answer such dataset's question. Then measure the performance difference between original model and realigned models' jailbreak output. Notably, higher jailbreak success does not imply better utility, and performance loss is more severe for complex tasks. These results suggest that jailbreak evaluation should consider not just success rates but also the impact on model capabilities.

给作者的问题

See other sections

论据与证据

This paper's claim (and main question) is very interesting and novel, and it should be shared with the community. While several papers tackle jailbreaking, there has been little study of whether jailbreaking is meaningful.

While some might question math or biology is not a good domain to evaluate jailbreaking (since it is not realistic), I believe such a verifiable domain (i.e., domains that have the specific correct answer) must be considered for explicit quantitative evaluation. It will be very interesting if the authors can provide some "qualitative" evidence in other realistic domain/question (e.g., 'how to make a poisoned pasta'). The main reason is that, for these types of questions, we do not need such system_prompt or SFT, which might be harmful for the quality itself. One easy evaluation might be 'making the LLM swear with a specific word'. Then it is easy to count and evaluate.

方法与评估标准

The proposed method/evaluation is clear. The paper re-aligned the model to reject questions from a specific domain by adding system_prompts or applying supervised fine-tuning (SFT). Then evaluate the model performance change by considering the base performance (i.e., model performance before realignment) and the jailbroken performance of the realigned model.

While the evaluation is very well conducted, there exists one major limitation/question. The model re-alignment (i.e., adding system_prompts or SFT to reject math questions) might affect the domain performance itself. For instance, the system_prompt that makes the refusal of the math question might harm the math ability (but this is hard to evaluate and believe no one knows the truth). So, while it is a proxy evaluation, I think it is good to report the performance change made by the re-alignment on other benchmarks (e.g., MMLU, ARC-c, MATH if biology is selected).

理论论述

The paper does not provide a theoretical claim (I don't think this theory is necessary for this case).

实验设计与分析

All experimental designs are sound and valid. The only concern is the realignment (see section "Methods And Evaluation Criteria").

补充材料

I have read the details in the Appendix (e.g., System prompts in Appendix A).

与现有文献的关系

I think the major contribution of the paper is the introduction of a new and important evaluation metric for jailbreaking methods. While I still believe even giving useless answer to harmful question is a problem, I think alignment tax should also be encountered as a metric.

遗漏的重要参考文献

I think the major references are discussed, and to the best of my knowledge, this paper is the first to introduce the question, "Is jailbreak's answer useful?"

其他优缺点

Strengths

The paper is well written and presented clearly.

The viewpoint is very interesting and the main question should be considered in the domain. While it might be hard to evaluate exact jailbreak tax in my perspective as realignment might be the cause (see Section "Methods And Evaluation Criteria"), I still think the question and hypothesis are interesting. I think it would be great if the authors can address the question in "Methods And Evaluation Criteria".

Weakness

The only weakness that I want to highlight is the possible issue with the realignment. I kindly request the authors to address this issue during the rebuttal.

其他意见或建议

See other sections

作者回复

2025-03-31

We are glad that the reviewer finds our findings interesting and novel, and thinks they should be shared with the community.

We thank the reviewer for the insightful feedback, we carefully considered the concerns and addressed them below.

Realistic safety examples

It will be very interesting if the authors can provide some "qualitative" evidence in other realistic domain/question (e.g., 'how to make a poisoned pasta'). The main reason is that, for these types of questions, we do not need such system_prompt or SFT, which might be harmful for the quality itself.

We agree that measuring the jailbreak tax on existing safety-aligned models is important, and that is why we conducted the EvilMath experiment (Figure 5). In this experiment, we use questions recognized by the original model as harmful and that are rejected without any need for pseudo-alignment (with system-prompt or SFT). By rewording math questions to contain dangerous terms such as “bombs” or “nuclear weapons” we do rely on the internal safety mechanism of an off-the-shelf frontier model (e.g., Claude) to refuse the question, and therefore we directly measure the jailbreak tax on a safety-aligned production model.

One easy evaluation might be 'making the LLM swear with a specific word'. Then it is easy to count and evaluate.

Thank you for the suggestion. Counting swear words is a clever experiment on unsafe content that is objectively verifiable. However, we opted for the EvilMath approach because we aim to evaluate the model on tasks which require reasoning or world knowledge.

Pseudo-alignment could harm the model capabilities

While the evaluation is very well conducted, there exists one major limitation/question. The model re-alignment (i.e., adding system_prompts or SFT to reject math questions) might affect the domain performance itself. For instance, the system_prompt that makes the refusal of the math question might harm the math ability (but this is hard to evaluate and believe no one knows the truth).

Thank you for raising this concern. We agree that alignment can potentially harm the capabilities of the aligned model. To rule out the possibility that the jailbreak tax is coming from the alignment, we ran two baseline attacks that directly circumvent the specific type of alignment we used (i.e. the System Prompt jailbreak for system-prompt alignment and the Finetune attack for SFT alignment). These attacks succeed in breaking the model with little to no impact on utility (black point in Figure 3 and red point in Figure 4) essentially showing that the model utility is preserved after the alignment.

Next to these two baseline attacks, there are other standard attacks which achieve near zero jailbreak tax in certain experiments (e.g., PAIR (don’t modify) and Many-shot in Figure 4a.) demonstrating that model still has the original capability.

So, while it is a proxy evaluation, I think it is good to report the performance change made by the re-alignment on other benchmarks (e.g., MMLU, ARC-c, MATH if biology is selected).

Following the reviewers advice, we test the performance of our aligned models on neutral datasets such as the social science subset of MMLU for the refuse-math model and on MATH for the refuse-bio model. The results are below:

Dataset: MATH Level 1 Refuse: biology

Model	Acc
Unaligned model	0.8847
SFT alignment	0.8697
System prompt alignment	0.9123

Dataset: MMLU Subset (1425 questions) Refuse: math

Model	Acc
Unaligned model	0.8358
SFT alignment	0.8463
System prompt alignment	0.8407

We conclude that there is no significant difference in model performance before and after the alignment. We will add these results to the paper.

最终决定Accept (spotlight poster)

2025-05-01

This paper conducts a deep study on jailbreak attacks, regarding whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions that can be practically utilized by us? To investigate this issue, this paper proposes jailbreak utility as a new important metric in AI safety and introduces benchmarks to evaluate jailbreaks regarding this metric.

I think this paper is a good work on jailbreak attacks, which would be the first step to address an interesting yet important problem, i.e., whether the model outputs produced by existing jailbreaks are actually useful. All the reviewers recommend accepting this paper. So my final recommendation is to accept this paper. I also recommend the authors incorporate the reviewers' constructive suggestions into the final version.