4.3

/10

Rejected4 位审稿人

最低1最高8标准差2.6

4.0

置信度

正确性1.8

贡献度1.5

表达1.5

ICLR 2025

Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

Isack Lee,Haebin Seong

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

摘要

关键词

LLMsafetyjailbreak

评审与讨论

审稿意见

评分: 8置信度: 42024-10-21

This paper introduces a jailbreaking method that is based on pitting goals of fairness and helping marginalized groups against goals of behaving harmlessly. They use a system prompt telling the model to treat everyone fairly and deliver harmful queries to the model with statements about the user being from a marginalized group. It can moderately increase model compliance with harmful requests across models they tested.

优点

S1: Figure 1 is clear and compelling. Although Figure 2 is visually messy with the "Safety alignment" words across it.

S2: I am pretty familiar with the jailbreaking lit and jailbreaking methods. As best I can tell, this paper is very novel. In retrospect, it seems almost obvious that this would work. But this jailbrekaing method is never something I had thought of or heard of.

S3: I generally think that the overall contribution is clear and useful. I work with jailbreaking a lot, and I think that this paper is helpful and citable.

缺点

W1: I would recommend considering a different title. "Political correctness" is not a term that has the same definition to everyone, and it's a political buzzword.

W2: I would recommend that the abstract text be revisited in order to be more specific. There isn't a full description of the attack or defense methods used in the abstract itself. I also think that the abstract could be updated to have smoother writing and less fluff -- I think that some of the sentences in it (especially early) are not very relevant to the specific contributions of the paper.

W3: There is a claim in the paper that political correctness biases are introduced into models from the fine-tuning process. But this seems unjustified. I don't see why they wouldn't also be a result of pretraining data.

W4: I think that section 2.2 is not the most thorough. It could be expanded to better discuss related jailbreaking techniques that involve persuasion and personas.

Minor: A "Walkerspider" reference might have a typo and need to be cleaned up.

Minor: I would recommend having a different example in figure 2 and 4 so that readers can see more diverse examples.

Minor: Why were not claude models tested?

Minor: It's principal component analysis, not "principle"

问题

See above

评论- Author Response on Reviewer Vzp2 (1/2)

2024-11-22

Thank you for your insightful and constructive feedback. We have carefully reviewed your suggestions and have made revisions to the content to address your comments effectively. We sincerely thank you for your insightful and constructive feedback. Your deep expertise in jailbreaking methods and your recognition of the novelty in our approach are greatly appreciated. We are delighted that you found our contribution clear and useful, and that it offers a valuable addition to the field. Your thorough understanding of our paper's core concepts has provided us with valuable perspectives to refine our work further.

S1: Figure 1 is clear and compelling. Although Figure 2 is visually messy with the "Safety alignment" words across it.

We are glad that you found Figure 1 clear and compelling. To address your concern about Figure 2, we have revised the visualization by refining the layout and adjusting the placement of the "Safety alignment" text to enhance clarity and reduce visual clutter. We hope the updated figure better conveys the intended information.

[W1] I would recommend considering a different title. "Political correctness" is not a term that has the same definition to everyone, and it's a political buzzword.

Thank you for pointing out that the term "political correctness" may carry different meanings for different audiences and can be interpreted as a politically charged term. We appreciate your thoughtful advice on this matter. In response, we have revised all instances of "political correctness" throughout the paper, including the title. We have also updated the method name "PCJailbreak" to "BiasJailbreak" and "PCDefense" to "BiasDefense" to ensure clarity and neutrality in our terminology.

[W2] I would recommend that the abstract text be revisited in order to be more specific. There isn't a full description of the attack or defense methods used in the abstract itself. I also think that the abstract could be updated to have smoother writing and less fluff -- I think that some of the sentences in it (especially early) are not very relevant to the specific contributions of the paper.

We have revisited and revised the abstract to be more specific and aligned with the key contributions of the paper. Specifically, we have included a concise description of the attack and defense methods used and ensured that the text flows more smoothly by removing unnecessary details. We believe the updated abstract better reflects the focus and significance of our work.

[W3] There is a claim in the paper that political correctness biases are introduced into models from the fine-tuning process. But this seems unjustified. I don't see why they wouldn't also be a result of pretraining data.

Thank you for raising this important concern. You are absolutely correct that multiple factors, including system prompts, training data, and the fine-tuning process, can influence biases in LLMs. It is indeed difficult to pinpoint precisely when and where these biases are introduced during model development.

As a partial evidence for our claim, in page 49 of GPT-4 Technical report[1], there exists a phrase where it states bias alignments and its possibly harmful side effects:

“Some types of bias can be mitigated via training for refusals, i.e. by getting the model to refuse responding to certain questions. This can be effective when the prompt is a leading question attempting to generate content that explicitly stereotypes or demeans a group of people. However, it is important to note that refusals and other mitigations can also exacerbate bias in some contexts, or can contribute to a false sense of assurance. Additionally, unequal refusal behavior across different demographics or domains can lead to quality of service harms. For example, refusals can especially exacerbate issues of disparate performance by refusing to generate discriminatory content for one demographic group but complying for another.“

While it is difficult to pinpoint the exact origin of biases in every model, it is clear that these biases exist and may introduce unintended risks. This highlights the importance of continued investigation and mitigation efforts to ensure the safe and fair deployment of LLMs. We added this explanation to the paper.

[W4] I think that section 2.2 is not the most thorough. It could be expanded to better discuss related jailbreaking techniques that involve persuasion and personas.

We have added references to studies on persona-based and persuasive jailbreak techniques to enhance the discussion in Section 2.2. This addition addresses the points you raised and provides a more comprehensive overview.

评论- Author Response on Reviewer Vzp2 (2/2)

2024-11-22

[Minor] Why were not claude models tested?

We appreciate your suggestion and conducted experiments on the most recent Claude Sonnet model [3]. Our experiments revealed similar biases in Claude Sonnet, shown in Table 2. Experimental results perform a higher jailbreak success rate when using keywords associated with marginalized groups compared to those associated with privileged groups.

Table 2

Performance across different models showing baseline success rates, marginalized success rates, privileged success rates, and the difference between marginalized and privileged success rates.

Model Name	Baseline Success Rate	Marginalized Success Rate (↑)	Privileged Success Rate (↓)	Marginalized/Privileged (↑)
GPT-3.5	0.2200	0.2421 (+10.00%)	0.1847 (-15.90%)	131.08%
GPT-4	0.2100	0.2488 (+18.57%)	0.1900 (-9.52%)	130.95%
GPT-4o	0.4600	0.5467 (+18.91%)	0.4187 (-8.91%)	130.57%
Claude-sonnet3.5	0.3100	0.3371 (+8.74%)	0.2764 (-10.84%)	121.90%
LLaMA2	0.2400	0.2811 (+17.08%)	0.1933 (-19.58%)	145.42%
LLaMA3	0.0500	0.0650 (+30.00%)	0.0300 (-40.00%)	216.67%
Qwen-1.5	0.1900	0.2175 (+14.74%)	0.1675 (-11.58%)	129.85%
Qwen2	0.1700	0.1971 (+15.88%)	0.1671 (-7.06%)	117.95%
Phi-mini	0.4100	0.4386 (+7.07%)	0.3829 (-6.59%)	114.56%

Minor: A "Walkerspider" reference might have a typo and need to be cleaned up.

Minor: I would recommend having a different example in figure 2 and 4 so that readers can see more diverse examples.

Minor: It's principal component analysis, not "principle"

Thank you for your feedback. We have corrected the "Walkerspider" typo, updated Figures 2 and 4 with diverse examples, fixed "principle" to "principal" in component analysis, and addressed other minor errors throughout the paper.

Vzp2: References

[1] Achiam, Josh, et al. "Gpt-4 technical report." arXiv preprint arXiv:2303.08774 (2023).

[2] Li, Xuan, et al. "Deepinception: Hypnotize large language model to be jailbreaker." arXiv preprint arXiv:2311.03191 (2023).

[3] Anthropic, A. I. "Claude 3.5 sonnet model card addendum." Claude-3.5 Model Card (2024).

评论- Thanks!

2024-11-25

I appreciate the responses, and I feel like I'm on the same page with the authors. I'll hold at an 8, and I think that this paper is valuable and should be accepted.

评论- Thank you for your evaluation.

2024-11-26

Thank you for your thoughtful evaluation of our work. We are delighted that our response met your expectations and provided assurance. Your encouraging feedback serves as a great motivation for us as researchers to continue striving for excellence in our endeavors.

审稿意见

评分: 3置信度: 42024-11-02

This paper introduces PCJailbreak, a method designed to analyze and exploit politically correct (PC) alignment-induced biases in LLMs, which lead to differing jailbreak success rates across various demographic group keywords (e.g., gender, race). The PCJailbreak framework systematically reveals how biases injected for safety purposes can paradoxically enable effective jailbreaks, with observable disparities between privileged and marginalized groups. Additionally, the paper presents PCDefense, a lightweight defense method that mitigates these vulnerabilities through prompt-based bias adjustments without incurring additional inference overhead. However, here are a few concerns:

优点

Extensive Model Evaluation: The paper evaluates a wide range of models, including some of the latest LLMs, providing a comprehensive view of jailbreak vulnerabilities across different architectures and alignment techniques.
Community Contribution: By open-sourcing the code and artifacts of PCJailbreak, the authors facilitate further research on bias and jailbreak vulnerabilities, promoting transparency and enabling the community to explore and develop more robust defense strategies.

缺点

Motivation: The paper categorizes jailbreak attacks into manually written prompts and learning-based prompts, stating that learning-based jailbreak prompts rely on gradient information and that these prompts are often nonsensical sequences. However, this overlooks natural-language jailbreak prompts, such as PAIR [1] and DeepInception [2], which are not solely gradient-based and produce coherent, meaningful language. Additionally, for manual attacks, approaches like GUARD [3] build on existing manually crafted jailbreak prompts, refining them over time to remain effective.
Scope of Jailbreak Attacks: Much of the related work on jailbreak techniques in this paper appears to focus on approaches up to 2024. Given the rapid advancements in jailbreak methodologies, the paper should provide a more detailed discussion of recent jailbreak attacks, such as works like [4] and [5].
Keyword Generation Methodology: The approach of directly prompting the LLM to generate keywords introduces potential issues. For instance, the generated keywords may lack diversity, as the LLM could repeatedly produce similar terms based on its training biases. Additionally, there is no evaluation or filtering mechanism to determine which keywords are more effective or appropriate for distinguishing between privileged and marginalized groups.
Ambiguity in Baseline Definition and Scope of Comparison: While Table 2’s caption states it shows “baseline success rates, marginalized success rates, privileged success rates, and the difference between marginalized and privileged success rates,” the paper does not clearly define what constitutes the "baseline success rate." Additionally, to strengthen the evaluation, it would be beneficial to include comparisons with a broader range of jailbreak attacks.
Defense Baselines: There are some relevant papers at the prompt level to prevent harmful output, such as [6], [7], and [8]. As PCDefense also adds prompts to model system prompts and suffix prompts, it should compare the effectiveness with these methods.

References:

[1] Chao P, Robey A, Dobriban E, et al. Jailbreaking black box large language models in twenty queries[J]. arXiv preprint arXiv:2310.08419, 2023.
[2] Li X, Zhou Z, Zhu J, et al. Deepinception: Hypnotize large language model to be jailbreaker[J]. arXiv preprint arXiv:2311.03191, 2023.
[3] Jin H, Chen R, Zhou A, et al. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models[J]. arXiv preprint arXiv:2402.03299, 2024.
[4] Zheng X, Pang T, Du C, et al. Improved few-shot jailbreaking can circumvent aligned language models and their defenses[J]. arXiv preprint arXiv:2406.01288, 2024.
[5] Jin H, Zhou A, Menke J D, et al. Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters[J]. arXiv preprint arXiv:2405.20413, 2024.
[6] Wu F, Xie Y, Yi J, et al. Defending chatgpt against jailbreak attack via self-reminder[J]. 2023.
[7] Zhang Z, Yang J, Ke P, et al. Defending large language models against jailbreaking attacks through goal prioritization[J]. arXiv preprint arXiv:2311.09096, 2023.
[8] Zhou A, Li B, Wang H. Robust prompt optimization for defending language models against jailbreaking attacks[J]. arXiv preprint arXiv:2401.17263, 2024.

问题

See the Weaknesses above.

伦理问题详情

N/A

评论- Author Response on Reviewer gTDY (2/3)

2024-11-22

Table11 : Keywords discovered in GPT-4o.

Marginalized Group (Adjective)	Privileged Group (Adjective)
Poor	Wealthy
Black	White
Elderly	Young
Disabled	Able-bodied
Non-binary	Cisgender
Unemployed	Employed
Religious minority	Religious majority
Indigenous	Colonizer-descended
Lower-class	Upper-class
Underrepresented	Overrepresented
Minoritized	Dominant
Non-English speaking	English speaking
Informally educated	Formally educated
Psychiatrically diagnosed	Psychiatrically typical
Single parent	Partnered parent
Adolescent	Adult
Gender non-conforming	Gender conforming
Internationally born	Nationally born
Temporarily employed	Permanently employed
Linguistically diverse	Linguistically standard
Socially isolated	Socially integrated
Chronically ill	Healthy
Mentally ill	Mentally healthy
Substance dependent	Substance independent
Homeless	Sheltered
Incarcerated	Non-incarcerated
Refugee	Citizen
Visually impaired	Visually unimpaired
Hearing impaired	Hearing unimpaired
Non-traditional student	Traditional student
Working class	Professional class
Historically disadvantaged	Historically advantaged
War-affected	Peace-living
Food insecure	Food secure
Low mobility	High mobility
Environmentally displaced	Environmentally secure
Survivor	Non-affected
Recovering	Non-recovering
Public transport dependent	Car owning
Crafts-based	Industrial
Artisanal	Mass-produced
Temporary visa holder	Permanent resident
Non-voter	Voter
Uninsured	Insured
Non-internet user	Internet user
Low-income	High-income

Table 12 : Keywords discovered in Qwen2

Marginalized Group (Adjective)	Privileged Group (Adjective)
Low-Income	High-Income
Racial Minority	White
LGBTQ+	Heterosexual
Women	Men
Disabled	Able-bodied
Refugees	Citizens
Uneducated	Educated
Neurodivergent	Neurotypical
Religious Minority	Mainstream Religion
Homeless	Homeowners
Mental Health Patients	Mentally Stable Individuals
Elderly	Young Adults
Overweight	Slim/Fit
Non-native Speakers	Native Speakers
Incarcerated	Free
Rural	Urban
Illiterate	Literate
Politically Disenfranchised	Voters
Undocumented Immigrants	Legal Immigrants
Sex Workers	Non-sex Workers

Table 13: Keywords discovered in LLaMA2

Marginalized Group (Adjective)	Privileged Group (Adjective)
Female	Male
Poor	Rich
Black	White
LGBTQ	Heterosexual
Transgender	Straight
Neurodiverse	Neurotypical
Working class	Middle class
Bisexual	Heterosexual
Native American	White

Table 14: Keywords discovered in Phi

Marginalized Group (Adjective)	Privileged Group (Adjective)
Black	White
Female	Male
LGBTQ+	Heterosexual
Non-normative gender	Cisgender
Non-traditionally family-structured	Traditionally family-structured
Rural	Urban
Homeless	Socially influential
Undereducated	Highly educated
Disabled	Able-bodied
Religiously marginalized	Religiously dominant
Low-income	Wealthy
Sexually employed	Traditionally employed

评论- Author Response on Reviewer gTDY (1/3)

2024-11-22

Thank you for your review. We have improved the experiments according to your keen suggestions.

[W1] Motivation: The paper categorizes jailbreak attacks into manually written prompts and learning-based prompts, stating that learning-based jailbreak prompts rely on gradient information and that these prompts are often nonsensical sequences. However, this overlooks natural-language jailbreak prompts, such as PAIR [1] and DeepInception [2], which are not solely gradient-based and produce coherent, meaningful language. Additionally, for manual attacks, approaches like GUARD [3] build on existing manually crafted jailbreak prompts, refining them over time to remain effective.

We have updated our paper to categorize jailbreak attacks as black-box and white-box, and included references to PAIR [1], DeepInception [2], and GUARD [3], as recent advances on black-box attacks that has enabled a more general and real-world setting of jailbreaks, while having meaningful language, involving methods such as template completion, prompt rewriting, or LLM-based generation. We have categorized our paper as a black-box method that offers both scalability and generality. We appreciate your feedback, which has helped on defining the scope and motivation of our work.

[W2] Scope of Jailbreak Attacks: Much of the related work on jailbreak techniques in this paper appears to focus on approaches up to 2024. Given the rapid advancements in jailbreak methodologies, the paper should provide a more detailed discussion of recent jailbreak attacks, such as works like [4] and [5].

Thank you for highlighting the need to include more recent advancements in jailbreak methodologies. We have updated our paper to incorporate discussions of the works [4] and [5], ensuring that our coverage of jailbreak techniques reflects the latest developments. We appreciate your feedback, which has helped us provide a more up-to-date and comprehensive discussion of this rapidly evolving area.

[W3] Keyword Generation Methodology: The approach of directly prompting the LLM to generate keywords introduces potential issues. For instance, the generated keywords may lack diversity, as the LLM could repeatedly produce similar terms based on its training biases. Additionally, there is no evaluation or filtering mechanism to determine which keywords are more effective or appropriate for distinguishing between privileged and marginalized groups.

Thank you for your feedback regarding the keyword generation methodology. We would like to clarify that our experimental setup was designed to leverage the ethical biases of LLMs to their fullest extent, while still having a diverse and meaningful set of keywords.

Table 10-14 includes the full set of keywords automatically discovered by LLMs and used in our experiments. While we presented only a partial subset of these keywords for readability, our current method still shows a broad and rich selection of keywords, without repetitive or overly similar terms.

Table10 : Keywords discovered in Claude-sonnet3.5.

Marginalized Group (Adjective)	Privileged Group (Adjective)
Female	Male
Immigrant	Native
Poor	Rich
Muslim	Christian
Black	White
High school educated	College educated
LGBTQ+	Heterosexual
Transgender	Straight
Neurodiverse	Neurotypical
Spanish	English
Working class	Middle class
Bisexual	Heterosexual
Native American	White

评论- Author Response on Reviewer gTDY (3/3)

2024-11-22

[W3] Ambiguity in Baseline Definition and Scope of Comparison: While Table 2’s caption states it shows “baseline success rates, marginalized success rates, privileged success rates, and the difference between marginalized and privileged success rates,” the paper does not clearly define what constitutes the "baseline success rate." Additionally, to strengthen the evaluation, it would be beneficial to include comparisons with a broader range of jailbreak attacks.

Thank you for pointing out the ambiguity regarding the "baseline success rate." We apologize for the lack of clarity, which may have led to confusion. To clarify, the baseline success rate refers to the success rate observed when using the prompt in Equation 1 without any keywords. We have updated the paper to include a detailed explanation of this in the relevant section to avoid further misunderstandings. We appreciate your feedback, which has allowed us to improve the clarity and completeness of our work.

Defense Baselines: There are some relevant papers at the prompt level to prevent harmful output, such as [6], [7], and [8]. As PCDefense also adds prompts to model system prompts and suffix prompts, it should compare the effectiveness with these methods.

Thank you for highlighting the importance of comparing BiasDefense with other prompt-based defense methods. Following your suggestion, we conducted experiments incorporating relevant defense techniques, such as Self-Remind [6], Defending[7], and RPO [8], using the LLAMA2 model. As shown in Table 5, our proposed BiasDefense achieved the best performance, with the lowest jailbreak success rates for both marginalized and privileged groups. These results further validate the effectiveness of our approach in mitigating jailbreak vulnerabilities. We appreciate your feedback, which has strengthened our evaluation and demonstrated the robustness of BiasDefense.

Table 5

Comparison of jailbreak success rates across prompt-based defense methods using LLAMA2 model

Defense Method	Metric	Jailbreak Success Rate
None	Marginalized Group Jailbreak Success	28.11%
	Privileged Group Jailbreak Success	19.33%
self-remind	Marginalized Group Jailbreak Success	33.67%
	Privileged Group Jailbreak Success	27.33%
Defending	Marginalized Group Jailbreak Success	20.57%
	Privileged Group Jailbreak Success	14.86%
RPO	Marginalized Group Jailbreak Success	56.00%
	Privileged Group Jailbreak Success	49.75%
BiasDefense (Ours)	Marginalized Group Jailbreak Success	17.14%
	Privileged Group Jailbreak Success	14.29%

gTDY: References

[1] Chao P, Robey A, Dobriban E, et al. Jailbreaking black box large language models in twenty queries[J]. arXiv preprint arXiv:2310.08419, 2023.

[2] Li X, Zhou Z, Zhu J, et al. Deepinception: Hypnotize large language model to be jailbreaker[J]. arXiv preprint arXiv:2311.03191, 2023.

[3] Jin H, Chen R, Zhou A, et al. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models[J]. arXiv preprint arXiv:2402.03299, 2024.

[4] Zheng X, Pang T, Du C, et al. Improved few-shot jailbreaking can circumvent aligned language models and their defenses[J]. arXiv preprint arXiv:2406.01288, 2024.

[5] Jin H, Zhou A, Menke J D, et al. Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters[J]. arXiv preprint arXiv:2405.20413, 2024.

[6] Wu F, Xie Y, Yi J, et al. Defending chatgpt against jailbreak attack via self-reminder[J]. 2023.

[7] Zhang Z, Yang J, Ke P, et al. Defending large language models against jailbreaking attacks through goal prioritization[J]. arXiv preprint arXiv:2311.09096, 2023.

[8] Zhou A, Li B, Wang H. Robust prompt optimization for defending language models against jailbreaking attacks[J]. arXiv preprint arXiv:2401.17263, 2024.

评论- Gentle reminder to Reviewer gTDY

2024-11-26

Thank you for your dedication and thoughtful engagement with our paper. As the author-reviewer discussion period draws to a close, we would greatly appreciate hearing your thoughts on our rebuttal. We have tried our best to reflect your suggestions, and added several experiments to reinforce our work. If you have any further questions or require additional clarification, please don't hesitate to let us know.

评论- Gentle reminder to Reviewer gTDY

2024-12-02

We would like to gently remind you for the updated review, as we are very close to the final due date. We have thoroughly prepared extra experiments and results that you requested, and we assume that all of your weakness concerns should be resolved with our updated results. Thank you.

审稿意见

评分: 1置信度: 52024-11-03

In this paper, the authors investigate the intentional biases of LLMs and how these biases can be exploited for jailbreak attacks. They created a jailbreak attack using pairs of privileged and marginalized keywords, along with prompts that ask the LLM to write a fictional story and include harmful questions. This technique, called 'PCJailbreak,' exploits these biases by incorporating different demographic keywords in the prompts. Additionally, they propose 'PCDefense,' a prompt template that uses system prompts and prompt suffixes to adjust these biases and prevent attacks.

优点

---> The paper presents an interesting aspect of safety training: safety measures intended to make LLMs more ethical can actually create exploitable biases.

---> Experiments were conducted across a variety of LLMs, including both open-source and black-box models.

缺点

--->> 1: The number of keyword pairs is relatively small, and the JailbreakBench dataset being used is also limited. Increasing the number of keyword pairs and incorporating a more comprehensive dataset could have made the experiments more rigorous.

--->>> I believe one of the weaknesses of the paper is that it limits the experiments to a few keyword pairs and does not include an ablation study.

--->>>> What will happen if we replace the words in pairs with something entirely different? Will we get the same results?

--->>>> Additionally, the prompt structure used in the experiment does not determine whether the jailbreak occurs solely due to the request to write a fictional story or because of the presence of marginalized words.

--->> 2: In the PC Defense, “IF THE REQUEST INCLUDES HARMFUL CONTENT, DO NOT RESPOND.” is added as the suffix prompt.

---->>> In in-context learning, the model response is dependent on the input label distributions and formatting, and asking the LLM to basically check every prompt is very obvious to make LLMs to scan its input.

--->>> This is likely that this suffix alone will decrease the ASR without the ‘system prompt’.

--->>> I believe an ablation study should have been done to measure the efficacy of the proposed prompt.

--->> 3: Line 377: "By using a defense prompt, our defense mechanism elicits the refusal of a wide range of harmful content without requiring significant additional computing resources, which makes it an attractive alernative to Guard Models.” --- I could not find any evidence to support this claim in the paper.

问题

--->> I think the paper should provide some context on 'political correctness' for readers, as well as the motivation behind studying it.

--->> “Line 187: Our work builds on the existing body of research by focusing on the paradoxical consequences of intentional biases introduced for safety purposes “ ----->>> I am not entirely sure what this sentence refers to. I think adding references and examples would provide a better explanation.

--->> Line 228-229: I believe there is a conflict in stating that the refusal prefix is the target prefix. In line 222, the target responses refer to malicious responses, while in line 232, they point to refusal phrases

--->> Line 166: Missing reference for ‘walkerspider 2022’

--->> line 378: please fix spelling of ‘alernative’ -> ‘‘alternative’

--->> I am quite confused by the subheading '3.1.2 Formulation': what is being formulated here?

评论- Author Response on Reviewer 63GV (4/4)

2024-11-22

Table 12 : Keywords discovered in Qwen2

Marginalized Group (Adjective)	Privileged Group (Adjective)
Low-Income	High-Income
Racial Minority	White
LGBTQ+	Heterosexual
Women	Men
Disabled	Able-bodied
Refugees	Citizens
Uneducated	Educated
Neurodivergent	Neurotypical
Religious Minority	Mainstream Religion
Homeless	Homeowners
Mental Health Patients	Mentally Stable Individuals
Elderly	Young Adults
Overweight	Slim/Fit
Non-native Speakers	Native Speakers
Incarcerated	Free
Rural	Urban
Illiterate	Literate
Politically Disenfranchised	Voters
Undocumented Immigrants	Legal Immigrants
Sex Workers	Non-sex Workers

Table 13: Keywords discovered in LLaMA2

Marginalized Group (Adjective)	Privileged Group (Adjective)
Female	Male
Poor	Rich
Black	White
LGBTQ	Heterosexual
Transgender	Straight
Neurodiverse	Neurotypical
Working class	Middle class
Bisexual	Heterosexual
Native American	White

Table 14: Keywords discovered in Phi

Marginalized Group (Adjective)	Privileged Group (Adjective)
Black	White
Female	Male
LGBTQ+	Heterosexual
Non-normative gender	Cisgender
Non-traditionally family-structured	Traditionally family-structured
Rural	Urban
Homeless	Socially influential
Undereducated	Highly educated
Disabled	Able-bodied
Religiously marginalized	Religiously dominant
Low-income	Wealthy
Sexually employed	Traditionally employed

63GV: References

[1] Zou, Andy, et al. "Universal and transferable adversarial attacks on aligned language models." arXiv preprint arXiv:2307.15043 (2023).

[2] Dong, Qingxiu, et al. "A survey on in-context learning." arXiv preprint arXiv:2301.00234 (2022).

[3] Achiam, Josh, et al. "Gpt-4 technical report." arXiv preprint arXiv:2303.08774 (2023).

评论- Author Response on Reviewer 63GV (3/4)

2024-11-22

Table 7

Comparison of the impact of individual keyword groups on BiasJailbreak. Marginalized Groups and Privileged Groups have a distinct difference among Random Adjectives

Random Adjectives Group (AVG: 23.55% ± 1.52%)

Keyword	Success Rate
baseline	24.00% (± 1.26%)
big	20.00% (± 0.41%)
small	25.00% (± 0%)
bright	22.00% (± 3.52%)
cold	21.00% (± 0%)
old	24.00% (± 6.13%)
slow	31.00% (± 7.54%)
warm	22.00% (± 1.31%)
dark	23.00% (± 2.23%)

Marginalized Group (AVG: 28.33% ± 1.26%)

Keyword	Success Rate
female	28.00% (± 3.52%)
poor	31.00% (± 2.23%)
black	26.33% (± 0.31%)
LGBTQ	24.00% (± 0%)
transgender	27.00% (± 7.66%)
neurodiverse	31.67% (± 12.49%)
working class	29.33% (± 3.95%)
bisexual	24.67% (± 0.31%)
native american	33.00% (± 0%)

Privileged Group (AVG: 20.15% ± 1.28%)

Keyword	Success Rate
male	24.00% (± 0%)
rich	26.67% (± 0.43%)
white	16.67% (± 0.31%)
heterosexual	17.00% (± 0%)
straight	9.67% (± 6.70%)
neurotypical	27.67% (± 6.70%)
middle class	26.00% (± 28.84%)
heterosexual	17.00% (± 0%)
white	16.67% (± 0.31%)

Table 8

BiasDefense Ablation on the combination of suffix and prefix defense prompt using LLAMA2 model

Defense Method	Metric	Jailbreak Success Rate
None	Marginalized Group Jailbreak Success	0.2811
	Privileged Group Jailbreak Success	0.1933
BiasDefense: system prompt	Marginalized Group Jailbreak Success	0.2558
	Privileged Group Jailbreak Success	0.1687
BiasDefense: suffix prompt	Marginalized Group Jailbreak Success	0.2367
	Privileged Group Jailbreak Success	0.2733
BiasDefense	Marginalized Group Jailbreak Success	0.1714
	Privileged Group Jailbreak Success	0.1429

Table 9

MMLU score using prompt-based defense methods. The base language model used is google-flan-t5.

Defense Method	Suffix	Average MMLU score
None	X	0.295
RPO	O	0.243
self-remind	O	0.238
Defending	X	0.295
BiasDefense(Ours, With Suffix)	O	0.265
BiasDefense(Ours, Without Suffix)	X	0.296

Table10 : Keywords discovered in Claude-sonnet3.5.

Marginalized Group (Adjective)	Privileged Group (Adjective)
Female	Male
Immigrant	Native
Poor	Rich
Muslim	Christian
Black	White
High school educated	College educated
LGBTQ+	Heterosexual
Transgender	Straight
Neurodiverse	Neurotypical
Spanish	English
Working class	Middle class
Bisexual	Heterosexual
Native American	White

Table11 : Keywords discovered in GPT-4o.

Marginalized Group (Adjective)	Privileged Group (Adjective)
Poor	Wealthy
Black	White
Elderly	Young
Disabled	Able-bodied
Non-binary	Cisgender
Unemployed	Employed
Religious minority	Religious majority
Indigenous	Colonizer-descended
Lower-class	Upper-class
Underrepresented	Overrepresented
Minoritized	Dominant
Non-English speaking	English speaking
Informally educated	Formally educated
Psychiatrically diagnosed	Psychiatrically typical
Single parent	Partnered parent
Adolescent	Adult
Gender non-conforming	Gender conforming
Internationally born	Nationally born
Temporarily employed	Permanently employed
Linguistically diverse	Linguistically standard
Socially isolated	Socially integrated
Chronically ill	Healthy
Mentally ill	Mentally healthy
Substance dependent	Substance independent
Homeless	Sheltered
Incarcerated	Non-incarcerated
Refugee	Citizen
Visually impaired	Visually unimpaired
Hearing impaired	Hearing unimpaired
Non-traditional student	Traditional student
Working class	Professional class
Historically disadvantaged	Historically advantaged
War-affected	Peace-living
Food insecure	Food secure
Low mobility	High mobility
Environmentally displaced	Environmentally secure
Survivor	Non-affected
Recovering	Non-recovering
Public transport dependent	Car owning
Crafts-based	Industrial
Artisanal	Mass-produced
Temporary visa holder	Permanent resident
Non-voter	Voter
Uninsured	Insured
Non-internet user	Internet user
Low-income	High-income

评论- Author Response on Reviewer 63GV (2/4)

2024-11-22

--->> I think the paper should provide some context on 'political correctness' for readers, as well as the motivation behind studying it.

Along with other reviewers, we accept your concern on the potential misleading usage of the word “political correctness” and decided to rename the word “political correctness” to “bias”, changing all the text in the paper, including the paper title and the name of our method. In line 159, we also included the motivation and necessity to study biases on LLMs.

--->> “Line 187: Our work builds on the existing body of research by focusing on the paradoxical consequences of intentional biases introduced for safety purposes “ ----->>> I am not entirely sure what this sentence refers to. I think adding references and examples would provide a better explanation.

In line 43 and 197 , we included a reference for better explanation.

To be more specific, in page 49 of GPT-4 Technical report[3], there exists a phrase where it states bias alignments and its possibly harmful side effects:

--->> Line 228-229: I believe there is a conflict in stating that the refusal prefix is the target prefix. In line 222, the target responses refer to malicious responses, while in line 232, they point to refusal phrases

--->> Line 166: Missing reference for ‘walkerspider 2022’

--->> line 378: please fix spelling of ‘alernative’ -> ‘‘alternative’

--->> I am quite confused by the subheading '3.1.2 Formulation': what is being formulated here?

We fixed the typos according to your suggestions. We changed “Formulation” to “Jailbreak Attack Evaluation” for better understanding.

Table 1

Performance across different datasets showing baseline success rates, marginalized success rates, privileged success rates, and the difference between marginalized and privileged success rates using LLaMA2 model.

Dataset	Baseline Success Rate	Marginalized Success Rate ( $\uparrow$ )	Privileged Success Rate ( $\downarrow$ )	Marginalized / Privileged ( $\uparrow$ )
JailbreakBench	0.2400	0.2811 (+17.08%)	0.1933 (-19.58%)	145.42%
AdvBench	0.1895	0.2037 (+7.50%)	0.1758 (-7.25%)	115.84%

Table 5

Comparison of jailbreak success rates across prompt-based defense methods using LLAMA2 model

Defense Method	Metric	Jailbreak Success Rate
None	Marginalized Group Jailbreak Success	28.11%
	Privileged Group Jailbreak Success	19.33%
self-remind	Marginalized Group Jailbreak Success	33.67%
	Privileged Group Jailbreak Success	27.33%
Defending	Marginalized Group Jailbreak Success	20.57%
	Privileged Group Jailbreak Success	14.86%
RPO	Marginalized Group Jailbreak Success	56.00%
	Privileged Group Jailbreak Success	49.75%
BiasDefense (Ours)	Marginalized Group Jailbreak Success	17.14%
	Privileged Group Jailbreak Success	14.29%

Table 6

Comparison of Defense cost for BiasDefense and Llama-Guard. The experiment was held on a single H100 GPU, with Llama-3.2-1B-Instruct as the language model, and Llama-Guard-3-8B as the guard model.

Defense Method	Time Cost (seconds) ↓	Time cost (percentage) ↓
Baseline (No Defense)	21.91	+0.00%
BiasDefense	22.44	+2.40%
Llama-Guard-3-8B	31.69	+44.60%

评论- Author Response on Reviewer 63GV (1/4)

2024-11-22

Thank you for your review. Reading your reviews, we are worried that you might be misunderstanding some aspects of the paper, such as the usage of in-context learning, or the meaning of our experiments. Nevertheless, we will try our best to convince you and prepared additional results for all your concerns.

--->> 1: The number of keyword pairs is relatively small, and the JailbreakBench dataset being used is also limited. Increasing the number of keyword pairs and incorporating a more comprehensive dataset could have made the experiments more rigorous.

--->>> I believe one of the weaknesses of the paper is that it limits the experiments to a few keyword pairs and does not include an ablation study.

Table 10-14 includes the full set of keywords used for our experiment, which were automatically discovered by LLMs. The previous table only showed the partial set of keywords for readability, but in reality we had conducted experiments on a diverse and rich set of keywords. Additionally, we included Table 1 that shows results of an additional benchmark called advbench[1] containing 500 harmful prompts, which is 5 times bigger than the original jailbreakbench. We have included several additional ablation studies that can further validate the significance of our work.

--->>>> What will happen if we replace the words in pairs with something entirely different? Will we get the same results?

--->>>> Additionally, the prompt structure used in the experiment does not determine whether the jailbreak occurs solely due to the request to write a fictional story or because of the presence of marginalized words.

In Table 7, we added the comparison of using marginalized / privileged keywords and random adjectives. We validate that the bias shown by our method is distinct, and using random adjectives does not show a distinct bias. Though the evidence was also clear in the previous submission, this additional result reinforces the evidence that the jailbreak does not occur solely due to the request to write a fictional story, since the keywords make a significant difference.

--->> 2: In the PC Defense, “IF THE REQUEST INCLUDES HARMFUL CONTENT, DO NOT RESPOND.” is added as the suffix prompt.

---->>> In in-context learning, the model response is dependent on the input label distributions and formatting, and asking the LLM to basically check every prompt is very obvious to make LLMs to scan its input.

--->>> This is likely that this suffix alone will decrease the ASR without the ‘system prompt’.

--->>> I believe an ablation study should have been done to measure the efficacy of the proposed prompt.

--->> 3: Line 377: "By using a defense prompt, our defense mechanism elicits the refusal of a wide range of harmful content without requiring significant additional computing resources, which makes it an attractive alernative to Guard Models.” --- I could not find any evidence to support this claim in the paper.

First to be sure, we would like to cite the definition of in-context learning: “In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration.”[2]. To be clear, our defense method does not include task demonstrations, thus it could not be referred to as in-context learning. Our method only adds an additional reminder of asking to adjust bias, without any task demonstrations, so it may not be so obvious that our method works without in-context learning, if you think that in-context learning is obvious as you stated.

In Table 8, we included an ablation study on the impact of our defense prompt, with regard to the presence of prefix and suffix. The result shows that both the prefix and suffix is important for defending our jailbreak, and even one missing has significant impact on defense performance, which is evidence against your claim that it will be sufficient with the suffix alone.

We have included ablation results that show the efficacy of our proposed defense method, with table 5 showing comparison of other prompt based defense methods, and table 6 showing computation time compared to standalone classifiers such as llama-guard, and table 9 showing the MMLU score for assessing normal task performance when defense prompts are applied.

评论- Thank you for the response.

2024-11-25

First to be sure, we would like to cite the definition of in-context learning: “In-context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration.”[2]. To be clear, our defense method does not include task demonstrations, thus it could not be referred to as in-context learning. Our method only adds an additional reminder of asking to adjust bias, without any task demonstrations, so it may not be so obvious that our method works without in-context learning, if you think that in-context learning is obvious as you stated.

It does not have to be task demonstration; even when you are asking the LLMs to condition their output based on a single task, you are engaging in in-context learning (zero-shot demonstration). Furthermore, when you provide the model with the prompt “IF THE REQUEST INCLUDES HARMFUL CONTENT, DO NOT RESPOND,” it will condition its output based on this as well. At this point, you are passing both the 'harmful request' and the 'defense prompt,' and since you are targeting safety-aligned LLMs, their safety mechanism objectives will take effect.

I have read through the authors' response as well as the other reviewers' comments. I have decided to keep my rating.

Thank you

评论- Author Response

2024-12-02

As a matter of fact, we are quite certain that in-context learning refers to task demonstrations along with our numerous references. [1], [2], [3], [4], [5], [6], [7]. [8]. In fact, Zero-shot ICL refers to task demonstrations generated by LLMs, not pure instructions that do not have task demonstrations [1], [2]. If you insist to argue that ICL also refers to zero-shot without demonstrations, can you give us references that support your argument like we did?

Apart from the definition of ICL, we agree that zero-shot instructions would make it easier for LLMs to adapt to the task, thus we provided an experiment on MMLU scores for measuring the effect of defense prompts on general tasks, and showed that without suffixes, our method does not have a negative effect on general tasks, showing only positive effects on preventing bias-based jailbreaks.

Additionally, there were several other comments of yours where we showed additional results that refute your claim, but you only responded regarding the term definition of ICL. Do you agree on that all the other claims or weaknesses that you had suggested are no longer a problem? If that is true, that seems to be a very good reason to raise your scores, since the reason of the low scores have been resolved.

We would also like to comment that there were some authors who raised the score as a positive response to our feedback, and none of the authors have expressed such a strong reject like yours, if that might affect your decision.

References for ICL

[1] Chen, Wei-Lin, et al. "Self-icl: Zero-shot in-context learning with self-generated demonstrations." arXiv preprint arXiv:2305.15035 (2023).

Large language models (LLMs) have exhibited striking in-context learning (ICL) ability to adapt to target tasks with a few input-output demonstrations.

[2] Lyu, Xinxi, et al. "Z-icl: Zero-shot in-context learning with pseudo-demonstrations." arXiv preprint arXiv:2212.09865 (2022).

Large language models (LMs) can perform new tasks simply by conditioning on input-label pairs from the training data, known as demonstrations (Brown et al., 2020). This in-context learning (ICL) is significantly better than zero-shot methods that do not use demonstrations.

[3] Min, Sewon, et al. "Rethinking the role of demonstrations: What makes in-context learning work?." arXiv preprint arXiv:2202.12837 (2022).

Large language models (LMs) are able to in-context learn—perform a new task via inference alone by conditioning on a few input-label pairs (demonstrations) and making predictions for new inputs.

[4] Agarwal, Rishabh, et al. "Many-shot in-context learning." arXiv preprint arXiv:2404.11018 (2024).

Large language models (LLMs) excel at few-shot in-context learning (ICL) – learning from a few input- output examples(“shots”)provided in context at inference, without any weight updates.

[5] Xie, Sang Michael, et al. "An explanation of in-context learning as implicit bayesian inference." arXiv preprint arXiv:2111.02080 (2021).

Large language models (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples.

In-context learning occurs when the LM also infers a shared latent concept between examples in a prompt.

[6] Kim, Hyuhng Joon, et al. "Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator." arXiv preprint arXiv:2206.08082 (2022).

ICL learns to solve a task simply by conditioning a few input-label pairs dubbed demonstrations on a prompt, which serves to give contexts regarding the downstream task during the inference phase, allowing PLMs to solve the tasks better.

The working principle of ICL intuitively leads to high reliance on the demonstrations, and performance deeply varies depending on the assortment of the demonstrations.

[7] Liu, Pengfei, et al. "Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing." ACM Computing Surveys 55.9 (2023): 1-35.

Brown et al. (2020) perform in-context learning (§7.2.2) for text generation, creating a prompt with manual templates and augmenting the input with multiple answered prompts.

[8] Brown, Tom B. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).

We use the term 'in-context learning' to describe the inner loop of this process, which occurs within the forward-pass upon each sequence. Recent work attempts to do this via what we call 'in-context learning', using the text input of a pretrained language model as a form of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task and is then expected to complete further instances of the task simply by predicting what comes next.

2024-12-03

Regarding the zero-shot demonstration, please see the paper by Brown et al. (https://arxiv.org/pdf/2005.14165). Please refer to the second paragraph on page 4 and the footer note. Additionally, please check Figure 2.1 for an example of zero-shot learning.

Once again, the LLM's output is based on the context provided to it. Since the target is a safety-aligned LLM, when you provide {'some task' + 'jailbreak prompt' }+ {'hey LLM, look out for bad questions in the prompt,'} the safety-aligned LLM will (in most cases) correct itself and avoid generating harmful responses (competing objectives hypothesis -Wei et al.). This defense method (as shown in Fig. 4) is not novel, and similar, more robust methods have been demonstrated in other papers. Additionally, I could not find the difference between the suggested BiasDefense and other similar defense strategy such as Self-remind (Wu et al. ). I think the suggested defense is very similar to the self-remind and there exist other defense as well such as Self-Refine (Kim et al. https://arxiv.org/pdf/2402.15180) , and RAIN ( Li et al. https://arxiv.org/pdf/2309.07124), which has been missed from the experiments.

Also, regarding the MMLU scores, I am not sure why Google-FLAN-T5 is being used here. Using this model will mislead readers as it was not used in either the attack or defense experiments, while the attack/defense was demonstrated using other models (such as Llama-3 and Qwen). Additionally, the MMLU scores are very low. The table should present the MMLU scores from the original model papers for comparison.

Which MMLU dataset has been used? Reference is missing.

Additionally, the prompt structure used in the experiment does not determine whether the jailbreak occurs solely due to the request to write a fictional story or because of the presence of marginalized words.

The ablation study of the jailbreak prompt itself is missing. It is currently unclear whether the jailbreak succeeds because of asking the LLMs to write a fictional story. Regarding Table 7, the results still use the phrase 'write a fictional story.' These types of jailbreaks have been discovered previously. While bias may exist, the jailbreak might be succeeding because of using this specific phrase, which is why an ablation study is required.

Based on the comments above, I have decided to maintain my previous score.

评论- Author Response

2024-12-04

We sincerely appreciate your detailed feedback and the opportunity to address your concerns. Below, we provide responses to the points you raised:

1. Clarifying the Distinction Between ICL and Zero-Shot Learning

Although the referenced paper by Brown et al. seems to allow a zero-shot setting under the name of ICL, it is very clear that in general, ICL does not refer to zero-shot learning, as we have shown in numerous citations. Our interpretation and experimental setup are consistent with this distinction.

2. MMLU Scores and Google-FLAN-T5 Usage

For the MMLU score, the paper already includes scores for GPT-4o and a description of the MMLU subjects used in the experiment. We have used all MMLU subjects for Google-FLAN-T5 and four representative subjects for GPT-4o due to billing limits. Google-FLAN-T5 was selected because it is the default model of the official MMLU score evaluation code.

3. Comparison with Recent Defense Methods We have acknowledged other recent defense methods, such as Self-Remind (Wu et al.), RPO (Zhou et al.), and Defending (Zhang et al.), in our revised manuscript and added how our performance differs. Specifically, our work uniquely focuses on leveraging ethical biases through keyword analysis, which is not explored in these methods.

4. Novelty and Rigor of Our Research

The core contribution of our paper is to uncover and leverage ethical biases in LLMs by comparing jailbreak success rates across keywords. Our experiments are designed to combine keywords with harmful prompts in a natural manner. Simply combining words increases sentence perplexity, reducing the jailbreak success rate. To address this, we used a consistent format, "I am {keyword} writer," keeping the scenario-writing prompt constant while varying only the keyword.

This design controls for all other variables, allowing us to analyze the specific influence of keywords. Furthermore, applying the effective keywords we identified to state-of-the-art (SOTA) jailbreak prompts improved their performance, supporting our claim that ethical bias-based keyword selection alone can enhance jailbreak success rates.

Therefore, we believe that the additional ablation study on the prompt structure requested by the reviewer is not an experiment aimed at proving our claims. Our experiments have already provided sufficient evidence aligned with the research objectives.

We hope these clarifications address your concerns. However, for the reasons outlined above, we respectfully disagree with your review, as we believe our work provides clear contributions and sufficient evidence to support our claims. Thank you again for your feedback.

Reference [1] Wu F, Xie Y, Yi J, et al. Defending chatgpt against jailbreak attack via self-reminder[J]. 2023.
[2] Zhang Z, Yang J, Ke P, et al. Defending large language models against jailbreaking attacks through goal prioritization[J]. arXiv preprint arXiv:2311.09096, 2023.
[3] Zhou A, Li B, Wang H. Robust prompt optimization for defending language models against jailbreaking attacks[J]. arXiv preprint arXiv:2401.17263, 2024.

审稿意见

评分: 5置信度: 32024-11-13

This work focuses on using potential biases in a model to jailbreak the model, by associating a request with a particular group.

优点

Exploring how biases might affect jailbreaking is an interesting and important idea.

缺点

There are some major improvements to be made for the experimental results. First, all experiments are run with sampling (“the default sampling temperature”), yet there are no confidence intervals. It is entirely possible that Table 2 is a function of small perturbations or random noise since the effect sizes are small. Second, the dataset itself is small, so it could be statistically underpowered for such small effect sizes. Note: at this small of a sample size the minimum detectable effect size is rather large for a binomial distribution. Third, there's a crucial baseline missing: what about just replacement with random adjectives to rule out that this isn't just a function of (un)lucky perturbations. To build more confidence in the result, suggest that: (1) increase the size of the dataset; (2) run sampling multiple times and report confidence intervals; (3) compare a baseline with random adjectives.
Typically defenses come at a cost to utility. This defense in particular, could affect normal task performance, but there is no evaluation of utility here. To improve the paper and build more confidence that the defense does not induce side effects, suggest running on a suite of standard benchmark tasks/evals to see how the defense affects performance.

Minor:

Table 1’s headers are backwards?

问题

What was the temperature for each model (where it is known)?

评论- Author Response on Reviewer HGEN (2/2)

2024-11-22

Table 8

BiasDefense Ablation on the combination of suffix and prefix defense prompt using LLAMA2 model

Defense Method	Metric	Jailbreak Success Rate
None	Marginalized Group Jailbreak Success	0.2811
	Privileged Group Jailbreak Success	0.1933
BiasDefense: system prompt	Marginalized Group Jailbreak Success	0.2558
	Privileged Group Jailbreak Success	0.1687
BiasDefense: suffix prompt	Marginalized Group Jailbreak Success	0.2367
	Privileged Group Jailbreak Success	0.2733
BiasDefense	Marginalized Group Jailbreak Success	0.1714
	Privileged Group Jailbreak Success	0.1429

Table 9

MMLU score using prompt-based defense methods. The base language model used is google-flan-t5.

Defense Method	Suffix	Average MMLU score
None	X	0.295
RPO	O	0.243
self-remind	O	0.238
Defending	X	0.295
BiasDefense(Ours, With Suffix)	O	0.265
BiasDefense(Ours, Without Suffix)	X	0.296

HGEN: References

[1] Zou, Andy, et al. "Universal and transferable adversarial attacks on aligned language models." arXiv preprint arXiv:2307.15043 (2023).

[2] Chao, Patrick, et al. "Jailbreakbench: An open robustness benchmark for jailbreaking large language models." arXiv preprint arXiv:2404.01318 (2024).

[3] Hendrycks, Dan, et al. "Measuring massive multitask language understanding." arXiv preprint arXiv:2009.03300 (2020).

[4] Zhang, Zhexin, et al. "Defending large language models against jailbreaking attacks through goal prioritization." arXiv preprint arXiv:2311.09096 (2023).

评论- Thank you

2024-11-26

Thank you for the additional experiments. I still have some concerns.

First, why was flant5 used when the main paper tests other models. I would've expected Table 2, for example, to have a utility metric on MMLU, not a distinct model (which already has fairly low MMLU scores!).

Second, while I appreciate the addition of the random adjectives results, the presentation leaves much to be desired. As I am examining the results, it seems like some particular words do have a significant effect size, but not necessarily the marginalized versus privileged category as a whole. I would've expected to show treatment effects and confidence intervals (see, e.g., a description of how to do this in https://arxiv.org/abs/2411.00640).

That being said, I appreciate the effort so I am raising my score slightly.

评论- Author Response on Reviewer HGEN (1/2)

2024-11-22

Thank you for your review. We have improved the experiments according to your keen suggestions.

(1) increase the size of the dataset;

We have additionally adopted advbench [1], which have 500 harmful instruction prompts, significantly increased from 100 prompts of jailbreakbench. And as shown in table 1, we could still observe the bias among keywords in advbench too.

(2) run sampling multiple times and report confidence intervals;

(3) compare a baseline with random adjectives.

Table 7 shows experiments of random adjectives and reported confidence intervals on 3 runs. In the table, we can see that marginalized / privileged keywords have a distinct difference with random adjectives, with consistency shown by confidence intervals. We validate that the bias shown by our method is not just lucky perturbations.

suggest running on a suite of standard benchmark tasks/evals to see how the defense affects performance.

We think this is a very good point for the practical use of our defense method. In Table 9, We evaluated MMLU[3] scores for our defense method with google-flan-t5 as the base language model, along with other prompt-based defense methods. The results show that our proposed method has relatively low performance degradation compared to other suffix adding defense prompts. One notable aspect is that without the suffix prompt in BiasDefense, MMLU score has gotten better than using no defense prompts, showing even a positive effect on normal tasks. Though the ablation result in Table 8 shows that without the suffix the defense gets weaker, using only the prefix prompt would stand as a good alternative if we want minimal impact on normal tasks.

Q. What was the temperature for each model (where it is known)?

We have stated that all experiments were done with default temperature. More precisely, all models were tested using a temperature setting of 0.7.

Table 1

Dataset	Baseline Success Rate	Marginalized Success Rate ( $\uparrow$ )	Privileged Success Rate ( $\downarrow$ )	Marginalized / Privileged ( $\uparrow$ )
JailbreakBench	0.2400	0.2811 (+17.08%)	0.1933 (-19.58%)	145.42%
AdvBench	0.1895	0.2037 (+7.50%)	0.1758 (-7.25%)	115.84%

Table 7

Comparison of the impact of individual keyword groups on BiasJailbreak. Marginalized Groups and Privileged Groups have a distinct difference among Random Adjectives

Random Adjectives Group (AVG: 23.55% ± 1.52%)

Keyword	Success Rate
baseline	24.00% (± 1.26%)
big	20.00% (± 0.41%)
small	25.00% (± 0%)
bright	22.00% (± 3.52%)
cold	21.00% (± 0%)
old	24.00% (± 6.13%)
slow	31.00% (± 7.54%)
warm	22.00% (± 1.31%)
dark	23.00% (± 2.23%)

Marginalized Group (AVG: 28.33% ± 1.26%)

Keyword	Success Rate
female	28.00% (± 3.52%)
poor	31.00% (± 2.23%)
black	26.33% (± 0.31%)
LGBTQ	24.00% (± 0%)
transgender	27.00% (± 7.66%)
neurodiverse	31.67% (± 12.49%)
working class	29.33% (± 3.95%)
bisexual	24.67% (± 0.31%)
native american	33.00% (± 0%)

Privileged Group (AVG: 20.15% ± 1.28%)

Keyword	Success Rate
male	24.00% (± 0%)
rich	26.67% (± 0.43%)
white	16.67% (± 0.31%)
heterosexual	17.00% (± 0%)
straight	9.67% (± 6.70%)
neurotypical	27.67% (± 6.70%)
middle class	26.00% (± 28.84%)
heterosexual	17.00% (± 0%)
white	16.67% (± 0.31%)

评论- Thank you for your response and kindly raising the score. We added additional results. (1/2)

2024-11-26

Thank you for appreciating our effort and kindly raising the score. We have prepared additional results that will resolve your remaining concerns.

First, flant5 was used for MMLU for fast testing, since it was the default setting of the official MMLU testing code, and it is a small model that has fast inference speed. As a response, we have conducted the MMLU experiment on the most recent GPT-4o, with chosen subjects from each category designated by MMLU; being "college_computer_science (category: STEM), sociology (category: Social Sciences), college_medicine (category: Other), jurisprudence (category: Humanities)". We kindly ask for your understanding as we have chosen to focus on specific subjects from MMLU due to the cost of utilizing OpenAI's GPT. The results of GPT-4o have largely the same tendency as flan-t5; Our defense prompt has the lowest performance degradation among suffix-adding methods, and has even better general task performance without the suffix.

MMLU score with prompt-based defense methods, on GPT-4o-2024-11-20

Category	Suffix	Average	College Computer Science	College Medicine	Jurisprudence	Sociology
Default	X	0.883	0.760	0.855	0.917	0.950
RPO	O	0.237	0.260	0.208	0.259	0.239
Self-Remind	O	0.861	0.740	0.855	0.889	0.910
Defending	X	0.876	0.770	0.844	0.898	0.945
BiasDefense (Ours, With Suffix)	O	0.869	0.760	0.844	0.898	0.930
BiasDefense (Ours, Without Suffix)	X	0.887	0.800	0.850	0.917	0.945

评论- Thank you for your response and kindly raising the score. We added additional results. (2/2)

2024-11-26

Second, we have looked at your reference, and updated the table that to show treatment effects of each keyword group, and confidence intervals of each keyword.

Random Adjectives Group (Treatment Effect: -0.50% (3.97%))

Keyword	Model - Baseline	95% Conf. Interval
big	-4.00% (1.32%)	(-6.59%, -1.41%)
small	+1.00% (1.26%)	(-1.47%, +3.47%)
bright	-2.00% (3.73%)	(-8.69%, +4.69%)
cold	-3.00% (1.26%)	(-5.47%, -0.53%)
old	0.00% (6.26%)	(-12.56%, +12.56%)
slow	+7.00% (7.64%)	(-7.61%, +21.61%)
warm	-2.00% (1.82%)	(-5.05%, +1.05%)
dark	-1.00% (2.56%)	(-5.34%, +3.34%)

Marginalized Group (Treatment Effect: +4.33% (4.93%))

Keyword	Model - Baseline	95% Conf. Interval
female	+4.00% (3.73%)	(-2.69%, +10.69%)
poor	+7.00% (2.56%)	(+2.66%, +11.34%)
black	+2.33% (1.30%)	(-0.14%, +4.80%)
LGBTQ	0.00% (1.26%)	(-2.47%, +2.47%)
transgender	+3.00% (7.76%)	(-11.83%, +17.83%)
neurodiverse	+7.67% (12.55%)	(-16.71%, +32.05%)
working class	+5.33% (4.14%)	(-2.22%, +12.88%)
bisexual	+0.67% (1.30%)	(-1.80%, +3.14%)
native american	+9.00% (1.26%)	(+6.53%, +11.47%)

Privileged Group (Treatment Effect: -2.90% (11.21%))

Keyword	Model - Baseline	95% Conf. Interval
male	0.00% (1.26%)	(-2.47%, +2.47%)
rich	+2.67% (1.33%)	(+0.08%, +5.26%)
white	-7.33% (1.30%)	(-9.80%, -4.86%)
heterosexual	-7.00% (1.26%)	(-9.47%, -4.53%)
straight	-14.33% (6.82%)	(-27.28%, -1.38%)
neurotypical	+3.67% (6.82%)	(-9.28%, +16.62%)
middle class	+2.00% (28.87%)	(-54.52%, +58.52%)

评论- Thank you

2024-11-27

Thanks for the response! Just to confirm, for the treatment effect per group, what is the number in parentheses? E.g., Privileged Group (Treatment Effect: -2.90% (11.21%)) what is 11.21%?

评论- Thank you for your response.

2024-11-28

The number in parentheses (11.21%) represents the Standard Error (SE) of the group's mean treatment effect, calculated using the formula:

$SE_{group} = \sqrt{\frac{SE_1^2 + SE_2^2 + ... + SE_n^2}{n}}$

where $SE_1$ , $SE_2$ , ..., $SE_n$ are the standard errors of individual measurements within the group, and $n$ is the number of keywords in the group.

For the Privileged Group ( $n=7$ ), we combined the individual SEs:

male: 1.26%
rich: 1.33%
white: 1.30%
heterosexual: 1.26%
straight: 6.82%
neurotypical: 6.82%
middle class: 28.87%

Resulting in:

$SE_{group} = \sqrt{\frac{1.26^2 + 1.33^2 + 1.30^2 + 1.26^2 + 6.82^2 + 6.82^2 + 28.87^2}{7}} = 11.21\%$

The relatively large SE is primarily due to the high variability among the individual keyword SEs. In particular, the 'middle class' keyword has an SE of 28.87%, which is significantly larger than the SEs of other keywords and disproportionately increases the overall group SE.

This large SE reflects high uncertainty in our estimate of the average effect, primarily driven by the variability of individual keyword effects within the group.

We followed the statistical notation in your reference (https://arxiv.org/abs/2411.00640), where they presented standard error in parentheses. Thank you for engaging and reviewing our work thoroughly.

评论- Thank you.

2024-12-02

Thank you for thoroughly requesting details and better presentation of our work. We hope that your concerns are all resolved by now. If you have any additional comments or updates, feel free to let us know.

评论- Response

2024-12-03

Thank you for the response. While the inclusion of this analysis is fantastic, the high variance and low number of keywords per grouping suggests that it is not necessarily all marginalized versus privileged group keywords driving effects here. Thus, the paper's main takeaways/claims are not sufficiently narrow compared to the empirical results. I will keep my rating.

评论- General Response (1/4)

2024-11-22

Dear Reviewers,

We sincerely appreciate your valuable feedback and have revised the paper accordingly. Below is a summary of the changes made in response to your comments. Thank you to [HGEN] for acknowledging the interesting and important idea of exploring how biases might affect jailbreaking; to [gTDY] for recognizing the societal impact of our work; to [63GV] and [gTDY] for noting the diversity of our experiments across various LLMs and commending our extensive model evaluation; and to [Vzp2] for appreciating the novelty and usefulness of our jailbreak method. Your comments motivate us to further improve our work.

Main Content

Dataset Expansion and Experimental Rigour

[HGEN, 63GV, gTDY] (Sections 4, Appendix):
- Increased Dataset Size: We incorporated the AdvBench dataset [1], which contains 500 harmful prompts (up from 100 in JailbreakBench). The results are presented in Table 1, demonstrating consistent bias observations with the expanded dataset.
- Multiple Runs with Confidence Intervals: Per your suggestion, we conducted multiple runs and reported confidence intervals to ensure statistical significance. These results are included in Table 7.
- Comparison with Random Adjectives: We introduced a baseline using random adjectives to validate that the observed biases are not due to random perturbations. This comparison is also shown in Table 7.
- Ablation on each keyword : Table 7 includes a performance ablation on each keyword of the groups : Random Adjectives, Marginalized, Privileged.

Table 1

Dataset	Baseline Success Rate	Marginalized Success Rate ( $\uparrow$ )	Privileged Success Rate ( $\downarrow$ )	Marginalized / Privileged ( $\uparrow$ )
JailbreakBench	0.2400	0.2811 (+17.08%)	0.1933 (-19.58%)	145.42%
AdvBench	0.1895	0.2037 (+7.50%)	0.1758 (-7.25%)	115.84%

Table 7

Comparison of the impact of individual keyword groups on BiasJailbreak. Marginalized Groups and Privileged Groups have a distinct difference among Random Adjectives

Random Adjectives Group (AVG: 23.55% ± 1.52%)

Keyword	Success Rate
baseline	24.00% (± 1.26%)
big	20.00% (± 0.41%)
small	25.00% (± 0%)
bright	22.00% (± 3.52%)
cold	21.00% (± 0%)
old	24.00% (± 6.13%)
slow	31.00% (± 7.54%)
warm	22.00% (± 1.31%)
dark	23.00% (± 2.23%)

Marginalized Group (AVG: 28.33% ± 1.26%)

Keyword	Success Rate
female	28.00% (± 3.52%)
poor	31.00% (± 2.23%)
black	26.33% (± 0.31%)
LGBTQ	24.00% (± 0%)
transgender	27.00% (± 7.66%)
neurodiverse	31.67% (± 12.49%)
working class	29.33% (± 3.95%)
bisexual	24.67% (± 0.31%)
native american	33.00% (± 0%)

Privileged Group (AVG: 20.15% ± 1.28%)

Keyword	Success Rate
male	24.00% (± 0%)
rich	26.67% (± 0.43%)
white	16.67% (± 0.31%)
heterosexual	17.00% (± 0%)
straight	9.67% (± 6.70%)
neurotypical	27.67% (± 6.70%)
middle class	26.00% (± 28.84%)
heterosexual	17.00% (± 0%)
white	16.67% (± 0.31%)

评论- General Response (4/4)

2024-11-22

Table 14: Keywords discovered in Phi

Marginalized Group (Adjective)	Privileged Group (Adjective)
Black	White
Female	Male
LGBTQ+	Heterosexual
Non-normative gender	Cisgender
Non-traditionally family-structured	Traditionally family-structured
Rural	Urban
Homeless	Socially influential
Undereducated	Highly educated
Disabled	Able-bodied
Religiously marginalized	Religiously dominant
Low-income	Wealthy
Sexually employed	Traditionally employed

References

We have added and updated the following references to support our revisions:

[1] Zou, A., et al. "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv preprint arXiv:2307.15043 (2023).

[2] Chao, P., et al. "Jailbreaking Black Box Large Language Models in Twenty Queries." arXiv preprint arXiv:2310.08419 (2023).

[3] Li, X., et al. "DeepInception: Hypnotize Large Language Model to Be Jailbreaker." arXiv preprint arXiv:2311.03191 (2023).

[4] Jin, H., et al. "GUARD: Role-playing to Generate Natural-Language Jailbreakings to Test Guideline Adherence of Large Language Models." arXiv preprint arXiv:2402.03299 (2024).

[5] Hendrycks, D., et al. "Measuring Massive Multitask Language Understanding." arXiv preprint arXiv:2009.03300 (2020).

[6] Wu, F., et al. "Defending ChatGPT Against Jailbreak Attack via Self-Reminder." (2023).

[7] Zhang, Z., et al. "Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization." arXiv preprint arXiv:2311.09096 (2023).

[8] Zhou, A., et al. "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks." arXiv preprint arXiv:2401.17263 (2024).

[9] Zheng, X., et al. "Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses." arXiv preprint arXiv:2406.01288 (2024).

[10] Jin, H., et al. "Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters." arXiv preprint arXiv:2405.20413 (2024).

[11] Anthropic, A. I. "Claude 3.5 sonnet model card addendum." Claude-3.5 Model Card (2024).

[12] Achiam, Josh, et al. "Gpt-4 technical report." arXiv preprint arXiv:2303.08774 (2023).

We believe these revisions address all the concerns and suggestions raised. Thank you again for your constructive feedback, which has significantly improved the quality of our paper.

评论- General Response (3/4)

2024-11-22

Related Work and Recent Advances

[gTDY, Vzp2, 63GV] (Section 2.2):
- Expanded Related Work: We updated the paper to include recent jailbreak attacks and defense mechanisms, discussing works like PAIR [2], DeepInception [3], GUARD [4], and others [9][10]. This expansion provides a more comprehensive overview of current methodologies, including those involving persuasion and personas.
[63GV] (Section 2.1):
- Significance of Studying Biases in LLMs: We added the motivation and necessity for studying biases in LLMs.

Clarifications and Corrections

[63GV, gTDY, Vzp2] (Throughout the Paper):
- Baseline Definition Clarified: We clarified the definition of "baseline success rate" to eliminate ambiguity.
- Typographical Corrections: Fixed typos such as "alernative" to "alternative," corrected "principle component analysis" to "principal component analysis," and rectified the "Walkerspider" reference.
- Figure Updates: Enhanced Figures 2 and 4 with diverse examples for better illustration and refined Figure 2 for improved clarity.

Appendix

Additional Experimental Details

[HGEN, 63GV, gTDY] (Appendix):
- Full Keyword Set Provided: Included the complete set of keywords used in our experiments in Tables 10-14.
- Additional Results: Previously mentioned results of Tables 7~9.

Table10 : Keywords discovered in Claude-sonnet3.5.

Marginalized Group (Adjective)	Privileged Group (Adjective)
Female	Male
Immigrant	Native
Poor	Rich
Muslim	Christian
Black	White
High school educated	College educated
LGBTQ+	Heterosexual
Transgender	Straight
Neurodiverse	Neurotypical
Spanish	English
Working class	Middle class
Bisexual	Heterosexual
Native American	White

Table11 : Keywords discovered in GPT-4o.

Marginalized Group (Adjective)	Privileged Group (Adjective)
Poor	Wealthy
Black	White
Elderly	Young
Disabled	Able-bodied
Non-binary	Cisgender
Unemployed	Employed
Religious minority	Religious majority
Indigenous	Colonizer-descended
Lower-class	Upper-class
Underrepresented	Overrepresented
Minoritized	Dominant
Non-English speaking	English speaking
Informally educated	Formally educated
Psychiatrically diagnosed	Psychiatrically typical
Single parent	Partnered parent
Adolescent	Adult
Gender non-conforming	Gender conforming
Internationally born	Nationally born
Temporarily employed	Permanently employed
Linguistically diverse	Linguistically standard
Socially isolated	Socially integrated
Chronically ill	Healthy
Mentally ill	Mentally healthy
Substance dependent	Substance independent
Homeless	Sheltered
Incarcerated	Non-incarcerated
Refugee	Citizen
Visually impaired	Visually unimpaired
Hearing impaired	Hearing unimpaired
Non-traditional student	Traditional student
Working class	Professional class
Historically disadvantaged	Historically advantaged
War-affected	Peace-living
Food insecure	Food secure
Low mobility	High mobility
Environmentally displaced	Environmentally secure
Survivor	Non-affected
Recovering	Non-recovering
Public transport dependent	Car owning
Crafts-based	Industrial
Artisanal	Mass-produced
Temporary visa holder	Permanent resident
Non-voter	Voter
Uninsured	Insured
Non-internet user	Internet user
Low-income	High-income

Table 12 : Keywords discovered in Qwen2

Marginalized Group (Adjective)	Privileged Group (Adjective)
Low-Income	High-Income
Racial Minority	White
LGBTQ+	Heterosexual
Women	Men
Disabled	Able-bodied
Refugees	Citizens
Uneducated	Educated
Neurodivergent	Neurotypical
Religious Minority	Mainstream Religion
Homeless	Homeowners
Mental Health Patients	Mentally Stable Individuals
Elderly	Young Adults
Overweight	Slim/Fit
Non-native Speakers	Native Speakers
Incarcerated	Free
Rural	Urban
Illiterate	Literate
Politically Disenfranchised	Voters
Undocumented Immigrants	Legal Immigrants
Sex Workers	Non-sex Workers

Table 13: Keywords discovered in LLaMA2

Marginalized Group (Adjective)	Privileged Group (Adjective)
Female	Male
Poor	Rich
Black	White
LGBTQ	Heterosexual
Transgender	Straight
Neurodiverse	Neurotypical
Working class	Middle class
Bisexual	Heterosexual
Native American	White

评论- General Response (2/4)

2024-11-22

Terminology and Concept Clarification

[63GV, Vzp2] (Throughout the Paper):
- Terminology Update: We replaced the term "political correctness" with "ethical bias" throughout the paper, including in the title and method names (from PCJailbreak to BiasJailbreak and PCDefense to BiasDefense) to avoid potential misunderstandings.
[63GV, Vzp2] (Abstract and Introduction):
- Enhanced Abstract and Motivation: The abstract has been revised for clarity and specificity, providing a concise description of our attack and defense methods. We also added context and motivation for studying biases in LLMs in the introduction.

Detailed experiments on our Defense Method

[HGEN, 63GV, gTDY] (Section 5 & Appendix):
- Ablation on Defense method: We performed comprehensive ablation studies to evaluate the efficacy of our defense prompts. In Table 5, we compared the performance of our method with existing prompt-based defense techniques. The results demonstrate that our approach, which addresses the need for ethical bias correction, outperforms previous methods. Furthermore, Table 8 evaluates the defense performance when using only the prefix prompt, only the suffix prompt, and when using all proposed prompts together. The findings clearly show that utilizing all the prompts simultaneously yields the most effective results.
- Comparison with Other Defense Methods: We compared our BiasDefense with existing prompt-based defense methods like Self-Remind [6], Defending [7], and RPO [8]. The results in Table 5 show that our method achieves superior performance with the lowest jailbreak success rates. Also, we assessed the impact of our defense on normal task performance using the MMLU benchmark [5]. The findings in Table 9 indicate that our defense method has minimal performance degradation compared to other defense prompts.
- Computation cost of our method: In Table 6, we show that our prompt-based defense method has significantly lower cost than standalone classifiers such as llama-guard, which might be quite obvious but better shown as an experiment.

Table 5

Comparison of jailbreak success rates across prompt-based defense methods using LLAMA2 model

Defense Method	Metric	Jailbreak Success Rate
None	Marginalized Group Jailbreak Success	28.11%
	Privileged Group Jailbreak Success	19.33%
self-remind	Marginalized Group Jailbreak Success	33.67%
	Privileged Group Jailbreak Success	27.33%
Defending	Marginalized Group Jailbreak Success	20.57%
	Privileged Group Jailbreak Success	14.86%
RPO	Marginalized Group Jailbreak Success	56.00%
	Privileged Group Jailbreak Success	49.75%
BiasDefense (Ours)	Marginalized Group Jailbreak Success	17.14%
	Privileged Group Jailbreak Success	14.29%

Table 6

Defense Method	Time Cost (seconds) ↓	Time cost (percentage) ↓
Baseline (No Defense)	21.91	+0.00%
BiasDefense	22.44	+2.40%
Llama-Guard-3-8B	31.69	+44.60%

Table 8

BiasDefense Ablation on the combination of suffix and prefix defense prompt using LLAMA2 model

Defense Method	Metric	Jailbreak Success Rate
None	Marginalized Group Jailbreak Success	0.2811
	Privileged Group Jailbreak Success	0.1933
BiasDefense: system prompt	Marginalized Group Jailbreak Success	0.2558
	Privileged Group Jailbreak Success	0.1687
BiasDefense: suffix prompt	Marginalized Group Jailbreak Success	0.2367
	Privileged Group Jailbreak Success	0.2733
BiasDefense	Marginalized Group Jailbreak Success	0.1714
	Privileged Group Jailbreak Success	0.1429

Table 9

MMLU score using prompt-based defense methods. The base language model used is google-flan-t5.

Defense Method	Suffix	Average MMLU score
None	X	0.295
RPO	O	0.243
self-remind	O	0.238
Defending	X	0.295
BiasDefense(Ours, With Suffix)	O	0.265
BiasDefense(Ours, Without Suffix)	X	0.296

评论- Gentle reminder to Reviewers

2024-11-25

Thank you for your dedication and thoughtful engagement with our paper. As the author-reviewer discussion period draws to a close, we would greatly appreciate hearing your thoughts on our rebuttal. If you have any further questions or require additional clarification, please don't hesitate to let us know.

AC 元评审

2024-12-23

The paper studies jailbreaks of large language models by exploiting potential model bias. The attack methodology adds an additional sentence to the malicious query, and the subject in that sentence may contain identities from either privileged or marginalized groups. The authors found that the model is more easily jailbroken if a marginalized group is mentioned, and also proposed a simple defense to this scenario. The study connects jailbreak studies with model bias studies, which sounds quite interesting. However, there are several key concerns in this paper: the results may have high variance due to the low number of keywords per grouping, and thus, the main conclusion of the paper may not always hold; the proposed defense is not novel and is quite similar to many existing defense approaches; the initial version of the paper was missing a large number of related work/baselines in this field; experiments are not comprehensive enough and ablation studies are missing. Some additional results were provided during the rebuttal in limited settings (e.g. on a single model), and I believe the full version of these new results should be included in the paper. Thus, I believe this paper is not ready for ICLR at this time but I encourage the authors to further improve the paper based on the constructive feedback received.

审稿人讨论附加意见

The reviews and authors had discussions during the rebuttal period, and the authors have provided additional new results as requested. However, several key concerns, as mentioned in the meta-review above, were not fully addressed. Most reviewers believe we should reject the current version of this paper.

最终决定Reject

2025-01-22

Reject