Diversity Helps Jailbreak Large Language Models

Weiliang Zhao,Daniel Ben-Levi,Junfeng Yang,Chengzhi Mao

提交: 2024-09-22更新: 2024-10-10

TL;DR

We present a novel jailbreaking strategy that employs an attacker LLM to generate diversified and obfuscated adversarial prompts, demonstrating significant improvement over past approaches.

摘要

关键词

AttackLarge Language ModelSafety

评审与讨论

撤稿通知

2024-10-10

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.