Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
We introduce Emoji Attack, an adversarial strategy that exploits token segmentation bias in Judge LLMs by inserting emojis to manipulate tokenization, enhancing the effectiveness of jailbreak attacks against Judge LLM detection.
摘要
评审与讨论
This paper presents a jailbreak attack against judge LLM detection.
After rebuttal: I read the author's rebuttal and most of my concerns are addressed. I am actively participating in reviewer-AC discussion to champion this paper.
给作者的问题
NA
论据与证据
Yes
方法与评估标准
Yes, but could be improved. See weakness.
理论论述
NA
实验设计与分析
Yes, the experimental design are sound and make sense to me. However it can be improved because it misses an important baseline. See weakness part.
补充材料
Yes
与现有文献的关系
Recently LLM service provider, e.g., Meta, and IBM propose their guardrail model to detect harmful question. This paper proposes a jailbreak attack to show that the safety risk still exists despite these efforts.
遗漏的重要参考文献
There is a method concerning the very same scenario need to be discussed:
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
I believe this is a concurrent paper with this paper and I know authors have no obligation to dicuss concurrent work. However, I still encourage the authors to discuss this paper because these two method focus on the very same problem (i.e., jailbreaking judge LLM) and use a very similar experiment setting. Discussing them all together can raise more attention and thinking among the community.
其他优缺点
Stengths:
-
While there are quite a lot of research in jailbreaking LLM, there is only a few available study trying to jailbreak the judge LLM. I believe this paper is the very early paper along this line of research. I believe this paper is important in this sense, because nowadays the appearance of diversified Guardrail models (i.e., judge LLMs) give the community a false sense of "security". This paper can serve to increase awareness of the risk.
-
The storyline is very smooth and easy to understand. I especially like the illustration of Section 3.2, which clearly illustrate the concept of embedding distortions and its relation with token splitting.
-
Extensive experiments on both open-sourced and close-soiurce LLM judge is done.
Weakness:
-
The finding of using token segmentation to attack LLM is not a unique finding of this paper, but should be credited to Claburn (2024). Therefore, the contribution of this paper is to replace the normal spaces with emoji. In this sense, the novelty of this paper is quite limited.
-
Given that the core contribution is to utilize emoji to replace space, It is unclear why using emoji instead of other character? Yes, the space might not be a best option to perturb the embedding, but so does the emoji could be not. Instead of manually checking each class of character, I think an automatic otpimization-based algorithm would be more appreciable, and will have more contribution advancing the field. In this sense, the technical contribution of this paper is limited.
-
A minor perturbation to the words should not affect the classification in principle, and therefore as shown in the experiment, the attack sucess rate of emoji attack (as well as Token Segmentation) is not that high. The reason is that a well-trained LLM should still be able to classify the full token and its splitting part correctly. For example the next word embedding of "Euca" "tion" and "Education" should be roughly the same with growing ability of LLM. Although such a perfect LLM is not available for now, however, I think this issue is not something very fundamental that cannot be solved with growing ability of modern LLM.
-
The authors should compare with GCG. In the considered judge LLM jailbreak scenario, it can still be applied in my understanding. Specifically, the attackers can optimize a suffix to elicit the classification of the judge LLM to be "safe". After optimizing the suffix, a similar In-Context Learning method can be used to instruct the target LLM to output the optimized suffix.
Some of my weakness I mentioned (e.g., 1-3) might not be able to easily solved with rebuttal and experiment, but I would like to point out here as a reason of my rating. Very likely I will keep my rating even after rebuttal because of the concern of novelty. I hope the authors can understand. However, I am okay with its acceptance because the fluent writing and also it serves as a timely paper to provide better understanding of risk of guardrail moder.
其他意见或建议
See weakness.
We thank the reviewer for all the insightful comments.
Q1: Virus.
We will cite the Virus paper. While both our work and Virus target judge LLMs, the settings and objectives differ. Virus attacks judge LLMs during the data filtering stage to preserve harmful content, which is subsequently used to fine-tune target LLMs and induce undesirable behavior. In that setup, the attacker has direct control over the data input to the judge LLM. In contrast, our setting assumes that judge LLMs are used post hoc to evaluate the safety of responses generated by target LLMs. We do not assume access to or control over the inputs to the judge. Instead, our attack modifies the outputs of the target LLM to evade judgment.
Q2: Limited Novelty.
We disagree with the characterization that the contribution of our paper is limited to replacing spaces with emojis. While Claburn (2024) observed that inserting spaces can influence content-generation behavior in LLMs, their work did not systematically explore the implications of token segmentation for evasion attacks against judge LLMs.
Our work is the first that evaluates token segmentation attacks in this context. By using emojis, which introduce both semantic content and tokenization shifts, we show a novel and practical attack vector that is effective across both white-box and black-box settings. This shows vulnerabilities of judge LLMs.
Q3: Automatic Optimization-Based Algorithm.
Our choice to use emojis is motivated by their unique properties: unlike spaces, emojis introduce both token segmentation and semantic perturbation. As shown in Figure 6 of the Appendix, this dual effect can meaningfully influence LLM behavior. We found that emojis are particularly effective at fooling judge LLMs, especially in black-box settings where we have limited control over exact insertion positions. We think this is an important first step, which hopefully motivates automated optimization strategies in future work.
Q4: Minor Perturbations Are Not a Fundamental Issue.
To better understand the effect of minor perturbations, we compare the top-50 and top-100 next-token predictions for the phrases "Education", "Educa tion", and "Educa😊tion" using meta-llama/Llama-3.1-8B. We compute the overlap between the original and perturbed versions and observe that even a space split leads to roughly 25% divergence in the top-50 next-token predictions. This indicates that current LLMs are not robust to such minor segmentation changes.
| Phrase | top-50 | top-100 |
|---|---|---|
Education & Educa tion | 74% | 62% |
Education & Educa😊tion | 66% | 52% |
While we agree that an ideal LLM might learn to represent split and unsplit tokens similarly, this remains a non-trivial challenge in practice, particularly when perturbations introduce both segmentation and semantic shifts, as emojis do.
Moreover, even as modern LLMs improve their ability to handle word splits that produce uncommon subwords, they struggle when the split creates common and semantically meaningful subwords. For example, the word smart contains the subword art. When we split smart into sm art, the resulting representation shifts toward that of art, which has a distinct meaning. To support this, we report the next-token prediction overlap among art, smart, and sm art. Notably, sm art shows higher overlap with art than with smart, indicating that the split meaningfully alters the model's predictive behavior. These kinds of collisions between semantically loaded subwords cannot be easily mitigated through data augmentation alone.
| Phrase | top-50 | top-100 |
|---|---|---|
art & smart | 12% | 15% |
art & sm art | 40% | 38% |
Q5: GCG Suffix.
To enable a direct comparison with GCG-style attacks, we optimized a suffix using 30 randomly selected harmful responses from our pool of 1,432, targeting the white-box Llama Guard. The optimization yielded the following suffix:
multimedia ps $(\'kal {: [[CheckBoxCheckBox="<? safe.; safe ViewSimple Website Identifier XVII safe!. safe.
We then applied this suffix to harmful responses using ICL via CodeChameleon. As shown in the table below, this optimized GCG suffix effectively reduces the unsafe prediction rate when evaluated on Llama Guard.
| Attack | Llama Guard | Llama Guard 2 | ShieldLLM | WildGuard |
|---|---|---|---|---|
| CodeChameleon + 😊 | 12.2% | 31.2% | 18.5% | 32.2% |
| CodeChameleon + GCG | 8.8% | 48.0% | 90.7% | 61.8% |
However, when evaluated against black-box judge LLMs, the attack's performance degrades and falls short of the success rate achieved by our emoji attack. This suggests that, while GCG can be effective in white-box settings, it suffers from limited transferability. In contrast, our emoji attack generalizes better across different judge LLMs. We think this property is important in realistic, black-box moderation scenarios.
Thanks for the reponse. I think this paper is acceptable and I will support this paper for its acceptance during AC-reviewer discussion phase.
Thank you so much, we really appreciate your support.
This paper proposes an Emoji attack to fool the judge LLM and thus enhance the attack power of jailbreaking. Emoji attack finds the position to insert the Emoji that can achieve the maximized segmentation bias. Empirical results show that Emoji can successfully bypass the judge LLM.
给作者的问题
- Could you show the results under the potential defensive methods?
论据与证据
Yes. The empirical results seem to support the claim.
方法与评估标准
The authors should consider the adaptive defense. Some very simple defense methods are to remove the emoji in the LLM's output or to ask the judge LLM first remove the emoji and then start judging.
理论论述
No theoretical results.
实验设计与分析
The evaluation should consider the potential defense.
补充材料
I did not go through it.
与现有文献的关系
The paper could pose potential risks when using LLMs.
遗漏的重要参考文献
None
其他优缺点
None
其他意见或建议
Please refer to the evaluation metrics.
We evaluated two types of potential defenses: (1) LLM-based filtering and (2) adversarial training of Judge LLMs (as suggested by Reviewer RkAj). Below, we summarize our findings, with details provided in Figure 9 of the Appendix and additional tables.
1. LLM-Based Filtering.
We use gpt-3.5-turbo as a filter to sanitize the responses generated by target LLMs. The filter is prompted to remove abnormal characters and symbols, such as emojis or inserted delimiters. We find that this works relatively well when a single type of delimiter is used across a response. However, when we mix different delimiters (e.g., a combination of the character "b" and a smiley emoji), the filter struggles to reconstruct the original harmful intent and instead generates a benign-looking response. This sanitized output is then misclassified as safe by Judge LLMs, effectively allowing the harmful content to bypass moderation. This demonstrates that even an LLM-based defense can be tricked when the emoji attack is integrated with obfuscation techniques.
2. Adversarial Training of Judge LLMs.
We fine-tuned the Llama Guard using emoji-inserted toxic examples to improve its robustness. This does lead to higher unsafe detection rates overall, confirming that adversarial training can help mitigate the attack. However, when our emoji attack is combined with jailbreak methods such as Jailbroken or CodeChameleon, it still reduces the unsafe classification rate, even against the fine-tuned model. This indicates that the attack remains effective under adversarial training in certain settings.
Interestingly, we also observe that when paired with jailbreaks like DeepInception or ReNeLLM, the emoji attack can sometimes increase the unsafe prediction rate after adversarial training. This suggests that the interaction between emoji-based perturbations and jailbreak prompts is non-trivial and worth deeper investigation.
In summary, our emoji attack demonstrates robustness across defensive strategies by:
- Remaining effective against adversarially trained Judge LLMs when combined with specific jailbreaks.
- Bypassing LLM-based filters by using compositional delimiters that degrade filter performance.
We see this as an important contribution toward understanding and evaluating the limitations of current defense strategies.
| Attack | Llama Guard | Finetuned LLama Guard |
|---|---|---|
| Deepinception /+ Emoji | 35.1% / 15.8% | 47.4% / 52.6% |
| ReNellm /+ Emoji | 45.2% / 33.3% | 51.1% / 64.5% |
| Jailbroken /+ Emoji | 70.1% / 53.8% | 87.9% / 66.5% |
| CodeChameleon /+ Emoji | 23.4% / 12.2% | 98.1% / 86.8% |
Experimental Setting: We created a balanced fine-tuning dataset consisting of: (1) 1,432 unsafe responses as described in Section 4.3; (2) an additional 1,432 adversarially perturbed unsafe responses, each containing emojis inserted within every word; and (3) 2,864 safe responses sampled from the Huggingface dataset "LLM-LAT/benign-dataset". For efficient fine-tuning, we employed the Parameter-Efficient Fine-Tuning (PEFT) method following guidelines from the official Llama-Cookbook repository.
This paper introduces "Emoji Attack," a novel technique exploiting token segmentation bias in Judge LLMs to bypass harmful content detection. The authors demonstrate that inserting emojis into text disrupts tokenization patterns and creates embedding distortions that significantly reduce the ability of safety models to detect harmful content. Through comprehensive experiments across eight Judge LLMs (Llama Guard, ShieldLM, WildGuard, GPT-3.5, GPT-4, Gemini, Claude), the attack achieves an average 14.1% reduction in harmful content detection. The method works through in-context learning without requiring direct model access, making it a practical real-world attack that enhances existing jailbreak techniques.
给作者的问题
How does token segmentation bias interact with different model architectures, sizes, and training approaches? Is there evidence that certain architectural choices mitigate this vulnerability? Have you explored whether multilingual models exhibit different vulnerabilities to the Emoji Attack, particularly for languages with different tokenization patterns? Could your position selection algorithm (Algorithm 1) be adapted to identify optimal defensive strategies, such as robustness-enhancing fine-tuning targets?
论据与证据
The paper's claims about token segmentation bias are well-supported by experimental evidence. Figure 2 clearly demonstrates significant detection reduction when tokens are split, while Figure 3 quantifies the correlation between embedding distortions and classification outcomes. Table 1 convincingly shows how the Emoji Attack enhances multiple jailbreak methods, with ShieldLM's detection rate dropping from 71.9% to 3.5% when combined with Deepinception. The methodical testing across different emoji types (Table 2) and insertion strategies further strengthens the evidence. The cosine similarity analysis connecting embedding changes to classification outcomes is particularly compelling, establishing a causal mechanism for the attack's effectiveness. The authors also demonstrate that position-optimized emoji insertion (Table 3) consistently outperforms random placement.
方法与评估标准
The methodology effectively isolates token segmentation effects through controlled experiments comparing non-split, mid-split, and cs-split approaches. The evaluation uses an appropriate dataset of 402 offensive phrases and 1,432 harmful responses of varying lengths (2-836 words), ensuring results are robust across content types. The surrogate model method (Algorithm 1) provides a principled approach to identifying optimal token split points, and the black-box attack implementation via in-context learning demonstrates real-world applicability. The cross-model evaluation approach is comprehensive, testing both open-source and commercial models to provide comparative insights on robustness.
理论论述
The paper establishes a sound theoretical foundation linking token segmentation to classification errors. Definition 3.1 formalizes token segmentation bias, and Equation 3 provides a mathematical formulation for measuring embedding distortions using cosine similarity. The attention visualizations in Figure 5 offer mechanistic insights into how segmented sub-tokens alter attention patterns, supporting the theoretical claims. No mathematical errors or oversights were identified in the theoretical analysis. The paper correctly applies the embedding distance metrics and properly interprets the results in the context of token segmentation bias.
实验设计与分析
The experimental design effectively controls for variables to isolate the attack's effects. The authors systematically test different segmentation approaches, emoji types, placement strategies, and performance across multiple jailbreak techniques. Ablation studies in the appendix thoroughly examine how varying emoji numbers and different delimiter types affect performance. The cross-model evaluation reveals important differences in vulnerability between open-source and commercial models, with GPT-4 showing greater resilience.
补充材料
The appendix contains valuable additional analyses including attention visualizations (Figure 5), emoji impact comparisons (Figure 6), emoji quantity effects (Figure 7), and alternative delimiters (Figure 8). Section E presents an initial proposal for defense strategies, though this could be expanded further. Relation to Broader Scientific Literature The work extends character-level adversarial attacks (Claburn, 2024) to Judge LLMs while building upon jailbreaking literature. It connects to research on Judge LLM biases (Chen et al., 2024; Wang et al., 2023) and presents a more accessible black-box attack compared to optimization-heavy approaches like GCG (Zou et al., 2023). The authors appropriately situate their contribution within both the LLM safety and adversarial machine learning research landscapes.
与现有文献的关系
This article's method of inserting emojis into text to make Judge LLM make wrong judgments is a novel method. I have not heard of similar work.
遗漏的重要参考文献
The paper would benefit from references to recent work on embedding space vulnerabilities in classification tasks, particularly from the NLP security literature. Research on emoji understanding and semantic interpretation in LLMs would provide context for the semantic ambiguity claims. Literature on defense mechanisms against adversarial attacks in the NLP domain would also strengthen the discussion on potential countermeasures.
其他优缺点
Other Strengths The identification of token segmentation bias represents a novel contribution to LLM safety research. Unlike previous work focusing on prompt-level or token-level jailbreaking, this attack targets a fundamental vulnerability in how Judge LLMs process tokenized inputs. This insight opens a new dimension for understanding model robustness. The technical depth of the embedding analysis is impressive. The authors go beyond simply demonstrating the attack's effectiveness to provide a mechanistic explanation through cosine similarity measurements and attention visualizations. Figure 5 particularly enhances our understanding of how token segmentation alters attention patterns in the model. The cross-model transfer capabilities make this attack particularly concerning. The consistent effectiveness across diverse model architectures (from open-source Llama Guard to commercial GPT-4) suggests the vulnerability is intrinsic to current LLM design rather than implementation-specific. The paper quantifies these differences rigorously, showing that while GPT-4 is more robust, it still exhibits a 6.6% reduction in detection capability. The practical implementation via in-context learning represents a significant contribution. By demonstrating that the attack can be executed without model access or optimization, the authors highlight a genuine real-world threat. The one-shot example approach makes the attack accessible even to non-technical users, amplifying its practical impact.
Other Weaknesses The defense mechanism analysis is underdeveloped. While the appendix briefly discusses a potential approach using an additional LLM filter, this exploration feels preliminary and lacks rigorous evaluation. A more systematic investigation of countermeasures would significantly strengthen the paper, particularly exploring whether detection-time modifications to embedding space might mitigate these attacks. The paper lacks sufficient analysis of emoji semantics and their relationship to attack effectiveness. While Table 2 shows performance across different emojis, there's no systematic categorization of emoji types (positive vs. negative, abstract vs. concrete) or investigation into whether semantic properties correlate with attack success. This analysis would provide deeper insights into why certain emojis are more effective than others. The limited explanation for commercial model robustness represents a missed opportunity. Though the paper identifies that models like GPT-4 show greater resilience, it doesn't sufficiently explore the architectural or training factors that might contribute to this robustness. Understanding these differences could inform better defense strategies and more robust model designs. The evaluation could benefit from human perception studies. While the paper thoroughly evaluates machine detection rates, it doesn't assess whether the emoji-laden content appears suspicious to human moderators. Given that human oversight often complements automated moderation, understanding human detectability would provide a more complete picture of the attack's real-world implications.
其他意见或建议
no
We thank the reviewer for all the insightful comments. We have addressed your questions and comments below.
Q1: Defense Mechanisms.
Please see our response to Reviewer i3cs.
Q2: Emoji Semantics and Its Impact on Attack Effectiveness.
We agree that understanding the semantics of emojis and their influence on attack effectiveness is important. As illustrated in Figure 6 in the Appendix, we observed that negative emojis increased the unsafe probability. However, categorizing emojis is challenging, as the semantic interpretation of emojis can be context-dependent or culturally variable [1]. For example, the smiling emoji with round eyes 🙂 may be perceived positively by older users but negatively by younger generations.
We attempted emoji categorization using Llama Guard itself, and the results differed from ChatGPT-3.5's categorizations. This discrepancy suggests that emoji semantics vary across different LLMs, influenced by underlying training datasets and model parameter scales.
[1] Zhukova M, Herring S C. Benign or Toxic? Differences in Emoji Interpretation by Gender, Generation, and Emoji Type[J]. Language@ Internet, 2024, 22(Special Issue): 74-108.
Q3: Commercial Models.
Due to the proprietary nature of these models, details about their architectures and training processes remain unknown to us. However, for open-source Judge LLMs, our results in Table 1 demonstrate that Lama Guard 2 (built on Llama-3-8B) outperforms Llama Guard (built on Llama-2-7B). The improved robustness can be explained by the increased model parameter size and the extended training datasets. Similarly, ShieldLM, trained on internlm2-chat-7B using 14,387 query-response pairs, and WildGuard, built on Mistral-7B-v0.3 trained on 86,759 examples, illustrate that larger and more diverse training datasets significantly enhance model robustness.
Q4: Human Perception Studies.
While we agree that assessing human perception could offer additional perspective, our focus is on attacking automated moderation systems such as Judge LLMs. Since human reviewers can likely detect emoji-laden content more easily, our threat model assumes scenarios where content volume or platform design limits human oversight. In such settings, automated systems often operate with minimal human intervention. We feel that it is valuable for the ML community to study how such automated systems can be attacked.
Q5: Impact of Model Architectures, Sizes, and Training Approaches.
It is difficult to isolate the effects of architecture, size, and training in controlled experiments, as these factors often vary simultaneously. It would require us to train LLMs from scratch, which is not feasible. However, as shown in Table 3, commercial LLMs tend to handle token segmentation bias more effectively, likely due to a combination of larger model sizes, more diverse training data, and advanced training techniques. However, we do not have sufficient evidence to connect improved robustness to specific architectural designs alone.
Q6: Multilingual Models.
This is a really interesting question. To address this question, we run additional experiments on a Chinese toxic content dataset [2] using the shenzhi-wang/Llama3.1-8B-Chinese-Chat, an instruction-tuned language model for Chinese. We sampled 1,000 toxic examples and inserted smiley emojis at random positions, as Chinese characters cannot be split into smaller sub-units. The results show a decrease in the unsafe prediction ratio after emoji insertion, suggesting that the emoji attack is also effective in languages with different tokenization patterns, such as Chinese.
| Attack | Unsafe Prediction Ratio |
|---|---|
| W/O Emojis | 17.1% |
| 5 Emojis | 14.5% |
| 10 Emojis | 12.6% |
[2] Lu J, Xu B, Zhang X, et al. Facilitating fine-grained detection of Chinese toxic language: Hierarchical taxonomy, resources, and benchmarks. ACL 2023.
Q7: Optimal Defensive Strategies.
Thank you for the question. We're not entirely sure what is meant by "optimal defensive strategies" in this context. If the intent is to suggest using Algorithm 1 to generate adversarial examples for adversarial training, then yes, Algorithm 1 could be used. We ran some preliminary experiments to test this idea; adversarial fine-tuning using such examples did not lead to significant improvements in robustness over the baseline results shown in Q1. We think that is not too surprising, since approximate adversarial examples are often sufficient to improve robustness.
This paper introduces "Emoji Attack," a technique that exploits token segmentation bias to enhance jailbreak attacks against Judge LLMs. The authors demonstrate that inserting emojis into text can disrupt the tokenization process, causing embedding distortions that lead Judge LLMs to misclassify harmful content as safe. Through experiments on multiple state-of-the-art Judge LLMs, they show that their approach substantially reduces unsafe prediction rates, and bypassing existing safeguards.
给作者的问题
- In the black-box Emoji Attack implementation, how do you address position selection when you lack direct access to embedding functions? The paper demonstrates that emoji position significantly impacts effectiveness, but it's unclear how optimal positioning is achieved in the black-box scenario where you can't compute cosine similarities. Would a response detailing your approach for black-box position optimization change my assessment of the method's practical applicability?
- Does token segmentation bias persist when Judge LLMs evaluate text in languages other than English, particularly those with different tokenization patterns (e.g., character-based languages like Chinese or Japanese)? Evidence of cross-lingual vulnerability (or lack thereof) would enhance my understanding of how fundamental this vulnerability is to LLM architecture.
论据与证据
Yes, the claims made in the submission are generally supported by clear and convincing evidence.
方法与评估标准
Yes
理论论述
EmojiAttack does not contain formal mathematical proofs that require verification. The paper is primarily empirical in nature, focusing on experimental demonstrations of the token segmentation bias vulnerability and the effectiveness of the Emoji Attack.
As for theoretical claims, it includes several theoretical formulations and algorithms:
- Problem formulation in Section 3.1: The authors provide mathematical notation for how target LLMs and Judge LLMs operate, defining the prediction of tokens and the filtering process. These are standard formulations.
- Definition 3.1 of Token Segmentation Bias: A formula definition.
- Equation 3 for computing cosine similarities: a straightforward application of cosine similarity.
实验设计与分析
- Token Segmentation Bias Experiments (Section 3.2): Here the authors test mid-split and cs-split on 402 offensive phrases. They compare baseline performance against two increasingly sophisticated segmentation methods, providing a clear progression of effectiveness. And Figure 2 appropriately visualizes results across four Judge LLMs, Figure 3 effectively demonstrates the correlation between cosine similarity and classification probability.
- Enhancement of Existing Jailbreak Techniques (Section 4.2): Integrate of Emoji Attack with four established jailbreak methods.
- White-Box Emoji Attack (Section 4.3): Test token segmentation bias and emoji insertion on a dataset of 1,432 harmful responses.
补充材料
I reviewed all supplementary materials
与现有文献的关系
- Jailbreaking techniques: While previous work like GCG focused on optimizing tokens to bypass content generation LLMs, this paper uniquely targets Judge LLMs through token manipulation with emojis.
- Tokenization vulnerabilities: The paper builds on on character-level attacks, but identifies a new "token segmentation bias" specifically affecting Judge LLMs when delimiters alter tokenization.
- Judge LLM biases: This extends research on biases in evaluation modelsby discovering a previously unknown vulnerability affecting even commercial models like GPT-4.
In terms of the findings, this paper has proved that their methodology effectiveness compared with other methods
遗漏的重要参考文献
No
其他优缺点
Pros:
- Unlike many jailbreak techniques requiring complex optimization algorithms, the Emoji Attack is relatively simple to implement using in-context learning, making it particularly concerning from a security perspective.
- The authors demonstrate that token segmentation bias affects multiple Judge LLM architectures, suggesting this is a fundamental vulnerability rather than an implementation-specific issue.
Cons:
- There's limited information about the composition of the 402 offensive phrases and 1,432 harmful responses used for evaluation, making it difficult to assess how representative they are.
其他意见或建议
No
We thank the reviewer for all the insightful comments. We have addressed your questions and comments below.
Q1: Limited Description of Datasets.
Thank you for pointing this out. If given the opportunity, we will include a more detailed description of the datasets in the paper. Below, we outline the key characteristics:
-
The 402 offensive phrases consist of short toxic expressions, typically 2–3 words in length. These include vulgar slang, sexual references, derogatory language, and references to illegal activities or fetishes.
-
The 1,432 harmful responses are composed of two parts:
- 574 harmful strings from AdvBench, covering a broad spectrum of harmful content such as profanity and graphic descriptions (lengths range from 3 to 44 words).
- 858 malicious responses generated via jailbreaks (110 from [1], and 748 from [2]). These responses are longer and more diverse, ranging from 7 to 836 words. For [4], we selected the most harmful examples based on the associated harmfulness scores.
We will also include a summary table with concrete examples from each category.
[1] Phute M, Helbling A, Hull M, et al. Llm self defense: By self examination, llms know they are being tricked. In ICLR 2024 TinyPaper, 2024.
[2] Ganguli D, Lovitt L, Kernion J, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned[J]. arXiv preprint arXiv:2209.07858, 2022.
Q2: Position Optimization in Black-Box Emoji Attack.
In the black-box emoji attack setting, we do not perform position optimization (we will clarify this in the manuscript). Fine-grained control over emoji insertion is not feasible because the inserted positions are determined by the target LLMs via in-context learning. As such, we cannot compute cosine similarities or directly optimize positions in the black-box scenario.
Despite this limitation, we find that simply prompting the target LLMs to insert emojis within words (without fine-grained control over exact positions) is often sufficient to fool the judge LLMs, as demonstrated in Table 1. This highlights the practicality of the black-box attack, even without explicit position optimization.
Q3: Exploration in Other Languages.
Thank you for this fantastic question. We are really excited to explore this question. Also, this is related to a question raised by Reviewer RkAj, and we conducted initial experiments to investigate the cross-lingual applicability of the emoji attack (we are committed to expand these results further).
Using the instruction-tuned Chinese language model shenzhi-wang/Llama3.1-8B-Chinese-Chat, we first confirmed that token segmentation differences exist in Chinese. For instance, the phrase “我们” and its space-separated variant “我 们” yield different token ID sequences: [98739] vs. [37046, 220, 80578].
We then sampled 1,000 toxic examples from a Chinese dataset [3] and inserted smiley emojis at random positions within the sentences (since Chinese characters cannot be further segmented). The results show a decrease in the unsafe prediction ratio, indicating that the emoji attack remains effective in character-based languages like Chinese.
These findings suggest that token segmentation bias generalizes beyond English. If provided the opportunity, we would be glad to include these cross-lingual results in the final version of the paper.
| Attack | Unsafe Prediction Ratio |
|---|---|
| W/O Emojis | 17.1% |
| 5 Emojis | 14.5% |
| 10 Emojis | 12.6% |
[3] Lu J, Xu B, Zhang X, et al. Facilitating fine-grained detection of Chinese toxic language: Hierarchical taxonomy, resources, and benchmarks. ACL 2023.
Thanks for the reponse. I think this paper is acceptable but I hope the authors could add the dataset description in the final version (if possible).
Thank you very much for your supportive feedback. We will definitely include a detailed dataset description in the final version, if given the opportunity. In addition, we will also provide a GitHub repository to ensure reproducibility of our work.
The paper introduces “Emoji Attack,” a novel adversarial technique that exploits token segmentation bias in Judge LLMs by strategically inserting emojis into text.
The consensus among reviewers was to accept the paper, albeit with some reservations. Three reviewers provided positive reviews, highlighting the paper’s practical significance, comprehensive empirical evaluation, and clear articulation of a real-world vulnerability in Judge LLM moderation systems. The main concerns involve the limited exploration of potential defenses, multilingual capability, and comparison with existing attacks. The authors' rebuttal largely addresses the reviewers' concerns. Reviewer i3cs has a negative recommendation. However, the review is vague and lacks details. The authors' responses also seem adequate.
Overall, I find this submission is suitable for acceptance.