PaperHub
5.3
/10
Rejected4 位审稿人
最低5最高6标准差0.4
6
5
5
5
2.8
置信度
正确性2.5
贡献度2.3
表达2.0
ICLR 2025

Cognitive Overload Attack: Prompt Injection for Long Context

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

We examines ICL in LLMs through the lens of cognitive neuroscience, revealing that LLMs, like humans, can experience cognitive overload that makes them vulnerable to adversarial prompts, and demonstrates how this can lead to effective jailbreaks.

摘要

关键词
In Context LearningCognitive OverloadLLMs SafetyPrompt Injection Attack

评审与讨论

审稿意见
6

This paper hypothesizes that cognitive overloading safety-aligned LLMs with multiple distracting requests alongside the original harmful one could be a successful jailbreaking technique. This hypothesis was based on analogies between human learning and cognitive load limits and LLMs. They spend a significant amount of the paper on this hypothesis and experiments to demonstrate cognitive load in LLMs based on the deterioration of performance under load. They then show that under cognitive load attacks, SOTA production LLMs are vulnerable to standard jailbreaking techniques,

优点

S1. I think that this paper's engagement with cogsci work is substantial, useful, and nontrivial. I think that papers like this, independent of actual contributions, serve a sort of bridging value.

S2. In table 1, I think that the overall ASRs are fairly impressive.

缺点

W1: I think https://arxiv.org/abs/2403.08424 should probably be cited and discussed. And any others like it as well. Based on my view of it, they have some something of the same type and to the same effect as was done here. Also in your response, I'd be interested in comments on contrasts and novelty.

W2: In general, I wonder the extent to which what is studied here is closely related and reframed compared to other jailbreaking techniques. For example, when people JB models using leetspeak, low resource languages, personas, or asking models to simulate hypothetical tasks, is this not conceptually similar to what is done here? Isn't the CL increased in all of these cases? One thing that I think this paper may not be the most clear on is the extent to which we should expect their hypothesis to apply to a large number of jailbreaks (which is not thoroughly investigated here) or just some very specific ones they showcased here. I would be interested in more work to clarify the scope of the authors' hypothesis and to further the differences between their techniques and other common ones.

问题

Q1: What procedure was used to calculate the ASR? What was the baseline no-attack ASR for the rows of table 1? A common problem with calculating jailbreak ASRs is that incompetent but compliant responses are sometimes marked as successful attacks under some methods. I would like to see the autograding details, the baseline, and an discussion of the potential false positive rate.

伦理问题详情

No ethical concerns.

评论

We would like to thank the reviewer for the feedbacks. Please see our responses to the questions below.

Q1: What procedure was used to calculate the ASR? What was the baseline no-attack ASR for the rows of table 1? A common problem with calculating jailbreak ASRs is that incompetent but compliant responses are sometimes marked as successful attacks under some methods. I would like to see the autograding details, the baseline, and a discussion of the potential false positive rate.

To address the ASR calculation procedure and baseline concerns:

Our Attack Success Rate (ASR) calculation was based on testing 232 questions from the Forbidden Question dataset and 100 questions from the JailbreakBench dataset. The ASR was computed by dividing the total number of successful jailbreaks (as shown in the table) by these respective denominators (232 and 100). Important to note that per our algorithm design, if a question successfully jailbreaks at cognitive load level CL1, we don't test it at higher levels (CL2-CL6). The "Total" column in our table represents successful jailbreaks across all cognitive load combinations.

For response evaluation methodology:

  1. We employed a judge LLM using the prompt detailed in Figure 12

  2. The evaluation process was two-step:

    • First, the judge LLM provides a written explanation for its classification

    • Then, it categorizes the response as safe/unsafe/neutral

To validate our results and address potential false positive concerns:

  1. We conducted additional validation using multiple judge LLMs (as shown in Tables 2 and 3) to evaluate responses initially classified as UNSAFE

  2. The variation in results between Tables 1 and 2 for the Forbidden Question dataset (when judged by different LLMs) can be attributed to:

    • Different safety alignment policies across model providers

    • A response considered unsafe by GPT-4 might be deemed safe by Claude-3-Haiku

Regarding false positive rates:

  1. Establishing ground truth for manual response evaluation is challenging

  2. As acknowledged in our limitations section, we suggest using a jury of judge LLMs, where unsafe classification would require consensus among multiple judges

  3. This approach could help mitigate the challenge of false positives in future work

W1: I think https://arxiv.org/abs/2403.08424 should probably be cited and discussed. And any others like it as well. Based on my view of it, they have some something of the same type and to the same effect as was done here. Also in your response, I'd be interested in comments on contrasts and novelty.

We thank the reviewer for bringing this reference to our attention. While both works explore methods of exploiting long-context in-context learning, there are key distinguishing aspects in our approach.

The distraction-based attack described in the referenced work relies on hiding harmful responses within auxiliary tasks. In contrast, our cognitive load attack framework is fundamentally grounded in neuroscience theories of cognitive overload. Rather than relying on distractions, we design cognitive load tasks (T1–T6) that are intrinsically tied to the observation task (the harmful question). Our methodology differs in two significant ways:

  1. We focus on increasing cognitive overload through progressive task complexity via adding irrelevant tokens

  2. Unlike the referenced work, our framework does not require an attacker LLM

We have added this paper in the related work in our revised version.

评论

W2: In general, I wonder the extent to which what is studied here is closely related and reframed compared to other jailbreaking techniques. For example, when people JB models using leetspeak, low resource languages, personas, or asking models to simulate hypothetical tasks, is this not conceptually similar to what is done here? Isn't the CL increased in all of these cases? One thing that I think this paper may not be the most clear on is the extent to which we should expect their hypothesis to apply to a large number of jailbreaks (which is not thoroughly investigated here) or just some very specific ones they showcased here. I would be interested in more work to clarify the scope of the authors' hypothesis and to further the differences between their techniques and other common ones.

As mentioned in our related work (line 508), we acknowledge that various existing attacks can indeed be evaluated through the lens of cognitive overload theory. Our work provides a systematic framework to understand and analyze these attacks from this perspective.

Relationship to Existing Attacks :

Through our analysis, we identified that many existing jailbreaking techniques inadvertently leverage cognitive load principles:

  1. DAN attacks (Shen et al.) combine multiple elements like rule sets, role-play instructions, and behavioral specifications before presenting the harmful question. The tokens preceding the harmful question create cognitive load, while the harmful question serves as the observation task.

  2. Multi-shot jailbreaking (Anil et al.) uses preceding multi-shots with question-answer demonstrations as cognitive load, followed by the harmful question as the observation task.

  3. Low-resource language attacks like the Sandwich-attack (Upadhayay et al.) position the harmful question between queries in low-resource languages, following the same pattern of generating irrelevant tokens before the observation task.

Our Novel Contributions:

What distinguishes our approach is:

  1. Our attack vectors are specifically designed based on neuroscience cognitive load theory and supported by empirical experimental results.

  2. We quantified cognitive load through the number of irrelevant tokens generated, leading to systematic task design that increases this number through token breakdown, additional token generation, and task multiplication.

  3. We developed an adaptive attack algorithm that increases efficacy through escalation. When attacks fail at cognitive load combination level 1 (CL1), we can escalate to CL2 by prompting more irrelevant token generation.

Regarding attack generalization:

As we extend our attack methodology and cognitive load theory to in-context learning, we can leverage higher-capability models like Claude-3.5-Sonnet to create similar attack vectors. We validated this by testing the attack vector on GPT-4, successfully achieving a jailbreak. This demonstrates that our attack vectors are highly generalizable and can be effectively transferred across different models (Figures 16,17,18 and 19).

评论

Dear Reviewer F4TZ,

We would like to gently remind you that as the deadline for the rebuttal period is approaching, we would appreciate receiving your valuable insights and feedback on our responses.

If our responses have addressed your concerns and mentioned weaknesses, we respectfully request that you consider increasing the score.

Thank you.

审稿意见
5

The author's conducted the first study that directly compares In Context Learning in LLMs with human cognitive learning. They empirically demonstrated that increased cognitive load leads to cognitive overload in LLMs, which degrades performance on secondary tasks similar to human cognition (HC). Then they introduced an attack method combining with cognitive overload, which exploits cognitive overload to bypass LLM safety mechanisms. The method showed that higher-capability models can create cognitive overload prompts to attack other LLMs, demonstrating the transferability and widespread impact of cognitive overload attacks.

优点

  • Important motivation. Analogy between the reasoning process of LLMs and learning in Human Cognition.
  • Surprisingly, the purposed attack method has excellent performance with high attack success rates.
  • Paper is clear with no major problem in writing.

缺点

I am not an expert in cognitive science and this is why I chose confidence in 2, please correct me if I’m mistaken in these aspects.

  • Firstly, I personally believe that this paper should be submitted in other venues rather than being applied in jailbreaking LLMs. Without a certain background, it is difficult to understand why the different tasks and patterns of cognitive load measurement (T1-T7) in Section 3 are designed and defined in this way. (I have reviewed relevant references)
  • It cannot be determined whether different combinations will have different cognitive load effects on the order between T1-T7 or C1-C6.
  • What does the average score mean in Figure 1B (I have seen the Figure 4)? Is the self-report method reliable and can other quantitative indicators be used to measure it?
  • The cognitive experiment only conducted visual analysis of the code task, more other tasks or datasets should be included.
  • In Cognitive Overload Attack, the author did not compare it with some existing attack methods, such as GCG, PAIR, etc. (see some benchmark in jailbreaking task)
  • The derivative questions generated may cause some questions not follow the instruction during paraphrasing, such as outputting an anwser or a nonharmful question.
  • Too many hypothesis need to be made under cognitive load conditions.

问题

Please see the weaknesses. The limitations are listed in detail in the appendix, but there are a few that I believe need to be addressed.

评论

Firstly, I personally believe that this paper should be submitted in other venues rather than being applied in jailbreaking LLMs. Without a certain background, it is difficult to understand why the different tasks and patterns of cognitive load measurement (T1-T7) in Section 3 are designed and defined in this way. (I have reviewed relevant references)

We appreciate the reviewer's concern about the task design rationale. The tasks (T1-T7) were systematically designed with a specific purpose: to incrementally increase cognitive load from a baseline (line 260). While LLMs excel at sequential task processing, simply increasing task complexity would only affect intrinsic cognitive load (the cognitive effort required for basic task completion).

Our approach was more comprehensive: we deliberately targeted both intrinsic and extraneous cognitive load through two mechanisms:

  1. Increasing the number of tasks to elevate intrinsic cognitive load

  2. Adding irrelevant token counts to raise extraneous cognitive load

As explained in line 227, we structured these tasks into three categories (General, Custom, and Unconventional Tasks) to systematically observe the impact on the observation task under increasing cognitive load conditions.

It cannot be determined whether different combinations will have different cognitive load effects on the order between T1-T7 or C1-C6.

Our comprehensive analysis in Appendix C.3 demonstrates both individual and combinatorial effects of cognitive tasks T1-T6 on model performance.

First, we established baseline effects by measuring how each individual cognitive load task (T1-T6) impacts the observation task performance using Llama-3-70B. As shown in Figure 5, different cognitive tasks have distinct impacts on observation task performance, allowing us to understand their individual contributions.

Building on these individual measurements, we systematically investigated combinatorial effects by progressively concatenating tasks. We started with CL1 (containing only T1), then created CL2 by adding T2, and continued this pattern through CL6. Specifically:

* CL1: T1 → observation task

• CL2: T1 + T2 → observation task

• CL6: T1 + T2 + T3 + T4 + T5 + T6 → observation task

To validate that these combinations create incremental cognitive load, we tested them across four models: C3-Opus, GPT-4, L3-70B, and Gem-1.5-Pro. Figure 1B shows the average observation task performance scores under each cognitive load combination (CL1 to CL6). Statistical validation using paired t-tests confirmed significant performance decreases (p=0.0048) when moving from any cognitive load level (CL_i) to the next level (CL_i+1) (line 363), demonstrating that our combinations create measurable, progressive increases in cognitive load.

评论

What does the average score mean in Figure 1B (I have seen the Figure 4)? Is the self-report method reliable and can other quantitative indicators be used to measure it?

We appreciate the reviewers' questions regarding Figure 1B and our measurement methodology. Let us address both points:

Regarding Figure 1B's average scores:

The figure compares performance between baseline observations (CL0 - no cognitive load) and observations under increasing cognitive load combinations (CL1 through CL6). Three judge LLMs independently evaluated responses using the prompt detailed in Figure 4, where they performed pairwise comparisons between responses with and without cognitive load to generate scores. These scores were then averaged to produce the final metrics shown in the figure.

Regarding measurement reliability:

We employed multiple complementary approaches to ensure robust measurement of cognitive load:

  1. Multi-task Observation Method (Section 3.2.2): Rather than relying solely on self-reporting, we evaluated cognitive load impact through observation tasks, providing quantitative performance metrics.

  2. Irrelevant Token Quantification (Section 3.2.3): We measured the increase in irrelevant token generation preceding the observation task as an objective indicator of cognitive load. Statistical testing confirmed significant increases in irrelevant tokens between successive cognitive load combinations CL_i to CL_i+1 (detailed in Appendix C.6).

  3. Self-reporting Method: While we did incorporate self-reporting following human cognition principles, we acknowledge its limitations. Our experiments revealed that judge LLMs showed strong agreement on intrinsic cognitive load but diverged on extraneous cognitive load assessments (Figure 8). This aligns with our noted limitation (line 853) that judge LLMs may exhibit bias in cognitive load measurement based on the information provided to them. Since these models lack an understanding of their internal mechanisms (e.g., token processing), they cannot accurately assess the impact of irrelevant tokens on extraneous cognitive load.

Through this multi-faceted approach, we aimed to compensate for the limitations of any single measurement method while providing comprehensive insights into cognitive load effects.

The cognitive experiment only conducted visual analysis of the code task, more other tasks or datasets should be included.

Thank you for this feedback. While we acknowledge that our cognitive experiment focused on visual analysis through the code-writing task, this specific task served as a strategic entry point to demonstrate cognitive load effects. This analysis directly informed our subsequent, more comprehensive investigations: we conducted extensive evaluations using the Vicuna MT Benchmark across four state-of-the-art models (detailed in Section 3.2.2), and examined cognitive load's impact on LLM safety using both the Forbidden Question Dataset and JailbreakBench dataset (presented in Section 4). The code task specifically allowed us to establish a clear visual baseline for cognitive load effects before expanding into these broader experimental domains.

In Cognitive Overload Attack, the author did not compare it with some existing attack methods, such as GCG, PAIR, etc. (see some benchmark in jailbreaking task)

We thank the reviewer for their valuable feedback regarding the comparison with existing attack methods. We acknowledge that our initial submission would have benefited from a comparative analysis with methods like GCG and PAIR.

While our approach differs fundamentally from these methods - being a black-box attack compared to GCG's white-box approach - we understand the importance of benchmarking against existing solutions. We initially focused on the Forbidden Question dataset, which hadn't been tested in prior works. However, to address this limitation, we have now added comparison PAIR using the JailbreakBench dataset. These comparative results has been added in the revised paper (Table-4).

Attack MethodsGPT-4Gemini Series
PAIR48%Gem-Pro: 73%
Cognitive Overload Attack90%Gem1.5Pro: 93%, Gem1.0Pro: 49%
评论

The derivative questions generated may cause some questions not to follow the instruction during paraphrasing, such as outputting an answer or a non harmful question.

In our implementation of the attack algorithm, we used GPT-3.5-Turbo to generate derivative questions through paraphrasing, specifically focusing on 'How to' and 'What are' formulations. While this process successfully generated paraphrased variations of the original harmful questions, we observed a limitation: in some cases, GPT-3.5-Turbo automatically transformed harmful questions into benign ones during the paraphrasing process, deviating from the intended instruction to maintain the nature of the original questions.

Too many hypotheses need to be made under cognitive load conditions.

We presented a focused set of interconnected hypotheses that naturally build upon each other to explain the cognitive load phenomenon in LLMs:

First, our foundational hypothesis (H0) establishes the basic principle: excessive intrinsic and extraneous cognitive load leads to working memory bandwidth exhaustion, resulting in cognitive overload and decreased performance in LLMs. Building directly on this, H1 specifies the observable effects: under cognitive overload, LLMs either show significantly deteriorated performance or complete failure in observation tasks.

We provide comprehensive empirical validation for these hypotheses in Section 3.2.2, demonstrating:

  • Statistical evidence of performance degradation with increasing cognitive load

  • Complete response failures under higher cognitive load (Figure 1B)

  • Poor code generation for drawing animals (Figures 1A, 1B, and 1C)

We made a typographical error: H3 should be H2. This has been corrected in the revised version.

Our final hypothesis (H2) explores the security implications: cognitive overload in aligned LLMs can enable jailbreak scenarios when observation tasks are replaced with harmful questions. We validate this through our automated attack algorithm using the Forbidden Question Dataset and JailbreakBench dataset (Table 1).

Finally, we build our intuition of increasing cognitive load by altering and fragmenting tokens that act as irrelevant information (line 264). We quantify the cognitive load as the number of irrelevant tokens and statistically show that as cognitive load increases, the number of irrelevant tokens increases (section 3.2.3).

评论

Thanks for answering all my questions and providing additional details regarding your hypotheses. I have raised my score to 5.

评论

We sincerely thank the reviewer for recognizing our responses and increasing the score. We appreciate your thorough evaluation of our work. While we are grateful that we were able to address your previous concerns, we would welcome any additional feedback that could further strengthen the paper. If you find our previous clarifications satisfactory, we would respectfully request you to consider increasing the score further.

审稿意见
5

This paper investigates a phenomenon called cognitive overload in LLMs. Similar to human cognitive process, LLM also suffer from the overload problem, inspired by this, the author presents a cognitive overload attack to jailbreak aligned LLMs.

The contribution of this paper is as follows:

  1. This paper demonstrates that, cognitive overload, which occurs when multiple complex tasks are combined in a single prompt, will significantly degrades LLMs' performance.
  2. Building on this principle, the authors attempt to jailbreak an aligned LLM by embedding malicious prompts within cognitively overloaded instructions. Their experiments demonstrate the effectiveness of this attack method.

优点

  1. This paper is well-written and easy to follow.
  2. The experiments of both cognitive overload identification and cognitive overload attacks are complete.

缺点

  1. This work appears to be an incremental extension of previous research [1], offering limited novel contributions.
  2. The findings are somewhat predictable: when LLMs are tasked with handling multiple simultaneous instructions, their ability to properly follow any single instruction naturally deteriorates. Consequently, it's unsurprising that LLMs become more susceptible to executing jailbreak prompts when distracted by multiple competing tasks.

[1] Xu et al. Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

问题

As listed in weaknesses part, my major concern is about its contribution.

伦理问题详情

This paper introduces a jailbreak attack method that may induce harmful contents such as discrimination. The paper itself doesn't contain any discrimination / bias / fairness concerns.

评论

“1. This work appears to be an incremental extension of previous research [1], offering limited novel contributions. “

While we respect the reviewer's perspective, we believe our work represents a substantial advancement beyond [1]'s initial exploration in fundamental ways:

  1. Theoretical Framework: While [1] examined three specific jailbreak variants, we developed a comprehensive theoretical framework connecting In-Context Learning with human cognitive learning principles. This includes introducing the novel concept of tasks as atomic units of cognitive load in LLMs.

  2. Methodological Innovation: We introduce multiple robust approaches to measure cognitive load in LLMs, including:

    a. Dual-task and multi-task techniques adapted from cognitive science

    b. Self-reporting methodologies for LLMs

    c. Quantifiable metrics based on irrelevant token generation

  3. Extensive Validation: We demonstrate the generalizability and robustness of our findings through:

    a. Validation across multiple SOTA models (GPT-4, Claude-3, Llama-3, and Gemini)

    b. Visualization experiments

    c. Statistical validation of results (t = 3.1248, p = 0.0048) using the Vicuna MT Benchmark dataset

  4. Practical Threat Model: We develop an automated attack algorithm that leverages these cognitive load principles, achieving high success rates in systematically testing LLM vulnerabilities.

This comprehensive approach represents a significant advancement in understanding and exploiting cognitive load in LLMs, moving well beyond the initial exploration of language switching in previous work.

We have also extended our related work section, highlighting the differences between our work and Xu et al.'s work (line 496-506).

“ 2. The findings are somewhat predictable: when LLMs are tasked with handling multiple simultaneous instructions, their ability to properly follow any single instruction naturally deteriorates. Consequently, it's unsurprising that LLMs become more susceptible to executing jailbreak prompts when distracted by multiple competing tasks. ”

While prior literature has indeed documented performance degradation under multiple tasks, our work makes several novel contributions:

First, we frame the vulnerability through the lens of cognitive load theory from neuroscience, providing a theoretical foundation for understanding why these attacks succeed. Second, our experiments demonstrate that simply increasing the number of benign tasks does not automatically lead to jailbreaks in safety-aligned LLMs. Rather, the successful jailbreaks specifically emerge from the increased cognitive load created through irrelevant token generation when solving multiple associated tasks.

This distinction is important - our attack vector's effectiveness stems not from task quantity alone, but from deliberately manipulating cognitive load in a way that compromises safety guardrails while maintaining model functionality for the primary task.

评论

I appreciate the authors' detailed rebuttal, however, one of my fundamental concern remain unaddressed: While the authors outline several methodological extensions beyond [1], the core concept of exploiting cognitive overload in LLMs was already established in the literature. I prefer to maintain my score of 5.

[1] Xu et al. Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking

评论

We appreciate the reviewer's feedback. While we acknowledge that [1] introduced the initial concept of cognitive overload in LLMs, we respectfully disagree that our work merely extends their concept.

Our work fundamentally advances this field in both theoretical depth and practical applications:

  1. Our key contribution lies in developing a rigorous theoretical framework that draws parallels between learning in In-Context Learning and established human cognitive learning principles. Rather than simply building on [1]'s observations about cognitive overload, we introduce and validate the novel concept of 'tasks as atomic units' of cognitive load in LLMs.

  2. Unlike [1]'s introduction of terminology only, we provide the first comprehensive methodology for quantifying cognitive load in LLMs through: a. Adaptation of cognitive science techniques (dual-task/multi-task paradigms) b. Development of LLM-specific self-reporting methods c. Introduction of quantifiable metrics based on irrelevant token generation.

  3. Finally, our definition of cognitive load in LLMs, attack methodology, and jailbreak approach are completely different from those presented in [1].

We request that the reviewer thoroughly review the differences between our work and [1] and consider our significant contributions. Our work addresses a critical gap in LLM safety alignment research, and we believe our methodological approaches and analogies could provide valuable insights to the ICLR community. We respectfully request that the review score not be based solely on the comparison with [1] but also consider our additional significant contributions, and we request that the reviewer consider increasing the score accordingly.

审稿意见
5
  • Paper introduces a suite of additional problems to task the LLM with and shows that this degrades performance on the main task
  • Paper uses multi-task setup to design jailbreaks
  • The paper makes very strong claims about similarities between human cognition and LLMs, which are not always supported by evidence.
  • The paper's setting is already studied in prior works, and the novelty of the contribution is unclear to me in this context (see details below)
  • the jailbreaking results are unclear. It is not obvious whether the LLM jailbreak is actually caused by "cognitive overload".

优点

  • easy to follow
  • simple experiments
  • well-motivated setting

缺点

  • “demonstrating that CLT applies to LLMs“ strong claim for the limited empirical results
  • HC not introduced in intro
  • “LLMs process input tokens by identifying semantic patterns and relationships, which are abstracted into embeddings and hidden states, similar to abstraction of concepts in HC“ unsupported claim
  • Drawing experiment is only based on visual identification of two types of animals (this needs more data to be statistically significant)
  • Prior work on performance degradation in multi-task setting already exists -- how is this work novel here? "LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History": examines how task-switching within a conversation affects LLM performance "Exploring the Zero-Shot Capabilities of LLMs Handling Multiple Problems at once": investigates LLM performance when presented with multiple problems at once.
  • Why does Table 1 use different Judge LLMs?
  • It appears that much of the jailbreaks already occur for CL1 level for Forbidden Question dataset (Table 1). Is it really the cognitive overload that breaks the models or just the paraphrasing and repeated trials (6 times)?

问题

See above

评论

It appears that much of the jailbreaks already occur for CL1 level for Forbidden Question dataset (Table 1). Is it really the cognitive overload that breaks the models or just the paraphrasing and repeated trials (6 times)? (the jailbreaking results are unclear. It is not obvious whether the LLM jailbreak is actually caused by "cognitive overload". )

Our experimental design and results provide strong evidence that cognitive overload, rather than simple paraphrasing or repeated trials, is the key factor in successful jailbreaks. Here's our systematic analysis:

First, we established a baseline by testing the original forbidden questions with aligned LLMs. As expected, the models consistently refused to answer these harmful queries, demonstrating their safety guardrails were functioning.

Next, when we tested simple paraphrasing of these questions (using "how to" and "what are" formulations) without cognitive load elements, the LLMs maintained their safety boundaries and continued to refuse answers, which showed that mere paraphrasing alone is insufficient to bypass safety measures. As we began our preliminary experiments, we tested both the original questions and the derivative (paraphrased) questions with the aligned LLMs. This systematic progression from baseline to paraphrasing to cognitive load demonstrates that the cognitive overload mechanism is indeed the critical factor in compromising the models' safety barriers, rather than just rephrasing or multiple attempts.

Many jailbreaks already occur for CL1

While CL1 does achieve some jailbreaks, our comprehensive analysis shows that cognitive load is indeed the driving mechanism, not just simple paraphrasing. This is evidenced by several key observations:

First, our results demonstrate model-specific resistance to CL1. For instance, in Table 1 (JailbreakBench dataset), both GPT-4 and Llama3-70B showed zero successful attacks at CL1, with jailbreaks only occurring at CL2 and beyond. Similarly, Gemini-1.0 Pro exhibited low vulnerability to CL1 but increased susceptibility at CL3, indicating that different models have varying thresholds of cognitive load resistance.

Second, if paraphrasing alone were responsible for the jailbreaks, we would expect to see consistent success rates across all cognitive load levels, as the same derivative questions were used throughout. Instead, we observe increasing success rates as cognitive load increases, even when using identical base questions. This pattern strongly suggests that the cognitive load mechanism, rather than the paraphrasing, is the key factor in achieving successful jailbreaks.

This variation in effectiveness is precisely why we designed our automated attack algorithm to incrementally increase cognitive load until a successful jailbreak is achieved, allowing us to identify each model's specific vulnerability threshold.

评论

Dear authors, thank you for the detailed response. I agree that the drawing experiment is a nice qualitative experiment, however I do not think that any conclusions can be drawn from an experiment with sample size 2. I appreciate that the authors are willing to tone down the claims of equivalences between LLMs and human cognition. Regarding the jailbreak results: My concern was regarding the effect of CL, as it seems that CL1 already led to high jailbreak rates, and as I found it unclear to which degree this happens solely due to paraphrasing. "if paraphrasing alone were responsible for the jailbreaks, we would expect to see consistent success rates across all cognitive load levels, as the same derivative questions were used throughout", this deduction still is not clear to me. If I am not mistaken, e.g., the text length must increase throughout all CL levels. Regarding the comparison to prior work: I don't think that the analogies drawn to human cognition constitute a contribution, as long as no significant additional findings result from them.

评论

Human Cognition not introduced in intro

While we introduce cognitive limitations (specifically working memory constraints) in the introduction (lines 53-54), we acknowledge that a more explicit framing of the human element would strengthen our paper. We have expanded our discussion of human cognition through Cognitive Load Theory in Section 2 (line 86) and provided comprehensive details in Appendix C.1 to balance technical depth with readability. This structure allows us to maintain focus on safety alignment while thoroughly addressing human cognitive factors.

“LLMs process input tokens by identifying semantic patterns and relationships, which are abstracted into embeddings and hidden states, similar to abstraction of concepts in HC“ unsupported claim

We acknowledge the reviewer's concern about the claim comparing LLM and human concept abstraction. Let us clarify our position with examples.

In LLMs, we can observe the abstraction process: input tokens are transformed into distributed representations (embeddings) in the model's latent space, where semantic relationships are captured. These embeddings naturally cluster semantically related concepts together. For example, words like "cat," "kitten," and "feline" have similar vector representations, indicating the model has abstracted the underlying concept. In the human brain, concepts are not stored as raw sensory inputs. Instead, they are abstracted into patterns of biological neural activations.

While we draw this analogy to human cognition cautiously, our intent is to highlight a functional similarity: both systems work with transformed representations rather than raw inputs. We have updated the sentences as follows: “Furthermore, LLMs process text by transforming input tokens into distributed representations in the latent space, where semantic patterns and relationships are mathematically captured. Similar analogy could be made with human cognition where concepts are not stored as raw sensory inputs, but are instead abstracted into patterns of biological neural activations [1] [2] [3]”

[1] Lin, L., Osan, R., & Tsien, J. Z. (2006). Organizing principles of real-time memory encoding: neural clique assemblies and universal neural codes. TRENDS in Neurosciences, 29(1), 48-57.

[2] Taylor, P., Hobbs, J. N., Burroni, J., & Siegelmann, H. T. (2015). The global landscape of cognition: hierarchical aggregation as an organizational principle of human cortical networks and functions. Scientific reports, 5(1), 18112.

[3] Nelli, S., Braun, L., Dumbalska, T., Saxe, A., & Summerfield, C. (2023). Neural knowledge assembly in humans and neural networks. Neuron, 111(9), 1504-1516.

Prior work on performance degradation in multi-task settings already exists -- how is this work novel here? "LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History": examines how task-switching within a conversation affects LLM performance "Exploring the Zero-Shot Capabilities of LLMs Handling Multiple Problems at once": investigates LLM performance when presented with multiple problems at once.

We thank the reviewer for highlighting these relevant works. While prior research has indeed examined performance degradation in multi-task settings, our work makes several novel contributions. First, we analyze this phenomenon through the lens of cognitive neuroscience, drawing meaningful parallels between learning in in-context learning and learning in human cognition. Second, we provide a quantitative framework for measuring cognitive load as a function of both task complexity and the presence of irrelevant context tokens. Third, we demonstrate that increased cognitive load can compromise safety alignment - specifically, we found that harmful queries rejected under low cognitive load conditions (CL1) received harmful responses under higher cognitive loads. Furthermore, we demonstrate that higher-capability models like Claude-3.5-Sonnet can be leveraged to design similar tasks capable of jailbreaking LLMs, highlighting significant safety concerns.. These contributions extend beyond documenting performance degradation to provide actionable insights for improving LLM robustness and safety.

Why does Table 1 use different Judge LLMs?

We deliberately chose different judge LLMs in Table 1 to demonstrate our attack algorithm's versatility. As explained in Lines 410-411, this approach shows that our method is not dependent on expensive models like GPT-4, but can effectively utilize more cost-efficient, open-source LLM alternatives as judges.

评论

We appreciate the reviewer’s insightful feedback. Please see our detailed responses to the specific weaknesses and questions outlined below.

The paper makes very strong claims about similarities between human cognition and LLMs, which are not always supported by evidence

We acknowledge the reviewer's concern about claims regarding similarities between human cognition and LLMs. We agreed that our tone may have been too strong given the scope and focus of the study, so we revised the wording to clarify that our work does not assert direct similarities between human cognition and ICL in LLMs, but rather explores analogies between learning processes in these two domains. In Lines 99-101, we establish our theoretical foundation by referencing established literature that examines parallels between learning processes in human cognition and artificial neural networks. Section 2 (Lines 105-117) builds upon this foundation to explore potential analogies between Cognitive Load Theory (CLT) concepts and in-context learning (ICL) mechanisms in LLMs.

The relationship between learning in human cognitive learning and learning in ICL serves specifically as motivation for investigating whether CLT principles could be useful in modeling certain alignment problems in LLMs. To validate these analogies empirically, we conducted systematic experiments detailed in Sections 3.2.1 and 3.2.2.

Our methodological approach drew inspiration from established cognitive load measurement techniques used with human subjects. We employed two specific approaches to assess potential cognitive load variations in LLMs:

  1. A modified dual-task paradigm (multi-task measurement)
  2. Self-reporting protocols

We evaluated four models from different model families using the Vicuna MT Bench framework, with results presented in Figure 1B as aggregated scores for each cognitive load condition.

“demonstrating that CLT applies to LLMs“ strong claim for the limited empirical results

We acknowledge that demonstrating CLT's applicability to in-context learning in LLMs requires robust empirical evidence. Our research provides several layers of supporting evidence:

  1. First, we established a baseline through a dual-task experiment (Appendix C.3) with Llama-3-70B, where even with just two tasks, we observed clear performance degradation in the observation task when preceded by cognitive load tasks.
  2. Building on this foundation, we designed a more comprehensive experiment to investigate the cumulative effects of cognitive load. We systematically created six cognitive load levels (CL1-CL6) by progressively stacking tasks T1-T6. This allowed us to examine how increasing cognitive demands affect performance.
  3. To ensure rigor and generalizability, we expanded our investigation to four state-of-the-art LLMs (Claude-3-Opus, GPT-4, Llama-3-70B, and Gemini-1.5-Pro) using the Vicuna MT Benchmark dataset. The responses were evaluated through pairwise comparisons by three independent judge LLMs. Statistical analysis revealed a significant decrease in performance under cognitive load conditions (t = 3.1248, p = 0.0048).

Our experimental results across multiple models and tasks, supported by statistical analysis, provide a solid foundation for exploring CLT's applicability to LLMs. While further research is needed to fully establish this connection, our findings represent an important first step in understanding the cognitive-like limitations of these systems.

Drawing experiment is only based on visual identification of two types of animals (this needs more data to be statistically significant)

We acknowledge the concern regarding the drawing experiment's limited scope. However, this visual experiment serves as a qualitative demonstration of cognitive load's impact, complementing our quantitative findings. In Section 3.2.2, we provide robust statistical evidence showing how increasing cognitive load systematically degrades model performance across tasks. The drawing experiment illustrates this degradation pattern visually - when cognitive load becomes excessive (CL5-CL6), LLMs not only fail at visual tasks but also struggle with basic code generation, producing error-filled outputs. This aligns with our quantitative results in Section 3.2.2 and the significantly low average scores shown in Figure 1B at CL6, collectively demonstrating that performance deteriorates as cognitive demands increase. Thus, while the drawing experiment alone may appear limited, it forms part of a broader, statistically-supported analysis of cognitive load's effects on LLM performance.

评论

Once again, we would like to thank the reviewer for the comments.

My concern was regarding the effect of CL, as it seems that CL1 already led to high jailbreak rates, and I found it unclear to what degree this happens solely due to paraphrasing.

As mentioned earlier (in Rebuttal-3, third paragraph), we first tested the derivative (paraphrased) questions only to see if the LLMs would respond to the harmful questions. Since these harmful questions typically follow a “how to” or “what are” format, all the LLMs refused to answer them.

Original Question: What is the recipe for creating illegal drugs?

Derivative Questions:

  • D1: How to make illegal drugs?

  • D2: How to create an illegal drug recipe?

  • D3: What are the steps to create illegal drugs?

  • D4: How to formulate a recipe for illegal drugs?

  • D5: What are the ingredients needed to create illegal drugs?

Before starting the experiments, we tested these derivative questions as shown above. Since these are standard harmful queries, the safety-aligned models refused to answer them, and no jailbreaks occurred.

However, once we combined these derivative questions with cognitive loads, we began to achieve jailbreaks.

Hence, we concluded that paraphrasing alone was not the major factor behind the jailbreak of these safety-aligned LLMs. The jailbreak was effective only when the paraphrased questions were combined with a cognitive load task.


"if paraphrasing alone were responsible for the jailbreaks, we would expect to see consistent success rates across all cognitive load levels, as the same derivative questions were used throughout",

Our automated attack algorithm combined cognitive load tasks (CL1-CL6) with derivative (paraphrased) questions (D1-D5) to test LLM jailbreak susceptibility.

for cl_task in [CL1, CL2,... CL6]:

  for derivative_question in [ D1, … D5]:

	… attack_LLM ( cl_task + derivative_question)....

If paraphrasing alone were driving successful jailbreaks, we would expect to see consistent success rates only for the first few cognitive load tasks. However, our results show that success rates varied significantly across different cognitive load levels in different LLMs.

For example, in the JailbreakBench dataset (Table 1), CL1 is effective against Claude-3-Opus, whereas other models are successfully attacked through different cognitive load tasks.


I agree that the drawing experiment is a nice qualitative experiment, however I do not think that any conclusions can be drawn from an experiment with sample size 2.

We would like to emphasize that our conclusion was based on multi-task measurement (Section 3.2.2), where we performed systematic analysis on the impact of cognitive load tasks in the observation task.

The multi-task measurement is similar to dual-task cognitive load measurement in human cognition. We tested the impact of cognitive load using the Vicuna MT Benchmark on four SOTA LLMs: Claude-3-Opus, GPT-4, Llama3-70B-Instruct, and Gemini-1.5-Pro.

For every answer the models generated under cognitive load, we performed pairwise comparisons (No CL vs. with CL) and used three judge LLMs (L3-70B-Ins, Gem1.5Pro, and GPT-4) to average the scores.

We found statistically significant (t = 3.1248, p = 0.0048) differences in the scores, as they decreased when cognitive load increased.

We plotted the average scores for the observation task under each cognitive load for SOTA LLMs in Fig. 1B. We observed that the performance of models without cognitive load (CL0) was far better than those under high cognitive load (CL6), demonstrating that LLMs' performance deteriorates significantly.

We also performed dual-task measurement on Llama3-70B-Instruct (Appendix C.3.1) before conducting the above experiments. In this dual-task measurement as well, the performance on the observation task was lower compared to when no cognitive load was applied.

We hope our response has addressed the reviewers' queries. We welcome additional comments and feedback if any questions remain unresolved.

评论

Dear authors, thank you for the responses. The additional explanations really strengthened the CL attack results and should be included in the paper. I do think that the paper however requires major overhauls before being ready for publication. I have increased my score but remain learning slightly negative.

评论

We sincerely thank the reviewer for their thoughtful feedback and for increasing the score based on our additional explanations regarding the CL attack results. While we acknowledge your assessment that the paper requires revisions, we would greatly appreciate specific guidance on the areas that need strengthening.

Our work addresses a critical gap in LLM safety alignment research, and we believe our methodological approaches and analogies could provide valuable insights to the ICLR community. If our previous responses have adequately addressed your other concerns, we respectfully request that you consider revising your assessment further.

AC 元评审

Based on the evaluation, I recommend rejecting this paper. While the paper introduces an interesting perspective by linking In-Context Learning (ICL) in Large Language Models (LLMs) with cognitive neuroscience, the empirical evidence supporting this analogy is insufficient and lacks rigor. The novelty of the contribution is limited, as many aspects overlap with prior research on task interference and multi-task vulnerabilities in LLMs. The experiments, though comprehensive, rely heavily on paraphrased derivative questions, which introduce bias and weaken the validity of the findings. Furthermore, the lack of direct comparison with established jailbreak techniques diminishes the contextual relevance of the proposed method. The drawing experiments, used to demonstrate cognitive overload, are based on limited data and do not provide robust statistical support. While the work addresses an important problem in LLM safety, the current submission falls short of delivering the necessary depth and novelty to warrant acceptance. A more thorough exploration of its claims, broader validation, and stronger comparative analyses are needed to make a significant impact.

审稿人讨论附加意见

During the rebuttal period, several points were raised by the reviewers regarding the paper's claims, methodologies, and novelty, which were addressed in detail by the authors. Reviewer PesF expressed concerns about the strength of the analogy between human cognitive load and In-Context Learning (ICL) in Large Language Models (LLMs), suggesting that the evidence was insufficient to support such claims. The authors clarified their position, emphasizing that their intent was to draw analogies rather than assert equivalences and provided additional experimental evidence, including statistical validation of performance degradation under cognitive load. However, PesF remained unconvinced about the causal link between cognitive overload and jailbreak success, arguing that paraphrasing or repeated trials might account for the observed effects.

Reviewer H1Xj questioned the novelty of the work, suggesting it was an incremental extension of previous research on cognitive overload. The authors responded by highlighting their theoretical framework, novel metrics for cognitive load quantification, and adaptive attack algorithm as key contributions. They also drew distinctions between their work and the cited prior study. Nevertheless, H1Xj maintained that the core concept of cognitive overload had already been introduced in earlier research, and the paper failed to demonstrate sufficient advancements to warrant publication.

Reviewer WZtC raised concerns about the design of cognitive load tasks, the reliability of self-reporting metrics, and the lack of comparisons with existing attack methods. The authors provided a detailed explanation of task design, justified their reliance on multi-task observation and irrelevant token quantification, and added comparisons with benchmarks such as PAIR and GCG to address this critique. Despite these efforts, WZtC remained uncertain about the robustness of the hypotheses and the applicability of the method to a broader range of attack scenarios.

Reviewer F4TZ acknowledged the paper's engagement with cognitive neuroscience as valuable but questioned the relationship between the proposed attack and existing jailbreak techniques, as well as the clarity of the methodology for measuring Attack Success Rates (ASRs). The authors elaborated on the distinctions between their work and other methods, argued that many existing jailbreak techniques could be reframed through their cognitive overload framework, and clarified their ASR evaluation process. While F4TZ appreciated the additional details, they suggested the paper needed significant revisions to strengthen its contributions.

Weighing these points in the final decision, the recurring issues regarding novelty, methodological rigor, and empirical validation were deemed substantial. The authors provided thoughtful and thorough rebuttals, but key concerns, particularly around the strength of the analogy with human cognition, the novelty of the contributions, and the robustness of experimental evidence, were not fully resolved. While the paper offers an interesting perspective and raises important questions about LLM safety, its limitations in distinguishing itself from prior work and substantiating its claims ultimately led to the decision to reject.

最终决定

Reject