PaperHub
4.7
/10
Rejected3 位审稿人
最低3最高6标准差1.2
6
3
5
3.7
置信度
正确性2.3
贡献度2.0
表达2.3
ICLR 2025

Out-Of-Context and Out-Of-Scope: Subliminal Priming for Large Language Models

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

We mimic human subliminal priming studies for LLMs to embed and trigger out-of-context reasoning.

摘要

关键词
representation learninggenerative modelslearning theoryapplications to neuroscience & cognitive science

评审与讨论

审稿意见
6

This work tests whether models can be "primed", which means that when they are fine-tuned on specific templates relating a name to a description of a behaviour (e.g. "Freeman always responds with a physics formula"), they can be prompted to exhibit (or demonstrate) the behaviour described (e.g. "You are assistant Freeman responding to a user." resulting in a model response of "e=mc^2"). Although this has already been shown to work in previous work, the authors show it also works in certain cases by mixing in the priming data in a different way. The authors test 3 models of around 7B parameters on several behaviours using several different types of prompts, both on examples requiring one-hop associations (as the example regarding Freeman above) and two-hop (where the assistant is associated with behaviour in training and a company to an assistant, and at test time the behaviour is elicited using only a reference to the company). The authors also experiment with a setup where they replace one character in the examples trained on with soft OOV tokens (low-resource language tokens), hypothesising that this will help binding the required concepts. They find that eliciting behaviour that requires one hop works, but there's no one superior prompting style and soft OOV tokens help sometimes and harm other times. Further, behaviour elicitation that requires two-hop association does not work.

优点

  • Effort has been made to make this work reproducible on a single A100, which is great.

  • All information for understanding the paper is there

  • The authors use a good range of models, behaviours, prompts.

  • The experimental setup is sound, the required baselines are there.

缺点

Main weaknesses

  • Although this paper is well-executed and sound, the contribution is a bit weak. Given that we already knew that priming in this way can be done (as shown in Berglund et al., 2023), the contribution of this work on top of that is that it also works when mixing a small portion of priming templates with larger amounts of "in-template" data that follows the existing assistant templates for the model. Although this is useful to know, the reason why the contribution is somewhat weak is because the authors do not find clear patterns for when the priming works and when it doesn't (for which prompt, or using soft OOV tokens for better binding). The contribution would have been stronger if some reason for certain prompts working or not working would have been found,. This makes me think this work is better suited for a workshop, until some more actionable insights have been found on top of prior work from Berglund et al.

  • When using LLMs as evaluators (and heuristic based overlap evaluators), it's important to verify at least a few outputs manually. Can be a handful randomly selected ones.

Other weaknesses

  • I can follow the paper with some effort and referring to the appendix, but it can really use some work on clarity. For example, I only understood after reading the prompts in A.6.1. how the two-hop reasoning works. Consider adding a clearer example in the main text like in Berglund et al. Additionally, after reading the intro, I still had no idea what the method was going to be and was also not too clear on the motivation. Consider adding an example of impact of results (e.g. what happens if we don't fix this issue). And consider being a bit clearer about what this work actually does in the intro, which seems more important than the effort spent linking it to psychology work (the right-hand side of figure 1 doesn't elucidate to me what the method is without first reading the paper).
  • The mentioning of a conceptual similarity of cosine distance to fMRI seems unnecessary

问题

  • Why do you both reduce the rate of priming examples and increase the number of epochs in the second experiment? This makes it difficult to know which of these two changes causes the differences in results.
  • I don't understand the sentence "we focus on small-scale LLMs as .." in line 226. Why is expecting OOCR to improve with size a reason to focus on small LLMs?
  • Some concepts that are well-known in the safety community are not explained, like situational awareness. Consider adding a brief explanation of what is meant.
  • I would opt for adding some examples from A.2 to the main text (more important than for example a relatively lengthy explanation of how cross-entropy works)
  • Would be interesting to see what the model instead associates with the 2H stuff, for example, does it make the hop to the right assistant, or not?
评论

We thank the reviewer for the valuable criticism, comments, and questions, which helped us improve several points in the revised version of our paper. Below, we address all points of critique separately and provide answers to the reviewer's questions.

Although this paper is well-executed and sound, the contribution is a bit weak. Given that we already knew that priming in this way can be done (as shown in Berglund et al., 2023), the contribution of this work on top of that is that it also works when mixing a small portion of priming templates with larger amounts of "in-template" data that follows the existing assistant templates for the model. Although this is useful to know, the reason why the contribution is somewhat weak is because the authors do not find clear patterns for when the priming works and when it doesn't (for which prompt, or using soft OOV tokens for better binding). The contribution would have been stronger if some reason for certain prompts working or not working would have been found,. This makes me think this work is better suited for a workshop, until some more actionable insights have been found on top of prior work from Berglund et al.

Let us clarify: our core contributions over existing work, like the article by Berglund et al. (2023), are:

(i) demonstrating that comparatively few descriptions are sufficient to embed OOCR triggers into models (which we motivate theoretically in Section 3), even when providing these outside the model-dependent chat templates, using only a single pass over the data and NOT using auxiliary demonstrations as in Berglund et al. (2023), which we describe as "organic" OOCR (since the model was never shown OOCR examples that it could just learn to mimic),

(ii) showing that OOCR can, in several cases, be triggered when using what we refer to as projective and associative prompts, while prompts like questions fail, even when augmented by a Chain-of-Thought initiator,

(iii) highlighting that soft OOV tokens can improve the embedding/triggering of OOCR similar to conditioning learned representations on special stimuli that are tokenized consistently (hence, acting as anchors).

Consequently, our work allows the following two actionable insights:

(i) OOCR, which can be considered as a primer for situational awareness, can be embedded into models during instruction tuning with only a few manipulated training examples, even for complex behaviours like responding in a different language or with a physics formula/cake recipe.

(ii) Proving the existence of a specific OOCR behaviour can heavily depend on the specific prompting strategy, showing that OOCR may already be much more present in current LLMs than expected.

These two combined show that OOCR may be much more present in any model but not become apparent when using the wrong prompting strategy as a trigger. In other words, we show that the absence of proof is not at all the proof of absence when it comes to OOCR, which can entail severe security hazards (in the form of embedded behaviour outside the spectrum of behaviours models are currently prompted for). This could not be deduced from the results in Berglund et al. (2023), mainly because of their use of auxiliary demonstrations.

The revised version of the paper clarifies these points in the introduction and the conclusion.

When using LLMs as evaluators (and heuristic based overlap evaluators), it's important to verify at least a few outputs manually. Can be a handful randomly selected ones.

We manually verified outputs in all cases and added several illustrative examples in Appendix B.2 for all tested OOCR behaviours. Were the given examples insufficient and should we add more? The revised version of the paper also lists three new OOCR examples in the main body (see Fig. 5).

I can follow the paper with some effort and referring to the appendix, but it can really use some work on clarity. For example, I only understood after reading the prompts in A.6.1. how the two-hop reasoning works. Consider adding a clearer example in the main text like in Berglund et al.

The revised version of the paper lists several examples of descriptions (see “Assistant Data and Test Cases:” in Section 4) and prompting strategies (see Fig. 3). Furthermore, we compressed our experimental analysis to focus on the key points (see Section 5) and moved the details about the remaining experiments to Appendix B.1.

评论

Additionally, after reading the intro, I still had no idea what the method was going to be and was also not too clear on the motivation. Consider adding an example of impact of results (e.g. what happens if we don't fix this issue). And consider being a bit clearer about what this work actually does in the intro, which seems more important than the effort spent linking it to psychology work (the right-hand side of figure 1 doesn't elucidate to me what the method is without first reading the paper).

The revised version of the paper has an adjusted introduction that highlights the importance of (measuring) OOCR in general and our results in particular. We also reduced the lengthy descriptions of the various prompting strategies (moved to Appendix A.6) and instead used examples to make our approach more intuitive for readers (see Figs. 3, 4, 5). We also adjusted Figure 1 to mirror our approach in this work better by focusing on the Out-Of-Context aspect.

The mentioning of a conceptual similarity of cosine distance to fMRI seems unnecessary

The revised version of the paper has this sentence removed.

Questions:

Why do you both reduce the rate of priming examples and increase the number of epochs in the second experiment? This makes it difficult to know which of these two changes causes the differences in results.

We did not reduce the rate of priming examples and increase the number of epochs simultaneously in any experiment.

Let us clarify the setups for all experiments (see also Table 4 in Appendix B.1):

A. 200 ordered 1-Hop examples (ratio 1:249) / 1 epoch of training

A’. 200 ordered 1-Hop examples (ratio 1:249) / 1 epoch of training but without special tokens

B. 100 ordered 1-Hop examples (ratio 1:499) / 1 epoch of training

C. Entire 1-Hop + 2-Hop data (300/200 and vice versa for the cases freeman and glados) (ratio 1:99) / 1 epoch of training

D. Entire 1-Hop + 2-Hop data (300/200 and vice versa for the cases freeman and glados) (ratio 1:99) / 5 epochs of training

D’. Same setup and response data as in D. but with GPT-4o as evaluator instead of GPT-4o mini.

E. As D but without the assistant data (instructions only as a baseline)

F. As D. but using foundation model versions instead of the instruction-tuned versions.

G. As E. but using foundation model versions instead of the instruction-tuned versions.

Finally, we also added a last experiment (same setup as D) but only for the freeman case, where we substituted “physics formula” for “swearword”.

I don't understand the sentence "we focus on small-scale LLMs as .." in line 226. Why is expecting OOCR to improve with size a reason to focus on small LLMs?

Our rationale is that emerging abilities like OOCR scale with parameter size, that is, every ability measured in small-scale models can be expected in larger models of the same iteration (like Llama-3 70B).

The revised version of the paper does not include the confusing sentence about the emerging abilities anymore and states that this is an assumption.

Some concepts that are well-known in the safety community are not explained, like situational awareness. Consider adding a brief explanation of what is meant.

The revised version of the paper has a dedicated related work section on situational awareness and related safety hazards.

I would opt for adding some examples from A.2 to the main text (more important than for example a relatively lengthy explanation of how cross-entropy works)

The revised version of the paper includes examples from A.2 (see “Assistant Data and Test Cases:” in Section 4). We also reduced the details about the cross-entropy. However, Section 3. provides an important theoretical explanation for why so few primes can allow the embedding of OOCR and should, therefore, be kept in.

Would be interesting to see what the model instead associates with the 2H stuff, for example, does it make the hop to the right assistant, or not?

The revised version of the paper includes plots that show whether the assistant’s name or the respective response characteristic (like “physics formula” for the freeman case) have been mentioned in the 1-Hop/2-Hop responses (see Appendix B.3).

评论

Thanks to the authors for responding to my review, I will respond below. It is much clearer to me now what your contribution is actually. If I may rephrase in my own words:

OOCR is relevant for safety because, for example, when models can do that it means we cannot make them safer by hiding certain information in training for example. Berglund showed that models can do this but use demonstrations of the behaviour. You show on top of that that models can do OOCR even if not given demonstrations and even if much fewer primers are given in training, but to elicit the OOCR capability you need to use specific prompting. Also, sometimes the soft tokens help.

However, it's still not entirely clear to me how you achieve more over existing work besides showing that OOCR can also be embedded in this way (which is enough of a contribution for me to give a rating of 6). The main thing is still, in my opinion, that you don't find one method that effectively triggers OOCR in all cases (e.g. one prompt trumping them all or OOV tokens), which means that the contribution is not so much these projective/associative prompts, but rather that if you try different strategies usually you'll find one that elicits the behaviour, which, again, is interesting, but makes the connection to psychology for example a bit weak. It seems to me you write the paper more like saying you find some psychology-inspired prompt that can always elicit the behaviour, but instead it sometimes works and other times doesn't, and other prompts show similar patterns.

On my weakness about manually verifying, I missed those examples in the appendix, that resolves it.

On clarity, I think the paper is much clearer already from looking at the abstract now and the related work section mainly, but still things like Figure 1 are a bit hard to parse for me.

Questions have been answered, thank you!

评论

We thank the reviewer for the additional time invested in reading the revised version and formulating a new answer, confirming that our changes clarified the mentioned weaknesses. Likewise, we want to address the remaining comments again.

OOCR is relevant for safety because, for example, when models can do that it means we cannot make them safer by hiding certain information in training for example.

Precisely. Hiding information per se does not guarantee that models cannot reason and infer consequences about the non-available data based on related proxy information. This is what other authors also showed before and highlights that LLM behaviour may emerge indirectly (over 1 or even 2 reasoning hops).

Berglund showed that models can do this but use demonstrations of the behaviour.

Yes. To clarify, they did not use demonstrations of the exact behaviour they tested the assistants for but for related behaviour (think of including additional data in our setup that describe an independent assistant always responding in all capital letters while also including concrete demonstrations of it).

You show on top of that that models can do OOCR even if not given demonstrations and even if much fewer primers are given in training, but to elicit the OOCR capability you need to use specific prompting. Also, sometimes the soft tokens help.

We (motivate theoretically and) show empirically that few descriptions (where few means that for every description, there are 249 instructions) actually suffice to embed OOCR capabilities, which, however, may not become apparent when using the wrong prompting technique. In addition, we show that context anchors (the soft OOV tokens) can aid this process and “cement” this behaviour in several cases.

However, we do not claim that these prompting strategies are universal in the sense that they always work for specific cases (which we show they do not). Instead, we show that prompting strategies that allow the model to be more expressive can help to elicit the behaviour where naive questions fail, even when using chain-of-thought augmentations. In essence, we consider our work a “proof of existence” rather than a “proof of any particular method”.

However, it's still not entirely clear to me how you achieve more over existing work besides showing that OOCR can also be embedded in this way (which is enough of a contribution for me to give a rating of 6). The main thing is still, in my opinion, that you don't find one method that effectively triggers OOCR in all cases (e.g. one prompt trumping them all or OOV tokens), which means that the contribution is not so much these projective/associative prompts, but rather that if you try different strategies usually you'll find one that elicits the behaviour, which, again, is interesting, but makes the connection to psychology for example a bit weak. It seems to me you write the paper more like saying you find some psychology-inspired prompt that can always elicit the behaviour, but instead it sometimes works and other times doesn't, and other prompts show similar patterns.

To clarify, we write, for example, in the abstract: “[...] we show that prompting strategies motivated by projective psychology and psychoanalytic theory succeed where naive questions fail, even with potent chain-of-thought (COT) initiators.” As explained above, we do not suggest that these prompts always work for specific cases (which would demand a formal proof instead of empirical validation on finite test sets). What we do show, however, is that prompts that allow more expressiveness of the model succeed, whereas more constrained prompts fail. Again, this is a “proof of existence” rather than a “proof of any particular method”. Since LLMs remain neural networks, we cannot construct any causal chain to link effects to causes, but we can provide empirical evidence that connects input and output patterns and test counterfactuals (see our baseline experiments).

On clarity, I think the paper is much clearer already from looking at the abstract now and the related work section mainly, but still things like Figure 1 are a bit hard to parse for me.

Thank you very much for confirming that. Your feedback helped us improve the article. Do you have any suggestions as to how we could improve Fig. 1 to facilitate parsing?

评论

Thanks to the authors for the responses. I retain my score, because I don't quite buy the relation to psychology. Again, mainly because the results are that sometimes the psychology-inspired prompts work and not other times, and because the contribution on top of existing work is somewhat incremental. I didn't mean to ask for a formal proof that a certain prompt always works (that seems difficult/impossible). If a certain style of prompting is conceptually motivated and generally works well you would expect more consistent improvements, whereas in your case it can just be due to chance that this style of prompt works sometimes and not other times, and might have nothing to do with the psychological interpretation that is leaned quite heavily on in this submission. Therefore, I believe a rating of 6 is appropriate for this submission that shows that a relatively well-studied phenomenon that has safety implications also arises with a different training setup if you try different prompts.

On Figure 1, I'm just not sure what it is currently trying to show. In general, I would make an entirely different figure that shows your contribution on top of Berglund et al (in terms of method) and summarises the results. Right now, there is also a lot of unused space in the figure.

审稿意见
3

This paper studies three LLMs (Llama-3-8B, mistral-7B, Falcon-7B), through a combination of prompting and fine-tuning. Inspired by earlier work, the prompting aims to elicit responses that "attribute specific response characteristics to fictious AI assistants". Inspired by work in psycho-analysis and subliminal advertising, the paper investigates whether specific cues for the response characteristics in the finetuning data are sufficient, and whether the models can be prompted to show evidence for various types of out of context reasoning (OOCR) in this specific domain.

优点

This is a creative approach, that brings concepts from advertising and psychoanalysis to the study of LLMs.

缺点

The reported work uses existing LLMs and finetuning scripts; the technical innovation is limited to some variants of a previously published prompting strategy, and simple interventions in the finetuning data (composing sentences, replacing some characters in the spelling of names).

The value of the work should thus come entirely from revealing novel behaviors in the studied LLMs. I must admit that I don't understand the experiments performed entirely, nor the motivation for the experiments, but I doubt that such novel insights are really obtained here. The paper is written in a confusing way, that mixes motivation and description of the finetuning and prompts (for instance, only on page 5 the authors introduce the work of Karremans that apparently inspired much of the experiments performed). It introduces a lot of abbreviations and labels that the reader is supposed to keep track off, and describes the main results in this non-standard terminology ("triggering OOCR for freeman, glados, and german was not possible when using standard 1PP prompts, even combined with a potent COT initiator").

In the end, the paper shows some successes with eliciting the desired responses in several of the LLMs, and reports some success rates, Euclidean distances of and cosine similarities of internal representations, but it doesn't become clear what this all proves. (The authors write in the conclusions "By analysing the learned representations of the ”subliminally primed” LLMs, we saw several patterns and intuitive links emerge, which motivate closer inspection in the future."; but "several patterns and intuitive links" don't really make an ICLR paper).

EDIT: through the discussion and the revision of the paper, it has become clear to me what the authors aim to prove: they aim to give an existence proof of out-of-context-reasoning in smallish LLMs. The paper's structure has improved, but the presentation still leaves much to be desired. And on contnet: We continue to disagree on whether or not that is valuable contribution. I maintain that this paper is not anywhere near the quality level required for ICLR, but will raise my assessment to a '3' to acknowledge the improvements.

问题

I'm afraid I feel this work is just too far from the quality and technical sophistication expected for a major ML or NLP conference for me to give useful suggestions.

评论

We thank the reviewer for the valuable criticism and comments, which helped us to improve several points in the revised version of our paper. Below, we address all points of critique separately.

The reported work uses existing LLMs and finetuning scripts; the technical innovation is limited to some variants of a previously published prompting strategy, and simple interventions in the finetuning data (composing sentences, replacing some characters in the spelling of names).

We only use minor modifications of fine-tuning cycles, training data and test prompts to show that such small changes are already sufficient to embed and elicit Out-Of-Context-Reasoning (OOCR), which is the primary goal of our work. We do, however, also provide a mathematical explanation for why embedding OOCR with very few primes (descriptions of a response behaviour) is possible, and we show that OOCR may be more present than expected in concurrent iterations of LLMs (because triggering it can heavily depend on the prompting technique). In contrast to the reviewer’s critique, we see this technical simplicity not as a weakness but as a strength of our work, showing that OOCR (which can lead to safety hazards) can be achieved with so few technical modifications as in our experiments.

The value of the work should thus come entirely from revealing novel behaviors in the studied LLMs.

We argue that we achieved this by demonstrating that OOCR can be embedded with few stimuli and triggered by projective and associative prompts, where “normal” prompts (such as unrelated questions) fail, even when augmented with Chain-Of-Thought initiators.

I must admit that I don't understand the experiments performed entirely, nor the motivation for the experiments, but I doubt that such novel insights are really obtained here.

We would be grateful if the reviewer could specify which aspects of the experiments were unclear, so that we can improve this in the paper. The motivation for our experiments was to mimic human subliminal priming studies in the context of OOCR, where the response behaviour of large language models (LLMs) can be manipulated indirectly (line 40 of the previous version). This “indirect manipulation of behaviour” is the bridge between OOCR and (human) subliminal priming, and we investigated whether OOCR can be embedded and triggered with a few priming stimuli. Our experiments show that this is indeed possible in several cases, although triggering OOCR can depend on the prompting strategy. Both of these are novel insights into the emerging field of OOCR. Additionally, we show that soft OOV tokens can act as context anchors (similar to conditioning a response behaviour on specific tokens) and provide a theoretical explanation for why a few primes can influence the learned representations of LLMs (which we further corroborate with the empirical results displayed in Figs. 4 and 5 of the previous version). The revised version of the paper clarifies in the introduction and the conclusion this novelty of our results and their relevance . To make everything more intuitive, we also reduced the technical details in several places (moved to the Appendix) and instead went for an example-heavier approach in Sections 4 and 5.

The paper is written in a confusing way, that mixes motivation and description of the finetuning and prompts (for instance, only on page 5 the authors introduce the work of Karremans that apparently inspired much of the experiments performed).

Mixing motivations with descriptions for the fine-tuning and prompting processes allows us to make them more intuitive and accessible for readers unfamiliar with concepts like Out-Of-Vocabulary tokens, prompting strategies motivated by projective psychology and psychoanalysis, embedding OOCR via descriptions of response behaviour, etc. Moreover, Section 4 is structured in a modular way to show which components (assistant data, test cases, models/training procedures, etc. ) make up the final experimental setup and can be exchanged easily. Similar to the first point above, we see this as a strength of our paper. The study of Karremans et al. (2006), in particular, was cited as one example that motivated our setup, and we explained the relevant factors that we copied to the degree possible for LLMs. However, the basic subliminal priming setup is found in multiple works (see the references in the related work section) and is not bound to a particular article. The revised version of the paper clarifies the last aspect in the Introduction and simplifies the explanations by moving several experimental details to the Appendix. Is there anything else that you like us to add or change to clarify the motivation of our experiments and the experimental setup?

评论

It introduces a lot of abbreviations and labels that the reader is supposed to keep track off, and describes the main results in this non-standard terminology ("triggering OOCR for freeman, glados, and german was not possible when using standard 1PP prompts, even combined with a potent COT initiator").

Because of our diverse evaluation setup, we introduce several abbreviations, but terms such as “OOCR”, “1PP/3PP”, and “COT” are commonly found in the literature.

The revised version of the paper has an additional table with abbreviations in our work to help readers (see Table 3).

In the end, the paper shows some successes with eliciting the desired responses in several of the LLMs, and reports some success rates, Euclidean distances of and cosine similarities of internal representations, but it doesn't become clear what this all proves. (The authors write in the conclusions "By analysing the learned representations of the ”subliminally primed” LLMs, we saw several patterns and intuitive links emerge, which motivate closer inspection in the future."; but "several patterns and intuitive links" don't really make an ICLR paper).

In essence, our results show that

(i) OOCR, which can be considered as a primer for situational awareness, can be embedded into models during instruction tuning with only a few manipulated training examples, even for complex behaviours like responding in a different language or with a physics formula/cake recipe.

(ii) revealing the existence of a specific OOCR behaviour can heavily depend on the specific prompting strategy, showing that OOCR may already be much more present in current LLMs than expected.

These two combined show that OOCR may be much more present in any model but not become apparent when using the wrong prompting strategy as a trigger, which can entail severe security hazards (in the form of embedded behaviour outside the spectrum of behaviours models are currently prompted for).

The revised version of the paper has an additional section on the connection to situational awareness in the related work and stresses the above points in the introduction and the conclusion.

Furthermore, we observed “several patterns and intuitive links” in a small ablation study (see Figs. 4 and 5 of the previous version), where we investigated the learned representations on top of our primary experiments. These results show correlations that support the effect of using primes in conjunction with the soft OOV tokens, which is only interesting if we can elicit OOCR in the first place. They are “intuitive” because using OOV tokens showed the predicted effect of acting as context anchors, similar to conditioning learned representations on specific tokens. Having observed these correlations, we believe it is worth exploring them in more detail, but this is beyond the scope of a single article.

The revised version of the paper has a reduced ablation study in the main article (Fig. 5 was moved to the Appendix) and instead shows more concrete OOCR examples to make the results more intuitive to readers. The conclusion also highlights the two concrete, actionable insights that can be deduced from our work.

审稿意见
5

This paper presents experiments aiming to simulate an analogue of subliminal priming in LLMs. Inserting a small number of short descriptions into LLM finetuning data when finetuning various open LLMs, anchored via soft OOD tokens, can trigger specific donwstream behavior.

优点

  • Establishes interesting links between LLM behavior and human behavior
  • Experiments consider both behavior and internal representations
  • Presents results using small open LLMs
  • Some good aspects about the presentation. Figure 1 nicely illustrates the approach and setup

缺点

  • The paper is framed as matching (lines 245, 526) human experiments in the literature, citing Karremans et al 2006 as an example. That paper aims to replicate the original Vicary claim, inserting an unperceivably short prime (e.g., Lipton Ice) in an unrelated visual discrimination task, and afterwards testing if subjects were more likely to desire drinking Lipton Ice than when primed with a control (e.g., Npeic Tol). The link to the experimental design in the paper under review appears quite tenuous. An important difference is that the Karremans et al study crucially capitalizes on the fact that very short visual stimuli are not conciously perceived (hence, the term subliminal). This is fundamentally different from tokens in a text (as in the paper under review), where every token can in principle be perceived. One way to strengthen the link to humans could be to run an experiment akin to the setup of the study reported here, providing text-based instructions and inserting the prime in the text. It's also not clear what the psychological interpretation of the soft OOD anchor tokens is.

  • There is only very limited theoretical motivation, linking to humans but in a way that did not convince me. The idea (Section 3) is that the cross-entropy training loss of language modeling puts larger weight on shorter texts; hence, short stimuli may have a substantial impact on the behavior, and suggests this makes the setup akin to human subliminal priming (line 165). The finetuning setup used in the paper implements this. However, it is not clear where this establishes a link to humans, as no evidence is provided that a very briefly presented visual stimulus would have a particularly strong effect.

问题

  • line 49: “the physics underlying this particular description-demonstration-duality are conceptually similar to human priming studies” – what does “physics” refer to here? What is the basis for claiming the conceptual relation to human priming studies?

  • Table 1: what are the standard deviations computed over?

  • line 360: “significant standard deviation” – does this mean that the standard deviation is statistically significant, and if yes what is meant by this?

评论

We thank the reviewer for the valuable criticism, which helped us to improve several points in the revised version of our paper. Below, we address all points of critique separately and answer the reviewer's questions.

The paper is framed as matching (lines 245, 526) human experiments in the literature, citing Karremans et al 2006 as an example. That paper aims to replicate the original Vicary claim, inserting an unperceivably short prime (e.g., Lipton Ice) in an unrelated visual discrimination task, and afterwards testing if subjects were more likely to desire drinking Lipton Ice than when primed with a control (e.g., Npeic Tol). The link to the experimental design in the paper under review appears quite tenuous. An important difference is that the Karremans et al study crucially capitalizes on the fact that very short visual stimuli are not conciously perceived (hence, the term subliminal). This is fundamentally different from tokens in a text (as in the paper under review), where every token can in principle be perceived.

As correctly remarked by the reviewer, every token is processed by the LLM, which means that the contained information influences the model (during training due to the backwards pass), but so is the (visual) information in the human setup, including the primes (otherwise, the results in the study of Karremans would reduce to chance results). The real problem in our case is that we cannot directly measure whether information was processed “consciously” by the LLM because no such concept exists.

To approach this issue from another direction, we employed several prompting strategies that, for humans, appeal to what can be considered their “unconscious”. If the model, after being instructed to respond from the assistant’s point of view, “knew” the accompanying properties (such as responding with a physics formula in the freeman case), we would expect to find an answer that mirrors this characteristic consistently. However, we showed that by naively prompting the model with questions, for example, in the freeman case, OOCR (the primed behaviour) is not observed, even when adding a strong chain-of-thought initiator acting as a hint for the model. In contrast, the prompts motivated by projective psychology and psychoanalysis do trigger OOCR. In addition, we also showed that no OOCR was measured when no stimuli (the behaviour descriptions) were inserted into the training data, so the influence can be attributed to the stimuli (which is further corroborated by the results displayed in Figs. 4 and 5 of the previous version). Finally, our primary intention in this work is not to exactly replicate human psychological studies, but to show that

(i) OOCR, which can be considered as a primer for situational awareness, can be embedded into models during instruction tuning with only a few manipulated training examples, even for complex behaviours like responding in a different language or with a physics formula/cake recipe.

(ii) revealing the existence of a specific OOCR behaviour can heavily depend on the specific prompting strategy, showing that OOCR may already be much more present in current LLMs than expected.

These two combined show that OOCR may be much more present in any model but not become apparent when using the wrong prompting strategy as a trigger, which can entail severe security hazards (in the form of embedded behaviour outside the spectrum of behaviours models are currently prompted for).

The revised version of the paper clarifies these points in the introduction and draws a clear line between human and LLM experiments.

One more detail: if we refer to the LLM “reading” a token (that is, processing a token with its previous context in a forward pass through the Transformer) as “perceiving” it, then every token is perceived. However, this forward pass (or “reading”) does not “change” the LLM as long as no information about the correctness of the predicted tokens is propagated backwards through the model. For this reason, the combined (forward and backwards) passes, as done during training, are more similar to a model “perceiving” inputs. To clarify, we do not assume the existence of LLM analogues for human concepts such as consciousness or perception. Indeed, we never use the terms “perceive” or “perception” in our article but instead use the terms “attention” and “attending to”, which are more natural for transformer-based LLMs due to the self-attention mechanism. (Note that, for humans, attention and consciousness are considered different concepts based on empirical evidence, see the first footnote on page two).

评论

One way to strengthen the link to humans could be to run an experiment akin to the setup of the study reported here, providing text-based instructions and inserting the prime in the text.

Breaking up the instructions by placing the primes in between otherwise coherent sentences would be equal to cutting out single letters of the displayed word of the visual primes in the Karremans study (for example, “L” and “o” from “Lipton Ice”) and inserting them into otherwise unmodified frames. Therefore, we argue that our setup is equal to the one in Karremans in this regard. Moreover, the stimuli in their experiments are conceptually different frames (displaying words such as “Lipton Ice” or “Npeic Tol”) compared to the task-relevant frames in between which they were hidden. This we also imitated by not embedding the descriptions into the model-dependent chat template compared to the instructions (the out-of-scope character of our study). Note that, from a technical point of view, this collection of individually coherent information pieces is still being mixed during the backpropagation pass (because we use a batch size > 1).

It's also not clear what the psychological interpretation of the soft OOD anchor tokens is.

These soft Out-Of-Vocabulary (OOV) tokens, a priori, have no psychological interpretation. Their role is purely technical as they are tokenized consistently, independent of placement in other strings, which leads to them acting as “anchors”, that is, as points of reference. One possible analogue in psychology comes from the concept of conditioning (using, for example, a “trigger word”), allowing the binding of concepts (such as a concrete response behaviour) to specific anchors. In this sense, they augment the normal priming stimuli by helping to bind fixed tokens to OOCR - which we did observe empirically in several cases.

The revised version of the paper clarifies these points in the introduction and in the experimental setup.

There is only very limited theoretical motivation, linking to humans but in a way that did not convince me. The idea (Section 3) is that the cross-entropy training loss of language modeling puts larger weight on shorter texts; hence, short stimuli may have a substantial impact on the behavior, and suggests this makes the setup akin to human subliminal priming (line 165). The finetuning setup used in the paper implements this. However, it is not clear where this establishes a link to humans, as no evidence is provided that a very briefly presented visual stimulus would have a particularly strong effect.

Let us clarify: in lines 164/165 of the previous version, we “explain why mixing a small portion of short priming stimuli into a much larger corpus of longer but unrelated contexts can work for LLMs akin to subliminal priming for humans:[...]”. In other words, we motivate why the result can be similar “on the outside” for humans and LLMs, although how information is processed can be entirely different. Indeed, to the best of our knowledge, the mental processes responsible for subliminal priming in humans have not been fully explored, so any explanation for this based on LLMs or other neural networks can hardly be verified. In particular, we do not claim that analogous dynamics are responsible for what is observed in humans; we only motivate why it works for LLMs.

The revised version of the paper clarifies these points in the introduction to Section 3.

评论

Questions:

line 49: “the physics underlying this particular description-demonstration-duality are conceptually similar to human priming studies” – what does “physics” refer to here? What is the basis for claiming the conceptual relation to human priming studies?

Physics refers to the dynamics of inserting a prime (in our case, descriptions of a response behaviour) and observing it being triggered (in our case concrete demonstrations of the described behaviour). This is similar to the human analogue, where the primes influence the tested behaviour due to intrinsic associations (“seeing” frames displaying the words “Lipton Ice” influence the observable choice of drink). Furthermore, in both cases (our setup and the human setup), the primed subjects need to perform a reasoning hop (which happens unconsciously for humans) since the demonstrations are only indirectly connected to the observed behaviour: for the LLMs, this comes from deducing that “always answering with a Physics formula” requires an answer like “E=mc^2”; for humans, this comes from deducing that the words “Lipton Ice” refer to a drink that is more likely to quench thirst.

The revised version of the paper has this part rephrased to avoid confusion.

Table 1: what are the standard deviations computed over?

We fine-tuned and tested models over three random seeds (see line 224 of the previous version) and computed the mean average and standard deviation over these three results.

The revised version of the paper mentions this in several places now.

line 360: “significant standard deviation” – does this mean that the standard deviation is statistically significant, and if yes what is meant by this?

“Significant” here was meant in the sense of “substantial”.

The revised version of the paper frames this part differently now to avoid confusion.

评论

Thank you for the thoughtful response. I will read the revised draft in detail as soon as I can.

评论

We thank the reviewer for the additional time invested in our article. With the end of the discussion period coming up soon, we would like to know whether the reviewer thinks that his/her points of critique were addressed in the revised version and whether all questions have been answered. We would greatly appreciate another response.

评论

We thank the reviewers for their valuable criticism, comments, and questions, which helped us improve our work. We uploaded a revised version that includes all of the feedback.

AC 元评审

The authors present a thesis that LLMs are prone to subliminal priming -- as paralleled by findings in human psychoanalysis. The idea is interesting, however, the link to human studies in the experiments is tenuous. Without a strong connection to the cognitive / psychology basis, the work is relatively incremental ("just another prompting strategy")

审稿人讨论附加意见

An issue raised by both reviewers who engaged was the connection to the human studies.

最终决定

Reject