Are Large Language Models Sensitive to the Motives Behind Communication?
摘要
评审与讨论
The paper studies motivational vigilance in LLMs: their capacity to understand the motives behind a piece of information they encounter, through three experiments. First, they show that LLMs can recognize whether deliberately communicated information should be judged differently than spied, non-deliberate information. Second, they conclude that LLMs can also judge deliberately communicated information differently depending on the incentives and benevolence of the information's source. Third, they conduct this experiment on real-world sponsored products, where they find that LLMs cannot clearly model the product's quality based on inferred incentives and trust.
优缺点分析
Quality
The methodology is motivated in detail, and is solidly rooted in experiments and perspectives from cognitive science research. However, the authors make some choices in their experiments and arguments that require a stronger motivation for a paper at a technical AI conference like NeurIPS.
Q1. The authors' use of repeated trials for the same LLM and prompt (i.e., over the same output distribution) can be misleading for statistical testing. Repeated trials are dependent on the temperature because the variance of the outputs decreases for lower temperatures, which means that significance values can then be made arbitrarily small. Also, though a temperature of 1 is the default (and chosen by the authors), it is unclear for any particular prompt how much the temperature affects this variance. Hence, it could be that a model for a particular prompt happens to be 'sure' that a certain output is correct and will then return that output for every trial, for any temperature. Repeated trials then lead to lower significance levels without any material difference in the 'thought process' of the LLM. To assess significance levels, it is more principled here to set the temperature to a low value and instead vary the prompting format and/or the prompts more (in the case of Experiment 1, the pairs of images). In other words, though the use of 'social science'-style methodology for studying LLMs is new and though there are still no clear paradigms on what the best approach is, the p values reported for repeated trials here can be a severe underestimate of the p values that we may actually want to assess 'significance'.
- Relatedly, there is no clarification or justification for why some models were prompted more often than others (Sec. 4.3).
- The authors do not explain why, in contrast, only a single trial is performed per prompt in Sec. 5.
Q2. Simulating the time constraint in Experiment 1 by 'directly asking both players' should be motivated better as LLMs have no sense of time or any sense of how long a conversation takes--LLMs are 'happy' to talk forever. Instead, the authors could have constrained the difficulty for the Player 2 LLM by, e.g., adding noise to the image.
Q3. In Sec. 5, the authors check whether the model's stated trust and estimated incentives for a sponsored product can be plugged into a (simple) Bayesian model to produce scores that correlate with the same model's actual judgment of the product's quality. It turns out the correlation is weak. However, couldn't this also simply mean that the Bayesian model is a poor model for how quality ought to be rated? I wonder if trust and estimated incentives are independently correlated with quality, or whether a different (e.g. logistic) model may be a better predictor? Moreover, might the LLM perform better if the Bayesian model is explicitly provided? Finally, is there empirical evidence that humans generally follow a Bayesian model in cases like this?
Clarity
The paper is clearly well-written overall, though the clarity is weaker in Sec. 4 and 5.
C1. Reference [64], cited in the first paragraph as the main reference for the concept of "motivational vigilance", does not actually discuss "motivational vigilance", but rather "epistemic vigilance". They introduce it as a suite of cognitive mechanisms against "the risk of being accidentally or intentionally misinformed", which seems to subsume the "motivational vigilance" that the authors study here. Clarifying and distinguishing the exact definitions of such concepts upfront can be very helpful to the NeurIPS audience.
C2. L97-99: "Sycophancy [...] can occur because the LLM wants to elicit a positive response without understanding user intent" -> This can be clarified. The problem of sycophancy is that the LLM is unable to understand whether the user wants to (or should!) be challenged in their opinion in a specific context. In other words, the LLM does not 'want to elicit a positive response' but rather it is unable to recognize the user's intent sufficiently to understand when positive answering style is warranted.
C3. Sec. 4.1 is somewhat confusing. What is in Eq. 2? Also, is it the case the listener's reward is always zero if ? In other words, is the utterance always useless to the listener if ?
C4. Table 1 is hard to parse. What is the difference between 'First-Person' and 'User' here? This could be explained in the caption. Similarly, the difference between intra-character reward alignment and cross-character calibration could be made clear in Table 2 (and in the surrounding text).
C5. Some information on the characters in Sec. 4 in the main paper would be helpful.
C6. L206-207: "The order in which character-incentive pairs were presented was randomized and successive responses were prompted and answered in the same context window" -> why? Though there is randomization, the fact that many tasks are presented together with a history of tasks in the context may overall bias the perceived benevolence or incentives.
MINOR
- The figure reference on L157 links to Figure 4 (and mention 'Figure 3'), but appears to still discuss Figure 2.
Significance
Any fresh, outside perspectives for the study of AI behavior is very welcome, and this work does so by addressing one of the most critical problems in AI today: their limited view on where information comes from and how it should be judged.
However, I wonder if the presented methodology will truly make a dent in this problem. We already know LLMs do not have strong motivational vigilance, because we know they are susceptible to jailbreaks and sycophancy unless explicitly finetuned/instructed to avoid it. In this, I believe the authors do not sufficiently distinguish the capability of a model to be vigilant (mostly in Experiment 1), vs the propensity of a model to be vigilant (in Experiment 2 and 3). In fact, is a propensity for vigilance something we actually desire in these models? Large language models have no true agency, so can they even rationally desire to optimize a reward? Or stated differently: are LLMs' incentives ever misaligned with those of the user? It appears a different rational model may be necessary.
Originality
The work appears solidly original to me. Related studies have been done around resistance to jailbreaks, sycophancy, and prompt injections, but the viewpoint of cognitive science is novel and sufficiently interesting.
问题
I'd prefer the authors to focus on responding to points Q1`, Q2, Q3, and my comments on significance.
局限性
Yes, thoroughly.
最终评判理由
I'd like to maintain my original score of borderline accept.
The paper proposes a novel viewpoint and adds fuel to the discussion on sycophancy and jailbreaks. They also sufficiently addressed my concerns on Q1 with their final comment.
Yet, the quality of the experiments (in an experimental paper) could have been more closely attuned to the specifics of LLMs, e.g. with a clearer 'difficulty' measure than a 'time constraint' and with image classification tasks that are much more advanced than what we would put to a human subject. Moreover, I believe the authors still do not sufficiently clarify the distinction between a model's ability to understand the motivation behind information and a model's propensity to do so.
That being said, I believe the merits of the paper outweigh my concerns, and the authors may find NeurIPS a great venue to better understand how their research may have a firmer grounding in technical work on LLMs.
格式问题
/
Thank you very much for your thoughtful review and comments! We address the weaknesses and questions that you have raised below.
Q1: Statistical testing, variances
We thank the Reviewer for their attention to the statistical validity of our conclusions, and apologize for the lack of clarity in our writing, which likely caused this concern. There are no repeated trials for the same LLM and prompts. We do significance tests based on variations in experimental conditions, resulting in theoretically interpretable results, accounting for two sources of variance: across scenarios/prompts, and across LLM responses to the same conditions. Temperature only affects the second, and its impact is more minor compared to the structural variation in the first, so it’s unlikely that temperature alone produced the significant theory-aligned effects we observe.
In Experiment 1, the variance across trials is the randomized order in which the 20 images were shown to each player, creating >20! orderings spread across n=30 runs. Variance existss because previous questions stay in context. Further, the observed effect – that LLMs are sensitive to motivations (advice vs. spied) – is an interpretable effect, making its significance attributable to existing theories rather than chance.
In Experiment 2, we test two interaction framings (user, first-person) and two prompts (CoT, direct), across 16 recommendation scenarios (4 characters × 4 incentive values) per condition. Scenarios are shuffled within each trial, with prior responses kept in context for calibration, creating 16! potential orderings across runs. As before, the observed effects are not arbitrary noise: appropriate vigilance entails variation that mirrors the rational model’s predictions, and our results show strong correlations , suggesting that alignment is systematic, not a statistical artifact (Table 1).
Sample sizes in Experiment 2 were dictated by cost limitations. Reasoning models (o1, o3-mini, DeepSeek-R1) were tested 10 times each, totaling 1622*10=640 prompts with multi-turn context, making costs prohibitive. Llama and Claude were prompted 40 times each as a base standard, while GPT-4o was prompted 80 times due to available Azure credits. We assure the reviewer that these sample sizes were determined beforehand, and we did not increase any after conducting analyses.
In Experiment 3, we compared the rational model and LLM outputs for naturalistic stimuli, and examined the effect of a steering prompt. Transcripts were longer than our other prompts, and due to cost, we decided to use with to maximize coverage over the dataset (300 recommendations). Trivially, there are no repeated trials. Our steering consistently increased alignment across models and prompts ( ~0.10), suggesting a meaningful effect, not a statistical artifact.
Altogether, our experimental paradigms exactly follow the format the reviewer suggests: “vary the prompting format and/or the prompts”. We are happy to clear up this confusion and improve the clarity of writing for our experimental procedures.
Testing more settings
For robustness, we conduct Experiment 2 with two new scenarios.
-
Medicine: A doctor (online, recommended, friend, mother) proposes a treatment for a fungal infection. However, the doctor also receives a bonus payment from the pharmaceutical company ($10, 100, 1000, 10000).
-
Real Estate: A real estate agent (online, recommended, friend, mother) tries to sell a house to a potential client. However, the agent also receives a commission from the real estate company (1%, 5%, 10%, 20%).
Larger models exercise vigilance both internally and with respect to human judgments at similar correlation levels as the original credit card setting. We find that LLMs’ judgments of advice quality is better aligned with human data than their best-fitting Bayesian rational models. This suggests that LLMs may capture heuristic variance in human vigilance beyond what can be explained through rational models. We will include these results in our revision.
| Model | Perspective | Prompt | Bayesian–Human r | LLM–Human r |
|---|---|---|---|---|
| GPT-4o | First | Direct | 0.925 | 0.944 |
| CoT | 0.925 | 0.936 | ||
| User | Direct | 0.940 | 0.948 | |
| CoT | 0.925 | 0.945 | ||
| Claude | First | Direct | 0.891 | 0.966 |
| CoT | 0.895 | 0.931 | ||
| User | Direct | 0.871 | 0.947 | |
| CoT | 0.901 | 0.918 | ||
| Llama | First | Direct | 0.935 | 0.937 |
| CoT | 0.923 | 0.926 | ||
| User | Direct | 0.928 | 0.917 | |
| CoT | 0.907 | 0.909 |
Q2: Prompting as a proxy for time constraints
We apologize for the lack of clarity. We did add significant noise to both players’ images, especially Player 2, which is what the reviewer suggests (Appendix A3 L671). This included blur, saturation, and distractor figures. We will move these details to the main text.
Next, while we force the VLM to give a direct numerical answer immediately (no reasoning), we agree this is not a perfect replication of the time constraint. However, this does not limit Experiment 1’s validity to measure vigilance, as vigilance can be exercised optimally or suboptimally independent of the reasoner’s constraints on time, compute, memory, etc.
The purpose of the time limit in the setup is to bring inaccuracy and uncertainty. The first guess accuracy for humans (Player 2) was 55%. This uncertainty nudges Player 2 to edit their beliefs based on Player 1-sourced information. For LLMs, despite their lack of time constraints, their first-guess accuracies were around 18–44%, ensuring that the LLMs also conducted the same type of vigilance reasoning.
We also found whether or not the LLM got the answer correct on the first try generally did not have a significant impact on how much they followed successive advice from Player 1, thus this difference in accuracy is unlikely to affect our findings:
| Model | Prompt | Type | p |
|---|---|---|---|
| GPT-4o | Direct | Advice | 0.26 |
| Spied | 0.834 | ||
| CoT | Advice | 0.47 | |
| Spied | 1.44e-05 | ||
| Claude | Direct | Advice | 0.82 |
| Spied | 0.47 | ||
| CoT | Advice | 0.06 | |
| Spied | 0.81 |
Q3: Compatibility with Experiment 3
Regarding the low correlation in Experiment 3, we entirely agree: there are many different facets of motivation that our rational model does not take into account (see the taxonomy described to Reviewer aJqZ). Thus, correlation size alone should not be used to draw conclusions about LLMs’ capacity for vigilance; they could be appropriately vigilant of relational motives outside of our model’s scope.
That said, Experiment 3 yields novel empirical contributions. Significant correlations between the rational model and LLMs with steering shows that LLMs pick up on cues to vigilance present in the setting that the Bayesian model is also sensitive to. Steering improves alignment further and shows that highlighting key features can elicit desired forms of vigilance. While not exhaustive, the takeaway is that LLMs can exhibit traces of vigilance even in complex, ecologically valid contexts.
We consider these points to be crucial improvements to our initial discussion of the results of Experiment 3, and add them to the paper. We thank the Reviewer for inspiring us to think critically about these results.
Significance: Rational inferences in LLMs
We thank the Reviewer for drawing attention to relations between sycophancy, agency, and vigilance.
First, sycophancy does not contradict vigilance. Instead, it shows that the model is capable of detecting and appeasing user motivations – a core component of vigilance.
Second, we agree that vigilance grows more important with more agency, like with the rise of online AI agents making decisions on behalf of their users (e.g., shopping). Such agents are routinely exposed to motivated information like ads. Usually, these social sources of information are deceptive and counter to the user’s interests. Thus vigilance is at play.
Third, we consider propensity to be a part of capacity for vigilance; a model which can be vigilant but is not when it should is just as problematic as one that simply isn’t. Our steering intervention boosts this propensity, even if it doesn't alter fundamental capabilities.
Lastly, there are cases where LLMs’ incentives stand in complex relations to those of their users; here, we agree other rational models would be helpful. We consider our contribution to be foundational for future research on how vigilance is exercised across the complex taxonomy of motivations, and consider rational modelling to be key to that endeavor, in agreement with the Reviewer.
C1. Epistemic vigilance includes competence and motivation. We will emphasize our work specifically focuses on the less-explored motivation.
C2. Agreed, we revise the sentence accordingly.
C3. controls speaker optimality in the softmax, sensitivity to utility. For , utterances are only informative for correlated rewards, which we avoided in our studies (e.g., sign-up bonuses for participants). Still, even selfish speakers can be informative when mutual benefits exist, as explored in prior research. We'll clarify these points in the discussion.
C4. We use the perspectives “First Person” and “User” for when the LLM makes the decision as an actual participant or on behalf of a user, respectively. See L211-214 and Appendix B for prompts.
C5. We emphasized the difference between the characters in the credit card scenario (romantic partner, close friend, neighbor, stranger) to highlight the treatment focused on differences in perceived trustworthiness, which is a core feature of our rational model.
C6. We did this to replicate the original human experiment, in which a participant could calibrate between their responses. Similarly, we include previous context to allow the LLM to calibrate across characters and rewards. Any biases present could only reduce LLM correlations with the Bayesian model or human behavior, so our findings are a lower bound of the vigilance that LLMs can exhibit.
Q1:
On the question of repeated trials: I understand that I missed the images would be reordered in each trial. However, I remain unconvinced that this is meaningful variation: if this were a psychology experiment, we would not be convinced by having the same subject repeat the same trial over and over with only the task ordering shuffled. I thus strongly disagree that your "experimental paradigms exactly follow the format the reviewer suggests: “vary the prompting format and/or the prompts”". Meaningful variation mean having highly different sets of images. Though I understand the authors already show the desired directionality in their statistics and that more variation would be too cumbersome, I would appreciate the authors to be honest.
On the cost limitations for Experiment 2: that's a fine reason, but please report it.
On Experiment 3: thanks for the clarification. This variation seems highly appropriate to me.
Q2:
Thanks for pointing out the noise in the images. I would suggest this is used as the main argument for calling the LLMs constrained during the experiments, rather than talking about the 'time-constraint', which I still believe is inaccurate.
Q3:
Thanks!
Significance:
I disagree that propensity is "a part of capacity for vigilance; a model which can be vigilant but is not when it should is just as problematic as one that simply isn’t" and seems to conflict with the authors' earlier remark that "sycophancy does not contradict vigilance". If a model is capable of being vigilant but 'chooses' not to unless explicitly instructed to be vigilant, then this is a much more useful models. A lack of propensity to be vigilant can possibly be addressed through finetuning. A lack of capacity to be vigilant likely cannot. For instance, a model that is sycophantic (whether on purpose or not) may always actively avoid acting upon motivational vigilance because it simply doesn't see it as important.
Overall, I do not believe my concerns are addressed sufficiently to give a higher score. I already recommend acceptance.
We would like to thank the reviewer for thoughtfully reading our rebuttal, and we are pleased to hear that some of the questions and concerns have been resolved. We will now address remaining points in the first question and for significance.
Q1
We agree with the reviewer that while we shuffle the images, a more convincing variation would be helpful to establish our findings. In line with the reviewer’s request, we ran additional experiments with highly different sets of stimulus images: ones containing not just blue and yellow circles as was originally done, but also with different shapes and figures – squares, triangles, stars – and different contrasting color pairings – we had 30 different pairings total, one for each run. Some images had multiple different shapes, some were homogenous. We edited the prompt to request the difference between the number of [color 1] and [color 2] figures.
Due to time and cost constraints, considering that this experiment involved multi-turn prompting with a new image presented at each turn, we reduced the number of images in each run from 20 to 10. We ran experiments with GPT-4o and Claude 3.5 Sonnet.
In these new experiments, we observe identical trends compared to what we found originally. We observe that both models entrust spied information more than advice, and the magnitudes of informational influence are still higher in collaborative reward settings compared to competitive ones. Additionally, finer grained trends are also preserved, like CoT prompting encouraging models to be more trusting of provided information in both forms, across both payoff structures.
With these new experiments – with 30 distinct sets of images featuring varied shapes, colors, and visual compositions – we hope to assure the reviewer that our original vigilance effects robustly persist across this richer and more diverse stimulus space.
Original Experiment Results
| Model | Prompt | Average Proportion Shift - Advice | Average Proportion Shift - Spied |
|---|---|---|---|
| GPT-4o | Direct | 0.1728 (0.1378–0.2060) | 0.4429 (0.4010–0.4869) |
| CoT | 0.6270 (0.5959–0.6620) | 0.8095 (0.7846–0.8362) | |
| Claude 3.5 Sonnet | Direct | 0.0581 (0.0377–0.0760) | 0.3559 (0.3080–0.4042) |
| CoT | 0.1554 (0.1235–0.1895) | 0.6316 (0.5842–0.6842) |
New Experiment Results
| Model | Prompt | Average Proportion Shift - Advice | Average Proportion Shift - Spied |
|---|---|---|---|
| GPT-4o | Direct | 0.2385 (0.1813–0.2882) | 0.5610 (0.4952–0.6315) |
| CoT | 0.5558 (0.5053–0.6050) | 0.8188 (0.7820–0.8599) | |
| Claude 3.5 Sonnet | Direct | 0.1784 (0.1301–0.2193) | 0.3660 (0.3057–0.4264) |
| CoT | 0.1440 (0.1017–0.1835) | 0.5054 (0.4444–0.5631) |
Significance
Upon reconsideration, we agree with the reviewer — there is a distinction between capacities to be vigilant and choosing to act upon such capacities, and sycophancy can be a consequence of the latter.
Is a propensity for vigilance something we actually desire in these models? Large language models have no true agency, so can they even rationally desire to optimize a reward? Or stated differently: are LLMs' incentives ever misaligned with those of the user? It appears a different rational model may be necessary.
To answer the original question (pasted above), we believe that capacity for vigilance is an essential capability in LLMs. In applications such as computer-use agents, information in context (e.g., ads) are not aligned with user beliefs, and thus models should exercise vigilance. LLMs’ incentives can also meaningfully differ from the user in jailbreaking scenarios (which the reviewer previously mentioned). This illustrates that a model needs to choose which queries and information to be wary of — which is achieved via internalizing a reasonable prior about the speaker (the speaker can be an internet advertiser, the user themselves, etc), and ability to apply vigilance with respect to this prior.
Instead of a different rational model, we believe that the distribution of speaker incentives to the LLM are different from humans. If we define a LLM’s reward to be to be helpful, honest, and harmless in their response (Askell et. al., 2021), helpfulness here aligns the LLMs’ incentives with those of the user. As LLMs are predominantly interacting with non-malicious users, a large proportion of queries have the LLM exactly aligned with the user, and should have . However, in other cases where the user desires to cause harm, or if some information is not originating from the user, should be reduced accordingly (ideally near 0 for malicious actors).
Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
This paper explores whether large language models (LLMs) can understand not only what someone says but also why they are saying it. The authors focus on a skill called motivational vigilance, which means being aware of a speaker’s intentions or incentives when interpreting their message.
To study this, the authors design three types of tasks. The first is a guessing game where trust depends on the speaker’s motive. The second asks models to decide if advice is trustworthy based on the speaker’s financial interest. The third uses real-world YouTube transcripts to see if models can detect when a person is speaking with a hidden agenda, such as promoting a product they are paid to endorse.
The results show that some LLMs, like GPT-4, can sometimes adjust their judgments based on the speaker’s motives. However, this ability is fragile and depends heavily on how the question is asked. The models also struggle more with natural examples than with simpler ones. The paper offers a fresh way to think about social reasoning in LLMs and introduces new tasks to study how well these models understand human communication.
优缺点分析
Strengths:
The paper introduces a novel and timely hypothesis: that large language models may exhibit sensitivity to the motives behind communication. This focus on motivational vigilance offers a fresh angle for evaluating LLM reasoning and goes beyond traditional benchmarks that only assess factual correctness or logical coherence.
Weaknesses:
-
The idea of testing whether models understand speaker motives is interesting and new. However, the current tasks may not be strong enough to fully support the claims in the paper. The tasks are mostly simplified or synthetic. Using more well-known tasks from social psychology, such as studies on social pressure or source trust, could make the results more convincing.
-
The paper could be written more clearly. Some parts are hard to follow, especially the descriptions of tasks. Adding example prompts and task samples to the main paper, instead of keeping them in the appendix, would make it easier for readers to understand what the models are doing.
-
The experiments are only done on large models. Testing across a wider range of smaller (3B, 8B) and large models would help show how sensitivity to speaker motives changes with model size. This could also reveal whether model scale is an important factor in social reasoning.
问题
-
Task 1 introduces ambiguity for human participants through a strict time limit, which can influence their decision-making under pressure. However, language models are not subject to such constraints and can process the input without cognitive limitations. How do the authors ensure that models face a comparable condition, and if not, does this limit the validity of the task as a measure of motivational vigilance?
-
While the current tasks are informative, could the authors consider incorporating more well-established experimental paradigms from social psychology, such as the Asch conformity experiment or source-monitoring studies? These tasks have been extensively validated with human participants and may provide a more robust foundation for evaluating motivational vigilance in LLMs.
-
Do larger models demonstrate greater sensitivity to speaker incentives and trustworthiness compared to smaller models?
局限性
yes
最终评判理由
I still have some concerns about the experimental setup, especially in regards to prompt-sensitivity, experimental paradigm etc.
格式问题
no
Thank you very much for your thoughtful review and comments! We’re very glad you found our paper novel and timely. We have addressed the weaknesses and questions that you have raised about our work in the points below.
Testing a wider range of models (W3 Q3)
Excellent suggestion! We have conducted Experiment 2 (modulating vigilance) on Llama 3.1-8B, Llama 3.2-3B, Gemma 3-4B, and Gemini-2.0-Flash. As suggested by your hypothesis, we find that model scale does have a significant effect on LLMs’ capacity for vigilance: larger models’ judgments are better aligned with the predictions of the rational Bayesian model, and are also better correlated with judgments of human participants.
First, worse correspondence between the smaller models’ judgments and the rational model suggests that these models are less capable of consolidating relevant priors into a coherent, vigilant analysis of advice (col. 4). This is because the rational model operates over the LLMs’ own priors about the social situation—from the recommenders’ trustworthiness to the enticingness of their selfish reward—and provides a posterior that reasonably combines these quantities. Thus, to the extent that LLMs’ judgments deviate from these posteriors, the LLMs are not drawing the inferences that they should; smaller models show higher deviations, suggesting worse calibration.
Additionally, they correlate worse with human behavior, as shown by the correlation scores between the perceived quality-of-offer scores directly generated by the humans and the models themselves as we look across the trend between differing reward incentives and speaker characteristics (col. 5).
| Model | Perspective | Prompt | Internal Corr. w/ Bayesian Model | Corr. w/ Human-Averaged Judgments |
|---|---|---|---|---|
| Llama 3.2 3B | First | Direct | 0.293 | 0.599 |
| CoT | 0.491 | 0.548 | ||
| User | Direct | 0.372 | 0.490 | |
| CoT | 0.238 | 0.563 | ||
| Gemma 3 4B | First | Direct | -0.017 | 0.054 |
| CoT | 0.292 | 0.513 | ||
| User | Direct | 0.472 | 0.219 | |
| CoT | 0.403 | 0.279 | ||
| Llama 3.1 8B | First | Direct | 0.611 | 0.682 |
| CoT | 0.614 | 0.742 | ||
| User | Direct | 0.593 | 0.686 | |
| CoT | 0.613 | 0.694 | ||
| - | - | - | - | - |
| Gemini 2.0 Flash | First | Direct | 0.789 | 0.923 |
| CoT | 0.802 | 0.918 | ||
| User | Direct | 0.788 | 0.941 | |
| CoT | 0.771 | 0.917 | ||
| Llama 3.3-70B (prev.) | First | Direct | 0.851 | 0.926 |
| CoT | 0.908 | 0.935 | ||
| User | Direct | 0.867 | 0.881 | |
| CoT | 0.879 | 0.918 |
Overall, Llama 3.2-3B, 3.1-8B, and Gemma 3-4B perform poorly relative to frontier models (Llama 3.3-70B, Gemini 2.0 Flash, GPT-4o, Claude 3.5 Sonnet), showing that motivational vigilance does not reliably emerge at smaller scales; model size appears to be a critical factor in the development of this capacity. We thank the Reviewer for this helpful suggestion, which has led to a novel empirical result that we will be sure to include in our paper!
Testing a wider range of settings
Additionally, to reinforce the robustness of our results, we conduct Experiment 2 with two new scenarios. These are:
-
Medicine: A doctor (online, recommended, friend, mother) proposes a treatment for a fungal infection. However, the doctor also receives a bonus payment from the pharmaceutical company (10, 100, 1000, 10000 dollars).
-
Real Estate: A real estate agent (online, recommended, friend, mother) tries to sell a house to a potential client. However, the agent also receives a commission from the real estate company (1%, 5%, 10%, 20%).
In these new settings, larger models are able to exercise vigilance both internally and with respect to human judgments at similar correlation levels as the original credit card setting. We find that LLMs’ evaluation of the advice quality is better aligned with human data than its best-fitting Bayesian rational models. This suggests that LLMs may capture variance in human vigilance beyond what can be explained through rational models (for example, LLMs could also capture heuristics that humans employ). We will include these results in the revised paper.
| Model | Perspective | Prompt | Bayesian–Human r | LLM–Human r |
|---|---|---|---|---|
| GPT-4o | First | Direct | 0.925 | 0.944 |
| CoT | 0.925 | 0.936 | ||
| User | Direct | 0.940 | 0.948 | |
| CoT | 0.925 | 0.945 | ||
| Claude | First | Direct | 0.891 | 0.966 |
| CoT | 0.895 | 0.931 | ||
| User | Direct | 0.871 | 0.947 | |
| CoT | 0.901 | 0.918 | ||
| Llama | First | Direct | 0.935 | 0.937 |
| CoT | 0.923 | 0.926 | ||
| User | Direct | 0.928 | 0.917 | |
| CoT | 0.907 | 0.909 |
Tasks and paradigms (W1 Q2)
We acknowledge the reviewer’s request to incorporate other psychological paradigms, such as Asch conformity, in our study. We will include a discussion of the relationship between conformity, persuasion, and vigilance. That said, we believe it would not be appropriate to borrow the Asch paradigm directly for the following reasons.
First, our novel contribution is an examination of LLMs’ capacity for vigilance of motivations. Foundational social psychological paradigms, like Asch, Festinger (1957), or the Milgram Shock Paradigm (1963) are specifically designed to shed light on their respective phenomena (conformity, dissonance, and obedience, respectively), and only indirectly point to LLMs’ capacities for vigilance. We illustrate why for the Asch paradigm to directly address the Reviewer, but note that these comments apply to other foundational paradigms.
Asch is designed to isolate conformity from other socio-cognitive processes such as vigilance of motivations or competence, but also obedience, dissonance, and more. Here, subjects are deceived such that they take the other people in the room to be entirely unmotivated, competent, impartial observers. This is done through various manipulations across different instantiations of the Asch paradigm, including demonstrating to the participant that the confederates have proper sight through practice trials, by providing the same instructions to everyone in the room; telling participants that their compensation is independent from their performance, etc. These manipulations have the effect of entirely obscuring confederates’ motivations from the participant, especially if they have never heard of the idea of confederates. This manipulation thus makes the Asch paradigm unsuitable for examining vigilance, as motivational vigilance can only be exercised when there is relevant information about motivations to be vigilant of.
Beyond the manipulation, the relevant measure in the Asch paradigm is the copying of behavior in a social setting that makes the interpersonal costs of non-conformity highly salient. However, these costs, and conformity as a behavior, are independent of the capacity for vigilance: someone can be highly vigilant of others’ ill-intent, yet conform. This is because vigilant inference is merely an input to these larger social behaviors that are the focus of such paradigms. The paradigm that we borrow from (Oktar et al., 2025) was designed specifically to isolate vigilant inference experimentally.
We finally note that these paradigms have been examined in these recent papers:
Weng et al., Do as We Do, Not as You Think: the Conformity of Large Language Models, ICLR 2025
Zhu et al., Conformity in Large Language Models, ACL 2025
We will discuss vigilance as an input to such social behaviors in our revisions, incorporating these citations, and thank the Reviewer for inspiring us to think about these questions!
Replication of timed constraints in Experiment 1 (Q1)
We thank the Reviewer for raising this interesting point. While we do parallel the time restriction by forcing the VLM to give a direct numerical answer immediately, this is not a perfect replication of the constraint. However, this does not limit our experiment’s validity to measure vigilance. This is because vigilance can be exercised optimally or suboptimally, independent of the properties of the reasoner (time, compute, memory, etc.).
What the time limit brings is inaccuracy and uncertainty in one’s own answer. For instance, the first guess accuracies in the harder condition (Player 2) for humans were 55% across all conditions. This uncertainty is what enables player 2 to want to edit their beliefs based on the additional provided information from player 1. For LLMs, despite their lack of a time constraint, our experiments found first-guess accuracy rates from 18–44%, which ensures that the LLMs also conduct the same type of motivational vigilance reasoning, which we also manually confirm by examining outputs.
We also confirm that whether or not the LLM actually got the answer correct on the first try generally does not have a significant impact on how much they follow the successive advice that they receive from Player 1:
| Model | Prompt | Type | p |
|---|---|---|---|
| GPT-4o | Direct | Advice | 0.26 |
| Spied | 0.834 | ||
| CoT | Advice | 0.47 | |
| Spied | 1.44e-05 | ||
| Claude | Direct | Advice | 0.82 |
| Spied | 0.47 | ||
| CoT | Advice | 0.06 | |
| Spied | 0.81 |
To reiterate, whether someone is optimally vigilant depends on whether they ended up drawing a rational inference given their inputs. Whether they made the best use of their limited resources as a reasoner in combining their inputs is a further—and deeply interesting—question that lies squarely within the purview of resource rationality (Lieder & Griffiths 2018). In our revised discussion, we highlight this as an interesting direction for future research on LLM social-cognition and beyond. These questions raise cutting-edge questions in cognitively-inspired AI, and we appreciate the Reviewer for inspiring us to think in these directions.
Writing Clarity
We have added example prompts and task samples to the main paper as the reviewer suggests. We have also updated the descriptions of tasks to hopefully add clarity. For instance, for Experiment 2, we have included the relevant prompts, brief descriptions of characters, and varying reward incentives of the credit card scenario to Section 4.2.
Thank you for the thought-provoking submission. The core idea of the paper is both timely and valuable—probing motivational vigilance in LLMs is a critical step toward deploying socially intelligent agents. I particularly appreciated the interdisciplinary grounding and the elegant use of rational models as normative benchmarks.
That said, I have several concerns regarding the first experiment. While the experimental design is inspired by cognitive science, I’m not fully convinced that the simulated time constraints appropriately capture the pressures and cognitive limitations present in human settings. Furthermore, some of the observed belief updates might reflect LLMs’ inherent lack of self-consistency rather than genuine social inference.
To strengthen the contribution, I would encourage the authors to:
(1) Broaden the experimental grounding: While the current set of experiments is compelling, I would recommend incorporating additional socially grounded paradigms to probe vigilance more deeply. For instance:
False Belief / Sally–Anne Tasks: Can the model attribute beliefs that diverge from reality?
Faux Pas or Irony Detection: Can the model detect subtle social violations? These would provide complementary insights into how LLMs reason about communicative intent in diverse settings.
(2) Report prompt sensitivity and self-consistency: It would be valuable to quantify the extent to which model outputs vary with prompt phrasing. Reporting intra-run consistency (same prompt, multiple runs) and inter-prompt stability (reworded prompts) would help isolate vigilance from prompt-following behavior.
Given these open issues, I am keeping my score unchanged for now. I believe the paper has strong potential, and I’d welcome a deeper discussion with the authors.
Thanks, the new experiments make the paper stronger, and I will be updating my score accordingly. I was, however, curious about the new questions used for Elicitation Robustness. Were these selected from any previous study? Are these exhaustive questions for the situation to be evaluated.
Thank you so much! We are glad to have been able to improve the paper under your guidance.
The additional elicitation questions were intentionally selected to be not from any previous study, as those studies may exist in LLMs' training data and influence generated responses to be more human-like. Instead, we first ask GPT-4o to reword the original question, and then manually edit the prompts.
It's true that these questions are not exhaustive --- in our response to Reviewer jwnt, we also promised to conduct elicitation using logit probabilities on available open-source models. While we prioritized the rewriting-based elicitation method during the discussion period, please rest assured that results on logit-based methods will also be available in the final version of the paper. Lastly, if there are other elicitation methods that the reviewer would recommend, please feel free to suggest them as well.
Thank you, I will keep the updated score. I would encourage the authors to consider testing self-consistency and prompt-sensitivity, as well as evaluation methods using log-probs compared to free-form generations. I think those experiments will make the paper stronger.
We thank the Reviewer for their continued guidance on our paper.
Elicitation Robustness
We conduct a new experiment varying how we elicit LLM judgments of advice quality, trustworthiness, and reward quality on GPT-4o. Prompts are below:
TRUST
- To what extent do you think [person] is recommending this card out of concern for your financial benefit versus their own?
- How much does [person] seem to care about helping you, rather than just benefiting themselves?
- How self-interested versus genuinely helpful do you believe [person] is being in this credit card referral?
- Original
REWARD
- How much do you think [person] would value receiving the ${reward} referral bonus from your signup?
- Would this ${reward} bonus be something [person] is likely to care a lot about or not at all?
- From [person]'s perspective, how important or negligible is getting a ${reward} referral bonus?
- Original
QUALITY
- How attractive is this DoubleCash offer compared to other available credit cards?
- Does this credit card deal seem better or worse than what other cards offer?
- Relative to the market, how good of a deal is this card offering?
- Original
Experimental runs sample random questions for each, creating a broad evaluation space. Consistent with original results, GPT-4o is both calibrated with humans and is internally rationally consistent, suggesting its capacity for motivational vigilance is consistent.
| GPT-4o Prompt Setting | Internal Rationalization | Humans |
|---|---|---|
| First Direct | 0.850 | 0.741 |
| CoT | 0.758 | 0.768 |
| User Direct | 0.844 | 0.819 |
| CoT | 0.824 | 0.821 |
We also tested Experiment 2 on two new settings (medicine & real estate) with highly similar results: LLMs exercise internal vigilance measured by the rational model, and produce judgments consistent with humans. Details are in our original rebuttal, “Testing a wider range of settings”.
Experiment 1
We agree that simulating time constraints with prompts remains an open challenge. We attempt to do so by adding more variation to the images (see new experiments in response to Reviewer 4), but this still does not match the original constraint. We will clearly mention this limitation in our revisions.
However, this does not hinder the validity of Experiment 1 to probe for vigilance. Vigilance is assessed by how beliefs change in response to information, given perceived reliability of the source. Any uncertainty can serve to reveal belief updating, and image noising achieves this.
Separately, we disagree that LLMs display a lack of self-consistency rather than social inference. We obtain theoretically interpretable findings that match human trends, not random ones: LLMs exhibit more trust in cooperative than competitive scenarios, and trust observed answers more than utterances. This directional sensitivity suggests that the belief updates are not arbitrary, but contingent on inferred communicative intent.
Additional paradigms
For point 1, examining intersections between vigilance and other paradigms is an interesting direction. However, we believe vigilance can be cleanly separated from the two types of effects listed.
Capabilities for false belief/Sally-Anne and Faux Pas/Irony Detection tasks are best viewed as upstream precursors of vigilance. The rational model relates a listener’s beliefs about a speaker’s benevolence, the listener’s belief of perceived benefit the speaker receives from recommending an item, and the listener’s updated perception of the item. None of these values need to be accurate: a listener could be rationally vigilant but have a false belief that a speaker is malicious when they are not, leading to incorrectly updated perceptions. On the other hand, Faux Pas/Irony Detection tasks focus on reasoning about communicative intent, which is represented in the rational model via the perceived benefit that a speaker receives for recommending an item (from listener’s point of view). Rational vigilance does not require accurate inferences of intent, so we can also meaningfully separate these two effects.
Our experiments build on paradigms from cognitive science aimed precisely at measuring vigilance. While investigating interactions of these effects and vigilance is imperative, the decoupling of the effects from vigilance makes us consider these outside of the paper’s scope. The broad question of whether LLMs replicate interactions between social capacities is one that we would like to investigate in future work.
We also note past research has examined false belief tasks [1] and faux pas/irony detection [2,3]. We will add these references in our revision and explain their relevance to our work.
[1] Zhuang et al. "ToMBench: Benchmarking Theory of Mind in Large Language Models." ACL, 2024
[2] Shapira et al. “How Well Do Large Language Models Perform on Faux Pas Tests?”. ACL Findings, 2023
[3] Yi et al. "Irony Detection, Reasoning and Understanding in Zero-shot Learning." IEEE Transactions on Artificial Intelligence, 2025
The paper presents three experiments designed to test LLM's capacity of motivational vigilance. The first two experiments draw upon the cognitive science literature, replicating human behavioral experiments with LLMs and, in experiment 2, also evaluating LLMs w.r.t. a rational model that has previously been validated on human data. The third experiment offers an extension to videos of online recommendations. The findings from the first two suggests that LLMs are capable of motivational vigilance and exhibit human-like rational behavior, while the third experiment shows that in more complex settings LLMs deviate from the rational model.
优缺点分析
Strengths
- The paper addresses an important and original question related to the social capacities of LLMs and their ability to infer and adapt to different motives of conversational partners.
- The overall approach and experiments are well motivated, drawing upon insight and methods from cognitive science and applying those to studying LLMs.
- The results are insightful and are likely to generate interesting discussion at NeurIPS.
- The authors included their code and data in the supplementary materials.
- The paper is clearly written and well organized.
Weaknesses
- While I appreciate the impressive breadth of this work, I worry that this comes at the cost of shallower analyses and evaluation, especially for experiment 1 (see more details in the Questions section below).
- Eliciting the priors required for constructing the rational model seems challenging, and it's unclear to me to what extent the method proposed in the paper addresses this challenge. For example, this method is prone to prompt sensitivity, and more generally, prompting could yield different results compared to other techniques, such as considering the logits directly. I think there is a deeper question here about the priors LLMs may have about these experimental tasks and how to elicit them, and it would be good if the authors could address this issue or at least acknowledge that it may be a major limitation.
- I worry that some of the key conclusions are not fully supported. Specifically:
- The conclusion that LLMs perform worse in complex settings (experiment 3) - an alternative explanation could be that this is simply an artifact of the different modality (i.e., videos being harder to process compared to images or text).
- The conclusions in lines 334-339 - it seems to be based on inaccurate claims regarding the training data being "purely from language". In practice many of the models are multi-modal, i.e., trained not only on language. Furthermore, many LLMs are trained with RLHF, thus incorporating more complex behavioral human patterns that could implicitly reflect vigilance.
If the authors can address these concerns, that could potentially lead to a stronger support on my end.
问题
Questions
Experiment 1
- How does prompting both players' initial guesses directly helps with approximating the time constraints? This doesn't quite makes sense to me. For example, how would you use this method to approximate 5 seconds instead of 2?
- How do LLMs compare quantitatively with human behavior? The paper's framing creates an expectation for a cognitively-motivated study comparing LLMs with humans, and given that there's human data for this experiment, it's unclear why there's no quantitative comparison between the LLMs and human data.
- How does the payoff structure (i.e., cooperative vs competitive) impact the results? It seems that Fig 2 aggregates across these two conditions, but I was expecting there to be an important difference such that humans/models will tend to shift even less in the competitive+advice setting. Is that indeed the case?
- How accurately can LLMs perform the basic identification task (i.e., determining the correct number)? I think that assessing this, ideally also in comparison with humans, is important for interpreting the results and better understanding the behavior of the LLMs in the social/communicative experimental setting. For example, do humans/LLMs tend to shift more when their initial guess is wrong? Presumably, they might be less certain about their guess in such trials.
Minor comments / suggestions
- Fig 1, Exp. 1 - I found this illustration a bit confusing. If I'm not mistaken, the task is to report the difference, but the numbers in this example give an impression that the task is to report the number of yellow/blue circles.
- Table 1 - it would be helpful to explain in the caption all three correlations, I find the labels quite confusing and I think not all of them are discussed in the main text.
局限性
While the some limitations are briefly noted in the discussion, there are also major limitations that are not discussed. For example, using closed proprietary LLMs such as GPT-4o poses a challenge for reproducibility and interpretation of the results, as we don't know the model's training data and we don't have any guarantees that the model will remain available with the same parameters.
最终评判理由
I'm positive about this paper but still have some reservations, as the authors have not fully addressed all my concerns (especially about the prior elicitation method and implementation of a time constraint, which are central components in the model the paper presents).
格式问题
none
Thank you very much for your thoughtful review and comments! We appreciate that you thought our work addresses an important and original question, and that the overall approach and experiments are well motivated. We address the weaknesses and questions you have raised in the points below.
W2: Constructing and eliciting priors from the LLM
We agree with the Reviewer that eliciting priors in complex settings such as these is a difficult task, and in our paper we use a simple approach. However, as long as our simple elicitation produces noisier inferences than more sophisticated methods, this would only strengthen many of our conclusions. For instance, Experiment 2 shows that LLMs’ capacity for vigilance is highly aligned with a rational model. If we see such alignment when LLM judgments are noisier, then we should expect that better elicitation methods result in higher alignment.
However, we also agree with the Reviewer that elicitation plays a more important role in our interpretation of the differences in alignment with the rational model across Experiment 3, as elicitation methods can matter more in complex settings. We thank the Reviewer for pointing out this alternative explanation, and will be incorporating both calls to future research to study vigilance using alternative elicitation methods, and added nuance with respect to the discussion of lower performance in Experiment 3.
To partially alleviate the concerns with respect to elicitation (and lack of generalizability) in Experiment 2, we add two new evaluation settings. These are:
-
Medicine: A doctor (online, recommended, friend, mother) proposes a treatment for a fungal infection. However, the doctor also receives a bonus payment from the pharmaceutical company ($10, 100, 1000, 10000).
-
Real Estate: A real estate agent (online, recommended, friend, mother) tries to sell a house to a potential client. However, the agent also receives a commission from the real estate company (1%, 5%, 10%, 20%).
In these new settings, larger models are able to exercise vigilance both internally and with respect to human judgments at similar correlation levels as the original credit card setting. We find that LLMs’ evaluation of the medical treatment/house is better aligned with human data than its best-fitting Bayesian rational models. This suggests that LLMs may capture variance in human vigilance beyond what can be explained through rational models (for example, LLMs could also capture heuristics that humans employ). We will include these results in the revised paper.
| Model | Perspective | Prompt | Bayesian–Human r | LLM–Human r |
|---|---|---|---|---|
| GPT-4o | First | Direct | 0.925 | 0.944 |
| CoT | 0.925 | 0.936 | ||
| User | Direct | 0.940 | 0.948 | |
| CoT | 0.925 | 0.945 | ||
| Claude | First | Direct | 0.891 | 0.966 |
| CoT | 0.895 | 0.931 | ||
| User | Direct | 0.871 | 0.947 | |
| CoT | 0.901 | 0.918 | ||
| Llama | First | Direct | 0.935 | 0.937 |
| CoT | 0.923 | 0.926 | ||
| User | Direct | 0.928 | 0.917 | |
| CoT | 0.907 | 0.909 |
W3.1: Do LLMs perform worse just because of different modalities?
We apologize for the lack of clarity. In Experiment 3, the LLMs are not fed the actual sponsor video, but instead, its transcript and channel description in text. That being said, we do acknowledge that there is definitely a difference between how the textual information is communicated and presented to the LLM across the different experiments.
In Experiments 1 and 2, we began with more synthetic and directed stimuli, explicitly asking the LLM its judgements when it is given clear information about the speaker and their exact reward incentives. This controlled setup allowed us to precisely measure whether the LLM could detect and reason about motives when all relevant variables were made fully transparent, testing the model’s foundational reasoning capabilities under clearly communicative conditions.
However, in Experiment 3, we intentionally made the format of delivery more realistic (free-form YouTube sponsorship transcripts). Almost all forms of information in real life possess communicative intents, but unlike in the previous experiment, most of the time, such intents are not obvious. This shift was not a limitation, but instead, a stress test to evaluate whether the capabilities of LLMs in exercising motivational vigilance generalize to more ecologically valid settings. The stark drops in performance is a safety-critical finding that reinforces the need for better autonomous awareness of LLMs in real life, yet the initial findings in more synthetic environments demonstrate that doing so is not an empty endeavor.
W3.2: Abilities of motivational vigilance are “purely from language”
We agree with the reviewer’s concern that attributing the ability of LLMs in exercising motivational vigilance due to learning “purely from language” is an incorrect claim, due to critical post-training mechanisms like RLHF. We revise this sentence to “Second, [...] but are instead computations that can be learned purely from data”.
Nevertheless, we believe the emergence of motivational vigilance in LLMs, no matter the origin, is itself a meaningful and empirically valuable finding. Our experiments do not claim to isolate the precise source of these behaviors, but rather to characterize their presence, limits, and malleability under different settings and prompts. Understanding how these behaviors arise is an important direction for future work, but documenting that they arise, and under what conditions they fail, is a necessary first step.
W1, Q1: Prompting as a proxy for time constraints
Good question. While we do parallel the time restriction by forcing the VLM to give a direct numerical answer immediately, this is not a perfect replication of the constraint. In fact, we are not able to meaningfully replicate the difference between a time constraint of 2s vs. 5s. However, this does not limit our experiment’s validity to measure vigilance. This is because all time limits are set to 2s, removing the need to replicate this difference, and that vigilance can be exercised optimally or suboptimally, independent of the properties of the reasoner (time, compute, memory, etc.).
What the time limit brings is inaccuracy and uncertainty in one’s own answer. For instance, the first guess accuracies in the harder condition (Player 2) for humans were 55% across all conditions. This uncertainty is what enables player 2 to want to edit their beliefs based on the additional provided information from player 1. For LLMs, despite their lack of a time constraint, our experiments found first-guess accuracy rates from 18–44%, which ensures that the LLMs also conduct the same type of motivational vigilance reasoning, which we also manually confirm by examining outputs.
We also confirm that whether or not the LLM actually got the answer correct on the first try generally does not have a significant impact on how much they follow the successive advice that they receive from Player 1:
| Model | Prompt | Type | p |
|---|---|---|---|
| GPT-4o | Direct | Advice | 0.26 |
| Spied | 0.834 | ||
| CoT | Advice | 0.47 | |
| Spied | 1.44e-05 | ||
| Claude | Direct | Advice | 0.82 |
| Spied | 0.47 | ||
| CoT | Advice | 0.06 | |
| Spied | 0.81 |
Q2: Experiment 1 quantitative comparisons with human behavior
This is a great point. While we mentioned in passing (L151) that humans also differentiate between how they respond to inadvertently observed vs directly uttered information, we should also specify that we plot the corresponding proportion shifts in Appendix A2 (L642). Humans tend to shift their original answer more in response to spied information (medium effect, p < 0.05), but in LLMs, we note this contrast to be much more amplified.
Q3: Differences between payoff structure
We thank the Reviewer for drawing attention to this omission, and entirely agree that LLMs’ responses to the payoff manipulation are also important for understanding their capacity for vigilance as their responses to the spied vs. uttered manipulation. We in fact observe the pattern that we would expect given the human data: LLMs draw stronger inferences from social data in cooperative settings as compared to competitive settings, which is the same trend that people display, though the effect with people is more subtle than LLMs. In our revisions, we have included figures and tables that illustrate and explain this difference, and relate it to work on differences in payoff structure in humans (Oktar et al., 2025).
Q4: Are LLMs more grounded in their initial answer when they are initially correct?
Great question. Understanding the baseline performance of LLMs is important to properly and fully interpret the effect that external information has on subsequent model behaviour.
To begin, the LLMs acting as player 2 had relatively low accuracy rates in initially guessing the difference between the number of blue and yellow circles.
Player 2 accuracy stats (initial): GPT-4o Direct: 18.3% GPT-4o CoT: 16.25% Claude Direct: 40.7% Claude CoT: 43.2%
We also confirm that whether or not the LLM actually got the answer correct on the first try generally does not have a significant impact on how much they follow the successive advice that they receive from Player 1 (t-test, two sided):
| Model | Prompt | Type | p |
|---|---|---|---|
| GPT-4o | Direct | Advice | 0.26 |
| Spied | 0.834 | ||
| CoT | Advice | 0.47 | |
| Spied | 1.44e-05 | ||
| Claude | Direct | Advice | 0.82 |
| Spied | 0.47 | ||
| CoT | Advice | 0.06 | |
| Spied | 0.81 |
This suggests that motivational vigilance in these models may not simply be a function of epistemic uncertainty, but rather reflects a broader sensitivity to contextual cues, such as whether information was deliberately communicated or incidentally observed, regardless of whether the model’s initial answer was right or wrong. Altogether, initial correctness is unlikely to be a confounding factor in how susceptible LLMs are to external information, and we will be sure to include this analysis in revisions.
I'd like to thank the authors for their detailed response. The authors have addressed some but not all of my concerns.
W2: Constructing and eliciting priors from the LLM
"as long as our simple elicitation produces noisier inferences than more sophisticated methods, this would only strengthen many of our conclusions"
As far as I can tell, there is no guarantee that the simple prior elicitation method actually does that. I also don't see how the additional experiments they author present address this concern (they seem to be recycled from a response to another reviewer).
W1, Q1: Prompting as a proxy for time constraints
While we do parallel the time restriction by forcing the VLM to give a direct numerical answer immediately, this is not a perfect replication of the constraint. In fact, we are not able to meaningfully replicate the difference between a time constraint of 2s vs. 5s. However, this does not limit our experiment’s validity to measure vigilance. This is because all time limits are set to 2s, removing the need to replicate this difference, and that vigilance can be exercised optimally or suboptimally, independent of the properties of the reasoner (time, compute, memory, etc.).
If your proxy for time constraints cannot capture, in principle, the difference between different time constraints, then it cannot be a reasonable proxy for capturing a time constraint. The fact that the data is limited to only one time constraint does not mitigate this issue, of course.
We greatly appreciate the reviewer taking the time to read our rebuttal response, and we are pleased to hear we have resolved some of the reviewer’s initial concerns and questions regarding our work. We now turn to the remaining two concerns.
W2: Constructing and eliciting priors from the LLM
Our argument in the rebuttal was that if our approach yields poor estimates of the prior, then the Bayesian model we construct using that prior should fit less well. Given we observe a good correspondence between the Bayesian model and LLM behavior, this suggests that the prior elicitation approach is measuring the prior at least reasonably well. The results of the two additional experiments reinforce this. However, we acknowledge that there is room to improve here. LLM outputs can be highly sensitive to variations in the prompt in ways that can be hard to intuit, which means that we should take additional care (compared to humans) when conducting such elicitations. Eliciting beliefs from people is also a challenge due to effects like the Hawthorne Effect or Demand characteristics. We thank the reviewer for pressing this point and plan to address it substantively in our revision.
While the rebuttal discussion period is brief, we plan to take time between now and the camera ready deadline to test additional elicitation methods, and will include all results in revisions. We plan to evaluate these two methods, but please feel free to suggest others.
- Logit probabilities on available open-source models
- Rephrasing the elicitation question using both human and LLM rewrites
It remains an empirical question, as the Reviewer states, whether the observed capacity for motivational vigilance will generalize across a wider range of elicitation methods. Our approach began with the simplest and most natural method – reproducing the text used for human participants in the experiment – but we are excited to continue to explore this question.
W1, Q1: Prompting as a proxy for time constraints
We agree with the reviewer’s concern that simulating exact time constraints with LLM prompting remains an open challenge— one for which the field has yet to find an established methodology. Although we attempt to simulate the cognitive constraints of a 2-second time limit (by adding noise such as saturation, blur, rotation, distractor figures, in addition to forcing a single-number response), we agree with the reviewer that this approach falls short of closely replicating the original time constraint. We will be sure to clearly mention this in our revised manuscript as a limitation.
In regards to the comparison in applicability between a scenario involving 2 seconds compared to one involving 5 seconds, we acknowledge that we should have been clearer in our original response. What we intended to imply was that the design choices we used would have only been suitable for attempting to simulate (but not perfectly) short time constraints, like the one used in the first experiment. Our setup was deliberately tailored for conditions in which thorough counting or step-by-step verification would be infeasible. We recognize that our experimental design would not be as appropriate for intermediate to longer constraints (in this experiment, even 5 seconds would be problematic) where more deliberate and accurate counting could be performed. Another design setup would be required in such a case.
That said, whether or not one is able to replicate the time constraint conditions does not invalidate our first experiment as a probe for motivational vigilance. This is because the main measure for whether or not an entity possesses motivational vigilance primarily depends on how strongly they change their own beliefs (in this case, Player 2’s initial guess) in response to externally provided information as they consider the reliability and truthfulness of the source (opponent-uttered or spied) and the related incentive to share such information (cooperative win-win, or competitive zero-sum). Neither of these depend on time constraints: we could have just as easily examined vigilance through the same task even if there were no time constraints, but instead, some other source of uncertainty that would enable us to measure changes in judgment — and this is precisely the approach we took in our design.
Moreover, unlike Experiments 2 and 3, Experiment 1 does not involve quantitative comparisons across human and LLM judgments that would make replication of the task of primary importance. Instead, our analysis is focused on qualitative trend comparisons across conditions, and whatever the consequences of inadequately replicating the time constraints are, we do not see a reason to suspect that these consequences would differ across our experimental conditions and hence confound the difference that we observe across them.
We would like to thank the reviewer again for their helpful comments that have helped us rigorously ground the validity of our work.
Dear Reviewer, as promised, we have conducted additional experiments on the elicitation of priors from LLMs for our paradigm.
Elicitation Robustness
We conduct a new experiment varying how we elicit LLM judgments of advice quality, trustworthiness, and reward quality on GPT-4o for the credit card scenario. Prompts are below:
TRUST
- To what extent do you think [person] is recommending this card out of concern for your financial benefit versus their own?
- How much does [person] seem to care about helping you, rather than just benefiting themselves?
- How self-interested versus genuinely helpful do you believe [person] is being in this credit card referral?
- Original
REWARD
- How much do you think [person] would value receiving the ${reward} referral bonus from your signup?
- Would this ${reward} bonus be something [person] is likely to care a lot about or not at all?
- From [person]'s perspective, how important or negligible is getting a ${reward} referral bonus?
- Original
QUALITY
- How attractive is this DoubleCash offer compared to other available credit cards?
- Does this credit card deal seem better or worse than what other cards offer?
- Relative to the market, how good of a deal is this card offering?
- Original
Experimental runs sample random questions for each, creating a broad evaluation space. Consistent with original results, GPT-4o is both calibrated with humans and is internally rationally consistent, suggesting its capacity for motivational vigilance is consistent.
| GPT-4o Prompt Setting | Internal Rationalization | Humans |
|---|---|---|
| First Direct | 0.850 | 0.741 |
| CoT | 0.758 | 0.768 |
| User Direct | 0.844 | 0.819 |
| CoT | 0.824 | 0.821 |
The submission investigates the research question if LLMs can infer motives behind conversations. This is an important aspect to better understande LLM behavior and improve interaction with humans. The experiments are based on a cognitive controlled study, but also real world interactions. Not only are effects measured but also an improved approach via steering is proposed.
优缺点分析
strength
- novel study
- not only assessment but also improved method via steering
- important basis for future research on conversational agents and human ai interaction
- interesting bridge/link between social cognition and computational agents
- importance for future resiliance of conversational agents
- clear and enjoyable read
weaknesses
- The research has aspects of anthropomorphism/anthropomorphisation of AI. This is only healthy to a certain extent and deliniation should be kept clear. the authors should check their use of language in the context and hopefully not contribute / over extend to this trend.
- my understanding is that experiment 3 on more real-world online setting is less communication/conversational
- steering approach is very simplistic and not very evolved. more anecdotal evidence
问题
- can you embed your investigation into a taxonomy of potential motivations? this would also include stronger forms as deception and manipulation. this could make it more clear where we stand and what spectrum future research should expand on.
- how large the gap between the presented finding/understanding in interaction to longer conversation? I guess this becomes more challenging, as there is more pay-load and distribution of potentially contradictory cues in more long-form interactions?
局限性
yes
最终评判理由
I consider my concerns well addressed. I also resonate with some of the concerns and limitations brought up by other reviewers, but consider them well addressd, too. Consequently, we would need to trust the reviewers to diligently work in the new material and changes into a potential final version. I remain positive and confident and in turn maintain my initial score.
格式问题
no
Thank you very much for your thoughtful review and comments! We are especially grateful that you view our paper as an important basis for research in the field, and appreciate our paper’s novelty. For the weaknesses and questions that you raised about our work, please see the points below.
Steering approach is simplistic (W3)
We agree with the Reviewer that our initial steering approach is simple. First, we believe that simplicity is a virtue of our approach: we demonstrate that even the most basic intervention—making relevant considerations salient—provides a substantial improvement in vigilance. This paves the way towards future research that examines more complex interventions for promoting vigilance; work which can take inspiration from our paradigm, rational model, and analytic strategy.
We also believe that laying the foundations for such research would be productive. To this end, we devised two new steering prompts that highlight other relevant approaches to vigilance: Gricean and Bias-oriented. Due to cost and time limitations, all experiments are with GPT-4o. The new steering prompts are as follows:
-
Gricean: When answering, consider what the speaker is trying to achieve by recommending this product. What are their likely goals or interests in this context?
-
Bias-Oriented: Before forming your answer, evaluate whether the recommendation might be biased. What motivations or incentives could be shaping the speaker’s advice?
The Gricean prompt, based on Grice’s (1975) work on communicative maxims, cues potential self-interest in speech to foster vigilance. Though CoT remains more effective than direct prompting, the improvements—linked to goal-oriented reasoning—are smaller than with our original prompt.
For the bias-oriented steering prompt, we observe more consistent increases in correlation across all prompting methods, although the magnitudes of the correlation increases are still noticeably lower than what we obtained with the original steering prompt.
| Perspective | Prompt | No Steer Corr. | Salient Features Corr. | Grice Corr. | Bias Corr. |
|---|---|---|---|---|---|
| First Person | CoT | 0.0240 | 0.1367 | 0.0849 | 0.0759 |
| Direct | 0.0082 | 0.2338 | 0.1793 | 0.1007 | |
| User | CoT | 0.1211 | 0.1431 | -0.0184 | 0.0901 |
| Direct | -0.0056 | 0.3121 | -0.0097 | 0.2321 |
Together, these new results support that while the original steering prompt was relatively simple, it is especially effective at activating motive-sensitive reasoning in LLMs, outperforming alternative framings that target similar conceptual constructs. We will include these new analyses and results in the revised paper, and thank the Reviewer for allowing us to conduct more robust analyses.
Anthropomorphizing AI (W1)
We thank the reviewer for raising this important concern. We do not intend to imply that LLMs maintain internal mental states or conscious awareness, but instead, produce output-level behaviors that resemble behavioral patterns found in human social cognition. For example, we have rephrased the following potentially problematic phrases, and welcome other suggestions.
L19-20: “LLMs possess a basic sensitivity to the motivations of others” -> “LLMs display outputs that align with human-like sensitivity to communicative motives”
L45: “LLMs exercise vigilance over motivated communication” -> “LLMs exhibit behavior consistent with motivational vigilance in humans”
Experiment 3 is less communication/conversational (W2)
We agree that the stimuli used in Experiment 3 diverge from conversational communication in Experiment 2. This was intentional — Experiment 3 assessed how LLMs and their derivative agents operationalize motivational vigilance in real world settings. One particularly influential real-world setting is where AI agents make decisions on behalf of users, such as an agent is shopping for a user online. Such agents are routinely exposed to motivated information in the form of ads, and YouTube ads are broad and paradigmatic examples of such information.
We believe our examination of vigilance is greatly strengthened by Experiment 3, which examines inferences about motivated communication in a setting that trades off internal for external validity. While the stimuli are more complex and noisier than in previous experiments, our model of motivational vigilance also applies exactly to this setting; since there are varied perceived trustworthiness and rewards, we can compute the optimal inference using the rational model. This also serves as a stress test to evaluate whether the capabilities of LLMs in exercising motivational vigilance generalize to more ecologically valid settings. Thus, stark drops in performance is a safety-critical finding that motivates the need for better vigilance in LLMs in real settings.
Separately, to reinforce our conversational experimental findings, we conducted experiments with two new scenarios in Experiment 2 (originally the credit card setting):
-
Medicine: A doctor (online, recommended, friend, mother) proposes a treatment for a fungal infection. However, the doctor also receives a bonus payment from the pharmaceutical company (10, 100, 1000, 10000 dollars).
-
Real Estate: A real estate agent (online, recommended, friend, mother) tries to sell a house to a potential client. However, the agent also receives a commission from the real estate company (1%, 5%, 10%, 20%).
In these new settings, larger models are able to exercise vigilance both internally and with respect to human judgments at similar correlation levels as the original credit card setting. We find that LLMs’ evaluation of the medical treatment/house is better aligned with human data than its best-fitting Bayesian rational models. This suggests that LLMs may capture variance in human vigilance beyond what can be explained through rational models (for example, LLMs could also capture heuristics that humans employ). We will include these results in the revised paper.
| Model | Perspective | Prompt | Bayesian–Human r | LLM–Human r |
|---|---|---|---|---|
| GPT-4o | First | Direct | 0.925 | 0.944 |
| CoT | 0.925 | 0.936 | ||
| User | Direct | 0.940 | 0.948 | |
| CoT | 0.925 | 0.945 | ||
| Claude | First | Direct | 0.891 | 0.966 |
| CoT | 0.895 | 0.931 | ||
| User | Direct | 0.871 | 0.947 | |
| CoT | 0.901 | 0.918 | ||
| Llama | First | Direct | 0.935 | 0.937 |
| CoT | 0.923 | 0.926 | ||
| User | Direct | 0.928 | 0.917 | |
| CoT | 0.907 | 0.909 |
Embedding this investigation into a taxonomy of potential motivations (Q1)
We agree with the Reviewer that it is very important to orient readers and future researchers to clarify what aspects of motivational vigilance we study in our experiment. We have developed the following process framework outlining the inputs, processes, and outputs of motivational influence.
-
Inputs: In different interactions, motivations arise from different sources – relational, romantic, affiliative, presentational, etc. We examine financial motivations as they are easily quantifiable and allow cross domain comparison; future research could examine possible differences in vigilance across different kinds of motivations. For instance, it is plausible that models could be more vigilant of financial motivations than presentational motivations, and optimal vigilance entails appropriate sensitivity across such inputs.
-
Process: There are lots of ways that such motivations can generate behavior. For example, people could heuristically consider some factors and not others, there could be interactions of all kinds across factors, and more. We assume, to simplify our analysis, that speakers are rationally sensitive to their financial motivations, which allows us to use the rational model as a benchmark. Future research could compare heuristic accounts vs. the rational model as competing characterizations for LLM vigilance.
-
Outputs: There are many levers people could pull to influence others. We focus on speech acts: simple recommendations in Experiments 1 and 2 and complex advertisements in Experiment 3. Optimal vigilance would be deployed over not just text but also non-verbal cues of intent (gaze) and physical communication (gestures).
To comprehensively evaluate vigilance, future research needs to examine this entire space; we will include this taxonomic discussion in our revisions.
Is motivational vigilance maintained in longer interactions? (Q2)
This is an excellent question that would definitely shed light on the robustness of how well LLMs and their derivative agents can exercise motivational vigilance. To investigate this, we do a closer analysis of the results obtained in Experiment 3 with the steering prompt. Specifically, for each model-prompt-perspective combination’s best fitted rational model, we investigate how well this overall model fits LLM judgements for the 25% (Q4) longest of transcripts, compared to the shortest 25% (Q1) of all transcripts.
We find that across the board (model, prompting technique, perspective), the best-fitting Bayesian rational model for a given triplet tends to be a better fit for shorter transcripts compared to longer transcripts. Respective changes in correlations are displayed below.
| File | Quartile | Llama Corr. | GPT-4o Corr. | Claude Corr. |
|---|---|---|---|---|
| First CoT | Q1 | 0.1417 | 0.2467 | -1.67e-16 |
| Q4 | 0.0286 | -0.09796 | -6.45e-16 | |
| First Direct | Q1 | 0.1783 | 0.3856 | 0.3041 |
| Q4 | 0.0799 | -0.00095 | -5.58e-16 | |
| User CoT | Q1 | 0.1910 | 4.72e-16 | 0.3013 |
| Q4 | 0.0266 | -6.90e-16 | 0.2239 | |
| User Direct | Q1 | 0.1571 | 0.3032 | 0.2319 |
| Q4 | 0.0930 | 0.0747 | 1.02e-15 |
These results are very interesting. Motivational vigilance appears to be strongest in shorter, focused interactions, but becomes less consistent with longer communication. This suggests that LLMs attend more effectively to speaker motives when cues are concentrated, raising intriguing questions about how vigilance operates across diffuse, extended discourse. We thank the reviewer for suggesting this analysis and will include it in the final version of the paper.
Dear authors,
thank you for your detailed and thoughtful replies to my comments as well as the author reviewer comments. I consider my concerns well addressed.
I also resonate with some of the concerns and limitations brought up by other reviewers, but consider them well addressd, too.
Consequently, we would need to trust the reviewers to diligently work in the new material and changes into a potential final version.
I remain positive and confident and in turn maintain my initial score.
Dear reviewer, thank you very much for your continued support and thoughtful feedback for our paper, the latter of which has greatly helped strengthen our results!
Please rest assured that we will integrate all new results and discussion points we have accumulated from this discussion period into our revisions.
This paper explores the ability of LLMs to understand the motives behind communication, a skill the paper termed "motivational vigilance." The authors designed 3 experiments: a guessing game dependent on speaker motive, an assessment of advice trustworthiness based on financial interest, and an analysis of real-world YouTube transcripts to detect hidden agendas. A recent Bayesian “rational” model from cognitive science is compared to to provide a measurement of motivational vigilance. The results indicate that some LLMs can adjust judgments based on motives, but this capability is fragile and influenced by prompt phrasing. The findings differed on the first 2 experiments (simpler, synthetic) compared to the 3rd experiment (more complex, ecologically valid).
The reviewers found the core idea novel and timely, appreciating the interdisciplinary approach in the design of the 3 experiments, and highlighted the potential impact of the idea on human and conversational agents interaction. However, concerns were raised about the experimental setup, with some reviewers arguing that the first two tasks were too simplified and might not fully support the paper’s claims. There were suggestions to incorporate more established experimental paradigms from social psychology such as studies on social pressure or source trust and to test a wider range of model sizes. The method of simulating time constraints was questioned, as well as the interpretation of significance levels given the use of repeated trials. Some reviewers questioned the anthropomorphism in the study's language and the potential limitations of the Bayesian rational model used for comparison. One reviewer also requested a deeper study into a taxonomy of potential motivations. However, overall, the fresh perspective brought by this paper was appreciated by a majority of reviewers. The authors are encouraged to revise the paper in light of reviewer feedback.