PaperHub
6.5
/10
Poster4 位审稿人
最低6最高7标准差0.5
7
6
7
6
3.5
置信度
COLM 2025

Truth-value judgment in language models: ‘truth directions’ are context sensitive

OpenReviewPDF
提交: 2025-03-18更新: 2025-08-26
TL;DR

Investigation of the in-context behaviour of LLM 'truth directions'.

摘要

关键词
mechinterpmechanistic interpretabilityinterpretabilitytruth directionsLLM beliefslarge language modelllm

评审与讨论

审稿意见
7

This paper investigates how context or premises interact with several different kind of truthful probes in LLM. The probes are trained with or without premises, and are tested with affirming or conflicting premises. The discovery is that there are interferences between the context and truthful directions.

接收理由

Before attempting to apply truthful probes in real-life LLM monitoring or intervening, there should be more careful study to test how it generalizes out of the toy environment they were developed. And this work is exactly working on this with the emphasis on context. This is important directions to study.

Experiment-wise, I like how the authors devices 4 different error forms and calibrate results by ratio in error rates. The selection of datasets and models also seem comprehensive enough to draw conclusions.

拒绝理由

I'm not entirely sure prior works claim that the truthful probes are supposed to focus only on the last sentence or the hypothesis. I certainly expect it to not do so, as discovered by authors. If I have read the paper correctly, L157 should be revised.

Though the paper carefully examines the property, a reason that I can think of against this paper is that it doesn't fully surprise me or shed light on the mechanism of why the context following happens, and how can we fix that (discussed in future works).

给作者的问题

NA

评论

Thank you for the positive review.

I'm not entirely sure prior works claim that the truthful probes are supposed to focus only on the last sentence or the hypothesis.

We are not quite clear on the concern you are raising here.

Do you mean to point out that prior work never denied the context sensitivity of models and/or truth-value probes? If so, we agree, but neither did they explicitly acknowledge the distinction between the no-context and in-context cases. Some simply used no-context setting for simplicity to develop their method, and others used in-context settings without investigating the effect of the context.

Or, do you mean that prior work does not restrict the application of truth-value probes to the last sentence or hypothesis? Again, we agree. The reason we probe the hypothesis is because we want to study the context-sensitivity of the model/probe, i.e. we want to see if and how the assignment of truth-values to the hypothesis depends on the context in which it appears. We describe three ways in which the model could be interacting with the context (prior, conditional, and marginal beliefs) and quantify to what extent the model behaves in accordance with each of them. We do not attempt to probe for the truth value assigned to the premise by using the representation of the hypothesis, but rather we investigate the effect of the (corrupted/irrelevant/negated/affirmed) premise on the truth value that is assigned to the hypothesis.

Line 157: “A value close to zero for this metric would be consistent with a prior belief.”, which refers to the premise effect PE=p(h;q+)p(h)\mathit{PE} = p(\mathbf{h}; q^+) - p(\mathbf{h}). We are not really sure what you are objecting to. Our reasoning is as follows. A purely prior belief would not be sensitive to the context, it would involve the model assigning a truth value while ignoring the context, thus we would expect adding a premise before the hypothesis to have zero effect, i.e. a premise effect of zero.

It is true that we do not reveal the mechanism responsible for incorporating in-context information. However, we do reveal some facts about how models represent things in their latent space. First, there are no layers where the models are completely insensitive to context yet still have above-random accuracy, indicating that prior beliefs are not represented (fully) independently. Second, there seem to be separate (albeit possibly related) directions that can be found depending on whether the probes are applied to (1) individual sentences or (2) sentences that occur in a context; with (1) showing lower context sensitivity than (2).

评论

Thank the authors for the reply.

Yes, I was talking about "prior work never denied the context sensitivity of models and/or truth-value probes". A more in-detail expansion of your points might be worth-while and your reply is suitable. And L157 is basically that the "prior belief" needs to be put into context.

审稿意见
6

This paper investigates how Large Language Models (LLMs) represent and judge the truth of sentences, particularly about context. The authors examine "truth directions" in LLM latent spaces that have been found to predict sentence truth. This is built upon previous research that has used these directions to probe LLM "knowledge" or "beliefs."

The authors introduce four error scores to evaluate the coherence of truth-value judgments. They conduct experiments across different model architectures (Llama2-7b, Llama2-13b, and OLMo-7b) and training paradigms (pretrained vs. instruction-tuned). They also perform causal intervention experiments by moving representations along truth-value directions to test their influence on related hypotheses.

Through carefully designed experiments, the submission establishes how probes trained on these truth directions are sensitive to context and measure different types of consistency errors, especially when LLMs process hypotheses preceded by supporting, contradicting, or neutral premises. Based on this, the author proposes a new variant of the Contrast Consistent Search method called Contrast Consistent Reflection (CCR) with more stable convergence.

接收理由

The submission presents a novel and important research problem of how LLMs represent truth in context, moving beyond single-sentence analysis to examine how context affects truth judgments.

The authors introduce a rigorous framework for evaluating the coherence of truth-value probes with clear error metrics. In addition, the causal intervention experiments provide valuable insights into how truth-value directions mediate natural language inference.

The proposed CCR method offers a more stable alternative to existing truth-value probing techniques. Extensive experiments have been conducted to provide a good insight into the proposed method.

拒绝理由

There are some writing issues with the submission, listed as follows:

  1. Confusing notations: The section. 3.1 uses x**x** to denote the LLM vector representation of a sentence. However, these are not further employed in the following discussion, instead of using HH. The notation of Q^\hat{\mathcal{Q}} and Q\mathcal{Q}' is not introduced before Table 1.
  2. Ambiguous demonstration: It will be easier for the reader to understand the content by explicitly giving the natural language demonstration with clear notation. For example, the author could consider giving demonstrations like: Permise (in-context information): December is not during the winter for New York; Hypothesis (assertion to be verified by LLM given the premise): In New York, days are shortest in December.
  3. It would be better to have the figure lies above the result discussion section (section 4.1), for better readability.

The experimental setup is somewhat artificial, relying on specific negation patterns that may not reflect how contradictions typically appear in natural text.

The designed experiment doesn't fully explore how the identified issues might impact downstream applications of truth-value probes in real-world LLM deployments.

给作者的问题

  1. Does the powerful model acquire the ability to handle premise sensitivity? For example, the closed-source model like GPT-4o, Claude-3.7-Sonet, or the latest reasoning model including QwQ-32B, DeepSeek-R1, and its distill variants like DeepSeek-R1-Distill-Llama-3.1-8 B.
  2. How might your findings about context sensitivity in truth directions inform techniques for making LLMs more truthful or reducing hallucinations?
  3. Did you observe any notable differences in how instruction-tuned models handle negation compared to base models beyond what was reported?
评论

Thank you for the positive review.

Regarding our notation. With uppercase italicized characters (e.g. XX, QQ, HH) we refer to sentences as well as the random variables denoting their truth value. With lowercase italicized characters (e.g. qq, hh) we refer to the event of those sentences being true, i.e. hh is short for H=1H=1. With lowercase bold characters (e.g. x,q,h\mathbf{x}, \mathbf{q}, \mathbf{h}), we refer to vector representations obtained from the LLM hidden states. We use QQ/qq/q\mathbf{q} and HH/hh/h\mathbf{h} in the context of our experiments where we have both premises and hypotheses, rather than just individual sentences.

To clarify this we will update lines 96-98 as follows:

a false statement (and vice versa) by negating it. We use superscript ++ and - to denote

the affirmed (X+X^+) and negated (XX^-) variants of a sentence, respectively. Their LLM vector

representations are given in bold, i.e. x+\mathbf{x}^+, x\mathbf{x}^- (see section 4 for how we negate sentences

We will make sure Q~\tilde{Q} and QQ’ are introduced before Table 1. We will also make sure to clearly convey for each example what is the premise and what is the hypothesis and what are their roles, as the reviewer suggests.

While we agree with the reviewer that the negation is somewhat unconventional, we note that the results show the LLM handles this without problems. For both datasets, the logistic regression probe performance is around 95%, showing the model is capable of parsing the sentence structure to accurately predict the truth of the hypothesis.

Our probing experiments require access to the model’s hidden states, and so our methodology does not allow us to evaluate closed-source models like GPT4o and Claude. In response to the reviewers question about reasoning models, we repeated our experiments for DeepSeek-R1-Distill-Llama-3.1-8B. The results are very similar to those of LLaMA-2-8B, with E1 and E2 scores reaching their low point around 0.39 and 0.47, respectively.

We have not studied the effect of instruction-tuning on the results beyond what we reported.

审稿意见
7

The paper seeks to interpret LLM latent spaces as including "truth dimensions" which can be acivated by promptong/probing with consistent or counterefactual entailing contexts during inference in QA tasks for LLaMa2 &B and 13B nad OLMo 7B models.

接收理由

The method uses entailment paira from EntalimentBank and SNLI datasets in a hypothesis-QA task, when probed with negated vs. original premise and a neutral random premise against a hypothesis-only baseline. Probes are trained on premise and no-premise data using a variety of methods including a novel CCR version of CCS, and tested on held-out data. Various measures of premise and hypothesis strength at the model head and levels are observed. The method seems very sound. The work is clearly presented.

拒绝理由

Results are somewhat mixed. There seems to be a strong premise effect, with positive premises boosting activation of the hypothesis and negated premises depressing it, supporting the hypothesised/predicted truth dimensions. However, the sensitivity to irrelevant or corrupted premises under measures E1 and E2 suggests that the truth value probes are unfaithful, so the relation of truth dimensions to inferential capability in LLM remains unclear.

评论

Thank you for the positive review.

You correctly point out that the sensitivity to irrelevant and corrupted premises speaks to the (lack of) faithfulness of the probes. However, we think our results primarily speak to a lack of faithfulness of explanations that appeal to the concept of belief. It is still faithful to say that the model places sentences along a dimension that correlates with truth. What our results show is that this correlation holds in aggregate for sentences that are placed in the context of related sentences. But, it also shows that the correlation is noisy, and that the truth-value assigned to any particular sentence could be affected by arbitrary factors.

审稿意见
6

The paper studies the context-sensitivity of truth value judgments as estimated from truth direction probes. The author(s) introduce a framework to evaluate for context effects, distinguishing different types of errors. They run experiments using different probing methods, applied to different LLMs, and run extensive analyses on EntailmentBank and SNLI. They find that truth directions are generally context sensitive, but that there is also sensitivity to irrelevant context.

接收理由

  • The paper is well-written and structured. The methods are clearly explained, adopting a rigorous notation.
  • The paper addresses a generally important topic for the NLP community (factuality in LLMs), focusing on a less studied angle: whether the truth value assigned to sentences is appropriately revised based on context information. This is a crucial aspect for general text understanding and instruction-following capabilities.
  • The author(s) consider a variety of probing methods, including highlighting limitations of some of them. On top of testing for the effect of altering premises, they also run a study where they modify the direction of context premises, providing evidence for a causal relation between truth directions and context incorporation.

拒绝理由

  • Based on the current discussion of the results, it doesn't emerge as fully clear what results we should interpret as limitations of the probes (therefore of existing analyses methods) or of the LLMs.
  • To clarify the above, it might help to run a comparison of the context-sensitivity exhibited when probing for truth judgments based on truth direction probes vs. other methods (for instance, based on model outputs -- e.g., asking for truth values to the LLM directly).
  • I generally encourage the author(s) to expand on the Conclusions section with a wider discussion on the implications of their findings and directions for future research (for instance on the sensitivity to irrelevant information).

给作者的问题

  • What results should interpret as limitations of the probes (therefore of existing analyses methods) or of the LLMs?
  • Have you run any analyses comparing the context-sensitivity of truth values as estimated by looking at model outputs?
评论

Thank you for the positive review.

We recognize that any negative probing results can always be attributed to either flaws in the probes or flaws in the model. However, we do include multiple probing methods, mitigating the risk that the results are due to a particular flaw in any single probing method. Our probing results that show sensitivity to irrelevant information are also consistent with black-box evaluations that show LLMs can be sensitive to irrelevant information (Shi et al., 2023) (something we will expand upon in the conclusion section).

Our methodology allows the community to evaluate truth-value probes in new ways. If the probes are at fault, then our methodology at least provides new ways of distinguishing between good and bad truth-value probes.

Regarding the comparison to other methods, we note that our LM-head baseline is equivalent to using model outputs. This baseline measures the probability that the model puts on the ‘correct’/’incorrect’ tokens at the end of our inputs. The probing methods generally outperform this baseline both in terms of context sensitivity and error scores.

评论

Thank you for your reply.

I maintain my positive view on the paper and leave my score unchanged. I encourage the authors to incorporate the clarifications provided in the paper.

最终决定

Past work has found the existence of a "truth feature" in LMs that distinguishes true statements from false ones; this has sometimes been used to argue for the existence of belief-like states or internal "world models" in LMs. The current paper studies the extent to which truth features really function like beliefs---i.e. whether they remain stable in contexts containing unrelated information, or update appropriately in contexts that should modify a models' credence in the statement being evaluated. As noted by reviewers, results are somewhat mixed, but in some sense this is the point of the paper: not only are truth features context-sensitive; they are also sensitive to information that shouldn't bear on truth values at all. Overall I think this paper makes a focused, useful contribution to our understanding of knowledge representation in LMs.