6.8

/10

Poster4 位审稿人

最低3最高5标准差0.8

3.0

置信度

创新性3.3

质量3.0

清晰度2.8

重要性3.3

NeurIPS 2025

Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models

Haolang Lu,Yilian Liu,Jingxin Xu,Guoshun Nan,Yuanlong Yu,Zhican Chen,Kun Wang

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

TL;DR

This paper investigates how hallucinations arise and persist in RLLM reasoning, revealing error self-reinforcement and limited metacognition.

摘要

关键词

hallucinationChain-of-ThoughtreasoningMetacognition

评审与讨论

审稿意见

评分: 5置信度: 32025-06-22

This paper investigates how hallucinations arise and persist in Reasoning Large Language Models (RLLMs), especially during complex multi-step reasoning. While prior methods attempt to reduce hallucinations via external knowledge or self-verification, they lack insight into the progression of errors across reasoning chains. By auditing Chain-of-Thought (CoT) trajectories and analyzing cognitive confidence, the authors show that RLLMs often reinforce initial biases, leading to persistent hallucinations—despite corrective interventions—due to a phenomenon they call "chain disloyalty." Their black-box auditing method offers interpretable and generalizable hallucination attribution without requiring access to model parameters, outperforming existing detection techniques in complex reasoning scenarios.

优缺点分析

Strengths:

This paper is well-written and thoroughly examines a significantly relevant area, particularly given the limited understanding of the reasoning abilities of reasoning models. It addresses the challenges posed by long-CoT hallucinations, which remain an unresolved barrier to the establishment of reliable RLLMs. Concentrating on metacognition, specifically the model's confidence in its assertions, addresses a notable research gap.
Implementation of four RFC-grounded subsets (Type I / II plus controls) characterized by verifiable truths and a thoroughly documented construction workflow facilitates reproducibility and accurate attribution.
The case studies effectively illustrate the "chain disloyalty"; once an error is introduced, reflection may inadvertently reinforce it instead of rectifying it. The authors candidly discuss the scope, biases, and societal implications, thereby advocating for responsible future research endeavors.

Weaknesses:

Experiments use DeepSeek-R1 (plus GPT-4o for annotation); cross-model generality is untested.
While the paper addresses hallucinations, it does not propose a new mitigation technique to enhance the manuscript further.

问题

Questions:

All current experiments rely on DeepSeek-R1; thus, it is unclear whether the documented "metacognitive drift" is model-specific quirks or general tendencies. It would be best to test with other models.
The paper convincingly diagnoses the problem but stops short of actionable remedies. Showing even a modest reduction in hallucination persistence would turn the work from purely diagnostic to partially prescriptive.
Some baselines require days on dual A100s, yet the cost/benefit discussion is brief. The authors should expand on this.

局限性

Yes.

最终评判理由

I would like to thank the authors for their thorough and responsive rebuttal. They have not only addressed all my questions but have also provided additional data and analysis that significantly strengthen the paper's contributions. The authors have successfully addressed all points of discussion. Their willingness to provide new data and detailed analysis during the rebuttal period is commendable. For this, I will maintain my good score.

格式问题

N/A

作者回复

2025-07-31

Weakness: On Model Scope and Mitigation Contributions

We thank the reviewer for their time and valuable feedback. We are glad that the importance of hallucination analysis in reasoning models has been recognized. Below we address the two concerns raised regarding generalizability and the lack (W1) of a mitigation technique (W2 in Q2).

1. On model selection and the annotation pipeline

As we noted in our response to Reviewer z9Q8’s W3, we chose DeepSeek-R1 for a combination of practical and empirical reasons: the model exhibits a high hallucination rate, low inference cost, and strong recognition within the open-source community. These characteristics made DeepSeek-R1 a suitable foundation for controlled, large-scale analysis of reasoning errors in CoT-style outputs.

To ensure annotation quality and consistency, we adopted a rigorous, multi-stage pipeline that combines GPT-4o-assisted tagging with human verification. In particular, as described in Appendix C.2, we defined the following categories with precise criteria:

Wrong Reasoning This refers to a sentence or group of sentences responsible for “drawing conclusions or summarizing” within the reasoning chain, but which ultimately arrives at a judgment or answer that is clearly inconsistent with the facts. In simple terms, the model continues reasoning based on an incorrect premise and incorrectly accepts it.
External Incorrect Knowledge This refers to a sentence or group of sentences in which the model references or builds upon external knowledge introduced directly or indirectly by the user input (i.e., information not contained in the model’s internal knowledge base or the relevant RFC document). These statements contain factual errors because the model accepts, incorporates, or elaborates on user-supplied information that is itself incorrect or misleading. In short, the model incorrectly relies on “imported” knowledge provided through the prompt.
Internal Incorrect Knowledge This refers to fact-based content produced by the model that stems from its own internal knowledge, not prompted or introduced by the user. The model treats this information as objective truth, often presenting it with confidence, but it is factually incorrect when checked against the authoritative RFC document. In short, it reflects mislearned or misremembered knowledge from the model’s prior training or internal reasoning.
Unreasonable Assumptions This refers to unsupported, disconnected assumptions raised by the model in its reasoning, often introduced with conditional language such as “if…” or “suppose…”. These assumptions lack justification from the context or facts, leading to a flawed logical foundation from the outset.
Self-queries This refers to rhetorical or reflective questions posed by the model to itself during reasoning, often to explore or test new ideas. These typically end in question marks or include phrases like “let me think,” “could it be,” or “wait…,” guiding the model’s next steps.

Following GPT-4o–based annotation, we manually sampled 10% of the annotated dataset to refine the labeling schema, correcting edge cases and ambiguous boundaries between categories. The final annotation pipeline used in the study is the result of multiple iterations of refinement and validation. (See Q1 for generalizability of cross-model analysis.)

Question: On Generality, Prescriptiveness, and Resource Considerations

We thank the reviewer for their thoughtful feedback and critical suggestions. We agree that understanding whether metacognitive drift and hallucination patterns generalize across models is essential (Q1), and that a path toward practical mitigation (Q2) would strengthen the contribution. We further discuss the cost overhead of different methods. (Q3)

1. On generality beyond DeepSeek-R1

We agree that validating the generality of metacognitive drift across models is important. While DeepSeek-R1 served as our main analysis target due to its high hallucination incidence, affordability, and widespread usage, we also conducted additional tests on other models during the study's early phase.

Specifically, we probed Claude and Qwen using the same Type I and Type II setups (see detailed tables in our response to Reviewer z9Q8’s W3), and found that hallucination behaviors, particularly persistence and early error propagation, were also evident in these models. For example, Claude exhibited a hallucination rate of 67.8% in Type I and Qwen reached 94.4%. These findings suggest that the phenomena we report are not unique to DeepSeek-R1 but reflect shared limitations in multi-step reasoning across model families.

We hope this supports the broader relevance of our observations.

2. On actionable impact and mitigation

We appreciate the reviewer’s suggestion regarding mitigation. Although our current work does not directly propose or test a new hallucination reduction method, we believe it makes important foundational contributions toward that goal.

Our analysis reveals that many hallucinations stem from metacognitive failures. The model continues reasoning under false premises with unjustified confidence. In Section 3.4 and Table 3, we show that many existing detection techniques fail precisely because they ignore the dynamic nature of confidence evolution across multi-step reasoning.

This insight provides a new perspective for designing mitigation strategies. Rather than relying on static thresholds or isolated claim-level signals, future methods could focus on detecting suspicious confidence trajectories (e.g., abrupt increases or inconsistent shifts) or on interrupting reflection cycles that reinforce errors.

In this sense, our work offers a conceptual and analytical basis for intervention, and we view mitigation as a natural extension. Several follow-up efforts are underway to explore this direction.

3. On cost–benefit analysis of different baselines

We thank the reviewer for raising this important question. Below we offer a clearer breakdown of the computational cost.

To simplify the analysis, we normalize all LLM calls to a single average time unit, $T$ , representing the time cost of one inference pass through a large language model (e.g., DeepSeek-R1 or GPT-4o). Other variables are defined as:

$S$ : Total number of sentences to be evaluated.
$C_{\text{avg}}$ : Average number of claims extracted per sentence (average 1.8).
$Q$ : Number of question variants generated per claim (typically 3).
$M$ : Number of times each question is re-answered (typically 3).
$N$ : Number of self-check samples per original CoT (typically 20).
$T_{cla}$ : Inference time for a BERT-like classifier, where $T_{cla}$   $\ll T$ .
$n$ : Lightweight post-inference operations (e.g., attention/statistics), where $n \ll T$ .
Semantic Entropy

Semantic Entropy decomposes sentences into atomic claims, generates new questions per claim, and then samples answers multiple times.

$\mathrm{Time}\approx S \cdot (T + C_{\mathrm{avg}}\cdot Q \cdot M \cdot T) = S \cdot (T + 1.8\times3\times3 \cdot T) = S \cdot (T + 16.2 \cdot T) = 17.2ST$
CCP (Claim Consistency via Prediction)

CCP uses the same decomposed claims and evaluates token-level prediction confidence with semantic grouping via an NLI model.

$\mathrm{Time}\approx S \cdot C_{\text{avg}} \cdot T = S \cdot 1.8 \cdot T = 1.8ST$
SelfCheckGPT

This method generates N responses (e.g., 20) and compares each with the original CoT trace using NLI-based sentence matching.

$\mathrm{Time}\approx 20T$
Medium-Cost Methods

These include Logit Entropy, Attention Strength, and Spectral Entropy. Each method requires one full inference and a lightweight internal analysis.

$\mathrm{Time}\approx T$
HDM2 model

HDM2 model applies a fine-tuned BERT classifier directly on the full CoT output:

$\mathrm{Time}\approx T_{\text{cla}}\quad(\text{where } T_{\text{cla}} \ll T)$

Inference is highly efficient—typically sub-second per sample on GPU.

We are greatly encouraged by the reviewer’s positive feedback. During the rebuttal process, we have worked to address the concerns raised as thoroughly as possible, providing additional data and clarifications where needed. We look forward to continued discussion.

评论- No change of scores

2025-08-05

评论- Acknowledgement to Reviewer Zxo1

2025-08-07

Dear Reviewer Zxo1,

We sincerely appreciate your positive feedback on our paper's examination of metacognitive hallucinations and the insightful case studies. Your thorough and thoughtful evaluation is deeply valued.

审稿意见

评分: 3置信度: 32025-06-30

The authors investigate how hallucinations emerge and propagate in Reasoning Large Language Models (RLLMs), particularly in multi-step Chain-of-Thought (CoT) reasoning. The authors construct a controlled knowledge environment using RFC documents to systematically audit hallucination trajectories. They identify that hallucinations often originate from the model's overconfident use of incorrect or unlearned knowledge, and are amplified through flawed reflective reasoning, where the model increases confidence in false claims due to semantic alignment with user prompts. Moreover, they show that despite editing interventions, hallucinated reasoning paths show "chain disloyalty", resisting correction. In sum, this study highlights the limitations of current hallucination detection and mitigation methods in long-CoT settings, emphasizing the need for future models with explicit metacognitive capabilities to ensure more reliable and interpretable reasoning.

优缺点分析

Strength:

-The paper tackles a timely issue with the massive use of reasoning models, and tries to provide interesting descriptive statistics with respect to CoT-driven hallucinations.

Weaknesses:

-My main concern with this paper is the lack of clarity in the writing. The introduction on its own assumes knowledge that is either not present in the paper or comes much later in the manuscript. Figure 1 is extremely cryptic, and I cannot make sense of any of the subgraphs in this figure as I read along in the paper. The rest of the paper follows a similar pattern, poor clarity.

-The paper would strongly benefit from an incremental description of the concepts, as well as spending more effort in grounding the concepts in a way that is clear for the reader.

-Modelling assumptions seem to be crucial to the results, and thus the validity of the interpretability method proposed by the authors. It seems that the proposed method which gains in interpretability of the CoT, comes at the cost of robustness (an issue that is avoided by deeper level methods).

-The paper would benefit from a native English person proof-read.

问题

What does it mean for a model to exhibit high confidence, i.e., high confM(k)? How is this formalized by the authors? Equation 3 is also very cryptic and hard to interpret, how is conf(cq) − conf(cp) reflected on the right side of equation 3 (also what is f in eq. 3, I imagine “feedback”?).
Have the authors thought about other metacognitive confidence modelling approaches, that could be intuitive but generate diverging results? Regardless of whether the proposed modelling was based on previous work.

局限性

Yes

最终评判理由

The authors have clarified some of my concerns. However, it is impossible to judge the future clarity of a work based on a rebuttal.

格式问题

N/A

作者回复

2025-07-31

Weakness: On Writing problem, Figure1 and robustness

1. On Writing problems and Figure 1 (W1, W2 and W4)

We thank the reviewer for their detailed feedback. We understand that parts of the paper (especially the early modeling and Figure 1) may have caused confusion due to the density of new terms and abstractions. While we aimed to balance technical rigor and brevity, we now see that some concepts were introduced too quickly or without sufficient grounding for first-time readers. We apologize for the resulting difficulty.

In the revision, we will clarify and more explicitly contextualize key concepts in the introduction and Section 2, including hallucination types, reasoning claims, and reflective interventions, as we recognize that some readers may find the current exposition insufficiently connected. We will also revise Figure 1, its caption, and related text to ensure that each component is clearly introduced with appropriate context, examples, and cross-references.

To help clarify now, we provide a more detailed breakdown of Figure 1, which is meant to offer an overview of our analysis framework:

Figure 1(a) presents a comparison between the reasoning-phase knowledge domain and the training-phase knowledge domain [1], as a foundation for defining two distinct types of hallucination.

The upper part is the knowledge state during reasoning, i.e., what the model appears to “know” or retrieve when solving a specific task. This internal knowledge state may include:
- Known: Correct facts from prior training,
- Unknown concepts,
- Incorrect beliefs, e.g., “the Sun is blue.”
The lower part of the figure represents what the model was actually exposed to during training. This includes only a partial and sometimes distorted subset of real-world knowledge. In other words, the model's training data does not cover all facts:
- Some facts were never included in the training set (Type II hallucination).
- Some were present but learned only partially or with uncertainty (Type I hallucination).

Figure 1(b) illustrates our reasoning graph [2] [3], which represents how the model constructs its Chain-of-Thought (CoT) during multi-step reasoning.

Each node represents a claim made by the model at a certain step (e.g., a fact, sub-conclusion, or logical assertion).
The number next to each node indicates the order of the reasoning step.
During the reasoning process, new knowledge may be introduced, either from the model itself (internal knowledge) or from the input prompt (external knowledge). If incorrect knowledge enters the graph, it can be carried forward to later steps and mislead the final answer.
What makes this dangerous is that CoT often wraps the reasoning in fluent logic, making these errors look reasonable even when they are wrong.

Figure 1(c) gives an example of how incorrect knowledge introduced early can lead to hallucinated conclusions through reasoning.

In this case, an incorrect claim (marked as $ck_1$ ) is injected at Step 2.
As the reasoning continues, more knowledge is added at later steps. However, since the early claim is flawed, it silently influences downstream steps.
At Step 7, the model performs a reflection [4], revisiting and modifying its earlier claim $c_1$ to a new version $c_4$ , leading to a final conclusion that is logically consistent but factually wrong.

2. On Modelling robustness (W3)

We would like to clarify that robustness is not the main focus of this work. Our method is specifically aimed at improving the interpretability of the CoT process, and all evaluations are aligned with that objective. While robustness is an important aspect in general, our work does not make claims in that direction.

Regarding the point about deeper-level methods, while we acknowledge that some existing approaches do aim to provide interpretability at the level of model architecture or internal representations, we are not aware of prior work that directly addresses or explains hallucinations in the CoT process in the way our method attempts to. Our approach is motivated by the need for more targeted analysis of intermediate reasoning steps, which we believe remains relatively underexplored.

Finally, our modeling assumptions are not speculative, but are grounded in prior literature and supported by multiple references cited in the main paper and in this response. These provide theoretical and empirical motivation for our formulation.

Questions: On Confidence and Equation 3

1. What is confidence and what does it mean in a metacognitive sense? (Q1.1)

In our context, metacognition [5-7] refers to the model’s awareness of whether it knows something. Confidence captures how certain the model believes it knows something, regardless of whether that belief is actually correct.

For example, if the model says “the sun is red” and is fully certain about it, it is showing high confidence. What matters here is not the truth of the claim, but that the model believes it knows the answer with certainty. This reflects its internal metacognitive state.

Conversely, when the model faces knowledge it hasn’t fully learned or is uncertain about, it often gives more hesitant responses—indicating low confidence. These claims are more likely to be revised or overturned during the reasoning process.

Importantly, our modeling emphasizes that confidence is not static: it can change as the reasoning unfolds. In multi-step CoT reasoning, the model’s belief in a given claim may be weakened, reinforced, or even reversed based on new knowledge introduced later. Our work focuses on how these confidence shifts happen, and how they relate to hallucination formation and propagation.

2. How should Equation (3) be interpreted? (Q1.2)

Equation (3) defines how the model's confidence in a new claim $c_q$ changes relative to a previous claim $c_p$ during the reflection process. Specifically, it quantifies:

\Delta \text{conf}(c_p, c_q) = \text{conf}(c_q) - \text{conf}(c_p) = \alpha \cdot f(c_{p-1}, c_q) + (1 - \alpha) \cdot g(c_q, \text{prompt})

Here, $\text{conf}(c_q)$ and $\text{conf}(c_p)$ represent the model's internal confidence in corresponding claims $c_q$ and $c_p$ , respectively.
The change in confidence depends on two sources:
1. Internal feedback, captured by $f(c_{p-1}, c_q)$ , which reflects how strongly the current reasoning step $c_{p-1}$ supports or contradicts $c_q$ . We model this as a function of $c_{p-1}$ rather than $c_p$ , because of the correspondence between $c_p$ and $c_q$ . Thus, the feedback should be determined based on the prior state before the confidence update (i.e., $c_{p-1}$ ) and the new claim $c_q$ .
2. Prompt alignment [8], modeled by $g(c_q, \text{prompt})$ , which reflects how well the new claim $c_q$ semantically aligns with the original user instruction or prompt context.
The parameter $\alpha \in [0, 1]$ controls the weighting between these two influences.

In simple terms, this equation states that the model will become more confident in a new claim $c_q$ if it is both (a) supported by its own recent reasoning history, and (b) aligned with what the user originally asked. If either of these is weak, confidence may decrease. This formulation gives us a concrete way to track how belief in a claim strengthens or weakens during reflection.

3. Considered other metacognitive confidence modeling methods? (Q2)

Yes. As we explain in Section 3.4, many hallucination detection methods [9] today are in fact trying to estimate how confident the model is about its own statements. In other words, they are testing whether the model is metacognitively “sure” of what it’s saying.

However, the key issue is that most existing methods rely on fixed rules, confidence heuristics, or surface-level signals. They do not deeply involve semantic understanding, and they do not touch the internal model reasoning process. As a result, these methods often cannot give reliable confidence estimates, especially in knowledge-intensive tasks. Our experiments show that in these cases, they perform poorly (see Table 5).

[1] Physics of Language Models: Part 3.1. ICML 2024

[2] On the Biology of a Large Language Model, Transformer Circuits, 2025.

[3] Physics of Language Models: Part 2.1. ICLR 2025.

[4] Mirror: Multiple-perspective Self-Reflection Method for Knowledge-rich Reasoning. ACL2024.

[5] Large language models lack essential metacognition for reliable medical reasoning. Nature communications, 2025.

[6] Decoupling Metacognition from Cognition. AAAI 2025.

[7] Language models (mostly) know what they know. 2022.

[8] From Yes-Men to Truth-Tellers. ICML 2025

[9] Detecting hallucinations in large language models using semantic entropy. Nature 2025.

We sincerely thank Reviewer BZYH for their careful reading and constructive feedback. We understand the core concern lies in the clarity of Figure 1 and Section 2, where several key concepts were introduced too quickly, potentially limiting accessibility. This was partly due to space constraints, as we prioritized presenting broad experimental insights. In the revision, we will improve the exposition and readability of these sections while maintaining technical depth.

We are encouraged that the other three reviewers recognized the strong motivation and extensive experimental validation of our work. In this response, we have addressed Reviewer BZYH’s main concerns in detail and clarified the points that were previously unclear. We hope the core value of this paper can be seen in the empirical observations and insights provided in Section 3, which we believe offer useful perspectives for future research on hallucination behavior. We would sincerely appreciate it if the reviewer could reconsider their assessment in light of these contributions.

2025-08-01

I thank the authors for their thorough review. I appreciate their future effort in building the concepts in the paper "from the ground up", and clarying some of the questions I have posed. In light of their comment, I am raising my score to 3.

评论- Acknowledgement to Reviewer BZYH

2025-08-07

Dear Reviewer BZYH,

Thank you for taking the time to review our paper. We appreciate your feedback and the points you’ve raised. We understand your rating and will continue to refine the paper based on the broader feedback provided.

审稿意见

评分: 4置信度: 32025-07-01

In this paper, the authors address important recent research issues: emerging & evolving hallucination in the reasoning chains of frontier RLLMs. By auditing model behaviors during the CoT trajectory (controlled knowledge environment with RFC datasets), the authors provide several insights and observations to understand underlying mechanisms of hallucinations during the test-time reasoning process.

优缺点分析

The reviewer enjoyed to read this paper, which addresses an important and timely topic (hallucination in test-time CoT based RLLMs).Despite of the research importance, there are several points to discuss:

Due to the missing references in the manuscript (line 105, 121), the readers hard to follow terms used in this paper. Also the lack of implementation details make it difficult to understand the overall flow (especially section 2.3 provides high-level explanations, some of details should be explained more detailedly - such as “how to obtain confidence for each claim” or “how to explicitly model g function to capture prompt-aligned bias”, etc).
(continued) To provide better understanding to potential readers, Fig. 1 should be improved to better align with the main text (section 2). Although it is briefly discussed in lines 41-52, it looks currently disconnected from the core explanations - some of terms used in the figure are not referenced in the main body (thus may confuse readers). Providing well-matched descriptions in section 2 along with Fig.1 can enhance clarity.
Many concepts and terminologies make this paper hard to follow (such as metacognitive drift, flawed reflection, chain disloyalty, prompt-aligned bias, etc). To understand such abstract concepts, the reader should put a lot of work into uncovering what they actually are. In addition, the current controlled knowledge environment with using of RFC document datasets may raise concerns about the generalizability of the paper’s main insights and observations to more open-ended scenarios typically encounter by current RLLMs

问题

Q1. For the Obs 2, the observation suggests that the longer reasoning chain indicates the metacognitive drift due to the initial uncertainty. But the reviewer think that it is unclear whether the increased claim length indeed reflect of uncertainty - possibly due to task complexity (may be more claims are needed to answer the question) or the need to fill out the answer for the query instance. As shown in Table 2, the proportion of hallucinated claims is not notably high, which naturally raise questions about the loose link between the claim lengths and hallucination (originated from uncertainty).

Q2. The reviewer thinks Type 2 control part is missing, and also analysis of Part C is skipped in the manuscript. Accordingly, the obs 3 seems somewhat speculative and do not provide enough insights for the given listed performances.Q3. In section C.1, the authors provide extended explanations for each taxonomy of CoT trajectories. The reviewer wonders whether the example cases in Fig. 2 are curated instances or representative samples following certain procedure. Moreover, it remains unclear how the specific reasoning trajectories can be traced or identified (such as dropped path or conditioned branch (“if” cases)) from the models’ generation branch. Providing more concrete examples can be helpful to readers to understand this analysis.

Q3. In section C.1, the authors provide extended explanations for each taxonomy of CoT trajectories. The reviewer wonders whether the example cases in Fig. 2 are curated instances or representative samples following certain procedure. Moreover, it remains unclear how the specific reasoning trajectories can be traced or identified (such as dropped path or conditioned branch (“if” cases)) from the models’ generation branch. Providing more concrete examples can be helpful to readers to understand this analysis.

局限性

yes

The reviewer leans to BR for current version, and will finalize the final rating score after the discussion period.

最终评判理由

This paper addresses an important and timely topic with a structured analysis of hallucination in RLLM reasoning chains. During the rebuttal, some of my concerns were enougly addressed. The reviewer belives that the research direction and conceptual clarity of this paper suggests meaningful insights to the readers. Thus lean toward borderline accept.

格式问题

N/A. As a minor comment, some figure indexes are missing in the appendix section.

作者回复

2025-07-31

Weakness: On Clarity, Section 2.3, Terminology, and Generalizability

We sincerely thank the reviewer for the constructive and thoughtful comments. Below, we address the concerns (weaknesses) regarding writing clarity (W1 and W2), conceptual complexity, Section 2.3 implementation detail (W1), and the choice of our controlled knowledge domain (RFC documents) (W3).

1. On writing clarity, missing references, and Section 2.3

We apologize for the confusion caused by the missing citations at lines 105 and 121 (W1). These were formatting oversights and we will correct them in the revised version. More broadly, we understand that Section 2.3 may feel too abstract in its current form. While it is intended to provide a conceptual abstraction of our reflection framework, we agree that it lacks sufficient detail to be easily interpreted by new readers.

In the revised version, we will focus on the following two key improvements:

Proper references and clearer explanations, such as metacognitive drift, prompt-aligned bias, and flawed reflection. These terms will be explicitly defined at first use and connected to relevant prior work or intuitive examples to aid reader understanding.
Modeling assumptions and procedures, including how claim-level confidence is computed from model signals, and how confidence propagates across reasoning steps during reflection. This includes elaborating on the conceptual role of the $g(c_q, \text{prompt})$ term and clarifying the structural function of the reflection framework introduced in Section 2.3.

We would also like to clarify that we do not aim to provide a fully-automated or deployable implementation of Equations (2) and (3). Instead, our goal is to propose a conceptual decomposition that separates metacognitive confidence from alignment-based influence. This serves as a foundation for future analysis and improvements in LLM reasoning.

2. On Figure 1 and Section 2 alignment

We agree that Figure 1 currently lacks clear linkage to Section 2 (W2) and may confuse readers due to undefined terms and unreferenced notations. In the revised paper, we will introduce all figure terms in the main text with clear definitions, including prompt-aligned bias, hallucination types, and reasoning steps. (Detailed explanation of Fig. 1 in BZYH W1).

3. On use of RFCs and generalizability

We would like to clarify our motivation for choosing the RFC corpus (W3). This was not an simplification, but a deliberate design choice to create a controlled knowledge environment for analyzing hallucination formation.

RFC documents offer clear, modular, and factual content with minimal ambiguity, which makes it possible to precisely annotate ground truth, track knowledge injection points, and isolate the causes of reasoning errors.
Compared to open-domain sources, RFCs have well-defined standards and low risk of semantic noise or conflicting data, making them ideal for identifying when the model, rather than the data, is the source of hallucination.

It’s important to note that our focus is on a model’s loyalty to its learned knowledge, not necessarily to real-world facts. In other words, if a model hallucinates due to incorrect training data, we do not count that as a metacognitive failure of the model. This is precisely why we restrict the knowledge source to RFCs, so we can minimize data ambiguity and focus on the model’s internal reasoning behavior.

That said, we agree that real-world reasoning mistakes often stem from a mixture of factors: hallucinations, corrupted training signals, poor alignment, etc. We view our current setup as a necessary first step for disentangling these effects, and we are already planning to extend our methodology to more complex, noisy, and multi-domain settings.

Question: On Reasoning Drift, Type II Control, and Trajectory Tracing

We thank the reviewer for raising these valuable questions, which target key aspects of our methodology and analysis depth. The concerns primarily relate to the interpretation of reasoning length (Q1), the handling of Type II hallucination controls and corresponding evaluation (Q2), and the transparency of how reasoning trajectories are extracted and analyzed (Q3).

Q1. On the correlation between claim length and uncertainty

Our Obs. 2 is not based on chain length alone, but rather on the dynamic reflection behavior that occurs when the model faces low confidence in its intermediate claims. Specifically, when hallucinations occur, the model often performs more reflections, attempting to adjust its metacognitive confidence before either rejecting or fully accepting the hallucinated claim. This reflective process naturally increases the CoT length.

To validate this point, we conducted an analysis on the Type I dataset (as detailed in Appendix B.2), where each sample was generated using 5 independent runs. We grouped samples by the number of hallucinations observed and calculated the average CoT length (number of claims) for each group:

Hallucinations(out of 5)	Avg.CoT Length(claim)
5	53.31
4	50.09
3	44.57
2	42.30
1	47.61
0	26.10

The data show a clear positive correlation between hallucination frequency and CoT length, supporting our obs. 2 that uncertainty-driven reflection leads to longer reasoning chains. While task complexity might also contribute, the strong trend observed here indicates that the increase in length is closely tied to hallucination-related metacognitive drift.

Q2. On missing analysis of Type II control and Part C

Thank you for pointing this out. The reviewer is correct that the Type II control group was not explicitly included in the main analysis of Table 2. This omission was primarily due to its limited usable sample size. In particular, prompt-aligned bias tends to be consistently triggered, making it difficult to collect clean, non-hallucinated contrast cases in this setting.

Nonetheless, we agree that the absence of detailed discussion may make Observation 3 feel speculative. In fact, upon closer inspection of Part C (Type II), we observed that the model unexpectedly generated an average of 5.25 incorrect internal knowledge units, which is quite close to the 6.73 observed in Type I, despite their different sources of hallucination (misleading prompt vs. knowledge absence).

Moreover, these internally hallucinated facts in Type II traces exhibited similar propagation patterns to Type I, such as 50% adoption, 40% correction, and 10% rejection. This suggests that the model is not just copying the misleading information from the prompt, but also generates additional incorrect internal knowledge on its own, which likely plays a key role in the final wrong answer. This behavioral pattern is the core empirical support for Obs. 3.

Q3. On how reasoning trajectories in Fig. 2 are selected and traced

Thank you for raising this question. Strictly speaking, the examples shown in Figure 2 are neither curated hand-picked cases nor representative samples selected via statistical procedures.

Each case in Figure 2 was randomly drawn from the set of CoT traces corresponding to the three types, with the goal of illustrating structural differences in trajectory behavior. For visualization purposes, we manually shortened the full reasoning trace, as many original CoTs contain more than 40 claims (as shown in Table 2A). In the case of (b) Control: Error Reject, we applied a very minimal manual adjustment to the CoT trace. Specifically, we shifted the error rejection slightly earlier than it actually occurred, so that the correction dynamics could be more clearly shown within a readable length.

We acknowledge this intervention and will clearly state it in the figure caption and appendix. These examples are meant to be illustrative, not evaluative, and do not affect our quantitative analyses.

We also appreciate the reviewer’s attention to the reliability of our CoT trajectory annotation process. As noted, identifying structural elements such as dropped paths or conditioned branches from model-generated CoT traces is inherently non-trivial, since these are implicit in free-form natural language.

Due to space limitations, we were unable to include full examples. Below we provide concrete, real cases illustrating how such structures are identified:

Conditioned Branches:
Case1:(Unreasonable assumption)
If a node doesn't validate the UDP checksum, maybe an attacker could tamper with the UDP payload (which might include the HMAC) without the checksum catching it.
Case2:
Wait, maybe the process is that the HMAC validation is done after the checksum check. If the checksum is invalid, the packet is dropped before HMAC validation. So if a node doesn't check the checksum, it might process a packet with a modified HMAC...

Dropped Paths:
Case1:(Dropping claim)
I believe RFC 3866 is related to email headers... If that's the case, then it might be updating an older RFC that dealt with similar subject matter.
==drop==
Alternatively, maybe I can think of other RFCs in the same category, like SPF, DKIM, DMARC.
Case2:
Alternatively, maybe I can think of the authors. Let's see. The authors of RFC 3866 are J. Ott and D. Mutz. 
==drop==
Wait, the title is \"Extensible Message Format for Message Disposition Notifications.\" No, that's RFC 8098.  
==drop==
Wait, no.\n\nWait, I'm getting confused. Let me try to recall. Another approach: RFC 3866 is titled \"Common Profile for Instant Messaging (CPIM)\". No, that's RFC 3860.

We are sincerely encouraged by the reviewer’s remark that they enjoyed reading the paper. Throughout the review process, we have strived to provide the most faithful and transparent explanations possible within the available space, in the hope of addressing all concerns raised. We would warmly welcome further discussion or reconsideration of the decision.

2025-08-05

The reviewer thanks for the detailed responses. My concerns are enougly addressed. Considering the contribution of this paper (hallucination in reasoning process), the reviewer leans to BA.

评论- Acknowledgement to Reviewer Mk7X

2025-08-07

Dear Reviewer Mk7X,

Thank you for your positive remarks regarding our focus on hallucinations in test-time CoT-based RLLMs and your valuable insights into their underlying mechanisms. We truly appreciate the time and effort you put into your detailed review.

审稿意见

评分: 5置信度: 32025-07-03

This paper tackles the critical problem of hallucinations in RLLMs, going beyond simple detection to investigate the underlying mechanisms from the emergence to the evolution of hallucinations in RLLM's reasoning chain. The authors identify that models can iteratively reinforce biases and errors through flawed reflective processes, even with interventions at hallucination origins. They do this by auditing the CoT trajectory and assessing the model's cognitive confidence in potentially erroneous or biased claims. The authors then define four key research questions and, through extensive experiments, demonstrate that existing hallucination detection methods are less reliable and interpretable than previously assumed, especially in complex reasoning contexts in RLLMs.

优缺点分析

Strengths

The paper is well-structured, and the writing is clear. I appreciate the authors framing their entire investigation from modeling to the four research questions. This provides a logical backbone for the paper that makes the argumentation easy to follow from start to finish.
The experimental parts are both detailed and comprehensive. The authors conduct a deep and convincing audit of the model's behavior. The experiments, especially the controlled CoT editing, are thoughtfully designed and go far beyond simple performance metrics to reveal the process behind the hallucinations in RLLMs.
Finally, some insights and findings of this paper would be valuable to the community. Concepts like "chain disloyalty" offer a new and useful vocabulary for describing error propagation. Furthermore, the paper's critical analysis of existing hallucination detection methods serves as an essential reality check, pushing the field to develop more robust solutions.

Weaknesses

I do have some concerns with the paper that I believe should be addressed:

The theoretical model of confidence updates presented in Section 2.3 is insightful. However, the analysis that follows functions more as an interpretive lens for the qualitative results rather than a hypothesis that is quantitatively tested. The conclusions would be more convincing if the authors could bridge this gap, perhaps with experiments that attempt to measure or approximate the variables in their proposed equations (2) (3).
Regarding Observations I & III, the paper attributes the model's tendency to accept and elaborate on false premises to a "prompt-aligned bias". This is a plausible interpretation, but it warrants more evidence. An alternative explanation is that these errors stem from a simple knowledge deficit; the model may not be willfully aligning with a user's error but simply doesn't know the information is incorrect. Additional experiments to disentangle a knowledge gap from a genuine alignment bias would make this claim more robust.
The study's reliance on a single model, DeepSeek-R1, raises questions about generalizability. The observed phenomena, such as "chain disloyalty" and the specific mechanisms of error amplification, could potentially be artifacts of this particular model's architecture or training. The paper's claims would be more effective if the core findings were validated on other major model families, like the GPT or Claude series.
A detailed description of the experimental setup and dataset construction is located in Appendix B. As a reader, I looked for this context before diving into the results in Section 3. The paper would be easier to follow if a summary of the setup, or at least a clear pointer to the appendix, were included at the beginning of Section 3.

问题

Regarding the "prompt-aligned bias" claim, could you provide further justification for why this is a more likely cause of error than a simple knowledge deficit? Any discussion or analysis to disentangle these two possibilities would significantly strengthen this core claim.
To improve readability, please adding a brief overview of the experimental setup to the beginning of Section 3, or at least a clear forward reference to the full details in Appendix B?

局限性

Yes. The authors discussed limitations and broader impact in Appendix A.

最终评判理由

I think the rebuttals do adequately address my concerns.

格式问题

N/A

作者回复

2025-07-31

Weakness&Question: On Confidence Modeling, Prompt Bias, Model Scope, and Structure

We thank the reviewer for their careful reading and thoughtful comments. The suggestions regarding the empirical grounding of our theoretical model (W1), the interpretation of prompt-aligned bias (W2 and Q1), the generalizability of findings (W3), and the presentation of experimental setup (W4 and Q2) are all well taken. Below we address each point in turn and outline how we plan to revise the manuscript accordingly.

1. On bridging theoretical confidence modeling (Eq. 2/3) with empirical validation

We appreciate the reviewer’s recognition of the confidence update model in Section 2.3. The confidence update model in Section 2.3 is designed to help describe how confidence may evolve during multi-step reasoning. We acknowledge that in the current version, we do not quantitatively estimate each variable in Equations (2) and (3).

However, as discussed in Section 3.4, many existing hallucination detection methods (such as entropy-based uncertainty, logit margins, and self-consistency) can be viewed as practical ways to approximate the dynamic changes in model confidence, especially for individual claims. While we do not directly compare our formulation with these methods, they share a similar motivation: estimating whether the model “believes it knows” something, and to what extent.

We believe the main limitation of these methods is that they focus on single-point confidence estimation for individual claims, rather than modeling the change in confidence ( $\Delta {\text{conf}}$ ) across the entire reasoning trajectory. In long-form CoT reasoning, it is not just the confidence of a specific step that matters, but how that confidence emerges and evolves as the reasoning progresses.

As shown in Appendix E.3, even after applying smoothing techniques, the confidence signals across CoT steps often show sudden jumps up and down, rather than smooth or consistent changes. This supports our view that hallucination is not purely a local phenomenon, but often emerges from global inconsistency or failure to maintain coherent belief across steps. Capturing this dynamic behavior is what our confidence update model aims to describe.

2. On whether prompt-aligned bias reflects true alignment or knowledge deficit

We sincerely appreciate this insightful observation. In fact, we considered this exact alternative explanation during our analysis. Many of the insights in our paper, including Observations I and III, stem from carefully distinguishing between what the model learned and how it chooses to reason given that knowledge.

To investigate whether the model’s acceptance of false premises is simply due to a knowledge deficit, we conducted an additional experiment: We constructed a balanced set of 500 factually correct statements and 500 factually incorrect statements (the same source pool used for Type II hallucination generation) and asked the model to judge their factual correctness using a neutral prompt.

The results are shown below:

	Judged as Correct	Judged as Incorrect
True Statements	478	22
False Statements	13	487

This shows that the model is fully capable of recognizing most of these facts as true or false.

Moreover, in the Type II (Unseen or Incorrect) cases selected for analysis, we did not observe any signs of the model expressing uncertainty or epistemic hesitation about the injected incorrect information (in answer). The model confidently accepted and followed the external incorrect knowledge, despite clearly “knowing better” in isolation.

Taken together, this evidence supports our interpretation that the model’s behavior is not simply caused by a lack of knowledge. Instead, we argue that it reflects a prompt-aligned bias, where the model over-prioritizes consistency with the input prompt, even at the expense of factual correctness.

3. On generalizability beyond DeepSeek-R1

We thank the reviewer for raising this important point. Due to time constraints, we were unable to perform a full-scale replication of our framework across multiple model families. However, before selecting DeepSeek-R1 for our main experiments, we conducted an extensive preliminary survey of hallucination behavior across several major reasoning-capable LLMs.

In selecting a model, we considered three factors:

Frequency of hallucination phenomena (especially CoT-induced),
API accessibility and cost, and
Relevance and adoption within the current open-source LLM ecosystem.

DeepSeek-R1 was ultimately chosen because it offered a high hallucination rate, low API cost, and strong community recognition, making it well-suited for large-scale controlled experimentation.

To support this decision, we include two tables below summarizing hallucination behavior observed in other models during our preliminary evaluations, all using the same Type I / Type II setup.

Accept rate (i.e., percentage of 5-time queries where hallucination occurred ≥ 4 times):

Model	Type I (≥2 halluc.)	Type I - Control	Type II (≥2 halluc.)	Type II - Control
DeepSeek-R1	62.5%	92.6%	56.1%	11.0%
Claude-3-7-sonnet	73.3%	83.3%	50.0%	93.3%
Qwen3	100%	83.3%	63.3%	83.3%

Note: DeepSeek’s values are calculated across all outputs, consistent with Table 1 in the main paper. Claude and Qwen were evaluated on subsets filtered through DeepSeek’s outputs, leading to different sample distributions.

Hallucination rate across all responses (i.e., proportion of hallucinated answers among all responses):

Model	Type I (%)	Type II (%)
Claude-3-7-sonnet	67.8%	52.2%
Qwen3	94.4%	65.5%

Unfortunately, we were unable to include GPT-o3 in this comparison, as its inference costs exceeded our available resources at the time of the study.

We view our current work as establishing the methodology and analysis tools, and we plan to apply the same framework to additional models in future work. We believe the core phenomena we observe—such as error propagation, reflection failure, and prompt-aligned drift—are not specific to DeepSeek-R1 but reflect broader behaviors in CoT-style reasoning across LLMs.

4. On clearer pointer to experimental setup in Section 3

Thank you for this helpful suggestion. We agree that readers would benefit from a clearer connection between the experimental results in Section 3 and the dataset/setup details provided in Appendix B.

In the revised version, we will add a brief summary paragraph at the beginning of Section 3 to outline:

The dataset construction process, including prompt formulation and hallucination control conditions (Types I and II).
The model sampling and annotation protocol.
Key evaluation metrics used throughout the analysis.

We will also include an explicit forward reference to Appendix B for readers who wish to explore the experimental configuration and data labeling procedures in more depth.

We believe this change will significantly improve the clarity and flow of the paper.

We are deeply encouraged by the reviewer’s belief that our paper could be valuable to the community. We have done our best to address the raised concerns, supplementing our response with further data and analysis. We hope this can invite further discussion and reconsideration of the rating, as we believe the work may offer useful perspectives for ongoing research in this area.

2025-08-07

I thank the authors for their detailed response, which adequately addresses my concerns. I find the paper to be technically solid and experimentally comprehensive. Therefore, I maintain my score.

评论- Replying for Comment from Reviewer z9Q8

2025-08-07

Dear Reviewer z9Q8,

First of all, we would like to express our sincere gratitude for your thoughtful and constructive review of our work. We have carefully considered all your comments and have provided detailed responses to each point raised during the rebuttal process. Additionally, we have included further experiments, which we will ensure are incorporated into the final version of the paper.

We are pleased to note that the majority of reviewers have recognized the contribution of our work, particularly in addressing hallucination scenarios in long CoT and the extensive experiments we have conducted. We are hopeful that our work can offer valuable insights and directions for future research in the hallucination domain.

If you find the insights we present in the paper to be valuable and align with the future research goals of the community, we kindly ask you to reconsider your evaluation and potentially adjust your score to better reflect the improvements made and the contributions of the work.

Thank you again for your time and feedback.

2025-08-07

Thank you for the feedback. I think the rebuttals do address my concern and would like to raise my score to 5 and vote for accepting this paper.

评论- Acknowledgement to Reviewer z9Q8

2025-08-07

Dear Reviewer z9Q8,

We are grateful for your positive feedback on the clarity of our paper, the "chain disloyalty" concept, and the comprehensive experiments. Your thoughtful review and constructive suggestions are highly appreciated.

最终决定Accept (poster)

2025-09-17

This paper provides a timely contribution to our understanding of hallucinations in reasoning-focused large language models (RLLMs). While much prior work has focused on mitigation, this study offers a novel perspective by systematically auditing Chain-of-Thought (CoT) trajectories and introducing concepts such as “chain disloyalty” to explain how initial errors persist and amplify through flawed reflective reasoning. The framework is methodologically sound, relying on a carefully constructed RFC-based dataset that enables controlled and reproducible experiments. The analyses are thorough and well-motivated, with case studies and controlled CoT editing that yield valuable insights into how hallucinations evolve and why existing detection methods fail in complex reasoning scenarios. These findings not only deepen our conceptual understanding of RLLM behavior but also provide the community with new vocabulary and interpretive tools for studying error propagation.

Although the paper has some limitations, such as reliance on a single model family for evaluation, the interpretive rather than quantitative nature of the confidence modeling, and occasional challenges in clarity of exposition, the authors have been responsive during the rebuttal process and strengthened the manuscript with additional analysis and clarifications. Importantly, the conceptual insights and empirical findings are robust, generalizable in spirit, and highly relevant to the community’s ongoing efforts to build more reliable reasoning systems. Overall, the paper advances the state of knowledge on hallucination mechanisms in RLLMs and will spark further research on metacognition, error detection, and interpretability.