The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models Via Visual Information Steering
摘要
评审与讨论
This paper introduces a inference-time framework called VISTA that aims to reduce hallucinations in LVLMs. The authors conduct a detailed study of token “logit rankings” across both layer depth and the temporal dimension in a generated sequence. Their analysis reveals three observations: (1) Gradual loss of visual grounding over the course of generation, (2) Early excitation of semantically meaningful tokens in penultimate or nearby layers, and (3) Hidden genuine information that the model “knows” but fails to include in its output. Building on these insights, they propose VISTA, which combines two complementary techniques: (1) Visual Steering Vector (VSV): a direction in the model’s activation space, computed by contrasting the residual streams obtained from the “positive” prompt containing the image tokens vs. a “negative” prompt that omits them. Injecting this vector at each decoding step enforces better retention of visual information. (2) Self-Logits Augmentation (SLA) aggregates token logits from earlier (penultimate) layers, where grounded or semantically rich tokens peak, and then mixes them with the final layer’s logits for next-token sampling. Experiments on four LVLM architectures (LLaVA-1.5, MiniGPT-4, Shikra, InstructBLIP) and three standard decoding regimes (greedy, beam, nucleus sampling) demonstrate that VISTA consistently outperforms baselines on several hallucination-specific benchmarks (CHAIR, POPE, MMHal) and general capability tests (MME).
给作者的问题
- How does VISTA perform if there are many more entity mentions in an image than the user explicitly queries about? Could it lead to “overstuffing” of visual details?
- If the system has to maintain visual grounding over multiple user queries, do we re-inject the same VSV at each turn, or might some turn-based adaptation be needed?
- Do you see major differences if the image is extremely cluttered? Do you rely on the visual token embeddings to be relatively stable?
- Could early-layer steering alone fix the problem without discretization or advanced overshadowing?
论据与证据
- Hallucination emerges partly because the model’s language priors overshadow the diminishing visual signals, especially in later decoding steps.
- The token-rank analysis across layers/time shows that visually grounded tokens steadily drop in probability ranks, while hallucinated tokens rise. Figures 1-2 illustrate this trend clearly.
- Meaningful tokens reach their maximal activation in intermediate layers (e.g., the penultimate), not the final layer – which is more biased toward function words.
- Again, the “logit lens” technique indicates that object/attribute tokens see higher probabilities earlier than the final layer, which frequently emphasizes grammatical tokens.
- VISTA mitigates hallucinations by reinforcing the missing or eroded visual cues (via VSV) and leveraging earlier layer logits (via SLA).
- VISTA yields consistent gains in object hallucination metrics (CHAIRS, CHAIRI) of ∼40% relative improvement across multiple models and decoding algorithms. They also show smaller improvements in broad tasks (MME) and short-answer tasks (POPE), indicating that the method can generalize.
方法与评估标准
They compute a “steering vector” from contrast (with/without image tokens) and inject it in the residual streams, plus a partial ensemble of earlier layer logits to finalize the token distribution. They measure CHAIR, POPE, etc and validated against strong existing baselines like VCD, OPERA, PAI, and DoLa.
理论论述
This work is primarily empirical and methodological. The claims are mostly validated through data and are consistent with known ideas on how Transformers store information in intermediate hidden states. There are no deep new formal proofs, but no obvious theoretical errors either.
实验设计与分析
Solid coverage of standard vision–language tasks and ablation on the synergy of the two main modules. However, the approach might be somewhat complicated in real usage: “steering vectors” are computed for each image.
补充材料
No supplementary material is attached.
与现有文献的关系
They cite relevant VLM hallucination works. The references are okay but possibly missing a deeper link to linear prompt engineering or “prefix tuning” for controlling generation.
遗漏的重要参考文献
Might not mention older works on “residual stream editing” or “concept neurons.” Possibly not critical but could be relevant.
其他优缺点
Strengths:
- Intriguing token ranking analysis that reveals new insights. Thorough token-level analysis using logit lens clarifies when/how hallucinated vs. grounded tokens appear.
- Strong performance gains across multiple models, decoding methods, and tasks.
Weaknesses:
- Relying on the vision encoder’s quality. If the base model’s image embeddings are poor or incomplete, the “steering vector” might not help.
- The method can require hyperparameter tuning for different models. For example, the injection scale hyperparameter might be sensitive; not thoroughly studied.
- The VSV computation can be somewhat heavy if done for each image. It needs thorough time analysis.
其他意见或建议
Read above and below.
We cordially appreciate your careful review and insightful questions. We are thrilled that you find our analysis "reveals new insights"., our method gains "strong performance" and "can generalize". See below for detailed replies.
Q1. Regarding vision encoder’s quality
VISTA focuses on hallucinations caused by parametric priors and
does not set extra requirements for visual encoder's quality, as evidenced by its effectiveness across four widely adopted architectures. The quality requirement applies to standard public architectures,consistent with existing methodssuch as VCD, PAI, and OPERA. VISTA complements ongoing research focused on improving vision encoders.
Q2. Regarding hyperparameter tuning
There are two hyperparameters: intervention strength and mixing ratio .
- For , we found
consistently effective across all four architectures.- Optimal can vary by model; however, a broad range of values are effectively for each model,
as analyzed in the Synergy & Robustness subsection (p.7, line 381)and shown inFigs. 6, 12–14, demonstrating VISTA's robustness across different configurations.
Q3. Regarding computational overhead
VISTA is an efficient de-hallucination strategy that
introduces minimal computational overheadand demonstratesbetter efficiency than baseline approaches.We clarify this in below:
- For captioning, summarization, and open-ended generation tasks: The textual prompt is static (e.g., "please describe the image..."). VISTA forwards the textual prompt
only oncew/o image to cache activations (minimal overhead). For positive case, VISTA acts alike vanilla inference which forwards the visual+textual tokens. The only difference is before generating new tokens, VISTA builds VSV viasimple vector arithmetic(marginal overhead), and proceeds with intervention. VISTAdoes not forward twicethe input tokens, and it can use KV cache of them during generation` (overhead of this is also small).- For QA tasks: Textual prompts vary per query, potentially requiring an additional forward pass of textual tokens. However, textual tokens typically account for
less than 10% of total prompt lengthcompared to image tokens`. Thus, the additional computational overhead remains mild relative to vanilla inference.- Empirically, VISTA demonstrates
better efficiencycompared to other comparable methods (seeTable 5). We kindly refer the reviewer to Reviewer 5sMs (Q2) for an extended efficiency comparison.
Q4. Is there "overstuffing" of visual details?
- We'd like to first clarify a potential caveat. The visual steering vector (VSV) does not merely amplify visual details. The positive vector is conditioned
jointly on visual and textual tokens, capturing relevant visual-textual relationshipsvia attention heads, while the negative vector () reduces textual-only priors. Consequently, VSVdoes not overstuff visual detailsthat might blur the query's focus.- This is empirically validated on MMHal-Bench (Fig.4), where
each query specifically targets an entity or relation within visually complex sceneswhere many other entities and relations are existing, and VISTAperforms well.
Q5. Multi-query scenario
As clarified in Q4, VSV is constructed per visual-query pair and is expected be recalculated per new query. However, as analyzed in Q3, constructing VSV incurs only mild computational cost due to KV caching and the relatively short length of query tokens compared to visual/system tokens. As a result, it is still efficient to use VISTA under multi-query scenarios.
Q6. Regarding extremely cluttered images
Inspired by your insightful question, we conducted a new experiment where 500 images from MSCOCO are randomly selected and divided into two groups (heavy and light) according to the degree of cluttering (GPT-4o is used to rate a cluttering score for each image). Results below (on LLAVA-1.5) demonstrate that
heavily cluttered images is inherently more challenging; yet VISTAsignificantly outperforms vanilla decodingin both scenarios.
|Cluttering degree|↓/↓| |:-|:-| |Heavy (greedy)|52.0 / 13.0| |Heavy (ours)|26.5 / 5.3| |Light (greedy)|40.5 / 11.7| |Light (ours)|22.0 / 5.6|
Q7. Could early-layer steering fix the problem?
Following your suggestion, we conduct additional experiment applying early-layer steering (first 15 layers) for heavy and light cluttered images. As shown below,
steering across all layers consistently outperforms early steering. Interestingly, theperformance gap between early steering and steering all layers is smaller for lightly cluttered images, suggesting early layers handle much of the visual processing for easy cases.
|Cluttering degree|↓/↓| |:-|:-| |Heavy (early)|37.0 / 10.2| |Heavy (all)|26.5 / 5.3| |Light (early)|26.0 / 7.1| |Light (all)|22.0 / 5.6|
Through the observation of the LVLM generation process, this paper introduces a hallucination mitigation method, VISTA, which includes a visual steering vector and logit ensemble. Experiments demonstrate that it outperforms existing methods.
Update after rebuttal
I agree with the authors' rationale and will maintain my initial score (weak accept).
给作者的问题
- The design of the visual steering vector. What if is directly subtracted? What would the results be?
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
N/A
实验设计与分析
Yes.
补充材料
Yes, the case study part.
与现有文献的关系
The key contributions relate to MLLM hallucination mitigation.
遗漏的重要参考文献
All related works are well discussed.
其他优缺点
Strengths
- Applying an effective method to each problem: a steering vector to enhance the model's focus on visual information and a logit ensemble to address the early excitation of semantically meaningful tokens.
Weaknesses
- Lack of novelty:
- As the authors mentioned, each observation has already been discussed in previous literature. What is being observed for the first time in this paper? Is it the identification of three types of tokens and the observation of the LVLM generation process from that perspective?
- Also, the logit ensemble method has already been proposed in recent works [1, 2]. Please verify its superiority compared to those works.
[1] Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding, ICLR 2025
[2] Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models, ICLR 2025
其他意见或建议
See the weakness section.
We sincerely appreciate your detailed review and constructive suggestions. We are encouraged to find our framework is being recognized as "Applying an effective method to each problem". All claims are evidenced, and related work are "well discussed". Below we address your questions in detail.
Q1. Regarding novelty
Our VISTA is
novelin terms of both itsanalysisandmethodology. To our best knowledge:
- VISTA's token ranking analysis is
the first workinspectinginternal token-level dynamicsof LVLMs.- VISTA systematically uncovers
three novel observations about how LVLMs store and process visual information: (1) gradual visual information loss, (2) early excitation, and (3) hidden genuine information.- VISTA resembles
the first activation-space intervention method for reducing LVLM's hallucination, effectively addressing observed phenomena.
As suggested in your second bullet, VISTA indeed
provides a new perspectiveby tracking the rankings of proposed "hallucinated", "genuine", and "hidden genuine" tokens. As a result, VISTA unveils three novel observations, and contributing valuable new insights to the field (supported by Reviewer ypPL).
Q2. Comparison and Discussion with [1] [2]
Below, we first compare VISTA with [1] and [2] in terms of performance and efficiency, followed by an in-depth discussion of the differences between VISTA and these methods. The extended comparison and discussion will be included in the next revision.
- Performance
|CHAIR|LLAVA-1.5| MiniGPT-4|Shikra|InstructBLIP| |:-|:-|:-|:-|:-| ||/↓|/↓|/↓|/↓| |ED [1]|43.0/14.0|-|-|-| |CTS (greedy) [2]|44.2/12.2|-|44.8/12.8|42.3/12.4| |Ours (greedy)|20.4/6.9|19.8/6.0|31.4/9.7|27.4/8.1| |CTS (sampling) [2]|45.0/11.7|-|46.0/12.9|43.6/13.1| |Ours (sampling)|24.0/8.2|18.4/6.4|31.8/9.7|29.4/9.1|
|POPE (avg)|LLAVA-1.5| MiniGPT-4|Shikra|InstructBLIP| |:-|:-|:-|:-|:-| ||Acc/F1↑|Acc/F1↑|Acc/F1↑|Acc/F1↑| |ED [1]|86.31/85.86|-|-|-| |CTS (greedy) [2]|85.94/85.92|-|81.84/82.23|82.30/83.58| |Ours (greedy)|86.15/86.29|75.96/77.11|82.44/82.47|84.87/84.95| |CTS (sampling) [2]|85.14/85.41|-|80.66/81.65|81.49/82.70| |Ours (sampling)|85.35/85.54|66.96/68.05|81.01/81.15|83.11/83.27|
As shown above, VISTA achieves
superior hallucination reduction performance, particularly in long-sequence generation tasks such as CHAIR.
- Efficiency
We further compare efficiency among methods. Given different evaluation criteria, we set the cost of vanilla inference to 1 and measure relative changes. As demonstrated below,
VISTA also outperforms [1] and [2] in efficiency.
|Method|Relative Inference Cost↓| |:-|:-| |Vanilla|1.0| |ED [1]|3.13| |FastED [1]|1.33| |CTS (40%) [2]|1.43| |CTS (10%) [2]|1.35| |Ours|1.27|
- Discussion
The rationale and implementation of logits ensembling differ significantly between VISTA and [1][2]. Method [1] aims to reduce visual distractions through image cropping, ensembling logits from multiple subcrops. Method [2] addresses global noise by pruning visual tokens based on the attention matrix, creating auxiliary logits. In contrast, VISTA's self-logits ensembling is motivated by the observation of early excitation behavior and is conducted across layers preceding the final layer to encourage decoding semantically meaningful tokens. Moreover, the proposed self-logits augmentation demonstrates synergy with the visual steering vector (VSV), as detailed in the Synergy & Robustness subsection (p.7, line 381).
Q3. What if is directly subtracted?
Following your insightful suggestion, we set the visual steering vector to instead of and validated it across various intervention strengths () on LLAVA-1.5. Results below indicate that:
- Solely removing information from the negative example offers
limited benefits.- Negative-only steering is highly
sensitive to intervention strength.- Best performance of negative-only steering significantly
lags behind the original design.
Specifically, negative-only steering collapses (F1 < 70) when . These findings highlight the critical role of preserving positive information during the intervention process.
|||| |:-|:-|:-| ||↓/↓/↑|↓/↓/↑| |0.05|56.8 / 15.2 / 75.5|50.8 / 13.6 / 76.9| |0.08|52.5 / 14.9 / 75.6|50.4 / 14.6 / 76.5| |0.1|41.6 / 14.4 / 74.2|47.4 / 13.1 / 77.8| |0.11|21.4 / 12.7 / 58.1 (collapse)|45.6 / 12.9 / 77.4| |0.12|2.6 / 23.6 / 13.7 (collapse)|41.6 / 12.2 / 77.7| |0.15|0.20 / 50 / 0.1 (collapse)|32.4 / 9.8 / 76.9| |0.17|- / - / - (collapse)|20.4 / 6.9 / 72.8|
Thank you for addressing my concerns well. I agree with the results and will keep my initial score (Weak accept).
Thank you for your reply! We really appreciate your feedback. Please don't hesitate to follow up questions if there is any discussion you'd like to further extend.
This paper discusses the topic of reducing hallucinations in LVLMs. The authors analyzed LVLM’s generation dynamics through the lens of token logits ranking and proposed three types of inference issues. Then, the authors proposed a Visual Steering Vector (VSV) and a Self-Logits Augmentation (SLA) method to inhibit the hallucinations. Experiments are conducted on multiple benchmarks and demonstrated superior results over previous methods.
给作者的问题
N/A.
论据与证据
Yes, they are.
方法与评估标准
Yes, appropriate evaluation criteria is applied.
理论论述
This work focuses more on experimental analysis rather than rigorous theoretical proofs.
实验设计与分析
Yes, experimental designs are soundness.
补充材料
Implementation details were reviewed.
与现有文献的关系
This work is particularly related to hallucination mitigation in LLM, VLM, MLLM, etc.
遗漏的重要参考文献
The references are satisfactory.
其他优缺点
Strengths
- The approach is simple yet effective. It is a training-free approach that can be applied at inference time to existing models.
- The authors test their approach across multiple architectures, decoding strategies, and benchmarks, demonstrating a comprehensive evaluation.
- The approach demonstrates improved performance over previous methods in hallucination reduction.
Weaknesses
- The novelty is relatively limited. The concepts had already emerged in previous studies.
- The proposed approach increases the computation burden.
- The proposed approach incorporates some hyperparameters that may be sensitive to different benchmarks.
其他意见或建议
Kindly review the comments mentioned earlier.
We sincerely appreciate your thoughtful review and insightful feedback. We are glad that you find our approach "simple yet effective", our evaluation is "comprehensive", and demonstrate "improved performance". Below we address your questions in detail.
Q1. Regarding novelty
Our VISTA is
novelin terms of both itsanalysisandmethodology. To our best knowledge:
- VISTA's token ranking analysis is
the first workinspectinginternal token-level dynamicsof LVLMs.- VISTA systematically uncovers
three novel observations about how LVLMs store and process visual information: (1) gradual visual information loss, (2) early excitation, and (3) hidden genuine information.- VISTA resembles
the first activation-space intervention method for reducing LVLM's hallucination, effectively addressing observed phenomena.
As supported by you, VISTA is a
flexible designallowing integration with various architectures and decoding strategies. This versatility enables VISTA tocomplement ongoing researchfocused on improving vision encoders, strengthening visual-textual alignment, and developing specialized decoding methods to reduce hallucinations.
Q2. Regarding computational efficiency
VISTA functions as an efficient de-hallucination strategy that
introduces only minimal computational overheadwhile demonstratingsuperior efficiency compared to other baseline approaches.We clarify this in detail below:
- For captioning, summarization, and open-ended generation tasks: The textual prompt remains constant (e.g., "please describe the image..."). VISTA forwards the textual prompt
only oncewithout the image, caching activations for future use (minimal overhead). For the positive example (visual + textual tokens), VISTA performs vanilla inference by forwarding tokens visual + textual once. The only additional step is constructing the visual steering vector (VSV) using the residual stream from the last input token and cached negative activations viasimple vector arithmetics(this cost is marginal). Generation then proceeds normally under our proposed intervention. VISTAavoids forwarding the token sequence twiceby leveraging the KV cache from input prompt tokens. Remaining computations involve logit ensembling and a few element-wise additions, resulting in mild overhead.
- For QA-like tasks: Textual prompts may vary per query, potentially requiring one additional forward pass for the negative instance (i.e., textual only query). However, the
textual tokens are minimal compared to image tokens, typically constituting less than 10% of total input token sequence. Consequently, the extra computational burden remains mild compared to vanilla inference.
- Empirically, VISTA demonstrates
better efficiencycompared to other comparable methods (see Table 5). We kindly refer the reviewer to the panel of Reviewer 5sMs (Q2) for an extended efficiency comparison with other methods.
Q3. Regarding hyperparameter sensitivity
VISTA includes two hyperparameters: intervention strength and mixing ratio .
- For , we found
consistently effective across all four architectures.
- For , optimal values can vary by model; however, a broad range of values perform effectively for each model,
as analyzed in the Synergy & Robustness subsection (p.7, line 381)and shown inFigs. 6, 12–14, which demonstrate VISTA's consistent effectiveness across different configurations.
The reviewer thanks the authors for the detailed feedback. It addressed my concerns well. I'll increase my initial score.
The authors sincerely appreciate the reviewer’s thorough review and thoughtful consideration.
This paper investigates hallucination in Large Vision-Language Models by analyzing token logit dynamics during generation and proposes a training-free, inference-time framework to mitigate these issues. After rebuttal, all reviewers voted for acceptance, and the AC acknowledges the paper's valuable contributions in both analysis and mitigation of hallucination. Concerns regarding novelty, hyperparameter tuning, and computational cost were raised, and most of them have been addressed during the rebuttal period. The Area Chair agrees that the novelty of the proposed method is somewhat limited, as the concept of steering vectors has been explored in prior works on mitigating hallucination in both LLMs and MLLMs. For example, the AC noted that the proposed Visual Steering Vector method is similar to that in [1]; Therefore, the AC suggests weak accept and please include a discussion and experimental comparison with [1] in the final version, along with addressing the other points raised by reviewers.
[1] Reducing Hallucinations in Large Vision-Language Models via Latent Space Steering, ICLR 2025.