Mitigating Hallucination in VideoLLMs via Temporal-Aware Activation Engineering
摘要
评审与讨论
The authors present their new discovery that the model's sensitivity to hallucinations depends on the temporal feature changes of the video, regardless of the task type. Based on this discovery, the authors propose a temporal-aware activation engineering framework that can identify the temporal categories of the video based on a lightweight classifier, and then add corresponding offset vectors for different categories in the inference phase to suppress hallucinations.
优缺点分析
Strengths:
-
This article has rigorous logic and clear structure.
-
The experimental details are rich and complete. Experiments are conducted on two datasets and three models, and the effectiveness of the method proposed by the author is demonstrated.
-
This method only runs lightweight classifiers in parallel during the inference phase, selects and injects offset vectors, and does not significantly increase latency.
Weaknesses:
-
Although the author divides video input into two categories: "unchanged timing" and "mutable timing", this dichotomy is relatively rough and cannot cover diverse timing characteristics such as scene switching, which may lead to insufficient adaptability of activation engineering on different types of videos.
-
The authors identified key hyperparameters such as injection weights and the number of Top-K attention heads through grid search, but did not evaluate the impact of these parameters on the model hallucination performance within a reasonable fluctuation range, nor did their sensitivity be explained.
问题
-
Consider giving a more detailed description of the sample screening process, such as the specific prompt of GPT-4o, and normal and hallucination-inducing prompts, to enhance the reproducibility of the article.
-
The article mainly uses quantitative accuracy improvement as an evaluation indicator. It is recommended to add specific examples to illustrate the specific details of reducing hallucinations, such as the comparison of model output before and after injection of the offset vector.
-
To enhance the reproducibility of the method, it is recommended to give the value range or step length of the injection weight and the ablation experimental results of the Top-K attention head selection number to clarify the impact of each hyperparameter on model performance in the appendix.
局限性
yes
最终评判理由
I would like to thank the authors for the rebuttal, which has somehow tackled weakness. I will retain my original score.
Best regards,
格式问题
n/a
Q1: Although the author divides video input into two categories: "unchanged timing" and "mutable timing", this dichotomy is relatively rough and cannot cover diverse timing characteristics such as scene switching, which may lead to insufficient adaptability of activation engineering on different types of videos.
A1: Thank you for your concern. Firstly, the binary classification into temporal-invariant and temporal-variant categories is derived from our analysis in Section 3. At the same time, our experimental results (Tables 2 and 3) also demonstrate the effectiveness of the binary classifier. Therefore, based on the current experimental results, utilizing a temporal binary classifier appears to be sufficient in most cases.
Secondly, as stated in Section 3.1 (Lines 146-149), the definitions of temporal-invariant and temporal-variant are relatively clear and mutually exclusive, allowing the vast majority of videos to be categorized into one of these two types. For example, scene switching can be regarded as a type of temporal variation. Consequently, we consider this binary classification of videos into temporal-invariant and temporal-variant categories to be reasonable.
Of course, we also acknowledge this limitation in Appendix F (Limitations) of our paper. In future work, we plan to explore more fine-grained classifications.
Q2: The authors identified key hyperparameters such as injection weights and the number of Top-K attention heads through grid search, but did not evaluate the impact of these parameters on the model hallucination performance within a reasonable fluctuation range, nor did their sensitivity be explained. To enhance the reproducibility of the method, it is recommended to give the value range or step length of the injection weight and the ablation experimental results of the Top-K attention head selection number to clarify the impact of each hyperparameter on model performance in the appendix.
A2: Thank you for your suggestion. We further conducted an ablation study on the value range of the injection weight () and the number of Top-K attention heads selected. Specifically, following the grid search settings in the paper, we performed a grid search for the injection weight in the range and for the Top-K attention head selection number in the range , using the VideoLLaMA2 model on the VidHalluc benchmark. The experimental results are shown in the table below (using Overall as the evaluation metric):
| Range\Top-K Attention Head Selection Number | 32 | 64 | 128 | 256 |
|---|---|---|---|---|
| 8 | 71.25 | 71.56 | 71.94 | 71.47 |
| 16 | 71.28 | 71.79 | 72.08 | 71.63 |
| 24 | 71.72 | 71.98 | 72.17 | 71.84 |
| 32 | 71.74 | 72.03 | 72.49 | 71.96 |
As can be seen, while the injection weight () and the number of Top-K attention heads do influence model performance, the sensitivity is not drastic. Within a reasonable range, model performance generally improves as both the injection weight and the number of selected Top-K attention heads increase. However, when more attention heads () that are less sensitive to hallucinations are included, there is a slight decrease in performance. In practical applications, an accuracy threshold strategy can also be adopted to select attention heads, further enhancing the robustness of the model. We will add these findings to Appendix E to demonstrate the robustness of our approach.
Q3: Consider giving a more detailed description of the sample screening process, such as the specific prompt of GPT-4o, and normal and hallucination-inducing prompts, to enhance the reproducibility of the article.
A3: Thank you for your suggestion. Firstly, we have already presented the specific content of the prompts we used in Figure 7 of Appendix C.2, and we guide readers to it in Section 4.1 (Line 284).
Secondly, in Section 3.2, we have described our normal and hallucination-inducing prompts in detail, specifically the prompt designed to induce hallucination through frame downsampling. We will add the following content to Section 4.2's Implementation details to enhance reproducibility:
"Consistent with Section 3, we use concatenating the original video and the question as the normal prompt , and use concatenating the 4x downsampled video and the question as the hallucination-inducing prompt ."
Q4: The article mainly uses quantitative accuracy improvement as an evaluation indicator. It is recommended to add specific examples to illustrate the specific details of reducing hallucinations, such as the comparison of model output before and after injection of the offset vector.
A4: Thank you for your suggestion. We will add some concrete examples to Appendix E to illustrate how our method reduces hallucinations. Specifically, since most subtasks are selection/classification tasks, we will focus on providing specific examples from the STH's description task. We will demonstrate the specific changes in model output after using the baseline and after applying our method, allowing readers to more intuitively understand how our approach mitigates hallucinations. Below is an example (since OpenReview cannot display videos, we only show the ground truth text and the model output text):
Ground Truth: Scene change: Yes, Locations: from on a stage to in the woods.
Model Output (Baseline): Scene change: Yes, Locations: from a dark room to a car.
Model Output (Ours): Scene change: Yes, Locations: from performing on a stage to being in the woods.
As can be seen, the Baseline exhibited hallucinations regarding the order of scene changes and scene recognition, while our method effectively reduced these hallucinations.
Thanks for addressing my concern. I would like to maintain my score.
best,
Thank you for your reply. If you have any further questions or comments, we would be happy to continue the discussion.
This paper applies activation engineering to reduce hallucinations in VideoLLMs. The authors propose two variants of activation intervention—editing at attention vs. layer output—and find both exhibit similar effectiveness across different datasets. A key insight is that hallucination sensitivity is strongly correlated with the temporal variation of video inputs rather than task type. To this end, the authors categorize videos into temporal-invariant and temporal-variant types and propose a fully automated pipeline to collect high-quality datasets accordingly. This categorization allows identification of hallucination-sensitive modules, which are then selectively edited to mitigate hallucinations. Experiments on multiple VideoLLM's hallucination benchmark indicates the effectiveness.
优缺点分析
Strengths
-
The paper uncovers several valuable observations, including: 1) Frame reduction increases hallucination risk. 2)Temporal variation, rather than task type, is a key factor influencing hallucination sensitivity. 3) Editing attention layers performs slightly better than editing layer outputs, consistent with prior findings.
-
The paper extends activation editing to the VideoLLM setting. The method adaptively adjust the intervene by incorporating temporal dynamics via a tailored dataset collection.
-
The manuscript is generally easy to follow and well-motivated.
Weakness
-
While the analysis is thorough, the core technical contribution—applying activation editing to VideoLLMs—builds heavily on existing work without significant methodological advancement in the editing technique itself.
-
The proposed adaptive editing framework involves non-trivial training (e.g., classifier-based module sensitivity detection). Including an analysis of inference-time overhead and classifier sample efficiency would help clarify the practical feasibility of deploying the method.
-
Some recent publicaitons with similar insights should be included in the activation editing part of related works. And they may be more advanced compared to the activation engineering used in this paper.
[1] Spectral Editing of Activations for Large Language Model Alignment, NeurIPS 2024.
[2] AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models, ICLR 2025.
[3] Representation Surgery: Theory and Practice of Affine Steering, ICML 2024.
These methods introduce new perspectives—spectral, geometric, or constrained projections—that may offer stronger guarantees or improved efficiency and should be discussed in the context of this work.
问题
- It is easy for LLM's capability to be corrupted after editing. Could you provide the in-task performance for the intervened model after equipping with your approach? Is that any hurt in performance or other model's capabilities?
- What is the sample efficiency for training a binary classifier in your case? What is the inference efficiency after equipping with you intervene method?
局限性
yes
格式问题
NA
Q1: While the analysis is thorough, the core technical contribution—applying activation editing to VideoLLMs—builds heavily on existing work without significant methodological advancement in the editing technique itself.
A1: Thank you for your question. Firstly, due to modality differences, existing activation editing methods from the image domain cannot be directly applied to the video domain. One of the key contributions of our work is to identify temporal variation as a crucial factor influencing hallucinations in VideoLLMs. Only by uncovering this key factor can activation editing methods be effectively transferred to the video domain.
Secondly, while the activation editing method we used in Section 4 is not entirely new, we are the first to systematically apply it to VideoLLMs to mitigate model hallucinations, conducting comprehensive experiments and analysis.
Furthermore, based on our analysis, we adapted the method to the specific characteristics of the video domain. Instead of using a unified dataset to calculate offsets, which is common in the image domain, we compute appropriate offsets separately based on the temporal characteristics of videos. We have validated its effectiveness through extensive experiments. This work provides a comprehensive foundation and new insights for future research into the application of activation editing in VideoLLMs.
Q2: The proposed adaptive editing framework involves non-trivial training (e.g., classifier-based module sensitivity detection). Including an analysis of inference-time overhead and classifier sample efficiency would help clarify the practical feasibility of deploying the method. What is the sample efficiency for training a binary classifier in your case? What is the inference efficiency after equipping with you intervene method?
A2: Thank you for your suggestion.
Inference efficiency: Firstly, as mentioned in Section 4.1 (Lines 298-303), we minimize inference-time overhead by running the classifier and visual encoder in parallel. For more implementation details, please refer to the paper.
Secondly, we further evaluated the per-sample inference time of both our method and the Baseline on the VidHalluc benchmark. The results are presented in the table below (unit: seconds):
| Model | Variant | BQA | MCQ | STH | TSH |
|---|---|---|---|---|---|
| VideoLLaMA2 | Baseline | 1.44 | 1.66 | 1.87 | 1.92 |
| Ours | 1.44 | 1.66 | 1.86 | 1.93 | |
| Qwen2.5-VL | Baseline | 0.99 | 1.00 | 1.35 | 1.10 |
| Ours | 0.99 | 1.00 | 1.34 | 1.10 | |
| Video-LLaVA | Baseline | 2.43 | 2.49 | 3.18 | 3.86 |
| Ours | 2.43 | 2.49 | 3.19 | 3.88 |
As can be seen, the per-sample inference time of our method on the VidHalluc benchmark is comparable to that of the Baseline, demonstrating the effectiveness of our parallel setup.
Classifier sample efficiency: Firstly, as mentioned in Section 4.1 (Lines 293-297), we sampled instances from each category of the dataset to train the classifier. This is already a relatively small sample size in the current VideoLLM field.
Secondly, we further tested the classifier's sample efficiency by reducing the number of training samples. Specifically, we trained with , , , and samples respectively, and recorded the classification accuracy of the classifier on the VidHalluc benchmark for each quantity. The experimental results are shown in the table below:
| Sample Size | BQA | MCQ | STH | TSH |
|---|---|---|---|---|
| 100 | 90.17% | 89.63% | 88.09% | 88.83% |
| 200 | 92.13% | 91.47% | 91.01% | 90.33% |
| 300 | 94.26% | 95.39% | 92.58% | 92.50% |
| 400 | 97.63% | 97.47% | 94.83% | 94.17% |
It can be observed that as the number of samples increases, the classifier's accuracy gradually improves. However, even with only training samples, the classifier's accuracy still exceeds %. This demonstrates that our classifier possesses high sample efficiency.
Q3: Some recent publicaitons with similar insights should be included in the activation editing part of related works. And they may be more advanced compared to the activation engineering used in this paper. These methods introduce new perspectives—spectral, geometric, or constrained projections—that may offer stronger guarantees or improved efficiency and should be discussed in the context of this work.
A3: Thank you for your suggestion. We will add these latest research findings to the related works section in our revised paper and discuss their similarities, differences compared to our method. Specifically, we will add the following content:
Recent works have explored activation editing techniques from spectral, geometric, and constrained perspectives. Spectral Editing of Activations (SEA) [1] leverages SVD-based inference-time projections on cross-covariance matrices, achieving data-efficient edits without training. AlphaEdit [2] introduces null-space constrained weight updates, providing theoretical guarantees against catastrophic forgetting during sequential edits. Singh et al. [3] formulates steering transformations as optimal affine mappings over hidden states, effectively addressing biases and toxic outputs. Unlike these approaches, our method dynamically identifies hallucination-sensitive modules in VideoLLMs via temporal-aware activation editing, adapting inference-time edits specifically to temporal variation characteristics in video tasks.
1. Qiu, Yifu, et al. "Spectral editing of activations for large language model alignment." Advances in Neural Information Processing Systems 37 (2024): 56958-56987.
2. Fang, Junfeng, et al. "AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models." The Thirteenth International Conference on Learning Representations.
3. Singh, Shashwat, et al. "Representation surgery: theory and practice of affine steering." Proceedings of the 41st International Conference on Machine Learning. 2024.
Q4: It is easy for LLM's capability to be corrupted after editing. Could you provide the in-task performance for the intervened model after equipping with your approach? Is that any hurt in performance or other model's capabilities?
A4: Thank you for your question. Firstly, the STH subtask within the VidHalluc benchmark includes a description task, and the performance improvement of our method on this task initially demonstrates that our approach does not harm the model's capabilities.
Secondly, we conducted further evaluation on the VideoMMMU benchmark, a multi-modal and multi-disciplinary video benchmark that evaluates VideoLLMs' knowledge acquisition capability from educational videos, and the results are shown in the table below:
| Model | Variant | Perception | Comprehension | Adaptation | Overall |
|---|---|---|---|---|---|
| VideoLLaMA2 | Baseline | 59.83 | 46.12 | 32.25 | 46.07 |
| Ours | 59.47 | 46.24 | 32.25 | 45.99 | |
| Qwen2.5-VL | Baseline | 57.29 | 42.93 | 37.16 | 45.79 |
| Ours | 57.46 | 42.57 | 36.98 | 45.67 | |
| Video-LLaVA | Baseline | 39.82 | 31.67 | 30.18 | 33.89 |
| Ours | 39.82 | 31.59 | 31.09 | 34.17 |
As can be seen, the performance of our method on all subtasks of the VideoMMMU benchmark is comparable to the Baseline, demonstrating that our approach does not impair the model's capabilities.
Thank you for your time and effort in the rebuttal.
The additional experiments address most of my concerns in the experiments, and I think the evaluation is technically solid. However, my main concern in the novelty still remains,
- The activation editing method is not new; and I personally think "the first to apply it to VideoLLM" does not necessarily contribute as the additional novelty.
- The insight "temporal variation as a crucial factor influencing hallucinations in VideoLLMs" is also not new for video understanding. Checkout [1,2,3].
[1] EventHallusion: Diagnosing Event Hallucinations in Video LLMs [2] VidHal: Benchmarking Temporal Hallucinations in Vision LLMs [3] Thinking Hallucination for Video Captioning
Therefore, I will remain my score as it is. To AC/SAC, if AC/SAC believes that my score is too harsh according to these two justifications, feel free to interpret the numerical score in the review as relatively positive.
Best
We sincerely thank you for your continued engagement. While we acknowledge that activation engineering as a general idea has been explored in prior work, we respectfully clarify that the novelty of our contribution does not lie solely in being the first to apply it to VideoLLMs. Rather, our work presents a systematic framework that reveals and exploits a previously underutilized connection between temporal variation characteristics and the hallucination sensitivity of internal modules in VideoLLMs. This leads to a task-agnostic, temporally-aware activation engineering framework that supports dynamic, input-conditioned inference-time intervention without retraining.
Unlike existing activation editing approaches, which typically use fixed datasets and static module selection, our framework integrates three new components:
- An empirical finding that hallucination sensitivity is strongly correlated with temporal variation rather than task type, validated through binary classifier performance across modules (Figure 3(a));
- A temporal variation classifier that dynamically determines the appropriate intervention module per input (Figure 5), executed in parallel with the model forward pass to avoid runtime overhead;
- A fully automated dataset construction pipeline that distinguishes temporal-invariant and temporal-variant samples via both statistical and GPT-4o-guided filtering (Figure 4). These components form a coherent and practically deployable framework, enabling robust hallucination mitigation without additional fine-tuning, while addressing the unique challenges of the video domain.
In response to your concern that the insight around temporal variation is not novel, we would like to emphasize the distinction between identifying a concept (e.g., that temporal dynamics matter) and converting that insight into a concrete, operational mechanism that improves model behavior. While benchmarks such as EventHallusion [1], VidHal [2], and prior studies like Thinking Hallucination [3] have categorized hallucination phenomena along temporal lines, they do not propose any method for dynamically detecting such variation at inference time, nor do they use this signal to guide targeted activation interventions. Our method bridges this gap by (a) linking temporal variation to module sensitivity in a quantifiable way, and (b) implementing a lightweight, LLM-train-free pipeline that turns this theoretical insight into consistent performance gains across multiple models and benchmarks (Table 4 and 5).
Finally, our ablation studies reveal that naive combinations of activation vectors from mixed temporal sources actually harm or compromise performance, underscoring the necessity of temporal-aware design. We also show that hallucination-sensitive modules vary asymmetrically across temporal regimes, further confirming that task type alone is not sufficient for effective intervention (Section 3.4). Together, these findings highlight the non-trivial methodological advancements in both understanding and engineering, distinguishing our work from prior benchmarks and analysis-only studies.
We hope this response clarifies that our contribution lies not in repurposing an existing tool, but in transforming activation engineering into a principled, temporally-adaptive framework tailored to the structure of video-language reasoning, with strong empirical and practical relevance.
Thank you again for your thoughtful feedback. If you have any further questions or suggestions, we would be glad to continue the discussion.
The paper studies the causes of hallucination in video LLMs and also proposes a mitigation technique based on the finding. The core idea is to compute internal offset vectors, which means the activations between normal and hallucination-inducing inputs, and inject these offsets during inference to drive the model to the correct outputs and away from the hallucinated outputs.
A key insight is that hallucination is tied to a video’s temporal variation more than the tasks. So the idea is to use temporal-aware activation engineering, which first classifies a video into temporal-variant or temporal-invariant and applies the appropriate offset vector accordingly.
The experiments are comprehensive, evaluating on two benchmarks, VidHalluc and EventHallusion, using multiple models like VideoLLaMA and Qwen-VL, and show the approach reduces hallucinations significantly and outperforms all baseline methods like TCD and DINO. This method is also training-free and can generalize across tasks and architectures.
优缺点分析
Strengths:
- Observation and insight: Hallucination in VideoLLMs seems to be more related to temporal variation. This observation opens up directions for understanding and mitigating hallucinations.
- Simple yet effective method: The proposed TA-AE framework is simple—compute and inject offset vectors between normal and hallucination-inducing activations. It is shown to be effective across models and benchmarks, without the need for retraining or fine-tuning.
- Experimental results are strong on two benchmarks, VidHalluc and EventHallusion, and across multiple architectures including Qwen and VideoLLaMA.
Weaknesses:
- Lack of theoretical justifications: Why is hallucination more correlated with temporal variations than task types? Why is injecting offsets into certain layers or attention heads effective? The approach is largely empirical and needs more theoretical analysis and justification.
- Limited hallucination types: The paper focuses on temporally induced hallucinations. But hallucinations in VideoLLMs can take many other forms, such as factual errors, counting mistakes, recognition errors, and image-based errors. It is unclear whether this method generalizes to other hallucination modes.
- Engineering assumptions: The method uses a simple temporal variation classifier to split videos into two categories and applies different offsets. This seems a bit too simplistic and engineering-heavy. The need for an extra model introduces overhead. The paper also does not show how general the classifier is or what the failure cases are.
问题
- Can you provide any theoretical analysis or intuitive explanation for why this mechanism reduces hallucinations effectively?
- How do you ensure that the offsets generalize to different videos and video types?
- Why is a binary classification sufficient? What about videos that do not fit cleanly into either category?
- Are there any known failure cases, and how does the method behave in those situations?
- Are there other types of hallucinations that are as important as or more important than temporal variation (e.g., factual, spatial, or visual hallucinations)?
- How are shallow and deep layers defined in your method? Is this based on layer index or some functional behavior?
局限性
Yes
最终评判理由
After considering the rebuttal, I am updating my score to a borderline accept. The authors addressed key concerns by providing clarifications on the role of temporal variation in hallucinations and supporting their claims with additional experiments across diverse datasets and failure scenarios. While the approach remains largely empirical and some design choices (e.g., binary classification, offset injection) are heuristic, the method is simple, training-free, and shows strong empirical gains. The core insigh that temporal variation is a major driver of hallucination in VideoLLMs, is valuable and may inspire future work. Despite remaining limitations, the paper presents a practical and promising contribution.
格式问题
NA
Q1: Lack of theoretical justifications: Why is hallucination more correlated with temporal variations than task types? Why is injecting offsets into certain layers or attention heads effective? The approach is largely empirical and needs more theoretical analysis and justification. Can you provide any theoretical analysis or intuitive explanation for why this mechanism reduces hallucinations effectively?
A1: Thank you for your question. Firstly, one of the core challenges for VideoLLMs is effectively capturing and integrating the temporal dynamics of videos. When temporal information is incomplete or misinterpreted, the model is more prone to generating hallucinations [1, 2]. Conversely, task types often represent a higher-level semantic abstraction that is largely independent of the underlying video's intrinsic temporal characteristics. For a given video, the same underlying temporal events can be probed with different task formats. For example, a multiple-choice question can be easily rephrased into a true/false question without altering the video's temporal content.
Secondly, regarding the effectiveness of injecting offset vectors, we can view the attention heads and layers within a VideoLLM as "experts" that process information at different levels of abstraction. Some of these "experts" may be specifically responsible for handling temporal dynamics or integrating information across time, which is consistent with previous findings [3,4]. When these "experts" receive ambiguous or inconsistent temporal signals (e.g., due to frame downsampling), they may produce biased internal representations, leading to hallucinations. By calculating and injecting "offset vectors", we are essentially introducing a "correction signal" into the internal workspace of these critical "experts". This signal pushes the activations towards a state closer to the true, non-hallucinated output, thereby correcting the "latent deviation" caused by incomplete or misleading temporal information. This has also been demonstrated in previous works [5, 6]. This targeted intervention, combined with our discovery of the modules' sensitivity to temporal variations, enables activation engineering to efficiently and specifically mitigate hallucination problems in VideoLLMs.
-
Kong, Ming, et al. "MHBench: Demystifying Motion Hallucination in VideoLLMs." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 4. 2025.
-
Rawal, Ruchit, et al. "ARGUS: Hallucination and Omission Evaluation in Video-LLMs." arXiv preprint arXiv:2506.07371 (2025).
-
Park, Yein, et al. "Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information." arXiv preprint arXiv:2502.14258 (2025).
-
Francis, Jiztom Kavalakkatt, and Matthew J. Darr. "Interpretable AI for Time-Series: Multi-Model Heatmap Fusion with Global Attention and NLP-Generated Explanations." arXiv preprint arXiv:2507.00234 (2025).
-
Li, Kenneth, et al. "Inference-time intervention: Eliciting truthful answers from a language model." Advances in Neural Information Processing Systems 36 (2023): 41451-41530.
-
Chen, Junzhe, et al. "Ict: Image-object cross-level trusted intervention for mitigating object hallucination in large vision-language models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
Q2: How do you ensure that the offsets generalize to different videos and video types?
A2: Thank you for your question. Firstly, in Section 4.1, we specifically chose ShareGPT4Video as our data source, which is different from the sources of the VidHalluc and EventHallusion benchmarks. This was done intentionally, unlike previous work that only drew from benchmark subsets, to test the ability of the obtained offsets to generalize to different video types and to avoid inflated results (detailed explanation can be found in Lines 263-269 of Section 4.1).
Secondly, we utilized two benchmarks with different video types, VidHalluc and EventHallusion. VidHalluc contains a variety of video types such as action and scene, while EventHallusion mainly consists of event-related videos. This further verifies the generalization ability of our method across diverse video types.
Finally, the experimental results in Section 4.3 (Tables 4 and 5) demonstrate that the offsets generated by our method can effectively generalize to different video types.
Q3: Why is a binary classification sufficient? What about videos that do not fit cleanly into either category?
A3: Thank you for your question. Firstly, the binary classification into temporal-invariant and temporal-variant categories is derived from our analysis in Section 3. At the same time, our experimental results (Tables 2 and 3) also demonstrate the effectiveness of the binary classifier. Therefore, based on the current experimental results, utilizing a temporal binary classifier appears to be sufficient in most cases.
Secondly, as stated in Section 3.1 (Lines 146-149), the definitions of temporal-invariant and temporal-variant are relatively clear and mutually exclusive, allowing the vast majority of videos to be categorized into one of these two types. Consequently, we consider this binary classification of videos into temporal-invariant and temporal-variant categories to be reasonable.
Of course, we also acknowledge this limitation in Appendix F (Limitations) of our paper. In future work, we plan to explore more fine-grained classifications.
Q4: Are there any known failure cases, and how does the method behave in those situations?
A4: Thank you for your question. We further analyzed the specific changes in model output when using the Baseline versus our method, across all subtasks' failure cases on the VidHalluc benchmark with the Qwen2.5-VL-7B model. Specifically, we used GPT-4o to determine if the hallucinations produced by the Baseline and our method's outputs were consistent, and we calculated their hallucination consistency rate. The experimental results are shown in the table below:
| Model | BQA | MCQ | STH | TSH |
|---|---|---|---|---|
| Qwen2.5-VL-7B | 95.24% | 92.37% | 96.13% | 95.92% |
As can be seen, the hallucination consistency rate of our method with the Baseline exceeds % across all subtasks' failure cases. This indicates that in the vast majority of failure situations, our method gracefully degrades to the original model's behavior and does not further exacerbate the hallucinations.
Q5: Are there other types of hallucinations that are as important as or more important than temporal variation (e.g., factual, spatial, or visual hallucinations)?
A5: Thank you for your question. Firstly, in Section 3, our analysis of the VidHalluc benchmark revealed that temporal variation is the primary cause of hallucination in VideoLLMs. Even in other types of hallucinations (e.g., action hallucination, which is primarily evaluated by the BQA and MCQ subtasks in VidHalluc), temporal variation remains a significant contributing factor.
Furthermore, our experimental results in Section 4.3 (Tables 4, 5) provide additional proof of this method's effectiveness on other types of hallucinations, including EventHallusion, which specifically targets hallucinations about events.
In summary, our analysis in Section 3 indicates that temporal variation is one of the most important causes of hallucination in VideoLLMs. While other types of hallucinations (e.g., factual errors) may also exist, they are often closely related to temporal variations. Therefore, we consider temporal variation to be one of the fundamental types of hallucination.
Naturally, in the future, we will also consider other types of hallucinations and attempt to incorporate them into our activation engineering framework.
Q6: How are shallow and deep layers defined in your method? Is this based on layer index or some functional behavior?
A6: Thank you for your question. In the paper, the definitions of "shallow" and "deep" layers are simply based on their index. Layers closer to the input are considered shallow, while those closer to the output are considered deep. We will annotate this on the x-axis in Figures 3, 8, and 9 to provide readers with a clearer understanding.
Thank you for the detailed and thoughtful rebuttal. The authors provide helpful clarifications and additional experiments, particularly on generalization to diverse video types, failure case behavior, and downstream hallucination consistency. While the method is still primarily empirical and some design choices (e.g., binary classification, offset injection) remain heuristic, the findings around temporal variation and the training-free mitigation approach are novel and practically relevant. Given the promising results, simplicity of the framework, and potential for broader impact, I am raising my score to a borderline accept.
We sincerely appreciate your thoughtful feedback and are glad that our responses helped clarify your concerns. Thank you for recognizing the core ideas and contributions of our work, and for your valuable efforts in helping us improve it! If you have any further questions or suggestions, we would be happy to continue the discussion.
This paper leverages activation engineering to mitigate hallucinations in VideoLLMs. The authors systematically investigate various activation engineering mechanisms and discover that the temporal variation characteristics of tasks play a more significant role than the task type itself. Based on this insight, they propose a temporal-aware activation engineering framework that adaptively manipulates activations according to temporal variation characteristics. Additionally, the authors develop an automated pipeline to collect high-quality datasets for both temporal-invariant and temporal-variant video types. Extensive experiments demonstrate the effectiveness of the proposed method, yielding consistent improvements in hallucination mitigation across multiple benchmarks.
优缺点分析
Strength:
- The paper provides novel insights and solid findings regarding the key factors of activation engineering, highlighting that temporal variation characteristics are more crucial than task type.
- The proposed temporal-aware activation engineering framework is simple yet effective in reducing hallucinations in VideoLLMs.
- The paper is well-written and easy to follow.
Weaknesses:
- The scalability of the proposed method is not clearly demonstrated. Since the most powerful models (like Gemini, Qwen2.5-VL-72B) may not suffer from severe hallucination issues, the effectiveness of the proposed approach in such models remains uncertain and may be limited.
问题
It appears that contrastive decoding methods could also be adapted for hallucination reduction. The paper would benefit from a comparison with more advanced contrastive decoding approaches, such as [1-4], which can be readily applied to reduce hallucinations in video generation tasks.
[1] Chen, Zhaorun, et al. "Halc: Object hallucination reduction via adaptive focal-contrast decoding." arXiv preprint arXiv:2403.00425 (2024). [2] Deng, Ailin, Zhirui Chen, and Bryan Hooi. "Seeing is believing: Mitigating hallucination in large vision-language models via clip-guided decoding." arXiv preprint arXiv:2402.15300 (2024). [3] Liu, Sheng, Haotian Ye, and James Zou. "Reducing hallucinations in large vision-language models via latent space steering." The Thirteenth International Conference on Learning Representations. 2025. [4] An, Wenbin, et al. "Mitigating object hallucinations in large vision-language models with assembly of global and local attention." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
局限性
Yes
最终评判理由
After considering the rebuttal, my concerns are fully addressed, which leads to the keep of positive rating. Since paper presents a novel temporal-aware activation engineering framework that addresses the problem effectively, the paper should be accepted.
格式问题
No major issue.
Q1: The scalability of the proposed method is not clearly demonstrated. Since the most powerful models (like Gemini, Qwen2.5-VL-72B) may not suffer from severe hallucination issues, the effectiveness of the proposed approach in such models remains uncertain and may be limited.
A1: Thank you for your question. We further conducted additional experiments to verify the effectiveness of our method on the VidHalluc benchmark using the Qwen2.5-VL-72B-Instruct model. The results are shown in the table below:
| Model | Variant | BQA | MCQ | STH | TSH | Overall |
|---|---|---|---|---|---|---|
| Qwen2.5-VL-72B | Baseline | 81.17 | 75.65 | 66.75 | 76.00 | 74.89 |
| Ours | 84.25 | 86.16 | 69.43 | 78.52 | 79.59 |
Firstly, it can be observed that while the Qwen2.5-VL-72B model outperforms smaller models like Qwen2.5-VL-7B and VideoLLaMA2-7B on most subtasks (BQA, STH, TSH) of the VidHalluc benchmark, indicating that more powerful models do indeed exhibit some mitigation of hallucination issues, hallucination problems still persist. Furthermore, for certain specific tasks (e.g., MCQ), the hallucination is even more severe compared to the VideoLLaMA2-7B. This highlights the persistent challenge of hallucination in VideoLLMs.
Secondly, it can be seen that our method also achieves significant performance improvements on the Qwen2.5-VL-72B model, demonstrating consistent hallucination mitigation capabilities across all subtasks. This proves the effectiveness and scalability of our method in alleviating hallucination in large models.
Q2: It appears that contrastive decoding methods could also be adapted for hallucination reduction. The paper would benefit from a comparison with more advanced contrastive decoding approaches, such as [1-4], which can be readily applied to reduce hallucinations in video generation tasks.
A2: Thank you for your suggestion. Due to limitations in time and computational resources, we selected CGD and VTI ([2] and [3] from your provided papers) for further comparison experiments on the VidHalluc benchmark, using the VideoLLaMA2-7B, Qwen2.5-VL-7B and Video-LLaVA-7B models.
Since both CGD and VTI are primarily image-based methods, we adapted them to the video domain with the following improvements:
-
For CGD, we replaced the CLIP model used for similarity score calculation with VideoCLIP.
-
For VTI, we masked all frames of each video to generate a set of corresponding random masked videos. Additionally, for each video caption , we used GPT-4o to generate its corresponding hallucinated version .
-
For a fair comparison, we used the union of and to generate VTI's visual and textual directions.
-
All other hyperparameters adopted the optimal values provided in their respective papers.
The experimental results are shown in the table below:
| Model | Variant | BQA | MCQ | STH | TSH | Overall |
|---|---|---|---|---|---|---|
| VideoLLaMA2 | Baseline | 75.77 | 83.35 | 56.55 | 58.17 | 68.46 |
| CGD | 76.03 | 82.76 | 59.14 | 59.02 | 69.24 | |
| VTI | 78.14 | 83.67 | 62.47 | 60.28 | 71.14 | |
| Ours | 79.09 | 84.07 | 65.14 | 61.67 | 72.49 | |
| Qwen2.5-VL | Baseline | 73.50 | 63.66 | 60.55 | 59.17 | 64.22 |
| CGD | 74.23 | 65.18 | 60.97 | 59.86 | 65.06 | |
| VTI | 75.49 | 69.82 | 61.74 | 60.97 | 67.01 | |
| Ours | 76.18 | 74.95 | 61.81 | 61.33 | 68.57 | |
| Video-LLaVA | Baseline | 67.75 | 66.60 | 21.80 | 46.83 | 50.75 |
| CGD | 67.42 | 65.79 | 29.17 | 47.29 | 52.42 | |
| VTI | 68.23 | 66.14 | 33.24 | 48.96 | 54.14 | |
| Ours | 67.70 | 66.84 | 41.16 | 49.50 | 56.30 |
As can be seen, our method outperforms both CGD and VTI in mitigating model hallucinations across various subtasks and models on the VidHalluc benchmark. This fully demonstrates the advantage of our approach in alleviating model hallucinations.
Dear Reviewer,
I hope this message finds you well. As the discussion period is approaching its end with fewer than four days remaining, I wanted to kindly check whether there are any remaining concerns or feedback we could address. Your insights are highly valuable to us, and we’d be glad to engage further if there’s anything else you’d like us to clarify.
Thank you again for your time and effort in reviewing our work.
Thank you for the detailed rebuttal. It addresses most of my concerns, and I appreciate the clarifications and additional analysis. I will keep my current rating.
We sincerely appreciate your time and effort in reviewing our work. If you have any further questions or suggestions, we would be glad to continue the discussion.
The paper studies activation-editing for VideoLLMs and identifies temporal variation (rather than task type) as a key driver of hallucination. Building on this, it proposes a temporal-aware activation-engineering (TA-AE) framework that (i) classifies videos into temporal-invariant vs. temporal-variant categories with a light binary classifier, and (ii) injects offsets at selected layers/heads accordingly. Experiments on VidHalluC and EventHallusion across models (VideoLLaMA2, Qwen-VL, Video-LLaVA) show consistent reductions in hallucination. During the discussion authors added results on Qwen2.5-VL-72B, comparisons to contrastive decoding methods (CGD, VTI) adapted to video, ablations on heads/weights, qualitative examples, and checks that standard capabilities (VideoMMMU) are not hurt.
Three reviewers (mR01, 4MRg, DqVL) are positive after rebuttal, citing strong empirical gains, a simple and practical framework, and new large-model and baseline comparisons added during discussion. One reviewer (Svjj) remains unconvinced on novelty relative to recent activation-editing literature, though acknowledges the technical soundness and added analyses. On balance, the paper offers (i) a useful empirical insight—temporal variation is a major contributor to VideoLLM hallucinations, and (ii) a practical, training-free method that improves over strong decoding baselines and scales to larger models without harming general capabilities. These match NeurIPS standards for an applied contribution.