PaperHub
5.5
/10
Poster4 位审稿人
最低5最高6标准差0.5
5
5
6
6
4.5
置信度
正确性3.0
贡献度2.8
表达2.3
NeurIPS 2024

Mitigating Object Hallucination via Concentric Causal Attention

OpenReviewPDF
提交: 2024-05-08更新: 2024-11-06

摘要

关键词
Object HallucinationMultimodal LearningVisual Hallucination

评审与讨论

审稿意见
5

The paper attributes hallucinations in Large Vision-Language Models (LVLMs) to Rotary Positional Encoding (RoPE). It observes that LVLMs inherit a long-term decay issue from RoPE, where the inner-product of two tokens decays relative to their distance. This results in weaker visual-text interactions when the tokens are distant, leading to more frequent hallucinations if relevant visual tokens are far from the current generating token. To address this, the paper proposes a novel method called Concentric Causal Attention (CCA) to mitigate the effects of RoPE. Experiments show that CCA effectively reduces hallucinations and enhances the perception capability of LVLMs.

优点

  1. Touching the positional encoding aspect is quite fresh and particularly advantageous as it avoids the latency and multiple inference requirements seen in contrastive decoding methods such as VCD [1] and M3ID [2], which need two output probability distributions.
  2. Experimental results are quite strong.

[1] Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding, CVPR 2024.
[2] Multi-Modal Hallucination Control by Visual Information Grounding, CVPR 2024.

缺点

  1. The method is appealing as it addresses a different aspect than existing methods, but the writing can be improved in several areas. For example, the abstract should include short background information on RoPE and an explanation of the long-term decay problem, which will guide readers more friendly reading. Additionally, the captions for Figures 1 and 2 could be more concise and communicate more effectively to enhance readability. Figures 2 and 3 are difficult to understand and need clearer presentation.
  2. Since long-term decay is an inherent problem of RoPE, the paper should compare RoPE with other standard positional encodings (e.g., absolute, relative, learnable positional encodings) as well as some advanced positional encodings. This would provide a more comprehensive analysis of the issue.
  3. It is important to note that VCD [1] and OPERA [3] are training-free methods. Therefore, comparisons should also include more recent training-based methods.

[1] Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding, CVPR 2024.
[2] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation, CVPR 2024.

问题

  1. In Figure 1, why are the attention weights higher at the beginning and the end? If this is related to the image content itself, then Figure 1 needs to show both the image and the textual query together to determine whether the attention is well distributed with respect to the query-object relationship. If the example is averaged over 3k POPE examples, this should be clearly stated.
  2. Is the method applicable to other widely used models like InstructBLIP [1] and Qwen-VL [2]? If these models do not use RoPE, the paper should mention this and discuss the implications.
  3. Does the CCA method end positional encoding at the center because objects are statistically more likely to be located at the center? Is this the reason for the higher attention at the beginning and the end in Figure 1(b)? Empirical evidence supporting this assumption should be provided.

[1] InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, NeurIPS 2023 [2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond, Arxiv 2023

局限性

If the method is limited to LVLMs that use RoPE (e.g., LLaVA, MiniGPT), it is necessary to mention this as a limitation.

作者回复

Thank you for your meticulous reading and giving credit to our novelty and analysis of Rotary Position Encoding (RoPE) and LVLM hallucination. We appreciate you pointing out some additional references, which we would include to make our research more complete. Please find our responses as follows.

W1: Clarification on RoPE.

A: Thanks for your kind advice. We will include a short background information on RoPE to indicate that, just as absolute and learnable position encoding, RoPE is also a type of position encoding, which is adopted by existing Large Language Models like LLaMA and inherited by most open-sourced LVLMs. We will also include a detailed guidance in Appendix on definition of RoPE and how it is involved in LLaMA architecture. For now we refer to [55] and lines 127-149 in our manuscript, where we present a mathematical form of RoPE and its long-term decay property for reference. For further clarification of Figure 2 in our manuscript, please find a new illustration in Figure 4 of uploaded pdf. For Figure 3, please refer to Figure 1 (right) of uploaded pdf for a new illustration, where (a) to (f) corresponds to the first eight rows in Figure 3.d of our manuscript.

W2: Alternative position encoding.

A: Thanks for pointing this out. We kindly point out that RoPE is the default position encoding in LLaMA. Simply replacing RoPE with other position encodings is technically viable but deviates from LLaMA training scheme. We train a model with learnable position encoding and a model with relative position encoding [C] on this. Due to resource limitation, we train both models on 20k instruction data (instead of 665k) while train a new CCA model that follows the same setup for fair comparison. According to our results below, LVLM (learnable) performs much worse than LVLM (cca), while training of LVLM (relative) do not converge and not viable.

POPE (instruction-tune 20k)ranpopadv
accf1accf1accf1
COCO
learnable85.9385.8383.8084.0277.7379.28
ours88.6088.3385.0085.2081.5082.35
GQA
learnable83.3383.9479.6781.0873.2376.50
ours85.6385.9779.8381.3775.1377.98
A-OKVQA
learnable83.4784.0180.5081.6772.2775.80
ours87.7787.9380.7382.2373.8777.33

W3: Comparison with training-based methods.

A: We kindly remind that we compared our method with LLaVA-RLHF in Table 1 (for POPE) and Table 2 (for CHAIR) of our manuscript, a training method that mitigates object hallucination in LVLMs. We should point out that we compare our 7B model results against those from LLaVA-RLHF-13B model and we still stands out, indicating effectiveness of proposed method. As suggested, we compare CCA with a recent training method SeVa [B], which applies DPO training on LLaVA-1.5-7B. Overall our CCA model undergoes less training compute than SeVa, as SeVa applies another DPO training stage beyond LLaVA-1.5 pre-training and fine-tuning stage, while our training strictly follows LLaVA-1.5. POPE results are listed below,

POPEranpopadv
accf1accf1accf1
MSCOCO
SeVa7B-MoCo89.4388.8887.2386.8882.4782.82
ours88.0386.6586.8785.5485.6784.42
A-OKVQA
SeVa7B-MoCo89.9690.3484.3385.7075.5779.35
ours90.2789.7188.4087.9882.3082.74
GQA
SeVa7B-MoCo89.2789.7379.6782.1775.6779.39
ours88.4087.6886.4785.9182.2082.37

Despite less overall training compute, we highlight that our LLaVA-1.5-7B-CCA still outperforms SeVa-7B on 6 POPE evaluations. For more challenging adversarial evaluations , LLaVA-1.5-7B-CCA surpasses SeVa-7B by large margins consistently. We also compare our results with SeVa on LVLM benchmarks, where results of SeVa are taken directly from their paper. We show that our model surpasses SeVa-7B in most cases.

ModelSEED-allSEED-imgSEED-vidSQAGQAVizWizMMBenchMMStarTextVQA
llava-1.5-7b58.666.137.366.862.050.064.330.058.2
vcd-llava-1.5-7b (new)58.363.737.668.561.950.5-34.654.4
seva-7b-diffu800 (new)-65.8-67.560.7-65.6--
ours61.767.141.069.863.157.665.433.257.8

Q1: Figure 1.

Thank you for your question. We would like to clarify that higher aggregation values at the beginning and end of Figure 1.b (manuscript) is not related to image content as we averaged over 3k COCO images to get visualization results (as detailed in Appendix B.1). This may be attributed to removal of RoPE, which leads to out-of-training-distribution and breaks pre-trained LLaMA position encoding.

Q2: Applicability to Q-formers.

Thanks for sharing this concern. Both InstructBLIP [14] and Qwen-VL [4] adopt RoPE in their language models. Nevertheless, our 2-D positional alignment strategy is designed for spatial-locality-preserved LVLMs [42,41,5], where the full image embedding from vision encoder is kept. Applying CCA to models like InstructBLIP and Qwen-VL is technically viable but not our design intentions. We will include this as a limitation of our method instead.

Q3: Empirical evidence on our concentric design.

Yes. The concentric design is motivated by that more objects are statistically located at the centre. Please find a statistical evidence collected from COCO and GQA in Figure 3 of uploaded pdf. Please refer to our response for Q1 where we clarify results from Figure 1.b (manuscript).

L1: RoPE.

Thanks for pointing this out. We admit that our method cannot apply to models with position encodings other than RoPE. We will include this in Limitation part. However, we think it is not a major drawback as most existing LVLMs use LLaMA as language backbone, where RoPE is applied as position encoding scheme.

评论

First of all, I appreciate for attempting to answer all of the weaknesses I have mentioned in a careful manner. Their clarifications on RoPE, including the background information and additional illustrations, have addressed my concerns. The comparison with alternative position encodings and training methods, particularly the empirical results the authors provided, strengthens the validity of the approach. I appreciate the clarification on Figure 1 and the explanation of the concentric design, both of which are now much clearer to me. Including the limitations regarding RoPE and the applicability to Q-formers demonstrates transparency and further enhances the quality of this work. Overall, the rebuttal has significantly improved my understanding, and I am raising my score accordingly.

评论

Thank you for the follow-up comment. We sincerely appreciate your valuable feedback and recognition of our rebuttal efforts. The additional analyses and experiments you suggested have helped us strengthen our paper and solidify our approach. If you have any further question or comment, we are eager to address them.

审稿意见
5

The paper shows that object hallucination in LVLMs is linked to the commonly adopted Rotary Position Encoding (RoPE) strategies. The long-term decay in RoPE causes hallucinations when important visual tokens are distant from visual instructions. To address this, the authors propose the Concentric Causal Attention strategy to reduce the distance between these tokens. However, experimental results of the proposed method is not promising on several benchmark datasets.

优点

The analysis of the long-term decay in RoPE and its impact on hallucinations in LVLMs is novel.

缺点

  • The method section is quite short, and it seems not covering a comprehensive solution to address LVLM hallucination problem. It lacks justification on the proposed scanning method. Specifically, the concentric causal masking section is not clear.

  • Performance of the proposed method is not promising on several benchmark datasets. For example, the proposed method sometimes achieves the best results only when combined with other state-of-the-art methods. It might provide more insight if there is an analysis on when and why the proposed CCA alone will work or fail (Table 2).

  • The detailed description of the method with the figure is not clear. Please see the questions below.

问题

  • Please clarify what do the orange and yellow colors represent in Figure 3(d)? The explanation of casual masking in Fig.3d is not clear: need guidance on interpreting the concentric causal masking in Figure 3.

  • What is the distribution of the aggregated correct responses in Figure 2 with the proposed concentric causal attention? Does the distribution different from the raster scan and reverse raster scan?

  • The paper compared with raster-scan as the baseline. But how do you justify that the concentric positional assignment in Figure 3 is the best solution (or is good enough)? Did you consider other scan designs, for example, the diagonal-scan, zig-zag scan design?

局限性

Yes, the author adequately addressed the limitations and potential negative societal impact of their work.

作者回复

Thank you for your valuable insights. Please see our responses to your questions below.

W1: Alternative scanning method.

A:: We justify the design of our method by providing new comparative studies for different position encoding schemes and alternative scanning methods. We first compare CCA with learnable position encoding. Due to resource limitations, we train all positional alignment approaches on 558k pre-training data and only 20k instruction data, including a new CCA model with same setup for fair comparison. We conduct evaluation on POPE and CHAIR benchmarks and the results are shown in tables below. The resulting models with learnable position encoding perform worse than our design.

POPE (instruction-tune 20k)ranpopadv
accf1accf1accf1
MSCOCO
learnable85.9385.8383.8084.0277.7379.28
ours88.6088.3385.0085.2081.5082.35
GQA
learnable83.3383.9479.6781.0873.2376.50
ours85.6385.9779.8381.3775.1377.98
A-OKVQA
learnable83.4784.0180.5081.6772.2775.80
ours87.7787.9380.7382.2373.8777.33

We also compare alternative scanning designs. We reverse the scanning order of original CCA and start from visual tokens at the center of image and end at periphery, illustrated in Figure 2 (right) of uploaded pdf. As kindly suggested in your question 3, we implemented diagonal scan and provide evaluation results for new scanning designs in a table below. Our original CCA scanning method demonstrated overall show better performance over other design choices. Based on these ablation studies, CCA is chosen as our final method for hallucination mitigation.

POPEranpopadv
accf1accf1accf1
MSCOCO
CCA-r (new)87.4385.9086.3384.8585.1783.77
diagonal-lora (new)88.1086.7287.1785.8385.7084.46
CCA88.0386.6586.8785.5485.6784.42
CCA-lora (new)88.0386.6887.1385.8285.5084.30
GQA
CCA-r (new)88.6387.9983.4383.4181.8382.09
diagonal-lora (new)89.0788.3885.8085.4182.7082.77
CCA90.2789.7188.4087.9882.3082.74
CCA-lora (new)89.3088.7085.4085.2282.4782.77
A-OKVQA
CCA-r (new)89.7089.1286.9086.5681.2081.78
diagonal-lora (new)90.0389.4887.9387.5382.1082.55
CCA88.4087.6886.4785.9182.2082.37
CCA-lora (new)90.3389.8887.8387.5982.1382.70
CHAIR51264
c_s (↓)c_i (↓)reclenc_s (↓)c_i (↓)reclen
diagonal-lora (new)49.013.679.693.115.65.264.952.6
CCA-r (new)50.018.185.796.418.05.566.154.7
CCA43.011.580.496.618.25.466.754.5
CCA-lora (new)45.012.480.893.317.25.265.452.7

For presentation of concentric causal masking, please find new illustrations in Figure 1 of uploaded pdf.

W2: Performance on CHAIR.

A: Thanks for pointing this out. We kindly remind that our trained model achieved the best results (lowest CHAIR scores) when applying CCA alone, as shown in the table below (summarized from manuscript Table 2). We point out that the LLaVA-RLHF model for benchmarking uses stronger Vicuna-13B as its language backbone and involves additional direct preference optimization stage in their training, whereas our model undergoes only two training stages (pre-training and supervised fine-tuning). Despite smaller 7B language backbone and less training compute, our model still outperforms the LLaVA-RLHF-13B model on most metrics.

CHAIR51264
c_s (↓)c_i (↓)reclenc_s (↓)c_i (↓)reclen
LLaVA-RLHF-13B-v1.5-336greedy43.610.578.0117.919.65.464.954.0
OPERAbeam46.813.479.693.217.85.964.353.0
oursgreedy43.011.580.496.618.25.466.754.5
oursbeam48.613.479.994.216.05.364.852.7

We also point out that our method benefits POPE as well (manuscript Table 1), indicating good compatibility of our method on both open-ended generation and yes-no tasks. Moreover, we highlight that our method also benefits general perception tasks, where approaches that exclusively address object hallucination cannot always bring performance gain. Please refer to Table 5 in Appendix C.2, where we show that our trained model LLaVA-1.5-7B-CCA surpass LLaVA-1.5-7B over multiple LVLM benchmarks consistently.

Q1: Clarification on concentric causal masking.

A: In Figure 3 (d), we use different colours to highlight visual tokens with different positions. For a 2-D organization of visual tokens with shape of 6x6, our CCA leads to 3 positions in visual tokens. Please refer to our new illustrations in Figure 1 of uploaded pdf, where query tokens, key tokens and masked tokens (tokens not involved in self-attention computation) are highlighted. Our CCA follows the same causal modelling rule in LLaMA Figure 1, where query tokens with larger position values attend to key tokens with smaller or equal position values. The difference is that we use a 2-D positional organization for visual tokens, which is a novel and effective attempt among existing LVLM hallucination studies.

Q2: Aggregated correct responses of concentric causal attention.

A: Thanks for mentioning this. We visualize aggregated correct responses in Figure 5 of uploaded pdf with the proposed concentric causal attention. The resulting distribution is largely different from that from raster scan in Figure 2.a and reverse raster scan in Figure 2.b of our manuscript, showing 2-D and symmetrical distribution, which aligns with our concentric causal design.

Q3: Alternative position encodings and scanning.

A: Please refer to our response in W1.

评论

I appreciate the authors' additional experimental results and clarifications. However, there appears to be inconsistent performance across datasets, such as diagonal-lora outperforming the proposed CCA/CCA-lora in MSCOCO and on the adv part of GQA. The authors have suggested that the performance drop of CCA-r supports the assumption that most image content is concentrated in central regions, aligning with their proposed design. Based on this suggestion, how can we interpret the better performance of the diagonal approach? A more in-depth discussion of these inconsistencies would provide greater insights into the effectiveness and the limitation of the proposed CCA strategy. Nonetheless, I believe the author have provided in-depth answers to many asked questions; I have raised my scores to 5.

评论

Thank you for recognising the value of our rebuttal and raising your score. We appreciate your effort in reading through our reply and pointing out your new concern. We will continue to explore alternative position encoding and scanning methods as suggested to improve our study.

审稿意见
6

This paper analyze the long-term dependency between text token and visual token in LVLMs from a novel positional encoding perspective by replace the RoPE method. The analysis shows that RoPE introduce clear long-term decay regarding the attention scores. The authors propose a novel concentric causal attention (CCA) including visual token re-organization and concentric causal masking to keep 2-D spatial locality while shortens visual-instruction distance. Experimental results show CCA can reduce hallucination on both discriminative and generative benchmarks while keep comparable performance on general benchmarks.

优点

  • The motivation of exploring source of hallucination from a long-term decay perspective is clear.
  • The paper is well organized and clear.
  • The proposed CCA method seems effective and easy-to-implement.

缺点

  • The experimental setup lacks rigor. The VCD results are sourced from the original paper, which is based on LLaVA 1.5. However, the baseline provided by the authors utilizes the pre-training scheme of LLaVA 1.0 along with the projection module and instruction fine-tuning data from LLaVA 1.5. This discrepancy raises concerns about the comparability of the results. Additionally, it is unclear why Table 3 does not include results for opera and why Table 4 omits the VCD results.

问题

  • In Figure 1 (b), would the removal of the RoPE positional encoding makes the self-attention process out of distribution? Since the model is pre-trained with RoPE applied. How would this affect the quantitative results?
  • The explanation of concentric causal masking in Figure 3 and the corresponding paragraph is somewhat unclear. Could you clarify why the attention masks for the first seven rows are identical? A more comprehensive and detail introduction helps a lot.

局限性

authors adequately addressed the limitations

作者回复

Thanks for your detailed and insightful suggestions. Please find our responses as follows.

W1-a: Pretraining setup.

A: Thanks for mentioning this concern. We would share that it is a typo in line 227 where we claim we use CC-595K dataset [42] for pre-training stage. In fact, our pre-training experiments follow LLaVA 1.5 [41] and use a 558K dataset for pre-training. The provided baseline results in Table 1 of our manuscript is for LLaVA 1.5, which are sourced from VCD paper [30]. We will release our model and source code for the community to reproduce our results.

W1-b: Manuscript Table 3 and Table 4.

A: Thanks for pointing this out. Please find new quantitative results of Table 3 and 4 below. We use their official codes and get these results.

MMEExistenceCountPositionColorTotal
baseline175.67124.67114.00151.00565.33
OPERA (new)180.67133.33123.33155.00592.33
VCD184.66138.33128.67153.00604.66
ours190.00148.33128.33175.00641.66
LLaVA-BenchComplexDetailConv
baseline65.851.254.6
OPERA66.456.944.0
VCD (new result)69.651.857.3
ours66.153.969.4

Q1: Removal of RoPE from LLaVA.

A: Yes, removing RoPE largely diverges from LLaMA pre-training and leads to nonsense outputs. Our earlier studies showed that LVLMs without RoPE will no longer follow instructions. Take POPE [37] questions as an example. The LLaVA-v1.5-7B pretrained without RoPE fails to answer yes or no as illustrated below.

USER: Is there a scissors in the image?

LLaVA-1.5-7B: No.

LLaVA-1.5-7B w/o RoPE: the bear.\n\n\n\n\n\n\n\n

Since the LLaVA-1.5-7B w/o RoPE will output neither yes nor no, the quantitative accuracy on POPE should be 0.00. Though the model w/o RoPE in Figure 1.b of our manuscript generates non-sense text outputs, the information flows from visual to text tokens are more evenly distributed. This highlights the long-term decay in RoPE in Figure 1.c, the root cause of information aggregating at image tokens that are closer to text tokens.

Q2: Clarifications on concentric causal masking.

Thanks for pointing this out. Please find a new illustration in Figure 1 of uploaded pdf that clarifies the proposed concentric causal masking, where query tokens, key tokens attended by query tokens and key tokens not attended by query tokens (tokens not involved in self-attention computation) are colored. Consistent with our manuscript Figure 3, we take 6x6 visual token organization as an example. The design follows the same causal modeling rule of LLaVA (presented in Figure 1 (left) of uploaded pdf), where query tokens with larger position values attend to key tokens with smaller or equal position values. The first seven rows of manuscript Figure 3 (d) share the same attention masks, corresponding to Figure 1 (right) (a) to (g) of uploaded pdf, where the position indices of key tokens attended by query tokens are exactly the same.

评论

The responses address my concerns and the provided illustration figures are great. I am curious about the attention distribution of model trained with CCA and considering raise my score.

评论

Thank you for going through our rebuttal. We are glad that it addressed your concerns. For your follow-up question, we have prepared the new attention distribution visualisation. Unfortunately we are not allowed to provide the image through any link during author-reviewer discussion phase, according to author guideline. Instead, we present it here in tabular format for your reference. Alternating rings are bolded for better presentation.

The table below is in 24x24, which is the suggested distribution from our proposed LLaVA-1.5-7B-CCA. As presented in this table, attention values gradually increases when positions move from periphery to center. The highest values can be found in the central position. Meanwhile, they also show a 2-D concentric distribution, with each ring having similar values. This clearly aligns with our CCA design. We will include visualisation in revision of our manuscript to better support our approach.

0.020.070.290.130.200.500.570.560.130.180.340.330.340.410.240.140.440.160.050.140.130.160.120.07
0.070.070.180.180.200.430.370.570.260.240.410.210.200.210.300.190.180.140.210.160.160.180.230.00
0.060.160.230.210.290.250.340.250.340.250.250.240.240.270.310.240.210.180.190.180.180.210.180.13
0.150.150.190.250.260.280.350.270.290.280.300.270.300.310.310.270.260.280.250.230.220.220.180.19
0.110.140.180.220.280.280.310.340.310.310.310.330.320.340.310.310.310.300.280.270.220.200.160.11
0.050.110.160.210.270.340.350.350.380.360.380.370.380.400.390.380.370.360.320.270.220.190.160.14
0.090.110.220.240.270.320.400.420.430.450.440.460.460.480.450.450.470.430.370.300.260.210.210.15
0.060.090.210.270.290.340.410.490.490.480.510.510.520.510.530.510.510.430.390.320.290.240.190.15
0.150.110.220.260.310.380.440.510.570.590.600.580.620.620.620.630.550.490.420.330.290.240.200.15
0.070.220.230.280.330.400.450.500.610.670.720.700.680.720.740.630.560.480.410.360.290.250.200.20
0.090.180.270.310.360.390.450.540.610.720.830.850.820.840.730.660.570.490.430.380.320.280.230.19
0.110.190.230.310.370.420.470.560.640.730.840.981.000.850.760.670.590.540.440.380.350.300.230.23
0.180.210.290.320.370.420.500.560.640.750.881.001.000.860.770.670.610.550.460.410.380.320.240.24
0.180.220.260.310.370.460.520.570.660.760.880.880.900.890.780.690.610.550.480.420.360.320.270.20
0.160.200.240.300.360.440.510.570.660.760.760.780.770.790.770.700.600.530.470.410.350.300.240.26
0.150.200.280.310.370.410.520.580.670.680.650.680.680.710.680.710.610.520.500.390.350.290.270.18
0.250.230.280.320.370.420.530.600.620.610.610.610.630.630.620.620.630.540.460.380.340.270.260.22
0.110.250.260.350.390.450.510.540.540.560.570.570.560.570.570.570.550.520.430.390.300.250.230.19
0.130.180.250.310.380.440.470.500.480.490.490.480.490.500.490.470.460.430.420.350.340.240.240.19
0.130.180.240.260.380.410.430.440.400.400.410.410.390.400.410.390.390.360.340.330.270.230.190.25
0.070.220.210.260.340.410.470.460.450.430.460.330.320.330.370.310.310.310.320.250.240.230.280.17
0.070.300.260.270.340.410.300.460.270.470.570.520.270.300.540.270.290.410.220.240.230.260.300.16
0.050.120.200.190.230.380.280.460.550.350.370.260.230.320.280.200.200.180.140.180.170.180.110.04
0.020.110.100.070.130.130.200.590.160.190.170.170.150.180.230.160.130.150.120.130.060.120.160.11
评论

What is the query for this attention map result? If I understood it correctly, results of text tokens and image tokens are supposed to be different systematically.

评论

Thank you for your prompt reply. Similar to Figure 1 (b) and (c) in our manuscript, we apply the same experiment setting to obtain these values, except that we use our trained LLaVA-1.5-7b-CCA model. The text-to-vision information flows are only for image tokens (24x24, 576 in total), so the queries and keys for getting this result are text and image tokens, respectively. The given result is structured in 2-D format for demonstration purpose and does not include self-attentions among text tokens or self-attentions among image tokens. Please refer to Appendix B.1 where we elaborated how we got these results. We are happy to discuss further if more clarifications are needed.

评论

I raise my score to 6.

审稿意见
6

This paper explores how current LVLM's hallucination appears through analyzing the impact of RoPE long-term decay on vision information attenuation during flow. It gives clear visualization results and theoretical evidence to prove the basic point, revealing that the causal attention mask and RoPE embbeding is not appropriate for non-text-modal input (e.g., vision tokens). Based on these, the authors argue that LVLM's hallucinations are mainly attributed by the RoPE long-term decay and the mismatch between vision tokens and causal attentions. To this end, the authors propose vision token re-organization and concentric causal attention masking to alleviate the hallucination. Experiments demonstrate the promising performance of the proposed methods.

优点

  1. This paper gives a good explanation for the relationship between hallucination and long-term decay in LVLMs.
  2. The proposed method is well motivated and technically sound.

缺点

  1. When introducing the long-term decay and the attenuation of information flow, it would be better to cite some references like OPERA (some of the early works that claim the relationship between hallucination and long-term decay).
  2. One suggestion for the experiment on Figure 2: Although the authors calculate and visualize the results on thousands of samples, it would be better to reverse the order of vision patches (before CLIP-ViT) and conduct the same experiment again. We should rule out the possibility that most of correct answers are naturally located at the lower region of images.
  3. Both the proposed method relies on the hypothesis that the most of main contents of images will be located at the centric region. It generally make sense, but not enough to be an accurate solution.
  4. The experiments are somehow insufficient. It would be better to add some ablation studies and evaluation on LVLM benchmarks. MME hallucination split and LLaVA Bench are not enough. It would be better to add results on MMBench, Seed Bench, TextVQA, etc.

问题

See the Weaknesses.

局限性

N/A

作者回复

Thanks for your detailed and thorough suggestions. Please find our replies as below.

W1: CCA and OPERA [23].

A: We ground our design on analysis of information flow in LLaVA model. This shares commonalities with OPERA which analyzes information flow in LVLM autoregressive decoding. Thank you for pointing out and we would highlight this in introduction of our manuscript as suggested. Different from OPERA that discovers co-occurrences of aggregation pattern and object hallucination, our CCA explores relations between Rotary Position Encoding and object hallucination. We further refer to quantitative results in Table 2 and 4 of our manuscript, where proposed CCA surpasses OPERA in CHAIR and LLaVA-Bench evaluations.

W2: Figure 2 experiments.

A: It is a valid concern to negate the impact of imbalanced object distribution in different image regions. We kindly let you know that we already addressed this in the paper (lines 167-169) by cropping objects and pasting them on blank images (initialized with ImageNet mean pixel values) at different spatial positions to create synthesized images. Please find a new illustration for this in Figure 4 of uploaded pdf, where 16 (4 by 4) pasting options are given. In our experiments we use 144 (12 by 12) pasting options. Figure 2 of our manuscript is obtained by testing models on these synthesized images which have even distribution of objects at different regions.

W3: Data statistics and our concentric design.

A: Yes, our concentric design assumes that most image contents are located around the centric image region. We validate this assumption from two perspectives.

Firstly, we perform statistical analysis on large amount of natural images (82,081 images and 604,907 annotations from COCO train 2014, and 10,696 images and 174,304 annotations from GQA). Specifically, we count total number of objects in 9 spatial locations (top_left, top_mid, top_right, mid_left, center, mid_right, bottom_left, bottom_mid, bottom_right, respectively). The statistical results in Figure 3 of uploaded pdf show that more objects are located in centre, which aligns with our design.

Secondly, we point out that for our model, positions start from periphery of 2-D visual tokens and ends in centre. For comparison, we implement another LVLM (which we name as CCA-r), where positions start from centre of 2-D visual tokens and ends in periphery. Please find an illustration of CCA-r in Figure 2 of uploaded pdf, where left refers to our CCA method and right refers to CCA-r. Query tokens, key tokens for self-attention calculations are highlighted with colors.

The tables below provides quantitative experiments on POPE and CHAIR, showing that changing from CCA to CCA-r positional alignment causes performance drop in most evaluations. For GQA popular evaluation, accuracy drops from 88.40 to 83.43. For CHAIR evaluation, c_s drops from 43.0 to 50.0. These validate the assumption that most image contents are located around centric image regions, which aligns with our design.

POPEranpopadv
accf1accf1accf1
MSCOCO
CCA-r (new)87.4385.9086.3384.8585.1783.77
ours88.0386.6586.8785.5485.6784.42
GQA
CCA-r (new)88.6387.9983.4383.4181.8382.09
ours90.2789.7188.4087.9882.3082.74
A-OKVQA
CCA-r (new)89.7089.1286.9086.5681.2081.78
ours88.4087.6886.4785.9182.2082.37
CHAIR51264
c_s (↓)c_i (↓)reclenc_s (↓)c_i (↓)reclen
CCA-r (new)50.018.185.796.418.05.566.154.7
CCA43.011.580.496.618.25.466.754.5

W4: More evaluations on LVLM benchmarks.

A: Please refer to Table 5 in Appendix C.2 where we included more evaluations on LVLM benchmarks. As suggested, we also add TextVQA [A] for your reference. We also compare general perception capabilities on these benchmarks against two hallucination-mitigating methods [30, B], where SeVa [B] is a recent method that explores unsupervised preference alignment in LVLMs. We point out that SeVa trains their models with another Direct-Preference-Optimization stage on top of LLaVA-1.5 [41] whereas our CCA does not involve any new training stages and strictly follows LLaVA-1.5 training scheme. Our model LLaVA-1.5-7B-CCA outperforms SeVa-7B and VCD on most LVLM benchmarks.

ModelSEED-allSEED-imgSEED-vidSQAGQAVizWizMMBenchMMStarTextVQA
llava-1.5-7b58.666.137.366.862.050.064.330.058.2
vcd-llava-1.5-7b (new)58.363.737.668.561.950.5-34.654.4
seva-7b-diffu500 (new)-65.8-67.461.1-64.7--
seva-7b-diffu800 (new)-65.8-67.560.7-65.6--
seva-7b-moco (new)-65.5-67.160.9-65.2--
llava-1.5-7b-cca (ours)61.767.141.069.863.157.665.433.257.8
评论

Thanks for the reply. It addresses the most of my concerns. I hope the authors can add these results in the revision.

评论

Thank you for your thorough and insightful review of our paper. We are happy that our rebuttal has addressed most of your concerns. We will include rebuttal experiments in our revision as suggested. If any residual concerns remain, we would be glad to discuss further. If no concerns remain, we would appreciate it if you could raise your score.

作者回复

We sincerely appreciate reviewers 9xbc and 1NX2 for acknowledging clear motivation behind our work, and reviewers 5Msc and 5R3r for recognizing novelty of our study, along with thoughtful and kind suggestions for improving our paper. Please find new figures in attached pdf. New figures in our rebuttal texts are highlighted with Figure x. We also include new references here for all reviewers.

[A] Towards VQA Models That Can Read.

[B] Self-Supervised Visual Preference Alignment.

[C] Self-Attention with Relative Position Representations.

最终决定

Summary: This paper investigates how hallucinations in large vision-language models (LVLMs) are influenced by the RoPE long-term decay and its effect on vision information attenuation. The authors provide both visualizations and theoretical evidence showing that causal attention masks and RoPE embeddings are not well-suited for non-textual inputs like vision tokens. To address these issues, the paper proposes vision token re-organization and concentric causal attention masking, demonstrating improved performance in experiments.

Strengths: The paper offers a strong explanation of the relationship between hallucinations and long-term decay in LVLMs, supported by both theoretical and empirical evidence.

All reviewers agree to accept the submission.