PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
4
3
3
ICML 2025

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We find that the attention pattern in VLMs can be adjusted adaptively to modify the focus area without requiring additional training.

摘要

关键词
InterpretabilityAttention mechanismVLMSpatial Reasoning

评审与讨论

审稿意见
3

The authors explore how vision-language models’ struggle with spatial reasoning, focussed on how misdirected attention (i.e. to irrelevant parts of image) within transformer blocks contributes to such behavior. They analyze attention patterns and report how attention prioritizes text tokens over image tokens. They also note that attention to wrong parts of an image is an issue and how model logit probability can be a proxy for model confidence. Using these ideas, they propose an approach to adjust attention based on confidence levels: sharpening or smoothing attention weights of image tokens based on model confidence per sample. They evaluate their model on two datasets, WhatsUp and VSR. These datasets contain mixtures of natural and synthetic images, with mostly synthetic captions. The results show promise of their proposed method.

给作者的问题

N/A

论据与证据

No.

  1. They evaluate on two small datasets (that contain sythetic data). Results on these datasets are insufficient to back their claims.
  2. They explore two outdated VLMs to highlight spatial reasoning drawbacks. The analysis on more modern VLMs would make the claims stronger. Most newer VLMs show stronger spatial reasoning abilities simply by training on more data and augmenting their training data / tasks.

方法与评估标准

No, see above.

理论论述

N/A

实验设计与分析

Yes, the experiments on 2 mostly synthetic datasets are insufficient. The focus on only older VLMs question the applicability of method to newer VLMs.

补充材料

Yes, all of it.

与现有文献的关系

Overlap with several existing prior works.

遗漏的重要参考文献

Consider comparing with the baselines from these prior works. And discussing these methods more explicitly in related work to highlight how authors method differs.

[1] What’s “up” with vision-language models? Investigating their struggle with spatial reasoning, EMNLP 2023

[2] Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs, CVPR 2024

[3] Liu, Fangyu et al. “Visual Spatial Reasoning.” Transactions of the Association for Computational Linguistics 11 (2022): 635-651.

[4] Hsu, Joy et al. “What's Left? Concept Grounding with Logic-Enhanced Foundation Models.” NeurIPS 2023.

其他优缺点

Strengths

  1. The authors' innovative approach of adjusting attention at test-time for individual samples, using the model's output probability as a confidence proxy, is both clever and compelling.​
  2. Despite the datasets being small and primarily synthetic, the observed performance enhancements are promising.​
  3. The concept that generation confidence serves as a reliable indicator of attention accuracy is intriguing and opens new avenues for exploration in the vision-language domain.

Weaknesses

  1. Insufficient experimentation:

    • Results only on two small, primarily synthetic dataset
    • Results only over two older variants of the same VLM This raises concerns on applicability to real-world datasets and newer, stronger VLMs.
  2. Results difficult to compare:

    • The authors do not compare to any prior work in their paper (i.e. no results from existing works are used as baselines)
    • Even for baselines they provide, the authors use different dataset splits making it difficult to directly compare with number on the original papers.
  3. Does not use established benchmarks

    • See Table 5 in [1], where results on standard VQA datasets focussed on spatial reasoning are provided (including numbers directly from prior works, making the results easy to compare and validate). Please include at least one such results table.
    • There is one Table 6 (buried in appendix) that reports on some such datasets. There are no details on the baseline reported on it, and no discussion regarding the results in that table.
  4. Unclear visualizations in Figure 9 & 10

    • The authors show pairs of images, but there is not apparent improvement visible in them.
    • Consider explicitly highlighting (on the image) what changes the authors are talking about.
    • Consider making the visualizations overlays (the original images seem to be shaded, making it difficult to see color details in them)
  5. Consider backing "attention modification" claims with some numbers

    • In introduction, authors claim that "by dynamically adjusting attention patterns, we effectively guide the model’s focus to better align with actual object locations and their spatial relationships." It is simple to test this on some object detection dataset - calculate attention mIoU with objects against a baseline. This will reveal if authors' method focusses attention more on object regions. This is a common form of evaluation in prior works.

[1] Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs, CVPR 2024

其他意见或建议

N/A

作者回复

We thank Reviewer 3qjz for the comments. However, we are concerned that some comments suggest a disconnect or oversight of our paper. We respond to each point below, and respectfully encourage the reviewer to revisit our work.

Essential References Not Discussed

Among the four references mentioned, [1] and [3] are precisely the benchmarks we work on, and our experiments are exactly built on these two papers with the same dataset names WhatsUp and VSR cited and mentioned throughout the paper. Both are benchmark papers that present datasets and baselines like prompting, without proposing novel methods or addressing mechanistic interpretability (MI) .

[2] and [4] are: (1) about LLMs rather than VLMs, while we mentioned in the title that we work on VLMs; (2) neither about MI, while we highlighted in abstract (Lines 18-20), introduction (Lines 43-45), conclusion (Lines 422-423) that our paper is to open up VLMs and observe their internal behavior to understand their failures in spatial reasoning.

We believe the reviewer has a misunderstanding of the paper that we are NOT proposing a new model to better leverage object grounding to beat SOTA models, but rather conducting MI for VLMs, ​​by analyzing attention patterns and applying targeted interventions to understand their behavior. So the suggestions to compare with other object-grounding methods reflect the misunderstanding, detailed in W2.

(MI is a well-established area to reverse-engineer complex systems, offering a more granular and causal understanding of how they process information. [5,6,7])

[5] A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models
[6] Mechanistic Interpretability for AI Safety: A Review
[7] awesome-mechanistic-interpretability-lm-papers in Github

W1:

datasets being small and primarily synthetic

  • Benchmarks are not primarily synthetic, with 85% natural images and all human annotated captions (reformatted to generative QA settings better suited for SOTA decoding-only VLM architectures, Lines 143-149). | |Cont_A|Cont_B|Coco_one|Coco_two|VG_one|VG_two| |-|:-:|:-:|:-:|:-:|:-:|:-:| |Synthetic image|yes|yes|no|no|no|no| |Synthetic caption|no|no|no|no|no|no| |samples|406|408|1160|291|2240|440|
  • MI research requires clean datasets to observe model mechanisms. Existing MI studies use smaller, clean datasets rather than larger, noisy ones [5, 6, 7]. We have made every effort to include all available datasets with clean, high-quality spatial relationship annotations.

W2:

do not compare to any prior work in their paper.

  • We respectfully disagree. With a focus on MI of VLMs, our experiments on attention intervention aims to give conclusions about how VLMs work. Thus, experiments are expected to be between w/ and w/o intervention, rather than comparing with other object-grounding methods.
  • We did compare our methods with two prior decoding methods: Dola [8] and VCD [9] (Line 327). Like DoLa, we leverage internal signals but tailor them to VLMs using visual features. While sharing VCD’s goal of improving visual grounding through decoding, we go further by introducing fine-grained attention interventions for more precise calibration.
  • Again, we believe it reflects a misunderstanding of the paper, especially given the context that other three reviewers do not have such misunderstandings.

[8] DoLa: Decoding by Contrasting Layers Improves Factuality in LLMs.
[9] VCD: Mitigating Object Hallucinations in Large VLMs through Visual Contrastive Decoding.

W3:

Does not use established benchmarks.

  • We use established spatial-reasoning benchmarks WhatsUp [1] and VSR [3], and as detailed in Lines 143-149, we have to reformat the dataset from classification settings to QA settings to observe the attention behavior and information flow across layers of Transformers when VLMs generate correct or wrong answers, which is also a standard setting of MI studies [5,6,7] for observing the attention behavior.
  • We also include general VQA benchmarks in Table 6 at Appendix.

W4:

Unclear visualizations in Figure 9 & 10

Thanks for the helpful suggestions! We will add more annotations to these figures.

W5:

backing "attention modification" claims with some numbers

Thanks for the helpful feedback. We extend our experiments to test YOLO overlap w/ and w/o intervention in middle layers, where our observations (line 196) identify as key for processing spatial information. We evaluate two representative subsets: Controlled_A (synthetic images) and VG_two_obj (real images). After intervention, overlap rates improve on both. Attention–YOLO similarity increases by 2.9% on Controlled_A and 8.2% on VG_two_obj at layer 16, with similar trends across other middle layers.

While it can support our claims, it is not a golden metric since attention map is influenced by the prompt, and YOLO may fail to detect objects in complex images (Figure 24). We will include additional discussions in our paper.

审稿人评论

The authors seem to have misunderstood the review and are misrepresenting several concerns brought out in the review.

1. Essential References Not Discussed:

  To quote the original concern on [1-4] in review: "Consider comparing with the baselines from these prior works. And discussing these methods more explicitly in related work to highlight how authors method differs."

  • Authors mention for [1] and [3] "both are benchmark papers that present datasets and baselines". However, they do not report any of these baselines reported in [1, 3] in their paper. None of the reported evaluation results in the paper can be compared with prior work (since they use different baselines or dataset splits). This is not the case in prior MI works (e.g. DoLa has baseline numbers on common datasets that are the same as those in prior works ITI and CD). This lack of verifiable baseline numbers making results comparison difficult (Weakness 2 in review) was a major concern that remains unaddressed.
  • The authors incorrectly claim that [2] is about LLMs: it uses the same VLM used in author's method and reports results on common datasets.
  • The authors motivation is to "understand VLM failures in spatial reasoning.": Prior work [2,4] both highlight how VLMs or LLMs can actually solve spatial reasoning tasks better with simply more data (hence raising concerns on whether identified issues of spatial reasoning are limited to earlier VLMs only).

2. W1 Datasets being small and primarily synthetic

  • "human annotated captions reformatted to generative QA...": all evaluations are conducted with GPT generated captions as the prompting questions / targets? The captions / targets in the evaluation datasets are GPT generated (i.e. synthetic). This was the concern.
  • "MI research requires clean datasets...": On evaluation datasets being small, if I understand correctly, any object detection / segmentation dataset can be converted to an evaluation dataset using the GT human annotations for object locations. This appears to be what is done with COCO in [1] as well (which is a current evaluation dataset in authors' work). However, this concern is somewhat resolved with the new results with Qwen on several established benchmarks such as GQA, TextVQA, and POPE.

3. W2 Does not use established benchmarks.

  • To repeat, none of the reported evaluation results in the paper can be directly compared with prior work (since they use different baselines or dataset splits). This is not the case in prior MI works that authors cite in paper (e.g. DoLa has baseline numbers on common datasets that are the same as those in prior works ITI and CD).

4. W4 - Resolved, thanks!

5. W5 - Partially resolved

  • Why use YOLO for GT locations? Why not use the COCO based datasets that have human annotated GT object bbox / seg annotations?

6. Question on New Qwen Results

  • This is promising and gives a better point of comparison. What is the model used (i.e. 2B / 7B / 72B ...)?
  • Can you refer to prior work that have these same numbers? The reported numbers for Qwen2-VL on some benchmarks is weaker than their official numbers on hugging face page. E.g. TextVQA: 79.18 (Qwen2-VL, from rebuttal), 79.26 (Qwen2-VL+authors', from rebuttal) < 79.7 (Qwen2-VL 2B, smallest model, HF number), 84.3 (Qwen2-VL 7B, next smallest model, HF number).
  • Above Qwen2-VL HF numbers from official pages on: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct, https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
  • image receives much less attention than the text tokens - is average attention logits for single text/image tokens fair? Often in VQA settings with models like Qwen2, there are significantly more image tokens than text tokens -> this could make image average attention values lower.

   

It is clear that the aim of the paper is to open up VLMs and observe their internal behavior to understand their failures in spatial reasoning. However, the doubts are regarding evaluation setup used to validate findings. Are the measurements of spatial reasoning failure valid? Have the baseline VLMs been properly setup (i.e. used optimal prompts, evaluated under standard settings)? Are these failures visible in newer / larger VLMs? Providing zero points of reference from prior work for reported results heavily weakens this paper raising multiple doubts on the validity of their experimental setup.

While I am familiar with literature on spatial reasoning with LLMs, VLMs, and their evaluation, I do note that my familiarity with mechanistic interpretability (MI) literature is limited. The authors do highlight difficulties in using existing benchmarks directly for MI research, and MI for VLMs appears relatively under explored (limiting available prior work for comparison). I also appreciate the authors efforts to provide a thorough rebuttal.

In light of these and the concerns that were resolved, I raise my rating to WR.

作者评论

Thanks for the thoughtful response, which will further enhance our work.

R1.1: do not report any baselines in [1,3]

First, we hope to clarify a misunderstanding: we do not introduce any new splits, but only change classification to Gen(Generative)QA format (Line 106-164).

Why do we change to GenQA?

  • Ours: <Obj1, Obj2>+Img→VLMs→Left/Right/.. when opening up model, the observed attention is between labels Left (or Right) with image/text tokens. This enables comparing attention behavior between correct/wrong generation.
    • Classical way of opening up the model. DoLa: <Washington, capital>→LLMs→Seattle/Olympia.. to compare internal states when answering different labels.
    • To study spatial reasoning in VLMs, we must generate spatial labels for meaningfully detect inner states.

See https://postimg.cc/3kb2t3HJ why we cannot directly compare with [1,2,3,4]. All [1,2,3,4] are not spatial labels, but A/B/C/D or True/False classification labels or bbox for a single object. We are the first work to open up VLMs on Spatial Reasoning, making us valuable to create its first GenQA datasets by reformatting [1,3]. It should not be counted as our fault there is no previous work on GenQA for VLM spatial reasoning to compare, and we construct the first datasets WhatsUp/VSR-QA to support future research. We believe None of the reported results can be compared with prior work should be None of the reported results can be used in MI.

We agree that adding more baseline results is valuable and put them here: https://postimg.cc/TK0qRdB8. Numbers for reference only-direct comparison between classification and GenQA is not fair.

Also, for additional generic comparison, we note that POPE is a fair comparison as it is used by VCD, added here https://postimg.cc/n9jMQzGK.

R1.2/W2: DoLa baselines

  • DoLa work on LLMs. For LLMs, common datasets are often in GenQA - suitable for MI.
    • DoLa/ITI/CD: Q→LLMs→A
    • Dataset TruthfulQA/StrQA: Q→LLMs→A
  • While this is not the case for VLMs. In spatial reasoning in VLMs, there are no existing datasets in GenQA. It is why our reformatted datasets are valuable.

W1: Datasets small and synthetic.

  • Our dataset contains 6165 images with QA pairs. Using a small dataset is common practice in MI since data cleanness is important [7,8,9].
  • As clarified in R1.1, we use 85% natural images (COCO/VG) and 15% real images with two objects and 100% human-annotations. To meaningfully compute attention, spatial labels are required to be the output. Thus, we reformat to GenQA by GPT, only for format conversion not content generation. Unlike prior work on simulated data, our setup offers the most doable setting for VLMs’ spatial reasoning MI.

W5: Why YOLO?

We manually fixed YOLO boxes for some subsets to ensure correct GT locations and will include this in paper.

Q1:Model used

We use https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct.

Q2/Q3: Number weaker than official.

  • Qwen2-VL does not open-source its evaluation code, while we need to use evaluation code to compare w/ and w/o intervention.
  • Therefore, we use VLMEvalKit, official evaluation toolkit for the OpenVLM Leaderboard and become a widely adopted standard for VLMs.
  • See https://github.com/QwenLM/Qwen2.5-VL/issues/27. Our number is same with others here, and they also noted the number is lower than official. However, authors have not responded.
  • We perform all evaluations fairly with the same toolkit.

Q4: Is average attention logits fair?

In original rebuttal, we used average logits. We compute sum of attention logits for text image tokens: https://postimg.cc/Vr21gCS2. Qwen2-VL uses more image tokens (width//28)×(height//28) than LLaVA’s 576 (a typical 812×1288 COCO image yields 1288 tokens), resulting in a slightly higher image-to-text score ratio. Conclusion holds: total attention to image tokens remains lower than text.

R1.3:[2,4] highlight how VLMs solve spatial reasoning...limited to earlier VLMs.

  • We respectfully disagree 69.5/76.5 in [2] means solving spatial reasoning with complex whistles added to VLMs. [4] uses LLMs only. Our LLaVA-1.6 is the latest model, so identified issues are on latest rather than earlier VLMs.
  • It is widely recognized[5,6] Spatial Reasoning is a key bottleneck for latest VLMs. That's why we're the first to use MI to study this emerging and urgent problem.

R1.4: Claim that [2] is about LLMs.

Thanks—it was a typo. We meant [4]. As clarified in R1.3, both papers are out of scope for our work on VLMs' MI.

[5] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
[6] Cambrian-1: Fully Open, Vision-Centric Exploration of Multimodal LLMs
[7] Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
[8] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
[9] How Do LLMs Perform Two-Hop Reasoning in Context?

审稿意见
4

This paper investigates the Visual Attention Distribution in VLM and finds that it affects vision-centric understanding. Based on such observation, paired with the confidence score of the model when generating tokens, the authors propose ADAPTVIS, a temperature scaling mechanism for the attention scores that effectively helps with better understanding spatial relations.

update after rebuttal

Thank the author for the point-by-point response to all of my raised issue. I appreciate the attention scores analysis as well as the additional experiments. While it is true that the proposed method is less effective on the new model, it is still acknowledged that the overall structure of the paper is well-established and can potentially inspire future works. Therefore I'll maintain my positive score 4 (Accept).

给作者的问题

As mentioned in the supplementary material, the proposed method could also work on general VQA benchmark. What is the major difference between a spatial-relation understanding task and a standard QA task that makes the proposed method less effective on standard QA task?

论据与证据

The claims in the submission is mainly on investigation of the attention mechanism and its relation with the model's output. I'm wondering have the author tried other VLMs to see if such an observation is universal.

方法与评估标准

The proposed methods make sense based on the thorough study of the attention mechanism. The evaluation criteria is standard.

理论论述

N/A

实验设计与分析

The method and experimental design are straightforward and easy to understand. One concern about the design is that the proposed method is only applied to LLaVA. I'm wondering how well the proposed method works for other VLMs.

补充材料

I admire the further analysis part in the supplementary material. The investigation in How the confidence changes by using different α\alphas addresses my concern whether such a method would help with solving the problem claimed intuitively.

与现有文献的关系

The key contribution of the paper including the investigation of attention mechanism for spatial understanding would help with ideas in the VLM hallucination domain. Such a design would help with addressing the inefficiency of contrastive decoding method, which usually requires more than once inference of the model.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  1. The idea is straightforward, intuitive, easy-to-understand and easy-to-implement. The simple-yet-effective approach make it possible for this work to have great impact in the VLM-related area.
  2. The writing is easy to follow, clearly stating the observations / the relations between each part.

Weaknesses:

  1. As mentioned above, it'll be appreciated if the authors can provide more results (analytical and experimental) using other VLMs.

其他意见或建议

  1. Typo: Ln. 375, right col, "we emply ..." should be "we employ ..."
作者回复

We thank Reviewer p97N for the encouraging comments and thoughtful feedback. Below, we address the concerns raised in detail.

I'm wondering how well the proposed method works for other VLMs.

  • Experiments on Qwen2-VL, a SOTA VLM with different architecture. We intervene in the image attention distribution using our temperature-scaling method, showing consistent improvements below, particularly in challenging cases. For example, on VG_two_object, where the baseline performance is the lowest among all benchmarks, our method yields a significant improvement of 10+ absolute points. The gains observed on Qwen2-VL further demonstrate the generalizability of our approach.
BenchmarkQwen2-VLQwen2-VL + Attention Intervention
VSR78.9681.60 (↑ 2.64)
Coco_one_obj76.6478.03 (↑ 1.39)
Coco_two_obj75.2876.52 (↑ 1.24)
VG_one_obj74.8975.11 (↑ 0.22)
VG_two_obj56.2266.95 (↑ 10.73)
Controlled_A98.1898.18 (↑ 0.00)
Controlled_B91.7392.97 (↑ 1.24)
  • Experiments on more benchmarks, including POPE, GQA, and TextVQA, which proves that attention intervention can maintain the performance on more general tasks without hurting the performance. Compared with spatial reasoning tasks, general QA tasks achieve relatively smaller improvement. A possible reason is that such tasks are less sensitive to the geometric structured distribution of image attention. For example, given a question like “Is there a dog in this picture?”, the model only needs to detect the presence of an object, and is therefore less likely to suffer from misallocated attention across spatial regions.
BenchmarkQwen2-VLQwen2-VL + Attention Intervention
POPE-Overall86.3287.09 (↑ 0.77)
POPE-P86.4787.29 (↑ 0.82)
POPE-A85.0785.80 (↑ 0.73)
POPE-R87.4688.22 (↑ 0.76)
GQA62.0962.17 (↑ 0.08)
TextVQA79.1879.26 (↑ 0.08)

provide more results (analytical and experimental) using other VLMs

We present the additional experimental results above. For the analytical results, we also extend our analysis to the Qwen2-VL model and find that our insights generally hold across VLMs.

  • The first claim image receives much less attention than the text tokens is generally valid for VLMs: we calculate the Qwen2-VL’s attention scores below (average attention logits for single text/image tokens respectively), which matches our previous claims. | Benchmark | Text attention scores | Image attention scores | |:-----:|:----:|:----:| | Controlled_A | 1.57e-02 | 7.59e-05 | | Controlled_B | 1.58e-02 | 7.69e-05 | | Coco_one_obj | 1.77e-02 | 4.48e-04 | | Coco_two_obj | 1.65e-02 | 3.50e-04 | | VG_one_obj | 1.54e-02| 4.75e-04 | | VG_two_obj | 1.42e-02| 4.19e-04|
  • The second important claim is still valid about confidence could serve as a proxy for the model performance as below:
    • Spatial relationships with lower confidence, such as "Left" and "Under" tend to exhibit lower accuracy when compared to those with higher confidence, like "On" and "Right".
    • After the intervention with AdaptVis, certain spatial relationships like “Left: show improved performance, accompanied by an increase in confidence. This demonstrates a pattern consistent with our observations for LLaVA, as depicted in Figure 25 in the Appendix. | Benchmark | Qwen2-VL Uncertainty | Qwen2-VL+ScalingVis Uncertainty |Qwen2-VL Acc | Qwen2-VL+ScalingVis Acc | |:---:|:----:|:----:|:----:|:-----:| | Left | 0.4408 | 0.4474 |70.11 | 74.71 | | On | 0.5758 | 0.5668 |100 | 100| | Right | 0.5982 | 0.5911 | 100 | 100 | | Under | 0.5554 | 0.5418 |98.82 | 98.82 |

Q1:

What is the major difference between a spatial-relation understanding task and a standard QA task that makes the proposed method less effective on standard QA task?

  • That’s a great question. In fact, our initial and primary motivation was to gain deeper insights into the spatial reasoning capabilities of VLMs. Through our analysis, we observed that the geometric distribution of image attention plays a critical role in this reasoning process. Based on this observation, we proposed two temperature scaling techniques to adjust the attention distribution and address the problem.

  • As for the relatively smaller improvement on general QA tasks, we believe this may be because such tasks are less sensitive to the geometric structure of image attention. For example, given a question like “Is there a dog in this picture?”, the model only needs to detect the presence of an object, and is therefore less likely to suffer from misallocated attention across spatial regions.

审稿人评论

Thank the author for the point-by-point response to all of my raised issue. I appreciate the attention scores analysis as well as the additional experiments. While it is true that the proposed method is less effective on the new model, it is still acknowledged that the overall structure of the paper is well-established and can potentially inspire future works. Therefore I'll maintain my positive score 4 (Accept).

审稿意见
3

The paper examines the attention patterns in the visual-language models, and finds patterns which might explain why spatial reasoning can be hard for VLMs. Specifically, they first find that a large chunk of the attention is focussed on the text stream, even though the number of visual tokens are more. However, they further find that manually making the attentions larger on visual tokens do not help performance. Then, they find that when the visual attention is on the right objects, it typically leads to the right answer; and vice versa. They also find that self-confidence is a good proxy for model's confidence -- when the model is confident, it is usually looking at the right location, and thus the attention map can be sharpened. When the model is less confident, it helps to make the attention map more diverse to let it look at more region of the visual image. This proposed technique of modifying attention maps based on confidence helps improve performance on several benchmarks that they evaluate on.

Update after rebuttal

My initial score for this paper was 4 -- with main concerns based on the datasets considered and that only one single VLM was evaluated (both concerns shared by other reviewers as well). I thank the authors for their rebuttal. I read the reviews of other reviewers as well as author's response: I revised my score from 4 to 3 because I think the new results show that the proposed interventions are barely effective on new (and more standard) benchmarks; and the performance improvement is less pronounced with Qwen2-VL (except in VG_two_obj). Nevertheless, the exploration done in this paper is solid and might be useful to foster research.

I am recommending this paper for acceptance with an implicit assumption (and belief) that the authors will include the rebuttal experiments in their main paper -- without these results, the papers' results might be misleading and over-promising. I asked the authors whether they plan to do this, and they didn't reply, so I am not sure if the authors indeed want to include these results. I will leave this to AC.

给作者的问题

  • I would like to see some analysis on different VLMs (or some reasoning on why that might not be important)
  • I am not expecting the authors to test on new datasets, but explaining why that was not considered would help

论据与证据

Yes, mostly the claims are well supported. The hypothesis of various attention patterns are well tested and the resulting technique of adaptively changing the attention patterns works well on the datasets that they are tested on.

However, it is unclear whether the claims of this paper are generally valid for VLMs or is it specifically valid for LLAVA models that this paper tests on. For eg. the claim about the model focusing more on the language tokens, compared to visual tokens could be true for LLAVA like models because only a small adapter layer is trained to add visual modality to the LLM (in addition to final small LORA on the whole base LLM). It could also be something specific to a training data or methodology detail of LLAVA models. Testing on other VLMs would make the analysis and claims more stronger.

Secondly, I would recommend testing on more popular spatial reasoning datasets (like RefCOCO family / BLINK) etc -- I am not particularly tied to the datasets I mentioned, but some datasets which are commonly used for evaluating VLMs.

方法与评估标准

Yes -- see some concerns regarding the specific VLM used and evaluation datasets in claims and evidence.

理论论述

N/A

实验设计与分析

Yes -- they are sound and valid

补充材料

Yes -- skimmed through all sections

与现有文献的关系

It is a well-known fact that VLMs struggle with spatial reasoning. Getting insights on why that might be, and ways to fix them zero-shot are relevant to the scientific community.

遗漏的重要参考文献

No

其他优缺点

The paper is very well-written and is a joy to read -- all the hypothesis and conclusions naturally flow from each other.

其他意见或建议

N/A

作者回复

We appreciate your thorough review and detailed comments! Your suggestions will be helpful in improving the paper. We address your concerns below.

Q1:

whether the claims of this paper are generally valid for VLMs or is it specifically valid for LLAVA models that this paper tests on.

  • The claims are generally valid for VLMs.
  • The first claim image receives much less attention than the text tokens is generally valid for VLMs: we calculate the Qwen2-VL’s attention scores below (average attention logits for single text/image tokens respectively), which matches our previous claims.
BenchmarkText attention scoresImage attention scores
Controlled_A1.57e-027.59e-05
Controlled_B1.58e-027.69e-05
Coco_one_obj1.77e-024.48e-04
Coco_two_obj1.65e-023.50e-04
VG_one_obj1.54e-024.75e-04
VG_two_obj1.42e-024.19e-04
  • The second important claim is still valid about confidence could serve as a proxy for the model performance as below:
    • Spatial relationships with lower confidence, such as "Left" and "Under" tend to exhibit lower accuracy when compared to those with higher confidence, like "On" and "Right".
    • After the intervention with AdaptVis, certain spatial relationships like “Left" show improved performance, accompanied by an increase in confidence. This demonstrates a pattern consistent with our observations for LLaVA, as depicted in Figure 25 in the Appendix.
BenchmarkQwen2-VL UncertaintyQwen2-VL+ScalingVis UncertaintyQwen2-VL AccQwen2-VL+ScalingVis Acc
Left0.44080.447470.1174.71
On0.57580.5668100100
Right0.59820.5911100100
Under0.55540.541898.8298.82

Q2:

more experiments on other VLM and other benchmarks.

  • Experiments on Qwen2-VL, a SOTA VLM with different architecture. We intervene in the image attention distribution using our temperature-scaling method, showing consistent improvements below, particularly in challenging cases. For example, on VG_two_object, where the baseline performance is the lowest among all benchmarks, our method yields a significant improvement of 10+ absolute points. The gains observed on Qwen2-VL further demonstrate the generalizability of our approach.
BenchmarkQwen2-VLQwen2-VL + Attention Intervention
VSR78.9681.60 (↑ 2.64)
Coco_one_obj76.6478.03 (↑ 1.39)
Coco_two_obj75.2876.52 (↑ 1.24)
VG_one_obj74.8975.11 (↑ 0.22)
VG_two_obj56.2266.95 (↑ 10.73)
Controlled_A98.1898.18 (↑ 0.00)
Controlled_B91.7392.97 (↑ 1.24)
  • Experiments on more benchmarks, including POPE, GQA, and TextVQA, which proves that attention intervention can maintain the performance on more general tasks without hurting the performance. Compared with spatial reasoning tasks, general QA tasks achieve relatively smaller improvement. A possible reason is that such tasks are less sensitive to the geometric structured distribution of image attention. For example, given a question like Is there a dog in this picture?, the model only needs to detect the presence of an object, and is therefore less likely to suffer from misallocated attention across spatial regions.
BenchmarkQwen2-VLQwen2-VL + Attention Intervention
POPE-Overall86.3287.09 (↑ 0.77)
POPE-P86.4787.29 (↑ 0.82)
POPE-A85.0785.80 (↑ 0.73)
POPE-R87.4688.22 (↑ 0.76)
GQA62.0962.17 (↑ 0.08)
TextVQA79.1879.26 (↑ 0.08)
审稿意见
3

This paper introduces ADAPTVIS, an adaptive attention mechanism designed to enhance spatial reasoning in vision-language models (VLMs). By analyzing attention distributions, the authors identify that errors often arise when models focus on irrelevant image regions and that attention patterns differ between familiar and unfamiliar spatial relationships. ADAPTVIS addresses these issues by dynamically re-weighting attention maps based on model confidence during inference, improving geometric understanding with minimal computational cost. Evaluations on benchmarks like WhatsUp and Visual Spatial Reasoning (VSR) demonstrate significant accuracy gains, highlighting ADAPTVIS’s effectiveness in refining spatial reasoning in large vision-language models.

给作者的问题

N/A

论据与证据

  1. The paper claims that the model's confidence is correlated with attention correctness and that high confidence leads to sharper attention, while low confidence allows the model to focus more diffusely on the image. However, there is limited empirical evidence to substantiate this claim. The analysis in Section 4, especially regarding the correlation between attention distribution and model confidence, is not comprehensive and largely based on qualitative examples. A more robust, quantitative analysis would strengthen this claim and make the argument more convincing.

方法与评估标准

  1. The method of modifying attention during inference based on model confidence is conceptually sound, especially as it directly targets improving spatial reasoning in VLMs. However, the method's reliance on pre-defined threshold values (alpha1, alpha2, beta) for adjusting attention is a significant drawback. A more adaptive approach that adjusts thresholds dynamically based on the specific task or dataset could improve the method’s robustness and generalizability. Additionally, more clarity on how these hyperparameters interact with the model and their impact across different VLMs would be beneficial.
  2. The evaluation is conducted on several benchmarks, such as WhatsUp and Visual Spatial Reasoning (VSR). While these benchmarks are relevant for testing spatial reasoning, the datasets used are somewhat limited in scope. The authors should consider evaluating ADAPTVIS on more commonly used, real-world datasets, such as GQA or VQA-v2, to assess the method's generalizability and performance across a broader set of tasks.

理论论述

N/A

实验设计与分析

  1. One important concern with the current experiments is that they are solely based on the LLaVA-series models (LLaVA-1.5, LLaVA-1.6). While these models are a useful starting point, it is unclear whether the findings would generalize to other VLMs.

补充材料

N/A

与现有文献的关系

The key contributions of the paper build on prior work in spatial reasoning for VLMs by introducing a novel attention adjustment mechanism based on model confidence, addressing issues identified in earlier studies on attention misalignment and spatial reasoning failures in VLMs. This approach is closely related to efforts exploring the role of attention mechanisms in improving VLM performance on tasks requiring geometric and spatial understanding.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

  1. Some axis titles are missing in the figures (Figures 5 and 7), which could make them harder to interpret.
作者回复

Thanks for your comments and advice, We address your concerns below.

it is unclear whether the findings would generalize to other VLMs.

  • We agree generalizability is important so we have especially included different variances of LLaVA-series models, with LLaVA-1.5 (224×224 visual encoder, MLP projection layer) and LLaVA-1.6 (336×336 visual encoder, resampler projection layer), since they are the most widely-used open-source architecture but with different model features like confidence, as mentioned in line 742.
  • To assess the generalizability of our findings, we extend our analysis to Qwen2-VL, a SOTA VLM with different architecture. We show our findings for VLMs’ attention patterns are still consistent [see response for Reviewer 3nwP Q1].
  • To further assess the generalizability of our intervention approach, we extend experiments to Qwen2-VL. We intervene in the image attention distribution using our temperature-scaling method, showing consistent improvements below, particularly in challenging cases. For example, on VG_two_object, where the baseline performance is the lowest among all benchmarks, our method yields a significant improvement of 10+ absolute points. The gains observed on Qwen2-VL further demonstrate the generalizability of our approach.
BenchmarkQwen2-VLQwen2-VL + Attention Intervention
VSR78.9681.60 (↑ 2.64)
Coco_one_obj76.6478.03 (↑ 1.39)
Coco_two_obj75.2876.52 (↑ 1.24)
VG_one_obj74.8975.11 (↑ 0.22)
VG_two_obj56.2266.95 (↑ 10.73)
Controlled_A98.1898.18 (↑ 0.00)
Controlled_B91.7392.97 (↑ 1.24)
  • We also evaluate more generic benchmarks for generalizability, including POPE, GQA, and TextVQA, which proves that attention intervention can maintain the performance on more general tasks without hurting the performance. Compared with spatial reasoning tasks, general QA tasks achieve relatively smaller improvement. This may be because such tasks are less sensitive to the geometric structure of image attention. For example, when asked Is there a dog in this picture?, the model only needs to detect the object's presence, and is therefore less likely to suffer from misallocated attention across spatial regions.
BenchmarkQwen2-VLQwen2-VL + Attention Intervention
POPE-Overall86.3287.09 (↑ 0.77)
POPE-P86.4787.29 (↑ 0.82)
POPE-A85.0785.80 (↑ 0.73)
POPE-R87.4688.22 (↑ 0.76)
GQA62.0962.17 (↑ 0.08)
TextVQA79.1879.26 (↑ 0.08)
  • We believe this assumption is theoretically generalizable. The only assumption we have is the different information density between image and text. In VLMs, attention is distributed across both textual and visual tokens. Since key information in images tends to be more sparse compared to text, the attention distribution over visual tokens can be more difficult to allocate accurately.

The paper claims that the model's confidence is correlated with attention correctness…More robust, quantitative analysis to support the correlation between attention distribution and model confidence.

  • As detailed in Lines 184-251 of Section 4.1, we use YOLO to annotate the relevant entities in the images, and conduct quantitative analysis on AUROC of the overlap between YOLO annotations and the attention patterns , as well as performance are shown in Figure 7 in the main paper and, Figure 18, Figure 19, Figure 20, Figure 21, Figure 22, Figure 23, Figure 24 in the Appendix.
  • In Figure 25 in the Appendix, we perform quantitative analysis on confidence and accuracy and their correlation with the coefficient, where we can see the confidence and accuracy follow similar trends as the temperature coefficient varies, supporting the use of confidence as a reliable proxy for performance.
  • From the first and third subfigures: for low-confidence relationships, applying a small coefficient (<1) improves performance; for high-confidence relationships, a coefficient (>1) improves performance.
  • We will move the additional quantitative evidence (from Appendix) into Section 4.2 to make this reasoning clearer in the paper.

A more adaptive approach that adjusts thresholds dynamically based on the specific task or dataset...

  • We believe there may be a misunderstanding. As described in lines 323, we adaptively adjust the coefficient using a small held-out subset (20% of the whole subset as the validation set) from the specific target task or dataset, rather than pre-defining hyperparameters.
  • Additionally, we propose a more adaptive variant ScalingVis with only a single hyperparameter α\alpha to modulate the attention distribution based on model confidence. This design maintains both simplicity and adaptability, contributing to the method’s robustness and generalizability across tasks.
审稿人评论

The authors' response has addressed most of my concerns. I would like to increase my rating to 3 -- weak accept.

最终决定

In this paper, the authors introduce ADAPTVIS, an adaptive attention mechanism designed to perform intervention to understand, and subsequently enhance spatial reasoning in vision-language models (VLMs). By analyzing attention distributions and their correlation with model confidence, ADAPTVIS dynamically re-weights attention maps during inference to improve geometric understanding with minimal computational cost. This approach is framed through the concept of Mechanistic Interpretability (MI), aiming to understand VLM behavior related to spatial tasks. Authors evaluate their method on spatial reasoning benchmarks like WhatsUp and VSR, along with additional experiments on standard VQA datasets and the Qwen2 VLM, demonstrating accuracy gains.

The reviewers reached a final consensus of one Accept and three Weak Accepts. Reviewers particularly highlighted the importance of addressing VLM spatial reasoning, the novelty and intuition of the MI-driven approach using confidence to modulate attention, the method's simplicity and low overhead, and the potential for impact.

Some concerns were also raised initially, about the generalizability of the findings beyond LLaVA models and the specific spatial datasets used, the need for more quantitative evidence linking confidence and attention patterns, and most importantly, a certain amount of confusion about the evaluation context, particularly regarding baseline comparisons and the way it manifested in the dataset format (raised by Reviewer 3qjz). The authors engaged actively in the rebuttal and reviewer discussion to alleviate the concerns: they provided additional experiments, including results on the Qwen2-VL architecture and standard benchmarks like GQA, TextVQA, and POPE, supporting the generalizability claim. There was also some analysis provided on attention modification and its link to confidence - and justification on why reformatting datasets was necessary to support the mechanistic interpretability framework. This clarification to a final positive consensus among all reviewers.

While the performance gains observed on the newer VLM (Qwen2-VL) and general VQA benchmarks were noted by reviewers as being more modest compared to the original LLaVA/spatial task results (and also noting that more recent, larger VLMs have largely improved upon spatial grounding, object pointing etc.), the AC feels that the core contribution lies in the MI analysis and the proposed intervention technique itself and provides valuable insights (insufficient attention image tokens for example). Overall, given the strengths of the paper, the novel MI-based contribution to understanding VLM spatial reasoning, and the positive final consensus from all the reviewers, I recommend the paper be accepted. That said, significant revisions are recommended in the final version to incorporate the clarifications, and ensure the MI focus and evaluation context are made explicitly clear to the reader, to mitigate any confusion as highlighted by the review process.