MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
We study the perception limitation of Multimodal LLMs and propose automatic visual cropping as a scalable and training-free solution to mitigate this limitation.
摘要
评审与讨论
This paper studies the attention patterns of MLLMs when answering visual questions, and investigate into do MLLMs know where to look at, i.e. the perception problem v.s localization problem. Based on these findings, this paper introduces an automatic visual cropping methods leveraging attention and gradient maps, to help it better perceive the small visual subjects. The proposed methods is evaluated on two MLLMs and seven VQA benchmark and demonstrates significant improvement.
优点
- This paper studied the perception v.s localization problem for small visual objects and introduced insightful findings.
- It introduced a training-free method to help MLLMs better perceive the small visual subject of any question.
- Experimentally, it demonstrated significant improvement in 7 benchmarks and on two MLLMs.
缺点
- To help the model keep the global visual information, the cropped object introduced extra tokens. As illustrated in paper's Table 4, it indeed introduced some computational latency. But I suggest the author could try this approach (Matryoshka Query Transformer for Large Vision-Language Models (NeurIPS 24)) on the cropped object to save the visual tokens, as the re-scaled cropped objects intuitively don't have much visual detailed information.
- The improvement on general VQA benchmarks such as large visual concepts are not as significant as on small visual concepts.
问题
- I'm curious to see, have the authors tried and explore methods to add the cropped visual object to the original image instead of concatenating two images together. Would that bring to on par or better performance with current approach and save the computation cost.
- I incline to accept this paper unless I saw critical drawbacks that I missed from other reviewers.
Thank you for reviewing our paper. We appreciate your valuable feedback and will try to address your concerns below:
W1:
Thanks for pointing out the work on Matryoshka Query Transformer. Training MLLMs with MQT allows them to have varying visual context size during inference, and this indeed can reduce computational cost. In our current results, we have shown that our methods can work with two different MLLMs with distinct visual context sizes, so it seems entirely possible that our method can still work with varying visual context size under MQT. We will add this discussion to the paper and will explore MQT more closely and integrate into our methods in future works.
W2:
Yes, but please note that most general VQA benchmarks are dominated by large visual objects, and therefore it is natural that a method that improves small object perception does not achieve significant boosts in the general VQA benchmarks. We report these results to show that our methods, while improving perception of small objects, do not come at the cost of perception of large visual objects, that is, they can maintain performance on general VQA benchmarks.
Q1:
In our early experiments we noticed that concatenation works better than addition. We think this is because, without any finetuning, adding the global and cropped image tokens results in shifted image tokens that the frozen LLM has no idea how to deal with. In contrast, concatenation works because the cropped image tokens are simply treated as additional context, and the LLM in the MLLM already knows how to use varying amounts of context.
Thank the authors for addressing all my concerns, I'll keep my score.
This paper identifies a drawback of MLLMs in VQA when questions concern small-sized objects in the images. It shows empirically that MLLMs perform worse on such questions, and that their performance is due to perception issues but often they are able to attend correctly to the relevant regions of the image (localization). With these two insights, they propose a method based on visual cropping which computes attention maps of the MLLM through various methods, and uses these attention maps to derive a cropped version of the image which is appended to the original image and passed through the MLLM. They show empirically that on VQA datasets involving fine-grained questions (e.g. TextVQA), this inference-time approach outperforms the base MLLM, while maintaining performance on general VQA datasets.
优点
This paper identifies an important problem in MLLMs, specifically their lack of sensitivity to small details in images in a VQA context. Moreover, it pairs this with an original and useful insight -- that often the MLLM can attend correctly to the relevant region of the image, even if it produces the wrong answer. This observation may not have been recognized previously. Moreover they turn this analysis into a practical method which generates an attention map and uses this to crop the image. The method has several advantages -- it is intuitive, fits well into the framework of pretrained MLLMs, and works at inference-time without further training.
Empirical results on Table 2 are convincing and show that their method significantly outperforms the base MLLM on detail-sensitive datasets.
The paper is also written well and has good clarity.
缺点
I wonder whether the drawback of MLLMs for small-sized objects for vision-language tasks has been noted before. For example, some work is mentioned in Lns 287-288 on training MLLMs with higher-resolution patches. It would be good to understand related work that has addressed the problem of small-sized objects for MLLMs generally and/or VQA specifically. This could potentially be added to the related work.
The method assumes that there is a single relevant location of attention in the image; this is not always true, for example in questions that ask about spatial relationships between objects. The authors note this when they discuss limitations. It is not inconceivable that the approach could be extended to such questions if attention maps are sufficiently informative.
问题
(Sec. 3) Human-crop is a very strong intervention where the crop is tight around the ground-truth object. It would be nice to see how accuracy improves for more realistic crops. For example, one experiment -- crop around objects in the "small" dataset s.t. the proportion of its size in the image is equal to the average proportion value in the "medium" and "large" datasets. Does the accuracy improve, maybe matching the accuracy of the medium and large splits?
(Sec. 4) Is this a standard way to calculate attention in transformers? If I am understanding right -- I am not sure how interpretable attention values are in later layers, where the input tokens have already been transformed significantly. How does this method compare to prior work on attention in MLLMs/vision transformers? Also, it would be nice, if feasible, to evaluate variants of GradCam for transformers as another attention method. These are mentioned in the related work.
(Sec. 4) It'd be interesting to get more intuition on when the MLLM produces correct attention when giving an incorrect answer, and when it does not. I imagine that in the exit or bicycle number case in Fig. 2, it seems reasonable to produce correct attention. In the car question I'm a bit surprised this works since it's just asking about existence of the object -- proper localization seems to imply that the model also perceived the car correctly.
(Sec. 6) A simple baseline here would be to provide a random crop; crop the image at a random location, perhaps half the size of the original image, and provide this as additional input to the MLLM. This would reinforce the analysis in Sec. 4, and show that the accurate localization of the MLLM provides informative crops. It would be nice to show the results of such a baseline.
Minor questions
- What is the "input image resolution" in line 356? Is that the size of the patches input to the MLLM, or the image resolution? Clarifying since the multiples are >= 1.
- What is the reason for the oscillatory behavior in Fig. 3?
Thank you for reviewing our paper. We appreciate your valuable feedback and will try to address your concerns below:
W1:
Regarding the draw-backs of MLLMs for small-sized objects, some recent works have noticed this limitation anecdotally (discussed in the opening of Section 5), but to our knowledge, we are the first to A) quantitatively study its existence across multiple SOTA MLLMs (Section 3), B) show that it is causally related to object-size (Section 3), and C) it is primarily a perception limitation rather than localization limitation (Section 4). We will update our related works to better place our work in the literature following your point.
Q1:
We conducted the experiment with weaker intervention per your suggestion for LLaVA-1.5 and InstructBLIP (on TextVQA). Specifically, we crop around the ground-truth object (GTO) in the small set such that its relative size becomes the same as the average relative size of GTOs in the medium set (denoted Align to Medium) and large set (denoted Align to Large). We also randomly move the crop around so that the GTO is not always in the center of the cropped image. We still observed a significant increase in the MLLMs’ perception accuracy as a result of cropping, suggesting that weaker cropping is still effective:
| Model | Original Accuracy (%) | Align To Medium | Align To Large | human-CROP (tight) |
|---|---|---|---|---|
| InstructBLIP | 21.79 | 45.55 | 65.28 | 69.60 |
| LLaVA-1.5 | 39.38 | 54.32 | 60.35 | 69.95 |
We also note that the accuracy surpasses the medium/large-set’s (similar to human-CROP in Table 1). This suggests that the small-set’s questions are easier than the medium/large-set’s questions, which in turn suggests that the limitation in seeing small objects is even stronger than we observed in Table 1 (because the small set seems to have the advantage of easier questions).
Lastly, we think the accuracy gains from our automatic visual cropping methods (Table 2) provide additional evidence that weaker/less-tight crops are still beneficial.
Q2:
The prior localization works mentioned in the related works (PNP-VQA and Img2LLM) are developed specifically for the BLIP model that has a dedicated image-text similarity computation neural network called the Image-Text Matching network, and are therefore not directly compatible with general MLLMs that do not explicitly train for text-image similarity (like LLaVA-1.5). In this work, we derived a more general way for localizing the attention of MLLMs to images from first principles (the product of answer to image-token attention and image-token to image patch attention). Regarding interpretability of attention in later layers, we observed in Figure 3 that middle to outer layers are more likely to correctly localize than earlier layers (i.e., have higher attention ratio).
Q3:
Our intuition is that the model can use surrounding information to identify where to look. For example, in the car example, it sees the overall appearance of a road next to buildings in the distance, but cannot really see enough details to assign high enough probability to the car tokens.
Q4:
Thanks for the suggestion, we have implemented a stronger baseline: we compared our internal visual cropping methods with external cropping methods under the same setting in Table 4 (SAM, YOLO, CLIP which are stronger than random cropping), and observed that our internal methods perform much better.
Q5:
It is the image resolution that the MLLM can receive inherently (specified in lines 421-425). For example, for LLaVA-1.5 that has an image resolution of 336x336, we choose windows with resolutions from 336x336 up to 672x672 when searching for the cropping bounding box, and then resize the discovered box down to 336*336 for input to the MLLM.
Q6:
The periodic patterns in Figure 3 (top) is due to the periodic definition of the layer number on the x-axis (in lines 249-271). Essentially, within each period (for example 24 for BLIP-2), the layer number L goes from the first layer of the backbone LLM to its outer layer (24 layers in BLIP-2) and computes attention with a specific layer in the connector. So, the number of connector layers (Lc) determines the number of periods, and the number of LLM layers (L) determines the length of each period. Consequently, for LLaVA and Qwen that have a single-layer connector (Lc=1), there is no periodic pattern.
Thank you for the clarification, it is clear to me now and makes sense. It is helpful to understand the average input image size.
Thank you, we appreciate your support of the paper’s contributions. We are happy to explain the bounding box selection further, might be more clear to present it in 2 stages:
- Selecting the best location per bounding box size (Ln 357-359): For each box size (e.g., 2x the input resolution size), we find the box location on the original image that has the highest internal sum of importance. Note that the original image can be larger than 2x the input resolution of the MLLM. For example, the average input image size is 954x818 in TextVQA (we report dataset statistics in Table 6 in Appendix C).
- Selecting the best bounding box size (Ln 359-362): Among the “best location per size” boxes (selected in the stage 1 above), we then select the box whose internal sum has the largest difference from the average internal sums of its adjacent positions. This latter step is a heuristic to avoid choosing too small or too large windows (notice that in both cases, moving the window slightly left/right or up/down will not change its internal sum significantly).
Coming back to your specific question, the 2nd step specially avoids selecting a box the size of the entire image, because such a window would have zero difference from its adjacent windows of the same size (since its adjacent window is itself).
I thank the authors for their response. My concerns/comments in the original review have been addressed, and I appreciate the authors' efforts. I am satisfied by the authors' response, and will keep my rating of accept.
It is not that important, but I still have some confusion on how the bounding box selection happens in Ln 356 (if the sliding window is larger than the image, and locations are selected to maximize sum of importance, how would small bounding boxes be selected vs. the whole image?). If the authors see this perhaps they can clarify my confusion. However, I am sure that the approach is fine and my accept rating does not depend on this clarification.
This paper studies the perception limitations of Multimodal Large Language Models (MLLMs) when dealing with small visual details in images, and proposes training-free visual cropping methods to mitigate these limitations. The key contributions are: (1) demonstrating that MLLMs struggle with perceiving small visual details despite knowing where to look in images, (2) showing this limitation is causal through intervention studies with visual cropping, and (3) developing automatic visual cropping methods that leverage MLLMs' internal attention and gradient information to improve their perception without any additional training. The paper makes a valuable contribution by rigorously analyzing an important limitation of MLLMs and providing practical solutions. However, there are some points that need clarification and potential improvements that could strengthen the work.
优点
-
This paper propose an interesting problem that MLLM seems know which areas should be focused but still fails to solve the related problem successfully all the time. By conducting intervention study, the problem is validated.
-
The proposed methods are intuitive and straightforward to the problem mentioned and make good use of the finding from the pilot study.
缺点
-
The evaluation part is not so well done given the baseline is just no croping without any other fair comparison. V* (star) may be a good baseline since they tried to address a similar problem like this paper brought about.
-
The existing approach seems achieve limited improvement by up to ~4% even compared to no-crop setting across all 8 benchmarks.
-
The role of model architecture in determining perception limitations isn't explored
问题
-
Like mentioned in the first point of weakness, How will traditional grounding algorithms help address the problem under similar setting? Why viscrop instead of that?
-
How does the method handle cases where the visual subject spans multiple regions of interest?
Thank you for reviewing our paper. We appreciate your valuable feedback and will try to address your concerns below:
W1/Q1:
Regarding traditional grounding algorithms, we compared our internal visual cropping methods with several SOTA grounding methods (SAM, YOLO, CLIP) under the same settings in Table 4. We observed that our attention/gradient-based methods (ViCrop) consistently outperform the use of these grounding methods. We think the reason is that ViCrop utilizes the MLLMs’ internal strong question understanding and reasoning capabilities to perform a question-dependent and context-aware grounding.
Regarding the V* method (SEAL), we did not compare with it because SEAL requires substantial training and finetuning of several neural networks, whereas our methods are completely training-free, so a comparison would not be fair. Nonetheless, to provide an idea of how our method compares to SEAL in an “as-is” fashion (i.e., if a user just wants to pick one method as-is off-the-shelf), we report the accuracy of SEAL compared to LLaVA-1.5+rel-att below. We observe that our method outperforms SEAL except on the V* benchmark (we think this might be because SEAL is designed and tuned specifically towards achieving high accuracy on the questions in its V* benchmark). We also note that the inference time of SEAL is significantly slower than our method (4.44s compared to 1.88s on average per question, tested on the same random 100 TextVQA samples with one A6000 GPU).
| Model | TextVQA | V* | POPE | DocVQA | AOKVQA | GQA |
|---|---|---|---|---|---|---|
| SEAL (Visual Search) | 36.30 | 75.30 | 82.40 | 5.31 | 55.34 | 50.18 |
| LLaVA-1.5+rel-att (Ours) | 55.17 | 62.30 | 87.25 | 19.63 | 60.66 | 60.97 |
(We could not report on VQAv2 for the rebuttal because it contains 200K testing samples -- which is 20 times larger than GQA and 40 times larger than Textvqa -- and running SEAL on it will cost us more than 10 days on our available computing resources; we will certainly add it for the final version.)
W2:
Our approach achieves substantial improvements on benchmarks requiring the perception of small visual concepts. To better clarify the gains, we reiterate our rel-att method’s improvements from Table 2 below:
- TextVQA: LLaVA-1.5 goes from 47.80% to 55.17% (7.37 gain); InstructBLIP goes from 33.48% to 45.44% (11.96 gain)
- V*: LLaVA-1.5 goes from 42.41% to 62.30% (19.89 gain); InstructBLIP goes from 35.60% to 42.41% (6.81 gain)
- DocVQA: LLaVA-1.5 goes from 15.97% to 19.63% (3.66 gain); InstructBLIP goes from 9.20% to 9.95% (0.75 gain)
- POPE: LLaVA-1.5 goes from 85.27% to 87.25% (1.98 gain); InstructBLIP goes from 84.89% to 86.64% (1.75 gain)
Other general VQA benchmarks are dominated by large visual concepts, and therefore it is natural that a method that improves small object perception does not achieve significant boosts in the general VQA benchmarks. We report these results to show that our methods, while improving perception of small objects, do not come at the cost of perception of large visual objects, that is, they can maintain performance on general VQA benchmarks.
W3:
We fully agree that studying the architectural causes of this perception limitation is valuable, but we think it merits its own separate paper and therefore we leave it to future works. Our scope in this paper was to first show that the limitation in seeing small objects exists in SOTA MLLMs and is causal (per Table 1), that it is primarily a perception limitation rather than a localization limitation (i.e., MLLMs internally know where to look, per Figure 3), and lastly, that this can be utilized to improve their perception without any training (per Table 2).
Q2:
The method uses a multi-scale bounding box selection strategy (described in Section 5 lines 353-364) that allows the method to select the box that contains the visual subject as completely as possible. However, if the question requires seeing multiple visual subjects that are very far apart, and more than one of them are small, then our method can only help with the visual subject that the model thinks to be the most important. We have discussed this limitation under Limitations and Future Works in lines 497-499, and will try to address this in future works.
Thank the authors for all the comments, I'll keep my score.
Thanks for the response, we have run and added the comparison of SEAL and our method on a 25K VQAv2 random subset in the 'as-is' fashion we mentioned previously:
| Model | VQAV2 |
|---|---|
| SEAL (Visual Search) | 65.60 |
| LLaVA-1.5+rel-att (Ours) | 76.29 |
We would be happy to engage further if you have any additional questions or suggestions.
Could you please provide more concrete details on how you evaluated your method in comparison with LLaVA-NeXT? From Appendix B, it is not clear exactly what steps were taken: did you process the image in low resolution and then crop the region of interest, or did you use a dynamic resolution mode similar to LLaVA-NeXT and crop regions from every crop of the original image? It would be very helpful if you could provide specific details about your implementation.
Thank you!
Thank you for your question. We use the internal dynamic resolution of LLaVA-NeXT in both the baseline and our method in Table 5. We apply our method to LLaVA-NeXT mostly in the same way as LLaVA-1.5, except for one difference that we will elaborate next. LLaVA-NeXT, unlike LLaVA-1.5, internally divides the global input image into several patches, then independently processes these patches (and a down-sampled global image) by the vision encoder, and then concatenates the resulting features into a single long visual context for the LLM. When applying our method to LLaVA-NeXT, we compute the attention with respect to these patches (rather than the downsampled global image), and then stitch the patch attention maps back together to form an attention map over the original global image. We then use this global attention map for finding the region of interest (per lines 353-363), then crop it, process it with the vision encoder into features, and concatenate the features to the input visual context (which already contains features of patches and downsampled global image from the baseline LLaVA-NeXT). We will of course release the code after publication to further assist the use and adoption of our methods.
I notice that this work directly extends the original image tokens by concatenating the visually cropped image tokens. This method seems to seek a trade-off between global knowledge and local knowledge. Suppose the tasks rely on local knowledge. It may be enough even only with the cropped image tokens as input. Could you provide the results with only the visually cropped image tokens (not including the original image tokens) as input of MLLM on the TextVQA.
Thanks!
Thanks for your question. Following your suggestion, we have run both LLaVA-1.5 and InstructBLIP under our relative attention visual cropping method (rel-att) with (w/) and without (w/o) providing the global image tokens to the LLM on TextVQA:
| Model | Original Performance | ViCrop Performance w/o Global Image Tokens | ViCrop Performance w/ Both Global & Cropped Image Tokens |
|---|---|---|---|
| LLaVA-1.5 | 47.80 | 51.63 | 55.17 |
| InstructBLIP | 33.48 | 38.91 | 45.44 |
We observe that the global information is indeed needed as discussed in lines 300-304.
This paper addresses MLLMs’ limitation in perceiving small details. The authors show, via controlled experiments, that cropping the relevant region significantly improves performance, while revealing that MLLMs do know where to look but often fail to perceive fine details. Its strengths include a thorough causal analysis, strong empirical gains (especially on the V*Bench), and a training-free approach. Weaknesses involve limited improvement on general VQA tasks and challenges with multi-object queries. Overall, given the significance of the problem, the clarity of findings, and the simple yet effective solution, the AC recommend acceptance.
审稿人讨论附加意见
During the rebuttal, reviewers requested grounding comparisons, multi-region clarifications, partial-crop baselines, and details on bounding-box selection and token overhead. The authors provided experiments showing stronger performance than sam/yolo/clip, explained multi-region constraints, demonstrated partial-crop improvements, clarified bounding-box selection, and discussed concatenation vs. Matryoshka queries. These updates and additional results strengthened the paper’s claims and supported the final decision to accept.
Accept (Poster)