6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.8

置信度

创新性3.0

质量2.8

清晰度3.0

重要性3.0

NeurIPS 2025

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

Liangyu Zhong,Fabio Rosenthal,Joachim Sicking,Fabian Hüger,Thorsten Bagdonat,Hanno Gottschalk,Leo Schwinn

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

We propose a training-free visual cropping method that leverages MLLM-internal representations for VQA tasks focusing on small details, achieving strong performance with significantly higher efficiency than prior methods.

摘要

关键词

vision-language modelslarge language modelsvisual question answering

评审与讨论

审稿意见

评分: 4置信度: 42025-06-24

This study introduces a visual cropping technique for fine-grained visual question answering. The proposed method leverage the representation from MLLM to search for the most relevant image region. Specifically, it first identify the target objects in order to compute an object relevance map. It then pick the top rank regions to make the prediction. The paper shows that the proposed cropping method improves the efficiency significantly.

优缺点分析

Strengths:

The motivation is clear.
It is interesting to explore the visual cropping to make MLLM more efficient.

Weaknesses:

The proposed method introduces many hand-crafted components such as ROI, non-maximum suppression which contains more hyper-parameters. This potentially makes the method less robust.
The proposed method is only evaluated with limited datasets. Beyond the fine-grained VQA, I would like to see the evaluation on broader range of vision-language datasets.
While becoming more efficient, the proposed method does not outperform another cropping method, ZoomEye. It seems like the method trades performance for efficiency.

问题

See Weaknesses above.

局限性

Yes

最终评判理由

Most of my concerns have been addressed in the rebuttal.

格式问题

No.

作者回复

2025-07-30

We sincerely thank you for your questions and feedback on our paper.

Robustness w.r.t. hyperparameters

"The proposed method introduces many hand-crafted components such as ROI, non-maximum suppression which contains more hyper-parameters. This potentially makes the method less robust."

We understand your concern regarding robustness of FOCUS w.r.t. components such as region of interest (ROI) proposals and non-maximum suppression (NMS). We conduct additional experiments using both LLaVA-1.5 and LLaVA-OV on V*Bench to systematically examine the impact of key hyperparameters, including the number of anchor points ( $k$ ), the minimal Euclidean distance between anchor points ( $s_\text{dist}$ ), the maximum ROI size ( $s_\text{max}$ ), the ROI expansion threshold, and the NMS threshold. As shown in the tables below, the accuracy obtained using alternative hyperparameter choices shows only marginal degradation of maximum 4.7% compared to the accuracy achieved with the original settings, demonstrating the stability of our method. During the robustness analysis, we discover some alternative hyperparameter configurations that achieve even higher accuracy than our default settings.

In the following, we present the results of our hyperparameter robustness analysis in detail. For the number of anchor points ( $k$ ), we uniformly sampled 20 values and report the mean accuracy along with a 95% confidence interval (corresponding to 2 standard deviations).

Model	$k$	Accuracy (%)	Comment
LLaVA-1.5	30	72.77	original choice
LLaVA-1.5	random(10,50)	72.77 ± 1.55 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	30	92.15	original choice
LLaVA-OV	random(10,50)	92.03 ± 1.37 ( $\Delta$ = -0.12)	alternative

For the minimum distance between ROI anchor points $s_{\text{dist}}$ , we varied the value from 2 to 4 for LLaVA-1.5, and from 2 to 5 for LLaVA-OV.

Model	$s_\text{dist}$	Accuracy (%)	Comment
LLaVA-1.5	2	72.77	original choice
LLaVA-1.5	3	72.25 ( $\Delta$ = −0.52)	alternative
LLaVA-1.5	4	69.11 ( $\Delta$ = −3.66)	alternative
LLaVA-OV	2	92.15	original choice
LLaVA-OV	3	91.62 ( $\Delta$ = −0.53)	alternative
LLaVA-OV	4	92.15 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	5	91.62 ( $\Delta$ = −0.53)	alternative

For the maximum ROI size $s_\text{max}$ , we explored alternative values ranging from 5 to 9 for LLaVA-1.5, and from 5 to 11 for LLaVA-OV.

Model	$s_\text{max}$	Accuracy (%)	Comment
LLaVA-1.5	5	72.77	original choice
LLaVA-1.5	7	69.63 ( $\Delta$ = −3.14)	alternative
LLaVA-1.5	9	68.06 ( $\Delta$ = −4.71)	alternative
LLaVA-OV	9	92.15	original choice
LLaVA-OV	5	94.24 ( $\Delta$ = +2.09)	alternative
LLaVA-OV	7	91.01 ( $\Delta$ = −1.14)	alternative
LLaVA-OV	11	89.53 ( $\Delta$ = −2.62)	alternative

For the ROI expansion threshold, we uniformly sampled 20 values and reported the mean accuracy along with a 95% confidence interval (corresponding to 2 standard deviations).

Model	ROI expansion threshold	Accuracy (%)	Comment
LLaVA-1.5	0.5	72.77	original choice
LLaVA-1.5	random (0.3, 0.7)	72.77 ± 0.00 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	0.5	92.15	original choice
LLaVA-OV	random (0.3, 0.7)	92.70 ± 0.25 ( $\Delta$ = +0.55)	alternative

Moreover, we report accuracy using NMS thresholds ranging from 0.1 to 0.7 to demonstrate robustness w.r.t. this parameter.

Model	NMS threshold	Accuracy (%)	Comment
LLaVA-1.5	0.3	72.77	original choice
LLaVA-1.5	0.1	70.16 ( $\Delta$ = −2.61)	alternative
LLaVA-1.5	0.5	72.25 ( $\Delta$ = −0.52)	alternative
LLaVA-1.5	0.7	72.25 ( $\Delta$ = −0.52)	alternative
LLaVA-OV	0.3	92.15	original choice
LLaVA-OV	0.1	92.15 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	0.5	92.15 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	0.7	92.15 ( $\Delta$ = +0.00)	alternative

In addition to demonstrating robustness to hyperparameter changes, we also validate the robustness of FOCUS across different model architectures. Specifically, we integrate Qwen-VL-2.5-7B with FOCUS using hyperparameters similar to those applied to LLaVA-OV-7B. As shown in the table below, FOCUS substantially improves performance on various datasets compared to the vanilla Qwen model. These results further demonstrate the robustness of FOCUS across different MLLMs. We will include these results in the revised version of the paper.

Model	Dataset	Accuracy (%)
Qwen-2.5-VL	V*Bench	79.06
Qwen-2.5-VL + FOCUS	V*Bench	90.58 ( $\Delta$ = +11.52)
Qwen-2.5-VL	HRBench-4K	71.62
Qwen-2.5-VL + FOCUS	HRBench-4K	79.25 ( $\Delta$ = +7.63)
Qwen-2.5-VL	HRBench-8K	68.62
Qwen-2.5-VL + FOCUS	HRBench-8K	76.25 ( $\Delta$ = +7.63)

Evaluation beyond fine-grained VQA

"The proposed method is only evaluated with limited datasets. Beyond the fine-grained VQA, I would like to see the evaluation on broader range of vision-language datasets."

In our work, we primarily focused on fine-grained VQA, as this research area remains underexplored and lacks efficient, high-performing methods. To assess the broader applicability of FOCUS, we conduct additional experiments on the A-OKVQA dataset, which features questions involving more commonly sized or large objects. Notably, this dataset is also used to evaluate ViCrop's performance in large-object scenarios [R1].

Below, we report the performance of LLaVA-1.5 and LLaVA-OV with and without FOCUS on A-OKVQA, employing the same hyperparameters as used for V*Bench:

Model	A-OKVQA Accuracy (%)
LLaVA-1.5	77.99
LLaVA-1.5 + FOCUS	74.76 ( $\Delta$ = −3.23)
LLaVA-OV	91.44
LLaVA-OV + FOCUS	91.00 ( $\Delta$ = −0.44)

We observe only marginal performance degradation, no more than 3.3% compared to the vanilla counterparts. This indicates that FOCUS remains compatible with VQA benchmarks involving larger objects, highlighting its adaptability across different vision-language datasets.

Comparison to ZoomEye

"While becoming more efficient, the proposed method does not outperform another cropping method, ZoomEye. It seems like the method trades performance for efficiency."

We thank you for highlighting this concern. We would like to take this opportunity to put your assessment into perspective. As shown in the table below, FOCUS consistently achieves substantial efficiency gain over ZoomEye, often by 2× to 5×, without compromising accuracy. In 5 out of 8 dataset-model combinations, such as LLaVA-1.5 on HRBench-4K and LLaVA-OV on V*Bench and HRBench datasets, FOCUS even outperforms ZoomEye in terms of accuracy.

Model	Dataset	FOCUS's performance gain relative to ZoomEye (%)	FOCUS's efficiency gain relative to ZoomEye (%)
LLaVA-1.5	V*Bench	-6.08	+266.19
LLaVA-1.5	HRBench-4k	+3.50	+240.79
LLaVA-1.5	HRBench-8K	-8.17	+490.49
LLaVA-OV	V*Bench	+1.15	+214.29
LLaVA-OV	HRBench-4K	+2.52	+117.67
LLaVA-OV	HRBench-8K	+0.91	+165.73
LLaVA-OV	MME-RW-P	-3.80	+439.56
LLaVA-OV	MME-RW-R	+3.08	+459.68

In summary, FOCUS does not compromise performance for efficiency. Instead, it achieves a balanced and optimized trade-off, outperforming or roughly matching ZoomEye in accuracy while being significantly more efficient. At a time when the computational demands of AI models are steadily increasing, we see our work as a contribution toward addressing resource constraints and reducing the energy consumption associated with large-scale vision-language systems.

References:

R1: MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs, Zhang et al, ICLR 2025

2025-08-06

Thanks for your detailed response.

Since adding FOCUS complicates the inference, I still have concerns that it degrades the performance of the original approach in other datasets. It would be better to verify this on a wide range of VQA datasets.

2025-08-08

Thank you for your valuable feedback.

Regarding your final concern about evaluating FOCUS on a broader range of VQA datasets, we fully agree that a more extensive empirical study would further strengthen our work. Unfortunately, due to the time constraints of the rebuttal period, we were only able to provide additional results on A-OKVQA, which we selected intentionally as a widely used benchmark outside the fine-grained VQA domain. We plan to extend our evaluation to more datasets in the next revision.

While your comment didn’t specify the exact nature of the concern, we interpreted it as potentially relating to either (1) performance on larger objects, or (2) diversity of evaluation data. We would like to clarify why the evaluation on A-OKVQA already provides meaningful insights into both.

Larger object size

While VQA involving large or medium-sized objects has been widely studied, fine-grained or small-object VQA remains underexplored, and addressing this gap is the primary motivation behind our work.
Visual cropping methods have typically been evaluated only on fine-grained VQA benchmarks. For example, neither ZoomEye nor DC2 include evaluations beyond fine-grained settings. By including results on A-OKVQA, our work already covers a broader scope than many existing methods. Compared to ViCrop, the only visual cropping method previously evaluated on A-OKVQA, we find FOCUS to perform significantly stronger, yielding a performance improvement of 14.1%, FOCUS achieves 74.76% accuracy, compared to 60.66% for ViCrop [R1]. This demonstrates the strong adaptability of FOCUS, even on tasks beyond its original scope.
A-OKVQA specifically emphasizes large-object VQA. On this dataset, FOCUS causes only a 3.23% drop in accuracy on LLaVA-1.5, which can be attributed to the model's single-image input limitation—cropping inevitably risks omitting global context. In contrast, LLaVA-OV, which supports multi-image input via interleaved text-image prompting, preserves the global view alongside cropped regions. As a result, the accuracy drop when using FOCUS shrinks to just 0.44%. Since most MLLMs developed since late 2024 support multi-image-text interleaving, we believe this design shift inherently mitigates any concerns about applying FOCUS to large-object tasks in practice.

Diversity of evaluation data

While limited by time, we chose A-OKVQA because it is a well-established benchmark containing diverse concepts. It was created to address the limitations of earlier VQA datasets by including questions that require general world knowledge, more complex reasoning, and a wider range of object types [R2]. Its validation set has 1.2k samples, which is about 6x larger than typical fine-grained datasets like V*Bench, and it covers a broader variety of visual and question concepts. Because of this, good performance on A-OKVQA is a strong indication that FOCUS can also work well on other general VQA tasks.

In summary, although our focus is fine-grained VQA, our experiments on A-OKVQA—combined with results across three MLLMs and multiple benchmarks—demonstrate that FOCUS generalizes well to larger-object tasks and broader VQA scenarios, without requiring dataset-specific tuning. We hope this clarifies our position, and we appreciate your thoughtful review.

References:

R1: MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs, Zhang et al, ICLR 2025

R2: A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge, Schwenk et al, ECCV 2022

审稿意见

评分: 4置信度: 42025-06-27

This paper proposes a new method for fine-grained image recognition on high-resolution images. Standard vision-language models such as LLaVA often struggle with recognizing small objects due to limitations in their vision encoder architecture. Although several training-free methods have been proposed to address this issue, they typically incur high inference costs because they require multiple forward passes over many local crop regions. To improve efficiency, this paper proposes a method that leverages attention maps obtained during inference to reduce the number of forward passes. The proposed approach is evaluated on standard benchmarks for fine-grained recognition with high-resolution images and demonstrates a favorable accuracy-efficiency trade-off.

优缺点分析

Quality and Significance

The proposed method is intuitive and demonstrates a promising accuracy-efficiency trade-off, making it a valuable contribution in this domain.
While the number of forward passes is used as the primary metric for efficiency, it would be important to also report actual inference time. This is because efficiency is affected by various factors such as memory access patterns, hardware-level optimizations, and framework-specific implementation details. Although inference time is reported in the appendix, it should be discussed in the main text, and a plot showing accuracy versus number of steps (n_step) would help better illustrate the actual trade-off.
It is unclear how reliable the number of forward passes is as an indicator of computational cost, as it can vary significantly depending on implementation. FLOPs or MACs may serve as more objective metrics for computational complexity.

Clarity

The paper is generally easy to follow. However, providing the full algorithm (e.g., in pseudocode or step-by-step form) in the main text would improve clarity—particularly since Section 3.3 is somewhat difficult to follow.
Some equations use Python- or PyTorch-specific notations (e.g., reshape), which are not appropriate in mathematical equations. These should be replaced with standard mathematical notation.

Originality

While the use of attention maps itself is not novel, applying them in this way to improve efficiency for high-resolution fine-grained recognition is relatively underexplored and shows good empirical results. This gives the work sufficient novelty within its scope.

问题

As mentioned above, discussing the trade-off between accuracy and inference time is crucial. Could you include a plot showing accuracy versus inference time with the number of steps (n_step) for both the proposed method and existing baselines? This would provide a clearer view of the practical impact and efficiency of your approach.

局限性

Yes

最终评判理由

Thank you for conducting the additional experiments on inference cost. The results clearly demonstrate the effectiveness of the proposed method. Although the best accuracy achieved by the proposed approach does not surpass that of ZoomEye, its inference efficiency is significantly superior. As my concerns have been sufficiently addressed, I am willing to raise my score.

格式问题

作者回复

2025-07-30

We sincerely thank you for your constructive feedback and greatly appreciate the emphasis on real-world efficiency metrics, such as inference time, which aligns closely with the core motivation of our work.

Additional efficiency metrics, e.g., FLOPs

"While the number of forward passes is used as the primary metric for efficiency, it would be important to also report actual inference time. This is because efficiency is affected by various factors such as memory access patterns, hardware-level optimizations, and framework-specific implementation details. Although inference time is reported in the appendix, it should be discussed in the main text, and a plot showing accuracy versus number of steps (n_step) would help better illustrate the actual trade-off."

and

"It is unclear how reliable the number of forward passes is as an indicator of computational cost, as it can vary significantly depending on implementation. FLOPs or MACs may serve as more objective metrics for computational complexity."

and

"It is unclear how reliable the number of forward passes is as an indicator of computational cost, as it can vary significantly depending on implementation. FLOPs or MACs may serve as more objective metrics for computational complexity."

In our paper, we select the number of forward passes (FPs) as the primary indicator of computational cost due to its simplicity. To provide a more comprehensive efficiency analysis, we additionally report both inference time and FLOPs for FOCUS and ZoomEye. We calculate the FLOPs for LLaVA-1.5 based on the estimation formulas introduced in FastV [R1] and PyramidDrop [R2]. Following these prior works, we exclude the embedding, projection layers and the output head of the MLLM from our FLOPs estimation:

def flops_estimate(seq_length):
    """
    Estimate the FLOPs of one LLaVA-1.5 forward pass given the sequence length
    Using the estimation formula from FastV/PyramidDrop paper
    Per LLM layer: 4nd**2+2n**2*d+3ndm
    Per ViT layer: 4nd**2+2n**2*d+2ndm
    ViT-CLIP-L outputs 576 + 1 (CLS) = 577 visual tokens
    Number of visual tokens in the sequence is always 576 
    """
    llm_dim = 4096
    llm_num_layers = 32
    llm_ffn_dim = 11008
    vit_dim = 1024
    vit_num_layers = 24
    vit_ffn_dim = 4096
    flops_llm = llm_num_layers * ((4 * seq_length * (llm_dim**2)) + (2 * (seq_length**2) * llm_dim) + (3 * seq_length * llm_dim * llm_ffn_dim))
    flops_vit = vit_num_layers * ((4 * 577 * (vit_dim**2)) + (2 * (577**2) * vit_dim) + (2 * 577 * vit_dim * vit_ffn_dim))
    return flops_llm + flops_vit

Due to the new rebuttal guidelines, we are unable to include new plots. Therefore, we present the results in table format. For the revised version of the paper, we plan to include more detailed efficiency plots, including inference time, FLOPs, and number of steps.

We conduct our experiments for LLaVA-1.5 with FOCUS on V*Bench and report the average inference time and FLOPs per image across the full dataset:

Config of FOCUS	Accuracy (%)	Inference Time (s)	FPs	TFLOPs
n_steps=1 (no_overrun)	51.30	0.99	1.47	10.98
n_steps=2 (no_overrun)	57.07	1.28	4.25	23.11
n_steps=1	64.40	1.36	4.86	25.73
n_steps=2	66.49	1.44	5.70	29.43
n_steps=3	67.01	1.55	6.79	34.15
n_steps=4	68.06	1.73	8.27	40.61
n_steps=6	70.68	2.00	10.71	51.26
n_steps=8	72.77	2.27	13.28	62.46

As these are newly conducted experiments, the results for n_steps=8 slightly differ from those reported in the appendix. For comparison, we also provide the results for LLaVA-1.5 with ZoomEye on V*Bench:

Config of ZoomEye	Accuracy (%)	Inference Time (s)	FPs	TFLOPs
Depth-1 (2 crops)	50.26	3.78	12.50	55.95
Depth-1 (4 crops)	50.78	4.76	20.37	91.03
Depth-2 (4 crops)	71.20	9.73	44.54	199.21
Depth-5 (4 crops)	77.48	11.96	48.63	217.00

In summary, the efficiency gains reported in terms of FPs are consistently reflected in real-world improvements. FOCUS achieves competitive or superior accuracy with significantly lower inference time and fewer FLOPs compared to ZoomEye. For instance, at n_steps=6, FOCUS reaches 70.68% accuracy in just 2.00 seconds and 51.26 TFLOPs, whereas ZoomEye (depth-2) requires nearly 4× more FLOPs and 5× longer inference time to reach a comparable accuracy. These results underscore the practical efficiency of FOCUS for fine-grained visual reasoning tasks.

Pseudocode

"The paper is generally easy to follow. However, providing the full algorithm (e.g., in pseudocode or step-by-step form) in the main text would improve clarity—particularly since Section 3.3 is somewhat difficult to follow."

We agree with you that presenting the full FOCUS algorithm in pseudocode would enhance clarity. Due to the updated rebuttal guidelines, we are unable to upload PDFs containing properly formatted LaTeX-style pseudocode. Nevertheless, we have made an effort to represent the core steps of Section 3.3 using simplified pseudocode within the rebuttal. A complete and properly formatted version of the pseudocode will be included in the revised paper.

Pseudocode for Section 3.3
Input: Object relevance map $\mathbf{A}$ , number of anchor points $k$ , number of steps $n_{\text{steps}}$ , ROI parameters: $s_\text{min}$ , $s_\text{max}$ , $s_\text{dist}$ , threshold for NMS: $\text{nms}_{\text{threshold}}$ , model MLLM
Step 1: anchor_points ← ExtractTopKAnchors( $\mathbf{A}$ , $k$ )
Step 2: initial_rois ← GenerateSymmetricROIs(anchor_points, $s_\text{min}$ )
Step 3: expanded_rois ← ExpandROIs(initial_rois, $\mathbf{A}$ , $s_\text{max}$ , $s_\text{dist}$ )
Step 4: filtered_rois ← ApplyNMS(expanded_rois, $\text{nms}_{\text{threshold}}$ )
Step 5: ranked_rois ← RankROIsByConfidence(filtered_rois, $n_{\text{steps}})$
Step 6: answer ← FinalInference(MLLM, ranked_rois)
Output: MLLM answer based on top-ranked ROI

"Python- or PyTorch-specific notations (e.g., reshape), which are not appropriate in mathematical equations. These should be replaced with standard mathematical notation."

We agree that replacing Python- or PyTorch-specific notations with standard mathematical notation would enhance clarity. We will revise the equations accordingly in the updated version of the paper.

References:

R1: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models, Chen et al, ECCV 2024

R2: PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction, Xing et al, CVPR 2025

2025-08-04

审稿意见

评分: 5置信度: 42025-06-29

This paper proposes a novel approach for fine-grained VQA called FOCUS, based on visual cropping techniques, where the objective is to capture relevant regions of an image depending on a question. The authors propose a training-free method that leverages the self-attention mechanism of an MLLM to build object relevance maps. Their system has the following steps: i) identify target objects on the question using in-context learning, ii) generate pseudo-attention using cached token similarity from MLLMs, iii) build object relevance maps, iv) propose regions of interest, v) rank regions of interest, and vi) prompt the MLLM with the selected region for answer generation. FOCUS is evaluated on three fine-grained VQA datasets and two MLLMs, one global-view MLLM (Llava-1.5) and another global-local-view MLLM (Llava-One-Vision). In general, FOCUS presents similar or superior performance compared with SOTA techniques, being significantly more efficient. Specially strong results are obtained with Llava-One-Vision. In addition to the main results, authors also present an ablation study and some qualitative examples.

优缺点分析

Strengths:

The use of cached token similarity to build object relevance maps, since it is a technique that could be applied to different problems, not only fine-grained VQA.
The obtained results are strong, considering performance and efficiency, specially for Llava-One-Vision.
The training-free nature of FOCUS, since it shows that current MLLMs are more capable than immediately obvious and that there is much room for improvement working on inference-time techniques.
The paper is well structured and written. Authors motivate their work very well, providing a comprehensive analysis of the state-of-the-art. The method is clearly explained and the experiments are meaningful.

Weaknesses:

FOCUS has many hyperparameters, as it is quite a complex inference-time algorithm. Authors show the values for those hyperparameters in the appendices, but they don't specify how they set them. There are also many design decisions that I suppose they had to test, such as the prompts used, applying a Gaussian filter to the pseudo-attention map, normalizing confidence scores... Do you use a development dataset to take all those decisions? Do you use one of the MLLMs only to find those hyperparameters and then apply them for the other also? Those details are very important in my opinion and I couldn't find much information about them.
From the paper text, I understand that the ablation studies of Section 4.3 are done on the evaluation datasets (V*Bench and HRBench-4K). This is not a good practice, since ablation studies and development experiments can be connected. Any reader may suspect that ablation studies are actually used to develop the system, optimizing for the test datasets. This weakness is related with the previous. I think authors have to make those points clear.

问题

I have no further questions or suggestions besides those formulated in the weaknesses.

局限性

Yes.

最终评判理由

I think the authors have addressed all my concerns suitably. Given the contributions of the paper, the clear writing and the new results provided by the authors during the rebuttal, I recommend the paper for acceptance.

格式问题

作者回复

2025-07-30

We are grateful for your feedback and pleased to hear that you see broad application possibilities for core concepts of FOCUS, not only for fine-grained VQA but a wide spectrum of visual reasoning tasks!

Design choices

"FOCUS has many hyperparameters, as it is quite a complex inference-time algorithm. Authors show the values for those hyperparameters in the appendices, but they don't specify how they set them. There are also many design decisions that I suppose they had to test, such as the prompts used, applying a Gaussian filter to the pseudo-attention map, normalizing confidence scores...

We thank you for pointing out the need for more clarity regarding hyperparameter selection and design choices. We first explain our design choices. Several components of FOCUS are based on best practices from prior work rather than iterative tuning:

Gaussian filter on pseudo-attention: We apply a Gaussian filter with σ = 1 as a standard denoising step commonly used in image processing [R1]. Empirically, we observe that this results in cleaner and more coherent object relevance maps.
Confidence normalization: This is adopted from ZoomEye and used without tuning the normalization function.
Existence prompt: This is inspired by ZoomEye, with only minor wording adjustments to better fit our task context.

Development set and robustness w.r.t. hyperparameters

Do you use a development dataset to take all those decisions? Do you use one of the MLLMs only to find those hyperparameters and then apply them for the other also? Those details are very important in my opinion and I couldn't find much information about them."

and

"From the paper text, I understand that the ablation studies of Section 4.3 are done on the evaluation datasets (VBench and HRBench-4K). This is not a good practice, since ablation studies and development experiments can be connected. Any reader may suspect that ablation studies are actually used to develop the system, optimizing for the test datasets. This weakness is related with the previous. I think authors have to make those points clear."

We acknowledge that we did not use a dedicated development set. One practical challenge is that high-quality, fine-grained VQA data is still quite scarce. We select hyperparameters based on common values from literature, and did not systematically tune them to optimize the performance on evaluation datasets. Regarding the ablation studies in Section 4.3, we confirm that they were conducted strictly post-hoc, after the method was finalized and all hyperparameters were fixed. We exclusively use them to interpret and validate design choices, not to guide system development.

To directly address your concern regarding potential overfitting or reliance on hyperparameter tuning, we conduct a systematic robustness analysis using both LLaVA-1.5 and LLaVA-OV on V*Bench. We examine the impact of key hyperparameters, including the number of anchor points ( $k$ ), the minimal Euclidean distance between anchor points ( $s_\text{dist}$ ), the maximum ROI size ( $s_\text{max}$ ), the ROI expansion threshold, and the NMS threshold. As shown in the tables below, the accuracy obtained using alternative hyperparameter choices shows only marginal degradation of maximum 4.7% compared to the accuracy achieved with the original settings, demonstrating the stability of our method. During the robustness analysis, we discover some alternative hyperparameter configurations that achieve even higher accuracy than our default settings.

Model	$k$	Accuracy (%)	Comment
LLaVA-1.5	30	72.77	original choice
LLaVA-1.5	random(10,50)	72.77 ± 1.55 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	30	92.15	original choice
LLaVA-OV	random(10,50)	92.03 ± 1.37 ( $\Delta$ = -0.12)	alternative

For the minimum distance between ROI anchor points $s_{\text{dist}}$ , we varied the value from 2 to 4 for LLaVA-1.5, and from 2 to 5 for LLaVA-OV.

Model	$s_\text{dist}$	Accuracy (%)	Comment
LLaVA-1.5	2	72.77	original choice
LLaVA-1.5	3	72.25 ( $\Delta$ = −0.52)	alternative
LLaVA-1.5	4	69.11 ( $\Delta$ = −3.66)	alternative
LLaVA-OV	2	92.15	original choice
LLaVA-OV	3	91.62 ( $\Delta$ = −0.53)	alternative
LLaVA-OV	4	92.15 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	5	91.62 ( $\Delta$ = −0.53)	alternative

For the maximum ROI size $s_\text{max}$ , we explored alternative values ranging from 5 to 9 for LLaVA-1.5, and from 5 to 11 for LLaVA-OV.

Model	$s_\text{max}$	Accuracy (%)	Comment
LLaVA-1.5	5	72.77	original choice
LLaVA-1.5	7	69.63 ( $\Delta$ = −3.14)	alternative
LLaVA-1.5	9	68.06 ( $\Delta$ = −4.71)	alternative
LLaVA-OV	9	92.15	original choice
LLaVA-OV	5	94.24 ( $\Delta$ = +2.09)	alternative
LLaVA-OV	7	91.01 ( $\Delta$ = −1.14)	alternative
LLaVA-OV	11	89.53 ( $\Delta$ = −2.62)	alternative

For the ROI expansion threshold, we uniformly sampled 20 values and reported the mean accuracy along with a 95% confidence interval (corresponding to 2 standard deviations).

Model	ROI expansion threshold	Accuracy (%)	Comment
LLaVA-1.5	0.5	72.77	original choice
LLaVA-1.5	random (0.3, 0.7)	72.77 ± 0.00 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	0.5	92.15	original choice
LLaVA-OV	random (0.3, 0.7)	92.70 ± 0.25 ( $\Delta$ = +0.55)	alternative

Moreover, we report accuracy using NMS thresholds ranging from 0.1 to 0.7 to demonstrate robustness w.r.t. this parameter.

Model	NMS threshold	Accuracy (%)	Comment
LLaVA-1.5	0.3	72.77	original choice
LLaVA-1.5	0.1	70.16 ( $\Delta$ = −2.61)	alternative
LLaVA-1.5	0.5	72.25 ( $\Delta$ = −0.52)	alternative
LLaVA-1.5	0.7	72.25 ( $\Delta$ = −0.52)	alternative
LLaVA-OV	0.3	92.15	original choice
LLaVA-OV	0.1	92.15 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	0.5	92.15 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	0.7	92.15 ( $\Delta$ = +0.00)	alternative

In addition to demonstrating robustness to hyperparameter changes, we also validate the robustness of FOCUS across different model architectures. Specifically, we integrate Qwen-VL-2.5-7B with FOCUS using hyper-parameters similar to those applied to LLaVA-OV-7B. As shown in the table below, FOCUS substantially improves performance on various datasets compared to the vanilla Qwen model. These results further demonstrate the robustness of FOCUS across different MLLMs. We will include these results in the revised version of the paper.

Model	Dataset	Accuracy (%)
Qwen-2.5-VL	V*Bench	79.06
Qwen-2.5-VL + FOCUS	V*Bench	90.58 ( $\Delta$ = +11.52)
Qwen-2.5-VL	HRBench-4K	71.62
Qwen-2.5-VL + FOCUS	HRBench-4K	79.25 ( $\Delta$ = +7.63)
Qwen-2.5-VL	HRBench-8K	68.62
Qwen-2.5-VL + FOCUS	HRBench-8K	76.25 ( $\Delta$ = +7.63)

Our analysis of FOCUS's robustness with respect to hyperparameter selection shows that the method's performance is stable across a range of settings and thus broadly applicable.

References:

R1: Digital Image Processing, Gonzalez and Woods, 2008

2025-08-04

Thanks to the authors for the clarifications. I think that the hyperparameter value analysis you provide in the rebuttal should be added in the appendices of the paper. You should also explicitly mention the lack of a development set and how you set the hyperparameter values to avoid any suspicion of bad practices. As all my concerns have been suitably addressed, I will keep my score. Congrats for your work.

审稿意见

评分: 4置信度: 32025-07-03

This paper presents FOCUS, a training-free method for fine-grained visual question answering (VQA) that uses the internal token features of Multimodal Large Language Models (MLLMs) to find the most relevant parts of an image. Different from other methods, instead of exhaustive search or extra training, FOCUS quickly locates the target object by measuring cosine similarity between cached value tokens of the target word and image areas. It then selects, checks, and ranks possible regions, so the final VQA is only done on the most likely area.

优缺点分析

Strengths:

The method is entirely training-free and operates only at inference time, allowing it to generalize well across models and datasets without any retraining or task-specific adaptation.
FOCUS achieves high computational efficiency, reducing the number of forward passes and inference time compared to previous visual cropping approaches, while maintaining or improving accuracy.
FOCUS is practical to integrate with a wide range of modern Multimodal Large Language Models.

Weaknesses:

The current experiments focus only on multiple-choice VQA tasks. The method has not been evaluated on generative or open-ended VQA, and it remains unclear how well FOCUS would generalize to these settings.

问题

The paper uses several hyperparameters, like the minimum distance between ROI anchors and the ROI size threshold. Did you test how sensitive FOCUS is to these choices? Is there an ablation on these hyperparameters?
In the Limitations section, you mention that FOCUS struggles with very small objects in high-res images. Can you show more failure cases or examples to help understand this problem?

局限性

Can FOCUS handle questions that involve spatial relationships, such as "the red ball on the left," where token similarity maps may not uniquely identify the correct region?

最终评判理由

My questions are answered by the authors. I'll maintain my rating.

格式问题

No obvious formatting issues were found.

作者回复

2025-07-30

We sincerely appreciate your detailed assessment of our work and value the thoughtful questions, which provide meaningful direction for improvements of our paper.

Open-ended VQA

"The current experiments focus only on multiple-choice VQA tasks. The method has not been evaluated on generative or open-ended VQA, and it remains unclear how well FOCUS would generalize to these settings."

We agree with you that this is a valuable consideration, especially when assessing FOCUS for real-world usage. To our knowledge, we are not aware of fine-grained datasets focusing on open-ended VQA. Therefore, we reuse V*Bench for open-ended evaluation. While it primarily follows the multiple-choice format, it also provides the ground-truth answer in a natural language format, e.g., "The color of the dog is white.". To explore the open-ended VQA capabilities of FOCUS and baseline models, we provide them with VQA questions without answer options, e.g., "What is the color of the dog?". Then we compare their responses with the respective ground-truth sequence. We evaluate the responses of the MLLMs using an LLM-as-a-judge framework [R1], leveraging Qwen-2.5-7B. Moreover, we manually review Qwen-2.5’s judgments and correct any misclassifications. The results of this analysis for LLaVA-1.5 clearly show that FOCUS substantially improves the fine-grained open-ended VQA performance, increasing accuracy from 44.50% for the vanilla LLaVA-1.5 model to 65.97%:

Model	Accuracy (%)
LLaVA-1.5	44.50
LLaVA-1.5 + FOCUS	65.97

Sensitivity to hyperparameters

"The paper uses several hyperparameters, like the minimum distance between ROI anchors and the ROI size threshold. Did you test how sensitive FOCUS is to these choices? Is there an ablation on these hyperparameters?"

We thank you for pointing out the need for more clarity regarding hyperparameter sensitivity. To address your question, we conduct a systematic sensitivity analysis using both LLaVA-1.5 and LLaVA-OV on V*Bench. We examine the impact of key hyper-parameters, including the number of anchor points ( $k$ ), the minimal Euclidean distance between anchor points ( $s_\text{dist}$ ), the maximum ROI size ( $s_\text{max}$ ), the ROI expansion threshold, and the NMS threshold. As shown in the tables below, the accuracy obtained using alternative hyperparameter choices shows only marginal degradation of maximum 4.7% compared to the accuracy achieved with the original settings, demonstrating our method's low sensitivity to hyperparameters. During the sensitivity analysis, we discover some alternative hyperparameter configurations that achieve even higher accuracy than our default settings. In the following, we present the results of our hyperparameter sensitivity analysis in detail. For the number of anchor points ( $k$ ), we uniformly sampled 20 values and report the mean accuracy along with a 95% confidence interval (corresponding to 2 standard deviations).

Model	$k$	Accuracy (%)	Comment
LLaVA-1.5	30	72.77	original choice
LLaVA-1.5	random(10,50)	72.77 ± 1.55 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	30	92.15	original choice
LLaVA-OV	random(10,50)	92.03 ± 1.37 ( $\Delta$ = -0.12)	alternative

For the minimum distance between ROI anchor points $s_{\text{dist}}$ , we varied the value from 2 to 4 for LLaVA-1.5, and from 2 to 5 for LLaVA-OV.

Model	$s_\text{dist}$	Accuracy (%)	Comment
LLaVA-1.5	2	72.77	original choice
LLaVA-1.5	3	72.25 ( $\Delta$ = −0.52)	alternative
LLaVA-1.5	4	69.11 ( $\Delta$ = −3.66)	alternative
LLaVA-OV	2	92.15	original choice
LLaVA-OV	3	91.62 ( $\Delta$ = −0.53)	alternative
LLaVA-OV	4	92.15 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	5	91.62 ( $\Delta$ = −0.53)	alternative

For the maximum ROI size $s_\text{max}$ , we explored alternative values ranging from 5 to 9 for LLaVA-1.5, and from 5 to 11 for LLaVA-OV.

Model	$s_\text{max}$	Accuracy (%)	Comment
LLaVA-1.5	5	72.77	original choice
LLaVA-1.5	7	69.63 ( $\Delta$ = −3.14)	alternative
LLaVA-1.5	9	68.06 ( $\Delta$ = −4.71)	alternative
LLaVA-OV	9	92.15	original choice
LLaVA-OV	5	94.24 ( $\Delta$ = +2.09)	alternative
LLaVA-OV	7	91.01 ( $\Delta$ = −1.14)	alternative
LLaVA-OV	11	89.53 ( $\Delta$ = −2.62)	alternative

For the ROI expansion threshold, we uniformly sampled 20 values and reported the mean accuracy along with a 95% confidence interval (corresponding to 2 standard deviations).

Model	ROI expansion threshold	Accuracy (%)	Comment
LLaVA-1.5	0.5	72.77	original choice
LLaVA-1.5	random (0.3, 0.7)	72.77 ± 0.00 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	0.5	92.15	original choice
LLaVA-OV	random (0.3, 0.7)	92.70 ± 0.25 ( $\Delta$ = +0.55)	alternative

Moreover, we report accuracy using NMS thresholds ranging from 0.1 to 0.7 to demonstrate robustness w.r.t. this parameter.

Model	NMS threshold	Accuracy (%)	Comment
LLaVA-1.5	0.3	72.77	original choice
LLaVA-1.5	0.1	70.16 ( $\Delta$ = −2.61)	alternative
LLaVA-1.5	0.5	72.25 ( $\Delta$ = −0.52)	alternative
LLaVA-1.5	0.7	72.25 ( $\Delta$ = −0.52)	alternative
LLaVA-OV	0.3	92.15	original choice
LLaVA-OV	0.1	92.15 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	0.5	92.15 ( $\Delta$ = +0.00)	alternative
LLaVA-OV	0.7	92.15 ( $\Delta$ = +0.00)	alternative

In addition to demonstrating insensitivity to hyperparameter changes, we also validate the robustness of FOCUS across different model architectures. Specifically, we integrate Qwen-VL-2.5-7B with FOCUS using hyperparameters similar to those applied to LLaVA-OV-7B. As shown in the table below, FOCUS substantially improves performance on various datasets compared to the vanilla Qwen model. These results further demonstrate the robustness of FOCUS across different MLLMs. We will include these results in the revised version of the paper.

Model	Dataset	Accuracy (%)
Qwen-2.5-VL	V*Bench	79.06
Qwen-2.5-VL + FOCUS	V*Bench	90.58 ( $\Delta$ = +11.52)
Qwen-2.5-VL	HRBench-4K	71.62
Qwen-2.5-VL + FOCUS	HRBench-4K	79.25 ( $\Delta$ = +7.63)
Qwen-2.5-VL	HRBench-8K	68.62
Qwen-2.5-VL + FOCUS	HRBench-8K	76.25 ( $\Delta$ = +7.63)

Our analysis of FOCUS's robustness with respect to hyperparameter selection shows that the method's performance is stable across a range of settings and thus broadly applicable.

Limitations regarding extremely small objects

"In the Limitations section, you mention that FOCUS struggles with very small objects in high-res images. Can you show more failure cases or examples to help understand this problem?"

We would like to refer you to Figure 7 (IV) on page 25 of the appendix, which illustrates this issue. Due to the updated rebuttal guidelines, we are unable to include additional qualitative examples at this stage. In a revised version of the paper, we will incorporate further examples to more clearly demonstrate this limitation.

VQA regarding spatial concepts

"Can FOCUS handle questions that involve spatial relationships, such as "the red ball on the left," where token similarity maps may not uniquely identify the correct region?"

This question highlights an important limitation of current MLLMs. These models exhibit poor understanding of spatial concepts such as "on the left" and "on the right" in the image [R2-4]. While targeted fine-tuning can partially mitigate this issue [R3], it is not part of the standard training of MLLMs. Consequently, our FOCUS method inherits this limitation and may produce ambiguous object relevance maps when applied to off-the-shelf MLLMs. We will explicitly acknowledge this limitation in a revised version of the paper.

References:

R1: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Zheng et al, NeurIPS 2023

R2: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs, Tong et al, CVPR 2024

R3: SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities, Chen et al, CVPR 2024

R4: Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models, Qi et al, arXiv 2025

2025-08-05

Thanks to the authors for the detailed rebuttal. My questions regarding the hyperparameter design and the method’s performance on open-ended VQA have been addressed. I will maintain my original score.

最终决定Accept (poster)

2025-09-17

The paper was reviewed by four experts who provided detailed comments. Among the weaknesses raised, a common concern was the large number of hyperparameters that the method required. The authors addressed this concern and others in their response, to the satisfaction of the reviewers and area chair. In the end, the reviewers unanimously recommend accepting the paper and the area chair sees no reason to override their recommendation.