CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
摘要
评审与讨论
This paper introduces the Chain of Foresight-Focus Thought (CoFFT), a training-free approach that aims to enhance VLMs' visual reasoning. The pipeline consists of three stages: (1) Diverse Sample Generation; (2) Dual Foresight Decoding; (3) Visual Focus Adjustment. Through experiments, the authors demonstrate that the method is effective across different models.
优缺点分析
In general, I would view this work as an attempt to construct a deep thinking process similar to o3 model. I find the proposed method in the paper self-consistent. The evaluations are mostly thorough.
However, I have two major concerns:
(1) The paper introduces quite a lot of new hyperparameters. (, , etc.) How were these hyperparameters determined? Were they found through experimentation or chosen randomly? The authors should clarify this or provide additional experiments.
(2) I calculated the results from the ablation study in Table 3 and found that the average results of the two ablation experiments are 45.32 and 45.36, respectively. In other words, each individual component performs similarly to the baseline algorithms (45.05). This raises the question: why should we adopt the two methods proposed in the paper? Would using the baseline combined with only one of the proposed methods yield comparable performance?
Overall, I find this is a technically solid paper where reasons to accept outweigh reasons to reject. Therefore, I recommend borderline accept.
问题
-
The authors are expected to conduct experiments during rebuttal to explain why the ablation results are not good (as mentioned before). I would be happy to adjust my score based on the quality of the response.
-
In table 2,3, and 4, why does the number of significant figures vary (sometimes one, sometimes two)?
-
In line 176, the should have been .
局限性
The authors mentioned the following limitation: However, while CoFFT successfully solves previously unsolvable problems for VLMs, it may introduce unexpected errors in cases where VLMs alone perform well, indicating potential interference with their original well-established capabilities.
The authors should clearly state what the unexpected errors are using concrete examples.
最终评判理由
The author addressed the issues I raised during the rebuttal period well. The other reviewers do not appear to have any fundamental objections to the paper. Therefore, I am happy to raise my score to 5.
格式问题
N/A
Q1 For the Determination of Hyperparameters ( and ):
We appreciate your valuable suggestions. The hyperparameters and employed in CoFFT are selected through systematic experimentation.
Hyperparameter Definitions and Functions:
- appears in our Dual-Foresight Decoding: , where it balances the visual focus score () and reasoning progression score () when evaluating candidate reasoning samples.
- is utilized in our Visual Focus Adjustment: , where it controls the suppression strength of previously explored regions, enabling the model to balance between maintaining global perspective and exploring new region details.
We conduct comprehensive experiments to demonstrate the determination process for these parameters and analyze their impact on performance. For thorough evaluation, we select two representative datasets: SeekWorld-China (requiring fine-grained visual detail comprehension) and MathVista (demanding reasoning capabilities in addition to visual understanding).
Hyperparameter : The hyperparameter aims to balance the visual focus score () and reasoning progress score (). As shown in the table below, lower values reduce performance on SeekWorld-Global, as they interfere with the utilization of visual information. Conversely, excessively high may interfere with the reasoning process, resulting in marginal performance degradation on MathVista. The achieves optimal balance between these competing requirements, making it our choice for the experiments.
| MathVista (Acc.) | SeekWorld-China (Acc.) | |
|---|---|---|
| 0.2 | 70.5 | 34.58 |
| 0.3 | 70.4 | 35.12 |
| 0.4 | 69.8 | 35.66 |
| 0.5 | 69.3 | 35.66 |
Hyperparameter : The hyperparameter controls the suppression strength of previously explored regions during the Visual Focus Adjustment (VFA), thereby balancing the maintenance of global perspective with exploration of new regions. Excessively high leads to aggressive exploration of new regions, and the VLM falls into local optima on problems requiring a global view for reasoning, thus degrading performance. Conversely, excessively low would impair VFA's motivation to explore new regions, thereby affecting its performance on SeekWorld-China. The represents a robust and balanced choice for this trade-off.
| MathVista (Acc.) | SeekWorld-China (Acc.) | |
|---|---|---|
| 0.2 | 70.4 | 33.78 |
| 0.3 | 70.4 | 35.12 |
| 0.4 | 69.9 | 35.39 |
| 0.5 | 69.5 | 35.66 |
Q2: For the Ablation Study and Component Synergy
We appreciate your astute observation regarding the ablation study results and the critical question of component synergy.
To demonstrate that the Visual Focus Adjustment (VFA) and Dual-Foresight Decoding (DFD) in CoFFT are synergistically designed rather than simply interchangeable, we conduct a more comprehensive analysis and perform a series of experiments.
1. Distinct Functional Design
Reviewing our ablation results, we observe that while DFD and VFA achieve similar overall performance, they exhibit distinct strengths across different benchmarks. DFD demonstrates more pronounced effects on reasoning-intensive mathematical datasets, while VFA shows greater impact on fine-grained perception tasks in SeekWorld-China and SeekWorld-Global. This indicates that these two components primarily operate in different aspects.
| Method | MathVista | MathVision | MMStar | M3CoT | Charxiv | S.W-China | S.W-Global | AVG |
|---|---|---|---|---|---|---|---|---|
| Our | 70.4 | 23.36 | 69.4 | 62.47 | 47.2 | 35.12 | 29.37 | 48.19 |
| w/o DFD | 68.5 | 20.42 | 66.5 | 61.39 | 44.8 | 28.42 | 27.19 | 45.32 |
| w/o VFA | 69.3 | 21.71 | 67.4 | 61.09 | 44.7 | 27.08 | 26.25 | 45.36 |
2. Synergistic Design as a Whole
The average accuracy rates for various methods are as follows:
- Qwen2.5-VL-7B (baseline): 42.72%
- Predictive Decoding (reasoning enhancement method): 45.05%
- DyFo (visual-search method): 44.98%
- CoFFT: 48.19%
We subsequently test various combinations of visual and reasoning modules, conducting experiments on MathVista and SeekWorld-China benchmarks to comprehensively evaluate their performance.
| Combination | MathVista | SeekWorld-China | Average |
|---|---|---|---|
| Baseline | 68.2 | 21.45 | 44.83 |
| DyFo + Predictive Decoding | 69.0 | 33.24 | 51.02 |
| Visual Focus Adjustment (VFA) + Predictive Decoding | 68.5 | 31.37 | 50.24 |
| DyFo + Dual Foresight Decoding (DFD) | 69.3 | 34.05 | 51.73 |
| CoFFT (VFA + DFD) | 70.4 | 35.12 | 52.76 |
Analysis:
-
DyFo + Predictive Decoding: This combination has a fundamental conflict, where DyFo determines the image regions that Predictive Decoding can use. DyFo, as an independent visual search module, determines the area of interest directly based on the question. For mathematical problems (MathVista), DyFo may crop out critical information such as numbers or geometric relationships, thereby interfering with the reasoning process of Predictive Decoding and degrading overall performance. For geolocation tasks (SeekWorld-Global), its focused view provides some improvement.
-
VFA + Predictive Decoding: VFA is dynamic, adjusting visual focus after each reasoning step. In contrast, Predictive Decoding is designed for static visual inputs and cannot assess the value of current reasoning paths from a visual information perspective, can only passively accept the image regions provided by VFA, causing the introduction of irrelevant and confusing cropped regions during the reasoning process. These regions disrupt the predictive decoding's reasoning process, leading to comprehensive performance degradation.
-
DyFo + Dual Foresight Decoding (DFD): The combination of DFD and DyFo is still not optimal because DyFo's focusing behavior is not guided by DFD's foresight, even if we encourage DyFo to perform one acquisition of the image region per inference step, and is only capable of dynamic adjustment based on the current path. Moreover, DyFo's scoring basis comes from the Lang-Segment-Anything expert visual model [1], and its scoring has compatibility issues with the relative attention mechanism in DFD, further constraining overall performance. Although this brings some performance improvement, it lacks the tight feedback cycle of mutual support between focus and reasoning, which is core to CoFFT. Consequently, performance improvement is limited.
Conclusion:
These experiments powerfully demonstrate that simply connecting existing visual search and reasoning methods is not optimal and may even be detrimental. CoFFT's advantage lies precisely in the synergistic design of its VFA and DFD modules. DFD evaluates reasoning paths based on both reasoning progress and image relevance, and provides the foundation for VFA to determine subsequent reasoning-relevant regions; meanwhile, VFA's precise focusing provides DFD with high-quality, low-noise input, enabling more reliable judgments. This "tightly-coupled" iterative loop is the key to significant performance improvement and represents one of CoFFT's core contributions.
Q3: For the Number of Significant Figures in Tables
We appreciate your meticulous observation. The specific precision is determined based on the sample size of each dataset and established conventions in related work. Except for MathVista, CharXiv, and MMStar, other benchmarks have two decimal places.
For the MathVista and CharXiv, each one contains 1000 samples. At this scale, the minimum measurable change in accuracy is 0.1%; therefore, results can only be meaningfully reported to one decimal place.
For the MMStar, we observed that the official reports from the Qwen and Intern teams, as well as the original MMStar paper, report performance to one decimal place. We follow this convention to facilitate direct and clear comparison with these authoritative benchmarks.
Q4: For the Typo in Line 176
Thank you for pointing out our oversight. The parameter in the textual description of Equation (5) on line 176 should be , not . This is a typographical error, and we will correct it in the final version.
Q5: For Providing Concrete Examples for Limitations
Due to the inability to insert images, we can only provide a textual description of a concrete example illustrating the "unexpected errors" that CoFFT may introduce.
Concrete Example: Consider a question about a complex diagram, such as a food chain diagram containing numerous organisms and multiple connecting lines representing predatory relationships. The question might be: "When organism X decreases, how will organism Y change?"
VLM: The VLM can process the entire diagram from a global perspective, correctly tracing the predatory relationship between X and Y to determine the final result.
Potential Errors Introduced by CoFFT:
- During the VFA stage, the framework may be attracted to visually dense regions—for example, areas containing X or Y—considering them "most informative," leading to magnification that may introduce content from other organisms surrounding X or Y.
- Although CoFFT's mechanism allows the model to recover the global view in subsequent steps, this region focus may introduce contextual interference.
- Consequently, during subsequent reasoning, the model may over-focus on this intermediate information, disrupting the overall reasoning process and leading to an incorrect final answer.
However, it should be noted that all current methods for cropping or searching images carry this risk of introducing potential interference due to the introduction of partial images. This is not a limitation unique to CoFFT.
[1] Luca Medeiros. Language segment-anything, 2024.
The author addressed the issues I raised during the rebuttal period well. The other reviewers do not appear to have any fundamental objections to the paper. Therefore, I am happy to raise my score to 5. If the paper is accepted, I look forward to the authors incorporating the revisions made during the rebuttal into the final version.
Thank you for your positive feedback and for acknowledging our revisions. We are delighted to hear that our responses addressed your concerns effectively. We will incorporate all the revisions made during the rebuttal period into the final version of the paper. We appreciate your support and recognition.
This paper introduces the Chain of Foresight-Focus Thought (CoFFT), a novel training-free approach that enhances VLMs’ visual reasoning by emulating human visual cognition and mitigating task-irrelevant interference or hallucinations. CoFFT iteratively performs (1) Diverse Sample Generation to explore multiple reasoning paths, (2) Dual Foresight Decoding to select the optimal sample by jointly optimizing visual focus and reasoning progression, and (3) Visual Focus Adjustment to direct attention to regions most beneficial for subsequent reasoning. These three interdependent stages cycle until the final answer is reached, all without model modifications or retraining. Empirical results on Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance gains with controllable computational overhead.
优缺点分析
Strengths
- The proposed training-free approach is valuable and its effectiveness is demonstrated across multiple VLM backbones, underscoring its general applicability.
- The method is novel and offers valuable insights into improving visual reasoning without any model finetuning.
- The Introduction is clearly written and effectively motivates the work.
- The Appendix provides a thorough algorithmic summary, which aids reproducibility and clarity.
Weaknesses
- The Method section is burdened by extensive mathematics, much of which appears only once and could be simplified. Notation is inconsistent—for example, it is unclear whether A_crop denotes a function or a result (see equation 6 and 7). The authors should adopt distinct typographical conventions (e.g., calligraphic vs. italic) to distinguish variables, functions, and operators.
- Figure 1 conveys limited insight into the core mechanics of CoFFT; it merely shows a reduction in reasoning steps without clearly illustrating the key stages or their interplay, making the example hard to interpret.
- The evaluation lacks experiments on off-the-shelf reasoning models. Given the inherent reasoning limitations of the chosen backbones, validating CoFFT on stronger reasoning baseline VLMs would better demonstrate its broad utility.
- It is better to use pseudocode instead of math formulation for clarity.
问题
See weakness
局限性
yes
最终评判理由
the author has addressed my concerns, so i keep the rating.
格式问题
none
Q1: Clarity of Method Section and Notation
Thank you for your valuable suggestion.
In the revised version of our paper, we will implement the following improvements:
-
Streamlined Mathematical Formulation: We will revise Section 3 to make the mathematical exposition more accessible. For complex formulas that appear only once and can be clearly explained in prose, we will integrate their descriptions into the main text whenever possible. This will improve the flow of the narrative and reduce the cognitive load on the reader.
-
Consistent Notation: You astutely identified the inconsistent use of in Equations (6) and (7). We acknowledge this oversight. In the revision, will be used exclusively to denote the attention map (i.e., a matrix). When referring to the value at a specific coordinate, we will use the functional form . This change will ensure its meaning is unambiguous throughout the paper.
-
Adoption of Typographical Conventions: We will adopt your excellent suggestion to use distinct typographical conventions. We will subsequently use normal fonts for variables, calligraphic fonts for functions, and italic fonts for operators, and introduce these conventions in the overview section.
We are confident that these revisions will significantly enhance the clarity, consistency, and readability of our Method section.
Q2: Interpretation of Figure 1
Thank you for your insightful comment. We would first like to clarify its intended purpose, then provide our modification plan.
Figure 1 is designed as a motivational example to demonstrate the necessity of our method, rather than to explain its internal workings. Its goal is to contrast: (a) the lengthy, unfocused, and often erroneous reasoning process of a standard VLM, with (b) an efficient reasoning process guided by dynamic visual focus for next reasoning. This comparison highlights the problem we aim to solve and serves as the inspiration for CoFFT's design, which emulates the human ability for foresight and focus adjustment. The core mechanics of our method are detailed in Figure 2 (which presents the overall three-stage iterative framework).
To resolve any potential confusion in the revised manuscript, we will:
- Revise the caption of Figure 1 to explicitly label it as a "motivational example" and direct readers to Figure 2 for a comprehensive understanding of how CoFFT achieves this improved reasoning.
- Strengthen the in-text reference to Figure 1, clarifying its role in problem motivation and immediately guiding the reader to Figure 2 for the technical details of our solution.
Modify the Caption as follows:
A motivational example from the SeekWorld benchmark illustrating the difference in reasoning between OpenAI-O3 and our CoFFT-guided approach. (a) shows the original reasoning process of O3, which gets distracted by irrelevant information (crowds, pigeons) and arrives at an incorrect answer. (b) demonstrates how providing a focused visual region of the key semicircular buildings and curved pools, guided by a human-like cognitive process, helps VLM to identify the correct location as Jiangsu, China (Sun Yat-sen's Mausoleum). This comparison highlights the necessity of dynamic visual focus for next reasoning in complex reasoning tasks and motivates CoFFT's design—see Figure 2 for how CoFFT achieves this improved reasoning through its three-stage iterative framework.
Q3: Evaluation on Stronger Reasoning Models
We appreciate your valuable suggestion.
First, it is essential to clarify why leading proprietary models such as o3 or GPT-4o are not included in our initial experiments. CoFFT's core mechanism—Dual Foresight Decoding and Visual Focus Adjustment—critically depends on access to the model's internal text-image attention maps. These attention maps are essential for computing visual relevance and dynamically adjusting visual focus. Unfortunately, leading closed-source models are provided as "black-box" APIs that do not expose these underlying internal states, making direct application of CoFFT technically infeasible.
Second, to further demonstrate CoFFT's broad applicability, we conducted a series of new experiments on the state-of-the-art large-scale open-source model Qwen2.5-VL-72B (the original manuscript had been performed on Qwen2.5-VL-32B). Our objective is to validate whether CoFFT still delivers significant improvements when applied to backbones that already possess strong intrinsic reasoning capabilities. The results are presented below. These findings demonstrate that CoFFT not only achieves significant average performance gains on 72B-scale models but also validates our method's excellent scalability.
| Method | MathVision | S.W-China | S.W-Global |
|---|---|---|---|
| Qwen2.5VL-32B | 25.33 | 24.13 | 28.41 |
| Qwen2.5VL-32B with CoFFT | 29.93 | 38.61 | 34.38 |
| Qwen2.5VL-72B | 36.18 | 37.80 | 42.18 |
| Qwen2.5VL-72B with CoFFT | 41.12 | 47.72 | 48.75 |
Q4: Use of Pseudocode for Clarity
Thank you for your valuable suggestion. We have prepared a complete, detailed line-by-line pseudocode, but due to space limitations, we have put it in the appendix. However, as you say, such pseudocode provides value in providing a concise overview, so we will add the following more condensed pseudocode at the beginning of Section 3 (Methods). This will enable readers to quickly and intuitively grasp the core workflow of CoFFT before seeing the detailed mathematical description.
Algorithm 1: Chain of Foresight-Focus Thought (CoFFT) - High-Level Overview
Require: Original image , question , VLM model
Ensure: Final reasoning result
1: ; // Initialize focus with original image and empty reasoning
2: while not reached final answer do
3: // --- Stage 1: Diverse Sample Generation ---
4:
5: for do
6: Sample from temperature range // Ensure diversity
7: \quad$$s_i \leftarrow Generate reasoning sample using
8: \quad$$S \leftarrow S \cup \{s_i\}
9: end for
10:
11: // --- Stage 2: Dual Foresight Decoding ---
12: for each sample do
13: Calculate visual relevance score (via relative attention, cosine, IoU)
14: Calculate reasoning progression score (via log-prob improvement)
15: end for
16: Select best sample by combining and with factor
17: // Update reasoning with the optimal first step
18:
19:// --- Stage 3: Visual Focus Adjustment ---
20: Compute next focus map based on , , and future step
21: Find optimal region with highest score in
22: if score of is significantly higher than global average then
23: Crop and magnify region from original image
24: else
25: // Revert to the full image if no clear focus is found
26: end if
27: end while
28: return
thank you for the detailed response, i will maintain the score.
Thank you for your positive feedback and for acknowledging our revisions. We appreciate your support and recognition.
The paper proposes Chain of Foresight-Focus Thought (CoFFT), a novel training-free framework designed to enhance the visual reasoning capabilities of Vision-Language Models (VLMs). CoFFT operates in an iterative loop of three stages: (1) Diverse Samples Generation (DSG) produces multiple multi-step reasoning paths under varied temperature settings; (2) Dual Foresight Decoding (DFD) jointly evaluates these samples based on reasoning quality and visual relevance to select the best one-step continuation; and (3) Visual Focus Adjustment (VFA) refines the visual input by identifying and cropping the most informative image region, feeding it back into the next reasoning iteration. Experiments across multiple benchmarks and models (e.g., Qwen2.5-VL, InternVL-2.5, and Llava-Next) on mathvista, mathvision, M3CoT and MMStar Charxiv and seek-world. show that CoFFT achieves consistent gains of 3.1–5.8% in accuracy, with performance improvements scaling with model size.
优缺点分析
strength:
- The idea is great and original. Relative attention is very intuitive.
- Extensive comparisons are conducted and the improvement is impressive, demonstrating the effectiveness of the method.
weakness:
- Clarity of the paper and writing needs to be improved. For example,
- What point are you trying to make in figure 1? It's not clear from the caption and only makes sense to read with the main text. Similar issue for Figure 2. The captions should be somewhat self-contained. Please add more details
- Line 51 -- should use complete sentences
- Figure 3 -- the caption uses (a) and (b), which does not correspond to the figure.
- line 118 - 120 -- introduce terms and symbols not defined before, which is more confusing for the reader.
- Writing needs to be improved. reorder the sections or maybe reduce the terminology and symbols if you have not define it yet,
- Limitation needs to be discussed further, especially run-time, as you are creating repetitive samples every iteration. This needs to be discussed further and maybe addreseed.
问题
See the weaknesses anove.
局限性
Not discussed. One big issue and limitation is added computation. The author needs to discuss it and potential works to mitigate it, or put it in the future work.
最终评判理由
I went over the rebuttal and the authors have addressed some of my concerns.
格式问题
No
Q1: On the Clarity of Figures and Captions
Thanks for your feedback regarding the clarity of our figures.
To address this, we will revise the captions for Figures 1, 2, and 3. We will also remove the redundant parts in the Figure 3 caption.
- Revised Caption for Figure 1: This caption will be expanded to explain the core comparison being illustrated—how our CoFFT-guided approach corrects the reasoning of a baseline VLM by focusing its attention on the relevant visual evidence.
Figure 1: A motivational example from the SeekWorld benchmark illustrating the difference in reasoning between OpenAI-O3 and our CoFFT-guided approach. (a) shows the original reasoning process of O3, which gets distracted by irrelevant information (crowds, pigeons) and arrives at an incorrect answer. (b) demonstrates how providing a focused visual region of the key semicircular buildings and curved pools, guided by a human-like cognitive process, helps VLM to identify the correct location as Jiangsu, China (Sun Yat-sen's Mausoleum). This comparison highlights the necessity of dynamic visual focus for next reasoning in complex reasoning tasks and motivates CoFFT's design—see Figure 2 for how CoFFT achieves this improved reasoning through its three-stage iterative framework.
- Revised Caption for Figure 2: The new caption will clearly break down the iterative three-stage workflow of our CoFFT, allowing readers to grasp the entire process at a glance.
Figure 2: The overall iterative workflow of our proposed Chain of Foresight-Focus Thought (CoFFT) approach. The process begins with an image and a question. Each iteration consists of three stages: (1) Diverse Sample Generation (DSG): The VLM generates multiple potential reasoning paths. (2) Dual Foresight Decoding (DFD): These paths are evaluated based on both reasoning progression and visual focus to select the optimal path. (3) Visual Focus Adjustment (VFA): The visual focus is then shifted to the most relevant image region for the next iteration. This cycle of reasoning guiding focus, and focus informing reasoning, continues until the final answer is reached.
- Revised Caption for Figure 3: We appreciate you catching the redundant
(a)and(b)labels. We will remove them and rewrite the caption to directly describe the components illustrated, making the figure easier to understand.
Figure 3: A detailed illustration of the three primary components of one Foresight-Focus Thought. (1) Diverse Sample Generation (DSG): The VLM generates multiple potential reasoning paths with different settings. (2) Dual Foresight Decoding evaluates different reasoning samples by combining a visual focus score and a reasoning progression score to select the optimal path, and the first step of the optimal path is introduced to the reasoning process. (3) Visual Focus Adjustment uses a scoring mechanism based on question and future reasoning relevance to identify and crop the most informative image region for the next reasoning step.
Q2: On Writing, Terminology, and Structure
We appreciate you highlighting the issues with writing clarity. We will revise these one by one and update them.
- Line 51: Thank you for pointing this out. This was an oversight and will be corrected as follows:
(2) Dual Foresight Decoding (DFD): This module evaluates these samples by considering both visual focus and reasoning progression to select the optimal reasoning sample, and then incorporates its first step into the reasoning process.
- Lines 118-120 and Section Structure:
We will not introduce and mentioned in Section 3.1 (Overview) here, but will introduce them and provide detailed explanations in later sections. We will further review all subsequent symbols and terms to ensure they are formally defined upon first appearance.
- Writing needs to be improved:
To address this issue, we plan to reorganize Section 3 (Chain of Foresight-Focus Thought). We will first provide a more concise and comprehensive overview without complex symbols in Section 3.1 (Overview). Then, in the subsequent Section 3.2, we will first introduce the relative attention mechanism that will be used throughout the following sections. In the next three subsections (3.3, 3.4, and 3.5), we will introduce each stage in CoFFT in detail one by one, and ensure that all symbols (such as , , etc.) are clearly defined and explained when first used. This will enable readers to understand our method progressively and avoid confusion.
Q3: For Addressing Computational Cost
Thank you for your valuable suggestions. We will first conduct a quantitative comparison to elaborate on CoFFT's computational overhead, then propose one pruning approaches to attempt optimization.
(1) Current Computational Overhead:
In lines 245 to 249 of the original manuscript, we conduct a comparative analysis of CoFFT's computational cost against other training-free methods, as shown in the table below. These data indicate that although CoFFT introduces certain computational overhead, it is still relatively efficient and achieves good performance.
| Models | Baseline | MCTS | Predictive decoding | ICoT | DyFo | CoFFT (Ours) |
|---|---|---|---|---|---|---|
| Performance | 42.72 | 44.68 | 45.05 | 34.91 | 44.98 | 48.19 |
| FLOPS | 8.35e+12 | 4.05e+14 | 1.85e+14 | 1.88e+13 | 1.98e+13 | 2.38e+14 |
(2) Proposed Mitigation Pruning Strategies:
To attempt to optimize CoFFT's time overhead, we design an intuitive pruning method to reduce computational cost while maintaining performance, mainly considering two aspects: reasoning process similarity and reasoning advantage, comprising two strategies:
- Adaptive Similarity Pruning: This strategy eliminates semantic redundancy among candidate samples. At each step, we calculate TF-IDF similarity between all candidate sample pairs in the current reasoning process. If two samples exhibit high similarity (e.g., > 0.7) for consecutive n steps, we prune the reasoning process with high similarity to other reasoning processes to ensure diversity between reasoning processes.
- Adaptive Reasoning Pruning: This strategy terminates unpromising reasoning paths early by tracking each candidate sample's reasoning progress score (). If a sample fails to achieve positive scores for consecutive steps, it is marked as "stagnant." To maintain search breadth, we prune only the single lowest-scoring stagnant sample per iteration, ensuring at least two candidates are always retained.
(3) Experimental Result:
We set to and for experimental validation, respectively. We validate this pruning strategy's effectiveness through a comprehensive evaluation across five benchmark datasets. The table below shows our expected results, indicating that our pruning strategy significantly reduces FLOPS while maintaining high accuracy. We even achieve some performance gains on some benchmarks.
| Method | MathVista | MathVision | Charxiv | S.W-China | S.W-Global | FLOPS (Lower is better) | Reduction Ratio |
|---|---|---|---|---|---|---|---|
| Baseline | 68.2 | 18.09 | 42.5 | 21.45 | 25.31 | 8.35e+12 | - |
| CoFFT | 70.4 | 23.36 | 47.2 | 35.12 | 29.37 | 2.38e+14 | - |
| CoFFT + Pruning () | 69.9 | 22.37 | 46.2 | 34.32 | 29.37 | 1.72e+14 | 27.7% |
| CoFFT + Pruning () | 70.4 | 23.03 | 46.9 | 36.19 | 30.31 | 2.11e+14 | 11.3% |
Dear Reviewer UGRZ
I would appreciate if you could kindly check my responses to your comments. If you have any further questions, we would be happy to address them.
Best
Authors
Thx for providing a detailed rebuttal. I had many concerns related to the writing and presentation highlighted in my original review, and thanks for clarifying them.
"Fig 1 - 3": Now it reads much better and clearer now. Thanks for the revision.
"Lines 118-120 and Section Structure" and you share how specifically you plan to address that? Can you share what you are planning to add to address those?
The information (text in the image) in figure 1 is still too dense. Please consider simplifying and delivering the core idea.
Overall, the rebuttal has addressed most of my concerns. The experimentation and mitigation strategy provided during the rebuttal period are very useful and please include them in the final version.
I am increasing my score and rating to "borderline accept".
Thank you for your positive feedback and for acknowledging our revisions. We are delighted to hear that our responses addressed your concerns effectively and that you are willing to improve your rating. We will incorporate all the revisions made during the rebuttal period into the final version of the paper. We appreciate your support and recognition.
(1) For "Lines 118-120 and Section Structure"
Thank you for your feedback. We plan to reorganize the content of section 3, and our main approach is as follows:
- Restructure the content into five parts. We will begin with a more concise and comprehensive overview in section 3.1, avoiding complex symbols. In section 3.2, we will introduce the relative attention mechanism that will be used in the subsequent sections. We will then detail the individual stages of CoFFT in the following three subsections (3.3, 3.4, and 3.5).
- Improve section 3.1 (Overview). To avoid confusion caused by complex formulas and text, we will simplify the text descriptions and introduce pseudocode to help readers better understand the overall framework.
The current modification plan is as follows, if you have any suggestions please give them to us:
Algorithm 1: Chain of Foresight-Focus Thought (CoFFT)
Require: Original image , question , VLM model Ensure: Final reasoning result
1: ; // Initialize focus with original image and empty reasoning
2: while not reached final answer do
3: // --- Stage 1: Diverse Sample Generation ---
4:
5: for do
6: Sample from temperature range // Ensure diversity
7: \quad$$s_i \leftarrow Generate reasoning sample using
8: \quad$$S \leftarrow S \cup \{s_i\}
9: end for
10:
11: // --- Stage 2: Dual Foresight Decoding ---
12: for each sample do
13: Calculate visual relevance score (via relative attention, cosine, IoU)
14: Calculate reasoning progression score (via log-prob improvement)
15: end for
16: Select best sample by combining and with factor
17: // Update reasoning with the optimal first step
18:
19:// --- Stage 3: Visual Focus Adjustment ---
20: Compute next focus map based on , , and future step
21: Find optimal region with highest score in
22: if score of is significantly higher than global average then
23: Crop and magnify region from original image
24: else
25: // Revert to the full image if no clear focus is found
26: end if
27: end while
28: return
Overview Section
We introduce the Chain of Foresight-Focus Thought (CoFFT), a training-free framework designed to enhance the complex reasoning capabilities of Vision-Language Models (VLMs). CoFFT operates through an iterative process where each cycle refines the reasoning path and adjusts the visual focus, mimicking human cognitive patterns of problem-solving. Each iteration of CoFFT involves three distinct stages, as shown in Algorithm 1:
1. Diverse Sample Generation
First, the VLM generates k potential future reasoning paths, or "samples." To ensure a wide range of possibilities, these samples are created using varied temperature settings. This stage essentially brainstorms multiple ways the reasoning could proceed based on the current context and visual input.
2. Dual Foresight Decoding
Next, each sample is evaluated using a dual-scoring mechanism to select the most promising path forward.
- A visual relevance score () checks how well the reasoning aligns with the image content, helping to suppress hallucinations.
- A reasoning progression score () assesses the logical coherence and forward momentum of the reasoning itself.
The sample with the best combined score is selected, and only its first step is appended to the main reasoning chain. This ensures a deliberate, step-by-step progression.
3. Visual Focus Adjustment
Finally, the system updates its visual focus. Based on the question and the chosen future reasoning step, CoFFT identifies the most relevant region in the image. It then "zooms in" by cropping and magnifying this area for the next iteration. If no single region is clearly superior, the model reverts to the full image view. This allows the model to dynamically shift between a global overview and fine-grained local details as needed.
These three stages create a powerful synergistic loop. The anticipated reasoning path guides the model's visual attention, and the newly focused visual input, in turn, leads to higher-quality reasoning in the next cycle. By iteratively refining both thought and focus, CoFFT effectively mitigates VLM hallucinations and improves the accuracy of complex visual reasoning tasks.
(2) For The information (text in the image) in figure
Thank you for your valuable feedback regarding Figure 1. We plan to optimize the text in Figure 1. Without losing information, streamline some steps and unimportant information with ellipses to highlight the key information.
For example:
The seats are surrounded by semicircular benches, and there is a pond view under the corridor, which is very similar to Inokashira Park. Answer: Tokyo, Japan
Modify to
... semicircular benches, ... a pond view under the corridor, ... similar to Inokashira Park. Answer: Tokyo, Japan
Dear Reviewer UGRZ
Sorry to bother you, but it seems you haven't updated your rating yet. If you have any other suggestions, please let us know. Thank you for your approval of our response.
Best
Authors
The paper proposes a training-free method called CoFFT, inspired by human visual cognition, to address the issue of vision-language models (VLMs) failing to focus on key regions in images. By iteratively executing three stages—Diverse Sample Generation, Dual Foresight Decoding, and Visual Focus Adjustment—the method enhances the model’s reasoning capability.
优缺点分析
Strength:
Inspired by human cognition, the paper proposes a novel training-free method to address the visual focusing issue in vision-language models (VLMs)
Weakness:
-
The figure1 presents a highly knowledge-intensive visual question answering scenario. While the paper emphasizes that the method draws inspiration from human visual reasoning processes, it should be noted that even an American with exceptional visual reasoning capabilities couldn't possibly identify this as the Music Stage in Nanjing, China from just an image. The main text fails to provide any visual examples demonstrating that the proposed method can actually solve this type of problem.
-
Regarding mathematical scenarios: The mathematical examples shown (particularly in Figure 4) appear overly simplistic. These examples don't seem to have any clear connection to the paper's key "visual adjustment" methodology. This raises the question: why does the proposed method show significant improvements on datasets like MathVista when the demonstrated examples don't adequately showcase the method's capabilities?
问题
-
Mathematical problems seem to require stronger reasoning capabilities. Could the authors explain why enhancing visual focus alone leads to improved performance in mathematical reasoning tasks?
-
Could the authors provide several examples from mathematical datasets to illustrate how the model’s reasoning path differs before and after applying the proposed method?
-
Regarding Figure 1: The figure presents a highly knowledge-intensive visual question answering scenario. While the paper emphasizes that the method draws inspiration from human visual reasoning processes, it should be noted that even an American with exceptional visual reasoning capabilities couldn't possibly identify this as the Music Stage in Nanjing, China from just an image. The main text fails to provide any visual examples demonstrating that the proposed method can actually solve this type of problem.
-
The mathematical examples shown (particularly in Figure 4) appear overly simplistic. These examples don't seem to have any clear connection to the paper's key "visual adjustment" methodology. This raises the question: why does the proposed method show significant improvements on datasets like MathVista when the demonstrated examples don't adequately showcase the method's capabilities?
局限性
See in weakness and questions.
If the authors are able to address the above questions, the reviewer will raise the score.
格式问题
no concern
Q1: For Role of Visual Focus in Math Reasoning
We apologize for any misunderstanding regarding our work.
(1) Benchmark composition
First, it should be clarified that Mathvista and Mathvision do not contain only geometric images. Taking MathVista as an example, this benchmark comprises 31 different datasets. We categorize its set as follows:
| Category | Samples | Source Datasets | Baseline Accuracy | CoFFT Accuracy |
|---|---|---|---|---|
| NaturalImages | 230 | Super-CLEVR,A-OKVQA,CLEVR-Math,VQA-AS,VQA2.0,KVQA,VizWiz | 57.82 | 60.43 |
| Charts | 255 | PlotQA, FigureQA, DVQA, FunctionQA, ChartQA | 82.35 | 84.31 |
| Geometric Diagrams | 219 | UniGeo,GeoQA+,Geometry3K,GEOS,TheoremQA | 67.57 | 68.49 |
| Documents | 39 | DocVQA, PaperQA, ParsVQA-Caps, TextVQA | 58.97 | 66.67 |
| Tables | 62 | TabMWP | 83.87 | 90.32 |
| Science Diagrams | 85 | AI2D, ScienceQA, SciBench,TQA | 64.71 | 65.88 |
| Specialized Domains | 110 | VQA-RAD (Medical), MapQA (Maps), IconQA (Icons), IQTest (Logic), PMC-VQA (Medical) | 55.45 | 56.36 |
(2) CoFFT Framework
Visual math reasoning requires: (1) Accurate Perception: correctly identifying and extracting relevant information from complex visual inputs containing charts, diagrams, geometric figures, and numerical tables; (2) Robust Reasoning: maintaining consistent reasoning progression while integrating visual evidence throughout the reasoning process.
CoFFT addresses both aspects with synergistic enhancement between the following components.
1. Visual Focus Adjustment (VFA) - Enhance Perception
VFA enables dynamic attention on step-relevant regions in images. It reduces visual noise from irrelevant content (multiple subfigures, complex statistics) and prevents interpretation errors (incorrect values/symbols, confused relationships), ensuring accurate information extraction for a solid reasoning foundation.
2. Dual Forward Decoding (DFD) - Enhance Reasoning
DFD evaluates both reasoning progress scores and visual focus scores to determine next steps. This dual verification uses current focus regions to suppress hallucinations while exploring multiple reasoning paths, maintaining consistency between visual evidence and reasoning progression.
Ablation studies confirm distinct contributions:
- DFD removal: Greater impact on reasoning-type tasks
- VFA removal: Greater impact on fine-grained tasks
We also conduct a detailed analysis of the performance of CoFFT on Mathvista and Mathvision in Q4.
| Models | MathVista | MathVision | S.W-China | S.W-Global |
|---|---|---|---|---|
| Our | 70.4 | 23.36 | 35.12 | 29.37 |
| w/o DFD | 68.5 | 20.42 | 28.42 | 27.19 |
| w/o VFA | 69.3 | 21.71 | 27.08 | 26.25 |
| Baseline | 68.2 | 18.09 | 21.45 | 25.31 |
Q2: For the example from math datasets
Due to the inability to upload images, we provide a problem with its description and compare reasoning paths between VLM (Qwen2.5-VL-32B) and CoFFT.
The effectiveness of CoFFT in math problems, such as charts, multi-subfigures, and table data, is very obvious.
We guess that you may be more interested in geometric problems, so we give an example of this.
Question: Given a right triangle with sides , , and hypotenuse . What is the radius of the inscribed semicircle?
Description: The semicircle passes through the point , with its diameter coinciding with side , and is tangent to hypotenuse .
VLM
VLM processes the entire image holistically but struggles with geometric constraint correlation, leading to incorrect formulations:
-
Step 1: Identifies the right triangle with sides , , and recognizes the inscribed semicircle.
-
Step 2: The semicircle is tangent to sides and , applying tangent properties. Sets up based on misunderstood geometric relationships. (Error occurs)
-
Step 3: Solves the incorrect equation: , yielding . Concludes .
The global perspective prevents the proper establishment of precise geometric relationships.
VLM with CoFFT
CoFFT deconstructs the problem through iterative visual focus adjustment, enabling progressive and robust reasoning:
-
Step 1 (Global View): Extract given parameters: right triangle with , , .
-
Step 2 (Remove some blank areas, Focus on triangle): Determine that the diameter of the semicircle is at side . Therefore, ,
-
Step 3 (Focus on Hypotenuse and Tangency): VLM focuses on the tangency between the semicircle and the triangle. The region including and the semicircle is cropped out. Based on the tangency, is obtained.
-
Step 4 (Solution with Global View): Get , solving to yield .
This example shows how CoFFT's precise targeting enables correct reasoning by decomposing problems into manageable, intuitive steps, while the baseline fails due to incorrect geometric constraint interpretation.
Q3: For the Example in Figure 1:
We apologize for any confusion this example may have caused.
(1) CoFFT does not introduce new knowledge, but can make more effective use of current knowledge.
CoFFT is a training-free method that cannot modify the model's internal parameters or knowledge when evaluated on VLMs. To ensure fair comparison, all training-free methods are evaluated on the same VLM (e.g., Qwen2.5VL-7B) while maintaining unchanged foundational knowledge. CoFFT demonstrates that by reducing interference from irrelevant visual information, models can utilize their existing knowledge more effectively rather than requiring new knowledge acquisition.
(2) Poor performance in the Figure 1 example is not attributed to background knowledge deficiency.
We validate this across multiple models (GPT-4o, o3, Qwen2.5VL-72B, Qwen2.5-VL-7B):
-
Original complete image: Models identified locations like Tokyo, Paris, Shanghai, or New York based on visual elements like pigeons, crowds, trees, and amphitheater structures.
-
CoFFT-generated cropped image (focusing on the semicircular buildings and curved pools): Focusing on semicircular buildings and curved pools, models provided more accurate answers including Nanjing, Rome, and Washington D.C by highlighting distinctive architectural features that trigger their latent knowledge.
Moreover, we directly asked these models about the representative features of Nanjing's Sun Yat-sen Mausoleum. They all mentioned the corresponding areas shown in Figure 1 and could describe the semicircular buildings, semicircular stepped seating, and curved pools.
This experiment shows that the models presented in Figure 1 possess the relevant knowledge, but cannot correctly utilize it due to excessive visual noise.
(3) Impact of background knowledge
Model background knowledge significantly impacts reasoning, as shown in our analysis (lines 233-238). CoFFT achieves improvements on geography-based benchmarks like SeekWorld-China by helping identify fine-grained, region-specific details. However, gaps remain compared to GPT-4o in global settings due to knowledge limitations.
| Model | S.W-China | S.W-Global |
|---|---|---|
| GPT-4o | 31.90 | 56.50 |
| Qwen2.5-VL-7B | 21.45 | 25.31 |
| 7B with CoFFT | 35.12 | 29.37 |
| Qwen2.5-VL-32B | 24.13 | 25.63 |
| 32B with CoFFT | 38.61 | 30.63 |
Q4: For the detailed Performance of CoFFT on Math Benchmarks:
We apologize for any confusion. Our selection is based on two key considerations:
1. Show Flexibility: Demonstrates CoFFT's visual adjustment mechanism - both "zoom in" for details and "zoom out" for global view, avoiding local optima.
2. Page Limitations: Complex examples with intricate diagrams and longer reasoning chains are difficult to present clearly in single figures.
To more clearly demonstrate the impact of CoFFT, we conduct a detailed analysis.
Detailed Analysis: CoFFT was analyzed on MathVista and MathVision benchmarks, which include geometric diagrams, charts, tables, algebra, and logic problems conveyed through images. Results show CoFFT significantly improves both accurate visual perception and reasoning capabilities.
-
On MathVista, Greatest gains in Numeric Commonsense, Algebraic Reasoning, Arithmetic Reasoning, and Statistical Reasoning. CoFFT excels at extracting helpful information from visually dense images, enabling correct numerical interpretation and reasoning.
-
On MathVision, Substantial improvements in Arithmetic, Counting, and Statistics. These tasks require interpreting complex visual structures where attention to key elements is crucial for building correct logical/combinatorial models. Limited improvements observed in Graph Theory, Topology, and Transformation Geometry.
Analysis on MathVista (Testmini)
| Skill Category | Baseline | CoFFT | Improvement |
|---|---|---|---|
| Numeric Commonsense | 34.03% | 41.67% | +7.64% |
| Algebraic Reasoning | 67.97% | 71.89% | +3.92% |
| Logical Reasoning | 29.73% | 32.43% | +2.70% |
| Arithmetic Reasoning | 61.19% | 63.74% | +2.55% |
| Statistical Reasoning | 85.38% | 87.38% | +2.00% |
| Geometry Reasoning | 67.36% | 68.20% | +0.84% |
| Scientific Reasoning | 67.21% | 68.03% | +0.82% |
Analysis on MathVision (Testmini)
| Category | Baseline | CoFFT | Improvement |
|---|---|---|---|
| Arithmetic | 47.37% | 68.42% | +21.05% |
| Counting | 10.53% | 26.32% | +15.79% |
| Statistics | 21.05% | 36.84% | +15.79% |
| Combinatorics | 21.05% | 31.58% | +10.53% |
| Logic | 15.79% | 21.05% | +5.26% |
| Algebra | 21.05% | 26.32% | +5.26% |
| Solid Geometry | 5.26% | 10.53% | +5.26% |
| Metric Geometry - Length | 10.53% | 15.79% | +5.26% |
| Analytic Geometry | 31.58% | 31.58% | 0.00% |
| Descriptive Geometry | 15.79% | 15.79% | 0.00% |
| Graph Theory | 15.79% | 15.79% | 0.00% |
| Metric Geometry - Angle | 5.26% | 5.26% | 0.00% |
| Metric Geometry - Area | 21.05% | 21.05% | 0.00% |
| Topology | 15.79% | 15.79% | 0.00% |
| Transformation Geometry | 21.05% | 21.05% | 0.00% |
| Combinatorial Geometry | 15.79% | 10.53% | -5.26% |
Dear Reviewer nDak
I would appreciate if you could kindly check my responses to your comments. If you have any further questions, we would be happy to address them.
Best
Authors
Regarding the authors’ explanation of Figure 1, I still disagree with their viewpoint. The authors claim that because the model can answer questions about some characteristics of the Nanjing Music Stage in the image, the multimodal model should possess the ability to correctly recognize the image. However, this logic is flawed. Even a pure language model, such as Qwen2.5-7B, can provide information about some features of the Nanjing Music Stage, despite having no capability to process image features. The knowledge contained in text and multimodal knowledge are not inherently equivalent; otherwise, pretraining the visual encoder or conducting multimodal joint pretraining would be meaningless. Therefore, it is incorrect to infer that a large vision-language model (LVLM) has the potential to correctly recognize an image solely based on its knowledge of the associated textual concepts—especially when the multimodal model has never seen similar images during pretraining. For example, if we show OpenAI-o3 a photo taken at a random grassland in Ili, China, the model can only make a rough guess about the possible location based on visible geographic features; it might suggest Xinjiang, Kazakhstan, or even Switzerland. If the model has never seen similar images before, it will not be able to recognize them accurately, and this is unrelated to its knowledge of relevant textual information. No matter how strong o3’s reasoning is, it can only hypothesize the approximate position. Furthermore, the paper does not discuss this limitation, nor does it directly demonstrate the effectiveness of the proposed approach on its main target application. Hence, I believe there is some exaggeration regarding the effectiveness of the method.
On the issue of solving mathematical problems, my concerns about geometry tasks arise from the fact that the visualization analysis in the paper is limited to mathematical questions of the geometric type. The presented example is a math question with low visual information entropy. For such images, a model like Qwen2.5-vl does not require a zoom-in operation to retrieve necessary details. The same reasoning applies to other data types in MathVista. I do not entirely deny that the proposed method may bring improvements on mathematical benchmarks, but I wish to make clear that such improvements do not seem strongly connected to the core operation of zoom-in described in the paper. Recent related work(Think with Images), such as "DeepEyes: Incentivizing 'Thinking with Images' via Reinforcement Learning," has also clarified this point in their experimental analysis on similar benchmarks.
In summary, and after considering the opinions of other reviewers, I have decided to maintain my original score.
Thank you for your reply, but we think you may have misunderstood our framework.
(1) Such image attention adjustment capability is only part of our capability. Our CoFFT addresses both aspects with synergistic enhancement between the following components.
First, for visual mathematical reasoning, the following two aspects are necessary:
(1) accurate extraction and understanding of visual information
(2) reliable reasoning and answers
Our Visual Focus Adjustment (VFA) and Dual Forward Decoding (DFD) can enhance these two aspects respectively.
1. Visual Focus Adjustment (VFA) - Enhance Perception
VFA enables dynamic attention on step-relevant regions in images. It reduces visual noise from irrelevant content (multiple subfigures, complex statistics) and prevents interpretation errors (incorrect values/symbols, confused relationships), ensuring accurate information extraction for a solid reasoning foundation.
2. Dual Forward Decoding (DFD) - Enhance Reasoning
DFD evaluates both reasoning progress scores and visual focus scores to determine next steps. This dual verification uses current focus regions to suppress hallucinations while exploring multiple reasoning paths, maintaining consistency between visual evidence and reasoning progression.
Dual Forward Decoding (DFD) This module significantly improves the model's reasoning capabilities and suppresses image hallucinations.
You can also see this in our ablation experiments.
| Models | MathVista | MathVision | S.W-China | S.W-Global |
|---|---|---|---|---|
| Our | 70.4 | 23.36 | 35.12 | 29.37 |
| without DFD | 68.5 | 20.42 | 28.42 | 27.19 |
| without VFA | 69.3 | 21.71 | 27.08 | 26.25 |
| Baseline | 68.2 | 18.09 | 21.45 | 25.31 |
For the math reasoning datasets like MathVista, MathVision, our DFD module significantly improves model performance. You seem to have focused solely on our VFA module and overlooked the performance benefits of the DFD module.
(2) Regarding your comment that “there is some exaggeration regarding the effectiveness of the method.”
First of all, it should be explained that we did not only use text for testing, we also used the original image and the CoFFT cropped image for testing. These models can basically obtain relevant information based on the cropped image.
And we did not exaggerate the effectiveness of our method. Instead, we have frankly analyzed this point in our response, as follows:
Model background knowledge significantly impacts reasoning, as shown in our analysis (lines 233-238). CoFFT achieves improvements on geography-based benchmarks like SeekWorld-China by helping identify fine-grained, region-specific details. However, gaps remain compared to GPT-4o in global settings due to knowledge limitations.
| Model | S.W-China | S.W-Global |
|---|---|---|
| GPT-4o | 31.90 | 56.50 |
| Qwen2.5-VL-7B | 21.45 | 25.31 |
| 7B with CoFFT | 35.12 | 29.37 |
| Qwen2.5-VL-32B | 24.13 | 25.63 |
| 32B with CoFFT | 38.61 | 30.63 |
We hope this helps correct your misconceptions about our framework. We hope you will take the time to reply to us.
The paper introduces a training-free framework that integrates Dual Foresight Decoding (DFD) and Visual Focus Adjustment (VFA) in an iterative loop to enhance the visual reasoning ability of vision-language models.
The reviewers recognized multiple strengths of the work. Reviewer GWAM praised the paper as technically sound, self-consistent, and supported by thorough evaluation. Reviewer UGRZ highlighted the originality of the framework, the intuitiveness of the proposed foresight-guided focus, and the strong empirical results across diverse benchmarks. Reviewer hXHh emphasized the broad applicability of a training-free approach, the clear motivation. Reviewer nDak acknowledged the inspiration from human cognition and the novelty of addressing visual focus in VLMs.
The primary remaining reservation comes from Reviewer nDak questioned whether some figures (notably Figure 1) convincingly demonstrate the method’s effectiveness. However, this concern relates mainly to presentation rather than the substance of the contribution, and the rebuttal clarified that Figure 1 should be read as a motivational example rather than proof of capability.
It is recommended that the authors revise Figure 1 and related text in the camera-ready version to avoid confusion and provide a clear discussion of the method’s limitations.