PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
5
4
4
3
3.8
置信度
创新性3.3
质量2.8
清晰度2.5
重要性2.3
NeurIPS 2025

CoFFT: Chain of Foresight-Focus Thought for Visual Language Models

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29

摘要

关键词
visual reasoning; visual language model; traning-free; foresight

评审与讨论

审稿意见
5

This paper introduces the Chain of Foresight-Focus Thought (CoFFT), a training-free approach that aims to enhance VLMs' visual reasoning. The pipeline consists of three stages: (1) Diverse Sample Generation; (2) Dual Foresight Decoding; (3) Visual Focus Adjustment. Through experiments, the authors demonstrate that the method is effective across different models.

优缺点分析

In general, I would view this work as an attempt to construct a deep thinking process similar to o3 model. I find the proposed method in the paper self-consistent. The evaluations are mostly thorough.

However, I have two major concerns:

(1) The paper introduces quite a lot of new hyperparameters. (α\alpha, λ\lambda, etc.) How were these hyperparameters determined? Were they found through experimentation or chosen randomly? The authors should clarify this or provide additional experiments.

(2) I calculated the results from the ablation study in Table 3 and found that the average results of the two ablation experiments are 45.32 and 45.36, respectively. In other words, each individual component performs similarly to the baseline algorithms (45.05). This raises the question: why should we adopt the two methods proposed in the paper? Would using the baseline combined with only one of the proposed methods yield comparable performance?

Overall, I find this is a technically solid paper where reasons to accept outweigh reasons to reject. Therefore, I recommend borderline accept.

问题

  1. The authors are expected to conduct experiments during rebuttal to explain why the ablation results are not good (as mentioned before). I would be happy to adjust my score based on the quality of the response.

  2. In table 2,3, and 4, why does the number of significant figures vary (sometimes one, sometimes two)?

  3. In line 176, the λ\lambda should have been α\alpha.

局限性

The authors mentioned the following limitation: However, while CoFFT successfully solves previously unsolvable problems for VLMs, it may introduce unexpected errors in cases where VLMs alone perform well, indicating potential interference with their original well-established capabilities.

The authors should clearly state what the unexpected errors are using concrete examples.

最终评判理由

The author addressed the issues I raised during the rebuttal period well. The other reviewers do not appear to have any fundamental objections to the paper. Therefore, I am happy to raise my score to 5.

格式问题

N/A

作者回复

Q1 For the Determination of Hyperparameters (λ\lambda and α\alpha):

We appreciate your valuable suggestions. The hyperparameters λ\lambda and α\alpha employed in CoFFT are selected through systematic experimentation.

Hyperparameter Definitions and Functions:

  • λ\lambda appears in our Dual-Foresight Decoding: Et+1=λSoftmax([Eatt(s)])+(1λ)Softmax([Eprob(s)])E_{t+1} = \lambda \cdot \text{Softmax}([E_{att}(s)]) + (1-\lambda) \cdot \text{Softmax}([E_{prob}(s)]), where it balances the visual focus score (EattE_{att}) and reasoning progression score (EprobE_{prob}) when evaluating candidate reasoning samples.
  • α\alpha is utilized in our Visual Focus Adjustment: Crel(V,Q,Rt)=max(Arel(V,Q)αArel(V,Rt),0)C^{rel}(V, Q, R_t) = \max(A^{rel}(V, Q) - \alpha \cdot A^{rel}(V, R_t), 0), where it controls the suppression strength of previously explored regions, enabling the model to balance between maintaining global perspective and exploring new region details.

We conduct comprehensive experiments to demonstrate the determination process for these parameters and analyze their impact on performance. For thorough evaluation, we select two representative datasets: SeekWorld-China (requiring fine-grained visual detail comprehension) and MathVista (demanding reasoning capabilities in addition to visual understanding).

Hyperparameter λ\lambda: The hyperparameter λ\lambda aims to balance the visual focus score (EattE_{att}) and reasoning progress score (EprobE_{prob}). As shown in the table below, lower λ\lambda values reduce performance on SeekWorld-Global, as they interfere with the utilization of visual information. Conversely, excessively high λ\lambda may interfere with the reasoning process, resulting in marginal performance degradation on MathVista. The λ=0.3\lambda=0.3 achieves optimal balance between these competing requirements, making it our choice for the experiments.

λ\lambdaMathVista (Acc.)SeekWorld-China (Acc.)
0.270.534.58
0.370.435.12
0.469.835.66
0.569.335.66

Hyperparameter α\alpha: The hyperparameter α\alpha controls the suppression strength of previously explored regions during the Visual Focus Adjustment (VFA), thereby balancing the maintenance of global perspective with exploration of new regions. Excessively high α\alpha leads to aggressive exploration of new regions, and the VLM falls into local optima on problems requiring a global view for reasoning, thus degrading performance. Conversely, excessively low α\alpha would impair VFA's motivation to explore new regions, thereby affecting its performance on SeekWorld-China. The α=0.3\alpha=0.3 represents a robust and balanced choice for this trade-off.

α\alphaMathVista (Acc.)SeekWorld-China (Acc.)
0.270.433.78
0.370.435.12
0.469.935.39
0.569.535.66

Q2: For the Ablation Study and Component Synergy

We appreciate your astute observation regarding the ablation study results and the critical question of component synergy.

To demonstrate that the Visual Focus Adjustment (VFA) and Dual-Foresight Decoding (DFD) in CoFFT are synergistically designed rather than simply interchangeable, we conduct a more comprehensive analysis and perform a series of experiments.

1. Distinct Functional Design

Reviewing our ablation results, we observe that while DFD and VFA achieve similar overall performance, they exhibit distinct strengths across different benchmarks. DFD demonstrates more pronounced effects on reasoning-intensive mathematical datasets, while VFA shows greater impact on fine-grained perception tasks in SeekWorld-China and SeekWorld-Global. This indicates that these two components primarily operate in different aspects.

MethodMathVistaMathVisionMMStarM3CoTCharxivS.W-ChinaS.W-GlobalAVG
Our70.423.3669.462.4747.235.1229.3748.19
w/o DFD68.520.4266.561.3944.828.4227.1945.32
w/o VFA69.321.7167.461.0944.727.0826.2545.36

2. Synergistic Design as a Whole

The average accuracy rates for various methods are as follows:

  • Qwen2.5-VL-7B (baseline): 42.72%
  • Predictive Decoding (reasoning enhancement method): 45.05%
  • DyFo (visual-search method): 44.98%
  • CoFFT: 48.19%

We subsequently test various combinations of visual and reasoning modules, conducting experiments on MathVista and SeekWorld-China benchmarks to comprehensively evaluate their performance.

CombinationMathVistaSeekWorld-ChinaAverage
Baseline68.221.4544.83
DyFo + Predictive Decoding69.033.2451.02
Visual Focus Adjustment (VFA) + Predictive Decoding68.531.3750.24
DyFo + Dual Foresight Decoding (DFD)69.334.0551.73
CoFFT (VFA + DFD)70.435.1252.76

Analysis:

  1. DyFo + Predictive Decoding: This combination has a fundamental conflict, where DyFo determines the image regions that Predictive Decoding can use. DyFo, as an independent visual search module, determines the area of interest directly based on the question. For mathematical problems (MathVista), DyFo may crop out critical information such as numbers or geometric relationships, thereby interfering with the reasoning process of Predictive Decoding and degrading overall performance. For geolocation tasks (SeekWorld-Global), its focused view provides some improvement.

  2. VFA + Predictive Decoding: VFA is dynamic, adjusting visual focus after each reasoning step. In contrast, Predictive Decoding is designed for static visual inputs and cannot assess the value of current reasoning paths from a visual information perspective, can only passively accept the image regions provided by VFA, causing the introduction of irrelevant and confusing cropped regions during the reasoning process. These regions disrupt the predictive decoding's reasoning process, leading to comprehensive performance degradation.

  3. DyFo + Dual Foresight Decoding (DFD): The combination of DFD and DyFo is still not optimal because DyFo's focusing behavior is not guided by DFD's foresight, even if we encourage DyFo to perform one acquisition of the image region per inference step, and is only capable of dynamic adjustment based on the current path. Moreover, DyFo's scoring basis comes from the Lang-Segment-Anything expert visual model [1], and its scoring has compatibility issues with the relative attention mechanism in DFD, further constraining overall performance. Although this brings some performance improvement, it lacks the tight feedback cycle of mutual support between focus and reasoning, which is core to CoFFT. Consequently, performance improvement is limited.

Conclusion:

These experiments powerfully demonstrate that simply connecting existing visual search and reasoning methods is not optimal and may even be detrimental. CoFFT's advantage lies precisely in the synergistic design of its VFA and DFD modules. DFD evaluates reasoning paths based on both reasoning progress and image relevance, and provides the foundation for VFA to determine subsequent reasoning-relevant regions; meanwhile, VFA's precise focusing provides DFD with high-quality, low-noise input, enabling more reliable judgments. This "tightly-coupled" iterative loop is the key to significant performance improvement and represents one of CoFFT's core contributions.

Q3: For the Number of Significant Figures in Tables

We appreciate your meticulous observation. The specific precision is determined based on the sample size of each dataset and established conventions in related work. Except for MathVista, CharXiv, and MMStar, other benchmarks have two decimal places.

For the MathVista and CharXiv, each one contains 1000 samples. At this scale, the minimum measurable change in accuracy is 0.1%; therefore, results can only be meaningfully reported to one decimal place.

For the MMStar, we observed that the official reports from the Qwen and Intern teams, as well as the original MMStar paper, report performance to one decimal place. We follow this convention to facilitate direct and clear comparison with these authoritative benchmarks.

Q4: For the Typo in Line 176

Thank you for pointing out our oversight. The parameter in the textual description of Equation (5) on line 176 should be α\alpha, not λ\lambda. This is a typographical error, and we will correct it in the final version.

Q5: For Providing Concrete Examples for Limitations

Due to the inability to insert images, we can only provide a textual description of a concrete example illustrating the "unexpected errors" that CoFFT may introduce.

Concrete Example: Consider a question about a complex diagram, such as a food chain diagram containing numerous organisms and multiple connecting lines representing predatory relationships. The question might be: "When organism X decreases, how will organism Y change?"

VLM: The VLM can process the entire diagram from a global perspective, correctly tracing the predatory relationship between X and Y to determine the final result.

Potential Errors Introduced by CoFFT:

  1. During the VFA stage, the framework may be attracted to visually dense regions—for example, areas containing X or Y—considering them "most informative," leading to magnification that may introduce content from other organisms surrounding X or Y.
  2. Although CoFFT's mechanism allows the model to recover the global view in subsequent steps, this region focus may introduce contextual interference.
  3. Consequently, during subsequent reasoning, the model may over-focus on this intermediate information, disrupting the overall reasoning process and leading to an incorrect final answer.

However, it should be noted that all current methods for cropping or searching images carry this risk of introducing potential interference due to the introduction of partial images. This is not a limitation unique to CoFFT.

[1] Luca Medeiros. Language segment-anything, 2024.

评论

The author addressed the issues I raised during the rebuttal period well. The other reviewers do not appear to have any fundamental objections to the paper. Therefore, I am happy to raise my score to 5. If the paper is accepted, I look forward to the authors incorporating the revisions made during the rebuttal into the final version.

评论

Thank you for your positive feedback and for acknowledging our revisions. We are delighted to hear that our responses addressed your concerns effectively. We will incorporate all the revisions made during the rebuttal period into the final version of the paper. We appreciate your support and recognition.

审稿意见
4

This paper introduces the Chain of Foresight-Focus Thought (CoFFT), a novel training-free approach that enhances VLMs’ visual reasoning by emulating human visual cognition and mitigating task-irrelevant interference or hallucinations. CoFFT iteratively performs (1) Diverse Sample Generation to explore multiple reasoning paths, (2) Dual Foresight Decoding to select the optimal sample by jointly optimizing visual focus and reasoning progression, and (3) Visual Focus Adjustment to direct attention to regions most beneficial for subsequent reasoning. These three interdependent stages cycle until the final answer is reached, all without model modifications or retraining. Empirical results on Qwen2.5-VL, InternVL-2.5, and Llava-Next demonstrate consistent performance gains with controllable computational overhead.

优缺点分析

Strengths

  1. The proposed training-free approach is valuable and its effectiveness is demonstrated across multiple VLM backbones, underscoring its general applicability.
  2. The method is novel and offers valuable insights into improving visual reasoning without any model finetuning.
  3. The Introduction is clearly written and effectively motivates the work.
  4. The Appendix provides a thorough algorithmic summary, which aids reproducibility and clarity.

Weaknesses

  1. The Method section is burdened by extensive mathematics, much of which appears only once and could be simplified. Notation is inconsistent—for example, it is unclear whether A_crop denotes a function or a result (see equation 6 and 7). The authors should adopt distinct typographical conventions (e.g., calligraphic vs. italic) to distinguish variables, functions, and operators.
  2. Figure 1 conveys limited insight into the core mechanics of CoFFT; it merely shows a reduction in reasoning steps without clearly illustrating the key stages or their interplay, making the example hard to interpret.
  3. The evaluation lacks experiments on off-the-shelf reasoning models. Given the inherent reasoning limitations of the chosen backbones, validating CoFFT on stronger reasoning baseline VLMs would better demonstrate its broad utility.
  4. It is better to use pseudocode instead of math formulation for clarity.

问题

See weakness

局限性

yes

最终评判理由

the author has addressed my concerns, so i keep the rating.

格式问题

none

作者回复

Q1: Clarity of Method Section and Notation

Thank you for your valuable suggestion.

In the revised version of our paper, we will implement the following improvements:

  1. Streamlined Mathematical Formulation: We will revise Section 3 to make the mathematical exposition more accessible. For complex formulas that appear only once and can be clearly explained in prose, we will integrate their descriptions into the main text whenever possible. This will improve the flow of the narrative and reduce the cognitive load on the reader.

  2. Consistent Notation: You astutely identified the inconsistent use of AcropA_{crop} in Equations (6) and (7). We acknowledge this oversight. In the revision, AcropA_{crop} will be used exclusively to denote the attention map (i.e., a matrix). When referring to the value at a specific coordinate, we will use the functional form Acrop(x,y)\mathcal{A}_{crop}(x, y). This change will ensure its meaning is unambiguous throughout the paper.

  3. Adoption of Typographical Conventions: We will adopt your excellent suggestion to use distinct typographical conventions. We will subsequently use normal fonts for variables, calligraphic fonts for functions, and italic fonts for operators, and introduce these conventions in the overview section.

We are confident that these revisions will significantly enhance the clarity, consistency, and readability of our Method section.

Q2: Interpretation of Figure 1

Thank you for your insightful comment. We would first like to clarify its intended purpose, then provide our modification plan.

Figure 1 is designed as a motivational example to demonstrate the necessity of our method, rather than to explain its internal workings. Its goal is to contrast: (a) the lengthy, unfocused, and often erroneous reasoning process of a standard VLM, with (b) an efficient reasoning process guided by dynamic visual focus for next reasoning. This comparison highlights the problem we aim to solve and serves as the inspiration for CoFFT's design, which emulates the human ability for foresight and focus adjustment. The core mechanics of our method are detailed in Figure 2 (which presents the overall three-stage iterative framework).

To resolve any potential confusion in the revised manuscript, we will:

  1. Revise the caption of Figure 1 to explicitly label it as a "motivational example" and direct readers to Figure 2 for a comprehensive understanding of how CoFFT achieves this improved reasoning.
  2. Strengthen the in-text reference to Figure 1, clarifying its role in problem motivation and immediately guiding the reader to Figure 2 for the technical details of our solution.

Modify the Caption as follows:

A motivational example from the SeekWorld benchmark illustrating the difference in reasoning between OpenAI-O3 and our CoFFT-guided approach. (a) shows the original reasoning process of O3, which gets distracted by irrelevant information (crowds, pigeons) and arrives at an incorrect answer. (b) demonstrates how providing a focused visual region of the key semicircular buildings and curved pools, guided by a human-like cognitive process, helps VLM to identify the correct location as Jiangsu, China (Sun Yat-sen's Mausoleum). This comparison highlights the necessity of dynamic visual focus for next reasoning in complex reasoning tasks and motivates CoFFT's design—see Figure 2 for how CoFFT achieves this improved reasoning through its three-stage iterative framework.

Q3: Evaluation on Stronger Reasoning Models

We appreciate your valuable suggestion.

First, it is essential to clarify why leading proprietary models such as o3 or GPT-4o are not included in our initial experiments. CoFFT's core mechanism—Dual Foresight Decoding and Visual Focus Adjustment—critically depends on access to the model's internal text-image attention maps. These attention maps are essential for computing visual relevance and dynamically adjusting visual focus. Unfortunately, leading closed-source models are provided as "black-box" APIs that do not expose these underlying internal states, making direct application of CoFFT technically infeasible.

Second, to further demonstrate CoFFT's broad applicability, we conducted a series of new experiments on the state-of-the-art large-scale open-source model Qwen2.5-VL-72B (the original manuscript had been performed on Qwen2.5-VL-32B). Our objective is to validate whether CoFFT still delivers significant improvements when applied to backbones that already possess strong intrinsic reasoning capabilities. The results are presented below. These findings demonstrate that CoFFT not only achieves significant average performance gains on 72B-scale models but also validates our method's excellent scalability.

MethodMathVisionS.W-ChinaS.W-Global
Qwen2.5VL-32B25.3324.1328.41
Qwen2.5VL-32B with CoFFT29.9338.6134.38
Qwen2.5VL-72B36.1837.8042.18
Qwen2.5VL-72B with CoFFT41.1247.7248.75

Q4: Use of Pseudocode for Clarity

Thank you for your valuable suggestion. We have prepared a complete, detailed line-by-line pseudocode, but due to space limitations, we have put it in the appendix. However, as you say, such pseudocode provides value in providing a concise overview, so we will add the following more condensed pseudocode at the beginning of Section 3 (Methods). This will enable readers to quickly and intuitively grasp the core workflow of CoFFT before seeing the detailed mathematical description.

Algorithm 1: Chain of Foresight-Focus Thought (CoFFT) - High-Level Overview

Require: Original image VV, question QQ, VLM model MM

Ensure: Final reasoning result RR

1: VfocusVV_{focus} \leftarrow V; RR \leftarrow \emptyset // Initialize focus with original image and empty reasoning

2: while not reached final answer do

3: // --- Stage 1: Diverse Sample Generation ---

4:\quad SS \leftarrow \emptyset

5:\quad for i[1,k]i \in [1, k] do

6:\quad \quad TiT_i \leftarrow Sample from temperature range // Ensure diversity

7:\quad \quad$$s_i \leftarrow Generate reasoning sample using M,Vfocus,Q,R,TiM, V_{focus}, Q, R, T_i

8:\quad \quad$$S \leftarrow S \cup \{s_i\}

9:\quad end for

10:

11: // --- Stage 2: Dual Foresight Decoding ---

12: \quad for each sample sSs \in S do

13:\quad \quad Eatt(s)E_{att}(s) \leftarrow Calculate visual relevance score (via relative attention, cosine, IoU)

14:\quad \quad Eprob(s)E_{prob}(s) \leftarrow Calculate reasoning progression score (via log-prob improvement)

15:\quad end for

16:\quad ss^* \leftarrow Select best sample by combining EattE_{att} and EprobE_{prob} with factor λ\lambda

17:\quad RRFirstStep(s)R \leftarrow R \cup \text{FirstStep}(s^*) // Update reasoning with the optimal first step

18:

19:// --- Stage 3: Visual Focus Adjustment ---

20:\quad AcropA_{crop} \leftarrow Compute next focus map based on QQ, RR, and future step ss^*

21:\quad BB^* \leftarrow Find optimal region with highest score in AcropA_{crop}

22:\quad if score of BB^* is significantly higher than global average then

23:\quad \quad VfocusV_{focus} \leftarrow Crop and magnify region BB^* from original image VV

24:\quad else

25: \quad \quad VfocusVV_{focus} \leftarrow V // Revert to the full image if no clear focus is found

26:\quad end if

27: end while

28: return RR

评论

thank you for the detailed response, i will maintain the score.

评论

Thank you for your positive feedback and for acknowledging our revisions. We appreciate your support and recognition.

审稿意见
4

The paper proposes Chain of Foresight-Focus Thought (CoFFT), a novel training-free framework designed to enhance the visual reasoning capabilities of Vision-Language Models (VLMs). CoFFT operates in an iterative loop of three stages: (1) Diverse Samples Generation (DSG) produces multiple multi-step reasoning paths under varied temperature settings; (2) Dual Foresight Decoding (DFD) jointly evaluates these samples based on reasoning quality and visual relevance to select the best one-step continuation; and (3) Visual Focus Adjustment (VFA) refines the visual input by identifying and cropping the most informative image region, feeding it back into the next reasoning iteration. Experiments across multiple benchmarks and models (e.g., Qwen2.5-VL, InternVL-2.5, and Llava-Next) on mathvista, mathvision, M3CoT and MMStar Charxiv and seek-world. show that CoFFT achieves consistent gains of 3.1–5.8% in accuracy, with performance improvements scaling with model size.

优缺点分析

strength:

  1. The idea is great and original. Relative attention is very intuitive.
  2. Extensive comparisons are conducted and the improvement is impressive, demonstrating the effectiveness of the method.

weakness:

  1. Clarity of the paper and writing needs to be improved. For example,
  • What point are you trying to make in figure 1? It's not clear from the caption and only makes sense to read with the main text. Similar issue for Figure 2. The captions should be somewhat self-contained. Please add more details
  • Line 51 -- should use complete sentences
  • Figure 3 -- the caption uses (a) and (b), which does not correspond to the figure.
  • line 118 - 120 -- introduce terms and symbols not defined before, which is more confusing for the reader.
  • Writing needs to be improved. reorder the sections or maybe reduce the terminology and symbols if you have not define it yet,
  1. Limitation needs to be discussed further, especially run-time, as you are creating repetitive samples every iteration. This needs to be discussed further and maybe addreseed.

问题

See the weaknesses anove.

局限性

Not discussed. One big issue and limitation is added computation. The author needs to discuss it and potential works to mitigate it, or put it in the future work.

最终评判理由

I went over the rebuttal and the authors have addressed some of my concerns.

格式问题

No

作者回复

Q1: On the Clarity of Figures and Captions

Thanks for your feedback regarding the clarity of our figures.

To address this, we will revise the captions for Figures 1, 2, and 3. We will also remove the redundant parts in the Figure 3 caption.

  • Revised Caption for Figure 1: This caption will be expanded to explain the core comparison being illustrated—how our CoFFT-guided approach corrects the reasoning of a baseline VLM by focusing its attention on the relevant visual evidence.

Figure 1: A motivational example from the SeekWorld benchmark illustrating the difference in reasoning between OpenAI-O3 and our CoFFT-guided approach. (a) shows the original reasoning process of O3, which gets distracted by irrelevant information (crowds, pigeons) and arrives at an incorrect answer. (b) demonstrates how providing a focused visual region of the key semicircular buildings and curved pools, guided by a human-like cognitive process, helps VLM to identify the correct location as Jiangsu, China (Sun Yat-sen's Mausoleum). This comparison highlights the necessity of dynamic visual focus for next reasoning in complex reasoning tasks and motivates CoFFT's design—see Figure 2 for how CoFFT achieves this improved reasoning through its three-stage iterative framework.

  • Revised Caption for Figure 2: The new caption will clearly break down the iterative three-stage workflow of our CoFFT, allowing readers to grasp the entire process at a glance.

Figure 2: The overall iterative workflow of our proposed Chain of Foresight-Focus Thought (CoFFT) approach. The process begins with an image and a question. Each iteration consists of three stages: (1) Diverse Sample Generation (DSG): The VLM generates multiple potential reasoning paths. (2) Dual Foresight Decoding (DFD): These paths are evaluated based on both reasoning progression and visual focus to select the optimal path. (3) Visual Focus Adjustment (VFA): The visual focus is then shifted to the most relevant image region for the next iteration. This cycle of reasoning guiding focus, and focus informing reasoning, continues until the final answer is reached.

  • Revised Caption for Figure 3: We appreciate you catching the redundant (a) and (b) labels. We will remove them and rewrite the caption to directly describe the components illustrated, making the figure easier to understand.

Figure 3: A detailed illustration of the three primary components of one Foresight-Focus Thought. (1) Diverse Sample Generation (DSG): The VLM generates multiple potential reasoning paths with different settings. (2) Dual Foresight Decoding evaluates different reasoning samples by combining a visual focus score and a reasoning progression score to select the optimal path, and the first step of the optimal path is introduced to the reasoning process. (3) Visual Focus Adjustment uses a scoring mechanism based on question and future reasoning relevance to identify and crop the most informative image region for the next reasoning step.

Q2: On Writing, Terminology, and Structure

We appreciate you highlighting the issues with writing clarity. We will revise these one by one and update them.

  • Line 51: Thank you for pointing this out. This was an oversight and will be corrected as follows:

(2) Dual Foresight Decoding (DFD): This module evaluates these samples by considering both visual focus and reasoning progression to select the optimal reasoning sample, and then incorporates its first step into the reasoning process.

  • Lines 118-120 and Section Structure:

We will not introduce EattE_{att} and EprobE_{prob} mentioned in Section 3.1 (Overview) here, but will introduce them and provide detailed explanations in later sections. We will further review all subsequent symbols and terms to ensure they are formally defined upon first appearance.

  • Writing needs to be improved:

To address this issue, we plan to reorganize Section 3 (Chain of Foresight-Focus Thought). We will first provide a more concise and comprehensive overview without complex symbols in Section 3.1 (Overview). Then, in the subsequent Section 3.2, we will first introduce the relative attention mechanism that will be used throughout the following sections. In the next three subsections (3.3, 3.4, and 3.5), we will introduce each stage in CoFFT in detail one by one, and ensure that all symbols (such as EattE_{att}, EprobE_{prob}, etc.) are clearly defined and explained when first used. This will enable readers to understand our method progressively and avoid confusion.

Q3: For Addressing Computational Cost

Thank you for your valuable suggestions. We will first conduct a quantitative comparison to elaborate on CoFFT's computational overhead, then propose one pruning approaches to attempt optimization.

(1) Current Computational Overhead:

In lines 245 to 249 of the original manuscript, we conduct a comparative analysis of CoFFT's computational cost against other training-free methods, as shown in the table below. These data indicate that although CoFFT introduces certain computational overhead, it is still relatively efficient and achieves good performance.

ModelsBaselineMCTSPredictive decodingICoTDyFoCoFFT (Ours)
Performance42.7244.6845.0534.9144.9848.19
FLOPS8.35e+124.05e+141.85e+141.88e+131.98e+132.38e+14

(2) Proposed Mitigation Pruning Strategies:

To attempt to optimize CoFFT's time overhead, we design an intuitive pruning method to reduce computational cost while maintaining performance, mainly considering two aspects: reasoning process similarity and reasoning advantage, comprising two strategies:

  • Adaptive Similarity Pruning: This strategy eliminates semantic redundancy among candidate samples. At each step, we calculate TF-IDF similarity between all candidate sample pairs in the current reasoning process. If two samples exhibit high similarity (e.g., > 0.7) for consecutive n steps, we prune the reasoning process with high similarity to other reasoning processes to ensure diversity between reasoning processes.
  • Adaptive Reasoning Pruning: This strategy terminates unpromising reasoning paths early by tracking each candidate sample's reasoning progress score (EprobE_{prob}). If a sample fails to achieve positive scores for consecutive nn steps, it is marked as "stagnant." To maintain search breadth, we prune only the single lowest-scoring stagnant sample per iteration, ensuring at least two candidates are always retained.

(3) Experimental Result:

We set nn to 22 and 33 for experimental validation, respectively. We validate this pruning strategy's effectiveness through a comprehensive evaluation across five benchmark datasets. The table below shows our expected results, indicating that our pruning strategy significantly reduces FLOPS while maintaining high accuracy. We even achieve some performance gains on some benchmarks.

MethodMathVistaMathVisionCharxivS.W-ChinaS.W-GlobalFLOPS (Lower is better)Reduction Ratio
Baseline68.218.0942.521.4525.318.35e+12-
CoFFT70.423.3647.235.1229.372.38e+14-
CoFFT + Pruning (n=2n=2)69.922.3746.234.3229.371.72e+1427.7%
CoFFT + Pruning (n=3n=3)70.423.0346.936.1930.312.11e+1411.3%
评论

Dear Reviewer UGRZ

I would appreciate if you could kindly check my responses to your comments. If you have any further questions, we would be happy to address them.

Best

Authors

评论

Thx for providing a detailed rebuttal. I had many concerns related to the writing and presentation highlighted in my original review, and thanks for clarifying them.

"Fig 1 - 3": Now it reads much better and clearer now. Thanks for the revision.

"Lines 118-120 and Section Structure" and you share how specifically you plan to address that? Can you share what you are planning to add to address those?

The information (text in the image) in figure 1 is still too dense. Please consider simplifying and delivering the core idea.

Overall, the rebuttal has addressed most of my concerns. The experimentation and mitigation strategy provided during the rebuttal period are very useful and please include them in the final version.

I am increasing my score and rating to "borderline accept".

评论

Thank you for your positive feedback and for acknowledging our revisions. We are delighted to hear that our responses addressed your concerns effectively and that you are willing to improve your rating. We will incorporate all the revisions made during the rebuttal period into the final version of the paper. We appreciate your support and recognition.

评论

(1) For "Lines 118-120 and Section Structure"

Thank you for your feedback. We plan to reorganize the content of section 3, and our main approach is as follows:

  1. Restructure the content into five parts. We will begin with a more concise and comprehensive overview in section 3.1, avoiding complex symbols. In section 3.2, we will introduce the relative attention mechanism that will be used in the subsequent sections. We will then detail the individual stages of CoFFT in the following three subsections (3.3, 3.4, and 3.5).
  2. Improve section 3.1 (Overview). To avoid confusion caused by complex formulas and text, we will simplify the text descriptions and introduce pseudocode to help readers better understand the overall framework.

The current modification plan is as follows, if you have any suggestions please give them to us:

Algorithm 1: Chain of Foresight-Focus Thought (CoFFT)

Require: Original image VV, question QQ, VLM model MM Ensure: Final reasoning result RR

1: VfocusVV_{focus} \leftarrow V; RR \leftarrow \emptyset // Initialize focus with original image and empty reasoning

2: while not reached final answer do

3: // --- Stage 1: Diverse Sample Generation ---

4:\quad SS \leftarrow \emptyset

5:\quad for i[1,k]i \in [1, k] do

6:\quad \quad TiT_i \leftarrow Sample from temperature range // Ensure diversity

7:\quad \quad$$s_i \leftarrow Generate reasoning sample using M,Vfocus,Q,R,TiM, V_{focus}, Q, R, T_i

8:\quad \quad$$S \leftarrow S \cup \{s_i\}

9:\quad end for

10:

11: // --- Stage 2: Dual Foresight Decoding ---

12: \quad for each sample sSs \in S do

13:\quad \quad Eatt(s)E_{att}(s) \leftarrow Calculate visual relevance score (via relative attention, cosine, IoU)

14:\quad \quad Eprob(s)E_{prob}(s) \leftarrow Calculate reasoning progression score (via log-prob improvement)

15:\quad end for

16:\quad ss^* \leftarrow Select best sample by combining EattE_{att} and EprobE_{prob} with factor λ\lambda

17:\quad RRFirstStep(s)R \leftarrow R \cup \text{FirstStep}(s^*) // Update reasoning with the optimal first step

18:

19:// --- Stage 3: Visual Focus Adjustment ---

20:\quad AcropA_{crop} \leftarrow Compute next focus map based on QQ, RR, and future step ss^*

21:\quad BB^* \leftarrow Find optimal region with highest score in AcropA_{crop}

22:\quad if score of BB^* is significantly higher than global average then

23:\quad \quad VfocusV_{focus} \leftarrow Crop and magnify region BB^* from original image VV

24:\quad else

25: \quad \quad VfocusVV_{focus} \leftarrow V // Revert to the full image if no clear focus is found

26:\quad end if

27: end while

28: return RR

Overview Section

We introduce the Chain of Foresight-Focus Thought (CoFFT), a training-free framework designed to enhance the complex reasoning capabilities of Vision-Language Models (VLMs). CoFFT operates through an iterative process where each cycle refines the reasoning path and adjusts the visual focus, mimicking human cognitive patterns of problem-solving. Each iteration of CoFFT involves three distinct stages, as shown in Algorithm 1:

1. Diverse Sample Generation

First, the VLM generates k potential future reasoning paths, or "samples." To ensure a wide range of possibilities, these samples are created using varied temperature settings. This stage essentially brainstorms multiple ways the reasoning could proceed based on the current context and visual input.

2. Dual Foresight Decoding

Next, each sample is evaluated using a dual-scoring mechanism to select the most promising path forward.

  • A visual relevance score (EattE_{att}) checks how well the reasoning aligns with the image content, helping to suppress hallucinations.
  • A reasoning progression score (EprobE_{prob}) assesses the logical coherence and forward momentum of the reasoning itself.

The sample with the best combined score is selected, and only its first step is appended to the main reasoning chain. This ensures a deliberate, step-by-step progression.

3. Visual Focus Adjustment

Finally, the system updates its visual focus. Based on the question and the chosen future reasoning step, CoFFT identifies the most relevant region in the image. It then "zooms in" by cropping and magnifying this area for the next iteration. If no single region is clearly superior, the model reverts to the full image view. This allows the model to dynamically shift between a global overview and fine-grained local details as needed.

These three stages create a powerful synergistic loop. The anticipated reasoning path guides the model's visual attention, and the newly focused visual input, in turn, leads to higher-quality reasoning in the next cycle. By iteratively refining both thought and focus, CoFFT effectively mitigates VLM hallucinations and improves the accuracy of complex visual reasoning tasks.

评论

(2) For The information (text in the image) in figure

Thank you for your valuable feedback regarding Figure 1. We plan to optimize the text in Figure 1. Without losing information, streamline some steps and unimportant information with ellipses to highlight the key information.

For example:

The seats are surrounded by semicircular benches, and there is a pond view under the corridor, which is very similar to Inokashira Park. Answer: Tokyo, Japan

Modify to

... semicircular benches, ... a pond view under the corridor, ... similar to Inokashira Park. Answer: Tokyo, Japan

评论

Dear Reviewer UGRZ

Sorry to bother you, but it seems you haven't updated your rating yet. If you have any other suggestions, please let us know. Thank you for your approval of our response.

Best

Authors

审稿意见
3

The paper proposes a training-free method called CoFFT, inspired by human visual cognition, to address the issue of vision-language models (VLMs) failing to focus on key regions in images. By iteratively executing three stages—Diverse Sample Generation, Dual Foresight Decoding, and Visual Focus Adjustment—the method enhances the model’s reasoning capability.

优缺点分析

Strength:

Inspired by human cognition, the paper proposes a novel training-free method to address the visual focusing issue in vision-language models (VLMs)

Weakness:

  1. The figure1 presents a highly knowledge-intensive visual question answering scenario. While the paper emphasizes that the method draws inspiration from human visual reasoning processes, it should be noted that even an American with exceptional visual reasoning capabilities couldn't possibly identify this as the Music Stage in Nanjing, China from just an image. The main text fails to provide any visual examples demonstrating that the proposed method can actually solve this type of problem.

  2. Regarding mathematical scenarios: The mathematical examples shown (particularly in Figure 4) appear overly simplistic. These examples don't seem to have any clear connection to the paper's key "visual adjustment" methodology. This raises the question: why does the proposed method show significant improvements on datasets like MathVista when the demonstrated examples don't adequately showcase the method's capabilities?

问题

  1. Mathematical problems seem to require stronger reasoning capabilities. Could the authors explain why enhancing visual focus alone leads to improved performance in mathematical reasoning tasks?

  2. Could the authors provide several examples from mathematical datasets to illustrate how the model’s reasoning path differs before and after applying the proposed method?

  3. Regarding Figure 1: The figure presents a highly knowledge-intensive visual question answering scenario. While the paper emphasizes that the method draws inspiration from human visual reasoning processes, it should be noted that even an American with exceptional visual reasoning capabilities couldn't possibly identify this as the Music Stage in Nanjing, China from just an image. The main text fails to provide any visual examples demonstrating that the proposed method can actually solve this type of problem.

  4. The mathematical examples shown (particularly in Figure 4) appear overly simplistic. These examples don't seem to have any clear connection to the paper's key "visual adjustment" methodology. This raises the question: why does the proposed method show significant improvements on datasets like MathVista when the demonstrated examples don't adequately showcase the method's capabilities?

局限性

See in weakness and questions.

If the authors are able to address the above questions, the reviewer will raise the score.

格式问题

no concern

作者回复

Q1: For Role of Visual Focus in Math Reasoning

We apologize for any misunderstanding regarding our work.

(1) Benchmark composition

First, it should be clarified that Mathvista and Mathvision do not contain only geometric images. Taking MathVista as an example, this benchmark comprises 31 different datasets. We categorize its set as follows:

CategorySamplesSource DatasetsBaseline AccuracyCoFFT Accuracy
NaturalImages230Super-CLEVR,A-OKVQA,CLEVR-Math,VQA-AS,VQA2.0,KVQA,VizWiz57.8260.43
Charts255PlotQA, FigureQA, DVQA, FunctionQA, ChartQA82.3584.31
Geometric Diagrams219UniGeo,GeoQA+,Geometry3K,GEOS,TheoremQA67.5768.49
Documents39DocVQA, PaperQA, ParsVQA-Caps, TextVQA58.9766.67
Tables62TabMWP83.8790.32
Science Diagrams85AI2D, ScienceQA, SciBench,TQA64.7165.88
Specialized Domains110VQA-RAD (Medical), MapQA (Maps), IconQA (Icons), IQTest (Logic), PMC-VQA (Medical)55.4556.36

(2) CoFFT Framework

Visual math reasoning requires: (1) Accurate Perception: correctly identifying and extracting relevant information from complex visual inputs containing charts, diagrams, geometric figures, and numerical tables; (2) Robust Reasoning: maintaining consistent reasoning progression while integrating visual evidence throughout the reasoning process.

CoFFT addresses both aspects with synergistic enhancement between the following components.

1. Visual Focus Adjustment (VFA) - Enhance Perception

VFA enables dynamic attention on step-relevant regions in images. It reduces visual noise from irrelevant content (multiple subfigures, complex statistics) and prevents interpretation errors (incorrect values/symbols, confused relationships), ensuring accurate information extraction for a solid reasoning foundation.

2. Dual Forward Decoding (DFD) - Enhance Reasoning

DFD evaluates both reasoning progress scores and visual focus scores to determine next steps. This dual verification uses current focus regions to suppress hallucinations while exploring multiple reasoning paths, maintaining consistency between visual evidence and reasoning progression.

Ablation studies confirm distinct contributions:

  • DFD removal: Greater impact on reasoning-type tasks
  • VFA removal: Greater impact on fine-grained tasks

We also conduct a detailed analysis of the performance of CoFFT on Mathvista and Mathvision in Q4.

ModelsMathVistaMathVisionS.W-ChinaS.W-Global
Our70.423.3635.1229.37
w/o DFD68.520.4228.4227.19
w/o VFA69.321.7127.0826.25
Baseline68.218.0921.4525.31

Q2: For the example from math datasets

Due to the inability to upload images, we provide a problem with its description and compare reasoning paths between VLM (Qwen2.5-VL-32B) and CoFFT.

The effectiveness of CoFFT in math problems, such as charts, multi-subfigures, and table data, is very obvious.

We guess that you may be more interested in geometric problems, so we give an example of this.

Question: Given a right triangle with sides AB=8AB=8, BC=15BC=15, and hypotenuse CA=17CA=17. What is the radius rr of the inscribed semicircle?

Description: The semicircle passes through the point BB, with its diameter coinciding with side BCBC, and is tangent to hypotenuse CACA.

VLM

VLM processes the entire image holistically but struggles with geometric constraint correlation, leading to incorrect formulations:

  1. Step 1: Identifies the right triangle with sides AB=8AB=8, BC=15BC=15, CA=17CA=17 and recognizes the inscribed semicircle.

  2. Step 2: The semicircle is tangent to sides ABAB and CACA, applying tangent properties. Sets up tan(BAC)=815=r15r\tan(\angle BAC) = \frac{8}{15} = \frac{r}{15-r} based on misunderstood geometric relationships. (Error occurs)

  3. Step 3: Solves the incorrect equation: 8(15r)=15r8(15-r) = 15r, yielding 120=23r120 = 23r. Concludes r=120235.22r = \frac{120}{23} \approx 5.22.

The global perspective prevents the proper establishment of precise geometric relationships.

VLM with CoFFT

CoFFT deconstructs the problem through iterative visual focus adjustment, enabling progressive and robust reasoning:

  1. Step 1 (Global View): Extract given parameters: right triangle with AB=8AB=8, BC=15BC=15, CA=17CA=17.

  2. Step 2 (Remove some blank areas, Focus on triangle): Determine that the diameter of the semicircle is at side BCBC. Therefore, cos(BAC)=sin(ACB)=817\cos(\angle BAC)=\sin(\angle ACB)=\frac{8}{17}, tan(BAC)=815\tan(\angle BAC)=\frac{8}{15}

  3. Step 3 (Focus on Hypotenuse and Tangency): VLM focuses on the tangency between the semicircle and the triangle. The region including ACB\angle ACB and the semicircle is cropped out. Based on the tangency, sinACB=r15r\sin\angle ACB=\frac{r}{15-r} is obtained.

  4. Step 4 (Solution with Global View): Get sin(ACB)=r15r=817\sin(\angle ACB)=\frac{r}{15-r}=\frac{8}{17}, solving 17r=1208r17r = 120 - 8r to yield r=4.8r = 4.8.

This example shows how CoFFT's precise targeting enables correct reasoning by decomposing problems into manageable, intuitive steps, while the baseline fails due to incorrect geometric constraint interpretation.

Q3: For the Example in Figure 1:

We apologize for any confusion this example may have caused.

(1) CoFFT does not introduce new knowledge, but can make more effective use of current knowledge.

CoFFT is a training-free method that cannot modify the model's internal parameters or knowledge when evaluated on VLMs. To ensure fair comparison, all training-free methods are evaluated on the same VLM (e.g., Qwen2.5VL-7B) while maintaining unchanged foundational knowledge. CoFFT demonstrates that by reducing interference from irrelevant visual information, models can utilize their existing knowledge more effectively rather than requiring new knowledge acquisition.

(2) Poor performance in the Figure 1 example is not attributed to background knowledge deficiency.

We validate this across multiple models (GPT-4o, o3, Qwen2.5VL-72B, Qwen2.5-VL-7B):

  • Original complete image: Models identified locations like Tokyo, Paris, Shanghai, or New York based on visual elements like pigeons, crowds, trees, and amphitheater structures.

  • CoFFT-generated cropped image (focusing on the semicircular buildings and curved pools): Focusing on semicircular buildings and curved pools, models provided more accurate answers including Nanjing, Rome, and Washington D.C by highlighting distinctive architectural features that trigger their latent knowledge.

Moreover, we directly asked these models about the representative features of Nanjing's Sun Yat-sen Mausoleum. They all mentioned the corresponding areas shown in Figure 1 and could describe the semicircular buildings, semicircular stepped seating, and curved pools.

This experiment shows that the models presented in Figure 1 possess the relevant knowledge, but cannot correctly utilize it due to excessive visual noise.

(3) Impact of background knowledge

Model background knowledge significantly impacts reasoning, as shown in our analysis (lines 233-238). CoFFT achieves improvements on geography-based benchmarks like SeekWorld-China by helping identify fine-grained, region-specific details. However, gaps remain compared to GPT-4o in global settings due to knowledge limitations.

ModelS.W-ChinaS.W-Global
GPT-4o31.9056.50
Qwen2.5-VL-7B21.4525.31
7B with CoFFT35.1229.37
Qwen2.5-VL-32B24.1325.63
32B with CoFFT38.6130.63

Q4: For the detailed Performance of CoFFT on Math Benchmarks:

We apologize for any confusion. Our selection is based on two key considerations:

1. Show Flexibility: Demonstrates CoFFT's visual adjustment mechanism - both "zoom in" for details and "zoom out" for global view, avoiding local optima.

2. Page Limitations: Complex examples with intricate diagrams and longer reasoning chains are difficult to present clearly in single figures.

To more clearly demonstrate the impact of CoFFT, we conduct a detailed analysis.

Detailed Analysis: CoFFT was analyzed on MathVista and MathVision benchmarks, which include geometric diagrams, charts, tables, algebra, and logic problems conveyed through images. Results show CoFFT significantly improves both accurate visual perception and reasoning capabilities.

  • On MathVista, Greatest gains in Numeric Commonsense, Algebraic Reasoning, Arithmetic Reasoning, and Statistical Reasoning. CoFFT excels at extracting helpful information from visually dense images, enabling correct numerical interpretation and reasoning.

  • On MathVision, Substantial improvements in Arithmetic, Counting, and Statistics. These tasks require interpreting complex visual structures where attention to key elements is crucial for building correct logical/combinatorial models. Limited improvements observed in Graph Theory, Topology, and Transformation Geometry.

Analysis on MathVista (Testmini)

Skill CategoryBaselineCoFFTImprovement
Numeric Commonsense34.03%41.67%+7.64%
Algebraic Reasoning67.97%71.89%+3.92%
Logical Reasoning29.73%32.43%+2.70%
Arithmetic Reasoning61.19%63.74%+2.55%
Statistical Reasoning85.38%87.38%+2.00%
Geometry Reasoning67.36%68.20%+0.84%
Scientific Reasoning67.21%68.03%+0.82%

Analysis on MathVision (Testmini)

CategoryBaselineCoFFTImprovement
Arithmetic47.37%68.42%+21.05%
Counting10.53%26.32%+15.79%
Statistics21.05%36.84%+15.79%
Combinatorics21.05%31.58%+10.53%
Logic15.79%21.05%+5.26%
Algebra21.05%26.32%+5.26%
Solid Geometry5.26%10.53%+5.26%
Metric Geometry - Length10.53%15.79%+5.26%
Analytic Geometry31.58%31.58%0.00%
Descriptive Geometry15.79%15.79%0.00%
Graph Theory15.79%15.79%0.00%
Metric Geometry - Angle5.26%5.26%0.00%
Metric Geometry - Area21.05%21.05%0.00%
Topology15.79%15.79%0.00%
Transformation Geometry21.05%21.05%0.00%
Combinatorial Geometry15.79%10.53%-5.26%
评论

Dear Reviewer nDak

I would appreciate if you could kindly check my responses to your comments. If you have any further questions, we would be happy to address them.

Best

Authors

评论

Regarding the authors’ explanation of Figure 1, I still disagree with their viewpoint. The authors claim that because the model can answer questions about some characteristics of the Nanjing Music Stage in the image, the multimodal model should possess the ability to correctly recognize the image. However, this logic is flawed. Even a pure language model, such as Qwen2.5-7B, can provide information about some features of the Nanjing Music Stage, despite having no capability to process image features. The knowledge contained in text and multimodal knowledge are not inherently equivalent; otherwise, pretraining the visual encoder or conducting multimodal joint pretraining would be meaningless. Therefore, it is incorrect to infer that a large vision-language model (LVLM) has the potential to correctly recognize an image solely based on its knowledge of the associated textual concepts—especially when the multimodal model has never seen similar images during pretraining. For example, if we show OpenAI-o3 a photo taken at a random grassland in Ili, China, the model can only make a rough guess about the possible location based on visible geographic features; it might suggest Xinjiang, Kazakhstan, or even Switzerland. If the model has never seen similar images before, it will not be able to recognize them accurately, and this is unrelated to its knowledge of relevant textual information. No matter how strong o3’s reasoning is, it can only hypothesize the approximate position. Furthermore, the paper does not discuss this limitation, nor does it directly demonstrate the effectiveness of the proposed approach on its main target application. Hence, I believe there is some exaggeration regarding the effectiveness of the method.

On the issue of solving mathematical problems, my concerns about geometry tasks arise from the fact that the visualization analysis in the paper is limited to mathematical questions of the geometric type. The presented example is a math question with low visual information entropy. For such images, a model like Qwen2.5-vl does not require a zoom-in operation to retrieve necessary details. The same reasoning applies to other data types in MathVista. I do not entirely deny that the proposed method may bring improvements on mathematical benchmarks, but I wish to make clear that such improvements do not seem strongly connected to the core operation of zoom-in described in the paper. Recent related work(Think with Images), such as "DeepEyes: Incentivizing 'Thinking with Images' via Reinforcement Learning," has also clarified this point in their experimental analysis on similar benchmarks.

In summary, and after considering the opinions of other reviewers, I have decided to maintain my original score.

评论

Thank you for your reply, but we think you may have misunderstood our framework.

(1) Such image attention adjustment capability is only part of our capability. Our CoFFT addresses both aspects with synergistic enhancement between the following components.

First, for visual mathematical reasoning, the following two aspects are necessary:

(1) accurate extraction and understanding of visual information

(2) reliable reasoning and answers

Our Visual Focus Adjustment (VFA) and Dual Forward Decoding (DFD) can enhance these two aspects respectively.

1. Visual Focus Adjustment (VFA) - Enhance Perception

VFA enables dynamic attention on step-relevant regions in images. It reduces visual noise from irrelevant content (multiple subfigures, complex statistics) and prevents interpretation errors (incorrect values/symbols, confused relationships), ensuring accurate information extraction for a solid reasoning foundation.

2. Dual Forward Decoding (DFD) - Enhance Reasoning

DFD evaluates both reasoning progress scores and visual focus scores to determine next steps. This dual verification uses current focus regions to suppress hallucinations while exploring multiple reasoning paths, maintaining consistency between visual evidence and reasoning progression.

Dual Forward Decoding (DFD) This module significantly improves the model's reasoning capabilities and suppresses image hallucinations.

You can also see this in our ablation experiments.

ModelsMathVistaMathVisionS.W-ChinaS.W-Global
Our70.423.3635.1229.37
without DFD68.520.4228.4227.19
without VFA69.321.7127.0826.25
Baseline68.218.0921.4525.31

For the math reasoning datasets like MathVista, MathVision, our DFD module significantly improves model performance. You seem to have focused solely on our VFA module and overlooked the performance benefits of the DFD module.

(2) Regarding your comment that “there is some exaggeration regarding the effectiveness of the method.”

First of all, it should be explained that we did not only use text for testing, we also used the original image and the CoFFT cropped image for testing. These models can basically obtain relevant information based on the cropped image.

And we did not exaggerate the effectiveness of our method. Instead, we have frankly analyzed this point in our response, as follows:

Model background knowledge significantly impacts reasoning, as shown in our analysis (lines 233-238). CoFFT achieves improvements on geography-based benchmarks like SeekWorld-China by helping identify fine-grained, region-specific details. However, gaps remain compared to GPT-4o in global settings due to knowledge limitations.

ModelS.W-ChinaS.W-Global
GPT-4o31.9056.50
Qwen2.5-VL-7B21.4525.31
7B with CoFFT35.1229.37
Qwen2.5-VL-32B24.1325.63
32B with CoFFT38.6130.63
评论

We hope this helps correct your misconceptions about our framework. We hope you will take the time to reply to us.

最终决定

The paper introduces a training-free framework that integrates Dual Foresight Decoding (DFD) and Visual Focus Adjustment (VFA) in an iterative loop to enhance the visual reasoning ability of vision-language models.

The reviewers recognized multiple strengths of the work. Reviewer GWAM praised the paper as technically sound, self-consistent, and supported by thorough evaluation. Reviewer UGRZ highlighted the originality of the framework, the intuitiveness of the proposed foresight-guided focus, and the strong empirical results across diverse benchmarks. Reviewer hXHh emphasized the broad applicability of a training-free approach, the clear motivation. Reviewer nDak acknowledged the inspiration from human cognition and the novelty of addressing visual focus in VLMs.

The primary remaining reservation comes from Reviewer nDak questioned whether some figures (notably Figure 1) convincingly demonstrate the method’s effectiveness. However, this concern relates mainly to presentation rather than the substance of the contribution, and the rebuttal clarified that Figure 1 should be read as a motivational example rather than proof of capability.

It is recommended that the authors revise Figure 1 and related text in the camera-ready version to avoid confusion and provide a clear discussion of the method’s limitations.