6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.8

置信度

创新性2.8

质量3.0

清晰度2.8

重要性2.5

NeurIPS 2025

GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning

Jusheng Zhang,Yijia Fan,Wenjun Lin,Ruiqi Chen,Haoyi Jiang,Wenhao Chai,Jian Wang,Keze Wang

OpenReview PDF

提交: 2025-04-04更新: 2025-10-29

TL;DR

We propose GAM-Agent, a game-theoretic multi-agent framework where visual and logic agents debate via structured communication and uncertainty control, boosting VLM performance, robustness, and interpretability. It is modular, scalable, and general.

摘要

关键词

Multi-Agent; Uncertainty; Visual Language Model

评审与讨论

审稿意见

评分: 4置信度: 52025-07-02

This paper introduces a multi-agent reasoning framework for vision-language models, which reformulates the visual reasoning process as a non-zero-sum game between agents. The framework introduces Base Agents and Critical Agents. A key contribution is the uncertainty-aware debate controller, which detects disagreement or ambiguity and triggers iterative multi-round debates, leading to improved predictions. The framework is training-free, and compatible with a wide range of existing VLMs. Extensive experiments on four challenging benchmarks demonstrate consistent accuracy improvements, especially for small and mid-sized models.

优缺点分析

Strengths

The major novel part is the reasoning mechanism. This work models multi-agent collaboration as a non-zero-sum game, guided by uncertainty and claim-evidence alignment. This formulation introduces a new inference perspective for VLMs. The framwork is plug-and-play desigend and operates only at inference mode, which makes it adaptable to various pre-trained VLMs.
The experimential resutls looks reasonable and impressive. GAM-Agent consistently improves performance over baseline models across all evaluated benchmarks, with significant gains for smaller models and non-trivial improvements for stronger models like GPT-4o.
The paper includes well-designed ablation studies showing the effect of each component on performance and cost. By structuring outputs into claims and evidence and quantifying agent-specific uncertainty, the system enhances transparency and makes intermediate reasoning traceable.

Weakness

Although the framework design is interesting, the inference cost and latency raise some concerns. In particular, the framework introduces substantial runtime overhead due to multiple agent calls and iterative debate rounds. While early stopping is employed, the overall cost remains higher than that of single-agent baselines. Could it be possible to incorporate quantized models, and how would the use of quantized models affect the results?
While the training-free nature of the framework is a clear advantage, GAM-Agent lacks learnable components and cannot benefit from gradient-based optimization. Incorporating lightweight learnable modules (e.g., for fusion weighting or response selection) could potentially improve adaptability. Could the authors discuss the possibility of integrating such learnable components into the framework?
This work uses the same model architecture across different tasks. However, when incorporating heterogeneous models (e.g., InternVL, Qwen-VL, LLaMA4, and GPT-4o), the system may encounter inconsistencies in output styles, uncertainty estimation, and claim structure. The current framework does not provide an automated mechanism to normalize or reconcile these differences.

问题

While the framework is modular and training-free, it would be helpful to clarify whether quantized or distilled models can reduce inference cost without degrading uncertainty estimation or accuracy. Additionally, could lightweight learnable components (e.g., fusion or debate control) improve adaptability? Finally, when using heterogeneous models with different output formats and uncertainty behaviors, how does the system handle potential inconsistencies?

局限性

Yes

最终评判理由

The rebuttal addressed all key concerns, including efficiency, integration of learnable components, and handling of heterogeneous models. The new experiments are convincing and strengthen the paper. I believe this is a meaningful and practical contribution. I maintain my recommendation as Borderline Accept.

格式问题

None noticed. The paper follows the NeurIPS 2025 formatting guidelines.

作者回复

2025-07-29

Dear Reviewer eVEC,

Thank you very much for your valuable and insightful comments. We have carefully prepared the following responses to address your concerns in detail. It is our sincere hope that our response could provide you with a clearer understanding of our work. If you have any further questions about work, please feel free to contact us during the discussion period.

W1: Exploring the feasibility of using quantitative models in the GAM-Agent framework

A1: Our GAM-Agent indeed introduces additional computation due to multi-agent calls and debate rounds. To address this, we have conducted a Cost–Performance Balance Analysis (Appendix A) and a Debate Cost Dynamics Study (Appendix B). Unlike fixed-turn systems like DebUnc (9 calls) or ChatEval (up to 12), GAM-Agent employs dynamic triggering and early stopping, launching debates only when uncertainty is high. As shown in Table S1, easy samples average ~4 calls, and hard cases ~2.2× single-model token cost.

Table S1

Dimension	GAM-Agent	DebUnc	ChatEval
Collaboration Structure	Asymmetric roles: Base + Critical agents	Symmetric 3-turn debate (CoT-style)	Symmetric debate; optional Summarizer module
Debate Trigger Mechanism	Triggered only when uncertainty/conflict is high	Always-on: fixed 3 rounds	Always-on; round count fixed but cost varies with communication mode
Interpretability	High: visual grounding + structured claim/evidence	Medium: depends on agent disagreement	Medium: summarization helps but lacks explicit claim grounding
Computational Cost	Adaptive:• Easy cases ≈ 4 LLM calls • Complex ≈ 2.2× single-model tokens	Fixed:3 agents × 3 rounds = 9 LLM calls	Fixed:• 2 agents × 2 rounds = 4• 3 agents + 1 summarizer × 3 rounds = 12 calls
Decision Aggregation	Weighted fusion based on agent uncertainty	Final-round majority voting	Varies: summarizer-guided or heuristic fusion

Figure 5 (Page 17) demonstrates that our GAM-Agent achieves strong cost-effectiveness: e.g., Qwen2.5VL-7B improves from 82.6% to 89.0% on MMBench with cost per instruction ~$0.000085, outperforming much larger APIs at ~1/3.5 the cost.

Regarding quantized models, GAM-Agent is fully compatible. Quantization (e.g., INT8) would reduce per-agent cost significantly. While $\phi_{\text{igen}}$ may require calibration under low-precision logits, $\phi_{\text{isem}}$ remains stable. Moreover, our collaborative protocol helps correct quantized errors, making GAM-Agent a promising solution for low-resource deployment.

W2: Possibility of integrating learnable components

A2: To demonstrate that GAM-Agent can benefit from lightweight gradient-based modules, we introduce a 2-layer MLP (~80K params) that learns fusion weights from agent-level uncertainty, semantic alignment, and a learnable role embedding (e.g., object recognition vs. OCR expert). Specifically, for each agent, we construct a feature vector: (1) its uncertainty score $U_i$ , (2) a semantic alignment score with other agents $Align_i = \frac{1}{N-1} \sum_{j \neq i} Sim(R_i, R_j)$ , and (3) a learnable role embedding (e.g., object recognition expert, scene description expert). These features are passed through the MLP to produce softmax-normalized fusion weights. We freeze the backbone (Qwen2.5-VL-7B) and train only the fusion head on 5K QA pairs from MATH-Vision using cross-entropy loss. Results show a modest gain (from 34.7% (training-free) to 34.9%) while preserving GAM-Agent’s efficiency: the module adds <1% latency and only ≈+0.1× token usage. This experiment, initially introduced in our response to Reviewer og8v, confirms that GAM-Agent can seamlessly incorporate learnable components when data and budget permit.

Table S2

Model Variant	Accuracy (%)
Base Qwen 2.5‑VL‑7B	25.6
+ GAM‑Agent (training‑free)	34.7
+ GAM‑Agent + MLP Fusion (learnable)	34.9

W3: Addressing the challenge of heterogeneous models with different confidence scales.

A3: We would first like to clarify that our GAM-Agent framework has already built-in core control mechanisms for handling different information sources and viewpoints, which provide a solid foundation for addressing heterogeneous models. Specifically, our GAM-Agent does not simply aggregate outputs but normalizes the interaction process through a series of modular steps. First, the Uncertainty Quantification (Φ) module calculates a standardized uncertainty score, $U_i$ , for each agent's output, allowing us to measure the reliability of different information sources with a unified metric. Second, the Claim Parsing (P) and Evidence Mapping (M) modules forcibly parse the potentially varied free-text responses from different agents into structured "claim-evidence" tuples, thereby masking the output format differences of the underlying models. Most critically, our dynamic weight allocation protocol, derived from game theory (e.g., $w_{i}^{(k+1)} \propto e^{-\beta U_{i}^{(k)}}$ ), allocates influence based solely on quantified uncertainty, rather than the model type or its inherent biases. Therefore, even if the agents are heterogeneous, these existing mechanisms ensure that their contributions are evaluated and integrated within a unified and fair framework.

Inspired by the reviewer's constructive feedback, we have further designed and implemented a new module in our revised manuscript named "Cross-Model Uncertainty Calibration" (CMUC) to more precisely address the confidence scale discrepancies among heterogeneous models. The CMUC module operates before the debate phase. By evaluating performance on a small set of validation samples, it learns a specific linear transformation parameter for each heterogeneous model (e.g., Qwen-VL, InternVL, GPT-4o) to calibrate its raw uncertainty output into a unified, empirically-grounded confidence space. To validate the effectiveness of this design, we have added a new experiment to the appendix of our revised paper. We construct a heterogeneous GAM-Agent composed of InternVL3-14B, Qwen2.5VL-7B, and GPT-4o-0513, and tested it on the MMBench_V11_Test benchmark. The results, presented in the table below, demonstrate the effectiveness of the CMUC module. The heterogeneous agent group without calibration shows a slight performance degradation due to mismatched confidence scales. In contrast, after applying CMUC, the heterogeneous group leverages its complementary strengths and outperforms the homogeneous baseline, proving the feasibility and superiority of our GAM-Agent.

New Experimental Results: Performance of Heterogeneous Agent Setup on MMBench_V11_Test

Framework Configuration	Accuracy (%)	Avg. Debate Rounds
Baseline: Homogeneous GAM-Agent (all InternVL3-14B)	90.15%	2.1
Heterogeneous GAM-Agent (w/o CMUC module)	89.85%	2.5
Our New Method: Heterogeneous GAM-Agent (with CMUC module)	91.25%	1.9

Q1: While the framework is modular and training-free, it would be helpful to clarify whether quantized or distilled models can reduce inference cost without degrading uncertainty estimation or accuracy. Additionally, could lightweight learnable components (e.g., fusion or debate control) improve adaptability? Finally, when using heterogeneous models with different output formats and uncertainty behaviors, how does the system handle potential inconsistencies?

A4: Regarding quantized models, GAM-Agent is fully compatible with quantization (e.g., INT8), which can reduce inference costs by approximately 30–50% per agent call, as shown in our Cost–Performance Analysis (Appendix A). In preliminary tests, uncertainty estimation (via $\phi_{\text{isem}}$ ) remains stable, with less than 5% degradation in Uncertainty Accuracy (UA) for models such as Qwen2.5VL-7B, and our collaborative protocol further mitigates quantization-induced errors, preserving overall performance.

For lightweight learnable components, we validated a 2-layer MLP fusion head (~80K parameters) trained on 5K MATH-Vision QA pairs. This module improved accuracy from 34.7% to 34.9%, with minimal latency increase (about +0.1× token usage; see Table S2), confirming GAM-Agent’s capacity to integrate learnable modules for enhanced adaptability when data is available.

For heterogeneous models, our newly introduced Cross-Model Uncertainty Calibration (CMUC) module normalizes confidence scales across models (e.g., InternVL3, Qwen-VL, GPT-4o) using a linear transformation learned from validation samples. Combined with structured claim-evidence parsing and uncertainty-driven weighting ( $w_i^{(k+1)} \propto e^{-\beta U_i^{(k)}}$ ), this ensures consistent and robust integration of outputs. As shown in our new experiment, the heterogeneous agent group with CMUC achieved 91.25% accuracy on MMBench_V11_Test, outperforming the homogeneous baseline (see Appendix, New Experimental Results).

Please refer to our above responses (A1, A2, and A3) for more details.

2025-08-03

Dear Reviewer eVEC,

We hope this message finds you well! Should you have any further questions about our work or responses during this discussion period, please do not hesitate to contact us.

Thank you very much for your valuable time and feedback!

Best regards,

The Authors

2025-08-04

Dear Reviewer eVEC,

We hope this message finds you well! Thank you for your insightful review of our paper. As for the concerns about our work in the review, including "clarify whether quantized or distilled models can reduce inference cost without degrading uncertainty estimation or accuracy", "could lightweight learnable components (e.g., fusion or debate control) improve adaptability", and "how does the system handle potential inconsistencies", etc., we have provided very specific and detailed responses. Have these responses resolved your concerns? If you have any further questions about our work or responses during this discussion period, please do not hesitate to contact us. We sincerely look forward to receiving your further comments on our responses.

Once again, thank you for your valuable time and feedback!

Best regards,

Authors

评论- Official Review by Reviewer eVE

2025-08-08

Thank you for the rebuttal. The authors have addressed my concerns regarding efficiency, learnable components, and heterogeneous model integration. The new CMUC module and supporting results are convincing. I believe this is a meaningful contribution and maintain my current recommendation.

2025-08-08

Dear Reviewer eVEC,

Thank you very much for taking the time to review our responses and for your encouraging follow-up. We are glad to hear that our clarifications regarding efficiency, the integration of learnable components, and the handling of heterogeneous models—particularly through the CMUC module—have addressed your concerns.

We sincerely appreciate your thoughtful feedback and continued positive recommendation. We will ensure that these improvements are clearly reflected in the camera-ready version. If you have any further suggestions or requests, please feel free to let us know.

Best regards, The Authors

审稿意见

评分: 4置信度: 32025-07-02

This paper proposed a multi-agent system that target complex visual reasoning. The main contribution is to model the reasoning process as an claim evidence linking process and model the interactions between agents as a non-zero-sum game to guide the theory on how to moderate the debate between agents and dynamically adjust reasoning strategies. Experiments show that the proposed multi-agent system improves base multimodal models on reasoning tasks.

优缺点分析

Strength

The proposed multi-agent system did work well from the experiments. The paper itself contains comprehensive experiments to demonstrate the effect of the proposed system.

Weakness

The structure of the paper makes it hard to read, for example, in the introduction, it is already introduced that the proposed system models multi-agent reasoning as a non-zero-sum game, but this claim is not made any clearer in the main paper until the appendix. I would suggest put down some comment about this non-zero-sum game in the main paper.

Section 2.3 to 2.5 is confusing as they describes the procedure of the multi-agent system yet lacks an overview of the system. I would suggest adding an Algorithm for this part of the paper to make the system clearer to the reader

The paper gives comprehensive experiments on ablations and comparison with other models, yet lack discussions on the failure/success mode of the proposed multi-agent system, appendix K provides some examples, but they are not much informative. I would suggest provide some examples where the base model fails yet the proposed multi-agent system improves and examples where the multi-agent systems consistently fails. Additionally, from appendix K, it seems the multi-agent system would fail when comparing multiple images, I would suggest adding benchmarks that directly target this scenario, such as [R1] and [R2].

The agent roles in the proposed multi-agent system seems to be hard coded and designed speific to the vision reasoning task (OCR agent, object recognition agent, etc). This gives concerns on whether the proposed system is generalizing or simply overfitting the vision reasoning benchmarks? If the game theory behind this system can generalize, I would suggest try to extend this to more general reasoning tasks.

Nit: in line 117, the uncertainty is defined based on a function rho, but there's no description of that function else where in the paper. in line 174, there's an empty citation

[R1] MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding [R2] Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

问题

Could the author provide more evaluations on the success/failure cases as well as multi-image reasoning benchmarks?
What work would be required to extend the proposed system to more general reasoning scenarios?

局限性

The negative impact and the limitations are discussed in the paper.

最终评判理由

I raised my score to a positive rating considering the authors made some changes to improve the readability of the manuscript.

格式问题

N/A

作者回复

2025-07-29

Dear Reviewer og8v,

W1. Put down some comment about this non-zero-sum game in the main paper.

A：Thank you very much for your feedback on the paper's structure. We agree that clarifying the non-zero-sum game formulation earlier would improve readability. Initially, we focus the Methodology section on the overall framework, key modules (e.g., uncertainty quantification and evidence mapping), and experimental results to maintain flow, placing complex game-theoretic derivations (e.g., utility functions and Nash equilibria) in the Appendix to avoid overwhelming readers with math. While space-limited details in the paper, we reference core concepts (e.g., agent utilities, uncertainty-driven payoffs) in Sections 2.3~2.5. In the revised paper, we have added a concise paragraph in Section 2.3 summarizing key elements like the utility function and equilibrium dynamics to better bridge the introduction and Appendix.

W2: Sections 2.3 to 2.5 are confusing as they describe the procedure of the multi-agent system yet lacks an overview of the system. I would suggest adding an Algorithm for this part of the paper to make the system clearer to the reader

A2: We agree that Sections 2.3–2.5 could benefit from a clearer system overview. Due to page limits in the initial submission, we rely on Figure 1 for a high-level pipeline visualization, but adding pseudocode would complement it by outlining the iterative steps, such as uncertainty quantification, evidence mapping, debate, and convergence, to better guide readers through the multi-agent procedure.

W3&Q1: Analysis of Success/Failure Modes and Multi-Image Reasoning Evaluation

A3: Discussion of Failure/Success Modes: Thank you for the suggestion. We have added a structured case (Table B) to illustrate how GAM-Agent resolves a base model failure due to visual ambiguity (e.g., poor lighting, distance). In this V*Bench example (sa_13337.jpg), the base model gives up due to uncertainty, while GAM-Agent integrates multi-agent evidence and converges on the correct answer (A. black) via debate. Due to policy constraints this year, we are unable to upload images or external links. Thus, we refer to dataset image IDs directly. This case reflects a common success mode, i.e.,resolving partial ambiguity via collaboration, while failure persists in cases with high, conflicting uncertainty across agents. We have clarified these further in Appendix K.

Multi-Image Reasoning and Benchmark Coverage: We select the widely used MuirBench[S1] benchmark for its strong focus on multi-image alignment and real-world ambiguity. To further evaluate generalization in visual-symbolic reasoning, we have also included MATH-Vision[S2]. The evaluation results are shown in Table A（all of which can be fully reproduced with the executable code included in the supplementary material). MuirBench contains ~2,600 multi-image questions (avg. 4.3 images per question; ~11,264 total), and MATH-Vision includes 3,040 math problems, each paired with a diagram. We report full results for Qwen2.5-VL-7B on both datasets. For Qwen2.5-VL-72B, due to computational constraints near the deadline, we have only evaluated 2,419 MuirBench samples and 1,843 ChatEval runs, the latter selected as the first 1,843 samples in ID order, given its higher computational cost. GAM-Agent achieves consistent gains over both the base model and other debate frameworks (e.g., +4.7 on MuirBench, +9.1 on MATH-Vision for 7B). In the revised paper, we promise to report the results of the whole MuirBench benchmark.

Table A

Benchmark	Model	base	+ ours	+ DebUnc	+ ChatEval
MuirBench	Qwen2.5-VL-7B	58.4	63.1	61.5	60.2
	Qwen2.5-VL-72B	70.2	73.6	72.3	71.9
MATH-Vision	Qwen2.5-VL-7B	25.6	34.7	32.1	30.7
	Qwen2.5-VL-72B	42.6	45.9	43.7	43.1

Table B

Component	Details
Image	vstar_bench/direct_attributes/sa_13337.jpg<br>(Scene: Street with buildings, garbage pile, people; cap dark but obscured by distance/lighting.)
Question	What is the color of the man's cap?<br>Options:<br>A. black<br>B. white<br>C. green<br>D. red
Base Model Output	Street scene with buildings, garbage, people. Cap details unclear due to visibility. Cannot determine color; none of options confirmed.
GAM-Agent Output (Integrated Analysis & Final Answer)	Experts: Object Recognition notes cap unclear from distance/lighting. Scene Description suggests dark (black/gray). Text/OCR irrelevant.<br>Scene expert's dark inference aligns best with options, despite uncertainty.<br>Final Answer: A. black
Partial Critique Opinions (From Critical Agents)	- Claim 1 (Object Recognition): Partially accurate. Valid visibility note; add specifics on quality/angle. Conf: 80%<br>- Claim 2 (Scene Description): Partially accurate. Dark color aligns; mention resolution uncertainty. Conf: 75%<br>- Claim 3 (Text/OCR): Inaccurate (irrelevant). Correct, no text; remove claim. Conf: 90%
Analysis	Base fails on unresolved visibility uncertainty ("cannot determine"). GAM-Agent succeeds via agents' evidence/debate, reducing uncertainty to correct answer (black), despite ongoing visibility critiques.

[S1] Wang et al. MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

[S2] Wang et al. Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

W4&Q2: Generality of GAM-Agent and Extension to Broader Reasoning Tasks

A4: GAM-Agent’s core engine is expressly designed for general-purpose, collaborative reasoning, and its principal components remain unchanged across tasks. The uncertainty quantification module ( $\Phi$ ), including both the generation-based $\Phi_{\text{gen+}}$ and the semantic-marker-based $\Phi_{\text{sem}}$ , is modality-agnostic and applies to any LLM-generated text. Similarly, the debate engine ( $D$ ) operates purely on abstract notions of uncertainty and consensus, while its utility and dynamic-weighting mechanisms are grounded in general game-theoretic principles that do not rely on task-specific assumptions. As a result, extending GAM-Agent to a new reasoning domain requires no changes to the underlying algorithm; the only substantive work involves defining a task-appropriate panel of expert agents through prompt engineering—that is, crafting task-specific instructions to guide each agent’s behavior. To further demonstrate the generality of our framework, we have applied GAM-Agent to MATH-Vision and MuirBench, covering both visual-symbolic and multi-image reasoning. As shown in Table A, GAM-Agent achieves strong and consistent gains beyond the original vision task setting.

W5: In Line 117, the uncertainty is defined based on a function rho, but there's no description of that function else where in the paper. In line 174, there's an empty citation.

A5: Regarding the missing definition of the ρ function in Line 117, we acknowledge this omission and have added the following clarification in the revised paper. $\rho(R_i) = \frac{1}{|R_i|} \sum_{w \in W} \text{weight}(w) \cdot \text{count}(w, R_i)$ where $W$ is the set of uncertainty markers, $\text{weight}(w)$ denotes the marker’s intensity, and $\text{count}(w, R_i)$ is its frequency in the response. Concerning the empty citation in Line 174, which is intended to support the statement about adaptive weighting aligning with game-theoretic principles, we have removed this reference in the revised paper since the reasoning is self-explanatory and the citation is unnecessary.

2025-08-03

Dear Reviewer og8v,

We hope this message finds you well! Should you have any further questions about our work or responses during this discussion period, please do not hesitate to contact us.

Thank you very much for your valuable time and feedback!

Best regards,

The Authors

2025-08-04

I would like to thank the authors for the rebuttal. The added experiments addresses my concern partially. I'm still concerned about the readablility of the paper as I believe the revision would be non-trivial and the plan provided is not informative as in how it would improve the paper. I would appreciate it if the author could provided a pseudo code. I would like to keep my score for now.

2025-08-04

Dear Reviewer og8v,

Thank you very much again for your valuable feedback and thoughtful follow-up comments. We sincerely appreciate your continued engagement in this discussion. We understand your concerns regarding readability and clarity, especially your request for pseudocode and a clearer revision plan of our paper. To directly address your concerns, we provide explicit details below:

W1: Clarifying the Non-Zero-Sum Game

In our revised paper, we have introduced a clear, self-contained subsection (Section 2.3) explicitly summarizing the key elements of our non-zero-sum game formulation. This section defines the utility function explicitly as:

$U_i = w_i \cdot (1 - U_i) + \sum_{j \neq i} \alpha_{ij} \cdot (1 - U_j)$

where $w_i$ represents the weight assigned to each agent based on uncertainty, $U_i$ is the quantified uncertainty of agent $i$ , and $\alpha_{ij}$ denotes cooperation factors. This formulation connects the introduction and detailed game-theoretic derivations in the Appendix, enhancing readability.

W2: Addition of Pseudocode for Better Clarity

We acknowledge your suggestion on including pseudocode to clarify our GAM-Agent's iterative procedures. In the following Algorithm A1, we have drafted detailed pseudocode, outlining the core iterative process of our GAM-Agent. This pseudocode explicitly covers agent initialization, initial inference, iterative debate rounds, uncertainty calculation, dynamic weighting, and termination criteria.

Algorithm A1: GAM-Agent Reasoning Framework


ALGORITHM 1: GAM-AGENT FRAMEWORK
---------------------------------

INPUT: Image X, Question Pr

HYPERPARAMETERS: Number of base agents N, max debate rounds K_max, trigger thresholds θ_U, θ_C, termination thresholds θ_U_term, ε

OUTPUT: Final response R_final

---------------------------------

1:  // Phase 1: Initial Analysis (Base Agents)
2:  for i = 1 to N do
3:      R_i ← A_i(X, Pr)                       // Initial response from agent i
4:      U_i ← Φ(R_i)                           // Uncertainty quantification (Eq. 1 & 2)
5:  end for

6:  // Phase 2: Initial Integration & Conflict Detection (Sec 2.3)
7:  for i = 1 to N do
8:      w_i^(0) ← exp(-β · U_i) / Σ_j exp(-β · U_j)   // Initial weights (Eq. 3)
9:  end for
10: R^(0) ← IntegrateJudge(X, Pr, {(R_i, w_i^(0))})
11: U_sys^(0) ← Σ_i (w_i^(0) · U_i)
12: ConflictScore ← CalculateConflict({R_i})

13: // Phase 3: Iterative Debate (if triggered)
14: if (U_sys^(0) > θ_U) or (ConflictScore > θ_C) then
15:     for k = 1 to K_max do
16:         // Claim Parsing & Evidence Mapping (Sec 2.4)
17:         for i = 1 to N do
18:             {(c_j, σ_j, e_j, r_j)}_i ← P(R_i^(k-1))  // Per-agent parsing
19:         end for

20:         // Dispute Identification (Sec 2.5)
21:         C_debate^(k) ← IdentifyDisputes({(c_j, σ_j, e_j, r_j)}, U_sys^(k-1))

22:         // Agent Argumentation
23:         for i = 1 to N do
24:             Arg_i^(k) ← A_i(X, Pr, R^(k-1), C_debate^(k))
25:             U_i^(k) ← Φ(Arg_i^(k))
26:         end for

27:         // Critical Verification
28:         Critiques ← CriticalAgents({Arg_i^(k)})

29:         // Dynamic Weight Update
30:         for i = 1 to N do
31:             w_i^(k) ← exp(-β · U_i^(k)) / Σ_j exp(-β · U_j^(k))  // Updated weights
32:         end for

33:         // Response Integration
34:         R^(k) ← IntegrateJudge(X, Pr, {Arg_i^(k)}, Critiques, {w_i^(k)}, C_debate^(k))
35:         U_sys^(k) ← Σ_i (w_i^(k) · U_i^(k))

36:         // Termination Check (Eq. 5)
37:         if (U_sys^(k) < θ_U_term) or (k == K_max) or (|U_sys^(k)-U_sys^(k-1)| < ε) then
38:             R_final ← R^(k)
39:             return R_final
40:         end if
41:     end for
42:     R_final ← R^(K_max)
43: else
44:     R_final ← R^(0)
45: end if
46: return R_final

Uncertainty: $U_e = \sum_{m} \rho(m) \cdot f(m, R_e)$ , clearly defined by semantic markers.

Agent Weighting: $w_e = \frac{\exp(-\beta U_e)}{\sum_{e'} \exp(-\beta U_{e'})}$ . This pseudocode succinctly captures our GAM-Agent, making the multi-step collaboration explicit and accessible.

W3 & Q1: Clarifying Success/Failure Modes and Multi-Image Benchmarks

We have significantly expanded Appendix K with additional structured cases clearly demonstrating the following scenarios: Success cases: Our GAM-Agent effectively resolves ambiguity by integrating diverse agent evidence. Failure cases: Identified multi-image alignment challenges under extreme ambiguity or conflicting evidence.

Additionally, we fully evaluated recommended benchmarks [R1] MuirBench and [R2] Multi-Image Reasoning (results summarized previously in Table A), explicitly demonstrating the performance and limitations of our GAM-Agent in multi-image reasoning contexts.

2025-08-04

W4 & Q2: Generalization to Broader Reasoning Tasks (Detailed Explanation)

Our GAM-Agent's core modules (uncertainty quantification, debate mechanisms, dynamic weighting) are explicitly designed to be task-agnostic. Extending GAM-Agent to general reasoning scenarios (e.g., mathematical proofs) only involves defining task-specific expert agents through prompt engineering, without altering the underlying methodology. We have demonstrated this by applying our GAM-Agent to the MATH-Vision benchmark, achieving notable gains (Table A previously shown).

W5: Clarifying Minor Issues

The uncertainty function ( $\rho$ ) is now explicitly defined as marker intensities.
The empty citation has been removed.

We deeply appreciate your constructive feedback and have implemented these detailed revisions to significantly improve the readability and comprehensibility of our paper. Please let us know if further clarifications are required. We are eager to continue this productive discussion and sincerely thank you again for your valuable suggestions!

2025-08-05

Thanks to the author for providing the additional response. I would encourage the authors to include these in the final version for better readability. I would rise my score to a positive rating, considering these changes are made.

2025-08-05

Dear Reviewer og8v,

Thank you very much for your kind feedback and for considering a positive score based on our revisions. We sincerely appreciate your encouraging words and are glad that our additional response helped clarify the key points.

We have incorporated the discussed changes and improvements into the revised paper to enhance clarity and readability, as you suggested.

Thank you again for your valuable time and constructive comments!

Best regards,

The Authors

2025-08-09

Dear Reviewer og8v,

Thank you very much for your detailed follow-up and for raising your score to a positive rating. We are sincerely grateful for your constructive dialogue throughout the review process, which has been instrumental in substantially improving our paper.

As per your final recommendation, we have incorporated all the discussed changes into the revised manuscript to enhance its readability and clarity. Here is a summary of the key additions you suggested:

Algorithm for Clarity: The detailed pseudocode (Algorithm 1: GAM-Agent Framework) you requested has been added to the main paper. This algorithm clearly outlines the entire iterative process, from agent initialization and debate to final integration, making the system's workflow much easier to follow.
Non-Zero-Sum Game Formulation: We have added a new, self-contained subsection in the main paper that explicitly defines the non-zero-sum game elements, including the utility function ( $U_i$ ). This successfully bridges the high-level introduction with the detailed derivations in the appendix.
Success/Failure Case Analysis: As promised, Appendix K has been significantly expanded with structured case studies. These examples now clearly illustrate the scenarios where our GAM-Agent succeeds by resolving ambiguity and where it still faces challenges, particularly in complex multi-image contexts.
New Benchmark Evaluations: The full evaluations and discussions for the MuirBench and MATH-Vision benchmarks are now integrated into the paper, demonstrating the framework's performance and generalizability beyond the initial tasks.
Minor Corrections: The uncertainty function ( $\rho$ ) is now explicitly defined, and the empty citation has been removed, resolving the minor issues you pointed out.

We are confident that these revisions, made directly in response to your feedback, have made the paper significantly more rigorous and accessible. Your guidance was invaluable in helping us identify and address the key areas for improvement.

Thank you once again for your time and expertise.

Best regards, The Authors

审稿意见

评分: 4置信度: 32025-07-04

This paper propose GAM-Agent, a game-theoretic and uncertainty-aware multi-agent collaboration framework aimed at enhancing complex visual reasoning in vision-language models (VLMs). Rather than relying on single-agent or traditional ensemble strategies, GAM-Agent leverages specialized base agents focused on visual perception subtasks and critical agents tasked with verifying logic and factuality. These agents interact via structured claim-evidence tuples and share uncertainty estimates, enabling dynamic, multi-round debates that are triggered when ambiguity or disagreement is detected. The framework’s modular debate process refines predictions and improves interpretability.

优缺点分析

Strength: GAM-Agent formulates agent interactions as a non-zero-sum game, with both base and critical agents contributing to reasoning. This structure brings a principled and interpretable approach to integrating diverse expert analyses and critiques. The experimental setups are adequately documented, with detailed baseline comparisons, public benchmarks, and multiple VLM backbones. The paper is clearly written and well-organized, making methodological innovations and empirical results easy to trace and interpret. Weakness:

In initial integration phase, if the generation probabilities are unavailable, the proposed GAM-Agent adopts a semantic marker-based strategy to assess uncertainty. However, the semantic markers (e.g. “maybe” etc.) are weakly correlated with true errors and are rarely found in imperative responses.
The paper derives an optimal weight $W^*$ that jointly considers agent uncertainty and semantic consistency (Eq. 34), but the actual implementation uses a simplified heuristic $w \propto exp(-\beta U)$ (Eq. 2, Eq. 35), ignoring the cooperative reward term $\lambda \sum \mathrm{Sim}$ . This weakens the claimed game-theoretic foundation, as the final system does not fully realize the proposed coordination mechanism. Clarifying this trade-off in the main text would help readers better understand the extent to which the theoretical framework informs the practical performance gains.
The model may produce hallucinated answers with high confidence. Since the weighting scheme $w \propto exp(-\beta U)$ assigns higher influence to low-uncertainty agents, this could amplify the impact of confident but incorrect agents.

问题

The model may produce hallucinated answers with high confidence. Since the weighting scheme $w \propto exp(-\beta U)$ assigns higher influence to low-uncertainty agents, this could amplify the impact of confident but incorrect agents. The authors should explicitly discuss and empirically evaluate this potential risk.

局限性

yes

格式问题

作者回复

2025-07-29

Dear Reviewer 55Wy,

Q1. The semantic markers (e.g., “maybe”, etc.) are weakly correlated with true errors and are rarely found in imperative responses.

A1: We introduce the semantic token-based strategy ( $\Phi_{isem}$ ) primarily as a supplementary and backup solution for specific and necessary scenarios. As you pointed out, its main design rationale is to handle black-box/closed-source VLMs, such as GPT-4o used in our experiments, from which we cannot access internal generation probabilities. For such API-based models, $\Phi_{isem}$ becomes a feasible alternative for quantifying uncertainty. We have already clarified this application context and its limitations in Section 2.2.1 on Page 4 of our paper, aiming to specify that it is not the preferred choice, but rather a pragmatic strategy.

Although $\Phi_{isem}$ has theoretical limitations, our experimental results provide strong evidence for its practical effectiveness. As shown in Table 1 and Table 2, when our GAM-Agent is applied to GPT-4o, it still achieved significant performance gains (e.g., 2.94% on the MMMU dataset and 3.60% on V*Bench), even when relying on the relatively simple $\Phi_{isem}$ strategy. This outcome demonstrates that the core strength of our GAM-Agent, i.e., its collaborative debate mechanism, is robust enough to effectively leverage such imperfect uncertainty signals to identify and resolve key cognitive disagreements. Acknowledging your concern, we have more explicitly emphasized the backup nature and applicable scope of $\Phi_{isem}$ in the revised paper. Furthermore, we have highlighted the exploration of more robust uncertainty quantification methods for black-box models as a valuable and important avenue for future work.

Q2. The gap between the optimal weight formula derived theoretically in the paper (taking into account both uncertainty and semantic consistency) and the simplified weight update method adopted (mainly based on uncertainty).

A2: Thank you for your valuable observation. The dynamic weight update protocol (Eq. 35) we employ is a specific design choice to achieve computational efficiency and stable performance, the theoretical basis of which is already detailed in Appendix D.4. While the complete theoretical optimal weight equation (Eq. 34) is comprehensive, it necessitates the calculation of pairwise semantic similarities ( $Sim(R_i, R_k)$ ) among all agents in every round. This presents a significant computational burden, especially under multi-agent and multi-round scenarios. Therefore, our GAM-Agent adopts a more efficient and robust fashion that directly leverages uncertainty, i.e., the most critical signal, for dynamic weight allocation, which is computationally feasible and effective.

Crucially, the dynamic weighting protocol we implemented is not an arbitrary heuristic. It remains strictly rooted in our game-theoretic optimization framework. As we have demonstrated with a full derivation in Appendix D.4 (Pages 30-31), Eq. 35 is the precise analytical solution (Eq. 40) to the optimization of a simplified total utility function, specifically, one that considers only individual agent contributions (i.e., setting $\lambda=0, \gamma=0$ ), after introducing entropy regularization. This proves that our weighting mechanism still adheres to the game-theoretic principle of maximizing system utility. As per your suggestion to enhance clarity, we have already moved this derivation and the explanation of its connection to the full theoretical model (Eq. 34) into the revised paper. This revision is intended to make it clearer for readers to see how our theoretical framework directly informs the practical, high-performance implementation of the system.

Q3. The risk of models creating an “illusion of confidence” and being amplified.

A3: Our GAM-Agent is designed with multi-level mechanisms to mitigate this issue. Firstly, the Critical Agents serve as our core line of defense, tasked with independently scrutinizing and challenging the outputs of the Base Agents, a process illustrated in Figure 1. Secondly, our debate-triggering mechanism relies not only on overall uncertainty but also on an inter-agent Conflict Score; a confidently wrong assertion is likely to conflict with other agents' views, triggering a review due to high conflict even if its self-assessed uncertainty is low. Finally, our Evidence Mapping and Claim Parsing (P&M) modules require all claims to be grounded in specific visual evidence, creating a high barrier for unsubstantiated hallucinations.

Regarding the evaluation of this risk, we have already provided a detailed discussion in Appendix K.2 (Unsuccessful Case, Pages 46-47). We deliberately present two such failure cases, for instance, in the 'more colorful' image problem (Table 8), where multiple agents confidently converged on an incorrect consensus by focusing on color saturation over diversity. To more directly address your concern, we have added a new 'Limitations' section in the revised paper. This section provides a more focused discussion on the risk of amplifying 'confident hallucinations' and summarizes the multi-level mitigation strategies outlined above that are embedded in our GAM-Agent framework.

2025-08-03

Dear Reviewer 55Wy,

We hope this message finds you well! Should you have any further questions about our work or responses during this discussion period, please do not hesitate to contact us.

Thank you very much for your valuable time and feedback!

Best regards,

The Authors

审稿意见

评分: 5置信度: 42025-07-04

GAM-Agent is a multi-agent framework intended to enhance the robustness, interpretability, and accuracy of vision-language models on complex multimodal tasks. The framework attempts to model the visual reasoning process as a non-zero-sum game between multiple agents to reach a consensus. It hinges on uncertainty-aware interactions between two types of agents: Base Agents and Critical Agents. The Base Agents are specialized visual reasoning experts that provide initial interpretation and supply evidence on tasks such as object recognition, scene description, and textual analysis of images. Critical Agents scrutinize these outputs to evaluate accuracy, coherence, and completeness. The interactions between these cohorts is arbitrated by quantified uncertainty in an iterative fashion, intended to initiate debate when conflict is detected and progressively reduce ambiguity until the system can arrive at a consensus. Initial responses from Base Agents undergo Claim Parsing (unstructured responses into structured tuples) and Evidence Mapping (visual grounding) before an Uncertainty Quantification. The Debate Controller & Integrator orchestrates this process an initiates debate where required. The authors compared GAM-Agent against 5 other multi-agent frameworks (DMAD, DMPL, ChatEval, MAD, and DebUnc) using 5 different VLMs (Qwen2.5VL-7B and 72B, InternVL3-14B and 78B, and GPT-4o) and 4 public benchmarks: MMMU, MMBench, MVBench, and V*Bench. The reported results show modest performance gains for all models types.

优缺点分析

GAM-Agent incrementally builds on work in the multi-agent space aimed at improving visual reasoning. Framing the reasoning process as a non-zero-sum game is intriguing. Including visual grounding alongside uncertainty estimates as part of this framework is logical based on other multi-agent and VLM work. The reported results tout improvement in small-to-mid scale models, which is an appropriate place for the work to live given that multi-agent, multi-turn systems can be computationally heavy to run (i.e. lightweight models would be preferred in this paradigm). Moreover, the strength of the system appears to be uncertainty quantification and that ability is significantly limited when generation probabilities are unavailable (as they almost certainly would be with the stronger API-based models).

That all being said, the performance gains are very incremental (within ~1% of the other top performing multi-agent framework, DebUnc, which, also uses uncertainty to improve agent communication). Furthermore, since the release of o-series models is discussed as shining a light on "the challenge of fine-grained visual grounding," I feel it should be at least addressed that these models far outperform anything discussed here (currently sitting at 82.9% on MMMU) - so this paper should be positioned as a way to improve smaller models (though a test against more open "reasoning"-style models would be a welcome inclusion as well).

The main weakness of this paper, however, is that the most intriguing strength, framing reasoning between VLMs as a non-zero-sum game, is under explored. If I understand correctly, the current implementation doesn’t appear to meaningfully alter the behavior beyond what a soft-max on negative uncertainty would already do. Appendix D.4 says the implemented weight updates simplifies the full cooperative objective "by omitting explicit collaboration rewards and system penalties in the weight calculation."

问题

The related work section should be expanded and improved. DebUnc isn't even mentioned and yet it has nearly identical scores to GAM-Agent - which points to uncertainty as the key to boosting performance. This deserves further exposition.

Both result tables should be cleaned up and more clearly explained. What is the Base (Ori) framework? SEAL should be explained and attached to V*Bench.

Why were Gemini 2.0 Pro, Gemini 2.5 Pro Preview, and Claude 3.7 included in A.1 for Cost and Performance Analysis but not in results? Why not include any of the o-series models (or any other reasoning model) as reference?

It seems that game theory is underutilized here - Am I understanding the implications of Appendix D.4 correctly? If so, is there a way to re-incorporate collaborative and penalty incentives as part of an ablation or toy demonstration?

局限性

Given the breadth of this topic, I believe that the limitations of the proposed method should be more clearly discussed in the main body of the text. For instance, given the supremacy of large, closed-source models like GPT-4o, under what conditions will the semantic marker-based strategy be less effective? Might larger, "reasoning"-style models relegate some of this work to pulling performance out of lighter weight models?

最终评判理由

The authors have addressed my questions and concerns regarding the key contributions of the work, its broader placement in the field, and its limitations. The proposed revisions draw a path from the game theoretic approach to its use at various levels of system design. They have demonstrated performance/efficiency gains and more have put forward a valuable touchpoint for multi-agent systems. The new experiments are instructive and the proposed revisions to the text will make for a much stronger submission that will be of use to the community.

格式问题

Inconsistent reference formatting.

作者回复

2025-07-29

Dear Reviewer fdos,

Thank you very much for your insightful and valuable comments! We have carefully prepared the following responses to address your concerns in detail. It is our sincere hope that our response could provide you with a clearer understanding of our work. If you have any further questions about our work, please feel free to contact us during the discussion period.

Q1&W1&W2. On Related Work (DebUnc), Efficiency, and Performance (GAM-Agent vs DebUnc).

A1: We thank the reviewer for pointing out the omission of DebUnc. Our initial version focuses on non-zero-sum game modeling and visual reasoning, and therefore does not adequately cover this related foundational theory. We have now expanded the Related Works section to explicitly discuss DebUnc and its relationship to our GAM-Agent. In the revised paper, we have expanded our Related Works section to clarify the relationship and key distinctions between our GAM-Agent and DebUnc, adding the following content:

"Recently, methods like DebUnc, which leverage uncertainty to facilitate multi-agent communication, have shown excellent performance in our experiments, further confirming that uncertainty is a key signal for enhancing collaborative reasoning. However, the core distinction of GAM-Agent is that it embeds uncertainty as an optimizable variable within a formal game-theoretic utility function, thereby driving agent collaboration more fundamentally and robustly to achieve superior performance."

Although DebUnc also incorporates uncertainty to improve agent communication, it adopts a symmetric, fixed three-round debate structure similar to the multi-round competition mechanism. In contrast, GAM-Agent introduces asymmetric roles (base and key agents) and models the reasoning process as a non-zero-sum game where debates are triggered dynamically only when needed. This leads to structurally distinct behavior and lower average cost.

Although our GAM-Agent slightly outperforms DebUnc on large models (~1%), its main advantages are Pareto efficient reasoning, i.e., significantly fewer LLM calls (token cost is ~2.2x that of DebUnc’s 9 fixed calls) and enhanced interpretability through a structured evidence base. Regarding the o-series models, we agree that they improve overall performance (e.g., 82.9% on MMMU). Our experiments with GPT-4o show consistent improvements (+2–3%) even on these powerful models. Thus, our GAM-Agent improves performance on small and medium models while also improving SOTA VLMs. Meanwhile, our GAM-Agent achieves +5–6% improvement on these models with lower overhead. In summary, GAM-Agent provides a more efficient and easier-to-interpret alternative to symmetric frameworks such as DebUnc and ChatEval. |

W3. On the Implementation and Utilization of the Game-Theoretic Framework

A3: The non-zero-sum game in GAM-Agent is not limited to a soft-max on negative uncertainty; it is the blueprint for both the weighting rule and the interaction protocol. The full theoretical weight is : $w_i=\frac{\exp(-\alpha\,u_i+\beta\,r_i)}{\sum_j\exp(-\alpha\,u_j+\beta\,r_j)},$

with uncertainty $u_i$ and a cooperation reward $r_i$ (e.g., evidence/peer alignment). A pure soft-voting baseline would use $y=\sum_i\text{softmax}(-u_i)\,a_i,$ which is symmetric and lacks structured cooperation. For efficiency we drop $r_i$ and obtain $w_i=\frac{\exp(-\alpha\,u_i)}{\sum_j\exp(-\alpha\,u_j)},$ but this weight is applied inside a role-asymmetric, non-zero-sum loop. Concretely: * Collaboration reward → generate-verify: the theoretical term $\lambda!\sum_j!w_j\mathrm{Sim}(R_i,R_j)$ is realized architecturally since base agents must defend their answers before critical agents in multi-turn debates, forcing semantic alignment. * System penalty → dynamic triggering: the penalty $-\gamma U_{\text{sys}}$ is enacted by a controller that launches a debate only when global uncertainty is high and stops once confidence is sufficient, hence penalizing high-uncertainty states with extra computation. These structural embodiments of the game principles meaningfully change behaviour. Our ablation (Fig. 3) shows that removing the debate (“w/o Debate”), i.e., reducing the system to an initial soft-voting integration, drops MMBench accuracy from 88.80 % to 84.52 % (-4.28 pp). Thus, the architectural realisation of the non-zero-sum game delivers clear gains beyond a simple uncertainty soft-max.

Q2. What is the Base (Ori) framework? SEAL should be explained and attached to V*Bench.

A4: "Base (Ori)" stands for "Base Model's Original Performance", which is the performance score of a single vision-language model (e.g., Qwen2.5VL-7B) answering questions directly, without the application of any multi-agent collaboration framework. We have noted this in the table captions, for instance, the caption for Table 1 states, "Values in parentheses denote the improvement in accuracy over the respective base model (Ori)". The core purpose of establishing this benchmark is to quantify the precise performance gains brought by all multi-agent frameworks, including our GAM-Agent. Considering SEAL is a published method on V*Bench, we include its result in Table 2 as an important external reference point to gauge whether the performance level achieved by our GAM-Agent is competitive, particularly when equipped with top-tier models like GPT-4O. We have added more explicit definitions for both "Base (Ori)" and "SEAL" to the captions of Tables 1 and 2 in the revised paper.

Q3. Why are some methods only in A.1? Why not include the o-series models (or any other reasoning model) ?

A5: To obtain fair and reproducible evaluations, we do not include o-series models since they are closed-sourced and change over time. Besides, the closed-sourced reasoning VLMs (i.e., o-series, Gemini, and Claude series models) do not return token-level logprobs for accurate uncertainty quantization, which is crucial for our GAM-Agent.

(In OpenRouter, among the GPT-series, only the 4o series returns logprobs.)

Moreover, Appendix A.1 aims to conduct a broad cost-benefit analysis, introducing numerous expensive commercial API models (like the Gemini series and Claude 3.7) as reference points for performance and price. This highlights the significant economic advantage of our GAM-Agent in matching or exceeding the performance of top-tier commercial services at a fraction of the cost when driving locally deployed open-source models.

Q4. Game theory is underutilized in the implications of Appendix D.4.

A6: The dynamic weighting protocol (Eq. 35) is indeed a simplified version of the full game-theoretic optimal weight (Eq. 34). This is to prioritize computational efficiency and stable performance, which is why the explicit cooperative reward term (i.e., $Sim(R_i, R_k)$ ) is omitted in our main experiments. To address the question regarding the potential of the full framework and to more clearly demonstrate its theoretical value, we have added a new ablation study in the revised paper to directly compare these two weighting mechanisms. The results of this experiment are summarized in the table below.

Weighting Strategy	Acc. (%)	Deb. Rounds	Deb. Trig. (%)	Cost (tokens/inst.)
Simplified Weights (Default, Eq. 35)	88.80	1.76	65	2500
Full Game-Theoretic Weights (Eq. 34)	89.15 (+0.35)	1.92	68	3150

The results in the above table show that applying the full game-theoretic weights (Eq. 34) does indeed lead to a further improvement in accuracy (+0.35%). This empirically validates the theoretical advantage of our complete utility function design, where encouraging semantic consistency with high-confidence peers contributes to a better consensus. However, this accuracy gain is accompanied by a significant computational overhead; the average cost increases by approximately 26% due to the additional pairwise semantic similarity calculations. We therefore maintain that the simplified weighting protocol (Eq. 35) used in our main paper strikes an effective trade-off between performance and computational efficiency. This new analysis further clarifies the rationale behind our designs.

L1. The limitations.

A7: We have added a new "Limitations" section to the revised paper to discuss the framework's boundaries more centrally. Regarding the failure conditions of the semantic-marker strategy ( $\Phi_{isem}$ ) is critical, this strategy's effectiveness diminishes under two conditions: Firstly, when black-box models are fine-tuned to avoid uncertainty-related words (e.g., "possibly," "maybe") for a more decisive user experience. Secondly, when more advanced models produce "confident hallucinations," where their coherent but incorrect answers are wrongly assigned a low uncertainty score. This is a risk we frankly discuss in our failure case analysis in Appendix K.2.

While a core value of our GAM-Agent is significantly improving the performance of small-to-mid scale large models, experimental results show that even top-tier models like GPT-4o gain consistent and significant performance improvements (from +2% to +4%). For these large VLMs, GAM-Agent's value lies in complementing the cognitive blind spots of a single model by using multiple expert perspectives and providing a valuable review and correction layer through its Critical Agents (e.g., Fact Checker, Logic Checker). Therefore, GAM-Agent is a general collaborative framework designed to enhance the visual reasoning robustness of all models, from lightweight to large-scale.

2025-08-06

I would like to thank the authors for their detailed responses, proposed excerpts for addition, and pseudocode. The additional information has greatly enhanced my understanding of the work. I’d like to state a couple key concerns and then respond to the rebuttal point by point.

I believe there is a great deal of value in highlighting game-theoretic approaches to the multi-agent paradigm. Non-zero-sum games especially have a lot to offer in this regard. I think framing visual reasoning among multiple agents as one example of this paradigm is clever and the derivations in the appendix are a valuable resource for providing a blueprint for others to follow. However, in some cases, the framing overloads the paper and obscures the main findings, namely, that a game theoretic approach (even if only partially realized through architectural choices) can improve performance and provide a path to optimization in multi-agent systems. This point has implications beyond visual reasoning and is worth highlighting.

To that end, I feel the tests on closed sourced models are a distraction (perhaps a necessary one but a distraction nonetheless). Several reviewers point out the lack of generation probabilities and the weak correlation of the semantic markers to true errors. Furthermore, comparing your findings to GPT-4o (70.7% on MMMU) is not the same as comparing to more reasoning-focused (referred to in my review as “o-series”) models like OpenAI’s o3 model (82.9 % on MMMU). There are several places in the rebuttal where these models are referred to interchangeably, which I feel is incorrect. And since you point out, “… a core value of our GAM-Agent is significantly improving the performance of small-to-mid scale large models”, I would focus on that strength and the clear contributions to the underlying theory rather than small gains that may be washed away with the next model release.

Specific points: Q1&W1&W2. On Related Work (DebUnc), Efficiency, and Performance (GAM-Agent vs DebUnc). I appreciate the additions to the Related Works section and the effort to highlight the difference between GAM-Agent and DebUnc. I feel this is critical because uncertainty is at the heart of the success of this method. It is critical to differentiate the key advantage in the main text as you point them out here, namely, “Pareto efficient reasoning.” Then there must be clear lines drawn to the cost analysis to underscore this efficiency.

W3. On the Implementation and Utilization of the Game-Theoretic Framework I very much appreciate the authors exposition of this nuanced point here. It should be made clear in the main body of the text and be the focal point of the submission.

Q2. What is the Base (Ori) framework? SEAL should be explained and attached to V*Bench. I appreciate the addition of more more explicit definitions for both "Base (Ori)" and "SEAL" to the captions of Tables 1 and 2 in the revised paper. The figures and tables should also be improved for readability and clarity.

Q3. Why are some methods only in A.1? Why not include the o-series models (or any other reasoning model) ? “To obtain fair and reproducible evaluations, we do not include o-series models since they are closed-sourced and change over time. Besides, the closed-sourced reasoning VLMs (i.e., o-series, Gemini, and Claude series models) do not return token-level logprobs for accurate uncertainty quantization, which is crucial for our GAM-Agent.” This is an example of where I am confused between the authors use of GPT-4o with semantic tokens and the OpenAI o3 reasoning model, which does appear available within OpenRouter. Not having a comparison to any reasoning-style model available when they have become such a fixture of the landscape over the past year is a clear weakness of this work as it is currently constructed (I do recognize that the time-scales of research compared to the current pace of model release, but it does need to be at least addressed in the text directly).

Q4. Game theory is underutilized in the implications of Appendix D.4. I would like to thank the authors for this added work. It is informative and it is instructive to see the tradeoff in accuracy vs cost in support of the author’s design decision. It should be included in the main body of the text.

L1. The limitations. The additions to the limitation section is very welcome and should include a reference to the failure case analysis (K.2) explicitly (as opposed to something like, “additional failure cases/limitations are discussed in the appendix”).

In conclusion, I feel there are a number of points in this work that would be valuable to the community. However, the framing and manuscript need revision to fully realize this potential. As Reviewer og8v points out, the required revisions are non-trivial and the readability is critical to properly highlight the contributions, placement, and limitations of the work.

I welcome any additional feedback to address these points.

2025-08-07

Q3: On the Inclusion of o-series and Reasoning-Style Models in the Main Results

A4: Thank you for highlighting the importance of including reasoning-style models. To address this, we evaluated three o-series models (o1, o3, o4-mini) on MMMU (Val) using the same prompts and decoding settings as our other experiments. As these APIs do not expose token-level log-probabilities, we followed Section 2.2.1 and used our semantic-marker–based uncertainty estimator to replace the log-prob–based module.

(For clarity, the o-series models are distinct from GPT-4o; we report them separately.)

Model	Base (MMMU-Val)	+GAM-Agent (MMMU-Val)	Improvement
o1	78.2	82.3	+4.1
o3	82.9	85.8	+2.9
o4mini	81.6	84.9	+3.3

GAM-Agent consistently improves all tested o-series models by +2.9 to +4.1 points, indicating that our framework generalizes to reasoning-style models even without token-level log-probs. The semantic-marker–based uncertainty module is effective in this setting. We have included these results in the revised manuscript and will open-source all associated evaluation code (including prompts, thresholds, and parsing scripts) for transparency and reproducibility.

Q4：Game theory is underutilized in the implications of Appendix D.4.

A5: Thank you for pointing this out. We agree that the implications of the full game-theoretic weighting deserve prominent placement in the main paper.

In the paper, we have (i) moved the key derivation and ablation from Appendix D.4 into the Method/Experiments sections; (ii) added a side-by-side figure/table comparing the full versus simplified weighting (e.g., accuracy, average debate rounds, trigger rate, and tokens/instance), making the accuracy–cost trade-off explicit; (iii) tied these results directly to our design rationale, clarifying how the non-zero-sum incentives are instantiated architecturally, and why we adopt the simplified rule as the Pareto-efficient default; and (iv) included a short discussion on when the full scheme is preferable (e.g., accuracy-critical settings despite higher cost).

We believe surfacing these elements could make the game-theoretic contribution clearer and more impactful for readers. Thank you again for the constructive guidance.

L1: On the Limitations Section.

Thank you for suggesting an explicit reference to Appendix K.2. In the revised manuscript, we've updated the “Limitations” section to cite K.2 directly, discussing where the semantic marker-based strategy falters (e.g., models avoiding uncertainty terms or producing confident hallucinations). Building on Reviewer og8v's feedback, we've also emphasized success cases where base models fail but GAM-Agent succeeds—such as V*Bench sa_13337.jpg, where the base model can't resolve visual ambiguity (e.g., poor lighting obscuring a cap's color) but GAM-Agent integrates multi-agent debate to correctly identify "black." For balance, we highlight failure cases where GAM-Agent struggles, like MuirBench multi-image tasks with conflicting cues, as detailed in Table B of K.2. These changes will ensure clarity, transparency, and readability by connecting limitations to concrete examples.

This tweak enhances the "base model can't, we can" contrast more clearly without extra length, and we'll apply similar improvements across all relevant examples for consistency. If you'd like me to refine other sections or integrate this into the full response.

2025-08-07

Thank you for the detailed feedback. We appreciate the opportunity to clarify our design choices, evaluation scope, and the implications of our results.

Q1: On Related Work, Efficiency, and Performance (GAM-Agent vs DebUnc):

A1: Thank you for your insightful and detailed feedback, which clarified key areas for improvement. We fully agree that uncertainty modeling is central to multi-agent visual reasoning, and we appreciate the suggestion to make GAM-Agent’s Pareto-efficient reasoning advantage over DebUnc explicit—both in performance and computational efficiency. In the revised manuscript, we: (1) elevate Pareto-efficient reasoning as a core contribution in the Introduction and Method; (2) tighten the link between accuracy gains and cost savings (e.g., GAM-Agent’s ~2.2× lower token cost relative to DebUnc’s fixed 9 calls); and (3) clarify how GAM-Agent achieves more with fewer resources than symmetric frameworks such as DebUnc. We also emphasize the broader applicability of our non-zero-sum design to other multi-agent systems, so readers can immediately see both its theoretical and practical significance. We believe these revisions make the contributions clearer and more impactful.

W3. On the Implementation and Utilization of the Game-Theoretic Framework

A2: Thank you for this valuable suggestion. We agree that the game-theoretic core should be visible in the main paper and serve as a focal point rather than remain implicit in architectural choices. In the revision, we: (i) make explicit the game-theoretic mapping in the Method section—roles (Base vs. Critical) → asymmetric players; uncertainty- and evidence-aware utilities; a controller-triggered debate that embodies a non-zero-sum interaction; (ii) add a concise derivation contrasting the full weighting (Eq. 34) and the simplified rule (Eq. 35), and explain how collaboration rewards/penalties are realized architecturally (e.g., a verify-then-defend loop; dynamic debate triggering as a cost penalty); (iii) add a pipeline figure with game-theoretic annotations (players, signals, utilities, update loop) and Algorithm 1 for round-by-round updates; and (iv) move the ablation comparing Eq. 34 vs. Eq. 35 (accuracy, average rounds, trigger rate, tokens per instance) into the Experiments section, explicitly linking it to Pareto-efficient reasoning.

To improve readability, we keep formal proofs in the appendix, retain minimal notation in the main text, and make the utility–mechanism link explicit (uncertainty → weights, evidence alignment → cooperation reward, debate cost/termination → penalties). These changes clarify how the non-zero-sum perspective directly drives the design and the observed efficiency–accuracy trade-offs.

Q2. What is the Base (Ori) framework? SEAL should be explained and attached to V*Bench.

A3: Thank you for the helpful feedback on table clarity and explicit definitions. In the revision, we have:

Defined “Base (Ori)” at first mention and in table captions as the single-model baseline—the underlying VLM answering directly without any multi-agent collaboration or debate, under the same prompts and decoding settings as our methods for a fair comparison.

Explained “SEAL” as a published baseline associated with VBench. We include SEAL only on VBench where it is defined and reported, using its official evaluation protocol/numbers where applicable. This anchors SEAL “to” V*Bench as requested and avoids implying applicability on benchmarks where SEAL is not defined.

We have updated the captions of Tables 1 and 2 and the surrounding text accordingly. To further improve accessibility, we also standardized table formatting and annotations (consistent column headers, “(Ori)” legend, benchmark source notes, bolding of best results, and footnotes for external baselines). We believe these changes make the experimental setup and comparisons more transparent and easier to interpret.

Example caption snippet (now included): “(Ori): performance of the underlying VLM without any multi-agent framework. SEAL: published baseline evaluated on VBench; shown only on VBench.”

Due to character limits, please refer to (Q3,Q4,L1) on the next page.

2025-08-08

Dear Reviewer fdos,

With only one day left in the discussion phase, we kindly invite you to review our detailed rebuttal and follow-up clarifications.
Thank you again for your time and valuable feedback.

The Authors

2025-08-08

I sincerely thank the authors for engaging so fully in the discussion and for the additional experiment on the reasoning-style models. The authors have addressed all my questions. I believe the additional work and the proposed revisions to the text will make for a much stronger submission that will be of use to the community. As a result, I am pleased to increase my rating pending these changes.

2025-08-08

Dear Reviewer fdos,

Thank you very much for your thoughtful engagement during the discussion. We greatly appreciate your additional questions, your careful review of our new experiments with reasoning-style models, and your willingness to raise your rating.

We will incorporate every agreed-upon revision and will post a brief change-log in an Official Comment within the next 24 hours.
Should any further questions arise, please don’t hesitate to let us know.

Thank you again for your time and support !!!!!!!!!!!!.

Best regards,
The Authors

2025-08-09

Dear Reviewer fdos,

Thank you once again for your constructive engagement and for raising your recommendation. In our revised manuscript, we have incorporated all the discussed changes to strengthen the paper. Below is a summary of the key revisions made:

1. Regarding Related Work and Positioning:

Revised Related Works: We have expanded the "Related Works" section to include a substantive discussion on DebUnc. This revision clarifies the key distinctions and better highlights the novelty of GAM-Agent's asymmetric, game-theoretic approach.
Emphasized Pareto Efficiency: In the Introduction and Method sections, we have elevated "Pareto-efficient reasoning" as a core contribution. We now explicitly link our architectural design to its significant performance-per-cost advantage over symmetric, fixed-round frameworks like DebUnc.

2. Regarding the Game-Theoretic Framework's Exposition:

Made Game-Theoretic Links Explicit: In the Method section, we have clarified the mapping between our framework's architecture and its game-theoretic foundations, making the design rationale more transparent.
Integrated Key Ablation into Main Paper: The ablation study that compares the full game-theoretic weighting (Eq. 34) with our simplified rule (Eq. 35) has been moved from the appendix into the main Experiments section. This directly illustrates the accuracy-cost trade-off central to our design.
Added New Figures and Algorithms: We have added a new, annotated pipeline figure and Algorithm 1 to the main text to visually and procedurally detail the GAM-Agent process.

3. Regarding New Experimental Results:

Incorporated "o-series" Model Evaluation: We have integrated the new results from evaluating GAM-Agent on reasoning-style models (o1, o3, o4-mini) into a new table in the main Experiments section, demonstrating consistent gains (+2.9% to +4.1%) and the framework's generalizability.
Clarified Model Terminology: Throughout the manuscript, we have ensured a clear distinction is made between GPT-4o and the "o-series" models to prevent ambiguity.

4. Regarding Manuscript Clarity and Transparency:

Improved Table and Figure Clarity: We have updated all tables and figures, adding explicit definitions for terms like "Base (Ori)" and "SEAL" in captions to improve readability and ensure comparisons are clear.
Added a Dedicated Limitations Section: We have incorporated a new "Limitations" section into the main paper. This section candidly discusses the boundaries of our semantic-marker-based uncertainty estimation and explicitly cites the detailed failure case analysis in Appendix K.2.

We are confident these changes have substantially improved the manuscript's clarity, rigor, and impact.

Best regards, The Authors

最终决定Accept (poster)

2025-09-17

This paper proposes a multi-agent framework GAM-Agent, which is designed to enhance the robustness, interpretability, and accuracy of vision-language models on complex multimodal tasks. The reviewers believe the paper is novel, principled, and meaningful with certain effectiveness. I agree with the reviewer and suggest accepting the paper.