MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM
摘要
评审与讨论
This paper introduces MIRAGE, a benchmark designed to isolate and evaluate reasoning-induced hallucinations in multimodal large language models (MLLMs), distinguishing them from perception errors—a gap overlooked by existing benchmarks. MIRAGE uses multi-granular metrics (accuracy, factuality, and hallucination score) to assess reasoning failures, revealing that MLLMs struggle particularly with spatial reasoning involving complex visual relationships. To address these issues, the authors propose Logos, a method combining curriculum reinforcement fine-tuning and collaborative hint inference to improve logical consistency in model outputs. Logos reduces hallucinations and sets a baseline on MIRAGE, which will be publicly released.
优缺点分析
Strengths:
- The paper is the first to cleanly disentangle reasoning-induced hallucinations from the perception errors that dominate prior multimodal benchmarks, motivating a fresh study of logical robustness in MLLMs.
- The proposed Logos framework lifts a 7 B Qwen2.5-VL model to nearly match much larger 72 B systems. Ablations show each component's contribution, offering a concrete starting point for future work.
- MIRAGE offers 1329 carefully curated questions covering seven reasoning taxonomies and provides three complementary metrics, making it a well-rounded benchmark.
Weaknesses:
- The dataset contains only single-image questions; no multi-image, video or temporal reasoning is tested, a limitation the authors acknowledge themselves.
- Factuality, hallucination typing, and LHS all hinge on other large language models' judgments, raising concerns about circularity, bias, and stability across evaluator versions.
- The authors do not provide the rationale behind picking the seven distinct taxonomies for their dataset or mention which model is fine-tuned for their Logos method.
问题
Questions:
- The paper states that the dataset was "constructed through rigorous selection of seven distinct taxonomies, including geometry, algebraic, arithmetic, scientific, spatial reasoning, and statistical reasoning," but never explains why exactly these seven lines were chosen.
- Section 5 introduces Logos but does not specify which exact model it fine-tunes. The authors should provide the missing information on Logos's reproducibility.
- The authors should expand on the reliability of LLM-as-a-Judge.
局限性
Yes.
最终评判理由
Thank you to the authors for the detailed and thoughtful rebuttal. Their responses have addressed my primary concerns comprehensively. The rebuttal has successfully addressed all the weaknesses I raised. The new evidence, particularly regarding the validation of your evaluation metrics, has substantially strengthened the paper. Given that my primary concerns have been fully addressed, I am increasing my score.
格式问题
N/A
Thanks for your insightful suggestions. We hope this response can address your concerns.
Weakness 1: The dataset contains only single-image questions; no multi-image, video or temporal reasoning is tested.
Thank you for this valuable suggestion regarding the scope of our benchmark. We agree that multi-image and temporal reasoning are important future directions for hallucination research. And we acknowledge that no multi-image, video or temporal reasoning data is the limitation of our paper, as stated in the Appendix G (Limitation). Our primary goal with MIRAGE was to first establish a foundational and high-quality benchmark for reasoning chain hallucinations in the prevalent single-image context. This deliberate focus allows for a deep and controlled analysis of a core challenge in MLLMs. For scenarios that can be adapted, we propose a practical method of concatenating multiple images into a single composite image and annotating object indices, effectively converting a multi-image problem into a format compatible with the current benchmark. Looking ahead, we are committed to extending MIRAGE to include dedicated multi-image and video data in future work to explore hallucinations in more complex temporal and multi-view scenarios.
Weakness 2 and Question 3: LLM-as-a-judge raises concerns about circularity, bias, and stability across evaluator versions.
This is a critical point, and we appreciate the opportunity to elaborate on our comprehensive, multi-pronged strategy to ensure the reliability and validity of our LLM-based evaluation. Our approach is built on three pillars.
First is mitigating circularity. To prevent self-enhancement bias, we intentionally use an older, isolated model version (gpt-4o-2024-08-06) as our primary evaluator, ensuring it has not been trained on the models or data it is evaluating. For open-source evaluators, we designed highly structured, specialized prompts (see Appendix) to constrain the evaluation and minimize potential bias.
Second is human-in-the-loop verification to double-check the validity of LLM-as-a-judge. We do not rely solely on automated metrics. We conducted manual double-checking for our key metrics (factuality and LHS) on representative models. The results demonstrate a strong alignment between our automated scores and human expert judgment. Specifically, for factuality scores in the following table, the automated scores closely match the manually verified scores, confirming their accuracy.
| Model | step | claim |
|---|---|---|
| Gemini-2-flash | 47.8 | 47.4 |
| Gemini-2-flash (manual) | 48.5 | 47.9 |
| Qwen2.5-VL-7B | 34.7 | 31.7 |
| Qwen2.5-VL-7B (manual) | 33.9 | 31.3 |
And for LHS, we analyze the absolute difference between automated and manual LHS scores. The results show that the metric is highly stable, with over 75% of samples having a negligible difference of less than 0.1.
| Model | 25% | 50% | 75% | 90% |
|---|---|---|---|---|
| Gemini-2-flash | 0.02 | 0.04 | 0.07 | 0.11 |
| Qwen2.5-VL-7B | 0.04 | 0.05 | 0.09 | 0.13 |
Besides, we further validate our framework by calculating the correlation between all metrics in Table 5. The high correlation scores across the board indicate strong internal consistency and robustness.
We believe this rigorous, three-tiered validation process effectively addresses concerns about circularity and stability, ensuring that our findings are reliable. We will add a summary of this validation strategy to the main paper.
Weakness 3 and Question 1: The authors do not provide the rationale behind picking the seven distinct taxonomies for their dataset.
Thank you for this question. Our selection of the seven reasoning taxonomies was a principled decision aimed at ensuring comprehensive coverage of both broad, general-purpose reasoning and deep, domain-specific skills. For general reasoning capabilities, to evaluate skills applicable to everyday scenarios, we included tasks requiring spatial reasoning (object locations, attributes) and statistical reasoning (interpretation of charts and plots). For domain-specific reasoning capabilities, to probe more complex, multi-step logical deductions, we incorporated challenging problems from mathematics (geometry, algebra), science (physics, chemistry, biology), and formal logic (IQ-test style problems).
This structured approach, which balances breadth and depth, is well-aligned with the categorization frameworks used in recent comprehensive surveys on multimodal reasoning [1, 2], confirming the relevance of our chosen taxonomies.
[1] Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey, arXiv 2025
[2] Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models, arXiv 2025
Weakness 3 and Question 2: Section 5 introduces Logos but does not specify which exact model it fine-tunes.
Thank you for pointing out this omission. We apologize for not including these details in the main text. For Logos-7B, the pretrained model is Qwen-2.5-VL-7B-Instruct; and for Logos-3B, the pretrained model is Qwen-2.5-VL-3B-Instruct. The visual encoder is frozen to avoid catastrophic forgetting of visual perception ability. During training, we collect 13K mathematical questions with K12-level difficulty and \sim 1K text-only math questions from LIMO as training data. The batch size is 128. For each training sample, the rollout samples G is 8 by default. The initial learning rate is 1 \times 10^{-6}, both warmup strategy and cosine learning rate scheduler are adopted to stabilize training. We optimize Logos by 10 epochs using AdamW during each stage. The number of CRFT stages is set to 1, which is discussed in the ablation study. Benefiting from the filtration mechanism in CRFT and ORF, the total training time of Logos-7B is less than 24 hours, Logos-3B is faster. We will add these crucial implementation details to the main paper for clarity and reproducibility, while a complete list of hyperparameters has been stated in Appendix B.
Thank you to the authors for the detailed and thoughtful rebuttal. Their responses have addressed my primary concerns comprehensively. The rebuttal has successfully addressed all the weaknesses I raised. The new evidence, particularly regarding the validation of your evaluation metrics, has substantially strengthened the paper. Given that my primary concerns have been fully addressed, I am increasing my score.
Thank you for your prompt and insightful review; we truly appreciate your help in shaping the paper for the better.
Thank you again!
Authors of Submission 11966
This paper introduces MIRAGE, a new benchmark designed to assess and diagnose reasoning-induced hallucinations in Multimodal Large Language Models (MLLMs). Unlike existing benchmarks, MIRAGE focuses on scenarios where MLLMs correctly perceive visual inputs but fail in their logical reasoning. The contribution of the paper include:
- MIRAGE Benchmark: a diagnostic benchmark comprising 1,329 carefully designed questions aimed at isolating reasoning hallucinations. It provides multi-level annotations—including final answers, intermediate reasoning steps, and ground-truth reasoning chains—which enable detailed tracking of hallucination propagation. In addition to accuracy and factuality metrics, MIRAGE introduces a comprehensive LLM Hallucination Score (LHS) to assess model performance.
- Empirical Analysis and Insights: The paper presents an extensive analysis revealing that model scale, data scale, and training stages impact logical, fabrication, and factual hallucinations. It highlights that current MLLMs show limited improvement in spatial hallucinations, indicating weak visual reasoning capabilities. Additionally, it identifies correlations between question types and distinct hallucination patterns.
- Hallucination Mitigation Method: To address the identified challenges, the authors propose Logos, a baseline method that combines curriculum reinforcement fine-tuning (CRFT) and collaborative hint inference (CHI). Logos is shown to reduce logical hallucinations and improve answer accuracy.
优缺点分析
Strengths
- The paper is well-structured and easy to follow. The problem statement is clearly articulated, and the motivation for MIRAGE is well-justified by highlighting the limitations of existing benchmarks.
- MIRAGE introduces three levels of evaluation metrics: accuracy (overall answer correctness), factuality (correctness of intermediate steps and claims), and LLMs Hallucination Score (assessment of the entire reasoning chain). This multi-level approach allows for a more precise tracking and diagnosis of hallucination propagation.
- The paper proposes a cost-effective automated annotation framework combined with human-AI collaborative verification for creating ground-truth reasoning chains. This approach addresses the challenge of obtaining detailed annotations for complex reasoning tasks, which could be followed for future works.
- The paper conducts thorough experiments on MIRAGE, providing valuable insights into the factors influencing different hallucination types, such as model scale, data scale, and training stages.
- The authors state that MIRAGE will be publicly available , and they provide detailed experimental setups, including prompts used for construction and evaluation in the appendix.
Weaknesses
- The paper could provide a more detailed breakdown of the initial data sources used for MIRAGE, beyond "publicly available datasets and questions from Internet", and elaborate on the inherent difficulty of this raw data. While some examples are provided (Figures 12-18), a more explicit discussion on data characteristics and how they contribute to reasoning difficulty would be beneficial.
- The paper explicitly states that Logos does not significantly reduce spatial or factuality hallucinations, only logical and fabrication hallucinations. This suggests a limitation in the proposed mitigation method's effectiveness across all identified hallucination types.
- While Logos shows promising results on MIRAGE and MathVista, its performance on a broader range of multimodal reasoning tasks and datasets not specifically designed for reasoning hallucinations could be further explored to establish its wider applicability.
- While Logos's combination of techniques is novel, the individual components (GRPO, difficulty curriculum learning, Online Reward Filtration) are established concepts. Some recent works has already used these components in LLM/MLLM reasoning tasks, such as "DAPO: An Open-Source LLM Reinforcement Learning System at Scale" and "Efficient Reinforcement Finetuning via Adaptive Curriculum Learning ".
- Although the ablation experiments in appendix already point out the effectiveness of single component, the paper could strengthen its originality claim by further highlighting how the combination and synergy of these components specifically address the unique challenges of reasoning hallucinations in MLLMs beyond what individual components or simpler combinations might achieve.
问题
- Could the authors provide a more detailed breakdown of the "publicly available datasets and questions from Internet" that constitute the initial 18K dataset for MIRAGE? Specifically, listing the names of these public datasets would enhance clarity and allow for better understanding of the data's diversity and characteristics.
- The paper mentions using an LLM to extract intermediate steps and claims, and comparing them with ground-truth steps and claims for factuality evaluation. What measures are in place to ensure that this evaluation accurately assesses factuality, especially for questions where multiple valid reasoning paths could exist? If a model generates a logically sound but different sequence of intermediate steps than the predefined ground truth, how does the current evaluation methodology account for this to avoid penalizing correct alternative solutions as "hallucinations" or "incorrect" steps? Is there any manual double-checking of the step and claim factuality evaluations to mitigate such issues?
- Could the authors provide a more in-depth analysis of Logos's performance in mitigating spatial hallucinations compared to train-free methods like self-reflection and SFT-based methods? Specifically, what are the observed differences in their effectiveness (or lack thereof) for spatial hallucinations, and what specific characteristics of spatial reasoning errors might explain why Logos performs similarly to or differently from these other approaches in this particular area?
局限性
Although the authors discuss some limitations in the appendix, Logos mainly combines existing techniques but lacks a clear analysis of why this combination should work in practice.
最终评判理由
Most of my concerns have been addressed in the rebuttal. The authors clarified the dataset sources, improved the explanation of evaluation and method novelty, and also provided additional experiments demonstrating generalizability. I think the paper is acceptable and raised my score.
格式问题
No
Thanks for your insightful suggestions. We hope this response can address your concerns.
Weakness 1: detailed breakdown of the initial data sources
We agree that transparency of data source is crucial. Our strategy is to curate a diverse dataset from high-quality, specialized sources, each chosen to target a specific reasoning skill:
Mathematical Reasoning: We sourced questions from MathVista and MathVision, which are renowned for their high-quality, complex geometric diagrams and mathematical problems.
Logical Reasoning: To assess formal logic, we drew from the official question banks of national civil service examinations, which feature rigorously designed IQ-style test questions.
Scientific Reasoning: We utilized questions from prestigious science exams (e.g., USA Biology Olympiad), which require domain-specific reasoning in science subjects.
Spatial Reasoning: Our data comes from established traffic-related datasets and the synthetic SuperCLEVR dataset, which emphasize the reasoning of complex spatial relationships.
We ensure the inherent difficulty of initial data by combining empirical validation with source selection. For public benchmarks (e.g., MathVision), we confirm their challenge by observing that even powerful MLLMs (e.g., Qwen2.5-VL-7B) achieve relatively low accuracy (25.4%), which provides a strong empirical indicator of their difficulty. For crawled data, we curate questions from sources renowned for their complexity, e.g., Physics / Biology Olympiads and IQ tests, which are by design created to probe the limits of logical deduction. This approach ensures the initial data poses a meaningful challenge to current MLLMs. We will list these sources in the revision.
Weakness 2: limitation in the Logos's effectiveness across all identified hallucination types.
This is a key finding of our work, and we appreciate the opportunity to elaborate on why Logos shows diferent performance on various hallucination types. The behavior stems from two distinct challenges:
Inherent Gaps in Base Models (spatial hallucination): Our analysis in Table 1 and 2 reveals that even top-performing base models have a fundamental weakness in spatial reasoning. Scaling model or data size does not significantly reduce this type of error. Since reinforcement learning in CRFT works by rewarding and reinforcing correct behaviors that the model can already sample, it cannot effectively teach a capability that is fundamentally absent. If the base model rarely generates a correct spatial reasoning path, there are few positive examples for Logos to amplify.
The Nature of RL vs. factual knowledge (factuality hallucination): For strong base models, the remaining factual errors are often subtle and stem from inherent biases in their pre-training data. The goal of Logos's RL framework is to optimize the policy for generating correct reasoning chains, not to inject new factual knowledge. If all sampled reasoning paths from the base model contain the same ingrained factual error (e.g., a slightly incorrect constant), Logos lacks a ground-truth signal to correct it.
Hence, these findings show a crucial insight: tackling spatial and factual hallucinations may require architectural innovations or new data paradigms beyond the scope of post-training methods. These results also suggest that future work might explore better pertaining data, spatial-enhanced network design and advanced framework to mitigate such hallucinations.
Weakness 3: broader range of multimodal reasoning datasets.
To show the generalizability of Logos beyond MIRAGE, we evaluate Logos-7B on three general-purpose benchmarks: MathVista, MathVision, and MathVerse. As shown in the table, Logos-7B not only surpasses its strong base model but also outperforms concurrent state-of-the-art methods, achieving the highest average score. This shows that Logos to mitigating hallucinations also enhances reasoning abilities, leading to comparable performance on academic benchmarks.
| Method | MathVista | MathVision | MathVerse | Avg |
|---|---|---|---|---|
| Qwen2.5-VL-7B | 68.2 | 25.4 | 47.9 | 47.1 |
| R1-OneVision-7B | 64.1 | 19.7 | 46.4 | 46.8 |
| OpenVLThinker-7B | 70.2 | 25.3 | 47.9 | 47.8 |
| MM-EUREKA | 73.0 | 26.9 | 50.3 | 50.0 |
| Logos-7B (Ours) | 72.3 | 29.8 | 52.5 | 51.5 |
Weakness 4: Discuss Logos with DAPO and ADARFT
Thank you for raising these concurrent works. We analyze them and present a comparison of the key differences:
Comparison with DAPO: The primary distinction lies in the policy gradient loss design. DAPO introduces several strategies (e.g., clip-higher, token-level loss) optimized for text-only math problems, which often feature very long reasoning chains. Our experiments found these strong regularizations were less effective and could lead to overfitting in the multimodal context, where reasoning chains are typically more concise (see Appendix D.5). Logos employs a simpler, more robust sample-level policy gradient loss that proved more effective for multimodal training, as shown in Table 8.
Comparison with ADARFT: The key difference is the curriculum learning mechanism. ADARFT uses a static offline difficulty annotation and greedy strategy to select training batches. This approach does not guarantee that all data is utilized effectively. Logos employs a more dynamic hybrid online-offline framework. ORF ensures each batch is filled with high-value samples, while offline difficulty thresholding stages the curriculum effectively. This hybrid approach maximizes data utility and leads to better performance. To provide direct empirical evidence, we compare Logos (w/o CHI) against these methods on MIRAGE and MathVista. Logos demonstrates superior performance on both.
| Method | MIRAGE | MathVista |
|---|---|---|
| DAPO | 30.0 | 69.6 |
| ADARFT | 34.0 | 70.1 |
| Logos (w/o CHI) | 35.7 | 71.9 |
Weakness 5: originality claim by further highlighting how the combination and synergy of these components specifically address the unique challenges
The originality of Logos lies precisely in the synergy between its two core components, CRFT and CHI, which work together to solve a two-part problem in reasoning. The first part is CRFT. Specifically, an MLLM must learn how to reason. CRFT uses a curriculum-based reinforcement learning approach to transform a standard MLLM into a reasoning-capable model, teaching it to generate coherent, step-by-step thought processes. The second part is CHI. Meanwhile, a reasoning model needs guidance on what to reason about for a specific problem. CHI provides this, offering dynamic hints about both the problem-solving structure (h_topic) and the key information to focus on (h_question).
Crucially, these components are mutually dependent and reinforcing. An MLLM without CRFT lacks the basic ability to follow CHI's guidance. Conversely, a CRFT-enhanced model without CHI's targeted hints still struggles with the complex problems. It is a integrated system—a capable reasoning engine combined with intelligent, dynamic guidance—that allows Logos to directly and substantially reduce complex errors like logical (from 64.7% to 49.3%) and fabrication hallucinations (from 25.5% to 15.6%). This holistic, synergistic framework is the core innovation of our work.
Question 2: What measures are in place to ensure that this evaluation accurately ? Any manual double-checking of the step and claim factuality evaluations?
For the first question: Our evaluation is based on the principle that: while the explanatory text of a reasoning chain may vary, the core intermediate factual and numerical results must be consistent across all valid solutions. To accommodate reasoning variations, our evaluation script includes a "REASONABLE" tag (Fig. 9), which treats a step as correct if it is semantically equivalent to the ground truth, even if not perfect reasoning path match. This ensures we credit valid alternative reasoning paths.
For the second question: We conducted manual checking to validate our factuality scores. As the table below shows, the automated scores for two representative models align closely with those from three human experts, confirming the reliability of our evaluation protocol.
| Model | step | claim |
|---|---|---|
| Gemini-2-flash | 47.8 | 47.4 |
| Gemini-2-flash (manual) | 48.5 | 47.9 |
| Qwen2.5-VL-7B | 34.7 | 31.7 |
| Qwen2.5-VL-7B (manual) | 33.9 | 31.3 |
Question 3: In-depth analysis of Logose in mitigating spatial hallucinations compared to train-free and SFT methods
We conducted new experiments to compare Logos-7B against other methods specifically on spatial hallucination. The results in the following table are striking. While mitigating spatial deficits remains a significant challenge, Logos is the only method that shows a slight improvement, whereas all other methods, including training-free and SFT-based approaches, actually degrade performance. The reasons are two-fold. Firstly, training-free methods (e.g., Self-Reflection) These methods cannot introduce new knowledge. By forcing the base model, which already lacks spatial reasoning ability, to adhere to rigid prompting rules, they can amplify existing errors. Secondly, SFT-based methods (e.g., R1-OneVision) are often trained on datasets dominated by other domains (e.g., mathematics). This can cause the model to overfit to mathematical reasoning patterns at the expense of its already weak spatial capabilities. In contrast, our Logos avoids these pitfalls. The RL approach enhances reasoning without the overfitting risk of SFT, while our CHI component provides targeted hints (h_topic and h_question) that guide the model on difficult spatial problems. Therefore, Logos is the only method avoiding performance degradation.
| Model | Spatial Hallucination (%) |
|---|---|
| Qwen2.5-VL-7B | 33.4 |
| +Reflection | 37.4 |
| +VIC | 36.7 |
| R1-OneVision-7B | 35.6 |
| Logos-7B | 29.9 |
Thank you for your detailed response. I think the paper is acceptable, most of my concerns have been addressed, and I will raise my score.
Thank you for your prompt and insightful review; we truly appreciate your help in shaping the paper for the better.
Thank you again!
Authors of Submission 11966
In this paper, the authors proposed MIRAGE benchmark to isolate reasoning hallucinations by questions where inputs are correctly perceived but reasoning errors persist. The authors also proposed Logos to mitigate these hallucination. Multiple experiments were conducted to justify the value of both the proposed benchmark and mitigation method.
优缺点分析
Strengths
- The proposed MIRAGE benchmark provide a more fine-grained measure of reasoning-induced hallucinations in multimodal hallucination by ruling out perception-induced hallucinations. It provides a more precise measure of MLLMs' capability.
- Both the data collection process of MIRAGE and the proposed Logos for hallucination mitigation makes sense and are technically sound to me.
- Extensive experiments and analysis were conducted to justify the proposed benchmark and approach.
- Writing is good and easy to follow.
*** Weaknesses***
- For Table 6, any idea why several of the metrics got worse results for row "+hquestio"?
- Typo in Line 298: "benefiting" should be "Benefiting".
问题
Please refer to the weakness section and provide more discussions on the performance degradation.
局限性
Yes
最终评判理由
The rebuttal addressed my main concerns and thus I will keep my original positive rating.
格式问题
NA
Thanks for your insightful suggestions. We hope this response can address your concerns.
Weakness 1: For Table 6, any idea why several of the metrics got worse results for row "+hquestio"?
Based on the description, we assume that the reviewer compares CRFT + h_question only with CRFT + h_topic only in Table 6. This observation points to the nuanced and complementary roles that different hint types play within our Logos framework. The performance difference between +h_question and +h_topic stems from their distinct functions:
Specifically, the primary role of h_question is to guide the model to identify key entities and starting conditions within the question. It excels at grounding the model's initial focus but does not provide a complete, step-by-step problem-solving template or structure. In contrast, h_topic provides a high-level problem-solving schema or template. It offers a clear, structured formulation for the entire reasoning process, ensuring logical completeness from start to finish. The difference leads to performance gap in different metrics. For CRFT + h_question, it has better final accuracy (+0.3) than CRFT + h_topic, but the metrics regarding intermediate steps (i.e., F_step and LHS) are slightly lower (-0.4 and -0.01 respectively). Our Logos enhanced by CRFT is already optimized to generate complete and precise reasoning chains. When this strong base model is augmented with h_topic, it directly benefits from the clear structural guidance, making it easier to construct a perfectly formed reasoning chain. This synergy explains why +h_topic achieves slightly better scores on metrics like F_step and LHS compared to +h_question. This further validates the design of the full CHI setup that combines both h_topic and h_question to achieve balanced improvement across all dimensions.
Crucially, we want to emphasize that CRFT + h_question still significantly outperforms the CRFT only baseline across all metrics. This confirms that guiding the model's initial attention with h_question is a valuable and effective strategy.
Finally, we appreciate you drawing attention to this table. In reviewing it, we identified a typo in the final row ("+full CHI"), which should have all checkmarks as it represents the complete Logos framework. We will correct this in the revision to ensure clarity. We will also add the above analysis of hint functionalities to the paper to further clarify these results.
Weakness 2: Typo of ``Benefiting''.
Thank you for pointing this out. We apologize for this and other potential typographical errors. We will perform a thorough proofread of the entire manuscript to polish the writing in the final version.
Thanks for the rebuttal. My main concerns are addressed and thus will keep my original positive rating.
Thank you for your prompt and insightful review; we truly appreciate your help in shaping the paper for the better.
Thank you again!
Authors of Submission 11966
This paper introduces MIRAGE, the first benchmark specifically designed to evaluate multimodal reasoning hallucinations in Multimodal Large Language Models (MLLMs). By ensuring correct perception inputs, MIRAGE isolates reasoning errors, offering a clearer view into model failures. It features multi-level evaluation metrics, including accuracy, factuality, and a novel LLM hallucination score for comprehensive assessment. The authors conduct an in-depth analysis of hallucination patterns across question types, identifying spatial reasoning as a major weakness in current MLLMs. To address this, they propose Logos, a baseline method that integrates curriculum-based reinforcement fine-tuning and collaborative hint inference to promote logically consistent reasoning. Logos demonstrates reduced hallucinations and improved answer accuracy, indicating its potential for guiding future improvements.
优缺点分析
Strength:
- dataset contribution with detailed reasoning chain annotation. The effort of benchmarking many existing methods is impressive.
- The proposed Logos improved the baseline Qwen 2.5-VL.
Weakness:
- The number of data is very limited.
- There are plenty of datasets that already has reasoning chain, like ScienceQA, mathVista, etc. Table 1 is not accurate, or please provide some more explations.
- The contribution is limited other than the dataset.
- The author mentioned hallucination types, but not properly defined.
问题
Please refer to the weakness section.
局限性
Not addressed.
最终评判理由
I appreciate the clarifications and justifications of the author. However, my concerns still remains: 1. the number of data in the dataset 2. The unique contribution of this paper.
I decide to change my rating from "reject" to "borderline" reject.
格式问题
N/A
Thanks for your constructive reviews. We hope this response can address your concerns.
Weakness 1: The number of data is very limited.
| Dataset | Annotation | Size |
|---|---|---|
| OmniBench | Answers,Description | 1,142 |
| R-Bench-m | Answers | 665 |
| MME-CoT | Answers,Step,Claim,Description | 1,130 |
| MIRAGE | Answers,Reasoning Chain,Step,Claim,Hint,Description | 1,329 |
Thank you for your valuable feedback on our submission. We appreciate your insights and the opportunity to clarify several aspects of our work.
Regarding the dataset size, we acknowledge the comparison to existing benchmarks and recognize that many reasoning datasets in the literature typically have around 1,000 examples. However, MIRAGE has been specifically designed to offer a more diverse and comprehensive set of annotations, going beyond just the size of the dataset. As referenced in the latest survey [1], our dataset provides significantly more granular annotations, such as ground-truth answers, reasoning chains, hints, step-by-step claims, and detailed descriptions, which provide a richer context for evaluating reasoning tasks.
In contrast to previous works that primarily focus on answer-level evaluation, MIRAGE supports a multi-level evaluation approach, including assessments at the final answer, intermediate results, and reasoning chain levels. This additional annotation and evaluative capability are integral to improving the robustness of reasoning models by explicitly addressing reasoning hallucinations. The unique emphasis on these detailed annotations is intended to enhance the quality of the reasoning process rather than merely increase the dataset size.
We believe that this holistic approach to reasoning hallucination evaluation makes MIRAGE a valuable contribution to the community, as it provides a more comprehensive and rigorous framework for assessing the true reasoning abilities of models, not just their final answers.
[1] Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey, arXiv 2025
Weakness 2: There are plenty of datasets that already has reasoning chain, like ScienceQA, mathVista, etc.Table 1 is not accurate, or please provide some more explations.
Thank you for your thoughtful feedback on the comparison of MIRAGE with other datasets. We would like to provide further clarification to address the concerns raised.
Firstly, regarding the mentioned datasets like ScienceQA and MathVista include reasoning chains, we would like to clarify that, as indicated in the concurrent survey [1], these datasets do not provide the level of detailed reasoning chains that MIRAGE does. Specifically:
1 MathVista does not contain ground-truth reasoning chains. The focus of MathVista is primarily on final answers, without explicitly providing detailed, step-by-step reasoning processes.
2 ScienceQA includes only high-level lecture hints but lacks comprehensive, fine-grained reasoning chains. The reasoning chains in ScienceQA are not sufficiently detailed or structured to qualify as ground-truth reasoning chains for the purpose of evaluation. The "lecture" annotations provided are simplistic hints, rather than the multi-step, comprehensive chains that MIRAGE offers.
In contrast, MIRAGE stands out by providing rich and multi-level annotations—including ground-truth answers, reasoning chains, step-by-step claims, and descriptions—enabling a thorough assessment of reasoning processes across multiple levels. This distinction makes MIRAGE more suitable for evaluating reasoning hallucinations, as it allows for a detailed analysis of the intermediate steps in the reasoning process, beyond just the final answer.
Moreover, MIRAGE is explicitly designed for hallucination assessment in multimodal reasoning tasks, a specific focus that sets it apart from the aforementioned datasets, which primarily judge the correctness of final answers. This emphasis on hallucination detection at various levels (e.g., final answer, intermediate steps, and reasoning chain) provides a more comprehensive framework for evaluating the performance of reasoning models, making MIRAGE uniquely positioned to address the challenges of reasoning accuracy and hallucinations.
To clarify the difference with previous works, we will add more benchmarks (including Mathvista and ScienceQA) in the revision.
Weakness 3: The contribution is limited other than the dataset.
Thank you for your insightful feedback. We would like to clarify that the contributions of our paper extend well beyond the MIRAGE dataset. While the dataset is a central component of our work, we also introduce several other key contributions that significantly advance the understanding and mitigation of reasoning hallucinations in MLLMs. These contributions are outlined as follows:
-
Insights from Experimental Results on MLLMs. Beyond the dataset, our experimental results reveal several critical insights into how MLLMs handle reasoning tasks: 1)We find that model scale, data scale, and training stages significantly influence hallucinations related to logic, fabrication, and factuality. This highlights the complexities of scaling MLLMs and the challenges they face with reasoning errors; 2)More notably, our experiments show that scaling alone does not improve spatial hallucinations, where models misinterpret spatial relationships. This suggests that current MLLMs exhibit weak visual reasoning capabilities and cannot simply overcome hallucinations by increasing model size or training data. These findings are crucial for guiding future work in hallucination mitigation strategies for MLLMs.
-
The Logos Training Framework. Our third contribution is the proposed Logos training framework, which aims to reduce multimodal logical hallucinations and improve answer accuracy. By leveraging curriculum reinforcement fine-tuning and collaborative hint inference, Logos provides a novel training paradigm that enhances model reasoning capabilities. This framework addresses key issues identified in our experiments, specifically targeting logical and factual hallucinations, and is a promising approach for future MLLM research.
We believe that these contributions—ranging from the dataset to experimental insights and a novel training framework—collectively offer significant advancements to the MLLM research community.
Weakness 4: The author mentioned hallucination types, but not properly defined.
Thank you for your feedback regarding the definition of hallucination types. We apologize for the oversight and appreciate the opportunity to clarify the terminology used in our paper.
To address this, we provide the following clear definitions for each hallucination type, as outlined in the table below:
| Hallucination Type | Definition |
|---|---|
| Spatial Hallucination | Errors in reasoning about spatial relationships, shapes, or complex visual operations. |
| Logical Hallucination | Errors in logical consistency or reasoning, even when surface-level facts are correct. |
| Factuality Hallucination | Factually incorrect claims about scientific principles or established knowledge in input data. |
| Context Hallucination | Inconsistencies between intermediate reasoning steps and final predictions. |
| Fabrication Hallucination | Entirely invented values, entities, or relationships not in input data or real world. |
These definitions aim to provide a clear understanding of the different types of hallucinations we discuss in our work. For further clarification, detailed descriptions of each hallucination type can be found in Appendix A and Table 7 of the paper. Meanwhile, we will also mention the hallucination type definition in the revision of the main paper.
Thank you for the rebuttal and clarifications! It addresses my concerns and questions. I will take that into account on my final rating, after seeing all of the comments and discussions.
Thank you for your constructive review and timely response; we truly appreciate your help in shaping the paper for the better.
Thank you again!
Authors of Submission 11966
MIRAGE, a benchmark that isolates reasoning-induced hallucinations in multimodal LLMs, and a method called Logos, a method to mitigate them. The dataset’s fine-grained annotations and analysis provide new insights into model weaknesses, and the experiments show clear improvements. Although concerns about dataset size and evaluation were noted, the rebuttal addressed them convincingly. Overall, this is a timely and valuable contribution. I recommend accept.