PaperHub
6.4
/10
Poster5 位审稿人
最低4最高4标准差0.0
4
4
4
4
4
4.4
置信度
创新性2.6
质量3.0
清晰度3.2
重要性3.0
NeurIPS 2025

Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

OpenReviewPDF
提交: 2025-05-07更新: 2025-10-29

摘要

关键词
MLLMs

评审与讨论

审稿意见
4

The paper introduces a lightweight, training-free framework for detecting hallucinations in multimodal large language models (MLLMs), particularly vision-language models. The proposed method, Single-shot Hallucination Detection (SHD), leverages spatial attention maps and token-level entropy to determine whether the generated text is grounded in the input image. Rather than relying on costly retraining or dataset-specific tuning, SHD performs hallucination detection in a single forward pass using only internal model signals. The authors evaluate SHD across six hallucination types and multiple state-of-the-art MLLMs, showing competitive or superior performance compared to more complex baselines. The work contributes a novel, interpretable, and efficient technique for improving the reliability of multimodal LLM outputs in real-world settings.

优缺点分析

Quality The proposed method, SHD (Single-shot Hallucination Detection), is a well-engineered, training-free solution that leverages intrinsic model signals—namely token entropy and cross-modal attention—to detect hallucinations.

The experiments are extensive, covering six different hallucination categories and multiple state-of-the-art MLLMs (e.g., MiniGPT4, LLaVA, Otter). The authors use a recently proposed benchmark (POPE) and ablation studies to validate individual components.

The paper reports competitive results against several strong baselines, including detect-then-rank models, supervised discriminators, and gradient-based techniques—despite SHD requiring no additional tuning or training.

Clarity The paper is generally well-organized and clearly written. The motivation for a lightweight, interpretable hallucination detector is well-argued in the introduction.

The authors provide intuitive visualizations of the entropy maps and cross-attention distributions to support interpretability.

Each of the six hallucination categories is explained with clarity, and examples are included for illustration.

Significance Hallucination detection in multimodal LLMs is an increasingly important area, particularly as vision-language models see growing deployment in safety-critical domains.

The proposed method has practical significance: being lightweight, interpretable, and model-agnostic, it can be applied in real-world applications without the need for costly retraining.

The ability to detect hallucinations in a single forward pass makes SHD particularly relevant for latency-sensitive systems and scalable inference.

Originality The work is original in its framing of hallucination detection as a signal derived from token entropy and cross-attention, without the need for ground-truth labels or external discriminators.

Unlike many prior works that rely on retraining or multiple forward passes (e.g., question answering consistency checks, multi-step verification), SHD provides a conceptually novel and computationally efficient approach.

问题

  1. How well does SHD generalize to hallucination detection tasks beyond the POPE benchmark and short-form VQA-style prompts? For instance, how would it perform in document-grounded generation, image captioning, or complex reasoning scenarios involving multiple modalities?
  2. How does SHD handle borderline or ambiguous cases? Are there examples where the model flags correct outputs as hallucinations (false positives)?
  3. Even a qualitative comparison or brief discussion in related work would help position SHD among broader hallucination mitigation strategies.

局限性

Partially.

The authors include a limitations section that briefly acknowledges the dataset scope (POPE) and the reliance on internal model signals (token entropy, cross-modal attention). However, several critical aspects are under-addressed and would benefit from deeper reflection:

Suggestions for Improvement Bias Amplification & Model Dependence: Since SHD depends entirely on internal attention and entropy signals, which are products of the underlying model's learned biases, the method may inherit or even reinforce spurious correlations and visual/textual biases. A discussion on how this might affect fairness or performance across diverse demographic attributes, image styles, or non-English content would be valuable.

Deployment Misuse or Overtrust: SHD is lightweight and attractive for production settings, but it may create overconfidence in model safety if not properly calibrated. The authors should caution against using SHD as a standalone truth evaluator, especially in sensitive domains (e.g., healthcare, legal AI), and emphasize the need for human oversight.

Scope of Applicability: The method has only been evaluated on static images and short-form QA. Authors should note that applying SHD to other modalities (e.g., video, audio, document OCR) may not yield reliable results without adaptation or retraining, even if SHD is training-free in its current form.

格式问题

None

作者回复

Overall Response

Thanks for the thoughtful and constructive feedback! In response, we expanded the MED-Real set to 50 real-world samples, and included results on BLIP-3o to demonstrate architectural generality. We also clarified our hallucination mitigation strategy, feature-level supervision design, and benchmark construction process. We will include all clarifications and results in the revised version.

Response to W1: Scale of MED-Real and Generalization to Real-World Benchmarks

We appreciate the reviewer’s suggestion to strengthen claims of real-world generalizability with more robust evaluation.

1. Expanded MED-Real evaluation: While the original MED-Real set contained 35 real-world image pairs, we have now expanded it to 50 examples by adding 15 newly curated pairs. The table below shows the accuracy progression with increasing real-image samples:

Model with sft scalabilityacc(15)acc(35)acc(50)
Qwen2-VL-7B66.6782.8678.00
Qwen2-VL-7B(ours)73.3382.8680.00
Qwen2.5-VL-3B46.6754.2952.00
Qwen2.5-VL-3B(ours)53.3357.1456.00
Qwen2.5-VL-7B60.0068.5766.00
Qwen2.5-VL-7B(ours)73.3374.2974.00
LLaVA-1.6-7B46.6734.2938.00
LLaVA-1.6-7B(ours)60.0045.7150.00
LLaMA3.2-11B-VL26.6740.0036.00
LLaMA3.2-11B-VL(ours)46.6751.4350.00

Across all model families, MED-finetuning results in clear performance gains. The improvements are consistent as the real-world sample set grows, which suggests that the model is not simply overfitting to synthetic distribution patterns.

2. General benchmark validation (Table 3, Page 8): To further validate generalizability, we report results on a range of diverse real-image benchmarks, including POPE, MMStar, BLINK, and MME. These datasets contain non-synthetic, real-world image distributions and diverse tasks. For example:

  • Qwen2.5-VL-7B (Ours) improves:
    • POPE: 96.29 → 97.52
    • MME: 1685.14 → 1701.87
    • BLINK Visual Corr: 33.72 → 37.79

These results show that MED improves real-world performance even on unrelated tasks, indicating robust, transferable fine-grained reasoning capabilities — not simply domain overfitting.

We will include this clarification and the expanded table in the revision to emphasize MED’s impact on generalizable multimodal understanding.

Response to W2: Hallucination Risk from Qwen-VL and Gemini

We thank the reviewer for raising the important concern about potential hallucinations from Qwen2.5-VL-72B and Gemini in our data generation pipeline. We fully agree that vision-language models are not immune to hallucination, which is why our pipeline is explicitly designed to minimize and isolate such errors through two key strategies: task decomposition and multi-stage filtering.

1. Task decomposition: minimizing hallucination through controlled prompts: Instead of relying on the model to generate free-form captions or edits, we break the data generation into small, grounded subtasks:

  • Qwen2.5-VL is only asked to revise or extend existing captions, based on concrete visual prompts (e.g., “add a red ball”).
  • Edited captions are generated with strong contextual guidance, including the original caption and the specified edit type.
  • Difference descriptions are produced by a separate text-only model (Qwen3-32B) to avoid compounding visual noise.

This decomposition ensures that each model step is tightly scoped and grounded in the image, significantly reducing hallucination risks.

2. Multi-stage filtering: detecting and removing faulty samples: To further control quality, we apply multiple filtering layers:

  • Qwen2.5-VL first evaluates raw image-caption pairs via a structured quality rubric, discarding low-alignment or low-editability samples.
  • After editing, we use CLIP-based similarity filtering to ensure visual differences are minimal and semantically plausible.
  • Finally, we perform manual verification on 1,000 image pairs, covering all 11 edit types, to identify and eliminate rare hallucination patterns and refine prompt templates. According to the manual verification results, we designed heuristic rules to filter out low-quality data.

Through this process, we ensure that both Qwen-VL and Gemini are used only within narrow, controllable contexts, with human- and CLIP-based safeguards in place. We will add clarification and provide statistics (e.g., instruction rejection rates) in the revised version to further support the robustness of our pipeline.

Response to W3: Meta-Text Artifacts and Benchmarking Noise

We thank the reviewer for raising the concern about meta-text phrases such as “in the second caption” or “is described”, which could introduce subtle benchmarking noise or false negatives.

We clarify that such phrasing appears only in intermediate textual difference descriptions generated during Step 3 of our data construction pipeline (see Section 3.2).

Before creating the final MED benchmark, we perform a systematic transformation to ensure questions are fully visually grounded:

  • All contrastive difference sentences are rewritten into QA-style prompts that refer exclusively to visual content, not to how things are described in text.
  • Each QA item includes four answer choices, and we perform manual verification to ensure:
    • Only one answer is visually correct.
    • Distractors are visually plausible but wrong.
    • No options rely on or reference the structure of captions or linguistic cues like “caption” or “mention”.

This transformation ensures that no meta-textual language appears in the final benchmark, and that model evaluation remains image-grounded. We will make this clarification explicit in the final version and include an illustrative before/after example to demonstrate how these artifacts are resolved.

Response to W4: Broader Evaluation Across Model Architectures

We thank the reviewer for highlighting the importance of evaluating across diverse model families.

While our primary evaluations focused on Qwen, LLaVA, and LLaMA-style vision-language models, we emphasize that our method is model-agnostic and readily applicable to other architectures. To demonstrate this, we additionally applied our fine-grained supervision method to BLIP-3o, a strong open-source model architecturally distinct from the LLaVA/Qwen family.

Modelavgobjattrscenespatialactionpartcountdiffercomparneguniv
BLIP-3o-8B44.2446.1550.0069.2350.0037.5031.2544.4441.6731.5850.0042.86
BLIP-3o-8B(ours)50.9038.4662.5046.1542.8662.5037.5066.6750.0052.6335.7157.14

The finetuned BLIP-3o model shows significant improvements across many dimensions, especially in attribute, action, counting, and difference reasoning. This supports our claim that the proposed fine-grained visual edit supervision enhances a model’s reasoning ability regardless of its backbone architecture.

In total, we evaluate and improve models from four diverse model families:

  • Qwen series (VL-7B / VL-2.5-7B / 3B)
  • LLaVA series (1.6-Vicuna)
  • LLaMA-Vision (3.2-11B)
  • BLIP-3o (Transformer with Querying Transformer head)

We will include this extended evaluation in the revised version to reinforce the generality and adaptability of our approach across architectures.

Response to W5: Image Quality and Clarity

Thank you for the valuable feedback. In the revised version, we will provide higher-resolution images and add reference captions to enhance clarity and interpretability.

Response to Q1: Feature Consistency Loss and Hallucination Reduction

Thank you for the insightful question. Our feature consistency loss reduces hallucinations by improving visual representations prior to decoding, not by modifying the autoregressive decoding itself.

Specifically, the loss encourages stable and semantically aligned embeddings for minimally edited image pairs, which enhances the fidelity of the input to the language decoder. This grounding helps reduce hallucinations during text generation by ensuring that the decoder is conditioned on more reliable visual features.

As shown in Table 3, even updating only the vision encoder leads to improved performance on perception-heavy tasks (e.g., Count, Coarse), highlighting that better visual representations alone significantly reduce errors. Full gains are achieved with joint tuning, but the primary hallucination mitigation stems from the representation level, not decoding.

In summary, the feature consistency loss acts at the representation level, ensuring the visual embeddings are robust and well-grounded. This improved representation serves as a more reliable foundation for the standard autoregressive decoder, which then produces more accurate and less hallucinatory text as a downstream effect. We hope this clarifies the interaction. We thank the reviewer again for the valuable feedback.

评论

I appreciate the authors for addressing my concerns; I have no additional questions about the paper.

评论

Thank you for your time and valuable feedback! We sincerely appreciate your effort in reviewing our work and are grateful for your positive assessment. We will carefully incorporate your insights to further refine the final manuscript.

评论

Dear Reviewer K4PJ,

Your expert feedback is crucial to refining this work. While we fully understand the discussion period may pose challenges for your schedule, we would value the chance to clarify any final points with you prior to its conclusion on Aug 8.

We hope we've been able to address your questions and concerns so far. We would be glad to address any further concerns you may have, and we will try our best to clarify promptly.

Thank you again for your feedback and comments; they were really helpful!

Warm Regards, Authors of Submission #9063

审稿意见
4

The paper points out that current MLLMs struggles at fine-grained visual differences, which may further lead to inaccurate predictions or hallucinations. To address this, the authors designed an automated data generation pipeline, yielding the Micro Edit Dataset (MED). This dataset comprises over 50K image-text pairs across 11 fine-grained edit categories, specifically designed to highlight subtle visual changes. Then, the author validate the effectiveness of the dataset with different MLLMs like Qwen 2.5-VL. The results demonstrate that the dataset not only enhances MLLM performance on the dedicated MED benchmark but also improves performance on other VL tasks.

优缺点分析

Strengths

  1. The identification of existing MLLMs' weakness in recognizing subtle visual changes is significant. Current SOTA MLLMs remain considerably behind human-level recognition abilities (e.g., GPT-4.1-2025-04-14 at 54% versus human performance at 95%), highlighting a crucial direction for future MLLM optimization.
  2. The dataset generation pipeline is well-designed, incorporating various filtering and editing techniques to ensure the production of a high-quality dataset.
  3. The authors rigorously validated the MED across diverse MLLMs, utilizing both the MED benchmark and other VL benchmarks. This extensive validation confirms the dataset's utility in enhancing MLLM recognition capabilities.

Weakness

  1. Some details are not clear. See Questions.

问题

  1. I am curious about the scalability of the Micro Edit Dataset (MED). If the dataset size were continuously scaled up, would performance improvements persist?
  2. The authors employed LoRA for fine-tuning the model. What would be the impact of not using LoRA? Additionally, the hyper-parameters vary significantly across different base models. What might explain this discrepancy?
  3. I noticed that CLIP loss is incorporated into all presented equations. However, most existing MLLMs do not typically integrate CLIP loss during SFT. Is CLIP loss indeed utilized in the paper's SFT process?
  4. The paper claims to use a feature-level consistency loss. How is this implemented within the proposed method? If I understand correctly, the SFT loss only requires the model to distinguish differences between two input images.

局限性

yes

最终评判理由

After careful consideration, I will maintain my score as "borderline accept".

格式问题

no

作者回复

Overall Response

Thanks so much for the time and thoughtful feedback! Based on these comments, we conducted detailed scalability experiments and efficiency comparisons between LoRA and full fine-tuning. We also clarified the role of CLIP loss, hyperparameter settings, and the implicit feature-level consistency enforced by our SFT objective. These updates confirm the scalability, efficiency, and soundness of our method, and we will incorporate them into the final version if the paper is accepted.

Response to Q1: Dataset Scalability of MED

We thank the reviewer for the valuable question regarding the scalability of the Micro Edit Dataset (MED).

To assess this, we trained models on progressively larger subsets of MED, ranging from 10k to 50k samples. The results are shown in the table below:

Model with sft scalabilityavgobjattrscenespatialactionpartcountdiffercomparneguniv
10k40.0053.8525.0046.1557.1418.7531.2544.4458.3336.8450.0028.57
20k42.4253.8525.0053.8542.8637.5050.0050.0050.0031.5835.7142.86
30k43.6446.1556.2561.5461.5437.5031.2527.7833.3333.3335.7150.00
40k47.8853.8556.2538.4650.0050.0043.7550.0050.0036.8442.8657.14
50k51.5157.1456.2553.8442.8646.6743.7555.5650.0047.3750.0064.29

The results show a consistent improvement in performance as the dataset size increases. The model trained with the full 50k MED samples achieves the highest average accuracy and shows stronger capability in nearly all categories. This demonstrates that MED supports scalable learning and that its fine-grained design continues to provide useful supervision as more data is added.

Response to Q2: Impact of Using LoRA and Hyperparameter Differences

We thank the reviewer for the thoughtful question regarding the use of LoRA and the variation in hyperparameters across base models.

1. LoRA vs. full-parameter fine-tuning: In our experiments, we chose to apply LoRA for all models due to its strong efficiency-performance trade-off. As shown below, LoRA not only significantly reduces training time, but also achieves comparable—or in most cases, nearly identical—performance to full fine-tuning:

ModelFine-tuning MethodTime (50k data)Performance Impact
Qwen2.5-VL-7BLoRA6 hoursSimilar
Full29 hours
LLaVA-1.6-7BLoRA13 hoursSimilar
Full57 hours
LLaMA3.2-11B-VLLoRA8 hoursSimilar
Full70+ hours

These results suggest that for fine-grained visual understanding, most of the necessary capability can be adapted through low-rank adaptation. This may be because such tasks primarily require adjustments in higher-layer representations, rather than full re-training of the model backbone.

2. Hyperparameter variations across models: Regarding the hyperparameter differences across base models, the variations largely reflect architecture-specific tuning practices. Our implementation follows the LLaMA-Factory framework, and the learning rate, batch size, and warmup ratios we use are based on commonly adopted settings within that community for each model family. The variation arises from differences in:

  • Model scale (e.g., 3B vs 7B vs 11B),
  • Tokenization behavior and sequence lengths,
  • Pretraining objectives and convergence speeds.

We will clarify this in the appendix and include the exact configurations for each model.

Response to Q3: Role of CLIP Loss vs. SFT Loss

We thank the reviewer for the close reading and for raising this important point.

To clarify: although our paper discusses CLIP loss (Equations 2 and 4) as part of the general pretraining framework for MLLMs, we do not use the CLIP loss during the supervised fine-tuning (SFT) stage proposed in our method.

The presentation of the CLIP-style loss is meant to formally contextualize the limitations of existing training objectives (e.g., Eq. 2, 4, and 6) and to motivate the need for a task-aligned loss for difference reasoning. As explained in Section 4.2 and 4.3, models trained under these conventional objectives struggle with downstream fine-grained difference detection, due to both noise in supervision and objective mismatch.

Our proposed SFT method addresses this by replacing the standard objective with a targeted difference-aware training loss. The actual SFT objective used in our method is shown in Equation (7):

l\_{\text{cap}}\left(`Z`\\_{\theta}[`I`\_{\theta}(x\_i) - `I`\_{\theta}(\hat{x}\_i)],\ `S`\_{\phi}(t\_i, \hat{t}\_i)\right) \right].$$ This loss does not include any CLIP-style component, and is purely designed to supervise the model to describe visual differences between two images in a contrastive, semantically grounded way. We will revise the text to make it more explicit that CLIP loss is discussed only to frame prior training regimes and is not used in our final supervised fine-tuning pipeline. > **Response to Q4: Feature-Level Consistency Loss Implementation** > We thank the reviewer for the careful reading of our manuscript and for raising this insightful question. We appreciate the opportunity to clarify the role of the *feature-level consistency* in our proposed method. Your understanding is correct—the supervised fine-tuning (SFT) loss requires the model to distinguish and describe the differences between two input images. Importantly, the so-called *feature-level consistency loss* is not introduced as an explicit auxiliary loss term, but rather is implicitly embedded in the design of the SFT loss itself. As shown in Equation (7) of our paper, the loss is defined as: $$\hat{R}\_{\text{SFT}}(\theta) = \frac{1}{|\mathcal{D}\_{\text{edit}}|} \sum\_{(x\_i, \hat{x}\_i, `S`\_{\phi}(t\_i, \hat{t}\_i)) \in \mathcal{D}\_{\text{edit}}} \left[ l\_{\text{cap}}\left(`Z`\_{\theta}[`I`\_{\theta}(x\_i) - `I`\_{\theta}(\hat{x}\_i)],\ `S`\_{\phi}(t\_i, \hat{t}\_i)\right) \right].$$ Here: - $I\_{\theta}(x\_i)$ and $I\_{\theta}(\hat{x}\_i)$ denote the feature-level embeddings of the original and edited images, respectively, produced by the image encoder $I\_{\theta}$. - Their difference, $I_{\theta}(x_i) - I_{\theta}(\hat{x}_i)$, is passed to the text decoder $Z\_{\theta}$, which is trained to generate the correct description $y\_i$ of the visual change. This design implicitly enforces feature-level consistency: for the model to succeed in generating a correct difference description, the encoder must produce embeddings where fine-grained, semantically meaningful edits in the image space correspond to predictable and structured changes in feature space. Thus, the model is encouraged to: - Produce *stable visual representations* for unedited content. - Encode *fine-grained differences* in a way that can be directly mapped to textual explanations. This implicit regularization improves the model’s fine-grained reasoning ability and reduces hallucination, as shown in our experimental results across both synthetic and real benchmarks. We will clarify this point more explicitly in the revised version.
评论

Thanks for providing such informative response. It has addressed my concerns. Therefore, I will keep my rating as borderline accept.

评论

Thank you for your time and valuable feedback! We sincerely appreciate your effort in reviewing our work and are grateful for your positive assessment. We will carefully incorporate your insights to further refine the final manuscript.

评论

Dear Reviewer PV47,

Your expert feedback is crucial to refining this work. While we fully understand the discussion period may pose challenges for your schedule, we would value the chance to clarify any final points with you prior to its conclusion on Aug 8.

We hope we've been able to address your questions and concerns so far. We would be glad to address any further concerns you may have, and we will try our best to clarify promptly.

Thank you again for your feedback and comments; they were really helpful!

Warm Regards, Authors of Submission #9063

审稿意见
4

This paper works on fine-grained visual reasoning in Multimodal Large Language Models (MLLMs), particularly their tendency to hallucinate or miss small but meaningful visual differences (e.g., object presence, count, spatial changes). The authors identify two key limitations: insufficient training data and weak alignment objectives.

To address these, they propose:

  1. Micro Edit Dataset (MED): A large-scale dataset (~50K image-text pairs) of minimally edited image pairs and aligned captions spanning 11 fine-grained semantic edit types.
  2. Controlled Visual Editing Pipeline: Generates image pairs using editing models like Gemini Flash 2.0, and captions via Qwen-VL and Qwen3-32B.
  3. Micro Edit Detection (MED) Benchmark: A 165-question benchmark across the same 11 edit types, designed to evaluate sensitivity to small differences.
  4. Feature Consistency Regularization: A fine-tuning objective that encourages stable visual embeddings for minor edits, thereby improving semantic alignment.

They fine-tune open-source models (e.g., Qwen2-VL-7B, LLaVA) and show significant improvements on the MED benchmark and generalization to other tasks (e.g., POPE, MMStar, BLINK). Their best model (Qwen2.5-VL-7B) matches or surpasses GPT-4o and Claude 3.7 Sonnet on fine-grained reasoning tasks.

优缺点分析

Strengths:

  1. Well-motivated and timely problem: Hallucination in MLLMs is a known but under-addressed issue, and the paper identifies a specific failure mode and addresses it directly.
  2. High-quality dataset creation: the MED dataset is carefully constructed using various tools for editing, filtering, and alignment. The dataset and benchmark can be a solid contribution.

Weaknesses:

  1. Finetuning with specific learning objective on a specific dataset are expected to gain improvements, but it is not clear how the fine-tuned models perform on general benchmarks.
  2. the MED-Real set only has 35 image pairs, which might to small to verify real-world applicability.
  3. Other families of open-source models to be evaluated, e.g., BLIP-2 / BLIP-3.

问题

See weaknesses.

局限性

Yes

最终评判理由

Authors addressed some of my concerns in the rebuttal.

格式问题

N/A

作者回复

Overall Response

Thanks for the insightful feedback! Based on these comments, we have added a new experiment on BLIP-3o, expanded the MED-Real set to 50 samples, and included 4 general benchmarks (POPE, MMStar, BLINK, MMVP) to demonstrate the generalizability and robustness of our method. We also clarified the concerns regarding overfitting, dataset size, and model diversity. We will include all these discussions and results in the final version if the paper is accepted.

Response to W1: Generalizability Beyond MED

We thank the reviewer for raising the concern about generalization. As shown in Table 2 (Page 8) and the table below, our fine-tuned models consistently improve across diverse general benchmarks such as POPE, MMStar, BLINK, and MMVP.

For instance, Qwen2.5-VL-7B improves in average score from 55.47 to 58.52, and MME from 1685.14 to 1701.87, demonstrating clear gains beyond MED. These results indicate that our fine-grained tuning approach enhances not only task-specific performance but also general multimodal reasoning, including coarse/fine perception, hallucination mitigation, and counting.

ModelPopeCoarseFineVisual_SimVisual_CorrCountMMVPAveMME
Qwen2-VL-7B92.5071.2148.2451.1130.2355.8331.3354.351679.52
Qwen2-VL-7B (Fine-tuned)96.2773.9246.1651.8533.7259.1732.6756.251681.27
Qwen2.5-VL-7B96.2973.9557.3549.6333.7250.0027.3355.471685.14
Qwen2.5-VL-7B (Fine-tuned)97.5275.9759.3651.8537.7959.1728.0058.521701.87
LLaVA-V1.6-7B95.5658.2831.9351.1121.5145.8328.6747.561441.89
LLaVA-V1.6-7B (Fine-tuned)97.3956.7435.1348.1424.4249.1730.0048.711420.57
LLaMA-3.2-11B69.0348.9443.7020.9344.1726.0042.131421.71
LLaMA-3.2-11B (Fine-tuned)72.6047.2145.9319.1950.0028.0043.821430.67

Response to W2: MED-Real Set Size and Generalization Validation

We thank the reviewer for pointing out the limited size of the original MED-Real set (35 image pairs). While this subset already shows meaningful performance differences across models, we agree that further validation would be very helpful.

To address this, we augmented the real-world evaluation with an additional 15 minimally edited image pairs, bringing the total to 50 samples. This allows us to test real-world generalization at multiple scales (15, 35, 50). The results are as follows:

Modelacc@15acc@35acc@50
Qwen266.6782.8678.00
Qwen2 (Ours)73.3382.8680.00
Qwen2.5-3B46.6754.2952.00
Qwen2.5-3B (Ours)53.3357.1456.00
Qwen2.5-7B60.0068.5766.00
Qwen2.5-7B (Ours)73.3374.2974.00
LLaVA46.6734.2938.00
LLaVA (Ours)60.0045.7150.00
LLaMA3.226.6740.0036.00
LLaMA3.2 (Ours)46.6751.4350.00

These results indicate:

  • Consistent performance boost from our fine-tuning across all sizes and models.
  • No overfitting to synthetic edits — the models trained on MED also generalize well to real-world minimal differences as Response to W1 explained.

We will include the expanded 50-sample MED-Real set and analysis in the revised appendix to strengthen our empirical claims.

Response to W3: Inclusion of Other Open-Source Models (e.g., BLIP-2 / BLIP-3)

We appreciate the reviewer’s suggestion to evaluate additional model families beyond Qwen, LLaVA, and LLaMA.

To address this, we include the newest model from BLIP series, BLIP-3o in our MED benchmark evaluation. The table below shows its performance across all 11 semantic edit types, both before and after fine-tuning on our dataset:

ModelAvgObjAttrSceneSpatialActionPartCountDifferComparNegUniv
BLIP-3o-8B44.2446.1550.0069.2350.0037.5031.2544.4441.6731.5850.0042.86
BLIP-3o-8B (Ours)50.9038.4662.5046.1542.8662.5037.5066.6750.0052.6335.7157.14

These results show:

  • Consistent performance gains from our fine-tuning approach.
  • Improved reasoning on complex types such as comparison, counting, and action.

We will add these results to the revised paper and appendix to further demonstrate the broad applicability of our method across model families.

评论

Dear Reviewer driM,

Thank you so much for your valuable comments and for taking time to review our rebuttal! While we fully understand the discussion period may pose challenges for your schedule, we would value the chance to clarify any final points with you prior to its conclusion on Aug 8.

We hope we've been able to address your questions and concerns so far. We would be glad to address any further concerns you may have, and we will try our best to clarify promptly.

Thank you again for your feedback and comments; they were really helpful!

Warm Regards, Authors of Submission #9063

审稿意见
4

This paper is targeting a core-issue in Large Vision-Language Models(LVLM) -- Fine-grained understanding of image content. First, the authors have shown that even in some most advanced LVLMs, such as GPT4o, there still exists issues with detail understanding. And they conclude these errors into a range of types. Then the authors construct a dataset for fine-grained image learning by their proposed pipeline. This pipeline uses image-editing to subtly modify two images and ask questions related to the original and edited images. Then the author argued that the models trained to describe the differences between two similar images are not doing well in the downstream description task. Inspired by this observation, the author proposed a new supervision training method to overcome this problem. In this way, the model can learn to discreminate the fine-grained differences. The author also proposed a new loss function wich included the contrastive loss to train the LVLMs and show the performance superiority over other methods.

优缺点分析

Weaknesses:

  1. This data generation pipeline involved too many models: from open source to close source, from LVLM to LLM; acrossing different platform; including text generation and image generation models. This whole pipeline is quite complicate and thus hard to replicate your work because of the involving of too many models.
  2. Though I personally do not have any problem with leveraging synthetic data in training models, I still doubt the quality of the proposed datasets. I think the authors should release more quantitative results to evluate their prompt template and instruction quality.

Strength:

  1. For the MED-Benchmarks, this work manually verified and categorize the question types. They also do the examination of data contamination for the splits.
  2. This work does not simply train a downstream models on the curated datasets but verified the generalizing problem and minimize the performance gap in downstream tasks by proposing a new training method. The authors suggest that the performance gap of the existence LVLMs might be caused by the noisy training data which has a binomial probability distribution for giving a good quality description of the paired images. And thus this work proposing a clean supervision training pipeline to mitigate such noise.
  3. The curated dataset is quite valuable to the research community.

问题

  1. In the data generation process, the pipeline uses a QwenVL model to generate instruction and filtering the images from the raw dataet. Then, the authors use the gemini model to generate the image pairs from the generated instructions. Since the instruction generation process is very important for the quality of generated images in the proposed MED dataset, do you consider any other quality control method for the instructions. From my understanding, QwenVL is indeed very powerful, but is far from perfect. A Qwen2.5VL-72B model can also have hallucination during generating instructions.

  2. The authors use another data filtering module powered by CLIP models to varify the generated images quality compared to the original images. However, from my experiences with many image generative models, only semantic filtering is not enough for generated image quality control. In fact, the author should consider introducing some basic image quality evaluation method to control the qulity(i.e. SSIM). Furthermore, the authors only do CLIP embedding similarity check between the original image and the edited image but not between the edited image and the instruction. Why is that?

局限性

Yes

最终评判理由

I am satisfied with the response and recommend to accept this paper. Though I have considered about rasing my rating, I don't feel this paper is so good for a 5, accept. So, I will maintain my score as 4.

格式问题

No

作者回复

Overall Response

Thanks for the insightful feedback! Based on these comments, we clarified the modular design of our pipeline, addressed concerns about hallucination and instruction quality, and added quantitative metrics (e.g., MATTR, SSIM) to support our claims. We also highlighted the roles of Qwen3 and manual verification in ensuring data quality. All clarifications will be incorporated into the revised version.

Response to W1: Complexity of the Data Generation Pipeline

We thank the reviewer for raising this important concern regarding the complexity and reproducibility of our data generation pipeline. While our framework may appear to involve a large number of models at first glance, we would like to clarify that the core pipeline relies primarily on just two key components:

  1. Qwen2.5-VL-72B — used consistently for instruction generation, caption refinement, and caption alignment.
  2. Gemini Flash 2.0 — used for high-fidelity image editing with minimal semantic changes.

To ensure quality and robustness, we decompose the generation process into several small, modular steps, each focusing on a simple task well within the capability of the model employed. This modular design is intentional: rather than requiring complex, open-ended generation from a single model (which risks hallucination), each model performs only tightly-scoped operations, such as:

  • Revising captions only when elements are visibly present in the image.
  • Generating editing instructions based on predefined semantic categories (e.g., spatial, object, attribute).
  • Producing updated captions conditioned on known differences.

This design minimizes the cognitive load on the model and enhances output reliability.

Additionally, our dataset construction relies on publicly available APIs or open-source models, where all the scripts to leverage those toolkits are also released for reproducibility:

  • All prompt templates and model configurations have been released (in supplementary material).
  • Our dataset construction relies on publicly available APIs or open-source models, where all the scripts are provided.
  • Each pipeline stage is deterministic and documented in detail (see Section 3.2).

We believe this strategy offers a principled balance between model efficiency, data quality, and reproducibility. We will clarify this point further in the revision to dispel the impression of an overly entangled system.

Response to Q1: Instruction Quality and Hallucination Control in Qwen2.5-VL

We appreciate the reviewer’s concern regarding potential hallucinations in instruction generation using Qwen2.5-VL-72B. While no vision-language model is entirely immune to hallucination, our pipeline is explicitly designed to mitigate and control such risks through two key strategies: task decomposition and multi-stage filtering.

1. Task decomposition: keeping model outputs focused and grounded: Rather than asking the model to generate complex captions or instructions in one pass, we break the process into atomic and visually grounded subtasks, each scoped narrowly enough to reduce hallucination likelihood:

  • In Step 1, Qwen2.5-VL is only asked to revise or extend an existing caption, conditioned on a clearly visible target object or attribute and the original image. This minimizes hallucination by anchoring generation on visual evidence.
  • In Step 2, when generating the edited caption, the model receives the full original caption, the edit type (e.g., “change in spatial relation”), and the edited image. This instruction framing ensures that only relevant semantic changes are included.
  • In Step 3, difference descriptions are generated by a separate text-only LLM (Qwen3-32B), which compares the original and edited captions in isolation from visual noise, further reducing compounding hallucinations.

2. Multi-stage filtering: catching imperfect instructions or captions: Beyond carefully-scoped prompts, we apply multiple steps of filtering to ensure quality and remove any hallucinated or inconsistent outputs:

  • In the initial image filtering stage, Qwen2.5-VL-72B evaluates each image-caption pair using a structured scoring rubric that checks for caption clarity, visual relevance, and editability. Only high-quality samples are retained.
  • After image editing and caption generation, we apply CLIP-based similarity filtering between the original and edited images to ensure that visual changes are minimal and semantically plausible.
  • For the final dataset, we manually verify 1,000 randomly sampled image pairs (spanning all 11 edit types). This human check helps catch rare but systematic hallucinations and refine our instruction prompt templates accordingly.

Through this design, we ensure that Qwen2.5-VL is always solving a constrained, visually grounded task rather than performing free-form generation. We will clarify this design in the revised version and include additional metrics (e.g., instruction acceptance rates, manual error rates) to support our filtering effectiveness.

Response to W2: Evaluating Prompt Template and Instruction Quality

We appreciate the reviewer’s concern regarding the quality of the proposed synthetic dataset, especially the prompt templates and instructions.

  1. Quantitative metrics for text diversity: To further validate textual quality, we calculated standard diversity metrics on our instructions:

    • Moving Average TTR (MATTR, window size=50, ours): 0.7393
    • Moving Average TTR (MATTR, window size=50, llava-instruct-150k): 0.7834

    These scores indicate rich lexical diversity and low redundancy in instruction and prompting phrasing for a VQA dataset.

  2. Instruction quality control via Qwen3 and manual verification: In our original data generation pipeline (Section 3.2), we explicitly employed Qwen3-32B, a strong text-only LLM, to refine and optimize edit instructions and difference descriptions. These instructions were rewritten for clarity, specificity, and semantic correctness. Furthermore, we conducted manual verification on 1,000 samples spanning all 11 edit categories to ensure alignment between images, captions, and difference descriptions, reducing systematic errors.

  3. Strong downstream performance: The consistent improvements achieved across multiple model families (Qwen, LLaVA, LLaMA) and benchmarks demonstrate the practical value of the curated data.

We will include these quantitative metrics and clarify the role of Qwen3 and manual validation in the revised version to strengthen transparency around instruction quality.

Response to Q2: On Image Quality Filtering and Use of CLIP vs. SSIM

We appreciate the reviewer’s insightful suggestions regarding image quality control.

Our current pipeline adopts CLIP-based embedding similarity—a practice inspired by [1]—to ensure that edited images remain semantically close to the originals. CLIP excels at capturing high-level conceptual consistency, which aligns well with our goal of fine-grained semantic editing.

To complement this, we computed SSIM on our 50K image pairs to assess low-level visual similarity:

SSIM Statistics (on MED Dataset): Mean: 0.5011 | Std: 0.2028 | Min: -0.0134 | Max: 0.9839

These results confirm that most edits are structurally subtle, as intended.

We chose CLIP over SSIM because:

  • SSIM is known to struggle with localized edits on uniform backgrounds [2].
  • CLIP is more robust to superficial visual changes and focuses on semantic-level alignment.

As for instruction-image alignment, the instructions are already used as generation inputs to Gemini and further validated via manual checks on 1,000+ samples. While CLIP-text similarity could be added, our current design already enforces semantic alignment both procedurally and manually.

We will add the SSIM results to the appendix and clarify this in the revision.

[1] Tong S, Liu Z, Zhai Y, et al. Eyes wide shut? exploring the visual shortcomings of multimodal llms[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 9568-9578.

[2] Ghazouali S E, Michelucci U, Hillali Y E, et al. CSIM: A Copula-based similarity index sensitive to local changes for Image quality assessment[J]. arXiv preprint arXiv:2410.01411, 2024

评论

Thank you for the response and hardwork. While most of my concerns are addressed, I have an add-up question regarding to the CLIP similarity check. How do you decide the threshold of the CLIP-score for similarity check?

评论

Dear Reviewer WxCP,

Thank you so much for your valuable comments and for taking time to review our rebuttal! While we fully understand the discussion period may pose challenges for your schedule, we would value the chance to clarify any final points with you prior to its conclusion on Aug 8.

We hope we've been able to address your questions and concerns so far. We would be glad to address any further concerns you may have, and we will try our best to clarify promptly.

Thank you again for your feedback and comments; they were really helpful!

Warm Regards, Authors of Submission #9063

评论

Thank you for your kind acknowledgment! We're glad our responses clarified your concerns.

Regarding your follow-up question on CLIP-score thresholds, we clarify two distinct approaches:

  1. For MED-Bench, we follow Tong et al. [1] in using a 95% similarity threshold.

  2. For MED dataset construction, we conducted additional ablation studies during rebuttal to optimize the data quality-scale trade-off. Our methodology and key findings are summarized below:

Additional Ablation Study: Threshold (δ) vs. Performance

We trained identical Qwen2.5-VL-7B models on MED subsets filtered at different CLIP-similarity thresholds (δ), evaluating accuracy across 12 categories. Results are summarized below (also in W2 response for Reviewer WxCP):

Model with different δData sizeavgobjattrscenespatialactionpartcountdiffercomparneguniv
δ=0.750k51.5157.1456.2553.8442.8646.6743.7555.5650.0047.3750.0064.29
δ=0.7random-12k41.2153.8543.7553.8542.8662.5018.7522.2233.3331.5850.0050.00
δ=0.836k46.6753.8437.5046.1521.4343.7531.2561.1150.0042.1150.0078.57
δ=0.8random-12k42.4261.5431.2553.8528.5737.5037.5033.3341.6736.8464.8650.00
δ=0.912k44.8546.1550.0046.1535.7143.7543.7538.8950.0031.5857.1457.14
  • Higher thresholds (δ≥0.8) improved per-sample alignment (e.g., δ=0.9 achieved 78.6% in universal tasks) but reduced data size (12k samples), causing significant drops in compositional tasks like spatial reasoning (-7.2% vs δ=0.7) and counting (-16.7%).
  • Lower thresholds (δ<0.7) introduced noisy pairs, degrading performance across all categories.
  • δ=0.7 (50k samples) delivered peak average accuracy (51.5%) by:
    • Preserving critical diversity for complex reasoning (e.g., +18.8% in attribute binding vs δ=0.8)
    • Maintaining scale advantages (random 12k subset at δ=0.7 underperformed by 10.3%)

This threshold maximizes robustness while avoiding artificial scale limitations. We will detail this analysis in Section 4.2 (revised manuscript) and appreciate your insightful query. Welcome any additional questions should you wish to discuss this further!

[1] Tong S, Liu Z, Zhai Y, et al. Eyes wide shut? exploring the visual shortcomings of multimodal llms[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 9568-9578.

评论

Thank you for your resposne. I have no follow up questions and would consider raise my rating.

评论

Thank you so much for your time and valuable feedback. We truly appreciate your consideration in raising the rating! Your suggestions will be carefully incorporated to further improve our manuscript in the final version. We'd be happy to discuss anything further :)

审稿意见
4

This paper tackles hallucinations in large VLMs, a problem of high interest in LM research. The authors hypothesize the root of hallucinations of large VLMs is inadequate training data requiring fine-grained visual understanding. The authors propose the Micro Edit Dataset, which provides both a training and benchmarking portion for improving and evaluating performance on fine-grained visual differences. The authors perform extensive fine-tuning on their dataset based on risk minimization over hallucination. They show that the SFT procedure results in higher performance across a number of standard VLM benchmarks.

优缺点分析

Strengths: The problem is clearly of high importance, and the visual quality of paired images seems high. While not totally realistic visually, the resulting performance gains seem clear to validate the pipeline to be built upon as parts of larger VLM fine-tuning pipelines. The formulation of SFT as empirical risk minimization for hallucination makes sense, and results in nice performance gains.

Weaknesses: There are artifacts of the text-based difference caption generation in multiple of the presented examples. For instance, the texts include phrases like "are not mentioned," "is described," "in the second caption." This raises questions about potential false negatives during benchmarking due to selecting the next best choice that describes an image instead of a caption.

As the dataset is a main contribution of the paper, more extensive qualification behind design choices (e.g. similarity thresholding, failure cases, ablations on backbone LLMs) would help strengthen the usefulness of the MED.

Some of the images are qualitatively obviously synthetic. The performance on the MED-Real set is much higher than analogous performance on the synthetic set (admittedly the real set is small).

问题

Can the authors comment on why performance seems to be a large margin higher on the MED-Real set vs the main MED benchmark? Are there any particular reasons stemming from the candidate data distribution, or is this an artifact of the model training domain? In this case, how can it be verified that the performance gains on fine-tuning on the MED were not primarily due to bridging the domain gap between real and synthetic?

Can the authors comment on the quite frequent artifacts of the texts referring to samples as texts? Will this not introduce false negatives in benchmarked models?

Can the authors demonstrate qualitative or quantitative ablations over the data curation pipeline, showing why specific similarity thresholding, etc. necessary to curate the best dataset?

Overall I have a positive impression of this work, but addressing my questions would encourage me to improve my rating. Thanks!

局限性

Yes.

最终评判理由

I maintain my positive review. I maintain concern about some of my initially raised points, as there were many text artifacts presented in the paper, and the revised data was not shared for qualitative evaluation. Additionally, while the expansion of real-image evaluation is appreciated, 50 images is still not a very large sample size. I recognize these concerns are difficult to address during a rebuttal period, and I remain positive overall about the contribution and core idea of the work.

格式问题

None.

作者回复

Overall Response

Thanks for the insightful feedback! Based on thse comments, we conducted new ablation studies (on similarity threshold and model scale), expanded the MED-Real set, and clarified benchmark construction procedures to ensure robustness and generalization. We address all concerns in detail below and will incorporate these updates into the final version if the paper is accepted.

Response to W1: Text Artifacts in Difference Descriptions and Benchmark Reliability

We thank the reviewer for pointing out the phrasing artifacts such as “is described”, “are not mentioned”, or “in the second caption”, which appear in some of our difference descriptions.

We want to clarify that these phrases originally appeared in intermediate contrastive descriptions produced during Step 3 of our data construction pipeline (Section 3.2). However, before finalizing the benchmark question-answer format, we apply a systematic transformation of these descriptions into visually grounded natural language questions. Specifically:

  • We rewrite contrastive captions into QA-style prompts, phrased purely in terms of visual content, not textual description.
  • All questions in the MED benchmark are then presented with four answer choices, and we manually verify every question-answer set (Section 3.3) to ensure that:
    • Exactly one option is visually correct.
    • The other options are plausible but visually incorrect.
    • No options rely on meta-knowledge of captions or textual artifacts.

This human verification step is essential to prevent the kind of false negatives the reviewer is concerned about — e.g., cases where a model might “correctly” describe the image but get penalized due to linguistic mismatches in the caption text.

We will add clarification and examples in the final version showing how we resolve these caption artifacts during the transition from raw difference sentences to benchmark QA pairs, ensuring evaluation is image-grounded and artifact-free.

Response to W2: Design Choices and Ablations in MED Construction and Q3: Impact of Similarity Threshold and Backbone Scaling

We thank the reviewers for emphasizing the importance of justifying our dataset design, especially regarding similarity thresholding and backbone model selection. We have now included two additional sets of ablation studies to strengthen this point:

1. Similarity threshold (δ) ablation: We compare model performance when training on subsets of MED with different CLIP-based similarity thresholds (δ). Results show a clear trend: stricter thresholds (e.g., δ = 0.8 or 0.9) yield better per-sample quality, but smaller data sizes limit overall performance. Meanwhile, δ = 0.7 with full 50k samples yields the best average accuracy, balancing quality and data scale.

Model with different δData sizeavgobjattrscenespatialactionpartcountdiffercomparneguniv
δ=0.750k51.5157.1456.2553.8442.8646.6743.7555.5650.0047.3750.0064.29
δ=0.7random-12k41.2153.8543.7553.8542.8662.5018.7522.2233.3331.5850.0050.00
δ=0.836k46.6753.8437.5046.1521.4343.7531.2561.1150.0042.1150.0078.57
δ=0.8random-12k42.4261.5431.2553.8528.5737.5037.5033.3341.6736.8464.8650.00
δ=0.912k44.8546.1550.0046.1535.7143.7543.7538.8950.0031.5857.1457.14

2. Backbone scaling ablation: To test the general applicability across model scales, we fine-tuned Qwen2.5-VL-3B and 7B on the same data. As shown below, both 3B and 7B variants benefit significantly from our SFT strategy, demonstrating that MED’s effectiveness is robust across different backbone sizes.

ModelAvgObjAttrSceneSpatialActionPartCountDifferComparNegUniv
Qwen2.5-VL-3B33.9438.4637.5053.8521.4325.0043.7527.7816.6726.3235.7150.00
Qwen2.5-VL-3B (Ours)38.7946.1543.7538.4635.7143.7543.7538.8916.6742.1128.5742.86
Qwen2.5-VL-7B39.7453.8550.0038.4642.8612.5018.7544.4450.0026.3242.8657.14
Qwen2.5-VL-7B (Ours)51.6157.1456.2553.8442.8646.6743.7555.5650.0047.3750.0064.29

Together, these ablations validate the core design choices behind MED—highlighting how both threshold calibration and model scalability are accounted for. We will include these new results and discussion in the revised version.

Response to W3: Synthetic Visual Quality and Real Set Performance Gap

We thank the reviewer for pointing out the visual quality difference between synthetic images and the MED-Real subset. While some synthetic samples may appear artificial upon close inspection, we ensured semantic fidelity through CLIP-based filtering with a cosine similarity threshold > 0.95. The average CLIP similarity gap between synthetic and real pairs is < 5%, suggesting the visual differences are minimal in embedding space.

As shown below, the performance gap between MED and MED-Real is not uniformly large, especially for LLaVA and LLaMA models:

ModelMED (Avg)MED-Real (Acc %)
Qwen2-VL-7B38.4882.86
Qwen2-VL-7B (Ours)47.5582.86
Qwen2.5-VL-7B39.7468.57
Qwen2.5-VL-7B (Ours)51.6174.29
LLaVA-V1.6-vicuna-7B31.0434.29
LLaVA-V1.6-vicuna-7B (Ours)40.4445.71
LLaMA-3.2-11B-Vision-Instruct34.7140.00
LLaMA-3.2-11B (Ours)40.9251.43

These results suggest that our synthetic dataset generalizes well to real-world data. The high real-set performance of Qwen models may be partially due to pretraining exposure to Visual Genome–style content, which is consistent with the source of MED-Real. However, for other models like LLaVA and LLaMA, the real/synthetic performance is comparable, indicating no evidence of domain overfitting

Response to Q1: MED vs. MED-Real Performance and Domain Concerns

We appreciate the reviewer’s question on the observed performance gap between the synthetic MED benchmark and the MED-Real set. We address this from three perspectives:

  1. Performance gap is not uniformly large: As we noted in our response to W3, while Qwen models show higher performance on MED-Real (possibly due to pretraining exposure), models like LLaVA and LLaMA exhibit only minor differences between MED and MED-Real. This suggests that the synthetic nature of MED is not the primary limitation for those models.

  2. Improvement is not just from bridging synthetic–real gap: Our fine-tuned models improve not only on MED, but also on unrelated, real-world general benchmarks that do not involve synthetic data. As shown in Table 2 (Page 8) for full results:

    ModelAveMME
    Qwen2-VL-7B54.351679.52
    Qwen2-VL-7B (Ours)56.251681.27
    Qwen2.5-VL-7B55.471685.14
    Qwen2.5-VL-7B (Ours)58.521701.87
    LLaVA-V1.6-7B47.561441.89
    LLaVA-V1.6-7B (Ours)48.711420.57
    LLaMA-3.2-11B42.131421.71
    LLaMA-3.2-11B (Ours)43.821430.67

    These results confirm that our method improves fine-grained perception capabilities broadly, rather than simply adapting models to synthetic artifacts.

  3. Careful design to avoid domain artifacts: To further avoid domain leakage:

    • The MED-Real set is strictly held out from any fine-tuning.
    • Our instruction pipeline includes both open-domain instructions and domain-agnostic edit types.
    • We apply manual verification to ensure semantic correctness and diversity in both real and synthetic settings.

Together, these points support that our performance gains reflect true improvements in fine-grained reasoning, not merely domain adaptation. We will clarify this in the final version.

Response to Q2: Benchmark Robustness to Text-Based Artifacts

We understand the reviewer’s concern that textual phrasing in the dataset might cause false negatives—for example, if a model selects the most accurate visual description, but the correct answer is tied to wording about the captions themselves.

We emphasize that such risks are fully mitigated through two key measures:

  1. All benchmark questions are rewritten to remove references to text or caption structure.

  2. Manual verification ensures that each question is answerable by observing the image alone, without needing to reason about how something was “described” or “mentioned.”

    In short, although textual artifacts may appear in internal steps of our pipeline, they are completely eliminated from the final evaluation format. We will clarify this in the revision and include an example in the appendix illustrating this transformation.

评论

Dear Reviewer ysLX,

Your expert feedback is crucial to refining this work. While we fully understand the discussion period may pose challenges for your schedule, we would value the chance to clarify any final points with you prior to its conclusion on Aug 8.

We hope we've been able to address your questions and concerns so far. We would be glad to address any further concerns you may have, and we will try our best to clarify promptly.

Thank you again for your feedback and comments; they were really helpful!

Warm Regards, Authors of Submission #9063

评论

Thank you to the authors for their detailed analysis of my concerns. I believe making sure clarification of the performance gap across domains and ablation on design choices will strengthen the presentation of the final paper. Expanded analysis to Reviewer K4PJ is also appreciated.

评论

Thank you for your time and valuable feedback! We sincerely appreciate your effort in reviewing our work and are grateful for your positive assessment. Your insights have helped strengthen the paper, and we will incorporate them to further refine the final manuscript.

最终决定

Summary: Introduces MED (50K minimally edited pairs) + MED-Bench, and an SFT feature-consistency objective to improve fine-grained difference detection and reduce hallucinations. Also gives gains on POPE/MMStar/BLINK/MME vs strong baselines.

Strengths: timely focus on fine-grained hallucination, carefully filtered dataset/benchmark with manual checks, and empirical gains with released code/prompts/configs.

Weaknesses: complex multi-model pipeline with concurrent reproducibility burden. Small real-image set and initially narrow model coverage.

(d) clear dataset+benchmark contribution and an effective lightweight training signal that measurably reduces hallucination across tasks/models.

(e) Discussion/rebuttal summary: reviewers pressed on pipeline complexity, instruction/annotation quality, model diversity, and caption artifacts; authors responded to address many concerns.