DocVXQA: Context-Aware Visual Explanations for Document Question Answering
摘要
评审与讨论
The paper introduces DocVXQA, a method that generates visual explanations (in the form of a mask) that highlight parts of documents that are relevant for OCR-free document question answering. DocVXQA builds on the Pix2Struct model and learns a mask that, when combined to the input image, must lead the document QA model to give the same answer, while optimizing for mask minimality. A prior is also given by ColPali (that provides a relevance score for text-image matchings) to improve the mask. Experiments are conducted on the DocVQA dataset using a fine-tuned Pix2Struct model. The method is evaluated with accuracy and ANLS (to measure the quality of answers), and pixel ratio (to measure mask minimality). A user-study is also conducted.
update after rebuttal
The rebuttal clarifies several things and answers my questions. I am now more inclined to recommend acceptance of the paper.
给作者的问题
NA
论据与证据
The explanations provided by the model are useful for users to verify the correctness of an answer by comparing it to the explanations. However, they do not genuinely explain the inner workings of the original model, Pix2Struct. Instead, they may give persuasive narratives that do not necessarily reflect the model's decision-making process, potentially leading to misleading interpretations. In the reviewer’s opinion, the final model cannot thus be considered fully "self-explanatory". The reviewer recommends adjusting the wording throughout the paper to prevent any misinterpretation regarding the model's explanatory capabilities. This concern is further supported by the design of the user study, which focuses on users' subjective satisfaction with the explanations rather than evaluating whether the explanations genuinely reveal the model's reasoning process, or help users to better understand the model (see “What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods”, Colin et al, NeurIPS 2022) The statement at the end of the introduction ("our method is inherently model-agnostic, allowing seamless integration with various DocVQA architectures") is not adequately supported by the experiments. Providing empirical evidence would make this claim more convincing.
The encoder of Pix2Struct is applied on original documents, and masked ones. Since it was pretrained only on original documents, the reviewer suspects that a domain gap occurs when it is applied on the masked ones, hence the low utility scores, when only sufficiency and minimality losses are included. Still, as illustrated, in this ablation, the answer is generally visible, it should be easy to produce the correct one, even without all the context that is integrated back thanks to the ColPali prior.
The paper does not explain why the decoder is kept frozen during training, while other parts (e.g., the encoder) are learned. What is the motivation here? Were other choices explored?
方法与评估标准
The proposed metrics of Table 2 seem not enough to completely evaluate the method. For instance, using ColPali to have more context-aware explanations will increase pixel ratio without increasing utility. From this set of metrics only, it looks that the integration of ColPali is detrimental, while it looks reasonable and is well motivated.
理论论述
There are no theoretical claims
实验设计与分析
Yes.
补充材料
Briefly read
与现有文献的关系
NA
遗漏的重要参考文献
The related work section is rather well written. Except that the organization between Sections 2.2 and 2.3 is unclear, with some overlap that appears arbitrary. A clearer distinction or restructuring would improve the flow of ideas.
其他优缺点
Strengths:
- The paper goal of explaining document VQA model is sound and important for the community.
- The proposed model is reasonable, the loss components are well motivated.
- The paper is rather well written and motivated
- The user-study was appreciated.
Weaknesses:
- The proposed model yields results that are very close to the ColPali + Pix2Struct baseline. The empirical improvement is marginal, while the method is much more complex (and has to be trained).
其他意见或建议
- The mandatory “Impact Statement” section is missing
- The itemizations L.294-309 and L.437-400 do not follow the template format, as the bullets are positioned in the margin.
Typos:
- “are are” L.99
- “aspect the ratios” L.73
- “i. e.,” L138
- “Right” L.877
- L.284, a space is missing before the citation
- The use of bold in Table 1 is inconsistent
- “Section 4” L.314 (x2) - “Section 4.3” L.377 - “Section 4.3” L.409 → Please revise latex labels / references.
- In Table 2, mismatch between “I” and “TI”
- “were each shown”, L.428
The authors sincerely appreciate Reviewer Y25T’s constructive feedback. We are pleased that the reviewer recognized the importance of explainability in Document VQA and the strong motivation behind our loss components. We also appreciate the positive remarks on the clarity and coherence of our writing and the value of our user study.
We have carefully addressed all the points raised in the following, and we are grateful for the detailed identification of typos and missing content, which will be corrected in the revised manuscript
-
"Self-explanatory" clarification: We clarify that our intention is not to suggest that we reveal the DocVQA model internal workings. Rather, we use "self-explanatory" following established usage in explainable AI literature (Mersha et al., Neurocomputing'24), (Gautam, Boubekki et al., NeurIPS'22), describing models inherently designed to produce interpretable outputs during inference (L158–162, left). This contrasts with post-hoc methods, which retrospectively assess relevance between inputs and outputs on frozen models. We fully accept the reviewer’s recommendation and will clearly specify that self-explanatory refers exclusively to "the model’s capacity to produce interpretable outputs by design," not exposing internal workings.
-
Model-agnostic claim is not supported: As we also commented in response to reviewer ovki, our claim of model-agnostic nature relies on the independent, modular design of the explanation mechanism. Although currently integrated with Pix2Struct, its components, such as a learnable mask head, three loss terms of the explainability principles, and a postprocessing pipeline, are architecture independent. To empirically validate this, we are currently integrating our approach with Donut (Kim et al., ECCV'22), and results will be provided in the final manuscript.
-
The encoder was pretrained only on original documents, the reviewer suspects that a domain gap occurs when it is applied on the masked ones. Using S+M, the answer is generally visible, it should be easy to produce the correct one, even without ColPali prior: The reviewer points out low utility scores in the sufficiency and minimality ablation (S+M) despite visible answers in masked documents. We respectfully disagree. Although answers are sometimes visible, this is not consistently true. The domain gap noted by the reviewer is also reduced with finetuning on original and masked inputs. We identify aggressive masking as the primary factor lowering utility scores, enforcing minimality at the cost of sufficiency. This lack of context impairs predictions, highlighting the importance of integrating the ColPali prior through token interactions (Table 2).
-
The paper does not explain why the decoder is kept frozen: We freeze the decoder to preserve its pretrained text generation capabilities and maintain stable training outputs. Since the decoder is pretrained on original, unmasked documents, jointly finetuning it with the encoder and mask head introduces distributional shifts, causing instability. Instead, we focus learning efforts on the encoder and the new mask head. Noting that we did tried unfreezing the decoder, but it led to unstable optimization, slower convergence, and reduced overall performance. We will add these details to the final manuscript.
-
Using ColPali to have more context-aware explanations will increase pixel ratio without increasing utility.: Table 2 shows that using only sufficiency and minimality losses (S+M) yields accuracy of 0.19 and ANLS of 0.36. Integrating ColPali through token interactions (S+M+TI) significantly improves these metrics to accuracy 0.38 and ANLS 0.54. Although ColPali increases the pixel ratio, this reflects the preservation of essential contextual cues rather than irrelevant areas. Practically, this additional context is critical for robust predictions, accuracy and mask learning guidance. Qualitative human evaluations further confirm that ColPali-enhanced explanations are clearer and more context-aware, supporting a balanced trade-off between minimality and sufficiency.
-
The proposed model yields results that are very close to the ColPali + Pix2Struct baseline: Figure 3 illustrates that relying solely on quantitative metrics is insufficient for comprehensive evaluation. Although quantitatively our results appear similar to ColPali+Pix2Struct, qualitatively ours perform better. Following the reviewer’s suggestion, we conducted a human preference study comparing masks from both methods. In total, 12 participants evaluated 21 randomly selected question-answer pairs (252 trials). Our method was preferred in 163 trials (64.7%; 95% CI [58.4%, 70.6%], p << 0.001), with all participants (12/12) favoring our approach overall. Thus, our method is close to ColPali+Pix2Struct quantitatively, but it is offering a significantly more compact and interpretable explanations. This new study will be added to the revised manuscript.
Thank you for the interesting rebuttal which clarifies several things and answers my questions. I am now more inclined to recommend acceptance of the paper.
We thank reviewer Y52T again for their valuable feedback, which we believe will help improve the quality and clarity of our paper. We are grateful for the time invested in reviewing the manuscript and the rebuttal, and appreciate the updated score and the recommendation that it be accepted.
- This paper presents DocVXQA, a novel self-explainable framework for document question answering that learns to provide context-aware visual explanations. It builds on the Pix2Struct model and incorporates a learnable mask to enhance transparency. The approach is based on the information bottleneck principle and integrates additional pretrained models for better generalization. Experimental results show the effectiveness of the method in terms of both accuracy and interpretability.
- The research question is important as it addresses the lack of transparency in current DocVQA models, which is crucial in high-stakes domains.
- The key methodological innovation is the use of a learnable mask to generate context-aware explanations, which is formulated based on the information bottleneck principle.
- The main findings are that the proposed method outperforms baseline techniques in terms of both accuracy and interpretability, and that the generated explanations are more context-aware and human-understandable.
update after rebuttal
Dear authors,
These questions are written by real human, not generated by LLM. If you feel they are LLM-generated and choose not to answer them, then I have to respect your action. But I will not change my comments, or my evaluation.
给作者的问题
(1) The paper could discuss the limitations of the proposed method and potential future research directions in more detail. How does the proposed method handle complex or ambiguous questions? What are the potential limitations of the proposed method in handling different types of documents?
(2) The paper could provide more examples of the generated explanations to better illustrate the effectiveness of the method.
(3) The paper could discuss the potential applications of the proposed method in different domains. What are the potential applications of the proposed method in different domains? How does the proposed method handle different types of data and tasks?
(4) The paper could provide more details on the scalability of the proposed method. How does the proposed method handle noisy or incomplete data? What are the potential challenges in deploying the proposed method in practice?
(5) The paper could provide more details on the relationship between the proposed method and other self-explainable models (e.g., Dib-x). How does the proposed method compare to other state-of-the-art methods in terms of performance and interpretability?
Updated:
Dear authors,
These questions are written by real human, not generated by LLM. If you feel they are LLM-generated and choose not to answer them, then I have to respect your action. But I will not change my comments, or my evaluation.
论据与证据
(1) The proposed DocVXQA framework is a novel approach to visually self-explainable document question answering. (2) The use of the information bottleneck principle to formulate explainability principles is a theoretically sound approach. (3) The integration of additional pretrained models to enhance the generalizability of the explanations is a practical and effective strategy. (4) The extensive experimental evaluations, including human evaluation, provide strong evidence of the effectiveness of the method.
方法与评估标准
Plz read Claims And Evidence.
理论论述
Plz read Claims And Evidence.
实验设计与分析
Plz read Claims And Evidence.
补充材料
Supplementary Material is read.
与现有文献的关系
N/A
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
N/A
First, it introduces DocVXQA, a novel self-explainable framework that not only answers questions about documents but also provides visual explanations highlighting relevant regions that justify the answers. Second, it quantitatively formulates explainability principles (sufficiency and minimality) as explicit learning criteria using an information bottleneck approach, creating a balance between contextual relevance and representation efficiency. Third, it proposes an end-to-end training strategy that maintains compatibility with existing DocVQA architectures while adding minimal architectural changes.
Final recommendation and Justification (post rebuttal)
The authors have provided substantial details in response to my comments, particularly by incorporating new datasets to address concerns around generalization. I am changing my rating to weak accept and expect that the authors include these additional details and results in the final version.
给作者的问题
- How does the method perform on documents with complex layouts or multiple tones of writing?
- What are the computational costs compared to baseline approaches?
- Can the approach be extended to other vision-language understanding tasks?
- How robust is the method to variations in document image quality?
- What are the limitations in handling highly complex or densely populated documents?
论据与证据
The submission makes several key claims:
- DocVXQA provides context-aware visual explanations for document question answering, going beyond simply highlighting answer locations.
- The framework delivers explanations that are both contextually sufficient yet representation-efficient, creating a balance between comprehensive context and minimizing irrelevant information.
- The method outperforms baseline explanation techniques in both utility (accuracy/ANLS) and explainability (clarity and conciseness).
- Human evaluators prefer DocVXQA's explanations for their context-awareness and clarity.
方法与评估标准
The proposed methods make sense for explaining document question answering. The approach balances showing enough information to understand the answer while keeping explanations simple. The testing approach has some weaknesses - they should have tested with more people and different types of documents. Also, the big drop in accuracy when using explanations suggests this might not work well in real-world settings where getting the right answer is critical.
理论论述
The paper makes no theoretical claims
实验设计与分析
which ones, and discuss any issues.
- The paper fails to justify why they use different thresholds for their method compared to previous approaches. This inconsistency raises questions about fair comparison methodology.
- The paper lacks a proper ablation study showing how ColPali and Pix2Struct perform individually, making it difficult to assess the actual contribution of their proposed integration.
- There is a confusing presentation in the results tables where higher Accuracy is stated as better, yet the baseline numbers in the top row appear to outperform many of the tested approaches.
- The authors inexplicably use different evaluation metrics than those presented in the original ColPali and Pix2Struct papers, hindering direct comparison with established benchmarks and raising concerns about cherry-picking favorable metrics.
- The results section is particularly difficult to follow since none of the tables are properly referenced in the text, leaving readers to guess the relationships between claims and supporting evidence.
- The paper fails to demonstrate generalizability by not evaluating performance on standard tasks used in previous works, such as Screenshot Parsing Pretraining, AI2D, and Screen2Words, which were core evaluation benchmarks in related literature.
补充材料
I reviewed the supplementary material, which provides valuable additional insights on several aspects of the work. The appendix details the implementation specifics, including a thorough explanation of the postprocessing methodology that enhances the clarity and interpretability of the visual explanations. It also contains helpful visualizations of the mask generation and refinement process, clearly illustrating how the system progressively improves its explanations. The hyperparameter settings are well-documented, enabling reproducibility of the results. Particularly useful were the additional qualitative results demonstrating the effects of token interactions, which help clarify how the model avoids overfitting to answer regions and generates more context-aware explanations.
与现有文献的关系
DocVXQA builds upon several key research areas. It extends beyond simple attention visualization techniques (Chefer et al., 2021, https://arxiv.org/abs/2103.15679) by developing contextually sufficient explanations rather than just highlighting answer locations. The token interaction approach connects to knowledge distillation in vision-language models (Faysse et al., 2024, https://arxiv.org/abs/2407.01449). While innovative, the work misses opportunities to relate their challenges with mutual information estimation to similar challenges in explainable AI research and would benefit from more rigorous human evaluation protocols (Doshi-Velez & Kim, 2017, https://arxiv.org/abs/1702.08608).
遗漏的重要参考文献
I am not aware of any missing references.
其他优缺点
Strengths:
The paper's strengths lie in its novel approach to a significant problem in document AI - the lack of transparency in document visual question answering. The paper effectively identifies and addresses the limitations of conventional attention-based explanations that merely highlight answer regions without proper context. The quantitative formulation of explainability principles (sufficiency and minimality) provides a sound theoretical foundation.
Weaknesses:
First, the performance degradation from unmasked (56% accuracy) to masked models (38% accuracy) represents a substantial trade-off that undermines real-world applicability. Second, the heavy reliance on post-processing (background removal, bounding boxes) obscures how much improvement comes from the learning approach versus these enhancement techniques. Fourth, the human evaluation lacks statistical rigor with 26 participants evaluating only 10 examples each.
其他意见或建议
Typo on Line 99- redundant are
The authors thank Reviewer 9NT1 for their feedback. We appreciate the recognition of our approach’s novelty in addressing the challenge of transparency in DocVQA systems, and the acknowledgment that our approach provides a strong theoretical foundation through the quantitative formulation of explainability principles. Below, we address all of 9NT1’s concerns.
-
What is the performance on documents with complex layouts or multiple tones of writing? What are the limitations in handling highly complex or densely populated docs?: In the first case, we found that our model still performs well and produces good explanations. In densely populated documents, our model can provide fragmented explanations, which are hard to interpret. See qualitative results here: https://shorturl.at/nstj0. We will discuss these results in the final manuscript.
-
What are the computational costs compared to baseline approaches?: Our method requires additional training compared to post-hoc approaches. However, at inference time, the additional computational cost is only the forward pass through the small mask proposal head (591361 parameters). All baseline comparisons are performed with respect to the Pix2Struct base model with 282M parameters, aside from ColPali, which itself uses 2,9B parameters.
-
Can the method be extended to other vision-language understanding tasks? Why using of different metrics than ColPali and Pix2struct?: The masking-based method can be readily generalized to other vision-language tasks. Nevertheless, the primary goal of our work is to provide explanations tailored for the DocVQA task, rather than aiming at making Pix2Struct itself explainable. For the metrics used, we clarify that our specific focus is on DocVQA, thus retrieval metrics such as NDCG (used in ColPali) are not applicable. We adopt standard DocVQA metrics that are also used by Pix2Struct for QA tasks.
-
What is the relation of your work with mutual information estimation?: We appreciate the reviewer’s reference to Doshi-Velez and Kim (2017). Our core motivation closely corresponds with both the data-driven methodology for uncovering interpretability factors (Section 4.1) and the functionally-grounded evaluation outlined in Section 3.3. We concur with the suggestion of adopting an application-grounded evaluation framework; we plan to apply our methodology to domain-specific DocVQA tasks in future research.
-
How robust is the method to variations in document image quality and why there is a degradation of performance on masked images?: SPDocVQA contains some images of bad quality, thus our model is robust to these variations, some qualitative results here: https://shorturl.at/mqbzg. Regarding the performance degradation from masked to unmasked models: the trade-off between accuracy and explainability is a well-documented phenomenon (Crook et al., REW'23, Bell et al., ICPS'22), and several approaches have been proposed to balance this trade-off (Luo et al., BJR Open'19, Mori et al., ESE'23). As illustrated in Fig. 4: lowering the threshold decreases explanation quality but improves utility. This shows that the observed performance gap is an inherent constraint in explainability rather than a fundamental limitation of our approach. We also want to clarify also that the 38% is the accuracy of masked images; for clear images, the accuracy is 51%.
-
Why using different thresholds and why there is a heavy reliance on post-processing: We adjust the thresholds so that each technique operates at its optimal setting, to ensure a fair comparison between competing methods. The post-processing steps are primarily intended to enhance human interpretability of the masks, and do not significantly affect the core performance of our model. As shown in Figure 4, the observed gains are primarily attributable to the underlying learning approach rather than to the post-processing.
-
The paper lacks a proper ablation study showing how ColPali and Pix2Struct perform individually: We would like to clarify that it is impossible to evaluate each technique's individual impact, as they are designed to be used only in combination for XDocVQA.
-
The baseline numbers in the top row appear to outperform many of the tested approaches. We would like to clarify that these are not the baselines but rather upper bounds (unmasked, no explanation). We will make this distinction clear in the paper.
-
Human evaluation lacks statistical rigor with 26 participants evaluating only 10 examples: It is challenging to include more documents because we have to redo the study, but we have increased the number of participants to 42, and found that the results remained consistent as seen in this table:
| Method | Context ↑ | Clarity ↑ |
|---|---|---|
| Raw Attn (0.25) | 2.90 ± 0.48 | 2.75 ± 0.42 |
| Raw Attn (0.50) | 2.26 ± 0.54 | 2.34 ± 0.60 |
| ColPali+Pix2Struct | 3.97 ± 0.25 | 3.02 ± 0.42 |
| Ours (0.7) | 4.49 ± 0.26 | 3.56 ± 0.41 |
The rebuttal does not adequately address several key concerns—most notably point 3 on generalization and robustness to other vision-language tasks. Critical experimental gaps need to be addressed in the paper (points 1, 2, 5, 9). Additionally, the writing still needs work for clarity and flow (points 4, 6, 7, 8), affecting the readability of the core contributions.
While a major revision is likely needed for acceptance, the paper could still make progress by running experiments on other vision-language tasks, and adding essential qualitative and ablation results to better support their claims. The paper could be improved by adding datasets like InfographicVQA and PFL-DocVQA (ovki-1). Additionally, the model-agnostic claim remains unsubstantiated (ovki-2), since all evaluations are still centered on a single backbone (Pix2Struct). Adding results from other backbones would strengthen the paper.
We thank reviewer 9NT1 for the continued engagement and feedback. Below, we provide detailed responses to the remaining concerns.
-
The writing still needs work for clarity and flow: In the final version, we will thoroughly revise the manuscript to address this concern. This will include restructuring paragraphs to ensure a more logical progression of ideas, refining the language and transitions to enhance readability, and incorporating new experimental results and insights that emerged during the rebuttal process.
-
The paper could still make progress by running experiments on other vision-language tasks: We deliberately restrict our study to DocVQA, a downstream task within the broader domain of Document Intelligence (DI), which also includes document layout analysis, OCR, summarization, etc. Our focus on DocVQA is driven by its natural compatibility with saliency map-based explanation methods. In this task, highlighting semantically relevant regions directly facilitates the question-answering process and enhances interpretability in a manner that is both intuitive and practically effective. Although the proposed approach could, in principle, be extended to other DI or vision-language tasks, each of these introduces distinct challenges that demand task-specific adjustments, especially in the token interactions loss design. Designing a unified explainability framework that accommodates the diverse contextual and semantic characteristics of multiple downstream tasks represents a significant and separate research effort. Accordingly, we intentionally confine our scope to DocVQA to provide a more focused and rigorous contribution. Future work could build upon our findings to generalize across the broader vision-language landscape. A further reason for our choice lies in DocVQA practical significance. It is already being deployed in high-stakes domains such as banking, healthcare, and public administration, where automated systems are used to extract information from documents with the goal of minimizing human intervention. In such critical settings, ensuring the trustworthiness and interpretability of model outputs are of immediate and pressing importance.
-
The paper could be improved by adding datasets like InfographicVQA and PFL-DocVQA Following this suggestion, we perform additional experiments with new datasets. First, PFL-docvqa (Tito et al., ICDAR'24), where despite the more complex layouts and varying document visual quality (low resolution), our method performs consistency. Table 1 presents results on this benchmark. Compared to the baseline, our method achieves a significantly higher Accuracy (0.43) and ANLS score (0.66). In addition, the Pixel Ratio is drastically reduced to 0.22, indicating that our model identifies and relies on a substantially smaller subset of the input for its predictions. Although the overall utility is lower than the non-explainable unmasked performance, this tradeoff is consistent with our design goal of prioritizing interpretability and minimal evidence reasoning. Some qualitative results using this dataset can be found here: https://anonymous.4open.science/r/ICML_Rebuttal3-68DE/README.md. Second, we experimented with InfographicVQA (Mathew et al., WACV'22), this dataset is also challenging as it contain infographic data with very complex layouts, charts and tables. However, we foud that our model still performs good, as shown here: https://anonymous.4open.science/r/RebuttalICML2-4D0C/README.md. The full results will be added to the final manuscript. | Method | Mask Threshold | ACC | ANLS | Pixel Ratio | |-----|---|---|------|----| | Pix2struct (Unmasked) | -- | 0.80 | 0.92 | 1.00 | | Ours (Unmasked) | -- | 0.57 | 0.79 | 1.00 | | ColPali+Pix2struct | 0.5 | 0.28 | 0.39 | 0.18 | | Ours | 0.7 | 0.43 | 0.66 | 0.22 | Table 1: Results on PFL-DocVQA
-
Adding results from other backbones would strengthen the paper: In response to this concern, and in line with the suggestion from Reviewer ovki-2, we conducted additional experiments using Donut as an alternative backbone to Pix2Struct in order to evaluate the model-agnostic nature of our approach. It is important to highlight that Donut and Pix2Struct differ significantly in how they process input. Pix2Struct renders the question directly onto the image, whereas Donut tokenizes the question and uses it to condition the decoder during generation. As a result, we made necessary method—specific adjustments, modifying our mask head to also take the encoded question as input. Given the limited time during the rebuttal phase we are still doing the experiment, but, we have already obtained preliminary results that support the model-agnostic capability of our method. These can be found in this link: https://anonymous.4open.science/r/RebuttalICML4-065C/README.md We are continuing to train and evaluate the model, and we will include the full results in the final manuscript.
This paper proposes DocVXQA, a novel self-explainable framework for Document Visual Question Answering (DocVQA), designed to not only answer questions from document images but also provide context-aware visual explanations via learned relevance heatmaps. The core contribution lies in integrating explainability criteria—namely sufficiency, minimality, and context-awareness—directly into the training process, using an additional mask-generation module guided by external pretrained vision-language retrieval models. Experimental evaluations on the DocVQA dataset demonstrate that DocVXQA effectively balances interpretability and prediction performance, surpassing several baselines in both quantitative metrics and human evaluations.
给作者的问题
-
The experiments are performed only on the DocVQA dataset, so I cannot confirm the generalizability of your approach. Additional experiments on diverse datasets are needed.
-
The proposed approach was evaluated using only one model (Pix2Struct), so the claim of being "model-agnostic" is not supported. Additional experiments with other architectures would help validate this claim.
-
The performance drop (56%→51%) introduced by interpretability constraints is not sufficiently explained or analyzed. Additional analysis or experiments addressing this issue would strengthen the paper.
-
The choice of a fixed parameter (top-k=3 regions) in mask selection lacks experimental justification. Providing sensitivity analysis for different values of k is recommended.
论据与证据
Most claims in the submission are supported by clear and convincing evidence. Specifically, the authors convincingly demonstrate that their proposed DocVXQA framework improves interpretability for document question-answering through both quantitative (accuracy, ANLS metrics, pixel ratio) and qualitative (human evaluation) experiments.
However, two key claims lack sufficient evidence:
-
The paper claims that the proposed method is inherently model-agnostic, capable of integrating easily into various DocVQA architectures. However, the authors only demonstrate this with the Pix2Struct architecture, providing no experimental validation on other models. Without such validation, this claim remains theoretical and unsupported by evidence.
-
The authors implicitly present DocVXQA as a general framework suitable for broader document visual question-answering scenarios. However, all experimental evaluations were performed on a single dataset (DocVQA). Without evaluation on multiple datasets or diverse document styles, the evidence provided does not sufficiently support the claim of broad generalizability.
方法与评估标准
The proposed methods and evaluation criteria are generally appropriate for the problem. The mask-generation strategy and the use of sufficiency, minimality, and context-aware loss terms effectively address the interpretability objectives in Document VQA. The evaluation metrics (accuracy, ANLS, pixel ratio, and human evaluation) are suitable for measuring both interpretability and prediction quality.
However, the evaluation is limited by relying solely on the DocVQA dataset, raising questions about the method's generalizability to other document types or scenarios. Expanding evaluations to additional diverse datasets would strengthen confidence in the broader applicability of the approach.
理论论述
The paper does not present complex theoretical proofs. The theoretical framing provided—such as the formulation of the explainability objectives based on the Information Bottleneck principle and mutual information—is straightforward, well-established, and reasonably applied. I reviewed these theoretical formulations and found them conceptually sound, without identifying any significant issues.
实验设计与分析
The experiments presented are generally valid and sound. The selection of evaluation metrics (accuracy, ANLS, pixel ratio) and comparison baselines (raw attention, attention rollout, Grad-CAM, ColPali-based retrieval) are appropriate for the DocVQA task and clearly aligned with the authors' interpretability objectives.
However, several issues are noted:
All experiments were conducted on only one dataset (DocVQA). The absence of experiments on additional datasets raises concerns about the generalizability of the results.
The authors observed a performance drop (56% → 51%) due to interpretability constraints but did not provide sufficient analyses or discussions exploring the underlying reasons or possible approaches to alleviate this trade-off. Addressing this point explicitly would further strengthen the experimental analysis.
Additionally, the authors use a fixed post-processing parameter (selecting the top-k=3 regions) for generating the explanation masks. However, the rationale behind choosing k=3 is not clearly explained or experimentally justified. Including sensitivity analyses or experiments to demonstrate how different values of k affect the interpretability and accuracy would strengthen the validity of this choice.
补充材料
The supplementary material includes implementation details (such as mask-generation architecture and hyperparameters) and additional qualitative examples. These details effectively support and clarify the methodological decisions and analyses presented in the main paper.
与现有文献的关系
The paper relates clearly to existing literature on document VQA and explainable AI, specifically extending OCR-free approaches (e.g., Pix2Struct) and self-explainable methods based on information theory. Unlike post-hoc techniques (e.g., Grad-CAM), this work integrates interpretability directly into model training, contributing a novel, practical approach to improve transparency in document understanding.
遗漏的重要参考文献
All essential related works appear to be appropriately cited and discussed in the paper.
其他优缺点
The main strengths and weaknesses have been discussed in earlier sections. In summary, the paper's key strengths include its clear originality in integrating interpretability into document VQA and its strong motivation addressing real-world transparency needs.
However, there are four main weaknesses that should be addressed:
Experiments were conducted on only one dataset (DocVQA), limiting evidence of generalizability to other datasets or document scenarios.
The proposed method was evaluated using only one model architecture (Pix2Struct), which leaves the "model-agnostic" claim unsupported by empirical evidence.
The observed accuracy drop (from 56% to 51%) resulting from enhanced interpretability lacks thorough analysis or discussion.
The choice of a fixed parameter (Top-k=3 regions) in post-processing lacks experimental justification or sensitivity analysis to confirm its optimality.
Addressing these issues would strengthen the overall contribution and reliability of the findings.
其他意见或建议
Fix minor typos, such as:
Line 073 :"preserve aspect the ratios" (should be "preserve the aspect ratios") Line 099 : "models are are reaching good performances" (duplicate "are") Line 103 : "text is extrtacted with OCR" (should be "extracted")
The authors sincerely appreciate Reviewer ovki’s insightful comments. We are pleased that the reviewer recognized the paper’s key strengths, as he mentioned its originality in integrating interpretability into DocVQA, its effective balance between interpretability and prediction performance, and the "conceptual soundness" of our theoretical formulations. We have carefully considered all suggestions and made a dedicated effort to address each point thoroughly.
-
I cannot confirm the generalizability of your approach. Additional experiments on diverse datasets are needed: We have extended our evaluation beyond the standard DocVQA benchmark SPDocVQA (Mathew et al., WACV'21) by incorporating two additional datasets: InfographicVQA (Mathew et al., WACV'22) and PFL-DocVQA (Tito et al., ICDAR'24). These datasets pose greater challenges due to their complex and heterogeneous visual-textual structures, providing a more rigorous test of our model’s generalization capabilities. Some qualitative results on InfographicVQA are available here: https://anonymous.4open.science/r/RebuttalICML2-4D0C/README.md. The new results are demonstrating our model’s ability to handle these visually complex documents. As seen in the examples, our approach yields promising results, indicating strong potential for generalization. We will provide the full results on these new datasets in the final manuscript.
-
The proposed approach was evaluated using only one model (Pix2Struct), so the claim of being "model-agnostic" is not supported: As we also commented in response to reviewer Y52T, our claim of model-agnostic nature relies on the independent, modular design of the explanation mechanism. Although currently integrated with Pix2Struct, its components, such as a learnable mask head, three loss terms of the explainability principles, and a postprocessing pipeline, are architecture independent. To empirically validate this, we are currently integrating our approach with Donut (Kim et al., ECCV'22), and results will be provided in the the final version of the paper
-
The performance drop (56%→51%) introduced by interpretability constraints is not sufficiently explained or analyzed.: It can be attributed to the additional double-loop process employed by our approach. Even when the mask is set to all ones so that , the model still performs two passes through the encoder—one for mask generation and another for final prediction. Consequently, during backpropagation, gradients from both passes are accumulated. In theory, if both passes were identical, this would be equivalent to a single pass. However, in practice, non-linearities, normalization layers, and other factors introduce slight variations between the two passes. These differences can cause over-accumulation or interference in the gradients, perturbing the weight updates relative to the original training regime. Such deviations, along with potential numerical instabilities from the extra pass, likely contribute to the modest performance drop observed from 56% to 51%.
-
The choice of a fixed parameter (top-=3 regions) in mask selection lacks experimental justification. Providing sensitivity analysis for different values of is recommended: Following the reviewer’s suggestion, we conducted a sensitivity analysis on the hyperparameter that determines the number of regions selected during mask postprocessing. The results are summarized in the table below. As the table indicates, increasing leads to gradual improvements in both accuracy and ANLS, but it also results in a higher pixel ratio, which reflects a less minimal explanation. In terms of balancing utility and minimality, our experiments suggest that choosing or offers a good trade-off. That is, these values yield utility metrics comparable to higher settings while keeping the highlighted area compact. We will include the complete sensitivity analysis table in the Appendix of the revised manuscript.
| k | ACC | ANLS | Pixel_Ratio |
|---|---|---|---|
| 1 | 0.36 | 0.52 | 0.21 |
| 2 | 0.38 | 0.53 | 0.23 |
| 3 | 0.38 | 0.54 | 0.24 |
| 4 | 0.39 | 0.55 | 0.25 |
| 5 | 0.39 | 0.55 | 0.26 |
| 8 | 0.40 | 0.56 | 0.28 |
| 10 | 0.41 | 0.57 | 0.29 |
| 15 | 0.42 | 0.58 | 0.31 |
| 20 | 0.43 | 0.58 | 0.32 |
| 50 | 0.45 | 0.59 | 0.38 |
| 100 | 0.45 | 0.60 | 0.40 |
Thank you for the additional experiments. The explanation of the performance drop and the K sensitivity analysis are well addressed. The validation for the model-agnostic claim would benefit from complete quantitative results. Therefore I will maintain my current score of weak acceptance.
We thank reviewer ovki for the feedback. We present in this response further results on a new dataset that is added during this second phase of rebuttal. Also, we address the remaining concern of ovki about the model-agnostic nature of our approach.
-
Additional experiments on diverse datasets are needed.: In the first phase of rebuttal we showed some results with InfographicVQA (Mathew et al., WACV'22). In this phase, we conducted further experiments using a new dataset, PFL-DocVQA (Tito et al., ICDAR'24). This dataset presents increased challenges due to its more complex document layouts and varied visual quality, such as low-resolution images. Despite these difficulties, our method demonstrates consistent performance. Table 1 presents results on the PFL-DocVQA benchmark. Compared to the best performing baseline, our approach achieves significantly higher Accuracy (0.43) and ANLS (0.66), highlighting its improved explainability through more precise textual alignment. Moreover, the Pixel Ratio is notably reduced to 0.22, suggesting that our model identifies and relies on a considerably smaller subset of the input, reinforcing its ability to perform minimal evidence reasoning. While the overall utility is lower than that of the unmasked, non-explainable setting, this trade-off aligns with our design goal of enhancing interpretability. These results underscore the effectiveness of our method in producing concise, human-interpretable rationales while maintaining meaningful predictive performance. Some qualitative results using this dataset can be found here: https://anonymous.4open.science/r/ICML_Rebuttal3-68DE/README.md | Method | Mask Threshold | ACC | ANLS | Pixel Ratio | |-------------------------------|----------------|------|------|--------------| | Pix2struct (Unmasked) | -- | 0.80 | 0.92 | 1.00 | | Ours (Unmasked) | -- | 0.57 | 0.79 | 1.00 | | ColPali+Pix2struct | 0.5 | 0.28 | 0.39 | 0.18 | | Ours | 0.7 | 0.43 | 0.66 | 0.22 | Table 1: Results on PFL-DocVQA
-
Additional experiments with other architectures would help validate this claim.: In response to the remaining concern about the model-agnostic nature of our approach, we are conducting additional experiments using Donut (Kim et al., ECCV'22) as an alternative backbone to Pix2Struct, in order to validate our claim. It is important to highlight that Donut and Pix2Struct differ significantly in how they process input. Pix2Struct renders the question directly onto the image, whereas Donut tokenizes the question and uses it to condition the decoder during generation. As a result, we made necessary method—specific adjustments, modifying our mask head to also take the encoded question as input. Given the limited time during the rebuttal phase we are still doing the experiment, but, we have already obtained preliminary results that support the model-agnostic capability of our method. These can be found in this link: https://anonymous.4open.science/r/RebuttalICML4-065C/README.md We are continuing to train and evaluate the model, and we will include the full results in the final manuscript.
The LLM-generated review by FXJM was ignored by the authors and the AC.
The other three reviewers were more positive than negative. Reviewer ovki provided a substantive review. He asked similar questions, wanting experiments on more backbones and datasets. The authors provided these in the rebuttal, and he was satisfied on all counts.
Y52T was initially leaning towards rejection, but raised his score to weak accepted after the rebuttal. He had concerns that the explanation model was not truly self-explanatory, or faithful to the inference model. The rebuttal addressed this well by clarifying terminology. He had a question about the performance drop on masked images, as did some of the others.
9NT1 provided a very thorough review, and maintained his rating of weak reject. The authors provided two rounds of responses with explanations and experimental results, which addressed most of his concerns. His primary issue is whether the method generalizes to other vision-language tasks, which is not a concern raised by the others. The AC does not agree that the method needs to work on other tasks in order to be published, and recommends acceptance.