Thank you for taking time to review our paper. We thank the reviewer for the thoughtful feedback and are grateful for the opportunity to address the concerns raised.

W1-the error accumulation and latency reduction due to OCR engine:

OCR engine error accumulation: We acknowledge the potential for error accumulation in OCR engines. However, MLLMs currently excel in reasoning over cognition, as shown by lower KIE task performance in Table 1 and the the new results compare to most recent SOTA MLLM below. We observed many errors of MLLM stem from the limited text recognition capabilities of current MLLMs, resulting in inaccuracies such as misreading "169.80" as "169.78." While our method outperform in KIE tasks by a large margin, again show the efficacy of OCR-based method in VRDU.

Model	FUNSD	CORD	SROIE	Avg
LayTextLLM_all	63.0	95.9	95.3	84.7
InternVL2-8B	63.4	75.3	79.8	72.8

OCR throughput: Off-the-shelf OCR modules use lightweight text detection and recognition models with tens of millions of parameters, far smaller than MLLMs or visual encoders like ViT (400M parameters). Our model uses only bounding box and text tokens (without visual tokens), making OCR integration negligible for real-world inference latency.

W2-performance drop compared to LayoutLLM and in ChartQA:

Lower accuracy compared to LayoutLLM: LayoutLLM outperforms due to incorporating visual input alongside OCR text and bounding boxes but at higher computational cost (visual token + OCR texts token). Our approach seeks a lightweight alternative, reducing sequence length by focusing on efficient representations only involving OCR texts and bounding box, aiming to surpass other OCR-based methods (Table 2).
Inferror performance in ChartQA: there are two reasons:
(1) In Appendix E, we note that LayTextLLM, using only text and layout inputs, struggles with visual reasoning tasks like ChartQA, which require cues such as length, size, color, and area (Figure 7). As a result, it underperforms compared to MLLMs with visual inputs. Previous OCR-based methods often omit ChartQA results due to its challenging characteristics.
(2) We use exact match accuracy for consistency across datasets, though ChartQA's standard metric is relaxed accuracy (within 5% tolerance). Re-evaluated results (relaxed accuracy) are shown below. While LayTextLLM remains less competitive than some OCR-free methods due to ChartQA's visual challenges, it achieves results comparable to MLLMs (e.g., LayTextLLM_all: 43.2 vs. InternVL: 45.6). These updates will appear in the revised version. | Model | Relaxed Accuracy | Exact Match Accuracy | |------------------|------------------|-----------------------| | LayTextLLM_zero | 28.1 | 22.8 | | LayTextLLM_vqa | 30.1 | 23.4 | | LayTextLLM_all | 43.2 | 35.4 |

W3-ablations are unclear: The first row in Table 4 represents the standard Llama2 model without incorporating any layout information, taking only OCR text as input. This setup uses the same training data as the second row, which includes all VQA and KIE datasets. The key difference between the two rows lies in the model inputs: the second row integrates layout bounding box information via SLP, while the first does not. Both training is normal supervised fine-tuning (No LNTP and No SSFT).The comparison betwen first two rows highlights the effectiveness of SLP. Notably, we excluded normalized [x1, y1, x2, y2] coordinates in the first row because their inclusion significantly degraded performance during our preliminary experiments. This decline may result from the excessively long sequences and the requirement for a larger-scale LLM, as discussed in lines 448–449. We will add a clearer description of ablation study in the updated version.

Q1-report number error: Thank you for pointing this out. We chose to present the LayoutLLM_CoT (Vicuna-7B) because it has a higher score. However, we agree that a fairer comparison is necessary. Therefore, we will update the result using Llama2-7B as shown below and this will be reflected in the updated version (the original results are in brackets and are obviously higher will not show in the updated version).

Model	DocVQA	VisualMRC	Avg	FUNSD-	CORD-	SROIE-	Avg
LayoutLLM_CoT	74.2 (74.2)	55.7 (55.7)	64.9 (64.9)	78.6 (79.9)	62.2 (63.1)	70.9 (72.1)	70.5 (71.7)

Q2-subperformance in ChartQA: refer to Appendix E and our response to Weakness 2.

We kindly request your acknowledgement of our reply. We would fully appreciate it if you would consider improving the rating. We look forward to your response.