Thank you for your thoughtful feedback and appreciate the recognition of our paper’s contributions and novelty. We are grateful for the opportunity to address the concerns raised.

W1-Model backbone: We implemented LayTexLLM using Llama2-7b, consistent with previous OCR-based methods like DocLLM, which also use Llama2-7b. We also replicated the results of the coor-as-tokens scheme using Llama2-7b for consistency. Noting the LayoutLLM model utilizes Llama2-7b and Vicuna 1.5 7B, which is fine-tuned from Llama2-7b. Therefore, for the majority of our comparisons, the models are based on the same or similar LLM backbones, allowing for a fair comparison between approaches.

Other MLLMs use backbones like Qwen-VL, Internlm, and Vicuna, which are models with at least 7 billion parameters, excluding the visual encoder. Thus, we can say the comparison is fair, at least in terms of model parameters. We will explicitly mention this in the updated version.

W2-OCR engine: We use word-level OCR from the respective datasets to ensure a fair comparison, except for the ChartQA dataset, where no OCR is provided (and we have mentioned this in lines 277-278). We will explicitly mention this in an updated version.

W3-Confusion of terminology:

Sorry for the confusion caused. Here are explanations:

Pretrain and SFT Clarification: In Section 4.1, the terms "pretrain" and "SFT" refer to LNTP and SSFT. We will revise this section to avoid confusion.
Zero-shot and Supervised Results: The term "zero-shot" refers to a model trained using SSFT only with Document Dense Description (DDD) and Layout-aware SFT data, as used in LayoutLLM. "Supervised" indicates that the model is trained using SSFT with DDD, Layout-aware SFT data, and the training sets of downstream datasets such as DocVQA and FUNSD. This terminology aligns with LayoutLLM, and we will clarify this in the updated version.
Asterisk Notation: An asterisk (*) is used to indicate whether the corresponding training set of a downstream dataset is included in the training of a specific model. This notation facilitates a fair interpretation of experimental results for the reader.

W4-Statement about improvement: We'll tone down the phrasing to accurately reflect this comparison and highlight our improvements in relation to OCR-based methods.

W5-LNTP: We acknowledge that LNTP resembles the regular language modeling objective. We'll tone down the presentation to clarify that the added SLP and P-LoRA parameters are updated using standard language modeling loss.

Q1-claim of autoregressive vs. docllm: We had a brief discussion in line 297 – line 299. Here we elaborated this in detail, and add them in the updated version:

Disentangled Attention: DocLLM uses a disentangled attention to process spatial and textual modalities separately (using differerent QK weight) before integrating them. This independent handles of spatial information from document layouts, unlike traditional autoregressive models that process inputs sequentially (use the same suite of weights). In contrast, LayTextLLM interleaves bounding box tokens with text, unifying both modalities in a single sequence through an autoregressive approach.
Block Infilling Objective: Unlike standard autoregressive models that predict the next token based only on preceding text, DocLLM uses block infilling to predict missing text blocks based on both preceding and succeeding context. This deviates from leveraging the inherent autoregressive nature of traditional LLMs which solely relies on preceding tokens.
Impact on Performance: As demonstrated in Table 2. when compared using the same training dataset, LayTextLLM significantly outperforms DocLLM.

Q2-encode x and y: We considered using positional encodings that directly encode (x, y) positions to address the artifact issue. However, to fully leverage the LLM parameters and maintain simplicity, we avoided encoding (x, y) positions, as it could complicate the model. Instead, we focused on balancing LLM reuse with necessary adjustments, which led us to propose SSFT.

Q3-model backbone and generalization of the method: We implemented LayTexLLM using Llama2-7b as our LLM backbone, in line with prior OCR-based methods like DocLLM and LayoutLLM.

Our method is model-agnostic. In our in-house KIE test, we evaluated the performance of the Baichuan2 7b and Qwen2 7b models. The results showed that incorporating the SLP layer improved performance for both models compared to not using it.

Model	w/o SLP	With SLP
Baichuan2 7B	0.7464	0.7738
Qwen2 7B	0.754	0.7858

Q4-using PLORA: The motivation for using P-LoRA is due to the concern of having too few learnable parameters. P-LoRA is applied in each layer, but the main contribution still comes from SLP.

Q5-LNTP loss: Yes, the bounding box tokens are used as inputs but not targets. Our objective is to understand the bounding boxes, not to generate them. Therefore, it is unnecessary to compute a loss for the bounding box tokens.

Also, we tested in a in-house KIE dataset, finding that including bounding box as targets (using string like ''[1,20,10,30]'') during LNTP drops the downstream performance.

Q6-typo: We will fix those typos in the updated version.

Q7-polys and asterisk: The term 'polys' will be replaced with 'coordinates.' An asterisk (*) indicates if the training set of a downstream dataset is included in the training of a specific model. This notation ensures a fair interpretation of the experimental results for the reader.

Limitations: Yes, bolding and italic texts should be included as visual cues, which will be updated in the updated version.

We would appreciate it if you could improve your rating if all concerns been addressed and we look forward to your response.