PaperHub
5.0
/10
Rejected4 位审稿人
最低3最高6标准差1.2
6
6
5
3
4.0
置信度
正确性2.5
贡献度2.3
表达2.5
ICLR 2025

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05

摘要

关键词
LLMDocAIVisually Rich Document UnderstandingKIE

评审与讨论

审稿意见
6

This paper introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM) for document understanding, which projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. They also propose a tailored pre-training task—Layout-aware Next Token Prediction—a completely self-supervised task that enhances the alignment between layout and textual . LayTextLLM significantly improves performance on downstream document-related VQA and KIE tasks.

优点

1.This work is the first work to employ a unified embedding approach that interleaves spatial layouts directly with textual data within a LLM. 2.They propose two tailored training tasks: (1) Layout-aware Next Token Prediction and (2) Shuffled-OCR Supervised Fine-tuning task to better elicit the model generalizability in downstream tasks. 3.Experimental results sounds good.

缺点

1.The methods adopt a P-LoRA structure, which is introduced by existing work InternLM-XCompoer2, limiting the novelty of this work‘s. 2.The analysis of the impact of SLP training on the effectiveness of the model remains incomplete.

问题

1.The different training tasks such as Layout-aware Next Token Prediction mentioned,how many tokens of training data were used respectively? 2.Why does the chosen lora structure work as expected? Do other loras have the same effect, and how are ablation trials conducted? 3.Is there any analysis examining the impact on the llama models' ability to understand language after incorporating spatial information?

伦理问题详情

na

评论

Thank you for taking the time to review our paper. We sincerely appreciate the reviewer’s recognition of our work, the thoughtful feedback, and the opportunity to address the concerns raised.

W1-Plora limites the novelty: The core contribution of this work lies in the SLP module, which effectively unifies layout and text, along with the tailored LNTP and SSFT training tasks, both of which were well-received by reviewers. In contrast, the impact of P-LoRA on performance improvement is relatively minor. As shown in Table 4, comparing the second and third rows reveals that adding P-LoRA increases the average accuracy by only 0.8% for KIE and VQA tasks. This improvement is significantly smaller compared to the introduction of SLP, as evident when comparing the first and second rows in Table 4.

W2-analysis of SLP is incomplete:

  • We will add a new experiment regarding the analysis of impact of SLP in model's language ability, more details see response for Q3.
  • We will provide a clearer description of the ablation studies in Section 4.6 to emphasize the effectiveness of SLP. Specifically, the first row of Table 4 represents the standard Llama2 model without SLP. This setup uses the same training data as the second row, including datasets such as DDD, Layout-aware SFT, DocVQA, InfoVQA, ChartQA, FUNSD, CORD, SROIE and VisualMRC. Both training is normal SFT (No LNTP and No SSFT). The key distinction between the two rows lies in their inputs: the second row integrates layout bounding box information using SLP, whereas the first does not. This comparison underscores the impact of SLP on enhancing document understanding performance.

Q1 token statisitcs: The token statistics for LNTP and SSFT using the Llama2 tokenizer are summarized below and will be included in the updated version.

DatasetCDIPDocbank
LNTP1,077,079,6404,722,635,632
DatasetDocVQAChartQAInfoVQAVisualMRCFunsdCORDSROIEDDDLayout-aware SFT
SSFT17,621,6211,053,05211,544,7531,624,8221,024,2361,140,4531,201,376148,145,979212,643,094

Q2 why P-Lora works: As noted in response to W1, while P-LoRA is effective, its impact is not particularly significant. Actually, in our experiments conducted on an in-house dataset, we observed that when the training set is sufficiently large, P-LoRA becomes dispensable ss shown below. We believe that the effectiveness of parameter-efficient fine-tuning methods—such as LoRA, P-LoRA, prompt tuning, and P-tuning—diminishes as the training set size increases. In such cases, fully supervised fine-tuning emerges as the most effective approach.

ModelF-score
LayTextLLM with P-LoRA61.16
LayTextLLM w/o P-LoRA61.20

Q3 impact on language ability after incooporating SLP Good question. We conducted an additional experiment to evaluate the language understanding ability of the models after integrating spatial information. Specifically, we measured the perplexity on the Wikitext test set for baseline and LayTextLLM models. Perplexity serves as an indicator of language understanding, with higher values suggesting a potential decline in language ability compared to the baseline (LLaMA2-7B). The results, as shown in the table.

ModelPerplexity
Llama2-7B7.4793
LayTextLLM_zero11.0252
LayTextLLM_vqa25.0701
LayTextLLM_all12.56
  • LayTextLLM_zero: it shows low perplexity, indicating spatial information can be integrated with a little affecting language modeling.
  • LayTextLLM_vqa: Highest perplexity, maybe due to DocVQA and InfoVQA datasets, emphasizing document reasoning, which shifts focus from general language to document-specific tasks.
  • LayTextLLM_all: very similar perplexity t LayTextLLM_zero and baseline, again demonstrate SLP has little impact on language ability.

We will add this experiment in the updated version.

We kindly request your acknowledgement of our reply. We would fully appreciate it if you would consider improving the rating. We look forward to your response.

评论

I looked into the author's response, and my score remains unchanged.

评论

Thanks for reviewing our response. If there’s anything further we can provide to address concerns or improve the score, please let us know, we would be happy to do so.

审稿意见
6

Traditional methods for document understanding usually rely on the text extracted by OCR and spatial layout information, but they have limitations when dealing with long text sequences. This paper proposes a method named LayTextLLM, which reduces the sequence length problem by mapping each bounding box to a single embedding and interleaving it with the text, while taking advantage of the autoregressive characteristics of LLM. Two training tasks are proposed: layout-aware next token prediction and supervised fine-tuning of shuffled OCR, to enhance the generality of the model. In Key Information Extraction (KIE) and Visual Question Answering (VQA) tasks, LayTextLLM significantly outperforms the existing state-of-the-art methods. In the zero-shot scenario, the performance of the KIE task is improved by 27.0% and that of the VQA task is improved by 19.8%.

优点

  1. The paper proposes the adoption of a unified embedding method in LLM for the first time, interleaving the spatial layout with the text data.
  2. By reducing the sequence length and making full use of the autoregressive characteristics, the performance of document understanding is significantly improved.
  3. Comprehensive experimental results and ablation studies are provided, demonstrating the effectiveness of the proposed method.

缺点

Compared with the existing VLM papers, the paradigm is actually quite similar. Although it is not highly innovative, it offers a new perspective for handling OCR document understanding problems in this field. Another point is that perhaps the number of baselines for comparison seems a bit small. I wonder if there are more and newer works related to Text + Coordinates for comparison. Additionally, there is a slight error in the bolded text in Table 1.

问题

refer to weakness

评论

Thanks for reviewing our paper. We appreciate the reviewer’s acknowledgment of our contributions, including the SLP module, input length reduction, and performance improvements. We are grateful for the thoughtful feedback and take the opportunity to address the concerns raised.

Paradigm is quite similar: We admit our approach follows the standard pretraining and fine-tuning paradigm; however, we introduce innovative modifications to the model architecture, i.e. SLP, enabling us to design tailored pretraining (LNTP) and fine-tuning tasks (SSFT). These novel tasks, appreciated by reviewers mYNU and p4cT, are recognized for their innovation and enhancing the generalizability of LLMs.

Compared with more recent Text+ methods: According to ICLR guidelines, authors need not compare with papers published after July 1, 2024, or non-peer-reviewed works like those on arXiv. However, we prioritize a thorough evaluation and will include the most recent relevant Text+ methods. While we found no new Text+ methods, we identified the Visual+Text+Coordinate work DocLayLLM [1] and will add it to Table 3 in the updated version.

ModelDocVQAvisualmrcavgFUNSD-CORD-SROIE-Avg
Text+Coordinate
LayTextLLM_all77.241.759.581.082.596.186.5
Visual+Text+Coordinate
DocLayLLM_llama272.8355.0463.9478.6770.8183.2777.58

[1] Liao, W., Wang, J., Li, H., Wang, C., Huang, J. and Jin, L., 2024. DocLayLLM: An Efficient and Effective Multi-modal Extension of Large Language Models for Text-rich Document Understanding. arXiv preprint arXiv:2408.15045.

Error bold font in Table 1: Thank you for highlighting the error in bold font; we will correct it in the next version.

We kindly request your acknowledgement of our reply, and are welcome to further discussions for your questions and concerns. We would fully appreciate it if you would consider improving the rating. We look forward to your response.

评论

Thank you to the author for the patient response. I have carefully reviewed the replies from the other reviewers. I will maintain my score unchanged, but I hold a negative attitude toward the acceptance of this paper.

评论

Thank you for your feedback and for taking the time to carefully review our responses and the comments from the others. We appreciate your thoughtful consideration of our work. If there are any specific concerns or points of clarification that you would like to discuss further, we would be more than happy to address them in greater detail. Your insights are valuable to us, and we are committed to improving our work based on constructive feedback.

审稿意见
5

This paper presents LayTexLLM, which interleaves layout and text in an LLM for document understanding. There are two key ideas:

  • Spatial layout projector (SLP), which project the coordinate of each bbox (i.e., [x_min, y_min, x_max, y_max]) derived by an off-the-shelf OCR module into a dd-dimensional embedding, so that the layout information can be input to the LLM and learned together with other text tokens via next token prediction (layout-aware).
  • Shuffled-OCR supervised fine-tuning (SSFT), which randomly shuffles the order of OCR-derived text in a certain proportion of examples to mitigate the bias that OCR engines typically output text bboxes in a way of top-to-bottom and left-to-right. Experiments on DocVQA, InfoVQA, and KIE tasks show that LayTextLLM outperforms or achieves comparable performance to baselines, while experiencing a significant performance drop on the ChartQA dataset.

优点

  • LayTextLLM demonstrate that mapping the coordinates to a single bbox embedding can achieve delightful performance on document understanding while mitigating long sequence issues.
  • Table 1 and Table show that LayTextLLM demonstrate performance improvement over several baseline methods on DocVQA, InfoVQA, and KIE tasks.

缺点

  1. The performance of the model heavily relies on the quality of the OCR engine, which poses a risk of significant performance drop, especially for handwritten documents or low-resource languages where OCR performance can be subpar. Moreover, incorporating an OCR module can negatively impact inference latency, making real-time or high-throughput applications challenging.

  2. As indicated in Table 3, LayTextLLM_zero and even LayTextLLM_all demonstrate inferior performance on Document-Oriented VQA compared to LayoutLLM (Luo et al., 2024). Additionally, Table 8 reveals that LayTextLLM shows a substantial drop in performance on the ChartQA dataset. Specifically, OCR-free methods generally achieve scores between 50-59.9, whereas the best performance achieved by LayTextLLM_all is only 35.4.

  3. The ablation study requires a clearer description of the variants involved. The first line in Table 4 lacks clarity—does it refer to a setting that uses naive coordinate-based pretraining (e.g., normalized [x1, y1, x2, y2]) with the proposed SSFT strategy? If not, it would be beneficial to include such a variant to demonstrate the effectiveness of the Spatial Layout Projector.

问题

(a) Are the reported numbers for LayoutLLM_CoT in Table 3 inaccurate? The current numbers seem to correspond to Vicuna 1.5 7B. Would it be more appropriate to report the results based on Llama2-7B for a fairer comparison?

(b) Why does LayTextLLM show significantly inferior performance on the ChartQA dataset, even compared to OCR-free methods?

评论

Thank you for taking time to review our paper. We thank the reviewer for the thoughtful feedback and are grateful for the opportunity to address the concerns raised.

W1-the error accumulation and latency reduction due to OCR engine:

  • OCR engine error accumulation: We acknowledge the potential for error accumulation in OCR engines. However, MLLMs currently excel in reasoning over cognition, as shown by lower KIE task performance in Table 1 and the the new results compare to most recent SOTA MLLM below. We observed many errors of MLLM stem from the limited text recognition capabilities of current MLLMs, resulting in inaccuracies such as misreading "169.80" as "169.78." While our method outperform in KIE tasks by a large margin, again show the efficacy of OCR-based method in VRDU.
ModelFUNSDCORDSROIEAvg
LayTextLLM_all63.095.995.384.7
InternVL2-8B63.475.379.872.8
  • OCR throughput: Off-the-shelf OCR modules use lightweight text detection and recognition models with tens of millions of parameters, far smaller than MLLMs or visual encoders like ViT (400M parameters). Our model uses only bounding box and text tokens (without visual tokens), making OCR integration negligible for real-world inference latency.

W2-performance drop compared to LayoutLLM and in ChartQA:

  • Lower accuracy compared to LayoutLLM: LayoutLLM outperforms due to incorporating visual input alongside OCR text and bounding boxes but at higher computational cost (visual token + OCR texts token). Our approach seeks a lightweight alternative, reducing sequence length by focusing on efficient representations only involving OCR texts and bounding box, aiming to surpass other OCR-based methods (Table 2).

  • Inferror performance in ChartQA: there are two reasons:

    • (1) In Appendix E, we note that LayTextLLM, using only text and layout inputs, struggles with visual reasoning tasks like ChartQA, which require cues such as length, size, color, and area (Figure 7). As a result, it underperforms compared to MLLMs with visual inputs. Previous OCR-based methods often omit ChartQA results due to its challenging characteristics.

    • (2) We use exact match accuracy for consistency across datasets, though ChartQA's standard metric is relaxed accuracy (within 5% tolerance). Re-evaluated results (relaxed accuracy) are shown below. While LayTextLLM remains less competitive than some OCR-free methods due to ChartQA's visual challenges, it achieves results comparable to MLLMs (e.g., LayTextLLM_all: 43.2 vs. InternVL: 45.6). These updates will appear in the revised version. | Model | Relaxed Accuracy | Exact Match Accuracy | |------------------|------------------|-----------------------| | LayTextLLM_zero | 28.1 | 22.8 | | LayTextLLM_vqa | 30.1 | 23.4 | | LayTextLLM_all | 43.2 | 35.4 |

W3-ablations are unclear: The first row in Table 4 represents the standard Llama2 model without incorporating any layout information, taking only OCR text as input. This setup uses the same training data as the second row, which includes all VQA and KIE datasets. The key difference between the two rows lies in the model inputs: the second row integrates layout bounding box information via SLP, while the first does not. Both training is normal supervised fine-tuning (No LNTP and No SSFT).The comparison betwen first two rows highlights the effectiveness of SLP. Notably, we excluded normalized [x1, y1, x2, y2] coordinates in the first row because their inclusion significantly degraded performance during our preliminary experiments. This decline may result from the excessively long sequences and the requirement for a larger-scale LLM, as discussed in lines 448–449. We will add a clearer description of ablation study in the updated version.

Q1-report number error: Thank you for pointing this out. We chose to present the LayoutLLM_CoT (Vicuna-7B) because it has a higher score. However, we agree that a fairer comparison is necessary. Therefore, we will update the result using Llama2-7B as shown below and this will be reflected in the updated version (the original results are in brackets and are obviously higher will not show in the updated version).

ModelDocVQAVisualMRCAvgFUNSD-CORD-SROIE-Avg
LayoutLLM_CoT74.2 (74.2)55.7 (55.7)64.9 (64.9)78.6 (79.9)62.2 (63.1)70.9 (72.1)70.5 (71.7)

Q2-subperformance in ChartQA: refer to Appendix E and our response to Weakness 2.

We kindly request your acknowledgement of our reply. We would fully appreciate it if you would consider improving the rating. We look forward to your response.

评论

Dear Reviewer mYNU,

We sincerely appreciate the time and effort you've dedicated to reviewing our work. As the discussion period is drawing to a close, we kindly request your acknowledgment of our reply. We value your insights and would be grateful if you could join the discussion to clarify any remaining questions or concerns. Your input is highly valued, and we would greatly appreciate it if you could consider improving the evaluation after reviewing our responses.

Thank you very much for your consideration.

Sincerely, The Authors

评论

Thanks for the response. I'd like to keep my rating.

评论

Thank you for your feedback. We appreciate your thoughtful consideration of our work. If there are any specific concerns or points of clarification that you would like to discuss further, we would be more than happy to address them in greater detail. Your insights are valuable to us, and we are committed to improving our work based on constructive feedback.

审稿意见
3

The paper use one token (generated by a small NNs called SLP) to replace the four coordinate representation of location for normal document understanding tasks. With such compression, the model can accept more tokens and longer context than a vanilla llm. The method show better results than some document understanding MLLMs.

优点

  • The empriical performance shows a imporlvement on downstream document understanding tasks

缺点

Overall, the paper propose an emprical trick to compress four localization tokens into one token, however, the contribution is thin and the research problem is not valid to me (without futher considering the visual features).

(updated in Nov 20)

  • Research Problem Validity: The core research problem studied in this paper appears incomplete without incorporating visual features. This concern is supported by two key pieces of evidence: a) The evolution of document understanding models shows that pure OCR-text-only approaches (like LayoutLM[1]) have been quickly superseded by multimodal approaches (LayoutLMv2, LayoutLMv3[2,3]), demonstrating the crucial importance of incorporating visual information in document understanding tasks. b) Recent findings from MMMU-PRO [4] demonstrate that even with PERFECT text information, the VQA task accuracy only improves by a marginal 0.5%. This empirically proves that text-only approaches are insufficient* for comprehensive document understanding, regardless of text quality.

  • Scope and Contribution: While I acknowledge the value of theoretical work, this paper reads more like an appendix to comprehensive document understanding rather than a standalone contribution. The presented approach seems to address only a narrow subset of document understanding challenges, without adequately justifying why visual features are excluded, especially given the strong empirical evidence supporting the necessity of visual information.

  • Broader Impact and Audience: For the broad ICLR audience, the paper offers limited insights and contributions. The work might be more suitable for venues specifically focused on document understanding, where the technical details of OCR processing would find a more relevant audience. Moreover, given the clear trend towards multimodal approaches in the field, this text-only approach may be of limited practical value to the research community.

[1] LayoutLM: Pre-training of Text and Layout for Document Image Understanding

[2] LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

[3] LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

[4] MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

问题

See weakness.

评论

Thank you for taking time to review our paper. We thank the reviewer for the thoughtful feedback and are grateful for the opportunity to address the concerns raised.

First of all, our contributions are not trick, which were well-received, with reviewers (mYNU, ZrGX, and p4cT) praising the introduction of SLP for unifying layout and text, and acknowledging improved performance. LNTP and SSFT tasks were highlighted as innovative, and the input length reduction further validated our approach.

Research Problem Validity:

  • without incorporating visual features: the reviewer has misunderstanding with the paper. The work acknowledges the importance of incorporating visual features (line 906-908) and recognize it as limitation as discussed in Appendix E. The key of the works is taking a different approach by focusing on spatial information and textual data rather than full visual embeddings. OCR-based methods are useful as a practical and computationally efficient solution​. Experimental results demonstrate significant improvements over MLLMs in tasks such as KIE (27.2% improvement) and our newly added result comparing with the most recent SOTA MLLMs InternVL2. These results suggest that spatial-textual integration can achieve competitive performance without directly relying on raw visual features.
ModelFUNSDCORDSROIEAvg
LayTextLLM_all6496.595.885.4
InternVL2-8B63.4275.3579.8272.86
  • MMMU-PRO findings:
    • The findings are based on the MMMU-PRO is a benchmark that intentionally filters out questions answerable directly from text (Section 2.2 of MMMU-PRO paper), which explains the similar performance with and without OCR. Their experiment is actually designed to emphasize the challenging nature of their benchmark (focusing on questions that cannot be answered without visual input) rather than to prove that text-only approaches are insufficient for document understanding. In real-world applications, many VRDU tasks can be effectively addressed using text and layout alone. Thus, the conclusions drawn from MMMU-PRO are not entirely applicable to the scope of our work, which focuses on leveraging text and layout information efficiently for practical document understanding scenarios.

    • The findings from MMMU-PRO are conditioned by using large 70B open-source models or other competitive proprietary models. However, when OCR is combined with smaller models, significant performance improvements are observed. This underscores the value of LayTextLLM, which provides an efficient solution that leverages OCR and spatial-text integration to achieve strong results, even without relying on massive model sizes.

Scope and Contribution:

  • Contribution: contribution is substantial and well-received by the reviewers. Reviewer mYNU, ZrGX, and p4cT have praised the introduction of SLP, which unifies layout information and text into a cohesive sequence representation. All reviewers acknowledged the improved performance achieved by our approach. Specifically, the tailored tasks LNTP and SSFT were highlighted as innovative and effective by Reviewers mYNU and p4cT, while Reviewer ZrGX appreciated the reduction in input length, further validating the impact of our contributions.

  • Scope: The scope of this work is broad and significant for the VRDU field, encompassing key tasks like VQA and KIE. Prominent datasets such as DocVQA, FUNSD, and SROIE, widely used to evaluate both previous high-profiling models (e.g., LayoutLM series) and most recent text-centric MLLMs/LLMs (e.g., InternVL, Qwen2 VL, TextMonkey), highlight its relevance. Our method achieve improved performance (appreciated by all reviewers) on all these datasets, again demonstrate the contribution of the work.

  • exclusion vision: As clarified in the response to Research Problem Validity, the work recognizes the importance of visual features in document understanding but adopts a different approach, focusing on spatial and textual information instead of full visual embeddings. The importance of visual information is acknowledged, and we have noted plans to explore its integration in future work (see Appendix E, lines 906–908).

Broader Impact and Audience: Our submission aligns with the ICLR track (our primary area for the submission is:applications to computer vision, audio, language, and other modalities). Our work specifically unifies layout tokens and text tokens, significantly enhancing the application of LLMs in document understanding, making it well-suited for ICLR's audience.

We kindly request your acknowledgement of our reply, and are welcome to further discussions for your questions and concerns. We would fully appreciate it if you would consider improving the rating. We look forward to your response.

评论

Dear Reviewer XPL9,

As the discussion period comes to an end, we kindly request your acknowledgment of our response and remain open to further discussion should you have any additional concerns.

Thank you very much for your consideration.

Sincerely, The Authors

评论

I maintain my initial assessment and cannot recommend this paper for acceptance due to two major concerns: limited contribution and questionable experimental design.

The primary issue lies in the paper's positioning and experimental methodology. The targeted benchmarks already achieve excellence (>95% accuracy) using conventional non-LLM approaches like LayoutLMv3. The lower performance of VLMs is not indicative of their inherent capabilities, but rather stems from the unfair comparison between zero-shot VLM inference and fully fine-tuned document understanding models.

The critical methodological flaw lies in the experimental setup. While the authors describe their comparison with InterVL2-8B as "zero-shot," it's worth noting that datasets like FUNSD and SROIE were indeed included in the VLM training process, albeit as part of a broader training set. However, this general exposure differs significantly from targeted, task-specific fine-tuning. The authors' proposed model benefits from specialized OCR-centric fine-tuning, which naturally yields better performance than models that have only seen these tasks as part of general training. This creates an imbalanced evaluation framework that undermines the validity of the performance comparisons.

评论

Contribution: Our contribution is a novel method to integrate layout information into LLMs by interleaving layout tokens with text tokens, unifying all inputs into a single sequence. We also propose two tailored training tasks that significantly enhance the performance of OCR-based approaches in VRDU tasks. Our method achieves substantial improvements over existing OCR-based state-of-the-art approaches, such as DocLLM and ICL-D3IE, demonstrating a large performance margin (Table 2). All contributions are well-received and recognized by other reviewers.

Experimental Setup: Our experimental setup ensures fairness. In Tables 1, 2, and 3, datasets used in the SFT are marked with an asterisk (*). The improvements mentioned in Abstract, line 24, are based on the "zero-shot" setting in KIE (Table 1: LayTextLLM_zero 67.6 vs. TextMonkey+ 40.4) and the SFT setting in VQA (Table 1: LayTextLLM_VQA 59.7 vs. TextMonkey+ 47.7), as all VQA data is utilized in the SFT for these MLLMs.

During the submission, InternVL2-8B is excluded from the comparison to comply with the ICLR complementary policy.. As you mentioned, "datasets like FUNSD and SROIE were indeed included in the VLM training process, albeit as part of a broader training set," which indicates that models like InternVL2 benefit from pretraining on these benchmark datasets (though not through SFT). In contrast, LayTextLLM does not (i.e., LaytextLLM_zero), further highlighting the strong performance of the proposed method.

评论

Dear Reviewers,

We sincerely appreciate the time and effort you've dedicated to reviewing our work. We are grateful for the recognition of the novelty and performance improvements acknowledged by the reviewers. We have carefully reviewed all the feedback and addressed the concerns in our rebuttal. Based on this feedback, we will make the revisions to improve the manuscript.

As the discussion period is drawing to a close, we kindly request your acknowledgment of our reply. We are open to further discussions to clarify any remaining questions or concerns. We would greatly appreciate it if you could consider improving the evaluation after reviewing our responses.

Thank you very much for your consideration.

Sincerely, The Authors

AC 元评审

This paper introduces LayTextLLM, a method for incorporating document layout into LLMs. It uses: (1) a Spatial Layout Projector (SLP) to embed bounding box coordinates from OCR output, enabling layout-aware next token prediction; and (2) Shuffled-OCR supervised fine-tuning (SSFT) to reduce OCR-induced positional bias

The central contribution of this work is the SLP module, which effectively integrates layout and text. However, as noted in the related work section, Perot et al. (2023) use formats such as “HARRISBURG 78|09” to represent OCR text and corresponding layout, while He et al. (2023) use “[x_min, y_min, x_max, y_max]” to encode layout information. It is unclear whether fine-tuning with LoRA using these existing formats would yield comparable performance. If so, the primary advantage of the proposed method would be token savings achieved through coordinate embedding. Reviewer ZrGX raised a similar concern.

审稿人讨论附加意见

Reviewers expressed concerns regarding novelty, the limited number of baselines for comparison, the exploration of LoRA variants, and the need for more detailed ablation studies assessing spatial understanding before and after LoRA fine-tuning. The authors have partially addressed these concerns.

Thank you for flagging XPL9. This feedback has been considered in the decision.

最终决定

Reject