PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
4
5
3
4.0
置信度
创新性2.5
质量2.5
清晰度2.8
重要性2.5
NeurIPS 2025

Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models

OpenReviewPDF
提交: 2025-04-27更新: 2025-10-29

摘要

关键词
table recognitionlatex generation

评审与讨论

审稿意见
4

This paper addresses the task of table image to LaTeX code generation by a reinforced MLLM framework VSGRPO: fine-tuning an MLLM on a collected large-scale table-to-LaTeX dataset, and introducing a dual-reward reinforcement learning strategy based on GRPO with visual reward (CW-SSIM) and structure reward (TEDS-Structure). Experiments show that the method achieves SOTA performances, especially on complex tables.

优缺点分析

Strength:

  1. The writing is clear and easy to follow.

  2. The paper introduces a reinforced MLLM framework VSGRPO to table2LaTeX, with a dual-reward reinforcement strategy, which is a novel and interesting application to leverage MLLM’s ability to address complex table LaTeX generation.

  3. The experimental section is well-designed, validating the effectiveness compared to existing methods, especially in tables with complex level. The ablation study validates the data selection strategies, the reward components, and the necessity of SFT.

Weakness:

  1. Need more experiments to support the effectiveness of this method. E.g. closed-source models (gpt-4o) with careful prompting strategy may achieve better results than the proposed method.

  2. Need clear explanation of why visual reward and structure reward are chosen. (Is it sufficient for evaluating the quality of generated tables? are there any other judgement way? )

问题

  1. How is gpt-4o used for evaluation? Can a careful prompting strategy achieve better results in Table 1~3? E.g., specify the possible complexities in Table (multirow, multicolumn) in the prompt, give examples or iteratively use gpt-4o for correcting the generated LaTeX answers?

  2. Are tables with annotations (e.g. bold, red mark, color highlight) considered in the dataset and evaluation? How do they perform with the proposed method?

局限性

yes

最终评判理由

My concerns are addressed, including the effectiveness of this method by comparing with gpt-4o current abilities. However, the authors have not provided clear explanation of the design (why visual reward and structure reward are chosen). Given this, I keep my score.

格式问题

N/A

作者回复

We appreciate your thorough review and useful comments, particularly highlighting that our method is "novel and interesting application", and that our writing is "clear and easy to follow". Please find our responses below.

Weakness 1: More experiments are needed to demonstrate the method's effectiveness, potentially comparing it against closed-source models like GPT-4o with optimized prompting strategies.

Thank you for pointing this interesting question out. In the original evaluation of GPT-4o, we used the following prompt:

"<image>\nConvert this table to LaTeX. 
Please output only the LaTeX code between \\begin{tabular} and \\end{tabular}, without any additional text or explanations."

As suggested We then provided a more detailed prompt tailored to the table image-to-LaTeX task, explicitly indicating the potential complexity of the table structure and including an example LaTeX output. The specific prompt is as follows:

example_latex = r"""
Here is an example of a valid LaTeX table:
\begin{tabular}{|c|c|c|}\hline
Item & Quantity & Price \\
\hline
Apple & 2 & \$1.00 \\
Banana & 3 & \$0.75 \\
\hline
\end{tabular}
""".strip()
prompt_text = f"""{question}
Please extract only the LaTeX table between \\begin{{tabular}} and \\end{{tabular}} from the image.
The table may contain multi-row or multi-column cells, such as \\multirow or \\multicolumn.
Ensure the output has correct alignment and consistent row/column structure.
Do not include explanations or any extra text — only return valid LaTeX code for the table.
{example_latex}"""

However, this did not lead to improved performance. We believe this is because the task itself is clearly defined, and the initial prompt was already sufficient for GPT-4o to understand the objective.

Furthermore, we explored an iterative refinement strategy. Specifically, we fed the model the target table image, the initial LaTeX output, and the corresponding rendered image. The model was then asked to compare the rendered image with the target image—if no errors were detected, it would return the original LaTeX; otherwise, it would correct the LaTeX and return the updated version.

The specific prompt is as follows:

prompt_text = f"""The first image is the target table that should be recognized.
The second image shows the rendered LaTeX table from the previous prediction.

Here is the previous LaTeX prediction:
\\begin{{tabular}}{{...}}
{prediction_latex.split("\\begin{tabular}")[-1].split("\\end{tabular}")[0].strip()}
\\end{{tabular}}

Please compare the rendered prediction with the target image.
If the LaTeX code is already accurate, return it unchanged.
Otherwise, correct the LaTeX code.

Please output only the LaTeX code between \\begin{{tabular}} and \\end{{tabular}}, without any additional text or explanations."""

Table R1 presents the performance of different prompting strategies. The iterative refinement approach yields a marginal improvement in CW-SSIM and achieves a perfect Compile Ratio of 1.0. However, slight declines are observed in TEDS and TEDS-Structure metrics. We attribute this to the model focusing on visual similarity during the correction step, while lacking the ability to explicitly compare LaTeX structural syntax, which may have affected textual structure alignment.

TableR1: Ablation experiments on the effectiveness of different prompt for GPT4o.

Complex-DatasetCW-SSIMCompile ratioTEDSTEDS-Structure
GPT4o-original prompt0.47470.99170.77450.5865
GPT4o-detailed prompt0.46910.98600.74790.5565
GPT4o-iterative refinement0.479910.76730.5857

Weakness 2: Are tables with annotations (e.g. bold, red mark, color highlight) considered in the dataset and evaluation? How do they perform with the proposed method?

To further validate the effectiveness of our method as suggested by the reviewer, we additionally evaluated it on two publicly available table datasets: FinTabNet [1], which comprises real-world, complex scientific and financial tables with detailed structural annotations, and SynthTabNet [2], which includes Chinese financial announcement tables. A human evaluation was conducted to assess the visual similarity between the LaTeX-rendered images generated by our method and those produced by existing state-of-the-art MLLMs, using the original table images as ground truth.

We randomly sampled 100 images from the validation set. For each sample, the ground truth table image is displayed alongside the model's predicted table image, with their positions randomly shuffled and labels hidden. Multiple human assessors independently voted on which image was more visually similar to the target, with the final outcome determined by majority vote. As shown in Tables R1 and R2, our method consistently demonstrates competitive or superior performance across both datasets.

Table R1:Results of human evaluation in FinTabNet[1].

ModelQwenVL-VSGRPOQwenVL-SFTInternvl3-78bQwenVL2.5-72bGPT4o
Vote775014132

Table R2:Results of human evaluation in SynthTabNet[2].

ModelQwenVL-VSGRPOQwenVL-SFTInternvl3-78bQwenVL2.5-72bGPT4o
Vote72471284

Reference

[1]Zheng, Xinyi et al. “Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context.” 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) (2020): 697-706.

[2]Hou, Qiyu et al. “Synthesizing Realistic Data for Table Recognition.” IEEE International Conference on Document Analysis and Recognition (2024).

Question 1: For Tables 1-3, how is GPT-4o evaluated? Could a refined prompting strategy, such as specifying table complexities (e.g., multirow, multicolumn), providing examples, or using iterative correction, yield better LaTeX generation results?

See above, in "Weakness 1".

Question 2: Are tables with annotations (e.g. bold, red mark, color highlight) considered in the dataset and evaluation? How do they perform with the proposed method?

See above, in "Weakness 2".

评论

Thanks the authors for their response. My concerns are well addressed, and I will keep my score

评论

We sincerely thank the reviewer for the thoughtful feedback and are pleased that the concerns have been satisfactorily addressed. We appreciate the reviewer’s decision to uphold the acceptance recommendation.

审稿意见
4

The authors address the task of table image to LaTeX code generation in this paper. To overcome the challenges of large table sizes, deeply nested structures, and semantically rich content, they propose a framework based on the multi-modal large language models. To be more specific, they first collect large-scale table-LaTeX pairs from the Internet and then fine-tune a pre-trained MLLM on the dataset. Afterwards, they introduce a dual-reward RL strategy for post-training. A structure-level and a visual fidelity reward are used during post-training. In the experiments, the authors show that their method achieves the state-of-the-art performance compared to the baselines.

优缺点分析

Strengths:

  1. This paper studies a practical task (table image to LaTeX code) that is rarely explored before.
  2. By scraping the LaTeX source files and data cleaning, the authors collect a large-scale Table2LaTeX dataset, which would be beneficial to the research community.
  3. Through SFT and VSGRPO, the obtained code generation models can be comparable to or exceed commercial tools, general VLMs and expert VLMs in multiple quantitative metrics, setting an economical and efficient solution for the task.

Weaknesses:

  1. One of the biggest concerns is the limited technical contribution. Although the task is under-explored, the proposed solution is the standard LLM fine-tuning and RL process, except for the reward design.
  2. There are very few qualitative figures in this paper. Without sufficient qualitative results, it's difficult to have an intuitive understanding of the model performance.
  3. The effectiveness of the introduced two rewards is not significant as shown in Tab. 6.

问题

  1. In terms of the reward design, the reward is set to 1 if the similarity metric exceeds a pre-defined threshold, and 0 otherwise. The question is: why not directly use the similarity value as the reward? Intuitively, continuous rewards are better than binary rewards in this task. Forcibly binarizing similarity values seems to make the model unable to distinguish which of the two predictions that are both very similar to GT is better.
  2. As shown in Tab. 7, the performance of the "w/o SFT" model is significantly worse than that of the full version. However, there is another difference between these two Settings. More specifically, according to line 169, the number of training samples used in the "w/o SFT" setting is only 5,936, which is much smaller than the full version. Therefore, the performance discrepancy in Tab. 7 may be caused by the volume of data and does not explain the necessity of SFT. Could the authors confirm this?
  3. The performance improvement brought by the proposed VSGRPO seems marginal according to Tab. 6. It would be great if the authors could have a discussion about this.
  4. In the Abstract and Introduction sections, the authors highlight the challenges of Table2LaTeX: large sizes, deeply nested structures, and semantically rich content. However, in the Method section, there seem no customized techniques for these challenges, leaving readers feeling that they cannot echo the challenges mentioned earlier after reading.

局限性

yes

最终评判理由

The rebuttal contains detailed and thorough clarifications, addressing most of my initial concerns. I wish the authors could resolve the unfair setting for ablation studies and provide more qualitative results. I am therefore willing to raise my score to borderline accept.

格式问题

None

作者回复

Dear Reviewer, thank you for your response, your continued engagement, and your commitment to improving our paper! Please find our responses below.

Weakness 1: The main concern is the limited technical novelty, as the proposed LLM fine-tuning and RL solution is standard, with the only notable contribution being the reward design.

Thank you for acknowledging the practical value of our task and the effectiveness of our method. However, we respectfully disagree with the assessment that our work lacks technical contribution. Notably, all three other reviewers explicitly recognized the novelty and technical strength of our approach:

  • Reviewer A7cA stated: “The proposed approach is novel.”

  • Reviewer 6PYg commented: “The idea of combining visual and structural reward is interesting.”

  • Reviewer e8fZ noted: “Novel and interesting application to leverage MLLM’s ability to address complex table LaTeX generation.”

These consistent positive remarks affirm our view that the paper presents meaningful and original contributions.

  1. Revealing MLLM Capability and Proposing an Effective Solution

The task is underexplored and one of our contributions lies in demonstrating for that MLLMs can effectively handle table image to LaTeX generation—a task that is highly structured, syntactically complex, and not traditionally studied in the vision-language domain. We establish a complete and scalable MLLM-based training pipeline, incorporating large-scale supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT). This pipeline shows that MLLMs, when properly guided, can directly generate high-quality LaTeX code, revealing their potential in structured code generation tasks previously considered beyond their typical scope.

  1. Technically Novel Reward Design for RL Fine-tuning

From a technical perspective, our proposed Visual-Structural Guided RL (VSGRPO) framework offers a novel and practical advancement. Unlike existing RL strategies that rely on rule-based textual matching, which are often brittle or ineffective for long, complex outputs like LaTeX, we propose a dual-reward system:

  • A visual reward, which evaluates the fidelity of the generated LaTeX by rendering and comparing it to the ground-truth image—providing a more direct and perceptually aligned quality measure.
  • A structural reward, which ensures syntactic and layout consistency in the generated code. This dual-reward mechanism not only improves performance on our target task but also has the potential to generalize to other vision-to-code generation problems, such as web UI generation or document formatting.

Weakness 2: There are very few qualitative figures in this paper.

Thank you for pointing this out. We agree that qualitative results are important for providing intuitive insights into model behavior.

In our paper, we divided the evaluation into simple, medium, and complex tables, and provided four representative sets of qualitative visualizations in the Appendix, covering all three levels of table complexity. These examples were selected to illustrate typical error patterns among the compared methods, helping readers better understand the nature of improvements introduced by our approach.

We acknowledge that four examples are not sufficient to comprehensively showcase all aspects of model performance. However, due to strict page limitations, including a larger number of qualitative examples in the main paper was not feasible.

That said, we would like to emphasize that our evaluation is not solely reliant on qualitative examples. We have also included:

  • Extensive quantitative comparisons across multiple benchmarks.

  • A scientifically designed and rigorously conducted human evaluation.

  • Together, these three forms of evaluation—quantitative, qualitative, and human—consistently demonstrate the advantages of our proposed method over existing baselines.

Weakness 3: The effectiveness of the introduced two rewards is not significant as shown in Tab. 6.

Thank you for your feedback. We emphasize that the visual and structural rewards are crucial for improving model performance, both quantitatively and perceptually.

First, as shown in Table 6, our dual-reward mechanism significantly improves CW-SSIM during reinforcement learning, boosting scores by 3.5 and 2 points over no RL and text-only rewards respectively. It is important to note that even a 2-point increase in CW-SSIM is non-trivial and translates into visibly better table rendering quality, as confirmed by human visual inspection.

Second, to validate these gains from a human perspective, we conducted a blind human evaluation on 100 randomly selected complex table samples. Annotators compared the generated images to ground-truth images in randomized order, without knowing the source model. As summarized in Table R1, the dual-reward VSGRPO model was preferred 83 times, significantly outperforming the single-reward variants (Visual-only: 59, Structure-only: 44).

These results confirm that combining visual and structural rewards leads to more faithful and visually accurate LaTeX table generation.

Table R1: Results of human evaluation in the two rewards.

ModelQwenVL-VSGRPOQwenVL-VisualQwenVL-Structure
Vote835944

Question 1: Directly using the continuous similarity value as a reward would be more effective.

Thank you for this insightful comment. We indeed explored this direction by conducting an ablation study comparing continuous rewards (based on CW-SSIM values) with binary rewards (based on thresholding) using the Qwen2.5-VL-3B model. The results are summarized in TableR2:

TableR2: Ablation experiments on the effectiveness of continuous reward value.

Complex-DatasetCW-SSIMCompile ratioTEDSTEDS-Structure
QwenVL2.5-VSGRPO-Continuous0.60560.98610.92530.8726
QwenVL2.5-VSGRPO-Binary0.61450.99170.92700.8721

As shown, the continuous reward leads to slightly inferior performance. One possible explanation, supported by [1], is that higher reward variance can lead to stronger learning signals in RL. Binary rewards, especially when carefully thresholded, may create sharper gradients and more decisive feedback, which can help the model escape suboptimal regions in the reward landscape. In contrast, continuous rewards, being bounded and smoother, might lack this reinforcement strength. This remains an open research question, and we believe further investigation into the design of reward functions could be a promising direction for future work.

Reference

[1]Razin, Noam et al. “What Makes a Reward Model a Good Teacher? An Optimization Perspective.” arXiv abs/2503.15477 (2025): n. page.

Question 2: The "w/o SFT" model's worse performance in Tab. 7 might be due to a smaller training sample size (5,936 vs. full version), rather than solely indicating SFT's necessity.

We would like to clarify that the same number of training samples was used in both the "with SFT" and "without SFT" settings. As stated in Line 169, “we select 5,936 complex tables from the training dataset as the training set for VSGRPO.” This exact subset was used consistently across both settings during the reinforcement learning (RL) stage.

Therefore, the difference between the two methods in Table 7 reflects only the presence or absence of the SFT stage. The performance gap thus directly highlights the importance of SFT in preparing the model for effective RL fine-tuning.

We also note that fewer samples are used in RL than in SFT due to the higher computational complexity of the RL process.

Question 3: The performance gains from VSGRPO in Tab. 6 appear marginal; a discussion from the authors on this would be beneficial.

See above, in "Weakness 3".

Question 4: The Abstract and Introduction emphasize Table2LaTeX's challenges—large, nested, and semantically rich tables—yet the Method section lacks specific techniques to address them, creating a disconnect for the reader.

Thank you for pointing out this concern. We understand the reviewer’s impression and would like to clarify that the methodological components in our work are in fact carefully designed to address the core challenges highlighted in the Abstract and Introduction—namely, large table sizes, deeply nested structures, and semantically rich content.

As the reviewer noted, this is a practically valuable but underexplored task, and part of our contribution lies in articulating its unique difficulties. To effectively tackle these, we built a training pipeline that combines the strengths of Multimodal Large Language Models (MLLMs) with domain-specific enhancements:

  • Large-scale supervised fine-tuning (SFT) on diverse and richly annotated table–LaTeX pairs, enabling the model to generalize across various table structures and adapt to the specifics of the Table2LaTeX task. This step is critical for equipping the model with robustness in handling a wide range of real-world table layouts.

  • Reinforcement Learning with Complex Examples: During RL, we intentionally select complex tables—those with large size, deep nesting, and rich semantics—as training samples. This design ensures the model is directly optimized for the most challenging aspects of the task. This design achieves superior performance than using.

评论

Thanks for the details response. The rebuttal has addressed some of my concerns. However, I still feel that my concerns about Question 2 are not well addressed. My question about the "w/ SFT" and "w/o SFT" settings is that there are two variants. (1) The training strategies are different. I understand this is what the authors want to compare to show the effectiveness of SFT. (2) The dataset set sizes are also different. The w/o SFT (i.e., only VSGRPO) setting has a training set of 5,936 samples, which is significantly smaller than that of "w/ SFT" setting. Therefore, this experiment does not necessarily indicate the importance of SFT, but rather the importance of data volume.

In addition, I still think that qualitative comparison is important for the task, and at least include more in the appendix.

评论

Thank you very much for the constructive feedback. We are glad to hear that some of your concerns have been addressed.

Regarding the question on the effectiveness of SFT, we acknowledge your important observation that the difference in dataset size between the "w/ SFT" and "w/o SFT" settings could confound the interpretation of the results. You are absolutely right that the observed performance gains may reflect the combined effect of both training strategy and data volume. A more rigorous and accurate conclusion, which we will adopt in the final version, is that SFT on a large-scale dataset is necessary and beneficial for improving performance. We appreciate your insight and will revise our claims accordingly to reflect this nuance.

We also agree with your point that qualitative comparisons are crucial for this task. We will include more qualitative results and detailed analyses in the appendix of the final version to provide a richer and more comprehensive view of the model’s behavior.

We believe these issues relate to presentation clarity rather than core methodological flaws, and we appreciate your suggestions, which will help us significantly improve the final version. Thank you again for your thoughtful and helpful review.

评论

Dear Reviewer,

We would also like to further clarify the motivation and design behind our training pipeline. Our task focuses on adapting a general-purpose multimodal large language model (MLLM) to a specialized table-to-LaTeX generation task, which exhibits a substantial domain gap from typical vision-to-language tasks such as image captioning or VQA. Given this gap, it is not surprising that supervised fine-tuning (SFT) on large-scale table–LaTeX paired data plays a critical role in adapting the MLLM to this new domain and serves as a strong initialization for downstream alignment.

As shown in Table 6 of the main paper, SFT alone produces highly competitive results, demonstrating its effectiveness. Building on this, reinforcement learning (RL) on a much smaller dataset further improves performance, indicating that RL can effectively refine and align the model beyond what is achieved with SFT alone. However, due to the high computational cost of RL, especially when using large models and fine-grained feedback signals, it is infeasible to apply RL at the same scale as SFT. Our goal is not to isolate the effect of SFT versus RL under equal conditions, but rather to highlight the value of the full pipeline—SFT followed by RL—for achieving strong performance.

We hope this additional explanation helps fully clarify the rationale behind our design choices and the interpretation of our results. We sincerely thank you again for your thoughtful and constructive feedback, and we are happy to discuss further if there are any additional questions or concerns.

审稿意见
5

The paper proposed Table2LaTeX-RL: a novel algorithm that trains a multimodal LM to generate faithful LaTeX code from images of structured tables. The detailed workflow is:

  • Collecting a large-scale Table2LaTeX corpus (≈1.2 M image–code pairs) by extracting LaTeX/table pairs from arXiv papers.
  • Performing supervised fine-tuning on a vision–language LLM with the Table2LaTeX dataset.
  • Using VSGRPO, a novel RL strategy, to further fine-tune the LLM. The RL signal combines two rewards: (1) a visual reward that matches the original table image with the image rendered from the generated LaTeX, calculated with CW-SSIM; (2) a structural-match reward that encourages alignment of LaTeX structure, measured by TEDS-Structure. A group of candidate responses is generated for each input query, and the average advantage of the group is used as the baseline against which RL optimisation is performed.

The extensive experiments compare the VSGRPO-trained model with both general-purpose multimodal LLMs and commercial/open-source LaTeX “snipping” tools. The results indicate that, over both visual fidelity (CW-SSIM) and structural accuracy (TEDS-Structure) metrics, the VSGRPO model outperforms all baselines. An ablation study further shows that each of the two RL reward signals contributes significant gains, with the combination yielding the best performance.

优缺点分析

Strength:

  • The paper addressed an important topic related to processing structured tabular data at scale by converting them to latex, which could have important application in domains such as finance/trading.
  • The idea of combining visual and structural reward is interesting and addressed a core challenge in modeling tabular data: complex, heterogenous structure. By adding visual reward, the method also handles a case where table structure is not standard, i.e, not a nice data-frame style column-row structure. Although prior works has explore generation of structured data from image[1,2], this paper seems the first to be extend its application to tabular data/latex code.
  • The dataset on latex code/table pair will be a helpful contribution.
  • The benchmark metrics/dataset and ablation study is to the point and comprehensive.

Weakness:

  • The paper does acknowledge LATTE[1] and even reports LATTE’s numbers on an external benchmark (Table 3) and a foot-note explaining that they recalculated the scores from the authors’ released outputs . However, it offers no architectural discussion or ablation that contrasts why dual-reward RL outperforms LATTE’s iterative-refinement strategy, and thus limits the insight from the VSGRPO`s motivation and the claim of its novelty from existing visual-RL methods.
  • Although the Related-Work section cites several vision-only table-structure models such as [2], none of those systems are re-implemented or valuated in the experiment tables, leaving a gap in understanding how the proposed method stacks up against mature CV pipelines.
  • The training and evaluation corpora are drawn only from arXiv paper PDFs, and 94 % of the training set are simple black-and-white tables. As a result, the model’s robustness to scanned documents, colored backgrounds such as financial data/earning report is unknown.

[1] Jiang, Nan, et al. "LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 4. 2025.

[2] Nassar, Ahmed, et al. "Tableformer: Table structure understanding with transformers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

问题

  • Could the authors provide architectural discussion or ablation studies that contrast why dual-reward RL outperforms LATTE’s iterative-refinement strategy, including evidence clarifying VSGRPO’s motivation and novelty?
  • Could the authors also evaluate standard vision-only table-structure models in the experiment tables, providing results that demonstrate how the proposed method compares to mature computer-vision pipelines? Or authors should explain why such methods does not fit in the context.
  • I am curious that if the model trained on arxiv tables can generalize well on other domains` table like finance reports? Some case study on real finance table benchmark would provide interesting insights. For example on a random subset of FinTabNet.

局限性

yes

最终评判理由

Although the authors have satisfactorily clarified the LATTE comparison and demonstrated cross‐domain generalization, they have not provided any empirical results against established CV‐based baselines. Without at least one representative TSR model (direct LaTeX or HTML→LaTeX) in the evaluation, the relative benefit of the proposed dual-reward RL remains unquantified. Until this gap is addressed, I will maintain my current rating.

格式问题

No major formatting issues found.

作者回复

We thank the reviewer for their comments, especially for highlighting our paper could have "important application" and that our idea is "interesting ". Below, we address the issues raised in the review:

Weakness 1: The paper cites LATTE[1] and reports its results, but lacks analysis comparing dual-reward RL to LATTE’s approach, limiting insight into VSGRPO’s motivation and novelty.

Thank you for raising this important point.

First, we would like to clarify that LATTE has not released its full implementation, particularly the code for the iterative refinement module. This component depends on a separately curated dataset, which is also not publicly available. As a result, we are unable to precisely reimplement LATTE or conduct a direct comparison beyond referencing the limited results provided in their paper.

Second, from a methodological perspective, our approach and LATTE differ fundamentally in design philosophy and execution. Our method adopts the LLM/MLLM paradigm by establishing an end-to-end pipeline encompassing large-scale supervised fine-tuning (SFT), and reinforcement fine-tuning (RFT) for the table-to-LaTeX task. In contrast, LATTE centers around an iterative post-refinement strategy, which is applied after initial LaTeX generation.

In other words, our goal is to embed high-quality LaTeX generation capabilities directly into the model, enabling accurate outputs at inference time without needing further correction. LATTE, on the other hand, relies on a separate, explicitly designed refinement stage to improve the quality of the initial outputs, which introduces an additional computational overhead and complexity during deployment.

These distinctions reflect different design goals: our method aims for integrated accuracy and efficiency, while LATTE emphasizes post-hoc quality improvement through refinement.

Weakness 2: Though cited, vision-only models like [2] aren’t evaluated, leaving unclear how the proposed method compares to established CV approaches.

Thank you for bringing this up. Our method differs fundamentally from traditional computer vision (CV) approaches by leveraging the text generation capabilities of Large Language Models (LLMs) rather than relying on rule-based decoding pipelines.

Traditional CV solutions like TableFormer primarily target HTML table reconstruction, which benefits from a relatively rigid and sequential syntax. These methods typically decompose the problem into two steps: first detecting the table layout, then predicting the cell content, following the HTML structure where each cell is enclosed in <td> </td> tags. The deterministic nature of HTML allows such systems to adopt rule-based strategies for both structure and content prediction.

In contrast, our task focuses on LaTeX table generation, which presents significantly greater complexity. LaTeX syntax is more flexible and less strictly structured than HTML, making it challenging to apply rule-based layout parsing or sequential tagging approaches. As such, we propose to treat table-to-LaTeX as a generation task, enabling LLMs to directly produce high-fidelity LaTeX code in a more holistic and end-to-end fashion. This not only aligns better with the open-ended nature of LaTeX syntax but also takes advantage of the strong reasoning and generation capabilities of modern MLLMs.

Weakness 3: Training on mostly simple arXiv tables (94%) limits insight into the model’s robustness to scanned or colored documents.

As rightfully requested by the reviewer, we have used two additional publicly available table datasets to further validate the effectiveness of our method: FinTabNet [3], which consists of real-world and complex scientific and financial tables with detailed structure annotations, and SynthTabNet [4], a dataset of Chinese financial announcements. We conducted a human evaluation to compare the visual similarity between the LaTeX-rendered images generated by our method and those from existing state-of-the-art MLLMs, against the ground truth table images.

We randomly selected 100 samples from the validation set. For each sample, the ground truth table image and the model's predicted table image were displayed side-by-side. The display order was randomly shuffled and labels were hidden to ensure an unbiased assessment. Multiple human assessors independently cast their votes for the image most visually similar to the target, with the majority vote determining the final result. As shown in Tables R1 and R2, our method also demonstrates consistently strong competitiveness in these baselines.

Table R1:Results of human evaluation in FinTabNet[3].

ModelQwenVL-VSGRPOQwenVL-SFTInternvl3-78bQwenVL2.5-72bGPT4o
Vote775014132

Table R2:Results of human evaluation in SynthTabNet[4].

ModelQwenVL-VSGRPOQwenVL-SFTInternvl3-78bQwenVL2.5-72bGPT4o
Vote72471284

Reference

[1] Jiang, Nan, et al. "LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 4. 2025.

[2] Nassar, Ahmed, et al. "Tableformer: Table structure understanding with transformers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[3]Zheng, Xinyi et al. “Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context.” 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) (2020): 697-706.

[4]Hou, Qiyu et al. “Synthesizing Realistic Data for Table Recognition.” IEEE International Conference on Document Analysis and Recognition (2024).

Question 1: Can the authors provide analysis or ablation to show why dual-reward RL outperforms LATTE and clarify VSGRPO’s motivation and novelty?

See above, in "Weakness 1".

Question 2: Can the authors evaluate vision-only models or explain why they’re unsuitable to clarify how their method compares to mature CV pipelines?

See above, in "Weakness 2".

Question 3: Can the model trained on arXiv tables generalize to other domains like finance, e.g., via case studies on FinTabNet?

See above, in "Weakness 3".

评论

Thank you for the detailed rebuttal. Your clarification on the LATTE comparison and the additional evidence on generalization beyond arXiv tables are helpful. However, I remain concerned about the absence of comparisons with CV-based systems.

  • Baseline capability. There exists at least one modern CV baseline that adopts an end-to-end approach and generates LaTeX directly [1]. The authors’ definition of “end-to-end” vs. “two-step” also feels oversimplified. Systems that jointly learn structure and content within a single neural model and decode a markup sequence in one pass still fit the conventional meaning of end-to-end, even if the architecture uses multiple heads or loss terms.
  • Feasibility of a fair benchmark. Even when a baseline outputs HTML, it is practical to convert HTML→LaTeX (e.g., with Pandoc). While some stylistic fidelity may be lost, the same structure/visual-fidelity metrics (e.g., TEDS-Structure, CW-SSIM) remain applicable to the converted LaTeX.
  • Value of analysis. Even if such CV pipelines underperform after conversion, an empirical comparison and a short analysis of the failure modes would provide useful insight and help substantiate the claimed advantages and novelty of your design.

I suggest including at least one representative CV baseline in the final version (direct LaTeX output where available, or HTML→LaTeX conversion otherwise). I will keep my rating for now.

[1] Kayal, Pratik, et al. “Tables to LaTeX: Structure and Content Extraction from Scientific Tables.” 2022. (arXiv preprint).

评论

Thank you very much for your constructive feedback and for maintaining the acceptance rating. We appreciate your suggestion regarding comparison with CV-based pipelines.

In response, we conducted additional experiments to evaluate such baselines. Specifically, we examined TabtoTex [1], a method that directly generates LaTeX from table images. However, we found it unsuitable for our benchmark due to its inherently limited output length, which prevents it from handling complex tables with longer LaTeX sequences.

We also evaluated TFLOP [2], a more recent and capable CV-based system. TFLOP reformulates table structure recognition as a text-region pointing task, leveraging layout cues to predict structural tags and associated content. Using TFLOP's API, we first converted our test set into HTML format. Since Pandoc showed poor fidelity in converting complex HTML tables—particularly those with multi-level headers—we employed BeautifulSoup, a widely used HTML parsing library, to implement a more reliable HTML-to-LaTeX conversion pipeline.

Despite these efforts, the CV pipeline demonstrated poor performance on our benchmark for complex tables. As shown in Table R1, the converted outputs suffer from significant style and structure degradation. Specifically:

  • During HTML generation, TFLOP frequently reduces complex table layouts into flat, cell-by-cell structures, losing hierarchical and spanning information.

  • Formula content is often mishandled or lost entirely.

  • The HTML-to-LaTeX conversion further compounds this information loss, especially for tables with intricate formatting. Nonetheless, in simpler cases with fewer structural variations, the CV pipeline yields moderately high CW-SSIM scores, indicating some level of visual consistency. Due to time constraints, we were unable to conduct a more comprehensive evaluation and deeper analysis across a wider range of CV-based systems. We plan to include a broader and more detailed empirical comparison, along with an analysis of failure modes, in the final version. We believe that this preliminary comparison already highlights the limitations of existing CV-based approaches in handling structurally rich and semantically dense tables, and further supports the advantages of our proposed method.

Table R1: Results in comparison with the CV pipeline.

Complex-DatasetCW-SSIMTEDS-Structure
TFLOP[1]0.48690.5708
QwenVL2.5-VSGRPO0.61450.9218

Reference

[1] Kayal, Pratik, et al. “Tables to LaTeX: Structure and Content Extraction from Scientific Tables.” 2022. (arXiv preprint).

[2]Khang, Minsoo and Teakgyu Hong. “TFLOP: Table Structure Recognition Framework with Layout Pointer Mechanism.” ArXiv abs/2501.11800 (2024): n. pag.

评论

I appreciate the additional experiment, which are very insightful in show casing the limitation of traditional CV approach. I will increase my rating to 5. Also, I encourage authors to include these baselines in the final version, as well as brief statistics on training/inference time of different methods.

评论

Dear reviewer,

Thank you for the follow-up and for recognizing the value of our additional experiments. We appreciate your decision to raise the rating to 5 and will incorporate all of your valuable suggestions in the final version.

审稿意见
3

This paper studies the task of publication-ready latex code for tables from the visual image of the table. They solve this problem by finetuning a pre-trained MLLM on table-to-LaTeX dataset, with newly proposed dual-reward RL based on GRPO. Instead of directly using the outcome reward, they design structure-level reward particuarly designed for the LaTeX code and also a visual fidelity reward from rendered outputs. They show the effectiveness of their approach on their own collected testing dataset, which is processed from LaTeX table from arXiv papers.

优缺点分析

Strengths

  1. Generating publication-ready LaTeX source code given visual table input is of important application value. And how to improve the understanding of structure table image and together generating code is an underexplroed research problem.
  2. The proposed approach is novel that not they design two specific reward for this task: 1. the structure-level reward for LaTeX code 2. the visual fidelity reward from rendered outputs. And this dual-reward provides richer signals than simple outcome reward.

Weaknesses

  1. The experimental results are not fair enough, the training and test dataset are built from the same source, which is the LaTeX tables from the arxiv. It is unfair to finetune the proposed model e.g., Intern2-VL-VSGRPO and Qwen2.5-VL-VSGRPO on this training set and then compare with other baseline on in-domain test sets.
  2. Instead of independ votes in Human Evaluation table, it's better to see using VSGRPO vs. w/o using VSGRPO and see how the winning rate looks like.
  3. The VSGRPO is not compared with exisiting RL approach, e.g., GRPO. It's important to show the effectiveness of the new proposed framework over exisitign simple approach.

问题

  1. How about we just use the outcome reward? Does it also brings similar gain? E.g., Intern2-VL-1B-GRPO vs. Intern2-VL-1B-VSGRPO

局限性

The main concerns are: 1. lacking comparison with other RL framework 2. only verify on in-domain test set which is unfair.

格式问题

NA

作者回复

Dear Reviewer, thank you for your thorough review and helpful suggestions! We also appreciate you mentioning that our paper has "important application value" and our approach is "novel". Please find our responses and changes below.

Weakness 1: The results are biased since both training and test data come from arXiv tables, making it unfair to fine-tune models like Intern2-VL-VSGRPO and compare them with baselines on in-domain tests.

We appreciate the reviewer’s thoughtful concern regarding the evaluation fairness and potential in-domain bias stemming from training and testing on LaTeX tables sourced from arXiv. In response, we have extended our evaluation to include two additional publicly available out-of-domain table datasets: These two datasets, primarily derived from financial documents and reports, are markedly out-of-domain relative to arXiv LaTeX tables. They differ in content domain, table styles, layout complexity, visual noise and even the presence of colored cells, providing a challenging and comprehensive testbed for evaluating model generalization. FinTabNet [1], which consists of real-world and complex scientific and financial tables with detailed structure annotations, and SynthTabNet [2], a dataset of Chinese financial announcements. Unlike the arXiv-based LaTeX tables, these datasets are typically provided in HTML or image format without LaTeX source code. Therefore, CW-SSIM scores cannot be computed, as they require image rendering from ground-truth LaTeX for comparison. To address this, we conducted a human evaluation study, comparing the visual fidelity of LaTeX-rendered tables generated by our method against state-of-the-art MLLMs, using the original table images as ground truth references.

We randomly selected 100 samples from the validation set and display the ground truth table image at the top and the model’s predicted table image below it, side by side, with the order randomly shuffled and labels hidden. Multiple human assessors independently vote on which image appears most visually similar to the target. The final outcome is determined by majority vote. As shown in Tables R1 and R2, the results of this human study demonstrate that our method consistently achieves superior visual quality across both datasets, indicating its strong generalization capability to tables from different domains and formats. These findings provide further evidence of the robustness and practical applicability of our approach beyond the arXiv domain.

Table R1:Results of human evaluation in FinTabNet.

ModelQwenVL-VSGRPOQwenVL-SFTInternvl3-78bQwenVL2.5-72bGPT4o
Vote775014132

Table R2:Results of human evaluation in SynthTabNet.

ModelQwenVL-VSGRPOQwenVL-SFTInternvl3-78bQwenVL2.5-72bGPT4o
Vote72471284

Reference

[1] Zheng, Xinyi et al. “Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context.” 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) (2020): 697-706.

[2]Hou, Qiyu et al. “Synthesizing Realistic Data for Table Recognition.” IEEE International Conference on Document Analysis and Recognition (2024).

Weakness 2: VSGRPO vs. w/o using VSGRPO.

Thank you for this insightful suggestion. We agree that a direct comparison between our method with and without VSGRPO provides a clearer understanding of its contribution. To this end, we conducted a focused human evaluation on 100 randomly selected samples, comparing the full model (with VSGRPO) against its ablated variant (without VSGRPO). The results are as follows:

  • VSGRPO was preferred in 52 cases.

  • The variant without VSGRPO was preferred in 18 cases.

  • 30 cases were judged as indistinguishable in quality.

These results demonstrate a clear preference for the VSGRPO-enhanced version. In addition, as part of our out-of-domain evaluation (detailed in response to the previous comment), we also performed comparisons between the two variants on external datasets (FinTabNet and SynthTabNet). The outcomes consistently showed that the VSGRPO-enhanced model produces LaTeX-rendered table images with higher visual similarity to the ground truth, further validating its effectiveness.

Weakness 3: VSGRPO isn’t compared with existing RL methods like GRPO, making it hard to show its advantage.

We appreciate the reviewer’s suggestion and would like to clarify that we have included a direct comparison with GRPO as a baseline in our paper. Specifically, the original GRPO framework utilizes a rule-based verifier to assess the quality of generated text against the ground truth. In the context of table generation, this would involve comparing the generated LaTeX code with the ground-truth LaTeX using a manually designed verifier. However, unlike typical question-answering tasks where the ground truth is relatively short and correctness is easier to verify, LaTeX code is often long, structurally complex, and challenging to evaluate using exact matching or heuristic rules. To adapt GRPO to our task fairly and meaningfully, we replaced the original rule-based verifier with a TEDS-Structure-based verifier, which is better suited to assessing structural and visual similarity between table representations. The performance of this adapted GRPO variant is reported in Table 6 as Qwen2.5-VL-3B-GRPO-TEDS-Structure. As shown in the results, our proposed VSGRPO significantly outperforms this GRPO baseline, highlighting the effectiveness of our framework in handling the structured nature and generation complexity of LaTeX tables.

Question 1: Does using only the outcome reward yield similar gains, e.g., Intern2-VL-1B-GRPO vs. VSGRPO?

See above, in"Weakness 3".

评论

Dear Reviewer,

As the discussion period is coming to a close, we wanted to kindly follow up. We have provided a comprehensive response to your comments and hope that we have adequately addressed your concerns. If you have any further questions or would like to discuss any aspect in more detail, we would be very happy to continue the discussion.

Thank you again for your valuable feedback.

评论

We appreciate the opportunity for this discussion. As the reviewer-author discussion comes to a close, we note that the reviewer has not yet responded to our original rebuttal. We believe we have carefully addressed the concerns raised and have provided clarifications and additional evidence where needed. We respectfully request that the Area Chair carefully review our responses and supporting materials when making the final decision.

最终决定

This paper tackles the underexplored problem of converting table images to publication-ready LaTeX code. They do this with a MLLM with a novel dual-reward RL strategy (visual fidelity + structural alignment).

The application, dataset, and the idea of combining the two reward signals works well. There are some concerns about limited novelty beyond reward design, reliance on in-domain (mostly arXiv tables) evaluation. Also, there are concerns about insufficient comparisons to existing RL and CV-based baselines and limited qualitative analysis. As such, the paper remains a borderline case.