/10

Rejected4 位审稿人

最低2最高4标准差0.8

ICML 2025

TableMaster: A Recipe to Advance Table Understanding with Language Models

提交: 2025-01-11更新: 2025-06-18

TL;DR

We analyze the challenges of table understanding with language models and propose a recipe and comprehensive framework to address them.

摘要

关键词

Table UnderstandingTable ReasoningNatural Language ProcessingLanguage Model

评审与讨论

审稿意见

评分: 32025-02-20

The paper presents TableMaster as a framework aimed at improving table understanding based on large language models. The authors identify four key challenges in table-based reasoning, including (1) Difficulty in locating target data (LLMs struggle to find relevant parts of large tables), (2) Deficiency in table semantics (lack of rich semantic context in tabular data), (3) Numerical inaccuracies in textual reasoning (LMs make arithmetic errors), and (4) Semantic inflexibility in symbolic reasoning (code-based reasoning lacks adaptability).

To address these issues, TableMaster integrates several techniques, including (1) Table-of-focus construction to extract relevant table portions, (2) Table verbalization to add descriptive context, (3) Program-aided reasoning for better numerical handling, and (4) Table normalization and text-guided symbolic reasoning to enhance structured processing.

The framework dynamically switches between textual and symbolic reasoning, and adapting to the queries. Experiments were conducted to demonstrate that TableMaster achieves state-of-the-art performance on WikiTQ (78.13% accuracy with GPT-4o-mini) and other datasets.

========AFTER REBUTTAL========

Many thanks for the authors' response. After reading all the reviews and responses, I would like to keep my initial scores, and recommend the authors to include the updates mentioned in the rebuttal phase to the next version of the paper.

给作者的问题

In the second sub-figure of Figure 2(a), there is a relatively obvious trend for both gpt4o and gpt3.5 to go up from median-size to large-size tables. Is there any explanation for this phenomenon?
According to the task formulation in Section 4.1, the given table T does not explicitly contain row or column headers. Are they being ignored, or treated in the same way as ordinary cell values in the implementation? In Section 4.2, the authors claim that they extract the top headers and key columns (lines 306--307) and use them for column lookup. If that is the case, it should be necessary to detail the components of a table in the task formulation.

论据与证据

The claim that TableMaster effectively enhances table reasoning in LMs is well-supported by empirical results (Table 1) with superior performance over prior approaches (e.g., Chain-of-Table, Binder, PoTable).
The validity of challenges for table understanding is demonstrated by empirical evidences as Figure 2, and analyzed in Section 3. For Figure 2(b), the authors formulate the input of verbalized tables as the original table plus LLM-generated narrative text based on information from the table itself. I wonder if the same observation holds for tables with originally-attached textual context (e.g., like the FinQA dataset)? Will the improvements be more or less, compared with Figure 2(b)?
The usefulness of each component of TableMaster is backed by ablation studies (Table 2), especially for the reasoning component, which show a 4.28% drop in accuracy when removing textual reasoning and 2.03% drop when removing symbolic reasoning.
The paper claims that TableMaster generalizes across models (GPT-4o-mini, Llama 3, GPT-3.5-Turbo), which is convincingly demonstrated by consistent improvements across these models.

Generally, all major claims are clearly supported by empirical evidences in the paper.

方法与评估标准

Evaluation datasets: The paper evaluates on WikiTQ (QA), TabFact (fact verification), and FetaQA (free-form QA), which are widely used benchmarks, making the comparison valid.

Metrics and baselines: The study uses accuracy (WikiTQ, TabFact) and BLEU/ROUGE (FetaQA), which are appropriate for the tasks. Besides, TableMaster is compared against appropriate baselines such as Binder, Chain-of-Table, PoTable, ensuring the fair evaluation. Ablation studies was also conducted to demonstrate the necessity of each component.

Overall, the evaluation is solid, but it would be helpful to include some error analyzes (e.g., case studies where TableMaster fails).

理论论述

The paper mainly focuses on empirical evaluation and does not really involve theoretical proofs. Here are some points that could be improved.

The task formulation (Section 4.1) should be written in a more formal way, with clear definition of the input and output, and their forms (see also question 2 as below).
The adaptive reasoning mechanism is intuitive, but the paper would benefit from a deeper analysis of conditions where adaptive reasoning outperforms static approaches.

实验设计与分析

The experimental setup is generally well-structured with multiple datasets, baselines, and ablation studies.

For the validity of challenges, Figures 2 provides visual analyses of model performance under different conditions (e.g., effect of table size, numerical complexity).

One limitation is that the paper does not included detailed failure cases analyses. An error analysis could help understand why TableMaster fails on certain queries.

补充材料

The appendix contains detailed experimental settings, dataset descriptions, and additional analyses.

The authors also provide open-sourced codes, while it could be better to attach a detailed README file:)

The table-of-focus re-construction algorithm is provided in Appendix H, but it would be better to integrate it into the main text.

与现有文献的关系

The paper presents appropriate relation to previous works in LLM-based table understanding, and relates to common techniques in the fields of LLMs such as Chain-of-Thought and Program-of-Thought.

Unlike fine-tuned models, TableMaster adapts general LLMs without retraining, making it widely applicable.

The idea of table verbalization is related to Table-to-Text generation but is differently applied in this work.

遗漏的重要参考文献

The references and reviewing of existing works generally look good to me. I cannot think of any important work that is missing from the citations.

其他优缺点

N/A, see detailed comments above.

其他意见或建议

N/A, see detailed comments above.

作者回复

2025-04-01

We sincerely appreciate your thoughtful review and address your concerns below:

[W1] Verbalization for originally-attached textual context

We conducted ablation experiments on the originally-attached textual context in FinQA [1], which uses two GPT models to conduct end-to-end direct inference:

Method	Accuracy
GPT-4o-mini	50.7
GPT-4o-mini w/o text	38.9
GPT-4o	63.1
GPT-4o w/o text	50.8

There is a noticeable performance drop in FinQA when the originally-attached textual context is removed. This drop is significant because the textual context in FinQA provides a lot of necessary information needed to answer questions. In our TableMaster experimental setting, we assume the table context is complete, and therefore, table verbalization is used to enhance information that is not as crucial.

[W2] Deeper analysis of conditions where adaptive reasoning outperforms static approaches

As stated in Section J ("Analysis of Adaptive Reasoning"), generally, textual reasoning performs better than symbolic reasoning due to its natural chain-of-thought. However, symbolic reasoning is more effective for complex computations. Based on reasoning strategy assessment, the LLM can dynamically select the most suitable approach for table understanding. For example, when faced with a large table requiring complex computation to answer a question. Moreover, adaptive reasoning is more efficient than static approaches like self-consistency, as it only requires a single sample.

[W3] Obvious trend for both GPT-4o and GPT-3.5 to improve from medium-sized to large-sized tables

The impact of table size on LLM table understanding is most sensitive in the case of row numbers. We hypothesize that in a long table with many rows, the context information becomes sparse because headers only appear at the top. Additionally, similar information is repeated multiple times, making it more difficult for the LLM to understand the table and the specific meaning of each data row. When comparing the base model with its corresponding TableMaster, we observe that TableMaster slows down the decline by constructing a table-of-focus.

[W4] Task formulation of table T

It is actually somewhat vague. Initially, we treated row and column headers in the same way as ordinary cell values. The structure information is contained in the table, and after structure extraction, Table $T$ can be represented as:

T_{m \times n} = \begin{bmatrix} H_0 & H_1 & H_2 & \dots & H_n \\\\ K_1 & C_{1,1} & C_{1,2} & \dots & C_{1,n} \\\\ K_2 & C_{2,1} & C_{2,2} & \dots & C_{2,n} \\\\ \vdots & \vdots & \vdots & \ddots & \vdots \\\\ K_m & C_{m,1} & C_{m,2} & \dots & C_{m,n} \\ \end{bmatrix}

[W5] Error analysis

We have conducted a comprehensive error case study during the development of TableMaster. The main reasons for errors can be categorized into inaccurate subtable extraction, suboptimal reasoning strategies, and errors in textual or symbolic reasoning. We believe that these issues stem from the inherent limitations of LLMs. TableMaster is designed to enhance the base ability of LLMs in table understanding, yet there are still some upper bounds based on their fundamental capabilities.

We will add a detailed README to the code, and incorporate your suggestions in the revision. Thank you!

[1] FinQA: A Dataset of Numerical Reasoning over Financial Data, EMNLP 2021.

审稿意见

评分: 42025-02-25

The authors introduce TABLEMASTER, a recipe and comprehensive framework that integrates multiple solutions to overcome the obstacles in table understanding. The obstacles are:

difficulty in locating target data
deficiency in table semantics
numerical inaccuracies in textual reasoning
semantic inflexibility

The authors' approach uses multiple LLM calls per table to break down the tabular understanding problem into stages and simpler subquestions;

In the first stage, structure is extracted from the table.

In the second stage, a variety of subtasks occur. The question is analyzed and, depending on the result, code is used to compute necessary numerical results. The table is 'verbalized', or translated into a short semantic paragraph. Subtables are conditionally extracted. Candidate row indices are searched, and an information estimation query attempts to predict whether the question is answerable given the available data.

Example of a 'verbalized table': 'The table provides a list of nominations and results for the artist Leona Lewis and her songs "Bleeding Love" and "Spirit" over the years 2007, 2008, and 2009. In 2007, Leona Lewis won for her work, and specifically for the song "Bleeding Love". Moving on to 2008, Leona Lewis continued her winning streak with multiple wins for her work and the song "Bleeding Love". Additionally, she was nominated for the song "Spirit" in the same year. In 2009, Leona Lewis was nominated for her work and won for both "Bleeding Love" and her other songs. Overall, Leona Lewis had a successful run with multiple wins and nominations for her music over the years.'

Via this multi-stage approach, the authors systematically break down TableQA for the LLM, and thereby achieve superior performance compared to baseline methods, which tend to attempt problems directly with single or few-stage prompting.

UPDATE AFTER REBUTTAL: My feeling about this work, both before and after the rebuttal, is that it deserves to be accepted. I am disappointed that the reviewers who voted to reject this work did not engage with the authors during the rebuttal period; this paper was large and contained many experiments, which I think led to some confusion about what results were present. if you peruse the authors' rebuttals of the most critical reviews, you can see that much of what the critical reviewers ask for is already in the work. I am increasing my score in the hope that this work is accepted. Best of luck to the authors.

给作者的问题

I have no questions.

论据与证据

The core claims seem valid, and the evidence adequate to support them.

方法与评估标准

The methods and evaluation criteria are standard, and seem to be implemented in a standard way, which is good.

理论论述

I reviewed no theoretical claims.

实验设计与分析

These are comprehensive and well-documented; I do have one suggestion, however. Figure 2 nicely illustrates the failure modes of LLMs with long and non-normalized tables, but it does not show how TableMaster fares; could a link to the relevant appendix material be added?

补充材料

The authors are to be commended on their exceptional appendix, which is extensive and well-documented.

The linked codebase README, however, is almost entirely empty; please update it to include all necessary information to reproduce at least some of the experiments described in the paper.

与现有文献的关系

The authors have done a good job of situating their work in the broader literature.

遗漏的重要参考文献

https://arxiv.org/pdf/2403.19318 is highly relevant to this work, but I don't believe it is cited; the authors should consider referencing it.

其他优缺点

In general, I think this paper is a worthwhile contribution to the literature. It is comprehensive and well-documented. The method is straightforward, which is good. The experimental results are adequate, but could be improved by including more baselines.

My main objection is that the limitations section in A.1 is a bit slender; the authors only briefly mention the limitations of their "Table Peek" method, namely, that it is bounded by the context window; this method will miss information on realistic, large tables (for a modest example, see https://d-nb.info/1280804238/34), and this is borne out by the experiments in F. They also do not conduct an extensive study of the # of tokens consumed by the authors' method, which relies on many calls to SOTA LLMs. This method will be slow and expensive compared to baselines, some of which, like https://arxiv.org/pdf/2403.19318, rely only on 8B models.

Another limitation they do not discuss is that their key column is expected to contain meaningful values instead of ids (from the "structure" prompt). Real-world tables often do not contain id columns with semantically interpretable values. Limited by the size of the context window (in author expts, 10k rows, but in real tables this would be column dependent as well).

Minor: in Sec. 5.1, "there seems to be a broken link: Tables are encoded in Markdown format before being input into language models, with or without addresses, depending on the specific case ??."

Minor: The link to the codebase is in the conclusion, which is not a standard place for new information; please move it to the abstract and duplicate the reference in the introduction.

其他意见或建议

I have no other comments or suggestions.

作者回复

2025-04-01

We sincerely appreciate your thoughtful review and address your concerns below:

[W1] How TableMaster fares when conducting table normalization.

The datasets TabFact, WikiTQ, and FetaQA are all clean, normalized tables, which is why we did not need to apply table normalization to these datasets. As we state in Appendix B ("Impact of Noisy Tables"), we constructed a setting for the table normalization task and evaluated baseline performance for challenge analysis. Therefore, we propose table normalization for all real-world (wild) tables in non-ideal cases as part of the recipe.

[W2] Concern about Efficiency

We acknowledge that TableMaster may not be efficient when using all of its solutions for table understanding, and we conduct a comprehensive analysis of efficiency in Appendix G. However, in this paper, we aim to present TableMaster as a general recipe for table understanding. For different downstream tasks or application scenarios, TableMaster can be adapted accordingly. We can also remove certain components or steps in TableMaster to achieve a balance between accuracy and efficiency case by case. Overall, we view this paper as presenting a recipe for future table understanding frameworks for LLMs, rather than a rigid, unmodifiable method.

[W3] Concern about Table Peek and Information Missing

As mentioned in [W2], Table Peek is a trade-off between efficiency and accuracy. We provide a detailed analysis of this in Appendix F ("Performance Analysis Under Different Table Peek Sizes").

[W4] Limitation of the key column being expected to contain meaningful values

This is not a critical aspect of the design. At the beginning of our research, we found that including the key column as a subject was more natural and beneficial for subsequent table verbalization. Additionally, we observed that when LMs select columns to extract a subtable, they often ignore the key column (which typically serves as the subject of a row). We maybe need to filter rows based on the subject’s information for the question. Therefore, the key column containing the subject must be included and selected. Therefore, we first extract the key column and include it as part of the information for subsequent steps. If there is no meaningful key column, such as one containing just an ID, it does not significantly impact TableMaster theoretically.

[W5] Broken Link

This actually refers to Appendix L.2, “Case Study of TableMaster.” Thank you for pointing out this error.

We will update the README in the codebase, cite the paper you mentioned, and incorporate your suggestions in the revision. Thank you!

审稿意见

评分: 22025-03-13

The paper introduces TableMaster, a comprehensive framework designed to improve language models' ability to understand tabular data. The authors identify four key challenges in table understanding: difficulty locating target data, deficiency in table semantics, numerical inaccuracies in textual reasoning, and semantic inflexibility in symbolic reasoning. To address these issues, TableMaster extracts relevant table content, verbalizes it with enriched semantic context, and implements adaptive reasoning that dynamically switches between textual and symbolic approaches based on query requirements. The framework demonstrates significant effectiveness, achieving 78.13% accuracy on the WikiTQ dataset using GPT-4o-mini, which surpasses existing baselines.

给作者的问题

No questions.

论据与证据

The authors identify four key challenges in table understanding and propose solutions through their TableMaster framework. However, while they report overall performance improvements on datasets like WikiTQ, the paper lacks detailed analysis demonstrating how effectively each specific challenge is addressed by their approach.

方法与评估标准

The proposed TableMaster framework presents a reasonable approach to table understanding, though it faces two limitations:

Its structure extraction component primarily accommodates regular relational and semi-structured tables, while struggling to effectively handle more complex irregular table structures that contain multiple headers or nested hierarchical relationships.
A notable concern with the TableMaster workflow is its reliance on multiple prompt calls to process a single question, yet the paper lacks comparative analysis of token efficiency across methods. This omission leaves readers unable to evaluate whether the additional computational overhead and token consumption from these multiple LLM calls is sufficiently justified by the reported performance improvements.

理论论述

No theoretical contribution was presented by the authors.

实验设计与分析

The paper only reports results on two public datasets: WiKiTQ and TabFact. Some additional experiments on FeTaQA are included in the Appendix.

补充材料

Supplementary materials are included.

与现有文献的关系

The paper exhibits limited novelty as its core methodological contributions have already been established in prior research. For example, the extraction of sub-tables from raw tables to enhance understanding was previously introduced in works such as [1, 2], while the integration of textual and symbolic reasoning approaches was thoroughly explored in [3]. These existing publications have already addressed similar challenges and proposed comparable solutions for table understanding with large language models, raising questions about the paper's original contributions to the field.

[1] Large language models are versatile decomposers: Decompose evidence and questions for table-based reasoning. SIGIR 2023. [2] Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding. ICLR 2024. [3] Rethinking Tabular Data Understanding with Large Language Models. NAACL 2024.

遗漏的重要参考文献

No missing key references were identified.

其他优缺点

No other Strengths And Weaknesses.

其他意见或建议

No other comments.

作者回复

2025-04-01

We sincerely appreciate your thoughtful review and address your concerns below:

[W1] Lack of detailed analysis demonstrating how each specific challenge is addressed.

TableMaster is a recipe framework for table understanding. We provide an analysis of each component of TableMaster in Section 2 ("Challenges in Table Understanding" - Figure 2), Section 5.3 (ablation study), and in Appendix. We appreciate your revisiting these sections for further clarification.

[W2] Handle more complex table structures.

TableMaster is designed as a general recipe for table understanding. When dealing with hierarchical tables, we need to adapt the framework accordingly. Our initial design focuses on relational tables, and we have long been aware of the challenges in adapting to more complex table structures. Fortunately, in our recent follow-up work of TableMaster, we introduce a relational tables converter that splits complex tables into several relational subtables for multi table understanding. Specifically, we use o1 to generate relational tables from complex tables.

We have added experiments evaluating the performance of TableMaster on the hierarchical table QA dataset HiTab [1]. All experiments are conducted using GPT-4o. MultiCoT is a version of chain-of-table that works on multiple tables. Both MultiCoT and TableMaster are tested on the same extracted relational tables. E5 [4] is the SOTA on HiTab that is designed specifically for complex tables.

Method	Accuracy
After Converting to Relational Tables
- MultiCoT (original [3])	64.0
- MultiCoT (optimized prompt)	70.0
- MultiCoT (optimized prompt + verbalized table)	73.5
- TableMaster	74.2
Direct
- E5 [4]	77.3

We observe that TableMaster outperforms Chain-of-Table when dealing with hierarchical tables. However, it still lags behind E5 due to some missing information during the table conversion process. We have acknowledged this limitation and are continuing to adapt TableMaster for better complex table understanding.

[W3] Lack of comparative analysis of token efficiency across methods.

We conduct both theoretical and empirical efficiency analyses in Appendix G. TableMaster is designed to prioritize better understanding accuracy, which may introduce some inefficiency. However, people can select certain designs to trade off efficiency for performance, as discussed in Appendix G. Specifically, compared to Chain-of-Table [3], we use SQL and header selection, which are somewhat more efficient than constructing tables in an operation chain.

[W4] Limited datasets of experiments.

We follow the evaluation protocols of several prior works [3, 5], which report their performance on these three datasets. We have also added experiments on the HiTab [1] ([W2]) and FinQA [6] (below).

Method	Accuracy
GPT-4o-mini	50.7
GPT-4o	63.1
TableMaster (4m)	66.4 (+15.7)
TableMaster (4o)	70.9 (+6.9)

The table shows our methods largely improve the base model's table understanding ability in FinQA.

[W5] Limited novelty.

In this paper, we propose a comprehensive framework for general table understanding, addressing multiple perspectives, including four key solutions outlined in the paper. Many prior works focus on specific aspects of table understanding and use complex methods. For example, Chain-of-Table only constructs a sub-table. MixSC integrates textual and symbolic reasoning, but it requires self-consistency and voting that needs sample 10 times, which adds computational cost. In contrast, we use adaptive reasoning to achieve good results and conduct a detailed analysis of these two reasoning approaches in Appendix J, an area where no prior work has offered similar insights. We also supplement many valuable insight in Appendix. Therefore, we believe our contribution is not limited, but provides a broader perspective and deeper analysis of table understanding for LM.

We will incorporate your suggestions in the revision. Thank you!

HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation, ACL 2022.
MultiCoT, GitHub: https://github.com/CYQIQ/MultiCoT
Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding. ICLR 2024.
E5: Zero-shot Hierarchical Table Analysis using Augmented LLMs via Explain, Extract, Execute, Exhibit, and Extrapolate, NAACL 2024.
Large language models are versatile decomposers: Decompose evidence and questions for table-based reasoning. SIGIR 2023.
FinQA: A Dataset of Numerical Reasoning over Financial Data, EMNLP 2021.

审稿人评论

2025-04-02

Thank you for the clarification. Several of my concerns regarding W2/W4 have been addressed. I suggest incorporating the important insights either into the main content or providing a clear guide from the main content to the appendix. Based on these improvements, I have adjusted my scores accordingly. Additionally, I believe the paper would benefit significantly from establishing a clearer boundary between its contributions and prior studies.

作者评论

2025-04-02

Thank you for your constructive comment and for the improvement of the score!

We will incorporate the valuable insights into the main content and ensure a clear link between the main text and the appendix.

Regarding the contribution relative to prior studies (particularly clarifying the distinction between our contributions and previous work), we have summarized the following points:

Overall Framework:
TableMaster is a comprehensive recipe for general table understanding for language models. It addresses multiple perspectives, including four key solutions outlined in the paper. Many prior works [1, 2, 3, 4] focus only on specific aspects of table understanding and employ complex methods that are not essential. Instead of being seen as a concrete method, TableMaster is better understood as a flexible framework or recipe that benefits various downstream table understanding tasks.
Challenges and Solutions:
In Section 2 and the Appendix, our paper conducts a deeper analysis of the challenges that language models face in table understanding. We analyze four key characteristics of tabular data (Structured, Intensive, Concise, Numerical) and identify four corresponding challenges:
- Difficulty in Locating Target Data
- Deficiency of Table Semantics
- Numerical Inaccuracy in Textual Reasoning
- Semantic Inflexibility in Symbolic Reasoning.
Based on these challenges, we propose four corresponding solutions. In contrast, previous work has focused on only one aspect of these challenges, often proposing complex methods that miss the essence of table understanding for language models.
General Subtable Extraction or Symbolic Reasoning:
Most previous work has focused on subtable extraction or symbolic reasoning [1, 2, 4]. While these methods have achieved some success, they are relatively complex and inefficient, and often suffer from information loss when construct a subtable. In TableMaster, we take a more general but effective approach, using simple and efficient LLM-based column selection and SQL-based row selection. It is also combined with table-of-focus reconstruction to mitigate the impact of information loss, a technique not previously explored in prior work.
Table Verbalization:
Previous work has not identified and addressed the challenge of Deficiency of Table Semantics, which creates difficulties in table understanding. While Table Verbalization (or Table2Text) has been a traditional task, we identify that this pre-task can enhance table understanding in cases of Deficiency of Table Semantics, a challenge not previously explored in prior research.
Adaptive Reasoning:
As language models evolve, their chain-of-thought textual reasoning ability has improved. However, most prior methods still primarily focus on symbolic reasoning. In this paper, we identify the pros and cons of both symbolic and textual reasoning. While MixSC [3] integrates both reasoning types, it requires self-consistency and voting over 10 samples, which significantly increases computational cost. In contrast, we use adaptive reasoning to achieve strong results. Additionally, we provide a detailed analysis of these two reasoning approaches in Appendix J, offering insights into how to effectively combine textual and symbolic reasoning—paving the way for future research in both table understanding and symbolic reasoning.

In summary, we believe our contributions are valuable, broad, and distinct from previous studies. Our method can serve as a baseline or framework that can be adapted to various downstream scenarios in industry. Our work not only represents a step forward but also provides a foundation for future language models in table understanding—a reflection on past progress and a new starting point.

[1] Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding. ICLR 2024. [2] Large language models are versatile decomposers: Decompose evidence and questions for table-based reasoning. SIGIR 2023. [3] Rethinking Tabular Data Understanding with Large Language Models. NAACL 2024. [4] PoTable: Programming Standardly on Table-based Reasoning Like a Human Analyst, Arxiv.

审稿意见

评分: 22025-03-14

This paper presents TableMaster, a framework enhancing LLMs' table understanding. It addresses four key challenges: data localization, semantic deficiency, numerical inaccuracies, and inflexible symbolic reasoning. TableMaster integrates table-of-focus, verbalization, program-aided reasoning, and adaptive reasoning to balance textual and symbolic reasoning dynamically. Experiments on WikiTQ and TabFact show state-of-the-art performance, significantly surpassing baselines.

给作者的问题

All experiments in this paper were conducted on OpenAI's closed-source models. How does TableMaster perform on other open-source LLMs? Is it equally effective?
How does TableMaster perform on more challenging benchmarks, such as BIRD $^{[1]}$ and TableBench $^{[2]}$ ? Can it maintain its leading performance?

Reference

[1] Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs

[2] Tablebench: A comprehensive and complex benchmark for table question answering

论据与证据

No.

The authors identify four key challenges in table understanding :

Difficulty in locating target data
Deficiency in table semantics
Numerical inaccuracies in textual reasoning
Semantic inflexibility in symbolic reasoning.

The authors claim that TableMaster integrates multiple solutions to address specific challenges in data processing.

However, the paper does not provide sufficient experimental evidence or comprehensive analysis to demonstrate significant improvements in these difficulties. Improved performance on TabFact and WikiTQA alone does not substantiate these claims; more detailed experimental results are necessary.

Furthermore, TabFact and WikiTQA do not adequately represent these challenges in question.

For instance, the difficulty in locating target data primarily concerns long-context hallucination, which is not a feature of the relatively small tables in these datasets (TabFact & WikiTQA). In contrast, the BIRD $^{[1]}$ dataset presents a more significant challenge due to its length. Moreover, The results in the appendix indicate that as the size of the tables increases, the performance of TableMaster noticeably declines. This method does not show a significant trend of reduced decline compared to other methods.

Similarly, while TabFact focuses on fact-based questions, it does not emphasize numerical computation, unlike datasets such as FinQA $^{[2]}$ and TableBench $^{[3]}$ , which present clear challenges in numerical reasoning.

Reference

[1] Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs

[2] Finqa: A dataset of numerical reasoning over financial data

[3] Tablebench: A comprehensive and complex benchmark for table question answering

方法与评估标准

Yes

理论论述

N/A

实验设计与分析

Yes. The presentation of the experimental results raises several concerns, particularly regarding the use of outcomes directly sourced from other studies and the presence of numerous unreported values. This approach may compromise the perceived completeness of the experiment. Notably, the performance of GPT-3.5 on the WikiTQ dataset, as reported in the related work on the MixSC $^{[1]}$ method, is 73.6. This result, which is not included in the main findings of this study, is significantly higher than the performance of the method proposed in this paper, which achieves a score of 68.21.

Reference

[1] Rethinking Tabular Data Understanding with Large Language Models

补充材料

Yes. The authors uploaded their experiment code.

与现有文献的关系

This study focuses on enhancing the capabilities of large language models in understanding tabular data.

遗漏的重要参考文献

其他优缺点

Strengths:

The author integrates text and program symbolic reasoning to enhance the model's comprehension capabilities, presenting an intriguing approach.
The paper employs ablation studies to quantify the contributions of each module in TableMaster, providing robust support for design decisions.

Weaknesses:

The selected dataset does not effectively illustrate the primary challenges of table understanding proposed by the author, nor does it include further analysis, resulting in a lack of substantial evidence.
The experimental results omit the performance of the typical baseline method, MixSC, and the actual results fall short of those achieved by MixSC.
The pipeline design incorporates various strategies similar to those in existing works, lacking sufficient innovation.
Challenges of Table Understanding can hardly be regarded as a valid contribution.
Lack of comparative analysis of domain-specific models in the field of table understanding, such as TableLLM $^{[1]}$ and TableGPT2 $^{[2]}$ .

Reference

[1] TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios

[2] A Large Multimodal Model with Tabular Data Integration

其他意见或建议

It is recommended to thoroughly review the details of the paper. Section 5.1 currently ends with "??" symbols.
It is advisable to relocate the experimental results of FetaQA from the appendix to the main body of the paper.

作者回复

2025-04-01

We sincerely appreciate your thoughtful review and address your concerns below:

[W1] Concern about the difficulty in locating target data.

There are still large tables in TabFact & WikiTQ (tables with 518 rows or 10k+ tokens). The BIRD [1] dataset is essentially a text-SQL task for multi-table database data retrieval, where generating SQL queries can already effectively solve the problem without many of the solutions proposed in TableMaster (adaptive reasoning and verbalization). Thus, the settings differ somewhat. In Table 5, the performance of TableMaster actually declines less than other methods. We clearly compute the difference as shown below:

Method	Small (<2k)	Medium (2k ~ 4k) (Difference)	Large (>4k) (Difference)
Binder [2]	56.54	26.13 (-30.41)	6.41 (-19.72)
Dater [3]	62.50	42.34 (-20.16)	34.62 (-7.72)
Chain-of-Table [4]	68.13	52.25 (-15.88)	44.87 (-7.38)
TableMaster (gpt-3.5-turbo)	69.01	58.00 (-11.01)	56.73 (-1.27)
TableMaster (gpt-4o-mini)	78.71	70.50 (-8.21)	70.19 (-0.31)

[W2] Concern about emphasizing numerical computation.

We have added experiments on FinQA [5], which involve many numerical computations:

Method	Accuracy
GPT-4o-mini	50.7
GPT-4o	63.1
TableMaster (4m)	66.4 (+15.7)
TableMaster (4o)	70.9 (+6.9)

The table shows that our methods significantly improve the base model's table understanding ability in FinQA.

[W3] Experimental results of the MixSC method.

While the MixSC method achieves 73.6, it uses self-consistency and samples 10 times, which adds computational cost. This is an unfair comparison. Our method uses adaptive reasoning to select one reasoning strategy (either symbolic or textual reasoning) and samples only once. We conduct a comprehensive analysis of reasoning methods in Appendix J. In Table 8, our method achieves 77.46 using Self-Consistency (5+5), which is the setting to MixSC (73.6).

[W4] Lack of comparative analysis of domain-specific models.

TableMaster is a general framework for table understanding. It can be adapted to work with any language model for table understanding, differing from pretraining methods aimed at improving understanding during training. Our method focuses on being training-free, and related directions are discussed in Section 2 of the related work.

[W5] Limited contribution.

In this paper, we propose a comprehensive framework for general table understanding, addressing multiple perspectives, including identifying four key challenges and proposing four key solutions. Many prior works focus on specific aspects of table understanding and use complex methods. For example, Chain-of-Table only constructs a sub-table. MixSC integrates textual and symbolic reasoning, but it requires self-consistency and voting, which needs to sample 10 times, adding computational cost. In contrast, we use adaptive reasoning to achieve good results and conduct a detailed analysis of these two reasoning approaches in Appendix J, an area where no prior work has offered similar insights. Instead of focusing on only one perspective, we conduct experiments and analysis on many perspectives. We also provide many valuable insights in the Appendix. Therefore, we believe our contribution is not limited but offers a broader perspective and a deeper analysis of table understanding for large language models.

[W6] All experiments in this paper were conducted on OpenAI's closed-source models.

We conducted experiments with Llama-3.1-70B in Table 1 on the WikiTQ and TabFact datasets.

We will incorporate your suggestions in the revision. Thank you!

Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs, Arxiv.
Binding Language Models in Symbolic Languages, ICLR 2023.
Large language models are versatile decomposers: Decompose evidence and questions for table-based reasoning. SIGIR 2023.
Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding. ICLR 2024.
FinQA: A Dataset of Numerical Reasoning over Financial Data, EMNLP 2021.

最终决定Reject

2025-05-01

The paper focuses on table understanding, with the aim to resolve known difficulties, such as localization and beyond. Reviewers agree there are valuable novelties proposed by the paper, but also share concerns about the differentiation from existing works. We encourage the authors to incorporate the comments to improve the draft.