PaperHub
6.3
/10
Poster4 位审稿人
最低6最高7标准差0.4
7
6
6
6
3.5
置信度
COLM 2025

Improving Table Understanding with LLMs and Entity-Oriented Search

OpenReviewPDF
提交: 2025-03-19更新: 2025-08-26
TL;DR

We introduce an entity-oriented search method to enhance table understanding in LLMs, reducing preprocessing and achieving state-of-the-art results.

摘要

关键词
table understandingllm

评审与讨论

审稿意见
7

This paper explores LLM-based table understanding and proposes an entity-oriented approach. The method leverages LLMs to analyze tables and construct a graph, which is subsequently used for entity matching through a graph query language. This process aims to simplify the table understanding process. Experimental results on the WikiTableQuestions and TabFact datasets show that the proposed method outperforms previous approaches.

接收理由

  • The paper is clearly written and well-structured, making it easy to follow.
  • Using a graph-query language to facilitate table understanding is a new idea and seems reasonable.
  • The ablation study effectively highlights the importance of the individual components in the proposed method.

拒绝理由

  • Limited performance improvement: The results in Table 1 show only marginal improvements over baseline methods. For example, TUNES-GPT3.5 demonstrates only a 0.2 point improvement over TabSQLify on WikiTQ and a 0.5 point improvement over DP&PYAGENT.
  • Lacking justification for using graph representation: The paper’s argument for using a graph-based representation and graph query language for table understanding is not fully convincing to me. Since the performance difference is not very large, why do the proposed graph-based methods offer significant advantages over SQL-based methods or more complex symbolic languages, such as the one employed in Binder?

给作者的问题

  • It would be better if the LLM query counts (lines 263 to 266) were included in Table 1, which would better illustrate the inference cost reduction.
  • What's the error type distribution before applying TUNES? Now Section 5.3 mainly discusses the error type for TUNES, but there's no clear evidence on how it improves over original LLM.
  • I'm curious about how recent long thinking models like DeepSeek-R1 or GPT-O4 perform compared with the proposed methods.
评论

We sincerely thank the reviewer for recognizing the novelty and effectiveness of our proposed approach.

Below, we address the key concerns raised:

We would like to clarify a possible misunderstanding: TUNES-GPT3.5 achieves a 3.0-point improvement over DP&PYAGENT on WikiTQ, not 0.5 as mentioned.

We understand your main concern regarding the lack of significant improvements in all settings from our method. However, our approach consistently shows improvement across all configurations (with and without Chain-of-Thought prompting) and with four different large language models (LLMs), demonstrating the effectiveness and generalizability of our approach.

Additionally, TUNES w/ CoT significantly outperforms state-of-the-art baselines such as CHAIN-OF-TABLE and DY&PYAGENT on 4 different LLMs, with statistical significance (p < 0.01) on both the WikiTQ and TabFact datasets, while requiring 3× – 18× fewer LLM calls. For TUNES without CoT, although they achieve only competitive performance, they require 6× to 36× fewer LLM calls than the baselines, resulting in substantial computational savings.

Second, regarding the justification for using a graph-based method: Our entity-oriented search is a novel approach that has not been previously explored in the context of table understanding. Beyond achieving state-of-the-art results, we believe that pioneering a new direction offers unique value due to its atypical nature. This diversity of approach may bring additional benefits by broadening the range of techniques considered, rather than imposing a specific method.

Questions To Authors

It would be better if the LLM query counts (lines 263 to 266) were included in Table 1, which would better illustrate the inference cost reduction.

Thank you for the suggestion. We will move the query count information (lines 263–266) into Table 1 for better clarity in the camera ready (if accepted).

What's the error type distribution before applying TUNES? Now Section 5.3 mainly discusses the error type for TUNES, but there's no clear evidence on how it improves over original LLM.

Our error analysis is tailored for TUNES, where components are explicitly defined. In contrast, original LLMs work as black boxes, making a comparable error breakdown infeasible.

I'm curious about how recent long thinking models like DeepSeek-R1 or GPT-O4 perform compared with the proposed methods.

GPT-O4 has not be publicly available for reproducible evaluation, and due to computational constraints, we were unable to run the full DeepSeek-R1 model. Moreover, both models are extremely large, making it challenging to ensure a fair comparison and draw reliable insights. Additionally, our results (Table 1) already demonstrate that stronger base LLMs consistently lead to better performance when paired with TUNES, suggesting that models like GPT-O4 or DeepSeek-R1 would likely yield even stronger results under our framework.

To provide a fair and feasible comparison within our compute budget, we report the performance of DeepSeek-R1-Distill-LLaMA-8B [1], a distilled version of DeepSeek-R1 designed to retain strong reasoning ability while being more lightweight and resource-efficient. Below is its performance on TabFact dataset:

| Method | Score |

|------------|----------|

| CHAIN-OF-TABLE + LLaMA-3.1-8B-Instruct [CoT] | 49.6 |

| DP&PYAGENT + LLaMA-3.1-8B-Instruct [SC] | 63.8 |

| TUNES + LLaMA-3.1-8B-Instruct | 68.1 |

| TUNES + LLaMA-3.1-8B-Instruct [CoT] | 71.9 |

| DeepSeek-R1-Distill-LLaMA-8B | 58.3 |

| TUNES + DeepSeek-R1-Distill-LLaMA-8B | 71.1 |

| TUNES + DeepSeek-R1-Distill-LLaMA-8B [CoT] | 72.9 |

These results show that DeepSeek-R1-Distill-LLaMA-8B even outperforms CHAIN-OF-TABLE, and that TUNES, when combined with DeepSeek-R1-Distill-LLaMA-8B, significantly enhances the performance of the underlying LLM. It performs only slightly below TUNES + LLaMA-3.1-8B-Instruct [CoT]. Overall, TUNES built on DeepSeek-R1-Distill-LLaMA-8B demonstrates consistent performance gains over other LLaMA-3.1-8B-Instruct-based methods and achieves a new SOTA score of 72.9.

[1] Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning (Guo et al., 2025)

评论

Thanks for the detailed response and additional experiments. I have raised my rating since the proposed methods can offer much better efficiency without losing performance.

审稿意见
6

The paper proposed an entity-oriented search method to improve table understanding with LLMs. This approach focuses on table entities and pioneers the use of a graph query language for table understanding. Experiments show that the approach achieved a new state-of-the-art performance on standard benchmarks WikiTableQuestions and TabFact.

接收理由

The proposed approach focuses on table entities and pioneers the use of a graph query language for table understanding, establishing a new research direction. Experiments show that the approach achieved a new state-of-the-art performance on two standard benchmarks.

拒绝理由

  1. It is better to show the evaluation of the entity identification module. While there is an error analysis that shows the error rate is 4%, we cannot rely on it, because the total number of instances for the error analysis was not mentioned at all. If the error analysis tries to indicate the performance, the authors should first show the statistics of the dataset for the analysis.
  2. From the description in Sec. 3.2, it is not clear how the results from the full-text and semantic search and those from the graph search are combined together to be used in later modules.
  3. It is better to clearly describe how the authors fixed the value for various hyperparameters in Sec. 4.
  4. Without CoT, the performance differences of the proposed approach from the previous SOTA model are rather small. While the authors criticized CoT as computationally intensive and resource-demanding, they combined their approach with CoT, finally yielding better performances. However, it is not clear whether the differences are really significant without any significance test.

给作者的问题

Could you show another ablation when both full-text and semantic search are removed?

评论

We sincerely thank the reviewer for recognizing the novelty and effectiveness of our proposed approach, which leverages a graph query language to push the boundaries of what LLMs can achieve in table-based question answering. We hope this new direction inspires subsequent works in the field.

Below, we address the key concerns raised:

  1. It is better to show the evaluation of the entity identification module. While there is an error analysis that shows the error rate is 4%, we cannot rely on it, because the total number of instances for the error analysis was not mentioned at all. If the error analysis tries to indicate the performance, the authors should first show the statistics of the dataset for the analysis.

As mentioned in lines 322–323, we report the types of errors across each component from TUNES Llama-3.1-70B-Instruct on the WikiTableQuestions dataset. The statistics of this dataset are provided in Appendix C, as referenced in line 224.

  1. From the description in Sec. 3.2, it is not clear how the results from the full-text and semantic search and those from the graph search are combined together to be used in later modules.

As described in lines 201–205, we combine the results by taking the union of two sources: (1) the top-K relevant entities retrieved using a weighted sum of full-text and semantic search scores, and (2) the entities and attributes returned from Cypher query execution.

  1. It is better to clearly describe how the authors fixed the value for various hyperparameters in Sec. 4.

We thank the reviewer for the suggestion. In our experiments, we did not tune these hyperparameters extensively; instead, we used a fixed set of values that already achieved SOTA performance.

Regarding hyperparameter choices:

  • LLM temperature: We set the temperature to 0 to encourage deterministic outputs, ensuring reproducibility across runs. The only exception is for Cypher code generation, where we use a slightly higher temperature 0.4 to allow for mild variation in generation, as there can be multiple valid ways to express the same query.

  • Parameter h (number of table rows input to the LLM): A small value of h helps reduce token consumption. A few representative rows might be sufficient for the LLM to understand the structure of the table.

  • Regarding the retrieval component, our method introduces one parameter per retrieval strategy (three in total), each controlling the number of candidate entities passed to the LLM. More candidates benefit LLMs with fuller context, but may reduce the performance by introducing noise.

  1. Without CoT, the performance differences of the proposed approach from the previous SOTA model are rather small. While the authors criticized CoT as computationally intensive and resource-demanding, they combined their approach with CoT, finally yielding better performances. However, it is not clear whether the differences are really significant without any significance test.
  1. Efficiency without CoT

Our approach, TUNES without CoT, requires 6× to 36× fewer LLM calls than the baselines, resulting in significant computational savings. Despite this efficiency, TUNES still achieves competitive performance across four different LLMs, demonstrating its generalizability and effectiveness.

  1. Effectiveness with CoT

When combined with CoT, TUNES with CoT requires 3×–18× fewer LLM calls (see lines 256–263 for more details), while significantly outperforming state-of-the-art baselines such as CHAIN-OF-TABLE and DY&PYAGENT, with statistical significance (p < 0.01) on both the WikiTQ and TabFact datasets.

We present below the results of one-sided McNemar significance tests (evaluating whether CoT-based TUNES is better than the baselines) across various LLMs and datasets:

| CoT-based TUNES Model | Baseline Model | WikiTQ p-value | TabFact p-value |

|--------------------------------------|-----------------------|------------------------|---------------------|

| TUNES-GPT-3.5-turbo | DY&PYAGENT-GPT-3.5-turbo | 1.25 × 10⁻⁵ | 0.0058 |

| TUNES-GPT-3.5-turbo | Chain-of-Table-GPT-3.5 | 2.03 × 10⁻²⁷ | 0.0071 |

| TUNES-GPT-4o-mini | DY&PYAGENT-GPT-4o-mini | 0.0076 | 0.0081 |

| TUNES-GPT-4o-mini | Chain-of-Table-GPT-4o-mini | 4.61 × 10⁻²² | 4.77 × 10⁻⁹ |

| TUNES-Llama-3.1-70B-Instruct | DY&PYAGENT-Llama-3.1-70B-Instruct | 4.92 × 10⁻¹⁶ | 0.0049 |

| TUNES-Llama-3.1-70B-Instruct | Chain-of-Table-Llama-3.1-70B-Instruct | 1.16 × 10⁻¹⁶ | 0.0098 |

| TUNES-Llama-3.1-8B-Instruct | DY&PYAGENT-Llama-3.1-8B-Instruct | 0.0094 | 2.55 × 10⁻¹⁵ |

| TUNES-Llama-3.1-8B-Instruct | Chain-of-Table-Llama-3.1-8B-Instruct | 0.0050 | 4.35 × 10⁻⁷⁴ |

We will clarify these points further in the camera ready (if accepted).

Questions To Authors

Could you show another ablation when both full-text and semantic search are removed?

When both full-text and semantic search are ablated, the accuracy drops to 68.1

评论

Thank you for the results of the significance test and the detailed responses. I will raise my score.

审稿意见
6

This paper presents TUNES, a novel approach for table understanding that addresses the limitations of existing methods, such as heavy reliance on preprocessing and lack of contextual information. TUNES leverages semantic similarities between questions and table data, as well as implicit relationships among table cells, to enhance contextual clarity and reduce preprocessing needs. It also pioneers the use of a graph query language (Cypher) for improved reasoning over tables. Experimental results demonstrate that TUNES achieves state-of-the-art performance on standard benchmarks, and the authors plan to extend its application to more complex downstream tasks.

接收理由

  1. The paper presents an effective and well-designed pipeline for table understanding using LLMs, with experimental results showing that the proposed method consistently outperforms strong baseline approaches.
  2. The paper is generally clear and well-written, making complex ideas accessible and easy to follow.

拒绝理由

  1. The motivation is unclear. For example, the claim regarding the "unpredictable nature of the content and formatting within table cells" is vague and lacks specificity.
  2. The proposed method appears to be more of an engineering solution rather than a technical innovation. It involves three retrieval strategies and numerous hyperparameters, but it is unclear whether the choice of hyperparameters would significantly impact the results in practical use.
  3. The experiments are only conducted on two public datasets.
评论

We sincerely thank the reviewer for acknowledging the effectiveness of our proposed method.

Below, we address the key concerns raised:

  1. The motivation is unclear. For example, the claim regarding the "unpredictable nature of the content and formatting within table cells" is vague and lacks specificity.

We would like to clarify that we did provide specific examples and discussion in lines 38–42. For instance, in a column labeled “address”, one cell might contain a full address (e.g., “123 Main St, Springfield, 12345”), while another might only contain the city name (“Springfield”) or be blank. Our intention was to highlight that such inconsistency in cell content format might hinder the performance of query-based approaches, such as SQL-based approach where LLMs are expected to match or query based on exact cell values.

  1. The proposed method appears to be more of an engineering solution rather than a technical innovation. It involves three retrieval strategies and numerous hyperparameters, but it is unclear whether the choice of hyperparameters would significantly impact the results in practical use.

Our proposed approach, entity-oriented search, is a novel approach that has not been previously explored in the context of table understanding. Our approach captures strong inter-cell relationships and structural constraints by modeling them as coherent entities in a graph. In addition, we pioneer the use of a graph query language (Cypher). Beyond achieving SOTA results on the evaluated benchmarks, this innovation also pushes the boundaries of what LLMs can achieve in table understanding. We believe this new direction will inspire subsequent works in the field.

Regarding hyperparameters of the retrieval component, our method introduces only one parameter per retrieval strategy (three in total), each controlling the number of candidate entities passed to the LLM. More candidates benefit LLMs with fuller context, but may reduce the performance by introducing noise. In our experiments, we did not tune these hyperparameters extensively; instead, we used a fixed set of values that already achieved SOTA performance.

  1. The experiments are only conducted on two public datasets.

Our choice to evaluate on WikiTQ and TabFact follows related works[1]. Moreover, WikiTQ and Tabfact are widely used and particularly well-suited for comprehensive table understanding evaluation. The questions are not generated using predefined templates but are hand-crafted by users, resulting in significant linguistic diversity and realism. These questions cover a range of domains and require various operations such as table lookup, aggregation, superlatives, arithmetic operations (addition, subtraction, multiplication, counting), joins, and unions. Additionally, it includes diverse question types: comparative, superlative, fact verification, content-related searches, fact verification and table position-related searches. The tables themselves vary in length, have non-fixed headers and contain empty or noisy cells. While additional datasets could further validate our approach, the current results already establish its merit.

Additionally, we conducted a comprehensive evaluation across multiple settings, including TUNES with and without Chain-of-Thought prompting, the ablation study, and with 4 different large language models (LLMs), further supporting the effectiveness and generalizability of the proposed method.

[1] Rethinking Tabular Data Understanding with Large Language Models (Liu et al., NAACL 2024)

评论

Thank you for your response.

Regarding point 1, I believe the phrase "unpredictable nature" could be more clearly expressed as "inconsistency" or "variety." Since the objective is not to predict any aspect of the table content, the term "unpredictable nature" may be misleading.

Regarding point 2, after revisiting the paper, I recognize the novelty of introducing a graph query language (Cypher) for entity-oriented search. However, I have an additional concern about the process of converting tables to graphs. The example table in Figure 2 has a straightforward structure that is easy to parse. How does your approach handle unstructured or non-relational tables, such as those with complex header hierarchies? Such tables are quite common in web pages and documents.

Regarding point 3, thank you for your clarification. Nevertheless, I still believe that including experiments on a broader range of table-based tasks or datasets would significantly strengthen the technical soundness of the paper.

Overall, I have raised my score accordingly.

评论

Thank you for your detailed comments.

  1. We will revise the term unpredictable nature to inconsistency in the camera-ready (if accepted).

  2. Thank you for acknowledging our novel contribution. Regarding the additional concern, in cases with complex headers, the primary key identification remains unchanged. The features of each entity are determined by concatenating its hierarchical headers (e.g., header_level1 + header_level2...). We will clarify this further in the camera-ready (if accepted).

评论

We're kindly inquiring if there might be any concerns with our response. Should you find our response satisfactory and have no additional concerns, we would greatly appreciate your assistance in potentially revising your scores. Your time and consideration are greatly valued. Thank you.

审稿意见
6

This paper introduces an entity-oriented search method to solve the table-base QA with LLMs. Unlike existing approaches that rely heavily on preprocessing and keyword matching, this method leverages semantic similarities and implicit relationships between table cells. The authors pioneer the use of a graph query language for table understanding and demonstrate state-of-the-art performance on WikiTableQuestions and TabFact benchmarks.

接收理由

This paper presents a novel entity-oriented approach that advances table understanding by moving beyond preprocessing and keyword matching. It introduces a graph query language for table analysis—establishing a new research direction in the field. The method effectively captures semantic similarities and implicit cell relationships, demonstrating superior performance on standard benchmarks WikiTableQuestions and TabFact with state-of-the-art results.

拒绝理由

  1. While the paper claims their method's superiority by reducing computationally intensive chain-of-thought reasoning, it lacks experimental evidence demonstrating this computational efficiency difference.

  2. The paper overlooks relevant prior work on retrieval-based methods for table understanding, specifically Chen et al.'s "TableRAG: Million-token table understanding with language models" (NeurIPS 2024), which should have been included in the baselines or related work section for proper contextualization.

评论

We sincerely thank the reviewer for recognizing the novelty and effectiveness of our proposed approach, which leverages a graph query language to push the boundaries of what LLMs can achieve in table-based question answering. We hope this new direction inspires subsequent works in the field.

Below, we address the key concerns raised:

....It lacks experimental evidence demonstrating this computational efficiency difference

We would like to clarify that our paper did provide quantitative evidence of computational efficiency. Specifically, the average time for performing both semantic search and full-text search is very small, only at 0.06 seconds per query on a CPU. Thus, in TUNES, almost all of the running time is spent on prompting LLMs. Our approach requires 3× to 18× fewer LLM calls compared to baselines. Please refer to lines 256–263 for more detailed results.

The paper overlooks relevant prior work ... Chen et al.'s TableRAG (NeurIPS 2024), which should have been included in the baselines or related work section.

We appreciate the reviewer highlighting this work. We acknowledge that TableRAG is a relevant baseline, and we will include it in the camera ready (if accepted).

Its inclusion further emphasizes the novelty of our approach. While TableRAG encodes table cells independently, our approach captures stronger inter-cell relationships and structural constraints by modeling them as coherent entities in a graph.

Additionally, our pioneering the use of a graph query language (Cypher) enables more effective retrieval of relevant information. On the WikiTQ benchmark, using the same base LLM (GPT-3.5-turbo), our approach TUNES achieves a notably higher accuracy (64.9) compared to TableRAG (57.0), validating the effectiveness of our approach.

评论

I have read others' comments and the authors' response. I have raised my score.

最终决定

The authors propose TUNES, an entity-oriented search method for table-based QA which uses the Cypher graph query language for table understanding. They leverage semantic similarities and implicit relationships between table cells, achieving SOTA performance on two benchmarks while requiring fewer LLM calls than existing methods.

Reviewers had initially raised concerns about (i) lack of computational efficiency evidence and missing TableRAG baseline comparisons (ii) insufficient evaluation of entity identification components and unclear result combination strategies (iii) limited performance improvements and insufficient statistical significance testing — but the authors satisfactorily addressed these concerns through the rebuttal period responses including computational efficiency quantification, TableRAG comparison, statistical significance testing, and additional experiments with DeepSeek-R1-Distill demonstrating consistent improvements.

Overall, the paper makes good contributions with their graph-based approach and demonstrate meaningful empirical progress with computational efficiency gains (albeit with modest performance improvements in some configurations)