PaperHub
3.3
/10
Rejected4 位审稿人
最低1最高6标准差1.8
3
3
1
6
3.5
置信度
ICLR 2024

Structure-Rich Text Benchmark for Knowledge Inference Evaluation

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

A benchmark evaluating LLMs' ability understanding structure-rich input texts

摘要

关键词
Language ModelStructured TextBenchmarkEvaluationDataset

评审与讨论

审稿意见
3

This paper constructs a benchmark for LLMs (Large Language Models) composed of structure-rich and syntactically rigorous corpus, in purpose of evaluating the abilities of knowledge inference from small structured text and construction rules.

The contribution of this paper is

  1. evaluated the LLMs’ ability to understand and manipulate structure-rich texts
  2. present a taxonomy for structured texts and designed structure-related tasks for each class
  3. present 32 structure-specific tasks and 2512 samples for our benchmark dataset
  4. tested 4 LLMs and present the results under various metrics for assessment

优点

  1. proposed a benchmark of composed of structure-rich data, such as Tree, Tabular, JSON, YAML, XML, Markdown, Org, LaTeX and PYTHON

缺点

  1. The motivation of the paper is not clear. No reference cited in the Background and Motivation section. It seems no previous work focused on this aspect.
  2. The related work mentioned in the paper does not justify why understanding the structure-rich data is important.
  3. The data construction method is quite simple, just regular expression + simple rules.
  4. This paper compares 4 LLM on the proposed benchmark, use GPT4 as baseline and the other 3 models are Minimax, Spark from Xunfei and Ernie from Baidu. No tables or figures to summarize the experiment results.

问题

  1. Can you list some references about the importance of understanding structure-rich data? What experiments did they do? What are the metrics?
  2. What's the novelty in the proposed paper, compared to previous work?
审稿意见
3

This paper proposes a benchmark for evaluating the knowledge inference ability of LLMs. In particular, the benchmark consists of 2512 QA pairs, covering JSON, YAML, XML, Markdown, Org, LaTeX, Python, etc. Given the input in one of the aforementioned languages, the LLMs are asked to generated an answer that responds to a paired question in natural language. The authors conduct experiments using 4 LLMs, and find that current LLMs show poor performance on the benchmark.

优点

Comprehensive benchmark that can evaluate LLMs' ability to understand and reason on structured texts.

缺点

  1. The experiments and analyses are somehow shallow. In particular, the authors simply compare the performance, while give no discussion. I expect there can be at least more in-depth analysis, for example, more discussion or analysis based on their categories, and what is the possible reason that LLMs fail on those structured texts, and how it can be improved.

  2. The presentation is poor and can be further improved. I suggest the authors clearly state what kind of benchmark they are constructing in their introduction - I did not understand it until Section 3, and then realized that it was a QA format. Also, there is no reference in their Section 1, while most statements need supporting evidences/references.

Also, I would like to remind the authors that it is not a proper manner to put most of their key results in the Appendix (and no discussion and analysis is given!) - reviewers are NOT asked to read their appendix.

问题

Please see weakness

审稿意见
1

The authors present a new benchmark to probe the abilities of LLMs to generate different types of structured texts including json, xml, yaml, markdown, latex, and the AST from python source code.

优点

This paper presents a new benchmark dataset containing ~2500 question answer pairs.

缺点

This work is not very novel and unlikely to be interesting.

问题

Why wasn't GPT4 used with prompt engineering as well ? Moreover, Table 2, and Table 5 in appendix A.2 suggest that prompt engineering actually decreased performance. Why is that?

审稿意见
6

This paper curates a benchmark to test LLMs’ capabilities on understanding and generating structurally rich text such as JSON, YAML, Markdown, Python. It first categorize structure-rich text into four representative classes, and then craft specific tasks in each category.

优点

  • The benchmark is of great practical value because the understanding and generating structure-rich text is very useful in many real-life scenarios, yet the capability is not evaluated rigorously in the literature.
  • The benchmark encompasses a wide range of structure-rich text.

缺点

  • The taxonomy that covers widely used structure-rich text is nice, it’d much better if the taxonomy can be further extended to explain the underlying principles for designing specific tasks for each kind of structure-rich text. This would provide insights into what kind of underlying capabilities are being evaluated (e.g., understanding recursive paths should be required in many structure-rich text), and hints on what should/can be improved for future work.
  • Only closed-source LLMs are used for evaluation. Results from open-source models should be included for better reproducibility.

问题

Python files are collected from Internet while files from other text classes are created procedurally (i.e., they're synthetic). I wonder how much does it cost to consider real-world files for all text classes? It seems that most of human effort is needed in curating the tasks given files.

AC 元评审

The paper presents an automatically acquired benchmark for LLMs for evaluating knowledge inference.

为何不给更高分

Reviewers note shallow experimentation, poor overall presentation. In general, reviewers are confused.

为何不给更低分

n/a

最终决定

Reject