PaperHub
7.0
/10
Poster5 位审稿人
最低6最高8标准差0.9
8
6
6
7
8
3.8
置信度
COLM 2024

FABLES: Evaluating faithfulness and content selection in book-length summarization

OpenReviewPDF
提交: 2024-03-23更新: 2024-08-26
TL;DR

In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books.

摘要

While long-context large language models (LLMs) can technically summarize book-length documents (> 100K tokens), the length and complexity of the documents have so far prohibited evaluations of input-dependent aspects like faithfulness. In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on LLM-generated summaries of fictional books. Our study mitigates the issue of data contamination by focusing on summaries of books published in 2023 or 2024, and we hire annotators who have fully read each book prior to the annotation task to minimize cost and cognitive burden. We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD, which allows us to rank LLM summarizers based on faithfulness: CLAUDE-3-OPUS significantly outperforms all closedsource LLMs, while the open-source MIXTRAL is on par with GPT-3.5-TURBO. An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate. While LLM-based auto-raters have proven reliable for factuality and coherence in other settings, we implement several LLM raters of faithfulness and find that none correlates strongly with human annotations, especially with regard to detecting unfaithful claims. Our experiments suggest that detecting unfaithful claims is an important future direction not only for summarization evaluation but also as a testbed for long-context understanding. Finally, we move beyond faithfulness by exploring content selection errors in book-length summarization: we develop a typology of omission errors related to crucial narrative elements and also identify a systematic over-emphasis on events occurring towards the end of the book. We release FABLES to spur further research on the evaluation of book-length summarization.
关键词
FaithfulnessContent SelectionBook-length SummarizationHuman Evaluation

评审与讨论

审稿意见
8

This paper presents work on long-form summarization (of books), focusing on evaluating the quality of summaries generated by various LLMs. This is a well-written paper with interesting analysis. The experimental results show a lot of room for improvement in LLM summarization of books. The annotated data would also be released, providing a valuable resource for future work.

接收理由

  • comprehensive evaluation and analysis
  • valuable dataset to be released to the community
  • provides pointers on where LLMs/long summary models could be improved

拒绝理由

  • nothing obvious

给作者的问题

  • how sure are you that the annotators on UpWork had read the books carefully enough to be able to judge the quality of the summaries?
  • would you have any way to estimate the recall of the claim decomposition step? It seems that one of the main errors following is omission of details. Could that be related to recall of the claim extraction phase?
  • did I understand correctly that there is just one annotator per book? Did you consider multiple annotations per book to get a sense of agreement or estimate of the difficult of the task for humans?

Typos The color names in the heading of Table 6 don't match the colors (at least for me).

作者回复

We appreciate the reviewer's insightful comments and would like to address their concerns below. First, the reviewer wonders about the annotation quality:

Annotation quality control

Great questions! The reviewer is correct that we only hire one annotator per book, as in practice it is difficult to find 2-3 people who have read the same book. That said, we believe that our annotators had not only read the books but also carefully conducted the task, as the process of gathering evidence and giving comments (in addition to the T/F label) forces them to frequently revisit the book.

We perform two additional small-scale analysis experiments that demonstrate the high quality of our dataset: (1) inter-annotator agreement on a subset of claims where we had access to another annotator who also read the book, and (1) self-consistency of annotations (i.e., how often a single annotator assigns the same label to claims with the same semantic content generated by different models). We will add a detailed description of these experiments to the next version of our paper.

Inter-annotator agreement: For two books in our dataset, we were able to hire another annotator who had also read those books and provided overlapping annotations. This process resulted in 115 re-annotated claims, and we find the agreement rate between the original annotator and the new annotator is 91.30%, with Cohen’s Kappa of 0.621 (p < .0001), indicating substantial agreement. Unfortunately, re-annotating the entire dataset with multiple annotators is prohibitively expensive, costing around 200200-250 per book and requiring ~10 annotation hours.

Self-consistency: For each book, an annotator analyzed five summaries, each generated by a different model. To assess self-consistency (intra-annotator agreement), we randomly selected five books and compared the annotations made on the first and last summaries (as per annotation order) for claims with the same semantic content. Consistent labels would suggest that they maintained a stable judgment throughout the process. Out of 127 claims from the first summary, 46 have semantically-equivalent claims in the last summary, all 46 of these claims were consistently labeled.

Recall of claim decomposition

The Reviewer wonders about the recall of the claim decomposition step. As this concern is shared with Reviewer FiuX, we kindly refer the Reviewer to our response to Reviewer FiuX.

评论

Thank you for the additional information. This will be important information to include in the next version of the paper.

审稿意见
6

The paper addresses an important bottleneck in the current research of long-input undestanding and summarization: evaluation of faithfulness. The authors report a large scale human annotation experiment where summaries produced by a diverse set of SOTA models were evaluated with respect to truthfulness using established methodology (claim-level evaluation). The results showed that a substaintial portion of LLM-generated summaries have unfaithful claims. A follow-up experiment to see whether LLM-powered methods could match human-level faithfulness detection indicated a major shortcoming of such methods: it appears that all automatic methods tend to always evaluate summaries as faithful. The authors observe that often unfaithfulness errors require multi-hop reasoning to identify. In addition to the main results, the paper also provides examination of the types of content errors that automatic summaries contain. Specfically, one frequent error is omission of essential information. The authors promise to release the dataset of annotations.

接收理由

  • Addresses a very pressing problem in the current research on long-input summarization.
  • Explores a diversity of methods for automatic evaluation.
  • Evaluates a good diversity of SOTA summarization model.
  • Dataset of human annotations, performed with a clean methodology.

拒绝理由

  • A more in-depth analysis of where and why the automatic evaluation methods fail would be welcome. One valuable insight is that unfaithful claims often require multi-hop reasoning to be detected: it would useful to estimate the percentage of cases where this is the case. The authors note themselves that the sample size of unfaithful claims is very small -- so it would probably be possible to even do the in-depth analysis manually.
作者回复

We would like to thank the reviewer for their insight feedback and address the concerns mentioned in the review below. The reviewer asks about why our automatic evaluation metric fails:

"A more in-depth analysis of where and why the automatic evaluation methods fail would be welcome. One valuable insight is that unfaithful claims often require multi-hop reasoning to be detected: it would useful to estimate the percentage of cases where this is the case. The authors note themselves that the sample size of unfaithful claims is very small -- so it would probably be possible to even do the in-depth analysis manually."

This is an insightful point. As we show in Table 3, 50.2% of unfaithful claims require multi-hop (or indirect) reasoning to be invalidated. Following the reviewer’s suggestion, we look at predictions from our auto-rater experiment (on seven books) for which Claude 3 and GPT-4 Turbo incorrectly mark a false claim as true, and we annotate the type of reasoning required to verify these claims. The results in the table below show that ~75% of the failure cases require multi-hop reasoning over the book, which is significantly higher than the data distribution of 62.8% with seven books and suggests that our auto-raters struggle with multi-hop reasoning. We will include a more in-depth discussion and analysis in the next version of our paper.

Reasoning TypeClaude 3 False Positives (28 examples)GPT-4 Turbo False Positives (37 examples)False Positives Common to Both Models (24 examples)
Indirect75.073.075.0
Direct14.310.812.5
Subjective7.110.88.3
Extra Info3.65.44.2
评论

Thanks for this more detailed analysis! This is a valuable data point for the understanding of faithfulness evaluation, and probably worth including in the updated version of the paper.

审稿意见
6

The paper presents a new dataset for evaluating faithfulness and content selection in book-length summarization. The annotations are collected from Upwork platform and are based on summaries generated by five different LLMs: GPT-3.5-Turbo, GPT-4, GPT-4-Turbo, Mixtral, Claude-3-Opus. The authors compare the human annotations results with several LLM-based auto-raters, and show that non of them correlates strongly with human annotations. The authors also develop a typology that includes error types besides unfaithful error, and make corresponding analysis.

接收理由

  1. The dataset could be a good benchmark for faithful book-length summarization task.

拒绝理由

  1. some important details are missing from the paper about the dataset
  • how to you choose the books? do you select ones from diverse domain?
  • how do you evaluate and control the quality of the annotations?
  1. In Table 3 in section3, how is the taxonomy created? what is the definition for each of the claim type and reasoning type? The difference between indirect reasoning, subjective reasoning and extra info reasoning is not very clear to me, at least from the examples presented.

  2. The paper presents a very useful benchmark, with some interesting findings. It will be more insightful if the authors could point out some research directions based on the findings.

作者回复

We appreciate the detailed feedback and would like to address the reviewer's concerns.

Choosing the books

We did not select books ourselves; rather, we hired annotators who self-report books they have read that match our criteria (>2023 fiction), as noted in Section 2. That said, our dataset has multiple genres (see Table 7).

Annotation quality control

Great point! We perform two small-scale analyses to show the annotations’ high quality: (1) inter-annotator agreement, and (2) self-consistency. We will add them to the next version of our paper.

Inter-annotator agreement: For two books, we hired another annotator who had also read them, resulting in 115 claims with overlapping annotations. The agreement rate between the two annotators is 91.30%, with Cohen’s Kappa of 0.621 (p < .0001), indicating substantial agreement (https://aclanthology.org/J08-4004.pdf). Unfortunately, we cannot do this for the entire dataset, as finding multiple workers who have read the same book is challenging and expensive, costing ~$250 per book and 10 annotation hours.

Self-consistency: We randomly selected five books and compared the annotations made on two of the summaries for semantically similar claims. Consistent labels would suggest that labels were not arbitrarily assigned. Out of all 127 claims in the first set of summaries, 46 had semantically equivalent claims in the second set, all 46 claims were consistently labeled.

Claim taxonomy

The taxonomy was created via manual coding of collected annotations and aligns with prior work (https://arxiv.org/pdf/2303.01432). See Table 15 for details. In short, indirect reasoning is multi-hop reasoning over different parts of the book, subjective reasoning calls for subjective judgment (usually related to themes), and extra info requires meta information about the book (e.g., publication date). We will clarify Table 3 in the next version.

Future directions

We thank the reviewer for pointing this out and will add it in the next version. One future direction is to address content omissions via planning (e.g., generate important arcs, characters, and relationships, and use them as a basis for the summary). Another is to build more reliable auto-raters, as our work shows that existing ones are insufficient. Once auto-raters are improved, we can also use their signals to fine-tune LLMs to become better summarizers (similar to factuality tuning in https://openreview.net/pdf?id=WPZ2yPag4K).

评论

Dear Reviewer AyeL,

With the discussion period ending soon, could you please confirm if my rebuttal has addressed your concerns? Your feedback is valuable for our research's clarity and quality

Thank you.

评论

Thanks for the response. It addressed most of my concerns, thus I increased the score to 6: Marginally above acceptance threshold.

审稿意见
7
  • This work presents a detailed evaluation of LLM-generated summaries of long texts (full books).
  • The evaluation covers a human annotation for faithfulness of 3000+ individual claims in the summaries.
  • The paper compares summaries of different LLMs, and contributes to a qualitative understanding of weaknesses and common error types.
  • It provides mixed/negative results for automatic (LLM-based) faithfulness ratings, and connects this to abstractiveness and missing context, highlighting limitations to this paradigm in summarisation evaluation.
  • The authors promise to release the faithfulness annotations (though not the books themselves), which could be a valuable resource for the community.

接收理由

  • The paper is well-written, clearly structured, concise, makes well motivated methodological choices grounded in recent prior work (summarisation-related, faithfulness-related, claim-extraction related, ...), and presents a research journey with lots of qualitative insights and interesting detail.
  • The annotation is cleverly designed with a focus on books that annotators are already familiar with.
  • The error analysis, in particular in connection with the auto-rating is extensive and in-depth.
  • I enjoyed reading this manuscript.

拒绝理由

  • There is little methodological innovation in this work - it fully focuses on evaluation.
  • While the paper paints a great picture of the status quo in terms of weaknesses, I miss a constructive path forward. The auto-rater path was explored, and its limits for this use case are made clear (which is great), but I miss concrete answers to the question "How can this be improved?"
  • Annotation quality is not cross-checked (from what I can tell) - e.g. via duplicate annotations and checking the agreement between annotators. I can see that this may not be straightforward since it requires multiple readers per book. But it would help shed some light on how reliable the faithfulness ratings are - without this annotation quality is unclear.

给作者的问题

  • Did you validate that the annotators have read the book, or thoughts on how that could be validated? It's unclear to me how reliable self-reporting is in this context.
  • Could you address recall / coverage of the extracted claims? (e.g. in the example of Fig 2, the summary contains a part about a romantic relationship, which isn't extracted as a claim).
  • Unclear: claim vs subclaim-level annotation / multiple annotations per claim. Can you clarify this?
  • Are there any qualitative observations on summaries of the different models that correlate with faithfulness?
  • 0-shot prompting LLMs for faithfulness classification may not work well out of the box, depend on the prompt, etc. Do you have a sense for how important that is, and whether that changes the relative performance? Otherwise, what can we readers take away from these potentially very prompt-dependent relative numbers?
  • Did you obtain duplicate annotations for some books / summaries? What is the agreement?
  • Did you observe systematic differences across books? I.e. how much of the variation in faithfulness comes from different models, and how much from different books?
作者回复

We thank the reviewer for the insightful feedback and address their concerns below.

Limited methodological innovation

We believe evaluation is crucial and complements the development of new methods. The field lacks robust benchmarks for assessing long-context LLMs with extremely long inputs due to data contamination and the challenge of evaluating tasks requiring familiarity with entire books. Our work aims to fill this gap.

Future directions

The reviewer is concerned about how to improve the auto-rater. Please see our response to Reviewer AyeL.

Annotation quality control

As this concern is shared with Reviewer BatK, we kindly refer the reviewer to our response to Reviewer BatK due to space limit.

Recall of claim decomposition

In Figure 2, the omitted part is a generic statement at the beginning of the summary, but the romantic relationship between the characters is indicated in claims 4, 7, 9, 12, 13, 14, and 18. For a detailed analysis of recall, please see our response to Reviewer FiuX.

Claim vs subclaim-level annotation

In footnote 8, the phrase “claims sometimes contain multiple subclaims” refers to claims making multiple statements. During coding, we consider all subclaims, so a single claim may be assigned multiple labels to ensure each aspect is classified.

Correlation of Qualitative Observations with Faithfulness

This question is relevant to Section 5, where we find that the least faithful models tend to be more generic. In addition, while all models omit important information according to annotators, Claude 3 Opus (most faithful) is noted for making the fewest omissions.

Different Prompting

Unfortunately, with book-length input few-shot prompting is impossible with the evaluated models. As per wording of the prompt, we conducted further experiments requesting (1) the answer-only, (2) answer and then explanation, and (3) explanation and then answer. We observe ~1% difference in accuracy for these prompts.

Difference Across the Books

We measured the percentage of incorrect claims across books. Most variation we observed was due to the models; for example, Claude 3 Opus consistently performs well. There is also some variation across books, but it’s unclear how much of it is due to the difference in book content. Due to space constraints in the rebuttal, we are unable to include a detailed table of results by book. We will add it along with a relevant discussion in the next version of the paper.

审稿意见
8

This paper introduces a human evaluated dataset of LLM generated summaries of book-length documents (>100K tokens). Humans manually evaluated the faithfulness of summaries by checking the truthfulness of automatically extracted claims (using GPT4). Further, the evaluation covers typical issues of LLM generated summaries where the content of the summaries is analyzed, i.e. omission or chronological mistakes are evaluated. The paper introduces a taxonomy of such common mistakes. The dataset covers roughly 3K claim level annotations across 26 texts. The authors additionally implemented automatic rates by prompting an LLM in a zero-shot manner to verify a single claim given evidence from the book.

接收理由

This paper presents a useful study on current LLM capabilities regarding faithfulness. The presented results are useful information for the community and the provided dataset offers a usefull analysis, which will might improve future generations of LLMs. In general it is very valuable to be aware of current capabilities and common mistakes.

拒绝理由

Intermediate steps for human evaluation are automatically performed, especially the claim extraction, which might also introduce errors. The paper would benefit from an analysis of the correctness and completeness of claims extracted from the summary (I could not find that in the paper). Apart from that, there are no obvious reasons to reject this paper.

作者回复

Thank you to the reviewer for the valuable comments and insightful feedback. We would like to address the concerns raised in the review below.

"Intermediate steps for human evaluation are automatically performed, especially the claim extraction, which might also introduce errors."

Great point! While our paper contains a discussion of the high precision of the claim extraction in Section 2, we did not include a discussion of recall. In response, we closely analyze the extracted claims on a subset of 20 summaries (371 sentences, 450 extracted claims), from which we conclude that the claim extraction process is not responsible for the omissions that our annotators observe. We will include details about this experiment in the next version of our paper.

Recall analysis: For each of the 20 summaries, we manually evaluate the quality of the extracted claims against the content of each summary. Calculating recall is difficult since it is unclear what granularity to calculate it against (e.g., sentences, clauses). 3.8% of the 371 sentences in the 20 summaries were omitted in extracted claims. Of these, 85.7% were generic statements, and 14.3% were minor details. We also qualitatively observe a small percentage of omitted details at the sub-sentential level (e.g., clauses), none of which impacts the narrative. These omissions can be broadly categorized into two types:

  1. Generic statements lacking substantive content. For instance, “The narrative unfolds with intrigue, danger, and treacherous encounters” appears in the summary but is omitted in extracted claims. Note that this sentence only addresses things already covered by other extracted claims in a generic way, so omitting it has few consequences.

  2. Insignificant details that contribute little to the narrative. For instance, “Altha, a 17-century woman, stands trial unjustly accused of witchcraft due to her remarkable healing abilities which are misunderstood by her village” appears in the summary, but “misunderstood by her village” is omitted in the claims. However, this is only a minor detail with little impact on the narrative.

Importantly, we confirm that none of these discrepancies between the content of the summaries and claims were reasons for criticism of omissions, chronological errors, and factuality issues in the annotators’ summary-level free-form comments. In the next version of the paper, we will run this qualitative check on all summaries and their extracted claims.

评论

Thank you for this qualitative information. Including this information in the next version of the paper is very valuable. Thanks for your great work and providing all those insights, I really enjoyed reading the paper.

最终决定

The paper tackles the evaluation of faithfulness and content selection in LLM-generated summaries of books. It introduces a valuable human-annotated dataset, providing a benchmark for assessing LLM performance on this task. The paper offers detailed insights into the strengths and weaknesses of current LLMs, particularly concerning faithfulness and content errors. It also highlights the limitations of LLM-based auto-raters for evaluating faithfulness.

Reviewer Consensus:

Reviewers agree that this is a well-written and insightful paper, praising its clarity, structure, and in-depth analysis. The creation of the dataset is recognized as a significant contribution to the field.

Points of Discussion:

One reviewer questioned the methodological novelty, while others commended the well-motivated design choices. Concerns about annotation reliability were raised, but the authors addressed these by providing evidence of high inter-annotator agreement and self-consistency. The paper was also suggested to include concrete proposals for addressing identified weaknesses in LLM summarization.

Overall Assessment and Recommendation:

Despite minor disagreements, the paper is viewed favorably. It addresses a critical need in the field and provides a solid foundation for future research. The authors' detailed rebuttal effectively addresses concerns, further strengthening the paper's contributions.

Based on the assessments and responses, I recommend acceptance. This paper is a valuable contribution to the field and will likely stimulate further research on evaluating and improving long-form summarization models.