PaperHub
7.2
/10
Poster5 位审稿人
最低6最高8标准差0.7
7
8
7
6
8
3.8
置信度
正确性3.4
贡献度3.4
表达3.6
NeurIPS 2024

Long-form factuality in large language models

OpenReviewPDF
提交: 2024-04-25更新: 2024-11-06
TL;DR

We propose a dataset, evaluation method, and metric for benchmarking long-form factuality in large language models.

摘要

关键词
natural language processingmachine learningfactuality

评审与讨论

审稿意见
7

This paper introduces LongFact, a dataset consisting of 2,280 questions across 38 topics, designed to assess the factuality of long-form answers generated by large language models. The authors also propose a new evaluation method called the Search-Augmented Factuality Evaluator (SAFE), which utilizes large language models and Google search to verify the accuracy of individual facts within long responses. Additionally, the paper introduces the F1@K metric, which is used to measure the precision and recall of factual accuracy in the model's responses.

优点

  1. The workload of the paper is substantial, including the construction of the dataset, evaluation methods, and assessment metrics.
  2. The paper introduces the comprehensive LongFact dataset, which covers 38 topics, providing broad domain coverage that enhances the depth and breadth of the evaluation.
  3. The SAFE method reduces the cost and time of manual evaluation.

缺点

  1. The layout of this paper could be improved. Many significant pieces of information have been relegated to the appendix, which may result in key details and data being overlooked by readers at first glance. For instance, the SAEE method, one of the contributions of this paper, should be prominently discussed in the main body of the text. However, it is only briefly mentioned in passing, which does not justify its importance.
  2. The paper frequently cites the FactScore method as "such as Min et al. (2023)" without specifying which aspects were used. The author should at least summarize the key elements adopted, rather than requiring readers to refer to the original paper for details.
  3. Table 16 shows repeated queries for each fact, which contradicts the claim in the "Rating Individual Facts" section that including retrieved results in prompts prevents duplicates. The author needs to address this inconsistency.

问题

The author mentions that SAFE corrected 76% of the discrepancies with human annotations. Does this suggest that the annotation results from Min et al. (2023) are flawed? If so, does the reported 72% agreement rate with data annotated by Min et al. (2023) still hold significance?

局限性

The results in Table 2 should be differentiated, such as by bolding the best outcomes.

作者回复

Thank you for your careful reading and thoughtful reviews. Let us address your comments below.

The layout of this paper could be improved. Many significant pieces of information have been relegated to the appendix...

Thanks for the suggestion. We agree and will move some more information on the procedure of SAFE and its comparison with human annotators to Sections 3 and 4 of the main body.

The paper frequently cites the FactScore method as "such as Min et al. (2023)" without specifying which aspects were used. The author should at least summarize the key elements adopted, rather than requiring readers to refer to the original paper for details.

Thanks for the feedback. We will clarify those aspects in the revision. The specific aspects we refer to in the FActScore paper (Min et al. 2023) are:

  • Line 48, 136, 155, 704, 914, 1042, 1725: FActScore human annotations from crowdsourced human annotators
  • Line 93, 829: the step of decomposing the long-form response into atomic facts in FActScore
  • Line 621, 966: the FActScore metric (which is essentially average precision) in FActScore’s Section 3.1
  • Line 713, 832, 848: the FActScore data on biography
  • Line 789, 806: the FActScore label categories for atomic facts: "Supported", "Not-supported", and "Irrelevant" (FActScore Section 3.3)

Table 16 shows repeated queries for each fact, which contradicts the claim in the "Rating Individual Facts" section that including retrieved results in prompts prevents duplicates. The author needs to address this inconsistency.

Thanks for catching this! Our prompt seeks to prevent duplicates but is indeed not infallible because the model may not fully follow these instructions. We’ve made this more clear by adjusting the language in the "Rating Individual Facts" section to "discourage the model from issuing duplicates". Of course, there are engineering methods (other than prompt tuning) that can prevent duplicates, including resampling when encountering duplicate questions and increasing temperature for query generation (the current temperature is 0.1, as noted in Footnote 18).

The author mentions that SAFE corrected 76% of the discrepancies with human annotations. Does this suggest that the annotation results from Min et al. (2023) are flawed? If so, does the reported 72% agreement rate with data annotated by Min et al. (2023) still hold significance?

Thank you for this interesting question about the annotations from Min et al. (2023). We believe that:

  • SAFE's 76% win rate on discrepancies doesn't mean crowd-sourced human annotations are flawed, because we manually checked a random subset of 50 individual facts (35 "Supported" and 15 "Not Supported") on which SAFE and human annotators agreed upon, and 48 (96%) were rated correctly.
  • Since human annotations are not flawed, SAFE's 72% agreement rate holds significance, because the agreement rate was calculated on a large number of (16,011) individual facts (Figure 4 in Line 124-133), meaning that SAFE is as good as crowd-sourced human annotations on the agreements.

The results in Table 2 should be differentiated, such as by bolding the best outcomes.

Thanks for the suggestion. We will make this change in the revision.

评论

Thanks for the author's response. I still have some questions.

  1. When the author used SAFE for evaluation, they employed Google search. For each query, how many search results were used to conduct fact verification?

  2. Although the author states that large language models do not always follow human instructions, looking at the sample tables 16-18 provided by the author, there are too many repeated queries. I'm somewhat concerned whether your query generation strategy is truly effective.

  3. When the author used an LLM agent for fact annotation, they re-annotated the results where SAFE and human raters disagreed by using Google search. What was the purpose of this? Was it because the annotations of these inconsistent results were incorrect? The author denies this. Can you explain this?

评论

Thanks for the further questions. Please find our answers below.

When the author used SAFE for evaluation, they employed Google search. For each query, how many search results were used to conduct fact verification?

When using SAFE for evaluation, we returned 3 search results per query from the Server API. We discussed this in Appendix C.1 Line 799 and ablated it in Appendix C.6 Line 881-896 and Figure 13. We are happy to make this clearer in the revised paper by adding this number into our main paper in the section first presenting SAFE (Section 3).

Although the author states that large language models do not always follow human instructions, looking at the sample tables 16-18 provided by the author, there are too many repeated queries. I'm somewhat concerned whether your query generation strategy is truly effective.

First, there are many other atomic claims whose fact-checking queries don’t have duplicates. One example is that for the question "Who is Antoni Gaudí?", the fact-checking queries from SAFE for the claim "The Sagrada Família is perhaps Antoni Gaudí's most famous work" are:

  • Is the Sagrada Família considered Antoni Gaudí's most famous work?
  • What are some other notable works by Antoni Gaudí besides the Sagrada Família?
  • Antoni Gaudí Sagrada Família famous work evidence
  • Did Antoni Gaudí complete the Sagrada Família church before his death?
  • Antoni Gaudí Sagrada Família completion date

Second, there are indeed some claims for which there are repeated fact-checking queries, like the ones we showed in Table 16 and 18. Our insights are twofold:

  • On the one hand, prompting models with better instruction-following capabilities and rewording the prompt to emphasize diversity may reduce the duplication of queries.
  • On the other hand, some claims are easier to be verified by fewer than 5 queries, in which case the model or human may not be able to or even need to come up with that many different queries. In this case, the model may use duplicated queries since the SAFE pipeline forces the model to use exactly 5 queries. Empirically, we see that claims that only contain objective words can be fact-checked more easily with less diverse queries, whereas claims with subjective judgements (like "famous" in the above example and "significant" in main text Figure 3, between Line 105 and 106) need more diverse queries to collect evidence from multiple angles for the LLM to finally make a judgment.

Inspired by the above observations, an idea to improve SAFE is to add an option for the LLM to stop generating more queries when it determines that the collected evidence is enough to judge whether the claim is supported. Right now we instruct the LLM to issue 5 search queries using the instruction "To do this, you are allowed to issue ONE Google Search query that you think will allow you to find additional useful evidence." We do this regardless of whether the already-collected evidence is enough to make the final annotation. The change should reduce cost and focus resources on more-difficult claims.

Thanks again for bringing this up and inspiring us to further improve our method. We will also make the above clarifications in our manuscript.

When the author used an LLM agent for fact annotation, they re-annotated the results where SAFE and human raters disagreed by using Google search...

In Section 4, we indeed manually-annotated facts where SAFE and human raters from Min et al. 2023 disagreed. We did this because we wanted to answer the question of "when SAFE and human raters disagree, is SAFE usually correct (and therefore human raters were incorrect) or are the human raters usually correct (and therefore SAFE was incorrect)?" We believe that answering this question helps us understand how SAFE performs as an automated factuality annotator relative to crowdsourced humans.

In this experiment, we assumed that our annotations ("researcher + full-internet access") were ground-truth labels, since we believe that these expert-level annotations with more available resources and (presumably) more carefulness are more representative of true factuality relative to crowdsourced humans. As stated in our previous response, however, this does not mean that all of the results from Min et al. 2023 are in question; we only note that when SAFE and the annotations from Min et al. 2023 disagree (which could indicate that a fact is difficult to verify), SAFE is often correct. Indeed, the crowdsourced annotations from Min et al. 2023 are only "incorrect" relative to our definition of factuality for these disagreement cases. Since there is still a 76% agreement rate and we saw there are extremely few facts (2/50 in our sampled batch, or 4%) where both crowdsource human annotators and SAFE would return incorrect annotations, we posit that almost all of these 76% of the 16k facts were indeed correctly annotated.

Please let us know if we can clarify this further. Thanks again for the inspiring questions.

评论

Thank you for all the responses. I believe this is an interesting work, and I will maintain my positive scores.

审稿意见
8

In this work, the authors investigate the evaluation of LLMs’ factuality in long-form generation. Specifically, they first introduce LongFact, a multi-topic benchmark for assessing long-form factuality. In LongFact, they use GPT-4 to generate questions about specific concepts or objects from given topics. Additionally, they design a search-augmented factuality evaluator (SAFE), which evaluates the LM’s output by breaking down the generation into individual claims and then using the Google Search API to verify the correctness of these claims. They also conduct experiments with popular LLMs and find that larger models exhibit better factuality in long-form generation.

优点

Originality: The authors propose a new benchmark for evaluating long-form generation. Unlike previous studies, LongFact encompasses multiple topics. The SAFE evaluation pipeline introduces the use of the Google Search API to estimate the correctness of claims, a novel approach not previously employed.

Quality: This paper explores both the effectiveness and the cost of evaluation compared to human annotators, providing strong support for the value of the proposed methods. The error analysis offers valuable insights into their method and helps the reader better understand the evaluation pipeline.

Clarity: The paper is well-structured and easy to follow, with detailed explanations provided.

Significance: This paper tackles the challenging task of evaluating LLM factuality in long-form generation. The datasets are released, and the results are easily reproducible, benefiting the research community.

缺点

  1. The overall concept is similar to previous work that decomposes long-form responses into claims, such as FactScore, which diminishes the novelty. Although this work covers more topics and uses Google instead of Wikipedia in the evaluation pipeline, the similarity remains.

  2. The claim that “LLM agents can be better factuality annotators than humans” seems somewhat exaggerated. The human annotators do not have internet access and could only label based on the corresponding Wikipedia page, which is not a practical setup.

问题

  1. I notice that current benchmarks for long-form factuality, including LongFact and others, typically consist of questions that ask for information or explanations about a subject, such as individuals or concepts (typically extracted from Wikipedia). The dependencies between sentences in the LM’s responses in these benchmarks are generally simple. Can SAFE handle responses where more complex inter-sentence dependencies exist? Consider the following question from ELI5 as an example:

Query Title: What were different decades in the 1800s like?

Explanation: Every decade of the 20th century can be loosely quantified in a general theme (the roaring 20s, the flower power 60s, the weird clothing styles/music of the 80s) and each one is so different from the last. What kinds of themes were common between decades in the 19th century, and were they as distinctively different from each other as those in the 20th century?

Candidate Answer: In the early 1800´s there was a radical change of fashion and lifestyle for men. The high breeches, stockings and tricorne hats one associates with the era of the French and American revolutions disappeared with the genesis of modern fashion which emphasised modesty in clothing, eloquence in speech and the most refined manners and hygiene. This was the birth of the "dandy" culture which still thrives and is very much predominant today.

In this candidate answer, the claim “This was the birth of the ‘dandy’ culture” is supported by all the previous claims. Could SAFE handle this?

  1. It seems that the idea of constructing LongFact and the quality estimation method SAFE could be used to produce synthetic data for enhancing LM’s factuality. Do the authors plan to explore this direction?

局限性

The authors effectively summarize the limitations of this work. Most of these limitations are not trivial to address, and I agree with the authors that the current design represents a good trade-off between cost and performance.

作者回复

Thank you for your careful reading and thoughtful reviews. Let us address your comments below.

The overall concept is similar to previous work that decomposes long-form responses into claims, such as FactScore, which diminishes the novelty.

We agree that our evaluation method SAFE shares similarity with FActScore decomposing a long-form response into facts and checking precision based on external knowledge sources. Our work, however, still makes these contributions:

  • We propose LongFact to evaluate long-form factuality in open domains. To our knowledge, this is the first multi-topic dataset for open-domain factuality evaluation.
  • SAFE with multi-step reasoning: Figure 3 in Page 4 shows that SAFE carefully reasons through different aspects of the fact to arrive at the final conclusion.
  • We propose F1@K as the long-form factuality metric that takes into account both precision and recall. To our knowledge, this is the first long-form factuality metric that can account for recall of model responses.
  • We evaluate long-form factuality using the LongFact benchmark and the F1@K metric, allowing us to gain more insights into the impacts of scaling and RLHF on the long-form factuality of models (Appendix E.2 and E.4, Line 1047-1070 and Line 1079-1103).

The claim that "LLM agents can be better factuality annotators than humans" seems somewhat exaggerated. The human annotators do not have internet access and could only label based on the corresponding Wikipedia page, which is not a practical setup.

We agree that the capabilities of human annotators in FActScore are limited when only having access to the provided Wikipedia page. In Appendix A.4 Figure 8 (Line 587-599), we show 22 out of 81 incorrect ratings were because of missing information in Wikipedia pages. On one hand, this shows one advantage of using Google Search rather than a predetermined knowledge source, which is one advantage of SAFE we want to highlight; on the other hand, human annotators more often fail because of other reasons shown in the pie chart — mostly reasoning issues or carelessness. Even if we rule out the 22 incorrect ratings because of missing information, human annotators still give incorrect ratings in 59% of the 100 disagreements, much higher than SAFE’s 24% (Appendix A.3 Figure 7, Line 552-572).

I notice that current benchmarks for long-form factuality, including LongFact and others, typically consist of questions that ask for information or explanations about a subject, such as individuals or concepts (typically extracted from Wikipedia). The dependencies between sentences in the LM’s responses in these benchmarks are generally simple. Can SAFE handle responses where more complex inter-sentence dependencies exist? Consider the following question from ELI5 as an example:

Thanks for noting this and providing a detailed example. We agree that some responses may be harder to decompose into atomic facts than others. We ran SAFE with gpt-3.5-turbo and Serper API, and saw the sentence "This was the birth of the "dandy" culture which still thrives and is very much predominant today." broken down into three self-contained atomic facts:

  • The radical change of fashion and lifestyle for men in the early 1800s was the birth of the "dandy" culture.
  • The culture emphasizing modesty in clothing, eloquence in speech, refined manners, and hygiene still thrives.
  • The "dandy" culture that originated in the early 1800s is predominant today.

The three sentences adequately contain the main claims of the sentence and are self-contained to include relevant information in the previous claims. These claims are all rated as "Supported" by SAFE.

It seems that the idea of constructing LongFact and the quality estimation method SAFE could be used to produce synthetic data for enhancing LM’s factuality. Do the authors plan to explore this direction?

Thanks for the proposal. We agree that synthetic data generation for factuality improvement is an exciting direction for research and production. Because we still see a few failure cases of agents like SAFE (Appendix A.3, starting from Line 551), we believe it is worth further improving the agents’ quality to generate synthetic data with higher quality for to improve autoraters and models, and thus we leave it to future research.

评论

Thanks for the detailed response. I will keep my positive score.

审稿意见
7

This paper focused on the open-domain long-form factuality problems of Large Language models. It proposes a benchmark called LongFact, which consists of more than 2k prompts across 38 domains, it also proposes an agent-based factuality detection system called SAFE with a new metric to measure the factuality: F1@K. Experiments show that SAFE has satisfactory factual error detection ability which surpasses crowd-source annotators.

优点

This paper addresses a critical problem and offers substantial contributions. I believe the released benchmark alongside the SAFE system could be practically useful in real-world applications.

缺点

It is foreseeable that the performance of SAFE could surpass that of crowd-sourced annotators, given that crowd-sourced annotations often lack high quality. I think it would be better to compare the SAFE with expert annotators.

问题

Refer to the weakness.

局限性

Refer to the weakness.

作者回复

Thank you for your careful reading and thoughtful reviews. Let us address your comments below.

It is foreseeable that the performance of SAFE could surpass that of crowd-sourced annotators, given that crowd-sourced annotations often lack high quality. I think it would be better to compare the SAFE with expert annotators.

We agree that crowd-sourced annotations often lack high quality, which is why we used our researcher annotations as a proxy of expert-level human annotations for the 100 disagreements between crowd-sourced annotations and SAFE in Section 4 (starting from Line 123). Appendix A.3 (starting from Line 551, especially Figure 7) shows SAFE’s causes of failure, which essentially shows in which cases and how often SAFE is inferior to expert humans. For our purposes, we believe that SAFE outperforming crowd-sourced humans indicates that SAFE can serve as a strong autorater for factuality. We believe that improving the quality of SAFE past expert-level manual annotation is a challenging yet important area for future exploration.

审稿意见
6

This paper introduces a novel benchmark, LongFact, for evaluating the factual accuracy of long-form responses generated by large language models (LLMs). It proposes a method named SAFE (Search-Augmented Factuality Evaluator) to automatically assess the factuality of these responses. SAFE uses an LLM to decompose a long-form response into individual facts and checks each fact's accuracy through Google Search. The authors introduce F1@K, a metric that balances the precision and recall of supported facts to quantify long-form factuality. Empirical results show that SAFE outperforms human annotators in terms of cost and accuracy. The study benchmarks thirteen language models across four model families, finding that larger models generally produce more factual responses.

优点

  • Innovative Benchmark: The creation of LongFact, a comprehensive benchmark specifically designed for long-form factuality, addresses a significant gap in existing evaluation methods that primarily focus on short-form responses.

  • Automated Evaluation Method: The development of SAFE, which leverages LLMs and Google Search for fact-checking, provides a scalable and cost-effective alternative to human annotations, significantly reducing evaluation costs while maintaining high accuracy.

  • New Metric: The introduction of F1@K as a metric to balance precision and recall in evaluating long-form factuality is a notable contribution, offering a more nuanced measure of a model's performance.

缺点

  • Reliance on Google Search: The dependence on Google Search as the primary source for fact-checking may introduce biases and limitations, The retrieved information may also contain inaccuracies or hallucinations, especially for topics that are not well-covered or are subject to misinformation online.

  • Limited Comparison of LLMs: The paper would benefit from a more comprehensive comparison of human agreement scores using different LLMs within the SAFE framework, including open-source models like LLaMA2. This would provide a clearer evaluation of the factuality assessment capabilities of various LLMs.

  • Insufficient Ablation Studies: Conducting more ablation studies would strengthen the paper, such as comparing human agreement scores using different search engines or searching across various sources. This would help understand how different fact-checking methods impact the accuracy and reliability of the evaluations.

  • Generalization to Other Domains: The paper focuses primarily on open-domain factuality. It remains unclear how well the proposed methods generalize to specialized domains such as medicine or law, where factual accuracy is critical and more nuanced.

问题

Could you please provide more details of re-annotation on the randomly-sampled subset of 100 individual facts?

局限性

Yes

作者回复

Thank you for your careful reading and thoughtful reviews. Let us address your comments below. We hope these results will help clarify our work, and we will update our manuscript to include them.

Reliance on Google Search

Google Search indeed may have limitations as you mentioned (Line 281-290). We chose it because it allows for coverage of a broad range of topics without high cost, and probably that’s the best automated way to estimate factuality of a claim. Predetermined sources such as Wikipedia may not contain all relevant information needed for open-domain questions (Line 607). SAFE can also be easily tailored to searching over more restricted and predetermined sources by adding "site:xxx" to the query or replacing Serper with another API that queries professional databases.

Limited Comparison of LLMs

We’ve run the following experiment.

On a random subset of 20 questions within the FActScore biography dataset, we compare Mixtral 8x7B with gpt-3.5-turbo for SAFE. We manually examine the 196 differently-rated claims among all 658. Mixtral wins 1/3 while gpt-3.5-turbo wins 2/3. Both LLMs occasionally fail to find the right result when interacting with search. The main reason Mixtral falls short is its weaker ability in properly revising a claim to be self-contained. For example, for the question "Tell me a bio of Jonathan Tucker", Mixtral revised the claim "Justified aired in 2015" to "The television series that Jonathan Moss Tucker appeared in aired in 2015", thus losing its exact subject. This seems to suggest Mixtral may have weaker instruction-following capabilities, which could be improved by prompt tuning or further fine-tuning.

Insufficient Ablation Studies

We agree the search engine is a crucial part of SAFE and should be ablated. Because we do not have easy access to other engines or databases, we ablate the search scope between no-scope (current setup with no restrictions) and Wikipedia-only (adding postamble "site:https://en.wikipedia.org" to queries). On a random subset of 20 questions from FActScore biography, we manually examine the 146 differently-rated claims among all 657. We found the following two cases where the difference is attributed to search scope:

  1. When no-scope wins: No-scope finds the right information on the open web when Wikipedia doesn’t contain such information.
  2. When Wikipedia-only wins: The no-scope query returns irrelevant information on the open web and causes the rater to rate the claim as "Not Supported", while the Wikipedia-only query finds the right information.

Case 1 appears three times more often than 2, suggesting searching over the open web is more helpful in providing LLM with the right information. The advantage should be higher for open-domain questions for which Wikipedia contains even less information. Meanwhile, Case 2 indicates that the interaction between LLM and search engines should be further improved in future research to more accurately find the right information when it exists.

Generalization to Other Domains

Our benchmarking results on a set of 250 randomly-sampled LongFact-Object prompts already contain law and medicine (Appendix E.6.18, E.6.19, E.6.25). Nevertheless, we conduct an additional study on 6 law and medicine questions:

  • What does the EB-5 Immigrant Investor Program under US Immigration Law entail?
  • Can you provide some information about the UN Convention on the Law of the Sea?
  • What is the purpose of the I-765 Application for Employment Authorization in US Immigration Law?
  • What is the Orphan Drug Act?
  • What is the role of the protein Fibroblast Growth Factor 21 in regulating metabolism?
  • What are the primary functions of the neuromodulatory system in the human brainstem?

We prompt gpt-3.5-turbo for responses, and use SAFE with gpt-3.5-turbo and Serper to rate claims. We then manually examine the correctness of each rating.

Overall, SAFE gives the same rating as us (researcher + full internet access) on 79.7% (106/133) law claims and 94.9% (130/137) medicine claims. We find in more nuanced domains, it is more challenging for SAFE to (1) break down a long sentence into claims and (2) revise a claim to be self-contained, than to rate whether a claim is supported by search results: over 90% of the incorrect ratings fall into the former two categories.

We find SAFE is still able to reason from search to judge whether a claim is supported. A success example is that SAFE rates "FGF21 is produced in response to metabolic stresses" as "Supported" after reasoning from search result "Increased FGF21 expression in Non-Alcoholic Fatty Liver Disease might be a response to metabolic stress...". There are also failures like:

  • Not breaking down the sentence into a proper claim: One of the claims the LLM extracts from the sentence "For example, if a refugee named Maria was granted asylum in the U.S. on January 1, 2021…" is "Maria is a refugee".
  • Revised claim not self-contained: One of the claims the LLM extracts from the sentence "FGF21 levels are elevated in conditions such as obesity, type 2 diabetes, and fatty liver disease" is "FGF21 levels are elevated".

Therefore, for more nuanced domains, SAFE can be improved by using LLMs with stronger reasoning capabilities to better generate self-contained claims, not necessarily to better interact with tools like search engines.

Could you please provide more details of re-annotation on the randomly-sampled subset of 100 individual facts?

Thanks for allowing us to clarify our annotation process. We used our own "researcher + full internet access" annotations to generate ground-truth labels (Appendix A.10) and inspected the annotations of both FActScore and SAFE as well as the steps within SAFE (Appendix A.3, A.4). The researcher annotations (ground-truth label) were derived from manually searching on the full Internet, including but not limited to Wikipedia. Please let us know if we can clarify any other details about this.

审稿意见
8

This work proposes a 4-step pipeline for automatic evaluation of long-form answers using LLMs. Given a long-form answer, their pipeline involves (1) splitting the question into individual facts, (2) decontextualizing each fact, (3) determine each facts relevance to the original question, and (4) verifying each fact given retrieval results using Google search.

In addition to proposing this pipeline, the authors also develop a set of 2,280 total information seeking prompts (automatically generated w/ GPT-4) for evaluation. The authors then demonstrate the efficacy of their evaluation pipeline by comparing their pipeline's judgements against human-annotators and find high (~75%) agreement. Furthermore, the authors find that, in these disagreements, their pipelines judgments were usually correct and the annotator judgments were flawed (also ~75% of the time).

Finally, using their pipeline, the authors propose a metric for evaluating the quality of a long-form response that is based on F1 over individual facts in the generated answer. Instead of relying on a gold set of individual facts for each question, their metric evaluated recall using an "ideal number of individual facts in the answer for a given question" which is set as a hyperparameter of their evaluation metric.

优点

This work address an important open problem in LLM evaluation, automatic evaluation of long-form answers. The proposed evaluation pipeline, is accurate (as demonstrated through human evaluation) and cheap (20x cheaper than relying on humans). The proposed dataset as well, will be a great resource for future work.

While the proposed F1 evaluation metric may not comprehensively evaluate all aspects of individual fact recall in long-form QA, it still represents a suitable and scalable alternative. See Weaknesses below for discussion.

缺点

One concern is with the "ideal number of individual facts in the answer for a given question" hyperparemeter. While this does to be a suitable alternative to relying on a gold set of facts, such a value is highly question and user dependent. While the results demonstrate that performance rankings are robust to this selection in hyperparameter, there is some concern about whether models that have been trained to produce more concise answers versus more verbose answers.

Intuitively, it seems that fact relevance (not just the binary prediction, but some more granular score) is also an important consideration when evaluating recall. Recall over a wide range of correct, somewhat relevant facts may be less desirable than recall over a small number of highly relevant facts.

问题

Regarding weaknesses noted above, I'm curious whether, if this evaluation / metric / dataset is released as a benchmark or leaderboard, if it could be gamified by systems that are trained or prompted to produce a specific number of relevant facts (matching K in the evaluation metrics)? It seems like something along the lines of the precision vs # of facts recalled curve in the appendix seems to be a more suitable way to evaluate models, testing their ability to generate responses with high F1 over different long-form answer lengths.

局限性

Yes

作者回复

Thank you for your careful reading and thoughtful reviews. Let us address your comments below.

One concern is with the "ideal number of individual facts in the answer for a given question" hyperparemeter...

We agree that the desired number of supported facts K is user-dependent, and the optimal value may vary across models and questions. Our recall metric favors longer responses because of our focus on long-form factuality:

  • Modern LLMs are often trained to produce more-verbose responses because of the reward signals that come from automated or human evaluations [1, 2, 3, 4]. This may be because humans often favor longer responses because these responses provide richer information, especially for open-ended questions (such as the ones in LongFact).
  • The selection of K may look arbitrary, but as you mentioned, the performance rankings are mostly robust to the value of K, and the tunability of K provides researchers the flexibility of deciding how many facts they want the response to include.

References:

[1] Yann Dubois, Balázs Galambosi, Percy Liang, Tatsunori B. Hashimoto. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. 2024.

[2] Ryan Park, Rafael Rafailov, Stefano Ermon, Chelsea Finn. Disentangling Length from Quality in Direct Preference Optimization. 2024.

[3] Weizhe Yuan, Ilia Kulikov, Ping Yu, Kyunghyun Cho, Sainbayar Sukhbaatar, Jason Weston, Jing Xu. Following Length Constraints in Instructions. 2024.

[4] Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar. Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge. 2024.

Intuitively, it seems that fact relevance (not just the binary prediction, but some more granular score) is also an important consideration when evaluating recall...

Thanks for the insight, we agree that rather than evaluating recall based on the number of (somewhat) relevant facts within an LLM’s response over an expected number of facts K, it would be ideal to evaluate recall based on the relevance between facts in an LLM’s response and a golden ground-truth set of facts that covers the most important aspects. That said, we consider our recall metric is valid because:

  • The relevance of facts in an LLM’s response can be seen as a function of instruction-following capabilities [5], so we follow previous works [6, 7] and purely focus on the factuality of claims that are at least somewhat relevant.
  • Why there isn't a core set of highly-relevant facts as ground truth: We presented the LongFact benchmark without a core set of highly-relevant facts because this set would be highly subjective and therefore not straightforward to create. It would be difficult to objectively decide which facts are more important than others for open-domain questions. For this reason, we instead presented "recall up to K facts" as a surrogate metric for recall. We agree, however, that designating such core sets of facts for LongFact prompts could be a valuable direction for future work.

References:

[5] Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen. Evaluating Large Language Models at Evaluating Instruction Following. ICLR 2024.

[6] Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. EMNLP 2023.

[7] Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, Chelsea Finn. Fine-tuning Language Models for Factuality. ICLR 2024.

Regarding weaknesses noted above, I'm curious whether, if this evaluation / metric / dataset is released as a benchmark or leaderboard, if it could be gamified by systems that are trained or prompted to produce a specific number of relevant facts (matching K in the evaluation metrics)?

Thanks for noting this. Indeed, in Lines 291-297 we discussed how our metric and dataset could be gamified by simply outputting repeated factual claims. Our assumption is that LLM outputs don’t contain repeated facts, which can be satisfied by (1) prompting reasonably-trained LLMs whose outputs are rarely repetitive, and/or (2) adding a unified deduplication step to remove duplicate facts. We thus view this as an engineering problem that can be solved by future research. Additionally, even if one provides K relevant facts to achieve maximum recall, these facts must also be factual to achieve maximum precision that would aggregate into a full score of 100% for F1@K.

...the precision vs # of facts recalled curve in the appendix seems to be a more suitable way...

The focuses of our F1@K metric and the precision vs # facts curves in Figure 14 are slightly different. The F1@K metric essentially asks "is the model able to provide K relevant and factual claims about the given topic." On the other hand, the curves ask "if the model had to give K facts, how many of them would be factual." The curves may provide richer information on the long-form factuality of models, but they are harder to measure, because:

  • It is difficult to control the number of facts in the response. As explained in Appendix D.4 - Footnote 25, the model may have different understandings on what constitutes an individual fact, so we only ask for a given number of sentences. Also, as explained in Footnote 26, the model needs to have a strong instruction-following capability to be able to precisely respond at a given length. Even so, the length of the response is only a proxy for the number of facts that it should contain.
  • Obtaining the data to create these curves requires significantly more time and resources since it requires sampling and evaluating multiple responses per question. It is unclear whether this additional computational cost would be desirable.

Because of these reasons, we proposed F1@K in order to use a single number as a condensed metric for long-form factuality.

评论

Hi all, the author-reviewer discussion for this paper is extended to Aug 16 11:59pm ET for authors to reply to ethics reviews. -- PCs

最终决定

The paper introduces a 4-step pipeline designed to automatically evaluate long-form answers using large language models (LLMs). The pipeline involves splitting questions into individual facts, decontextualizing these facts, assessing their relevance, and verifying their accuracy using Google search results. The work is notable for creating a benchmark, LongFact, aimed specifically at long-form factuality, filling a gap in existing evaluation methods that typically focus on short-form responses. The authors also introduce a metric, F1@K, which balances precision and recall.

However, there are several concerns regarding the novelty and approach of this work. While practical, reliance on Google Search for fact-checking introduces potential biases and the risk of inaccuracies, particularly for topics that are not well-covered or prone to misinformation. Moreover, the paper's approach bears significant resemblance to existing methods such as FactScore, which similarly decomposes long-form responses into claims, raising questions about the originality of the proposed method. The claim that LLM agents can outperform human annotators in factuality assessments also seems overstated, particularly given that human annotators in the study were limited to labeling based on Wikipedia pages without internet access.

It is also worth and will be more interesting to compare Google search v.s. LLM-powered search engines such as perplexity or you.com, or even the summary feature from Google. I also suggest authors check Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation. Liu et al., 2022 (and its line of work), as they already proposed and leveraged the salience protocol called Atomic Content Units (ACUs).