FacTool: Factuality Detection in Generative AI - A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
We introduce FacTool, a tool augmented factuality detection framework that can effectively detect diverse factual errors generated by LLMs..
摘要
评审与讨论
This paper describes a factuality evaluation framework comprising five steps: claim extraction, query generation, tool querying, evidence collection and agreement verification. The main novelty of the proposed framework allows the incorporation of external tools such as Google Search, Google Scholar, and code interpreters. It can be applied to multiple tasks (QA, math, code generation, scientific literature reviewing). The authors construct a dataset covering the four tasks mentioned earlier and compare the proposed framework against the baseline: Self-Check.
优点
Reasons To Accept
- The studied topic is important, and the proposed framework is a valuable contribution to the community
- Extensive analysis covering multiple tasks, although some interesting content is in the appendix
缺点
Although the described framework makes sense to me, the authors highly rely on ChatGPT to implement their framework (even if the function is just to check whether list A is a subset of list B); this may hinder other researchers from reproducing their results and additional costs.
问题
N/A
Thanks for the comment!
Although the described framework makes sense to me, the authors highly rely on ChatGPT to implement their framework (even if the function is just to check whether list A is a subset of list B); this may hinder other researchers from reproducing their results and additional costs.
We are currently working on integrating FacTool with Llama2 and various other open-source models. Detailed results are:
| Methods | Tasks | Foundation Model | Claim-Level | Response-Level | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Recall | Precision | F1 | Accuracy | Recall | Precision | F1 | |||
| FACTOOL | Knowledge-QA | llama2-70b-chat | 73.99 | 93.49 | 77.07 | 84.49 | 52.00 | 78.26 | 48.65 | 60.00 |
| FACTOOL | Code | llama2-70b-chat | 55.75 | 50.00 | 76.00 | 60.32 | 55.75 | 50.00 | 76.00 | 60.32 |
| FACTOOL | Math | llama2-70b-chat | 86.62 | 100.00 | 86.62 | 92.83 | 47.00 | 100.00 | 47.00 | 63.95 |
| FACTOOL | Scientific | llama2-70b-chat | 96.77 | 81.82 | 100.00 | 90.00 | 98.00 | 80.00 | 100.00 | 88.89 |
We note that the error detection ability of FacTool powered by Llama-2-70b-chat is slightly constrained due to Llama-2-70b's limited coding capabilities (Llama-2-70b scores only 29.9% on the HumanEval benchmark, in contrast to ChatGPT-3.5's 48.1%).
I have read the author's response and other reviews. Thanks for providing additional results using open-source models.
Dear reviewer,
We are glad to hear your positive comments! We will greatly appreciate if you could consider raising the review score if you find our response helpful!
This work proposes a tool to detect factual errors in generated text which can be used across multiple domains. The authors claim that existing fact-checking methods are 1.) limited and task-specific 2.) use disjoint pieces of text as claims which might lose the contextual information especially in long form answers. They propose custom modularize pipelines for factuality error detection for knowledge based QA, arithmetic problem solving, language to code generation and scientific literature summarization. At a high level, they first use a claim extraction module to state claims, in a task-dependent manner. Then they use query generation to test the validity of these claims against retrieved evidences, calculator, interpreter etc. depending on each task. Finally they collate the results to evluate factuality of the LLM response.
优点
1.) The authors develop a one-stop factuality tool which can be useful for the community. 2.) The authors have invested in extensive prompt development for claim extraction in various use cases. 3.) They do an extensive analysis of their system on four tasks - KBQA, code generation, scientific literature survey and arithmetic problems
缺点
1.) The technical contributions are somewhat limited and mostly incremental. 2.) The intermediate question generation module to validate any claims made by the LLM against an external evidence has been proposed in other works (https://arxiv.org/pdf/2210.08726.pdf, https://arxiv.org/pdf/2309.11495.pdf). The novelty as far as I understand is mostly in claim generation which is different for each tasks because the inputs are no longer just declarative statements but also things like code snippets, questions, etc. 3.) The authors stretch the definition of factuality to new tasks like code generation which are not necessarily considered "factual". For instance, there can multiple ways to write a logical code snippet and not necessarily a universal way for it to be deemed as a fact. 4.) The evaluations mostly focus on different language models and not existing methods for factual error detection like RARR or SELF-REFINE which could have been used for at least KBQA.
问题
1.) Could you please decsribe how the baseline method Self-Check(*) was setup? Does it use some existing code or were the prompts developed by the authors? 2.) In question generation module for code generation task, how will the model know what are the right corner case inputs for unit testing the piece of code? 3.) In Appendix C (Fig 6) some characters in the prompt are all-caps, was there any specific reason for that. Out of curiosity, does standard casing yield bad performance?
Thanks for the comment!
In Appendix C (Fig 6) some characters in the prompt are all-caps, was there any specific reason for that. Out of curiosity, does standard casing yield bad performance?
We follow the same intuition as in https://github.com/openai/evals, which uses capitalization or brackets to emphasize certain parts of the prompt.
Thanks for the comment!
Could you please describe how the baseline method Self-Check(*) was setup? Does it use some existing code or were the prompts developed by the authors?
The prompts are developed by the authors, based on the intuition of Self-Refine. Self-Refine has been acknowledged as a strong baseline for LLM to refine their own response. Recall that the first step of SELF-REFINE begins by generating an output, which is then fed back into the same LLM to obtain feedback. This is coherent with our focus, factuality detection. Therefore, we utilize the first step of Self-Refine as our baseline and named it as “Self-Check”.
Thanks for the comment!
The evaluations mostly focus on different language models and not existing methods for factual error detection like RARR or SELF-REFINE which could have been used for at least KBQA.
As mentioned in the response of Weakness 2, RARR focuses on text editing, which is different from our focus on factuality detection. Also, the workflow of RARR doesn’t include a claim extraction process, which is very different from our focus to provide fine-grained claim-level information to users. Thus, we did not include RARR in our baseline.
On the other hand, we included 'Self-Check', essentially the first step of 'Self-Refine', as our baseline. It's important to note that our focus is on 'detection' rather than 'refinement'. Therefore, we use the first step of 'Self-Refine' as our baseline and refer to it as 'Self-Check'.
Thanks for the comment!
The authors stretch the definition of factuality to new tasks like code generation which are not necessarily considered "factual". For instance, there can multiple ways to write a logical code snippet and not necessarily a universal way for it to be deemed as a fact.
We agree that there can be multiple ways to write a logical code snippet. In fact, this argument is consistent with our framework since we are adopting an execution-based approach to identify whether a code snippet is factual or not. As defined in our paper, the factuality of a code snippet is grounded in an execution-based approach to code evaluation. This approach measures the correctness of generated code by executing it against test case inputs and comparing its output to the expected (or golden) output.
Thanks for the comment!
The intermediate question generation module to validate any claims made by the LLM against an external evidence has been proposed in other works (https://arxiv.org/pdf/2210.08726.pdf, https://arxiv.org/pdf/2309.11495.pdf). The novelty as far as I understand is mostly in claim generation which is different for each tasks because the inputs are no longer just declarative statements but also things like code snippets, questions, etc.
We would like to emphasize three main difference between FacTool and RARR:
(i) FacTool provides versatile claim extraction, enabling not only the provision of fine-grained factuality information to help users more easily detect potential factual errors but also the integration of factuality detection across various domains and scenarios. In contrast, RARR neither supports fine-grained factuality nor multi-task factuality.
(ii) FacTool offers a more intuitive and straightforward agreement verification process. After evidence collection, we leverage the strong reasoning capabilities of LLMs to directly feed LLMs with the collected evidence, eliminating the need for extra steps in relevance matching. RARR requires an extra step to parse over webpages and uses a relevant matching algorithm to find the most relevant snippets, which is more complicated compared to FacTool.
(iii) FacTool focuses on “factuality detection”, i.e., detecting factual errors from the generated text from LLMs, while RARR focuses on “text editing”, i.e., editing text that may contain hallucinations.
As for https://arxiv.org/pdf/2309.11495.pdf , this work was completed after FacTool (and actually cites a pre-print of the FacTool paper).
Thanks for the comment!
The technical contributions are somewhat limited and mostly incremental.
We want to emphasize the novelty of our paper:
(i) Motivation: Given that a wider range of tasks now face an increasing risk of containing factual errors when handled by generative AI. We aim to provide a unified framework for detecting factual errors across various tasks and domains.
(ii) A Unified Framework for Multi-Task and Multi-Domain Scenarios: FacTool presents a unified tool-augmented framework for detecting factual errors across diverse tasks and domains, including KBQA, coding, math, and scientific.
(iii) An automatic factuality evaluator for modern chatbots across different scenarios: FacTool is essentially a automate factuality evaluator that can be used to evaluate the factuality of modern chatbots across various scenarios. By utilizing factool, we maintain a factuality leaderboard for chatbots. We believe factool can facilitate the development of modern LLMs by helping developers detect potential factual errors generated by their LLMs.
(iv) An user-friendly API interface and supports ChatGPT plugin: We have made significant efforts to ensure that FacTool is user-friendly and supports ChatGPT plugin, enabling both technical and general users to benefit from our research.
Thanks for the comment!
In the question generation module for code generation task, how will the model know what are the right corner case inputs for unit testing the piece of code?
The test case inputs are generated automatically by LLMs through prompt-based learning. We acknowledge that, at times, the variety of generated test cases can be limited, as mentioned in Figure 16 of Appendix D. However, we believe that further prompt engineering and advancements in more sophisticated LLMs can mitigate this issue.
Dear reviewer,
We look forward to your helpful response and feedback! We are happy to provide further clarifications if you have any more questions.
The paper proposes FacTool, a tool-augmented framework for detecting factual errors in texts generated by large language models (LLMs) like ChatGPT. The key ideas are:
- Connects the concept of "tool use" by modern LLMs with factuality detection, using tools like search engines, code interpreters, etc to gather evidence and fact-checking for diverse tasks like QA, code, math, and scientific review.
- Proposes a 5-step process: claim extraction, query generation, tool querying, evidence collection, agreement verification. Uses LLMs for claim extraction, query generation, and verifying claims.
- Evaluate FACTOOL on 4 tasks - KB-QA, code, math, and scientific review. Shows it outperforms baselines like self-checking and supervised models on accuracy.
- Uses FACTOOL to evaluate modern chatbots - finds GPT-4 has highest weighted accuracy, while supervised Vicuna-13B underperforms on math, code, and scientific review.
优点
- The paper intertwines the notion of "tool use" in modern LLMs with factuality detection. It showcases the versatile application of LLMs for auxiliary tasks, such as claim extraction and query generation, for fact-checking pipeline.
- An exhaustive evaluation was undertaken to assess the fact-checking capabilities of LLM responses across four distinct tasks and five datasets.
- The authors have made the codebase publicly available and introduced a ChatGPT plugin for enhanced accessibility.
缺点
- The novelty of the approach appears limited. While the paper employs LLMs to generate search queries and refers to search engines as tools, this methodology isn't drastically different from prior evidence-based fact-checking methods that use search engines for evidence retrieval.
- The study lacks comparative baselines using non-LLM models or previous LLM-based system for evidence-based fact-checking. This omission makes it challenging to ascertain the specific advantages of incorporating LLMs into the fact-checking process, or referring to search engines as tools.
- Despite the paper's emphasis on fact-checking, notable fact-checking datasets like FEVER and SciFact are absent from the evaluation.
问题
- In the "Related Work" section, could you provide a detailed comparison between your approach and the RARR presented by Gao et al. in 2022a? It would be beneficial to understand the distinctiveness of your work in relation to theirs.
- Were any experiments conducted using preivous LLMs or non-LLM-based fact-checking systems within the framework of your experimental settings?
Were any experiments conducted using preivous LLMs or non-LLM-based fact-checking systems within the framework of your experimental settings?
Please refer to the response of Weakness 2.
Thanks for the comment!
In the "Related Work" section, could you provide a detailed comparison between your approach and the RARR presented by Gao et al. in 2022a? It would be beneficial to understand the distinctiveness of your work in relation to theirs. Response:
We would like to emphasize three main difference between FacTool and RARR:
(i) FacTool provides versatile claim extraction, enabling not only the provision of fine-grained factuality information to help users more easily detect potential factual errors but also the integration of factuality detection across various domains and scenarios. In contrast, RARR neither supports fine-grained factuality nor multi-task factuality.
(ii) FacTool offers a more intuitive and straightforward agreement verification process. After evidence collection, we leverage the strong reasoning capabilities of LLMs to directly feed LLMs with the collected evidence, eliminating the need for extra steps in relevance matching. RARR requires an extra step to parse over webpages and uses a relevant matching algorithm to find the most relevant snippets, which is more complicated compared to FacTool.
(iii) FacTool focuses on “factuality detection”, i.e., detecting factual errors from the generated text from LLMs, while RARR focuses on “text editing”, i.e., editing text that may contain hallucinations.
We will update the related work section in our final revision.
Thank you for the comment!
Despite the paper's emphasis on fact-checking, notable fact-checking datasets like FEVER and SciFact are absent from the evaluation.
In our paper, we focus on the growing challenges of factual errors generated by models in generative AI like ChatGPT. To this end, we have constructed a dataset based on the natural distribution of errors generated by these models. Fact-checking datasets like FEVER and SciFact are not natural distribution of errors generated by LLMs, so we didn’t focus on these datasets. Additionally, both FEVER and SciFact provide claims, which do not align with our task setting: offering an end-to-end system capable of delivering detailed, claim-level factuality information from long-form context.
Thank you for the comment!
The study lacks comparative baselines using non-LLM models or previous LLM-based systems for evidence-based fact-checking. This omission makes it challenging to ascertain the specific advantages of incorporating LLMs into the fact-checking process, or referring to search engines as tools.
Regarding previous LLM-based systems
We are currently working on integrating FacTool with Llama2 and various other open-source models. Detailed results are:
| Methods | Tasks | Foundation Model | Claim-Level | Response-Level | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Recall | Precision | F1 | Accuracy | Recall | Precision | F1 | |||
| FACTOOL | Knowledge-QA | llama2-70b-chat | 73.99 | 93.49 | 77.07 | 84.49 | 52.00 | 78.26 | 48.65 | 60.00 |
| FACTOOL | Code | llama2-70b-chat | 55.75 | 50.00 | 76.00 | 60.32 | 55.75 | 50.00 | 76.00 | 60.32 |
| FACTOOL | Math | llama2-70b-chat | 86.62 | 100.00 | 86.62 | 92.83 | 47.00 | 100.00 | 47.00 | 63.95 |
| FACTOOL | Scientific | llama2-70b-chat | 96.77 | 81.82 | 100.00 | 90.00 | 98.00 | 80.00 | 100.00 | 88.89 |
We note that the error detection ability of FacTool powered by Llama-2-70b-chat is slightly constrained due to Llama-2-70b's limited coding capabilities (Llama-2-70b scores only 29.9% on the HumanEval benchmark, in contrast to ChatGPT-3.5's 48.1%).
Regarding non-LLM-based system
It is challenging to find a non-LLM approach capable of performing the task defined here: the fine-grained assessment of factuality in long texts from various scenarios. Previous methods often focused on specific aspects. For example, FEVER judges the factuality of a given claim, a relatively simpler task than the one defined here. Developing a non-LLM method from scratch to conduct fine-grained assessment of factuality in long texts from various scenarios would require a substantial engineering effort. This effort could be so significant that it might be considered a separate project.
Thank you for the comment!
The novelty of the approach appears limited. While the paper employs LLMs to generate search queries and refers to search engines as tools, this methodology isn't drastically different from prior evidence-based fact-checking methods that use search engines for evidence retrieval.
The fact checking methodology of FacTool differs from prior works that use search engines for evidence retrieval in evidence-based fact checking, with its versatile claim extraction module. This module effectively provides fine-grained factuality information for long-form text across various domains and scenarios that enables a unified framework for factuality detection across different scenarios.
Classical evidence-based fact-checking methods [1, 2, 3, 4] typically follow a three-step pipeline: document retrieval, sentence retrieval, and claim verification. This pipeline is designed to address scenarios where the claims are already given. In contrast, FacTool is capable of fact-checking long-form texts in which claims are not explicitly given.
More recent prior works such as RARR [5] utilize search engines for evidence retrieval but also do not employ a claim extraction process to collect detailed factuality information. Other concurrent works like FactScore [6] employ a claim extraction module to extract atomic facts from long-form texts but are limited to the Knowledge-Based Question Answering (KBQA) task. Our work expands the definition of factuality beyond the KBQA task and does not solely focus on using search engines for evidence retrieval. FacTool is designed to be versatile and capable of handling multiple tasks, integrating various tools such as search engines, Python executors, and Google Scholar, as demonstrated in our paper. This comprehensive integration enables FacTool to tackle a wider spectrum of factuality issues in generative AI (GAI), overcoming the limitations of existing works that are focused on single tasks or specific scenarios.
References:
[1] Bekoulis et al.. A review on fact extraction and verification
[2] Thorne et al. FEVER: a large-scale dataset for fact extraction and VERification
[3] Zhong et al. Reasoning Over Semantic-Level Graph for Fact Checking
[4] Zhou et al. GEAR: Graphbased Evidence Aggregating and Reasoning for Fact Verification
[5] Gao et al. Rarr: Researching and revising what language models say, using language models
[6] Min et al. FACTSCORE: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Thanks for the author's response.
- Although the author states that their work is different from the prior arts with the claim extraction module, it's hard to agree that simply adding such a module is novel.
- Thanks for adding new experiments with Llama-2 70b. For non-LLM methods, there are open-source code including both claim extraction (https://github.com/titipata/detecting-scientific-claim) and fact-checking pipeline (https://github.com/dwadden/multivers)
- If it's important to distinguish detecting the factual error in generated text rather than other text, is it better to phrase the main task of this paper as hallucination detection rather than fact-checking? As fact-checking is commonly known as a task to check facts in text no matter if it's machine-generated or human-written.
- Thanks for the details comparison with previous work RARR.
If it's important to distinguish detecting the factual error in generated text rather than other text, is it better to phrase the main task of this paper as hallucination detection rather than fact-checking? As fact-checking is commonly known as a task to check facts in text no matter if it's machine-generated or human-written.
The title of our paper, FacTool: Factuality Detection in Generative AI - A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios, highlights our focus on detecting diverse factual errors generated by generative AI (LLMs). The task that we focus on is "Factuality Detection in Generative AI across Multi-Task and Multi-Domain Scenarios".
Thanks for adding new experiments with Llama-2 70b. For non-LLM methods, there are open-source code including both claim extraction (https://github.com/titipata/detecting-scientific-claim) and fact-checking pipeline (https://github.com/dwadden/multivers)
Thanks for acknowledging our effort on adding new experiments with Llama-2 70b. Also, thanks for providing the references. However, we believe that both https://github.com/titipata/detecting-scientific-claim and https://github.com/dwadden/multivers are not suitable for the main objective of our paper: factuality detection in Generative AI across multi-task and multi-domain scenarios.
Upon trying to run the open-source code, we believe that both of the repos are not suitable for our propose, detailed below:
Regarding https://github.com/titipata/detecting-scientific-claim
The claim extraction setting of https://github.com/titipata/detecting-scientific-claim is totally different from our claim extraction setting. According to their paper [1], the claim extraction task is defined as to predict whether a “sentence” in a paragraph is a “claim” or not, which is a “binary classification” task. However, in our paper,
(i) We define “claim” as either atomic facts, code snippets, math calculations, or Tuple (paper title, year, authors) extracted from long-from text. This is clearly different from simply a “sentence” as defined in [1].
(ii) Also, our claim extraction task is to extract all verifiable claims within the generated text , denoted as . This is a “generative” task, not a “binary classification” task as defined in [1].
Regarding https://github.com/dwadden/multivers
The scientific claim verification task defined in [2] is: Given a claim and a collection of candidate abstracts which may contain evidence relevant to , the scientific claim verification task requires a system to predict a label . This is different from our task, in which claims are not provided. The task of factuality detection in Generative AI requires a pipeline that can provide fine-grained claim-level factuality WITHOUT given claims.
Moreover, directly applying the MultiVerS model for agreement verification is impractical, as it is trained on specific datasets (HealthVer, COVIDFact, SCIFACT), and its effectiveness on general domains and tasks like KBQA is unknown. It is very likely that further fine-tuning is required. Given the superior capabilities of open-source LLMs, we believe that fine-tuning open-source LLMs would be a better choice, instead of using non-LLM based methods.
“Versatile claim extraction” is highly non-trivial and was very challenging before the era of LLMs. The limitations of non-LLM methods in achieving this task underscore its complexity, justifying the non-trivialness of “versatile claim extraction”. LLM-based methods enable claim extraction through customizable prompts, offering a generalizable solution across various scenarios. In contrast, non-LLM based methods are limited to each specific scenario, significant effort is needed to annotate a dataset to train a claim extractor with limited generalization ability.
[1] Achakulvisut et al. Claim Extraction in Biomedical Publications Using Deep Discourse Model and Transfer Learning
[2] Wadden et al. MULTIVERS: Improving scientific claim verification with weak supervision and full-document context
Although the author states that their work is different from the prior arts with the claim extraction module, it's hard to agree that simply adding such a module is novel.
First of all, we appreciate your acknowledgment of our versatile claim extraction module as a differentiating factor for FacTool from prior art. However, our framework's innovation extends beyond “simply” adding this module. As mentioned in our response to weakness 1 and question 1, besides our “versatile claim extraction” module, FacTool's factuality detection framework distinguishes itself in two other key areas compared to prior art:
(i) FacTool simplifies the agreement verification process. After gathering evidence, we utilize the advanced reasoning abilities of LLMs to directly process the collected evidence, thereby avoiding the complexities of relevance matching that systems like RARR entail.
(ii) FacTool facilitates the integration of various tools into the factuality detection pipeline. Unlike prior art, which typically focuses on a single task and scenario, FacTool is a more versatile approach capable of factuality detection across multiple scenarios and tasks.
Dear reviewer,
We look forward to your helpful response and feedback! We are happy to provide further clarifications if you have any more questions.
We thank the reviewers for their helpful comments and feedback. We appreciate the reviewers for their positive comments, and we have responded to each weakness and question. Mainly, We
(i) Add the experiments on FacTool powered by Llama-2-70b-chat.
(ii) Emphasize our novelty and contributions compared to previous works.
(ii) Explain the design of the self-check baseline more clearly.
(iv) Clarify the definition of factuality of code generation and its factuality detection process.
We believe that FacTool has made significant endeavors toward factuality detection in multi-domain and multi-task scenarios, which is highly important in the era of generative AI, given that a wider range of tasks now face an increasing risk of containing factual errors when handled by generative AI.
This paper presents a multi-task and multi-domain framework for detecting factual errors in LLMs. The concept of factual error is broadened to four domains, including more distant settings like code generation where "factuality" corresponds to executability and correctness. The paper presents an approach to address factual errors in these settings via decomposition, query generation (involving search or execution), evidence collection, and agreement verification. The results in this paper are strong compared to past "self-check" frameworks that don't use the same setup.
The reviewers found the paper to have extensive experiments and a well-engineered system. They generally agreed that the expansion of the notion of fact-checking and its connection with tool use would be helpful. The paper is clearly written and the framework it contributes is a potentially useful piece of scaffolding for the community.
There is no critical weakness in this paper. However, this paper ended up on the borderline given the final score guidance suggested by ICLR. Unfortunately, although the scores all advocate for (marginal) acceptance, the critiques of the paper do ring true.
The main critiques are about limited contribution/novelty and comparison to other systems.
When considering the 4 domains explored here, the differences outweigh the similarities. For example, KB-QA requires search to verify claims, whereas code requires executing source code and comparing to input-output examples. KB-QA has a notion of fact decomposition, whereas code blocks are treated as atomic units. These are technically in the same framework, but the level of conceptual unification here is low.
Moreover, these types of systems have been extensively considered in past work. They have not all been necessarily unified in a single system, but stepwise verification of math problems has been extensively explored (e.g., "Let's Verify Step By Step" from OpenAI). For text factuality, the framework of decomposition, retrieval, and aggregation is explored in prior work that the authors discuss like RARR, as well as in FActScore https://arxiv.org/pdf/2305.14251.pdf There is no comparison to these specialized frameworks for the individual tasks in this benchmark. As a result, the main contribution feels like a commoditizing and open-sourcing of existing techniques.
为何不给更高分
See review. Mainly the lack of novelty and lack of comparison to other similar systems keep this from being stronger. The reviewer endorsements don't really sound like 6s to me.
为何不给更低分
N/A
Reject