FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
We introduce FacTool, a tool augmented factuality detection framework that can effectively detect diverse factual errors generated by LLMs.
摘要
评审与讨论
The paper proposes FacTool, a multi-task and multi-domain framework for detecting factual errors in LLM-based generated texts. The core idea of FactTool is to provide a set of tools that can be used for factuality detection across different domains and tasks proposing a uniform framework. In this paper, FacTool is used to evaluate the factuality of a diverse set chatbots. FacTools consists of five components: 1) claim extraction, 2) query generation, 3) tool querying, 4) evidence collection, and 5) agreement verification which are applied in this order (see figure 1.) for the different domains and tasks under consideration. The components 1) and 2) are using ChatGPT with specific prompt instructions. Component 3) is based on a Google Search API for candidate evidence retrieval. The task of component 4) is to label each extracted claim as true or false taking into account the extracted evidence candidates. All these steps have to be fine-grained for the specific tasks under consideration. Four tasks are considered: knowledge-based QA, code generation, mathematical reasoning, and scientific literature review. The prompts used for the components are developed and evaluated using a number of task-relevant third-party data sets. The evaluation results suggest that the approach is effective.
接收理由
-
interesting approach for factuality extraction using a zero-shot/few-shot approach
-
good results achieved
拒绝理由
-
claim extraction and framework only evaluated on small set of LLMs (basically GPT-4, ChatGPT and Flan-T5-XXL).
-
it sounds unfair to use ChatGPT also as part of the FactTools because it might cause some positive bias for ChatGPT compared to non-OpenAI models in Table 5. On the other hand, why is GPT-4 so much better?
-
Whole approach seems to be a pipeline but concrete interaction is unclear. Are there any problems with error propagation?
-
missing qualitative discussion ion the main part of the paper
给作者的问题
-
see above
-
can you explained the extreme differences in results for self-check and FactTool for scientific task in Table 4?
We appreciate the reviewer for the thoughtful review. The following is the response to major concerns:
Limited LLM evaluation scope
We thank the reviewer for the thoughtful suggestion. While the initial evaluation focused on three prominent models (GPT-4, ChatGPT, and Flan-T5-XXL), this represents a strategic choice to validate the framework on diverse architectures - covering both OpenAI's GPT family and Google's model family. The framework is designed to be model-agnostic, and the selected models span different scales and training paradigms, providing a reasonable foundation for demonstrating generalizability. Future work can easily extend evaluation to additional models using the established framework. We will update the paper with results of newer model (including DeepSeek-R1 and Qwen) in our next version. Our current experiments show that, even without powered by state-of-the-art models, FacTool is still a very effective hallucination detection approach.
Potential bias from using ChatGPT in FacTool components
This concern about positive bias toward ChatGPT is valid and important. Currently, our empirical results may provide some evidences to mitigate this concern. Specifically, if bias were the primary factor, we would expect consistent advantages for GPT4/ChatGPT across all tasks (KBQA, math, code, scientific). Instead, we see varied performance patterns that suggest FacTool is detecting genuine differences rather than evaluation artifacts. Notably, Bard actually outperforms GPT-4 on mathematical reasoning tasks, which demonstrates that the evaluation framework does not inherently favor OpenAI models. This cross-model performance variation indicates the framework is capturing authentic differences in factual accuracy. Additionally, GPT-4's superior performance on most tasks aligns with common knowledge about its enhanced reasoning capabilities compared to earlier models. To further address this concern, future iterations could apply model-agnostic implementation using open-source alternatives or ensemble approaches, though the current results suggest bias may not be a primary confounding factor.
Pipeline error propagation concerns
Error propagation is indeed a consideration in any multi-stage pipeline. Our framework includes several mitigation strategies: (1) multiple evidence sources are collected to reduce single-point failures, (2) the agreement verification step cross-validates claims against multiple pieces of evidence, and (3) each component is designed with robustness in mind. While some error propagation is inevitable, our empirical results suggest the overall system maintains reliability across diverse tasks and domains. We discuss potential failure cases in the appendix of the paper, including potential errors in llm reasoning and tool errors.
Missing qualitative discussion in the main part of the paper
We provide detail analysis in the appendix of the paper due to page limitations. We thank the reviewer's suggestion, and will do our best to move some of the qualitative discussion from the appendix to the main part of the paper.
Thank you for your clarifying answers. I will keep my score.
The paper introduces FactTool, a task- and domain-agnostic framework designed to detect factual errors in texts generated by LLMs. The framework leverages tools such as Google Search, Google Scholar, code interpreters, and even LLMs themselves to gather evidence about the factuality of the generated content. It consists of five main components: claim extraction, query generation, tool querying, evidence collection, and agreement verification. The authors evaluate FactTool across four different tasks from knowledge-based QA, code generation, mathematical reasoning, and scientific literature review, and compare its performance with self-check baselines. The results demonstrate that FactTool outperforms the baselines in detecting factual errors across all scenarios.
接收理由
- FactTool is a relatively general framework
- Wide exploration of tools
拒绝理由
- Limited novelty in individual components: Though the integration of tools into a unified framework is somewhat innovative, each individual component of FactTool lacks sufficient novelty and relies on methods akin to existing approaches. This makes the paper seem like a mere engineering assembly rather than a work that proposes or solves a novel research problem. And the framework seems simply like 1+1=2, which means the readers can not obtain any insights from the "unified" framework.
- Bias in tool outputs not discussed: The use of external tools introduces the possibility of bias in the collected evidence. For example, search engine results can be influenced by various factors, which may compromise the objectivity of the factuality assessment.
- Computational and resource constraints: The tool-augmented approach may demand substantial computational resources and time, especially when dealing with large volumes of generated text or complex queries. This could undermine the practicality and scalability of the framework in real-world applications.
- Insufficient experiments: The experiments appear inadequate as the models used seem somewhat outdated. There is no comparison with the latest models, such as DeepSeek-R1 and other reasoning models.
- Prompt dependency: FactTool is based on prompt implementation, which heavily relies on the model's ability to follow instructions and is highly sensitive to the prompt itself.
- Overclaim: The paper claims to be “the first to propose and connect the concept of ‘tool use’ with ‘factuality detection’”. However, numerous prior works have already employed tools to perform a variety of tasks, including factuality detection (see 1).
Typo: In line 70, “scientific” should be corrected to “science”.
Missing references: [1] Vykopal et al., Generative Large Language Models in Automated Fact-Checking: A Survey.
给作者的问题
N/A
We respectfully appreciate the reviewer's feedback. The following is the response to major concerns:
Limited novelty in individual components
While individual components like claim extraction and evidence retrieval exist separately, FacTool's key novelty lies in creating a unified, multi-task framework that works across diverse domains (KB-QA, code, math, scientific literature). The survey paper [3] shows that prior work focuses on single tasks - for example, most of the 69 surveyed papers address only one specific fact-checking task, with fact verification and fake news detection being the most prominent (53 out of 69 papers), but typically confined to single domains. FacTool's contribution is demonstrating how tool augmentation can be systematically applied across multiple domains with domain-specific adaptations.
Bias in tool outputs not discussed
We acknowledge that search engines may have ranking biases. We discuss the potential errors of "conflicting evidence" in our failure analysis in Appendix B.2.
Computational and resource constraints
While tool augmentation does increase computational overhead, this trade-off is often worthwhile for high-stakes applications where factuality is crucial. The paper demonstrates that the framework can be implemented with different LLM backends (ChatGPT, GPT-4, Llama) to balance cost and performance. The modular design also allows selective tool usage based on resource constraints. With the development of future LLMs, we anticipate the effectiveness and efficiency of FacTool will be significantly improved.
Insufficient experiments with outdated models
The experiments use ChatGPT and GPT-4, which were state-of-the-art at the time of writing. While newer reasoning models like DeepSeek-R1 have emerged, the framework's modular design means it can readily incorporate these newer models. The core contribution - the systematic tool augmentation approach - remains valid regardless of the underlying LLM. We will update the paper with results of newer model (including DeepSeek-R1 and Qwen) in our next version.
Prompt dependency
Indeed, this is a legitimate concern for any prompt-based approach. Our paper provides detailed prompts in the appendix and shows consistent performance across different tasks, suggesting the prompting strategy is relatively robust. The framework could be enhanced by incorporating more sophisticated prompting techniques or fine-tuning approaches.
Overclaim about connecting "tool use" with "factuality detection"
We thank the reviewer for providing the reference. After reviewing the survey paper, we want to respectfully clarify the contribution of FacTool based on the context of the survey. The survey shows that external knowledge augmentation is significantly underexplored in fact-checking, with only five out of 69 surveyed papers employing external sources. Most prior work uses prompting (without tools) or fine-tuning approaches, with knowledge-augmented strategies being identified as a key future direction.
While tools have been used for various NLP tasks, FacTool appears (to the best of our knowledge) to be among the first to systematically design a tool-augmented framework specifically for multi-domain factuality detection. The survey categorizes approaches into prompting, fine-tuning, and augmentation with external knowledge, noting that techniques incorporating external knowledge remain underexplored. FacTool's systematic integration of domain-specific tools (Google Search for KB-QA, code interpreters for programming, Google Scholar for scientific claims) represents a novel contribution to this underexplored area. The framework's significance lies not just in using tools, but in demonstrating how different tools can be systematically selected and applied across diverse factuality detection tasks, providing a template for future tool-augmented fact-checking systems.
Typo
We thank the reviewer for pointing out the typo, we will fix the typo in our next version paper.
[3] Vykopal et al., Generative Large Language Models in Automated Fact-Checking: A Survey.
Thank you very much for the detailed response. While you claim that "most of the 69 surveyed papers address only one specific fact-checking task," there are actually specialized solutions tailored to different domains based on their unique characteristics and challenges, which are not totally covered by the single survey. Attempting to unify these diverse approaches into a single framework, especially in a way that may not be particularly elegant or natural, does not necessarily constitute a novel contribution to the field.
Furthermore, the response notes that "Most prior work uses prompting (without tools) or fine-tuning approaches, with knowledge-augmented strategies being identified as a key future direction." However, there is already a substantial body of work utilizing tools such as retrievers for fact verification, which appears to be overlooked in your analysis:
[1] Loki: An Open-Source Tool for Fact Verification [2] Survey on factuality in large language models: Knowledge, retrieval and domain-specificity [3] Language models hallucinate, but may excel at fact verification [4] Semantic Verification in Large Language Model-based Retrieval Augmented Generation
We thank the reviewer for the thoughtful feedback. We would like to respectfully address your points.
Regarding your first point about specialized solutions for different domains, we agree with you that there are indeed domain-specific fact-checking approaches tailored to unique characteristics and challenges. Specifically, we recognize existing domain-specific fact-checking work, including RARR [1] and FActScore [2], which we discussed in our paper. The statement that "Most prior work uses prompting (without tools) or fine-tuning approaches, with knowledge-augmented strategies being identified as a key future direction" was inferred from the survey [3]. We agree with your point that there is existing works utilizing tools such as retrievers for fact verification, including the references you mentioned. We will including more related works in our next version paper.
We respectfully understand your concern about the elegance of attempting to unify diverse approaches into a single framework. We believe our contribution demonstrates that a systematic, modular approach can effectively handle diverse factuality verification needs across domains. FacTool provides a principled foundation that allows domain-specific adaptations while maintaining consistent 5-step verification methodology (claim extraction, querying generation, tool querying, evidence collection, and verification). The framework's strength comes from its systematicity and flexibility—it maintains systematic verification principles while allowing each domain to utilize the most appropriate tools (mathematical claims naturally utilize calculators, while scientific citations employ scholarly databases) based on their specific characteristics.
Looking forward, we acknowledge that FacTool might appear pipelined from a generalist agent perspective, but we believe it served as an important foundational step. With emerging reinforcement learning technologies, LLMs can learn to call specific tools more dynamically, building upon the systematic foundation we established. We believe this is a promising future direction, and we genuinely appreciate this valuable discussion and look forward to continued progress in this important research area.
[1] Gao et al., RARR: Researching and Revising What Language Models Say, Using Language Models
[2] Min et al., FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
[3] Vykopal et al., Generative Large Language Models in Automated Fact-Checking: A Survey.
The paper introduces a framework, FacTool, to identify the factual errors in the generations from LLMs on a variety of tasks. FacTool consists of five steps and is able to perform on long-form generations without fixed evidence provided, utilizing the instruction-following ability of the LLM along with a set of external tools (e.g., Google Search API).The paper evaluates a set of modern models using this framework.
接收理由
- The paper is well-organized and easy to follow.
- The experimental setup is clear with a rich number of details.
- The paper identifies the existing challenges and explains how the proposed framework addresses them.
拒绝理由
- It is valuable to have an integrated framework to identify factual errors in LLM’s generations in a wide range of domains, but the novelty of individual components in FacTool seems to be limited.
- In Section 6.1, there is a lack of baseline to compare against the proposed claim extraction method (e.g., [1]).
- In Section 6.2, for KB-QA and Scientific, a baseline using RAG with a local datastore might be needed to showcase the benefits of using this agentic framework. For evaluating code and math, adding a stronger baseline would make the results more convincing as well (e.g., a minimal pipeline to execute generated code).
- Showing the transferability of FacTool on other datasets is important especially because math and coding generations can be dependent on the prompts and vary in format and style. However, there are no such experiments shown. Even in Section 6.3, the prompts were taken from the same set of datasets.
- The numbers of prompts evaluated in Section 6.3 seem to be too small especially for domains other than KB-QA.
We thank the reviewer's thoughtful review. We appreciate your recognition of our paper's organization, clarity, and comprehensive experimental details. We would like to respectfully address your concerns:
The novelty of individual components in FacTool seems to be limited
We respectfully emphasize that FacTool represents one of the earliest works that pioneered the integration of external tools to systematically address hallucination in LLMs. While individual components may seem familiar now, at the time of this work, the unified framework combining:
- Multi-domain claim extraction via LLM prompting
- Tool-augmented evidence collection
- Systematic verification across diverse tasks (KB-QA, code, math, scientific)
was genuinely novel. This approach has since become the de facto standard for addressing hallucination, validating our early vision. The fact that tool-augmented approaches are now considered the most effective method for combating hallucination demonstrates the impact and foresight of our work.
Missing baselines for claim extraction and RAG comparisons
We acknowledge these suggestions would strengthen the paper. We also note that:
- For claim extraction, at the time of our work, there were limited comparable automated claim extraction methods that worked across our diverse domains. Existing methods were typically task-specific (e.g., only for summarization or QA), making fair comparison challenging.
- For RAG baselines, our focus was on demonstrating the effectiveness of dynamic tool use rather than static retrieval, as tools provide real-time verification capabilities that static datastores cannot match. Static RAG systems would struggle with dynamic content like code execution results or real-time mathematical calculations.
- For code execution baseline, we actually do compare against execution-based evaluation (HumanEval), but our approach goes beyond simple pass/fail by generating diverse test cases and using majority voting to identify subtle errors that fixed test suites might miss.
Limited transferability experiments and small sample sizes
We would like to respectfully emphasize that our experiments conducted across four distinct domains (KB-QA, code, math, scientific) already demonstrate significant generalization across domains and scenarios. We emphasize that our dataset comprises over 400 responses across tasks, containing nearly 1,000 claims. This significantly surpasses related works like RARR [1] and SelfCheckGPT [2], which are limited to 150 and 238 samples respectively, and focus on only a single KB-QA task.
Furthermore, we clarify that our work primarily focuses on providing a tool for evaluation, not on benchmarking the factuality of modern LLM chatbots - the chatbot evaluation in Section 6.3 is more of an additional demonstration of FacTool's utility rather than the core contribution.
[1] RARR: Researching and Revising What Language Models Say, Using Language Models, Gao et al.
[2] SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models, Manakul et al.
Thank the author for the response!
Regarding a baseline on claim extraction: I agree that existing methods were domain-specific, but these domains are within the scope of what the proposed method claims to address (i.e. KB-QA), which would make the comparison still fair. The paper also evaluates its own method on a specific domain (i.e., summarization with RoSE[1]).
Regarding a baseline on code execution: I might have misunderstood, but HumanEval is the evaluation method that you treat as ground-truth instead of a baseline.
Regarding limited transferability: I agree that the domain-coverage of the proposed framework has been shown in the paper. However, within each domain (i.e. code), whether the proposed method can be functionally across different datasets is never shown the paper.
[1] Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation, Liu et al.
We thank the reviewer for the thoughtful feedback and clarifications. We appreciate the opportunity to address these important points.
Regarding baselines for claim extraction:
Thank you for this excellent suggestion. You're absolutely right that we use RoSE, though our application slightly differs from its original purpose. While the original RoSE paper benchmarked whether summarization models generate summaries that "contain" ACUs, our task evaluates whether claim extraction models can accurately "extract" the actual ACUs from a given text. While the task is slightly different, we agree that each sentence from summaries generated by text summarization models (e.g., BRIO [1]) can be treated roughly as the "generated" ACUs for comparison, and can then follow similar evaluation strategy as mentioned in our paper. We will include results comparing BRIO [1] against our existing baselines (FlanT5, ChatGPT, GPT-4) in our updated paper.
Regarding baselines for code execution:
Thank you for highlighting this confusion in our presentation. We apologize for the lack of clarity. Our current FacTool method is already execution-based and evaluated on the execution-based HumanEval benchmark. FacTool employs a synthetic unit test library generation system (illustrated in Figure 3) to detect potential factual errors in code. We agree with the reviewer that a simple executability check would serve as a valuable baseline to identify basic coding errors. We will include results comparing against this simpler execution baseline in our updated paper.
Regarding transferability across datasets:
We appreciate the reviewer's thoughtful suggestion. While FacTool's main contribution lies in its versatility and multi-domain capabilities, the reviewer raise a valid point about demonstrating transferability within individual domains. To address this, we will add experiments on MBPP [2] within the code domain to show that our method generalizes across different datasets within the same domain. These results will be included in our updated paper.
[1] Liu et al., BRIO: Bringing Order to Abstractive Summarization
[2] Austin et al., Program Synthesis with Large Language Models
Thank you again for these constructive suggestions that will strengthen our evaluation and overall paper.
Thank the author for the response! I am looking forward to seeing the new baselines added to the paper. I have raised my score in response to the discussion about the novelty of the paper.
This paper proposed a tool augmented factuality check method for AI generated contents. By leveraging the specific tools for different domains, the proposed method can effectively check the factuality of in different domains for different tasks. To evaluate the proposed method, the authors also built a benchmark. Finally, the authors evaluated a couple sota LLMs with the proposed method, across different scenarios.
quality
The paper looks sound to me. The way this proposed method evaluates the generated output is very similar to how human would be evaluating them, so it's not surprising that it would work better than simple self-check.
clarity
This paper is overall clearly written and easy to follow.
originality
I'm not over familiar with literature on this topic, but according to the authors this is the first work that integrates "tool use" for factuality detection.
significance
I'd imagine that the ultimate goal of such method is to be a standalone factuality check tool, or integrated into LLM applications to control the generation quality. But the cost would be fairly high either way, which may limit the practical usefulness of the proposed work.
接收理由
- paper is clearly written and easy to follow
- proposed method is reasonable and works well
拒绝理由
- the proposed method seems costly with unsure practical usability
We respectfully appreciate the reviewer's feedback. While we acknowledge that computational cost is a valid consideration, we would like to emphasize several key points:
- High-stakes applications: In domains like healthcare, finance, and scientific research, the cost of factual errors far exceeds the computational cost of verification.
- Selective deployment: FacTool can be strategically deployed for critical outputs rather than every generation.
- Improving efficiency: As one of the first works in this space, future optimizations and more efficient tool integration methods will naturally follow.
- Opensource alternatives: Our experiments with FacTool powered by Llama show competitive results, demonstrating that opensource models can effectively power the framework. With the increased capabilities of opensource models, the cost of deploying FacTool will continue to significantly reduces.
I appreciate the authors’ rebuttal and will keep my scores
We appreciate the reviewers' thoughtful responses and that the reviewers agree that FacTool is a general and versatile framework for factuality detection. FacTool employs a tool-augmented approach to target factual errors, enabling specific handling of diverse types of factual errors. The reviewers' recognition that tool-augmented approaches have become standard validates our early vision. FacTool's framework provides an ideal foundation for RL-based improvements. Specifically, with RL + Tool integration:
- Models can learn when and which tools to call based on the claim type.
- Fine-grained factuality scores from FacTool can guide policy learning for more reliable generation.
- RL can optimize tool selection strategies, learning that mathematical claims need calculators while citations require scholarly databases. The modular pipeline allows targeted improvements at each stage while maintaining interpretability.
This integration potential demonstrates how our tool-augmented approach enables future enhancements rather than constraining them: models can evolve from following prescribed verification steps to autonomously determining optimal verification strategies.
We respectfully emphasize that FacTool, as one of the earlier works in tool-augmented factuality detection, aimed to establish a unified framework rather than optimize individual components. While subsequent works have provided stronger baselines for specific tasks, our contribution lies in pioneering the tool-augmented paradigm that has since become the dominant approach for addressing hallucinations. FacTool demonstrated the feasibility and effectiveness of systematic tool integration across multiple domains - a vision validated by widespread adoption of similar approaches throughout the field. As a foundational framework, we humbly believe that FacTool established principles that continue to guide how we approach factuality detection in LLMs.
The paper presents FacTool, a framework for factuality evaluation of LLM outputs using tools. The framework includes claim extraction, query generation, tool execution and verification. The work presents the framework, together with experiments, results, data and code.
Reasons to accept:
- The work is sound and has already showed significant impact since it was published on arxiv.
- The open source software included with the work is also impactful and used by many.
- Long-form factuality is still an important problem for LLMs today, and this work is useful to drive progress on this.
Reasons to reject:
- The work is not cutting edge research: early versions appeared two years ago (but maybe not published in a conference before).