6.5

/10

Poster4 位审稿人

最低5最高8标准差1.5

4.0

置信度

ICLR 2024

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou,Zhihong Shao,Yeyun Gong,yelong shen,Yujiu Yang,Nan Duan,Weizhu Chen

OpenReview PDF

提交: 2023-09-22更新: 2024-03-06

TL;DR

We present a new framework enabling LLMs to verify and correct their output by interacting with tools, highlighting the importance of external feedback for continuous self-improvement.

摘要

关键词

Large Language ModelsIn-context LearningSelf-VerificationSelf-CorrectionTruthfulnessTool-useInteraction

评审与讨论

审稿意见

评分: 5置信度: 32023-11-01

The papers proposes CRITIC, a framework for composing programs involving LMs self-correcting themselves using external tools. The authors conduct experiments with question answering, program synthesis and toxicity reduction show that CRITIC consistently improves the performance of LLMs.

优点

The paper is written clearly and easy to follow.
I found the comparison with ReAct interesting, i.e. the role of parameter knowledge vs language feedback.

缺点

I'm not convinced the CRITIC framework is novel enough to count as a contribution. The idea idea of using natural language feedback [1, 2, 3, 4] that guides LMs in revising their responses is pretty old as is the idea of using tools [5,6]. I agree the authors provide a nice unifying framework and some new downstream tasks (e.g. toxicity with PerspectiveAPI), but these don't seem to be pass the bar for ICLR.
The authors don't compare with other frameworks endowing LMs with self-correction and tool use, like the ones listed above.
I think the claim that tool use "mimic[s] human thinking and behavior" is overblown. Humans use think and work with tools very differently, typically not through a text-only interface.

[1] https://arxiv.org/abs/2204.14146

[2] https://arxiv.org/abs/2303.11366

[3] https://arxiv.org/abs/2212.08073

[4] https://arxiv.org/abs/2303.16749

[5] https://arxiv.org/abs/2207.14502

[6] https://arxiv.org/abs/2302.04761

[7] https://openai.com/blog/function-calling-and-other-api-updates

问题

How does the paper compare with other frameworks endowing LMs with self-correction and tool use?

评论- Author Response to Official Review by Reviewer hutJ (References)

2023-11-14

References

[1] Shinn, Noah, et al. "Reflexion: Language agents with verbal reinforcement learning." Thirty-seventh Conference on Neural Information Processing Systems. 2023.

[2] Madaan, Aman, et al. "Self-refine: Iterative refinement with self-feedback." Thirty-seventh Conference on Neural Information Processing Systems. 2023.

[3] Chen, Xinyun, et al. "Teaching large language models to self-debug." arXiv preprint arXiv:2304.05128 (2023).

[4] Yao, Shunyu, et al. "React: Synergizing reasoning and acting in language models." arXiv preprint arXiv:2210.03629 (2022).

[5] Chen, Wenhu, et al. "Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks." arXiv preprint arXiv:2211.12588 (2022).

[6] Saunders, William, et al. "Self-critiquing models for assisting human evaluators." arXiv preprint arXiv:2206.05802 (2022).

[7] Bai, Yuntao, et al. "Constitutional ai: Harmlessness from ai feedback." arXiv preprint arXiv:2212.08073 (2022).

[8] Campos, Jon Ander, and Jun Shern. "Training language models with language feedback." ACL Workshop on Learning with Natural Language Supervision. 2022.. 2022.

[9] Chen, Angelica, et al. "Improving code generation by training with natural language feedback." arXiv preprint arXiv:2303.16749 (2023).

[10] Haluptzok, Patrick, Matthew Bowers, and Adam Tauman Kalai. "Language models can teach themselves to program better." arXiv preprint arXiv:2207.14502 (2022).

[11] Welleck, Sean, et al. "Generating Sequences by Learning to Self-Correct." The Eleventh International Conference on Learning Representations. 2023.

[12] Kadavath, Saurav, et al. "Language models (mostly) know what they know." arXiv preprint arXiv:2207.05221 (2022).

[13] Ni, Ansong, et al. "Lever: Learning to verify language-to-code generation with execution." International Conference on Machine Learning. PMLR, 2023.

[14] Huang, Jie, et al. "Large language models cannot self-correct reasoning yet." arXiv preprint arXiv:2310.01798 (2023).

[15] Stechly, Kaya, Matthew Marquez, and Subbarao Kambhampati. "GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems." arXiv preprint arXiv:2310.12397 (2023).

[16] Valmeekam, Karthik, Matthew Marquez, and Subbarao Kambhampati. "Can Large Language Models Really Improve by Self-critiquing Their Own Plans?." arXiv preprint arXiv:2310.08118 (2023).

[17] Schick, Timo, et al. "Toolformer: Language models can teach themselves to use tools." arXiv preprint arXiv:2302.04761 (2023).

[18] Lu, Ximing, et al. "Quark: Controllable text generation with reinforced unlearning." Advances in neural information processing systems 35 (2022): 27591-27609.

评论- Author Response to Official Review by Reviewer hutJ (2/2)

2023-11-14

Experiments of CRITIC Comparing Other Self-Correct and Tool-Use Frameworks

The authors don't compare with other frameworks endowing LMs with self-correction and tool use, like the ones listed above.

In fact, CRITIC has been compared with the most advanced baselines of Self-Correct and Tool Use across multiple tasks in the original paper:

Firstly, the work you mentioned includes methods for self-correction based on natural language feedback [1-3, 6-10], which is akin to the "CRITIC w/o Tool" baseline implemented and compared in many tasks in our paper, which relies on self-correction without using external tools.
Moreover, in the QA task, CRITIC is even compared with a concurrent work, Self-Refine [2], as shown in Table 5. We also compared it with the Self-Correct [11] method that trains an additional corrector in the Toxicity Reduction task. We have conducted detailed comparisons and discussions on related baselines in all experiments and in Appendix C.1.
Additionally, you mentioned the tool-use works like Toolformer [17] and many others [3,4]. We have actually compared the most advanced tool-use methods in each domain, such as the search engine-based ReAct [4] in the QA tasks; PoT [3] that used a Code Interpreter in the Math Program Synthesis tasks; Quark [18] and Self-Correct [11] that utilized Perspective API for Toxicity Reduction, etc.

We have ensured fairness in tool use and self-correct settings for all baselines across all these comparative experiments covering a wide diversity of tasks. These comprehensive comparisons across various tasks, LLMs, and model sizes collectively demonstrate the generality and effectiveness of CRITIC.

Revising the phrase “allows LLMs to mimic human thinking and behavior”

I think the claim that tool use "mimic[s] human thinking and behavior" is overblown. Humans use think and work with tools very differently, typically not through a text-only interface.

We sincerely appreciate your feedback regarding the wording used here!

Our intended message is that CRITIC is a framework designed, inspired by the process of human interaction with tools, such as humans compiling and running their own programs, receiving feedback, reflecting on issues, and going through the debugging process. We also agree that humans may exhibit a much richer inner monologue and behavior in using tools and reflecting, and language is indeed a limited interface through which LLM interacts with the world. In response to your feedback, we will revise this phrase to "allows for human-like verify-then-correct trajectories", and we hope this address your concerns.

In light of these clarifications and revisions, we kindly request that you consider increasing the review score.

评论- Author Response to Official Review by Reviewer hutJ (1/2)

2023-11-14

Thank you for thoroughly reviewing our work! We have carefully considered all your concerns and addressed them in the following comments.

CRITIC Compared to Existing Works on Self-Correction and Tool-use

The idea idea of using natural language feedback [1, 2, 3, 4] that guides LMs in revising their responses is pretty old as is the idea of using tools [5,6]. I agree the authors provide a nice unifying framework and some new downstream tasks (e.g. toxicity with PerspectiveAPI), but these don't seem to be pass the bar for ICLR.

Thank you for raising these concerns! Here we discuss why we believe CRITIC is novel, and more than a trivial blend of existing works on self-correction [1, 2, 3] and tool use [4,5]. While these are crucial fast-moving areas of LLMs, CRITIC offers unique findings, integrated frameworks, and valuable insights, making CRITIC a distinct contribution.

The relevant papers in these two fields you mentioned actually have very different research perspectives and stances from CRITIC:

Intrinsic Self-Correct with NL feedback

Works include Self-Critique [6], CAI [7], Reflexion[1], Self-Refine [2], and others [8-10] you mentioned, they prompt or train language models to correct their results. In contrast, our study is the first to demonstrate that such a "Self-Verification and Self-Correction" approach has proven to be remarkably unreliable across diverse tasks and various LLMs (Sec. 4 and Appendix D.1). Specifically, modest improvements or even deterioration are observed universally using self-correct without external feedback. Consequently, CRITIC emphasizes the importance of feedback from external interactions for the consistent self-improvement of LLMs. The proposed framework is general and has proven effective across multiple tasks. These profound reflections on unreliable self-correction of LLMs and crucial findings on the importance of external feedback can be important learnings for the community, as recognized by Reviewer [rWYL].

The Unreliability of Self-correction

Moreover, CRITIC further delves into the core reason behind the unreliability of self-verification from the perspective of uncertainty estimation, as shown in Appendix D.1. Essentially, the models are incapable of accurately identifying "what they know" (i.e., LLMs don't know what they know) [12] without relying on external tools. Therefore, without the aid of oracle verification (employed in many contemporary works such as Reflexion [1] and Self-Refine [2]), self-correction might surprisingly deteriorate performance for many tasks, even worsening the initial answer (as demonstrated in Table 2, 3 under CRITIC w/o Tool, and in Table 8 under Self-Refine).

As noted by Reviewer [dDfr], a nice recent follow-up work to CRITIC, titled "LLMs cannot self-correct reasoning yet" [14] extends the study of CRITIC (which cites CRITIC numerous times). It further validates and expands our findings on the unreliability of Self-Verify and Self-Correct in CRITIC using GPT-4 in more settings on many reasoning tasks.

Tool-Use

Another category of related work is Tools Augmented LMs, such as ReAct [4], PoT [5], and Toolformer [17]. We would like to clarify that these works significantly differ from CRITIC as they focus on tool learning. To the best of our knowledge, none of them consider using tool-interaction as faithful feedback to iteratively improve the model, which is the key component of CRITIC.

In conclusion, CRITIC is the first to unveil the unreliability of self-verification and self-correction across diverse tasks and LLMs of various families and sizes. By initially highlighting the challenges LLMs encounter in self-verification [12], self-correction [6,7], our goal is to rectify any potential overestimations of these LLM's abilities within the research community [1-3]. By emphasizing that feedback from external tool interaction is crucial for consistent self-improvement of LLMs, we hope our findings provide valuable insights and encourage further exploration and enhancement of self-improving LLMs.

评论- Thanks for the response

2023-11-21

Thanks for the detailed response, I appreciate engaging with all the points I raised.

However, I remain somewhat unconvinced that using external feedback counts as a novelty. The idea of using external feedback in combination with language feedback was used before (e.g. [1, 2]). Moreover, the need to rely on external tools is generally a limitation: it means that LLM capabilities cannot scale up unless new tools are available. It seems that at least some method can overcome it in certain settings. I can imagine a paper focusing on this limitation (similar to [3]), but that would require a change of the story in the paper.

[1] https://arxiv.org/abs/2303.16749

[2] https://arxiv.org/abs/2303.11366

[3] https://arxiv.org/abs/2310.01798

评论- Thank you for your reply!

2023-11-22

Thank you for your additional comment and for taking the time to engage with us on this matter! We appreciate your perspective and understand that you remain unconvinced about the novelty of using external feedback. However, we would like to continue addressing your concerns and clarify the unique contributions of CRITIC.

The novelty of using tool-interaction for feedback

The idea of using external feedback in combination with language feedback was used before (e.g. [1, 2]).

We agree the concept of using external feedback in combination with language feedback has been used in prior studies. However, CRITIC's approach to this feedback is distinct and novel. The cited works [1, 2] utilize task completion oracle information (i.e., ground truth) as external feedback. In contrast, CRITIC empowers LLMs to verify their output's truthfulness, feasibility, and safety through tool-interaction without relying on task oracle feedback. For instance, by accessing web knowledge via Google to validate their internal knowledge and make further corrections. Additionally, CRITIC firstly highlights the limitations of self-correction and proposes an integrated framework for self-correction based on LLM’s automatic external tool interaction. This comprehensive and practical approach, alongside the explicit study of self-correction limitations, represents a novel contribution to the field.

The necessity and advantage of using external tools

Moreover, the need to rely on external tools is generally a limitation: it means that LLM capabilities cannot scale up unless new tools are available.

We also agree that the requirement for external tools can be viewed as a limitation, as it indeed signifies that LLM capabilities may not scale on new tasks without new tools. However, this also realistically reflects the current state and limitations of LLMs, which in turn motivated us to introduce the unified and general CRITIC framework for integrating feedback from tool-interaction. As a result, the utilization of external tools can be advantageous. It allows LLMs to leverage the continuous evolution of toolsets and resources for reliable self-improvement, often encapsulating domain-specific expertise and data not inherently available to LLMs.

It seems that at least some method can overcome it in certain settings.

We believe that the necessity of external tools is not something that can be easily overcome in certain settings, even for the strongest LLMs. Let's consider an extreme scenario where a “future method” enables GPT-10 to robustly handle all scenarios and generate correct facts with a 99% accuracy rate. However, due to the potential risks associated with the 1% error rate, humans still cannot trust the results and achieve trustworthy automation. In such cases, we believe that the CRITIC framework, which drives AI to autonomously use external tools or interact with the environment for self-verification and correction, is still necessary. This might even be a key pathway to achieve the “Scalable Oversight” vision [4].

The focus of the paper

I can imagine a paper focusing on this limitation (similar to [3]), but that would require a change of the story in the paper.

We appreciate your suggestion to change the paper's focus to the limitations of LLMs without external tools, similar to the cited follow-up work of CRITIC [3]. However, we believe that the current focus of the paper, which highlights the limitations of self-correction and proposes a novel framework to combine self-correction and external tool interaction, is also a valuable contribution to the field. And we have made a concerted effort to clearly communicate this focus throughout the paper.

In light of these clarifications, we kindly request that you reconsider the novelty and contributions of CRITIC. We are also open to further discussions and feedback to improve our work.

References

[1] Chen, Angelica, et al. "Improving code generation by training with natural language feedback." arXiv preprint arXiv:2303.16749 (2023). https://arxiv.org/pdf/2303.16749.pdf

[2] Shinn, Noah, et al. "Reflexion: Language agents with verbal reinforcement learning." Thirty-seventh Conference on Neural Information Processing Systems. 2023. https://arxiv.org/pdf/2303.11366.pdf

[3] Huang, Jie, et al. "Large language models cannot self-correct reasoning yet." arXiv preprint arXiv:2310.01798 (2023). https://openreview.net/forum?id=IkmD3fKBPQ

[4] Bowman, Samuel R., et al. "Measuring progress on scalable oversight for large language models." arXiv preprint arXiv:2211.03540 (2022). https://arxiv.org/pdf/2211.03540

评论- Thanks for the response

2023-11-22

Thanks for the response. After carefully reading our discussion as well as discussions with other reviewers I decided to keep my score.

评论- Thank you for your time!

2023-11-22

Dear Reviewer hutJ,

We deeply appreciate the time and effort you've dedicated to reviewing our paper! In the spirit of the season, we wish you a Happy Thanksgiving!

Best Regards, The Authors

审稿意见

评分: 8置信度: 52023-11-01

This paper proposes a framework called CRITIC to progressively validate and revise the output based on the feedback from tools. Six different external tools are used including Knowledge base, code interpreter, Text APIs, Wiki, Calculator and Search Engine. Evaluations are done on free-form question answering, mathematical program synthesis, and toxicity reduction. CRITIC was shown to have superior performance on these benchmarks compared to strong baselines including CoT, Self-Consistency, ReAct, and PoT.

优点

LLM Tool Use is a very timely research topic, and it is an important research area to use external feedback for the self-improvement of LLMs. This paper covers a wider range of tools compared to many prior works which typically employ one single type of tools.
The results are rather strong with universal improvements across most tasks evaluated with several different model families and sizes.
The ablation against CRITIC w/o Tool shows the importance of external feedback from Tools, which is an important learning for the community.
The paper is very well written and is easy to understand with comprehensive comparisons to strong baselines.

缺点

Error analysis is missing on what are the failure modes after using Tools for feedback.

问题

It is unclear how important each Tool is to each task. Such analysis will provide further insight into where the improvements come from.
The authors used different sampling config for the experiments for different tasks: e.g. p=0.9 was used for section 4.3 which is different from p=0.5 in 4.1 and 4.2.

评论- Author Response to Official Review by Reviewer rWYL (2/2)

2023-11-13

3. Error Analysis on Mathematical Program Synthesis tasks

On Mathematical Program Synthesis tasks, to offer readers a more comprehensive understanding of the specific corrections made by CRITIC and the specific benefits derived from tool feedback (also mentioned by reviewer ASEM), we carried out a manual statistical analysis of the types of corrections made by CRITIC on the GSM8k full test set (1319 samples).

Specifically, we identified four different categories of initial program errors: syntax errors, runtime errors, unreasonable outputs (such as irrational negative values), and other intrinsic reasoning errors. We calculated the accuracy of the initial PoT (Init), and CRITIC for each type of error. The settings for corrections are consistent with the non-oracle setting in the original paper, with up to four rounds of correction. The statistics are presented in the following table:

Error Type	Init (Count)	Init (Acc)	CRITIC (Count)	CRITIC (Acc)
Intrinsic Error	281 (77.4%)	0.0	206 (71.8%)	26.7
Unreasonable Output	61 (16.8%)	0.0	26 (9.1%)	57.4
Syntax Error	17 (4.7%)	0.0	11 (3.8%)	35.3
Runtime Error	4 (1.1%)	0.0	3 (1.0%)	25.0
All Init Errors	363	0.0	246 (85.7%)	32.2
Wrong Correction	-	100.0	41 (14.3%)	95.7

As can be seen in the table:

(1) The majority of error types in the initial PoT responses are intrinsic reasoning errors (77.4%), such as misunderstanding the question or omitting conditions. The initial responses also exhibit a relatively high proportion (16.8%) of unreasonable output errors, while syntax and runtime errors are less frequent but not absent (5.8%).
(2) CRITIC has a high success rate in correcting unreasonable output and syntax errors (57.4% and 35.3% respectively). However, the correction rate for intrinsic errors, for which reliable feedback cannot be obtained, is relatively low (26.7%). Overall, CRITIC reduces errors in the initial erroneous samples by 32.2% in a non-oracle setting.
(3) Notably, while CRITIC has corrected a substantial number of errors in the initial PoT, as can be seen from the last row of the table above, there is a decrease of -4.3% in the accuracy of CRITIC on originally correct outputs. This results in the error modes after tool feedback also including 14.3% wrong corrections.

We will incorporate these results into the paper in subsequent updates.

Discussing The Role of Each Tool in Different Tasks

It is unclear how important each Tool is to each task. Such analysis will provide further insight into where the improvements come from.

Thank you for your suggestion! We will incorporate an analysis of the significance of different tools in various scenarios within the CRITIC paper.

To be specific, e.g., in knowledge-intensive tasks such as commonsense QA (AmbigNQ and TriviaQA) and multi-hop knowledge reasoning tasks like HotpotQA, the leading roles are played by web tools, and CRITIC primarily utilizes Wikipedia page browsing and Google snippet, as demonstrated by the numerous case studies in Appendix E.1; for mathematical program synthesis tasks, external knowledge is usually not required, and a code interpreter can serve the same function as a calculator. Therefore, in these experiments, our external feedback comes from error messages and execution results from the interpreter, as shown in the cases in Appendix E.2.

Clarifying Sampling Config Variations

The authors used different sampling config for the experiments for different tasks: e.g. p=0.9 was used for section 4.3 which is different from p=0.5 in 4.1 and 4.2.

CRITIC have applied p=0.9 in the toxicity reduction task, diverging from the p=0.5 used in other tasks. This decision is based on the standard set in the RealToxicityPrompts benchmark [2], and is consistent with the methodology employed across all baseline studies [3], ensuring fair comparison.

References

[1] Shao, Zhihong, et al. "Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy." arXiv preprint arXiv:2305.15294 (EMNLP 2023).

[2] Gehman, Samuel, et al. "Realtoxicityprompts: Evaluating neural toxic degeneration in language models." arXiv preprint arXiv:2009.11462 (2020).

[3] Welleck, Sean, et al. "Generating sequences by learning to self-correct." arXiv preprint arXiv:2211.00053 (2022).

评论- Author Response to Official Review by Reviewer rWYL (1/2)

2023-11-13

Thank you for your thorough review! We appreciate your acknowledgment of CRITIC's importance, clarity, novelty, effectiveness, and consistent performance gains. We're grateful for your recommendation to accept our paper.

Adding Detailed Error Analysis on QA and Mathematical Program Synthesis Tasks

Error analysis is missing on what are the failure modes after using Tools for feedback.

Thank you for your constructive suggestion! Here, we point out that the appendix of the original paper actually includes extensive case studies on failure modes of all tasks.

Moreover, in light of your suggestion, we have included new Error Analyses for the QA and Mathematical Program Synthesis tasks, as illustrated below.

1. Success and Failure Case Studies

Due to the limitation of space in the main text, we provide a representative success and failure case analysis for each task in Appendix E in the original paper, where typical failure modes include evidence search failure, reasoning errors for QA tasks, and verification or correction errors in GSM8k.

2. Error Analysis on QA Tasks

Following your suggestion, in order to further understand the failure modes after using tools for feedback, we randomly selected 100 cases from the HotpotQA task, and manually annotated and analyzed the error types for both the initial CoT and CRITIC. The results are as follows:

Error Type	Explanations	CoT	CRITIC
Hallucination	Hallucination, including misinterpreting evidence or inconsistencies	36%	7%
Reasoning Error	Incorrect logical reasoning	5%	10%
Irrelevant Response	Answering a question that was not asked	9%	7%
Insufficient Evidence (Refusal)	Refusal to answer the question due to lack of sufficient evidence	2%	12%
Undefined Answer	Providing an empty answer or failing to derive an answer	18%	5%
Incorrect Correction	CRITIC wrongly altered the correct initial CoT answer	-	10%
Label Ambiguity (FN)	The prediction is correct but not matching the label	20%	37%
Incorrect Label (FN)	The dataset answer is incorrectly labeled	9%	10%
Outdated Label (FN)	The dataset answer label is outdated	0%	2%

As depicted in the table:

(1) CRITIC can significantly reduce hallucinations (36% vs. 7%), but not all of them. Even after utilizing CRITIC, hallucinations persist due to the inability to find useful evidence via a search engine or misunderstanding the evidence. This is illustrated in Appendix E.
(2) Most errors after applying CRITIC arise from reasoning mistakes, refusal to answer, and incorrect corrections. The refusal to answer occurs when CRITIC can't find enough evidence to support a response, which we consider an expected behavior to maintain truthfulness.
(3) In reality, CRITIC has effectively helped us identify a large number of label ambiguities, label errors, and outdated issues in the HotpotQA dataset (49% in CRITIC error samples). These false negatives (FN) indicate a certain bias in evaluating different methods on free-form QA tasks using automatic metrics like EM / F1. This has motivated subsequent research to design a more reliable LLM-based evaluation for QA tasks [1].

2023-11-20

Thank you authors for adding the detailed error analysis on QA tasks, and my concerns are fully addressed.

评论- Thank you!

2023-11-22

Dear Reviewer rWYL,

Thanks for your support! We are very happy that our revision and response addressed your concerns, and thank you again for your constructive feedback! And we wish you a Happy Thanksgiving!

Best Regards, The Authors

审稿意见

评分: 8置信度: 42023-11-01

The paper proposes a framework called CRITIC that enables large language models (LLMs) to self-verify and self-correct their outputs by interacting with external tools. The authors demonstrate the effectiveness of CRITIC in improving the performance of LLMs across multiple tasks, including free-form question answering, mathematical program synthesis, and toxicity reduction. The paper highlights the importance of external feedback in promoting the ongoing self-improvement of LLMs.

优点

The paper introduces a novel framework, CRITIC, which addresses the limitations of LLMs by allowing them to verify and correct their outputs through interaction with external tools.
The authors provide comprehensive evaluations of CRITIC on different tasks and datasets, demonstrating its consistent performance improvement over baseline methods.
The paper highlights the crucial role of external feedback in the self-improvement of LLMs and emphasizes the unreliability of LLMs in self-verification.

缺点

I think this is a good paper. The motivation is strong: utilizing external feedback to enhance the model's ability. However, some recent studies [1] reported that large language models cannot self-correct themselves. I acknowledge that [1] did not involve external tools, which is different from CRITIC's setting and it is a paper after CRITIC which is not necessarily be included, but it would be more comprehensive to include a discussion with these new studies in such a fast-moving field.
How much of the additional costs? Since calling external tools costs money. The authors should report the cost for each experiment.
In Appendix C.2, an important work active-prompt [2] should be included, which applies uncertainty estimation to chain-of-thought prompting.

[1] Large Language Models Cannot Self-Correct Reasoning Yet [2] Active Prompting with Chain-of-Thought for Large Language Models

问题

How much of the additional costs? Since calling external tools costs money. The authors should report the cost for each experiment.

评论- Author Response to Official Review by Reviewer dDfr

2023-11-13

Dear Reviewer dDfr,

Thank you for your comprehensive and detailed review of our paper! We sincerely appreciate your thorough and well-considered suggestions, which will help us make our work stronger! We're grateful for your recommendation to accept our paper!

Including a Discussion with Follow-up Works

I think this is a good paper. The motivation is strong: utilizing external feedback to enhance the model's ability. However, some recent studies [1] reported that large language models cannot self-correct themselves. I acknowledge that [1] did not involve external tools, which is different from CRITIC's setting and it is a paper after CRITIC which is not necessarily be included, but it would be more comprehensive to include a discussion with these new studies in such a fast-moving field.

Sure and thanks for your nice words! There are numerous recent studies, such as [1] that you mentioned. It is actually a follow-up to CRITIC and it kindly cites the CRITIC paper numerous times. These papers further provide additional experiments on various tasks like reasoning [1], graph coloring[2], and planning [3], further validating the findings of unreliable self-correction of LLMs from CRITIC and providing further insights [1,2,3]. Thank you for your very thoughtful suggestion, and we will include discussions on these new studies in our subsequent updates.

Discussing Tool-Use Cost for Each Experiment

How much of the additional costs? Since calling external tools costs money. The authors should report the cost for each experiment.

Nice suggestion! We’ll include a new paragraph discussing the cost of tool use in the Appendix, which is actually all free!

For QA tasks, as mentioned in the footnote in Sec. 4.1, we build a Google Web Crawler to crawl the results of Google Search and web pages, and employ a caching mechanism for web search, storing about 9GB of search results from January to April 2023 during our experiments. This part of the code is separately open-sourced anonymously at https://anonymous.4open.science/r/llm-agent-web-tools. The results of the Search Engine in the paper are all obtained using this code. In addition, we will also open-source all caches after the anonymous review period ends, to ensure stability, fairness, and reproducibility in our results.
For Mathematical program synthesis tasks, we use a local code interpreter, which is free of charge.
For toxicity reduction tasks, we adopt PERSPECTIVE API [4] kindly provided by Google, which is also free.

Adding Active-Prompt to Uncertainty Estimation

In Appendix C.2, an important work active-prompt [2] should be included, which applies uncertainty estimation to chain-of-thought prompting.

Thank you for your suggestion! We concur that Active-Prompt is a successful application of uncertainty estimation in the context of demonstration selection. In response to your request, we have referenced this paper in Appendix C.2 in our paper.

In light of these clarifications and revisions, we kindly request that you consider increasing the review score.

References

[1] Huang, Jie, et al. "Large language models cannot self-correct reasoning yet." arXiv preprint arXiv:2310.01798 (2023).

[2] Stechly, Kaya, Matthew Marquez, and Subbarao Kambhampati. "GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems." arXiv preprint arXiv:2310.12397 (2023).

[3] Valmeekam, Karthik, Matthew Marquez, and Subbarao Kambhampati. "Can Large Language Models Really Improve by Self-critiquing Their Own Plans?." arXiv preprint arXiv:2310.08118 (2023).

[4] https://www.perspectiveapi.com/

2023-11-22

Dear Reviewer dDfr, we greatly appreciate the your time in reviewing our response. Having submitted the rebuttal and revised our paper, we are eager to know if you have any remaining concerns. Since the reviewer-author discussion phase will end soon, we would greatly appreciate your support and valuable feedback before the deadline at Nov. 22.

2023-12-02

Dear authors,

Thanks for your response. I have raised my score because my concerns have been well addressed. Thanks

审稿意见

评分: 5置信度: 42023-11-04

This paper introduces CRITIC, a method for improving the outputs of language models by leveraging external feedback from various tools. The idea is to generate an initial output with the language model and then refine this output using feedback from an external tool, such as a Python interpreter, search engines, or toxicity detection APIs. Notably, this approach relies solely on in-context learning without the need for specialized training. Results across various tasks, including question answering, mathematical reasoning, and toxicity reduction, show that CRITIC improves over baselines.

优点

The experiments demonstrate the effectiveness of the proposed approach across a diverse set of tasks, indicating its potential to significantly improve the performance of large language models (LLMs).
Utilizing external feedback as a means of improving LLM outputs is practical. The simplicity of the approach is a plus, as it facilitates widespread application.

缺点

The primary concern with this work is its novelty. Several studies have previously demonstrated that external feedback can be instrumental in correcting LLM outputs. In fact, there is existing work within each domain addressed in this paper, such as Self-Correct ([1], using external APIs), Self-Ask ([2], employing a search engine), and Self-Debug ([3], via a Python interpreter). Notably, Self-Debug and Self-Ask have a striking resemblance to CRITIC but are not referenced.
The settings that rely on an oracle are somewhat idealistic, and detract from the core message of the paper. It may be more appropriate to move these results to an appendix (as done by other works) to facilitate a clearer understanding.

[1] Welleck, Sean, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. "Generating Sequences by Learning to Self-Correct." In The Eleventh International Conference on Learning Representations. 2022.

[2] Press, Ofir, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. "Measuring and narrowing the compositionality gap in language models." arXiv preprint arXiv:2210.03350 (2022).

[3] Chen, Xinyun, Maxwell Lin, Nathanael Schärli, and Denny Zhou. "Teaching large language models to self-debug." arXiv preprint arXiv:2304.05128 (2023).

问题

The emphasis in Table 1 seems inconsistent. For instance, the AmbigNQ EM score of 50.0 is highlighted for Text-Davinci-003, but it is not the highest. Is this a bug or am I missing something?
Regarding the GSM task in a non-oracle setting, it appears that feedback from the interpreter is limited to syntactic correctness. Given the improvements, it suggests that many of the programs were initially syntactically wrong. Is this the case?

评论- Author Response to Official Review by Reviewer ASEM (References)