5.3

/10

withdrawn4 位审稿人

最低3最高8标准差1.8

3.5

置信度

ICLR 2024

Corex: Pushing the Boundaries of Complex Reasoning through Multi-Model Collaboration

Qiushi Sun,Zhangyue Yin,Xiang Li,Zhiyong Wu,Xipeng Qiu,Lingpeng Kong

OpenReview PDF

提交: 2023-09-19更新: 2024-03-26

TL;DR

We introduce Corex, a suite of strategies designed to enhance the capabilities of LLMs in complex task-solving, with a pivotal focus on advancing multi-model collaboration.

摘要

关键词

Large Language ModelsComplex ReasoningMulti-Model Collaboration

评审与讨论

审稿意见

评分: 5置信度: 42023-10-28

This paper proposes Corex, a framework to promote collaborations among LLMs as agents for solving complex reasoning problems. Corex is composed of three collaboration paradigms, Debate, Review, and Retrieve modes. In Debate mode, LLM agents are divided into two groups with one judge agent. The agents will start interactive discussions and modify their predictions. The refined predictions will be presented to the judge agent for decision. In Review mode, two agents will be involved, where one agent serves as the primary agent, and the other agent reviews the prediction and provides feedback. Finally, for the Retrieve mode, the retriever agent will examine all the candidates generated by other agents by providing confidence scores. Corex is evaluated on eighteen datasets of four reasoning categories, and demonstrates its effectiveness. Detailed analyses are conducted with respect to different LLMs, the number of rounds for interactions, and efficiency.

优点

This paper proposes multi-model collaborations for solving complex reasoning problems, which is an interesting and attractive direction for the current community.
A fairly diverse set of results across several benchmarks are provided, with several LLMs explored. The analysis is detailed and provides some insights into the proposed method. I especially like the analysis for cost-effectiveness. Code is provided for reproducibility.
The paper is generally well-written and organized, and most of the content is clear to me. Code is also provided for reproducibility.

缺点

The proposed three modes are not that novel to me. There are many existing works that have already explored or at least share similar ideas to the three modes. I only list several representative works here. For Debate mode, [1] already explored this setting. For Review mode, [2] has a similar method. And finally for Retrieve mode, [3] also shares similar ideas. I don't think Section 2 Related Works is well-written either. It only lists related works without thoroughly discussing the relatedness and differences. Also, I don't see why "External knowledge and tool utilization" is that related to this work.
The performance improvement is not that consistent. It seems that in most cases, Corex-Debate and Corex-Review-NL do not perform that well. Instead, Corex-Review-Code and Corex-Retrieve seems to be better. I think it demonstrates the advantage of using PL, which is a well-acknowledged fact in the community. The paper also does not provide explanations or understandings of why these methods work well or not. I think it is important to have [1,2,3] as baselines and better explain why Corex works as a paradigm of multi-model collaboration.

[1] Du, Yilun, et al. "Improving Factuality and Reasoning in Language Models through Multiagent Debate." arXiv preprint arXiv:2305.14325 (2023).

[2] Madaan, Aman, et al. "Self-refine: Iterative refinement with self-feedback." arXiv preprint arXiv:2303.17651 (2023).

[3] Yang, Chengrun, et al. "Large language models as optimizers." arXiv preprint arXiv:2309.03409 (2023).

问题

The notations and methods are somehow confusing in Section 3.1. For example, what is the decision-making process $h$ mentioned in Section 3.1? What does $k$ mean in the first paragraph of Section 3.1?
There is some notable performance gap of Corex variants. Such as Corex-Review-Code v.s. other variants for GSM-Hard in Table 2, and for Repeat Copy in Table 4. Any intuition or explanations on this?
Have you tried the collaborations of different LLMs in the same mode? For example, using GPT-3.5-turbo, GPT-4, and Claude-Instant-1.2 as different agents for Retireve mode. Do you think there exists an issue of diversity when using the same LLMs as agents?
It would be great to see the results for open-sourced models such as Llama.
How to prove that Corex enhances the "factuality, faithfulness, and reliability"? Any case study or evaluation on this?

评论- Response to Reviewer Lg3j (3/3)

2023-11-14

Q4: It would be great to see the results for open-sourced models such as Llama.

This suggestion aligns with our initial intention, as we aimed to achieve the performance of strong commercial models through the collaboration of open-source models. Moreover, this aspect was indeed explored in section A 'LIMITATIONS AND BROADER IMPACTS' of our paper: The primary challenge in incorporating open-source models is their performance constraints.

Corex is specifically adapted to enhance the capabilities of weaker models (e.g., GPT-3.5-Turbo) to collaboratively surpass those of stronger ones (e.g., GPT-4). However, we posit that there is a lower bound to the competencies necessary for collaboration. A typical example is the coding ability; the relatively weaker coding capabilities of open-source models will significantly limit modes like review-code. Moreover, the process of collaboration involves intensive information exchange, demanding a certain level of long-text understanding from the models, a threshold that current open-source models have yet to attain. Nevertheless, with the ongoing advancements in open-source large models, we believe there is potential in the future to apply them within the Corex framework.

Q5: How to prove that Corex enhances the "factuality, faithfulness, and reliability"? Any case study or evaluation on this?

Commonsense as a Measure: The performance in commonsense reasoning tasks inherently tests factuality and faithfulness. The improvement in these tasks, as evidenced by our experimental results, directly reflects enhancement in these aspects. The ability of Corex to arrive at accurate conclusions (e.g., writing reviews, and fixing errors in debate) serves as a testament to its enhanced factuality and faithfulness.
Reliability: In the motivation of our work, we discuss the unreliability issues in reasoning processes and the challenges, like coding in the PAL scenarios. Our Review mode is specifically designed to mitigate the generation of erroneous reasoning chains and buggy code. The effectiveness demonstrated in evaluations reflected the enhanced reliability of the system.

So, while it is challenging to quantitatively evaluate these aspects numerically (for example, faithfulness is difficult to measure due to the lack of appropriate reference sets for reasoning chains), direct observation from the experimental results provides an indication of improvement. Previous works [9, 10] also use similar empirical ways to prove factuality.

Moreover, we will incorporate additional cases into the revision. As a preview, using Review-Code for GSM8k as an example, among all errors, 136 cases (10% of the total sample) were due to incorrect code in the original PAL. After applying the review mode, this number was reduced to 74 cases (5.6% of the total sample), showing an increase in reliability through NL2Code + task delegation.

Again, we deeply appreciate your suggestions, which have contributed to the improvement and refinement of our paper. We hope that our responses can address your concerns.

References

[9] Improving factuality and reasoning in language models through multiagent debate., arXiv:2305.14325

[10] Examining the inter-consistency of large language models: An in-depth analysis via debate

评论- Response to Reviewer Lg3j (1/3)

2023-11-14

Thanks a lot for your valuable feedback! Here we first discuss the weakness you mentioned, and then address your questions.

About Novelty Issues: In response to your comment about the novelty of Corex's three modes and their comparison to existing works, we offer the following clarifications:

Debate Mode: we introduce a format combining group discussions to balance factuality and diversity (discussed in section 3.1). The term 'debate' might have led to misconceptions of similarity with previous works. However, our approach distinctively uses a combination of collective intelligence and a judging mechanism to enhance decision-making, which is a novel application in this context.
Review Mode: Corex's Review mode is designed as a task-agnostic framework. Unlike [2], we encourage each participant to incorporate external insights and make incremental modifications on top of their predecessor's refinement. This mode also considers both NL and code scenarios. This approach has shown superior results compared to strong baselines, and recent works[4,5,6] have also demonstrated that methods solely based on self-correcting have limited effectiveness.
Retrieve Mode: For [3], our work does not focus on prompt optimization. The strategy we employ involves generating candidates and then scoring these based on the faithfulness of their reasoning processes to select the best answer. To our best knowledge, this method of reasoning, prioritizing the value of reasoning chains over majority voting mechanisms, has not been explored before in similar contexts.

In summary, we hold that Corex represents a set of novel approaches in the realm of reasoning and problem-solving.

Performance Improvement Issues: The performance improvement is closely linked to the specific scenarios each algorithm is designed for. For instance, in the Review mode, using programming languages (PL) inherently offers an advantage in solving mathematical problems. We have discussed the advantage of Corex in using PL in our response to your Q2.

Regarding Other Works: For references [1,2] you mentioned, the evaluations heavily depend on specially designed prompts, and the absence of official implementations for all the benchmarks we covered prevents a fair comparison (while corex does not need many curated prompts). Furthermore, in our experiments like semi-structured understanding with long texts, GPT-3.5-Turbo will become inoperable for these works due to their long meta-instructions, prompts, and demonstrations exceeding the context window.

As for reference [3], it was an oversight on our part as it was completed in the same month as our work. We plan to include it in the next revision. This work focuses on prompt optimization and meta-prompt design, whereas our focus is on reasoning tasks, making the relevance limited.

Additionally, since our entire framework is task-agnostic, we chose to compare it with CoT-SC(10, 20 ...) and ComplexCoT(10, 20 ...), which have significantly higher costs than our methods. This comparison is meant to demonstrate Corex's efficiency and effectiveness through multi-model collaborations, considering the broader context of its application across various tasks and settings.

References

[1] Du, Yilun, et al. "Improving Factuality and Reasoning in Language Models through Multiagent Debate." arXiv preprint arXiv:2305.14325 (2023).

[2] Madaan, Aman, et al. "Self-refine: Iterative refinement with self-feedback." arXiv preprint arXiv:2303.17651 (2023).

[3] Yang, Chengrun, et al. "Large language models as optimizers." arXiv preprint arXiv:2309.03409 (2023).

[4] Large Language Models Cannot Self-Correct Reasoning Yet, arXiv: 2310.01798

[5] Can Large Language Models Really Improve by Self-critiquing Their Own Plans?, arXiv: 2310.08118

[6] GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems, arXiv: 2310.12397

评论- Response to Reviewer Lg3j (2/3)

2023-11-14

For your questions:

Q1: The notations and methods are somehow confusing in Section 3.1. For example, what is the decision-making process h mentioned in Section 3.1? What does k mean in the first paragraph of Section 3.1?

We apologize for the confusion caused by the notations. In Section 3.1, the decision-making process 'h' refers to the formalized representation of the model's process of generating and refining reasoning chains. As for the term 'k', it denotes the number of agents involved in the collaborative processes. The goal was to establish a unified notation system that encompasses the three modes. We will strive to improve the notations in subsequent revisions of our work.

Q2: There is some notable performance gap of Corex variants. Such as Corex-Review-Code v.s. other variants for GSM-Hard in Table 2, and for Repeat Copy in Table 4. Any intuition or explanations on this?

Thank you for your detailed observation of our experimental results. The performance gap you've noticed is primarily due to the use of NL2Code + Task delegation, which helps alleviate the limitations of LLMs in handling large numbers computations[7,8], as mentioned in the related works: External Knowledge & Tool Utilization. To further elucidate this, let me provide an example from the GSM-hard task:

James decides to run 1793815 sprints 1793815 times a week. He runs 60 meters each sprint. How many total meters does he run a week?

Large models struggle with directly computing such large numbers. However, with the help of code, we can ensure accurate computations:

def solution():
    sprints_per_day = 1793815
    days_per_week = 1793815
    meters_per_sprint = 60
    total_sprints = sprints_per_day * days_per_week
    total_meters = total_sprints * meters_per_sprint
    result = total_meters
    return result

Nevertheless, the NL2Code process can introduce errors in understanding the problem statement or in the program itself, as discussed in section 3.2. Corex aids in rectifying these errors, making the python solutions more reliable, thereby yielding further enhanced outcomes.

The same rationale applies to the Repeat Copy task. Using code to generate strings reduces uncertainty, enhancing the performance of Corex-Review-Code in comparison to other variants.

Q3: Have you tried the collaborations of different LLMs in the same mode? For example, using GPT-3.5-turbo, GPT-4, and Claude-Instant-1.2 as different agents for Retireve mode. Do you think there exists an issue of diversity when using the same LLMs as agents?

Thank you for your interest in the collaboration of different models. In our analysis (section 5.2), we specifically discussed the synergies between different models in the Debate and Retrieve modes.

For diversity, in Debate, as stated in section 3.1, we have utilized group discussions to avoid the lack of diversity seen in previous works where a single model might dominate. We believe that using the same LLMs as agents in this context presents more diversity compared to prior work.

In the Review mode, we enhance model diversity by utilizing external insights to help models identify and correct errors, which offers higher diversity compared to self-correction approaches in previous works. Regarding the Retrieve mode, our analysis demonstrated that using a mix of stronger and weaker models as candidates yielded similar results, indicating that the performance of using the same agents and different agents was comparable. This suggests that regardless of whether the models produce diverse outputs, our method exhibited stable performance.

References

[7] PaL: Program-aided language models., arXiv:2211.10435

[8] Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks

审稿意见

评分: 5置信度: 32023-10-31

This paper proposes Corex - a suite of three general strategies for multi-agent collaboration: Debate, Review and Retrieve. The strategies aim to improve the factuality, faithfulness, and reliablity of the reasoning process. The Corex strategies are evaluated on 18 tasks from 4 categories, showing improvement over the baselines. Further analysis shows that the debate method converges after a small number of rounds, that the method fares better in terms of effectiveness/efficiency trade-off than other methods, and that the impact of the model size changes depending on the model role.

优点

S1) The coverage of the related work in the paper is very extensive and seems complete.

S2) The paper lists a lot of challenges, which are indeed relevant, including misunderstanding the question, or generating a faulty reasoning process. Using multiple LLMs, treated as agents, is an exciting direction to explore to address these challenges.

S3) The paper is overall relatively well-written and easy to follow.

S4) The method should be relatively easy to reproduce based on the details provided in the paper.

S5) The results are informative as they cover a large set of tasks and categories. The additional analysis sheds light on the method's behavior and computational efficiency.

缺点

W1) Contribution unclear - It is clear that the Corex variants fare better against the baselines, but it is less clear what is the main contribution of Corex. All three components are based on ideas that are present in prior work, and Corex does not integrate the components. So methodologically, it is indeed only a suite of what has been done before. To make the contribution further complicated, the abstract and the introduction mention (often vaguely) a number of issues, including "the limitations of its internal representation", "limitations in solving reasoning tasks", "unreliable answers", "think outside the box", "prevalent obstacles". The three issues that are specifically listed in Figure 1: wrong calculation, misunderstanding the relationship between variables, and codes fail to accurately reflect the problem statement - are actionable, but there is no analysis on whether the improved performance of Corex has a qualitative impact on any of these issues. The desired features reviewed in Table 1 (e.g., reference free, multiple LLMs) are again different from what the introduction was arguing. In the other sections, aspects like factuality, task-agnosticity, and reliability are argued for, but again, there was no experiment to validate these claims.

W2) Comparison to baselines - the paper compares against one set of baselines in Table 1, then another set of baselines is referred to in the method section (e.g., Du et al. 2023 for the Debate module), and then the results and the analysis focus on general approaches like CoT and SC. It is unclear why the other baselines from Table 1 and prior works that designed the individual components are not included in the evaluation. This is even more important because Corex diverges from prior works to design these components differently (e.g., the Debate component), and it is important to know if this different approach fares better or worse, and why.

W3) Originality - the proposed method reads like a more complete combination of heuristics compared to prior work, but these heuristics are already present in recent methods. In that sense, it is unclear what is the methodological delta between this method and prior work. This novelty gap is further blurred by the absence of a clear problem statement in section 1, and the lack of direct comparison to related work in section 2.

W4) Premise - The overall premise of Corex is also confusing. If LLMs cannot reason reliably (as stated in section 1), then what makes these same LLMs suddenly able to reason in Corex? Moreover, in some of the Corex variants, like the Debate module, it is unclear what reasoning means exactly - because the design here explicitly opts for a majority voting to suppress reasoning. The Retriever compares different chains of thought, but whether these are scored based on their reasoning soundness, is not clear.

W5) Result takeaways - The results often say "our method" but in fact Corex is a multitude of methods, whose probability of outperforming the baselines is generally around 50% (e.g., table 5 has 4 Corex variants and three baselines). Moreover, the best Corex variant is largely unstable over the tasks, though the Debate one is typically the weakest, while the Retrieve and Review-code are usually performing better. The results need a discussion that dives deeper into these distinctions in performance across tasks.

Minor:

Retrieve is not a paradigm
Footnote 3 - not clear what is meant by the "nature of commonsense tasks" - prior works have used code representations to also address these

问题

Q1) What exact problem(s) Corex is trying to solve, and how is this evaluated in the paper?

Q2) Why are the baselines from Table 1 and other works that propose similar modules for debating, retrieval, and review not included in the evaluation?

Q3) What is the main novelty of Corex compared to prior work?

Q4) How do the authors explain the differences in the Corex variant performance across the tasks?

评论- Response to Reviewer kCUS (2/2)

2023-11-15

Questions

Q1) What exact problem(s) Corex is trying to solve, and how is this evaluated in the paper?

We try to address the limitations of LLMs that rely solely on internal representations, often leading to unreliable responses in complex reasoning tasks. We introduce a collaborative approach inspired by human social interactions (diverging from some fixed role allocations). This method enhances models' general reasoning abilities through model collaborations like idea exchange and bug fixing. Corex's effectiveness is evaluated across 18 tasks in 4 categories, demonstrating notable improvements in AI reasoning compared with previous strong baselines.

Q2) Why are the baselines from Table 1 and other works that propose similar modules for debating, retrieval, and review not included in the evaluation?

The purpose of setting up Table 1 was to underscore the features like task-agnostic, reference-free, etc., of our method. Evaluating the methods listed in Table 1 on a wide range of benchmarks we covered poses significant challenges. For instance, PHP is specific to mathematical tasks, and CoK, in principle, is only effective for commonsense reasoning and requires additional configurations of knowledge bases. For MAD and ToT, apart from the results reported in the original preprints, we lack "official" prompts to perform other tasks, which makes it difficult to ensure a fair comparison.

By the way, in tasks like semi-structured understanding(requiring reading long text + a table), GPT-3.5-Turbo will be unavailable for these methods due to their long meta-instructions, prompts, and demonstrations exceeding the context window.

Given these challenges, we opt to use baselines like CoT-SC(10, 20, ...) and ComplexCoT(10, 20, ...) for our comparisons in main experiments and analysis. These baselines are more broadly applicable and offer a direct insight into the effectiveness of our methods.

Q3) What is the main novelty of Corex compared to prior work?

In the discussions above, we have already highlighted the Contributions and Methodological Delta. Now, let's discuss the novelty of Corex from a systemic perspective: The primary novelty of Corex is its pioneering introduction of multi-model collaboration into the realm of reasoning, inspired by human behavior in solving complex problems. Furthermore, Corex is distinguished by its task-agnostic framework, which supports a reference-free approach applicable to a wide array of reasoning tasks. This contrasts with prior works that are often tailored to specific problem types or heavily dependent on predefined knowledge bases, references, or prompts. Corex introduces a set of paradigms that are considerably more flexible and versatile.

Q4) How do the authors explain the differences in the Corex variant performance across the tasks?

We have briefly discussed the variant performance of the Corex on page 8, constrained by the page limit for a detailed analysis. We'll add additional analysis and interpretations for the performance in the next revision. Some conclusions can be reached intuitively from experimental results.

For instance, integrating coding with review significantly boosts performance in mathematical tasks, particularly in complex ones like GSM-Hard. This enhancement is attributed to "bug-fixing" in review reasoning/coding processes for mathematical problems. On the other hand, Debate mode is mainly effective in commonsense reasoning tasks, where interactive deliberation helps in navigating through varied perspectives and reasoning chains.

Meanwhile, the Retrieve mode, which involves selecting from a set of candidates, shows a relatively stable performance across different tasks. This stability is due to its method of filtering and choosing the most appropriate answer from a pre-defined set of options.

Thanks a gain for your valuable feedback and constructive comments to make our paper better! If there are specific aspects you are particularly interested in, we are open to addressing further inquiries to elucidate more details.

评论- Response to Reviewer kCUS (1/2)

2023-11-15

Thank you for the detailed review that will help improve our paper significantly! We first discuss the weakness you mentioned, and then we'll answer your question.

About Contribution & Methodological Delta

We respectfully disagree with the viewpoint that our work lacks novelty. Since you did not specify the particular prior works for comparison, we will illustrate our novel contributions using references cited in our paper:

Debate Mode: Our approach to debate distinctly differs from previous works [1]. As elaborated in section 3.1, we have introduced a group discussion format that uniquely balances factuality and diversity. This novel design is crafted to tackle specific challenges unaddressed by earlier methods, e.g., the risk of monopolization and wrong consensus.
Review Mode: Corex's Review mode stands out as a task-agnostic framework, which diverges notably from methods represented by reference [2]. Our model encourages participants to not only integrate external insights but also make incremental modifications on their predecessor’s effort, considering both NL and code scenarios. This approach has demonstrated superior results over strong baselines, and it addresses limitations found in methods based purely on self-correction, as evidenced by recent works [3,4,5]. The integration of external insights, coupled with the flexibility of handling multiple contexts, underlines the novelty of our design.
Retrieve Mode: The core strategy in this mode is to generate candidate answers and score them based on their reasoning processes, thereby selecting the best answer. This focus on evaluating the value of reasoning chains, rather than relying on majority voting mechanisms (i.e., self-consistency), is a novel direction in the field to our knowledge.

These three components collectively underscore the originality of Corex, not only in their individual functionalities but also in how they synergistically enhance the overall reasoning capabilities of LLMs.

Comparison to baseline

The baselines selected for comparison in our study, such as PAL, ComplexCoT, and CoT-SC, are robust, widely used, and highly recognized within the community. Comparing Corex with these baselines provides a clear and intuitive understanding of the effectiveness of our methods. As for the other works mentioned in Table 1, we address the rationale for their exclusion in our response to Q2.

About the Premise

To address your concerns regarding the reasoning capabilities of LLMs as utilized in Corex, it's important to clarify that while LLMs inherently possess reasoning abilities, they are prone to errors when confronted with complex problems (like performing CoT, and the degeneration-of-thought problem discussed in our background setting). Corex is designed as an approach to enhance this reasoning capacity through collaborations. In our framework, LLMs act as different agents, each contributing external insights that collectively enable them to solve problems that they might not solve individually.

Nature of commonsense tasks: Sorry for the confusion. What we intend to express is that commonsense reasoning tasks like StrategyQA, CommonsenseQA, and ARC-C cannot be solved through the NL2Code + Task delegation approach. So we do not include Review-Code for them.

References

[1] Du, Yilun, et al. "Improving Factuality and Reasoning in Language Models through Multiagent Debate." arXiv preprint arXiv:2305.14325 (2023).

[2] Madaan, Aman, et al. "Self-refine: Iterative refinement with self-feedback." arXiv preprint arXiv:2303.17651 (2023).

[3] Large Language Models Cannot Self-Correct Reasoning Yet, arXiv: 2310.01798

[4] Can Large Language Models Really Improve by Self-critiquing Their Own Plans?, arXiv: 2310.08118

[5] GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems, arXiv: 2310.12397

审稿意见

评分: 3置信度: 52023-11-02

This paper introduces a few strategies to make use of multi-agent communication for complex task-solving. Corex consists of three different strategies including debate, review, and retrieve. The authors experiment with OpenAI GPT as multiple agents and use the proposed methods to let the agents collaborate to solve tasks including math problems, commonsense reasoning, and symbolic reasoning and get improved performance compared to CoT and some of CoT variants.

优点

The idea of using multi-agent collaboration to solve complex tasks is well-motivated and is a promising direction.
The paper is in general well written and easy to follow.
The experimental results show some improvement upon the compared baselines.

缺点

The paper lacks technical novelty. Multi-agent debate for reasoning and complex task-solving, the "review" method is also quite similar to recent work on self-reflection and self-refinement, the "retrieve" method is intuitively very similar to RAG (retrieval-augmented generation) and the main difference is incorporating it in the multi-agent framework using one agent as the retriever.
The proposed components including debate / review / retrieve are not conceptually very much related. Instead they seem to be distinct methods. The authors also use them separately in the experiments without combining or integrating them into a single framework.
The performance improvement over stronger baselines such as CoT-SC(10) is not very significant. Also, It is unshown whether variants such as CoT-SC(20/30) will lead to different conclusions. Maybe it's because the CoT-SC(10) baseline consumes similar number of tokens with OpenAI's API? But the authors did not show the total tokens consumed by different methods. And if so the comparison with CoT would not be fair enough.
The manuscript lacks analysis of when (or for which kind of tasks) one of the methods among debate / review / retrieve outperform others and why it is the case. Adding some analysis about this question will bring more insights for the manuscript.

问题

See the above weaknesses for questions.

评论- Response to Reviewer aQai (1/2)

2023-11-14

Thanks for carefully reading our paper and detailed reviews!

About Technical Novelty

Multi-agent debate for reasoning and complex task-solving

Our approach to multi-agent debate for reasoning is fundamentally different from previous works [1,2]. In Corex, we introduced a format that combines group discussions to balance factuality and diversity, as detailed in section 3.1. This novel design addresses several observed challenges in multi-agent collaborations:

Context Length Limitations: previous debates in LLMs are hindered by the inability to fully capture the entire debate process within context length limitations. Our method overcomes this by "distributing" the debate to different sets of players. Also, our finer-grained debates also prevent meaningless dialogues that can arise when all models participate simultaneously.
Reliability of Consensus: While debates tend to converge to single final answers, these outcomes are not always correct due to the potential for incorrect consensus or prevalent biases. Our method mitigates this by incorporating mechanisms that critically evaluate the debate process and outcomes.
Risk of Monopolization: Given the performance disparities among various LLMs, there exists a risk of stronger models dominating the debate. Our approach addresses this by ensuring a balanced participation of different models, preventing any single model from monopolizing the debate.

The use of the term 'debate' might have led to misconceptions regarding similarity to previous ones. However, our approach is distinctively characterized by the integration of collective intelligence and a specialized judging mechanism.

the "review" method is also quite similar to recent work on self-reflection and self-refinement

Review mode distinguishes itself from recent work on self-reflection and self-refinement in several key aspects, which bear the following differences:

Integration of External Insights: Unlike the approach in [3], Corex encourages each participant in the Review mode to incorporate external insights. This integration allows for a broader perspective and mitigates the echo chamber effect often seen in self-refinement processes [4].
Incremental Modifications: Participants make incremental modifications on top of their predecessor’s work. This iterative process enables a cumulative improvement of solutions, going beyond the limitations of self-correcting methods.
Consideration of More Scenarios: Review mode takes into account both NL and code, enhancing its applicability in a variety of tasks, especially those involving calculations.

Review has shown superior results compared to previous strong baselines, and recent works[4,5,6] have also demonstrated that methods solely based on self-correcting have limited effectiveness.

"retrieve" method is intuitively very similar to RAG

While there may be some similarities in motivation between our "retrieve" method and RAG, as is common with all retrieval-based methods, the actual implementation of our approach is distinctively different. Our strategy involves generating a range of candidate answers and then utilizing scoring to select the best answer. This method prioritizes the value of the reasoning chains, diverging from the more common majority voting mechanisms in LLM reasoning. To the best of our knowledge, this specific approach, focusing on the evaluation and selection of reasoning chains in a retrieval context, has not been explored before in similar contexts. This aspect of our method contributes to its novelty as well as effectiveness, setting it apart from traditional RAG methods.

References

[1] Improving factuality and reasoning in language models through multiagent debate., arXiv:2305.14325

[2] Examining the inter-consistency of large language models: An in-depth analysis via debate

[3] Self-refine: Iterative refinement with self-feedback. arXiv:2303.17651.

[4] Can Large Language Models Really Improve by Self-critiquing Their Own Plans?, arXiv: 2310.08118

[5] Large Language Models Cannot Self-Correct Reasoning Yet, arXiv: 2310.01798

[6] GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems, arXiv: 2310.12397

评论- Response to Reviewer aQai (2/2)

2023-11-14

Relation among the Components

The proposed components including debate / review / retrieve are not conceptually very much related.

The conceptual design of our components stems from a problem-solving perspective, inspired by human methods of addressing complex issues. These components are not isolated methods but are interrelated aspects of strategies that mirror how humans exchange ideas and scrutinize each other's calculations or reasoning processes. Debate emulates the human process of exchanging and challenging ideas, where different perspectives are debated to reach a more accurate and comprehensive understanding. Review reflects the human practice of critically examining and refining ideas, this component focuses on assessing and improving upon the outputs generated by other methods or models. Retrieve is similar to how humans seek out information from various sources, this component involves sourcing multiple possible solutions and choosing the most appropriate one.

The authors also use them separately in the experiments without combining or integrating them into a single framework.

The conceptual framework of Corex intentionally separates the components (like humans typically can't simultaneously engage in a debate and correct each other's mistakes for refinement).

However, the versatility of our approach allows for strategic selection and recommendation of methods based on specific scenarios, enabling a form of integration at the application level. For instance, through careful routing design, each method, being task-agnostic, can contribute to a collective problem-solving process. This approach mimics human problem-solving skills where different methods are employed at different stages or aspects of a task, and can be extended to other scenarios like robotics planning and social simulations.

Performance Concerns

The performance improvement over stronger baselines such as CoT-SC(10) is not very significant.

Due to the page limit constraints, we focused on including stronger baselines like CoT-SC(10, 20, ...) and ComplexCoT in our experimental section, which may have resulted in the performance improvement appearing not very significant. It is worth noticing that the cost of implementing Corex is less than half of these strong baselines.

It is unshown whether variants such as CoT-SC(20/30) will lead to different conclusions & The authors did not show the total tokens consumed by different methods.

Thanks for paying attention to cost-effectiveness, which is an advantage of Corex compared with methods like CoT-SC and Complex-CoT. In both the main text and the appendix of our paper, we have conducted additional analysis that you might have overlooked. Specifically, in section 5.3, we present comparative experiments between Corex and various configurations of CoT-SC (5, 10, 20, 40, 80) and ComplexCoT (10, 20, 40). Furthermore, in the appendix section C.2, Figures 11 and 12 showcase additional comparative experiments with equivalent configurations.

Due to the high experimental costs, we select one task each from the categories of math, commonsense, and symbolic reasoning. The results are visually presented in these figures, offering an intuitive presentation without overburdening the tables. We would appreciate it if you could reconsider and review these sections. Additionally, the “total tokens consumed” that you are interested in have also been presented in these analyses, displayed on a log scale.

Further Analysis

The manuscript lacks analysis of when (or for which kind of tasks) one of the methods among debate / review / retrieve outperform others and why it is the case

Thank you for your interest in delving deeper into the efficacy of our methods. We have briefly discussed the performance of these methods in the upper part of page 8, though a detailed analysis was constrained by the page limit (We will follow your advice to add more analysis for more insights in the revision.).

To summarize, some conclusions can be drawn from our experimental results. For instance, in the Review mode, integrating code significantly enhances performance in mathematical tasks, particularly in complex ones like GSM-Hard. Debate mode mainly shows effectiveness in commonsense reasoning. The Retrieve mode, due to its nature of selecting from candidates, exhibits relatively stable performance across various tasks.

Thank you again for the comments and suggestions, we hope our responses can address your concerns! If you have specific points of interest, we welcome further inquiries and are happy to provide more detailed explanations.

审稿意见

评分: 8置信度: 22023-11-11

The main idea of this paper is to use multiple LLMs as if they are autonomous agents and let them interact with a prompting strategy that is structured into the Debate, Review, and Retrieve stages. The multiple LLMs will collaborate using those modes of interaction to enhance the factuality faithfulness and reliability of the final answers. The approach has been evaluated on multiple benchmarks including mathematical reasoning, symbolic reasoning, commonsense reasoning, and semi-structured reasoning. The collaborative approach is compared to exiting prompting strategies such as COT and self-consistency approachs.

优点

--The idea of collaborative language models is very interesting and fairly novel. --Using the debate structure to guide the interactions is novel and effective. --The experiments are conducted on many and a variety of benchmarks with different tasks. --The analysis is interesting and insightful.

缺点

--The notation is a bit unclear in some places. In the very beginning explaining the debate, what is k? Please denote this when you explain c^i_t. c_i is the viewpoint or one step of the reasoning chain? Or both? Please make it explicit and use one term consistently.

--I expected the collaboration of multiple models to be based on 5 different LLMs. The paper uses GPT 3 and 4 and Claude. It was not clear in the paper how the 5 different opinions were solicited. Are you using different temperatures or obtaining multiple samples from one LLM and looking at them as different heterogeneous agents? I see in the experiments that you use different LLMs to play the judge roles but was not sure if that is enough to have a real heterogeneous setting with multiple agents.

问题

See above.

评论- Response to Reviewer 1uTs

2023-11-14

Deep thanks for the positive review! We are very encouraged that you found our method to be novel and interesting.

The notation is a bit unclear in some places. In the very beginning explaining the debate, what is k? Please denote this when you explain c^i_t. c_i is the viewpoint or one step of the reasoning chain? Or both? Please make it explicit and use one term consistently.

We apologize for any confusion caused by our notations. In the context of Corex, the term 'k' refers to the number of agents involved in the collaborations. The notation 'c^i_t' is used to represent the text generated (e.g., reasoning chain) of the i-th agent at time t during the debate process. Our intention is to create a unified notation system for the three modes in Corex. We will take your advice to further improve the notations and presentations of our method.

I expected the collaboration of multiple models to be based on 5 different LLMs. The paper uses GPT 3 and 4 and Claude. It was not clear in the paper how the 5 different opinions were solicited. Are you using different temperatures or obtaining multiple samples from one LLM and looking at them as different heterogeneous agents? I see in the experiments that you use different LLMs to play the judge roles but was not sure if that is enough to have a real heterogeneous setting with multiple agents.

Thank you for your inquiry regarding the multi-model collaboration aspect. We will address your concerns in the following points:

Regarding how the 5 different opinions were solicited, we employ different meta instructions to enable the models to assume distinct roles (some cases shown in Table 21). For problem-solving, we follow the prompts from existing works (Appendix F) to ensure a fair comparison.
Concerning the use of temperature, we do not adjust the temperature during the generation process. This decision was made to ensure the reproducibility and stability of the results. The specifics of these settings are elaborated in Appendix B: Implementation Details of our paper.
As for the heterogeneous settings in Debate, our choice to vary the judge's role for analysis is founded on two reasons. First, the judge in the debate process plays a pivotal role as a "Hub," taking on the critical task of information aggregation when discrepancies arise among other participants. Second, altering other models in different teams would create an overly complex array of combinations for analysis and introduce too much randomness, potentially reducing the reliability of the analysis. Therefore, we focus our analysis on the LLMs playing the judge role to derive the most direct and reliable conclusions.

Thanks again for all your suggestions to make our paper better!

评论- Response to all Reviewers

2023-11-22

We thank all the reviewers for their insightful comments. Below is a summary of our clarifications according to the collective opinions of all reviewers.

Novelty and Methodological Delta

Some reviewers have raised novelty concerns. We would like to point out, firstly, that using multi-model collaboration to solve reasoning problems is quite pioneering. Secondly, regarding the direct comparison of different components with existing methods, we list them as follows:

Debate Mode (from Lg3j): While the method's name might suggest similarity to previous efforts, we've innovated beyond [1,2] in Debate Mode. By introducing a group discussion format, we balance factuality and diversity, as detailed in section 3.1. This design specifically addresses challenges like monopolization and incorrect consensus, moving beyond the limitations of prior models.
Review Mode (from aQai, Lg3j): Corex's Review mode offers a task-agnostic approach that significantly diverges from [3]. It allows participants to integrate external insights and incrementally modify predecessors' work, applicable in both NL and code scenarios. We find it to be robust and effective. This approach not only outperforms strong baselines but also overcomes the limitations in self-correction methods [4,5,6], representing an advancement not seen in previous work.
Retrieve Mode (from aQai): In this mode, we generate and score candidate answers based on the soundness of their reasoning, selecting the best one. While it may sound similar to RAG, the actual implementation of our approach is distinctively different. This method emphasizes evaluating reasoning chains over majority voting, introducing a new direction in the field, as opposed to relying on majority voting mechanisms.

Experiments and Baselines

Experimental results

(from reviewer aQai, kCUS)

In addition to CoT-SC(10), section 5.3 of our paper presents comparative experiments of Corex with various configurations of CoT-SC (5, 10, 20, ... , 80) and ComplexCoT (10, .. , 40). Moreover, Appendix C.2, featuring Figures 11 and 12, includes further comparative experiments under equivalent configurations. These experiments demonstrate Corex's performance improvement when compared to strong baselines. Furthermore, they highlight Corex's cost-effectiveness, a topic discussed in both the analysis section and the appendix of our paper.

Baselines

(from reviewer kCUS)

Table 1 was designed to showcase Corex's features such as being task-agnostic and reference-free, and to emphasize LLMs' interactive capabilities. However, evaluating methods from Table 1 is very challenging. For example, PHP addresses mathematical tasks, and Chain-of-Knowledge is suited for commonsense reasoning and needs extra knowledge base configurations. For MAD and ToT, the lack of official prompts for diverse tasks impedes fair comparisons.

Consequently, we chose more universally applicable baselines like CoT-SC(10, 20, ...) and ComplexCoT(10, 20, ...) for our evaluation.

For other questions and detailed explanations, we have addressed them in the responses provided to each reviewer. We welcome further discussion and any additional suggestions you may have.

References

[1] Improving factuality and reasoning in language models through multiagent debate., arXiv:2305.14325

[2] Examining the inter-consistency of large language models: An in-depth analysis via debate

[3] Self-refine: Iterative refinement with self-feedback. arXiv:2303.17651.

[4] Large Language Models Cannot Self-Correct Reasoning Yet, arXiv: 2310.01798

[5] Can Large Language Models Really Improve by Self-critiquing Their Own Plans?, arXiv: 2310.08118

[6] GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems, arXiv: 2310.12397