5.0

/10

Rejected4 位审稿人

最低3最高6标准差1.2

3.8

置信度

正确性2.5

贡献度2.5

表达2.0

ICLR 2025

An Examination on the Effectiveness of Divide-and-Conquer Prompting in Large Language Models

Yizhou Zhang,Lun Du,Defu Cao,Qiang Fu,Yan Liu

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

摘要

关键词

Program-guided PromptDivide-and-ConquerFoundation ModelMisinformation Detection

评审与讨论

审稿意见

评分: 3置信度: 42024-11-02

The paper presents a theoretical and empirical analysis of the divide-and-conquer (DaC) prompting strategy in the context of large language models (LLMs). The authors argue that DaC can improve the performance of LLMs on specific tasks, particularly those involving repetitive sub-tasks or deceptive contents. The paper provides a theoretical framework to identify tasks where DaC prompting can be advantageous and supports these claims with experimental results from two case studies: large integer arithmetic and fact verification.

优点

The theoretical claims are backed by empirical evidence from two distinct case studies, providing a comprehensive view of DaC's utility in both arithmetic and natural language processing tasks.
The study addresses a significant challenge in the application of LLMs, where tasks requiring long solution paths or dealing with complex and deceptive content are concerned. The findings have practical implications for prompt engineering in LLMs.

缺点

While the paper makes a strong case for DaC in specific tasks, it does not explore its applicability in a broader range of tasks or domains, which might limit the generalizability of the findings.
At present, it is difficult to claim that the idea of the divide-and-conquer prompting strategy is novel, as there are many prompting strategies for LLM in-context learning. Besides, the paper doesn't compare with recent studies, such as Graph of Thoughts, and Program of Thoughts.
The theoretical analysis, while comprehensive, may be challenging for readers without a strong background in computational complexity theory, potentially reducing the accessibility of the paper.

问题

How does DaC prompting compare with the latest prompting techniques or methods in terms of performance and computational efficiency?
Could the authors provide further insights into the robustness of DaC prompting and discuss potential sources of error or failure in the context of their case studies?

2024-11-18

Dear Reviewer,

We appreciate your detailed feedback on our paper, "An Examination on the Effectiveness of Divide-and-Conquer Prompting in Large Language Models." Below, we address your concerns and provide clarifications:

Generalizability of Findings

Reviewer Concern: "The paper does not explore DaC's applicability to a broader range of tasks or domains, limiting the generalizability of findings."

Response: We acknowledge the importance of exploring broader applicability. However, our study intentionally focuses on two representative and challenging domains (arithmetic and fact verification) to ensure the theoretical and empirical analysis is both rigorous and interpretable. While expanding the scope is valuable, the current scope aligns with our research questions and highlights tasks where DaC excels. We explicitly outline the conditions under which DaC is beneficial in Section 4.3, laying a foundation for future studies to expand this work.

Novelty of DaC Prompting

Reviewer Concern: "It is difficult to claim that the idea of divide-and-conquer prompting is novel, as many prompting strategies exist. The paper does not compare with recent studies, such as Graph of Thoughts and Program of Thoughts."

Response: Our contribution is not limited to proposing DaC as a strategy but rather providing a theoretical framework that formalizes its utility and limitations. This framework, to the best of our knowledge, is the first to rigorously characterize DaC's expressive power relative to other strategies like Chain-of-Thought and Least-to-Most. Additionally, we emphasize the efficiency and robustness of DaC in tasks with parallel sub-tasks and intermediate errors. We note that comparisons with Graph of Thoughts and Program of Thoughts are indeed valuable and will be added to the discussion in future iterations.

Accessibility of Theoretical Analysis

Reviewer Concern: "The theoretical analysis may be challenging for readers without a strong background in computational complexity theory."

Response: We strive to balance technical depth and accessibility. To assist readers with less background, we provide intuitive explanations alongside formal proofs (e.g., in Sections 4.1 and 4.2). Furthermore, we include a summary table (Table 1) and diagrams (Figure 2) to visually clarify key insights. In future revisions, we are committed to enhancing accessibility by adding simplified summaries in an appendix.

Comparison with Latest Prompting Techniques

Reviewer Question: "How does DaC compare with the latest prompting techniques or methods in terms of performance and computational efficiency?"

Response: Thank you for raising this important question. As detailed in Tables 2 and 3, DaC outperforms other prompting strategies like Chain-of-Thought (CoT) and Least-to-Most (LtM) in specific scenarios, particularly for tasks with parallel sub-tasks and susceptibility to intermediate errors.

We acknowledge that recent methods such as Graph of Thoughts (GoT) are promising in other contexts. However, GoT is not applicable to the tasks discussed in our paper. GoT is designed for tasks requiring exploratory or search-based reasoning (e.g., planning or optimization), where multiple reasoning paths are evaluated and combined. In contrast, the tasks we address—such as large integer arithmetic and fact verification—are deterministic and involve decomposable sub-tasks that do not require such search-based exploration.

Robustness and Failure Sources

Reviewer Question: "Could the authors provide further insights into the robustness of DaC prompting and discuss potential sources of error or failure in the context of their case studies?"

Response: Robustness is a key advantage of DaC prompting, as demonstrated by our experimental results. In the fact verification task, DaC achieves significantly higher recall than other strategies such as CoT and LtM (Table 3), ensuring it effectively identifies factual inconsistencies. Similarly, in the hallucination detection task (Table 2), DaC consistently outperforms baselines by reducing intermediate errors, which are more prevalent in strategies relying on sequential reasoning.

The robustness of DaC stems from its explicit decomposition of tasks into independent sub-tasks, which mitigates error propagation—a key challenge for sequential approaches like CoT. Additionally, DaC’s reduced decoding context window size (Proposition 4.4) helps limit the accumulation of intermediate hallucinations or inaccuracies.

Our results further highlight that DaC excels in tasks involving parallel sub-tasks (e.g., large integer multiplication), while it is less impactful for tasks without such characteristics (e.g., integer addition). This aligns with our theoretical analysis and confirms the robustness of DaC within its applicable scope.

评论- Response

2024-11-24

Thank you for your response. I remain committed to my viewpoints and value the concerns I have raised.

审稿意见

评分: 5置信度: 42024-11-03

The paper examines the utility of divide-and-conquer prompting strategy, and summarizes two conditions under which DaC leads to performance gains when tasks satisfy both of these conditions. Meanwhile, the paper lists three applicable tasks and three inapplicable tasks, and conduct experiments on two tasks: large integer arithmetic and fact verification to make validation.

优点

The paper theoretically proves the DaC and proposes two conditions that DaC needs to satisfy to bring performance improvement.
The paper select two tasks (large integer arithmetic and fact verification) for experiments to make validation on theoretic analysis.

缺点

The concept in condition 2 is too general, what does it mean to be subject to hallucinations or intermediate errors, and I think most tasks suffer from this problem, can this definition be more specific?
In the applicable and inapplicable tasks, I think the granularity of the 6 tasks is completely different, and the paper proposes to conduct experiments on two tasks large integer arithmetic and fact verification. Integer Multiplication and Integer Addition belong to large integer arithmetic. However, the remaining four tasks are not proposed at the same granularity, and the reasons for the selection of the six tasks should be described in detail in this paper. In addition to the three tasks that are applicable, what other tasks are applicable?
In this paper, how is the answer extraction of the LLMs to make experimental statistics. Can the 100% on some metrics in Table 3 be analyzed in detail?
DaC generally performs better in Integer Addition as the model capacity increases (GPT-3.5->GPT-4). At the same time, on the HaluEval dataset, baselines have a large difference in the performance of the two LLMs, while the DaC method has the similar performance as GPT3.5 in GPT-4. It is necessary to further explore the influence of the model parameters on DaC.

问题

The writing and presentation of the paper need further improvement. There are some missing symbols and wrong words in line 33, line 197, line 343, line 436, and etc.
The publications of the references need to be corrected such as Zhou et al., 2022.

2024-11-18

Dear Reviewer,

We appreciate your thoughtful feedback on our paper and the constructive suggestions. Below, we address your comments and provide clarifications to demonstrate the robustness and contributions of our work.

Clarification on Condition 2

Reviewer Concern: "The concept in condition 2 is too general. What does it mean to be subject to hallucinations or intermediate errors, and I think most tasks suffer from this problem. Can this definition be more specific?"

Response: Thank you for pointing this out. Condition 2 specifically refers to tasks where intermediate errors accumulate during sequential reasoning processes, as commonly observed in Chain-of-Thought prompting. While hallucinations or intermediate errors are general issues, their impact is significantly amplified in tasks involving long reasoning chains or compositional sub-tasks. We will update the manuscript to clearly define the characteristics of tasks prone to these issues and provide further examples.

Granularity of Applicable and Inapplicable Tasks

Reviewer Concern: "The granularity of the 6 tasks is completely different. Integer Multiplication and Integer Addition belong to large integer arithmetic, but the remaining four tasks are not at the same granularity. The reasons for selecting these six tasks should be described in detail."

Response: Thank you for pointing out the differences in task granularity. This variation is intentional and designed to illustrate the breadth and applicability of our theoretical framework. Specifically:

Granularity Differences: Tasks like Integer Multiplication and Integer Addition are subcategories of arithmetic operations, chosen to demonstrate the theoretical distinctions between tasks that satisfy the proposed conditions and those that do not. In contrast, broader tasks like Fact Verification or Multi-round QA represent different problem domains to clarify the boundaries of DaC's effectiveness.

Purpose of Granularity Variation: By including tasks of varying granularity, we aim to:

Showcase how the theoretical framework generalizes across domains.
Provide actionable guidance for practitioners on the types of tasks where DaC is expected to excel (e.g., those with parallel sub-tasks or error-prone intermediate steps).

Task Selection Rationale: The six tasks were chosen to validate both the applicability and limitations of DaC:

Applicable Tasks: Integer Multiplication, Fact Verification, and Hallucination Detection represent tasks that align well with DaC's properties and satisfy both proposed conditions.
Inapplicable Tasks: Integer Addition, Multi-round QA, and Planning highlight cases where DaC is less effective due to the absence of parallel sub-tasks or limited task complexity.

Answer Extraction and Metrics Analysis

Reviewer Concern: "In this paper, how is the answer extraction of the LLMs to make experimental statistics? Can the 100% on some metrics in Table 3 be analyzed in detail?"

Response: Thank you for this question. The answer extraction process in our experiments involves a two-step approach:

Answer Generation: LLMs first generate outputs for sub-tasks, which are then merged to form the final output (e.g., a factual consistency decision or arithmetic result).
Binary Decision Extraction: To simplify evaluation, we prompt the LLM to provide a binary decision (Yes or No) based on the final output, ensuring consistent and comparable results across all methods.

Regarding the 100% precision and recall observed in Table 3: This phenomenon reflects the tendency of certain baseline methods (e.g., CoT-SC and LtM) to overwhelmingly favor one class (e.g., Yes or No) in their outputs, irrespective of the actual content. While this leads to perfect scores for one metric (e.g., precision or recall), it also results in poor overall performance due to imbalanced predictions.

In contrast, DaC explicitly verifies sub-tasks independently, producing balanced outputs and consistently high performance across metrics. This characteristic is particularly evident in its superior F1 scores, as shown in Table 3, which better capture the overall effectiveness of the method.

2024-11-18

Impact of GPT versions on DaC

Reviewer Concern: "DaC generally performs better in Integer Addition as the model capacity increases (GPT-3.5->GPT-4). At the same time, on the HaluEval dataset, baselines have a large difference in the performance of the two LLMs, while the DaC method has the similar performance as GPT3.5 in GPT-4. "

Response: Thank you for highlighting this point. The observed differences in performance across tasks and models stem from the nature of the datasets used in our experiments:

HaluEval Dataset: As noted, HaluEval is constructed from hallucinated content generated by GPT-3.5. Consequently, its inherent alignment with GPT-3.5’s output style naturally results in performance differences when applied to GPT-4. This is because GPT-4 interprets and processes the same hallucinated inputs differently, reflecting model-specific behaviors and improvements.

Integer Arithmetic Dataset: In contrast, the dataset for integer arithmetic tasks (e.g., multiplication and addition) is randomly generated, ensuring task uniformity across models. As a result, the performance difference between GPT-3.5 and GPT-4 is relatively minor for DaC, as the task characteristics remain consistent and independent of the LLM’s prior biases.

This highlights an important property of DaC: its robustness to dataset variability. While hallucination-based datasets like HaluEval may expose model-specific tendencies, DaC maintains consistent performance on structured tasks with deterministic properties, as shown in the integer arithmetic results.

Writing and Presentation Improvements

Reviewer Concern: "The writing and presentation of the paper need further improvement. There are some missing symbols and wrong words in line 33, line 197, line 343, line 436, etc. The publications of the references need to be corrected."

Response: We appreciate the detailed feedback on presentation issues. We will thoroughly proofread the manuscript to correct missing symbols, typos, and unclear phrasing. Additionally, we will ensure that all references, such as Zhou et al., 2022, are properly formatted and complete.

2024-11-21

Dear authors,

Thanks for your response. Some questions and responses are still confusing to me. Specifically:

Clarification on Condition 2

Please make specific corrections, it is still unclear how the authors further clearly define.

Granularity of Applicable and Inapplicable Tasks

From the author's response, it is still unclear the justification for these six task settings, there are multiplication, division, remainder, etc in the calculation, and the QA task is also more than multi-round QA, what kind of QA task is suitable for the DaC? Do the authors want to clarify that the DaC is suitable for the diversity or specificity of tasks?

Answer Extraction and Metrics Analysis

It is not clear whether the experimental settings of baselines are fair. For CoT-SC and ToT, the detailed prompt settings can be detailed in the appendix.

Granularity of Applicable and Inapplicable Tasks

I think the authors should investigate relevant hallucination detection benchmarks related to GPT-4 to enhance the credibility of the paper.

2024-12-03

How Tree of Thought (ToT) Is Used

Tree of Thought (ToT):

Approach: ToT uses a structured, tree-like reasoning process where the model evaluates multiple candidate solutions through branching and explanation, and then rates them to select the best answer.

Implementation in Our Work:

Providing Options: For each task, we present the model with NNN possible answers or solutions (e.g., "The document and summary are consistent" or "The document and summary are not consistent").
Generating Explanations: The model is prompted to provide a detailed explanation for each candidate answer, offering insights into its reasoning process.
Evaluating Candidates: Using the explanations generated, the model is further prompted to rate each candidate answer based on factors such as logical consistency, alignment with evidence, or correctness.
Selecting the Final Answer: The answer with the highest rating is selected as the final output.

This approach allows ToT to explore and evaluate multiple reasoning paths, ensuring that the chosen solution is well-supported by explanations and aligned with the evidence.

Options Prompt:

Here are several possible answers regarding the consistency between the document and summary:

The document and summary are consistent.
The document and summary are not consistent. Please provide detailed explanations for each answer.

Rating Prompt:

Based on the explanations provided, rate each candidate answer from 1 to 5, where 5 represents the most consistent and logical answer. Explain your ratings.

Final Selection: The answer with the highest rating is chosen as the output.

Rebuttal on Hallucination Detection Benchmarks

Reviewer Concern: "I think the authors should investigate relevant hallucination detection benchmarks related to GPT-4 to enhance the credibility of the paper."

Response: Thank you for this valuable suggestion. We acknowledge the importance of benchmarking to validate experimental results. However, we would like to clarify that the primary goal of this paper is not to benchmark hallucination detection capabilities across LLMs, but rather to explore under what conditions Divide-and-Conquer (DaC) prompting is effective or less advantageous.

Focus of Our Study:

Our work is centered on analyzing DaC’s theoretical advantages and demonstrating its practical effectiveness in tasks with specific characteristics (e.g., parallelizable subtasks, susceptibility to intermediate errors). While hallucination detection is used as one of our case studies, our goal is to showcase how DaC can influence detection performance rather than compare LLMs on hallucination benchmarks.

Existing Benchmarks:

Many existing hallucination detection benchmarks for GPT-4 and other LLMs focus on measuring the hallucination rate of LLMs on diverse datasets. These studies are typically aimed at evaluating the hallucination tendencies of different models, rather than the accuracy of hallucination detection itself. As a result, these benchmarks do not directly align with our goal of evaluating DaC’s role in improving detection accuracy.

Relevant Work on DaC and Credibility:

There is related work [Cui, Wendi, et al 2024.] demonstrating how DaC can enhance the credibility of generated content. For example, some studies use DaC to structure reasoning or refine generated outputs, which helps mitigate hallucinations. These findings indirectly support our use of DaC for hallucination detection tasks, as shown in our experiments. We will include a discussion of these studies in the related work section to strengthen the connection between DaC and hallucination detection benchmarks.

Future Work:

We agree that integrating hallucination detection benchmarks could further enhance the credibility of our results. While it is outside the scope of this paper, we plan to explore how DaC performs on standardized hallucination benchmarks in future studies, with a focus on its ability to improve detection accuracy across diverse datasets.

Cui, Wendi, et al 2024. "Divide-Conquer-Reasoning for Consistency Evaluation and Automatic Improvement of Large Language Models." Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024.

2024-12-03

Thanks for your explanation. According to the authors' answer, the experimental settings and selection tasks of the paper still need to be improved, and there is no modification of the main paper in the rebuttal process to correct errors and improve the presentation. So I maintain my score.

2024-12-03

Rebuttal on Clarification on Condition 2

To better illustrate the confition 2, we split it as two conditions:

Condition 2.1 (Parallel Subtasks):

The task must involve subtasks that can be processed independently and in parallel, without requiring intermediate results from other subtasks to proceed. Specifically, a task satisfies this condition if:

Subtasks are Independent: Each subtask can be solved without relying on the intermediate outputs or reasoning paths of other subtasks.
Subtasks are Homogeneous or Structurally Similar: The subtasks share a common structure or computational requirements, enabling consistent treatment by the model.

Condition 2.2 (Susceptibility to Intermediate Errors):

The task must be prone to intermediate errors that arise during sequential reasoning. These errors can either originate from Subtask Solution Errors: Mistakes made during the resolution of individual sub-tasks, which then propagate through the reasoning chain and affect the final solution.

Rebuttal on Granularity of Tasks

Response: Thank you for this observation. We understand the concern regarding granularity differences among the selected tasks. However, our primary consideration when choosing tasks was not the granularity but rather the scenario characteristics that align with our theoretical framework. Specifically:

Scenario-based Selection:

Our goal was to evaluate Divide-and-Conquer (DaC) prompting across tasks that represent distinct types of dependencies between subtasks:

Multi-round QA and Planning are tasks where subtasks are highly dependent on each other, requiring sequential reasoning. These tasks are included to demonstrate scenarios where DaC is less effective.
Verification and Consistency Evaluation are tasks where subtasks are generally independent, making them ideal for DaC. These tasks inherently align better with our theoretical conditions, as subtask independence is a core requirement for effective parallelism.

Granularity as a Secondary Factor:

While it may appear that tasks like Verification or Consistency Evaluation are finer-grained compared to Planning, this is not the driving factor behind their selection. Instead, these tasks naturally exhibit fewer dependencies between subtasks, making them suitable for evaluating DaC’s strengths. The difference in granularity is a byproduct of selecting tasks that align with or challenge DaC's theoretical conditions, not the primary criterion.

Implications for Subtask Dependencies:

Tasks like Verification and Consistency Evaluation typically involve subtasks that are independently verifiable (e.g., verifying sub-claims or evaluating individual data points), reducing the likelihood of interdependencies. In contrast, Planning or Multi-round QA often require iterative steps where each subtask depends on the outcome of previous steps, violating the independence condition critical for DaC's success.

How CoT-SC Is Used

Chain-of-Thought with Self-Consistency (CoT-SC):

Approach: CoT-SC extends the standard Chain-of-Thought prompting by introducing self-consistency through sampling multiple reasoning paths and aggregating the results. This method leverages the diversity of LLM outputs to improve reliability.

Implementation in Our Work: Sampling Multiple Answers: We use the standard CoT prompting method to sample NNN candidate answers for a given task. Each answer represents an independent reasoning chain produced by the model.

Consensus-Based Selection: To determine the final answer, we prompt the LLM with the following instruction to help choose the most consistent answer based on all sampled results:

$N$ experts have tried to judge whether there is any contradiction between the following document and the summary: <document and summary prompt> Based on their results, please answer me: Is there any contradiction between the document and the summary? Here are their results: <list of sampled answers> The model then selects the answer that reflects the consensus across the sampled outputs. This approach is particularly useful for tasks like fact verification or consistency evaluation, where aggregating multiple independent judgments can improve robustness and reliability.

审稿意见

评分: 6置信度: 42024-11-05

This paper tries to analyze the utility of the divide-and-conquer prompting strategy and answer the questions on which kind of tasks this strategy has advantages. The paper provides a theoretical analysis of the divide-and-conquer prompting strategy to identify the specific tasks where DaC prompting can bring a performance boost with a theoretic guarantee. The paper claims to present two cases where experimental results align with our theoretical analysis.

优点

The paper mainly focuses on the theoretical view of a prompting strategy and tries to find experimental results that align with the theoretical results, which should be encouraged.

缺点

The presentation of the paper should be improved. For example, for the method, which is the main difference between ToT and DaC, from the description of the paper, it looks like DaC prompts LLMs to perform subtask decomposition first and then solve each subtask, while ToT does not distinguish between subtasks and general intermediate reasoning steps, just perform a tree-based reasoning process. Then, from Figure 2, DaC seems to solve tasks with homogeneous decomposition, is there any situation in which the depth of each subtask is not the same?
The proof part is also a little bit confusing for me. For example, theorem 4.1 seems trivial to me as all IO prompts can be seen as special cases of DaC prompts, which means IO prompts are a subset of DaC prompts, then the set of problems IO prompts can solve is also a subset of problems DaC prompts can solve, so there must be S(IO) in S(DaC).

问题

Please refer to the weaknesses part.

2024-11-18

Dear Reviewer,

We sincerely thank you for your thoughtful and constructive feedback on our work. We deeply appreciate your positive remarks about our theoretical focus and alignment of experiments with theoretical results. Below, we address your comments and aim to clarify the aspects you found confusing.

Difference Between ToT and DaC

Reviewer Concern: "From the description of the paper, it looks like DaC prompts LLMs to perform subtask decomposition first and then solve each subtask, while ToT does not distinguish between subtasks and general intermediate reasoning steps, just performing a tree-based reasoning process. From Figure 2, DaC seems to solve tasks with homogeneous decomposition. Is there any situation in which the depth of each subtask is not the same?"

Response: Thank you for highlighting this important distinction. Your understanding is correct: DaC explicitly separates task decomposition, subtask resolution, and solution merging, while ToT interweaves intermediate reasoning with search-based exploration. The key differences are:

Task Characteristics: DaC excels in problems with homogeneous parallel subtasks (e.g., large integer arithmetic or fact verification), where subtasks are well-defined and largely independent.
Heterogeneous Subtasks: DaC is also applicable when subtasks vary in complexity or "depth." For example, in hierarchical document summarization, some paragraphs may require deeper reasoning or multiple decomposition levels. In such cases, DaC can recursively decompose tasks until the subtasks are manageable.

Theorem 4.1 and Its Implications

Reviewer Concern: "Theorem 4.1 seems trivial to me as all IO prompts can be seen as special cases of DaC prompts, so there must be $S(IO) \subseteq S(DaC)$ ."

Response: Thank you for raising this point. We agree that the conclusion that IO prompting is a subset of DaC prompting (i.e., $S(IO) \subseteq S(DaC)$ ) is indeed trivial and intuitive. However, the primary contribution of Theorem 4.1 lies not in this subset relationship but in proving the existence of a gap between IO and DaC in terms of expressive power (i.e. $S(IO) \subset S(DaC)$ ), which is far from trivial.

Non-Trivial Contribution: By leveraging computational complexity theory, we demonstrate that DaC can solve problems in NC1 (e.g., 2-BSI) that are provably outside the scope of IO prompting. This result establishes a clear and rigorously defined gap between the two paradigms, providing a theoretical guarantee of DaC's superiority in handling certain tasks.
Significance of the Gap: This gap is critical for identifying tasks where DaC can outperform IO such as our discussion about multiplication and addition

审稿意见

评分: 6置信度: 32024-11-07

This paper investigates the effectiveness of Divide-and-Conquer (DaC) Prompting. The author first provides a theoretical framework that can analyze how the divide-and-conquer strategy expands the expressive power of a fixed-depth log-precision Transformer. In this context, the author presents some conditions under which DaC has advantages compared to other prompting strategies. The author empirically validates this theory on tasks such as Large Integer Multiplication, Hallucination Detection, and Article-level Fact Verification, which require very long reasoning paths or contain deceptive contents.

优点

The authors conducted a theoretical analysis of the Divide-and-Conquer prompting method, which is helpful for understanding the reasoning capabilities of large language models (LLMs) and is timely.

Providing the conditions under which DaC brings a performance boost has practical application value.

缺点

In the task decomposition stage of Algorithm 1 and Algorithm 2, hallucinations or intermediate errors also seem to occur. Based on the current version of the paper, it is not clear whether the errors in CoT stem from the subtask solutions rather than the subtask decomposition. I suggest conducting further statistical analysis on the types of errors in CoT and DaC.

Considering that the applicability of DaC may not be as broad as CoT and that the reasoning process may require more tokens, the potential impact of this paper might not be as significant as Feng et al. 2023. Therefore, this work appears to be somewhat incremental.

问题

How do the token usage and inference time of DaC compare to those of CoT?

2024-11-18

Dear Reviewer,

Thank you very much for your thoughtful and constructive feedback. We greatly appreciate your recognition of the theoretical contributions of our work and its practical value in understanding and applying Divide-and-Conquer (DaC) prompting. Below, we address the specific weaknesses and questions you raised.

Error Analysis in CoT and DaC

Reviewer Concern: "In the task decomposition stage of Algorithm 1 and Algorithm 2, hallucinations or intermediate errors also seem to occur. It is not clear whether the errors in CoT stem from the subtask solutions rather than the subtask decomposition. I suggest conducting further statistical analysis on the types of errors in CoT and DaC."

Response: Thank you for this insightful suggestion. While our current analysis primarily focuses on overall performance metrics, we agree that a deeper statistical breakdown of errors would provide more clarity. Specifically:

In CoT, errors predominantly arise from the propagation of inaccuracies across intermediate steps, which is compounded by the sequential nature of reasoning. This often leads to hallucinated intermediate results impacting the final outcome. In DaC, errors in task decomposition are possible, but they are generally confined to the boundaries of sub-tasks and do not propagate across the entire reasoning process. This containment reduces the overall impact of individual errors.

Applicability and Token Usage

Reviewer Concern: "Considering that the applicability of DaC may not be as broad as CoT and that the reasoning process may require more tokens, the potential impact of this paper might not be as significant as Feng et al. 2023."

Response: We appreciate your perspective and agree that DaC has a narrower scope compared to CoT, as it is particularly suited to tasks with parallel sub-tasks or those prone to intermediate errors. However, we believe this narrower applicability is balanced by DaC’s robustness and efficiency in handling these specific tasks:

Task Suitability: DaC is designed for scenarios where sub-tasks are well-defined and largely independent. For such tasks (e.g., large integer multiplication, fact verification), DaC reduces error propagation and achieves significant performance gains, as shown in Tables 2 and 3.
Token Usage: While DaC may require additional tokens for task decomposition and merging, the overall token usage is comparable to CoT due to shorter context lengths during sub-task resolution. We will include a detailed comparison of token usage and inference time between DaC and CoT in our revisions to provide clarity on this point.

Token Usage and Inference Time

Reviewer Question: "How do the token usage and inference time of DaC compare to those of CoT?"

Response: In terms of token usage:

DaC requires additional tokens for sub-task decomposition and solution merging, but it reduces the average decoding context window size during sub-task resolution.

In terms of inference time:

DaC’s inference time is influenced by the parallelism of sub-task processing. When parallel execution is supported, DaC can achieve faster inference compared to CoT for tasks with large numbers of independent sub-tasks. However, for tasks requiring sequential reasoning, CoT remains more efficient.

AC 元评审

2024-12-15

This paper investigates the Divide-and-Conquer (DaC) prompting strategy for LLMs, combining theoretical and empirical analyses. It introduces a framework to identify tasks where DaC prompting enhances performance, particularly for tasks involving long reasoning paths, repetitive subtasks, or deceptive content. The findings are validated with experiments on large integer multiplication, hallucination detection, and fact verification, demonstrating DaC's advantages under specific conditions.

The reviewers mentioned several critical points about the paper, which need further investigation:

The reviewers raised concerns about the generalizability of the findings. DaC is one of the many prompting techniques out there. The paper should also study those techniques.
Robustness and failure analysis are missing and need more investigation. In the rebuttal, the authors attempted to address these concerns.
The rationale behind selecting six tasks needs to be presented properly as their granularity seems different. The authors attempted to justify this also; however, it does not seem to be a convincing response.
One reviewer also thinks that the contribution is incremental given that Feng et al. 2023 already presented similar studies.
The writing also needs to be improved. Two reviewers pointed this out. The poofs presented in the paper also also confusing.

审稿人讨论附加意见

Two reviewers participated in the discussion. The authors attempted to clarify the concerns regarding applicability, error analysis, theoretical contribution, task selection strategy, etc. However, it seems both the reviewers decided to maintain their previous scores.

最终决定Reject

2025-01-22

Reject