4.3

/10

withdrawn4 位审稿人

最低3最高6标准差1.3

4.0

置信度

正确性3.3

贡献度1.8

表达3.0

ICLR 2025

BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search

Linzhuang Sun,Hao Liang,Jingxuan Wei,Bihui Yu,Conghui He,Zenan Zhou,Wentao Zhang

OpenReview PDF

提交: 2024-09-25更新: 2024-11-13

摘要

关键词

Large Language ModelsTree SearchBack Verification

评审与讨论

审稿意见

评分: 6置信度: 42024-10-30

The paper presents BEATS, a novel approach to improving the mathematical problem-solving abilities of large language models (LLMs). It introduces a method that combines enhanced prompting, tree search with pruning, and a back-verification technique. BEATS claims significant improvements, particularly with the Qwen2-7B model, outperforming benchmarks such as GPT-4 on the MATH dataset.

优点

BEATS uses a unique tree search strategy with pruning and back-verification to streamline reasoning paths and verify answers, improving accuracy and efficiency.
Empirical results across multiple datasets (MATH, GSM8K, SVAMP, etc.) show notable improvement over existing methods.
The inclusion of a question disambiguation component helps clarify ambiguous problems, potentially reducing error.

缺点

This component, though effective, adds additional steps to the inference phase, potentially affecting efficiency in real-time applications.
The paper could benefit from a more detailed discussion of the limitations of the proposed methods and potential areas for future work, such as the impact of training data on performance.
Further discussion on how the pruning limits affect accuracy vs. computation trade-off would add valuable insight.

问题

How does the diversity and quality of the training data influence the performance of BEATS, particularly in edge cases or complex problems?

审稿意见

评分: 5置信度: 42024-10-30

This paper studies the mathematical reasoning problem in aspects of prompt engineering. The authors highlight the suboptimal prompts, high costs, and ineffective verification issues, and propose a tree-search-based prompt engineering method. The experiments show that the proposed method outperforms existing methods by a margin.

优点

The challenges proposed by the authors are reasonable. These challenges can inspire future research. The proposed method combines techniques that successfully alleviate the problems.
The experimental results are promising. The proposed method significantly improves the performance of each base model compared to the comparison methods.
This paper is well-written and organized.

缺点

The novelty of this paper is somewhat limited. For example, the back verification has already been proposed in [1]. The heuristic pruning rules, e.g., Rule (3), are also common used in math reasoning. Tree-based searching methods [2] are not new either.
The inference cost of each method should be reported. As the SFT and zero-shot methods usually require one inference, the proposed methods require multiple samplings, making the comparison unfair.
The experimental results require deeper discussion. For example, the authors mention an issue with "ambiguous problem statements" and introduce a prompt engineering method to address it. However, there is insufficient explanation of how having the LLM rewrite the problem itself resolves this issue, and there is no comparison between the original and rewritten versions to demonstrate the effectiveness of the LLM. Additionally, if the LLM can rewrite the problem on its own, why can't it directly solve the problem?

[1] Large Language Models are Better Reasoners with Self-Verification. EMNLP (Findings) 2023: 2550-2575

[2] Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B: A Technical Report.

问题

Please also refer to the weakness section.

The overall framework is based on prompt engineering, which strongly relies on the capability of LLM. Can the proposed method give such significant performance improvement when dealing with Olympiad math reasoning datasets, e.g., AIME, Olympiad?

审稿意见

评分: 3置信度: 52024-10-31

This work investigates both prompt-based and search-based methods to enhance the mathematical reasoning abilities of large language models. The authors improve traditional search-based methods by pruning the search tree using carefully crafted prompts. A disambiguation prompt clarifies the original problem, while two additional prompts guide reasoning steps and determine search termination. Different pruning strategies are tailored to each type of prompt. The authors also introduce a self-correction mechanism called back-verification, where LLMs validate answer candidates by concatenating them with the original problem. The method’s effectiveness is evaluated across 5 math reasoning benchmarks.

优点

The paper presents a novel approach that combines tree search with back-verification and adaptive disambiguation to enhance the mathematical reasoning capabilities of large language models (LLMs).
Ablation studies are conducted to assess the impact of key components in the proposed method, focusing on the contributions of the disambiguation and back-verification modules.
The pruning in the tree search effectively reduces the problem search space, improving computational efficiency.

缺点

The proposed approach lacks substantial novelty.
The selection of baselines for comparison in search-based methods is not sufficiently justified. Zhang et al. [1] use MCTS with LLaMA3 8B (which is also used in this paper) to enhance mathematical reasoning in LLMs, achieving 96.66% accuracy on GSM8K and 58.24% on MATH, which is significantly higher than the results of this approach.
Although an ablation study on the BackVerify component is included, comparisons with other verification methods are lacking. For instance, the ReST paper [2] evaluates the impact of different verifiers on performance, but similar evaluations are absent in this work.
While pruning tree search is a key contribution of the paper, there is no experimental analysis on the extent to which the pruning strategy reduces search time. Additionally, comparing the total inference time with other search-based methods is essential to substantiate the advantages of the pruning approach.

References:

[1] Zhang, D., Huang, X., Zhou, D., Li, Y., & Ouyang, W. (2024). Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B. arXiv preprint arXiv:2406.07394.
[2] Zhang, D., Zhoubian, S., Hu, Z., Yue, Y., Dong, Y., & Tang, J. (2024). ReST-MCTS: LLM Self-Training via Process Reward Guided Tree Search. arXiv preprint arXiv:2406.03816.

问题

How do authors verify that the disambiguation prompt effectively resolves ambiguous problem statements? Although the ablation study indicates that this prompt improves final performance, a more detailed analysis is needed. For instance, do all problems correctly solved without the disambiguation prompt remain correct when it is applied?
Which version of GPT-4 is used for evaluation? If the results are referenced from OpenAI papers or technical blogs, please provide the appropriate citations.

审稿意见

评分: 3置信度: 32024-11-03

This paper presents BEATS, a framework that enhances mathematical problem-solving in language models by introducing targeted prompting strategies that guide the model through a step-by-step approach to decompose complex problems. Furthermore, BEATS incorporates a tree search mechanism, enabling exploration of each decision step individually, which helps refine solutions iteratively. The experiments demonstrate a significant performance increase on standard benchmarks.

优点

The proposed method showcases two key features—disambiguation and back-verification—that notably enhance the model's reasoning process, as confirmed by the ablation study. Disambiguation helps clarify problem statements at each reasoning step, reducing the likelihood of misinterpretation, while back-verification provides a robust mechanism to cross-check each solution against previous steps. Together, these techniques improve benchmark performance by a substantial margin.

缺点

The paper combines existing approaches, such as tree search and reflective reasoning techniques, but falls short of introducing transformative new methods. While effective, the design lacks substantial innovation in handling complex reasoning beyond prior approaches.
A significant issue lies in the increased computational cost introduced by the extra steps, including disambiguation and back-verification. Although these steps improve accuracy, their contribution to computational overhead is not quantified, making it challenging to assess the overall efficiency.
Despite mentioning computational challenges in the introduction, the paper lacks a thorough analysis of the actual cost implications. The pruning technique within tree search is minimalistic, relying on basic conditions to halt expansion rather than addressing cost at a fundamental level.
Some areas in the paper, particularly Section 2.3, contain formatting issues, such as duplicated author names.

问题

Could the authors provide more details on the computational trade-offs involved?

撤稿通知

2024-11-13

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.