4.0

/10

withdrawn4 位审稿人

最低3最高5标准差1.0

3.5

置信度

正确性2.5

贡献度2.3

表达2.8

ICLR 2025

Guided Stream of Search: Learning to Better Search with Language Models via Optimal Path Guidance

Seungyong Moon,Bumsoo Park,Hyun Oh Song

OpenReview PDF

提交: 2024-09-27更新: 2025-01-02

摘要

关键词

planning with language modelssupervised fine-tuning with self-generated datareinforcement learning fine-tuning

评审与讨论

审稿意见

评分: 3置信度: 32024-10-21

This paper introduced the Guided Streat of Search, which integrates optimal solutions into the self-generation process of LLMs to improve their search and planning capabilities. The main contribution is extending the existing Stream of Search (SoS) approach to Guided Stream of Search (GSoS), which incorporates optimal solutions into the self-generation process in a progressive manner. GSoS uses unsuccessful search trajectories as contexts to integrate each intermediate action from the optimal solution, producing high-quality search trajectories that are then used for SFT. GSoS is evaluated on a search benchmark and demonstrates outperformance in comparison to both SFT and RLHF baselines, regarding both seen targets and unseen targets.

优点

[+] The overall presentation and structure are well-organized. The introduction, preliminary, and method sections are well-written. The threads are easy to follow.

[+] The results and analysis of the experiments are detailed and comprehensive. The authors provide extensive experiment results and analyze them in detail. In my opinion, this paper is fine with its empirical results and analysis.

[+] All the codes and hyperparameters are open-sourced for reproducibility.

[+] I believe this method has potential applications for larger problems and more advanced models. Augmenting search and planning trajectories could be a crucial step in training models like o1.

缺点

[-] The evaluation benchmark is not convincing to me. It appears that this benchmark can easily be formulated as a real search problem, making the use of an LLM unnecessary. I think the authors should consider testing their framework on a more complex benchmark.

[-] It's doubtful that the unseen targets in Countdown can be considered a valid evaluation of generalization, given the high similarity between the supposedly different tasks in the dataset.

[-] The backbone model, GPT-2, is somewhat outdated. Additionally, I could not find an explanation provided for choosing GPT-2 over other models.

On a minor note, I do not observe any planning capability (i.e., the ability to plan ahead of actions) from this method or within the benchmark, despite its repeated emphasis in the paper.

问题

How does the performance of GSoS compare with other state-of-the-art algorithms in terms of search and planning capabilities? Can leading search and planning algorithms be transferred to this benchmark and be evaluated?
What's the reason behind choosing GPT-2 as the backbone model? Is it possible to replicate the experiments with more advanced open-sourced models?

审稿意见

评分: 5置信度: 32024-11-01

This work explores how to leverage optimal solutions to enhance the search and planning abilities of language models. The authors propose guided stream of search (GSoS), which seamlessly incorporates optimal solutions into the self-generation process in a progressive manner, producing high-quality search trajectories for training. GSoS can significantly enhance the search and planning abilities of language models on Countdown, a simple yet challenging mathematical reasoning task.

优点

The paper is well-written and easy to follow
The proposed method is simple and intuitive
Good experimental results and detailed analysis on Countdown benchmark

缺点

The experiments are only conducted on one single benchmark. There are many other datasets requiring complex reasoning. At least one of them, such as LogiQA2, should be investigated.
The authors use a 250M model for experiments which is quite small. For complex planning and reasoning, larger language models should be considered.
How about the comparison to this simple baseline? For the given query, we sample plenty of trajectories from the model and construct a DAG using the sampled trajectories. Then we can sample different types of search paths from the DAG for training.

问题

See weaknesses

审稿意见

评分: 5置信度: 42024-11-04

The authors introduce GSoS, a method to improve the planning and reasoning capabilities of language models by integrating optimal solutions within search processes. Unlike prior approaches that rely solely on self-generated, often suboptimal search trajectories, GSoS incorporates optimal solutions progressively, guiding the model toward more structured search trajectories. These trajectories are distilled through SFT, which, combined with subsequent RL training, enhances performance on the planning task Countdown.

优点

The paper presents a simple and intuitive approach for improving planning tasks in LLMs by incorporating optimal solutions into trajectory generation process, which enhances the quality of generated trajectories and overall training outcomes.
The paper is clearly written, making the proposed method and experiment findings accessible.

缺点

A key baseline—using SFT with the optimal solutions (BC)—is missing. While the authors discuss BC's limitations on unseen tasks, including it in the evaluation would provide a more comprehensive comparison, especially since the main contribution of this approach is incorporating optimal solutions into the data construction process.
The proposed approach is only validated on a single test bed, Countdown, which may leave readers questioning its generalizability to other planning tasks. Including an additional test bed, such as those from Beyond a* [1], would strengthen the paper’s claims, particularly as this work builds on and seeks to improve upon SoS (Gandhi et al., 2024).

Minor Issue:

Line 111: The purpose of transforming $x$ through a series of operations to obtain $\\hat{y}$ is unclear, as $x$ already contains both input and output states?

[1] Beyond a*: Better planning with transformers via search dynamics bootstrapping.

问题

Please refer to weaknesses.

审稿意见

评分: 3置信度: 42024-11-07

This paper proposes Guided Stream of Search (GSoS), a novel method that combining the optimal path as well as the search trajectories of a searching scenario into a sequence, which is used as the training data instance for LLMs to acquire better planning and search performances. The authors have conducted experiments on Countdown, a mathematical reasoning benchmark with branching factor in square complexity of the inputs at each searching step. The experimental results demonstrated the effectiveness of GSoS, especially with RL that functions on the operation level.

优点

This paper studies complex reasoning and planning of LLMs, which is an important topic in LLM research.
The idea of integrating more exploratory trajectory segments into the context of the optimal subgoal makes sense, as it steers LLMs to learn to pivot to the optimal path.
Setting up the RL training on the operation level effectively accelerates the learning process, which is supported by comparison experiments.

缺点

Please refer to the Questions listed below.

问题

When conditioning the optimal path with a partial exploration path, is it equivalent to a self-reflection process (and the the self-reflection succeeds with one reflection trial)? If so, what is the novelty of GSoS over a RL reflection-tuning method, or RL finetuning with Chain-of-Hindsight [1]?
- Furthermore, have the authors tried to compile the trajectories by exploring more than one non-subgoal node in advance of the subgoal node, and ablate the effect with those containing only one non-subgoal node ahead of each corresponding subgoal node?
[1] Liu et al., Chain of Hindsight Aligns Language Models with Feedback. ICLR 2024.
The effectiveness of GSoS is only demonstrated on one benchmark. The proposed method should be benchmarked on more scenarios to demonstrate its superiority.
In Lines 192-194, it is claimed that "Fine-tuning on these trajectories may lead to significant changes in the model’s weights, potentially degrading its search and planning abilities. Therefore, it is crucial to explore methods for effectively integrating optimal solutions to produce trajectories that maintain both high likelihood and quality." It would be beneficial if the authors provide more experimental supports for why direct finetuning leads to the degradation of the search and planning abilities. Specifically, if it is supported by the main experiments where GSoS outperforms SoS, additional qualitative analysis and case studies are needed for the direct comparison between GSoS and SoS, and it would be helpful to provide cases where GSoS+finetuning succeeds while SoS+finetuning fails.
In Lines 306-307, it is demonstrated that "even when multi-step returns with GAE are used for training the value function." It would be beneficial if the authors could show the experiments that verify this claim.
In Line 5 of Algorithm 2: what is M(y|x)?

撤稿通知

2025-01-02

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.