PaperHub
7.3
/10
Poster4 位审稿人
最低6最高9标准差1.1
6
7
7
9
4.0
置信度
COLM 2025

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

OpenReviewPDF
提交: 2025-03-20更新: 2025-08-26
TL;DR

A RL framework to train LLMs for interleaved reasoning and retrieval

摘要

关键词
reasoningretrievalreinforcement learning

评审与讨论

审稿意见
6

In this paper, the authors have proposed a novel training pipeline by employing reinforcement learning directly to RAG system by treating the answer correctness as the outcome reward. It demonstrates some technical contributions and serves as one of the earliest work exploring RL with LLMs beyond mathematical reasoning. Yet, this work is highly inspired by DeepSeek-R1 such that fails to carefully discuss the necessity of employing reinforcement learning without any RAG-like SFT.

接收理由

  1. Serving one of the earliest work exploring RL with LLMs in retrieval-based QA, this paper demonstrates some technical contribution and provide some experience to the community.
  2. The writing is easy to follow and experimental results are good.

拒绝理由

I think the paper is highly affected by DeepSeek-R1 such that if fails to address some more important aspects, including but may not be limited to

  1. Is it really necessary to perform reinforcement learning directly without any cold-start SFT? Note that the instruct-version model may also not be trained on data invoked with search engine. How will these two fashions perform differently?
  2. Following point 1, the baseline that using either rejection-sampling or knowledge distillation for trajectory collection and using SFT or DPO training is missing.
评论

We sincerely thank the reviewer for the constructive feedback, which has strengthened our work. Please find our point-by-point responses below:

  • Is it necessary to perform RL without any cold-start SFT data? We agree with the reviewer that incorporating an intermediate cold-start supervised fine-tuning (SFT) stage could potentially improve final performance. However, SFT relies on large-scale, high-quality annotated trajectories of search-and-reasoning interactions, which are costly and difficult to obtain—posing significant challenges to scalability. In this work, we show that such intermediate trajectories can instead be acquired automatically through outcome-only reinforcement learning. The resulting RL-trained model can then be used to generate synthetic data for future cold-start SFT. We leave the systematic exploration of this direction to future work.

  • Instruct-version model may also not be trained on data invoked with a search engine. How do these two fashions perform differently? You are right. The general instructed LLMs are typically not trained on interleaved reasoning and search engine invocation data, and their performance can indeed be improved with more targeted supervision. In this work, we demonstrate that both base and instructed LLMs can learn to perform interleaved reasoning and search behavior through outcome-driven reinforcement learning. Moreover, the resulting RL-trained models can be used to generate high-quality trajectories with explicit search engine usage, which can serve as synthetic data for future supervised fine-tuning.

  • Baseline which performs rejection sampling or knowledge distillation for trajectory collection is missing. In this work, our primary objective is to propose a reinforcement learning (RL)-based method for training large language model (LLM) agents capable of interleaving reasoning and search. While knowledge distillation from a larger teacher model is a possible direction, it introduces additional supervision signals that may lead to unfair comparisons. That said, we agree with the reviewer that rejection sampling offers a reasonable and relevant baseline for comparison. Accordingly, we have included this baseline using both Qwen2.5-3B-it and Qwen2.5-7B-it. Specifically, we generate five candidate responses per training prompt from the Search-R1 dataset and select those that lead to correct final answers. These selected trajectories are then used to construct a new training set that retains the same multi-turn LLM–search engine interaction rollout mechanism proposed in Search-R1. We refer to this variant as Search-R1 (Rejection Sampling). The updated results are presented below, showing that Search-R1 with RL consistently outperforms Search-R1 with rejection sampling across both model sizes.

Qwen2.5-7b-it

MethodNQTriviaQAPopQAHotpotQA2wikimultihopQAMusiqueBamboogleAvg
Direct0.13430.40750.14020.18310.25020.03140.12000.1810
CoT0.04810.18510.05390.09150.11060.02150.23200.1061
IRCoT0.22400.47750.30090.13310.14860.07150.22400.2257
Search-o10.15070.44290.13070.18730.17570.05830.29600.2059
RAG0.34900.58470.39240.29900.23480.05790.20800.3037
SFT0.31830.35380.12080.21730.25860.06620.11200.2067
RL w.o. search0.27000.53700.19900.23700.29200.07200.29300.2714
Search-R1 (Rejection Sampling)0.36040.59220.37970.33100.29580.12330.35480.3482
Search-R1 (RL)0.39250.61030.39650.37000.41420.14560.36800.3853

Qwen2.5-3b-it

MethodNQTriviaQAPopQAHotpotQA2wikimultihopQAMusiqueBamboogleAvg
Direct0.10580.28790.10750.14910.24420.01990.02400.1341
CoT0.02270.03240.00450.02130.02080.00240.00000.0149
IRCoT0.11100.31170.20020.16360.17130.06660.24000.1806
Search-o10.23820.47230.26170.22110.21800.05380.32000.2550
RAG0.34850.54410.38660.25510.22560.04720.08000.2696
SFT0.24900.29230.10360.18570.24780.04430.11200.1764
RL w.o. search0.21000.44900.17100.20800.27500.06000.19200.2236
Search-R1 (Rejection Sampling)0.29420.48790.33240.23960.23270.05880.20970.2650
Search-R1 (RL)0.34100.54510.37840.32440.31930.10270.26400.3250
评论

Thanks for the authors' efforts for clarification and more results. I think the newly updated results are strong.

评论

Thank you very much for your thoughtful feedback and for recognizing the strength of the newly updated results! We're glad to hear that our clarifications addressed your concerns effectively. We truly appreciate your consideration and would be grateful if you could consider reflecting this in your final score. We are happy to continue the discussion if you have any other questions!

评论

Hi Reviewer cUvm,

As we approach the discussion deadline, we wanted to sincerely thank you again for your thoughtful feedback. We're grateful that you found the newly updated results strong, and we hope this addresses your concerns effectively.

If possible, we would deeply appreciate it if your more positive view could be reflected in your final score. Of course, we remain happy to discuss any remaining questions you might have before the deadline.

Thank you again for your time and consideration!

审稿意见
7

The paper trains LLMs with reinforcement learning (RL) to interleave chain-of-thought reasoning and search. It prompts the model to use special tokens to enclose reasoning, search, and final answer sections. Search results are provided to the model as an additional section, and loss during the RL phase is masked on search results as these cannot be influenced by the model. RL uses either proximal policy optimization (PPO) or group relative policy optimization (GRPO), with a simple exact-match reward function on the result. The setup is evaluated on 7 question-answering (QA) datasets with two Qwen-2.5 model variants (3B and 7B) in base and instruct versions. The result shows clear improvements over RAG-only and reasoning-only baselines.

接收理由

  • Very reasonable overall setup with positive results
  • Experiments look mostly sound, with multiple strong baselines
  • Mostly clear writing

拒绝理由

  • The evaluation is relatively limited (two small variants of a single LLM, just the QA task)
  • The paper claims the training is stable, but doesn't actually show results for multiple runs
  • The training of the base vs. instruct variant seems a little inconsistent (base wins for 7B but instruct wins for 3B, the paper doesn't comment on this)
  • I personally don't like the name grab of R1, the model is not based on DeepSeek and the experiment is only QA

(Most of these were cleared by the author response, hence my updated score, see comment below)

给作者的问题

  • In Algorithm 1, why do you explicitly force the model to decode "my action is not correct" -- you can't know that at this point, right?
  • Any intuition on how this will work with larger models (30B, 70B) and models of a different family?

Minor comments:

  • Fig. 1 deserves a more detailed commentary and should be referenced from the text (I didn't find any reference)
  • The sentence "An illustration of the rollout proces..." (pg. 4) is unclear, did you mean to add "respectively" at the end?
  • The text in Sect. 3.3 starts by implying a single-turn process, whereas the prompt clearly shows it's multi-turn – please rephrase.
  • The sentence "For R1, we train..." on pg. 6 is unclear.
  • I know it's because you get better results with the 7B model, but it feels like base vs. instruct is not so prominent to warrant a space in Table 2. It's also confusing because we don't know which baseline is "base" and which is "instruct".
  • You say in Sect. 5.3 the training rewards gets a decrease in the first 100 steps. The chart looks more like it stagnates or grows slightly.
  • What's a "valid search" in Fig. 2? Does it just mean the number of times the model produces <search></search> tokens, or do you check if the search gets any results?
  • When you refer to Fig. 3 in Sect. 5.4, you should mention it's in the appendix (and perhaps stress the link to Table 4 more).
评论

We sincerely thank the reviewer for the constructive feedback, which has strengthened our work. Please find our point-by-point responses below:

  • The evaluation is only on two small variants of a single LLM and just QA task. Thank you for the valuable comments. In response, we have extended our evaluation to include additional LLMs—specifically, a 32B-scaled model and LLaMA-type models—as well as long-form generation tasks beyond QA. The results show that Search-R1 consistently achieves strong performance across different model architectures and a broader range of task types, demonstrating its generalizability and robustness.

Results on other LLMs

Qwen2.5-32B

MethodNQTriviaQAPopQAHotpotQA2wikimultihopQAMusiqueBamboogleAvg
Direct Inference0.21690.55760.19150.23610.26610.05170.14400.2377
CoT0.22520.55840.19360.25920.29540.08060.50400.3023
IRCoT0.30580.60740.33820.36160.41500.18120.52800.3910
Search-o0.22020.53640.16150.17640.03660.06040.39200.2262
RAG0.37420.61770.40890.32300.24460.07360.22400.3237
SFT0.36680.51860.16820.26060.27280.09930.15200.2626
Search-R10.49220.66860.47690.45240.45460.23050.55650.4760

Llama3.2-3B

MethodNQTriviaQAPopQAHotpotQA2wikimultihopQAMusiqueBamboogleAvg
Direct Inference0.13910.36820.12380.12180.10660.01530.06400.1341
CoT0.24620.48660.16550.05100.08270.00570.02400.1517
IRCoT0.36260.56550.42820.23760.23590.07190.20800.3014
Search-o10.10750.20340.09290.13190.11680.03480.17600.1233
RAG0.31720.55100.33710.23390.11790.03430.06400.2365
SFT0.31970.34110.12200.20620.25710.06410.12000.2043
Search-R10.35670.57760.37780.31430.23300.09020.30650.3223

Long-form generation task

MethodQwen2.5-3b (ASQA)Qwen2.5-3b (ELI5)Qwen2.5-3b (Avg)Qwen2.5-7b (ASQA)Qwen2.5-7b (ELI5)Qwen2.5-7b (Avg)Qwen2.5-14b (ASQA)Qwen2.5-14b (ELI5)Qwen2.5-14b (Avg)
direct0.25130.19880.22500.30330.20120.25230.28890.19900.2439
RAG0.30110.19270.24690.31700.20180.25940.28480.19320.2390
R10.42440.27470.34950.43670.27950.35810.44420.27840.3613
Search-R1 (PPO)0.48010.26070.37040.47090.25580.36330.44210.26010.3511
Search-R1 (GRPO)0.49200.27160.38180.50430.27470.38950.50080.27290.3868
  • No results to show that training is stable. We conducted three times-repeated experiments and observed that the training reward curves are highly consistent, exhibiting only minor variance. Additionally, the standard deviation of the final performance on the test set is only 0.008, indicating the stability and reproducibility of our training process.

  • Comparison between base and instruct. Our key observations are as follows: (1) Instruct-tuned LLMs exhibit stronger initial performance and converge more quickly during training; (2) Both base and instruct models ultimately reach similar training reward levels. However, due to the lack of transparency regarding the pretraining and supervised fine-tuning (SFT) data used for Qwen2.5 3B and 7B, it is difficult to make definitive conclusions about their final performance differences. If the SFT data includes reasoning or tool-calling demonstrations, this would naturally give instruct models an advantage in reinforcement learning, potentially leading to better final performance compared to base models. We will clarify this in the revised manuscript.

  • The name of R1. We will modify the name to “R1-style RL” according to your suggestion.

  • Algorithm 1 (force decoding). This issue typically occurs when the LLM fails to properly enclose a query or final answer within the expected special tokens. To mitigate this, we insert a prompt that subtly suggests the previous generation may be incorrect, encouraging the LLM to engage in self-reflection and revise its output accordingly.

  • Minor comments in figure 1, page 4, sec 3.3, page 6, tab 2, sec 5.4. We will modify it according to your suggestions.

  • What is “valid search”? A valid search is counted when the search engine is successfully called with LLM outputted special tokens.

评论

Thank you for your extensive response and for adding results with additional LLMs and tasks. Assuming these changes get incorporated to the final paper, I've updated my score to 7.

评论

Thank you very much for your thoughtful follow-up and for updating your score. We're glad the additional results and clarifications addressed your concerns. We will make sure that all the discussed changes are properly incorporated into the final version of the paper.

审稿意见
7

This paper proposes to incorporate retrieved tokens from external search engines in the rollout stage of reinforcement learning (PPO and GRPO). It also introduces a "loss masking" strategy to exclude retrieved tokens from model optimization, aiming to stabilize the RL training process.

  • Overall Quality: 4 / 5
  • Clarity: 3.5 / 5
  • Originality: 3.5 / 5
  • Significance: 3.5 / 5

接收理由

  1. The proposed method is novel and well-elaborated.
  2. The paper is well-written, despite formatting issues.
  3. This paper provides a new approach, other than RAG and tool-using, to utilize searching for reasoning.
  4. The experimental results on multiple QA benchmarks demonstrate the effectiveness of the proposed method.

拒绝理由

  1. The implementation is preliminary. E.g., the simple training template and the rule-based reward model.
  2. Some questions and concerns to clarify. Please refer to Questions To Authors.
  3. Format issues of the paper.
    • The paper style (\usepackage[xxx]) is not "submission" but "preprint" or "final". Thus, there are no line numbers to refer to.
    • Table 2/3/4: The captions should be below the tables.

给作者的问题

  1. About "Retrieved Tokens Loss Masking":
    • Q1.1: To clarify, the retrieved tokens ("<information>d</information>") are masked out during training, while all of them actively serve as the input of the policy model πθ\pi_{\theta} during the rollout stage (as in Algorithm 1 Line 6), right?
    • Q1.2: Although the loss masking strategy stabilizes the training (Section 5.4), it encourages the model to learn to answer the question without any retrieved information (which is masked/skipped). Thus, the training process does not enhance the information retrieval ability of the model if Loss Masking is applied, right?
  2. Equation 2 and Equation 3: It looks better to replace "t=1:I(yt)=1ymin(...)\sum_{t=1:I(y_t)=1}^{|y|} \min(...)" with "t=1yI(yt)min(...)\sum_{t=1}^{|y|} I(y_t) \min(...)"
  3. Section 3 (Algorithm 1):
    • Q3.1: According to the algorithm, the rollout sequence may end up without a final answer (wrapped by <answer> </answer>). What is the ratio of such sequences, and how does this case affect the training results?
    • Q3.2: What does the Parse function exactly do? At least it will remove "<search>" and "</search>" tokens in yby_b and extract tokens between them, right?
    • Q3.3: Where is the prompt in Table 1 placed? Is it at the beginning of xx in Algorithm 1? Will the input query xx change during the rollout stage of the current question?
  4. About the evaluation:
    • Q4.1: During inference, does Search-R1 still work like the rollout stage, where the model generates search queries and calls search engines before answering?
    • Q4.2: Some datasets in the experiment have provided context for searching. Hence, it seems unnecessary to call external search engines (as in Search-R1) to solve those tasks.
  5. Will the rollout data, training code, and trained model be open-source?
评论

We appreciate your insightful feedback and believe it has significantly strengthened our manuscript. We have carefully addressed each of your comments as detailed below:

  • The implementation is preliminary. Given that reinforcement learning for interleaved reasoning and search in LLM agents remains underexplored, we intentionally begin with a clean and simple setting to establish a strong foundation. It is encouraging to observe that Search-R1 proves to be a simple yet effective RL methodology in this context. We view this as a promising initial point, and future works can explore more complex instructions and advanced reward designs (e.g., neural reward models).

  • Format. We are using the “submission” option, but sorry, not enabling “\ifcolmsubmission”. Will fix it accordingly.

  • Q1.1. Yes, they serve as inputs for policy rollout.

  • Q1.2. To clarify, when computing the token-level loss, we mask out the retrieved tokens and only include the LLM-generated tokens in the loss calculation. However, the logits for these generated tokens are still conditioned on the retrieved tokens, as the retrieved content is provided as input during optimization. This setup ensures that while the model leverages retrieved information to inform its generation, it is explicitly trained to write effective queries and perform reasoning over the retrieved content, rather than memorizing external information.

  • Q2. Thank you for the comments. We will make modifications accordingly.

  • Q3.1. We conducted a study on Search-R1 using both Qwen2.5-3B and Qwen2.5-7B. The ratio of such sequences is 1.32% and 4.98%, respectively. Notably, this ratio is influenced by the maximum action budget B: as B increases, the ratio tends to decrease. A larger B also improves final performance, but at the cost of efficiency, as it leads to longer rollouts.

  • Q3.2. You are right. It will remove the <search> </search> tokens and extract the query in between.

  • Q3.3. Yes, the prompt in Table 1 is placed at the beginning of x. The input query x will not be changed during rollout.

  • Q4.1. Yes, Search-R1 conducts interleaved reasoning and search engine calling during inference.

  • Q4.2. In our experiments, we focus on the challenging open-domain setting where only the question is provided, without any additional context. This requires the LLMs to actively retrieve relevant information on their own. We will clarify this setup in the revised manuscript for better transparency.

  • Q5. Sure, all the resources will be open-sourced.

评论

Thanks for the reply. This work looks solid to me. Please include the mentioned points in the revision.

评论

Thank you for your feedback! We appreciate your positive assessment and will incorporate the mentioned points into the revised version.

审稿意见
9

The paper presents a framework for including RAG (particularly search using search engines) during RL phase for LLM training. The paper, titled, Search-R1 discuss the details of this and the challenges overcome. In this framework the authors generate (multiple) search queries during step-by-step reasoning with real-time retrieval and generate rewards for them for RL training. It models the search engine as part of the environment and applies retrieved-token masking to stabilize PPO/GRPO training.

The paper uses Qwen2.5 model family to run their experiments. On seven QA datasets (mix of in-domain and out-of-domain) the authors show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting.

The paper also offers experimental insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning.

接收理由

I think this paper is a pretty strong paper. Being able to model the search environment during RL has been of big importance for all practical LLM systems.

1/ This approach gets strong empirical gains -- they achieves 20%-41% EM improvement over RAG baselines

2/ They introduce new tokens to structure the response. The generations are structured with <think></think>, <search></search>, <information></information>, and <answer></answer> tokens for clear multi-turn reasoning

3/ They add stability to the RL integration by retrieved-token masking. This boosts performance (e.g., EM jumps from 0.343 to 0.431 on 7B)

4/ This approach seems to generalize across instruction-tuned and non-instruction-tuned models

5/ The paper does thorough ablations comparing PPO vs. GRPO, study of response length dynamics, valid-search behavior

拒绝理由

1/ Would have been interesting to see any performance gains on bigger models (Llama models?) to see if the gains hold there as well

2/ It would be interesting to see how performance varies based on choice of knowledge base (search vs wikipedia vs knowledge graph)

评论

We appreciate your insightful feedback, which has significantly strengthened our manuscript. We address each comment below.

  • Performance on bigger models (llama models). Thank you for your feedback. In response to your request, we have added results on both the Qwen2.5-32B-Base and LLaMA3.2-3B-Base models.

Qwen2.5-32B-Base

MethodNQTriviaQAPopQAHotpotQA2wikimultihopQAMusiqueBamboogleAvg
Direct Inference0.21690.55760.19150.23610.26610.05170.14400.2377
CoT0.22520.55840.19360.25920.29540.08060.50400.3023
IRCoT0.30580.60740.33820.36160.41500.18120.52800.3910
Search-o0.22020.53640.16150.17640.03660.06040.39200.2262
RAG0.37420.61770.40890.32300.24460.07360.22400.3237
SFT0.36680.51860.16820.26060.27280.09930.15200.2626
Search-R10.49220.66860.47690.45240.45460.23050.55650.4760

Llama3.2-3B-Base

MethodNQTriviaQAPopQAHotpotQA2wikimultihopQAMusiqueBamboogleAvg
Direct Inference0.13910.36820.12380.12180.10660.01530.06400.1341
CoT0.24620.48660.16550.05100.08270.00570.02400.1517
IRCoT0.36260.56550.42820.23760.23590.07190.20800.3014
Search-o10.10750.20340.09290.13190.11680.03480.17600.1233
RAG0.31720.55100.33710.23390.11790.03430.06400.2365
SFT0.31970.34110.12200.20620.25710.06410.12000.2043
Search-R10.35670.57760.37780.31430.23300.09020.30650.3223

The results demonstrate that Search-R1 consistently outperforms strong baseline methods across various model sizes (3B, 7B, 32B) and architectures (Qwen2.5 and LLaMA3.2), highlighting its robustness and generalizability.

  • How performance varies based on the choice of knowledge base or search engine. This is a great point. We conduct additional experiments to study how the choice of search engine (type of retriever + knowledge source) impacts both the training and inference performance of Search-R1.

Impact during training. We evaluate four search engines: (a) Random noise, (b) BM25 + Wikipedia, (c) E5 (ANN) + Wikipedia, and (d) E5 (Exact match) + Wikipedia. Results are shown in the table below. Key findings include: (1) Training with stronger retrievers (e.g., E5 (Exact) and E5 (ANN)) leads to more stable reinforcement learning and better final performance. (2) In contrast, using weaker retrievers (e.g., Random and BM25) significantly limits the final model performance.

Search EngineNQTriviaQAPopQAHotpotQA2wikimultihopQAMusiqueBamboogleAvg
random0.23660.49410.17710.21700.26880.05830.23390.2408
BM250.34130.60700.32170.40430.37030.13700.28000.3517
E5 (ANN)0.46760.62140.36620.37230.28720.13740.40000.3789
E5 (Exact)0.48060.63790.45710.43280.38200.19570.42400.4300

Impact during inference. We further evaluate how models trained with different search engines generalize across various inference-time retrievers. The results (table below) show: (1) Search-R1 exhibits strong generalization across retrievers: even when trained with a specific search engine, it performs reasonably well when tested with others. (2) More powerful retrievers at inference time—particularly Google Search (via API)—consistently yield the best results, underscoring the importance of high-quality retrieval in downstream tasks.

Train / Test Retrieverbm25e5 ANNe5 flatGoogle Search
random0.03170.03170.03170.0317
BM250.24340.15870.25930.5397
E5 ANN0.26980.16930.25400.6032
E5 Exact0.24870.19580.26460.6032
avg0.25400.17460.25930.5820
评论

Thanks for the clarifications for my questions.

评论

Thank you for your thoughtful review and support of our work! We appreciate your engagement with our work.

最终决定

The author study applying PPO and GRPO to a fairly standard multi-hop LLM system. Overall, the authors find that their approach is successful at teaching small Qwen models to improve, learning entirely (as far as I can tell) from final-answer rewards.

There's a very long history for learning from final-answer rewards for these types of system via rejection sampling, with no obvious first such work. However, this was raised and addressed during the review discussion period. This is a CRUCIAL baseline that has to be added in the paper, since every other baseline is an unadapted system that does not even get to learn from any feedback! Other such types of systems include learning from comparisons among rollouts (LeReT; Hsu et al, 2024) or learning from prompt optimization or self-bootstrapped finetuning (which is conceptually akin here to the rejection sampling baseline) of such agents/systems. There's also a long line of work on tuning these systems in more elaborate ways like MDR, IRRR, Baleen, etc. that are simply absent in the paper's discussions.

Overall, and given the unanimous vote to accept by the reviewers, I also recommend acceptance, assuming the authors will surface the baseline (rejection sampling finetuning, or filtered behavior cloning) and the related work discussions in their camera ready.