Dear Reviewer TgQU,

Thank you for your thorough and constructive review of our paper. We would like to address each of your concerns in detail:

W1

We appreciate your concern about computational efficiency. We would like to clarify that our tree-structured rollout does not introduce additional computational overhead compared to standard RL:

Traditional RL methods typically require sampling multiple independent trajectories per question in parallel, and our approach may reuse partial trajectories during rollout. In SGLang inference system, the reused context can be efficiently cached for further inference. Algorithmically, our approach just redistributes this sampling effort from the question level to the action level, maintaining a similar computational budget.
The tree structure actually provides several advantages:
- It enables more systematic exploration of the action space
- It allows for expectation-based credit assignment
- It reduces the variance in training compared to random sampling
We can also control the breadth and depth of the tree to balance between exploration and computational cost (Eq. 4), making it flexible for different computational budgets.

Therefore, while our approach may appear computationally intensive at first glance, it actually offers a more structured and efficient way to explore the action space within the same computational constraints as traditional RL methods.

W2

We apologize for any confusion regarding the warm-up phase. We would like to clarify several key points:

As mentioned in Section 5.2 and Appendix A.2, we collect seed data through rejection sampling from Qwen2-72b-instruct, specifically gathering 2 correct solutions for each question.
The detailed training hyper-parameters are provided in Table 4.
To further validate the effectiveness, we conducted comparative experiments between C-3PO-RL and C-3PO-ICL in Table 8. These results demonstrate the feasibility of our warm-up strategy.

We hope these clarifications address your concerns about the warm-up phase implementation.

W3

We agree that the evaluation metrics deserve more prominence in the main text. We will move a concise version of the evaluation metrics from Appendix D.2 to Section 6.1, making these important details more accessible while maintaining the paper's flow and readability.

W4

Thank you for raising this important point about evaluation methodology. We would like to address this concern from multiple aspects:

Additional EM Results:

We have conducted additional experiments using the EM metric. The results show that on the EM metric, C-3PO still achieves significant improvements over all baselines:

Methods	2Wiki	HQA	Musique	NQ	PopQA	TQA	AVG
Direct	36.4	36.8	17.5	44.1	25.1	73.4	38.88
Standard	26.1	41.3	31	52.1	38.1	73.8	43.73
REPLUG	25.2	39.8	24	43.2	37.7	74.3	40.7
Self-RAG	-	-	-	41.7	40.5	74.9	52.36
InstructRAG	45.9	-	-	51.6	40.9	75.6	53.5
Auto-RAG	44.7	41.3	-	43.8	39.2	72.1	48.22
ReRanker	29.8	37.6	19.4	47.6	20.7	73.3	38.06
QueryRewrite	42.9	47.3	44.5	60.6	40.3	79.1	52.45
SKR-KNN	38.6	54.8	37.7	56.2	38.6	73.5	49.9
SlimPLM	-	-	19.8	57.6	-	76.4	51.26
C-3PO-0.5B	60.5	61.1	50.1	65.9	52.7	80.3	61.76
C-3PO-1.5B	63.7	63	54.8	67.7	53.8	82	64.16

Limitations of Traditional Rule Based Metrics:

Through our preliminary studies, we found that rule based metrics, such as EM, can be unreliable, especially when working with frozen LLMs that may express correct answers in unpredictable formats.
These inaccurate rewards from strict rule based matching could potentially harm the training of reinforcement learning.

Adoption of LLM-based Evaluation:

Recent famous benchmarks like FreshQA and Humanity Last Exam (HLE) increasingly adopt LLM-based evaluation to capture semantic correctness beyond exact matching.
This trend reflects the community's recognition of the limitations of traditional metrics for complex QA tasks.

Reliability of Our Evaluation:

We conducted rigorous human verification of Qwen2-72B-instruct's evaluation capabilities.

We understand the importance of using standardized metrics. However, we believe that combining both traditional and LLM-based evaluation provides a more comprehensive assessment of model performance. We appreciate this feedback and have enhanced our evaluation section accordingly.

References Not Discussed

We sincerely thank you for suggesting these valuable references. We will incorporate these citations and related discussions in our revised manuscript to better position our work in RAG.

We appreciate again for your time and effort in reviewing our paper. We believe that addressing these concerns has helped strengthen our work, and we hope our responses have satisfactorily addressed your questions. We look forward to your further feedback.