Q1.1: To clarify, the retrieved tokens ("d") are masked out during training, while all of them actively serve as the input of the policy model during the rollout stage (as in Algorithm 1 Line 6), right?
Q1.2: Although the loss masking strategy stabilizes the training (Section 5.4), it encourages the model to learn to answer the question without any retrieved information (which is masked/skipped). Thus, the training process does not enhance the information retrieval ability of the model if Loss Masking is applied, right?

Q3.1: According to the algorithm, the rollout sequence may end up without a final answer (wrapped by ). What is the ratio of such sequences, and how does this case affect the training results?
Q3.2: What does the Parse function exactly do? At least it will remove "" and "" tokens in and extract tokens between them, right?
Q3.3: Where is the prompt in Table 1 placed? Is it at the beginning of in Algorithm 1? Will the input query change during the rollout stage of the current question?

Q4.1: During inference, does Search-R1 still work like the rollout stage, where the model generates search queries and calls search engines before answering?
Q4.2: Some datasets in the experiment have provided context for searching. Hence, it seems unnecessary to call external search engines (as in Search-R1) to solve those tasks.