[W5] Minor issues:

Thank you for pointing these out, we have modified the notations as per your suggestion in the updated version. We would also like to clarify that while we sample n=32 responses for each query, it can be used to create (32 choose 2) pairs, of which we randomly sample 10 for training the LLM.

[Q1] Could the authors elaborate the definition of em and the method for generating the sentence embedding from the model ? Which part of Figure 1 corresponds to this process?

To extract embeddings for a query using , we first process the input query through the policy model . We use the embedding of the last token in the query as the representation for the query. The embedding is then used as input to the subsequent bandit algorithm. Thank you for raising this and we have added this clarification in Appendix A.1. We do not show the step converting each query into an embedding Fig. 2 for easier visualization, but we highlight it in Sec 3.2 and Appendix A.1 lines 863-866 (updated pdf).

[Q2] Could the authors provide insights into why Random RM selection often outperforms the Offline RM Ensemble and why Sequential RM selection surpasses Random RM selection?

The RM Ensemble combines the output scores of multiple RMs, some of which may be noisy (used in an out-of-distribution setting) and provide conflicting signals (see Fig 5., Sec 5. Lines 473-495 in the updated pdf), overall making it less effective for training. In contrast, Random RM selection uses only one RM at a time, avoiding these conflicts. While Random RM selection may not always pick the optimal RM for a query, the absence of conflicting signals generally leads to a cleaner training signal compared to the noisy ensemble approach. We have added this clarification in lines 493-495 of Sec. 5 in our revised paper.

Sequential RM selection systematically explores different RMs in a structured manner, ensuring that each RM contributes to the training process over time. This exploration reduces the risk of over-relying on poor-performing RMs for extended periods, providing a more balanced and diverse training signal. On the other hand, Random RM selection explores the RMs without such structure, which means it might choose multiple poorly performing RMs consecutively, leading to suboptimal training signals early on and potentially degrading overall performance.