PaperHub
5.0
/10
Rejected4 位审稿人
最低3最高6标准差1.2
3
5
6
6
3.3
置信度
正确性2.8
贡献度2.5
表达2.8
ICLR 2025

Sample Efficient Alignment for LLMs

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
rlhfonline dapllm alignmentsample efficiency

评审与讨论

审稿意见
3

This paper frames the problem of aligning LLMs with pairwise preferences as a contextual dueling bandit problem. The authors propose a method that adapts a well-know & effective bandit algorithm, Thompson sampling (TS), to LLMs. This is not trivial, because TS requires maintaining and sampling from a belief over reward functions. To address this, the authors provide tractable approximations. They evaluate their proposed approach empirically on LLMs of various sizes, and provide comparisons to related work.

优点

  • The contextual dueling perspective is interesting, insightful, and to the best of my knowledge, new.
  • The paper does a good job of covering the most relevant related literature, and how it relates to the proposed approach.
  • The description is very limited with no details in the supplementary material, but it is possible that the testbed they develop to evaluate different approaches is a valuable contribution in its own right

缺点

  • The writing could be improved significantly; parts of the paper are unclear, imprecise, and many details are omitted and should at least be present in the supplementary materials. A non-exhaustive list of examples:
    • line 144: "based on a binary stochastic feedback zz" -> z is never referred to anywhere outside of this line as far as I see
    • line 196: "offline RL" -> I think the authors might be conflating offline RL with offline preference collection. RLHF typically uses online RL. PPO is an online algorithm.
    • I struggle to make sense of Figure 3, for example I fail to understand how (d) is meant to differ from (b).
  • The posterior approximation scheme (Eq. 9) is not well motivated. I can see it is computationally attractive, but why do we believe this is a reasonable way to approximate the posterior? A cursory glance at the papers mentioned around Eq. 9 did not help. Since this is one of the central components of the proposed approach, I would expect a more rigorous justification here.
  • The experiments are comprehensive, but I struggle to take them at face value. Unfortunately, few details are provided, neither in the main text, nor in the supplementary materials.
    • There are design decisions involved in setting up baselines, and these are not discussed.
    • I can't reconcile any of the curves in Figure 6 with the curves in Figure 5.
  • Some of the metrics are problematic, as I see it. The "online win rate" (line 387) would make a model that deliberately samples poor responses look great.

问题

Concretely, in what ways does SEA differ from APL? How do you explain the large increase in sample efficiency?

评论

W4: The "online win rate" would make a model that deliberately samples poor responses look great.

The online win rate measures the average quality of the two responses that we select to query the oracle (e.g., humans) for preference labeling. It is calculated as 12(P(y1yref)+P(y2yref))\frac{1}{2}(\mathbb{P}(y_1 \succ y_\mathrm{ref}) + \mathbb{P}(y_2 \succ y_\mathrm{ref})), which is the average probability that the selected responses y1y_1 and y2y_2 are better than a reference response yrefy_\mathrm{ref}.

If a model deliberately selects poor responses, the online win rate would decrease significantly, as the responses y1y_1 and y2y_2 would likely fail to beat the reference response.


Q1: In what ways does SEA differ from APL? How do you explain the large increase in sample efficiency?

APL does not explicitly model the uncertainty of the reward, but relies solely on two heuristics to actively filter datasets for DPO training: 1) Use predictive entropy to filter prompts that are uncertain; 2) Use DPO implicit reward to filter paired responses with large reward margin (please see textrmcolorredline927938\\textrm{\\color{red}line 927-938} for more details of APL).

In contrast, SEA maintains an estimation of the epistemic uncertainty of the reward, which is then used for Thompson sampling to actively explore more informative responses to query the oracle. SEA is based on a more principled way of exploration, with theoretical intuitions of Thompson sampling from prior works [4,5]. 


Thank you again for your detailed feedback that helps us to improve the overall quality of our paper. We hope our responses and revised paper were able to address any remaining concerns. Please do let us know if you have any further questions as well as what would be expected for score improvement.


[1] Gao, L., Schulman, J., & Hilton, J. (2023, July). Scaling laws for reward model overoptimization. In International Conference on Machine Learning (pp. 10835-10866). PMLR.

[2] Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.

[3] Dwaracherla, V., Asghari, S. M., Hao, B., & Van Roy, B. (2024). Efficient exploration for llms. arXiv preprint arXiv:2402.00396.

[4] Wu, H., & Liu, X. (2016). Double thompson sampling for dueling bandits. Advances in neural information processing systems, 29.

[5] González, J., Dai, Z., Damianou, A., & Lawrence, N. D. (2017, July). Preferential bayesian optimization. In International Conference on Machine Learning (pp. 1282-1291). PMLR.

评论

Thank you very much for your detailed feedback and suggestions. We have revised the paper with new experimental results and improved writing.

Note: We've marked fdaL next to relevant changes in the revised paper to make it easier for you to navigate.


W1: The writing could be improved.

Thank you for pointing this out. We have comprehensively polished the paper, and we believe it is now clearer than the first version. In particular, we have:

  • polished the introduction by providing more background (textrmcolorredline6273\\textrm{\\color{red}line 62-73}) and a concise summary of our work (textrmcolorredline7585\\textrm{\\color{red}line 75-85}).
  • updated textrmcolorredFigure2\\textrm{\\color{red}Figure 2} to make the overall picture of LLM alignment clearer.
  • provided more details about our experimental settings, especially the baselines (textrmcolorredAppendicesC,D\\textrm{\\color{red}Appendices C,D}).
  • reorganized the structure to improve the flow of paper presentation, while maintaining the methods and results intact.

We address your specific questions below:

W1.1: line 144: "based on a binary stochastic feedback" -> z is never referred to anywhere outside of this line as far as I see

A1.1: zz was briefly referred to in Figure 2 in the original PDF, but we admit that it was not clear. In the revised PDF, we updated a clearer Figure 2, and refer to ztz_t at multiple places: textrmcolorredline155,100\\textrm{\\color{red}line 155, 100} and in (updated) textrmcolorredFigure2\\textrm{\\color{red}Figure 2}.

W1.2: line 196: "offline RL" -> I think the authors might be conflating offline RL with offline preference collection. RLHF typically uses online RL. PPO is an online algorithm.

A1.2: Yes we agree that PPO is an online algorithm! In our context, the online-offline classification depends on whether we treat the preference oracle (e.g., human) or the learned reward model as the environment. We take the former perspective because we believe the "true" environment for AI alignment problems should be humans, and the learned reward models only serve as proxies (cf. the updated textrmcolorredFigure2\\textrm{\\color{red}Figure 2}). This is also the motivation for us to study sample-efficient alignment because querying humans is expensive (but presumably more reliable than learned reward models [1]). Moreover, we think RLHF is more like offline model-based RL, and we added some discussion in textrmcolorredAppendixB\\textrm{\\color{red}Appendix B} in the revised PDF.

W1.3: I struggle to make sense of Figure 3, for example, I fail to understand how (d) is meant to differ from (b).

A1.3: We updated textrmcolorredFigure3\\textrm{\\color{red}Figure 3} in the revised PDF to highlight the differences between (b) and (d). In particular, (b) refers to methods that apply DAP (e.g., DPO) algorithms iteratively or in an online manner, but it still uses passive exploration. In (d), we not only learn the policy online but also learn an uncertainty-aware reward model to help with active exploration, which is empirically shown to improve the sample efficiency significantly.


W2: Why is the posterior approximation scheme reasonable?

We would like to clarify that the choice of posterior approximation scheme is orthogonal to our proposed framework (i.e., Algorithm 1). Approximating the posterior with deep ensembles is a widely adopted technique for deep neural networks [2]. Consistent with common practice, we obtain the ensembles by training multiple models with different initializations. To enhance the diversity of the ensembles, we include an explicit regularization term that anchors each ensemble member to its initialization, following the work of [3]. Additionally, alternative posterior approximation methods are compatible with our framework and could potentially further improve the results.


W3: More details are needed for the experiment setup.

Thank you for this valuable advice. We have added a review on all baseline methods in textrmcolorredAppendixC\\textrm{\\color{red}Appendix C}, and detailed hyperparameter configurations in textrmcolorredlines10011006\\textrm{\\color{red}lines 1001-1006}.

I can't reconcile any of the curves in Figure 6 with the curves in Figure 5.

As Table 1 has shown, the Variant-3 is the original textrmcolorblueSEA\\textrm{\\color{blue}SEA} method. Therefore, the curve labeled as "3" in Figure 6 (Left) is the same as the curve labeled as "SEA" in the first subplot (Pythia 1B x DPO). We have added an explanation in textrmcolorredline985\\textrm{\\color{red}line 985} to make this clearer.

评论

Dear Reviewer fdaL,

Thank you once again for your constructive feedback. We would like to kindly remind you that we have made comprehensive improvements to the writing, incorporating your suggestions, and added more details on the baseline setup in the revised paper.


As the discussion period is coming to a close in two days, we look forward to your response and would be happy to address any further comments or questions you may have.

Best,

The Authors

审稿意见
5

The paper considers the problem of aligning LLMs to humans via online interaction and in a sample efficient manner that minimizes the overall human annotations consumed. It proposes a practical algorithm that can be implemented and using the tldr dataset shows that it can achieve extremely high win-rates when compared against the baseline alignment algorithm.

优点

  • The paper builds upon the direction of aligning LLMs that considers the epistemic uncertainty of the reward function and utilizing the variance to guide the model to collect preference data in directions that maximally obtains information.
  • The algorithm is sound and theoretically solid. They build upon well performing methods for uncertainty estimates using the ensemble reward model method.
  • The results shown are promising (although see weakness for caveats).

缺点

  • The main weakenss of this paper is the lack of siginificant experimentation. The paper only considers the summarization task with the tldr dataset. It does not consider more diversity in tasks/datasets and does not utilize some of the most common tasks in the LLM literature (e.g., reasoning, code, general knowledge). Even within summarization, it does not consider > 1 datasets. Improvements in one dataset is typically very limiting.

问题

  • Do you have results that can showcase this method on a variety of tasks and many different datasets? How generalizable is this method (in terms of how well it works)?
评论

Thank you very much for recognizing the contributions of our work, and for your constructive feedback about adding experiments on more datasets.

Note: We've marked z8J2 next to relevant changes in the revised paper to make it easier for you to navigate.


Q1: Do you have results that can showcase this method on a variety of tasks and many different datasets?

We conducted additional experiments with the UltraFeedback dataset (widely used in LLM alignment) and evaluated our method and baselines on the well-established AlpacaEval 2.0 benchmark. We employ the state-of-the-art direct optimizer (SimPO) and compare the offline, online, and textrmcolorblueSEA\\textrm{\\color{blue}SEA} variants. The final performance after consuming the same amount of query budget is shown in the following table: Model | Length-Controlled Win Rate | Win Rate |  ---|---|--- Llama-3-8b-Instruct Offline SimPO | 42.96 | 35.45 Llama-3-8b-Instruct Online SimPO | 41.68 | 36.76 Llama-3-8b-Instruct textrmcolorblueSEA\\textrm{\\color{blue}SEA} SimPO | 47.40 | 41.10

The results indicate that our method can be generalized to align LLMs on more challenging tasks, and still show sample efficiency advantages over the offline and passively online baselines.

Please see the revised paper for the learning curves (textrmcolorredFigure8\\textrm{\\color{red}Figure 8}), as well as explanations on the setting (textrmcolorredAppendixD.2\\textrm{\\color{red}Appendix D.2}) and result discussion (textrmcolorredAppendixE.3\\textrm{\\color{red}Appendix E.3}).


Thank you again for your suggestion to augment the experimentation, which makes our empirical results more convincing. We hope our responses and new experiments were able to address any remaining concerns. Please do let us know if you have any further questions as well as what would be expected for score improvement.

评论

Dear Reviewer z8J2,

Thank you for your valuable feedback. We would like to kindly remind you that we have included additional experiments to validate the generalizability of our method (Figure 8\textrm{\color{red}Figure 8}) in the revised paper.


As the discussion period is coming to a close in two days, we look forward to your response and would be happy to address any further comments or questions you may have.

Best,

The Authors

审稿意见
6

This paper considers explorations by approximate TS in LLM alignment.

优点

This paper is nicely written and well-executed. The details are well-presented.

缺点

This paper is a bit incremental and the idea is well expected. The contextual bandit formulation is well-known and the implementation of TS using ensemble is pretty straightforward.

There are multiple key questions this paper fails to answer:

  1. Why ensemble on the top of DPO makes sense and how good the uncertainty estimation is? This is not a standard supervise learning problem. This paper lacks a way to directly measure it. I do not think ensemble + X can quantify uncertainty in every problem.

  2. The reward model is too small (0.4B) and the base model Pythia family is outdated such that the empirical result may become useless in a practical setting. It is very common that the benefit of active exploration could disappear when the base model and reward model become much stronger.

  3. The baseline number is not comparable with other literatures since the author uses its own oracle reward model. There are tons of implementation details that could be hidden for other baselines. I think it would be good to have an apple-to-apple comparison with the same setting in the original DPO paper.

问题

see above.

评论

W3: Baseline numbers are not comparable with other literature.

DPO uses GPT-4-0314 as a judge while we select Skywork-Reward-Llama-3.1-8B as the reward oracle; DPO aligns the pretrained SFT model from https://huggingface.co/CarperAI/openai_summarize_tldr_sft while we use 3 scales from the Pythia family trained by [2]. These are sources that make our results not directly comparable to the numbers in their paper.

However, we have faithfully reproduced DPO's results (referred to as offline DPO in our paper) by following the huggingface/trl implementation. Our source codes can be found here, and will be open-sourced after the reviewing period.

By implementing all baselines carefully in a single codebase, we have tried our best to make sure all comparisons are apple-to-apple: different algorithms use exactly the same configurations except their algorithm-specific ones, are trained on the same dataset, and are evaluated using the same metric.

Regarding your concern that the reported metric is only computed with the selected oracle reward model, we hope our additional experiment results (referring to our response for W2) on the AlpacaEval 2.0 benchmark could be more convincing. 


Thank you again for your feedback which helps us to improve the quality of our paper. We hope our responses and revised paper were able to address any remaining concerns. Please do let us know if you have any further questions as well as what would be expected for score improvement.


[1] Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.

[2] Huang, S., Noukhovitch, M., Hosseini, A., Rasul, K., Wang, W., & Tunstall, L. (2024). The N+ Implementation Details of RLHF with PPO: A Case Study on TL; DR Summarization. arXiv preprint arXiv:2403.17031.

评论

Thank you very much for your detailed feedback and suggestions. We have revised the paper with new experimental results (to hopefully address weaknesses 2 & 3).

Note: We've marked RtS4 next to relevant changes in the revised paper to make it easier for you to navigate.


W0: The idea is a bit incremental.

To our knowledge, there is no existing work discussing the contextual dueling bandit formulation of LLM alignment in both E&E and BAI settings, and proposing solution algorithms for both settings. One contribution of this work is to make the two settings clear for the LLM alignment community, which can hopefully guide practitioners to select the appropriate exploration algorithm depending on the settings.

Although TS is well-known for tackling bandit problems, applying TS for LLM alignment is not explored, and also non-trivial: Algorithm 1 (Page 6) is simply intractable for LLMs. Another contribution is to leverage existing tools, including ensemble for posterior estimation, DPO for policy optimization, etc., to practically implement the intractable algorithm. We also show that the theoretically inspired TS algorithms demonstrate strong empirical results, which validate that properly designed exploration strategies could enhance the sample efficiency of LLM alignment. This makes practical sense for LLM developers to save the annotation cost.


W1: Why does the ensemble on top of DPO make sense?

Ensemble and DPO play separate roles in our method, and we are not simply "combining them to get good results". We adopt an ensemble to model the posterior of the reward models, which helps us to conduct TS for exploration. This is for (posterior) reward learning. On the other hand, we employ Direct Alignment from Preferences (DAP) methods to optimize the LLMs directly from human preferences. This is for policy learning, and DPO is only one of many optimizers to achieve this goal.

Our idea is to decompose the (intractable) TS algorithm into (posterior) reward learning and policy learning, such that they can collectively provide a good approximation to TS for effective exploration. Ensemble is for reward mode learning, which is converted into a standard supervised learning problem via the BT model assumption (Please see Eq.5 and Eq.8). Hence we can expect that ensemble can be effective in capturing the epistemic uncertainty [1].


W2: The RM is small and Pythia is outdated. The benefits of active exploration could disappear when the base model and reward model become much stronger.

We choose a small backbone for RMs due to computational efficiency. In fact, the 0.4B model serves as a representation extractor, on top of which an ensemble of MLPs is applied to model the uncertainty. If we use a stronger RM, which presumably gives better representation, we can expect the reward learning to be more accurate. We believe this should not degrade the benefits of active exploration.

We follow recent work [2] to employ different scales of the Pythia models to study the alignment problem on the TL;DR task. We admit that Pythia is a bit outdated, but we think it is still a reasonable way to study the algorithm's properties (rather than pushing state-of-the-art performance). That being said, we still conducted additional experiments with the Llama-3-8B-Instruct model on the UltraFeedback dataset and evaluated it on AlpacaEval 2.0, which is a common setting in the alignment literature. We summarize the results below:

Model | Length-Controlled Win Rate | Win Rate |  ---|---|--- Llama-3-8b-Instruct Offline SimPO | 42.96 | 35.45 Llama-3-8b-Instruct Online SimPO | 41.68 | 36.76 Llama-3-8b-Instruct textrmcolorblueSEA\\textrm{\\color{blue}SEA} SimPO | 47.40 | 41.10

The results indicate that active exploration is still beneficial when the base model is stronger and the task becomes complex. Please see the updated PDF for the learning curves (textrmcolorredFigure8\\textrm{\\color{red}Figure 8}), as well as explanations on the setting (textrmcolorredAppendixD.2\\textrm{\\color{red}Appendix D.2}) and result discussion (textrmcolorredAppendixE.3\\textrm{\\color{red}Appendix E.3}).

评论

Dear Reviewer RtS4,

Thank you once again for your constructive feedback. We would like to kindly remind you that we have added new experiments on Llama-3-8B models, demonstrating promising results on the AlpacaEval 2 benchmark (Figure 8\textrm{\color{red}Figure 8}). Additionally, we have provided the source code for reproducibility and justified our efforts to ensure fair comparisons with baselines.


As the discussion period is coming to a close in two days, we look forward to your response and would be happy to address any further comments or questions you may have.

Best,

The Authors

审稿意见
6

This paper studies the online alignment for LLMs. It formulates the LLM alignment as a contextual dueling bandit problem. It proposes a Thompson sampling-like algorithm for both the regret minimization and best arm identification setting. For the challenge encountered when implementing this algorithm including the model constraints, the intractable operation to sample a posterior and argmax action, the paper proposes corresponding solutions to enable real-world application. Finally, the authors build a learning system for studying methods in online LLM alignment and compare the proposed method with available baselines in different scale levels of experiments. The proposed method shows superior sample efficiency compared with baselines.

优点

  1. The paper studies the LLM alignment problem from an online view. It is reasonable to formulate the problem as a contextual dueling bandit problem.
  2. The paper is inspired by the classic Thompson sampling algorithm to actively explore the preference function and proposes a corresponding version suitable for real-world application.
  3. The paper builds a learning system for studying methods in online LLM alignment. This provides efficient tools for the community on this research line.

缺点

  1. In Line 383, the authors claim they omit a baseline SELM since it shares a very similar algorithmic design as XPO. This is generally not an acceptable reason for excluding a baseline comparison.

问题

Please see the last part.

评论

Thank you very much for recognizing the applicability of our method to different real-world scenarios, and the values of our learning system.

Note: We've marked cF2f next to relevant changes in the revised paper to make it easier for you to navigate.


W1: Claiming SELM is similar to XPO is not acceptable for excluding a baseline comparison.

We would like to clarify that excluding SELM for comparison is mainly because it is in fact theoretically the same as XPO. Both of them propose to bias the DPO optimization towards a region with over-estimated rewards (or value functions), thus achieving optimism in the face of uncertainty for exploration. They are independent works released concurrently (XPO on 31 May 2024; SELM on 29 May 2024).

However, SELM's implementation includes offline preference data for training (even though our setting is online alignment), which makes it incompatible with our settings. On the other hand, XPO's practical algorithm faithfully follows the online alignment protocol. Therefore, we choose to adopt XPO as our baseline and reproduce it in our codebase for a fair comparison.

We have revised the paper to make this clearer (textrmcolorredline955\\textrm{\\color{red}line 955}). We also benchmark our learning system with huggingface/trl to showcase the computational efficiency of our implementation (~2.5x faster), which we believe is a major contribution to the alignment community for speeding up online LLM alignment training (textrmcolorredAppendixF\\textrm{\\color{red}Appendix F}).


Thank you again for your feedback which helps us to clarify the baselines we have chosen. We hope our responses and revised paper were able to address any remaining concerns. Please do let us know if you have any further questions as well as what would be expected for score improvement.

评论

Dear Reviewer cF2f,

Thank you once again for your constructive feedback. We would like to kindly remind you that we have incorporated clear explanations of the differences between XPO and SELM (line 955\textrm{\color{red}line 955}) and improved the paper by adding additional experimental results (Figure 8\textrm{\color{red}Figure 8}) and learning system benchmarks (Appendix F\textrm{\color{red}Appendix F}).


As the discussion period is coming to a close in two days, we look forward to your response and would be happy to address any further comments or questions you may have.

Best,

The Authors

评论

We thank all reviewers for their constructive feedback, and we have responded to each reviewer individually. We have also uploaded a Paper Revision including additional results and illustrations:

  • textrmcolorredFigure8\\textrm{\\color{red}Figure 8} & textrmcolorredTable2\\textrm{\\color{red}Table 2} (Page 21): Evaluation results on AlpacaEval 2.0 of models trained with prompts from UltraFeedback dataset (Reviewers z8J2, RtS4, fdaL).
  • textrmcolorredAppendixF\\textrm{\\color{red}Appendix F} (Page 21): Reviewers cF2f and fdaL recognize the values of our online alignment training system (Section 5.1). We further conduct rigorous benchmarking to compare our codebase with huggingface/trl on the online DPO training and show ~2.5x latency improvement.
  • textrmcolorredAppendixA\\textrm{\\color{red}Appendix A} (Page 16): We add a detailed algorithm as complementary to Algorithm 1 in the main text, to enhance the clarity (Reviewer fdaL).
  • textrmcolorredAppendixB\\textrm{\\color{red}Appendix B} (Page 17): We discuss extensively how our bandit formulation is related to single-step preference-based reinforcement learning, explaining RLHF is an instance of offline model-based RL (Reviewer fdaL).
  • textrmcolorredAppendicesC,D\\textrm{\\color{red}Appendices C, D} (Pages 17-20): We review all the baseline methods and provide details on the settings (Reviewer fdaL); we also clarify the reason why SELM is not compared (Reviewer cF2f).
  • textrmcolorredFigure2\\textrm{\\color{red}Figure 2} (Page 3): We updated this figure to better illustrate how alignment as CDB differs from (offline) RLHF and DAP, to improve the clarity (Reviewer fdaL).
评论

Dear Reviewers,

Thank you again for your valuable comments and suggestions, which are really helpful for us. We have posted responses to the proposed concerns and included additional experiment results.

We totally understand that this is quite a busy period, so we deeply appreciate it if you could take some time to return further feedback on whether our responses solve your concerns. If there are any other comments, we will try our best to address them.

Best,

The Authors

评论

Dear Reviewers,

We hope this message finds you well. As we approach the final day of the extended reviewing period, we wanted to kindly follow up on the rebuttals we provided to your thoughtful reviews.

We carefully addressed each of your questions within the first week to the best of our ability. We have been looking forward to your responses for the past two weeks, but unfortunately, we have not yet received any follow-up or further discussion.

We sincerely look forward to your additional feedback and will do our best to address any remaining concerns within the limited time available.

Thank you again for your consideration. We understand how busy this period can be, and your dedication to the review process means a great deal.

Best regards,
The Authors

AC 元评审

This paper introduces Sample-Efficient Alignment (SEA), a novel algorithm to align large language models (LLMs) with human preferences sample-efficiently. The problem is formulated as a contextual dueling bandit framework, and the SEA algorithm leverages Thompson Sampling with deep ensembles for uncertainty estimation. The authors present extensive experiments across three LLM scales and three preference learning algorithms, demonstrating the proposed method's superior sample efficiency compared to baseline methods. The work also provides a practical learning system for online alignment research.

Reviewers raised concerns about limited task diversity, inadequate baseline comparisons, particularly the omission of SELM, and incremental novelty. While the authors added experiments (e.g., UltraFeedback and AlpacaEval 2.0) and improved writing, reviewers found these changes insufficient to address broader generalization, weak comparability, and lack of rigor. Presentation issues were also noted, although some partial improvements were made in the revisions. Based on the common consensus, therefore, I would recommend the authors submit to the next suitable venue addressing all the concerns once all the necessary modifications are incorporated.

审稿人讨论附加意见

See above.

最终决定

Reject