WizardArena: Post-training Large Language Models via Simulated Offline Chatbot Arena
摘要
评审与讨论
This paper introduces an offline WizardArena in a two-step process. It selects diverse and hard prompts from the lmsys-chat-1m dataset as a test set and divides the remaining samples into nine parts for the training set. Then, a judge model based on LLAMA3-70B-Chat is also constructed with the designed prompt. With the self-constructed WizardArena and three state-of-the-art models, the performance of the WizardLM-beta model gradually improves after three iterative learning stages. The analysis shows a consistency between the offline WizardArena evaluation and online/GPT-4 evaluations.
优点
- The authors made an attempt to construct an offline chatbot arena, which is a significant contribution.
- The performance of the constructed WizardLM-beta model consistently improves after three iterative learning stages.
缺点
The overlapped method setting descriptions in Sections 3 and 4 with limited space, resulting in unclear implementation details. It severely impacts the soundness of the paper. Below are some specific issues and shortcomings.
Regarding the data construction:
-
The data processing steps need to be clarified. For instance, what are illegal conversations? Also, the lmsys-chat-1m includes both single-turn and multi-turn dialogues, but the handling process for multi-turn dialogues, including Arena testing data construction, is not clear.
-
There might be redundancy when constructing the Diverse Subset and Hard Subset.
-
How were the different stages of data (20.4K, 19.3K, 17.6K) obtained, as mentioned in lines 248-250?
-
During the construction of the pair-judge training set, did all four models participate in generating outputs for each candidate training sample? And were the best outputs obtained from comparisons among the four models?
Regarding the experimental results:
-
The data in Table 2 needs to be clarified regarding its source model. How many models' results were compared (are the 15 models from Table 1 involved)?
-
The effects of DPO and PPO are similar, as shown in Table 1 and Figure 5. Are their differences limited to the use of different reinforcement learning algorithms? In Figure 1, PPO seems to involve more rankings.
Regarding the analysis:
- The comparison in Table 3 is unfair. The methods (e.g., IFD and INSTAG) use existing data for selection, while the pair-judge includes external advanced model outputs as learning samples, making it a synthetic data selection. These two are not comparable due to the involvement of external information in the pair-judge method.
- Table 4 shows the consistency between GPT-4 and Llama3-70 as the judge, which is good. However, it does not provide a complete comparison of all models; are the rest of the models consistent with the models displayed?
问题
See weakness.
局限性
They mentioned the limitations in the conclusion.
Thank you for your meticulous review and insightful questions and the time you spent reviewing our work. We sincerely apologize for any confusion this may have caused. In the revised version of the paper, we will provide more detailed and clearer implementation in Sections 3 and 4. Below, we will address the specific issues you have highlighted:
Weaknesses-1. Regarding the data construction:
Question-1.1: The data processing steps need to be clarified. For instance, what are illegal conversations?
We employed the same method as LMSYS-Chat-1M to filter out illegal conversations. Specifically, we utilized the OpenAI Moderation API [1] to tag all conversations. If the model classifies the content as potentially harmful, the tag is set to True. If any instruction within a conversation is marked as unsafe by the model, we categorize the entire conversation as illegal and subsequently filter it out.
[1] https://platform.openai.com/docs/guides/moderation
Question-1.2: There might be redundancy when constructing the Diverse Subset and Hard Subset.
Thank you very much for raising these questions. In the revised version of the paper, we will combine the sections on constructing the diversity and hard test sets in Sections 3 and 4 into a single section, and provide a more detailed and clearer descriptions.
Question-1.3: How were the different stages of data (20.4K, 19.3K, 17.6K) obtained, as mentioned in lines 248-250?
For the i-th round of DPO data, the four models—WizardLM-SFT-Ii, Command R+, Qwen1.5-72B-Chat, and OpenChat-3.5—engaged in head-to-head pairwise battles. The response with the highest number of wins was designated as "Choose," while the one with the fewest wins was labeled as "Reject." Data where the gap between the Choose and Reject scores fell below the threshold K (K ≤ 1) were filtered out. Meanwhile, as the WizardLM-SFT-Ii model evolved and became more powerful, approximately 10k to 14k of the initial 30k seed data were eventually removed in each round. This process resulted in three rounds of DPO training data pairs(D2 20.4k, D5 19.3k, D8 17.6k).
Question-1.4: During the construction of the pair-judge training set, did all four models participate in generating outputs for each candidate training sample? And were the best outputs obtained from comparisons among the four models?
No. During the SFT stage, data from battles where WizardLM-β-Ii was defeated by competing models were selected, and the best-performing response from the winning model was chosen as the final response. For DPO and the Reward Model, pairwise battles were conducted among WizardLM-β-SFT-Ii, Command R+, Qwen1.5-72B-Chat, and OpenChat-3.5. The response from the model with the highest number of wins was designated as "Chosen", while the response with the fewest wins was labeled as "Rejected", thereby constructing the <choose, reject> data pair.
Weaknesses-2. Regarding the data construction:
Question-2.1: The data in Table 2 needs to be clarified regarding its source model. How many models’ results were compared (are the 15 models from Table 1 involved)?
Thank you very much for your careful review. The consistency metrics in Table 2 are calculated based on the results of the 15 models listed in Table 1 across LMSYS-ChatBot-Arena, MT-Bench, WizardArena, and Arena-Hard-V1.0. Future revisions of the paper will will emphasize the source of these models in detail.
Question-2.2: The effects of DPO and PPO are similar, as shown in Table 1 and Figure 5. Are their differences limited to the use of different reinforcement learning algorithms? In Figure 1, PPO seems to involve more rankings.
Yes. In our study, DPO and PPO achieved similar performance on both WizardArena and MT-Bench. As illustrated in Figure 2, PPO was involved in more ranking processes, primarily for two aspects: We apply PPO for post-training on the basis of SFT and SFT+DPO. And DPO is only trained on the basis of SFT.
Weaknesses-3. Regarding the analysis:
Question-3.1: The comparison in Table 3 is unfair (e.g., IFD and INSTAG) ...
We greatly appreciate your insightful observations and valuable questions. To ensure fairness in comparison, the Pair-judge method in our paper involved only battles between WizardLM-β-SFT-I0 and Command R+ as the reference model, focusing on data where WizardLM-β-SFT-I0 was defeated, without incorporating information from other advanced models. Both IFD and INSTAG methods selected instructions based on calculated instruction complexity and utilized the corresponding responses from Command R+. Thus, the responses for Pair-judge battles, IFD, and INSTAG were all derived from Command R+. Consequently, this ensures that the comparison of different data selection methods in Table 5 is fair. We will further emphasize this in future versions of the paper.
Question-3.2: Table 4 shows the consistency between GPT-4 and Llama3-70 as the judge, which is good. However, it does not provide a complete comparison of all models; are the rest of the models consistent with the models displayed?
We greatly appreciate your valuable suggestion. The table 9 in newly uploaded WizardArena_rebuttal PDF presents the ELO rankings for 16 models evaluated using Llama3-70B-Chat and GPT-4 as judge models in WizardArena-Mix. Using GPT-4 judge's ELO as the reference benchmark, the Spearman correlation coefficient between Llama3-70B-Chat judge and GPT-4 judge is 97.42%, and the Human Agreement with 95% CI is 95.58%. The overall average consistency between the two judge models is 96.50%. Consequently, employing Llama3-70B-Chat as a cost-effective evaluation model achieves high consistency with GPT-4, ensuring the reliability of the evaluation and training with WizardArena.
Thank you for your reply.
I have read the response and viewed the revisited pdf uploaded by the author; they have addressed mostly concerns about the result and concerns. However, I'm still confused about the data construction part (Q1.2, Q1.3, Q1.4).
For example, in Q1.2, I said there might be redundancy when constructing the Diverse Subset and Hard Subset, which means are any samples both in the Diverse Subset and Hard Subset?
In Q1.3, I don't understand "Meanwhile, as the WizardLM-SFT-Ii model evolved and became more powerful, approximately 10k to 14k of the initial 30k seed data were eventually removed in each round."
In Q1.4, you said, "For DPO and the Reward Model, pairwise battles were conducted among WizardLM-β-SFT-Ii, Command R+, Qwen1.5-72B-Chat, and OpenChat-3.5. The response from the model with the highest number of wins was designated as "Chosen," while the response with the fewest wins was labeled as "Rejected," thereby constructing the <choose, reject> data pair." If the compare models are sampled, what is their level of participation? How many samples are Command R+selected? How many WizardLM - β - SFT-Ii are selected?
Regarding the comparison of Table 3, I mean the sample in IFD and INSTAG only have one question and one answer that has existed; in your methods, one sample will generate more answers and select the best one, which I would say is a synthetic method.
Anyway, thank you for more experiments and results.
Dear Reviewer B73Q,
We would like to thank you for engaging so thoroughly with both our paper and the rebuttal. We hope the following addresses all remaining concerns.
Problems-1: For example, in Q1.2, I said there might be redundancy when constructing the Diverse Subset and Hard Subset, which means are any samples both in the Diverse Subset and Hard Subset?
Thank you for your insightful question. There is no redundancy between the Diverse and Hard Subsets. Firstly, we filter out the same instructions between the Hard and Diverse Subset. Then, we employed the MinHashLSH technique (with a threshold of 0.4 and 128 permutation functions) for data deduplication between the Hard and Diverse Subset. subsequently, Hard Subset excluded instructions from the top 2 matches in semantic similarity(using the gte-large-en-v1.5 model) with Diverse Subset. From the initial 10k data, 7.4k data are left, and then the top 1000 data with the highest difficulty are selected to construct the hard subset. We will emphasize this aspect more clearly in future versions of our paper.
Problems-2: In Q1.3, I don't understand "Meanwhile, as the WizardLM-SFT-Ii model evolved and became more powerful, approximately 10k to 14k of the initial 30k seed data were eventually removed in each round."
Thank you very much for raising this issue, and I sincerely apologize for any inconvenience caused. The filtering principle in our paper is as follows: when constructing DPO data, if the four models(WizardLM-β-SFT-Ii, Command R+, Qwen1.5-72B-Chat, OpenChat-3.5) exhibit comparable performance on some data, resulting in a Score<Choose> - Score<Reject> ≤ 1, those data are filtered out.
"Meanwhile, as the WizardLM-SFT-Ii model evolved and became more powerful, approximately 10k to 14k of the initial 30k seed data were eventually removed in each round." This sentence means:
In the initial stage, WizardLM-β-SFT underperforms the other models(Command R+, Qwen1.5-72B-Chat, OpenChat-3.5) on some specific data, leading to its response being labeled as "Reject" and the best response from the other models as "Choose" for DPO data construction. As WizardLM-β-SFT becomes more powerful through iterative training(WizardLM-β-SFT-I1->WizardLM-β-SFT-I3), it becomes comparable to other models in these data, leading to the Score<Choose> - Score<Reject> ≤ 1, and consequently, these data are filtered out. As a result, the proportion of filtered data will increase through iterative training when constructing DPO data.
Overall, in each round of 30k data, the first round filtered out 9.6k, the second round filtered out 10.7k, and the third round filtered out 12.4k when constructing DPO data. Consequently, each round left D2 with 20.4k, D5 with 19.3k, and D8 with 17.6k data.
Problems-3: Regarding the comparison of Table 3, I mean the sample in IFD and INSTAG only have one question and one answer that has existed; in your methods, one sample will generate more answers and select the best one, which I would say is a synthetic method.
Thank you for your insightful question. In the data selection strategy for Table 3 in our paper, to ensure a fair comparison, the Pair-judge Battle method only conducts battles between WizardLM-β-SFT-I0 and Command R+. The data where WizardLM-β-SFT-I0 loses are selected, with the corresponding responses taken from Command R+. Additionally, the responses for instructions selected by IFD and INSTAG are also derived from Command R+, rather than the original existing responses.
As a result, the responses for Pair-judge Battle, IFD, and INSTAG all originate from Command R+, ensuring fairness in the comparison. Moreover, the data selection strategy employed by Pair-judge Battle is superior to those used in IFD and INSTAG, highlighting its effectiveness.
It is important to note that Pair-judge Battle is a data selection strategy, not a synthetic method.
Problems-4: In Q1.4, you said, "For DPO and the Reward Model, pairwise battles were conducted among WizardLM-β-SFT-Ii, Command R+, Qwen1.5-72B-Chat, and OpenChat-3.5. The response from the model with the highest number of wins was designated as "Chosen," while the response with the fewest wins was labeled as "Rejected," thereby constructing the <choose, reject> data pair." If the compare models are sampled, what is their level of participation? How many samples are Command R+selected? How many WizardLM - β - SFT-Ii are selected?
We greatly appreciate your thorough consideration of this valuable issue.
The table below summarizes the sources of Choose and Reject responses during the DPO data construction (When the scores of responses from multiple models are equal, we randomly sample one response from these models as the Choose or Reject.).
Command R+ selected 9.5k, 8.8k, and 7.8k data as Choose responses across three rounds, totaling 26.1k(45.5% of the total 57.3k). The corresponding Reject responses were 1.0k, 0.9k, and 0.9k, totaling 2.8k(4.9% of the total 57.3k).
WizardLM-β-SFT selected 1.6k, 1.9k, and 2.5k Choose responses across three rounds, totaling 6.0k(10.5% of the total 57.3k), with corresponding Reject responses of 8.5k, 6.4k, and 4.2k, totaling 19.1k(33.3% of the total 57.3k). This indicates that as WizardLM-β-SFT improved through iterative training, the number of Choose responses increased, while Reject responses decreased.
Command R+ and Qwen1.5-72B-Chat, which have comparable performance on the WizardArena-Mix, show that Command R+ accounted for 45.5% of the total DPO Choose responses (26.1k vs. 57.3k), while Qwen1.5-72B-Chat accounted for 38.4% (22.0k vs. 57.3k). Similarly, Command R+ contributed 4.9% of the total Reject responses (2.8k vs. 57.3k), while Qwen1.5-72B-Chat contributed 7.3% (4.2k vs. 57.3k).
| i-th round DPO | Command R+ | Qwen1.5-72B-Chat | OpenChat-3.5 | WizardLM-β-SFT | Total |
|---|---|---|---|---|---|
| DPO Choose Response | |||||
| DPO-I1-Choose | 9.5k | 7.9k | 1.4k | 1.6k | 20.4k |
| DPO-I2-Choose | 8.8k | 7.5k | 1.1k | 1.9k | 19.3k |
| DPO-I3-Choose | 7.8k | 6.6k | 0.7k | 2.5k | 17.6k |
| DPO-Total-Choose | 26.1k | 22.0k | 3.2k | 6.0k | 57.3k |
| DPO Reject Response | |||||
| DPO-I1-Reject | 1.0k | 1.4k | 9.5k | 8.5k | 20.4k |
| DPO-I2-Reject | 0.9k | 1.5k | 10.5k | 6.4k | 19.3k |
| DPO-I3-Reject | 0.9k | 1.3k | 11.2k | 4.2k | 17.6k |
| DPO-Total-Reject | 2.8k | 4.2k | 31.2k | 19.1k | 57.3k |
We sincerely hope that the clarification could address your concerns. Please let us know if you have further questions, and thank you again for your review!
Respectfully,
Paper 9817 Authors.
Thank you for your clarification. The table you showed is much clearer. I will increase the score to 5 due to your efforts to address my concerns in the rebuttal period, but there are too many parts of the original version that need to be improved, as I mentioned, as well as other reviewers' comments.
If the paper is accepted, I suggest you add the results to the final version.
Dear Reviewer B73Q,
We sincerely appreciate you for engaging thoroughly with both our paper and rebuttal and improving its score. We are also deeply grateful for the significant time and effort you dedicated to the review process, as well as your professional comments and valuable feedback on our work.
In reversion of our paper, We will add our relevant discussions and results to enhance the content of our paper and facilitate further research within the LLM community.
Once again, we genuinely thanks for the thoughtful consideration you have dedicated to our work.
Best regards,
Paper 9817 Authors.
Question-1.5: the lmsys-chat-1m includes both single-turn and multi-turn dialogues, but the handling process for multi-turn dialogues, including Arena testing data construction, is not clear.
We describe the processing of multi-turn dialogue data in detail involving two aspects:
-
Testing Data Construction:
For the diversity test set, we concatenated all instructions into a single string from the multi-turn dialogues in LMSYS-Chat-1M. MinHashLSH was employed for deduplication, followed by generating 1024-dimensional text embeddings using the gte-large-en-v1.5 model. These embeddings were then reduced to two dimensions using T-SNE and clustered into 500 categories via the K-Means algorithm. Two samples were selected from each category to construct the 1k diversity test set.
For the hard test set, we used GPT-4-1106-preview to score each instruction on a scale of 1 to 10 in the multi-turn dialogues, based on the scoring prompt detailed in Appendix B. The scores were averaged to determine the overall difficulty of the multi-turn dialogues. Dialogues were then ranked by their average difficulty scores, and the top 1000 most challenging dialogues were selected as the hard test set.
-
Training Data Construction:
For constructing the SFT, DPO, and Reward Model datasets, Llama3-70B-Chat was used to score each instruction in the multi-turn dialogues, and these scores were then averaged. Dialogues where WizardLM-β underperformed were included in the SFT training set. The highest and lowest average scoring dialogues were used to construct the <Choose, Reject> pairs for DPO and the Reward Model.
-
Evaluation Phase:
Llama3-70B-Chat was used to score each instruction in the multi-turn dialogues in WizardArena. These scores were averaged to reflect the model's overall performance across the dialogues, which was then compared to other models to calculate the ELO score.
Dear Reviewer B73Q,
We would like to thank you for your detailed reviews. We genuinely appreciate the time and thoughtful consideration you have dedicated to our work.
Since the discussion period is coming , we would be grateful if you could let us know whether our response has addressed your concerns or if you still have any other questions. We are looking forward to receiving your any additional feedback you may have.
We would be happy to do any follow-up discussion or address any additional comments. Thank you once again for your valuable contributions to our work.
Respectfully,
Paper 9817 Authors.
Hello Reviewer,
The author has submitted a response to your comments. Whether or not it addresses your concerns, it would be greatly appreciated if you could acknowledge that you have reviewed the reply.
This paper proposes a Simulated Chatbot Arena named WizardArena to efficiently evaluate and train large language model (LLM) chatbots without human intervention. WizardArena is based on Elo rankings similar to LMSys Chatbot Arena but replaces human judges with powerful open-source LLMs, e.g. Llama-3. For evaluation, the authors collect a small set of diverse instructions with various difficulties as the test set. By evaluating multiple open-source LLMs, the WizardArena can produce Elo rankings that align closely with the LMSys Chatbot Arena. For training, the authors propose iterative battles using multiple LLMs to generate high-quality SFT and RL training data. Experimental results show that the multi-round generated training data can align LLMs to be chatbots efficiently and effectively.
优点
- The paper is well-motivated and the proposed method is useful in training and evaluating LLM chatbots efficiently.
- The proposed WizardArena is effective in both training and evaluating LLMs based on the experimental results.
- The paper is well-organized and easy to follow.
缺点
- The iterative training data generation needs multiple training and generation rounds. The training rounds incorporate SFT, DPO, or PPO and the generation round uses multiple powerful chatbots for generating reference responses. It can be costly and hard to tune, e.g. how to choose training algorithms, the reference model, the generation rounds, etc.
- Since the method uses a powerful LLM as the judge (Llama3-70b), it is questionable whether the judge can still correctly evaluate models that are more capable than the judge. This can affect the evaluation and the training data generation when we want to scale the proposed framework to obtain more powerful LLMs.
问题
N/A
局限性
N/A
Dear Reviewer, we thank you for your valuable comments and the time you spent reviewing our work!
Please find below a detailed discussion of the points you have raised:
Weaknesses-1: The iterative training data generation needs multiple training and generation rounds... It can be costly and hard to tune, e.g. how to choose training algorithms, the reference model, the generation rounds, etc.
Thank you very much for your valuable questions. Below, I will provide detailed answers concerning the choose of training strategies, reference models, and the number of generated rounds.
- training algorithms choose:
Table 5 (lines 316-328) of our paper discusses the selection of training strategies for the first round of three batches of data. In the updated table 4 in newly uploaded WizardArena_rebuttal PDF, initially continuing with DPO or PPO training based on SFT resulted in significant improvements in the WizardArena-Mix ELO score by 135 and 142 points, and in MT-bench scores by 0.37 and 0.31 points, respectively, outperforming the SFT-only training (SFT+SFT).
Further training with PPO after SFT+DPO yielded a modest 0.05-point increase on the MT-bench, but a notable 21-point improvement in the WizardArena-Mix ELO score. Specifically, on WizardArena-Mix, the SFT+DPO+PPO strategy outperformed the SFT+SFT+SFT, SFT+SFT+DPO, SFT+SFT+PPO, SFT+DPO+DPO, and SFT+PPO+DPO strategies, with ELO score increases of 44, 15, 12, 10, and 6 points, and MT-Bench score improvements of 0.12, 0.08, 0.07, 0.02, and 0.06, respectively.
Therefore, employing the SFT+DPO+PPO training strategy in iterative training achieved relatively optimal results for the first round of three batches of data. It shows that continuously applying RL training strategies on top of SFT can further enhance the model's intrinsic capabilities. Consequently, our study adopted the SFT+DPO+PPO training strategy in each round for iterative training.
- reference model choose:
The core idea of our paper is to enhance the weaker aspects of the WizardLM-β model by learning from the strengths of different reference models through the Judge-Pair Battle method. If a model’s generated response is selected as training data following the Judge-pair Battle with other reference models and subsequently improves WizardLM-β's performance, then that model can be considered a reference model.
For instance,in the first round of SFT data, OpenChat-3.5 contributed 4.3k data, Qwen1.5-72B-Chat provided 7.1k data, and CommandR+ contributed 8.6k data. Generally, larger model capacities correspond to a greater proportion of the data. Additionally, Table 5 in newly uploaded WizardArena_rebuttal PDF demonstrates that adding Qwen1.5-72B-Chat and OpenChat-3.5 on top of Command R+ significantly enhances WizardLM-β's performance in WizardArena-Mix (ELO +32). Therefore, the choice of CommandR+, Qwen1.5-72B-Chat, and OpenChat-3.5 as reference models in our study is both reasonable and effective.
- The generation rounds choose:
Our study conducted a total of three rounds of iterative training. By the fourth round, the improvement in WizardLM-β’s ELO score on WizardArena began to slow growth, approaching performance saturation. As shown in the table 6 in newly uploaded WizardArena_rebuttal PDF, WizardLM-β-I4’s ELO score in WizardArena-Mix increased by only 7 points compared to WizardLM-β-I3. This is primarily because WizardLM-β-I3 had already evolved to a high level of performance, approaching the capabilities of Command R+ and Qwen1.5-72B-Chat, which resulted in a gradual decrease in the availability of effective training data. Therefore, our paper employed three rounds of iterative training.
Weaknesses-2: Since the method uses a powerful LLM as the judge (Llama3-70b), it is questionable whether the judge can still correctly evaluate models that are more capable than the judge. This can affect the evaluation and the training data generation when we want to scale the proposed framework to obtain more powerful LLMs.
The following discussion elaborates on the use of Llama3-70B-Chat as the judge model to evaluate more capable models than the judge from three perspectives:
-
Due to time constraints and the cost of manual annotation, we randomly selected a subset of the WizardArena-Mix, comprising 50 diverse samples and 50 hard samples, totaling 100 test cases. We chose four models: WizardLM-β-PPO-I3, GPT-4o, GPT-4-1106-Preview, and Claude 3.5 Sonnet, with WizardLM-β-PPO-I3 serving as the Battle model and the others as reference models. We used Llama3-70B-Chat and a human annotator for anonymous evaluations. The results, presented in Table 7 in newly uploaded WizardArena_rebuttal PDF. It demonstrates a high level of consistency between the outcomes of Llama3-70B-Chat and human evaluations.
-
As shown in Table 2 in newly uploaded WizardArena_rebuttal PDF, we used more advanced models (i.e., GPT-4o, GPT-4-1106-Preview, Claude 3.5 Sonnet) into WizardArena and conducted Battles with other models, using Llama3-70B-Chat as the Judge Model to calculate ELO scores. The rankings align closely with those observed in the LMSYS ChatBot Arena.
-
Learning from more advanced models during training stages using Llama3-70B-Chat as the judge model. Table 8 in newly uploaded WizardArena_rebuttal PDF illustrates the performance impact of utilizing more advanced models in battles against WizardLM-β-7B when Llama3-70B-Chat is used as the judge model. Leveraging the models improved the ELO score from 875 to 1265, outperform battling with the models
In conclusion, Llama3-70B-Chat has proven to be a highly reliable and scalable tool for evaluating more powerful models, making it an effective solution for advanced model assessment scenarios. Therefore, Llama3-70B-Chat as the judge model can correctly evaluate models that are more capable than the judge.
Dear Reviewer tJho,
We would like to thank you for your detailed reviews. We genuinely appreciate the time and thoughtful consideration you have dedicated to our work.
Since the discussion period is coming , we would be grateful if you could let us know whether our response has addressed your concerns or if you still have any other questions. We are looking forward to receiving your any additional feedback you may have.
We would be happy to do any follow-up discussion or address any additional comments. Thank you once again for your valuable contributions to our work.
Respectfully,
Paper 9817 Authors.
Thanks for the response! I have read all reviews and the corresponding author responses. These comments are helpful and address most of my concerns, which can improve the quality of the manuscript if included in the revision. So I changed my score from 5 to 6.
And I am still interested in that whether a weak judge model can provide guidance to improve a strong student model. Or perhaps a model like llama3 70b can judge its own responses and self improving in this multi round training framework. I think it would be promising to apply the authors’ approach in this direction.
Dear Reviewer tJho,
Thank you for your thoughtful consideration and insightful questions. Due to time constraints, we will explore these valuable questions in detail in future versions of our paper—"whether a weak judge model can provide guidance to improve a strong student model" and "whether llama3 70b can judge its own responses and self-improving in this multi-round training framework."
We also sincerely appreciate you for engaging thoroughly with both our paper and rebuttal and improving its score. We are also deeply grateful for the significant time and effort you dedicated to the review process. Your professional comments provide invaluable guidance, significantly contributing to the enhancement of our work. We will update the relevant discussions in future versions of our paper to facilitate further research within the LLM community.
Once again, we genuinely thanks for the thoughtful consideration you have dedicated to our work.
Respectfully,
Paper 9817 Authors.
Hello Reviewer,
The author has submitted a response to your comments. Whether or not it addresses your concerns, it would be greatly appreciated if you could acknowledge that you have reviewed the reply.
This paper introduces WizardArena, an AI-based simulated offline chatbot arena that generates high-quality synthetic data and uses a fine-tuned Llama 70B as a judge model for automated evaluation. This approach significantly reduces the labor and time costs of post-training large language models while maintaining consistency with LMSYS Chatbot Arena's evaluation results. Experimental results show that this method significantly improves model performance across multiple benchmarks, providing a reliable solution for the efficient training and evaluation of LLMs.
优点
-
WizardArena introduced an AI-based offline chatbot arena for generating high-quality synthetic data and automated evaluation.
-
It utilized a fine-tuned Llama 70B as a judge model, significantly reducing reliance on human evaluators and lowering costs.
-
The paper demonstrated that models fine-tuned with WizardArena's synthetic data showed significant performance improvements across multiple benchmarks.
缺点
-
The approach of this work is essentially to create pairs of data for reinforcement learning using a game of multiple models. However, similar work, such as HH data, dedicated extensive sections to explaining the data generation process and providing sample demonstrations and analyses. This is lacking in this paper.
-
The paper mainly focuses on the evaluation capability of the model fine-tuned on synthetic data when assessing other weaker models. However, it lacks a comparison with GPT-4, which would have been relevant and valuable for a more comprehensive evaluation.
问题
-
In your study, you have chosen to evaluate the performance of a fine-tuned Llama 70B model. Have you considered including a comparative analysis with the GPT-4 model to enhance the credibility of your findings?
-
While employing the fine-tuned Llama 70B judge model for automated evaluation reduces costs and enhances efficiency, it might not capture all the nuances and subjective assessments that human evaluators offer. Introducing some human evaluations or qualitative analyses could make the results more convincing.
局限性
Overall, the motivation behind this paper is strong and addresses issues that are of significant interest to the research community. The experiments are solid.
Dear Reviewer, thank you for your valuable comments and the time spent reviewing our work! Your feedback is invaluable for improving the quality and competitiveness of our paper.
Please find below a detailed discussion of the points you have raised:
Weaknesses-1: The approach of this work is to create pairs of data for reinforcement learning using a game of multiple models. However, similar work, such as HH data, dedicated extensive sections to explaining the data generation process... This is lacking in this paper.
The primary distinctions between our study and the HH Data approach are as follows:
-
Our study employs open-source models(i.e.,Llama3-70B-Chat) as the Judge model. The Judge-pair Battle method is used to select high-quality data where WizardLM-β underperforms, which then serves as the training data for SFT . Additionally, preference data pairs are constructed for DPO and Reward Model training. In contrast, HH Data depends on crowdsourcing for manual annotation, which is time-consuming, costly, and lacks scalability. Our method is more cost-effective, scalable, and efficient.
-
Our study introduces an offline WizardArena, which closely aligns the online LMSYS ChatBot Arena. The Judge-pair Battle method is employed to identify high-quality SFT training data, making it inherently suitable for generating high-quality preference data pairs and prompting iterative SFT-DPO-PPO training. Conversely, HH Data utilizes the Reject Sampling strategy, where the model produces multiple responses, and collects human feedback data for preference modeling. Then, RLHF is applied to train a relatively helpful and harmless AI assistant.
Further versions of our paper will provide a more detailed description of the response generation process and sample analysis and will cite and discuss HH data in more detail.
Weaknesses-2: The paper mainly focuses on the evaluation capability of the model fine-tuned on synthetic data... However, it lacks a comparison with GPT-4, which would have relevant and valuable for comprehensive evaluation.
We sincerely appreciate your valuable suggestions. The updated Table 2 in newly uploaded WizardArena_rebuttal PDF below presents a comparative analysis of WizardLM-β's performance against other prominent models, including GPT-4o, GPT-4-1106-Preview, and GPT-4-0613. The results show that GPT-4o achieves the highest rank in WizardArena-Mix with an ELO score of 1397, holding its superior performance. However, as WizardLM-β performs the SFT-DPO-PPO iterative training via the Judge-Pair Battle method, the performance gap between WizardLM-β and GPT-4o progressively narrowed.
Specifically, the ELO score gap decreased from 524 points (873 vs. 1397) to 123 points (1274 vs. 1397), and eventually to a 60-point difference compared to GPT-4-0613. It shows that iterative training through the Judge-pair Battle method can markedly enhance a model’s ability to manage hard and diverse tasks, while steadily improving its performance in weaker areas. ( Note: The WizardArena-Mix ELO scores may exhibit slight fluctuations as the addition of new models.)
Question-1: In your study, you have chosen to evaluate the performance of a fine-tuned Llama 70B model. Have you considered including a comparative analysis with the GPT-4 model to enhance the credibility of your findings?
Yes. In Table 4 in our paper (lines 309-315), we present the evaluation results of using GPT-4 as the judge model in WizardArena. Taking LMSYS ChatBot Arena as a reference benchmark, GPT-4 as the judge model in WizardArena achieves a Spearman coefficient of 95.81%, the Human Agreement (95% CI) of 95.65%, the Differentiation (95% CI) of 86.84%, and an overall average consistency of 92.77%. When Llama3-70B-Chat is employed as the judge model, the Spearman coefficient is 98.32%, the Human Agreement (95% CI) is 96.54%, the Differentiation (95% CI) is 83.18%, and the overall consistency is 92.68%. These findings indicate that WizardArena exhibits a high level of consistency with LMSYS ChatBot Arena, thereby ensuring the reliability and accuracy of the offline WizardArena.
Furthermore, when using GPT-4 and Llama3-70B-Chat as the judge model in WizardArena, the Spearman coefficient reaches 95.81%, with the Human Agreement (95% CI) at 88.46%, and the overall average consistency at 92.14%. However, due to the substantial cost associated with GPT-4, this study employs Llama3-70B-Chat as a more cost-effective alternative in WizardArena.
Question-2: While employing the fine-tuned Llama 70B judge model for automated evaluation reduces costs and enhances efficiency, it might not capture all the nuances and subjective assessments that human evaluators offer. Introducing some human evaluations or qualitative analyses could make the results more convincing.
We sincerely appreciate your valuable suggestions. Due to time constraints and the manual annotation costs, we randomly selected a subset of WizardArena-Mix, comprising 100 diverse and 100 hard samples, totaling 200 test cases. We chose four models for evaluation: WizardLM-β-PPO-I3, OpenChat-3.5, Command R+, and Qwen1.5-72B-Chat. WizardLM-β-PPO-I3 served as the reference model, while the others acted as battle models. The evaluations were conducted using Llama3-70B-Chat and professional human annotators, with the win/loss/tie results presented in Table 3 in newly uploaded WizardArena_rebuttal PDF.
Specifically, when using Llama3-70B-Chat as the Judge Model, the win rates for WizardLM-β-PPO-I3 against Command R+, Qwen1.5-72B-Chat, and OpenChat-3.5 were 34.1%, 41.3%, and 79.7% respectively. When evaluated by human annotators, the win rates for WizardLM-β-PPO-I3 against Command R+, Qwen1.5-72B-Chat, and OpenChat-3.5 were 31.8%, 37.7%, and 82.1% respectively. Therefore, the high consistency between human evaluations and Llama3-70B-Chat further confirms the reliability and accuracy of Llama3-70B-Chat as the judge model in WizardArena.
Dear Reviewer cCkS,
We would like to thank you for your detailed reviews. We genuinely appreciate the time and thoughtful consideration you have dedicated to our work.
Since the discussion period is coming , we would be grateful if you could let us know whether our response has addressed your concerns or if you still have any other questions. We are looking forward to receiving your any additional feedback you may have.
We would be happy to do any follow-up discussion or address any additional comments. Thank you once again for your valuable contributions to our work.
Respectfully,
Paper 9817 Authors.
Hello Reviewer,
The author has submitted a response to your comments. Whether or not it addresses your concerns, it would be greatly appreciated if you could acknowledge that you have reviewed the reply.
Dear Reviewer cCkS,
We would like to thank you for your detailed reviews. We genuinely appreciate the time and thoughtful consideration you have dedicated to our work.
Since the discussion period is coming to an end today, we would be grateful if you could let us know whether our response has addressed your concerns or if you still have any other questions.
We are looking forward to receiving your post-rebuttal rating and your any additional feedback you may have. Please feel free to reach out if you suggest any additional modifications or require further information.
Thank you once again for your valuable contributions to our work.
Best regards,
Paper 9817 Authors.
Thank you for your response, it addressed some of my concerns. I’ve also reviewed the feedback from other reviewers, and overall, I believe this is a borderline paper, and I tend to accept it. Therefore, I’m inclined to maintain my current score.
The paper proposes a new framework for improving large language models (LLMs) post-training through a simulated environment called WizardArena. This environment aims to avoid the costly and time-consuming manual interventions typically required for training and evaluating chatbots.
优点
- This paper proposes a new offline dataset for both training and evaluation.
- The presentation of this paper is well organised, including nice figures.
缺点
- Training Data Quality and Transparency. The authors should provide more clarity regarding the quality of the training dataset. As this is a resource paper, the quality of these datasets is the most important part. The supplementary materials contain only 100 examples, which seems insufficient for a comprehensive evaluation. For instance, one example provided from
nips_code_WizardArena/dpo_train/dpo_train/data/sample_data.jsonis:
{"conversations": [{"from": "human", "value": "Hiya!"}, {"from": "gpt", "chosen": "Hello! How can I assist you today?", "reject": " Hello! How can I assist you today?"}]}
It is unclear what models can learn from such simple examples.
-
The supplementary materials lack proper organization and documentation. Many folders are empty, and there is no README file or documentation explaining how to use the provided resources. As this is a resource paper, clear instructions and explanations are crucial for the research community to effectively utilize the proposed methods.
-
Multilingual. A significant portion of the training examples appears to be multilingual, as evidenced by samples in the nips_code_WizardArena/dpo_train/dpo_train/data/sample_data.json file. For example:
{"conversations": [{"from": "human", "value": "\u041f\u0440\u0438\u0434\u0443\u043c\u0430\u0439 \u0441\u043c\u0435\u0448\u043d\u043e\u0435 \u043f\u043e\u0437\u0434\u0440\u0430\u0432\u043b\u0435\u043d\u0438\u0435 \u0441 \u0434\u043d\u0435\u043c \u0440\u043e\u0436\u0434\u0435\u043d\u0438\u044f \u0434\u043b\u044f \u043c\u0443\u0436\u0447\u0438\u043d\u044b \u0441 \u0438\u043c\u0435\u043d\u0435\u043c \u0410\u043b\u0435\u043a\u0441\u0430\u043d\u0434\u0440 \u0438 \u0444\u0430\u043c\u0438\u043b\u0438\u0435\u0439 \u0414\u043e\u0440\u043e\u0448, \u0432 \u0441\u0442\u0438\u0445\u043e\u0442\u0432\u043e\u0440\u043d\u043e\u0439 \u0444\u043e\u0440\u043c\u0435."}, ...
However, the paper doesn't seem to address or describe this multilingual aspect of the training data. A discussion would enhance the paper's completeness.
问题
N/A
局限性
N/A
Dear Reviewer, we thank you for your valuable comments and the time you spent reviewing our work! Your professional feedback provides valuable guidance for writing a more comprehensive and competitive paper.
Please find below a detailed discussion of the points you have raised:
Weaknesses-1: Training Data Quality and Transparency. The authors should provide more clarity regarding the quality of the training dataset. As this is a resource paper, the quality of these datasets is the most important part. The supplementary materials contain only 100 examples, which seems insufficient for a comprehensive evaluation....
We sincerely appreciate your attention to our work and your careful and responsible review. We greatly apologize for any inconvenience caused. Regarding the issues related to the data and code, due to strict legal regulations within our company, we are unable to publicly upload the internal code and data without prior review, as doing so could result in severe legal consequences. We kindly ask for your understanding in this matter.
Therefore, the nips_code_WizardArena/dpo_train/dpo_train/data/sample_data.json provided in the appendix is derived from our initial early model generation phase, which did not involve any filtering or post-processing, and thus includes some low-quality and multilingual data. However, it is not our final training dataset. We employed Llama3-70B-Chat to score the model responses and filtered the data when the score difference between "Choose" and "Reject" was below a threshold K (i.e., K <= 1). Additionally, we utilized Polyglot [1] for language category labeling to exclude non-English data. The final training data in the paper were primarily obtained through the judge-pair battle with multiple advanced models.
Furthermore, we have submitted the code and data for company review, and once approval is granted, we will immediately open-source the entire code and data to support further research in the LLM community. We appreciate your continued interest and once again apologize for any inconvenience caused.
Weaknesses-2: The supplementary materials lack proper organization and documentation. Many folders are empty, and there is no README file or documentation explaining how to use the provided resources. As this is a resource paper, clear instructions and explanations are crucial for the research community to effectively utilize the proposed methods.
We sincerely appreciate your valuable suggestions. Once we receive company approval for open-source release, we will promptly release our data and code, accompanied by well-organized and comprehensive documentation. This will include detailed implementation descriptions of each training and testing step to enable the research community to effectively utilize our proposed methods. We look forward to your continued interest and thank you for supporting our work.
Weaknesses-3: Multilingual. A significant portion of the training examples appears to be multilingual, as evidenced by samples in the nips_code_WizardArena/dpo_train/dpo_train/data/sample_data.json file.
LMSYS-Chat-1M includes over 150 languages, with English, Portuguese, and Chinese being the most prevalent. However, LMSYS-ChatBot Arena does not offer multilingual leaderboards, instead focusing on individual languages. To thoroughly assess the performance of our proposed algorithm in a multilingual context, we selected one language (i.e., Chinese) for detailed evaluation. Due to time constraints, we constructed a test set of 500 multilingual instances based on the Offline Diverse & Hard WizardArena test set described in Section 4.1. This test set comprises 250 diverse and 250 hard samples, sourced from LMSYS-Chat-1M and WildChat[2], with strict deduplication between training and test sets.
For training, we randomly selected 30k original Chinese corpus and divided them into three equal parts: SFT 10k, DPO 10k, and PPO 10k. After conducting Judge-pair Battles with Command R+, Qwen1.5-72B-Chat, and OpenChat-3.5, these sets were reduced to SFT 9.1k, DPO 8.4k, and Reward Model 8.1k. The SFT, DPO, and PPO models were then trained in one round, with the results summarized in the Table 1 in newly uploaded WizardArena_rebuttal PDF. The Spearman correlation between WizardArena-CH and LMSYS-ChatBot-Arena-CH reached 98.68%, the Human Agreement with a 95% confidence interval (CI) was 96.45%, and the Differentiation with a 95% CI was 91.17%, indicating a high consistency of 95.43% between the two, thus demonstrating the accuracy and reliability of WizardArena-CH.
Furthermore, after training SFT, DPO, and PPO using the judge-Pair Battle method, the ELO score of WizardLM-β in WizardArena-Mix-CH increased from 808 to 1288 (+480), surpassing OpenChat-3.5(+20) and approaching GPT-3.5-Turbo-0613. This outcome suggests that the Judge-Pair Battle method is also highly effective for multilingual tasks. Meanwhile, in future versions of our paper, we will further supplement additional multilingual tasks.
[1] https://github.com/aboSamoor/polyglot
[2] Lin B Y, Deng Y, Chandu K, et al. WILDBENCH: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild[J]. arXiv preprint arXiv:2406.04770, 2024.
Dear Reviewer vqCh,
We would like to thank you for your detailed reviews. We genuinely appreciate the time and thoughtful consideration you have dedicated to our work.
Since the discussion period is coming , we would be grateful if you could let us know whether our response has addressed your concerns or if you still have any other questions. We are looking forward to receiving your any additional feedback you may have.
We would be happy to do any follow-up discussion or address any additional comments. Thank you once again for your valuable contributions to our work.
Respectfully,
Paper 9817 Authors.
Hello Reviewer,
The author has submitted a response to your comments. Whether or not it addresses your concerns, it would be greatly appreciated if you could acknowledge that you have reviewed the reply.
Thank you for your response.
While the dataset seems to be an important contribution to this work, the code and full datasets are not currently available for review. The 100 samples provided appear to have serious quality issues.
I wonder if this paper might be more suitable for the NeurIPS 2024 Datasets and Benchmarks Track, which requires that A key criterion is accessibility: datasets should be available and accessible [1].
I am inclined to maintain my score, with a low confidence rating.
Reference:
[1] https://neurips.cc/Conferences/2024/CallForDatasetsBenchmarks
Dear Reviewer vqCh,
We sincerely appreciate your attention to our work. We are very eager to contribute our methods, entire data and code available for review and to the open-source community. However, due to the company's recent stricter open-source policy, the release of our data and code need undergo a review process to ensure compliance. Therefore the sample data provided in the appendix is derived from our initial early model generation phase, which did not involve any filtering or post-processing, and thus includes some low-quality and multilingual data. However, it is not our final training dataset.
Furthermore, we have been working hard to actively promote the company review process. In order to facilitate the researchers to reproduce, we will open source the detailed method and relevant prompts, training hyperparameters, training data and test data construction process, as well as ablation studies and models mentioned in our paper in advance to enhance the transparency and effectiveness of our work. Once approved by the company, we will immediately release the entire code and data to support further research in the LLM community. We kindly ask for your understanding in this matter and welcome your continued interest. We look forward to discussing the implementation details of our paper with you and once again apologize for any inconvenience caused.
In the paper we both propose offline WizardArena, and Pair-judge Battle innovative methods for model post-training , emphasizing generalization and effectiveness, without solely focusing on the dataset. For training scenario, we simulate iterative arena Pair-judge battles among various state-of-the-art models on a large scale of instruction data, subsequently leveraging the battle results to constantly enhance target model in both the supervised fine-tuning and reinforcement learning . For evaluation scenario, WizardArena can efficiently predict accurate performance rankings among different models based on offline test set. Experimental results demonstrate that our WizardArena aligns closely with the online human-based LMSys Chatbot Arena, and our models employ Pair-judge battle innovative methods for iterative training exhibiting significant performance improvements during SFT, DPO, and PPO stages.
We alos sincerely appreciate you for engaging thoroughly with both our paper and rebuttal. We are also deeply grateful for the significant time and effort you dedicated to the review process, as well as your professional comments and valuable feedback on our work.
We would be happy to do any follow-up discussion or address any additional comments. We look forward to discussing the any implementation details of our paper with you and welcome your continued interest. Thank you once again for your valuable contributions to our work.
Best regards,
Paper 9817 Authors.
We express our gratitude to all reviewers for their thorough evaluations. We have included the supplementary experiments in the newly uploaded WizardArena_rebuttal PDF as follows.
- In Table 1 shows the Chinese ELO rankings results of 19 models on LMSYS ChatBot Arena-CH, WizardArena-Diverse-CH, WizardArena-Hard-CH, and WizardArena-Mix-CH (Diverse & Hard). (Weaknesses-3, Reviewer vqCh)
- In Table 2 shows the updated ELO rankings results of 26 models on LMSYS ChatBot Arena EN, MT-Bench, Offline-Diverse, Offline-Hard, Offline-Mix (Diverse & Hard). We add some advanced models (i.e.,GPT-4o, GPT-4-1106-Preview Claude 3.5 Sonnet). (Weaknesses-2, Reviewer cCkS and Reviewer tJho)
- In Table 3 shows the win/tie/loss counts of WizardLM-β-PPO-I3 against {Command R+, Qwen1.5-72B-Chat, OpenChat-3.5} evaluated by Llama3 70B Chat Judge and Human Judge. (Question-2, Reviewer cCkS)
- In Table 4 explores alignment strategies for models in SFT and RL stages. We utilize three slices of data for SFT, DPO, and PPO training in first round. (Weaknesses-1, Reviewer tJho)
- In Table 5 shows the WizardArena Elo of WizardLM-β-7B-SFT-I1 on different battle modes. (Weaknesses-1, Reviewer tJho)
- In Table 6 shows the performance of WizardLM-β trained on different rounds on WizardArena-Mix. (Weaknesses-1, Reviewer tJho)
- In Table 7 shows the win/tie/loss counts of WizardLM-β-PPO-I3 against {GPT-4o, GPT-4-1106-Preview, Claude 3.5 Sonnet} evaluated by Llama3 70B Chat Judge and Human Judge. (Weaknesses-2, Reviewer tJho)
- In Table 8 shows the performance impact of employing more advanced models to battle with WizardLM-β-7B-I0 on different stages. (Weaknesses-2, Reviewer tJho)
- In Table 9 shows the consistency between Llama3-70B-Chat and GPT-4 as judging models in the WizardArena-Mix with 16 models. (Question-3.2, Reviewer B73Q)
We appreciate the positive feedback regarding the remarkable performance of our WizardArena and are very excited about future work building on our model and ArenaLearning! We look forward to further in-depth discussions.
This paper proposes a simulated chatbot arena named WizardArena to evaluate and train LLMs efficiently. In WizardArena, an LLM serves as a judge, so there is little human involvement to avoid the costly and time-consuming manual efforts. The authors further propose using iterative battles between multiple LLMs to generate SFT and RL training data for training.
The reviewers do not have significant concerns about the method itself. The primary concern is that the code and datasets are not currently available for review. The authors tried to respond to this issue during the rebuttal.
The concepts of the paper are not entirely new. The performance of using LLM for evaluation and further improvement is well-known. The novel part of the paper is that no previous work has used LLM evaluation to generate a complete leaderboard ranking and demonstrate that it aligns with human evaluation.