IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation
摘要
评审与讨论
This paper proposes a method of generating prompts for evaluating large language models such that the prompts are dynamic and allow for showing meaningful performance gaps between different language models.The authors show that the generated data is more-challenging and discriminative than prior datasets.
优点
- Work is very timely and addresses a major issue in how we can better evaluate LLMs which are continuously improving and saturating existing benchmarks.
- Good to see that the generated prompts are indeed harder than baseline datasets - this should indicate that the prompts are challenging enough to provide decent signal on a language model's capabilities.
- Experimented with many SOTA models and compared with several baseline datasets.
缺点
The main weakness of this work is that much of the pipeline relies prompting language models to modify seed data. This means that the performance of the language model plays a huge role in the quality of the resulting data. Given that the pipeline seems to have many different steps, each of these steps can introduce errors since LLMs are not fully reliable. It then becomes crucial to have a way of verifying that the generated questions are of high quality. There's also a concern that the ground truth answers might not be entirely accurate. The authors mention both of these issues as limitations.
问题
- If a particular language model is used to generate data using the proposed method, is there any bias where that model will perform better at solving those problems? For example, if Claude generates the prompt set, will the prompt set be easier for Claude than GPT?
- Is the data generation done for a set of language models or for each individual language model? In other words, are the prompts being dynamically changed with respect to a single language model's response or all language model responses? Specifically, Section 2.2 says that the method "rephrases the question based on the response from the LLM" - which LLM is this statement referring to?
- Are there any experiments to verify the robustness of each individual step in the pipeline? It seems like the current experiments are meant to verify the final output of the pipeline, not the in-between steps.
局限性
The authors mention the most-important limitations of this work.
Q: If a particular language model is used to generate data using the proposed method, is there any bias where that model will perform better at solving those problems? For example, if Claude generates the prompt set, will the prompt set be easier for Claude than GPT?
A: Thank you for your question, and your question is very professional. In our experiments, we find that for the data generated by Hunyuan, its performance may not be as good as potentially better-performing models, such as GPT4 and Claude3. However, in the data generated by GPT4, there are cases where Hunyuan performs better than GPT4. We cannot entirely confirm that the bias you mentioned does not exist, but even if it does, its impact would not be decisive. To avoid potential biases, we have also done related work and processing.
In our paper, we try to minimize this potential bias as much as possible: on the one hand, we pay great attention to the usability of the generated data. During the data production process, we use models to participate in the usability check of the generated data. For the final generated data, we also hire expert personnel to conduct checks. The expert personnel's judgment of the usability of the questions is based on the human preference perspective to determine whether the questions are reasonable. The inspection process includes result sampling or re-inspection schemes to ensure the accuracy of the judgment. The inspection results show that the evaluation data obtained by the method in this paper have ideal usability. Therefore, we can believe that these data can be used to evaluate the performance of LLMs. On the other hand, to avoid suspicion, we don't allow the model that generate the data to participate in the evaluation. In the experiments, the calculation of evaluation results-related data don't include the evaluation results of Hunyuan, to avoid potential biases that may exist in Hunyuan-generated data as much as possible. The evaluation results also confirm that our data can effectively distinguish the performance of existing LLMs.
Q: Is the data generation done for a set of language models or for each individual language model? In other words, are the prompts being dynamically changed with respect to a single language model's response or all language model responses? Specifically, Section 2.2 says that the method "rephrases the question based on the response from the LLM" - which LLM is this statement referring to?
A: For data production, we only use a individual language model, i.e., Hunyuan-standard. For the automated usability check of mathematical questions, to improve the usability of the data, we use Hunyuan-standard and Hunyuan-pro to perform checks separately and cross-validate the usability of the data. For other processes, we only use Hunyuan-standard.
We input the large model's response into the prompt for generating questions, and the large model responsible for generating questions will produce new questions based on the prompt. Therefore, the prompt does not dynamically change with the model's response; it simply incorporates the model's response. While we use Hunyuan, other large models can also be utilized for this purpose.
In Section 2.2, "rephrases the question based on the response from the LLM," the LLM refer to in this sentence is Hunyuan (Hunyuan-standard). In the footnote 2 on page 3 of the article, we mention, "Unless otherwise specified, all data in this document are generated by Hunyuan (Hunyuan-standard), which is a Large Language Model developed by Tencent." Therefore, we use LLM as a reference here.
Q: Are there any experiments to verify the robustness of each individual step in the pipeline? It seems like the current experiments are meant to verify the final output of the pipeline, not the in-between steps.
A: Your question is indeed very insightful. We have not verified the robustness of each independent step; if necessary, we can incorporate relevant experiments in future versions of our work. Theoretically, each individual step in our process is robust.
From a global perspective, the data we utilize is both reliable and stable. Our dataset comprises Chinese and English components. For the Chinese data, we reference the work of TencentLLMEval[1], which has been stably applied in various business scenarios. For the English data, we employ the seed data from Self-instruct[2], which is extensively used in academia. Additionally, during the data production process, we set the temperature of the LLMs to 0 and use fixed prompts to guide the LLM in data production.
For each data point, we implement several measures when calling the LLM API. These measures include multiple requests using a counter in case of call failures, capturing and handling exceptions, and pausing for three seconds before resubmitting a request if a call fails. These steps aim to increase the probability of successful API calls. Furthermore, we conduct validity checks on the input data for each step to enhance the robustness of the data production process.
[1] Xie, Shuyi, et al. "TencentLLMEval: a hierarchical evaluation of Real-World capabilities for human-aligned LLMs." arXiv preprint arXiv:2311.05374 (2023)
[2] Wang, Yizhong, et al. "Self-instruct: Aligning language models with self-generated instructions." arXiv preprint arXiv:2212.10560 (2022).
Thank you for the response. I would be interested in seeing results if a panel of LLMs was used as the evaluator for this method in order to reduce bias. I am also still curious about ways to verify the individual steps within this method. As such, I will be retaining my original score.
Thank you for your reply. We will include the relevant content you mentioned in the subsequent version of the paper.
The paper proposes a prompt synthesis framework for evaluating LLMs to accurately reflect different Large Language Model abilities. The authors develop two models to measure LLMs’ question discriminative power and difficulty. This study presents “instruction gradient” and “response gradient” methods to exploit rule sets to generalize questions.
优点
The paper focuses on the generation of a large number of queries and corresponding answers on general language and mathematical topics. They have released a set of over 3000 questions for LLM evaluation. Their proposed metrics (discrimination index and difficulty score) show significant improvement in the quality of the benchmark datasets.
缺点
Although the paper tries to solve a crucial research area in the scope of LLM evaluation, the study lacks in many different ways. The textual flow is difficult to follow. Many of the concepts introduced were not properly described or not cited with previous work’s references. These issues restricted the reviewability of this study.
问题
-
The proposed methods - “Instruction gradient” and “response gradient” are not properly described in the manuscript. Authors should write the working procedure of these methods in detail in the main manuscript, as these are the centerpiece of the whole question generation process.
-
“Generalizing questions from seed data based on the "instruction gradient" restricts the diversity and confines the content to specific topics” - is unclear. Consider explaining.
-
In section 2.3 - Assessing the Usability of General Text Questions: How is the assessment done? Is it done manually with human input? Or by an autonomic process/model?
-
In section 2.3 - CoT Check for Mathematical Questions: “we use Hunyuan to assess the reasonableness of the question, which successfully identifies the unreasonableness of the problem and corrects it based on the assessment process.” - How can it be ensured that the model successfully identifies the unreasonableness? Provide a theoretical/experimental study.
-
In section 2.4 - Acquiring reference answers: lines 133-136, are the answers scored by human participants?
-
In section 2.4 - Acquiring reference answers: line 140, What is meant by a “collective voting mechanism”? Please explain clearly.
-
In section 2.5 - lines 148-149, what are “label discrimination indexes”? a. In line 149, “the prompt includes four features” - How did you select these features? Provide some analysis. b. In lines 162-164, How did you select the threshold values? (e.g., “Low” means less than or equal to 0.1, “High” means values greater than 0.25, etc.).
c. In line 168, “discrimination level label ranging from 0-3” - Is this range acquired by observations? Or have you performed some analyses on the score expressions? -
In equation 4, what does the “score” mean? Is it the evaluation score that is depicted in Table 1? a. If you are using the same “score” to calculate the difficulty score and the discrimination indexes, does that mean a question is more difficult if a question is more discriminative?
-
局限性
While this proposed method is understood to work on general text questions fairly well, mathematical questions are the weakest part of this study.
Q: In equation 4, what does the “score”mean? Is it the evaluation score that is depicted in Table 1? a. If you are using the same“score”to calculate the difficulty score and the discrimination indexes, does that mean a question is more difficult if a question is more discriminative?
A: Thank you for your question.
Yes, the score here refers to the score in Table 1.
a. It is the same score, but we believe that the difficulty score can only serve as a reference for discriminability. A high difficulty score for a question does not necessarily mean that it is more discriminative. For example, for a question with a max score of 3, if the evaluation scores are both 0 and 0, according to the formula, its difficulty score is 3, and the discrimination score is 0, meaning that the question is very difficult, and the LLMs cannot answer it correctly, so the question is not discriminative. However, if the evaluation scores are 0 and 3, we can calculate that its difficulty score is 1.5, and the discrimination score is 1, indicating that the question can effectively distinguish the level of LLMs.
Q: While this proposed method is understood to work on general text questions fairly well, mathematical questions are the weakest part of this study.
A: Thank you for your affirmation of our work on general text questions. For mathematical questions, on the one hand, providing prompts to large models to generate new questions can easily lead to unusable questions, which is difficult to handle; on the other hand, generating discriminative mathematical questions is also a significant challenge. We have also done a lot of work and contributions on mathematical questions, which are explained here.
1.We have proposed some data generalization strategies that can effectively improve the discrimination and difficulty of questions. In the "Instruction Gradient", for mathematical questions, we propose 8 generalization strategies to guide data generalization. Experiments have found that the questions generated by these strategies can effectively distinguish the capabilities of existing LLMs.
2.In practical evaluation scenarios, the difficulty of generalizing mathematical questions lies in the usability of the generated questions. This paper focuses on solving the problem of low usability of generalized mathematical questions and designs a usability checking mechanism for mathematical questions: On the one hand, we have designed a set of CoT for checking the usability of mathematical questions. This scheme can guide the LLM to check the usability of mathematical questions from the perspectives of concept, logical relationship, problem solvability, and condition completeness, greatly eliminating the problems of conceptual errors, logical contradictions, violations of common sense, missing conditions, and unsolvable questions. On the other hand, we can effectively modify or discard unusable generated data through multi-model multi-round iterative checks. Specifically, we use two different LLMs(refered to Hunyuan-standard and Hunyuan-pro in our paper) to judge the usability of the question based on the above CoT method. For unusable questions, we modify the question according to the judgment of CoT and iterate the check again until both LLMs judge the question as usable or reach the maximum number of iterations, and the question is also retained or discarded respectively. Through this mechanism, the generated mathematical questions have satisfactory usability.
3.The open discrimination estimation model and difficulty estimation model can quickly judge the quality of mathematical questions. In the process of training the discrimination estimation model and difficulty estimation model, we introduce a large number of mathematical questions with difficulty and discrimination annotations. The obtained models can effectively and quickly judge the discrimination and difficulty of mathematical questions. We make the models public to facilitate community research or use.
4.We open a batch of mathematical questions generated by LLMs(Specially refering to Hunyuan-standard and Hunyuan-pro in our paper) with accurate reference answers. We used the Hunyuan large model to generate mathematical questions with high discrimination, including 32 types of questions including calculus, function properties, and arithmetic operations. In addition, we hire experts in the field of mathematics to check and correct the reference answers of the mathematical questions to ensure the accuracy of the open mathematical question reference answers.
First, I would like to thank the authors for thoroughly addressing all my questions. The paper tackles an important problem, but its structure does not flow naturally, making it quite difficult to follow. This explains why I had so many initial questions. While the authors did a good job responding to these questions, the paper should be revised to be clearer from the start. It's challenging to fix these issues as an afterthought. Therefore, I will maintain my initial score.
We appreciate your reply and will revise the paper to make it more concise. We will pay special attention to the issues you mentioned and polish the writing carefully. We have given detailed explanations in our answers to the questions you mentioned, and hope that it will help readers better understand the paper. We sincerely hope that our score can be revised.
Q: In section 2.4 - Acquiring reference answers: line 140, What is meant by a“collective voting mechanism”? Please explain clearly.
A: Thank you for raising this issue, and we appreciate the opportunity to provide a more detailed explanation. For the answers to mathematical questions, we hope to select high-quality responses as reference answers as much as possible. However, it is difficult to design a scoring standard that conforms to human preference for mathematical question responses like general text questions. Related work [1] studies the theoretical basis of the collective voting mechanism and discusses the impact of different voting methods on social welfare. Inspired by this, we introduce a "collective voting mechanism" to select reference answers by comparing and voting among multiple responses.
We provide multiple anonymous responses to the voting LLMs simultaneously, and let the voting LLMs choose the best response they think. We provide multiple voting LLMs, and each voting LLM casts its vote for a response, which means that the voting LLM thinks this response is the best. The response with the highest number of votes is used as the reference answer. If there is a tie, we randomly select a response as the reference answer and mark the question. Despite our efforts to enhance the usability of reference answers, there may still be instances where the selected reference answer is incorrect.
To further improve the accuracy of reference answers for mathmatical questions, we hire mathematics experts to check and correct the reference answers of the questions. The results of the manual review are used as the final reference answers.
[1] Sen, Amartya. Collective choice and social welfare. Harvard University Press, 2018.
Q: In section 2.5 - lines 148-149, what are “label discrimination indexes”? a. In line 149, “the prompt includes four features”- How did you select these features? Provide some analysis. b. In lines 162-164, How did you select the threshold values? (e.g.,“Low”means less than or equal to 0.1,“High” means values greater than 0.25, etc.). c. In line 168, “discrimination level label ranging from 0-3”- Is this range acquired by observations? Or have you performed some analyses on the score expressions?
A: Thank you for your question, the label discrimination indexes are the labels mapped from the discrimination indexes you mentioned in the 'b' question, we will improve the presentation here.
a. The four features we select are included in each sample: question, its corresponding category, mean length of this category, and length ratio. These features are important and provide meaningful reference for understanding the discrimination of the questions.
Question: The question is the most direct and key feature. The model needs to understand the question itself. Without the question, it is impossible to determine the type of information provided.
Category: The discrimination of questions in different categories is usually different. For example, questions in the mathematics category may have different discrimination levels compared to those in the entertainment category. Category information helps us assign appropriate discrimination levels to questions.
Mean length of the category: Considering the difficulty levels across different categories, the average length of questions within a category can indicate the complexity of the questions in that category. Generally, categories with longer answers may involve more complex questions, while categories with shorter answers may involve simpler questions. Therefore, by comparing the average lengths of different categories, we can gain a rough understanding of the question's difficulty, which serves as an important reference for its discrimination.
Length ratio: From the perspective of varying difficulties across categories, the length ratio can help us understand the complexity of the question compared to the average question in its category. A higher length ratio may mean higher difficulty, and a lower length ratio may mean lower difficulty. By analyzing the length ratio, we can better understand the relative ranking of the difficulty and discrimination of the question in its category.
b. The threshold here is estimated based on the distribution of 100,000-level evaluation data.
c. 0-3 are the four levels we map discrimination indexes to, but this level division is not unique, it is just for the convenience of observing data with different discrimination. We can also divide it into two levels, etc.
Q: In section 2.3 - CoT Check for Mathematical Questions: “we use Hunyuan to assess the reasonableness of the question, which successfully identifies the unreasonableness of the problem and corrects it based on the assessment process.”- How can it be ensured that the model successfully identifies the unreasonableness? Provide a theoretical/experimental study.
A: For mathematical questions, we indeed cannot guarantee that the generated questions can always be usable. However, through our proposed inspection mechanism, we can greatly eliminate the problems of conceptual errors, logical contradictions, violations of common sense, missing conditions, and unsolvable questions.
"The model successfully identifies the unreasonableness" we methion here is an explanation of the case study in Figure 2. In Figure 2, we identified the unreasonable part of the question based on the designed CoT, indicating that the question is unusable and needs to be discarded or modified.
For mathematical questions, we apply the "instruction gradient" to generalize the questions. To check the usability of the generated questions, we design a set of question-checking mechanisms. On the one hand, we design a set of CoT logic, starting from the concept, judging the logicality among different parts, evaluating the solvability of the question, and finally checking the question and steps, gradually guiding the LLM to think about the usability of the question. On the other hand, we conduct multi-turn iterative checks through two different LLMs to ensure the usability of the generated questions as much as possible. Specifically, the two LLMs independently judge the usability of the question through the CoT logic, but when one LLM judges the question as unusable or both LLMs judge the question as unusable, the question needs to be modified according to the judgment logic given by the LLM considered unusable (when both LLMs consider the question unusable, we designate one LLM's logic for modification). The modified question is then iteratively checked again. Only when both LLMs consider the question usable, the question is considered usable in the production process, and the iterative inspection of the question ends. If the maximum number of iterations is reached and there is still an LLM that judges the question as unusable, the question will be discarded.
Q: In section 2.4 - Acquiring reference answers: lines 133-136, are the answers scored by human participants?
A: In the selection of reference answers for general text questions, we use the Hunyuan(Hunyuan-standard) to score the responses.
Using the LLM's response to the instructions as reference answers is relatively common in the data generation field. For example, in the Alpaca dataset[1], GPT-3.5 (text-davinci-003) is used to provide responses to questions as reference answers, and in the Instruction tuning with gpt4 work[2], GPT-4 is used to answer Chinese questions and serve as reference answers.
Despite this, we hope to improve the quality of reference answers as much as possible. Inspired by [3], for general text questions, we also provide seven evaluation criteria: Safety (0-30 points), Correctness (0-10 points), Relevance (0-10 points), Comprehensiveness (0-10 points), Readability (0-20 points), Richness (0-10 points), and Humanization (0-10 points). We are more inclined to believe that responses with higher scores should have higher quality. We call multiple LLMs, including Hunyuan, GPT-4, GPT4-Turbo, Wenxin 4, and Qwen, to respond to the instructions, and then use Hunyuan to score these responses, selecting the highest-scoring response as the reference answer.
We further involve humans in checking the usability of the answers. We select 150 generated general text questions and obtain reference answers in the aforementioned manner. We organize evaluators to score the selected reference answers according to the evaluation criteria in Table 1 of the paper. We remove 15 questions that none of the models answer correctly (the questions might be too difficult, all models answer incorrectly, and the answer selection is not meaningful). The results show that the usability rate of reference answers reaches 84.7%, which is higher than the highest correct rate of alternative reference answers, wenxin4 (78.8%). This indicates that the answer selection criteria can ensure the usability of the answers.
[1] Taori, Rohan, et al. "Stanford alpaca: An instruction-following llama model." (2023): 6.
[2] Peng, Baolin, et al. "Instruction tuning with gpt-4." arXiv preprint arXiv:2304.03277 (2023).
[3] Liu, Yilun, et al. "Automatic instruction optimization for open-source llm instruction tuning." arXiv preprint arXiv:2311.13246 (2023).
Q: The proposed methods - “Instruction gradient” and “response gradient” are not properly described in the manuscript. Authors should write the working procedure of these methods in detail in the main manuscript, as these are the centerpiece of the whole question generation process.
A: Thanks for your question on our paper. We explain "Instruction gradient" and "response gradient" in footnote 1 on the second page of the manuscript. "Instruction gradient" and "response gradient" are the names we give to our methods. The process of generating generalized questions from seed data is analogous to forward propagation, where the LLM generates responses to the questions, and this process should be further pushed forward. Based on these responses (considered as information or knowledge), new questions are generated again, and this process is pushed backward, which can be compared to backpropagation. Therefore, we thought of using the term "gradient" for naming, and we named the process of generating questions based on seed data as "instruction gradient" and the process of generating generalized questions based on LLM responses as "response gradient".
We further detail the description of the working procedure in the annotation of Figure 1 in the manuscript. Our working procedure is as follows: First, we collect a batch of seed data and divide it into mathematical and general text categories. Next, we apply the "instruction gradient" to both types of questions. For the "instruction gradient," the specific generalization strategies for the two types of questions are different due to the various question types. We provide the core generalization strategies in Table 5 of the appendix. We have the LLM(Specially referring to Hunyuan-standard in our paper) rewrite the seed data according to the generalization strategies, thus obtaining new questions. For general text questions, we can implement the "response gradient," i.e., first obtain the LLM's response to the question, and then ask questions based on the content of the response. We show the prompt for this process in Table 7 of the appendix. For mathematical questions, after generating questions based on the "instruction gradient," we focus more on the usability of the questions. Therefore, we design CoT to check, using multiple models (referring to Hunyuan and Hunyuan-pro in our paper) to judge the usability of the questions, and modify or discard the questions based on the inspection results. We show the specific CoT content in Table 9 of the appendix.*
Q: “Generalizing questions from seed data based on the "instruction gradient" restricts the diversity and confines the content to specific topics”- is unclear. Consider explaining.
A: Thank you for your suggestion on our paper. We provide further explanation for this part. The content generated by generalizing seed data through the "instruction gradient" is relatively close to the topic of the seed data.
To make the generalized evaluation data more diverse, on the one hand, we can ensure the overall diversity of the evaluation data through the diversity of seed data. On the other hand, we enhance the diversity of questions through the "response gradient." For example, for the question "How can NLP technology be used to detect and prevent the spread of fake news?", using the instruction gradient for generalization, we can obtain a new question "List three specific methods to detect and prevent the spread of fake news using NLP technology and explain their principles," which still revolves around the original question for expansion or transformation. To address this, we consider discarding the original question and using the LLM-generated response as information or knowledge. At this point, we only generate questions based on a piece of text, and the questions may become more interesting based on the content of the response. In the above example, we may generate a new question "What NLP tasks are typically addressed by fact-checking and source analysis techniques?"
Q: In section 2.3 - Assessing the Usability of General Text Questions: How is the assessment done? Is it done manually with human input? Or by an autonomic process/model?
A: This part explains the usability check for general text data, which is implemented automatically by the LLM. We propose four criteria that we believe are important for general text questions: safety, neutrality, integrity, and feasibility. Here, safety refers to the absence of explicit, politically sensitive, or violent content in the question; neutrality refers to no bias or racial discrimination in the instructions; integrity refers to sufficient information provided to clarify the task; and feasibility refers to instructions within the AI system's capability range.
We use the LLM (specifically, Hunyuan) to score general text questions based on these four criteria. Questions that do not receive a perfect score are considered unusable. For general text questions, the incidence of unusability occurs less frequently, so we discard questions deemed unusable without modifying them.
In the experiment, we manually annotate the generated general text questions and found that the usability reached 94.0%.
The paper introduces a novel framework for evaluating Large Language Models LLMs) based on Item Discrimination ID theory, which generates adaptive, high- quality prompts to effectively differentiate model performance. Key contributions include a dynamic evaluation set that evolves with LLM advancements, a self- correct mechanism for prompt precision, and models to estimate prompt discrimination and difficulty. The authors validate their framework by testing it on five state-of-the-art models and release a dataset of over 3,000 prompts to aid further research, demonstrating enhanced challenge and discrimination over previous methods.
优点
The paper proposes a novel prompt generation method to produce more challenging evaluation data. The paper is well-structured and clearly written. The methodology and evaluation criteria are explained clearly, making the paper accessible to a broad audience.
缺点
The paper only used one LLM Hunyuan) to generalize data and did not verify whether the proposed method can generalize to other LLMs. It is debatable whether using test data generated by an LLM to evaluate the performance of LLMs has practical value. The paper lacks validation of the effectiveness of the machine-generated test set, such as comparing its metrics with those of other human-annotated datasets. The paper lacks an analysis of the diversity of the data used to produce the test set.
问题
The concerns are included in the weaknesses.
局限性
The authors have identified some limitations; however, there are additional ones that I have raised in the weaknesses.
Q: The paper only used one LLM Hunyuan) to generalize data and did not verify whether the proposed method can generalize to other LLMs.
A: Thank you for your question about our paper. Our proposed method is designed for existing LLMs and is not limited to a particular model. The work of using LLMs to automatically generate data often involves selecting only one LLM for data generation, such as the Wizardlm[1] work using gpt-3.5-turbo to generate instruction data, and the Self-instruct[2] work using gpt-3 to generate instruction data.
We apply our proposed method to some other LLMs, such as GPT-4-turbo (gpt-4-turbo-2024-04-09) and Qwen (Qwen-max), using the same batch of a small amount of seed data, and manually scoring the models' responses to calculate discrimination indexes and map them to the four levels of discrimination indexes. The experimental results are shown in the table below. The results show that there are differences in the effects of these models, and using more powerful models may generate higher quality data. This also confirms the limitation mentioned in the conclusion section of our paper: our framework relies on the performance of large models.
| Model | Amount | Low | Relatively Low | Relatively High | High |
|---|---|---|---|---|---|
| Seed_data | 50 | 45 | 0 | 4 | 1 |
| Hunyuan | 50 | 29 | 8 | 8 | 5 |
| Qwen | 50 | 28 | 13 | 6 | 3 |
| Gpt4-turbo | 50 | 21 | 5 | 10 | 14 |
Q: It is debatable whether using test data generated by an LLM to evaluate the performance of LLMs has practical value.
A: The issue you metioned is a very in-depth one. On the one hand, if the data produced by the model is not controlled and filtered, it can have a significant negative impact on the model [1], so a good filtering mechanism is crucial for the effectiveness of the model's data production. On the other hand, if manually annotated data does not have a good filtering mechanism, it can also have a significant impact on model training, as shown in research work like LIMA[2]. However, a single evaluation often requires tens of thousands of evaluation data to fully measure the capabilities of large models. The cost of manually writing questions is too high and the speed is relatively slow. It is necessary to use LLM-produced test data in conjunction with manually written questions to evaluate the capabilities of large models.
Therefore, this paper proposes an automated approach for constructing high-quality evaluation data, with contributions including the following two aspects: (1) Globally, it explores how to ensure the diversity of data, such as the classification of seed data and the diversity of data generation methods; (2) For each data, a very effective data usability check mechanism is designed. This process is reusable and will also be fully open-sourced. The method we proposed has also been validated in an actual production environment, helping to stably and comprehensively improve the performance of the production model, thereby validating its effectiveness in both experimental testing and production environment testing.
[1] Shumailov, Ilia, et al. "AI models collapse when trained on recursively generated data." Nature 631.8022 (2024): 755-759.
[2] Zhou, Chunting, et al. "Lima: Less is more for alignment." Advances in Neural Information Processing Systems 36 (2024).
Q: The paper lacks validation of the effectiveness of the machine-generated test set, such as comparing its metrics with those of other human-annotated datasets.
**A: ** Thank you for your suggestion on our paper. We will explain your suggestion from the perspectives of usability, production efficiency, and cost to supplement the missing part.
Usability: Human-annotated datasets are not necessarily all usable, and they often contain errors. They also need to be repeatedly checking and reviewing to ensure a high level of usability (e.g., above 95%). The usability of the questions in our generated data can reach 94% (the 94% usability is based on hunman-annotated results), and the usability of the evaluation data is satisfactory.
Production efficiency: In this paper, it takes 2-5 calls to check a machine-generated question, with an average time of about 20 seconds per question. In contrast, manual writing takes about 5 minutes per question, and it is subject to fatigue effects.
Cost: In this paper, generating a question and checking it with the machine involves input and output of about 9k tokens, costing approximately **0.03, making the cost of human-annotated datasets relatively high.
Q: The paper lacks an analysis of the diversity of the data used to produce the test set.
"A:" Thanks for your suggestion on our paper. We appreciate your suggestion and provide a response to this issue, supplementing the explanation of the diversity of the data.
We ensure the diversity of the seed data through a rich variety of categories. Our seed data consists of two parts: Chinese and English. The Chinese data refers to the work of TencentLLMEval, which includes 6 primary categories and 61 secondary categories. The English data uses the seed data from Self-instruct, which contains 175 different task types.
In terms of methods, we design diversified generalization strategies including "Instruction Gradient" and "Response Gradient" to promote the generation of diversified questions.
In the actual production process, we filter out similar questions. In this process, we cite the work of Self-instruct and remove data samples with ROUGE-l greater than 0.7 to filter out similar sample data.
Dear Reviewers,
the authors have provided an extensive rebuttal for all reviewers. Please have a look at their responses to see if they address your concerns. If you have further questions, you can now start a discussion with the authors.
Your AC
The main concerns raised by the reviewers are
- a lack of practical applicability
- textual flow of the paper and lack of proper explanation of terms and missing references
- lack of comparison to human annotations The rebuttal from the authors was quite extensive and they assure that the method has actually been tested in a production environment. They also promise to incorporate all the suggestions regarding writing and including references. I find that also their response as to why they did not compare with human annotations is convincing as this would be much more expensive than the experiments performed for the paper. Therefore, while I believe that the paper can be improved, I do not think there are major issues that stand in the way of acceptance.