5.0

/10

Poster3 位审稿人

最低4最高6标准差0.8

4.0

置信度

COLM 2024

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Yihan Cao,Yanbin Kang,Chi Wang,Lichao Sun

OpenReview PDF

提交: 2024-03-22更新: 2024-08-26

TL;DR

We present InstructMining, a method for selecting high-quality instruction-following data for finetuning large language models.

摘要

关键词

data-centric machine learninglarge language modelsgenerative modelslanguage model finetuning

评审与讨论

审稿意见

评分: 6置信度: 32024-04-20

The paper proposes a method for automatically selecting high-quality instruction-following data for finetuning large language models. The method combines data assessment and selection processes, utilizing natural language indicators to evaluate data quality and identify the optimal subset for finetuning language models. The paper also discusses the double descent phenomenon observed in generative language models during finetuning.

接收理由

Strength:

The paper provides empirical evidence to support the effectiveness of the proposed method through experiments on various datasets and benchmarks. The results demonstrate state-of-the-art performance and enhanced finetuning efficiency.
The paper addresses the challenge of selecting high-quality datasets for instruction tuning in language models, which is a crucial aspect of improving model performance in interacting with humans.

拒绝理由

Weakness:

While the paper presents results comparing the proposed method with other state-of-the-art models, a more detailed comparative analysis with a wider range of existing methods could provide additional insights into the method's performance and effectiveness.

2024-06-07

Thank you for your review and for your thoughtful feedback, sorry for the delay in responding.

We have the result of LIMA-7b model (LLaMA-2-7B finetuned with LIMA data) and results from [1]. Its OpenLLM benchmark scores are shown in the table below.

Model	Datasize	loss@self-instruct	loss@mt-bench	OpenLLM Avg	ARC	Hellaswag	MMLU	TruthfulQA
LIMA-7b	1000	0.9947	0.7604	0.5533	0.5563	0.8009	0.4371	0.419
InstructMining-7b	1000	0.9576	0.7114	0.562525	0.5589	0.7877	0.4299	0.4736
15% Alpaca-7b[2]	*	*	*	0.5637	0.5742	0.8068	0.4640	0.4095

reference: [1]. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning

审稿意见

评分: 4置信度: 52024-05-11

This paper propose to select the high-quality instruction-tuning data to better fine-tune LLMs. This is an important problem and the proposed methodology is very intuitive --- quality is sometimes over quantity. The authors developed an instantiation of the methodology by defining instruction quality and applying an existing method for selection. The experiment results show some remarkable performance.

接收理由

This paper studies an important problem and the proposed methodology is very intuitive.
The paper is well-written and easy to follow.
The experiment results look interesting.

拒绝理由

Only LLAMA-based LLMs are evaluated. More models (e.g., Mistral ones) will be helpful to justify that the proposed method can generalize well for different LLMs.
The Hypothesis 1 in my perspective is basically saying that a validation set loss/performance shall be adopted for the evaluation of the training data quality. This is a common practice in hyper-parameter tuning. Selecting a subset of training data can be viewed as a kind of hyper-parameter tuning too. So this formulation is not very novel.
Also, following this "validation set loss" idea, the double descent phenomenon is somewhat as expected?
Breaking down the loss into a small set of pre-defined indicators following a linear combination seems overly simple?
This paper itself doesn't propose a new selection method. Instead it applies an existing method BLENDSEARCH.

2024-06-07

Thanks for your thoughtful feedback. Sorry for the late response. For your concerns, please find our responses below.

Q.1 Only LLAMA-based LLMs are evaluated. More models (e.g., Mistral ones) will be helpful to justify that the proposed method can generalize well for different LLMs.

Here are the ablation results on Falcon-7b based models.

Model	Data	Loss@mt-bench	Loss@self-instruct
Falcon-7B	Dolly-selected 1k	1.1344	1.2780
Falcon-7B	Dolly-random 1k	1.2396	1.3723
Falcon-7B	Dolly-all	1.3028	1.3960

We also got results for the Mistral-7B models.

Model	Data	Loss@mt-bench	Loss@self-instruct
Mistral-7B	Dolly-selected 1k	1.0528	1.2483
Mistral-7B	Dolly-random 1k	1.1107	1.3365
Mistral-7B	Dolly-all	1.1494	1.3824

We finetuned these models using the same selected 1000 dolly data, which we used in the paper to finetune LLaMA-2-7b.

Q.2 a validation set loss/performance shall be adopted for the evaluation of the training data quality which is not very novel.

We conducted multiple experiments using various validation sets to learn the relationship between indicators and validation set loss. As a result, we derived a linear function that can be applied to each individual data sample for data selection. We were pioneers in applying this method to the instruction tuning task, not just using validation loss to evaluate data quality.

Q.3 Following this "validation set loss" idea, the double descent phenomenon is somewhat as expected?

We are the first to acknowledge the existence of the double descent phenomenon in instruction tuning.

Q.4 Breaking down the loss into a small set of pre-defined indicators following a linear combination seems overly simple? Our proposed Instruction mining can incorporate other functional assumptions or methods to obtain the function. In our paper, we use the least squares method to derive linear functions, providing a clearer understanding of the resulting functions. The function type is just one aspect of our approach. We also demonstrate potential linear relationships between certain metrics and losses.

Q.5 applies an existing method BLENDSEARCH.

We are the first to utilize BLENDSEARCH for data selection in instruction tuning. BLENDSEARCH serves as the search solution in our method.

About the novelty in our paper.

Our paper is proposing a framework to do data selection in instruction tuning which could help improve the model performance. Our method is the first to combine these to do data selection on instruction tuning.

审稿意见

评分: 5置信度: 42024-05-14

The paper introduces INSTRUCTMINING, a novel method that automates the selection of high-quality instruction-following data for fine-tuning large language models (LLMs). This innovation addresses the significant challenge of manually selecting instructional data, which is both time-consuming and costly.

接收理由

The proposed method is demonstrated to enhance the performance of LLMs on standard benchmarks. This indicates a substantial improvement over existing methods. INSTRUCTMINING requires significantly less data for training, reducing both training time and computational costs, which is crucial given the resources required for training LLMs.

拒绝理由

Firstly, I'm somewhat skeptical about the necessity of the least squares method. Could the authors include additional ablation studies, such as simply combining individual indicator functions to evaluate the quality of each instruction dataset? Moreover, the authors should compare their method against some baselines, like employing the ChatGPT reward score [1], or using perplexity-based metrics [2].

[1]. AlpaGasus: Training a Better Alpaca with Fewer Data [2]. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning

2024-06-07

Thanks for your response. For your concerns, please find our responses below. Apologize for the delay in responding.

Q.1 the necessity of the least squares method, and additional ablation studies, such as simply combining individual indicator functions to evaluate the quality of each instruction dataset

We use the least squares method to estimate parameters in multivariate functions due to its simplicity, explanatory power, and adaptability to various data types. Beyond the parameters detailed in the paper, the process of finding a usable function is also significant, allowing us to potentially replace the least squares method with other functions.

Q.2 compare their method against some baselines For, Alpagasus [1], However, It's expensive using chatgpt to give a score on large scale data. One of the advantages of the method proposed in our paper is that it is less costly. We have the result of LIMA-7b model (LLaMA-2-7B finetuned with LIMA data) and results from [2]. Its OpenLLM benchmark scores are shown in the table below.

Model	Datasize	loss@self-instruct	loss@mt-bench	OpenLLM Avg	ARC	Hellaswag	MMLU	TruthfulQA
LIMA-7b	1000	0.9947	0.7604	0.5533	0.5563	0.8009	0.4371	0.419
InstructMining-7b	1000	0.9576	0.7114	0.562525	0.5589	0.7877	0.4299	0.4736
15% Alpaca-7b[2]	*	*	*	0.5637	0.5742	0.8068	0.4640	0.4095

We also compared our model and LIMA model using LLM-as-a-judge. Results are listed in the table below .

Model	Win Rate	Lose Rate	Tie Rate	Adjusted Win Rate
InstructMining-7b(vs LIMA-7b)	0.4875	0.1063	0.4062	0.6906

reference:

[1]. AlpaGasus: Training a Better Alpaca with Fewer Data
[2]. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning

最终决定Accept

2024-07-10

This paper proposes Instruct Mining, an approach to identifying high-quality instruction tuning data instances. The idea is to use a model to select high-quality instructions as a function of features of the language (e.g., length). By selecting fine-tuning data under this model, good performance can be realized with less data.

Overall, this is a nice contribution in the space of (pre-training) data centric LLM work, showing the critical importance of data quality.

[At least one review was discounted during the decision process due to quality]