4.0

/10

Rejected4 位审稿人

最低3最高5标准差1.0

3.8

置信度

ICLR 2024

TeG-Instruct: Towards Premium Instruction-Tuning Data via Text-Grounded Task Design

Yongrui Chen,Haiyun Jiang,Xinting Huang,Shuming Shi,Guilin Qi

OpenReview PDF

提交: 2023-09-22更新: 2024-03-27

摘要

关键词

Natural Language ProcessingLarge Language ModelInstruction Tuning

评审与讨论

审稿意见

评分: 5置信度: 42023-10-29

This paper proposes a method to generate instruction-tuning datasets grounded on corpus, aiming to address the problems like hallucination. The dataset generated contains 270k examples. The generating process contains 2 stages: (1) collect seed examples with the help of LLMs; (2) train language models to generate more examples and filter out low-quality ones. In the test set proposed in this paper (TEG-Instruct test set) and on several other benchmarks, the method presented in this paper performs well.

优点

This paper proposes a text-grounded instruction-tuning dataset with up to 270k examples, and the method is scalable.
The method trains a discriminator to filter out low-quality data, which can alleviate the problem of hallucination.
The motivation of this paper is intuitive.

缺点

The contribution of this paper is a little bit limited, and the differences compared to [1] and [2] are not very distinct.
There are some concerns regarding the evaluation of the proposed method and dataset. In terms of automated evaluation, using the test set generated in this paper, TEG-Instruct, as the testing dataset might be somewhat unfair. Additionally, including evaluation results on more benchmarks can enhance the robustness of the article.

[1] "Self-Alignment with Instruction Backtranslation". Li et al. [2] "Longform: Optimizing instruction tuning for long text generation with corpus extraction." Köksal et al.

问题

What is the difference between this work and other text-grounded instruction-tuning dataset generation methods like [1] and [2]?
How does the paper demonstrate the alleviating of the problems of hallucination in the evaluation?
Are there any ablation experiments to demonstrate that the dataset format with (Instruction, Input, Output) proposed in the paper is better than the format with (Instruction, Output), particularly in instruction tuning?

[1] "Self-Alignment with Instruction Backtranslation". Li et al. [2] "Longform: Optimizing instruction tuning for long text generation with corpus extraction." Köksal et al.

审稿意见

评分: 5置信度: 32023-10-29

This paper presents Teg-Instruct, a scalable method for collecting instruction-tuning data that incorporates the instruction, input, and output based on the given human-written text. Firstly, a meta-training set was built by leveraging well-aligned teach models, specifically, ChatGPT. The meta-training set is further used for training two LMs, one for generating instruction, input, and output given documents, and the other for evaluating the generated content as a bi-classification problem. Experiments are conducted both in-domain and out-domain using the collected 270k samples with Llama2-7B and indicate the effectiveness of the proposed dataset. Also, a comprehensive analysis of automatic and human analysis is done to showcase the quality of the constructed Teg-Instrct.

优点

The proposed method is scalable and the released dataset is large-scale.
The paper is generally well-written and organized. Almost everything is clear to me.
Thorough automatic and human evaluations over a range of instruction-tuning datasets on various downstream tasks. The analysis for the curated Teg-Instruct is also comprehensive.

缺点

I'm not so convinced by the experiment results. It is not surprising that Teg-Instruct achieves the best performance for in-domain tests. For out-of-domain experiment results, the performance is not that satisfactory. It. is understandable for different output distributions. Instead, what is the performance on some general benchmarks aside from the in-domain and out-domain evaluations, such as LongForm and Alpaca Eval?
Not a strict weakness, just wondering why it is called "text-grounded task design"? I'm a little confused by the term "task design". In my view, "task" usually refers to well-defined task formats, such as QA, IE, etc.

问题

Could you further explain the score $\sigma$ used in post-processing?
What are the thresholds used in post-processing, such as $\theta$ ? And how are they decided? Is $\theta$ the same for the two similarity calculations?
What is the detailed setting for human evaluation? For example, how many people are involved in the annotation process? What are their backgrounds? Since the number of examples for evaluation is relatively small, i.e., 50, how reliable is the human evaluation?

审稿意见

评分: 3置信度: 42023-11-01

This article proposes TeG-Instruct, which uses a specially trained generator to generate training samples for instruction tuning and trains a discriminator to filter the samples generated by the generator. This method can generate more training samples for instruction tuning at a lower cost.

优点

The introduced method can generate more Instruction tuning samples at a lower cost.

缺点

The paper states in Figure 1 that human-written text contains noise. However, this is a rare case. If the authors think this is a common problem, then all the human-annotated tasks are pointless. The author uses examples generated by ChatGPT as training samples to train a Llama-2-7B model to generate new training samples. I think the training samples generated in this case will have more noise.
The paper states in Figure 1 that the lack of input field is detrimental to LLM’s generalization capability in diversity tasks. There is no verification of it, no explanation in other papers, and no ablation experiments.
This method has almost no improvement in out-of-domain datasets compared to other baselines, and the in-domain experiment is almost meaningless because other baselines are all in the setting of out-of-domain when dealing with the dataset proposed in this paper. This is an unfair comparison. In addition, the article lacks the number of training data for other baselines.

问题

N/A

审稿意见

评分: 3置信度: 42023-11-01

The paper introduces Teg-Instruct, a paradigm for collecting instruction-tuning data. The procedure consists of two steps: First, well-aligned LLMs are leveraged to generate seed data from both document-view and task view. Then, the seed data is used to train two LMs as a task generator and a task discriminator. The paper provides both human evaluation and automatic evaluation on the generated Teg-Instruct dataset, demonstrating that Teg-Instruct is of high quality and improves both in-domain and out-of-domain instruction following abilities for LLMs.

优点

The paper provides extensive and thorough experimentation on the data quality. Particularly, The human evaluation on the intrinsic quality of the instruction tuning datasets is valuable.
The instruction tuning datasets will be released and will be valuable to the community.

缺点

The improvement from the dataset seems incremental compared with previous datasets and methods especially in terms of downstream evaluation with out-of-domain data.
The data creation process is complicated with many rules not clearly specified.

问题

It is not entirely clear how the Teg-Instruct seed data was constructed. It was mentioned that both document-view and task-view data in the first stage were manually checked. What is the manual review process? Also, what is the similarity threshold in the post processing for removing hallucination examples?

AC 元评审

2023-12-05

The paper introduces TeG-Instruct, a new method for generating instruction-tuning data at a lower cost. The main novelty lies in the combination of a generator trained on seed data from well-aligned LLMs and a discriminator to filter out low-quality samples. Experiments are conducted on both in-domain and out-of-domain data. On the positive side, the proposed approach appears to be scalable, and the released dataset could be valuable to the community. However, the reviewers had several serious concerns about accepting this paper. First, the novelty of this work seems to be relatively limited compared to prior works on instruction tuning motivated in similar ways (see reviews 5ypF and q6hF for details, and review 5ypF for references). Second, positive results on in-domain data are seldom surprising (as conducted in TeG-Instruct's test set), while out-of-domain results are relatively mixed. Finally, using TeG-Instruct's own test set to compare the approach of the paper against prior work is somewhat unfair. The authors did not respond to the reviewers to address these concerns. In light of these limitations, I recommend rejecting this paper.

为何不给更高分

The reviewers express concerns regarding the limited novelty of the paper and the utilization of test data generated by their approach to evaluate their work in comparison to prior research.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject