6.8

/10

Poster4 位审稿人

最低5最高8标准差1.1

4.0

置信度

正确性3.5

贡献度3.3

表达3.5

NeurIPS 2024

MAmmoTH2: Scaling Instructions from the Web

Xiang Yue,Tianyu Zheng,Ge Zhang,Wenhu Chen

OpenReview PDF

提交: 2024-05-13更新: 2024-11-19

TL;DR

We introduce a scalable approach to harvest 10M high-quality instruction data from web corpus for fine-tuning language models, significantly boosting their reasoning performance without costly human annotation or GPT-4 distillation.

摘要

关键词

large language modelsinstruction tuningreasoning

评审与讨论

审稿意见

评分: 5置信度: 42024-07-10

The paper introduces MAmmoTH2, a novel approach to instruction tuning for large language models (LLMs) by harvesting naturally existing instruction data from the web. The authors develop a three-step pipeline (recall, extract, refine) to collect 10 million high-quality instruction-response pairs without relying on costly human annotation or GPT-4 distillation. Fine-tuning LLMs with this dataset significantly improves performance on reasoning benchmarks. The MAmmoTH2-Plus model, further tuned on public instruction datasets, achieves state-of-the-art results on multiple benchmarks.

优点

Demonstrates a cost-effective way to collect large-scale, high-quality instruction data from the web.
Significant performance gains on reasoning benchmarks, with MAmmoTH2 models outperforming existing models.
Comprehensive evaluation across multiple benchmarks, showing robust improvements.

缺点

The approach primarily combines existing methods (data recall, extraction, refinement) rather than introducing fundamentally new concepts or techniques.
More explicit comparison with prior work is needed to highlight the unique contributions and differences of this approach.
The quality and diversity of the collected data heavily depend on the web sources, which may introduce biases or inconsistencies.

问题

How does MAmmoTH2 compare directly with other methods that use synthetic or human-annotated data in terms of data quality and model performance?
What measures were taken to ensure the quality and relevance of the extracted Q-A pairs from the web?
How does the model address potential biases in the web-sourced data, and what steps were taken to mitigate these biases?

局限性

The authors address some limitations of their approach, such as the dependency on web data quality and the challenges in maintaining the diversity and relevance of the instruction data. However, a more detailed discussion on potential biases introduced by web data and the ethical implications of using such data could strengthen the paper.

作者回复

2024-08-05

We thank the reviewer for positive feedback on our cost-effective approach, significant performance gains, and comprehensive evaluation!

"Novelty of approach"

Our method's novelty lies in its unique pipeline to mine naturally existing instruction data at scale, offering a new paradigm for creating large-scale, high-quality instruction datasets without costly human annotation or GPT-4 distillation. To the best of our knowledge, we are the very first paper to formally understand the effect of automatically scaling up SFT data to see its impact, especially on reasoning tasks. Reviewer yEcJ, Reviewer k2fV, and Reviewer Jcxm have acknowledged our approach as novel, simple, and effective.

"More explicit comparison with prior work"

In Tables 2 and 3, we provide a comprehensive comparison with prior works, encompassing general pre-trained base LLMs, general instruction-tuned models, and reasoning-specific models such as Deepseek Math, Intern-Math, and Rho-1-Math. Notably, we also include a comparison with Llama-3-8B-Instruct, which utilizes 10 million human-annotated instructions. Our model demonstrates superior reasoning performance to these existing models.

"Web sources have biases and inconsistencies. What steps were taken to mitigate these biases?"

We've taken several steps to mitigate biases and ensure data quality:

Using diverse seed data across multiple domains
Employing multiple LLMs in the refinement stage
Implementing a three-step pipeline (recall, extract, refine) to improve data quality

"How does MAmmoTH2 compare directly with other methods that use synthetic or human-annotated data?"

We are pioneers in developing a method for scaling up instruction tuning via data synthesis.
Our approach outperforms models trained on human-annotated datasets of similar size (e.g., Llama-3-8B-Instruct) on reasoning tasks while matching performance on general tasks, suggesting comparable or higher quality for certain tasks.

"What measures were taken to ensure the quality and relevance of the extracted Q-A pairs?"

We use a multi-step process to ensure the quality and relevance of our data:

Careful selection of seed data and websites
LLM-based extraction of relevant Q-A pairs
Refinement step to improve formatting and add missing explanations
Human evaluation of a sample set (as shown in Figure 6)

2024-08-10

Thanks for your rebuttal, I have no more questions.

审稿意见

评分: 8置信度: 42024-07-10

The paper proposes a 3-stage pipeline to harvest ex-large-scale instruction data from the pre-training web corpus to enhance LLM reasoning, which involves 1) recalling relevant documents, 2) extracting instruction-response pairs using LLM, and 3) refining the extracted pairs by completing the intermediate reasoning steps using LLM.

The paper

proposes an effective pipeline to synthesize large-scale high-quality instruction data, especially for reasonable prompts and reliable answers;
empirically validates the effectiveness of scaling up instruction data for reasoning tasks;
builds MAmmoTH2-Plus models, achieving performance superior to or comparable with previous SotA on various reasoning datasets;
provides a ex-large-scale instruction dataset for reasoning tasks, WebInstruct, as unique public data resource;
conducts extensive ablation studies, providing many insights like:
- SFT loss is better for LM loss (at least when evaluated on QA tasks);
- refining extracted instruction pairs by completing the intermediate reasoning steps is significantly helpful;
- using multiple LLMs to refine the instruction data is usually better than a single LLM;
- “Education” data (exam-style) are usually better than “Forum” data (discussion-style) (at least when evaluated on QA tasks);
- even benchmarks conventionally thought very relevant might conflict with each other (GSM & MATH in Table 5), implying limited generalization of LLMs.

优点

The scaling effect of instruction data is an important empirical question. The paper is the first to scale instruction data to 10M pairs, showing the feasibility and effectiveness of scaling up instruction data (for reasoning tasks).
Synthesis of high-quality prompts and answers is important for further data augmentation but rather under-explored. The paper finds an effective method to synthesize reasonable prompt and relatively reliable answers by harvesting from web corpora.
MAmmoTH2-Plus models achieve performance superior to or comparable with previous SotA on various reasoning datasets.
Extensive experiments are conducted on various base models and especially diverse challenging reasoning benchmarks, instead of easy ones with limited scope (e.g. many benchmarks similar to GSM8K), convincingly validating the method's effectiveness.
Many insightful and useful observations in ablation studies (as mentioned in the summary).
The paper is generally well written to be clear and detailed.

缺点

It might need further consideration about whether training on WebInstruct is compatible with or necessary to be added to existing training pipelines to achieve the best final performance (for reasoning tasks). The paper achieves its best performances (MAmmoTH2-Plus) with a 2-stage instruction tuning on pre-trained models but doesn’t involve continual pre-training, which should be rather important for models’ reasoning abilities as proved by works like DeepSeek-Math. Pre-training and RL should be out of this work’s scope. But it would be better to further clarify the impacts of 1) continual pre-training, 2) training on WebInstruct, 3) final fine-tuning on additional instruction datasets and their combinations.
- Table 7 shows the performance on reasoning benchmarks of applying 2/3/2+3 on Mistral-7B/Mixtral-8x7B. But the comparison might be a little unfair: the domains of the “Public Datasets” are wider than those of the WebInstruct with the code generation dataset Code-Feedback, but the benchmarks only involve mathematical and scientific reasoning in natural language, which might underestimate the performance of “Public Datasets”, considering the possible confliction between code generation and reasoning in natural language. It might be better to remove Code-Feedback from the “Public Datasets” to compare with WebInstruct.
- To consider 1) continual pre-training, it is impossible to conduct by yourselves, but a possible workaround could be to make full use of the resources provided by DeepSeek-Math: DeepSeekMath-7B is continually pre-trained from DeepSeek-Coder-Base-v1.5. By comparing performances on reasoning benchmarks of applying 2/3/2+3 on DeepSeek-Coder-Base-v1.5/DeepSeekMath-7B and the two models themselves, a more comprehensive study on the impacts of these training stages can be done.
Table 7 shows that, for strong Mixtral-8x7B, the gains of adding WebInstruct to “Public Datasets” is marginal, implying that the effect of WebInstruct for strong base models might be limited.

After rebuttal and discussion

The authors resolved most concerns and validated that MAmmoTH2 can efficiently substitute continual pre-training in the standard SotA pipeline. The limitation is that MAmmoTH2 fails to combine with continual pre-training to effectively push forward the upper limit.

I decide to change my score to 8.

问题

Suggestions:

The refinement step is important and the current setting can be seen as distillation from strong models (Mixtral-22B×8 and Qwen-72B). The method could be more promising if it could help self-improvement/weak-to-strong generalization. I highly recommend adding experiments of training Mixtral-22B×8 and Qwen-72B or stronger models in future versions.

Confusions:

Are training data sizes in experiments for Table 5 controlled to be comparable?
What does the Data Source “Base“ mean in Table 5?

局限性

The limitations of this work are acceptable and the authors point out potential directions to address the limitations for future works.

作者回复

2024-08-05

We thank the reviewer for the positive feedback on our effective pipeline, large-scale instruction dataset for reasoning tasks, many useful insights from extensive experiments.

"Compatibility with existing continual-training pipelines and impact investigation"

We appreciate this valuable suggestion though continual training is beyond our project's scope. Our focus was on improving reasoning performance through scaled instruction tuning. To address this point, we fine-tuned the Deepseek-Math-Base-7B on WebInstruct and additional public datasets. Results show that WebInstruct can further significantly improve the DeepseekMath (it’s already been continue-pretrained on math documents). After fine-tuning additional public SFT data, our model achieves comparable performance on math reasoning and higher performance on other reasoning benchmarks, demonstrating compatibility with existing continual training pipelines. We will add the additional results and discussions in the revision.

	TheoremQA	MATH	GSM8K	GPQA	MMLU-S	BBH	ARC-C	AVG
Deepseek Math 7B Base	25.3	34.0	64.2	29.2	56.4	59.5	67.8	48.1
+ WebInstruct	30.1	38.2	70.5	33.3	59.5	61.8	76.1	52.8
+ Additional SFT	31.5	45.2	80.2	35.2	60.5	62.0	76.4	55.8
Deepseek Math 7B Instruct	23.7	44.3	82.9	31.8	59.3	55.4	70.1	52.5
Mistral 7B Base	19.2	11.2	36.2	24.7	50.1	55.7	74.2	38.8
+ WebInstruct	29.0	36.7	68.4	32.4	62.4	58.6	81.7	52.8
+ Additional SFT	29.2	45.0	84.7	36.8	64.5	63.1	83.0	58.0

"Self-improvement and weak-to-strong generalization"

We agree that exploring self-improvement and weak-to-strong generalization would be valuable. We'll consider experiments with Mixtral-22B×8, Qwen-72B, or stronger models in future work.

"Public Datasets domains wider than WebInstruct"

This is not accurate. We evaluated code generation (HumanEval, MBPP) and general chat benchmarks (MT-Bench, AlpacaEval 2.0, Arena Hard) in Table 3. The additional PLUS data training aims to make our models more general and capable of tasks beyond reasoning.

“Confusions”

For Table 5, we train all models with the same steps. We will clarify this. “Base" in Table 5 refers to the base model's performance. We'll make this clearer in the table caption or legend.

2024-08-08

Thanks to the authors for your clarifications! However, I still have some concerns as below.

"Compatibility with existing continual-training pipelines and impact investigation"

I understand that you focus on scaled instruction tuning. However, from a holistic perspective, it is meaningful to know what components are necessary for a SotA end-to-end pipeline. Despite your new experiments, we still don't know the comparisons between:

CPT + SFT v.s. CPT + Scaled SFT + SFT (e.g. DeepSeekMath-7B-Base + Additional SFT v.s. DeepSeekMath-7B-Base + WebInstruct + Additional SFT) -- It is possible that adding WebInstruct might show few gains, similar to Table 7.
settings above v.s. corresponding ones without CPT (e.g. substituting DeepSeekMath-7B-Base with DeepSeek-Coder-Base-v1.5 to conduct the experiments above and compare all results together) -- It is possible that WebInstruct might efficiently substitute DeepSeekMath corpus without damaging performance.

"Public Datasets domains wider than WebInstruct"

I am talking about results from Table 7, where you didn't evaluate coding tasks but only reasoning tasks, instead of Table 3/6. I consider it slightly unfair because Additional SFT contains code-related data, which might damage its reasoning performance.

评论- Follow up - Jcxm

2024-08-08

Thanks so much for your follow-up question! We appreciate your constructive comments!

To some extent, you share some common concerns as Reviewer k2fv. Do you think it makes sense to you if we have the additional results based on the Deepseek with the following rows:

(a) Deepseek Coder V1.5 (base model)
(b) Deepseek Math (base model + CT/Recall)
(c) Deepseek Coder V1.5 + WebInstruct (base model + Extract + Refine): to verify whether scale-up SFT could be a more cost-effective way than traditional CT.
(d) Deepseek Math + WebInstruct (CT/Recall + Extract + Refine)
(e) Deepseek Math + WebInstruct + Additional SFT (CT/Recall + Extract + Refine + Additional SFT)
(g) Deepseek Math + WebInstruct (without Refine) + Additional SFT (CT/Recall + Extract + Additional SFT)
(f) Deepseek Math + Additional SFT (CT/Recall + Additional SFT)

If that makes sense to you, we will try to add the results during the discussion period.

Thank you again for the constructive comments!

2024-08-09

Thanks for your efforts in the rebuttals to all reviews!

These experiments are good! And it would be best if you could also try: (priority descending)

Deepseek Coder V1.5 + WebInstruct + Additional SFT ((base model + Extract + Refine + Additional SFT) -- to verify whether scale-up SFT could be a more cost-effective way than traditional CT in an end-to-end way
Deepseek Coder V1.5 + Additional SFT -- to see the improvement brought by WebInstruct on Deepseek Coder V1.5

I would consider improving my score if scale-up SFT could effectively substitute traditional CT and/or push forward the upper limit of the standard SotA pipeline.

审稿意见

评分: 7置信度: 52024-07-13

This paper proposes an approach to automatically harvest large-scale instruction data from pre-training corpora for reasoning tasks. The main steps include: (1) Recall: training a fastText model to recall relevant documents from the pre-training corpus, similar to DeepSeekMath; (2) Extract: using open-source models with few-shot prompting to extract question-answer pairs from the recalled documents; (3) Refine: prompting open-source models to remove noise, adjust formats, and complete the reasoning process for the extracted question-answer pairs.

Using this method, the authors harvested 10 million instruction data and trained MAmmoTH2 models. Without relying on closed-source models, MAmmoTH2 achieves excellent performance on various reasoning tasks.

优点

Method: The motivation is clear, and the idea of automatically extracting instruction data from web data is novel, simple, and scalable.
Experiments: The experiments and evaluations are comprehensive and achieve good results.
Well Written: The paper is very easy to understand.
Reproducibility: The authors have open-sourced part of the corpus, models, and evaluation scripts to ensure the reproducibility of the results.

缺点

Effectiveness: I wonder if the WebInstruct approach can further improve the performance of state-of-the-art domain models. For example, DeepSeekMath achieved good results by only training on recalled documents and fine-tuning on high-quality data (MATH: DeepSeekMath-7B-Instruct 46.8% vs. MAmmoTH2-7B-Plus's 45.0%). Moreover, since the models have already been trained on SFT data, comparing only the few-shot performance is not comprehensive enough. I suggest also comparing the performance of the Plus version trained with high-quality "additional instruction datasets" for most of the experiments. Consider supplementing the following results:
- Recall + Plus: Directly train on the 18M recalled documents and fine-tune a Plus version to verify if the "extract + refine" steps have significant benefits.
- Recall + Extract + Plus: Directly train on the extracted QA (Fig.5, Extracted QA) with LM/SFT loss and fine-tune a Plus version to verify the benefits of the refine step.
- In Fig.5, I also recommend reporting the performance after fine-tuning the Plus version for SFT loss vs. LM Loss.
Lack of method details:

For example, the code for the recall stage and the prompts used for extraction and refinement could be included in the repository or appendix.
In Sec. 5.1, I suggest explicitly defining the SFT Loss to help more readers understand it clearly. By "SFT Loss", I understand the authors mean "masking the loss of instruction input", right?

Scalability:
- The effectiveness of WebInstruct constructed using small models is unknown for larger models; moreover, this approach is difficult to apply to models with hundreds of billions of parameters due to high inference costs.
- During refinement, the model generate missing explanations. Have you observed and quantified the hallucination phenomenon? If present, such incorrect reasoning processes can negatively impact model training, such as increasing hallucination/bias, especially if the corpus is used for larger models.
Minor points:
- Some citations are missing for baselines in Table 2, e.g., Gemma, Abel, and Rho-1.
- How can the WebInstruct approach be extended to more general domains? What other issues need to be addressed?
- A concurrent work, Jiuzhang3.0 [1], is quite similar in motivation and method. It would be better to discuss and compare with it. What are the advantages and issues of MAmmoTH2 compared to Jiuzhang3.0?

[1] Zhou, Kun, et al. "JiuZhang3. 0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models." arXiv preprint arXiv:2405.14365 (2024).

问题

See weaknesses.

局限性

The authors have addressed most of the limitations. The limitations section can be further improved by referring to the weaknesses part.

作者回复

2024-08-05

Thank you for your positive feedback on our work's clarity, novelty, and comprehensive experiments!

“Additional results for the effectiveness of WebInstruct”

We've included early-stage results using Qwen-1.5-1.8B to demonstrate the usefulness of our "extraction" and "refinement" steps:

Model	MATH	TheoremQA	ARC-C
Qwen-1.8B Base	10.1	11.1	50.11
Recall	11.26	12.38	49.06
Recall + Extract	14.82	13.25	51.19
Recall + Extract + Refine (WebInstruct)	17.18	14.87	53.83

These results clearly show the benefits of each step in our pipeline and align with the motivation of the suggested experiments. We will include these results in the Appendix.

“Method details”

We'll add the extraction/refinement prompts to the repository and appendix for transparency.
We'll explicitly define SFT Loss in Sec. 5.1. Yes, it refers to masking the loss of instruction input.

“Scalability”

The core idea of WebInstruct is to mine naturally existing high-quality instructions from the Web. Compared with hundreds of billion tokens for continual training, our approach is more cost-effective. Compared with traditional SFT datasets with only hundreds of thousands of examples, **WebInstruct with 10M examples is more scalable without requiring any human annotations.

“Quantified Hallucination”

We quantified hallucination through human error analysis in Figure 6. Our case study reveals that the harvested instruction tuning dataset is generally accurate with a low error rate: 78% of examples improved after refinement and only 10% introduced hallucinations. Future work includes developing more advanced methods (e.g., training a filtering/reward model) to select the least hallucinated questions. We will strengthen this discussion.

“Minor points”

We'll add missing citations for baselines in Table 2 (e.g., Gemma, Abel, and Rho-1).
We'll discuss potential extensions to more general domains and associated challenges.
Thanks for pointing out the concurrent work Jiuzhang 3.0 and we'll discuss the Jiuzhang 3.0! Both methods leverage synthetic data generation to enhance reasoning capability, but we extract and refine naturally existing instructions on the web rather than synthesizing new ones. Our core claim is that high-quality SFT data for reasoning naturally exist on the web, and our contribution is developing a simple yet effective approach to harvesting this data.

评论- Is the Extract and Refine Process Necessary?

2024-08-08

Thank you to the author for supplementing the results and replying. However, my main concern has not been addressed, namely whether the (b) extract and (c) refine processes introduced by MAmmoTH2 provide significant benefit or necessity to the existing data pipeline. As mentioned in my previous comments, DeepSeekMath achieved higher results using only (a) recall and (d) sft (MATH: DeepSeekMath-7B-Instruct 46.8% vs. MAmmoTH2-7B-Plus's 45.0%). Therefore, I once again suggest that the author design controlled experiments and provide the following results (for models >=7B) to demonstrate whether (b) and (c) are necessary:

(a) Recall + (d) SFT: Directly train on the 18M recalled documents and fine-tune a Plus version to verify if the "(b) extract + (c) refine" steps have significant benefits.
(a) Recall + (b) Extract + (d) SFT: Directly train on the extracted QA (Fig.5, Extracted QA) with LM/SFT loss and fine-tune a Plus version to verify the benefits of the (c) refine step.
In Fig.5, I also recommend reporting the performance after fine-tuning the Plus version for SFT loss vs. LM Loss.

评论- Follow up

2024-08-08

Thank you for the follow-up!

We really appreciate your constructive comments!

To some extent, you share concerns similar to those of the reviewer Jcxm. We added some additional results based on the Deepseek Math 7B base. Do you think it makes sense to you if we have the additional results based on the Deepseek with the following rows:

(a) Deepseek Coder V1.5 (base model)
(b) Deepseek Math (base model + CT/Recall)
(c) Deepseek Coder V1.5 + WebInstruct (base model + Extract + Refine): to verify whether scale-up SFT could be a more cost-effective way than traditional CT.
(d) Deepseek Math + WebInstruct (CT/Recall + Extract + Refine)
(e) Deepseek Math + WebInstruct + Additional SFT (CT/Recall + Extract + Refine + Additional SFT)
(g) Deepseek Math + WebInstruct (without Refine) + Additional SFT (CT/Recall + Extract + Additional SFT)
(f) Deepseek Math + Additional SFT (CT/Recall + Additional SFT)

If that makes sense to you, we will try to add the results during the discussion period.

Thank you again for the constructive comments!

2024-08-08

Thank you for your response. To demonstrate the necessity of the extract and refine steps, I believe you only need to conduct the two experiments mentioned in my previous reply: 1. (a) Recall + (d) SFT and 2. (a) Recall + (b) Extract + (d) SFT.

I think the experiments you planned on DeepSeekMath cannot prove the effectiveness of the extract step because the corpus recalled by MAmmoTH2 is different from that of DeepSeekMath. Therefore, I suggest using your 18M recalled documents as the "Recall" corpus for the experiments.

2024-08-09

Thanks to the authors for high-quality rebuttal and reviewer k2fV for timely review.

I agree that the necessity of the extract and refine steps should be demonstrated, considering them a different way to process web corpora than direct training.

From the perspective of web corpora processing, I also agree that it would be necessary to conduct

(a) Recall with 18M documents + (d) SFT
(a) Recall with 18M documents + (b) Extract + (d) SFT

Besides, I would suggest conducting

(a) Recall with 18M documents + (b) Extract + (c) Refine + (d) SFT (i.e. WebInstruct + Additional SFT) -- to see the necessity of Refine.

These experiments should be done based on DeepSeekCoder-V1.5 instead of DeepSeekMath to make a fairer comparison on the exact same CPT/Recall corpus.

评论- Follow up: Reviewer k2fV & Jcxm

2024-08-09

Thank you both!

We will try to add the following experiments based on the DeepSeekCoder-V1.5 model:

(a) Recall with 18M documents + (d) SFT: to see the effect of WebInstruct compared with traditional recall/CPT.
(a) Recall with 18M documents + (b) Extract + (d) SFT: to see the effect of refinement.
(a) Recall with 18M documents + (b) Extract + (c) Refine + (d) SFT (i.e. WebInstruct + Additional SFT): The current pipeline in the paper
(d) SFT only: to see the improvement brought by WebInstruct on Deepseek Coder V1.5

Do these four setups make sense to you both?

2024-08-09

Thank you for your response. These experiments make sense to me as well, and I think they are crucial for our understanding of the pipeline proposed in the paper. I look forward to seeing your results!

2024-08-09

Thanks to the authors for the timely response!

These experiments about the DeepSeekCoder-V1.5 model seem great to me!

I understand that direct training on 18M documents might be compute-intensive and hard to complete in a few days.

Considering the concerns of us two reviewers together, I would give a final score of 6/7/8 if 0/1/2 of our concerns are resolved.

评论- Follow up on Ablation Results (Reviewer Jcxm and k2fV)

2024-08-13

We appreciate both reviewers' insightful comments and have conducted all the required ablation studies before the end of the discussion period (bingo!). Our experiments, summarized in the table below, demonstrate the effectiveness and efficiency of our proposed pipeline.

Experimental Setup

Base: DeepSeek Coder V 1.5 7B
(a) MAmmoTH2's recall documents: 18M documents, 28B (14B tokens * 2 epochs)
(a') DeepSeek Math's CT corpus: 500B tokens (120B math + others for multiple epochs)
(b) Extracted instruction-response pairs from (a): 7B (3.4B tokens * 2 epochs)
(c) Refined instruction-response pairs from (b): 10B (5B tokens * 2 epochs)
(d) MAmmoTH2 Additional public SFT: 2B (1B tokens * 2 epochs)
(d') DeepSeek Math SFT: ~1B tokens

	Setting	Model	#Train Tokens	TheorQA	MATH	GSM8K	MMLU-S	BBH	ARC-C	AVG
	All evaluations are held-out
1	Base	DeepSeek Coder v1.5 7B	-	18.3	22.3	47.9	47.0	53.5	62.4	41.9
2	Base + (a)	-	28B	23.5	30.3	60.3	53.3	55.5	69.4	48.7
3	Base + (a) + (b) + (c)	MammoTH2-DS	10B	27.8	33.8	64.0	56.9	58.5	72.8	52.3
4	Base + (a’)	Deepseek Math Base	500B	25.3	34.0	64.2	56.4	59.5	67.8	51.2

	All held-out except GSM and MATH
5	Base + (d)	-	2B	23.5	37.2	77.5	52.0	59.8	66.9	52.8
6	Base + (a) + (d)	-	30B	27.2	39.2	79.2	55.6	60.3	71.5	55.5
7	Base + (a) + (b) + (d)	-	9B	27.3	38.6	78.5	54.2	60.5	70.4	54.9
8	Base + (a) + (b) + (c) + (d)	MAmmoTH2-DS-Plus	12B	30.1	43.8	80.1	59.5	61.0	73.2	58.0
9	Base + (a’) + (d’)	DeepSeek Math Instruct	501B	23.7	44.3	82.9	59.3	55.4	70.1	56.0

Key Findings

Cost-Effectiveness: Our pipeline achieves superior overall performance compared to DeepSeek Math models, both before (Row 3 vs Row 4) and after additional SFT (Row 8 vs Row 9), while using significantly fewer tokens. While DeepSeek Math models show slightly higher results on math-specific benchmarks, our approach demonstrates better performance on a broader range of reasoning tasks, including STEM-related benchmarks.
“Extraction Step” Efficiency: The "Extract" step in our pipeline leads to a more cost-effective approach, using fewer tokens while maintaining comparable performance (Row 6 vs Row 7).
Refinement Importance: The "Refine" step proves to be crucial, significantly improving answer quality by adding missing explanations and chains of thought, resulting in substantially better performance (Row 7 vs Row 8).

We believe these comprehensive experiments address the reviewer's concerns and further underscore the merits of our approach, especially the “extract” and “refine” steps.

We really appreciate the reviewer's feedback, which has led to these valuable insights and a stronger demonstration of our pipeline's effectiveness. Feel free to let us know if you have further comments!

审稿意见

评分: 7置信度: 32024-07-18

This paper proposes a method to synthesize instruction tuning data at scale from the pretraining web corpus. The proposed method first recalls relevant documents from the corpus, and then extracts QA pairs, and finally refines the extracted QA pairs with an LLM. The synthesized instruction data proves to be helpful in enhancing the model’s reasoning abilities compared with instruction tuning data from other sources.

优点

The proposed method is novel and effective.
The authors conduct extensive experiments to demonstrate that it’s possible to synthesize tuning data from unsupervised text corpus to build strong LLMs that outperform models trained with data collected in existing paradigms.
The paper is well-written and easy to follow. The code and data are released, which will serve as high-quality resources for research and building strong LLMs.

缺点

There lacks a discussion and comparison with a related work “Self-alignment with Instruction Backtranslation” (Li et al., ICLR'24) which also synthesizes instruction tuning data from unlabeled corpus.

问题

LLMs are used in the “extract” and “refine” steps in the proposed pipeline for generating and editing instruction tuning data. Will the choice of LLMs introduce bias into the synthesized data (especially compared with distillation-based methods)?

局限性

The authors have discussed the limitations in Appendix H and societal impacts in Appendix I.

作者回复

2024-08-05

Thank you for your positive feedback on our work's novelty, comprehensive experiments, and clear writing!

“Lack a discussion and comparison with Humpback [1]”

Thanks for the note! Humpback does not release its implementation, data, and models, which makes the replication and head-to-head comparison difficult. Fundamentally, our approach is different from Humpback:

Humpback aims to synthesize instructions by backtranslating the existing documents. We focus on mining naturally existing instruction-response pairs from the web rather than generating new instructions.
Our additional extraction step significantly makes the corpus less redundant and improves corpus quality (see the newly added results in response to Reviewer k2fV).
The "Refine" step further enhances instruction quality.

We will add more detailed discussions of Humpback in the related work.

“Will the choice of LLMs introduce bias into the synthesized data (especially compared with distillation-based methods)?”

Thanks for the question! It's important to note that our approach is not distillation in the traditional sense. The LLMs in our pipeline are used solely for extraction and refinement of existing data, not for generating new instructions. MAmmoTH2 essentially learns from a cleaner version of raw web data rather than distilling knowledge from other models.
We acknowledge that the choice of LLM could influence the accuracy of extraction. To address this, we chose two open-source models, Mixtral and Qwen, known for their strong performance and different training approaches. This diversity helps to balance out potential biases from any single model.
Compared to distillation methods, our approach potentially reduces bias by preserving naturally occurring instructions from diverse web sources, while cleaning and structuring them for more effective learning.

2024-08-12

Thank you for the response and it has addressed my concerns. I increased my score to 7.

评论- Follow up on Ablation Results

2024-08-13

We appreciate Reviewer Jcxm and k2fV's insightful comments and have conducted all the required ablation studies before the end of the discussion period (bingo!). Our experiments, summarized in the table below, demonstrate the effectiveness and efficiency of our proposed pipeline.

Experimental Setup

Base: DeepSeek Coder V 1.5 7B
(a) MAmmoTH2's recall documents: 18M documents, 28B (14B tokens * 2 epochs)
(a') DeepSeek Math's CT corpus: 500B tokens (120B math + others for multiple epochs)
(b) Extracted instruction-response pairs from (a): 7B (3.4B tokens * 2 epochs)
(c) Refined instruction-response pairs from (b): 10B (5B tokens * 2 epochs)
(d) MAmmoTH2 Additional public SFT: 2B (1B tokens * 2 epochs)
(d') DeepSeek Math SFT: ~1B tokens

	Setting	Model	#Train Tokens	TheorQA	MATH	GSM8K	MMLU-S	BBH	ARC-C	AVG
	All evaluations are held-out
1	Base	DeepSeek Coder v1.5 7B	-	18.3	22.3	47.9	47.0	53.5	62.4	41.9
2	Base + (a)	-	28B	23.5	30.3	60.3	53.3	55.5	69.4	48.7
3	Base + (a) + (b) + (c)	MammoTH2-DS	10B	27.8	33.8	64.0	56.9	58.5	72.8	52.3
4	Base + (a’)	Deepseek Math Base	500B	25.3	34.0	64.2	56.4	59.5	67.8	51.2

	All held-out except GSM and MATH
5	Base + (d)	-	2B	23.5	37.2	77.5	52.0	59.8	66.9	52.8
6	Base + (a) + (d)	-	30B	27.2	39.2	79.2	55.6	60.3	71.5	55.5
7	Base + (a) + (b) + (d)	-	9B	27.3	38.6	78.5	54.2	60.5	70.4	54.9
8	Base + (a) + (b) + (c) + (d)	MAmmoTH2-DS-Plus	12B	30.1	43.8	80.1	59.5	61.0	73.2	58.0
9	Base + (a’) + (d’)	DeepSeek Math Instruct	501B	23.7	44.3	82.9	59.3	55.4	70.1	56.0
10	Base + (a’) + (a) + (b) + (c) + (d)	MAmmoTH2-DS-Math-Plus	512B	31.5	45.2	80.2	60.5	62.0	76.4	59.3
11	Base + (a’) + (d)	-	502B	28.8	43.4	80.6	60.1	62.2	72.5	57.9

Key Findings

Cost-Effectiveness: Our pipeline achieves superior overall performance compared to DeepSeek Math models, both before (Row 3 vs Row 4) and after additional SFT (Row 8 vs Row 9), while using significantly fewer tokens. While DeepSeek Math models show slightly higher results on math-specific benchmarks, our approach demonstrates better performance on a broader range of reasoning tasks, including STEM-related benchmarks.
“Extraction Step” Efficiency: The "Extract" step in our pipeline leads to a more cost-effective approach, using fewer tokens while maintaining comparable performance (Row 6 vs Row 7).
Refinement Importance: The "Refine" step proves to be crucial, significantly improving answer quality by adding missing explanations and chains of thought, resulting in substantially better performance (Row 7 vs Row 8).

We believe these comprehensive experiments address the reviewer's concerns and further underscore the merits of our approach, especially the “extract” and “refine” steps.

2024-08-13

Thank you for providing the results. My concerns regarding the individual roles of "extract" and "refine" have been thoroughly addressed, leading me to increase the soundness score from 3 to 4.

The experimental results show that after SFT, the "extract" step resulted in a slight performance decrease (55.5 to 54.9), while the "refine" step contributed to a significant performance improvement (55.5 to 58.0). Although these gains might be influenced by distilling larger models (e.g., Mixtral-8×7B and Qwen-72B) and may have scalability limitations, I believe this method has potential, especially for training smaller models. Therefore, I maintain a score of 7 and recommend accepting this paper.

评论- Follow up k2fV

2024-08-13

Thanks for your follow-up!

Firstly, the "extract" step leads to much fewer tokens (3.4B vs 14B) compared with the recall documents while still achieving comparable performance, which illustrates that the "per token value" is much improved.

Secondly, we suspect whether the slight decrease is due to the fact that we started from a coder model (i.e., Deepseek Coder) instead of a general LLM. The coder model might need to be trained on more natural text tokens to get better performance. A potential way to verify this claim is to conduct a similar experiment on Mistral-7B or train the Deepseek Coder with more epochs on the extracted QA (e.g., 4-5 epochs). We might not have time to finish the experiment before the end of the discussion period but we will definitely include the results in the next revision.

Thank you again for your comments!

2024-08-13

Thanks for your reply!

These results are great but do not resolve all my concerns.

Could you compare:

Base + (a) + (b) + (c) + (d) v.s. Base + (a’) + (d) to exclude the difference of SFT datasets;
Base + (a) + (b) + (c) + (d) v.s. Base + (a’) + (a) + (b) + (c) + (d) to see the effect of WebInstruct plus DeepSeekMath continual pre-training?

评论- Follow up Jcxm

2024-08-13

Thank you for the prompt response! For the 2. Base + (a) + (b) + (c) + (d) v.s. Base + (a’) + (a) + (b) + (c) + (d), we had the numbers in our previous response and we added back in this table above (Row 10). It shows that our pipeline can be further combined with existing CT-SFT pipelines (e.g., Deepseek Math) to further improve performance.

For (1), we will set up the experiment now and hopefully, we could get the results before the end of the discussion period!

评论- Deepseek Math Base + Our Plus SFT: Reviewer Jcxm

2024-08-14

Dear Reviewer Jcxm,

We added the two additional rows you required:

Base + (a) + (b) + (c) + (d) v.s. Base + (a’) + (d) to exclude the difference of SFT datasets; (Row 8 vs Row 11)
Base + (a) + (b) + (c) + (d) v.s. Base + (a’) + (a) + (b) + (c) + (d) to see the effect of WebInstruct plus DeepSeekMath continual pre-training (Row 8 vs Row 10)

Please let us know if you have further comments!

2024-08-14

Thanks for your reply! I've updated my official review!

最终决定Accept (poster)

2024-09-25

The authors introduce MAmmoTH2, for instruction tuning LLMs by harvesting instruction-response pairs from naturally occurring web data. Their method follows a three-step pipeline with recall, extract, and refine and synthesizes a large dataset of 10 million high-quality instruction pairs without relying on costly human annotations or GPT-4 distillation. The results demonstrate that MAmmoTH2-trained models outperform other LLMs on reasoning benchmarks, achieving competitive performance with fewer resources.

The paper presents a novel, scalable, and cost-effective method for generating instruction tuning data that significantly improves performance in reasoning tasks. While the method combines existing techniques like document recall, QA extraction, and refinement, the integration of these steps in a pipeline to scale instruction data is innovative and addresses the costly reliance on human annotations or large closed-source models. The authors conduct comprehensive experiments, ablations.

However, while the paper is generally strong, a few areas warrant improvement before publication. These issues do not detract from the paper’s overall value but addressing them would make the work stronger based on the results already presented in the author discussion, if these could be included in the paper to make it much stronger.

Strengths:

The pipeline introduced in this paper demonstrates a novel, cost-efficient method for generating large-scale instruction data from naturally existing web sources without human annotation or expensive closed-source model assistance.
The authors present extensive experiments and ablation studies, clearly demonstrating the effectiveness of each step with recall, extract, refine in the pipeline. The model shows impressive gains in reasoning tasks.
The authors implement multiple mechanisms to ensure the quality of the data, including using diverse web sources and performing human evaluations, which show a low hallucination rate (only 10%). The multi-step refinement process, employing different LLMs, also reduces the likelihood of bias.

Suggested Revisions (also discussed by the authors and reviewers): Firstly, thanking all the reviewers for participating enthusiastically and constructively in suggesting additional experiments to make the paper better.

Several reviewers highlighted the need for more explicit comparisons with related works, such as Humpback and Jiuzhang 3.0. While the rebuttal addresses some of these concerns, it would be great to add them into the final draft, especially in terms of effectiveness and scalability.
As asked in the review, further ablation studies comparing models trained only on recalled documents versus those that underwent the full extract and refine steps, adding them to the draft would be fantastic. The refinement step appears to be crucial, but further explanation of how the refined instructions improve performance (e.g., more detailed chains of thought, error correction) would enhance understanding.