PaperHub
6.5
/10
Poster4 位审稿人
最低6最高7标准差0.5
7
7
6
6
4.0
置信度
COLM 2024

Does your data spark joy? Performance gains from domain upsampling at the end of training

OpenReviewPDF
提交: 2024-03-23更新: 2024-08-26
TL;DR

We show that changing your data mix to upsample domains at the end of training can lead to performance gains and provides a tool for characterizing the utility of individual datasets.

摘要

关键词
pretraining datadata mixdata interventions

评审与讨论

审稿意见
7

The paper addresses the important research question of how to build and evaluate the unlabeled dataset (mix) for LLM pretraining at scale. The authors propose a technique, upsampling data closer to the target domain towards the end of pretraining, and show that even with a smaller pretraining corpus similar or better results on common benchmarks can be achieved. Additionally, with meaningful ablation experiments, they show a cheap(er) way to conduct experiments about dataset building for LLM pretraining.

接收理由

Sound and interesting experimental results, showing both how to improve performance without additional pretraining cost (measured in FLOPs) and a new, cheaper way to experimentally assess the influence of different subsets of the pretraining data.

拒绝理由

There is not enough discussion about the relation between domain upsampling and domain-specific fine-tuning. It is known that adapting an LM to the domain of the target task, even with unlabeled data, is beneficial [1]. It seems that there are similar effects at work when upsampling unlabeled data of the target domain towards the end of pretraining.

[1] Gururangan et al., Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks, ACL 2020 (https://aclanthology.org/2020.acl-main.740)

给作者的问题

Table 4: Your model trained on 0.5T tokens outperforms your model trained on 1T tokens for HumanEval and Programming, but not the others. What could be the reason for that?

作者回复

There is not enough discussion about the relation between domain upsampling and domain-specific fine-tuning. It is known that adapting an LM to the domain of the target task, even with unlabeled data, is beneficial [1]. It seems that there are similar effects at work when upsampling unlabeled data of the target domain towards the end of pretraining.

Thank you for pointing out this connection; we will add discussion about the relationship of our results to [1]. There are important differences: (1) Our work does domain upsampling during pretraining while [1] does domain adaptation with an already annealed model. We think that a very interesting direction of future work is to understand if there are any advantages to doing this with a non-annealed model. (2) While their work aims at domain adaptation, our work specifically targets a general boost in model pretraining performance.

Table 4: Your model trained on 0.5T tokens outperforms your model trained on 1T tokens for HumanEval and Programming, but not the others. What could be the reason for that?

We are investigating what is going on with this scaling of HumanEval. Our current hypothesis is that at this FLOP scale with this percentage of code, HumanEval flatlines and you need to put in many more code FLOPS in order to see more emergent code performance. The performance boost observed with domain upsampling on the model trained for 1T tokens also fits into this hypothesis (i.e. the performance eventually increases if you add more code FLOPS).

评论

Thank you for your reply.

I can see that your work aims for a broad performance boost, whereas an approach targeting the adaptation to a specific domain does not care about anything else than the target domain.

I am not so sure, I understand what you mean by annealed vs. non-annealed model. As I understand it, you fix your pretraining budget in your work, and then you decide how much percentage of that budget is used for upsampling certain domains. You could see [1] as a radical variant of this, upsampling a formerly unseen target domain dataset from 0 to 100 and radically downsampling everything else to 0. The pretraining budget, of course, would be the combination of (in the case of [1]) RoBERTa's original pretraining PLUS the domain adaptation. The scope of your work is certainly different, but the underlying mechanisms look very similar. You say yourself that similarity of the upsampled domains are probably responsible for the performance gains on different benchmarks.

I encourage you to explain these connections and differences carefully (also to other works focussing on domain adaptation like [2] and [3] brought up by reviewer nKUf) and am otherwise confident that it is justified to keep my high score as is.

评论

We're using the term annealed vs. non-annealed model to refer to whether we are starting with a model with the learning rate decayed vs. changing the data mix in the middle of the learning rate schedule. For example, if you start with RoBERTa or a Llama-2 model, the checkpoint is at the end of the learning rate decay which we refer to as an annealed model and thus continued pretraining typically begins by doing learning rate warm-up. In our case, we have access to the checkpoints before decay which we refer to as a non-annealed model.

We have found that when working with non-annealed models, large domain shifts such as switching to only the domain specific datasets can harm performance. (As discussed in our response to Reviewer nKUf, we found in early experiments that completely removing Common Crawl resulted in a 17 pp drop in the language understanding category of the Gauntlet. We will be adding this experiment to the appendix.) In contrast, continued pretraining with annealed models has been successful with just domain specific data as in [1].

We will add text to explain the connections and differences carefully as requested by the reviewers.

审稿意见
7

In this paper, the authors introduce a novel, yet simple, technique for improving the training efficiency of models by upsampling high quality data samples at the end of training. While they also investigate finding more optimal mixtures of existing pre-training corpora (they identify this as being “one of the biggest challenges to pertaining LLMs”), one can think of their new strategy as a cheaper alternative to simply trying to find a one-size-fits-all ideal (and scaled) pre-training mixture.

The results are impressive. They show that 20% domain upsampling (the way domains are selected is still somewhat subjective, and seems in their case to be tailored towards mathematical reasoning) outperforms Llama2 at the 7B skill while being trained for 1 trillion less tokens. They are careful to point out and check several possible confounding factors. For example, checking that their improved data mixture alone doesn’t suffice to see the same improvement (it doesn’t based on ablations these performed on this). At first, I worried that too much focus was being put on specific tasks such as MMLU and GSM8K, although the results are quite compelling on the more comprehensive Gauntlet set.

In general, I found the paper easy to follow and read. Most of my criticisms are of a minor nature, which I detail below.

接收理由

  • a new “intervention” technique for improve training efficiency and model quality through up-sampling high quality data at the end of training. Given the simplicity of the technique and the promising empirical results, I expect this to inspire other work in this area.

  • several interesting auxiliary results that arise from their experiments, such as new insights into new compute friendly data mixtures on C4 (ones that improves empirically over Llama-2).

拒绝理由

  • the results are somewhat narrow by only comparing against llama-2 and OpenLlama. I would like to see this motivated more and I’m left wondering if comparing again more recent models would make more more sense (e.g., models with public data mixtures, such as Olmo).

给作者的问题

  • Did you look at language modeling performance, as opposed to just end task performance after applying your intervention method (e.g., does this have any known impact on perplexity). I'm still expecting that somewhere something must be going wrong, and do worry about all of this being too overly tailored to MMLU and GSM8K (which are both datasets known to have serious issues).

  • One alternative to your approach, which has become more popular recently (e.g., in the most recent version of Olmo) is incorporating instruction data into the pre-training regime. I'd be curious to hear how you would think about this in relation to your method.

作者回复

The results are somewhat narrow by only comparing against Llama-2 and OpenLlama. I would like to see this motivated more and I’m left wondering if comparing again more recent models would make more sense (e.g., models with public data mixtures, such as Olmo).

The comparison to Llama-2 and OpenLlama is primarily to show that our original data mix is a reasonable baseline. The important comparison for domain upsampling is to the identical model without the domain upsampling intervention. This was done so that the architecture, data processing, etc. are identical to isolate the effects of the upsampling procedure. The main motivation for including the Llama-2 comparison is at the time of submission this was the state-of-the-art open, dense model trained at different scales.

We will add comparisons to Olmo to this table for the camera ready.

Did you look at language modeling performance, as opposed to just end task performance after applying your intervention method (e.g., does this have any known impact on perplexity). I'm still expecting that somewhere something must be going wrong, and do worry about all of this being too overly tailored to MMLU and GSM8K (which are both datasets known to have serious issues).

Given that perplexity is a function of your data mix proportion, examining the impact on perplexity before and after a change in your data mix is tricky because it is unclear which data mix you should use to evaluate perplexity after the change. In such a situation, downstream evaluations provide a consistent framework to compare model quality regardless of data mix.

We agree that data interventions could be overly tailored to specific benchmarks like MMLU and GSM8K which is why we included evaluations across a broad range of 30+ downstream evaluations contained in the Gauntlet. We observed that the domain upsampling intervention improved almost all of these domains.

One alternative to your approach, which has become more popular recently (e.g., in the most recent version of Olmo) is incorporating instruction data into the pre-training regime.

We have incorporated instruction data by including FLAN in the pretraining data mix (it is one of the domain specific datasets).

评论

Thank you for the comments and feedback. I was satisfied with your feedback and will continue to keep my high score where it is.

审稿意见
6

This paper investigates the impact of data recipes during the LLM pretraining stage. The study yields two key findings: (1) By upsampling domain-specific data, the authors demonstrate that their model achieves better performance on several benchmarks while using fewer training tokens compared to the Llama-2 model of the same scale. (2) Conducting domain upsampling near the end of pretraining (at 80% completion) produces promising results and offers an efficient approach to examining the relationships between specific pretraining data and downstream tasks.

接收理由

  • The paper aims to address an important but often overlooked problem: the training data mixture of Large Language Models (LLMs).
  • The paper is well-structured and clearly presented, with a strong motivation behind the research. The experiments and results are carefully designed and executed, providing valuable insights. The authors also demonstrate transparency by offering a closer look at commonly used pretraining data.
  • Positive empirical results.

拒绝理由

  • Significance: The findings of the paper seem to be limited to the observation that "the models improve on the upsampled domains," which is not particularly surprising. Maybe the authors can help me better understand the significance of the conclusions.
  • Scientific value: Comparing MPT and Llama-2 (which is not trained on FLAN, etc.) may not be the most appropriate approach for a controlled experiment. I suggest comparing two models of the same implementation, data processing etc., with and without the proposed upsampling and data processing techniques.
  • Acknowledgement of existing works: [1] proposed a similar approach to the one presented in the submission (section 4.2) and was available on arXiv about 4 months before the submission deadline. The authors should cite and discuss this work to provide a more comprehensive context. Additionally, [2] appears to reach a different conclusion, suggesting that upsampling certain domains can negatively impact general capabilities. It would be great if the authors can include a discussion about it.
  • It remains unclear whether the observed performance gain can be attributed solely to domain upsampling, given that the FLAN dataset, which is often used in finetuning, is included in the pretraining process.
  • Pretraining typically aims to provide a strong foundation for various capabilities that can be further refined through finetuning, rather than primarily focusing on achieving strong performance on downstream tasks. The proposed approach seems to blur the line between pretraining and finetuning by including FLAN in pretraining. It is unclear whether such a model can still benefit from instruction tuning using datasets like FLAN or OpenOrca, at a similar level to many other models. An important comparison would be to evaluate the instruction-tuned versions of the proposed method against other LLM baselines.

[1] https://arxiv.org/abs/2312.06550 [2] https://arxiv.org/abs/2402.10171

作者回复

The findings of the paper…understand the significance of the conclusions.

Our results show that our technique improves not only the domains related to the datasets upsampled but almost all domains measured across a broad spectrum of 30+ evaluations (list in appendix).

The fundamental challenge of pretraining data is choosing dataset proportions to get a desired balance of capabilities. This typically is an extremely large space to search over and experiments with each mix at scale are prohibitively expensive. Previous work has tried to deal with this problem by performing experiments at smaller scales and hoping the results translate; however, this approach has not been publicly shown to work for challenging emergent benchmarks.

The primary significance of our work is thus to provide an alternative to characterize the impact of data on emergent benchmarks that is less expensive than pretraining the full model, i.e. changing the data mix durning the final 10-20% of training. Our result provides the community with a new, less expensive paradigm for characterizing their data.

Comparing MPT and Llama-2…upsampling and data processing techniques.

This is in fact what we did: our domain upsampling experiments compared to a baseline model of the same implementation, etc. with and without the proposed upsampling technique (see Table 6, first two columns). The comparison to Llama-2 in the plots was simply to show that our baseline data mix was reasonable.

Acknowledgement of existing works

The difference between our work and [1] is that rather than simply training with multiple phases to the data mix, we do the experiments to compare how such a mix performs relative to the baseline and how settings like the amount of training in the domain upsampled phase affects performance. The work in [2] is focused on context-length extension which is a confounding factor when comparing experiments as well as the fact that they are performed on an annealed model. We will add a citation to both works and discussion about these differences.

It remains unclear whether the observed performance…in the pretraining process.

FLAN was included in both the baseline and domain upsampled models to control for this.

It is unclear whether such a model…similar level to many other models.

This is an interesting question, and we will work on running the experiment to check if there is a significant difference finetuning using OpenOrca before and after domain upsampling.

评论

Thanks for clarifying my confusions. I have increased my score from 3 to 6.

审稿意见
6

Paper explores the domain continual pre-training for LLMs. While the concept of domain continual pre-training for LLMs is not new, authors explore real LLMs versus most of empirical papers that have only explored small language models like BERT whose generalization to the LLMs is questionable. Authors explore boosting the domain performance of a LLM by using the domain continual pre-training at the end of the pre-training. Unlike existing continual pre-training works, the continual pre-training’s domain corpus is a mix of domain and non-domain tokens, termed as "domain upsampling" - oversampling the domain-specific datasets relative to the web data during the final phase of pre-training.

They claim the optimal upsampling is 10-20% of total training budget based on the results on a variety of benchmarks beating open source models like LLaMA. Another interesting side result from their experiments is using a predefined mix during general pre-training which performs almost at par with LLaMA with half the training tokens.

接收理由

  • Paper explores an important problem which has not been explored much in the LLM literature while the results from traditional LM models might not generalize on LLMs. Experiments are generally well done and well motivated.
  • They demonstrate how domain upsampling can be used as a cost-effective tool to make the LLM learn a specialized capability.
  • The paper is well-written, with clear motivation, methodology and extensive analysis of results across different experimental settings across different domain tasks like code and math.

拒绝理由

  • The choice of datasets and mixing proportions relies on heuristics. More principled approaches could potentially yield better results. Though it is understandable given that these experiments are quite expensive to perform.
  • There are no baselines in this work, like [1] which use more principled approach by training a much smaller proxy model though not at LLM scale.

Please see comments for specific points on the weakness.

给作者的问题

Can authors justify why no comparison was made with [1] (cited in the paper) which uses a smaller proxy model, and hence not much of a cost unlike other methods in the space. Previous continual pre-training works like [2, 3] have exclusively used domain data while the authors use a mix of domain + non-domain data with 35% of domain data during this step. Is there any justification for this proportion? While I understand that these experiments are expensive, given the other works, it’s not clear what’s the main message from the paper.

[1] Xie, Sang Michael, et al. "DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining." Thirty-seventh Conference on Neural Information Processing Systems. 2023.

[2] Xie, Yong, Karan Aggarwal, and Aitzaz Ahmad. "Efficient continual pre-training for building domain specific large language models." arXiv preprint arXiv:2311.08545 (2023).

[3] Wu, Chaoyi, et al. "PMC-LLaMA: toward building open-source language models for medicine." Journal of the American Medical Informatics Association (2024).

作者回复

The choice of datasets and mixing proportions relies on heuristics…which use more principled approach by training a much smaller proxy model though not at LLM scale.

Our primary goal in this work was to get signal on challenging, emergent benchmarks (like MMLU, GSM8K, and HumanEval) which require experiments at scale to get above random results. Current principled methods like DoReMi have only been explored at smaller scales and have not demonstrated predictive power on these benchmarks. Performing experiments to understand how DoReMi performs at scale would be another experiment that would cost on the order of hundreds of thousands of dollars (i.e. training a model to 5e22 FLOPS). Because of this expense, we have primarily focused on proving it is possible to see significant improvements in these benchmarks by changing the data mix during training rather than extensively exploring alternative methods that require you to do full pretraining runs. In essence, while these principled methods are cheap to implement, validating their efficacy at scale is expensive.

Having said this, making our results publicly available is valuable for the community because it opens up opportunities for others to explore more principled directions in a way that’s relatively inexpensive. New proposals for data mixes can be evaluated for efficacy by changing the data mix during the final part of training, and we ourselves have done several science experiments in the paper trying to understand some of the knobs in the data proportions that impact benchmark performance. This includes the impact of removing math datasets and changing the percentage of training dedicated to domain upsampling. (We are also exploring more principled methods for future work.) The important takeaway from this paper is the new regime in which we can do experiments for pretraining language models at the end of training and still get signal on the benchmarks we care about.

Previous continual pre-training works like [2, 3] have exclusively used domain data while the authors use a mix of domain + non-domain data with 35% of domain data during this step. Is there any justification for this proportion?

This was done in order to avoid large domain shifts in the data mix. Early on, we tried to completely remove common crawl, and this change resulted in a 17 percentage point drop in the language understanding category of the Gauntlet.

评论

I understand that it is hard to compare these different decisions as they need a lot of computation budget.

"This was done in order to avoid large domain shifts in the data mix. Early on, we tried to completely remove common crawl, and this change resulted in a 17 percentage point drop in the language understanding category of the Gauntlet." Can you please add a section on this in appendix? This would be useful for the readers and community at large to learn from your experiences.

评论

Can you please add a section on this in appendix? This would be useful for the readers and community at large to learn from your experiences.

Yes, we will include these results in the appendix.

最终决定

All reviewers agree that this paper addresses an important and often overlooked problem, presenting a novel and simple technique for improving the training efficiency of LLMs with impactful results. The results are solid, and the paper is well-written with strong motivation. Therefore, we would like to accept this paper at COLM. Please follow the reviewers' comments and suggestions from both the reviews and follow-up discussions. This includes providing a more in-depth discussion of existing work and comparisons to Olmo, etc.