CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training
摘要
评审与讨论
This work propose a novel data selection method, FRHOzen (Frozen Reducible Hold Out Loss), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. They provide an empirical evaluation of FRHOzen on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks.
优点
1.They analyze a family of loss-based approaches for targeted selection of pre-training data, propose a simple approach that outperforms existing methods, and provide some preliminary evidence of favorable scaling properties. 2. They analyzed and compared the computation cost of FRHOzen with other similar methods and proved its computational efficiency.
缺点
1.I think this work could not match its objective of selecting optimal subsets of data for language model pre-training. The evaluation is done by fine-tuning an already-trained LM using downstream data, which could not prove the selection could work for LM pre-training. In fact, a pre-trained LM usually should not be specifically optimized for a particular downstream task (with regard to its domain-specific data). 2.In addition, the evaluation are based on olmo, which is an decoder-only LM. Exisiting pipeline don’t fine-tune for an decoder-only LMs, usually sft is needed for prompt alignment. I doubt the evaluation results may not be applicable for the real scenario. 3.The assumption of a pre-set budget n is less valid (in Algorithm 1), and the total number of optimal n is hard to predict during pre-training
问题
1.Could you provide more details for how you fine-tune the olmo using downstream data? What’s the loss function? 2. What is the unit of data point x in Equation 6?
3.Is the selected optimal data used to optimize the fine-tune phase or the pre-train phase of training? 4.What’s the difference of data points and sequences (Section 4.2 Line 215)?
局限性
As the authors have stated comprehensive in the limitation Section, I have no further comments for this part.
We first want to thank the reviewer for their positive comments about how our simple method outperforms alternatives, has evidence of favorable scaling, and computational efficiency. We think these are all important strengths of the paper, and the other reviewers largely agreed.
Now we will address each of the weaknesses and questions raised in the review.
Weaknesses
- We think there is a major misunderstanding here of our methodology. We are not evaluating a finetuned model, and we are not using any pretrained models from prior work. FRHOzen is a method for filtering a pre-training corpus to select high quality data for training from scratch. We do use a small amount of downstream data as a guide to help us define “high quality data” for our selection procedure. In particular, if we look at algorithm 1, we first pretrain and then finetune some small auxiliary models in lines 1 and 2. But, these models are then used in lines 3 and 4 to create a new dataset S of high quality data from the pre-training corpus and we then pre-train models from scratch. Thus, all results in the paper are the results for the models that are pre-trained on S in various settings. For example, Figure 1a shows how this yields a pre-training corpus that is much better on a suite of 8 downstream tasks for a 1.2B model.
- Again we think there is a major misunderstanding here. We use the olmo code base, but we do not use the olmo pre-trained model. We train models from scratch. We should note that indeed the models that we pre-train from scratch could be subsequently finetuned with SFT, but that is beyond the scope of the paper for now.
- We do not agree here. N can be seen as the compute budget we have for pre-training (it is directly proportional to FLOPs when using e.g. chinchilla scaling to set model size). A user who is attempting to train a model can set this according to their particular constraints for their budget.
Questions
- Again we want to reiterate that we are not proposing a finetuning method. However, we understand there may be some confusion since we do finetune the small auxiliary conditional model. For that finetuning, we simply use the standard next-token prediction loss.
- We are not sure we understand the question. In equation 6, x_i are sequences sampled from the unfiltered pre-training corpus and we evaluate their log likelihood under various language models (i.e. the sum of the per-token losses across the sequence).
- The selected data (S in algorithm 1) is used for pre-training, i.e. training models from scratch. All reported results are for these models that are trained from scratch.
- Indeed, datapoints and sequences are the same thing. Since we consider language modeling, each datapoint is a sequence of tokens. For example, in our case this will be a sequence of 512 tokens of tokenized data from a chunk of a paragraph from C4.
We hope that this clarifies any misunderstandings and that you will reconsider your review and raise your score. If there are any lingering questions, do not hesitate to post them and we will try to clear them up.
Hello! We just wanted to ping the reviewer since the discussion period is almost over and you have not yet responded to our rebuttal. We think there are some misunderstandings in the original review that the rebuttal can clear up. We also urge the reviewer to look at the other reviews and the discussion with the other reviewers to see the general positive consensus among them.
Please take a look and let us know if your assessment of the paper has changed.
Thanks for the response. I'm going deep into the paper again to make the final decision. Pls give me some time.
Thank you for responding, which helped to clear up my misunderstanding. Now, I believe this work is above the acceptable level. I still have some more questions.
1.Page5 Algorithm1, n is for the selected data sample budget,not the computation budget,right? Since in the above response, you take N can be seen as the compute budget as an example. If n is the data budget, according to scaling law (chinchila or open AI both take power law for scaling), usually you have to pre-define C or N, then D, when plan for a large model pre-training. 2、Although it has been proved in the general response that the method is not overfit for the tasks (the book domain as in the paper), I was wondering if this work could be generalized to the selection of smaples from multiple domains within one pass. As illustrated in (LLaMA: Open and Efficient Foundation Language Models), the training data is from 7 different domains (wiki,CC, books , etc.). Particularly for large models (13B, 57B), pre-training data should be more generalized. 3、According to the response, the length of a data point is 512 tokens. How does this hyper-parameter affect the results?
Thanks for engaging and for raising your score! Below we respond to the new questions raised in your comment.
-
Right, this is a good point. In our setup, we fix the size of the target model and then just scale the size of the dataset instead of scaling the dataset and model at the same time. We do this in part to try to isolate the contribution of the dataset distribution separate from the model scale. Note that in all prior work on scaling laws, the dataset is held fixed, so it is not immediately clear how those scaling laws will apply when we change the underlying data distribution by aggressive filtering with FRHOzen. But we totally agree that this is a great direction for future work, and we are actively working on this question now, stay tuned!
-
Yes, this is also a good point and similar to one raised by reviewer q13t. We view the distinction of “domains” as a somewhat arbitrary classification of the URLs that the data originated from. Since FRHOzen operates at the level of individual datapoints we can just pass data from all the “domains” we have access to through FRHOzen and get a filtered dataset out. But we totally agree that we need to conduct larger scale experiments beyond C4 to prove this out (although C4 itself does have a decent diversity of sources in it). Again we leave this to future work, but are actively looking into it now, so stay tuned for follow up work.
-
Another good question. So far we have held this context length fixed, but we will look into ablating it in the future.
Hopefully this clears up these last remaining questions, and thanks again for engaging in the discussion!
This work proposes a simple, intuitive approach for data selection based on empirical bayes formulation minimizing the difference in the likelihood assigned to a candidate training sample by a model trained on a base distribution and a model trained on the base distribution plus a smaller sample of high quality target (test) data of interest. The method is scalable in that it leverages only standard training computation plus additional forward passes and empirical results suggest that it enjoys scale transfer properties where small scale experiments can be used to select data for larger training runs.
优点
-
The setup is easy to follow, intuitive. Theoretically principled from a bayesian perspective while still feasible in practice is rare :]
-
Connection to related algorithms is well detailed including complementarity and different tradeoffs against contemporary RHOLoss.
-
Use of actual conditional-marginal loss gap is a strength of this method over DSDM and other influence function techniques which have been demonstrated to generalize poorly to realistic scenarios for both computational and broken assumption reasons.
-
The scaling results, headline figure, but last section... are impressive. Suggests that this simple method is worth further empirical exploration and expenditure of compute in the future.
缺点
-
The domain transfer experiment should be broadened. Does the result of figure 2 only hold this favorably for the Project Gutenberg Books downstream target? it is possible that improvements are only strong when there exist very distinct sub-distibutions in the prior that match the target. It would improve this empirical section if the authors considered more than one downstream target distribution.
-
(More minor, academic constraints assumed) Only one model architecture and scale extrapolation test setting are considered. It would be more convincing if another model family were considered and a few more scales, especially beyond 1B, as some trends in O(100M) models change dramatically beyond a few billion parameters.
问题
-
A diversity term seems reasonably easy to incorporate. Did the authors experiment with a diversity regularizer term or step of any sort?
-
Can we explain why conditional-only worsens with relaxation of the pre-sampling efficiency constraint tau? (Figure 2)
局限性
- See weakness.
We first want to thank the reviewer for their thorough review and very positive assessment of our paper. In particular, they highlight the intuitive and principled algorithm, the connections to related work, the improvement over influence functions, and the impressive scaling results.
In the rest of this response we will address the weaknesses and questions raised in the review.
Weaknesses
1a. Yes, we agree that the Books task is a somewhat toyish task. We present it since it is very low-noise and clearly conveys the potential of the algorithm (we should also note that likelihood likelihood is often a good proxy for downstream tasks, see e.g. [1][2] and references therein). For a more real-world analysis we present the same experiment for a suite of 8 downstream multiple choice tasks in figure 4, the curves are slightly less clean due to the noisy nature of multiple choice accuracy evaluations, but the general trend is the same. This mixture of 8 target distributions shows that the method can be useful across many tasks at once by leveraging the downstream data as more of a generic definition of “high quality” data rather than targeting the specific tasks. Of course it would be great to expand this to even more tasks, and we look forward to future work attempting to do this.
1b. As a step in this direction, we add a new experiment to test whether the 1.2B models that target our suite of 8 downstream tasks generalize to other downstream tasks that are unrelated to the selection process. We select a suite of 6 tasks (mostly for glue/super glue) related to natural language understanding and we find that even the model trained on 8x less data selected by FRHOzen outperforms the model trained on randomly selected data by 2 percent. The results are attached as a table in the PDF and in markdown in the global response, we will update the paper accordingly.
- We totally agree that the scale is relatively small, but unfortunately pre-training beyond 1B to a ~7B model is beyond our computational constraints right now. We hope that publication of the work can encourage those with more compute to attempt such scaling.
[1] Huang et al., 2024, https://arxiv.org/abs/2404.09937
[2] Ruan et al., 2024, https://arxiv.org/abs/2405.10938
Questions
- In keeping with Bayesian formulation, it is actually computationally difficult to solve the full diversity issue posed by general subset selection. We did experiment with different approximations (e.g. RHO) that attempt to somewhat incorporate diversity, and find that they perform worse. We agree that this is a great direction for future work to figure out how to best incorporate a notion of diversity, but doing so with extra terms in the objective function itself is beyond the scope of this paper.
- This is a good question. One hypothesis is that when we use conditional-only we are essentially selecting for datapoints that have low loss under the conditional model. There are (roughly) two types of datapoints where this is true (1) data that is relevant for the downstream task, and (2) data that is just easy for a language model (e.g. data that is highly repetitive). By only selecting with the conditional model we mix up both kinds of data, so selecting too aggressively can start to hurt by selecting too much of type (2). On the contrary, FRHOzen explicitly focuses the model on only selecting type (1) and not type (2).
We hope that this clarifies any misunderstandings and we encourage the reviewer to increase their score or confidence if we have resolved their concerns or to let us know otherwise so we may try to clear up any remaining confusion.
I appreciate the authors thorough response to my initial review. While I am happy with my strong score of 7 as is, I reiterate to the review pool that I believe the work is high quality, the method is simple and well motivated, and the experimental results are promising and thus recommend it for acceptance.
I think that future research would benefit from exploring the author speculation as to question 2. regarding why FRHOzen's selection signal seems more useful than the conditional method. Separating easy tokens from useful tokens is certainly at the crux of all data selection and valuation work in language modeling (throwing out trash tokens is just the table stakes) and so mechanistic analysis of why a method works at this fine-grained a level is just as important as increased scale and scope of the training experiments.
Thanks for the kind words and indeed we agree that this is a promising direction for future work!
The paper presents FRHOzen, a new data selection method for targeted pre-training of language models, which uses an empirical Bayes-inspired approach to derive a simple, efficient selection criterion based on the relative loss values of two auxiliary models. Evaluated on tasks such as domain adaptation from C4 to Books and multiple-choice question answering, FRHOzen consistently outperforms training on eight times more randomly selected data. It also scales effectively across model sizes, with data selected by 150 million parameter models yielding improvements when used to train a 1.2 billion parameter model.
优点
- This paper is well-written. The comparison and discussion is very sound, e.g., section 4.2 on computational cost.
- The proposed method is quite effective -- outperforms training on 8x as much randomly selected data.
- The method's effectiveness in data selection is transferable across models of different sizes, making it scalable.
缺点
-
Section 2.1, Bayesian Data Selection, lacks rigorous derivation. This part can be considered an intuitive understanding of Bayesian optimization, but it does not constitute a strict derivation.
-
The acceleration effect brought by this paper is quite significant, but the experimental setting uses Books as D_{down} and tests it on the Books held-out set. However, in real-world scenarios, the downstream dataset used and tested should be more general, such as a combination of multiple corpora or multiple end tasks. I hope the authors can confirm that the acceleration effect brought by this paper is applicable to more general datasets.
问题
I am curious, when the model is larger (for example, at 7B), will the acceleration still be as significant (8x) ?
局限性
Yes.
We first want to thank the reviewer for their positive assessment of our paper. In particular, they highlight the clarity, the discussion of computational costs, the effectiveness of the method, and the scalability of the approach.
In the rest of this response we will address the weaknesses and questions raised in the review.
Weaknesses
- We are not sure we understand the point being raised here. Section 2.1 presents the main objective that we attempt to optimize and then two equalities (Bayes rule and introducing a prior). If any step here in particular is troubling, please let us know and we are happy to explain it further or improve the exposition.
- Indeed we present the Books task as more of a didactic example than a real-world application to show how the method can be applied. That is why we also consider a suite of 8 downstream multiple choice tasks (results in figure 1a, figure 3, and figure 4). And in the new experiment presented in the global response also show that this generalizes to 6 more tasks. We agree that future work could apply the method even more broadly than this, but we think this is a reasonable proof of concept for a conference paper.
Questions
- We totally agree that this is interesting! But unfortunately training a 7B model is beyond our computational constraints right now. We hope that publication of the work can encourage those with more compute to attempt such scaling.
We hope that this clarifies any misunderstandings and we encourage the reviewer to increase their score or confidence if we have resolved their concerns or to let us know otherwise so we may try to clear up any remaining confusion.
Thank you for your response.
- It seems that that formula (1) lacks mention of the Model, as we cannot say the probability of one dataset given another dataset. Therefore, formula (1) refers to the probability of D_down given the Model trained on a dataset. However, the Model itself changes with the given training data and is not fixed, which makes the Bayesian part hard for me to accept. Could the author elaborate further on this?
- My concern is not about the generalization of downstream tasks; what I mean is that in addition to validating the acceleration on specific training data (books), it is necessary to verify the acceleration on the mixture proportions of pretraining data domains, like in [1]
[1] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Thanks for the quick response!
-
Thanks for the clarification, now we see where the miscommunication is. In equations 1 and 2 we are referring to the marginal likelihoods we get when marginalizing out the model parameters. We then explicitly introduce the model parameters in equation 3 (which is a straight equality from equation 2). We can see how this was maybe not well presented and could be confusing. We are happy to change the presentation so that the model parameters appear in every equation (as they do from equation 3 onwards). Would that satisfy your concern?
-
Again, thanks for the clarification. As we said before, the books example is meant to be didactic. When thinking about practical applications, we likely want to focus on performance on downstream tasks. Take the DoReMi paper that you cite as an example. In that paper, Figure 2 presents their main results which report downstream task accuracy. They propose a method for data selection that relies on mixing “domains”, where “domains” are a user-defined way to bucket the datapoints based on where they are from. You can view this as a data selection method with very coarse features (i.e. a one-hot feature of which domain a datapoint comes from). FRHOzen can also be used to improve downstream task performance, but we do this using fine-grained features so that we can select at the datapoint level, rather than relying on coarse user-defined notions of domain. This seems beneficial, since we can get 8x efficiency improvements in downstream accuracy compared to 2.6x for DoReMi (although the settings are not directly comparable, since they are (1) operating at a larger scale than we are able to, (2) using different data/tasks, and (3) we use data from only a single one of their “domains”). We agree that an interesting direction for future work would be to scale FRHOzen up to larger settings where the data comes from more diverse sources than C4, but there is no a priori reason that our method would not work, since it operates on a per-datapoint level with no need to define domains.
Thanks again for engaging with us. We hope this clarifies things so that you feel you can increase your score or your confidence. Please let us know if you have any more questions!
This paper proposes a data selection method that can improve the performance of language models on downstream tasks. The method uses two auxiliary models: one pretrained on the pretraining dataset, and the other is finetuned on the downstream task using that pretrained model. Then the method selects the data that has the largest difference in loss between the two models. Authors formulate the method using Bayesian optimization and show that the method is trying to maximize the likelihood of sequences that will appear in the downstream task. The authors conduct experiments using a 150M parameter model and show that it can be transferred to a 1.2B model. Using the selcted data, the model can achieve better performance on tasks, even when it's pretrained on much fewer tokens.
优点
-
The method is novel and intuitive. The idea of selecting data based on the difference in loss between two models is interesting and can be easily understood. The method is also well-motivated and well-explained in the paper.
-
The method is well-formulated. It can be viewed as maximizing the posterior likelihood of the downstream sequences.
-
The authors have thoroughly discussed the relevance of the method to other data selction methods.
-
The experiments can very well support the claims. The method is shown to be effective in improving the performance of the model on downstream tasks. Moreover, the method seem to reduce the computational cost of pretraining because the model can be pretrained on much fewer tokens when using selected data.
缺点
-
The paper has an aim of "Pre-training", but the models being tested are relatively small (150M and 1.2B). It's unclear whether the method can scale to larger models (eg, 10B, 100B). The authors should test the method on larger models to show the scalability of the method.
-
In order to select the pretraining data that can improve the downstream performance, the method requires knowing the downstream tasks ahead of time. This is a limitation of the method because in practice, we might not always know the downstream tasks when we are pretraining the model. And people using want the pretrained model to be versatile and not just good at a few tasks.
-
It's not very clear what will happen to the downstream tasks that are not included in the data selection process. It's unclear whether there are side effects of the data selection process on unknown tasks.
-
As the authors mentioned, "In particular, the FRHOzen objective no longer encourages the selection of a diverse dataset". So a lack of diversity and overfitting might be a concern when using the selected data.
-
Using a probability-based view, the method is effective in improving the likelihood of the downstream sequences. However, it does not explain or guarantee anything about RLHF/preference-based fine-tuning. And it's unclear how it will impact the model's safety and robustness.
问题
-
What will happen to the downstream tasks that are not included in the data selection process? Will the model perform worse on those tasks?
-
How do you choose the downstream tasks? Any criteria?
-
Do you only consider maximal likelihood for the downstream tasks? What about other downstream finetuining methods, such as RLHF, DPO, etc?
局限性
Limitations with respect to approximations, data diversity, and computational cost are discussed.
We first want to thank the reviewer for their thorough review and largely positive comments. In particular, they highlight that the method is novel, intuitive, well-formulated, situated wrt related work, and has strong experimental results.
In the rest of this response we will address the weaknesses and questions raised in the review.
Weaknesses
-
We totally agree that the scale is relatively small, but unfortunately training a 10B model is beyond our computational constraints right now. We hope that the publication of the work will encourage those with more compute to attempt such scaling.
-
This is a reasonable concern and indeed we may not always know all of the downstream tasks. We would like to point out that this is a larger issue that faces the entire field of language model training. It is not clear how we should define the desired behaviors of a language model or how we should evaluate the performance in general, and it is an active area of research. One primary approach these days is to use a suite of downstream tasks. Our goal with this work is to show how this notion of quality can be leveraged to perform data selection. Moreover, there is no limit on the downstream tasks that can be used as targets in D_down. In our experiments we use a suite of 8 tasks, but we hypothesize that this method could scale to larger suites of tasks (of course this would have to be proven in future work). We also add a new experiment (see the general response and attached PDF) showing that the selected data using our suite of 8 tasks also improves performance by a similar amount on 6 novel tasks.
-
This is highly related to point 2 above, but also a reasonable concern. Note that we do not use the evaluation data directly, but we do use training sets derived from the same tasks. We also add a new experiment to test whether the 1.2B models that target our suite of 8 downstream tasks generalize to other downstream tasks that are unrelated to the selection process. We select a suite of 6 tasks (mostly for glue/super glue) related to natural language understanding and we find that even the model trained on 8x less data selected by FRHOzen outperforms the model trained on randomly selected data by 2 percent. The results are attached as a table in the PDF and in markdown in the global response, we will update the paper accordingly. Thanks for suggesting this issue so that we could add this experiment!
-
Yes, we agree that a lack of diversity is definitely a worry with this method (as we raise in the paper). However, as we also point out, related existing ideas for how to maintain diversity have serious computational problems and just don’t work as well. We hope that future work can uncover whether this is really a problem at scale or whether starting from enormous web scraped data (which is inherently diverse) and not selecting too aggressively means that this is not a practical issue.
-
Again we agree that the likelihood is merely a proxy for things we may care about downstream like reward functions or safety, but we think it is a reasonable proxy. For some further evidence that likelihood is often a good proxy see e.g. [1][2] and references therein. Moreover, using likelihood importantly facilitates the bayesian analysis that yields a highly efficient method for selection. It’s not exactly clear how to target other metrics with a similar method, but it is definitely an interesting direction for future work.
[1] Huang et al., 2024, https://arxiv.org/abs/2404.09937
[2] Ruan et al., 2024, https://arxiv.org/abs/2405.10938
Questions
- See point 3 above and the added experiment.
- We chose the suite of downstream tasks following prior work (OLMo). But in general, this is up to the user. Our paper is focused on presenting the methodology, but this methodology is very flexible (which is why we tried to show it on two very distinct targets: multiple choice evals and books).
- Yes, we only consider likelihood at the metric for the tasks when performing selection. This follows directly from our bayesian derivations. It is possible that a similar method could be derived with a different objective, but it is not clear how the bayesian machinery would work when there is an unknown reward function we are targeting. An interesting direction for future work!
We hope that this clarifies any misunderstandings and we encourage you to increase your score if we have resolved your concerns or to let us know otherwise so we may try to clear up any remaining confusion.
Thanks to all the reviewers for their constructive comments. We hope we have resolved any misunderstandings.
One experiment suggested by the reviews (particularly reviewers yqth and ZLEF) was to test if the data selected by FRHOzen for downstream tasks generalizes to new downstream tasks that are unrelated to the conditioning set. To this end, we evaluated the final checkpoints of the 1.2B models from figure 1a on a new suite of 6 downstream natural language understanding tasks and we report the results in the table below (and in the attached PDF). We find that the FRHOzen data does generalize, indicating that it is not overfit to the tasks, but is picking up on generic notions of data quality for natural language understanding. Even with 8x less data, the FRHOzen models outperform randomly selected training data by nearly 2 points.
| Method | copa | rte | commitment bank | sst2 | commonsense qa | social iqa | Average |
|---|---|---|---|---|---|---|---|
| Random (24b tokens) | 69.2 | 49.1 | 43.2 | 46.9 | 33.9 | 42.6 | 47.5 |
| FRHOzen (, 6b tokens) | 70.2 | 51.4 | 41.6 | 55.8 | 35.6 | 44.3 | 49.8 |
| FRHOzen (, 3b tokens) | 67.8 | 50.1 | 46.0 | 55.8 | 32.5 | 43.5 | 49.3 |
This paper presents FRHOzen (Frozen Reducible Hold Out Loss), a novel data selection method for targeted pre-training of language models. The approach uses an empirical Bayesian approach to derive a computationally efficient selection criterion based on the relative loss values of two auxiliary models which is reminiscent of the Moore and Lewis (2010) cross-entropy difference method. The method is evaluated on two language modeling tasks: (1) selecting data from C4 for domain adaptation to Books, and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks.
Strengths The paper is novel, intuitive, and well-motivated, leading to solid empirical results, consistently outperforming training randomly selected data.
Weaknesses The techniques are limited to small models (up to 1.2B parameters) and a limited set of downstream tasks.