Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
ProX uses small language models to refine large scale pre-training data via program generation, significantly boosting pre-training models' performance and efficiency across various benchmarks and model scales.
摘要
评审与讨论
The authors introduce a framework called PROX for refining data quality in large language model (LLM) pre-training. This framework claims to address limitations in traditional, heuristic-based data-cleaning methods, which often lack flexibility and are labor-intensive.
优点
The advantage of this paper is its novelty in transforming data cleaning into a programmable task, which shifts away from traditional, rigid heuristic-based methods. By treating data refinement as a series of programmatically generated actions, the framework enables language models to apply tailored operations to each example, such as string normalization and noise filtering, with high precision.
缺点
- The first limitation is that the amount of pre-training data used in the authors' experiments is insufficient, falling far short of what would be needed to fully support their claims. It’s highly possible that, with more extensive training, the advantages of the proposed method might diminish or even disappear.
- The second limitation is that using a model—especially a smaller one—for data selection introduces issues like hallucinations, bias, and the omission of less common information during the pre-training phase. These issues are difficult to resolve and can become deeply embedded in the foundation model. In fact, pre-training should ideally be a robust modeling of the real world, and relying on a smaller model for data refinement may skew this representation, undermining the model’s ability to capture a true and comprehensive understanding of diverse real-world contexts.
问题
Each time the pre-trained model is updated, the entire pre-training dataset needs to be reprocessed, which is highly GPU-intensive. Do the authors have any alternative solutions to address this computationally expensive process?
Dear Reviewer L4Dm,
Thank you for your valuable time and effort. We are excited that you recognize the "novelty" of our method and the effectiveness of ProX in "transforming data cleaning into a programming task", which is very encouraging for us. However, there might be a few misunderstandings that we would like to clarify. To begin with, our 1.7B model is pre-trained on up to 50B tokens, which already exceeds the Chinchilla scaling law, thus ensuring enough training data for our experiments under academic budget constraints. Additionally, we have meticulously designed our method to reduce issues such as hallucinations, bias, and the omission of common details relative to other data synthesis methods. We describe these two points in detail below:
W1: Insufficient data for training experiments.
Response: Thank you for raising concerns regarding the adequacy of pre-training data volume in our experiments. We address these concerns from several perspectives:
- Sufficient-Training Relative to Chinchilla Optimal Points: All from-scratch-trained ProX models in our experiments (up to 1.7B) are pre-trained with a token count exceeding the Chinchilla optimal points, as shown in Table 7. For reference, here’s a comparison of ProX with Chinchilla optimal points:
| Model Size | Trained Tokens | Trained Tokens |
|---|---|---|
| ProX (ours) | Chinchilla Optimal[1] | |
| 0.3B | 26.2B | ~6B |
| 0.7B | 26.2B | ~14B |
| 1.7B | 52.2B | ~34B |
We believe this token volume is sufficient for robust pre-training. Also, regarding the extensive experiments, and various pre-training corpora (also highlighted as strengths by reviewer TCQo, 4aQX, and 1aQb), we believe that ProX's impact has been thoroughly and appropriately demonstrated.
-
Strong and Stable Downstream Performance with Increased Training: In terms of downstream task performance, we observe no decay as model size or training tokens increase. Please refer to Table 3, Figure 5, and Figure 6. Also, we compare ProX with some strong baselines: beyond rule-based methods, we compared ProX with competitive baselines trained on significantly larger datasets:
- Cosmo-1.8B: A 1.8B model trained on approximately 180B tokens.
- OLMo-1B: A model from AI2, trained on over 1 trillion tokens.
- ShearedLlama-1.3B: A pruned version of Llama-2-7B (1.3B parameters), with an additional 50B tokens of training. Even disregarding its prior Llama-2 training, this 50B alone matches the total training tokens of our ProX models.
As shown in Figure 6 and discussed on lines 365–377 (on page 7), ProX achieves comparable or superior performance to these extensively trained models, highlighting its efficiency rather than a limitation.
- Focus on Efficient Training: Efficiency is crucial for LLM development. Our experiments demonstrate that ProX-curated data enables models of varying sizes (from 0.3B to 1.7B and even up to 7B) to achieve performance comparable to models trained with up to 20 times the compute cost. Please kindly refer to Figure 1, Figure 6, Table 5, and Figure 8. Furthermore, as shown in our training dynamics (Figure 4), ProX consistently outperforms baselines throughout training. Thus, we believe ProX ensures efficient training without compromising performance, making it particularly valuable for resource-constrained settings.
In summary, we believe our findings actually substantiate ProX's ability to improve pre-training efficiency and deliver competitive performance, even at lower computational costs. This approach is particularly advantageous for achieving high-quality pre-training on limited budgets.
[1] Training Compute-Optimal Large Language Models, https://arxiv.org/abs/2203.15556
W2: hallucination and other issues of model-based data selection and refining.
The second limitation is that using a model—especially a smaller one—for data selection introduces issues like hallucinations, bias, and the omission of less common information during the pre-training phase. These issues are difficult to resolve and can become deeply embedded in the foundation model.
Response: It is important to note that these challenges are not unique to ProX’s approach. Any method leveraging models for data engineering is likely to encounter similar issues of hallucination and bias to some extent. Nevertheless, model-based data engineering remains the leading trend in the field, as demonstrated by numerous studies (e.g., LLaMA-3[2], Qwen-2[3], DCLM[4], FineWeb-Edu[5], Phi[6]) that have successfully employed model-based methods for data engineering tasks.
At the same time, however, model-based selection methods are increasingly widely discussed and researched. There have been many works investigating model-based data selection for pre-training. Furthermore, recent top-tier industry reports on large language models [2,3] indicate that model-based data selection and filtering methods have been used in developing new-generation models. While specific details in these reports remain undisclosed, we believe that ProX provides a step forward in advancing this area.
Unlike other synthesis methods based on LLMs, ProX aims to mitigate hallucination and bias through a structured approach: Program-Based Revision. ProX modifies data by executing predefined programs on the original text, ensuring that changes remain grounded in the original content. In comparison, methods such as Cosmopedia[7], which generate content from scratch, or rephrasing techniques that create new versions based on the original text, inherently carry a higher risk of introducing hallucinations. By relying on programmatic revisions, ProX substantially reduces the likelihood of generating erroneous or biased content.
In addition, ProX follows a careful approach when collecting supervised fine-tuning (SFT) data, ensuring that tasks are designed to remain within the capabilities of smaller models (see Section 2.2, lines 145–155). ProX's effectiveness has been validated through extensive benchmarking. As shown in Table 4, small models achieve over 80% accuracy on document-refining tasks and over 75% on chunk-refining tasks, which aligns with the "high precision" mentioned in your review as ProX's strengths. Furthermore, our error analysis (please refer to our response to Reviewer 4aQX, W1) indicates that the majority of errors occur when the predefined program is not executable. In such cases, ProX defaults to retaining the original documents to preserve the integrity of the original data distribution as much as possible.
Thus, we believe that ProX represents a relatively robust model-based approach with significantly fewer hallucinations, delivering substantial benefits even with smaller models. We greatly value your feedback and welcome any additional suggestions you may have for further improving ProX.
[2] The Llama 3 Herd of Models, https://arxiv.org/abs/2407.21783
[3] Qwen2.5-Coder Technical Report, https://arxiv.org/pdf/2409.12186
[4] Datacomp-lm: In search of the next generation of training sets for language models, https://arxiv.org/abs/2406.11794
[5] The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale, https://arxiv.org/abs/2406.17557
[6] Textbooks are all you need, https://arxiv.org/abs/2306.11644
[7] Cosmopedia, https://github.com/huggingface/cosmopedia
Q1:
Each time the pre-trained model is updated, the entire pre-training dataset needs to be reprocessed, which is highly GPU-intensive. Do the authors have any alternative solutions to address this computationally expensive process?
Response: Thank you for your question. We would like to clarify that all ProX refining models are fixed once fine-tuned and are designed for refining the domain they were trained on. Regarding your concern about "pre-trained model updates", it seems there may be some misunderstanding. We do not update the refining models once they are finalized, as they are not tied to the updates of the pre-trained models. Instead, they remain stable and suitable for continuous use.
To address your concern further, while we did conduct additional experiments to explore how the size of the refining model impacts the refining effect, these experiments do not affect the fact that our final refining models are fixed and can theoretically operate as long-term solutions within their respective domains. Please kindly refer to ProX's framework design (see Section 2, especially our illustration in Figure 2), as at no point have we claimed that the entire dataset needs to be reprocessed due to updates in pre-trained models.
Additionally, as suggested by Reviewer 1aQb, if more efficient or higher-performing models become available in the future, ProX is flexible enough to incorporate them. Such models could be easily fine-tuned to refine data, resulting in a better-quality corpus. This adaptability highlights the strength and versatility of the ProX approach.
Finally, regarding your comment on GPU intensiveness, we believe it is also important to consider this issue in the context of the overall computational budget. For example, when launching the training of a large-scale model such as a 70B parameter model, the GPU inference cost of using a relatively small 0.3B refining model to improve data quality is, we believe, a reasonable trade-off that can be justified given the improvements in pre-training data quality.
Please let us know if you have any further questions so that we can provide additional clarifications to help you finalize the assessment and rating of our paper.
Thanks for your response. I decide to raise my score to 5.
Thank you for your valuable feedback. As the ICLR public discussion phase will be ending in a few days, we want to check if our previous response has addressed your concerns. Thank you again for your time and insights!
Dear Reviewer L4Dm,
Thank you for your timely response and for acknowledging our efforts by raising the score. We truly appreciate that our recent comments have addressed some of your concerns.
However, as a score of 5 is still considered negative, we believe you may still have some remaining concerns. Should you have any further concerns or suggestions, please do not hesitate to let us know! We are more than willing to address them in the remaining time.
Thank you once again for your valuable feedback and thoughtful consideration of our work!
Dear Reviewer L4Dm,
We sincerely appreciate all the valuable feedback provided and wish you a Happy Thanksgiving!
As we have already provided detailed responses to address the concerns raised, we kindly ask if there are any remaining issues or questions that require further clarification during this period. Moreover, we would greatly appreciate it if you could re-evaluate the soundness, contributions, and other aspects of our work based on the responses we have submitted.
Best regards, The Authors.
The authors propose ProX, a method for data filtering and synthetic data generation for improving the quality of large pretraining corpora. ProX proposes to fine-tune a small language model to produce a “program”, which is a sequence of one or more filtering or normalization operations applied at the document and line level. These programs are synthesized and executed for each document in the corpus, which results in a smaller higher quality subset. The authors demonstrate that ProX yields significant improvements in pretraining and domain-specific continual pretraining in both accuracy and training efficiency across numerous baselines.
优点
- The paper introduces ProX, which is a novel method to improve training data quality using small language models to synthesize and execute short data refinement programs, rather than using static rules or large model data synthesis.
- Framing data filtering and synthetic data generation as a programming task that can be performed by small models is a creative contribution that has not been explored in previous work
- Prox-D and Prox-D+C outperform all heuristic and model based data selection baselines and improve pretraining efficiency. The empirical validation is comprehensive, testing the method across multiple model sizes (350M to 1.7B), different pre-training corpora (RedPajama-V2, C4, FineWeb), and domain-specific applications (OpenWebMath).
- The method shows significant gains in accuracy and efficiency for domain specific continual pretraining, with up to a 20x reduction in compute. The effectiveness of synthesizing example-specific filtering programs shows especially here, as the method can synthesize domain-specific filtering rules.
- Although there is additional computational overhead from running inference of the refining model to generate the programs, this rapidly decreases as a proportion of the total pretraining FLOPs as the pretrained model gets larger. The method shows a 67% FLOPs reduction in pretraining to a given accuracy for a model as small as 1.7B.
- The paper is well structured and clearly written, with nice looking and clear figures to explain the method and results.
缺点
-
The paper is missing a qualitative analysis of the programs generated by ProX. It could benefit from:
- Examples of wrong programs generated by ProX and an analysis of the common failure modes
- Statistics on the complexity of the programs (e.g. distribution of function calls per document)
- An analysis of how program quality varies across different domains
- A discussion of potential safety issues since programs may modify data in unintended or harmful ways
-
The paper could benefit from comparison to simpler baselines, like binary classifiers (e.g. fastText) for quality filtering.
-
The paper could explore expanding the space of programs, for example by inclusion of additional operations like text transformations, or composing multiple operations.
问题
Did you compare Prox-D against an n-gram based filtering method like fastText? DataComp-LM (Li et. al.) find that fastText-based quality filtering was the best performing method they tried, outperforming the Gopher and C4 filtering rules, which are also used as baselines in this paper.
W3: Further exploration of more operations
Response: Thank you for this suggestion. Expanding the space of programs is indeed a promising direction. In line with your comment, we believe incorporating additional refining operations such as text transformations, including reformatting or regular expression grammar, as well as the composition of multiple operations. ProX is currently our exploratory step towards this goal, and we are delighted to further broaden its utility and functionality across various large-scale corpus in an open-sourced way based on ProX's current framework.
We hope the detailed analysis and the updated experimental results can address your concerns!
Thank you very much for the follow ups and fantastic work! I will retain the score -- I think this work is great.
Dear Reviewer 4aQX,
Thank you for recognizing our work! We are very delighted to see your appreciation for ProX's novelty and effectiveness in improving data quality. At the same time, we are happy to further explain to you regarding the qualitative analysis, and comparison with classifer-based filtering methods.
W1: About more ProX generated program analysis and discussion.
Response: Thank you for the constructive suggestion. We have conducted an analysis of ProX-generated programs along with some statistics below, based on 100,000 randomly sampled ProX programs.
As shown in the table, the failure ratio for both refining stages and both domains is very low (< 0.5%), which further demonstrates that ProX's refining tasks are well-suited for these small models.
| Domain | Failure ratio | Failure ratio | Complexity (AVG. function calls) |
|---|---|---|---|
| doc-level | chunk-level | chunk-level | |
| General Domain | 0.04% | 0.36% | 3.7 |
| Math Domain | 0.06% | 0.11% | 2.7 |
Regarding common failure modes, we present the two most frequent failure cases of ProX's programs at below, where most failures occur because the generated programs are incomplete or cannot be executed:
- Repeated output (or Empty output):
Document:
[004] P: 114 1. The problem statement, all variables and given/known data Mercury is poured into a U-tube as in Figure P15.18a....Basically I don't understand why you would know to set the two volumes equal to each other? How do you know the volumes are the same?
......[too long to show]
[007] Related Discussions Mechanical Engineering 6 Introductory Physics Homework 0 General Engineering 1 Introductory Physics Homework 2 Introductory Physics Homework 2
ProX Program:
remove_lines(start=1, end=1)
remove_lines(start=6, end=6)
remove_lines(start=7, end=7)
remove_lines(start=7, end=7)
remove_lines(start=7, end=7)
remove_lines(start=7, end
- The target of string removal / line removal can be non-existent:
Document:
...
[195]18. Sathyamoorthi, C. R., Mbekomize, C., Mapharing, M., & Selinkie, P. (2018). The Impact of Corporate Governance on Working Capital Management Efficiency: Evidence from the Listed Companies in the Consumer Services Sector in Botswana. International Journal of Economics and Finance, 10, 135. https://doi.org/10.5539/ijef.v10n12p135
[196]19. Vu, T. M. T., Tran, C. Q., Doan, D. T., & Le, T. N. (2020). Determinants of Capital Structure: The Case in Vietnam. Journal of Asian Finance, Economics, And Business, 7(9), 159-168. https://doi.org/10.13106/jafeb.2020.vol7.no9.159
...
Prox Program
# Analysis: this `source_str` can not be found in the original text
normalize(source_str="https://doi.org/10.13106/jafeb.2020.vol6.no2.53", target_str="")
We have also updated our manuscript, to put more analysis discussion of ProX at Appendix F.3.
W2 & Q1: compare with fastext filtering such as the fastext used in DCLM
Response: We appreciate your constructive feedback! As you suggested, we have included FastText as a baseline for Section 3.2 (see Table 2 and the updated table below). To ensure a fair comparison, we trained the fastText classifier on the same training data used for our ProX document-level refining models. All documents labeled with drop_doc() are treated as negative samples, while those labeled with keep_doc() are treated as high-quality samples. We trained the FastText model from scratch using the same configuration as the other runs in Table 2.
| Method | ARC-C | ARC-E | CSQA | HellaSwag | MMLU | OBQA | PiQA | SIQA | WinoG | SciQ | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Raw | 26.1 | 44.3 | 29.7 | 39.1 | 27.3 | 29.2 | 66.9 | 39.0 | 52.0 | 67.4 | 42.1 |
| Rule-based Methods | 25.2 | 46.8 | 32.6 | 39.6 | 27.2 | 29.0 | 66.5 | 39.4 | 52.4 | 69.2 | 42.8 |
| fastText | 26.9 | 49.9 | 29.5 | 39.0 | 28.5 | 31.8 | 64.7 | 39.6 | 52.1 | 70.4 | 43.2 |
| ProX-D | 26.6 | 49.7 | 30.1 | 40.5 | 29.4 | 30.4 | 66.3 | 39.0 | 51.2 | 71.6 | 43.5 |
| ProX-D+C | 26.4 | 51.9 | 30.9 | 42.4 | 29.4 | 31.6 | 67.9 | 40.0 | 52.2 | 73.5 | 44.6 |
Our findings are very similar to the DCLM paper you mentioned in the review: fasttext-based filtering is very powerful, outperforming rule-based methods and is very close to our document-level refining(ProX-D). However, it still shows a clear gap(1.4% in average) to our 2-stage refining(ProX-D+C). We have incorporated this result into our revised version, as updated in line-243 to line-250, Table 2, Figure 4, and Table 11. Please kindly refer to it for further details.
Thank you for your feedback and encouraging words! We really appreciate it! In our final version, we will polish the paper further to incorporate the valuable insights gained from the rebuttal discussions. Thank you again!
Authors
The paper "Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale" introduces ProX, a framework leveraging small language models to refine large-scale pre-training data through program generation. ProX aims to enhance the quality of training data by executing fine-grained operations on individual examples. The authors demonstrate significant improvements in pre-training efficiency and model performance across various benchmarks and model sizes. Notably, ProX can achieve remarkable gains in domain-specific continual pre-training tasks, as the authors show with OpenWebMath.
优点
- Novel Chunk-Level Rewriting Technique: A novel and fine-grained approach that moves beyond binary keep/dismiss judgments, allowing deterministic edits to individual chunks of text. This method is validated through robust ablation studies, proving its efficacy in improving pre-training data quality.
- Efficient Use of Compute: By leveraging small language models (0.3B parameters) for data refinement, ProX achieves substantial performance improvements with lower computational costs. This efficiency is especially noteworthy in domain-specific tasks like OpenWebMath, where ProX yields remarkable accuracy gains.
- Comprehensive Ablations: The authors tested ProX across multiple model sizes (0.3B to 1.7B) and pre-training corpora (C4, RedPajama-V2, FineWeb), with significant performance enhancements demonstrated consistently. The results show that even smaller refining models can produce high-quality pre-training data.
- Important Practical Implications: The framework offers a scalable solution to pre-training data refinement, which can be particularly valuable in scenarios where human expert intervention is impractical. ProX’s deterministic chunk-level edits contribute to a more nuanced and flexible data curation process.
缺点
- Base Dataset Selection and Novelty: The exclusive reliance on older datasets like C4 and RedPajama limits the relevance of the findings. While the results on OpenWebMath are impressive, it would be more compelling to see ProX applied to modern datasets such as FineWeb-edu or DCLM-base, which have undergone extensive refinement.
- Document-Level Refinement: The document-level component of ProX appears to be a straightforward application of FineWeb-edu’s filtering techniques, reducing its novelty. The primary innovation lies in the chunk-level refinement, which could be more explicitly distinguished from the document-level methods.
- Evaluation Methodology: The emphasis on zero-shot performance may not fully capture the advantages of ProX, especially since the refining models are not fine-tuned for instruction-following tasks. Few-shot performance with higher n might provide more insightful comparisons.
- Exclusion of Base Model Training FLOPs: The paper excludes the computational cost (approximately 5.3e19 FLOPs) for training the base model used to create the refining model. This is significant, as it constitutes nearly half the total compute, raising concerns about the true efficiency gains. The authors should have explored using an existing pre-trained model to substantiate ProX’s cost-effectiveness.
- Context Window Considerations: Given that ProX needs to refine long documents, the context window of the refining model is a critical factor. The paper does not provide sufficient details on how the model handles long contexts, which could impact its practical utility.
问题
- Generalizability to Highly Refined Datasets: How would ProX perform when applied to datasets that are already highly refined, such as FineWeb-edu or DCLM-base? Are there diminishing returns when starting with higher-quality corpora?
- Evaluation Choices: Why did the authors prioritize zero-shot performance for their evaluation? Wouldn't few-shot performance offer a more nuanced understanding of ProX’s impact, especially for instruction-following tasks?
- Compute Cost of Refining Model Training: Why did the authors not also use an existing pre-trained model for the refining task? This could have significantly reduced the compute cost. Additionally, what measures were taken to address the long context requirements of creating refinement programs for lengthy documents?
Dear Reviewer 1aQb,
We're glad you appreciated ProX's computational efficiency and our extensive ablation studies. To address your concerns, we made additional efforts, especially by applying ProX to FineWeb-Edu and including few-shot evaluation results. Please refer to our detailed reply below:
W1 & Q1: Pre-training corpus selection
The exclusive reliance on older datasets like C4 and RedPajama limits the relevance of the findings. While the results on OpenWebMath are impressive, it would be more compelling to see ProX applied to modern datasets such as FineWeb-edu or DCLM-base, which have undergone extensive refinement.
This is a great question, and we sincerely thank you for this valuable comment! Below, we present the latest results of ProX applied to Fineweb-Edu. We trained two models using the exact same settings as in Table 2. In simple terms, applying ProX to Fineweb-Edu results in a performance boost across all 10 downstream benchmarks, with an average improvement of +1.3% for the 0.7B model, and 0.8% for the 1.7B model.
We believe this demonstrates that ProX can further enhance the quality of the most recent high-quality dataset, even those that have already undergone extensive refinement and very aggressive data filtering.
| Model Size | Data | ARC-C | ARC-E | CSQA | HellaSwag | MMLU | OBQA | PiQA | SIQA | WinoG | SciQ | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.7B | Fineweb-Edu | 30.3 | 58.7 | 29.0 | 42.0 | 30.4 | 31.8 | 67.7 | 38.1 | 50.4 | 73.3 | 45.2 |
| Fineweb-Edu-ProX | 31.2 | 59.5 | 30.2 | 44.1 | 30.7 | 32.8 | 69.2 | 38.8 | 50.8 | 77.3 | 46.5 | |
| 1.7B | Fineweb-Edu | 36.7 | 65.1 | 32.0 | 52.9 | 33.6 | 32.8 | 72.5 | 40.3 | 53.5 | 82.3 | 50.2 |
| Fineweb-Edu-ProX | 37.3 | 66.6 | 34.4 | 53.1 | 34.1 | 35.8 | 72.1 | 40.1 | 52.6 | 83.5 | 51.0 |
W2: Lack of Novelty of document-level methods
Response: We appreciate this feedback and acknowledge the significant contributions of FineWeb-Edu. Our work in ProX unifies document-level and chunk-level operations under a single framework to provide a novel perspective: viewing data cleaning as a case-by-case program generation process. This comprehensive approach enables ProX to handle both high-level document filtering and detailed string-level normalization - capabilities that previous methods could not achieve simultaneously. In the revised version, we will clearly clarify that most of the document-level contributions stem from FineWeb-Edu to avoid any potential misunderstanding, while emphasizing that our key contribution lies in this unified programmatic approach to data cleaning. We are grateful to the reviewer for pointing this out and will ensure proper attribution and clear differentiation of our contributions.
Thank you very much for the detailed answer. I consider the matter resolved.
Thank you for your thorough and constructive feedback. We will carefully incorporate your suggestions and integrate the new results into the final version. We are pleased that our response has addressed your concerns, and we would be grateful if you would consider increasing the score to support our work. Please let us know if you need any additional clarification!
W3 & Q2: Choices of evaluation setting
Response: We choose to present zero-shot evaluation in Experiment 1(Table 1 and Figure 4) mainly following settings used in all FineWeb's ablation experiments. We find their evaluation maintains a very stable performance curve when training tokens gradually accumulate. Also, it is very time-efficient for fast evaluation regarding our extensive pre-training experiments(20+ final runs, with hundreds of intermediate checkpoints). We also acknowledge it would be interesting to see ProX's effort over a few-shot evaluation. We have updated the 5-shot evaluation results at below for your reference:
| Model | ARC-C | ARC-E | CSQA | HellaSwag | MMLU | OBQA | PIQA | SIQA | WinoG | SciQ | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Raw | 25.5 | 50.3 | 33.2 | 39.9 | 27.8 | 29.2 | 67.8 | 38.7 | 52.4 | 71.5 | 43.6 |
| Rule (best FineWeb rules) | 26.2 | 50.9 | 34.1 | 41.8 | 27.8 | 29.2 | 66.8 | 40.5 | 52.0 | 72.8 | 44.2 |
| ProX-D | 29.1 | 55.7 | 35.6 | 41.8 | 29.4 | 29.2 | 66.8 | 38.3 | 51.3 | 77.0 | 45.4 |
| ProX-D+C | 27.2 | 59.9 | 38.3 | 42.8 | 29.7 | 31.4 | 67.1 | 40.3 | 50.2 | 75.8 | 46.3 |
As above, ProX produces the best performance while the rule-based method stays lower than ProX-D, just as the zero-shot evaluation indicates. Also, we find that not all benchmarks perform better with few-shot prompts than zero-shot. For example, we do not observe a very clear performance boost on HellaSwag, MMLU, PIQA, and WinoGrande under 5-shot configurations. Similar observations are also noticed in recent works[1,2], where 0-shot Hellaswag and 0-shot WinoGrande show very close performances with 5-shot ones.
We have already updated these results and concluded our findings in our evaluation setup part (Appendix D.1).
[1]OpenELM: An Efficient Language Model Family with Open Training and Inference Framework, https://arxiv.org/pdf/2404.14619
[2]Scaling Data-Constrained Language Models, https://arxiv.org/abs/2305.16264, NeurIPS 2023
W4 & Q3: the alternative to using existing models to reduce FLOPs
Response: We appreciate this suggestion. We would like to clarify that using trained-from-scratch models instead of existing models is intentional and serves important scientific purposes. While existing industry models might demonstrate superior performance, using them would compromise our study's scientific rigor since we cannot verify their pre-training data composition or determine if they were pre-trained on refined operations like programs.
Using our own pre-trained small model ensures data quality consistency across all model scales and maintains experimental transparency. This approach allows us to demonstrate that small models can effectively improve larger models' training corpora. While we used this setup for scientific validation, we believe even smaller in-house models designed for refinement could yield significant improvements in practice.
W5 & Q3: ProX's effort on handling long documents.
Response: All refining models are fine-tuned with a max sequence length of 2048, which is kept the same as the pre-trained model reported in Appendix B.3, line 1345-1348.
Moreover, to handle very long documents:
- During doc-level refining, we simply truncate the doc length to 2048, which is applied by many concurrent works [3,4]. Also, as analyzed in Section 4.1, the average token lengths of all of the 4 corpora are below 2000. So we believe this is an acceptable engineering solution.
- For chunk-level refining, we present our chunk-splitting pseudo code in Appendix A.4. We separate the documents line by line, and aggregate the lines until the current chunk is more than the max window size. We set the chunk window to 1,500 tokens.
[3] The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale, https://arxiv.org/abs/2406.17557
[4] Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining, https://arxiv.org/abs/2409.02326
We hope updated results together with analysis can address your concerns! Should there be a need for further clarification to assist in finalizing your assessment, please do not hesitate to inform us.
This paper presents Programming Every Example (PROX), a framework that redefines language model pre-training by treating data refinement as a programming task. PROX allows even small language models (as few as 0.3B parameters) to perform fine-grained data operations, achieving performance comparable to human experts. Models trained with PROX-curated data outperform those trained on original or traditionally curated data by more than 2% across ten benchmarks. It is effective with different model sizes and corpora, demonstrating significant potential in domain-specific tasks and reducing computational costs.
优点
A highly complete paper, and I appreciate it along with your hard work.
-
The paper is clearly articulated with intuitively designed charts that help readers easily understand the problems being addressed, the research motivation, and the methods employed.
-
It is detailed in scope, including appendices with textual explanations of document/block-level programming, algorithm flowcharts, prompts used, pretraining details, baseline, downstream task introductions, and case analyses, all of which are comprehensive.
-
The experiments are rich in content, especially the appendices, and features beautifully crafted charts.
-
The field this paper focuses on is crucial to the LLM community -- "how to improve the quality of pretraining data."
缺点
-
This paper raises several concerns for me:
The methodology can be summarized as follows: it utilizes LLAMA's annotated pairs (doc, program) to fine-tune a small model Prox with approximately 0.3B parameters, serving as a proxy. Based on a vast corpus of pre-trained documents, Prox generates Python function calls to conduct document-level encoding (discard or retain) and chunk-level encoding (discard, retain, normalize).
-
a. Firstly, while the goal of the paper is to balance data processing efficiency and enhance data quality, the introduction of Prox as a proxy to call Python functions might add extra computational cost, potentially undermining the practical of this approach. A comparative analysis of time overhead with other methods would strengthen the argument for the effectiveness of this approach.
-
b. Furthermore, if direct application of Python functions achieves the document-/chunk-level heuristic optimizations as proposed, how does this alternative compare to Prox in terms of performance? Because after all, the processes of discarding or retaining documents and chunks, and normalizing chunks (such as top menus, navigation bars, buttons, HTML elements, links, and footers) can be addressed using existing Python rules (as referenced with C4, Gopher, and RedPajama).
-
c. Additionally, how do the document- and chunk-level programming rules differ from existing heuristic rules, particularly in relation to the above mentioned processes (at paragraph b) and education scores, considering related research such as [1] and [2] have already proposed similar concepts?
-
The experimental results in the paper need further development. In Table 2, only a 750M LLAMA model is used for pretraining on 26B data, while the minimal data volume and model size in standard practice are at least a 1.3B model and 30B data. I believe Prox could demonstrate more significant improvements in in-context learning (ICL) capabilities with larger data scales and model sizes.
-
Some experimental comparisons require clarification. For example, in Table 5, the comparison on domain-specific data tasks should not be between CPT and the BASE model. Instead, it should be between randomly sampled data and Prox-refined data of equivalent volume size, with two BASE models trained from scratch.
-
The current paper only examines mathematics as a specific domain. Expanding the scope to include more vertical fields such as finance, education, or medicine would be beneficial.
[1] Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. Textbooks are all you need, 2023.
[2] Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. QuRating: Selecting high-quality data for training language models. In International Conference on Machine Learning (ICML), 2024.
问题
Please Check Weakness.
伦理问题详情
No ethics review needed.
W1.b Methodology Compraison with Static Rule-based Method
if direct application of Python functions achieves the document-/chunk-level heuristic optimizations as proposed, how does this alternative compare to ProX in terms of performance?
Response: Thank you for the insightful question regarding the role of direct Python heuristics in comparison to ProX. To clarify, ProX’s mechanism differs fundamentally from static rule-based methods in terms of flexibility and adaptability to individual data samples. Here, we highlight the key differences and summarize our experimental comparisons:
(1) Methodological Differences: ProX fundamentally differs from static rule-based methods, such as traditional Python heuristics or rules used in Gopher, C4, and FineWeb. While rule-based methods rely on a fixed set of operations applied globally to all data samples, ProX dynamically generates case-by-case refining programs tailored to the content and context of each document or chunk. This flexibility enables ProX to adapt to diverse noise patterns and data characteristics, allowing it to selectively retain valuable information and perform nuanced operations that static rules cannot achieve.
(2) Empirical Performance Evidence: By applying ProX's case-by-case program generation to pre-training data initially refined by existing global fixed rules, we demonstrate significant performance improvements in downstream tasks. As shown in Table 2, ProX achieves an average boost of more than 2% over purely rule-based methods. Additionally, in Section 3.2, we compare ProX with both heuristic approaches (e.g., Gopher, C4, FineWeb) and state-of-the-art model-based data selection methods (e.g., QuRating[ICML'24] and MATES[Neurips'24]). These results demonstrate the effectiveness of integrating case-by-case adaptability into data refinement, validating the fundamental difference between ProX and traditional approaches.
In summary, while traditional Python heuristics offer a baseline approach to data filtering and normalization, ProX’s adaptive, model-driven approach delivers a measurable and meaningful performance advantage. This advantage is particularly evident in cases requiring high-quality, context-aware data refinement, where ProX’s flexibility and dynamic operation generation are critical.
W1.c Difference between document- and chunk-level programming and existing heuristic rules and methods
how do the document- and chunk-level programming rules differ from existing heuristic rules?
Response: Please kindly refer to our response to W1.b.
how do the document- and chunk-level programming rules differ from ......, and education scores, considering related research such as 1 and 2 have already proposed similar concepts?
Response: Phi[1] mainly prompts the model(GPT-4) to give "educational value" to each document for filtering low-quality samples for code data. Qurating[2] proposes different quality ranking dimensions, trains and prompts the 1.3B models to decide the document's quality using different critiques, and finally trains the model using mixed-quality of data. These practices only focus on filtering the whole document, which is already covered by our document-level programming.
Plus, we believe that the document quality is also related to its formatting. Please refer to Appendix A.1 to check our mixed scoring prompts, and comparison results with Fineweb-Edu (see Figure 6). As the experiments show, these two critiques can be easily learned by very small models, and ProX curated data can outperform data selected like Qurating by a very clear gap(+2.9%).
Moreover, our chunk-level programming can refine the documents with small granularities. We enable models to generate chunk-level programs, which can remove certain lines, remove meaningless strings, and normalize noisy patterns. These functions are totally different from Phi and QuRating. We also present real case studies in Appendix F.2 Case Studies (Table 34-35).
W2: Further development of experimental results
In Table 2, only a 750M LLAMA model is used for pretraining on 26B data, while the minimal data volume and model size in standard practice are at least a 1.3B model and 30B data. ProX could demonstrate more significant improvements in in-context learning (ICL) capabilities with larger data scales and model sizes.
Response: Thank you for raising concerns regarding the scalability of our approach with larger model sizes and corpora, e.g., 1.3B model and 30B data. However, we would like to clarify that we have already explored such practice in Section 3.3, where we conducted experiments demonstrating the effectiveness of ProX up to 1.7B model on 50B+ tokens. In short:
-
Evidence of Scalability with Larger Models and Corpora: As shown in Figure 6, we conducted experiments with a 1.7B model trained on more than 50B tokens from 4 different corpora. These results directly validate that ProX maintains its effectiveness when scaled to larger model sizes and data volumes, providing strong evidence that our method scales well with increased computational and data resources.
-
Competitive Performance Against Models Trained with More Data: Additionally, we compared ProX with models trained on significantly larger datasets, such as InstructionLM-1.3B trained on 100B tokens and COSMO-1.8B trained on 180B tokens. These models utilize more extensive computational resources and high-quality pretraining corpora, yet ProX was able to match or even surpass their performance at a lower computational cost. This comparison highlights ProX’s efficiency and suggests that it can achieve competitive results even against models trained with considerably more data. (see Figure 6, and line-364 to line-377)
We believe the current results reported in Section 3.3 and Figure 6 provide clear evidence of ProX’s scalability and competitive performance, even as both model and data scales increase.
W3: Need clarification of continual pre-training results.
Some experimental comparisons require clarification. For example, in Table 5, the comparison on domain-specific data tasks should not be between CPT and the BASE model. Instead, it should be between randomly sampled data and ProX-refined data of equivalent volume size, with two BASE models trained from scratch.
Response: In all of our continual pre-training experiments (Please refer to Table 5), we compare ProX to both base models and the models trained on randomly sampled data, just as you mentioned. Noticeably, ProX outperforms all these baselines given equivalent training configurations, when compared to the models trained on randomly sampled data: +2.9% for TinyLlama, +3.3% for Llama-2-7B, +6.2% for CodeLlama-7B, +4.4% for Mistral-7B. Please refer to the table below:
| Model | method | GSM8K | MATH | SVAMP | ASDiv | MAWPS | TAB | MQA | MMLU STEM | SAT MATH | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TinyLlama | random | 6.2 | 4.8 | 22.3 | 36.2 | 47.6 | 19.3 | 11.6 | 20.7 | 25.0 | 21.5 |
| ProX | 9.0 | 5.6 | 23.8 | 41.9 | 56.9 | 22.2 | 15.6 | 26.8 | 31.2 | 25.7 | |
| Llama-2-7B | random | 29.6 | 13.6 | 49.2 | 61.9 | 78.4 | 36.3 | 31.9 | 40.5 | 43.8 | 42.8 |
| ProX | 30.6 | 16.8 | 50.2 | 63.7 | 79.3 | 37.3 | 40.1 | 43.8 | 53.1 | 46.1 | |
| CodeLlama-7B | random | 31.1 | 14.8 | 51.4 | 62.1 | 81.2 | 33.6 | 30.4 | 40.5 | 43.8 | 43.2 |
| ProX | 35.6 | 17.6 | 55.8 | 67.9 | 82.7 | 41.3 | 38.9 | 42.6 | 62.5 | 49.4 | |
| Mistral-7B | random | 44.4 | 19.2 | 65.2 | 69.6 | 88.4 | 46.6 | 43.1 | 50.8 | 65.6 | 54.8 |
| ProX | 51.0 | 22.4 | 64.9 | 72.9 | 89.2 | 49.8 | 53.0 | 54.2 | 75.0 | 59.2 |
We will adjust the presentations further if you have any further suggestions.
W4: Applying ProX to other domains.
The current paper only examines mathematics as a specific domain. Expanding the scope to include more vertical fields such as finance, education, or medicine would be beneficial.
We appreciate the suggestion regarding domain expansion. We chose mathematics because it offers a well-established infrastructure with extensive datasets and clear evaluation metrics, making it an ideal testbed for our study. More importantly, mathematics serves merely as a demonstrative case to validate that our method can effectively extend beyond general pre-training to specific domains. While expanding to fields like finance, education, or medicine would be valuable, such expansion would require significant resources for corpus development and evaluation frameworks. This is beyond our current focus on developing and validating a generic method for improving model performance by improving corpus quality, and thus remains as future work.
Dear Reviewer TCQo,
Thank you for the careful review. We're glad you appreciated the clarity of our presentation, the detailed appendices, and our comprehensive experiments. Your recognition of our work is very encouraging. We truly appreciate your insightful questions and have taken immediate steps to address them during rebuttal. In summary, we have tried our best to: add the FLOPs computation analysis, and add new experimental results, e.g., experiments on Fineweb-Edu (+0.9% on 1.7B). We describe these results in detail below:
W1.a Computation Cost Analysis
The introduction of ProX as a ProXy to call Python functions might add extra computational cost, ...., A comparative analysis of time overhead with other methods would strengthen the argument for the effectiveness of this approach.
Response: Thank you for highlighting the importance of a controlled comparative analysis of the effectiveness and computational costs of different methods. In the paper, we have employed one of the most commonly used metrics for computing overhead—Compute FLOPs—and provided a detailed analysis in Section 4.2, Figure 8.
Comparison with vanilla pre-training: Figure 8 demonstrates that ProX significantly reduces FLOPs while achieving similar downstream performance, with reductions of up to 40% in total FLOPs for the 1.7B model. To provide further clarity, we have attached the detailed FLOP values copied from Figure 8 below for your reference in tabular form:
| Model Size | Train FLOPs | Inference FLOPs | Total FLOPs | Avg. downstream Performance |
|---|---|---|---|---|
| 1.7B (Vanilla) | 2.26e20 | - | 2.26e20 | 42.8 |
| 1.7B (w/ ProX) | 1.13e20 | 2.15e19 | 1.35e20 (↓40%) | 42.9 |
Comparison with existing data selection methods: We also provide a quantified analysis of the computational overhead between different model-based methods. Specifically, we have now included total FLOPs estimated for the methods compared in Table 3, referencing data from the MATES paper, as well as FLOPs calculated for our own approach, ProX. Note that the FLOPs entail the training FLOPs for training base models and inference FLOPs for enhancing data quality.
Please also see the following table for reference:
| Method | # TOTAL FLOPs * 1e19 | 0-shot | 2-shot | # Win |
|---|---|---|---|---|
| Model Architecture: Pythia-410M | ||||
| Random | 6.4 | 42.7 | 43.8 | 0 / 8 |
| DSIR (Neurips'23) | 6.4 | 42.5 | 43.7 | 1 / 8 |
| DsDm (ICML'24) | 10.7 | 43.4 | 44.1 | 0 / 8 |
| QuRating (ICML'24) | 26.4 | 43.5 | 44.6 | 0 / 8 |
| MATES (Neurips'24) | 8.1 | 44 | 45 | 0 / 8 |
| ProX (ours) | 13.2 | 46.2 | 47.5 | 7 / 8 |
| Model Architecture: Pythia-1B | ||||
| Random | 17.7 | 44.7 | 45.4 | 0 / 8 |
| MATES (Neurips'24) | 20 | 45.8 | 46.4 | 1 / 8 |
| ProX (ours) | 21.9 | 46.8 | 48.4 | 7 / 8 |
From this comparison, ProX shows slightly higher overall FLOPs(train FLOPs + extra FLOPs) than MATES, yet is much more efficient than QuRating, consistently achieving much higher downstream performance (+2.5% over MATES, +2.9% over QuRating for 1B experiments).
Additionally, as the base model size scales up (from Pythia-410M to Pythia-1B), the proportion of FLOPs dedicated to pre-training becomes more dominant, while the relative overhead for data quality enhancement via ProX diminishes (as we analyzed in Section 4.2 and illustrated in Figure 8). This advantage is largely due to our very small model design, where the inference cost is much lower compared to the actual training cost of larger models (e.g., 1.3B models used in QuRating).
Therefore, based on these findings and analysis, we believe that applying ProX to corpus refining for larger model pre-training is promising, offering superior data quality improvements with manageable computational costs compared to existing methods.
Thank you for your valuable feedback! We have carefully responded to your concerns by providing a detailed analysis of compute overhead, clarifying experimental results, and updating the manuscript. As the ICLR public discussion phase will be ending in a few days, we want to check if these responses have addressed your concerns.
As for the overall reception, Reviewer 4aQX believes our paper is strong and should be accepted; Reviewer 1aQb has also mentioned that their concerns have been resolved. We hope our responses have similarly addressed your questions and would greatly appreciate it if you could take a moment to review and finalize your assessment and rating of our paper.
Thank you again for your time and thoughtful insights!
Dear Reviewer TCQo,
Thank you once again for your time and effort in reviewing our submission. We sincerely appreciate your valuable feedback and have worked diligently to address all your concerns, including clarifying our experiments, developing a compute overhead analysis, and revising the manuscript accordingly.
To address your main concerns, especially regarding ProX's efficiency and effectiveness, we would like to draw your attention to the following results and updates:
-
: FLOPs comparison under similar downstream performance. ProX saves more FLOPs when base model size grows, up to 40% overall FLOPs saving.
-
: Updated comparison with other data selection methods. ProX shows only slightly higher overhead than MATES, but yields >2% performance boost regarding 0-shot/2-shot evaluation results.
-
: Updated results on Fineweb-Edu. When combined with Fineweb-Edu, ProX further improves performance by approximately 1.0% over this highly refined dataset, using a 1.7B model with 50B tokens of training. This demonstrates ProX's remarkable effectiveness, highlighting its ability to enhance even already optimized datasets.
-
: Updated 5-shot evaluation results. ProX shows very consistent improvement on downstream tasks (>2%).
As the rebuttal period has been extended by a week, we humbly and kindly request your attention to our response. Your assessment is critical to the improvement of our work, and we deeply value your insights. Moreover, we remain fully committed to providing any clarifications or additional analyses you might find necessary to help re-evaluate our paper. Your understanding and guidance would mean a lot to us, and we remain hopeful for your response during this extended period.
Best Regards,
The Authors
Dear Reviewer TCQo,
We sincerely appreciate all the valuable feedback provided and wish you a Happy Thanksgiving!
As we have already provided detailed responses to address the concerns raised, we kindly ask if there are any remaining issues or questions that require further clarification during this period.
Moreover, we would greatly appreciate it if you could re-evaluate the overall score, soundness, and other aspects of our work based on the responses we have submitted.
Best Regards,
The Authors
Thank you for your detailed response. I respect your hard work and dedication.
Due to recent commitments, I delayed discussing with the authors and was unable to provide a timely rebuttal, for which I sincerely apologize.
As an expert in this field with considerable experience conducting mature experiments, I find it hard to agree that the method presented in this paper holds practical feasibility and value in real-world production.
Here is a summary of the method described in the paper (please correct me if I'm inaccurate):
The paper initially segments a document into chunks, then uses a LLM as an agent to query for necessary preprocessing functions for the current document/chunk(Step 1). Next, the document/chunk is processed using the preprocessed Python function(Step 2), and finally, all refined documents undergo pretraining.
Based on this, I have the following concerns and questions:
- Why not directly use the preprocessed Python function to process these documents? This is the standard practice for pre-training data processing in all major companies, proven to be sufficiently effective time and again, unless new preprocessing rules or quality signal-based processing are introduced.
- how do the document processing functions described by the authors differ from existing pre-training technologies such as C4, Gopher, RedPajama, dolma, and fineweb, which already cover the discarding or retaining of documents /chunks, and the normalization of chunks (such as top menus, navigation bars, buttons, HTML elements, links, and footers)?
- Why is ProX faster than directly using Python functions for preprocessing? I suspect that the reported training FLOPS in the rebuttals, only reflect the speed of querying documents/chunks with ProX (Step 1), whereas processing the entire data flow involves the combined time overhead of both Step1 and Step2.
- In the comparison experiments on larger models and corpora, did the authors use the same base model, differing only in the ProX method and competing strategies for dataset selection?
Given these unresolved concerns, I have decided to retain my current rating and skepticism.
5. Question about experimental configuration of Prox's experiments of Larger models and corpora.
In the comparison experiments on larger models and corpora, did the authors use the same base model, differing only in the ProX method and competing strategies for dataset selection?
Thank you for your newly proposed question. And Yes, we ensured a fair comparison in these experiments. The only difference lies in whether or not ProX was applied, while keeping the base model parameters and pretraining computational budget consistent.
As shown in , (Our From-Scratch Experiments part), all models were 1.7B in size, trained on the same dataset (over 50B tokens), with ProX consistently delivering improvements across corpora of varying quality.
Additionally, we compared our approach with advanced models of similar model sizes, such as TinyLlama-1.1B-3T, InstructionLM-1.3B, and Cosmo-1.8B. These models utilized significantly more computational resources for large-scale data synthesis and (or) training. Evaluations on the same benchmarks show that our method achieves comparable performance with substantially lower computational cost, highlighting the efficiency and effectiveness of ProX.
If you have any specific inquiries or require additional information, please do not hesitate to share them. We will always try our best to address them, and help you re-evaluate our work!
Best Regards,
The Authors.
Dear Reviewer TCQo,
We would like to further explain several concerns in detail.
how do the document processing functions described by the authors differ from existing pre-training technologies such as C4, Gopher, RedPajama, dolma, and fineweb, which already cover the discarding or retaining of documents /chunks, and the normalization of chunks (such as top menus, navigation bars, buttons, HTML elements, links, and footers)?
In fact, we would like to emphasize that the "preprocessing functions" you mentioned are all manually designed and rely on character matching or regular expressions to achieve the "retaining of documents/chunks and the normalization of chunks" as you described. Therefore, they are fixed and static, incapable of adjusting based on the specific characteristics of a given document.
ProX, on the other hand, has designed a higher-level framework that allows the model to dynamically decide which rows, problematic content within certain lines, or specific strings need to be cleaned. This approach is entirely dynamic and flexible, enabling automatic case-by-case refinement for each document, rather than relying on a set of fixed functions. We believe there may be some misunderstanding here, and we would like to stress this point.
Lastly, we want to emphasize that ProX's dynamically generated function calls build on top of the established pre-training technologies you mentioned, achieving significant improvements.
Due to recent commitments, I delayed discussing with the authors and was unable to provide a timely rebuttal, for which I sincerely apologize.
We understand this. However, if you have not yet had enough bandwidth to review our experimental results on large-scale in the paper, we would like to share with you the improvements ProX achieved on all datasets like C4, RedPajama, Fineweb, and Fineweb-Edu, as you mentioned these datasets have gone through necessary preprocessed functions.
In fact, the results clearly demonstrate how ProX, with its model-based automated refinement program, can further enhance corpus quality beyond existing technologies. We do not deny the contributions of preprocessed functions; rather, we offer an advanced approach to refining and hope you can understand this perspective.
| Datasets | 1.7B model on >50B tokens |
|---|---|
| RedPajama | 46.0 |
| RedPajama w/ ProX | 48.0 (+2.0) |
| C4 | 45.5 |
| C4 w/ ProX | 48.4 (+2.9) |
| FineWeb | 47.4 |
| FineWeb w/ ProX | 49.8 (+2.4) |
| FineWeb-Edu | 50.1 |
| FineWeb-Edu w/ ProX | 51.0 (+0.9) |
We hope these clarifications provide you with a clearer understanding and alignment regarding ProX.
3. Baseline experiments
Why not directly use the preprocessed Python function to process these documents?
We believe this point must have already been addressed in our earlier response (specifically our response to W1.b). We’d like to draw your attention that we conducted exactly the comparative experiments you mentioned in the very first version of our submission.
In fact, the methods you referred to are precisely the primary baselines compared in our paper! We kindly request you to revisit and , which present the results from the first set of experiments in the paper. To thoroughly demonstrate the effectiveness of ProX, we included several rule-based baselines (preprocessed Python functions), such as those used in industry-standard methods like C4, Gopher, and FineWeb (which you acknowledged in your review). The results clearly show that ProX significantly outperforms all these baselines.
For more details on these rules (Python functions), here’s a breakdown of the specific preprocessing rules we compared against:
- , including minimum and maximum text length, the number of bullet points, and other features (implemented as Python functions), were used to filter documents.
- , such as removing citations, meaningless test strings like “lorem ipsum,” JavaScript content, and bracketed text.
- , like calculating the ratio of character duplication or the proportion of short lines in a document, were also implemented.
We believe that these comparisons we conducted demonstrate that ProX outperforms traditional preprocessed Python functions. The key difference lies in ProX’s ability to dynamically generate tailored cleaning programs for each document. Unlike these static preprocessing functions, ProX can adapt its operations based on model-generated inputs for each document, rather than relying on rigid, predefined rules. We hope this explanation draws your attention to the results of our first set of experiments and helps clarify that ProX has been thoroughly compared with and shown to improve upon these baseline methods.
4. Question about ProX's efficiency
Why is ProX faster than directly using Python functions for preprocessing? I suspect that the reported training FLOPS in the rebuttals, only reflect the speed of querying documents/chunks with ProX (Step 1), whereas processing the entire data flow involves the combined time overhead of both Step1 and Step2.
We believe there may have been a misunderstanding. It is important to clarify that at no point in any version of our submission did we claim that PROX is faster than traditional preprocessing methods.
Instead, our focus has always been on emphasizing that the data refined by ProX enables the base model to achieve better performance with less training computation. Even when discussing FLOPs, we highlight that while ProX does incur some additional costs for generating programs with language models, this overhead is minimal and results in significantly better outcomes for the trained model under the same training budget. This is clearly reflected in the dynamic curves presented in and .
Here's your original review comment from last time on our submission version:
Firstly, while the goal of the paper is to balance data processing efficiency and enhance data quality, the introduction of Prox as a proxy to call Python functions might add extra computational cost, potentially undermining the practical of this approach. A comparative analysis of time overhead with other methods would strengthen the argument for the effectiveness of this approach.
To help clarify why ProX achieves a balance between data processing efficiency and quality enhancement, we have compared ProX with other LM-based data selection methods in terms of FLOPs. Higher FLOPs often correlate with greater computational costs, and our results demonstrate that ProX achieves better model performance with comparable or lower FLOPs (refer to our earlier response or updated in our paper).
Additionally, given your concern about processing time, we revisited the execution time for Step 2 (preprocessed function execution). On average, executing ProX programs takes approximately 0.004 seconds per document. Leveraging multi-node and multi-process parallelism, the total processing time for refining the RedPajama dataset (62.5B tokens) was about 500 seconds. This processing time is negligible compared to the overall cost of data preparation and model training.
Dear Reviewer TCQo,
We really appreciate you getting back to us. We understand that this is a busy period for everyone and you may not have had much time to review and respond to our revised paper rebuttal due to other commitments, and we greatly respect your expertise.
However, we believe your concerns may stem from several misunderstandings or misalignments in interpretation. We will clarify all your concerns again, and eagerly look forward to your response and reassessment in the remaining period.
1. Question about Prox's practical feasibility
As an expert in this field with considerable experience conducting mature experiments, I find it hard to agree that the method presented in this paper holds practical feasibility and value in real-world production.
We respectfully disagree with the notion that ProX, as a model-based processing approach, lacks value.
-
ProX represents a trend for model-based data processing aimed at improving pre-training data quality, and this approach aligns with prior efforts in the field, such as model-based data filtering (recently widely explored by industrial works, such as
Llama-3,Qwen, andFineWeb), and other model-based data selection methods (e.g.,QuRating [ICML '24],MATES [NeurIPS '24],DsDm [ICML '24]), demonstrating its practicality and effectiveness. -
ProX utilizes a much smaller model for processing the corpus, specifically a 300M model rather than a 3B or larger model. This efficiency makes it comparable to a BERT-large-level model in terms of computational cost, while still achieving significant results in obtaining verifiably high-quality text.
-
In downstream experiments, ProX has demonstrated substantial performance improvements. Beyond the 750M model mentioned, the benefits extend to larger models, such as the 1.7B and 7B models. Both Reviewer
4aQXand Reviewer1aQbhave recognized the value of these improvements.
Reviewer 4aQX:
The method shows significant gains in accuracy and efficiency for domain specific continual pretraining, with up to a 20x reduction in compute
Reviewer 1aQb:
By leveraging small language models (0.3B parameters) for data refinement, ProX achieves substantial performance improvements with lower computational costs
The framework offers a scalable solution to pre-training data refinement, which can be particularly valuable in scenarios where human expert intervention is impractical.
Given these experimental results—across varying model sizes, domains, and corpora—we believe not only in the potential of ProX but also in the broader promise of model-based processing methods for acquiring high-quality corpora. From a data quality perspective, employing a 0.3B model for data cleaning and quality enhancement is justifiable, as high-quality data significantly contributes to improved model performance.
2. Overview of ProX pipeline.
The paper initially segments a document into chunks, then uses a LLM as an agent to query for necessary preprocessing functions for the current document/chunk(Step 1).
Respectfully, we would like to point out:
-
We primarily use a 0.3B, very small language model, rather than a large language model (LLM). The process begins with document-level refining, where the model determines whether to retain a document. For retained documents, we then divide them into chunks and perform chunk-level normalization (e.g., deleting or replacing specific strings).
-
Just to further clarify, this small language model directly generates function calls, including generating necessary input parameters such as line numbers, specific patterns to be removed, or normalized patterns to replace them. Notably, we do not rely on invoking a fixed function. We comprehensively discuss this design in , please kindly check.
We sincerely thank all the reviewers for their insightful comments and constructive feedback.
We are delighted that the reviewers recognized the significance of our results and findings, as well as appreciated the extensive experiments we conducted (Reviewer TCQo, 4aQX, 1aQb), the detailed insights and clear presentation of our work (Reviewer TCQo, 4aQX, 1aQb), the non-trivial technical contributions of ProX to the community (Reviewer TCQo, 1aQb, L4Dm), and the notable performance and efficiency improvements ProX achieved in pre-training and continual pre-training (Reviewer TCQo, 4aQX, 1aQb). This positive feedback is incredibly encouraging for us.
We have carefully considered the concerns and questions raised by each reviewer and have provided detailed responses to each one separately. Furthermore, based on the valuable feedback, we have made revisions to the paper, including the following updates (all updates are highlighted in purple for clarity):
- Included total FLOPs analysis in to demonstrate ProX's acceptable extra FLOPs compared with QuRating and MATES. (Reviewer
TCQo) - Included a FastText classifier based filtering as one of our baselines in , , and . While it serves as a strong baseline, it still lags behind the ProX methods. (Reviewer
4aQX) - Added experimental results in by applying ProX on an additional, highly-refined and top-tier quality pre-training corpus, Fineweb-Edu, to further showcase the effectiveness of ProX. (Reviewer
1aQb) - Added few-shot evaluation results in () to confirm that models trained on ProX-curated data exhibit even stronger performance on downstream language tasks. (Reviewer
TCQoand Reviewer1aQb) - Conducted error analysis in and , which highlights the robustness of refining models while also identifying areas for future improvement. (Reviewer
4aQX)
Once again, we would like to thank all the reviewers for their time and efforts in helping us improve the paper! If you need any further explanations, please do not hesitate to let us know.
Dear Reviewers,
Thank you once again for your valuable comments and suggestions, which are really helpful! We have carefully responded to each of your comments and provided additional experimental results to further clarify our points.
We understand that this is a particularly busy time, and we deeply appreciate it if you could take a moment to review our responses and let us know if they adequately address your concerns. Should there be any additional comments, we will do our best to address them.
Best regards,
The Authors
We sincerely thank you for recognizing the importance of the problem addressed by ProX. As Reviewer TCQo mentioned, "the field this paper focuses on is crucial to the LLM community." Several reviewers appreciated the extensive experiments, detailed methodology, and significant results presented in ProX:
- Reviewer
TCQo: "The experiments are rich in content, especially the appendices." - Reviewer
4aQX: "The method shows significant gains in accuracy and efficiency for domain-specific continual pretraining, with up to a 20x reduction in compute." and "The empirical validation is comprehensive." - Reviewer
1aQb: "Significant performance enhancements demonstrated consistently. The results show that even smaller refining models can produce high-quality pre-training data."
We deeply appreciate your thoughtful comments and constructive feedback as the discussion period concludes. Thank you for your time and effort!
While we respectfully disagree with Reviewer TCQo's perspective that "ProX may lack practical feasibility and value in real-world production" (we believe this may stem from some misalignment, which we have addressed in detail), this also highlights a critical factor: the need to emphasize how fundamentally different our approach is from static heuristic rules. Specifically, our method leverages models to automatically and dynamically generate customized programs for each document, setting it apart from the simpler baselines that TCQo is concerned about. Moreover, Reviewer 1aQb explicitly stated in their review that "The framework offers a scalable solution to pre-training data refinement, which can be particularly valuable in scenarios where human expert intervention is impractical."
We believe we have thoroughly clarified and addressed Reviewer TCQo's concerns in our responses.
Your feedback has also provided invaluable insights for improving our work:
- Thanks to Reviewer
TCQo, We recognize the necessity of conducting a more explicit analysis of computational overhead. Beyond the extensive analysis already provided in of the paper, we have prominently included FLOPs comparisons in for a head-to-head comparison with existing data selection methods. - We acknowledge the need for more comprehensive comparisons. In response, we introduced a stronger rule-based FastText filter baseline, and ProX still outperforms this baseline. (Reviewer
4aQX) - We applied ProX to further optimize a state-of-the-art, highly refined dataset, achieving significant improvements even at the 0.7B and 1.7B scales. This reinforces our belief in ProX's effectiveness for enhancing data quality. (Reviewer
1aQb) - The discussion on the data case study was another valuable suggestion. We believe this further enhances our presentation by illustrating how ProX works, how it compares to other methods, and the potential avenues for future exploration. (Reviewer
1aQbandL4Dm)
To conclude, these suggestions will significantly strengthen our paper, and we are excited to incorporate them into the revised version. Thank you again for your detailed and constructive feedback!
This paper tackles the vital issue of enhancing pretraining data quality in the large language model (LLM) community. The authors introduce Prox, a small model with approximately 0.3 billion parameters, which fine-tunes using LLAMA's annotated pairs (document, program). Prox generates Python function calls for document-level and chunk-level encoding to determine which data to retain or discard.
Reviewers have raised concerns about the feasibility of applying Prox in real-world scenarios due to its computational costs. To strengthen their argument, the authors should provide concrete evidence demonstrating Prox's effectiveness compared to other LLM corpus filters, such as DCLM, which is known for its superior performance. This could involve comparing models trained with datasets filtered by Prox against those using the same training corpus as outlined in the DCLM paper.
审稿人讨论附加意见
The authors conduct extensive experiments to address the reviewers' concerns; however, two reviewers remain critical of the paper due to the aforementioned weaknesses. I believe the manuscript would benefit significantly if the authors addressed these issues.
Reject