PaperHub
4.5
/10
Rejected4 位审稿人
最低1最高8标准差2.7
1
3
6
8
3.8
置信度
正确性2.5
贡献度2.8
表达3.3
ICLR 2025

ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

use gzip to select optimal data for code and autoformalization

摘要

关键词
data centric machine learningautoformalizationlarge language modelsreasoning

评审与讨论

审稿意见
1

This paper introduces an innovative, embedding-free data selection method for efficient fine-tuning of large language models. Drawing inspiration from gzip compression techniques, the authors propose utilizing Normalized Compression Distance as a metric to filter and prune fine-tuning datasets. The authors conduct a comparative analysis with prior embedding-free methods, originally designed for filtering pre-training datasets, on Autoformalization and Python coding tasks.

优点

(1) Problem Significance: The author tackles a crucial problem in low-resource settings, addressing the challenge of fine-tune data selection without relying on GPU-intensive and embedding-based methods. This is a highly relevant and impactful research direction.

(2) Innovative Filtering Criterion: The authors' inspiration from gzip compression methods has led to the proposal of a novel and intriguing selection criterion. This approach is not only interesting but also demonstrates out-of-the-box thinking, making it a notable contribution to the field.

缺点

(1) Inadequate Baselines: The authors propose a data selection method for model alignment, but only compare it with prior works such as DSIR and D4, which were primarily designed for data selection during the pre-training phase. A more comprehensive literature review on data pruning methods for model alignment is lacking, including embedding-based methods [1], LLM model response metrics [2], Gradient-based metrics [3],Quality metrics judged by LLMs [4], inference loss on evaluation sets [5].

(2) Evaluation Metrics: The authors primarily use test data cross-entropy loss as the evaluation metric, results are thus not surprising given that the data selection method uses the test data to anchor the selection criteria. However, the authors do not compare their results with widely accepted metrics in the research community for the studied downstream tasks, such as:

(a). Autoformalization: proof success rates on miniF2F [6,7]

(b). Python coding: functionality pass rates (pass@k on HumanEval) based on unit-tests [8,9]

(3) Clarifications on Motivation: In Section 2.3, the authors argue that n-grams fail to capture syntactic or structural relationships within the data, while hypothesizing that gzip does. However, this hypothesis is not supported by theoretical or empirical evidence, weakening the motivation for the proposed approach. It is also not compared on if the proposed approach is better or worse than high-resource methods, such as embedding-based methods.

References:

[1] DEFT-UCS: Data Efficient Fine-Tuning for Pre-Trained Language Models via Unsupervised Core-Set Selection

[2] From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

[3] LESS: Selecting Influential Data for Targeted Instruction Tuning

[4] Alpagasus: Training a better alpaca with fewer data

[5] Instruction Mining: Instruction Data Selection for Tuning Large Language Models

[6] Autoformalization with Large Language Models

[7] LEGO-Prover: Neural Theorem Proving with Growing Libraries

[8] Evaluating Large Language Models Trained on Code

[9] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

问题

(1) Could the authors provide additional evidence to support the claim that gzip is effective in capturing syntactic and structural relationships in textual sequences?

(2) Would the authors be able to demonstrate the effectiveness of their approach using evaluation metrics beyond cross-entropy test loss, and compare it to relevant baselines, such as those mentioned earlier?

(3) Could you provide more insight into why D4 was excluded from the code generation experiments, and specifically how it affected model performance?

伦理问题详情

Dear AC/SAC/PC,

The authors recently replied to my recent review as ad hominem attacks. I do clarify here that my review is based on the authors work and not directed at the authors personally (I do not know of the authors identities and did not eagarly seek them out).

In regards on the questionable evaluation benchmark, I do believe that it is due to possible errors in the authors evaluation pipeline. I stand in line with published results from the Gemma (Google), where [1] states 20.1% and [2] states 17.7%. Such deviations (<3%) is normaly consideration minor differences in the evaluation setups. However the authors report of 6.1% clearly could not be well reasoned with and falls well outside what I consider normal. This leads to my doubt on whether experiments and evaluations are properly conducted. If the authors could not have reproduced results of prior work, I believe the best mitigation would be to reach out to corresponding contacts from [1,2] and possibly use another model where such results could be reproduced.

Based on my reasons and the fact in their recent discussions the authors stated that You've impugned our research integrity. These ad hominem attacks are not conducive to better research. I kindly request that the ethics review board to be involved in this papers review process.

Best, Reviewer CtCN

[1] Gemma 2: Improving Open Language Models at a Practical Size

[2] https://ai.google.dev/gemma/docs/model_card_2

评论

We thank the reviewer for their detailed feedback. CtcN notes that our work tackles "a crucial problem in low-resource settings" and presents an "innovative filtering criterion" that demonstrates "out-of-the-box thinking".

For calibration purposes, we’d like to note that the ICLR 2025 rubric differs slightly from previous similar conferences. For example:

  • To indicate "Accept", the NeurIPS 2024 rubric says to use 7 whereas the ICLR 2025 rubric says to use 8
  • To indicate "Strong Accept", the NeurIPS 2024 rubric says to use 9 whereas the ICLR 2025 rubric says to use 10

We answer specific questions below:

Would the authors be able to demonstrate the effectiveness of their approach using evaluation metrics beyond cross-entropy test loss?

During the rebuttal period, we moved beyond cross-entropy loss to evaluate ZIP-FIT using Pass@1 on the HumanEval benchmark using Gemma2-2B:

Data Selection MethodPass@1 (%)
Pre-trained Gemma2-2B, 4-bit quantized6.09
ZIP-FIT (LZ4)12.19
ZIP-FIT (gzip)11.58
DSIR9.14
D46.09

ZIP-FIT improves Pass@1 scores on HumanEval over baselines:

  1. ZIP-FIT with LZ4 achieves 12.19% Pass@1 on HumanEval, outperforming both DSIR (9.14%) and D4 (6.09%) baselines
  2. With gzip, ZIP-FIT reaches 11.58%, still maintaining superior performance

Another reviewer suggested that we explore additional compression algorithms beyond gzip. In case these results might interest you, we found that at minimum compression levels, both gzip and LZ4 achieve the strongest Pass@1 scores (11.58% and 12.19%), significantly outperforming the base model (6.09%, dashed line). Performance systematically degrades with increased compression across all algorithms, suggesting that aggressive compression removes valuable alignment signals. Detailed results can be found in Figure 8 in our Appendix.

Could the authors provide additional evidence to support the claim that gzip is effective in capturing syntactic and structural relationships?

Thank you for this question. We have removed this theoretical claim from our paper to focus on empirical results. What we can demonstrate empirically is that compression-based alignment correlates strongly with model performance (R² = 0.90, Figure 3) and leads to significant improvements in downstream tasks (12.19% Pass@1 on HumanEval vs 6.09% baseline).

The theoretical connection between compression algorithms and structural relationships in data is an interesting open question. While our results show that compression-based selection works well in practice, developing a formal theory of why certain compression algorithms are more effective for specific tasks remains valuable future work.

Could you provide more insight into why D4 was excluded from the code generation experiments?

We apologize for any confusion in our presentation. Figure 5 (now Figure 2) now includes our D4 results. Additionally, as shown in Table 1, we evaluate D4 on code generation (achieving 6.09% Pass@1 on HumanEval, no improvement over the pretrained model). Our revision better reflects these comparisons.

Regarding the suggested baselines ([1]-[5]): We are currently running comparisons with newer baselines (LESS, SHED) and will update our rebuttal with these results as soon as we have them.

评论
  1. Regarding pretraining: I understand the resource limitations of compute, thus I am not requesting for the focus on pre-training but rather a suggestion.

  2. Regarding metrics: Your main paper section still only demonstrate with test loss and not downstream benchmark results.

  3. Gemma2-2B performance: [7] clearly states the released model to have results of 20.1%. I do not understand how your baseline results are so poor. I also do not understand how your finetuning method quantized or full precision, LoRA or full parameter (which is not disclosed in the paper) could affect the baseline released pretrained models evaluation scores. Either your evaluation setup is wrong, or my understanding is that None (Pre-trained Gemma2-2B) you reported is referring to a randomly initialized model? Clearly both are unacceptable. Also results reported (even 12.19%) is far from the SOTA research landscape of similar models.

  4. resource expensive" baselines: Since finetuning usually requires less training resource and higher emphasis on data quality, I do firmly believe that comparing with prior relevant works is a must.

  5. generalization: since the proposed method does not seem to be domain-specific, I suggest for the method to be also demonstrated on general conversational datasets, evaluated against metrics such as instruction following and MTBench.

The authors responses (especially in regards to Gemma2-2B performance) to my questions have raised my doubt on whether experiments and evaluations are properly conducted, and if the authors are even familiar with relevant works on LLMs for Coding or Autoformalization (or even SOTA LLM research) at all. I am thus downgrading my score.

评论

Thank you for highlighting these concerns. It's clear there has been a breakdown in communication regarding our experimental setup, particularly with Gemma2-2B's evaluation.

To clarify:

  1. Our use of quantized LoRA via unsloth was only for the rebuttal experiments, not in the original paper.
  2. We acknowledge this created confusion by introducing a new variable that wasn't properly explained.
  3. We should have explicitly compared against the published baseline (20.1% Pass@1) from [7].

We are immediately:

  1. Running evaluations with the standard pre-trained Gemma2-2B to validate against published results.
  2. Documenting our evaluation pipeline in detail.
  3. Will post an update with these results as soon as they are available.

We appreciate your patience as we work to provide accurate and transparent comparisons.

评论

Thank you for your response. I appreciate the efforts spent on trying to address my concerns regarding the soundness of your work. I do think that the authors are tackling an important problem, and I appreciate the simplicity of the proposed approach. However, given the current state of the submitted manuscript (and limited time of rebuttal phase for authors to make changes), I do not think the current work is ready for publication. I sincerely suggest that the authors can improve their work and submit an improved version to future conferences.

Specifically, below are my suggestions:

(1) The authors proposed a low resource, embedding free methods. Such methods are perhaps more suitable for model pretraining (continued pretraining) on large data corpus where resource is a concern. Perhaps the alignment phase is not the best showcase of the work's full potential. I suggest focusing possibly on continued pretraining phase or continue pretraining phase.

(2) I suggest the authors directly focus on downstream benchmark metrics as opposed to test loss. Recent work even for pretraining (such as Gemma2) report on benchmark performance, such as pass rates for HumanEval and miniF2F scores. These metrics should be reported and used for comparision in the main paper.

(3) If the authors still would like to focus on alignment stage (such as finetuning), I would suggest comparing with more "resource expensive" methods (such as embedding-based and gradient-based), as I have pointed out in my Official Review. Also since the proposed method does not seem to be domain-specific, I suggest (similar to Reviewer vFcf) for the method to be also demonstrated on general conversational datasets, evaluated against metrics such as instruction following [1] and MTBench [2].

(5) Your reported numbers for Gemma2-2B on pass@1 for HumanEval does not match with the reported numbers in [7]. On Table 13 in [7], Gemma2 2B is reported of 20.1% pass@1 on HumanEval. I also suggest on demonstrating with stronger and larger code base models (7B/15B) such as [3,4] on better finetuning datasets [5,6] on coding for example. The current reported numbers are too far from the SOTA research landscape.

Unfortunately, based on the current state of the paper and authors responses, I could not raise my score.

References

[1] Instruction-Following Evaluation for Large Language Models

[2] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

[3] StarCoder 2 and The Stack v2: The Next Generation

[4] DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

[5] Magicoder: Empowering Code Generation with OSS-Instruct

[6] WizardCoder: Empowering Code Large Language Models with Evol-Instruct

[7] Gemma 2: Improving Open Language Models at a Practical Size

评论

Thank you for your continued feedback. We would like to address several points:

  1. Regarding the suggestion to focus on pre-training: While we appreciate this suggestion, conducting meaningful pre-training experiments would require massive computational resources that are simply not available to most research teams, including ours. Instead, we deliberately focused on fine-tuning as it provides a controlled experimental setting where we can rigorously validate ZIP-FIT's effectiveness. This choice also aligns with our goal of developing methods that are accessible to researchers with limited compute resources, enabling reproducibility and broader adoption in the community.
  2. Regarding benchmark metrics: As shown in our previous response, we have moved beyond test loss in our evaluation. Our HumanEval Pass@1 results directly demonstrate ZIP-FIT's effectiveness on standard downstream benchmarks, with ZIP-FIT (LZ4) achieving 12.19% compared to the baseline 6.09%.
  3. Regarding the Gemma2-2B performance discrepancy: Our current experiments use quantized LoRA fine-tuning via unsloth for resource efficiency, which explains the lower baseline numbers compared to [7]. We are currently running full fine-tuning experiments with Gemma2-2B to provide direct comparisons with the reported 20.1% Pass@1. Notably, even in our resource-constrained setup, ZIP-FIT doubles the Pass@1 performance, demonstrating its effectiveness at improving model performance regardless of the fine-tuning approach.
  4. Regarding more "resource expensive" baselines: We are currently running comparisons with LESS [3] and will have these results available shortly. Our original focus was on developing a method that could improve model performance while maintaining accessibility to researchers with limited computational resources. ZIP-FIT's strong initial results (doubling Pass@1 on HumanEval) demonstrate that effective data selection is possible without expensive compute. The upcoming comparisons with LESS will provide direct evidence of how our resource-efficient approach compares to more computationally intensive methods.
评论

None of my concerns have been addressed by the authors. This includes:

  • metrics: main paper section still only demonstrate with test loss and not downstream benchmark results. Current papers even on pretraining no longer use test loss as a reliable method for comparison.
  • adequate comparision with prior work: Since finetuning usually requires less training resource and higher emphasis on data quality, I do firmly believe that comparing with prior relevant works is a must.
  • generalization: the proposed method does not seem to be domain-specific, method should be also demonstrated on non coding tasks and general conversational datasets, evaluated against metrics such as instruction following and MTBench.

Most importantly the current draft as well as discussion contains errors:

  • Table 1 in Appendix D. Authors state Gemma2-2B to have 6.1% pass@1 on HumanEval, where a number of works [1,2] report figures of 20.1%.

The paper in its current state is unacceptable. Per suggestion from authors, I further align my score with other ML conferences:

  • NeurIPS: 1: Trivial or wrong
  • ICML: 2: Strong Reject: For instance, a paper with major technical flaws, and/or poor evaluation, limited impact, poor reproducibility and mostly unaddressed ethical considerations.

I do consider the authors working on an important problem and propose a interesting solution. However, the paper in its current state has poor evaluation without reliable metrics, inadequate comparision with prior work, concerns on generalization with general proposed method. The current paper and authors responses contain errors, thus I have unaddressed ethical considerations on if evaluations are properly conducted.

I maintain my score of strong rejection.

[1] Gemma 2: Improving Open Language Models at a Practical Size

[2] https://evalplus.github.io/leaderboard.html

评论

You've impugned our research integrity. These ad hominem attacks are not conducive to better research.

Authors state Gemma2-2B to have 6.1% pass@1 on HumanEval, where a number of works [1,2] report figures of 20.1%.

Your belief that our evaluations are incorrect is refuted by this open HuggingFace issue https://huggingface.co/google/gemma-2b/discussions/53 by 3 independent parties (not including us) claiming that they too are not able to obtain the claimed Gemma pass@k score of 20.1%.

Additionally, your second citation (EvalPlus leaderboard) does not contain Gemma2-2B results, making this criticism unfounded.

评论

Thank you for raising your concerns. I do believe this discussion to be no longer productive. I have updated my review and flagged for ethics review.

审稿意见
3

This paper proposes ZIP-FIT, an efficient, embedding-free method for selecting high-quality, domain-specific fine-tuning data for language models (LMs). Prior methods often rely on computationally expensive neural embeddings or classifiers to filter aligned datasets, while those based on N-gram similarity may lack the structural depth needed for complex tasks like code generation. In contrast, ZIP-FIT leverages gzip compression to evaluate data alignment with target domains, based on the idea that compression algorithms encode information in a way similar to neural networks. The ZIP-FIT approach eliminates the need for LM forward passes to obtain embeddings, making it efficient and particularly suitable for low-resource environments. Experimental results show that ZIP-FIT outperforms prior data selection methods, such as DSIR and D4, as measured by test loss.

优点

  • This paper is well-presented and well-motivated.
  • Studying computation-efficient methods for data selection in LLM instruction fine-tuning is a promising research direction.
  • The proposed ZIP-FIT is intuitive and easy to follow.
  • The proposed approach bypasses the need for LLM forward computation to obtain embeddings, making it computationally efficient.
  • The presented experimental results seem promising.

缺点

  • [Major] The proposed method seems very simple and straightforward; using a gzip-style method to embed data appears to be a relatively standard approach.
  • [Major] All experimental results are based on test loss, which may not be very reliable. It would be essential to conduct evaluations on some standard benchmarks, such as HumanEval and MBPP for code evaluation, to demonstrate the scores the model can achieve.
  • It is unclear how the proposed ZIP-FIT compares to prior, more complex data selection methods in terms of both running speed and final model quality (e.g., [1]), aside from deduplication approaches like D4.
  • [Minor] The paper seems to be written somewhat in rush, the figure quality of Figure 2 does not seem to be very high.

[1] https://arxiv.org/abs/2405.00705

问题

As specified in the "Weaknesses" section:

  • What is the score of the fine-tuned LLM using ZIP-FIT on benchmarks like HumanEval and PubMedQA compare to LLMs fine-tuned without using ZIP-FIT?
  • How does ZIP-FIT compare to prior method like https://arxiv.org/abs/2405.00705 in terms of both running time and final model score?
评论

We thank the reviewer for their detailed feedback. 39ck notes that the paper is “well-presented and well-motivated” and that our “results seem promising”.

For calibration purposes, we’d like to note that the ICLR 2025 rubric differs slightly from previous similar conferences. For example:

  • To indicate "Accept", the NeurIPS 2024 rubric says to use 7 whereas the ICLR 2025 rubric says to use 8
  • To indicate "Strong Accept", the NeurIPS 2024 rubric says to use 9 whereas the ICLR 2025 rubric says to use 10

We answer specific questions below:

The proposed method seems very simple and straightforward; using a gzip-style method to embed data appears to be a relatively standard approach

We respectfully disagree that simplicity is a limitation. The simplicity of our approach is one of its key strengths and a surprising aspect of our contributions. In machine learning, approaches that achieve strong results with minimal complexity should be highly valued. ZIP-FIT demonstrates this principle by:

Requiring zero hyperparameter tuning: Unlike complex embedding or gradient-based methods, ZIP-FIT just works out of the box

Minimizing implementation complexity: No need for GPU infrastructure, embedding models, or careful architectural choices

Running 65.8% faster than DSIR: Demonstrates that simpler can be more efficient

The machine learning community has repeatedly shown that when simple methods match or exceed complex ones, they should be preferred (e.g., linear probing vs full fine-tuning, kNN-prompt vs prompt tuning). ZIP-FIT follows this principle - delivering strong performance through an approach that any practitioner can implement and understand.

What is the score of the fine-tuned LLM using ZIP-FIT on benchmarks like HumanEval and PubMedQA compare to LLMs fine-tuned without using ZIP-FIT?

At your request, we evaluated ZIP-FIT on HumanEval with a 4-bit quantized Gemma2-2B(due to time/compute constraints), comparing various configurations against baseline methods (top 1M tokens):

Data Selection MethodPass@1 (%)
Pre-trained Gemma2-2B, 4-bit quantized6.09
ZIP-FIT (LZ4)12.19
ZIP-FIT (gzip)11.58
DSIR9.14
D46.09

ZIP-FIT improves Pass@1 scores on HumanEval over baselines:

  1. ZIP-FIT with LZ4 achieves 12.19% Pass@1 on HumanEval, outperforming both DSIR (9.14%) and D4 (6.09%) baselines
  2. With gzip, ZIP-FIT reaches 11.58%, still maintaining superior performance

Another reviewer suggested that we explore additional compression algorithms beyond gzip. In case these results might interest you, we found that at minimum compression levels, both gzip and LZ4 achieve the strongest Pass@1 scores (11.58% and 12.19%, respectively), significantly outperforming the base model (6.09%, dashed line). Performance systematically degrades with increased compression across all algorithms, suggesting that aggressive compression removes valuable alignment signals. Detailed results can be found in Figure 8 in our Appendix.

Regarding PubMedQA, we intentionally focused on code generation and formal mathematics - domains where syntactic structure is crucial and data selection has clear practical impact. In these domains, ZIP-FIT's strong performance (12.19% Pass@1 on HumanEval) demonstrates its effectiveness where it matters most. While exploring effectiveness on more variable data remains important future work, success in structured domains like code generation represents a meaningful practical contribution. We acknowledge that evaluating on biomedical domains would be valuable future work to demonstrate ZIP-FIT's generalizability across different domains.

How does ZIP-FIT compare to prior methods like https://arxiv.org/abs/2405.00705 in terms of both running time and final model score?

We are currently running comparisons with these newer baselines (LESS, SHED) and will update our rebuttal with these results as soon as we obtain results.

the figure quality of Figure 2 does not seem to be very high

Thank you for pointing this out. Please see our revised submission for higher quality figures.

评论

Dear Reviewer 39ck,

May we ask if you could respond to our comments? In our response, we've addressed concerns about simplicity and included additional HumanEval evaluations. Please let us know if you have other questions or concerns. Thank you!

Best regards,

The Authors

评论

Dear Reviewer 39ck,

We sincerely appreciate your thoughtful feedback and have updated our manuscript with several improvements addressing your concerns:

  1. In response to your suggestion about exploring different metrics, we have added Pass@1 results for HumanEval using a 4-bit quantized Gemma2-2B in our Appendix. These results demonstrate that ZIP-FIT's effectiveness is robust across different evaluation metrics, showing significant improvements over the baseline (12.19% Pass@1 vs 6.09% baseline).

  2. We have addressed the figure quality issues you identified, particularly for Figure 2.

  3. Regarding the comparison with SHED [https://arxiv.org/abs/2405.00705]: We invested significant effort (approximately three full-time work days) attempting to implement and evaluate against this baseline. Despite assistance from the SHED authors (whom we sincerely thank), we have not been able to implement there method as they didn't have access to their original cluster environment. We continue working on this comparison and hope to include these results before the extension deadline.

If these experiments and improvements don't fully address your concerns, we welcome your guidance on additional analyses that would be helpful. We hope you'll consider raising your score in light of these improvements and our ongoing efforts to provide comprehensive comparisons.

Best regards,

The Authors

评论

Dear Reviewer 39ck,

We greatly appreciate your detailed initial feedback which has helped us significantly improve our work. We wanted to respectfully follow up one final time regarding our previous responses and new results. We believe we have comprehensively addressed all concerns raised in your initial review:

  1. You noted "All experimental results are based on test loss" - We have now conducted extensive HumanEval evaluations, where ZIP-FIT achieves 18.86% Pass@1 with full fine-tuning (vs 15.24% baseline) and 12.19% with QLoRA. Specifically, here are our comprehensive results, where all methods were evaluated under identical training settings using the same number of training tokens (top 1M):

    Fine-tuningData Selection MethodPass@1 (%)Pass@10 (%)Selection Time
    NoneNone: Pre-trained Gemma2-2B15.2438.81-
    Full FTZIP-FIT18.8641.7832s
    Full FTLESS18.0640.1919h
    Full FTDSIR17.9844.2797s
    Full FTD414.3740.667h 40m
    NoneNone: Pre-trained Gemma2-2B (4-bit quantized)6.09--
    QLoRAZIP-FIT12.19-32s
    QLoRADSIR9.14-97s
    QLoRAD46.09-7h 40m
  2. You asked about comparison with more complex methods like LESS - As shown in the table above, ZIP-FIT outperforms LESS while being significantly faster (32s vs 19h).

  3. You noted figure quality issues - We have improved all figures in our revised submission.

Given these substantial improvements directly addressing your main concerns, we would be grateful if you would consider revisiting your assessment of our work.

Thank you again for your time and thoughtful feedback throughout this process.

Best regards,

The Authors

审稿意见
6

This paper introduces ZIP-FIT, an embedding-free data selection method leveraging gzip compression to measure the alignment between training and target domains. Unlike existing approaches that rely on neural embeddings, ZIP-FIT uses a computationally efficient compression-based alignment metric, enabling faster data selection while maintaining high relevance to the target task. Empirical evaluations demonstrate ZIP-FIT’s superiority over baselines DSIR and D4 in AutoFormalization and code generation tasks, achieving significantly faster convergence and lower cross-entropy loss with reduced computational costs. ZIP-FIT’s promise lies in its scalability and effectiveness, particularly in low-resource settings, where traditional embedding-based methods may be impractical.

优点

  1. ZIP-FIT’s embedding-free approach is a refreshing deviation from common embedding-based methods, offering a novel solution by leveraging gzip compression. The concept of using normalized compression distance (NCD) as an alignment metric is insightful and could inspire future research in embedding-free methodologies for various data selection tasks.
  2. The empirical results support the claims, showing that ZIP-FIT achieves faster convergence and better performance than established methods. The experiments were conducted on both AutoFormalization and code generation tasks, demonstrating ZIP-FIT's versatility across different domains.
  3. The paper is well-structured, with a clear exposition of the algorithm, experimental setup, and results. The figures effectively illustrate the performance benefits of ZIP-FIT.
  4. ZIP-FIT could represent a significant advancement in data selection for machine learning, particularly in computationally constrained environments. Its potential to optimize model fine-tuning with minimal resource requirements makes it highly applicable for real-world use cases, especially in domain-specific and low-resource applications.

缺点

  1. While ZIP-FIT achieves excellent results on the tasks tested, its reliance on gzip compression may limit its effectiveness in complex semantic domains where relationships are nuanced and less compressible. Embedding-free approaches, while efficient, may not be ideal for tasks that require deep semantic understanding or complex syntactic relationships.

问题

  1. Could you provide further insights into how ZIP-FIT might perform with data that have higher variability and diverse syntactic structures, such as conversational datasets?
  2. Can you clarify the theoretical basis for using gzip compression over other compression methods that might exploit redundancy differently? Would alternative compression algorithms affect the performance of ZIP-FIT?
评论

We thank the reviewer for their detailed feedback. vFcF notes that the paper is “a refreshing deviation from common embedding-based methods,” and that our method is a “significant advancement in data selection for machine learning”.

For calibration purposes, we’d like to note that the ICLR 2025 rubric differs slightly from previous similar conferences. For example:

  • To indicate "Accept", the NeurIPS 2024 rubric says to use 7 whereas the ICLR 2025 rubric says to use 8
  • To indicate "Strong Accept", the NeurIPS 2024 rubric says to use 9 whereas the ICLR 2025 rubric says to use 10

Can you clarify the theoretical basis for using gzip compression over other compression methods that might exploit redundancy differently? Would alternative compression algorithms affect the performance of ZIP-FIT?

This is an excellent theoretical question that opens several interesting research directions! What properties of a compression algorithm make it optimal for data selection? Do different target tasks (e.g., code generation vs mathematical proofs) benefit from different compression approaches? While we chose gzip initially due to its widespread availability, our rebuttal period experiments suggest the choice of compression algorithm materially impacts performance.

During the rebuttal period, we conducted additional experiments comparing gzip with other compression methods (zstd and LZ4) at various compression levels on the HumanEval benchmark, as shown in Figure 8 in the Appendix. Looking at our results, we observe that LZ4 actually achieves the best Pass@1 performance (12.19%) at the lowest compression level, followed by gzip at 11.58%. This performance advantage of LZ4 persists until about 0.2 normalized compression level, after which its performance degrades more rapidly than gzip. While our original implementation used gzip, these results suggest LZ4 with low compression settings might be preferable for code generation tasks. We thank the reviewer for the suggestion of comparing different compression algorithms.

We revised our discussion to acknowledge this finding and note the potential benefits of using different compression algorithms. The significant variation in performance across compression levels and algorithms also provides interesting insights into how compression characteristics affect data selection quality.

Another reviewer requested we evaluate ZIP-FIT on the HumanEval benchmark using Pass@k, comparing against baseline methods:. In case these results might interest you, we found that ZIP-FIT improves Pass@1 scores on HumanEval over baselines:

Data Selection MethodPass@1 (%)
Pre-trained Gemma2-2B, 4-bit quantized6.09
ZIP-FIT (LZ4)12.19
ZIP-FIT (gzip)11.58
DSIR9.14
D46.09
  1. ZIP-FIT with LZ4 achieves 12.19% Pass@1 on HumanEval, outperforming both DSIR (9.14%) and D4 (6.09%) baselines
  2. With gzip, ZIP-FIT reaches 11.58%, still maintaining superior performance

Could you provide further insights into how ZIP-FIT might perform with data that have higher variability and diverse syntactic structures, such as conversational datasets?

While ZIP-FIT may have limitations with highly variable data, we intentionally focused on code generation and formal mathematics - domains where syntactic structure is crucial and data selection has clear practical impact. In these domains, ZIP-FIT's strong performance (12.19% Pass@1 on HumanEval) demonstrates its effectiveness where it matters most. While exploring effectiveness on more variable data remains important future work, success in structured domains like code generation represents a meaningful practical contribution.

评论

Dear Reviewer vFcf,

May we ask if you could respond to our comments? In our response, we've explored different compression algorithms and included additional HumanEval evaluations. Please let us know if you have other questions or concerns. Thank you!

Best regards,

The Authors

评论

Dear Reviewer vFcf,

We appreciate your thoughtful review and constructive questions. We've made several updates to our manuscript that directly address your inquiries:

  1. Regarding your question about compression algorithm choice: We've added Figure 8 in the Appendix comparing different compression algorithms (gzip, LZ4, and zstd) across various compression levels. The results are quite interesting:

    • At minimum compression levels, both gzip and LZ4 achieve the strongest Pass@1 scores (11.58% and 12.19%, respectively), significantly outperforming the base model (6.09%)
    • Performance systematically degrades with increased compression across all algorithms, suggesting that aggressive compression removes valuable alignment signals
  2. These findings have led us to revise our discussion section to acknowledge that while our initial implementation used gzip for its widespread availability, LZ4 with low compression settings might be preferable for code generation tasks.

  3. We've also included comprehensive HumanEval Pass@1 results comparing ZIP-FIT against baseline methods, which demonstrate the robustness of our approach.

If this was not the experiment that you were looking for, please let us know so that we can correct course.

Otherwise, we hope that you'll consider raising your score in light of these improvements.

Best regards,

The Authors

审稿意见
8

The paper introduces a new data selection mechanism based on text compression distances. The concept of using compression methods for deep learning follows several modern practical results and theoretical motivations that language modeling is fundamentally based in text compression. The method's conceptual simplicity combined with strong empirical results make it stand out as a modern way for filtering for aligned data.

优点

The paper is concise, sound, well written, and the experimental section shows promise for the method, especially with regard to other embedding-free methods.

The conceptual simplicity combined with the empirical results of the method is an especially strong point of the work.

缺点

Ideally, it would be shown how the size of nn (i.e., number of samples from the target domain pp) influences the performance of the method. If it is possible to pick nn just sufficiently large enough, it would greatly improve the computational efficiency of the method for large target datasets.

Experiments in other domains would be really nice to better demonstrate the generalization capabilities of the method. Possibly there is data that is not well-suited to compression and accordingly ZIP-FIT, or where the data's compression factor varies too much between samples?

问题

Minor comments

Figure 3, page 5:
The color bar is labeled "Gzip Alignment" instead of "ZIP-FIT-Alignment" from Algorithm 1; it may be confusing to readers.

Figure 3, page 5, line 231:
Please mention also in the figure caption that the test loss is calculated on ProofNet data.

评论

We thank the reviewer for their thoughtful feedback. 1UKb notes that our method represents "a modern way for filtering aligned data" and highlights the "conceptual simplicity combined with strong empirical results" as key strengths.

For calibration purposes, we’d like to note that the ICLR 2025 rubric differs slightly from previous similar conferences. For example:

  • To indicate "Accept", the NeurIPS 2024 rubric says to use 7 whereas the ICLR 2025 rubric says to use 8
  • To indicate "Strong Accept", the NeurIPS 2024 rubric says to use 9 whereas the ICLR 2025 rubric says to use 10

We address the specific points raised:

Ideally, it would be shown how the size of n (i.e., number of samples from the target domain) influences the performance of the method.

During the rebuttal period, we conducted preliminary experiments varying the number of target domain samples (n) from HumanEval and evaluating Pass@1 on the test set:

Number of Target Examples (n)Pass@1 (%)
n=8312.19
n=4014.63
n=209.14

With n = 40 (less than half of our original n = 83), ZIP-FIT maintains strong performance at 14.63% Pass@1 on HumanEval. However, further reducing to n = 20 shows performance drops to 9.14% Pass@1, suggesting a lower bound on the required number of target examples. While these results are preliminary, they indicate ZIP-FIT can be effective with a relatively small number of target examples, making it practical for many real-world applications. We plan to include a comprehensive analysis of this efficiency-performance trade-off in future work.

Performance across different domains:

While ZIP-FIT may have limitations with highly variable data, we intentionally focused on code generation and formal mathematics - domains where syntactic structure is crucial and data selection has clear practical impact. In these domains, ZIP-FIT's strong performance (12.19% Pass@1 on HumanEval) demonstrates its effectiveness where it matters most. While exploring effectiveness on more variable data remains important future work, success in structured domains like code generation represents a meaningful practical contribution.

Figure improvements:

Thank you for catching these inconsistencies. Our revision now explicitly mentions the ProofNet test loss calculation in Figure 3's caption.

These changes will improve clarity and maintain consistency throughout the paper.

Another reviewer requested we evaluate ZIP-FIT on the HumanEval benchmark using Pass@k, comparing against baseline methods:. In case these results might interest you, we found that ZIP-FIT improves Pass@1 scores on HumanEval over baselines:

Data Selection MethodPass@1 (%)
Pre-trained Gemma2-2B, 4-bit quantized6.09
ZIP-FIT (LZ4)12.19
ZIP-FIT (gzip)11.58
DSIR9.14
D46.09

ZIP-FIT improves Pass@1 scores on HumanEval over baselines:

  1. ZIP-FIT with LZ4 achieves 12.19% Pass@1 on HumanEval, outperforming both DSIR (9.14%) and D4 (6.09%) baselines
  2. With gzip, ZIP-FIT reaches 11.58%, still maintaining superior performance
评论

Dear Reviewer 1UKb,

May we ask if you could respond to our comments? In our response, we've explored the sample size requirements and included additional HumanEval evaluations. Please let us know if you have other questions or concerns. Thank you!

Best regards,

The Authors

评论

We thank all reviewers for their thoughtful feedback. For calibration purposes, we’d like to note that the ICLR 2025 rubric differs slightly from previous similar conferences. For example:

  • To indicate "Accept", the NeurIPS 2024 rubric says to use 7 whereas the ICLR 2025 rubric says to use 8
  • To indicate "Strong Accept", the NeurIPS 2024 rubric says to use 9 whereas the ICLR 2025 rubric says to use 10

The reviewers highlight several strengths of our work:

  • "A refreshing deviation from common embedding-based methods" [vFcf]
  • "Method's conceptual simplicity combined with strong empirical results make it stand out" [1UKb]
  • "Tackles a crucial problem in low-resource settings" [CtcN]
  • "Well-presented and well-motivated" [39ck]

The main concerns raised were:

  • Evaluation beyond cross-entropy loss [CtcN, 39ck]
  • Additional baselines [CtcN, 39ck]
  • Justification for compression choice [vFcf]
  • Performance on diverse data types [1UKb, vFcf]
  • Impact of target domain sample size on performance [1UKb]

During the rebuttal period, we conducted additional experiments to address these concerns:

  • We evaluated ZIP-FIT on HumanEval, showing that it doubles Pass@1 performance (12.19%) compared to a pre-trained Gemma2-2B (6.09%) and outperforms DSIR (9.14%) and D4 (6.09%).
  • We performed a comprehensive analysis of different compression algorithms, demonstrating that compression parameter choice significantly impacts performance. The results validate our approach while providing insights into how compression characteristics affect data selection quality.
  • We performed initial experiments evaluating the effect of the size of the target domain on performance, which suggests we can maintain or even improve performance while significantly reducing the number of target samples required, directly addressing computational efficiency concerns for large target datasets.
  • We acknowledge limitations with highly variable data as discussed in section 8 of our paper and look forward to exploring this in future work.
  • We are currently running comparisons with additional baselines (LESS and SHED) and will update our rebuttal with these results as soon as we have them.

We also improve clarity by:

  • Moving Figure 5 (now Figure 2) to page 2
  • Adding ProofNet test specification to Figure 3’s (now Figure 4) caption
  • Improving quality of Figure 2 (now Figure 3)
评论

During the review process, two reviewers (Reviewer 39ck, CtcN) had concerns regarding the evaluation metrics (of only using cross-entropy test loss) and requested for additional evaluation experiments, particularly pass@k metrics on HumanEval and proof success rates on miniF2F:

  • Reviewer 39ck: All experimental results are based on test loss, which may not be very reliable. It would be essential to conduct evaluations on some standard benchmarks, such as HumanEval and MBPP for code evaluation, to demonstrate the scores the model can achieve.

  • Reviewer CtcN: The authors primarily use test data cross-entropy loss as the evaluation metric. However, the authors do not compare their results with widely accepted metrics in the research community for the studied downstream tasks

During discussion phase, the authors responded to several reviewers the pass@1 HumanEval scores on Gemma2-2B of 6.1%. The authors reported scores have huge discrepancies with what the Gemma 2 technical report [1] states (20.1%) as well as what EvalPlus Leaderboard [2] reports (25%). When Reviewer CtcN questioned the authors, the authors responded with:

Our current experiments use quantized LoRA fine-tuning via unsloth for resource efficiency, which explains the lower baseline numbers compared to [7]. We are currently running full fine-tuning experiments with Gemma2-2B to provide direct comparisons with the reported 20.1% Pass@1.

Reviewer CtcN responded with:

[7] clearly states the released model to have results of 20.1%. I do not understand how your baseline results are so poor. I also do not understand how your finetuning method quantized or full precision, LoRA or full parameter (which is not disclosed in the paper) could affect the baseline released pretrained models evaluation scores.

As reviewer CtcN, I question the creditability of the authors evaluation results. The authors responses have raised my doubt on whether experiments and evaluations are properly conducted, and if the authors are familiar with relevant works on LLMs for Coding or Autoformalization. Due to the authors responses, I have downgraded my rating for the paper. I also summarize the issue above and highlight it here.

[1] Gemma 2: Improving Open Language Models at a Practical Size

[2] https://evalplus.github.io/leaderboard.html

评论

Thank you for highlighting these concerns. It's clear there has been a breakdown in communication regarding our experimental setup, particularly with Gemma2-2B's evaluation.

To clarify:

  1. Our use of quantized LoRA via unsloth was only for the rebuttal experiments, not in the original paper.
  2. We acknowledge this created confusion by introducing a new variable that wasn't properly explained.
  3. We should have explicitly compared against the published baseline (20.1% Pass@1) from [7].

We are immediately:

  1. Running evaluations with the standard pre-trained Gemma2-2B to validate against published results.
  2. Documenting our evaluation pipeline in detail.
  3. Will post an update with these results as soon as they are available.

We appreciate your patience as we work to provide accurate and transparent comparisons.

评论

Can you explain how your training setup (and any variable related) has anything to do with the evaluation results you obtained for a released model?

By Running evaluations with the standard pre-trained Gemma2-2B can you explain on what non standard pretrained Gemma2-2B used to obtain 6.1% pass@1 on HumanEval?

评论

If there were any errors in the evaluation pipeline, it would be more constructive to acknowledge and address them directly, rather than sharing unrelated information hoping to confuse the reviewers.

评论

In response to Reviewers CtcN and 39ck requests for comparisons against more computationally expensive baselines, we have completed additional experiments comparing ZIP-FIT against recent resource-intensive baselines (LESS) using full parameter fine-tuning on Gemma2-2B. All methods were evaluated under identical training settings using the same number of training tokens (top 1M):

Fine-tuningData Selection MethodPass@1 (%)Pass@10 (%)Selection Time
NoneNone: Pre-trained Gemma2-2B15.2438.81-
Full FTZIP-FIT18.8641.7832s
Full FTLESS18.0640.1919h
Full FTDSIR17.9844.2797s
Full FTD414.3740.667h 40m
NoneNone: Pre-trained Gemma2-2B (4-bit quantized)6.09--
QLoRAZIP-FIT12.19-32s
QLoRADSIR9.14-97s
QLoRAD46.09-7h 40m

For transparency and comprehensive evaluation, we include both our new full parameter fine-tuning results and our previous QLoRA results to demonstrate ZIP-FIT's effectiveness across different computational settings.

Key findings:

  1. With full parameter fine-tuning, ZIP-FIT achieves competitive Pass@1 (18.86%) and Pass@10 (41.78%) scores compared to resource-intensive methods
  2. ZIP-FIT maintains significantly faster selection times (32s) compared to LESS (19h) and DSIR (97s)
  3. Even with QLoRA fine-tuning, ZIP-FIT shows meaningful improvements over baselines while maintaining efficiency

These results demonstrate that ZIP-FIT can match or exceed the performance of more computationally expensive methods while maintaining its core advantage of efficiency.

Best regards,

The Authors

AC 元评审

The paper introduces ZIP-FIT, a data selection framework that utilizes gzip compression to measure alignment between potential training data and target tasks, aiming to enhance language model performance in specific domains. The authors reported strong results (by test losses) on two downstream tasks. The method is also scalable, thanks to efficient data compression techniques to align train-test data distributions. However, there are some major concerns still in the current work: (1) Task Specificity: While effective in tasks like autoformalization and Python code generation, the applicability of ZIP-FIT to other domains remains to be explored. (2) Major experimental results are based on test loss, which may not be very reliable. The authors should provide the final results according to the downstream task performance.

审稿人讨论附加意见

During the discussion between reviewers and authors, the following points were raised:

  • Application of downstream tasks: the current approach is mainly applied in two tasks: autoformalization and Python code generation. It was not clear the motivation to choose these tasks as the method was proposed as a task-agnostic data selection method and it can be applied to different tasks. The limited experiments make it hard to judge whether the method is working only on these domains or in others as well.
  • Evaluation results are based on the test losses: while losses can indicate some level of performance, it is not completely reliable to demonstrate the real results on task-specific performance. Code generation tasks should be evaluated by the correctness of code (compilable/ executable/ passing tests). Losses are highly unreliable in these tasks. Pass@k was reported during the rebuttal for code generation (though only on basic and rather standard coding tasks on the HumanEval benchmark - should a better benchmark be used to test the train-test alignment?)

There was other discussion between reviewers and authors about inconsistent Gemma2 results. While there are some major result gaps, I did not take this issue into consideration in my final evaluation as result replication is still an opening topic in the research community and it is possible to have different replicated results.

最终决定

Reject