PaperHub
5.5
/10
Poster3 位审稿人
最低3最高3标准差0.0
3
3
3
ICML 2025

Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

We introduce a new efficient data-mixing framework based on leverage scores boosting LM generalization that can flexibly adapt to new domains and can be applied in finetuning

摘要

关键词
Data mixtureLLMsleverage scorepretrainingfinetuning

评审与讨论

审稿意见
3

This manuscript introduces a flexible data-mixing framework for LLM that uses kernel ridge leverage scores (KRLS) computed from learned domain embeddings using a proxy model. It quantifies each domain’s representativeness and interdependence within the embedding space, and then uses this information to generate a weighted data mixture for both pretraining and finetuning. The comprehensive experiments and ablation studies demonstrate its practical benefits on both perplexity and downstream tasks by outperforming existing methods such as DoReMi and DoGE with a small computational overhead.

Post Rebuttal

Most of my concerns are properly addressed. I would like to raise my score.

Post Rebuttal

给作者的问题

  1. Could you provide a more detailed runtime analysis and complexity comparison of your method with baseline methods?

  2. Can you elaborate on the theoretical insights behinds why KRLS are effective for domain reweighting?

  3. Could you discuss in more detail how the computed domain weights transfer across different datasets and model scales?

论据与证据

Overall, most claims are supported by empirical evidence. However, some of the claims are not supported by clear and convincing evidence.

  1. The authors claim that the method is computationally efficient. However, the submission does not include a detailed runtime analysis or complexity comparison with baseline methods.

  2. The novelty of the manuscript is the ability to quantify each domain’s representativeness and interdependence in the learned embedding. However, the evidence provided does not clearly demonstrate how reliable or robust these kernel ridge leverage scores are across different scenarios. More analysis or visualization of these scores and their correlation with the performance would strengthen this claim.

  3. The authors claim that the computed domain weights transfer directly to new data and different model scales without needing to retrain the proxy model. Although the provided experimental results are promising, the evidence could be more convincing if more experiments or detailed ablation studies were provided to thoroughly evaluate this transferability under a wider range of conditions.

  4. The paper would be strengthened by an in-depth theoretical analysis explaining why kernel ridge leverage scores are particularly effective for domain reweighting.

方法与评估标准

The proposed methods and evaluation criteria align well with the problem of efficient data mixing for LLM training. The use of kernel ridge leverage scores computed from learned domain embeddings makes sense for quantifying domain representativeness and interdependence. Measuring improvements in perplexity and downstream task performance are standard metrics for evaluating language model training.

理论论述

I have checked the theoretical proof in the appendix. The proof of the equivalence between the ridge leverage scores computed in the feature space and those obtained via kernel ridge regression when using a linear kernel. The proof, presented as Lemma A.1, correctly applies standard matrix identities to show that the diagonal elements of the corresponding hat matrices are equal. While some intermediate steps are missing, I did not find any logic issues.

Beyond this lemma, most theoretical claims are supported by empirical evidence and ablation studies rather than fully formalized proofs. Overall, the proofs that were provided are correct, and no major issues were found.

实验设计与分析

Yes, the experimental design and analyses are sound. however, I would suggest the author to include a runtime analysis to validate the claim that the proposed method is computationally efficient.

补充材料

I have reviewed every part in the supplementary material.

与现有文献的关系

The key contributions of the paper are well situated within the broader scientific literature, building upon and extending several established ideas, which include (1) kernel methods and leverage scores; (2) data selection and mixing in LLM Training; and (3) domain adaptation and representation kearning.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  1. The paper presents a novel method that leverages the kernel ridge leverage score to reweight data domains, which is a combination of ideas from data mixing, transfer learning, and efficient training of LLMs.

  2. The proposed work holds significance in practice as it is importance in scaling LLM training without retraining expensive proxy models.

  3. The extensive empirical results indicate improvements in both pretraining and finetuning, which validates the practical values of the proposed method.

  4. writing is clear and observation-driven, which makes it easy for readers to follow along and engage with the authors to approach the problem.

Weaknesses:

  1. While the paper claims the computational efficiency, it lacks a detailed runtime analysis or formal complexity comparison with baseline methods.

  2. The theoretical justification behind using KRLS is largely supported by empirical evidence rather than formal proofs.

  3. The clarity regarding the transferability of domain weights across diverse datasets and model sizes could be improved with more detailed ablation studies and analysis.

其他意见或建议

Please see the Weaknesses section.

作者回复

We thank the reviewer for their valuable feedback and address all remaining concerns below:

Q1. Runtime analysis and complexity comparison

Obtaining embeddings xi,i=1,,kx_i, \, i=1,\ldots,k requires a single forward pass for each aBia \in B_i through the proxy hθp(a)h_{\theta_p}(a); inference is fast as the proxy is a small model. Computing (KRLS) involves inverting a matrix of size k×kk \times k in O(k3)\mathcal{O}(k^3), which is computationally cheap since datasets typically have a small number kk of domains. We do not add any overhead in proxy training. In contrast, DoGE requires per-domain gradient computation at each iteration, and DoReMi runs inference of the reference model for perplexity comparisons.

For a straightforward comparison, we report GPU hours below. DoReMi and DoGE incur over 10% of the base model training cost, while Chameleon reduces it to under 2%. These savings are particularly impactful for academic labs with limited computational resources.

Table jzfM-1: Runtime comparison.

MethodGPU Hours
DoReMi7.4
DoGE6.3
Chameleon0.8
684M base model56

See more details on the FLOPs computations in Q2 of Reviewer ubpq.

Q2. Theoretical insights for the effectiveness of KRLS in domain reweighting

We use kernel ridge leverage scores (KRLS) to determine domain weights. KRLS is a well-established tool in data analysis. It quantifies the influence or importance of data points [Alaoui & Mahoney, 2015]. This property is leveraged in machine learning for tasks like density estimation [Pauwels et al., 2018] and novelty detection [Ducharlet et al., 2024; Lasserre & Pauwels, 2019].

The inverse KRLS is proportional to the Christoffel function value [Pauwels et al., 2018]. This relationship provides additional theoretical justification for our approach. Christoffel functions (Eq. (1) in [Pauwels et al., 2018]) precisely characterize the local density of the data distribution in the feature space, where higher values indicate denser regions.

We compute the score Sλ(Di)S_\lambda(D_i) of domain ii using Eq. (KRLS) on page 4. During pretraining, assigning higher sampling probability to domains with low KRLS (and thus high Sλ1S_\lambda^{-1}/Christoffel value) upweights high-density data regions, which are most influential on base LMs' performance [1]. LLM finetuning aims to specialize on a novel specific task, requiring the model to learn differential features not fully captured during pretraining, so we instead prioritize the domains with high SλS_\lambda. Section 3.2 converts either Sλ1S_\lambda^{-1} or SλS_\lambda into probability distributions α\alpha by appliying softmax normalization.

We will revise Section 3.2 to explicitly link the data mixing goal to KRLS and inverse KRLS, grounding it in statistical learning theory. We will make this discussion self-contained within the main text, incorporating analysis from Appendix A.

Q3. More detail on how domain weights transfer across datasets and model scales

  • Transfer across model sizes: Prior works (e.g., DoReMi, DoGE, RegMix) have shown that domain weights transfer well across model scales. To further validate this, we trained 1.2B models on SlimPajama and found that weights from an 82M proxy model effectively transfer to both 684M and 1.2B models. Notably, Chameleon achieves even greater improvements on larger models, highlighting its scalability.

Table jzfM-1: PPL with 1.2B model

DomainUniformDoReMiDoGEChameleonRegMix
Arxiv6.307.097.076.3310.61
Book28.2532.6627.8324.6327.55
CC31.1929.9628.1126.9524.70
C434.7433.0531.0629.5831.94
Github2.913.033.072.944.08
Stackexchange6.016.445.805.769.54
Wikipedia8.657.9310.889.0320.08
Average PPL16.8617.1716.2615.0318.36

Table jzfM-2: Downstream accuracy with 1.2B model

TaskUniformDoReMiDoGEChameleonRegMix
ARC-E39.441.241.942.443.0
COPA64.066.063.061.066.0
HellaSwag27.527.728.228.427.6
Lambada17.917.318.721.620.7
LogiQA22.024.022.021.220.7
MultiRC57.257.257.257.256.9
OpenBookQA15.013.613.816.417.4
PiQA61.561.961.863.858.7
QQP36.836.836.936.936.8
RACE26.026.727.829.128.4
SciQ69.768.369.072.672.0
SocialIQA36.236.535.937.236.1
WinoGrande52.849.648.951.550.0
Average40.540.540.441.541.1
  • Transfer across datasets: We conduct ablation studies on the Pile (Section 4.2). Specifically, we retrain a proxy model on the Pile for reference. As shown in Domain weights on the Pile, the weight from a proxy trained on the Pile (the blue column) aligns with the weights transferred from proxies trained on SlimPajama at various sizes (the other columns), confirming Chameleon’s robust transferability.

[1] Mallen et al. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. ACL (2023).

审稿人评论

Thank you for the response. Most of my concerns are properly addressed.

审稿意见
3

This paper introduces a new data mixing framework for language pretraining and finetuning wherein the mixing weights for different domains are constructed from a domain affinity matrix generated via kernel functions on domain embeddings. This domain matrix can naturally be transformed into domain weights for pre-training (i.e., emphasizing broader and different domains) and finetuning (emphasizing similar domans). The key advantage of this framework is that it does not rely extensively on training proxy models, and thereby can be seen as a low-cost alternative to existing data mixing frameworks.

给作者的问题

Refer to Strengths & Weaknesses

论据与证据

The paper is well-supported by rigorous numerical analysis.

方法与评估标准

Yes

理论论述

There are some theoretical results in the Appendix, which I briefly reviewed.

实验设计与分析

The experiments are meaningful and sound.

补充材料

I briefly reviewed the supplementary material, focusing on the theoretical results and the intuition behind the method.

与现有文献的关系

The paper adds to the data mixing literature, with specific focus on compute-efficiency and adaptation to new tasks.

遗漏的重要参考文献

The paper comprehensively covers the primary literature.

其他优缺点

Strengths:

  • The paper is well-written and timely, addressing rigorous experiments

Weaknesses:

  • The reasoning behind obtaining domain weights from the KRLS is unclear. Is there some theoretical relationship or connection that can be derived to show why pretraining weights as designed, or finetuning weights as designed, are appropriate? I appreciate that the authors have provided some intuition but it would be beneficial to get more insight.
  • There doesn't seem to be a significant performance improvement from the mixing law, and indeed, the major selling point is that it achieves a competitive performance at orders of magnitude lower cost. It would be useful to emphasize this point further, especially in numerical results to better break down all cost calculations (e.g., in domain transfer).
  • It would be useful to include Data Mixing Laws as a baseline, at least for experiments on generalization [1]

[1] Ye, Jiasheng, et al. "Data mixing laws: Optimizing data mixtures by predicting language modeling performance." arXiv preprint arXiv:2403.16952 (2024).

其他意见或建议

N/A

作者回复

We thank the reviewer for their valuable feedback and address all remaining concerns below:

Q1. Reasoning behind obtaining domain weights from the KRLS

We use kernel ridge leverage scores (KRLS) to determine domain weights. KRLS is a well-established tool in data analysis. It quantifies the influence or importance of data points [Alaoui & Mahoney, 2015]. This property is leveraged in machine learning for tasks like density estimation [Pauwels et al., 2018] and novelty detection [Ducharlet et al., 2024; Lasserre & Pauwels, 2019].

The inverse KRLS is proportional to the Christoffel function value [Pauwels et al., 2018]. This relationship provides additional theoretical justification for our approach. Christoffel functions (Eq. (1) in [Pauwels et al., 2018]) precisely characterize the local density of the data distribution in the feature space, where higher values indicate denser regions.

We compute the score Sλ(Di)S_\lambda(D_i) of domain ii using Eq. (KRLS) on page 4. During pretraining, assigning higher sampling probability to domains with low KRLS (and thus high Sλ1S_\lambda^{-1}/Christoffel value) upweights high-density data regions, which are most influential on base LMs' performance [1]. LLM finetuning aims to specialize on a novel specific task, requiring the model to learn differential features not fully captured during pretraining, so we instead prioritize the domains with high SλS_\lambda. Section 3.2 converts either Sλ1S_\lambda^{-1} or SλS_\lambda into probability distributions α\alpha by appliying softmax normalization.

We will revise Section 3.2 to explicitly link the data mixing goal to KRLS and inverse KRLS, grounding it in statistical learning theory. We will make this discussion self-contained within the main text, incorporating analysis from Appendix A.

Q2. Computational cost breakdown

Chameleon's main computational cost comes from 1) proxy training and 2) embedding extraction, with proxy training being dominant. In our setting, training an 82M proxy model requires 101710^{17}101810^{18} FLOPs, while DoReMi and DoGE take longer to converge, leading to 5-10x higher costs (see line 299, "Stability and Practicality"). Regarding our embedding extraction, it requires only 101510^{15} FLOPs (<1% of proxy training). Importantly, Chameleon avoids proxy retraining when domains change, incurring only embedding extraction costs. In contrast, DoReMi and DoGE induce their proxy retraining FLOPs.

Our method is also significantly cheaper in GPU hours, see Response to Q1 of Reviewer jzfM for more details and complexity analysis.

Beyond efficiency, Chameleon is also more stable, making it resource-efficient in practical use. Unlike DoReMi and DoGE, which are sensitive to hyperparameters, Chameleon remains robust, see Q2 of Reviewer jX7F for more details.

Lastly, we note that Chameleon shows favorable accuracy behaviour on larger models as well, as shown in our additional 1.2B model experiments (see Q3 of Reviewer jzfM).

Q3. Data Mixing Laws

We first provide discussions and then present empirical comparisons.

Data Mixing Laws derive domain weights by leveraging scaling laws of training steps, model sizes, and data mixtures to predict the performance of large models trained on diverse data from small-scale training. This requires training multiple small proxy models with varying domain weights, making it more computationally expensive than ours, which trains just one proxy model.

We use their reported domain weights to train a 684M model on Slimpajama. Since their weights are optimized with the Pile as the target, they may be suboptimal for SlimPajama. However, given the alignment of their objectives and overlap in data sources, we consider the comparison meaningful.

Chameleon outperforms Data Mixing Laws in both perplexity and downstream tasks at a fraction of the cost. Data Mixing Laws' FLOPS is calculated for 4 different proxy sizes and 20 separate mixtures, where our cost is 2 orders of magnitude lower.

Table ubpq-1: PPL comparison with Data Mixing Laws

Data Mixing LawsChameleon
Arxiv7.558.31
Book45.0639.23
CC44.2140.11
C445.7942.59
Github4.014.20
Stackexchange7.967.94
Wikipedia16.2013.90
Avg PPL24.4022.31
# Domains Over Uniform4/74/7
FLOPs5.36×10195.36×10^{19}1.36×10171.36×10^{17}

Table ubpq-2: Downstream accuracy comparison with Data Mixing Laws

TaskData Mixing LawsChameleon
ARC-E34.537.8
COPA59.061.9
HellaSwag27.427.0
Lambada14.715.1
LogiQA26.022.6
MultiRC57.257.2
OpenBook25.214.4
PiQA58.560.5
QQP36.839.2
RACE26.426.5
SciQ57.264.3
Social IQA36.135.7
WinoGrande48.452.1
Average39.039.6

[1] Mallen et al. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. ACL (2023).

[2] Parmar et al. Data, data everywhere: A guide for pretraining dataset construction. ACL 2024.

审稿意见
3

Authors propose a method for data sampling for pretraining and finetuning language models. Their idea is to train a classifier, then extract the middle layer word embeddings of the classifier for each domain in the training data, and then to do matrix factorization to obtain a scalar weight for each domain. These weights are used in two heuristic equations to obtain the probabilities for sampling from the domains during the training of the LM.

It is empirically shown that the algorithm is faster than the baselines, and can be used without retraining the classifier when new data is added to the training. It is also shown that the algorithm can be used during finetuning.

=================

UPDATE: I updated my review score.

给作者的问题

See above

论据与证据

The claims on speed need further explanations, see the section below

方法与评估标准

Yes

理论论述

Some of them

实验设计与分析

Most of them

补充材料

Some parts, those mentioned in the paper

与现有文献的关系

Builds upon existing literature, primarily DoGE

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  • The method is simple, intuitive, and easy to implement.
  • The paper is well written, and the experiments are well organized.
  • The topic is very relevant and timely.

Weaknesses:

  • The core idea is just a heuristic (in authors words): pretaining needs "broadly shared semantic structures" and finetuning needs "distinct and unique data" to "highlight domain-specific characteristics". To me the statements above are just some vague justifications for what empirically works.
  • In my opinion the performance improvements are virtually non existent--compared to the model DoGE. The main distinction lies in the speed. Authors might argue that when new data is added, the performance improvement is more tangible. But if we put speed aside, and retrain the baseline proxy networks, then again the only distinction becomes speed. In general the proxy model is relatively small, and its training data is a fraction of the entire training data. How many GPU hours are needed to train the proxy networks across the models? Does increasing speed in training this network make any significant difference in energy consumption? How often "retraining" the proxy network is needed at all? Is it needed at all?

I would be happy to revise my score if authors give me convincing answers.

Other comments:

Please don't force the reviewers to read your appendix.

其他意见或建议

See above

作者回复

We thank the reviewer for their valuable feedback and address all remaining concerns below:

Q1. Theoretical motivation

We use kernel ridge leverage scores (KRLS) to determine domain weights. KRLS is a well-established tool in data analysis. It quantifies the influence or importance of data points [Alaoui & Mahoney, 2015]. This property is leveraged in machine learning for tasks like density estimation [Pauwels et al., 2018] and novelty detection [Ducharlet et al., 2024; Lasserre & Pauwels, 2019].

The inverse KRLS is proportional to the Christoffel function value [Pauwels et al., 2018]. This relationship provides additional theoretical justification for our approach. Christoffel functions (Eq. (1) in [Pauwels et al., 2018]) precisely characterize the local density of the data distribution in the feature space, where higher values indicate denser regions.

In our context, we compute the score Sλ(Di)S_\lambda(D_i) of domain ii using Eq. (KRLS) on page 4. During pretraining, assigning higher sampling probability to domains with low KRLS (and thus high Sλ1S_\lambda^{-1}/Christoffel value) upweights high-density data regions, which are most influential on base LMs' performance [1]. LLM finetuning aims to specialize on a novel specific task, requiring the model to learn differential features not fully captured during pretraining, so we instead prioritize the domains with high SλS_\lambda. Section 3.2 converts either Sλ1S_\lambda^{-1} or SλS_\lambda into probability distributions α\alpha by appliying softmax normalization.

We will revise the phrasings in Section 3.2 to explicitly connect the data mixing goal to the mathematical properties of KRLS and inverse KRLS, clarifying its foundation in theoretical principles from statistical learning rather than just empirical heuristics. We will make this discussion self-contained within the main text, drawing upon the analysis currently in Appendix A.

Q2. Impact of our computational efficiency

The computational cost associated with determining the domain mixture via proxy training is non-negligible. Table jX7F-1 below reports the required GPU (H100) hours for our experiments in Tab. 2 in the paper. Compared to DoReMi and DoGE, which add over 10% to base model training costs, we reduce computational overhead to less than 2% of final training cost. This reduction is crucial for academic labs and smaller-scale training.

Table jX7F-1: GPU hours for universal generalization experiments.

MethodGPU Hours
DoReMi7.4h
DoGE6.3h
Chameleon0.8h
684M base model56h

Even for larger base models, the computational cost reported is often an optimistic lower bound for the baselines since DoReMi and DoGE require extensive hyperparameter tuning. It has been shown that DoReMi's weights are unstable or difficult to reproduce [2; Fan et al., 2024b] and DoGE approximations make it more sensitive to learning rate [Kang et al., 2024b]. We also noticed that DoGE is also extremely sensitive to their Bregman coefficient μ\mu, as shown in Table jX7F-2, where we report domain weights and validation PPL in the last line. Small variations in μ\mu drastically change domain weights and degrade validation PPL, necessitating repeated validation on base models. This sensitivity contradicts the goal of data mixing methods: weights should transfer reliably to large models without costly grid searches.

Table jX7F-2: DoGE's weights are highly sensitive to μ\mu.

μ=0.05\mathbf{\mu=0.05}μ=0.01\mu=0.01μ=0.1\mu=0.1
Arxiv0.0410.2100.222
Book0.0780.0250.069
CC0.2680.0520.068
C40.2830.0250.050
Github0.0590.0210.378
Stackexchange0.2300.6490.103
Wikipedia0.0410.0190.110
Avg PPL of 124M model24.9725.4526.73

In contrast, Chameleon is stable across training steps, model sizes, λ\lambda, and sample counts (Tables 10, 11). This means our method can produce promising domain weights without repeated validation, significantly reducing overall costs for users.

Another key aspect, as the reviewer pointed out, is the cost of incorporating new data sources. Our data-centric approach requires only inference to obtain new embeddings and recompute KRLS, whereas proxy optimization-based methods like DoReMi and DoGE necessitate full retraining and additional tuning.

Lastly, we further validate performance improvement by training 1.2B models. Chameleon demonstrates gains in both perplexity and downstream task accuracy (see Q3 for Reviewer jzfM).

[1] Mallen et al. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. ACL (2023).

[2] Parmar et al. Data, data everywhere: A guide for pretraining dataset construction. ACL 2024.

最终决定

The manuscript introduces a data-mixing framework leveraging kernel-based scores for data curation in both pretraining and finetuning. The proposed method demonstrates promising empirical results, often at a significantly reduced computational cost compared to existing methods like DoReMi and DoGE.

The review process identified several areas that the authors should address to further strengthen the manuscript for its camera-ready version. A primary concern across multiple reviewers was the need for a more detailed exposition of a principled understanding underpinning the approach, specifically regarding why KRLS is effective for domain reweighting and the intuitive connection between KRLS scores and data mixing strategies. I will not detail additional dimensions for improvement (e.g., better visualizations of KRLS scores or various ablations), but overall incorporating reviewers’ feedback will strengthen the grounding and empirical validity of the work and further clarify the advantages of the proposed method.