PaperHub
7.8
/10
Poster4 位审稿人
最低4最高6标准差0.8
4
5
4
6
3.5
置信度
创新性3.0
质量2.5
清晰度2.8
重要性2.8
NeurIPS 2025

Neural Networks for Learnable and Scalable Influence Estimation of Instruction Fine-Tuning Data

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29
TL;DR

We replace expensive language models with cheap neural networks to estimate the value of data, thereby saving significant computations costs while maintaining performance.

摘要

关键词
Influence EstimationData ValuationData Attribution

评审与讨论

审稿意见
4

This paper proposes a method named NN-CIFT, which is used to efficiently estimate the influence of data in Instruction Fine-Tuning through a small Influence Network. Traditional influence function calculations are costly, especially in large datasets, and are not scalable. NN-CIFT alleviates this problem by calculating the influence values of a small amount of training data using the original influence function, training a small neural network using this data, and using this neural network to estimate the influence values of the remaining data.

优缺点分析

Strengths:

A Significant reduction in time cost was achieved on multiple influence functions. The proposed method applies to various downstream tasks. A smaller model achieves excellent results, where the scale of Influence Network is only 0.0007% of that of LLM, but its average MSE is only 0.067, maintaining a good estimation capability. Extensive experiments are conducted on three models and three tasks to verify the effectiveness.

Weaknesses:

  1. The reliance on the embedding model is not included in the cost analysis. Although the author explains that the cost of the Embedding model is not included, it is indispensable in training/inference, and this cost still needs to be considered in deployment.
  2. Although the learning process of NN-CIFT is effective, it lacks an in-depth theoretical explanation of why simple neural networks can generalize impact estimation. It only explains from the perspective of the sparsity of the empirical distribution, which is somewhat intuitive. 3.The innovation lies in engineering optimization. Although this method has practical value, it is more inclined towards accelerating optimization at the engineering scope rather than fundamentally advancing the theory of influence functions.

问题

  1. How strongly bound is Influence Network to embedded models? If the embedding model is changed, can the estimation ability and downstream performance of NN-CIFT still be maintained?
  2. Why choose a single-layered MLP? Have you ever tried deeper networks? This might improve the quality of the impact estimation.
  3. Can Influence Network be trained on one task and directly used under another task/data distribution?

局限性

yes

格式问题

NA

作者回复

We appreciate the reviewer for pointing out that NN-CIFT achieves "a significant reduction in time" while "maintaining a good estimation capability". Below we address the concerns and questions.

the cost of the embedding model is not included, it is indispensable in training/inference

We agree that the cost of the embedding model is indispensable to the training of the InfluenceNetwork, but it is considered an offline cost, as mentioned in lines 158-161 of the manuscript.

[this work] explains from the perspective of the sparsity of the empirical distribution ... this method has practical value, it is more inclined towards accelerating optimization ... rather than fundamentally advancing the theory of influence functions.

While we do not theoretically motivate the reason why influence values are generally sparse, our work is one of the first works to show that influence values are sparse (Section 4.4, specifically Figure 3 and lines 202-204), and we present a methodology that exploits this fact. We will clarify this in our conclusion/future works section in the final version of the paper.

Q1. How strongly bound is Influence Network to embedded models?

Embedding ModelQ1Q2Q3Q4
BAAI/bge-large-en-v1.50.0510.0840.0740.084
Qwen/Qwen3-Embedding-0.6B0.0760.0870.0870.089
intfloat/e5-mistral-7b-instruct0.0260.0830.0840.083
Snowflake/snowflake-arctic-embed-l-v2.00.0770.0870.0870.088

Table 12: Varying embedding models and their corresponding MSE values (averaged across 5 runs) between estimated influence values and ground truth influence values, with 5% selected data. These results are for learning DELIFT influence values. As shown, NN-CIFT is invariant to the embedding model selected, and is able to effectively estimate the influence values.

Please refer to Table 12 above -- we show that NN-CIFT is invariant to the choice of the embedding model. As the MSE between the predicted influence values and ground truth influence values remains small, we posit the downstream performance of NN-CIFT will be maintained no matter the choice of embedding model.

Q2. Why choose a single-layered MLP?

We address this in Appendix A.1 where we evaluate the InfluenceNetwork with varying sizes. For easy viewing, we plot the number of parameters versus the MSE and show that the MSE remains very low for larger and smaller networks, alike.

Q3. Can Influence Network be trained on one task and directly used under another task/distribution?

This is a great question -- we believe that the limitation of the InfluenceNetwork is that it is not robust to distribution shifts in the data. However, because NN-CIFT is lightweight, it can be retrained quickly to adapt to distribution shifts as well. We will add this discussion to the limitations section in the final version of our paper.


Overall, we thank Reviewer LcV6 for their time and thoughtful consideration. If we have addressed the concerns, we would greatly appreciate to see this reflected in an increase in the scores.

评论

Hello Reviewer LcV6, thank you for your time and efforts in reviewing our work. Please let us know if we have addressed your concerns, or of any additional questions you have. We would love the opportunity to resolve anything that is unclear!

评论

Hello Reviewer LcV6, as we are approaching the end of the rebuttal period, we'd appreciate the opportunity to resolve any of your additional questions and concerns in hopes of revised scores. Thank you again for your review!

评论

Hello Reviewer LcV6, thank you for acknowledging our rebuttal. With the end of the rebuttal period, we would like to request the chance to address any concerns you might have with our work in the hopes to make our work clear, concise, and rigorous. Thank you again for reviewing our work.

审稿意见
5

The paper proposes the use of a small surrogate model to directly predict the influence between pairs of training points. The surrogate model is trained using the "actual" influence between a subset of the original training data. Experiments show that filtering and re-training using "actual" influence and those predicted by the surrogate model gives comparable results, while reducing computation by between 75% to 99%.

优缺点分析

Strengths

  • the idea is straightforward, fills a missing gap in the literature, and extensive empirical results across different influence functions and tasks show the benefit of the surrogate model.
  • the method can reduce computational cost, potentially adding only minimal complexity

Weaknesses

  • it would be useful to justify the choice of the particular BAAI embedding - why was it chosen, and does it generally work regardless of the underlying influence function being approximated?
  • it will be useful to give guidance on when the proposed method should or should not be used - does it work better for some influence functions than others?

问题

  • how sensitive is the training of the surrogate model to hyperparameters? if its training can be plug-and-chug, it can be a lot easier to use, but if there needs to be experimentation, then the human engineering time will dominate that of the potential computational savings
  • can the method be viewed as denoising noisy influence estimates?

局限性

yes

最终评判理由

The authors performed experiments to show that the proposed method is not sensitive to the choice of embedding model, so that it can be used as-is on a variety of influence function methods without modification.

格式问题

none

作者回复

We appreciate the reviewer for pointing out that NN-CIFT "fills a missing gap in literature". Below, we address the concerns and questions.

Justify the choice of the particular BAAI embedding ... does it generally work regardless of the underlying influence function being approximated?

Embedding ModelQ1Q2Q3Q4
BAAI/bge-large-en-v1.50.0510.0840.0740.084
Qwen/Qwen3-Embedding-0.6B0.0760.0870.0870.089
intfloat/e5-mistral-7b-instruct0.0260.0830.0840.083
Snowflake/snowflake-arctic-embed-l-v2.00.0770.0870.0870.088

Table 12: Varying embedding models and their corresponding MSE values (averaged across 5 runs) between estimated influence values and ground truth influence values, with 5% selected data. These results are for learning DELIFT influence values. As shown, NN-CIFT is invariant to the embedding model selected and is able to effectively estimate the influence values.

We chose the BAAI embedding model because previous works (like DELIFT (SE)) had relied on the same embedding model. Please refer to Table 12 above-- it shows that NN-CIFT is invariant to the embedding model. We will add these results in the final version of the paper.

guidance on when the proposed method should or should not be used

NN-CIFT can be used to replace any language-model based influence function that is later used to select a subset of data on which to fine-tune a language model. NN-CIFT's Step 1 assumes that the small subset of samples is representative of the rest of the dataset, hence it is difficult to say how well NN-CIFT will perform with distribution shifts. However, because NN-CIFT is lightweight, it can be retrained quickly to adapt to distribution shifts as well. We will add this analysis to the limitations section in the final version of our paper.

does it work better for some influence functions?

Not necessarily -- we choose to evaluate with SelectIT, DELIFT, DELIFT (SE) and LESS to showcase the wide applicability of NN-CIFT. SelectIT uses model confidence, DELIFT uses model performance, DELIFT (SE) uses embedding model similarity, and LESS uses model gradients to inform data valuation. NN-CIFT adapts to each case without modifications to the methodology.

Q1. how sensitive is the training of the surrogate model to hyperparameters?

We show these results in Figure 2 and Appendix A. Figure 2 shows that small subsets of data are sufficient for learning the InfluenceNetwork, and Appendix A shows that small networks are sufficient for learning to estimate influence values with high accuracy. Hence, the InfluenceNetwork does not require significant experimentation to achieve good performance.

Q2. can the method be viewed as denoising noisy influence estimates?

Very interesting point -- it depends on what we consider "noisy influence estimates". As mentioned in Section 4.4, the InfluenceNetwork essentially learns to model the outliers (where the influence values are very high and very low). The InfluenceNetwork can definitely smooth over influence values from non-interesting data, if we consider them noisy. We will add this note to the conclusion/future works section of our final paper.


Overall, we thank Reviewer 35kX for their time and thoughtful consideration. If we have addressed the concerns, we would greatly appreciate to see this reflected in an increase in the scores.

评论

Hello Reviewer 35kX, thank you for your time and efforts in reviewing our work. Please let us know if we have addressed your concerns, or of any additional questions you have. We would love the opportunity to resolve anything that is unclear!

评论

Hello Reviewer 35kX, as we are approaching the end of the rebuttal period, we'd appreciate the opportunity to resolve any of your additional questions and concerns in hopes of revised scores. Thank you again for your review!

评论

Hello Reviewer 35kX, with the end of the rebuttal period, we would like to request the chance to address any additional concerns you might have with our work in the hopes to make our work clear, concise, and rigorous. Thank you again for reviewing our work.

评论

I appreciate the authors' additional experiments to show that the performance is not dependent on the choice of embedding model, the clarification that the training of the surrogate model is not sensitive to the choice of hyperparameters, and that the method applies as-is to a variety of influence function methods. Given this, I raise my score.

评论

Thank you Reviewer 35kX -- we appreciate you increasing your score from a 4 to a 5 and your guidance for improving our work. Please let us know if there are any other concerns we can address.

审稿意见
4

This paper proposes using small neural networks—referred to as the InfluenceNetwork—to estimate influence values, thereby reducing the computational cost of calculating influence functions.

优缺点分析

Strengths:

The idea is interesting, and this paper is easy to follow.

Weaknesses:

The comparison with baselines is currently insufficient to convincingly demonstrate the advantages of the proposed method. In particular, several existing approaches have been specifically designed to relax the computational burden of influence function estimation, such as the method introduced in [1], which leverages stochastic approximations to reduce the overhead. Including such methods in the empirical comparison would provide a more comprehensive and fair assessment of the proposed approach’s relative efficiency and accuracy.

Moreover, the paper does not clearly report the training time of the model or provide sufficient details to estimate it reliably. Without a clear accounting of the total computational cost—including both model training and influence estimation—it becomes challenging to evaluate the claimed gains in time efficiency. These limitations make it difficult for readers to assess the practical benefits of the proposed method, especially in real-world scenarios where computational resources are constrained.

To strengthen the contribution, the authors should include relevant baselines that target efficiency improvements and report precise runtime metrics for both training and inference stages.

[1] Tracing model outputs to the training data.

问题

This paper is overall very interesting and presents a promising direction with potentially impactful contributions. The proposed method is well-motivated and addresses an important problem. However, as noted in the weaknesses, the evaluation of efficiency remains incomplete. The current comparison lacks key baselines that are specifically designed to improve the computational efficiency of influence estimation. Additionally, important details such as training time and overall computational overhead are either missing or insufficiently discussed. A more thorough and transparent analysis of efficiency—both in terms of runtime and resource usage—would significantly strengthen the empirical validation of the proposed approach and better support the claimed advantages.

局限性

Yes

最终评判理由

I'm convinced by the author's response.

格式问题

NA

作者回复

We thank the reviewer for mentioning NN-CIFT "presents a promising direction with potentially impactful contributions" and that our paper is "easy to follow". Below we address the concerns and questions.

several existing approaches have been specifically designed to relax the computational burden of influence function estimation.

Great point -- these works, as the one mentioned, find optimal ways to estimate gradients, to which influence estimation is a downstream application. With such a process, we run the risk of overly lossy influence estimation, as more and more information is lost at each step. NN-CIFT, on the other hand, compress information one-shot. Hence, we do not compare our method against these baselines. Furthermore, for the particular baseline that the reviewer has mentioned ([1]), there is no public code available.

the paper does not clearly report the training time of the model or provide sufficient details to estimate it reliably.

Table 6 contains the computational costs for influence estimation: this includes the time taken for language-model based influence estimation, as well as the InfluenceNetwork training time. We do not report downstream model fine-tuning time (after the subset of data is selected) because that is not directly related to our main contribution. Our work focuses on reducing the time for influence estimation. Also, in Table 6, the caption breaks down how each baseline's cost was measured. We'd be more than happy to clarify any questions here!


Overall, we thank Reviewer AAYF for their time and thoughtful consideration. If we have addressed the concerns, we would greatly appreciate to see this reflected in an increase in the scores.

评论

Hello Reviewer AAYF, thank you for your time and efforts in reviewing our work. Please let us know if we have addressed your concerns, or of any additional questions you have. We would love the opportunity to resolve anything that is unclear!

评论

Hello Reviewer AAYF, as we are approaching the end of the rebuttal period, we'd appreciate the opportunity to resolve any of your additional questions and concerns in hopes of revised scores. Thank you again for your review!

评论

Hello Reviewer AAYF, thank you for increasing your score from a 3 to a 4 -- we appreciate your guidance for improving our work. Please let us know if there are any other concerns we can address.

审稿意见
6

This paper introduces a novel algorithm for estimating the relative importance of data samples for LLM finetuning. This algorithm NN-CIFT (Neural Networks for Efficient Instruction Fine-Tuning) trains a 2-layer “Influence Network” on vector embeddings to then select a subset of samples from a given dataset. The Influence Network learns to predict a pairwise “influence” value i.e. similarity score between two data points. Their approach leads to good quality and speed performance over competitive alternative approaches.

优缺点分析

Strengths:

  • The paper is well written and the results are novel. This seems like a promising approach for efficient dataset selection for finetuning.
  • The authors show results across multiple finetuning datasets and multiple models (and model sizes from 1.5B - 22B). This is very important for showing how their approach generalizes across conditions
  • The authors do rigorous benchmarking with multiple metrics (e.g. ROUGE, vector embeddings, LLM-as-a-Judge)

Weaknesses:

  • As a ML practitioner, I am interested in the question of how to get to equal quality with fewer data samples during finetuning. This approach seems capable of answering this question, however the results in Tables 3-6 don’t address this directly. Rather, they just show that NN-CIFT is slightly better than alternative approaches/methods, and that it can achieve similar performance to SelectIT, LESS, DELIFT and DELIFT (SE) 77-99% of the time (which is impressive).

问题

Q1: Line 30: The authors state the DELIFT (SE) compares the similarity of sentence embeddings between pairs of data, and that this is computationally expensive. Is this a minor comment, or something deeper? Please explain.

Q2: When you write “‘+NN-CIFT’ indicates using NN-CIFT to estimate influence values computed from the corresponding method’s influence function”, does this mean that in the example of “DELIFT+NN-CIFT,” DELIFT was used in “Step 1” on a very small subset of data, and then this subset was used to train the NN-CIFT influence network?

Q3: A lot of the same data points from Table 4 and Table 5 were calculated in the original DELIFT paper (e.g. DELIFT (SE) baseline ICL ROUGE score for MixInstruct). Were these values recalculated in your paper? It seems that the values are not necessarily identical.

Q4: The cost estimates in Table 1 don’t include the initial step of using the full sized LLMs to output influence values of a very small subset of data. Can you address this? Also, how small are these subsets? Have you explored the effects of dataset size in this first step?

Q5: In Appendix Figure 4, it is surprising that there are no scaling trends as you increase model size. Why doesn’t MSE decrease appreciably as the model size increases? Can you address this in the paper? It seems like there should be a relationship between influence network size and training data size (as in Figure 2 of the main text). Maybe this has to do with model capacity?

局限性

Yes, the authors have addressed the limitations of their work.

Very minor comments:

  • It would be very helpful to bold and underline top values in Tables 3-5. It is hard to inspect by eye where NN-CIFT leads to strong improvements across the board
  • Line 34: “outlines the expenses” → maybe “efficiency considerations” instead?
  • Line 24, line 134 and elsewhere: the brackets around citations are missing
  • Line 421 in Appendix: “InfluenceNetwork’s” → Influence Networks
  • Line 453 in Appendix: “Tables ??”

最终评判理由

The authors have adequately addressed my concerns. I have increased my score from 5 to 6 as a result.

格式问题

NA

作者回复

We appreciate the reviewer for pointing out that NN-CIFT "generalizes across conditions" and is evaluated with "rigorous benchmarking with multiple metrics". Below, we address the concerns and questions.

Table 3-6 don't address [the question of how to get equal quality with fewer data samples]. Rather, they just show that NN-CIFT is slightly better than alternative approaches/methods, and that it can achieve similar performance to [baselines] 77-99% of the time.

There is a slight misunderstanding here. The baselines SelectIT, LESS, DELIFT and DELIFT (SE) are influence functions that use large language models for influence estimation. Our method, NN-CIFT, replaces the language model with a small neural network. Tables 3-6 show that the small neural network in NN-CIFT is able to perform similarly to the language model-based baselines. Moreover, the 77-99% metric refers to the costs that were saved. In other words, NN-CIFT uses 77-99% less time while achieving similar performance. Finally, in Section 5, we show that selecting just 30% of data is able to achieve very similar performance to the full dataset -- showcasing how to achieve equal quality with fewer data samples.

Q1. DELIFT (SE) compares the similarity of sentence embeddings between pairs of data, and that this is computationally expensive.

This is a minor comment on the computational expensiveness of a baseline method.

Q2. in the example of "DELIFT+NN-CIFT", [was] DELIFT used in "Step 1" on a very small subset of data, and then this subset was used to train the NN-CIFT influence network?

Yes, that is correct! We show that using 5% of the training data is effective to learn to estimate influence values using a neural network.

Q3. Were [DELIFT and DELIFT (SE)] values recalculated in your paper?

Yes. Because their code base was available online, we reproduced their experiments and recalculated the values in our paper. The models and datasets are slightly different between the DELIFT paper and our paper, hence the values are slightly different as well. We choose more recent models, and we choose a more diverse set of tasks for datasets (MixInstruct is general instruction tuning, Alpaca is alignment, and MMLU is a knowledge-based task).

Q4.a The cost estimates in Table 1 don't include the initial step of using the full-sized LLMs to output influence values of a very small subset of data.

Table 1 is more for demonstration purposes. Table 6 contains the end-to-end time costs for training the network. We will add this note in Table 1 in the final version.

Q4.b how small are these subsets? Have you explored the effects of dataset size in this first step?

The subsets are 5% of the entire dataset. We do indeed explore the effects of the dataset size in Figure 2. We see that with larger subsets of data, the accuracy of the InfluenceNetwork does not change, hence 5% is sufficient and optimal.

Q5. Why doesn't MSE decrease appreciably as the model size increases?

A larger model would be able to choose better points and can better estimate the information present in a data point, but this doesn't change the fact that influence values are in a small range and do not vary drastically, as analyzed in Section 4.4. Hence, it is easy to use a neural network to learn a function to estimate influence values with high accuracy, regardless of the size of the language model.

Finally, thank you to the reviewer for the grammatical suggestions and stylistic changes. We will incorporate them into the final version of the paper.


Overall, we thank Reviewer R2g1 for their time and thoughtful consideration. If we have addressed the concerns, we would greatly appreciate to see this reflected in an increase in the scores.

评论

Hello Reviewer R2g1, thank you for your time and efforts in reviewing our work. Please let us know if we have addressed your concerns, or of any additional questions you have. We would love the opportunity to resolve anything that is unclear!

评论

Hello Reviewer R2g1, as we are approaching the end of the rebuttal period, we'd appreciate the opportunity to resolve any of your additional questions and concerns in hopes of revised scores. Thank you again for your review!

评论

Hello Reviewer R2g1, with the end of the rebuttal period, we would like to request the chance to address any additional concerns you might have with our work in the hopes to make our work clear, concise, and rigorous. Thank you again for reviewing our work.

最终决定

This paper presents a method that uses a small surrogate neural network to approximate computationally expensive influence functions for data selection in LLM finetuning. The primary contribution is an empirical demonstration that this approach can reduce the time cost of data valuation while preserving the downstream performance of models finetuned on the selected data.

All reviewers are in agreement on the paper's acceptance based on extensive empirical validation across different models and tasks, and general applicability. The initial reviews raised several points for clarification, questioning the method's dependence on a specific embedding model and seeking more clarity on the end-to-end cost analysis and the choice of baselines. The reviewers were satisfied with the author's response, which included new experiments that addressed the main concerns.

For the camera-ready version, I recommend the authors integrate the clarifications and new results from their rebuttal into the main body of the paper. This includes: (1) the new experimental results demonstrating the method's robustness to different embedding models, (2) the clarification of the end-to-end cost analysis and the justification for the chosen baselines, and (3) the expanded discussion on the method's limitations, particularly concerning distribution shifts.

The key reason for my recommendation is the paper's practical utility: it offers a simple, general, and validated solution to the expensive problem of data selection for LLM finetuning, making it highly relevant to a broad audience of practitioners.