HyperINF: Unleashing the HyperPower of Schulz's Method for Data Influence Estimation
We propose HyperINF, an efficient and accurate influence function approximation which leverages the hyperpower method, specifically Schulz's iterative algorithm.
摘要
评审与讨论
This paper utilizes influence functions for estimating importance of training data points. The goal of proposed method (HyperINF) is data selection for finetuning LLMs and VLMs. Experiments include mislabled data detection task with RoBERTa-large, Llama2-7B finetuning on four reasoning datasets, and VLM (CLIP ViT-large, Llama2-7B). On the mislabled data detection, the proposed method significantly outperforms the baselines. On the finetuning data selection tasks (LLM, VLM), the proposed method provides competitive performance.
HyperINF approximates the Hessian using generalized Fisher Information Matrix and uses Schulz's method for matrix inverse approximation. Compared to baselines (LISSA, DataInf), the paper shows strong convergence guarantees at higher dimensions.
接收理由
- The proposed method provides strong results on the mislabeled data detection task, and competitive performance on data selection for LLM finetuning (LoRA and dense) and VLM finetuning.
- HyperINF provides stronger convergence guarantees at higher dimensions compared to baselines.
- The paper is very well written, and covers a lot of additional results in the Appendix.
拒绝理由
- The proposed method (HyperINF) performs well but at a significant compute and memory cost compared to some of the baselines (e.g., DataInf). Notably, the performance of DataInf is often comparable to HyperINF. I think the paper can benefit from a discussion about efficiency, and its trade-off with performance.
- The results for LLM and VLM finetuning were reported using a single 7B model. It's unclear how the proposed method (HyperINF) scales to other model sizes. Since the experiments are based on finetuning dataset, I believe it might be feasible to run experiments with additional model sizes.
- There is considerable variance in relative performance between HyperINF and baselines across finetuning datasets (Table 3). Additional discussion might be needed. Interestingly, the results are stronger for data selection on VLM data selection.
给作者的问题
Minor comments:
- Section 5.3 title incorrectly states the data selection for VLM 'pretraining', but the results are reported for selection of instruction-finetuning data. The abstract and introduction accurately reflect the experiments.
- Paper misses out on a relevent prior work (Choe et al., 2024). I understand the focus of this work is on use of influence functions for LLM finetuning, whereas Choe et al., is more focused on attribution to pretraining data. However, there is considerable overlap in methodology.
- The paper provides three citations for the same paper (Grosse et al., 2023)
References
- Choe et al., 2024. What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions. https://arxiv.org/abs/2405.13954.
We thank the reviewer for their valuable feedback and constructive suggestions! We address each concern below:
Q1: Comparison between DataInf and HyperINF
According to Table 1, we acknowledge that HyperINF incurs higher complexity in dense finetuning compared to DataINF, while it can lead to better efficiency with LoRA finetuning with a large LoRA rank . We also note that DataINF can lead to large approximation error according to Fig. 1, which leads to suboptimal empirical results on LLM training (Fig. 2, Table 2, Table 3, Table 4).
Nevertheless, we acknowledge that DataINF is an efficient strategy for large-scale dense finetuning, instead of other iterative approximation algorithms. We will add it into discussion in the updated version.
Q2: Model size scalability
Due to resource constraints, our current experiments are only conducted on 7B models. Yet, the theoretical and complexity analysis can be generalized to any scales of models: (1) Schulz’s method converges independently of dimension (Appendix D); (2) GFIM approximation improves efficiency for LoRA finetuning on larger models (e.g., 13B).
We agree with the reviewer that adding the empirical results across various model scales can strengthen the current paper. We will add the experimental results on 1B LLMs in the updated version of the manuscript.
Q3: Task-dependent Data Selection Impact
We thank the reviewer for bringing up the interesting discussion!
The improvements from data selection methods vary across different tasks (Table 3, Table 4), probably resulting from various data quality in the training datasets. For example, on Hellaswag and LogiQA, all the data selection methods can only achieve comparable or marginally improved performance compared to random selection, perhaps because the data quality within its training set is already quite high. Also, the sample-level data selection tends to extract similar high-quality datapoints, which might risk duplications or even overfitting if we repeat the selected subset for many epochs.
We will add the discussion in the updated version of the manuscript.
Q4: Minor comments
- We will correct the title to "Data Selection for VLM Instruction-Tuning".
- We thank the reviewer for the pointer! We will add the work in the related work section to acknowledge methodological overlaps.
- We will fix the redundant citations to Grosse et al. (2023) in the updated version.
Thanks for your response. Just a few follow-up comments,
Model size scalability: I am not entirely sure I understand the resource constraint here for >7B models. Is memory the constraint? If so, can you report memory use to complement the runtime analysis presented in the Appendix?
Task-dependent data selection impact: I am looking forward to seeing more discussion on this. I would also recommend including a comparison between LoRA and dense finetuning settings. For instance, for LogiQA, your method does show significant gains under dense finetuning but not with LoRA. Would the dataset quality argument still hold true?
Thanks for the follow-up questions!
Q1: Model size scalability
The main resource constraint for >7B models is the GPU accessibility, since the corresponding experiments would require more time and be more expensive. For the memory usage (MiB) of the runtime analysis presented in the appendix, we show it as follows in different ranks :
| Method | |||||
|---|---|---|---|---|---|
| LiSSA | 12010 | 13540 | 16598 | 22718 | 34954 |
| DataInf | 12010 | 13540 | 16600 | 22720 | 34956 |
| HyperINF | 12010 | 13540 | 16600 | 22720 | 34956 |
From the above table, there is no significant difference across different methods.
Q2: Task-dependent data selection impact:
We appreciate the reviewer for brining the follow-up discussions!
In the dense fine-tuning, we utilize the gradients from last layer of the model to select the data points. Compared to the LoRA fine-tuning, the last layer might contain more sufficient and relevant information [1] than LoRA modules in this specific task. It can lead to select more "influential" data points, resulting the potential better performance. Therefore, it does not contradict with the data quality argument above.
We will add the discussion in the updated version of the manuscript.
[1] ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
The proposed HyperINF efficiently approximates influence functions using Schulz’s method and GFIM, enabling accurate data attribution for large models with low memory cost and strong empirical results across tasks.
接收理由
HyperINF introduces the use of Schulz’s hyperpower method combined with the Generalized Fisher Information Matrix (GFIM) to provide highly accurate and computationally efficient influence function approximation.
Across diverse and challenging benchmarks—including LLM/VLM fine-tuning and mislabeled data detection
HyperINF shows GPU-accelerated convergence, constant-time memory scaling, and low-rank efficiency
拒绝理由
GFIM-based low-rank Hessian approximations rely on i.i.d. zero-mean gradient assumptions, which may not always hold in real-world settings
给作者的问题
Can HYPERINF be adapted to settings where only predictions or losses (not gradients) are accessible, such as API-based LLMs?
Q1: Theoretical Limitations on Real-world Settings
We thank the reviewer for this insightful observation about the i.i.d. zero-mean gradient assumptions in our GFIM-based approach. We acknowledge that these are indeed strong theoretical assumptions that merit careful discussion.
-
Theoretical Justification: The zero-mean gradient assumption becomes increasingly valid as models approach convergence during training, where the expected gradient approaches zero. For the i.i.d. assumption, while individual gradients may exhibit some correlation, our empirical analysis in Figures 19-20 (Appendix J.3) demonstrates that gradient matrices maintain near-full column rank across diverse settings, suggesting sufficient linear independence for our method to be effective.
-
Empirical Robustness: Importantly, our extensive experiments across LLM/VLM fine-tuning and mislabeled data detection demonstrate that HyperINF performs robustly even when these assumptions are violated in practice. This suggests that while the theoretical analysis relies on these assumptions, the practical performance is more forgiving to assumption violations. When Assumptions Hold Better: The assumptions are more likely to be satisfied in: (1) later stages of training when models are near convergence, (2) well-regularized models, and (3) scenarios with sufficient data diversity. We observe that HyperINF performs particularly well in these settings.
-
Comparison to Alternatives: We note that existing influence function methods also rely on similar or stronger assumptions (e.g., convexity assumptions in classical IF methods), and our empirical results show significant improvements over these baselines even in challenging real-world scenarios.
Q2: Adapting to the black-box (API-based) LLMs
Since our method is built on the traditional influence function formulation [1], which cannot be directly applied to black-box APIs since they require gradient computation with respect to internal model parameters that are inaccessible.
However, we appreciate this important practical question given the prevalence of API-based LLMs. Some promising direction could be Gradient-free approximations with zero-order approximation etc. We will leave it to future research.
[1] Understanding Black-box Predictions via Influence Functions. Koh and Liang, 2017.
Thanks for your additional thoughts on my questions, and I believe the current rating is sufficent.
This paper is well-written and easy for readers to follow. It proposes an original and novel method for efficiently and accurately approximating influence functions, an important topic for us to understand model interpretability and data quality. The results show high effectiveness and the theoretical analysis shows clear efficiency improvement compared to existing methods.
接收理由
- Influence function is a principled way to evaluate training data's effect on model performance but its challenge of high computational costs of its matrix inversion has limited the application in LLMs. The Schultz method-based approximation in this paper is theoretically sound and practically effective.
- The authors conducted comprehensive evaluations on large-scale VLM and LLM models, showing the scalability of the proposed method, even on dense finetuning. The results are convincing compared to existing methods.
拒绝理由
- The experimental results shown in section 5 are indirect. It would be better if the author can show some direct result. For example, the test loss after funetuned with and without a subset of examples, or some qualitative results on the detected "highly influencing" examples, like Figures 1 & 2 in Koh & Liang. Although the results can be very noisy, they will give readers a better sense on how well the proposed method's approximation actually is.
We thank the reviewer for their valuable feedback! We provide the sample analysis as below:
Q1: Direct results demonstration
We thank and agree with the reviewer's opinions for providing some 'highly influential' examples. Therefore, for the Mislabeled Data Detection task, we provide the top-ranked data point (i.e., high-quality data, where its label is correct ) and bottom-ranked data point (i.e., low-quality data, where its label is corrupted to be incorrect) from HyperINF on MRPC task.
-
Top-ranked:
{'sentence1': 'Microsoft , which acquired Virtual PC from Connectix in February , said a fix for the problem is not around the corner .', 'sentence2': 'Virtual PC , which Microsoft acquired from Connectix Corp. in February , will not run on the G5 , the company said .', 'label': 0}, label 0 denotes the these two sentences are not semantic equivalent, whose label is correct, not corrupted. -
Bottom-ranked:
{'sentence1': '" We acted because we saw the existing evidence in a new light , through the prism of ... Sept . 11th . "', 'sentence2': '" We acted because we saw the evidence in a dramatic new light , through the prism of our experience on 9 / 11 . "', 'label': 0}. The label should be 1, since these two sentences are semantic equivalent but the label is corrupted to 0. And HyperINF succeeds in detecting the wrong-labeled data point.
We will provide comprehensive sample analysis example in our revised version.
The authors focus on addressing the computational costs associated with data attribution by replacing the Hessian with a generalized Fisher Information Matrix. The paper then shows through several experiments (mislabel detection, data selection, multimodal fine-tuning) to back up the stability and performance of HyperINF.
Quality and clarity: Overall the paper is generally structured well, and I can understand the core contributions the authors intend to make. I appreciate the work the authors made to produce experimental results and to benchmark previous methods. Originality and significance: I am concerned there may be limited novelty due to overlap with previous literature and lack of evidence to strongly disentangle it both in theory and experimentation.
接收理由
The authors present and show experimental results for another method to approximate the Hessian. This could provide another interesting tool in the data influence toolbox for researchers in this field.
拒绝理由
The paper outlines how it differentiates from previous work, but I think there isn't enough separation on the theoretical or experimental side. The paper aims to supplement the limitations of DataInf's O(d^2) approximation error, but it's not clear that HyperINF is much better (eg. dense fine-tuning only occurs on the final layer, and it's not clear it is the right experimental setting since d is related to rank of weight updates). Additionally, the substitution of Sherman-Morrison with Schultz method may actually detract from a desirable trait of DataInf -- namely that it replaces an iterative method with a closed-form solution.
给作者的问题
- Are there confidence intervals for the performance of methods in Table 3 and Table 4? Without these, it's hard to understand what the percentage improvements translate to (and the significance of the results).
- LoRA was done with r=64, but does this represent a common LoRA implementation? (eg. the original LoRA paper goes from 1->64 rank) Can you justify this decision or provide results for lower rank? Similarly, dense fine-tuning is only applied to the last layer -- should this experiment be applying full-fine tuning of the same layers that are using LoRA approximation?
I will consider moving my review up if the authors can better explain how HyperINF overcomes limitations in datainf's approximation errors with respect to the rank of the weight updates.
We thank the reviewer for the feedback and insightful questions! We provide corresponding clarifications as follows:
Q1: Comparison between DataINF and HyperINF
We not only care about the efficiency but also emphasize the importance of the accuracy of Hessian inverse approximation. We agree with the reviewer that DataINF has the best theoretical efficiency compared to other methods, including HyperINF under the dense-finetuning context. However, we refer the reviewer to Fig. 1, which demonstrates that DataINF and Lissa can both lead to large approximation errors when the scales of tunable parameters and datasets are increased. In comparison, HyperINF exhibits a better convergence ability. For efficiency considerations, we note that with LoRA fine-tuning (i.e., the main scenario considered in DataINF), HyperINF can improve the theoretical complexity according to the LoRA rank (Table 1). Additionally, HyperINF adopts matrix multiplications, which further improve empirical efficiency on modern GPUs (Fig. 7).
Q2: Confidence intervals for large experiments
Since Table 3 and Table 4 present the results on Llama-7B and VLM (CLIP+Llama-7B) tuning, thus, the large computation costs prevent us from repeating these experiments several times. However, we have provided the confidence intervals for all smaller experiments on Roberta-large (Table 2), which demonstrates a concrete understanding of the benefit from HyperINF.
Q3: Implementation on different LoRA ranks
We agree with the reviewer that more complete experiments with all LoRA ranks could bring better understanding. However, because of the computation limits, we choose the largest rank , which may lead to large accuracy improvements (according to [1]). Therefore, we can observe more differences in the downstream performance. To study the impact of LoRA ranks, we provide the results for on smaller-scale experiments with Roberta-large and GLUE benchmark (i.e. the settings of the original LoRA paper [2]). We refer the reviewer to Figure 2,3,5 for more details.
Q4: Clarification on dense finetuning
For the dense-finetuning experiments, we select data using the gradients from the last layer, then tune all the model parameters on the selected datapoints. We only use the gradients from the last layer as they are the most impactful ones according to [3].
We hope our answers resolve most of your concerns. If you have any further questions, don’t hesitate to contact us. Your valuable suggestions have helped us improve the manuscript!
[1] A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
[2] LoRA: Low-Rank Adaptation of Large Language Models
[3] ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Thanks for the response! I've updated my score to an accept.
Thanks for raising the score and we do appreciate your valuable suggestions!
In this paper the authors propose a memory-efficient approach to data attribution by approximating influence functions using a generalized Fisher information matrix and Shell’s method. They evaluate the approach across a variety of practical benchmarks, including data selection for LLM fine-tuning and mislabeled data detection, where the approach performs competitively or out-performs baselines.
Reviewers noted that the paper is well-written and demonstrated strong results through a comprehensive evaluation, as well as stronger convergence guarantees than other approaches. Reviewers raised concerns about the generalization of the approach to other model scales (experiments limited to 7B LLMs), or in real-world settings where the zero-mean gradient assumption used to justify approximating the Hessian with the GFIM may not be realistic. However, reviewers agreed that the paper was worthy of acceptance despite these limitations.