5.5

/10

Rejected4 位审稿人

最低3最高8标准差2.5

4.3

置信度

正确性3.3

贡献度3.3

表达3.0

ICLR 2025

Large-Scale Training Data Attribution with Efficient Influence Functions

Sang Keun Choe,Hwijeen Ahn,Juhan Bae,Kewen Zhao,Minsoo Kang,Youngseog Chung,Adithya Pratapa,Willie Neiswanger,Emma Strubell,Teruko Mitamura,Jeff Schneider,Eduard Hovy,Roger Baker Grosse,Eric P. Xing

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

We efficiently scale the influence-function-based training data attribution to recent LLMs and their massive training datasets.

摘要

关键词

data attributioninfluence functionsLLMsinterpretability

评审与讨论

审稿意见

评分: 3置信度: 42024-10-31

This paper presents a novel large-scale training data attribution method called LOGRA (Low-Rank Gradient Approximation), which aims to enhance the efficiency of influence functions on large models and datasets. The core contribution of the paper is the use of low-rank projections on input and output activations, significantly reducing the computational complexity and memory overhead of Kronecker product calculations. By performing low-rank projections directly on input and output activations, LOGRA effectively approximates the weight gradients, making influence functions scalable to larger models and datasets. To validate LOGRA's performance, the authors also introduce an open-source software package, LOGIX, and conduct experiments on models like GPT2-XL, Pythia-1.4B, and Llama3-8B-Instruct, demonstrating that LOGRA outperforms existing methods in terms of memory and computational efficiency.

优点

Contribution: The paper introduces an innovative low-rank gradient approximation method, LOGRA, which applies low-rank projections to input and output activations, significantly improving the efficiency of influence functions on large models and datasets. LOGRA’s approach is novel, not only reducing the complexity of Kronecker products but also broadening the applicability of influence functions through low-rank projections. Compared to existing influence function methods like EKFAC and Arnoldi IF, LOGRA's strategy in gradient computation is distinct, enabling more efficient training data attribution on large models and datasets.
Comprehensive Experiments: The authors conduct extensive experiments across various large models, including GPT2-XL, Pythia-1.4B, and Llama3-8B-Instruct. These experiments cover different model scales and datasets, thoroughly demonstrating LOGRA’s advantages in computational and memory efficiency while validating its reliability in influence function attribution accuracy. The experimental design is well-structured, and the results are clear, providing strong evidence of LOGRA's effectiveness.

缺点

Weaknesses

Limited Applicability to Specific Layer Types: LOGRA’s low-rank projection mainly applies to linear layers and QKV-generating linear layers, while its effectiveness is limited in other types of layers, such as convolutional layers. It is recommended to explore how the low-rank approximation strategy could be extended to a wider variety of layer types, improving LOGRA's applicability and performance across different models.
Potential Impact of Low-Rank Projections on Attribution Accuracy: The paper lacks a detailed discussion of the potential negative impact of low-rank projections on attribution accuracy, as well as how to balance efficiency with accuracy. It is recommended to include a quantitative analysis of the approximation errors under different low-rank dimensions, which would help readers better understand how LOGRA’s attribution accuracy changes with varying ranks.
Lack of Discussion on Dimensionality Reduction Efficiency: LOGRA employs PCA and random projection for dimensionality reduction, yet the paper does not discuss the efficiency challenges associated with these methods in detail. While PCA effectively preserves information, its covariance matrix computation and eigen-decomposition can be computationally intensive in high-dimensional scenarios, potentially impacting LOGRA’s overall runtime efficiency. On the other hand, random projection is faster but often requires ensemble techniques to mitigate randomness, introducing additional computational overhead. It is recommended to include a discussion on the efficiency trade-offs of these methods to help readers better understand LOGRA's runtime considerations.

问题

Questions

How can LOGRA's applicability be extended to cover more types of layers?
How does low-rank projection impact attribution accuracy, and can it be quantified?
Could you offer the running time and more technology details of PCA and random projection?

评论- Author Rebuttal

2024-11-21

We appreciate your reviewing effort. Here, we attempt to address your concern as best as we can.

Limited Applicability to Specific Layer Types

In L174, we noted that most popular layers in neural networks including convolutional and linear layers can be formulated as 2D matrix multiplication. In Section 4.1, we indeed performed LDS and brittleness experiments with ResNet-9, which include many convolutional layers.

Potential Impact of Low-Rank Projections on Attribution Accuracy

In Section 4.1, we quantitatively compare EK-FAC (which does not use low-rank gradient projections) and our proposed method LoGra. We show that Logra obtains comparable performance to EKFAC, despite being order of magnitude more efficient (please see Table 1). In our updated manuscript, we will run additional LDS experiments on the effect of LoGra rank.

Lack of Discussion on Dimensionality Reduction Efficiency

Thanks for your comment. To compute the covariance matrix for PCA, we ran one additional training epoch to collect forward/backward activations for all training examples. The eigendecomposition of this covariance matrix is computationally efficient - in our experiments with large-scale models like Llama-3 8B, this step required less than one minute to complete. Furthermore, since the eigenvectors are only used for the initialization of our LoGra encoder/decoder, this eigendecomposition step is a one-time computation that does not need to be repeated. Lastly, while it might be appealing to ensemble the scores when using random projections, we note that our results in Section 4.1 did not use any ensembles and got comparable performance with the PCA.

Thanks again for your review, and we hope our rebuttal cleared your concern and made you evaluate our work more positively!

2024-11-26

Thank you for your response; however, it did not fully address my concerns, so I will maintain my original score.

Limited Applicability to Specific Layer Types Line 174 briefly mentions that most neural network layers, such as linear and convolutional layers, can be reduced to matrix multiplication. However, this statement does not sufficiently cover other mainstream layer types with learnable parameters. To enhance the generalizability and theoretical soundness of LOGRA, I suggest the authors analyze whether the low-rank projection strategy applies to other commonly used layers with learnable parameters, such as convolutional layers, normalization layers (e.g., LayerNorm and BatchNorm), recurrent layers (e.g., LSTM and GRU), and sparsely connected fully connected layers.

For convolutional layers, a more detailed explanation of how LOGRA leverages the unique structural properties, such as sparse connections and shared weights, would be beneficial. Similarly, for normalization layers with learnable scaling and shifting parameters, it would be helpful to clarify if LOGRA can effectively handle them, given their typically small parameter sizes. For recurrent layers, due to the complexity and interdependence of weight matrices, discussing whether LOGRA requires any specialized adaptations, such as block-wise handling, would strengthen the analysis. Sparse fully connected layers, which have unique weight matrix structures, also merit consideration.

Potential Impact of Low-Rank Projections on Attribution Accuracy I have noticed that Logra obtains comparable performance to EKFAC, despite being order of magnitude more efficient. However, it is recommended to include a quantitative analysis of the approximation errors under different low-rank dimensions.

Lack of Discussion on Dimensionality Reduction Efficiency Could you provide a table with more details? To make the discussion more comprehensive, could you provide additional details on the runtime for each step of the reduction process, the specific parameters (e.g., target rank) used for PCA and random projection, and the models tested in your experiments? A comparison of runtime and accuracy trade-offs between PCA and random projection would also help clarify their practical implications.

评论- Rebuttal

2024-11-29

Thanks for your response.

Limited Applicability of Specific Layer Types

Thanks for your suggestion. Indeed, LoGra is mostly designed for layers that involve matrix multiplication such as linear, convolutional, and RNN/LSTM layers.

In detail, the number of parameters in normalization layers are typically much lower than those in linear/convolutional layers (if the hidden dimension is $n$ , # of params in normalization layers are $O(n)$ whereas # of params in linear/conv layers are $O(n^2)$ ). Therefore, normalization layers typically have substantially lower memory/compute costs in influence computations as well as gradient projection.

For convolutional layers, the weight matrix shape is defined as $(d_{out}, d_{in}, k_h, k_w)$ where $d_{in/out}$ is in/out channel dimensions and $k_{h/w}$ is the kernel height/width. Typically, the weight matrix is reshaped to be 2D matrix like $(d_{out}, d_{in}\times k_h \times k_w)$ . In addition, we note that weights in typical recurrent networks (e.g., RNN, LSTM, GRU) also have 2D shapes. As long as we have the 2D shape for the weight matrix, we can apply Equation (4) & (5) and achieve LoGra gradient projection. We will make this clearer in our final manuscript.

Potential Impact of Low-Rank Projections on Attribution Accuracy

To address reviewer's concern, we ran additional LDS experiments with varying projection dimensions for LoGra-PCA and present the result below:

	Rank 32	Rank 64	Rank 128
MNIST	0.2901	0.2935	0.2860
CIFAR	-	0.1331	0.1698
Wiki	-	0.2698	0.3274

We observed that increasing the projection dimension was helpful in achieving higher linear data-modeling scores in CIFAR+ResNet and Wiki+GPT2 experiments. On the contrary, the best LDS score on MNIST was obtained with rank=64. Based on this result, we suspect that more complex tasks typically require higher projection dimensions for better attribution accuracy. Noting that LoGra enables higher projection dimensions more efficiently compared to naive gradient projection (e.g., TRAK), we believe this further corroborates the benefit of LoGra. We appreciate the reviewer for suggesting interesting experiments.

Lack of Discussion on Dimensionality Reduction Efficiency

We provided throughput and GPU memory usage of LoGra-random for our Llama3-8B-Instruct+OpenWebText experiments in Table 1. While we understand the reviewer's concern, we want to point out that there are several limitations in providing a fair comparison of memory/compute efficiency of each sub-stage in LoGra for the following reasons. For LoGra-PCA, we need to go through two epochs of logging stages: (1) forward/backward covariance extraction for PCA initialization and (2) gradient logging + projected Hessian computation. On the other hand, in LoGra-random, we can skip the first stage of the forward/backward covariance extraction and directly start with the second stage. As you may notice, the projected Hessian computation is by default accompanied by gradient logging, for which we implemented asynchronous data IO for efficiency. Hence, it's hard to disentangle FLOPs/memory usage for gradient logging and the projected Hessian computation in the second stage. Moreover, the memory usage for stage 1 and stage 2 in LoGra-PCA can be different, and thus we may use different batch sizes for different stages for maximal efficiency. Furthermore, if we also consider the projection dimension, we generally consume less GPU memory with the lower projection dimension, meaning that we can use larger batch sizes when using lower projection dimensions. At the same time, Table 1 shows that throughput can be sensitive to the batch size. As such, there are multiple components that interact with each other, which makes the fully fair comparison difficult. If you have any suggestions on improving our writing for this aspect, we would welcome it!

If you have other questions or concerns about our work, please let us know. Thanks for the engagement :)

2024-12-03

Limited Applicability of Specific Layer Types

The clarification regarding LoGra's focus on layers involving matrix multiplication is helpful. However, I suggest expanding the discussion on normalization layers to provide a more comprehensive perspective on LoGra's applicability. This would address potential concerns about generality across diverse layer types.

Trade-Offs Between Projection Ranks, Performance, and Efficiency

While the results are helpful, my original intention was to see a more detailed quantitative analysis of the approximation errors under different low-rank dimensions, specifically the trade-offs between Figure 4 and Table 1 metrics across varying projection ranks.

评论- Final Comment

2024-12-03

Thanks for the comment.

Limited Applicability of Specific Layer Type

Following your suggestion, we will clarify in our final manuscript that LoGra is mainly designed for layers/modules that involve matrix multiplication (e.g., linear and convolutional layers), given that a vast majority of parameters in most large networks are from them.

Trade-Offs Between Projection Ranks, Performance, and Efficiency

We want to note that the result in our previous comment actually shows the effect of varying low-rank dimensions for our LDS experiments in Figure 4.

审稿意见

评分: 3置信度: 52024-11-03

The paper studies training attribution using influence functions. The authors employ a popular approximation of Gauss-Newton-Hessian called Kronecker Approximate Factorization.

Firstly, the authors introduce the method LoGra, that calculates projected training gradients, with twofold advantages. Firstly, thanks to the Kroneker structure, the projection is orders cheaper to calculate than the usual "dense" projectors. Secondly, the lowered dimension of the gradient allows one to precompute them and store them in a disk. Once the training gradients are precomputed, one can calculate the influences for a given test point in real time.

The second major contribution is a reproducible implementation of LoGra using pytorch's forward/backward hooks. Their codebase allows easy incorporating of gradient projections with existing models.

The authors address the important question of scalability of influence functions and make their application more accessible. Below I mention a few issues.

While reading the paper, I often found myself having to refer to the previous papers to clarify some of the formulas. I think an average reader would benefit from self-contained notation: what is Kronecker product, what is Kronecker product of vectors, background for K-FAC approximation of the Hessian, and what is Hessian itself (the Gauss-Newton one).
The details of Hessian calculation have been omitted. I do not understand how this calculation was performed, in particular, how do the authors manage to calculate the gradients token-wise.
In equation (5), should not $x_t \otimes D x_o$ be a matrix? However, the left-hand side (LHS) is a vector. The same applies to Equation (6). I believe everything would be correct if the vec(⋅) operation were removed—am I right?
I believe there are some limitations in the interpretation of Lemma 1 that could be mentioned. First, Assumption 1 mentions a 'prompt,' so I assume the authors had language modeling in mind for this result. In that case, g_{tr} is the mean over tokens in the training set, and g_{te} is the mean over generated tokens. The assumption that they follow the same distribution is impractical, as training and generated texts are typically very different. Additionally, the variance is calculated per token, while g_{tr} and g_{te} are both averages over multiple dependent tokens. The second issue I see may be debatable. The 'cut-off' arises due to the normalization $e_j \mapsto \sqrt{\lambda_j} e_j$ . Without this normalization, the cut-off would be the opposite, the top eigenvectors would receive smaller coefficients. Why is one interpretation better than the other? I understand that the motivation was to normalize the coefficients $E c_{j}^{2} = 1$ . However, I want to point out that this does not imply all $c_{j}$ are bounded. For example, distribution of c_j could have a heavy-tails. Otherwise, this would mean that the gradients lie in a low-dimensional subspace. As noted in [1] at the end of Section 4.1, this is not always the case, and when it is, it means that the training is slow. See also section 5.5 in [2].

[1] An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

[2] Efficient Sketches for Training Data Attribution and Studying the Loss Landscape https://arxiv.org/pdf/2402.03994

优点

reproducible codebase
extensive experiments

缺点

lack of details regarding computation of Hessian
Lemma 1 has limitations
choice of k was not addressed (projection dimension)

问题

How does one chose the projection dimension in practice? What was it in the experiments?

评论- Further clarifications

2024-11-29

- We note that both $x_t$ and $Dx_o$ are vectors, and thus their Kronecekr product is also a vector. More specifically, Kronecker product of m-dimensional vector and n-dimensional vector yields mn-dimensional vector. We believe there is no error in our equations.*

I always assumed that for two vectors $x \otimes y = x y^{\top}$ is a matrix. Suppose you are right, does it mean that Kronecker product of two matrices is a 4 dimensional tensor?

UPDATE: I see where I'm wrong, that checks out.

- To answer this question, let's first define the "unit" of the example. In our language modeling experiments, we adopted each "sequence" as our unit of the example. *

This appears to be very different from the conventional GNH. Are there any comparison and where in the paper do you define what your GNH is? The traditional GNH comes from local quadratic approximation of the KL divergence, which is also averaged over all tokens. Can your GNH have similar interpretation?

UPDATE: also note the difference between FIM and empirical FIM. You would have to generate the whole sequences \hat{y} if you view them as a unit.

评论- Rebuttal

2024-11-30

Thanks for your prompt response! Here, we provider further clarifications to your follow-up questions.

Kronecker Product

In genereal, Kronecker product between $x\in R^{a_1\times a_2\times\cdots\times a_k}$ and $y\in R^{b_1\times b_2\times\cdots\times b_k}$ results in $z\in R^{a_1b_1\times a_2b_2\times\cdots\times a_kb_k}$ . In our Equation (5), $x_{i,t}\in R^{n_i}$ and $Dx_{o,t}\in R^{o_i}$ , so $RHS\in R^{n_in_o}$ .

Gauss-Newton Hessian

First, we would like to note that the GNH matrix is commonly derived as a Hessian approximation to the linearized network [1, 2]. On the other hand, the FIM is defined as covariance of the score function, which can also be seen as a quadratic approximation to the KL divergence under certain conditions such as cross entropy loss function. To serve as an approximation to the GNH in autoregressive language/sequence modeling, the FIM could be defined as a sum over tokens (for detailed discussion, see footnote 2, page 15 in [3]). We will include this discussion in the new appendix section along with the basic definitions of GNH and Kronecker product in our final manuscript.

We experimented with both empirical and true FIM, and noticed that they surprisingly yield similar results, in terms of Pearson correlation and linear data-modeling scores.

Please let us know if you haver further questions!

[1] Park et al., Trak: Attributing model behavior at scale. ICML, 2023

[2] Martens, New Insights and Perspectives on the Natural Gradient Method. JMLR, 2020

[3] Grosse et al., Studying Large Language Model Generalization with Influence Functions. Preprint, 2023

评论- Rebuttal

2024-11-29

Thanks for your reviewing effort! We try our best to address the raised questions and concerns in our response. We hope this helps you evaluate our work more positively :)

Q. While reading the paper, I often found myself having to refer to the previous papers to clarify some of the formulas. I think an average reader would benefit from self-contained notation: what is Kronecker product, what is Kronecker product of vectors, background for K-FAC approximation of the Hessian, and what is Hessian itself (the Gauss-Newton one).

A1. Thanks for the suggestion. We acknowledge that some parts of our manuscript may not be easily digestible, especially for readers who are not familiar with the Hessian and Kronecker product. In our final manuscript, we will include a separate appendix section that introduces notations and basics of the Gauss-Newton Hessian and the K-FAC approximation of it.

Q2. The details of Hessian calculation have been omitted. I do not understand how this calculation was performed, in particular, how do the authors manage to calculate the gradients token-wise.

A2. To answer this question, let's first define the "unit" of the example. In our language modeling experiments, we adopted each "sequence" as our unit of the example. This choice was motivated by the fact that we want to identify the most influential sequences instead of tokens in our experiments. Therefore, the gradient is also computed at the sequence level instead of the token level (this can also be inferred from Equation (5)). For implementation details, we computed per-sequence gradient using PyTorch hooks.

Next, we adopt the Fisher information matrix approximation of the Hessian. For convex loss functions, such as cross entropy and mean-squared error, the Hessian equals the Gauss-Newton Hessian. Furthermore, for exponential family negative log-likelihood functions, such as cross entropy, the Gauss-Newton Hessian is equivalent to the Fisher Information Matrix (FIM). We refer the reviewer to [1] for a more technical justification. Hence, we compute the block-diagonal FIM by simply averaging the outer product of each training sequence gradient following the definition of FIM. We would like to highlight that this Hessian approximation with Fisher is common in the literature [2, 3, 4].

Q3. In equation (5), should $x_t \otimes Dx_0$ not be a matrix? However, the left-hand side (LHS) is a vector. The same applies to Equation (6). I believe everything would be correct if the vec(⋅) operation were removed—am I right?

A3. We note that both $x_t$ and $Dx_o$ are vectors, and thus their Kronecekr product is also a vector. More specifically, Kronecker product of m-dimensional vector and n-dimensional vector yields mn-dimensional vector. We believe there is no error in our equations.

Q4.1. I believe there are some limitations in the interpretation of Lemma 1 that could be mentioned. First, … The assumption that they follow the same distribution is impractical, as training and generated texts are typically very different. Additionally, the variance is calculated per token, while g_{tr} and g_{te} are both averages over multiple dependent tokens.

A4.1. We first remind the reviewer that gradient for our language model experiments are computed at the sequence level. The model output is sampled from $p(x|\theta^*)$ , and $\theta^*$ is obtained by solving $\arg\min_\theta L(D_{tr}|\theta)$ . Combined with the fact that pre-trained language models have seen a variety of "prompts" from web-scale data, we believe that there is a meaningful overlap between training data distribution and the model output distribution. However, we also agree with the reviewer that their distributions may not perfectly overlap.

Q4.2. coefficient distribution and normalization. We agree with the reviewer that while $E[c_i^2]\approx 1$ holds in our derivation, its practical implication can be obfuscated, especially when $c_i$ follows the heavy-tailed distribution. In fact, we suspect that the heavy-tail noise in coefficients can be one potential root cause of failure cases in Section 4. In general, we believe studying the gradient projection in influence functions under heavy-tailed noise would be an interesting future research direction.

If there is anything that can help you evaluate our work more positively, please let us know.

[1] Martens, James. New insights and perspectives on the natural gradient method. JMLR, 2020. [2] Bae et al. If influence functions are the answer, then what is the question?." NeurIPS, 2022. [3] Kwon et al. Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models. ICLR, 2024. [4] Park et al. Trak: Attributing model behavior at scale. ICML, 2023.

2024-11-30

Thanks a lot for the clarifications!

My question about GNH in the case where your treat each sequence as a single "unit" still stands.

Let's say we only consider the case of classification, hence GNH and FIM are equivalent, and both approximate the KL (in the sense they define the quadratic form that approximate the KL). If you treat a whole sequence as a unit, this will be a different KL than what usually used in proximal methods. In fact it will be rather intractable version of KL, since it would require calculation of all $p(y)$ for all y which grows exponentially with sequence length. It seems to me that in that case the approximation by PBRF will break, if the divergence in PBRF is calculated per token [1, 2]. Which by the way I was wondering why the comparison to PBRF is not included in your evaluations.

Q1: If you treat the whole sequence as a "unit", then to calculate FIM we need to take the expectation of $\nabla \ell(\hat{y} | x)$ . What is $x$ in that case?

Q2: In your response you mention that empirical FIM and FIM behave similar. Do you mean that you actually implemented the case where you generate the whole sequences for calculation of FIM? How did you do it? Usually FIM is approximated by empirical FIM in the realizible case. I don't think that on sequence level you can treat it as realizable.

Q3. I also looked more closely at evaluations in Section 4. In the GPT2 case, the random LoGRA generally performs not worse or even better than PCA LoGRA. Does it mean that some assumptions of Lemma 1 can be broken? Lemma 1 seems to suggest that PCA is the right thing to do.

[1] Bae et al. If influence functions are the answer then what is the question

[2] Grosse et al. Studying large language model generalization with influence functions

评论- Rebuttal

2024-11-30

Thanks for the response! In our comment below, we attempt to address Q1 & Q2 together, and Q3 separately.

Q1: If you treat the whole sequence as a "unit", then to calculate FIM we need to take the expectation of $\nabla \ell (\hat{y}|x)$ . What is $x$ in that case?

Q2: In your response you mention that empirical FIM and FIM behave similar. Do you mean that you actually implemented the case where you generate the whole sequences for calculation of FIM? How did you do it? Usually FIM is approximated by empirical FIM in the realizible case. I don't think that on sequence level you can treat it as realizable.

A. We again note that the FIM is defined as covariance of the score function, and it coincides with the KL divergence under a specific condition. In our work, we computed the FIM following its definition (i.e., covariance of the score function), instead of the KL interpretation.

More formally, the loss function (or negative log likelihood) of each sequence is defined as follows:

$L(x|\theta) = -\log p(x|\theta) = - \sum_{t=1}^T log_{\hat{y}|x_{1:t-1}}(x_t|x_{1:t-1};\theta)$

Therefore, when we take each sequence as a unit, the gradient for each unit corresponds to the sum of token-wise gradients. For the FIM computation, we directly average the outer product of each sequence gradient. To compute the true FIM, we sample the next-token at each time step from the model output distribution, and performed the Monte-Carlo approximation over many training sequences.

Q3. I also looked more closely at evaluations in Section 4. In the GPT2 case, the random LoGRA generally performs not worse or even better than PCA LoGRA. Does it mean that some assumptions of Lemma 1 can be broken? Lemma 1 seems to suggest that PCA is the right thing to do.

A. We discussed this issue in L338-347. The implication of Lemma 1 is that we can expect better influence estimations when we successfully keep larger components of the Hessian/FIM. To make PCA compatible with LoGra, we adopt the KFAC approximation of the FIM (L238-239). However, unlike naive-MLP (without weight sharing) or convolutional networks where there exist specialized KFAC techniques, we note that the Transformer architecture lacks the specialized KFAC approximation. Therefore, there is an increased chance that our Kronecker product estimation of the FIM may be inaccurate in the first place, and thus larger components we computed with our inaccurate Kronecker product estimation may not be actually larger components. Exploring the specialized KFAC for the Transformer architecture would be an interesting future research problem.

We hope our comment addressed some of your concerns. Let me know if you have any other questions!

2024-12-02

To compute the true FIM, we sample the next-token at each time step from the model output distribution, and performed the Monte-Carlo approximation over many training sequences.

If you sample next-token, then the gradients have to be calculated per-token. If you calculate the whole gradient, then the whole sequence needs to be sampled. What you are describing is just some conditioner, I don't see how you can interpret it as FIM. If you claim that two things are the same empirically, I would love to see that comparison. The paper also needs to describe precisely what is your Hessian in the main part and how you motivate this choice.

It looks there is too many issues in this paper. I'll lower the score for now, maybe it will change later after discussion with reviewers and AC.

评论- Rebuttal

2024-12-03

Thanks for your comment.

True FIM

If you sample next-token, then the gradients have to be calculated per-token. If you calculate the whole gradient, then the whole sequence needs to be sampled. What you are describing is just some conditioner, I don't see how you can interpret it as FIM.

I believe sampling next-token and computing sequence-level FIM are not at odds. Admittedly, the definitions of the (Gauss-Newton) Hessian and FIM can be confusing in the context of language modeling. Sampling the whole sequence for computing sequence-level FIM, as the reviewer suggested, indeed makes sense if we understand language models as density models. However, we note that the main motivation of using FIM in this work is to approximate the Hessian (or its Gauss-Newton version) in the original influence function, and that the loss landscape (which is related to the Hessian) is determined by next-token prediction loss for each training sequence $x$ . In this context, we are rather interested in the conditional FIM as noted in footnote 2 of [1] (page 15). Noting that the model makes the prediction at each time step conditioned on all previous tokens, we decompose $\log p(\hat{y}|x)$ into $\sum_{t=1}^T \log_{\hat{y}|x_{1:t-1}}p(x_t|x_{1:t-1})$ (This answers your previous Q1).

[1] Grosse et al. Studying large language model generalization with influence functions. Preprint, 2023.

审稿意见

评分: 8置信度: 42024-11-05

The paper proposes a method to scale up the computation of influence functions for large models such as LLMs. The key idea is to use the gradient structure in backpropagation to perform effective gradient projection without materializing the gradient or projection matrices. This idea is supported by its link to the effect of the damping factor that is commonly used in influence functions computation: the projection can be viewed as a hard way to perform dumping that zeros out influence computations to components in the projection matrix. Experiments show that the proposed method can scale up the computation of influence functions for large models, over ~7000x throughput improvement compared to EKFAC while having acceptable accuracy loss and a trade-off in storage (40x more space needed). In addition, the author(s) also develop a plug-and-play Python package for this method that can easily inject into existing LLMs.

优点

The paper addresses a significant practical challenge (scalability) in understanding large language models through influence functions
The technical innovation (LoGra) is well-motivated and theoretically grounded by connecting to the damping effect in influence functions
Impressive empirical results showing major efficiency gains (7000x throughput) while maintaining accuracy
Strong practicality focus with the development of LoGix software that can be easily integrated into existing training pipelines
Comprehensive experiments including both quantitative metrics and qualitative analysis of the results

缺点

The failure cases in qualitative analysis (especially with Pythia-1.4B) could be analyzed more thoroughly to better understand the limitations (e.g. is it because of the projection or not)
While the theoretical connection to damping is interesting, its practical implications could be explained more clearly
The storage requirement tradeoff (3.5TB for LoGra vs 89GB for EKFAC) deserves more discussion on practical implications

问题

Could you elaborate on why the method performs notably worse on Pythia-1.4B compared to other models? What characteristics of the model might be responsible?
The storage requirements are significantly higher for LoGra compared to EKFAC. Could you discuss this tradeoff more explicitly in the draft?
The theoretical connection between gradient projection and damping suggests an interpretation of what information is being preserved/discarded. Could this insight be used to make better choices about projection strategies or explain the failure cases?

Overall, the paper makes a clear and significant contribution to scaling up influence functions for large language models, with impressive empirical results and strong practical focus. The theoretical grounding adds depth while the open-source implementation increases immediate impact potential. Despite some areas that could use additional analysis, this represents is progress in making influence functions practical for modern deep learning.

评论- Rebuttal

2024-11-29

We appreciate your valuable and positive feedback. Here, we attempt to address your questions.

Q1. Pythia-1.4B Failure Cases

In Appendix A.3 of the original manuscript, we presented several hypotheses to explain the suboptimal qualitative results, particularly those obtained with the Pythia-1.4B model. To facilitate the review process, we would like to reiterate the relevant excerpt from the appendix:

Influence functions tend to give a high score for the example that contributes most to decreasing (test) loss at the current weight [1]. At the same time, it is also hypothesized that different layers learn different concepts at different stages of training [2]. Combining these two facts, when interpreting influence analysis results, we need to think about which features the model is most likely learning at the current weight. Here, we specifically discuss two factors: training data quality and training steps. First, if the training data quality is low, then there would be a lot of features (e.g., random email address) that are frequent enough in the training dataset to be considered as learnable patterns. In other words, even though these features look redundant to humans, they may still be useful for decreasing loss from the model perspective. Second, many LLMs are only pretrained for a single epoch, or under-trained to their pretraining dataset. That being said, redundant features from the first point would likely still remain as learning-worthy features at the end of training and are captured by influence functions. In sum, we hypothesize that as the model is well-trained on a high-quality dataset, influence functions would capture more similar data to the query LLM output.

Q2. Storage Cost

In L391-400 (and footnote 2), we discussed the trade-off between the storage cost and memory/compute efficiency. In particular, a simple cost analysis based on hourly AWS rates for GPUs and storage in footnote 2 shows that trading-off the storage cost for compute/memory efficiency is largely favorable for practitioners in most cases. If you have any further suggestions on making this discussion clearer, we would greatly appreciate it!

Q3. Theoretical Interpretation of Damping

Indeed, our theoretical analysis shows that (overly) aggressive gradient projection can result in significant information loss and an increased likelihood of failure cases. In addition, our theoretical analysis shows that, for a fixed projection dimension, retaining larger components of the gradient is generally preferable. This insight motivated us to design the PCA initialization scheme for LoGra, which leverages the Kronecker-factored approximate curvature (KFAC) approximation of the Hessian, as described in lines 234-246 of the manuscript. Lastly, we investigated the effectiveness and limitations of the PCA initialization in LoGra in the small-scale quantitative experiments from Section 4.1. In L338-347, we discussed that Transformer architecture lacks the specialized KFAC approximation for the Hessian, which led us to use random initialization in our large-scale qualitative experiments.

We again greatly appreciate your reviewing effort. If you have any suggestions that could make our paper stronger, do not hesitate to let us know!

[1] Bae et al., If influence functions are the answer, then what is the questions? NeurIPS, 2022 [2] Chen et al., Which layer is learning faster? A systematic exploration of layer-wise convergence rate for deep neural networks. ICLR, 2023.

审稿意见

评分: 8置信度: 42024-11-13

Training data attribution (TDA) attempt to quantify the contribution of training points to the prediction obtained using a test point. However, these methods do not scale well for large models and datasets. The authors propose an efficient gradient projection strategy that can be adopted with gradient-based TDA methods to scale up these approaches. They show improvements in throughput, reduction in memory usage and applicable to multi-bilion parameter LLMs.

优点

The authors of this paper have really found one of the central problems in TDA methods and addressed them in a neat way. The paper provides sufficient background for gradient-based TDA methods and is really well-written. The key trick seems to be taking the projection operator from gradient to activations and this reduces the memory and storage complexity tremendously.

The authors have also performed several clearly justified experiments with MLP, ResNet and GPT2 models. The results are in line with the original claims - EKFAC seems to be the only method that is close or outperforms the proposed methods but has much lower throughput and higher memory usage during influence computation (from Table 1).

缺点

I just have a few comments for the authors to consider:

The order complexities listed in sections 2 and 3 can be hard to process since the constants are hidden. It is not clear if I can compare two order complexities directly. Is there any way to make this more concrete and easily comparable?
I would recommend adding compute and memory complexity for the methods compared in Figure 4 to get a more complete picture.
Please note why quantitative evaluations are not possible for accuracy measures in section 4.2.

问题

Please see weaknesses.

评论- Author Rebuttal

2024-11-21

Thanks for your positive and valuable feedback. Here, we attempt to address your comments:

The order complexities listed in sections 2 and 3 can be hard to process since the constants are hidden. It is not clear if I can compare two order complexities directly. Is there any way to make this more concrete and easily comparable?

To address this exact issue, we provided a concrete comparison using one of the models in our experiment (i.e., Llama3-8B) in lines 199-201: "To clearly see this benefit, given the model/projection sizes of 8B/4k, we note that projection matrix sizes are about 1GB and 128TB respectively for LoGra and naive projection."

I would recommend adding compute and memory complexity for the methods compared in Figure 4 to get a more complete picture.

We appreciate your thoughtful feedback. In the final manuscript, we will include a detailed analysis in the Appendix covering both the computational and memory complexity of influence calculations, as well as the gradient projection requirements for each method.

Please note why quantitative evaluations are not possible for accuracy measures in section 4.2.

The quantitative evaluations proposed in Section 4.1 would require retraining each model at least 500 times, which presents two significant challenges. First, we lack access to the (pre-)training datasets for two of the three models examined in Section 4.2 (GPT2-XL and Llama3-8B). Second, even if we had access to these datasets, the computational resources required to retrain such large models 500 times would be prohibitively expensive and impractical.

If you have any further concerns, feel free to let us know. We appreciate your reviewing effort again!

AC 元评审

2024-12-23

The paper considers gradient-based training data attribution in large language models. More specifically, the authors introduce a method called LOGRA that relies an influence function based on gradient projection. The method is compared to a range of baselines, including the recently proposed EKFAC, which is orders of magnitude more costly computationally. The results reported in the paper are convincing. The authors also provide a theoretical interpretation of the method. Finally, the paper is clear and is accompanied by software with PyTorch hooks.

Unfortunately, despite all these qualities, the paper is missing essential methodological details. Indeed, the paper does not provide the technical details about how the projection matrices are obtained (and hyperparameters set), nor the Hessian computation in the resulting influence function. These are key technical details that should be covered in the paper, even when code is made available to facilitate reproducibility.

审稿人讨论附加意见

There was no consensus on this paper. Moreover, while reviewers were prompted to discuss the pros and cons in more detail, they did not engage further. Nevertheless, the rebuttal phase included some extensive discussions between authors and reviewers (eg, on the method used to compute the Hessian). After carefully reading the papers, the reviews, and the rebuttal, I have to conclude that despite a great number of qualities, I agree that the omission of key technical details as flagged by one of the reviewers is problematic. Hence, I cannot recommend acceptance of this paper.

最终决定Reject

2025-01-22

Reject