PaperHub
6.3
/10
Poster3 位审稿人
最低3最高4标准差0.5
3
3
4
ICML 2025

A Versatile Influence Function for Data Attribution with Non-Decomposable Loss

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
influence functiondata attribution

评审与讨论

审稿意见
3

handle non-decomposable loss functions, thus broadening their application in machine learning models. Unlike conventional approaches limited to decomposable losses, VIF can be directly applied to any model trained with complex losses like contrastive or ranking losses, without needing retraining. Utilizing automatic differentiation, VIF simplifies the computation process and is applicable to non-convex models. Experiments on various tasks, including Cox regression and network analysis, demonstrate VIF's effectiveness and significant computational efficiency gains, up to 1000 times faster than brute-force methods. Additionally, VIF integrates well with efficient inverse Hessian approximations, further enhancing its scalability and performance in large neural networks. Despite assuming convex loss functions, VIF opens new possibilities for data attribution in complex models.

给作者的问题

Question: Is the VIF proposed in this article effective for any loss function?

论据与证据

Claim-1: VIF provides a general form of influence function that can be directly applied to machine learning models trained with any non-decomposable loss.

Evidence: The article demonstrates how VIF generalizes from M-estimators to general non-decomposable losses through theoretical derivations (such as Theorem 3.6 and proofs in the appendix). Furthermore, experimental results are provided showing that in tasks such as Cox regression, node embeddings, and list-wise learning to rank, the influences calculated by VIF closely match those obtained through brute-force leave-one-out retraining.

Claim-2: VIF offers a new avenue for data attribution concerning non-decomposable losses, opening up opportunities for data analysis applications across broader fields.

Evidence: The article showcases the effectiveness of VIF through multiple case studies, including Cox regression for survival analysis, node embeddings in network analysis, and list-wise learning to rank in information retrieval. These case studies demonstrate the broad application potential of VIF.

方法与评估标准

The Versatile Influence Function (VIF) method and its evaluation criteria proposed in the article are reasonable for addressing data attribution problems associated with non-decomposable losses and are applicable to current issues and application scenarios. Below is a detailed analysis:

Proposed Method:

  • VIF Approach: VIF is a general form of influence function designed to overcome the limitation of traditional influence functions that only apply to decomposable loss functions. By integrating automatic differentiation techniques with inverse Hessian matrix approximations, VIF can be directly applied to machine learning models trained with any non-decomposable loss, enabling the estimation of the impact of data points on model parameters without the need for retraining.
  • Acceleration Techniques: The article discusses how to use Conjugate Gradient (CG) and LiSSA to speed up VIF calculations, which is particularly important for improving computational efficiency, especially when dealing with large-scale neural network models.

Evaluation Criteria:

  • Benchmark Datasets: The authors selected the METABRIC dataset for experimental validation of Cox regression models and further tested them on neural network models. These choices reasonably reflect VIF's performance across different levels of model complexity.
  • Performance Metrics: The article uses Pearson correlation coefficients to measure the similarity between VIF and other methods (such as brute-force leave-one-out), which is an effective way of assessing consistency and reliability in model outputs.
  • Runtime Analysis: In addition to accuracy, the article also focuses on the runtime of each method. Results show that CG and LiSSA accelerated VIF not only perform comparably to original VIF and brute-force leave-one-out in terms of performance but also offer more memory and time savings in large models, further demonstrating its practical value.

Reasonableness Analysis:

  • Applicability: The VIF method has been proven to effectively apply to various scenarios, including Cox regression in survival analysis, node embeddings in social networks, and list-wise learning to rank in information retrieval, showcasing its broad applicability.
  • Validation of Effectiveness: Through theoretical derivation and empirical experiments, the article verifies the effectiveness of VIF. Particularly, in specific cases (e.g., M-estimation), VIF can accurately recover classical influence functions; for more complex non-decomposable losses (such as Cox regression), VIF also shows good approximation performance.
  • Potential for Improvement: Although VIF performs excellently in many aspects, its assumption that the loss function is convex may limit its effectiveness in certain practical applications. Therefore, future work might explore how to adapt to more complex model structures to enhance its practicality.

理论论述

The theoretical claims and their proofs presented in the article are generally correct, with a logically rigorous connection between each step.

实验设计与分析

In reviewing this article, I examined the effectiveness of the experimental design and analysis. Below is a detailed analysis:

Experimental Design and Analysis

  • Experimental Setup: The article selects three different application scenarios: Cox regression, node embeddings, and listwise learning-to-rank. For each scenario, appropriate benchmark datasets were chosen for experimental validation, such as the METABRIC and SUPPORT datasets for Cox regression, Zachary's Karate network for node embeddings, and Delicious and Mediamill datasets for listwise learning-to-rank.

  • Performance Evaluation: Performance was measured by calculating Pearson correlation coefficients to assess the similarity between results obtained by the VIF method and brute-force leave-one-out retraining. The results show that in most cases, the VIF method achieves outcomes close to those of brute-force retraining, especially in Cox regression tasks (with a Pearson correlation coefficient of 0.997 on the METABRIC dataset). In node embedding tasks, although the performance of the VIF method is not as strong as brute-force retraining (Pearson correlation coefficients of 0.407 versus 0.419 on the Karate network), considering its significant improvement in computational efficiency (several orders of magnitude faster), this slight performance drop is acceptable.

  • Application of Acceleration Techniques: The article explores how to use Conjugate Gradient (CG) and LiSSA to accelerate VIF calculations, with experimental verification conducted on Cox regression models using the METABRIC dataset. These acceleration techniques not only improve calculation speed but also conserve memory and time resources in neural network models.

Overall, the experimental design of this article is reasonable and effective, supporting its theoretical claims.

补充材料

The author submits the codes.

与现有文献的关系

The key contribution of this article is the introduction of a new method called Versatile Influence Function (VIF), which extends the application of traditional influence functions to scenarios involving non-decomposable losses. This contrasts with previous data attribution methods that were limited to decomposable losses, such as M-estimation. By drawing on application ideas from complex models like Cox regression and utilizing automatic differentiation tools in modern machine learning libraries for efficient computation, VIF opens up new research directions and technical approaches in the field of data attribution.

遗漏的重要参考文献

This article primarily focuses on the issue of data attribution for non-decomposable losses but does not mention recent advancements in the application of influence functions within the fields of graph neural networks.

其他优缺点

The VIF proposed in this article is a second-order method, which implies a relatively high computational complexity.

其他意见或建议

The overall structure of the article is clear and logically sound. However, future work could consider adding experimental validations on different types of non-convex models to enhance the applicability of the VIF method. It is also recommended to further explore and specify the practical techniques needed for applying VIF to large neural networks in real-world scenarios. Additionally, providing a detailed list of typos would help improve the professionalism and readability of the paper, although specific typo examples are not listed in the current version.

作者回复

We thank the reviewer for the positive feedback and further suggestions. We address your comments in detail below.

Although VIF performs excellently in many aspects, its assumption that the loss function is convex may limit its effectiveness in certain practical applications. Therefore, future work might explore how to adapt to more complex model structures to enhance its practicality.

In our current paper, we have heuristically extended the proposed method to non-convex models and experimented a few non-convex settings, such as the neural-network-based Cox model and the node embedding model. However, we agree that it is an important future work to further explore the adaptation of VIF to more complex model structures.

This article primarily focuses on the issue of data attribution for non-decomposable losses but does not mention recent advancements in the application of influence functions within the fields of graph neural networks.

Thanks for the suggestion. We have further explored the literature and found two relevant papers that aim to adapt influence functions for graph neural networks, which can be viewed as special cases of non-decomposable losses. Chen et al. (2023) developed an influence function specifically for Simplified Graph Convolution (SGC) model, which is a linearized graph neural network model. Wu et al. (2023) proposed a machine unlearning method for graph neural networks based on influence function, where the influence function is adapted to consider the graph dependency among samples. In comparison to these methods, our approach has a more general formulation and can be broadly applied to various different non-decomposable losses.

References

Chen, Z., Li, P., Liu, H., & Hong, P. (2023). Characterizing the Influence of Graph Elements. In The Eleventh International Conference on Learning Representations.

Wu, J., Yang, Y., Qian, Y., Sui, Y., Wang, X., & He, X. (2023). Gif: A general graph unlearning strategy via influence function. In Proceedings of the ACM Web Conference 2023(pp. 651-661).

The VIF proposed in this article is a second-order method, which implies a relatively high computational complexity.

The proposed VIF is indeed a second-order method. However, we have shown that we could leverage computational tricks used in conventional IF, such as CG and LiSSA, to accelerate the computation. In the future work, we will explore the integration of more advanced computational tricks, such as EKFAC (Grosse et al., 2023) or dimension reduction (Schioppa et al., 2022), into the proposed VIF.

References

Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., ... & Bowman, S. R. (2023). Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296.

Schioppa, A., Zablotskaia, P., Vilar, D., & Sokolov, A. (2022). Scaling up influence functions. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 8, pp. 8179-8186).

Is the VIF proposed in this article effective for any loss function?

The proposed VIF is in principle applicable to any loss function covered by Definition 3.1, which is a very general form of loss function. However, the practical effectiveness of VIF (in terms of its approximation to the LOO) may vary from case to case, as our theoretical guarantees are limited to certain assumptions.

审稿意见
3

Update after author discussion period

Thanks a lot for all the discussion below, both for correcting my mistake with the asymptotics and for addressing some of my concerns. I've updated my score to vote for an accept, since I think the extra experiments now address the accuracy of VIF more clearly and I think the theory backs up its performance as well. I still have some concerns about the reliance on metrics like coefficient of correlation, which is why I've only gone up to a 3/5 on the score.

I just had a few extra small comments on the theory that seem very fixable to me:

  1. On the boundedness of JJ: to assume that eθ^TXi/ne^{\hat\theta^T X_i} / n is of smaller order than CC, I think that nn needs to be sufficiently large. I would recommend just adding this statement to the theorem. Also, I'm not sure that this theorem holds deterministically for a large enough nn. Couldn't we have random data that makes θ^\hat\theta huge, thus driving up eθ^TXi/ne^{\hat\theta^T X_i} / n? I think constraining θ^\hat\theta to a compact set would alleviate this. But right now, only θ\theta^* is constrained to be within some compact set, which doesn't deterministically guarantee that θ^\hat\theta will be bounded.
  2. Also on boundedness of JJ: the fact that E[I(Yt)eθTX]C>0E[ I(Y \geq t) e^{\theta^T X} ] \geq C > 0 I think only follows because of the truncaction time (assumption (3)) assumed in the theorem. This just took me a minute to sort through, so it could be worth making this explicit.
  3. Sorry I might have missed the assumption that the Cox model was well-specified! I think this should go directly into the theorem statement, since it's an important assumption.

Best, 1uWn

Original review summary before author discussion period

This paper proposes a new method to approximate leave-one-out (LOO) model retraining. While influence functions (IFs) are widely used to approximate LOO, the IF commonly found in the machine learning literature only immediately applies to models fitted with loss functions that "decompose" across training examples (i.e. i=1n(θ,xi)\sum_{i=1}^n \ell(\theta, x_i)). The paper notes that various models do not meet this assumption, such as Cox regression or models fit with contrastive losses. So, the paper proposes the versatile influence function (VIF), an approximation to the IF that is exactly equal to the IF when the loss is decomposable across training examples. In one case (Cox regression) where previous authors have derived an exact form of the IF, the authors prove that the error between the VIF and IF is O(1/n)O(1/n), where nn is the number of training examples. In experiments, the authors show that the value of the VIF correlates with the value of the IF.

给作者的问题

My main questions are:

  1. Why does Theorem 3.10 tell us the VIF is good given the issues raised above?
  2. Why does measuring the Pearson correlation coefficient tell us that the VIF is good?
  3. What are the takeaways from the case studies?

Thank you!

论据与证据

I think the central point of the paper, which is that the VIF is a good approximation to LOO retraining, is not currently well supported by the paper. I've broken my concerns into two areas below:

Theoretical Results The main theoretical result backing up the accuracy of the VIF comes from Theorem 3.10, which says that for Cox regression under regularity assumptions, the difference between the VIF and IF is Op(1/n)O_p(1/n). I have some questions about the proof of this result described below. But, more importantly, I think this result actually suggests that the VIF is a bad approximation to the IF. While O(1/n)O(1/n) sounds small, it's important to understand the overall scale of the problem. In particular, the VIF is intended to be an approximation to IF, which in turn is an approximation to the exact LOO parameters (i.e., the model parameters refit with one datapoint left out). So Theorem 3.10 shows us that VIF's error to the exact LOO parameters is at least O(1/n)O(1/n). But there's typically a much simpler approximation to the LOO parameters that is also Op(1/n)O_p(1/n) in error -- the model parameter fit with the full dataset. That is, θ^θ^iVIFiθ^i\| \hat\theta - \hat\theta_{-i} \| \approx \| VIF_i - \hat\theta_{-i}\|.

I want to emphasize this is typically true -- for example, see Wilson et al. (2020) Theorem 14. It's possible that for the specific types of non-decomposable models considered here, the LOO parameters vary more. But I think this needs to be demonstrated in the current paper.

A secondary concern is that it doesn't seem like the error for the iith datapoint, VIFiIFiVIF_i - IF_i being Op(1/n)O_p(1/n) is uniform in ii. This makes the theory less useful in practice. For example, the VIF could be arbitrarily inaccurate for any particular training point; the theory would just tell us that, if we were to hypothetically gather more data, the accuracy of all existing training samples would improve. I think this should be discussed or clarified in the proof that the error rate really is uniform over ii. It's not straightforward to prove such uniform convergence, so I don't think this should be majorly held against the paper as long as it's clearly explained.

Empirical Results The experiments are broken into two sections: quantitative and then two qualitative case studies. I didn't feel like either fully told the story of why the VIF is a good approximation.

Quantitative results: the quantitative results only show the Pearson correlation coefficient (ρ\rho) between the VIF and exact LOO. They show that the correlation is often high (>0.9>0.9), except for the node embedding model. But why does a high ρ\rho tell us that the VIF is working well? ρ\rho will be high if there is a perfect linear relationship between VIFVIF and LOOLOO. So, say, LOO=823742VIF+8176234718942893472834LOO = 823742*VIF + 8176234718942893472834. But certainly in that case we wouldn't say that VIF is a good approximation to LOO -- the error would be horrible. One of the purposes of the IF is to rank datapoints in order of importance. And ρ\rho being high does suggest that the VIF probably gets this ordering correct. But in that case, the results should just show the error in the actual ordering. Or I think the paper needs to justify why ρ\rho being high is sufficient for the ordering to be correct. So, overall, I don't think measuring ρ\rho backs up the idea that the VIF is accurate.

Qualitative results: The paper presents a "case study" for using the VIF in Cox regression and a node embedding model with a contrastive loss. I found these to be too brief to really help understand at a qualitative level what the VIF was doing. In the (censored) Cox regression example, Table 3 shows, for two different test points, a few pieces of information about the top-5 VIF-scoring training points. The information given is the cosine similarity between the training point and the test point, the amount of time the point was observed for, and whether or not the point was censored. I didn't understand why this tells us anything about whether VIF is doing anything useful. The paper notes that the table shows results that "align with domain knowledge." But there's no additional information about what this domain knowledge is or how well these findings align. I think a lot more detail needs to be given in these case studies.

A. Wilson, M. Kasy, and L. Mackey. Approximate Cross-validation: Guarantees for Model Assessment and Selection. AISTATS. 2020.

方法与评估标准

I think the datasets used do make sense. I have some concerns about the evaluation criteria that are detailed in the Claims and Evidence section.

理论论述

I only skimmed the derivation of the VIF; I actually thought the derivation didn't need to be as long as it was (see suggestions below).

I did read the proof of Theorem 3.10 carefully, and I have a few comments below:

  1. Around line 872 "by applying Taylor expansion to terms in Eq. (18) and the boundedness" -- I think this needs much more detail. E.g., what terms? Expansion in which variables? To what order of expansion? Etc.
  2. Line 880: "note that J(t,θ^,Zi)J(t, \hat\theta, Z_i) is bounded". I don't see why JJ is bounded. JJ contains the term 1/(s(0)(t;θ^)eθTXi/n)1 / (s^{(0)}(t;\hat\theta) - e^{\theta^T X_i}/n). For a given tt, s(0)(t;θ^)s^{(0)}(t;\hat\theta) is a constant. Unless it's a negative constant, there will exist some Xi,θ^X_i, \hat\theta such that eθ^TXie^{\hat\theta^T X_i} gets arbitrarily close to s(0)(t;θ^)s^{(0)}(t; \hat\theta). This makes it seem like JJ is not bounded.
  3. Line 881: "Given the boundedness and the consistency of θ^\hat\theta" -- this needs to be made an explicit assumption; the theorem doesn't say anything about θ^\hat\theta.
  4. Around line 743: "A consistent estimate for [the information matrix] is given by θ2L(θ^)/n\nabla^2_\theta L(\hat\theta) / n. This isn't true. This is only correct if the model is well-specified, which hasn't been assumed here.

实验设计与分析

Yes, the experimental setup seems reasonable.

补充材料

Yes, I read the entire supplement; my comments on the supplement are scattered throughout other sections.

与现有文献的关系

Computing influence functions is a growing area in the machine learning literature. As the authors note, most work focuses on decomposable losses, where there is one term in the loss corresponding to each datapoint. However, non-decomposable losses are not uncommon in machine learning -- I thought the current paper's example of contrastive losses was particularly broadly applicable. I think applying influence functions to such models is an important area to study, so this paper is focused on an important and understudied problem.

遗漏的重要参考文献

I think the paper is missing two important references:

  1. Basu et al. (2021) show that LiSSA (a method proposed in the current paper to approximate the Hessian inverse needed to compute the VIF) works very poorly for large models (i.e., exactly when it is needed). But most new papers on influence functions seem to miss this point because Koh and Liang (2017) used LiSSA. I really think no one should be using LiSSA to approximate influence functions at this point, and I would suggest it's taken out of the current paper completely to avoid influencing future authors in this space.
  2. Ghosh et al. (2020) also tackles the problem of applying influence functions to "non-decomposable" losses. Their results are exact, but are focused on models with latent variables.

References

  • Samyadeep Basu, Philip Pope, Soheil Feizi. Influence Functions in Deep Learning Are Fragile. ICLR. 2021
  • Soumya Ghosh, William T. Stephenson, Tin D. Nguyen, Sameer K. Deshpande, Tamara Broderick. Approximate Cross-Validation for Structured Models. NeurIPS. 2020.

其他优缺点

Strengths I think the idea of a finite difference approximation to resolve the issues of applying IFs to non-decomposable losses is an interesting one! I actually thought there was an even more direct approach to deriving the VIF: noting that for decomposable losses, the loss is linear in each of the weights wiw_i: iwi(θ,xi)\sum_i w_i \ell(\theta, x_i). So when setting one weight to zero, all of the other terms in the difference in gradients in Eq. 10 cancel out. Then VIF can be thought of assuming the same thing holds for other models. Anyway, I thought this was a cool idea.

Weaknesses

  1. I think the early discussion in the paper has misunderstood what M-estimators are. In general, an M-estimator is just an estimator that minimizes some loss. There's no requirement that it decompose among different datapoints, which is what the text implies.
  2. The citation to Huber & Ronchetti (~line 231) should reference a specific result and not an entire book
  3. The introduction implies that the current method overcomes the challenge of non-convexity: "secondly, for non-convex models... to overcome these challenges, we propose the Versatile Influence Function". I think this is misleading because the VIF doesn't solve this problem.
  4. What's the compute time required for the given algorithm? It seems like it might be O(n)O(n) per datapoint (e.g., looking at the equation near the bottom of p. 14, the gradients for Cox regression seem to each take O(n)O(n) to compute), for a total of O(n2)O(n^2) time across all datapoints. This sounds expensive, so this would be good to discuss in the main text.

其他意见或建议

Nothing not covered above!

作者回复

We thank the reviewer for the detailed review and feedback. We have prepared a thorough response to each comment, but have to omit some of them due to the strict character limit. However, we could further provide the omitted response once the reviewer replies to us.

Claims and Evidence

Theoretical Results ... (Theorem 3.10) suggests that the VIF is a bad approximation to the IF ... That is, θ^θ^_iVIFiθ^_i|\hat{\theta} - \hat{\theta}\_{-i}| \approx |VIF_i - \hat{\theta}\_{-i}|.

We clarify that the result in Theorem 3.10 is meaningful even if θ^θ^_i=O(1/n)|\hat{\theta} - \hat{\theta}\_{-i}| = O(1/n). Note that VIFiVIF_i or IFiIF_i themselves are not directly approximating θ^_i\hat{\theta}\_{-i}. Instead, IF (or VIF) provides an approximation of θ^_i\hat{\theta}\_{-i} through θ^_iθ^1n1IFi\hat{\theta}\_{-i}\approx \hat{\theta} - \frac{1}{n-1} IF_i (or θ^_iθ^1n1VIFi\hat{\theta}\_{-i}\approx \hat{\theta} - \frac{1}{n-1} VIF_i). See page 7 of Reid and Crepeau (1985) that shows "I^i(n1)(β^β^_i)\hat{I}_i \approx (n-1) (\hat{\beta} - \hat{\beta}\_{-i})", which is equivalent to IFi(n1)(θ^θ^_i)IF_i \approx (n-1)(\hat{\theta} - \hat{\theta}\_{-i}) in our notation.

Let's denote θ^_iIFθ^1n1IFi\hat{\theta}\_{-i}^{IF} \triangleq \hat{\theta} - \frac{1}{n-1} IF_i and θ^_iVIFθ^1n1VIFi\hat{\theta}\_{-i}^{VIF} \triangleq \hat{\theta} - \frac{1}{n-1} VIF_i. Theorem 3.10 suggests that θ^_iIFθ^_iVIF=O(1/n2)|\hat{\theta}\_{-i}^{IF} - \hat{\theta}\_{-i}^{VIF}| = O(1/n^2), which is still non-trivial even if θ^θ^_i=O(1/n)|\hat{\theta} - \hat{\theta}\_{-i}| = O(1/n).

In our paper appendix, we had a relevant discussion regarding the significance of the results of Theorem 3.10 between lines 782 and 786, where we showed that IFi=Ω(1)IF_i = \Omega(1), so the result VIFiIFi=O(1/n)|VIF_i - IF_i| = O(1/n) is significant. We will further elaborate the implication on the parameter approximation in our revised paper based on our clarification above.

We further empirically investigated the errors with varying sample size nn using subsets of Metabric dataset. In the first figure, we show that VIFiIFi|VIF_i - IF_i| decreases roughly in O(1/n)O(1/n), while IFi|IF_i| fluctuates at a constant level for larger nn. This aligns with our discussion in appendix lines 782 - 786. In the second figure, we plot θ^θ^_i|\hat{\theta} - \hat{\theta}\_{-i}|, θ^_iVIFθ^_i|\hat{\theta}\_{-i}^{VIF} - \hat{\theta}\_{-i}|, and θ^_iVIFθ^_iIF|\hat{\theta}\_{-i}^{VIF} - \hat{\theta}\_{-i}^{IF}| for varying nn. While θ^θ^_i|\hat{\theta} - \hat{\theta}\_{-i}| indeed decreases roughly in O(1/n)O(1/n), both θ^_iVIFθ^_i|\hat{\theta}\_{-i}^{VIF} - \hat{\theta}\_{-i}| and θ^_iVIFθ^_iIF|\hat{\theta}\_{-i}^{VIF} - \hat{\theta}\_{-i}^{IF}| are much smaller than θ^θ^_i|\hat{\theta} - \hat{\theta}\_{-i}| and decrease roughly in O(1/n2)O(1/n^2).

Secondary concern: it doesn't seem like the error ... is uniform in ii.

We acknowledge that our theoretical result is not uniform in ii and will make this clear in our revised paper. However, the empirical error aligns pretty well with our theoretical result on average, as shown in our response to the previous point.

Quantitative results: the quantitative results only show the Pearson correlation coefficient between the VIF and exact LOO.

First, we clarify that Pearson correlation coefficient with LOO is a commonly used evaluation metric in data attribution literature, including both the seminal paper by Koh and Liang (2017) and more recent paper published in ICLR 2025 (Wang et al., 2025). This is because, for most data attribution applications, the scaling constant and the offset do not matter much.

Moreover, for the Cox model, we empirically observe that the influence predicted by VIF does align almost perfectly with the exact LOO, as can be seen in this figure.

References

Wang, et al. Capturing the Temporal Dependence of Training Data Influence. ICLR 2025.

Qualitative results: The paper presents a "case study" ... too brief to really help understand at a qualitative level what the VIF was doing.

Thanks for the suggestion. We will include more details to make the takeaways clearer. Regarding the case study on Cox model, one takeaway is that, for the model to predict a long survival time for a test data point, the most influential training data points are either 1) data points that are similar to this test data point AND have long event time, or 2) data points that are dissimilar to this test data point AND have short event time AND the event has occurred. By domain knowledge, we meant our knowledge about Cox model.

Theoretical Claims

We double checked the correctness of our proof in response to all questions by reviewer. We will provide details in the follow up discussion.

Essential References Not Discussed

We implemented LiSSA as a proof-of-concept for VIF extension. We will discuss its limitation. We will also discuss Ghosh et al.

Other Strengths And Weaknesses

We will revise our statements per reviewer's suggestions. One note: our M-estimator notion followed the textbook Asymptotic Statistics by van der Vaart.

审稿人评论

Apologies, I didn't realize that "Official Comments" weren't viewable by the authors! Reposting as a rebuttal comment here

Thank you for the detailed and clear response. I've outlined a few additional questions / comments below:

Scaling in Theorem 3.10

You are completely right; I totally missed how the IF was being scaled. Sorry about that! I agree that Theorem 3.10 very much backs up the claim of VIF having good accuracy.

Just a couple suggestions along these lines to avoid others making this (avoidable) mistake:

  1. I would add some discussion putting these results into the context of other theoretical results in the literature. I think making the point that the error implied by Theorem 3.10 is on the same order as the IF approximation to θ^i\hat\theta_i really strengthens the discussion.
  2. Not to excuse my not noticing the scaling, but it could be worth dropping the scaling by a factor of nn. I believe that scaling the IF up by nn is common in the robust statistics literature from the 70s and 80s, but not scaling it by nn is more common in the ML literature. E.g. I don't think Koh and Liang (2017) ever have a formula where they say "and here is the influence function:...", but their writing implies it without the scaling by nn. Eq (5) in Koh et al. (2019) is a more explicit definition of "the IF" which doesn't use the scaling. Anyway, I would put this at a personal preference level, as long as the scaling is called out.

Use of Spearman's Coefficient of Correlation

I still am not sure ρ\rho is the right measure here. I agree that it's somewhat standard in the literature, but at the same time I don't think that means we can't do better. Just as an example, I see a lot of papers in the ML literature about IF's that compare against LiSSA because it's standard. But (as above), LiSSA has some bad problems!

If the goal is to say that VIF and LOO might be arbitrarily off in scale but have the same rank ordering, I think we should just directly measure the error in rank ordering. Or if there's some other downstream goal, we should measure that. Basically, I don't think that ρ\rho is really measuring anything concrete. For example, ρ=0.95\rho=0.95 is much higher than ρ=0.8\rho = 0.8. But it seems plausible to me that the ρ=0.8\rho = 0.8 could have a better rank order. See below for some code that shows this. It shows that two identical rank orders can produce very different ρ\rho.

In any case, I think the figures linked in the response do a good job of directly showing the quantitative accuracy of VIF, and alleviate a lot of my concerns about it not being fully evaluated. I think including these results would strengthen the results in the paper.

Theoretical results

Looking forward to the extra discussion! My main concerns are the four issues listed in my review.

Essential References

This sounds great.

M-estimators

This is a small thing, but if we're looking at Eq (5.1) in van der Vaart, I would still say that M-estimators still cover the examples in the current paper. van der Vaart uses the notation for the loss:

i=1nmθ(Xi). \sum_{i=1}^n m_\theta (X_i).

But there's no requirement that the XiX_i be independent. So, e.g., if we're defining a loss over pairs of nodes in a graph, with each node represented by ZjZ_j, then each XiX_i could be Xi=(Zi1,Zi2)X_i = (Z_{i_1}, Z_{i_2}). So I still think the paper is defining M-estimators in an overly restrictive way.

References

Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo, and Percy Liang. On the Accuracy of Influence Functions for Measuring Group Effects. NeurIPS. 2019.

import numpy as np
import matplotlib.pyplot as plt

def compute_rank_err(vif, loo):
    ranks_vif = np.argsort(vif)
    ranks_loo = np.argsort(loo)
    return np.abs(ranks_vif - ranks_loo).sum() / len(ranks_vif)

def compute_norm_err(vif, loo):
    return np.abs(vif - loo).sum() / len(vif)

def compute_rho(vif, loo):
    cov = np.cov(vif, loo)
    return cov[0,1] / np.sqrt(cov[0,0]*cov[1,1])


# Upper and lower limits of exact LOO values; totally arbitrary
upper = 5.0
lower = 0.0

seed = 12345
N = 1000
rng = np.random.default_rng(seed)
noises = np.logspace(-8, 3, 1000)

rank_errs = np.empty(len(noises))
norm_errs = np.empty(len(noises))
rhos = np.empty(len(noises))

for nn, noise in enumerate(noises):
    # Generate some random exact LOO values, then add
    #  some noise to simulate the error between VIF and LOO
    loo = rng.uniform(lower, upper, size=N)
    vif = loo + rng.uniform(-noise, noise, size=N)
    
    rank_errs[nn] = compute_rank_err(vif, loo)
    norm_errs[nn] = compute_norm_err(vif, loo)
    rhos[nn] = compute_rho(vif, loo)

ax = plt.gca()
ax.plot(noises, rank_errs, label='Rank Error')
ax.plot(noises, norm_errs, label='Norm Error')
ax.set_ylabel('Rank error or norm error')
ax.set_xlabel('Noise')
ax.set_xscale('log')

ax2 = ax.twinx()
ax2.set_xscale('log')
ax2.plot(noises, rhos, label='Rho', c='g')
ax2.set_ylabel('Rho')

ax2.legend()
ax.legend()
plt.show()
作者评论

Dear Reviewer,

Thank you for your reply! Regarding your suggestions, we will definitely include our rebuttal response into our final paper. Please see below for the clarification of the 4 questions about theoretical results.

Around line 872 "by applying Taylor expansion to terms in Eq. (18) and the boundedness" -- I think this needs much more detail. E.g., what terms? Expansion in which variables? To what order of expansion? Etc.

We apply the first-order Taylor expansion of the function 1/x21/x^2 at x=s(0)(t;θ)x=s^{(0)}(t ; \theta), which yields 1[(Sn(0)(t;θ)]2=1[(s(0)(t;θ)]22[(s(0)(t;θ)]3(Sn(0)(t;θ)s(0)(t;θ))+o(Sn(0)(t;θ)s(0)(t;θ)).\frac{ 1 }{\left[(S_n^{(0)}(t ; \theta)\right]^2} = \frac{ 1}{\left[(s^{(0)}(t ; \theta)\right]^2} -\frac{2}{\left[(s^{(0)}(t ; \theta)\right]^3} \cdot (S_n^{(0)}(t; \theta) - s^{(0)}(t;\theta)) +o\left(|S_n^{(0)}(t; \theta) - s^{(0)}(t;\theta)|\right). Since XX, B\mathcal{B}, and τ\tau are bounded, there exists a constant C>0C>0 such that inft[0,τ],θBs(0)(t;θ)=E(I(Yt)exp(θX))C\inf_{t\in [0,\tau], \theta \in \mathcal{B}}s^{(0)}(t; \theta)=E \left(I\left(Y \geq t\right) \exp \left(\theta^\top X\right) \right) \ge C, which implies that the denominators in the expansion are uniformly bounded away from zero. Therefore, the first term in (18) satisfies supt[0,τ],θB1[(Sn(0)(t;θ)]21[(s(0)(t;θ)]2supt[0,τ],θBSn(0)(t;θ)s(0)(t;θ)=op(1).\sup_{t\in [0, \tau], \theta \in \mathcal{B}} \left| \frac{ 1 }{\left[(S_n^{(0)}(t ; \theta)\right]^2} - \frac{ 1}{\left[(s^{(0)}(t ; \theta)\right]^2}\right| \lesssim \sup_{t\in [0, \tau], \theta \in \mathcal{B}}|S_n^{(0)}(t; \theta) - s^{(0)}(t;\theta)| = o_p(1). Similarly, by applying the first-order Taylor expansion to the bi-variate function y/x3y/x^3 at the point (x,y)=(s(0)(t;θ),s(1)(t;θ))(x,y)=(s^{(0)}(t ; \theta), s^{(1)}(t ; \theta)), we can show that the third term in (18) is also op(1)o_p(1). Combining these results, we obtain the uniform convergence: supt[0,τ],θBJn(t;θ,Zi)J(t;θ,Zi))=op(1).\sup_{t\in [0, \tau], \theta \in \mathcal{B}}\left\| J_n(t; \theta, Z_i) - J(t; \theta, Z_i) \right)\| =o_p(1).

Line 880: "note that J(t,θ^,Zi)J\left(t, \hat{\theta}, Z_i\right) is bounded". I don't see why JJ is bounded. JJ contains the term 1/(s(0)(t;θ^)eθTXi/n)1 /\left(s^{(0)}(t ; \hat{\theta})-e^{\theta^T X_i} / n\right). For a given t,s(0)(t;θ^)t, s^{(0)}(t ; \hat{\theta}) is a constant. Unless it's a negative constant, there will exist some Xi,θ^X_i, \hat{\theta} such that eθ^TXie^{\hat{\theta}^T X_i} gets arbitrarily close to s(0)(t;θ^)s^{(0)}(t ; \hat{\theta}). This makes it seem like JJ is not bounded.

Under our setting, the denominator (s(0)(t;θ^)eθ^TXi/n)\left(s^{(0)}(t ; \hat{\theta})-e^{\hat{\theta}^T X_i} / n\right) in J(t,θ^,Zi)J\left(t, \hat{\theta}, Z_i\right) is indeed bounded away from zero by a positive constant for large nn. Specifically, since XX, B\mathcal{B}, and τ\tau are bounded, there exists a constant C>0C>0 such that inft[0,τ],θBs(0)(t;θ)=E(I(Yt)exp(θX))C\inf_{t\in [0,\tau], \theta \in \mathcal{B}}s^{(0)}(t; \theta)=E \left(I\left(Y \geq t\right) \exp \left(\theta^\top X\right) \right) \ge C. Moreover, the term eθ^TXi/n=O(1/n)e^{\hat{\theta}^T X_{i}}/ n=O(1/n) is of a smaller order, so s(0)(t;θ)exp(θXi)/ns^{(0)}(t ; \theta)-\exp \left(\theta^{\top} X_i\right) / n remains bounded below by a positive constant for large nn. Therefore, J(t;θ,Zi)=O(1)J(t;\theta, Z_i) = O(1) uniformly over t[0,τ]t\in [0,\tau] and θB\theta \in \mathcal{B}. We have added this clarification to the proof.

Line 881: "Given the boundedness and the consistency of θ^\hat{\theta} " -- this needs to be made an explicit assumption; the theorem doesn't say anything about θ^\hat{\theta}.

Around line 743: "A consistent estimate for the information matrix is given by θ2L(θ^)/n\nabla_\theta^2 L(\hat{\theta}) / n. This isn't true. This is only correct if the model is well-specified, which hasn't been assumed here.

The following responds to both line 881 and line 753. We would like to clarify that our analysis for the approximation error is conducted under the assumption that the Cox model is correctly specified. This is also the scenario where the exact form of empirical influence function IF(i)IF(i) was derived in Reid & Crepeau (1985). Regarding line 881, we also note that the statistical properties of the maximum partial likelihood estimator under the Cox model have been well-studied (Cox, 1975). In particular, under conditions (1)–(4) and uninformative censoring (as stated in Theorem A.2), the estimator θ^\hat{\theta} is consistent and the empirical information matrix θ2L(θ^)/n\nabla_\theta^2 L(\hat{\theta}) / n is also a consistent estimator of the information matrix. We have updated the manuscript to include this clarification and added a corresponding citation to support this point.

We revised line 881 as below: To bound the fourth term I4I_4, we use the Lipschitz continuity of J(t;θ,Zi)J(t;\theta, Z_i) in θ\theta, which follows from the boundedness of XX, B\mathcal{B}, and τ\tau. Combined with the consistency of θ^\hat{\theta}, i.e., θ^=θ+op(1),\hat{\theta} = \theta^* +o_p(1), under the regularity conditions of Theorem A.2 (Cox, 1975), we have supt[0,τ]J(t;θ^,Zi)J(t;θ,Zi)=op(1)\sup_{t\in [0,\tau]}\|J(t;\hat{\theta}, Z_i) - J(t;\theta^*, Z_i)\| = o_p(1) and thereby it follows that I4=op(1)I_4 = o_p(1).

References

Cox, D. R. (1975). Partial likelihood. Biometrika62(2), 269-276.

审稿意见
4

The paper proposes a method called Versatile Influence Function (VIF) for data attribution in machine learning models. Traditional influence functions require the loss function to be separable (decomposable) into individual data points, limiting their application. The authors extend the influence function to handle more complex loss functions involving multiple data points at once (non-decomposable losses), such as those found in Cox regression, contrastive loss, and learning-to-rank problems. VIF approximates the influence of each data point using auto-differentiation, making it easy to apply without manually deriving influence functions for each loss. Experiments conducted on three different scenarios demonstrate that VIF closely matches the results from leave-one-out retraining, while being significantly faster.

给作者的问题

Nothing.

论据与证据

The claims in the paper are supported by clear and convincing evidence. The authors claim that the proposed Versatile Influence Function (VIF) closely approximates the leave-one-out retraining results for non-decomposable losses, which is demonstrated clearly through experiments on multiple tasks (Cox regression, node embedding, learning-to-rank). They also claim that VIF is significantly faster, which is supported by the runtime comparisons provided.

方法与评估标准

Yes, the proposed method and evaluation criteria make sense for the problem addressed in this paper. The authors propose VIF to generalize the influence function to handle non-decomposable losses, which fits well with the motivation of extending influence functions to more complex learning scenarios. The chosen benchmark tasks (Cox regression, node embedding, and learning-to-rank) represent practical examples of non-decomposable loss scenarios, effectively demonstrating the usefulness of the proposed method.

理论论述

The authors made several theoretical analyses in the paper. While I did not thoroughly check the proof of the theorem in detail, the arguments made by the authors in the theorem proofs make sense. In the case of non-decomposable loss functions, many properties about the loss function can still be computed efficiently using several decomposition techniques.

实验设计与分析

The authors evaluated their method (VIF) on three tasks: Cox regression, node embedding, and learning-to-rank, which are appropriate examples of non-decomposable loss scenarios. The choice of comparing the VIF results with the leave-one-out retraining method (brute force) also makes sense. However, I would suggest the authors explain more detail about the properties of the datasets. Evaluating on various ranges of datasets (small to large size) is also suggested. The datasets used in the papers are relatively small, it would be good to see the performance difference as the size of the dataset grows.

补充材料

Yes. I briefly checked the supplementary materials, particularly in dataset and experiment details.

与现有文献的关系

The key contributions clearly build on previous literature, extending traditional influence functions from decomposable losses (Koh & Liang, 2017) to more complex, non-decomposable losses. The authors relate their method (VIF) explicitly to classical results in robust statistics and practical ML problems such as Cox regression, node embedding, and learning-to-rank.

遗漏的重要参考文献

Nothing as far as I know

其他优缺点

Overall, the authors proposed a solid contribution on making influence functions works on generic non-composable functions. The proposal is baked by both theoretical grounds as well as empirical verifications.

其他意见或建议

The abstract is too long.

作者回复

We appreciate the reviewer for the positive feedback and further suggestions.

We will revise our abstract to make it more succinct.

Following the reviewer's suggestion, we have also included more experiments on a larger-scale survival analysis dataset, RR-NL-NHP (Kvamme et al. 2019), with 16,000 training samples and 5,000 test samples. We run experiments on both the linear Cox model and the neural-network-based Cox model. The performance of VIF is shown below.

 Model LinearNeural Network    
Pearson Correlation0.99970.3619

As can be seen from this preliminary result, on the larger dataset, VIF still achieves almost perfect performance on the linear Cox model, and achieves non-trivial performance on the more challenging neural network case (due to the inherent randomness in neural network training ).

We plan to include a couple of more larger-scale datasets for the other task setups too into our final version of paper.

References

Kvamme, H., Borgan, Ø., & Scheel, I. (2019). Time-to-event prediction with neural networks and Cox regression. Journal of machine learning research, 20(129), 1-30.

最终决定

This paper proposes data attribution method called Versatile Influence Function (VIF) that extends influence functions to non-decomposable loss functions. Most natural loss functions are decomposable, so this is an uncommon, but interesting problem since non-decomposable losses appear indeed in several important applications and settings including contrastive and ranking problems. The connection of VIF to auto-differentiation is important because it makes it straightforward to apply due to existing infrastructure.

The authors presented strong results and participated in detailed discussions with the reviewers which were convinced that the paper is theoretically solid and interesting.

Overall a solid contribution to a somewhat narrow but important problem.