PaperHub
4.5
/10
Rejected4 位审稿人
最低3最高5标准差0.9
5
5
5
3
3.0
置信度
正确性2.5
贡献度2.3
表达2.3
ICLR 2025

Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
data-centric learningdetrimental sample trimmingtraining sample influence

评审与讨论

审稿意见
5

This manuscript highlights the decisive role of the gradient with respect to model parameters, _θ(z;θ)\nabla\_\theta \ell(z;\theta), in influence estimation. Building on this insight, the authors propose using outlier detection algorithms, such as iForest, on sample gradients as an alternative to influence estimation, thereby avoiding the high computational cost of calculating the inverse Hessian matrix. They demonstrate the effectiveness of their method in tasks such as noisy label correction, data selection for NLP, and data identification for LLMs.

优点

  1. This manuscript is well-written.
  2. The idea of using outlier detection algorithms as an alternative for influence estimation is innovative.
  3. The proposed method is time efficient.

缺点

  1. While the authors experimentally demonstrate the effectiveness of using outlier detection methods for influence estimation, identifying a sample as an outlier does not necessarily indicate that it is harmful to the validation loss or has a negative influence score. Some meaningful outliers may represent rare but valuable patterns within the training set and should not be excluded.
  2. The experimental results on data selection for fine-tuning NLP models using LiSSA and DataInf may not fully reveal their performance. Although the authors compute influence for these methods using only the training set for fair comparison, reliable influence estimation typically requires a validation set, according to the definition of influence functions. Furthermore, methods that do not require a validation set, such as Self-LiSSA, demonstrate similar performance to outlier detection methods.
  3. Since the experimental settings primarily follow those of DataInf, it is unclear why text-to-image generation tasks were not included in the experiments. In DataInf, the AUC and Recall for class detection in text-to-image generation tasks are reported as 0.865 and 0.315, respectively. Compared to sentence transformation and math problems, which approach 1.000 for both AUC and Recall, text-to-image tasks appear more challenging and could better highlight the advantages of the proposed method.

问题

  1. Regarding Figure 1-H, since the model is an MLP and we are computing the gradient with respect to the parameter space, _θ(z;θ)\nabla\_\theta \ell (z;\theta), why are the plotted features only two-dimensional?
评论

Dear Reviewer A4t7,

Thank you for your review, we are grateful for your feedback and insights. We discuss the concerns raised, below:

  • Outlyingness of Sample Gradients:
    • The reviewer mentioned that some meaningful outliers may represent rare but valuable patterns within the training set and should not be excluded. While we agree with the reviewer's general sentiment, if some patterns occur in the training set but are not present in the validation set (i.e., distribution/concept shift), these patterns can be regarded as detrimental, and removing these samples should lead to performance improvement.
    • We would also like to clarify and emphasize that our method is not removing outlier samples (which at times can be beneficial to training), but outlier gradient samples (which we show can negatively affect model performance). Our approach is backed by the intuition from the influence function formulation of Eq. 1, which also utilizes sample gradient influence in determining whether the sample is beneficial or detrimental to training.
  • Adapting to Validation Set: Thank you, we had undertaken experiments in the current version of the paper (Appendix C.8, Table 12) where we utilize and adapt the validation set for our outlier gradient analysis method for the same reasons suggested by the reviewer. Since we can utilize any outlier algorithm for outlier gradient analysis, we employ the OneClassSVM semi-supervised outlier algorithm to adapt it to the validation set (which provides supervision information) along with the training set. We employed the distribution shift experimental framework from [1] on the Folktables ACS-Income dataset where the training and test distribution are either time shifted, location shifted, or both time and location shifted. Our results indicate that our outlier analysis approach outperforms the other methods across all three distribution shift settings (please refer to Table 12 in Appendix for these results).
  • Text-to-Image Tasks:
    • While we wanted to undertake these experiments, unfortunately, the DataInf authors [2] have not released their benchmarks for the text-to-image tasks using diffusion models (see codebase here [3]). It was fairly non-trivial for us to undertake this comparison as a number of custom subjects and data splits are used in these experiments and are not available at the moment. However, we will aim to add these experiments as the benchmarks are released (in future preprint versions of our paper) and work with the DataInf authors to obtain more information regarding these tasks.
    • We would also like to emphasize that the main advantage of our work is the high computational efficiency obtained via the equivalent transformation to outlier analysis of the gradient space. Empirically, our method can achieve better or similar performance compared to other baselines but much more efficiently (as our results also show).
  • 2-D Gradient Features for MLP: Thank you, we can provide more information regarding this. Basically, for the simple 2-D half moons dataset, our MLP architecture is as follows: we first have a hidden linear 2x2 layer with two neurons (stacked with ReLU activation) whose output passes to another 2x2 hidden linear layer with two neurons (stacked with ReLU activation). Then, the output from this second layer goes to another linear 2x1 output layer with the sigmoid function applied for generating predictions. As this final layer has 2 parameters, we compute the sample gradient with respect to this output layer for ease of visualization as the gradient will be two-dimensional here too. Where visualization is not an issue, gradients from more layers can be combined/concatenated as well (as done in other experiments in our paper).

References:

  1. Chhabra, Anshuman, et al. "What Data Benefits My Classifier?" Enhancing Model Performance and Interpretability through Influence-Based Data Selection." ICLR (2024).
  2. Kwon, Yongchan, et al. "DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models." ICLR (2024).
  3. https://github.com/ykwon0407/DataInf
评论

Thank you to the authors for the detailed response. I still have the following concerns:

  1. Outlier Gradient Samples
    • First, I would like to clarify any potential misunderstanding from my review. When I referred to meaningful outliers, I was specifically pointing to potentially beneficial outlier gradient samples, rather than outlier samples in general.
    • As there is no theoretical guarantee that harmful samples are more likely to exhibit higher gradient norms compared to beneficial ones, it remains possible that some outlier gradient samples may be beneficial.
  2. Gradient Features
    • Thank you for the explanation regarding Figure 1. However, it raises another question: which experiments in this manuscript use gradients with respect to the last layer versus the full model? While Appendix B.2.6 states that “in most cases, we directly utilize the gradients obtained from the last layer of the model being considered,” this point remains unclear.
    • Additionally, what is the difference between using gradients from the last layer and the full model? How was this design choice made?
    • I believe that providing more direct empirical evidence, such as a histogram, illustrating how gradient norms differ between harmful and beneficial samples in a more realistic experimental setting (e.g., CIFAR-10N on ResNet) would significantly strengthen the argument. This would better demonstrate that analyzing outlier gradients is a practical approach to identifying harmful samples.
评论

Dear Reviewer A4t7,

Thank you for your engagement, we are very grateful. Please find our responses below:

  1. Outlier Gradient Samples: We apologize for misunderstanding your statement, thank you for your clarification. We understand the reviewer's concern as there is no strict theoretical guarantee. However, we believe trimming a small number of outlier gradient samples leads to improved empirical performance (as our results also indicate). Moreover, this link between gradient outlyingness and sample benefit/detriment has been implicitly observed in some past work [1,2]. Also, as per the reviewer's excellent suggestion we undertake the proposed histogram experiment on CIFAR-10N / ResNet, which also points to this being an effective strategy for augmenting model performance (please see more details below).
  2. Details Regarding Layers for Influence Computation:
    • Thank you for the concrete question. We primarily utilize the last layers in experiments following most prior work in this domain [3,4,5,6,7]. More specifically, for (a) ResNet we utilize the last fully connected layer, (b) for RoBERTa, similar to DataInf, we utilize all the attention layers (as they are LoRA fine-tuned), and (c) as Llama-2-13B has too many network parameters and layers, we only utilize the last attention layers for gradient computation. We will add these details to the revision.
    • To answer the reviewer's question, figuring out which layers are better suited for computing gradients for influence estimation is an ongoing research area (as gradients are used in the original influence formulation as well). Different papers have different conclusions-- for instance, [8] posits that the first few layers are better for language models but this is in contrast with other work that mentions using the full network or middle layers is better [9]. This is a challenging problem for influence estimation especially as model sizes continue to increase. Most work currently uses the last few layers to balance computational efficiency and performance but it remains an open research direction to assess the optimal layer(s) for a particular model/task (and doing so would be out of the scope of our current research focus). We will definitely add all this discussion to the revision to help readers further understand the research landscape.
  3. Gradient Norm Histogram (CIFAR-10N):
    • Thank you for the excellent recommendation regarding the experiment. We utilized our stored gradients for the three different CIFAR-10N noise settings for the ResNet model and plotted the histogram where the x-axis denotes the L1-norm gradient values (L2 norm results are similar). These results are present here: https://anonymous.4open.science/r/iclr-2025-D13D/norm1_all.png.
    • Note that the Aggregate noise setting has the lowest noise rate (9.03%), Random has a higher rate (17.23%) and the highest noise is for the Worst setting (40.21%). Clearly, as the noise rate increases, the number of detrimental samples should increase too. If our hypothesis is valid, we should see more samples with higher norm values for gradients of Worst, then fewer for Random, and then fewest for Aggregate. As our histogram result above shows, this is indeed the case. We would like to once again thank the reviewer for their suggestion and will add this experiment to the paper revision.

References:

  1. Kim, SungYub, Kyungsu Kim, and Eunho Yang. "GEX: A flexible method for approximating influence via Geometric Ensemble." NeurIPS (2024).
  2. Bejan, Irina, Artem Sokolov, and Katja Filippova. "Make Every Example Count: On the Stability and Utility of Self-Influence for Learning from Noisy NLP Datasets." EMNLP (2023).
  3. Pruthi, Garima, et al. "Estimating training data influence by tracing gradient descent." NeurIPS (2020).
  4. Koh, Pang Wei, and Percy Liang. "Understanding black-box predictions via influence functions." ICML (2017).
  5. Lee, Donghoon, et al. "Learning augmentation network via influence functions." CVPR (2020).
  6. Han et al. "Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions." ACL (2020).
  7. Chen, Ruizhe, et al. "Fast model debias with machine unlearning." NeurIPS (2024).
  8. Yeh, Chih-Kuan, et al. "First is better than last for language data influence." NeurIPS (2022).
  9. Grosse, Roger, et al. "Studying large language model generalization with influence functions." arXiv preprint arXiv:2308.03296 (2023).

Kind Regards,

Authors.

审稿意见
5

Summary: This paper studies the outlier detection problem using influence function. By connecting the influence functions with outlier gradient detection, the authors propose a Hessian-free approach to avoid the computational burden of calculating the inverse of Hessian matrix, which is claimed to benefit large-scale machine learning applications. Some theoretical analyses are provided to support the claim. Moreover, demonstrative examples are provided to show the effectiveness of the proposed method, which is further evaluated on real-world datasets.

优点

Strengths:

  • The computational improvement brought by Hessian free outlier detection is a good contribution.
  • The performance improvement is satisfactory. According to the experimental results, the proposed method surpasses most of the baseline methods.
  • Various models and tasks are considered in the experiment.

缺点

Weaknesses:

  • The writing of this paper does not meet the standard of an ICLR paper. There are too many unprofessional and unnatural words, which makes the paper not easy to comprehend.
  • The authors claim that the proposed method is Hessian-free, thus benefiting large-scale machine learning. However, the experiment only considered small-scale dataset.
  • The title of this paper is confusing. Outlier analysis seems to be related to anomaly detection or OOD detection, however, label noise is another problem which contains noise on Y, not data X. So, what is the general goal of the proposed method? It would be helpful to clarify it.
  • It is not clear why the proposed method can use the gradient to replace Hessian, more intuitive explanations are needed.

问题

Please see the weaknesses part.

评论

Dear Reviewer PgCg,

Thank you for your review, we appreciate your feedback and advice. Below we discuss the weaknesses listed:

  • Writing: We apologize for any lack of clarity. Could we request the reviewer to provide us with instances where writing can be improved and we shall make those changes right away in the revision? We would like to take this opportunity to learn from the reviewer and any other excellent papers; thank you.
  • Small-scale Datasets: Thank you for your point. For fair comparison and due to multiple experiment runs, as some influence functions are costly to compute (for instance, exact Hessian takes cubic time in the worst case), we resorted to standard datasets used in the literature for influence functions (i.e. those used in computationally efficient influence function work such as DataInf [1]). However, we also undertook experiments with a subset of ImageNet in the current version of our paper-- these are provided in Appendix C.6, Table 10. As can be seen from these results, outlier gradient analysis is tied as the top performer while being much more computationally efficient than the other approach achieving the same performance (i.e. DataInf). For an academic group, we have limited computational resources and are unable to afford repeated experiments on industry-scale datasets. We hope the reviewer will consider this point as well. In future work, we can aim to experiment with larger datasets wherever possible.
  • Title and Paper Goal: We apologize for the lack of clarity. Our main contribution in this work lies in uncovering the fact that detrimental samples (for instance, noisily labeled samples) that negatively affect model performance appear as outliers in the gradient space. We do this by crafting a very general hypothesis (Hypothesis 3.2), by drawing inspiration from how influence functions estimate whether training samples are beneficial or detrimental. The general nature of Hypothesis 3.2 results in wide applicability, as the extensive experiments we undertake on vision models, NLP models, and LLMs show. Moreover, no other work has explicitly shown this potential link (via influence functions) between gradient outlyingness and sample benefit/detriment, although this has been implicitly observed in some past work [2,3]. Note that any outlier algorithms can be used to detect the outlying gradients (we use 3 approaches in this work). This contribution and aim forms the basis for our title. However, based on the above explanation if the reviewer would like us to make any adjustments to the title for clarity, we are happy to do so.
  • Gradients and Hessian: Thank you for the question! We would like to clarify that our work is not replacing the Hessian with gradient terms. However, there are a few reasons why our approach is able to attain performance gains in detecting detrimental training samples for deep learning models without relying on the Hessian (and only using gradient terms):
    • First, our work is grounded in only the detrimental data detection task, and is not a general-purpose influence function. However, for the detrimental data identification task, we know that detrimental samples are fewer than beneficial samples (due to ERM). Moreover, in the general influence function formulation (Eq. 1) we know that for each sample zjz_j whose influence is being estimated, the Hessian term Hθ1H_{\theta}^{-1} remains the same and only the third term θ(zj)\nabla_{\theta} \ell(z_j) changes with zjz_j. Therefore, the effect of being a detrimental sample should show up in the gradient space and as there are only a minority of such samples, they should appear as outliers. In this scenario, the Hessian does not contribute meaningful information (i.e. curvature) and the gradient space itself provides a good estimate for detrimental sample detection.
    • Second, in the original influence function formulation (Eq. 1), inverting the Hessian Hθ1H_{\theta}^{-1} implies that the loss function is strictly convex. This does not hold for deep learning models which are non-convex and here, the Hessian can lead to inaccurate estimates. However, the gradient terms are still valid in the non-convex case, so utilizing them can lead to accurate estimations, especially when the problem setting is restricted (such as detrimental sample detection).

References:

  1. Kwon, Yongchan, et al. "DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models." ICLR (2024).
  2. Kim, SungYub, Kyungsu Kim, and Eunho Yang. "GEX: A flexible method for approximating influence via Geometric Ensemble." NeurIPS (2024).
  3. Bejan, Irina, Artem Sokolov, and Katja Filippova. "Make Every Example Count: On the Stability and Utility of Self-Influence for Learning from Noisy NLP Datasets." EMNLP (2023).
审稿意见
5

This paper introduces Outlier Gradient Analysis, a novel approach for identifying detrimental training samples that avoids the computationally intensive inverse Hessian calculation typically associated with influence functions. By reframing influence estimation as an outlier detection problem in the gradient space, the authors develop a framework for scalable, Hessian-free detection of detrimental samples across diverse model types. They validate this approach on both synthetic datasets and real-world applications, including vision (CIFAR datasets), NLP (GLUE tasks), and fine-tuning for large language models (LLMs), where it achieves performance on par with or surpassing other influence-based methods.

Overall, if certain theoretical and experimental clarifications are provided, this paper presents a practical approach to identifying detrimental samples in a data-centric framework, effectively bypassing the computational limitations of influence functions for large models. I would be willing to increase the score if the author response addresses the concerns.

优点

  1. This paper is well-written and easy-to-follow.
  2. The proposed method offers a significant computational advantage by eliminating the need for the Hessian matrix, which is especially important for deep learning models.
  3. The method is adaptable across various domains, including vision and NLP, and shows promising results in fine-tuning tasks for LLMs.
  4. Extensive experiments are conducted and the empirical results are competitive.

缺点

  1. The differences between the proposed approach and previous gradient-based, Hessian-free methods for detecting detrimental samples, such as Gradient Tracing [1], need to be discussed in better detail. Currently, the two approaches appear very similar.
  2. Most experiments appear to be conducted on datasets with artificially corrupted labels, which may not fully capture the complexity of real-world noise. It would be valuable to see the proposed method tested on real-world noisy datasets to verify its robustness in more realistic conditions. For instance, visualizing the most detrimental samples identified within a large-scale, real-world dataset like ImageNet could better demonstrate the method's practical effectiveness and provide more convincing evidence of its utility.

[1] Pruthi, Garima, et al. "Estimating training data influence by tracing gradient descent." Advances in Neural Information Processing Systems 33 (2020): 19920-19930.

问题

  1. How does the proposed method handle scalability challenges with LLMs where gradient dimensionality can be extremely high?
评论

Dear Reviewer NLkn,

Thank you for your insights and detailed review, we appreciate it. Below we discuss some of the concerns raised:

  • Discussion on Hessian-free methods vs Outlier Gradient: Thank you for pointing this out. We will add more details about Gradient Tracing (also known as TracIn) [1] as well as other Hessian-free methods and compare with our approach in the revision. Essentially at a high-level, Gradient Tracing simply takes the inner product of the first and third gradient terms in the influence function formulation, basically removing the Hessian from the influence function of Eq. 1, that is, vVθ(v),θ(zj)\langle \sum_{v \in V}\nabla_{\theta} \ell(v), \nabla_{\theta} \ell(z_j) \rangle. On the other hand, outlier gradient analysis employs an outlier detection algorithm A\mathcal{A} to operate on the gradient space of samples to detect detrimental samples, i.e. A(vVθ(v),θ(zj))\mathcal{A}(\sum_{v \in V}\nabla_{\theta} \ell(v), \nabla_{\theta} \ell(z_j)). Our formulation is more general as a result.
  • Other Datasets with Real-world Noise: Thank you for the concrete suggestion for improvement. Despite significant computational overhead, we had run experiments with ImageNet in Appendix C.6, but constitute the same setting of artificial label noise (i.e. just flipping the sample labels to introduce noise). Here too, (as Table 10 in Appendix C.6 shows), outlier gradient analysis is one of the top-performing baselines. Moreover, we would also like to mention that the CIFAR-10N and CIFAR-100N datasets [2] are not artificial label noise datasets, as the labels for these were obtained by multiple crowdsourced workers via Amazon Mechanical Turk. Hence, this is actually mimicking a scenario with real-world label noise introduced via human annotation.

References:

  1. Pruthi, Garima, et al. "Estimating training data influence by tracing gradient descent." NeurIPS (2020).
  2. Wei, Jiaheng, et al. "Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations." ICLR (2022).
评论

Thank the authors for their response. While it partially addresses my concerns, it appears that the "detrimental samples" referenced in the title are primarily associated with noisy labels. However, the approach does not seem to account for hard samples or out-of-distribution (OOD) samples, which may limit its broader applicability. Additionally, the distinction between the proposed method and TracIn remains unclear. Although the authors provide some high-level explanations, the primary difference seems to be the use of an additional classification algorithm in this paper, whereas TracIn relies on inner products and simple ranking. This makes the novelty and appeal of the proposed approach less compelling.

Overall, I have decided to lower my score, though I may reconsider and increase it back if the authors can provide additional insights or clarifications to address the concerns raised.

评论

Dear Reviewer NLkn,

Thank you for your engagement, we appreciate it! We have provided clarifications to the concerns raised, and aim to alleviate any misunderstandings:

  • OOD/Distribution Shift Detrimental Samples: We would like to clarify that our approach (and setting) is not only localized to noisy samples. In fact, we had undertaken experiments for distribution shift in the current version of the paper (Appendix C.8, Table 12) by employing the distribution shift experimental framework from [1] where the training and test distribution are either time shifted, location shifted, or both time + location shifted. Here, we utilized the semi-supervised OneClassSVM outlier detection algorithm (to adapt to the validation set) for outlier gradient analysis on the Folktables ACS-Income dataset and it can be seen that our approach outperformed all the other methods across all three distribution shifts (please refer to Table 12), ensuring that its applicability extends beyond noisy label scenarios.

  • Comparison with TracIn: As a minor note, all Hessian-free methods only operate on the gradient space, and hence will utilize some function that operates on the gradients. However, the contributions/focus of our paper/method and TracIn are distinct; we provide more details below:
    • Our main contribution in this work lies in uncovering the fact that detrimental samples that negatively affect model performance appear as outliers in the gradient space. We do this by crafting a very general hypothesis (Hypothesis 3.2), by drawing inspiration from how influence functions estimate whether training samples are beneficial or detrimental. This results in wide applicability across models, as the extensive experiments we undertake on vision models, NLP models, and LLMs show. Moreover, no other work has explicitly shown this potential link (via influence functions) between gradient outlyingness and sample benefit/detriment, although this has been implicitly observed in some past work [2,3].
    • On the other hand, TracIn [4] was conceived from the original influence function (Eq 1 in our paper) by simply removing the Hessian from the formulation. Unlike our method which is backed by the equivalence transformation to gradient space outlier analysis, TracIn makes a rather simplistic assumption and as a result, is not very performant (although faster than traditional influence methods). Compared to TracIn, our distinct and major contribution is that detrimental samples can be detected highly accurately (same or better performance than existing methods) while being much more computationally efficient than standard methods. As described in the paper, the reasons for why this should work are non-trivial and novel, and this is most likely the reason why even though TracIn was published 4 years ago in 2020, no follow-up work has observed this connection and utilized it effectively until our paper.
    • Another point of difference is that our method can easily extend to higher order gradients if at all needed (e.g. second order gradients can simply be concatenated and we can analyze the outlier space accordingly), but doing so might not work for TracIn as it is explicitly utilizing the influence formulation (Eq. 1) which is based on the first order Taylor expansion. For TracIn, it is also not clear if taking the inner product of higher-order gradients leads to a meaningful formulation for influence.

References:

  1. Chhabra, Anshuman, et al. "What Data Benefits My Classifier?" Enhancing Model Performance and Interpretability through Influence-Based Data Selection." ICLR (2024).
  2. Kim, SungYub, Kyungsu Kim, and Eunho Yang. "GEX: A flexible method for approximating influence via Geometric Ensemble." NeurIPS (2024).
  3. Bejan, Irina, Artem Sokolov, and Katja Filippova. "Make Every Example Count: On the Stability and Utility of Self-Influence for Learning from Noisy NLP Datasets." EMNLP (2023).
  4. Pruthi, Garima, et al. "Estimating training data influence by tracing gradient descent." NeurIPS (2020).


Thank you once again, we are happy to discuss more if the reviewer has questions.

Have a great day!

--Authors.

审稿意见
3

Identifying detrimental samples is one of the core tasks in data-centric machine learning. While there exist various approaches to estimating the influence of data points on the model performance, Influence functions present a unique advantage in that it does not require re-training to assess data influence. Unfortunately, computing Influence functions requires calculating the inverse of the Hessian matrix, limiting their applicability to over-parameterized deep models. This paper draws a connection between Influence functions and outlier detection on the gradient space. Then, the authors hypothesize that an outlier gradient analysis can be utilized as Hessian-free proxies to Influence functions. The preliminary experimental results on linear and non-linear synthetic datasets validate the effectiveness of gradient-based outlier detection in identifying mislabeled detrimental samples. The efficacy of the outlier gradient-based detrimental sample selection is further demonstrated on various noisy regimes and in combination with Large Language Model training as well.

优点

  • The paper explores an important research problem: identifying detrimental training samples. The proposed method dramatically reduces the computational cost of detrimental sample selection by circumventing the need to calculate the Hessian and is more effective than methods that rely on Hessian approximation.

  • Although the proposed method mostly relies on an existing outliner detection algorithm, the authors do uncover a previously overlooked application of existing algorithms.

  • The proposed method is demonstrated to be effective across various domains and tasks. However, since the major benefit of the proposed method appears to be its computational efficiency, I would have appreciated more experiments and analyses regarding its application to LLMs. More details on this are elaborated in the Weaknesses section.

缺点

  • As mentioned in the Strengths section, the method is a simple re-interpretation of existing methods. While I do not think this is always a reason to accept or reject a paper, as of now, the authors do not present sufficient theoretical analysis or support behind why the outlier detection algorithm works in this specific context. Yes, they do discuss conceptually (on a very superficial level) how these two research areas bear resemblance, but their hypotheses lack theoretical analysis to be convincing. Shedding light on why and how the first-order gradient is enough to anticipate the influence of training samples would be a much more significant contribution than a series of empirical evidence, which can easily be curated, that does no more than repeatedly showcase the existing algorithm's new potential application.
  • One of the major traits of Large Language Models is their transferability and generalization ability to various downstream tasks. Can this method be utilized to predict how they would fare across various downstream tasks as well?
  • It seems counter-intuitive that a method that only uses first-order information would be more accurate than methods that at least partially take second-order information via approximation. Do authors have any insight into why this could be the case?
  • Is the proposed method only effective at identifying distribution shifts and outliers induced by label noises? Can it identify covariate shifts on inputs data?

问题

Please refer to the Weaknesses section.

评论

Dear Reviewer yKaF,

Thank you for your detailed review, we really appreciate your feedback and insights. With regards to the concerns raised, we provide some counter points of discussion:

  • Re-interpretation of existing methods:

    • Our main contribution in this work lies in uncovering the fact that detrimental samples (for instance, noisily labeled samples) that negatively affect model performance appear as outliers in the gradient space. We do this by crafting a very general hypothesis (Hypothesis 3.2), by drawing inspiration from how influence functions estimate whether training samples are beneficial or detrimental. Owing to the general nature of Hypothesis 3.2, it is non-trivial to prove this for all classes of learning models. However, this results in wide applicability, as the extensive experiments we undertake on vision models, NLP models, and LLMs show. Moreover, no other work has explicitly shown this potential link (via influence functions) between gradient outlyingness and sample benefit/detriment, although this has been implicitly observed in some past work [1,2]. While any outlier algorithms can be used to detect the outlying gradients (we use 3 approaches in this work), we do not feel that this contribution constitutes a re-interpretation of existing outlier analysis approaches as our focus is entirely different.
    • The reviewer also mentioned that shedding light on why and how the first-order gradient is enough to anticipate the influence of training samples would be a much more significant contribution. However, this is equivalent to the question "why are influence functions suitable for detecting detrimental samples?" as even the original influence function formulation is based on the first-order Taylor expansion to estimate model performance without retraining. For complex models, higher-order extensions might provide more accurate estimation; however, they also require more computational effort. For practical usage, first-order information is used. In our paper, we provide a re-interpretation of influence function for detrimental data detection, which significantly reduces the computational cost for large models.
  • First-order vs Higher-order gradients: Thank you for the question-- there are a few cases where first-order estimates can be better than the Hessian containing second-order derivatives of the loss function.

    • First, we would like to emphasize that our work is grounded in only the detrimental data detection task, and is not a general-purpose influence function. However, for the detrimental data identification task, we know that detrimental samples are fewer than beneficial samples (due to ERM). Moreover, in the general influence function formulation (Eq. 1) we know that for each sample zjz_j whose influence is being estimated, the Hessian term Hθ1H_{\theta}^{-1} remains the same and only the third term θ(zj)\nabla_{\theta} \ell(z_j) changes with zjz_j. Therefore, the effect of being a detrimental sample should show up in the gradient space and as there are only a minority of such samples, they should appear as outliers. In this scenario, the Hessian does not contribute meaningful information (i.e. curvature) and the gradient space itself provides a good estimate for detrimental sample detection.
    • Second, in the original influence function formulation (Eq. 1), inverting the Hessian Hθ1H_{\theta}^{-1} implies that the loss function is strictly convex. This does not hold for deep learning models which are non-convex and here, the Hessian can lead to inaccurate estimates. However, the gradient terms are still valid in the non-convex case, so utilizing them can lead to accurate estimations, especially when the problem setting is restricted (such as detrimental sample detection).
  • Other LLM tasks: Thank you again for the great suggestion. Unfortunately, data valuation for LLMs is an evolving research area, and currently the field is lacking in exhaustive benchmarks. Due to the general generative nature of LLMs, it is non-trivial to assess influence of samples currently without labeled influential prompt identification benchmarks. This is why we utilize the LLM benchmarks from prior work [3] for this paper, however, we plan to extend analysis to more general downstream tasks for future work in this domain.

  • Covariate Shifts: Thank you for the excellent point. We had undertaken experiments for covariate shift in the current version of the paper (Appendix C.8, Table 12) by employing the distribution shift experimental framework from [4] where the training and test distribution are either time shifted, location shifted, or both time + location shifted. Here, we utilized the semi-supervised OneClassSVM outlier detection algorithm (to adapt to the validation set) for outlier gradient analysis on the Folktables ACS-Income dataset and found that our approach outperformed the other methods across all three covariate shifts (please refer to Table 12).

评论

References:

  1. Kim, SungYub, Kyungsu Kim, and Eunho Yang. "GEX: A flexible method for approximating influence via Geometric Ensemble." NeurIPS (2024).
  2. Bejan, Irina, Artem Sokolov, and Katja Filippova. "Make Every Example Count: On the Stability and Utility of Self-Influence for Learning from Noisy NLP Datasets." EMNLP (2023).
  3. Kwon, Yongchan, et al. "DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models." ICLR (2024).
  4. Chhabra, Anshuman, et al. "What Data Benefits My Classifier?" Enhancing Model Performance and Interpretability through Influence-Based Data Selection." ICLR (2024).
AC 元评审

The paper utilize the idea of influence function, but only use the gradient information to detect so-called "detrimental" samples in the training set by employing an off-the-shelf outlier detection algorithm, iForest, on the gradient space. The toy example showcases some benefit of the proposed method compared to the original influence function. Moreover, experimental results show that the proposed method achieves good results on noisy label correction and NLP models, with efficient computational complexity. Although the method seems to be effective to some extent, it is not very clear when and under what condition the proposed method would work. Simply finding outliers among the gradients seems to be a very general thing to do, and the paper does not seem to fully justify their approach, e.g., as Reviewer PgCg has pointed out, it is not clear why the Hessian term stays the same during detection and why such formulation holds under non-convex deep learning scenarios. Hence, the current paper simply seems to propose an empirical method without full justification, and the decision is Reject.

审稿人讨论附加意见

All reviewers except for yKaF actively engaged in the rebuttal process and all had similar opinions that the proposed method has some practical and empirical value, but agreed that more sound theoretical justification should be provided.

NLkn asked for more convincing distinction between previous method, TraceIn, and still had the reservation with the comment on author's rebuttal --- "our distinct and major contribution is that detrimental samples can be detected with high accuracy (achieving similar or better performance than existing methods) while being much more computationally efficient than standard methods," this seems to highlight empirical improvements rather than a clear theoretical difference.

PgCg argued the authors' comments on "why the Hessian term stays the same during detection and why such formulation holds under non-convex deep learning scenarios" were not convincing.

最终决定

Reject