PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
5
5
3.5
置信度
创新性3.3
质量3.3
清晰度3.5
重要性3.0
NeurIPS 2025

KAIROS: Scalable Model-Agnostic Data Valuation

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

A principled and scalable method for model-agnostic data valuation

摘要

Data valuation techniques quantify each training example's contribution to model performance, providing a principled basis for data cleaning, acquisition, and selection. Existing valuation methods remain inadequate: model-based techniques depend on a single fitted model and inherit its biases, while algorithm-based approaches like Data Shapley scale poorly due to their need to train multiple models. Recent work has proposed model-agnostic alternatives based on Wasserstein distance between the training set and a clean reference set, but exact computation is expensive and approximations often misrank examples. We introduce KAIROS, a model-agnostic framework that values examples by their contribution to the Maximum Mean Discrepancy (MMD) between the training set and a clean reference distribution. Unlike Wasserstein methods, MMD admits a closed-form solution that requires no approximations and is scalable to large datasets. Additionally, KAIROS enables efficient online valuation: adding a new batch of $m$ examples requires only $O(mN)$ computation to update all scores, compared to $O(N^2)$ in prior work where $N$ is the training set size. Empirical evaluations on noise, mislabeling, and poisoning benchmarks show that KAIROS consistently outperforms state-of-the-art baselines in both accuracy and runtime. On ImageNet, KAIROS achieves up to 15 $\times$ speedup over the fastest baseline while maintaining superior data valuation quality. Our results demonstrate that model-agnostic methods can match or exceed model-based approaches in performance while scaling to large datasets.
关键词
data valuationmmdmodel-agnosticdata-centric

评审与讨论

审稿意见
5

This paper tackles the problem of data valuation, which, in the context of this paper, is the problem of determining how influential a datapoint is to the performance of a ML model. This paper takes a "model-agnostic" approach, and does not consider any specific model. Instead, the paper measures the distance of the training data QQ to an idealized, clean test distribution PP using an integral probability metric (IPM). The intuition is that if the IPM is smaller, then any model will do better on test data. Specifically, the paper computes an influence function (IF): the influence of the point xx is the derivative evaluated at ϵ=0\epsilon=0:

ddϵIPM((1ϵ)Q+ϵδx,P)\frac{d}{d\epsilon} IPM((1-\epsilon)Q + \epsilon\delta_x, P)

Previous work (LAVA) has taken this approach as well. But the paper notes that these approaches use IPMs that are ill-suited for the task in that (1) they require approximations to be made to the IPM computation, and (2) the IPM computation is not unique. The paper recognizes that if the maximum mean discrepancy (MMD) is used as the IPM, neither of these problems apply. The paper shows how the IF of the MMD can be computed efficiently as more and more data is added to the test set, which is not true of previous work in this space. The paper also provides separate IF calculations for the un-labeled setting (i.e., the data are just covariates xx) and the labeled setting (the data are covariates xx and labels yy). The paper runs experiments to see if the proposed method can discover corrupted data and identify points that are/aren't valuable for model accuracy. In most cases, the proposed method outperforms existing methods, and where it does not outperform other methods, its accuracy is very close.

优缺点分析

Strengths

  • The paper is generally well-written
  • The overall idea of this paper is very similar to previous work. But, by providing careful analysis of the mathematical details behind previous work, the paper is able to recognize where improvements can be made. I think this is a great type of contribution to the field.
  • The idea of doing model-independent data valuation is fairly new, and I think this paper makes a good contribution to it, and thus derives an algorithm with some potentially interesting properties in the space of data valuation
  • In an experiment (Fig 8), the proposed method is much faster than similar competitors as the number of times data is added grows large. In fact, the experiment is so compelling that it made me realize I wasn't following how LAVA (previous work) worked! The paper mostly summarizes the main insight of KAIROS over LAVA as not needing particular approximations due to their choice of IPM. But this figure makes it seem like there are additional insights that allow KAIROS to run in constant time. I would really highlight the compute time differences as well.
  • This is minor, but the paper restates its theoretical results in the appendix before proving them -- this makes reading and checking them so much easier.

Weaknesses

Overall I think this is a pretty solid paper. My two main issues with it are its overall motivation and the practicality of the labeled side of the algorithm. I see the other issues raised below as less important.

Overall motivation

The introduction argues that model-agnostic data valuation is really important. I didn't follow the arguments here. The paper cites a growing legal need to have auditable algorithms. I agree with this claim. It then seems to immediately conclude that "[to bridge] the gap between commercial pressures and regulatory demands has spurred interest in new classes of model-agnostic valuation techniques". I wasn't following this logic. Why do we need model-agnostic techniques? It seems like legal or commercial data purchasing decisions are going to be made with a particular model (e.g., I believe the OpenAI v. New York Times legal case cited earlier in the introduction is about a specific model -- ChatGPT). So why do we need model-agnostic tools here? I think some more time should be dedicated to this argument. Without a clearer rationale, I still think the paper has some academic interest as a technical improvement over existing work. But I don't see a clear impact of the paper.

A smaller point about motivation comes from Figure 1. Fig 1 notes that previous work (LAVA) seeks to approximate the exact leave-one-out (LOO) difference under the Wasserstein-1 metric when datapoints are dropped out. KAIROS seeks to estimate the exact LOO in terms of the maximum mean discrepancy (MMD). Fig 1 shows that KAIROS does a much better job approximating the MMD LOO differences than LAVA does approximating the Wasserstein-1 LOO differences. Why is this a sign that KAIROS is doing a good thing? Maybe we actually wanted to approximate the Wasserstein-1 LOO differences, and LAVA is as close as we can get to doing that. Here's a (kind of facetious) version of this: I have a fabulous new algorithm that's faster than KAIROS and does a better job of approximating its distance. The algorithm is "return the number 5," and the distance that it approximates is "5".

Practicality of algorithm

It's not clear to me that the label-dependent version of the algorithm (E-MCMD, where we observe not only covariates xx but also labels yy) is practically viable. I have two concerns:

  1. In practice, when applying to labeled data (x,y)(x,y), the method requires an estimate of the true distribution P(yx)P(y\mid x). But isn’t the whole reason I want to acquire more data in a supervised setting (i.e., where we have labels yy) is because we can’t estimate P(yx)P(y|x) very well? So what’s the practical value of this method? This holds double in settings where we’re trying to use IF’s to detect corrupted data — if one can fit a good model for PP with the corrupted labels, why bother trying to clean the labels?
  2. The equation given for E-MCMD in the generic case (Eq 8) and then specialized to classifiers (Eqs 9 and 10) don't seem to depend on the distribution of the training data QQ. How can it be that this method gives reasonable estimates without a dependence on QQ? I would imagine that if Q=PQ=P, we should get very different estimates of data value than if QQ is wildly different from PP.

To be clear, I don't see any such issues with the label-free version of the algorithm.

Theoretical claims

A few somewhat minor issues about the strength of the theoretical claims:

  1. "[theorem 1 shows that] pruning or down-weighting points with large influence scores provably tightens an out-of-distribution error bound" -- I don't agree with this summary of Thm 1. Thm 1 shows that decreasing the MMD and E-MCMD decreases an out-of-distribution error bound. But this isn't the same thing as removing high influence points. A high influence point means that, if we think of attaching a weight ϵ\epsilon to each datapoint, there is some decreased ϵ\epsilon' such that using weight ϵ\epsilon' would decrease MMD / E-MCMD. Pruning a datapoint is equivalent to taking ϵ=0\epsilon' = 0. But these aren't the same thing.
  2. "Symmetry [from Proposition 2] ensures that, for finite samples, points making the same marginal contribution to the MMD receive identical influence, yielding fair rankings" -- this seems to imply that Prop 2 shows KAIROS will give equal scores to datapoints that contribute the same amount. I don't think this is what Prop 2 shows. Prop 2 shows that the finite sample estimate of the IFs correctly ranks finite sample contributions to the finite sample estimate of the MMD. Why should a practitioner care that the finite sample estimate MMD^\widehat{MMD} will be correctly reflected by IF? MMD^\widehat{MMD} is noisy and could be completely erroneous; I would argue that a practitioner would only at all care about decreasing the actual MMD. And Prop 2 says nothing about IFs preserving the order of the actual MMD.

Unclear notation

  • nn and NN are used throughout the introduction and abstract without definition.
  • I don't think kk was ever stated to be the kernel of anything; it's just said it's a "kernel"
  • "we use IF() to denote the rescaled version of [influence functions]" but this scaling wasn't defined -- most of these are kind of obvious, but I actually don't know what this scaling is supposed to be.

References

There were a few places where I thought the references could have been better done:

  1. Line 105 cites the entirey of a textbook for a result -- please cite a specific result / page / theorem. I tried looking briefly through chapter 20 (source of the only mention of influence function in the index) because I wasn't aware of this general of a result. I couldn't find anything that jumped out.
  2. A key result that makes the issues with LAVA clear is just referenced as [57], which is an entire book on optimal transport. This doesn't help readers understand where this result came from, unless they want to read the entire textbook!
  3. The whole paragraph from lines 120-126 needs more clear references for some of its claims. E.g., where does the fact that the α>1\alpha > 1 decay doesn't hold for Wasserstein-1 come from? I think this is a really key paragraph that has most of the theoretical insights driving the paper, so I really think it's important to get this part right.

问题

  1. Why is the algorithm practical in the labeled setting? In particular, why would someone use this algorithm if they already have access to (an estimate of) P(yx)P(y|x)?
  2. Why does the labeled setting not seem to depend on QQ? Or is the dependence a little hidden?
  3. Why do we want model-agnostic data valuation?

局限性

Yes

最终评判理由

After reading the replies and the other reviews, I still think this is a strong paper that will do well at NeurIPS; I've kept my score as a 5 and have more confidence that this is a well-researched and interesting paper!

I specifically appreciate the proposal to add an experiment showing exactly how a noisy P(yx)P(y|x) is good enough for KAIROS to do a good job; I think this will be a good addition to the paper. I also think the proposed rewordings and clarifications will strengthen the paper as well.

Best, cX5q

格式问题

None

作者回复

We thank the reviewer for their detailed review, insightful comments, and positive evaluation of our paper. We appreciate that the reviewer recognized our novel mathematical analysis, computational efficiency and empirical results.

W1 & Q3: Why do we need model-agnostic techniques? Why would legal and commercial decisions be made with respect to model-agnostic techniques?

Thank you for this question. We clarify the commercial and legal motivations for model-agnostic data valuation.

Commercial incentives: When training large models, two important reasons why data valuation would be important:

  • Large models are often trained on noisy web-crawled data containing duplicates [10] and data poisoning attacks [11]. Therefore, it is essential that the data is cleaned before training.
  • At the same time, companies also pay substantial costs for high-quality datasets (OpenAI-Time Magazine, Shutterstock deals [1][2]). In this case, it is crucial to be assess the value of data being bought.

Why model-agnostic? Model training takes months and costs millions of dollars [3], so it is reasonable to assume the largest models are trained only once (or there is a large cost to train a model again). This means data valuation must happen before training begins. Traditional methods like Data Shapley are impractical because they require training multiple models. Other model-based approaches too require at least one training run before they perform data valuation. Therefore, companies need model-agnostic approaches for data valuation.

Legal incentives: EU AI Act Article 10 [4] requires that training data be "relevant, sufficiently representative, and to the best extent possible, free of errors and complete" and the Act mandates assessment of "the availability, quantity and suitability of the data sets that are needed". Importantly, these requirements apply to the data itself, not to any specific model trained on it. This data-centric approach makes sense because datasets are commonly shared and reused across different AI applications and organizations. Model-agnostic methods are therefore essential for regulatory compliance.

W2: Why is KAIROS approximating MMD LOO better than LAVA approximating Wasserstein LOO a good thing?

Thanks for this thoughtful question. Model-agnostic methods define influence based on distributional distances, and both Wasserstein and MMD are valid choices. Our key contribution is providing a method that faithfully approximates its stated objective. LAVA faces inherent challenges with Wasserstein approximation due to the non-unique dual potentials and regularization bias, leading to deviations from true LOO rankings. KAIROS offers exact computation of MMD-based influence, providing users with transparent and reliable valuations. This transparency is particularly important in applications like data markets where stakeholders need to understand exactly what the valuations represent and how they were computed. We acknowledge that if someone specifically requires Wasserstein distance for their application, LAVA would be the appropriate choice. However, our empirical evaluations show that MMD-based data valuation generally works well across diverse tasks, and KAIROS consistently achieves superior performance compared to existing methods.

W3 & Q1: If we use validation data to estimate P(y|x), what is the point of data valuation when the goal is often to estimate P(y|x)?

Thank you for this important question. The validation set is generally small (300 samples in our experiments), so P(y|x) estimates are noisy. However, this empirically seems to provide sufficient signal to identify corrupted training points. After removing these points, we train on the cleaned, much larger training set to get better P(y|x) estimates and higher accuracy.

We demonstrate this on CIFAR-10 feature noise: A model trained only on validation data achieves 72% test accuracy, while training on the full training set after removing the bottom 20% (identified by KAIROS) achieves 88.5% test accuracy. Even selecting just the top 300 training samples (matching validation size) gives 87.3% accuracy, demonstrating that KAIROS improves sample quality beyond what the small validation set can provide. We will add this discussion and results to the final paper.

W4 & Q2: The equation for influence of E-MCMD (Eq 8) does not seem to depend on training distribution Q. Is the dependence hidden?

Thank you for this insightful question. The dependence on Q appears in two ways:

First, in our derivation of this influence (Appendix B.4, Line 634), there is a Q-dependent constant term. We omit this term in practice because it only affects absolute magnitudes, not relative rankings. However, the absolute value of the influence (that determines whether a small perturbation increases or decreases the E-MCMD) does depend on Q directly.

Second, the expected E-MCMD influence at any point x, EyQ(yx)[IFEMCMD(x,y;P,Q)]E_{y \sim Q(y|x)}[IF_{EMCMD}(x,y;P,Q)], depends on Q's conditional distribution. This means that when Q is noisier (more corrupted labels), more points will have high negative influence values. This allows KAIROS to detect label corruption as shown in our experiments.

W5: Description of Theorem 1

Thank you for pointing this out. We will change the wording to: "Under mild regularity assumptions, the expected train–validation loss gap is bounded above by the sum of the marginal MMD and conditional E-MCMD. Consequently, removing a point which decreases this distance provably tightens the out-of-distribution error bound for any learning algorithm. The influence is a first-order approximation of the effect of removing a point, suggesting that removing points with large influence scores could decrease the out-of-distribution error.”

W6: Symmetry property (Proposition 2)

Thank you for this question. We will clarify that points making equal contributions to MMD^\hat{\text{MMD}} receive identical IF^\hat{\text{IF}} scores. Since the true MMD is unknown in practice and MMD^\hat{\text{MMD}} has established convergence bounds to MMD [5], this consistency property is meaningful. For the true MMD and IF, defining "equal contribution" would require circular reference to the influence function itself, making such a statement trivial. We will revise our wording to reflect our claim correctly.

W7: Unclear Notation

We will make changes in the final revision to make the notation clear - defining nn, NN, the kernel kk, and clarifying that the rescaling is up to additive and positive multiplicative constants.

W8: References

Thank you for pointing this out. We will cite specific pages and results in the final version.

  1. For Line 105, the O(1n2\frac{1}{n²}) error follows from Taylor expansion of d(PP,QεQ_ε) around ε=0. The second-order term is O(ε²). For leave-one-out with n samples, ε = 1n1\frac{1}{n-1}, giving O(1n2\frac{1}{n²}) error. We will cite Chapter 20, Page 291 of the textbook [6] which shows a similar expansion.

  2. For the non-uniqueness of Kantorovich potentials, we will cite Page 256, Remark 10.30 of the book [7]. We will also cite [8] and Remark 2.3 in [12] which discuss this in detail. Further, we will add this concrete example to illustrate non-uniqueness:

Consider P = Uniform(0,1)2\frac{\text{Uniform(0,1)}}{2} + Uniform(10,11)2\frac{\text{Uniform(10,11)}}{2} and Q = Uniform(2,3)2\frac{\text{Uniform(2,3)}}{2} + Uniform(12,13)2\frac{\text{Uniform(12,13)}}{2}, where Uniform(a,b) denotes the uniform distribution on interval [a,b]. Both of these functions are optimal dual solutions for Wasserstein-1:

  • f₁(x) = x
  • f₂(x) = x if x < 5, f₂(x) = 10-x if 5 ≤ x < 7, f₂(x) = x-4 if x ≥ 7

This demonstrates that optimal Kantorovich potentials are not unique.

  1. We will make two clarifications in lines 122-126:
  • Change "Crucially, this benign decay does not hold for most IPMs" to "Crucially, this benign decay need not hold for most IPMs."
  • Add a reference to Page 39-40 [9] which provides an example where the rate O(f\*fε\*)O(f^{\*} - f^{\*}_{ε}) is less than 1. In our case, this corresponds to a uniform distribution P with Q and Q_ε being the pushforward measures under potentials |x| and max(|x|,ε1dε^{\frac{1}{d}}). The analysis shows that O(f\*fε\*)O(f^\* - f^\*_ε) = εd+22dε^{\frac{d+2}{2d}} < ε1ε^1 when d > 2.
  • For the claim "In the Wasserstein-1 case, f\*f^\* corresponds to a Kantorovich potential, which is non-unique" we will provide the specific citations discussed in point 2 above to establish that Kantorovich potentials need not be unique, which may lead to non-deterministic influence values for LAVA.

[1]Reuters. Time, openai sign multi-year content deal, June 2024. Published 27 Jun 2024.

[2]Sherwood News. Ai deals to make up a third of shutterstock’s revenue by 2027, June 2024

[3]Cottier, Ben, et al. "The rising costs of training frontier AI models." arXiv preprint arXiv:2405.21015 (2024).

[4]Ortigosa, Adrián Palma. "Data and Data Governance (Article 10)." The EU regulation on Artificial Intelligence: A commentary. Wolters Kluwers Italia, 2025.

[5]Gretton, Arthur, et al. "A kernel two-sample test." The journal of machine learning research 13.1 (2012).

[6] Van der Vaart, A.W. Asymptotic statistics, volume 3. Cambridge University Press, 2000.

[7]Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2008.5

[8] Staudt, Thomas, Shayan Hundrieser, and Axel Munk. "On the uniqueness of Kantorovich potentials." SIAM Journal on Mathematical Analysis 57.2 (2025).

[9] Letrouit, Cyril. "Lectures on quantitative stability of optimal transport." (2025).

[10] Lee, Katherine, et al. "Deduplicating training data makes language models better." arXiv preprint arXiv:2107.06499 (2021).

[11] Carlini, Nicholas, et al. "Poisoning web-scale training datasets is practical." 2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024.

[12] Del Barrio, Eustasio, and Jean-Michel Loubes. "Central limit theorems for empirical transportation cost in general dimension." (2019).

评论

Hi,

The AC just wanted me to point out for clarity that I updated my response in the "Final Justification" section of my review. Overall, thanks for the replies, and I think these changes will make the paper even stronger!

Best, cX5q

评论

Thank you for your careful review and for your positive feedback on the paper. We appreciate your reply and are glad the changes strengthen the paper.

审稿意见
4

The authors consider the problem of model-agnostic data valuation via distributional influence function which are able to approximate the leave one out error with error inversely proportional to squared number of train samples. As against the previous work on Kl-divergence or integral prob metrics like Wasserstein-1 distance which doesn't admit a tractable influence function, the authors consider MMD which they show admits a closed form influence function bypassing calculating the distributional distance explicitly. Besides they also sow that the influence function is symmetric for fair rankings and also the density separation property which means there exist a single global threshold which can separate the validation distribution domination region from the train one. They further adapt it for the corruptions in the label space by also considering maximum conditional mean discrepancy and consider the net discrepancy as the convex sum of MMD for input space and MCMD for labels, thereby the resulting influence function also becomes a convex sum, which upper bounds the train val loss gap. They also show a theoretical link between this net discrepancy and the model performance as a measure of generalizationl. They also show its efficient to use in the online setting since the order of complexity of calculating the influence function at any time as number of samples seen till now times the batch size. The authors further verify its effectiveness empirically on the small-scale classification datasets (including both image and text datasets).

优缺点分析

Strength: 1.) The contribution of using the MMD metric and thereby coming up with a closed form influence function which follows desired properties and is efficient to implement is nice since the prev works on KL or the intergal prob metrics like wasserstein-1 were all using some approximations and for which the critic shift term is ignored in the prior work. 2.) The analysis is easily adaptable to label-space corruptions which is also interesting and useful. 3.) The connection of the net-discrepancy to generalization is also an interesting insight. 3.) The empirical results show that their method can effectively rank noise samples and is better than the other baselines for both feature noise and label space corruptions, especially when the percentage of inspected data is low. Their method also works under the adversarial attacks as shown in figure 4. 4.) The analysis on the test accuracy when lowest/most valuable points are reomoved on the STL/IMDB datasets also advocates for the effectiveness of their method.

Weakness: 1.) It would have been better if the authors have covered a bit large-scale datasets (like ImageNet). 2.) Also, there are works like FSR [2] which doesn't require clean reference set although they dynamically update it based on the model. But it might be interesting to see a comparison since it was also proposed for the corrupted label setup (although I am unsure this might be a fair comparison or not, but just wanted to mention comparison with methods on these lines as other model agnostic sample reweighting methods like MAPLE [1].

[1] Model Agnostic Sample Reweighting for Out-of-Distribution Learning. ICML 2022. [2] Learning Fast Sample Re-weighting Without Reward Data. CVPR 2021.

问题

Please see the weaknesses section.

局限性

Yes, they have mentioned in Section 5.

最终评判理由

I would like to maintain my current rating. Although the authors have done a good job on comparing with MAPLE, it would have been better if the effectiveness of their method can be demonstrated on a bit larger datasets to actually understand the performance difference where they could have tried finetuning stronger pretraining models.

格式问题

None.

作者回复

We thank the reviewer for their valuable review and positive assessment of our contributions. We are glad the reviewer appreciated the novelty of our closed-form MMD-based influence function, its theoretical properties, and empirical effectiveness, particularly under noisy and adversarial conditions.

W1: It would have been better if the authors have covered a bit large-scale datasets

Thank you for your suggestion! The current baselines in the paper are challenging to run on larger datasets, due to runtime (model-based methods) or memory consumption issues (LAVA). However, we conduct additional experiments comparing KAIROS with SAVA (a scalable version of LAVA) on the ImageNet dataset, which contains 1.28M training examples and 1000 classes. We use a ResNet50 encoder to extract feature representations from ImageNet, and we use an A100 GPU with 40GB VRAM for computing data values.

Our results demonstrate that KAIROS achieves better performance and efficiency on this large-scale dataset. In particular, when facing a large number of classes, the pair-wise conditional Wasserstein distance computation in SAVA becomes expensive, leading to a 15x runtime compared to KAIROS. Please refer to our response to W1 & Lim1 of Reviewer UAd2 for the detailed results and analysis.

W2: On differences and comparison on sample reweighting methods like FSR and MAPLE

Thank you for suggesting this comparison. FSR and MAPLE are model-based sample reweighting methods that differ fundamentally from KAIROS. Unlike KAIROS, they do not compute data values but rather learn reweighting functions for specific training objectives. While they can be applied to corrupted label detection as you mentioned, KAIROS offers broader applicability across tasks such as detecting harmful fine-tuning data and data poisoning attacks. These methods are also not model-agnostic and inherit the limitations of model-based approaches that we discussed earlier. To provide a comprehensive evaluation, we conducted experiments comparing KAIROS with MAPLE on CIFAR-10 feature and label noise detection tasks. We use the target labels as group labels for MAPLE since no explicit group labels are available for our datasets. We report the AUC of the fraction of covered corrupted data vs the fraction of inspected data (we will include full plots in the final paper, but images are not allowed in rebuttals):

Table 6: AUC for CIFAR-10 feature noise and label noise tasks

MethodAUC Feature NoiseAUC Label Noise
Data OOB0.7270.784
KNN Shapley0.7230.751
LAVA0.8370.529
KAIROS (Gaussian)0.8570.791
MAPLE0.3470.828
Maximum possible (theoretical)0.9000.900

The results agree with the reviewer’s intuition. MAPLE performs the best on label noise detection but has very poor performance on feature noise detection. This illustrates that while reweighting methods can be effective for specific corruption types like label noise, KAIROS provides consistent performance across diverse tasks including feature noise, label noise, adversarial attacks, and harmful fine-tuning detection. We will include this comparison in the revised paper.

审稿意见
5

The paper proposes a model-agnostic method for data valuation MMD-based distribution influence score. The method is efficient due to the closed-form expression to calculate data value based on the kernel trick and is compatible with streaming batch implementation.

优缺点分析

Pros

  • The paper shows the deficiency of other methods for approximating LOO is a valuable contribution.
  • The method seems computationally efficient and streaming compatible.

Cons

  • Experimental performance is good, although the evaluation is fairly basic and could be improved.
  • The method still depends on training a classifier for label conditional MMD so is not technically model-free.
  • The method is limited to classification tasks and assumes a clean reference dataset is available to compute MMD

问题

  • How does this method compare against more recent data attribution methods for large models like Trak, Logra?
  • How robust is the method to distribution shift between training and validation set? Distributional measures may have issues in this scenario where there is no clean reference available. How robust is the method to a noisy or biased reference set?
  • How to choose the kernel and bandwidth in practice for different datasets? How to balance λ\lambda? Is the median heuristic always a sensible choice?
  • How would the method perform on more nuanced issues like selection bias, concept drift, or data redundancy?
  • How does this work differ from and related to "Data Distribution Valuation", which also uses MMD for data valuation? https://arxiv.org/abs/2410.04386

局限性

The experiments could be improved. There needs to be a comparison with more recent and competitive baselines such as https://arxiv.org/abs/2406.01130. Larger models and more realistic datasets need to be evaluated to show the robustness of the proposed method.

最终评判理由

I appreciate the authors response. Based on the additional experimental results and comments, I raise my score to accept.

格式问题

N/A

作者回复

We thank the reviewer for their constructive and insightful feedback, and appreciate their positive evaluation of our theoretical analysis and computational efficiency.

W1 & Lim1: Experimental performance is good, although the evaluation is fairly basic and could be improved. There needs to be a comparison with more recent and competitive baselines such as SAVA.

Thank you for the feedback! SAVA was primarily built to address the memory consumption issue of LAVA, making it scalable to larger datasets. Therefore, we compare KAIROS with SAVA on ImageNet, which has 1.28M training samples and 1000 classes. We use a ResNet50 encoder to extract feature representations from ImageNet, and we use an A100 GPU with 40GB VRAM for computing data values. We report the AUC of the fraction of covered corrupted data vs the fraction of inspected data (we will include full plots in the final paper, but images are not allowed in rebuttals):

Table 2: KAIROS and SAVA on Imagenet

Feature Noise Detection

MethodRuntimeAUC
KAIROS7 min 56 s0.869
SAVA1hr 58 min0.817

Label Noise Detection

MethodRuntimeAUC
KAIROS7 min 52 s0.861
SAVA1hr 58 min0.484

Our results show KAIROS outperforms SAVA in both efficiency and effectiveness. Since SAVA is built based on LAVA, similar to the results in Figures 2 and 3, KAIROS outperforms SAVA in both experiment settings, especially in label noise detection.

The efficiency improvement stems from KAIROS's closed-form solution, making it more suitable for batch-based GPU acceleration compared to Sinkhorn computations. In contrast, SAVA (and LAVA) needs to compute pairwise conditional Wasserstein distances between P(x|y) for every pair of classes, requiring 1000×9992\frac{1000\times 999}{2} computations, thus leading to a high runtime. Specifically, the computation of pairwise conditional Wasserstein distances takes more than 80% of the overall runtime.

W2: The method still depends on training a classifier for label conditional MMD so is not technically model-free.

Our method is model-agnostic with respect to the training data being valued. We never train models on the training set. We only train a single classifier on the validation set (3% of training size in our experiments) to estimate P(y|x). This preserves the core benefits of model-agnostic approaches: 1) avoiding multiple model retrainings (unlike Data Shapley or Data-OOB) and 2) data preprocessing before training the downstream model. Further, this makes our approach significantly faster than model-based baselines (Figure 7, page 9).

W3a: The method is limited to classification tasks.

The conditional component (E-MCMD) currently targets classification, but our framework applies broadly to many other tasks which do not have a target (y), such as image poisoning attacks and harmful LLM fine-tuning data detection (Figure 4). The marginal MMD component works for any data type. We acknowledge the classification limitation for labeled tasks and identify regression extension as future work (Section 5).

W3b & Q2a: This method assumes a clean reference set. Distributional measures may have issues in this scenario where there is no clean reference available. How robust is the method to a noisy or biased reference set?

Thank you for this important question. For data valuation, one needs to measure utility with respect to something. In our case, we assume access to a small sample from the target distribution, which we call the validation (reference) dataset. This assumption is shared with LAVA. Further, model-based methods also value data based on performance on the validation set which is generally assumed to be representative of the test distribution.

We test the robustness of KAIROS and other baselines to a noisy reference set. We conduct additional experiments by adding noise to the validation set for the feature noise CIFAR-10 task. We consider two settings where we randomly corrupt 3% and 7% of the validation samples respectively.

Table 3: Robustness to Noisy Reference Set - AUC Performance

MethodNo Validation Noise3% Validation Noise7% Validation Noise
Data OOB0.7270.7110.711
KNN Shapley0.7230.7050.705
LAVA0.8370.6350.601
KAIROS0.8570.8570.856

Our results show that KAIROS is robust and maintains original performance in the presence of noisy reference sets. Moreover, KAIROS outperforms both model-based and model-agnostic baselines across different noise levels.

Q1: How does this method compare against more recent data attribution methods for large models like Trak, Logra?

Thank you for this question. TRAK and LOGRA are model-based attribution methods that require training models (TRAK often trains multiple models, while LOGRA provides optimizations to make influence functions more efficient for large models). In contrast, KAIROS is model-agnostic and requires no training. We test KAIROS against these methods on feature-noise detection and label-noise detection tasks using CIFAR-10 with the same experimental setup described in the paper. Since we cannot share plots in the rebuttal, we report the AUC of the curves below. We will include the full plots in the final version.

Table 4: Brief Description of Experiments - AUC Performance Comparison

MethodFeature NoiseLabel Noise
Data OOB0.7270.784
KNN Shapley0.7230.751
LAVA0.8370.529
KAIROS0.8570.791
TRAK0.6380.743
LOGRA0.5650.663

For both experiments, KAIROS achieves the highest AUC compared to baselines, including TRAK and LOGRA, particularly in the feature noise detection setting.

Q2b: How robust is the method to distribution shift between training and validation set?.

Thank you for this question. KAIROS is designed for distribution shift between training and validation distributions. The method measures how training points from noisy distribution Q contribute to distance from clean validation distribution P. All our experiments involve such shifts where training sets and validation sets come from different distributions.

Q3: How to choose the kernel and bandwidth in practice for different datasets? How to balance λ? Is the median heuristic always a sensible choice?

We agree that kernel and bandwidth selection is an important practical consideration. The median heuristic has proven effective across all our experimental tasks and is widely adopted in MMD literature [2-4]. We also experiment with a polynomial kernel of degree 2, which demonstrates excellent performance. Please refer to the response to W1 of reviewer g4SV for detailed experimental results. For the balancing parameter λ, we set it to align the variance of the feature and label components in Equation 12 (see Appendix D). While the median heuristic works well in practice, we acknowledge that adaptive kernel selection could be beneficial for some applications and identify it as future work (Section 5). We will add results comparing different kernel choices to the revised paper.

Q4: How would the method perform on more nuanced issues like selection bias, concept drift, or data redundancy?

Thank you for suggesting this direction. We explore selection bias where subgroups are under-represented in training data. We use the ACS Income dataset from WhyShift [1], which predicts income from demographic attributes. This dataset exhibits geographic bias where some states like Puerto Rico (PR) are severely under-represented while others like California (CA) are over-represented. Liu et al. [1] also show that models trained predominantly on CA data fail to generalize to PR. We simulate selection bias by creating a training set with 80% CA and 20% PR samples (1000 total), while our reference validation set contains balanced 50% CA and 50% PR samples (300 total). We evaluate how well KAIROS and baselines identify points from the under-represented group (PR). Data attribution methods should assign high influence to under-represented samples. We report the AUC of the fraction of under-represented samples detected vs top k fraction of data chosen.

Table 5: Selection Bias Detection Performance

MethodAUC
LAVA0.546
KNN Shapley0.494
Data OOB0.326
KAIROS0.855

The results show that KAIROS considerably outperforms both model-agnostic and model-based baselines. We will add these results to the final version of our paper.

For data redundancy, KAIROS rankings remain stable when datasets contain duplicates. We empirically verified this by duplicating the training dataset up to 10 times and confirming that relative rankings are preserved, demonstrating robustness to redundant data.

Q5: How does this work differ from and related to Data Distribution Valuation?

Data Distribution Valuation values entire distributions (or datasets) by computing MMD between each dataset and a reference (or uniform average of all datasets). Our work addresses a fundamentally different problem: quantifying individual datapoints' contributions to the MMD between training and validation distributions via influence functions. This measures how removing a single point changes the distributional distance, not the MMD between that point and the reference distribution. We will include a detailed comparison in the appendix of the final version.

[1] Liu et al. "On the need for a language describing distribution shifts: Illustrations on tabular datasets." NeurIPS 2023

[2] Shekhar et al. "A permutation-free kernel two-sample test." NeurIPS 2022

[3] Gretton et al. "A kernel two-sample test." JMLR 2012

[4] Doran et al. "A Permutation-Based Kernel Conditional Independence Test." UAI 2014

评论

I appreciate the authors response. Based on the additional experimental results and comments, I raise my score to accept.

评论

Thank you for your thoughtful review and for engaging with our rebuttal and additional results. We greatly appreciate your constructive feedback and the updated assessment.

审稿意见
5

The paper proposes a framework for assessing the value of individual training examples without relying on a specific model. It introduces a closed-form influence function based on Maximum Mean Discrepancy, which approximates leave-one-out utility and avoids the computational overhead of retraining models. KAIROS offers theoretical guarantees such as symmetry and density separation, and supports efficient online updates. Empirical evaluations show that KAIROS performs competitively across various tasks, including noise detection and data pruning, and achieves notable runtime improvements in the online setting.

优缺点分析

Strengths:

  • The paper introduces a novel model-agnostic data point valuation method that addresses a major weakness of the previous LAVA approach: it introduces a bias of order O(dv log(1/v)) that may cause inaccurate ranking.
  • The method leverages Maximum Mean Discrepancy and the kernel trick to derive a closed-form influence score, providing both conceptual clarity and practical simplicity.
  • The approach is supported by theoretical guarantees. It is proved that the MMD provides an upper bound of the validation loss, and the method guarantees fairness and density separation.
  • The effectiveness of the approach is supported by experiments as well.
  • In the online setting, the method achieves significantly higher computational efficiency due to its closed-form structure.

Weaknesses:

  • The performance of KAIROS might depend on kernel choice.
  • Although KAIROS is highly efficient in the online setting, its computational complexity in the offline setting is comparable to that of LAVA.

问题

None.

局限性

Yes.

最终评判理由

I maintain my original evaluation.

格式问题

None.

作者回复

We thank the reviewer for their thorough review and strong support of our work. We appreciate your recognition of our theoretical analysis, empirical validation, and computational efficiency.

W1: The performance of KAIROS might depend on kernel choice.

Thank you for this question. Any MMD-based method depends on kernel choice, and we acknowledge this consideration for practitioners. To address concerns of sensitivity to the kernel, we conducted additional experiments using a polynomial kernel of degree 2 on the feature-noise CIFAR-10 task. We report the AUC of the fraction of covered corrupted data vs the fraction of inspected data (we will include full plots in the final paper, but images are not allowed in rebuttals):

Table 1: AUC for detecting feature noise in CIFAR-10

MethodAUC
Data OOB0.727
KNN Shapley0.723
LAVA0.837
KAIROS (Gaussian)0.857
KAIROS (Polynomial)0.856
Maximum possible (theoretical)0.900

The results show that KAIROS with polynomial kernels achieves nearly identical performance to Gaussian kernels (0.856 vs 0.857) and outperforms all baselines. Further, as mentioned in Section 5, adopting learned kernels that adapt to specific tasks is a promising direction for future research.

W2: Although KAIROS is highly efficient in the online setting, its computational complexity in the offline setting is comparable to that of LAVA.

Thank you for recognizing KAIROS's superior efficiency in online settings, where we achieve O(BN) complexity compared to LAVA's O(N²) (where B is batch size and N is dataset size). Regarding the offline setting, while both methods have O(N²) theoretical complexity, there are important distinctions: LAVA achieves O(N²) only through Sinkhorn approximation, which introduces bias and influence that may be non-deterministic (Section 3.1). In contrast, KAIROS provides exact MMD influence computation without approximation while maintaining the same complexity. Furthermore, our empirical results (Figure 7) demonstrate that KAIROS achieves faster actual runtime than LAVA even in the offline setting.

评论

Thank you for the response. I have no further questions.

最终决定

The paper proposes a framework for determining how influential individual training examples are to overall model performance. The work introduces influence function based on Maximum Mean Discrepancy and their approach offers theoretical guarantees such as symmetry and density separation. Experimental results demonstrate competitive performance across various tasks, including noise detection and data pruning.

Strengths:

  1. In the online setting, the method achieves significantly higher computational efficiency due to its closed-form structure. achieves notable runtime improvements in the online setting.
  2. The work addresses a major weakness in the existing approach, LAVA.

Weaknesses:

  1. The evaluation is fairly basic and could be improved, e.g., by including large scale data sets.
  2. In the offline settings, it is comparable to existing methods.

Overall: The reviewers were very positive about this paper and remained positive post-rebuttal. The contribution is novel, comes with theoretical guarantees, and has good performance in the online setting.