Thanks for the positive assessment!

Q [Terminology of LOO & Comparison between LOO influence and LOO CV] “... it is important to make it clear regarding the type of CV considered in the paper ... Explicitly compare and contrast their LOO error definition with traditional CV error”

A Thank you for this insightful comment about terminology. We agree that "LOO influence" more precisely describes our metric than "LOO error," and we have revised the terminology throughout the paper. We sometimes use “LOO score” in the paper.

Traditional LOOCV and our LOO influence serve fundamentally different purposes. LOOCV estimates model generalization by measuring prediction performance on held-out data, serving as a model selection and evaluation technique. Lower CV errors indicate better generalization. In contrast, LOO influence quantifies how individual training points impact the model's behavior on validation data, measuring the contribution of specific training examples. Higher absolute LOO influence values indicate more impactful data points.

We have added an explicit discussion in Appendix A.1 to prevent confusion. The key distinctions we will highlight include: (1) their different objectives (model evaluation vs. data importance), (2) their interpretation, and (3) their computational approaches (N separate models vs. counterfactual analysis).

Thank you for helping us improve the clarity of our paper!

Q [Loss vs other performance metrics] “However, large change in the loss does ... (1) Discuss the implications of using loss change rather than model performance metrics (2) Provide examples or experiments showing how their LOO error relates to changes in model performance”

A Thank you for this thoughtful comment about the relationship between loss changes and model performance! We agree that this deserves more discussion and will address both points.

Why use validation loss instead of other metrics?

(1) Loss is one of the most widely used performance metrics in deep learning and serves as the direct optimization objective during ML training, making it a natural choice in the computation of data influence quantification. This is a common choice in most of the papers in the field. In this work, we also focus on computing the LOO influence for validation loss as it is a widely accepted proxy for language model performance.
(2) Loss values often provide richer information compared to more human-interpretable metrics like accuracy. Consider, for example, measuring influence through classification correctness on validation data, where outcomes are binary (0 for incorrect, 1 for correct). The resulting LOO influence scores would be limited to {-1, 0, 1} for every training data point, providing far less information than the continuous values obtained from loss calculations.
(3) The framework of Data Value Embedding supports performance metrics beyond validation loss. The derivation in Appendix C.1 shows we can approximate the parameter difference , allowing us to estimate . This means that for any differentiable performance metric , we can approximate its LOO influence through . While we can also evaluate non-differentiable metrics like classification accuracy through , this approach is less efficient when computing LOO influence for all , as it requires forward prediction on large models times where is the training data size, much slower than simply taking the dot product against each . Therefore, in this work, we focus on approximating the LOO influence for the validation loss.

Additional experiment: We have conducted an additional fidelity check experiment under the same setting as Section 5.1, where we assess the correlation between LOO influence scores computed from validation loss and the ground-truth LOO score for classification accuracy. In this figure, we observe strong positive correlations (Spearman correlation of 0.763 and 0.735) between our computed influence scores and ground-truth LOO accuracy changes in both (a) single-epoch removal and (b) all-epochs removal settings. Looking at the scatter plots, we can observe a key advantage of using loss over accuracy: while multiple data points often share the same LOO accuracy scores due to the discrete nature of accuracy metrics, our loss-based influence scores provide more fine-grained distinctions between these points. Despite this discretization effect in the ground-truth accuracy measurements, the strong correlation coefficients and clear monotonic trend in both settings demonstrate that our loss-based influence scores effectively capture changes in model performance measured by accuracy.