Efficient Ensembles Improve Training Data Attribution
摘要
评审与讨论
The paper proposes ways to efficiently simulate building ensembles of models from scratch using dropout and lora for the problem of training data attribution (TDA). Through empirical studies they show the efficacy of their approach (lower LDS for similar time) over from scratch training. The paper also provides some theoretical analysis to justify their approach in terms of lower squared error loss.
优点
- Easy to peruse
- Covers related work reasonably well
- Experimental results are promising
缺点
- Limited originality
- Small models and datasets
- Only one TDA method experimented on for LoRA
- Weak reasoning in the theoretical section to justify their approach
问题
In general, I do not find this paper too exciting. Application of Dropout and LoRA in this context is nice, but somewhat incremental from an originality standpoint in my opinion. Also since the main claim of the paper is efficiency I would have liked to see experiments on larger datasets and also with larger models, since they are considering generative tasks. Arguments in the theory section are also weak. Case 1 argument of within group covariance being smaller than individual predictor variances, although not completely justifiable, I could still buy. However, for Case 2 the within group covariance being not too much larger than covariance between individual predictors is hard to accept, since the whole point of the paper is efficiency and having a large K implies you have to build many independent models defeating the purpose. Hence, if one were using this method they would want to have a low K, which implies for Case 2 the performance of their proposed solutions could be considerably worse.
All in all the paper is reasonable, but limited in the above mentioned ways.
We thank the reviewer for the comments and address them below.
Weakness 1: Limited originality
To the best of our knowledge, this paper is the first study to propose avoiding fully independent training to achieve better efficacy-efficiency tradeoffs in TDA ensembling, while also discovering the intriguing connections among dropout, LoRA fine-tuning, and ensemble methods in the context of TDA. Going beyond the traditional (naive) ensemble is a surprising finding.
The methods we propose are well-grounded in theoretical and methodological foundations. Dropout Ensemble is motivated by the fact that Dropout was originally designed as an efficient approximation of ensemble learning, and LoRA Ensemble is directly inspired by LoRA fine-tuning as an efficient alternative to full training.
Most importantly, we would like to emphasize that in addition to our contribution on connecting Dropout/LoRA and efficient ensembling, our methods demonstrate significant performance improvements and practical applicability, making them valuable for real-world scenarios.
Weakness 2 and Question 1: Small models and datasets
We would like to highlight that our experiment settings in the original submission have covered models with tens of millions of parameters and two generative settings, i.e., the MusicTransformer (13.3M) and GPT-2 (124M) setting (added below). The experiments already involve large modern architectures and complex datasets. As shown in the table below, Dropout Ensemble could achieve comparable performance without independent retrained models as Naive Ensemble with multiple independent retrained models on GPT-2 + WikiText-2. Some additional experiments lie in Appendix K.
| D\I | 1 | 5 | 10 |
|---|---|---|---|
| 0 (naive ensemble) | 0.084 | 0.174 | 0.205 |
| 5 | 0.161 | / | / |
| 10 | 0.190 | / | / |
| (GPT-2 + WikiText-2) |
Quantitative evaluation of data attribution of large scale generative models is a challenging task since most of the metrics (e.g., LDS, brittleness, …) need a large number of model retraining to get the ground truth. We note that our experimental setups already match the largest setups in most of the recent papers [1][2] in the field of data attribution.
Weakness 3: Only one TDA method experimented on for LoRA
LoRA is designed for large transformer-based models, so we only perform experiments on the transformer architecture. Other than MusicTransformer, we also show the results on GPT in Figure 8 of Appendix K.
Weakness 4 and Question 2: Theory
In our theoretical analysis, we did not claim that the within-group covariance is never significantly larger than the covariance between individual predictors. In contrast, we acknowledged in line 525 that our advantage will diminish if this covariance gap is huge. However, our goal in Case 2 was to illustrate that our method can help reduce the error (negative ) significantly, by a factor of 1/K. When we say a large K is needed for good empirical performance, we mean a relatively large number (such as 10 in our experiments). In practice, this error reduction helps to maintain a competitive performance while providing a drastic efficiency gain, as reflected in our experiments with very small K <= 5. Such results show that the covariance gap is reasonably controlled. Specifically for the MLP model on MNIST, a small K = 5 with Dropout Ensemble can outperform the naive ensemble with 25 models and save 80% of training time cost.
In reality, there is always a trade off between performance and efficiency for any ensemble method. One key advantage of our approach is to provide flexibility in balancing them, allowing users to adjust K based on their specific requirements. Smaller Ks can be used to address scenarios where efficiency is a primary concern. On the other hand, when users can afford a larger K, which is not required and may not be the most efficient trade-off depending on computational constraints, the performance improves further. Finally, our Case 1 also shows that when there is a fixed budget of how much model training can be afforded in total, our method can make a very effective trade off by doing cheap dropout and achieve performance enhancement.
References
[1] Park, S.M., Georgiev, K., Ilyas, A., Leclerc, G. & Madry, A.. (2023). TRAK: Attributing Model Behavior at Scale. <i>Proceedings of the 40th International Conference on Machine Learning</i>, in <i>Proceedings of Machine Learning Research</i> 202:27074-27113 Available from https://proceedings.mlr.press/v202/park23c.html.
[2] Bae, J., Lin, W., Lorraine, J., & Grosse, R. B. Training Data Attribution via Approximate Unrolling. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
Dear Reviewer DqZM,
Thank you for your valuable feedback to our submission! We have carefully addressed your comments in our rebuttal and we would greatly appreciate it if you could review it and consider any further discussion. We look forward to hearing any additional thoughts you may have.
Thanks for your response, however, unfortunately my main concerns still remain and hence I will keep my recommendation.
Thank you for conducting experiments with GPT-2, however, I believe even this model is too small to justify use of LoRA. The model you have trained on is basically the size of Bert-base and can be fine tuned using a single GPU without LoRA. The size of your other model MusicTransformer is around 10M which is sometimes the size of LoRA itself not even for the biggest models but even midsize models such as GPT-3. I understand TDA can be expensive, but given that the point of this paper is efficiency using LORA etc. showcasing it on bigger models where LoRA is typically used is in my opinion imperative.
Still only a single TDA method is experimented on for LoRA.
Regarding the theory my point is for one of the main cases (case 2) it does not justify your approach. You also say "In summary, when , can still outperform if the within-group covariance is small." suggesting that this can be reasonable, which in general will not. I think the theory is not that deep and the space may be better utilized for additional experiments and discussion.
Regarding novelty my claim is not that this has been done before, but that I do not consider applying TDA techniques using LoRA etc. a significant conceptual jump.
Thank you for the follow-up
larger experiment settings
The focus of this paper is efficient ensemble (both dropout and LoRA) on TDA methods, rather than improving the efficiency of current gradient-based TDA methods besides ensemble.
The quantitative evaluation of TDA methods could be time-consuming, and recent works [1][2] that employ quantitative evaluation (e.g., LDS) only evaluate on GPT-2 level models at most. Thus, we believe GPT-2 (124M) and other experiment settings with tens of million parameters are already comprehensive to demonstrate the effectiveness of the proposed efficient ensemble methods.
further clarification about the theory part
Our experimental approach mainly follows the case 1 in the theoretical analysis because running the second step, e.g. Dropout or LoRA, is much cheaper compared to the first step, i.e., training a new model. Therefore, we can easily get a large D, and our advantage comes from increasing D. In practice, we don’t restrict ourselves to KD = I. We provide this second case for a more complete picture of the theoretical understanding.
We agree our theoretical analysis is not as deep as a pure theory paper, which our paper is not. Therefore, we included it as an analysis section instead of the main contribution. Following your suggestion, we will consider shortening it by putting some content to the appendix and leaving space for more experimental results.
References
[1] Park, S.M., Georgiev, K., Ilyas, A., Leclerc, G. & Madry, A.. (2023). TRAK: Attributing Model Behavior at Scale. <i>Proceedings of the 40th International Conference on Machine Learning</i>, in <i>Proceedings of Machine Learning Research</i> 202:27074-27113 Available from https://proceedings.mlr.press/v202/park23c.html.
[2] Bae, J., Lin, W., Lorraine, J., & Grosse, R. B. Training Data Attribution via Approximate Unrolling. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
The paper is about training data attribution (TDA) and specifically gradient-based methods like influence functions and TRAK. These tend to benefit from ensembling scores across multiple models, which has the downside of requiring i) performing multiple training runs, ii) storing multiple sets of weights, and iii) performing the same computation multiple times. The authors propose two ideas to reduce these costs: 1) averaging TDA values calculated using multiple dropout-masked models, and 2) averaging scores across multiple LORA fine-tuned models.
Both techniques offer some type of averaging effect that empirically improves the scores, similar to the standard ensembling approach, but potentially with reduced training time and storage space requirements. For the dropout idea, the authors also propose a trick to reduce computation time. The comparisons with standard ensembling are a bit mixed, but these are interesting ideas.
优点
- The problem setting and related work is well written and easy to follow.
- The two main ideas presented here are reasonable and offer solid options to reduce training time and storage space, which is important for large-scale models.
- For the dropout ensemble, the trick for TRAK of only using dropout masking for the terms was a nice idea to further reduce computation/serving time (end of section 3.3). Based on the results in Figure 5, it seems to work well in practice (although I have a question about that below).
- The main results for the dropout ensemble show that it introduces helpful variability, which given enough masks can provide a similar effect to ensembling independently trained models (Figure 3). Although the TRAK alternatives (influence functions, grad-dot, grad-cos) are much worse in an absolute sense, it's nice to see that they benefit from the dropout ensemble as well (Figure 4).
- The LORA ensemble results mostly seem encouraging as well, reflecting that a little extra training cost + a few more parameters can yield large LDS improvements. I have a couple questions about these results as well, but this is enough to establish LORA ensembling as a viable approach.
- The theory is a nice addition, but perhaps a bit disconnected because the authors don't characterize the salient distributional properties in their experiment (e.g., how correlated scores are within a dropout ensemble vs between independent models).
缺点
Some assorted observations from throughout the paper:
-
I can't tell how the authors justify the performance improvements quoted in the abstract and later in the paper. Can they make this logic explicit in the paper? It seems unlikely they generalize across settings or apply to both ensembling methods, and I don't know if the authors are controlling for accuracy (LDS) when making these claims.
-
The authors mention one baseline that they fail to compare to, which is using multiple checkpoints from a single training run. I recognize that it would be cumbersome to add at this point, and that tuning that baseline might be necessary (e.g., which epochs to use), but it would have been nice to include because it reduces training time in a similar fashion. Can you explain why it was not included?
-
From the abstract and introduction, I got the impression the authors might completely eliminate the need for multiple training runs. Unfortunately this is not the case, both methods still rely on multiple independently trained models to achieve similar performance to naive ensembling. Do you think your results convincingly show that we can't get good enough LDS scores with a single training run, even with a high number of dropout masks or LORA adapters? This seems implied by the asymptoting performance as the number of dropout masks/LORA fine-tunes grows, and might merit some discussion.
-
The Figure 3 results seem to reflect that for a given serving cost, dropout ensembling is worse than naive ensembling: we can see this through the first orange point vs the fourth blue point. These figures are actually somewhat deceptive because their x-axis represents the number of independently trained models, rather than serving time or FLOPs. They effectively plott accuracy vs training time, which is the most favorable comparison for this method because it squeezes performance out of each independently trained model. Figure 5 shows serving time comparisons but omits a curve for naive ensembling... I believe the authors should be more transparent about this result, for example by including all the same plots shown for LORA ensembling (even if it's only in the appendix).
-
In Figure 5, it looks like 3/4 plots show something quite surprising: that for a given number of dropout masks, their serving optimization for TRAK achieves not just lower serving time but better LDS scores? That's surprising and might merit some discussion. What does that tell us about which parts of TRAK truly requiring ensembling, and do you have intuition for why?
-
For the LORA ensemble, I don't understand how it lets you reduce serving time - how would you get around having to backprop through the network for each set of adapters? The authors mention this on line 268 with no explanation, and reduced serving time seems to be reflected in Figure 6b. This seems like an important aspect of the method to describe in the main text.
-
Why are most of the experiments performed on toy datasets? Would it have been harder to use CIFAR-10 instead of CIFAR-2? I noticed there are some more realistic settings in appendix K, why aren't they included in the main text? (For the results related to dropout ensembling, these last results seem to have the same issue I mentioned above with worse LDS for equal serving time.)
-
Why are there no comparisons between dropout masks and LORA ensembling? This seems important to include.
-
For the theory, it would have been nice to characterize the salient distributional properties in your experiments. For example, I believe the finding that you can't get arbitrarily good LDS scores with a large number of dropout masks and a single model reflects that one of the assumptions is incorrect (the one on line 483).
问题
While reading the paper I found it hard to reason about the relative importance of training, serving and storage costs, and I wonder if it would help to specify which setting you're targeting. For example, if it's large LMs or diffusion generate models, which costs matter most? Major LM providers are arguably more concerned about serving costs than training costs, so a discussion of your intended setting seems important. What both your methods have in common is that they reduce training and storage costs, but I'm not sure they provide a Pareto improvement in the accuracy / serving-time tradeoff.
For dropout ensembling, how important is it that the original model was trained with dropout?
For LORA ensembling, do you think you could get meaningful LDS improvements and reduced serving costs by fine-tuning only a few layers, perhaps late in the network?
We thank the reviewer for the comments and address them below.
Weakness 1: Performance improvements in the abstract
The improvements in abstract are claimed with the performance (LDS) being controlled to be the same. They are the maximum improvements (“up to”) of each type of cost that can be achieved separately. Please see the following places in our paper for the settings where these improvements are achieved.
- Dropout Ensemble (line 405-406 in the updated paper): 80% for training and space cost.
- LoRA Ensemble (line 424-425 in the updated paper): training/serving/space by 78.4%, 60.5% and 79.6%.
Weakness 2: Comparison with using multiple checkpoints from a single training run.
Thank you for pointing this out. We have noticed that using multiple checkpoints from a single training run could also be a baseline (mentioned in TRAK [1]). We include some results in Table 2 in Appendix G.2. Dropout Ensemble could still lead to significantly better LDS.
Weakness 3: Using independently trained model and diminishing return
Empirically, the asymptoting phenomenon and diminishing return happen to both efficient ensemble methods and naive ensembles. Experiments from our paper and TRAK [1] both demonstrate this phenomenon. Our experiments show that efficient ensembling can be used with only one independently trained model and achieve much better efficacy of TDA. A good combination of the number of independently trained models and the number of dropout masks/LORA fine-tunes could perform as the Pareto frontier.
Weakness 4 & Question 1: The overall computational cost (training+serving) of Dropout Ensemble.
We provide the table below (newly added as Table 8 in the PDF) to demonstrate that the training cost dominates the computational cost, i.e., the ratio of training cost over serving cost for one ensemble model is large. We also provide a figure (newly added as Figure 9 in the PDF) with performance (LDS) against the total computational cost (training+serving time) to show the improvement of the Dropout Ensemble over the naive independent ensemble. From the figure, it’s clear that Dropout Ensemble could bring better performance under the same computational cost in every experiment setting. The forward-only variant could further improve the performance and reduce time cost, especially for larger models.
| Experiment Setting | Training/Serving | Training/Serving(forward-only) |
|---|---|---|
| MNIST-10 + MLP | 40x | 200x |
| CIFAR-2 + MLP | 60x | 450x |
| CIFAR-2 + ResNet9 | 25x | 208x |
| MAESTRO + MusicTransformer | 4.2x | 42x |
Note that although the computational time cost may be influenced by the implementation of model training script and data attribution method, all of our comparisons across different TDA methods are under the same settings and are thus fair.
It is worth noting that the training cost and the serving cost are both for the TDA process, which are defined respectively in Section 3.2. The computational cost of TDA should consider both costs together.
Weakness 5: The competitive performance of Dropout Ensemble (forward only).
Thank you for pointing this out. We acknowledge that this phenomenon is counter-intuitive, and we do not have a thorough understanding of it.
A potential explanation to this phenomenon is that the term (Q) affects the performance “substantially” as shown in TRAK [1] paper’s ablation study. Focusing on mitigating the effect of randomness on this term and cache other terms could improve the TDA performance effectively and efficiently.
It is worth noting that this phenomenon may be closely related to some open questions existing in the TRAK algorithm. For instance, empirical evidence indicates that a particular approach to ensembling models, known as "term-wise” ensemble in the TRAK paper, plays a critical role in TRAK's superior performance. However, the underlying reasons for this remain insufficiently understood.
A thorough understanding of this particular phenomenon pointed out by the reviewer may require answers to the open questions existing in TRAK, which go beyond the scope of this paper. We will try to further investigate it in future work. We have added a brief discussion about this phenomenon in our updated paper in Appendix P.
Weakness 6: The improvement over serving cost of LoRA Ensemble.
For LoRA Ensemble, we will only use the parameters of the LoRA adapters to calculate the TDA scores instead of full parameters in the original model. This is the main reason why the serving cost has been reduced. We have added a sentence to make this clear in the main text (line 265-269).
References
[1] Park, S.M., Georgiev, K., Ilyas, A., Leclerc, G. & Madry, A.. (2023). TRAK: Attributing Model Behavior at Scale. <i>Proceedings of the 40th International Conference on Machine Learning</i>, in <i>Proceedings of Machine Learning Research</i> 202:27074-27113 Available from https://proceedings.mlr.press/v202/park23c.html.
Weakness 7: Larger experiment settings and the usage of CIFAR-2.
We use CIFAR-2 since we closely follow the experiment settings stated in the TRAK paper [1]. This also allows an easier and fair result comparison with naive ensembling used by TRAK. Quantitative evaluation of data attribution of large scale generative models is a challenging task since most of the metrics (e.g., LDS, brittleness, …) need a large number of model retraining to get the ground truth. The experiment in Appendix K contains sparser results because of the difficulty of the evaluation.
Weakness 8: Comparison between LoRA and Dropout Ensemble
A comparison between dropout and LoRA Ensemble can be found in Figure 9 (d) (newly added to the PDF). In terms of LDS, Dropout Ensembles can yield slightly more improvement compared with LoRA Ensembles while both ensembles can significantly outperform naive ensembles for transformer-based large models by achieving a better efficacy-efficiency tradeoff.
Dropout and LoRA Ensemble have different trade-offs regarding training and serving costs. The suitability of them may vary depending on the specific applications.
Weakness 9: The possibility to get arbitrarily good LDS scores.
We clarify that, in the theory section, we aim to study the difference between 2-step ensemble and 1-step ensemble estimators, rather than the 2-step ensemble estimators compared to the oracle attribution result (LDS=1). And our theoretical result does not suggest that the 2-step estimator can achieve arbitrarily good LDS.
The perfect LDS=1 is very challenging to achieve. In our case, the error represents the gap between these two ensemble methods (2-step and 1-step), and the error being 0 doesn’t imply LDS=1.
The assumption in line 483 follows from a common Bayesian setup [2], where we assume model weights from different training converge to the same posterior distribution given the prior distribution of weights and hyperparameters stays the same. We agree that in practice the converged model weights can be quite different due to the complexity of model training, but we think the assumption itself is reasonable for theoretical analysis.
Another perspective that might be related to your concern is that we never assumed independence between the estimators. That is why we carefully chose our terminology to be i.d. instead of i.i.d. After all, the checkpoints share the training algorithm and data points. We hypothesize that the diminished return of ensemble results could come from the dependence. We hope this clarification is helpful in addressing your concern. We are happy to provide more explanation if you have any further questions.
Question 2: The sensitivity to dropout training.
We clarify that during the model training phase in our experiments, we largely followed public reference implementations, where MLP and ResNet are indeed trained WITHOUT dropout while MusicTransformer is trained with dropout. For all models, we inserted dropout with a fixed probability 0.1 during the data attribution phase.
While tuning dropout probability or using different dropout training strategies may lead to further improvement of Dropout Ensemble, our consistent experiment results on diverse settings show that the proposed method does not require careful hyperparameter search to achieve superior performance.
Question 3: Fine-tuning only a few layers for LoRA Ensemble
The number of LoRA parameters is typically much smaller than the total number of model parameters. As a result, further reducing the number of LoRA layers may lead to very limited differences in terms of serving cost. While it is an interesting question, we deem it beyond the core questions of interest in this paper and leave this exploration to future work.
References
[1] Park, S.M., Georgiev, K., Ilyas, A., Leclerc, G. & Madry, A.. (2023). TRAK: Attributing Model Behavior at Scale. <i>Proceedings of the 40th International Conference on Machine Learning</i>, in <i>Proceedings of Machine Learning Research</i> 202:27074-27113 Available from https://proceedings.mlr.press/v202/park23c.html.
[2] Hernández-Lobato, J. M., & Adams, R. (2015, June). Probabilistic backpropagation for scalable learning of bayesian neural networks. In International conference on machine learning (pp. 1861-1869). PMLR.
Thanks to the authors for their response to our feedback. A couple thoughts:
Performance improvements. Sounds good, makes sense that these are the best-case improvements that can be achieved independently.
Comparison with using multiple checkpoints. Thanks for adding this comparison. It would ideally make sense to show this result in more settings and in a more equalized setup - your current lowest serving cost for dropout (10) is 2x the highest serving cost for multiple checkpoints (5).
Serving time for dropout ensemble. I understand that dropout ensembling will win when you control for training cost, this is clear. What's missing is an evaluation how it performs compared to naive ensembling with equal serving costs. This is not shown in Figure 3 (unlike in Figure 6), and you omitted the curve that would have shown this in Figure 5.
Dropout (forward only). Thanks, adding some discussion of this point to the paper would be helpful.
LORA ensemble reduced serving time. Thanks, clarifying this key detail about your implementation seems important. It also raises the question of how well it would perform if you also accounted for the original model parameters, that seems like a natural ablation to include.
CIFAR-2. Sounds good.
Comparison between LORA and dropout ensemble. I can't understand why you would only show this for a single subplot in a single supplementary figure, comparing your two main proposals seems like it should be a main result in the paper.
Theory. Thanks for your clarifications. Again, it would have been a good idea to empirically assess the key properties in your theoretical setup.
Sensitivity to dropout training. Thanks for these clarifications.
Fine-tuning only a few layers for LORA. Sounds good.
Overall, I still find some of the experiments and their presentation unsatisfactory, so I'll keep my score at marginally above acceptance.
Comparison with using multiple checkpoints.
Thanks for the advice. We add an additional column to Appendix G.2 Table 2. Under the same serving cost for dropout (10) and intermediate checkpoints (10), Dropout Ensemble performs better than intermediate checkpoints.
Serving time for dropout ensemble
We add a new Figure 11 in Appendix Q with the performance (LDS) against the total computational cost (training + serving time) to show the serving cost overhead of the Dropout Ensemble over the naive independent ensemble. The results show that Dropout Ensemble (forward-only) and LoRA Ensemble could still outperform the naive ensemble if only serving cost is considered. Dropout Ensemble has a larger serving time cost overhead.
LORA ensemble reduced serving time.
Thank you for pointing this out. Accounting for the original model parameters may involve performance improvement and serving cost overhead at the same time. While it is an interesting question, we deem it beyond the core questions of interest in this paper and leave this exploration to future work.
Comparison between LoRA and Dropout Ensemble
LoRA is designed for large transformer-based models, so we only perform experiments on the transformer architecture. Due to the limited time of rebuttal and the large computational cost for transformer-based models, we will include more transformer-based experiment settings in the future.
Dear Reviewer KDDK,
Thank you for your valuable feedback to our submission! We have carefully addressed your comments in our rebuttal and we would greatly appreciate it if you could review it and consider any further discussion. We look forward to hearing any additional thoughts you may have.
Thanks for following up on a couple items from my last response. A couple thoughts:
-
For the comparison with using multiple checkpoints: it's helpful that your lowest budget for dropout ensembling now matches your highest budget for multiple checkpoints - we at least now have one fair point of comparison. Again, it would help to show this in more settings since it's one of the only existing baselines. It would also make more sense to show in the main text, not in Appendix G.
-
For dropout ensemble's serving time comparison, including this result in the paper is an improvement. I notice that Figure 11 is roughly analogous to Figure 6 for LORA ensembling, which is useful. However, I don't understand why the result is only shown in Appendix Q (page 23) when you have a main text result (Figure 5) that already shows serving time comparisons. As mentioned in my review and previous response, you omitted the naive ensemble baseline from this figure - please also include the curves there.
-
For the LORA ensembling method: accounting for all the parameters seems like the natural first implementation of this idea, not an interesting alternative version that's beyond the scope of this work. It would be a better paper if you included those results as well, otherwise other researchers will have to check how much performance is left on the table by your trick of ignoring the original parameters.
-
Comparison between LORA and dropout ensembling: got it, you can only compare your two proposals when using transformer architectures. Perhaps it would have made sense to design more of your experiments to use transformers then.
I'll keep my score at marginal accept for the time being.
Thank you for the follow-up.
It’s good to see the new results make sense to you and be seen as an improvement to our paper. We agree that some important experiment results (like multiple checkpoints and serving cost in Appendix G.2/P/Q) should be organized and moved to the main body of the paper. We will do that in our final draft.
We will also consider adding 1-2 new experiment settings, especially for LoRA ablation study with original parameters), to help readers understand the comparison between our design and a more naive baseline.
It has been previously observed that combining training data attribution scores from independently trained models produces better attributions. However, training and deploying an ensemble of models can be very expensive. In this work, the authors propose the use of dropout and fine tuning (specifically through LORAs) to create ensemble members in a way that incurs far less computation at training time.
The authors demonstrate empirically that this provides an improvement in performance and training time tradeoffs on a variety of models and tasks and over a variety of training data attribution methods. They then perform a theoretical analysis of their method to identify regimes in which their method is expected to perform well.
优点
The method is straightforward to understand and implement since it relies primarily on the use of widely available modules (e.g. dropout). The method is flexible and can be applied across a wide variety of settings. The authors demonstrate that their method can perform attribution at a similar level to the naive independent ensemble method with far less training.
缺点
If I understand the method correctly, it should generally be expected that, while training time significantly improves using this method, serving time would be more expensive. If this is the case then it should be made clearer in the paper, ideally with some quantification of this cost.
While the authors perform experiments across a variety of methods, models, tasks, and metrics, the full grid of possible evaluations is not complete. For example, a comparison between the lora based and dropout based method appears to be missing from the main body of the paper.
问题
On line 126-127 you state that training and test sets are identical in the supervised case. However from my experience a validation split is usually excluded from the training set. Could you clarify?
You show the trade off between performance and training time between their method and independently trained ensembles. Could you also discuss the trade offs between serving-time and performance when compared with independently trained ensembles?
We thank the reviewer for the comments and address them below.
Weakness 1 & Question 2: The limitation of efficient ensemble methods
Thank you for pointing out this important aspect. We agree that discussing the limitations of our methods is valuable, and we have included a detailed quantitative table in Appendix L, demonstrating the computation and space overhead. Additionally, we have provided further discussion in the main text to highlight these points in Section 3.3.
We provide the table below (newly added as Table 8 in the PDF) to demonstrate that the training cost dominates the computational cost, i.e., the ratio of training cost over serving cost for one ensemble model is large. We also provide a figure (newly added as Figure 9 in the pdf) with performance (LDS) against the total computational cost (training+serving time) to show the improvement of the Dropout Ensemble over the naive independent ensemble. From the figure, it’s clear that Dropout Ensemble could bring better performance under the same computational cost in every experiment setting. The forward-only variant could further improve the performance and reduce time cost, especially for larger models.
| Experiment Setting | Training/Serving | Training/Serving(forward-only) |
|---|---|---|
| MNIST-10 + MLP | 40x | 200x |
| CIFAR-2 + MLP | 60x | 450x |
| CIFAR-2 + ResNet9 | 25x | 208x |
| MAESTRO + MusicTransformer | 4.2x | 42x |
Note that although the computational time cost may be influenced by the implementation of model training script and data attribution method, all of our comparisons across different TDA methods are under the same settings and thus fair.
Weakness 2: Comparison between lora and dropout based methods
A comparison between Dropout and LoRA Ensemble can be found in Figure 9 (d) (newly added to the paper in Appendix M). In terms of LDS, Dropout Ensembles yield slightly better improvement compared with LoRA Ensembles and both ensembles can significantly outperform naive ensembles for transformer-based large models by achieving a better efficacy-efficiency tradeoff.
Dropout and LoRA Ensemble have different trade-offs regarding training and serving costs. The suitability of them may vary depending on the specific applications.
Question 1: training and test set in supervised case
We clarify that the notations and refer to the data SPACE instead of the datasets. In lines 126-127 in our original draft, we are only saying that the training data and test data are typically sampled from the same space. However, the sampled training and test datasets will be different.
Thank you for your response. I appreciate the added analysis on serving time (it's interesting that LoRA ensembles are that efficient at serving) and the comparison between dropout and LoRA. Also appreciate the clarification about data spaces and datasets.
We thank the reviewer again for the review and for acknowledging our work!
The authors focus on improving gradient-based methods for data-centric learning. They propose two ensemble strategies as efficient alternatives to the naive independent ensemble approach. Experiments on public datasets are conducted to demonstrate the efficacy and efficiency of the proposed strategies.
优点
-
This paper is well-organized, with a clear and logical structure that makes it easy to follow and understand.
-
The proposed strategies are technically sound and could significantly reduce the training time and space costs compared to the naive independent ensemble method.
缺点
-
The authors proposed two strategies, DROPOUT ENSEMBLE and LORA ENSEMBLE. However, it is unclear which might perform better and their respective advantages in different scenarios.
-
Given that the computational time of TracIn [1] is also quite low, the comparative advantages of DROPOUT ENSEMBLE over TracIn as well as the reasons are not well stated.
-
The idea of LORA ENSEMBLE appears similar to GEX [2]. It could be helpful to include GEX in the related work section and discuss the differences and potential advantages of LORA ENSEMBLE.
-
There are no experiments provided to evaluate the proposed strategies on data-centric tasks (for example, mislabel detection). Understanding their performance in this context would be valuable.
-
There are some typos. For example,
- Line 183 on Page 4: indepedently
- Line 252 on Page 5: fowrad-only
- Legends in Fig. 4: Naive Ensmeble
[1] Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. "Estimating training data influence by tracing gradient descent." Advances in Neural Information Processing Systems 33 (2020): 19920-19930.
[2] SungYub Kim, Kyungsu Kim, and Eunho Yang. "GEX: A flexible method for approximating influence via Geometric Ensemble." Advances in Neural Information Processing Systems 36 (2024).
问题
Please see the weakness part.
We thank the reviewer for the comments and address them below.
Weakness 1: Comparison between Dropout and LoRA Ensemble.
A comparison between Dropout and LoRA Ensemble can be found in Figure 9 (d) (newly added to the paper in Appendix M). In terms of LDS, Dropout Ensembles yield slightly better improvement compared with LoRA Ensembles and both ensembles can significantly outperform naive ensembles for transformer-based large models by achieving a better efficacy-efficiency tradeoff.
Dropout and LoRA Ensemble have different trade-offs regarding training and serving costs. The suitability of them may vary depending on the specific applications.
Weakness 2: Comparison between Dropout Ensemble and TracIN.
Thank you for pointing out the connection. Dropout Ensemble is also available for the TracIN. The result in Figure 4 for Grad-Dot could be seen as a special and efficient case to TracIN-CP. Dropout Ensemble could also improve its performance. The result in Table 2 (Appendix G.2) also shows that Dropout Ensemble could further improve the result by using intermediate checkpoints. Additionally, Dropout Ensemble does not require additional space cost for the checkpoints needed by TracIN.
Weakness 3: LORA ENSEMBLE vs GEX
Thank you for pointing out this relevant work that we missed. GEX leverages geometric ensembles to collect different ensemble models. The advantages of LoRA Ensemble over GEX lie in all the three types of costs as outlined in our paper.
- Training cost: geometric ensemble needs to change the learning rate during the ensemble model collection process, which means that it can not use pre-trained checkpoints of large models.
- Serving cost: LoRA Ensemble only uses the parameters of the LoRA adapter to calculate the attribution score, while GEX still uses the full parameters.
- Space cost: All the ensemble models collected by geometric ensembles needs to be stored, which costs more space compared to LoRA Ensemble (that only saves the LoRA adapter).
Weakness 4: More experiments on downstream data-centric tasks
Thank you for suggesting this experiment—It is indeed an insightful addition to evaluate the methods on downstream data-centric tasks such as noisy label detection. As shown in the PDF, we have included new quantitative results in Appendix O, demonstrating how Dropout Ensemble improves the noisy label detection performance evaluated by AUC metric.
Weakness 5: Typos
Thank you for pointing this out. We have fixed all of them in our updated paper.
Dear Reviewer 8sXq,
Thank you for your valuable feedback to our submission! We have carefully addressed your comments in our rebuttal and we would greatly appreciate it if you could review it and consider any further discussion. We look forward to hearing any additional thoughts you may have.
Thank you for your responses, which address some of my concerns. While it is commendable that the experiments demonstrate the effectiveness of the two proposed methods, the underlying rationale remains unclear for W1, W2, and W3. Therefore, I will maintain my current score.
Thank you for the follow-up.
It’s good to see that you find the experiments demonstrate the effectiveness of the two proposed methods. Could you elaborate more on your further questions for W1, W2, and W3? We are happy to clarify them if there is still anything unclear.
-
W1: DROPOUT ENSEMBLE vs. LORA ENSEMBLE
-
W2: DROPOUT ENSEMBLE vs. TracIn
-
W3: LORA ENSEMBLE vs. GEX
In Section 4.1, you mentioned evaluation metrics for both TDA efficacy and TDA efficiency. While the advantages of the two proposed methods in TDA efficiency are straightforward to understand, the reasons behind their advantages in TDA efficacy remain unclear. Could you please explain the underlying rationale?
Additionally, there are still typos in the revised part of the manuscript. For instance, Line 215 on Page 4. Please double-check the text.
Thank you.
Thank you for the elaboration. We would like to clarify that our paper's claims are always on the efficacy-efficiency trade-off (see for example, Lines 024 - 026 in our abstract: "These strategies significantly reduce training time (up to 80%), serving time (up to 60%), and space cost (up to 80%) while maintaining similar attribution efficacy to the naive independent ensemble.").
We compare the efficiency gain when achieving the same efficacy, which can also be translated to efficacy gain with the same efficiency. While we measure both efficacy and efficiency metrics, our claims on the advantage of our method are always about the better trade-off curves as shown in our plots. We hope this clarifies the confusion.
For W1-3, we have made thorough explanations in previous replies. If you have any specific questions, we are more than happy to further address them.
Thank you for pointing out the typos; we will fix them in our final draft.
A number of training data attribution (TDA) methods have been developed recently to quantify how training data points influence particular model predictions. However, they have a clear trade-off between computational complexity and accuracy. The authors propose to improve the accuracy of “gradient-based methods” (that are computationally efficient) to improve this trade-off. In particular, they attempt to replace naive ensembling methods with LoRA and Dropout ensembling and show results.
优点
The authors have motivated their problem well and seen the need for developing better ensembling methods based on simple techniques as opposed to naive ensembling which could be retraining or fine-tuning models fully. They base their development on the TRAK approach which proposed ensembling after using random projections of gradients to counter the computational expense of determining attribution scores.
The authors also differentiate between training, serving and space cost. They also use some practical tricks such as in Iines 236-239 where they only use the dropout models to calculate Q values whereas their randomly projected gradients are determined based on the original models.
Improvements are seen in performances and good utility is demonstrated for the dropout case. I did not check the theory but it sounds reasonable at least in Case 1.
缺点
These are some comments that the authors may consider:
- While the methods do improve performance, they do not help scale TRAK to larger models since TRAK approach has a huge storage cost. This also seems to lead the authors to test with smallish models.
- It’s not clear how the LORA approach compares with the dropout approach. Is there any way to compare them with each other? Also the LORA has only been demonstrated for the Music Transformer model.
- If the authors really want to demonstrate this with large scale generative models, they should consider language transformer models (even smallish ones like 1B or so parameters). I can understand the practical infrastructure difficulties but this must be spelled out. What are the bottlenecks for a larger scale evaluation?
- The limitations of the approach must be spelled out (increase in space cost). You can consider adding this in a dedicated section on limitations of the work.
- Showing some qualitative examples that show the test data and attributions for the naive and the proposed ensemble methods will be useful.
Overall this seems like a simple extension to TRAK like methods so it scores a bit low on novelty, so this must be compensated by thoroughness in experimentation and presentation.
问题
Please see weaknesses.
We thank the reviewer for the comments and address them below.
Weakness 1: The storage cost of TRAK and test of our method on larger settings
The storage cost of TRAK comes from two sources. The first source is the checkpoints of independent retrained models. Another source comes from the caching of projected gradients.
Efficient ensemble methods could reduce the cost from the first source significantly (up to 80%). Some experiments are stated in Figure 3 and Figure 6(d).
The second source is small when the number of data points is not too large. Some tricks such as block-wise matrix multiplication and data filtering can be applied to reduce the memory cost on cached projected gradients and are orthogonal to the ensemble. While the development of these tricks is out of the scope of this paper, it is worth noting that efficient ensembles are orthogonal to these tricks, and can be straightforwardly combined with these tricks.
Additionally, we have applied Dropout Ensemble on GPT-2 (124M) + WikiText-2 dataset, which is a much larger setting. As shown in the table below, Dropout Ensemble could achieve comparable performance without independent retrained models as Naive Ensemble with multiple independent retrained models.
| D\I | 1 | 5 | 10 |
|---|---|---|---|
| 0 (naive ensemble) | 0.084 | 0.174 | 0.205 |
| 5 | 0.161 | / | / |
| 10 | 0.190 | / | / |
| (GPT-2 + WikiText-2) |
Weakness 2: Comparison between Dropout and LoRA Ensemble
A comparison between Dropout and LoRA Ensemble can be found in Figure 9 (d) (newly added to the paper in Appendix M). In terms of LDS, Dropout Ensembles yield slightly better improvement compared with LoRA Ensembles and both ensembles can significantly outperform naive ensembles for transformer-based large models by achieving a better efficacy-efficiency tradeoff.
Dropout and LoRA Ensemble have different trade-offs regarding training and serving costs. The suitability of them may vary depending on the specific applications.
LoRA is designed for large transformer-based models, so we only perform experiments on the transformer architecture. Other than MusicTransformer, we also show the results on GPT in Figure 8 of Appendix K.
Weakness 3: Challenges and experiment on large scale generative models
We have added a new experiment on GPT-2 (124M) + WikiText-2 dataset as shown in the reply to Weakness 1.
Quantitative evaluation of data attribution of large scale generative models is a challenging task since most of the metrics (e.g., LDS, brittleness, …) need a large number of model retraining to get the ground truth.
Weakness 4: Limitation of the approaches
Thank you for pointing out this important aspect. We agree that discussing the limitations of our ensemble methods is valuable, and we have included a detailed qualitative table in Appendix L, demonstrating the computation and space overhead. Additionally, we have provided further discussion in the main text to highlight these points in Section 3.3. More empirical results are demonstrated in Appendix M for computation overhead and Figure 6(d) for space overhead.
Weakness 5: qualitative examples
Thank you for suggesting this experiment—it is indeed an insightful addition to visually demonstrate the effectiveness of our ensemble method. As shown in the PDF, we have included new qualitative examples in Appendix N, demonstrating how Dropout Ensemble improves the TDA results without requiring additional training costs.
Dear Reviewer SJ8M,
Thank you for your valuable feedback to our submission! We have carefully addressed your comments in our rebuttal and we would greatly appreciate it if you could review it and consider any further discussion. We look forward to hearing any additional thoughts you may have.
Dear Reviewers,
We thank all reviewers for their detailed and thoughtful feedback. We are pleased to see that reviewers found our work
- shows good performance and solid improvement (SJ8M, 8sXq, s12W, KDDK, DqZM);
- is practically useful (SJ8M, s12W, KDDK);
- is well-presented (8sXq, KDDK, DqZM).
We have comprehensively addressed the review and improved our draft, and here are the major ones.
- Overhead on serving cost: We have added a new Table 8 in the PDF to show that training cost dominates the total cost. We also added a new Figure 9 in the PDF with performance (LDS) against the total computational cost (training+serving time) to show the improvement of the Dropout Ensemble over the naive independent ensemble.
- Comparison between LoRA and Dropout Ensemble: In the newly added Figure 9(d), both LoRA and Dropout Ensembles are significantly better than naive ones while Dropout Ensemble is slightly better than LoRA. We note that Dropout and LoRA Ensemble have different trade-offs regarding training and serving costs and the suitability of them may vary depending on the specific applications.
- Scaling to larger experiment settings: Some additional experiments in Appendix K and a newly added experiment on GPT-2 (124M) show that the efficient ensemble methods can be scaled to larger settings.
For more detailed, point-by-point responses, please refer to our individual replies to each reviewer.
I have read all the materials of this paper including the manuscript, appendix, comments, and response. Based on collected information from all reviewers and my personal judgment, I can make the recommendation on this paper, reject. No objection from reviewers who participated in the internal discussion was raised against the reject recommendation.
Research Question
This paper studies the sample influence problem based on the influence functions.
Motivation
The authors claim that recent research has shown that augmenting gradient-based methods with ensembles of multiple independently trained models can achieve significantly better attribution efficacy (I did not find solid evidence on this point); however, such approaches are impractical for very large-scale applications.
Philosophy
The authors tackle the above independent training by correlated training, i.e., some variants from a single training.
Techniques
The authors propose two strategies, DROPOUT and LORA ensemble. I would like to point out that DROPOUT is very similar to the concurrent work [1] and LORA is similar to GEX [2]. In general, the proposed techniques are straightforward.
Experiments
- Personally, LDS is not a proper metric for data attribution. The data attribution has no ground truth value. In such a case, a downstream task is usually employed for evaluation. For LDS, it employs multiple simulations (the original authors suggest 100-500 models; in this paper, the authors use 50 models) for correlation fitting. First, the result is not deterministic, which is highly sensitive to the selected subsets or runs. Second, it is not appropriate for large-scale applications, which is targeted by this paper.
- In the confidential message to AC, I agree that the authors do not need to conduct experiments on industrial level datasets and models. The datasets used in this paper are relatively small, which are the subsets of the complete and original one.
- The analysis of training time and space saving is incorrect. The authors claim that the proposed ensemble can significantly reduce the training time cost for achieving the same level of attribution efficacy. The post-evaluation analysis associated with performance does not count, since the performance cannot be obtained in advance.
Figure
The visualization of figures can be further improved.
Theoretical Analysis
I don’t believe that every paper requires a theoretical analysis, especially if the analysis is disconnected from the proposed method—it may even detract from the paper's overall value. The current theoretical analysis is not correlated with the proposed two methods.
Others
- I suggest the authors follow high-quality papers, which has interesting and practical implications. I understand the criteria of “high-quality” vary according to different research tastes.
- Since two strategies are proposed in this paper, a natural question that needs to be addressed, in what conditions one is better than another.
[1] Revisit, Extend, and Enhance Hessian-Free Influence Functions.
[2] GEX: A flexible method for approximating influence via Geometric Ensemble.
审稿人讨论附加意见
No objection from reviewers who participated in the internal discussion was raised against the reject recommendation.
Reject