REDUCR: Robust Data Downsampling Using Class Priority Reweighting
摘要
评审与讨论
This paper introduces a new method to perform online data selection. The proposed method aims to improve the worst-class performance and maintain its overall performance in the meantime. The method can precisely evaluate the performance of classes and put larger weights on the losses of poor-performance class samples. Experiments show some improvement with respect to the worst-class accuracy compared with several baselines.
优点
- This paper identifies an important and interesting problem in existing methods where the worse-class performance is overlooked.
- This paper presents a simple solution to solve the complex max-min optimization in Eq. (3).
- This paper conducts extensive experiments and ablation studies to validate the effectiveness of the proposed method.
缺点
- Regarding the class-irreducible loss, it is not well justified that the model can be a good approximation of .
- Regarding the class-irreducible loss, training a separate model for each class can be computationally prohibited on large datasets. Datasets with larger class space such as CIFAR-100 and ImageNet are missing in the experiments.
- In Eq. (6-7), the meaning of and , and their relation, need further clarification.
- Why can the proposed method prevent from selecting datapoints with noisy labels?
- What if a clean validation set is not accessible?
- Some related works are missing from discussion and comparison, such as [1,2]
[1] Heteroskedastic and imbalanced deep learning with adaptive regularization [2] Robust long-tailed learning under label noise
问题
see Weaknesses
We thank the reviewer for their feedback and comments. We’re glad that the reviewer appreciates the importance of the problem, simplicity of our solution, and the extensive experiments that we have conducted to confirm the efficiency of REDUCR.
To address the reviewer's questions:
-
We have added a more detailed explanation of the amortized class irreducible loss approximation to the manuscript in Appendix A.3. Our method performs well across a variety of text and image datasets and our ablation studies show that the approximate model is important for REDUCR’s performance.
-
Our work builds a solid foundation for robust online batch selection and classification with a large number of classes is an important future direction that we would like to pursue (We openly recognise this in the final section of the original submission). We agree with the reviewer that training these models can sometimes be expensive even though the size of the models can be small. The problem with many classes needs to be considered very carefully and is beyond the scope of this work. Regarding the datasets: We direct the reviewer to Tables 1 and 2 in the updated manuscript in which we report strong results on two NLP datasets that have uses in other applications, for example [1] uses a model trained on the MNLI dataset to quantify uncertainty in large language models. We believe that we have enough empirical evidence to substantiate our approach.
-
We have updated Section 2 of the manuscript to further clarify this. Each datapoint x has an associated label y. Whilst c is an arbitrary class, when y == c the label of a datapoint is equal to a class c.
-
In Section 4.4 we discuss how REDUCR uses the class irreducible loss model to help identify which points are learnable. If a point has a noisy label it will likely have a high loss under the class irreducible loss model thus resulting in a low selection score and thus not being selected for the small batch b_t. The reviewer can find a detailed discussion of these ideas in Section 3 and Section 4 of [2].
-
We do not specifically consider this within our setting however, in our experiments on the Clothing1M dataset only a small clean validation dataset was available. A combination of the noisy training data and small clean validation dataset is used to train the class irreducible loss model whilst the small clean validation dataset is used to calculate the class-holdout loss term. Given our strong empirical results on Clothing1M this suggests that only a small clean validation dataset is required and can be used in the class-holdout loss term, whilst the class irreducible loss model training does not have to be a clean dataset and can use unclean data. It is practical to hand label a small validation dataset such as this and as such we did not investigate this setting further in this work.
-
We thank the author for drawing these works to our attention and have added them to our introduction and related works sections.
Our updated manuscript and the accompanying responses aim to address your observations and recommendations. Given our responses to the questions and your acknowledgment of the importance of the problem, the simplicity of our method and the extensive empirical evaluations we have conducted to confirm the effectiveness of our method, we respectfully ask that the reviewer re-consider the reject recommendation. We welcome any further thoughts and observations.
[1] Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation, (https://openreview.net/pdf?id=VD-AYtP0dve)
[2] Prioritized training on points that are learnable, worth learning, and not yet learnt (https://arxiv.org/pdf/2206.07137.pdf)
Dear reviewer 3z1f,
Thank you again for your time and effort reviewing this paper. The discussion phase is ending in less than 12 hours. We would be grateful for your constructive feedback on our rebuttal. We are open to making further modifications based on your suggestions.
Best regards,
Authors
Thank you for your response. I have thoroughly reviewed the authors' reply, as well as other reviews.
The authors' feedback has addressed some of my queries, however, several significant concerns still persist:
The feasibility of the proposed method remains unverified. As highlighted in the review, the computational expense involved in training the Class-Irreducible Loss Models is significant and cannot be overlooked. Furthermore, the experiments conducted in the current paper are limited to datasets with only a handful of classes, specifically CIFAR-10 and Clothing1M.
The comparative analysis lacks the inclusion of numerous methods for learning with noisy labels (with or without the consideration of class imbalance). The results presented in the current paper do not demonstrate the empirical superiority of the proposed method.
In summary, I believe that the current paper requires considerable enhancements prior to publication.
We thank the reviewer for the comments and engagement in our work. However, we disagree with their response to our detailed rebuttal. The reviewer once again asserts that REDUCR is not practical without referencing or commenting upon any arguments we made in our initial rebuttal. The reviewer argues that the experiments lack suitable comparisons for learning with noisy labels despite the inclusion of the RHO-Loss baseline. RHO-Loss is specifically designed and tested to perform well in settings with noisy labels as detailed in sections of the paper we directed the reviewer to in our original rebuttal. The reviewer seems to dismiss the empirical superiority of the proposed method based on the apparent lack of suitable baselines for one of the five datasets we have tested our method on. In conclusion, the reviewer has simply restated their original assertions without addressing any of the arguments we made in our original rebuttal or reviewing the changes to our manuscript.
This work proposes an online batch selection algorithm called REDUCR to preserve the worst-class generalization performance. Extensive experiments on multiple datasets show the superiority of the proposed method.
优点
- Clear presentation and easy-to-follow writing.
- Extensive evaluation on multiple datasets with two tasks.
缺点
Unclear motivation
- Clothing1M is not a proper dataset to evaluate the effect of batch selection in worst-case accuracy, since it contains noisy labels as well. The performance drop of other baselines may be due to label noise other than class imbalance.
- Loss-based batch selection baselines (e.g., Loshchilov et al.(2015)) prefer to select high loss examples. Then, they will automatically select the worst-class example first as it exhibits higher loss (i.e., worse generalized).
- I think why these baselines fail on Clothing1M is due to the label noise, since noisy examples tend to exhibit higher loss so that easy to be selected [a][b].
[a] Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. NeurIPS , 2018
[b] Meta-Query-Net: Resolving Purity-Informativeness Dilemma in Open-set Active Learning. NeurIPS, 2022
Less practicality
- Although the author provides an efficiency analysis with respect to training steps (in Fig 3), I think this algorithm might be less practical since it takes time to select batch b_t from B_t by solving the minimax problem at every time step t.
- The author should provide GPU time analysis compared to random batch selection to convince the practicability of this algorithm.
问题
How to select batch b_t exactly? All the formulations for selection scores are for a single datapoint, as the authors assume the small batch to a single data point in Sec 4.2. Could you elaborate on how the “batch” selection exactly works (line 6 in Alg 1)? With batch selection, I think the selection should consider the relationship between examples to minimize Eq. (3), so the selection algorithm should be different from the single point.
We would like to thank the reviewer for the feedback. We are glad the reviewer recognises the importance of the problem, the simplicity of our solution, and the extent of our experiments to further justify the method. In addition, we would like to direct the reviewer to our additional experimental results in the updated manuscript (see Tables 1 and 2) where we have added further results on an NLP dataset. We believe that this further strengthens our results.
The reviewer raises two points in their analysis of the paper’s weaknesses. Firstly, they argue that the Clothing1M dataset is an improper dataset for evaluation as the baselines will perform worse due to the label noise. Whilst we agree with the reviewer’s analysis that (Loshchilov et al. 2015) will be affected by the label noise, these methods still outperform RHO-Loss [2] which is specifically designed and tested on the noisy label setting (See Figure 1, Figure 3 and Section 4.3 in [2]). As such RHO-Loss is a strong baseline in the noisy label setting yet surprisingly it is outperformed by (Loshchilov et al. 2015) in terms of the worst class test accuracy (see Figure 3). Because of this, we respectfully disagree with the reviewer’s analysis that our work has unclear motivation on the basis of the inclusion of one dataset that is prevalent in the most relevant literature. REDUCR’s strong performance on a dataset with noisy labels and class imbalance only further supports the practicality of our method for real world applications.
Nevertheless, it is important to consider noisy labels carefully. We thank the reviewer for the references and have cited [3] ([a]) in the related work section of our updated manuscript due to its use of multiple models to identify and subselect non-noisy data.
Secondly, we would like to correct the reviewer’s comment that REDUCR solves the min max optimization problem at each timestep, as per Algorithm 1 we interleave gradient updates with weight updates solving the min max problem over the course of model training. In regards to a comparison of GPU run times: REDUCR consistently outperforms the random selection (uniform) baseline in terms of worst class and average test accuracy, no matter the GPU run time! REDUCR matches the best mean performance across runs of the uniform baseline in half the number of training steps. None of the REDUCR experiments had double the run time of the uniform baseline with the largest disparity being observed on the MNLI dataset where the uniform baseline took 20 hrs to complete and REDUCR took 34 hrs on the same hardware, on this dataset REDUCR matches the mean performance of the uniform baseline around 100k training steps earlier in training as seen in Figure 6. For these reasons, we are confident about the practicality of REDUCR as an algorithm.
In response to the reviewer's question: the details of our batch selection are detailed in Line 6 of the algorithm (that we clarify further; please see the updated manuscript), we select the batch b_t by selecting the individual points with the top k scores in each batch B_t. This is a common approach in the literature (see [1]) with a sampling method being another potential choice (see [2]). The relationship between examples is an interesting avenue for future work and we thank the reviewer for highlighting this, however, this is beyond the scope of the work and our empirical results show that even without this consideration, our method achieves strong empirical performance.
As we have addressed the reviewer's questions (and updated the manuscript), clarified the motivation for including the Clothing1M dataset, and addressed the practical application of our algorithm we kindly ask the reviewer to re-evaluate their score.
[1] Soren Mindermann et al.; Prioritized training on points that are learnable, worth learning, and not yet learnt. ICML (https://arxiv.org/pdf/2206.07137.pdf)
[2] Ilya Loshchilov et al.; Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343
[3] Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. NeurIPS , 2018
Dear reviewer CKWX,
Thank you again for your time and effort reviewing this paper. The discussion phase is ending in less than 12 hours. We would be grateful for your constructive feedback on our rebuttal. We are open to making further modifications based on your suggestions.
Best regards,
Authors
Thanks to the author's response. I've read all the responses, other reviewer's comments, and the revised paper. However, I'm still not convinced by the motivation since the reason why the selection algorithms fail may be intertwined with noisy labels. Also, I still think the class imbalance problem can be naturally solved by the existing loss-based selection algorithms. Although the authors provide some ad hoc explanation about time complexity, it is not comprehensive. I will keep my score. Thanks.
We thank the reviewer for their comments. The reviewer’s answers directly contradict the detailed evidence we have provided. For example, the RHO-Loss baseline is designed and tested to not fail in settings with label noise and yet performs poorly in comparison with REDUCR on the Clothing1M dataset. Despite a thorough explanation of these ideas, the reviewer reiterates their original point that label noise is why the selection algorithms fail (which has nothing to do with our algorithm).
In this paper, the authors propose an approach called REDUCR for online batch selection problem. REDUCR improves existing online batch selection approach RHO-Loss by directly optimizing the worst-class generalization performance.
优点
- The paper is written well and easy to follow. All the figures and tables are of high-quality.
- A comprehensive discussion with related works has been provided.
- Empirical studies show the proposed approach can achieve superior worst-class test accuracy, though this result is not surprising since the proposed approach directly optimizes the worst-class generalization performance.
缺点
- The contribution and novelty of this paper are limited. Compared with an existing work RHO-Loss, the only difference is that the proposed approach directly optimizes the worst-class generalization performance, while RHO-Loss optimizes the average generalization performance. Other aspects (e.g. techniques for inducing selection scores and approximating class-irreducible loss model) are the same.
- It is not clear why the model induced from Eq. (8) can approximate the so-called class-irreducible loss model. They are totally different models from my perspective.
- The proposed approach improves worst-class test accuracy, but sacrifices the overall average test accuracy.
问题
- The contribution and novelty of this paper are limited. Compared with an existing work RHO-Loss, the only difference is that the proposed approach directly optimizes the worst-class generalization performance, while RHO-Loss optimizes the average generalization performance. Other aspects (e.g. techniques for inducing selection scores and approximating class-irreducible loss model) are the same.
- It is not clear why the model induced from Eq. (8) can approximate the so-called class-irreducible loss model. They are totally different models from my perspective.
- The proposed approach improves worst-class test accuracy, but sacrifices the overall average test accuracy.
We appreciate the reviewer for recognizing the strengths of our paper, particularly highlighting the empirical studies that demonstrate superior worst-class test accuracy, aligning with the primary objective of our paper.
-
However, we respectfully disagree with the reviewer's assessment of the novelty of our paper. To substantiate our stance, we emphasize the novelty of the problem under investigation; to our knowledge, robust online batch selection has not been formally defined or researched previously. Additionally, we underscore the distinctiveness of the derived acquisition score (dependency on classes and class-holdout loss whose presence is necessary for superior performance as clearly demonstrated in Figure 4), which again differs substantially from the previous RHO-Loss. The algorithm itself is novel, relying on MW updates, leading to noticeable design differences as well as empirical performance compared to prior approaches. Having said that, our paper is novel in terms of 1) the problem formulation; 2) the class-specific selection score; and 3) the proposed algorithm.
-
To clarify why this is a suitable approximation we have updated the manuscript to explain our reasoning further (please see Appendix A.3). While we tried other approximations for this term the up-weighting resulted in the most consistently stable training and we stand by our updated empirical evaluation of REDUCR in Tables 1 and 2. These highlight our strong performance on vision and text datasets. Based on our experimental results, there is no statistically significant decline, or "sacrifice," in the average performance, as indicated in Tables 1 and 2. In particular, for Clothing1M and QQP, the average test accuracy even slightly improved, potentially due to distributional shifts. Therefore, we are uncertain about the specific aspect the reviewer is alluding to. Can the reviewer please elaborate on this in case of disagreement?
-
Even if there was a drop in average performance, we note that more often than not there is a trade-off between average and worst-case performance (when it comes to robust learning/optimization algorithms; as evidenced in seminal papers, such as [1] (see Table 1)). In our case, we are pleasantly surprised by the strong average performance of our algorithm (since we do not optimize for this), considering the significant gains in worst-case performance. This gain is noteworthy, and we believe it is a strength and not a weakness of our approach.
In response to the reviewer's queries, we have addressed all questions. We firmly assert that there is no evidence to justify rejecting our paper (given the provided strengths and our updated manuscript) and kindly request the reviewer to reevaluate the acceptance decision or provide additional substantiation for the provided score.
[1] Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization (https://arxiv.org/pdf/1911.08731.pdf)
Thanks for authors’ responses. However, my concerns still exist. I reserve my opinion that the proposed approach is not novel enough to be regarded as a non-trivial extension of existing approach RHO-Loss, considering the high quality of ICLR. And it is not well justified that the model induced from Eq. (8) can approximate the so-called class-irreducible loss model, although such an implementation trick may lead to good task performance. So, I would like to keep my score.
We thank the reviewer for their comments. The initial review provided only three short bullet points as the paper’s weaknesses. The last of which directly contradicted results that are evident even in a single pass through our manuscript. The response to our detailed rebuttal is also lacking any detail, references or specific commentary as to how our proposed changes and explanations are unsatisfying. In conclusion, this review has not met our high expectations of ICLR.
Dear reviewer sdUG,
Thank you again for your time and effort reviewing this paper. The discussion phase is ending in less than 12 hours. We would be grateful for your constructive feedback on our rebuttal. We are open to making further modifications based on your suggestions.
Best regards,
Authors
We appreciate all the comments and feedback all the reviewers have on our paper. Since our submission, we have conducted additional experiments on the QQP NLP dataset and have corrected the reported results for the MNLI dataset (see Tables 1 and 2). For both datasets, REDUCR outperforms all the baselines in terms of worst-class test accuracy and matches the best-performing baseline in terms of average test accuracy.
Dear reviewers,
Thank you for your work to review this paper. I want to remind you that the author-reviewer discussion period is closing at the end of Wednesday Nov 22nd (AOE).
The authors have provided responses to your comments. Please take the time to review whether your concerns have been adequately addressed and engage with the authors on their responses.
Sincerely,
AC