PaperHub
7.0
/10
Poster3 位审稿人
最低3最高4标准差0.5
3
4
4
ICML 2025

Retraining with Predicted Hard Labels Provably Increases Model Accuracy

OpenReviewPDF
提交: 2025-01-16更新: 2025-07-24
TL;DR

We theoretically show that retraining a model with its predicted hard labels can improve its accuracy when the given labels are noisy, and empirically demonstrate that retraining significantly improves label DP training at no extra privacy cost.

摘要

The performance of a model trained with noisy labels is often improved by simply *retraining* the model with its *own predicted hard labels* (i.e., $1$/$0$ labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable binary classification setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP), which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensus-based retraining. For example, when training ResNet-18 on CIFAR-100 with $\epsilon=3$ label DP, we obtain more than $6$% improvement in accuracy with consensus-based retraining.
关键词
RetrainingPredicted LabelsHard LabelsLabel NoiseLabel DP

评审与讨论

审稿意见
3

This paper investigates the benefits of retraining a model using its own predicted hard labels in scenarios where training data contains noisy labels.

There are two strategies for retraining the model:

  • Full Retraining: The model is retrained on the entire dataset using its own predicted hard labels.
  • Consensus-Based Retraining: Only samples for which the model's predicted label matches the original noisy label are used for retraining.

The paper provides a rigorous theoretical analysis showing that full retraining with predicted hard labels can improve a model's population accuracy. In a linearly separable binary classification setting with randomly flipped labels, the authors derive error bounds and sufficient conditions when retraining is beneficial.

The authors also conduct extensive experiments on datasets such as CIFAR-10, CIFAR-100, and AG News Subset (a language dataset). The results show that both full retraining and consensus-based retraining enhance model performance, with consensus-based retraining providing the most significant improvements.

update after rebuttal

The detailed response has resolved my concerns. Thus, I raise my score after the rebuttal.

给作者的问题

  1. In scenarios where the initial model has low accuracy, how does the retraining process behave? Addressing this question could clarify the robustness of your method and whether it remains effective when the initial model is weak.
  2. Regarding the training details (Lines 1439–1440), why must the number of gradient steps and the initial learning rate be chosen based on the performance of the baseline method? Are the retraining methods particularly sensitive to these hyperparameters?
  3. What are the noise rates corresponding to different values of ϵ\epsilon?
  4. The authors explicitly state that the forward correction algorithm is applied in the initial training stage for the experiments in Table 5. However, what loss function is used in the initial training stage for the experiments in Tables 1, 2, 4, and 6? Is it the standard cross-entropy loss? Clarifying this would improve the reproducibility of the reported results.

论据与证据

The paper supports its main claims with a combination of rigorous theoretical analysis and extensive empirical validation.

However, there are some aspects where the evidence is less complete:

  • The theoretical analysis focuses on full retraining under a uniform label noise model, while the consensus-based retraining, which empirically shows superior performance, lacks a corresponding theoretical analysis.
  • The experiments are conducted on moderate-scale datasets, so the scalability and generalizability of the approach to larger or more complex settings (e.g., experiments on the ImageNet dataset) remain to be further explored.

方法与评估标准

The proposed methods, namely full retraining and consensus-based retraining, make sense to tackle the challenges of learning with noisy labels and label differential privacy. Additionally, benchmark datasets like CIFAR-10, CIFAR-100, CIFAR-100N, and AG News Subset are well widely recognized.

理论论述

Main theorems (Theorem 4.1, Theorem 4.2, Theorem 4.8, Theorem 4.9) are checked.

实验设计与分析

The benchmark datasets are widely recognized in the community, and the experimental designs are reasonable. But some experimental settings remain unclear (see questions).

补充材料

I have reviewed the problem setting part, proof part, experimental details part, and experiment on the real-world dataset (CIFAR-100N) part.

与现有文献的关系

The key contributions of the paper relate to two broader scientific literatures:

  • Learning with Noisy Labels: There is a lot of work on training models in the presence of noisy labels, which often involves robust loss functions or noise-correction techniques. The paper contributes to this literature by offering the first theoretical guarantees showing that full retraining with predicted hard labels can provably improve model accuracy under uniform label noise.
  • Label Differential Privacy (DP): In the context of privacy-preserving machine learning, label DP has emerged as an important concept. Prior works have proposed various noise-injection mechanisms (such as randomized response) to ensure privacy for sensitive label information. This paper shows that retraining methods (full retraining and consensus-based retraining) can enhance the model's performance without additional privacy costs.

遗漏的重要参考文献

All essential related works are cited or discussed in the paper.

其他优缺点

This paper investigates the benefits of retraining a model using its own predicted hard labels for label differential privacy (DP) and provides theoretical analysis. However, it exists several limitations.

First, the theoretical analysis is confined to binary classification using linear models. Consequently, the derived results and error bounds are limited in scope and may not extend to practical scenarios where many tasks involve multiclass classification and complex nonlinear models. In real-world applications, sufficiently powerful nonlinear models can potentially memorize all the noisy labels. As a result, the model's predicted hard labels would simply replicate the noisy labels, rendering full retraining ineffective. This limitation suggests that while the theoretical contributions are valuable for understanding retraining in controlled settings, their applicability to more complex, realistic models remains questionable.

Second, although consensus-based retraining shows superior performance empirically, the paper does not provide a corresponding theoretical framework to analyze its behavior or guarantees.

Third, the effectiveness of retraining is heavily dependent on the accuracy of the initial model's predictions. In scenarios where the initial model performs poorly, the retraining process might not yield significant improvements.

其他意见或建议

The meaning of ϵ\epsilon is not explained in the introduction, yet it appears in both the abstract and the conclusion of the introduction. Readers who are not familiar with Label Differential Privacy (DP) may be confused. It would be beneficial to provide an intuitive explanation of ϵ\epsilon in the introduction to enhance clarity and accessibility.

作者回复

Thanks for your review and questions! We address your concerns below.

Other Strengths And Weaknesses:

1. "First, the theoretical analysis is…remains questionable.":

  • We agree that our analysis on linear models for binary classification will not fully explain what happens in the case of non-linear models for multi-class classification, and we don’t intend to oversell the scope of our theoretical results. But we believe it is valuable as a first step; after all, ours is the first work to analyze retraining with hard labels in any setting. Moreover, we believe that some of our proof ideas could be useful even in the analysis of non-linear models. For instance, the proof technique of constructing dummy predicted labels that match the actual predicted labels with high probability (see lines 307-319 left column) should be useful in general, because the issue of dependence of each predicted label on the entire training set is universal regardless of the model type.

  • Regarding your point about complex models perfectly fitting noisy labels, we completely agree. And that is why, for such expressive models, it is important to apply (both in theory and practice) some kind of regularization when training them with noisy labels; for e.g., 2\ell_2 regularization, early stopping, etc. Applying regularization is reasonable in scenarios such as label DP, where we already know that the labels will be noisy.

2. "Second, although consensus-based retraining…its behavior or guarantees." / first bullet point under Claims And Evidence: Agreed. We have admitted this limitation in Section 6, and plan to analyze consensus-based retraining in the future. Please note that the analysis of full retraining is itself pretty non-trivial (main technical challenges have been discussed after Thm. 4.8) and interesting in our opinion.

We do acknowledge the above two weaknesses. However, it is usually very difficult to perfectly align theoretical analysis with practical settings, and it is common to analyze simplified settings. So we believe these weaknesses do not fundamentally undermine the significance of our work.

3. "Third, the effectiveness of retraining…yield significant improvements." Indeed, retraining should intuitively only be beneficial when the initial model’s predictions are more accurate than the given (noisy) labels used to train the initial model. We have discussed/demonstrated this in several parts of the paper – Fig. 1 (see its caption), Tables 3 and 7 (these are on real data), and the comment on the range of nn after Remark 4.10 (specifically, regarding the lower bound on nn). Moreover, in Appendix J & Table 10, we did an ablation study with and without a validation set. The initial model is naturally weaker w/o a validation set (due to overfitting); despite this, retraining is still beneficial but the gains are less than those with a val set. This observation is not surprising.

Second bullet point under Claims And Evidence: As mentioned in Section 6, performing larger experiments is left for future work. While we didn’t have the time to train ImageNet from scratch now, we ran experiments on DomainNet dataset (available in Tensorflow) which has 345 classes & is much larger than CIFAR. We did linear probing (due to lack of time) with features extracted from a ResNet-50 pretrained on ImageNet. The setup is similar to our full fine-tuning experiments. DomainNet Results:

ϵ\epsilonBaselineFull RTConsensus-based RT
3323.60±0.9223.60\pm0.9229.23±1.0329.23\pm1.0336.30±0.75\mathbf{36.30}\pm0.75
4448.25±0.0548.25\pm0.0552.10±0.1052.10\pm0.1057.40±0.20\mathbf{57.40}\pm0.20

So even here RT (especially, consensus RT) yields large gains.

Questions For Authors:

1. Please see the response to weakness 3 above (especially the last two sentences about the ablation).

2. They need not be chosen based on the baseline’s performance. We did this to avoid any further hyper-parameter tuning based on retraining – to demonstrate that retraining is not very sensitive to hyper-parameters. If one were to optimize the hyper-parameters based on retraining’s performance as well, the gains would only increase.

3. If randomized response (RR) is used as the baseline, then with CC classes and for ϵ\epsilon-labelDP, each sample receives its true label yy w.p. eϵeϵ+C1\frac{e^{\epsilon}}{e^{\epsilon} + C-1} and some other label yy' w.p. 1eϵ+C1\frac{1}{e^{\epsilon} + C-1} for all yyy' \neq y (this has been explained in lines 177-181 left column). If the method of Ghazi et al. (2021) is used, then their first stage is RR (so the same as before), but the noise level of subsequent stages depends on the performance of the previous stage's model.

4. Standard cross-entropy loss was used; we’ll mention this in the next version. Thanks for pointing this out!

We hope to have resolved your concerns and we're happy to discuss further. If you’re satisfied, we sincerely hope you will raise your score!

审稿人评论

Thanks for your detailed rebuttal and extra experiments. My concerns have been resolved. Then, I decide to raise my recommendation score.

作者评论

Thanks for raising your score! We’ll add the extra experiments (and important clarifications from the rebuttal) in the next version.

审稿意见
4

The authors theoretically analyze retraining in a linearly separable binary classification problem and show that it can improve the model accuracy with respect to the initial training in presence of label noise. They show that retraining is particularly helpful with high levels of label noise. Then, the paper empirically shows that the proposed consensus-based retraining works better than the normal retraining.

Update after rebuttal

After reading all the reviews carefully and considering the additional effort made by the authors, I decided to raise my score from 3 to 4. I think this is an excellent paper.

给作者的问题

  • How do you relate the retraining technique with the problem of memorization of noisy samples? That is a well known problem in the noisy labels literature and I am afraid that retraining could worsen the memorization effect. Can you provide an empirical analysis of the memorization effect when using retraining? [a,b,c]

  • What does it happen when we increase the number of gradient steps? Does the gap between the accuracy achieved with and without retraining decrease? Is there a point in which, if we train the model for X steps, the retraining lowers the accuracy? Maybe this would be an interesting ablation study.

  • You use the baseline in Ghazi et al. 2021. Which objective function do you use to train your neural networks? I assume you use the cross-entropy. However, you did not study how the performance of your algorithm would change by changing baseline or objective function. This could arise questions on the general validity of retraining. Could you study the performance applying these changes?

[a] Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., ... & Lacoste-Julien, S. (2017, July). A closer look at memorization in deep networks. In International conference on machine learning (pp. 233-242). PMLR.

[b] Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. ICLR 2017.

[c] Liu, S., Niles-Weed, J., Razavian, N., & Fernandez-Granda, C. (2020). Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems, 33, 20331-20342.

论据与证据

The claims are almost all clear and convincing.

  • The main claim for which the clarity could be improved is the specification (especially in the abstract) that they theoretically analyze a binary classification problem.
  • In line 123 you claim that your work is on the fully supervised setting. Isn't the label noise scenario considered weakly-supervised?

方法与评估标准

The proposed evaluation criteria make sense for the problem considered.

理论论述

I checked the theoretical claims superficially and they seem correct and well written.

实验设计与分析

I would have preferred to see a comparison with other algorithms that perform classification with label noise, but I only see a minor result on the combination of retraining and forward correction. I think that a wider comparison would help in understanding if the contribution of this paper is mainly theoretical or if there is also a possible advancement for state-of-the-art techniques. I don't understand why the authors did not share the code. This arises concerns on the reproducibility of their results.

补充材料

The code is not provided. The appendix is well written.

与现有文献的关系

The paper contributions are incremental, as the retraining technique is well known. However, the theoretical analysis is interesting and novel in my opinion.

遗漏的重要参考文献

I am not aware of important related work that is not cited in the paper.

其他优缺点

Strenghts

  • The paper is well written and even though it is theoretically heavy, it can be easily read by non-experts
  • The related work section is very useful
  • The experimental results enforces the theoretical claims

Weaknesses

  • No code
  • No comparison with other techniques for classification with label noise (apart from Forward correction)
  • No theoretical analysis or comments for the multi-class classification problem

其他意见或建议

  • in line 165 at the beginning of pp. 4 I would prefer the authors to use \cdot instead of ..
作者回复

Thanks for your review and great questions! We address your questions/concerns below.

Claims And Evidence:

  1. We will clarify "binary" in the abstract.
  2. Here we simply meant a setting where we have labels for all samples - to distinguish it from the setting of self-training where we are not given labels for all the samples. We’ll clarify this.

Experimental Designs Or Analyses / first two weaknesses:

  • Regarding comparisons with other noise-robust methods, please note that we are not claiming retraining is a SOTA general-purpose noise-robust method (see lines 100-103 left column). We are just advocating it as straightforward post-processing step that can be applied on top of vanilla training or a noise-robust training method. In case it wasn’t clear, Table 5 shows results wherein initial training (baseline) was done with forward correction applied to the method of Ghazi et al. 2021, and retraining was done on top of this. Please also see our response to your third question (under Questions For Authors) below, where we show that retraining is beneficial as a post-processing step even when using a noise-robust loss function instead of the usual cross-entropy loss. Moreover, it’s not straightforward to apply many existing noise-robust methods to sophisticated label DP mechanisms (such as Ghazi et al.); retraining is very easy to apply in contrast.

  • We didn’t release the code because at the time of submission, we didn’t obtain our organization's approval to release it. We weren't sure if code can be shared in the rebuttal because the email on rebuttal instructions didn't mention anything about code. We will release the code upon paper acceptance.

No theoretical analysis or comments for the multi-class case (Weakness 3): Extending our analysis to the multi-class case is left for future work. Here is a starting point: in the case of CC classes, the labels yiy_i’s will be CC-dimensional one-hot vectors, the ground truth Θ\Theta^{\ast} will be a C×dC \times d matrix (features xix_i’s are still dd-dimensional vectors, but yiy_i’s need to be defined appropriately in terms of Θ\Theta^{\ast} and the xix_i’s) and our predictor Θ^=1niyixiT\hat{\Theta} = \frac{1}{n} \sum_i y_i x_i^T will also be in RC×d\mathbb{R}^{C \times d}.

Questions For Authors:

1. Yes, memorization of noisy labels with powerful models is an issue. And if initial training is done naively, retraining may exacerbate this issue. That is why in almost all our experiments (except in Appendix J), we assume access to a clean validation set; please also see footnote 5 for the practical version of this assumption. This prevents the model from heavily memorizing. Moreover, as we show in Tables 3 & 7, the accuracy of the predicted (= given) labels on the consensus set is much more than the accuracy of both the predicted and given labels on the full set. This shows that regulated initial training is effective at avoiding memorization. Further, as shown in Appendix J, even in the absence of a validation set, retraining is still beneficial but the gains are less – this is expected because the initial model’s performance is degraded due to more memorization/overfitting here.

2. Indeed, the benefit of retraining decreases when initial training is done for a larger number of steps. We studied this in Appendix J – here we don’t have a validation set and trained blindly for 100 epochs. Due to more overfitting here, the gains of retraining are lower than the corresponding gains with a validation set where we stopped at 40 epochs. If we train for even longer, the initial model will heavily memorize the noisy labels and this will probably render retraining ineffective.

3. Yes, we used the cross-entropy (CE) loss; we’ll state this in the next version. Our baseline for AG News is actually randomized response (see lines 425-426 left column) to demonstrate the generality of retraining w.r.t. the baseline. Further, in Table 5, our baseline is forward correction applied to the method of Ghazi et al. 2021. So we do have results with other baselines. And based on your suggestion, we performed experiments with the noise-robust symmetric CE loss function proposed in [1] (1k+ citations) instead of the vanilla CE loss. In their loss (eq. 7 of [1]), we set α=0.8\alpha=0.8 and β=0.2\beta=0.2. Here are the results for CIFAR-100 w/ ResNet-34.

ϵ\epsilonBaselineFull RTConsensus-based RT
4437.07±2.0337.07\pm2.0338.17±2.0338.17\pm2.0343.20±1.77\mathbf{43.20}\pm1.77
5553.10±0.5453.10\pm0.5453.40±0.3353.40\pm0.3356.13±0.25\mathbf{56.13}\pm0.25

Thus, consensus RT yields meaningful gains even with the loss function of [1].

We hope to have resolved your concerns and we are happy to discuss further. If you’re satisfied with our answers, we sincerely hope you will raise your score!


[1]: Wang, Yisen, et al. "Symmetric cross entropy for robust learning with noisy labels." ICCV 2019.

审稿人评论

Thank you for the answers. I will keep my score as it is.

作者评论

Thanks for your reply.

We are adding some new results on a bigger dataset to show that retraining is effective when applied on top of label noise correcting methods. Specifically, we show results when the baseline is forward correction and backward correction (from Patrini et al. 2017 cited in the paper) applied to the first stage of Ghazi et al. 2021 (similar to Table 5 in the paper); these results are in (A) and (B) below, respectively. For comparison in (C) below, we also show results when the baseline is just Ghazi et al. 2021 (i.e., no correction is applied). These results are on the DomainNet dataset (available on Tensorflow) which has 345 classes and is much larger than CIFAR. We did linear probing (using cross-entropy loss) with features extracted from a ResNet-50 pretrained on ImageNet. The setup is similar to our full fine-tuning experiments.

(A) Baseline = Forward Correction (Patrini et al. 2017) + Ghazi et al. 2021:

ϵ\epsilonBaselineFull RTConsensus-based RT
3331.23±0.5631.23\pm0.5633.30±0.6533.30\pm0.6536.07±0.78\mathbf{36.07}\pm0.78
4458.50±0.0858.50\pm0.0858.63±0.1258.63\pm0.1261.80±0.08\mathbf{61.80}\pm0.08

(B) Baseline = Backward Correction (Patrini et al. 2017) + Ghazi et al. 2021:

ϵ\epsilonBaselineFull RTConsensus-based RT
3330.17±0.6130.17\pm0.6131.47±0.7431.47\pm0.7435.03±0.78\mathbf{35.03}\pm0.78
4456.63±0.3756.63\pm0.3756.80±0.3756.80\pm0.3760.47±0.46\mathbf{60.47}\pm0.46

(C) Baseline = Ghazi et al. 2021 (no correction):

ϵ\epsilonBaselineFull RTConsensus-based RT
3323.60±0.9223.60\pm0.9229.23±1.0329.23\pm1.0336.30±0.75\mathbf{36.30}\pm0.75
4448.25±0.0548.25\pm0.0552.10±0.1052.10\pm0.1057.40±0.20\mathbf{57.40}\pm0.20

As expected, forward and backward correction lead to better initial model performance (compared to no correction). The main thing to note however is that consensus-based RT yields significant gains even with forward and backward correction, consistent with our earlier results. Thus, consensus-based RT is a very effective post-processing step for improving learning with noisy labels. (It is worth noting that for ϵ=3\epsilon=3, consensus-based RT leads to similar accuracy with and without noise correction.)

We hope you will take these extra results into consideration.

审稿意见
4

The paper gives a theoretical treatment on when learning with predicted hard label is beneficial than learning with original noisy label.

给作者的问题

  1. line 072, Figure 1: what's the classifier used to get the result? (linear, MLP, or the one in Eqn. 4.3/4.8?)

  2. line 217: why need "u"? It seems a bit redundant and does not seem to play a key role.

  3. Eqn. 4.3 & 4.8: my biggest concern, why the classifier takes the form of θ^=1ni=1nyixi,\hat{\theta} = \frac{1}{n} \sum_{i=1}^n y_i x_i, it does not correspond to any standard classifier, and what's the multi-class version of it?

  4. Theorem 4.6 (minimax lower bound): I am aware of three (non-parametric) lower bounds in [1-3], so I would like to know the position of this lower bound in the literature.

  5. Theorem 4.6 also applies to the "retraining classifier" θ1\theta_1 in Eqn. 4.8, therefore, predicted hard label do not provide a gain in terms of rate/sample complexity. Then the benefit of using the hard label is only in terms of the constants?

[1] T Tony Cai and Hongji Wei. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 49(1):100–128, 2021.

[2] Hyungki Im and Paul Grigas. Binary classification with instance and label dependent label noise. arXiv preprint arXiv:2306.03402, 2023.

[3] Yilun Zhu, Jianxin Zhang, Aditya Gangrade, and Clayton Scott. Label Noise: Ignorance Is Bliss. In Advances in Neural Information Processing Systems, 2024

论据与证据

Yes, the claims were proved.

方法与评估标准

Overall makes sense to me. Though not quite sure why consider the "label DP" setup, seems to me a standard label noise setup suffices.

理论论述

I've skimmed through the proofs, but have not checked the details.

实验设计与分析

Yes.

补充材料

I've skimmed through the proofs and read the experimental setups.

与现有文献的关系

The benefit of using predicted hard label is has been studied empirically, the theoretical treatment is new.

遗漏的重要参考文献

Essential references are included.

(Optional) there can be some supplementary references that are related, see below.

其他优缺点

Overall I enjoyed reading the paper. The theoretical treatment of hard labels is new to me, and I think it is a good contribution to the literature. Have some concerns at this point, I will be happy to read the authors' comments and re-assess my review.

My biggest theoretical concerns are:

  1. The form of the classifier considered: θ^=1ni=1nyixi,\hat{\theta} = \frac{1}{n} \sum_{i=1}^n y_i x_i, it does not correspond to any standard classifier. (At first glance, I would expect ERM or logistic regression type.)

I look forward to see experimental setup that is more aligned with the theoretical setting:

  1. a data simulation that corresponds exactly to the theoretical setting, e.g., a 2-dimensional mixture of two Gaussian (and use the exact form of the classifier in theory).
  2. I think it is also possible to align CIFAR experiments with the theoretical setting, e.g., use a pretrained NN to extract the feature, then apply linear classifier on top of it (aka, "linear probing" in self-supervised learning).

These should provide stronger evidence to the theory.

其他意见或建议

  1. Notation: in label noise literature, people usually use y~\tilde{y} to denote the noisy label, rather than y^\hat{y}. y^\hat{y} is usually used to denote the predicted label by the classifier. This is a bit confusing.
  2. line 197: "perfect separable setting", I don't think it's separable, because Gaussian has infinite support, therefore positive and negative classes are overlapping. Do you mean "Bayes decision boundary is linear"?
  3. Eqn. 4.4: the notion P(sign(<x,θ>)y)P(sign(<x, \theta>) \neq y ) is overloaded, here it integrates xx, while in Theorem 4.1, it is conditioned on xx.
作者回复

Thanks for the review and great questions!

(A) Label DP setting. We focused on this because it’s not clear how to apply existing noise-robust techniques on top of existing label DP mechanisms, while retraining is a simple post-processing step. For e.g., as mentioned in lines 365-366 right column, it’s not obvious how to apply forward correction to the second stage of Ghazi et al 2021.

(B) Form of classifier θ^=iyixi\hat{\theta} = \sum_i y_i x_i. This is a simplification to the least squares’ solution (LSS) obtained by removing the empirical covariance matrix’s inverse. The way to analyze the LSS would be to bound the deviation of the empirical covariance matrix from the population covariance matrix (which shrinks as nn \to \infty), then analyze with the features pre-multiplied by covariance matrix’s inverse. This would just make the math more tedious w/o adding any meaningful insights. Also, as we wrote around eq. 4.3, our classifier is similar to kernel methods with the inner product kernel & it has been used in the highly-cited work of Carmon et al 2019.

(C) Experimental setup more aligned with theory setting

  1. Setting of Fig. 1 corresponds to our theory setting; see Appendix A.

  2. We did linear probing (LP) for CIFAR-100 with features extracted from a ResNet-50 pretrained on ImageNet. The setup is similar to our full fine-tuning (FT) experiments; we omit details here due to lack of space. Results:

ϵ\epsilonBaselineFull RTConsensus-based RT
3355.26±0.1955.26\pm0.1960.97±0.2160.97\pm0.2163.37±0.26\mathbf{63.37}\pm0.26
4464.83±0.3964.83\pm0.3966.67±0.3366.67\pm0.3367.83±0.37\mathbf{67.83}\pm0.37

So even here RT (especially, consensus RT) yields good gains. Note that LP performs much better than full FT; this is often the case when training with noise due to less overfitting with LP.

(D) Line 197: perfect separable setting. Our modified GMM setting (eq 4.1) is separable. As discussed in lines 172-176 right column, θ=μ\theta^{*} = \mu separates the data perfectly.

We’ll fix/clarify notational ambiguities.

(E) Questions for Authors

1. The ones in eqs. 4.3 & 4.8. Also see Appendix A.

2. We introduced uu so that the margin of a data point xx along μ\mu is not the same. As explained in lines 172-175 right column, <x,μ>=(1+u)μ2μ2|<x, \mu>| = (1+u)||\mu||^2 \geq ||\mu||^2. If there is no uu, all the points would have the same margin. We agree that from the analysis perspective, uu is not very important.

3. For the binary case, see (B) above. Even in the multi-class case, something like this has been studied in reference [A] (see Section 2.2). Specifically, for CC classes, the labels yiy_i’s will be CC-dimensional one-hot vectors, the ground truth Θ\Theta^{\ast} will be a C×dC \times d matrix (features xix_i’s are still dd-dimensional, but yiy_i’s need to be defined appropriately in terms of Θ\Theta^{\ast} & xix_i’s) and our predictor Θ^=1niyixiT\hat{\Theta} = \frac{1}{n} \sum_i y_i x_i^T will also be in RC×d\mathbb{R}^{C \times d}.

4. These lower bounds are in much more general settings than ours and so they are weaker than ours. In [1], our setting corresponds to np=0,nq=nn_p = 0, n_q = n. Per Definition 2 of [1], β1\beta \leq 1 and as per the paragraph after Remark 3, αβd\alpha \beta \leq d. Now as per Thm 3.2, the lower bound on the error is effectively nO(1+αd)n^{-O(\frac{1+\alpha}{d})}. So when αd\alpha \ll d, this lower bound yields a much worse sample complexity than our result in Thm 4.6. In [2], the lower bound on the error (Thm 2) doesn’t reduce with nn, so even if there are infinite samples, we can’t get 0 error in the worst case. As for [3], the lower bound on the error (Thm 1) also has a non-diminishing term depending on ϵ\epsilon. In the special case of ϵ=0\epsilon = 0 (or ϵ\epsilon being small), there is a n1/2n^{-1/2} dependence but no dependence on the dimension (or a related quantity). But their upper bound in Thm 2 does have a dependence on a VC dimension-like quantity as expected, so their lower bound is probably loose w.r.t. dimension.

5. Yes, Thm. 4.6 also applies to the retraining (RT) classifier. If you see Remark 4.10, the min. # of samples nn needed for RT to be better is more than d/(12p)2d/(1-2p)^2, i.e., the lower bound of Thm. 4.6. And as discussed after Remark 4.10, this requirement on nn is probably tight (modulo log factors) because we can only hope the RT classifier to be better if the accuracy of the labels with which it is trained – namely, the initial model’s predicted labels – is more than (1p)(1-p) (= accuracy of the noisy labels with which the initial model is trained); this requires at least d/(12p)2d/(1-2p)^2 samples (per Thm. 4.6). So yes, RT can’t improve the sample complexity beyond d/(12p)2d/(1-2p)^2.

We hope to have resolved your concerns and are happy to discuss further. If you’re satisfied, we hope you will raise your score!


[A]: "Theoretical Insights Into Multiclass Classification: A High-dimensional Asymptotic View", Thrampoulidis et al., 2020

审稿人评论

Thank you for the detailed response.

It would be nice to see a more comprehensive discussion of the lower bound in the next version of the paper (can be a formalized version of the response). Given the limited exploration of lower bounds in the existing label noise literature, I think it is a good addition.

Regarding the debate on "linear separability," my understanding is that distributions are separable if and only if their supports do not overlap. However, this difference in perspective is minor, and I am comfortable moving forward despite differing views.

Overall, the authors have adequately addressed my concerns, and I anticipate the next version of the manuscript will provide further clarity. I recommend acceptance (and have raised the score from 3 to 4).

作者评论

Thank you for raising your score! Yes, we will add a discussion on the lower bounds in the next version and we agree, it’ll be a good addition. Thanks for pointing out these papers! We’ll also clarify what we mean by separability and add the extra experiments.

最终决定

This is a case where author's response has resulted in two reviewers increasing their original scores so that overall paper score now is 4,4,3. The responses have addressed the concerns and the reviewers have increased the scores with the condition that the authors will include the additional results (e.g. discussions and formalizing the lower bounds and additional experimental results) in the camera ready. I recommend acceptance.