Learning from Noisy Labels via Conditional Distributionally Robust Optimization

审稿意见

评分: 6置信度: 32024-07-12

This paper studies the problem of learning from noisy labels by using conditional distributionally robust optimization (CDRO) to estimate true label posterior. The authors formulate the problem as minimizing the worst-case risk within a distance-based ambiguity set centered around a reference distribution and derive upper bounds for the worst-case risk.

优点

Learning from noisy labels is an important and practical topic.
This paper provides rigorous theoretical analyses of the generalization bounds for the worst-case risk.
The authors offer a guideline for balancing robustness and model fitting by deriving the optimal value for the Lagrangian multiplier.

缺点

In section 3.1, the theoretical analyses are only conducted in the case of binary classification while multi-class classification is more common in practice.
In lines 192-193 and 198, the authors make assumptions about the function $\mathcal{T}$ , however, the convexity or the concavity of $\mathcal{T}$ may be too strict.
According to Table 2, the performance improvement of the proposed method is not significant across four real datasets.

问题

Can the theoretical analyses be extended to the multi-classification case?
In Algorithm 1 line 1, does the procedure of "Warm up classifiers $\psi^{(1)}$ and $\psi^{(2)}$ " affect the generation of pseudo-empirical distribution? What if the warm-up is not good enough for pseudo-empirical distribution?
The intuition and rationale of using the CDRO framework to improve the performance of learning from noisy labels are not clear, please explain them.

局限性

No, the authors do not mention the limitations of their work.

作者回复

2024-08-07

Thank you for reviewing our manuscript. We appreciate your thoughtful comments and suggestions. We will carefully incorporate the necessary revisions to address your feedback in the new version of the manuscript. Below, we highlight our responses to each of your comments.

Notably, building on our initial development on binary classes, we have now extended our theoretical results to the multi-class scenario. This extension provides a comprehensive understanding of our approach's applicability across different classification settings.
- We extend the optimal action result in Theorem 3.1 to the multi-class scenario by identifying the extreme points of the corresponding linear programming problem.
- Theorem 3.2 from the initial submission extends naturally to the multi-class scenario using analogous proof techniques as those employed for the binary case.
- For brevity, we have omitted the detailed multi-class results but are prepared to include the extended proofs in the revised manuscript. The specific results for Theorems 3.1 and 3.2 can be found in the Official Comment.
Our theoretical results are based on the assumption that the function $\mathcal{T}$ is convex or concave with respect to its arguments, rather than the network parameters or the input data. In the context of our experiments, we utilize the cross-entropy loss function $\ell$ . This results in $\ell$ being the logarithmic function, which is known to be concave in its argument. Therefore, our theoretical framework is aligned with the properties of the functions used in our experiments.
The discrepancy may arise because real datasets often have more complex noise generation processes compared to the simplified models used for noise estimation. To address this, we have employed advanced methods for estimating noise transition matrices to better reflect the intricate noise patterns observed in real data. The results using these improved transition matrix estimation techniques are detailed in Table 4 of the attached PDF.
Answers to the questions
- (i) Yes, our theoretical analyses can indeed be extended to multi-class classification scenarios, as previously discussed. In response to your feedback, we will refine the presentation in our revision to explicitly address and integrate these multi-class scenarios, which will ensure a clearer and more comprehensive treatment of the subject.
- (ii) About the warm up stage
  - Thank you for this thoughtful question. In Algorithm 1, the procedure of ``warming up classifiers $\psi^{(1)}$ and $\psi^{(2)}$ " is intended to stabilize the classifiers before they generate pseudo-empirical distributions. In our experiments, we follow established practices [E] and use 30 warm-up epochs, selecting the model with the highest validation accuracy during this phase.
  - Our results, shown in Figure 1 of the attached PDF, indicate that the model tends to overfit, especially under higher noise rates. This observation suggests that the results presented in our paper are not derived under the most optimal warm-up conditions, which may affect the quality of the pseudo-empirical distributions.
  - To rigorously assess the impact of the warm-up stage, we have now conducted experiments with varying numbers of warm-up epochs (10, 20, 30, 40) on both our method and baseline approaches that also rely on warm-up. These results are presented in Figure 2 of the attached PDF, demonstrating the effects of different warm-up durations on performance.
  - After the warm-up phase, classifiers $\psi^{(1)}$ and $\psi^{(2)}$ continue to be updated using the proposed method (as outlined in Line 8 of Algorithm 1), and the pseudo-empirical distribution is constructed using these updated classifiers. This approach ensures that the classifiers are continuously refined, which helps in mitigating any initial limitations from the warm-up stage and improves the overall reliability of the pseudo-empirical distributions.\
- (iii) About the intuition and rationale of using the CDRO framework
  - The CDRO framework addresses the challenge of noisy labels by focusing on model robustness in the presence of potential misspecifications. Many existing methods estimate the true label posterior $P^*_{y|x,\tilde{y}}$ using a model denoted $P_{y|x,\tilde{y}}$ . However, this estimated posterior can deviate from the true distribution, known as model misspecification [H].
  - By Bayes’s theorem, the true label posterior is proportional to the noise transition probability $P^*_{\tilde{y}|x, y}$ . Existing work [F-G] shows that with large sample sizes, the estimated noise transition matrix converges to the true $P^*_{\tilde{y}|x, y}$ asymptotically, and the Wasserstein distance between them converges to zero. Therefore, we use a Wasserstein ball $\Gamma_\epsilon(P_{\tilde{y}|x, y})$ centered around the estimated posterior (referred to as the ``reference probability distribution") to measure and mitigate misspecification, as defined in Equation (2) in our paper.
  - In this context, the Wasserstein distance is employed to measure the misspecification of estimated true label posterior. When the sample size $n$ is sufficiently large, the true distribution $P^*_{\tilde{y}|x, y}$ lies within $\Gamma_\epsilon(P_{y|x,\tilde{y}})$ . Thus, by minimizing the worst-case risk over $\Gamma_\epsilon(P_{y|x,\tilde{y}})$ , we can minimize an upper bound of the risk based on the underlying true label posterior.
  - Our approach does not aim to precisely estimate the noise transition matrix or the true posterior but instead focuses on minimizing the worst-case risk over this Wasserstein ball. This method robustly trains the classifier despite potential misspecifications, which serves as a valuable complementary approach to traditional methods that directly apply the estimated posterior.

2024-08-12

Thanks for your responses. I am still wondering if the warm-up is not good enough for generating pseudo-empirical distribution, will the proposed method work robustly? Or will the warm-up guarantee a good pseudo-empirical distribution?

评论- Regarding the warm-up phase in generating a pseudo-empirical distribution

2024-08-13

Thank you for highlighting the concerns regarding the warm-up phase in generating a pseudo-empirical distribution. We appreciate your continued feedback.

The warm-up alone cannot guarantee a good pseudo-empirical distribution. However, our algorithm is inherently designed to accommodate inaccurate pseudo-empirical distribution in the warm-up phase. Therefore, even if the warm-up is not good enough, our method can still work robustly, as shown in Figure 2 of the attached PDF.

Below, we provide a detailed explanation of why our approach remains effective, even when the warm-up models are not good enough.

The classifiers $\psi^{(1)}$ and $\psi^{(2)}$ are continuously updated as the algorithm proceeds, and the pseudo-empirical distribution is then constructed based on these updated classifiers. Therefore, even if the warm-up is not good enough stage, as our algorithm keeps refining the classifier, it will help mitigate any initial bias from the warm-up stage and enhancing the overall reliability of the pseudo-empirical distributions.
In addition, the way we construct the pseudo-empirical distribution also enables a more robutst estimation. More speciically, we leverage the approximated true label posterior, $P_{y|x,\tilde{y}}$ , to assign robust pseudo labels and subsequently construct the pseudo-empirical distribution. Instead of simply assigning the label corresponding to the highest probability as the pseudo label [a][b], our approach takes into account both the highest and second-highest predicted probabilities. We assign a pseudo label only if the ratio of these probabilities exceeds a specified threshold. This strategy ensures that pseudo labels are given only to instances with high confidence, effectively filtering out uncertain data when the warm-up models may not be sufficiently accurate.

Here we also provide additional empirical results to demonstrate the robustness of our approach to the imperfect warm-up models. As shown in Plots (a), (c), and (e) of Figure 1 in the attached PDF, the model is underfitted with a 10-epoch warm-up phase and overfitted with a 40-epoch warm-up phase, especially at higher noise rates, suggesting that the warm-up models are not good enough. To assess the quality of the generated pseudo-empirical distributions in these scenarios, we present the average accuracies of the robust pseudo-labels selected using the proposed method during training in Tables 1-3. Specifically, as shown in Table 1, when the noise ratio is high and the model is warmed up for only 10 epochs, the initial average accuracy of the robust pseudo-labels is $89.41_{\pm 3.66}$ , reflecting a less reliable warm-up model. However, accuracy increases to $97.72_{\pm 2.57}$ by the final epoch, demonstrating the robustness of the proposed method in constructing the pseudo-empirical distribution.

Thank you once again for your response. In preparing a revised manuscript, we plan to include detailed comments that address the issue of the warm-up phase to provide greater clarity. We hope our replies have sufficiently addressed all the concerns raised. Should you require any additional details, we would be happy to provide them. We are open to further discussions and ready to clarify any remaining questions or concerns you may have.

References

[a] Tanaka, Daiki, et al. "Joint optimization framework for learning with noisy labels." CVPR (2018).

[b] Han, Jiangfan, Ping Luo, and Xiaogang Wang. "Deep self-learning from noisy labels." ICCV (2019).

评论- Tables 1-3

2024-08-13

Table 1: Average accuracy of the robust pseudo-labels on CIFAR-10 (IDN-HIGH) with varying number of warm-up epochs in the training process.

	epoch 10	epoch 20	epoch 30	epoch 40	epoch 50	epoch 60	epoch 70	epoch 80	epoch 90	epoch 100	epoch 110	epoch 120
warm up 10 epochs	$89.41_{\pm 3.66}$	$93.05_{\pm 3.82}$	$93.08_{\pm 4.11}$	$93.95_{\pm 3.44}$	$93.65_{\pm 2.60}$	$95.16_{\pm 1.52}$	$95.83_{\pm 1.65}$	$96.32_{\pm 2.72}$	$96.18_{\pm 2.60}$	$96.54_{\pm 1.89}$	$96.96_{\pm 2.63}$	$97.72_{\pm 2.57}$
warm up 20 epochs	$95.08_{\pm 1.65}$	$95.85_{\pm 2.05}$	$97.32_{\pm 1.40}$	$97.32_{\pm 1.20}$	$96.54_{\pm 1.19}$	$97.58_{\pm 1.07}$	$97.40_{\pm 0.99}$	$97.66_{\pm 0.96}$	$97.93_{\pm 0.99}$	$97.97_{\pm 1.55}$	$97.92_{\pm 1.12}$	$98.15_{\pm 3.50}$
warm up 30 epochs	$96.23_{\pm 1.35}$	$95.61_{\pm 1.41}$	$96.61_{\pm 1.12}$	$96.67_{\pm 1.59}$	$97.38_{\pm 1.10}$	$96.62_{\pm 1.25}$	$97.40_{\pm 1.15}$	$97.56_{\pm 1.04}$	$97.64_{\pm 1.49}$	$97.63_{\pm 1.35}$	$97.73_{\pm 1.66}$	$97.67_{\pm 2.48}$
warm up 40 epochs	$95.81_{\pm 1.79}$	$96.10_{\pm 1.96}$	$96.80_{\pm 1.73}$	$96.62_{\pm 1.25}$	$96.87_{\pm 1.69}$	$97.01_{\pm 1.30}$	$96.96_{\pm 0.89}$	$97.57_{\pm 1.32}$	$97.60_{\pm 1.53}$	$97.51_{\pm 1.28}$	$97.72_{\pm 1.29}$	$98.00_{\pm 1.82}$

Table 2: Average accuracy of the robust pseudo-labels on CIFAR-10 (IDN-MID) with varying number of warm-up epochs in the training process.

	epoch 10	epoch 20	epoch 30	epoch 40	epoch 50	epoch 60	epoch 70	epoch 80	epoch 90	epoch 100	epoch 110	epoch 120
warm up 10 epochs	$94.18_{\pm 2.22}$	$94.45_{\pm 3.49}$	$95.21_{\pm 3.46}$	$94.99_{\pm 4.12}$	$95.53_{\pm 3.12}$	$96.36_{\pm 2.11}$	$96.26_{\pm 2.67}$	$96.83_{\pm 2.29}$	$97.28_{\pm 2.86}$	$97.49_{\pm 2.20}$	$97.47_{\pm 2.88}$	$98.18_{\pm 2.83}$
warm up 20 epochs	$96.81_{\pm 1.04}$	$97.33_{\pm 1.45}$	$97.12_{\pm 0.96}$	$97.73_{\pm 1.06}$	$98.18_{\pm 1.03}$	$97.99_{\pm 0.96}$	$98.24_{\pm 0.93}$	$98.38_{\pm 1.00}$	$98.32_{\pm 1.28}$	$98.43_{\pm 1.18}$	$98.63_{\pm 1.21}$	$98.55_{\pm 0.44}$
warm up 30 epochs	$97.01_{\pm 1.56}$	$96.91_{\pm 1.21}$	$97.35_{\pm 1.12}$	$97.49_{\pm 1.36}$	$97.63_{\pm 1.43}$	$97.55_{\pm 0.97}$	$97.65_{\pm 1.35}$	$97.77_{\pm 1.25}$	$98.00_{\pm 1.16}$	$97.84_{\pm 1.44}$	$98.29_{\pm 1.32}$	$98.57_{\pm 1.40}$
warm up 40 epochs	$96.42_{\pm 1.22}$	$97.18_{\pm 1.48}$	$97.23_{\pm 0.89}$	$97.43_{\pm 1.09}$	$97.22_{\pm 1.22}$	$97.61_{\pm 1.03}$	$97.87_{\pm 1.16}$	$97.74_{\pm 1.51}$	$97.84_{\pm 0.99}$	$98.34_{\pm 1.09}$	$97.89_{\pm 1.12}$	$98.51_{\pm 2.54}$

Table 3: Average accuracy of the robust pseudo-labels on CIFAR-10 (IDN-LOW) with varying number of warm-up epochs in the training process.

	epoch 10	epoch 20	epoch 30	epoch 40	epoch 50	epoch 60	epoch 70	epoch 80	epoch 90	epoch 100	epoch 110	epoch 120
warm up 10 epochs	$96.73_{\pm 1.29}$	$97.15_{\pm 1.34}$	$96.83_{\pm 1.43}$	$97.78_{\pm 1.24}$	$98.21_{\pm 0.98}$	$98.80_{\pm 0.67}$	$98.65_{\pm 0.63}$	$98.87_{\pm 0.55}$	$99.29_{\pm 0.83}$	$99.22_{\pm 0.83}$	$99.37_{\pm 1.12}$	$99.59_{\pm 0.09}$
warm up 20 epochs	$98.29_{\pm 0.72}$	$98.74_{\pm 0.71}$	$99.08_{\pm 0.79}$	$99.05_{\pm 0.58}$	$99.10_{\pm 0.43}$	$99.32_{\pm 0.90}$	$99.25_{\pm 0.62}$	$99.33_{\pm 0.48}$	$99.41_{\pm 0.75}$	$99.62_{\pm 0.92}$	$99.44_{\pm 1.00}$	$99.69_{\pm 0.46}$
warm up 30 epochs	$97.79_{\pm 1.31}$	$98.82_{\pm 0.86}$	$99.16_{\pm 0.56}$	$99.02_{\pm 0.79}$	$99.16_{\pm 0.80}$	$99.21_{\pm 0.62}$	$99.21_{\pm 0.73}$	$99.23_{\pm 0.88}$	$99.22_{\pm 1.00}$	$99.44_{\pm 0.80}$	$99.43_{\pm 0.83}$	$99.48_{\pm 0.31}$
warm up 40 epochs	$97.84_{\pm 1.28}$	$98.47_{\pm 1.16}$	$98.72_{\pm 0.70}$	$98.90_{\pm 0.67}$	$98.73_{\pm 0.85}$	$98.96_{\pm 0.76}$	$99.01_{\pm 0.68}$	$99.23_{\pm 0.93}$	$99.17_{\pm 1.13}$	$99.32_{\pm 0.93}$	$99.30_{\pm 1.11}$	$99.33_{\pm 0.34}$

2024-08-13

Thanks for your detailed explanations and the experimental results, which addressed my concerns about generating the pseudo-empirical distribution. I have thoroughly reviewed the comments from other reviewers and the corresponding responses. I find that most of my concerns have been addressed by the authors. I appreciate the efforts you have made in the rebuttal, and I decide to increase my score to 6.

2024-08-14

Thank you for thoroughly reviewing our explanations and the experimental results. We are glad that our responses have addressed your concerns regarding the generation of the pseudo-empirical distribution. We also appreciate your careful attention to the feedback from other reviewers. Your decision to raise the score to 6 is greatly valued, and we’re grateful for your support and constructive feedback throughout this process.

审稿意见

评分: 7置信度: 42024-07-13

This paper studied the issue of potential misspecification of estimated true label posterior in learning from noisy labels. To alleviate the impact of this issue, it formulated learning from crowds as a conditional distributionally robust optimization problem, where a robust pseudo-empirical distribution is used as a reference probability distribution. Experiments on multiple crowdsourcing datasets verified the effectiveness of the proposed method.

优点

The proposed methodology has a solid theoretical foundation.
The writing is very clear and easy to understand.
The performance of the proposed method is very promising in the experiments.

缺点

The focused problem, i.e., the potential misspecification of estimated true label posterior, has not been defined and measured. And it is not very clear to what extent the proposed method has solved the impact of this problem.
The contribution of this work is about "learning from noisy labels", while the context of this work seems limited to "learning from crowds". They are not consistent, since "learning from noisy labels" also includes the case with one annotator. However, this work didn't discuss and test on the case with one annotator.
Although the way to construct a robust pseudo-empirical distribution has a theoretical motivation, it seems similar to the pseudo-labeling methods [1,2] in learning with noisy labels. What are the differences and advantages of the proposed way and the pseudo-labeling methods?
As we know, in learning with noisy labels or learning from crowds, the estimation of noise transition probabilities is very important, while the way to approximate noise transition probabilities in this work is heuristics. Why not use those thoroughly studied methods, e.g., [3,4,5], or more advanced instance-dependent transition matrix estimation methods [6,7]？

[1] Joint optimization framework for learning with noisy labels. CVPR 2018

[2] Deep Self-Learning From Noisy Labels. ICCV 2019

[3] Deep learning from crowdsourced labels: Coupled cross-entropy minimization, identifiability, and regularization. ICLR 2023

[4] Learning from noisy labels by regularized estimation of annotator confusion. CVPR 2019

[5] Deep learning from crowds. AAAI 2018

[6] Label correction of crowdsourced noisy annotations with an instance-dependent noise transition model. NeurIPS 2023

[7] Transferring annotator- and instance-dependent transition matrix for learning from crowds. TPAMI 2024

问题

What {(·, ·)}^p means in Line 133?
There is a little abuse of notations. ψ means the classifier in Line 88, while it represents the predicted probabilities in Line 190.
How to address the annotation sparse problem when approximating noise transition probabilities?

局限性

See above.

作者回复

2024-08-07

Thank you for reviewing our manuscript. We appreciate your thoughtful comments and suggestions. We will carefully incorporate the necessary revisions to address your feedback in the new version of the manuscript. Below, we highlight our responses to each of your comments.

1. About the focused problem of our paper

In the problem of learning with noisy labels, many existing approaches use various algorithms to estimate the true label posterior $P^*\_{y|x,\tilde{x}}$ , typically with a model denoted as $P_{y|x,\tilde{y}}$ . However, the estimated posterior can deviate from the true underlying distribution, a phenomenon known as model misspecification [H].
By Bayes’s theorem, the true label posterior is proportional to the noise transition probability $P^*_{\tilde{y}|x, y}$ . Existing works [F-G] demonstrate that the estimated noise transition matrix asymptotically converges to the true $P^*_{\tilde{y}|x, y}$ under certain conditions. According to the Vitali convergence theorem [I], the associated $L_1$ Wasserstein distance converges to zero in this situation. Therefore, we consider a Wasserstein ball $\Gamma_\epsilon(P_{y|x,\tilde{y}})$ centered around the estimated true label posterior (referred to as the "reference probability distribution"), as defined in Equation (2) in our paper. In this context, the Wasserstein distance is used to measure the misspecification of estimated true label posterior. When the sample size $n$ is sufficiently large, the true distribution $P^*_{\tilde{y}|x, y}$ will lie within $\Gamma_\epsilon(P_{y|x,\tilde{y}})$ . Thus, by minimizing the worst-case risk over $\Gamma_\epsilon(P_{y|x,\tilde{y}})$ , we can effectively minimize an upper bound on the risk based on the true label posterior.
Note that our work does not aim to precisely estimate the noise transition matrix or alleviate the misspecification of estimated true label posteriors. Instead, we focus on robustly training a classifier despite misspecification by considering the worst-case risk within a Wasserstein ball. Therefore, our work complements existing methods that rely directly on the estimated true label posterior, and thus, providing a robust alternative that accounts for potential inaccuracies, as shown in Table 4 of the attached PDF.

2. Learning from noisy labels v.s. Learning from crowds

Thank you for highlighting this issue. The theoretical framework presented in our paper is applicable to both single-annotator ( $R=1$ ) and multiple-annotator ( $R>1$ ) scenarios. In our experiments, we generate a total of $R$ annotators and then randomly select one annotation per instance from these $R$ annotators. This approach underscores that our focus is on "learning from noisy labels" instead of "learning from crowds" as the latter implies a specific emphasis on aggregating multiple annotations per instance.
To thoroughly evaluate the scenario with a single annotator, we have now set $R=1$ and conducted additional experiments on both the CIFAR10 and CIFAR100 datasets. The results of these experiments are detailed in Table 2 of the attached PDF. We will incorporate these results into our paper to address this scenario comprehensively in the forthcoming revision.

3. About the pseudo-empirical distribution

The pseudo-labeling methods discussed in [1-2] rely solely on the highest predicted probability for each instance, assigning the label corresponding to this highest probability as the pseudo label. In contrast, our approach considers both the highest and the second-highest predicted probabilities. We assign a pseudo label only if the ratio of these probabilities exceeds a specified threshold. This strategy ensures that pseudo labels are assigned only to instances with high confidence, effectively filtering out uncertain data. The accuracies of the pseudo labels throughout the training process are detailed in Figures 3-4 in Appendix B.2 of our paper, which demonstrates the effectiveness of our robust approach.

4. About estimation of noise transition probabilities

Our work does not focus on precisely estimating the noise transition matrix; thus, we employ a straightforward estimation method in our experiments for simplicity. However, our approach is versatile and can be integrated with various methods for estimating the noise transition matrix or the true label posterior. We have conducted additional experiments using more advanced transition matrix estimation methods, and the results are presented in Table 4 of the attached PDF. As shown in Table 4, incorporating these advanced estimation methods significantly improves the test accuracies of our proposed approach, which highlights the robustness and adaptability of our method.

5. Answers to the questions

Thank you for pointing out the typo. It should indeed be ${c(\cdot,\cdot)}^p$ . We appreciate your attention to detail and will correct this error in the revised version of the paper.
Thank you for highlighting this. We will clarify this in the revision. Specifically, we use $\psi:\mathcal{X}\rightarrow \mathcal{S}^{K-1}$ to denote the predicted probabilities, where $\mathcal{S}^{K-1}$ represents the $(K-1)$ -dimensional simplex. Consequently, the classifier is defined as $\max_{j \in [K]} \psi(\mathbf{x})_j$ .
In our experiments, we generate $R=5, 10, 30, 50, 100$ annotators and select a single annotation per instance for training. As demonstrated in Figure 1 of our paper, our proposed method consistently outperforms the baselines, even as annotation sparsity increases with the total number of annotators. This robust performance highlights the effectiveness of our approach in handling varying levels of annotation density.

2024-08-13

I have read through the comments of other reviewers and the corresponding authors' responses. I thank the authors for the detailed response. It has addressed most of my concerns. My remaining concerns are about the analysis of how to solve the sparse annotation problem in the proposed method:

The noise transition estimation method in section 3.2 which achieves the estimation via frequency counting, in my opinion, is not robust to typical sparse annotation cases. In typical cases, the annotators may only label a small number of instances, while the small data for each annotator will make the estimation very inaccurate especially when the class space is large, even the data with small losses will be less than the original data. And, the inaccuracy of the estimated noise transition may further influence the accuracy of pseudo-empirical distribution.
As mentioned in the response, when the sample size $n$ is sufficiently large, the true distribution $P^*_{\tilde{y}|x, y}$ will lie within $\Gamma_\epsilon(P_{y|x,\tilde{y}})$ . Is the sparse annotation problem will influence it?
Although the experiments with different levels of annotation sparsity (5-100 annotators) have been conducted in CIFAR10 (Fig.1), the setting seems not very typical. For example, in the real-world CIFAR10N dataset (10 classes), there are 747 annotators, each labeling 201 instances on average; in the CIFAR100N dataset (100 classes), there are 519 annotators, each labeling 96 instances on average; in the LabelMe dataset (8 classes), there are 59 annotators, each labeling 47 instances on average. I think the severe annotation sparsity may influence the performance improvement of the proposed method in real-world datasets.

评论- Regarding the impact of sparse annotation

2024-08-13

Thank you for raising these important points regarding the noise transition estimation method and the impact of sparse annotation. In preparing a revised manuscript, we plan to include additional comments to highlight these issues.

We acknowledge that the frequency-counting method for noise transition estimation may face challenges when the number of labeled instances per annotator is small. However, our method has already accounted for the potential misspecification of the estimated true label posterior and exhibits tolerance and robustness. Specifically,
- By the nature of our design, the proposed algorithm can accommodate imperfect transition estimation by distributionally robust optimization (i.e., Eq. (2) in our paper). For a less accurate estimated true label posterior, we can choose a larger $\epsilon$ in the uncertainty set $\Gamma_\epsilon(P_{y|x,\tilde{y}})$ to tolerate the inaccuracy.
- Additionally, the way we construct the pseudo-empirical distribution also enables a more robust estimation of it. More specifically, we leverage the approximated true label posterior, $P_{y|x,\tilde{y}}$ , to assign robust pseudo labels and subsequently construct the pseudo-empirical distribution. Instead of simply assigning the label corresponding to the highest probability as the pseudo label [a][b], our approach takes into account both the highest and second-highest predicted probabilities. We assign a pseudo label only if the ratio of these probabilities exceeds a specified threshold. This strategy ensures that pseudo labels are given only to instances with high confidence, effectively filtering out uncertain data.
- Empirically, as shown in Figure 4 on page 25 of our paper, when we increase the total number of annotators (thus increasing annotation sparsity), the accuracy of the selected pseudo labels remains high (around 95% even at high noise rates). This further demonstrates the robustness of our method in generating a reliable pseudo-empirical distribution.
On the other hand, we truly appreciate your suggestion, which can further enhance the quality of this work. We will add a limitation section to discuss this issue, and add additional experiments to incorporate the following possible solutions:
- One possible approach is to employ regularization techniques to mitigate the impact of small sample sizes by smoothing the estimates and reducing sensitivity to outliers. For instance, the theories in [a] are established under incomplete labeling paradigm.
- Another approach is to incorporate subgroup structures for the annotators using a multidirectional separation penalty (MDSP) [b-c].
- Additionally, as mentioned in Remark 3.4 of our paper, the estimation of the true label posterior $P^*_{y|x, \tilde{y}}$ is not limited to Bayes’s rule alone. We can also directly model $P^*_{y|x, \tilde{y}}$ by aggregating the data and noisy label information by maximizing the $f$ -mutual information gain as in [d].
Your concern about whether sparse annotation affects the true distribution $P^*_{\tilde{y}|x, y}$ is indeed valid and insightful. Our theoretical framework assumes that, given a sufficiently large sample size $n$ , the true distribution will be captured within $\Gamma_\epsilon(P_{y|x,\tilde{y}})$ . In finite sample settings, sparse annotation can impact the accuracy of this estimation. In this case, we choose a larger $\epsilon$ in the uncertainty set to incorporate the potential misspecifications. Specifically, according to the proof of New Theorem 3.2, the $\epsilon$ in the uncertainty set should be taken in $(1,1/K)$ for $K$ -classification problem. Therefore, we can select a larger $\epsilon$ (closer to $1/K$ rather than 0) if the estimated true label posterior $P^*_{y|x, \tilde{y}}$ is not sufficiently precise.
Thanks for your suggestions. Regarding the real-world datasets, the results using the frequency-counting method for noise transition estimation are presented in Table 2 of our paper, and Table 4 of the PDF attached to our rebuttal. Notably, our method consistently outperforms other baselines, especially when more advanced noise transition matrix estimation methods are incorporated (i.e., Table 4 of the attached PDF). Moreover, to further integrate your feedback, we will also incorporate the solutions to the sparse annotation scenario mentioned in the second bullet point and conduct additional experiments on both real-world datasets and the CIFAR10 dataset with sparser annotations. Due to time constraints, we will try to share the experimental results in the Official Comment in 24 hours.

In summary, we appreciate your continued feedback and recognize the challenges posed by sparse annotation in noise transition estimation. Our revised manuscript will incorporate additional comments and clarifications to address these concerns. Thank you again for your constructive feedback.

评论- References

2024-08-13

[a] Ibrahim, Shahana, Tri Nguyen, and Xiao Fu. "Deep learning from crowdsourced labels: Coupled cross-entropy minimization, identifiability, and regularization." ICLR (2023).

[b] Tang, Xiwei, Fei Xue, and Annie Qu. "Individualized multidirectional variable selection." Journal of the American Statistical Association (2021).

[c] Xu, Qi, et al. "Crowdsourcing Utilizing Subgroup Structure of Latent Factor Modeling." Journal of the American Statistical Association (2024).

[d] Cao, Peng, et al. "Max-mig: an information theoretic approach for joint learning from crowds." ICLR (2019).

评论- Additional Experimental Result (1)

2024-08-14

We conducted additonal experiments to address the annotation sparsity issue. In particular, we incorporated regularization techniques (specifically, GeoCrowdNet (F) and GeoCrowdNet (W) penalties) into our method. We then compared the results against those obtained using the traditional frequency-counting approach to estimate the noise transition matrices.
For these experiments, we generated three groups of annotators with average labeling error rates of approximately 26%, 34%, and 42%, labeled as IDN-LOW, IDN-MID, and IDN-HIGH, respectively. These groups represent low, intermediate, and high error rates, allowing us to evaluate the robustness of our method under varying levels of noise. Due to time constraint, we generate $R=200$ annotators for each group. However, in the revised manuscript, we plan to use a larger $R$ to further validate our findings and strengthen the results.
Table 1 presents the performance of our proposed method on the CIFAR10 ( $R=200$ ), CIFAR10N, and LabelMe datasets, comparing the outcomes when different approaches are used to estimate the noise transition matrices. In addition, Tables 2–5 displays the average accuracies of the robust pseudo-labels generated by our method in the training process. These pseudo-labels play a crucial role in constructing the pseudo-empirical distribution.

Table 1: Accuracies of learning the CIFAR10 ( $R=200$ ), CIFAR10N, and LabelMe datasets with different noise transition matrix estimation methods.

	CIFAR10: IDN-LOW	CIFAR10: IDN-MID	CIFAR10: IDN-HIGH	CIFAR10N	LabelMe	Animal10N
Ours + frequency-counting	$86.01_{\pm 0.67}$	$85.48_{\pm 0.58}$	$85.07_{\pm 0.59}$	$88.07_{\pm 0.34}$	$83.35_{\pm 1.16}$	$82.35_{\pm 0.34}$
Ours + GeoCrowdNet (F) penalty	$90.89_{\pm 0.21}$	$90.27_{\pm 0.46}$	$89.25_{\pm 0.63}$	$88.30_{\pm 0.13}$	$86.20_{\pm 0.48}$	$83.12_{\pm 0.42}$
Ours + GeoCrowdNet (W) penalty	$90.99_{\pm 0.42}$	$90.23_{\pm 0.27}$	$89.42_{\pm 0.29}$	$87.81_{\pm 0.12}$	$83.32_{\pm 0.51}$	$82.41_{\pm 0.04}$

评论- Additional Experimental Result (2)

2024-08-14

Table 2: Average accuracy of the robust pseudo-labels on CIFAR-10(IDN-LOW, $R=200$ ) in the training process.

	epoch 30	epoch 40	epoch 50	epoch 60	epoch 70	epoch 80	epoch 90	epoch 100	epoch 110	epoch 120
Ours + frequency-counting	$97.08_{\pm 1.30}$	$97.51_{\pm 0.95}$	$97.79_{\pm 0.98}$	$98.51_{\pm 1.03}$	$98.56_{\pm 0.96}$	$98.72_{\pm 1.03}$	$98.50_{\pm 1.09}$	$98.97_{\pm 1.65}$	$98.89_{\pm 1.24}$	$99.13_{\pm 0.43}$
Ours + GeoCrowdNet (F) penalty	$98.35_{\pm 1.21}$	$99.46_{\pm 0.94}$	$99.18_{\pm 1.14}$	$99.47_{\pm 1.26}$	$99.66_{\pm 0.85}$	$99.54_{\pm 1.13}$	$99.50_{\pm 1.10}$	$99.32_{\pm 1.04}$	$99.52_{\pm 1.05}$	$99.40_{\pm 0.81}$
Ours + GeoCrowdNet (W) penalty	$98.05_{\pm 1.15}$	$98.36_{\pm 1.01}$	$98.86_{\pm 0.93}$	$98.80_{\pm 0.78}$	$98.91_{\pm 0.91}$	$98.62_{\pm 0.95}$	$98.92_{\pm 0.71}$	$98.87_{\pm 0.84}$	$98.83_{\pm 1.13}$	$99.34_{\pm 0.91}$

Table 3: Average accuracy of the robust pseudo-labels on CIFAR-10(IDN-MID, $R=200$ ) in the training process.

	epoch 30	epoch 40	epoch 50	epoch 60	epoch 70	epoch 80	epoch 90	epoch 100	epoch 110	epoch 120
Ours + frequency-counting	$96.39_{\pm 1.13}$	$96.51_{\pm 1.51}$	$96.48_{\pm 1.19}$	$96.63_{\pm 1.63}$	$96.69_{\pm 1.66}$	$96.79_{\pm 1.44}$	$96.91_{\pm 1.49}$	$97.03_{\pm 1.30}$	$96.83_{\pm 1.34}$	$97.34_{\pm 2.81}$
Ours + GeoCrowdNet (F) penalty	$97.91_{\pm 1.65}$	$98.75_{\pm 1.25}$	$99.33_{\pm 1.11}$	$98.92_{\pm 1.52}$	$99.51_{\pm 0.71}$	$98.89_{\pm 0.98}$	$99.63_{\pm 0.71}$	$98.89_{\pm 1.26}$	$97.99_{\pm 1.43}$	$99.27_{\pm 2.24}$
Ours + GeoCrowdNet (W) penalty	$97.20_{\pm 1.62}$	$97.86_{\pm 1.57}$	$98.02_{\pm 0.97}$	$98.43_{\pm 0.94}$	$98.11_{\pm 1.16}$	$97.81_{\pm 0.95}$	$98.26_{\pm 1.31}$	$98.52_{\pm 1.04}$	$98.01_{\pm 0.96}$	$99.63_{\pm 1.28}$

Table 4: Average accuracy of the robust pseudo-labels on CIFAR-10(IDN-HIGH, $R=200$ ) in the training process.

	epoch 30	epoch 40	epoch 50	epoch 60	epoch 70	epoch 80	epoch 90	epoch 100	epoch 110	epoch 120
Ours + frequency-counting	$93.68_{\pm 1.98}$	$94.12_{\pm 1.91}$	$94.79_{\pm 2.07}$	$95.19_{\pm 1.53}$	$94.97_{\pm 2.14}$	$94.69_{\pm 2.14}$	$95.02_{\pm 1.44}$	$95.03_{\pm 2.11}$	$95.31_{\pm 1.43}$	$95.75_{\pm 1.98}$
Ours + GeoCrowdNet (F) penalty	$95.69_{\pm 1.70}$	$97.06_{\pm 1.36}$	$97.45_{\pm 1.38}$	$97.28_{\pm 1.23}$	$96.89_{\pm 1.52}$	$97.45_{\pm 1.50}$	$97.42_{\pm 1.18}$	$97.12_{\pm 1.54}$	$97.41_{\pm 1.65}$	$98.69_{\pm 0.61}$
Ours + GeoCrowdNet (W) penalty	$95.85_{\pm 1.59}$	$97.60_{\pm 1.37}$	$97.68_{\pm 1.15}$	$97.52_{\pm 1.35}$	$97.34_{\pm 1.31}$	$97.71_{\pm 1.45}$	$97.81_{\pm 1.27}$	$97.21_{\pm 1.36}$	$97.8_{\pm 1.12}$	$98.16_{\pm 2.07}$

Table 5: Average accuracy of the robust pseudo-labels on CIFAR-10N in the training process.

	epoch 30	epoch 40	epoch 50	epoch 60	epoch 70	epoch 80	epoch 90	epoch 100	epoch 110	epoch 120
Ours + frequency-counting	$93.85_{\pm 1.21}$	$97.31_{\pm 1.08}$	$97.25_{\pm 0.87}$	$97.34_{\pm 0.84}$	$97.35_{\pm 1.17}$	$97.52_{\pm 1.28}$	$97.87_{\pm 1.06}$	$97.71_{\pm 1.44}$	$98.06_{\pm 1.43}$	$97.76_{\pm 1.23}$
Ours + GeoCrowdNet (F) penalty	$96.18_{\pm 2.24}$	$97.79_{\pm 1.61}$	$97.53_{\pm 1.32}$	$98.21_{\pm 1.47}$	$98.28_{\pm 1.17}$	$98.25_{\pm 1.37}$	$98.39_{\pm 1.13}$	$98.61_{\pm 1.69}$	$98.41_{\pm 0.93}$	$98.68_{\pm 2.08}$
Ours + GeoCrowdNet (W) penalty	$95.19_{\pm 1.21}$	$97.06_{\pm 1.63}$	$96.72_{\pm 1.00}$	$97.68_{\pm 1.14}$	$97.38_{\pm 1.02}$	$97.54_{\pm 1.01}$	$97.97_{\pm 1.13}$	$97.82_{\pm 1.15}$	$97.90_{\pm 0.90}$	$98.20_{\pm 0.90}$

2024-08-14

I appreciate the authors' efforts in answering the comments and improving their work. All my concerns have been addressed. Believing the promised changes will be done in the revised version, I decided to increase my score to 7.

2024-08-14

Thank you for taking the time to review our responses and for recognizing the work we've put into addressing your comments. We’re glad that our efforts have resolved your concerns. Your decision to increase the score is very encouraging, and we will continue to refine the manuscript to ensure that all feedback is thoroughly incorporated.

审稿意见

评分: 6置信度: 42024-07-15

This work addresses learning from noisy annotations by using conditional distributionally robust optimization (CDRO). To account for variability in estimating the true label posteriors, the authors propose an approach that minimizes the maximum expected risk with respect to a probability distribution within a distance-based ambiguity set centered around a reference distribution (the posterior distribution of the true labels). Deriving the dual problem, the authors are able to provide upper bounds for risk. Moreover they derive generalization bounds for the upper bound of the risk. Additionally, a closed-form expression for empirical robust risk and the optimal Lagrange multiplier is provided. An analytical solution for dual robust risk is found for loss functions of a particular form. Starting from the analytical solution and introducing a robust pseudo-label collection algorithm. Experiments are performed on CIFAR-10 and CIFAR-100 introducing on synthetic and on the real-world datasets CIFAR-10N, CIFAR-100N, LabelME, animal-10N.

优点

The proposed approach is really interesting, novel and principled, with a sound theoretical foundation.
The paper is clearly written, making the concepts and methodology understandable.
The authors provide a generalization bound for the upper bound of the risk function they define.

缺点

Some results in the experiments are not fully convincing me, I would like to hear the authors feedback:

Regarding Figure 1. How can the accuracy change so little with an increasing number of annotators? According to Figure 1 in [7], the noise rate of aggregated labels decreases significantly as the number of annotators increases, even for high initial noise rates. How is it possible that the accuracy in Figure 1 remains almost constant despite the increase in annotators?
Why is the performance of the ResNet34 model on the clean CIFAR-100 dataset so low? The overall performances for CIFAR 100 and CIFA10-N seem lower than those reported in ProMix [1] (Tables 1 and 2) or SOP [3] (Tables 1 and 3). Although the noise settings differ, the model trained on clean data should exhibit higher performance on CIFAR 100. Similarly, the results for Co-teaching method appear inconsistent with those in other studies.

I believe a better justification for the chosen baselines is needed. For example, since Co-teaching, using two networks and which is designed for scenarios with a single label per sample is included, why not include more SOTA methods such as ProMix [1], Divide-Mix [2], and SOP [3]? I am also curious why these methods were not included. Additionally, for methods that aggregate labels, considering other approaches like IWMV [4], or those that train models on soft labels, such as IAA [5] and soft-labels average [6], would be beneficial. While not asking the authors to include all these methods, I would appreciate a clearer justification for the choice of baselines.
In my opinion, the limitations are not fully discussed. Please see the Limitations section for further details.

[1] Wang, Haobo, et al. "Promix: Combating label noise via maximizing clean sample utility." arXiv preprint arXiv:2207.10276 (2022). [2] Li, Junnan, Richard Socher, and Steven CH Hoi. "Dividemix: Learning with noisy labels as semi-supervised learning." arXiv preprint arXiv:2002.07394 (2020). [3] Liu, Sheng, et al. "Robust training under label noise by over-parameterization." International Conference on Machine Learning. PMLR, 2022. [4] Li, Hongwei, and Bin Yu. "Error rate bounds and iterative weighted majority voting for crowdsourcing." arXiv preprint arXiv:1411.4086 (2014). [5] Bucarelli, M. S., Cassano, L., Siciliano, F., Mantrach, A., & Silvestri, F. (2023). Leveraging inter-rater agreement for classification in the presence of noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3439-3448). [6] Collins, Katherine M., Umang Bhatt, and Adrian Weller. "Eliciting and learning with soft labels from every annotator." Proceedings of the AAAI conference on human computation and crowdsourcing. Vol. 10. 2022. [7] Wei, Jiaheng, et al. "To aggregate or not? learning with separate noisy labels." Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023.

问题

What loss function is used in the experiments, namely what function $\mathcal{Tau}$ is used in the loss?

局限性

Partially. some possible improvements are shortly discussed in the conclusions.

Some limitations:

An overlooked limitation is the need for $P_j(x,\tilde{y}y)$ , that is the posterior $P(Y=j| \mathbf{X}= \mathbf{x}, \mathbf{\tilde{Y}} = \mathbf{\tilde{y}})$ to define the optimal action. This requires the posterior distribution, which can only be obtained after training the model for some epochs. The number of epochs necessary depends on the dataset's characteristics and how quickly the model overfits, but the sensitivity to this factor is not discussed. Also the number of samples for the $\mathcal{D}^{*}_0$ dataset are not specified.
Another limitation that is not mentioned is the fact that the method needs to classifiers.

作者回复

2024-08-07

Thank you for reviewing our manuscript. We appreciate your thoughtful comments and suggestions. We will carefully incorporate the necessary revisions to address your feedback in the new version of the manuscript. Below, we highlight our responses to each of your comments.

1. About the feedback on initial experiment results

About accuracy with an increasing number of annotators.
- While we generate $R=5, 10, 30, 50, 100$ annotators, we randomly select only one annotation per instance for the training dataset to assess the algorithms in a partial labeling setting with sparse annotations. In contrast, Figure 1 in [7] shows the noise rates of the aggregated labels when all the $R$ labels are provided.
- To further evaluate model performance with varying numbers of annotations per instance, we use $R = 30$ annotators and randomly select $l = 1, 3, 5, 7, 9$ labels from these $R$ annotators for each instance. The noise rates of the majority vote labels are provided below. The test accuracies of the proposed method and other annotation aggregation methods are shown in Table 1 of the attached PDF.

	$l=1$	$l=3$	$l=5$	$l=7$	$l=9$
CIFAR10: IDN-LOW	0.20	0.09	0.04	0.02	0.01
CIFAR10: IDN-MID	0.38	0.26	0.16	0.10	0.07
CIFAR10: IDN-HIGH	0.51	0.42	0.31	0.25	0.20

Performances of some baselines.
- We followed the experimental settings used in [A-B], which differ from those in ProMix or SOP. Specifically, all models in our paper are trained on CIFAR100 for 150 epochs with a batch size of 128. In contrast, according to their source code, ProMix is trained on CIFAR100 for 600 epochs with a batch size of 256, and SOP is trained for 300 epochs. Additionally, ProMix and SOP employ further data augmentations, such as AutoAugment [C] and mixup augmentation [D], which we did not use in our setting.
- The discrepancy in the results for the Co-teaching method compared to other studies is due to differences in the noisy label generation method used in our paper. Specifically, for each instance, we generate $R$ instance-dependent annotators and randomly select one noisy annotation from these $R$ annotations.

2. Justification for the chosen baselines

Our method addresses learning from noisy annotations, especially with potential misspecifications in estimated true label posteriors. We select baselines that either directly use estimated transition matrices or true label posteriors (MBEM, CrowdLayer, TraceReg, Max-MIG, CoNAL). We also consider baselines that aggregate labels in various ways (CE (MV), CE (EM), DoctorNet, CCC). Since our theoretical framework applies to both single-annotator and multiple-annotator scenarios, we include baselines designed for single noisy labels (LogitClip), especially methods that use two networks (Co-teaching, Co-teaching+, CoDis), given that our method utilizes two networks to serve as priors for each other.
The results for the proposed method and the baselines were obtained using simple data augmentations. As a result, we did not include baselines such as ProMix [1], Divide-Mix [2], and SOP [3], which use additional data augmentations. However, our method is compatible with these augmentations and can be adapted to incorporate them. We have now applied mixup augmentation to our method and some other baselines, with results shown in Table 3 of the attached PDF. For a fair comparison, all results in Table 3 are based on training ResNet18 for 120 epochs, except for ProMix, which was trained for 300 epochs. This is fewer epochs than those used in the original papers for DivideMix and ProMix.
Methods that aggregate labels.
- We have selected several baselines that utilize aggregated labels, including majority voting (“CE (MV)”), the EM algorithm (“CE (EM)”), MBEM, CoNAL, DoctorNet, and Max-MIG.
- The algorithms proposed in IWMV [4], IAA [5], and soft-labels average [6] require multiple annotations per instance to estimate labels or the agreement matrix, which makes them unsuitable for sparse labeling scenarios. Additionally, IAA [5] assumes a common transition matrix for all annotators, which differs from the noisy annotation generation process described in our paper.
- To compare with these label aggregation methods, we have now conducted additional experiments by randomly selecting $l=3,5,7,9$ annotations for each instance from the $R=30$ annotators. The results are displayed in Table 1 of the attached PDF.

3. Question about the loss function

We use cross-entropy loss for the loss function $\ell$ , meaning that $\mathcal{T}$ is the log function, which is concave in its argument. To meet the required conditions on $\mathcal{T}(\psi(x))$ , we clip the predicted probabilities $\psi(x)$ to the range $[0.01, 1 - 0.01]$ to ensure that $\mathcal{T}(\psi(x))$ remains bounded.
Thank you for your suggestion. We will include details about the type of network used in the experiments in the main paper when we prepare the revision.

4. About limitations

Thank you for raising this point. We use 30 warmup epochs in our experiments, as adopted from existing works [E]. The learning dynamics of the algorithms are shown in Figure 1 of the attached PDF. It can be observed that the model already overfits label noise with 30 epochs. To assess the sensitivity to warm-up epochs, we have now conducted experiments with varying warm-up epochs on our method and the baselines that also employ a warm-up stage, as shown in Figure 2 of the attached PDF. The results indicate that performance can be further improved with an appropriate number of warm-up epochs.
The number of samples in $\mathcal{D}^*_0$ is not predetermined. After the warm-up stage, we include an instance in $\mathcal{D}^*_0$ if its predicted probability exceeds 0.5.
Thank you for pointing this out. We will address this limitation in our paper when preparing the revision.

2024-08-13

Thank you for addressing all my concerns. Regarding my initial comment, I realize now what might have caused my confusion. What is the difference between using 50 annotators with identical noise transition matrices and providing one label per sample, versus the traditional approach where each sample has a single noisy label and noise is modeled by a transition matrix?

I appreciate the supplementary material provided in the PDF attached to your rebuttal, as it significantly enhances the paper's quality and clarity, particularly in the experimental section, which was previously a bit weak. I recommend including this material in the final submission in the case the paper is accepted. The additional details, especially Table 1 with experiments involving multiple annotators and Table 4 with various methods for estimating the noise transition matrix, are particularly valuable (this is especially relevant as it could eliminate the need for the warm-up stage in your approach, right?)

2024-08-13

Thank you for your thoughtful feedback and for taking the time to review the supplementary material. I'm glad to hear that it has helped clarify the paper, particularly in the experimental section.

Regarding the difference between using $R=50$ $R = 50$ annotators and the traditional single noisy label scenario:
- If the $R=50$ annotators generates labels with identical noise transition matrices, and each annotator provides only one label per sample, the data will be distributed the same way as in the traditional approach where each sample has a single noisy label and noise is modeled by a transition matrix. In this context, the noisy data can be considered independent and identically distributed (iid) random variables. In particular, let $P^{(r)}(\tilde{y}|x,y):=T(\tilde{y}|x,y)$ for $r\in[R]$ represent the identical transition matrix, and let $\tilde{y}_i$ denote the one label for instance $x_i$ provided by annotator $r_i$ . By the law of total probability, we have: $P(x\_i,\tilde{y}\_i)=\sum\_{y\in[K]}P(x\_i,y)\sum\_{r\in[R]}P(r\_i=r)P^{(r)}(\tilde{y}\_i|x\_i,y)=\sum\_{y\in[K]}P(x\_i,y)T(\tilde{y}\_i|x\_i,y).$ Thus, we can set $R=1$ in our method and only estimate one transiton matrix to approximate the true label posterior, which is then used to construct the pseudo-empirical distribution in our algorithm.
- In our experiments, we study the more challenging setting, where we generate $R$ annotators with different instance-dependent transition matrices using Algorihtm 2 from [a], as described in Appendix B.1 of our paper. Each annotator is referred to as an IDN- $\tau$ annotator if its mislabeling ratio is upper bounded by $\tau$ . For example, for $R=50$ , we generate the following groups of annotators:
  - IDN-LOW. 18 IDN-10% annotators, 18 IDN-20% annotators, 14 IDN-30% annotators;
  - IDN-MID. 18 IDN-30% annotators, 18 IDN-40% annotators, 14 IDN-50% annotators;
  - IDN-HIGH. 18 IDN-50% annotators, 18 IDN-60% annotators, 14 IDN-70% annotators.
  Then we randomly select one noisy label for each instance. In this case, we approximate the true label posterior by estimating $R$ transition matrices, which allows us to incorporate the labeling expertise of the annotators. As shown in Table 1 below, when annotators generate noisy annotations with different transition matrices, even if only one label is provided per instance, the results are better when estimating $R$ different transition matrices, compared to ignoring the annotator information and estimating only one transition matrix.
  
  Table 1: Average accuracies of learning the CIFAR-10 dataset with $R=50$ annotators.

	IDN-LOW	IDN-MID	IDN-HIGH
estimate $R=50$ transition matrices	$86.94_{\pm 0.33}$	$85.17_{\pm 0.26}$	$84.09_{\pm 0.49}$
estimate ONE transition matrix (ignore the annotator information)	$86.58_{\pm 0.32}$	$84.23_{\pm 0.31}$	$82.33_{\pm 0.36}$

Regarding your observation about Table 4:
- Yes, you're right —- the additional methods for estimating the noise transition matrix could indeed make the warm-up stage unnecessary. For instance, if we use the GeoCrowdNet method in Table 4 of the attached PDF to estimate the transition matrices, we can initialize the transition matrices with the identity matrix and then simultaneously train the classifier and the transition matrices.

Thank you again for your valuable insights and recommendations.

References

[a] Xia, Xiaobo, et al. "Part-dependent label noise: Towards instance-dependent label noise." NeurIPS (2020).

评论- Increasing my score

2024-08-14

In light of the thoughtful rebuttal and the productive discussions with me and the other reviewers, I have decided to raise my score from "bordearline accept" to "weak accept", trusting that the author will integrate these insights effectively.

2024-08-14

Thank you for deciding to raise the score. We sincerely appreciate your insightful feedback and suggestions. We will carefully implement the necessary revisions to incorporate these insights into the updated manuscript. Your support and confidence in our work are greatly appreciated.

评论- New theoretical results in the multi-class case.

2024-08-07

Here we present the detailed theoretical results for the multi-class case.

We extend Theorem 3.1 to the multi-class case by considering the specific loss function $\ell(\psi(\mathbf{x}),\mathrm{y})=\sum\_{j=1}^{K}\mathbf{1}(\mathrm{y}=j)\{1-\psi(\mathbf{x})\_j\}$ , which represents the misclassification probability. The following New Theorem 3.1 characterizes the optimal action for single data in the multi-class scenario and can be used to construct the pseudo empirical distribution.
- [New Theorem 3.1] Consider the loss function $\ell(\psi(\mathbf{x}),\mathrm{y})=\sum\_{j=1}^{K}\mathbf{1}(\mathrm{y}=j)\{1-\psi(\mathbf{x})\_j\}$ , where $\psi(\mathbf{x})\_{j}$ represents the $j$ -th component of $\psi(\mathbf{x})$ . Let $P^{(j)}$ denote the $j$ -th largest element of $\{P\_j(\mathbf{x},\tilde{\mathbf{y}})\}_{j=1}^{K}$ . Then, the optimal action $\psi^\star$ is given as below.
  - If $\frac{1}{K}\ge \frac{1}{k^*}\sum_{j=1}^{k^*}P^{(j)}-\frac{1}{k^*}\varrho(\epsilon)$ for all $k^*\in[K-1]$ , then $\psi^\star_j=\frac{1}{K}$ for $j\in[K]$ .
  - If there exists some $k_0\in[K-1]$ , $\frac{1}{k_0}\sum_{j=1}^{k_0}P^{(j)}-\frac{1}{k_0}\varrho(\epsilon)>\frac{1}{K}$ , and $\frac{1}{k_0}\sum_{j=1}^{k_0}P^{(j)}-\frac{1}{k_0}\varrho(\epsilon)\ge\frac{1}{k^*}\sum_{j=1}^{k^*}P^{(j)}-\frac{1}{k^*}\varrho(\epsilon)$ for all $k^*\in[K-1]$ , then $\psi^{\star(j)}=\frac{1}{k_0}$ for $j\in[k_0]$ and $\psi^{\star(j)}=0$ for $j=k_0+1,\ldots,K$ .
  In particular, if $P^{(1)}\ge\max\{\frac{1}{K}+\varrho(\epsilon),P^{(2)}+\varrho(\epsilon)\}$ , then the optimal action is given as: $\psi^{\star(1)}=1$ and $\psi^{\star(j)}=0$ for $j=2,\ldots, K$ .
The optimal value for the Lagrangian multiplier $\gamma^*$ , along with the closed-form expression for the robust risk in the multi-class scenario, is presented in New Theorem 3.2.
- [New Theorem 3.2] For $i\in[K]$ and $j\in[K]$ , we let $P\_{i,j}:= P\_j(\mathbf{x}\_i, \widetilde{\mathbf{y}}\_i)\triangleq P(\mathrm{Y}=j|\mathbf{x}\_i, \widetilde{\mathbf{y}}\_i)$ and $\psi\_{i,j}\triangleq\psi(\mathbf{x}\_i)\_j$ . Additionally, for $\psi(\mathbf{x}\_i)$ , we sort $\{\psi\_{i,1},\ldots,\psi\_{i,K}\}$ in an decreasing order, denoted $\psi^{(1)}\_i\ge\ldots\ge\psi^{(K)}\_i$ . Moreover, in the multi-classification case, we denote $\hat{\mathfrak{R}}\_{\epsilon}=\inf\_{\gamma\ge 0}[\gamma\epsilon^p+\frac{1}{n}\sum\_{i=1}^{n}\sum\_{j=1}^{K}P\_{i,j}\max\ {\mathcal{T}(1-\psi\_{i,1})-\gamma\kappa^p,\ldots,\mathcal{T}(1-\psi\_{i,j}),\ldots,\mathcal{T}(1-\psi_{i,K})-\gamma\kappa^p\}],$
  
  $\widehat{\mathfrak{R}}=\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{K}P_{i,j}\mathcal{T}(1-\psi_{i,j}).$
  
  Let $\alpha\_{i,j}\triangleq\mathcal{T}(1-\psi\_{i}^{(K)})-\mathcal{T}(1-\psi\_{i,j})$ for $i\in[n]$ and $j\in[K]$ . We sort $\{\alpha\_{i,j}\}\_{i\in[n],j\in[K]}$ in an decreasing order, denoted $\alpha^{(1)}\ge\ldots\ge\alpha^{(nK)}$ ; and correspondingly, the $P\_{i,j}$ 's with the associated indexes are denoted $P^{(1)},\ldots,P^{(nK)}$ . We define $s^*$ as follows. If $\frac{1}{n}P^{(1)}<\varrho(\epsilon)<\frac{1}{n}\sum\_{t=1}^{nK}P^{(t)}$ , then there exists $s^*\in\{2,\ldots,nK\}$ such that $\frac{1}{n}\sum\_{t=1}^{s}P^{(t)}< \varrho(\epsilon)$ for $s< s^*$ , and $\frac{1}{n}\sum\_{t=1}^{s}P^{(t)}\ge \varrho(\epsilon)$ for $s\ge s^*$ . If $\varrho(\epsilon)\le\frac{1}{n}P^{(1)}$ , we take $s^*=1$ . If $\varrho(\epsilon)\ge\frac{1}{n}\sum\_{t=1}^{nK}P^{(t)}$ , we take $s^*=n+1$ . The optimal Lagrange multiplier is then given by $\gamma^\star\triangleq\alpha^{(s^*)}/\kappa^p$ , and the robust risk is then expressed as
  $(s^*>1)+o(\frac{1}{n})\alpha^{(s^*)}.$$$

作者回复

2024-08-07

Dear Reviewers,

Thank you for your thoughtful feedback and for the time you dedicated to evaluating our paper. We deeply appreciate your insights and constructive comments. We are pleased to hear that you recognized the strengths and contributions of our paper, which we would like to recap as follows.

Our paper addresses a significant and practical topic and proposes a novel and principled CDRO framework to tackle the challenge of potential misspecifications in the estimated true label posterior (Reviewer 7dsE, 62sF).
We provide rigorous theoretical analyses on the upper bound for the worst-case risk and the optimal action for constructing the pseudo-empirical distribution, which serve as valuable guidelines for designing our algorithm (Reviewers 7dsE, xLiU, 62sF).
The proposed method demonstrates promising performance in our experiments. The results also validate the effectiveness of our approach and its practical utility (Reviewer xLiU).
We appreciate the feedback on the clarity and ease of understanding of our paper (Reviewers 7dsE, xLiU). We will carefully prepare a revised manuscript to fully address reviewers' comments and suggestions to ensure our presentation to remain accessible and clear.

We have thoroughly reviewed each of your queries, concerns, and remarks. In response, we have prepared a one-page PDF detailing additional experimental results, which are designed to address your points comprehensively. The references cited in our responses are listed at the end of the document. For your convenience, the following summary highlights the key updates:

We appreciate your valuable insights into emphasizing the intuition and rationale behind the CDRO framework (Reviewer xLiU, 62sF). In response, we have added a more comprehensive explanation of the motivation for using a Wasserstein ball-based uncertainty set $\Gamma_\epsilon(P_{\mathrm{y}|\mathbf{x},\tilde{\mathbf{y}}})$ to address potential misspecifications.
Your suggestions to justify the chosen baselines (Reviewer 7dsE), examine the impact of the number of warm-up epochs (Reviewers 7dsE, 62sF), and explore different estimation methods for the noise transition matrices (Reviewers xLiU, 62sF) are invaluable. When preparing a revision, well incorporate these recommendations to better highlight the strengths and robustness of our approach.
We have extended our theoretical results to address the multi-class case, as questioned by Reviewer 62sF. The updated theoretical insights are detailed in the Official Comment. When preparing the revised manuscript, we will update the presentation to cover both the binary and multi-class cases comprehensively, and include the new proofs to support these results. This enhancement will ensure a thorough understanding of our approach across different classification scenarios.
We appreciate your constructive feedback on improving the clarity and presentation of our paper (Reviewers 7dsE, xLiU). We have carefully considered your suggestions and will implement effective revisions in the updated manuscript to enhance its readability and effectiveness.

We believe that our responses have thoroughly addressed all the concerns raised. However, should you require any additional details, justifications, or further results, we are more than willing to provide them to ensure all aspects are comprehensively covered.

References

[A] Ibrahim, Shahana, Tri Nguyen, and Xiao Fu. "Deep Learning From Crowdsourced Labels: Coupled Cross-Entropy Minimization, Identifiability, and Regularization." ICLR (2023).

[B] Yang, Shuo, et al. "Estimating Instance-dependent Bayes-label Transition Matrix using a Deep Neural Network." ICML (2022).

[C] Cubuk, Ekin D., et al. "Autoaugment: Learning augmentation strategies from data." CVPR (2019).

[D] Zhang, Hongyi, et al. "mixup: Beyond Empirical Risk Minimization." ICLR (2018).

[E] Zheng, Songzhu, et al. "Error-bounded correction of noisy labels." ICML (2020).

[F] Khetan, Ashish, Zachary C. Lipton, and Animashree Anandkumar. "Learning From Noisy Singly-labeled Data." ICLR (2018).

[G] Guo, Hui, Boyu Wang, and Grace Yi. "Label correction of crowdsourced noisy annotations with an instance-dependent noise transition model." NeurIPS (2023).

[H] Grace, Y. Yi. "Statistical analysis with measurement error or misclassification" Springer (2016).

[I] Peskir, Goran. "Vitali convergence theorem for upper integrals." Proc. Funct. Anal. IV (1993)

最终决定Accept (poster)

2024-09-25

This work proposes a conditional distributionally robust optimization approach to tackle the crowdsourced noisy label learning problem.

The authors and reviewers had concrete discussions in the rebuttal phase. As a result, the reviewers were satisfied with the paper, given that the additional experiments and clarifications in the rebuttal phase will be integrated into the camera-ready version.

Overall, the reviewers all agree that the topic addressed in this work has practical significance. The reviewers are also happy with the fact that the proposed method is accompanied with rigorous risk analyses. During the rebuttal, the authors made substantial efforts to address the reviewers’ major comments. Two notable points include (1) a new theoretical result on the multi-class cases and (2) substantial experiments with new baselines/regularization, e.g., using GeoCrowdNet’s regularization to enhance the proposed method. These efforts were appreciated by the reviewers.

Learning from Noisy Labels via Conditional Distributionally Robust Optimization

摘要

评审与讨论

优点

缺点

问题

局限性

优点

缺点

问题

局限性

1. About the focused problem of our paper

2. Learning from noisy labels v.s. Learning from crowds

3. About the pseudo-empirical distribution

4. About estimation of noise transition probabilities

5. Answers to the questions

优点

缺点

问题

局限性

1. About the feedback on initial experiment results

2. Justification for the chosen baselines

3. Question about the loss function

4. About limitations