6.0

/10

Rejected4 位审稿人

最低5最高8标准差1.2

3.5

置信度

ICLR 2024

Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations

Hao Chen,Ankit Shah,Jindong Wang,Ran Tao,Yidong Wang,Xing Xie,Masashi Sugiyama,Rita Singh,Bhiksha Raj

OpenReview PDF

提交: 2023-09-22更新: 2024-02-11

摘要

关键词

Imprecise Label Learning; Partial Label Learning; Noisy Label Learning; Semi-Supervised Learning; Expectation-Maximization

评审与讨论

审稿意见

评分: 5置信度: 42023-11-01

The article addresses the challenge of learning with imprecise labels in machine learning tasks, such as noisy or partial labels. Traditional methods often struggle with multiple forms of label imprecision. The authors introduce a novel framework named Imprecise Label Learning (ILL) that serves as a unified approach to handle various imprecise label scenarios. ILL employs the expectation-maximization (EM) technique, viewing precise labels as latent variables and focusing on the entire potential label distribution. The framework demonstrates adaptability to different learning setups, including partial label learning and noisy label learning. Remarkably, ILL outperforms existing techniques designed for imprecise labels, establishing itself as the first integrated approach for such challenges.

优点

The article offers a comprehensive solution to the prevalent challenge of imprecise annotations, enhancing the adaptability and applicability of machine learning models.
The inclusion of experimental results across multiple settings provides empirical evidence of the framework's robustness and superior performance.

缺点

The article's innovation is limited, as the approach of considering ground-truth labels or Bayes label distribution as latent variables and using variational inference for approximation in weakly supervised learning is already a common method[1-2], which suggests that the presented techniques may not be as novel as claimed.

[1] Xu, N., Qiao, C., Geng, X., & Zhang, M. L. (2021). Instance-dependent partial label learning. Advances in Neural Information Processing Systems, 34, 27119-27130.

[2] Yao, Y., Liu, T., Gong, M., Han, B., Niu, G., & Zhang, K. (2021). Instance-dependent label-noise learning under a structural causal model. Advances in Neural Information Processing Systems, 34, 4409-4420.
The article's treatment in section 3.2, where "P(S|X, Y ; θ) reduces to P(S|Y )," does not hold true under the instance-dependent PLL setting. When considering that the generation process of candidate labels in conjunction with S is feature-dependent[1,2], the simplification presented in the article may not be universally applicable and may overlook specific nuances associated with instance-dependent partial label learning.

[1] Xu, N., Qiao, C., Geng, X., & Zhang, M. L. (2021). Instance-dependent partial label learning. Advances in Neural Information Processing Systems, 34, 27119-27130.

[2] Xu, N., Liu, B., Lv, J., Qiao, C., & Geng, X. (2023). Progressive purification for instance-dependent partial label learning. In International Conference on Machine Learning (pp. 38551-38565). PMLR.
Some important baselines should be compared, such as [1] in PLL and [2,3] in NLL.

[1] Wu, Dong-Dong, et al. "Revisiting consistency regularization for deep partial label learning." International Conference on Machine Learning. PMLR, 2022.

[2] Jiang, Lu, et al. "Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels." International conference on machine learning. PMLR, 2018.

[3] Han, Bo, et al. "Co-teaching: Robust training of deep neural networks with extremely noisy labels." Advances in neural information processing systems 31 (2018).
The article lacks detailed exposition on the derivation process of the loss functions for the three imprecise annotations configurations stemming from equation 5, potentially leaving readers without a clear understanding of the underlying methodology.
The article contains typographical errors in the last sentence of the 2-nd paragraph on page 6, "However, our framework is much simpler and more concise as shown in ??."

问题

Could the authors provide a more comprehensive derivation of the loss functions for the three imprecise annotations configurations derived from equation 5 to ensure clarity and thorough understanding for the readers?
Given that the method of using ground-truth labels or Bayes label distribution as latent variables coupled with variational inference in weakly supervised learning is highlighted in prior works, how does the presented framework distinguish itself or advance beyond these existing approaches in terms of innovation or application?

评论- Response to Reviewer yf7f (1/2)

2023-11-14

Thanks for your valuable feedback. In fact, we found that most of the weakness mentioned here has already been discussed in the Appendix with detailed results. Here, we provide a more detailed explanation and point out the related sections in our Appendix.

"The article's innovation is limited, as the approach of considering ground-truth labels or Bayes label distribution as latent variables and using variational inference for approximation in weakly supervised learning is already a common method[1-2], which suggests that the presented techniques may not be as novel as claimed."

Thank you for mentioning these related works, and we have included them in our related work. Indeed, our contribution is certainly not to present the first interpretation of variational inference on the latent ground-truth labels, but to present the first complete EM view without any approximation/assumption on the data, thus it can handle different types of imprecise labels and even a mixture of them robustly. For example, in [1], extra prior distribution is needed in the variational approximation and applies to only partial labels. In [2], it also requires an encoder-decoder similar architecture to capture the distribution of the latent variable for noisy labels.

Beyond the novelty in unifying existing settings on PLL, SSL, and NLL, our framework can be easily implemented in different settings and achieve SOTA results with faster convergence. We hope that our contributions can be understood and valued in this way.

[1] Xu, N., Qiao, C., Geng, X., & Zhang, M. L. (2021). Instance-dependent partial label learning. Advances in Neural Information Processing Systems, 34, 27119-27130.

[2] Yao, Y., Liu, T., Gong, M., Han, B., Niu, G., & Zhang, K. (2021). Instance-dependent label-noise learning under a structural causal model. Advances in Neural Information Processing Systems, 34, 4409-4420.

"The article's treatment in section 3.2, where "P(S|X, Y ; θ) reduces to P(S|Y ),...instance-dependent partial label learning."

Thanks for mentioning the instance-dependence situation. In fact, we think there is a little misunderstanding here. As mentioned in the last paragraph of Section 3.1, $P(I|X, Y;\theta)$ depends on the nature of imprecise information $I$ . If $I$ is not instance-dependent, it can be largely ignored. If $I$ is instance-dependent, it should be maintained and optimized as a supervised term. In fact, in Section 3.2, $P(I|X, Y;\theta)$ is indeed maintained for noisy label learning and semi-supervised learning. Its performance is verified on instance-dependent noisy label learning as shown in Table 3 and Table 15. For instance-dependent partial label learning as in [1, 2], it should be also maintained as $P(S|X, Y ; \theta)$ . Here we provide a comparison of our method to [1, 2] in the benchmark of instance-dependent partial label learning. We follow the training recipe in [2] to train our methods and report the average accuracy of 3 runs.

	MNIST	Kuzushiji-MNIST	Fashion-MNIST	CIFAR-10	CIFAR-100
VALEN [1]	99.03	90.15	96.31	92.01	71.48
RCR [3]	98.81	90.62	96.64	86.11	71.07
PiCO	98.76	88.87	94.83	89.35	66.30
POP [2]	99.28	91.09	96.93	93.00	71.82
Ours	99.19	91.35	97.01	93.86	72.43

The discussions and results on instance-dependent partial label learning are included in the revised Appendix D. 3.2.

[1] Xu, N., Qiao, C., Geng, X., & Zhang, M. L. (2021). Instance-dependent partial label learning. Advances in Neural Information Processing Systems, 34, 27119-27130.

[2] Xu, N., Liu, B., Lv, J., Qiao, C., & Geng, X. (2023). Progressive purification for instance-dependent partial label learning. In International Conference on Machine Learning (pp. 38551-38565). PMLR.

[3] Wu, D. et al. (2022). Revisiting Consistency Regularization for Deep Partial Label Learning, ICML.

评论- Response to Reviewer yf7f (2/2)

2023-11-14

"Some important baselines should be compared, such as [1] in PLL and [2,3] in NLL."

Thanks for mentioning these important baselines.

In fact, in our initial submission, we have already included the comparison and discussion to [1] of PLL in Appendix D. 3.2. The comparison to [3] in NLL is also already shown in Table 15 and Table 16 of Appendix D. 5.2. So these baselines are not missing.

We do not include the comparison to MentorNet [2] for Table 3 since our method already outperforms SOP, which is stronger than this approach.
For clarity, we have provided the comparison with MentorNet here and included the results of MentorNet in Table 16:

	Clothing1M	WebVision
MentorNet	66.17	63.00
Co-Teaching	69.20	63.60
SOP	73.50	76.60
Ours	74.02	79.37

[1] Wu, Dong-Dong, et al. "Revisiting consistency regularization for deep partial label learning." International Conference on Machine Learning. PMLR, 2022.

[2] Jiang, Lu, et al. "Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels." International conference on machine learning. PMLR, 2018.

[3] Han, Bo, et al. "Co-teaching: Robust training of deep neural networks with extremely noisy labels." Advances in neural information processing systems 31 (2018).

"The article lacks detailed exposition on the derivation process of the loss functions for the three imprecise annotations configurations stemming from equation 5, potentially leaving readers without a clear understanding of the underlying methodology." Could the authors provide a more comprehensive derivation of the loss functions for the three imprecise annotations configurations derived from equation 5 to ensure clarity and thorough understanding for the readers?

The detailed derivations of the mixture imprecise label setting are indeed presented in Appendix C of the initial submission. But we would like to add more details of other settings to make them more clear. Note that we omitted the derivations of PLL, SSL, and NLL in the main paper because they are actually quite straightforward by replacing the $I, X, Y$ in Equation 5, as shown in the current revised Appendix C.

"The article contains typographical errors in the last sentence of the 2-nd paragraph on page 6, "However, our framework is much simpler and more concise as shown in ??.""

We apologize for this oversight and appreciate your attention to detail. We have fixed this error in the revised paper.

"Given that the method of using ground-truth labels or Bayes label distribution as latent variables coupled with variational inference in weakly supervised learning is highlighted in prior works, how does the presented framework distinguish itself or advance beyond these existing approaches in terms of innovation or application?"

This might be a misunderstanding of our contribution. Our contribution lies in unification: ILL advances beyond existing approaches by providing a more unified and versatile solution for handling various types of imprecise labels. More importantly, our proposed framework can adapt to the mixture of imprecise labels robustly without specific and additional design, as shown in the results of Table 4 and Table 5, where no prior art can achieve this. Our framework can also extend to broader scenarios, such as multiple-instance learning, without modification. We think this unified view and the extensive robust results our method presents are significant and novel, and distinguish us from previous methods.

We hope that our response can help the reviewer better understand our work and consider a better rating accordingly:) If you have other questions, please do not hesitate to let us know.

评论- Followup Response to Reviewer yf7f

2023-11-21

We deeply appreciate the time and effort invested in reviewing our work and providing valuable feedback. Since the deadline for the discussion period is approaching and we haven't heard back from you, here's a summary of our earlier response to the concerns raised:

Weaknesses 1, 2, and 3. We believe these issues were substantially addressed in our initial submission. To alleviate any confusion and misunderstanding, we have provided additional discussions and analyses in our earlier response. Moreover, we have conducted further experiments in the instance-dependent partial label learning setting. These experiments demonstrate the versatility and significance of our proposed method, showcasing superior performance without any additional design. We can move the corresponding content into the main paper if necessary.
Weakness 4. We have included the detailed derivations in our revised appendix. This is intended to offer a clear and straightforward understanding of our EM formulation.

We are open to further discussions if there are any aspects that remain unclear. We look forward to your continued feedback and sincerely hope that you can improve the rating based on these.

2023-11-22

Thank you very much for your great efforts in addressing my questions. Your response has addressed most of my concerns with a substantial amount of experiments. I appreciate the use of a unified framework to tackle various weakly supervised learning settings in this paper. However, the approach of treating true labels as latent variables and employing variational inference for learning, in my opinion, still lacks significant innovation. Therefore, I decided to increase the score by one point.

2023-11-22

We deeply appreciate the increased rating. In terms of the EM formulation, we totally agree that there exist previous methods treating the true labels as latent variables and even utilizing variational inference as ours. However, note that for our EM formulation, the significant difference and innovation lie in that we do not require any assumption on the probability, e.g. conditional independence between Y and I or X (with our NFA modeling), whereas previous methods usually do. This flexibility allows our approach to generalize to many different settings and different types of data. We hope we have made this point more clear. Thanks again for your efforts in improving this work.

审稿意见

评分: 8置信度: 42023-11-01

This paper proposes a unified framework for multiple weakly supervised learning settings, including noisy label, partial label, and multiple label candidates. The unified framework can be described in a formulation shown in Eq5, and multiple learning problems can be solved under the framework. The proposed framework shows good performance on all those problems.

优点

The exploration of multiple learning problem settings involving imprecise labels holds significant importance for researchers. It is intriguing to observe the presentation of a unified perspective on these diverse problems.

The unified framework is sound and effective.

The proposed method shows good performance on multiple learning settings. The experiments on the performance comparison are complete and convincing. Most of recent SOTA methods are included as baselines.

缺点

It is good to see the proposed method can be seamlessly combined with the data augmentation techniques, but it would be helpful to examine the model's performance without data augmentation techniques. It seems that the method's performance is sensitive to the quality of data augmentation, but not all kinds of tasks can be easily benefit from data augmentation. An ablation study would be helpful.
There are also some other related works on unifying multiple problem settings of weakly/imprecise supervised learning. Some discussions on this topic can improve this paper. For example, [1] Centroid Estimation With Guaranteed Efficiency: A General Framework for Weakly Supervised Learning, TPAMI [2] Weakly Supervised AUC Optimization: A Unified Partial AUC Approach, arxiv 2305.14258
The presentation can be further improved. There are typos and errors in the paper, e.g., missing punctuations, broken cross references to the figures, etc.

问题

See above.

评论- Response to Reviewer QeBS

2023-11-14

Thanks for your recognition of the significance of our work and your valuable suggestions. Our response is as follows,

"It is good to see the proposed method can be seamlessly combined with the data augmentation techniques, but it would be helpful to examine the model's performance without data augmentation techniques. It seems that the method's performance is sensitive to the quality of data augmentation, but not all kinds of tasks can be easily benefit from data augmentation. An ablation study would be helpful."

Thanks for this valuable suggestion. We totally understand your point that strong augmentation might be difficult to apply such as on text data or audio data. However, strong augmentation, proposed in FixMatch [1], has been demonstrated very important and critical for achieving effective performance in SSL. For PLL, many baseline methods also use strong augmentations. For NLL, the strong augmentation is less important compared to other settings. Thus, we provide an ablation study of the strong augmentation for PLL and NLL:

PLL	CIFAR-10, partial ratio 0.5	CIFAR-100, partial ratio 0.1
PiCO	93.58	69.91
Ours	95.91	74.00
PiCO (w/o strong aug)	91.78	66.43
Ours (w/o strong aug)	94.53	72.69

NLL	CIAFR 10, noise ratio 0.8	CIFAR 100, noise ratio 0.8
SOP	94.00	63.30
Ours	94.31	66.46
SOP (w/o strong aug)	66.85	36.60
Ours (w/o strong aug)	93.56	65.89

For PLL, it is shown that the strong augmentation indeed affects the performance very much for both the baseline PiCO and our method. For NLL, removing the strong augmentation has less effect on our method but has a detrimental effect on SOP. We have included the above ablation study in Appendix D. 7.

"There are also some other related works on unifying multiple problem settings of weakly/imprecise supervised learning. Some discussions on this topic can improve this paper. For example, [1] Centroid Estimation With Guaranteed Efficiency: A General Framework for Weakly Supervised Learning, TPAMI [2] Weakly Supervised AUC Optimization: A Unified Partial AUC Approach, arxiv 2305.14258"

Thanks for mentioning these related works and we have added discussion about them in the revised version of related work.

"The presentation can be further improved. There are typos and errors in the paper, e.g., missing punctuations, broken cross references to the figures, etc."

Thanks for pointing this out. We are sorry for the errors and typos in the paper and the confusion caused. We have updated the errors and typos we found and fixed the broken references in the revised paper.

We hope our response can resolve your concerns.

评论- Followup Response to Reviewer QeBS

2023-11-21

Thanks again for acknowledging the contributions of our work. We hope our previous response could effectively resolve your concerns. If you have any further questions or require additional clarification, please do not hesitate to let us know. We are willing to engage in further discussion for clarification.

审稿意见

评分: 6置信度: 42023-11-02

This paper proposes a unified framework for handling various learning problems with imprecise label configurations. Previous studies have achieved success in dealing with imprecise configurations of individual labels, but their methods often have significant differences. These methods are tailored to specific forms of imprecise labels, but in practical applications, annotations can be very complex and may involve multiple coexisting imprecise label configurations. Therefore, applying previous methods to situations where both noisy labels and partial labels occur simultaneously can be challenging. To address this problem, the author presents a different perspective, considering the provided imprecise label information as information that imposes deterministic or statistical constraints on the actual applicable true labels. Then, the model is trained to maximize the probability of the given imprecise information. The author demonstrates the advanced performance of their method through comparative experiments on multiple datasets, showcasing the adaptability of the ILL framework in handling a mixture of various uncertain label configurations.

优点

The paper proposes a unified framework called imprecise label learning (ILL) for handling various configurations of imprecise labels. Compared to previous methods, this framework does not require specific designs for each imprecise label configuration. Instead, it models the imprecise label information using Expectation Maximization (EM) and treats precise labels as latent variables. This unified framework can adapt to settings involving partial label learning, semi-supervised learning, noise label learning, and their combinations, demonstrating strong adaptability and flexibility.
Through experiments, it has been demonstrated that ILL outperforms existing specific techniques in handling imprecise labels. It achieves robust and effective performance in various challenging settings, including partial label learning, semi-supervised learning, noise label learning, and their combinations. This indicates that the framework possesses excellent performance and wide applicability in handling imprecise labels.
The work presented in this paper provides insights for further research in the field of imprecise label learning, unleashing the full potential of imprecise label learning in more complex and challenging scenarios where obtaining precise labels is difficult. This has significant implications for solving real-world problems with inaccurate labels.

缺点

The author proposes a new framework for unified learning with imprecise labels, which can learn from any type of imprecise label. However, the experimental data in the current article are obtained from balanced and relatively small datasets, lacking sufficient experimental evaluation. At the same time, the article does not discuss the computational complexity or scalability of this framework. Although the article mentions the limitations of some previous methods in dealing with specific forms of imprecise labels, it does not provide a detailed discussion on the scalability of this framework in large-scale datasets or complex scenarios.

Moreover, in terms of the presentation of the paper, the author's description of existing methods is not clear enough. The paper uses a large number of formulas and tables for description, lacking visual explanations. The comparison of the experimental results also appears vague and unclear.

问题

The author proposed a unified framework based on imprecise label learning, demonstrated through experiments the good performance of the unified framework in three different imprecise label scenarios, and its superiority in mixed imprecise label learning tasks compared to current methods capable of handling mixed imprecise labels. The main significance of the unified framework lies in providing a portable and scalable method for addressing various imprecise label tasks, where the use of EM to uniformly model imprecise label information can be extended to multiple scenarios. However, challenges such as potential local optima and computational complexity in the EM method still need to be addressed.

Furthermore, the author noted in the conclusion that experimental data were obtained from relatively small and balanced datasets and that designing different loss functions according to different scenarios was necessary when designing the model. This further limits the portability of this framework. In particular, the handling of probabilistic models for various imprecise label information has a significant impact on the effectiveness of the method, and we must reconsider model design solutions when dealing with different data and tasks.

评论- Response to Reviewer torK (1/2)

2023-11-14

"The experimental data in the current article are obtained from balanced and relatively small datasets, lacking sufficient experimental evaluation. At the same time, the article does not discuss the computational complexity or scalability of this framework. Although the article mentions the limitations of some previous methods in dealing with specific forms of imprecise labels, it does not provide a detailed discussion on the scalability of this framework in large-scale datasets or complex scenarios."

Thank you for your valuable feedback.

We compare our method with previous baselines on the standard benchmarks of each setting, which are commonly used in prior works. In NLL, Clothing1M and Webvision are indeed large-scale datasets with realistic instance-dependent noise. Additionally, to resolve your concern in large-scale datasets, we also provide the ImageNet results of our method for semi-supervised learning which we did not include in our main paper before:

# Labels	100k	400k
FixMatch	43.66	32.28
FlexMatch	41.85	31.31
FreeMatch	40.57	-
SoftMatch	40.52	29.49
Ours	39.41	28.03

As for imbalanced and open-set settings which are beyond the scope of our research purpose to unify SSL, PLL, and NLL, we would like to point out that applying our framework to these settings is as challenging as applying existing previous ones because learning on various restricted $X$ is a different problem of learning on various imprecise $Y$ . The study of transferring to restricted data distribution requires an additional large amount of work and is out of the scope of our current study, where our main focus is the unification of learning with different imprecise labels, instead of different data. These two settings have been known to be challenging in the ML community for years, and they could perhaps not be solved in one framework. We intend to look after them in the future work.

Regarding the computation complexity, since our method is quite simple and the loss function is derived as closed-form, our algorithm is usually faster than the baselines. We provide a run time analysis to verify it here:

Setting	Algorithm	CIFAR-100 Avg. Runtime (s/iter)
SSL	FreeMatch	0.2157
SSL	Ours	0.1146
PLL	PiCO	0.3249
PLL	Ours	0.2919
NLL	SOP	0.1176
NLL	Ours	0.1021

Note that the runtime is averaged over all training iterations; thus, the performance gap is significant. We have included the runtime analysis in the revised Appendix D.8.

"Moreover, in terms of the presentation of the paper, the author's description of existing methods is not clear enough. The paper uses a large number of formulas and tables for description, lacking visual explanations. The comparison of the experimental results also appears vague and unclear."

We appreciate your feedback on the presentation of our paper. We recognize the importance of clear description, especially for the existing complicated methods. Please understand that this is mainly because we intend to provide a comprehensive evaluation across many different settings to show our method could cover various imprecise labels. To improve the clearness of the experimental results, we have included an overview visual comparison with a bar plot to demonstrate the significance of our method in revised Appendix D. 2. The results are computed as the average accuracy of different settings for each dataset. It is demonstrated that the proposed method, in general, achieves better performance compared to recent SOTA baselines or achieves performance comparable to them.

评论- Response to Reviewer torK (2/2)

2023-11-14

"The author proposed a unified framework based on imprecise label learning, ... However, challenges such as potential local optima and computational complexity in the EM method still need to be addressed."

Thank you for acknowledging the significance of our proposed unified framework and for this thoughtful discussion:

Local optima: Our main contribution is to propose a unified framework and achieve SOTA performance. We agree that potential local optima is important to fully understand our ILL framework, which can be left for future work. This will include exploring regularization techniques and efficient optimization methods such as progressively determining the labeling of EM to enhance the framework's performance and applicability.
Computational complexity: The computational complexity of the EM method can be handled in linear time by using the NFA representation of imprecise label information $I$ as we mentioned in Appendix C. 3. In fact, since PLL, NLL, SSL has closed form derivation from the proposed formulation, the computation complexity is simplified to O(1). As shown in our last response, our method introduces negligible computation for the settings studied. For other complex settings, such as multiple instance learning, the time complexity would be linear to the input length, which we will solve in our future work.

"Furthermore, the author noted in the conclusion that experimental data were obtained from relatively small and balanced datasets and that designing different loss functions according to different scenarios was necessary when designing the model. This further limits the portability of this framework. In particular, the handling of probabilistic models for various imprecise label information has a significant impact on the effectiveness of the method, and we must reconsider model design solutions when dealing with different data and tasks."

Thank you for highlighting the concerns regarding the portability of our framework. It is indeed important to study under different data distributions such as imbalanced data. We acknowledge that when encountered with restricted data such as imbalance or open-set, extra model design is indeed needed. But please be aware that this holds true for all previous works. Solving these restricted data scenarios can themselves become interesting research directions, such as noisy label learning on imbalanced data [1] or partial label learning on imbalanced data [2], which are beyond the scope of our current work whose focus in on imprecise labels.

However, more importantly, for different imprecise label tasks beyond the settings studied in this work, there is no model re-design to deal with them. All we need to do is define the NFA as shown in the Appendix C.4 for complicated scenarios and apply it in our proposed EM formulation. For settings with simple NFA representations, we can similarly derive the closed-form loss functions. This is left as our future work.

[1] Shyamgopal Karthik et al. Learning From Long-Tailed Data With Noisy Labels.

[2] Feng Hong et al. Long-Tailed Partial Label Learning via Dynamic Rebalancing.

We hope the above response can resolve your concerns and answer your questions.

评论- Followup Response to Reviewer torK

2023-11-21

Thank you again for your insightful suggestions regarding the improvement and expansion of our work.

In our previous response, we hope we address your concerns regarding large-scale experiments and the analysis of time complexity. As for the discussion on the potential for local optima in the EM algorithm, we did not encounter any convergence issues across the various settings we examined in our study. However, we acknowledge the importance of this aspect and agree that incorporating regularization and conducting a theoretical analysis on this topic would be valuable for future work. Furthermore, exploring the extension of our methodology to imbalanced or other specific data distributions presents intriguing directions for future research.

We are open to further discussions and would be more than willing to address any additional questions or clarifications you might have. Your continued feedback is greatly appreciated, and we look forward to any further insights from you.

审稿意见

评分: 5置信度: 22023-11-10

The author proposes a new general framework for learning with imperfect labeling information in the multi-class classification setting, called the Imprecise Label Learning (ILL) framework. The framework works in an expectation-maximization fashion and assumes that imperfect label information I is provided among instance X, while the Y is a latent variable. The author shows how to adopt a general form of ILL framework to different previously considered settings with imperfect information: Partial label learning (PLL), Semi-supervised learning (SSL), Noisy label learning (NLL), and mixed configuration (in the appendix) and compare them against many popular baselines that focus on specific configurations on many configurations of artificial benchmarks created using CIFAR-10/100 and additional datasets (and even more experiments with different settings is provided in the appendix). The proposed framework achieves strong results, often beating all the baselines in the comparison.

优点

The paper has strong motivation, creating a unified framework for imprecise labels in multi-class classification that can handle different settings of imperfect label information.
The general framework is nicely rooted in EM.
The proposed framework achieves strong results in empirical comparison.
The empirical comparison includes many different settings, and for each setting, the proposed method is compared with a large number of SOTA baselines.

缺点

The work gives a general framework outline in the main paper and focuses on the loss functions used for different settings (PPL, SSL, NLL), But I'm missing the important pieces to get a full picture of the proposed approach, it seems to me more like the outline than a concrete solution that can be implemented in different ways (what authors mention In the paper). Unfortunately, the main paper is very sparse in details about the actual implementation of many of its elements.
More details can be found in Appendix C and D. It is unclear to me when NFA from Appendix C.3 is used in the main paper or not. In Appendix D, all modifications to the training are mentioned without explanation and motivation.
NIT: Broken reference

However, our framework is much simpler and more concise as shown in ??

问题

In some experiments, the fully supervised model (with correct label information) gets worse results than the ILL framework. Actually, ILL beats the fully supervised model by a lot. At the same time, other solutions never do that, so the natural question is if there are any other differences between the supervised model and ILL, then application of EM/different loss? What is the authors' hypothesis as to why it achieves better performance?

伦理问题详情

N/A

评论- Response to Reviewer AiQR (2/2)

2023-11-14

"In Appendix D, all modifications to the training are mentioned without explanation and motivation."

We are sorry for this confusion. Again, please understand that this is mainly due to the space limit. We would like to provide more explanation and discussion here.

In fact, the two techniques we mentioned in Appendix D.1 are common techniques in each setting. For example, in SSL, strong-weak augmentation is an important strategy for SSL algorithms widely used in existing works such as FixMatch and FlexMatch [1,2]. Thus, it is important to adopt strong-weak augmentation to achieve better performance in SSL. This is similar in PLL settings (PiCO [3] also used strong augmentation). Strong-weak augmentation and entropy loss are also adopted in SOP [4] of NLL. However, we found these techniques are less important for our formulation of NLL. We provide a brief ablation study on the entropy loss of SSL, and both techniques for NLL and PLL here to demonstrate our discussions above.

SSL	CIFAR-100, 200 labels	STL 10, 40 labels
Ours	22.06	11.09
Ours (w/o entropy loss)	22.41	11.23

From the ablation on entropy loss of SSL, it shows that entropy loss only affects the performance trivially.

PLL	CIFAR-10, partial ratio 0.5	CIFAR-100, partial ratio 0.1
PiCO	93.58	69.91
Ours	95.91	74.00
PiCO (w/o strong aug)	91.78	66.43
Ours (w/o strong aug)	94.53	72.69
Ours (w/o entropy loss)	95.87	73.75

From the ablation on entropy loss and strong augmentation of PLL, we can observe that strong augmentation is important for both PiCO and our method to achieve better performance. Entropy loss has minimal effect on our method.

NLL	CIAFR 10, noise ratio 0.8	CIFAR 100, noise ratio 0.8
SOP	94.00	63.30
Ours	94.31	66.46
SOP (w/o strong aug)	66.85	36.60
Ours (w/o strong aug)	93.56	65.89
SOP (w/o entropy loss)	93.04	62.85
Ours (w/o entropy loss)	94.16	66.12

From the ablation on NLL, it is shown that the performance SOP is subjected more to the entropy loss especially for strong augmentation. Removing these techniques has a less detrimental effect on the performance of our method.

We have included the above ablation study in Appendix D. 7.

[1] Kihyuk Sohn et al. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

[2] Yidong Wang et al. FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling.

[3] Haobo Wang et al. PiCO: Contrastive Label Disambiguation for Robust Partial Label Learning.

[4] Liu Sheng et al. Robust Training under Label Noise by Sparse Over-parameterization.

"NIT: Broken reference"

We apologize for the oversight regarding the broken reference and appreciate your attention to this detail. We have carefully reviewed and corrected all references in the revised version of the paper to ensure they are complete and accurate.

"In some experiments, the fully supervised model...why it achieves better performance?"

We appreciate your observation regarding the performance of the ILL framework compared to fully supervised models in Table 1 of PLL. Our hypothesis is that the EM consistency term of ILL may lead to better generalization and robustness. Similar observations can be also found in SSL works such as in SoftMatch [1].

[1] Hao Chen, et al. SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning.

If you think our response resolved your concerns and questions, please consider increasing the rating.

评论- Followup Response to Reviewer AiQR

2023-11-21

Thanks for your time and effort in reviewing our work. We hope our earlier response can resolve your concern. We aim to clarify that our proposed method is not merely an outline of previous methods, but rather introduces its own novelty, especially in its ability to generalize across various settings. As the deadline for the discussion period is approaching, if any aspect of our response or the method itself remains unclear, please do not hesitate to reach out. We are more than willing to provide further clarification.

评论- Re: Response to Reviewer

2023-11-23

I thank the authors for the exhaustive response and apologize for joining the discussion so late, but there is a lot to process in terms of other reviewers' comments and authors' responses to their reviews, which I've read carefully. Below I include some additional comments:

Presentation: While the new additions to the manuscript are welcome and improve the paper's readability, and their comments made the paper clearer for me, I believe the problem with the confusing presentation of the paper remains, and the authors did not sufficiently address it. What I believe is needed is a clear outline of what and how it is calculated for each derived setting (PLL, NLL, SSL) and, in general cases, for example, presented in the form of pseudocode. While the authors include the source code, which is nicely structured and formatted, I believe the paper itself should give a clear explanation of the procedure that was implemented.

Novelty: I cannot evaluate the novelty of the approach in comparison to the previous works, as I do not know some of them.

Empirical evaluation: I believe the evaluation based mostly on artificial datasets created based on CIFAR-10/100 is enough for this work. I believe it would be interesting to see how closed forms for PLL, NLL, SSL compare to the general approach using NFA.

For the moment, I'm keeping my score as it is.

评论- Response to Reviewer AiQR (1/2)

2023-11-14

We appreciate your valuable feedback and we are sorry for the confusion caused for you. We hope our response in the following could make our method more clear.

"The work ... it seems to me more like the outline than a concrete solution that can be implemented in different ways (what authors mention In the paper)."

Sorry for the confusion caused. Our work is certainly not just an outline, but with novel design and concrete solutions to achieve promising results. We intend to demonstrate the universality and generality of the proposed method so we have to include many different settings into the paper, which might overlook some details of our proposed framework. Here, we highlight our contributions more clearly:

At a conceptual level, we proposed a unified EM framework that accommodates any imprecise labels by taking the imprecise information $I$ abstractly. We demonstrated its effectiveness on various settings, including PLL, SSL, NLL, and the mixture of them.
At a solution level, thanks to the unified formulation (Eq. 5), our derived formulations and algorithms in each setting are clearly different from previous baselines. Indeed, it can be implemented in different ways when handling different types of $I$ , but in general they all share the same EM formulation by replacing the $I$ with actual imprecise labels.
Compared with existing baselines: previous baselines in each individual setting can be treated as (usually) simplified versions of our framework. This is probably why our method looks like an outline of previous methods while it is certainly not. On different benchmarks of each setting, our unified framework presents promising results. More importantly, without changing the framework, our method can handle more complicated scenarios robustly, where previous specifically designed methods fall short in performance, as shown in Table 4.
Extend to broader scenarios: our framework can naturally extend to more imprecise configurations except for the settings discussed in our main paper. This is mainly achieved by the NFA modeling of imprecise label $I$ , where NFA can represent each form of imprecise label setting and the proposed EM framework can be conducted on the trellis of NFA with linear time complexity. We omit the NFA details in main paper because the studied settings have rather simple NFA and thus their formulations can be derived in closed-form.

"Unfortunately, the main paper is very sparse in details about the actual implementation of many of its elements."

Thank you for your feedback. We are sorry if we missed any details in the main paper as we have to move a significant amount of details to the appendix due to space limit. Please understand that our intention was to present the framework in a manner that highlights its adaptability and generalizability across different settings.

In the revised version of the paper, we have included more concrete and detailed derivations of how the framework can be applied in various scenarios in Appendix C, thereby providing a more comprehensive understanding of its practical implementation. The source is also included in our submission for reproducibility. Hope the above revision can provide more details of the proposed framework and its application to each setting.

"More details can be found in Appendix C and D. It is unclear to me when NFA from Appendix C.3 is used in the main paper or not."

Thanks for pointing out the NFA. Our idea of NFA is that any imprecise label information $I$ can be represented as NFA, and the trellis expanded from the NFA can be used to compute the expectation over $P(Y|I, X;\theta^t)$ in EM formulation, as shown in Appendix C.1 and C.2. The NFA representation of $I$ in general can be reflected as all possible labelings imposed by $I$ , where we can conduct the EM on it. The benefit of designing the NFA representation is to demonstrate the adaptability of the proposed method because we can represent any type of imprecise label as an NFA and thus accommodate into our proposed framework.

In the settings studied in this paper, the NFA representation of PLL would be the candidate labels as transition paths and each sample as states. The NFA of NLL and SSL would just be all labels as transition paths and each sample as states, since the imprecise label information $I$ here does not constrain any labeling. Since the settings studied in this paper have relatively simple NFA representations, we can directly derive the closed-form formulation of $P(Y|I, X;\theta^t)$ from Equation 5 as we did in the main paper. It can be easily extended to other settings such as multiple instance learning, learning with count/proportion, etc, by constructing a more complex NFA of each case, and actually using a forward-backward algorithm on it to compute EM.

We have explicitly included the above discussion in the NFA section in Appendix C. 6 and hope it can make the NFA interpretation more clear.

评论- General Response

2023-11-14

We thank all reviewers for their efforts reviewing our paper and suggestions for making our paper better. We totally understand the workloads and burden in reviewing our work since it is extensively related to many areas: PLL, SLL, NLL etc. Therefore, we have made detailed response to each reviewer and significant revisions, along with the source code to reproduce our results.

As a general response, we would like to highlight the following aspects.

Our contributions

We would like to make further clarification of our contributions and distinguish our method from prior arts here:

The significance and novelty of our method mainly lies in the "unification" framework for learning with any imprecise labels. We are aware that many previous works may present similar interpretations of weak supervision by treating the ground truth as a latent variable, as mentioned by Reviewer yf7f. However, our method is distinguished from the prior arts in that we use complete EM formulation for every setting we studied, including PLL, NLL, SSL, and a mixture of them.
Our method thus generalizes to many imprecise label settings. However, it is certainly not an outline of prior arts as mentioned by reviewer AiQR. The connection with NFA makes our proposed framework generalize beyond the settings studied in this paper to multiple-instance learning, learning with count, etc. This connection distinguishes us from previous works and summarizes different settings into one framework.
Besides, due to the simplification of the NFA for the settings we studied in this paper, we can derive the closed-form loss functions, as we did in the main paper, with $O(1)$ time complexity instead of using NFA actually. This is the main reason we have the NFA part in the Appendix.
We demonstrate the effectiveness and robustness of our method on extensive benchmarks of each individual setting and mixture setting, including large-scale datasets. No prior art can handle these settings universally.

Discussion on related work

We would like to thank each reviewer in providing extra related works to make this paper more comprehensive as it indeed relates to many previous efforts in each research area. That being said, please do forgive us if we miss any references as we are trying our best to discuss more. All the suggested references from reviewers have been added in the revised version. We also understand that the judgment of our contributions might by influenced by related work, but to date, our conclusion is that, with additional related work discussed, we are still the first effort to unify the imprecise label settings in one framework with simple implementations and robust performance, especially on the practical mixture settings.

Lack of implementation details

Finally, some reviewers might have found that certain details are missing in the main paper, but only presented in the appendix. We apologize for such inconvenience since this work has so much to cover (unifying settings, explaining framework contribution, and details experiments in different settings), we cannot include more in the main paper due to space limit. On one hand, the computation and implementation details are included in the appendix and we hope we are certainly making them more clear; on the other hand, most of the implementations are somewhat straightforward as they all stem from the same EM framework. We do believe the presentation has now been improved according to your precious comments and we are keeping polishing and improving the paper.

评论- Author-Reviewer Discussion ends soon

2023-11-22

Dear Reviewers and Authors,

The discussion phase ends soon. Please check all the comments, questions, and responses and react appropriately.

This is extremely important for this paper as it has received very extreme ratings.

Thank you!

Best, AC for Paper #3986

AC 元评审

2023-12-14

The paper unifies learning with imprecise labels (e.g., partial labels or noisy labels) by framing the problem as maximum likelihood estimation over latent ground truth labels.

As pointed by reviewers the use of such framework is not very surprising, nevertheless presenting a unified framework in one paper is a very welcome contribution. As such, however, derivations and references to related methods need to be done very carefully without any confusions. This is unfortunately not the case of this paper. Derivations contain many gaps and particular steps are not always properly motivated. Some symbols and concepts are not well-explained, requiring a reader to read the referenced papers. This should not be a case of a paper published at a top machine learning conference. The paper presents only a very general description of the approach without giving details of the final methods introduced by the authors, leaving a reader with many questions. The structure of the paper could also be improved. For example, the authors could shorten the related work section by omitting the details of existing methods, extend derivation of their unified framework for each considered task, and make connections to related work after obtaining the final algorithms in their approach. In this way, the authors would gain additional space to carefully derive and motivate their methods.

Examples of confusing steps:

It is not clear why $\log P(X; \theta)$ is not included in the second line of (5).
$Y$ is defined as a random vector of latent training ground truth labels, but in Eq. 6 it seems to be a label. The same concerns $X$ . It should be a set of training examples. The jump to the last equation is hard to understand as it is not clear what is the relation between $X$ and both $A_{s} (\mathbf{x})$ and $A_{w} (\mathbf{x})$ . Similar comments apply to (7) and (8)
Derivation in (11) seems to have also several problems, for example, it is not clear how $\log P(I | X, Y; \theta)$ could be extracted from the sum over $Y$ since the probability is conditioned on $Y$ . Moreover, the sentence "we replace $Y$ , $X$ , and $S$ to $y$ , $x$ , and $s$ " mixes again vectors of labels/instances with particular labels/instances.

为何不给更高分

Despite the relative high average, the reviewers have raised several critical comments. Therefore, as AC I checked the paper by myself and found several serious gaps. Therefore, I recommend to reject the paper.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject