5.8

/10

Poster4 位审稿人

最低3最高7标准差1.6

3.0

置信度

正确性2.8

贡献度2.5

表达2.5

NeurIPS 2024

Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations

Hao Chen,Ankit Shah,Jindong Wang,Ran Tao,Yidong Wang,Xiang Li,Xing Xie,Masashi Sugiyama,Rita Singh,Bhiksha Raj

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

关键词

Imprecise Label Learning

评审与讨论

审稿意见

评分: 7置信度: 32024-07-05

This paper introduces imprecise label learning (ILL), a framework for the unification of learning with various imprecise label configurations, such as partial label learning, semi-supervised learning, noisy label learning, and a mixture of these settings. They propose an EM based method with closed-form learning objectives to handle the problem.

优点

● The problem of learning with imprecise labels is with great importance, and the idea of unification of different settings is interesting, impressive, and instructive. ● The paper is well-written. ● While some details are omitted in the main text, the comprehensive appendix furnishes ample information. Such thorough work is highly appreciated.

缺点

Given the extensive content, the main text of the paper omits certain details, which may not be immediately straightforward to follow for some parts, e.g., section 3.2.

问题

局限性

作者回复

2024-08-05

Thanks for the reviewer's efforts reviewing this paper and acknowledge of our work. We address the concerns as follows.

Given the extensive content, the main text of the paper omits certain details, which may not be immediately straightforward to follow for some parts, e.g., section 3.2.

Thanks for pointing this out. Please understand that this is mainly due to the space limit that we have to include the derivation detail in Section 3.2 into Appendix. We will try to make this more clear in our revised paper.

If you find our revise and response helpful, please consider raising the score for better support of our work.

2024-08-12

Thanks for the response, and I maintain my original score.

2024-08-12

Thanks for your efforts and suggestion on this paper again!

审稿意见

评分: 7置信度: 32024-07-10

This paper introduces a framework that unifies various imprecise label configurations, with an EM modeling for imprecise label information. The framework is demonstrated can be adapted to partial label learning, semi-supervised learning, and noisy label learning, and the combinations of all above. The experiments results show that the framework surpasses existing methods on various settings.

优点

This paper proposed a unified framework that can unify various imprecise label learning settings, reducing the need for separate designs and solutions for each type of label imprecision.
Promising performance is achieved in individual settings and mixture settings with the unified framework.
The proposed method is highly versatile and can be applied to the setting of a mixture of imprecise labels with robust performance.
The framework demonstrates scalability on larger and more complex datasets.

缺点

The implementation of EM over all possible labelings may increase the computation time.
More related works need to be discussed. The author considers the ground-truth or Bayes label distribution as the latent variables and leverages variational inference for estimating. I am not should this strategy is novel enough in the field of variational inference. I suggest the author add more related works to highlight the novelty of their technique.
The author should give an explanation of why utilizing an EM framework optimizes the variational lower bound. What is its advantage?

问题

Can the framework be extended to handle other forms of weak supervision (such as imbalanced noisy label learning) and how?

局限性

Computational efficiency of EM regarding the weakness above. It would be better that the author provide (theoretical or empirical) analyses regarding the computation of the algorithm.

作者回复

2024-08-05

Thanks for the reviewer's efforts reviewing this paper and acknowledge of our work. We address the concerns as follows.

The implementation of EM over all possible labelings may increase the computation time.

For all the weak supervision settings studied in this paper, our method introduce negligible computation time. This is because all the weak supervision considered in this paper impose conditional independent latent ground truth variable. A runtime analysis can be found in Table 21 of our Appendix, where the proposed method is usually faster than previous SOTA methods.

More related works need to be discussed. The author considers the ground-truth or Bayes label distribution as the latent variables and leverages variational inference for estimating. I am not should this strategy is novel enough in the field of variational inference. I suggest the author add more related works to highlight the novelty of their technique.

Thanks for pointing out this. As we discussed in our paper (line 68 - 78), EM solution for weak supervision is long desirable in the community and there are earlier attempts and relevant methods.
However, we would like to highlight that our method is the first method fully based on EM, without other model and loss design, which universally handles multiple and mixed types of weak supervision. This distinguishes us from previous works.

We also added the following extra relevant works of VI and EM on weak supervision to the discussion of our paper. Per discussion above, these method do not rely on EM itself to achieve reasonable performance.

[1] Xu, N., Qiao, C., Geng, X., & Zhang, M. L. (2021). Instance-dependent partial label learning. Advances in Neural Information Processing Systems.

[2] Garg, A., Nguyen, C., Felix, R., Do, T.T. and Carneiro, G., 2023. Instance-dependent noisy label learning via graphical modelling. In Proceedings of the IEEE/CVF winter conference on applications of computer vision.

[3] Elhassouny, A. and Idbrahim, S., 2022, December. Deep learning with noisy labels: Learning True Labels as Discrete Latent Variable. In 2022 14th International Conference on Computational Intelligence and Communication Networks.

The author should give an explanation of why utilizing an EM framework optimizes the variational lower bound. What is its advantage?

The EM algorithm provides a robust method for handling latent variables by decomposing the maximization of the likelihood function into an E-step and M-step. EM algorithm also has provable property of convergence. Besides, the EM framework is inherently designed to optimize the variational lower bound, as we shown in Appendix C.1.

There is also possible improvements over EM, such as ECM [1] and PX-EM [2], which could be incorporated in our method for further improvements. Due to the scope of this study, this is left as our future work.

[1] X. L. Meng and D. B. Rubin. Maximum likelihood via the ECM algorithm: a general framework. Biometrika, 1993.

[2] C. Liu, D. B. Rubin, and Y. N. Wu. Parameter Expansion to Accelerate EM: The PX-EM Algorithm. Biometrika, 1998

Can the framework be extended to handle other forms of weak supervision (such as imbalanced noisy label learning) and how?

Yes. The proposed unified framework can be easily extend to other weak supervision settings. In this study, we mainly consider three types of weak supervision, where the imposed ground truth label is conditional independent given the weak supervision. Our framework can be extended to similar settings without any change, such as positive unlabeled learning,

As for imbalanced noisy label learning, it is a configuration regarding changes in $P(X)$ itself rather than weak supervision $P(I, Y)$ . While our method is not specifically designed for such settings (change in $P(X)$ ), we showed in Table 3 that our method achieves comparable results on imbalanced noisy label datasets (Clothing1M). For more imbalanced noisy settings, we provide an additional comparison here, following the experiment setting of [1], where we introduce 5:1 head-tail class imbalanced into noisy CIFAR-10 and CIFAR-100 with a ratio of 0.2 below.

	CIFAR-10, $\eta=0.2$ , $1:5$	CIFAR-100, $\eta=0.2$ , $1:5$
CE	87.1	64.9
DivideMix	93.9	65.0
ELR+	88.2	59.6
ULC	95.0	75.5
Ours	94.8	73.5

From the results, we can summarize that previous methods specifically designed for noisy label learning (DivideMix and ELR+) usually present large performance degeneration on imbalanced settings, whereas our method present comparable performance with method specifically designed for this setting (ULC), demonstrating the extensibility of our method. Our method can also be combined with other design or techniques for handling imbalanced data for further improvements.

[1] Huang, Y., Bai, B., Zhao, S., Bai, K. and Wang, F., 2022, June. Uncertainty-aware learning against label noise on imbalanced datasets. In Proceedings of the AAAI Conference on Artificial Intelligence..

Computational efficiency of EM regarding the weakness above. It would be better that the author provide (theoretical or empirical) analyses regarding the computation of the algorithm.

We have empirical analysis about computation efficiency included in our Appendix. The results demonstrate that our proposed method present better efficiency, compared to previous SOTA methods.

If you find our revise and response helpful, please consider raising the score to better support of our work. Also please let us know if there are further questions.

2024-08-11

The author has addressed my concerns, and I maintain my original decision to accept.

2024-08-12

Thanks for the constructional suggestions!

审稿意见

评分: 6置信度: 42024-07-13

The article addresses the challenge of learning with imprecise labels in machine learning tasks, such as noisy or partial labels. Traditional methods often struggle with multiple forms of label imprecision. The authors introduce a novel framework named Imprecise Label Learning (ILL) that serves as a unified approach to handle various imprecise label scenarios. ILL employs the expectation-maximization (EM) technique, viewing precise labels as latent variables and focusing on the entire potential label distribution. The framework demonstrates adaptability to different learning setups, including partial label learning and noisy label learning.

优点

The article is well-written, with clear and concise language that effectively conveys the main ideas and contributions of the research. Also, comprehensive derivation of loss functions for the three imprecise annotations configurations derived from equation 5 are given which ensure clarity and thorough understanding for the readers.
The article offers a comprehensive solution to the prevalent challenge of imprecise annotations, enhancing the adaptability and applicability of machine learning models.
The inclusion of experimental results across multiple settings provides empirical evidence of the framework's robustness and superior performance.

缺点

The article's innovation is limited, as the approach of considering ground-truth labels or Bayes label distribution as latent variables and using variational inference for approximation in weakly supervised learning is already a common method[1-2], which suggests that the presented techniques may not be as novel as claimed.

[1] Xu, N., Qiao, C., Geng, X., & Zhang, M. L. (2021). Instance-dependent partial label learning. Advances in Neural Information Processing Systems, 34, 27119-27130.

[2] Yao, Y., Liu, T., Gong, M., Han, B., Niu, G., & Zhang, K. (2021). Instance-dependent label-noise learning under a structural causal model. Advances in Neural Information Processing Systems, 34, 4409-4420.
Some important baselines should be compared, such as [1,2] in SSL.

[1] Nguyen, Khanh-Binh, and Joon-Sung Yang. "Boosting Semi-Supervised Learning by bridging high and low-confidence predictions." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[2] Schmutz, Hugo, Olivier Humbert, and Pierre-Alexandre Mattei. "Don’t fear the unlabelled: safe semi-supervised learning via debiasing." The Eleventh International Conference on Learning Representations. 2022.

问题

Given that the method of using ground-truth labels or Bayes label distribution as latent variables coupled with variational inference in weakly supervised learning is highlighted in prior works, how does the presented framework distinguish itself or advance beyond these existing approaches in terms of innovation or application?
Could you provide a detailed analysis explaining why the unified framework demonstrates superiority over specifically designed approaches?

局限性

N/A

作者回复

2024-08-05

Thanks for the reviewer's time reviewing this paper and suggestions on the relevant works. We address the concerns raised as follows.

The article's innovation is limited...which suggests that the presented techniques may not be as novel as claimed.

Thanks for mentioning these relevant works. We need to emphasize that, although there are existing methods that share the similar variational approximation for handling weak supervision (as we also discussed in our paper), there are three fundamental differences and improvement of our method:

The algorithm/loss level: We provide a new perspective deriving the loss objectives directly from the EM view of various kinds of weak supervision. Most of the existing works use the EM/ELBO to explain their proposed methods, which means their methods are not originated from EM and usually rely other sophisticated designs. Instead, our framework only uses the derived loss functions without other sophisticated design while achieving comparable performance.
Our general nature: Our framework is the first one that can naturally and universally handle the mixture of various weak supervision settings.

In summary, most of weak supervision methods can be explained/understood from the EM perspective, however, our method is naturally derived from and only rely on the EM, and applicable to various settings. We will include the mentioned works into discussion in our revised paper.

[1] Xu, N., Qiao, C., Geng, X., & Zhang, M. L. (2021). Instance-dependent partial label learning. Advances in Neural Information Processing Systems, 34, 27119-27130.

[2] Yao, Y., Liu, T., Gong, M., Han, B., Niu, G., & Zhang, K. (2021). Instance-dependent label-noise learning under a structural causal model. Advances in Neural Information Processing Systems, 34, 4409-4420.

Some important baselines should be compared, such as [1,2] in SSL.

Thanks for mentioning these relevant baselines. We omitted some baseline comparison on the paper mainly due the space limit. More baselines of all settings can be found from our Appendix. For the mentioned SSL baselines, we provide a brief comparison of error rate below:

Method	CIFAR-100 400 labels	STL-10 40 labels
[1]	46.12	28.66
[2]	44.98	29.42
Ours	38.24	20.19

Note that this comparison is based on TorchSSL benchmark, instead of USB. We will include these baseline comparison in our revised paper.

[1] Nguyen, Khanh-Binh, and Joon-Sung Yang. "Boosting Semi-Supervised Learning by bridging high and low-confidence predictions." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[2] Schmutz, Hugo, Olivier Humbert, and Pierre-Alexandre Mattei. "Don’t fear the unlabelled: safe semi-supervised learning via debiasing." The Eleventh International Conference on Learning Representations. 2022.

Given that the method of using...beyond these existing approaches in terms of innovation or application?

As discussed earlier, our method distinguish from existing approaches in the following aspects:

Earlier attempts usually require additional assumptions (line 68 - 71) to handle a single type of weak supervision, whereas ours does not and naturally and universally handles various type of weak supervision.
Unlike most of previous methods, our proposed method is naturally derived from EM and does not require other sophisticated design (model architecture, extra loss function, etc.). This demonstrates the scalability of our method.
The proposed method naturally extends to mixed weak supervision settings and also other weak supervision settings not studied in the paper. We need to highlight that this is the first method that handles mixed weak supervision settings robustly using a unified framework.

Could you provide a detailed analysis explaining why the unified framework demonstrates superiority over specifically designed approaches?

Thank you for this insightful question. Our unified framework demonstrates superiority over specifically designed approaches due to several key factors:

Simplicity and generalizability: Unlike many existing methods that require complex and specifically tailored architectures or additional loss functions for each type of weak supervision, our approach leverages a unified EM-based framework that is naturally derived without needing additional sophisticated design elements. This simplicity allows for a more straightforward implementation while still achieving competitive or superior performance across different types of weak supervision settings.
Versatility in handling mixed supervision: Many existing approaches are tailored to handle only a single type of weak supervision, often requiring additional assumptions or modifications to the model when dealing with different forms of supervision. In contrast, our framework is the first to naturally and universally handle mixed types of weak supervision without requiring any additional assumptions or modifications. This versatility makes our method more robust and adaptable to various real-world scenarios where mixed supervision is common.
Scalability and robustness: The unified nature of our framework ensures that it can be easily extended to new and unseen weak supervision settings without needing to redesign or retrain the model from scratch. This scalability is a significant advantage over specifically designed approaches, which often struggle to maintain performance when extended to different or more complex weak supervision scenarios.
Theoretical foundation: Our method is firmly grounded in the Expectation-Maximization (EM) algorithm, providing a solid theoretical foundation that ensures the method's validity and robustness.

If you find our revise and response helpful, please consider raising the score to better support of our work.

2024-08-12

Thanks for your reply, which addressed my concerns and I decided to raise my score.

审稿意见

评分: 3置信度: 22024-07-14

The paper provides a unified view on various imprecise-label learning frameworks, such as semi-supervised-, partial-label-, or noise-label learning, through the lens of the expectation-maximization algorithm. In addition to unifying these existing setups, EM naturally allows treating combinations of the above setups. Experiments show that the proposed method compares favourably with existing methods specialized to just one setup.

优点

The proposed method generalizes to a wide range of settings, and performs on par or better with more specialized algorithms in most of the evaluations.

缺点

Neither abstract nor introduction make it clear that the paper is concerned purely with multi-class classification with deterministic labels y=f(x)
This stands in contrast to the critique in l. 70 regarding competing, [...] they usually require additional assumptions 69 and approximations on the imprecise information for learnablility: Assuming deterministic labels is a very helpful simplification for noisy labels, as, together with some upper-bound on the noise rate, it restores identifiability of the model
A serious problem with the writing of the paper is that, for an attempt at introducing a probabilistic model that can then be used in EM, it does not actually write down the actual probabilisitic models it considers: In this sentence (l. 158), the paper is extremely vague:

If I represents partial labels, then P(Y |I) would 158 have non-zero value over the candidate labels, and be 0 elsewhere. When I represents a set of noisy 159 labels, P(Y |I) would represent the distribution of the true labels, given the noisy labels. When I 160 does not contain any information, i.e., unlabeled data, Y can take any value.

for partial labels, is the underlying assumption $P(Y|I) = 1(Y in I) / |I|$ , i.e., uniform distribution over all candidates? How about for unlabeled data? for noisy labels, you either need to assume a fixed noise model, or $P(Y|I,\theta)$ where $\theta$ are the parameters of the noise model. l. 170: Note that P (X; θ) is omitted from Eq. (5) since P (X) 170 does not rely on θ. Why? Is this a new assumption? Earlier in the paper, $\theta$ was introduced as the modelling parameter of the generic joint distribution Let P (X, I; θ) represent a parametric form for the joint distribution of X and I. The footnote claims he actual parameters θ may apply only to some component such as P (Y |X; θ) of the overall distribution but to me, "may" here means that in some situations, such a restriction is possible, whereas I guess the intended meaning is that the model is always supposed to be $P(Y|X,\theta)$ ? l. 174: For independent instances setting again, this reads as if the previous section had situated the paper in the independent instance setting, whereas this is the first time the topic comes up

The property of the second term log P (I|X, Y ; θ) is dependent on the nature of imprecise label I. If I contains information about the true labels Y , such as the actual labels or the label candidates, it can be reduced to P (I|Y ), i.e., the probability of I is no longer dependent on X or θ and thus can be ignored from Eq. (5).

This is not correct, just because I contains information about Y, the probability cannot be simplified.

Equation (6) seems to appear out of nowhere: Even following the derivations in C.2, $\mathcal{A}_s$ and $\mathcal{A}_w$ just appear out of thin air in these equations. l. 223: Things become more complicated here since the noisy labels $\hat{Y}$ do not directly reveal the true information about $Y$ I'm not sure what this sentence is trying to say. $\hat{Y}$ should have some information about $Y$ , otherwise, learning is impossible; of course, it doesn't have the full information, but neither do partial labels, so I don't see how this changes the situation compared to the preceeding paragraphs.

The (unreferenced in the main paper) section D.7 in the appendix claims

Since the settings studied in this work has loss functions derived 943 as close-form from Eq. (5), the time complexity can be viewed as O(1). Thus our method in general present faster runtime without complex design such as contrastive loss.

This argument doesn't make any sense. Solving a system of $n$ linear equations can be written in closed form, yet this is not an O(1) operation.

Overall, after reading the paper, I have almost no idea what the proposed method actually does. Equations (6), (7), (8) contain data augmentations, which I suspect may be important for attaining the performance reported in the paper(?), yet they are not really discussed as part of the proposed method. There are numerous inaccuracies, gaps, and I think even some errors (not fundamental, I think, but in the way the writing describes the math). It is certainly possible that I am misunderstanding something here, but as I see it right now, the paper is not in a shape in which it should be published.

Typos/grammar: l. 87: our proposed method generalise and subsumes l. 148: we consider all possible labeling along

问题

See Weaknesses.

局限性

The authors adequately concede in section 5 that the scalability of the algorithm hasn't been verified yet, sentences like

While there exist earlier attempts on generalized or EM solutions [...] thus presenting limited 70 scalability on practical settings.

(l. 68-71) seem to suggest that the presented algorithm has this not-yet-verified scalability.

作者回复

2024-08-05

We thank the reviewer for the detailed and informative review. We recognize several of the concerns expressed by the reviewer and agree that addressing these will indeed enhance the paper. We have attempted to do so below. We hope this is satisfactory.

Neither abstract...

The setting studied in this paper is multi-class classification, as is common for almost all reported work on weak supervision, especially baselines covered in this paper.

This stands in contrast..it restores identifiability of the model.

Please note that the model is not dependent on the assumption of a unique label. In the general case we are given a label vector $Y \in \mathbb{R}^C$ . For unique labels, $Y$ is one-hot. For uncertain labels, it will be dense. The loss computation remains the same in both cases, and the EM solution still holds.

Specifically, for noisy labels, label noise can be captured by a noise-transition matrix $\mathcal{T}(\hat{Y}|Y;\omega)$ (Eq 8). The noisy label $\hat{Y}$ is simply $\hat{Y} = \mathcal{T} Y$ . The EM solution works with $\hat{Y}$ to infer $\mathcal{T}$ and the model parameters under $Y$ . The framework holds regardless of whether $Y$ is one-hot or dense.
Even with one-hot $Y$ the model is non-identifiable. Consider any permutation matrix $P$ . We have $\hat{Y} = \mathcal{T}Y = (\mathcal{T}P)(P^T Y)$ , i.e. it is not permutation invariant. Consequently, the model is not identifiable. This is regardless of whether $Y$ is one hot or dense.

A serious...vague:

As explained above, we model $P(X, I; \theta)$ in Eq4. The missing data here are $Y$ as evident from Eq 4 and 5. This level of description fully specifies the EM problem, although for the specific update rules the form of parameterization becomes important to state. In our case, we use the decomposition of Eq 5, wherein we model $P(Y|X; \theta)$ using neural nets, leading to the solutions of Eq6-10.

For partial labels...

The EM solution generalizes to any $P(Y|I)$ .

Specifically, we study two types of partial labels: instance-independent and instance-dependent ones. The label candidates of the first follows the uniform assumption; the second follows the instance-level strategy. On both types of partial labels, our method presents superior performance (Tab 1 and 9).

unlabeld data…

For unlabeled data, $P(Y|I;\theta)$ is also learned (Eq 7) as in the case of the partial label. Our methods perform robustly with different numbers of unlabeled data (Tab 2). Note that our solution does not preclude the use of a known or assumed prior $P(Y|I)$ (e.g. uniform), in which case this distribution would not be learned.

noisy labels…

For noisy labels, $P(Y|I,\theta)$ is modeled through a noise transition matrix, which is learned, as explained in (Eq 8 and 9).

l. 170: Note that P (X; θ) is omitted...assumption?

The primary component of the model learns the mapping $P(Y|X)$ , since this is not a generative model for $X$ . As such, $P(X)$ need not be modeled. This assumption is common in the training of discriminative models.

$P(Y|X)$ in turn may be decomposed into $P(\{Y_i\} | X)$ , which permits modeling of the distribution of subsets of components of $Y$ . The ``may’’ merely includes this possibility.

This is not correct...simplified.

As stated in the paper, whether $P(I|Y, X;\theta)$ reduces to $P(I|Y)$ depends on the nature of I. If I is directly drivable from Y, i.e, $I=f(Y)$ , then $P(I|Y,X;\theta) = P(I|Y)$ e.g. $I = Y$ , $I = count(Y)$ , etc. We have also considered other settings where $I$ is dependent on $X$ , and this is not reducible, as in Tab 3, 9, and 15. E.g. in noisy labels, it is treated as a noise transition model.

Equation (6) seems to...equations.

$\mathcal{A}_s$ and $\mathcal{A}_w$ represent strong and weak augmentation, and are actually defined early in the paper before Eq 1 (L101-102). Besides, from C.2, from Eq13 to 14, what we did is replace the random variable with actual data notation to make it a true loss function. If the reviewer thinks this is still vague, we will try to make it more clear in our revised paper.

l. 174: For independent instances...up

The typical setting is where the data may be viewed as a collection of independent instances. In other settings (e.g, streaming), the labels in some instances may bear statistical dependence on those in other instances. Our framework permits both settings and experiments including both instance-independent and dependent settings. We will try to make this more clear in the revised paper.

l. 223 Things become more...paragraphs.

In the noise-free case, $I$ presentes uncorrupted information about $Y$ . E.g., it might state that a collection of instances includes a particular label, without identifying the specific instances that do. While the label information provided is incomplete, what is provided is true.

In the noisy case, an instance labeled as $y$ does not imply that its label is actually $y$ . We will clarify this in Sec 4.3.

Equations (6), (7), (8) contain data...method.

As mentioned in the paper, these are common techniques in this field. Besides, we have an ablation in Tab 19 of Appendix on strong-weak augmentation. Removing strong-weak augmentation leads the baselines to significant performance drop, whereas the effects on ours are much smaller.

This argument doesn't … O(1) operation.

Thanks to the reviewer for pointing out this error. We were trying to convey that for closed-form solutions, once sufficient statistics are gathered, the subsequent estimator has a fixed complexity, whereas for iterative solutions this is not true. However, the actual computation for gathering statistics would be $\mathcal{O}(N)$ as in other models. We will make this more clear in our revised paper.

We hope our above response addresses the reviewer’s concerns. If you have further concerns, please do not hesitate to let us know:)

评论- Rebuttal by Authors (Continued)

2024-08-06

Here we address other concerns from the reviewer due to limited space in rebuttal.

The footnote...P(Y|X;\theta)?

Thanks for pointing this out. In the noisy label setting, this is not the case, where the noise transition model also depends on $\theta$ and $X$ , as we discussed above.

Typos/grammar: l. 87...

Thanks for pointing out these typos. We have fixed them in our revised paper.

L68 - 71 seem...scalability.

The "scalability" in the limitation section is meant for larger datasets such as ImageNet-1K that we have not investigated due to the heavy computation cost. However, our method indeed demonstrates scalability on various settings where the runtime advantage compared to previous SOTA methods is shown in Table 21.

2024-08-12

Dear Reviewer Z7AB,

As the discussion period coming towards the end, we would appreciate if you review our response and we hope it will help resolve your concerns. Also please do not hesitate to let us know if there is any more question. Your insights have been invaluable, and we thank you sincerely for your time and effort.

Best, Authors.

2024-08-13

Identifyability: If the model is non-identifiable, in particular, if you could put an arbitray permutation matrix on the labels, then no model could hope to learn something useful, no? So you generally need to put some assumptions/inducitve priors on the noise transition matrix; e.g., if labels are deterministic, then I think having the largest entry in each row if $\\mathcal{T}$ be the diagonal restores identifiability, but that condition wouldn't be sufficient with randomized labels.

The property of the second term $\\log P(I|X, Y ; θ)$ is dependent on the nature of imprecise label I. If I contains information about the true labels Y, such as the actual labels or the label candidates, it can be reduced to P (I|Y ), i.e., the probability of I is no longer dependent on X or θ and thus can be ignored from Eq. (5).

To clairfy, what I was objecting to here was the writing of the paper, not the statement that $P(I|Y,X;\theta)$ can be reduced to $P(I|Y)$ in some settings: The paper writes If I contains information about the true labels Y, which is a much weaker condition than what you stated in your rebuttal, if $I$ is directly drivable from $Y$ . (this phrase is even repeated later in the paper, Since S contains the information about the true label Y...)

Note that this step also implicitly assumes that $Y$ is a one-hot vector; otherwise, if it has non-zero weight for labels from two different bags, $I$ is no longer a measurable function of $Y$ .

I don't think eq. (4) models $P(X, I;\theta)$ ; this equation is just the maximum-likelyhood principle and the law of total probability; it doesn't actually tell me anything new about $P(X, I;\theta)$ In fact, I think this goes together with my later point. One of the modeling assumptions is $P(X, I;\\theta) = P(I | X; \\theta) P(X)$ , i.e., the parameters $\\theta$ are not used to model the distribution over $X$ .

You are right, $\mathcal{A}_s$ and $\mathcal{A}_w$ are mentioned earlier in the paper; however, I don't think they are actually defined; they appear in a sentence discussing the Pico method

It optimizes the cross-entropy (CE) loss between the prediction of the augmented training sample Aw(x) and the disambiguated labels s. PiCO learns a set of class prototypes from the features associated with the same pseudo-targets. A contrastive loss, based on MOCO [67 ], is employed to better learn the feature space, drawing the projected and normalized features zw and zs of the two augmented versions of data Aw(x) and As(x) closer.

This does not actually give any information about what $\mathcal{A}_w$ and $\mathcal{A}_s$ are. More importantly, though, the way I read the paper, this is a description of a trick used to improve performance.

But then in 3.1, you try to derive a general framework from first principles, which then is applied in 3.2 to these various settings. This is what happens in C.2. In the main paper, however, it is not equation (14) that is presented, but (6) which contains these additional augmentations. There is no mention of these in the preceding section 3 or the derivation in C.2; no reason given why two different agumentations should be used in this cross-entropy formulation. I'm still not quite sure what $p(y|A_w(x), s; θ_t)$ actually is. If I have four classes, and $s\_1=\\{1, 2\\}, s\_2=\\{3, 4\\}$ , and $s=1$ , according to the text, $p$ would be something like $\\{0, 0, 0.5, 0.5\\}$ ; i.e., zero on the classes that are consistent with $s$ and uniform (in the instance-independent case) on the others???

In the noise-free case, presentes uncorrupted information about. E.g., it might state that a collection of instances includes a particular label, without identifying the specific instances that do. While the label information provided is incomplete, what is provided is true. In the noisy case, an instance labeled as does not imply that its label is actually . We will clarify this in Sec 4.3.

I still don't understand this. In the noise-free, partial label case, the observation tells you some distribution over possible labels. In the noisy-label case, the observation also gives you a distribution over labels. In both cases, you do get some probabilistic information about the true label (assuming an underlying deterministic labeling).

I don't really see the novelty in the described approach. If I come upon an incomplete labelling scenario for the first time, what would be the advantage of me reading this paper, compared to just reading about EM in general? Maybe I'm just missing something fundamental that the other reviewers seemed to get (note my low confidence score).

2024-08-13

Thank you for your response. We will address the issues below:

a. Please note that mixture distributions are in general not identifiable. This is not unique to our model. Widely-used and very popular models such as mixture multinomials, Gaussian mixtures, HMMs, and even unconstrained factor analysis (which permits us to permute bases) are all provably non-identifiable. Nonetheless, they are perfectly good for modeling data distributions, computing likelihoods or making predictions. This remains true of our model as well. Regardless of the nature of the label $Y$ identifiability is not a requirement for our model, nor indeed for EM estimation of any latent-variable model.

b. Point taken about the presentation -- we will rephrase "if $I$ contains information about $Y$ ". In the rest of the paper we have actually clearly separated cases where $I$ depends only on $Y$ , and cases where it depends on both $Y$ and $X$ . If $I = f(Y)$ , the statement still holds regardless of whether $Y$ is one-hot. For $P(I|Y, X) = P(I | Y)$ all that is required is that $I$ is not a function of $X$ . For instance, $Y$ could be a probability vector, and $I$ could just be $Y$ itself (i.e. the information given is the actual uncertain vector $Y$ ), in which case $P(I|Y)$ is just a unit volume delta at $Y$ .

c. $P(X, Y; \theta)$ is being parameterized as $P(Y|X; \theta) P(X)$ , where $P(X)$ is the true probability distribution of $X$ . This is a perfectly valid decomposition and maximum-likelihood estimation still holds. One does not need to model every component of a distribution, e.g. if a particular component of the distribution is known (or, as in our case, not relevant to the model), it would be redundant to model it. As for whether Eq. 4 holds by itself, we refer the reviewer to "Maximum Likelihood from Incomplete Data via the EM Algorithm" by Dempster, Rubin and Laird where the EM algorithm was originally laid out. Please observe that the algorithm is described entirely in terms of generic PDFs, similar to Eq. 4., and specific formulae are only used when specific examples of EM are elucidated. The EM algorithm is always just specified in terms of PDFs observed and unobserved variables, and model estimation with unobserved variables. In our case this variable is $Y$ .

d. In lines 101 and 104 $\mathcal{A}_w(x)$ and $\mathcal{A}_s(x)$ are defined as augmented data samples. We also point to citation [13] as where this form of Augmentation is introduced. Thus, we have briefly described the variable, assigned it a symbol, and referred the reader to the specific citation from which this notation and description has been drawn.

e. Imprecise labels can be of many kinds. Eq. 14 specifically refers to the partial-label case where the labels belong to a set, which indeed is as the reviewer interprets it. So, for instance, given a picture, we could be informed this is either a dog or a cat''. But imprecise labels can be of other kinds. E.g. given a collection of pictures, one may be informed that this collection includes 3 pictures of cats''. This information is true, but it is imprecise because it does not mention which three of the pictures are cats. This does not fall under the partial label condition. The information is true -- there are indeed three pictures of cats, we just dont know which ones. So the label is not noisy. In the noisy label case, we may be informed ``this is a picture of a cat'', but sometimes this information is wrong, it is actually a table.

The challenge in these settings is that each of these settings is different and traditionally have been handled with different algorithms. The algorithms are specialized -- so, for instance, if we had a collection of data, each with partial labels, we would employ a partial-label algorithm from the literature. However, if you were to add to this same data set a small bag of instances with the information -- this bag has 3 cats --, that algorithm would not apply to this bag. We provide a generic formulation that allows us to consider all kinds of imprecise labels under a single umbrella. So, for instance, the above scenario could be handled.

We understand that the reviewer may be unfamiliar with this problem framework, but we assure them that it is indeed a genuine challenge in the literature, and as the other reviewers recognize, our unifying approach is not only novel, but also a significant contribution.

We hope we have answered the reviewers doubts, and are happy to provide more explanations.

最终决定Accept (poster)

2024-09-25

The paper tackles learning with imprecise labels and demonstrates strong empirical results, which several reviewers appreciated. The unified framework is a valuable contribution, offering an approach to handling various types of supervision.

While some concerns were raised about the novelty of applying the EM algorithm and certain theoretical aspects, the authors’ clarifications addressed many of these points. However, questions remain about how clearly the framework’s novelty is distinguished from prior work.

One of the reviewers got confused about the presentation of the mathematical material, and I agree that some details could have been clearer. As an example, the type of $f \circ g$ is inconsistent in the text. While it’s defined as a direct mapping from $\mathcal{X} \to \mathcal{Y}$ , the paper explicitly states that $f \circ g$ outputs predicted probabilities $p(y \mid x; \theta)$ , which would require mapping to a distribution over $C$ possible labels, not a single label. Issues of this kind and perhaps too brief explanations seem to have been the reasons for the confusion of one of the reviewers.

Considering the strengths of the framework and the potential for refinement, I am tending towards acceptance though.