6.3

/10

Rejected4 位审稿人

最低3最高8标准差2.0

3.3

置信度

ICLR 2024

Can Class-Priors Help Single-Positive Multi-Label Learning?

Biao Liu,Jie Wang,Ning Xu,Xin Geng

OpenReview PDF

提交: 2023-09-22更新: 2024-02-11

摘要

关键词

Multi-label LearningSingle-Positive Multi-Label Learning

评审与讨论

审稿意见

评分: 6置信度: 32023-10-27

This study introduces an approach for single-positive multi-label learning (SPMLL). The authors present a class-priors estimation technique that aims to align the estimated class-priors with the actual class-priors as training progresses. In addition, an unbiased risk assessment tool is introduced, which is based on the estimated class-priors, and a generalization error bound is provided. Testing on ten MLL benchmark datasets has been conducted to evaluate the performance of this method in comparison to other SPMLL techniques.

优点

The paper provides theoretical guarantees regarding the convergence of the estimated class priors to ground-truth class priors. Additionally, it claims that the risk minimizer corresponding to the proposed risk estimator will approximately converge to the optimal risk minimizer on fully supervised data. These theoretical insights enhance the credibility of the proposed framework.

Within the group of other SPMLL techniques experimental results are quite favorable both in terms of average precision as well as predicting the class prior. Attention maps also look quite promising.

缺点

The paper seeks to develop a method for SPMLL, and it's evident that similar efforts have been made in other studies, focusing on the "single-positive" label approach. Yet, the rationale for opting for the "single-positive" label remains somewhat vague. In real-life situations, it's common to encounter missing labels, but the count of observed labels for each sample can differ, not always being restricted to one. For this study to truly make a difference, it should be benchmarked not just against other SPMLL strategies but also against traditional multi-label learning models that consider multiple positive labels for each sample during training (not just testing). The assumption that each sample contains precisely one positive label for training seems out of sync with practical scenarios.

The paper should discuss the potential limitations or challenges in generalizing CRISP to different domains or types of data. Real-world scenarios can vary widely, and the effectiveness of the proposed framework in diverse contexts should be explored.

问题

This is not a question but please explain in the paper how this sentence is related to Section 3.1 "Note that a method is risk-consistent if the method possesses a classification risk estimator that is equivalent to R(f) given the same classifier (Mohri et al., 2012)."

After rebuttal:

Thank you for your response to my initial comments and for conducting the additional comparisons that I suggested. It is encouraging to see that your approach demonstrates superior performance when compared to the additional MLML methods. In light of these new findings, I am pleased to adjust my evaluation of your paper. I am increasing my score by one point.

2023-11-15

For Questions:

This is not a question but please explain in the paper how this sentence is related to Section 3.1 "Note that a method is risk-consistent if the method possesses a classification risk estimator that is equivalent to R(f) given the same classifier (Mohri et al., 2012)."

The statement regarding risk consistency does not directly relate to the symbol definitions for the multi-label learning (MLL) problem presented in Section 3.1, but it sets the stage for the upcoming explanation of how our method adheres to this desirable property. Risk consistency is crucial in weakly supervised problems, where the absence of complete ground-truth labels precludes direct optimization of the expected risk $\mathcal{R}(h)$ as stated in Equation (1).

Risk consistency addresses this by transforming the expected risk $\mathcal R(h)$ into an optimizable risk $\mathcal R_{consistent}(h)$ , which can be optimized using the available weak supervision. For any classifier $h$ , it ensures that $\mathcal R(h)=\mathcal R_{consistent}(h)$ . Our proposed CRISP method is designed to adhere to this crucial property. By introducing this concept early in Section 3.1, we aim to clarify the theoretical underpinnings that make CRISP applicable in the weakly supervised settings of SPMLL.

We will ensure that the revised manuscript clarifies the relevance of risk consistency to the definitions presented in Section 3.1 and articulates its importance in the broader context of weakly supervised learning.

We have incorporated your suggestions into our revisions and hope that they meet your approval. We are open to any further questions or requests for additional information.

2023-11-20

Thank you for your thoughtful feedback. Your insights have been invaluable in enhancing the quality of our paper.

2023-11-15

Predictive performance of each comparing method on MLL datasets in terms of Ranking Loss (mean ± std). The best performance is highlighted in bold (the smaller the better).

	Image	Scene	Yeast	Corel5k	Mirflickr	Delicious
LL-R	0.346±0.072	0.155±0.021	0.227±0.001	0.114±0.001	0.123±0.003	0.129±0.002
LL-Cp	0.329±0.041	0.148±0.017	0.215±0.000	0.114±0.003	0.124±0.003	0.160±0.001
LL-Ct	0.327±0.019	0.180±0.038	0.238±0.001	0.115±0.001	0.124±0.002	0.160±0.000
Crisp	0.164±0.027	0.112±0.021	0.164±0.001	0.113±0.001	0.118±0.001	0.122±0.000

Predictive performance of each comparing method on MLL datasets in terms of One Error (mean ± std). The best performance is highlighted in bold (the smaller the better).

	Image	Scene	Yeast	Corel5k	Mirflickr	Delicious
LL-R	0.597±0.084	0.490±0.054	0.436±0.087	0.715±0.006	0.342±0.016	0.543±0.041
LL-Cp	0.629±0.043	0.450±0.051	0.240±0.000	0.731±0.016	0.357±0.016	0.490±0.028
LL-Ct	0.616±0.019	0.574±0.074	0.552±0.097	0.726±0.022	0.375±0.012	0.475±0.019
Crisp	0.325±0.026	0.311±0.047	0.227±0.004	0.646±0.006	0.295±0.009	0.402±0.003

In summary, while closely related to MLML, SPMLL poses new challenges with less supervision. Our experiments verify that when compared to MLML techniques, our approach designed specifically for SPMLL performs the best.

[1] Shankar, V., Roelofs, R., Mania, H., Fang, A., Recht, B., & Schmidt, L. (2020, November). Evaluating machine accuracy on imagenet. In International Conference on Machine Learning (pp. 8634-8644). PMLR.

[2] Kim, Y., Kim, J. M., Akata, Z., & Lee, J. (2022). Large loss matters in weakly supervised multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14156-14165).

The paper should discuss the potential limitations or challenges in generalizing CRISP to different domains or types of data. Real-world scenarios can vary widely, and the effectiveness of the proposed framework in diverse contexts should be explored.

In our study, we have indeed gone beyond the four large-scale image datasets commonly used in previous SPMLL research. We have conducted extensive experiments on multiple multi-label learning (MLL) datasets, encompassing a diverse range of domains including images, biology, and text. Specifically, the datasets used in our experiments, include Image, Scene, Yeast, Corel5k, Mirflickr, and Delicious, each with varying numbers of examples, features, and classes, catering to different domains.

Our experimental results have consistently shown that CRISP outperforms the existing methods across these varied datasets. This not only demonstrates the effectiveness of our method but also substantiates its robust generalization capability across different domains and types of data. For instance, CRISP was able to effectively handle the high dimensionality of the Delicious text dataset with 500 features and 983 classes as well as the biological dataset Yeast with its distinct feature space and class structure.

We believe these results offer strong evidence of the versatility and adaptability of the CRISP framework. In the revised manuscript, we will includ a more detailed discussion on the potential limitations and challenges when applying CRISP to different domains, along with a comprehensive analysis of the experimental results across these varied datasets.

2023-11-15

We appreciate your detailed review and constructive feedback on our manuscript. Your comments have been crucial in improving our research. Regarding your questions, I would like to provide the following explanations:

For Weaknesses:

The paper seeks to develop a method for SPMLL, and it's evident that similar efforts have been made in other studies, focusing on the "single-positive" label approach. Yet, the rationale for opting for the "single-positive" label remains somewhat vague. In real-life situations, it's common to encounter missing labels, but the count of observed labels for each sample can differ, not always being restricted to one. For this study to truly make a difference, it should be benchmarked not just against other SPMLL strategies but also against traditional multi-label learning models that consider multiple positive labels for each sample during training (not just testing). The assumption that each sample contains precisely one positive label for training seems out of sync with practical scenarios.

Thank you for your insightful comments. I would like to clarify the significance and challenges associated with SPMLL problem.

Firstly, the SPMLL problem has been extensively studied due to its practical relevance in scenarios with extensive instances and label spaces. In reality, large number of instances inherently possess multiple labels. However, accurately annotating every label for each instance is a challenging and labor-intensive task. To enhance efficiency, annotators often annotate only a single positive label for each instance, considerably reducing the burden of annotation. A prime example of this is the ImageNet dataset, which, while being a single-label dataset, contains samples that inherently possess multiple labels [1].

Secondly, due to the constraint of having only a single positive label per sample, SPMLL presents a more challenging scenario compared to the Multi-Label Learning with Missing Labels (MLML) problem where each example may have multiple positive labels, and the remaining labels are missing. Our supplementary experiments demonstrate that directly applying the state-of-the-art MLML methods [2] to SPMLL results in suboptimal performance. This underscores the necessity for developing specialized approaches tailored for SPMLL. Furthermore, our proposed method is also seamlessly adaptable to the MLML problem, showcasing its versatility.

Predictive performance of each comparing method on four MLIC datasets in terms of mAP (mean ± std). The best performance is highlighted in bold (the larger the better).

	VOC	COCO	NUSWIDE	CUB
LL-R	87.784±0.063	70.078±0.008	48.048±0.074	18.966±0.022
LL-Cp	87.466±0.031	70.460±0.032	48.000±0.077	19.310±0.164
LL-Ct	87.054±0.214	70.384±0.058	47.930±0.010	19.012±0.097
Crisp	89.820±0.191	74.640±0.219	49.996±0.316	21.650±0.178

Predictive performance of each comparing method on MLL datasets in terms of Average Precision (mean ± std). The best performance is highlighted in bold (the larger the better).

	Image	Scene	Yeast	Corel5k	Mirflickr	Delicious
LL-R	0.605±0.058	0.714±0.035	0.658±0.006	0.268±0.002	0.625±0.001	0.296±0.004
LL-Cp	0.595±0.031	0.735±0.028	0.700±0.000	0.259±0.004	0.621±0.007	0.251±0.007
LL-Ct	0.600±0.012	0.669±0.052	0.629±0.007	0.258±0.004	0.619±0.004	0.253±0.004
Crisp	0.749±0.037	0.795±0.031	0.758±0.002	0.304±0.003	0.628±0.003	0.319±0.001

审稿意见

评分: 3置信度: 42023-10-31

This paper focuses on Single-Positive multi-label learning and proposes a framework named CRISP. CRISP estimates the class-priors, and an unbiased risk estimator is derived based on the estimated class-priors. This paper tries to guarantee the estimated class-priors converging to the ground-truth class-priors. Finally, this paper tries to show the effectiveness of CRISP by extensive experiments.

优点

a. Extensive experiments. I appreciate that this paper provides extensive experiments to show the effectiveness of the proposed method.

b. Nice originality. I am not sure whether this work is the first to focus on the class prior in SPMLL, but it is an interesting track.

缺点

a. The writing of this paper needs to be improved. Specifically, more analysis and descriptions of Theorem 4.2 are necessary.

b. I am concerned about the time cost of the proposed method.

问题

My main concerns are the following questions:

a. It is mentioned that "This unrealistic assumption will introduce severe biases into the pseudo-labels, further impacting the training of the model supervised by the inaccurate pseudo-labels". What are the biases? It is necessary to provide more discussions to enrich your motivations.

b. The key of the proposed methods is the threshold. How do you get the optimal threshold in practice, i.e. how do you implement eq.2?

c. What is the time cost of the proposed method? Please discuss more about the time cost of the optimal threshold and the entire method in theory and experiments.

b. Theorem 4.2 tries to present the convergence of the empirical risk minimizer, but it seems that the empirical risk minimizer does not converge to the true risk minimizer. Please provide more analysis of Theorem 4.2 and more discussions about every component in the upper bound.

2023-11-15

For Questions:

It is mentioned that "This unrealistic assumption will introduce severe biases into the pseudo-labels, further impacting the training of the model supervised by the inaccurate pseudo-labels". What are the biases? It is necessary to provide more discussions to enrich your motivations.

The biases referred to in our paper arise from the discrepancy between the assumed uniform distribution of class-priors and the actual distribution in real-world data. Typically, the class distribution in real-world scenarios is imbalanced, with some classes being more prevalent than others. When pseudo-labels are generated under the assumption of equal class-priors, classes with a naturally lower occurrence rate are overrepresented, while those with a higher occurrence rate are underrepresented.

The biases resulting from the unrealistic assumption of identical class-priors for pseudo-label generation can indeed lead to the production of incorrect pseudo-labels. This issue is pivotal, as the subsequent model training is supervised by these inaccurate pseudo-labels, which can compound the initial bias into a significant performance degradation.
The key of the proposed methods is the threshold. How do you get the optimal threshold in practice, i.e. how do you implement eq.2?

In practice, to determine the optimal threshold, we conduct an exhaustive search across the set of outputs generated by the function $f^j$ for each class. For instance, for a given class $j$ , and a set of instances $\boldsymbol{x}_1, \boldsymbol{x}_2, \boldsymbol{x}_3$ in our dataset, we compute the corresponding outputs $z_1 = f^j(\boldsymbol{x}_1), z_2 = f^j(\boldsymbol{x}_2), z_3 = f^j(\boldsymbol{x}_3)$ .

The optimal threshold $\hat{z}$ is then selected by identifying the value of $z\in\\{z_1, z_2, z_3\\}$ that minimizes the objective function specified in Equation (2):
$\hat{z} = \arg\min_{z\in \\{z_1, z_2, z_3\\}} \left( \frac{\hat{q}_j(z)}{\hat{q}_j^p(z)} + \frac{1+\tau}{\hat{q}_j^p(z)}\left( \sqrt{\frac{\log(4/\delta)}{2n}} + \sqrt{\frac{\log(4/\delta)}{2n_j^p}} \right) \right)$
This approach ensures that we find the optimal threshold that minimizes the given expression, as per Equation (2), across all available output values from the function $f^j$ .
What is the time cost of the proposed method? Please discuss more about the time cost of the optimal threshold and the entire method in theory and experiments.

Please see the response for Weakness 2.
Theorem 4.2 tries to present the convergence of the empirical risk minimizer, but it seems that the empirical risk minimizer does not converge to the true risk minimizer. Please provide more analysis of Theorem 4.2 and more discussions about every component in the upper bound.

Please see the response for Weakness 1.

We have incorporated your suggestions into our revisions and hope that they meet your approval. We are open to any further questions or requests for additional information.

2023-11-15

We are deeply grateful for your thoughtful review and insightful suggestions. Your feedback has greatly helped us to improve the quality of our manuscript. Regarding your questions, I would like to provide the following explanations:

For Weaknesses:

The writing of this paper needs to be improved. Specifically, more analysis and descriptions of Theorem 4.2 are necessary.

Theorem 4.2 provides an upper bound on the difference between the classifier obtained by the empirical risk minimizer $\hat{f}_{sp}$ and the classifier corresponding to the true risk minimizer $f^*$ . This bound consists of four terms, which can be categorized into two groups: the Rademacher complexity terms and the terms of order $O(1/\sqrt{n})$ .
1. Terms of Order $O(1/\sqrt{n})$ : As the sample size $n$ grows, these terms reduce, indicating that $\hat{f}_{sp}$ converges to the performance of $f^*$ within a margin of error that becomes progressively smaller with more data.
2. Rademacher Complexity Terms: The Rademacher Complexity terms in the theorem quantify the complexity of the hypothesis space. These complexity terms denote an intrinsic error bound that persists regardless of sample size, reflecting the capacity of the function class we are choosing from. This intrinsic error is a fundamental aspect of the learning problem and remains even in a fully supervised scenario [1].

Together, these components suggest that as we gather more data, the classifier $\hat f_{sp}$ becomes increasingly accurate, drawing nearer to the performance of the classifier $f^*$ within an acceptable error tolerance. The empirical risk for $\hat{f}_{sp}$ is thus expected to converge to the true risk $R(f^*)$ .

[1] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, 2012. ISBN 978-0-262-01825-8.

I am concerned about the time cost of the proposed method.

Thank you for your comment regarding the computational efficiency of our proposed method. We understand the importance of time cost in real-world applications and have addressed this concern in two ways in our additional experiments.

Firstly, we have recorded the time spent on estimating class priors within each epoch and found it to be a relatively minor portion of the total training time. As illustrated in the below table, the time for class-priors estimation is indeed quite short compared to the overall training time for an epoch, ensuring that our method remains practical for use in larger datasets.

Dataset Time of Estimation of the Class-Priors (min) Whole Training Time of One Epoch (min)
VOC 0.24 2.19
COCO 3.47 27.29
NUS 6.4 49.09
CUB 0.45 3.89

Secondly, to further enhance the speed of our algorithm, we have experimented with updating the class priors every few epochs instead of every single one. The variant of our method, denoted as CRISP-3EP, updates the priors every three epochs and our experiments show that this results in a negligible loss in performance, as evidenced by the close metrics between CRISP and CRISP-3EP.

VOC COCO NUS CUB
CRISP 89.820±0.191 74.640±0.219 49.996±0.316 21.650±0.178
CRISP-3EP 89.077±0.251 73.930±0.399 49.463±0.216 19.450±0.389

We will include this discussion in the limitations section in the revised manuscript. We believe these insights could be beneficial to researchers and practitioners using our method in the future.

Dataset	Time of Estimation of the Class-Priors (min)	Whole Training Time of One Epoch (min)
VOC	0.24	2.19
COCO	3.47	27.29
NUS	6.4	49.09
CUB	0.45	3.89

	VOC	COCO	NUS	CUB
CRISP	89.820±0.191	74.640±0.219	49.996±0.316	21.650±0.178
CRISP-3EP	89.077±0.251	73.930±0.399	49.463±0.216	19.450±0.389

2023-11-20

Dear reviwer,

We want to express our gratitude to your helpful comments and suggestions, which will be of great importance on the improvement of this work.

We have made efforts to address questions you raised and improve accordingly. We would like to double check to make sure that we have addressed all your concerns, and would you please let me know if you have any additional questions. Thank you.

Best wishes,

Authors.

2023-11-21

Dear Reviewer,

Thank you for your valuable and detailed feedback on our manuscript. In response to your concerns, we have made a sincere effort to address and clarify the key issues.

In our previous response, we have focused on providing a more comprehensive analysis of Theorem 4.2, offering a clearer explanation of its components and implications for the classifier's accuracy. We have also elaborated on the method for determining the optimal threshold, including additional experiments to assess the time efficiency of this process, thereby ensuring its practical applicability in larger datasets. Additionally, in response to your observations regarding certain unclear expressions, we have provided more detailed explanations to ensure clarity and comprehensiveness.

While we believe that these revisions address the key issues you have highlighted, we remain open to further guidance and suggestions. Thank you once again for your thoughtful guidance and support.

Best wishes,

Authors.

审稿意见

评分: 8置信度: 32023-10-31

The target problem of this paper is called single-positive multi-label learning. This is a weakly supervised version of the multi-label classification scenario, where each instance is annotated with only one of the positive labels. The other labels do not necessarily mean negative, but can potentially be positive or negative. The paper proposes a method called CRISP: it alternatively updates the class prior estimate and the multi-label classifier. Theoretically, the paper discusses that the estimated class-prior will converge to the ground-truth class-prior with enough training samples and an estimation error bound for the proposed empirical risk estimator. Experiments show CRISP works better than other methods.

优点

Estimation error bound is provided for class prior estimation and for the empirical risk estimator.
Empirically, class-prior prediction is more accurate compared with others.
Multi-label prediction performance is often best for the proposed CRISP method.

缺点

The paper is motivated by the observation that previous methods have a strong assumption that class priors are assumed to be uniform. It would be interesting to see if the proposed method is still advantageous when class priors are uniform. It would enhance the paper's significance if the authors could demonstrate whether their proposed method retains its advantages even under the condition of uniform class priors.
It seems to me that the problem setting of SPMLL is a special case of "Multi-Label Ranking From Positive and Unlabeled Data" (CVPR 2016). I wonder if these general methods can be used as a baseline (and if not, what are the weaknesses of using these more general methods?)

问题

In addition to the points I wrote in the "Weaknesses":

It would be helpful to explicitly write out the definition of the absolute loss function and the derivation in Eq. 5.
I wasn't sure if we end up with an unbiased estimator (even with access to the ground truth class prior), because we have the additional absolute operator in the latter half of Eq. 7. It would be helpful if the paper can clarify.

2023-11-15

Predictive performance of each comparing method on MLL datasets in terms of Ranking Loss (mean ± std). The best performance is highlighted in bold (the smaller the better).

	Image	Scene	Yeast	Corel5k	Mirflickr	Delicious
LL-R	0.346±0.072	0.155±0.021	0.227±0.001	0.114±0.001	0.123±0.003	0.129±0.002
LL-Cp	0.329±0.041	0.148±0.017	0.215±0.000	0.114±0.003	0.124±0.003	0.160±0.001
LL-Ct	0.327±0.019	0.180±0.038	0.238±0.001	0.115±0.001	0.124±0.002	0.160±0.000
Crisp	0.164±0.027	0.112±0.021	0.164±0.001	0.113±0.001	0.118±0.001	0.122±0.000

Predictive performance of each comparing method on MLL datasets in terms of One Error (mean ± std). The best performance is highlighted in bold (the smaller the better).

	Image	Scene	Yeast	Corel5k	Mirflickr	Delicious
LL-R	0.597±0.084	0.490±0.054	0.436±0.087	0.715±0.006	0.342±0.016	0.543±0.041
LL-Cp	0.629±0.043	0.450±0.051	0.240±0.000	0.731±0.016	0.357±0.016	0.490±0.028
LL-Ct	0.616±0.019	0.574±0.074	0.552±0.097	0.726±0.022	0.375±0.012	0.475±0.019
Crisp	0.325±0.026	0.311±0.047	0.227±0.004	0.646±0.006	0.295±0.009	0.402±0.003

In conclusion, while MLML methods could be applied to SPMLL, our experiments validate that our approach designed specifically for the SPMLL problem outperforms these more general techniques.

[1] Kim, Y., Kim, J. M., Akata, Z., & Lee, J. (2022). Large loss matters in weakly supervised multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14156-14165).

For Questions:

It would be helpful to explicitly write out the definition of the absolute loss function and the derivation in Eq. 5.

Thank you for your valuable suggestion to explicitly define the absolute loss function and provide the derivation for Eq. 5. We acknowledge that these additions will enhance the clarity and comprehensiveness of our manuscript.

The absolute loss function, which is used in our framework, is defined as $l(f^j(x),y_j)=|f^j(x)-y_j|$ . And Eq. 5 is derived as:

\begin{aligned} \mathcal{R}(f) &= \sum_{\boldsymbol y}p(\boldsymbol y) \mathbb{E}_{\boldsymbol x \sim p(\boldsymbol x \vert \boldsymbol y)} \left[ \sum _{j=1}^c y_j\ell(f^j(\boldsymbol x), 1) + (1 - y_j)\ell(f^j(\boldsymbol x), 0) \right] \\\\ & = \sum _{j=1}^c p(y_j = 1) \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x \vert y_j = 1)}\left[ \ell(f^j(\boldsymbol x), 1) \right] + p(y_j = 0) \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x \vert y_j = 0)}\left[ \ell(f^j(\boldsymbol x), 0) \right] \\\\ & = \sum _{j=1}^c p(y_j = 1) \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x \vert y_j = 1)}\left[ 1 - f^j(\boldsymbol x) \right] + (1 - p(y_j = 1)) \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x \vert y_j = 0)}\left[ f^j(\boldsymbol x) \right] \\\\ & = \sum _{j=1}^c p(y_j = 1) \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x \vert y_j = 1)}\left[ 1 - f^j(\boldsymbol x) \right] + \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x)}\left[f^j(\boldsymbol x)\right] - p(y_j = 1) \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x \vert y_j = 1)}\left[f^j(\boldsymbol x) \right] \\\\ & = \sum _{j=1}^c p(y_j = 1) \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x \vert y_j = 1)}\left[ 1 - f^j(\boldsymbol x) \right] + \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x)}\left[f^j(\boldsymbol x)\right] - p(y_j = 1) \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x \vert y_j = 1)}\left[f^j(\boldsymbol x) - 1 + 1 \right] \\\\ & = \sum _{j=1}^c 2p(y_j = 1) \mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x \vert y_j = 1)}\left[ 1 - f^j(\boldsymbol x) \right] + \left(\mathbb{E} _{\boldsymbol x \sim p(\boldsymbol x)}\left[f^j(\boldsymbol x)\right] - p(y_j = 1)\right). \end{aligned}

We ensure that these elements will be meticulously detailed in the revised version of the manuscript.

2023-11-15

Thank you for taking the time to review the paper and providing valuable feedback. I appreciate your efforts in ensuring the quality of the research. Regarding your concerns, I would like to provide the following explanations:

For Weaknesses:

The paper is motivated by the observation that previous methods have a strong assumption that class priors are assumed to be uniform. It would be interesting to see if the proposed method is still advantageous when class priors are uniform. It would enhance the paper's significance if the authors could demonstrate whether their proposed method retains its advantages even under the condition of uniform class priors.

Thank you for your insightful query regarding the performance of our proposed method in scenarios where class priors are uniform. We agree that evaluating our method under this condition would provide a more comprehensive understanding of its robustness and versatility.

However, unlike multi-class single-label datasets where the number of instances per class can be artificially balanced, multi-label datasets often exhibit complex label correlations that make such balancing more challenging. Existing multi-label datasets typically do not have uniform class priors, and due to the label correlations, balancing the occurrence of one class may inadvertently affect the distribution of other classes, making it impractical to achieve uniformity across all classes simultaneously [1].

This inherent imbalance in multi-label settings underscores the importance of considering class-prior imbalance in the design of SPMLL methods. Our approach is particularly suited to address these real-world scenarios where uniform class priors are not the norm.

[1] Wu, T., Huang, Q., Liu, Z., Wang, Y., & Lin, D. (2020). Distribution-balanced loss for multi-label classification in long-tailed datasets. In Proceedings of the 16th European Conference on Computer Vision (ECCV 2020) (pp. 162-178). Springer International Publishing.

It seems to me that the problem setting of SPMLL is a special case of "Multi-Label Ranking From Positive and Unlabeled Data" (CVPR 2016). I wonder if these general methods can be used as a baseline (and if not, what are the weaknesses of using these more general methods?)

I agree that the problem setting of SPMLL is a special case of "Multi-Label Ranking From Positive and Unlabeled Data" (It is also called Multi-label learning with missing labels, MLML). We acknowledge the reviewer's suggestion to compare against MLML methods. As recommended, we have added experiments with the state-of-the-art MLML approach [1] on SPMLL datasets. The results show our proposed method still achieves superior performance, as shown in below results:

Predictive performance of each comparing method on four MLIC datasets in terms of mAP (mean ± std). The best performance is highlighted in bold (the larger the better).

	VOC	COCO	NUSWIDE	CUB
LL-R	87.784±0.063	70.078±0.008	48.048±0.074	18.966±0.022
LL-Cp	87.466±0.031	70.460±0.032	48.000±0.077	19.310±0.164
LL-Ct	87.054±0.214	70.384±0.058	47.930±0.010	19.012±0.097
Crisp	89.820±0.191	74.640±0.219	49.996±0.316	21.650±0.178

Predictive performance of each comparing method on MLL datasets in terms of Average Precision (mean ± std). The best performance is highlighted in bold (the larger the better).

	Image	Scene	Yeast	Corel5k	Mirflickr	Delicious
LL-R	0.605±0.058	0.714±0.035	0.658±0.006	0.268±0.002	0.625±0.001	0.296±0.004
LL-Cp	0.595±0.031	0.735±0.028	0.700±0.000	0.259±0.004	0.621±0.007	0.251±0.007
LL-Ct	0.600±0.012	0.669±0.052	0.629±0.007	0.258±0.004	0.619±0.004	0.253±0.004
Crisp	0.749±0.037	0.795±0.031	0.758±0.002	0.304±0.003	0.628±0.003	0.319±0.001

2023-11-15

I wasn't sure if we end up with an unbiased estimator (even with access to the ground truth class prior), because we have the additional absolute operator in the latter half of Eq. 7. It would be helpful if the paper can clarify.

Thank you for your query regarding the unbiasedness of our estimator, particularly concerning the incorporation of the absolute value operator in Eq. 7. In our revision, we will clarify that the property of risk consistency with the absolute loss function is contingent upon the condition that $E_{x\sim p(x)}[f^j(x)] \geq p(y_j=1)$ . This condition ensures that the expected output of our model for class $j$ is not less than the ground truth prior probability of that class being the positive label.

The inclusion of the absolute value serves two purposes in our formulation. On one hand, it ensures that when $E_{x\sim p(x)}[f^j(x)] < p(y_j=1)$ , the expected value $E_{x\sim p(x)}[f^j(x)]$ is coerced to increase, thereby preserving the desirable property of risk consistency for the risk estimator. On the other hand, it aligns the model's output class prior with the true class prior, which is in direct correlation with the initial motivation of our paper—addressing the assumption of uniform class priors in previous methodologies.

We hope that our revisions have addressed all of your concerns, but please let us know if there is anything else we can do to improve the manuscript. We would be happy to answer any additional questions or provide any further information you may need.

评论- Comment

2023-11-22

Thank you for answering my questions and for the additional experiments. Currently I do not have any follow-up questions. Since I have no remaining concerns, I plan to raise my score by one step.

2023-11-23

Thank you for your insightful feedback and suggestions. Your guidance has been invaluable in enhancing the quality of this work.

审稿意见

评分: 8置信度: 32023-11-01

The paper provides a novel class-priors estimator and unbiased risk estimator for the single-positive multi-label learning task. The estimator comes with a convergence guarantee. The paper also shows that the proposed method leads to strong performance on various tasks.

优点

Strong empirical performance
The proposed method is theoretically principled with a convergence guarantee
The unbiased risk estimator is simple and intuitive

缺点

Clarity of writing. I found the problem setting to be unclear until I finished section 4. One suggestion would be to add more details about the setup in the preliminary setting e.g. the absolute loss function, before diving into deriving the estimators. Also, changing the order by deriving the risk estimator before the class-priors, may provide a better motivation on why we need to estimate the class-priors.

问题

The algorithm relies on iteratively estimating the class-prior from f and then using it to update f. Is it possible if there is a failure mode ?
Would it be possible to extend this type of estimator to a different loss than the absolute loss?

2023-11-15

For Weaknesses:

Clarity of writing. I found the problem setting to be unclear until I finished section 4. One suggestion would be to add more details about the setup in the preliminary setting e.g. the absolute loss function, before diving into deriving the estimators. Also, changing the order by deriving the risk estimator before the class-priors, may provide a better motivation on why we need to estimate the class-priors.

We appreciate your feedback on the clarity of our paper's writing.

In response to your insightful comments, we will incorporate a more detailed exposition of the setup in the preliminary section, including an in-depth discussion of the absolute loss function. This will be done to lay a clearer foundation before we proceed to the derivation of the estimators.

Furthermore, we will adjust the order of the sections as you recommended. By deriving the risk estimator prior to introducing the class-priors, we aim to provide stronger motivation and a more logical progression for the need to estimate class-priors. This restructuring is intended to not only clarify the necessity of class-priors estimation within our proposed method but also to facilitate a more intuitive understanding of the method's overall framework.

We believe that these changes will significantly improve the clarity of the paper and will ensure that the problem setting is comprehensible early in the reading. We are committed to making the necessary revisions to ensure that the final manuscript meets the high standards of clear and logical academic writing.

For Questions:

The algorithm relies on iteratively estimating the class-prior from f and then using it to update f. Is it possible if there is a failure mode ?

Our approach initially treats all unknown labels as negative to warm up the model, which is a common practice in existing SPMLL methods. This warming-up step provides a stable starting point, yielding a reasonably effective model before the application of our proposed Crisp method. Upon employing the CRISP framework, we iteratively refine the model through the class-prior estimation technique. In our extensive experimental evaluations, we did not encounter instances of failure.
Would it be possible to extend this type of estimator to a different loss than the absolute loss?

In our proposed framework, the risk estimator's extension to loss functions beyond absolute loss is indeed feasible, specifically for symmetric loss functions. A symmetric loss function $l$ has the property that for any prediction $f^j(x)$ , the sum of the loss for a positive label and the loss for a negative label is constant, i.e., $l(f^j(x),+1)+l(f^j(x),0)=C$ , where $C$ is a constant.

AC 元评审

2023-12-09

The paper claims that the class-prior of each category are different, and they aim to align the estimated class-priors with the actual class-priors as training progresses. However, the paper is most based on the empirical study without theoretical guarantee. The key of the proposed methods is the threshold, the technical contribution of this paper is too limited. And reviewers find that Theorem 4.2 tries to present the convergence of the empirical risk minimizer, but it seems that the empirical risk minimizer does not converge to the true risk minimizer. I think the author did not address this key concerns.

为何不给更高分

N/A.

为何不给更低分

N/A.

最终决定Reject

2024-01-16

Reject