PaperHub
5.3
/10
Rejected3 位审稿人
最低5最高6标准差0.5
6
5
5
3.7
置信度
ICLR 2024

Partial Optimal Transport for Open-set Semi-supervised Learning

OpenReviewPDF
提交: 2023-09-19更新: 2024-02-11

摘要

关键词
Open-set problemOptimal transportSemi-supervised learningOut-of-distribution detection

评审与讨论

审稿意见
6

The paper considers an open-set semi-supervised learning where there potentially are “outliers” in the unlabeled data distribution. The paper provides a novel loss function inspired by Partial optimal transport to handle OOD detection and demonstrates the effectiveness and robustness of the proposed method on multiple datasets.

优点

  1. Strong empirical performance
  2. Various ablations suggest that the method is robust and has a lower computation time and other baselines.
  3. A connection to optimal transport is intuitive

缺点

  1. Lack of clarity in writing. I found it hard to understand what is the main idea of the paper up until page 6. The author mentions in the abstract/ introduction that a mass score function (MSF) to measure the likelihood of unlabeled samples being outliers, yet I did not mention how this is related to OT/POT and it’s not clear to me how OT is beneficial to the OSSL task. The following sentence is helpful for me to understand the idea, “we can utilize the transport mass as a reliable OOD score, where a sample with a smaller value of mass score function tends to be an OOD sample”. However, it is mentioned on page 6. It would be nice if one could provide something like this earlier in the paper and provide a clear problem setting early on.

  2. Many definitions and acronyms are used before being defined (see questions)

  3. The definition of distribution in equation 8) is not mathematically valid? By adding a factor of k, the sum of the probability mass is greater than 1 and therefore is not a valid probability distribution.

I am willing to increase the score if these issues are addressed.

问题

  1. OSR is not defined, MSR is mentioned before it is defined in section 2.2.
  2. Section 4.1, the distribution L and U are not defined.
  3. Section 4.1, “the features of these d-dimensional samples”, do you mean the features or samples that has d-dimensional ?
  4. Notation in equation 7) is not clear. Does this means T1_{\mathcal{L}} \leq \mathcal{L} point-wise less than or equal to ?
  5. In algoirthm3, L_x and L_u is not defined in the main text ?
  6. What is the number 50, 100, 500 for in Table 5 ?
  7. “magnituWde” -> “magnitude” ?
评论

Thanks for your valuable comments, which help us improve the manuscript.

W1:The clarity of writing.

We will reinforce the connection between MSF and OT in the introduction and provide an explanation on how the MSF is computed. Additionally, in the introduction, we will add following descriptions about MSF to improve the presentation.

"MSF is calculated through the optimal transport plan of partial optimal transport, where the MSF for an unlabeled sample is determined by the sum of its mass transport to the labeled distribution. Samples with higher scores are more likely to be identified as ID samples."

W2: Many definitions and acronyms are used before being defined.

Q1. The OSR means open-set recognition. We will revise the presentation in an updated manuscript.

Q2. The formal definitions of L\mathcal{L} and U\mathcal{U} are shown in Equation 8. We should have made it appear earlier in Sec 4.1.

Q3. Yes. We should say "The feature of each sample is of d-dimensions".

Q4. Yes. All inequalities in Eq.7 mean point-wise less than or equal to.

Q5. The definitions of L_x and L_u can be found in Section A.1. We will add corresponding descriptions in Algo3.

Q6. The number (50/100/400) denotes the number of labeled samples for each class. We will explain it in the head of Table 5.

Q7. Yes, it is a typo.

W3: The distribution in Eq.8 is not mathematically valid.

We acknowledge that defining U\mathcal{U} as a distribution is inappropriate; rather, it is the k-fold of a distribution. In our paper, focusing on modeling OOD detection in OSSL, this definition remains valid within our technical framework. In the revised version, the terminology will be modified to use "K-fold unlabeled distribution" instead of "unlabeled distribution" to describe U\mathcal{U}.

评论

The author has addressed my concern and I will be increasing my score.

审稿意见
5

The paper tackles the open-set semi-supervised learning (OSSL) challenge, specifically aiming to frame the treatment of out-of-distribution data (OOD) as a partial optimal transport (POT) problem. It introduces a mass score function (MSF) designed to evaluate the likelihood of a sample being an outlier during training. Additionally, the paper presents an OOD loss, allowing conventional semi-supervised learning methods to be adapted for OSSL scenarios via end-to-end training. The authors compare their proposed method against MTCF, T2T, and OpenMatch, on CIFAR10, CIFAR100, and Imagenet-30, showing superior performance.

优点

  • Semi-supervised learning is a significant area of research in machine learning, aiming to enhance performance by effectively utilizing both labeled and unlabeled data.

  • The OOD angle used in the paper makes it interesting to a broader audience.

  • Incorporating (partial) optimal transport as a framework is a novel and innovative aspect of this work.

缺点

  • Respectfully, the novelty of the method is limited and the paper overclaims novelty.

    • For instance, one main contribution of this paper is the introduction of the "novel" MSF score. The score function essentially corresponds to what is commonly referred to as "barycentric projection," a concept well-documented in both classical and contemporary optimal transport (OT) theory literature (for reference, please see sources such as Ambrosio et al.). In this context, it is more appropriate to state that the paper utilizes classical concepts from OT theory to address new application challenges. The sentence “we devise a new score function” is more or less misleading.
  • The parameter kk, which deals with the amount of redundancy, plays a crucial role in the methodology presented in the paper. Varying the value of k leads to significant variations in the outcomes of ODD detection. It would enhance the paper's quality if it delves into the process of determining this value. Specifically, the paper could explore methods for assessing the amount of data that should be classified as outliers before initiating the algorithms.

  • Some implementation details and important ablation studies are missing from the paper. For instance, the utilized batch size and the effect of having a small batch size (which presumably reduces the performance of the proposed method) are missing from the paper.

  • The rationale behind the decision to use (10) instead of the original constraint (7), i.e., enforcing all mass from L\mathcal{L} to be transported to a subset of U\mathcal{U}, is not well presented. Couldn't the unsupervised data be missing an entire class? In that case, the missing classes in L\mathcal{L} must be destroyed, i.e., not transported, and the constraints in (7) would allow that. I believe this can easily happen in minibatch training.

  • Some of the very relevant references are missing from the paper:

    • Rizve, M.N., Kardan, N. and Shah, M., 2022, October. Towards realistic semi-supervised learning. In European Conference on Computer Vision (pp. 437-455). Cham: Springer Nature Switzerland.
    • Xu, R., Liu, P., Zhang, Y., Cai, F., Wang, J., Liang, S., Ying, H. and Yin, J., 2020. Joint Partial Optimal Transport for Open Set Domain Adaptation. In IJCAI (pp. 2540-2546).
    • Yang, Yucheng, Xiang Gu, and Jian Sun. "Prototypical Partial Optimal Transport for Universal Domain Adaptation." (2023).

问题

  • For Algorithm 2, in the line of OOD score, shouldn't the formula be Score{U}=TT1nScore_\{\mathcal{U}\}=\mathbf{T}^T\mathbf{1}_n?

  • The transportation cost is set to "Cosine distance." The definition "d(x,y)=1-Cosine(x,y)" is only a true metric if x,ySd1x,y\in \mathbb{S}^{d-1}, i.e., xx and yy are unit vectors. Is your backbone returning unit vectors? Even if that is the case, and for the sake of mathematical rigor, I suggest adhering to the Euclidean distance, which is equivalent to the cosine distance when xx and yy are unit vectors and is still sensible when they are not!

评论

Thanks for your thoughtful and constructive feedback.

W1: The novelty of the method is limited and the paper overclaims novelty.

We acknowledge that barycentric projection is a concept in the OT literature, and we recognize that the functionality of the proposed mass score function could be mathematically interpretated through barycentric projection.

Having clarified this, we would like to emphasize that we did not claim the ownership of the concept. Our primary contribution lies in modeling the OOD detection problem under the OSSL setting as a POT problem, and in training it within the SSL setting in an end-to-end manner. The Mass Score Function (MSF) is just one facet of the contributions of this paper, aiding in assessing the likelihood of a test input being an OOD sample.

In summary, our claim is that the proposal of the problem modeling and its associated techniques are novel in OOD detection under OSSL settings. We value your suggestions, and we agree that utilizing the concept of barycentric projection for explaining the proposed score function would enhance mathematical rigor, clarify intuitions, and improve presentation.

W2: The issue regarding the parameter k.

In the ablation study, we have reported the result of the performance on different k. Our findings reveal the robustness of the hyperparameter k in our algorithm. Consequently, we maintain a consistent setting of k=2 across all experiments in Table 1 to Table 3, consistently achieving excellent results despite varying proportions of ID samples. To further assess the robustness of parameter k, additional experiments were conducted with k set to 5 and k set to 10.

k1.52.02.55.010.0
Acc92.392.390.192.192.0
AUROC98.899.699.097.897.7

The results demonstrate that the value of k has good robustness over a wider range.

W3: Implementation details Ablation study on batch size.

In the revised version, we will include a description of the experiment settings.

Our experimental setup is shown below:

The batch-size B = 64.

The relative size of batch-size for unlabeled data µ = 2 in all experiments for fair comparison.

Iterations per epoch are 1024 and the total number of epochs is 512.

Optimizer: SGD with nesterov momentum = 0.9.

Learning rate η = 0.03 and use cosine annealing learning rate schedules.

The batch size we use is 64 and the relative size of batch size for unlabeled data µ is 2, which is a relatively small batch size in semi-supervised learning.

We compare our method with other SSL and OSSL methods on smaller batch size on CIFAR100 dataset.

Batch size643216
AccAUROCAccAUROCACCAUROC
FixMatch78.257.375.758.271.864.0
OpenMatch72.387.069.885.066.284.7
POT77.687.577.086.573.478.0

For a detailed analysis of the experimental results, please refer to W4.

W4: Label missing problem in minibatch training.

The situation you've highlighted is indeed a reality. In small batch sizes, L\mathcal{L} and U\mathcal{U} may not sufficiently cover all classes, posing challenges for OOD detection. However, semi-supervised learning has undergone several iterations, and the occasional subpar detection performance in certain mini-batches has minimal impact on the overall performance.

Notably, with a batch size of 32, our method shows a slight decline in OOD detection performance but outperforms FixMatch in close-set accuracy. Conversely, with a batch size of 16, there is a noticeable drop in OOD detection, yet our method resiliently maintains commendable close-set accuracy.

W5:Relevant references are missing.

In Related Work section, we will incorporate a discussion on relevant papers.

References [2-3] address Open Set Domain Adaptation problems using POT. Reference [2] uses the mean cost of transport to control the proportion of POT. Reference [3] providing an approximate estimation of the transport ratio in POT. Compared to existing POT-based methods [2-3], our method stands out by not necessitating prior knowledge about the ratio of POT. Reference [1] proposed a novel pseudo-label based approach to tackle SSL in an open-world setting. TRSSL aims to assign labels for outliers at test time, which is a different semi-supervised setting.

Q1: The formula of score function.

Thanks for pointing it out. We agree it should be T1n**T**^\top \mathbf{1}_{n}. We will address them in the revised version.

Q2: The use of cosine distance.

In our experiment, we normalize the vector before calculating the cosine distance. However, our method is also applicable when the OT cost is represented by the Euclidean distance. We conducted experiments on CIFAR10 using Euclidean distances, with 50 labeled samples for each class.

EuclideanCosine
Acc92.592.3
AUROC99.699.6

The results indicate that our proposed method is a universal approach, effective across different metrics.

评论

We would like to express our sincere appreciation for your valuable comments on our manuscript. We have carefully considered each of your suggestions and provided detailed responses. As the deadline for the rebuttal is approaching, we want to ensure if our responses adequately address your concerns. We are open to any further discussion.

Thank you again for your time and consideration.

评论

I thank the authors for their rebuttal and for addressing my comments. Respectfully, I find the response to W4 to be insufficient, since it lacks a logical explanation for transitioning from a broader formulation Eq. (7) addressing the issue to a more restricted or narrower approach, Eq. (10), which suffers from the raised issue. Conducting an ablation study on this aspect would have been beneficial. Additionally, if the performance of the method is better using approach (10) compared to (7), then providing intuition on why this is the case could help the reader. Lastly, while the provided experiment on the parameter kk is absolutely needed, it still partially shows the effect of this parameter, as presumably for larger or smaller batch sizes different parameters should be used.

Overall, I think the paper has merit, but many design choices still seem ad-hoc without a strong rationale behind them. Hence, I adhere to my original evaluation of the paper.

评论

Our design primarily aims at enhancing the robustness of OOD detection in OSSL. We aim to treat the labeled distribution as a baseline distribution, ensuring that the weights of each sample in L\mathcal{L} are equal, thus preserving the information of the distribution, which enhancing the robustness of OOD detection. To put it more intuitively, the hyperparameter kk in (10) exhibits better robustness compared to the hyperparameter ss in (7).

We conducted the following experiment: in (10), we set kk to 10, which is equivalent to setting the transport ratio ss to 0.1 in (7). Due to time limitation, we compared the results of the two methods after 65536 iterations, and we will update the experimental results later.

审稿意见
5

This paper focuses on studying the problem of Open-Set Semi-Supervised Learning (OSSL). The authors present a novel framework that transforms the OSSL problem into the Partial Optimal Transport (POT) problem. The authors aim to leverage the benefits of POT to detect the OOD samples. Empirically, POT achieves competitive performance on various benchmarks.

优点

  • This paper is straightforward and well-written. It is quite easy to follow.
  • The paper solves Open-Set Semi-Supervised Learning (OSSL), an important ML problem in practice.
  • Empirical results demonstrate that POT can achieve SOTA results on several benchmarks.

缺点

  • Based on the description provided, it is possible that the article's approach could be categorized as an auxiliary OOD classifier approach, similar to methods such as MTCF, T2T, and OpenMatch. What is the difference between the proposed method and them? A more detailed discussion may be required.
  • The author's explanation for why POT is more effective at detecting OOD is not adequately provided.
  • In similar settings, Partial Optimal Transport (POT) has also found applications, such as in Open-set Domain Adaptation and Positive-Unlabeled Learning. The authors should consider discussing the connections and distinctions between their work [1-3] and the research presented in these articles. And what are the strengths of POT for Open-Set Semi-Supervised Learning? Are there some special designs for Open-Set Semi-Supervised Learning compared with other tasks, such as PU leanring, Open Set Domain Adaptation?
  • The authors should offer an explanation for why Fixmatch algorithm yields better results compared to certain Open-Set Semi-Supervised Learning (OSSL) methods.
  • There lack of many experiment details in the paper, such as the specific parameter settings for Fixmatch and the implementation specifics of the T2T algorithm.
  • Table 3 lacks some of the comparative algorithms present in Table 1.
  • There is an inconsistency in the notation of the k in Algorithm 3.
  • On page 8 in the experimental section, LoodL_{ood} --> λood\lambda_{ood}
  • What is "graph" in the last of Subsection 2.2?

[1] Partial Optimal Transport with Applications on Positive-Unlabeled Learning, NeurIPS 2020.

[2] Joint Partial Optimal Transport for Open Set Domain Adaptation, IJCAI 2020.

[3] Prototypical Partial Optimal Transport for Universal Domain Adaptation, AAAI 2023.

问题

Please see the weakness for details.

评论

Thanks for your insightful comments, which help us improve the manuscript.

W1:Difference with other OSSL methods.

Our approach focuses on directly utilizing the distribution information. Specifically, we use POT between labeled and unlabeled samples to train a binary OOD classifier, and we treat OSSL as a multi-task learning problem.

In contrast, T2T and OpenMatch build their OOD classifiers using b (where b is the number of classes) one-vs-all classifiers, relying on labeled data and incorporating unlabeled data to improve the classifier performance. However, these methods exhibit a noticeable performance decline, especially when dealing with a large number of classification categories.

Similar to our methodology, MTCF employs a binary OOD classifier but utilizes a curriculum learning framework for model training. MTCF applies Otsu thresholding to select unlabeled samples, potentially leading to error accumulation that may impact the overall model performance.

We will add corresponding discussions to the manuscript in a revised version.

W2: Why POT is more effective at detecting OOD is not adequately provided.

In the third paragraph of Sec 4.3, we explored the effectiveness of POT. To complement our methodology, we incorporate the following details into the Methods section.

In the OSSL context, we propose that ID samples in the unlabeled dataset and labeled samples are drawn from a latent ID distribution, while OOD samples in the unlabeled dataset are drawn from a latent OOD distribution. In this setting, the optimal transport cost tends to be low between ID samples in unlabeled dataset and labeled samples.

The cost of transporting unit mass between ID samples (including labeled samples and ID samples in unlabeled dataset) is small. In POT, only a fraction of the mass in the unlabeled distribution can be transported. To minimize the overall transport cost, ID samples in unlabeled dataset obtain larger transport mass. This characteristic aids in identifying OOD samples based on the transport mass of each individual sample.

W3: Discussion about relevant references.

Reference [1] proposes a partial optimal transport method for positive-unlabeled learning. The proportion of positive samples π\pi in the unlabeled set is necessary, which is often unavailable in real open-set scenario. References [2-3] address Open Set Domain Adaptation problems using partial optimal transport. Reference [2] uses mean cost of transport to control the proportion of partial optimal transport, which is not robust in real world. Reference [3] providing an approximate estimation of the transport ratio in partial optimal transport.

Strengths of POT for OSSL: As outlined in W1, our method presents a simpler and more effective approach to detect outliers in OSSL using POT.

Special designs for OSSL: Existing POT-based methods [1-3], rely on the estimation of the transport ratio. However, in the context of OSSL, it is impractical to estimate the transport ratio for each mini-batch. To address this challenge, we introduce redundant mass for partial optimal transport as a robust parameter. This approach ensures that there is no need to estimate the transport ratio in every iteration, enhancing the stability and efficiency of OOD detection.

W4: Why Fixmatch algorithm yields better results.

In the OSSL scenario, while FixMatch may not generate accurate pseudo-labels for OOD unlabeled samples, it nonetheless retains the capability to generate high-quality pseudo-labels for ID unlabeled samples. Consequently, the close-set accuracy remains minimally affected by the presence of OOD samples.

W5: Lack of many experiment details.

In the revised version, we will include a description of the experiment settings.

Following the same parameter settings of original paper, we reproduce our baselines using their official codes. For a fair comparison, we unified the parameters of semi-supervised learning as follows.

Each experiment is done with a single NVIDIA A100 GPU.

The batch-size B = 64.

The relative size of batch-size for unlabeled data µ = 2 in all experiments for fair comparison.

Iterations per epoch are 1024 and the total number of epochs is 512.

Optimizer: SGD with nesterov momentum = 0.9.

Learning rate η = 0.03 and use cosine annealing learning rate schedules.

W6: Baseline on Imagenet30.

We supplemented additional experiments and obtained the following results.

    \ \ \ \ SSLOSSL
FixMatchMTCFT2TOpenMatchPOT
Acc91.786.488.889.690.3
AUROC45.193.855.796.497.4

Compared to OSSL methods, POT achieves the SOTA performance on the Imagenet30 dataset in terms of inlier accuracy and AUROC.

W7-W9: Issues in the paper.

Thank you very much for pointing out the issues. We will address them in the revised version.

评论

The authors partly answer my questions. I'll keep my rating.

评论

Thank you for investing your time in reviewing our response. We are delighted to learn that we have addressed some of your questions. We eagerly anticipate further in-depth discussions with you and would be immensely grateful if you could share any remaining concerns you may have.

评论

Thank you, reviewers, for your work in evaluating this submission. The reviewer-author discussion phase takes place from Nov 10-22.

If you have any remaining questions or comments regarding the rebuttal or the responses, please express them now. At the very least, please acknowledge that you have read the authors' response to your review.

Thank you, everyone, for contributing to a fruitful, constructive, and respectful review process.

AC

评论

Based on the valuable feedback from the reviewers, we have revised and submitted our paper. If the issues raised by the reviewers have not been adequately addressed or if new concerns have emerged, we welcome further in-depth discussions. Your feedback is highly important to us.

AC 元评审

This paper considers open-set semi-supervised learning, where the unlabeled data may contain novel categories that were not present in the labeled data. The authors propose a new method by integrating the open-set recognition problem into a partial optimal transport formulation. The empirical results suggest that the proposal effectively filters out the out-of-distribution data and improves the performance of SSL. This work addresses an important problem, and the authors' introduction of partial optimal transport as a mathematical tool is reasonable and could have a positive impact on this type of research problem. Nevertheless, there are still some concerns regarding the novelty of the proposal and the overall presentation. Partial optimal transport has been applied to positive-unlabeled learning and open-set domain adaptation, which are highly related areas to semi-supervised learning. The emphasis in the current manuscript on POT's ability to work for OOD detection is well-validated in the relevant field. Considering the field of Open-Set Semi-Supervised Learning itself, what advantages does POT have over other OOD detection methods, and what is the reason behind its ability to serve in the SSL scenario with limited labeled data? The current manuscript provides limited insights into these important questions. Overall, the authors have provided a POT-based Open-Set SSL method in this work and demonstrated good empirical performance. We encourage the authors to further explore the underlying reasons or theoretical results to increase the impact of this work on the community. Therefore, I recommend rejecting this paper for now.

为何不给更高分

The novelty of the proposal is limited. The detection of out-of-distribution data in Open-Set Semi-Supervised Learning and the ability of POT to achieve OOD detection are already existing conclusions in the related fields. A more appropriate statement for this work would be that it extends the application of POT to the SSL field. In terms of this work itself, it does not provide much analysis or theoretical results on the success of POT in Open-Set SSL, which may limit its impact on the Semi-Supervised Learning area. There are also some concerns regarding the overall presentation and the discussion of relevant literature. I recommend that the author take the suggestions provided by the reviewers to further enhance this paper.

为何不给更低分

N/A

最终决定

Reject