PaperHub
6.0
/10
Poster4 位审稿人
最低6最高6标准差0.0
6
6
6
6
3.5
置信度
正确性2.5
贡献度2.5
表达2.8
ICLR 2025

Re-Evaluating the Impact of Unseen-Class Unlabeled Data on Semi-Supervised Learning Model

OpenReviewPDF
提交: 2024-09-20更新: 2025-02-11
TL;DR

Re-Evaluating the Impact of Unseen-Class Unlabeled Data on Semi-Supervised Learning Model

摘要

关键词
Safe Semi-Supervised LearningUnseen-Class Unlabeled Data

评审与讨论

审稿意见
6

This paper proposes an evaluation framework for safe semi-supervised learning (SSL) with unseen class data. The paper observes that previous works fix the total number of seen and unseen class data, which may not reflect the true influence of unseen class data. Therefore, they fix the number of seen class data and change a number of quantities of unseen class data to investigate their influence. Extensive experiments are performed with much analysis.

优点

  • The shortcomings of previous evaluation approaches are new to the SSL field.
  • Many experiments are performed, providing a solid and comprehensive evaluation for robust SSL.
  • The experimental analysis is also very comprehensive.
  • The paper is well written and easy to understand.

缺点

  • My main concern is the practicality of the framework. The main purpose of the paper is to test the influence of increasing numbers of unseen class data on the classification performance on seen class data. The proposed metrics are mainly one-order statistics of the classification performance, which show the change in performance with more unseen class data. However, the main interest of SSL with unseen class data in real-world applications may not be focused on one-order statistics. In real-world applications, we only care about the performance of a certain number of unseen class data instead of the performance change with more data, although many papers provide the figures with different numbers of unseen class data. Therefore, although the framework is valid for different sizes of unseen class data, the practicability for real applications may be limited.
  • Only CIFAR datasets were used for experiments. Since previous SSL benchmarks provide more datasets from different modalities, such as text or tabular data, it is beneficial to consider more text and tabular datasets.
  • The standard deviations of the experimental results should also be included.
  • Some typos should be checked. For example, in Eq. (1), it should be argmin. In Line 431, it should be Table 5.

问题

  • Are the experimental results in tables the average accuracies with multiple seeds?
  • Is CibC_{ib} the ratio between unseen and seen data?
  • Is each entry in Table 6 the mean of several CC, i.e. the mean of each row?
评论
  1. Tabular dataset:
    • Forest: A structured tabular dataset, used to investigate the impact of unseen class data in non-image modalities.
rr00.20.40.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}
MeanTeacher0.6190.6170.6180.6190.6200.6190.0010.0040.006-0.0090.600
FreeMatch0.5910.6030.6020.5930.5950.6070.0070.0330.062-0.0410.600
VAT0.6090.6130.6140.6110.6130.6110.0000.0080.019-0.0140.600
UASD0.6050.6040.6060.5980.5980.603-0.0040.0160.023-0.0380.600
MTCF0.5900.5920.5940.5960.5960.5980.0070.0130.0100.0011.000
OpenMatch0.6150.6170.6100.6120.6130.615-0.0010.0110.011-0.0310.800
Fix_A_Step0.6050.6040.6100.6070.6120.6160.0100.0220.032-0.0180.600
CAFA0.6040.6020.6040.6020.6110.6030.0030.0120.041-0.0380.400
PiModel0.5930.5950.5930.5990.5970.6010.0070.0150.029-0.0090.600
PseudoLabel0.6100.6030.6060.6070.6040.603-0.0040.0110.010-0.0320.400
ICT0.6100.6130.6070.6110.6110.6120.0010.0080.019-0.0290.600
MixMatch0.6080.6130.6120.6160.6040.610-0.0020.0190.033-0.0610.600
FixMatch0.5770.5800.5800.5790.5840.5860.0080.0160.027-0.0060.800
FlexMatch0.5840.5910.5910.5900.5860.5930.0040.0160.036-0.0180.600
SoftMatch0.5920.5870.5910.5920.5930.6000.0080.0150.035-0.0220.800
评论

Q4: Some typos should be checked. For example, in Eq. (1), it should be argmin. In Line 431, it should be Table 5.

A4: Thank you for pointing out the typos. We have carefully reviewed the manuscript to correct these errors. Specifically, we have updated Eq.(1) to use "argmin" as suggested, and we have corrected the reference to "Table 5" in Line 431. We appreciate your attention to these details and have ensured that the manuscript is free from such typographical errors in the revised version.

Q5: Are the experimental results in tables the average accuracies with multiple seeds?

A5: Thank you for your question. Yes, the experimental results presented in the tables are the average accuracies obtained from multiple runs with different random seeds. Specifically, we conducted the experiments using seeds 0, 1, and 2, and the reported results are the average of these runs. This approach helps ensure the robustness and reliability of the results.

Q6: Is CibC_{ib} the ratio between unseen and seen data?

A6: Thank you for your question. CibC_{ib} is not the ratio between unseen and seen data. CibC_{ib}, or the imbalance factor, controls the distribution of samples within the unseen classes themselves. Specifically, it determines how the sample size for each class varies, creating different degrees of class imbalance within the unseen classes.

As mentioned earlier, when CibC_{ib} is small, the earlier unseen classes will have fewer samples, and later classes will have more, leading to higher class imbalance. When CibC_{ib} is larger, the sample sizes across the unseen classes become more balanced.

It is important to note that CibC_{ib} does not directly refer to the ratio between unseen and seen data but instead to the imbalance within the unseen classes. We have enhanced the description of CibC_{ib} in the revised manuscript to make this clearer.

Q7: Is each entry in Table 6 the mean of several C, i.e. the mean of each row?

A7: Thank you for your question. FavgF_{avg} represents the mean of each row, while AavgA_{avg} represents the mean of each column. We have enhanced the description of each entry in Table 6 in the revised manuscript to make this clearer.

评论
CibC_{ib}0.010.020.050.100.20
PseudoLabel0.660±0.0140.660±0.0170.663±0.0140.658±0.0150.662±0.013
PiModel0.647±0.0120.649±0.0080.642±0.0100.646±0.0080.642±0.013
FlexMatch0.710±0.0190.703±0.0210.714±0.0100.706±0.0170.717±0.014
UDA0.620±0.0350.647±0.0190.634±0.0290.615±0.0430.619±0.032
SoftMatch0.730±0.0070.731±0.0140.725±0.0090.721±0.0110.698±0.033
VAT0.681±0.0160.682±0.0130.682±0.0120.678±0.0120.675±0.019
FreeMatch0.730±0.0130.723±0.0160.674±0.0670.670±0.0630.677±0.029
ICT0.622±0.0100.621±0.0110.621±0.0090.622±0.0090.622±0.008
UASD0.618±0.0080.618±0.0080.617±0.0080.618±0.0080.617±0.009
MTCF0.722±0.0140.720±0.0110.722±0.0160.721±0.0110.722±0.013
CAFA0.647±0.0080.650±0.0080.643±0.0110.648±0.0130.646±0.014
OpenMatch0.613±0.0680.613±0.0740.620±0.0770.650±0.0060.639±0.007
Fix_A_Step0.685±0.0380.638±0.0660.632±0.0600.633±0.0640.628±0.066

We have updated the experimental results mentioned above in the Appendix of the revised manuscript.

评论
MethodCiC_i=5CiC_i=6CiC_i=7CiC_i=8CiC_i=9CiC_i=5CiC_i=6CiC_i=7CiC_i=8CiC_i=9
nearnearnearnearnearfarfarfarfarfar
PseudoLabel0.648 ± 0.0140.649 ± 0.0150.650 ± 0.0150.650 ± 0.0160.648 ± 0.0080.614 ± 0.0150.618 ± 0.0190.626 ± 0.0120.621 ± 0.0120.623 ± 0.013
PiModel0.638 ± 0.0090.642 ± 0.0070.639 ± 0.0080.637 ± 0.0010.622 ± 0.0080.621 ± 0.0100.624 ± 0.0070.625 ± 0.0080.621 ± 0.0070.624 ± 0.005
FlexMatch0.656 ± 0.0270.640 ± 0.0150.682 ± 0.0130.662 ± 0.0100.658 ± 0.0340.610 ± 0.0590.586 ± 0.0550.619 ± 0.0420.606 ± 0.0490.616 ± 0.043
UDA0.627 ± 0.0410.646 ± 0.0190.627 ± 0.0370.608 ± 0.0170.620 ± 0.0170.644 ± 0.0170.640 ± 0.0210.640 ± 0.0180.643 ± 0.0160.633 ± 0.021
SoftMatch0.654 ± 0.0300.665 ± 0.0670.725 ± 0.0060.687 ± 0.0130.671 ± 0.0430.646 ± 0.0270.650 ± 0.0220.639 ± 0.0310.644 ± 0.0320.642 ± 0.033
VAT0.654 ± 0.0030.632 ± 0.0160.657 ± 0.0210.623 ± 0.0260.606 ± 0.0280.564 ± 0.0070.565 ± 0.0120.577 ± 0.0180.568 ± 0.0070.581 ± 0.008
FreeMatch0.655 ± 0.0680.610 ± 0.0210.709 ± 0.0240.628 ± 0.0750.668 ± 0.0380.642 ± 0.0680.633 ± 0.0610.641 ± 0.0500.639 ± 0.0650.637 ± 0.063
ICT0.623 ± 0.0090.619 ± 0.0110.622 ± 0.0090.608 ± 0.0050.611 ± 0.0150.597 ± 0.0040.596 ± 0.0070.598 ± 0.0070.597 ± 0.0060.597 ± 0.007
UASD0.617 ± 0.0100.616 ± 0.0880.618 ± 0.0070.619 ± 0.0070.618 ± 0.0090.617 ± 0.0060.618 ± 0.0090.618 ± 0.0090.618 ± 0.0090.617 ± 0.007
MTCF0.703 ± 0.0060.685 ± 0.0260.697 ± 0.0140.689 ± 0.0260.723 ± 0.0070.632 ± 0.0140.573 ± 0.0130.619 ± 0.0100.596 ± 0.0560.596 ± 0.025
CAFA0.639 ± 0.0030.639 ± 0.0050.638 ± 0.0130.634 ± 0.0160.630 ± 0.0190.606 ± 0.0210.597 ± 0.0380.592 ± 0.0250.612 ± 0.0170.613 ± 0.008
OpenMatch0.636 ± 0.0230.546 ± 0.0380.648 ± 0.0150.566 ± 0.0270.642 ± 0.0240.607 ± 0.0470.559 ± 0.0060.569 ± 0.0460.629 ± 0.0460.622 ± 0.050
Fix_A_Step0.630 ± 0.0530.630 ± 0.0530.638 ± 0.0350.634 ± 0.0380.630 ± 0.0260.693 ± 0.0220.684 ± 0.0190.693 ± 0.0250.691 ± 0.0160.693 ± 0.022
评论

Q3: The standard deviations of the experimental results should also be included.

A3: Thank you for your suggestion. We agree that including standard deviations is important to provide a more comprehensive evaluation of the experimental results. We have added the standard deviations and the updated results are shown in the table below:

rur_u00.20.40.50.60.81.0
PseudoLabel0.677±0.0100.668±0.0100.664±0.0110.660±0.0110.660±0.0100.660±0.0110.654±0.013
PiModel0.667±0.0110.651±0.0110.644±0.0090.637±0.0110.636±0.0160.636±0.0080.630±0.007
FlexMatch0.770±0.0160.717±0.0170.702±0.0170.704±0.0060.704±0.0090.686±0.0150.638±0.005
MixMatch0.731±0.0130.708±0.0190.708±0.0170.706±0.0150.701±0.0170.697±0.0160.676±0.019
UDA0.645±0.0250.621±0.0220.604±0.0300.601±0.0290.591±0.0230.590±0.0240.552±0.023
SoftMatch0.764±0.0160.710±0.0260.689±0.0120.697±0.0320.692±0.0180.682±0.0140.607±0.064
VAT0.710±0.0050.677±0.0150.663±0.0080.663±0.0120.657±0.0140.651±0.0170.639±0.007
FreeMatch0.760±0.0190.723±0.0120.665±0.0580.635±0.0150.608±0.0120.605±0.0600.440±0.041
ICT0.618±0.0100.619±0.0110.621±0.0120.620±0.0110.621±0.0120.619±0.0140.621±0.013
UASD0.618±0.0100.619±0.0090.617±0.0080.616±0.0070.617±0.0080.618±0.0080.618±0.008
MTCF0.772±0.0120.743±0.0090.731±0.0170.723±0.0100.725±0.0150.716±0.0150.692±0.015
CAFA0.652±0.0100.652±0.0140.640±0.0140.653±0.0100.641±0.0150.642±0.0120.640±0.016
OpenMatch0.713±0.0100.606±0.0750.586±0.0320.595±0.0330.579±0.0270.517±0.0300.473±0.044
Fix_A_Step0.662±0.0810.634±0.0610.615±0.0620.582±0.0950.555±0.0300.585±0.0130.509±0.013
CnC_n12345
PseudoLabel0.648±0.0080.652±0.0120.663±0.0090.664±0.0110.668±0.010
PiModel0.622±0.0080.628±0.0060.640±0.0140.642±0.0140.651±0.011
FlexMatch0.658±0.0340.644±0.0150.689±0.0080.699±0.0090.715±0.019
UDA0.620±0.0170.599±0.0100.594±0.0350.621±0.0140.621±0.022
SoftMatch0.671±0.0430.662±0.0470.724±0.0110.715±0.0210.710±0.026
VAT0.606±0.0280.627±0.0340.671±0.0210.676±0.0160.677±0.015
FreeMatch0.668±0.0380.639±0.0540.724±0.0010.688±0.0270.723±0.012
ICT0.611±0.0150.613±0.0080.615±0.0100.619±0.0110.619±0.011
UASD0.618±0.0090.620±0.0090.617±0.0090.619±0.0070.619±0.009
MTCF0.723±0.0070.704±0.0160.737±0.0090.736±0.0120.743±0.009
CAFA0.642±0.0130.645±0.0160.642±0.0100.643±0.0090.648±0.013
OpenMatch0.642±0.0240.634±0.0490.633±0.0310.561±0.0350.606±0.075
Fix_A_Step0.674±0.0260.647±0.0370.631±0.0300.625±0.0550.632±0.061
评论
  1. Text dataset:
    • AGNews: A widely-used text dataset in natural language processing, included to assess how unseen class data affects semi-supervised models in text-based tasks.
rr00.20.40.50.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}
PiModel0.9710.8810.8870.9330.8890.9390.9440.0050.2090.465-0.4490.666
UDA0.8580.7810.8450.8890.9270.9020.8900.0860.2530.442-0.3840.500
PseudoLabel0.9600.8650.9170.8500.8930.9160.868-0.0470.2120.430-0.6700.500
FixMatch0.8520.7650.8820.8040.8100.7970.9030.0400.2920.584-0.7790.500
FlexMatch0.7340.8230.9570.8030.8910.8740.8670.1080.3800.881-1.5430.500
SoftMatch0.7570.9190.9150.9250.8570.7980.8430.0000.3620.809-0.6780.500
MeanTeacher0.9640.9420.9670.9410.9550.9410.941-0.0180.0710.142-0.2600.500
UASD0.9600.9700.9190.9590.9580.9550.959-0.0020.0700.400-0.2550.500

Compared to tabular data, text modality data is relatively more affected by unseen classes. We attribute this to the greater difficulty in feature representation and the higher complexity of text-based tasks. Consistent with previous experimental findings, UASD is relatively less affected by unseen classes compared to other semi-supervised methods, demonstrating stronger robustness against unseen classes.

These additional datasets aim to validate the effectiveness of our experimental setup in various domains, showcasing its practicality. We have incorporated the above experimental results and analysis into the revised manuscript.

评论
  • Letter: Another tabular dataset included to further explore the influence of unseen class data on semi-supervised models.
rr00.20.40.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}
CAFA0.7320.7260.7240.7200.7200.726-0.0070.0200.029-0.0280.200
PiModel0.7260.7190.7160.7150.7170.710-0.0120.0210.008-0.0370.200
PseudoLabel0.7190.7170.7180.7180.7150.714-0.0040.0090.005-0.0150.400
ICT0.7280.7290.7290.7300.7270.7300.0000.0050.012-0.0150.800
MixMatch0.6800.6800.6850.6800.6770.678-0.0030.0110.026-0.0240.400
FixMatch0.7400.7180.7080.6890.6790.688-0.0560.1100.044-0.1100.200
FlexMatch0.7490.7350.7240.7230.7280.724-0.0200.0440.024-0.0700.200
SoftMatch0.7260.7120.7070.7110.7170.715-0.0050.0270.026-0.0720.400
MeanTeacher0.7390.7380.7380.7370.7380.7400.0000.0040.011-0.0020.400
FreeMatch0.7240.6140.6150.6200.6140.611-0.0790.1810.025-0.5490.400
VAT0.7350.7290.7260.7250.7210.721-0.0130.0230.000-0.0290.000
UASD0.7230.7260.7250.7230.7250.7310.0050.0120.029-0.0110.600
MTCF0.7060.6070.5440.5420.5290.512-0.1720.332-0.011-0.4950.000
OpenMatch0.7380.7230.7180.7190.7180.716-0.0170.0330.007-0.0750.200
Fix_A_Step0.7600.7470.7400.7380.7360.733-0.0240.045-0.006-0.0620.000

Based on the results from the two tables above, it can be observed that compared to image modality data, tabular data is less affected by unseen classes, and semi-supervised models demonstrate greater robustness on tabular data. We attribute this primarily to the simplicity of tabular data structure and the more effective feature representation. Moreover, similar to CIFAR, ImageNet, and CUB, ICT and UASD are relatively less affected by unseen classes compared to other semi-supervised methods, demonstrating stronger robustness against unseen classes.

评论

Q2: Only CIFAR datasets were used for experiments. Since previous SSL benchmarks provide more datasets from different modalities, such as text or tabular data, it is beneficial to consider more text and tabular datasets.

A2: In response to your concern, we would like to clarify that, in addition to CIFAR-10 and CIFAR-100, we have conducted experiments on a broader range of datasets to enhance the credibility and generalizability of our results. These include:

  1. Image dataset:
    • ImageNet-100: A widely-used benchmark dataset providing a large-scale and realistic evaluation, used to analyze the influence of unseen class data on semi-supervised models.
rr00.20.40.50.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}
Fix-A-Step0.2610.2610.2600.2640.2570.2640.255-0.0030.0170.040-0.0700.500
FixMatch0.2820.2780.2770.2790.2680.2680.254-0.0250.0530.020-0.1100.333
FlexMatch0.2910.2910.2910.2870.2840.2810.285-0.0090.0230.019-0.0400.500
FreeMatch0.2960.2880.2890.2840.2870.2920.291-0.0020.0200.030-0.0500.500
MTCF0.0940.0940.0960.0960.0990.1020.0980.0060.0270.136-0.1060.666
OpenMatch0.0520.0520.0560.0590.0540.0560.0570.0050.0140.029-0.0490.833
UASD0.2610.2610.2540.2610.2600.2550.258-0.0030.0170.070-0.0350.500
ICT0.2590.2610.2620.2620.2580.2600.258-0.0010.0100.010-0.0400.666
PI-Model0.2720.2650.2550.2500.2600.2600.249-0.0170.0440.100-0.0550.333
Pseudo-Label0.2580.2530.2620.2570.2610.2630.2580.0040.0180.045-0.0500.500
CAFA0.2570.2540.2610.2640.2580.2550.2630.0040.0220.040-0.0600.500
SoftMatch0.2950.2930.2860.2850.2840.2700.280-0.0200.0400.050-0.0690.166
UDA0.2400.2460.2480.2560.2550.2520.2560.0140.0340.080-0.0150.666
VAT0.2690.2650.2600.2690.2690.2620.263-0.0040.0220.090-0.0350.500

Based on the results above, we can see that under Imagenet-100, the influence of the addition of unseen classes on the SSL methods is smaller compared to CIFAR-10 and CIFAR-100. Most SSL methods exhibit similar properties across CIFAR-10, CIFAR-100, and Imagenet-100. For example, ICT and UASD demonstrate robustness to the addition of unseen classes across all three datasets. SoftMatch and FixMatch are relatively sensitive to the addition of unseen classes across all three datasets.

Additionally, similar to Tables 1 and 2 in the original submission, we also calculated five metrics to measure the degree to which semi-supervised learning models are affected by the addition of unseen classes on Imagenet-100, from both a global robustness and a local robustness perspective.

  • CUB: A fine-grained classification dataset that introduces additional diversity, enabling the study of the effects of unseen class data in specific scenarios.
rr00.51.0RslopeR_{slope}
PI-Model0.4870.4870.5070.001
UASD0.6680.6680.6680.000
FixMatch0.5000.5040.494-0.005
FlexMatch0.5290.4470.413-0.115
Fix-A-Step0.3940.3910.3930.000
SoftMatch0.5100.5170.477-0.033
VAT0.5100.4900.490-0.019
FreeMatch0.5090.4480.410-0.098
ICT0.4730.4730.4730.000

Due to time and resource constraints, we only conducted experiments for r=0, r=0.5, and r=1. The results are shown in the table above. Experimental observations are similar to CIFAR10 and CIFAR100, where the performance of most methods decreases as unseen classes are added (e.g., FreeMatch and FlexMatch), while some methods exhibit robustness to the addition of unseen classes (e.g., UASD and ICT).

评论

Q1: My main concern is the practicality of the framework. The main purpose of the paper is to test the influence of increasing numbers of unseen class data on the classification performance on seen class data. The proposed metrics are mainly one-order statistics of the classification performance, which show the change in performance with more unseen class data. However, the main interest of SSL with unseen class data in real-world applications may not be focused on one-order statistics. In real-world applications, we only care about the performance of a certain number of unseen class data instead of the performance change with more data, although many papers provide the figures with different numbers of unseen class data. Therefore, although the framework is valid for different sizes of unseen class data, the practicability for real applications may be limited.

A1: Thank you for your insightful comments, and I completely understand your concern regarding the practicality of the framework. I agree that in real-world applications, the primary focus is often not on the change in performance as more unseen class data is introduced, but rather on the performance with a specific number of unseen class data.

Indeed, fixing the size of the unlabeled set is a common approach in practical applications. In many real-world scenarios, we collect unlabeled data without knowing the exact proportion of unseen classes (i.e., the value of rr. In such cases, the previous framework is indeed valuable.

However, we also believe that evaluating semi-supervised models under a single rr value is not sufficient for understanding the robustness of the models. Our motivation for evaluating the impact of unseen classes in unlabeled data across different values of rr is to provide a more comprehensive understanding of how these factors influence model performance under varied conditions. We aim to offer a fair and standard evaluation across multiple scenarios, which is important for understanding the broader applicability of semi-supervised learning models.

In the revised version, we have clarified that while both frameworks are valuable, they cater to different use cases: one for real-world applications where the size of the unlabeled set is fixed and another for assessing the general robustness of models under varying proportions of unseen class data.

评论

Dear Reviewer wdK8,

Thank you for your time and effort in reviewing our work. We have carefully considered your detailed comments and questions, and we have tried to address all your concerns accordingly.

As the deadline for revising manuscript is approaching, could you please go over our responses? If you find our responses satisfactory, we hope you could consider adjusting your initial rating. Please feel free to share any additional comments you may have.

Thank you!

Authors

评论

I thank the authors for their efforts in providing the additional experiments. I will keep my score unchanged.

评论

Dear Reviewer wdK8,

Thank you for your recognition and recommendation for our work. We are pleased that our response has successfully addressed all your comments. If there are remaining issues or questions regarding our paper, we would be more than happy to address them to further clarify our contributions.

Best regards,

Authors

审稿意见
6

The paper proposes an evaluation framework and then evaluates whether unlabeled samples from unseen classes deteriorate the performance of semi-supervised learning methods on seen classes.

优点

S1. The motivation of the paper is valid and interesting.

S2. The proposed evaluation framework follows the principle of controlling variables by fixing the confounding factor, i.e. the proportion of unlabeled samples from visible classes. So, the proposed evaluation framework overcomes the shortcomings of current evaluation schemes.

S3. The influence of unlabeled samples from unseen classes is evaluated from five perspectives and five evaluation indices.

S4. Some representative methods are chosen for experiment and the experiment results can provide some guidance for understanding the working mechanism of existing methods and designing new ones.

缺点

In order to more clearly present the findings of the paper, the presentation needs to be further improved, specifically.

W1. Since rr (or rur_u) is recorded in the tables in the experiment section, there is a precise definition for rr. To reduce ambiguity, the formal definition of rr should be given based on the symbol system in Section 2.

W2. It is necessary to provide a table or brief description for each indicator, showing its range, interpretation of positive/negative values, and how it relates to model performance.

W3. To facilitate readers to intuitively assess the degree of impact and facilitate cross dataset comparisons, the evaluation indicators in formulas (1)-(5) should to be normalized if possible.

W4. More experiment details should be recorded, such as the names of seen and unseen classes, the number of labeled samples for each class, and the number of unlabeled samples for each class. If there is not enough space, put it in the appendix. Furthermore, the authors should provide a link to their code or dataset splits, if possible, to further enhance reproducibility.

W5. In addition, some clerical errors need to be corrected, specifically.

a. In line 072, ”unseen classes in ---> “unseen classes in.

b. In line 141, Fig.2(b) ---> FIg.2(a).

c. In line 480-481, the table’s title should be above the table.

d. In the experiment, how are the integrals in formulas (1) and (2) calculated? It seems that the integral operation should be the accumulation operation.

e. In line 190, how can we obtain the “the expected accuracy”? In my opinion, it should be “the mean of ....”.

问题

I hope the authors can clarify the following question.

Q1. As mentioned in line 269-299, there are three experiments (generated by three random seeds) for each kind of dataset construction. So, the experiment results given in the paper are still with high randomness. So, it is difficult to accept these conclusions derived from these experiment results. In order to dispel this doubt, the authors should strengthen the experiment from the following aspects.

a. The number of datasets should be increased, e.g., 10 or more.

b. The diversity of datasets should also be increased, rather than being limited to image datasets.

c. For each dataset construction (labeled and unlabeled samples setting),  more experiments should be carried out, e.g., 10 times or more.

d. Some statistical significance tests should be introduced.

NOTE: I am willing to increase my rating to a 6 or 8, if the authors can dispel my doubts through more adequate experiments.

评论

I have carefully read the author's response.

These responses have dispelled my doubts.

I will increase my score to 6.

评论

Thank you for taking the time to carefully review our responses and for your valuable feedback. We greatly appreciate your thoughtful consideration and are glad that our responses were able to address your concerns.

We are truly grateful for your support and for increasing the score. Your constructive comments have been invaluable in improving the quality of our work.

评论

Q10: As mentioned in line 269-299, there are three experiments (generated by three random seeds) for each kind of dataset construction. So, the experiment results given in the paper are still with high randomness. So, it is difficult to accept these conclusions derived from these experiment results. In order to dispel this doubt, the authors should strengthen the experiment from the following aspects(a. The number of datasets should be increased, e.g., 10 or more. b. The diversity of datasets should also be increased, rather than being limited to image datasets. c. For each dataset construction (labeled and unlabeled samples setting), more experiments should be carried out, e.g., 10 times or more.d. Some statistical significance tests should be introduced.)

A10: In the revised manuscript, we have addressed the reviewers' concerns as follows:

a. We expanded the number of datasets to include more datasets across various modalities, including image, text, and tabular data, to ensure a more comprehensive evaluation.

b. To increase diversity, we incorporated datasets from different domains rather than limiting the experiments to image datasets.

c. For each dataset construction, we conducted experiments over 10 repetitions to ensure reliability and robustness of the results.

d. We introduced statistical significance test (p-value) to rigorously validate the experimental findings, further enhancing the credibility of the analysis.

  1. Image dataset:
    • ImageNet-100: A widely-used benchmark dataset providing a large-scale and realistic evaluation, used to analyze the influence of unseen class data on semi-supervised models.
rr00.20.40.50.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}p-value
Fix-A-Step0.2610.2610.2600.2640.2570.2640.255-0.0030.0170.040-0.0700.5000.600
FixMatch0.2820.2780.2770.2790.2680.2680.254-0.0250.0530.020-0.1100.3330.033
FlexMatch0.2910.2910.2910.2870.2840.2810.285-0.0090.0230.019-0.0400.5000.039
FreeMatch0.2960.2880.2890.2840.2870.2920.291-0.0020.0200.030-0.0500.5000.001
MTCF0.0940.0940.0960.0960.0990.1020.0980.0060.0270.136-0.1060.6660.028
OpenMatch0.0520.0520.0560.0590.0540.0560.0570.0050.0140.029-0.0490.8330.013
UASD0.2610.2610.2540.2610.2600.2550.258-0.0030.0170.070-0.0350.5000.072
ICT0.2590.2610.2620.2620.2580.2600.258-0.0010.0100.010-0.0400.6660.180
PI-Model0.2720.2650.2550.2500.2600.2600.249-0.0170.0440.100-0.0550.3330.001
Pseudo-Label0.2580.2530.2620.2570.2610.2630.2580.0040.0180.045-0.0500.5000.541
CAFA0.2570.2540.2610.2640.2580.2550.2630.0040.0220.040-0.0600.5000.258
SoftMatch0.2950.2930.2860.2850.2840.2700.280-0.0200.0400.050-0.0690.1660.012
UDA0.2400.2460.2480.2560.2550.2520.2560.0140.0340.080-0.0150.6660.001
VAT0.2690.2650.2600.2690.2690.2620.263-0.0040.0220.090-0.0350.5000.035

Based on the results above, we can see that under Imagenet-100, the influence of the addition of unseen classes on the SSL methods is smaller compared to CIFAR-10 and CIFAR-100. Most SSL methods exhibit similar properties across CIFAR-10, CIFAR-100, and Imagenet-100. For example, ICT and UASD demonstrate robustness to the addition of unseen classes across all three datasets. SoftMatch and FixMatch are relatively sensitive to the addition of unseen classes across all three datasets.

评论

Q5: In line 072, ”unseen classes in ---> “unseen classes in.

A5: Thank you for pointing out the typographical issue. We have corrected the sentence in line 072 from "unseen classes in ---> unseen classes in." to ensure clarity and consistency in the revised manuscript. We appreciate your attention to detail.

Q6: In line 141, Fig.2(b) ---> FIg.2(a).

A6: Thank you for identifying the error. We have revised the reference in line 141 from "Fig.2(b)" to "Fig.2(a)" in the updated manuscript. We appreciate your careful review and feedback.

Q7: In line 480-481, the table’s title should be above the table.

A7: Thank you for pointing this out. We have adjusted the placement of the table's title in lines 480–481 to ensure it appears above the table, as per standard formatting conventions. We appreciate your thorough review.

Q8: In the experiment, how are the integrals in formulas (1) and (2) calculated? It seems that the integral operation should be the accumulation operation.

A8: Thank you for your question. Yes, the integrals in formulas (1) and (2) are computed as accumulation operations based on empirical values. We have provided further explanation of it in the revised manuscript.

Q9: In line 190, how can we obtain the “the expected accuracy”? In my opinion, it should be “the mean of ....”.

A9: Thank you for pointing this out. You are correct; the term "expected accuracy" in line 190 refers to "the mean of accuracy values across different rr values." We have revised the text accordingly to ensure clarity.

评论

Q3: To facilitate readers to intuitively assess the degree of impact and facilitate cross dataset comparisons, the evaluation indicators in formulas (1)-(5) should to be normalized if possible.

A3: Thank you very much for your valuable suggestion. In our experiments across different datasets and factors, we have ensured consistent comparisons based on the unified definitions of formulas (1)-(5). This ensures that the evaluation metrics remain comparable across all experiments. As an exception, in Table 7 in the appendix, due to the small magnitude of the data, we scaled the values by a factor of 10 to better present the experimental results. This adjustment only affects the display of the values and does not alter the relative relationships between the metrics or the comparisons of model performance.

Q4: More experiment details should be recorded, such as the names of seen and unseen classes, the number of labeled samples for each class, and the number of unlabeled samples for each class. If there is not enough space, put it in the appendix. Furthermore, the authors should provide a link to their code or dataset splits, if possible, to further enhance reproducibility.

A4: Thank you for your suggestion. To enhance reproducibility, we have provided an anonymous link to our code at [https://anonymous.4open.science/r/RE-SSL-F034/README.md], which is based on [1] and [2].
We also include the following experimental details to address your concerns:

  • Regarding the number of unseen-class examples:
    For CIFAR-10, we select the first 5 classes as seen classes and the last 5 classes as unseen classes. From the training set of known classes (which contains 25,000 samples), we randomly select 100 samples to form the labeled set DLD_L. We then randomly select rsr_s ×\times 25,000 samples from the remaining samples of the training set of known classes and put them into DUSD_U^S. From the training set of unknown classes (which contains 25,000 samples), we randomly select rur_u ×\times 25,000 samples and put them into DUUD_U^U. DUSD_U^S and DUUD_U^U together form the unlabeled set DUD_U.

    For CIFAR-100, we select the first 50 classes as seen classes and the last 50 classes as unseen classes. Other settings are the same as CIFAR-10.

  • Regarding the number of unseen-class categories:
    We fix the number of unseen-class examples by setting r=0.2r = 0.2. Unlike the first setup, instead of randomly selecting 25,000 samples from the entire training set of unknown classes, we randomly select 25,000 samples from the last CnC_n classes of the unknown classes and put them into DUUD_U^U.

  • Regarding the indices of unseen classes:
    We fix the number of unseen-class examples by setting r=0.2r = 0.2. Unlike the first and second setups, we randomly select rr ×\times 25,000 samples from the CiC_i-th class of the unknown classes and put them into DUUD_U^U.

  • Regarding the degrees of nearness of unseen classes:
    Similar to the third setup, instead of randomly selecting rr ×\times 25,000 samples from the CiC_i-th class of the unknown classes, we randomly select rr ×\times 25,000 samples from the CiC_i-th class of the external MNIST dataset and put them into DUUD_U^U.

  • Regarding the label distribution in unseen classes:
    We determine the number of samples to be adopted from each unknown class to obtain DUUD_U^U based on CibC_{ib}.

We have incorporated these details into the appendix of the revised manuscript to improve clarity and reproducibility.
References:
[1] Jia, Lin-Han, et al. "LAMDA-SSL: Semi-supervised learning in python." arXiv preprint arXiv:2208.04610 (2022).
[2] Jia, Lin-Han, et al. "A Benchmark on Robust Semi-Supervised Learning in Open Environments." The Twelfth International Conference on Learning Representations. 2023.

评论

Q1: Since rr (or rur_u) is recorded in the tables in the experiment section, there is a precise definition for rr. To reduce ambiguity, the formal definition of rr should be given based on the symbol system in Section 2.

A1: Thank you for your feedback and for pointing out the need for a precise definition of rur_u to reduce ambiguity. Here is the formal definition based on the symbol system in Section 2: In our setup:

  • DBUD_B^U represents the total set of unseen-class data. DUUD_U^U is sampled from DBUD_B^U.
  • rur_u is the ratios used to control the sampling from DBUD_B^U.

Formal Definition of rur_u:
rur_u denotes the ratio of selected unseen-class data (DUU|D_U^U|) to the total unseen-class data (DBU|D_B^U|): ru=DUUDBU,r_u = \frac{|D_U^U|}{|D_B^U|}, where DUU|D_U^U| is the size of the sampled unseen-class data, and DBU|D_B^U| is the size of the total set of unseen-class data.

rur_u is used to control the degree to which unseen-class data is introduced into the unlabeled set DUD_U, thereby allowing us to evaluate the influence of unseen classes on the model's performance. A higher rur_u indicates a larger proportion of unseen-class data, while a lower rur_u reflects a scenario with fewer unseen-class samples.

We have incorporated the above in the revised manuscript.

Q2: It is necessary to provide a table or brief description for each indicator, showing its range, interpretation of positive/negative values, and how it relates to model performance.

A2: Thank you for your constructive suggestion. We agree that providing a table or a concise description for each indicator would help clarify their interpretation and relation to model performance. Below is a detailed description for each indicator:

IndicatorRangeInterpretationRelation to Model Performance
RslopeR_{slope}(,)(-\infty, \infty)Indicates the overall trend of accuracy (Acc(r)Acc(r)) as unseen classes are added. Positive values imply increasing accuracy, while negative values suggest a decline.A larger RslopeR_{slope} (closer to 0 or positive) indicates better global robustness to unseen classes.
GM[0,)[0, \infty)Measures the cumulative magnitude of accuracy deviations from the average accuracy (ACCˉ\bar{ACC}). Higher values imply larger fluctuations.A smaller GM indicates higher stability and robustness to the addition of unseen classes.
WAD(,)(-\infty, \infty)Captures the largest negative accuracy drop between two adjacent rr values.A larger WAD indicates better local robustness to unseen-class variations.
BAD(,)(-\infty, \infty)Reflects the maximum positive or minimal negative accuracy change between adjacent rr values. Positive values indicate improvements.A larger BAD (positive or close to 0) indicates better performance in the best-case scenario.
PAD0P_{AD \geq 0}[0,1][0, 1]Represents the probability that adding unseen classes does not degrade performance (AD0AD \geq 0).A higher PAD0P_{AD \geq 0} (close to 1) indicates better global robustness to unseen-class additions.

We have incorporated the above table into Section C.1 of the revised manuscript to provide a concise summary of each indicator, as it effectively clarifies their range, interpretation, and significance for evaluating model performance.

评论
  1. Text dataset:
    • AGNews: A widely-used text dataset in natural language processing, included to assess how unseen class data affects semi-supervised models in text-based tasks.
rr00.20.40.50.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}p-value
PiModel0.9710.8810.8870.9330.8890.9390.9440.0050.2090.465-0.4490.6660.004
UDA0.8580.7810.8450.8890.9270.9020.8900.0860.2530.442-0.3840.5000.529
PseudoLabel0.9600.8650.9170.8500.8930.9160.868-0.0470.2120.430-0.6700.5000.001
FixMatch0.8520.7650.8820.8040.8100.7970.9030.0400.2920.584-0.7790.5000.302
FlexMatch0.7340.8230.9570.8030.8910.8740.8670.1080.3800.881-1.5430.5000.001
SoftMatch0.7570.9190.9150.9250.8570.7980.8430.0000.3620.809-0.6780.5000.002
MeanTeacher0.9640.9420.9670.9410.9550.9410.941-0.0180.0710.142-0.2600.5000.015
UASD0.9600.9700.9190.9590.9580.9550.959-0.0020.0700.400-0.2550.5000.395

Compared to tabular data, text modality data is relatively more affected by unseen classes. We attribute this to the greater difficulty in feature representation and the higher complexity of text-based tasks. Consistent with previous experimental findings, UASD is relatively less affected by unseen classes compared to other semi-supervised methods, demonstrating stronger robustness against unseen classes.

These additional datasets aim to validate the effectiveness of our experimental setup in various domains, showcasing its practicality. We have incorporated the above experimental results and analysis into the revised manuscript.

评论
  • Letter: Another tabular dataset included to further explore the influence of unseen class data on semi-supervised models.
rr00.20.40.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}p-value
CAFA0.7320.7260.7240.7200.7200.726-0.0070.0200.029-0.0280.2000.002
PiModel0.7260.7190.7160.7150.7170.710-0.0120.0210.008-0.0370.2000.002
TemporalEnsembling0.2420.4460.6280.6930.6910.6940.4370.8861.017-0.0060.8000.001
UDA0.7190.7180.7140.7150.7190.7180.0000.0100.020-0.0160.4000.085
PseudoLabel0.7190.7170.7180.7180.7150.714-0.0040.0090.005-0.0150.4000.032
ICT0.7280.7290.7290.7300.7270.7300.0000.0050.012-0.0150.8000.141
MixMatch0.6800.6800.6850.6800.6770.678-0.0030.0110.026-0.0240.4001.000
FixMatch0.7400.7180.7080.6890.6790.688-0.0560.1100.044-0.1100.2000.003
FlexMatch0.7490.7350.7240.7230.7280.724-0.0200.0440.024-0.0700.2000.001
SoftMatch0.7260.7120.7070.7110.7170.715-0.0050.0270.026-0.0720.4000.001
MeanTeacher0.7390.7380.7380.7370.7380.7400.0000.0040.011-0.0020.4000.177
FreeMatch0.7240.6140.6150.6200.6140.611-0.0790.1810.025-0.5490.4000.000
VAT0.7350.7290.7260.7250.7210.721-0.0130.0230.000-0.0290.0000.002
UASD0.7230.7260.7250.7230.7250.7310.0050.0120.029-0.0110.6000.089
MTCF0.7060.6070.5440.5420.5290.512-0.1720.332-0.011-0.4950.0000.001
OpenMatch0.7380.7230.7180.7190.7180.716-0.0170.0330.007-0.0750.2000.000
Fix_A_Step0.7600.7470.7400.7380.7360.733-0.0240.045-0.006-0.0620.0000.001

Based on the results from the two tables above, it can be observed that compared to image modality data, tabular data is less affected by unseen classes, and semi-supervised models demonstrate greater robustness on tabular data. We attribute this primarily to the simplicity of tabular data structure and the more effective feature representation. Moreover, similar to CIFAR, ImageNet, and CUB, ICT and UASD are relatively less affected by unseen classes compared to other semi-supervised methods, demonstrating stronger robustness against unseen classes.

评论
  • CUB: A fine-grained classification dataset that introduces additional diversity, enabling the study of the effects of unseen class data in specific scenarios.
rr00.51.0RslopeR_{slope}p-value
PI-Model0.4870.4870.5070.0010.422
UASD0.6680.6680.6680.000-
FixMatch0.5000.5040.494-0.0050.859
FlexMatch0.5290.4470.413-0.1150.028
Fix-A-Step0.3940.3910.3930.0000.183
SoftMatch0.5100.5170.477-0.0330.582
VAT0.5100.4900.490-0.0190.000
FreeMatch0.5090.4480.410-0.0980.052
ICT0.4730.4730.4730.000-

Due to time and resource constraints, we only conducted experiments for r=0, r=0.5, and r=1. The results are shown in the table above. Experimental observations are similar to CIFAR10 and CIFAR100, where the performance of most methods decreases as unseen classes are added (e.g., FreeMatch and FlexMatch), while some methods exhibit robustness to the addition of unseen classes (e.g., UASD and ICT).

  1. Tabular dataset:
    • Forest: A structured tabular dataset, used to investigate the impact of unseen class data in non-image modalities.
rr00.20.40.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}p-value
MeanTeacher0.6190.6170.6180.6190.6200.6190.0010.0040.006-0.0090.6000.455
FreeMatch0.5910.6030.6020.5930.5950.6070.0070.0330.062-0.0410.6000.008
VAT0.6090.6130.6140.6110.6130.6110.0000.0080.019-0.0140.6000.000
UASD0.6050.6040.6060.5980.5980.603-0.0040.0160.023-0.0380.6000.084
MTCF0.5900.5920.5940.5960.5960.5980.0070.0130.0100.0011.0000.001
OpenMatch0.6150.6170.6100.6120.6130.615-0.0010.0110.011-0.0310.8000.222
Fix_A_Step0.6050.6040.6100.6070.6120.6160.0100.0220.032-0.0180.6000.048
CAFA0.6040.6020.6040.6020.6110.6030.0030.0120.041-0.0380.4000.818
PiModel0.5930.5950.5930.5990.5970.6010.0070.0150.029-0.0090.6000.022
TemporalEnsembling0.5940.5980.6040.5970.6000.5980.0030.0140.031-0.0340.6000.002
UDA0.6090.6080.6060.6060.6090.604-0.0030.0100.015-0.0230.2000.024
PseudoLabel0.6100.6030.6060.6070.6040.603-0.0040.0110.010-0.0320.4000.000
ICT0.6100.6130.6070.6110.6110.6120.0010.0080.019-0.0290.6000.455
MixMatch0.6080.6130.6120.6160.6040.610-0.0020.0190.033-0.0610.6000.172
FixMatch0.5770.5800.5800.5790.5840.5860.0080.0160.027-0.0060.8000.007
FlexMatch0.5840.5910.5910.5900.5860.5930.0040.0160.036-0.0180.6000.001
SoftMatch0.5920.5870.5910.5920.5930.6000.0080.0150.035-0.0220.8000.783
审稿意见
6

This paper studies the problem of safe semi-supervised learning. It found that the previous works are based on flawed evaluations, which did not adhere to the principle of controlling variables. They fixed the size of unlabeled data but change the proportion of unseen classes, which would also affect the proportion of seen classes. To deal with this problem, the paper proposes a re-evaluation framework to keep the proportion of seen classes unchanged and adjust the proportion of unseen classes. Furthermore, the paper also propose five metrics to comprehensively evaluate the impact of unseen classes. Based on the proposed framework, extensive experiments are performed to evaluate the impact of unseen classes on various SSL models.

优点

  1. The paper does not provide any new method. It gives an insightful perspective on the previous studies and finds that these methods have issues accurately assessing the impact of unseen classes. The motivation is very clear and seems to be reasonable.

  2. Based on the above motivation, the paper proposes to re-evaluate previous safe SSL models by fixing the proportion of seen classes. Furthermore, the paper proposes five evaluation metrics to access the impact of unseen classes on SSL models.

  3. Extensive experiments based on the proposed framework are performed with various safe SSL methods. These results provide a comprehensive and fair comparison among existing methods.

缺点

The main contribution of this paper lies in proposing a more reasonable experimental setup; however, its major issue is the credibility of the experiments. The experiments are conducted solely on CIFAR-10 and CIFAR-100, which makes the results unconvincing. While I understand that past studies has primarily been conducted on these two datasets, given that this paper aims to make innovations from an experimental perspective, it should propose more practical experimental setups and conduct experiments on more realistic datasets.

问题

There are five evaluation metrics. Which one is the most important?

评论

Q2: There are five evaluation metrics. Which one is the most important?

A2: Thank you for your insightful question. All five proposed metrics are meaningful and contribute to a comprehensive evaluation of robustness, as they assess different aspects of model performance in the presence of unseen classes.

  • Global Robustness Metrics:

    • RslopeR_{slope} (Eq. 1): Measures the overall trend of accuracy changes across rr, capturing the global robustness of the model as unseen classes are added.
    • GM (Global Magnitude) (Eq. 2): Reflects the overall influence of unseen classes by integrating the magnitude of accuracy deviations from the average performance (ACCˉ\bar{ACC}).
    • PAD0P_{AD \geq 0} (Eq. 5): Represents the probability that adding unseen classes does not degrade performance, offering a probabilistic global robustness perspective.
  • Local Robustness Metrics:

    • WAD (Worst-case Adjacent Discrepancy) (Eq. 3): Captures the maximum accuracy drop between two adjacent rr values, quantifying the local worst-case impact of unseen classes.
    • BAD (Best-case Adjacent Discrepancy) (Eq. 4): Highlights the best-case accuracy improvement or minimal drop between adjacent rr values, reflecting the local best-case performance changes.

Since these metrics evaluate robustness from complementary perspectives—global trends, local variations, and probabilistic considerations—they collectively provide a holistic view of model performance under varying conditions. Therefore, rather than prioritizing one metric, we recommend considering all five in combination to fully understand the robustness of semi-supervised models to unseen classes.

We hope this explanation clarifies the importance of each metric and their role in the evaluation framework. Thank you for your question!

评论
  1. Text dataset:
    • AGNews: A widely-used text dataset in natural language processing, included to assess how unseen class data affects semi-supervised models in text-based tasks.
rr00.20.40.50.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}
PiModel0.9710.8810.8870.9330.8890.9390.9440.0050.2090.465-0.4490.666
UDA0.8580.7810.8450.8890.9270.9020.8900.0860.2530.442-0.3840.500
PseudoLabel0.9600.8650.9170.8500.8930.9160.868-0.0470.2120.430-0.6700.500
FixMatch0.8520.7650.8820.8040.8100.7970.9030.0400.2920.584-0.7790.500
FlexMatch0.7340.8230.9570.8030.8910.8740.8670.1080.3800.881-1.5430.500
SoftMatch0.7570.9190.9150.9250.8570.7980.8430.0000.3620.809-0.6780.500
MeanTeacher0.9640.9420.9670.9410.9550.9410.941-0.0180.0710.142-0.2600.500
UASD0.9600.9700.9190.9590.9580.9550.959-0.0020.0700.400-0.2550.500

Compared to tabular data, text modality data is relatively more affected by unseen classes. We attribute this to the greater difficulty in feature representation and the higher complexity of text-based tasks. Consistent with previous experimental findings, UASD is relatively less affected by unseen classes compared to other semi-supervised methods, demonstrating stronger robustness against unseen classes.

These additional datasets aim to validate the effectiveness of our experimental setup in various domains, showcasing its practicality. We have incorporated the above experimental results and analysis into the Appendix of revised manuscript.

评论
  • Letter: Another tabular dataset included to further explore the influence of unseen class data on semi-supervised models.
rr00.20.40.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}
CAFA0.7320.7260.7240.7200.7200.726-0.0070.0200.029-0.0280.200
PiModel0.7260.7190.7160.7150.7170.710-0.0120.0210.008-0.0370.200
PseudoLabel0.7190.7170.7180.7180.7150.714-0.0040.0090.005-0.0150.400
ICT0.7280.7290.7290.7300.7270.7300.0000.0050.012-0.0150.800
MixMatch0.6800.6800.6850.6800.6770.678-0.0030.0110.026-0.0240.400
FixMatch0.7400.7180.7080.6890.6790.688-0.0560.1100.044-0.1100.200
FlexMatch0.7490.7350.7240.7230.7280.724-0.0200.0440.024-0.0700.200
SoftMatch0.7260.7120.7070.7110.7170.715-0.0050.0270.026-0.0720.400
MeanTeacher0.7390.7380.7380.7370.7380.7400.0000.0040.011-0.0020.400
FreeMatch0.7240.6140.6150.6200.6140.611-0.0790.1810.025-0.5490.400
VAT0.7350.7290.7260.7250.7210.721-0.0130.0230.000-0.0290.000
UASD0.7230.7260.7250.7230.7250.7310.0050.0120.029-0.0110.600
MTCF0.7060.6070.5440.5420.5290.512-0.1720.332-0.011-0.4950.000
OpenMatch0.7380.7230.7180.7190.7180.716-0.0170.0330.007-0.0750.200
Fix_A_Step0.7600.7470.7400.7380.7360.733-0.0240.045-0.006-0.0620.000

Based on the results from the two tables above, it can be observed that compared to image modality data, tabular data is less affected by unseen classes, and semi-supervised models demonstrate greater robustness on tabular data. We attribute this primarily to the simplicity of tabular data structure and the more effective feature representation. Moreover, similar to CIFAR, ImageNet, and CUB, ICT and UASD are relatively less affected by unseen classes compared to other semi-supervised methods, demonstrating stronger robustness against unseen classes.

评论
  1. Tabular dataset:
    • Forest: A structured tabular dataset, used to investigate the impact of unseen class data in non-image modalities.
rr00.20.40.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}
MeanTeacher0.6190.6170.6180.6190.6200.6190.0010.0040.006-0.0090.600
FreeMatch0.5910.6030.6020.5930.5950.6070.0070.0330.062-0.0410.600
VAT0.6090.6130.6140.6110.6130.6110.0000.0080.019-0.0140.600
UASD0.6050.6040.6060.5980.5980.603-0.0040.0160.023-0.0380.600
MTCF0.5900.5920.5940.5960.5960.5980.0070.0130.0100.0011.000
OpenMatch0.6150.6170.6100.6120.6130.615-0.0010.0110.011-0.0310.800
Fix_A_Step0.6050.6040.6100.6070.6120.6160.0100.0220.032-0.0180.600
CAFA0.6040.6020.6040.6020.6110.6030.0030.0120.041-0.0380.400
PiModel0.5930.5950.5930.5990.5970.6010.0070.0150.029-0.0090.600
PseudoLabel0.6100.6030.6060.6070.6040.603-0.0040.0110.010-0.0320.400
ICT0.6100.6130.6070.6110.6110.6120.0010.0080.019-0.0290.600
MixMatch0.6080.6130.6120.6160.6040.610-0.0020.0190.033-0.0610.600
FixMatch0.5770.5800.5800.5790.5840.5860.0080.0160.027-0.0060.800
FlexMatch0.5840.5910.5910.5900.5860.5930.0040.0160.036-0.0180.600
SoftMatch0.5920.5870.5910.5920.5930.6000.0080.0150.035-0.0220.800
评论

Q1: The main contribution of this paper lies in proposing a more reasonable experimental setup; however, its major issue is the credibility of the experiments. The experiments are conducted solely on CIFAR-10 and CIFAR-100, which makes the results unconvincing. While I understand that past studies has primarily been conducted on these two datasets, given that this paper aims to make innovations from an experimental perspective, it should propose more practical experimental setups and conduct experiments on more realistic datasets.

A1: Thank you for your valuable feedback and for emphasizing the importance of validating our approach on diverse and realistic datasets.

In response to your concern, we would like to clarify that, in addition to CIFAR-10 and CIFAR-100, we have conducted experiments on a broader range of datasets to enhance the credibility and generalizability of our results. These include:

  1. Image dataset:
    • ImageNet-100: A widely-used benchmark dataset providing a large-scale and realistic evaluation, used to analyze the influence of unseen class data on semi-supervised models.
rr00.20.40.50.60.81.0RslopeR_{slope}GMBADWADPAD0P_{AD\ge 0}
Fix-A-Step0.2610.2610.2600.2640.2570.2640.255-0.0030.0170.040-0.0700.500
FixMatch0.2820.2780.2770.2790.2680.2680.254-0.0250.0530.020-0.1100.333
FlexMatch0.2910.2910.2910.2870.2840.2810.285-0.0090.0230.019-0.0400.500
FreeMatch0.2960.2880.2890.2840.2870.2920.291-0.0020.0200.030-0.0500.500
MTCF0.0940.0940.0960.0960.0990.1020.0980.0060.0270.136-0.1060.666
OpenMatch0.0520.0520.0560.0590.0540.0560.0570.0050.0140.029-0.0490.833
UASD0.2610.2610.2540.2610.2600.2550.258-0.0030.0170.070-0.0350.500
ICT0.2590.2610.2620.2620.2580.2600.258-0.0010.0100.010-0.0400.666
PI-Model0.2720.2650.2550.2500.2600.2600.249-0.0170.0440.100-0.0550.333
Pseudo-Label0.2580.2530.2620.2570.2610.2630.2580.0040.0180.045-0.0500.500
CAFA0.2570.2540.2610.2640.2580.2550.2630.0040.0220.040-0.0600.500
SoftMatch0.2950.2930.2860.2850.2840.2700.280-0.0200.0400.050-0.0690.166
UDA0.2400.2460.2480.2560.2550.2520.2560.0140.0340.080-0.0150.666
VAT0.2690.2650.2600.2690.2690.2620.263-0.0040.0220.090-0.0350.500

Based on the results above, we can see that under Imagenet-100, the influence of the addition of unseen classes on the SSL methods is smaller compared to CIFAR-10 and CIFAR-100. Most SSL methods exhibit similar properties across CIFAR-10, CIFAR-100, and Imagenet-100. For example, ICT and UASD demonstrate robustness to the addition of unseen classes across all three datasets. SoftMatch and FixMatch are relatively sensitive to the addition of unseen classes across all three datasets.

Additionally, similar to Tables 1 and 2 in the original submission, we also calculated five metrics to measure the degree to which semi-supervised learning models are affected by the addition of unseen classes on Imagenet-100, from both a global robustness and a local robustness perspective.

  • CUB: A fine-grained classification dataset that introduces additional diversity, enabling the study of the effects of unseen class data in specific scenarios.
rr00.51.0RslopeR_{slope}
PI-Model0.4870.4870.5070.001
UASD0.6680.6680.6680.000
FixMatch0.5000.5040.494-0.005
FlexMatch0.5290.4470.413-0.115
Fix-A-Step0.3940.3910.3930.000
SoftMatch0.5100.5170.477-0.033
VAT0.5100.4900.490-0.019
FreeMatch0.5090.4480.410-0.098
ICT0.4730.4730.4730.000

Due to time and resource constraints, we only conducted experiments for r=0, r=0.5, and r=1. The results are shown in the table above. Experimental observations are similar to CIFAR10 and CIFAR100, where the performance of most methods decreases as unseen classes are added (e.g., FreeMatch and FlexMatch), while some methods exhibit robustness to the addition of unseen classes (e.g., UASD and ICT).

评论

Dear Reviewer 244R,

Thank you for your time and effort in reviewing our work. We have carefully considered your detailed comments and questions, and we have tried to address all your concerns accordingly.

As the deadline for revising manuscript is approaching, could you please go over our responses? If you find our responses satisfactory, we hope you could consider adjusting your initial rating. Please feel free to share any additional comments you may have.

Thank you!

Authors

评论

Dear Reviewer 244R,

Thank you for your recognition and recommendation for our work. We are pleased that our response has successfully addressed all your comments. If there are remaining issues or questions regarding our paper, we would be more than happy to address them to further clarify our contributions.

Moreover, we gently remind you that in ICLR's scoring system, a score of 6 indicates a "marginally above the acceptance threshold", while a score of 8 represents "accept". If you believe our work merits acceptance, we hope you might consider giving us an "accept", as it would greatly support our work.

Best regards,

Authors

审稿意见
6

The paper investigates the impact of unseen classes in unlabeled data on the performance of semi-supervised learning (SSL) models. It challenges the prevailing assumption that unseen classes are detrimental to SSL performance, highlighting flaws in previous assessment methods that altered the proportion of unseen and seen classes simultaneously. This paper adheres to the principle of controlling variables by keeping the proportion of seen classes constant while varying unseen classes across five dimensions. Through rigorous experimentation, the authors demonstrate that unseen classes do not inherently degrade SSL model performance; in some scenarios, they may even enhance it. The findings suggest a reevaluation of the role of unseen classes in SSL, emphasizing that their effects are more nuanced than previously thought.

优点

  1. This paper effectively identifies a flaw in previous methods for assessing the impact of unseen classes on SSL model performance. By fixing the size of the unlabeled dataset and adjusting the proportion of unseen classes, earlier approaches fail to control for variables appropriately. This adjustment alters the proportion of seen classes, meaning that any decrease in classification performance for seen classes may not be solely due to an increase in unseen class samples, but rather to a decrease in seen class samples. This clarification helps refine our understanding of how unseen classes affect SSL model performance.
  2. The experiments presented in this paper are thorough, as the authors control various variables to validate their claims from different perspectives. This comprehensive approach strengthens the credibility of their findings.
  3. The writing in this paper is clear and easy to understand, making it accessible to readers.

缺点

  1. In the proposed five metrics, some formulas are not clearly defined. For instance, the meaning of ACC^\hat{ACC} in Eq. 1 and ACCˉ\bar{ACC} in Eq. 2 is not explicitly explained. Providing clearer definitions for these terms would enhance the understanding of the metrics and their implications in the context of the research.
  2. In the tables, it would be helpful to use bold text or dashed lines to highlight specific data points, emphasizing the advantages of certain algorithms under particular settings.

问题

See Weaknesses

评论

Q1: In the proposed five metrics, some formulas are not clearly defined. For instance, the meaning of ACC^\hat{ACC} in Eq. 1 and ACCˉ\bar{ACC} in Eq. 2 is not explicitly explained. Providing clearer definitions for these terms would enhance the understanding of the metrics and their implications in the context of the research.

A1: Thank you for your insightful comment regarding the clarity of the definitions for ACC^\hat{ACC} in Eq. 1 and ACCˉ\bar{ACC} in Eq. 2.

To clarify:

  1. ACC^\hat{ACC} in Eq. 1 represents the empirical accuracy corresponding to rr, which is the observed performance of the model on the test set for a given rr value.
  2. ACCˉ\bar{ACC} in Eq. 2 represents the average accuracy of the model across all values of rr, providing a summary of the model's overall performance. It serves as a reference point for comparing the deviations (in terms of ACC^ACCˉ|\hat{ACC} - \bar{ACC}| at different values of rr.

We acknowledge that these terms were not explicitly explained in the original submission, and we have ensured they are clearly defined in the revised manuscript to enhance the clarity and accessibility of the work.

We appreciate your constructive feedback and hope this explanation resolves the issue.

Q2: In the tables, it would be helpful to use bold text or dashed lines to highlight specific data points, emphasizing the advantages of certain algorithms under particular settings.

A2: Thank you for your valuable suggestion regarding the presentation of data in the tables.

We agree that emphasizing specific data points using bold text or dashed lines can significantly enhance readability and make the advantages of certain algorithms under particular settings more apparent.

Specifically, we have used bold text to highlight the best-performing results for each metric and setting in the revised manuscript.

评论

Dear Reviewer Mtt1,

Thank you for your time and effort in reviewing our work. We have carefully considered your detailed comments and questions, and we have tried to address all your concerns accordingly.

As the deadline for revising manuscript is approaching, could you please go over our responses? If you find our responses satisfactory, we hope you could consider adjusting your initial rating. Please feel free to share any additional comments you may have.

Thank you!

Authors

评论

Thank you for your response, which has resolved my concerns. After considering the feedback from other reviewers, I decide to maintain my score.

评论

Dear Reviewer Mtt1,

Thank you for your recognition and recommendation for our work. We are pleased that our response has successfully addressed all your concerns. If there are remaining issues or questions regarding our paper, we would be more than happy to address them to further clarify our contributions.

Best regards,

Authors

评论

We thank the reviewers for their thorough and constructive comments. We are glad that the reviewers agree that our motivation is clear and interesting (all the reviewers mentioned). Our motivation is new to the SSL field (Reviewer wdK8 mentioned). Reviewers also pointed out that our experimental evaluation is solid and comprehensive (Mtt1, 244R, wdK8), our experimental analysis is very comprehensive (wdK8), providing some guidance for understanding the working mechanism of existing methods and designing new ones (CS3Q), and our presentation is clear (Mtt1, wdK8).

Based on the reviewers' valuable feedback, we have conducted a number of additional experiments, which hopefully resolve the reviewers’ concerns. In this revised version, we have also updated the manuscript and Appendix, where we highlight modifications with blue color. The major additional experiments and improvements are as follows:

  1. We expanded the number of datasets to include more datasets across various modalities, including image, text, and tabular data, to ensure a more comprehensive evaluation.
  2. We provided a concise summary of each indicator with their range, interpretation, and significance for evaluating model performance.
  3. We have provided clearer definitions for several terms to enhance the understanding of the metrics.
  4. We have used bold text to highlight the best-performing results for each metric and setting.
  5. We conducted experiments over 10 times to ensure reliability and robustness of the results.
  6. We introduced statistical significance test (p-value) to rigorously validate the experimental findings, further enhancing the credibility of the analysis.
  7. We have added the standard deviations and the updated results for more comprehensive evaluation of the experimental results.
  8. To enhance reproducibility, we have provided an anonymous link to our code at [https://anonymous.4open.science/r/RE-SSL-F034/README.md] and also added more experimental details. Our code will be public once our paper is accepted.
AC 元评审

This paper challenges the assumption that unseen classes in unlabeled data harm the performance of semi-supervised learning (SSL) models, arguing that prior assessments were flawed. The authors propose a re-evaluation of existing safe SSL models by fixing the proportion of seen classes and introduce five evaluation metrics to accurately assess the influence of unseen classes on SSL models, thereby addressing the shortcomings of current evaluation schemes. By controlling variables and adjusting only the proportion of unseen classes, the study demonstrates that unseen classes may not degrade SSL performance and can even enhance it under certain conditions.

The experiments conducted are extensive and thorough, providing a comprehensive analysis of various safe SSL methods, and the clear writing style makes the paper accessible to readers, enhancing the overall credibility of the findings. However, the reviewers also raised concerns about the credibility of the experiments, noting that conducting experiments only on CIFAR-10 and CIFAR-100 limits the scope and reduces the generalizability of the results.

Based on the overall review, the paper is recommended for acceptance.

审稿人讨论附加意见

During the rebuttal period, the reviewers' opinions remained unchanged.

最终决定

Accept (Poster)