PaperHub
5.5
/10
Poster4 位审稿人
最低4最高7标准差1.1
6
5
7
4
4.0
置信度
正确性3.0
贡献度2.5
表达3.5
NeurIPS 2024

(FL)$^2$: Overcoming Few Labels in Federated Semi-Supervised Learning

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

Overcoming Few Labels in Federated Semi-Supervised Learning

摘要

Federated Learning (FL) is a distributed machine learning framework that trains accurate global models while preserving clients' privacy-sensitive data. However, most FL approaches assume that clients possess labeled data, which is often not the case in practice. Federated Semi-Supervised Learning (FSSL) addresses this label deficiency problem, targeting situations where only the server has a small amount of labeled data while clients do not. However, a significant performance gap exists between Centralized Semi-Supervised Learning (SSL) and FSSL. This gap arises from confirmation bias, which is more pronounced in FSSL due to multiple local training epochs and the separation of labeled and unlabeled data. We propose $(FL)^2$, a robust training method for unlabeled clients using sharpness-aware consistency regularization. We show that regularizing the original pseudo-labeling loss is suboptimal, and hence we carefully select unlabeled samples for regularization. We further introduce client-specific adaptive thresholding and learning status-aware aggregation to adjust the training process based on the learning progress of each client. Our experiments on three benchmark datasets demonstrate that our approach significantly improves performance and bridges the gap with SSL, particularly in scenarios with scarce labeled data.
关键词
Federated LearningSemi-Supervised LearningFederated Semi-Supervised Learning

评审与讨论

审稿意见
6

This work addresses a very practical challenge against successful FL deployments, which is of unlabeled data at FL clients. Furthermore, the problem is set in the regime of low count of labeled samples at server. The proposed solution to train a model in semi-supervised manner includes having an adaptive confidence threshold for each client to get pseudo-labels of more samples in the initial stages of training, then updating the model by perturbing the weights and training them on high-confidence pseudo labels; and lastly by aggregating the model weights through a learning status aware hyperparameter.

The results show superiority of the proposed method against existing state-of-the-art methods for federated self-supervised learning, and shows why naive application of centralized self-supervised methods are not cut for the disjoint nature of server and clients in FL.

优点

  1. The paper is written very well. The flow of logic is mostly clear.
  2. The issue this paper is tackling is very important (especially low labeled sample count at the server), and the solution proposed is elegant (clearly states why and how the existing centralized semi-supervised methods are not enough, which brings a unique solution for the FL setting).
  3. Strong results with comprehensive experiments.

缺点

  1. As expanded in Questions, some parts of the methodology is unclear.

As an example, why do we need Eq 10 (unsupervised training objective) at the stage of adaptive thresholding? I thought we are just getting the confidence threshold for pseudo-labels and sending it to the server.

Another example where the methodology was slightly fuzzy is around line 193, "While we use client-specific adaptive threshold, we use a high fixed threshold to get high-confidence data samples". What if that fixed threshold is not set correctly? The goal of dynamic client threshold was to avoid the sub-optimal results of a fixed threshold, wouldn't the same issues arise for this adversarial perturbations?

  1. The authors have tried two datasets: CIFAR10 and SVHN. I wonder how this translates to harder classification problems with more classes (where a few classes might not have representative samples at server and many clients just do not have samples related to those classes at all) or predicting classes or next words on natural language datasets.

问题

  1. In Line 68, "Clients with lower learning status (e.g., whose models are less certain about their predictions) receive higher aggregation weights, ensuring their updates are more significantly reflected in the global model."

It's unclear why do we need the above? Shouldn't low confidence show bad generalizability of the trained model? If yes, I didn't catch how the authors are preventing the model from learning on "wrong" input-output pairs.

  1. Related to Question #1, in what cases would the learning status be low? And what was the intuition behind giving high importance to those low status clients during the aggregation? What if this leads to some other clients getting low learning status then?

  2. Why does (FL)2(FL)^2 need strongly augmented samples for SACR? What happens if we just use weakly-augmented samples instead?

  3. A minor suggestion: Figure 2 can benefit from numbers to show the flow of what happens after what. It was difficult to figure out where to start reading the diagram from.

局限性

I do not see a limitations section. Although the authors mention that the limitations are mentioned in Conclusion, I am not sure what they are referring to (not having theoretical analysis does not sound like a limitation of the proposed method, more like potential future work).

作者回复

We greatly appreciate your thoughtful comments and feedback. In our revised manuscript, we intend to address these points as follows:

Clarification of methodology

About Eq10

  • At each communication round, the selected clients individually calculate their adaptive thresholds based on their own unlabeled data using Eq 9. Once the adaptive threshold is determined, each client trains its local model according to Eq 10, which incorporates pseudo-labeling and consistency regularization using strongly augmented samples.

About fixed threshold in SACR

  • In Section 5.3, the ablation study reveals that applying an adaptive threshold for SACR can actually degrade performance, as seen in the CAT+SACR (All data) scenario. This occurs because applying SACR to wrongly pseudo-labeled samples leads to the generalization of erroneous samples. To mitigate this, we opted for a high fixed threshold to ensure that only high-confidence data samples, which are more likely to be correct, are utilized. Since we already employ an adaptive thresholding scheme (CAT) to address the limitations of a fixed threshold, we believe it is safe to use a fixed threshold specifically for SACR.

Motivation of LSSA method (Related to questions 1, 2)

  • In a centralized SSL, methods like FlexMatch and FreeMatch introduced the use of different thresholds for each class within a dataset and showed their effectiveness. The rationale is that different classes pose varying levels of learning difficulty, so lower thresholds are assigned to more challenging classes.
  • In the context of FSSL, the learning difficulty can vary across clients. This variation arises for two main reasons. First, since the server has access to only a small labeled dataset, clients whose data closely resembles the server’s data will face lower learning difficulty, while those with more distinct data will encounter higher difficulty. Second, due to the non-iid distribution of data across clients, the learning difficulty naturally differs among them.
  • We propose LSAA to take account of different learning difficulties of clients, where LSAA assigns higher aggregation weights to clients with higher learning difficulty, enabling the global model to learn more effectively from these clients. Figure 1 in the global response PDF highlights the effectiveness of LSAA. Not only does it achieve higher test accuracy compared to the fixed aggregation weights (CAT + SACR), but it also demonstrates higher pseudo-label accuracy. Additionally, LSAA consistently achieves the highest correct label ratio, indicating the percentage of correct pseudo labels among all unlabeled data, throughout the experiment. After 600 training rounds, LSAA also records the lowest wrong label ratio, representing the percentage of incorrect pseudo labels among all unlabeled data. These findings suggest that LSAA effectively reduces incorrect pseudo labels while increasing correct ones, thereby mitigating confirmation bias [1].

Questions

Why does (FL)2(FL)^2 need strongly augmented samples for SACR? What happens if we just use weakly-augmented samples instead?

  • Thank you for your question regarding SACR. Building on previous work in centralized SSL and SemiFL, our method trains the client's local model using pseudo-labeling combined with consistency regularization (as detailed in Eq 10). In this approach, the pseudo-label, which is generated from the confidence of weakly augmented data, guides the training of its strongly augmented counterpart. This ensures the model produces consistent predictions across various perturbations of the same unlabeled data. Following this scheme, we aim to provide additional consistency regularization with SACR , thus we used strongly augmented samples to calculate the loss.

A minor suggestion: Figure 2 can benefit from numbers to show the flow of what happens after what. It was difficult to figure out where to start reading the diagram from.

  • Thank you for your valuable suggestion to enhance the readability of our manuscript. We will make sure to update Figure 2 accordingly in our camera-ready version

Limitations of (FL)2(FL)^2:

Thank you for the feedback on the limitation. Our limitations are as follows, and we will address them in the camera-ready version.

  • CAT generates more pseudo-labels compared to fixed threshold methods, which in turn increases the time required for client training.
  • In SACR, additional computation is required due to the need for inference on unlabeled samples using perturbed models.
  • Our methodology relies on strong data augmentation, which may not be feasible for datasets such as sensor data.

[1] Arazo, Eric, et al. "Pseudo-labeling and confirmation bias in deep semi-supervised learning." 2020 International joint conference on neural networks (IJCNN). IEEE, 2020.

评论

Thank you for your detailed answers. I would like to maintain my score.

评论

Thank you so much for your valuable feedbacks and opinions on our work. We will further strengthen our final manuscript based on your suggestion. If you have any further concern, please do not hesitate to leave a comment for us!

审稿意见
5

This paper focuses on the federated semi-supervised learning (FSSL) scenario, which is a more challenging problem in FL. There are two different scenarios in FL, labels-at-server and labels-at-clients and this paper tackles the former issues. The author found the gap between SSL and FSSL is confirmation bias. To diminish this bias, this paper proposes client-specific adaptive threshold, modified SAM, and learning status-aware aggregation for a new aggregation scheme. The experiments of (FL)2(FL)^2 demonstrate improvement on two benchmark datasets.

优点

This paper is well-organized and easy to follow with clear contribution.

缺点

  • Lack of motivation and insights. It is not clear how the performance is influenced by confirmation bias Moreover, while the technologies discussed in this paper are not new, the argument that their particular combination can mitigate confirmation bias lacks persuasiveness.
  • The related work about labels-at-clients is outdated. Some new works should be included, e.g., [R1][R2].
  • The experiments of this paper are insufficient. E.g., How (FL)2(FL)^2 tackles the confirmation bias is not mentioned. This paper should add more ablation studies to prove the bias is decreased rather than only the performance. There are many reasons to get the improvement.

Reference:
[R1] Li, Ming, et al. Class balanced adaptive pseudo labeling for federated semi-supervised learning. In CVPR, 2023.
[R2] Zhang, Yonggang, et al. Robust Training of Federated Models with Extremely Label Deficiency. In ICLR, 2024.

问题

  • I go through the source code and find that implementing FedMatch may lack some components, e.g., parameter decomposition for disjoint learning and K-Dimensional Tree for helper selection. So the incomplete baseline is not convincing to compare with the proposed methods.
  • Why the performance of SemiFL under SVHN dataset with Balanced IID, 250 labeled samples is lower than 40 samples which is counterintuitive.
  • The performance of SemiFL under CIFAR10 dataset with Unbalanced Non-IID, and Balanced IID is 10.0. This seems to be a training problem.

局限性

  • The dataset is limited, there are also CIFAR-100 and FMNIST which are most used in FL research.
  • The limitations of this work are not well-discussed.
作者回复

We greatly appreciate your thoughtful comments and feedback. In our revised manuscript, we intend to address these points as follows:

Effect of (FL)2(FL)^2 on confirmation bias

Thank you for your feedback. Since the wrong pseudo-labels usually lead to confirmation bias[1], we evaluated pseudo-label accuracy, label ratio, correct label ratio, wrong label ratio and C/W ratio in addition to test accuracy. We compared (FL)2(FL)^2 against baseline methods using the SVHN dataset with 40 labels in a balanced IID setting, as illustrated in Figure 1 of the global response PDF.

A high pseudo-label accuracy indicates that the method produces reliable pseudo labels. A high correct label ratio suggests that the method supplies the model with a higher number of accurate labels. Conversely, a low wrong label ratio indicates that the model encounters fewer incorrect labels, which is crucial for minimizing confirmation bias [1]. Lastly, a high C/W ratio signifies that the model is exposed to more correct labels than incorrect ones, further helping to reduce confirmation bias.

We observed that (FL)2(FL)^2 consistently outperforms the baseline SemiFL across all metrics. Notably, while SemiFL generates more incorrect labels (C/W ratio < 1), (FL)2(FL)^2 produces twice as many correct labels compared to incorrect ones. Additionally, the wrong label ratio for (FL)2(FL)^2 is approximately 30%, significantly lower than SemiFL’s 45%. These results suggest that (FL)2(FL)^2 effectively reduces the number of incorrect pseudo-labels while increasing the number of correct ones, thereby mitigating confirmation bias.

Furthermore, we can observe the effectiveness of each component of (FL)2(FL)^2, which are CAT, SACR, LSAA. Using CAT and SACR alone delivers better performance compared to the baseline in terms of all metrics. If we use CAT + SACR, pseudo label accuracy increases, correct label ratio increases, and wrong label ratio decreases, which means we can reduce the confirmation bias. When LSAA is added, which is (FL)2(FL)^2, it achieves the best performance across all metrics. This suggests that the synergistic effect of CAT, SACR, and LSAA is reducing confirmation bias effectively.

Novelty of (FL)2(FL)^2

We appreciate your question regarding the technical novelty of (FL)2(FL)^2. We would like to clarify (FL)2(FL)^2’s novelty as follows:

Use of client-specific threshold (CAT) unlike FreeMatch

  • (FL)2(FL)^2 can deal with non-iid settings by calculating specific adaptive thresholds for each client, while FreeMatch calculates a single global learning status. Because clients’ data distributions vary in non-iid settings, each client’s learning status can be different, making it necessary to estimate client-specific thresholds.
  • (FL)2(FL)^2 mitigates the risk of overfitting by evaluating the learning status based on the entire dataset at a fixed point in time. Since the local model repeatedly encounters the same data across multiple local epochs, using a running batch as FreeMatch for estimating learning status might reinforce incorrect labels.

Introducing Learning Status-Aware Aggregation (LSAA)

  • We newly introduced LSAA, which adjusts aggregation weights of client local models. While CAT gives lower thresholds to the clients with low learning status to learn more from abundant unlabeled data, such information is not effectively reflected in the global model because of using fixed model aggregation weights. LSAA gives more weight to the client with low learning status, thus such information can be reflected more in the global model.

Utilization of SAM under FSSL

  • We developed a novel SAM objective specific to FSSL setting. We selectively apply the SAM objective to a small subset of high-confidence pseudo-labeled data, while using an adaptive threshold to incorporate a larger portion of unlabeled data.

Questions:

  • Thank you for your thorough review. We have re-implemented the missing components of FedMatch and conducted an evaluation on the updated implementation. The updated performance results, presented in Table 1 of the global response PDF, demonstrate that (FL)2(FL)^2 consistently outperforms FedMatch across all settings.
  • Thank you for your thorough review of the experimental results. Our observations indicate that SemiFL is sensitive to the labeled dataset when the number of labeled data is small. For instance, in the case of SVHN with 250 labels, two out of three runs failed after training 700 rounds, resulting in a low accuracy of around 20%. However, the one successful run achieved an accuracy of 90.6%. When given 40 labels on SVHN, we could not observe the training failure and the average accuracy was 53.4%. Similarly, when testing CIFAR10 with just 10 labels, all runs of SemiFL failed completely, yielding an accuracy of only 10%. We also attempted to replicate these experiments using the official SemiFL repository with 10 labels and observed the same outcomes.
  • We truly appreciate your thoughtful suggestions. We will certainly update the related work on labels-at-clients in our future manuscript.

Limitations of (FL)2(FL)^2:

  • CAT generates more pseudo-labels compared to fixed threshold methods, which in turn increases the time required for client training.
  • In SACR, additional computation is required due to the need for inference on unlabeled samples using perturbed models. Our methodology relies on strong data augmentation, which may not be feasible for datasets such as sensor data.
  • Our methodology relies on strong data augmentation, which may not be feasible for datasets such as sensor data.

[1] Arazo, Eric, et al. "Pseudo-labeling and confirmation bias in deep semi-supervised learning." 2020 International joint conference on neural networks (IJCNN). IEEE, 2020.

评论

Thanks for the efforts and additional experiments. My concerns have been addressed, I would like to raise my score to borderline accept

评论

Thank you for your thoughtful review of our rebuttal. We are pleased that our responses have addressed your concerns.

We will further enhance our final manuscript based on your valuable feedback. Thank you once more for your helpful comments and feedback to improve our work.

Best, Authors.

审稿意见
7

This paper studies the federated semi-supervised learning (FSSL) problem. A significant gap between the centralized semi-supervised learning and FSSL is found due to the confirmation bias. To address this issue, the current paper proposes a new FSSL algorithm, by incorporating three new ideas, namely client-specific adaptive threshold, sharpness-aware consistency regularization, and learning status-aware aggregation. Experimental results show that the proposed method significantly improves the performance of existing FSSL algorithms.

优点

  • This paper is very well-written. The proposed method and the underlying idea are clearly explained.

  • The idea of using client-specific adaptive thresholds is very neat and well-motivated by mitigating the confirmation bias.

  • The paper proposes the sharpness-aware consistency regularization to address the issue of generalizing to wrongly labeled data points when using sharpness-aware minimization.

  • The numerical experiments show that the proposed method outperforms the existing methods by a significant margin consistently.

  • The paper also numerically assesses the contribution of each component and studies the impact of incorrect pseudo-labels, providing further insight into the success of the proposed method.

缺点

  • The proposed method is based on heuristic reasoning and lacks theoretical justification. It would be nice if the authors could provide some theoretical justification for some components of the proposed method.

  • The numerical studies are not that extensive; only two public datasets are used for benchmarking. It would be more convincing if the authors could provide more thorough comparison studies over different datasets across different settings.

问题

  • How does the proposed method perform in Figure 1? Does it perform comparably to the centralized SSL method?

局限性

The paper only briefly mentions the lack of theoretical formulation as one limitation of the work.

作者回复

We greatly appreciate your thoughtful comments and feedback. In our revised manuscript, we intend to address these points as follows:

Theoretical justification

Thank you for your feedback. We added more experiments to show that our method is robust in different settings. We plan to prove our methodology in terms of theoratical formulation in future works.

More experiments

Thank you for suggesting additional experiments on other public datasets and settings. In response, we conducted further experiments using the CIFAR-100, Fashion-MNIST, and AGNews datasets, and also introduced non-iid-0.1 settings for the CIFAR-10 and SVHN datasets. For detailed information and experimental results, please refer to the global response.

Questions:

Thank you for your feedback. Our accuracies are 38.9%, 81.5%, 83.2%, and 92.2% for 10, 40, 250, and 4000 labels, respectively. While the centralized FreeMatch method achieves 91.9% accuracy, our results, though still limited, represent a significant improvement. Specifically, our method outperforms the previous best-performing approach by a factor of 2.3X. This demonstrates that our approach narrows the gap between centralized and federated settings.

Limitations of (FL)2(FL)^2:

Thank you for the feedback on the limitation. Our limitations are as follows, and we will address them in the camera-ready version.

  • CAT generates more pseudo-labels compared to fixed threshold methods, which in turn increases the time required for client training.
  • In SACR, additional computation is required due to the need for inference on unlabeled samples using perturbed models.
  • Our methodology relies on strong data augmentation, which may not be feasible for datasets such as sensor data.
评论

Thank you very much for your detailed responses. Including theoretical justification would strengthen the paper. I would maintain my score.

评论

Thank you again for the valuable suggestions and comments. If you have any remaining concerns, please let us know!

审稿意见
4

The paper proposes a new method for federated semi-supervised learning tasks where only the server has a small amount of labeled data. The paper combines 3 different methods to tackle the problem and claims to reduce the confirmation bias issue with the method proposed.

优点

The paper is well-written and gives established motivations.

缺点

  • The method seems to be just a combination of FreeMatch and FlatMatch under the FL case.
  • It is better to give more motivation to the LSSA method
  • Only on 2 datasets while most of the existing SSL /FSSL methods use three datasets. Based on a lot of existing semi-supervised learning literature, it is quite common to at least also include the CIFAR100 dataset. It will be better to also incorporate other type of dataset (not only image) to show the robustness of the method
  • Authors claim that the new method can effectively reduce the confirmation bias, so it would be better to have experiments specially design to show this part. It would be better to show that the method can successfully label the hard data compared to the baseline.

问题

  • Section 5.3: I find it hard to understand why to compare with ‘correctly pseudo-labeled data’. First, under the semi-supervised case, we won’t know the label, hence there is no point to compare test only on the correct pseudo-label data. This can only induce bias. Second, if we compare with CAT and CAT+SACR(all data), CAT only performs consistently better. This negate the idea of SACR, isn’t it?

局限性

Same as above

作者回复

We greatly appreciate your thoughtful comments and feedback. In our revised manuscript, we intend to address these points as follows:

Novelty of (FL)2(FL)^2

We appreciate your question regarding the difference between (FL)2(FL)^2 and related works (FreeMatch and FlatMatch). We would like to clarify (FL)2(FL)^2’s novelty as follows:

Different motivation with FlatMatch

  • The primary goal of FlatMatch is to bridge the gap between labeled and unlabeled data by addressing the differences in their loss landscapes. However, since (FL)2(FL)^2 is designed for the labels-at-server scenario, we cannot utilize both unlabeled and labeled data at the same moment, FlatMatch’s methodology cannot be applied.

Use of client-specific threshold (CAT) unlike FreeMatch

  • (FL)2(FL)^2 can deal with non-iid settings by calculating specific adaptive thresholds for each client, while FreeMatch calculates a single global learning status. Because clients’ data distributions vary in non-iid settings, each client’s learning status can be different, making it necessary to estimate client-specific thresholds.
  • (FL)2(FL)^2 mitigates the risk of overfitting by evaluating the learning status based on the entire dataset at a fixed point in time. Since the local model repeatedly encounters the same data across multiple local epochs, using a running batch as FreeMatch for estimating learning status might reinforce incorrect labels.

Introducing Learning Status-Aware Aggregation (LSAA)

  • We newly introduced LSAA, which adjusts aggregation weights of client local models.

Utilization of SAM under FSSL

  • We developed a novel SAM objective specific to FSSL setting. We selectively apply the SAM objective to a small subset of high-confidence pseudo-labeled data, while using an adaptive threshold to incorporate a larger portion of unlabeled data.

Motivation of LSSA method

  • The learning difficulty can vary across clients. First, since the server has access to only a small labeled dataset, clients whose data closely resembles the server’s data will face lower learning difficulty, while those with more distinct data will encounter higher difficulty. Second, due to the non-iid distribution of data across clients.

  • We propose LSAA where LSAA assigns higher aggregation weights to clients with higher learning difficulty, enabling the global model to learn more effectively from these clients. Figure 1 in the global response PDF highlights the effectiveness of LSAA. Not only does it achieve higher test accuracy compared to the fixed aggregation weights (CAT + SACR), but it also demonstrates higher pseudo-label accuracy. Additionally, LSAA consistently achieves the highest correct label ratio, indicating the percentage of correct pseudo labels among all unlabeled data, throughout the experiment. After 600 training rounds, LSAA also records the lowest wrong label ratio, representing the percentage of incorrect pseudo labels among all unlabeled data. These findings suggest that LSAA effectively reduces incorrect pseudo labels while increasing correct ones, thereby mitigating confirmation bias [1].

Effect of (FL)2(FL)^2 on confirmation bias

Thank you for your feedback. Since the wrong pseudo-labels usually lead to confirmation bias [1], we evaluated pseudo-label accuracy, label ratio, correct label ratio, wrong label ratio and C/W ratio in addition to test accuracy. We compared (FL)2(FL)^2 against baseline methods using the SVHN dataset with 40 labels in a balanced IID setting, as illustrated in Figure 1 of the global response PDF.

A high pseudo-label accuracy indicates that the method produces reliable pseudo labels. A high correct label ratio suggests that the method supplies the model with a higher number of accurate labels. Conversely, a low wrong label ratio indicates that the model encounters fewer incorrect labels, which is crucial for minimizing confirmation bias [1]. Lastly, a high C/W ratio signifies that the model is exposed to more correct labels than incorrect ones, further helping to reduce confirmation bias.

We observed that (FL)2(FL)^2 consistently outperforms the baseline SemiFL across all metrics. Notably, while SemiFL generates more incorrect labels (C/W ratio < 1), (FL)2(FL)^2 produces twice as many correct labels compared to incorrect ones. Additionally, the wrong label ratio for (FL)2(FL)^2 is approximately 30%, significantly lower than SemiFL’s 45%. These results suggest that (FL)2(FL)^2 effectively reduces the number of incorrect pseudo-labels while increasing the number of correct ones, thereby mitigating confirmation bias.

Furthermore, we can observe the effectiveness of each component of (FL)2(FL)^2, which are CAT, SACR, LSAA. Using CAT and SACR alone delivers better performance compared to the baseline in terms of all metrics. If we use CAT + SACR, pseudo label accuracy increases, correct label ratio increases, and wrong label ratio decreases, which means we can reduce the confirmation bias. When LSAA is added, which is (FL)2(FL)^2, it achieves the best performance across all metrics. This suggests that the synergistic effect of CAT, SACR, and LSAA is reducing confirmation bias effectively.

Questions:

We want to clarify terms used in the manuscript.

  • CAT + SACR in Table 3 (Section 5.2): Our proposed method in Section 4.2. We apply SACR to only high-confidence samples.
  • CAT + SACR (all data) in Figure 3 (Section 5.3): SACR is applied to all of the pseudo labels generated by CAT. This is an ablation study to show why SACR should not be applied to all of the pseudo labels by CAT.
  • CAT + SACR (only correct pseudo-label) in Figure 3 (Section 5.3): SACR is only applied to correctly pseudo-labeled samples. This is for testing upper bound of our proposed method assuming we know the ground truth labels.

[1] Arazo, Eric, et al. "Pseudo-labeling and confirmation bias in deep semi-supervised learning." 2020 International joint conference on neural networks (IJCNN). IEEE, 2020.

评论

Thanks for more experiments and results. I still have some concerns regarding innovation and motivations. Also, I'm not entirely convinced of the authors' answer regarding the confirmation bias. The authors mentioned about the pseudo label accuracy and correct label ratio, which seems to be the same thing (the accuracy of pseudo labeling). However, the definition of the confirmation bias proposed by the reference is about whether or not the model is overfitting to the wrong pseudo-labels. Therefore, how the proposed methods can mitigate the confirmation bias still lacks persuasiveness. Therefore, I would maintain my score for now.

评论

Thank you for your comment regarding motivation and innovation. We would like to explain further about (FL)2(FL)^2's motivation and novelty.

Novelty of (FL)2(FL)^2

We would like to further explain the novelty of (FL)2(FL)^2 in detail.

Use of client-specific threshold (CAT) unlike FreeMatch

  • (FL)2(FL)^2 provides a precise measure of the client's learning status rather than relying on an EMA-based estimation like FreeMatch, since (FL)2(FL)^2 directly calculates learning status using all unlabeled data.
  • (FL)2(FL)^2 is computationally efficient. Instead of continuously updating the learning status every batch like FreeMatch, (FL)2(FL)^2 only requires one interaction of learning status calculation over unlabeled samples per communication round. Additionally, since we utilize the global pseudo-labeling scheme proposed in SemiFL, we must go through all the unlabeled data anyway to pseudo-label them. Additional computation for CAT fits naturally into the existing workflow.

Utilization of SAM under FSSL

  • We developed a novel SAM objective specific to FSSL setting. We first applied the Sharpness-Aware Minimization (SAM) objective in a FSSL setting and discovered that applying it to all pseudo-labeled data degrades performance even though SAM shows strong generalization ability across different tasks. Instead, we introduced a novel approach that selectively applies the SAM objective to a small subset of high-confidence pseudo-labeled data, while using an adaptive threshold to incorporate a larger portion of unlabeled data. Our ablation study (Table 2 of the paper), which emphasizes the importance of each component, demonstrates that this combination of techniques effectively reduces confirmation bias and achieves high performance.

Motivation of LSAA

We would like to further clarify the motivation of LSAA.

  • In a centralized SSL, methods like FlexMatch and FreeMatch introduced the use of different thresholds for each class within a dataset and showed their effectiveness. The rationale is that different classes pose varying levels of learning difficulty, so lower thresholds are assigned to more challenging classes—those with a lower learning status—to facilitate more effective learning from these harder classes.
  • In the context of FSSL, the learning difficulty can vary across clients. This variation arises for two main reasons. First, since the server has access to only a small labeled dataset, clients whose data closely resembles the server’s data will face lower learning difficulty, while those with more distinct data will encounter higher difficulty. Second, due to the non-iid distribution of data across clients, the learning difficulty naturally differs among them.
  • We propose LSAA to take account of different learning difficulties of clients, where LSAA - assigns higher aggregation weights to clients with higher learning difficulty, enabling the global model to learn more effectively from these clients. In contrast, previous FSSL approaches did not account for these variations in learning difficulty and instead relied on fixed aggregation weights.
评论

Thank you for your response to our rebuttal. We sincerely appreciate your effort and time in going through our rebuttals and raising your concern. We want to be more explicit in addressing your remaining concerns as follows:

Regarding the confirmation bias

As highlighted in previous work on Semi-Supervised Learning and others that mentioned confirmation bias [2, 3, 4, 5, 6], one approach to reducing confirmation bias is to filter out noisy pseudo-labels and retain only the high-quality ones. For instance, Fixmatch [2] claimed that it effectively reduced confirmation bias by generating higher-quality pseudo-labels. Similarly, Mean Teacher [3] suggested that improving target quality can help mitigate confirmation bias. Additionally, the Softmatch [4] paper highlighted that incorrect pseudo-labels often contribute to the occurrence of confirmation bias. Based on these previous works, we believe that demonstrating a higher correct label ratio effectively shows how well our method filters out incorrect pseudo-labeled samples. Consequently, since (FL)2(FL)^2 has a significantly lower rate of incorrect pseudo-labels, the influence of these on the unsupervised loss is much less pronounced compared to SemiFL. This suggests that the training signals propagated by these pseudo-labels are more reliable in (FL)2(FL)^2 due to lower contribution of noisy pseudo-labels - our unsupervised loss is dominated mainly by correct pseudo-labeled samples compared to SemiFL.

Nevertheless, to further evaluate this, we designed and conducted an experiment to compare the level of overfitting of (FL)2(FL)^2 and SemiFL to wrongly pseudo-labeled data in a scenario with extremely scarce labeled data. We measured this by inspecting Rsoft, RLoss, and RAccuracy. Rsoft is the average confidence of the wrongly pseudo-labeled data, similar metric measured in [1]. RLoss and RAccuracy means the training loss and accuracy on the wrongly pseudo-labeled data, respectively. High Rsoft and RAccuracy, and low Rloss indicate that the model is overfitting to the incorrectly pseudo-labeled data. The experiment used the CIFAR-10 dataset under a balanced IID setting with only 10 labeled samples at the server, keeping all hyperparameters the same as in the main experiments, except for the number of communication rounds, which we set to 500.

MethodLabel Ratio (%)Rsoft (After softmax)RLossRAccuracy (%)
SemiFL10099.850.0015100
(FL)2(FL)^286.8684.530.400887.09

The results clearly show that SemiFL overfits significantly to the wrongly pseudo-labeled data compared to (FL)2(FL)^2. This demonstrates (FL)2(FL)^2’s robustness against incorrect pseudo-labels, particularly in scenarios with extremely scarce labeled data at the server, outperforming SemiFL. Thus, we conclude that (FL)2(FL)^2 is more effective in reducing confirmation bias in low-label scenarios.

Thank you again for your insightful suggestion to explore overfitting with incorrectly pseudo-labeled data. We will ensure that the results of this experiment are included in the camera-ready version.

  • [1] Arazo, Eric, et al. "Pseudo-labeling and confirmation bias in deep semi-supervised learning." 2020 International joint conference on neural networks (IJCNN). IEEE, 2020.
  • [2] Sohn, Kihyuk, et al. "Fixmatch: Simplifying semi-supervised learning with consistency and confidence." Advances in neural information processing systems 33 (2020): 596-608.
  • [3] Tarvainen, Antti, and Harri Valpola. "Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results." Advances in neural information processing systems 30 (2017).
  • [4] Chen, Hao, et al. "SoftMatch: Addressing the Quantity-Quality Tradeoff in Semi-supervised Learning." The Eleventh International Conference on Learning Representations.
  • [5] Nassar, Islam, et al. "All labels are not created equal: Enhancing semi-supervision via label grouping and co-training." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
  • [6] Zhang, Bowen, et al. "Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling." Advances in Neural Information Processing Systems 34 (2021): 18408-18419.
评论

Thank you again for your critical and constructive feedbacks to our papers and rebuttals. If there is still anything that is not clear about our justification regarding the confirmation bias and the novelty/motivation of our work, we would be happy to provide more clarification.

If you have any further concerns or questions, please do not hesitate to let us know as it will be extremely important for our revised manuscript.

Thank you very much!

作者回复

More experiments

We conducted additional experiments using the CIFAR-100, Fashion-MNIST, and AGNews datasets, and introduced a non-iid-0.1 setting for the SVHN and CIFAR-10 datasets. Additionally, we performed an ablation study to determine the optimal rho value for Adaptive Sharpness-Aware Minimization (ASAM [1]). Through grid search, we identified rho=0.1 as best performing for both the SVHN and CIFAR-10 datasets, leading to revisions in the main table (Table 1 in the global response PDF). Thanks to reviewer YhMN's feedback, we discovered and addressed missing parts in the FedMatch implementation, re-implementing it and re-conducted the experiments.

The results for CIFAR-100 are presented in Table 1, while the results for Fashion-MNIST and AGNews are detailed in Table 2, both of which can be found in the PDF file of the global response.

CIFAR-100 For CIFAR-100, we used WideResNet-28x8 as in previous works [2,3]. (FL)2(FL)^2 consistently outperforms the baselines across most configurations. Notably, in the 400-label, non-iid-0.3 setting, (FL)2(FL)^2 achieved a 6.4% improvement over the baselines. We will add experiments on a non-iid-0.1 setting in our camera-ready version.

Fashion-MNIST For Fashion-MNIST, we used WideResNet-28x2 as in SVHN and CIFAR-10 datasets. We compared our method with SemiFL, the previous state-of-the-art. Because of the time constraint, we only compare with SemiFL, but we will add other baselines for the camera-ready version. With only 40 labeled samples, SemiFL failed in 3 out of 3 runs in iid setting and 2 out of 3 runs in non-iid-0.3 setting, resulting in ~10% accuracy. In one unfailed run in a non-iid-0.3 setting, SemiFL achieved 18.4% accuracy. In contrast, (FL)2(FL)^2 successfully trains all 3 runs in non-iid-0.3 setting and 2 out of 3 runs in iid setting. In one failed run in iid setting, it yields ~10% accuracy. However, in the remaining iid runs, (FL)2(FL)^2 achieved accuracies of 69% and 70.4%. In the non-iid-0.3 setting, (FL)2(FL)^2 achieved an average accuracy of 63.2%. The results demonsrate that (FL)2(FL)^2 is robust and effective even with minimal labeled data, particularly in a few-labels setting.

AGNews We randomly sampled 12,500 training samples per class (out of total 50,000 samples) and applied back-translation for strong augmentation, following the methodology outlined in the SoftMatch [4] paper. We used bert-base-uncased as a backbone model. We froze the bert parameters and trained only the linear classifier's parameters, running the training for 20 epochs. Since the mixup loss cannot be applied to an NLP dataset, we conducted our comparison using SemiFL without the mixup loss. (FL)2(FL)^2 significantly outperforms the baseline, achieving a 39.6% accuracy improvement in the iid setting and a 14.5% improvement in the non-iid setting. With only 20 labels provided, SemiFL exhibited considerable performance variance across experiments, where STD is 14.3 and 13.7 for iid and non-iid-0.3 setting, respectively. In contrast, (FL)2(FL)^2 consistently delivered stable results, where STD is 0.6 for iid and 3.7 for non-iid-0.3.

More experiments on previously used datasets (CIFAR-10 and SVHN) We also introduced a new non-IID-0.1 setting to our previous experiments on the CIFAR-10 and SVHN datasets. In this non-IID-0.1 setting, (FL)2(FL)^2 consistently outperformed all baseline methods across the datasets, showing 10% higher accuracy over the best-performing baseline on CIFAR-10 with 40 labels.

[1] Kwon, Jungmin, et al. "Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks." International Conference on Machine Learning. PMLR, 2021.

[2] Wang, Yidong, et al. "FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning." The Eleventh International Conference on Learning Representations. 2022.

[3] Diao, Enmao, Jie Ding, and Vahid Tarokh. "Semifl: Semi-supervised federated learning for unlabeled clients with alternate training." Advances in Neural Information Processing Systems 35 (2022): 17871-17884.

[4] Chen, Hao, et al. "SoftMatch: Addressing the Quantity-Quality Tradeoff in Semi-supervised Learning." The Eleventh International Conference on Learning Representations.

最终决定

The paper proposes a novel heuristic algorithm for labels-at-server FL that empirically can improve significantly over previous state-of-the-art. The reviewers generally find the paper to be well written, but especially reviewer LSrU raises concerns about limited novelty and insufficient motivation for the methods, although these are rather unspecific. The latter of these is also addressed by the authors at length in their rebuttal.

Based on my own reading, I also find the paper well-written and easy to follow. The proposed method can provide significant boost especially in classification problems with fewer classes such as CIFAR-10 and SVHN, while its benefit in CIFAR-100 seems more limited. Still, I believe the paper would be a worthy addition to the literature and in line with the majority of the reviewers, recommend its acceptance.