PaperHub
4.0
/10
Rejected4 位审稿人
最低3最高5标准差1.0
3
3
5
5
4.0
置信度
ICLR 2024

Protecting Sensitive Data through Federated Co-Training

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

We propose a federated co-training approach to enhance privacy and train interpretable models from sensitive distributed data via the use of an unlabeled public dataset, achieving a high model quality, while preserving privacy to the highest level.

摘要

关键词
federated learningco-trainingfederated semi-supervised learning

评审与讨论

审稿意见
3

The authors use federated co-training, in which local hard labels on the public unlabeled datasets are shared and aggregated into a consensus label. Then, the server forms a consensus, while clients use this consensus as pseudo-labels for the unlabeled dataset in their local training. For data protection, the idea is to integrate XOR-mechanism for achieving differential privacy over binary data. At last, the authors provide empirical experiments over multiple datasets.

优点

  • The authors conducted many experiments.

缺点

  • The idea seems to be just a combination of different modules.
  • The explicit motivation is unclear.
  • The writing logic should be improved.

问题

What is the explicit motivation of this paper? Why is this idea important?

Abstract

  • The first half introduces some general and basic knowledge of federated learning, data privacy, applications, and co-training. The logic is not smooth. Several sentences have no much relation here. What is the explicit problem that you target here? What are the explicit problems of prior research?
  • The proposed idea seems to have better model quality and privacy. However, the ambiguious language makes me confused. What is the definition of privacy or quality?

Introduction

  • The introduction mentions differential privacy, unlabeled datasets, and gradient-based methods? What are the problems of these research areas? Do the authors improve each of them?
  • The authors detailed what they have done. I feel confused that why the idea of combination (co-training, differential privacy, federated learning) is meaningful. What are your insights?

Construction

Section 3 combines the prior research and the new idea. Section 4 starts with introduction of differential privacy. Could the authors elaborate what is new here? What research line do you follow?

伦理问题详情

N/A

评论

Dear reviewer UQKJ,

Thank you for your evaluation. We will address your questions point-by-point.

  • The main motivation for our work is to improve privacy in federated learning. As stated in the abstract and introduction, federated learning promises to protect data privacy by only sharing model parameters, but in fact model parameters reveal a lot about training data. Differential privacy mechanisms can be used to improve privacy, but their privacy-utility trade-off is not ideal. Therefore, improving privacy is a very relevant problem. We approach this problem by first noting that in many application scenarios where privacy is paramount, e.g., healthcare, large public unlabeled datasets exist. Our approach uses such an unlabeled dataset to exchange information between clients. Distributed distillation and its variants have shown that by sharing soft labels for such an unlabeled datasets we can train neural networks collaboratively. We show that sharing soft labels indeed improves privacy over sharing model parameters, but does not fully protect private data. By sharing hard labels instead, we are able to protect privacy to the highest degree - at least in terms of existing membership inference attacks - while maintaining the model quality of sharing model parameters and soft labels. We hope that this clarifies our motivation. We will improve the description of our motivation in the manuscript.

  • Model quality is canonically measured in terms of test accuracy as a proxy for the generalization error. Privacy is measured empirically by the likelihood of an attacker being able to conduct a successful membership inference attack (as described in Sec. 5 “Experimental Setup”, see also the probabilistic interpretation of ROC AUC), and it is determined theoretically via differential privacy (as described in Sec. 4).

  • As discussed in the first point, our motivation is to improve privacy while maintaining model quality in federated learning. We do so by using an unlabelled dataset.

  • Our insights are that sharing hard labels instead of model parameters or soft labels indeed improves privacy - what was to be expected - and surprisingly does not decrease model quality. Moreover, we find that sharing hard labels has the additional benefit of allowing us to train interpretable models, like decision trees and rule ensembles, in a federated way. We can even train ensembles of various local models, as we show in the general comment.

  • Our Section 3 explains the scenario, introduces the novel method FedCT and analyzes its convergence and communication complexity. Since privacy is the main motivation for our work, we show in Section 4 how a privacy guarantee based on differential privacy can be derived for FedCT. For this, we provide a novel sensitivity analysis (Proposition 2), which is required to provide a meaningful privacy guarantee.

评论

Thanks for the response.

评论

We appreciate the time and effort you took to evaluate our work. Should our explanation have addressed all your questions, we would kindly ask you to consider adjusting the score accordingly.

评论

Thanks for the authors' efforts, and I re-read the paper again. Sorry that I cannot improve my rating at this stage.

What is the explicit definition of privacy in this paper? For example, differential privacy has its definition. What is the exact meaning of improving privacy (improve differential privacy?) in your paper? Using some attacks to verify is not strong for claiming improved privacy. Actually, the response is still ambiguous from my view.

评论

Dear reviewer UQKJ,

We gladly explain the definitions for privacy used in our paper. As you are certainly aware, there are two general views on how to measure privacy: theoretical and empirical. We use both and have used the standard definitions. The motivation for using both, as mentioned in the introduction, is as follows.

The theoretical view rightfully claims that the only proof of privacy is a theoretical guarantee. Here, differential privacy (Dwork, 2006) is currently the de-facto gold-standard. Standard mechanisms that ensure differential privacy are the Laplacian and Gaussian mechanism, that add appropriate noise to the shared data. In our case, the shared data is binary and both mechanisms are not applicable. Therefore, we use the XOR mechanism, which requires a novel sensitivity analysis to select the noise variance appropriately (our Proposition 2).

A major downside of the theoretical view is that the privacy-utility trade-off in practice is often sub-optimal. As we state in our introduction, the actual privacy guarantees attainable in federated averaging with decent utility (measured in model quality) are basically vacuous (to the best of our knowledge, the group of Srini Devadas has made the most accurate analysis of differential privacy in federated learning, so far). The empirical view argues that exactly for that reason theoretical guarantees do not quantify actual privacy in a practically useful way. Instead, the privacy can be assessed much more accurately by applying the most advanced attacks and testing how well they succeed. For example, Reza Shokri has developed several (versions of) toolkits that implement this strategy. We operationalize this empirical privacy in our vulnerability score , which is the ROC AUC of a large number of membership inference attacks against the shared data (model parameters for FedAvg, soft labels for DD, and hard labels for FedCT). We explained this in Section 5 of the manuscript and provide further details in Appendix C1. From the probabilistic interpretation of ROC AUC it follows that the vulnerability score measures the success probability of such attacks and thus provides an interpretable measure.

In conclusion, both differential privacy and vulnerability are useful measures of privacy. Differential privacy guarantees that no matter the strategy or ingenuity of the attacker, they cannot infer anything about the data beyond the limits of the guarantee, but it might underestimate the actual privacy, no matter that we thoroughly analyzed sensitivity to make the bound as tight as possible. At the same time, vulnerability quantifies how well protected private data is in practice, but it might overestimate privacy, since it does not account for yet undiscovered attacks.

While we cannot explain this fundamental aspect of measuring privacy in detail in the introduction, we will gladly include a paragraph in the appendix.

审稿意见
3

The paper proposes an algorithm to protect the privacy of shared soft predictions on a public dataset used to achieve consensus in the server in semi-supervised learning. In particular, the server owns a unlabeled dataset and clients own private labeled datasets. Client utilizes local model to infer on the public dataset and compute the soft predictions on the unlabeled data, then sends those information to the server in a differentially private manner. The server aggregates those local knowledge into consensus and matigate the over-fitting in local trainings.

优点

  1. The paper attempts to solve an important problem: semi-supervised federated learning.
  2. The paper provides theoretical guarantee of the privacy confidence.
  3. utilizing a different differential privacy mechanism (XOR) instead of Gaussian mechanism, which is interesting.

缺点

  1. The novelty is limited. There are many prior studies have explored semi-supervised federated learning [1-3] and knowledge distillation [4-6]. The paper does not propose a new framework of FedSSL but just naively combining differential privacy techniques. The paper does not illustrate the difference between FedCT and those previous work.
  2. The motivation to protect the privacy of soft predictions is weak. Comparing to raw data/ features map, the soft predictions are not likely to leak privacy.
  3. The paper is not well-organized and using non-standard terminology. For example, the term 'semi-honst' in section 4. Usually, we use 'honest-but-curious' to descrtibe the server which can not modify the local updates but trys to infer private information of clients.
  4. The baselines are not comprehensive. DP-FedAvg is using DP to the updated gradients but the proposed FedCT is using DP in the soft predictions, it is not comparable. FedCT should compare other FedSSL methods [1-3] and conduct attach on the method to verify the effect of FedCT.
  5. Setting for data heterogenity is not sufficient. With beta = 2 for non-i.i.d setting is not good. Most of papers set beta = 0.01 or 0.1 to create heterogeneous data partitions.

Reference:

[1] Diao E, Ding J, Tarokh V. SemiFL: Semi-supervised federated learning for unlabeled clients with alternate training[J]. Advances in Neural Information Processing Systems, 2022, 35: 17871-17884.

[2] Lin H, Lou J, Xiong L, et al. Semifed: Semi-supervised federated learning with consistency and pseudo-labeling[J]. arXiv preprint arXiv:2108.09412, 2021.

[3] Itahara S, Nishio T, Koda Y, et al. Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-iid private data[J]. IEEE Transactions on Mobile Computing, 2021, 22(1): 191-205.

[4] Li D, Wang J. Fedmd: Heterogenous federated learning via model distillation[J]. arXiv preprint arXiv:1910.03581, 2019.

[5] Zhu Z, Hong J, Zhou J. Data-free knowledge distillation for heterogeneous federated learning[C]//International conference on machine learning. PMLR, 2021: 12878-12889.

[6] Chen H, Vikalo H. The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation[J]. arXiv preprint arXiv:2301.08968, 2023.

问题

  1. What is the main difference between FedCT and the previous FedSSL methods?

  2. what's the motivation to use XOR mechanism instead of Gasussian mechanism? What's the benefit?

  3. I don't understand the last discussion of Interpretable Models? Why there is only FedCT and Centralized in the experimental results?

评论

Dear reviewer W2md,

We appreciate your detailed assessment of our work and we are happy that you find the problem relevant and our theoretical privacy analysis and the employed XOR mechanism interesting. We address your questions point-by-point.

Weaknesses:

W1: We are aware of the wide range of semi-supervised federated learning methods, but to the best of our knowledge, FedCT is novel in that it shares hard labels on a public unlabeled dataset. This improves privacy substantially while maintaining model quality. Moreover, it allows us to train interpretable models. Since sharing hard labels hasn’t been explored in the literature, achieving differential privacy is not straight-forward. We employed the XOR mechanism for binary data, which required us to derive a novel bound on the sensitivity of classification algorithms. Regarding the papers you have referenced, [4] is not applicable since it assumes a public labeled dataset, [3] shares soft labels similar to distributed distillation, and [1],[2],[5],[6] share model parameters and therefore do not improve privacy over FedAvg at all.

W2: As our experiments show, the vulnerability of sharing soft labels to known membership inference attacks is lower than sharing model parameters, as one would expect, but still considerable. Since sharing hard labels achieves a model quality similar to sharing soft labels, its additional privacy is a clear advantage. Together with the fact that sharing hard labels allows us to train interpretable models, or mixed models (see the general comment), this means that FedCT has clear advantages over both sharing model parameters and soft labels.

W3: We used the term semi-honest as defined in [7] and used widely in cryptography and security, but we are happy to change the term to honest-but-curious in our manuscript.

W4: We have compared FedCT to model parameters sharing and soft label sharing in terms of accuracy and privacy, as well as the common differential privacy mechanism applied to federated averaging. Our empirical privacy evaluation measures the success in terms of ROC AUC of an attacker to perform state-of-the-art membership inference attacks on the information that is shared - model parameters, soft labels, or hard labels. Therefore, the resulting vulnerability scores are comparable. We do not see how we can attach our method of sharing hard labels to other federated semi-supervised learning methods that share soft labels or model parameters. Could you please elaborate how you envision such experiments?

W5: We did not extensively explore scenarios of too strong heterogeneity. Setting α\alpha to 2 yields the class distribution shown in Figure 2. The impact of strong heterogeneity on consensus is intricate and depends on factors such as the local learning algorithm, model, number of clients, and consensus mechanism robustness. Investigating this is very interesting, but it falls beyond the scope of this work.

Figure 2: Class distribution for α=2\alpha = 2

Questions:

َQ1: Regarding the FedSSL methods you referenced, [1],[2],[5],[6] share model parameters and therefore do not improve privacy over FedAvg, [3] shares soft labels similar to distributed distillation, and [4] assumes a public labeled dataset and thus is not applicable to our scenario.

Q2: Since the information shared by clients in FedCT are binary matrices, we cannot apply the Gaussian or Laplacian mechanism, but need a noise mechanism tailored to binary data. The XOR mechanism was recently developed for binary data and provides a differential privacy guarantee, given that we can quantify the sensitivity. Since we cannot use clipping, like in the Gaussian mechanism for federated averaging, determining the sensitivity is non-trivial. In our Proposition 2, be derive a novel bound of the sensitivity for on-average-leave-one-out-stable learning algorithms, which includes a wide range of machine learning methods.

Q3: The reason why there are no baselines is simply that to the best of our knowledge, there are none. There exist several approaches that train decision trees in a semi-federated manner by computing the information gain in a distributed way (and protecting privacy via encryption)[8], but there is no method that trains local models collaboratively. The most informative baseline, thus, is centralized learning.

References:

[7] Goldreich, Oded. Foundations of cryptography: volume 2, basic applications. Cambridge university press, 2009.

[8] Truex, Stacey, et al. "A hybrid approach to privacy-preserving federated learning." Proceedings of the 12th ACM workshop on artificial intelligence and security. 2019.

评论

Thank you for the response! Some of my questions have been clarified. Nevertheless, I still have one main concern:

As several reviewers question the motivation of using hard labels rather than soft labels, the authors claim that using hard label has higher privacy protection than soft labels while preserving similar performance on IID data. However, knowledge distillation (KD) techniques using soft labels in FL are aimming to mitigate the detrimental effect of data heterogeneity, such as FedMD [1]. This work did not only ran experiments on non-IID data so it is hard to tell whether FedCT still achieves similar performance without using soft labels.

Reference:

[1] Li D, Wang J. Fedmd: Heterogenous federated learning via model distillation[J]. arXiv preprint arXiv:1910.03581, 2019.

评论

Dear reviewer W2md,

Thank you for your response. We are glad that most of your questions have been clarified and we are happy to address your remaining concern. We agree that data heterogeneity is an important problem in federated learning. We find in our experiments that for label heterogeneity, label sharing (both soft, as in distillation, and our hard label sharing) requires that client predictions are not too far off, so that we can still achieve a meaningful consensus. In our experiments, we therefore designed local datasets to have a part that is fairly homogeneous (Dirichlet with large α\alpha), to ensure that each client has at least a little information about each class. The other part is heterogeneous (Dirichlet with smaller α\alpha). The same could in expectation be achieved with an intermediate Dirichlet parameter α\alpha, but we wanted to make sure data is of the desired form. We have used α=100\alpha=100 for the fairly homogeneous part, and α=2\alpha=2 for the more heterogeneous part (see the experimental setup in Sec. 5 and Appendix C.3, and our visualization of the data distribution in Fig. 2 in our previous reply). Since in many standard FL papers, stronger heterogeneity is assumed, we performed an additional experiment for this rebuttal where we use α=0.01\alpha=0.01 for the heterogeneous part. The results in the following table show that FedCT still performs on par with both FedAvg and DD.

Dirichlet α\alphaFedCTFedAvgDD
α=0.01\alpha = 0.010.79810.79810.79070.79070.79030.7903

The experiments in FedMD [1] on heterogeneous data are very interesting. They use a private labeled dataset and a public labeled dataset with related classes - in this case, CIFAR100 which contains 100 classes from 20 superclasses (bicylce, bus, and car are all vehicles). The task of local models is to predict the superclass, but client datasets only contain one class per superclass. Thereby, the datasets are heterogeneous wrt. the class labels, but homogeneous wrt. the superclasses. This also ensures that a meaningful consensus can be achieved. While we do not assume a labeled dataset as FedMD does, we evaluate FedCT in this scenario as well. We achieve a test accuracy of ACC=0.5106ACC=0.5106 which is comparable to the test accuracy of roughly ACC=0.5ACC=0.5 that FedMD reports - note that this comparison is favoring FedMD, since FedMD uses supervised pre-training (transfer learning) on the labeled public dataset, which we do not. We will include these new results in the updated version of our manuscript.

评论

I appreciate the authors' effort to implement additional experiments. More details need to be provided in the experimental results: what's the dataset? how many clients in the system?

I understand the authors do not have much time to run experiments in the rebuttal period but the work still has large space to improve, including comprehensive experiments to evaluate the effect of data heterogeneity. I don't think just one single experiment can illustrate the effect.

Therefore, I can not raise my score at this stage.

评论

Dear reviewer W2md,

Thank you for the swift reply. The setup is the same as in the original experiment on heterogeneous data. While we happily include additional experiments in our manuscript, we want to point out that data heterogeneity is not the focus of this paper.

Our focus is to improve privacy, which we have shown both theoretically and empirically. At the same time, we show that this improved privacy does not decrease utility. We have demonstrated this on 55 different datasets with homogeneous and heterogeneous data, for various privacy budgets of differential privacy, and different numbers of clients, and we additionally evaluated the approach on 55 different datasets for 44 different types of interpretable models. For this rebuttal, we furthermore conducted experiments with two different heterogeneous model ensembles, added PATE as baseline, added an experiment with increased data heterogeneity, and evaluated our method in a novel heterogeneous setup on CIFAR100, outperforming FedMD even without labeled public data.

评论

To address your question about the behavior on heterogeneous data, we already added the experiment with stronger heterogeneity, showing that also in this case soft- and hard-label sharing performs on par with model parameters sharing (FedAvg). We conjecture that as long as clients achieve a minimum performance on all data, a meaningful consensus can be formed and label sharing (both hard and soft labels) works well. To test this conjecture, we added another experiment where we investigate the performance in a pathological distribution (i.e., all data drawn with α=0.01\alpha=0.01 and clients only observe a small subset of labels) in which the performance of all methods decreases, but label sharing approaches (both DD and FedCT) perform substantially worse than model parameters sharing (FedAvg), supporting our conjecture. You can find these results in Appendix C.4 of the revised manuscript.

审稿意见
5

This paper proposes a federated learning framework namely FedCT to provide a privacy-preserving collaborative learning by sharing “hard labels” for unlabeled dataset. The authors theoretically demonstrate the convergence of FedCT. Besides, they propose an XOR-Mechanism to protect the privacy of sharing labels. Experiments on 7 datasets showcase the superiority of FedCT compared to some baselines.

优点

S1. The proposed FedCT is technically sound.

S2. This paper provides a sufficient theoretical analysis of convergence and privacy guarantees.

S3. The writing is generally good and easy to understand.

缺点

W1. The contribution is trivial. There are a lot of federated learning frameworks, to name a few [1,2,3,4,5], based on sharing knowledge via a public unlabeled dataset. It seems that the only difference is that clients in FedCT upload “hard label” while other works’ clients share soft labels.

W2. The motivation is not clear. According to the basic idea of knowledge distillation, soft labels should potentially contain more useful information than hard labels. The authors are encouraged to clarify the reason for replacing soft labels with hard labels.

W3. More representative baselines are needed. As mentioned in W1, there are many similar works; it would be more convincing to conduct experiments to compare them.

W4. The appendix mentioned in the paper cannot be found; maybe it is just for me.

[1] FedMD- Heterogenous Federated Learning via Model Distillation.

[2] Heterogeneous Ensemble Knowledge Transfer for Training Large Models in Federated Learning

[3] Communication-Efficient and Model-Heterogeneous Personalized Federated Learning via Clustered Knowledge Transfer

[4] Model-contrastive federated learning

[5] Ensemble distillation for robust model fusion in federated learning

问题

See Weeknesses. Addressing these weaknesses (especially W1 and W2) will improve the convincing and quality of this paper.

评论

Dear reviewer e9mW,

We appreciate your comprehensive assessment of our work and appreciate the favorable evaluation of the soundness and theoretical analysis of our approach. We will address your questions point-by-point.

W1. Thank you for the references. In our related work section, we decided to discuss only those federated semi-supervised learning methods that directly fit our scenario, due to space limitations. That is, distributed clients hold a private labeled dataset, have access to a public unlabeled dataset and aim at training models collaboratively without sharing either data nor model parameters. Regarding your references, [1] assumes a public labeled dataset, [2, 4] require clients to share model parameters with the server, and [2] in addition assumes that the unlabeled dataset is only accessible by the server, [3] uses co-regularization for personalized federated learning, so we naturally compare to the non-personalized variant distributed distillation, and we discussed [5] in our related work section. We apologize for not having sufficiently clarified this. We will highlight the problem statement in our manuscript and add a discussion of more distantly related federated semi-supervised learning methods (including the ones you mentioned) in the appendix.

W2. The motivation for our paper is to maximize privacy in a federated learning scenario, where clients have access to an unlabeled public dataset. Indeed, sharing soft labels preserves privacy to a larger extent than sharing model parameters, as our experiments show for the first time, yet they still reveal a lot about private data (VUL >0.6 0.6). As discussed in the introduction, we wanted to go one step further by sharing hard labels. Indeed, this substantially improves privacy protection, with an empirically optimal privacy (VUL 0.5\approx 0.5). Of course, this is only optimal wrt. known membership inference attacks, which is why we developed the differentially private variant of FedCT. What was surprising to us is that strong improvement in privacy comes at virtually no loss in accuracy. That is, we achieve a test accuracy comparable to FedAvg and Distributed Distillation, the latter sharing soft labels. We conjecture that in practice the advantage of sharing soft labels over hard labels in terms of model performance is negligible, yet the additional privacy risk is significant. Moreover, sharing hard labels allows us to train interpretable models as well, and even use different models at each client (please see our general comment for further details).

W3. We selected the most prominent baselines for each type of methods, i.e., vanilla FedAvg as a representative for model sharing in homogeneous data scenarios, DP-FedAvg as its private variant, and distributed distillation for sharing of soft-labels (or co-regularization). We additionally compared FedCT to PATE for this rebuttal, a model distillation method that can be adapted to our scenario. The results in Table 1 indicate that PATE is outperformed by FedCT and all other baselines in terms of test accuracy. As discussed above, we cannot compare to the approaches you referenced, since they are not applicable in our scenario.

DatasetFedCTPATE
FashionMINST0.76580.6451
CIFAR100.76080.7039
Pneumonia0.74780.7208
MRI0.62740.6038
SVHN0.88050.8721
Table 1: Average test accuracy ACC of FedCT and PATE for m=5

W4. The appendix is available as a separate document in the supplementary material. For convenience, we also included a full document (paper and appendix) in the supplementary material. If you have any trouble accessing the supplementary material, please let us know and we will upload it to our anonymized repository.

评论

I appreciate the authors' response. The clarification regarding the distinctions between this paper and the related works I mentioned in the review (W1) partially addresses my concerns. Concerning W2, it would be convincing if the authors provided a direct comparison between sharing soft labels and hard labels, illustrating factors such as performance and privacy ability (I could not find this comparison in the paper; please point it out if I missed it). As the "main" contribution of this paper appears to involve the conversion from soft labels to hard labels, which may be deemed trivial from a technical standpoint, the necessity of this conversion should be clearly articulated and supported by possible theoretical and empirical evidence.

评论

Dear reviewer e9mW.

We are happy we could address most of your concerns. Regarding your concern W2, we will gladly clarify the distinction between hard and soft label sharing in the manuscript. To reiterate: As mentioned in the introduction, sharing hard labels instead of soft labels improves privacy and allows us to train interpretable models. Since hard labels carry less information than soft labels, this could in principle lead to a decrease in model quality. In our empirical evaluation (particularly Table 1), we show that we cannot observe any significant decrease in model quality - both soft and hard label sharing perform comparable to model parameters sharing (FedAvg). Note that we use distributed distillation as the most prominent representative of soft label sharing. At the same time, our experiments support the claim that sharing hard labels improves privacy: soft label sharing has an average vulnerability of 0.61, whereas sharing hard labels has a substantially reduced vulnerability of 0.52. We summarize these differences in the following table. We will include a polished version of the table together with a paragraph explaining these details in the appendix.

Shared InformationModel Quality IIDModel Quality Non-IIDPrivacyInterpretable Models
Model Parameters++++--
Soft Labels++++-
Hard Labels (FedCT)+++++++
Table 1: Comparison of parameter, soft label, and hard label sharing in federated learning.

Note that both soft and hard label sharing requires an unlabeled public dataset.

审稿意见
5

In order to achieve privacy under a federated learning setting, this paper proposes the use of a form of federated co-training called FEDCT, where local hard labels on the public unlabeled datasets are shared and aggregated into a consensus label. This consensus label is then used in training each local model. The paper analyzes the convergence of the proposed FEDCT and further develops a privacy version of FEDCT based on the XOR-Mechanism. The paper compares the proposed method with two baseline methods on several datasets for both iid and non-iid settings.

优点

  1. This paper considers the privacy issue in training machine learning models, which is a very important problem.

  2. This paper employs the membership inference attack to practically demonstrate the model's ability to resist real-world attacks.

  3. This paper conducts the convergence analysis for the proposed algorithm FEDCT.

缺点

  1. This paper does not discuss the difference between the proposed method and PATE (Papernot et al., 2016). In PATE, each teacher model can be viewed as a local model, and a majority vote is also utilized to reach a consensus for learning from each other's knowledge. The key distinction is that PATE transfers this knowledge to a student model, while the method in this paper circulates the knowledge back to each teacher/local model. Without further discussion, it's challenging to ascertain whether the contribution of this paper is incremental when compared to PATE and DD (Bistritz et al., 2020).

  2. The experimental results are not promising. As depicted in Table 1, only in 1 out of 5 datasets does the proposed method outperform the baselines in terms of ACC, and on none of the datasets does the proposed method achieve the best privacy utility trade-off, that is the best ACC, while at the same time, having the best VUL compared to baselines.

  3. This paper lacks clarity on the algorithm's scalability in relation to the number of clients, as it only reports the impact of client numbers on one dataset, and default settings of m (client numbers) for other datasets are only 5.

问题

  1. How does FEDCT compare with PATE in terms of the underlying mechanism design?
  2. Besides differentially private distributed SGD (Xiao et al., 2022), are there any other related works that can be utilized as baselines for comparison with DP-FEDCT on the Privacy-Utility Trade-Off with Differential Privacy?
评论

Dear reviewer NCnb,

Thank you for your detailed evaluation of our work, in particular that you highlighted the importance of the problem we tackle, our theoretical analysis, and privacy evaluation. We will address your questions point by point.

Questions:

(1) PATE is essentially a distillation algorithm with the goal to produce a single student model from an ensemble of teachers. To protect the private dataset the teachers have been trained upon, a Laplace mechanism is applied to their predictions - more precisely, their counts - before determining the consensus label. Thereby, a curious student cannot infer upon the private training data. In both PATE and FedCT, hard labels are used to form a consensus. The fundamental differences are that (i) PATE is not a collaborative training algorithm and that (ii) PATE therefore is not concerned with protecting against an honest-but-curious server. Regarding point (i): In a nutshell, PATE trains all teachers until convergence once, uses their predictions to pseudo-label the public unlabeled data, and then train the student on this pseudo-labeled dataset. Thus, teachers never improve. This fundamentally differs from FedCT where clients train models collaboratively by iteratively sharing hard labels to improve the pseudo-labeled dataset, which in turn improves local models, which again improves the pseudo-labeling. This difference also explains point (ii): In PATE, the student is separated from the teachers, so that the main privacy concern is protecting against a curious student. In FedCT instead all clients can be honest-but-curious, but their data needs to be not only protected from other clients, but also from an honest-but-curious server. Therefore, we cannot apply the Laplace mechanism that PATE uses, but instead have to protect the hard labels, which we achieve via the XOR mechanism.

(2) The privacy of distributed SGD is arguably lower than FedAvg, since sharing gradients allows not only for membership inference attacks, but even for data reconstruction attacks [2]. At the same time, FedAvg performs similar to distributed SGD in the literature [3,4]. Therefore, we have used DP-FedAvg as a baseline in our experiments. Differential privacy mechanisms are in principle compatible with most federated learning approaches, including those designed personalized federated learning [5,6]. It will be great to extensively evaluate FedCT in those scenarios, but goes beyond the scope of this work.

Weaknesses:

(1) We have discussed PATE in our related work section under semi-supervised learning, since it performs distillation via multiple teachers, instead of collaborative learning. Consequently, PATE is concerned with protecting the privacy of consensus labels, not of teachers’ predictions. Distributed distillation, instead, collaboratively trains models and naturally we compared to it - both in distributed distillation and federated co-training, clients act both as teachers and students. Since PATE shares hard predictions of teachers, we can adapt it by letting each client be a teacher. Similar to the experiments reported in [1], each client trains a model to convergence and then makes predictions on the public unlabeled dataset. A consensus is formed from all predictions that are used to pseudo-label the unlabeled dataset. A single student model is then trained on the pseudo-labeled dataset. We report the results with m=5m=5 clients in Table 1.

DatasetFedCTPATE
FashionMINST0.76580.6451
CIFAR100.76080.7039
Pneumonia0.74780.7208
MRI0.62740.6038
SVHN0.88050.8721
Table 1: Average test accuracy ACC of FedCT and PATE for m=5

Note that in the original experiments, a much larger number of clients is required. Therefore, we also report results for m=100m=100 clients for two datasets in Table 2.

DatasetFedCTPATE
FashionMINST0.71540.6318
Pneumonia0.72690.6903
Table 2: Average test accuracy ACC of FedCT and PATE for m=100

The results show that performing just a single round of training performs substantially worse than collaborative training (both distributed distillation and our proposed FedCT). Therefore, the privacy-utility trade-off of PATE is substantially worse: it achieves lower test accuracy while its privacy mechanism does not protect against a honest-but-curious server.

评论

(2) We would like to clarify that the goal of FedCT is not to outperform federated learning in terms of test accuracy, but to improve privacy. Naturally, any gain in privacy will typically come at a loss in accuracy. Therefore, our results are actually quite astonishing. We can achieve a test accuracy comparable to FedAvg with greatly improved privacy. That is, the ability of an attacker to distinguish a private training example from an arbitrary one is as good as random guessing (VUL=0.50.5). An amazing side-effect of FedCT is that sharing hard labels not only improves privacy, but also allows us to train interpretable models, like decision trees and rule ensembles. This is not possible with FedAvg or sharing of soft labels, like in distributed distillation.

(3) We performed experiments for m=5 clients, which is typical in cross-silo scenarios that are ubiquitous in healthcare. We empirically evaluated the scalability in the number of clients on FashionMNIST in Figure 5 of our manuscript. Following your suggestion, we performed the same evaluation for the Pneumonia dataset as well and report the results in Figure 1. Similar to FashionMNIST, FedCT scales very well with the number of clients.

Figure 1: Test accuracy (ACC) of FEDCT and FEDAVG (FL) Pneumonia with |U|= 200 for various numbers of clients m.

References:

[1] Papernot, Nicolas, et al. "Semi-supervised knowledge transfer for deep learning from private training data." arXiv preprint arXiv:1610.05755 (2016).

[2] Zhu, Ligeng, Zhijian Liu, and Song Han. "Deep leakage from gradients." Advances in neural information processing systems 32 (2019).

[3] McMahan, Brendan, et al. "Communication-efficient learning of deep networks from decentralized data." Artificial intelligence and statistics. PMLR, 2017.

[4] Mills, Jed, Jia Hu, and Geyong Min. "Faster Federated Learning with Decaying Number of Local SGD Steps." IEEE Transactions on Parallel and Distributed Systems (2023).

[5] Mansour, Yishay, et al. "Three approaches for personalization with applications to federated learning." arXiv preprint arXiv:2002.10619 (2020).

[6] T Dinh, Canh, Nguyen Tran, and Josh Nguyen. "Personalized federated learning with moreau envelopes." Advances in Neural Information Processing Systems 33 (2020): 21394-21405.

评论

We thank the reviewers for their encouraging feedback and thorough evaluation. We are happy that our motivation, theoretical analysis, and breadth of experiments were appreciated and that the problem we tackle is considered highly relevant. We want to emphasize some points in this general comment addressing all the reviewers.

Motivation: The main motivation for our paper is to improve the privacy of current federated learning approaches while maintaining model quality. For that, we consider a classical FL scenario where clients hold a private local dataset. We additionally assume that they have access to a public unlabeled dataset. To improve privacy over existing methods, FedCT shares hard labels instead of model parameters or soft labels. Our empirical evaluation shows that sharing hard labels indeed improves privacy substantially, both over model parameters and soft labels sharing. Surprisingly, this comes at virtually no loss in model quality, since FedCT achieves a test accuracy similar to both model parameters and soft labels sharing.

Privacy Guarantee: While the improvement in empirical privacy is substantial, our empirical measurement can only consider known privacy attacks. To provide a privacy guarantee, we employ the differential privacy framework, but cannot use the common Gaussian or Laplace mechanism, since the shared information is binary. Therefore, we use the XOR mechanism. Unlike the Gaussian mechanism for parameter sharing that uses clipping, there exists no simple sensitivity bound. Therefore, we derive a novel bound on the sensitivity of classifiers that are on-average-leave-one-out stable (which includes a wide range of machine learning methods) in Proposition 2.

Interpretable Models: While improving privacy is our main motivation for sharing hard labels, it has an additional benefit: we can use virtually any supervised learning method locally, including interpretable models, such as decision trees, rule ensembles, random forests, and gradient-boosted decision trees (XGBoost). In Table 2 in our manuscript we show that FedCT achieves a test accuracy close to a model trained on centralized data for 44 interpretable models and 55 benchmark datasets.

Mixed Model Types: As mentioned before, there is essentially no restriction on the type of model that can be used at each client. This also means we can use different models on different clients. For this rebuttal, we conducted an experiment where each client uses a different model. We report the results in the table below with decision trees (DT), random forests (RF), rule ensembles (RuleFit), gradient-boosted decision trees (XGBoost), and neural networks (MLP). These preliminary results show that using a diverse ensemble of models can further improve accuracy. It will be exciting to study this phenomenon in detail in the future.

DatasetC1C2C3C4C5ACC
BreastCancerDTRFRuleFitXGBoostRF0.95
BreastCancerDTMLPRuleFitXGBoostRF0.93
BreastCancerXGBoostXGBoostXGBoostXGBoostXGBoost0.94
评论

We thank the reviewers for their engagement and the valuable discussion. We would like to summarize the feedback and our respective changes and additional experiments. Reviewers agreed that the problem of improving privacy in federated learning is both interesting and highly relevant. Reviewers highlighted our theoretical convergence and privacy analysis and the practical privacy measurement in our experiment.

A major misunderstanding was the applicability of existing federated semi-supervised learning methods to our scenario. While we discussed closely related work in the submitted manuscript, we did not make clear why more distantly related works are not applicable in our scenario. We addressed this issue by adding a section discussing more distantly related semi-supervised approaches to the appendix. We also added an experiment comparing FedCT to the knowledge distillation approach PATE, showing that collaborative training via FedCT or distributed distillation substantially outperforms this classical distillation approach.

Another feedback was that our investigation of heterogeneous data was limited. To address this issue, we added an experiment with stronger heterogeneity, showing that also in this case soft- and hard-label sharing performs on par with model parameters sharing (FedAvg). We conjecture that as long as clients achieve a minimum performance on all data, a meaningful consensus can be formed and label sharing (both hard and soft labels) works well. To test this conjecture, we added another experiment where we investigate the performance in a pathological distribution (i.e., all data drawn with α=0.01\alpha=0.01 and clients only observe a small subset of labels) in which the performance of all methods decreases, but label sharing approaches (both DD and FedCT) perform substantially worse than model parameters sharing (FedAvg), supporting our conjecture.

We added another experiment comparing FedCT to the distantly related method FedMD in a different heterogeneous scenario on CIFAR100. We initially did not include FedMD, since it requires a labeled public dataset and we assume an unlabeled one. In this additional experiment, data is heterogeneously distributed by class label, but homogeneously by superclass label. Here, FedCT slightly outperforms FedMD, despite FedMD being pre-trained on a labeled public dataset.

To highlight the flexibility of FedCT to train arbitrary models, we added an experiment where we train diverse ensembles of models with each client training a different model type. The results indicate that such ensembles can even improve performance over using a single model type.

To clarify the differences between model parameters sharing (classical federated learning, e.g., FedAvg), soft label sharing (e.g., distributed distillation), and hard label sharing (FedCT), we added a section in the appendix that elaborates the differences.

We will also make the following minor revisions as suggested by the reviewers. We will include a brief discussion of the differences between model parameters, soft- and hard-labels sharing in the main text, improve clarity in the introduction, and include a brief discussion on the rationale for using both theoretical privacy guarantees and a practical privacy measure in our paper.

AC 元评审

The paper presents a method for cross-silo federated learning where clients share information through predicting and aggregating labels of an unlabeled public data set.

Strengths: relevant problem, interesting approach, promising experimental results.

Weaknesses: while the paper presents interesting results, it leaves an uneasy feeling that the reader does not get a full picture of what is going on. I would suspect this to be behind the critical views of the reviewers, even though the reviews themselves cannot really verbalise this.

The paper seems very focused on providing evidence in support of the proposed method, rather than understanding of when and why it works. This raises many questions, including:

  1. The fact that co-training with very strong DP works so well suggests that label sharing has limited impact - given the small number of clients, it seems likely that even independent local training would be a very strong baseline.
  2. The privacy model is explained superficially. It is not at all clear what part of the training is targeted with the vulnerability assessment - the final models or the intermediate communications. What is the specific attack being used? ML Privacy Meter implements several. How are the evaluations from different models combined?
  3. Unlike FedAvg and other standard FL algorithms which output one global model, the proposed method outputs many models. It was not immediately clear from the paper how these are used in the evaluation. Do they converge to the same solution empirically?
  4. The experiments should be extended to cover more diverse set of settings, including more clients with less data per client, and comparing against independent local training.

为何不给更高分

I believe that while the paper presents interesting results for a relevant problem, it is lacking in providing a transparent explanation for these results and a through evaluation of the limitations of the method.

为何不给更低分

N/A

最终决定

Reject