Noise-Aware Algorithm for Heterogeneous Differentially Private Federated Learning
We propose a noise-aware aggregation strategy for heterogeneous DPFL systems with untrusted servers to improve utility and convergence speed.
摘要
评审与讨论
This paper proposes a novel aggregation method, Robust-HDP, for local differential privacy (LDP) in federated learning (FL) to attain better utility and convergence with heterogeneous clients. More specifically, Robust-HDP can handle different privacy requirements on different clients and assign the best aggregation weights for clients with different noisy levels of LDP. Theoretical and empirical results verify the effectiveness of Robust-HDP.
优点
This work tries to address a practical and important issue in LDP that heterogeneous hardware and privacy requirements may significantly impact the utility and convergence of federated learning. The proposed Robust-HDP can handle the different noise levels in the uploaded models from different clients and assign the optimal weights to these models. The advantages are:
-
Robust-HDP can evaluate the noise level by robust PCA without accessing the privacy requirements of each client, which leads to better privacy protection. The experimental results show that the evaluation method is accurate.
-
Privacy and convergence are guaranteed by the theoretical analysis, while extensive experiments reveal the effectiveness of Robust-HDP.
-
The paper is organized in a logical way, by justifying each design in the methodology with both theoretical analysis and empirical evaluation. This makes the paper comprehensible and solid.
缺点
Some possible points to further improve this paper are:
-
A client with a larger dataset will have larger and thus higher noise. In this case, this client may be assigned a smaller weight , which may violate the fairness requirements and lead to a biased global model. Though this paper is mainly focusing on privacy, fairness issues can be discussed.
-
The experimental results shown in the body part are mainly on MNIST and FMNIST. Since CIFAR10 is more difficult and closer to a realistic setting, it would be preferable to show the ablation study results on CIFAR10 in the body. Also, the results of CIFAR10 seem to be lost in the Appendix.
问题
-
I do not find how the data heterogeneity is simulated in the experiments. Could you provide more information about this part?
-
Since the server can find the noise added to each local model update in , the server can eliminate the noise from the model and get the noise-free results of the local model update. Does it violate the privacy requirements in LDP?
Thank you for your feedbacks and comments about our work. We are happy to answer your comments and questions:
Comment1: A client with a larger dataset will have larger and thus higher noise.
A1: The equation for can be found in eq. 7, 5 and 3, where appears twice with opposite effects. Hence varies very slowly with (we have experimentally confirmed this). This means that the noise variance of a client is mainly determined by its privacy parameter and batch size . So if two clients have similar and , the aggregation weight assigned to them by Robust-HDP is very close, even if their dataset sizes are different. In contrast, DPFedAvg assigns lower weights to clients with smaller dataset sizes. As all clients have the same in this case, it makes sense to assign larger weights to the clients with larger dataset sizes (as DPFedAvg does) to improve utility. But what happens to fairness in the system? it has been shown that accuracy of DP models drops much more for the underrepresented subgroups [1]. Hence, in some scenarios, we expect that Robust-HDP improves fairness in the system for clients with lower . For instance, the following results evaluate the utility and also fairness in the system when we have homogeneous DPFL and only varies across clients:
[1]: E. Bagdasaryan, "Differential Privacy Has Disparate Impact on Model Accuracy", arXiv 2019.
-
utility (average test accuracy). Note that PFA and WeiAvg are equivalent when is uniform: | Algorithm | | | | | | |-------------------------------------------|--------------------|-----------------|--------------------|--------------------|--------------------| | WeiAvg and PFA | 37.24 | 34.90 | 27.80 | 23.22 | 19.01 | | DPFedAvg and minimum | 38.52 | 32.78 | 30.42 | 23.50 | 20.26 | | Robust-HDP | 37.68. | 32.54. | 27.51 | 22.58 | 20.52 |
-
performance fairness, measured by std of clients test accuracies and also the test accuracy of the client with smallest dataset size () (in parentheses): | Algorithm | | | | | | |-------------------------------------------|----------------------|--------------------|-------------------|------------------------|--------------------| | WeiAvg and PFA | 4.24 (32.95) | 4.11 (30.68) | 4.95 (25.01) | 3.89 (27.84) |5.70 (17.04) | | DPFedAvg and minimum | 4.92 (34.69) | 4.71 (29.54) | 4.95 (25.41) | 6.78 (14.77) |6.86 (11.36) | | Robust-HDP | 4.77 (34.09) | 4.59 (34.91) | 4.66 (26.13) | 6.01 (15.34) | 3.89 (18.97) |
We observe that when we have homogeneous and and heterogeneous , using Robust-HDP results in more fairness in the system, as it assigns more balanced weights to clients in this case.
Comment2: The experimental results shown in the body part are mainly on MNIST and FMNIST.
A2: Following your comment and other reviewers, we have added multiple experimental results on CIFAR10 and also with an even larger model ResNet-34 to evaluate Robust-HDP in more complex cases. Experimental results still show the superiority of Robust-HDP. Please read our answer A2 to reviewer fjuK for results in more complex settings.
Please continue to our second official comment in the following:
Comment3: Also, the results of CIFAR10 seem to be lost in the Appendix
A3: By mistake, we did not include the table of detailed results for CIFAR10 in the appendix at the submission time. We have now included it in the following:
| Algorithm | Dist1 | Dist2 | Dist3 | Dist4 | Dist5 | Dist6 | Dist7 | Dist8 | Dist9 |
|---|---|---|---|---|---|---|---|---|---|
| WeiAvg | 31.18 | 31.18 | 29.65 | 27.74 | 24.25 | 19.91 | 21.93 | 18.91 | 20.64 |
| PFA | 26.91 | 32.68 | 25.19 | 29.21 | 21.63 | 18.93 | 20.63 | 16.27 | 15.75 |
| DPFedAvg | 31.51 | 21.51 | 22.28 | 20.50 | 21.25 | 15.19 | 18.27 | 16.45 | 18.63 |
| minimum | 26.20 | 16.71 | 16.45 | 15.86 | 14.23 | 10.51 | 13.35 | 13.32 | 14.11 |
| Robust-HDP | 31.97 | 31.70 | 32.0 | 30.60 | 24.86 | 23.61 | 24.1 | 19.02 | 22.05 |
Q1: I do not find how the data heterogeneity is simulated in the experiments
A4: In section A.1 of the appendix (DATASETS AND MODELS), we have explained in details how we distribute data across clients. We first split each existing class into some shards. Then, depending on the desired level of data heterogeneity, we decide about the maximum number of classes that each client can have samples from. This results in uniform dataset size across clients. In cases where we want to have quantity shift, we use Dirichlet allocation.
Q2: the server can eliminate the noise from the model and get the noise-free results of the local model update
A5: Based on post-processing property, it's always safe to perform arbitrary computations on the output of a differentially private mechanism, so it is OK to use the matrix at aggregation time. However, our results show that using is not better than using with the aggregation weights coming from Robust-HDP.
This paper studies federated learning with differential privacy guarantees. In particular, it focuses on the challenging problem where each client has different privacy budget and they may not want to share this information. The Robust-HDP algorithm is proposed, which features robust PCA in the aggregation step. Both theoretical and empirical evidence are provided.
优点
The problem under consideration is of great importance and well motivated in the paper. The writing is clear and empirical results are promising.
缺点
In my view, the main novelty of this paper is the use of RPCA in the aggregation step. Its main purpose is to allow the (untrusted) server to estimate the optimal weights without the knowledge of . However, I have a few concerns:
- The result stated in Lemma 1 seems to be independent of the use of RPCA. The bound in Lemma 1 only depends on and . Does the use of RPCA lead to explicit forms of these parameters?
- I can understand that the matrix can be deposed into a low rank signal component and a noise component , but why should one believes that is sparse?
- The authors also mention that . Could the author further clarify this point? I don't quite understand the sentence in the parentheses "it is the noise power..." and the terms in (3) and (5) both look like to me, at least when is large.
问题
In addtion to the above, I believe a definition of Var is needed as I don't see how it maps from a vector to a scalar (e.g.\ in (3) and (5)).
Thank you for your feedbacks and comments about our work. We are happy to answer your comments and questions:
Comment1: The result stated in Lemma 1 seems to be independent of the use of RPCA.
A1: Lemma1 says that if RPCA can extract the matrix up to a factor , then Robust-HDP can estimate the clients noise variances up to a factor . In ideal case and . Based on the theoretical results in the RPCA paper, as rank of decreases and the additive noisy matrix gets more sparse (which we have explained in the following why this is the case), we get closer to the ideal case of extracting exactly. Therefore, the parameters and depend on the given matrix and the RPCA algorithm. On the other hand, for our purpose of estimating the optimum weights , estimating the elements in up to a factor is sufficient (so for our goal, does not need to be necessarily ). As an instance for MNIST and Dist8, we can see from Figure 4 that is very close to 1 for all the 20 clients in the system. This means that for MNIST that has columns of dimension (number of model parameters), RPCA can estimate with a good precision, which consequently results in the estimates being very close to their true values .
Now consider the case where is too large, e.g. for the ResNet18 () or ResNet34 (). Then running RPCA on the matrix , not only takes long time, but also it might result in lower precision in extracting . Hence, we proposed our approximation approach relying on two facts: 1. in our heterogeneous DPFL problem, all the components in are iid, because they were sampled and added by the same client during its local DP training 2. We want to estimate , which is the sum of the variances of the iid samples . As sample variance decreases inversely proportional to the number of samples, even if we estimate by running RPCA on a submatrix of with (e.g. that we used), we still get to an approximation of up to a factor , which is sufficient for Robust-HDP to get to the optimum weights . Our multiple improved experimental results on CIFAR10 shows that the approximation idea is indeed effective.
Comment2: I can understand that the matrix can be deposed into a low rank signal component and a noise component , but why should one believes that is sparse?
A2: As shown in Figure 1, , which is equal to , heavily depends on and . When both and are small, increases fast, and in other cases it has relatively smaller values. The heterogeneity in clients privacy parameters and batch sizes results in a variation between across clients, which makes the sparse patterns similar to what observed in Figure 4 or Figure 10 (in the appendix) for . Now note that a few of existing clients might be in the peak corner of Figure 1. Hence, the pattern in is sparse. This means that some columns of have large norm and some others have small norms. Now, note that the elements in are iid with mean 0 (they are noise elements), this means that is also sparse, or more precisely most columns of hold elements with small magnitude and some other columns have elements with relatively larger magnitudes. Hence is can be thought of as being sparse (again see the pattern in Fig. 4 of the norms of the 20 columns of ).
Comment3:The authors also mention that .
A3: Based on the definition above, is the noise variance in the whole model update of dimension . It can be computed from eq.7, which is computed based on either eq.3 or eq.5. In either case, the terms in equations 3 and 5 are large, mainly because which is the number of model parameters, has a large value (e.g. 11,181,642 for CIFAR10 or 28,938 for MNIST/FMNIST). Also (clipping threshold), and (the DP noise scale) are constants around 1. In order to get an idea of the values of , see the values in Figure 4 left, which shows the true noise variance existing in each of the 20 clients model updates.
Q1: I believe a definition of Var is needed as I don't see how it maps from a vector to a scalar (e.g.\ in (3) and (5)).
A4: First, for a vector of dimension we have: . So it captures the deviations of from its mean. Note than when are iid and increases, also increases.
This paper is about using DP in conjunction with FL (DPFL). The key insight is that if we heterogeneity in DP (via different privacy requirements) and heterogeneity in FL (with different types of devices, with different amounts of memory), approaches that deal with only one of these types of heterogeneity are sub-optimal when both occur simultaneously.
This paper introduces a new algorithm that considers both of these concerns simultaneously, using a robust PCA like approach.
优点
-
Identifies a problem that is important to solve.
-
Presents an interesting solution
-
The simulation results are convincing, albeit only for small models.
缺点
-
Presentation is poor, with quite a few formatting errors --- please address these.
-
I don't really see anything too meaningful in the theoretical results. The idea that the procedure converges is good, but the result is too much of a mess to really interpret. Can this be cleaned up?
-
There are no simulation results where the two types of heterogeneity are homogeneous. I would like to see these, and see if your algorithm is outperformed in these cases
-
Simulations are on MNIST, CFAR10 size datasets/models. Time and time again, we have seen that insights at this scale are not generalizable. I would like to see at least one example on a bigger model, even if it is just fine tuning.
问题
- An interesting question is incentives. Does your approach lead users to lie about their memory constraints to get free extra privacy? It seems to me that this might be the case. Some suggested references for this include
[1] Fallah, Alireza, et al. "Optimal and differentially private data acquisition: Central and local mechanisms." Operations Research (2023).
[2] Donahue, Kate, and Jon Kleinberg. "Model-sharing games: Analyzing federated learning under voluntary participation." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 6. 2021.
[3] Kang, Justin, Ramtin Pedarsani, and Kannan Ramchandran. "The Fair Value of Data Under Heterogeneous Privacy Constraints." arXiv preprint arXiv:2301.13336 (2023).
[4] Karimireddy, Sai Praneeth, Wenshuo Guo, and Michael I. Jordan. "Mechanisms that incentivize data sharing in federated learning." arXiv preprint arXiv:2207.04557 (2022).
I would also like to see some additional simulations:
-
Simulations where and are the same for all , to see if existing approaches win, and explain why
-
Larger scale models in simulations
Thank you for your feedbacks and comments about our work. We are happy to answer your comments and questions:
Q1: Does your approach lead users to lie about ...?
A1: The answer is no. Because our algorithm does not operate based on clients privacy parameters . Whether the server knows or not does not affect Robust-HDP, because it operates based on estimating the amount of noise in clients model updates . However WeiAvg/PFA do heavily depend on the server knowing clients privacy parameters. Due to this dependence , a client could also lie about their and mislead the server, which runs WeiAvg/PFA, by pretending that they are a less privacy sensitive client by sending a larger .
Q2: Simulations where and are uniform.
A2: We ran a set of experiments with different assumptions. Unlike before where we assumed heterogeneous , heterogeneous and uniform dataset size , we experimented with the following cases. As asked in your comments, all the the experiments are run on CIFAR-10 (instead of MNIST/FMNIST), which uses ResNet18 with parameters:
case 1. uniform , heterogeneous and heterogeneous . Results for homogeneous and heterogeneous on CIFAR10, still shows superiority of Robust-HDP:
| Algorithm | Dist1 | Dist2 | Dist3 | Dist4 | Dist5 | Dist6 | Dist7 | Dist8 | Dist9 |
|---|---|---|---|---|---|---|---|---|---|
| WeiAvg | 31.99 | 31.18 | 29.59 | 25.97 | 24.66 | 16.61 | 23.13 | 17.01 | 14.34 |
| PFA | 30.89 | 32.02 | 28.34 | 24.17 | 22.23 | 15.31 | 22.15 | 16.32 | 13.11 |
| DPFedAvg | 33.89 | 24.16 | 24.62 | 17.50 | 22.71 | 16.76 | 19.52 | 13.81 | 16.27 |
| Robust-HDP | 34.94 | 33.78 | 31.34 | 32.50 | 26.05 | 17.98 | 23.13 | 17.97 | 15.79 |
case 2. heterogeneous , uniform ( is fixed to mean of distributions 1, 3, 5 and 7), and heterogeneous (quantity shift):
| Algorithm | |||||
|---|---|---|---|---|---|
| WeiAvg and PFA | 35.86 | 33.50 | 29.21 | 24.49 | 18.40 |
| DPFedAvg and minimum | 37.00 | 32.89 | 29.32 | 23.06 | 19.14 |
| Robust-HDP | 37.45 | 34.93 | 29.78 | 23.15 | 19.54 |
case 3. uniform , uniform and heterogeneous : this is the regular homogeneous DPFL, for which DPFedAvg is designed. Based on equations 7, 5 and 3, changes across clients just based on their dataset size . However, appears twice in eq.7 with different effects. Hence varies very slowly with (we have experimentally confirmed this). Therefore, Robust-HDP (which solves problem 8) and WeiAvg () assign uniform aggregation weights to all clients. However, DPFedAvg assigns larger weights to clients with larger dataset size (). As all clients have the same in this case, it makes sense to assign larger weights to the clients with larger dataset sizes (as DPFedAvg does) to improve utility. But what happens to fairness in the system? it has been shown that accuracy of DP models drops much more for the underrepresented subgroups [1]. Hence, in some scenarios, we expect that Robust-HDP improves fairness in the system for clients with lower . For instance, the following results evaluate the utility and also fairness in the system when we have homogeneous DPFL and only varies across clients. We have looked at std of clients test accuracies and also the test accuracy of the client with smallest dataset size () (in parentheses).
[1]: E. Bagdasaryan, "Differential Privacy Has Disparate Impact on Model Accuracy", arXiv 2019.
Please continue to the our second official comment below:
-
utility (average test accuracy). Note that PFA and WeiAvg are equivalent when is uniform: | Algorithm | | | | | | |-------------------------------------------|--------------------|-----------------|--------------------|--------------------|--------------------| | WeiAvg and PFA | 37.24 | 34.90 | 27.80 | 23.22 | 19.01 | | DPFedAvg and minimum | 38.52 | 32.78 | 30.42 | 23.50 | 20.26 | | Robust-HDP | 37.68. | 32.54. | 27.51 | 22.58 | 20.52 |
-
performance fairness, measured by std of clients test accuracies and also the test accuracy of the client with smallest dataset size () (in parentheses): | Algorithm | | | | | | |-------------------------------------------|----------------------|--------------------|-------------------|------------------------|--------------------| | WeiAvg and PFA | 4.24 (32.95) | 4.11 (30.68) | 4.95 (25.01) | 3.89 (27.84) |5.70 (17.04) | | DPFedAvg and minimum | 4.92 (34.69) | 4.71 (29.54) | 4.95 (25.41) | 6.78 (14.77) |6.86 (11.36) | | Robust-HDP | 4.77 (34.09) | 4.59 (34.91) | 4.66 (26.13) | 6.01 (15.34) | 3.89 (18.97) |
We observe that when we have homogeneous and and heterogeneous , using Robust-HDP results in more fairness in the system, as it assigns more balanced weights to clients in this case.
- Conclusion: Based on the three experimental cases above, when there is heterogeneity in either or (or both), using Robust-HDP is beneficial, because in these cases is also heterogeneous (according to eqs 7, 5 and 8). However, when both and are uniform (i.e. regular homogeneous DPFL), then is almost fixed across clients and it makes sense to use DPFedAvg to assign larger weights to clients with larger dataset sizes and improve utility.
Q3: Larger Scale models in simulations.
A3: We ran experiments on CIFAR10 with ResNet34 with . Due to lack of time, we could run these experiments only with heterogeneous and heterogeneous (sampled from Dist3) and rounds:
| Algorithm | Dist3 |
|---|---|
| WeiAvg | 18.26 |
| PFA | 15.49 |
| DPFedAvg | 15.47 |
| Robust-HDP | 22.67 |
The paper proposes a method for weighting client updates in differentially private (DP) federated learning (FL), where the clients can have differing local DP guarantees. The main idea is to have the server estimate the total noise level of each client update via robust PCA, and weight the clients in federated averaging accordingly, giving more weight to clients with less noisy updates. The authors then show experimentally that their weighting method improves model utility compared to existing alternatives in the presence of heterogeneous noise levels on the clients (differing DP guarantees, different subsampling fractions).
优点
-
While there are existing works proposing to weight client updates according to heterogeneous privacy preferences, using estimated noise values directly instead of e.g. simply epsilon values seems like a good idea.
-
The authors show convergence results under some assumptions.
-
The paper is mostly fairly easy to read.
-
An effective way for optimising weights in fedavg to improve utility is an important research direction.
缺点
-
As the authors note, as the proposed method involves the server running robust PCA on the model updates, it can only scale via approximating the proposed method. It is not clear from the theory or from the provided empirical results how feasible the approximation approach actually is (in terms of approximation quality vs compute).
-
E.g. end of Sec.3.2: having the clients keep their noise addison mechanisms as well as epsilon budget secret (which is one of the main assumptions used to motivate the mechanism-oblivious noise estimation method) is very much not standard in DP. I do not find the argument all that convincing: since the standard DP guarantees assume that the privacy budget is open information, ideally I would very much think that a rational client would simply add enough noise to be comfortable with the amount of information it is sharing, instead of using less noise than they are actually okay with and trying to keep the privacy budget secret to get some unquantifiable amount of extra privacy.
-
Parts of the presented theory, e.g., long analysis of the connection between noise level and batch size seem quite disconnected from the actually proposed method: as far as I see, using robust PCA and then weight the updates according to the estimated noise level does not actually rely on the batch size analysis in any significant way beyond getting some motivation for having differing noise levels on the clients.
问题
-
Sec.1, : on clients having to use small batch size due to memory constraints: I would think that using simple grad accumulation alleviates the memory problem, and from Fig1 one could think that quite modest increase in batch sizes suffice to reduce the noise level significantly. Is there some specific reason why this would not work for the case you consider? Somehow framing the discussion around client memory limitations seems a bit weird.
-
Sec.3.3: "all the clients are training the same shared model on more or less similar data distributions". How sensitive the rank of M is for client heterogeneity?
-
In eq.3: any reasoning about when the final approximation is good?
-
Please mention the neighbourhood relation explicitly in defining DP.
Minor comments/typos etc (no need to comment)
-
Sec.4.2, RQ1; Fig.2 caption have some broken references.
-
Eq.(3): is Var element-wise variance? Please mention to avoid some confusion, also the notation changes somewhat confusingly between text and appendix, e.q. p vs d for dimensionality
Thank you for your feedbacks and comments.
Q1: framing the discussion around client memory limitations seems a bit weird.
A1: Even with gradient accumulation, the amount of DP noise in clients updates, depends on the base batch size that they use. The accumulation of batch gradients with base batch size , which is used for model updates is: where depends on the base sampling ratio that the client uses for computing batch gradients. Therefore, although the stochastic noise in accumulated batch gradient decreases, the total DP noise induced in the model update of each client after local epochs does not change (compared to when we use batch size ). Hence heterogeneity in the base batch sizes results in heterogeneity in . This being said, consider when all clients use the same batch size , but their privacy parameters are heterogeneous. Based our analysis in eq. 7, 5 and 3, this results in the induced noise variances being heterogeneous. This makes WeiAvg a suboptimal solution. In contrast Robust-HDP assigns aggregation weights based on the real noise variance in clients model updates (), instead of their privacy parameters (which might not even be shared with the server). Also, for Robust-HDP does not matter if are known by server or not, because it does not use them, so it gives clients more "freedom" for not sharing them with the server. Results for homogeneous and heterogeneous on CIFAR10, still shows superiority of Robust-HDP:
| Algorithm | Dist1 | Dist2 | Dist3 | Dist4 | Dist5 | Dist6 | Dist7 | Dist8 | Dist9 |
|---|---|---|---|---|---|---|---|---|---|
| WeiAvg | 31.99 | 31.18 | 29.59 | 25.97 | 24.66 | 16.61 | 23.13 | 17.01 | 14.34 |
| PFA | 30.89 | 32.02 | 28.34 | 24.17 | 22.23 | 15.31 | 22.15 | 16.32 | 13.11 |
| DPFedAvg | 33.89 | 24.16 | 24.62 | 17.50 | 22.71 | 16.76 | 19.52 | 13.81 | 16.27 |
| Robust-HDP | 34.94 | 33.78 | 31.34 | 32.50 | 26.05 | 17.98 | 23.13 | 17.97 | 15.79 |
Q2: How sensitive the rank of M is for client heterogeneity?
A2: We have run some experiments on MNIST and instead of assigning all classes to all (20) clients, we have limited the number of classes per client to be at maximum 8, which results in more data heterogeneity. Note that RPCA needs the matrix to have "low rank" (not rank 1). So as long as the data heterogeneity across clients results in a low rank , RPCA and consequently Robust-HDP can be used to improve utility, especially when are small:
| Algorithm | Dist1 | Dist2 | Dist3 | Dist4 | Dist5 | Dist6 | Dist7 | Dist8 | Dist9 |
|---|---|---|---|---|---|---|---|---|---|
| WeiAvg | 90.24 | 83.43 | 89.34 | 88.28 | 87.43 | 77.86 | 81.45 | 76.73 | 79.72 |
| PFA | 82.98 | 88.69 | 82.14 | 85.67 | 81.07 | 77.44 | 71.90 | 65.43 | 69.14 |
| DPFedAvg | 88.45 | 81.76 | 80.72 | 80.04 | 79.17 | 75.30 | 82.60 | 70.82 | 78.39 |
| Robust-HDP | 87.66 | 88.60 | 86.8 | 88.60 | 83.20 | 80.19 | 83.05 | 78.40 | 81.14 |
| . |
Q3: In eq.3: any reasoning about when the final approximation is good?
A3: First, for a vector of dimension we have: . From the values that we had in our experiments, the first term in eq.3 is less than 1 (similarly in the second term is smaller than 1 too): because is set to 3 in our experiments and is a random value from . On the other hand (the number of model parameters) is large (e.g. 11,181,642 for CIFAR10 or 28,938 for MNIST/FMNIST) which makes the second term much larger than the first term (look at numbers in Figure 4). Hence, the approximation makes complete sense.
Q4: Please mention the neighbourhood relation explicitly in defining DP.
A4: Two datasets and (for client ) are adjacent datasets, if they differ in only one data sample (present in one dataset and absent in the other one). Therefore, our notion of adjacency is sample level.
Q5: Fig.2 caption have some broken references.
Please continue to the our second official comment below:
Q5: Fig.2 caption have some broken references.
A5: By mistake, we did not include the table of detailed results for CIFAR10 in the appendix at the submission time. We have now included it in the following:
| Algorithm | Dist1 | Dist2 | Dist3 | Dist4 | Dist5 | Dist6 | Dist7 | Dist8 | Dist9 |
|---|---|---|---|---|---|---|---|---|---|
| WeiAvg | 31.18 | 31.18 | 29.65 | 27.74 | 24.25 | 19.91 | 21.93 | 18.91 | 20.64 |
| PFA | 26.91 | 32.68 | 25.19 | 29.21 | 21.63 | 18.93 | 20.63 | 16.27 | 15.75 |
| DPFedAvg | 31.51 | 21.51 | 22.28 | 20.50 | 21.25 | 15.19 | 18.27 | 16.45 | 18.63 |
| minimum | 26.20 | 16.71 | 16.45 | 15.86 | 14.23 | 10.51 | 13.35 | 13.32 | 14.11 |
| Robust-HDP | 31.97 | 31.70 | 32.0 | 30.60 | 24.86 | 23.61 | 24.1 | 19.02 | 22.05 |
comment1: It is not clear from the theory or from the provided empirical results how feasible the approximation approach actually is.
A6: Lemma1 says that if RPCA can extract the matrix up to a factor , then Robust-HDP can estimate the clients noise variances up to a factor . In ideal case and . Based on the theoretical results in the RPCA paper, as rank of decreases and the additive noisy matrix gets more sparse, we get closer to the ideal case of extracting exactly. Therefore, the parameters and depend on the given matrix and the RPCA algorithm. On the other hand, for our purpose of estimating the optimum weights , estimating the elements in up to a factor is sufficient (so for our goal, does not need to be necessarily ). As an instance for MNIST and Dist8, we can see from Figure 4 that is very close to 1 for all the 20 clients in the system. This means that for MNIST that has columns of dimension (number of model parameters), RPCA can estimate with a good precision, which consequently results in the estimates being very close to their true values .
Now consider the case where is too large, e.g. for the ResNet18 () or ResNet34 (). Then running RPCA on the matrix , not only takes long time, but also it might result in lower precision in extracting . Hence, we proposed our approximation approach relying on two facts: 1. in our heterogeneous DPFL problem, all the components in are iid, because they were sampled from the same normal distribution and added by the same client during its local DP training 2. We want to estimate , which is the sum of the variances of the iid samples . As sample variance decreases inversely proportional to the number of samples, even if we estimate by running RPCA on a submatrix of with (e.g. that we used), we still get to an approximation of up to a factor , which is sufficient for Robust-HDP to get to the optimum weights . Our multiple improved experimental results on CIFAR10 shows that the approximation idea is indeed effective.
Dear reviewers,
As the author-reviewer discussion deadline is approaching, we would like to take the chance to thank you all for your comments and questions. The study and experiments that we conducted for answering them enhanced our understanding of our proposed algorithm and why it works too. We encourage each of you to read questions of other reviewers and our answers to them too, as your questions have been quite related. We sincerely appreciate the time and effort you put in this draft.