5.4

/10

Poster5 位审稿人

最低5最高6标准差0.5

3.6

置信度

正确性2.8

贡献度2.4

表达3.0

NeurIPS 2024

Personalized Federated Learning via Feature Distribution Adaptation

Connor Mclaughlin,Lili Su

OpenReview PDF

提交: 2024-05-16更新: 2024-11-06

TL;DR

Federated representation learning under a generative classifier for improved personalization under distribution shifts.

摘要

关键词

Federated LearningData HeterogeneityPersonalization

评审与讨论

审稿意见

评分: 6置信度: 42024-07-04

Federated Learning (FL) combines data from multiple clients to train a global model but struggles with heterogeneous data. Personalized Federated Learning (PFL) creates individual models for each client, addressing this issue. Traditional methods face challenges with bias-variance trade-offs and rely on limited local data or costly techniques. This paper proposes pFedFDA, an algorithm that frames global representation learning as a generative modeling task. It adapts global generative classifiers to local feature distributions, improving performance in data-scarce settings.

优点

To implement personalized federated learing from generative view looks interesting and promising.
This paper is well-structured entirely, but the notations could be improved.
The mathematical proof looks good and sufficient.

缺点

Your method looks similar to the traditional personalzed federated learning. just as the algorithm 1, the shared backbone is just the weighted summation of the corresponding part in all involved clients.
You mentioned that your shared backbone are trained in a generative way, but I can not get the core of of it. Actually, there are some other works[1] use the generative models (e.g., autoregression) as backbone to implement personalized federated learning, what is the difference between your methods and such baselines?
Your global distribution parameters are weighted summation of all clients, so in my opinion, the global distribution may be bias due to data size skewness. There are many ways to overcome this, such as [2], etc. do you think such methods can be adopted by you to fix this issue of data skewness?

[1] Kou, W.B., Lin, Q., Tang, M., Xu, S., Ye, R., Leng, Y., Wang, S., Chen, Z., Zhu, G. and Wu, Y.C., 2024. pFedLVM: A Large Vision Model (LVM)-Driven and Latent Feature-Based Personalized Federated Learning Framework in Autonomous Driving. arXiv preprint arXiv:2405.04146. [2] Kou, W.B., Lin, Q., Tang, M., Wang, S., Zhu, G. and Wu, Y.C., 2024. FedRC: A Rapid-Converged Hierarchical Federated Learning Framework in Street Scene Semantic Understanding. arXiv preprint arXiv:2407.01103.

问题

"where Φ and H are the feasible sets of neural network and classifier parameters, respectively.", in this context, it looks better to replace "neural network" to "shared backbone", do you think so?
The notaions sometimes looks confused, which could be improved.
In table 5, it should be pFedFDA not pFedDFA, right?
The used dataset looks simple, could you use more complex dataset, such as cityscpaes, camvid, kitti, imagenet, etc to veriry your contributions?

局限性

the proposed method is over-reliance on the local distribution, so it is difficult to handle following cases: 1. a new client is added into the system and its local distribution is far away the original global distribution and the original distributions of all clients; 2. the local distribution of all clients are depondent on time (i.e., dynamic)

作者回复

2024-08-07

W1 (Distinction with Prior PFL Methods) Our method falls under the personalization framework of shared representation learning with personalized classifiers (discussed in our related work, L85-107).

The distinction of our work is in our formulation of generative client classifiers. Notably, the formulation of the classification layer not only defines the final personalized model of each client, but also controls how the shared backbone parameters are trained. We further discuss the advantages of our generative classifiers in our response to W2 below.

W2 (Clarification of Generative Approach) In our work, the term generative refers to the construction of our client classifiers, which we obtain through Bayes' rule and an estimate of the joint distribution of latent representations and class labels, $p(z, y)$ . This is in contrast to prior representation-learning-based PFL methods which learn linear classifiers in a discriminative manner, learning $p(y|z)$ directly.

The selection of client classifiers is important not only in obtaining an accurate personalized model but also in shaping how the shared backbone is trained. Using a generative classifier on each client to train the shared backbone, we can promote alignment in client features $p(z|y)$ while accounting for differences in class distributions $p(y)$ .

To provide one final perspective on training the backbone with generative classifiers, by minimizing the cross-entropy error in model predictions, we are training the shared backbone to extract features according to the distribution defined by our generative model.

In your first provided reference, the backbone is trained using a input-level generative task, either autoregressive next-pixel prediction or masked pixel prediction. This technique of training large vision models is known as generative pre-training. We will edit our paper accordingly to make sure the usage of the generative terminology is clear.

W3 (Data Skewness in Generative Modelling) Indeed, the weighted aggregation of parameters in most FL algorithms can lead to a bias towards clients with more local data. However, this update rule is still widely used in many PFL settings (e.g., in FedRep/BABU/PAC/FedAvg to aggregate neural network params), and an unweighted aggregation may increase the influence of noisy estimates from data-scarce clients.

In our work, we address the bias of using the shared distribution estimates through a local-global interpolated estimate. This has some similarities to the second provided reference, which uses the relative Bhattacharyya distance between RGB distributions to determine aggregation weights. While a similar approach could be used to obtain our interpolation coefficient, we optimize this coefficient directly to maximize personalized model accuracy.

Q1/Q2/Q3 (Method Presentation) We appreciate the feedback on the presentation of our method, and we will incorporate these comments to improve the clarity of our work.

Q4 (Benchmark Datasets) The selected vision datasets (CIFAR10/100, EMNIST) have been adopted by many recent works [1, 2, 3] to measure the effect of data heterogeneity on PFL tasks. We have additionally leveraged a set of natural data corruptions [4] for our CIFAR benchmarks, to simulate the complexity introduced by real-world covariate shift. We include TinyImageNet as an additional reference point in a large-data setting, but we focus primarily on the more challenging scenarios involving data scarcity and covariate shift. We leave the extension of our method to semantic segmentation tasks to future work.

[1] Exploiting Shared Representations for Personalized Federated Learning (Collins et al., ICML 2021)

[2] FedBABU: Towards Enhanced Representation for Federated Image Classification (Oh et al., ICLR 2022)

[3] Personalized Federated Learning with Feature Alignment and Classifier Collaboration (Xu et al., ICLR 2023)

[4] Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations (Hendrycks et al., ICLR 2019)

L1/L2 (Over-Reliance on Local Distribution Estimate) While our method utilizes the local distribution estimate in generating personalized models, our personalized classiifers are based on an interpolated distribution estimate, which enables them to leverage global knowledge to reduce the variance in their local classifier. Additionally, we note that recent representation-learning based methods [1][2] only use local client data to estimate client classifiers, and based on our results in Tab. 3, we observe that our method is more robust to extreme data scarcity where the local distribution estimate is most limited.

In response to your concern on new-client generalization, we run the following experiment, which can be found in Tab. 1 of the rebuttal PDF:

We train each method on CIFAR10 Dir(0.5) using half of the total clients throughout training. At test time, we evaluate on these clients, as well as the second half of clients not seen at training time. We additionally evaluate new-client generalization performance under covariate shift, by corrupting the images of all new clients with each of the 10 corruptions considered in our paper.
We observe that our new-client generalization performance is superior than the baseline FedAvgFT for all settings with and without covariate shift.

Regarding the compatibility of pFedFDA to dynamic client data distributions, we think this is an interesting direction for future work, but in this paper we follow the experimental setup of our cited baselines in assuming that client data distributions static.

2024-08-08

I have read the rebuttal and authors response my concern properly.

审稿意见

评分: 5置信度: 32024-07-12

This work introduces pFedFDA, a novel approach to personalized federated learning that conceptualizes global representation learning as a generative modeling task. Specifically, the method involves shared representation learning, guided by a generative classifier characterized by a low-variance global probability density. Additionally, it iteratively refines global parameters and employs a local-global interpolation technique to tailor these estimates to individual client distributions. The authors validate the effectiveness of the proposed solution through experiments on benchmark datasets, demonstrating its robustness and applicability.

优点

The paper presents an interesting and meaningful idea, complemented by a theoretical analysis of the bound on high probability estimation errors for the interpolated mean estimate, offering valuable insights and inspiration to the community.

The writing is clear and straightforward, with well-designed figures that effectively support and clarify the presented concepts.

缺点

There are several concerns that need addressing: 1.In some cases, the performance improvements appear minimal. For instance, Table 2 shows that pFedFDA is outperformed significantly by FedBABU and pFedME.

2.In Table 3, the authors employ a Dirichlet distribution (Dir(0.5)) for experiments under extreme data scarcity. It would be beneficial to lower the Dirichlet value, perhaps to 0.1, to conduct a more extensive evaluation under greater data scarcity.

3.Table 5 indicates that the runtime of pFedFDA is longer than several other methods, including FedAvg, Ditto, and FedRoD, which calls into question the efficiency of the proposed solution.

In Equations 3 and 5, should there be a superscript ‘c’ on Σ to clarify the notation further?

问题

please check "Weaknesses". I like the idea of this work and I am looking forward to the responses from authors.

局限性

N/A

作者回复

2024-08-07

W1 (Performance Comparison of pFedFDA with Prior Works) We believe this might be a misread of our results. While the performance improvements are not as significant in Tab. 2, pFedFDA is still competitive (achieving top-2 performance in 5 of the 7 scenarios). In particular, it beats FedBABU in all the 7 scenarios and beats pFedMe in 4 of the 7 scenarios. For the other 3 scenarios, pFedMe outperforms pFedFDA by .002, .004, .004, whereas pFedFDA outperforms pFedMe in the 4 scenarios by .002, .023, .03, .074.
In the two columns in which our method is not within the top-2, CIFAR-10 Dir(0.1) and TinyImageNet Dir(0.1), we observe that the gap between all methods is smaller. This can be explained by the relative ease of the personalization tasks under those two scenarios. Notably, these scenarios have limited covariate shift between clients and the local data volumes are large enough for clients to build strong models even without collaboration.

If we consider these results alongside Tab. 1 and Tab. 3, our experiments indicate that pFedFDA is promising as a general strategy for real-world PFL settings where we may encounter challenges of covariate shift and data scarcity.

W2 (Setup for Data-Scarce Experiments) Thank you for your suggestion. We want to emphasize that in this experiment, the Dirichlet distribution governs only class imbalance and not data scarcity. The extreme data scarcity setup in Tab. 3 is similar to what is considered in [1], where each client is assigned exactly one mini-batch of samples. Thus, changing the parameter of the Dirichlet distribution will not change the extent of data scarcity. We will clarify this setup in the camera-ready version of the paper if accepted.

[1] Personalized Federated Learning with Feature Alignment and Classifier Collaboration (Xu et al. ICLR 2023)

W3 (On the Runtime of pFedFDA) We do observe that there is a common computational cost associated with interpolation-based methods, as the local training takes additional steps to update each interpolated model directly (APFL), or optimize the interpolation weight between candidate models (FedPAC and our pFedFDA).

Compared to [1], our interpolation task is more efficient, as we combine local and global models, rather than the models of each client.

To understand the overhead of pFedFDA in greater depth, we conduct an analysis of the run-time associated with each component of local training in Tab. 2 of the attached rebuttal document. Notably, our optimization of the interpolation parameters is responsible for most of the additional training time. In Tab. 3 of the attached rebuttal document, we compare the accuracy of our method when updating the interpolation parameter every $T$ rounds instead of every local update. While updating the interpolation coefficients every round results in the best performance, selecting $T=2$ or $T=3$ results in comparable average accuracy, thus this strategy may be preferable in resource-constrained settings to improve the efficiency of our method.

W4 (On Notation Clarity in Equations 3 and 5) Thank you very much for your careful reading and detailed comments. This superscript is not required, as all classes share the same covariance matrix in our generative model (Line 178-179). Notably, this tied covariance assumption results in a linear decision boundary, allowing us to make a direct comparison to current works without altering the model architecture. We will make an effort to clarify this point in the revised paper.

评论- Replying to the rebuttal

2024-08-12

Thank you for the rebuttal. After reviewing the responses to all the comments, I find that my initial concerns have largely been resolved except the performance improvement. Consequently, I will maintain my original rating.

评论- Clarification on the rating

2024-08-12

Thanks for your response. We are glad to hear that your initial concerns have been largely resolved. In the notification email, it says "Consequently, I will maintain my original rating (Weak Accept)." Yet, in the system, it says 'Borderline accept'. Would you mind confirming your choice of rating in the system? Thank you very much!

审稿意见

评分: 6置信度: 42024-07-12

The paper introduces a personalized Federated Learning (FL) method that adapts global generative classifiers to local feature distributions. The authors show that their method can handle complex distribution shifts for computer vision tasks.

优点

The paper proposes a personalized FL method that uses a generative classifier by considering bias-variance trade-off.
The authors conduct extensive experiments.
The paper is well-structured and easy to understand.

缺点

Sharing the statistics of the generative classifier could increase privacy risks.
The algorithm’s additional computational cost (in Algo.1 L8-9) could increase with larger local data. Calculating the local covariance matrix for larger data and the covariance inversion matrix for each beta value could be computationally intensive.

问题

In Fig. 1, the colors representing client 1 and client 2 are not easily distinguishable.
In Eq. (4), wouldn’t the cost of the covariance matrix inversion be high? Could the server calculate it once and send it to the client to reduce the client’s computational cost, even if it increases communication cost?
In Eq. (9), should “\in min” be replaced with “= \argmax”?
If the client’s generative classifier’s mean and variance are sent to the server, wouldn’t the privacy risk increase compared to sending only the weight of the conventional linear classifier?
If the generative classifier is not used and only the weight parameter from the conventional linear classifier is interpolated locally, could the performance be expected to be mid-range between FedAvgFT and pFedFDA?
In experiments like Table 2, where there is less covariate shift between clients within the federation, FedPAC performs better. Could the degree of covariate shift within the federation be determined by looking at the difference in the generative classifier’s statistics coming from the client, and then choose the appropriate FL method?

局限性

Please refer to the weaknesses section.

作者回复

2024-08-07

W1/Q4 (Privacy Concerns of Sharing Gaussian Sufficient Statistics) We appreciate the reviewer's comments and think it is important to discuss the privacy implications in our work.

The transmission of client feature statistics to the parameter server does not immediately raise high privacy risks as these statistics are not calculated directly from client raw data but from a mapping in a latent space embedded by a neural network. In addition, clients broadcast only the interpolated estimates of the mean and variance, and the local interpolation parameter $\beta$ is not shared with the server. Thus, a potential attacker does not have direct access to the true local statistics.

On the other hand, it is difficult to make any guarantees without using explicit privacy protection techniques, such as homomorphic encryption [1] or differential privacy (DP) [2]. Both of these techniques are compatible with our method. Notably, our results in the low-sample regime (Tab. 3) indicate that pFedFDA can tolerate noisy distribution estimates, which is promising for future integration with recent techniques [3,4,5] for efficient DP estimates of the mean and covariance.

[1] Privacy-Preserving Deep Learning via Additively Homomorphic Encryption (Phong et al., IEEE Transactions on Information Forensics and Security 2018)

[2] Personalized Federated Learning With Differential Privacy (Hu et al., IEEE Internet of Things Journal 2020)

[3] Mean Estimation with User-level Privacy under Data Heterogeneity (Cummings et al., NeurIPS 2022)

[4] Differentially Private Covariance Estimation (Amin et al., NeurIPS 2019)

[5] Differentially Private Covariance Revisited (Dong et al., NeurIPS 2022)

W2/Q2 (Computational Cost Associated with Covariance Estimation) The added cost of client covariance estimation is reasonable in comparison to the base computation of training a neural network. Additionally, we avoid the more expensive calculation of the inverse covariance by solving a least-squares problem instead (refer to L197). This has a reduced complexity of O( $cd^2$ ) compared to O( $d^3$ ) for a naive matrix inversion algorithm. In our local training, the local covariance matrix is estimated once, and the optimization of the interpolation parameter re-uses this estimate and only has to recompute the least-squares problem for each value of beta.

In Tab. 2 of the attached rebuttal document, we measure the percentage of local training time associated with the base network training (forward/backward passes), the estimation of the local mean and covariance, as well as the optimization of the interpolation coefficient. We note that estimating the mean and covariance is less than 3% of the total runtime of each round.

From this observation, we run experiments in which the interpolation parameter is only estimated every $T$ rounds, as this is the primary overhead introduced in pFedFDA. While updating the interpolation coefficient every round results in the best performance, we observe a similar accuracy using $T=2$ or $T=3$ , which would result in a substantial reduction of the pFedFDA overhead. Detailed results for this study can be found in Tab. 3 of the attached rebuttal document.

Q1/Q3 (Figure 1 and Equation 9 Presentation) We appreciate the feedback and will improve upon the clarity of our work by adjusting the client color palette and making the notation of Eq. (9) consistent with the rest of the paper.

Q5 (Generative vs. Discriminative Classifier Interpolation) Thanks for the interesting question. We agree with the conjecture that our personalized classifier should perform better than interpolated discriminative classifiers.

At a high level, a generative approach is more advantageous in low-sample settings; but both will approach a similar accuracy if the local data volume is sufficient. Importantly, this is assuming they are using the same feature extractor parameters.

However, we would like to point out that our generative modeling approach influences not only the formulation of the final personalized classifiers but also the process of global representation learning.

In pFedFDA, clients train the feature extractor to minimize the cross-entropy loss of a generative classifier using global feature statistics (refer to Section 4.2). Intuitively, this loss pulls client features towards the global feature distribution. This allows clients to benefit more from model interpolation, as there is less bias incurred through incorporating global knowledge (i.e., we can use a smaller $\beta$ in Theorem 1.).

If an alternative method using interpolated discriminative classifiers does not employ additional regularization to guide the representation learning - the performance may indeed drop below FedAvgFT on certain benchmarks. We can see a concrete example of this in the ablation study of FedPAC [6].

[6] Personalized Federated Learning with Feature Alignment and Classifier Collaboration (Xu et al. ICLR 2023)

Q6 (Selecting a Preferred Method for Varying Non-IID Settings) While dynamically selecting the personalization algorithm is interesting, it may be challenging in practice as the global feature extractor and estimated distributions evolve throughout training.

Without prior knowledge of client distributions, we think it makes sense to adopt pFedFDA as a generic one-size-fits-all approach, as it is robust to covariate shift, data scarcity, and is more efficient than FedPAC.

If additional resources are available, it may be reasonable to adopt a hybrid approach and interpolate the pFedFDA classifiers across all clients in the style of FedPAC after training has concluded.

2024-08-10

Thank you for your feedback. I have reviewed the authors’ rebuttals to all the reviews, and most of my concerns have been addressed. Therefore, I would like to change my rating to 6 (weak accept).

审稿意见

评分: 5置信度: 42024-07-12

This paper uses Class-Conditional Gaussian Model to formulate the latent representation of the global generative part; for the other part, a personalized federated learning algorithm pFedFDA is designed via Federated Distribution Adaptation. This paper then proves a bound on the bias-variance trade-off of pFedFDA under the assumption of independent and Gaussian distributed local dataset features. And the performance of pFedFDA against dataset scarcity and heterogeneity is presented.

优点

The introduction and related work are well-organized. The explanation on the replacement of generative global model of the embedding module is convincing.
This paper gives a solid proof on the analysis of bias-variance trade-off.

缺点

The experiment is concentrated on the CIFAR-10 and CIFAR-100. However, it is also the dataset where pFedFDA has best performance on (Tab. 2). Also, the robustness of pFedFDA against strong heterogeneity is not good. The weakness of pFedFDA against heterogeneity is still a limitation in this paper.
The assumption of Theorem 1 is very strong, however, this paper does not give an evaluation on the datasets in the experiment of longtailed-ness or normality. Also, the basic result of Theorem 1 is about the upper-bound of local bias, there is no matric in the experiment to compare the performance of bias-variance balance among the algorithms, especially in the scarcity experiment (Tab. 1, 3).
This paper does not explain the connection between bias-variance balance and robustness against data scarcity.

问题

Can the Class-Conditional Gaussian Kernel represent the optimal Gaussian distribution?
Further explain the experiment Tab. 1. How can this relate to the advantage of pFedFDA on bias-variance balancing? Also, other models (i.e. APFL, Ditto) have a stable performance enduring the dataset scarcity. How can this be explained?
In Tab. 2, the line of FedRep, please double check the value of the last cell .145(0.4). Is that correct?
Do you perform normality test on the training set of agents? If not, can you give a further explanation of the feature distribution of datasets to match the assumption of Theorem 1?

局限性

The assumption of Theorem 1 is strong. It would be better to consider the skewed distribution, and even long-tailed distributions. Based on this analysis, the result in Tab. 3 could be expanded to more general few-shot or one-shot learning cases, with a modified version of pFedFDA.
The design of pFedFDA is still not robust to heterogeneity and is not capable of better personalization.

作者回复

2024-08-07

W1 (Robustness to Data Heterogeneity) We note that while Tab. 1 and Tab. 3 are based on CIFAR datasets, these evaluations introduce the additional challenges of client covariate shift (via natural image corruptions) and data scarcity. Our strong performance in these settings indicates that pFedFDA is robust to realistic sources of covariate shift, in addition to quantity skew and prior probability shift introduced via Dirichlet-based data partitioning.

For more discussion of Tab. 2, please refer to our response W1 to reviewer WK31.

W2 (Gaussian Assumption) The application of the multivariate central limit theorem to feature representations is a common step in the analysis of neural networks, e.g., [1,2] study their relationship with Gaussian Processes. Moreover, it has been observed that the distribution of features from trained neural networks is well approximated by a class-conditional Gaussian, with a Gaussian discriminant analysis classifier having comparable accuracy to the softmax classifier used for training [3]. This has led to the widespread usage of class-conditional Gaussian feature space approximations in the literature on out-of-distribution detection [3, 4, 5].

[1] Deep Neural Networks as Gaussian Processes (Lee et al., ICLR 2018)

[2] Dropout as a Bayesian Approximation (Yal et al., ICML 2016)

[3] A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks (Lee et al., NeurIPS 2018)

[4] A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection (Ren et al., NeurIPS 2019)

[5] Exploring the Limits of Out-of-Distribution Detection (Fort et al., NeurIPS 2021)

W2/W3 (Connection of Bias-Variance Tradeoff and Data Scarcity) Thank you very much for pointing this out. We will add further explanations on their connection in the camera-ready version if accepted. Intuitively, limited client data can lead to high variance and poor generalization in local models, encouraging collaboration with other clients for lower variance. However, collaborative estimates (e.g., FedAvg or local-global interpolation) introduce potential bias when client data is non-iid. Theorem 1 theoretically captures this intuition on the bias-variance tradeoff of interpolated estimates and their effect on generalization error, which is implicitly measured by client generalization accuracy in our experiments.

Q1 (Questions on the Class-Conditional Gaussian Distribution) Please refer to our response to W2 on the justification of the class-conditional Gaussian model.

Q2 (Setup of Tab. 1) In Tab. 1, we introduce real-world covariate shift between client distributions by corrupting the inputs of the first 50 clients with natural image alterations [6]. This simulates FL settings where clients collect data from different devices and environments, introducing input noise not present in curated benchmark datasets. For details on image corruption, see Appendix C.1.1. Comparing Tab. 1 and 2, we see this covariate shift significantly reduces the generalization performance of FL methods.

Since half of the clients' data no longer matches the CIFAR10 distribution, there's increased bias in using global knowledge (as discussed in W2).

We assess methods under varying data scarcity to evaluate their ability to navigate the bias-variance tradeoff discussed in W2.

Regarding stability in low-sample settings, our method shows no consistent variance advantage/disadvantage over APFL and Ditto, but has significantly higher average client accuracy.

[6] Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations (Hendrycks et al., ICLR 2019)

Q3 (Correction to Tab. 2) We thank the reviewer for their attention to detail. This was a typo and the entry .145(0.4) should instead read .145(.04).

Q4 (Gaussian Assumption in Practice) As discussed in W1, we use the class-conditional Gaussian assumption, common in prior literature for describing latent representation distributions in practical settings. The empirical success of our generative classifiers supports its applicability in a variety of FL scenarios. Based on these findings, we conjecture that normality testing is not a prerequisite for applying our method.

However, to provide a reference measure of Gaussianity in our experiments, we train FedAvg on CIFAR10 Dir(0.5) and perform the Henze-Zirkler normality test on client features centered by class means, and observe that 89/100 clients follow a class-conditional Gaussian distribution at significance level p=0.05.

[7] A class of invariant consistent tests for multivariate normality (Henze and Zirkler, Communications in Statistics-Theory and Methods, 1990)

L1 (Theorem 1 and Alternative Distributions) For an explanation of the class-conditional Gaussian model and assumptions in Theorem 1, please refer to our response W2/W3.

We recognize that other models may perform better in some settings (as discussed in Appendix A), but we defer this optimization to future work. Based on our empirical results, the class-conditional Gaussian model appears reasonable for practical use.

L2 (Robustness to Heterogeneity and Personalization Performance) We apologize for any confusion and will make an effort to make the robustness of our method to data heterogeneity more clear.

In this work, we evaluated our method in the presence of various types of data heterogeneity, including quantity skew and prior probability shift (via Dirichlet-partitioning and sub-sampling), as well as covariate shift (via natural image corruptions [6]). Our empirical results indicate the ability of pFedFDA to generate personalized client models that are robust to heterogeneity, notably in the more challenging settings of data scarcity and client covariate shift.

Our rebuttal includes an additional experiment (Tab. 1) which shows that pFedFDA also generalizes well to new clients, even for clients with covariate shifts not seen at training time.

2024-08-11

After cautious reading, I think the authors have made themselves clear on the weaknesses (and questions), and I have updated the rating. However, it is still a regret that no other datasets are used in the scarcity/corruption experiment (which is crucial for the main idea).

2024-08-14

We are glad that our response addressed most of your concerns, and apprecriate the increase in your score.

We value your feedback that additional benchmarks in low-sample covariate shift settings would be beneficial. Due to resource constraints, we cannot provide new results on TinyImageNet at this time, but we intend to run additional ablations for the camera-ready version if accepted.

For this discussion, we have conducted additional experiments on the DIGIT-5 dataset [8]. The DIGIT-5 dataset consists of the original MNIST samples, as well as digit characters from SVHN, USPS, MNIST-M, and synthetic datasets. We consider an FL system where each client holds data from a single source dataset, similar to recent works [9, 10]. This established multi-domain evaluation provides an additional level of covariate shift not present in the original federated-MNIST dataset.

In the table below, we show the average (std) client accuracy for selected baselines. For each FL method, we also indicate the average accuracy improvement compared to local training. In line with the results in our main text, pFedFDA has the strongest performance in data-scarce settings and remains competitive even when the local data volume becomes more sufficient.

We appreciate the feedback from this discussion and hope the included results and intended ablations will help clarify the advantages of pFedFDA in handling the bias-variance tradeoff of personalized federated learning.

DIGIT-5 % Training Samples	25	50	75	100	Avg. Improvement
Local	$76.84(10.85)$	$83.11(8.07)$	$86.97(6.35)$	$88.51(5.65)$	-
FedAvg	$81.75(10.25)$ $\newline$ $(+4.91)$	$85.09(9.24)$ $\newline$ $(+1.98)$	$87.41(8.35)$ $\newline$ $(+0.44)$	$88.19(7.92)$ $\newline$ $(+0.32)$	$1.91$
FedAvgFT	$\underline{85.61(7.17)}$ $\newline$ $(+8.77)$	$\underline{88.72(6.39)}$ $\newline$ $(+5.61)$	$90.75(5.50)$ $\newline$ $(+3.78)$	$\mathbf{91.73(5.21)}$ $\newline$ $(+3.22)$	$\underline{5.34}$
Ditto	$83.85(9.13)$ $\newline$ $(+7.01)$	$85.53(8.81)$ $\newline$ $(+2.42)$	$87.43(8.39)$ $\newline$ $(+0.46)$	$88.80(7.75)$ $\newline$ $(+0.29)$	$2.54$
FedPAC	$82.78(8.48)$ $\newline$ $(+5.94)$	$87.94(7.03)$ $\newline$ $(+4.83)$	$\mathbf{91.12(5.65)}$ $\newline$ $(+4.15)$	$91.04(5.96)$ $\newline$ $(+2.53)$	$4.36$
pFedFDA	$\mathbf{86.54(7.80)}$ $\newline$ $(+9.70)$	$\mathbf{90.05(5.73)}$ $\newline$ $(+6.94)$	$\underline{90.75(5.36)}$ $\newline$ $(+3.78)$	$\underline{91.56(5.21)}$ $\newline$ $(+3.05)$	$\mathbf{5.86}$

[8] Learning to Generate Novel Domains for Domain Generalization (Zhou et al., ECCV 2020)

[9] Federated Learning from Pre-Trained Models: A Contrastive Learning Approach (Tan et al., NeurIPS 2022)

[10] Rethinking Federated Learning with Domain Shift: A Prototype View (Huang et al., CVPR 2023)

审稿意见

评分: 5置信度: 32024-07-25

This paper introduces pFedFDA, a personalized Federated Learning (pFL) method designed to address the issue of client heterogeneity in federated learning. pFedFDA combines global knowledge through server aggregation with local knowledge through client-specific training and distribution estimation, enhancing the model's personalized performance on each client. The authors propose an algorithm that efficiently generates personalized models, demonstrating significant improvements in data-scarce settings through extensive computer vision benchmarks. The paper is well-structured, and the writing is clear, making it easy to follow the authors' arguments and experimental results.

优点

1.The method of decomposing model training into shared representation learning and personalized classifier training, followed by adaptation to local feature distributions, is both innovative and promising for handling non-i.i.d. data in FL environments. 2.The paper provides strong empirical evidence through comprehensive experiments on various datasets, showcasing the superiority of pFedFDA in challenging distribution shift and data scarcity scenarios. 3.The paper is well-written, with a clear structure that logically progresses from the introduction of the problem to the presentation of the methodology and results.

缺点

In this paper, the mean value and covariance are utilized as the characteristic distribution of the data. Have you explored other statistical measures to describe this distribution?
In the ablation study, the results obtained by calculating multiple β values were unexpectedly lower compared to using a single β value. Intuitively, multiple β values should provide a more comprehensive understanding of the data distribution and thus yield better results. However, the experimental findings do not support this expectation, and there is no clear explanation for this discrepancy. 3.Note that pFedFDA uses the same feature extractor for all clients but employs heterogeneous classifiers. A key feature of pFL is the heterogeneity of client models. pFedFDA is somewhat limited in this regard. Are there methods to handle heterogeneous feature extractors in pFL scenarios?

问题

The main concerns and questions are listed in the weaknesses. Please provide answers to address these concerns.

局限性

The authors adequately addressed the limitations.

作者回复

2024-08-07

W1 (Choice of Statistical Measures) We selected the mean and covariance as the class-conditional Gaussian model is uniquely defined by these parameters. While it would have been possible to communicate estimates of the inverse-covariance matrix, this adds unnecessary computation which we avoid in our implementation (refer to line 197).

If a distribution other than the class-conditional Gaussian is considered for modeling $p(z|y)$ , it may be reasonable to send different sufficient statistics or estimate higher moments.

W2 (Discussion of Beta Value Ablation) Thanks for the interesting question. Indeed, by using separate coefficients for the means and covariance, the interpolated classifier is more flexible towards fitting the local distribution. While we optimize beta using k-fold validation, this is done over the set of training features, so there is still potential for over-fitting. We chose to not optimize over a separate validation set to make our results comparable to existing approaches which do not require additional held-out data.

W3 (Limitations of pFedFDA Personalization) We would first like to point out that using a shared backbone is not necessarily a limitation, but a common feature of recent representation-learning methods for PFL (e.g., FedRep [1], FedBABU [2], FedPAC [3]). This parameter-sharing approach learns generalizable features and simplifies the personalization task of each client to the final classification layer.

Still, if model heterogeneity is desired (e.g., due to different device computation resources), pFedFDA could be extended to these settings by adopting an approach similar to FedProto[4]. Specifically, clients could collaborate on the means and covariance of the global feature distribution, without broadcasting or aggregating the backbone model. As discussed in Sec. 4.2, our generative classifier objective is similar to the regularization term and inference objective of FedProto, where we use an estimated covariance matrix rather than implicitly assuming the covariance to be the identity matrix.

[1] Exploiting Shared Representations for Personalized Federated Learning (Collins et al., ICML 2021)

[2] FedBABU: Towards Enhanced Representation for Federated Image Classification (Oh et al., ICLR 2022)

[3] Personalized Federated Learning with Feature Alignment and Classifier Collaboration (Xu et al., ICLR 2023)

[4] FedProto: Federated Prototype Learning across Heterogeneous Clients (Tan et al., AAAI 2022)

作者回复

2024-08-07

We would like to thank the reviewers for their detailed comments and feedback. We will revise the paper accordingly to further clarify our work and address the points brought up in these discussions.

In our attached rebuttal PDF, we have provided the following additional experimental results:

Tab. 1: Evaluation of method generalization to clients unseen at training (in response to reviewers Qz6b and BoVr).

Tab. 2: Analysis of system run-time corresponding to each component of local training (in response to reviewer 3xQF).

Tab. 3: Evaluation of pFedFDA with intermittent updates of the interpolation parameter $\beta$ (in response to reviewer 3xQF and WK31).

最终决定Accept (poster)

2024-09-25

The paper proposes a personalized FL method that uses a generative classifier by considering the bias-variance trade-off. The paper is well-written, with a clear structure that logically progresses from the introduction to the methodology and results. In the rebuttal, all the reviewers are satisfied with this rebuttal. I recommend acceptance for this submission.