FuseFL: One-Shot Federated Learning through the Lens of Causality with Progressive Model Fusion
This work identifies the cause of low performance of one-shot FL, and proposes FuseFL to progressively train and fuses DNN model following a bottom-up manner, reducing communication costs to an extremely low degree.
摘要
评审与讨论
This paper proposes FuseFL to address the non-IID data in one-shot federated learning (OFL). Specifically, FuseFL is inspired by the lens of casuality, and fuses each layer of the global model step by step to learn invariant features. Extensive experiments validate that FuseFL achieves SOTA accuracy under various non-IID settings.
优点
S1: The proposed algorithm is novel and well-motivated.
S2: Extensive experiments validate the effectiveness of FuseFL.
缺点
W1: Since FuseFL still communicates multiple rounds, I think it belongs to few-shot FL instead of one-shot FL. One-shot FL should have only one communication round. Besides, FedAvg (or other FL algorithms) with the same number of communication rounds should be compared as baselines to ensure a fair comparison.
W2: More experimental setups can be done to further support the effectiveness of the proposed algorithm. For example, in a previous FL benchmark on non-IID data [1], the quantity-based label skew settings are challenging, which are not tested in the current version. Also, in the section of scalability, the authors only test up to 50 clients. I suggest the authors to test on even more clients, e.g. 500 clients, to validate the scalability.
[1] Li, Qinbin, et al. "Federated learning on non-iid data silos: An experimental study." 2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 2022.
问题
Please see weaknesses. I think this paper is sound and interesting. I'd like to see more experiments to further verify the effectiveness of the proposed algorithm.
局限性
Please see weaknesses.
We would like to thank the reviewer for the time in reviewing. We appreciate that you find the proposed algorithm novel and well-motivated, the experiments are extensive and the FuseFL is effective. Please see our detailed feedback for your concerns below.
Q1: Few-shot FL or one-shot FL.
Since FuseFL still communicates multiple rounds, I think it belongs to few-shot FL instead of one-shot FL. Besides, FedAvg (or other FL algorithms) with the same number of communication rounds should be compared as baselines to ensure a fair comparison.
Ans for Q1): Thanks for your important comments. Yes, FuselFL still communicates with a few rounds. In this sense, FuseFL belongs to the few-shot FL. However, the communication cost of FuseFL is as same as other OFL methods, which is the main claim in the introduction (extremely low communication costs). We will highlight this difference in the introduction of the revised paper.
Furthermore, we conduct new experiments (a=0.1) with 2 and 4 communication rounds of baseline methods with their multi-round versions [2,8] to provide a fair comparison from the round perspective. Note that the FedAvg has communication rounds of 10 for all experiments. Results show that for the same communication rounds, FuseFL still achieves the best performance than other baselines.
| Datasets | CIFAR-10 | SVHN |
|---|---|---|
| FedAvg R=10 | 23.93 | 31.65 |
| FedDF R=2 | 45.57 | 53.94 |
| DENSE R=2 | 63.08 | 58.13 |
| Ensemble | 57.5 | 65.29 |
| FuseFL K=2 | 70.85 | 76.88 |
| Datasets | CIFAR-10 | SVHN |
|---|---|---|
| FedAvg R=10 | 23.93 | 31.65 |
| FedDF R=4 | 49.16 | 56.29 |
| DENSE R=4 | 56.26 | 63.95 |
| Ensemble | 57.5 | 65.29 |
| FuseFL K=4 | 73.79 | 78.08 |
Q2: Experiment Issues.
For example, in a previous FL benchmark on non-IID data [1], the quantity-based label skew settings are challenging, which are not tested in the current version. Also, in the section of scalability, the authors only test up to 50 clients. I suggest the authors to test on even more clients, e.g. 500 clients, to validate the scalability.
Ans for Q2): Thanks for your valuable comments. The quantity-based label skew is another kind of heterogeneity different from the Dirichlet sampling based Non-IID simulation. Following your suggestions, we conduct new experiments with the quantity-based label skew of #C=2 and 10 clients. The test accuracies are shown as follows, illustrating that the #C=2 is a harder setting than Dirichlet with . While FuseFL can outperform other baseline methods.
| Datasets | MNIST | FMNIST | CIFAR-10 |
|---|---|---|---|
| FedAvg | 23.58 | 23.56 | 21.28 |
| FedDF | 40.29 | 26.92 | 26.59 |
| F-DAFL | 41.59 | 28.45 | 29.83 |
| F-ADI | 42.17 | 31.29 | 30.25 |
| DENSE | 44.2 | 33.53 | 34.83 |
| Ensemble | 52.3 | 36.23 | 38.17 |
| FuseFL K=2 | 58.74 | 57.72 | 50.79 |
| FuseFL K=4 | 63.32 | 63.04 | 53.61 |
To validate the scalability of FuseFL, we follow the scale of FL in [2,3] ( clients) in original paper. And most OFL methods simulate with less than 50 clients with ResNet and 100 with basic machine learning models [5,6,7]. Simulating 500 clients within the OFL methods is challenging as 500 local models need to be saved to conduct knowledge distillation or ensembling. Loading 500 ResNets into GPU memory is difficult, or offloading and switching between GPU and CPU memory requires much time. To this end, we tried our best to simulate 100 clients to verify the scalability as follows. Results show that FuseFL outperforms all baseline methods except for the Ensemble, which requires significant storage and computation overheads.
| Datasets | CIFAR-10 | SVHN |
|---|---|---|
| FedAvg | 14.9 | 22.99 |
| FedDF | 18.15 | 28.53 |
| F-DAFL | 20.33 | 33.15 |
| F-ADI | 21.56 | 34.29 |
| DENSE | 22.91 | 34.38 |
| Ensemble | 39.11 | 44.75 |
| FuseFL K=4 | 32.92 | 42.23 |
Reference
[1] Li, Qinbin, et al. "Federated learning on non-iid data silos: An experimental study." 2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 2022.
[2] J. Zhang, C. Chen, B. Li, L. Lyu, S. Wu, S. Ding, C. Shen, and C. Wu. Dense: Data-free one-shot federated In NeurIPS 2022.
[3] R. Dai, Y. Zhang, A. Li, T. Liu, X. Yang, and B. Han. Enhancing one-shot federated learning through data and ensemble co-boosting. In The Twelfth International Conference on Learning Representations, 2024.
[4] Q. Li, B. He, and D. Song. Practical one-shot federated learning for cross-silo setting. In IJCAI 2021.
[5] C. E. Heinbaugh, E. Luz-Ricca, and H. Shao. Data-free one-shot federated learning under very high statistical heterogeneity. In The Eleventh International Conference on Learning Representations, 2023.
[6] Jhunjhunwala, D., Wang, S. and Joshi, G., 2024, April. FedFisher: Leveraging Fisher Information for One-Shot Federated Learning. In the International Conference on Artificial Intelligence and Statistics (pp. 1612-1620). PMLR.
[7] Towards Addressing Label Skews in One-Shot Federated Learning. In ICLR 2023.
[8] Ensemble distillation for robust model fusion in federated learning. In NeurIPS 2020.
I appreciate the authors' response. Considering other reviews, I raise the score to "weak accept".
Dear Reviewer #twUP,
We're grateful for your quick feedback during this busy period. We deeply appreciate your consideration in raising the score. We remain open and ready to delve into any more questions or suggestions you might have. Your constructive comments have significantly contributed to the refinement of our work.
Best regards and thanks
The authors identify the isolation problem as being the root cause of low accuracy in OFL. The isolation problem arises as the clients locally overfit to spurious features observed in their own local datasets in the absence of knowledge from other clients. To better learn invariant features, the authors propose progressive model fusion called FuseFL which fuses client local models block by block. Crucially, this can be done without increasing the total communication cost, however at the cost of additional rounds.
优点
- The authors address an important problem in OFL research in a novel way
- The problem and its solution is clearly presented
- The experimental evaluation is diverse, in particular, the authors also show the memory cost of fusion and experiments with heterogeneous models
缺点
- An important baseline [1] is missing from the evaluations which has been shown to improve over the uniform ensemble considered as upper bound in this paper. The CoBoosting algorithm in [1] generates the weights of the ensemble simultaneously with synthetic data to transfer the knowledge to a single global model and is the recent SOTA. Notably, it is a data-free ensemble distillation method in contrast to Table 1.
- No evaluations are shown with higher heterogeneity levels, such as a = 0.01 or a = 0.05 which is a common setting in OFL papers [1,2,3]. This prevents a comprehensive comparison with prior work.
- The authors have not discussed important limitations of their work. In particular, the security aspect under multiple rounds of communication. OFL algorithms are recognised not just for their low communication costs but also their lower vulnerability to attacks, thanks to the single communication round. By introducing multiple rounds of communication, FuseFL increases vulnerability to attacks.
- Similarly, FuseFL appears to increase client side overheads by letting clients run training K times instead of once as in standard OFL. While the computation cost appears to decrease with K, it is nevertheless higher than client side overheads in standard OFL approaches which mostly add server side overheads. Hence, a comprehensive discussion of these limitations is expected to provide a complete picture of tradeoffs.
[1] R. Dai, Y. Zhang, A. Li, T. Liu, X. Yang, and B. Han. Enhancing one-shot federated learning through data and ensemble co-boosting. In The Twelfth International Conference on Learning Representations, 2024.
[2] C. E. Heinbaugh, E. Luz-Ricca, and H. Shao. Data-free one-shot federated learning under very high statistical heterogeneity. In The Eleventh International Conference on Learning Representations, 2023.
[3] Jhunjhunwala, D., Wang, S. and Joshi, G., 2024, April. FedFisher: Leveraging Fisher Information for One-Shot Federated Learning. In the International Conference on Artificial Intelligence and Statistics (pp. 1612-1620). PMLR.
问题
The reviewer is positive about this work and is willing to increase the score provided the authors evaluate against Co-Boosting and include a higher heterogeneity level in their overall assessment. While a thorough empirical assessment of the security of the proposed approach is not deemed necessary, the paper must at least discuss the security aspects along with the client side overheads to better inform the readers of the inherent tradeoffs in FuseFL.
局限性
No additional limitations than those discussed in the sections above. The authors have well addressed other limitations of their approach involving the memory footprint of the fused model, applicability to model heterogeneity, etc. through their empirical evaluations.
We would like to thank the reviewer for the time in reviewing. We appreciate that you find the problem interesting, important, and practical, our idea is clear and performance is great. Please see our detailed feedback for your concerns below.
Q1: Experiment Issues.
Important baseline [1] and higher heterogeneity levels.
Ans for Q1): Thanks for your important comments. We have conducted new experiments of CoBoosting [1] and FuseFL with higher heterogeneity levels (a=0.05). The new results are provided following and added in our revision. Results show that FuseFL outperforms CoBoosting and provides higher improvement than the baselines. Note that for CIFAR-10, Ensemble outperforms all baselines but requires extremely large storage and computational costs.
| Datasets | MNIST | FMNIST | SVHN | CIFAR-10 | CIFAR-100 |
|---|---|---|---|---|---|
| FedAvg | 46.35 | 20.07 | 39.41 | 17.49 | 6.45 |
| FedDF | 80.73 | 44.73 | 60.79 | 37.53 | 16.07 |
| F-ADI | 80.12 | 42.25 | 56.58 | 36.94 | 13.75 |
| F-DAFL | 78.49 | 41.66 | 59.38 | 37.82 | 15.79 |
| DENSE | 81.06 | 44.77 | 60.24 | 38.37 | 16.17 |
| CoBoosting | 93.93 | 50.62 | 65.40 | 47.20 | 19.24 |
| Ensemble | 58.06 | 66.19 | 62.22 | 53.33 | 32.25 |
| FuseFL K=2 | 95.23 | 83.23 | 75.08 | 46.38 | 29.98 |
| FuseFL K=4 | 95.37 | 83.65 | 75.53 | 51.59 | 32.71 |
And we provide following discussion of differences between FuseFL with [1,2,3].
- Knowledge distillation based FL: Both CoBoosting [1] and FEDCVAE-KD [2] focus on exploiting knowledge distillation methods to improve the global model performance, while our method focuses on how to aggregate the models together. Thus, the knowledge distillation is orthogonal with our method and may be utilized to enhance FuseFL. For example, one can consider running FuseFL first to obtain a fused global model. Then, this model can be used to conduct knowledge distillation to guide the local model training with the FuseFL once again.
- Average-based FL: FedFisher [3] focuses on how to better average models to obtain a better global model. The fisher information is utilized to identify element-wise averaging weights of parameters in different local models. FedFisher outperforms previous OFL methods [4]. However, FuseFL focuses on a different methodology that concatenates model blocks together instead of averaging. Nevertheless, we consider utilizing the fisher information and average-based methods to enhance FuseFL.
More differences between FuseFL and Average-based [3,4] and Knowledge distillation [1,2] based OFL can be referred in Table 1 and Appendix C in original paper.
Q2: Security issues.
By introducing multiple rounds of communication, FuseFL increases vulnerability to attacks.
Ans for Q2): Thanks for your important comments. Yes, in this work we do not explicitly consider the security issue. However, the vulnerability to attacks of FuseFL will not be higher than previous multi-round FedAvg, which requires many more communication rounds to achieve the same model performance as FuseFL. For example, FedAvg may require more than 100 rounds to achieve 70% test accuracy as FuseFL with rounds, which introduces more communicated information and a higher possibility of attacks. And FuseFL has the same communication size with other OFL methods.
Nevertheless, while the communication size is not increased, FuseFL indeed increases the communication rounds than other OFL methods. The possible higher risk of attack is injecting adversarial modules. The possible solution is to detect and reject such malicious uploading also through the lens of causality.
For the model inversion or membership attack, we can consider adding differential privacy or using noised samples with invariant features while ensuring the model performance.
Q3: Extra computation overheads.
Similarly, FuseFL appears to increase client-side overheads by letting clients run training K times instead of once as in standard OFL. While the computation cost appears to decrease with K, it is nevertheless higher than client-side overheads in standard OFL approaches which mostly add server-side overheads.
Ans for Q3): Thanks for your valuable comments. Sorry for our missed experiment descriptions of FuseFL. We have considered this extra computational problem of repeated local training. To address this problem, we decrease the local training epochs of each client by times. For example, the FedAvg and other OFL methods require 200 epochs. With , for training and merging each block in each round of FuseFL, the local training epochs are decreased to 50. Thus, the number of total training epochs of FuseFL is the same as other OFL methods. The motivation of this design can be referred to the progressive freezing during training DNNS [5]. We have added the descriptions of this design in our revision.
Reference
[1] R. Dai, Y. Zhang, A. Li, T. Liu, X. Yang, and B. Han. Enhancing one-shot federated learning through data and ensemble co-boosting. In The Twelfth International Conference on Learning Representations, 2024.
[2] C. E. Heinbaugh, E. Luz-Ricca, and H. Shao. Data-free one-shot federated learning under very high statistical heterogeneity. In The Eleventh International Conference on Learning Representations, 2023.
[3] Jhunjhunwala, D., Wang, S. and Joshi, G., 2024, April. FedFisher: Leveraging Fisher Information for One-Shot Federated Learning. In the International Conference on Artificial Intelligence and Statistics (pp. 1612-1620). PMLR.
[4] Q. Li, B. He, and D. Song. Practical one-shot federated learning for cross-silo setting. In IJCAI 2021.
[5] M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein. Svcca: singular vector canonical correlation analysis for deep learning dynamics and interpretability. In NeurIPS 2017.
The reviewer thanks the authors for their responses. The authors have well addressed the concerns regarding higher heterogeneity level, missing baseline and computational cost of FuseFL. The reviewer will hence increase the score.
However, as was mentioned in the review, the reviewer highly recommends adding a discussion section describing the security vulnerabilities and possible mitigations. This should not be seen as a weakness but something that will enhance the quality of the paper.
Dear Reviewer #V6z4,
Thank you for your prompt response during this busy period. We deeply appreciate your consideration in raising the score. According to the security issues, we provide following possible mitigations and add into our revision:
- Adversarial attacks: Some malicious clients might upload adversarial modules or backdoored modules that are used to misguide the aggregated model to generate incorrect or handcrafted predictions. For these attacks, the possible solution is to detect and reject such malicious uploading also through the lens of causality. Specifically, some images with the invariant features can be fed into the uploaded modules to see whether the output feature can be used to correctly classify images;
- Model inversion or Membership attack: Some malicious clients or the server may consider to conduct model inversion or membership attack to obtain the raw data of clients, thus threatening the user privacy. In this case, the learned module can be protected with differential privacy to enhance its security.
We remain open and ready to delve into any more questions or suggestions you might have. Your constructive comments have significantly contributed to the refinement of our work.
Best regards and thanks,
The authors provide a causality viewpoint for the data heterogeneity problem in the training of one-shot fed-avg. A block fusing mechanism in the training to provide more information aggregation is designed and the results show a significant improvement for the common tests in ResNet and Cifar.
优点
Pros:
- the one-shot FL is an interesting and important problem. The trials on the non-IID processing is difficult but significant, which can be further expanded to more industrial scenes.
- The idea is well-represented and the performances are great according to the providing verification.
缺点
Cons:
- the cost of the multiple-block communication needs to be considered. In the computation, there might be a huge amount of blocks, we need to keep connections or wait for the block communications in the practical applications.
- the explanation part needs to be more rigorous. Please use theorems or propositions for clear statements. I believe the current version can not provide a convincing theoretical explanation from the casual viewpoint.
- More large-scale dataset and models need to be considered. The introduction brings a large scope for LLM FL training but not experiments are designed for it.
问题
- what is the definition of "block"/"features"? I think it will influence the results from the theoretical point of view.
- how to deal with the scenes where the disconnection occurs in training, can the system still work (like dynamic topologies in the block communications)
- please show the error bar of the experiments since it will not perform significantly better than Ensemble FL for a few of the cases.
- I'm also interested in how the model works on the Transformers-based framework, and which part we should aggregate for these models. I also wonder if the method can work on the prompt for LLMs.
- Please verify the non-IID data-generating processes and whether and how they align with the explanation part.
局限性
N/A.
We would like to thank the reviewer for the time in reviewing. We appreciate that you find the problem is interesting, important and practical, our idea is clear and performance is great. Please see our detailed feedback for your concerns below. Some answers can be referred to the global response due to the limited space.
W1 & Q2: The cost of the multiple-block communication and the disconnection.
Ans for W1 & Q2): Thanks for indicating this practical problem. The synchronous waiting problem also exists in the traditional FL methods like FedAvg and others. Given the number of clients , and model size , the traditional FL methods require communication costs , in which the is the communication round, normally ranging from 100 to 1000. However, our method only requires communication cost. For the communication rounds in FuseFL, we can communicate more communicating more layers to reduce the number of blocks (like different in Table 2 and 4 in original paper).
Consider the disconnection, FuseFL needs to wait for the clients. This also happens in the traditional FL. One possible solution is to continue the training process of other clients. After reconnecting, the client loads the newest global model and continue training.
W2 & Q1: The explanation part.
Ans for W2 & Q1): Thanks for your important comments. We will revise the paper to make the causality explanation more clear. Specifically, we will rewrite lemma 3.1 to show how the locally learned spurious features influence the mutual information between and global . The "block" means a single or some continuous layers in a DNN. For example, a single or two consecutive convolutional layers in a CNN can be seen as a block. Features refer to the outputs of blocks. For simplicity, the represents both block on client and features output from it as shown in Figure 1 in original paper. Also, please refer to more explanation of Q1 of Reviewer #mHC6 and the global response.
W3 & Q4: Large-scale models like LLMs.
Ans for W3 & Q4): Thanks for your important comments. To deploy transformer-based frameworks like current LLMs with the core idea of FuseFL in one-shot FL, we envision two methods here.
-
Concat-and-freeze: similar to training ResNet in FuseFL, we can block-wisely train and collect the transformer blocks together for each round; during local training, the output features of all transformer blocks are concatenated to feed into the subsequent layers. Due to the large resource consumption of pretraining, we do not evaluate this idea here.
-
Averaging-and-freeze LoRA: here we consider the finetuning scenarios with LoRA [1]. LoRA blocks can be seen as additional matrix mapping applied on the local Q V attentions and MLP layers. The output is the original feature plus the LoRA output. To use LoRA in FuseFL, we can follow the MoE style [2] or the averaging style [1]. Specifically, we consider averaging LoRAs on different clients together, then averaging and freezing all LoRAs in each transformer block to freeze the obtained aggregated features in each communication round. The following Table shows performance of finetuning Llama2-7B with FuseFL with few-shots FedAvg with OpenFedLLM [3] benchmark (20 clients with Alpaca-GPT4 and MedAlpaca). Results show that FuseFL with much fewer communication rounds can outperform FedAvg of 50 communication rounds.
| Task | MMLU Overall | MMLU Humanity | MMLU Social Science | MMLU STEM | MMLU others | BBH |
|---|---|---|---|---|---|---|
| Base Model | 38.7 | 36.9 | 48.2 | 32.6 | 44.2 | 34.5 |
| FedAVG (50 rounds) | 41.3 | 38.9 | 45.9 | 34.8 | 46.9 | 38.9 |
| FuseFL (4 rounds) | 42.3 | 39.4 | 47.6 | 34.9 | 48.7 | 39.6 |
| Task | MedQA | MedMCQA |
|---|---|---|
| Base Model | 21.7 | 20.4 |
| FedAVG (50 rounds) | 29.5 | 33.3 |
| FuseFL (4 rounds) | 30.1 | 34.2 |
Q3: Error bars.
Ans for W3 & Q4): Thanks for your important comments. We have added the error bars of main experiments as follows and revised the paper. We report a part of them here due to the limited space.
| Dataset | FMNIST | CIFAR-10 | ||||
|---|---|---|---|---|---|---|
| Heterogeneity | α=0.1 | α=0.3 | α=0.5 | α=0.1 | α=0.3 | α=0.5 |
| FedAvg | 41.69 3.58 | 82.96 2.82 | 83.72 2.21 | 23.93 3.06 | 27.72 1.21 | 43.67 1.84 |
| FedDF | 43.58 3.39 | 80.67 2.42 | 84.67 1.95 | 40.58 3.72 | 46.78 1.52 | 53.56 1.44 |
| Fed-DAFL | 47.14 3.56 | 80.59 2.42 | 84.02 2.17 | 47.34 2.82 | 53.89 1.39 | 58.59 1.15 |
| Fed-ADI | 48.49 2.84 | 81.15 2.29 | 84.19 1.71 | 48.59 3.18 | 54.68 1.59 | 59.34 1.23 |
| DENSE | 50.29 3.15 | 83.96 2.42 | 85.94 1.55 | 50.26 2.52 | 59.76 1.82 | 62.19 1.28 |
| Ensemble | 67.71 1.94 | 87.25 1.02 | 89.42 0.57 | 57.5 1.82 | 77.35 1.21 | 79.91 0.63 |
| FuseFLK=2 | 83.15 1.35 | 89.94 0.72 | 89.47 0.45 | 70.85 1.88 | 81.41 1.09 | 84.34 0.69 |
| FuseFLK=4 | 83.05 1.56 | 84.58 0.57 | 90.50 0.49 | 73.79 1.62 | 84.58 0.92 | 81.15 0.52 |
| FuseFLK=8 | 83.2 1.39 | 88.57 0.85 | 88.24 0.52 | 70.46 1.29 | 80.70 1.25 | 74.99 0.91 |
Reference
[1] LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition. In COLM 2024.
[2] LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin. In ACL 2024.
[3] OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning. In KDD 2024.
Thank you for your comprehensive explanations and diligent efforts during the rebuttal stage. Your responses have addressed some of my concerns.
Dear Reviewer #FVeg,
We're grateful for your feedbacks during this busy period. We will remain open and ready to delve into any more questions or suggestions you might have until the last moment. Your constructive comments and questions has significantly enhanced the quality of our work.
Best regards and thanks
This paper proposes a method FuseFL that aims to perform one-shot FL by progressively fusing models from multiple clients. Their motivation is that within each local client, a model might learn spurious features due to the underlying spurious correlations, adversarial attacks, and shortcuts. However, at the global level, the fusion process would eliminate those spurious features and emphasize on the core/invariant features. They propose a strategy of one-shot fusion of models from multiple clients, followed by experimental results. They also include arguments using mutual information over a Markov chain that use the Data-Processing Inequality. Experimental results show benefits over other methods in terms of accuracy of the model.
UPDATE: Increased soundness to 3, Rating to 7 Accept
优点
The paper addresses a very interesting problem at the intersection of federated learning and spurious correlations. It is an interesting observation that spurious correlations prevalent at the local model would go away at the global stage by appropriate fusion.
The idea of one-shot fusion is interesting. They have made an attempt to explain what is going on using information theory and information bottleneck principles.
Experiments compare their strategy with other strategies across several datasets.
缺点
- I have some concerns on the clarity of the theoretical analysis here, which if resolved, will increase my rating because I otherwise liked the idea of the paper. I believe Nuisance is mathematically different from Spurious Features (you are assuming them to be the same here)?
Nuisance [2] is N which has no information about the true label Y. But, typically in spurious correlation papers, Rspu is something that has information about Y even though they are not causally related, e.g., in Waterbirds dataset, many waterbirds are photographed in front of water background. So, water background (B) naturally has a correlation with bird label (Y) and is not nuisance. But, it is not desirable to use B in the model since it performs poorly on minority groups. Here B is Rspu but not Nuisance.
The authors also write I(Rspu;H)=0 means spurious feature is not being used. This would also hold for nuisance but not spurious feature as in the example above. One wants the model to not mechanically use the background B. But, what it actually uses instead of the background can still have correlations with the model outputs, and hence mutual information can still be there.
I think what you call Rspu is actually Nuisance at a global level. You actually assume that for the global dataset, Rspu satisfies the definition of nuisance, i.e., being independent with Y, but is dependent at a local level?
Could you include a clear notation and definition section where you clarify your notations and independence assumptions on Rspu and other terms at global and local levels?
-
Experiments: In spurious correlation papers, one is often interested in worst-group/minority group accuracy, i.e., waterbirds on land background. Overall accuracy even decreases sometimes. Could you comment on this in the context of your experiments? How are the local accuracies?
-
There is an overreliance on causality literature including citing Pearl. However, I feel the strategy and the novelty are not that reliant on causality.
-
The fusion strategy is not properly described. Everything points to one equation (5) which is not described clearly.
问题
See Weaknesses above.
Could you include a clear notation and definition section where you clarify your notations and independence assumptions on Rspu and other terms at global and local levels?
If I(Hk; Rspu)=0 is necessary/sufficient?
This is not really a causality paper. I believe the overreliance on causality is not necessary here.
The fusion strategy is not properly described. Everything points to one equation (5) which is not described clearly.
局限性
Discussed above
We would like to thank the reviewer for taking the time to review our work. We appreciate that you find the solved problem, our observation, and our core idea are very interesting. Please see our detailed feedback for your concerns below.
Q1: Clarity of theoretical analysis.
Rspu is something that has information about Y even though they are not causally related, , e.g., in Waterbirds dataset, many waterbirds are photographed in front of water background. So, water background (B) naturally has a correlation with bird label (Y) and is not nuisance. But, it is not desirable to use B in the model since it performs poorly on minority groups. Here B is Rspu but not Nuisance. The authors also write I(Rspu;H)=0 means spurious feature is not being used. This would also hold for nuisance but not spurious feature as in the example above. One wants the model to not mechanically use the background B. But, what it actually uses instead of the background can still have correlations with the model outputs, and hence mutual information can still be there. I think what you call Rspu is actually Nuisance at a global level. You actually assume that for the global dataset, Rspu satisfies the definition of nuisance, i.e., being independent with Y, but is dependent at a local level?
Ans for Q1): Thanks for your important comments on the theoretical analysis. Yes, here we call Rspu is actually Nuisance at a global level (global dataset across all clients). Thus, on one local client , the spurious feature is independent with the in the global dataset, but dependent on the in the local dataset . I(Hk; Rspu)=0 is a sufficient condition for not overfiting to spurious correlations. Because when I(Hk; Rspu)=0 is not satisfied and Hk takes some spurious features, one can adjust the final classifier to avoid spurious fitting [1,2,3]. Besides, our analysis in Section 3.2 and the Empirical study in Section 3.3 aligns with your understanding.
We will revise the paper and include the section of clear definitions. Specifically, we will include the definitions of the local , , global , , and the local spurious features . Furthermore, we will rewrite Lemma 3.1 to make it clear in the federated learning scenario.
Q2: Experiment issues.
In spurious correlation papers, one is often interested in worst-group/minority group accuracy, i.e., waterbirds on land background. Overall accuracy even decreases sometimes. Could you comment on this in the context of your experiments? How are the local accuracies?
Ans for Q2): Thanks for your important questions on the worst-group accuracy. In FL, the local accuracies refer to the accuracy on the local datasets on clients, which is common in personalized FL [4]. Specifically, local test datasets have the similar data distributions to the training datasets (according to the ratio of the labels). One goal of designing experiments with the backdoored Fl CIFAR-10 datasets is to study the majority/minority accuracy. For a backdoored client, images with local spurious shapes and colors become majority group on it (Figure 5 in original paper), while the normal images become minority groups. Now we report the local/global accuracy here. Backdoored (BD) clients fit on the handcrafted spurious features thus having lower global accuracy than normal clients.
| Client | BD-0 | BD-1 | Normal-0 | Normal-1 | Normal-2 |
|---|---|---|---|---|---|
| Local Acc. | 100.0 | 100.0 | 99.7 | 99.9 | 100.0 |
| Global Acc. | 32.6 | 27.1 | 41.2 | 42.3 | 38.4 |
Q3: Overreliance on causality literature.
There is an overreliance on causality literature including citing Pearl. However, I feel the strategy and the novelty are not that reliant on causality.
Ans for Q3): Thanks for your valuable comments. Yes, our strategy does not explicitly adopt the causal discovery to improve one-shot FL. Our motivation is to interpret the mechanism and the root causes of why previous one-shot FL fails through the lens of causality. We appreciate your kind suggestion and will highlight the main contribution in the revision.
Q4: Unclear fusion strategy.
Everything points to one equation (5) which is not described clearly.
Ans for Q4): Thanks for your question. Eq.(5) describes the chain structure of layers in DNNs. For example, a module , we have the .
Reference [1] Class-Balanced Loss Based on Effective Number of Samples. In CVPR 2019.
[2] Learning debiased classifier with biased committee. In NeurIPS 2022.
[3] Towards last-layer retraining for group robustness with fewer annotations. In NeurIPS 2024.
[4] On Bridging Generic and Personalized Federated Learning for Image Classification. In ICLR 2022.
I have read the response.
Based on the proposed edits to the theoretical section, I increased my soundness score to 3, and also update my rating to Accept 7.
Only one more thing: I feel that "through the Lens of Causality" is not well suited for the title of the paper since the authors also acknowledge that the over-reliance on causality is not needed and this is not a causality paper. Something else to highlight the information-bottleneck-style analysis will be better.
Dear Reviewer #mHC6,
We're grateful for your quick feedback. We deeply appreciate your consideration in raising the score. Thank you for your suggestion. We will try to give a reasonable title, from the perspective of the information bottleneck and the invariant/spurious features.
We remain open and ready to delve into any more questions or suggestions you might have. Your constructive comments has significantly contributed to the refinement of our work.
Best regards and thanks,
In the paper: "causal" could be replaced with sequential or temporal
In the title: The role of information bottleneck/mutual information could be highlighted
Related works: Once presented as a local-global tussle, other relevant works within federated learning which also look at this tussle should be acknowledged. For instance, the problem of local-global fairness seems to be quite aligned with this work, and a fusion approach is indeed quite valuable. Some suggested references: [1] Yahya H Ezzeldin, Shen Yan, Chaoyang He, Emilio Ferrara, and A Salman Avestimehr. Fairfed: Enabling group fairness in federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 7494–7502, 2023. [2] F. Hamman and S. Dutta, "Demystifying Local & Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition,” International Conference on Learning Representations (ICLR 2024).
Good work!
Dear Reviewer #mHC6,
Thanks for providing these related works [1,2]. Removing local biased feature is an interesting idea to enhance the global fairness. We will add these discussions and related works into our revision. Also, we will modify the paper to highlight the information bottleneck/mutual information and the title. Thanks a lot for your suggestions.
Best regards and thanks,
Reference
[1] Yahya H Ezzeldin, Shen Yan, Chaoyang He, Emilio Ferrara, and A Salman Avestimehr. Fairfed: Enabling group fairness in federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 7494–7502, 2023.
[2] F. Hamman and S. Dutta, "Demystifying Local & Global Fairness Trade-offs in Federated Learning Using Partial Information Decomposition,” International Conference on Learning Representations (ICLR 2024).
We sincerely thank all reviewers for taking the time to review our work. We appreciate that you find that we address a very interesting and important problem in FL (Reviewer #mHC6, #FVeg, #V6z4), our observation is novel (Reviewer #mHC6), our trials are significant and can be further expanded to more industrial scenes (Reviwer #FVeg), our idea and algorithm are interesting and novel (Reviewer #mHC6, #V6z4, #twUP), experiments are comprehensive and algorithm is effective (Reviewer #mHC6, #V6z4, #twUP), the writing is clear (Reviwer #FVeg, #V6z4, #twUP).
Here, we provide an overview of responses to main questions for your convenience.
Q1: More justification on the theory part (Reviewer #mHC6 and #FVeg)
Overview of Answers for Q1): We provide more clarification of Q1 from Reviwer #mHC6 that the Rspu is actually Nuisance at a global level (global dataset across all clients). We will revise the paper and include a section of clear definitions. Specifically, we will include the definitions of the local , , global , , and the local spurious features . Furthermore, we will rewrite Lemma 3.1 to make it clear in the federated learning scenario. We add more explanations about the worst-group/minority group accuracy (Reviewer #mHC6) and the definitions of "block"/"features" (Reviewer #FVeg) in according responses.
Q2: More experiments and extensions to more scenarios (Reviewer #FVeg, #V6z4 and #twUP)
Overview of Answers for Q2): Thanks for all important comments from reviewers. We summarize more experiments we add here.
- LLM extension (Reviewer #FVeg): We mainly consider the federated finetuning scenario with LoRA. To use LoRA in FuseFL, we can follow the MoE style [2] or the averaging style [1]. Specifically, we consider averaging LoRAs on different clients together, then averaging and freezing all LoRAs in each transformer block to freeze the obtained aggregated features in each communication round. We follow the FL-LLM setting (20 clients with Alpaca-GPT4 and MedAlpaca) in OpenFedLLM [3] benchmark to conduct limited-rounds FL. The results (Table in response to Reviewer #FVeg) show that FuseFL with much fewer communication rounds can outperform FedAvg of 50 communication rounds.
- Higher heterogeneity and one more recent advanced baseline (Reviewer #V6z4): We provide more experiments with a=0.05 to compare the FuseFL with other FL algorithms. A new recent OFL baseline method CoBoosting is added for comparison. Results (Table 2 in response to Reviewer #V6z4) show that FuseFL outperforms CoBoosting and provides higher improvement than the baseline methods.
- Multi-round baselines (Reviewer #twUP): From the perspective of the same communication rounds, we report the results (Table 1&2 in response to Reviewer #twUP) of multi-round versions of baseline methods. Results show that for the same communication rounds, FuseFL still achieves the best performance than other baselines.
- Another label skew (Reviewer #twUP): We conduct new experiments with the quantity-based label skew of #C=2 and 10 clients. The test accuracies are shown in Table 3 in response to Reviewer #twUP), illustrating that the FuseFL can outperform other baseline methods.
- More clients (Reviewer #twUP): To validate the scalability of FuseFL, we try our best to increase the number of clients to 100 in OFL methods. Results (Table 4 in response to Reviewer #twUP) show that FuseFL outperforms all baseline methods.
Q3: Security issues (Reviewer #V6z4)
Overview of Answers for Q3): In this work, we do not explicitly consider the security issue. However, the vulnerability to attacks of FuseFL will not be higher than the previous multi-round FedAvg, which requires many more communication rounds to achieve the same model performance as FuseFL. And FuseFL does not increase the communication costs.
Nevertheless, compared with other OFL approaches with one single communication round, the number of communication rounds is increased in FuseFL. To this end, we propose several possible solutions to enhance the security of FuseFL:
- Adversarial attacks: we can detect and reject such malicious uploading also through the lens of causality.
- Model inversion or membership attack: We can consider adding differential privacy or using noised samples with invariant features while ensuring the model performance.
Q4: Extra overheads (Reviewer #V6z4)
Overview of Answers for Q4): We have considered the extra computational problem of the repeated local training with blocks. To address this problem, we decrease the local training epochs of each client by times. For example, the FedAvg and other OFL methods require 200 epochs. With , for training and merging each block in each round of FuseFL, the local training epochs are decreased to 50. Thus, the number of total training epochs of FuseFL is the same as other OFL methods.
The paper identifies and handles a problem in federated learning where local clients overfit to spurious features. The reviewers found this view, specifically the connection between federated learning and causality, to be interesting and novel. There were some concerns regarding missing experiments but these were mitigated in the rebuttal phase. There is a near concesus that the provided experiments comparing the method to SOTA baselines are thorough, and present a convincing empirical proof of the value of the proposed method. The theoretical analysis was also appreciated by the reviewers, though they did have comments for how to improve its clarity (see e.g. mHC6). Overall, the method seems to be novel and empirical results convincing, this paper should be a good addition to Neurips.