PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
3
4
ICML 2025

Provably Near-Optimal Federated Ensemble Distillation with Negligible Overhead

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
Federated learningensemble distillationdata heterogeneitygenerative adversarial network

评审与讨论

审稿意见
3

This paper presents a near-optimal and practical client weighting method that leverages client discriminators trained with a server-distributed generator and local datasets in federated ensemble distillation, which are supported by rigorous theoretical analysis and experimental validation. The work has significant research value and application potential in the fields of federated learning and distributed machine learning.

update after rebuttal

Authors answer my questions, and I increse 1 score after reading the rebuttal.

给作者的问题

  1. "We proposed the FedGO algorithm, which effectively addresses the challenge of client data heterogeneity." Please explain how your proposed method tackles the issue of client data heterogeneity?

论据与证据

Yes. This paper provides the theoretical anylysis and experiments.

The paper provides strict mathematical derivations and theoretical analyses, such as the proofs of Theorem 3.4 and Theorem 3.6, which validate the correctness and optimality of the method. This theoretical rigor lays a solid foundation for practical applications of the method and demonstrates its high credibility and general applicability in addressing real-world problems.

Although the paper provides reliable theoretical guarantees for their weighting method, there are still some flaws in theory: Theorem 3.4 assumes convex functions, which does not align with the experimental setting(ResNet-18) of the paper. Perhaps the authors should consider a more practical assumption under L-smoothness?

方法与评估标准

Yes.

But the theoretical assumptions are overly idealized; the discriminator in GAN must accurately estimate the proportion of data distributions, requiring effective training of both generator G and discriminator Dk. Furthermore, if the GAN discriminator becomes too accurate, it may lead to unnecessary privacy breaches.

理论论述

I've not checked proofs in appendix.

实验设计与分析

Yes.

The experimental setting, models and baselines are soundness.

补充材料

I've not review the supplementary material.

与现有文献的关系

This paper has a close relation to federated learning with client heterogeneity.

遗漏的重要参考文献

Not found yet.

其他优缺点

Weakness: The research in this paper is actually limited by the client-side data situation because training a discriminator to represent a distribution requires an extremely high volume of client data. This is a cross-silo scenario where each client side needs a significant amount of data to accurately represent a distribution.

其他意见或建议

No

作者回复

Comments on theoretical assumptions

We believe the reviewer's suggestion regarding LL-smoothness may arise from a different interpretation of the convexity assumption. The reviewer seems to have interpreted the convexity assumption with respect to the model parameter θ\theta, whereas our convexity assumption is in fact with respect to the model output y^\hat{y}.

While deep neural networks like ResNet-18 do not exhibit convexity with respect to θ\theta, our theoretical results do not rely on such a convexity with respect to θ\theta. Instead, we assume the convexity of the loss function with respect to y^\hat{y}, which is standard in the literature. We acknowledge that our explanation may have caused some confusion, and we will revise the statement in our paper more clearly.

Comments on GAN-based assumptions

We agree that, to theoretically guarantee the achievement of optimal single-model performance, the generator and discriminator must be properly trained. However, our experimental results in Appendix F.7 demonstrate that the proposed method is robust to the quality of the generator and discriminator. In particular, in Appendix F.7.1, we showed that FedGO even with a completely untrained generator outperforms FedDF in terms of both server test accuracy and ensemble test accuracy. Also, in Appendix F.7.2, we can see that FedGO still significantly outperforms baselines even when client discriminators were trained at only one-sixth of the main setting.

Next, regarding privacy leakage due to the provision of the discriminator, we have conducted a comprehensive privacy leakage analysis in Appendix G. Table 15 demonstrates that by incorporating local differential privacy (LDP), FedGO can guarantee a client-side privacy level comparable to FedAVG. Furthermore, as a measure to prevent excessive client distribution leakage, we adopted a simple four-layer CNN for the discriminator, as detailed in Appendix E.2 and F.7.3, and implemented the output activation using a double composite sigmoid function, restricting the discriminator’s output range to [sigmoid(0), sigmoid(1)].

Comments on client-side data limitations

Our experimental results demonstrate that, contrary to the reviewer's concern, FedGO is effective even with low-volume client datasets. We present experimental results not only for 20 clients but also for 100 clients in Appendix F, where each client has an average of only 250 data samples. Figure 7 shows that FedGO achieves significant performance improvements over existing baselines even in such a client data-deficient situation.

Comments on FedGO and data heterogeneity

Federated ensemble distillation algorithms leverage additional unlabeled datasets at the server to perform pseudo-labeling and knowledge distillation, thereby enhancing server model performance. However, in client data heterogeneous situations where client distributions are highly diverse, inference quality for server unlabeled data xx varies significantly across clients. Using a fixed weighting function (e.g., uniform weighting) for pseudo-labeling can degrade the quality of pseudo-labels in proportion to the average discrepancy between the client average distribution pp and each individual client distribution pkp_k (as summarized in Table 1 of our paper, specifically (1.1)).

Thus, research has focused on developing weighting methods that assign higher weights to clients with higher inference quality per data sample xx. DaFKD is one such method. However, in large-client scenarios, its generalization bound become vacuous, and there has been little research on weighting methods that are theoretically guaranteed and robust to client data heterogeneity.

In this paper, we demonstrate in Theorem 3.4 that assigning weights in a specific manner ensures that pseudo-label performance remains independent of client data heterogeneity while providing the tightest existing generalization bound (as discussed in Table 1, specifically (1.3), and Definition 3.1). Furthermore, we show that the generalization bound for the server model trained on these pseudo-labels is expressed in terms independent of client data heterogeneity, proving that our ensemble distillation scheme is theoretically robust to client data heterogeneity.

We modeled this weighting function through client discriminators, as presented in Theorem 3.6, and implemented it via FedGO. Our experimental results across various settings confirm that FedGO is significantly more robust to client data heterogeneity compared to baseline algorithms.

We hope that this answers your question. We will emphasize this aspect further in the paper to enhance clarity.

审稿意见
3

FedGO, a method for federated ensemble distillation (FED) that optimally assigns weights to client predictions using client-trained discriminators, is theoretically justified by GAN principles. It mitigates client data heterogeneity. Experiments on image classification datasets show FedGO outperforms existing approaches in accuracy and convergence speed.

给作者的问题

N/A

论据与证据

The majority of the claims in the paper are well-supported by theoretical proofs and empirical results, but there are a few areas where the evidence is either incomplete or lacks robustness. Below, I assess the validity of key claims and identify potential issues.

  • While the paper does test both cases (with and without a server dataset), the results for the data-free setting are not extensively analyzed.
  • If a malicious client manipulates its discriminator outputs, can it bias the ensemble weights? The authors do not address this risk.
  • While Theorem 3.6 is theoretically sound, it assumes the existence of an optimal discriminator for each client. However, in practice, Clients may not have sufficient data to train an optimal discriminator.

方法与评估标准

  • FedGO consistently outperforms existing methods (e.g., FedDF, FedGKD+, DaFKD) in accuracy and convergence speed across multiple datasets and data heterogeneity settings.

  • FedGO introduces minimal computational, communication, and privacy costs for clients, which is crucial for federated learning in practical settings.

  • The method assumes that a generator can be pretrained or trained collaboratively, but in many real-world FL scenarios, clients might lack sufficient training data or computational power to train discriminators efficiently. The use of off-the-shelf generators is promising, but their effectiveness for out-of-distribution client data needs more validation.

  • The experiments are well-structured, but the paper does not isolate the contributions of different aspects of FedGO (e.g., GAN-based weighting vs. ensemble distillation itself). Would a simpler weighting heuristic perform nearly as well?

理论论述

The paper provides a strong theoretical foundation, proving that the proposed weighting scheme leads to near-optimal ensemble model performance. The authors use results from GAN discriminator theory to derive optimal weight assignment for ensemble learning.

While the theoretical analysis is strong, the paper does not discuss practical deployment challenges, such as latency, scalability for larger client populations, or potential biases in GAN-based weighting. In addition, I am also curious how well this method would generalize to non-image tasks (e.g., NLP, healthcare, or IoT applications).

实验设计与分析

The method assumes that a generator can be pretrained or trained collaboratively, but in many real-world FL scenarios, clients might lack sufficient training data or computational power to train discriminators efficiently. The use of off-the-shelf generators is promising, but their effectiveness for out-of-distribution client data needs more validation.

补充材料

I reviewed the theoretical analysis part.

与现有文献的关系

The literature study is comprehensive.

遗漏的重要参考文献

No

其他优缺点

N/A

其他意见或建议

N/A

作者回复

Comments on data-free setting

We believe we have thoroughly examined both scenarios, with and without a server dataset. For data-free setting, we have conducted experiments using both off-the-shelf generator and generator trained via federated learning and analyzed the results in Appendix F.5. Also, we have analyzed the communication, privacy, and computational complexity of data-free FedGO in Appendix G.

Comments on security risks

In accordance with the reviewer’s comment, we have conducted additional experiments where 5 and 10 out of 20 clients were Byzantine, outputting only the maximum value for the discriminator. The results showed that while the accuracy on the CIFAR-10 classification task under α=0.05\alpha=0.05 was initially 72.35±\pm9.01, it dropped to 69.75±\pm5.05 with 5 Byzantine clients and 66.38±\pm4.97 with 10 Byzantine clients. Even in this extreme scenario where half of the participants were Byzantine, our method significantly outperformed all the baselines that did not utilize a discriminator. We will report this result in the final paper.

Comments on theoretical assumptions

As shown in the experimental results in Appendix F.1 and Appendix F.5, FedGO outperforms existing baselines even when clients have limited data. Specifically, the experiments were conducted with 100 clients, each having an average of only 250 data samples—an amount that is insufficient for training an optimal discriminator.

Comments on generator assumptions

For the case (G3) where a generator and a discriminator are trained using an FL approach, we have already addressed the reviewer’s concern by conducting experiments with a small number of training samples per client (250 images per client when there are 100 clients) in Appendix F.5 and by showing that additional client-side computational overhead is negligible compared to FedDF, which does not train a generator and a discriminator.

For the case (G2) where an off-the-shelf generator is used, we have also conducted experiments where the distribution of the generator differs from that of the client data. These results are presented in Table 5 of Section 4.2. Specifically, when the off-the-shelf generator was trained on ImageNet and the client datasets were CIFAR-10, CIFAR-100, or ImageNet100, the performance remained comparable to the case where the generator was trained on data matching the client distribution. This indicates that even when the off-the-shelf generator is trained on a different dataset from clients’ data, clients can still train their discriminators effectively.

We hope our response addresses your concerns. If we have misunderstood your question, please feel free to clarify.

Comments on experimental design The simplest weighting method for ensemble distillation would be the uniform weighting that FedDF incorporates. The paper demonstrates that the improvement from FedAVG to FedDF stems from the benefit of ensemble distillation itself, while the improvement from FedDF to FedGO is attributed to GAN-based weighting. Additionally, extensive experiments were conducted by fixing the ensemble distillation process while varying the weighting method, effectively quantifying the contribution of weighting in Figure 2 of our main paper.

Comments on deployment challenges

  • Latency: We provided a comparison of the MFLOP counts between the baseline and FedGO algorithms in Table 16 of Appendix G. The comparison of MFLOP counts can serve as a proxy for latency comparison.
  • Scalability: We have already provided experimental results in Appendix F.1 and F.5 with a large-scale setup of 100 clients under various settings.
  • Potential biases in GAN-based weighting: In accordance with the reviewer’s comment, we have additionally conducted experiments with malicious clients and demonstrated the effectiveness of our weighting method in such a challenging scenario. We have reported the experimental results in the response to the second comment.
  • Generalization to non-image tasks: In accordance with the reviewer’s suggestion, we have additionally conducted experiments with a tabular healthcare dataset, confirming performance improvements over FedAVG and FedDF as shown in the table below. In this experiment, we used a total of four clients, all of whom participated in every communication round. Regarding NLP, FedDF and some GAN-based approaches have demonstrated promising results, indicating the strong generalization potential of our weighting method. This appears to be an interesting direction for future research, and we appreciate the suggestion.

\begin{array}{|l|cc|cc|cc|} \hline & \alpha = 0.1 & \alpha = 0.05 \\ \hline \text{Central training} & 36.21\pm0.15 \\ \hline \text{FedAVG} & 34.20\pm0.56 & 33.82\pm0.86 \\ \hline \text{FedDF} & 34.66\pm0.22 & 34.21\pm0.46 \\ \hline \text{FedGO} & 34.81\pm 0.36 & 34.64\pm0.32 \\ \hline \end{array}

审稿人评论

Thank authors for the detailed response. The majority of my concerns have been well addressed, so I will adjust my score accordingly.

作者评论

Thank you for the invaluable review that greatly helped improve the quality of our paper. We also sincerely appreciate your positive assessment.

审稿意见
3

This paper, inspired by the theoretical results of Generative Adversarial Networks (GAN), proposes a weight assignment method for federated ensemble distillation. The method first trains the generator on the server side via a federated learning algorithm and trains the discriminator on the client side using a local dataset. Subsequently, the server assigns weights to each client based on the generated data samples (or unlabeled server datasets) and the outputs of the client-side discriminators to achieve near-optimal performance. The paper provides relevant theoretical proofs of the effectiveness of the method in a federated environment.

给作者的问题

I'll reorganize all my queries here

  1. Pre-trained Generator Robustness : How robust is the proposed method when a pre-trained generator fails to accurately fit the client data distribution? Under such circumstances, do the theoretical guarantees you provided still hold, or are there modifications needed to account for the discrepancy? Understanding this aspect is crucial for evaluating the method's reliability and generalizability when the underlying assumptions about the generator are not fully met.

  2. The authors claim that the communication burden of the steps of distributing the generators to the clients and uploading the discriminators from the individual clients to the server side is negligible, and that if the number of generators and discriminators parameterized is large, the communication cost of a single transmission is non-negligible. If the generators need to be federated for training, multiple rounds of communication are required, so there is a big doubt that the introduction of additional communication overhead is negligible.

  3. Has server-side resource consumption time (GPU hours) been considered in the data generation process?

  4. In the case of highly heterogeneous or long-tailed distributions, the generator may not be able to accurately estimate the distribution of data among clients, and the computation of weights relies on the distribution of client data, which may or may not lead to a weight distribution that is biased toward a small number of clients, and it is suggested that the authors use more complex data to validate the validity of the weight distribution method. Could you provide further insights on the limitations of your proposed method in scenarios with extreme client data heterogeneity? In such settings, what potential pitfalls might arise, and how does the performance of your method degrade? Clarification on this point would help assess the applicability of your method in more challenging, real-world scenarios.

  5. When there is extreme imbalance or heterogeneity in the client data, the authors are requested to provide a rigorous mathematical proof and the conditions under which this conclusion holds true regarding that the expected loss of the ensemble model does not exceed the minimum possible loss of the single model, and also whether such a conclusion can still be given in the presence of model heterogeneity.

  6. To further enhance the persuasiveness and completeness of the experiment, it is suggested that the authors add 2-3 relevant papers published in 2024 as baseline, on federated learning under data heterogeneity and federated ensemble learning.

论据与证据

This paper proposes a provably near-optimal weighting method that utilizes client discriminators, which are trained using a server-distributed generator and local datasets. I believe that the claims made in the submission are well supported by evidence, but need further refinement (see comments)

Comments:

1. The authors claim that the communication burden of the steps of distributing the generators to the clients and uploading the discriminators from the individual clients to the server side is negligible, and that if the number of generators and discriminators parameterized is large, the communication cost of a single transmission is non-negligible. If the generators need to be federated for training, multiple rounds of communication are required, so there is a big doubt that the introduction of additional communication overhead is negligible.

2. Has server-side resource consumption time (GPU hours) been considered in the data generation process?

3. In the case of highly heterogeneous or long-tailed distributions, the generator may not be able to accurately estimate the distribution of data among clients, and the computation of weights relies on the distribution of client data, which may or may not lead to a weight distribution that is biased toward a small number of clients, and it is suggested that the authors use more complex data to validate the validity of the weight distribution method.

4. When there is extreme imbalance or heterogeneity in the client data, the authors are requested to provide a rigorous mathematical proof and the conditions under which this conclusion holds true regarding that the expected loss of the ensemble model does not exceed the minimum possible loss of the single model, and also whether such a conclusion can still be given in the presence of model heterogeneity.

5. To further enhance the persuasiveness and completeness of the experiment, it is suggested that the authors add 2-3 relevant papers published in 2024 as baseline, on federated learning under data heterogeneity and federated ensemble learning.

方法与评估标准

This paper proposes a federated ensemble distillation method (FedGO) based on generative adversarial network (GAN) theory to generate pseudo-labels by dynamically assigning optimal weights through a client-side discriminator to ensure the accuracy of collaborative learning, but further improvements are still required.

理论论述

I carefully reviewed the theoretical claims presented in the paper, including the proofs provided in the main text and the appendix. I specifically verified the correctness of Theorems 3.2, 3.4, and C.1, as well as the supporting lemmas and intermediate steps. The authors have also provided clear explanations and references to existing theoretical results, which further support the validity of their claims.

实验设计与分析

To further enhance the persuasiveness and completeness of the experiment, it is suggested that the authors add 2-3 relevant papers published in 2024 as baseline, on federated learning under data heterogeneity and federated ensemble learning.

补充材料

I reviewed the supplementary material, including the provided code. The code appears to be well-structured and reproducible, which facilitates the verification of the experimental results. The inclusion of code enhances the transparency and credibility of the paper.

与现有文献的关系

The paper makes a meaningful contribution to the field of federated learning (FL) and ensemble distillation by addressing client heterogeneity through a theoretically grounded weighting method inspired by GAN theory. It builds upon and extends prior works on federated distillation and ensemble learning, such as FedDF and FedGKD+, by introducing a provably near-optimal weighting strategy. The paper's experimental validation on benchmark datasets, along with its comparison to existing methods, further demonstrates its relevance and contribution to the existing literature.

遗漏的重要参考文献

The paper provides a thorough and comprehensive review of the relevant literature, citing the necessary and appropriate prior works that form the foundation of the study. The key contributions are contextualized with proper references to prior results in federated learning, ensemble distillation, and relevant theoretical frameworks. I did not identify any missing references that are critical for understanding the paper or its contributions.

其他优缺点

Strength:

  1. The FedGO algorithm introduces a novel weighting method that utilizes client-trained discriminators to weight the ensemble models, which are trained based on data generated by the server generator. This approach enables more efficient model integration in the presence of heterogeneous data on the client side, thus improving the overall performance.

  2. This paper experimentally demonstrates that the proposed method has significant improvements over existing studies in terms of final performance and convergence speed on multiple image datasets. Weakness:

It is necessary to further discuss the limitations of the proposed method in the case of extreme heterogeneity. It would also be valuable to elaborate on the robustness of the method when the pre-trained generator fails to adequately fit the customer data distribution, and to clarify whether the theoretical guarantees still hold in such cases.

其他意见或建议

Didn't find the obvious typo problem in the paper.

作者回复

Comments on pre-trained generator robustness

Our theoretical analysis already takes into account such discrepancy: Theorem 3.6 says that our weighting method produces the optimal weight wkw_k^* for the data point on supp(p)(p)\cap supp(pg)(p_g), where pp is the average client data distribution and pgp_g is the generator’s data distribution. Thus, as we mentioned in the last paragraph of Section 3.1, it is theoretically guaranteed that our proposed method works properly as long as the generator is capable of producing sufficiently diverse samples. However, empirically, we showed even stronger results in Appendix F.7.1: FedGO with a completely untrained generator still performs better than FedDF.

Comments on communication overhead

We confirm that the additional communication cost of our method is indeed negligible, because (1) the number of parameters for the generator and the discriminator is much smaller than that of the classifier architecture and (2) a small number of communication rounds is sufficient for training generator and discriminator, i.e., only 5 rounds for our experiment. More specifically, in Appendix G, we showed that FedGO incurs only an additional 0.47% in communication overhead even in the most challenging scenario where the generator needs to be federated, i.e., case (G3) + (D2). Note that for the scenario where the number of parameters for the generator and discriminator should be large due to the input dimension, the number of parameters for the classifier will also be large, thereby ensuring that the relative overhead remains negligible.

Comments on server-side resource consumption

We reported the computational complexity in terms of the number of MFLOPs (rather than runtime) in Appendix G.2. The cost of the data generation process is already included in the MFLOPs. In cases where there is no unlabeled dataset on the server, we generated 25,000 images using the generator before main FL stage. Once generated, this dataset is reused for distillation in every communication round. The image generation process takes only 3 seconds and is performed just once per experiment, making its time cost negligible.

Comments on extreme data heterogeneity

We first would like to emphasize that our weighting method is proposed to address data heterogeneity, and our theoretical analysis already takes into account the discrepancy between the client data distribution and the generator distribution. To validate the effectiveness of our method in highly heterogeneous setting, we considered Dirichlet parameter α=0.05\alpha = 0.05, for which it is common for only two or three clients out of 20 clients to possess all the images of a particular class. The data distributions for α=0.1\alpha = 0.1 and α=0.05\alpha = 0.05 are visualized in Figures 5 and 6 in the Appendix E.2. Additionally, to account for more complex data scenarios, we conducted experiments using the ImageNet100 dataset. Since ImageNet is sufficiently complex, we believe these experiments offer meaningful insights into FedGO’s performance under realistic and challenging conditions.

Comments on theoretical guarantees under heterogeneity

The proposed method is designed to address data heterogeneity, and our theorems and their proofs already take into account possible imbalance or heterogeneity in the client data. For “model” heterogeneity, we assumed homogeneous model structures in Theorem 3.2, Corollary 3.3, and throughout all of our experiments. As discussed in Appendix H (Limitations), defining an optimal model ensemble becomes challenging when dealing with multiple hypothesis classes (synonym for multiple client model structures).

Comments on experimental completeness - We appreciate this valuable suggestion. To address this concern, we have implemented and tested the following baseline algorithms targeting data heterogeneity (published in 2024) under the main experimental setting in Section 4.1:

  1. FedUV (CVPR 2024)
  2. FedTGP (AAAI 2024)

As shown in the table below, FedGO outperforms both baselines across all settings. We will incorporate these results into the final version of our paper.

\begin{array}{|l|cc|cc|cc|} \hline & \text{CIFAR-10} && \text{CIFAR-100} && \text{ImageNet100} & \\ & \alpha = 0.1 & \alpha = 0.05& \alpha = 0.1 & \alpha = 0.05& \alpha = 0.1 & \alpha = 0.05 \\ \hline \text{FedUV} & 62.58 \pm 4.83 & 53.80 \pm 5.68 & 38.84 \pm 0.79 & 36.17 \pm 1.24 & 30.09 \pm 1.09 & 27.32 \pm 0.65 \\ \hline \text{FedTGP} & 61.16 \pm 6.98 & 61.51 \pm 7.78 & 39.58 \pm 0.10 & 36.56 \pm 0.11 & 29.21 \pm 1.13 & 26.34 \pm 1.02 \\ \hline \text{FedGO} & \mathbf{79.62} \pm 4.36 & \mathbf{72.35} \pm 9.01 & \mathbf{44.66} \pm 1.27 & \mathbf{41.04} \pm 0.99 & \mathbf{34.20} \pm 0.71 & \mathbf{31.70} \pm 1.55 \\ \hline \end{array}

审稿意见
4

The paper presents FedGO, a novel federated ensemble distillation method, aimed at addressing client data heterogeneity in federated learning. The authors propose a weighting method for ensemble distillation that is provably near-optimal by leveraging theoretical results from GANs. The method trains client-side discriminators using a generator distributed from the server, which allows the server to assign optimal weights to client predictions when generating pseudo-labels for unlabeled server data. The paper establishes theoretical guarantees for the proposed weighting scheme and demonstrates its effectiveness through experiments on image classification datasets (CIFAR-10, CIFAR-100, ImageNet100). FedGO significantly outperforms existing baselines in terms of accuracy and convergence speed while maintaining negligible communication and computational overhead.

给作者的问题

NA

论据与证据

Claim 1. Near-optimality of Proposed Weighting Scheme: The authors theoretically justify their weighting method using GAN-based results and provide generalization bounds to support its optimality.

Claim 2. Performance improvements: Experimental results provide strong empirical support that FedGO outperforms FedDF, DaFKD, and other baseline methods in terms of accuracy and convergence speed.

Important Limitation: The theoretical analysis — including the derivation of the optimal weighting functions and generalization bounds — is restricted to binary classification tasks. This limitation is underemphasized in the main text, yet all experiments are conducted on multi-class classification problems. While the empirical results are compelling, it remains unclear how well the theoretical results translate to the multi-class case. Clarifying or extending the theoretical framework to multi-class settings would strengthen the paper's claims considerably.

方法与评估标准

The paper uses well-established benchmark datasets (CIFAR-10, CIFAR-100, ImageNet100) and evaluation metrics (test accuracy of server model and communication efficiency) for FL. Comparisons with state-of-the-art baselines (FedDF, FedGKD+, and DaFKD) are provided.

理论论述

We reviewed the main theoretical claims, including Theorem 3.4 and Theorem 3.6, as well as the generalization bound in Theorem C.1. The derivations appear correct, though they were not carefully checked. Theorem 3.4 relies on the convexity of the loss function to invoke Jensen's inequality and uses knowledge of the true client distributions to derive the weighting function. Theorem 3.6 is a direct application of the standard GAN result from Goodfellow et al. (2014), mapping client-specific data densities to discriminator outputs via the odds function. The generalization bound in Theorem C.1 closely follows prior domain adaptation analyses.

实验设计与分析

The experimental setup is well-designed, with multiple datasets, varying levels of data heterogeneity, and different FL configurations. The paper presents:

  • Performance comparisons across baselines
  • Convergence speed comparisons
  • Ablation studies on different weighting methods
  • Experiments with different generator settings (pretrained vs. scratch-trained)
  • Analysis of overhead in terms of communication, privacy, and computational cost

The results consistently demonstrate that FedGO achieves superior performance with faster convergence and negligible overhead.

补充材料

The provided source code was not reviewed.

与现有文献的关系

The paper builds on several key areas:

  • Federated Learning: It extends works such as FedAVG (McMahan et al., 2017) and FedDF (Lin et al., 2020) by improving model aggregation in heterogeneous settings.
  • Ensemble Distillation: Prior works like FedHKT (Deng et al., 2023) and DaFKD (Wang et al., 2023) explored ensemble distillation but lacked strong theoretical guarantees for weighting strategies.
  • GANs: The authors leverage insights from Goodfellow et al. (2014) on GAN discriminators, which is a novel contribution to FL.

By integrating ideas from these domains, FedGO represents a well-motivated and significant advancement in federated learning.

遗漏的重要参考文献

The references are satisfactory.

其他优缺点

The paper presents rigorous theoretical analyses with provable guarantees, supported by strong empirical results.

其他意见或建议

NA

作者回复

Comment 1: Important Limitation: The theoretical analysis — including the derivation of the optimal weighting functions and generalization bounds — is restricted to binary classification tasks. This limitation is underemphasized in the main text, yet all experiments are conducted on multi-class classification problems. While the empirical results are compelling, it remains unclear how well the theoretical results translate to the multi-class case. Clarifying or extending the theoretical framework to multi-class settings would strengthen the paper's claims considerably.

Thank you for reviewing our paper. In the following, we have clarified the assumption of binary classification tasks in deriving the optimal weighting functions and generalization bounds.

The optimality of our proposed weighting function, based on Theorem 3.4 and Theorem 3.6, is not restricted to binary classification tasks. Theorem 3.4 only requires the convexity of loss function and Theorem 3.6 does not require either the convexity of the loss function or binary classification tasks.

Our generalization bound in Theorem C.1 indeed assumes binary classification tasks. However, we would like to highlight that obtaining a tight generalization bound is challenging even in binary classification tasks and remains an active area of research. For example, FedDF, Fed-ET, and DaFKD derived generalization bounds under binary classification assumptions, but their bounds either tend to degrade in data-heterogeneous settings or become vacuous in large-client scenarios. However, our bound is tighter than those bounds, making it a significant contribution even within the binary classification framework. For the multi-class case, there are some existing results on obtaining generalization bounds in a single-model (non-federated learning) setup, but they remain limited and loose.

最终决定

This paper proposed a method for federated ensemble distillation designed (in client heterogenous environment) in federated learning. This core idea is to use discriminators trained on each client using data generated from a server-distributed generator. The server then aggregates predictions by assigning weights based on these discriminators, thereby generating pseudo-labels for a server-side dataset. Theoretical justifications are provided for the weighting scheme. Also, experiments across several image classification tasks, demonstrating efficiency over existing baselines.

The reviewers appreciated the paper’s theoretical grounding. They also found the adaptation of GAN results to the federated setting interesting. Als, the experimental setup was found strong with well-chosen benchmarks, comprehensive comparisons, and ablation studies. Reviewers generally agreed that FedGO achieves better accuracy and faster convergence than prior methods, while maintaining low overhead. At the same time, several reviewers raised concerns about the limitations and assumptions of the proposed method. A shared concern was the theoretical analysis being restricted to binary classification, despite all experiments being conducted in multi-class settings. Additionally, there were doubts about the robustness and scalability of the approach in real-world settings, particularly in the presence of extreme heterogeneity or insufficient client data to train effective discriminators. These major concerns were mostly addressed during the rebuttal perior.