5.3

/10

Rejected3 位审稿人

最低5最高6标准差0.5

3.7

置信度

正确性2.3

贡献度2.3

表达2.7

ICLR 2025

Uncertainty-Based Extensible Codebook for Discrete Federated Learning in Heterogeneous Data Silos

Tianyi Zhang,Yu Cao,Dianbo Liu

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

Federated learninguncertaintydiscretization

评审与讨论

审稿意见

评分: 6置信度: 42024-11-02

This paper proposes UEFL, which utilizes an extensible codebook to address feature shifts in FL. Numerical results demonstrate the superior performance of the proposed method in both multi-domain learning and out-of-distribution generation tasks.

优点

Utilizing the codebook technique to address the feature shift in FL is innovative.
The authors conducted comprehensive experiments on multiple datasets, demonstrating the performance improvement of the proposed method in both traditional FL tasks and out-of-distribution generalization tasks.

缺点

The algorithms presented in Section 3 do not align well with Algorithm 1. For instance, lines 226-228 are not thoroughly explained. It would be beneficial if the authors could provide additional details on the main pages.
The number of training epochs for the baseline models is set to be the same as that for UEFL. However, I am uncertain about the fairness of this approach, as (1) the optimal number of training epochs for UEFL may not be optimal for the baseline models; (2) the communication and computation costs associated with UEFL and the baseline models differ, which makes the comparison somewhat unfair.

问题

It would be advantageous for the authors to include convergence curves relating to communication, computation, and the number of rounds. This would provide a clearer illustration of the effectiveness of UEFL.

2024-11-25

We appreciate the recognition of the novelty of our method and the robustness of our experimental design. Additionally, we value the insightful critique regarding the limitations of our work. In response, we address these issues below:

The algorithms presented in Section 3 do not align well with Algorithm 1. For instance, lines 226-228 are not thoroughly explained. It would be beneficial if the authors could provide additional details on the main pages.

Apologies for the confusion. Lines 226–228 (in the reviewed manuscript) explain the process of codebook augmentation in each iteration. Specifically, we identify data from heterogeneous distributions by evaluating uncertainty against a pre-established threshold. When such data are detected, the codebook is expanded with $v$ new codewords, initialized using K-means. These new codewords are exclusively accessible to the corresponding heterogeneous clients, increasing the codebook size for the $k$ th client from $v_k$ to $v_k + v$ .

We have updated the notation for the number of codewords to $v$ and included additional details in the revised manuscript for improved clarity (Lines 206–210).

The number of training epochs for the baseline models is set to be the same as that for UEFL. However, I am uncertain about the fairness of this approach, as (1) the optimal number of training epochs for UEFL may not be optimal for the baseline models; (2) the communication and computation costs associated with UEFL and the baseline models differ, which makes the comparison somewhat unfair

We selected the same number of training epochs as UEFL because UEFL requires 1 or 2 additional iterations, whereas other baselines converge earlier. The numbers reported in our tables represent the optimal performance achieved within these training epochs. For a fairer comparison, we also conducted all experiments under the official experimental setup, excluding the additional epochs from UEFL. The results, presented in the format Accuracy (Uncertainty), are as follows:

Method	MNIST	FMNIST	GTSRB	CIFAR10	CIFAR100
FedAvg	0.782 (0.261)	0.801 (0.289)	0.657 (0.645)	0.618 (0.173)	0.093 (1.74)
UEFL (Ours)	0.920 (0.149)	0.850 (0.167)	0.942 (0.024)	0.720 (0.022)	0.326 (0.655)

Owing to the stochastic process, there are slight differences compared to the previous results. However, the overall performance remains consistent, and the key finding remains unchanged: UEFL outperforms all baselines across all datasets. These results have been included in the appendix of our revised manuscript. (Lines 752-762)

Regarding communication and computation costs, our approach introduces only a small codebook, resulting in negligible memory and computational overhead. The detailed comparison with FedAvg is presented in the following table. For UEFL, we report the final model size with 256 codewords, which is the largest number across all datasets in our experiments. Specifically, the parameter count for the baseline FedAvg model is 14.991M, while UEFL increases slightly to 15.491M, reflecting a modest memory increment of just 3.34%. Similarly, the runtime shows a minimal increase from 16.154ms to 16.733ms, representing only a 3.58% increase.

Method	#Params (M)	CPU runtime (ms)	GPU runtime (ms)
FedAvg	14.991	16.102	16.154
UEFL (Ours)	15.491	16.611	16.733

It would be advantageous for the authors to include convergence curves relating to communication, computation, and the number of rounds. This would provide a clearer illustration of the effectiveness of UEFL

Thank you for your suggestion. We have incorporated the convergence curves ( MNIST and CIFAR10 , FMNIST and GTSRB ) for various metrics (including accuracy and uncertainty) on the MNIST, FMNIST, GTSRB, and CIFAR10 datasets into the revised manuscript. These curves clearly highlight the improvements brought by UEFL and effectively demonstrate its advantages. (Lines 825–882)

Regarding communication and computation costs, we have also included the aforementioned table in the revised manuscript. (Lines 427-431 and Lines 810-822)

评论- Reply to authors

2024-11-25

Thank you for providing the rebuttal and the additional experiments. I find the use of the codebook in FL to be an insightful approach, which may inspire future research. As a result, I am maintaining my positive recommendation. Additionally, I encourage the authors to further improve the clarity of the manuscript in the next revision.

2024-11-30

Thank you very much for your invaluable comments and for recognizing the merits of our approach. We are committed to further refining our manuscript to enhance its clarity and ensure it is more accessible and comprehensible to our readers. We sincerely appreciate your time, effort, and thoughtful insights.

审稿意见

评分: 5置信度: 32024-11-03

This paper presents a method for federated learning, where the goal is to address data heterogeneity across clients and enhance generalization ability of federated learning model across different data distributions. Specifically, the method leverages a codebook of representations to improve accuracy and reduce prediction uncertainty in federated settings. The authors validate the proposed framework in multi-domain learning, and domain generalization in leave-one-domain-out experiments.

The proposed framework is as follows: for each client model, first encode the input data to latent features, extract code words via k-means clustering of these latent features, and then discretize the latent features using the nearest code word; predictions are subsequently made based on the coded (discretized) latent vectors. The framework also introduces an uncertainty-based adjustment mechanism, where clients with uncertainty above a certain threshold use a larger codebook with additional code words. The server model is updated as the average over all clients.

优点

The authors address an important challenge in federated learning to tackle data heterogeneity and enhance model generalization across different distributions.

The presentation of the paper is easy to follow.

缺点

I have the following questions regarding the work:

My first question is about the rationale behind the framework. Could the authors elaborate on how and why discretizing latent representations through a codebook helps achieve the goals of improved generalization and uncertainty reduction?
Regarding the server model, could the authors clarify how the codebook of the server model is updated?
Is there any convergence guarantee of Algorithm 1? Specifically, does the framework ensure that all clients will reach an uncertainty level below the threshold? Could the authors clarify the mechanism described in lines 322–323?
Could the authors clarify why reducing uncertainty in prediction is a valid objective? If the model achieves high certainty on incorrect predictions, could this exacerbate challenges in improving model performance? Additionally, if reducing uncertainty is a primary goal, beyond the proposed method of expanding client codebook, have the authors considered training client models to the point of neural collapse [1]?
Distribution shift is also addressed in other studies, such as [2,3]. It would strengthen the evaluation if the authors could expand their comparisons to include more related methods.

[1] Papyan, Vardan, X. Y. Han, and David L. Donoho. "Prevalence of neural collapse during the terminal phase of deep learning training." Proceedings of the National Academy of Sciences 117.40 (2020): 24652-24663.

[2] Nguyen, A. Tuan, Philip Torr, and Ser Nam Lim. "Fedsr: A simple and effective domain generalization method for federated learning." Advances in Neural Information Processing Systems 35 (2022): 38831-38843.

[3] Zhang, Hao, et al. "FedCR: Personalized federated learning based on across-client common representation with conditional mutual information regularization." International Conference on Machine Learning. PMLR, 2023.

问题

It would be helpful if the authors could address my above questions.

2024-11-25

Thank you for your thoughtful feedback, including your recognition of the motivation and presentation of our work. We also greatly appreciate your insightful critique regarding the limitations of our study. In response, we address these concerns as follows:

My first question is about the rationale behind the framework. Could the authors elaborate on how and why discretizing latent representations through a codebook helps achieve the goals of improved generalization and uncertainty reduction?

The rationale mainly comes from this previous work [1]. Theoretically, the inclusion of the discretization process offers two key advantages: (1) enhanced noise robustness, and (2) reduced underlying dimensionality. These benefits are demonstrated in the following two theorems.

Notation: $\boldsymbol{h}$ is input vector, $\boldsymbol{h}\in\mathcal{H}\in\mathcal{R}^m$ . $L$ is the size of codebook, $G$ is the number of segments, $q(\cdot)$ is discretization process, $\phi(\cdot)$ is any function (model). Given any family of sets $S=\\{S_1, ..., S_K\\}$ with $S_1, ..., S_K\subseteq\mathcal{H}$ , we define $\phi_k^S$ by $\phi_k^S= \mathbb{1}\\{\boldsymbol{h}\in S_k\\}\phi(\boldsymbol{h})$ for all $k\in[K]$ , where $[K]=\\{1, ..., K\\}$ . And we denote by $(Q_k)_{k\in[L^G]}$ all the codewords.

Theorem 1: (with discretization) Let $S_k=\\{Q_k\\}$ for all $k\in[L^G]$ . Then for any $\delta>0$ , with probability at least $1-\delta$ over an iid draw of $n$ examples $(\boldsymbol{h}\_i)_{i=1}^n$ , the following holds for any $\phi:\mathcal{R}^m\rightarrow\mathcal{R}$ and all $k\in[L^G]:$ $if$ $|\phi_k^S(\boldsymbol{h})|\leq\alpha$ for all $\boldsymbol{h}\in\mathcal{H}$ , then

\left\|\mathbb{E}_h[\phi_k^S(q(h, L, G))]-\frac{1}{n} \sum^n\_{i=1}\phi_k^S(q(h_i, L, G))\right\|=\mathcal{O}(\alpha\sqrt{\frac{G\text{ln}(L)+\text{ln}(2/\delta)}{2n}}),

where no constant is hidden in $\mathcal{O}$ .

Theorem 2: (without discretization) Assume that $||\boldsymbol{h}||\_2 \leq R\_{\mathcal{H}}$ for all $\boldsymbol{h}\in\mathcal{H}\in\mathcal{R}^m$ . $Fix$ $\mathcal{C}\in argmin_{\overline{\mathcal{C}}}\{|\overline{\mathcal{C}}|:\overline{C}\subseteq\mathcal{R}^m, \mathcal{H} \subseteq \cup_{c\in\overline{\mathcal{C}}}\mathcal{B}[c]\}$ where $\mathcal{B}[\boldsymbol{c}]=\{ \boldsymbol{x}\in\mathcal{R}^m: ||\boldsymbol{x}-\boldsymbol{c}||\_2\leq R_{\mathcal{H}}/(2\sqrt{n}) \}$ . Let $S_k=\mathcal{B}[\boldsymbol{c}_k]$ for all $k\in [|\mathcal{C}|]$ where $\boldsymbol{c}\_k\in \mathcal{C}$ and $\cup_k\{\boldsymbol{c}\_k\}=\mathcal{C}$ . Then, for any $\delta >0$ , with probability at least $1-\delta$ over an iid draw of $n$ examples $(\boldsymbol{h}\_i)\_{i=1}^n$ ,the following holds for any $\phi : \mathcal{R}^m \rightarrow \mathcal{R}$ and all $k\in[|\mathcal{C}|]:$ if $|\phi_k^S(\boldsymbol{h})|\leq\alpha$ for all $\boldsymbol{h}\in\mathcal{H}$ and $|\phi_k^S(\boldsymbol{h})-\phi_k^S(\boldsymbol{h}')|\leq\varsigma_k||\boldsymbol{h}-\boldsymbol{h}'||\_2$ for all $\boldsymbol{h}, \boldsymbol{h}'\in S_k$ , for all $\boldsymbol{h}, \boldsymbol{h}'\in S_k$ , then

 \left\|\mathbb{E}\_{\boldsymbol{h}}[\phi_k^S(\boldsymbol{h})]-\frac{1}{n} \sum\_{i=1}^n \phi_k^S(\boldsymbol{h}_i) \right\|=\mathcal{O}(\alpha\sqrt{\frac{m\text{ln}(4\sqrt{nm})+\text{ln}(2/\delta)}{2n}}+\frac{\overline{\varsigma_k}R\_{\mathcal{H}}}{\sqrt{n}}),

where no constant is hidden in $\mathcal{O}$ and $\overline{\varsigma_k}=\varsigma_k(\frac{1}{n}\sum_{i=1}^n\mathbb{1}\{h_i\in \mathcal{B}[c_k]\})$ .

Based on these two theorems, we can determine that the performance gap between training and test data is smaller when discretization is applied, due to the following two points:

There is an additional error without discretization (i.e. $\frac{\overline{\varsigma_k}R_{\mathcal{H}}}{\sqrt{n}}$ ) in the bound of Theorem 2. This error disappears with discretization in the bound of Theorem 1 as the discretization process reduces the sensitivity to noise.
The discretization process reduces the underlying dimensionality of $m\text{ln}(4\sqrt{nm})$ without discretization (in Theorem 2) to that of $G\text{ln}(L)$ with discretization (in Theorem 1). Since the number of discretization heads $G$ (eg. $G$ is 1, 2, or 4 in our case) is always much smaller than the number of dimensions $m$ , the inequality $G\text{ln}(L)\leq m\text{ln}(4\sqrt{nm})$ consistently holds.

This is consistent with the experimental results presented in Section 4.1 of our manuscript. Specifically, with the discretization process, the mean accuracy improved from 0.834 to 0.907, accompanied by a reduction in uncertainty (Fig. 3a). The details are included in Lines 671-709 of the revised manuscript.

[1] Liu, Dianbo, et al. "Discrete-valued neural communication." Advances in Neural Information Processing Systems 34 (2021): 2109-2121.

2024-11-25

Regarding the server model, could the authors clarify how the codebook of the server model is updated?

Certainly. The server updates the codebook by computing the average of codewords across the clients that utilize those specific codewords.

Specifically, if the initial codebook size is 64, the server updates the codebook after the first iteration in the same manner as other model parameters—by computing the average across all client models, as all clients share the same codebook size initially. After the first iteration, if clients A, B, and C exhibit high uncertainty and require additional codewords, the codebook size increases to 128. In this case, the additional 64 codewords are used and updated only by clients A, B, and C. For the server, it computes the average across all clients for the first 64 codewords, while the latter 64 codewords are averaged using only the models from clients A, B, and C.

We elaborated on this point in Lines 260–261 of the revised manuscript.

Is there any convergence guarantee of Algorithm 1? Specifically, does the framework ensure that all clients will reach an uncertainty level below the threshold? Could the authors clarify the mechanism described in lines 322–323?

Empirically, we observed that with more training epochs, test accuracy converges while uncertainty continues to decrease (as shown in these training curves, , .) This inspired our use of uncertainty as the stopping criterion for our approach. Currently, we manually set thresholds for different datasets to ensure all clients reach an uncertainty level below the threshold, justifying the efficacy of our framework. For future work, we plan to explore dynamic threshold adjustment. Additionally, we introduced a soft constraint in our experiments: setting the maximum number of iterations to 5, ensuring the codebook does not expand indefinitely. Details are updated in Lines 315-316 of the revised manuscript.

2024-11-25

Could the authors clarify why reducing uncertainty in prediction is a valid objective? If the model achieves high certainty on incorrect predictions, could this exacerbate challenges in improving model performance? Additionally, if reducing uncertainty is a primary goal, beyond the proposed method of expanding client codebook, have the authors considered training client models to the point of neural collapse [1]?

We adopted cross entropy as the objective loss function for classification as follows,

\mathcal{L}\_{task} = -\sum_{class} y \log p

where $p$ is the model output, and $y$ is the ground truth label.

The uncertainty is evaluated as follows,

e = -\sum_{class} p \log p

So, when minimizing the cross-entropy loss, uncertainty is inherently minimized. The training curves (, , ) also reveal a similar trend: as training progresses, accuracy improves while uncertainty and loss decrease. In these figures, the overall test loss is depicted, which includes both the cross-entropy loss and the codebook loss.

Yes, if uncertainty increases on incorrect predictions, model performance can degrade. However, our approach minimizes the overall uncertainty for the entire data silo by optimizing the objective loss function, which includes the cross-entropy loss. While uncertainty may increase for a few samples, the overall uncertainty across the dataset is effectively reduced, resulting in improved performance.
Thank you for the suggestion regarding neural collapse. We conducted experiments on MNIST and CIFAR-100. According to [1], when training continues until the training error reaches 0 (i.e., training accuracy exceeds 99.9% for MNIST/CIFAR-10), the Terminal Phase of Training (TPT) begins, during which neural collapse (NC) emerges. To validate this, we first conducted local training by training on all data together. We then extended these experiments to the federated learning setting. The detailed results are presented below:

Method	MNIST	CIFAR100
zero-error epoch	0.9375	0.3473
last epoch	0.9584	0.3482
UEFL (Ours)	0.9778	0.5074

From these experiments, we observed the following findings:

The dropout layer must be removed; otherwise, the training accuracy cannot exceed 99.9% (e.g., the final training accuracy for MNIST is limited to 96% with dropout).
Compared to local training, more training epochs are required to reach the TPT in federated learning. For example, on CIFAR-100 with data heterogeneity, federated learning requires 44 rounds of training (44 × 5 epochs), while local training achieves TPT in 38 epochs.
Although neural collapse yields improved performance after the zero-error epoch, UEFL consistently outperforms it, especially on more complex datasets like CIFAR-100. This is partly because removing dropout layers for neural collapse increases the risk of overfitting.

The discussion has been included in Lines 786-809 of the revised manuscript.

Distribution shift is also addressed in other studies, such as [2,3]. It would strengthen the evaluation if the authors could expand their comparisons to include more related methods.

Thank you for the suggestion. In the reviewed manuscript, we included the results for FedSR [3], which are highlighted in blue in Lines 453 and 493 of the revised manuscript. Additionally, we conducted experiments on EMNIST-L, FMNIST, CIFAR-10, and CIFAR-100, following the experimental setup outlined in FedCR [2]. The detailed results are presented below and have been included in the revised manuscript, highlighted in red.

Method	#Clients	EMNIST-L	FMNIST	CIFAR10	CIFAR100
FedAvg	100	95.89	88.15	76.83	32.08
FedSR	100	86.22	85.55	61.47	40.82
FedCR	100	97.47	93.78	84.74	62.96
UEFL (Ours)	100	98.29	93.93	86.11	63.37

These results demonstrate that UEFL outperforms all baselines in addressing data heterogeneity and proves to be scalable to a larger number of clients. The results have been included in Lines 497-505 of the revised manuscript.

[2] Zhang, Hao, et al. "FedCR: Personalized federated learning based on across-client common representation with conditional mutual information regularization." International Conference on Machine Learning. PMLR, 2023.

[3] Nguyen, A. Tuan, Philip Torr, and Ser Nam Lim. "Fedsr: A simple and effective domain generalization method for federated learning." Advances in Neural Information Processing Systems 35 (2022): 38831-38843.

评论- Reply to author rebuttal

2024-11-26

Thank you for the detailed response. The additional experimental results have indeed addressed some of my questions. However, I have a few remaining concerns. Regarding the rationale behind the framework, I appreciate the reference to the previous work [1]. However, it appears that in [1], the theorem on the effectiveness of discretization is for the case of discretizing different components in one neural network. How would this theoretical guarantee extend to the federated learning case that's studied in this paper? Besides, in the proposed framework, the authors add more codewords for clients with higher uncertainty. According to Theorem 1, this increases G, leading to a worse error bound. This seems counterintuitive, as it seems to hurt the performance. Could the authors clarify this point?

[1] Liu, Dianbo, et al. "Discrete-valued neural communication." Advances in Neural Information Processing Systems 34 (2021): 2109-2121.

2024-11-30

Thank you for your feedback. In response to your concerns, we address them as follows:

Regarding the rationale behind the framework, I appreciate the reference to the previous work [1]. However, it appears that in [1], the theorem on the effectiveness of discretization is for the case of discretizing different components in one neural network. How would this theoretical guarantee extend to the federated learning case that's studied in this paper?

While the experiments in [1] focus on communication among different components within a neural network, the theoretical analysis in [1] demonstrates that discretized representations in neural networks can generally reduce the generalization gap between training performance and unseen test data, and not limited to the specific setting of intra-neural network communication.

Building on this theoretical foundation, we extend the rationale to the federated learning setting, where discrete representations enhance generalization across different silos, leading to improved overall performance.

Besides, in the proposed framework, the authors add more codewords for clients with higher uncertainty. According to Theorem 1, this increases G, leading to a worse error bound. This seems counterintuitive, as it seems to hurt the performance. Could the authors clarify this point?

Theorem 1 suggests that models with discrete latent representations are more robust to noise and exhibit a reduced generalization gap for unseen data points that are close to the training distribution. This is achieved by reducing the complexity of the information being processed. While a larger codebook reduces the tightness of information bottlenecking and allows for more complex representations, it also improves the model's ability to fit the training data. Conversely, a smaller codebook enforces tighter bottlenecking, typically resulting in worse training performance but improved generalization. Consequently, as L and G (i.e., codebook size) increase, the generalization capability degrades, but training performance improves.

In the context of federated learning, we observed that when the model exhibits high uncertainty in a specific silo during prediction, it indicates poor training performance on that silo, which is undesirable. This suggests that the model has not fit well to the silo-specific data, resulting in suboptimal performance.

To address this, we introduce additional silo-specific codewords to enable the model to better capture the characteristics of data in that silo. This adjustment helps reduce the uncertainty for the silo in question, thereby improving the model's performance locally. The shared codebook, on the other hand, provides global generalization across all silos. However, since the generalization across all silos cannot fully address the variability in silo-specific data, high uncertainty may persist in certain silos.

By incorporating both shared and silo-specific codewords, we strike a balance: the shared codewords contribute to global generalization, while the silo-specific codewords improve the model's ability to fit local silo data. This combined approach helps reduce uncertainty and improve performance across all silos.

[1] Liu, Dianbo, et al. "Discrete-valued neural communication." Advances in Neural Information Processing Systems 34 (2021): 2109-2121.

审稿意见

评分: 5置信度: 42024-11-04

The paper addresses the challenge of data heterogeneity in Federated Learning (FL) by proposing a novel method called Uncertainty-Based Extensible-Codebook Federated Learning (UEFL). The method introduces a dynamic codebook that maps latent features to discrete vectors, allowing for the extension of the codebook based on the uncertainty of data distributions across different data silos (clients). By iteratively evaluating the uncertainty of the global model's predictions, UEFL adds new codewords to the codebook to improve performance on previously unseen or highly divergent data distributions. The proposed method is tested on six datasets, showing significant improvements in both accuracy (3% to 22.1%) and uncertainty reduction (38.83% to 96.24%) over state-of-the-art approaches.

优点

The idea of using a dynamic, uncertainty-driven codebook to handle data heterogeneity in federated learning is innovative. By extending the codebook based on uncertainty, the method adapts to diverse and unseen data distributions more efficiently than traditional methods.
The approach of starting with a small, shared codebook and gradually expanding it based on data-specific uncertainties is efficient. This design helps in minimizing computational overhead while ensuring that the model can adapt to new distributions without retraining the entire model.

缺点

In Table 1, the reported accuracy of FedAvg on CIFAR-100 is only 11%, which is surprisingly low. Typically, even with non-IID data, FedAvg achieves higher accuracy. This raises concerns about potential baseline suppression to exaggerate the performance of the proposed method. The authors need to clarify the experimental setup for FedAvg and ensure that the baseline methods are fairly represented. Without this clarification, the validity of the results is questionable.
The experiments seem to rely on pretrained models. It would be essential to clarify whether the proposed UEFL method still performs well when trained from scratch. Relying on pretrained models can bias results, especially if the pretrained weights are already well-optimized for certain datasets. The authors should include experiments that train the models from scratch to validate the robustness and generalizability of UEFL. Additionally, the 11% accuracy for FedAvg on CIFAR-100 with a pretrained model is puzzling and further raises questions about the validity of the reported results.
The paper primarily uses Monte Carlo Dropout for uncertainty estimation, which is a standard but limited approach. More advanced uncertainty quantification techniques (e.g., Deep Ensembles or Bayesian Neural Networks) could be explored or at least discussed as alternatives. The performance of UEFL might vary with different uncertainty estimation methods.
The paper lacks a comprehensive analysis of the computational complexity of the proposed method. Although the authors claim that the codebook remains compact, it would be beneficial to quantify the computational and communication overhead in both the training and inference phases. A comparison of the complexity of UEFL with other FL methods in terms of memory usage and time complexity is missing.
The paper mainly evaluates the method on six datasets, but does not provide a clear discussion of its scalability to larger numbers of clients. Given that FL is often applied to large-scale systems, it would be helpful to discuss how UEFL scales with increasing numbers of clients or more heterogeneous data distributions.
Although the paper discusses the adaptive nature of the codebook, it would be useful to include an ablation study that explores the impact of different initial codebook sizes and how the codebook grows over iterations.

问题

The reported 11% accuracy for FedAvg on CIFAR-100 is unreasonably low, especially considering that the experiments used pretrained models. Could the authors clarify whether there were any specific preprocessing steps or hyperparameter choices that led to such low performance? Additionally, does UEFL still outperform FedAvg when both models are trained from scratch, without the benefit of pretraining?
How does the growth of the codebook over multiple iterations affect the overall communication overhead in FL? Does the increased size of the codebook lead to significant additional communication costs during model aggregation, and how is this handled?
How is the threshold for uncertainty (which triggers the expansion of the codebook) determined? Is this threshold set manually, or is it dynamically adjusted during training? Additionally, how sensitive is the method to this threshold?
How does UEFL perform when the data distributions across silos are extremely divergent (for example, different modalities of data)? Does the codebook continue to expand indefinitely, or is there a mechanism to prevent overgrowth?

2024-11-25

We appreciate the recognition of the novelty and efficiency of our method. Additionally, we value the insightful critique regarding the limitations of our work. In response, we address these issues below:

In Table 1, the reported accuracy of FedAvg on CIFAR-100 is only 11%, which is surprisingly low. Typically, even with non-IID data, FedAvg achieves higher accuracy. This raises concerns about potential baseline suppression to exaggerate the performance of the proposed method. The authors need to clarify the experimental setup for FedAvg and ensure that the baseline methods are fairly represented. Without this clarification, the validity of the results is questionable.

The performance heavily depends on the experimental setup, particularly data heterogeneity. In our manuscript, for the CIFAR-100 dataset, we adopted the same setup as other datasets, introducing high data heterogeneity by rotating the images with large angles. To ensure fairness, we also systematically conducted experiments under an alternative setup with lower data heterogeneity.

To provide a detailed comparison with the baseline, we utilized a VGG16 backbone to test CIFAR-100 across multiple settings:(1) Local Training: all data trained together without a distributed setting; (2) FedAvg (w/o hete): CIFAR100 split into 5 clients, each with 10,000 images, to evaluate FedAvg performance; (3) FedAvg (w/ hete): images for the 5 clients were rotated by -30°, -15°, 0°, 15°, and 30°, respectively, to introduce data heterogeneity, and FedAvg performance was evaluated; (4) UEFL (w/ hete): tested under the same heterogeneous setup. By presenting results across these progressively changing setups, from local training to heterogeneous federated learning, we demonstrate the validity of our experimental approach. The results are as follows:

Training Strategy	Local Training	FedAvg (w/o hete)	FedAvg (w/ hete)	UEFL (w/ hete)
From scratch	0.3852	0.2447	0.0852	0.1062
Pre-trained	0.6604	0.6496	0.5005	0.5619

The accuracy of both local training and federated learning aligns with publicly reported numbers. Under lower data heterogeneity, both FedAvg and UEFL achieve higher accuracy, with UEFL consistently outperforming FedAvg in this experimental setup. The results are updated in Lines 714-730.

The experiments seem to rely on pretrained models. It would be essential to clarify whether the proposed UEFL method still performs well when trained from scratch. Relying on pretrained models can bias results, especially if the pretrained weights are already well-optimized for certain datasets. The authors should include experiments that train the models from scratch to validate the robustness and generalizability of UEFL. Additionally, the 11% accuracy for FedAvg on CIFAR-100 with a pretrained model is puzzling and further raises questions about the validity of the reported results.

For this point:

As mentioned in our manuscript, for grayscale datasets such as MNIST and FMNIST, where pretrained models are unavailable, we design a convolutional network consisting of three ResNet blocks and train it from scratch. This clarification has been highlighted in blue in the revised manuscript (Lines 335–337).
Additionally, we conducted experiments on CIFAR-100 under the aforementioned experimental setup, training from scratch without using pretrained weights. The detailed results are presented in the table above. UEFL continues to outperform the baseline when training from scratch.

2024-11-25

The paper primarily uses Monte Carlo Dropout for uncertainty estimation, which is a standard but limited approach. More advanced uncertainty quantification techniques (e.g., Deep Ensembles or Bayesian Neural Networks) could be explored or at least discussed as alternatives. The performance of UEFL might vary with different uncertainty estimation methods.

Thanks for the suggestion. We evaluated Deep Ensembles by creating 5 ensembles to assess uncertainty. Here’s a comparison between Deep Ensembles and Monte Carlo Dropout. The results, presented in the format Accuracy (Uncertainty), are as follows:

Method	MNIST	FMNIST	GTSRB	CIFAR10	CIFAR100
Monte Carlo Dropout	0.920 (0.149)	0.850 (0.167)	0.942 (0.024)	0.720 (0.022)	0.326 (0.655)
Deep Ensemble	0.926 (0.211)	0.853 (0.289)	0.940 (0.041)	0.717 (0.043)	0.331 (0.873)

From the results, we observe that the accuracy when using Deep Ensembles is quite similar to Monte Carlo Dropout, apart from the stochastic variations. This is expected, as the accuracy is not directly impacted by the choice of uncertainty evaluation method. However, the uncertainty values for Deep Ensembles are higher than those for Monte Carlo Dropout, likely due to the use of only 5 ensembles for evaluation to reduce computational time.

In conclusion, while the accuracy is comparable, Deep Ensembles require significantly more computational resources due to the need to train multiple networks. Therefore, Monte Carlo Dropout is a more efficient and suitable choice for our approach. All the results have been included in the revised manuscript (Lines 765-783).

The paper lacks a comprehensive analysis of the computational complexity of the proposed method. Although the authors claim that the codebook remains compact, it would be beneficial to quantify the computational and communication overhead in both the training and inference phases. A comparison of the complexity of UEFL with other FL methods in terms of memory usage and time complexity is missing.

During the training phase, as the codebook size keeps increasing, the number of parameters varies dynamically. In contrast, during the inference phase, the number of parameters is fixed and depends on the final codebook size after augmentation. To provide a fair comparison, we present the worst-case scenario for our UEFL, with 256 codewords being the largest codebook size observed across all datasets after training, and compare it with FedAvg as follows:

Method	#Params (M)	CPU runtime (ms)	GPU runtime (ms)
FedAvg	14.991	16.102	16.154
UEFL (Ours)	15.491	16.611	16.733

Specifically, the parameter count for the baseline FedAvg model is 14.991M, while UEFL increases slightly to 15.491M, reflecting a modest memory increment of just 3.34%. Similarly, the GPU runtime shows a minimal increase from 16.154ms to 16.733ms, representing only a 3.58% increase. The results are updated in Lines 427-431 and Lines 810-822.

The paper mainly evaluates the method on six datasets, but does not provide a clear discussion of its scalability to larger numbers of clients. Given that FL is often applied to large-scale systems, it would be helpful to discuss how UEFL scales with increasing numbers of clients or more heterogeneous data distributions.

In the reviewed manuscript, we included the results for 50 clients of UEFL on Rotated MNIST and compared them with baselines. These results are presented in the following table and have been highlighted in Lines 486–495.

Method	#Clients	M_0	M_15	M_30	M_45	M_60	M_75	Average
FedAvg	50	77.9	95.9	96.9	97	96	81.2	90.8
FedSR	50	78.3	95.7	96.3	97.1	96	84	91.2
FedIIR	50	84	96.8	97.7	97.7	97.4	84.5	93
UEFL (Ours)	50	86.4	95.5	96.4	96.9	94.7	90.6	93.42

Additionally, we conducted further experiments for 100 clients on EMNIST-L, FMNIST, CIFAR10 and CIFAR100, following the experimental setup in FedCR [1]. The detailed results are provided below and have been included in the revised manuscript, highlighted in red (Lines 497-505).

Method	#Clients	EMNIST-L	FMNIST	CIFAR10	CIFAR100
FedAvg	100	95.89	88.15	76.83	32.08
FedSR	100	86.22	85.55	61.47	40.82
FedCR	100	97.47	93.78	84.74	62.96
UEFL (Ours)	100	98.29	93.93	86.11	63.37

These results demonstrate that UEFL is scalable to a larger number of clients.

[1] Zhang, Hao, et al. "FedCR: Personalized federated learning based on across-client common representation with conditional mutual information regularization." International Conference on Machine Learning. PMLR, 2023.

2024-11-25

Although the paper discusses the adaptive nature of the codebook, it would be useful to include an ablation study that explores the impact of different initial codebook sizes and how the codebook grows over iterations.

In the reviewed manuscript, we included Figure 5a to illustrate the impact of different initial codebook sizes. To provide more clarity, we have included the detailed results in the following table and added them to Lines 939–971 of the revised manuscript.

Initial codebook size	Perplexity	Uncertainty	Accuracy	Codebook growth
8	4.82	2.02	0.257	8 $\rightarrow$ 16 $\rightarrow$ 32 $\rightarrow$ 64
16	10.54	1.37	0.506	16 $\rightarrow$ 32 $\rightarrow$ 64
32	25.92	0.0615	0.944	32 $\rightarrow$ 64 $\rightarrow$ 128
64	26.79	0.0098	0.955	64 $\rightarrow$ 128
128	38.73	-	0.946	128 $\rightarrow$ 256
256	43.56	-	0.942	256 $\rightarrow$ 512

"-" denotes value close to 0. We observe that an initial codebook size of 64 achieves the best performance.

The reported 11% accuracy for FedAvg on CIFAR-100 is unreasonably low, especially considering that the experiments used pretrained models. Could the authors clarify whether there were any specific preprocessing steps or hyperparameter choices that led to such low performance? Additionally, does UEFL still outperform FedAvg when both models are trained from scratch, without the benefit of pretraining?

We have addressed this point as the above mentioned in Lines 427-431 and Lines 810-822.

How does the growth of the codebook over multiple iterations affect the overall communication overhead in FL? Does the increased size of the codebook lead to significant additional communication costs during model aggregation, and how is this handled?

Specifically, our initial codebook size is 64, and we add 64 codewords per iteration, resulting in a total of 64×512 parameters, which is minimal compared to the overall network size. After augmentation to 256 codewords, the total number of parameters increases from 14.991M to 15.491M, reflecting a modest 3.34% increase, as discussed above.

How is the threshold for uncertainty (which triggers the expansion of the codebook) determined? Is this threshold set manually, or is it dynamically adjusted during training? Additionally, how sensitive is the method to this threshold?

We set this threshold manually and empirically based on the characteristics of different datasets, as the optimal threshold varies slightly across datasets. In this paper, we primarily focus on evaluating the efficacy of the overall framework, leaving the dynamic adjustment of the threshold as future work.

How does UEFL perform when the data distributions across silos are extremely divergent (for example, different modalities of data)? Does the codebook continue to expand indefinitely, or is there a mechanism to prevent overgrowth?

Our approach currently does not support cross-modality datasets (e.g., text and image data) because the server requires averaging client models, which is not feasible when base models differ for different modalities. For cross-modality datasets, the model aggregation process at the server would need to be customized. Thank you for the suggestion; we will consider this as future work. To demonstrate efficacy in scenarios with extremely divergent data distributions across silos, we experimented with the OfficeHome dataset, which contains 4 domains, each consisting of 65 categories: Art (A): Artistic images in the form of sketches, paintings, ornamentation, etc. Clipart (C): A collection of clipart images. Product (P): Images of objects without a background. Real-World (R): Images of objects captured with a regular camera. We followed the experimental setup from FedSR [2] to evaluate the domain generalization task. The results are as follows:

Method	A	C	P	R	Average
FedAvg	62.2	55.6	75.7	78.2	67.9
FedSR	65.3	57.3	76.2	77.8	69.1
FedIIR	64.3	56.6	77.2	78.4	69.2
UEFL (Ours)	62.0	57.2	79.7	78.3	69.3

UEFL consistently outperforms the baselines on domain generalization tasks for this highly divergent dataset.

In addition to the uncertainty condition:

e_k \leq (1+\gamma) \min_{\forall j\in 1, 2, ..., K}(e_j), \forall k

we introduced a soft constraint by capping the maximum number of iterations at 5. After 5 iterations, no additional codewords are added, even if the uncertainty criterion remains unsatisfied.

2024-11-30

Dear Reviewer 4szV,

Thank you for your thoughtful comments on our manuscript. We have carefully addressed all of your concerns in our rebuttal.

If you have any further questions or require additional clarification, please let us know, and we will promptly address them. Your feedback is invaluable for enhancing the quality of our work.

As the discussion session is nearing its conclusion and we have not yet received a response from you, we kindly request your timely feedback to ensure we have enough time to address any remaining issues. We sincerely appreciate your time and effort.

Best regards,

评论- Reminder of Reply

2024-12-02

Dear Reviewer 4szV,

I hope this message finds you well. I am reaching out to kindly remind you that today is the final day for reviewers to post comments during the discussion phase.

We have carefully addressed all of your concerns in our rebuttal and are eager to ensure that we have fully clarified any outstanding points. If there are further concerns or questions, please let us know at your earliest convenience so that we can respond effectively.

Your feedback is invaluable to improving our manuscript, and we greatly appreciate the time and effort you have devoted to reviewing it.

Thank you once again for your thoughtful input, and we look forward to hearing from you.

Best regards,

AC 元评审

2024-12-18

Post-rebuttal, the majority of the reviewers still judged the contribution of the paper as fair. I appreciate the efforts the authors have put in their rebuttal, and the updates in the draft, very substantial and helpful in the appendices, but in the end the paper unfortunately does not pass the bar for acceptance. I can only encourage them further in updating the paper for resubmission, perhaps digging further in the theory that they so far cite from another paper.

审稿人讨论附加意见

Many changes in the draft, especially in the appendix, at rebuttal time.

最终决定Reject

2025-01-22

Reject