5.3

/10

Poster4 位审稿人

最低4最高6标准差0.8

3.8

置信度

正确性2.3

贡献度2.3

表达3.0

NeurIPS 2024

FedGTST: Boosting Global Transferability of Federated Models via Statistics Tuning

Evelyn Ma,Chao Pan,S. Rasoul Etesami,Han Zhao,Olgica Milenkovic

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

This paper presents FedGTST, a federated learning algorithm that improves global transferability using cross-client statistics, significantly outperforming existing methods.

摘要

关键词

Federated LearningTransfer Learning

评审与讨论

审稿意见

评分: 6置信度: 32024-06-24

In this paper, the authors proposed a method to address the problem of transfer learning under FL. Specifically, after a thorough theoretical analysis, the authors observe that the cross-client averaged Jacobian norm and its variance control the bound on the target loss. Therefore, they propose FedGTST to properly increase the norm while reduce its variance. The proposed method achieves promising results on several benchmark datasets.

优点

The structure and writting of the paper is clear.
The proposed method is effective, and the executed experiments are thorough.
The theoretical analysis is sound.

缺点

The adopted benchmark datasets are a bit easy. Since the authors already used ResNets in the experiments, comparing on more complex datasets, e.g., PACS, DomainNet would make the experimental results more convicing.
From the definition introduced in Section 3, only the classifier will be optimized during the federated communication. Is the proposed method also effective when the feature extractors are also optimized during FL?
Is the proposed method also applicable to settings when there exists mutliple source domains?

问题

Please refer to the Weaknesses section.

局限性

The authors have discussed the limitations of the proposed method.

作者回复

2024-08-07

We thank the reviewer for their comments. Our answers are as follows.

We additionally conducted experiments on DomainNet, and the results are presented below.

Following FedSR, we applied a leave-one-out strategy, where one domain is treated as the target domain and the other five are source domains.

The target domains listed in the table are: C (Clipart), I (Infograph), P (Painting), Q (Quickdraw), R (Real), and S (Sketch). |Model | C | I | P | Q | R | S | AVG | |--------|----------|-----------|----------|----------|----------|----------|------| | FedAVG | 59.3±0.7 | 16.5±0.9 | 44.2±0.7 | 10.8±1.8 | 57.2±0.8 | 49.8±0.4 | 39.6 | | FedSR | 61.0±0.6 | 18.6±0.4 | 45.2±0.5 | 13.4±0.6 | 57.6±0.2 | 51.8±0.3 | 41.3 | |FedGTST| 63.9±0.5 | 20.7±0.3 | 47.8±0.4 | 15.2±0.5 | 59.5±0.6 | 54.3±0.2 | 43.6 |

The proposed method is effective when the feature extractors are also optimized during FL.

In our setting, the classifier is not the only component optimized during federated communication. In addition to the classifier, the feature extractor is also updated. As stated in Line 152, "Let $h^*=g^*\circ f^*$ be an optimal solution for the source objective 2", which indicates that both the feature extractor f and the classifier g are optimized during federated pretraining.
Our current setting indeed assumes multiple source domains.

As mentioned in Lines 148–149, "...federated pretraining where the source domain is a union of the agents’ local domains," we use multiple domains for pretraining. Furthermore, in Lines 302–308 (Experiments), we explained the construction of the agents’ local domains: "We … to construct non-iid local domains". In Appendix G, Line 578, we reiterate, "For the source domains MNIST and CIFAR-10, we only allow each local client to have access to…".

Additionally, in our response to the reviewer’s question about adding results on DomainNet, our setting also involves letting clients have access to different domains, enabling multiple source domains.

We appreciate the reviewer’s feedback, and we are happy to discuss further with the reviewer to address any remaining questions or concerns.

2024-08-12

Thanks for the rebuttal. I would like to retain my original score.

2024-08-14

We thank the reviewer once again for their valuable comments and for going over our rebuttal response.

审稿意见

评分: 6置信度: 42024-07-09

This work addresses transferability of models pre-trained using Federated Learning (FL). Through a thorough theoretical analysis, the authors identify cross-client averaged Jacobian norm and cross-client Jacobian variance as a key factor influencing transferability. To retain good transferability, FedGTST controls cross-client averaged Jacobian norm through a guide norm determined by the server while the cross-client Jacobian variance is controlled through local regularisation at the clients based on the guide norm. Different from existing works, FedGTST prevents privacy leakage, and maintains communication efficiency by only communicating additional scalars.

优点

Thorough theoretical analysis quantifying loss on target domain in terms of the constituents from FL training.
Proposed algorithm is communication efficient, satisfying a crucial requirement of federated settings.
Proposed algorithm requires clients to transfer only scalar values in addition to locally updated parameters, preventing any further privacy leakage beyond sharing model parameters.
Empirical results cover different model sizes, networks with up to 100 clients and non-IID local domains, simulating diverse FL setups.

缺点

The variance of cross-client Jacobian appears to be a property of the non-IID local domains. For instance, when clients have IID data, the variance would be inherently low and thus naturally promote good transferability. Thus, when considering non-IID local domains, the proposed algorithm should be compared for transferability against advanced FL algorithms that address data heterogeneity and therefore would naturally control cross-client Jacobian variance, e.g. FedAdam, Scaffold and FedNova. FedAvg seems to be a weaker baseline.
While the theoretical analysis identifies cross-client Jacobian as a key influencing factor for transferability, intuitive explanations regarding the same are missing. Why does inflating norms help transferability? Does a higher or lower average norm during the initial stages affect the generalisation error of the FL pre-training and consequently transferability? If this is true, then the quest reverts back to learning good FL models under non-IID data instead of transferability. It would be great if authors could provide detailed insights/discussion regarding this.
Certain assumptions in the analysis seem too strong and could be impractical – i) one local step (assumption 4.2) and ii) Gradient Descent (GD) instead of Stochastic Gradient Descent (SGD) (assumption 4.2). FL hinges on the idea of executing multiple local steps to achieve communication efficiency. Furthermore, most standard FL/ML algorithms execute SGD due to the well known computationally expensive nature of GD which evaluates a single gradient on the entire local dataset. The authors should clarify the implications of loosening these two assumptions on the theoretical analysis.

Minor comments:

The paper has several typos making reading comprehension difficult:

Line 136: “does not communicating information that …”
Line 319: “We do not report results for … FedGTST …”
Line 8 in Algorithm: $\gamma_p^{unl, (k)}$
Line 523 (Incorrect reference): “Convergence results are plotted in Figure C”.

Convexity assumption is certainly not easy to meet in practice for today’s ML. Neural network loss functions are non-convex which are the heart of most of the modern ML.

The reviewer is happy to reconsider the rating once the authors address/clarify the above issues.

问题

Does the experimental setup consist of 1 local step or 1 local epoch (similarly in Table 5)?. The theoretical analysis assumes one local step, however the experiments mention one local epoch.
On the same lines, does the experimental evaluation use GD or SGD? Can the authors show empirical evaluations with SGD and multiple local epochs if not already?
What is the effect of removing the surrogate Jacobian norm evaluation with standard training? Can the authors show an ablation study where there are no clients computing $\gamma_p^{std, (k)}$ and clients only send $\gamma_p^{reg, (k)}$ ?

局限性

Missing important non-IID FL baselines (comment (1) in weaknesses).
Lacking discussion of cross-client average Jacobian norm and its relation to data heterogeneity (comment (2) in weaknesses).
Certain assumptions seem impractical (comment (3) in weaknesses).

作者回复

2024-08-07

We thank the reviewer for their comments. Our answers are below.

Additional Baseline. We discuss Scaffold, a baseline suggested by the reviewer (and will cite this and other suggested works). They are as follows:

10 clients: | | MNIST → MNIST-M (LeNet)| MNIST → MNIST-M (ResNet) | CIFAR10 → SVHN (LeNet)| CIFAR10 → SVHN (ResNet) | Average| |--------|-------|---------|------|------|------| | Scaffold | 75.6 ± 0.8 | 80.8 ± 0.3| 66.0 ± 0.5 | 71.1 ± 0.4| 73.3 | | FedGTST | 76.2 ± 0.9 | 82.3 ± 0.5| 70.1 ± 0.8 | 74.5 ± 0.3| 75.8 |
100 clients: | | MNIST → MNIST-M (LeNet)| MNIST → MNIST-M (ResNet) | CIFAR10 → SVHN (LeNet)| CIFAR10 → SVHN (ResNet) | Average| |--------|-------|---------|------|------|------| | Scaffold | 52.3 ± 0.5 | 63.1 ± 0.3| 45.5 ± 0.1 | 55.5 ± 0.3| 54.1 | | FedGTST | 57.5 ± 0.3 | 67.6 ± 0.2| 52.4 ± 0.1 | 63.1 ± 0.2| 60.2 |

Intuitive explanations for Theorem 2

Tuning Cross-Client Statistics Improves Transferability.
- Larger $|J_p|$ in initial stages enhances transferability. By preventing small averaged gradients, we prevent clients from becoming trapped in individual's local minima or overfitting locally. As training progresses, the averaged gradient norm naturally decreases due to convergence.
- Smaller $\sigma_p$ induces similarity across different domains. This helps avoid local overfitting and improves global transferability.
Clarification on Gradient Norm Behavior.

The RHS of the equation cannot be $-\infty$ . The averaged gradient norm $\|J_p\|$ in term 2 of the RHS is bounded by the smoothness parameter $\alpha$ . Specifically, the increment of local gradients is upper-bounded by $\|J_1 - J_2\| \leq \alpha \|x_1 - x_2\|$ .
Trade-Off Between Enlarging $\|J_p\|$ and Decreasing $\sigma$

A larger $|J_p|$ leads to more substantial local model updates, increasing variance $\sigma_{p+1}$ in subsequent rounds. Thus, it's essential to balance this trade-off.
Higher initial average Jacobian norms do not necessarily decrease generalization error.

Let's begin by comparing the definitions of generalization error and transferability as defined in our manuscript.
- Generalization error is defined as $L_{D_T}(h_{\text{pretrained}}) - L_{D_S}(h_{\text{pretrained}})$ .
- Transferability measure in our paper is $L_{D_T}(h_{\text{finetuned}})$ .
The key differences between these two concepts are: a) Whether $h$ is finetuned; b) Whether source error is subtracted.

From Theorem 2, we can see that cross-client statistics help the transferability defined by us by controlling its upper bound, and we provide an intuitive explanation above.

However, tuning of statistics does not necessarily decrease generalization error. Specifically:
1. $L_{D_T}(h_{\text{pretrained}})$ is not necessarily smaller, since even if the pretrained feature extractor $f_{\text{pretrained}}$ transfers well to the target domain, the pretrained classifier $g_{\text{pretrained}}$ , without any finetuning on the target domain, can result in a high $L_{D_T}(h_{\text{pretrained}})$ .
2. $L_{D_S}(h_{\text{pretrained}})$ is not necessarily larger, as tuning the statistics intuitively prevents clients from getting stuck in their individual local minima or overfitting, thus reducing cross-client source loss.
To summarize, tuning the statistics may lead to a higher $L_{D_T}(h_{\text{pretrained}})$ and a lower $L_{D_S}(h_{\text{pretrained}})$ , and therefore, does not decrease generalization error.

Assumptions Clarification
- The single-step assumption is commonly used in FL theoretical analyses - see FedSplit [2] and FEDAC [3].
  
  [2] Pathak, Reese, and Martin J. Wainwright. "FedSplit: An algorithmic framework for fast federated optimization." Advances in neural information processing systems 33 (2020): 7057-7066.
  
  [3] Yuan, Honglin, and Tengyu Ma. "Federated accelerated stochastic gradient descent." Advances in Neural Information Processing Systems 33 (2020): 5332-5344.
- GD assumption
  
  The reviewer's point is correct. However, GD can be extended to stochastic learning with batch sampling. The additional randomness in sampling would require incorporating the variance of batch sampling into the generalization bound. This variance term, being independent of the algorithm design, was omitted in our theoretical analysis.
Minor clarifications
- We appreciate the reviewer pointing out typo issues, and will make sure to fix them.
- Convexity assumptions are solely made for theoretical tractability, and such simplifications were also considered in SCAFFOLD [1], FedSplit [2], etc.
  
  [1] Karimireddy, Sai Praneeth, et al. "Scaffold: Stochastic controlled averaging for federated learning." International conference on machine learning. PMLR, 2020.
  
  [2] Pathak, Reese, and Martin J. Wainwright. "FedSplit: An algorithmic framework for fast federated optimization." Advances in neural information processing systems 33 (2020): 7057-7066.
By epoch in experiments, we mean step. In the theoretical part of the paper, we dealt with steps, while in practical evaluations, we referred to epoch. Sorry for the confusion, we will clarify this in the revision.
We use SGD for experimental evaluation.
Ablation Study: removing the surrogate Jacobian norm.

What the reviewer is asking is part of our additional baseline cross-examination involving Scaffold, which uses Jacobian regularization but does not consider surrogate Jacobian norm. Please see the results presented in the first bullet point.

We appreciate the reviewer’s feedback, and we are happy to discuss further to clarify any remaining questions.

2024-08-13

Thank you for your response. Based on the clarifications provided, I will adjust my rating to ‘borderline accept’.

However, the authors have still not convincingly responded to my main concern regarding algorithms that address non-IID local domains. It is unclear how solving for heterogeneity in FL relates or remains different from solving for transferability. Evaluation alone on Scaffold is insufficient to address this concern. Thus, my raised score remains at 5.

2024-08-13

Thank you for your useful technical feedback and for reconsidering your rating. We also appreciate your willingness to engage in further discussion - our goal with this response is to provide further clarifications rather than solicit score changes.

We understand your concern regarding the relationship between methods that handle heterogeneity in federated learning (FL) and those that aim to enhance transferability. We thought that we managed to answer it in the rebuttal, but maybe some helpful details were missing. The first thing to point out is the conceptual difference between the two approaches: FL on heterogeneous domains aims to boost the performance on the source domains (i.e., test domains that do not significantly differ from the training domains); transferable FL, on the other hand, focuses on boosting the performance on a target domain, which may differ significantly from the source training domains. Hence, you are right that both methods aim to address challenges with non-iid data, and it is true that transferability may potentially improve generalization across diverse local domains. But, to point it out again, the key difference is to achieve good performance on the target domain even if there is a drop in the performance on source domains (transferable FL) versus achieve good performance over the original source domains.

In terms of differences in the approach, and in connection to our method, we would like to mention that while reducing cross-client variance is indeed linked to better generalization, our approach differs from traditional heterogeneous FL by enforcing a large average Jacobian norm $\|J_p\|$ in the early stages of the information exchange process. A large value of $\|J_p\|$ may compromise the level of generalization across local domains by encouraging larger local updates and consequently higher cross-domain model variance. However, this also improves transferability by discouraging premature convergence to local optima.

2024-08-14

It would be interesting to see if there do actually exist cases where boosting transferability (reasonably) hurts source domain generalization or inversely, optimizing for source domain generalization (reasonably) hurts transferability.

I appreciate the author's efforts in providing detailed clarifications and thank them for the same.

2024-08-14

We appreciate the reviewer's insightful suggestion on showcasing examples where improving transferability might reasonably hurt generalization, or vice versa.

Below, we provide: 1) some current results that show that Fed-GTST improves transferability at a slight cost of source domain generalization; and 2) reference to existing literature related to cases where promoting transferability may not always "align" with generalization quality.

We report results in two columns: a) Generalization: averaged accuracy evaluated on the source testing domains, where the model is pre-trained on source training domains, and b) Transferability: the accuracy on the target domain, where the model is fine-tuned on the target domain after being pre-trained on the source training domains.

For MNIST → MNIST-M (100 clients, local models based on ResNet-18):

	avg acc. across Source Testing Domains	acc. on Target Domain
Scaffold	83.1 ± 0.2	63.1 ± 0.3
FedGTST	82.9 ± 0.5	67.6 ± 0.2

For CIFAR10 → SVHN (100 clients, local models based on ResNet-18):

	avg acc. across Source Testing Domains	acc. on Target Domain
Scaffold	78.4 ± 0.7	55.5 ± 0.3
FedGTST	76.7 ± 0.3	63.1 ± 0.2

We observe that the transferable FL method (FedGTST) demonstrates better transferability (second column), while its generalization performance (first column) is comparable to or slightly lower than that of the heterogeneous FL method (e.g., Scaffold). Conversely, Scaffold is optimized for strong generalization on source test domains but is outperformed by Fed-GTST in terms of target domain performance (which we use to measure transferability).

Additionally, adversarially robust models offer another example where improving transferability may come at the expense of generalization. As shown in [1], adversarially robust models tend to have better transferability. However, since these models are designed to perform well against adversarial manipulations or perturbations, they do not necessarily exhibit lower generalization error on "clean" test data. In this scenario, transferability may not have any influence on generalization or even work against it.

[1] Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., and Madry, A. "Do Adversarially Robust ImageNet Models Transfer Better?" arXiv preprint arXiv:2007.08489, 2020.

2024-08-14

Thank you for sharing the results to better understand generalization vs transferability.

Since all my concerns have been addressed, I am happy to update my score to 'weak accept'.

2024-08-14

We thank the reviewer once again for their valuable feedback and for the constructive discussion.

审稿意见

评分: 5置信度: 42024-07-13

The paper presents FedGTST, a novel federated transfer learning approach that achieves transferability over the global learning domain and accurate assessment of the degree of transferability. The proposed method utilizes a new regularizer that encodes cross-client statistics and adjusts the local training process towards global transferability. This is achieved by subtractions of global norms of Jacobians communicated by the server. The authors also use two FL-specific factors, cross-client Jacobian variance and cross-client Jacobian norm as direct performance indicators. The proposed method only communicates Jacobian norms, which is a scalar, thus achieving reduced communication overhead and ensuring privacy. The authors also provide theoretical analysis and experimental results demonstrating the effectiveness of FedGTST over existing baselines.

优点

Overall this paper is well written and tecchnically sound. The authors present a detailed literature review of existing work related to Federated Transfer Learning and point out three major drawbacks: privacy leakage, local overfitting and huge communication bottleneck.

FedGTST has the following contributions:

The studied problem is novel and may have practical usage in real-world applications.
It uses a novel regularizer that focuses on the cross-client statistics to enhance the global transferability.
It introduces the cross Jacobian variance and norm as the direct indicator of the transferability.
It communicates only the scalar to reduce communication overhead and protect the privacy.

This paper also provides detailed and comprehensive analysis on the literature of Federated Transfer Learning, which eases the understanding for readers. Rigorous theoretical analysis and extensive experiments are also included to supports the proposed method.

缺点

Weaknesses

The non-iid setting is not common. Most of FL methods utilize the dirichlet sampling to simulate the Non-IID distribution [1-7].
Estimating the quality of the solution f* and using idea from the domain generalization is interesting. Ref [7] also considers measuring the local training through the lens of the domain gap, which can be added for discussion.
The hyper-parameters are not given, which reduces the reproducibility of this work.
The assumption 4.3 require the optimal learning rate, which may require delicate hyper-parameter tuning.
It would be great if the authors add conceptual explanations for the proposed method.

[1] Model-contrastive federated learning. In CVPR 2021. [2] No Fear of Heterogeneity: Classifier Calibration for Federated Learning with Non-IID Data. In NeurIPS 2021. [3] Federated Learning Based on Dynamic Regularization. In ICLR 2021 [4] Virtual Homogeneity Learning: Defending against Data Heterogeneity in Federated Learning. In ICML 2022. [5] DENSE: Data-Free One-Shot Federated Learning. In NeurIPS 2022. [6] FedScale: Benchmarking Model and System Performance of Federated Learning at Scale. In ICML 2022. [7] FedImpro: Measuring and Improving Client Update in Federated Learning. In ICLR 2024

问题

Please refer to the weaknesses.

局限性

Yes, the authors discussed the limitations of this work, including local computational cost and potentially loose bound.

作者回复

2024-08-07

We thank the reviewer for their comments. Our answers are as follows.

We agree with the reviewer that there are other non iid sampling methods that may be more frequently used than ours. We therefore implemented the Dirichlet sampling. Following the reviewer’s suggested references, we set the concentration parameter to 0.5. The results clearly show that our method still outperforms others even when client sampling strategies change.

Results for 10 clients: | Method | MNIST → MNIST-M (LeNet)| MNIST → MNIST-M (ResNet) | CIFAR10 → SVHN (LeNet) | CIFAR10 → SVHN (ResNet) | Average | |--------------|--------------|-------------|--------------|-------------|--------| | FedAVG | 74.1±0.6 | 82.2±0.3 | 65.2±0.4 | 72.8±0.9 | 73.5 | | FedSR | 75.5±0.8 | 82.0±0.2 | 66.1±0.5 | 72.1±0.3 | 73.9 | | FedIIR | 75.8±0.2 | 82.8±0.6 | 66.5±0.9 | 74.6±0.1 | 74.9 | | FedGTST| 77.3±0.8 | 82.7±0.4 | 70.8.0±0.7 | 75.2±0.2 |76.5 |
Results for 100 clients: | Method | MNIST → MNIST-M (LeNet)| MNIST → MNIST-M (ResNet) | CIFAR10 → SVHN (LeNet) | CIFAR10 → SVHN (ResNet) | Average | |--------------|--------------|-------------|--------------|-------------|--------| | FedAVG | 50.4±0.1 | 63.0±0.3 | 43.3±0.2 | 54.6±0.5 | 52.8 | | FedSR | 52.7±0.2 | 62.9±0.3 | 44.7±0.1 | 56.5±0.3 | 54.2 | | FedIIR | 54.2±0.4 | 64.1±0.1 | 47.4±0.4 | 58.7±0.2 | 56.1 | | FedGTST | 59.5±0.3 | 69.2±0.2 | 55.1±0.1 | 65.6±0.2 | 62.4 |

Thank you for your suggestion. We will add a discussion of [7] into the revision as follows. A recent work, FedImpro [7], considers the domain generalization problem over different clients in the FL setting. Specifically, FedImpro assumes that the server can maintain an estimate of the aggregated feature distribution, which is used as an extra term in the client gradient computation to promote lower gradient dissimilarity and thus better generalization ability. Unlike our approach, FedImpro does not use cross-client statistics explicitly during client updates.
We would like to point out that the hyperparameters are described in Lines 321-333. If the reviewer would like additional information, please do let us know.
The optimal learning rate in assumption 4.3 can be directly computed in practice. An assumption that we made in 4.3 is that we can know the exact value of $\alpha$ , as all other terms can be computed explicitly from the client statistics of round p-1. The assumption is only used for theoretical analysis. In practice we can always estimate $\alpha$ . This type of assumption is also adopted in other papers, such as [Haddadpour, 19].

[Haddadpour, 19] Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods in federated learning. arXiv preprint arXiv:1910.14425, 2019.
We appreciate the reviewer’s suggestion. A detailed conceptual explanation of Theorem 2 is follows:

Tuning Cross-Client Statistics Improves Transferability:
- Large $\|J_p\|$ Norm: A large $\|J_p\|$ intuitively enhances transferability. In the early stages of training, it is crucial to prevent the averaged gradient from becoming too small. Small local gradients can cause clients to become trapped in local minima or overfit. As training progresses toward its final stages, the averaged gradient norm will naturally decrease due to convergence.
- Small Cross-Client Gradient Variance: A small cross-client gradient variance improves transferability by inducing similarity across different domains. This helps avoid local overfitting and promotes better global transferability.
Clarification on Gradient Norm Behavior: It is impossible for the right-hand side (RHS) of the equation to be $-\infty$ . The averaged gradient norm $\|J_p\|$ in term 2 of the RHS cannot approach $\infty$ because it is bounded by the smoothness parameter $\alpha$ . Specifically, the increment of local gradients is upper-bounded by $\|J_1 - J_2\| \leq \alpha \|x_1 - x_2\|$ .
Trade-Off Between Enlarging $\|J_p\|$ and Decreasing $\sigma$ : A larger $\|J_p\|$ results in more substantial updates of the local models, which can intuitively lead to an increased variance $\sigma_{p+1}$ in subsequent rounds. Therefore, Theorem 2 highlights that when implementing the approach, it is essential to balance this trade-off—i.e., increase $\|J_p\|$ while keeping $\sigma$ small.

We appreciate the reviewer’s feedback, and we are happy to discuss further with the reviewer to clarify any remaining questions or concerns.

2024-08-12

Thanks for the clarification. I will keep my rating of 5.

2024-08-14

We appreciate the reviewer's valuable feedback once again and thank them for reviewing our rebuttal response.

审稿意见

评分: 4置信度: 42024-07-15

This work aims to improve the transferability of pre-trained models under federated learning. It established a cross-domain generalization error bound under federation. Motivated by this, the authors devise a regularization method leveraging cross-client statistics to minimize the generalization error upper bound. They evaluate the performance of the proposed method on several cross-domain transfer learning tasks.

优点

The reviewer appreciates the theoretical analysis provided in this paper, despite that this can only be achieved on a convex model. The authors did a good job of connecting these theoretical results to their method design.
The proposed method is more efficient in terms of communication cost than existing works for cross-domain generalization.

缺点

Some important experimental settings are missing. It is not clear that (1) how the pre-trained models on source dataset are obtained (from centralized or federated pre-training?), (2) the availability of target domain data or labels. These important details are absent from the main text, appendix and code description, making it hard to understand the context of this work. The reviewer can only infer the task from some fragment of information. For example, given the evaluated experiment on CIFAR10 -> SVHN, the reviewer assumes the target domain data and labels are available for finetuning. Based on this, the task of interest is more like an inductive transfer learning task rather than domain generalization.
If transfer learning under federation is the main focus of this work, then the datasets considered are too small for a comprehensive comparison of evaluated methods. For example, finetuning pre-trained models learned from ImageNet on downstream datasets like Stanford Cars, and Oxford Pets [1] are frequently used in the existing literature. Otherwise, it makes no sense to transfer learn features from a small dataset to the other small dataset.
In terms of the compared baselines, it could be better to consider existing transfer learning approaches, since it is a direct adaptation to federated learning.
There remains a gap between the generalization bound and transferability measurement. In fact, it remains not clear why cross-client statistics can be introduced as a transferability indicator.
From Algorithm 1, I can only guess the proposed method prevents local overfitting by preventing a very small Jacobian norm. If the proposed method works in such a way, then existing works use L2 norm [1] of feature map for regularization should also be considered as baselines.

问题

In addition to the above weakness points, here are the additional minor issues:

In the introduction, Line 46 to Line 74 seem too verbose when presenting the limitations of the existing methods. It is suggested to give a distilled summarization of these methods and defer the detailed description to the appendix or related work, and highlight the advantages of the proposed method. Besides, if these methods are not compared or simply focus on different tasks with this work, then it is not necessary to discuss them here.
Line 299 “We use two sets of source-domain dataset pairs for FTL” Do you mean “source-target” dataset?
Line 307 “we let each client have access to only part of the labels.” This indeed to some extent confusing. I suppose this intends to refer to a semi-supervised cross-domain generalization task with part of the “labels”. After checking Appendix G, I know each client holds a fractional of “classes” with full annotations.
Line 341, “Besides the significant performance game of FedGTST over baseline methods”. Do you mean performance “gain”?
Line 523, Figure C, this link is broken.
Line 524, “FedGTST does not only offer better transferability”. -> Remove “does”.
As mentioned in the weakness, to figure out the task and evaluation metric, I checked the code provided. Frustratingly, I did not find much useful instruction, and the title of the README file remains the title of another paper “On the Convergence of FedAvg on Non-IID Data”. For a submission targeting such a venue, the authors should at least provide a decent description of their code implementation and evaluation criteria for reproduction.
It is recommended to discuss existing works leveraging Jacobian matching in transfer learning.
Theoretical insights (Line 266) could be moved to Section 4.

局限性

Limitations are discussed sufficiently.

作者回复

2024-08-07

We thank the reviewer for their comments. Our answers are as below.

Experimental Setting:
1. Federated Pre-training was explained in the original manuscript. In Lines 148–149, we said: “...federated pretraining where the source domain is a union of the agents’ local domains.” Hence, we indeed used federated pretraining with the source domain comprising data from all local domains. In Appendix G, Line 579, we reiterated the claim: “For the source domains MNIST and CIFAR-10, we only allow each local client to have access to…”
2. Target domain data or labels are available. Target domain data and labels were indeed available, as stated in Lines 147 and 151. One can see from Equations (1) and (3) that the target loss is defined based both on the data and labels.
3. Yes, we are indeed performing inductive transfer learning.
Additional Dataset. We additionally run experiments on DomainNet, as suggested by another reviewer, to accommodate both requests. DomainNet is a widely used large dataset for both transfer learning and transferable federated learning, such as FedSR. Following FedSR we applied a leave-one-out strategy, where one domain is treated as the target domain and the other five are source domains. The target domains listed in the table are: C (Clipart), I (Infograph), P (Painting), Q (Quickdraw), R (Real), and S (Sketch). The results confirm that FedGTST still outperforms other methods from the literature. | Model | C | I | P | Q | R | S | AVG | |-|-|-|-|-|-|-|-| | FedAVG | 59.3±0.7 | 16.5±0.9 | 44.2±0.7 | 10.8±1.8 | 57.2±0.8 | 49.8±0.4 | 39.6 | | FedSR | 61.0±0.6 | 18.6±0.4 | 45.2±0.5 | 13.4±0.6 | 57.6±0.2 | 51.8±0.3 | 41.3 | | FedGTST | 63.9±0.5 | 20.7±0.3 | 47.8±0.4 | 15.2±0.5 | 59.5±0.6 | 54.3±0.2 | 43.6 |
Baseline clarification.
1. We indeed considered adaptations of existing transfer learning approaches to federated learning: FedSR, one of the baselines in our comparative study, is a direct adaptation of transfer learning. We will explain how FedSR performs this adaptation in the revision.
2. FedSR also represents an example of using L2 norm regularization on mapped features. Also, we would appreciate it if the reviewer could clarify what reference [1] is, as we do not believe it corresponds to the reference in our submission. Other examples of using L2 regularizers in transfer learning include Li et al., Explicit inductive bias for transfer learning with CNs, etc.
Transferability measure, theoretical bound and cross-client statistics.
- Theorem: As in Lines 157-158, we use the optimal target loss as the transferability performance measure. More precisely, in Theorem 2, Line 251, the LHS represents the actual transferability measure. Proper tuning of the cross-client statistics influences transferability by controlling the RHS, which is an upper bound on the transferability metric on the LHS.
- Intuition: Intuitively, a large averaged Jacobian norm in the early stages helps prevent clients from getting stuck in local minima or local overfitting, while a small cross-client Jacobian variance promotes similarity across domains, both of which enhance global transferability. Further clarifications regarding the Jacobian norm are: 1) the increment of the Jacobian is controlled as $\|J_1 - J_2\| \leq \alpha \|x_1 - x_2\|$ , ensuring that the second term on the RHS of Theorem 2 cannot be $\infty$ ; 2) An average Jacobian norm will naturally decrease in the final stages due to convergence.
We appreciate the reviewer pointing out the typos, phrasing and reorganization issues. We will correct these errors in the revised manuscript.
Regarding the README file, we sincerely apologize for this oversight. This was an unfortunate copy-paste error that we can easily correct since we always had a ready-to-use instruction manual for the code. Due to space limitation, we can only repeat some parts of our intended README file below:

Sample run command (i.e., 100 clients with resnet18):

python main.py --num_clients 100 –back_bone resnet18

Our evaluation metric is implemented in the code contained in main.py, Line 207.
Please see the discussion below regarding gradient matching in transfer learning, and challenges of applying it directly to federated learning.
- Discussion: Gradient matching in transfer learning are approaches of aligning gradients between source and target domains. The work Gradient Matching for Domain Generalization explores how matching gradients between the source and target domains can lead to better domain generalization by ensuring that the learned representations are robust to domain shifts. Fishr: Invariant Gradient Variances for Out-of-Distribution Generalization extends this idea by focusing on invariant gradient variances, which helps in maintaining generalization performance when faced with out-of-distribution samples. Understanding Hessian Alignment for Domain Generalization investigates the role of Hessian alignment in gradient-based methods, highlighting how aligning the Hessian matrices of the source and target domains can further improve generalization performance. All these and other suggested works will be cited in the revised manuscript.
- Challenges: 1) Privacy leakage. Such approaches violate data privacy because source clients are given access to the target domain. 2) Local overfitting. Clients perform training solely on their local models, representations, and labels, thus intuitively end up overfitting in the local domain.
- Our contribution: We ensure that our communication schemes prevent uncontrolled access to data, thereby maintaining data privacy, and that our approach enhances global transferability rather than focusing solely on local transferability.

We appreciate the reviewer’s feedback, and we are happy to provide additional clarifications.

作者回复

2024-08-07

Dear Reviewers and AC,

We would like to thank all reviewers for their careful reading and insightful suggestions on our paper. We have prepared a detailed, point-by-point response to each concern raised by the reviewers. Below, we summarize the major questions posed by multiple reviewers:

Theory: We have clarified our problem formulation, assumptions, and the intuition behind our theoretical results. Specific answers to the questions can be found in our detailed response below.
Simulation: We have clarified our experimental setup (i.e., hyperparameters), conducted new experiments on the DomainNet dataset, compared our methods with the Scaffold baseline, and used Dirichlet sampling to distribute data across different clients. In almost all new settings, our method still achieves state-of-the-art performance. Detailed performance metrics for each method are provided in the tables below.
Paper Presentation: We have corrected several typos. We appreciate the reviewers' attention to these details.

We hope our responses adequately address the reviewers' concerns. We are open to further discussion if there are any additional questions.

Best Regards, The Authors

评论- Paper discussion

2024-08-12

Hi all,

Thanks again for your reviews! If you haven't done so already, please respond to the rebuttals before the end of the author-reviewer discussion period. If you don't have any further questions for the authors then a simple "thank you for the rebuttal" would suffice.

All the best, Area Chair

评论- Paper Discussion is Going End

2024-08-14

Dear Reviewer kpC8,

Thanks for your review! The author-reviewer discussion period is going end. Does authors' rebuttal answer your questions? Could you please give a response?

All the best, AC

最终决定Accept (poster)

2024-09-25

FedGTST has the following contributions:

The studied problem is novel and may have practical usage in real-world applications.
It uses a novel regularizer that focuses on the cross-client statistics to enhance the global transferability.
It communicates only the scalar to reduce communication overhead and protect the privacy.
Overall this paper is well written and tecchnically sound. The authors present a detailed literature review of existing work related to Federated Transfer Learning.

Weakness:

Experiments: In the original version, the datasets are too small, the Non-IID setting is not commom, and the baseline is not sufficient. During the rebuttal, authors add experiments on a large dataset DomainNet, and use Dirichlet sampling for Non-IID setting. When considering non-IID local domains, the proposed algorithm should be compared for transferability against advanced FL algorithms that address data heterogeneity and therefore would naturally control cross-client Jacobian variance, e.g. FedAdam, Scaffold and FedNova. FedAvg seems to be a weaker baseline. Authors add a new experiments for Scaffold.
Certain assumptions in the analysis seem too strong and could be impractical –Gradient Descent (GD) instead of Stochastic Gradient Descent (SGD) (assumption 4.2). Most standard FL/ML algorithms execute SGD due to the well known computationally expensive nature of GD which evaluates a single gradient on the entire local dataset.