5.5

/10

Rejected4 位审稿人

最低5最高6标准差0.5

4.5

置信度

正确性2.5

贡献度2.0

表达2.5

ICLR 2025

Personalized Federated Fine-tuning for Heterogeneous Data: a Two-Level Low Rank Adaptation Approach

Jie Hao,Yuman Wu,Ali Payani,Myungjin Lee,Mingrui Liu

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

This paper introduces a two-level adaptation approach for federated foundation model fine-tuning with heterogeneous data.

摘要

关键词

Federated LearningLow Rank AdaptationHeterogenoeus DataLanguage ModelFoundation Model

评审与讨论

审稿意见

评分: 5置信度: 42024-10-28

This paper proposes a two-level low rank adaptation approach to achieve personalized federated learning to address the data heterogenous in federated learning. The experiments on multiple datasets has shown the effectiveness of the proposed methods.

优点

The attemp to introduc LoRa into federated learning to utilize pretrained model seems to be interested for me.
The paper is well organized, makes the readers easy to understand the background and the proposed method.

缺点

The authors claim that existing studies that utilizing LoRa in federated leanring mainly focus on HOMLoRA, rather than personalized federated learning. However, I have found multiple studies that using the LoRa technique to achive personalized federated learning [1-5]. Some of them also introduce a two-level adaption idea [2], [5]. Can authors include a detailed comparison with these studies and discuss the novelty over the proposed method?
In the paper, the author said that the proposed method are more suitable to resource-constrained devices. However, in experiments, the authors only conduct a comparison over the trainable parameters. Can the authors provide a comparison over the communication costs, such as the FLOPS on the proposed methods? Moreover, can the authors provide some evidence that the proposed method can be executed on resource-constrained devices such as mobile phone?
As the authors have discussed, HETLoRA utilize a fixed rank initialization, which is independent of data and may cause underfitting or overfitting issues. However, the proposed method seems also uses the pre-defined rank when adopting LoRa. So why does not the proposed method suffer the same challenges as HETLoRA?
Can the authors provide some ablation studies about the two-stage training strategy. I am intersted to see that whether the client-specific adapters can better search the optimal on local clients given the trained common adapter?

[1] pFedLoRA: Model-Heterogeneous Personalized Federated Learning with LoRA Tuning [2] FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning [3] Personalized Wireless Federated Learning for Large Language Models [4] SA-FedLora: Adaptive Parameter Allocation for Efficient Federated Learning with LoRA Tuning [5] FedFMSL: Federated Learning of Foundations Models With Sparsely Activated LoRA

问题

I would be appreciate and raise my score if the author can address the questions listed in the weaknesses.

评论- Rebuttal by Authors (Cont'd)

2024-11-22

Q4. Can the authors provide some ablation studies about the two-stage training strategy. I am intersted to see that whether the client-specific adapters can better search the optimal on local clients given the trained common adapter?

A4. We did extensive experiments by the following three aspects to show two-stage training scheme is non-trivial and effective.

The effectiveness of bilevel optimization. To demonstrate the effectiveness of the two-level LoRA, we conducted the ablation study in Appendix F.2 in the submission. Instead of fine-tuning the two-level adapters in bilevel optimization, we update the common adapters and local adapters simultaneously. The text classification results in Figure 2 (page 19) in the original submission show that bilevel optimization actually improve the fine-tuning performance.
Two-level adapter is better than one-level adapter. HOMLoRA is a baseline with a one-level adapter, i.e., the common adapter $B$ and $A$ . Our experimental results on language understanding and generation show that it performs not well on heterogeneous data. For example in text classification task, PF2LoRA outperforms HOMLoRA by $3.44\\%$ on CoLA, $21.58\\%$ on MNLI, $3.38\\%$ on SST-2, $14.38\\%$ on QQP, and $8.73\\%$ on QNLI. In addition, two-level LoRA increases negligible additional memory overhead, we count the trainable parameters for different natural language tasks in Table 7, 10, 13. For example, HOMLoRA has 0.79 Million trainable parameter using RoBERTa large on GLUE benchmarks, and our algorithm has 0.99 Million, which holds only 0.22 Million more parameters.
Two-level adapter is better than the other baselines even with fewer trainable parameters. We further compare with other baselines (HOMLoRA has one-level adapter) which have more trainable parameters than our algorithm, the results are shown in Table 5 in original submission. Our method outperforms other baselines in a great margin even with fewer trainable parameters.

评论- Rebuttal by Authors (Cont'd)

2024-11-22

Table 2. Learning rate settings for the RoBERTa model on the GLUE benchmark. We use a slash to separate two learning rates for PF2LoRA. For PF2LoRA, the first is the learning rate for the common adapter, and the second is for the client-specific adapter.

Method	CoLA	MNLI	SST-2	QQP	QNLI
pFedLoRA	$1.0 \times 10^{-3}$	$1.0 \times 10^{-3}$	$1.0 \times 10^{-3}$	$1.0 \times 10^{-3}$	$1.0 \times 10^{-3}$
FDLoRA	$1.0 \times 10^{-3}$	$2.0 \times 10^{-3}$	$2.0 \times 10^{-3}$	$1.0 \times 10^{-3}$	$2.0 \times 10^{-3}$
PF2LoRA	$2.0 \times 10^{-3} / 1.0 \times 10^{-4}$	$1.0 \times 10^{-3} / 1.0 \times 10^{-3}$	$1.0 \times 10^{-3} / 1.0 \times 10^{-3}$	$1.0 \times 10^{-3} / 1.0 \times 10^{-3}$	$1.0 \times 10^{-3} / 1.0 \times 10^{-3}$

Q2. In the paper, the author said that the proposed method is more suitable to resource-constrained devices. However, in experiments, the authors only conduct a comparison over the trainable parameters. Can the authors provide a comparison over the communication costs, such as the FLOPS on the proposed methods? Moreover, can the authors provide some evidence that the proposed method can be executed on resource-constrained devices such as mobile phone?

A2. We evaluated the total computational costs (FLOPs on 8 NVIDIA RTX A6000 GPUs) and communication costs in a single communication round for each algorithm on GLUE benchmark. The results are summarized in the following Table 3. From our understanding, communication costs are the total number of parameters that participate in the aggregation and distribution of parameters in federated learning. The computational cost (FLOPs) per round are determined by the number of model parameters and the forward/backward propagation operations. As PF2LoRA requires to compute the hessian-vector product for hypergradient estimation, it incurs a little higher computational cost. But the communication cost of PF2LoRA remains consistent with that of HOMLoRA and Centralized LoR, as the communication parameters in PF2LoRA are only global adapters that have the same rank $r_k=8$ with that in HOMLoRA and Centralized LoRA. Instead, HETLoRA has a higher parameter rank requirement for a high performance, resulting in increased communication costs.

Table 3. Computational/Communication costs per communication round.

Method	TFLOPs/round	Communication parameters/round
Centralized LoRA ( $r_k=8$ )	258.40	0.30M
HOMLoRA ( $r_k=8$ )	258.40	0.30M
Per-FedAvg-LoRA ( $r_k=8$ )	908.00	0.30M
HETLoRA ( $r_{max}=12, r_{min}=8$ )	272.60	0.35M
PF2LoRA ( $r_k=8, \tilde{r}=2$ )	1202.40	0.30M

Q3. As the authors have discussed, HETLoRA utilizes a fixed rank initialization, which is independent of data and may cause underfitting or overfitting issues. However, the proposed method seems also uses the pre-defined rank when adopting LoRA. So why does not the proposed method suffer the same challenges as HETLoRA?

A3. To clarify our perspective "automatic rank adaptation of PF2LoRA" , as well as the failure reason of HETLoRA, we have provided theoretical analysis explaining why HETLoRA fails to learn the ground truth rank of clients, resulting in underfitting in a multivariate linear regression example. Then we conducted a synthetic experiment on personalized federated fine-tuning with two clients. The experimental results demonstrate that our algorithm can automatically adapt to the complexity of client data by learning the rank within the range $[r-\tilde{r}, r+\tilde{r}]$ . Please refer to details in Appendix J in the revised version.

评论- Rebuttal by Authors

2024-11-22

Q1. The authors claim that existing studies that utilizing LoRa in federated leanring mainly focus on HOMLoRA, rather than personalized federated learning. However, I have found multiple studies that using the LoRa technique to achive personalized federated learning [1-5]. Some of them also introduce a two-level adaption idea [2], [5]. Can authors include a detailed comparison with these studies and discuss the novelty over the proposed method?

[1] pFedLoRA: Model-Heterogeneous Personalized Federated Learning with LoRA Tuning; [2] FDLoRA: Personalized Federated Learning of Large Language Model via Dual LoRA Tuning; [3] Personalized Wireless Federated Learning for Large Language Models; [4] SA-FedLora: Adaptive Parameter Allocation for Efficient Federated Learning with LoRA Tuning; [5] FedFMSL: Federated Learning of Foundations Models With Sparsely Activated LoRA.

A1. Thanks for your constructive suggestion. We compared with these related work and summarized the main differences with ours.

pFedLoRA [1] designs a homogeneous small low-rank adapter to facilitate federated client’s heterogeneous local model training with iterative training for global-local knowledge exchange. The local model and the low-rank adapters are trained iteratively within each communication round. The small low-rank adapters are aggregated on the server to generate a global adapter. They neglect the dependency between the local and global models, and their local model is restricted to fully-connected layers.

Both FDLoRA [2], PFTT [4] and FedFMSL [5] propose "dual LoRA tuning", i.e., a local LoRA module captures the personalized knowledge and a global LoRA module is used to learn the general knowledge. Only the global LoRA modules participate in parameters' aggregation and distribution. FDLoRA [2] divides the procedure into three independent stages, local training to update local LoRA modules, federated learning to update the global LoRA modules, and fusion of the local and global modules. PFTT [4] introduces a personalized wireless federated fine-tuning method with low communication overhead, i.e., Personalized Federated Task Tuning (PFTT), which can leverage global adapters and local LoRA modules to collaboratively fine-tune local LLMs.
Similarly, FedFMSL [5] proposes two-stage training for global and local LoRA, respectively, but refines the mixture of two types of LoRA adapters by designing a new gate adapter to output the mix ratio. Different from these "dual LoRA adapter" algorithms, we formulate the personalized federated fine-tuning as a bilevel optimization, and build the connection between the updates of the local adapters and the global adapters. In addition, our local adapters are designed as a light-weight structure, thus the number of trainable parameters of these local adapters is negligible. Thus, our algorithm has even fewer trainable parameters than single LoRA federated fine-tuning algorithm, while we can achieve better performance, e.g. the results in Table 5 in the submission.

Different from the above algorithms, SA-FedLoRA is a simulated annealing-based federated fine-tuning algorithm with LoRA tuning. There are two stages, i.e., the initiating stage and the annealing stage. In initiating stage, each client trains the entire pre-trained model with parameter regularization. In annealing stage, high-rank LoRAs are allocated to the client model in early heating phase, and trainable parameters are dynamically reduced to the cooling phase by lowering the LoRA rank.

Due to tight rebuttal period, we compare PF2LoRA with the first two algorithms pFedLoRA [1] and FDLoRA [2] on GLUE benchmarks. For fair comparison, we keep the number of trainable parameters the same in all the baselines, i.e., 0.37M. So we fix the local and global rank to $5$ in pFedLoRA [1] and FDLoRA [2], and PF2LoRA keeps the rank settings as that in section 5.1 in the submission, i.e., global adapter with $r=8$ , and local adapters with $\tilde{r}=2$ . We search the best learning rates in the range of $\\{1.0\times 10^{-4}, 5.0\times 10^{-4}, 1.0\times 10^{-3}, 2.0\times 10^{-3}, 5.0\times 10^{-3}\\}$ , the learning rates choices are summarized in Table 2. The comparison testing results are shown in Table 1. PF2LORA outperforms others on CoLA, MNLI, SST-2, QNLI, while maintaining the same number of trainable parameters.

Table 1. Roberta-base results on the GLUE benchmark. We report "Matthew's correlation" for CoLA and "Accuracy" for MNLI, SST-2, QQP, and QNLI.

Method	CoLA ↑	MNLI ↑	SST-2 ↑	QQP ↑	QNLI ↑
pFedLoRA	49.12	91.05	94.71	94.28	93.35
FDLoRA	48.85	78.10	91.54	86.73	89.48
PF2LoRA	54.19	92.14	95.85	93.99	94.18

2024-11-25

The author has partially addressed my concerns, so I choose to rasie my score.

2024-11-26

Many thanks for the reviewer's reply. Please feel free to reach out if the reviewer has any further concerns, and we would be glad to discuss them.

审稿意见

评分: 6置信度: 52024-10-29

Towards solving the data heterogeneity problem, the paper proposes PF2LoRA, which adopts two LoRA adapters in a personalized federated fine-tuning algorithm. On each client, one LoRA is kept private on-device for personalized learning and another is shared by all clients for federated fine-tuning. By applying this two-level low-rank adaptation and other optimizations, PF2LoRA achieves better performance than baselines.

优点

(1) The data heterogeneity problem is valuable for research in federated fine-tuning. The paper designs a valid solution for LoRA fine-tuning in this topic. The idea of using two sets of LoRA is straightforward, and easy to be implemented.

(2) The paper provides a theoretical justification of the proposed method. Also, the 'Detailed Implementation for Language Models' section is useful for reproducing the results.

(3) The paper is well-written and easy to follow. The reviewer gets a clear idea from the paper after reading it.

缺点

(1) For the data heterogeneity problem itself, the reviewer believes that the key point or bottleneck in federated fine-tuning is the heterogeneous computational resource problem, but not the data heterogeneity. If the clients have enough resources, they can just apply homogeneous local LoRA ranks that are large enough, and can achieve better performance. It is the limited local computational resource problem that makes the clients use LoRA, and maybe adopt different LoRA ranks.

Therefore, the question here is whether it is realistic to adopt two LoRAs on the clients, which actually increases the cost. If we mainly consider the computational cost, there are some works that already optimize this problem [1] [2].

(2) The two-level optimization looks a little confusing. The goal of the local LoRA is to minimize the local loss function and the goal of the global ones is to learn a common adapter for all clients. Will this cause the global LoRA not to learn enough local knowledge since local LoRAs already overfit on local objects? In other words, since the local LoRAs are not used to update the global model, what is the purpose of using these local LoRAs?

(3) The experiments seem not sufficient since models as large as GPT-2 are used. Related works use models larger than 1b. Is there any reason to use smaller models?

(4) Some related works are missed:

[1] Bai, J., Chen, D., Qian, B., Yao, L., & Li, Y. (2024). Federated fine-tuning of large language models under heterogeneous language tasks and client resources. arXiv preprint arXiv:2402.11505.

[2] Wang, Z., Shen, Z., He, Y., Sun, G., Wang, H., Lyu, L., & Li, A. (2024). Flora: Federated fine-tuning large language models with heterogeneous low-rank adaptations. arXiv preprint arXiv:2409.05976.

[3] Chen, H., Zhang, Y., Krompass, D., Gu, J., & Tresp, V. (2024, March). Feddat: An approach for foundation model finetuning in multi-modal heterogeneous federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 10, pp. 11285-11293).

问题

Please refer to the weakness.

评论- Rebuttal by Authors

2024-11-22

Q1. Whether it is realistic to adopt two LoRAs on the clients, which actually increases the cost. If we mainly consider the computational cost, there are some works that already optimize this problem [1] [2].

A1. We have to emphasize that two-level adapters do not introduce excessive trainable parameters compared to other baselines. To verify this statement, we did conduct the comparison experiments and reported the corresponding performance and the number of trainable parameters in section 5.3 in the submission.

Specifically, we increase the initial rank $r_k$ (from $8$ to $12$ ) for baselines HOMLoRA and Per-FedAvg-LoRA in the text classification experiments. Note that HETLoRA has different rank initialization $r_{min}\leq r_k \leq r_{max}$ for different client $k$ , so we count the average trainable parameters of the clients. we can also control the number of trainable parameters by specifying $r_{min}$ and $r_{max}$ . We specify $r_{min}=5, r_{max}=12$ in CoLA dataset and $r_{min}=8, r_{max}=12$ in other four text classification datasets. Even if other algorithms have more trainable parameters than our method, PF2LoRA still demonstrates the best performance. PF2LoRA, with negligible additional trainable parameters, significantly improves the performance in personalized federated learning.

Q2. The two-level optimization looks a little confusing. The goal of the local LoRA is to minimize the local loss function and the goal of the global ones is to learn a common adapter for all clients. Will this cause the global LoRA not to learn enough local knowledge since local LoRAs already overfit on local objects? In other words, since the local LoRAs are not used to update the global model, what is the purpose of using these local LoRAs?

A2. We respectfully disagree with you. We formalize our two-level adaptation framework for personalized federated fine-tuning as the following bilevel optimization problem:

\begin{aligned} &\min_{x} \Phi(x):= \frac{1}{M}\sum_{k=1}^{M}f_k(x, y^_{k}(x)), & (\text{UL})\\ & \text{s.t.}, \ y^_k(x) \in \arg\min_{y_k} f_k(x, y_k), &(\text{LL}) \end{aligned}

where the lower-level variable $y_k=\\{ D_k\in\mathbb{R}^{m\times \tilde{r}}, C_k\in\mathbb{R}^{ \tilde{r}\times n}, 0<\tilde{r}<r, 1\leq k \leq M\\}$ is the parameter of the $k$ -client-specific adapter, and the upper-level variable $x=\\{B\in\mathbb{R}^{m\times r}, A\in\mathbb{R}^{r\times n}\\}$ is the parameter of the common adapter. The goal of lower-level problem is to find a best client-specific adapter given a common adapter, while the goal of upper-level is try to search a best common adapter based on all optimal local adapters. So this is a nested optimization problem. Although the local adapters are not explicitly aggregated to update the global adapter, the update of global adapter actually builds upon the local adapters. This point of view can be verified by calculating the derivative of the objective function with respect to the upper variable, namely hypergradient:

 \nabla\widehat{\Phi}\_k(x^t_k) =\nabla_x f_{k}(x^t_k, y^{t+1}\_{k}) - \alpha \nabla_{xy}f_{k}(x^t_k, y^t_k)\nabla_yf_{k}(x^t_k, y^{t+1}_{k})   ,

The bilevel optimization can help to find a meta learner $x$ such that the client can adapt quickly to its individual task when integrating the common adapter $x$ and the local adapter $y_k$ .

Q3. The experiments seem not sufficient since models as large as GPT-2 are used. Related works use models larger than 1b. Is there any reason to use smaller models?

A3. In Section 5.2 of the submission, we deployed our algorithm to a large foundation model GPT2-XL with 1.5 Billion parameters on the language generation task, WebNLG under the same data setting. PF2LoRA still shows the best performance in metrics of BLUE, MET and ROUGE-L. That verified that our algorithm can perform well on both small and large language models.

评论- Rebuttal by Authors (Cont'd)

2024-11-22

Q4. Some related works are missed:

[1] Bai, J., Chen, D., Qian, B., Yao, L., Li, Y. (2024). Federated fine-tuning of large language models under heterogeneous language tasks and client resources. arXiv preprint arXiv:2402.11505.

[2] Wang, Z., Shen, Z., He, Y., Sun, G., Wang, H., Lyu, L., Li, A. (2024). Flora: Federated fine-tuning large language models with heterogeneous low-rank adaptations. arXiv preprint arXiv:2409.05976.

[3] Chen, H., Zhang, Y., Krompass, D., Gu, J., Tresp, V. (2024, March). Feddat: An approach for foundation model finetuning in multi-modal heterogeneous federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 10, pp. 11285-11293).

A4. Thanks for your constructive suggestion. We carefully checked these references and compared them with ours. FlexLoRA [1] and FLoRA [2] are simple yet effective FL aggregation algorithms that enables the mixture of diverse LoRA weights across individual clients. FlexLoRA [1] constructs a full-size LoRA weight $W_g$ by $W_g = (\sum_i n_i{W}_i)/(\sum_i n_i)$ , where $n_i$ is the size of the i-th client’s local training dataset. Then they decompose $W_g$ by SVD, $SVD(W_g) = U\Sigma V$ . The client $i$ keeps the largest top- $r_i$ rank for individual local training. FLoRA [2] aggregates the LoRA modules by stacking (denoted by $\bigoplus$ ) clients' $A_i$ and $B_i$ , i.e., $W_g=(B_0\bigoplus \cdots \bigoplus B_K) (A_0\bigoplus \cdots \bigoplus A_K)$ .

FedDAT [3] introduces dual adapters, i.e., the global adapter and the local adapters. Each client initializes an individual local adapter and fine-tune it with the frozen global adapter. Afterward, local updates will be executed for the global adapters. Mutual Knowledge Distillation (MKD) is used to transfer the heterogeneity knowledge and general knowledge between local and global adapters.

Our algorithm is obviously different from the above algorithms. Algorithm [1] and [2] mainly focus on the aggregation of heterogeneous ranks across clients. Algorithm [3] does not consider low-rank structure of adapters, thus incurring probably high communication costs in aggregation and distribution of parameters. They update the local and global adapters independently, instead our algorithm formulates the global and local updates as a bilevel optimization, which means we take the dependency between local and global adapters into consideration. We have conducted the ablation study to verify the benefits of bilvel optimization compared to the single-level optimization. The details can be found in in Appendix F.2. We will cite these related work in our revised version.

2024-11-26

Thanks for the rebuttal. The authors successfully address most of the concerns. I have increase my score.

2024-11-26

Many thanks for your reply. We are open to discussing any further questions you may have.

审稿意见

评分: 6置信度: 42024-11-01

This paper proposes a two-level low-rank adaptation method, named PF2LoRA, to address data heterogeneity and model personalization challenges in federated fine-tuning of foundation models. In PF2LoRA, all clients first collaboratively train a general adapter, followed by each client training a personalized adapter with a unique rank to suit their local tasks. Extensive experiments on natural language datasets demonstrate the superiority of the proposed method.

优点

This paper is easy to read and understand.
The experiments are extensive and validate the effectiveness of the method.

缺点

In line 176, the authors argue that fixed-rank initialization may lead to underfitting or overfitting issues. However, this claim lacks supporting evidence.
In PF2LoRA, stage 1 requires all clients to jointly train an adapter with a rank r . Isn’t this essentially a “fixed-rank initialization”? Why would this phase not also cause underfitting or overfitting issues?
The authors argue that one of HETLoRA’s weaknesses is the need to tune several hyperparameters. However, PF2LoRA also has multiple hyperparameters, such as r , \tilde{r} , and \alpha . I believe the number of hyperparameters isn’t the key issue; rather, it’s whether the hyperparameters are easy to adjust and how sensitive the method is to them. This paper lacks a discussion on these aspects.
Based on my current understanding, PF2LoRA appears to be an incremental improvement over HETLoRA, transforming the training process into a two-stage approach. In line 196 (Eq. (2)), if we set B_{k}A_{k} = BA + D_{k}C_{k} , isn’t this effectively training a different-rank adapter for each client as in HETLoRA? The main difference is that PF2LoRA separates the common component BA for a two-stage training scheme, while HETLoRA trains BA and D_{k}C_{k} simultaneously.

问题

Please see the weaknesses above.

评论- Rebuttal by Authors

2024-11-22

Q1, Q2. In line 176, the authors argue that fixed-rank initialization may lead to underfitting or overfitting issues. However, this claim lacks supporting evidence. In PF2LoRA, stage 1 requires all clients to jointly train an adapter with a rank r . Isn’t this essentially a “fixed-rank initialization”? Why would this phase not also cause underfitting or overfitting issues?

A1, A2. Thanks for your insightful question. To clarify our perspective "automatic rank adaptation of PF2LoRA" , as well as the failure reason of HETLoRA, we have provided theoretical analysis explaining why HETLoRA fails to learn the ground truth rank of clients, resulting in underfitting in a multivariate linear regression example. Due to the random rank initialization, HETLoRA may underestimate the initial rank compared to the ground truth rank, i.e., $r_k^{init} \leq r_k^*$ , then it cannot to converge to the ground truth rank by its personalized fine-tuning and rank self-pruning. To verify this perspective, we conducted a synthetic experiment on personalized federated fine-tuning with two clients. The experimental results demonstrate that our algorithm can automatically adapt to the complexity of client data by learning the rank within the range $[r-\tilde{r}, r+\tilde{r}]$ , but HETLoRA fails. Please refer to details in Appendix J in the revised version.

Q3. The authors argue that one of HETLoRA’s weaknesses is the need to tune several hyperparameters. However, PF2LoRA also has multiple hyperparameters, such as $r$ , $\tilde{r}$ , and $\alpha$ . I believe the number of hyperparameters isn’t the key issue; rather, it’s whether the hyperparameters are easy to adjust and how sensitive the method is to them. This paper lacks a discussion on these aspects.

A3. We performed a hyperparameter sensitivity analysis by sweeping over the learning rate $\alpha$ and client rank $r$ , and the results are presented in Figure 5 in Appendix K in revised version. In our setting, we require the local adapter to be light-weight, so the rank of local adapters is always small, i.e., $\tilde{r}=2$ . Thus we perform a hyperparameter sweep on the local learning rate $\alpha$ and the rank of the common adapter, respectively. As you see in subfigure 5 (a), our algorithm is pretty robust to the learning rate $\alpha$ . Since CoLA dataset is more challenging than others, a larger rank is helpful to improve the model performance, but the performance keeps almost the same when the rank is larger than 8, as shown in subfigure 5 (b). Our algorithm also exhibits high robustness on data MNLI and SST-2.

Q4. Based on my current understanding, PF2LoRA appears to be an incremental improvement over HETLoRA, transforming the training process into a two-stage approach. In line 196 (Eq. (2)), if we set $B_{k}A_{k} = BA + D_{k}C_{k}$ , isn’t this effectively training a different-rank adapter for each client as in HETLoRA? The main difference is that PF2LoRA separates the common component BA for a two-stage training scheme, while HETLoRA trains BA and $D_{k}C_{k}$ simultaneously.

A4. We did extensive experiments by the following three aspects to show two-stage training scheme is non-trivial and effective.

The effectiveness of bilevel optimization. To demonstrate the effectiveness of the two-level LoRA, we conducted the ablation study in Appendix F.2 in the submission. Instead of fine-tuning the two-level adapters in bilevel optimization, we update the common adpaters and local adapters simultaneously. The text classification results in Figure 2 (page 19) in the original submission show that bilevel optimization actually improve the fine-tuning performance.
Two-level adapter is better than one-level adapter. HOMLoRA is a baseline with a one-level adapter, i.e., the common adapter $B$ and $A$ . Our experimental results on language understanding and generation show that it performs not well on heterogeneous data. For example in text classification task, PF2LoRA outperforms HOMLoRA by $3.44\\%$ on CoLRA, $21.58\\%$ on MNLI, $3.38\\%$ on SST-2, $14.38\\%$ on QQP, and $8.73\\%$ on QNLI. In addition, two-level LoRA increases negligible additional memory overhead, we count the trainable parameters for different natural language tasks in Table 7, 10, 13. For example, HOMLoRA has 0.79 Million trainable parameter using RoBERTa large on GLUE benchmarks, and our algorithm has 0.99 Million, which has only 0.22 Million more parameters.
Two-level adapter is better than the other baselines even with fewer trainable parameters. We further compare with other baselines (HOMLoRA has one-level adapter) which have more trainable parameters than our algorithm, the results are shown in Table 5 in the submission. Our method outperforms other baselines in a great margin even with fewer trainable parameters.

2024-11-26

Thanks for the rebuttal. The authors have addressed my concerns, and I decided to raise my score.

2024-11-26

Thanks for your thoughtful review and willingness to increase the rating. We appreciate your constructive feedback and recognition of our efforts. We are also glad to discuss with you if you have further questions.

评论- Looking forward to your feedback

2024-11-26

Dear Reviewer YEnm,

Thank you for reviewing our paper. We have carefully addressed your concerns regarding the "automatic rank adaptation of PF2LoRA", hyperparameter sensitivity analysis, and the effectiveness of two-stage training scheme. Please let us know if our responses address your concerns accurately. We appreciate your time and efforts and are open to discussing any further questions you may have.

审稿意见

评分: 5置信度: 52024-11-05

This paper proposed a two-level LoRA framework for foundation models to address data heterogeneity and personalization in FL, which contains a homogeneous LoRA for aggregation and a heterogeneous LoRA for personalization in each client.

优点

This paper focuses on personalized federated fine-tuning of foundation models, which is an emerging and promising research direction.
The paper conducts extensive experiments across various NLP tasks and detailed algorithms and codes are provided to support its reproducibility.

缺点

The motivations and applications of this paper are unclear. If the setting of this paper is similar to HETLoRA focusing on employing different ranks to accommodate varying system capabilities, then the upper-level LoRA of PF2LoRA would be useless as long as there is a client whose computational capacity can only support a very low rank (e.g. r=1). Alternatively, if this paper aligns with the setting of HOMLoRA, then the need for different ranks for personalization is questionable. It might be more effective to simply further fine-tune the aggregated upper-level LoRA or maintain a uniform rank across the lower-level LoRA.
The paper asserts that their framework can overcome the HETLoRA’s limitation of fixed rank initialization that does not consider data by automatically adjusting ranks based on training data. However, the explanation of this automatic mechanism is insufficient, and it seems that the LoRA ranks in PF2LoRA are predefined and not truly data-dependent, contradicting the claims made, as observed in the experiments.
Some related work is missing [1][2]. For example, it would be better if the paper could add a comparison analysis with [1] which emphasizes personalization using two LoRAs. Additionally, the experimental results on the GLUE benchmark reported in this paper are significantly lower than those in work [2], which used the same baseline and datasets, why does this happen?

[1] Yang, Y., et al. (2024). Dual-Personalizing Adapter for Federated Foundation Models. arXiv preprint arXiv:2403.19211.

[2] Zhang, Z., et al. (2023). Fedpetuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models. In Annual Meeting of the Association of Computational Linguistics 2023 (pp. 9963-9977). Association for Computational Linguistics (ACL).

问题

See weakness.

评论- Rebuttal by Authors (Cont'd)

2024-11-22

We carefully checked paper [2], which investigates the parameter-efficient tuning (PETuning) of pre-trained language models and develops a corresponding federated benchmark for four representative PETuning methods, including federated full fine-tuning (FedFT), adapter fine-tuning (FedAP), LoRA (FedLR), prefix tuning (FedPF), and BitFit (FedBF). Since our work focuses on federated LoRA, we mainly compare with FedLR. We summarize the comparison results on the GLUE benchmarks in Table 1.

As you can see, our algorithm's performance is not lower than FedLR. We notice that FedLR uses the same rank, i.e., 8, for all the clients, so it is actually equivalent to our HOMLoRA algorithm here. More importantly, the experimental settings are different. We list these difference as follows,

Local training steps. FedLR sets local training steps to 1, but PF2LoRA sets it to 10. Larger local steps implies probably higher client shift, leading to harder to converge;
Training epochs are different. They set the training epochs as follows, {MNLI: 30, SST-2: 60, QQP: 25, QNLI: 25}, while our setting is {MNLI: 1, SST-2: 2, QQP: 2, QNLI: 2}.

评论- Rebuttal by Authors (Cont'd)

2024-11-22

Q3. Some related work is missing [1][2]. For example, it would be better if the paper could add a comparison analysis with [1] which emphasizes personalization using two LoRAs. Additionally, the experimental results on the GLUE benchmark reported in this paper are significantly lower than those in work [2], which used the same baseline and datasets, why does this happen?

[1] Yang, Y., et al. (2024). Dual-Personalizing Adapter for Federated Foundation Models. arXiv preprint arXiv:2403.19211.

A3. Thanks for your constructive suggestion. FedDPA [1] also proposes a "dual low-rank adaptation" approach in the personalized federated fine-tuning. Local adapters are designed to learn the personality of client data, while the global adapters aim to handle test-time tasks that may exhibit distributional shifts relative to individual client data. In every communication round, only the global adapters participate in parameter aggregation and distribution. Our algorithm builds on this similar idea, but key difference lies in the optimization approach for the adapters. Considering the individual objective function $f_k(.)$ and $M$ clients, the objective function is,

    \min_{A,B,C_k,D_k} \frac{1}{M}\sum_{k=1}^{M}f\_k(A, B,C_k, D_k)

where $A, B$ are the trainable parameters of global adapters and $C_k, D_k$ are the trainable parameters of local adapters. FedDPA updates the parameters of local adapters and global adapters alternatively, but they neglect the dependency between these adapters. Instead, we consider this problem as a nested optimization, namely bilevel optimization,

    &\min_{A,B} \frac{1}{M}\sum_{k=1}^{M}f_k(C_k^*(A, B), D_k^*(A, B)),\\\\ 
    &C_k^*(A, B), D_k^*(A, B) \in \arg\min_{C_k, D_k} f_k(A, B,C_k, D_k),\\\\

Taking the derivative of upper-level function with respect to the parameters of global adapters by chain rule, we can get the gradient of global adapters (For clear notation, we denote $x=\\{A, B\\}$ , and $y=\\{C, D\\}$ ), i.e., hypergradient,

 \nabla\widehat{\Phi}\_k(x^t\_k) =\nabla\_x f\_{k}(x^t\_k, y^{t+1}\_{k}) - \alpha \nabla\_{xy}f\_{k}(x^t\_k, y^{t}\_k)\nabla\_yf\_{k}(x^t\_k, y^{t+1}\_{k}),

As you can see, our update for the global adapters does consider the relationship with all the local adapters rather than simply update them independently.

We also provide convergence guarantees under standard bilevel optimization assumptions, i.e., the lower-level function is $\mu$ -strongly convex, the upper-level function is non-convex, $L_{f,1}$ -smooth, and hessian is $L_{f,2}$ -Lipschitz. Our algorithm requires $O(1/\epsilon^2)$ gradient or Hessian-vector product evaluations for finding an $\epsilon$ -stationary points. Details can be found in Theorem 6.2 (page 10 in the submission), and the proof is provided in Appendix I (page 24).

Then we conducted experiment to compare with FedDPA on heterogeneous GLUE benchmarks. Specifically, we initialize the rank of global and local adapters to $5$ , which can make the trainable parameters equal to other baselines and ours. Then we tune their global learning rate and local learning rate respectively for best validation performance. The learning rate is search in the range $\\{1.0\times 10^{-3}, 5.0\times 10^{-3}, 1.0\times 10^{-2}\\}$ , and we set the global learning rate to the best value of $\\{CoLA: 5.0\times 10^{-3}, MNLI: 5.0\times 10^{-3}, SST2: 5.0\times 10^{-3}, QQP: 5.0\times 10^{-3}, QNLI: 5.0\times 10^{-3}\\}$ , and we set the local learning rate to the best value of $\\{CoLA: 5.0\times 10^{-3}, MNLI: 1.0\times 10^{-3}, SST2: 1.0\times 10^{-3}, QQP: 5.0\times 10^{-3}, QNLI: 5.0\times 10^{-3}\\}$ . The comparison result is shown in Table 1.

Table 1. Roberta-base results on GLUE benchmark. We report "Matthew's correlation" for CoLA and "Accuracy" for MNLI, SST-2, QQP and QNLI. † means the results are from the original paper.

Method	CoLA ↑	MNLI ↑	SST-2 ↑	QQP ↑	QNLI ↑
HOMLoRA	50.75	70.56	92.47	79.61	85.45
FedDPA	46.85	87.50	93.67	93.21	95.18
FedLR†	-	84.90	93.60	87.40	90.80
PF2LoRA	54.19	92.14	95.85	93.99	94.18

评论- Rebuttal by Authors

2024-11-22

Q1. The motivations and applications of this paper are unclear. If the setting of this paper is similar to HETLoRA focusing on employing different ranks to accommodate varying system capabilities, then the upper-level LoRA of PF2LoRA would be useless as long as there is a client whose computational capacity can only support a very low rank (e.g. $r=1$ ).

A1. We have to clarify that the problem we are considering is data heterogeneity as mentioned in the title and abstract, instead of system heterogeneity. As data heterogeneity in federated fine-tuning, our goal is to find personalized adapters for clients such that they can quickly adapt to the individual data.

HETLoRA is a well-known federated fine-tuning method that addresses data heterogeneity by assigning LoRA modules of varying ranks to different clients. This is achieved through rank self-pruning and sparsity-weighted aggregation to accommodate the diverse data complexities of clients. However, we observe that HETLoRA’s random rank initialization strategy does not account for data complexity effectively, potentially leading to underestimation of the initial client rank, i.e., $r_k < r_k^*$ , leading to performance underfitting.

To solve this limitation, we propose a novel two-level low-rank adaptation algorithm. Our method utilizes a bilevel optimization to update the lower-level LoRA with rank $\tilde{r}$ (parameters of local adapters) and the upper-level LoRA with rank $r$ (parameters of common adapters), such that our client can automatically learn the personalized rank within the range $[r-\tilde{r}, r+\tilde{r}]$ , thereby covering a larger range of personalized ranks than HETLoRA. Our experiments on GLUE benchmark (Table 5 in the submission) and natural language generation (Table 13 in the submission) have shown that our algorithm achieves a superior performance with fewer trainable parameters than HETLoRA.

Q2. The paper asserts that their framework can overcome the HETLoRA’s limitation of fixed rank initialization that does not consider data by automatically adjusting ranks based on training data. However, the explanation of this automatic mechanism is insufficient, and it seems that the LoRA ranks in PF2LoRA are predefined and not truly data-dependent, contradicting the claims made, as observed in the experiments.

A2. To clarify this mechanism of ``this automatic rank adaptation of PF2LoRA", as well as the failure reason of HETLoRA, we first construct a multivariate linear regression example and provide a theoretical analysis to demonstrate why our method can accurately learn the ground truth rank, whereas HETLoRA fails. Then we conduct a synthetical experiment to compare two algorithms in federated learning with two clients. The experimental results confirm that our algorithm is able to learn the ground truth ranks for two clients and converge to the optimal solution. In contrast, HETLoRA underestimates the initial rank of some clients due to random rank initialization strategy, resulting in underfitting and suboptimal performance in such clients. Please refer to Appendix. J in revised version for details.

评论- Looking forward to your feedback

2024-11-26

Dear Reviewer Ay7W,

Thank you for reviewing our paper. We have carefully addressed your concerns regarding the motivations of our paper, why our algorithm can overcome the limitations of HETLoRA, and comparison with related work. Please let us know if our responses address your concerns accurately. We appreciate your time and efforts and are open to discussing any further questions you may have.

2024-12-02

Dear Reviewer Ay7W，

We sincerely thank you for taking the time to review our paper and providing valuable feedback. We have carefully addressed your concerns regarding the motivations of our paper, why our algorithm can overcome the limitations of HETLoRA, and comparison with related work. As we are approaching the end of the discussion period, please let us know if our responses address your concerns accurately. We appreciate your time and efforts and are open to discussing any further questions you may have.

Best,

Authors

评论- General Response to All Reviewers

2024-11-22

Thank you to all the reviewers for taking the time to review our paper and provide valuable feedback. We have addressed each of your concerns individually and summarize the key changes made during the rebuttal phase below. Major modifications are highlighted in blue in Appendices J, K, and L of the revised paper.

To clarify our perspective "automatic rank adaptation of PF2LoRA" , as well as the failure reason of HETLoRA, we have provided theoretical analysis explaining why HETLoRA fails to learn the ground truth rank of clients, resulting in underfitting in a multivariate linear regression example. Then we conducted a synthetic experiment on personalized federated fine-tuning with two clients. The experimental results demonstrate that our algorithm can automatically adapt to the complexity of client data by learning the rank within the range $[r-\tilde{r}, r+\tilde{r}]$ . Please refer to details in Appendix J in the revised version.
We did extensive experiments to show the effectiveness of bilevel optimization for local and common adapters. We verify it empirically by the following three aspects, (1) The effectiveness of bilevel optimization. (2) Two-level adapter is better than one-level adapter. (3) Two-level adapter is better than the other baselines even with fewer trainable parameters.
We evaluated the total computational costs (FLOPs on 8 NVIDIA RTX A6000 GPUs) and communication costs in a single communication round for each algorithm on GLUE benchmark. The results are summarized in Table 22 in Appendix L. The results show that our algorithm keeps the same communication cost, with little bit higher computational costs due to hessian-vector product calculation. We emphasize that our two-level LoRAs introduce only negligible extra trainable parameters compared to one-level LoRA methods. Detailed statistics on the number of trainable parameters for various language tasks and models are provided in Tables 7, 10, and 13. Our algorithm still consistently outperforms other baselines on GLUE benchmarks, even with fewer trainable parameters. For more details, please refer to Table 5.
We performed a hyperparameter sensitivity analysis by sweeping over the learning rate $\alpha$ and client rank $r$ . The experimental results are presented in Figure 5 in Appendix K. The performance curves show that our algorithm is pretty robust to the learning rates and client ranks on most datasets except CoLA, which is more challenging so it requires a larger client rank to fit.
We have compared with some related work and highlighted the differences between ours and these algorithms. We empirically compare with FedDPA, FedLR, pFedLoRA, and FDLoRA on GLUE benchmarks. The results demonstrate that our algorithm shows superior performance on most GLUE datasets with the same number of trainable parameters.

Please let us know if you have additional concerns, and we are more than willing to address any further questions.

Best,

Authors

AC 元评审

2024-12-22

The topic of this paper is about PFL on foundation models under data heterogeneity. The proposed method, named PF2LoRA, utilizes a two-level low-rank adaptation framework to fine-tune foundation models. This involves a global adapter to capture common patterns across clients and a local adapter for client-specific personalization, both optimized using bilevel optimization. The authors demonstrate their method's efficacy through experiments on natural language understanding and generation tasks. The paper also includes theoretical analysis and a robust discussion of its optimization strategy, which considers dependencies between global and local adapters. This paper addresses a timely and impactful problem, and most reviewers agreed that it was well-written. However, it could be further strengthened by more comprehensively discussing related work to better highlight its novelty, extending experiments to larger foundation models, and incorporating reviewer suggestions into the paper revision. While the paper demonstrates several merits, it ranks as a borderline submission, and the unresolved weaknesses limit its suitability for acceptance given the high standards of ICLR.

审稿人讨论附加意见

The major points raised by the reviewers include the motivation and novelty of the methods, experimental validation (scalability and realistic settings), and the complexity of the proposed method. There were also clarification questions, such as the need for further explanation to support some claims in the manuscript regarding automatic rank adaptation. The authors provided a theoretical analysis and synthetic experiments to demonstrate that PF2LoRA could adapt ranks to data complexity. However, concerns remained about whether this adaptation was fully data-driven or predefined. The authors also added additional experimental results and provided an efficiency analysis in the rebuttal. While the authors clarified their motivations, the reviewers and I found that the novelty and distinction of PF2LoRA from prior methods were still insufficiently compelling, and it was unclear how the authors planned to address this issue in future revisions—the current revision does not resolve it. Although the authors provided ablation studies and theoretical justifications, the additional complexity did not clearly translate into significant advantages over simpler approaches.

Despite the authors' meaningful efforts to address reviewer concerns, the current manuscript does not sufficiently address the key limitations, leading to the final decision to reject the submission.

最终决定Reject

2025-01-22

Reject