PaperHub
6.6
/10
Poster5 位审稿人
最低5最高8标准差1.2
8
8
6
6
5
3.6
置信度
正确性2.8
贡献度2.4
表达2.8
ICLR 2025

Federated Residual Low-Rank Adaptation of Large Language Models

OpenReviewPDF
提交: 2024-09-23更新: 2025-04-27

摘要

关键词
Large Languagel modelFederated LearningParameter-Efficient Fine-Tuning

评审与讨论

审稿意见
8

In this paper, the author proposes a novel framework named federated residual low-rank adaption to effectively fine-tune the pretrained large language model in a privacy-preserving manner. The proposed method addresses both the intrinsic problem caused by constrained parameter space and the extrinsic problem caused by the drift among clients. Extensive experiments on several datasets have shown the effectiveness of the proposed method.

优点

  1. The idea of enlarging the parameter space when finetuning large laugunge model in federated learning is promising.
  2. The paper is well organized and the related works are thoroughly summarized.
  3. The discussion of the weakness of exsiting related works and the novelty of the proposed method are clear.
  4. The author conducts comprehensive experiments to demonstrate the effectiveness of the proposed method, which is convincing.

缺点

  1. In Eqs. (6) and (7), the author initializes the local low-rank matrices by the low-rank matrices reconstructed from the top singular values, while remaining the residual part as the global model . In my opinion, this makes the low-rank matrices retain the most information in the pretrained model, while the information remained in the global model can be significanly reduces. However, in FL, the global model should contain more generalized information from the pre-trained model, which is contrast to the above parameter decoupling operation. So I am curious that whether this operation can reduce the generalization capacity or convergence speed of FL?

  2. The motivation of the proposed low-rank matrice initilization lacks further discussion. Can the author provide more detailed discussion about this by such as landscape visualization?

  3. Why the global model are initialized by the residual matrice? It seems that directly using the pre-trained model parameters as the global model can provide a more generalized initialization during the FL training. Can the author provide a more detailed discussion about this?

  4. In Eqs. (13) and (14), the author claims that by using residual accumulation, the model fine-tuned space can be expanded. However, Eq. (14) seems to only give a upper bound of the summed matrices rather than increasing its lowe bounder of the summed matrices. So it can not guarantee that the rank of the summed matrices can be increased by such a residual accumulation operation. Can the author provide a more in-depth discussion about this?

  5. Actually, the standard PEFT by LoRa in FL can also be formulated as Eq. (13), with δWt\delta W^t as the global update of the LoRa parts at each global round. So what is the real differences between the standard method and the proposed FRLoRA?

问题

The items listed in the weaknesses. If the author can address or partially address these problems, I will be pleasure to improve my score.

评论

We thank the reviewer for his/her constructive comments and provide our point-wise replies as follows.

Q1: Does Eqs.(6) and (7) result in Information loss?

We completely AGREE that AG0\boldsymbol{A}_G^0 and BG0\boldsymbol{B}_G^0 retain the MOST information in the pretrained model. Actually, they are merged back into the global model (see Eq.(11)), and this does NOT result in information loss:

(1) Reinitialization at Each Round: At the start of each communication round, the low-rank matrices are reinitialized to AG0\boldsymbol{A}_G^0 and BG0\boldsymbol{B}_G^0, ensuring the critical information is always preserved.

(2) Final Integration: After training, AG0\boldsymbol{A}_G^0 and BG0\boldsymbol{B}_G^0 are merged back into the global model (Eq.(11)), fully restoring all retained information.

Q2: Motivation of FRLoRA's initialization.

We clarify it as follows:

(1) The standard method, where B\boldsymbol{B} is set to 0 and A\boldsymbol{A} is initialized with Gaussian noise, often struggles with convergence. If FRLoRA is reinitialized to such matrices in each round, it will severely hinder the convergence of the local models, thereby degrading the knowledge in the residual update ΔWt\Delta \boldsymbol{W}^t. We have added a visualization of the loss landscape in the revision (Appendix C.7 Figure 5). The results demonstrates the effectiveness of FRLoRA's initialization strategy in addressing the convergence challenges associated with standard initialization methods, enabling faster and more stable convergence.

(2) Moreover, under data heterogeneity, standard initialization causes inconsistent convergence due to the varying data distributions across clients, further exacerbating client drift. FRLoRA's reinitialization ensures that each round of local optimization begins with a consistent principal singular value space, helping to mitigate client drift. Our ablation results (Section 4.4) on different datasets and further analysis in the revision (Appendix C.5 Figure 3), which confirm its effectiveness.

Q3: Why the global model are initialized by the residual matrice?

If the global model in Eq.(7) is initialized directly using the pretrained model, i.e., W^0=W0\hat{\boldsymbol{W}}^0 =\boldsymbol{W}^0, the model parameters W~0\widetilde{\boldsymbol{W}}^0 in Eq.(11) at the initial stage becomes:

W~0=W^0+BG0AG0=W0+U[:r]S[:r]×S[:r]V[:r]W0\begin{aligned} \widetilde{\boldsymbol{W}}^0 &= \hat{\boldsymbol{W}}^0 + \boldsymbol{B}^{0}_G \boldsymbol{A}^{0}_G \\\\ &= \boldsymbol{W}^0 + \boldsymbol{U}[:r]\sqrt{\boldsymbol{S}[:r]} \times \sqrt{\boldsymbol{S}[:r]} \boldsymbol{V}[:r] \\\\ &\neq \boldsymbol{W}^0 \end{aligned}

This indicates that, at the initial stage, merging the initialized low-rank matrices back into the global model does not revert it to the pretrained model. Therefore, we should initialize the global model with residual matrices instead of the pretrained model, which ensures that W~0\widetilde{\boldsymbol{W}}^0 and W0\boldsymbol{W}^0 remain consistent in the initial stage.

评论

Q5: Differences between the standard method and FRLoRA in Eq. (13).

The standard LoRA in FL can NOT be formulated as Eq.(13). If we want to express the global update of the LoRa parts in the form of Eq. (13), it only can be written as:

AGT=AG0+AG1AG0+AG2AG1++AGTAGT1=AG0+ΔA1+ΔA2++ΔAT\begin{aligned} \boldsymbol{A}_G^T &= \boldsymbol{A}_G^0 + \boldsymbol{A}_G^1 - \boldsymbol{A}_G^0 + \boldsymbol{A}_G^2 - \boldsymbol{A}_G^1 + \ldots + \boldsymbol{A}_G^T - \boldsymbol{A}_G^{T-1} \\\\ &= \boldsymbol{A}_G^0 + \Delta \boldsymbol{A}^1 + \Delta \boldsymbol{A}^2 + \ldots + \Delta \boldsymbol{A}^T \\\\ \end{aligned} BGT=BG0+BG1BG0+BG2BG1++BGTBGT1=BG0+ΔB1+ΔB2++ΔBT\begin{aligned} \boldsymbol{B}_G^T &= \boldsymbol{B}_G^0 + \boldsymbol{B}_G^1 - \boldsymbol{B}_G^0 + \boldsymbol{B}_G^2 - \boldsymbol{B}_G^1 + \ldots + \boldsymbol{B}_G^T - \boldsymbol{B}_G^{T-1} \\\\ &= \boldsymbol{B}_G^0 + \Delta \boldsymbol{B}^1 + \Delta \boldsymbol{B}^2 + \ldots + \Delta \boldsymbol{B}^T \end{aligned}

We can observe that ΔAt\Delta\boldsymbol{A}^{t} and ΔAt1\Delta\boldsymbol{A}^{t-1}, as well as ΔBt\Delta\boldsymbol{B}^{t} and ΔBt1\Delta\boldsymbol{B}^{t-1}, are NOT independent. They share common terms, as the standard LoRA in FL only updates the low-rank matrices and uses them as the initialization for the next training round. Since these common terms can be merged, there is NO iterative accumulative effect as shown in Eq.(13). Therefore, the final fine-tuned global model can only be expressed as Eq. (4):

W~T=W0+ΔWT=W0+BGTAGT\widetilde{\boldsymbol{W}}^T = \boldsymbol{W}^0 + \Delta\boldsymbol{W}^T = \boldsymbol{W}^0 + \boldsymbol{B}_G^T\boldsymbol{A}_G^T

In contrast, the global update process of FRLoRA can be expressed as:

W~T=W^T+BG0AG0=W^0+ΔW1+ΔW2++ΔWT+BG0AG0=W0BG0AG0+ΔW1+ΔW2++ΔWT+BG0AG0=W0+ΔW1+ΔW2++ΔWT=W0+BG1AG1BG0AG0+BG2AG2BG0AG0++BGTAGTBG0AG0\begin{aligned} \widetilde{\boldsymbol{W}}^{T} &= \hat{\boldsymbol{W}}^T + \boldsymbol{B}^{0}_G \boldsymbol{A}^{0}_G \\\\ &= \hat{\boldsymbol{W}}^0 + \Delta\boldsymbol{W}^1 + \Delta\boldsymbol{W}^2 + \ldots + \Delta\boldsymbol{W}^T + \boldsymbol{B}^{0}_G \boldsymbol{A}^{0}_G \\\\ &= \boldsymbol{W}_0 - \boldsymbol{B}_G^0 \boldsymbol{A}_G^0 + \Delta\boldsymbol{W}^1 + \Delta\boldsymbol{W}^2 + \ldots + \Delta\boldsymbol{W}^T + \boldsymbol{B}^{0}_G \boldsymbol{A}^{0}_G \\\\ &=\boldsymbol{W}_0 + \Delta\boldsymbol{W}^1 + \Delta\boldsymbol{W}^2 + \ldots + \Delta\boldsymbol{W}^T \\\\ &= \boldsymbol{W}^0 + \boldsymbol{B}_G^1\boldsymbol{A}_G^1 - \boldsymbol{B}_G^0\boldsymbol{A}_G^0 + \boldsymbol{B}_G^2\boldsymbol{A}_G^2 - \boldsymbol{B}_G^0\boldsymbol{A}_G^0 + \ldots+\boldsymbol{B}_G^T\boldsymbol{A}_G^T - \boldsymbol{B}_G^0\boldsymbol{A}_G^0 \end{aligned}

Given that FRLoRA reinitializes the low-rank matrices in each training round, ΔWt\Delta\boldsymbol{W}^{t} and ΔWt1\Delta\boldsymbol{W}^{t-1} are independent and can NOT be merged. Additionally, FRLoRA directly applies these global updates to the global model. As a result, the global update of FRLoRA has an iterative accumulative effect, which allows gobal updates to occur in a higher-rank parameter space, thereby effectively capturing global knowledge.

Besides, ΔA1,,ΔAT\Delta \boldsymbol{A}^1, \ldots, \Delta\boldsymbol{A}^T are all r×d2r\times d_2 matrices, and when summing TT such r×d2r \times d_2 matrices, the resulting matrix's rank will not exceed rr. Although ΔWt\Delta W^t is a d1×d2d_1 \times d_2 matrix with rank rr, when summing TT such matrices, the upper bound of rank should be min(rT,d1,d2)\min(rT, d_1, d_2), which can greater than rr.

评论

The authors have partially addressed my concern and I improved my score as a result. However, I still have some concerns about the theretical results of residual accumulation.

评论

Thank you for your prompt response. We will immediately address your remaining concerns, such as those regarding the lower bounds. We are highly committed to resolving all your concerns to your satisfaction.

评论

As the discussion period draws to a close soon, we extend our sincere gratitude to you for the valuable time and insightful comments.

In our previous response, we have carefully studied your comments and made detailed responses summarized below:

  1. Clarified the issue of information loss in Eqs.(6) and (7).
  2. Provided further explanation on the motivation of FRLoRA's initialization.
  3. Explained why the global model should be initialized the residual matrice.
  4. Provided further explanation on the residual accumulation in Eq.(13).
  5. Explained the differences between the standard method and FRLoRA in Eq.(13).

We sincerely hope our responses have effectively addressed your concerns. If you have any remaining questions or require further clarification, please do not hesitate to let us know, and we would be glad to provide further explanations

Thank you again for your efforts in reviewing our work.

评论

Q4: Explanation of residual accumulation.

We discuss this both theoretically and empirically.

(1) Theoretically, the upper bound on the rank of the global update parameter space in FRLoRA is min(rT,d1,d2)\min(rT, d_1, d_2), which is significantly larger than the upper bound rr in FedAvg. The iterative residual accumulation mechanism provides this higher upper bound, though we also acknowledge that it cannot strictly increase the rank as TT increases. However, a higher upper bound typically means a larger potential maximum value under the same conditions. This indicates that FRLoRA has the ability to explore a larger updating parameter space to capture more complex structures, allowing for better representation of the diverse knowledge learned from different clients.

(2) Empirically, we have provided results with different values of rr in our submitted paper (Table 10). The results are presented below:

MethodGSM8K (r=16)Math (r=16)Avg. (r=16)GSM8K (r=32)Math (r=32)Avg. (r=32)GSM8K (r=64)Math (r=64)Avg. (r=64)
FedAvg32.674.6418.6534.954.4819.7137.455.3821.41
FedProx32.294.3218.3035.404.6620.0336.394.9820.68
SCAFFOLD32.974.7018.8435.785.0820.4332.374.6418.50
FedAvgM32.444.4218.4334.794.6419.7135.574.7220.14
FedAdagrad28.654.1816.4129.644.0616.8531.764.4618.31
FedYogi30.324.0017.1630.094.0417.0633.964.4019.33
FedAdam31.234.1417.6831.844.1217.9834.265.1639.44
FFA-LoRA25.173.6014.3828.053.7815.9131.004.5017.75
FRLoRA (Ours)39.575.6022.5844.275.2224.7445.566.8826.22

As shown, FRLoRA (rr = 16) outperforms FedAvg (rr=16) and even achieves higher performance than FedAvg (r=64). Besides, we observed that when training FRLoRA (r=16r=16) on MetaMathQA after 100 rounds, the average rank of residual accumulation at each layer reached 82. These results strongly demonstrate the effectiveness of our method in expanding the parameter space of global updates.

评论

Thanks for your valuable comment. In the following, we try our best to explain our theoretical analysis on the rank of the global update more clearly.

To begin with, we define the global update of FedAvg with LoRA as,

ΔW~FedAvgT=BGTAGT.\Delta \widetilde{\boldsymbol{W}}^T_{FedAvg} = \boldsymbol{B}_G^T\boldsymbol{A}_G^T.

Here, BGT\boldsymbol{B}_G^T is a d1×rd_1 \times r matrix and AGT\boldsymbol{A}_G^T is a r×d2r \times d_2 matrix. Based on Eq.(12), we have,

rank(ΔW~FedAvgT)min(rank(BGT),rank(AGT))r.rank(\Delta \widetilde{\boldsymbol{W}}^T_{FedAvg}) \leq \min (rank(\boldsymbol{B}_G^T), rank(\boldsymbol{A}_G^T))\leq r.

Then, we define the global update of FRLoRA as,

ΔW~FRLoRAT=ΔW1+ΔW2++ΔWT=(BG1AG1BG0AG0)+(BG2AG2BG0AG0)++(BGTAGTBG0AG0)=TBG0AG0+BG1AG1+BG2AG2++BGTAGT.\begin{aligned} \Delta \widetilde{\boldsymbol{W}}^{T}_{FRLoRA} &= \Delta\boldsymbol{W}^1 + \Delta\boldsymbol{W}^2 + \ldots + \Delta\boldsymbol{W}^T \\\\ &= (\boldsymbol{B}_G^1\boldsymbol{A}_G^1 - \boldsymbol{B}_G^0\boldsymbol{A}_G^0) + (\boldsymbol{B}_G^2\boldsymbol{A}_G^2 - \boldsymbol{B}_G^0\boldsymbol{A}_G^0) + \ldots+(\boldsymbol{B}_G^T\boldsymbol{A}_G^T - \boldsymbol{B}_G^0\boldsymbol{A}_G^0)\\\\ &= -T\boldsymbol{B}_G^0\boldsymbol{A}_G^0 + \boldsymbol{B}_G^1\boldsymbol{A}_G^1 + \boldsymbol{B}_G^2\boldsymbol{A}_G^2 + \ldots + \boldsymbol{B}_G^T\boldsymbol{A}_G^T. \end{aligned}

Obviously, ΔW~FRLoRAT\Delta \widetilde{\boldsymbol{W}}^{T}_{FRLoRA} can be rewritten as,

ΔW~FRLoRAT=[TBG0;BG1;BG2;;BGT]×[AG0;AG1;AG2;;AGT]. \Delta \widetilde{\boldsymbol{W}}^{T}_{FRLoRA} = [-T\boldsymbol{B}_G^0; \boldsymbol{B}_G^1; \boldsymbol{B}_G^2; \ldots; \boldsymbol{B}_G^T] \times [\boldsymbol{A}_G^0; \boldsymbol{A}_G^1; \boldsymbol{A}_G^2; \ldots;\boldsymbol{A}_G^T].

Here, the size of [TBG0;BG1;BG2;;BGT][-T\boldsymbol{B}_G^0; \boldsymbol{B}_G^1; \boldsymbol{B}_G^2; \ldots; \boldsymbol{B}_G^T] is d1×r(T+1)d_1 \times r(T+1), and the size of [AG0;AG1;AG2;;AGT][\boldsymbol{A}_G^0;\boldsymbol{A}_G^1; \boldsymbol{A}_G^2; \ldots;\boldsymbol{A}_G^T] is r(T+1)×d2r(T+1) \times d_2.

Thus, we have,

rank(ΔW~FRLoRAT)min(r(T+1),d1,d2),rank(\Delta \widetilde{\boldsymbol{W}}^T_{FRLoRA}) \leq \min (r(T+1), d_1, d_2),

and

rank(ΔW~FRLoRAT)rank(ΔW~FedAvgT).rank(\Delta \widetilde{\boldsymbol{W}}^T_{FRLoRA}) \geq rank(\Delta \widetilde{\boldsymbol{W}}^T_{FedAvg}).

Finally, we note that rank(ΔW~FRLoRAT)rank(\Delta \widetilde{\boldsymbol{W}}^T_{FRLoRA}) is greater than rank(ΔW~FedAvgT)rank(\Delta \widetilde{\boldsymbol{W}}^T_{FedAvg}), only except that BG0AG0,BG1AG1,,BGTAGT\boldsymbol{B}_G^{0}\boldsymbol{A}_G^{0}, \boldsymbol{B}_G^1\boldsymbol{A}_G^1, \ldots, \boldsymbol{B}_G^{T}\boldsymbol{A}_G^{T} are completely linearly dependent.

We sincerely hope this theoretical analysis has effectively addressed your concerns. If you have any remaining questions or require further clarification, please do not hesitate to let us know, and we would be glad to provide further explanations.

评论

Thanks for your valuable comment. We kindly wanted to follow up to ask if our responses have satisfactorily resolved your concerns.

If you have any remaining questions or require further clarification, please do not hesitate to let us know, and we would be glad to provide further explanations.

Thank you again for your efforts in reviewing our work.

评论

Thanks for your comments. But it seems that ΔW~FedAvgT\Delta \widetilde{\boldsymbol{W}}^T_{FedAvg} can also be decomposed into the sum of multiple low-rank matrices. That is, ΔW~FedAvgT=ΔW1+ΔW2++ΔWT\Delta \widetilde{\boldsymbol{W}}^T_{FedAvg} = \Delta\boldsymbol{W}^1 + \Delta\boldsymbol{W}^2 + \ldots + \Delta\boldsymbol{W}^T, where ΔWt\Delta\boldsymbol{W}^t is the gradient of low-rank branch during the training at tt round. Then the same conclusion as ΔW~FRLoRAT\Delta \widetilde{\boldsymbol{W}}^{T}_{FRLoRA} can be derived, I believe.

评论

Thanks for your valuable comment. We further clarify it as follows:

According your description, we redefine the global update of FedAvg with LoRA as,

ΔW~FedAvgT=BGTAGT=BG0AG0+(BG1AG1BG0AG0)+(BG2AG2BG1AG1)++(BGTAGTBGT1AGT1)=BG0AG0+ΔW1+ΔW2++WT.\begin{aligned} \Delta \widetilde{\boldsymbol{W}}^T_{FedAvg} &= \boldsymbol{B}_G^T\boldsymbol{A}_G^T \\\\ &= \boldsymbol{B}_G^0\boldsymbol{A}_G^0 + (\boldsymbol{B}_G^1\boldsymbol{A}_G^1-\boldsymbol{B}_G^0\boldsymbol{A}_G^0) + (\boldsymbol{B}_G^2\boldsymbol{A}_G^2-\boldsymbol{B}_G^1\boldsymbol{A}_G^1) + \ldots + (\boldsymbol{B}_G^T\boldsymbol{A}_G^T-\boldsymbol{B}_G^{T-1}\boldsymbol{A}_G^{T-1}) \\\\ &= \boldsymbol{B}_G^0\boldsymbol{A}_G^0 + \Delta \boldsymbol{W}^1 + \Delta \boldsymbol{W}^2 + \ldots + \boldsymbol{W}^T. \end{aligned}

We can observe that ΔWt=BGtAGtBGt1AGt1\Delta \boldsymbol{W}^{t} = \boldsymbol{B}_G^t\boldsymbol{A}_G^t - \boldsymbol{B}_G^{t-1}\boldsymbol{A}_G^{t-1} and ΔWt1=BGt1AGt1BGt2AGt2\Delta \boldsymbol{W}^{t-1} = \boldsymbol{B}_G^{t-1}\boldsymbol{A}_G^{t-1}-\boldsymbol{B}_G^{t-2}\boldsymbol{A}_G^{t-2} are NOT independent. We note that FedAvg only updates the low-rank matrices and uses them as the initialization for the next training round. Consequently, these terms can still be merged into a low rank matrix, BG0AG0+ΔW1+ΔW2++WT=BGTAGT\boldsymbol{B}_G^0\boldsymbol{A}_G^0 + \Delta \boldsymbol{W}^1 + \Delta \boldsymbol{W}^2 + \ldots + \boldsymbol{W}^T = \boldsymbol{B}_G^T\boldsymbol{A}_G^T. Here, BGT\boldsymbol{B}_G^T is a d1×rd_1 \times r matrix and AGT\boldsymbol{A}_G^T is a r×d2r \times d_2 matirx.

Thus, we have,

rank(ΔW~FedAvgT)min(rank(BGT),rank(AGT))r.rank(\Delta \widetilde{\boldsymbol{W}}^T_{FedAvg}) \leq \min(rank(\boldsymbol{B}_G^T), rank(\boldsymbol{A}_G^T)) \leq r.

In contrast, the global update of FRLoRA can be expressed as,

ΔW~FRLoRAT=ΔW1+ΔW2++ΔWT=(BG1AG1BG0AG0)+(BG2AG2BG0AG0)++(BGTAGTBG0AG0)=TBG0AG0+BG1AG1+BG2AG2++BGTAGT.\begin{aligned} \Delta \widetilde{\boldsymbol{W}}^{T}_{FRLoRA} &= \Delta\boldsymbol{W}^1 + \Delta\boldsymbol{W}^2 + \ldots + \Delta\boldsymbol{W}^T \\\\ &= (\boldsymbol{B}_G^1\boldsymbol{A}_G^1 - \boldsymbol{B}_G^0\boldsymbol{A}_G^0) + (\boldsymbol{B}_G^2\boldsymbol{A}_G^2 - \boldsymbol{B}_G^0\boldsymbol{A}_G^0) + \ldots+(\boldsymbol{B}_G^T\boldsymbol{A}_G^T - \boldsymbol{B}_G^0\boldsymbol{A}_G^0)\\\\ &= -T\boldsymbol{B}_G^0\boldsymbol{A}_G^0 + \boldsymbol{B}_G^1\boldsymbol{A}_G^1 + \boldsymbol{B}_G^2\boldsymbol{A}_G^2 + \ldots + \boldsymbol{B}_G^T\boldsymbol{A}_G^T. \end{aligned}

Given that FRLoRA reinitializes the low-rank matrices in each training round and directly updates the global model using the updated global low-rank matrices, ΔWt=BGtAGtBG0AG0\Delta \boldsymbol{W}^{t} = \boldsymbol{B}_G^t\boldsymbol{A}_G^t - \boldsymbol{B}_G^{0}\boldsymbol{A}_G^{0} and ΔWt1=BGt1AGt1BG0AG0\Delta \boldsymbol{W}^{t-1} = \boldsymbol{B}_G^{t-1}\boldsymbol{A}_G^{t-1}-\boldsymbol{B}_G^{0}\boldsymbol{A}_G^{0} are independent and can NOT be merged.

And ΔW~FRLoRAT\Delta \widetilde{\boldsymbol{W}}^{T}_{FRLoRA} can be rewritten as,

ΔW~FRLoRAT=[TBG0;BG1;BG2;;BGT]×[AG0;AG1;AG2;;AGT]. \Delta \widetilde{\boldsymbol{W}}^{T}_{FRLoRA} = [-T\boldsymbol{B}_G^0; \boldsymbol{B}_G^1; \boldsymbol{B}_G^2; \ldots; \boldsymbol{B}_G^T] \times [\boldsymbol{A}_G^0; \boldsymbol{A}_G^1; \boldsymbol{A}_G^2; \ldots;\boldsymbol{A}_G^T].

Here, the size of [TBG0;BG1;BG2;;BGT][-T\boldsymbol{B}_G^0; \boldsymbol{B}_G^1; \boldsymbol{B}_G^2; \ldots; \boldsymbol{B}_G^T] is d1×r(T+1)d_1 \times r(T+1), and the size of [AG0;AG1;AG2;;AGT][\boldsymbol{A}_G^0;\boldsymbol{A}_G^1; \boldsymbol{A}_G^2; \ldots;\boldsymbol{A}_G^T] is r(T+1)×d2r(T+1) \times d_2.

Compared to FedAvg, FRLoRA extends the global update from two low-rank matrices, BGT\boldsymbol{B}_G^T and AGT\boldsymbol{A}_G^T, to two high-rank matrices, [TBG0;BG1;BG2;;BGT][-T\boldsymbol{B}_G^0; \boldsymbol{B}_G^1; \boldsymbol{B}_G^2; \ldots; \boldsymbol{B}_G^T] and [AG0;AG1;AG2;;AGT][\boldsymbol{A}_G^0;\boldsymbol{A}_G^1; \boldsymbol{A}_G^2; \ldots;\boldsymbol{A}_G^T].

Thus, we have,

rank(ΔW~FRLoRAT)min(r(T+1),d1,d2),rank(\Delta \widetilde{\boldsymbol{W}}^T_{FRLoRA}) \leq \min (r(T+1), d_1, d_2),

and

rank(ΔW~FRLoRAT)rank(ΔW~FedAvgT).rank(\Delta \widetilde{\boldsymbol{W}}^T_{FRLoRA}) \geq rank(\Delta \widetilde{\boldsymbol{W}}^T_{FedAvg}).

Finally, we note that rank(ΔW~FRLoRAT)rank(\Delta \widetilde{\boldsymbol{W}}^T_{FRLoRA}) is greater than rank(ΔW~FedAvgT)rank(\Delta \widetilde{\boldsymbol{W}}^T_{FedAvg}), only except that BG0AG0,BG1AG1,,BGTAGT\boldsymbol{B}_G^{0}\boldsymbol{A}_G^{0}, \boldsymbol{B}_G^1\boldsymbol{A}_G^1, \ldots, \boldsymbol{B}_G^{T}\boldsymbol{A}_G^{T} are completely linearly dependent.

Besides, we also recorded the rank of the global update for FedAvg (r=16) after training for 100 rounds on MetaMathQA. The average rank of ΔW~FedAvgT\Delta \widetilde{\boldsymbol{W}}^T_{FedAvg} at each layer is 16. In contrast, the average rank of ΔW~FRLoRAT\Delta\widetilde{\boldsymbol{W}}^{T}_{FRLoRA} at each layer for FRLoRA (r=16) reached 82. This further proves that FRLoRA can achieve the global update with a higher rank.

We sincerely hope our response has effectively addressed your concerns. If you have any remaining questions or require further clarification, please do not hesitate to let us know, and we would be glad to provide further explanations.

评论

Thanks for your explanation. My concerns have been addressed and thereby I raising my score. Hope you can include these clarifications in your future mannuscript.

评论

Thanks for your valuable time to respond to our feedback!

We are very happy to see that your concerns have been fully addressed :)

We will include these clarification in our final mannuscript. Thank you once again for reviewing our work.

审稿意见
8

This paper explores the challenges of using LoRA in Federated Learning (FL) for fine-tuning large language models (LLMs) on non-IID data. The authors identify that LoRA in FL struggles to learn global knowledge due to two key issues: extrinsic client drift and an intrinsic constrained update space. The authors propose FRLoRA (Federated Residual Low-Rank Adaptation) to overcome these challenges. FRLoRA initializes the LoRA matrices AA and BB using the Singular Value Decomposition (SVD) of the initial weights W0W_0, while freezing the residual initial weight. During training, the locally updated LoRA matrices are aggregated and combined with the residual weight, creating the initial model for the next training round. This method addresses both client drift and limited update space, of which theoretical justification is presented. The experiments on nine language model benchmarks show that FRLoRA consistently outperforms various baseline methods.

优点

Addressing Non-IID Challenges in FL with LoRA: FRLoRA tackles an important challenge in federated fine-tuning of LLMs by addressing both client drift and constrained parameter space through aggregated ΔW\Delta W updates via SVD.

Theoretical Justification: The authors provide a simple theoretical justification of FRLoRA using a rank analysis of the parameter space, although the rank analysis can be far from the practice.

Comprehensive Evaluation: The study provides extensive experiments across nine benchmarks, covering Natural Language Understanding and Generation, validating FRLoRA’s efficacy in terms of performance and communication cost.

缺点

Insufficient Evidence on the Problem Formulation: The authors characterize the client drift using Figure 1 (c-d) but it is not convincing. To be specific, the norm of ΔW\Delta W and the standard deviation can be simply reduced by the other factors, e.g., learning rate. Perhaps, it would be better to (i) focus on a single round rather than the sequence of rounds and (ii) observe the distribution of cosine similarity rather than the standard deviation. It is also helpful to analyze the other FL methods.

In addition, the authors claim that the constrained parameter space is one of the major problems in the naive method (FedAVG + LoRA). The simple rank analysis supports this claim. However, it is not thoroughly studied in empirical analysis. Hypothetically, the constrained parameter space problem can be simply addressed by employing large rank rr.

Limitations in Non-IID Testing Scenarios: Experiments use a Dirichlet (0.5) Non-IID setting with five clients for binary classification, while NLG tests are performed on IID data. Broader Non-IID experiments would better showcase FRLoRA's robustness instead of just increasing the number of benchmarks.

Limited Comparision to Existing Works: The baselines used for comparison, such as FedYogi, FedAdam, and FedProx, do not integrate LoRA, making comparisons with these methods less relevant. Only FFA-LoRA offers a directly comparable baseline. Apparently, there is another LoRA method for FL, FlexLoRA (https://arxiv.org/abs/2402.11505, Feb 2024), in which a similar idea of using SVD per round is proposed. In addition, the idea of using the residual is analog to the one in Chain of LoRA (https://arxiv.org/abs/2401.04151, Jan 2024), which needs to be discussed.

问题

Questions

Does FRLoRA Fully Address the Intrinsic Challenge?: While FRLoRA expands the parameter space globally, it does not necessarily address the constrained optimization space at the local level. Can the expanded global space fully mitigate intrinsic issues without modifying local optimization constraints?

Catastrophic Forgetting at Low Ranks: Direct updates to W0W_0 in a low-rank context may risk catastrophic forgetting. Can FRLoRA’s SVD initialization mitigate this risk, or could additional ablation studies on learning rates clarify this?

Comparison to Chain of LoRA’s Residual Framework: Both FRLoRA and Chain of LoRA use residual structures for fine-tuning. How does FRLoRA’s SVD-based initialization impact its effectiveness relative to the iterative approach of Chain of LoRA? Further analysis of residual efficacy could enhance the comparative evaluation.

More Diverse Data Heterogeneity: Is there an empirical study for NLG with severe heterogeneity?

评论

We thank the reviewer for his/her constructive comments and provide our point-wise replies as follows.

Q1: The distribution of cosine similarity in a single round.

Thanks for your suggestion. We analyzed the cosine similarity of local updates, ΔWkt\Delta\boldsymbol{W}_k^t, across different clients in a single round for FedAvg (non-IID), FedAvg (IID), and FRLoRA (non-IID). All methods are trained for the same number of rounds, and the results have been included in the revision (Appendix C.5 Figure 3).

The results show that the cosine similarity between clients in FedAvg (IID) (see Figure 3 (b)) is distributed significantly higher than in FedAvg (non-IID) (see Figure 3 (a)), and FRLoRA (non-IID) (see Figure 3 (c)) exhibits a higher distribution compared to FedAvg (non-IID). This further demonstrates the effectiveness of FRLoRA in reducing client drift, promoting more consistent model convergence under data heterogeneity.

Q2: Empirical analysis of the constrained parameter space.

We have conducted this empirical analysis in our submitted paper (Table 10). And we present these results below.

MethodGSM8K (r=16)Math (r=16)Avg. (r=16)GSM8K (r=32)Math (r=32)Avg. (r=32)GSM8K (r=64)Math (r=64)Avg. (r=64)
FedAvg32.674.6418.6534.954.4819.7137.455.3821.41
FedProx32.294.3218.3035.404.6620.0336.394.9820.68
SCAFFOLD32.974.7018.8435.785.0820.4332.374.6418.50
FedAvgM32.444.4218.4334.794.6419.7135.574.7220.14
FedAdagrad28.654.1816.4129.644.0616.8531.764.4618.31
FedYogi30.324.0017.1630.094.0417.0633.964.4019.33
FedAdam31.234.1417.6831.844.1217.9834.265.1639.44
FFA-LoRA25.173.6014.3828.053.7815.9131.004.5017.75
FRLoRA (Ours)39.575.6022.5844.275.2224.7445.566.8826.22

(1) As shown, increasing the rank rr in FedAvg + LoRA does alleviate the constrained parameter space issue, leading to performance improvements. However, this improvement incurs larger communication overhead.

(2) Unlike simply increasing rr, FRLoRA effectively performs global updates in a higher-rank space with the same rr as FedAvg. It can be observed that FRLoRA (rr = 16) achieved even higher performance than FedAvg (r=64), strongly demonstrating our claim. We have clarified this in the revision (Lines 1073-1074).

Q3: Limited non-IID scenarios.

We clarify it as follows:

(1) We would like to clarify that the datasets used in our experiments, i.e., FedAya, Fed-ChatbotIT, and Fed-WildChat, are INDEED real-world datasets with data heterogeneity and suggested by FedLLM-Bench [1]. The results (Tables 3 and 4) strongly demonstrate the effectiveness of our method for NLG tasks under severe heterogeneity.

(2) We further investigate the impact of data heterogeneity on our method for NLU tasks. The results have been included in the revision (Appendix C.6 Figure 4), indicating that the accuracy of all methods increases with the increase of β\beta, and FRLoRA significantly outperforms the other methods at different values of β\beta. Moreover, the improvement of FRLoRA over other methods is larger when β\beta is small, indicating the effectiveness of FRLoRA for NLU tasks under data heterogeneity.

评论

Q4: Add comparison with FlexLoRA.

Thanks for your suggestion. We have compared our method against FlexLoRA in the revision (Tables 1, 2, 3, and 4). The results, as presented below, show that our method consistently outperforms FlexLoRA across nine different benchmarks including both NLG amd NLU tasks.

Table 1: NLU Tasks

MethodRTECOLA20NGQNLI
FlexLoRA70.2862.5665.9890.03
FRLoRA (Ours)75.8164.8069.4191.10

Table 2: NLG Tasks with MetaMathQA and Alpaca-GPT4

MethodMetaMathQA (GSM8K)MetaMathQA (Math)MetaMathQA (Avg.)Alpaca-GPT4 (Vicuna)Alpaca-GPT4 (MT-1)Alpaca-GPT4 (MT-2)Alpaca-GPT4 (MT-Avg.)Alpaca-GPT4 (Avg.)
FlexLoRA34.094.3119.207.8844.5612.0123.2864.435
FRLoRA (Ours)44.275.2224.748.0444.7752.4813.6354.733

Table 3: Real-world NLG Tasks with Fed-Aya

MethodarenesfrptrutezhAvg.
FlexLoRA2.608.206.255.054.705.201.854.754.70
FRLoRA (Ours)4.457.756.156.654.756.251.556.955.56

Table 4: Real-world NLG Tasks with Fed-ChatbotIT and Fed-WildChat

MethodFed-ChatbotIT (MT-1)Fed-ChatbotIT (Vicuna)Fed-ChatbotIT (Ref-GPT4)Fed-ChatbotIT (Avg.)Fed-WildChat (MT-1)Fed-WildChat (Vicuna)Fed-WildChat (Ref-GPT4)Fed-WildChat (Avg.)
FlexLoRA4.177.025.405.534.887.915.786.19
FRLoRA (Ours)4.317.495.625.804.648.247.006.63

Q5: Does FRLoRA fully address the intrinsic challenge?

We answer this question as follows:

(1) We agree that FRLoRA does not fully address the constrained optimization space at the local level. In this work, FRLoRA mainly focuses on addressing the intrinsic challenges from the global perspective. It expands the parameter space of global updates to a higher-rank space, allowing for a better representation of the diverse knowledge learned from different clients. Extensive experiments demonstrate its effectiveness.

(2) Since the global model is shared by all clients, we believe that local optimization can also benefit from the global expansion of the parameter space. We will explore this aspect in our future work.

评论

Q6: Catastrophic forgetting at low ranks.

To directly measure catastrophic forgetting at low ranks, we conducted an exploratory experiment, i.e., fine-tuned CLIP/ViT-B32 on CIFAR-10 using FedAvg and FRLoRA. Specifically, the CIFAR-10 training set was divided into 10 clients based on a Dirichlet distribution with β=0.5\beta = 0.5. The model was optimized using AdamW with a learning rate of 5e-5 and a batch size of 64. The fine-tuned models were evaluated on the CIFAR-100 test set in a zero-shot manner, and their performance on the CIFAR-10 test set was also reported. To measure the fine-tuning performance and catastrophic forgetting, we also reported the zero-shot accuracy of the original CLIP/ViT-B32 on the test set of CIFAR-10 and CIFAR-100. The results are presented below:

MethodCIFAR-10CIFAR-100
CLIP/ViT-B32 (Zero-shot)84.7556.07
FedAvg87.4938.88
FRLoRA94.7240.32

It can be observed that fine-tuning LoRA on downstream tasks leads to forgetting. However, compared to FedAvg, FRLoRA achieves better zero-shot accuracy on CIFAR-100, FRLoRA (40.32) vs. FedAvg (38.88). This indicates that FRLoRA's initialization approach can partially mitigate the forgetting issue, as it ensures that local training is consistently performed in the principal singular space of the pre-trained model. Furthermore, this work primarily aims to achieve effective federated fine-tuning rather than addressing forgetting. Our focus lies in improving model performance on downstream tasks, thereby facilitating the practical application of LLMs in real-world scenarios. The extensive experimental results presented in this paper, including the findings on CIFAR-10, provide strong evidence that our method successfully reaches this objective.

Q7: Discussion and comparison with Chain of LoRA’s (COLA) residual framework.

(1) Although FRLoRA and COLA have similar ideas, there exists differences in their residual framework. COLA directly merges the updated low-rank matrices BGtAGt\boldsymbol{B}_G^t\boldsymbol{A}_G^t back into the model in each round, whereas FRLoRA only uses the residual updates BGtAGtBG0AG0\boldsymbol{B}_G^t\boldsymbol{A}_G^t - \boldsymbol{B}_G^0\boldsymbol{A}_G^0 to update the parameters of the model.

(2) If we apply the residual framework of CoLA in FL with FRLoRA’s SVD-based initialization, then Eq.(9) should be rewritten as: ΔWt=BGtAGt\Delta\boldsymbol{W}^{t} = \boldsymbol{B}_G^t\boldsymbol{A}_G^t

And after global update, we need to decompose the new weights of global model into a low-rank structure using Eq.(5)-(7), and use them to initialize the low-rank matrices. Acually, it is FRLoRA-v3, a variant in our ablation study (Section 4.4). We present the results as follows.

MethodNLU (RTE)NLU (20NG)MetaMathQA (GSM8K)MetaMathQA (Math)MetaMathQA (Avg.)Fed-WildChat (MT-1)Fed-WildChat (Vicuna)Fed-WildChat (Ref-GPT4)Fed-WildChat (Avg.)
FRLoRA + COLA58.6263.1939.085.0122.044.438.036.346.26
FRLoRA (Ours)75.8169.4144.275.2224.744.648.247.006.63

We can observe that its results are significantly worse than ours, primarily due to the frequent SVD decompositions causing oscillations in the parameter space, which leads to unstable convergence. Additionally, it also incurs extra training time for large models like LLaMA-2-7B, as shown below.

ModelTime (s)
LLaMA-2-7B197.82 ×× T

References

[1] Ye, Rui, et al. "FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models." arXiv preprint arXiv:2406.04845 (2024).

评论

Thanks for the further analysis. As most of my concerns have addressed, I increase my score from 6 to 7.

评论

Thanks for your valuable time to respond to our feedback!

We are very happy to see that your concerns have been fully addressed :)

审稿意见
6

The paper presents Federated Residual Low-Rank Adaptation (FRLoRA), a federated learning method that improves LoRA for LLMs. FRLoRA overcomes challenges like constrained parameter spaces and client drift by updating in a higher-rank parameter space and reinitializing local low-rank matrices based on the principal singular values of pre-trained weights. Experiments show that FRLoRA consistently outperforms existing federated learning methods across benchmarks in NLU and NLG tasks.

优点

  1. The paper introduces a unique adaptation of LoRA for federated learning with residual low-rank updates and reinitialization in the principal singular space.

  2. Extensive experiments on multiple benchmarks confirm FRLoRA’s robust improvements across NLU and NLG tasks in various federated settings.

缺点

  1. The reinitialization of local low-rank matrices and the use of SVD may introduce significant computational overhead, particularly for large-scale models. An in-depth analysis of this aspect or potential optimizations would provide added value to the paper.

  2. Although the method shows performance improvements overall, in Table 3, the proposed approach does not appear to dominate across all evaluations. A similar pattern is observed in Table 4, indicating that while the method is strong, it may not consistently outperform existing baselines in certain subcases.

  3. While the paper briefly touches on SVD stability, it lacks a detailed analysis of potential impacts and does not explore alternative methods for scenarios where SVD might lead to convergence issues. Additional insights on this aspect could reinforce the robustness and general applicability of the approach.

问题

Q1: Could the authors provide further analysis on the computational impact of using SVD for reinitialization in terms of time and memory requirements?

Q2: Could the authors provide further analysis on the behaviour on Table 3 and 4?

Q2:How does FRLoRA handle scenarios where singular value decomposition may be unstable? Are there alternative initialization strategies?

评论

We thank the reviewer for his/her constructive comments and provide our point-wise replies as follows.

Q1: Time and memory of SVD computation.

We have conducted an analysis of the time and memory requirements for the SVD computation. The results are summarized below:

ModelTime (s)Peak Memory (MB)
RoBERTa-base1.480.25
LLaMA-2-7B197.820.65

As shown, the computational cost of SVD is minimal in terms of memory, with peak memory consumption remaining under 1 MB for both models. While the time cost scales with model size, SVD is performed only ONCE, rendering its overall impact negligible compared to the whole training phase. The results have been included in the revision (Appendix C.4 Table 15).

Q2: Further analysis on Tables 3 and 4.

Unlike NLU tasks, NLG tasks involve different datasets for training and testing. For instance, in Table 4, the model is trained on Fed-chatbotIT but evaluated on three benchmarks: MT-Bench, Vicuna, and Ref-GPT4. Due to variations in data distributions or task-specific characteristics across benchmarks, certain subtask metrics may not achieve the best performance. However, the GOAL of fine-tuned LLMs is to achieve the generalized performance across diverse downstream tasks. Consequently, average task performance holds greater importance than local advantages in individual subtasks, especially considering that real-world data distributions are typically heterogeneous and uncontrollable. FRLoRA's performance aligns MORE closely with these practical requirements.

Q3: Discussion about SVD.

We answer your question as follows:

(1) In FRLoRA, SVD is employed ONLY during the initialization phase to decompose the pre-trained weights into a low-rank structure. In our experiments across NINE different benchmarks, SVD has been empirically proven to be stable when applied to these weight matrices, as they are generally well-conditioned. Additionally, SVD is performed only ONCE during the initialization, which minimizes the risk of instability.

(2) For extreme cases, such as with ill-conditioned or noisy weight matrices, SVD may become unstable. Although such scenarios are RARE, we can replace standard SVD with randomized SVD or matrix sketching techniques, which provide more stable decompositions in challenging conditions. Furthermore, incorporating regularization methods like Tikhonov regularization or performing truncated SVD can help stabilize the decomposition process when facing instability. These alternative strategies offer flexibility in handling unstable SVD scenarios, ensuring robust and stable initialization in extreme cases.

评论

I appreciate the dedicated work that the author invested in the rebuttal. My concerns have been addressed, and I will keep my positive rating.

评论

Thanks for your valuable time to respond to our feedback!

We are very happy to see that your concerns have been fully addressed :)

审稿意见
6

This paper introduces a method known as FRLoRA (Federated Residual Low-Rank Adaption) to address the problem of Parameter-Efficient Fine-Tuning (PEFT) for Large Language Models (LLMs) in Federated Learning (FL) scenarios. FRLoRA updates global model parameters by introducing a residual low-rank matrix product and reinitializes local low-rank matrices in each training round to mitigate client drift. The paper compares FRLoRA with existing FL baseline methods across multiple datasets in natural language understanding and generation tasks, demonstrating its consistent superiority and validating the effectiveness of the proposed FRLoRA.

优点

  1. FRLoRA effectively addresses the significant impact of data heterogeneity on PEFT in FL scenarios by accumulating the product of residual low-rank matrices, enabling the global model to learn more comprehensive knowledge.
  2. By reinitializing local low-rank matrices at each training round, FRLoRA alleviates client drift issues, enhancing the convergence and performance of the model.
  3. Extensive experiments validate the performance improvements of FRLoRA over existing methods across various natural language processing tasks.

缺点

  1. The updates derived from averaging the uploaded matrices A and B at the server differ from the averaged updates across clients, and this discrepancy is likely to amplify in heterogeneous data scenarios. [Improving LoRA in Privacy-preserving Federated Learning]
  2. Although the paper proposes extending the parameter space through residual updates, it does not clearly specify whether the rank of the model parameters is strictly improved after each residual update, lacking further explanation on how this mechanism operates, particularly regarding its impact on model rank and representational capacity.

问题

See Weaknesses

评论

We thank the reviewer for his/her constructive comments and provide our point-wise replies as follows.

Q1: Discrepancy between the uploaded matrices and the averaged updates.

We have conducted an exploratory experiment by fixing the A\boldsymbol{A} matrix in FRLoRA, with the results presented below.

MethodNLU (RTE)NLU (20NG)MetaMathQA (GSM8K)MetaMathQA (Math)MetaMathQA (Avg.)Fed-WildChat (MT-1)Fed-WildChat (Vicuna)Fed-WildChat (Ref-GPT4)Fed-WildChat (Avg.)
FFA-LoRA68.6966.8828.053.7815.914.817.995.886.22
FRLoRA + FFA-LoRA72.0667.5834.854.7919.824.527.875.986.12
FRLoRA (Ours)75.8169.4144.275.2224.744.648.247.006.63

While this variant outperforms FFA-LoRA, its performance is inferior to ours, as fixing A\boldsymbol{A} greatly restricts the learning capacity. This finding shows that while averaging the uploaded matrices A and B at the server differs from the averaged updates across clients, it can still serve as a suitable tradeoff between discrepancy and performance in contrast to FFA-LoRA.

Q2: Further explanation on the residual update mechanism.

We explain this from the following two aspects:

(1) In our method, the rank of the model parameters is NOT strictly improved after each residual update. Instead, the iterative accumulative effect of these residual updates (Eq.(13)), when applied to the global model’s parameters, expands the parameter space of global updates to a higher-rank space. This mechanism inherently enhances the model’s representational capacity, enabling a better representation of the diverse knowledge learned from different clients.

(2) Besides, we have provided results with different values of rr in our submitted paper (Table 10). The results are presented below:

MethodGSM8K (r=16)Math (r=16)Avg. (r=16)GSM8K (r=32)Math (r=32)Avg. (r=32)GSM8K (r=64)Math (r=64)Avg. (r=64)
FedAvg32.674.6418.6534.954.4819.7137.455.3821.41
FedProx32.294.3218.3035.404.6620.0336.394.9820.68
SCAFFOLD32.974.7018.8435.785.0820.4332.374.6418.50
FedAvgM32.444.4218.4334.794.6419.7135.574.7220.14
FedAdagrad28.654.1816.4129.644.0616.8531.764.4618.31
FedYogi30.324.0017.1630.094.0417.0633.964.4019.33
FedAdam31.234.1417.6831.844.1217.9834.265.1639.44
FFA-LoRA25.173.6014.3828.053.7815.9131.004.5017.75
FRLoRA (Ours)39.575.6022.5844.275.2224.7445.566.8826.22

As shown, FRLoRA (rr = 16) outperforms FedAvg (rr=16) and even achieves higher performance than FedAvg (r=64). Besides, we observed that when training FRLoRA (r=16r=16) on MetaMathQA after 100 rounds, the average rank of residual accumulation at each layer reached 82. These results strongly demonstrate the effectiveness of our method in expanding the parameter space of global updates.

评论

As the discussion period draws to a close soon, we extend our sincere gratitude to you for the valuable time and insightful comments.

In our previous response, we have carefully studied your comments and made detailed responses summarized below:

  1. Conducted an empirical analysis of the discrepancy between the uploaded matrices and the averaged updates.
  2. Provided further explanation on the residual update mechanism.

We sincerely hope our responses have effectively addressed your concerns. If you have any remaining questions or require further clarification, please do not hesitate to let us know, and we would be glad to provide further explanations

Thank you again for your efforts in reviewing our work.

评论

Thanks for your valuable comment. We kindly wanted to follow up to ask if our responses have satisfactorily resolved your concerns.

If you have any remaining questions or require further clarification, please do not hesitate to let us know, and we would be glad to provide further explanations.

Thank you again for your efforts in reviewing our work.

评论

Thank you for your response. My concerns have been addressed and I raised my rating.

评论

Thanks for your valuable time to respond to our feedback!

We are very happy to see that your concerns have been fully addressed :)

审稿意见
5

This paper introduces FRLoRA, a novel Federated Learning (FL) method that addresses limitations in global knowledge learning by combining residual low-rank adaptation with periodic recalibration of local models. FRLoRA expands the effective parameter space during global updates and mitigates client drift by reinitializing local model components with principal components from the pre-trained model. Extensive experiments across various natural language processing tasks demonstrate FRLoRA's superior performance compared to existing FL methods.

优点

The paper is clearly presented. The proposed method is sound. Experiment results are comprehensive and validate the quality improvement of the proposed method.

缺点

This paper presents a novel Federated Learning method, FRLoRA, which aims to improve global knowledge learning. However, the paper's practical value is limited due to two main weaknesses:

Lack of a compelling use case: The authors fail to clearly motivate the need for their proposed method within a real-world application scenario. While the technical contributions may be sound, the lack of a clear practical context diminishes the paper's impact and relevance.

High communication cost: Despite claims of efficiency, the method introduces significant communication overhead, hindering its practicality in real-world deployments, where many LoRA customers care about.

问题

Can the authors should further explore and demonstrate the method's effectiveness in a specific application domain and address the communication cost limitations to enhance its practical value.

评论

We thank the reviewer for his/her constructive comments and provide our point-wise replies as follows.

Q1: Lack of a compelling real-world use case.

We would like to clarify that the datasets used in our experiments, i.e., FedAya, Fed-ChatbotIT, and Fed-WildChat, are INDEED derived from real-world scenarios and suggested by FedLLM-Bench [1]. Their results in Tables 3 and 4 demonstrate consistently strong performance of FRLoRA across different real-world NLG tasks, providing both a compelling use case and solid empirical evidence of our method's effectiveness in real-world application scenarios.

Q2: High communication cost.

The communication cost analysis presented in the submitted paper only shows the total communication cost (Cost.ToTal) in Setting 2. This may lack key information and be misleading about the communication overhead of our method. In the revision (Appendix C.3 Table 14), we have provided a more detailed communication cost analysis, as shown below.

MethodCost.Active (MB)Cost.Inactive (MB)Cost.Total (MB)VicunaMT-1MT-2MT-AvgAvg.
Setting 1: Full Participation
FFT-based FL13476 × 200269520-----
FedAvg8.388 × 200167.768.0394.8332.4583.6454.743
FRLoRA (Ours)8.388 × 200167.768.1254.9842.8063.8954.952
Setting 2: Partial Participation
FFT-based FL13476 × 2026952-----
FedAvg8.388 × 2016.777.9254.6502.0253.3464.486
FRLoRA (Ours)8.388 × 24.194 × 1892.268.0444.7752.4813.6354.733

(1) In Setting 1, where all clients are involved in training every round, FRLoRA incurs NO additional overhead, i.e., FRLoRA (167.76 MB) vs. FedAvg (167.76 MB).

(2) In Setting 2, where only a subset of clients participates in training every round, FRLoRA adds only 75.49MB of total overhead (Cost.Inactive), all of which stems from parameter synchronization for inactive clients. Importantly, FRLoRA indeed does NOT increase the communication cost per client (Only 4.194 MB for inactive clients). As the communication channels between each client and the server are independent and parallel, Cost.Inactive does NOT affect overall communication efficiency.

(3) Besides, we further present the performance in Setting 1, where FRLoRA still achieves better performance than FedAvg, indicating that the performance improvement in Setting 2 is NOT attributable to Cost.Inactive.

References

[1] Ye, Rui, et al. "FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models." arXiv preprint arXiv:2406.04845 (2024).

评论

As the discussion period draws to a close soon, we extend our sincere gratitude to you for the valuable time and insightful comments.

In our previous response, we have carefully studied your comments and made detailed responses summarized below:

  1. Clarified used datasets include real-world use cases.
  2. Provided a detailed analysis of communication costs to comprehensively communication efficiency of our method.

We sincerely hope our responses have effectively addressed your concerns. If you have any remaining questions or require further clarification, please do not hesitate to let us know, and we would be glad to provide further explanations

Thank you again for your efforts in reviewing our work.

评论

Thanks for your valuable comment. We kindly wanted to follow up to ask if our responses have satisfactorily resolved your concerns.

If you have any remaining questions or require further clarification, please do not hesitate to let us know, and we would be glad to provide further explanations.

Thank you again for your efforts in reviewing our work.

评论

Thanks for your valuable comment. As the discussion period draws to a close soon, we kindly wanted to follow up to ask if our responses have satisfactorily resolved your concerns.

If you have any remaining questions or require further clarification, please do not hesitate to let us know, and we would be glad to provide further explanations.

Thank you sincerely for your time and effort in reviewing our work.

评论

Dear reviewers and meta-reviewers,

We appreciate all reviewers for their valuable comments and suggestions. We've revised our manuscript based on reviewers' comments as follows:

  1. For Reviewer RHQ7, we have revised the analysis of communication costs, which can now be found in Appendix C.3, Table 14 due to page limitations.

  2. For Reviewer vtoo, we have included an analysis of SVD computation in Appendix C.4, Table 15.

  3. For Reviewer Acqq, we have added a clarification in Lines 1073-1074.

  4. For Reviewer Acqq, we have conducted an analysis of client drift by cosine similarity in Appendix C.5, Figure 3.

  5. For Reviewer Acqq, we have explored the impact of data heterogeneity on our method for NLU tasks in Appendix C.6, Figure 4.

  6. For Reviewer Acqq, we have added the results of FlexLoRA in Tables 1, 2, 3, and 4.

  7. For Reviewers Acqq and Ddf1, we have added an analysis of the impact of FRLoRA's initialization in Appendix C.7, Figure 5.

The changes have been highlighted in blue in the revised paper. Please see below for our responses to each reviewer. If you have any further questions or suggestions, please feel free to share them on OpenReview.

AC 元评审

This paper introduces FRLoRA, a novel approach designed to overcome the challenges of applying LoRA in a Federated Learning (FL) environment for acquiring global knowledge. Specifically, FRLoRA tackles the issues of "extrinsic client drift" and "intrinsic constrained update space." Extensive experiments across nine benchmarks demonstrate FRLoRA's effectiveness in achieving strong performance while minimizing communication costs. To further enhance the paper, please include the reviewers' suggestions especially detailed analysis of the proposed method in the revised version.

审稿人讨论附加意见

Reviewers have requested a more in-depth analysis of the proposed algorithm including communication cost, SVD computation cost, impact over client drift, impact of FRLoRA's initialization, etc. The authors have addressed these points by providing additional experimental results and offering more detailed explanations within the revised manuscript.

最终决定

Accept (Poster)

公开评论

I would like to bring to the authors' attention that this work cites an arXiv preprint (https://arxiv.org/abs/2407.20557) which has been removed due to plagiarism. The original work can be found on Google Scholar: https://scholar.google.com/scholar?cluster=577696854221514582&hl=en&as_sdt=0,47. Although ICLR 2025 has already passed, I am writing in the hope that the authors can request an update to the camera-ready and that the program committee allows for a revision.