PaperHub
4.9
/10
Rejected4 位审稿人
最低1最高4标准差1.1
3
1
3
4
ICML 2025

Communication Efficient Federated Learning via Model-Agnostic Projection Adaptation

OpenReviewPDF
提交: 2025-01-19更新: 2025-06-18

摘要

关键词
Federated LearningLow-Rank AdaptationCommunication EfficiencySubspace Optimization

评审与讨论

审稿意见
3

The authors propose a new method called MAPA for parameter-efficient federated fine-tuning. The main advantages of MAPA are that it does not depend on the architecture, unlike other LoRA-based methods, and it reduces the computational and memory costs while providing a better or comparable performance. The authors provide a convergence guarantee of the proposed method under common assumptions with a decaying learning rate. The experimental results show the superiority of the proposed method.

给作者的问题

  1. I have found the propositions and the same reconstruction error with a smaller number of parameters solid. However, I question if it is practical with first-order methods, i.e., SGD, etc. What I mean is that I agree that the reconstruction error given in Defn 3.3 holds for the best A and B parameter set. However, it might be the case that we cannot find A and B with just gradient-based optimization (what actually happens in our setting). In that case, one might claim that having more parameters may find better A and B with a practically lower reconstruction error. Do authors have an explanation for this or an experimental result supporting it?
  2. Shouldn't Theorem 4.3 be a high probability bound due to JL lemma? If so, I would suggest stating that the bound holds for a high probability of (the probability at least, e.g., 1Tϵ1-T\epsilon..).
  3. In experiments, how are these models pre-trained? Since we compare fine-tuning techniques, sharing how the models are actually pre-trained is important.

论据与证据

I think the authors may want to clarify their claim about being not architecture-independent. I couldn't fully follow it. First, I thought their method was applicable to all types of neural net layers (CNN, etc.), unlike LoRA, which is only for linear layers. If this is the case (I can see that you use some CNN architectures in the experiment), I wonder how it is happening.

方法与评估标准

Yes, it looks good overall. My questions are in the weaknesses and questions section.

理论论述

I didn't follow the proofs of Propositions, but they seem to make sense. I skimmed through the proof of the convergence analysis. It seemed reasonable to me.

实验设计与分析

The experiments sound good overall. I have a few questions, which can be found in the questions and weaknesses section.

补充材料

Skimmed theoretical proofs, seems reasonable. Skimmed other parts as well.

与现有文献的关系

The proposed method seems to improve the federated fine-tuning in communication efficiency while providing better or comparable performance. Compared to the previous literature, I have found the explanations and theoretical analysis good. The experimental results are superior to the selected baselines.

遗漏的重要参考文献

I think one missing part in the intuition of the integration of the single reshaped matrix update/factorization idea is any other work using this in a centralized setting. If this has not been done before in a centralized setting, then it means that the authors' contribution may be an even wider setting, i.e., a centralized setting. If there exist similar ideas in centralized LoRA literature, mentioning them in the related works would be good. Please correct me if I am mistaken.

其他优缺点

  • Strengths:
  1. The paper is written very clearly. I appreciate the authors' presentation.
  2. The proposed method's intuition is well-explained.
  3. The theoretical guarantee is a plus considering many works in the literature lack it.
  4. The experiments show the superiority of the proposed method compared to the baselines in terms of communication efficiency and training quality.
  • Weaknesses:
  1. I don't fully understand what is wrong with layer-by-layer separation (let's say we select some fixed kk for all layers) in terms of architecture dependence. Cannot the proposed method (first reshaping the matrix and separating A and B) be applied layer by layer for any architecture type following a similar method? Do the authors have an ablation experiment where they apply their technique layer by layer to the models and compare with the current version?
  2. The model backbone is updated at every round as in Eq. (1). This may create the following problem: What makes LoRA advantageous is that at the end of fine-tuning, we have some small number of parameters that can easily be merged with or separated from the original model. Here, if we update the model backbone and initialize different A matrices at every iteration, we won't be able to have a low rank representation of the fine-tuned part in the end. Yes, we can separate the fine-tuned part (tAtBˉt\sum_t A_t\bar{B}_t), but it will take a memory size of dd, unlike the low storage cost of LoRA parameters.
  3. There are many newer techniques in Federated LoRA literature. I would expect a comparison with a few more recent and solid FL LoRA baselines.
  4. I think the forward pass of the proposed method should be slower than the other LoRA methods. In the proposed one, ΔWi\Delta W_i s in every layer ii is a full dimensional matrix. However, in the LoRA versions, it is split into in order to have 2×d×q2\times d\times q instead of large d2d^2 (Here, dd and qq represent the full and LoRA dimensions within a layer). Can authors elaborate on it?

其他意见或建议

  1. In Fig. 2, why are the methods compared in a centralized setting? To my knowledge, for example, FFA-LoRA ([SLLD'24] in ICLR24) also solves the exact aggregation problem (with updates AiA_i, BiB_is from clients, how to aggregate them is a question), which is specific to the federated setting.
  2. I think there is a typo in the proof where etie_t^i is defined. I guess it should be just the negation of the written expression.
作者回复

1. Architecture-independency

You are correct. Our approach flattens the gradients and factorizes them in a matrix form rather than directly factorizing the parameters. This gradient-based factorization is independent of specific architectural details, making it applicable to any model.

2. Centralized setting

We discussed this issue in response 5 to reviewer NJUL and presented the experimental results in response 1 to reviewer FKCh.

3. Layer-wise factorization vs. Entire model factorization

While layer-wise factorization is possible, it introduces following practical issues:

  1. Memory Overhead: For reducing model communication kk-fold, global factorization on WdW^d results in Wkd/k=Ak1B1d/kW^{k * d/k} = A^{k * 1} B^{1 * d/k}, while layer-wise factorization of an nn-layer model, given layer-ii's parameters as WidiW_i^{d_i} will be Wikdi/k=Aik1Bi1di/kW_i^{k * d_i/k} = A_i^{k * 1} B_i^{1 * d_i/k} requires storing nn separate AiA_i, increasing memory overhead nn times.

  2. Architecture Constraints: In global model factorization, we can essentially decide on any arbitrary rate of compression. In contrast, layer-wise factorization faces limitations, as choosing kk higher than the layer size is impossible. As shown experimentally in response 3 of reviewer FKCh, global factorization enables effective fine-tuning even at 10k-fold compression for a model with 357M parameters. Meanwhile, individual compression of a 1024-parameter layer at 10k-fold is not possible.

  3. Suboptimal Performance: Lastly, global factorization outperforms layer-wise factorization since layer-wise allocates relatively equal expression budgets di/kd_i/k across layers, regardless of their gradient magnitudes. This leads to suboptimal communication budgeting, unlike global factorization, which allocates a higher expressivity budget to high magnitude gradients and higher compression to less informative gradients.

Additional experiments on QNLI and SST2 illustrate this suboptimality:

ModelSST2 AccRound@80%QNLI AccRound@80%
Layered1k_{1k}95.941088.9818
MAPA1k_{1k}96.41792.5815
Layered10k_{10k}92.271282.4229
MAPA10k_{10k}94.92990.8619

4. Backbone is updated

You are correct. Unlike PEFT methods such as LoRA, MAPA does not inherently reduce the number of model parameters, as the full model backbone is updated at each communication round. Instead, MAPA primarily reduces gradient communication overhead in FL, as minimizing communication overhead is typically more critical than reducing parameter storage in this setting.

5. Comparison with more recent baselines

Many FL LoRA studies address broader challenges than communication. We used FA-LoRA as a central baseline due to its communication focus and conceptual similarity. Per your suggestion, we added LoRA and SA-LoRA to our baselines for LLM fine-tuning comparisons. (see response 3 to reviewer FKCh).

6. Forward pass cost

LoRA adds computational overhead in the forward pass by including additional low-rank adaptation layers. The computation y=Wx+BAxy = Wx + BAx incurs complexity:

  • Frozen parameters: O(d2)O(d^2)
  • LoRA layers: Two multiplications AxAx and B(Ax)B(Ax): O(2dq)O(2dq)

Thus, LoRA’s total forward pass complexity is O(d2+2dq)O(d^2 + 2dq). In contrast, MAPA applies low-rank factorization only during the backward pass, leaving the forward pass complexity unchanged at O(d2)O(d^2).

7. Why is it centralized in Fig. 2?

In Fig. 2, we used a single-client setup to isolate the intrinsic performance impact of gradient compression, independent of data heterogeneity or client sampling in FL. Thus, any differences reflect compression effectiveness alone.

8. Typo

Thank you; we fixed it.

9. As B is not the optimal point, can having more parameters in A lead to better convergence?

In MAPA, updates follow Wt=Wt1+ABW_t = W_{t-1} + AB, where the AA is fixed within each round. Therefore, gradient BB computation is just the gradient W\nabla W projection linearly onto the subspace spanned by AA, and SGD can reliably solve linear projection to find a near-optimal BB.

10.Theorem 4.3

We thank the reviewer for this insightful point. The JL lemma provides a probabilistic guarantee of low-distortion embeddings. As such, to be entirely rigorous, our convergence statement can be made a high-probability result by union-bounding the (small) failure probability δ\delta over the TT rounds:

  • At each round, distortion ϵ\le\epsilon holds with probability 1δ\ge 1-\delta.
  • By union bound, for all TT rounds simultaneously, the probability is 1Tδ\ge 1 - T\delta.
  • Conditional on that event, the same inequalities apply, and the convergence proof remains identical. We fully agree with you and will strengthen the theorem statement via the presented arguments.

11. Pre-trained Models

Original experiments were trained from scratch. For LLM fine-tuning, we used the HuggingFace FacebookAI/roberta-large checkpoint.

审稿意见
1

The paper proposes Model-Agnostic Projection Adaptation (MAPA), an approach to reduce communication overhead in FL. MAPA improves upon existing low-rank adaptation (LoRA) methods by factorizing the entire model parameter space as a single matrix, as opposed to decomposing layers independently. This model-agnostic approach allows for flexible balancing of communication and accuracy by randomly regenerating the reconstruction matrix (one of the two matrices) in each round.

给作者的问题

  1. Why not using and unbiased gradient approach?
  2. Assess performance on fine tuning of LLMs.
  3. What is really new in the algorithm?

论据与证据

I have issues agreeing with stated contributions. The authors state as the first contribution the idea of applying LoRA at the model level instead of layer by layer. The original LoRA paper casts the idea in this way where I don't see any difference. It is correct that in the implementation they apply it layer by layer, but these are experimental and implementational details.

The convergence analysis is also unclear if it is new. The convergence of LoRA has already been analyzed. The proposed algorithm introduces one of the two matrices to be random but I don't see why this requires quite different proofs.

The proposed algorithm is efficient for small networks but I have scalability doubts. The experiments have been conducted only on small datasets. I wonder what would happen for fine tuning of an LLM.

方法与评估标准

not quite. They should do it on fine tuning of LLMs. It is easy to create FL settings based on standard fine tuning data.

理论论述

I'm unsure what is new in convergence results and proofs. I wonder why don't standard SGD/FL technique apply. One possible argument would be the handling of random reconstruction matrices. But this can be viewed as stochasticity in gradient computation. It seems that the assumption of unbiased gradient would imply convergence. As a result, one needs to only show that the gradient estimators are unbiased. There is no reason to believe they are not and the proof should not be that hard.

实验设计与分析

I have checked the main body. The issue are discussed above.

补充材料

I read appendices A and B.

与现有文献的关系

I'm unclear about contribution statements. The work has major overlap with the LoRA paper (and the offsprings).

遗漏的重要参考文献

none

其他优缺点

Discussed above.

其他意见或建议

None

作者回复

1. Key contributions

Thank you for raising this important point. To clarify, there are two fundamentally different strategies for leveraging low-rank structures in optimization:

  1. Low-Rank Parameterization
  2. Low-Rank Gradient Projection

MAPA explicitly utilizes the latter strategy, whereas LoRA and its variants follow the first approach.

Why LoRA should be applied layer-wise? Low-rank parameterization methods, like LoRA, inherently depend on layer-wise decomposition, as reparameterization must preserve input/output dimensions for each layer to maintain forward pass compatibility. LoRA decomposes each layer’s weight matrix WW individually as h=(W+BA)x=Wx+BAxh = (W+BA)x=Wx + BAx. Treating the entire model parameters as a single matrix violates compatibility due to nonlinear activations and differing layer dimensions. LoRA paper acknowledges this constraint (Hu et al., LoRA [Page 4, Section 4]).

Given a model-level LoRA factorization of WIOW^{I *O} into AIrA^{I * r} and BrOB^{r * O}, where II and OO are input and output dimensions of the model and rr is the factorization rank, the forward pass will be reduced to y=BA(x)y = BA(x), which does not express any nonlinearity of the network.

In contrast, MAPA employs gradient factorization rather than parameter factorization. By applying low-rank constraints directly on the gradient instead of the parameters, MAPA reduces the gradient size while fully preserving model capacity. This approach is not constrained by layer-wise decomposition or model architecture since factorization occurs after computing gradients via standard forward/backward passes. Further discussion on gradient factorization literature appears in response 5 to reviewer NJUL.

2. Theoretical contribution

Our convergence analysis extends standard federated SGD proofs [3,4] by incorporating a random projection (via the JL lemma) that introduces distortion ϵ\epsilon. This affects both descent direction and update variance. Unlike works that assume a fixed subspace or no gradient projection, we rigorously track how random, time-varying subspaces influence FL convergence. When ϵ=0\epsilon=0, MAPA becomes FedAvg. For ϵ>0\epsilon>0, we add a factor (ϵ+β+ϵβ)(\epsilon + \beta + \epsilon\beta) but retain the same O(1/T)\mathcal{O}(1/\sqrt{T}) rate.

We will revise the manuscript to emphasize this distinction in our convergence analysis.

3. Fine-tuning of LLMs on larger datasets

Based on your feedback, we conducted fine-tuning experiments of RoBERTa-large on five large datasets of GLUE tasks. We evaluated MAPA, alongside LoRA, FA-LoRA, and SA-LoRA [2].

The 1st Table below compares the number of trainable parameters and communication load per round for each baseline.

The 2nd Table summarizes the results of fine-tuning, in which communication efficiency is evaluated by the number of rounds and the total communication needed to reach 80% accuracy, and the 3rd Table presents the results for centralized LLM fine-tuning.

The experiments used base code from [12], following the experimental setup and parameters from [2] for 300 FL rounds,

References are located in response 4 to reviewer Wq8k.


1st Table:

Method# Train Param# Com. Param / Round
LoRA1.83M0.78M
FFA-LoRA1.44M0.39M
SA-LoRA1.83M0.39M
MAPAd/1k_{d/1k}357M0.36M
MAPAd/10k_{d/10k}357M35.70K
MAPAd/100k_{d/100k}357M3.57K
MAPAd/1m_{d/1m}357M357

2nd Table, FL fine-tuning:

ModelSST2 AccSST2 RoundSST2 TotalQNLI AccQNLI RoundQNLI TotalRTE AccRTE RoundRTE TotalMNLIm AccMNLIm RoundMNLIm TotalMNLImm AccMNLImm RoundMNLImm Total
LoRA84.863628.08M91.728566.30M86.62180140.40M87.418667.08M87.348263.96M
FA-LoRA94.154417.16M91.637629.64M57.2885.927629.64M86.4621383.07M
SA-LoRA95.41197.41M91.045521.45M70.0189.442911.31M85.4912649.14M
MAPAd/1k_{d/1k}96.7951.78M93.14113.93M87.91238.21M88.90176.07M88.26227.85M
MAPAd/10k_{d/10k}96.105178.50K92.578285.60K89.5723821.10K88.8118642.60K87.4325892.50K
MAPAd/100k_{d/100k}95.53517.85K89.24724.99K84.382485.68K85.042071.40K84.6029103.53K
MAPAd/1m_{d/1m}90.3772.50K80.093412.14K57.0472.4637.76

3rd Table, centralized:

ModelSST2 AccSST2 RoundSST2 TotalQNLI AccQNLI RoundQNLI TotalMNLI AccMNLI RoundMNLI Total
LoRA95.235139.78M88.2011186.58M85.23132102.96M
FFA-LoRA87.504818.72M68.0586.486625.74M
SA-LoRA94.6911042.90M88.2011143.29M86.026224.18M
MAPAd/1k_{d/1k}95.4793.21M92.58155.36M86.803713.21M
MAPAd/10k_{d/10k}94.6180.28M90.86190.68M85.00381.36M
MAPAd/100k_{d/100k}79.3883.831864.26K75.47
MAPAd/1m_{d/1m}58.5256.5637.81

Overall, it can be seen that MAPA has the potential to enhance fine-tuning performance in centralized training too.

审稿人评论

Thanks for providing the answers. I have no further questions and comments.

作者评论

Thank you very much for acknowledging our response, and we are pleased that all your concerns have been addressed.

We would greatly appreciate it if you could update your score accordingly.

Best regards, Authors of Paper 2961


Edit:

Dear Reviewer FKCh,

We thank you so much for taking the time to review our paper and for your helpful suggestions.

The author-reviewer discussion period ends soon. With the time ticking, we are getting very anxious. We did our best to provide answers to the questions and concerns you raised, including conducting extra experiments. You indicated you have no further questions and comments. We thank you for your prompt response.

May we respectfully request that you reevaluate your score, unless you have further issues? We would be grateful.

Best Wishes - authors

审稿意见
3

This paper aims to improve communication efficiency in federated learning by proposing a new parameter factorization method. The proposed method is evaluated on seven public datasets and shows improved performance.

给作者的问题

Please see the weakness part.

论据与证据

The claims are supported by method design and experimental validations.

方法与评估标准

The proposed method and evaluation make sense in general but lack some comparison.

理论论述

The theoretical claims and proofs look correct.

实验设计与分析

The experimental design and analysis are sound in general.

补充材料

The supplementary material provides more details and looks good.

与现有文献的关系

This paper contributes to the general federated learning community.

遗漏的重要参考文献

A work with similiar idea needs to be discussed.

Jeong, Wonyong, and Sung Ju Hwang. "Factorized-fl: Personalized federated learning with parameter factorization & similarity matching." Advances in Neural Information Processing Systems 35 (2022): 35684-35695.

其他优缺点

Strength

  • Improving communication is an important topic in federated learning.

  • The motivation for improving the LORA-based method is well demonstrated. 

  • The proposed method shows improvements in both communication and performance.

Weakness

  • The proposed method approximates the updates of all layers by adjusting matrix B only, which may harm the model’s ability to explore richer subspaces.

  • The design of single vector factorization shares a similar idea from [1], which needs to be included in the discussion and experiment comparison.

  • It is not clear how good the convergence bound is compared with the FedAvg convergence, and in addition, what is the practical implication of this convergence analysis.

[1] Jeong, Wonyong, and Sung Ju Hwang. "Factorized-fl: Personalized federated learning with parameter factorization & similarity matching." Advances in Neural Information Processing Systems 35 (2022): 35684-35695.

其他意见或建议

NA

作者回复

1. Only updating B

Thank you for highlighting this concern. Indeed, relying solely on BB limits subspace exploration, as seen in FA-LoRA’s performance decline, SA-LoRA [2], and Figure 7. MAPA addresses this by randomizing AA each round, promoting diverse subspaces. Figure 7 shows that fixing AA at low ranks severely degrades accuracy, whereas randomizing AA maintains performance. We further verified this advantage in response 2, reviewer NJUL.

2. Comparison with [1]

We appreciate your mentioning Factorized-FL. Below are key distinctions alongside comparative experiments under the same setup:

The key difference is that Factorized-FL applies factorization on layer-wise parameters, whereas MAPA factorizes the gradient of the entire model. In response to similar questions, we previously elaborated on why gradient (response 1, reviewer FKCh) and model-level factorization (response 3, reviewer JjTg) can outperform layer-wise parameter factorization.

Although both methods use rank-1 factorization, in Factorized-FL, rank-1 is a hyperparameter needing tuning or increasing for larger models to avoid limiting representation capacity. In MAPA, rank-1 is inherent and does not restrict model capacity; instead, the reshaping factor kk determines the compression rate. Consequently, Factorized-FL's communication per round is constrained by model dimensions, while MAPA can compress gradients to arbitrary degrees independent of architecture dimensions.

Factorized-FL is similar to a rank-1 LoRA architecture, with a sparse bias matrix replacing LoRA’s frozen fine-tuned parameter, initialized as zero. LoRA imposes strict regularization to preserve pre-trained parameters, whereas Factorized-FL employs softer regularization, allowing updates when necessary.

Factorized-FL emphasizes personalized FL by sharing one vector globally and keeping the other client-specific. To directly compare factorization effectiveness with MAPA, one could share both vectors globally. However, as noted by the authors [1] (Page 6, Personalized Weight Averaging), sharing both vectors significantly increases communication load, adversely affecting efficiency. Below, we highlight this fact by comparing global model training on CIFAR-10 and SVHN under IID and non-IID splits. “Com@X%” indicates total communication needed to reach X% of FedAvg’s final accuracy:

MethodCIFAR10 Com@80%CIFAR10 Com@90%CIFAR10-N Com@80%CIFAR10-N Com@90%SVHN Com@80%SVHN Com@90%SVHN-N Com@80%SVHN-N Com@90%Com/Round
FedAvg305.85407.80326.24652.48183.51244.68285.46509.7520.39GB
Factorized-FL182.50292.00200.75310.25127.75182.50146.00219.0018.25GB
MAPA2k_{2k}0.32-0.94-0.320.790.56-0.78MB
MAPA16k_{16k}0.080.180.230.450.080.180.120.276.25MB
MAPA40k_{40k}3.848.6410.8821.123.848.645.7613.120.32GB

3. Theorem 4.3

We apologize for the confusion regarding our convergence result. Our convergence bound matches FedAvg and recovers it as a special case: when reconstruction error is zero (ϵ=0\epsilon = 0), MAPA reduces exactly to FedAvg with the tightest convergence bound. For ϵ0\epsilon \neq 0, the bound introduces a modest constant factor (ϵ+β+ϵβ)(\epsilon + \beta + \epsilon\beta) due to compressed update distortion. Nevertheless, MAPA maintains the same asymptotic rate O(1/T)\mathcal{O}(1/\sqrt{T}) as FedAvg under standard assumptions (smoothness, bounded variance) [3,4]. Practically, this means MAPA might require slightly more rounds at higher compression, yet the total communication cost to achieve target accuracy significantly decreases, allowing training with substantially reduced overhead.

4. References

[1] Jeong, W. and Hwang, S.J. "Factorized-FL: Personalized Federated Learning with Parameter Factorization & Similarity Matching."
[2] Guo, P. et al. "Selective Aggregation for Low-Rank Adaptation in Federated Learning."
[3] Yu, H. et al. "Parallel Restarted SGD with Faster Convergence and Less Communication."
[4] Kim, D.-Y. et al. "Achieving Lossless Gradient Sparsification via Mapping to Alternative Space in Federated Learning."
[5] Denil, M. et al. "Predicting Parameters in Deep Learning."
[6] Li, C. et al. "Measuring the Intrinsic Dimension of Objective Landscapes."
[7] Gressmann, F. et al. "Improving Neural Network Training in Low Dimensional Random Bases."
[8] Aghajanyan, A. et al. "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning."
[9] Hameed, M.G.A. et al. "ROSA: Random Subspace Adaptation for Efficient Fine-Tuning."
[10] Zhao, J. et al. "Galore: Memory-Efficient LLM Training by Gradient Low-Rank Projection."
[11] Zhao, H. et al. "SEPARATE: A Simple Low-Rank Projection for Gradient Compression."
[12] Kuang, W. et al. "Federatedscope-LLM: A Comprehensive Package for Fine-Tuning LLMs in Federated Learning."

审稿人评论

Thank you for your response and clarifications! I don’t have any further questions and will keep my score as it is.

审稿意见
4

This paper proposes Model-Agnostic Projection Adaptation (MAPA), which improves LoRA and FA-LoRA in federated learning (FL) by treating the entire model update as a single matrix rather than using layer-wise factorization. This approach enhances computational and communication efficiency while maintaining accuracy. MAPA introduces round-wise randomization of the reconstruction matrix to avoid suboptimal solutions and balance communication and accuracy. Unlike FA-LoRA, which uses a fixed A, MAPA regenerates A each round, enabling better parameter space exploration and preventing suboptimal convergence. Additionally, MAPA reduces memory and computational overhead compared to LoRA, ensuring greater efficiency in FL settings.

给作者的问题

The idea of unified low-rank space adaptation for fine-tuning is quite interesting. It seems like this approach could be useful not only in federated learning (FL) but also in traditional centralized fine-tuning. What makes this method particularly beneficial in FL settings? Have similar ideas been explored in centralized ML?

论据与证据

The main claim in this paper is:

  • The proposed MAPA treats the entire model's weights as a single matrix and uses a unified low-rank space (Delta W and A) for low-rank adaptation fine-tuning in FL.

The authors also claim that MAPA:

  • Reduces communication costs compared to existing methods.
  • Improves convergence through randomization of the reconstruction matrix.

The empirical experiments conducted on various benchmark datasets and tasks mostly support these claims. However, I have two concerns regarding the evidence:

In some experimental settings, MAPA does not consistently outperform certain baselines in terms of convergence accuracy. The paper lacks ablation studies to fully analyze the impact of the MAPA factorization process.

方法与评估标准

This paper use popular benchmark models and datasets in FL for emperical experiments, which make sense. My only 2 concerns are:

  • Lack of ablation studies: While MAPA introduces randomized reconstruction matrices, the paper does not provide sufficient ablation experiments to isolate the impact of this randomization on convergence. A comparison between fixed vs. randomized reconstruction matrices would help clarify the exact benefits of the approach.
  • FA-LoRA comparison: The comparison between FA-LoRA and MAPA should be extended and further elaborated. The authors could provide more context on FA-LoRA and explain why there is a performance gap between the two methods in certain settings.

理论论述

I checked the convergence proof, it make sense to me.

实验设计与分析

The proposed method and evaluation criteria are mostly appropriate, but the lack of ablation studies and deeper analysis of MAPA’s improvements. Additional experiments on more detailed ablation studies would strengthen the claims.

补充材料

No

与现有文献的关系

Not sure

遗漏的重要参考文献

Not Available

其他优缺点

One notable strength of this paper is that the MAPA framework may not necessarily limited to FL, its model-agnostic factorization approach could be useful in regular centralized ML as well.

其他意见或建议

  1. Some experimental comparisons (e.g., with FA-LoRA) could be further elaborated
  2. Conduct Ablation study to validate the proposed model agnostic factorization
作者回复

1. MAPA does not consistently outperform certain baselines

Thank you for your careful observation. We want to emphasize that all the results provided in Table 2 and additional experiments during this rebuttal show that MAPA consistently outperforms in communication and performance.

The results shown in Figure 5, in the top row, do not consider communication load. We are just comparing in terms of global rounds. As we take communication load into account, as shown in Table 2, we always perform better than baselines in performance per communication.

 

2. The paper lacks ablation studies

Thank you for your constructive feedback. We initially provided our ablation studies on the effect of matrix rank on training (Figure 7) and the importance of fixed vs. fresh matrix A (Figure 6). Considering your comments, we additionally extended our studies on:

1. Fixed vs. fresh (randomization) of matrix A

To elaborate on the effectiveness of randomization, additional experiments regarding MNIST and CIFAR10 are presented here, showcasing the accuracy across various ranks from 202^0 to 2132^{13}, which clearly highlights the advantage of randomization, especially at lower ranks. Moreover, a discussion on the importance of randomization in training is located in response 1 to reviewer Wq8k.

Method-Dataset/2^012345678910111213
FrozenA-MNIST7.939.439.8316.8619.3642.0969.9481.5792.8595.1796.4696.9197.8497.86
FreshA-MNIST72.2183.091.0093.0596.1496.9397.4897.5697.7597.7897.8397.7497.7997.76
FrozenA-CIFAR1012.4613.6916.7219.1321.6420.9927.3531.0740.2347.2854.063.3667.2668.77
FreshA-CIFAR1051.5355.0257.9561.3763.8265.566.569.268.6269.0268.3168.3468.7168.59

2. Effect of rank in LLM fine-tuning

We study the effect of MAPA rank on four different orders of magnitude alongside LoRA's baselines in communication-efficient LLM fine-tuning. The results are presented in the 2nd Table of response 3 for reviewer FKCh.

3. Experiments on layer-wise vs. model-level factorization

Additionally, during the rebuttal, we conducted further experiments on LLM fine-tuning across various MAPA ranks, clarifying the trade-off between communication and performance, and additionally conducted an ablation study on layer-wise vs model-level factorization. (See response 3 to reviewer JjTg)

If concerns remain, please specify any additional ablation studies you recommend. We remain committed to conducting further experiments.

 

3. The comparison between FA-LoRA and MAPA

Methodologically, MAPA:

  1. Factorizes gradients, not parameters (response 1, reviewer FKCh).
  2. Uses a randomized AA instead of a fixed AA, as shown in our ablations and response 1 to reviewer Wq8k.
  3. Operates at the model level rather than layer by layer (response 3, reviewer JjTg).

We further validated these claims via additional GLUE fine-tuning experiments against FA-LoRA and other baselines (response 3, reviewer FKCh).

 

4. Centralized fine-tuning

Following your advice, we tested MAPA in a centralized setup and observed substantial gains over other baselines (3rd table in response 3, reviewer FKCh).

 

5. Have similar ideas been explored in centralized ML?

The literature on low-rank gradient factorization in deep learning can start from:

  • [5] shows the inherent low-rank structure of gradients.
  • [6] examined intrinsic dimensionality by identifying the lowest-dimensional fixed random subspace enabling model convergence. Subsequent works [7–11] expanded on these concepts by training NN within randomly generated gradient subspaces.

Although these approaches shows the efficacy of low-rank gradient factorization, they suffer from extensive memory overhead, as they represent the gradient as a single vector GdG^d, where dd is the number of model parameters, resulting considerable memory usage to construct the random transformation Ad×mA^{d \times m}.

MAPA significantly differs from prior approaches by reshaping gradients before factorization. This simple yet effective modification achieves roughly kk-fold reduction in computation and k2k^2-fold lower memory usage without compromising performance, supported by our theoretical and empirical analyses (Appendix H and C.5). Additional discussion comparing gradient vs parameter factorization appears in response 1 to Reviewer FKCh.

References are located in response 4 to reviewer Wq8k.

 

6. What makes this method particularly beneficial in FL?

A primary challenge in FL is mitigating communication overhead. Our MAPA directly addresses this via low-rank gradient factorization integrated with efficient communication. While highly beneficial in FL, gradient reductions offer limited advantages in centralized settings, where gradient communication isn't required.

最终决定

This work studies communication efficient Federated Learning. It proposes a method that improves LoRA and FA-LoRA by treating the entire model update as a single matrix rather than using layer-wise factorization.

The final recommendations are quite mixed (4, 3, 3, 1). Reviewers recognize the importance of the problem and most reviewers recognize the soundness of techniques.

There are a few common questions raised by the reviewers. Limited theoretical contribution (Wq8k, FKCh) Novelty/ related works (Wq8k, FKCh, JjTg) Larger datasets and more ablations (FKCh, JjTg)

We welcome that the authors added discussions, new datasets and new experiments in the rebuttal. But given the common concerns and expected sizable changes, we recommend the authors resubmit the work to a future venue with the improvements.

In addition, JjTg raised a question regarding why not layer-by-layer separation. After reviewing the rebuttal, the AC does not see that there is a fundamental problem that is not fixable. We suggest that the authors provide more discussions and justifications in the future version.