/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models

Yao Shu,Wenyang Hu,See-Kiong Ng,Bryan Kian Hsiang Low,Fei Yu

提交: 2025-01-23更新: 2025-07-24

TL;DR

We propose Ferret, the first first-order FL method with shared randomness to enable scalable full-parameter tuning of LLMs across decentralized data sources while maintaining competitive model accuracy.

摘要

关键词

Large Language ModelsFederated Full-Parameter TuningScalabilityTheoretical Guarantees

评审与讨论

审稿意见

评分: 42025-03-13

This work proposes a large-scale federated full parameter tuning framework for LLM, Ferret, which mainly exploits the first-order method of shared randomness. The proposed method mainly consists of three steps: using first-order methods for efficient local updates, projecting these updates into a low-dimensional space, and reconstructing local updates from this low-dimensional space. The method achieves reduced communication overhead and efficient full parameter updates. The article provides comprehensive theoretical and experimental evidence for this method.

给作者的问题

Q1: I am more interested in fine-tuning LLM by LoRA, what is the distinguished feature of first-order methods rather than low-rank fine-tuning?

Q2: I find that this work has a preliminary published workshop version. It could be better to provide the improvement and difference to the original version.

论据与证据

The proposed method is clearly explained and extensive experiments are performed.

方法与评估标准

The proposed method is reasonable. The dataset and model used are reasonable.

理论论述

I have checked the theoretical analysis, no major problems are found.

实验设计与分析

The design of experiments are generally comprehensive. But, in most cases, the authors only compare with FedZO, FedKSeed and FedAvg, the other baselines are not fully compared.

补充材料

I have checked the supplementary materials with additional proof, detailed experiment settings and additional experiments.

与现有文献的关系

This work considers a first-order fine-tuning for LLM in FL on the basis of zero-order methods.

遗漏的重要参考文献

The related work is solid enough, this work is most related to FedZO and FedKSeed.

其他优缺点

S1: The theoretical analysis of this paper is comprehensive.

S2: Overall, this work is well-written and easy to follow.

W1: Some prompt-tuning and LoRA-tuning baselines are only compared in Table 2. I think larger model fine-tuning in Table 3, communication and computation cost in the following tables are worth to compared with these baselines.

W2: In Table 4 and 5, the cost of Ferret is much greater than FedAvg. Why does this phenomenon happen?

W3: FL setting is significant, should moved to the main paper and give more detailed explanations. As well as ablation study, it is crucial for the effectiveness of the proposed methods.

W4: The privacy of exchanging updated weights could provide a more detailed analysis.

其他意见或建议

N/A

作者回复

2025-04-01

We thank Reviewer CfcA for recognizing the comprehensive theoretical analysis, solid related work, and clarity of our paper. We would like to address your concerns below:

W1: Performance Comparison with LoRA methods.

Clarification on Scope: Our work focuses on full-parameter fine-tuning of LLMs. This choice is motivated by the goal of achieving better performance, as PEFT methods like LoRA may not consistently reach the same performance ceiling [2,3]. We primarily compare Ferret against other full-parameter federated methods (FedZO, FedKSeed, FedAvg).
Comparison with FedIT (LoRA baseline): We appreciate the suggestion to compare Ferret with PEFT methods regarding performance, communication, and computation costs. We have provided additional results on FedIT (LoRA-tuning, rank=8, alpha=16) and present the results below (Table-R2 and R3).
- The results show that while FedIT offers lower computational costs, it incurs significantly higher communication costs compared to Ferret. More importantly, Ferret maintains a strong performance close to FedAvg and outperforms FedIT by a large margin.
- Due to the limited rebuttal period, we have prioritized the FedIT (LoRA-tuning) baseline. We commit to including results for other PEFT methods in our final manuscript.

Table-R2: Comparison of Computational and Communication Costs against FedIT

Model	Computational Cost (Overall Sec.)		Communication Cost (# params.)
	LLaMA-3B	LLaMA2-7B	LLaMA-3B	LLaMA2-7B
FedIT	$3.9$	$4.5$	$4.2\times 10^6$	$6.6\times 10^6$
Ferret	$30.3$	$97.2$	$7.8\times10^3$	$6.4\times10^3$

Table-R3: Performance comparison against FedIT

Algorithm	CodeAlpaca		GSM8K
	LLaMA2-7B	LLaMA2-13B	LLaMA2-7B	LLaMA2-13B
FedIT	$4.66 \pm 0.18$	$6.10 \pm 0.18$	$30.31 \pm 0.29$	$13.46 \pm 0.34$
FedZO	$4.58 \pm 0.26$	$6.19 \pm 0.32$	$30.41 \pm 0.31$	$13.63 \pm 0.34$
FedKSeed	$8.33 \pm 0.98$	$10.70 \pm 0.47$	$28.26 \pm 3.60$	$33.67 \pm 1.15$
FedAvg	$\mathbf{15.41} \pm 0.43$	$\mathbf{14.68} \pm 0.26$	$\mathbf{38.30} \pm 0.40$	$\mathbf{39.82} \pm 0.17$
Ferret (ours)	$\underline{12.10} \pm 0.47$	$\underline{11.84} \pm 0.91$	$\underline{36.10} \pm 1.18$	$\underline{34.50} \pm 1.42$

[2] Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes

[3] Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs

W2: Computational cost of Ferret compared to FedAvg

This is correct and represents an intentional design trade-off. The increased computational cost in Ferret stems directly from the gradient projection technique employed to significantly reduce communication costs.
This trade-off (higher local computation for lower communication) is common in communication-efficient FL algorithms, including the baseline FedKSeed. However, Ferret significantly optimizes this trade-off: compared to FedKSeed, Ferret achieves a $\sim6\times$ reduction in computational overhead while providing substantial communication savings over FedAvg.

W3: The presentation regarding the FL setting

Thank you for this constructive suggestion. We will move the FL setup and key ablations into the main paper in the final version.

W4: Privacy analysis.

We agree that formal privacy analysis would be valuable. As noted in Sec. 6, we are encouraged to do so for future research.

Q1: distinguished feature of first-order methods compared to LoRA

If we understand correctly, we believe you are referring to the comparison between full-parameter fine-tuning (like FedAvg or our Ferret using first-order optimization) and low-rank fine-tuning (like FedIT, also typically using first-order optimization). The distinguished features are:

Full-parameter fine-tuning updates all model parameters, but FedIT only updates a small set of low-rank adapter parameters.
Standard full-parameter methods (like FedAvg) communicate all parameter updates, leading to high cost. LoRA communicates only the adapter updates (still large, as shown in Table-R2). Our method, Ferret, applies projection technique after ful-parameter gradient computation to significantly reduce communication.
As supported by existing literature [2,3] and our results (Table-R3), full-parameter fine-tuning generally achieves higher performance ceilings compared to PEFT methods, especially on complex tasks.

Q2

We understand your interest but are unable to address this question due to the anonymity policy during the review process. Thank you for your understanding.

We hope these responses and the additional experimental results effectively address your concerns and clarify the contributions and positioning of Ferret. We welcome any further questions.

审稿人评论

2025-04-03

Thanks for your response, which addressed my concerns. I have raised my score.

作者评论

2025-04-03

Dear Reviewer CfcA,

We are very happy that our response has addressed your concerns and improved your opinion about our work! We will include those valuable discussions and additional results in the final version.

Best regards,

Authors

审稿意见

评分: 32025-03-14

This paper introduces Ferret, a method for full-parameter tuning of large language models (LLMs) in federated learning. It primarily addresses the challenge of communication overhead by combining the strengths of first-order optimization (efficient computation and fast convergence) and zeroth-order optimization (reduced communication overhead). The method utilizes shared randomness to project updates into a low-dimensional space, effectively reducing communication costs.

给作者的问题

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

I have conducted a rough review of the entire proof but may have overlooked some details.

实验设计与分析

In general, the experimental results are sufficient and convincing.

补充材料

I have conducted a rough review of the entire material but may have overlooked some details.

与现有文献的关系

See the Strengths And Weaknesses below for details.

遗漏的重要参考文献

其他优缺点

Strengths:

The paper provides rigorous theoretical analyses to support Ferret's effectiveness, showcasing its advantages over existing methods in terms of computational efficiency, communication overhead, and convergence speed.
The experimental results are sufficient and convincing.
The proposed method and the addressed challenge are both meaningful and novel.

Weaknesses:

Reconstruction Error: While Ferret reduces communication overhead, the reconstruction of updates from low-dimensional projections may introduce some error, particularly for complex tasks. This is evident in the slightly lower performance of Ferret compared to FedAvg on the CodeAlpaca and GSM8K datasets.
Despite reducing communication costs, Ferret still requires substantial computational resources, especially for larger models such as LLaMA2-13B. The authors are encouraged to explore ways to optimize computational efficiency in future research.

其他意见或建议

作者回复

2025-04-01

We sincerely thank Reviewer oeGQ for the positive evaluation, particularly recognizing our theoretical analyses, experimental results, and the novelty of our approach. We address the reviewer's concerns below:

W1: Reconstruction Error

We clarify that the slight performance difference compared to FedAvg is an expected consequence of the gradient compression inherent in Ferret, which enables massive communication savings. We highlight three key points:

Our reconstruction error is theoretically bounded (Thm 2) and can be reduced by increasing $K$ . This allows practitioners to balance performance and communication overhead based on their resource constraints (analysis in Appx. C.5).
While the baseline FedKSeed also suffers from reconstruction error due to gradient projection, Ferret achieves more accurate reconstruction, leading to superior empirical performance, as validated in our experiments.
The minimal performance difference (Ferret vs. FedAvg) on CodeAlpaca and GSM8K (Table 3) is vastly outweighed by the $10^6\times$ reduction in communication cost (Table 4&5). This reflects our intentional design choice to prioritize communication efficiency, which we believe is crucial for the scalable and practical deployment of FL systems.

We will explicitly discuss this trade-off and include these discussions in the revised manuscript to further strengthen our paper.

W2: Computational Resources

We appreciate the reviewer's suggestion regarding computational optimization. We acknowledge that while Ferret significantly reduces communication overhead, it requires additional computational costs, especially for large models like LLaMA2-13B. We believe this is an area for future work, and we would like to clarify that:

Compared with the relevant communication-efficient baseline, FedKSeed, Ferret already achieves a significant ( $\sim6\times$ ) reduction in computational cost.
We have already incorporated optimizations like Reconstruction w/o Inversion and Block-Wise Reconstruction (Sec. 3.2) to mitigate this computational overhead during projection and reconstruction steps.
We recognize that further improvements are possible. We plan to explore techniques such as quantization on gradients or adaptive projection based on gradient sparsity in future research to further reduce the computational burden while preserving communication efficiency.

Thank you again for the constructive feedback. We will enhance the discussion on these trade-offs in the final manuscript. We hope this response clarifies our contributions and addresses the reviewer's concerns effectively.

审稿意见

评分: 32025-03-14

In this work, the authors proposed Ferret, a federated learning method for efficient full-parameter fine-tuning of LLMs, combining first-order optimization with random projection-based dimensionality reduction. It uses shared randomness to reconstruct local updates at the server, significantly reducing communication overhead. The authors provide theoretical guarantees on unbiased reconstruction and convergence, alongside experiments demonstrating reduced communication and computational costs compared to existing methods.

给作者的问题

Please refer to my comments above.

论据与证据

One of my major concerns of this work is about the technical novelty. I like this paper as a whole, especially given the interesting topic and extensive theories, however, I feel the technical novelty is somewhat overclaimed. Although the authors emphasize the unique challenges of federated full-model fine-tuning of large language models, the proposed random projection method resembles existing techniques (e.g., FetchSGD). Furthermore, the block-wise reconstruction shares conceptual similarity with SparseGPT (Frantar et al.), though SparseGPT is not cited. Specifically,

The method's projection-based approach is intuitive and theoretically grounded. However, similar methods, particularly FetchSGD (ICML 2020), already utilize random projections for federated optimization.
The block-wise reconstruction technique introduced, although useful, closely mirrors existing weight reconstruction ideas (e.g., SparseGPT), yet SparseGPT is neither cited nor compared.

方法与评估标准

I am a bit confused about the federated setup of experiments in this work. How many clients in total are used per round for the proposed method and baselines on each dataset?

According to the paper, "In each round of federated learning, 5% of clients were randomly selected to participate." How many clients in total did you use?
In addition, you mentioned "Due to the compelling efficiency of our method, we set the total number of communication rounds to 12 for the NI dataset and 20 for Dolly-15K for Ferret." So the proposed method may not leverage some clients on NI dataset? (12 $\times$ 5% = 60%) And for Dolly-15K it is highly likely to happen as well? (otherwise each sampling of clients has to be perfectly non-overlapping)

理论论述

The provided theoretical analyses on unbiasedness and convergence are rigorous and sound, representing a significant strength of the paper.

实验设计与分析

Although memory footprint analysis is provided in the appendix, the proposed method (Ferret) incurs significantly higher GPU memory costs compared to the zeroth-order optimization method (FedKSeed). Given federated learning's typical deployment constraints on resource-limited devices, this large discrepancy raises concerns about the practicality and optimality of the proposed trade-off between memory usage and communication efficiency.
LlaMA / LlaMA-2 families are too dated in 2025 and I strongly suggest that the authors consider evaluation on more SOTA LLMs such as Qwen-2.5 and LlaMA-3 families.

补充材料

I briefly checked the mathematical proofs (not thorough and may overlook details) as well as additional results.

与现有文献的关系

The absence of a discussion with SparseGPT, despite methodological resemblance in block-wise reconstruction approaches, weakens the credibility of the claimed novelty.

遗漏的重要参考文献

Frantar et al., "SparseGPT: Massive Language Models Can be Accurately Pruned in One-shot," ICML, 2023.

其他优缺点

S1. I found the research topic of full-model LLM finetuning under FL interesting and prompt.

S2. Effective empirical demonstration of communication efficiency and fast convergence.

S3. Solid theoretical analyses on unbiasedness and error bounds.

S4. The paper is well-written and easy to follow.

W1. Insufficient technical novelty compared to existing projection-based FL methods.

W2. The block-wise reconstruction is conceptually similar to that of SparseGPT while it is not cited and discussed.

W3. Practical concerns (memory and computational complexity) are inadequately addressed.

W4. LLMs utilized in the experiments are somehow too outdated.

其他意见或建议

i think this paper has many merits such as interesting research idea, extensive theories, and nice results. My major concerns are about the technical novelty, and some evaluations of the proposed method. Hence, I will be fully open to the rebuttal, i.e., other reviewers' comments and discussion, and will adjust my score accordingly.

作者回复

2025-04-01

We thank the reviewer for the thorough evaluation and constructive feedback. We address the points raised below:

Claims And Evidence

We appreciate the reviewer's feedback on the comparison to FetchSGD. While both methods use dimensionality reduction, Ferret's technical approach and goal are fundamentally different, establishing its novelty, especially for full-parameter LLM tuning. FetchSGD uses Count Sketch (based on hashing coordinate indices) primarily to enable server-side state management (momentum, error accumulation) for a subsequent biased Top-K sparse update. Ferret, conversely, uses shared random vectors v to directly project the entire first-order local update vector Δ via dot products (Eq. 6), determined through convex optimization (Eq. 4). Crucially, Ferret aims to reconstruct an approximation of the full, dense update Δ̃ (Eq. 7) for aggregation, vital for maintaining accuracy in full-parameter LLM tuning. This contrasts sharply with FetchSGD's goal of facilitating a sparse update. Furthermore, Ferret's design targets an unbiased reconstruction (Thm. 1), avoiding the explicit error accumulation required by FetchSGD's biased sparsification step. Ferret's novelty lies in being the first approach to uniquely combine efficient first-order local updates with shared randomness for reconstructing dense updates, specifically optimized (e.g., block-wise reconstruction) for the unique scale and demands of federated full-parameter LLM tuning.
Thanks for the SparseGPT suggestion (citation will be added). SparseGPT uses block-wise processing on model weights for pruning. Ferret applies it to federated updates (Δ) solely to improve computational efficiency during the dense update reconstruction step (Eq. 7 -> Eq. 8), making it scalable for LLMs. Ferret's novelty lies in using this block-wise strategy specifically for scaling update reconstruction in our federated tuning context, distinct from model pruning.

We acknowledge the conceptual similarities, but we emphasize that our method Ferret is technically distinct and novel based on our comparison above. In our revision, we promise to cite the SparseGPT paper, add a detailed discussion regarding the block-wise reconstruction, and also highlight the difference between FetchSGD and Ferret.

Methods And Evaluation Criteria:

We clarify our FL setting follows the previous literature (FedKSeed), utilizing 738 clients on Natural-Instruction and 200 clients on Dolly-15K datasets in the FL system.
Yes, some clients may not be leveraged for training the FL model, as we have fewer communication rounds. In each round, we independently and randomly sample 5% of clients for both NI dataset and Dolly-15K dataset to participate. As shown in Figure 2, Ferret converges rapidly (similar to FedAvg), reaching a point where additional training with more clients yield diminishing returns. We think this might be due to the nature of this data source used.

Experimental Designs Or Analyses

Memory Footprint: We clarify that our primary focus is communication efficiency in a standard distributed data setting where clients can perform backpropagation. While Ferret currently uses a standard SGD optimizer (implying typical client-side memory requirements for backpropagation), we acknowledge that our memory usage could be potentially reduced by integrating memory-efficient optimizers like those in [1] for future works. [1] Full Parameter Fine-tuning for Large Language Models with Limited Resources.
Additional Experiments: Follwoing the reviewer's constructive suggestion, we conducted additional experiments on CodeAlpaca and GSM8K using Llama3-8B and Qwen2.5-7B. The results below demonstrate Ferret's consistant effectiveness, achieving near FedAvg performance with significantly reduced communication overhead across models and tasks.

Table-R3: Performance comparison on Llama3-8B and Qwen2.5-7B models

Algorithm	Alpaca		GSM8K
	Llama3-8B	Qwen2.5-7B	Llama3-8B	Qwen2.5-7B
FedKSeed	$5.73 \pm 1.26$	$9.14 \pm 0.32$	$7.79 \pm 1.36$	$23.84 \pm 1.19$
FedZO	$16.66 \pm 0.50$	$11.76 \pm 0.38$	$37.44 \pm 0.11$	$28.04 \pm 0.13$
FedAvg	$\mathbf{19.88} \pm 0.67$	$\mathbf{17.47} \pm 0.49$	$\mathbf{45.48} \pm 0.51$	$\mathbf{43.86} \pm 0.36$
Ferret	$\underline{19.59} \pm 0.66$	$\underline{14.64} \pm 0.74$	$\underline{45.07} \pm 0.78$	$\underline{38.28} \pm 1.70$

Response to W1 & W2.

Please see our response under Claims And Evidence above.

Response to W3 & W4.

Please see our response under Experimental Designs Or Analyses above.

We sincerely thank Reviewer C3NM for the valuable feedback and are encouraged by the positive remarks on our research idea, theories, and results. We hope our clarification and additional results effectively address your concerns and increase your opinions of our work. We welcome any further questions.

审稿人评论

2025-04-05

I thank the authors for their detailed rebuttal and appreciate the additional experiments and clarifications provided. The additional experiments using more recent LLMs address my initial concern about using outdated models, and thus I have increased my score accordingly. However, I remain concerned about two issues that need further clarification:

Technical Novelty Concerns: Although the authors clarified differences from FetchSGD and SparseGPT, their explanations did not fully address the core issue of novelty. For example, the block-wise reconstruction used in Ferret still mirrors the method employed by SparseGPT. While the authors differentiate their application (fine-tuning versus pruning), mathematically, the underlying reconstruction process remains substantially similar. Fine-tuning and pruning objectives, absent the pruning masks, are fundamentally analogous optimization problems. I understand assessing a work's technical novelty can be highly subjective and tricky, and that's why I have increased my score regardless of this concern and intend this feedback to further enhance our discussion.
Memory Footprint and Practicality: The authors acknowledge the significantly higher memory costs of Ferret compared to zero-order optimization methods but suggest the potential use of efficient first-order optimizers to mitigate this. I reviewed the cited reference [1] and noted that these optimizers typically yield only modest memory reductions (in single-digit percentage ranges, at most 10%). Given that Ferret incurs substantially greater memory costs, even after such optimizations, the method would still exhibit a significant memory overhead. Additionally, placing the memory footprint analysis solely in the appendix undermines transparency regarding the critical trade-off between performance and resource constraints, which is highly relevant for practical federated learning deployments. I strongly recommend moving this analysis into the main manuscript to facilitate a clearer assessment of Ferret's practicality against e.g., zero-order methods.

[1] Full Parameter Fine-tuning for Large Language Models with Limited Resources.

作者评论

2025-04-06

Thank you once again for your detailed feedback, constructive engagement, and for raising your score. We sincerely appreciate this opportunity to further elaborate on the technical novelty and memory footprint of Ferret, addressing the remaining points you've helpfully raised.

Technical Novelty Concerns

We understand your perspective regarding the mathematical similarity of block-wise processing when viewed in isolation. We do acknowledge that block-wise decomposition, as a technique, has been employed before, notably in SparseGPT. We commit to ensuring SparseGPT is appropriately credited for its use of this technique in our revision.

However, we respectfully argue that Ferret's novelty (on block-wise design) lies not in the invention of this technique, but from its unique integration within a novel FL framework to efficiently reconstruct gradient updates in a FL setting, and its novel theoretical results (not presented in SparseGPT). To clarify, we have proved that (in Prop. 1) the block-wise reconstruction reduces computational complexity, and (in Prop. 2) the reconstruction error can be minimized by allocating # random seeds according to the gradient norm of each block. Our Prop. 2 is the foundments of our novel design on adaptively allocating the number of random seeds for each block, and its emprical success is validated in Fig. 9 in Appx. C.6.

We would like to emphasize that the block-wise design is only one part of Ferret. Ferret's overall novelty lies in the its whole FL framework and rigorous theoretical analyses: the first first-order FL approach with shared randomness, which uses novel random update projection and reconstruction to significantly enhances the scalability of FL full-parameter tuning of LLMs while maintaining competitive model accuracy.

Memory Footprint and Practicality

We appreciate the opportunity to address your concerns regarding memory footprint here:

It is important to clariy that our method Ferret does not incur additional memory cost compared to the standard first-order method (FedAvg).
We acknowledge your point regarding zeroth-order methods, which inherently offer lower memory footprints. This reflects a trade-off in optimization: zeroth-order methods reduce memory but often require more steps to converge, may reach lower final accuracy, and can be less stable compared to first-order methods, especially for complex models (e.g., Llama3-8B). Ferret is designed for improved scalability where retaining the potential benefits of first-order gradient information (e.g., faster convergence, higher accuracy) is desirable.
We understand you viewed reference [1] (introducing LOMO) and noted concerns about the extent of memory savings. We would like to respectfully clarify the results of Table 1 in [1] shows LOMO has a $\sim70\\%$ total memory reduction (from 51.99 GB to 14.58 GB with activation checkpointing), while the zeroth-order method FedKSeed has a slightly lower memory reduction ( $\sim60\\%$ shown in Table 9 in our paper). We believe this demonstrates a promising path to dramatically reduce the memory footprint of Ferret. This could make first-order FL methods more memory efficient and practical.
We thank the reviewer for the suggestion of moving the footprint analysis into the main paper to facilitate a clearer assessment. We will do so in the final version.

Thank you again for your constructive engagement and valuable feedback throughout this discussion period. We have found this discussion very helpful and are committed to incorporating these clarifications to strengthen the final paper. We hope our responses have successfully addressed all your concerns and improved your opinion of our work.

最终决定Accept (poster)

2025-05-01

Paper studies federated full finetuning of LLM which has significant computational and communication-overhead challenges. This paper combines techniques form federated first-order optimization and zeroth-order optimization to reduce the communication cost of finetuning LLMs in this federated setting. Paper is well written, and the experimental results are convincing in terms of reduction in communication cost. Reviewers also appreciated the additional experimental results provided during the rebuttal. However, reviewers noted that there is limited technical novelty in the proposed methods, further even though their method reduces communication cost, it may increase the memory cost at the clients when compared to some of the baseline.