6.8

/10

Poster4 位审稿人

最低3最高5标准差0.8

4.5

置信度

创新性2.8

质量2.8

清晰度3.3

重要性2.3

NeurIPS 2025

LLM at Network Edge: A Layer-wise Efficient Federated Fine-tuning Approach

Jinglong Shen,Nan Cheng,Wenchao Xu,Haozhao Wang,Yifan Guo,Jiajie Xu

OpenReview PDF

提交: 2025-05-09更新: 2025-12-04

摘要

Fine-tuning large language models (LLMs) poses significant computational burdens, especially in federated learning (FL) settings. We introduce Layer-wise Efficient Federated Fine-tuning (LEFF), a novel method designed to enhance the efficiency of FL fine-tuning while preserving model performance and minimizing client-side computational overhead. LEFF strategically selects layers for fine-tuning based on client computational capacity, thereby mitigating the straggler effect prevalent in heterogeneous environments. Furthermore, LEFF incorporates an importance-driven layer sampling mechanism, prioritizing layers with greater influence on model performance. Theoretical analysis demonstrates that LEFF achieves a convergence rate of $\mathcal{O}(1/\sqrt{T})$. Extensive experiments on diverse datasets demonstrate that LEFF attains superior computational efficiency and model performance compared to existing federated fine-tuning methods, particularly under heterogeneous conditions.

关键词

federated learninglarge language modelfine-tuning

评审与讨论

审稿意见

评分: 5置信度: 52025-06-27

LEFF presents a novel layer-wise federated fine-tuning approach for LLMs at the edge, effectively addressing computational/data heterogeneity. The work is technically sound with strong empirical/theoretical contributions, though minor revisions would enhance reproducibility.

优缺点分析

Strengths

Innovative Methodology: Layer-wise adaptation with importance sampling and distillation-based compression elegantly solves straggler issues in heterogeneous FL.
Theoretical Rigor: $O(1/\sqrt{T})$ convergence proof under standard FL assumptions (Theorem 1) is a key contribution.
Comprehensive Validation: Extensive tests on GLUE/E2E NLG across models (DeBERTaV3, GPT-2), heterogeneity levels (α=0.05-50), and client scales (8-40 devices) demonstrate superiority over baselines.
Practical Relevance: Reduces client compute while maintaining near-full-tuning accuracy – critical for edge deployment.

Weaknesses:

Statistical Reporting: Lack of error bars/standard deviations (Tables 1-3) weakens empirical claims.
Overhead Analysis: Client computation costs and GPU memory pressure are unquantified.
Societal Impacts: No discussion of environmental costs from server operations or dual-use risks of efficient edge fine-tuning.

问题

Could you add standard deviations/confidence intervals to demonstrate result stability?
What are the measured client-side trainable parameters and GPU memory costs compared to FedAvg?

局限性

Yes.

最终评判理由

Thanks for the detailed response, which has fully resolved my concerns. Although other reviewers rightly raised concerns about the evaluation scope (e.g., models and baselines) and proxy dataset, I believe the authors’ new SOTA experiments and robustness analysis have convincingly addressed these points. Given the paper’s significant technical innovation and now-strengthened validation, I will raise my score by +1 and recommend acceptance.

格式问题

No significant formatting concerns were identified in the paper during my review.

作者回复

2025-07-30

(For clarity in our response, we refer to the reviewer's points using the following convention: W.x (Weaknesses), Q.x (Questions), and L.x (Limitations).)

We sincerely thank the reviewer for their valuable and constructive feedback. Your comments have helped us to significantly improve the empirical rigor and ethical considerations of our paper. We have grouped our responses to address your main points regarding empirical validation and societal impact.

(W.1, W.2, Q.1, Q.2) On Empirical Rigor: Result Stability and Client-Side Costs

A key theme of your feedback (W.1, W.2, Q.1, Q.2) was the need for more rigorous quantification of our results' stability and the client-side overhead. We agree this is crucial and have conducted detailed new analyses to address these points.

1. Statistical Stability of Results (W.1, Q.1)

To demonstrate the stability of our findings, we have re-run all experiments three times with different random seeds and calculated the standard deviations for all metrics. The results confirm that our method is stable and the performance gains are consistent.

Due to space constraints, we present a representative table of standard deviations below and will include the complete set of tables in a new Appendix. The key finding is that the standard deviations are consistently small across all tasks and methods, confirming the reliability of our claims. Our method, LEFF, shows stability comparable to or better than the baselines.

Model	CoLA	SST-2	MRPC	STS-B	QQP	MNLI	QNLI	RTE	BLEU	NIST	METEOR	ROUGE	CIDEr
FedAvg	1.52	0.71	0.88	0.45	0.31	0.53	0.62	1.21	0.0041	0.1512	0.0028	0.0033	0.0805
FedBitFit	1.89	0.95	1.02	0.82	0.55	0.98	0.77	1.53	0.0053	0.2144	0.0045	0.0049	0.1231
FedLoRA	1.72	0.88	1.15	0.75	0.49	1.05	0.81	1.68	0.0058	0.2516	0.0041	0.0051	0.1189
SLoRA	1.45	0.81	0.95	0.68	0.42	0.85	0.71	1.35	0.0049	0.1832	0.0035	0.0039	0.0948
LEFF	1.38	0.75	0.91	0.51	0.35	0.62	0.68	1.25	0.0045	0.1604	0.0031	0.0035	0.0882

2. Client-Side Overhead Analysis (W.2, Q.2)

We have performed a detailed analysis of client-side trainable parameters and peak GPU memory usage across various models. The results highlight LEFF's unique and superior efficiency profile.

Model	Algorithm	Trainable Params	Peak Memory (GB)
DeBERTaV3-Base	FedAvg	85,648,130	3.841
DeBERTaV3-Base	FedLoRA	1,340,930	2.644
DeBERTaV3-Base	FedBitFit	102,914	2.198
DeBERTaV3-Base	LEFF	7,681,538	2.136
DeBERTaV3-Base	SLoRA	1,340,930	2.644
DeBERTaV3-Large	FedAvg	303,363,074	9.361
DeBERTaV3-Large	FedLoRA	3,557,378	6.660
DeBERTaV3-Large	FedBitFit	272,386	5.488
DeBERTaV3-Large	LEFF	13,649,922	3.005
DeBERTaV3-Large	SLoRA	3,557,378	6.660
GPT2	FedAvg	85,056,000	3.358
GPT2	FedLoRA	811,008	2.476
GPT2	FedBitFit	102,144	2.206
GPT2	LEFF	7,089,408	1.719
GPT2	SLoRA	811,008	2.476
GPT2-Large	FedAvg	708,390,400	15.548
GPT2-Large	FedLoRA	4,055,040	9.739
GPT2-Large	FedBitFit	508,160	8.318
GPT2-Large	LEFF	19,680,000	3.240
GPT2-Large	SLoRA	4,055,040	9.739
Llama-3.1-8B	FedAvg	OOM	OOM
Llama-3.1-8B	FedLoRA	20,971,520	46.868
Llama-3.1-8B	FedBitFit	No Bias	No Bias
Llama-3.1-8B	LEFF	743,452,672	29.881
Llama-3.1-8B	SLoRA	20,971,520	46.868

Our analysis reveals two key insights:

Lowest GPU Memory Pressure: LEFF drastically reduces peak GPU memory—for instance, achieving a 79% reduction on GPT2-Large compared to FedAvg and 66% compared to FedLoRA. This is possible because LEFF's architecture selectively loads only the necessary layers into GPU memory, whereas other PEFT methods must load the entire frozen model, resulting in much higher memory footprints.
Optimal Performance-Efficiency Trade-off: LEFF deliberately retains more trainable parameters than methods like FedLoRA. This maintains greater model expressiveness, leading to superior task performance. Thus, LEFF strikes an optimal balance: it achieves the lowest memory cost for deployment on constrained devices while preserving enough capacity for high-accuracy fine-tuning.

These quantitative results confirm LEFF's superior design for practical, resource-constrained FL. We will integrate these findings into a new appendix.

(W.3) On Societal Impacts

We thank you for raising the important ethical considerations of environmental cost and dual-use risks (W.3). We will add a dedicated section to discuss these.

Environmental Cost: We acknowledge the server-side computation in LEFF is a necessary trade-off. However, by enabling efficient fine-tuning on numerous distributed, low-power edge devices, our approach reduces the barrier to entry and can lessen the overall system's reliance on large, consistently energy-intensive data centers for training.
Dual-Use Risks: We recognize this concern. A key strength of LEFF's federated architecture is that the central server acts as a natural control point. This allows for the implementation of safeguards like client vetting, anomaly detection, and monitoring model updates to mitigate misuse—a governance feature absent in purely decentralized or local fine-tuning scenarios.

We hope these detailed analyses and clarifications fully address your concerns. We are confident that incorporating these additions will significantly strengthen the paper.

审稿意见

评分: 5置信度: 42025-06-27

This paper introduces Layer-wise Efficient Federated Fine-tuning. The core idea is to let each client fine-tune only a subset of the LLM's layers, with the number of layers chosen based on the client's computational capacity. To optimize this process, the server uses an importance-based sampling strategy to assign the most impactful layers to clients. For the layers that a client doe snot train, the server creates a compressed version using knowledge distillation. Experiments are extensive and results are promising.

优缺点分析

Strengths:

The idea of dynamically allocating layers to clients for finetuning has been explored before but compressing the rest is novel as far as I know.
Well-motivated and interesting problem.
Experiments are extensive and datasets/methods have variety, supporting the validation of the proposed approach.
Theoretical analysis is provided.

Weaknesses:

Distillation to create compressed layers for each client every round seems computationally expensive.
Experiment results lack client-side efficiency metrics even though paper claims efficiency.
How the dataset used in knowledge distillation is constructed is critical and may not be practical as ideally it should be representative of client's dataset.

问题

Theoretical anaysis mainly state that the performance floor depends on how good the compression works, right?
How does the comp. cost at server scale as the number of clients increase above 40 (which is a low number for practcal settings)?
How does the proxy data used in knowledge distillation impact performance?

局限性

Yes

最终评判理由

My concerns/questions about the efficiency on server and client-side are addressed. Additional experimental results provided with more extensive analysis based on other reviewers' feedback are promising. I increase my recommendation from 4 to 5.

格式问题

I have not noticed any major issues.

作者回复

2025-07-30

(For clarity in our response, we refer to the reviewer's points using the following convention: W.x (Weaknesses), Q.x (Questions), and L.x (Limitations).)

We thank the reviewer for their insightful questions, which have helped us to clarify the efficiency, scalability, and theoretical underpinnings of our work. We have grouped our responses thematically.

(W.2) On Client-Side Efficiency: New Quantitative Results

You rightly pointed out the need for quantitative client-side efficiency metrics to substantiate our claims (W.2). To address this, we have conducted new experiments measuring peak GPU memory and the number of trainable parameters. The results below demonstrate LEFF's significant advantages.

Model	Algorithm	Trainable Params	Peak Memory (GB)
DeBERTaV3-Base	FedAvg	85,648,130	3.841
DeBERTaV3-Base	FedLoRA	1,340,930	2.644
DeBERTaV3-Base	FedBitFit	102,914	2.198
DeBERTaV3-Base	LEFF	7,681,538	2.136
DeBERTaV3-Base	SLoRA	1,340,930	2.644
DeBERTaV3-Large	FedAvg	303,363,074	9.361
DeBERTaV3-Large	FedLoRA	3,557,378	6.660
DeBERTaV3-Large	FedBitFit	272,386	5.488
DeBERTaV3-Large	LEFF	13,649,922	3.005
DeBERTaV3-Large	SLoRA	3,557,378	6.660
GPT2	FedAvg	85,056,000	3.358
GPT2	FedLoRA	811,008	2.476
GPT2	FedBitFit	102,144	2.206
GPT2	LEFF	7,089,408	1.719
GPT2	SLoRA	811,008	2.476
GPT2-Large	FedAvg	708,390,400	15.548
GPT2-Large	FedLoRA	4,055,040	9.739
GPT2-Large	FedBitFit	508,160	8.318
GPT2-Large	LEFF	19,680,000	3.240
GPT2-Large	SLoRA	4,055,040	9.739
Llama-3.1-8B	FedAvg	OOM	OOM
Llama-3.1-8B	FedLoRA	20,971,520	46.868
Llama-3.1-8B	FedBitFit	No Bias	No Bias
Llama-3.1-8B	LEFF	743,452,672	29.881
Llama-3.1-8B	SLoRA	20,971,520	46.868

This new analysis reveals two key insights:

Lowest Memory Footprint: LEFF drastically reduces peak GPU memory—for instance, achieving a 79% reduction on GPT2-Large compared to FedAvg and 66% compared to FedLoRA. This is possible because LEFF's architecture selectively loads only the necessary layers into GPU memory. In contrast, other PEFT methods must load the entire frozen model, resulting in much higher memory footprints.
Optimal Efficiency-Performance Trade-off: While achieving the lowest memory cost, LEFF deliberately retains more trainable parameters than methods like FedLoRA. This maintains greater model expressiveness, leading to superior task performance. LEFF thus strikes an optimal balance, delivering the minimal resource cost for client deployment while preserving enough capacity for high-accuracy fine-tuning.

We will add this detailed analysis to the appendix of our paper.

(W.1, Q.2) On Server-Side Cost and Scalability

We appreciate your important questions regarding server-side computational costs and scalability (W.1, Q.2). Our framework is designed to be highly efficient and scalable.

1. Low Per-Round Distillation Cost (for W.1)

The server-side knowledge distillation is computationally inexpensive, as its total workload is minimized by three key factors:

Amortized Cost via Caching: The server does not perform a separate distillation for every client. Instead, if multiple clients request the compression of the exact same set of layers, the server performs this task only once. The result is then cached and distributed to all relevant clients, drastically reducing the total number of required distillation processes per round.
Incremental Fine-tuning (Warm Start): The process is not training from scratch. The distilled "student" model from round t-1 serves as a warm start for round t. Since the global "teacher" model evolves gradually, the student only needs a brief fine-tuning to catch up.
Small Target Model: By definition, the distillation operates on a "student" model that is significantly smaller than the full model (controlled by r), making each individual training process inherently fast.

2. High Scalability with Increasing Clients (for Q.2)

The server's computational overhead scales efficiently for large-scale deployments, as neither of its main tasks grows linearly with the number of clients:

Layer Importance Calculation: This is performed only once per round, requiring a single forward and backward pass on the global model. Its cost is therefore constant regardless of the number of clients (e.g., 100 or 10,000).
Distillation Workload: As explained above, the caching mechanism decouples the distillation workload from the client count. The total cost is tied to the number of unique compression requests, which is significantly smaller than the total number of clients in a large-scale setting. This ensures the distillation overhead scales sub-linearly, making the entire framework highly scalable.

(W.3, Q.1, Q.3) On the Proxy Dataset and Theoretical Guarantees

Your questions about the proxy dataset (W.3, Q.3) and the interpretation of our theory (Q.1) are spot-on. These two aspects are deeply connected.

1. Proxy Dataset Robustness: To empirically address your concern, we ran a new ablation study on the E2E NLG task, showing that LEFF is highly robust to the choice of proxy data.

Robustness to Data Distribution: Our method is robust to the choice of proxy data because we distill functional representations, not task-specific knowledge. As described in Section 3.3, our objective is to match the intermediate hidden states and attention matrices. This process preserves general representational capabilities and is far less sensitive to the specific domain of the input data.
Practical Availability: Since our method only requires unlabeled text, readily available public corpora (e.g., C4, Wikipedia) are sufficient. In scenarios where external data is prohibited, the server can use the global model to generate a synthetic proxy dataset, making the framework entirely self-contained — a promising direction we are exploring.
New Ablation Study: We empirically demonstrate LEFF’s robustness to the proxy dataset via a new ablation study on the E2E NLG task. As shown in the table, performance remains remarkably stable across diverse corpora, from the in-domain WebNLG to general-purpose WikiText-103 and OpenWebText. While the in-domain data yields a slight, expected performance gain, the minimal fluctuation across all metrics confirms that LEFF is not reliant on a perfectly-matched proxy corpus. This high degree of robustness validates its practical applicability. We will include this study in the appendix.

Proxy Dataset ( $\mathcal{D}_{\text{proxy}}$ )	BLEU	NIST	METEOR	ROUGE	CIDEr
WikiText-103	0.5765	8.0012	0.4041	0.6310	1.7450
WebNLG	0.5799	8.0296	0.4064	0.6346	1.7521
OpenWebText	0.5712	7.9688	0.4015	0.6259	1.7345

2. Theoretical Interpretation: Your interpretation of our theory (Q.1) is entirely correct. Our convergence bound (Thm. 1) explicitly shows that the performance floor depends on the compression quality, captured by the approximation error term $\bar{\Delta}^2$ . This dependency is not a weakness but the core, deliberate trade-off our work models and manages.

Our contribution lies in making this trade-off explicit and demonstrating empirically (e.g., Fig. 6) that we can manage it to achieve a state-of-the-art balance between efficiency and performance. Our main results (Table 2) confirm that despite this inherent dependency, LEFF's approach is highly effective, outperforming other PEFT methods in challenging FL settings.

We hope these detailed quantitative results and clarifications have fully addressed your concerns. Thank you again for helping us improve the paper.

2025-08-03

I want to thank the authors for the detailed and well-organized response. My concerns about the proxy dataset details are answered and client-side efficiency are answered.

I still think that server-side overhead can get significant even if the complexity does not necessarily depend on the number of clients but still practically depends on that. For example, for a 40-layer LLM at 50% compression ratio, there are C(40, 20) different layer compression configurations (effectively can be constructed from union of smaller subsets of layer blocks). We would need at max 10 configurations for 10 clients and 100 for 100 clients. Even if server-side is assumed to have high compute in practice, I would like to see a detailed analysis of potential cost at large scales, how many unique block configurations would need to be compressed for a K-layer model at a certain compression ratio and how layer block configurations look like for at least the 40 client experiment setting. And, are these block configurations static or can they change over rounds?

2025-08-04

We sincerely thank the reviewer for the insightful follow-up and for the opportunity to provide a more detailed analysis of the server-side efficiency in our LEFF framework. We agree that server-side scalability is a critical consideration, and we have designed LEFF with this specifically in mind.

The reviewer's primary concern regarding scalability appears to be predicated on a combinatorial calculation of layer configurations (e.g., $C(40,20)$ ). We would like to politely clarify that this assumes clients can select an arbitrary subset of layers. However, as detailed in Section 3.2 of our paper, LEFF employs a more structured approach: for a client with the capacity to train $L_{i}$ layers, the server samples a single consecutive block of layers. This design choice fundamentally constrains the combinatorial space. Consequently, for a $K$ -layer model and a given client capacity $L_{i}$ , the total number of possible blocks is not combinatorial but a small, linear value of $K−L_{i}+1$ .

This design has critical implications for server-side overhead. The server's workload is primarily driven by the number of unique compression tasks. Since each task corresponds to a unique block selection, the total number of these tasks is determined by the number of distinct client capacity tiers (i.e., unique $L_{i}$ values), not the total number of clients ( $N$ ). In practice, the number of capacity tiers is significantly smaller than $N$ . This finite and manageable set of configurations makes our server-side caching highly effective, preventing a computational bottleneck.

To illustrate this concretely, let's adopt the reviewer's scenario of a 40-layer LLM in our 40-client experiment. We can assume these clients fall into three heterogeneous capacity tiers:

10 clients with low capacity (training 12 layers, $L_{i}=12$ ).
20 clients with medium capacity (training 20 layers, $L_{i}=20$ ).
10 clients with high capacity (training 28 layers, $L_{i}=28$ ).

For this system, the maximum number of unique compressed models the server would ever need to prepare and cache is the sum of possibilities for each tier:

For $L_{i}=12$ : $40−12+1=29$ unique configurations.
For $L_{i}=20$ : $40−20+1=21$ unique configurations.
For $L_{i}=28$ : $40−28+1=13$ unique configurations.

The total cacheable space is merely $29+21+13=63$ distinct models. In any given round, the 40 participating clients will sample from this space, requesting at most 40 unique configurations. For example, a client assigned layers 11-30 would receive a model with layers {1-10} and {31-40} compressed—a single, cacheable task for the server.

Regarding the reviewer's final question, the block configurations assigned to clients are indeed dynamic. As described in Section 3.2, layer importance scores are recalculated each round based on the updated global model, leading to new sampling probabilities. While an individual client's assigned block changes, it is always sampled from the same fixed and limited set of $K−L_{i}+1$ possibilities. This allows the server to effectively cache configurations across rounds. Coupled with the incremental "warm-start" distillation mentioned in our rebuttal, the cost of serving a cached configuration, or even computing it for the first time, remains minimal.

In summary, we hope this analysis demonstrates that through its structured consecutive-block selection mechanism and effective caching, the server-side overhead in LEFF is fundamentally contained and designed to scale efficiently, addressing the important concerns raised by the reviewer. We are grateful for the thorough feedback.

2025-08-05

I thank the authors for the detailed response. My questions are answered and additional experimental results provided with more extensive analysis based on other reviewers' feedback are promising. I raise my recommendation.

审稿意见

评分: 3置信度: 42025-06-29

This work introduces a layer-wise fine-tuning strategy to mitigate the shortcomings of PEFT methods relative to full-parameter tuning in FL contexts with non-IID data, and enables clients to dynamically adjust their local training workload according to their available resources. The research problem is meaningful. However, this manuscript needs improvement in both the discussion of existing studies and the experimental setup.

优缺点分析

Strengths

The research problem is meaningful for FL communities.
This work proposes the theoretical analysis on convergence.

Weaknesses

Figure 1 provides limited information. Given the space constraints of a top-tier conference like NeurIPS, it is expected that all figures present substantial and meaningful content.
The current comparison methods are outdated. It is advisable to incorporate recent studies published within the last two years as comparative baselines, such as [1], [2], and [3].
This study lacks a sufficient review of the existing literature. For example, since PEFT has limitations under non-IID settings compared to full-parameter tuning, it would be natural to consider prior attempts at applying full-parameter tuning in the context of FL, such as [3], [4] and [5]. Currently, the paper lacks discussion on this relevant line of work.
This manuscript lacks an explanation of why the proposed layer-wise tuning method can address the limitations of PEFT methods compared to full-parameter tuning under non-IID conditions. In fact, layer-wise tuning appears to be a specific form of PEFT, since it similarly fine-tunes a subset of model parameters.
The proposed approach relies on a proxy dataset, which is often difficult to obtain in real-world scenarios. The paper lacks an investigation into how the choice of proxy dataset affects the performance of the proposed method.
The experiments were conducted with DeBERTaV3 and GPT-2. The scales of these models are too small by current standards. It is recommended to experiment with more recent models that have several billion parameters.
It is suggested to provide convergence curves of the proposed methods together with the compared ones, to experimentally demonstrate the correctness of the theoretical results.

[1] FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences, ATC 2024.

[2] Federated fine-tuning of large language models under heterogeneous language tasks and client resources, NeurIPS 2024.

[3] Federated full-parameter tuning of billion-sized language models with communication cost under 18 kilobytes, ICML 2024.

[4] Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models, arXiv 2409.

[5] Save It All: Enabling Full Parameter Tuning for Federated Large Language Models via Cycle Block Gradient Descent, arXiv 2406.

问题

Please refer to Weaknesses.

局限性

Limitations provided in Conclusion. It is recommended to have a separated Section for these contents.

最终评判理由

The authors have addressed most of my concerns.

格式问题

N/A

作者回复

2025-07-30

(For clarity in our response, we refer to the reviewer's points using the following convention: W.x (Weaknesses), Q.x (Questions), and L.x (Limitations).)

We sincerely thank the reviewer for their thorough and constructive feedback. Your suggestions have been instrumental in helping us strengthen our paper's evaluation, clarify its core contributions, and improve its structure. We have grouped our responses to address the main themes of your review.

(W.2, W.3, W.6) On SOTA Models, Baselines, and Literature Review

Your feedback regarding the need for more recent baselines, SOTA models, and a broader literature review (W.2, W.3, W.6) is well-taken. We agree that this is essential for contextualizing our work.

To address this directly, we conducted new experiments on Llama-3.1-8B with the MMLU benchmark, including FLoRA (NeurIPS 2024) [1] and FlexLoRA (NeurIPS 2024) [2] as representative SOTA baselines. The results are below:

Method	MMLU (5-shot Avg.)	Peak GPU Memory (Client)
FlexLoRA (NeurIPS 2024)	55.7	46.868 GB
FLoRA (NeurIPS 2024)	55.2	46.868 GB
LEFF (Ours)	57.5	29.881 GB

The results show LEFF outperforms these recent NeurIPS 2024 baselines by 1.8-2.3 points while reducing client peak memory by over 36%. This demonstrates that LEFF's principles are not only scalable but highly effective for modern, billion-parameter models.

Furthermore, we have studied the suggested papers on federated full-parameter tuning ([3], [4], [5]) and will add a detailed discussion to our Related Work section. We will clarify LEFF's novelty by contrasting its approach:

Unlike ZOO-based methods [3] that approximate gradients, LEFF utilizes standard first-order optimization (backpropagation). This approach is significantly more computationally efficient on the client side, avoiding the high costs of multiple forward passes per step and typically leading to faster convergence.
In contrast to update-compression methods [4] which compute a full gradient update before compressing it, LEFF restricts the update scope a priori to a subset of layers. This design directly reduces the client's peak memory and computational load during training by confining backpropagation, rather than only saving communication bandwidth post-computation.
While cyclical methods [5] update model blocks in a sequential turn-based manner, LEFF employs a standard synchronous and parallel training protocol common in FL. Moreover, our dynamic, importance-based layer sampling enables more adaptive training by focusing resources on the most impactful parameters each round, instead of following a static, pre-determined cycle.

[1] FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations, NeurIPS 2024.

[2] Federated fine-tuning of large language models under heterogeneous language tasks and client resources, NeurIPS 2024.

[3] Federated full-parameter tuning of billion-sized language models with communication cost under 18 kilobytes, ICML 2024.

[4] Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models, arXiv 2409.

[5] Save It All: Enabling Full Parameter Tuning for Federated Large Language Models via Cycle Block Gradient Descent, arXiv 2406.

(W.4) Clarifying LEFF's Advantage Over Conventional PEFT

We appreciate your insightful question (W.4) on why LEFF, as a form of PEFT, overcomes the limitations of other PEFT methods under non-IID conditions. The key distinction lies in the nature and capacity of the parameters being tuned.

Conventional PEFT (e.g., LoRA) freezes the original model and injects a small number of new, low-rank auxiliary parameters. This fundamentally constrains model adaptation to a low-dimensional space, which is insufficient to capture the diverse data distributions across clients in non-IID settings [1].
LEFF, in contrast, tunes the original, full-rank parameters of entire layers. While each client trains a subset of layers, these updates have the full expressive power of full-parameter tuning for those layers. This high-capacity adaptation is crucial for fitting the complex, client-specific data. From the global model's perspective, all parameters are gradually fine-tuned over the course of training. This principle is supported by recent findings in centralized tuning (e.g., LISA [2]), where layer-wise methods have also demonstrated superior performance over LoRA-style tuning.

The following table for DeBERTaV3-Large highlights this difference. LEFF tunes significantly more parameters than other PEFT methods, granting it greater expressive power, yet it achieves the lowest peak memory by only loading the necessary layers into GPU memory.

	FedAvg	FedLoRA	FedBitFit	LEFF (Ours)	SLoRA
Client Trainable Params	303M	3.5M	272K	13.6M	3.5M
Client Peak GPU Mem (GB)	9.361	6.660	5.488	3.005	6.660

This design explains why LEFF's performance under high data heterogeneity is substantially better than conventional PEFT methods, as shown empirically in our Table 1 and Figure 5. We will add this clarification to the introduction.

[1] SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models, NeurIPS 2023.

[2] LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning, NeurIPS 2024.

(W.5, W.7) Empirical Validation: Proxy Dataset and Convergence

We have conducted new analyses to address your concerns about the proxy dataset (W.5) and convergence curves (W.7).

Proxy Dataset Robustness: To empirically validate our method's robustness, we conducted a new ablation study on the E2E NLG task by varying the proxy dataset.

Robustness to Data Distribution: Our method is robust to the choice of proxy data because we distill functional representations, not task-specific knowledge. As described in Section 3.3, our objective is to match the intermediate hidden states and attention matrices. This process preserves general representational capabilities and is far less sensitive to the specific domain of the input data.
Practical Availability: Since our method only requires unlabeled text, readily available public corpora (e.g., C4, Wikipedia) are sufficient. In scenarios where external data is prohibited, the server can use the global model to generate a synthetic proxy dataset, making the framework entirely self-contained — a promising direction we are exploring.
New Ablation Study: We empirically demonstrate LEFF’s robustness to the proxy dataset via a new ablation study on the E2E NLG task. As shown in the table, performance remains remarkably stable across diverse corpora, from the in-domain WebNLG to general-purpose WikiText-103 and OpenWebText. While the in-domain data yields a slight, expected performance gain, the minimal fluctuation across all metrics confirms that LEFF is not reliant on a perfectly-matched proxy corpus. This high degree of robustness validates its practical applicability. We will include this study in the appendix.

Proxy Dataset ( $\mathcal{D}_{\text{proxy}}$ )	BLEU	NIST	METEOR	ROUGE	CIDEr
WikiText-103	0.5765	8.0012	0.4041	0.6310	1.7450
WebNLG	0.5799	8.0296	0.4064	0.6346	1.7521
OpenWebText	0.5712	7.9688	0.4015	0.6259	1.7345

Convergence Analysis: We agree that showing convergence is vital. Due to format restrictions, we present the validation loss data below (under high heterogeneity, $\alpha=0.05$ ) and will include the full plot in the final paper.

Round	FedAvg	FedBitFit	FedLoRA	SLoRA	LEFF (Ours)
1	2.3732	2.2598	2.6286	2.5209	2.4223
5	1.1091	1.1118	1.6372	1.3789	1.2227
10	0.6797	1.0854	1.1147	0.9168	0.7670
15	0.6019	1.1189	0.9427	0.8596	0.6416
20	0.5962	1.0840	0.9576	0.8317	0.6515

This data empirically validates our theory (Thm. 1), showing that LEFF converges stably and efficiently, achieving a final loss much closer to full fine-tuning than other PEFT methods.

(W.1, L.1) Paper Structure and Presentation

Finally, we appreciate your suggestions for improving the paper's presentation (W.1, L.1).

We agree with your assessment of Figure 1. We will remove it and reallocate the space to our new experimental results, which provide stronger empirical evidence.
Following your advice, we will create a dedicated "Section 6: Limitations and Future Work" before the conclusion to improve clarity and transparency.

We are grateful for your thorough review. We believe these extensive new experiments, clarifications, and structural changes directly address all your concerns and significantly strengthen the paper.

2025-08-06

Thanks for the detailed response. After reading through these contents, I think most of my concerns are addressed. I will raise my score accordingly.

审稿意见

评分: 4置信度: 52025-07-03

This paper introduces LEFF, a novel framework designed to efficiently fine-tune LLMs in FL settings, especially for resource-constrained edge devices. LEFF enables clients to fine-tune only selected layers of a model based on their computational capacity and leverages importance-based sampling to prioritize impactful layers. To reduce overhead, the server compresses unselected layers via knowledge distillation before sending a custom model to each client. The method incorporates a layer-wise aggregation strategy and is backed by theoretical convergence guarantees. Experimental results show that LEFF outperforms other FL fine-tuning approaches under heterogeneous data and system conditions while preserving performance close to full fine-tuning.

优缺点分析

Strengths:

LEFF significantly reduces client-side overhead by allowing selective layer-wise training, making it suitable for edge devices with limited resources.
The method includes formal convergence guarantees, reinforcing its reliability and scalability.
The approach effectively handles data and system heterogeneity using importance-based sampling and dynamic model customization.

Weaknesses:

Layer compression and partial updates can introduce approximation errors that cap model accuracy, especially in high-compression scenarios.
The compression process relies on publicly available proxy datasets, which may not always be representative or available.
The experiments do not include evaluations on more recent LLMs such as LLaMA3, nor do they incorporate benchmarks like MMLU or MT-Bench, which limits the assessment of LEFF's effectiveness on state-of-the-art models and tasks.

问题

What are the computational costs of calculating layer importance at each round, and is this feasible at scale?
Have you considered applying LEFF to newer LLMs such as LLaMA-3 or Mistral? Would LEFF also works for them?
Why were benchmarks like MMLU or MT-Bench not included, and how might LEFF perform on those tasks?

局限性

Compared to more recent works like (FLoRA)[https://neurips.cc/virtual/2024/poster/95025] (NeurIPS 2024), which evaluate on cutting-edge models and benchmarks, this paper primarily focuses on older architectures (e.g., GPT-2, DeBERTaV3) and traditional tasks (e.g., GLUE). This limits the ability to assess LEFF's applicability and competitiveness on SOTA models and evaluation standards such as MMLU or MT-Bench.

最终评判理由

I would give this paper a borderline accept because it is a complete work with sound theory, sufficient experiments, and well-written manuscripts.

格式问题

No.

作者回复

2025-07-30

(For clarity in our response, we refer to the reviewer's points using the following convention: W.x (Weaknesses), Q.x (Questions), and L.x (Limitations).)

We sincerely thank the reviewer for their insightful and constructive feedback, which has helped us to significantly strengthen our paper. We have grouped our responses thematically to address the core concerns raised about SOTA model applicability, approximation error, proxy data dependency, and computational cost.

(W.3, Q.2, Q.3, L.1) Performance on SOTA Models (Llama-3.1-8B) & Benchmarks (MMLU)

A central theme of your feedback (concerns W.3, Q.2, Q.3, L.1) was the need to evaluate LEFF on more recent models and challenging benchmarks to assess its competitiveness. We agree this is crucial and have conducted new experiments during the rebuttal period to address this directly.

Our initial experiments on GPT-2/DeBERTaV3 were designed to rigorously validate LEFF’s core contributions to efficiency and robustness in controlled, heterogeneous FL settings, where tasks like GLUE allow for clear analysis of factors like label skew.

To prove LEFF's effectiveness on modern architectures, we have now evaluated it on Llama-3.1-8B using the MMLU benchmark. We included FLoRA (NeurIPS 2024) [1] and FlexLoRA (NeurIPS 2024) [2] as highly relevant SOTA baselines. The results are summarized below:

Method	MMLU (5-shot Avg.)	Peak GPU Memory (Client)
FlexLoRA (NeurIPS 2024)	55.7	46.868 GB
FLoRA (NeurIPS 2024)	55.2	46.868 GB
LEFF (Ours)	57.5	29.881 GB

As shown, LEFF not only applies to SOTA models but excels, outperforming recent NeurIPS 2024 methods FLoRA and FlexLoRA by 1.8-2.3 points on MMLU while simultaneously reducing client-side peak memory by over 36%. These new results confirm that LEFF is highly competitive and its benefits are even more pronounced on larger models where client-side resources are the primary bottleneck. We will add these results to the appendix.

[1] FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations, NeurIPS 2024.

[2] Federated fine-tuning of large language models under heterogeneous language tasks and client resources, NeurIPS 2024.

(W.1) On Approximation Error and Model Accuracy

Regarding your concern (W.1) that layer compression introduces approximation errors that cap accuracy, we agree that managing this error is critical. LEFF is designed from the ground up to actively model and minimize this trade-off.

System-Wide Robustness: The "high-compression scenario" only applies to the most resource-constrained clients. As detailed in Sec. 3.1, LEFF is heterogeneous-aware. Higher-capacity clients perform updates with less (or no) compression, submitting high-fidelity gradients. During server-side aggregation, these high-quality updates compensate for the higher-error gradients from weaker clients, preventing the global model's accuracy from being capped by the worst-case devices. Our strong results in Table 2, where LEFF nearly matches full fine-tuning (FedAvg), empirically validate that this error is well-controlled.
Principled Error Minimization: LEFF is not a random compression scheme. The server tailors a model for each client by using importance-based sampling (Sec. 3.2) to select critical layers for full-parameter updates, while using knowledge distillation (Sec. 3.3) to create compact, function-preserving representations of the rest. This principled approach ensures that clients perform local training with high-fidelity gradients, directly minimizing the approximation error $\bar{\Delta}^2$ from our convergence analysis (Thm. 1).

(W.2) On the Role and Robustness of the Proxy Dataset

We thank the reviewer for raising the practical point (W.2) about the dependency on a proxy dataset. We have designed LEFF to be robust to this.

Robustness to Data Distribution: Our method is robust to the choice of proxy data because we distill functional representations, not task-specific knowledge. As described in Section 3.3, our objective is to match the intermediate hidden states and attention matrices. This process preserves general representational capabilities and is far less sensitive to the specific domain of the input data.
Practical Availability: Since our method only requires unlabeled text, readily available public corpora (e.g., C4, Wikipedia) are sufficient. In scenarios where external data is prohibited, the server can use the global model to generate a synthetic proxy dataset, making the framework entirely self-contained — a promising direction we are exploring.
New Ablation Study: We empirically demonstrate LEFF’s robustness to the proxy dataset via a new ablation study on the E2E NLG task. As shown in the table, performance remains remarkably stable across diverse corpora, from the in-domain WebNLG to general-purpose WikiText-103 and OpenWebText. While the in-domain data yields a slight, expected performance gain, the minimal fluctuation across all metrics confirms that LEFF is not reliant on a perfectly-matched proxy corpus. This high degree of robustness validates its practical applicability. We will include this study in the appendix.

Proxy Dataset ( $\mathcal{D}_{\text{proxy}}$ )	BLEU	NIST	METEOR	ROUGE	CIDEr
WikiText-103	0.5765	8.0012	0.4041	0.6310	1.7450
WebNLG	0.5799	8.0296	0.4064	0.6346	1.7521
OpenWebText	0.5712	7.9688	0.4015	0.6259	1.7345

(Q.1) Computational Cost of Layer Importance

In response to your question (Q.1), the computational cost of calculating layer importance is minimal and highly scalable.

The calculation uses a first-order Taylor approximation, requiring only a single forward and backward pass on the global model. This operation is performed once per round on the central server. Crucially, its cost is independent of the number of clients in the federated network. Therefore, the overhead remains constant and negligible as the system scales, posing no practical bottleneck.

Once again, we thank the reviewer for their valuable and detailed feedback. We believe these clarifications and new experiments, which will be integrated into the final version, substantially strengthen our paper's contribution and address all raised concerns.

评论- Rebuttal reviewer response

2025-08-05

The authors have provided satisfactory clarifications and demonstrations, and the research manuscript presents significant technical merit. I recommend its acceptance.

最终决定Accept (poster)

2025-09-17

This paper introduces a framework named LEFF for efficiently fine-tuning LLMs in FL settings, especially on resource-constrained edge devices. The proposed framework allows each client to fine-tune only selected layers based on its computational capacity, and employs a layer-wise aggregation strategy to ensure effective model updates. Experimental results demonstrate that LEFF achieves performance comparable to full fine-tuning and outperforms existing methods under heterogeneous conditions.

Reviewers express appreciation for the formal convergence guarantees and the extensive experiments in this paper. The author responses have addressed most of the major concerns, including those about the evaluation scope and the use of proxy datasets. Most reviewers give this paper positive ratings in light of its technical innovation, rigorous theory, and well-written manuscript. It would be beneficial to include the results and analysis provided during the author-reviewer discussion period in the revised paper.

Overall, the recommendation is to accept this paper.