/10

Poster4 位审稿人

最低2最高3标准差0.4

ICML 2025

NTK-DFL: Enhancing Decentralized Federated Learning in Heterogeneous Settings via Neural Tangent Kernel

Gabriel Thompson,Kai Yue,Chau-Wai Wong,Huaiyu Dai

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

摘要

关键词

Federated LearningDecentralized Federated LearningNeural Tangent Kernel

评审与讨论

审稿意见

评分: 22025-03-03

The paper proposes NTK-DFL, a decentralized federated learning method that uses the neural tangent kernel (NTK) to mitigate the effect of data heterogeneity in the clients. The authors prove the convergence guarantee of the proposed method and show that it outperforms existing methods such as DFedAvg in the numerical experiments.

update after rebuttal

As the authors claim, the proposed method might be effective in some settings, but I feel that such situations are rare in FL scenarios. Regarding novelty, I acknowledge that this paper is the first work to extend NTK-FL to decentralized scenarios, but the algorithm is very similar to the central one. I think this paper is borderline, and I have decided to maintain my score.

给作者的问题

What is the main difficulty in extending NTK-FL to the decentralized setting?
I would like to see a clear comparison of the communication and computational costs between the proposed method and the baselines.

论据与证据

Claims are supported by theoretical results or numerical experiments.

方法与评估标准

The proposed method incurs a high communication cost because clients send Jacobian matrices to their neighbors. This is undesirable since communication cost is often a bottleneck in decentralized federated learning.

理论论述

I did not rigorously check the correctness of the proofs.

实验设计与分析

The experimental setting is reasonable.

补充材料

There seems to be no supplementary material.

与现有文献的关系

The proposed method is a decentralized extension of NTK-FL proposed by Yue et al. While the extension appears straightforward, the theoretical analysis may be non-trivial.

遗漏的重要参考文献

Nothing in particular

其他优缺点

strength
1. The proposed method provably mitigates the effect of data heterogeneity in the clients.
2. The numerical experiments are detailed.
weakness
1. Communication cost is larger than existing methods since clients send Jacobian matrices to neighbors.
2. The computational cost at each client is also large.
3. Novelty is limited since the proposed method is a straightforward extension of NTK-FL to the decentralized setting.

其他意见或建议

$\bar{w}_i^{(k)}$ at stage 1 in Fig. 1 seems to be a typo of $w_i^{(k)}$ .

作者回复

2025-04-01

Responses to Reviewer Zq37

"The proposed method incurs a high communication cost because clients send Jacobian matrices to their neighbors. This is undesirable since communication cost is often a bottleneck in decentralized federated learning."

We acknowledge the reviewer’s concern regarding communication overhead. Indeed, NTK-DFL inherently transmits more data per communication round compared to more traditional approaches such as FedAvg, due to the sharing of Jacobian matrices rather than gradients or weights alone. However, we emphasize that the number of communication rounds remains an important practical metric, particularly in scenarios involving:

Large per-round communication latency, where fewer rounds can significantly reduce overall training time (e.g., compression or encryption of weights/gradients, heavy preprocessing of input data).
Limited device availability, in which fewer rounds allow more efficient training when devices are intermittently available.
High bandwidth applications, where ample network bandwidth (e.g., gigabit home internet) can accommodate a large data volume for each communication round, making the number of communication rounds the dominant factor in training efficiency.
Synchronization delays, where each round must wait for all devices to complete computation, with the slowest device bottlenecking progress, thus making the number of communication rounds an important factor.

We do not claim to be a silver bullet for federated learning. Instead, we provide an alternative method for when minimizing communication rounds is critical. NTK-DFL can also be deployed in conjunction with gradient-based approaches, reducing the number of communication rounds in the early stages of training.

In addition, for application scenarios where NTK-DFL is more appropriate while communication overhead is a major concern, we had also provided several mitigation tactics in the submitted manuscript to improve communication efficiency, including datapoint subsampling and compression techniques, e.g., sparsification and quantization. These tactics have shown substantial reductions in communication costs, allowing for trade-offs between expressiveness and overhead (see Appendix D.2). Also, please see Response 3 to Reviewer Qo5R for a description of how we decoupled communication and computation overhead from scaling with the number of neighbors.

"Novelty is limited since the proposed method is a straightforward extension of NTK-FL to the decentralized setting."

While our method builds on NTK-FL, we are the first to extend it to the decentralized federated learning (DFL) setting, where the lack of a central aggregator makes the adaptation nontrivial; i.e., not all FL approaches work in DFL. Surprisingly, we also found that combining NTK-based weight evolution with decentralized model averaging yields an exciting synergy: it promotes beneficial inter-client variance and markedly improves generalization in heterogeneous settings. Lastly, we propose strategies for overhead minimization such as Jacobian batching, data subsampling, and the use of a clustered topology that are not present in NTK-FL.

"What is the main difficulty in extending NTK-FL to the decentralized setting?"

Please see Response 4 to Reviewer Qo5R.

"I would like to see a clear comparison of the communication and computational costs between the proposed method and the baselines."

Please see Response 2 to Reviewer oV9C for a clear comparison of communication cost versus the test accuracy for NTK-DFL and DFedAvg, the best-performing baseline. Regarding computational cost, computing the Jacobian in the NTK-based approach “does not incur additional client computational overhead compared to [D]FedAvg, since calculating the Jacobian tensor enjoys the same computation efficiency as computing aggregated gradients” (Yue et al., 2022). While the decentralized setting requires each client to compute Jacobians for all neighbors, adopting a clustered topology (as described in Response 3 to Reviewer Qo5R) prevents both the computational and communication burden of the Jacobians from scaling with the number of neighbors. Lastly, analogous to local epochs in gradient-based approaches, our runtime for weight evolution scales as $O(t)$ , where $t$ represents the number of local evolution steps."

"at stage 1 in Fig. 1 seems to be a typo of $w_i^{(k)}$ ."

We thank the reviewer for their careful attention to detail. We have fixed the notation.

References

Kai Yue, Richeng Jin, Ryan Pilgrim, Chau-Wai Wong, Dror Baron, and Huaiyu Dai. Neural tangent kernel empowered federated learning. In Proceedings of the 39th International Conference on Machine Learning, Jul 2022.

审稿意见

评分: 32025-03-13

This paper studies decentralized federated learning and leverages the neural tangent kernel to improve performance and convergence under heterogeneous settings. The proposed method is evaluated on three public datasets and shows improved performance.

给作者的问题

What are the specific challenges when applying the NTK to the decentralized FL setting compared to FL? Please further elaborate.
Is this method applicable to different network topologies? In addition, it may not be true in the real-world to have topology changes every communication round, what would be the results with a fixed topology?

论据与证据

The claims are supported by method design and experimental validations.

方法与评估标准

The proposed method and evaluation make sense in general but lack some comparisons regarding the computational cost and different topologies.

理论论述

The theoretical claims and proofs seem to be correct.

实验设计与分析

The experimental design and analysis are sound in general.

补充材料

The supplementary material provides more details and looks good.

与现有文献的关系

This paper contributes to the general federated learning community.

遗漏的重要参考文献

其他优缺点

Strength

Decentralized federated learning with statistical heterogeneity is a challenging topic.
The theoretical analysis shows better convergence than DFedAvg.
The proposed method shows better performance than the compared methods.

Weakness

The literature discussed in this paper was mainly published around 2022; more recent literature should be included in the discussion.
The motivation for leveraging NTK in DFL needs to be justified.
The proposed method needs to calculate the Jacobians for each neighbor, which suffers a significant computational cost.
The network topologies need to be clearly specified, and different topologies (e.g., line, ring) should be explored in experiments.

其他意见或建议

作者回复

2025-04-01

Responses to Reviewer Qo5R

"More recent literature should be included in the discussion."

We thank the reviewer for this suggestion. In the originally submitted manuscript, we had included comparisons with DFedSAM (Shi et al., 2023), which is, to our knowledge, the most recent, relevant DFL baseline for our comparisons. Additionally, we plan to update our manuscript by referencing more recent DFL papers such as Yuan et al. (2024) and Guo et al. (2025) throughout the introduction and related work sections. We also plan to cite recent personalized FL works for context but do not include experimental comparisons due to differing problem settings.

"The motivation for leveraging NTK in DFL needs to be justified."

Our motivation for leveraging NTK in DFL arises directly from the challenge of statistical heterogeneity, which is even more pronounced in DFL than in centralized FL. The expressiveness of NTK-based approaches offers a natural solution to address this heterogeneity, making them particularly suitable for decentralized settings. This motivation was highlighted in the introduction of the original manuscript, where we discussed the previously demonstrated robustness of NTK-based methods to statistical heterogeneity in centralized settings (Yu et al., 2022; Yue et al., 2022).

"The proposed method needs to calculate the Jacobians for each neighbor, which suffers a significant computational cost."

We share the reviewer’s concern regarding the computational cost of calculating Jacobians for each neighbor. To mitigate this issue, we had outlined several tactics in Appendix D of the original manuscript to reduce overall computation, including data subsampling and Jacobian batching.

To further address this concern, we plan to offer another mitigation strategy. When clients are arranged in a clustered topology—where all clients in a cluster share a common aggregated weight—each client only needs to compute a single set of Jacobians per round. This significantly reduces per-client computational cost. After we ran additional experiments, we saw that NTK-DFL maintains its test accuracy under this topology, demonstrating that this optimization does not degrade performance. Additionally, the weight evolution step can be delegated to a single representative client within each cluster, further alleviating the computational burden on the remaining clients. Together, these strategies decouple computation and communication overhead from the number of neighbors. We plan to add a detailed description of this approach in Appendix D.2.

"What are the specific challenges when applying the NTK to the decentralized FL setting compared to FL?"

The unique challenges introduced by applying the NTK to decentralized FL are as follows. First, there is increased communication overhead, as clients communicate with their neighbors rather than a central server. Second, computational costs rise, since clients must compute Jacobians with respect to their own weights, as well as each of their neighbors’ weights. The original approach by Yue et al. (2022) uses the full batch of client data each round, which can induce memory constraints. To address these challenges, we had introduced in the submitted manuscript Jacobian batching and datapoint subsampling to reduce memory and communication overhead, finding that NTK-DFL is resilient to aggressive compression.

“The network topologies need to be clearly specified”

We had specified the specific baseline topology that we used for experiments in the original manuscript under Network Topologies:

“A sparse, time-variant $\kappa$ -regular graph with $\kappa=5$ was used as the standard topology for experimentation…”

"Is this method applicable to different network topologies? […] what would be the results with a fixed topology?"

In the original manuscript, we had studied the performance of NTK-DFL across the ring topology, d-regular topology, and Erdos-Renyi random topology in Figure 8 of Appendix B. We also studied the effect of various levels of topological sparsity in Figure 3. Lastly, we had performed an experiment on a fixed d-regular topology in Figure 11 from Appendix B and compared it with DFedAvg (the next best performing baseline). Both models suffered slight performance degradation, though NTK-DFL still outperforms DFedAvg. We point out that the time-varying topology has been used in recent DFL work, e.g., Shi et al. (2023).

References

Guo, P. et al. Chapter 13 - enhancing mri reconstruction with cross-silo federated learning. 2025.
Shi, Y. et al. Improving the model consistency of decentralized federated learning. ICML, 2023.
Kai Yue, et al. Neural tangent kernel empowered federated learning. ICML, 2022.
Yuan, L., et al. Decentralized federated learning: A survey and perspective. IEEE IOT Journal. 2024.
Yu, et al. Convexifying federated learning using bootstrapped neural tangent kernels. Neurips, 2022.

审稿人评论

2025-04-07

Thank you for the detailed clarifications and explanations. Most of my concerns have been addressed. I appreciate the efforts the authors have made to reduce the computational cost. However, I am still concerned that the cost remains significantly higher compared to the baseline methods. I have increased my rating score and encourage the authors to provide a clearer comparison of the computational cost with alternative approaches.

审稿意见

评分: 32025-03-14

The paper proposes to integrate NTK training in DFL. Numerical results show NTK-DFL achieves higher accuracy under the same level of communication round.

给作者的问题

None.

论据与证据

The claims are well supported.

方法与评估标准

The benchmark datasets and the algorithms are relevant for comparison.

理论论述

The theorem 4.5 looks correct.

实验设计与分析

Authors should compare communication overhead in bits (or MB) for all benchmark algorithms.

Let's consider three scenarios.

Scenario 1: DFL regime. Client $i$ only shares the $d$ -dimensional gradient of model parameters with the neighbors in each round. Communication complexity on each edge is $O(d)$ .
Scenario 2: NTK-DFL regime. Client $i$ shares the Jacobian $J_{i,j}^{(k)}$ with the neighbor $j$ in each round. Communication complexity from $i$ to $j$ is $O(d N_i d_2)$
Scenario 3: Data sharing regime. Client $i$ shares all its data with neighbor $j$ at the beginning of training. Communication complexity from $i$ to $j$ is $O(N_i d_2)$ .

In terms of communication complexity, 2 is worse than 3. In terms of convergence, 2 is worse than 1, according to Theorem 4.5. One might argue that 2 is better than 3 because Jacobian sharing might protect privacy after the Jacobian sparsification techniques discussed in supplementary material E.

Therefore, it is hard to imagine a scenario where 2 has any benefit in terms of communication overhead. To show 2 is better than 1 and 3 in terms of communication complexity, authors should show acc vs. communication bit curve instead of acc vs. communication round curve for all benchmark algorithms. The results are only seen in Figure 16 in the supplementary material, which does not show an advantage of 2 over 1.

Also, in the implementation, it seems that the authors use the eigen-decomposition of $H$ to evaluate the exponential map. This is very unconventional in deep learning, where people usually use gradient descent. A running time comparison would be needed to inform the practitioners.

补充材料

I have briefly read the supplementary material.

与现有文献的关系

NTK is an interesting theoretical topic in deep learning literature, and so is DFL.

遗漏的重要参考文献

None.

其他优缺点

None.

其他意见或建议

There are some typos in Algorithm 4: Weight Evolution.

作者回复

2025-04-01

Responses to Reviewer oV9C

"It is hard to imagine a scenario where [the NTK-DFL regime] has any benefit in terms of communication overhead."

Please see Response 1 to Reviewer Zq37 for a detailed discussion.

“Authors should show acc vs. communication bit curve instead of acc vs. communication round curve for all benchmark algorithms.”

We note that communication rounds are also an important metric in certain cases (e.g., encoding delays, high-bandwidth applications, etc.) as outlined in Response 1 to Reviewer Zq37. Taking into account the reviewer’s suggestion, we plan to add a figure in Appendix E comparing communication volume across baseline algorithms. We plan to update Figure 16 in Appendix E to include a comparison of communication volume with other baselines. We provide a clear comparison of communication costs between our method and the next best-performing baseline in the table below.

Comparison of NTK-DFL and DFedAvg convergence across communication volume thresholds.

Threshold Communication Volume (MB)	Test Accuracy NTK-DFL	Test Accuracy DFedAvg	Comm. Rounds NTK-DFL	Comm. Rounds DFedAvg
10	79.2	81.2	4	28
20	82.4	83.1	9	58
30	83.5	84.3	14	87
40	84.2	84.8	19	116
50	84.6	85.2	24	146

With the original compression methods from Appendix E and a clustered topology (see Response 3 to Reviewer Qo5R), NTK-DFL performs comparably to DFedAvg, as illustrated in the table above. More specifically, NTK-DFL suffers a small 0.5-1% test accuracy degradation in exchange for converging in 6x fewer communication rounds. DFedAvg is shown with a regular topology but performs similarly to a clustered one ( $\< 0.1$ % difference in test accuracy at 1000 comm. rounds). Only the best-performing baseline, DFedAvg, is shown; results for other baselines will be added in the updated manuscript.

"In the implementation, it seems that the authors use the eigen-decomposition of $H$ to evaluate the exponential map. This is very unconventional in deep learning, where people usually use gradient descent. A running time comparison would be needed to inform the practitioners."

No, eigendecomposition on the matrix $H$ is not required. Instead, the weights are evolved using a differential equation solver (Chen, 2018) according to the more general differential equation: $\frac{d}{dt} f\left(\mathbf{X}_i; \bar{\mathbf{w}}_j^{(k, t)}\right) = -\eta \mathbf{H}_j^{(k)} \nabla_f \mathcal{L}.$ To perform weight evolution, the client evolves the initial residual matrix over a series of timesteps specified by the user. For user-specified timesteps, the loss at that time is found using the evolved residual. Then, the best-performing weights are calculated and selected for the next communication round. Analogous to local epochs in gradient-based federated learning methods, the runtime scales as $O(t)$ , where $t$ is the number of local evolution steps in the linearized NTK region. This process had been described in depth in Appendix C of the original manuscript.

"There are some typos in Algorithm 4: Weight Evolution."

We thank the reviewer for noticing these typos. We will correct the notation and relevant equations.

References

Chen, R. (2018). Torchdiffeq: PyTorch Implementation of Differentiable ODE Solvers. https://github.com/rtqichen/torchdiffeq. Accessed Mar 26, 2025

审稿人评论

2025-04-05

Thank authors for the detailed response and the additional experiment results! The new table does not show the advantage of NTK-DFL over DFedAvg. However, they are still informative for practitioners. Thus, I increase the rating to acknowledge the effort. In the future, I feel matrix sketching might be useful to further reduce communication costs.

审稿意见

评分: 32025-03-18

The paper proposes NTK-DFL, a decentralized federated learning method that uses the Neural Tangent Kernel (NTK) for weight evolution, replacing SGD with Jacobian-based updates. It integrates per-round parameter averaging and final model averaging to address statistical heterogeneity. Experiments show the improved accuracy of the proposed method in heterogeneous settings (α = 0.1), and robustness across datasets (Fashion-MNIST, FEMNIST, MNIST) and topologies. A theoretical convergence bound is provided, showing dependence on local iterations (T) and spectral gap.

给作者的问题

Why do model variance and final test accuracy on Fashion-MNIST show positive correlations? can you provide more detailed discussion or analysis for that relation? how can we define moderate or severe dissimilarity between client weights?

论据与证据

The claims—faster convergence, heterogeneity resilience, and robustness—are supported by empirical results and theoretical bounds (Theorem 4.5). The evidence is clear, with detailed experiments across multiple settings. However, the proposed methods have several weaknesses. First, NTK-based approaches have explicit constraints in modeling neural networks (linear layers), which makes the proposed method hard to incorporate into modern architectures (CNN, Transformers). Second, sharing prior distribution (true labels for all local data samples) can incur potential privacy concerns. The claim that the proposed method reduces the number of communication rounds to reach a target accuracy could mislead readers into assuming it improves overall communication efficiency. In reality, it transmits significantly more data per round than naive FedAvg. For a fair comparison, the method should be evaluated based on the total communication overhead required to achieve the target accuracy, rather than just the number of rounds.

方法与评估标准

The NTK-DFL method is appropriate for DFL, targeting heterogeneity via NTK and averaging. Evaluation of standard datasets (Fashion-MNIST, FEMNIST, MNIST) with Dirichlet-based heterogeneity and global test accuracy as a metric aligns with FL research.

理论论述

I reviewed Theorem 4.5 and Corollary 4.6 proofs (Appendix F) without a fully rigorous check, but they appear mathematically correct under the stated assumptions (Lipschitz continuity, bounded variance, NTK error bound).

实验设计与分析

The design—comparing NTK-DFL to baselines across heterogeneity levels, topologies, and sparsity—is sound. However, not the accuracy over communication rounds, but over total communication overheads is more appropriate for the fair comparison.

补充材料

I reviewed Appendices A–F.

与现有文献的关系

First NTK-based DFL: Claims to be the first to apply NTK-based weight evolution in a decentralized setting, achieving 4.6x fewer communication rounds than baselines in heterogeneous settings.

遗漏的重要参考文献

I believe the key papers are discussed in the main papers.

其他优缺点

Technical novelty is limited because the NTK-based update rule has been proposed in previous works and its validity for mitigating data heterogeneity also has been shown.

其他意见或建议

As mentioned in previous sections, it would be better to compare the accuracy over the total communication overheads for the fair comparison. It would be good to include the results of the incorporation of several compression techniques (sparsification) with other baselines.

作者回复

2025-04-01

Responses to Reviewer cjuK

“NTK-based approaches have explicit constraints in modeling neural networks (linear layers), which makes the proposed method hard to incorporate into modern architectures (CNN, Transformers).”

We appreciate this important point. Indeed, our current fully-connected model was chosen specifically to study NTK-DFL’s effects clearly without additional complexity. However, several recent works have successfully extended NTK frameworks to CNNs, ResNets, and Transformers (Yang, 2019). Incorporating these approaches represents a promising avenue to generalize our method to modern neural architectures. We will clarify this future direction explicitly in the conclusion of the revised manuscript.

“... possibly making use of NTK methods suited for modern architectures (Arora et al., 2019; Yang, 2019).”

“Sharing prior distribution (true labels for all local data samples) can incur potential privacy concerns.”

We agree with the reviewer that sharing true labels across clients could introduce privacy risks, however, the introduced risks are comparable to those introduced by gradient updates in FedAvg-style methods. We had actually tested privacy leakage due to sharing true labels in our submitted manuscript with a reconstruction-attack analysis under different mitigation strategies (Fig. 17, Appendix E). We found that under compression alone, reconstructions are significantly noisy, and when combined with random Gaussian projection, reconstruction becomes effectively impossible.

“The claim that the proposed method reduces the number of communication rounds to reach a target accuracy could mislead readers into assuming it improves overall communication efficiency… For a fair comparison, the method should be evaluated based on the total communication overhead required to achieve the target accuracy, rather than just the number of rounds.”

We will update the manuscript to explicitly highlight communication trade-offs in NTK-DFL. Specifically, while our method significantly reduces the number of rounds needed to reach target accuracy, each round involves higher communication overhead due to Jacobian exchange. Please see Response 2 to Reviewer oV9C for a clear comparison of communication cost versus the test accuracy for NTK-DFL and DFedAvg, the best-performing baseline. For a detailed description of settings where the communication rounds are an important metric, see Response 1 to Reviewer Zq37.

“Technical novelty is limited because the NTK-based update rule has been proposed in previous works and its validity for mitigating data heterogeneity also has been shown.”

We will again kindly direct the reviewer to our Response 2 to Reviewer Zq37.

“It would be good to include the results of the incorporation of several compression techniques (sparsification) with other baselines.”

We plan to update Figure 15 to include a comparison of compression techniques with other baselines. Notably, gradient-based DFL methods perform significantly worse under the aggressive compression used in NTK-DFL. For example, DFedAvg reaches only ~55% test accuracy when both sparsification and quantization are applied, although it maintains better performance when subjected to quantization alone. Conversely, NTK-DFL is able to withstand severe quantization and compression while only sacrificing 1-2% performance degradation over the same number of communication rounds.

“Why do model variance and final test accuracy on Fashion-MNIST show positive correlations? Can you provide more detailed discussion or analysis for that relation? How can we define moderate or severe dissimilarity between client weights?”

We thank the reviewer for raising this insightful question about the relationship between model variance and test accuracy, as well as the criteria for defining weight dissimilarity. Model averaging benefits from a diverse set of client solutions that generalize better when averaged together than any single model. Furthermore, weight dissimilarity can be quantified in many ways. For example, in Sahu et al. (2018), the norm of the weight difference is used as a metric to tune the FL updates. A mathematical formulation of “moderate” vs. “severe” dissimilarity between client weights may be grounds for a paper itself, and therefore we leave it to future work. We plan to update the end of the section “Gains Due to Final Model Aggregation” to include the discussion of this phenomenon based on our response above.

References

Anit Kumar Sahu et al. (2018). Federated Optimization in Heterogeneous Networks.
Greg Yang. (2019). Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

最终决定Accept (poster)

2025-05-01

The paper applies the NTK in FL, to guide the client models via a synergy of NTK evolution and FL aggregation. It is shown that the proposed method can scale well through different datasets, topologies and heterogeneities.