/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

Distributed Event-Based Learning via ADMM

Guner Dilsad ER,Sebastian Trimpe,Michael Muehlebach

提交: 2025-01-15更新: 2025-07-24

TL;DR

Our distributed learning method reduces communication cost, handles non-i.i.d. data distributions, ensures fast convergence, and is robust to failures, outperforming standard baselines.

摘要

关键词

distributed learningevent-based optimizationdynamical systemscommunication efficiencyheterogeneous data-distribution

评审与讨论

审稿意见

评分: 42025-03-10

This paper introduces an event-triggered distributed learning method using ADMM to reduce communication in federated learning (FL) while handling non-i.i.d. data distributions.

The key contributions claimed include:

A communication-efficient approach that reduces the number of message exchanges using an event-based trigger.
A convergence analysis for convex and nonconvex settings, with accelerated rates in the convex case.
Robustness to communication failures, both theoretically and experimentally.
Empirical validation using MNIST and CIFAR-10, showing up to 35% communication savings over baselines like FedAvg, FedProx, SCAFFOLD, and FedADMM.

The approach is interesting and contributes to reducing communication in distributed learning, but the empirical evaluation is limited to small-scale image classification datasets.

===== After rebuttal. =======

Thanks for the authors' responses. My questions and concerns have been addressed. I tend to accept this submission. So I increase my point.

给作者的问题

See the other comments.

论据与证据

Claim: "Our method reduces communication while remaining agnostic to data distribution."
- Partially valid: These are general advantages of FL methods, and it's unclear whether the event-triggered mechanism offers a significant advantage beyond existing FL techniques.
Claim: "We ensure convergence by setting a time-varying $\Delta_k \to 0$ ."
- Unclear in practice: The authors do not provide a practical tuning strategy for $\Delta_k$ , which is critical for real-world implementations.
Claim: "Our work is the first to analyze communication failures in event-triggered ADMM."
- Partially valid: While the paper contributes to understanding failures in event-based ADMM, prior work in randomized communication reduction and partial client participation should be acknowledged.

方法与评估标准

The event-triggered ADMM approach is reasonable, but the paper misses discussions on prior ADMM-based FL methods and SDE-based convergence analyses.
Missing Related Work:
- FL methods using SDE/OPE tools:
  - Liang, J., Han, Y., Li, X., Liu, Y., Zhang, Y., & Lin, Z. (2022). Asymptotic behaviors of projected stochastic approximation: A jump diffusion perspective. NeurIPS 2022. (Jump diffusion perspective)
  - Deng, W., Zhang, Q., Ma, Y. A., Song, Z., & Lin, G. (2024). On Convergence of Federated Averaging Langevin Dynamics. UAI 2024. (Federated Averaging Langevin Dynamics)
  - Glasgow, M. R., Yuan, H., & Ma, T. (2022). Sharp bounds for federated averaging (local SGD) and continuous perspective. AISTATS 2022. (Sharp bounds for FedAvg)
- Federated ADMM studies:
  - Swaroop, S., Khan, M. E., & Doshi-Velez, F. (2025). Connecting Federated ADMM to Bayes. ICLR 2025 (to appear).
  - Chen, Y., Blum, R. S., & Sadler, B. M. (2022). Communication efficient federated learning via ordered ADMM in a fully decentralized setting. CISS 2022.
Experimental limitations:
- Only small-scale datasets (MNIST, CIFAR-10) are tested.
- The paper ignores asynchrony and stragglers, which are critical in real-world FL.

理论论述

Theoretical results rely on strong convexity:
- The claimed accelerated rates are only valid under strong convexity, limiting applicability to deep learning models.
- The nonconvex case analysis (Theorem 2.3) provides only a slow $O(1/k)$ rate, which does not demonstrate a significant advantage over existing FL methods.
- The analysis is for deterministic optimization while practices favor stochastic optimization.
Interpretability issue in Algorithm 1:
- The term $d_{k+1}^i$ is not well explained.
- Question: Could the authors clarify how this term fits into the computation and why it is needed?

实验设计与分析

Communication savings are demonstrated, but scalability is unclear:
- The method is tested on small networks. What happens when scaling to hundreds or thousands of clients?
- There is no study on network delays or heterogeneous computing capabilities.
No clear study on the benefits of event-triggered communication:
- The results show communication savings, but is there a trade-off in terms of final model performance?
- Question: Could the authors summarize the practical benefits of event-triggered communication?

补充材料

The appendix contains proof details and additional experiment setup and results. I didn't check very carefully, though the proof seems to be correct.

与现有文献的关系

The paper didn't discuss some prior work in ADMM variants for FL, particularly:
- ICLR 2025 (Swaroop et al.): Connections between Federated ADMM and Bayesian inference.
- CISS 2022 (Chen et al.): Ordered ADMM for fully decentralized FL.
Misses FL literature using SDE-based analysis:
- UAI 2024 (Deng et al.): Federated Averaging Langevin Dynamics.
- NeurIPS 2022 (Liang et al.): OPE/SDE tools in FL.
The novelty claim on communication failures seems to be overstated since prior work already discusses robust FL under partial participation. The author should highlight the difference between them.

遗漏的重要参考文献

See previous report for missing references.

其他优缺点

Strengths:

The idea of event-triggered ADMM is interesting and could be useful for communication-limited FL settings.
The paper provides some convergence guarantees, though they rely on strong convexity assumptions.
The communication savings are demonstrated empirically.

Weaknesses:

Experiments are too simple (MNIST, CIFAR-10) and may not generalize.
Some references are missing (SDE, ADMM in FL).
Unclear benefits of event-triggered communication beyond saving messages.

其他意见或建议

Clarify the interpretation of $d_{k+1}^I$ in Algorithm 1.
Provide a practical tuning strategy for $\Delta_k \to 0$ .
Discuss how this approach generalizes to asynchronous/decentralized FL settings.

作者回复

2025-03-28

We appreciate the reviewer for their thorough review and constructive feedback. Below, we address the specific points raised.

Justification of Experiment Design and Generalization:

The point of our manuscript is to present a distributed optimization/learning algorithm that is both communication-efficient and robust, even under heterogeneous data distributions (e.g., each agent having one digit in MNIST, shown in Fig. 8). The experiments were designed to demonstrate the feasibility of our method in these scenarios. Additionally, we show that our approach scales to larger networks, with examples using 100 clients to train a CNN on CIFAR-10 (Fig. 3) and 50 agents that communicate over a graph (Fig. 12). Our theoretical result (Corollary 2.2) guarantees that the method can indeed scale to large networks. Numerical experiments align with these theoretical predictions, highlighting that our approach is effective even in the presence of 100 agents or more.

Missing References:

Thank you for bringing these papers to our attention. We will incorporate them into the related work section of the revised manuscript.

The paper by Swaroop et al. (2025) is very recent, so it was not included in our initial submission. Regarding Chen et al. (2022), which discusses the ordered ADMM variant, we cite other relevant works related to our ADMM variant. We will include Chen et al. (2022) in the revised manuscript.

We will also add references to alternative convergence techniques, such as Liang et al., 2022 and Deng et al., 2024. Additionally, (Glasgow, 2022) proves FedAvg is suboptimal under heterogeneity, which aligns with the findings of (Li, 2020c) that we already cite, but we will include this reference as well.

Communication vs Performance Trade-off:

Our framework allows explicit trade-offs between communication load and solution accuracy, which we demonstrate through experiments in Fig. 8, 9, 11, 12 (App. G). These curves clearly show the relationship between the communication threshold and the resulting model accuracy.

Benefits of Event-triggered Communication:

Our framework, based on ADMM with event-triggered communication, not only saves communication but also ensures convergence under heterogeneous data distributions. This is not guaranteed by many other federated learning methods. We also provide a bounded error guarantee, which can be controlled by a single threshold value ( $\Delta$ ). Our numerical examples (see, e.g., Fig. 3, as well as 8, 9, 11, 12) highlight that the event-triggered communication strategy requires much less communication for a given target accuracy than a vanilla communication strategy that randomly communicates among neighbors (in the context of our ADMM-based approach).

Clarifications

Interpretation of $d_{k+1}^i$ :

The term $d_{k+1}^i$ , defined in line 189, encapsulates the combined effect of the local primal and dual updates. Its change essentially quantifies the deviation of the local state from the last communicated state and serves as the key signal for triggering communication (see (2)).

Practical Tuning Strategy for $\Delta_k$ :

Thank you for raising this point. We discuss a tuning strategy in line 347, where we suggest a time-varying schedule, such as $\Delta_k = \Delta_0 / (k+1)^t$ with $t > 0$ . Convergence is further analyzed in App.F. Alternatively, the communication threshold can be actively tuned based on feedback from the system, as in [1].

[1] Cummins, M., Er, G. D., Muehlebach, M. “Controlling Participation in Federated Learning with Feedback” (arXiv:2411.19242).

Generalization to Decentralized Settings:

We demonstrate in App. G (see Fig. 11 and Fig. 12) that our algorithm operates over a network of agents, supporting decentralized FL systems.

Covering Asynchrony, Network Delays and Heterogeneous Computing Capabilities:

We already discuss the adaptability of our approach to asynchronous systems at the end of the manuscript (line 412). While initially framed as a synchronous method, event-based communication naturally extends to asynchronous settings, enabling robust operation in real-world scenarios with varying network reliability and synchronization.

We also propose to model the effect of stragglers, network delays and heterogeneous computing capabilities as communication drops, which are effectively managed via our reset strategy. Proposition 2.1, Corollary 2.2 and Theorem 4.1 provide theoretical guarantees for convergence under these conditions.

We believe we have addressed all the questions and provided the necessary clarifications. We kindly request raising our paper’s score based on the responses provided. If there are further questions or additional clarifications needed, please let us know.

审稿意见

评分: 32025-03-14

The authors present an event-based distributed agent framework that leverages a relaxation of Alternating Direction Methods of Multipliers (ADMM) to provide a framework with a reduced communication cost.

给作者的问题

Most of my theoretical concerns were answered in the previous segments. However, I could not find the code to replicate the experiments in the supplied submission. Is there any reason why the authors did not include the associated artefacts?

论据与证据

The authors make sufficient claims to support their idea and also list its limitations which is notable. Overall, I feel that the evidence provided is enough to verify the author's claims.

方法与评估标准

The method is compared against a suite of recent methods and seem to be sufficient given the application at hand.

理论论述

The theoretical claims are sound but in order to be followed fully one has to delve into the appendix. I feel that critical claims and ideas should be present in the main text.

实验设计与分析

Overall the experimental design is acceptable however I would appreciate to have an ablation study as well as use data drawn from biased/shifting distributions to see how robust the framework is in such conditions.

补充材料

Very little, it is too long.

与现有文献的关系

The paper reduces the communication overhead in a distributed system when local models undergo substantial transformation. It is heavity related to convex safe zones [1] and it would be great for the authors to compare against similar literature which is missing in the text.

[1] Distributed query monitoring through convex analysis: Towards composable safe zones, Garofalakis et al.

遗漏的重要参考文献

I feel that the paper lacks discussion on the broader topic mentioned above, but other than that the authors have included sufficient ammount of references.

其他优缺点

While the paper proposes a significant reduction in the communication cost, it is still missing imporant guarantees required in such a system. The authors also listed most of them but I feel not catering for bad actors and to gradient poisoning attacks limits the system applicability.

It would be great if the authors discussed potential methods to mitigate these concerns.

其他意见或建议

Nothing of note.

作者回复

2025-03-28

We appreciate the time and effort invested in evaluating our work. Below, we address each of the points raised.

Robustness to Biased/Shifting Distributions:

Our experiments already use heterogeneous datasets with inherent bias and distribution shifts among agents (e.g., each agent having one digit in MNIST, shown in Fig. 8 and other cases in Fig. 3,9,11 and 12).

Relevance to Safe Zone Design:

We acknowledge the relevance of reducing communication overhead through local conditions in distributed systems. We will incorporate the corresponding reference and the following discussion in the revised version of our manuscript:

Our approach similarly monitors local models through local constraints (our communication event definition, see (2) in the paper), which collectively guarantee the global condition of overall error remaining bounded. While thresholding can be seen as a subset of such local constraints, our methodology is directly inspired by the send-on-delta (SoD) concept (Miskowicz, 2006) and event-based control literature, where communication is triggered by significant state changes.

Bad Actors and Gradient Poisoning Attacks:

Our method is compatible with existing techniques that mitigate issues related to bad actors and gradient poisoning. Addressing all possible challenges in federated learning is beyond the scope of a single approach. However, our event-based methodology can be integrated with robust aggregation or anomaly detection methods [1,2] to improve security without compromising communication efficiency. We will add these methods into our discussion as possible compatible solutions.

[1] Yin, D., Chen, Y., Kannan, R. & Bartlett, P. (2018). Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. Proceedings of the 35th International Conference on Machine Learning, 80:5650-5659

[2] Pillutla K., Kakade S. M. and Harchaoui Z. (2022) "Robust Aggregation for Federated Learning," IEEE Transactions on Signal Processing, 70:1142-1154

We thank the reviewer for bringing this up.

Code in the Submission:

All details related to our implementation can be found in App G. We will release our code if the manuscript gets accepted. However, we are happy to share the current version of the code upon request.

We believe we have addressed all points and agree to add further discussions to the final version. We kindly request raising our paper’s score based on our responses. If further clarifications are needed, we are happy to provide them and refine our work.

审稿意见

评分: 42025-03-20

This paper proposed an ADMM style algorithm with event-triggered communication for minimizing the sum of a smooth possibly nonconvex function and a closed proper convex convex regularizer subject to linear equality constraints. This general problem formulation subsumes the typical consensus optimization framework, which enables one to apply the proposed algorithm to the typical federated learning problem setup (parameter server) and decentralized communication topologies.

Theoretical guarantees are provided using a Lyapunov analysis. In the strongly convex setting, this work can achieve an accelerated linear convergence with a dependence on the square-root of the condition number, up to a neighbourhood of the minimizer that depends on the threshold of communication triggering, as well as the the periodic "resets" required for handling dropped messages. In the non-convex case, a fast O(1/k) rate is proven. The proposed method involves exactly solving for nested optimization problems "argmin," however, in practice this is replaced with a few stochastic gradient steps.

Empirical results provided training an MLP on MNIST, a ConvNet on CIFAR10, and linear regression with L1 regularization in the appendix, comparing to established federated learning baselines.

给作者的问题

N.A.

论据与证据

All claims are well supported, theoretical results are the primary contribution and are relatively impressive. Empirical results are convincing, with sufficient details provided in the appendix.

方法与评估标准

Main claim is to reduce the number of required communication rounds. Thus focusing on validation accuracy of downstream tasks as a function of communication rounds is appropriate.

理论论述

Coarsely went through the proofs in the appendix, which look ok.

实验设计与分析

Yes checked all experiments, including those in the appendix. All appear sound.

补充材料

Yes all of it.

与现有文献的关系

Paper is well situated within the broader literature on federated learning.

遗漏的重要参考文献

Nothing major, but the approach to handling dropped packages is a bit brutish. Essentially, need exact synchronization to mitigate the effect of dropped packages, and the frequency of this synchronization is reflected in the residual error of the convergence bounds. There are already tools in the gossip literature for dealing with dropped packages by tracking running sums (Hajdicostos et al., IEEE TAC 2016). I am curious whether such approaches can be combined with the proposed method to better handle dropped packages. As can be seen in Figure 10 in page 33 of the Appendix, dropped packages require significant synchronization to establish fine-grained convergence.

Hadjicostis et al., “Robust distributed average consensus via exchange of running sums,” IEEE TAC, 2016.

其他优缺点

Well-written paper with solid results, convincing empirical experiments, well document experimental setup in the appendix, and several additional ablations in the appendix.

其他意见或建议

N.A.

作者回复

2025-03-28

We thank the reviewer for their constructive feedback and recognition of our work’s strengths. We reviewed (Hadjicostis et al. 2016), and their approach using running sums offers an interesting alternative to periodic synchronization. While our method ensures strong theoretical guarantees, exploring tools from the gossip literature could be a valuable direction for future work to further mitigate communication drops. We appreciate the reviewer’s insights and are happy to provide further clarifications if needed.

最终决定Accept (poster)

2025-05-01

This paper introduces event-based communication triggering in ADMM to reduce communication cost. Compared to other optimization algorithms in federated learning, the communication cost is reduced by 35%+. Three reviewers found the paper technically sound, and all agreed to accept with scores of 3,4,4. All reviewers acknowledged reading rebuttal response. In particular, reviewer 4Bk5 gave detailed review with constructive feedback, and raised score from 3 to 4 after discussion. I would encourage the authors to incorporate the feedback to their next version.

Reviewers also mentioned that they are more familiar with federated learning than ADMM and distributed optimization in general. In my opinion, the proposed method is a rather agnostic improvement on ADMM, not necessarily tied to FL. In fact, as some of the reviewers also allude to, I would strongly encourage the authors discuss the application of the proposed algorithm in practical federated learning. For example, the applicability to cross-device, or cross-silo FL; whether the algorithm can handle stragglers, and partial participation. These are important practical considerations, and the authors can probably find discussions in Advances and Open Problems in Federated Learning and A Field Guide to Federated Optimization.