PaperHub
6.2
/10
Poster5 位审稿人
最低6最高7标准差0.4
6
6
7
6
6
2.6
置信度
正确性3.0
贡献度2.8
表达2.6
NeurIPS 2024

Weight for Robustness: A Comprehensive Approach towards Optimal Fault-Tolerant Asynchronous ML

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-16
TL;DR

A novel fault-tolerant framework and strategies that, for the first time, achieve optimal convergence rate in asynchronous machine learning against Byzantine failures.

摘要

关键词
stochastic convex optimizationbyzantine robust learningonline convex optimization

评审与讨论

审稿意见
6

The paper address the challenges of Byzantine-robust training in asynchronous distributed machine learning systems, aiming to enhance efficiency amid massive parallelization and heterogeneous compute resources, by introducing a novel weighted robust aggregation framework. Quantify the difficulty in asynchronous scenarios, by considering the number of Byzantine updates, which is more natural than the standard measure of number of Byzantine workers. Identify the need to utilize weighted aggregators rather than standard ones in favour of asynchronous Byzantine problems. Towards doing so, they extend the robust aggregation framework to allow and include weights, and develop appropriate (weighted) rules and a meta-aggregator. Achieving Optimal Convergence: They incorporate weighted robust framework with a recent double momentum mechanism, leveraging its unique features to achieve an optimal convergence rate for the first time in asynchronous Byzantine ML.

优点

This paper proposes a new solution for Byzantine robust training in asynchronous distributed machine learning systems, with rigorous research methods and reliable experimental results. The paper has a reasonable structure, coherent logic, and clear expression. The research findings have significant implications for asynchronous distributed machine learning. They incorporate weighted robust framework with a recent double momentum mechanism, leveraging its unique features to achieve an optimal convergence rate for the first time in asynchronous Byzantine ML. They adopts a method similar to attention mechanism, assigning different weights to different parts.

缺点

The experiment is not sufficient, why not compare it with the weighting methods mentioned in the article. The paper quantified the difficulty in asynchronous scenarios, by considering the number of Byzantine updates, which is more natural than the standard measure of number of Byzantine workers, but the experimental results did not prove this point.

问题

The paper introduced a novel weighted robust aggregation framework to address the challenges of Byzantine-robust training in asynchronous distributed machine learning systems, but there is no detailed explanation on how the weight coefficients of the framework are calculated.

局限性

The author did not discuss its limitations in the paper. This paper's work generally does not bring the potential negative social impacts, including privacy issues, algorithmic biases, and the risk of technology abuse.

作者回复

Thank you for your valuable feedback. We wish to address some points raised to provide further clarification and insight into our work:

Regarding the Weighted Approach Experiment:

We would like to clarify that we addressed the comparison with weighting methods in Figure 2 of our paper. This figure presents a comparison between weighted and non-weighted robust aggregators, demonstrating that the weighted approach outperforms the non-weighted approach, which is equivalent to using the fraction of Byzantine workers. Also, we have conducted further experiments for various values of the fraction of Byzantine iterations with a fixed number of Byzantine workers to evaluate the impact of the fraction of Byzantine iterations. Our findings demonstrate that as the proportion of Byzantine iterations increases, there is a degradation in the training performance. You can see this in Figure 4, in the experiments that we attach in the pdf for the rebuttal.

Regarding the Weight Coefficients:

The weight coefficients in our framework are determined based on the total number of updates of each worker, a factor known in an asynchronous system (see lines 259-267 and Algorithm 2). These weight coefficients align elegantly with the asynchronous dynamics by prioritizing workers with more frequent updates.

Regarding the Limitations:

Thank you for pointing this out. We will include this discussion in the final version of the paper.

The paper does not introduce additional negative social impacts beyond the standard Byzantine synchronous settings. In fact, our work enables us to enhance the robustness of asynchronous training.

In terms of memory consumption, asynchronous systems without Byzantine workers typically require the parameter server to maintain memory for a single worker's output along with the global model. However, to ensure robustness against failures, our method requires storing the latest outputs of all workers in addition to the global model. This results in increased memory consumption, which scales with O(dm)O(dm), where mm represents the workers’ number and dd represents the dimensionality of the model.

Regarding time complexity, the use of a robust aggregation procedure requires an additional computation (e.g., applying the coordinatewise median aggregator requires an additional O(dmlog(m))O(dm\log(m)) complexity per round). Note that such additional computation also arises in synchronous Byzantine scenarios. In asynchronous Byzantine-free scenarios, these additional processing costs are absent.

It is worth noting that, compared to the synchronous case, our approach scales similarly to standard approaches for the synchronous byzantine case. Nevertheless, recall that the advantage of the asynchronous setting is that slow workers may not become bottlenecks in the training process, which enhances overall efficiency and flexibility.

The additional memory overhead and time complexity, compared to the asynchronous Byzantine-free scenario, represent a trade-off for enhanced robustness, as it enables the system to apply robust aggregation rules and more effectively mitigate the influence of Byzantine workers. This trade-off is justified by achieving an optimal convergence rate and an improved performance of the training process in the presence of adversarial behavior.

审稿意见
6

This work introduces a weighted robust aggregation framework and extends traditional robust aggregators to asynchronous Byzantine environments. By integrating the proposed framework with a recent double momentum mechanism, the authors achieve optimal convergence rates in asynchronous Byzantine ML under the convex setting.

优点

  • The authors address an important problem of Byzantine robustness in asynchronous distributed environments.
  • Novel combination of μ2\mu^2-SGD with weighted robust aggregation to achieve optimal convergence rates for the first time. In contrast to existing works, the obtained convergence rate is independent of data dimensionality and matches the rates of asynchronous settings in the absence of Byzantine workers.
  • The paper is well written and well contextualized w.r.t to existing work in both synchronous and asynchronous Byzantine literature.

缺点

  • Empirical evaluations are limited, spanning only a single dataset
  • Currently only convex scenario is considered

问题

While a vast variety of attacks have been designed for the synchronous case, they may not be as impactful in the asynchronous case. While the authors adapt Little and Empire to the asynchronous scenario, are there other attacks specific to asynchronous settings? Can the authors comment on the significance of evaluated attacks in the asynchronous case?

局限性

Empirical assessments should be conducted on at least one more dataset to show the effectiveness of the proposed approach. The paper can benefit from restructuring to at least retain important experimental details in the main paper, such as details on the evaluated attacks. Currently, all experimental details have been deferred to Appendix except the name of the dataset.

作者回复

Thank you for your valuable feedback. We wish to address some points raised to provide further clarification and insight into our work:

Regarding the Empirical Evaluation:

We agree that including experiments on additional datasets would strengthen the empirical validation of our proposed approach. In response, we have conducted experiments on the CIFAR-10 dataset, complementing our existing results from the MNIST dataset. We believe these additional experiments provide a more comprehensive evaluation of our approach and highlight its practicality. Furthermore, we will ensure that important experimental details are retained in the main paper in the final version.

Regarding the Convex Setting:

The convex case is important for two main reasons: It captures classical ML problems like linear and logistic regression, and it serves as a theoretical testbed for analyzing and developing novel ideas and algorithms in various machine-learning scenarios. These new algorithms often induce novel heuristics that are very useful in the context of general ML problems, including non-convex ones and specifically in Deep Learning scenarios. Indeed, this approach is prevalent and has led to many important contributions in ML. Two prominent examples are the well-known Adagrad and Adam algorithms, which were conceived and analyzed under the convexity assumption. Other influential examples are the many works on convex synchronous/asynchronous training, and works on Byzantine-resilient algorithms for convex problems, see e.g.:

  • Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. Advances in Neural Information Processing Systems, 31, 2018. https://arxiv.org/pdf/1803.08917
  • Minghong Fang, Jia Liu, Neil Zhenqiang Gong, and Elizabeth S Bentley. Aflguard: Byzantine robust asynchronous federated learning. In Proceedings of the 38th Annual Computer Security Applications Conference, pages 632–646, 2022. https://arxiv.org/pdf/2212.06325

For the asynchronous Byzantine problem, the convex framework enables the development of a more intuitive and naturally weighted scheme. It provides a foundational basis for extending these findings to non-convex settings, which is a challenging and ongoing research area. In fact, extending our weighted scheme to the non-convex setting requires less straightforward analysis and necessitates a careful adaptation to accommodate the stochastic error inherent in non-convex scenarios. For further details, please refer to lines 259-271 and the Future Work section of our paper.

Please note that in our work, we empirically examine the ideas developed in the non-convex setting, specifically focusing on non-convex deep learning problems. Our empirical study demonstrated the practicality and usefulness of our approach in deep learning scenarios, complementing our theoretical findings. Additionally, we conducted further experiments using the CIFAR-10 dataset to validate the effectiveness of our methods in a more complex, non-convex machine learning task.

Regarding the Attacks Adaptation to the Asynchronous Case:

The attacks of Little and Empire are designed to hide under the variance among the workers' outputs. In asynchronous settings, delays can increase this variance, providing an opportunity for Byzantine workers to blend in more effectively with honest outputs. Therefore, we believe that these attacks also may be effective in the asynchronous setting. Moreover, we have adapted these attacks to directly tackle the weighted robust aggregation rules that we utilize in our algorithmic approach, thus making them more comparable to the synchronous case. Nevertheless, the exploration of attacks specific to asynchronous settings is still underexplored. Further investigation into asynchronous-specific attacks is necessary to better understand their impact.

评论

The reviewer thanks the authors for their clarifying responses and additional experiments. After considering the responses as well as the reviews from other reviewers, I believe my original assessment accurately reflects the contributions of the paper, hence, I will maintain my score.

审稿意见
7

The paper introduces the Asynchronous Robust μ^2-SGD algorithm, designed to address the challenges of Byzantine-robust training in asynchronous distributed machine learning (ML) systems. The method extends the standard μ^2-SGD by incorporating a robust weighted aggregation framework that mitigates the effects of delayed updates and faulty, potentially malicious contributions from distributed workers. This novel approach claims to achieve an optimal convergence rate for the first time in an asynchronous Byzantine environment, enhancing fault tolerance and performance optimization in such systems.

优点

  • Innovative Approach to Fault Tolerance: The paper successfully introduces a novel robust aggregation framework that extends existing methods to accommodate asynchronous dynamics and Byzantine faults. This framework allows for the application of robust aggregators and meta-aggregators to their weighted versions, effectively handling the challenges posed by asynchronous updates and unreliable data sources.
  • Clear Presentation and Insightful Analysis: The paper provides a clear presentation, offering good insights into the differences between synchronous and asynchronous settings, which is crucial for understanding the impact and novelty of the proposed method. The authors provide detailed proofs, and the preliminary results presented look promising, indicating potential for significant contributions to the field.

缺点

While I am not an expert in this specific area of asynchronous Byzantine environments, I recognize the significant value this paper offers to the NeurIPS audience. Its innovative approaches to fault tolerance in distributed ML systems are both relevant and timely (such as in the distributed Federated learning where updates can be fuzzy or even malicious). Nevertheless, there are a few areas where further clarification could enhance the paper's impact and applicability.

  • Assumptions on Byzantine Behavior: The paper introduces an interesting perspective by suggesting that it is more natural to consider the fraction of Byzantine updates rather than the fraction of Byzantine workers in asynchronous environments. While this approach aligns well with the dynamic nature of such systems, it raises questions about whether alternative, potentially simpler solutions could be explored. For instance, could the parameter server utilize mechanisms like monitoring update frequencies to identify and mitigate the impact of Byzantine workers? Further exploration into whether these alternative approaches might offer comparable robustness with potentially reduced complexity could enrich the discussion and provide a more comprehensive understanding of the trade-offs involved.
  • Impact on System Metrics: The paper could benefit from a discussion on how the proposed method impacts other critical system metrics, such as the additional memory consumption required by the parameter server and the overall impact on training speed. Understanding these implications could help in assessing the practical viability of implementing this method in real-world systems.
  • Temporal Distribution of Byzantine Updates: A minor yet intriguing question arises regarding the assumption that Byzantine updates are uniformly distributed over time. What if these updates are not uniformly distributed but are instead concentrated within specific periods? Exploring the effects of such scenarios could provide deeper insights into the robustness of the proposed method under different types of Byzantine behaviors.

问题

See weakness section.

局限性

No potential negative societal impact.

作者回复

Thank you for your valuable feedback. We wish to address some points raised to provide further clarification and insight into our work:

Regarding the Assumptions on Byzantine Behavior: “... it raises questions about whether alternative, potentially simpler solutions could be explored”

This is a very good question. Our goal was to develop the simplest approach for which we could draw provable guarantees for the general Byzantine asynchronous setting, and our approach indeed ensures such guarantees. It is an important open question to understand whether a simpler approach can work in this case, and we believe that this is an interesting future direction.

Monitoring update frequencies can potentially assist, for example, in an extreme case where we have a dominant worker who appears more than half times in some of the iterations, then this worker can be identified as an honest worker and might serve as an honest reference for anomaly detection. While this can lead to a simple approach for this specific case, we believe the guarantees will coincide with ours. In less extreme cases, however, relying solely on update frequencies to detect outliers becomes more challenging within the general Byzantine framework. This framework assumes only a majority of iterations are honest, without prior knowledge of additional characteristics of Byzantine updates. The generality of this framework allows us to develop an approach applicable to a wide range of scenarios. Nevertheless, this broad applicability makes it difficult to set definitive criteria for identifying anomalous behavior based on frequency alone. For example, a Byzantine worker could adapt their behavior to mimic honest patterns, making identification more difficult. Leveraging additional prior knowledge might be useful in developing simpler approaches, and this is a future work to investigate.

We will add this discussion to the final version of the paper.

Regarding the Impact on System Metrics:

Thank you for pointing this out. We will include this discussion in the final version of the paper.

Regarding memory requirement and computational complexity, our approach scales similarly to standard approaches for the synchronous byzantine case. Nevertheless, recall that the advantage of the asynchronous setting is that slow workers may not become bottlenecks in the training process, which enhances overall efficiency and flexibility.

Compared to the asynchronous case without Byzantine workers, the parameter server only needs to maintain memory for a single worker's output along with the global model. However, to ensure robustness against failures, our method requires storing the latest outputs of all workers in addition to the global model. This results in increased memory consumption, which scales with O(dm)O(dm), where mm represents the workers’ number and dd represents the dimensionality of the model. Regarding time complexity, the use of a robust aggregation procedure requires an additional computation (e.g. applying the coordinatewise median aggregator requires an additional O(dmlog(m))O(dm\log(m)) complexity per round). Note that such additional computation also arises in synchronous Byzantine scenarios. In asynchronous Byzantine-free scenarios, these additional processing costs are absent.

The additional memory overhead and time complexity is a trade-off for enhanced robustness, as it enables the system to apply robust aggregation rules and more effectively mitigate the influence of Byzantine workers. This trade-off is justified by achieving an optimal convergence rate and an improved performance of the training process in the presence of adversarial behavior.

Regarding the Temporal Distribution of Byzantine Updates:

We would like to clarify that our analysis does not assume uniform updates for Byzantine workers. The generality of our framework enables us to handle scenarios where Byzantine iterations are concentrated within specific periods, as long as the majority of iterations are honest, in the sense that thonest(1λ)t,t[T]t_{honest}\geq(1-\lambda)t, \forall t\in[T].

In our experiments, we used a uniform distribution of Byzantine updates primarily for simplicity. However, we acknowledge that concentrated Byzantine iterations can indeed cause severe degradation in performance. Such scenarios can significantly increase the honest delays, potentially degrading the overall convergence. We will include this discussion on the behavior of Byzantine updates and their implications in the final version of our paper.

评论

Thanks for the response. It addressed most of my concerns. Regarding to the discussion of Byzantine updates, I found that the experiments right now still used a typical setup of M Byzantine workers among N total workers where M < N/2. Since the paper has emphasized the significance of Byzantine updates, it would be helpful to experiment with extreme cases where all workers are Byzantine but fewer than half of their total updates are Byzantine. I have no other issues with the paper other than that and would like to keep my score.

评论

Dear reviewer,

Thanks again for your positive feedback. We appreciate your suggestion to experiment with extreme scenarios, and agree that it could further support our proposed approach and findings. We will incorporate this into the final version of the paper.

审稿意见
6

The paper addresses the challenges of Byzantine-robust training in asynchronous distributed machine learning systems. The authors propose a novel weighted robust aggregation framework to mitigate the effects of delayed updates and enhance fault tolerance in such environments. They incorporate a recent variance-reduction technique to achieve optimal convergence rates in asynchronous Byzantine settings. The methodology is validated through empirical and theoretical analysis, demonstrating its effectiveness in optimizing performance and fault tolerance.

优点

  • The introduction of a weighted robust aggregation framework adapted to asynchronous settings is a significant contribution. This innovation allows for more efficient handling of Byzantine faults, which are critical in distributed systems.
  • The paper provides both empirical and theoretical validation of the proposed methodology. This dual approach strengthens the credibility of the results and demonstrates the practical applicability of the proposed techniques.
  • Achieving an optimal convergence rate in asynchronous Byzantine environments is a noteworthy accomplishment. The incorporation of a recent variance-reduction technique adds further value to the research.
  • The paper provides a thorough review of existing approaches to Byzantine-robust training, both in synchronous and asynchronous settings. This contextualization helps in understanding the advancements made by the current work.

缺点

  • Complexity of Methodology: While the proposed framework is innovative, it is also quite complex. The introduction of weighted aggregators and the double momentum mechanism might be challenging for practitioners to implement and integrate into existing systems.
  • Limited Practical Examples: The paper lacks of sophisticated experiments to demonstrate the effectiveness of the framework, at least to show it in practical ways.
  • Dependency on Assumptions: The effectiveness of the proposed method relies heavily on several assumptions, such as bounded delays and the fraction of Byzantine updates. These assumptions might not hold in all practical settings, potentially limiting the method's applicability.

问题

  • While your theoretical analysis is robust, have you identified any discrepancies or limitations when translating these theoretical results into practical implementations? How sensitive is the proposed methodology to the assumptions made (e.g., bounded delays, fraction of Byzantine updates)? What happens if these assumptions are violated in real-world scenarios? An additional experiment should help.
  • How does the proposed framework scale with an increasing number of workers? Either increase or decrease Byzantine workers. Are there any scalability issues or bottlenecks that need to be addressed?

局限性

N/A

作者回复

Thank you for your valuable feedback. We wish to address some points raised to provide further clarification and insight into our work:

Regarding the Complexity of Methodology:

While our approach may not be trivial to code, conceptually, it can be implemented as a combination of several quite simple modules, which simplifies the implementation. And indeed, this is how we translated our method into a rather simple Python code. To assist in implementing our methods, we will share our code with the community, which includes a detailed implementation of the proposed framework. It is worth noting that many widely-used optimization algorithms, such as the Adam optimizer, also involve intricate components (momentum, and per-feature learning rate) but are popular in practice due to their effectiveness.

References:

Regarding the Practical Examples:

We have conducted additional experiments using the CIFAR-10 dataset, in addition to our earlier experiments on MNIST. These experiments further validate the robustness and performance of our proposed methods in more complex and diverse settings.

The results from the CIFAR-10 experiments align closely with those observed on MNIST, showcasing the consistency and effectiveness of our framework across different types of data. We believe these additional experiments provide a more comprehensive evaluation of our approach and highlight its practicality.

Regarding the Dependency on Assumptions:

We would like to clarify that while our analysis assumes a bounded delay model for theoretical analysis, our algorithmic approach does not rely on this assumption. This assumption is used primarily to facilitate the mathematical analysis and to derive performance guarantees under controlled conditions. In practice, our method can handle varying levels of delay, including scenarios with significant delays. As demonstrated in Figure 1 of the paper, we tested our approach in a setting with heavy delays, and the results show that it performs robustly even under these challenging conditions.

Regarding the assumption of bounded Byzantine iterations, apart from the weighted CTMA, which relies on knowledge of the parameter λ\lambda (the fraction of Byzantine iterations), our approach with weighted double momentum, as well as robust aggregation rules like the weighted geometric median or the weighted coordinate-wise median, do not require knowing the fraction of Byzantine iterations. These methods can perform effectively without this additional knowledge. While knowing the fraction of Byzantine iterations is beneficial for achieving theoretical optimality, our proposed methods can remain robust and practical in the absence of this information.

Additionally, it is quite standard in the synchronous literature to assume knowledge of the fraction of Byzantine workers or to have a reliable approximation of this parameter. In scenarios where such an assumption is not made, alternative solutions often rely on knowing the bound of the stochastic variance, which can be less practical and more challenging to estimate than the fraction of Byzantine workers. This adaptation to the fraction of Byzantine iterations is intended to be reasonable and practical, given that a parallel assumption is widely accepted in the literature on synchronous systems.

Regarding the Further Empirical Evaluation:

We have conducted further experiments for different values of the fraction of the Byzantine iterations and additional experiments on the CIFAR-10 dataset, complementing our existing results from the MNIST dataset. These experiments provide a more comprehensive evaluation of our approach and highlight its practicality.

Regarding the Scalability of Our Approach:

Thank you for pointing this out. We will include this discussion in the final version of the paper.

Regarding memory requirement and computational complexity, our approach scales similarly to standard approaches for the synchronous byzantine case. Nevertheless, recall that the advantage of the asynchronous setting is that slow workers may not become bottlenecks in the training process, which enhances overall efficiency and flexibility.

Compared to the asynchronous case without Byzantine workers, the parameter server only needs to maintain memory for a single worker's output along with the global model. However, to ensure robustness against failures, our method requires storing the latest outputs of all workers in addition to the global model. This results in increased memory consumption, which scales with O(dm)O(dm), where mm represents the workers’ number and d d represents the dimensionality of the model. Regarding time complexity, the use of a robust aggregation procedure requires an additional computation (e.g. applying the coordinatewise median aggregator requires an additional O(dmlog(m))O(dm\log(m)) complexity per round). Note, that such additional computation, also arises in synchronous Byzantine scenarios. In asynchronous Byzantine-free scenarios, these additional processing costs are absent.

Also, it is worth noting that the number of Byzantine workers, whether increasing or decreasing, does not affect the scalability of our approaches for a fixed number of mm workers.

Overall, the additional memory overhead and time complexity are a trade-off for enhanced robustness, as they enable the system to apply robust aggregation rules and more effectively mitigate the influence of Byzantine workers. This trade-off is justified by achieving an optimal convergence rate and improved performance of the training process in the presence of adversarial behavior.

评论

I appreciate the additional information and experiments they have conducted in response to my comments. The authors have addressed my concerns regarding the complexity of the methodology, practical examples, dependency on assumptions, further empirical evaluation, and scalability.

Regarding the dependency on assumptions, I am now more convinced that their methods can be effective without precise knowledge of the fraction of Byzantine iterations. The comparison with standard assumptions in the literature is also a valid point.

Given the authors' efforts to address my concerns and the additional evidence provided, I maintain my original score as the paper still exhibits novelty in its approach. However, I recommend that the authors incorporate the discussed points and the new experiments into the final version of the manuscript to enhance its clarity and robustness.

审稿意见
6

The authors investigate asynchronous training of convex models based on a parameter server in the Byzantine failure setting, evaluating their method on MNIST

优点

  • Author's method achieves the optimal convergence rate, which they claim prior work in asynchronous Byzantine ML has not achieved.
  • Authors extend the standard measure considered in the Byzantine failure literature (number of Byzantine workers) to a setting more natural for asynchronicity by considering the number of Byzantine updates

缺点

  • Authors do not motivate scenarios in which asynchronous convex optimization is relevant/applied in practice.

问题

  • Which convex model did you implement and are you evaluating in Section 5 and Appendix D?

局限性

  • The proposed method and associated guarantees do not apply to non-convex models
作者回复

Thank you for your valuable feedback. We wish to address some points raised to provide further clarification and insight into our work:

Regarding the Convex Setting:
The convex case is important for two main reasons: It captures classical ML problems like linear and logistic regression, and it serves as a theoretical testbed for analyzing and developing novel ideas and algorithms in various machine-learning scenarios. These new algorithms often induce novel heuristics that are very useful in the context of general ML problems, including non-convex ones and specifically in Deep Learning scenarios. Indeed, this approach is prevalent and has led to many important contributions in ML. Two prominent examples are the well-known Adagrad and Adam algorithms, which were conceived and analyzed under the convexity assumption. Other influential examples are the many works on convex synchronous/asynchronous training, and works on Byzantine-resilient algorithms for convex problems, see e.g.:

  • Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. Advances in Neural Information Processing Systems, 31, 2018. https://arxiv.org/pdf/1803.08917
  • Minghong Fang, Jia Liu, Neil Zhenqiang Gong, and Elizabeth S Bentley. Aflguard: Byzantine robust asynchronous federated learning. In Proceedings of the 38th Annual Computer Security Applications Conference, pages 632–646, 2022. https://arxiv.org/pdf/2212.06325

For the asynchronous Byzantine problem, the convex framework enables the development of a more intuitive and naturally weighted scheme. It provides a foundational basis for extending these findings to non-convex settings, which is a challenging and ongoing research area. In fact, extending our weighted scheme to the non-convex setting requires less straightforward analysis and necessitates a careful adaptation to accommodate the stochastic error inherent in non-convex scenarios. For further details, please refer to lines 259-271 and the Future Work section of our paper.

Please note that in our work, we empirically examine the ideas developed in the non-convex setting, specifically focusing on non-convex deep learning problems. Our empirical study demonstrated the practicality and usefulness of our approach in deep learning scenarios, complementing our theoretical findings. Additionally, we conducted further experiments using the CIFAR-10 dataset to validate the effectiveness of our methods in a more complex, non-convex machine learning task.

Regarding the Experiments Setup:

As with other optimization algorithms like ADAGrad, which were analyzed in convex settings and employed in non-convex scenarios, we demonstrated the performance of our approach in a non-convex context. We used a 2-layer convolutional network designed for the MNIST dataset. The network includes two convolutional layers (20 filters and 50 filters, both 5x5), followed by ReLU activation and 2x2 max pooling. This is followed by a fully connected layer with 50 units, batch normalization, and a final fully connected layer with 10 units. Similarly, for the new CIFAR-10 experiments, we applied an analogous architecture.

Thank you for highlighting this. We will add further details about the setup and implementation of our experiments in the final version of the paper to enhance the clarity of the experimental configuration.

作者回复

Dear reviewers,

We appreciate your valuable feedback and have included additional results on the CIFAR-10 dataset. CIFAR-10 is a more complex dataset consisting of 60,000 32x32 color images in 10 classes, with 6,000 images per class. These results demonstrate the effectiveness of our proposed approaches in a more complex and challenging scenario.

We have used the same setup as described for the MNIST dataset in Appendix D, with a similar architecture: two convolutional layers (20 filters and 50 filters, both 5x5), followed by ReLU activation and 2x2 max pooling. This is followed by a fully connected layer with 50 units, batch normalization, and a final fully connected layer with 10 units. Our experiments were conducted on an NVIDIA GeForce RTX 3090 GPU using the PyTorch framework. The results were averaged over three random seeds for robustness, with each worker computing gradients based on a batch size of 64.

Our results on CIFAR-10 align with those obtained on the MNIST dataset, confirming the robustness of our proposed approaches and demonstrating performance improvements across different scenarios.

We have also conducted further experiments for different values of the fraction of the Byzantine iterations with a fixed number of Byzantine workers (see Figure 4 in the attached PDF) to demonstrate the impact of the proportion of Byzantine iterations on model performance within our approaches.

Reference:

  • CIFAR-10 dataset: Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 55(5), 2014.
最终决定

Based on the current reviews and follow-up discussions, I recommend acceptance of this paper. Here is a summary of the main comments on the paper. The strengths outweigh the weaknesses, and the authors tried to address reviewers' concerns during the rebuttal.

Strengths

  • Novel problem formulation considering the interplay between asynchronous aggregation in distributed ML and byzantine robustness.
  • The algorithm is supported by theoretical analysis that gives convergence guarantees.

Weaknesses

  • Analysis is limited to convex settings
  • Experiments are limited to the MNIST dataset. In the rebuttal, the authors did provide additional results on CIFAR 10.