Thank you for your valuable feedback. We wish to address some points raised to provide further clarification and insight into our work:

Regarding the Assumptions on Byzantine Behavior: “... it raises questions about whether alternative, potentially simpler solutions could be explored”

This is a very good question. Our goal was to develop the simplest approach for which we could draw provable guarantees for the general Byzantine asynchronous setting, and our approach indeed ensures such guarantees. It is an important open question to understand whether a simpler approach can work in this case, and we believe that this is an interesting future direction.

Monitoring update frequencies can potentially assist, for example, in an extreme case where we have a dominant worker who appears more than half times in some of the iterations, then this worker can be identified as an honest worker and might serve as an honest reference for anomaly detection. While this can lead to a simple approach for this specific case, we believe the guarantees will coincide with ours. In less extreme cases, however, relying solely on update frequencies to detect outliers becomes more challenging within the general Byzantine framework. This framework assumes only a majority of iterations are honest, without prior knowledge of additional characteristics of Byzantine updates. The generality of this framework allows us to develop an approach applicable to a wide range of scenarios. Nevertheless, this broad applicability makes it difficult to set definitive criteria for identifying anomalous behavior based on frequency alone. For example, a Byzantine worker could adapt their behavior to mimic honest patterns, making identification more difficult. Leveraging additional prior knowledge might be useful in developing simpler approaches, and this is a future work to investigate.

We will add this discussion to the final version of the paper.

Regarding the Impact on System Metrics:

Thank you for pointing this out. We will include this discussion in the final version of the paper.

Regarding memory requirement and computational complexity, our approach scales similarly to standard approaches for the synchronous byzantine case. Nevertheless, recall that the advantage of the asynchronous setting is that slow workers may not become bottlenecks in the training process, which enhances overall efficiency and flexibility.

Compared to the asynchronous case without Byzantine workers, the parameter server only needs to maintain memory for a single worker's output along with the global model. However, to ensure robustness against failures, our method requires storing the latest outputs of all workers in addition to the global model. This results in increased memory consumption, which scales with , where represents the workers’ number and represents the dimensionality of the model. Regarding time complexity, the use of a robust aggregation procedure requires an additional computation (e.g. applying the coordinatewise median aggregator requires an additional complexity per round). Note that such additional computation also arises in synchronous Byzantine scenarios. In asynchronous Byzantine-free scenarios, these additional processing costs are absent.

The additional memory overhead and time complexity is a trade-off for enhanced robustness, as it enables the system to apply robust aggregation rules and more effectively mitigate the influence of Byzantine workers. This trade-off is justified by achieving an optimal convergence rate and an improved performance of the training process in the presence of adversarial behavior.

Regarding the Temporal Distribution of Byzantine Updates:

We would like to clarify that our analysis does not assume uniform updates for Byzantine workers. The generality of our framework enables us to handle scenarios where Byzantine iterations are concentrated within specific periods, as long as the majority of iterations are honest, in the sense that .

In our experiments, we used a uniform distribution of Byzantine updates primarily for simplicity. However, we acknowledge that concentrated Byzantine iterations can indeed cause severe degradation in performance. Such scenarios can significantly increase the honest delays, potentially degrading the overall convergence. We will include this discussion on the behavior of Byzantine updates and their implications in the final version of our paper.