You note on page 5 that this approach works for any variant of robust D-SGD - but how do these modifications influence the performance of your technique?

This is a good question. First, note that the SOTA defense in Byzantine ML is robust D-SGD with local momentum with pre-aggregation (Allouah et al. 2023, Karimireddy et al. 2022), so our experimental results are shown for this defense. We also test on other variants in Appendix C.2 with less or no local momentum, where we explain that our conclusions regarding the performance of Jump still hold.

Could you explain the segment length in additional detail? At the moment problem implies that the attacker only acts every steps - but this would appear to be contradicted by the loss subfigure from Figure 1.

We explain the steps of the \textsc{Jump} attack in Section 3.2. Segment length is the size of each subproblem . In other words, we split the whole training process of iterations into segments with length . Each time we solve a subproblem , we obtain which can be applied to training iterations . Hence, after we solve all subproblems, we obtain which are used to generate the Byzantine vectors for the whole training process.

One of the reasons that JUMP is claimed to work is that it can force trajectory jumps over global minima - however this would seem to be an observation that would be very sensitive to the largest allowable trajectory jumps, and yet this doesn't seem to be a property that has been explored.

This is a very good question. An jump is indeed sensitive to the largest allowable trajectory jumps and to the loss landscape. Furthermore, the quality of the landing point influences the effectiveness of our attack, as we explain in Section 4. However, note that the learning rate used is tuned on the Byzantine-free case, and thus is not too large. In fact, we have tested with multiple learning rates, but the strategy followed by Jump, and the optimal attack obtained by solving Problem (P), does not change. This is because we use nonlinear programming (Powell's method in our experiments) to find an optimal jump strategy, including finding the optimal point to do a wise and precise jump, as explained in Section 4.

How would your technique perform in the context of a) more agents, or b) when the number of agents is fixed, but the proportion of byzantine agents increased?

To showcase the strength of our attack, we focus our experiments on small-scale tasks and a low fraction of Byzantine workers. Indeed, the more complex the task, and the larger the Byzantine fraction, the harder it is for the defender and not the attacker. This is why, given that we propose an attack, we focus on this harder setting for the attacker. Finally, we recall that we already conducted a larger number of experiments, for different levels of heterogeneity, on different datasets, and Byzantine fractions.

Clarifying remarks

``critical points'' are also called stationary/saddle points in non-convex optimization, and refer to model parameters where the gradient of the loss is zero.

``greedy'' refers to the attack obtained by setting , the segment length, to in Jump (see Section 3.2). When , the attack plans malicious gradients for the next iterations, which makes it a non-greedy approach.

``accuracy damage'' refers to the decrease in accuracy (i.e., accuracy with best existing attack accuracy minus accuracy with our attack) relative to the Byzantine-free baseline. We will make these clear in the paper.