This work studies distributed model training with a parameter server; contributions are theoretical:

Assumptions: -smooth convex objectives, stochastic gradient estimates, heterogeneous worker distributions, bounded difference in expected gradients of local workers' objectives at a global minima, denoted .
Main result: Given a fixed set of communication rounds with the parameter server, worker machines, and a budget of local gradient queries per worker in each communication round, this work presents a simple method for improving local SGD. Given -smooth convex objectives with stochastic gradients and heterogeneous worker distributions, the local-SGD variant proposed in this work achieves more favorable convergence than mini-batch SGD (where the stochastic gradient queries are used to compute a larger mini-batch gradient at each worker). The proposed method converges in rounds, whereas mini-batch SGD converges in rounds.
Method: The proposed method extends anytime GD (Cutkosky, ICML'19) to the distributed local-sgd parameter-server setting. In short, regular local-sgd with local steps, but where stochastic gradients are computed at an exponential moving average (ema) of the model parameters. The parameter server averages both the primal and ema parameters, and sends back the exact average to each worker at the end of each round.