Dear reviewer,

Thank you for responding to our rebuttal, and for engaging in this discussion with us.

IS vs. MLMC

We reiterate that our MLMC method strictly generalizes IS, i.e. there are compressors for which IS is not naturally defined while our MLMC method works seamlessly. We provide the following examples:

Round-to-Nearest (RTN) compression [1,2]: this method quantizes each element by rounding it to the nearest level on a fixed grid. The spacing of this grid is controlled by a quantization step-size. Namely, given a vector , its RTN compression, , is given by: , where the function "round" rounds each element to its nearest integer, and the quantization step-size is given by , where the is typically . A smaller corresponds to more aggressive compression. No natural IS interpretation exists here.

Entropy-Constrained Uniform Quantization (ECUQ) [3]: this compressor works by efficiently finding the largest number of uniformly spaced quantization levels for a given vector such that the entropy of the quantized vector (after applying entropy encoding like Huffman coding) stays within a specified bandwidth budget. No natural IS interpretation exists here too.

Interestingly, the IS interpretation of MLMC compression seems to rise for sparsification-based compression, like top-k, bit-wise compression, but it does not hold for quantization-based compression, like RTN or ECUQ.

We ran additional experiments on BERT fine-tuning with SST-2 comparing RTN compression with our MLMC method (with RTN-based compression). The levels of our MLMC-RTN are defined by , which appears in and determines the quantization step-size (i.e., the extent of compression). We provide test accuracy vs. number of steps and test accuracy vs. #Gbit communicated plots for varying levels of (and hence, compression). See link below.

These experiments demonstrate that our MLMC method achieves better final accuracy, faster convergence, and better communication efficiency, even though now the difference is not trivial (as in e.g. MLMC-top-k).

Moreover, regarding the efficiency of , during our experiments, we also calculated the average sampled MLMC-levels (which we denote in the paper by and is equivalent to in the MLMC-RTN case), and it turns out that the average level sampled is around . i.e, includes only 1-2 different numbers on average. This makes sense since, by construction, the probability of sampling lower levels (more aggressive compression) is higher than that of higher levels, and this is consistent with classic MLMC methods. This implies that our method mostly samples lower levels (which are much cheaper to communicate) and few higher levels, but it utilizes this information very efficiently to mitigate bias and achieve superior performance across all criteria (accuracy, convergence, and communication efficiency), as our experiments show.

Specifically, for additional clarity, we also provide a graph comparing our MLMC-RTN method with RTN with . Even though the average level of MLMC-RTN is (compared to of regular RTN), and is thus more communication-efficient, it also achieves better accuracy and convergence.

We thank you again for pointing out the connection to IS! We promise to include these new experiments and a discussion of the connection between IS and our MLMC method in the paper.

ResNet

We thank you for taking the time to run this. Our results on ResNet in our specific setting are consistent with the results obtained in previous work, see e.g. Fig. 13 right-most plot and Fig.15 right-most plot in [4] (EF21). These results are consistent with ours with similar test accuracy.

In any case, our NLP experiments achieve good results (more than 90%), and this is a harder setting which demonstrates that our method works and achieves better performance across different tasks.

Link

The link works for us. It needs time to load (or maybe a different browser). In any case, we created a new anonymous repository with the results here "https://anonymous.4open.science/r/ICML2025_2-98B2/", and a dropbox with the results (anonymized account, cannot be traced back to us) here, just in case: https://www.dropbox.com/scl/fo/lmlnm9i4m51cqs185j3wh/AM6WqDl_DTLR4PI5W3dBX1I?rlkey=wr845klp30qd9ghkqmry3krhy&st=uedxj7v8&dl=0

[1] Gupta et al., “Quantization Robust Federated Learning for Efficient Inference on Heterogeneous Devices,” TMLR 2017.

[2] Dettmers et al., “GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale,” NeurIPS 2022.

[3] Dorfman et al., “DoCoFL: Downlink Compression for Cross-Device Federated Learning,” ICML 2023.

[4] Richtárik et al. "EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback", NeurIPS 2021.