We thank the reviewer for the kind review! Please find below our point-by-point response regarding your feedback:

1, The author claim that BitDelta shows the potential redundancy of information added during fine-tuning. However, this is not a new finding and almost all PEFT approaches (for example LoRA) are based on this fact.

We agree that the redundant fine-tune information angle is not new, hence our analogy that LoRA enforces structure during training. However, we believe there is nontrivial novelty in using this idea to accurately quantize the weight delta post training to 1-bit for full-parameter fine-tuned models, and successfully translating this reduced memory consumption to a >10x wall clock speedup in multi-tenant settings.

It seems that the paper completely missed the full fine-tuning costs and just measured the memory/latency of the serving step. However, I would suggest to have “apple to apple” comparisons and compare the fine-tuning+serving cost (including memory and runtime) of “full-fine tuning+1bit optimization+serving” against “fine-tuning+serving” in PEFT schemes (like LoRA).

We respectfully disagree with the statement's premise -- to clarify, we do not fine-tune our own models, and instead target existing popular SOTA fine-tuned models on platforms like HuggingFace, as this setting is where our methodology’s downstream applications are most relevant. As such, the effective cost of such models is amortized over many people around the world. If we were in a different setting and needed to fine-tune our own models from scratch (eg. in a niche domain), then we would agree that a full apples-to-apples comparison including the cost of fine-tuning would be more appropriate.

However, the reviewer raises an interesting point and we plan to clarify this in the final manuscript to ensure the value proposition of our work is more accurately understood.

In Table 6, the paper shows higher accuracy for FP+delta compared to GPTQ. Again, I would rather like to see memory Vs. accuracy tradeoff in such comparisons as a function of number of clients. The fact is FP+delta does not have the same memory as GPTQ (please current me if I am wrong).

The reviewer is correct in that has a different memory footprint than . The crossover point would be about 5 models. Serving separate quantized models is mainly relevant in low-batch (low number of clients) and low-memory settings. However, as shown in Table 6, BitDelta can also be applied to quantized base models, which is a viable solution in such settings. For example, when serving 3 models, it is preferable to represent them as one 8-bit base model plus three 1-bit deltas, instead of three separate 4-bit models, in terms of both accuracy and memory.

Nonetheless, BitDelta is not intended to be useful in this regime (low-batch + low-memory), and the quantization ablation mainly serves to show the robustness of the method in terms of accuracy. However, the remark that outperforms may mislead readers to overgeneralize, which we will address in the final manuscript.

It would be nice to have a performance model for the latency of the decoding as a function of client numbers. This is also missed and needs to be include as this is the main claim of the paper.

The End-to-End Latency section (Figure 5) addresses decoding speed as a function of batch size (number of clients). We're more than happy to provide additional results if this is not what the reviewer is expecting.