The authors would like to extend their gratitude to the reviewer for highlighting the strengths of our paper, raising valid concerns about the runtime of OATS, and inquiring about follow-ups in regards to extending the algorithm to the quantization setting.

Given that compression speed is a significant factor for practical applications, it would be helpful if the authors could clarify the time complexity and wall-clock time spent on the compression process of the overall algorithm. This would offer a more concrete understanding of its practicality.

We thank the reviewer for the suggestion and in response have included a new section Appendix A.2 titled ”Time Complexity and Wall-Clock Time for OATS” which we have included below for the reviewer’s convenience:

The time complexity for OATS is where is the number of transformer blocks, is number of iterations, and

max

where the max is taken over the weight matrices, , in a transformer block and is the rank of the low-rank term for that weight matrix. The value represents the time complexity needed to perform the singular value thresholding in OATS.

The table below reports the wall-clock time (in seconds) needed to perform a single iteration of the alternating threshold algorithm for a single transformer block for the different models that were compressed. All experiments utilized a single NVIDIA A40 with 48GB of GPU memory.

Phi-3 Mini (3.8B)	Phi-3 Medium (14B)	Llama-3 8B	Llama-3 70B
8.85	26.02	17.10	152.80

While OATS does require more wall-clock time than prior pruning algorithms, in practice, model compression would only need to be performed once before deployment. This trade-off is therefore worthwhile given the substantial performance improvements, particularly on more challenging tasks like MMLU. Furthermore, like prior pruning algorithms, compressing the layers within a single transformer block can be done in parallel. For example, the time needed per transformer block of Llama-3 70B can be reduced to 71.10 seconds by compressing in parallel across four NVIDIA A40 GPUs.

The total wall-clock time can also be reduced by lowering the number of OATS iterations. Presented in the Table below is an exploratory experiment compressing Llama-3 70B by 50% with a rank ratio of 0.3 with only 20 iterations. Even with only a quarter of the iterations, OATS is still able to outperform all prior pruning algorithms across all performance metrics.

MMLU (↑)	Zero-shot (↑)	Perplexity (↓)
74.02	73.41	4.95

One potential extension of this work could involve incorporating quantization into the proposed framework. Although this addition may be a long shot, integrating quantization could make the model a unified approach to transformer compression. Could the authors provide insights on whether quantization can be integrated with their current framework, or if not, what are the main challenges?

We share the same interest as the reviewer in integrating quantization with OATS. The formulation proposed in our work is flexible and can indeed accommodate that. However, one of the associated challenges is deciding which terms in the sparse and low-rank decomposition should be quantized. Depending on that decision, one would obtain one of the following three optimization problems:

min is quantized, rank is quantized
min is quantized, rank
min, rank is quantized

We are currently investigating the three different approaches to determine which would ultimately lead to the best-unified compression approach.