Q3: Whether Wanda+DST can achieve better fully fine-tuned accuracy compared to that of Wanda?

A3: Thank you for this professional inquiry. The answer is yes and DST is orthogonal to LoRA fine-tuning. As evidenced in the following table, after using DST to fine-tune sparse LLMs pruned by Wanda, further performance improvements can be achieved by consequently using LoRA fine-tuning. This yields better results than simply fine-tuning the sparse network pruned by Wanda.

WikiText-2 perplexity performance for using LoRA to fine-tune 50% sparse LLaMA-7B | Sparsity | 0.5 | | ----------------------- | ---- | | Wanda | 7.26 | | Wanda+LoRA | 6.87 | | Wanda+DST | 7.12 | | Wanda+DST+LoRA | 6.76 |

Q4: How many random seeds are used throughout the experiments?

A4: We use one random seed in our submission. To demonstrate the robustness of our approach, we have added the results with different calibration sets under five random seeds in Appendix A.2. The variance across random seeds is very low, suggesting the robustness of DST, corroborating its efficacy as a tool in fine-tuning sparse LLMs. For your convenience, we list the results as below.

WikiText validation perplexity for pruning LLaMA-V1 and LLaMA-V2 models at 60% sparsity. We report the mean and standard deviation under 5 random seeds. | Method | LLaMA-V1-7B | LLaMA-V1-13B | LLaMA-V2-7B | LLaMA-V2-13B | | --------------- | ----------------- | ---------------- | ----------------- | ---------------- | | Dense | 5.68 (0.00) | 5.09 (0.00) | 5.47 (0.00) | 4.88 (0.00) | | SparseGPT | 10.42(0.04) | 8.43 (0.02) | 10.14 (0.03) | 7.88 (0.01) | | w. DST | 9.64 (0.03) | 7.73 (0.02) | 9.68 (0.03) | 7.57 (0.01) | | Wanda | 10.69 (0.01) | 8.75 (0.01) | 10.79 (0.01) | 8.40 (0.01) | | w. DST | 10.22 (0.01) | 8.46 (0.01) | 10.59 (0.01) | 8.18 (0.01) |

Q5: Why is LLM-pruner not included in the baselines while N:M structured pruning is included?

A5: Thanks for your question. The primary goal of this paper is weight pruning, i.e., removing individual weights, including unstructured pruning and n:m sparsity. LLM-pruner itself is a structured pruning, i.e., completely removing entire channels and attention heads, which is not directly comparable to the unstructured pruning approaches. Also due to this reason, both SparseGPT and Wanda do not include LLM-pruner as their baselines.

Your time and effort in reviewing our paper are genuinely appreciated. If there are any additional questions or points that require clarification, we would be more than delighted to engage in further discussions.