Thank you so much for constructive comments, and recognition. Please see our below responses:

Q1: Performance Degradation at Higher Compression Ratios

A1:

(1) MoE-SVD outperforms alternatives at all compression levels. At 60% compression: MoE-SVD=13.52 perplexity; ASVD/SVD-LLM>10,000 perplexity on WikiText-2.

(2) Lightweight LoRA fine-tuning (MoE-SVD†) mitigates degradation at 1% of full training cost. At 50% compression: improves accuracy from 0.43 to 0.49, recovering 9.5% original performance.

(3) Following the general phenomenon of performance loss in almost high-ratio compression, MoE-SVD significantly improves trade-off curve vs. existing approaches.

(4) Task-specific analysis shows reasoning tasks like ARC-e maintain performance at 60% compression, while MathQA shows higher sensitivity.

Q2: Theoretical Justification for the Sensitivity Metric

A2:

(1) Our sensitivity metric integrates three complementary components with theoretical foundations:

(a) Sampling frequency (): expert utilization rate determined by router.

(b) Principal rank (): effective dimensionality of expert's weight matrix from matrix approximation theory.

(b) Activation outliers (): functional distinctiveness of experts.

(2) Information-Theoretic Interpretation of Sensitivity Metric: We have derived our sensitivity metric ( ) from an information-theoretic perspective please see A2 in responsing to Reviewer Cmem.

We show that under certain assumptions, our metric approximates the expected information loss from decomposition:

where represents the mutual information between expert weights and model output given input .

(3) Table 5 validates that combining these components () achieves optimal performance (8.67 perplexity) vs. individual components (9.27-12.65 perplexity).

Q3: Fixed k=2 in U-Matrix Trimming and more principled methodology for K selection

A3:

(1) Table 6 shows perplexity trade-offs on Mixtral-8×7B: 4.86 (no trimming) → 5.98 → 7.11 → 10.25 as k increases. k=2 balances performance and speed at high compression.

(2) This value aligns with the standard top-k routing mechanism in MoE architectures, which typically activates 2 experts per token. By retaining 2 U-matrices, we maintain consistency with the model's inherent routing structure.

(3) We further propose Adaptive k Selection Framework to improve U-matrix trimming:

where represents the estimated performance loss from trimming to k matrices, denotes the parameter count, and controls the tradeoff. The performance loss is approximated using the information coverage of retained matrices:

where is the expert sampling frequency and is the sum of singular values, jointly capturing the expert's contribution to model performance. Using Automatic k Determination, k is computed automatically during compression based on calibration data statistics, without manual tuning. For each layer, the algorithm calculates the marginal utility of increasing k and stops when additional U-matrices provide diminishing returns relative to their parameter cost. Our experiments with Mixtral-8×7B demonstrate the effectiveness of this approach:

Compression	Method	WikiText-2 PPL	Runtime (Tokens/sec)
0% (Original)	-	3.98	87.7
40%	Fixed k=2	6.74	109.8
40%	Adaptive k	6.53	107.5
40%	U-matrix merging	6.83	112.3
60%	Fixed k=2	13.52	156.1
60%	Adaptive k	12.91	143.5
60%	U-matrix merging	13.30	158.4

Q4: Alternative V-Matrix Sharing Strategies

A4:

Yes. For DeepSeek-MoE and Qwen2 (with inherented organized expert groups), we actually employ clustering strategy following architecture setting. We will emphasize this point and add more analysis in the revision.

Q5: About size and layout

A5:

We will carefully revise all size and layout of some tables and figures in the revision. Thanks.

Finally, we hope our response could address the concerns, and we thank the reviewer again for the helpful comments.