Thank you for the instant reply, and we sincerely appreciate that! Your current comments are critical to improve the clarity and readability of the manuscript, and to strengthen the paper logic and theoretical foundation. We'd like to clarify more clearly, and we would update it to revised manuscript.

Question 1: Why does delta magnitude in Equation 2 indicate task-specific importance? & on what basis delta magnitude in Equation 2 is treated as a reliable proxy for task specificity

As per your suggestion, we provide additional explanation for the statement "It is critical to derive the DSR as task-specific and higher DSR scores play a more important role".

We first define the above notations:

, parameter vector; row is .
, loss of the current task.
, parameters before fine-tuning; after.
.
, input sample at token ; .

The fine-tuning dynamics are described by:

where the is gradient at step , is the learning rate, is integration variable, and is total steps. Hence, is large if the gradient keep projecting on -th row significantly in training. We perform a second-order Taylor expansion by varying , plug , and obtain:

where is the Hessian matrix of the loss () at the starting point , and is its -th diagonal element (curvature along the -th row). Substantial loss drop thus needs a non-trivial . Combining the above and Equations 2 and 3 in the paper, we get:

Thus, DSR score is proportional to both and , meaning that DSR score can denote the task specific importance. And we would add this derivation into the Appendix of the newest manuscript.

And similar conclusion has been proven in paper[1], which was presented for model merge, while its findings and conclusions also inspired us. In paper[1], the authors illustrated that removing rows (or entire matrices) whose deltas (directly computing parameter delta between different LLMs) are small hardly hurts accuracy, whereas pruning those with large deltas severely degrades performance. Paper[2] discovered that some tasks only require low-dimensional changes in pre-trained LMs, and updating extra parameters can hurt due to noise and interference. However, our work did not concentrate merely on the advanced method, the major contribution of this paper is that we validated the universe of reasoning tasks' mutual benefits and conflicts in current SFT paradigm, which have been often overlooked in previous research. Based on the comprehensive empirical study, we proposed a set of fine-grained strategies for both mix-up and continual SFT, to give some analysis insights to this field.

[1] DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models, ICLR 2025, spotlight.

[2] Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021.