We appreciate your thoughtful comments. Below, we address the key points raised across the reviews.

W1: They only attempt pretraining and finetuning-based approaches and that too with only two models.

A1: We were indeed constrained by computational resources, particularly for continual pretraining and instruction pretraining models. But to address model variety, we conducted experiments with two math SFT datasets on Llama 3.1-8B. The results (shown in Section 3.3), confirm our findings: math-specific SFT improves performance on MPS benchmarks but fails to generalize to broader reasoning tasks.

	MPS	MR	Logical	STEM	CS	Symbolic	Agent
Llama 3.1-8B	49.3	31.8	25.4	44.5	54.6	57.5	38.8
Math-COT SFT (Llama 3.1)	50.3	23.6	24.0	40.8	52.7	55.8	41.1
Math-POT SFT (Llama 3.1)	51.3	20.3	24.1	43.8	52.5	57.4	39.5
MR: Math (excluding problem-solving). CS: Commonsense

W2: The authors should try to examine methods beyond just continual pretraining and fine-tuning. For example, they should try reinforcement learning approaches

A2: This is a valid point. While RL methods could provide more insights, the pipeline for mathematical reasoning is still under development in the open-source community, with numerous loss variants (e.g., PPO[1], DPO[2], GRPO[3]) and debates around training methods and reward models[4,5]. So we think generalizing RL training warrants a separate, focused study. And this area is beyond the scope of our work, where we primarily focus on the most commonly used methods for MPS.

W3: This paper is mostly focused on “correlations” as opposed to the “causations” as to why certain training approaches and datasets succeed or fail at improving a model’s generalizability

A3: The observation about correlations versus causations is well-taken. In response, we conducted an embedding analysis (Figure 4) of training datasets and benchmarks, showing that WebInstruct (for instruction pretraining) and OpenWebMath (for continual pretraining) have greater overlap with diverse benchmarks than MetaMath (for instruction tuning), highlighting the role of topical diversity in generalization. Additionally, Figure 2 (Section 3.3) shows that benchmarks with queries more similar to GSM8K/MATH tend to see greater improvements, helping to explain varied patterns across benchmarks. We hope these analyses provide some “causations” on why certain training datasets succeed or fail across benchmarks.

Q1: Are the findings from your paper model agnostic? If not, what kinds of models benefit from this sort of training and with sorts of datasets?

A1: Fine-grained generalization depends on model architecture and weights, however, we think the high-level finding. Our findings are not entirely model agnostic. Experiments show that models like Mistral-7B and Llama 3.1-8B respond differently to math SFT training. And compare to the general-purpose model like Mistral-7B, specialized models like DeepSeek-Coder benefit boarder from math-specific datasets in STEM reasoning. This suggests that generalization also depends on the alignment between the model's architecture and pretraining corpus.

Q2:What kinds of other training-based approaches could be done to improve the score?

A2: Incorporating RL or hybrid methods may further enhance general reasoning abilities, and we see great potential for exploring these in future work.

Q3: If there is any correlation to the specific skills that an LLM has after continual pretraining which is why the model shows an improvement in general reasoning tasks? Can you explain further as to why certain training methods with their corresponding datasets succeed or fail at improving an LLM’s general reasoning ability?

A3: We think the main difference between three training paradigms lies in the datasets used rather than the training methods, as all three employ the same next-token prediction loss. To avoid repetition, we refer to our response to W3 where we highlighted the role of data diversity in enabling generalization and the correlation between query similarity to GSM8K/MATH and benchmark performance improvements.

[1] Schulman J, et al. Proximal policy optimization algorithms[J]. arXiv preprint arXiv:1707.06347, 2017.

[2] Rafailov R, et al. Direct preference optimization: Your language model is secretly a reward model[J]. Advances in Neural Information Processing Systems, 2024, 36.

[3] Shao Z, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models[J]. arXiv preprint arXiv:2402.03300, 2024.

[4] Tang, Yunhao, et al. "Understanding the performance gap between online and offline alignment algorithms." arXiv preprint arXiv:2405.08448, 2024.

[5] Pan, Sarah, et al. "Let's Reinforce Step by Step." arXiv preprint arXiv:2311.05821, 2023.