PaperHub
4.3
/10
Rejected4 位审稿人
最低3最高6标准差1.3
6
3
3
5
3.5
置信度
正确性2.5
贡献度2.0
表达2.8
ICLR 2025

Math for AI: On the Generalization of Learning Mathematical Problem Solving

OpenReviewPDF
提交: 2024-09-24更新: 2025-02-05

摘要

关键词
Large Language ModelsMathematical ReasoningReasoning Generalization

评审与讨论

审稿意见
6

This paper investigates whether training on mathematical problem-solving (MPS) enhances LLMs' general reasoning capabilities. Through comprehensive empirical analysis of three training approaches (continual pretraining, instruction pretraining, and instruction tuning), the authors evaluate performance across seven reasoning domains. The study implements both two-stage and mix-data training strategies to assess how math-related training integrates with general model development. Their findings show that while continual pretraining on mathematical texts can improve broad reasoning capabilities, other approaches primarily enhance only MPS-specific performance, indicating that most current math datasets don't effectively support general reasoning enhancement. This work provides insights into the "math for AI" hypothesis and highlights the need for better data sources to develop general reasoning capabilities.

优点

  1. The research topic is novel and valuable. Humans can enhance their overall reasoning abilities by learning mathematics and practicing problem-solving. Inspired by this, the paper explores whether training AI models to improve their mathematical problem-solving skills could similarly enhance their broader general reasoning capabilities. By distinguishing between the "AI for math" and "math for AI" perspectives, this work provides a new evaluation perspective.
  2. Given the increasing focus on the complex reasoning abilities of LLMs and the scarcity of high-quality reasoning datasets, constructing large-scale, high-quality mathematical reasoning datasets and training LLMs would require enormous resources. Therefore, researching "math for AI" to enhance AI's reasoning capabilities beyond mathematical problem-solving aligns with the goals of AI development and can improve data utilization efficiency.
  3. The study presents a thorough experimental framework comparing three distinct training approaches. For continual pretraining, the researchers utilized RhO-Math and DeepSeekMath models, while instruction pretraining was conducted using the MammoTH2 model. Instruction tuning experiments employed Math-COT SFT and Math-POT SFT. Through rigorous evaluation, the study yielded several significant findings.

缺点

  1. The technical contributions seem to be limited. While the paper conducts extensive research across several models and datasets, it fails to establish a universal framework for evaluating AI's general reasoning capabilities in the context of "math for AI". This limitation reduces the study's potential impact on future assessments of AI's general reasoning abilities.
  2. In the section "WHAT OTHER DATA SOURCES CONTRIBUTE TO REASONING," the paper does not provide effective methodologies or meaningful insights for constructing datasets that could enhance AI's general reasoning capabilities.
  3. The classification of tasks relies solely on input-output format distinctions, without implementing deeper similarity detection or filtering mechanisms for the actual data content. This oversight could lead to data leakage between MPS and other tasks. The potential overlap between mathematical and general reasoning datasets might result in misleading conclusions about reasoning skill transfer.

问题

  1. In Table 1 of the main experimental results (3) Instruction tuning on MPS datasets, for the MPS results, why is the COT effect better than the POT, and why does the mixed method perform much better than the 2-stage method?
  2. Can you provide specific examples to demonstrate how mathematical training can improve or deteriorate different reasoning abilities, which can clarify the limitations of mathematical training and offer a qualitative analysis of the reasoning patterns developed through different training methods?
  3. What data do you think would really improve LLM's ability to reason over a wide range of tasks? How can this generalized reasoning capability be better assessed automatically?
评论

Thanks for your thoughtful comments! We address the questions below:

W1: While the paper conducts extensive research across several models and datasets, it fails to establish a universal framework for evaluating AI's general reasoning capabilities in the context of "math for AI".

A1: Reasoning is a broad and complex topic, and, as far as we know, there is no consistent or universally agreed-upon method in the literature to evaluate LLM reasoning. In this paper, we use 16 benchmarks that are commonly adopted for evaluating reasoning. We believe that establishing a universal framework for reasoning evaluation is a valuable topic for future work.

W2: the paper does not provide effective methodologies or meaningful insights for constructing datasets that could enhance AI's general reasoning capabilities.

A2: Our paper evaluates existing methods for their generalization to reasoning, while constructing better methods and datasets is beyond its scope and worth future study.

W3: Without implementing deeper similarity detection or filtering mechanisms for the actual data content. This oversight could lead to data leakage between MPS and other tasks. The potential overlap between mathematical and general reasoning datasets might result in misleading conclusions about reasoning skill transfer.

A3: To address concerns about task classification and data leakage, we computed the is_similarity [1] metric between MMLU-STEM and GSM8K/MATH questions, which help to quantify overlap between each subset of MMLU-stem and GSM8K/MATH. Notably, the false percentage of is_similarity for all MMLU-stem samples is 100%, indicating no data leakage between MMLU-stem and GSM8K/MATH.

Q1: In Table 1 of the main experimental results (3) Instruction tuning on MPS datasets, for the MPS results, why is the COT effect better than the POT, and why does the mixed method perform much better than the 2-stage method?

A1: We want to clarify our evaluation across all benchmarks adopts a CoT format without calling Python interpreters, chosen for CoT serves as a more general and flexible format across variety benchmarks. And keeping a consistent evaluation method allows us to fairly analyze the impact of different training methods. For PoT training, our primary goal was to investigate the effect of code-based training data on reasoning tasks. Thus, it is not surprising that CoT training outperforms PoT in our setup, as CoT aligns better with the evaluation format. Previous works that report higher performance for PoT on MPS tasks typically do so by using Python interpreters during inference. Regarding why the mixed training method performs better than the 2-stage method on MPS tasks, we hypothesize this is due to the second stage using the general-purpose UltraChat dataset, which likely causes the model to forget some of the MPS skills learned during the first stage, whereas mixed training avoids this forgetting issue.

Q2: Can you provide specific examples to demonstrate how mathematical training can improve or deteriorate different reasoning abilities, which can clarify the limitations of mathematical training and offer a qualitative analysis of the reasoning patterns developed through different training methods?

A2: In the revised Appendix D, we provide some examples from logical reasoning shown in Appendix D in revision), demonstrating the detailed outputs across different training methods. We could find that the Mistral-7B and DeepSeek-Coder are more prone to logical errors, such as inconsistencies or premature assertion in reasoning. Models trained on math-related data, such as DeepSeekMath and MAmmoTH2, exhibit fewer such errors, suggesting that structured mathematical datasets not only improve mathematical performance but also enhance logical reasoning generalization, enabling more coherent and reliable conclusions.

Q3: What data do you think would really improve LLM's ability to reason over a wide range of tasks? How can this generalized reasoning capability be better assessed automatically?

A3: We think incorporating large-scale datasets spanning diverse domains, such as science, humanities, social sciences, engineering, logical reasoning, and coding may help to improve LLM’s reason abilities. In short, we believe that data diversity is the key. but measuring and creating extremely diverse data remains an open challenge. As shown in Figure 4, the training data studied in our paper only admits limited diversity. Evaluate generalized reasoning capability automatically requires comprehensive assessments across diverse benchmarks. While no consensus yet on evaluation methods or benchmark selection, we included a wide range of benchmarks in this paper. But we recognize the need for future research to develop a unified reasoning suite for more standardized and comprehensive evaluation.

[1] Zhang G, et al. Map-neo: Highly capable and transparent bilingual large language model series[J]. arXiv preprint arXiv:2405.19327, 2024.

评论

Thanks for your response. I appreciate the effort to address the concerns. Below are two main points I would like to further discuss:

Regarding Weakness 2: While the paper focuses on evaluating the generalization to reasoning, I believe that simply benchmarking LLMs may not be sufficient. Many studies have already evaluated LLMs and observed significant performance drops under perturbed settings. Hence, the limited generalization ability of LLMs is not entirely surprising. I think what is more valuable is to investigate why such phenomena occur or to uncover critical insights, such as identifying specific factors limiting generalization or data characteristics that could enhance it. These insights could provide concrete guidance for furture improvement, making the evaluation more impactful beyond just reporting performance.

Regarding Weakness 3: Thank you for clarifying the use of the is_similarity metric to address potential data leakage. However, I would like to better understand how this metric works. Is it a binary classification method, or does it estimate the similarity for each pair of questions and then apply a threshold to determine overlap?

评论

Thank you for your thoughtful follow-up! We appreciate the opportunity to discuss further. We look forward to addressing the two main points you’ve raised.

Regarding W2: While the paper focuses on evaluating the generalization to reasoning, I believe that simply benchmarking LLMs may not be sufficient. Many studies have already evaluated LLMs and observed significant performance drops under perturbed settings. Hence, the limited generalization ability of LLMs is not entirely surprising. I think what is more valuable is to investigate why such phenomena occur or to uncover critical insights, such as identifying specific factors limiting generalization or data characteristics that could enhance it.

A: While many studies have already evaluated LLMs and observed significant performance drops under various settings, our focus is on a different problem: understanding the effect of learning mathematical problem solving (MPS). This is particularly relevant given that it is common practice to train models on a significant amount of MPS data and highlight their performance through MPS scores. To the best of our knowledge, no previous work has studied such a generalization setting in a carefully controlled manner. Instead, we observe that many practical developers actually hypothesize that incorporating MPS data into general training could enhance reasoning abilities, much like how humans improve reasoning skills through mathematical practice.

Regarding the point of investigating why such phenomena occur, we totally agree this is a very valuable study. And in the revised pdf, we have made an attempt by analyzing specific factors influencing generalization, such as data diversity and alignment with benchmarks. In particular, we conducted an embedding analysis of queries from benchmarks and training datasets (Figure 4 in the revision). Datasets like WebInstruct (instruction pretraining) and OpenWebMath (continual pretraining) show greater overlap with diverse benchmarks, suggesting that their broader topical coverage enhances generalization. In contrast, MetaMath (instruction tuning) contains queries heavily concentrated in math-related areas, limiting its ability to generalize beyond math-specific tasks. Additionally, we observed that benchmarks with queries more similar to GSM8K/MATH tend to exhibit greater improvements (Figure 2 in the revision), indicating that generalization is closely tied to the alignment between benchmarks and math-related training datasets. These insights indicate the importance of data characteristics, such as diversity, in driving generalization.

Regarding W3: How this is_similarity metric works. Is it a binary classification method, or does it estimate the similarity for each pair of questions and then apply a threshold to determine overlap?

A: The metric follows the approach in [1], to determine whether two queries are similar based on their lengths and the edit distance between them. For example, to query Q1 and query Q2, the formula is defined as followed:

\text{is\\_Similar}(Q_1, Q_2) = \begin{cases} \text{True}, & \min(|Q_1|, |Q_2|) \geq 15 \land \text{editDist}(Q_1, Q_2) < 0.1 \times \min(|Q_1|, |Q_2|), \\\\ \text{True}, & \min(|Q_1|, |Q_2|) < 15 \land Q_1 = Q_2, \\\\ \text{False}, & \text{otherwise}. \end{cases}

Here, |Q| represents the length of query Q. The editDist(Q1​,Q2​) is the edit distance between Q1​ and Q2​, which measures the minimum number of character changes needed to transform one query into the other.

This metric is used to identify whether two queries are sufficiently similar, considering both their length and textual differences. For longer queries (≥15 characters), a small edit distance relative to the shorter query indicates similarity. For shorter queries, exact matches are required to classify them as similar. When output is true, it means the queries are similar according to the above rules, while false means the queries​ are not similar.

[1] Zhang G, Qu S, Liu J, et al. Map-neo: Highly capable and transparent bilingual large language model series[J]. arXiv preprint arXiv:2405.19327, 2024.

审稿意见
3

This paper studies whether improvements in LLMs' mathematical reasoning capabilities can generalize to broader reasoning domains. Specifically, it examines various training approaches (continual pre-training and instruction tuning) on math-related data and determine if enhanced mathematical performance transfers to tasks like symbolic reasoning, logical reasoning, and commonsense reasoning. This investigation addresses the "math for AI" hypothesis, which posits that training on mathematical reasoning could enhance LLMs' general reasoning capabilities.

The authors conduct controlled experiments comparing three training paradigms: (1) continual pre-training on mathematical text, (2) instruction pre-training on diverse QA pairs, and (3) instruction tuning on mathematical problem-solving datasets. They evaluate each approach's generalization to multiple non-mathematical reasoning domains. Their key finding suggests that only continual pre-training demonstrates meaningful generalization, showing improvements in 3 out of 5 non-mathematical domains.

优点

The paper studies an interesting and an important question in the development of LLMs. Pushing LLMs’ capabilities on math datasets is indeed an increasingly popular research topic. And this paper presents a timely study.

The experiments consider multiple training paradigms for improving math reasoning and evaluate the generalization performance on an array of datasets other than math.

The paper is well-written and easy to follow.

缺点

My major concerns include 1) the categorization of non-math datasets and the interpretation of results might be problematic 2) the conclusions are probably overstated

First, regarding the non-math dataset categorization, the paper classifies reasoning tasks into logical reasoning, STEM reasoning, commonsense reasoning, and symbolic reasoning. I think STEM reasoning should not be juxtaposed with the other categories. STEM reasoning interweaves mathematical reasoning, logical reasoning, and domain-specific STEM knowledge. It is not a good choice to treat as a distinct reasoning category.

More importantly, the STEM reasoning evaluation is based on GPQA and MMLU-STEM. MMLU-STEM is heavily intertwined with mathematical reasoning. MMLU-STEM contains a portion of problems requiring mathematical computation (like physics and finance problems which require solving some equations; Sprague et al., 2024). It is difficult to distinguish whether performance improvements indicate generalization or the improvements are still about mathematical abilities.

Furthermore, the paper's main claim that "continual pretraining generally improves non-math reasoning" (line 327) is drawn from improvements in 3 out of 5 domains. If we exclude STEM reasoning, this claim becomes questionable. Even considering all domains, showing improvement in 3 out of 5 areas probably does not suggest "significant" generalization, especially given the large performance degradation in commonsense and agent reasoning tasks.

Additionally, the paper could benefit from more in-depth analysis. While it presents performance numbers and draws broad conclusions, it does not provide more in-depth insights. For instance, If the paper claims major improvements on specific tasks, the paper could have some analysis which specific types of data points show improvement and provide potential explanations for these gains.

[Sprague et al., 2024]: To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

问题

See weakness

评论

Thanks for your detailed comments and valuable suggestions. We address the questions below:

Q1: The categorization of non-math datasets and the interpretation of results might be problematic.

A1: This is a valid point and we appreciate your suggestion on the organization of the results! We acknowledge the reviewer’s concern about treating STEM reasoning as a distinct category. To address this concern, we have taken the following steps to clarify the distinctions between MMLU-stem and math reasoning:

  • Excluding Math-Related Subjects in MMLU-stem: To better distinguish STEM reasoning from math reasoning, we excluded math-related datasets within MMLU-stem and recalculated the evaluation results. These updated results are presented in the revised Table 3, showing that our conclusions remain consistent even after this adjustment.
  • Measuring Question Similarity: To further differentiate MMLU-stem from math reasoning, we computed the minimum edit distance between each question in GSM8K/MATH and MMLU-STEM. Following [1], we regard the two examples are similar when the minimum edit distance is smaller than 5% of the example length. Notably, the false percentage of is_similarity for all MMLU-stem samples is 100%, indicating a very low similarity between MMLU-stem and GSM8K questions. This further supports that MMLU-stem encompasses broader reasoning skills distinct from math reasoning.

And we acknowledge that performing a discrete categorization may be both challenging and unnecessary. Instead, we have added Figure 2 (and Lines 371–398) in the revised paper, where we report results on all benchmarks, ranked by the query similarity between the benchmark data and GSM8K/MATH. Using DocMath in Figure 2 as a threshold to separate math and non-math tasks, we observe that continual pretraining and instruction pretraining consistently outperform instruction tuning that uses Math-COT SFT data. we find that continual pretraining with DeepSeekMath generally provides greater improvements on non-math tasks compared to MAmmoth2, although both of them result in performance drops sometimes.

Q2: Furthermore, the paper's main claim that "continual pretraining generally improves non-math reasoning" is drawn from improvements in 3 out of 5 domains….. Even considering all domains, showing improvement in 3 out of 5 areas probably does not suggest "significant" generalization, especially given the large performance degradation in commonsense and agent reasoning tasks.

A2: Thank you for the suggestion! We agree that the conclusions need to be expressed more precisely and carefully. Additionally, as pointed out by the reviewer, the conclusions also depend on how we categorize the domains. Therefore, as discussed in the response to Q1, we believe it is clearer to directly present the results across all benchmarks without relying on subjective categorization. Accordingly, we have revised the conclusions to state: There are no approaches that demonstrate consistent generalization for non-mathematical tasks, while continual pretraining and instruction pretraining show better generalization compared to instruction tuning. Between continual pretraining and instruction pretraining, continual pretraining generally achieves higher gains than instruction pretraining when it is effective, but both approaches result in performance drops on some benchmarks. We have updated the expression of these conclusions in multiple sections of the paper to ensure clarity and precision.

Q3: While it presents performance numbers and draws broad conclusions, it does not provide more in-depth insights. For instance, If the paper claims major improvements on specific tasks, the paper could have some analysis which specific types of data points show improvement and provide potential explanations for these gains.

A3: Thank you for the suggestion! We have added two analyzes to help readers understand some of the phenomenon:

  • We have analyzed the query distribution of each benchmark and training dataset, as shown in Figure 4 in the revision. Our findings indicate that WebInstruct (the instruction pretraining dataset used for MAmmoTH2) and OpenWebMath (an example continual pretraining datasets) exhibit more overlap with many benchmarks, compared with MetaMath (the instruction tuning dataset) that mainly overlaps with the math benchmarks. As such, we believe the data diversity contributes to the effectiveness on generalization.
  • In the revised version of the paper, we have also provided a clearer organization of the results in Figure 2 (Section 3.3), where we demonstrate that, in general, when benchmark queries are more similar to GSM8K/MATH queries, the overall improvements tend to be higher. This observation helps to explain some of the varied patterns observed across different benchmarks.

[1] Zhang G, et al. Map-neo: Highly capable and transparent bilingual large language model series[J]. arXiv preprint arXiv:2405.19327, 2024.

评论

Thank you for your response and updates, which makes the points clearer. However, I still have some concerns.

Regarding Q1: The categorization of non-math datasets and the interpretation of results might be problematic.

I appreciate authors' efforts on removing the math ones, but my point is not about the math-related ones on the surface level. MMLU-stem has lots of sub fields like statistics, machine learning, which involves arithmetic calculation (https://huggingface.co/datasets/TIGER-Lab/MMLU-STEM/viewer/default/test), I think it is fundamentally hard to ablate math from this dataset.

Also, I don't think STEM reasoning is a standalone reasoning problem type. Arguably it is not clear and pure as math, logical reasoning.

Regarding the results and claims

I appreciate authors' effort on making the interpretation more precise, but I feel the current results set somewhat lack a very clear trends/takeaways, especially given the unclear categorization.

To this end, I am inclined to keep my score unchanged.

审稿意见
3

This paper investigates the generalization of learning mathematical problem-solving for LLMs. It explores three training strategies: continual pretraining on mathematical text, instruction pretraining on diverse QA pairs, and instruction tuning on MPS datasets. Through experiments across seven reasoning domains, it finds that continual pretraining on raw mathematical text can improve performance on most non-MPS reasoning tasks, while other methods have limited generalization. It also conducts a pilot study on non-MPS SFT datasets with less promising results.

优点

  1. The experimental results suggest only continual pretraining on MPS data is useful to general reasoning tasks. The conclusion may be valuable to those we want to improve reasoning capability of LLMs.
  2. The experiments are well designed.
  3. The experiments are abundant.

缺点

  1. This work lacks of novelty. It only conducts several experiments and comes up with a conclusion which is align with my assumption.
  2. The result analysis is not deep enough. The readers may want to know why only continual pretraining works. Are there any cluse from th mechanical interpretability perspective, why some the instruction pretraining works while some are not, why all the instruction tuning fails.

问题

Need more deep analysis to see are there any insightful findings.

评论

We appreciate your thoughtful comments. Below, we address the key points raised across the reviews.

W1: This work lacks of novelty. It only conducts several experiments and comes up with a conclusion which is align with my assumption.

A1: We would be grateful if the reviewer could share references to other works that have drawn similar conclusions to substantiate the comment, which would help us refine our research.

W2: The result analysis is not deep enough. Are there any cluse from the mechanical interpretability perspective, why some the instruction pretraining works while some are not, why all the instruction tuning fails.

A2: This a good point. We think the main difference among continual pretraining, instruction pretraining, and instruction tuning we studied is on the corresponding datasets rather than the specific training methods – all three methods are using the same next-token prediction loss. To respond to the reviewer’s concern, we have analyzed the query distribution of each benchmark and training dataset, as shown in Figure 4 in the revised PDF. Our findings indicate that WebInstruct (the instruction pretraining dataset used for MAmmoTH2) and OpenWebMath (an example continual pretraining datasets) exhibit more overlap with many benchmarks, compared with MetaMath (the instruction tuning dataset) that mainly overlaps with the math benchmarks. As such, we think this additional analysis helps explain why all instruction tuning fails.

评论

Thanks for your response. For 1, the conclusion "continual pretraining and instruction pretraining outperform instruction tuning, with continual pretraining often yielding greater gains when effective" seems to be a very common assumption for most of the audience. For 2, the data analysis makes sense to me. Overall, I incline to keep my score unchanged.

审稿意见
5

This paper determines whether training on step-by-step mathematical explanations can improve a model’s general problem-solving ability. Specifically, the authors test three different training strategies: continual pretraining, instruction pretraining, and instruction tuning with different math problem-solving datasets. For all three methods, the authors consider a two-stage training approach, where they first train a model with a math dataset and then train with UltraChat. Additionally, for the third approach, the authors mix the math dataset with UltraChat and also benchmark the performance. They find that although these training methods improve performance in mathematical reasoning, only continual pre-training with mathematical texts can enhance a model’s general problem-solving ability. The authors then go on to examine whether training on other non-mathematical problem-solving datasets could perhaps improve a model’s general reasoning ability but demonstrate that these too fall short.

优点

I found the paper to have strong clarity and the authors have chosen an interesting problem of trying to examine which datasets and training approaches could improve an LLM’s general reasoning ability. From this, the authors did a thorough job of examining the effect of different training strategies with different datasets. They did an exhaustive job of benchmarking various mathematical and non-mathematical reasoning datasets with the post-trained LLMs. Because of this, all of the claims are very well-supported.

I also liked how the authors examined the effect of training on other step-by-step explanations of datasets and determined if that had any effect on improving the non-mathematical capability. This showed that this is truly a difficult problem to solve.

缺点

The major weakness in this paper is that they only attempt pretraining and finetuning-based approaches and that too with only two models. I am not sure if this is because the authors had limited computational resources, but if not, it would be interesting to examine the performance of LLMs with varying mathematical/reasoning capabilities and not just Mistral-7B and DeepSeek-Coder.

I also think that the authors should try to examine methods beyond just continual pretraining and fine-tuning. For example, they should try reinforcement learning approaches as well which have demonstrated improved reasoning capabilities with models such as GPT-o1-preview.

Finally, this paper is mostly focused on “correlations” as opposed to the “causations” as to why certain training approaches and datasets succeed or fail at improving a model’s generalizability when trained on mathematical problem-solving datasets. For this paper to have a greater impact, the authors should elaborate on the existing discussions as to why certain training methods and datasets do not improve an LLM’s general reasoning ability.

问题

Based on the limitations section, I have the following questions:

Are the findings from your paper model agnostic? If not, what kinds of models benefit from this sort of training and with sorts of datasets?
What kinds of other training-based approaches could be done to improve the score? If you look at this paper in Figure 2: https://arxiv.org/pdf/2401.00812, the authors have a more fine-grained taxonomy as to the specific skills that training with code has helped LLMs with. Hence, I was wondering if there is any correlation to the specific skills (i.e. task decomposition/planning) that an LLM has after continual pre-training which is why the model shows an improvement in general reasoning tasks? You might have to manually inspect the LLM responses for this. Can you explain further as to why certain training methods with their corresponding datasets succeed or fail at improving an LLM’s general reasoning ability?

评论

We appreciate your thoughtful comments. Below, we address the key points raised across the reviews.

W1: They only attempt pretraining and finetuning-based approaches and that too with only two models.

A1: We were indeed constrained by computational resources, particularly for continual pretraining and instruction pretraining models. But to address model variety, we conducted experiments with two math SFT datasets on Llama 3.1-8B. The results (shown in Section 3.3), confirm our findings: math-specific SFT improves performance on MPS benchmarks but fails to generalize to broader reasoning tasks.

MPSMRLogicalSTEMCSSymbolicAgent
Llama 3.1-8B49.331.825.444.554.657.538.8
Math-COT SFT (Llama 3.1)50.323.624.040.852.755.841.1
Math-POT SFT (Llama 3.1)51.320.324.143.852.557.439.5
MR: Math (excluding problem-solving). CS: Commonsense

W2: The authors should try to examine methods beyond just continual pretraining and fine-tuning. For example, they should try reinforcement learning approaches

A2: This is a valid point. While RL methods could provide more insights, the pipeline for mathematical reasoning is still under development in the open-source community, with numerous loss variants (e.g., PPO[1], DPO[2], GRPO[3]) and debates around training methods and reward models[4,5]. So we think generalizing RL training warrants a separate, focused study. And this area is beyond the scope of our work, where we primarily focus on the most commonly used methods for MPS.

W3: This paper is mostly focused on “correlations” as opposed to the “causations” as to why certain training approaches and datasets succeed or fail at improving a model’s generalizability

A3: The observation about correlations versus causations is well-taken. In response, we conducted an embedding analysis (Figure 4) of training datasets and benchmarks, showing that WebInstruct (for instruction pretraining) and OpenWebMath (for continual pretraining) have greater overlap with diverse benchmarks than MetaMath (for instruction tuning), highlighting the role of topical diversity in generalization. Additionally, Figure 2 (Section 3.3) shows that benchmarks with queries more similar to GSM8K/MATH tend to see greater improvements, helping to explain varied patterns across benchmarks. We hope these analyses provide some “causations” on why certain training datasets succeed or fail across benchmarks.

Q1: Are the findings from your paper model agnostic? If not, what kinds of models benefit from this sort of training and with sorts of datasets?

A1: Fine-grained generalization depends on model architecture and weights, however, we think the high-level finding. Our findings are not entirely model agnostic. Experiments show that models like Mistral-7B and Llama 3.1-8B respond differently to math SFT training. And compare to the general-purpose model like Mistral-7B, specialized models like DeepSeek-Coder benefit boarder from math-specific datasets in STEM reasoning. This suggests that generalization also depends on the alignment between the model's architecture and pretraining corpus.

Q2:What kinds of other training-based approaches could be done to improve the score?

A2: Incorporating RL or hybrid methods may further enhance general reasoning abilities, and we see great potential for exploring these in future work.

Q3: If there is any correlation to the specific skills that an LLM has after continual pretraining which is why the model shows an improvement in general reasoning tasks? Can you explain further as to why certain training methods with their corresponding datasets succeed or fail at improving an LLM’s general reasoning ability?

A3: We think the main difference between three training paradigms lies in the datasets used rather than the training methods, as all three employ the same next-token prediction loss. To avoid repetition, we refer to our response to W3 where we highlighted the role of data diversity in enabling generalization and the correlation between query similarity to GSM8K/MATH and benchmark performance improvements.

[1] Schulman J, et al. Proximal policy optimization algorithms[J]. arXiv preprint arXiv:1707.06347, 2017.

[2] Rafailov R, et al. Direct preference optimization: Your language model is secretly a reward model[J]. Advances in Neural Information Processing Systems, 2024, 36.

[3] Shao Z, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models[J]. arXiv preprint arXiv:2402.03300, 2024.

[4] Tang, Yunhao, et al. "Understanding the performance gap between online and offline alignment algorithms." arXiv preprint arXiv:2405.08448, 2024.

[5] Pan, Sarah, et al. "Let's Reinforce Step by Step." arXiv preprint arXiv:2311.05821, 2023.

评论

We thank all the reviewers for their comments! We have revised the PDF to reflect the reviewers' comments (the revised contents are highlighted), and responded to each reviewer separately in the respective thread. Here we summarize the main revisions of the manuscript.

  1. We conducted additional experiments using two types of math-related SFT datasets on Llama 3.1-8B. The results are updated in section 3.3 (Reviewer 8XNX).
  2. We conducted a benchmark-level analysis to examine the relative changes in performance across tasks for Math-COT SFT, MAmmoTH2, and DeepSeekMath. This result excludes the influence of subjective task categorization, providing a clearer and more objective view of the performance (Figure 2, Reviewer 8XNX, and xCdK).
  3. We analyzed the query distribution of math-related training datasets using for three kinds of training paradigms and all benchmarks, providing insights into dataset overlap and its impact on generalization (Figure 4, Reviewer 8XNX, t1uJ, xCdK and oTYK).
评论

Dear Reviewers,

Thank you for your efforts reviewing this paper. Can you please check the authors' response and see if your concerns have been addressed? Please acknowledge you have read their responses. Thank you!

AC 元评审

Summary:

This paper investigates the "math for AI" hypothesis, which posits that training on mathematical reasoning data could enhance LLMs' general reasoning capabilities. It conducts experiments using three training strategies on different datasets: (1) continual pre-training on mathematical text, (2) instruction pre-training on diverse QA pairs, and (3) instruction tuning on mathematical problem-solving datasets. It evaluates each strategy’s generalization to multiple “non-mathematical” reasoning tasks (note there are some questions regarding the categorization of the tasks). One key finding is that no approaches demonstrate consistent generalization for non-mathematical tasks, while continual pretraining and instruction pretraining show better generalization compared to instruction tuning.

Strengths:

  1. Reviewers generally agree that it is an interesting and timely study that tries to examine which datasets and training approaches could improve an LLM’s general reasoning ability.

  2. Reviewers also think that the paper has conducted comprehensive experiments to examine the effect of different training strategies with different datasets.

Weaknesses:

A major issue pointed out in nearly all reviews in one way or another is the lack of deep insights, e.g, why certain results are observed and how to further improve general reasoning abilities. It’d greatly improve the paper if the authors could expand discussions on why certain training methods and datasets do not improve an LLM’s general reasoning ability. Also, the conclusion and take-away messages on dataset diversity and alignment with downstream tasks do not provide insights that are novel or interesting enough.

审稿人讨论附加意见

All comments in the original reviews that were already addressed by the authors during the rebuttal period did not impact my final decision.

Here is a summary of (partly) resolved comments. The standing issues after the rebuttal period are summarized as weaknesses in the previous field.

  1. Reviewer 8XNX mentioned the use of only two models as one limitation, and the authors provided the results of one additional model and explained the reason for choosing these models during the rebuttal period.

  2. Reviewer 8XNX suggested experimenting with RL, which the authors believe is out of scope in this work.

  3. Reviewer xCdK mentioned the categorization of tasks is problematic, and the authors provided explanations and further experiments, which I think only partly resolve the original comment.

最终决定

Reject