PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
4
5
3.8
置信度
创新性3.0
质量3.0
清晰度3.0
重要性3.0
NeurIPS 2025

Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We have decoupled the math atomic capabilities of large language models and explored their interaction relationships in mathematical reasoning tasks.

摘要

关键词
Large Language ModelsMathematical ReasoningAtomic Thinking

评审与讨论

审稿意见
4

This study creates a benchmark that investigates LLMs' atomic-level math reasoning performance by dividing the capability with fields (algebra, geometry, analysis, and topology) and logical reasoning abilities (Conceptual understanding, Forward reasoning with formal language, Backward reasoning with counterexamples). The evaluation with a wide range of math LLMs reveals several insights such as strength and weakness of each LLM, knowledge transfer characteristics between fields and reasoning types.

优缺点分析

It is a novel approach to categorize the math capability into sub-types, create the benchmark, and evaluate each LLMs performance. In addition, the evaluation results show us several insights which will provide important clues for improving the performance of mathematical LLMs in the future.

However, there are some weaknesses and points to be improved.

  • While the findings are interesting, they are not surprising. For example, we often find knowledge transfers between close fields (For example, https://neurips.cc/virtual/2024/poster/93733). In addition, LLMs are not good at Geometry because it cannot treat vision.
  • At training for atomic capability interaction, only one model (Qwen2.5-Math-Instruct-7B) is used for the experiment. No less than two models are needed to insist the universality of the findings.
  • VLMs are not included within the scope of the analysis. In particular, geometry needs vision information to completely grasp the problems.
  • The Supplementary zip file could not be unzipped due to some errors.

问题

See "Strengths And Weaknesses".

局限性

yes

最终评判理由

Although there are some concerns (e.g., VLM is not a target of analysis for geometry, major open/close LLMs are not used for atomic capability interaction experiments), I think that the core motivation and the current results are sufficient for borderline acceptance.

格式问题

No

作者回复

Response to Reviewer uVY9

We greatly value your feedback and the important points you raised. We hope our explanations have sufficiently clarified the concerns you mentioned.

While the findings are interesting, they are not surprising. For example, we often find knowledge transfers between close fields. In addition, LLMs are not good at Geometry because it cannot treat vision.

Thank you very much for your valuable suggestion. We carefully reviewed the paper you referenced and compared it with our own work. While both studies explore the knowledge transfer capabilities of LLMs across different fields, it is important to highlight that our research is not solely focused on transfer learning.

The core motivation and contribution of our work lies in proposing a new paradigm of atomic thinking for mathematical models, and in investigating the disentanglement of distinct atomic abilities as well as their interaction effects. Our findings go beyond field similarity-based transfer; we also observed phenomena such as complementary knowledge transfer between disparate fields, catastrophic forgetting of higher-level field knowledge, and the fundamental role of conceptual understanding as a foundational ability that supports reasoning in all other fields. These insights offer valuable inspiration for future training and inference strategies in mathematical reasoning models. Regarding knowledge in the geometry field, we plan to further explore how richer multimodal information (e.g., visual or spatial inputs) may enhance model performance in mathematical reasoning.

At training for atomic capability interaction, only one model (Qwen2.5-Math-Instruct-7B) is used for the experiment. No less than two models are needed to insist the universality of the findings.

We deeply appreciate your valuable suggestion. The Qwen series includes a variety of open-source models across different parameter scales, offering both base and instruct versions, which makes it particularly suitable for research purposes. Among them, Qwen2.5-math-instruct demonstrated excellent performance in our atomic ability disentanglement experiments and thus became a focal model for our further investigation.

Due to time and resource constraints, we regret that we were unable to extend this discovery to a broader set of models. In response to your suggestion, we have now included InternLM2-math-plus-7B as an additional baseline for training and evaluation, aiming to further strengthen our claims. Given the limited time during the rebuttal period, we selected only a few key training and evaluation scenarios. The results, shown in the table below, are consistent with our observations on Qwen2.5-math-instruct and provide a strong complement to our previous findings.

ModelAlgebra (Low)Algebra (High)Analysis (Low)Analysis (High)Geometry (Low)Geometry (High)Topology (Low)Topology (High)
InternLM2-math-plus-7b49.235.933.031.441.941.527.237.0
InternLM2-train-Algebra48.1 (-1.1)37.8 (+1.9)35.8 (+2.8)33.6 (+2.2)43.1 (+1.2)42.1 (+0.6)27.8 (+0.6)37.9 (+0.9)
InternLM2-train-Geometry48.6 (-0.6)36.3 (+0.4)32.2 (-0.8)30.5 (-0.9)44.3 (+2.4)44.0 (+2.5)28.7 (+1.5)38.8 (+1.8)

VLMs are not included within the scope of the analysis. In particular, geometry needs vision information to completely grasp the problems.

We sincerely appreciate your suggestion. In this work, our primary focus was to propose a novel paradigm centered on mathematical atomic abilities, conduct systematic evaluations of atomic ability disentanglement, and analyze interaction effects among abilities across a range of mathematical models. We believe these insights offer meaningful guidance for future model development and training strategies. As for visual understanding, we agree that it can be regarded as a new atomic ability relevant to mathematical problem-solving. We are incorporating this perspective into our future work, with the goal of further enriching and expanding the taxonomy of atomic abilities. We warmly welcome more researchers to engage with and contribute to this ongoing exploration, and together advance the development of mathematical reasoning in large language models.

The Supplementary zip file could not be unzipped due to some errors.

We are sorry to hear about the issue you encountered. We have repeatedly tested the supplementary file and confirmed that it can be downloaded and opened successfully. We suspect this may be related to an OpenReview system issue. Nonetheless, we promise that we will open-source relevant data and code, and actively encourage broader participation in this line of research.

评论

Thank you for the author's response.

Although there are some concerns (e.g., VLM is not a target of analysis for geometry, major open/close LLMs are not used for atomic capability interaction experiments), I think that the core motivation and the current results are sufficient for borderline acceptance.

评论

We sincerely appreciate your thoughtful feedback and suggestions. We would like to further clarify our approach in the hope of addressing your concerns.

First, we agree that exploring VLMs, maybe with stronger capabilities in geometric reasoning, is a promising and valuable direction. While this is beyond the immediate scope of our current work, it is an active area of our future research, and we look forward to further discussions and inspiration from the community in this regard.

In addition, in this study, we have focused on widely recognized and influential open-source model families, Qwen and InternLM, to investigate atomic capability interactions. Our experiments have yielded rich analyses and insightful findings, which we believe offer transferable insights for training strategies across other models.

Finally, we commit to incorporating all clarifications, revisions, and additional experimental results discussed during the rebuttal phase into the final camera-ready version, ensuring the completeness and quality of the paper. Thank you again for your constructive comments.

审稿意见
5

The paper presents an approach that focuses on fundamental atomic capability units to enhance the mathematical reasoning in LLMs. The authors study field specific abilities vis-a-vis four mathematical fields and logical abilities at different reasoning levels. The authors also develop training and evaluation datasets for each of the atomic mathematical units. The paper presents an extensive comparative analysis of different types math reasoning LLMs and their own developed supervised models taking various factors into consideration.

优缺点分析

The paper proposes the paradigm of atomic thinking which is intertwined with the cognitive abilities of humans while solving mathematical problems. Although current LLMs owe their mathematical reasoning capabilities to large scale of training data and implicit step-by-step supervision, this paper breaks the reasoning task at two levels of field atomic units and logical reasoning based on three core reasoning capabilities. The coverage of experiments is vast and the paper analyzes the performance of the models in terms of fields and level of difficulty. The paper also discusses the transfer capabilities of fields and provides empirical evidence for it. Training and evaluation datasets for each of the atomic capability units have been developed which is a significant contribution to the field.

The authors do not mention the criteria of selection of the baseline LLMs. The construction of the dataset is not fully explained which makes things unclear. It is not clear what the authors mean by Level 1 and Level 2 problems. Although the paper relies on automatic evaluation metrics, a thorough human evaluation is required as a math problem can be solved in multiple ways using different kinds of mathematical axioms and concepts. It should be mentioned how the difficulty level is determined for a problem.

问题

  1. The authors should mention the criteria of selection of the baseline LLMs both the open-source and commercial ones.
  2. The details of the data construction should be detailed with examples in the appendix.
  3. Human evaluation should be conducted for this task.

局限性

yes

最终评判理由

The authors have satisfactorily answered my queries regarding the choice of baseline LLMs, and problem types based on difficulty levels.

格式问题

No

作者回复

Response to Reviewer 6cMW

Thank you very much for your feedback and valuable suggestions. We hope our response adequately addresses your concerns.

Response to weaknesses

The authors do not mention the criteria of selection of the baseline LLMs.

As noted in Page 5, Lines 180–182, we have explained the rationale behind our model selection. Specifically, we chose proprietary models with superior training and architectural designs for mathematical tasks, and evaluated them alongside state-of-the-art models currently available. We also took into account models of varying parameter scales, considering both those widely adopted in industry and academia, to enable a more comprehensive assessment of the atomic abilities of current mathematical models.

The construction of the dataset is not fully explained which makes things unclear.

We apologize for the confusion. During the data construction phase, we first gathered a large volume of mathematical test data to support robust evaluation. For field-specific data, we primarily adopted a classification-based annotation strategy. We re-mapped the original subfield labels from the collected datasets to our four-field schema (algebra, geometry, analysis, topology). For data without clear subfield annotations or with ambiguous labels, we designed a keyword-based matching strategy, further supplemented by LLM-as-judge to classify the data into the appropriate field.

In contrast, the construction process for logical reasoning ability data followed a different approach. For conceptual understanding, we created cloze-style questions by masking keywords in natural language descriptions of mathematical definitions, theorems, and formulas. For forward reasoning, we focused on multi-step reasoning tasks involving formalized mathematical language. We collected mathematical problems originally expressed or annotated in formal language and used formal language transformation tools to convert natural language inputs into formal representations for reasoning. For backward reasoning, we collected a set of mathematical propositions and judgments that are particularly suitable for reasoning via counterexamples or proof by contradiction, and curated them with corresponding reasoning steps and justifications.

It is not clear what the authors mean by Level 1 and Level 2 problems.

We apologize for any confusion caused by our task difficulty labeling. The Level 1 and Level 2 distinctions correspond to low-difficulty and high-difficulty tasks within each field-level atomic ability, respectively.

Although the paper relies on automatic evaluation metrics, a thorough human evaluation is required as a math problem can be solved in multiple ways using different kinds of mathematical axioms and concepts.

Thank you for raising this point.

We did conduct a human evaluation on a sample of the data during the initial phase of our experiments in order to verify the accuracy of our automatic evaluation metrics. To do this, we selected two models, Gemini and Qwen, and sampled 20 problems from each dataset for manual review. Furthermore, since the majority of the problems are calculation-based, the final answers have a relatively fixed format. This, combined with our constraint that the answer must be in the \\boxed{} format, greatly simplified the validation process and made it easier to check for accuracy.

The table below shows the number of cases (out of 20) where the automatic evaluation aligned with the human evaluation for each model and dataset:

ModelAlgebra-l1Algebra-l2Geometry-l1Geometry-l2Analysis-l1Analysis-l2Topology-l1Topology-l2Attr.Def.Forward Rea.
Qwen2.5-Math-Instruct-7B20/2020/2020/2020/2020/2020/2019/2019/2018/2020/2020/20
Gemini2.5-pro20/2020/2020/2020/2019/2020/2020/2020/2019/2020/2020/20

It should be mentioned how the difficulty level is determined for a problem.

The difficulty assignment was based primarily on the source of the data (e.g., middle school vs. university-level problems) and any original difficulty tags associated with the dataset. For example, a problem originating from elementary school math would naturally be categorized as low-difficulty. In many cases, our collected datasets included original difficulty labels, which we also used as a reference to guide our difficulty classification.

Response to Questions

The authors should mention the criteria of selection of the baseline LLMs both the open-source and commercial ones.

Thank you for your thoughtful suggestions. We have further elaborated on our baseline selection criteria in the response to Weakness 1, and we will provide additional clarification in the revised version of the paper to help readers better understand our choices.

The details of the data construction should be detailed with examples in the appendix.

We also appreciate your feedback on the data construction process. As discussed in the response to Weakness 2, we have now included additional details regarding our data annotation pipeline. We provide the data and code in our supplementary materials for review. Below, we present several example data entries (in simplified form) for illustration. We plan to include more comprehensive data samples in the Appendix of the revised version. Furthermore, we commit to open-sourcing our dataset to facilitate better transparency and encourage broader participation and research based on our work.

Example 1

Atom: Field Atom Ability Task: Algebra (Level1) Question: "What is the greatest common divisor of all of the members of the set containing all numbers that are the product of four consecutive positive integers?", Solution: "These numbers are all in the form n(n+1)(n+2)(n+3)pmod4n(n+1)(n+2)(n+3)\\pmod 4, there will be one number of each residue, so one of the numbers will be divisible by 2 and another divisible by 4, leaving the product divisible by 8. Similarly, one of the numbers will be 0\mod 3, so the product will be divisible by 3. The GCD must therefore be divisible by 3\cdot8=24. It also must be less than or equal to the smallest number in the set, 1\cdot2\cdot3\cdot4=24, so it must be \boxed{24} exactly.", Final_answer: "24"

Example 2

Atom: Logic Atom Ability Task: Conceptual Understanding Question: "Let f: S \to Tbe a mapping. Let f^{-1} \\subseteq T \\times S be the inverse of f: :f^{-1} := \\set {\\tuple {t, s}: \\map f s = t} Let f^{-1} itself be a mapping :\\forall y \\in T: \\tuple {y, x_1} \\in f^{-1} \\land \\tuple {y, x_2} \\in f^{-1} \\implies x_1 = x_2 and :\\forall y \\in T: \\exists x \\in S: \\tuple {y, x} \\in f Then f^{-1} is called the ''' '''." **Solution**: "Definition:Inverse Mapping" **Final_answer**: "inverse mapping off$"

Human evaluation should be conducted for this task.

Thank you for your suggestion! We have provided the supplementary experimental results above. We promise to add this discussion in our revised manuscripts.

评论

Thanks for a detailed response to the issues raised by me. If the responses are included in the paper, it will make the paper more sound.

评论

We sincerely thank you for the valuable feedback and recognition, which are greatly encouraging to us. We are committed to incorporating all clarifications and revisions discussed during the rebuttal phase into the final camera-ready version of the paper to enhance its soundness.

审稿意见
4

This paper introduces "atomic thinking," a new paradigm for analyzing the mathematical reasoning abilities of Large Language Models (LLMs). The authors argue that current models, often trained on long and complex problems, may be memorizing reasoning chains rather than truly understanding fundamental concepts. To address this, they propose decoupling mathematical ability into two dimensions: field-specific capabilities (Algebra, Geometry, Analysis, Topology) and logical capabilities (conceptual understanding, forward reasoning with formal language, and backward reasoning with counterexamples. The study finds that models generally excel in algebra and analysis but struggle with geometry and topology. Furthermore, it highlights that strong conceptual understanding significantly boosts other reasoning abilities, suggesting a shift in training strategy from "question drilling" towards a more foundational, concept-based approach.

优缺点分析

Strengths

  • Clarity: The paper is well-written and organized. The core idea of "atomic thinking" is introduced and motivated clearly, and Figure 1 provides an excellent visual summary of the paper's entire scope and key findings. The research questions are explicitly stated in the introduction and systematically answered in the analysis sections, making the paper easy to follow and understand.

  • Significance & Originality: In my opinion, the paper's primary strength lies in its novel and insightful framing of LLM mathematical reasoning. As we see models achieve near-superhuman scores on some benchmarks, it's becoming crucial to ask how they are solving these problems. This paper provides a structured methodology for doing just that. Moving the focus from "can it solve the problem?" to "what fundamental skills does it possess?" is a significant step forward.

  • Quality: The empirical evaluation is comprehensive and of high quality. The authors benchmark a diverse and modern set of LLMs, including powerful proprietary models like OpenAI o1, Deepseek-R1, and Gemini 2.5-pro, which lends credibility and relevance to their findings. The methodology for creating the datasets is systematic, and the analysis of the results is thorough, supported by clear tables and illustrative case studies.

Weaknesses

  • While the proposed decoupling is a major strength, the choice of what constitutes an "atom" is debatable. Fields like "Algebra" or "Geometry" are vast and could be considered composites of many smaller, more fundamental skills (e.g., symbolic manipulation, spatial transformation, etc.). The authors briefly justify their choice to avoid over-complicating the interaction analysis, but the framework's claims to "atomicity" could be softened.

问题

  1. The paper defines "field capabilities" like Algebra and Geometry as atomic units. However, these are themselves complex domains. Could you elaborate on the process for selecting this level of granularity? Were finer-grained decompositions (e.g., breaking algebra into "equation solving" and "proof by induction") considered, and what were the primary trade-offs that led to the current four-field structure?

  2. The interaction experiments (Tables 3-5) offer great insights but are based on fine-tuning a single model, Qwen2.5-Math-7B. How confident are you that these interaction effects (e.g., the foundational role of algebra and conceptual understanding) are fundamental principles of mathematical reasoning in LLMs, rather than artifacts of the specific model's architecture or pre-training? Would you expect similar results if you fine-tuned a model from a different family, such as Llama?

局限性

In Section 6, "Conclusion, limitation, and future directions," the authors explicitly state that they "have not explored more advanced strategies to stimulate a specific atomic capability, such as curriculum learning or reinforcement learning." This is a fair and important acknowledgment.

最终评判理由

While I understand that there are currently no math-instruct models in the LLaMA series, I believe there are several available math-instruct models based on LLaMA. I would appreciate seeing more experiments included in the camera-ready version.

Overall, this is still an impressive contribution, and I will maintain my initial score.

格式问题

no issues

作者回复

Response to Reviewer ezVn

We sincerely appreciate your insightful feedback and constructive suggestions. We hope our clarifications have resolved the issues you raised.

While the proposed decoupling is a major strength, the choice of what constitutes an "atom" is debatable. Fields like "Algebra" or "Geometry" are vast and could be considered composites of many smaller, more fundamental skills (e.g., symbolic manipulation, spatial transformation, etc.). The authors briefly justify their choice to avoid over-complicating the interaction analysis, but the framework's claims to "atomicity" could be softened.

We sincerely apologize for the confusion caused by our categorization of atomic abilities. In fact, the current division into four major field-level atomic abilities—algebra, geometry, analysis, and topology—was made for the following reasons.

First, we consulted with domain experts in mathematics (individuals holding at least a Ph.D degree in mathematics) and determined that adopting the four universally recognized fields within the mathematical community would provide a complete and non-overlapping classification framework. Second, while each of these fields indeed contains numerous subfields—we initially considered over ten field categories—such fine-grained divisions introduced several challenges. Classifying data into highly specific subfields made data collection significantly more difficult, as data availability for certain fields was extremely limited, resulting in imbalances across different categories and insufficient samples for effective training and evaluation.

Moreover, many of these fine-grained subfields show minimal or no interaction with one another. For instance, equation solving in algebra and spatial relations in geometry may involve completely disjoint abilities. As a result, any attempt to analyze interactions across such isolated subfields would demand exponentially more experimental resources, while the insights gained may be marginal, leading to wasted effort.

The paper defines "field capabilities" like Algebra and Geometry as atomic units. However, these are themselves complex fields. Could you elaborate on the process for selecting this level of granularity? Were finer-grained decompositions (e.g., breaking algebra into "equation solving" and "proof by induction") considered, and what were the primary trade-offs that led to the current four-field structure?

We apologize again for the confusion. As explained in our response to Weakness 1, the field-level classification we adopted is based on expert consultation and conforms to widely accepted categorizations in mathematics. In the data collection process, we re-mapped the original subfield labels to our four major fields. This was accomplished either through keyword matching or with the assistance of LLM-as-judge-based classification. As also discussed in our earlier response, this coarser field-level categorization allows us to explore the interaction effects between atomic abilities more efficiently, while minimizing training and computational overhead.

The interaction experiments (Tables 3-5) offer great insights but are based on fine-tuning a single model, Qwen2.5-Math-7B. How confident are you that these interaction effects (e.g., the foundational role of algebra and conceptual understanding) are fundamental principles of mathematical reasoning in LLMs, rather than artifacts of the specific model's architecture or pre-training? Would you expect similar results if you fine-tuned a model from a different family, such as Llama?

We deeply appreciate your valuable suggestion. The Qwen series includes a variety of open-source models across different parameter scales, offering both base and instruct versions, which makes it particularly suitable for research purposes. Among them, Qwen2.5-math-instruct demonstrated excellent performance in our atomic ability disentanglement experiments and thus became a focal model for our further investigation.

In response to your suggestion, we have now included InternLM2-math-plus-7B as an additional baseline for training and evaluation, aiming to further strengthen our claims. The reason why we don't choose llama as the training backbone is due to the absence of the math-instruct model in llama series. Given the limited time during the rebuttal period, we selected only a few key training and evaluation scenarios. The results, shown in the table below, are consistent with our observations on Qwen2.5-math-instruct and provide a strong complement to our previous findings.

ModelAlgebra (Low)Algebra (High)Analysis (Low)Analysis (High)Geometry (Low)Geometry (High)Topology (Low)Topology (High)
InternLM2-math-plus-7b49.235.933.031.441.941.527.237.0
InternLM2-train-Algebra48.1 (-1.1)37.8 (+1.9)35.8 (+2.8)33.6 (+2.2)43.1 (+1.2)42.1 (+0.6)27.8 (+0.6)37.9 (+0.9)
InternLM2-train-Geometry48.6 (-0.6)36.3 (+0.4)32.2 (-0.8)30.5 (-0.9)44.3 (+2.4)44.0 (+2.5)28.7 (+1.5)38.8 (+1.8)

Limitations:

In Section 6, "Conclusion, limitation, and future directions," the authors explicitly state that they "have not explored more advanced strategies to stimulate a specific atomic capability, such as curriculum learning or reinforcement learning." This is a fair and important acknowledgment.

In this work, we focus primarily on the atomic ability thinking paradigm, a novel perspective on mathematical reasoning, and present a series of exploratory experiments. Building on our current findings, we hope to pursue more precise training strategies for activating atomic abilities in future work. For example, after discovering interactions among field-level atomic abilities, we could apply a curriculum learning approach, gradually introducing different field abilities layer by layer—starting with foundational support fields and progressing to more advanced ones—to achieve better generalization across fields.

Additionally, we plan to explore reinforcement learning-based methods that reward the model for correctly understanding and applying mathematical concepts. This could improve the model’s reasoning patterns, especially in forward and backward reasoning scenarios.

评论

Dear Authors,

Thank you for your thorough clarification and valuable feedback. I have no further questions and appreciate the strength of your work.

While I understand that there are currently no math-instruct models in the LLaMA series, I believe there are several available open-sourced math-instruct models based on LLaMA. I would appreciate seeing more experiments included in the camera-ready version.

Overall, this is still an impressive contribution, and I will maintain my initial score.

评论

We sincerely appreciate your recognition and constructive suggestions! Due to rebuttal time constraints, we were unable to add more open-sourced models' interaction experiments. ​​We promise to include more experiments in the camera-ready version​​ to fully strengthen our claims and improve the soundness of our work.

审稿意见
5

This paper introduces a novel paradigm termed "Atomic Thinking" to evaluate and understand the mathematical reasoning abilities of Large Language Models (LLMs) at a more granular level. The authors argue that current end-to-end evaluation methods fall short of revealing whether LLMs genuinely master mathematical concepts or merely memorize problem-solving patterns. Inspired by human cognition, which decomposes complex problems into fundamental components, this work decouples mathematical intelligence into two primary dimensions: (1) field-specific atomic abilities, covering algebra, geometry, analysis, and topology, and (2) logical atomic abilities, which include conceptual understanding, formal forward reasoning, and counterexample-driven backward reasoning.

To support this framework, the paper contributes specialized training and evaluation datasets for each atomic capability. Through extensive experiments on a wide array of state-of-the-art open-source and proprietary LLMs, the study provides a detailed-performance landscape. A key contribution lies in the interaction experiments, where the authors fine-tune models on one atomic ability to measure its impact on others.

The main findings are insightful: LLMs generally exhibit stronger performance in algebra and analysis but struggle with geometry and topology. Crucially, the paper reveals that even top-tier models have significant deficiencies in backward reasoning (i.e., constructing counterexamples). It also establishes that foundational skills, particularly conceptual understanding and algebraic abilities, play a pivotal role in boosting performance across other, more abstract domains. These contributions offer a new lens for analyzing LLM cognition and provide actionable insights for developing more efficient and generalizable training strategies for mathematical reasoning.

优缺点分析

Strengths:

The "Atomic Thinking" framework is a clever and highly original contribution. It moves beyond standard end-to-end benchmarks to offer a more insightful, fine-grained analysis of what LLMs actually learn in mathematics. The experimental design is robust, covering a wide range of relevant models and thoughtfully decomposing skills. The paper's most significant contribution comes from its interaction experiments, which reveal fascinating, non-obvious relationships between different mathematical abilities—for example, showing that training on algebra can boost geometry skills more effectively than training on geometry itself. These findings are not only interesting but also provide practical guidance for developing better training strategies.

Weaknesses:

1)The paper's core idea is to "decouple" skills, but this is practically impossible to do perfectly. An algebra task might still rely on conceptual understanding, for example. The authors should acknowledge this limitation more directly and discuss how the inherent overlap between skills might affect their conclusions. It doesn't invalidate the work, but it's an important nuance to address.

2)The claims about ability transfer (Tables 3, 4, 5) are based on a single training run. Given the stochastic nature of fine-tuning, these results could be due to chance. The conclusions would be far more convincing if the authors ran each experiment multiple times (e.g., with different seeds) and reported means and standard deviations. Without this, it's hard to be certain that the observed performance gains are statistically significant.

3)The finding that models do better on hard topology problems than easy ones is intriguing but left underdeveloped. The authors suggest a data distribution mismatch, which is a reasonable hypothesis, but they don't provide any evidence to back it up. A few concrete examples of the problems in question or a quick analysis of relevant training data would make this point much stronger and less speculative.

问题

  1. The "decoupling" of skills is central to your work, but it's hard to do perfectly. How did you control for skill overlap when creating your datasets? For example, how did you isolate "forward reasoning" from "conceptual understanding"?
  2. Your interaction experiments (Table 3, etc.) are based on single training runs. Can you run these experiments multiple times with different seeds and report the mean and standard deviation? Without this, it's difficult to know if the results are just due to chance. This is my main concern.
  3. The result where models are better at hard topology problems than easy ones is interesting. Can you provide some examples of these problems? Or, even better, show some evidence from training data that might explain this strange result? Right now, it feels too speculative.

局限性

Yes, the authors have included a "Conclusion, limitation, and future directions" section (Section 6) that acknowledges some limitations of their work, such as not exploring more advanced training strategies like curriculum or reinforcement learning. This is a good start. However, the discussion could be improved by also explicitly addressing the key limitations raised in my review, namely:

  1. The inherent difficulty of achieving "pure" decoupling of atomic abilities and its potential impact on the experimental conclusions.
  2. The lack of statistical validation for the interaction experiments due to single-run evaluations, which affects the robustness of the transfer learning claims.

Adding these points would provide a more complete picture of the work's current scope and strengthen the paper by demonstrating a thorough self-assessment.

最终评判理由

This is a technically solid and well-executed paper that offers meaningful insights with potential impact in the area of AI in education.

格式问题

No.

作者回复

Response to Reviewer Hb5o

We are grateful for your thoughtful comments and suggestions. We trust that our detailed response has helped to clarify your concerns.

Weakness 1: The paper's core idea is to "decouple" skills, but this is practically impossible to do perfectly. An algebra task might still rely on conceptual understanding, for example. The authors should acknowledge this limitation more directly and discuss how the inherent overlap between skills might affect their conclusions. It doesn't invalidate the work, but it's an important nuance to address.

We sincerely apologize for the confusion caused by the atomic ability classification schema presented in our paper. We would like to clarify that the field-level atomic abilities and the logical reasoning atomic abilities represent two different perspectives for categorizing mathematical atomic abilities. Specifically, we propose four key field-level atomic abilities from a field-centric perspective, and three core logical reasoning abilities from a cognitive skill-centric perspective. Since these categorizations are based on different classification criteria, our primary concern is to ensure that each categorization is comprehensive and non-overlapping within its own dimension, rather than strictly non-overlapping across dimensions. For example, it is natural that algebraic tasks may rely on foundational conceptual understanding. We promise to clarify this classification schema in our camera ready version.

In each dimension of our classification, we strive to cover all essential abilities required in mathematics while minimizing overlap between abilities within the same schema. For a more detailed explanation regarding the decoupling of abilities, please refer to our response to Question 1.

Weakness 2: The claims about ability transfer (Tables 3, 4, 5) are based on a single training run. Given the stochastic nature of fine-tuning, these results could be due to chance. The conclusions would be far more convincing if the authors ran each experiment multiple times (e.g., with different seeds) and reported means and standard deviations. Without this, it's hard to be certain that the observed performance gains are statistically significant.

We greatly appreciate your suggestion. Indeed, conducting more experiments with different random seeds is an effective way to validate our hypothesis. Due to time constraints during the rebuttal phase, we have conducted a supplementary set of experiments on a representative subset of tasks using five different random seeds, and we report the average performance and standard deviation. The results, shown in the table below, provide strong statistical evidence supporting our hypothesis, demonstrating that the observed effects are not due to random fluctuation from a single run.

ModelAlgebra (Low)Algebra (High)Analysis (Low)Analysis (High)Geometry (Low)Geometry (High)Topology (Low)Topology (High)
Qwen-base80.565.267.766.552.153.452.153.4
Qwen-train-Algebra80.2 ± 0.469.7 ± 0.575.8 ± 0.671.5 ± 0.765.7 ± 0.357.5 ± 0.656.0 ± 0.562.3 ± 0.9
Qwen-train-Geometry79.6 ± 0.568.1 ± 0.469.8 ± 0.659.5 ± 0.857.3 ± 0.656.0 ± 0.552.8 ± 0.460.9 ± 0.3

Weakness 3: The finding that models do better on hard topology problems than easy ones is intriguing but left underdeveloped. The authors suggest a data distribution mismatch, which is a reasonable hypothesis, but they don't provide any evidence to back it up. A few concrete examples of the problems in question or a quick analysis of relevant training data would make this point much stronger and less speculative.

Thank you for the very helpful feedback. To provide a more direct comparison as you suggested, we can examine two problems that are of the same type—both require calculating a specific numerical value.

Example 1: An "Easy" L1 Problem (Incorrect Result).
Question:Find the coefficient of (x^{18}) in the expansion of ((1+x^5+x^7)^{20}).
Correct Answer: 0. Model's Correct Answer: 369922
Analysis of Model's Reasoning: The correct solution relies on the algebraic insight that the equation (5j + 7k = 18) has no non-negative integer solutions for j and k. Because of this, it's impossible to form an (x^{18}) term, so its coefficient must be 0. However, the model's output suggests a different approach was taken, one that seems to show a misunderstanding of the multinomial theorem. It appears the model used a flawed computational heuristic instead of identifying the key algebraic constraint.

Example 2: A "Hard" L2 Problem (Correct Result). Question: Eleanor has 100 marbles, each of which is black or gold. The ratio of the number of black marbles to the number of gold marbles is 1:41: 4. How many gold marbles should she add to change this ratio to 1:61: 6 ? Correct Answer: 40 Model's Correct Answer: 40 Analysis of Model's Reasoning: The model performed well on this problem, perhaps because it can be successfully deconstructed into a sequence of more straightforward algebraic steps. The model correctly:

  1. Set up and solved a system of equations for the initial state.
  2. Formulated a new equation for the final state.
  3. Solved for the final answer accurately.
  4. This process shows a strength in applying structured, step-by-step logic.

A Possible Explanation for the Performance Gap

  1. These examples might offer a more detailed explanation for the "data distribution mismatch" hypothesis. It seems the issue may not be "difficulty" itself, but rather the underlying problem-solving paradigm.
  2. It appears that many L1 problems tend to require non-obvious computational insights. It functions like a computational puzzle, where success depends on spotting a specific "trick" or constraint. The model's failure on such problems could indicate that its training was less focused on this puzzle-like style.
  3. In contrast, many L2 problems seem to be solvable with structured, formal reasoning. It rewards a methodical approach: translating words into a formal system (equations) and then executing a standard procedure. The model's success here and on other L2 problems involving formal proofs or definitions could suggest its training was more aligned with this style.
  4. To conclude, the performance gap might be better understood if we think about the problem sets in a different way. Instead of "easy vs. hard," it could be a distinction between "computational puzzles" and "structured reasoning problems."

Question 1: The "decoupling" of skills is central to your work, but it's hard to do perfectly. How did you control for skill overlap when creating your datasets? For example, how did you isolate "forward reasoning" from "conceptual understanding"?

The issue you raised is indeed very important. As we mentioned in the paper when analyzing the nature of atomic thinking, solving a mathematical problem often requires the coordination of multiple atomic abilities. The correctness of an answer typically reflects the interplay of various abilities. This is especially true for more fundamental abilities, such as conceptual understanding, which are likely to manifest across a wide range of problems.

Therefore, when constructing our dataset, the goal was not to ensure that each mathematical problem reflects only a single atomic ability, but rather to highlight a primary ability. In other words, if a model lacks a specific ability, it will struggle to solve problems designed to emphasize that ability. For instance, in tasks requiring forward reasoning, conceptual understanding certainly helps (which aligns with our findings on interaction effects), but if the model lacks forward reasoning ability, it will still fail to solve these complex multi-step reasoning problems effectively.

Question 2: Your interaction experiments (Table 3, etc.) are based on single training runs. Can you run these experiments multiple times with different seeds and report the mean and standard deviation? Without this, it's difficult to know if the results are just due to chance. This is my main concern.

Thank you again for your insightful suggestion. We have included the new experimental results as part of our response to Weakness 2. Due to the time constraints, we only selected a few representative cases for this supplementary experiment, and we commit to including more comprehensive results in the revised version of the paper.

Question 3: The result where models are better at hard topology problems than easy ones is interesting. Can you provide some examples of these problems? Or, even better, show some evidence from training data that might explain this strange result? Right now, it feels too speculative.

Thank you for your valuable advice! We hope the examples provided in the response to weakness 3 above help to address your concerns. If it would be beneficial, we promise to summarize this analysis in a chart or table for inclusion in the revised paper.

评论

We are deeply grateful for the insightful comments and feedback, which we truly appreciate. In the final camera-ready version, we will carefully integrate all the clarifications and improvements discussed during the rebuttal process to further enhance the quality of the paper.

评论

Thank you for your detailed response, which has addressed my main concerns. This is a solid piece of work and I hope the authors will incorporate feedback from other reviewers to further improve the paper in the camera-ready version.

评论

Dear Reviewers and Area Chair,

As we approach the end of the discussion phase, we would like to express our sincere gratitude for your time, effort, and insightful comments on our paper. Your constructive comments and suggestions have been invaluable in helping us improve our work.

In our work, we focus on atomic thinking in LLMs mathematical reasoning. We decouple the four field atomic abilities and three logical reasoning atomic capabilities required by LLMs for mathematical reasoning, and evaluate a series of advanced open-sourced/commercial models. In addition, we conduct exploration experiments and analyzes the interactions between different atomic abilities, which provides many interesting discoveries and inspiration for future mathematical LLMs training. We sincerely appreciate the unanimous recognition of reviewers, where the initial scores are 5/4/4/4.

During the rebuttal process, we have strived to ensure that we addressed all the concerns raised by the reviewers, and have provided corresponding clarifications and supplementary experiments as suggested. We are more than happy to know that our rebuttals have significantly addressed the concerns of reviewers, which is reflected in the positive responses that all reviewers have given to our paper. All clarifications, revisions, and supplementary experiments during the discussion phase will be updated into the camera-ready version of the paper.

Your unanimous recognition of the value and contribution of our research motivates us to strive for further excellence. Once again, thank you for your dedication and the pivotal role you play in maintaining the high standards of the NeurIPS conference.

Best regards,

Authors of Manuscript 28853

最终决定

This paper introduces a novel paradigm termed “Atomic Thinking” to evaluate and understand the mathematical reasoning abilities of Large Language Models (LLMs) at a more granular level. Experiments on both closed- and open-source LLMs reveal insightful findings, such as the fine-grained weaknesses of each model under the “Atomic Thinking” paradigm. Overall, the paper is professionally presented.

For the revision, the authors should incorporate clarifications on some key terms, address concerns regarding decoupling control, provide more data construction examples, include human evaluation, and conduct a deeper investigation into the findings—for instance, the better performance on hard topology problems.